Build Your Own Tree — Beyond the Look-Alike: Gulf Fish Phylogenetics

This page walks through the full pipeline: from downloading sequences off GenBank to a finished tree in R. All four scripts are in the GitHub repo and run on any laptop with R 4.0+.

What You Need

R 4.0+ from r-project.org. Four packages: ape and phangorn from CRAN; msa and Biostrings from Bioconductor. Run this once, then comment it out.

install.packages(c("ape", "phangorn"))
if (!requireNamespace("BiocManager")) install.packages("BiocManager")
BiocManager::install(c("msa", "Biostrings"))

For Bayesian analysis you also need MrBayes (runs in the terminal, outside R). Mac: brew install mrbayes. Other systems: binaries at mrbayes.sourceforge.net. Confirm by parsing mb in a terminal.

The four scripts (00_prepare_alignment.R, 01_tree_NJ.R, 02_tree_ML.R, 03_tree_Bayesian.R) are in the R/ folder of the repo. Run them from that folder as your working directory. Output folders (out/rdata/, out/trees/) are created automatically.

BiocManager::install() fails with a repository error or "cannot open URL."

Test CRAN first: install.packages("ape"). If that works, update BiocManager: install.packages("BiocManager"). SSL error? Try options(repos = c(CRAN = "http://cran.r-project.org")). On a university network, try from a personal connection or VPN. Check R/Bioconductor version compatibility with BiocManager::version().

Installed msa and Biostrings but R says "there is no package called 'msa'."

Bioconductor installs to a separate library path. Run .libPaths() to check. Try reinstalling in a fresh R session. If you have multiple R versions, make sure BiocManager ran in the same version you're using now. Windows: run RStudio as Administrator if you see permission errors. Permission errors can be an issue with university owned laptops as well. You may need to enter admin mode.

Getting Sequences from GenBank

All sequences come from NCBI Nucleotide. The gene is COI, the standard DNA barcode for animal species, ~650 bp long. Search with: [species name][ORGN] AND COI[GENE] AND 600:700[SLEN]. Download 2–3 sequences per species in FASTA format. Combine them: cat *.fasta > combined.fasta. Edit headers to clean short names like >Red_Drum_1.

Accession numbers used in this project:

Species	Family	Role	Accession numbers
Sciaenops ocellatus (Red Drum)	Sciaenidae	Ingroup	MT879846.1 MT456179.1 MT456021.1
Pogonias cromis (Black Drum)	Sciaenidae	Ingroup	MT456022.1 MT456011.1 MT455667.1
Cynoscion nebulosus (Speckled Trout)	Sciaenidae	Ingroup	GU672720.1 MT456255.1 MT879816.1
Micropogonias undulatus (Atlantic Croaker)	Sciaenidae	Ingroup	MT456167.1 MT455752.1 MT455479.1
Paralichthys lethostigma (Southern Flounder)	Paralichthyidae	Ingroup	MN869893.1 MG837973.1 MG837971.1
Archosargus probatocephalus (Sheepshead)	Sparidae	Ingroup	MN869862.1 MN869861.1 MT456170.1
Mugil cephalus (Flathead Mullet)	Mugilidae	Ingroup / alt. outgroup	PQ347331.1 PQ347332.1 JX983367.1
Carcharhinus leucas (Bull Shark)	Carcharhinidae	Outgroup	PV837658.1 MT456249.1 MT456152.1

NCBI search returns wrong-length results or the length filter isn't working.

Make sure 600:700[SLEN] is in the query string itself, not a separate field. NCBI sometimes annotates the gene as "cox1" not "COI"; try COI[GENE] OR cox1[GENE]. If no results at all, check BOLD Systems (boldsystems.org) as an alternative source.

After cat *.fasta > combined.fasta, some headers look garbled or sequences are missing.

NCBI FASTA files often don't end with a newline. When concatenated, the last line of one sequence runs into the next header. Fix: use awk 'FNR==1 && NR!=1{print ""}1' instead of cat. Then open the file and visually confirm each sequence starts with > on its own line.

Script 00: Alignment and Setup

00_prepare_alignment.R is the shared setup all three tree scripts depend on. Run it once (~30 seconds). It reads sequences, aligns them, and writes the NEXUS file for MrBayes.

# Read unaligned sequences directly into Biostrings format
dna_strings <- Biostrings::readDNAStringSet(
  "FASTA/gulf_fish_COI_combined.fasta"
)

# Align with MUSCLE
msa_result <- msa(dna_strings, method = "Muscle")
cat("Alignment length:", ncol(as.matrix(msa_result)), "bp\n")

# Convert to ape (for NJ distances) and phangorn (for ML)
aln_dnabin <- msaConvert(msa_result, type = "ape::DNAbin")
aln_phydat <- msaConvert(msa_result, type = "phangorn::phyDat")

# Save both objects plus the color helper for downstream scripts
save(aln_dnabin, aln_phydat, family_colors, tip_color,
     file = "out/rdata/alignment_objects.RData")

Output: FASTA/gulf_fish_COI_aligned.fasta, MrBayes/gulf_fish_COI_aligned.nex (for Script 03), and out/rdata/alignment_objects.RData, the main handoff to all three tree scripts, containing the aligned matrix in both DNAbin and phyDat formats plus the shared color palette.

readDNAStringSet() throws "cannot open file: no such file or directory."

Run getwd(). Scripts expect the R/ folder as working directory. If you're in the project root, either setwd("R") or update the path to R/FASTA/gulf_fish_COI_combined.fasta. Windows: use forward slashes or double backslashes in paths.

Script 01: Neighbor-Joining

01_tree_NJ.R builds two NJ trees (rooted on Bull Shark and Flathead Mullet) with 1,000 bootstrap replicates each. NJ is the fastest method and serves as both a sanity check and a starting topology for ML.

# Compute K80 pairwise distances
dist_k80 <- dist.dna(aln_dnabin, model = "K80")

# Build unrooted NJ tree, then root on Bull Shark
tree_nj_raw   <- nj(dist_k80)
outgrp_shark  <- grep("Bull_Shark", tree_nj_raw$tip.label, value = TRUE)
tree_nj_shark <- root(tree_nj_raw, outgroup = outgrp_shark,
                       resolve.root = TRUE)

# Bootstrap: resample columns 1000 times, rebuild NJ each time
boot_nj_shark <- boot.phylo(
  tree_nj_shark, aln_dnabin,
  function(x) root(
    nj(dist.dna(x, model = "K80")),
    outgroup = grep("Bull_Shark", rownames(x), value = TRUE),
    resolve.root = TRUE
  ),
  B = 1000, quiet = TRUE
)

When reading the output: check that same-species sequences cluster together (validates labels), that all four Sciaenidae group together (tests family monophyly), and that the outgroup has a long branch while closely related ingroup species have short ones. Bootstrap at the Sciaenidae node should be near 100%; below 70% at internal nodes just means branching order is unresolved, not an error.

boot.phylo() has been running for over 10 minutes on a 24-taxon dataset.

It should finish in 1–3 minutes. Check that quiet = TRUE is set. Try B = 100 as a test.

Script 02: Maximum Likelihood

02_tree_ML.R runs model selection, fits a GTR+Gamma+I tree starting from the NJ topology, and runs 1,000 bootstrap replicates. Plan for 10–20 minutes on a laptop.

modelTest() from phangorn ranks models by AIC. Regardless of the winner, the script uses GTR+Gamma+I: all six substitution rates free, rate variation across sites modeled with gamma, invariant sites estimated.

# Start ML search from the NJ topology (faster convergence)
fit_start <- pml(tree_nj_start, data = aln_phydat)
fit_ml    <- optim.pml(
  fit_start,
  model         = "GTR",
  optGamma      = TRUE,        # estimate Gamma shape parameter
  optInv        = TRUE,        # estimate proportion invariant
  rearrangement = "stochastic",# thorough topology search
  control       = pml.control(trace = 0)
)

# 1000 bootstrap replicates (~10-20 min)
bs_ml      <- bootstrap.pml(fit_ml, bs = 1000, optNni = TRUE,
                              control = pml.control(trace = 0))
tree_ml_bs <- plotBS(fit_ml$tree, bs_ml, type = "n")

rearrangement = "stochastic" adds random perturbations to avoid local optima. Starting from the NJ tree speeds up convergence. Bootstrap scale: ≥95% = strong; 70–94% = moderate; 50–69% = weak; <50% = not usually reported. Weak internal support within a recently diversified family like Sciaenidae is normal for a single 650 bp gene.

ML tree topology differs substantially from NJ; is one wrong?

Not necessarily. Check bootstrap at the differing nodes. If both are <70%, both trees are uncertain and the difference is noise. If ML is high where NJ disagrees, ML is better supported. Disagreement at family-level is concerning; within-family is expected for a single barcode gene.

Script 03: Bayesian Inference

03_tree_Bayesian.R runs in two steps. First run: appends the MrBayes block to the NEXUS file and exits with instructions. Second run (after MrBayes finishes): reads the consensus tree and produces the figure.

What MrBayes is doing

MrBayes uses MCMC to explore tree space, randomly proposing and accepting/rejecting trees based on posterior probability. After a burn-in period, the surviving sampled trees form a consensus. This run: 500,000 generations, sampled every 500th (1,000 total), first 250 discarded as burn-in, leaving 750 trees for the consensus. Key settings: nst=6 (GTR), rates=gamma, ngen=500000. Increase ngen to 2,000,000+ for larger datasets or poor convergence.

begin mrbayes;
  set autoclose=yes;
  outgroup Bull_Shark_1;
  lset nst=6 rates=gamma;
  mcmc ngen=500000 samplefreq=500 printfreq=1000;
  sumt burnin=250;
  sump burnin=250;
end;

Running MrBayes

After the first R run, switch to a terminal and run:

cd R/MrBayes
mb gulf_fish_COI_aligned.nex

Takes ~5–10 minutes for this dataset. Output goes to R/MrBayes/, including .con.tre (consensus tree) and .p (parameter log). Then run 03_tree_Bayesian.R a second time to produce the figure.

Checking MCMC convergence

Load the .p file into Tracer (beast.community/tracer). A good run shows a "fuzzy caterpillar" trace with no drift, and ESS > 200 for all parameters. Low ESS or a drifting trace means the chain hasn't converged; re-run with larger ngen.

PP = 0.95 means 95% of sampled trees contained that clade, a direct probability statement. PP values run numerically higher than ML bootstrap and aren't directly comparable. Use them as complementary evidence.

MrBayes finished but .con.tre is empty or R can't find it.

Make sure you ran mb from R/MrBayes/ with the exact NEXUS filename the script wrote. If .con.tre is empty, MrBayes crashed; scroll back through the terminal. Common causes: MrBayes block wasn't appended, or the outgroup name doesn't exactly match a tip label (check spaces vs. underscores).

Tracer shows ESS below 200. Chain hasn't converged. What do I do?

Run longer; try ngen=2,000,000 with burnin scaled proportionally. Check the trace for drift after burn-in. Still low ESS? Increase nchains=4. On TAMU HPRC, ngen=5,000,000 runs in a few hours.

Extending This Analysis

Adding your own species

Download 2–3 COI sequences, clean the headers (e.g., >Spotted_Seatrout_1), and append them to FASTA/gulf_fish_COI_combined.fasta. Re-run from Script 00. If the new species is a new family, add it to family_colors in Script 00 first.

Using different genes

COI is the default, but 16S rRNA (more conserved, better for deeper splits) and cytochrome b (large GenBank database) both work well. Multi-gene analyses are more powerful but require sequences from the same individuals; don't mix accessions across genes from different specimens.

Scaling up with TAMU-HPRC

For 50+ species or Bayesian runs with ngen=5,000,000+, a laptop isn't practical. TAMU-HPRC is free for students and faculty and can finish the same MrBayes run in hours. Get started at hprc.tamu.edu; create an account, complete the orientation, and submit via SLURM.

I added a species and it clusters somewhere completely unexpected. Is my sequence wrong?

Check in order: (1) verify the FASTA header matches the label you expect; copy-paste errors are common; (2) BLAST the sequence at blast.ncbi.nlm.nih.gov; if the top hit is a different species, the GenBank entry was mislabeled

I want to use 16S or cytb instead of COI. What changes in the scripts?

Just swap the FASTA file in Script 00; update the path in readDNAStringSet(). Alignment length will differ (16S ~500 bp, cytb ~1,100 bp) but msa() handles any length. You may need larger ngen if using shorter sequences.