This page walks through the full pipeline: from downloading sequences off GenBank to a finished tree in R. All four scripts are in the GitHub repo and run on any laptop with R 4.0+.
What You Need
R 4.0+ from r-project.org. Four packages: ape and
phangorn from CRAN; msa and Biostrings
from Bioconductor. Run this once, then comment it out.
install.packages(c("ape", "phangorn"))
if (!requireNamespace("BiocManager")) install.packages("BiocManager")
BiocManager::install(c("msa", "Biostrings"))
For Bayesian analysis you also need MrBayes (runs in the terminal, outside R).
Mac: brew install mrbayes. Other systems: binaries at
mrbayes.sourceforge.net. Confirm by parsing mb in a terminal.
The four scripts (00_prepare_alignment.R, 01_tree_NJ.R,
02_tree_ML.R, 03_tree_Bayesian.R) are in the
R/ folder of the repo. Run them from that folder as your working
directory. Output folders (out/rdata/, out/trees/) are
created automatically.
BiocManager::install() fails with a repository error or "cannot open URL."
Test CRAN first: install.packages("ape"). If that works, update
BiocManager: install.packages("BiocManager"). SSL error? Try
options(repos = c(CRAN = "http://cran.r-project.org")). On a
university network, try from a personal connection or VPN. Check R/Bioconductor
version compatibility with BiocManager::version().
Installed msa and Biostrings but R says "there is no package called 'msa'."
Bioconductor installs to a separate library path. Run .libPaths()
to check. Try reinstalling in a fresh R session. If you have multiple R
versions, make sure BiocManager ran in the same version you're using now.
Windows: run RStudio as Administrator if you see permission errors. Permission errors can be an issue with university owned laptops as well. You may need to enter admin mode.
Getting Sequences from GenBank
All sequences come from NCBI Nucleotide. The gene is COI, the standard DNA
barcode for animal species, ~650 bp long. Search with:
[species name][ORGN] AND COI[GENE] AND 600:700[SLEN]. Download
2–3 sequences per species in FASTA format. Combine them:
cat *.fasta > combined.fasta. Edit headers to clean short names
like >Red_Drum_1.
Accession numbers used in this project:
| Species | Family | Role | Accession numbers |
|---|---|---|---|
| Sciaenops ocellatus (Red Drum) | Sciaenidae | Ingroup | MT879846.1 MT456179.1 MT456021.1 |
| Pogonias cromis (Black Drum) | Sciaenidae | Ingroup | MT456022.1 MT456011.1 MT455667.1 |
| Cynoscion nebulosus (Speckled Trout) | Sciaenidae | Ingroup | GU672720.1 MT456255.1 MT879816.1 |
| Micropogonias undulatus (Atlantic Croaker) | Sciaenidae | Ingroup | MT456167.1 MT455752.1 MT455479.1 |
| Paralichthys lethostigma (Southern Flounder) | Paralichthyidae | Ingroup | MN869893.1 MG837973.1 MG837971.1 |
| Archosargus probatocephalus (Sheepshead) | Sparidae | Ingroup | MN869862.1 MN869861.1 MT456170.1 |
| Mugil cephalus (Flathead Mullet) | Mugilidae | Ingroup / alt. outgroup | PQ347331.1 PQ347332.1 JX983367.1 |
| Carcharhinus leucas (Bull Shark) | Carcharhinidae | Outgroup | PV837658.1 MT456249.1 MT456152.1 |
NCBI search returns wrong-length results or the length filter isn't working.
Make sure 600:700[SLEN] is in the query string itself, not a
separate field. NCBI sometimes annotates the gene as "cox1" not "COI";
try COI[GENE] OR cox1[GENE]. If no results at all, check
BOLD Systems (boldsystems.org) as an alternative source.
After cat *.fasta > combined.fasta, some headers look garbled or sequences are missing.
NCBI FASTA files often don't end with a newline. When concatenated, the last
line of one sequence runs into the next header. Fix: use
awk 'FNR==1 && NR!=1{print ""}1' instead of cat.
Then open the file and visually confirm each sequence starts with
> on its own line.
Script 00: Alignment and Setup
00_prepare_alignment.R is the shared setup all three tree scripts
depend on. Run it once (~30 seconds). It reads sequences, aligns them, and writes
the NEXUS file for MrBayes.
# Read unaligned sequences directly into Biostrings format
dna_strings <- Biostrings::readDNAStringSet(
"FASTA/gulf_fish_COI_combined.fasta"
)
# Align with MUSCLE
msa_result <- msa(dna_strings, method = "Muscle")
cat("Alignment length:", ncol(as.matrix(msa_result)), "bp\n")
# Convert to ape (for NJ distances) and phangorn (for ML)
aln_dnabin <- msaConvert(msa_result, type = "ape::DNAbin")
aln_phydat <- msaConvert(msa_result, type = "phangorn::phyDat")
# Save both objects plus the color helper for downstream scripts
save(aln_dnabin, aln_phydat, family_colors, tip_color,
file = "out/rdata/alignment_objects.RData")
Output: FASTA/gulf_fish_COI_aligned.fasta,
MrBayes/gulf_fish_COI_aligned.nex (for Script 03), and
out/rdata/alignment_objects.RData, the main handoff to all three
tree scripts, containing the aligned matrix in both DNAbin and phyDat formats
plus the shared color palette.
readDNAStringSet() throws "cannot open file: no such file or directory."
Run getwd(). Scripts expect the R/ folder as working
directory. If you're in the project root, either setwd("R") or
update the path to R/FASTA/gulf_fish_COI_combined.fasta. Windows:
use forward slashes or double backslashes in paths.
Script 01: Neighbor-Joining
01_tree_NJ.R builds two NJ trees (rooted on Bull Shark and Flathead
Mullet) with 1,000 bootstrap replicates each. NJ is the fastest method and serves
as both a sanity check and a starting topology for ML.
# Compute K80 pairwise distances
dist_k80 <- dist.dna(aln_dnabin, model = "K80")
# Build unrooted NJ tree, then root on Bull Shark
tree_nj_raw <- nj(dist_k80)
outgrp_shark <- grep("Bull_Shark", tree_nj_raw$tip.label, value = TRUE)
tree_nj_shark <- root(tree_nj_raw, outgroup = outgrp_shark,
resolve.root = TRUE)
# Bootstrap: resample columns 1000 times, rebuild NJ each time
boot_nj_shark <- boot.phylo(
tree_nj_shark, aln_dnabin,
function(x) root(
nj(dist.dna(x, model = "K80")),
outgroup = grep("Bull_Shark", rownames(x), value = TRUE),
resolve.root = TRUE
),
B = 1000, quiet = TRUE
)
When reading the output: check that same-species sequences cluster together (validates labels), that all four Sciaenidae group together (tests family monophyly), and that the outgroup has a long branch while closely related ingroup species have short ones. Bootstrap at the Sciaenidae node should be near 100%; below 70% at internal nodes just means branching order is unresolved, not an error.
boot.phylo() has been running for over 10 minutes on a 24-taxon dataset.
It should finish in 1–3 minutes. Check that quiet = TRUE is set.
Try B = 100 as a test.
Script 02: Maximum Likelihood
02_tree_ML.R runs model selection, fits a GTR+Gamma+I tree starting
from the NJ topology, and runs 1,000 bootstrap replicates. Plan for 10–20 minutes
on a laptop.
modelTest() from phangorn ranks models by AIC. Regardless of the
winner, the script uses GTR+Gamma+I: all six substitution rates free, rate
variation across sites modeled with gamma, invariant sites estimated.
# Start ML search from the NJ topology (faster convergence)
fit_start <- pml(tree_nj_start, data = aln_phydat)
fit_ml <- optim.pml(
fit_start,
model = "GTR",
optGamma = TRUE, # estimate Gamma shape parameter
optInv = TRUE, # estimate proportion invariant
rearrangement = "stochastic",# thorough topology search
control = pml.control(trace = 0)
)
# 1000 bootstrap replicates (~10-20 min)
bs_ml <- bootstrap.pml(fit_ml, bs = 1000, optNni = TRUE,
control = pml.control(trace = 0))
tree_ml_bs <- plotBS(fit_ml$tree, bs_ml, type = "n")
rearrangement = "stochastic" adds random perturbations to avoid
local optima. Starting from the NJ tree speeds up convergence. Bootstrap scale:
≥95% = strong; 70–94% = moderate; 50–69% = weak; <50% = not usually reported.
Weak internal support within a recently diversified family like Sciaenidae is
normal for a single 650 bp gene.
ML tree topology differs substantially from NJ; is one wrong?
Not necessarily. Check bootstrap at the differing nodes. If both are <70%, both trees are uncertain and the difference is noise. If ML is high where NJ disagrees, ML is better supported. Disagreement at family-level is concerning; within-family is expected for a single barcode gene.
Script 03: Bayesian Inference
03_tree_Bayesian.R runs in two steps. First run: appends the MrBayes
block to the NEXUS file and exits with instructions. Second run (after MrBayes
finishes): reads the consensus tree and produces the figure.
What MrBayes is doing
MrBayes uses MCMC to explore tree space, randomly proposing and accepting/rejecting
trees based on posterior probability. After a burn-in period, the surviving sampled
trees form a consensus. This run: 500,000 generations, sampled every 500th (1,000
total), first 250 discarded as burn-in, leaving 750 trees for the consensus.
Key settings: nst=6 (GTR), rates=gamma,
ngen=500000. Increase ngen to 2,000,000+ for larger
datasets or poor convergence.
begin mrbayes;
set autoclose=yes;
outgroup Bull_Shark_1;
lset nst=6 rates=gamma;
mcmc ngen=500000 samplefreq=500 printfreq=1000;
sumt burnin=250;
sump burnin=250;
end;
Running MrBayes
After the first R run, switch to a terminal and run:
cd R/MrBayes
mb gulf_fish_COI_aligned.nex
Takes ~5–10 minutes for this dataset. Output goes to R/MrBayes/,
including .con.tre (consensus tree) and .p (parameter
log). Then run 03_tree_Bayesian.R a second time to produce the figure.
Checking MCMC convergence
Load the .p file into Tracer (beast.community/tracer).
A good run shows a "fuzzy caterpillar" trace with no drift, and ESS > 200 for
all parameters. Low ESS or a drifting trace means the chain hasn't converged;
re-run with larger ngen.
PP = 0.95 means 95% of sampled trees contained that clade, a direct probability statement. PP values run numerically higher than ML bootstrap and aren't directly comparable. Use them as complementary evidence.
MrBayes finished but .con.tre is empty or R can't find it.
Make sure you ran mb from R/MrBayes/ with the exact
NEXUS filename the script wrote. If .con.tre is empty, MrBayes
crashed; scroll back through the terminal. Common causes: MrBayes block wasn't
appended, or the outgroup name doesn't exactly match a tip label (check spaces
vs. underscores).
Tracer shows ESS below 200. Chain hasn't converged. What do I do?
Run longer; try ngen=2,000,000 with burnin scaled proportionally.
Check the trace for drift after burn-in. Still low ESS? Increase
nchains=4. On TAMU HPRC, ngen=5,000,000 runs in
a few hours.
Extending This Analysis
Adding your own species
Download 2–3 COI sequences, clean the headers (e.g., >Spotted_Seatrout_1),
and append them to FASTA/gulf_fish_COI_combined.fasta. Re-run from
Script 00. If the new species is a new family, add it to family_colors
in Script 00 first.
Using different genes
COI is the default, but 16S rRNA (more conserved, better for deeper splits) and cytochrome b (large GenBank database) both work well. Multi-gene analyses are more powerful but require sequences from the same individuals; don't mix accessions across genes from different specimens.
Scaling up with TAMU-HPRC
For 50+ species or Bayesian runs with ngen=5,000,000+, a laptop
isn't practical. TAMU-HPRC is free for students and faculty and can finish the
same MrBayes run in hours. Get started at
hprc.tamu.edu; create an account, complete
the orientation, and submit via SLURM.
I added a species and it clusters somewhere completely unexpected. Is my sequence wrong?
Check in order: (1) verify the FASTA header matches the label you expect;
copy-paste errors are common; (2) BLAST the sequence at
blast.ncbi.nlm.nih.gov; if the top hit is a different species,
the GenBank entry was mislabeled
I want to use 16S or cytb instead of COI. What changes in the scripts?
Just swap the FASTA file in Script 00; update the path in
readDNAStringSet(). Alignment length will differ (16S ~500 bp,
cytb ~1,100 bp) but msa() handles any length. You may need larger
ngen if using shorter sequences.