How DNA Builds a Tree — Beyond the Look-Alike: Gulf Fish Phylogenetics

From Specimen to Sequence

DNA is extracted from a small piece of fish tissue. Universal PCR primers amplify the COI barcode region¹, and Sanger sequencing reads out a ~650 base-pair string of A, C, G, and T. That string gets deposited in GenBank where anyone can download it.

A fish genome is ~800 million bp. How can 650 bp of COI reliably identify species?

COI evolves at a useful rate: 2–15% divergence between species but less than 1% within a species. Once two populations stop interbreeding, their COI sequences accumulate differences independently.

Alignment and Matrix Construction

Before sequences can be compared, they need to be aligned. Alignment inserts gap characters so that column n in the matrix represents the same position across all sequences. This matrix is the direct input to every tree-building method. This tutorial uses the MUSCLE⁷ alignment algorithm.

The Actual Alignment

The panel below shows the COI alignment. Scroll left/right to explore. Conserved columns (same across all species) appear as solid color blocks; variable columns show mixed colors. Variable columns are what the tree-building algorithms use.

Scroll horizontally to explore. Colors represent nucleotides: green=A, blue=C, yellow=G, red=T. Gaps (–) appear as white.

Want to try executing an alignment yourself? Download the alignment FASTA used here, and open it in Clustal Omega .

A classmate says alignment is arbitrary and could introduce error. How do you respond?

Alignment is mandatory. Every distance and likelihood calculation that is used in building your trees assumes column n is the same position in every sequence. Without it, you're comparing random sites and the result is meaningless.

Inferring the Tree

Three methods were used. Neighbor-Joining (NJ)⁴ converts the alignment into pairwise distances and finds the shortest tree. It is compuationally light, fast, and a good first attempt. Maximum Likelihood (ML)⁵ asks which tree makes the observed data most probable under a DNA evolution model. It is slower but more accurate. Bayesian inference⁶ uses MCMC sampling to estimate a probability distribution over possible trees, returning posterior probabilities. When all three agree, confidence is high. When they disagree, it flags a region the data can't resolve with a single gene.

All three methods agree on families but disagree on two internal Sciaenidae nodes (NJ 38%, ML 29%, PP 0.54). What do you conclude?

Low support across all three methods means the data genuinely can't resolve those branches. That is not neccicarily a failure. The family grouping is still at 100%. The fine branching order within Sciaenidae just requires more data (more genes or whole genomes) to resolve.

NJ and ML agree but Bayesian places one species differently at PP = 0.62. Which do you report?

Report all three and describe the disagreement. PP = 0.62 means the data support both placements to some degree. Hiding it would misrepresent your certainty. The disagreement can tell you where more data would help.

Reading a Phylogeny

A phylogenetic tree has three parts. Tips are the species being analyzed. Internal nodes are the hypothetical common ancestors where lineages split. Branch lengths represent evolutionary distance (longer branches mean more DNA change). The root gives the tree a direction in time. Without an outgroup to place it, the tree is unrooted and has no ancestor-descendant direction.

Figure: anatomy of a feaux phylogenetic tree. See the Explore page for trees built from real Gulf fish COI sequences.

Bull Shark is the outgroup here. It diverged from all bony fish roughly 420 million years ago,¹⁰ placing the root at the base of the tree. A node with 100% bootstrap⁹ support means that clade appeared in all 1,000 bootstrap replicates.

If you rooted the tree on Speckled Trout instead of Bull Shark, what changes and what stays the same?

The topology doesn't change; that's in the data. The direction is what changes. Rooting on Speckled Trout forces the drum family to appear split by the root, even though the relationships are identical.

How Neighbor-Joining works

NJ converts the alignment into a pairwise distance matrix, using the K80 model³, which corrects for transitions (A to G, C to T) happen more often than transversions in mitochondrial DNA. It repeatedly joins the closest pair of taxa until the tree is complete. Bootstrap support is calculated by resampling alignment columns n=1,000 times and rebuilding the tree each time.

How Maximum Likelihood works

ML asks: given a DNA evolution model, what tree makes the observed data most probable? This tutorial uses GTR+Γ+I: all six substitution rates can differ, rate variation across sites follows a gamma distribution, and some sites can be invariant. Every variable column in the alignment contributes. To summarize, ML is slower than NJ but can be more accurate because it uses the full alignment rather than a single distance summary.

Why not always use the simplest model (JC), and why not always use the most complex one?

JC assumes all substitution rates are equal, which is not true for COI. An oversimplified model can bias the result. Too many parameters can overfit the data, producing unstable estimates. Some complex models are also extreemly computationally intense, and simple model estimates should be made first.

How Bayesian Inference works

Bayesian inference⁶ uses the same model as ML but instead of finding the single best tree, it estimates a probability distribution over all possible trees. This is done with Monte-Carlo Markov-Chains: a chain runs for millions of steps, accepting or rejecting trees based on posterior probability. After trimming (burn-in), the surviving trees are summarized into a consensus tree. The fraction of sampled trees containing a given clade is that clade's posterior probability. PP = 0.95 means 95% of sampled trees contained that grouping. MrBayes was the software package used here.

A Bayesian run returns PP = 1.00. A critic says this is overconfident. How do you respond?

PP = 1.00 means every sampled tree in the run contained that clade. The data strongly favor it. The critic's concern is valid only if the MCMC didn't converge (chain stuck, not exploring alternatives). In order to evaluate convergence in MrBayes, Look for ESS values.

Comparing the three methods

When all three methods agree on a grouping, that agreement across independent mathematical algorythims is strong evidence of evolutionary history. Where the methods disagree, that flags a region where a single barcode gene doesn't have enough signal and more data would help.

A reviewer asks why you ran all three methods instead of just reporting one. What do you say?

Each method uses a different mathematical framework, so agreement rules out method-specific artifacts. When all three agree, confidence is higher than any single method can provide. Disagreements can pinpoint where more data is needed.