Phylogenetic Analysis

Author

Karl Schmid

Published

23 May 2025

Introduction

This document presents a phylogenetic analysis performed in R. The following code demonstrates a simple plot.

R packages for this computer lab

For this computer lab the following R packages are required:

ape
phangorn

Make sure that they are installed and loaded.

# install.packages("ape") # run only once
library(ape)
# install.packages("phangorn")
library(phangorn)

The data set

We continue to use our data set with teosinte individuals, maize landraces and maize improved varieties. First, load the data set that you saved as an Rdata object previously.

load("data/zea_snps.RData")

Histogram of heterozygous positions

We first check how many heterozygous positions we observe in our data. This is important to decide upon which method to use to calculate the distance matrix for the phylogenetic reconstructions, i.e. whether we can ignore the heterozygous positions or not.

# Create a sequence of indices representing individuals.
# Each individual is represented by a pair of consecutive rows.
# For example, row 1-2, row 3-4, etc. 'indiv' holds the index of the second row of each pair.
indiv <- seq(2, nrow(zea_snps), 2)

# Calculate the number of nucleotide differences (heterozygous positions) within each individual
# and plot a histogram of these differences.
# For each individual (every pair of rows in zea_snps), the function dist.dna calculates the 
# DNA distance (using the "n" method to simply count the number of differences) between the two rows.
hist(
  sapply(indiv, function(i) {
    # Calculate the DNA distance for the pair corresponding to one individual
    dist.dna(zea_snps[(i-1):i,], "n")
  }),
  # Set the main title of the histogram with the total number of SNP positions available
  main = paste("Number of differences within individuals of", ncol(zea_snps), "SNP positions"),
  xlab = "Number of differences",            # Label for the x-axis
  ylab = "Number of individuals",            # Label for the y-axis
  breaks = 100                               # Specifies the number of bins in the histogram
)

# Loop through each individual to print the number of observed heterozygous positions.
for (i in indiv) {
  # Print the identifier for the individual (using the rowname of the second row)
  # along with the count of heterozygous positions computed as the DNA distance between the pair.
  cat(
    rownames(zea_snps)[i], " observed heterozygous positions ",
    dist.dna(zea_snps[(i-1):i,], "n"), "\n"
  )
}

Building a phylogeny with two different methods: UPGMA and Neighbour Joining

We first need to calculate a distance matrix. Since we have diploids and observe a reasonable number of heterozygous positions within our data set, we use the allele sharing distance:

# Define a function to count the occurrences of the first unique allele 
# in a specified column of a DNAbin object for each individual.
f_allele_counts_col <- function(col1, dnab1) {
  # Create a sequence of indices representing individuals.
  # Each individual is represented by a pair of consecutive rows.
  # 'indiv_ind' holds the index of the second row for each individual.
  indiv_ind <- seq(2, nrow(dnab1), 2)
  
  # Get the unique bases (alleles) from the specified column.
  # 'bases1[1]' is used as a reference allele to count its occurrences.
  bases1 <- unique(dnab1[, col1])
  
  # Define a helper function to count the occurrences of the reference allele
  # for a given individual's rows (i-1 and i).
  f_count1 <- function(i) {
    sum(dnab1[(i-1):i, col1] == bases1[1])
  }
  
  # Apply the counting function over each individual's second row index and return the counts.
  sapply(indiv_ind, f_count1)
}

# Apply the f_allele_counts_col function to each column of the DNAbin object 'zea_snps'
# to obtain a matrix where each row corresponds to an individual and each column to an allele count.
matrix_allele_counts <- sapply(1:ncol(zea_snps), f_allele_counts_col, zea_snps)

# Calculate the Manhattan distance between individuals based on their allele counts.
# The distance is normalized by the number of columns in the matrix.
dist1 <- dist(matrix_allele_counts, method = "manhattan") / ncol(matrix_allele_counts)

First, we build the phylogeny with the UPGMA method:

# Build an UPGMA tree using the previously computed Manhattan distance matrix
tree_zea_upgma <- upgma(dist1) 

# Assign tip labels to the tree:
# "LR" for landraces, "TEO" for teosinte, and "VAR" for varieties.
tree_zea_upgma$tip.label <- c(paste0("LR", 1:23),       # First 23 labels: LR1 to LR23
                              paste0("TEO", 1:19),      # Next 19 labels: TEO1 to TEO19
                              paste0("LR", 24:25),      # Next 2 labels: LR24 to LR25
                              paste0("VAR", 1:30))      # Last 30 labels: VAR1 to VAR30

# Create a vector of colors corresponding to the tip labels:
# "red" for landraces, "blue" for teosinte, and "orange" for varieties.
col1 <- c(rep("red", 23),   # First 23 individuals (landraces) in red
          rep("blue", 19),  # Next 19 individuals (teosinte) in blue
          rep("red", 2),    # Following 2 individuals (landraces) in red
          rep("orange", 30))# Last 30 individuals (varieties) in orange

And then with the Neighbour Joining method:

tree_zea_nj <- nj(dist1) 
tree_zea_nj$tip.label <- c(paste0("LR", 1:23),   # First 23 labels: LR1 to LR23
                           paste0("TEO", 1:19),  # Next 19 labels: TEO1 to TEO19
                           paste0("LR", 24:25),  # Next 2 labels: LR24 to LR25
                           paste0("VAR", 1:30))  # Last 30 labels: VAR1 to VAR30
col1 <- c(rep("red", 23),   
          rep("blue", 19),  
          rep("red", 2),    
          rep("orange", 30))

We plot the two phylogenies:

# Set graphics layout to display 1 row and 2 columns of plots
par(mfrow = c(1, 2))

# Plot the UPGMA tree in an unrooted style with specified tip colors, point size, and rotation
plot.phylo(tree_zea_upgma, type = "unrooted", tip.color = col1, cex = 0.7, rotate.tree = 90,
           main = "UPGMA tree")
# Add a scale bar to the UPGMA tree plot
add.scale.bar()

# Plot the NJ tree in an unrooted style with specified tip colors and point size
plot.phylo(tree_zea_nj, type = "unrooted", tip.color = col1, cex = 0.7, rotate.tree = 90, main = "NJ tree")
# Add a scale bar to the NJ tree plot
add.scale.bar()

# Reset the plotting layout to default (single plot)
par(mfrow = c(1, 1))

Exercise

Compare the two phylogenies. Are they identical? What are the main similarities/differences?
What can you recognize about the placement of the teosintes? Do you have possible explanations for this? (Hint: Only a small proportion of the genome is used for constructing the tree)

We can also use the function comparePhylo to compare the topology of two phylogenies:

comparePhylo(tree_zea_nj, unroot(tree_zea_upgma), plot=TRUE, use.edge.length=FALSE)

Exercise

Does the result agree with you thoughts from before?
How similar/different are the phylogenies?

The Newick format

Phylogenetic trees are usually saved in the so called Newick format, which represents trees with parentheses and commas.¹ It can further also include edge lengths. The following two commands save our Neighbour Joining tree in Newick format and read the tree back in to display the Newick format:

¹ The format was conceived in a restuarant in Newick, New Hampshire, USA. See Wikipedia

write.tree(tree_zea_nj,file = "tree_zea_nj.txt")
readLines("tree_zea_nj.txt")

Bootstrapping

Next we want to perform bootstrapping to our Neighbour Joining tree to estimate the confidence level of each node. Due to time constraints we will only do five bootstrap replicates.

Warning

The following code may run for a few minutes

# this function calculates the distance matrix for a bootstrap replicate
f_nj_allele_shar <- function(dnab2){
  matrix_allele_counts2 <- sapply(1:ncol(dnab2), f_allele_counts_col, dnab2)
  rownames(matrix_allele_counts2) <- c(paste0("LR", 1:23), paste0("TEO", 1:19), paste0("LR", 24:25),
                                       paste0("VAR", 1:30))
  dist2 <- dist(matrix_allele_counts2,method = "manhattan") / ncol(matrix_allele_counts2)
  #tree2 <- nj(dist2)
  return(nj(dist2))
  }

# bootstrapping
boot_zea_nj <- boot.phylo(tree_zea_nj,zea_snps,f_nj_allele_shar,B=5,trees=TRUE,jumble=FALSE)

We plot again the Neighbour Joining tree with the bootstrap values:

plot.phylo(tree_zea_nj,type="unrooted",tip.color=col1,cex=0.7,rotate.tree=90,main="NJ tree")
add.scale.bar()
nodelabels(boot_zea_nj$BP, frame = "n")

Warning

The following code is not working yet fully and needs to be optimized.

This tree is difficult to look at, therefore we use a more modern approach to zoom into the tree:

# Install the Bioconductor package manager and then ggtree from the Bioconductor repository
# The installation may take some time
# install.packages("Biocmanager")
# BiocManager::install("ggtree")
library(ggtree)   # For plotting phylogenetic trees using ggplot2
# install.packages("plotly")
library(plotly)   # For making ggplot objects interactive

# Suppose 'tree_object' is your phylogenetic tree, e.g., generated by nj() or upgma()
p <- ggtree(tree_zea_nj) + geom_tiplab()   # Create tree plot with tip labels

# Convert the ggplot object to an interactive Plotly object
p_interactive <- ggplotly(p)

# Display the interactive tree
p_interactive

Exercise

How confident can we be about most of the nodes?

We can also plot the five bootstrap replicate trees in comparison to the Neighbour Joining tree:

par(mfrow = c(3, 2))  # Set plotting area to 3 rows and 2 columns

plot.phylo(tree_zea_nj,          # Plot the main Neighbor-Joining tree (NJ tree)
           type = "unrooted",    # Display as an unrooted tree
           tip.color = col1,     # Color tips based on col1 vector
           cex = 0.7,            # Set character expansion for tip labels
           rotate.tree = 90,     # Rotate the tree by 90 degrees
           main = "NJ tree")     # Main title for the plot

add.scale.bar()              # Add a scale bar to the tree plot

nodelabels(boot_zea_nj$BP,   # Add node labels showing bootstrap support values
           frame = "n")      # No frame around the labels

for (i in 1:length(boot_zea_nj$trees)) {              # Loop through each bootstrap tree
  plot.phylo(boot_zea_nj$trees[[i]],                  # Plot the ith bootstrap tree
             type = "unrooted",                       # Display as an unrooted tree
             tip.color = col1,                        # Use the same tip colors as before
             cex = 0.7,                               # Set character expansion for tip labels
             rotate.tree = 90,                        # Rotate the tree by 90 degrees
             main = paste("bootstrap tree", i))       # Set title with bootstrap tree number
}

par(mfrow = c(1, 1))  # Reset plotting area to default (1 plot per page)

Exercise

Which differences and commonalities do you observe between the NJ tree of the complete sequence data and the individual bootstrap trees?