Phylogenetic Analysis
Introduction
This document presents a phylogenetic analysis performed in R. The following code demonstrates a simple plot.
R packages for this computer lab
For this computer lab the following R packages are required:
ape
phangorn
Make sure that they are installed and loaded.
# install.packages("ape") # run only once
library(ape)
# install.packages("phangorn")
library(phangorn)
The data set
We continue to use our data set with teosinte individuals, maize landraces and maize improved varieties. First, load the data set that you saved as an Rdata
object previously.
load("data/zea_snps.RData")
Histogram of heterozygous positions
We first check how many heterozygous positions we observe in our data. This is important to decide upon which method to use to calculate the distance matrix for the phylogenetic reconstructions, i.e. whether we can ignore the heterozygous positions or not.
# Create a sequence of indices representing individuals.
# Each individual is represented by a pair of consecutive rows.
# For example, row 1-2, row 3-4, etc. 'indiv' holds the index of the second row of each pair.
<- seq(2, nrow(zea_snps), 2)
indiv
# Calculate the number of nucleotide differences (heterozygous positions) within each individual
# and plot a histogram of these differences.
# For each individual (every pair of rows in zea_snps), the function dist.dna calculates the
# DNA distance (using the "n" method to simply count the number of differences) between the two rows.
hist(
sapply(indiv, function(i) {
# Calculate the DNA distance for the pair corresponding to one individual
dist.dna(zea_snps[(i-1):i,], "n")
}),# Set the main title of the histogram with the total number of SNP positions available
main = paste("Number of differences within individuals of", ncol(zea_snps), "SNP positions"),
xlab = "Number of differences", # Label for the x-axis
ylab = "Number of individuals", # Label for the y-axis
breaks = 100 # Specifies the number of bins in the histogram
)
# Loop through each individual to print the number of observed heterozygous positions.
for (i in indiv) {
# Print the identifier for the individual (using the rowname of the second row)
# along with the count of heterozygous positions computed as the DNA distance between the pair.
cat(
rownames(zea_snps)[i], " observed heterozygous positions ",
dist.dna(zea_snps[(i-1):i,], "n"), "\n"
) }
Building a phylogeny with two different methods: UPGMA and Neighbour Joining
We first need to calculate a distance matrix. Since we have diploids and observe a reasonable number of heterozygous positions within our data set, we use the allele sharing distance:
# Define a function to count the occurrences of the first unique allele
# in a specified column of a DNAbin object for each individual.
<- function(col1, dnab1) {
f_allele_counts_col # Create a sequence of indices representing individuals.
# Each individual is represented by a pair of consecutive rows.
# 'indiv_ind' holds the index of the second row for each individual.
<- seq(2, nrow(dnab1), 2)
indiv_ind
# Get the unique bases (alleles) from the specified column.
# 'bases1[1]' is used as a reference allele to count its occurrences.
<- unique(dnab1[, col1])
bases1
# Define a helper function to count the occurrences of the reference allele
# for a given individual's rows (i-1 and i).
<- function(i) {
f_count1 sum(dnab1[(i-1):i, col1] == bases1[1])
}
# Apply the counting function over each individual's second row index and return the counts.
sapply(indiv_ind, f_count1)
}
# Apply the f_allele_counts_col function to each column of the DNAbin object 'zea_snps'
# to obtain a matrix where each row corresponds to an individual and each column to an allele count.
<- sapply(1:ncol(zea_snps), f_allele_counts_col, zea_snps)
matrix_allele_counts
# Calculate the Manhattan distance between individuals based on their allele counts.
# The distance is normalized by the number of columns in the matrix.
<- dist(matrix_allele_counts, method = "manhattan") / ncol(matrix_allele_counts) dist1
First, we build the phylogeny with the UPGMA method:
# Build an UPGMA tree using the previously computed Manhattan distance matrix
<- upgma(dist1)
tree_zea_upgma
# Assign tip labels to the tree:
# "LR" for landraces, "TEO" for teosinte, and "VAR" for varieties.
$tip.label <- c(paste0("LR", 1:23), # First 23 labels: LR1 to LR23
tree_zea_upgmapaste0("TEO", 1:19), # Next 19 labels: TEO1 to TEO19
paste0("LR", 24:25), # Next 2 labels: LR24 to LR25
paste0("VAR", 1:30)) # Last 30 labels: VAR1 to VAR30
# Create a vector of colors corresponding to the tip labels:
# "red" for landraces, "blue" for teosinte, and "orange" for varieties.
<- c(rep("red", 23), # First 23 individuals (landraces) in red
col1 rep("blue", 19), # Next 19 individuals (teosinte) in blue
rep("red", 2), # Following 2 individuals (landraces) in red
rep("orange", 30))# Last 30 individuals (varieties) in orange
And then with the Neighbour Joining method:
<- nj(dist1)
tree_zea_nj $tip.label <- c(paste0("LR", 1:23), # First 23 labels: LR1 to LR23
tree_zea_njpaste0("TEO", 1:19), # Next 19 labels: TEO1 to TEO19
paste0("LR", 24:25), # Next 2 labels: LR24 to LR25
paste0("VAR", 1:30)) # Last 30 labels: VAR1 to VAR30
<- c(rep("red", 23),
col1 rep("blue", 19),
rep("red", 2),
rep("orange", 30))
We plot the two phylogenies:
# Set graphics layout to display 1 row and 2 columns of plots
par(mfrow = c(1, 2))
# Plot the UPGMA tree in an unrooted style with specified tip colors, point size, and rotation
plot.phylo(tree_zea_upgma, type = "unrooted", tip.color = col1, cex = 0.7, rotate.tree = 90,
main = "UPGMA tree")
# Add a scale bar to the UPGMA tree plot
add.scale.bar()
# Plot the NJ tree in an unrooted style with specified tip colors and point size
plot.phylo(tree_zea_nj, type = "unrooted", tip.color = col1, cex = 0.7, rotate.tree = 90, main = "NJ tree")
# Add a scale bar to the NJ tree plot
add.scale.bar()
# Reset the plotting layout to default (single plot)
par(mfrow = c(1, 1))
- Compare the two phylogenies. Are they identical? What are the main similarities/differences?
- What can you recognize about the placement of the teosintes? Do you have possible explanations for this? (Hint: Only a small proportion of the genome is used for constructing the tree)
We can also use the function comparePhylo
to compare the topology of two phylogenies:
comparePhylo(tree_zea_nj, unroot(tree_zea_upgma), plot=TRUE, use.edge.length=FALSE)
- Does the result agree with you thoughts from before?
- How similar/different are the phylogenies?
The Newick format
Phylogenetic trees are usually saved in the so called Newick format, which represents trees with parentheses and commas.1 It can further also include edge lengths. The following two commands save our Neighbour Joining tree in Newick format and read the tree back in to display the Newick format:
1 The format was conceived in a restuarant in Newick, New Hampshire, USA. See Wikipedia
write.tree(tree_zea_nj,file = "tree_zea_nj.txt")
readLines("tree_zea_nj.txt")
Bootstrapping
Next we want to perform bootstrapping to our Neighbour Joining tree to estimate the confidence level of each node. Due to time constraints we will only do five bootstrap replicates.
The following code may run for a few minutes
# this function calculates the distance matrix for a bootstrap replicate
<- function(dnab2){
f_nj_allele_shar <- sapply(1:ncol(dnab2), f_allele_counts_col, dnab2)
matrix_allele_counts2 rownames(matrix_allele_counts2) <- c(paste0("LR", 1:23), paste0("TEO", 1:19), paste0("LR", 24:25),
paste0("VAR", 1:30))
<- dist(matrix_allele_counts2,method = "manhattan") / ncol(matrix_allele_counts2)
dist2 #tree2 <- nj(dist2)
return(nj(dist2))
}
# bootstrapping
<- boot.phylo(tree_zea_nj,zea_snps,f_nj_allele_shar,B=5,trees=TRUE,jumble=FALSE) boot_zea_nj
We plot again the Neighbour Joining tree with the bootstrap values:
plot.phylo(tree_zea_nj,type="unrooted",tip.color=col1,cex=0.7,rotate.tree=90,main="NJ tree")
add.scale.bar()
nodelabels(boot_zea_nj$BP, frame = "n")
The following code is not working yet fully and needs to be optimized.
This tree is difficult to look at, therefore we use a more modern approach to zoom into the tree:
# Install the Bioconductor package manager and then ggtree from the Bioconductor repository
# The installation may take some time
# install.packages("Biocmanager")
# BiocManager::install("ggtree")
library(ggtree) # For plotting phylogenetic trees using ggplot2
# install.packages("plotly")
library(plotly) # For making ggplot objects interactive
# Suppose 'tree_object' is your phylogenetic tree, e.g., generated by nj() or upgma()
<- ggtree(tree_zea_nj) + geom_tiplab() # Create tree plot with tip labels
p
# Convert the ggplot object to an interactive Plotly object
<- ggplotly(p)
p_interactive
# Display the interactive tree
p_interactive
How confident can we be about most of the nodes?
We can also plot the five bootstrap replicate trees in comparison to the Neighbour Joining tree:
par(mfrow = c(3, 2)) # Set plotting area to 3 rows and 2 columns
plot.phylo(tree_zea_nj, # Plot the main Neighbor-Joining tree (NJ tree)
type = "unrooted", # Display as an unrooted tree
tip.color = col1, # Color tips based on col1 vector
cex = 0.7, # Set character expansion for tip labels
rotate.tree = 90, # Rotate the tree by 90 degrees
main = "NJ tree") # Main title for the plot
add.scale.bar() # Add a scale bar to the tree plot
nodelabels(boot_zea_nj$BP, # Add node labels showing bootstrap support values
frame = "n") # No frame around the labels
for (i in 1:length(boot_zea_nj$trees)) { # Loop through each bootstrap tree
plot.phylo(boot_zea_nj$trees[[i]], # Plot the ith bootstrap tree
type = "unrooted", # Display as an unrooted tree
tip.color = col1, # Use the same tip colors as before
cex = 0.7, # Set character expansion for tip labels
rotate.tree = 90, # Rotate the tree by 90 degrees
main = paste("bootstrap tree", i)) # Set title with bootstrap tree number
}
par(mfrow = c(1, 1)) # Reset plotting area to default (1 plot per page)
Which differences and commonalities do you observe between the NJ tree of the complete sequence data and the individual bootstrap trees?