Analysis of genetic diversity
Module: Plant Genetic Resources (3502-470)
R packages for this computer lab
For this computer lab we will need the R package:
pegas
to perform the analyses
Install and load the package1. It also installs the package ape
, which is a dependency and provides several functions for the population genetic analysis of the data2.
# install.packages("pegas") # run this only once on your computer
library(pegas)
The data set
The data set for this exercise consists of 148 genotyped individuals that were prepared in the previous exercise. As a reminder the data span a region of 3 Megabases on chromosome 1 of the maize genome. The individuals belong to teosinte (the wild ancestor of cultivated maize), maize landraces and modern maize varieites. First, load the data set that you saved as an Rdata
objects in the first computer lab:
load("data/zea_snps.RData")
load("data/teosinte_acc.Rdata")
load("data/landraces_acc.Rdata")
load("data/varieties_acc.Rdata")
Genetic diversity
We will look at the genetic diversity within each of the three data sets.
The function seg.sites
returns the indices (positions) segrating sites in a data frame, which can then be counted using the length()
function.
First we calculate the number of variants, i.e. the number of segregating sites \(S\). For a nice presentation of the output, a data frame is constructed and it is converted to a nice-looking table using the knitr::kable()
function
# Calculate the number of segregating sites for each pool
<- length(seg.sites(zea_snps[teosinte_acc,]))
teosinte_seg <- length(seg.sites(zea_snps[landraces_acc,]))
landraces_seg <- length(seg.sites(zea_snps[varieties_acc,]))
varieties_seg
# Create a data frame with the results
<- data.frame(
result_table Pool = c("Teosinte", "Landrace", "Variety"),
`Segregating sites` = c(teosinte_seg, landraces_seg, varieties_seg)
)
# Display the table nicely using knitr::kable()
::kable(result_table, caption = "Segregating Sites by Pool") knitr
Next, we calculate the nucleotide diversity \(\pi\):
<- ncol(zea_snps) # get the number of SNPs in the 3 MB long region
n_snps <- (nuc.div(zea_snps[teosinte_acc,])*n_snps)/3000000
pi_teosinte
pi_teosinte<- (nuc.div(zea_snps[landraces_acc,])*n_snps)/3000000
pi_landraces
pi_landraces<- (nuc.div(zea_snps[varieties_acc,])*n_snps)/3000000
pi_varieties pi_varieties
The reason why we first multiply with the reported number of variants in this data set and then divide by the sequence length is that we want to have a per site estimate and the original vcf input data file did only contain the variant positions.
It is easy to compare the effect of correcting for the length of the sequenced region.
Compare the output of the command nuc.div(zea_snps[teosinte_acc,])
with the output of `pi_teosinte <- (nuc.div(zea_snps[teosinte_acc,])*n_snps)/3000000``
Why is the output of the first command higher than of the second command?
Please review the formula to calculate nucleotide diversity that you learned about in the lecture.
Modify the above code to include the value of pi as another column in the table.
- Which data set has the highest genetic diversity? Which one the lowest?
- Can you come up with an explanation of the differences in genetic diversity?
Sliding window analysis
The above calculation gives an average nucleotide diversity for the whole 3 megabases long region. There is, however, the expectation that there is some variation in genetic diversity within this region. We therefore use a sliding window approach to calculate nucleotide diversity in windows of 10000 bp that slide along the chromosomes in steps of 100 bp.
Before we can start with the sliding window analysis, we have to get the position of each polymorphism. Unfortunately, the DNAbin
object does not store the chromosome and position on the chromosome in the object, but it is encoded in the column name of the object because the vcfR
package reads the chromosome number and position from the vcf file and then encodes it. Therefore, S1_265004016
reflects the position 1_265004016 on chromosome 1.
We have extract from the column names of the DNAbin
object and then extract the position and chromosome.
First, load the stringr
library for string manipulation.
# install.packages("stringr") # run only once
library(stringr)
To be able to analyse different populations, we define a large function analyse_sliding_window
. The function first creates windows and then iterates over them. Instead of using a for
loop for the iteration, which is slow in R, we use the sapply
function to serially apply a function to the data.
<- 10000 # Window size: 10,000 base pairs
window_size <- 10000 # Step size: 100 base pairs step_size
<- function(snp_data, window_size, step_size) {
analyse_sliding_windows # Get the polymorphism names from the DNAbin object
<- colnames(snp_data)
polymorphisms
# Use regex to extract the chromosome and position components from each name.
# The regex "^S(\\d+)_(\\d+)$" captures two groups: chromosome (group 1) and position (group 2)
<- str_match(polymorphisms, "^S(\\d+)_(\\d+)$")
matches
# Convert the captured values to numeric
<- as.numeric(matches[, 2])
chromosome <- as.numeric(matches[, 3])
position
# Get starting and end positions on the chromosome.
# Here we assume the positions are sorted in ascending order.
<- position[1]
startpos <- tail(position, 1)
endpos
# Generate the starting positions for the sliding windows
<- seq(startpos, endpos - window_size + 1, by = step_size)
starts
# Calculate the midpoint for each window (used as the position indicator)
<- starts + floor(window_size / 2)
midpoints
# Calculate nucleotide diversity per nucleotide for each window.
# Here we divide the nucleotide diversity (calculated by nuc.div on the window) by the window size.
<- sapply(starts, function(start) {
nuc_diversity # Define the end position of the window
<- start + window_size - 1
end
# Subset polymorphisms based on chromosome 1 and the current window
<- polymorphisms[chromosome == 1 & position >= start & position <= end]
selected_polymorphisms
# Extract the corresponding columns from the DNAbin object
<- snp_data[, selected_polymorphisms, drop = FALSE]
selected_snps
# Calculate nucleotide diversity over the window (using nuc.div from the pegas package)
<- nuc.div(selected_snps)
diversity
# Return the nucleotide diversity per nucleotide
/ window_size
diversity
})
# Combine midpoints and nucleotide diversity values into a data frame
<- data.frame(
results_df Midpoint = midpoints,
Nucleotide_Diversity = nuc_diversity
)
return(results_df)
}
Run the function for the complete dataset and then plot the diversity along the chromosome.
<- analyse_sliding_windows(zea_snps, window_size, step_size)
zea_snps.div <- analyse_sliding_windows(zea_snps[teosinte_acc,], window_size, step_size)
teosinte_snps.div <- analyse_sliding_windows(zea_snps[landraces_acc,], window_size, step_size) landraces_snps.div
Now plot the level of diversity along the chromosome. We use ggplot2
for the plotting, and for this purpose, we create a dataframe with the diversity values. Since the midpoints are the same for all, we can easily combine them into a new dataframe for the plotting
<- data.frame(
df_plot Midpoint = zea_snps.div$Midpoint,
Zea_Div = zea_snps.div$Nucleotide_Diversity,
Teosinte_Div = teosinte_snps.div$Nucleotide_Diversity,
Landraces_Div = landraces_snps.div$Nucleotide_Diversity
)
Then plot the dataframe
library(ggplot2)
ggplot(df_plot, aes(x = Midpoint)) +
#geom_line(aes(y = Zea_Div, color = "Zea Diversity"), size = 1) +
geom_line(aes(y = Teosinte_Div, color = "Teosinte Diversity"), size = 1) +
geom_line(aes(y = Landraces_Div, color = "Landraces Diversity"), size = 1) +
labs(x = "Midpoint", y = "Nucleotide Diversity", color = "Group") +
theme_minimal()
- Add the code for the
varieties
to plot the variation along the chromosome. - Which general patterns do you recognize with respect to the overall levels of diversity and the variation in genetic diversity along the chromosome in the different pools of the sample?
- Which explanations can you come up with to explain the high variation of diversity levels along the chromosome?
- Can you think of any biases with respect to the total levels of diversity and the levels of diversity in individual genomic regions? (Hint: Think of the effect of the filtering during data preparation or in later stages of the analysis)
Identification of outlier windows
The comparison of diversity levels of individual windows between pools may provide further insights.
We can use a scatterplot and a linear model to identify windows, whose level of diversity differs strongly between pools.
# Calculate the maximum value for the limits (ignoring NA values)
<- max(c(df_plot$Teosinte_Div, df_plot$Landraces_Div), na.rm = TRUE)
max_val
ggplot(df_plot, aes(x = Teosinte_Div, y = Landraces_Div)) +
geom_point() +
# Add the x = y line (dashed line)
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray") +
# Fit and add the linear model line without confidence interval (blue line)
geom_smooth(method = "lm", se = TRUE, color = "blue") +
# Set x and y limits the same based on the maximum value
scale_x_continuous(limits = c(0, max_val)) +
scale_y_continuous(limits = c(0, max_val)) +
labs(
x = "Teosinte Nucleotide Diversity",
y = "Landraces Nucleotide Diversity",
title = "Scatterplot of Nucleotide Diversities: Teosinte vs Landraces",
subtitle = "Dashed line: x = y | Blue line: Linear Model Fit"
+
) theme_minimal()
- Make a similar plot for the comparison of Landraces versus Varieties diversity.
- Interprete the resulting plots. What to the lines indicate?
- Which approach would you take to (statistically) identify and further characterize outliner windows?
- Which process could explain the presence of the outlier windows?