Data preparation and first steps
Module: Plant Genetic Resources (3502-470)
Packages in R
We will need various R packages throughout the module. To install an R package you have to use the install.packages()
command, e.g. to install the R package vcfR
you use install.packages("vcfR")
. An R package is then loaded with library()
, e.g. library(vcfR)
. If you need help or want to see the documentation of a package just type in ?vcfR
and if you type in library(help="vcfR")
you get a list with all commands from the package.
The data set
We will work with several data sets throughout this module. To demonstrate how to prepare data for an analysis we will use a dataset from maize, which can be found in the data directory. This dataset contains several teosinte (Zea mays parviglumis) individuals, maize landraces and maize improved varieties. It is part of the MaizeHapMap Dataset (Bukowski et al., 2018) and can be downloaded from CyVerse Link. The data spans a 3Mb region on chromosome 1. The data is stored in a file format called vcf
: zea.vcf
. In addition to this file, we prepared three text files containing the sample names of each of the three groups: teosinte.csv
, landraces.csv
and improved_varieties.csv
. Make sure that all four files are in your working directory on your local computer. Otherwise you will get error messages saying that the file is not found or that the connection cannot be opened.
R packages for this computer lab
For this computer lab we will need the R packages:
vcfR
to read the vcf-formatted filepegas
to perform the analyses
Install and load both packages:
# install.packages("vcfR") # run only once
library(vcfR)
# install.packages("pegas") # run only once
library(pegas)
Prepare the data
Before we can start with any analysis, we first need to prepare the data.
We load the vcf and then write the data into an DNAbin
object (to learn about this class of objects, type ?DNAbin
):
<- read.vcfR("zea.vcf")
zea_vcf <- vcfR2DNAbin(zea_vcf,unphased_as_NA = FALSE,extract.haps = TRUE) zea
Have a look at the screen output; it gives you some information about number of variants, number of individuals, etc.
Next, we want to keep only variants that are useful for our analyses, that is we want to exclude any positions with missing data (NA) or variants that have more than two states, i.e. are not biallelic.
Please note that filtering criteria always depend on the analyses you want to perform. For some analyses you may need to keep all variants or even invariant sites. This also means that a thorough understanding of the type of analysis and method is essential to perform good data analyses.
For this, we first write a function, then run this function to get this information for each position and finally exclude all positions that do not match the criteria:
# Function to only retain biallelic SNPs with no missing data
<- function(v){v <- as.character(v)
only_biall_noNA <- all(v %in% c("a","c","t","g"))
includeSNP <- includeSNP & (length(unique(v))==2)
includeSNP return(includeSNP)}
# whichSNPs contains for each SNP whether it is biallelic and has no missing data
<- rep(FALSE, ncol(zea))
whichSNPs for (i in 1:ncol(zea)){
<- only_biall_noNA(zea[,i])
whichSNPs[i]
}
# Only keep such SNPs
<- zea[,whichSNPs] zea_snps
We will now save this DNAbin
object as an R data object, which are recognizable by their ending .RData
to load it directly in the future computer labs.
save(zea_snps,file="zea_snps.RData")
Since we have teosinte, landraces and improved varieties in our data set, we also need to define accession groups.
We can first look at the names of the individual accessions:
<- rownames(zea_snps)
acc_names acc_names
Then we read in our text files which contain the accession names of each group and define accession lists that we can later use to specifically analyse only the accessions from one of the lists:
# all teosinte individuals
<- unname(unlist(read.table("teosinte.csv",as.is = TRUE)))
teo_list
teo_list<- c(paste0(teo_list,"_0"),paste0(teo_list,"_1"))
teo_list2 <- which(acc_names %in% teo_list2)
teosinte_acc
# all landraces
<- unname(unlist(read.table("landraces.csv",as.is = TRUE)))
landraces_list
landraces_list<- c(paste0(landraces_list,"_0"),paste0(landraces_list,"_1"))
landraces_list2 <- which(acc_names %in% landraces_list2)
landraces_acc
# all improved varieties
<- unname(unlist(read.table("improved_varieties.csv",as.is = TRUE)))
varieties_list
varieties_list<- c(paste0(varieties_list,"_0"),paste0(varieties_list,"_1"))
varieties_list2 <- which(acc_names %in% varieties_list2) varieties_acc
We also save the accession lists so that we can directly load them in the future:
save(teosinte_acc,file="teosinte_acc.Rdata")
save(landraces_acc,file="landraces_acc.Rdata")
save(varieties_acc,file="varieties_acc.Rdata")
A first look at the data set
We can visually display the data with the following commands:
image.DNAbin(zea_snps[teosinte_acc,],main="teosinte indviduals")
image.DNAbin(zea_snps[landraces_acc,],main="landraces")
image.DNAbin(zea_snps[varieties_acc,],main="improved varieties")
- Try to understand what you see: What do the colours show? What is on the y-axis and what is on the x-axis?
- Compare the three data sets: Can you see any differences? What could they mean?
To better compare them, it helps to plot them together into one graph. This can be achieved with the par(mfrow=c())
command:
par(mfrow=c(3,1)) # this allows to plot the three graphs together
# and one below the other: 3 rows and 1 column of graphs
image.DNAbin(zea_snps[teosinte_acc,],main="teosinte indviduals")
image.DNAbin(zea_snps[landraces_acc,],main="landraces")
image.DNAbin(zea_snps[varieties_acc,],main="improved varieties")
par(mfrow=c(1,1)) # this "resets" the previous par command
You may need to make the panel that shows the plots bigger in order to see them. Alternatively, you can also save them in a pdf and then open the pdf. This you can do by running pdf("zea-data.pdf")
before the code chunk and dev.off()
after the code chunk.