Data preparation and first steps

Module: Plant Genetic Resources (3502-470)

Author

Karl Schmid, University of Hohenheim

Published

23 May 2025

Packages in R

We will need various R packages throughout the module. To install an R package you have to use the install.packages() command, e.g. to install the R package vcfR you use install.packages("vcfR"). An R package is then loaded with library(), e.g. library(vcfR). If you need help or want to see the documentation of a package just type in ?vcfR and if you type in library(help="vcfR") you get a list with all commands from the package.

The data set

We will work with several data sets throughout this module. To demonstrate how to prepare data for an analysis we will use a dataset from maize, which can be found in the data directory. This dataset contains several teosinte (Zea mays parviglumis) individuals, maize landraces and maize improved varieties. It is part of the MaizeHapMap Dataset (Bukowski et al., 2018) and can be downloaded from CyVerse Link. The data spans a 3Mb region on chromosome 1. The data is stored in a file format called vcf: zea.vcf. In addition to this file, we prepared three text files containing the sample names of each of the three groups: teosinte.csv, landraces.csv and improved_varieties.csv. Make sure that all four files are in your working directory on your local computer. Otherwise you will get error messages saying that the file is not found or that the connection cannot be opened.

R packages for this computer lab

For this computer lab we will need the R packages:

vcfR to read the vcf-formatted file
pegas to perform the analyses

Install and load both packages:

# install.packages("vcfR") # run only once
library(vcfR)
# install.packages("pegas") # run only once
library(pegas)

Prepare the data

Before we can start with any analysis, we first need to prepare the data.

We load the vcf and then write the data into an DNAbin object (to learn about this class of objects, type ?DNAbin):

zea_vcf <- read.vcfR("zea.vcf")
zea <- vcfR2DNAbin(zea_vcf,unphased_as_NA = FALSE,extract.haps = TRUE)

Have a look at the screen output; it gives you some information about number of variants, number of individuals, etc.

Next, we want to keep only variants that are useful for our analyses, that is we want to exclude any positions with missing data (NA) or variants that have more than two states, i.e. are not biallelic.

Tip

Please note that filtering criteria always depend on the analyses you want to perform. For some analyses you may need to keep all variants or even invariant sites. This also means that a thorough understanding of the type of analysis and method is essential to perform good data analyses.

For this, we first write a function, then run this function to get this information for each position and finally exclude all positions that do not match the criteria:

# Function to only retain biallelic SNPs with no missing data
only_biall_noNA <- function(v){v <- as.character(v)
                            includeSNP <- all(v %in% c("a","c","t","g"))
                            includeSNP <- includeSNP & (length(unique(v))==2)
                            return(includeSNP)}

# whichSNPs contains for each SNP whether it is biallelic and has no missing data
whichSNPs <- rep(FALSE, ncol(zea))
for (i in 1:ncol(zea)){
  whichSNPs[i] <- only_biall_noNA(zea[,i])  
}

# Only keep such SNPs
zea_snps <- zea[,whichSNPs]

We will now save this DNAbin object as an R data object, which are recognizable by their ending .RDatato load it directly in the future computer labs.

save(zea_snps,file="zea_snps.RData")

Since we have teosinte, landraces and improved varieties in our data set, we also need to define accession groups.

We can first look at the names of the individual accessions:

acc_names <- rownames(zea_snps)
acc_names

Then we read in our text files which contain the accession names of each group and define accession lists that we can later use to specifically analyse only the accessions from one of the lists:

# all teosinte individuals
teo_list <- unname(unlist(read.table("teosinte.csv",as.is = TRUE)))
teo_list
teo_list2 <- c(paste0(teo_list,"_0"),paste0(teo_list,"_1"))
teosinte_acc <- which(acc_names %in% teo_list2)

# all landraces
landraces_list <- unname(unlist(read.table("landraces.csv",as.is = TRUE)))
landraces_list
landraces_list2 <- c(paste0(landraces_list,"_0"),paste0(landraces_list,"_1"))
landraces_acc <- which(acc_names %in% landraces_list2)

# all improved varieties
varieties_list <- unname(unlist(read.table("improved_varieties.csv",as.is = TRUE)))
varieties_list
varieties_list2 <- c(paste0(varieties_list,"_0"),paste0(varieties_list,"_1"))
varieties_acc <- which(acc_names %in% varieties_list2)

We also save the accession lists so that we can directly load them in the future:

save(teosinte_acc,file="teosinte_acc.Rdata")
save(landraces_acc,file="landraces_acc.Rdata")
save(varieties_acc,file="varieties_acc.Rdata")

A first look at the data set

We can visually display the data with the following commands:

image.DNAbin(zea_snps[teosinte_acc,],main="teosinte indviduals")
image.DNAbin(zea_snps[landraces_acc,],main="landraces")
image.DNAbin(zea_snps[varieties_acc,],main="improved varieties")

Exercise

Try to understand what you see: What do the colours show? What is on the y-axis and what is on the x-axis?
Compare the three data sets: Can you see any differences? What could they mean?

To better compare them, it helps to plot them together into one graph. This can be achieved with the par(mfrow=c()) command:

par(mfrow=c(3,1)) # this allows to plot the three graphs together
                  # and one below the other: 3 rows and 1 column of graphs
image.DNAbin(zea_snps[teosinte_acc,],main="teosinte indviduals")
image.DNAbin(zea_snps[landraces_acc,],main="landraces")
image.DNAbin(zea_snps[varieties_acc,],main="improved varieties")
par(mfrow=c(1,1)) # this "resets" the previous par command

Tip

You may need to make the panel that shows the plots bigger in order to see them. Alternatively, you can also save them in a pdf and then open the pdf. This you can do by running pdf("zea-data.pdf") before the code chunk and dev.off() after the code chunk.

References

Bukowski R, Guo X, Lu Y, Zou C, He B, Rong Z, Wang B, Xu D, Yang B, Xie C, Fan L, Gao S, Xu X, Zhang G, Li Y, Jiao Y, Doebley JF, Ross-Ibarra J, Lorant A, Buffalo V, Romay MC, Buckler ES, Ware D, Lai J, Sun Q, Xu Y. 2018. Construction of the third-generation Zea mays haplotype map. GigaScience 7. doi:10.1093/gigascience/gix134