Genetic diversity
Motivation
Modern plant breeding programs follow a standard scheme in order to create improved varieties (Figure 1). Parents with favorable properties are crossed to generate new phenotypic variation and to combine favorable trait in single genotypes Among the offspring, favorable genotypes are selected and tested in muultiple environments to prove their value in comparison to existing varieties and the stability of their phenotype in different environments. The best performing genotypes are then registered as new varieties and marketed. These varieties are then again used as parents for novel breeding programs.

One consequence of breeding cycles is that at the beginning of the breeding program, genetic diversity is increased, and over time, genetic diversity is reduced because of selection and genetic drift. Furthermore, according to the DUS criteria, new varieties need to be distinct, uniform and stable for plant variety registration, i.e., a legal protection of the product of plant breeding. The uniformity requirement is responsible for the fact that with the exception of population varieties the vast majority of released varieties, in particular hybrid and line varieties are genetically homogeneous.
Over time, however, the available genetic diversity in a breeding population is reduced due to selection, and further breeding progress (also called genetic gain) becomes smaller and more difficult to achieve.
Learning goals
to be added
Variability of genetic diversity in breeding populations
Our motivation section showed that plant breeding strongly depends on the introduction of novel genetic variation from genetic resources. This leads to the question, which type and how much new genetic variation is optimal for sustaining long term breeding progress to produce productive, healthy and adapted varieties.
The effect of repeated breeding cycles is seen in Figure 2 diversity. It shows the results of a metastudy on the genetic diversity of varieties of various field crops (wheat, maize, barley, oat, flax, pea, rice, soybean) that were released in the 20th century in different regions of the world. The regional genetic diversity of varieties (i.e., Europe or North America) was assessed with molecular markers. Both the analysis of the complete dataset of 44 individual studies and of a subset of 20 studies of wheat indicate a drop in diversity in the 1960s after a period of substantial diversity. Starting from the 1970s, plant breeders managed to increase the diversity by introgression of new genetic diversity with the result that no substantial reduction in the regional diversity has taken place.
It should be noted, however, that this analysis is based on a fairly small numbers of molecular markers per species and that errer margins tend to be large with respect to the estimates.

The paradox of genetic diversity
Based on the knowledge of Mendelian inheritance and of genetic diversity one may ask, why new genetic diversity needs to be introgressed in plant breeding programs if the interplay of existing diversity and recombination are able to produce potentially unlimited amounts of diversity? This discrepancy between observed and expected diversity is the paradox of genetic diversity and explained in the following.
Figure 3 shows a cross of two parents, which are homozygous for two genes. The two parents differ at the two alleles from each other. The F1 generation is heterozygous, and in the F2 generation the alleles start to segregate according to Mendel’s rules.
We assume that the two genes control two different traits (flower color and leaf shape). From this follows that the number of possible combinations of the two traits is
Now we assume that the two parents differ at 21 loci. Again the F1 individuals are heterozygous for all loci, but they start to segregate in the F2 generation. Based on the number of genotypes, there are more than 10 billion (
AABBCCDDEEFFGGHHIIJJKKLLMMNNOOPPQQRRSSTTUU |
|
aabbccddeeffgghhiijjkkllmmnnooppqqrrssttuu |
|
AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUu |
|
What then is the paradox of genetic diversity? While it is easy to generate novel phenotypic variation by combining different alleles of different genes in the offspring of a cross, there is often not sufficient useful variation for traits that need to be improved available in breeding populations.
In other words, there may not be enough genetic variation at genes involving 21 different traits within a breeding population even though it would be easy to generate new genetic diversity (i.e., combinations of different phenotypes) by crossing.
For this reason, new genetic variation needs to be identified, crossed with existing material and evaluated on a phenotypic level in so-called prebreeding programs to increase phenotypic diversity by introgressing new genetic diversity.
Loss of genetic variation due to breeding
There are several reasons why breeding populations are impoverished for genetic variation in useful traits. Possible reasons are:
- Bottlenecks
- Only a small proportion of total genetic variation is included in a breeding population. The establishment of a new population by selecting a small number of funders from the ancestral population is also called founder effect. The new population likely has a much smaller level of genetic variation and a very different frequency of alleles.
- Genetic drift
- Random fixation of alleles, especially profound in small populations.
- Selection
- Fixation of advantageous alleles or loss of disadvantageous by selection.
For an explanation of how these processes reduce genetic variation, you may consult introductory population genetics textbooks.
The reduction of genetic diversity occurred primarily at two stages during the history of crop plants. These are the domestication of a crop from its wild ancestor, and modern breeding programs that apply strong selection to breeding populations. The two stages can be compared to a funnel that represent a genetic bottleneck (Figure 4).

In the discussion of genetic erosion of crop species, Figure 4 has become iconic, and measures of genetic diversity of crop plants and their wild ancestors confirm this notion. However, this figure is too simplistic because in many crops that were domesticated thousands of years ago, new genetic diversity originated by mutation, which may have been advantageous for cultivation and was selected by farmers and early breeders. There is surprisingly little research in this direction and the narrative of a loss of genetic variation prevails.
Genetic vulnerability due to a narrow genetic basis
Populations that lack genetic variation are called vulnerable because they either suffer from inbreeding depression (in outbreeding species like maize) or have little genetic variation at resistance genes, which makes it easy for pathogens to spread in a population, once they overcome resistance mechanisms.
Two historical examples:
- Potato leaf blight in Ireland (1845-1849)
- It was caused by the pathogen Phytophtora infestans. As a result one million people starved to death and one million people emigrated to the US.
- Southern leaf blight epidemics in the Southern US (1970)
- Maize varieties with the so-called texas-T cytoplasm were wiped out by Southern leaf blight. The epidemics resulted in a harvest loss in this year. Although the economic damage was limited, this epidemic epidemic raised awareness of genetic vulnerability.
An important consequence of genetic vulnerability is the high cost required for fungicides to keep pesticides under control in modern large-scale farming systems.
The Southern leaf blight epidemic had a relatively small (economic) impact, but the Great Famine that was caused by potato leaf blight had major historical consequences (Zadoks, 2008).
A lack of genetic diversity of potatos allowed a single strain of the pathogen P. infestans to produce the epidemic. At that time a single variety of potato was planted, which was favored by the fact that it can be clonally propagated and produces genetically identical offspring.1
1 For a very short summary, see here
The genomic analysis of a herbarium strain of P. infestans showed that a single strain originating from Mexico was causing the epidemic, which today has been replaced by other strains (Birch and Cooke, 2013; Goss et al., 2014; Yoshida et al., 2013).
Genetic vulnerability and genetic erosion
In the context of crop genetic diversity, two terms are frequently used.2
2 See, for example, FAO (2010)
Genetic vulnerability results if a widely planted crop is uniformly susceptible to a pest, pathogen or environmental hazard as a result of its genetic constitution, thereby creating the potential for widespread crop losses.
Genetic erosion is the loss of individual genes and the loss of particular combinations of genes (i.e. of gene complexes) such as those maintained in locally adapted landraces.
There are alternative uses of the term ‘genetic erosion’ (Khoury et al., 2022):
- Loss of old varieties or landraces of a given crop
- Loss of (useful) genes or alleles present in a crop species, landrace, variety
Types of genetic variation
If the genetic diversity in a breeding population or a crop species in general is limited, it is necessary to search for new variation, such as in old varieties or in wild relatives.
However, genetic diversity is not useful per se, and it is necessary to differentiate among different types of genetic variation.
One may differentiate the total genetic variation into three major types of genetic variation:
Useful genetic variation influences traits that are selected during breeding in a positive manner. Such variation is expected in the following classes of genes:
- Yield genes
- Adaptation genes
- Resistance genes
- etc.
Neutral genetic variation is the fraction of genetic variation without phenotypic effects. Since it is not selected, it mainly evolves under genetic drift (i.e., random evolution) in a population.
Deleterious genetic variation is any genetic variation that reduces the favorable traits of a crop such as the yield or quality of the crop harvest. If it can be recognized as such by the breeder, it is removed by selection during the breeding process.
There is a large body of theory and also empirical studies in population genetics on the question which proportion of variation segregating in populations belongs to each of these three classes, and which methods are suitable to classify genetic variants into these classes3
3 See, for example, our work on deleterious mutations segregating in wild and crop plants (Günther and Schmid, 2010).
What are plant genetic resources (PGR)?
To introduce the topic of genetic diversity in the context of PGR, we first define plant genetic resources and then show why new genetic diversity in form of PGR are important and required in the context of the paradox of genetic diversity discussed above.
The 4th international technical conference of FAO on PGR in Leipzig 1996 used the following definition for plant genetic resources4:
4 See the report on the conference: Link.
… generatively or vegetatively reproducible material of plant of current or potential value, including landraces, related wild sprecies and wild forms and special genetic material of crop plants
A much simpler definition was given by Pflanzenzüchtung (1993):
… the complete genetic material that is available for breeding of a crop plant.
In the age of genetic engineering and synthetic biology, we can even further broader expand this definition to
any genetic material and variants that is naturally available or can be artificially generated and used in plant breeding.
Sources of new genetic variation
If genetic diversity (e.g., in a given crop species) is limited, there are several sources to utilize new genetic variation. The sources include
- New and minor crops (Neodomestication)
- Close relatives of crops (Neodomestication, Redomestication, Introgression)
- Wild ancestors of crop species
- Land races and traditional cultivars
- Modern elite varieties of different geographic origin
- Induced mutations by chemical (Mutagens), physical (Irradiation) or biological (Genome editing) treatments
In the breeding of commercial varieties, the released varieties of competitors (if they are not patented) likely are the main sources of new genetic variation. This is possible because of the breeder’s exemption granted by plant variety protection laws in many countries and also a meaningful approach because released commercial varieties are usually purged of much deleterious variation that frequently segregates in traditional cultivars that have not been improved by breeding (Figure 5).

Crop gene pools as sources of new genetic diversity
In an effort to have a more rational and systematic access to new genetic diversity for crop plants, Harlan and de Wet developed the concept of crop gene pools Harlan and Wet (1971). Essentially, for each crop several gene pools are defined that describe the genetic and taxonomic relationship to a given crop plant (Figure 6).

According to this definition, there are four gene pools:
- Gene pool 1 (GP1):
- This pool is defined by the biological species concept, which states that all genotypes that can be crossed with each other and form fertile offspring belong to the same species. Therefore, this GP1 includes usually the ancestors of crops as well as all subspecies of a species that may reflect either cultivated types or spontaneous types that are phenotypically variable.
- Gene pool 2 (GP2):
- It includes closely related species that can be successfully crossed with the crop species, but produce offspring with limited viability or limited fertility indicated by partial sterility. This pool frequently include close relatives that my harbor interesting disease resistance genes. Taxonomically such species may be other species in the genus.
- Gene pool 3 (GP3):
- This gene pool includes more distantly related species. Crosses with the crop species may produce offspring, but it either has an anomalous phenotype, or the embryo does not develop into an adult plant because of lethality or are completely sterile. Quite frequently these barriers to producing offspring can be overcome by biotechnological treatments in order to produce offspring that can be included in further breeding processes.
- Gene pool 4 (GP4):
- It includes all other species where crosses with the crop species are not successful.
The first three gene pools an be utilized by classical breeding or biotechnological approaches such tissue culture or protoplast fusion. The fourth gene pool includes species that can not be crossed to a given crop but may provide useful genes that can be introduced into the crop by genetic engineering.
In the following, the concept of gene pools as sources of novel and useful genetic diversity is demonstrated for canola, Brassica napus.
The primary gene pool consists of
- Modern elite varieties of canola (Brassica napus)
- Landraces and traditional cultivars
- German Brassica napus (rapeseed) (Wikipedia)
The secondary gene pool includes
- Brassica rapa (Turnip; several subspecies used as vegetables) (Wikipedia)
- Brassica oleracea (Cabbage and derived vegetables) (Wikipedia)
- Brassica nigra (Black mustard) (Wikipedia)
The four Brassica species have a special chromosomal relationship that is called ‘Triangle of U’, which essentially states that the genomes of three ancestral species merged to form the genomes of several modern Brassica vegetables (Wikipedia) see also the lecture or gene flow).
The four species can be crossed with each other relatively easily and produce F1 hybrids (Hauser et al., 1998).
The tertiary gene pool includes
- Raphanus (Radish) (Wikipedia)
- Crambe (Sea kale; Meerkohl) (Wikipedia)
- Arabidopsis (thale cress) (Wikipedia)
Somatic cell lines between Arabidopsis thaliana and Brassica nigra could be produced by protoplast fusion (Siemens and Sacristán, 1995), which demonstrates that genes can be transferred between the two species. However, further complications may arise by the fact that the chromosomes between the two species may not recombine because of extensive sequence differences.
Key questions in the context of PGR
Based on the above considerations, several questions arise with respect to the level and use of genetic diversity in plant genetic resources and their use in breeding:
- What is the level of genetic diversity within and between modern elite breeding populations?
- Does ‘exotic’ breeding material have a higher level of genetic variation that can be introgressed into modern varieties?
- What type of genetic diversity is useful or deleterious?
- How can useful genetic variation be identified?
Much contemporary research on genetic resources is focused on answering these questions.
An example of a successful introduction of new variation
The effect of a loss in genetic diversity and subsequent gain by intregression of plant genetic resources can be demonstrated with the breeding history of tomato in the Netherlands (Schouten et al., 2019). The wild ancestors of cultivated tomato are Solanum lycopersicum var. lycopersicum, S. lycopersicum var. cerasiforme, and S. pimpinellifolium. From these ancestors, heirloom types and landraces originated by domestication processes. Inbreeding and selection among these old (vintage) varieties led to commercial varieties in the 1960s with a very low genetic and phenotypic diversity. During this time Dutch tomatoes developed a reputation for being tasteless and watery, which led to the “waterbomb” (in German: “Wasserbombe”) crisis. Breeders reacted and introduced more exotic genetic variation into the breeding material to increase multiple aspect that include taste, shape and diesease resistances. From the 1970s onwards, resistances to diseases and pests were introgressed from distant species, including S. peruvianum, S. pennellii, S. chilense, and S. habrochaites, increasing genetic diversity among commercial tomato varieties considerably. After the 1980s, fruit size, color, and flavor started to vary substantially, further increasing the genetic diversity of modern varieties.
As a result, the genetic diversity of Dutch tomato varieties increased. The temporal dynamics is shown in Figure 8 and the data are based on genotyping 90 varieties using a SNP array and calculating Nei’s index

Quantification of genetic variation
In current studies of genetic variation, essentially two types of data are used: Genetic markers such as single nucleotide polymorphisms (SNPs), and DNA sequences. Markers target individual polymorphisms with a known location in a genome using a experimental assay, whereas DNA sequences analyse the complete set of polymorphisms in a genomic region or even the whole genome. To derive the theory for quantification of genetic variation, we first focus on SNPs, i.e. individual genetic polymorphisms.
We first introduce the following conventions: A locus is indicated with a capital letter.
The relative frequency of allele
If there are more than two alleles, frequencies are expressed as
There are three genotypes in a population of diploid organisms: two homoygous marker genotypes
The relative frequencies in the populations are written as
Genotype: | |||
Relative frequency: |
The sum of relative frequencies is 1:
From genotype frequencies, the allele frequencies can be calculated directly. The frequency
Some markers or marker types (such as simple sequence repeats, SSRs) have more than two alleles, but the calculation of relative allele frequencies is straightforward.
For a marker with
Alleles: | |||||
Frequencies: |
Indices
Under the assumption that in heterozygotes
With
It is essentially an estimate that two randomly chosen alleles in a population are not identical, and is calculated for a single marker as
A related measure is polymorphism information content (PIC), which was originally introduced for use in human genetics (Botstein et al., 1980).
It refers to the value of a marker for detecting polymorphism within a population, depending on the number of detectable alleles and the distribution of their frequency for a given marker.
A measure on informativeness is highly useful in selecting parents for genetic mapping because one wants to maximize the chance that a set of markers has a high power to detect a quantitative trait locus (QTL) in a genetic mapping study because they have a high chance of being polymorphic among offspring as they are likely different between the parents.
For outbred and heterozygous individuals, the PIC value of a marker is defined as
The occurrence of rare alleles has less impact on the PIC than alleles occurring with high frequencies.
A simplified version was developed assuming that the inbred individuals selected for mapping are homozygous (Anderson et al., 1993).
Then, the PIC of a marker
Note that this measure corresponds to Nei’s gene diversity measurement.
The PIC for multiple markers can be calculated by taking the average of PIC values of each marker.
Depending on the type of mapping study, PIC values can be used to select markers for genotyping or individuals for creating a mapping population. Therefore, this measure is frequently used in the design of genotyping arrays in breeding programs.
Quantification of DNA sequence variation
In comparison to SNP markers, DNA sequencing identifies all polymorphisms of a locus, and one obtains a group of coupled markers.
The different combinations of alleles of different polymorphisms on the chromosome, in a genomic region or in a gene are called haplotypes.
The following sequence alignment shows the hypothetical DNA sequence of a short region of 10 base pairs from two outbred, heterozygous, diploid individuals.
A) B)
Position 1234567890 1234567890
Individual 1 GATCGAACAG G.T......G
Individual 1 GAACGAACAT G.A......T
Individual 2 TATCGAACAG T.T......G
Individual 2 TATCGAACAG T.T......G
The difference between A)
and B)
is that in the latter, only the variable positions in the sequence alignments are shown with letters and invariable sites with dots.
In this sample, three single-nucleotide polymorphisms are observed and three haplotypes (i.e. combinations of polymorphisms). Individual 2 is homozygous at this region, because both sequences are identical (they are the same haplotype). Using this sequence alignment, several descriptive statistics can be calculated.
If each sequence differs from each other one (given that the sequenced region is long enough), heterozygosity as a measure of genetic variation becomes obsolete because nearly each individual is heterozygous.
Instead, a new measure is defined, which is called nucleotide polymorphism.
It describes the proportion of nucleotide positions in a sample that are polymorphic and is calculated as
In the above sequence alignment,
A second, and more widely used measure for DNA sequence variation is nucleotide diversity,
It is defined as average pairwise nucleotide difference and can be calculated as
This measure describes the average proportion of nucleotide positions, which are polymorphic (i.e., different) between any to randomly chosen alleles of a locus in a population.
A more efficient calculation of nucleotide diversity is achieved by first counting the number of haplotypes and the differences between the different haplotypes. If
5 The method is taken from Hartl and Clark (2007), p. 34 f
- Calculate average number of pairwise mismatches between haplotypes
- Count number of mismatches for every pair of aligned sequences
- For
sequences, there are possible pairs of sequences - Calculate
: - Calculate on a per-nucleotide basis genes of different length:
where is the length of the alignment in nucleotides
This value of
Nei (1987) proposed a simple correction that results in an unbiased estimator and is calculated as
Both statistics have in common that they can be used to estimate a very important parameter of populations, namely
6 :Consult introductory population textbooks for an explanation.
It should be noted however, that nucleotide polymorphism,
The variable
Both
There are other statistics that can be used to describe genetic variation in sequencing data. They include haplotype diversity and the mismatch distribution. For example, the haplotype diversity is calculated as
In most studies of crop genetic diversity that use sequences, nucleotide diversity,
Limitations of current estimators of genetic diversity
The sequencing of complete plant genomes and the resequencing of the genome sequences of multiple individuals revealed that very complex types of genomic polymorphisms segregate in plant populations. They result from insertions and deletions of DNA, from the activity of transposable elements (TEs) and from gene duplications, which creates many repetitive regions in plant genomes. For this reason, plant genomes tend to be much more repetitive than animal genomes (Figure 9).

As an example, a genomic region of the model plant Arabidopsis thaliana, which was sequenced a high quality in three individuals collected at different locations of the species distribution range and in which all genes (and gene fragments) were annotated shows the complexity of rearrangments (Figure 10). As consequence of these rearrangments, no simple multiple alignment to calculate nucleotide diversity measures discussed above is possible, and new measures of describing genetic diversity have to be developed.

Similar rearrangements are observed in crop species such as grapevine Vinis vitifera or maize. For example, Figure 11 shows the schematic alignment of a genomic region from Chardonnay and Caubernet Sauvignon. The alignment shows a high degree of genetic differentiation caused by presence and absence of genes between the two haplotypes of each of the two varieties, and in addition of the two varieties. In total, the two varieties differ by a total of 2,217 genes (6% of all genes), which are present or absent in one of the two varieties relative to the other.

In summary, these comparisons show that complex polymorphisms result from different and interaction types of mutations like point mutations, insertions/deletions and rearrangements and transposon insertions. Such polymorphisms can not be described with the simple, nucleotide-based measures of genetic differentiation. Unfortunately, no suitable statistics have been developed yet to quantify this type of genetic variation, but there developments to model diversity of these complex variants that are either based on
Key concepts
Summary
- Nowadays, DNA-based markers such as single nucleotide polymorphisms and DNA sequences are the method of choice for analysing genetic variation.
- Plant breeding depletes genetic variation over time and leads to a slowdown of genetic gain per time.
- Genetic resources are genetic material available for breeding, which increases genetic diversity in breeding pools or breeding programs.
- The gene pool concept describes the relationship between a crop species and its genetic resources.
- The quantification of genetic variation differentiates between allele and genotype frequencies.
- The number of polymorphic loci (polymorphism) and the heterozygosity are used as measures to estimate and compare levels of genetic variation in population.
- Marker- and sequence-based estimates of DNA variation are derived from polymorphism and heterozygosity.
- For markers, the most important estimate of diversity is gene diversity,
, and for DNA sequences, nucleotide diversity, and Watterson's , which are both estimators of the population mutation parameter, . - DNA sequencing revealed very complex polymorphisms segregating in plant populations whose diversity is difficult or impossible to quantify using classical diversity statistics.
Further reading
- Any recent introductory textbook has a section on genetic diversity measurements.
- Lynch and Walsh: Genetics and Analysis of Quantitative Traits (1998), p. 492-495 - A description and evaluation of marker informativeness
- Turner-Hissong et al. (2020) - A short review that derives the importance of genetic diversity and evolutionary concepts for plant breeding and outlines some concepts relevant for the module.
Study questions
- What exactly is the paradox in the paradox of genetic diversity?
- What is the difference between a polymorphism and a genetic marker?
- What are different types of marker systems?
- What is the difference between the gene diversity and PIC measures of genetic diversity?
- What is the difference between nucleotide polymorphism and nucleotide diversity estimators of genetic variation?
- Which parameter is estimated by nucleotide diversity or nucleotide polymorphism? and why is it useful to know this parameter?
Problems
- Calculate gene diversity for a tri-allelic marker and the following allele frequencies:
Gene diversity, |
|||
---|---|---|---|
0.1 | 0.1 | 0.8 | |
0.333 | 0.333 | 0.333 | |
0.5 | 0.25 | 0.25 | |
0.998 | 0.001 | 0.001 |
For which marker frequencies is
- Calculate the parameters of the sequence alignment:
seq1 GATCTATATA
seq2 GAACTATATA
seq3 CATCATCATA
seq4 GACCTATATC
Calculate the following parameters