Published

23 May 2025

Genetic diversity

Motivation

Modern plant breeding programs follow a standard scheme in order to create improved varieties (). Parents with favorable properties are crossed to generate new phenotypic variation and to combine favorable trait in single genotypes Among the offspring, favorable genotypes are selected and tested in muultiple environments to prove their value in comparison to existing varieties and the stability of their phenotype in different environments. The best performing genotypes are then registered as new varieties and marketed. These varieties are then again used as parents for novel breeding programs.

Figure 1: The innovation cycle in plant breeding. Modified after Jorasch ().

One consequence of breeding cycles is that at the beginning of the breeding program, genetic diversity is increased, and over time, genetic diversity is reduced because of selection and genetic drift. Furthermore, according to the DUS criteria, new varieties need to be distinct, uniform and stable for plant variety registration, i.e., a legal protection of the product of plant breeding. The uniformity requirement is responsible for the fact that with the exception of population varieties the vast majority of released varieties, in particular hybrid and line varieties are genetically homogeneous.

Over time, however, the available genetic diversity in a breeding population is reduced due to selection, and further breeding progress (also called genetic gain) becomes smaller and more difficult to achieve.

Learning goals

to be added

Variability of genetic diversity in breeding populations

Our motivation section showed that plant breeding strongly depends on the introduction of novel genetic variation from genetic resources. This leads to the question, which type and how much new genetic variation is optimal for sustaining long term breeding progress to produce productive, healthy and adapted varieties.

The effect of repeated breeding cycles is seen in diversity. It shows the results of a metastudy on the genetic diversity of varieties of various field crops (wheat, maize, barley, oat, flax, pea, rice, soybean) that were released in the 20th century in different regions of the world. The regional genetic diversity of varieties (i.e., Europe or North America) was assessed with molecular markers. Both the analysis of the complete dataset of 44 individual studies and of a subset of 20 studies of wheat indicate a drop in diversity in the 1960s after a period of substantial diversity. Starting from the 1970s, plant breeders managed to increase the diversity by introgression of new genetic diversity with the result that no substantial reduction in the regional diversity has taken place.

It should be noted, however, that this analysis is based on a fairly small numbers of molecular markers per species and that errer margins tend to be large with respect to the estimates.

Figure 2: Crop genetic diversity in the twentieth century based on weighted meta-analysis of 44 (A) studies of different crops and of 20 (B) studies of wheat. The diversity in the decade with the lowest diversity was set to 100. Source: Wouw et al. ()

The paradox of genetic diversity

Based on the knowledge of Mendelian inheritance and of genetic diversity one may ask, why new genetic diversity needs to be introgressed in plant breeding programs if the interplay of existing diversity and recombination are able to produce potentially unlimited amounts of diversity? This discrepancy between observed and expected diversity is the paradox of genetic diversity and explained in the following.

shows a cross of two parents, which are homozygous for two genes. The two parents differ at the two alleles from each other. The F1 generation is heterozygous, and in the F2 generation the alleles start to segregate according to Mendel’s rules.

We assume that the two genes control two different traits (flower color and leaf shape). From this follows that the number of possible combinations of the two traits is 32=9 genotypes.

Figure 3: Number of segregating phenotypes if homozygous parents differ in 2 traits.

Now we assume that the two parents differ at 21 loci. Again the F1 individuals are heterozygous for all loci, but they start to segregate in the F2 generation. Based on the number of genotypes, there are more than 10 billion (109) possible genotypes. Under the assumption that each locus controls a different trait, there are the same number of potentially different phenotypes. These are many more phenotypic variants originating from a single cross than a breeder can handle in a breeding program.

Table 1: Segregation of genetic variation if two alleles are present at each of 21 loci.
P1 AABBCCDDEEFFGGHHIIJJKKLLMMNNOOPPQQRRSSTTUU
×
P2 aabbccddeeffgghhiijjkkllmmnnooppqqrrssttuu
F1 AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUu
F2 321=10,460,353,203 genotypes

What then is the paradox of genetic diversity? While it is easy to generate novel phenotypic variation by combining different alleles of different genes in the offspring of a cross, there is often not sufficient useful variation for traits that need to be improved available in breeding populations.

In other words, there may not be enough genetic variation at genes involving 21 different traits within a breeding population even though it would be easy to generate new genetic diversity (i.e., combinations of different phenotypes) by crossing.

For this reason, new genetic variation needs to be identified, crossed with existing material and evaluated on a phenotypic level in so-called prebreeding programs to increase phenotypic diversity by introgressing new genetic diversity.

Loss of genetic variation due to breeding

There are several reasons why breeding populations are impoverished for genetic variation in useful traits. Possible reasons are:

Bottlenecks
Only a small proportion of total genetic variation is included in a breeding population. The establishment of a new population by selecting a small number of funders from the ancestral population is also called founder effect. The new population likely has a much smaller level of genetic variation and a very different frequency of alleles.
Genetic drift
Random fixation of alleles, especially profound in small populations.
Selection
Fixation of advantageous alleles or loss of disadvantageous by selection.

For an explanation of how these processes reduce genetic variation, you may consult introductory population genetics textbooks.

The reduction of genetic diversity occurred primarily at two stages during the history of crop plants. These are the domestication of a crop from its wild ancestor, and modern breeding programs that apply strong selection to breeding populations. The two stages can be compared to a funnel that represent a genetic bottleneck ().

Figure 4: Presentation of genetic variation and its loss as passage through a series of funnels. Source: Tanksley and McCouch ()

In the discussion of genetic erosion of crop species, has become iconic, and measures of genetic diversity of crop plants and their wild ancestors confirm this notion. However, this figure is too simplistic because in many crops that were domesticated thousands of years ago, new genetic diversity originated by mutation, which may have been advantageous for cultivation and was selected by farmers and early breeders. There is surprisingly little research in this direction and the narrative of a loss of genetic variation prevails.

Genetic vulnerability due to a narrow genetic basis

Populations that lack genetic variation are called vulnerable because they either suffer from inbreeding depression (in outbreeding species like maize) or have little genetic variation at resistance genes, which makes it easy for pathogens to spread in a population, once they overcome resistance mechanisms.

Two historical examples:

Potato leaf blight in Ireland (1845-1849)
It was caused by the pathogen Phytophtora infestans. As a result one million people starved to death and one million people emigrated to the US.
Southern leaf blight epidemics in the Southern US (1970)
Maize varieties with the so-called texas-T cytoplasm were wiped out by Southern leaf blight. The epidemics resulted in a harvest loss in this year. Although the economic damage was limited, this epidemic epidemic raised awareness of genetic vulnerability.

An important consequence of genetic vulnerability is the high cost required for fungicides to keep pesticides under control in modern large-scale farming systems.

The Southern leaf blight epidemic had a relatively small (economic) impact, but the Great Famine that was caused by potato leaf blight had major historical consequences ().

A lack of genetic diversity of potatos allowed a single strain of the pathogen P. infestans to produce the epidemic. At that time a single variety of potato was planted, which was favored by the fact that it can be clonally propagated and produces genetically identical offspring.

1 For a very short summary, see here

The genomic analysis of a herbarium strain of P. infestans showed that a single strain originating from Mexico was causing the epidemic, which today has been replaced by other strains (; ; ).

Genetic vulnerability and genetic erosion

In the context of crop genetic diversity, two terms are frequently used.

2 See, for example, FAO ()

Genetic vulnerability results if a widely planted crop is uniformly susceptible to a pest, pathogen or environmental hazard as a result of its genetic constitution, thereby creating the potential for widespread crop losses.

Genetic erosion is the loss of individual genes and the loss of particular combinations of genes (i.e. of gene complexes) such as those maintained in locally adapted landraces.

There are alternative uses of the term ‘genetic erosion’ ():

  • Loss of old varieties or landraces of a given crop
  • Loss of (useful) genes or alleles present in a crop species, landrace, variety

Types of genetic variation

If the genetic diversity in a breeding population or a crop species in general is limited, it is necessary to search for new variation, such as in old varieties or in wild relatives.

However, genetic diversity is not useful per se, and it is necessary to differentiate among different types of genetic variation.

One may differentiate the total genetic variation into three major types of genetic variation:

Useful genetic variation influences traits that are selected during breeding in a positive manner. Such variation is expected in the following classes of genes:

-   Yield genes
-   Adaptation genes
-   Resistance genes
-   etc.

Neutral genetic variation is the fraction of genetic variation without phenotypic effects. Since it is not selected, it mainly evolves under genetic drift (i.e., random evolution) in a population.

Deleterious genetic variation is any genetic variation that reduces the favorable traits of a crop such as the yield or quality of the crop harvest. If it can be recognized as such by the breeder, it is removed by selection during the breeding process.

There is a large body of theory and also empirical studies in population genetics on the question which proportion of variation segregating in populations belongs to each of these three classes, and which methods are suitable to classify genetic variants into these classes

3 See, for example, our work on deleterious mutations segregating in wild and crop plants ().

What are plant genetic resources (PGR)?

To introduce the topic of genetic diversity in the context of PGR, we first define plant genetic resources and then show why new genetic diversity in form of PGR are important and required in the context of the paradox of genetic diversity discussed above.

The 4th international technical conference of FAO on PGR in Leipzig 1996 used the following definition for plant genetic resources:

4 See the report on the conference: Link.

… generatively or vegetatively reproducible material of plant of current or potential value, including landraces, related wild sprecies and wild forms and special genetic material of crop plants

A much simpler definition was given by Pflanzenzüchtung ():

… the complete genetic material that is available for breeding of a crop plant.

In the age of genetic engineering and synthetic biology, we can even further broader expand this definition to

any genetic material and variants that is naturally available or can be artificially generated and used in plant breeding.

Sources of new genetic variation

If genetic diversity (e.g., in a given crop species) is limited, there are several sources to utilize new genetic variation. The sources include

  • New and minor crops (Neodomestication)
  • Close relatives of crops (Neodomestication, Redomestication, Introgression)
  • Wild ancestors of crop species
  • Land races and traditional cultivars
  • Modern elite varieties of different geographic origin
  • Induced mutations by chemical (Mutagens), physical (Irradiation) or biological (Genome editing) treatments

In the breeding of commercial varieties, the released varieties of competitors (if they are not patented) likely are the main sources of new genetic variation. This is possible because of the breeder’s exemption granted by plant variety protection laws in many countries and also a meaningful approach because released commercial varieties are usually purged of much deleterious variation that frequently segregates in traditional cultivars that have not been improved by breeding ().

Figure 5: Segregation of a recessive deleterious mutations in the traditional cultivar Rheintaler Ribelmais. Each row represents the offspring of a S1 family obtained by self-fertilization of a single individual in the parental generation. One of the family has a recessive deleterious allele that causes chlorosis, i.e., the lack of chlorophyll leading to a white plant that will die soon after germination.

Crop gene pools as sources of new genetic diversity

In an effort to have a more rational and systematic access to new genetic diversity for crop plants, Harlan and de Wet developed the concept of crop gene pools Harlan and Wet (). Essentially, for each crop several gene pools are defined that describe the genetic and taxonomic relationship to a given crop plant ().

Figure 6: A general definition of crop gene pools. Source: Acquaah ()

According to this definition, there are four gene pools:

Gene pool 1 (GP1):
This pool is defined by the biological species concept, which states that all genotypes that can be crossed with each other and form fertile offspring belong to the same species. Therefore, this GP1 includes usually the ancestors of crops as well as all subspecies of a species that may reflect either cultivated types or spontaneous types that are phenotypically variable.
Gene pool 2 (GP2):
It includes closely related species that can be successfully crossed with the crop species, but produce offspring with limited viability or limited fertility indicated by partial sterility. This pool frequently include close relatives that my harbor interesting disease resistance genes. Taxonomically such species may be other species in the genus.
Gene pool 3 (GP3):
This gene pool includes more distantly related species. Crosses with the crop species may produce offspring, but it either has an anomalous phenotype, or the embryo does not develop into an adult plant because of lethality or are completely sterile. Quite frequently these barriers to producing offspring can be overcome by biotechnological treatments in order to produce offspring that can be included in further breeding processes.
Gene pool 4 (GP4):
It includes all other species where crosses with the crop species are not successful.

The first three gene pools an be utilized by classical breeding or biotechnological approaches such tissue culture or protoplast fusion. The fourth gene pool includes species that can not be crossed to a given crop but may provide useful genes that can be introduced into the crop by genetic engineering.

In the following, the concept of gene pools as sources of novel and useful genetic diversity is demonstrated for canola, Brassica napus.

Figure 7: Canola (Brassica napus). Source: Wikipedia.

The primary gene pool consists of

  • Modern elite varieties of canola (Brassica napus)
  • Landraces and traditional cultivars
  • German Brassica napus (rapeseed) (Wikipedia)

The secondary gene pool includes

  • Brassica rapa (Turnip; several subspecies used as vegetables) (Wikipedia)
  • Brassica oleracea (Cabbage and derived vegetables) (Wikipedia)
  • Brassica nigra (Black mustard) (Wikipedia)

The four Brassica species have a special chromosomal relationship that is called ‘Triangle of U’, which essentially states that the genomes of three ancestral species merged to form the genomes of several modern Brassica vegetables (Wikipedia) see also the lecture or gene flow).

The four species can be crossed with each other relatively easily and produce F1 hybrids ().

The tertiary gene pool includes

Somatic cell lines between Arabidopsis thaliana and Brassica nigra could be produced by protoplast fusion (), which demonstrates that genes can be transferred between the two species. However, further complications may arise by the fact that the chromosomes between the two species may not recombine because of extensive sequence differences.

Key questions in the context of PGR

Based on the above considerations, several questions arise with respect to the level and use of genetic diversity in plant genetic resources and their use in breeding:

  • What is the level of genetic diversity within and between modern elite breeding populations?
  • Does ‘exotic’ breeding material have a higher level of genetic variation that can be introgressed into modern varieties?
  • What type of genetic diversity is useful or deleterious?
  • How can useful genetic variation be identified?

Much contemporary research on genetic resources is focused on answering these questions.

An example of a successful introduction of new variation

The effect of a loss in genetic diversity and subsequent gain by intregression of plant genetic resources can be demonstrated with the breeding history of tomato in the Netherlands (). The wild ancestors of cultivated tomato are Solanum lycopersicum var. lycopersicum, S. lycopersicum var. cerasiforme, and S. pimpinellifolium. From these ancestors, heirloom types and landraces originated by domestication processes. Inbreeding and selection among these old (vintage) varieties led to commercial varieties in the 1960s with a very low genetic and phenotypic diversity. During this time Dutch tomatoes developed a reputation for being tasteless and watery, which led to the “waterbomb” (in German: “Wasserbombe”) crisis. Breeders reacted and introduced more exotic genetic variation into the breeding material to increase multiple aspect that include taste, shape and diesease resistances. From the 1970s onwards, resistances to diseases and pests were introgressed from distant species, including S. peruvianum, S. pennellii, S. chilense, and S. habrochaites, increasing genetic diversity among commercial tomato varieties considerably. After the 1980s, fruit size, color, and flavor started to vary substantially, further increasing the genetic diversity of modern varieties.

As a result, the genetic diversity of Dutch tomato varieties increased. The temporal dynamics is shown in and the data are based on genotyping 90 varieties using a SNP array and calculating Nei’s index H for genetic diversity, which will be introduced below.

Figure 8: The evolution of genetic diversity in tomato. The upper left group represents the wild ancestor species, which gave rise to vintage types and landraces. Source: Schouten et al. ()

Quantification of genetic variation

In current studies of genetic variation, essentially two types of data are used: Genetic markers such as single nucleotide polymorphisms (SNPs), and DNA sequences. Markers target individual polymorphisms with a known location in a genome using a experimental assay, whereas DNA sequences analyse the complete set of polymorphisms in a genomic region or even the whole genome. To derive the theory for quantification of genetic variation, we first focus on SNPs, i.e. individual genetic polymorphisms.

We first introduce the following conventions: A locus is indicated with a capital letter. A, B, etc. are different markers (e.g., SNPs) with the alleles A1, A2, etc.

The relative frequency of allele A1 is p, and of allele A2 is q.

If there are more than two alleles, frequencies are expressed as p1,p2,p3,...

There are three genotypes in a population of diploid organisms: two homoygous marker genotypes A1A1 and A2A2, and the heterozygous genotype A1A2.

The relative frequencies in the populations are written as xij:

Genotype: A1A1 A1A2 A2A2
Relative frequency: x11 x12 x22

The sum of relative frequencies is 1: (1)x11+x12+x22=1

From genotype frequencies, the allele frequencies can be calculated directly. The frequency p of allele A1 is (2)p=x11+12x12, and the frequency q of allele A2 is (3)q=1p=x22+12x12.

Some markers or marker types (such as simple sequence repeats, SSRs) have more than two alleles, but the calculation of relative allele frequencies is straightforward.

For a marker with n alleles we write

Alleles: A1 A2 Ai Aj An
Frequencies: p1 p2 pi pj pn

Indices i and j indicate different alleles, and xij gives the frequency of genotype AiAj.

Under the assumption that in heterozygotes ij and, as a convention, i<j, we obtain a total frequency of 1 when summed over the frequencies of all alleles.

With n alleles segregating at a locus we write (4)x11+x22++xnn+x12+x13++x(n1)n=i=1nj1nxij=1. Another measure of genetic variation is Nei's gene diversity,H, which is an estimator of the average observed heterozygosity of markers ().

It is essentially an estimate that two randomly chosen alleles in a population are not identical, and is calculated for a single marker as (5)H=1i=1npi2 where n is the number of alleles of a marker and pi is the observed frequency of allele i. An unbiased estimator is given by (6)H^=nn1(1i=1npi2). Gene diversity is the most widely used measure of genetic variation of markers, because it is easy to calculate and it can be used for biallelic and multiallelic markers.

A related measure is polymorphism information content (PIC), which was originally introduced for use in human genetics ().

It refers to the value of a marker for detecting polymorphism within a population, depending on the number of detectable alleles and the distribution of their frequency for a given marker.

A measure on informativeness is highly useful in selecting parents for genetic mapping because one wants to maximize the chance that a set of markers has a high power to detect a quantitative trait locus (QTL) in a genetic mapping study because they have a high chance of being polymorphic among offspring as they are likely different between the parents.

For outbred and heterozygous individuals, the PIC value of a marker is defined as (7)PIC=1i=1npi2i=1n1j=i+1n2pi2pj2, $$ where n is the total number of alleles of a locus, and pi and pj the frequencies of alleles i and j, respectively.

The occurrence of rare alleles has less impact on the PIC than alleles occurring with high frequencies.

A simplified version was developed assuming that the inbred individuals selected for mapping are homozygous ().

Then, the PIC of a marker i is: (8)PICi=1jinpij2, where pij is the frequency of the j-th pattern for marker i and the summation extends over n patterns.

Note that this measure corresponds to Nei’s gene diversity measurement.

The PIC for multiple markers can be calculated by taking the average of PIC values of each marker.

Depending on the type of mapping study, PIC values can be used to select markers for genotyping or individuals for creating a mapping population. Therefore, this measure is frequently used in the design of genotyping arrays in breeding programs.

Quantification of DNA sequence variation

In comparison to SNP markers, DNA sequencing identifies all polymorphisms of a locus, and one obtains a group of coupled markers.

The different combinations of alleles of different polymorphisms on the chromosome, in a genomic region or in a gene are called haplotypes.

The following sequence alignment shows the hypothetical DNA sequence of a short region of 10 base pairs from two outbred, heterozygous, diploid individuals.

                A)            B)
Position        1234567890    1234567890
Individual 1    GATCGAACAG    G.T......G
Individual 1    GAACGAACAT    G.A......T
Individual 2    TATCGAACAG    T.T......G
Individual 2    TATCGAACAG    T.T......G

The difference between A) and B) is that in the latter, only the variable positions in the sequence alignments are shown with letters and invariable sites with dots.

In this sample, three single-nucleotide polymorphisms are observed and three haplotypes (i.e. combinations of polymorphisms). Individual 2 is homozygous at this region, because both sequences are identical (they are the same haplotype). Using this sequence alignment, several descriptive statistics can be calculated.

If each sequence differs from each other one (given that the sequenced region is long enough), heterozygosity as a measure of genetic variation becomes obsolete because nearly each individual is heterozygous.

Instead, a new measure is defined, which is called nucleotide polymorphism.

It describes the proportion of nucleotide positions in a sample that are polymorphic and is calculated as (9)Pn=nPnt with nP as the number of polymorphic nucleotide positions and nt as the total number of sequenced nucleotide polymorphisms in the sequenced region.

In the above sequence alignment, Pn=3/10=0.3.

A second, and more widely used measure for DNA sequence variation is nucleotide diversity, π.

It is defined as average pairwise nucleotide difference and can be calculated as (10)π=1ni=1nj=i+1npij, where pij is the proportion of nucleotide differences between allele i and j of a sample of n alleles. For example, if two alleles differ at 3 of 10 nucleotide positions, pij=0.3.

This measure describes the average proportion of nucleotide positions, which are polymorphic (i.e., different) between any to randomly chosen alleles of a locus in a population.

A more efficient calculation of nucleotide diversity is achieved by first counting the number of haplotypes and the differences between the different haplotypes. If πij is the proportion of nucleotide differences between haplotypes i and j, and pi and pj are the relative frequencies of each of the two haplotypes, then nucleotide diversity is calculated in a sample with k distinct haplotypes as (11)π=i=1kj=1kpipjπij Yet another simple method to calculate nucleotide diversity to calculate π is given in the following

5 The method is taken from Hartl and Clark (2007), p. 34 f

  1. Calculate average number of pairwise mismatches between haplotypes
  2. Count number of mismatches for every pair of aligned sequences
  3. For n sequences, there are n(n1)/2 possible pairs of sequences
  4. Calculate Π: Π=Total number of nucleotide mismatchesTotal number of pairwise comparisons
  5. Calculate on a per-nucleotide basis genes of different length: π=ΠL where L is the length of the alignment in nucleotides

This value of π describes the nucleotide diversity of a population. However, it is biased and can not be used directly as an estimator for population parameter using a sample.

Nei () proposed a simple correction that results in an unbiased estimator and is calculated as (12)π=nn1i=1kj=1kpipjπij

Both statistics have in common that they can be used to estimate a very important parameter of populations, namely θ=4Neμ, where Ne is the effective population size and μ the mutation rate, expressed as number of mutations per nucleotide per generation. The parameter θ is often called the population mutation parameter and describes the expected level of genetic diversity in equilibrium populations, in which the processes of mutation and genetic drift are in an equilibrium.

6 :Consult introductory population textbooks for an explanation.

It should be noted however, that nucleotide polymorphism, Pn, needs to be modified to be an estimator of θ as (13)θW^=Pnan, where the factor an is (14)an=i=1n11i.

The variable θW stands for Watterson’s θ () who showed that it is an estimator of θ=4Neμ. If nucleotide diversity is used as an estimator of θ it is often designated as θ^π.

Both π and θW have the characteristics that they are unbiased estimators of the population mutation parameter, θ=4Neμ in populations that evolve without selection. If the the mutation rate is known, they can be used, for example, to estimate the effective population size in populations.

There are other statistics that can be used to describe genetic variation in sequencing data. They include haplotype diversity and the mismatch distribution. For example, the haplotype diversity is calculated as \begin{equation} \label{hapdiv} h = \frac{n}{n-1}(1-\sum_{i=1}^n x_i^2) \end{equation} where xi is the frequency of haplotype i and n is the number of different haplotypes. Frequently, these estimators are highly correlated, because they measure similar properties of the data.

In most studies of crop genetic diversity that use sequences, nucleotide diversity, π, is used as measure of genetic diversity.

Limitations of current estimators of genetic diversity

The sequencing of complete plant genomes and the resequencing of the genome sequences of multiple individuals revealed that very complex types of genomic polymorphisms segregate in plant populations. They result from insertions and deletions of DNA, from the activity of transposable elements (TEs) and from gene duplications, which creates many repetitive regions in plant genomes. For this reason, plant genomes tend to be much more repetitive than animal genomes ().

Figure 9: Comparison of size and repetitiveness of plant and animal genomes based on whole genome sequence assemblies. Repetitiveness was measured as proportion of non-unique 31-mers among all 31-mers in a genome sequence (31-mers: Sequence “words” with a length of 31 nucleotides). Source: Jiao and Schneeberger ()

As an example, a genomic region of the model plant Arabidopsis thaliana, which was sequenced a high quality in three individuals collected at different locations of the species distribution range and in which all genes (and gene fragments) were annotated shows the complexity of rearrangments (). As consequence of these rearrangments, no simple multiple alignment to calculate nucleotide diversity measures discussed above is possible, and new measures of describing genetic diversity have to be developed.

Figure 10: Schematic alignment of mapping interval. Genes are indicated with green boxes. Insertions relative to Col-0 accessions are indicated with red triangles, deletions in beige, and duplications in blue. The last two digits of AT5G417XX gene identifiers are given above each line. Parentheses for Uk-1 indicate that these open reading frames are incomplete; asterisk indicates stop codon. Transposons are labeled below each line: MULE (M), copia (C), LINE (L), gypsy (G), and gypsy-associated UlpI protease sequence (U) Source: Bomblies et al. ()

Similar rearrangements are observed in crop species such as grapevine Vinis vitifera or maize. For example, shows the schematic alignment of a genomic region from Chardonnay and Caubernet Sauvignon. The alignment shows a high degree of genetic differentiation caused by presence and absence of genes between the two haplotypes of each of the two varieties, and in addition of the two varieties. In total, the two varieties differ by a total of 2,217 genes (6% of all genes), which are present or absent in one of the two varieties relative to the other.

Figure 11: Haplotypes of the sex (SD) region among grapevine cultivars. Chardonnay is homozygous hermaphroditic (HH) and both haplotypes from Char04 are shown. Cabernet Sauvignon is heterozygous (HF), with Haplotype 1 of Cab08 representing the presumed H haplotype. Protein-coding genes are coloured according to their functional annotation. Genes that are not shared among genome assemblies are coloured in black. Source: Zhou et al. ()

In summary, these comparisons show that complex polymorphisms result from different and interaction types of mutations like point mutations, insertions/deletions and rearrangements and transposon insertions. Such polymorphisms can not be described with the simple, nucleotide-based measures of genetic differentiation. Unfortunately, no suitable statistics have been developed yet to quantify this type of genetic variation, but there developments to model diversity of these complex variants that are either based on k-mers (sequence words) or genome graphs.

Key concepts

Summary

  • Nowadays, DNA-based markers such as single nucleotide polymorphisms and DNA sequences are the method of choice for analysing genetic variation.
  • Plant breeding depletes genetic variation over time and leads to a slowdown of genetic gain per time.
  • Genetic resources are genetic material available for breeding, which increases genetic diversity in breeding pools or breeding programs.
  • The gene pool concept describes the relationship between a crop species and its genetic resources.
  • The quantification of genetic variation differentiates between allele and genotype frequencies.
  • The number of polymorphic loci (polymorphism) and the heterozygosity are used as measures to estimate and compare levels of genetic variation in population.
  • Marker- and sequence-based estimates of DNA variation are derived from polymorphism and heterozygosity.
  • For markers, the most important estimate of diversity is gene diversity, H, and for DNA sequences, nucleotide diversity, π and Watterson's θW, which are both estimators of the population mutation parameter, θ=4Neμ.
  • DNA sequencing revealed very complex polymorphisms segregating in plant populations whose diversity is difficult or impossible to quantify using classical diversity statistics.

Further reading

  • Any recent introductory textbook has a section on genetic diversity measurements.
  • Lynch and Walsh: Genetics and Analysis of Quantitative Traits (1998), p. 492-495 - A description and evaluation of marker informativeness
  • Turner-Hissong et al. () - A short review that derives the importance of genetic diversity and evolutionary concepts for plant breeding and outlines some concepts relevant for the module.

Study questions

  1. What exactly is the paradox in the paradox of genetic diversity?
  2. What is the difference between a polymorphism and a genetic marker?
  3. What are different types of marker systems?
  4. What is the difference between the gene diversity and PIC measures of genetic diversity?
  5. What is the difference between nucleotide polymorphism and nucleotide diversity estimators of genetic variation?
  6. Which parameter is estimated by nucleotide diversity or nucleotide polymorphism? and why is it useful to know this parameter?

Problems

  1. Calculate gene diversity for a tri-allelic marker and the following allele frequencies:
p1 p2 p3 Gene diversity, H
0.1 0.1 0.8
0.333 0.333 0.333
0.5 0.25 0.25
0.998 0.001 0.001

For which marker frequencies is H the highest? Which conclustion can you draw from the H values?

  1. Calculate the parameters of the sequence alignment:
seq1 GATCTATATA
seq2 GAACTATATA
seq3 CATCATCATA
seq4 GACCTATATC

Calculate the following parameters n (sequence number), L (sequence length), S (segregating sites), nucleotide polymorphisma, P, and nucleotide diversity, π.

References

Acquaah G. 2009. Principles of plant genetics and breeding. John Wiley & Sons.
Anderson JA, Churchill GA, Autrique JE, Tanksley SD, Sorrells ME. 1993. Optimizing parental selection for genetic linkage maps. GENOME 36:181–6.
Birch PR, Cooke DE. 2013. The early days of late blight. eLife 2:e00954. doi:10.7554/eLife.00954
Bomblies K, Lempe J, Epple P, Warthmann N, Lanz C, Dangl JL, Weigel D. 2007. Autoimmune Response as a Mechanism for a Dobzhansky-Muller-Type Incompatibility Syndrome in Plants. PLoS Biology 5:e236. doi:10.1371/journal.pbio.0050236
Botstein D, White RL, Skolnick M, Davis RW. 1980. Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am J Hum Genet 32:314–31.
FAO. 2010. The Second Report on the State of the Worlds Plant Genetic Resources.
Goss EM, Tabima JF, Cooke DEL, Restrepo S, Fry WE, Forbes GA, Fieland VJ, Cardenas M, Grunwald NJ. 2014. The Irish potato famine pathogen Phytophthora infestans originated in central Mexico rather than the Andes. Proceedings of the National Academy of Sciences 111:8791–8796. doi:10.1073/pnas.1401884111
Günther T, Schmid KJ. 2010. Deleterious amino acid polymorphisms in Arabidopsis thaliana and rice. Theoretical and Applied Genetics 121:157–168. doi:10.1007/s00122-010-1299-4
Harlan JR, Wet JMJ de. 1971. Toward a Rational Classification of Cultivated Plants. Taxon 20:509. doi:10.2307/1218252
Hauser TP, Shaw RG, Ostergard H. 1998. Fitness of F1 hybrids between weedy Brassica rapa and oilseed rape (B. napus). Heredity 81:429–435. doi:10.1046/j.1365-2540.1998.00424.x
Jiao W-B, Schneeberger K. 2017. The impact of third generation genomic technologies on plant genome assembly. Current Opinion in Plant Biology, 36 Genome studies and molecular genetics 36:64–70. doi:10.1016/j.pbi.2017.02.002
Jorasch P. 2019. The global need for plant breeding innovation. Transgenic Research 28:81–86. doi:10.1007/s11248-019-00138-1
Khoury CK, Brush S, Costich DE, Curry HA, Haan S de, Engels JMM, Guarino L, Hoban S, Mercer KL, Miller AJ, Nabhan GP, Perales HR, Richards C, Riggins C, Thormann I. 2022. Crop genetic erosion: Understanding and responding to loss of crop diversity. New Phytologist 233:84–118. doi:10.1111/nph.17733
Nei M. 1987. Molecular Evolutionary Genetics. New York: Columbia University Press.
Nei M. 1973. Analysis of gene diversity in subdivided populations. Proc Natl Acad Sci USA 70:3321–3.
Pflanzenzüchtung, 1st ed. 1993., 1st ed. Ulmer Verlag.
Schouten HJ, Tikunov Y, Verkerke W, Finkers R, Bovy A, Bai Y, Visser RGF. 2019. Breeding Has Increased the Diversity of Cultivated Tomato in The Netherlands. Frontiers in Plant Science 10. doi:10.3389/fpls.2019.01606
Siemens J, Sacristán MD. 1995. Production and characterization of somatic hybrids between Arabidopsis thaliana and Brassica nigra. Plant Science 111:95–106. doi:10.1016/0168-9452(95)04221-F
Tanksley SD, McCouch SR. 1997. Seed banks and molecular maps: Unlocking genetic potential from the wild. Science 277:1063–1066.
Turner-Hissong SD, Mabry ME, Beissinger TM, Ross-Ibarra J, Pires JC. 2020. Evolutionary insights into plant breeding. Current Opinion in Plant Biology 54:93–100. doi:10.1016/j.pbi.2020.03.003
Watterson GA. 1975. On the number of segregating sites in genetical models without recombination. Theoret Pop Biol 7:256–276.
Wouw M van de, Hintum T van, Kik C, Treuren R van, Visser B. 2010. Genetic diversity trends in twentieth century crop cultivars: A meta analysis. Theoretical and Applied Genetics 120:1241–1252. doi:10.1007/s00122-009-1252-6
Yoshida K, Schuenemann VJ, Cano LM, Pais M, Mishra B, Sharma R, Lanz C, Martin FN, Kamoun S, Krause J, others. 2013. The rise and fall of the Phytophthora infestans lineage that triggered the Irish potato famine. Elife 2:e00731.
Zadoks JC. 2008. The Potato Murrain on the European Continent and the Revolutions of 1848. Potato Research 51:5–45. doi:10.1007/s11540-008-9091-4
Zhou Y, Minio A, Massonnet M, Solares E, Lv Y, Beridze T, Cantu D, Gaut BS. 2019. The population genetics of structural variants in grapevine domestication. Nature Plants 5:965–979. doi:10.1038/s41477-019-0507-8