This document summarizes a seminar on association mapping in plants. It discusses how association mapping offers greater precision in locating quantitative trait loci (QTLs) than family-based linkage analysis by taking advantage of linkage disequilibrium across diverse populations. The key steps in association mapping are described, including population selection and structure analysis, high-throughput phenotyping and genotyping, measuring linkage disequilibrium, and association analysis to identify marker-trait links. Software for conducting association mapping and case studies in rice are also reviewed.
3. Introduction
• Polygenic inheritance of agronomic traits- controlled by multiple genes
whose expression is affected by many factors. Hence phenotypic selection
becomes tedious job.
• Family mapping (Limitations-Biparental population, Low resolution,
Analysis of only 2 alleles, time consuming).
• Population or Association mapping (I) increased mapping resolution, (ii)
reduced research time, and (iii) greater allele number (Yu and Buckler,
2006).
• Association mapping identifies quantitative trait loci (QTLs) by examining
the marker-trait associations that can be attributed to the strength of linkage
disequilibrium between markers and phenotype across a set of diverse
4. • Association mapping, also known as "linkage disequilibrium
mapping", is a method of mapping quantitative trait loci (QTLs)
that takes advantage of linkage disequilibrium to link phenotypes
to genotypes.
Offers greater precision in QTL location than family-based
linkage analysis.
Does not require family or pedigree information , can be
applied to a range of experimental and non-experimental
populations.
Association mapping (AM)
5. How it works?
• Association studies are based on the assumption that a marker locus
is ‘sufficiently close’ to a trait locus so that some marker allele
would be ‘travelling’ along with the trait allele through many
generations during recombination.
6. Direct and Indirect Allelic Association
D
*
Measure disease relevance (*)
directly, ignoring correlated
markers nearby
Direct Association
M1 M2 Mn
Assess trait effects on D via
correlated markers (Mi) rather than
susceptibility/etiologic variants.
D
Indirect Association & LD
•Allele of interest is itself
involved in phenotype
• Allele itself is not involved,
but a nearby correlated marker
changes phenotype
7. Linkage mapping
In 1913, the first individual to construct a
(very small) genetic map was Alfred
Sturtevant.
Genes/ markers in order, indicating the relative genetic distances
between them, and assigning them to their chromosome.
Distance = Recombination frequency=
No. of recombinants /Total progeny X
100
Suppose the recombination between loci
A and B is 6%, that between loci B and C
is 20%, and that between A and C 24%,
then we can order the loci along the
chromosome as…
(Hartal et al., 2010)
9. Marker-trait associations in experimental and
natural populations
Experimental populations (e.g.
F2, RIL) 2-parental alleles; small
genetic variation; few meiotic
cycles; low resolution
Natural populations
many alleles; large genetic
variation; many meiotic cycles;
high resolution
11. Advantages of AM over linkage mapping
Linkage Mapping Association Mapping
Structured Population
(e.g. Biparental population)
Un-structured population
(e.g. Germplasm lines)
Low resolution (few to several
centimorgans away from gene/QTL)
High resolution (Much closer than
those by linkage mapping)
Only few alleles can be detected Many alleles can be detected
Moderate marker density High/moderate marker density
Feasible in annual and biennial
species, not feasible in perennial
species
Feasible in annual, biennial and
perennial species
Narrow range Wide range
Time consuming Less time required
(Yu et al., 2006)
12. Types of association mapping
1. Genome wide association mapping: Search whole genome for causal
genetic variation. A large number of markers are tested for association
with various complex traits and it doesn’t require any prior information
on the candidate genes.
2. Candidate gene association mapping: Dissect out the genetic control
of complex traits, based on the available results from genetic,
biochemical, or physiology studies in model and non-model plant
species (Mackay, 2001). Requires identification of SNPs between lines
within specific genes.
15. Mapping population and Population structure
• Randomly or non-randomly mated germplasm
• Randomly mated populations represent a rather narrow group of
germplasm, likely to lower resolution and harbor only a narrow range of
alleles
• Non randomly mated germplasm is used, population structure needs to
be controlled in the statistical analysis.
• Cluster analysis is done to know the variation in population and most
diverse individuals are selected from each cluster to represent the
individuals of that cluster.
(Yu et al., 2006)
16. Phenotyping
Phenotyping
• Success of AM depends on accuracy and throughput of genotyping
• Replications across multiple years in randomized plots and multiple
locations and environments.
• Field Design:- incomplete block design (Lattice), RBD (Eskridge,
2003).
Should be done on the basis of
• Diversity:- on the basis of phenotype and genotype
• Population structure:- Systematic difference in allele frequencies
btw. sub-populations…
17. Genotyping
• Mostly multiallelic, reproducible, PCR-based markers are used.
• Microsatellites or simple sequence repeats (SSRs), and SNPs are
more revealing than their dominant counterparts and, therefore, are
more powerful.
• Due to higher genome density, lower mutation rate and wide
distribution throughout the genome SNPs are rapidly becoming the
marker of choice for complex trait
19. Linkage Disequilibrium Map & Allelic Association
Primary Aim of LD maps: To identify the
relationship between marker and QTL or trait of
interest.
Marker 1 2 3 n
LD
D
20. Linkage disequilibrium (LD)
• LD refers to non random association of allels at different loci.
• LD follows the fact that closely located genes are transmitted as a block,
which only rarely breaks up in meiosis.
• Closely located genes often express linkage disequilibrium to each other: An
example: Consider two independently segregating genes A and B with two
alleles (A, a and B, b respectively)
• At equilibrium, the frequency of the AB should equal to the product of the
allele frequencies of A and B,
• PAB
Pab
=PAb
PaB
(1:1 ratio = no LD)
• Any deviation from these values implies LD.
A a Total
B AB aB B
b Ab ab b
Total A a
23. LD Decay with time for four different recombination
fractions (ϴ)
(Powell et al., 2006)
24. Factors affecting LD
LD increases due to population structure, relatedness (kinship), small
founder population size or genetic drift, selection (natural, artificial).
While factors like outcrossing, high recombination rate, high
mutation rate, gene conversion, etc., lead to a decrease/disruption in
LD. Thus, LD declines with 1) increase in genetic distance and
2) increase in number of generations.
(Huttley et al., 2005)
25. Evaluation of linkage disequilibrium and associating
genotype- phenotype
• TASSEL (http://www.maizegenetics.net) is used to measure the
extent of LD as squared allele frequency correlation estimates (R2
,
Weir, 1996) and measure the significance of R2
.
• Besides TASSEL there are many other softwares like DnaSP,
Arlequin etc. used to calculate D‘ and R2
.
26. Softwares used in AM
Sr. Software Focus Description
1. TASSEL Association analysis Free, LD statistics, sequence analysis, association mapping
2. Haploview
4.2
Haplotype analysis and LD LD and haplotype block analysis, haplotype population
frequency estimation, single SNP and haplotype
association tests.
3. SVS 7 Stratification,
LD and AM
Estimate stratification, LD, haplotypes blocks and multiple
AM approaches for up to 1.8 million SNPs and 10,000
sample
4. GenStat Stratification, LD and AM SSR markers, GLM and MLM-PCA methods
5. JMP
genomics
Stratification, LD and
structured AM
SNPs, CG and GWAS, analysis of common and rare
Variants
6. GenAMap Stratification, LD and
structured AM
SNPs, tree of functional branches, multiple visualization
tools
7 PLINK Stratification, LD and
structured AM
SNPs, multiple AM approaches, IBD and IBS Analyses
8. STRUCTU
RE
Population
structure
Compute a MCMC Bayesian analysis to estimate the
proportion of the genome of an individual originating from
the different inferred Populations
9. SPAGeDi Relative kinship genetic relationship analysis
(Braulio et al., 2012)
27. Advantages of AM
1. Saves time, effort, and cost needed for the development of specific
mapping populations.
2. The QTL-linked markers identified by AM can be directly used for MAS
3. AM has high resolution
4. AM would assess the entire range of diversity in the trait of interest
5. Associated markers identified during AM can be used for either selection
of parents for hybridization or for selection of desirable segregants
28. Disadvantages of AM
1. The results from AM are affected by several factors like selection
history, population structure, kinship, etc., may lead to false positive
association
2. Large number (hundreds of thousands or even millions) of markers
would be required to adequately cover the entire genome.
3. High quality phenotypic data required (Multiple environment with multi
location)
4. The rate of recombination is not uniform throughout the genome.
29. Need of Association mapping in Rice
• Rice (Oryza sativa) is a staple food that feeds 3 billion people.
• Largest variability among germplasm and genomic database as
compared to any other species.
• All the agronomic traits in rice (Grain yield, Days to maturity,
Height, etc.) have quantitative inheritance.
• Challenge and opportunity is to utilize this information to
understand and predict how genotypic variation gives rise to the
abundance of phenotypic variation and its utilization in MAS.
30. Association mapping studies in RiceAssociation mapping studies in Rice
Population Sample
Size
Markers
used
Trait Reference
Germplasm 523 5291 SNPs 12 agronomic traits (Qing et al., 2015)
Diverse accessions 203 154 SSRs Trait of Harvest Index (Li et al., 2012)
Diverse rice
accessions
383 44,000 SNPs Aluminum Tolerance (Famoso et al., 2011)
Diverse accessions 413 44K SNPs Agronomic traits (Zhao et al., 2011)
Diverse accessions 210 86 SSRs yield and grain
quality
(Borba et al., 2010)
Landraces 517 3,625,200
SNPs
14 agronomic traits (Xuehui et al., 2010)
Mini core
collection
90 108 SSR stigma and spikelet
characteristics
(Yan et al., 2009)
Diverse accessions 103 123 SSRs Yield and its
components
(Agrama et al., 2007)
31. 517 landraces were phenotyped and genotyped by sequencing using
Illumina Genome Analyzer II
Aligned sequence reads to the rice reference genome for SNP
identification
Discrepancies with rice reference genome were called as candidate
SNPs.
Case Study
32. A total of 3,625,200 SNPs were identified, resulting in an average of
9.32 SNPs per kb, with 87.9% of the SNPs located within 0.2 kb of the
nearest SNP
A total of 167,514 SNPs were found in the coding regions of 25,409
genes.
3,625 large-effect SNPs (representing mutations predicted to cause large
effects) were identified.
Principal-component analysis
seperated rice germaplasm in two
groups i.e. indica and japonica.
Further both indica and japonica had three subgroups.
33. Because of strong population differentiation between the two
subspecies of cultivated rice GWAS was conducted only for 373
indica lines using mixed linear model (MLM)
80 associations for the 14 agronomic traits were identified.
Heading date strongly correlated with both population structure
and geographic distribution.
34.
35. • Ultimate aim of plant breeding is prediction of phenotype from
genotype
• Major agricultural economic traits are of complex nature
• It is desperate to dissect these complex traits and assign them function
• Advanced genomic tools like association mapping will be a valuable
option can be effectively and efficiently utilized to accelerate crop
improvement
• Association mapping is long term commitment, so have all the things
and then go for it
Conclusion
Hinweis der Redaktion
The phenotypic variation of many complex traits controlled by multiple alleles hence selection becomes very complicated. We need such a technique which can predict the phenotype on the basis of genotype with high accuracy. Linkage analysis and association mapping are the two most commonly used tools for dissecting complex traits. Linkage analysis in plants typically localizes QTLs to 10 to 20 cM intervals because of the limited number of recombination. To overcome these limitations, linkage disequilibrium (LD) mapping or association mapping (AM) has been used extensively to dissect genotype-phenotype correlations among different individuals. High mapping resolution, allele number and time saving.
Association mapping (genetics), also known as "linkage disequilibrium mapping", is a method of mapping quantitative trait loci (QTLs) that takes advantage of historic linkage disequilibrium to link phenotypes (observable characteristics) to genotypes (the genetic constitution of organisms).
More Recombinants= No linkage but more parental types=linkage.
LD mapping detects and locates quantitative trait loci (QTL) by the strength of the correlation between a trait and a marker.
Uses the diverse lines from the natural populations or germplasm collections.
Discovers linked markers associated (=linked) to gene controlling the trait.
F6 or higher generational lines derived by continual generations of outcrossing the F2 (Darvasi and Soller, 1995), sufficient meioses have occurred to reduce disequilibrium between moderately linked markers. When these advance generation lines are created by selfing, the reduction is disequilibrium is not nearly as great as that under random mating.
Assuming many generations, and therefore meioses, have elapsed since these events, recombination will have removed association between a QTL and any marker not tightly linked to it. Association mapping thus allows for much finer mapping than standard bi-parental cross approaches.
More recombinant=more distance between locus because chances of crossing over increases with increase in distance between two loci.
Consider population of 100 individuals…. For A and B no of recombinants are 6 out of 100 lead to 6 cM map distance from formula.
(1) Availability of broader genetic variations with wider background for marker-trait correlations (i.e., many alleles evaluated simultaneously),
(2) likelihood for a higher resolution mapping because of the utilization of majority recombination events from a large number of meiosis throughout the germplasm development history,
(3) possibility of exploiting historically measured trait data for association, and
(4) no need for the development of expensive and tedious biparental populations that makes approach timesaving and cost-effective
The absolutely most important aspect when deciding between a candidate gene approach and a whole-genome study is the extent of LD in the organism of interest, because the extent of LD determines not only the mapping resolution that can be achieved, but also the numbers of markers that are needed for an adequate coverage of the genome in a genome-wide study
The final consideration in selecting the sample population is whether to use randomly or non-randomly mated germplasm.
From a practical level, it is found that a sample of 100 diverse inbred lines has enough statistical power to identify associations that control 10% of the phenotypic variation.
Larger samples and/or more replications of phenotypic evaluation could be used to identify associations with smaller effects.
Randomly mated populations represent a rather narrow group of germplasm, likely to lower resolution and harbor only a narrow range of alleles.
However, if nonrandomly mated germplasm is used, population structure needs to be controlled in the statistical analyses.
Because association mapping often involves a relatively large number of diverse accessions, phenotypic data collection with adequate replications across multiple years and multiple locations is challenging. Efficient field design with incomplete block design (e.g., á-lattice), appropriate statistical methods (e.g., nearest neighbor analysis and spatial models), and consideration of QTL × environmental interaction should be explored to increase the mapping power, particularly if the field conditions are not homogenous (Eskridge, 2003). The increase in power of detecting QTLs with repeated measurements is well known and also has been demonstrated by simulation studies in mapping with pedigree-based breeding germplasm (Arbelbide et al., 2006).
A set of unlinked, selectively neutral background markers scaled to achieve genome-wide coverage are employed to broadly characterize the genetic composition of individuals
PAB Pab =PAbPaB (1:1 ratio = no LD) same as test cross where we consider parental and recombinant types
D (LD between loci AB)= PAB (freq. of gamet AB)- PA.PB (freq. of allele A and allele B)
Not in use…
D= Coefficient Of Linkage Disequilibrium
ϴ= Recombination fraction between two loci i.e. 0.5=50% recombine.
If we are having ϴ=0.5 then we will lose the linkage within 10 generations.
( p value) may be obtained from linear regression, ANOVA or one of several non-parametric statistical methods.
TASSEL General Linear Model (Yu and Buckler, 2006; TASSEL: http://www.maizegenetics.net), a multiple regression model combined with the estimates for the false discovery rate suggested by Kraakman et al. (2006),
The D’ is the standardized disequilibrium coefficient which mainly measures recombinational history and is therefore useful to assess the probability of historical recombination in a given population.
The r2 is essentially the correlation between the alleles at two loci; it summarizes both recombinational and mutational history and is useful in the context of association studies.
Both parameters vary in the interval from 0 to the value of 1.
It has the largest, publicly available, single-species germplasm collection in the world, and a richest genomic database. Rice was the first crop plant to be fully sequenced.
Quantitative inherited genes mostly affected by environment. For selection of such trait AM will play very imp role.
About 78% of all SNPs were found in intergenic regions; of the remaining SNPs, the largest number were in introns of annotated genes, followed by coding regions and untranslated regions of annotated genes
It has previously been suggested that the photoperiod and temperature clines along latitudes may have been the primary factors driving differentiation of cultivated rice in China1
Genome-wide LD decay rates of indica and japonica were estimated at ~123 kb and ~167 kb, where the r2 drops to 0.25 and 0.28, respectively. This is in agreement with the previous estimation that cultivated rice has a long-range LD from close to 100 kb to over 200 kb, which might be a result of self-fertilization coupled with a relatively small effective population size.
GWAS was carried out on 14 agronomic traits, which can be divided into five categories: morphological characteristics (tiller number and leaf angle), yield components (grain width, grain length, grain weight and spikelet number), grain quality (gelatinization temperature and amylose content), coloration (apiculus color, pericarp color and hull color) and physiological features (heading date, drought tolerance and degree of seed shattering)
Numbers of loci used to assign contributions to phenotypic variance are indicated at ends of bars.
GxE interaction, high heritability, LD decay, Population structure, Vast amount of snp to cover whole genome.