Association mapping, GWAS, Mapping, natural population mapping

1
Seminar-II
Mahesh Biradar
II PhD
PGS19AGR7965
Dept. of GPB, UASD
“Association Mapping
for crop improvement ”

4
Selection: The core of Plant Breeding
• Improvement of traits is affected by selection
• Phenotypic selection has been less effective
• Direct selection for these traits is less effective as they are
controlled by large number of genes with substantial G×E
interaction
• DNA markers are being increasingly used as surrogates for
selection of genotypes for combination of desirable but
conventionally difficult-to-breed traits
Advantages of indirect selections through markers
• Off season selection
• Early selection
• Cost effective
• High throughput
• Pyramiding and stacking
• For MABC: Transfer of recessive gene (Avoids linkage drag)
• MARS
• Genomic selection
Singh A. K. and Singh B. D., 2015 Irri.org

Requirements for indirect selection
Reliable marker trait association
https://www.semanticscholar.org/paper/3-Association-Mapping-in-Plant-Genomes-Soto-Cerda-Cloutier/0ecf0db269a995ebb23e1f2334d7c663bc1376ea
5
How to go about
GENETIC MAPPING
Family based Mapping
• Select Parents
• Develop MP
• Genotype
• Phenotype
• Computations and
QTL identification
Population based mapping
(Association Mapping)
• GWAS
• Candidate Gene based mapping

Limitations of Bi Parental mapping approaches
θ Capture only those QTL alleles for which
the parents differ.
θ Require large population size
θ low resolution.
θ Longer research time.
θ Not feasible in perennial crops and
animals.
θ Suitable mostly for coarse mapping-except
BC derived populations
6
Singh A. K. and Singh B. D., 2015

Harnessing Linkage Disequilibrium For
Mapping Traits: Association Mapping
An Approach For Establishing Marker Trait Association

What is Association mapping
 A tool to resolve complex trait variation down to the sequence
level by exploiting historical and evolutionary recombination
events at the population level
 Association mapping, also known as "linkage disequilibrium
mapping ", is a method of mapping quantitative trait loci (QTLs)
that takes advantage of historic linkage disequilibrium (LD) to
find associations between phenotypes to genotypes
 Greater precision in QTL location than family-based linkage
analysis.
 Can be applied to a range of experimental and non-experimental
populations.
8

Advantages of association mapping
• Time and cost saved in
developing mapping
populations
• High resolution
• More number of alleles
detected
• Even for small effect QTLs
9
Yu and Buckler, 2006

Towards a better resolution..
• GWAS can give up to 1cM of
resolution in comparison to 10cM
in DH and 5cM in RILs
• In rice, 1cM=500kb; genes are
mapped with resolution of 100 kb
(Crowell et al., 2016)
10
Sujan Mamidi, 2020

•Concept of Linkage Disequilibrium (LD)
Concept of LD was first described by Jennings (1917).
Term- Levontin and Kojima (1960).
Measure -D-Coefficient of LD- developed by Levontin (1964).
LD is the ‘non-random association of alleles at different loci’.
Gametic phase disequilibrium of loci in population
θ Non-Random association of alleles from loci from markers/QTLs or a marker
and QTL
B
f (B)=0.6
b
f (b)=0.4
Total
A
f (A) = 0.2
0.1/ 0.12 0.1/ 0.08 0.2
a
f (a) = 0.8
0.5/ 0.48 0.3/ 0.32 0.8
Total 0.6 0.4 1.0
11

Defining Linkage Disequilibrium (Contd..)
Oraguzie et al., 2007, Association mapping in Plants, Springer publications 12

Measures of Linkage Disequilibrium
Measure Formula Reference Remarks
D D=Pr(A1B1) - Pr(A1) Pr(B1)
(DABC = pABC − pADBC − pBDAC − pCDAB − pApBpC)
Weir, 1996
General account of
LD
D’ Lewontin, 1964 Unitless measure
r
Hill and Robertson,
1968
Commonly used for
bi allelic markers
δ
Levin and Bertell,
1978
Odds ratio
Devlin and Risch,
1995
13

14
M1 M2 Q1 Q2 M1Q1 = 0.25
M1Q2 =0.25
M2Q1 =0.25
M2Q2 =0.25
D = (pM1Q1.pM2Q2) - (pM1Q2.pM2Q1)
D= (0.25*0.25)- (0.25*0.25)
D= 0
M1 M2
Q1 Q2
M1Q1 = 0.5
M1Q2 =0
M2Q1 =0
M2Q2 =0.5
D = (pM1Q1.pM2Q2) - (pM1Q2.pM2Q1)
D= (0.5*0.5)- (0*0)
D= 0.25
M1 M2
Q1 Q2
M1Q1 = 0.4
M1Q2 =0.1
M2Q1 =0.1
M2Q2 =0.4
D = (pM1Q1.pM2Q2) - (pM1Q2.pM2Q1)
D= (0.4*0.4)- (0.1*0.1)
D= 0.15
r = 0.2

Illustration
• Functional allele T is in LD with
berry number from simple
association test
• Allele C/T is in LD with functional
allele
15
Myles et al., 2009
Principle of association
mapping

Natural
population
size
Phenotypic
data M1 M2 M3 M4 M5
1 R 1 1 0 -1 -1
2 R 1 1 0 -1 -1
3 R -1 1 0 -1 -1
4 R 0 1 1 -1 0
5 R 0 1 1 -1 -1
6 S 1 -1 -1 1 1
7 S -1 -1 -1 1 1
8 S 0 -1 0 1 1
9 S -1 -1 -1 1 1
10 S 0 -1 1 1 1 16
M1M1 = 1
M1M2 = 0
M2M2 = -1

Testing the Statistical Significance of LD
 Chi-square test
 Fisher’s exact test (Fisher, 1935)
 Likelihood ratio test
 Multi-factorial permutation analysis
 A threshold p-value of ≤ 0.05 is often used to declare significant
LD
 These statistical methods are implemented in software: Power
Marker (Liu & Muse, 2005) and TASSEL (Bradbury et al., 2007)
and R
17

LD plots and
Heatmaps
θ Indicate disequilibrium
measures
θ Upper half measures LD
θ Lower half P-values
18
Rajendra
Singh
et.
al.,
2019,
https://www.researchgate.net/publication/332014392_Development_of_model_web-
server_for_crop_variety_identification_using_throughput_SNP_genotyping_data_OPEN

LD Decay plot
Zegeye, Habtemariam; Rasheed, Awais; Makdis, Farid; Badebo, Ayele; C. Ogbonnaya, Francis (2015): Linkage disequilibrium (LD) decay as a function of
genetic distance.. PLOS ONE. Figure. https://doi.org/10.1371/journal.pone.0105593.g003
19

Manhattan Plots: Representing GWAS results
θ Scatter plot arranged chr wise to summarize results
θ The X-axis is the genomic position of each SNP, and the Y-axis is the
negative logarithm of the P-value obtained from the GWAS mod
20
Skyscrappers
GAPIT manual, Zhiwu zhang, 2020 pp: 27

Recombination &
Mutation
Mating System
Selection
Population Structure
& Kinship
Genetic drift/ gene
flow
21

Mutation & Recombination
New mutation-LD with all loci-recombination causes LD to decay as new
haplotypes are created.
LD is broken down by recombination, hence blocks of LD is expected.
D is expressed in standardized units as D' or r2
r = 0.05
r = 0.5
r = 0.005
r = 0.0005
Dt+1 = (1-r) Dt
r = 0.5 for unlinked loci, so
LD decays by half each
generation
22

Mating system
Selfing reduces opportunities for effective recombination- LD extends much larger
distance. (low marker density- less resolution).
LD declines more rapidly in out crossing plant species-high marker density and
higher resolution is expected.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1
4
7
10
13
16
19
22
25
28
31
34
37
40
Generation
D'
0.05 0.00
0.05 0.99
0.25 0.00
0.25 0.99
0.50 0.00
0.50 0.99
no linkage
99% selfing
outcrossing
r s
99% selfing
outcrossing
r s
23

Selection:
 It generate LD between unlinked loci -“a hitchhiking” effect and
epistatic selection of the co-adapted genes.
 fixation of alleles flanking a favored variant
 Domestication and modern plant breeding considerably modified the
genome architecture and reduced genetic diversity-population
structure-effect on LD.
24

Genetic Drift and Bottleneck:
Genetic drift results in the consistent loss
of rare allelic combinations which increases
LD level (Flint-Garcia et al., 2003).
Marked reduction in the size of population
for one or more generations.
Enhance the genetic drift since few allelic
combinations are transmitted
25

Inferences from Linkage Disequilibrium
θ Populations with slow decay can be helpful in AM
θ Populations with long LD blocks are amenable for coarse
mapping with fewer markers
θ short LD blocks-fine mapping
26

Association Mapping
27
• Genome Wide Association Mapping
• Candidate Gene Association mapping
Zhu et al., 2008

Other Populations can be
used too..
θ MAGIC
θ Nested Association panel
28

Population structure
• Population structure affects LD throughout the genome.
• Population structure occurs from the unequal
distribution of alleles among subpopulations of different
ancestries.
• When these subgroups are sampled to construct a panel
of lines for AM which results different allele
frequencies creates LD-False positives.
• Subgroups within population formed due to allelic
frequency differences
• Needs to be taken care of if present in the association
panel
• Generally PCA represents structure and estimates
• STRUCTURE: Developers Pritchard Lab, Stanford
University: Identifies k clusters to which individuals are
assigned: Time consuming
29
Shi et. al., 2017
Detection: AMOVA,
Wright’s F statistic

• Sub population 1:
•M1Q1 = 0.49
•M1Q2 = 0.21
•M2Q1 = 0.21
•M2Q2 = 0.49
• LD=0
• f(M1M1Q1Q1)= 0.7* 0.7* 0.7* 0.7=0.2401
• Sub population 2:
• M1Q1 = 0.09
• M1Q2 = 0.21
• M2Q1 = 0.21
• M2Q2 = 0.49
• LD=0
• f(M1M1Q1Q1)= 0.3* 0.3* 0.3* 0.3=0.0081
30
f(m1)= 0.7
f(Q1) = 0.7
f(m2)= 0.3
f(Q2) = 0.3
f(m1)= 0.3
f(Q1) = 0.3
f(m2)= 0.7
f(Q2) = 0.7
In mixed whole population
Expected f(M1M1Q1Q1)=
f(M1)= 0.7+0.3/2 = 0.5
f(M2) = 0.7+0.3/2 = 0.5
f(Q1) = 0.7+0.3/2 = 0.5
f(Q1) = 0.7+0.3/2 = 0.5
f(M1M1Q1Q1)= 0.5* 0.5* 0.5* 0.5
= 0.0625
= 0.1241/ 0.0625
1.9856
2 times more probable
estimation of marker trait
association
Observed f(M1M1Q1Q1) = 0.2401 + 0.0081/2
= 0.1241

Population Structure (contd..)
31
201 upland cotton germplasms of Agricultural Research Station, of University of Agricultural Sciences (UAS),
Dharwad, were used for association studies from a total of 557 available. It included indigenous, exotic, released
varieties and breeding lines
Population structure was inferred using the program fastSTRUCTURE (Raj, Stephens, & Pritchard, 2014).
23,254 polymorphic SNPs with minor allele frequencies greater than 0.05 for population structure analysis

Population Structure (contd..)
32
Population structure determination in the population

θ Kinship is co-ancestry or half relatedness
θ Coefficient of kinship is the probability that the alleles of a
particular locus chosen randomly from two individuals are
identical by descent
θ Can be estimated based on pedigree information
θ Kf= ∑k∑a (fai * faj)/D
33
Another confounding effects in marker trait
association: Kinship

Kinship matrices
θ Marker based approaches: method of moments based
estimate
θ Kinship estimates for pairs of individuals
35

Statistical Models of association mapping
θ Case control approaches: affected by Q
θ Transmission disequilibrium tests
θ Structured association models
36

Statistical Models of association mapping
Generalized linear models: only has fixed effects
θInclude naïve models and models with fixed effect (population
structure (Q))
θTrait=Markers + Error or Trait=Markers + Population str+ Error
Mixed models: With both fixed and random effects
θ Trait=Markers + Population str+ Kinship+ Error
37

Overview of
models
38
θ Models like MLMM, SUPER,
FarmCPU, BLINK are feedback
models-after each regression
significant markers are used as
covariates to negate effect

Selecting the best model
θ Quantile-quantile (QQ) –plot assess
how well the model used in GWAS
accounts for population structure
and familial relatedness
θ Negative logarithms of the P-values
from the models fitted in GWAS are
plotted against their expected value
under the null hypothesis of no
association with the trait
39

Controlling false discoveries
θ Composite error rate increase in individual markers
equations
θ Bonferroni correction: αc= αE/m (to control family wise error
rate FWER)
θ Benjamini-Hochberg
θOrder m unadjusted p-values from hypothesis testing
θLet k be the largest i for which P(i)<= i/m q* (q* exp prop of FDR)
θReject all Hi for i ε (1,2,3…)
41
Null hypothesis Total
True False
Accept U
V m0
Reject T S m-m0
Total m-R R m

Marker effects
θ The additive effect of the variant allele is calculated as half
the difference between mean of the variant allele and mean
of the reference allele
θRef allele mean = 10
θVar allele mean = 16
θAdditive effect = 3
42

Variation explained and effect of allele
θ Regression R2 is used to denote the PVE
θ Likelihood ratio (sun et al., 2010) R2
LR
=
θ Log LM is likelihood of Full model
Ex: y = m1 + PCA + kinship + ε
θ Log L0 is likelihood of reduced model Ex: y = PCA + kinship + ε
43

Phenotyping
θ Since large population are involved, suitable strategy to
minimize G× E interaction has to be designed
θ Incomplete block designs are suitable or phenomics should
be used
θ Multi location and/ multi environment testing is suggested
θ Boxplots are one of the effective means to identify outliers
θ Normality tests - If not transformations
44

Success stories
θ In human, GWAS has identified SNPs linked to diabetes,
Alzheimer’s Parkinson’s, obesity and many more
θ Phenotypic variation in flowering time, endosperm
color, starch production, maysin and chlorogenic acid
accumulation, cell wall digestibility, and forage quality
were associated using SNP markers of candidate
genes
45

Current issues
θ Missing heritability
θ Algorithm for efficient estimation of epistasis are missing
θ Controlling false positives
θ Accurate phenotyping of panel
47

Limitations
θ Results of AM are affected by selection history, K, Q
θ Linkage may not be always the basis of significant LD
θ Demands large number of markers-increases cost on
genotyping
θ Rate of recombination is not uniform throughout the
genome-reduces reliability of using LD estimates
θ Low frequency alleles with larger effects cannot be detected
48

Materials and methods
θ The material for the study comprised 64 core set germplasm
accessions of Dolichos bean and two check varieties (HA-4 and
kadalavare (KA)) maintained at All India Coordinated Research
Project (AICRP) on pigeon pea, UAS, Bengaluru.
θ The core germplasm accessions include accessions of Indian (78%-
collected from Karnataka, Andhra Pradesh, Maharashtra, Gujarat,
Tamil Nadu and Kerala states of India), exotic (6%) and unknown
origin (16%).
θ Days to 50% flowering, primary branches per plant, racemes per
plant, raceme length, nodes per raceme, pods per plant, fresh pod
yield per plant, fresh seed yield per plant and 100-fresh seed weight
50

Genotyping
θ Those core germplasm accessions were genotyped with a total of
234 SSR markers which included 198 in-house developed SSR
markers and 36 transferable cross legume species/genera SSR
markers.
θ The population structure of Dolichos bean core set was worked
out using 95 polymorphic SSR markers
θ K and Q were estimated from STRUCTURE 2.3.2
θ To confirm population structure AMOVA and Wright’s F statistic
was performed
θ Marker trait linkage: TASSEL 3.1
51

θ Objective: To identify the QTLs related to Soybean protein and oil
content
θ Population size: 185 diverse soybean germplasm accessions ( china,
America, Canada, Japan and some European countries)
θ Genotyping approach: Whole genome sequencing using SLAF-seq
approach (12,072 SNPs used)
θ Phenotyping for 2 years: Protein and oil content (Infratec 1241 NIR Grain
Analyzer)
Population structure
evaluation
PCA was used to assess the
population structure using
the GAPIT package

Association mapping
θ Compressed mixed linear model (cMLM) in GAPIT: based
on SNPs from the 185 soybean accessions and ~12k SNPs
θ A P-value of 0.001 constituted the Type I error significance
threshold

Results: Manhattan plots
Li et al., 2019

In summary..
θ 9 of 23 SNPs were in line with previous reports
θ Due to overlapping detection of some SNPs in protein and oil
content, pleiotropy could be suspected
59

60
Materials and Methods
229 accessions from a worldwide B. napus collection were divided into two panels of 96 and 133 accessions
Bio chemical analysis : Throuh HPLC analysis tocopherol content and composition(ATC, γ-tocopherol content
(GTC), and δ-tocopherol content (DTC) and the tocopherol composition was expressed as the ratio of α- and γ-
tocopherol (AGR)) was assessed
Glucosinolate (GSL), seed oil (SOC), and seed protein (SPC) contents were also assessed by near-infrared
spectroscopy (NIRS)

Genotyping
θ The 13 tocopherol candidate genes
(BnaX.VTE1.a, BnaX.VTE1.b, BnaA.VTE2.a, BnaX.VTE2.b, BnaX.VTE3.a, BnaX.VTE
3.b, BnaA.VTE4.a, BnaX.VTE4.b, BnaX.VTE4.c, BnaC.VTE5, BnaX.PDS1.a, BnaX.PD
S1.b, and BnaA.PDS1.c) were identified by BAC library screening and characterized
by functional and mapping approaches (Fritsche et al; Wang et al.).
θ Population structure: assessed by 31 publicly available genome-wide microsatellite
SSR markers (Cheng et al., 2009)
θ Principal component analysis (PCA) was performed based on SSR markers data and the first and
second principal component was used (D matrix) for the association analysis.
θ Kinship matrices also calculated
61

LD and association analysis
θ R2 values of LD and corresponding p-values for all loci pairs were calculated
using the software R
θ The two models, general linear model (GLM) and PK-mixed model, were
used to analyze associations between polymorphic sites and the traits in
panel 1
θ Identification of Polymorphisms within Tocopherol Genes
θ Among 13 gene, specific primer pairs yielding high-quality sequences were developed
for only nine genes
θ remaining four candidate genes had poor sequence quality
θ Setting a threshold of 5%, they found polymorphic sites in only two candidate genes
(BnaA.PDS1.c, BnaX.VTE3.a) whereas low polymorphic sites (frequency < 5%) were
detected in three genes. They found no polymorphisms in the amplified fragments of
the remaining four genes
62

θ BnaA.PDS1.c- 25 SNPs
θ BnaX.VTE3.a – 6 SNPs
64

66
They concluded that the polymorphisms within the tocopherol genes
clearly impact tocopherol content and composition in B. napus seeds.
Hence suggest that these nucleotide variations may be used as
selectable markers for breeding rapeseed with enhanced tocopherol
quality.

Association mapping, GWAS, Mapping, natural population mapping

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Association mapping, GWAS, Mapping, natural population mapping

Similar to Association mapping, GWAS, Mapping, natural population mapping (20)

Recently uploaded

Recently uploaded (20)

Association mapping, GWAS, Mapping, natural population mapping

Editor's Notes