Pedigree Based
Methods
• Positional Cloning: Identification of a gene for a particular disease based on its
location in the genome, determined by a collection of methods including linkage
analysis, genomic (physical) mapping, and Bioinformatics
• Founder Gene Approach: Loss of genetic diversity or limited genetic diversity that
occurs when a small group of individuals from a genetically diverse population are
studied
Pedigree
Independent
Methods
• Candidate Gene Approach: Associations between genetic variations within pre-
specified genes of interest and phenotypes or disease states
• Genome Wide Association Studies: Examination of many common genetic variants
in different individuals to see if any variant is associated with disease phenotype
Sr No Name Web Address Reference
1 T1Dbase http://www.t1dbase.org (Hulbert et al. 2007)
2 COSMIC http://www.sanger.ac.uk/genetics/CGP/cosmic/ (Forbes et al. 2008)
3 The European Genome-Phenome
Archive
https://www.ebi.ac.uk/ega/ (Church et al. 2010)
4 ModSNP modsnp.expasy.org/ (Yip et al. 2004)
5 SwissVar http://swissvar.expasy.org/ (Mottaz et al. 2010)
6 HGMD http://www.hgmd.cf.ac.uk/ac/index.php (Stenson et al. 2003)
7 Catalog of published Genome Wide
Association Studies (NHGRI)
http://www.genome.gov/gwastudies/ (Gong et al. 2011)
3
Scope of a Genetic Association Study
Candidate gene
◦ Known functional variants
◦ Variants with unknown function in exons, introns, regulatory regions
Linkage candidate region
◦ Functional variants, or those with unknown function in candidate genes
◦ More general coverage of region using many markers
Genome-wide
◦ Test for association with hundreds of thousands (millions) of SNPs
spread across the entire genome.
Background
There are two main types of genetic association studies:
population-based case–control studies
family-based studies
Can be hypothesis driven e.g CG or with out prior
hypothesis e.g GWAS
Population-based (defined here as nonfamily-based)
case–control studies have become the most popular
design to find common polymorphisms thought to underlie
complex traits (also termed ‘common disease common
variant hypothesis’).
CGs
Targeting the genes with previous role in
the trait in question
If focus on few genes then is cost
effective
Small number of marker are needed to
capture the most common variation
Candidate genes can be selected from
biological pathways that harbor other
previously associated risk loci.
Goals
Use bioinformatics databases to:
◦ Determine basic properties of genes
◦ Identify common genetic variants in and
around genes
◦ Characterize genetic variants in terms of
frequency and functionality
Possible Stages in Candidate-Gene Study
Design
Select a
Candidate
System
Select a
Candidate Genes
in System
Select Genetic
Variants in
Candidate Genes
Knowledge of the
biology of the phenotype
1. Expert Opinion
2. Literature Search
3. Pathway Analysis
4. (Positional)
1.Literature Search
2.Bioinformatic Databas
3.SNP Tagging
GWAS – Genome Wide
Association Studies
Studies of genetic variation across
the (entire) human genome
Designed to identify associations
between genetic markers &
observable traits, or the
presence/absence of a
disease or condition
Often markers of modest effect
11
Genetic Association Studies
Short-term Goal: Identify genetic variants that explain differences in
phenotype among individuals in a study population
◦ Qualitative: disease status, presence/absence of congenital
defect
◦ Quantitative: blood glucose levels, % body fat
If association found, then further study can follow to
◦ Understand mechanism of action and disease etiology in
individuals
◦ Characterize relevance and/or impact in more general population
Long-term goal: to inform process of identifying and delivering better
prevention and treatment strategies
Steps
Specify case definition
Consider the literature for a consensus definition of
the disease of interest. Following standard
diagnostic
guidelines allows other groups to more easily
replicate initial findings, though it is not always the
most powerful approach for initial gene detection.
If a consensus definition does not exist, consider all
evidence and decide on a specific definition that
optimizes biological and clinical relevance.
Determine if the disease is heritable
Decide from all available evidence in familial aggregation
studies whether there is sufficient evidence that the
disease of interest is heritable.
Concordance rates: presence of the same trait in both
members of a pair of twins
If the heritability of a disease or subphenotype appears to
be low (<20%) and the disease is common, it is likely that
very large sample sizes (in excess of 5,000 cases and
5,000 controls) will be required to find predisposing genetic
variants using a population-based approach.
• Control selection
Should be age, gender and ethnicity specific
Genetic association studies
Direct genotyping occurs when an actual causal polymorphism is
typed. Indirect genotyping occurs when nearby genetic markers that are
highly correlated with the causal polymorphism are typed
Hirschhorn & Daly, Nat Rev Genet 2005
Candidate Gene or GWAS
Takes advantage of the
correlation between SNPs,
called linkage
disequilibrium (LD)
Quality controls
Quality control refers to the procedures used
to evaluate the genotyping performance of the samples
and the genotyping array.
As there can be degradation of input DNA, plating
errors and hybridization failures of genotyping chips,
it is important to review the performance
of the samples prior to definitive downstream
analysis with the genotypes.
The process of calling genotypes is not error free,
It is thus vital to identify and exclude SNPs with
potentially high rates of missingness or erroneous
genotypes.
Sample quality control
The extent of missing genotypes and heterozygosity
for a sample are useful indicators for poorly genotyped
samples.
Samples with anomalously high rates for either
of these two measures are often excluded from the
outset.
High rates of missingness generally imply
hybridization
problems, which may be caused by faulty arrays or
poor
quality DNA
Excess heterozygosity can indicate sample
Contamination
Sample quality control
Unintentional use of related samples or accidental
sample duplication in large scale studies
Such cryptic relatedness is easy to infer through
measuring the allele sharing
Typically the sample in each relation with the least
amount of missing genotypes is retained in the
study.
Family-based studies, the authenticity of the
pedigree relationships can be achieved by
calculating the extent of mendelian inconsistency
PedChek software
Exclude those are inconsistent
SNPs Quality control
Remove SNPs with low call rate (e.g., <97%)
Proportion of SNPs actually called by
software
Remove SNPs / Individuals who have too much
missing data
Hardy-Weinberg Equilibrium,
Test for this (e.g., chi-squared test)
Remove those with very low minor allele
frequency
Population structure
Population structure refers to the genetic differences that exist between
individuals from different groups, populations or geographical regions.
There are a number of established statistical strategies for detecting
population structure, of which those commonly used in genome-wide
studies include genomic control (GC),
which estimates the
degree of inflation
of the test statistic
A representation of how differences in genotypic (or allelic) frequencies across
different populations can introduce false signals of association
Selection of Markers for Association studies
Human genome consists of over 3 billion base pairs
Have about 28000 genes
individuals are identical for ~99.5% of their
sequence, with the small remaining part variable to
differing extents
could variation have a role in explaining differences
in genetic susceptibility to disease?
comparing variation between diseased (cases) and
healthy (control) individuals from the same population
If frequency of a variant at specific locus is >1% is
said to be a polymorphism
The most common class of polymorphisms SNPs,
which comprise ~90% of all human variation
Other types are larger blocks of
sequence variation (mini-/micro-satellites), Indel,
LD: non-random association of allele at two or more loci, that may or may not
be on the same chromosome
SNPs in LD?
dbSNP have about 10 millions
HapMap project determine taqSNPs which can be
used as a proxies for other in LD and reduces the
number of marker to be examined
SNP-SNP association, or linkage disequilibrium,
is fundamental to our ability to sample the whole
genome with relatively few SNPs.
Genome-wide association study of 14,000 cases of seven common
diseases and 3,000 shared controls (NATURE| Vol 447|7 June 2007)
using the Affymetrix GeneChip 500K Mapping Array Set
TaqMan assay system and mechanism of action
•This is a best method for SNPs genotyping
•Robust, reliable and very easy to prepare
•Can be done in 384 well plate
•Very low genotyping error rate
•Reaction can be run on regular thermo
cycler but Real-Time PCR detection system
is necessary to scan the plates