SlideShare a Scribd company logo
1 of 19
A statistical learning approach to
detect carriers of the HH1 haplotype
in Italian Holstein Friesian cattle
Stefano Biffani*, Stefania Chessa,
Alessandra Stella and Filippo Biscarini1
*IBBA-CNR (UOS-Lodi),
1 Department of Bioinformatics, (PTP-LODI)
Background
The advent of Genotyping
GWAS
GEBV
PARENTAGE/
POP STRUCTURE
CARRIERS/
NON CARRIERS
Background
The advent of Genotyping
Statistical Learning:
Tools for understanding data
Background
• Haplotype or gene allele prediction from
marker genotypes
• K-casein
• Embryo – losses
• Perinatal mortality
Background
• Haplotype reconstruction & Imputation
Extended pedigree
Information
Time consuming
Computationally
demanding
• BEAGLE
• IMPUTE2
• MACH
• FINDHAP
• PEDIMPUTE
• FIMPUTE
• …. Nicolazzi et al –
2015 Animal Genetics.
doi: 10.1111/age.12295
The idea
CLASS LABEL FEATURES (INPUT DATA)
CARRIER/NOT
CARRIER
SNP1 SNP2 …. SNP100 … SNP1000 …. SNPn
YES 0 1 … 1 2 1
YES 2 2 0 2 0
NO 2 1 1 1 1
YES 1 2 1 2 1
YES 1 0 2 1 2
NO 1 0 2 2 1
NO 1 0 2 0 0
The idea
The idea
• LET’S APPLY A SIMPLE LINEAR CLASSIFICATION
METHOD FOR:
– the prediction of specific haplotype carriers from
SNP data
– HH1 on BTA5 - reduced cow fertility
– Original idea applied to Italian Brown Swiss
(Genetics Selection Evolution 2015, 47:4)
Data
• 1104 Holstein bulls
• genotyped with the 54K SNP-chip
• Only SNPs on BTA5 (2225)
• Carriers = 5.3% - non-carriers = 94.7%
Methods
Using 2.5; 10; 15; 30; 50 and 100% of SNPs on BTA5
– For each threshold:
1. Check linear dependencies and collinearity (r = 0.7)
2. Backward Stepwise Selection (BSS) to select the SNP
that best fit the model,
3. Linear Discriminant Analysis (LDA) to classify
observations
Methods
• For each threshold a 10-fold cross-validation
scheme was used (repeated 100 times)
Statistics
1.Total Error Rate (TER)
2.False Positive Rate (FPR) # non-carriers
misclassified as carriers over the total #
non carriers;
3.False Negative Rate (FNR) # carriers
misclassified as non-carriers over the total
# carriers.
Results- Total Error Rate
Results- False Positive Rate (FPR)
Results- False Negative Rate (FNR)
Additional considerations
• For each threshold we have 1000 replicates
– Count the # of times a SNP was selected
– genomic regions that harbour SNPs most relevant
for prediction can be identified
BTA 5 – SNPs position (Mbp)
60 70
HH1 Region
Conclusions
• Simple, fast and reliable method to accurately
identify haplotype carriers from SNP
genotypes
• The classification step can in principle be
replaced by any other linear or nonlinear
classification method (e.g. LR, SVM)
Conclusions
• The methodology can be applied to different
situations (e.g., causal mutation)
• Cross-validation: the right way – assign
observation to the training and testing sets
before the variable selection and classification
steps
THANKS

More Related Content

Similar to Biffani hh1

Pituitary Adenoma: How Registry and Statistical Learning Can Improve Care and...
Pituitary Adenoma: How Registry and Statistical Learning Can Improve Care and...Pituitary Adenoma: How Registry and Statistical Learning Can Improve Care and...
Pituitary Adenoma: How Registry and Statistical Learning Can Improve Care and...
Allina Health
 
140127 GIAB update and NIST high-confidence calls
140127 GIAB update and NIST high-confidence calls140127 GIAB update and NIST high-confidence calls
140127 GIAB update and NIST high-confidence calls
GenomeInABottle
 

Similar to Biffani hh1 (20)

Biffani ncd
Biffani ncdBiffani ncd
Biffani ncd
 
ICMP MPS SNP Panel for Missing Persons - Michelle Peck et al.
ICMP MPS SNP Panel for Missing Persons - Michelle Peck et al.ICMP MPS SNP Panel for Missing Persons - Michelle Peck et al.
ICMP MPS SNP Panel for Missing Persons - Michelle Peck et al.
 
Quan Nguyen at #ICG12: Mammalian genomic regulatory regions predicted by util...
Quan Nguyen at #ICG12: Mammalian genomic regulatory regions predicted by util...Quan Nguyen at #ICG12: Mammalian genomic regulatory regions predicted by util...
Quan Nguyen at #ICG12: Mammalian genomic regulatory regions predicted by util...
 
Pituitary Adenoma: How Registry and Statistical Learning Can Improve Care and...
Pituitary Adenoma: How Registry and Statistical Learning Can Improve Care and...Pituitary Adenoma: How Registry and Statistical Learning Can Improve Care and...
Pituitary Adenoma: How Registry and Statistical Learning Can Improve Care and...
 
2007. stephen chanock. technologic issues in gwas and follow up studies
2007. stephen chanock. technologic issues in gwas and follow up studies2007. stephen chanock. technologic issues in gwas and follow up studies
2007. stephen chanock. technologic issues in gwas and follow up studies
 
Digging into thousands of variants to find disease genes in Mendelian and com...
Digging into thousands of variants to find disease genes in Mendelian and com...Digging into thousands of variants to find disease genes in Mendelian and com...
Digging into thousands of variants to find disease genes in Mendelian and com...
 
SNp mining in crops
SNp mining in cropsSNp mining in crops
SNp mining in crops
 
The server of the Spanish Population Variability
The server of the Spanish Population VariabilityThe server of the Spanish Population Variability
The server of the Spanish Population Variability
 
Bioinformatics and NGS for advancing in hearing loss research
Bioinformatics and NGS for advancing in hearing loss researchBioinformatics and NGS for advancing in hearing loss research
Bioinformatics and NGS for advancing in hearing loss research
 
2017 molecular profiling_wim_vancriekinge
2017 molecular profiling_wim_vancriekinge2017 molecular profiling_wim_vancriekinge
2017 molecular profiling_wim_vancriekinge
 
140127 GIAB update and NIST high-confidence calls
140127 GIAB update and NIST high-confidence calls140127 GIAB update and NIST high-confidence calls
140127 GIAB update and NIST high-confidence calls
 
Gene expression group presentation at GAW 19
Gene expression group presentation at GAW 19Gene expression group presentation at GAW 19
Gene expression group presentation at GAW 19
 
Genotyping in Breeding programs
Genotyping in Breeding programsGenotyping in Breeding programs
Genotyping in Breeding programs
 
Dr. Igor Paploski - Making Epidemiological Sense Out of Large Datasets of PRR...
Dr. Igor Paploski - Making Epidemiological Sense Out of Large Datasets of PRR...Dr. Igor Paploski - Making Epidemiological Sense Out of Large Datasets of PRR...
Dr. Igor Paploski - Making Epidemiological Sense Out of Large Datasets of PRR...
 
Partitioning Heritability using GWAS Summary Statistics with LD Score Regression
Partitioning Heritability using GWAS Summary Statistics with LD Score RegressionPartitioning Heritability using GWAS Summary Statistics with LD Score Regression
Partitioning Heritability using GWAS Summary Statistics with LD Score Regression
 
Predictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Predictive Analytics of Cell Types Using Single Cell Gene Expression ProfilesPredictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Predictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
 
Genomic surveillance of Rift Valley fever virus
Genomic surveillance of Rift Valley fever virusGenomic surveillance of Rift Valley fever virus
Genomic surveillance of Rift Valley fever virus
 
Pcr single strand conformation polymorphism
Pcr single strand conformation polymorphismPcr single strand conformation polymorphism
Pcr single strand conformation polymorphism
 
Predicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learningPredicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learning
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030
 

Recently uploaded

The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 

Recently uploaded (20)

Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Velocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.pptVelocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.ppt
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptx
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curve
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 

Biffani hh1

  • 1. A statistical learning approach to detect carriers of the HH1 haplotype in Italian Holstein Friesian cattle Stefano Biffani*, Stefania Chessa, Alessandra Stella and Filippo Biscarini1 *IBBA-CNR (UOS-Lodi), 1 Department of Bioinformatics, (PTP-LODI)
  • 2. Background The advent of Genotyping GWAS GEBV PARENTAGE/ POP STRUCTURE CARRIERS/ NON CARRIERS
  • 3. Background The advent of Genotyping Statistical Learning: Tools for understanding data
  • 4. Background • Haplotype or gene allele prediction from marker genotypes • K-casein • Embryo – losses • Perinatal mortality
  • 5. Background • Haplotype reconstruction & Imputation Extended pedigree Information Time consuming Computationally demanding • BEAGLE • IMPUTE2 • MACH • FINDHAP • PEDIMPUTE • FIMPUTE • …. Nicolazzi et al – 2015 Animal Genetics. doi: 10.1111/age.12295
  • 6. The idea CLASS LABEL FEATURES (INPUT DATA) CARRIER/NOT CARRIER SNP1 SNP2 …. SNP100 … SNP1000 …. SNPn YES 0 1 … 1 2 1 YES 2 2 0 2 0 NO 2 1 1 1 1 YES 1 2 1 2 1 YES 1 0 2 1 2 NO 1 0 2 2 1 NO 1 0 2 0 0
  • 8. The idea • LET’S APPLY A SIMPLE LINEAR CLASSIFICATION METHOD FOR: – the prediction of specific haplotype carriers from SNP data – HH1 on BTA5 - reduced cow fertility – Original idea applied to Italian Brown Swiss (Genetics Selection Evolution 2015, 47:4)
  • 9. Data • 1104 Holstein bulls • genotyped with the 54K SNP-chip • Only SNPs on BTA5 (2225) • Carriers = 5.3% - non-carriers = 94.7%
  • 10. Methods Using 2.5; 10; 15; 30; 50 and 100% of SNPs on BTA5 – For each threshold: 1. Check linear dependencies and collinearity (r = 0.7) 2. Backward Stepwise Selection (BSS) to select the SNP that best fit the model, 3. Linear Discriminant Analysis (LDA) to classify observations
  • 11. Methods • For each threshold a 10-fold cross-validation scheme was used (repeated 100 times)
  • 12. Statistics 1.Total Error Rate (TER) 2.False Positive Rate (FPR) # non-carriers misclassified as carriers over the total # non carriers; 3.False Negative Rate (FNR) # carriers misclassified as non-carriers over the total # carriers.
  • 16. Additional considerations • For each threshold we have 1000 replicates – Count the # of times a SNP was selected – genomic regions that harbour SNPs most relevant for prediction can be identified BTA 5 – SNPs position (Mbp) 60 70 HH1 Region
  • 17. Conclusions • Simple, fast and reliable method to accurately identify haplotype carriers from SNP genotypes • The classification step can in principle be replaced by any other linear or nonlinear classification method (e.g. LR, SVM)
  • 18. Conclusions • The methodology can be applied to different situations (e.g., causal mutation) • Cross-validation: the right way – assign observation to the training and testing sets before the variable selection and classification steps