SlideShare a Scribd company logo
1 of 44
J. B. Cole
Animal Improvement Programs Laboratory
Agricultural Research Service, USDA
Beltsville, MD 20705-2350
john.cole@ars.usda.gov
Data Structures and Visualization
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (2) Cole
Introduction
• We’re drowning in information
• Genetics are viewed as a commodity
• We need to get better data from
fewer cows
• Do we have the resources we need?
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (3) Cole
U.S. dairy population
0
5
10
15
20
25
30
40 50 60 70 80 90 00
Year
Cows(millions)
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (4) Cole
We need to do more with less
• 47% of U.S. dairy cows are enrolled
in DHIA testing
• The Class III milk is $17/cwt
• Grain prices are very high
 Corn averaged $6/bu in May
 Soybeans averaged $13/bu in May
• Enrollment and cow numbers are
unlikely to increase
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (5) Cole
Major topics
• Different sources of data
• Data source integration and quality
• Data mining models
• Visualization examples
• Computational resources
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (6) Cole
Data currently in national database
• Identification and registration
• Conformation scores
• Milk production and composition
• Fertility
• Longevity
• Some genotypes
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (7) Cole
What are big data?
Type of Record Number of Records1
Cows with lactation data 28,394,976
Lactations 68,373,863
Individual test days 508,574,732
Calving ease records 20,770,758
Animals in pedigree file 58,893,009
Bull genotypes 50,393
Cow genotypes 70,687
1Totals include animals from all breeds.
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (8) Cole
Data not routinely available
• Farm and herd management
 Geography and climate
 Housing systems
 Feed intake
• Milk composition
 Milk fats, proteins, vitamins, minerals
 Conductivity, lactose, MUN
• DNA data
 Cow SNP genotypes, DNA sequence data
Photo: NOAA
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (9) Cole
Data “trapped” on the farm
• Fertility
 Insemination information
 Use of estrus synchronization
• Cow health and longevity
 Body condition scores
 Birth weights and mature weights
 Disease occurrence data
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (10) Cole
Electronic milk meters
• Currently can provide—
 Milk yield
 Milking speed
 Electrical conductivity
• May possibly supply—
 Progesterone levels
 Milk temperature
 Fat and protein concentrations
Photo: afimilk
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (11) Cole
Other sources of data
• RFID tags have lower ID
error rates associated with
meter data
• Pedometers are useful for
detecting estrus, the
onset of calving, and
some early-stage
disease
Top: Allflex; Bottom: afimilk
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (12) Cole
Current sources of data
AIPL CDCB
NAAB
PDCA
DHI
Universities
AIPL Animal Improvement Programs Lab., USDA
CDCB Council on Dairy Cattle Breeding
DHI Dairy Herd Improvement (milk recording organizations)
NAAB National Association of Animal Breeders (AI)
PDCA Purebred Dairy Cattle Association (breed registries)
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (13) Cole
Sources of genomic data
AIPL
Requester
(Ex: AI, breeds)
Dairy
producers
DNA
laboratories
samples
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (14) Cole
Data source integration
• Incoming data from different sources
are checked against one another
• The AIPL edits system consists of
~64,000 SLOC
 Mostly C, some Fortran 90
• Data stored in a relational database
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (15) Cole
Typical edits
• Match birth date with dam’s calving
• Compare with other sources (e.g. breed
association)
• Investigate maternal sibs born within 9
mo (may assume ET)
• IDs within 100 with same sire, dam, and
birth assumed to be twins
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (16) Cole
How do we assess data quality
• Consistency
 e.g., calving, progeny birth,
breeding, dry dates
• Parentage verification
• Electronic ID
• Within-herd heritability
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (17) Cole
Data mining
• The discovery of useful, possibly
unexpected patterns in data
• Four principal tasks
 Association
 Clustering
 Classification
 Regression
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (18) Cole
Bonferroni’s principle
• You will find interesting patterns if
you look hard enough
• Not all relationships are legitimate
• You must have enough data to
support the questions you’re
asking
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (19) Cole
Association analysis
• Discover interesting relationships
among variables in large databases
 e.g., predicting protein function and
identifying SNP-disease associations
 Not statistical association analysis!
• Lots of algorithms, many based on
counting attributes
• Watch for false positives
 Measures co-occurence, not causality
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (20) Cole
Clustering
• Place items into distinct groups
such that
 Items in a group are similar
 Items in one group are dissimilar to
those in other groups
• Hierarchical or partitional
approaches
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (21) Cole
Partitional clustering
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (22) Cole
Hierarchical clustering
• Nested clusters organized into
hierarchical trees
• Data objects may belong to
multiple subsets
• Examples
 Relationships among species
 Evolutionary history of proteins
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (23) Cole
BFGL-Illumina
Deep SNP Discovery
Angus
Holstein
Limousin
Jersey
Nelore
Brahman
Romagnola
Gir
BFGL
Genome Assemblies
Nelore
Water Buffalo
Pfizer
Light SNP Discovery
Angus
Holstein
Jersey
Hereford
Charolais
Simmental
Brahman
Waygu
Partners
Deep SNP Discovery
N’Dama
Sahiwal
Simmental
Hanwoo
Blonde d’Aquitaine
Montbeliard
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (24) Cole
Classification
• Training set used to develop a rule
for assigning individuals to classes
• Validation set used to assess the
accuracy of the classification rule
• Examples
 Identify cows with subclinical mastitis
 Mate assignment
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (25) Cole
Classification methods
• Bayesian belief networks
• Decision trees
• Nearest-neighbor classification
• Neural networks
• Rule-based classification
• Support vector machines
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (26) Cole
Decision tree classification
Pinzón-Sánchezetal.,2011,JDS,94:1873-1892.
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (27) Cole
Rule-based classification
• Classify records using a series of
“if…then” rules
• Rules come directly from the data,
or from other classification models
• e.g., if (PTA NM$ ≥ $800) and (EFI ≤
0.05) then (breed to cow)
• Easy to generate and interpret
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (28) Cole
Regression models
• Prediction of real-valued outputs
• Given one or more attributes, we
can predict, for example—
 Breeding values
 Feed intake
 Milk and components yields
• Very mature analytical tools
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (29) Cole
Visualization
• How do we present lots of numbers
in a compact form?
• “Graphical methods can retain the
information in the data.” ― Deming
• Complements numerical
techniques
 Tukey (1977), Tufte (1983, 1990,
1997, 2006) , Cleveland (1985,
1993), Wickham (2009)
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (30) Cole
One image, millions of points
43,382 SNP solutions x 4,064 animals = 176,304,448 data points
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (31) Cole
Use size to denote importance
Colors differentiate among chromosomes and markers are proportional to effect sizes.
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (32) Cole
O-Style Haplotypes (chromosome 15)
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (33) Cole
Correlations among calving traits
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (34) Cole
Provide multiple cues
Cole and VanRaden. 2011. J. Anim. Breed. Genet. Online, 1-10.
Lines are differentiated by color and pattern.
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (35) Cole
Interstitial figures
Cole and VanRaden. 2010. J. Dairy Sci. 93(6):2727-2740.
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (36) Cole
Computational capacity is abundant
WikiMedia Commons, Wgsimon, Transistor_Count_and_Moore%27s_Law_-_2011.svg
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (37) Cole
Supercomputer performance
• Cray-1 (1976) — 136
megaFLOPS (106)
• Fujitsu K machine
(2011) — 8.16
petaFLOPS (1015)
• Commodity hardware
also has experienced
gains in performance Top: Sherwin Gooch; Bottom: Riken
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (38) Cole
Storage costs are plummeting
Matthew Komorowski, http://www.mkomo.com/cost-per-gigabyte
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (39) Cole
Data storage technologies
• Storage costs are now as
low as $100/TB
 Quality costs!
• Solid state disks are
promising, but relatively
low-capacity
• What do you do about
backups? Top: Snopes/IBM; Bottom: Tom’s Hardware
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (40) Cole
Memory is very cheap
Lev Lafayette, http://www.organdi.net/article.php3?id_article=82
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (41) Cole
Random access memory
• RAM is still much
faster than disk (ns
vs. ms access times)
• A 64-bit OS can
address 16.8 EB, in
theory
• How much can your
motherboard hold?
Top: Stan Yack; Bottom: Samsung
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (42) Cole
Software
• Complexity is increasing
 Parallelism is hard and debugging is
much harder
• Productive developers are expensive
and difficult to find
 A top programmer may be 10x as
productive as an average worker
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (43) Cole
Conclusions
• The more data we get, the more data
we want
• Relationships among traits may become
as important as individual traits
• Software may be more limiting than
hardware
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (44) Cole
Questions?

More Related Content

Similar to Data Structures and Visualization

Idcc kansa-kansa-arbuckle
Idcc kansa-kansa-arbuckleIdcc kansa-kansa-arbuckle
Idcc kansa-kansa-arbuckleEric Kansa
 
Research Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisResearch Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisUniversity of Washington
 
Using the Semantic Web to Support Ecoinformatics
Using the Semantic Web to Support EcoinformaticsUsing the Semantic Web to Support Ecoinformatics
Using the Semantic Web to Support Ecoinformaticsebiquity
 
AMIA Webinar - BioSharing - Mapping the landscape of standards in the life sc...
AMIA Webinar - BioSharing - Mapping the landscape of standards in the life sc...AMIA Webinar - BioSharing - Mapping the landscape of standards in the life sc...
AMIA Webinar - BioSharing - Mapping the landscape of standards in the life sc...Peter McQuilton
 
Potential for New Dairy Cattle Phenotypic Data from Automated Technology Meas...
Potential for New Dairy Cattle Phenotypic Data from Automated Technology Meas...Potential for New Dairy Cattle Phenotypic Data from Automated Technology Meas...
Potential for New Dairy Cattle Phenotypic Data from Automated Technology Meas...Jeffrey Bewley
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchGreg Landrum
 
Finding and accessing human genome data with Repositive
Finding and accessing human genome data with RepositiveFinding and accessing human genome data with Repositive
Finding and accessing human genome data with RepositiveManuel Corpas
 
SciDataCon - How to increase accessibility and reuse for clinical and persona...
SciDataCon - How to increase accessibility and reuse for clinical and persona...SciDataCon - How to increase accessibility and reuse for clinical and persona...
SciDataCon - How to increase accessibility and reuse for clinical and persona...Fiona Nielsen
 
Final From journal on website
Final From journal on websiteFinal From journal on website
Final From journal on websiteMichael Clawson
 
Strata 2011 - Real world apps panel - IPUMS International
Strata 2011 - Real world apps panel - IPUMS InternationalStrata 2011 - Real world apps panel - IPUMS International
Strata 2011 - Real world apps panel - IPUMS InternationalPete Clark
 
Setting the stage with beginning data analyses
Setting the stage with beginning data analysesSetting the stage with beginning data analyses
Setting the stage with beginning data analyseshuebner14
 
Measuring richness. A RCT to quantify the benefits of metadata quality; Scott...
Measuring richness. A RCT to quantify the benefits of metadata quality; Scott...Measuring richness. A RCT to quantify the benefits of metadata quality; Scott...
Measuring richness. A RCT to quantify the benefits of metadata quality; Scott...GigaScience, BGI Hong Kong
 
Workshop finding and accessing data - fiona nadia charlotte - cambridge apr...
Workshop   finding and accessing data - fiona nadia charlotte - cambridge apr...Workshop   finding and accessing data - fiona nadia charlotte - cambridge apr...
Workshop finding and accessing data - fiona nadia charlotte - cambridge apr...Fiona Nielsen
 
Big Data Initiatives for Agroecosystems
Big Data Initiatives for AgroecosystemsBig Data Initiatives for Agroecosystems
Big Data Initiatives for AgroecosystemsCyndy Parr
 
Highly dimensional data_20160926
Highly dimensional data_20160926Highly dimensional data_20160926
Highly dimensional data_20160926Laura Clarke
 
ICG-11 - genomic data projects around the world - nov 5 2016
ICG-11 - genomic data projects around the world - nov 5 2016ICG-11 - genomic data projects around the world - nov 5 2016
ICG-11 - genomic data projects around the world - nov 5 2016Fiona Nielsen
 

Similar to Data Structures and Visualization (20)

Idcc kansa-kansa-arbuckle
Idcc kansa-kansa-arbuckleIdcc kansa-kansa-arbuckle
Idcc kansa-kansa-arbuckle
 
Research Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisResearch Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and Analysis
 
Using the Semantic Web to Support Ecoinformatics
Using the Semantic Web to Support EcoinformaticsUsing the Semantic Web to Support Ecoinformatics
Using the Semantic Web to Support Ecoinformatics
 
AMIA Webinar - BioSharing - Mapping the landscape of standards in the life sc...
AMIA Webinar - BioSharing - Mapping the landscape of standards in the life sc...AMIA Webinar - BioSharing - Mapping the landscape of standards in the life sc...
AMIA Webinar - BioSharing - Mapping the landscape of standards in the life sc...
 
Potential for New Dairy Cattle Phenotypic Data from Automated Technology Meas...
Potential for New Dairy Cattle Phenotypic Data from Automated Technology Meas...Potential for New Dairy Cattle Phenotypic Data from Automated Technology Meas...
Potential for New Dairy Cattle Phenotypic Data from Automated Technology Meas...
 
Data 101: A Gentle Introduction
Data 101: A Gentle IntroductionData 101: A Gentle Introduction
Data 101: A Gentle Introduction
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
 
Finding and accessing human genome data with Repositive
Finding and accessing human genome data with RepositiveFinding and accessing human genome data with Repositive
Finding and accessing human genome data with Repositive
 
SciDataCon - How to increase accessibility and reuse for clinical and persona...
SciDataCon - How to increase accessibility and reuse for clinical and persona...SciDataCon - How to increase accessibility and reuse for clinical and persona...
SciDataCon - How to increase accessibility and reuse for clinical and persona...
 
Hands-on Introduction to Machine Learning
Hands-on Introduction to Machine LearningHands-on Introduction to Machine Learning
Hands-on Introduction to Machine Learning
 
Final From journal on website
Final From journal on websiteFinal From journal on website
Final From journal on website
 
Strata 2011 - Real world apps panel - IPUMS International
Strata 2011 - Real world apps panel - IPUMS InternationalStrata 2011 - Real world apps panel - IPUMS International
Strata 2011 - Real world apps panel - IPUMS International
 
Setting the stage with beginning data analyses
Setting the stage with beginning data analysesSetting the stage with beginning data analyses
Setting the stage with beginning data analyses
 
Measuring richness. A RCT to quantify the benefits of metadata quality; Scott...
Measuring richness. A RCT to quantify the benefits of metadata quality; Scott...Measuring richness. A RCT to quantify the benefits of metadata quality; Scott...
Measuring richness. A RCT to quantify the benefits of metadata quality; Scott...
 
A Stocktake of New Zealand’s Healthcare Datasets
A Stocktake of New Zealand’s Healthcare DatasetsA Stocktake of New Zealand’s Healthcare Datasets
A Stocktake of New Zealand’s Healthcare Datasets
 
Workshop finding and accessing data - fiona nadia charlotte - cambridge apr...
Workshop   finding and accessing data - fiona nadia charlotte - cambridge apr...Workshop   finding and accessing data - fiona nadia charlotte - cambridge apr...
Workshop finding and accessing data - fiona nadia charlotte - cambridge apr...
 
Big Data Initiatives for Agroecosystems
Big Data Initiatives for AgroecosystemsBig Data Initiatives for Agroecosystems
Big Data Initiatives for Agroecosystems
 
Highly dimensional data_20160926
Highly dimensional data_20160926Highly dimensional data_20160926
Highly dimensional data_20160926
 
SOC2002 Lecture 6
SOC2002 Lecture 6SOC2002 Lecture 6
SOC2002 Lecture 6
 
ICG-11 - genomic data projects around the world - nov 5 2016
ICG-11 - genomic data projects around the world - nov 5 2016ICG-11 - genomic data projects around the world - nov 5 2016
ICG-11 - genomic data projects around the world - nov 5 2016
 

More from John B. Cole, Ph.D.

Using genotypes to construct phenotypes for dairy cattle breeding programs an...
Using genotypes to construct phenotypes for dairy cattle breeding programs an...Using genotypes to construct phenotypes for dairy cattle breeding programs an...
Using genotypes to construct phenotypes for dairy cattle breeding programs an...John B. Cole, Ph.D.
 
If we would see further than others: research & technology today and tomorrow
If we would see further than others: research & technology today and tomorrowIf we would see further than others: research & technology today and tomorrow
If we would see further than others: research & technology today and tomorrowJohn B. Cole, Ph.D.
 
Using genotyping and whole-genome sequencing to identify causal variants asso...
Using genotyping and whole-genome sequencing to identify causal variants asso...Using genotyping and whole-genome sequencing to identify causal variants asso...
Using genotyping and whole-genome sequencing to identify causal variants asso...John B. Cole, Ph.D.
 
Genetic improvement programs for US dairy cattle
Genetic improvement programs for US dairy cattleGenetic improvement programs for US dairy cattle
Genetic improvement programs for US dairy cattleJohn B. Cole, Ph.D.
 
The hunt for a functional mutation affecting conformation and calving traits ...
The hunt for a functional mutation affecting conformation and calving traits ...The hunt for a functional mutation affecting conformation and calving traits ...
The hunt for a functional mutation affecting conformation and calving traits ...John B. Cole, Ph.D.
 
An updated version of lifetime net merit incorporating additional fertility t...
An updated version of lifetime net merit incorporating additional fertility t...An updated version of lifetime net merit incorporating additional fertility t...
An updated version of lifetime net merit incorporating additional fertility t...John B. Cole, Ph.D.
 
An updated version of lifetime net merit incorporating additional fertility t...
An updated version of lifetime net merit incorporating additional fertility t...An updated version of lifetime net merit incorporating additional fertility t...
An updated version of lifetime net merit incorporating additional fertility t...John B. Cole, Ph.D.
 
Genetic Evaluation of Stillbirth in US Holsteins Using a Sire-maternal Grands...
Genetic Evaluation of Stillbirth in US Holsteins Using a Sire-maternal Grands...Genetic Evaluation of Stillbirth in US Holsteins Using a Sire-maternal Grands...
Genetic Evaluation of Stillbirth in US Holsteins Using a Sire-maternal Grands...John B. Cole, Ph.D.
 
Stillbirth, Longevity and Fertility Update
Stillbirth, Longevity and Fertility UpdateStillbirth, Longevity and Fertility Update
Stillbirth, Longevity and Fertility UpdateJohn B. Cole, Ph.D.
 
New tools for genomic selection in dairy cattle
New tools for genomic selection in dairy cattleNew tools for genomic selection in dairy cattle
New tools for genomic selection in dairy cattleJohn B. Cole, Ph.D.
 
Opportunities for genetic improvement of health and fitness traits
Opportunities for genetic improvement of health and fitness traitsOpportunities for genetic improvement of health and fitness traits
Opportunities for genetic improvement of health and fitness traitsJohn B. Cole, Ph.D.
 
Genomic selection and systems biology – lessons from dairy cattle breeding
Genomic selection and systems biology – lessons from dairy cattle breedingGenomic selection and systems biology – lessons from dairy cattle breeding
Genomic selection and systems biology – lessons from dairy cattle breedingJohn B. Cole, Ph.D.
 
Use of NGS to identify the causal variant associated with a complex phenotype
Use of NGS to identify the causal variant associated with a complex phenotypeUse of NGS to identify the causal variant associated with a complex phenotype
Use of NGS to identify the causal variant associated with a complex phenotypeJohn B. Cole, Ph.D.
 
Genomic evaluation of dairy cattle health
Genomic evaluation of dairy cattle healthGenomic evaluation of dairy cattle health
Genomic evaluation of dairy cattle healthJohn B. Cole, Ph.D.
 
Uso e valore economico dei test genomici in azienda
Uso e valore economico dei test genomici in aziendaUso e valore economico dei test genomici in azienda
Uso e valore economico dei test genomici in aziendaJohn B. Cole, Ph.D.
 
The use and economic value of genomic testing for calves on dairy farms
The use and economic value of genomic testing for calves on dairy farmsThe use and economic value of genomic testing for calves on dairy farms
The use and economic value of genomic testing for calves on dairy farmsJohn B. Cole, Ph.D.
 
Genomic evaluation of low-heritability traits: dairy cattle health as a model
Genomic evaluation of low-heritability traits: dairy cattle health as a modelGenomic evaluation of low-heritability traits: dairy cattle health as a model
Genomic evaluation of low-heritability traits: dairy cattle health as a modelJohn B. Cole, Ph.D.
 
New applications of genomic technology in the US dairy industry
New applications of genomic technology in the US dairy industryNew applications of genomic technology in the US dairy industry
New applications of genomic technology in the US dairy industryJohn B. Cole, Ph.D.
 

More from John B. Cole, Ph.D. (20)

Crv 2015 jbc
Crv 2015 jbcCrv 2015 jbc
Crv 2015 jbc
 
Using genotypes to construct phenotypes for dairy cattle breeding programs an...
Using genotypes to construct phenotypes for dairy cattle breeding programs an...Using genotypes to construct phenotypes for dairy cattle breeding programs an...
Using genotypes to construct phenotypes for dairy cattle breeding programs an...
 
2015 AGIL Update
2015 AGIL Update2015 AGIL Update
2015 AGIL Update
 
If we would see further than others: research & technology today and tomorrow
If we would see further than others: research & technology today and tomorrowIf we would see further than others: research & technology today and tomorrow
If we would see further than others: research & technology today and tomorrow
 
Using genotyping and whole-genome sequencing to identify causal variants asso...
Using genotyping and whole-genome sequencing to identify causal variants asso...Using genotyping and whole-genome sequencing to identify causal variants asso...
Using genotyping and whole-genome sequencing to identify causal variants asso...
 
Genetic improvement programs for US dairy cattle
Genetic improvement programs for US dairy cattleGenetic improvement programs for US dairy cattle
Genetic improvement programs for US dairy cattle
 
The hunt for a functional mutation affecting conformation and calving traits ...
The hunt for a functional mutation affecting conformation and calving traits ...The hunt for a functional mutation affecting conformation and calving traits ...
The hunt for a functional mutation affecting conformation and calving traits ...
 
An updated version of lifetime net merit incorporating additional fertility t...
An updated version of lifetime net merit incorporating additional fertility t...An updated version of lifetime net merit incorporating additional fertility t...
An updated version of lifetime net merit incorporating additional fertility t...
 
An updated version of lifetime net merit incorporating additional fertility t...
An updated version of lifetime net merit incorporating additional fertility t...An updated version of lifetime net merit incorporating additional fertility t...
An updated version of lifetime net merit incorporating additional fertility t...
 
Genetic Evaluation of Stillbirth in US Holsteins Using a Sire-maternal Grands...
Genetic Evaluation of Stillbirth in US Holsteins Using a Sire-maternal Grands...Genetic Evaluation of Stillbirth in US Holsteins Using a Sire-maternal Grands...
Genetic Evaluation of Stillbirth in US Holsteins Using a Sire-maternal Grands...
 
Stillbirth, Longevity and Fertility Update
Stillbirth, Longevity and Fertility UpdateStillbirth, Longevity and Fertility Update
Stillbirth, Longevity and Fertility Update
 
New tools for genomic selection in dairy cattle
New tools for genomic selection in dairy cattleNew tools for genomic selection in dairy cattle
New tools for genomic selection in dairy cattle
 
Opportunities for genetic improvement of health and fitness traits
Opportunities for genetic improvement of health and fitness traitsOpportunities for genetic improvement of health and fitness traits
Opportunities for genetic improvement of health and fitness traits
 
Genomic selection and systems biology – lessons from dairy cattle breeding
Genomic selection and systems biology – lessons from dairy cattle breedingGenomic selection and systems biology – lessons from dairy cattle breeding
Genomic selection and systems biology – lessons from dairy cattle breeding
 
Use of NGS to identify the causal variant associated with a complex phenotype
Use of NGS to identify the causal variant associated with a complex phenotypeUse of NGS to identify the causal variant associated with a complex phenotype
Use of NGS to identify the causal variant associated with a complex phenotype
 
Genomic evaluation of dairy cattle health
Genomic evaluation of dairy cattle healthGenomic evaluation of dairy cattle health
Genomic evaluation of dairy cattle health
 
Uso e valore economico dei test genomici in azienda
Uso e valore economico dei test genomici in aziendaUso e valore economico dei test genomici in azienda
Uso e valore economico dei test genomici in azienda
 
The use and economic value of genomic testing for calves on dairy farms
The use and economic value of genomic testing for calves on dairy farmsThe use and economic value of genomic testing for calves on dairy farms
The use and economic value of genomic testing for calves on dairy farms
 
Genomic evaluation of low-heritability traits: dairy cattle health as a model
Genomic evaluation of low-heritability traits: dairy cattle health as a modelGenomic evaluation of low-heritability traits: dairy cattle health as a model
Genomic evaluation of low-heritability traits: dairy cattle health as a model
 
New applications of genomic technology in the US dairy industry
New applications of genomic technology in the US dairy industryNew applications of genomic technology in the US dairy industry
New applications of genomic technology in the US dairy industry
 

Recently uploaded

9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 

Recently uploaded (20)

9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 

Data Structures and Visualization

  • 1. J. B. Cole Animal Improvement Programs Laboratory Agricultural Research Service, USDA Beltsville, MD 20705-2350 john.cole@ars.usda.gov Data Structures and Visualization
  • 2. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (2) Cole Introduction • We’re drowning in information • Genetics are viewed as a commodity • We need to get better data from fewer cows • Do we have the resources we need?
  • 3. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (3) Cole U.S. dairy population 0 5 10 15 20 25 30 40 50 60 70 80 90 00 Year Cows(millions)
  • 4. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (4) Cole We need to do more with less • 47% of U.S. dairy cows are enrolled in DHIA testing • The Class III milk is $17/cwt • Grain prices are very high  Corn averaged $6/bu in May  Soybeans averaged $13/bu in May • Enrollment and cow numbers are unlikely to increase
  • 5. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (5) Cole Major topics • Different sources of data • Data source integration and quality • Data mining models • Visualization examples • Computational resources
  • 6. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (6) Cole Data currently in national database • Identification and registration • Conformation scores • Milk production and composition • Fertility • Longevity • Some genotypes
  • 7. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (7) Cole What are big data? Type of Record Number of Records1 Cows with lactation data 28,394,976 Lactations 68,373,863 Individual test days 508,574,732 Calving ease records 20,770,758 Animals in pedigree file 58,893,009 Bull genotypes 50,393 Cow genotypes 70,687 1Totals include animals from all breeds.
  • 8. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (8) Cole Data not routinely available • Farm and herd management  Geography and climate  Housing systems  Feed intake • Milk composition  Milk fats, proteins, vitamins, minerals  Conductivity, lactose, MUN • DNA data  Cow SNP genotypes, DNA sequence data Photo: NOAA
  • 9. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (9) Cole Data “trapped” on the farm • Fertility  Insemination information  Use of estrus synchronization • Cow health and longevity  Body condition scores  Birth weights and mature weights  Disease occurrence data
  • 10. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (10) Cole Electronic milk meters • Currently can provide—  Milk yield  Milking speed  Electrical conductivity • May possibly supply—  Progesterone levels  Milk temperature  Fat and protein concentrations Photo: afimilk
  • 11. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (11) Cole Other sources of data • RFID tags have lower ID error rates associated with meter data • Pedometers are useful for detecting estrus, the onset of calving, and some early-stage disease Top: Allflex; Bottom: afimilk
  • 12. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (12) Cole Current sources of data AIPL CDCB NAAB PDCA DHI Universities AIPL Animal Improvement Programs Lab., USDA CDCB Council on Dairy Cattle Breeding DHI Dairy Herd Improvement (milk recording organizations) NAAB National Association of Animal Breeders (AI) PDCA Purebred Dairy Cattle Association (breed registries)
  • 13. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (13) Cole Sources of genomic data AIPL Requester (Ex: AI, breeds) Dairy producers DNA laboratories samples
  • 14. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (14) Cole Data source integration • Incoming data from different sources are checked against one another • The AIPL edits system consists of ~64,000 SLOC  Mostly C, some Fortran 90 • Data stored in a relational database
  • 15. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (15) Cole Typical edits • Match birth date with dam’s calving • Compare with other sources (e.g. breed association) • Investigate maternal sibs born within 9 mo (may assume ET) • IDs within 100 with same sire, dam, and birth assumed to be twins
  • 16. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (16) Cole How do we assess data quality • Consistency  e.g., calving, progeny birth, breeding, dry dates • Parentage verification • Electronic ID • Within-herd heritability
  • 17. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (17) Cole Data mining • The discovery of useful, possibly unexpected patterns in data • Four principal tasks  Association  Clustering  Classification  Regression
  • 18. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (18) Cole Bonferroni’s principle • You will find interesting patterns if you look hard enough • Not all relationships are legitimate • You must have enough data to support the questions you’re asking
  • 19. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (19) Cole Association analysis • Discover interesting relationships among variables in large databases  e.g., predicting protein function and identifying SNP-disease associations  Not statistical association analysis! • Lots of algorithms, many based on counting attributes • Watch for false positives  Measures co-occurence, not causality
  • 20. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (20) Cole Clustering • Place items into distinct groups such that  Items in a group are similar  Items in one group are dissimilar to those in other groups • Hierarchical or partitional approaches
  • 21. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (21) Cole Partitional clustering
  • 22. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (22) Cole Hierarchical clustering • Nested clusters organized into hierarchical trees • Data objects may belong to multiple subsets • Examples  Relationships among species  Evolutionary history of proteins
  • 23. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (23) Cole BFGL-Illumina Deep SNP Discovery Angus Holstein Limousin Jersey Nelore Brahman Romagnola Gir BFGL Genome Assemblies Nelore Water Buffalo Pfizer Light SNP Discovery Angus Holstein Jersey Hereford Charolais Simmental Brahman Waygu Partners Deep SNP Discovery N’Dama Sahiwal Simmental Hanwoo Blonde d’Aquitaine Montbeliard
  • 24. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (24) Cole Classification • Training set used to develop a rule for assigning individuals to classes • Validation set used to assess the accuracy of the classification rule • Examples  Identify cows with subclinical mastitis  Mate assignment
  • 25. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (25) Cole Classification methods • Bayesian belief networks • Decision trees • Nearest-neighbor classification • Neural networks • Rule-based classification • Support vector machines
  • 26. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (26) Cole Decision tree classification Pinzón-Sánchezetal.,2011,JDS,94:1873-1892.
  • 27. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (27) Cole Rule-based classification • Classify records using a series of “if…then” rules • Rules come directly from the data, or from other classification models • e.g., if (PTA NM$ ≥ $800) and (EFI ≤ 0.05) then (breed to cow) • Easy to generate and interpret
  • 28. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (28) Cole Regression models • Prediction of real-valued outputs • Given one or more attributes, we can predict, for example—  Breeding values  Feed intake  Milk and components yields • Very mature analytical tools
  • 29. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (29) Cole Visualization • How do we present lots of numbers in a compact form? • “Graphical methods can retain the information in the data.” ― Deming • Complements numerical techniques  Tukey (1977), Tufte (1983, 1990, 1997, 2006) , Cleveland (1985, 1993), Wickham (2009)
  • 30. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (30) Cole One image, millions of points 43,382 SNP solutions x 4,064 animals = 176,304,448 data points
  • 31. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (31) Cole Use size to denote importance Colors differentiate among chromosomes and markers are proportional to effect sizes.
  • 32. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (32) Cole O-Style Haplotypes (chromosome 15)
  • 33. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (33) Cole Correlations among calving traits
  • 34. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (34) Cole Provide multiple cues Cole and VanRaden. 2011. J. Anim. Breed. Genet. Online, 1-10. Lines are differentiated by color and pattern.
  • 35. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (35) Cole Interstitial figures Cole and VanRaden. 2010. J. Dairy Sci. 93(6):2727-2740.
  • 36. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (36) Cole Computational capacity is abundant WikiMedia Commons, Wgsimon, Transistor_Count_and_Moore%27s_Law_-_2011.svg
  • 37. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (37) Cole Supercomputer performance • Cray-1 (1976) — 136 megaFLOPS (106) • Fujitsu K machine (2011) — 8.16 petaFLOPS (1015) • Commodity hardware also has experienced gains in performance Top: Sherwin Gooch; Bottom: Riken
  • 38. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (38) Cole Storage costs are plummeting Matthew Komorowski, http://www.mkomo.com/cost-per-gigabyte
  • 39. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (39) Cole Data storage technologies • Storage costs are now as low as $100/TB  Quality costs! • Solid state disks are promising, but relatively low-capacity • What do you do about backups? Top: Snopes/IBM; Bottom: Tom’s Hardware
  • 40. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (40) Cole Memory is very cheap Lev Lafayette, http://www.organdi.net/article.php3?id_article=82
  • 41. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (41) Cole Random access memory • RAM is still much faster than disk (ns vs. ms access times) • A 64-bit OS can address 16.8 EB, in theory • How much can your motherboard hold? Top: Stan Yack; Bottom: Samsung
  • 42. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (42) Cole Software • Complexity is increasing  Parallelism is hard and debugging is much harder • Productive developers are expensive and difficult to find  A top programmer may be 10x as productive as an average worker
  • 43. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (43) Cole Conclusions • The more data we get, the more data we want • Relationships among traits may become as important as individual traits • Software may be more limiting than hardware
  • 44. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (44) Cole Questions?