1. J. B. Cole
Animal Improvement Programs Laboratory
Agricultural Research Service, USDA
Beltsville, MD 20705-2350
john.cole@ars.usda.gov
Data Structures and Visualization
2. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (2) Cole
Introduction
• We’re drowning in information
• Genetics are viewed as a commodity
• We need to get better data from
fewer cows
• Do we have the resources we need?
3. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (3) Cole
U.S. dairy population
0
5
10
15
20
25
30
40 50 60 70 80 90 00
Year
Cows(millions)
4. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (4) Cole
We need to do more with less
• 47% of U.S. dairy cows are enrolled
in DHIA testing
• The Class III milk is $17/cwt
• Grain prices are very high
Corn averaged $6/bu in May
Soybeans averaged $13/bu in May
• Enrollment and cow numbers are
unlikely to increase
5. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (5) Cole
Major topics
• Different sources of data
• Data source integration and quality
• Data mining models
• Visualization examples
• Computational resources
6. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (6) Cole
Data currently in national database
• Identification and registration
• Conformation scores
• Milk production and composition
• Fertility
• Longevity
• Some genotypes
7. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (7) Cole
What are big data?
Type of Record Number of Records1
Cows with lactation data 28,394,976
Lactations 68,373,863
Individual test days 508,574,732
Calving ease records 20,770,758
Animals in pedigree file 58,893,009
Bull genotypes 50,393
Cow genotypes 70,687
1Totals include animals from all breeds.
8. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (8) Cole
Data not routinely available
• Farm and herd management
Geography and climate
Housing systems
Feed intake
• Milk composition
Milk fats, proteins, vitamins, minerals
Conductivity, lactose, MUN
• DNA data
Cow SNP genotypes, DNA sequence data
Photo: NOAA
9. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (9) Cole
Data “trapped” on the farm
• Fertility
Insemination information
Use of estrus synchronization
• Cow health and longevity
Body condition scores
Birth weights and mature weights
Disease occurrence data
10. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (10) Cole
Electronic milk meters
• Currently can provide—
Milk yield
Milking speed
Electrical conductivity
• May possibly supply—
Progesterone levels
Milk temperature
Fat and protein concentrations
Photo: afimilk
11. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (11) Cole
Other sources of data
• RFID tags have lower ID
error rates associated with
meter data
• Pedometers are useful for
detecting estrus, the
onset of calving, and
some early-stage
disease
Top: Allflex; Bottom: afimilk
12. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (12) Cole
Current sources of data
AIPL CDCB
NAAB
PDCA
DHI
Universities
AIPL Animal Improvement Programs Lab., USDA
CDCB Council on Dairy Cattle Breeding
DHI Dairy Herd Improvement (milk recording organizations)
NAAB National Association of Animal Breeders (AI)
PDCA Purebred Dairy Cattle Association (breed registries)
13. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (13) Cole
Sources of genomic data
AIPL
Requester
(Ex: AI, breeds)
Dairy
producers
DNA
laboratories
samples
14. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (14) Cole
Data source integration
• Incoming data from different sources
are checked against one another
• The AIPL edits system consists of
~64,000 SLOC
Mostly C, some Fortran 90
• Data stored in a relational database
15. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (15) Cole
Typical edits
• Match birth date with dam’s calving
• Compare with other sources (e.g. breed
association)
• Investigate maternal sibs born within 9
mo (may assume ET)
• IDs within 100 with same sire, dam, and
birth assumed to be twins
16. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (16) Cole
How do we assess data quality
• Consistency
e.g., calving, progeny birth,
breeding, dry dates
• Parentage verification
• Electronic ID
• Within-herd heritability
17. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (17) Cole
Data mining
• The discovery of useful, possibly
unexpected patterns in data
• Four principal tasks
Association
Clustering
Classification
Regression
18. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (18) Cole
Bonferroni’s principle
• You will find interesting patterns if
you look hard enough
• Not all relationships are legitimate
• You must have enough data to
support the questions you’re
asking
19. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (19) Cole
Association analysis
• Discover interesting relationships
among variables in large databases
e.g., predicting protein function and
identifying SNP-disease associations
Not statistical association analysis!
• Lots of algorithms, many based on
counting attributes
• Watch for false positives
Measures co-occurence, not causality
20. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (20) Cole
Clustering
• Place items into distinct groups
such that
Items in a group are similar
Items in one group are dissimilar to
those in other groups
• Hierarchical or partitional
approaches
21. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (21) Cole
Partitional clustering
22. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (22) Cole
Hierarchical clustering
• Nested clusters organized into
hierarchical trees
• Data objects may belong to
multiple subsets
• Examples
Relationships among species
Evolutionary history of proteins
23. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (23) Cole
BFGL-Illumina
Deep SNP Discovery
Angus
Holstein
Limousin
Jersey
Nelore
Brahman
Romagnola
Gir
BFGL
Genome Assemblies
Nelore
Water Buffalo
Pfizer
Light SNP Discovery
Angus
Holstein
Jersey
Hereford
Charolais
Simmental
Brahman
Waygu
Partners
Deep SNP Discovery
N’Dama
Sahiwal
Simmental
Hanwoo
Blonde d’Aquitaine
Montbeliard
24. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (24) Cole
Classification
• Training set used to develop a rule
for assigning individuals to classes
• Validation set used to assess the
accuracy of the classification rule
• Examples
Identify cows with subclinical mastitis
Mate assignment
25. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (25) Cole
Classification methods
• Bayesian belief networks
• Decision trees
• Nearest-neighbor classification
• Neural networks
• Rule-based classification
• Support vector machines
26. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (26) Cole
Decision tree classification
Pinzón-Sánchezetal.,2011,JDS,94:1873-1892.
27. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (27) Cole
Rule-based classification
• Classify records using a series of
“if…then” rules
• Rules come directly from the data,
or from other classification models
• e.g., if (PTA NM$ ≥ $800) and (EFI ≤
0.05) then (breed to cow)
• Easy to generate and interpret
28. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (28) Cole
Regression models
• Prediction of real-valued outputs
• Given one or more attributes, we
can predict, for example—
Breeding values
Feed intake
Milk and components yields
• Very mature analytical tools
29. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (29) Cole
Visualization
• How do we present lots of numbers
in a compact form?
• “Graphical methods can retain the
information in the data.” ― Deming
• Complements numerical
techniques
Tukey (1977), Tufte (1983, 1990,
1997, 2006) , Cleveland (1985,
1993), Wickham (2009)
30. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (30) Cole
One image, millions of points
43,382 SNP solutions x 4,064 animals = 176,304,448 data points
31. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (31) Cole
Use size to denote importance
Colors differentiate among chromosomes and markers are proportional to effect sizes.
32. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (32) Cole
O-Style Haplotypes (chromosome 15)
33. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (33) Cole
Correlations among calving traits
34. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (34) Cole
Provide multiple cues
Cole and VanRaden. 2011. J. Anim. Breed. Genet. Online, 1-10.
Lines are differentiated by color and pattern.
35. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (35) Cole
Interstitial figures
Cole and VanRaden. 2010. J. Dairy Sci. 93(6):2727-2740.
36. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (36) Cole
Computational capacity is abundant
WikiMedia Commons, Wgsimon, Transistor_Count_and_Moore%27s_Law_-_2011.svg
37. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (37) Cole
Supercomputer performance
• Cray-1 (1976) — 136
megaFLOPS (106)
• Fujitsu K machine
(2011) — 8.16
petaFLOPS (1015)
• Commodity hardware
also has experienced
gains in performance Top: Sherwin Gooch; Bottom: Riken
38. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (38) Cole
Storage costs are plummeting
Matthew Komorowski, http://www.mkomo.com/cost-per-gigabyte
39. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (39) Cole
Data storage technologies
• Storage costs are now as
low as $100/TB
Quality costs!
• Solid state disks are
promising, but relatively
low-capacity
• What do you do about
backups? Top: Snopes/IBM; Bottom: Tom’s Hardware
40. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (40) Cole
Memory is very cheap
Lev Lafayette, http://www.organdi.net/article.php3?id_article=82
41. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (41) Cole
Random access memory
• RAM is still much
faster than disk (ns
vs. ms access times)
• A 64-bit OS can
address 16.8 EB, in
theory
• How much can your
motherboard hold?
Top: Stan Yack; Bottom: Samsung
42. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (42) Cole
Software
• Complexity is increasing
Parallelism is hard and debugging is
much harder
• Productive developers are expensive
and difficult to find
A top programmer may be 10x as
productive as an average worker
43. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (43) Cole
Conclusions
• The more data we get, the more data
we want
• Relationships among traits may become
as important as individual traits
• Software may be more limiting than
hardware
44. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (44) Cole
Questions?