Data Structures and Visualization

J. B. Cole
Animal Improvement Programs Laboratory
Agricultural Research Service, USDA
Beltsville, MD 20705-2350
john.cole@ars.usda.gov
Data Structures and Visualization

Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (2) Cole
Introduction
• We’re drowning in information
• Genetics are viewed as a commodity
• We need to get better data from
fewer cows
• Do we have the resources we need?

U.S. dairy population
0
5
10
15
20
25
30
40 50 60 70 80 90 00
Year
Cows(millions)

We need to do more with less
• 47% of U.S. dairy cows are enrolled
in DHIA testing
• The Class III milk is $17/cwt
• Grain prices are very high
 Corn averaged $6/bu in May
 Soybeans averaged $13/bu in May
• Enrollment and cow numbers are
unlikely to increase

Major topics
• Different sources of data
• Data source integration and quality
• Data mining models
• Visualization examples
• Computational resources

Data currently in national database
• Identification and registration
• Conformation scores
• Milk production and composition
• Fertility
• Longevity
• Some genotypes

What are big data?
Type of Record Number of Records1
Cows with lactation data 28,394,976
Lactations 68,373,863
Individual test days 508,574,732
Calving ease records 20,770,758
Animals in pedigree file 58,893,009
Bull genotypes 50,393
Cow genotypes 70,687
1Totals include animals from all breeds.

Data not routinely available
• Farm and herd management
 Geography and climate
 Housing systems
 Feed intake
• Milk composition
 Milk fats, proteins, vitamins, minerals
 Conductivity, lactose, MUN
• DNA data
 Cow SNP genotypes, DNA sequence data
Photo: NOAA

Data “trapped” on the farm
• Fertility
 Insemination information
 Use of estrus synchronization
• Cow health and longevity
 Body condition scores
 Birth weights and mature weights
 Disease occurrence data

Electronic milk meters
• Currently can provide—
 Milk yield
 Milking speed
 Electrical conductivity
• May possibly supply—
 Progesterone levels
 Milk temperature
 Fat and protein concentrations
Photo: afimilk

Other sources of data
• RFID tags have lower ID
error rates associated with
meter data
• Pedometers are useful for
detecting estrus, the
onset of calving, and
some early-stage
disease
Top: Allflex; Bottom: afimilk

Current sources of data
AIPL CDCB
NAAB
PDCA
DHI
Universities
AIPL Animal Improvement Programs Lab., USDA
CDCB Council on Dairy Cattle Breeding
DHI Dairy Herd Improvement (milk recording organizations)
NAAB National Association of Animal Breeders (AI)
PDCA Purebred Dairy Cattle Association (breed registries)

Sources of genomic data
AIPL
Requester
(Ex: AI, breeds)
Dairy
producers
DNA
laboratories
samples

Data source integration
• Incoming data from different sources
are checked against one another
• The AIPL edits system consists of
~64,000 SLOC
 Mostly C, some Fortran 90
• Data stored in a relational database

Typical edits
• Match birth date with dam’s calving
• Compare with other sources (e.g. breed
association)
• Investigate maternal sibs born within 9
mo (may assume ET)
• IDs within 100 with same sire, dam, and
birth assumed to be twins

How do we assess data quality
• Consistency
 e.g., calving, progeny birth,
breeding, dry dates
• Parentage verification
• Electronic ID
• Within-herd heritability

Data mining
• The discovery of useful, possibly
unexpected patterns in data
• Four principal tasks
 Association
 Clustering
 Classification
 Regression

Bonferroni’s principle
• You will find interesting patterns if
you look hard enough
• Not all relationships are legitimate
• You must have enough data to
support the questions you’re
asking

Association analysis
• Discover interesting relationships
among variables in large databases
 e.g., predicting protein function and
identifying SNP-disease associations
 Not statistical association analysis!
• Lots of algorithms, many based on
counting attributes
• Watch for false positives
 Measures co-occurence, not causality

Clustering
• Place items into distinct groups
such that
 Items in a group are similar
 Items in one group are dissimilar to
those in other groups
• Hierarchical or partitional
approaches

Partitional clustering

Hierarchical clustering
• Nested clusters organized into
hierarchical trees
• Data objects may belong to
multiple subsets
• Examples
 Relationships among species
 Evolutionary history of proteins

BFGL-Illumina
Deep SNP Discovery
Angus
Holstein
Limousin
Jersey
Nelore
Brahman
Romagnola
Gir
BFGL
Genome Assemblies
Nelore
Water Buffalo
Pfizer
Light SNP Discovery
Angus
Holstein
Jersey
Hereford
Charolais
Simmental
Brahman
Waygu
Partners
Deep SNP Discovery
N’Dama
Sahiwal
Simmental
Hanwoo
Blonde d’Aquitaine
Montbeliard

Classification
• Training set used to develop a rule
for assigning individuals to classes
• Validation set used to assess the
accuracy of the classification rule
• Examples
 Identify cows with subclinical mastitis
 Mate assignment

Classification methods
• Bayesian belief networks
• Decision trees
• Nearest-neighbor classification
• Neural networks
• Rule-based classification
• Support vector machines

Decision tree classification
Pinzón-Sánchezetal.,2011,JDS,94:1873-1892.

Rule-based classification
• Classify records using a series of
“if…then” rules
• Rules come directly from the data,
or from other classification models
• e.g., if (PTA NM$ ≥ $800) and (EFI ≤
0.05) then (breed to cow)
• Easy to generate and interpret

Regression models
• Prediction of real-valued outputs
• Given one or more attributes, we
can predict, for example—
 Breeding values
 Feed intake
 Milk and components yields
• Very mature analytical tools

Visualization
• How do we present lots of numbers
in a compact form?
• “Graphical methods can retain the
information in the data.” ― Deming
• Complements numerical
techniques
 Tukey (1977), Tufte (1983, 1990,
1997, 2006) , Cleveland (1985,
1993), Wickham (2009)

One image, millions of points
43,382 SNP solutions x 4,064 animals = 176,304,448 data points

Use size to denote importance
Colors differentiate among chromosomes and markers are proportional to effect sizes.

O-Style Haplotypes (chromosome 15)

Correlations among calving traits

Provide multiple cues
Cole and VanRaden. 2011. J. Anim. Breed. Genet. Online, 1-10.
Lines are differentiated by color and pattern.

Interstitial figures
Cole and VanRaden. 2010. J. Dairy Sci. 93(6):2727-2740.

Computational capacity is abundant
WikiMedia Commons, Wgsimon, Transistor_Count_and_Moore%27s_Law_-_2011.svg

Supercomputer performance
• Cray-1 (1976) — 136
megaFLOPS (106)
• Fujitsu K machine
(2011) — 8.16
petaFLOPS (1015)
• Commodity hardware
also has experienced
gains in performance Top: Sherwin Gooch; Bottom: Riken

Storage costs are plummeting
Matthew Komorowski, http://www.mkomo.com/cost-per-gigabyte

Data storage technologies
• Storage costs are now as
low as $100/TB
 Quality costs!
• Solid state disks are
promising, but relatively
low-capacity
• What do you do about
backups? Top: Snopes/IBM; Bottom: Tom’s Hardware

Memory is very cheap
Lev Lafayette, http://www.organdi.net/article.php3?id_article=82

Random access memory
• RAM is still much
faster than disk (ns
vs. ms access times)
• A 64-bit OS can
address 16.8 EB, in
theory
• How much can your
motherboard hold?
Top: Stan Yack; Bottom: Samsung

Software
• Complexity is increasing
 Parallelism is hard and debugging is
much harder
• Productive developers are expensive
and difficult to find
 A top programmer may be 10x as
productive as an average worker

Conclusions
• The more data we get, the more data
we want
• Relationships among traits may become
as important as individual traits
• Software may be more limiting than
hardware

Questions?

Data Structures and Visualization

Recommended

Recommended

More Related Content

Similar to Data Structures and Visualization

Similar to Data Structures and Visualization (20)

More from John B. Cole, Ph.D.

More from John B. Cole, Ph.D. (20)

Recently uploaded

Recently uploaded (20)

Data Structures and Visualization