Graphs are Feeding the World

Graphs
are
Feeding
the
World 
Tim
Williamson
(@TimWilliate) 
Data
Scientist

Monsanto

Our
Growing
Planet
Faces
Difficult
Challenges
Sources: http://esa.un.org/unpd/wpp/; UN FAO Food Balance Sheet, “World Health Organization
Global and regional food consumption patterns and trends”; The World Bank, Food and Agriculture
Organization of the United Nations (FAO-STAT), Monsanto Internal Calculations; @TimWilliate #MonDataScience
Rising
Population
Growing enough for
a growing world
Global Population
1980 TODAY 2050
4.4B
7.1B
9.6B+
Limited
Farmland
Farmers will need to
produce enough food
with fewer resources
to support our
world population
Acres per Person
1961 2050
1 <1/3
Changing
Economies
and Diets
A growing global middle
class is choosing animal
protein – meat, eggs,
and dairy – as a larger
part of their diet
Dietary Percentage of Protein
14%
1965 2030
9%
Changing
Climate
Farmers are impacted
by climate change
in many ways:
WATER AVAILABILITY ISSUES
INCREASINGLY
UNPREDICTABLE WEATHER
INSECT RANGE EXPANSION
WEED PRESSURE CHANGES
CROP DISEASE INCREASES
PLANTING ZONE SHIFTS

Improved
Genetic
Gain
is
One
of
Several
Tools

Humanity
has
to
Address
These
Challenges
Sources: http://www.ers.usda.gov/data-products/feed-grains-database/feed-grains-yearbook-tables.aspx
• 8
commodity
crops
and
18
vegetable
crop

families,
sold
in
160
countries
Average US Corn Yield 1866 - 2014
Yield(Bushels/Acre)
0
45
90
135
180
Year
1865 1890 1915 1940 1965 1990 2015
@TimWilliate #MonDataScience
10,000 Years

Genetic
Gain
is
Created
Through
Breeding
Cycles
X
Lab Data (Genotypes)
Field Data (Phenotypes)
Select the Best,
Discard the Rest
All Progeny of Two Parents Enter
Best One Leaves to
Become a Future Parent
1000’s crosses/year
Dozens progeny/cross
5-10 locations/progeny
$3-5 million/year
Screening
Field Trials

Every
Breeding
Cycle
Extends
a
Tree
of
Genetic
Ancestry
C
A B
A B
C

Forcing
Genetic
Ancestry
Data
into
Rows
and
Columns
• In
our
relational
store,
genetic
ancestry
data
was
spread
across
a
hierarchy
of
~11

tables
representing
a
total
of
~895
million
rows

• Every
read
became
an
unpleasant
exercise
in
CONNECT BY PRIOR
Plant Plant:Plant Relationship
plant id attributes… plant id parent plant id parental role

Given
a
Starting
Population,
Return
All
Ancestors
ResponseTime(s)
0
6
12
18
24
30
Depth
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
SQL on Oracle Exadata

Genetic
Ancestry
is
a
Naturally
Occurring
Graph
• ~700
million
nodes

• ~1.2
billion
relationships

• ~1.7
billion
properties
:Plant :Plant
:PARENT
:Plant Inventory
:Plant Inventory
:PARENT
:Planting
:PLANTED
:Selection :SELECTED
:HARVESTED
:INVENTORY

Given
a
Starting
Population,
Return
All
Ancestors
ResponseTime(s)
0
6
12
18
24
30
Depth
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
SQL on Oracle Exadata Traversal Framework on Neo4j
~90x
Difference

Retrieving
Genetic
Ancestry
in
a
‘RESTful’
Style
4
2
3
:PARENT
{parental_role: male}
:PARENT
{parental_role: female}
1
5
:PARENT
:PARENT
6
:PARENT
/population/1/ancestors
RESTful
Resource
{“nodes”: [
{“id”: 1},
{“id”: 2},
{“id”: 3},
{“id”: 4},
{“id”: 5},
{“id”: 6}
],
“relationships”: [
{“from”: 1, “to”: 2, “relation”: “PARENT”},
{“from”: 3, “to”: 5, “relation”: “PARENT”}
]}

Building
a
Grammar
for
Ancestral
Milestones
/population/1/binary-‐cross
RESTful
Resource
{
“male”: {“id”: 4},
“female”: {“id”: 3}
}
4
2
3
:PARENT
:PARENT
1
5
:PARENT
:PARENT
6
:PARENT

Pruning
Genetic
Ancestry
Trees
‘On
the
Fly’
/population/1/ancestors?until-‐first=binary-‐cross
RESTful
Resource
{“nodes”: [
{“id”: 1},
{“id”: 2},
{“id”: 3},
{“id”: 4}
],
]}
4
2
3
:PARENT
:PARENT
1
5
:PARENT
:PARENT
6
:PARENT

Ancestry-‐as-‐a-‐Service
is
Released
September
2014
REST API (Ancestry-as-a-Service)
Data Scientists
Application
Developers • >30
elements
of
RESTful
grammar

• ~120
applications
and
data
scientists

•
>
600
million
REST
requests

• 10x
performance
boost

• 1
month
analysis
now
takes
3
hours

Real-‐Time
Reads
Require
Real-‐Time
Data
• Ingestion
volume
is
~10
million
writes/day
(not
a
write
heavy
flow)

• https://github.com/MonsantoCo/goldengate-‐kafka-‐adapter
Field + Lab
Applications
{
“table”: “foo”
“type”: “INSERT”
“columns”: [
{
“name”: “bar”,
“before”: “fizz”,
“after”: “buzz”
}
]
}
REST API
POST /population
PUT /population/1234
PUT /population/parents
DELETE /population

We’ve
Got
Ancestry
Figured
Out…What’s
Next?
Genotype Phenotype
Environment
Ancestry

Layering
Genotype
Data
Over
Ancestry
Trees
Genotype
nodes
act

as
simple
pointers
to

remote
systems

which
store
the
raw

data
:Plant :Plant
:PARENT
:Plant Inventory
:Plant Inventory
:PARENT
:Planting
:PLANTED
:Selection :SELECTED
:HARVESTED
:INVENTORY
:Genotype
:HAS_GENOTYPE
:Genotype
:HAS_GENOTYPE

Retrieving
Ancestry
Trees
Annotated
with
Genotypes

{“nodes”: [
{“id”: 1, “genotypes”: [{“id”: 123}]},
{“id”: 2},
{“id”: 3},
{“id”: 4, “genotypes”: [{“id”: 456}]},
{“id”: 5, “genotypes”: [{“id”: 789}]}
],
{“from”: 1, “to”: 2, “relation”: “PARENT}”,
{“from”: 2, “to”: 3, “relation”: “PARENT}”,
]}
3
2
1
:Genotype
{marker_count: 300}
:Genotype
{marker_count: 60,000}
:Genotype
{marker_count: 60,000}
54
/population/1/ancestors?until=genotyped-‐ancestor&props=genotypes

Estimate
the
Genotype
of
Every
Seed
Produced
Genotypes
Field + Lab
Applications
REST API
Genotype Estimation
Engine
Genotype Annotated
Ancestry Trees
Required Genotype
DataSets
Estimated
Genotypes
New Estimated
Genotypes Messages

Let’s
Revisit
the
Flow
of
a
Breeding
Cycle
X
Estimate Hi-Res Genotypes
Select the Best,
Discard the Rest
All Progeny of Two Parents Enter
Best One Leaves to
Become a Future Parent
1000’s crosses/year
Dozens progeny/cross
1 genotype/progeny
< $1 million/year
Genome-Wide
Selection
Width of Pipeline
Increases to
Accommodate More
Crosses

A
Glimpse
Inside
Our
Active
‘Graphy’
Work
Sources: http://biodiversitylibrary.org/page/27066167#page/125/mode/1up @TimWilliate #MonDataScience

Constructing
Coancestry
Matrices
A
B C
ED GF
A B C D E F G
A 1 0.5 0.5 0.25 0.25 0.25 0.25
B 1 0 0.5 0.5 0 0
C 1 0 0 0.5 0.5
D 1 0 0 0
E 1 0 0
F 1 0
G 1
Coancestry(A)
• Consider
a
reduced
ancestor
tree
only
between
crosses

• A
progeny
inherits
50%
of
its
genetics
from
each
parent

• Key
input
for
a
large
class
of
predictive
genetic
analysis
algorithms

Thank
You
All
@TimWilliate
http://engineering.monsanto.com/
Special
thanks
to
my
teammates

• Jason
Clark

• Marshall
Marietta

Graphs are Feeding the World

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (8)

Ähnlich wie Graphs are Feeding the World

Ähnlich wie Graphs are Feeding the World (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Graphs are Feeding the World