4. World’s
largest
online
family
history
resource
Historical
Content
Over
30,000
historical
content
collec2ons
11
billion
records
and
images
Records
da2ng
back
to
16th
century
4
5. World’s
largest
online
family
history
resource
User
Contributed
Content
45
million
family
trees
More
than
4
billion
profiles
200
million
stories
and
photos
5
6. DNA
Data
DNA
Data
Over
120,000
DNA
samples
700,000
SNPs
for
each
sample
2,000,000
4th
cousin
matches
Spit
in
a
tube,
pay
$99,
learn
your
past
Derrick
Harris
-‐
GigaOm
6
DNA molecule 1 differs from DNA
molecule 2 at a single base-pair location
(a C/T polymorphism). (http://
en.wikipedia.org/wiki/Single-
nucleiotide_polymorphism)
7. User
Behavior
Data
User
Behavior
Data
40
million
searches
/
day
10
million
people
added
to
trees
/
day
5
million
Hints
accepted
/
day
3.5
million
Records
aMached
/
day
7
1/12
12/12
1/12
12/12
12. Record
linkage
• Record
linkage
–
finding
and
matching
records
in
mul2ple
data
sets
with
non-‐unique
iden2fiers
• Goal:
bring
together
informa2on
about
the
same
person
• Some
non-‐unique
iden2fiers:
– Names:
first
name,
last
name
(John
Smith
–
300,000
records)
– Dates:
date
of
birth,
date
of
death
– Places:
place
of
birth,
residence,
place
of
death
– Extra:
family
members,
life
events
• Records
o[en
incomplete
• Records
contains
mistakes
• Exact
and
fuzzy
match
12
13. Life
events
in
collecOons
13
• Life
events
– Birth:
2.59
bln
– Marriage:
114
mln
– Census:
2.74
bln
– Death:
467
mln
• Total:
5.91
bln
events
14. Candidate
set
funnel:
exact
match
14
John
Smith:
300,000
John
Smith,
1870:
2,200
John
Smith,
1870,
Boston,
MA:
10
Search:
high
precision
15. Candidate
set
funnel:
fuzzy
match
15
John
Smith:
380,000
John
Smith,
1870:
97,000
John
Smith,
1870,
Boston,
MA:
1400
Explora2on:
large
recall
16. Results
set
16
Names editdistance
Extendeddates
Missing fields
Short names
initials
Exact match
18. • Supervised
machine
learning
• Learn
similarity
measure
(how
to
combine
iden2fiers)
• Training
&
tes2ng
sets:
– User
accepts,
rejects
• Features
(>
500):
– First
last
name,
DOB,
POB,
DOD,
POD
– Parents,
children,
siblings,
spouses
– Fuzzy
matches
• Similar
to
“learning
to
rank”
problem
A
place
for
machine
learning
18
ML suggest
Candidate
k-‐set
Person Record?
19. Similarity
measure
learning
19
Ancestry
collections
Feature generation
Member
trees
Person ID
ML Random
forest
Person ID
Label
Model
Index
Top-k records
candidate set
Feature generation Ranked
List
Training
Scoring
Hadoop
Hive
Record ID
20. Large
scale
machine
learning
20
Random
forest (R)
Random
forest (R)
Random
forest (R)
Random
forest (R)
Model
Hadoop
streaming
Hadoop
HDFS
27. Historical
immigraOon
to
the
US
• ImmigraOon
is
the
movement
of
people
into
a
country
or
region
to
which
they
are
not
na2ve
in
order
to
seMle
there
• Immigrants
are
those
who
were
born
outside
the
US
and
died
in
the
US
• Based
on
family
tree
profiles:
– Birth/death
dates
range
1500-‐1990
– Select
only
complete
profiles
with
FLN,
POB,
DOB,
POD,
DOD
– Perform
de-‐duplica2on,
remove
same
ancestors
from
different
family
trees
– Select
only
those
with
POB
!=
US,
POD
==
US
• 15
mln
profiles
(
0.3
%
from
4.9
bln
profiles)
27
32. Data
Science
• Ancestry
is
building
data
science
team
• We
work
on
product
data
and
BI
• We
are
hiring
• Special
thanks
to
Mercator
Group
for
inforgraphics
32