Cassandra Summit 2014: Huge Online Genealogical Database Driven By Cassandra

1
© 2014 by Intellectual Reserve, Inc. All rights reserved.
Huge Online Genealogical Database
Driven By Cassandra
Cassandra Summit 2014
John Sumsion

2
Outline
• Introduction to FamilySearch Family Tree
• Outline of Cassandra reimplementation
• Journal-based Consistency Model
• Experience with Cassandra

3
What is FamilySearch?
Familysearch.org website
Very large single pedigree (Family Tree)
Largest collection of free genealogical records
Largest genealogical library
Family History Department of Church of Jesus
Christ of Latter-day Saints (known as Mormons)

4
Why does FamilySearch exist?
Visit http://mormon.org/family-history/

5
Family Tree
Records Indexing
Family
Tree
Memories
Community
Where it fits

6
Record Preservation
Neglect
Time
Disasters (e.g. WWII)

7
Record Preservation (continued)
• 100 million images published online / year

8
Indexing
3.5 billion indexed records – 35M / month
Turns this… …into this!

11
Family Tree
Records Indexing
Family
Tree
Memories
Community
Where it fits

12
Family Tree Data
Family Tree:
• 900M+ person records, open-edit
• 500M+ relationships, open-edit
• 8.4B change log entries, 100M+ per quarter
• Dynamic OLTP system
• Data-dependent performance issues

13
Family Tree: Example 9 Gen Pedigree
up to 511 person slots
Dynamic content!

14
Family Tree: Example Pedigree App
31+ persons per section
Dynamic content!

15
Family Tree: Example Ancestor Page
10+ persons in families
100-1000+ changes
Dynamic content!

16
Family Tree: Example Change History
100-1000+ changes
Dynamic content!

17
Contents

18
Performance & Scale
• Slow page views
• pedigree (500-3000ms for 3 generations)
• change history (2000+ms for first page of changes)
• large family view
• Query problems
• relationships connect persons, range scan by person id
• every person => person traversal is 200-300M btree scan
(global index)
• change history queries travers 8+B btree scan
(global index)

19
Performance & Scale
• Query performance problems
Person
Relati
onship
Person
Wide range scan
Pedigree
Change History
Change
History
Wide range scan

20
Cassandra Reimplementation
• selected Cassandra after extensive testing
• full data scale proof-of-concept & tests
• required: new data model (performance)
• required: new consistency model (critical!)

21
• event-sourced data model – journal / views
• new data model – no indexes
• new consistency model – satisfies consistency
JE #8
P1 P1 Views
A B
JE #6
P2 P2 Views
A B

22
• denormalized relationships
P1 P2
R1
R2
R3
R5
R4

23
P1 P2
R1
R2
R3
R5
R4
R2
R3

24
• exact duplication allows biderectional traversal
Person
/Rels
Person
/Rels
Person
Relatio
nship
Person
Wide query
P1 P2
R1
R2
R3
R5
R4
R2
R3

25
• change history is a core feature
• denormalized change history
• optimizes for displaying recent changes
JE #8
P1 P1 Change History View
1000s of changes
(spread over multiple Cassandra cells)
Last 100-1000 changes
(local to a single Cassandra cell)

26
Contents

27
Journal-based Consistency Model
Command Journal View
View
View
Rough Process Flow
captures edits
safely
stores edits
canonically
view-optimized
summations

28
Command
• write-once with quorum
• application to journal requires 3 tables:
pending / completed / aborted
• idempotent application to journal
View
View

29
Command Schema
• key: command v1 uuid (as text)
• value: blob (binary json)
View
View

30
Journal
• write-once with quorum & C* batch
• denormalized byte-exact across
affected persons & relationships
• each entry stored in separate cell
(compaction required for fast journal reads)
View
View

31
Journal
• CmRDT (commutative replicated type)
• partitions converge without conflict
because of unique uuid
View
View

32
View
View
Partition Key Command UUID Content (blob)
KWZ3-P71
KWZ3-P71
eda6f967-0955…
6af8d90c-8f3a…
{ "attribution": {}, … } (binary json)
{ "attribution": {}, … } (binary json)
KCDT-J59 fd35ac61-7def… { "attribution": {}, … } (binary json)
KCDT-J59 b2db2fa5-da5f… { "attribution": {}, … } (binary json)

33
View
• multiple views for multiple uses
(person, person card, change history)
• populated by applying journal entries
• incrementally updated in steady state
• not canonical data, can be recalculated
View
View

34
View
View
P1 P1 Views
A B

35
View
View
JE #8
P1 P1 Views
A B
JE #8 JE #8

36
View
View
P1 P1 Views
A B
JE #8 JE #8
A
(new)
B
(new)
JE #8

37
View
View
P1 P1 Views
A B

38
View
• views have same schema as journal
• journal entries are written to view for
incremental refresh
• core of the consistency model
View
View

39
View
• CvRDT (convergent replicated type)
• partitions converge with conflict; resolved
by full view refresh from canonical journal
• steady state: one view of a given type per
entity
View
View

40
View
View
P1 P1 Views
A B
JE #8 JE #8
A
(new)
B
(new)
JE #8

41
• Performance & Scale
• lookup by partition key only, no indexes
• any cross-entity change happens in duplicate on all
• stored “current-state” views – cheapest possible read
• custom views – tunable to different use cases
• disposable views – able to tweak view over time

42
• Business Rule Enforcement
• Read / Write / Read & Revert
• pre-command checks prevent invalid changes
• write with appropriate quorum ensures consistent write
• post-command checks prevent business-rules conflicts
• administrative revert marks command as “not applicable”
and thereby causes full refresh which ignores changes
• appropriate quorum: depending on the change, either
LOCAL_QUORUM or EACH_QUORUM

43
• Strong consistency
• command store – atomic capture of a single user action
• command handling – idempotent writes to journal,
picked up later even if interrupted
• no global lock needed for optimistic concurrency
• Read after write
• consistency ONE for normal reads
• quorum when the client knows it’s refreshing after write

44
• Journal / View Concerns
• native support for change history
• no journal tombstones in steady state – write-once
• blob schema implementable on any db engine that
supports two-level keys (partition, composite)
• consistency model implementable on any db engine that
supports batches & quorum writes/reads
• view tombstones on every write, biggest concern
• leveled compaction?
• WISH: size-tiered compaction with data locality hoisting

45
Contents

46
Experience with Cassandra
• tested Community 1.2 and 2.0
• fantastic performance
• easy cloud setup
• great developer response
• easy to bulk load through CQL3
• harder to get running inside AWS VPC

47
• Bulk import experience
• 8.4B change log records => 5.8B journal entries (2.5TB lzo)
• ‘hi1.4xlarge’ cluster (2x 1TB SSDs)
• import through CQL was fast enough
• 11h to import 5-node cluster (5h on 30-node cluster)
• 140k writes / sec, fed from 128 writer threads
• 20 records / unlogged batch write, 1-2k record size
• minimal post-import compaction (size-tiered)
• ended up with 3.5-4TB on C* disk after import
• OpsCenter – great visibility for tuning
• Community – harder to automate repairs, etc.

48
• Full-scale load test experience
• got to 25x our peak hourly load on 25-28-node cluster
• production peak load included significant write load
• working-set size was about 2M persons in a month
• enabled row cache, ran almost entirely without disk access
• bottlenecked on interconnect socket w/ round robin client
• got 50% boost from token-aware, round robin client
• OpsCenter – great visibility for tuning
• Large SSD cluster – able to handle repair
during scale tests

49
current system
cassandra
impl (1x, 10x, 20x)

50
current system
cassandra
impl (1x, 10x, 20x)
LOG SCALE!

51
Current Status
• still working on implementation & rollout
• migration, reconciliation, integration…
• consistency model code separate

52
Contents
Questions?

53
Contact Info
John Sumsion
Sr. Software Engineer
sumsionjg@familysearch.org
@jdsumsion
Thanks to the team at FamilySearch!
esp. Randy & James for doing the model
Thanks to the awesome presenters & organizers at
#CassandraSummit!

Cassandra Summit 2014: Huge Online Genealogical Database Driven By Cassandra

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Ähnlich wie Cassandra Summit 2014: Huge Online Genealogical Database Driven By Cassandra

Ähnlich wie Cassandra Summit 2014: Huge Online Genealogical Database Driven By Cassandra (20)

Mehr von DataStax Academy

Mehr von DataStax Academy (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Cassandra Summit 2014: Huge Online Genealogical Database Driven By Cassandra