SlideShare a Scribd company logo
1 of 49
Download to read offline
On Brewing Fresh Espresso: LinkedIn’s Distributed Data
Serving Platform
Swaroop Jagadish
http://www.linkedin.com/in/swaroopjagadish
LinkedIn Confidential ©2013 All Rights Reserved
Outline
LinkedIn Data Ecosystem
Espresso: Design Points
Data Model and API
Architecture
Deep Dive: Fault Tolerance
Deep Dive: Secondary Indexing
Espresso In Production
Future work
2
The World’s Largest Professional Network
Members Worldwide
2 new
Members Per Second
100M+
Monthly Unique Visitors
225M+ 2M+
Company Pages
Connecting Talent  Opportunity. At scale…
LinkedIn Confidential ©2013 All Rights Reserved 3
LinkedIn Data Ecosystem
4
Espresso: Key Design Points
 Source-of-truth
– Master-Slave, Timeline consistent
– Query-after-write
– Backup/Restore
– High Availability
 Horizontally Scalable
 Rich functionality
– Hierarchical data model
– Document oriented
– Transactions within a hierarchy
– Secondary Indexes
5
Espresso: Key Design Points
 Agility – no “pause the world” operations
– “On the fly” Schema Evolution
– Elasticity
 Integration with the data ecosystem
– Change stream with freshness in O(seconds)
– ETL to Hadoop
– Bulk import
 Modular and Pluggable
– Off-the-shelf: MySQL, Lucene, Avro
6
Data Model and API
7
Application View
8
key
value
REST API:
/mailbox/msg_meta/bob/2
Partitioning
9
/mailbox/msg_meta/bob/2
MemberId is the partitioning key
Document based data model
Richer than a plain key-value store
Hierarchical keys
Values are rich documents and may contain
nested types
10
from : {
name : "Chris",
email : "chris@linkedin.com"
}
subject : "Go Giants!"
body : "World Series 2012! w00t!"
unread : true
Messages
mailboxID : String
messageID : long
from : {
name : String
email : String
}
subject : String
body : String
unread : boolean
REST based API
• Secondary Index query
– GET /MailboxDB/MessageMeta/bob/?query=“+isUnread:true
+isInbox:true”&start=0&count=15
• Partial updates
POST /MailboxDB/MessageMeta/bob/1
Content-Type: application/json
Content-Length: 21
{“unread” : “false”}
• Conditional operations
– Get a message, only if recently updated
GET /MailboxDB/MessageMeta/bob/1
If-Modifed-Since: Wed, 31 Oct 2012 02:54:12 GMT
11
Transactional writes within a hierarchy
mboxId value
George { “numUnread”:
2 }
MessageCounter
mboxId msgId value etag
George 0 {…, “unread”: false, …} 7abf8091
George 1 {…, “unread”: true, …} b648bc5f
George 2 {…, “unread”: true, …} 4fde8701
Message/Message/George/0 {…, “unread”: false, …} 7abf8091
/Message/George/0 {…, “unread”: true, …}
/MessageCounter/George {…, “numUnread”: “+1”, …}
1. Read, record etags
2. Prepare after-image
3.Update
mboxId value
George { “numUnread”:
3 }
Espresso Architecture
13
14
15
16
17
18
19
Cluster Management and Fault
Tolerance
20
Generic Cluster Manager: Apache Helix
 Generic cluster management
– State model + constraints
– Ideal state of distribution of partitions
across the cluster
– Migrate cluster from current state to
ideal state
• More Info
• SoCC 2012
• http://helix.incubator.apache.org
21
Espresso Partition Layout: Master, Slave
 3 Storage Engine nodes, 2-way replication
22
Apache Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Master
Slave
Offline
Cluster Management
Cluster Expansion
Node Failover
Cluster Expansion
 Initial State with 3 Storage Nodes. Step1: Compute new Ideal
state
24
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Master
Slave
Offline
Node 4
Cluster Expansion
 Step 2: Bootstrap new node’s partitions by restoring from
backups
25
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Master
Slave
Offline
Node 4
P4 P8 P12
P7 P9P1
Snapshots
Cluster Expansion
 Step 3: Catch up from live replication stream
26
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Master
Slave
Offline
Node 4
P4 P8 P12
P7 P9P1
Snapshots
Cluster Expansion
 Step 4: Migrate masters and slaves to rebalance
27
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1
P1 P2 P3
P5 P6
P10
Node 2
P5 P6 P7
P2
P11 P12
Node 3
P9 P10 P11
P3 P4
P8
Master
Slave
Offline
Node 4
P4 P8 P12
P7 P9P1
Cluster Expansion
 Partitions are balanced. Router starts sending traffic to new
node
28
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1 Node 2
P5 P6 P7
P2 P11 P12
Node 3
Master
Slave
Offline
Node 4
P1 P2 P3
P5 P6 P10
P9 P10 P11
P3 P4 P8
P4 P8 P12
P1 P7 P9
Node Failover
• During failure or planned maintenance
29
Node 1
P1 P2 P3
P10P5 P6
Node 2
P5 P6 P7
P12P2 P11
Node 3
P9 P10 P11
P8P3 P4
Node 4
P4 P8 P12
P7 P9P1
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 4
Database
Cluster
Node: 4
M: P4 – Active
…
S: P7 – Active
…
Node Failover
• Step 1: Detect Node failure
30
Node 1
P1 P2 P3
P10P5 P6
Node 2
P5 P6 P7
P12P2 P11
Node 3
P9 P10 P11
P8P3 P4
Node 4
P4 P8 P12
P7 P9P1
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 4
Database
Cluster
Node: 4
M: P4 – Active
…
S: P7 – Active
…
Node Failover
• Step 2: Compute new ideal state for promoting slaves to
master
31
Node 1
P1 P2 P3
P5 P6
Node 2
P5 P6 P7
P12P2
Node 3
P10 P11
P8P3 P4
Node 4
P4 P8 P12
P7 P9P1
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 4
Database
Cluster
Node: 4
M: P4 – Active
…
S: P7 – Active
…
P11P10
P9
Failover Performance
32
Secondary indexing
33
Espresso Secondary Indexing
• Local Secondary Index Requirements
• Read after write
• Consistent with primary data under failure
• Rich query support: match, prefix, range, text search
• Cost-to-serve proportional to working set
• Pluggable Index Implementations
• MySQL B-Tree
• Inverted index using Apache Lucene with MySQL backing store
• Inverted index using Prefix Index
• Fastbit based bitmap index
Lucene based implementation
• Requires entire index to be memory-resident to support low latency
query response times
• For the Mailbox application, we have two options
Optimizations for Lucene based implementation
• Concurrent transactions on the same Lucene
index leads to inconsistency
• Need to acquire a lock
• Opening an index repeatedly is expensive
• Group commit to amortize index opening cost
write
Request 2
Request 3
Request 4
Request 5
Request 1
Optimizations for Lucene based implementation
 High value users of the site accumulate large
mailboxes
– Query performance degrades with a large index
 Performance shouldn’t get worse with more usage!
 Time Partitioned Indexes: Partition index into buckets
based on created time
Espresso in Production
38
Espresso in Production
 Unified Social Content Platform –social activity aggregation
 High Read:Write ratio
39
Espresso in Production
 InMail - Allows members to communicate with each other
 Large storage footprint
 Low latency requirement for secondary index queries involving text
search and relational predicates
40
Performance
 Average Failover Latency with 1024 partitions is
around 300ms
 Primary Data Reads and Writes
 For Single Storage Node on SSD
 Average row size = 1KB
41
Operation Average Latency Average
Throughput
Reads ~3ms 40,000 per
second
Writes ~6ms 20,000 per
second
Performance
 Partition-key level Secondary Index using Lucene
 One Index per Mailbox use-case
 Base data on SAS, Indexes on SSDs
 Average throughput per index = ~1000 per second
(after the group commit and partitioned index
optimizations)
42
Operation Average Latency
Queries (average
of 5 indexed
fields)
~20ms
Writes (Around
30 indexed fields)
~20ms
Durability and Consistency
 Within a Data Center
 Across Data Centers
Durability and Consistency
 Within a Data Center
– Write latency vs Durability
 Asynchronous replication
– May lead to data loss
– Tooling can mitigate some of this
 Semi-synchronous replication
– Wait for at least one relay to acknowledge
– During failover, slaves wait for catchup
 Consistency over availability
 Helix selects slave with least replication lag to take over
mastership
 Failover time is ~300ms in practice
Durability and Consistency
 Across data centers
– Asynchronous replication
– Stale reads possible
– Active-active: Conflict resolution via last-writer-wins
Lessons learned
Dealing with transient failures
Planned upgrades
Slave reads
Storage Devices
– SSDs vs SAS disks
Scaling Cluster Management
46
Future work
Coprocessors
– Synchronous, Asynchronous
Richer query processing
– Group-by, Aggregation
47
Key Takeaways
Espresso is a timeline consistent,
document-oriented distributed database
Feature rich: Secondary indexing,
transactions over related documents,
seamless integration with the data
ecosystem
In production since June 2012 serving
several key use-cases
48
49
Questions?

More Related Content

What's hot

What's hot (20)

Oracle Latch and Mutex Contention Troubleshooting
Oracle Latch and Mutex Contention TroubleshootingOracle Latch and Mutex Contention Troubleshooting
Oracle Latch and Mutex Contention Troubleshooting
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
Best Practices for Becoming an Exceptional Postgres DBA
Best Practices for Becoming an Exceptional Postgres DBA Best Practices for Becoming an Exceptional Postgres DBA
Best Practices for Becoming an Exceptional Postgres DBA
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Using Kafka to scale database replication
Using Kafka to scale database replicationUsing Kafka to scale database replication
Using Kafka to scale database replication
 
Spark tuning
Spark tuningSpark tuning
Spark tuning
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
How to Build a Scylla Database Cluster that Fits Your Needs
How to Build a Scylla Database Cluster that Fits Your NeedsHow to Build a Scylla Database Cluster that Fits Your Needs
How to Build a Scylla Database Cluster that Fits Your Needs
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David Anderson
 
Apache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache SparkApache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache Spark
 
MySQL for Large Scale Social Games
MySQL for Large Scale Social GamesMySQL for Large Scale Social Games
MySQL for Large Scale Social Games
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
What Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database ScalabilityWhat Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database Scalability
 

Similar to Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Rose Toomey
 
Moolle fan-out control for scalable distributed data stores
Moolle  fan-out control for scalable distributed data storesMoolle  fan-out control for scalable distributed data stores
Moolle fan-out control for scalable distributed data stores
SungJu Cho
 

Similar to Espresso: LinkedIn's Distributed Data Serving Platform (Talk) (20)

Espresso Database Replication with Kafka, Tom Quiggle
Espresso Database Replication with Kafka, Tom QuiggleEspresso Database Replication with Kafka, Tom Quiggle
Espresso Database Replication with Kafka, Tom Quiggle
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
 
MYSQL
MYSQLMYSQL
MYSQL
 
Jags Ramnarayan's presentation
Jags Ramnarayan's presentationJags Ramnarayan's presentation
Jags Ramnarayan's presentation
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Moolle fan-out control for scalable distributed data stores
Moolle  fan-out control for scalable distributed data storesMoolle  fan-out control for scalable distributed data stores
Moolle fan-out control for scalable distributed data stores
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
 
Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
HPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance OptimizationHPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance Optimization
 
BigData Developers MeetUp
BigData Developers MeetUpBigData Developers MeetUp
BigData Developers MeetUp
 
Event Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQEvent Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQ
 
Building Apps with Distributed In-Memory Computing Using Apache Geode
Building Apps with Distributed In-Memory Computing Using Apache GeodeBuilding Apps with Distributed In-Memory Computing Using Apache Geode
Building Apps with Distributed In-Memory Computing Using Apache Geode
 
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The SequelSilicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
 
Spil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLSpil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NL
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 

More from Amy W. Tang

More from Amy W. Tang (13)

Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
LinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationLinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data Application
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
 
Building Distributed Systems Using Helix
Building Distributed Systems Using HelixBuilding Distributed Systems Using Helix
Building Distributed Systems Using Helix
 
LinkedIn Graph Presentation
LinkedIn Graph PresentationLinkedIn Graph Presentation
LinkedIn Graph Presentation
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State Drives
 
Untangling Cluster Management with Helix
Untangling Cluster Management with HelixUntangling Cluster Management with Helix
Untangling Cluster Management with Helix
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the Databus
 
Introduction to Databus
Introduction to DatabusIntroduction to Databus
Introduction to Databus
 
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInA Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

  • 1. On Brewing Fresh Espresso: LinkedIn’s Distributed Data Serving Platform Swaroop Jagadish http://www.linkedin.com/in/swaroopjagadish LinkedIn Confidential ©2013 All Rights Reserved
  • 2. Outline LinkedIn Data Ecosystem Espresso: Design Points Data Model and API Architecture Deep Dive: Fault Tolerance Deep Dive: Secondary Indexing Espresso In Production Future work 2
  • 3. The World’s Largest Professional Network Members Worldwide 2 new Members Per Second 100M+ Monthly Unique Visitors 225M+ 2M+ Company Pages Connecting Talent  Opportunity. At scale… LinkedIn Confidential ©2013 All Rights Reserved 3
  • 5. Espresso: Key Design Points  Source-of-truth – Master-Slave, Timeline consistent – Query-after-write – Backup/Restore – High Availability  Horizontally Scalable  Rich functionality – Hierarchical data model – Document oriented – Transactions within a hierarchy – Secondary Indexes 5
  • 6. Espresso: Key Design Points  Agility – no “pause the world” operations – “On the fly” Schema Evolution – Elasticity  Integration with the data ecosystem – Change stream with freshness in O(seconds) – ETL to Hadoop – Bulk import  Modular and Pluggable – Off-the-shelf: MySQL, Lucene, Avro 6
  • 10. Document based data model Richer than a plain key-value store Hierarchical keys Values are rich documents and may contain nested types 10 from : { name : "Chris", email : "chris@linkedin.com" } subject : "Go Giants!" body : "World Series 2012! w00t!" unread : true Messages mailboxID : String messageID : long from : { name : String email : String } subject : String body : String unread : boolean
  • 11. REST based API • Secondary Index query – GET /MailboxDB/MessageMeta/bob/?query=“+isUnread:true +isInbox:true”&start=0&count=15 • Partial updates POST /MailboxDB/MessageMeta/bob/1 Content-Type: application/json Content-Length: 21 {“unread” : “false”} • Conditional operations – Get a message, only if recently updated GET /MailboxDB/MessageMeta/bob/1 If-Modifed-Since: Wed, 31 Oct 2012 02:54:12 GMT 11
  • 12. Transactional writes within a hierarchy mboxId value George { “numUnread”: 2 } MessageCounter mboxId msgId value etag George 0 {…, “unread”: false, …} 7abf8091 George 1 {…, “unread”: true, …} b648bc5f George 2 {…, “unread”: true, …} 4fde8701 Message/Message/George/0 {…, “unread”: false, …} 7abf8091 /Message/George/0 {…, “unread”: true, …} /MessageCounter/George {…, “numUnread”: “+1”, …} 1. Read, record etags 2. Prepare after-image 3.Update mboxId value George { “numUnread”: 3 }
  • 14. 14
  • 15. 15
  • 16. 16
  • 17. 17
  • 18. 18
  • 19. 19
  • 20. Cluster Management and Fault Tolerance 20
  • 21. Generic Cluster Manager: Apache Helix  Generic cluster management – State model + constraints – Ideal state of distribution of partitions across the cluster – Migrate cluster from current state to ideal state • More Info • SoCC 2012 • http://helix.incubator.apache.org 21
  • 22. Espresso Partition Layout: Master, Slave  3 Storage Engine nodes, 2-way replication 22 Apache Helix Partition: P1 Node: 1 … Partition: P12 Node: 3 Database Node: 1 M: P1 – Active … S: P5 – Active … Cluster Node 1 P1 P2 P4 P3 P5 P6 P9 P10 Node 2 P5 P6 P8 P7 P1 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Master Slave Offline
  • 24. Cluster Expansion  Initial State with 3 Storage Nodes. Step1: Compute new Ideal state 24 Helix Partition: P1 Node: 1 … Partition: P12 Node: 3 Database Node: 1 M: P1 – Active … S: P5 – Active … Cluster Node 1 P1 P2 P4 P3 P5 P6 P9 P10 Node 2 P5 P6 P8 P7 P1 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Master Slave Offline Node 4
  • 25. Cluster Expansion  Step 2: Bootstrap new node’s partitions by restoring from backups 25 Helix Partition: P1 Node: 1 … Partition: P12 Node: 3 Database Node: 1 M: P1 – Active … S: P5 – Active … Cluster Node 1 P1 P2 P4 P3 P5 P6 P9 P10 Node 2 P5 P6 P8 P7 P1 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Master Slave Offline Node 4 P4 P8 P12 P7 P9P1 Snapshots
  • 26. Cluster Expansion  Step 3: Catch up from live replication stream 26 Helix Partition: P1 Node: 1 … Partition: P12 Node: 3 Database Node: 1 M: P1 – Active … S: P5 – Active … Cluster Node 1 P1 P2 P4 P3 P5 P6 P9 P10 Node 2 P5 P6 P8 P7 P1 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Master Slave Offline Node 4 P4 P8 P12 P7 P9P1 Snapshots
  • 27. Cluster Expansion  Step 4: Migrate masters and slaves to rebalance 27 Helix Partition: P1 Node: 1 … Partition: P12 Node: 3 Database Node: 1 M: P1 – Active … S: P5 – Active … Cluster Node 1 P1 P2 P3 P5 P6 P10 Node 2 P5 P6 P7 P2 P11 P12 Node 3 P9 P10 P11 P3 P4 P8 Master Slave Offline Node 4 P4 P8 P12 P7 P9P1
  • 28. Cluster Expansion  Partitions are balanced. Router starts sending traffic to new node 28 Helix Partition: P1 Node: 1 … Partition: P12 Node: 3 Database Node: 1 M: P1 – Active … S: P5 – Active … Cluster Node 1 Node 2 P5 P6 P7 P2 P11 P12 Node 3 Master Slave Offline Node 4 P1 P2 P3 P5 P6 P10 P9 P10 P11 P3 P4 P8 P4 P8 P12 P1 P7 P9
  • 29. Node Failover • During failure or planned maintenance 29 Node 1 P1 P2 P3 P10P5 P6 Node 2 P5 P6 P7 P12P2 P11 Node 3 P9 P10 P11 P8P3 P4 Node 4 P4 P8 P12 P7 P9P1 Helix Partition: P1 Node: 1 … Partition: P12 Node: 4 Database Cluster Node: 4 M: P4 – Active … S: P7 – Active …
  • 30. Node Failover • Step 1: Detect Node failure 30 Node 1 P1 P2 P3 P10P5 P6 Node 2 P5 P6 P7 P12P2 P11 Node 3 P9 P10 P11 P8P3 P4 Node 4 P4 P8 P12 P7 P9P1 Helix Partition: P1 Node: 1 … Partition: P12 Node: 4 Database Cluster Node: 4 M: P4 – Active … S: P7 – Active …
  • 31. Node Failover • Step 2: Compute new ideal state for promoting slaves to master 31 Node 1 P1 P2 P3 P5 P6 Node 2 P5 P6 P7 P12P2 Node 3 P10 P11 P8P3 P4 Node 4 P4 P8 P12 P7 P9P1 Helix Partition: P1 Node: 1 … Partition: P12 Node: 4 Database Cluster Node: 4 M: P4 – Active … S: P7 – Active … P11P10 P9
  • 34. Espresso Secondary Indexing • Local Secondary Index Requirements • Read after write • Consistent with primary data under failure • Rich query support: match, prefix, range, text search • Cost-to-serve proportional to working set • Pluggable Index Implementations • MySQL B-Tree • Inverted index using Apache Lucene with MySQL backing store • Inverted index using Prefix Index • Fastbit based bitmap index
  • 35. Lucene based implementation • Requires entire index to be memory-resident to support low latency query response times • For the Mailbox application, we have two options
  • 36. Optimizations for Lucene based implementation • Concurrent transactions on the same Lucene index leads to inconsistency • Need to acquire a lock • Opening an index repeatedly is expensive • Group commit to amortize index opening cost write Request 2 Request 3 Request 4 Request 5 Request 1
  • 37. Optimizations for Lucene based implementation  High value users of the site accumulate large mailboxes – Query performance degrades with a large index  Performance shouldn’t get worse with more usage!  Time Partitioned Indexes: Partition index into buckets based on created time
  • 39. Espresso in Production  Unified Social Content Platform –social activity aggregation  High Read:Write ratio 39
  • 40. Espresso in Production  InMail - Allows members to communicate with each other  Large storage footprint  Low latency requirement for secondary index queries involving text search and relational predicates 40
  • 41. Performance  Average Failover Latency with 1024 partitions is around 300ms  Primary Data Reads and Writes  For Single Storage Node on SSD  Average row size = 1KB 41 Operation Average Latency Average Throughput Reads ~3ms 40,000 per second Writes ~6ms 20,000 per second
  • 42. Performance  Partition-key level Secondary Index using Lucene  One Index per Mailbox use-case  Base data on SAS, Indexes on SSDs  Average throughput per index = ~1000 per second (after the group commit and partitioned index optimizations) 42 Operation Average Latency Queries (average of 5 indexed fields) ~20ms Writes (Around 30 indexed fields) ~20ms
  • 43. Durability and Consistency  Within a Data Center  Across Data Centers
  • 44. Durability and Consistency  Within a Data Center – Write latency vs Durability  Asynchronous replication – May lead to data loss – Tooling can mitigate some of this  Semi-synchronous replication – Wait for at least one relay to acknowledge – During failover, slaves wait for catchup  Consistency over availability  Helix selects slave with least replication lag to take over mastership  Failover time is ~300ms in practice
  • 45. Durability and Consistency  Across data centers – Asynchronous replication – Stale reads possible – Active-active: Conflict resolution via last-writer-wins
  • 46. Lessons learned Dealing with transient failures Planned upgrades Slave reads Storage Devices – SSDs vs SAS disks Scaling Cluster Management 46
  • 47. Future work Coprocessors – Synchronous, Asynchronous Richer query processing – Group-by, Aggregation 47
  • 48. Key Takeaways Espresso is a timeline consistent, document-oriented distributed database Feature rich: Secondary indexing, transactions over related documents, seamless integration with the data ecosystem In production since June 2012 serving several key use-cases 48