LinkedIn has several data driven products that improve the experience of its users -- whether they are professionals or enterprises. Supporting this is a large ecosystem of systems and processes that provide data and insights in a timely manner to the products that are driven by it.
This talk provides an overview of the various components of this ecosystem which are:
- Hadoop
- Teradata
- Kafka
- Databus
- Camus
- Lumos
etc.
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Â
The Big Data Analytics Ecosystem at LinkedIn
1. The Big Data Analytics
Ecosystem at LinkedIn
Rajappa Iyer
September 17, 2013
2. Agenda
ï§ LinkedIn by the numbers
ï§ An Overview of Data Driven Products / Insights
ï§ The Big Data Analytics Ecosystem
â Storage and Compute Platforms
â Data Transport Pipelines
â Data Processing Pipelines
â Operational Tooling - Metadata
ï§ Q&A
3. LinkedIn: The Worldâs Largest
Professional Network
Members Worldwide
2 new
Members Per Second
100M+
Monthly Unique Visitors
238M+ 3M+
Company Pages
Connecting Talent ïł Opportunity. At scaleâŠ
9. A Simplified Overview of Data Flow
Hadoop
Camus
Lumos
Teradata
External
Partner Data
Ingest
Utilities
DWH ETL
Product,
Sciences,
Enterprise
Analytics
Site
(Member
Facing
Products)
Kafka
Activity
Data
Espresso /
Voldemort /
Oracle
Member Data
DatabusChanges
Derived
Data Set
Core Data
Set
Computed Results for Member Facing Products
Enterprise
Products
18. Operational Support - Metadata
ï§ ETL pipeline is a complex graph of workflows
â Our comprehensive dashboard production flow is
nearly 30 levels deep with complex dependencies
ï§ To manage this, we needed to capture:
â Process dependencies
â Data dependencies
â Process execution history
â Data load status
â Data consumption status (watermarks)
19. Operational Metadata â v1
ï§ Capture process
dependency graph
â Also capture useful
metadata such as process
owners
ï§ Capture stats for each
execution of a workflow
â Time of execution
â Status
â Pointer to error logs
ï§ Has proved quite useful for
monitoring critical chains
WorkïŹow F
Workunit
W1
Workunit
W2
Workunit
W3
Workunit
W4
Workunit
W5
on success
on success on failure
on successon success
Start
Stop
20. Operational Metadata â v2
Data Entity
D1
Data Entity
D2
Data Entity
D3
WorkïŹow F
consumes consumes
produces
ï§ For each flow, capture input
and output data elements
ï§ For each execution, capture
stats on data element, e.g.
â Number of records / lines read
â Number of records / lines
written
â Error counts
â Last processed record
ï§ Can be time based or sequence
based
ï§ This can be per flow as more
than one flow can consume a
data element
21. Operational Metadata â The Payoff
ï§ Restartable ETL jobs
â Process new data since last successful previous run
ï§ Catch up mode for ETL jobs
â Single run can consume data from multiple intervals
in one batch
â Next run will resume from correct place
ï§ Data freshness and availability dashboard
ï§ Coarse form of data lineage
â Impact analysis for unfortunately all-too-common
changes upstream
23. `whoami`
ï§ Sr. Manager / DWH Architect @ LinkedIn
since 2011
ï§ Prior to that:
â Director of Engineering at Digg
â Enterprise Data Architect at eBay
ï§ www.linkedin.com/in/rajappaiyer/