Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecosystem at LinkedIn
1. The Data Driven Network
Kapil Surlaker
Director of Engineering
Powering the Data Driven Network
Kapil Surlaker and Shirshanka Das
Hadoop Summit 2015
16. Solving for real-time
Inefficiencies in batch
YARN based
Apache Helix
Continuous
Auto-scaling
YARN
Helix
Executor 1
Executor 2
Executor 3
HDFS
Stream Source
17. Data Quality
Per record, per task, or per
job
Composable quality checkers
Schema compatibility
Audit check
Sensitive fields
Unique key
Policy driven
Record
WriterJob
Task
Quality
Checker
FailQuarantine
Policy
Checker
18. Current Activity
Open source @ github.com/linkedin/gobblin
In production @ LinkedIn
Tens of TB per day
Hundreds of datasets
~20 different sources
Gobblin on YARN
24. Where is the billings data?
How did it get here?
What data is used to create inferred
skills data?
Who owns that flow?
When will the latest profile data
show up? 24
40. More dimensions!
Device Geo Carrier View
Android US ATT 1
Android IN Reliance 1
iOS US Verizon 1
Dimension View
Android 2
iOS 1
US 2
IN 1
ATT 1
Reliance 1
Verizon 1
Android,US 1
... ...
44. (S)QL: Filters and Aggs
SELECT count(*)
FROM companyFollowHistoricalEvents
WHERE entityId = 121011 AND
'day' >= 15949 AND 'day' <= 15963 AND
paid = 'y’ AND
action = 'stop'
45. (S)QL: Group By
SELECT count(*)
FROM companyFollowHistoricalEvents
WHERE entityId = 121011 AND
'day' >= 15949 AND 'day' <= 15963 AND
paid = 'y’
GROUP BY action
46. (S)QL: ORDER BY and LIMIT
SELECT *
FROM companyFollowHistoricalEvents
WHERE entityId = 121011 AND
entityId = 1000 AND
action = 'start'
ORDER BY creationTime DESC LIMIT 1