Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
The Data Driven Network
Kapil Surlaker
Director of Engineering
Powering the Data Driven Network
Kapil Surlaker and Shirsha...
2
How does PYMK work?
5
Houston
we have a problem
Step 1
Central transport pipeline
Still have
a problem
Hadoop Ingest Pipeline
Complexity
Step 2
Central
Ingestion
Framework
11
Requirements
Source
Diversity
Batch
and
Streaming
Data
Quality
Gobblin Architecture
14
Source
Work
Unit
Work
Unit
Work
Unit
Extract
Extract
Extract
Convert
Convert
Convert
Quality
Quality
Quality
Write
Writ...
Taming Source Diversity
REST
SFTP
JDBC
Protocol
Config
Source Extractor
checkpoint
Solving for real-time
Inefficiencies in batch
YARN based
Apache Helix
Continuous
Auto-scaling
YARN
Helix
Executor 1
Execut...
Data Quality
Per record, per task, or per
job
Composable quality checkers
Schema compatibility
Audit check
Sensitive field...
Current Activity
Open source @ github.com/linkedin/gobblin
In production @ LinkedIn
Tens of TB per day
Hundreds of dataset...
Transformation: No one size fits all
Cubert: Converting hours to minutes
http://github.com/linkedin/cubert
Physical language
Block organization
Specialized ope...
Got Diversity?
Where is the billings data?
How did it get here?
What data is used to create inferred
skills data?
Who owns that flow?
Whe...
25
Where is my data?
How did it get here?
….
WhereHows
26
WhereHows architecture
28
29
31
Lineage
WhereHows: Roadmap
Streaming ecosystem integration
Kafka, Samza
Recommendations for Datasets, Metrics
Exploring Open Source
Real-time. Interactive.
Slice and Dice metrics
Precompute!
Device Geo View
Android US 1
Android IN 1
iOS US 1
Dimension View
Android 2
iOS 1
US 2
IN 1
Android,US 1
iOS,U...
More dimensions!
Device Geo Carrier View
Android US ATT 1
Android IN Reliance 1
iOS US Verizon 1
Dimension View
Android 2
...
Challenges
Horizontally scalable
Low latency
Data freshness
Fault tolerance
OLAP features
Introducing Pinot
Key features
SQL-like
interface
Columnar
storage and
indexing
Real-time
data load
(S)QL: Filters and Aggs
SELECT count(*)
FROM companyFollowHistoricalEvents
WHERE entityId = 121011 AND
'day' >= 15949 AND ...
(S)QL: Group By
SELECT count(*)
FROM companyFollowHistoricalEvents
WHERE entityId = 121011 AND
'day' >= 15949 AND 'day' <=...
(S)QL: ORDER BY and LIMIT
SELECT *
FROM companyFollowHistoricalEvents
WHERE entityId = 121011 AND
entityId = 1000 AND
acti...
Columnar Storage
Forward Index
Broker Helix
Real
time Historical
Kafka Hadoop
Pinot
Architecture
Queries
Raw
Data Samza
Fast but needs a ton of RAM
To pre-compute or not?
Data aware
pre-computation
Pinot@LinkedIn
Site-­‐facing	
  Apps Reporting	
  dashboards Monitoring
Breaking the cycle
Form hypothesis
Query
Repeat
OR …
Hmm... whats up with portugese and
spanish speaking countries?
Brazil?
56
57
Holidays in Brazil 2015
Pinot Roadmap
Pinot is
Open Source !!!
github.com/linkedin/pinot
59
Kapil Surlaker
@kapilsurlaker
github.com/linkedin/
60
gobblin
cubert
pinot
Shirshanka Das
@shirshanka
Thanks!
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Nächste SlideShare
Wird geladen in …5
×

Bigger Faster Easier: LinkedIn Hadoop Summit 2015

5.434 Aufrufe

Veröffentlicht am

We discuss LinkedIn's big data ecosystem and its evolution through the years. We introduce three open source projects, Gobblin for ingestion, Cubert for computation and Pinot for fast OLAP serving. We also showcase our in-house data discovery and lineage portal WhereHows.

Veröffentlicht in: Daten & Analysen
  • DOWNLOAD FULL eBOOK INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF eBook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB eBook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc eBook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. PDF eBook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB eBook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc eBook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, CookeBOOK Crime, eeBOOK Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • if you think kenneth`s story is impressive,, 2 weAks-Ago my sister's boyfriend Also got A cheque for $5532 sitting there thirteen hours A week from their ApArtment And their roomAte's mother-in-lAw`s neighbour hAs done this for 8-months And mAde over $5532 in their sp............payshd.com
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Bigger Faster Easier: LinkedIn Hadoop Summit 2015

  1. 1. The Data Driven Network Kapil Surlaker Director of Engineering Powering the Data Driven Network Kapil Surlaker and Shirshanka Das Hadoop Summit 2015
  2. 2. 2
  3. 3. How does PYMK work? 5
  4. 4. Houston we have a problem
  5. 5. Step 1 Central transport pipeline
  6. 6. Still have a problem
  7. 7. Hadoop Ingest Pipeline Complexity
  8. 8. Step 2 Central Ingestion Framework 11
  9. 9. Requirements Source Diversity Batch and Streaming Data Quality
  10. 10. Gobblin Architecture
  11. 11. 14 Source Work Unit Work Unit Work Unit Extract Extract Extract Convert Convert Convert Quality Quality Quality Write Write Write Data Publish Task Task Task
  12. 12. Taming Source Diversity REST SFTP JDBC Protocol Config Source Extractor checkpoint
  13. 13. Solving for real-time Inefficiencies in batch YARN based Apache Helix Continuous Auto-scaling YARN Helix Executor 1 Executor 2 Executor 3 HDFS Stream Source
  14. 14. Data Quality Per record, per task, or per job Composable quality checkers Schema compatibility Audit check Sensitive fields Unique key Policy driven Record WriterJob Task Quality Checker FailQuarantine Policy Checker
  15. 15. Current Activity Open source @ github.com/linkedin/gobblin In production @ LinkedIn Tens of TB per day Hundreds of datasets ~20 different sources Gobblin on YARN
  16. 16. Transformation: No one size fits all
  17. 17. Cubert: Converting hours to minutes http://github.com/linkedin/cubert Physical language Block organization Specialized operators
  18. 18. Got Diversity?
  19. 19. Where is the billings data? How did it get here? What data is used to create inferred skills data? Who owns that flow? When will the latest profile data show up? 24
  20. 20. 25
  21. 21. Where is my data? How did it get here? …. WhereHows 26
  22. 22. WhereHows architecture
  23. 23. 28
  24. 24. 29
  25. 25. 31
  26. 26. Lineage
  27. 27. WhereHows: Roadmap Streaming ecosystem integration Kafka, Samza Recommendations for Datasets, Metrics Exploring Open Source
  28. 28. Real-time. Interactive.
  29. 29. Slice and Dice metrics
  30. 30. Precompute! Device Geo View Android US 1 Android IN 1 iOS US 1 Dimension View Android 2 iOS 1 US 2 IN 1 Android,US 1 iOS,US 1 Android,IN 1
  31. 31. More dimensions! Device Geo Carrier View Android US ATT 1 Android IN Reliance 1 iOS US Verizon 1 Dimension View Android 2 iOS 1 US 2 IN 1 ATT 1 Reliance 1 Verizon 1 Android,US 1 ... ...
  32. 32. Challenges Horizontally scalable Low latency Data freshness Fault tolerance OLAP features
  33. 33. Introducing Pinot
  34. 34. Key features SQL-like interface Columnar storage and indexing Real-time data load
  35. 35. (S)QL: Filters and Aggs SELECT count(*) FROM companyFollowHistoricalEvents WHERE entityId = 121011 AND 'day' >= 15949 AND 'day' <= 15963 AND paid = 'y’ AND action = 'stop'
  36. 36. (S)QL: Group By SELECT count(*) FROM companyFollowHistoricalEvents WHERE entityId = 121011 AND 'day' >= 15949 AND 'day' <= 15963 AND paid = 'y’ GROUP BY action
  37. 37. (S)QL: ORDER BY and LIMIT SELECT * FROM companyFollowHistoricalEvents WHERE entityId = 121011 AND entityId = 1000 AND action = 'start' ORDER BY creationTime DESC LIMIT 1
  38. 38. Columnar Storage
  39. 39. Forward Index
  40. 40. Broker Helix Real time Historical Kafka Hadoop Pinot Architecture Queries Raw Data Samza
  41. 41. Fast but needs a ton of RAM
  42. 42. To pre-compute or not?
  43. 43. Data aware pre-computation
  44. 44. Pinot@LinkedIn Site-­‐facing  Apps Reporting  dashboards Monitoring
  45. 45. Breaking the cycle Form hypothesis Query Repeat OR …
  46. 46. Hmm... whats up with portugese and spanish speaking countries?
  47. 47. Brazil?
  48. 48. 56
  49. 49. 57 Holidays in Brazil 2015
  50. 50. Pinot Roadmap Pinot is Open Source !!! github.com/linkedin/pinot 59
  51. 51. Kapil Surlaker @kapilsurlaker github.com/linkedin/ 60 gobblin cubert pinot Shirshanka Das @shirshanka Thanks!

×