Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Confidentia
l
Mao Ye
Big Data Platform at interest
1
Data Architecture
Design Choices for Hadoop Platform
Pinball for Workflow Management
Data Architecture
Data at Pinterest
• 60 Billion Pins
• 1 Billion boards
• 100M MAU
• 60 PB of data on S3
• 3 PB processed every day
• 2000 ...
Pinterest Data Architecture
App
Pinterest Data Architecture
App
events
Kafka
Secor
Singer
Pinterest Data Architecture
App
events
Kafka
Secor
Singer
Pinterest Data Architecture
App
events
Kafka
Secor
Skyline
Pinball
Redshift
Pinalytics
Features
Qubole
(Hadoop)
Singer
Design Choices for Hadoop Platform
• Ephemeral clusters
• Access control layer
• Shared data store
• Easy deployment
Hadoop Platform Requirements
• Isolated ...
Decoupling compute & storage
Hadoop Cluster 1
Transient
HDFS
Hadoop Cluster 2
Transient
HDFS
S3 Persistent
Store
Centralized Hive Metastore
Hive
Metastore
Pig
Cascading
Hive
HDFS/S3
DataMetadata
Multi-layered Packaging
Mapreduce Jobs
Hadoop Jars/Libs
Job/User level Configs
Software Packages/Libs
Configs (OS/Hadoop)
...
Executor Abstraction Layer
Hive
Metastore
HDFS/S3
Qubole
Managed
Hadoop
EMR
Executor
Pinball
Dev
Server
• API for simplified
executor abstraction
• Advanced support
for spot instances
• Baked AMI
customization
Why Qubole?
• Ha...
Confidentia
l
Pinball for Workflow Management
Confidentia
l
● Scale:
o 60 Billion Pins
o Hundreds of workflows
o Thousands of jobs
o 500+ jobs in a workflow
o 3 petabyt...
Confidentia
l
Why Pinball?
● Requirements
o Simple abstractions
o Extensible in future
o Reliable stateless computing
o Ea...
Confidentia
l
Pinball Design
Master
Worker
Scheduler
Command
Line Clients
UI
Confidentia
l
● Workflow
o A directed graph of
nodes called jobs
● Edge
o Run after
dependence
● Node
o Job is a node
Work...
Confidentia
l
Job State
● Job state is captured in a token
● Tokens are named hierarchically
Master
Job Token
version: 123...
Confidentia
l
Job State Machine
RUNNABLE
RUNNINGWAITING
Confidentia
l
● Master keeps the state
● Workers claim and execute tasks
● Horizontally scalable
Master Worker Interaction...
Confidentia
l
Master
● Entire state is kept in memory
● Each state update is synchronously persisted
before master replies...
Confidentia
l
Worker
Confidentia
l
Open Source
Git repo:
https://github.com/pinterest/pinball
Mailing list:
https://groups.google.com/forum/#!f...
Confidentia
l
Thank You
Nächste SlideShare
Wird geladen in …5
×

Big Data Platform at Pinterest

19.879 Aufrufe

Veröffentlicht am

Presented by Mao Ye at the AWS Loft meetup on Thursday, September 17th, 2015

Veröffentlicht in: Daten & Analysen
  • Eat This POTENT Vegetable To Melt Diabetic Fat. IMPORTANT: Be careful, only eat it twice a day or you will lose diabetic belly fat too fast... ♣♣♣ http://t.cn/AiBhrtGG
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Girls for sex in your area are there: tinyurl.com/areahotsex
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • More than happy to recommend your service, just keep the numbers at a level where the profits aren't effected. Currently all profits are going into savings, not like its earning me a lot sat there but it stops me from spending it. ★★★ http://t.cn/A6vAxKsh
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Can you earn $7000 a month from home? Are you feeling trapped by your life? Stuck in a dead-end job you hate, but too scared to call it quits, because after all, the rent's due on the first of the month, right? Are you ready to change your life for the better? ★★★ http://t.cn/AisJWzdm
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • There is a REAL system that is helping thousands of people, just like you, earn REAL money right from the comfort of their own homes. The entire system is made up with PROVEN ways for regular people just like you to get started making money online... the RIGHT way... the REAL way. ▲▲▲ http://t.cn/AisJWYf4
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Big Data Platform at Pinterest

  1. Confidentia l Mao Ye Big Data Platform at interest 1
  2. Data Architecture Design Choices for Hadoop Platform Pinball for Workflow Management
  3. Data Architecture
  4. Data at Pinterest • 60 Billion Pins • 1 Billion boards • 100M MAU • 60 PB of data on S3 • 3 PB processed every day • 2000 node Hadoop cluster • 250 engineers
  5. Pinterest Data Architecture App
  6. Pinterest Data Architecture App events Kafka Secor Singer
  7. Pinterest Data Architecture App events Kafka Secor Singer
  8. Pinterest Data Architecture App events Kafka Secor Skyline Pinball Redshift Pinalytics Features Qubole (Hadoop) Singer
  9. Design Choices for Hadoop Platform
  10. • Ephemeral clusters • Access control layer • Shared data store • Easy deployment Hadoop Platform Requirements • Isolated multi-tenancy • Elasticity • Support multiple clusters
  11. Decoupling compute & storage Hadoop Cluster 1 Transient HDFS Hadoop Cluster 2 Transient HDFS S3 Persistent Store
  12. Centralized Hive Metastore Hive Metastore Pig Cascading Hive HDFS/S3 DataMetadata
  13. Multi-layered Packaging Mapreduce Jobs Hadoop Jars/Libs Job/User level Configs Software Packages/Libs Configs (OS/Hadoop) Misc Sys Admin OS Bootstrap Script Core SW Runtime Staging (on S3) Automated Configuration (Masterless Puppet) Baked AMI
  14. Executor Abstraction Layer Hive Metastore HDFS/S3 Qubole Managed Hadoop EMR Executor Pinball Dev Server
  15. • API for simplified executor abstraction • Advanced support for spot instances • Baked AMI customization Why Qubole? • Hadoop & Spark as managed services • Tight integration with Hive • Graceful cluster scaling
  16. Confidentia l Pinball for Workflow Management
  17. Confidentia l ● Scale: o 60 Billion Pins o Hundreds of workflows o Thousands of jobs o 500+ jobs in a workflow o 3 petabytes processed daily ● Support: o Hadoop, Cascading, Hive, Spark … Scale of Processing job workflow
  18. Confidentia l Why Pinball? ● Requirements o Simple abstractions o Extensible in future o Reliable stateless computing o Easy to debug o Scales horizontally o Can be upgraded w/o aborting workflows o Rich features like auto-retries, per-job emails, overrun policies… ● Options o Apache Oozie, Azkaban, Luigi
  19. Confidentia l Pinball Design Master Worker Scheduler Command Line Clients UI
  20. Confidentia l ● Workflow o A directed graph of nodes called jobs ● Edge o Run after dependence ● Node o Job is a node Workflow Model
  21. Confidentia l Job State ● Job state is captured in a token ● Tokens are named hierarchically Master Job Token version: 123 name: /workflow/w1/job owner: worker_0 expiration: 1234567 data: JobTemplate(....)
  22. Confidentia l Job State Machine RUNNABLE RUNNINGWAITING
  23. Confidentia l ● Master keeps the state ● Workers claim and execute tasks ● Horizontally scalable Master Worker Interaction Worker Master Persistent Store 1: request 2: update 3: ack
  24. Confidentia l Master ● Entire state is kept in memory ● Each state update is synchronously persisted before master replies to client ● Master runs on a single thread – no concurrency issues
  25. Confidentia l Worker
  26. Confidentia l Open Source Git repo: https://github.com/pinterest/pinball Mailing list: https://groups.google.com/forum/#!forum/ pinball-users
  27. Confidentia l Thank You

×