Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Pinterest hadoop summit_talk

920 Aufrufe

Veröffentlicht am

50 Billion Pins and Counting - Building data driven products at Pinterest

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

Pinterest hadoop summit_talk

  1. 1. Confidentia l Using Hadoop to build data driven Products 50 Billion pins and counting Krishna Gade 1
  2. 2. What is Pinterest? A visual bookmarking tool Discover an inspiring idea Save it to a board Go do it
  3. 3. Krishna Gade • Data Engineering at Pinterest • Search and Data platforms at Twitter and Bing • Follow @krishnagade Who am I?
  4. 4. Pinterest is a data product
  5. 5. Why do we care about data? How is Hadoop helping us to harness the power of the data? What are some of the tools we built on top of Hadoop Platform?
  6. 6. Why do we care about data?
  7. 7. 3.375
  8. 8. 5’10”
  9. 9. < uncertainty
  10. 10. > odds of making the best decisions
  11. 11. 15 It is a capital mistake to theorize before one has data. - Sherlock Holmes
  12. 12. How is Hadoop helping us to harness the power of the data?
  13. 13. Data at Pinterest • 50 Billion Pins • 1 Billion boards • 40 PB of data on S3 • 3 PB processed every day • 2000 node Hadoop cluster • 200 engineers
  14. 14. Pinterest Data Architecture App
  15. 15. Pinterest Data Architecture App events Kafka Secor Singer
  16. 16. Pinterest Data Architecture App events Kafka Secor Singer
  17. 17. Pinterest Data Architecture App events Kafka Secor Skyline Pinball Redshift Pinalytics Features Qubole (Hadoop) Singer
  18. 18. • Ephemeral clusters • Access control layer • Shared data store • Easy deployment Hadoop Platform Requirements • Isolated multi-tenancy • Elasticity • Support multiple clusters
  19. 19. Confidentia l Design Choices 23
  20. 20. Decoupling compute & storage Hadoop Cluster 1 Transient HDFS Hadoop Cluster 2 Transient HDFS S3 Persistent Store
  21. 21. Centralized Hive Metastore Hive Metastore Pig Cascading Hive HDFS/S3 DataMetadata
  22. 22. Multi-layered Packaging Mapreduce Jobs Hadoop Jars/Libs Job/User level Configs Software Packages/Libs Configs (OS/Hadoop) Misc Sys Admin OS Bootstrap Script Core SW Runtime Staging (on S3) Automated Configuration (Masterless Puppet) Baked AMI
  23. 23. Executor Abstraction Layer Hive Metastore HDFS/S3 Qubole Managed Hadoop EMR Executor Pinball Dev Server
  24. 24. • API for simplified executor abstraction • Advanced support for spot instances • Baked AMI customization Why Qubole? • Hadoop & Spark as managed services • Tight integration with Hive • Graceful cluster scaling
  25. 25. Confidentia l ● Scale: o 50 Billion Pins o Hundreds of workflows o Thousands of jobs o 500+ jobs in a workflow o 3 petabytes processed daily ● Support: o Hadoop, Cascading, Hive, Spark … Scale of Processing job workflow
  26. 26. Confidentia l Pinball 30
  27. 27. Confidentia l Why Pinball? ● Requirements o Simple abstractions o Extensible in future o Reliable stateless computing o Easy to debug o Scales horizontally o Can be upgraded w/o aborting workflows o Rich features like auto-retries, per-job emails, overrun policies… ● Options o Apache Oozie, Azkaban, Luigi
  28. 28. Confidentia l Pinball Design
  29. 29. Confidentia l ● Workflow o A directed graph of nodes called jobs ● Edge o Run after dependence ● Node o Job is a node Workflow Model
  30. 30. Confidentia l Job State ● Job state is captured in a token ● Tokens are named hierarchically Master Job Token version: 123 name: /workflow/w1/job owner: worker_0 expiration: 1234567 data: JobTemplate(....)
  31. 31. Confidentia l Job State Machine
  32. 32. Confidentia l ● Master keeps the state ● Workers claim and execute tasks ● Horizontally scalable Master Worker Interaction Worker Master Persistent Store 1: request 2: update 3: ack
  33. 33. Confidentia l Master ● Entire state is kept in memory ● Each state update is synchronously persisted before master replies to client ● Master runs on a single thread – no concurrency issues
  34. 34. Confidentia l Worker
  35. 35. Confidentia l Open Source Git repo: https://github.com/pinterest/pinball Mailing list: https://groups.google.com/forum/#!forum/ pinball-users
  36. 36. Confidentia l Data Driven Products 40
  37. 37. Confidentia l Guided Search
  38. 38. Confidentia l Related Pins
  39. 39. What are some of the tools we built on top of Hadoop Platform?
  40. 40. Confidentia l Scalable Data Analytics Engine Pinalytics 44
  41. 41. Confidentia l Architecture 45 Backend Thrift Services and Hbase databases Webapp Rich UI Components Reporter Generates formatted data Metrics Customized optimizations 1 2 3 4 Main Components
  42. 42. Confidentia l Visualizations • Highcharts • Time-series updated automatically daily Customizability • Dashboards • Built-in or user-defined reports User Interface 47
  43. 43. Confidentia l Pinomaly • Anomalous metric tracking • Email alerts Reporting • Formatted dashboards • PDF printing • Duplicated weekly Metric Manipulation • Metric Composer • Global operations (segmentation, rollup/aggregation, etc). User Interface 48
  44. 44. Confidentia l Date, seg1, seg2, ... => value • Store the value for every possible segmentation • On-the-fly aggregation E.g. • 2015-01-01, US, Male => 1 • 2015-01-01, US, Female => 2 • 2015-01-01, UK, Male => 3 • 2015-01-01, UK, Female => 4 • 2015-01-01, UK, * => 7 • 2015-01-01, *, Male => 4 Data Model 51
  45. 45. Confidentia l Backend Architecture 53 Pinalytics Thrift Service 2. readMetrics() 5. metrics HBase Region Server 1 Region Server N Region Server 2 Region1 CP Region2 CP Region3 CP Region4 CP Region5 CP RegionM CP Metric table Webapp Server 3. Scan & Aggregate 1. request 4. Region aggregation
  46. 46. Confidentia l Horizontal Scalability • No app-level sharding Flexibility in Aggregation • FuzzyRowFilter • Coprocessor Tables • Report metadata • Reports HBase 54
  47. 47. Confidentia l Composite row key • METRIC|TIME|SEG1|SEG2|... Filters rows given a row key and a fuzzy row • 0: match the byte, 1: don’t match the byte E.g. MAU of male users on 2015-01-01 • Start row: MAU|2015-01-01| • End row: MAU|2015-01-01|| • Row Key: MAU|2015-01-01|--|M- • Fuzzy filter: 000|0000000000|11|00 Fuzzy Row Filter 55
  48. 48. Confidentia l • Region-local aggregation with coprocessor • Final aggregation at the Thrift service • Reduces Network I/O • Low Latency HBase Coprocessor 56
  49. 49. Confidentia l Flexible python client library for generating reports • Arbitrary metrics and segments Easy-to-access data • Data is automatically copied to s3 • Hive external table is generated Reporter 58
  50. 50. Confidentia l WAU, WARC and MAU segmented by gender and country class DemoWAUReport(PinalyticsWideReport): _METRIC_NAMES = ['wau', 'warc', 'mau'] _SEGKEY_NAMES = ['gender', 'country'] _QUERY_TEMPLATE = """ SELECT dt, gender, country, wau, warc, mau FROM activity_metrics WHERE dt>='2015-01-01';""" • Sample query output [‘2015-01-01’, ‘male’, ‘US’, 102, 53, 110] Reporter Example 60
  51. 51. Confidentia l • Pre-compute a lot of core metrics • Standard segmentation - Gender, Country, App - Spam-filtering Core Metrics 62 • Activity • Event counts • Retention • Signups
  52. 52. Confidentia l Outcomes 69
  53. 53. Confidentia l 70 Internal Tools Matter Solving problems inside of our company 400 Unique users 800 Page views per day 1500 Custom charts created and updated daily
  54. 54. Confidentia l Thank You

×