Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift

4.267 Aufrufe

Veröffentlicht am

Amazon Redshift is a fast, fully managed petabyte-scale data warehouse service that costs less than $1,000 a TB a year, under a tenth the price of most traditional data warehousing solutions. Learn how Yahoo! uses both to build a billion event a day infrastructure that is fast, easy, and cost-effective. Dive into how Yahoo performs advanced user retention and cohort analysis to make near–real time product and marketing decisions.

Veröffentlicht in: Technologie

(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Adam Savitzky, Yahoo! Tina Adams, AWS October 2015 DAT308 How Yahoo! Analyzes Billions of Events with Amazon Redshift
  2. 2. Fast, simple, petabyte-scale data warehousing for $1,000/TB/Year Amazon Redshift a lot faster a lot cheaper a lot simpler
  3. 3. Amazon Redshift architecture Leader node Simple SQL end point Stores metadata Optimizes query plan Coordinates query execution Compute nodes Local columnar storage Parallel/distributed execution of all queries, loads, backups, restores, resizes Start at $0.25/hour, grow to 2 PB (compressed) DC1: SSD; scale from 160 GB to 326 TB DS2: HDD; scale from 2 TB to 2 PB 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC
  4. 4. Amazon Redshift is priced to analyze all your data Pricing is simple # of nodes X hourly price No charge for leader node 3x data compression on avg Three copies of data DS2 (HDD) Price Per Hour for smallest single node Effective Annual Price per TB compressed On-Demand $ 0.850 $ 3,725 1 Year Reservation $ 0.500 $ 2,190 3 Year Reservation $ 0.228 $ 999 DC1 (SSD) Price Per Hour for smallest single node Effective Annual Price per TB compressed On-Demand $ 0.250 $ 13,690 1 Year Reservation $ 0.161 $ 8,795 3 Year Reservation $ 0.100 $ 5,500
  5. 5. Amazon Redshift is easy to use Provision in minutes Monitor query performance Point and click resize Built-in security Automatic backups
  6. 6. Selected Amazon Redshift customers
  7. 7. Analytics at Yahoo
  8. 8. What to expect from the session • What does analytics mean for Yahoo? • Learn how our extract, transform, load (ETL) process runs • Learn about our Amazon Redshift architecture • Do’s, don’ts, and best practices for working with Amazon Redshift • Deep dive into advanced analytics, featuring how we define and report user retention
  9. 9. Setting the stage “We are returning an iconic company to greatness.” —Marissa Mayer
  10. 10. Guiding principles
  11. 11. Guiding principles “You can’t grow a product that hasn’t reached product market fit.” —Arjun Sethi, @arjset
  12. 12. Guiding principles Analytics is critical for growth
  13. 13. Overall volume 0 10 20 30 40 50 60 70 80 90 Yahoo Events Auto Miles Driven Google Searches McDonald's Fries Served Babies Born Billions
  14. 14. Audience data breakdown Desktop Mail Tumblr Sports Weather Front Page Aviate Other
  15. 15. Hadoop Clusters Nodes Data centers Data 14 42,000 3 500PB
  16. 16. Hive Slow Hard to use Hard to share Hard to repeat
  17. 17. Hive
  18. 18. And many others…
  19. 19. Benchmarks (lower is better) 1 10 100 1000 10000 Count Distinct Devices Count All Events Filter Clauses Joins Seconds Amazon Redshift Vertica Impala
  20. 20. Amazon Redshift at Yahoo Nodes Events per Day Queries per Day Data 21dc1.8xl 2B 1,200 27TB
  21. 21. Architecture
  22. 22. Extract, transform, load (ETL) Hadoop • Pig S3 • Airflow Amazon Redshift • Looker
  23. 23. ETL—upstream Clickstream Data (Hadoop) Intermediate Storage (HDFS) AWS (S3) Hourly Batch Process (Oozie) Custom Uploader (python/boto)
  24. 24. ETL—downstream Data available? Copy to Amazon Redshift Sanitize Export new installs Process new installs Update hourly table Update install table Update params Subdivide params Clean up Subdivide events Data flows in hourly from S3 to Amazon Redshift, where it’s processed and subdivided
  25. 25. ETL—downstream Visualization of running and complete tasks
  26. 26. Schema event_raw mail event hourly event daily install install attribution event_raw flickr event_raw homerun event_raw stark event_raw livetext e v e n t r a w u n i o n v i e w user retention funnel first_event date param mail param flickr param homerun param stark param livetext p a r a m u n i o n v i e w is_active param keys telemetry daily revenue daily Raw tables Summary tables Derived tables
  27. 27. ETL—Nightly 24 hours available? Wipe old data Build daily table Build user retention Build funnel Vacuum Runs all daily aggregations and cleans up/vacuums
  28. 28. Do’s and don’ts
  29. 29. DO Summarize user_id event_date action 1 2015-10-08 spam 1 2015-10-08 spam 1 2015-10-08 spam 1 2015-10-08 spam 1 2015-10-08 spam user_id event_date action event_count 1 2015-10-08 spam 5
  30. 30. DO Choose good sort keys (and use them) CREATE TABLE revenue ( customer_id BIGINT, transaction_id BIGINT, location VARCHAR(64), event_date DATE, event_ts TIMESTAMP, revenue_usd DECIMAL ) DISTKEY(customer_id) SORTKEY( location, event_date, customer_id )
  31. 31. DO Vacuum nightly (or weekly and tell people you do it nightly)
  32. 32. DO Avoid joins where possible—and learn mitigation strategies for when you must join
  33. 33. Join mitigation strategies Key distribution Records distributed by distkey Choose a field that you join on Avoid causing excess skew All distribution All records distributed to all nodes Most robust, but most space- intensive Fastest joins occur when records are colocated Key distribution A.1 B.1 A.3 B.3 A.5 B.5 A.2 B.2 A.4 B.4 A.6 B.6 All distribution A.1 B.1 A.2 B.2 A.3 B.3 A.4 B.4 A.5 B.5 A.6 B.6 A.1 B.1 A.2 B.2 A.3 B.3 A.4 B.4 A.5 B.5 A.6 B.6 Even distribution A.1 B.6 A.5 B.2 A.3 B.3 A.4 B.1 A.2 B.5 A.6 B.4
  34. 34. DO Automate
  35. 35. DON’T Fill the cluster (leave more than you think)
  36. 36. DON’T Run ETL in the default queue Workload management (WLM) is your friend
  37. 37. Example WLM configuration Queue Concurrency User Groups Timeout (ms) Memory (%) 1 1 etl 50 2 10 60,000 50 Two queues: ETL and ad hoc Purpose: Insulate normal users from ETL and free up plenty of memory for big batch jobs
  38. 38. DON’T Use CREATE TABLE AS For permanent tables
  39. 39. DON’T Email SQL around Find a good reporting tool
  40. 40. Deep dive: user retention
  41. 41. User retention is…
  42. 42. User retention is… The most important* quality metric for your product * kinda
  43. 43. Day-14 retention over time User retention and growth N-day retention
  44. 44. User retention and growth 0 1000 2000 3000 4000 5000 6000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 DailyActiveUsers Product Age (days) Product A Product B
  45. 45. High churn = wasted ad dollars $- $5,000.00 $10,000.00 $15,000.00 $20,000.00 $25,000.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Product age (days) Product A Product B
  46. 46. The Sputnik method For generating a multidimensional user retention analysis table event_date install_date os_name country active_users cohort_size monday monday android us 100 100 tuesday monday android us 83 100 monday monday ios us 75 75 tuesday monday ios us 75 75
  47. 47. Get one-day retention SELECT SUM(active_users) AS active_users, SUM(cohort_size) AS cohort_size, SUM(active_users) / SUM(cohort_size) AS retention FROM user_retention WHERE event_date – install_date = 1 AND CURRENT_DATE – 1 > event_date;
  48. 48. Get one-day retention event_date install_date os_name country active_users cohort_size monday monday android us 100 100 tuesday monday android us 83 100 monday monday ios us 75 75 tuesday monday ios us 75 75 Active Users: 83 + 75 = 158 Cohort Size: 100 + 75 = 175 ------------------------------- Pct Retention = 158 / 175 = 90%
  49. 49. Get one-day retention by OS SELECT os_name, SUM(active_users) AS active_users, SUM(cohort_size) AS cohort_size, SUM(active_users) / SUM(cohort_size) AS retention FROM user_retention WHERE event_date – install_date = 1 AND CURRENT_DATE – 1 > event_date GROUP BY 1;
  50. 50. Get one-day retention event_date install_date os_name country active_users cohort_size monday monday android us 100 100 tuesday monday android us 83 100 monday monday ios us 75 75 tuesday monday ios us 75 75 Active Users: 83 Cohort Size: 100 ------------------- Pct Retention = 83% Active Users: 75 Cohort Size: 75 -------------------- Pct Retention = 100% iOS: Android:
  51. 51. The Sputnik method You will need: Daily event summary User user_id
  52. 52. The Sputnik method Calculate cohort sizes • Count users by all dimensions • For example: Male, iOS, in USA, who installed today Determine user activity • For each day, for each user, were they active • Create a table with user_id and event_date Join and aggregate • Join user table to user_activity on user_id • SUM active users by cohort and join to cohort sizes
  53. 53. Calculate cohort sizes user_id install_date os_name country 1 2015-10-02 iOS us 2 2015-10-01 android ca 3 2015-10-01 android ca SELECT install_date, os_name, country, COUNT(*) AS cohort_size FROM user GROUP BY 1,2,3;
  54. 54. Calculate cohort sizes install_date os_name country cohort_size 2015-10-02 iOS us 1 2015-10-01 android ca 2 SELECT install_date, os_name, country, COUNT(*) AS cohort_size FROM user GROUP BY 1,2,3;
  55. 55. Determine user activity user_id event_date action 1 2015-10-02 app_open 1 2015-10-02 spam 1 2015-10-03 app_open CREATE TEMP TABLE user_activity AS SELECT DISTINCT user_id, event_date FROM event_daily WHERE action = ‘app_open’;
  56. 56. Determine user activity user_id event_date action 1 2015-10-02 app_open 1 2015-10-02 spam 1 2015-10-03 app_open CREATE TEMP TABLE all_users AS SELECT DISTINCT user_id FROM event_daily; CREATE TEMP TABLE all_days AS SELECT DISTINCT event_date FROM event_daily;
  57. 57. Determine user activity user_id event_date action 1 2015-10-02 app_open 1 2015-10-02 spam 1 2015-10-03 app_open CREATE TABLE active_users_by_day AS SELECT xproduct.user_id, xproduct.event_date FROM ( SELECT * FROM all_users CROSS JOIN all_dates ) xproduct INNER JOIN user_activity u ON u.user_id = xproduct.user_id;
  58. 58. Determine cohort activity user_id event_date 1 2015-10-02 1 2015-10-03 CREATE TEMP TABLE cohort_activity AS SELECT u.*, all_dates.event_date, <1 if hit 0 if miss> as is_active FROM user AS u LEFT JOIN all_dates ON all_dates.event_date >= u.install_date LEFT JOIN active_users_by_day AS au ON au.user_id = u.user_id AND au.event_date = all_dates.event_date WHERE all_dates.event_date >= u.install_date; user_id install_date os_name country 1 2015-10-02 iOS us
  59. 59. Determine cohort activity user_id event_date install_date os_name country is_active 1 2015-10-02 2015-10-02 iOS us 1 1 2015-10-03 2015-10-02 iOS us 1 1 2015-10-04 2015-10-02 iOS us 0 CREATE TEMP TABLE active_users AS SELECT event_date, install_date, os_name, country, SUM(is_active) AS count FROM cohort_activity GROUP BY 1, 2, 3, 4;
  60. 60. Determine cohort activity event_date install_date os_name country is_active 2015-10- 03 2015-10- 02 iOS us 100 2015-10- 03 2015-10- 02 android us 350 2015-10- 03 2015-10- 02 iOS ca 50 Join these two tables on matching cohort dimensions install_date os_name country cohort_size 2015-10-02 iOS us 200 2015-10-02 android us 400 2015-10-02 iOS ca 60
  61. 61. Big wins for Yahoo Real-time insights Easier deployment and maintenance Data-driven product development Cutting edge analytics
  62. 62. Thank you!
  63. 63. Related sessions Hear from other customers discussing their Amazon Redshift use cases: • DAT201—Introduction to Amazon Redshift (RetailMeNot) • ISM303—Migrating Your Enterprise Data Warehouse to Amazon Redshift (Boingo Wireless and Edmunds) • ARC303—Pure Play Video OTT: A Microservices Architecture in the Cloud (Verizon) • ARC305—Self-Service Cloud Services: How J&J Is Managing AWS at Scale for Enterprise Workloads • BDT306—The Life of a Click: How Hearst Publishing Manages Clickstream Analytics with AWS • DAT311—Large-Scale Genomic Analysis with Amazon Redshift (Human Longevity) • BDT314—Running a Big Data and Analytics Application on Amazon EMR and Amazon Redshift with a Focus on Security (Nasdaq) • BDT316—Offloading ETL to Amazon Elastic MapReduce (Amgen) • BDT401—Amazon Redshift Deep Dive (TripAdvisor) • Building a Mobile App using Amazon EC2, Amazon S3, Amazon DynamoDB, and Amazon Redshift (Tinder)

×