SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Clickstream Analytics at Bazaarvoice
Evan Pollan, Engineering Lead




                                 @EvanPollan
Agenda
 •    Infrastructure: lessons learned operating Hadoop in EC2
 •    Case study: uniques at scale using Hadoop and HBase




Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
Project Magpie
 •    Bazaarvoice products – extremely large web surface area
 •    Client-side instrumentation to measure interactions
 •    Many event sources (apps) => one sink: Magpie
 •    Consolidated HTTP event collection
       – Network-wide event correlation
       – Network ~ many apps and many “sites” (clients)
 •    Clickstream == Topically segmented JSON event log files
 •    Sense of scale
       – 10 - 20K events per second
       – 500M – 1B impressions per day
       – 25 – 50 GB compressed event log data per day


Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
Infrastructure Whys
 •    Why Hadoop?
       – Experience scaling brute-force log processing via Hadoop
          • Everybody’s favorite: Akamai edge request logs
          • EMR, Apache Whirr
       – Needed online analytics – HBase fit the bill
       – Apache OSS ecosystem familiar to BV

 •    Why Amazon Web Services?
       – Existing infrastructure hosting solution too inflexible and slow
       – Couldn’t scale R&D without an elastic infrastructure


Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
High-level architecture
 •    Event collectors in auto-scale groups behind elastic load balancers
 •    Event stream compressed and uploaded hourly to S3
 •    S3: store of record
 •    Hadoop cluster:
        –   HDFS: stores raw event logs, derived file-based data sets, and HBase
            HFiles/WALs
        –   Oozie: job scheduling, data dependency management
        –   MapReduce: analytics (mix of Pig, Java => 100% Java)
        –   HBase: stores hourly/daily analytics results
 •    Job Portal: job schedule viz, gap analysis & alerting
 •    UI/API: Analytics available via JSON API and in Backbone.js UI


Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
EMR vs. roll our own
 •    Neither
 •    Cons: EMR
       – Price premium
       – Opaque Hadoop configuration
       – No way to mitigate SPOFs
 •    Cons: Roll our own
       – Small group of engineers, no ops manpower at beginning
 •    Solution: Cloudera
       – Cloudera Manager for config management and provisioning
       – CDH 3.X distribution


Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
Missteps
 •    Problem: non-HA NameNode
 •    Solution: EBS!
 •    Problem: EC2 MTBF iffy
 •    Solution: EBS!
 •    Reality: When something goes wrong in AWS, it is invariably an
      outage or degradation in EBS.
       – Violates the whole concept of data locality. Hadoop + SAN =
          sadness
 •    Problem: Where should HBase live?
 •    Solution: Co-resident with MapReduce!


Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
Where we’ve ended up
 •    Moved to the latest Cloudera CDH 4.X – HA NameNode!
       – Zookeeper for leader election
       – Quorum Journal Manager for edit logs
 •    Learn to let go
       – Mitigate SPOF where possible, but plan for failure
       – End-to-end automation for DR/migration
 •    Avoid EBS like the plague
 •    HBase and MapReduce segmentation
       – Enables different hardware step size
       – Batch processing doesn’t affect HBase response time
       – Better understanding of HBase/HDFS locality (or lack thereof)


Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
Let’s talk sets




Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
Let’s talk sets
 •    Common problem: uniques (e.g. unique visitors, users, etc.)
 •    Naïve solution: SELECT DISTINCT(X) FROM Y
 •    Not tenable given:
       – Massive, semi-structured data set
       – Thousands of grouping axes
 •    OK: pre-calculate via MapReduce
 •    But…
       – What would you pre-calculate?
       – Daily for each grouping?
       – How would you answer queries for other time ranges? Pre-
          calculate them, too?


Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
Set Unions
 •    Definition: cardinality of a set is the number of elements in that set
       – A = {1, 2, 3}; |A| = 3
 •    Cardinality of the union of two sets cannot be determined from the
      cardinality of the two sets
       – |A U B| not necessarily equal to |A| + |B|
       – Only equal if A and B are disjoint
       – How do you know if they’re disjoint?
       – You need both sets
 •    Imagine:
       – Set “a” are the visitors from yesterday
       – Set “b” are the visitors from today
       – To get uniques for both days, you have to look at both data sets


Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
An entirely different set: bit sets
 •    Translate set members’ identifiers to an index in a bit set
 •    Bit sets are combinable – yahtzee!
 •    HBase is good at storing bits 
       – MapReduce to build bit set for each grouping in your smallest
           desirable unit of time
       – Persist w/ row key as a function of date and grouping
 •    Uniques for last month?
       – Scan: start and stop rows accounting for date range and grouping
       – Merge each day’s bit set with a single bit set representing the union
       – Count the number of “on” bits in the merged bit set => cardinality
 •    But…
       – # bits for items whose identifiers number in the billions?
       – A billion bits is a lot of bits


Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
Bit sets – solving the size problem
 •    109 bits is an expensive way to store a combinable cardinality
 •    Query I/O example: Uniques for last quarter
       – 120 MB/day * 90 days = 10.8 GB
       – Too much to pull out of HBase to answer an “online” query
 •    Storage example: 10K different grouping axes
       – Clients, sites, favorite colors, whatever
       – 120 MB * 10K = 1.2 TB/day of storage
 •    Possible mitigation: compression
       – Still need to generate a 120 MB data structure, then compress



Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
Cardinality Estimation
 •    Many different approaches to estimate the cardinality of a set
       – General goal: calculate cardinality in small RAM footprint
 •    Big breakthrough in 2007: the HyperLogLog estimator
 •    What’s the big deal?
       – Tunable accuracy
       – Incredible information density
       – Combinable
 •    Analog: lossy compression of bit sets
 •    How good?
       – Estimate cardinality of 109 unique elements +/-2% in 1.5 KB


Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
Nuts & Bolts
 •    http://github.com/clearspring/stream-lib
       – Java impls of top-K, frequency, and cardinality for streams
 •    A ha moment: combining estimators from distributed counters is
      no different than combining them across different time periods!
 •    MapReduce algorithm
       – map(Event) : (key, identifier)
            • key is what ever grouping you want uniques for
       – Shuffle sorts all key, identifier tuples by key
       – reduce(key, Iterable<identifier>) : estimator bytes
 •    Reducer simply updates the estimator in-place – tiny RAM footprint


Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
Nuts & Bolts
 •    Reducer output: HBase Put
 •    HBase “schema”, e.g. daily uniques aggregated by brand:
 •    Scan:
                             Row Key              Estimator
       – brandX
       – Jan 2-3             brandX-20130101      [0100110111000]
      [0110100111000]        brandX-20130102      [0110100111000]
      [0100000101011]        brandX-20130103      [0100000101011]
                                             brandY-20130101   [0101100011000]
       [0110100111011]                       brandY-20130102   [0100100111001]


         Cardinality = N
Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
Nuts & Bolts
 •    HBase scan is the key to making this fast
       – First result: instantiate HyperLogLog estimator
       – Remaining results: update estimator in-place
 •    O(n) to compute result, n ~ number of bits in estimator (1.5KB)
 •    Freedom to build a data set of unique estimators that can be
      arbitrarily sliced quickly
       – Quarterly, daily, weekly, ad-hoc date ranges
       – HBase client pulls 1.5KB * number of days, returns a long
       – Perf anecdote: REST API call to get network-wide uniques for
          current month-to-date
            • 66 ms over the internet
            • 12 ms server-side latency

Confidential and Proprietary. © 2012 Bazaarvoice, Inc.

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

HBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and Spark
 
Keynote: The Future of Apache HBase
Keynote: The Future of Apache HBaseKeynote: The Future of Apache HBase
Keynote: The Future of Apache HBase
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
 
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
 
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
 
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWSHBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
 
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHarmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
 
Dawn of YARN @ Rocket Fuel
Dawn of YARN @ Rocket FuelDawn of YARN @ Rocket Fuel
Dawn of YARN @ Rocket Fuel
 
Maintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single ClusterMaintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single Cluster
 
HBaseCon 2015 General Session: State of HBase
HBaseCon 2015 General Session: State of HBaseHBaseCon 2015 General Session: State of HBase
HBaseCon 2015 General Session: State of HBase
 
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial Industry
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
 
Keynote - Hosted PostgreSQL: An Objective Look
Keynote - Hosted PostgreSQL: An Objective LookKeynote - Hosted PostgreSQL: An Objective Look
Keynote - Hosted PostgreSQL: An Objective Look
 
(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014
(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014
(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014
 
HBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2015: State of HBase Docs and How to ContributeHBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2015: State of HBase Docs and How to Contribute
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
 
NoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBaseNoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBase
 
HBase Backups
HBase BackupsHBase Backups
HBase Backups
 
Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016
 

Ähnlich wie Austin Scales- Clickstream Analytics at Bazaarvoice

Rigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance Measurement
DataWorks Summit
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
Edward Capriolo
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
Taldor Group
 
PPCD_And_AmazonRDS
PPCD_And_AmazonRDSPPCD_And_AmazonRDS
PPCD_And_AmazonRDS
Vibhor Kumar
 

Ähnlich wie Austin Scales- Clickstream Analytics at Bazaarvoice (20)

Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
 
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and CloudHBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
 
MongoDB: What, why, when
MongoDB: What, why, whenMongoDB: What, why, when
MongoDB: What, why, when
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
Rigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance Measurement
 
Rigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase PerformanceRigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase Performance
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
Horizon for Big Data
Horizon for Big DataHorizon for Big Data
Horizon for Big Data
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutions
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Intro to database_services_fg_aws_summit_2014
Intro to database_services_fg_aws_summit_2014Intro to database_services_fg_aws_summit_2014
Intro to database_services_fg_aws_summit_2014
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Machine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupMachine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville Meetup
 
Modern MySQL Monitoring and Dashboards.
Modern MySQL Monitoring and Dashboards.Modern MySQL Monitoring and Dashboards.
Modern MySQL Monitoring and Dashboards.
 
NoSQL_Night
NoSQL_NightNoSQL_Night
NoSQL_Night
 
PPCD_And_AmazonRDS
PPCD_And_AmazonRDSPPCD_And_AmazonRDS
PPCD_And_AmazonRDS
 
Wmware NoSQL
Wmware NoSQLWmware NoSQL
Wmware NoSQL
 

Kürzlich hochgeladen

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Austin Scales- Clickstream Analytics at Bazaarvoice

  • 1. Clickstream Analytics at Bazaarvoice Evan Pollan, Engineering Lead @EvanPollan
  • 2. Agenda • Infrastructure: lessons learned operating Hadoop in EC2 • Case study: uniques at scale using Hadoop and HBase Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 3. Project Magpie • Bazaarvoice products – extremely large web surface area • Client-side instrumentation to measure interactions • Many event sources (apps) => one sink: Magpie • Consolidated HTTP event collection – Network-wide event correlation – Network ~ many apps and many “sites” (clients) • Clickstream == Topically segmented JSON event log files • Sense of scale – 10 - 20K events per second – 500M – 1B impressions per day – 25 – 50 GB compressed event log data per day Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 4. Infrastructure Whys • Why Hadoop? – Experience scaling brute-force log processing via Hadoop • Everybody’s favorite: Akamai edge request logs • EMR, Apache Whirr – Needed online analytics – HBase fit the bill – Apache OSS ecosystem familiar to BV • Why Amazon Web Services? – Existing infrastructure hosting solution too inflexible and slow – Couldn’t scale R&D without an elastic infrastructure Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 5. High-level architecture • Event collectors in auto-scale groups behind elastic load balancers • Event stream compressed and uploaded hourly to S3 • S3: store of record • Hadoop cluster: – HDFS: stores raw event logs, derived file-based data sets, and HBase HFiles/WALs – Oozie: job scheduling, data dependency management – MapReduce: analytics (mix of Pig, Java => 100% Java) – HBase: stores hourly/daily analytics results • Job Portal: job schedule viz, gap analysis & alerting • UI/API: Analytics available via JSON API and in Backbone.js UI Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 6. EMR vs. roll our own • Neither • Cons: EMR – Price premium – Opaque Hadoop configuration – No way to mitigate SPOFs • Cons: Roll our own – Small group of engineers, no ops manpower at beginning • Solution: Cloudera – Cloudera Manager for config management and provisioning – CDH 3.X distribution Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 7. Missteps • Problem: non-HA NameNode • Solution: EBS! • Problem: EC2 MTBF iffy • Solution: EBS! • Reality: When something goes wrong in AWS, it is invariably an outage or degradation in EBS. – Violates the whole concept of data locality. Hadoop + SAN = sadness • Problem: Where should HBase live? • Solution: Co-resident with MapReduce! Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 8. Where we’ve ended up • Moved to the latest Cloudera CDH 4.X – HA NameNode! – Zookeeper for leader election – Quorum Journal Manager for edit logs • Learn to let go – Mitigate SPOF where possible, but plan for failure – End-to-end automation for DR/migration • Avoid EBS like the plague • HBase and MapReduce segmentation – Enables different hardware step size – Batch processing doesn’t affect HBase response time – Better understanding of HBase/HDFS locality (or lack thereof) Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 9. Let’s talk sets Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 10. Let’s talk sets • Common problem: uniques (e.g. unique visitors, users, etc.) • Naïve solution: SELECT DISTINCT(X) FROM Y • Not tenable given: – Massive, semi-structured data set – Thousands of grouping axes • OK: pre-calculate via MapReduce • But… – What would you pre-calculate? – Daily for each grouping? – How would you answer queries for other time ranges? Pre- calculate them, too? Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 11. Set Unions • Definition: cardinality of a set is the number of elements in that set – A = {1, 2, 3}; |A| = 3 • Cardinality of the union of two sets cannot be determined from the cardinality of the two sets – |A U B| not necessarily equal to |A| + |B| – Only equal if A and B are disjoint – How do you know if they’re disjoint? – You need both sets • Imagine: – Set “a” are the visitors from yesterday – Set “b” are the visitors from today – To get uniques for both days, you have to look at both data sets Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 12. An entirely different set: bit sets • Translate set members’ identifiers to an index in a bit set • Bit sets are combinable – yahtzee! • HBase is good at storing bits  – MapReduce to build bit set for each grouping in your smallest desirable unit of time – Persist w/ row key as a function of date and grouping • Uniques for last month? – Scan: start and stop rows accounting for date range and grouping – Merge each day’s bit set with a single bit set representing the union – Count the number of “on” bits in the merged bit set => cardinality • But… – # bits for items whose identifiers number in the billions? – A billion bits is a lot of bits Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 13. Bit sets – solving the size problem • 109 bits is an expensive way to store a combinable cardinality • Query I/O example: Uniques for last quarter – 120 MB/day * 90 days = 10.8 GB – Too much to pull out of HBase to answer an “online” query • Storage example: 10K different grouping axes – Clients, sites, favorite colors, whatever – 120 MB * 10K = 1.2 TB/day of storage • Possible mitigation: compression – Still need to generate a 120 MB data structure, then compress Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 14. Cardinality Estimation • Many different approaches to estimate the cardinality of a set – General goal: calculate cardinality in small RAM footprint • Big breakthrough in 2007: the HyperLogLog estimator • What’s the big deal? – Tunable accuracy – Incredible information density – Combinable • Analog: lossy compression of bit sets • How good? – Estimate cardinality of 109 unique elements +/-2% in 1.5 KB Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 15. Nuts & Bolts • http://github.com/clearspring/stream-lib – Java impls of top-K, frequency, and cardinality for streams • A ha moment: combining estimators from distributed counters is no different than combining them across different time periods! • MapReduce algorithm – map(Event) : (key, identifier) • key is what ever grouping you want uniques for – Shuffle sorts all key, identifier tuples by key – reduce(key, Iterable<identifier>) : estimator bytes • Reducer simply updates the estimator in-place – tiny RAM footprint Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 16. Nuts & Bolts • Reducer output: HBase Put • HBase “schema”, e.g. daily uniques aggregated by brand: • Scan: Row Key Estimator – brandX – Jan 2-3 brandX-20130101 [0100110111000] [0110100111000] brandX-20130102 [0110100111000] [0100000101011] brandX-20130103 [0100000101011] brandY-20130101 [0101100011000] [0110100111011] brandY-20130102 [0100100111001] Cardinality = N Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 17. Nuts & Bolts • HBase scan is the key to making this fast – First result: instantiate HyperLogLog estimator – Remaining results: update estimator in-place • O(n) to compute result, n ~ number of bits in estimator (1.5KB) • Freedom to build a data set of unique estimators that can be arbitrarily sliced quickly – Quarterly, daily, weekly, ad-hoc date ranges – HBase client pulls 1.5KB * number of days, returns a long – Perf anecdote: REST API call to get network-wide uniques for current month-to-date • 66 ms over the internet • 12 ms server-side latency Confidential and Proprietary. © 2012 Bazaarvoice, Inc.

Hinweis der Redaktion

  1. August – 2012 Version
  2. A magpie is a bird that suffers an irresistible urge to collect and hoard things Sense of scaleAt our current level of instrumentation and app penentration
  3. HBase fit the bill…Given its storage model and affinity to timeseries dataGiven its clean, out of the box integration with MapReduce
  4. I can’t, and therefore don’t, do diagrams. You’re stuck with word-dense slidesHadoop clusterOf note: we sync S3 to HDFS for optimized job execution and to enable Oozie’s data dependency managementJob PortalOozie web UI is painful to use
  5. When most people get ready to deploy Hadoop to EC2, they choose between Elastic Map Reduce or a custom deploymentCDH distributionCurated, don’t have to worry about mixing and matching various apache component versions
  6. non-HANameNodeEven CDH3 was not immune from SPOFEC2 MTBF iffy…The Magpie team was definitely not the first to foray in to EC2 – BV had been using EC2 for quite some time at this point
  7. Quorum Journal Manager for edit logs- Doesn’t push the SPOF further upstream with a NFS NAS solution for shared storage of the edit logs - This system works really well. Leader election is lightning fast, and we haven’t encountered any failures of reads or writes during out “pull the plug” testingEnd-to-end automation for DR:- And by DR, I mean AZ outages; loss of 3+ data nodes; loss of 2+ “master nodes” - When our SLAs require it, we’ll run an HBase replica in another region, but still treat the MapReduce cluster as expendable HBase/HDFS locality: - Region Server and HFile blocks are not co-resident after a region has been reassigned
  8. We have a solid hadoop infrastructure running in AWS, let’s crunch some big dataNot tenable given…Well, not tenable w/out a very, very large OLAP data store. We’ve got a hadoop cluster, though…Pre-calculate them, too?Large, expensive jobs re-processing the same data sets, lack of flexibility to the end-user
  9. We have a solid hadoop infrastructure running in AWS, let’s crunch some big dataNot tenable given…Well, not tenable w/out a very, very large OLAP data store. We’ve got a hadoop cluster, though…Pre-calculate them, too?Large, expensive jobs re-processing the same data sets, lack of flexibility to the end-user
  10. Conclusion: need some way to calculate and persist a representation of cardinality for an incremental time period that would not be prohibitive to scan over arbitrary time ranges and combine into a single representation of the cardinality of all the subsets.
  11. Bit sets are combinable…Meaning you could take a bit set representation of one day’s cardinality, OR it with another day’s bit set and have a bit set that would tell you the cardinality of the union of the two daysMapReduce to build …For example, unique users at site XYZ on January 31, 2013Scan: start and stop…HBase is very good at scans over reasonable sets of data, even without the benefit of block cache, when rows are (a) reasonably narrow, and (b) the ordering of the keys leads to linear readsA billion bits is a lot of bits…It’s not big data, but it can quickly become big data
  12. Possible mitigation: compressionStill need to generate a 120 MB data structure in RAM, then compressRetrieval non-trivial given decompression costs and heap pressure
  13. -Calculate cardinality in small RAM footprintE.g. for stream processingBig breakthrough in 2007: HyperLogLogNew algorithm and representational data structureTeam of French mathematicians, led by FlajoletTimely: engineers at Google just published a refinement called HLL++ that is more accurate on the low and high end.Combinable…Not unique to HyperLogLogAnalog: lossy compression…BUT: doesn’t require large intermediate heap and associated CPU cycles for compression
  14. I don’t peruse the proceedings of math conferences – but I do keep up on hacker news and high scalability.comLast April, Matt Abrams of Clearspring wrote a blog post on using HyperLogLog estimators to merge cardinality estimators from a bunch of distributed stream-processing machines