SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Real-time Analytics at Facebook:
Data Freeway and Puma


Zheng Shao
12/2/2011
Agenda
 1   Analytics and Real-time

 2   Data Freeway

 3   Puma

 4   Future Works
Analytics and Real-time
what and why
Facebook Insights
• Use cases
▪   Websites/Ads/Apps/Pages
▪   Time series
▪   Demographic break-downs
▪   Unique counts/heavy hitters

• Major challenges
▪   Scalability
▪   Latency
Analytics based on Hadoop/Hive
                                              Hourly           Daily
        seconds            seconds         Copier/Loader   Pipeline Jobs


 HTTP             Scribe             NFS              Hive             MySQL
                                                     Hadoop

• 3000-node Hadoop cluster

• Copier/Loader: Map-Reduce hides machine failures

• Pipeline Jobs: Hive allows SQL-like syntax

• Good scalability, but poor latency! 24 – 48 hours.
How to Get Lower Latency?




• Small-batch Processing                    • Stream Processing
▪   Run Map-reduce/Hive every hour, every   ▪   Aggregate the data as soon as it arrives
    15 min, every 5 min, …
                                            ▪   How to solve the reliability problem?
▪   How do we reduce per-batch
    overhead?
Decisions
• Stream Processing wins!



• Data Freeway
▪   Scalable Data Stream Framework

• Puma
▪   Reliable Stream Aggregation Engine
Data Freeway
scalable data stream
Scribe

                                                  Batch
                                                  Copier
                                                               HDFS

                                                  tail/fopen
 Scribe          Scribe     Scribe
                Mid-Tier                    NFS
 Clients                    Writers                           Log
• Simple push/RPC-based logging system                     Consumer


• Open-sourced in 2008. 100 log categories at that time.

• Routing driven by static configuration.
Data Freeway
                                                      Continuous
                                                        Copier
                  C1            C2         DataNode

                                                                     HDFS
                                                                       PTail
                  C1            C2         DataNode
                                                                      (in the
                                                                       plan)
Scribe                                                  PTail
Clients      Calligraphus   Calligraphus     HDFS
               Mid-tier       Writers
                                                                      Log
                                                                   Consumer
                    Zookeeper

• 9GB/sec at peak, 10 sec latency, 2500 log categories
Calligraphus
• RPC  File System
▪   Each log category is represented by 1 or more FS directories
▪   Each directory is an ordered list of files

• Bucketing support
▪   Application buckets are application-defined shards.
▪   Infrastructure buckets allows log streams from x B/s to x GB/s

• Performance
▪   Latency: Call sync every 7 seconds
▪   Throughput: Easily saturate 1Gbit NIC
Continuous Copier
• File System  File System

• Low latency and smooth network usage

• Deployment
▪   Implemented as long-running map-only job
▪   Can move to any simple job scheduler

• Coordination
▪   Use lock files on HDFS for now
▪   Plan to move to Zookeeper
PTail
                     files        checkpoint

directory

directory

directory



  • File System  Stream (  RPC )

  • Reliability
  ▪   Checkpoints inserted into the data stream
  ▪   Can roll back to tail from any data checkpoints
  ▪   No data loss/duplicates
Channel Comparison
           Push / RPC Pull / FS
Latency      1-2 sec   10 sec
Loss/Dups      Few      None
Robustness     Low      High           Scribe
Complexity    Low       High
                                        Push /
                                         RPC
                      PTail + ScribeSend          Calligraphus
                                      Pull / FS

                                  Continuous Copier
Puma
real-time aggregation/storage
Overview


 Log Stream    Aggregations                   Serving
                               Storage
• ~ 1M log lines per second, but light read

• Multiple Group-By operations per log line

• The first key in Group By is always time/date-related

• Complex aggregations: Unique user count, most frequent
  elements
MySQL and HBase: one page
                   MySQL                 HBase
Parallel           Manual sharding       Automatic
                                         load balancing
Fail-over          Manual master/slave   Automatic
                   switch
Read efficiency    High                  Low
Write efficiency   Medium                High
Columnar support   No                    Yes
Puma2 Architecture




   PTail         Puma2         HBase        Serving

• PTail provide parallel data streams

• For each log line, Puma2 issue “increment” operations to
  HBase. Puma2 is symmetric (no sharding).

• HBase: single increment on multiple columns
Puma2: Pros and Cons
• Pros
▪   Puma2 code is very simple.
▪   Puma2 service is very easy to maintain.

• Cons
▪   “Increment” operation is expensive.
▪   Do not support complex aggregations.
▪   Hacky implementation of “most frequent elements”.
▪   Can cause small data duplicates.
Improvements in Puma2
• Puma2
▪   Batching of requests. Didn‟t work well because of long-tail distribution.

• HBase
▪   “Increment” operation optimized by reducing locks.
▪   HBase region/HDFS file locality; short-circuited read.
▪   Reliability improvements under high load.

• Still not good enough!
Puma3 Architecture



                          PTail          Puma3         HBase
• Puma3 is sharded by aggregation key.

• Each shard is a hashmap in memory.
                                                  Serving
• Each entry in hashmap is a pair of
  an aggregation key and a user-defined aggregation.

• HBase as persistent key-value storage.
Puma3 Architecture



                                    PTail           Puma3           HBase




• Write workflow
                                                                Serving
▪   For each log line, extract the columns for key and value.
▪   Look up in the hashmap and call user-defined aggregation
Puma3 Architecture



                                    PTail              Puma3         HBase


• Checkpoint workflow
▪   Every 5 min, save modified hashmap entries,
    PTail checkpoint to HBase                                    Serving
▪   On startup (after node failure), load from HBase
▪   Get rid of items in memory once the time window has passed
Puma3 Architecture



                                  PTail          Puma3           HBase




• Read workflow
                                                             Serving
▪   Read uncommitted: directly serve from the in-memory hashmap; load
    from Hbase on miss.
▪   Read committed: read from HBase and serve.
Puma3 Architecture



                                    PTail            Puma3        HBase


• Join
▪   Static join table in HBase.
▪   Distributed hash lookup in user-defined function (udf).   Serving

▪   Local cache improves the throughput of the udf a lot.
Puma2 / Puma3 comparison
• Puma3 is much better in write throughput
▪   Use 25% of the boxes to handle the same load.
▪   HBase is really good at write throughput.

• Puma3 needs a lot of memory
▪   Use 60GB of memory per box for the hashmap
▪   SSD can scale to 10x per box.
Puma3 Special Aggregations
• Unique Counts Calculation
▪   Adaptive sampling
▪   Bloom filter (in the plan)

• Most frequent item (in the plan)
▪   Lossy counting
▪   Probabilistic lossy counting
PQL – Puma Query Language
• CREATE INPUT TABLE t („time', • CREATE AGGREGATION „abc‟
  „adid‟, „userid‟);              INSERT INTO l (a, b, c)
                                  SELECT
• CREATE VIEW v AS                   udf.hour(time),
  SELECT *, udf.age(userid)          adid,
  FROM t                             age,
  WHERE udf.age(userid) > 21         count(1),
                                     udf.count_distinc(userid)
                                  FROM v
                                  GROUP BY
• CREATE HBASE TABLE h …             udf.hour(time),
                                     adid,
• CREATE LOGICAL TABLE l …           age;
Future Works
challenges and opportunities
Future Works
• Scheduler Support
▪   Just need simple scheduling because the work load is continuous

• Mass adoption
▪   Migrate most daily reporting queries from Hive

• Open Source
▪   Biggest bottleneck: Java Thrift dependency
▪   Will come one by one
Similar Systems
• STREAM from Stanford

• Flume from Cloudera

• S4 from Yahoo

• Rainbird/Storm from Twitter

• Kafka from Linkedin
Key differences
• Scalable Data Streams
▪   9 GB/sec with < 10 sec of latency
▪   Both Push/RPC-based and Pull/File System-based
▪   Components to support arbitrary combination of channels

• Reliable Stream Aggregations
▪   Good support for Time-based Group By, Table-Stream Lookup Join
▪   Query Language:    Puma : Realtime-MR = Hive : MR
▪   No support for sliding window, stream joins
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

Weitere ähnliche Inhalte

Was ist angesagt?

HBase Sizing Notes
HBase Sizing NotesHBase Sizing Notes
HBase Sizing Noteslarsgeorge
 
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...Cloudera, Inc.
 
Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction HBaseCon
 
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...Cloudera, Inc.
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.Jack Levin
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconYiwei Ma
 
001 hbase introduction
001 hbase introduction001 hbase introduction
001 hbase introductionScott Miao
 
004 architecture andadvanceduse
004 architecture andadvanceduse004 architecture andadvanceduse
004 architecture andadvanceduseScott Miao
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0enissoz
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance TuningLars Hofhansl
 
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - FacebookHadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - FacebookCloudera, Inc.
 
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践HBaseCon
 
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageHBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageCloudera, Inc.
 
hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0HBaseCon
 
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceHBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceCloudera, Inc.
 
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored NodesAchieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored NodesDataWorks Summit
 
HBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBaseCon
 
HBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond PanelHBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond PanelHBaseCon
 

Was ist angesagt? (20)

HBase Sizing Notes
HBase Sizing NotesHBase Sizing Notes
HBase Sizing Notes
 
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
 
Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction
 
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
001 hbase introduction
001 hbase introduction001 hbase introduction
001 hbase introduction
 
004 architecture andadvanceduse
004 architecture andadvanceduse004 architecture andadvanceduse
004 architecture andadvanceduse
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance Tuning
 
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - FacebookHadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
 
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践
 
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageHBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
 
hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0
 
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceHBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
 
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored NodesAchieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes
 
HBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBase: Where Online Meets Low Latency
HBase: Where Online Meets Low Latency
 
HBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond PanelHBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond Panel
 

Andere mochten auch

Chris Goundry introduction
Chris Goundry introductionChris Goundry introduction
Chris Goundry introductioncgoundry
 
Respiration (includingFermentation)
Respiration (includingFermentation)Respiration (includingFermentation)
Respiration (includingFermentation)LM9
 
introtomongodb
introtomongodbintrotomongodb
introtomongodbsaikiran
 
투이컨설팅 제16회 Y세미나 : 설문결과
투이컨설팅 제16회 Y세미나 : 설문결과투이컨설팅 제16회 Y세미나 : 설문결과
투이컨설팅 제16회 Y세미나 : 설문결과2econsulting
 
YGAW (restitution)
YGAW (restitution)YGAW (restitution)
YGAW (restitution)af83media
 
CETS 2011, Jeff Graunke, slides for Creating Studio-Quality Audio on a Budget
CETS 2011, Jeff Graunke, slides for Creating Studio-Quality Audio on a BudgetCETS 2011, Jeff Graunke, slides for Creating Studio-Quality Audio on a Budget
CETS 2011, Jeff Graunke, slides for Creating Studio-Quality Audio on a BudgetChicago eLearning & Technology Showcase
 
Semiconductor06 april11 020511
Semiconductor06 april11 020511Semiconductor06 april11 020511
Semiconductor06 april11 020511Prafulla Tekriwal
 
CETS 2012, Christine O'Malley, handouts for Using Variables in Lectora to Col...
CETS 2012, Christine O'Malley, handouts for Using Variables in Lectora to Col...CETS 2012, Christine O'Malley, handouts for Using Variables in Lectora to Col...
CETS 2012, Christine O'Malley, handouts for Using Variables in Lectora to Col...Chicago eLearning & Technology Showcase
 
DevOps Dilemma - Make Dev work with Ops!
DevOps Dilemma - Make Dev work with Ops!DevOps Dilemma - Make Dev work with Ops!
DevOps Dilemma - Make Dev work with Ops!Sandeep Joshi
 
Feel romania gro wing autumn edition
Feel romania gro wing autumn editionFeel romania gro wing autumn edition
Feel romania gro wing autumn editionTaras
 
How to Get started with Press2Flash in 8 Steps
How to Get started with Press2Flash in 8 StepsHow to Get started with Press2Flash in 8 Steps
How to Get started with Press2Flash in 8 StepsErwan Jegouzo
 
140321_株式会社MK翻訳事務所_会社紹介
140321_株式会社MK翻訳事務所_会社紹介140321_株式会社MK翻訳事務所_会社紹介
140321_株式会社MK翻訳事務所_会社紹介MK Translation Firm
 
Moneda
MonedaMoneda
MonedaEver
 

Andere mochten auch (20)

CETS 2011, Sarah Remijan, handout for Webinars Made Easy
CETS 2011, Sarah Remijan, handout for Webinars Made EasyCETS 2011, Sarah Remijan, handout for Webinars Made Easy
CETS 2011, Sarah Remijan, handout for Webinars Made Easy
 
Chris Goundry introduction
Chris Goundry introductionChris Goundry introduction
Chris Goundry introduction
 
Respiration (includingFermentation)
Respiration (includingFermentation)Respiration (includingFermentation)
Respiration (includingFermentation)
 
introtomongodb
introtomongodbintrotomongodb
introtomongodb
 
Cets 2013 graunke using audacity
Cets 2013 graunke using audacityCets 2013 graunke using audacity
Cets 2013 graunke using audacity
 
Cets 2015 ls iaco cheap cheerful
Cets 2015 ls iaco cheap cheerfulCets 2015 ls iaco cheap cheerful
Cets 2015 ls iaco cheap cheerful
 
Xuat huyet nao
Xuat huyet naoXuat huyet nao
Xuat huyet nao
 
SkySimulator & DrFerozMusa
SkySimulator & DrFerozMusaSkySimulator & DrFerozMusa
SkySimulator & DrFerozMusa
 
투이컨설팅 제16회 Y세미나 : 설문결과
투이컨설팅 제16회 Y세미나 : 설문결과투이컨설팅 제16회 Y세미나 : 설문결과
투이컨설팅 제16회 Y세미나 : 설문결과
 
CamTech
CamTechCamTech
CamTech
 
YGAW (restitution)
YGAW (restitution)YGAW (restitution)
YGAW (restitution)
 
CETS 2011, Jeff Graunke, slides for Creating Studio-Quality Audio on a Budget
CETS 2011, Jeff Graunke, slides for Creating Studio-Quality Audio on a BudgetCETS 2011, Jeff Graunke, slides for Creating Studio-Quality Audio on a Budget
CETS 2011, Jeff Graunke, slides for Creating Studio-Quality Audio on a Budget
 
Semiconductor06 april11 020511
Semiconductor06 april11 020511Semiconductor06 april11 020511
Semiconductor06 april11 020511
 
Brouchere
BrouchereBrouchere
Brouchere
 
CETS 2012, Christine O'Malley, handouts for Using Variables in Lectora to Col...
CETS 2012, Christine O'Malley, handouts for Using Variables in Lectora to Col...CETS 2012, Christine O'Malley, handouts for Using Variables in Lectora to Col...
CETS 2012, Christine O'Malley, handouts for Using Variables in Lectora to Col...
 
DevOps Dilemma - Make Dev work with Ops!
DevOps Dilemma - Make Dev work with Ops!DevOps Dilemma - Make Dev work with Ops!
DevOps Dilemma - Make Dev work with Ops!
 
Feel romania gro wing autumn edition
Feel romania gro wing autumn editionFeel romania gro wing autumn edition
Feel romania gro wing autumn edition
 
How to Get started with Press2Flash in 8 Steps
How to Get started with Press2Flash in 8 StepsHow to Get started with Press2Flash in 8 Steps
How to Get started with Press2Flash in 8 Steps
 
140321_株式会社MK翻訳事務所_会社紹介
140321_株式会社MK翻訳事務所_会社紹介140321_株式会社MK翻訳事務所_会社紹介
140321_株式会社MK翻訳事務所_会社紹介
 
Moneda
MonedaMoneda
Moneda
 

Ähnlich wie Hic 2011 realtime_analytics_at_facebook

Xldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsXldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsliqiang xu
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012larsgeorge
 
[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)baggioss
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统yongboy
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase强 王
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBasedave_revell
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataRyan Bosshart
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...HostedbyConfluent
 
Hive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsHive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsYifeng Jiang
 
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on KubernetesHBaseCon
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestHBaseCon
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messagesyarapavan
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackDataWorks Summit/Hadoop Summit
 
Meet Hadoop Family: part 4
Meet Hadoop Family: part 4Meet Hadoop Family: part 4
Meet Hadoop Family: part 4caizer_x
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad ranaData Con LA
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012StampedeCon
 
Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)Alexey Rybak
 

Ähnlich wie Hic 2011 realtime_analytics_at_facebook (20)

Xldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsXldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalytics
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
 
[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
 
Hive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsHive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfs
 
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
 
Meet Hadoop Family: part 4
Meet Hadoop Family: part 4Meet Hadoop Family: part 4
Meet Hadoop Family: part 4
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012
 
Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)
 

Mehr von baggioss

Hdfs写流程异常处理
Hdfs写流程异常处理Hdfs写流程异常处理
Hdfs写流程异常处理baggioss
 
Hbase性能测试文档
Hbase性能测试文档Hbase性能测试文档
Hbase性能测试文档baggioss
 
Hbase使用hadoop分析
Hbase使用hadoop分析Hbase使用hadoop分析
Hbase使用hadoop分析baggioss
 
Hadoop基线选定
Hadoop基线选定Hadoop基线选定
Hadoop基线选定baggioss
 
Hdfs introduction
Hdfs introductionHdfs introduction
Hdfs introductionbaggioss
 
Hdfs原理及实现
Hdfs原理及实现Hdfs原理及实现
Hdfs原理及实现baggioss
 

Mehr von baggioss (10)

Hdfs写流程异常处理
Hdfs写流程异常处理Hdfs写流程异常处理
Hdfs写流程异常处理
 
Hbase性能测试文档
Hbase性能测试文档Hbase性能测试文档
Hbase性能测试文档
 
Hbase使用hadoop分析
Hbase使用hadoop分析Hbase使用hadoop分析
Hbase使用hadoop分析
 
Hadoop基线选定
Hadoop基线选定Hadoop基线选定
Hadoop基线选定
 
Hic2011
Hic2011Hic2011
Hic2011
 
Hdfs introduction
Hdfs introductionHdfs introduction
Hdfs introduction
 
Hbase
HbaseHbase
Hbase
 
Hdfs
HdfsHdfs
Hdfs
 
Hdfs
HdfsHdfs
Hdfs
 
Hdfs原理及实现
Hdfs原理及实现Hdfs原理及实现
Hdfs原理及实现
 

Kürzlich hochgeladen

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Kürzlich hochgeladen (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Hic 2011 realtime_analytics_at_facebook

  • 1. Real-time Analytics at Facebook: Data Freeway and Puma Zheng Shao 12/2/2011
  • 2. Agenda 1 Analytics and Real-time 2 Data Freeway 3 Puma 4 Future Works
  • 4. Facebook Insights • Use cases ▪ Websites/Ads/Apps/Pages ▪ Time series ▪ Demographic break-downs ▪ Unique counts/heavy hitters • Major challenges ▪ Scalability ▪ Latency
  • 5. Analytics based on Hadoop/Hive Hourly Daily seconds seconds Copier/Loader Pipeline Jobs HTTP Scribe NFS Hive MySQL Hadoop • 3000-node Hadoop cluster • Copier/Loader: Map-Reduce hides machine failures • Pipeline Jobs: Hive allows SQL-like syntax • Good scalability, but poor latency! 24 – 48 hours.
  • 6. How to Get Lower Latency? • Small-batch Processing • Stream Processing ▪ Run Map-reduce/Hive every hour, every ▪ Aggregate the data as soon as it arrives 15 min, every 5 min, … ▪ How to solve the reliability problem? ▪ How do we reduce per-batch overhead?
  • 7. Decisions • Stream Processing wins! • Data Freeway ▪ Scalable Data Stream Framework • Puma ▪ Reliable Stream Aggregation Engine
  • 9. Scribe Batch Copier HDFS tail/fopen Scribe Scribe Scribe Mid-Tier NFS Clients Writers Log • Simple push/RPC-based logging system Consumer • Open-sourced in 2008. 100 log categories at that time. • Routing driven by static configuration.
  • 10. Data Freeway Continuous Copier C1 C2 DataNode HDFS PTail C1 C2 DataNode (in the plan) Scribe PTail Clients Calligraphus Calligraphus HDFS Mid-tier Writers Log Consumer Zookeeper • 9GB/sec at peak, 10 sec latency, 2500 log categories
  • 11. Calligraphus • RPC  File System ▪ Each log category is represented by 1 or more FS directories ▪ Each directory is an ordered list of files • Bucketing support ▪ Application buckets are application-defined shards. ▪ Infrastructure buckets allows log streams from x B/s to x GB/s • Performance ▪ Latency: Call sync every 7 seconds ▪ Throughput: Easily saturate 1Gbit NIC
  • 12. Continuous Copier • File System  File System • Low latency and smooth network usage • Deployment ▪ Implemented as long-running map-only job ▪ Can move to any simple job scheduler • Coordination ▪ Use lock files on HDFS for now ▪ Plan to move to Zookeeper
  • 13. PTail files checkpoint directory directory directory • File System  Stream (  RPC ) • Reliability ▪ Checkpoints inserted into the data stream ▪ Can roll back to tail from any data checkpoints ▪ No data loss/duplicates
  • 14. Channel Comparison Push / RPC Pull / FS Latency 1-2 sec 10 sec Loss/Dups Few None Robustness Low High Scribe Complexity Low High Push / RPC PTail + ScribeSend Calligraphus Pull / FS Continuous Copier
  • 16. Overview Log Stream Aggregations Serving Storage • ~ 1M log lines per second, but light read • Multiple Group-By operations per log line • The first key in Group By is always time/date-related • Complex aggregations: Unique user count, most frequent elements
  • 17. MySQL and HBase: one page MySQL HBase Parallel Manual sharding Automatic load balancing Fail-over Manual master/slave Automatic switch Read efficiency High Low Write efficiency Medium High Columnar support No Yes
  • 18. Puma2 Architecture PTail Puma2 HBase Serving • PTail provide parallel data streams • For each log line, Puma2 issue “increment” operations to HBase. Puma2 is symmetric (no sharding). • HBase: single increment on multiple columns
  • 19. Puma2: Pros and Cons • Pros ▪ Puma2 code is very simple. ▪ Puma2 service is very easy to maintain. • Cons ▪ “Increment” operation is expensive. ▪ Do not support complex aggregations. ▪ Hacky implementation of “most frequent elements”. ▪ Can cause small data duplicates.
  • 20. Improvements in Puma2 • Puma2 ▪ Batching of requests. Didn‟t work well because of long-tail distribution. • HBase ▪ “Increment” operation optimized by reducing locks. ▪ HBase region/HDFS file locality; short-circuited read. ▪ Reliability improvements under high load. • Still not good enough!
  • 21. Puma3 Architecture PTail Puma3 HBase • Puma3 is sharded by aggregation key. • Each shard is a hashmap in memory. Serving • Each entry in hashmap is a pair of an aggregation key and a user-defined aggregation. • HBase as persistent key-value storage.
  • 22. Puma3 Architecture PTail Puma3 HBase • Write workflow Serving ▪ For each log line, extract the columns for key and value. ▪ Look up in the hashmap and call user-defined aggregation
  • 23. Puma3 Architecture PTail Puma3 HBase • Checkpoint workflow ▪ Every 5 min, save modified hashmap entries, PTail checkpoint to HBase Serving ▪ On startup (after node failure), load from HBase ▪ Get rid of items in memory once the time window has passed
  • 24. Puma3 Architecture PTail Puma3 HBase • Read workflow Serving ▪ Read uncommitted: directly serve from the in-memory hashmap; load from Hbase on miss. ▪ Read committed: read from HBase and serve.
  • 25. Puma3 Architecture PTail Puma3 HBase • Join ▪ Static join table in HBase. ▪ Distributed hash lookup in user-defined function (udf). Serving ▪ Local cache improves the throughput of the udf a lot.
  • 26. Puma2 / Puma3 comparison • Puma3 is much better in write throughput ▪ Use 25% of the boxes to handle the same load. ▪ HBase is really good at write throughput. • Puma3 needs a lot of memory ▪ Use 60GB of memory per box for the hashmap ▪ SSD can scale to 10x per box.
  • 27. Puma3 Special Aggregations • Unique Counts Calculation ▪ Adaptive sampling ▪ Bloom filter (in the plan) • Most frequent item (in the plan) ▪ Lossy counting ▪ Probabilistic lossy counting
  • 28. PQL – Puma Query Language • CREATE INPUT TABLE t („time', • CREATE AGGREGATION „abc‟ „adid‟, „userid‟); INSERT INTO l (a, b, c) SELECT • CREATE VIEW v AS udf.hour(time), SELECT *, udf.age(userid) adid, FROM t age, WHERE udf.age(userid) > 21 count(1), udf.count_distinc(userid) FROM v GROUP BY • CREATE HBASE TABLE h … udf.hour(time), adid, • CREATE LOGICAL TABLE l … age;
  • 30. Future Works • Scheduler Support ▪ Just need simple scheduling because the work load is continuous • Mass adoption ▪ Migrate most daily reporting queries from Hive • Open Source ▪ Biggest bottleneck: Java Thrift dependency ▪ Will come one by one
  • 31. Similar Systems • STREAM from Stanford • Flume from Cloudera • S4 from Yahoo • Rainbird/Storm from Twitter • Kafka from Linkedin
  • 32. Key differences • Scalable Data Streams ▪ 9 GB/sec with < 10 sec of latency ▪ Both Push/RPC-based and Pull/File System-based ▪ Components to support arbitrary combination of channels • Reliable Stream Aggregations ▪ Good support for Time-based Group By, Table-Stream Lookup Join ▪ Query Language: Puma : Realtime-MR = Hive : MR ▪ No support for sliding window, stream joins
  • 33. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

Hinweis der Redaktion

  1. Good morning everyone. My name’s Zheng Shao.Today I am going to talk about the Real-time Analytics at Facebook.
  2. This is the agenda of the talk. We will start with why we need realtime analytics, then get into details of how we implemented it, and finally future works and comparisons with other systems.
  3. First of all, what is realtime analytics and why we want to do it.
  4. This is the main use case for our analytics. We have a product called Facebook Insights, which allows website owners, advertisers, Facebook application developers, and Facebook page owners to view the time series of impression/click/action counters, the counters broken down by demographics like gender and age, as well as the unique user counters and heavy hitters like most popular urls.The major challenges of building the backend of this insights products are two folds. On one hand, we have huge amount of data coming from both Facebook and non-Facebook websites. On the other hand, customers of the insights product really want to have low-latency summaries, so that they can immediately know how popular a new article or a new game is.
  5. We didhave an existing complete data warehouse solution at Facebook to handle insights workload.In short, log streams got generated from HTTP servers and transferred to NFS via a log collection framework called scribe all within seconds, and then got copied/loaded into Hadoop. Summaries got generated from daily pipeline jobs and eventually got loaded into MySQL for serving.Specifically, we have a 3000-note Hadoop cluster to handle the scalability issue. Copier/Loader are map-reduce jobs which handle machine failures automatically. And Pipeline Jobs are written in Hive which has a SQL-like syntax.Pretty good scalability until we hit the data center power limit. But latency is terrible.
  6. We got 2 ideas on how to improve the latency.The first one is small-batch processing. Instead of using a batch of 1 day, we can produce much smaller batches. The question is how to reduce per-batch overhead, so that tiny batches like 1 min or less makes sense.The second one is stream processing. We can aggregate the data as soon as it arrives. This will produce near realtime results. The question is how to make the system reliable against hardware failures.It turns out the per-batch overhead of Map-Reduce is so high that it’s not practical to have even 5-minute batches on our Hadoop cluster, so we finally decided to go stream processing.
  7. The rest of the talk will focus on two key systems that we built for realtime analytics.The first one, Data Freeway, is a scalable data stream framework on top of Scribe and HDFS.The second one, Puma, is a reliable stream aggregation engine on top of HBase.
  8. This was our old data stream framework. It has several layers of data transportation. The first transport from clients to mid-tier is to reduce the fanout from tens of thousands to hundreds, the second transport is to shuffle the data based on log categories, so that one log category goes to a single writer. Then log data gets written into NFS, which is consumed by batch copier, as well as unix tail/fopen.In short, it’s a simple push/RPC-based logging system. Scribe was open-sourced in 2008, when we have 100 log categories at that time. It quickly got adopted by a lot of other companies. The routing is driven by static configuration which is flexible but have two problems: 1. not scalable because we need to maintain a config for each box in the writers, and a single writer is not scalable; 2. single point of failure in writers.
  9. We came up with Data Freeway in 2011. Right now it’s handling 9GB/sec of data at peak with 10 sec end-to-end latency, and has over 2500 log categories.It contains 4 major components.The first one is scribe. It’s used only at the client, responsible for sending out data via RPCs. The second one is called Calligraphus. It utilizes Zookeeper to manage the ownership of categories, shuffles the data and write to HDFS.The third one is called Continuous Copier, which continuously copies files from one HDFS to another, as the file grows.The fourth one is called PTail, which in parallel tails multiple directories on HDFS and writes out to stdout. Right now we directly ptail from the HDFS written by Calligraphus, but we plan to tail from the HDFS written by Continuous Copier in the future.Let’s get into details of these components.
  10. Calligraphus is responsible for getting log data from RPC and write to File System.Each log category is represented by 1 or more FS directories.Each directory is an ordered list of files, with date in the file name. The files can be compressed.This is a very simple protocol for storing log data. Probably the simplest that I can think of.The most interesting feature about Calligraphus is the bucketing support.We have application buckets, which are application-defined shards. These are used for sharded log consumers. Most of the big log consumers are sharded because their log stream is too big.We also support infrastructure buckets, which allow a single application bucket to have a throughput from several bytes per second to several gigabytes per second. Each infrastructure bucket is a directory. So big streams can go to multiple directories at the same time.Calligraphus has a pretty high performance. We call File System sync every 7 seconds, which is the major source of data latency right now. The network throughput can easily saturate 1Gbit NIC, and we are planning to use 10Gbit NIC some time soon.
  11. Continuous Copier is for continuous data transfer from one File System to another.Compared with the batch-based map-reduce copier, it provide much lower latency as well as smooth network usage.Right now it’s implemented as a long-running map-only job, but it can be easily moved to any simple job scheduling system other than map-reduce.Right now it uses lock files in HDFS for coordination among different nodes, and we plan to move to Zookeeper very soon.The peak throughput of continuous copier in production is about 3GB/sec compressed right now.
  12. The last component in Data Freeway is PTail, which transfers data from a File System to an output stream.The key feature of PTail is the checkpoint. A PTail checkpoint contains the current files and the file offsets in each of the directories. This makes it possible for PTail to roll back to an earlier checkpoint, and reproduce the data stream without any data loss/duplicates at the boundary.
  13. To wrap up Data Freeway, we support 2 channels for data transfers.Push via RPC has lower latency, can potentially have some loss/dups when network has a problem, is less robust with respect to machine failures, and has a very low complexity in code.Pull via FS has a longer latency, but it does not have any loss/dups, and is robust to machine failures. The problems is that the code of the File System, especially HDFS, can be pretty complex, and we still need to identify and fix some bugs there.Data Freeway consists of 4 components that allows data transfer between these 2 channels.
  14. This is the simplified architecture of a typical stream aggregation engine.Log streams get aggregated on a set of machines. The summaries is usually saved to storage for persistence. Online serving get summaries from either the aggregations directly or from the storage. Usually the write throughput is much higher than the read, because analytics data is only viewed by the owners of the website, e.g.In our environment, we have on the order of 1M log lines per second. For each of the log lines, we need to do multiple group-by operations, like by age, or by gender. The first key in group by is always time/date-related which means the summaries will become static after some time. Also we need to support complex aggregations like unique counts and heavy hitters.
  15. Let’s look at our storage choices first.We considered using either MySQL or HBase as our storage engine. HBase is much easier to manage in a distributed environment, which was the major reason that we chose HBase. It also has better write efficiency as well as Columnar support. The read efficiency is inferior because HBase’s cache has less memory space efficiency.
  16. The first architecture that we came up is called Puma2.We run Puma2 on a set of machines, and use PTail to provide parallel data streams. For each log line, Puma2 issues “increment” operations to HBase. Note that Puma2 servers are all symmetric, which means the same row in HBase can be incremented by multiple Puma2 at the same time.HBase can do single increment operation on multiple columns of the same row. So we can use a single increment operation in HBase to handle multiple Group-By’s.Puma2 went into production in March 2011 and is handling 600K log lines on 100 boxes (Puma2 + HBase)
  17. Here are the pros and cons of the Puma2 architecture. The good thing about Puma2 is extremely simple and easy to maintain. The root reason is that Puma2 servers are symmetric and almost stateless. The only state is the PTail checkpoint that is saved to HBase periodically. As a result, we can easily add more boxes or reboot a box if the box went down.However, Puma2 also has its problems. First of all, HBase increment operation is expensive because it’s a read-and-write, and read is expensive. It’s also not possible to support aggregations other than counts, because that need a lot of customized code in HBase. We did a hacky implementation of “most frequent elements” by multiple layers of “frequent element table”. Finally, Puma2 can have small data duplicates because “increments” and checkpoint writes are not in a single transaction.
  18. We did some small improvements to Puma2.On the Puma2 service, an obvious idea is to batch the increment requests to reduce the load on HBase. However, it didn’t work well because of the long-tail distribution of Group-By keys. It also made data less accurate because we cannot save checkpoints in the middle of a batch.On the HBase side, we first optimized the “increment” operation by reducing the number of locks. Another big efficiency improvement came from the short-circuited read from HBase directly to HDFS block files on the disk, instead of via DataNode daemon. We also improved the HBase reliability under the high load.All in all, we are still not happy about Puma2, especially when we try to support unique counters. So we switched to a new architecture called Puma3.
  19. The biggest difference between Puma2 and Puma3 is that in Puma3, we do aggregations in the memory of Puma3 process instead of in HBase. Local memory operations are much faster so that we can achieve a much higher throughput.In order to make in-memory aggregations, we made Puma3 sharded by aggregation key. That means the input PTail data stream has to be sharded as well. That is supported by the application bucketing feature from Calligraphus.Each shard of Puma3 is basically a hashmap in memory. Each entry of the hashmap is a pair of an aggregation key and a user-defined aggregation, which can be count, sum, avg, or anything.We use HBase as a persistent storage but usually don’t read from it.
  20. The write workflow for Puma3 is pretty simple.Basically, for each log line, we extract the columns for key and value. We use the key to look up the in-memory hashmap, and call user-defined aggregation with the value.Note that, since the log streams are sharded by aggregation key, the same aggregation key won’t appear in more than 1 Puma3 processes. This is the key to make Puma3 work.
  21. We checkpoint the state of Puma3 process into HBase every 5 minutes. Basically, we save all the modified hashmap entries as well as the PTail checkpoint. That means if Puma3 crashes and restarts, it can load the state from HBase via sequential read, which is pretty fast in HBase.In order to save memory, we also get rid of hashmap entries from memory once the time window for the aggregation has passed, because we are not going to receive new log lines for that time window again.
  22. There are 2 choices for the read workflow.If we want to read uncommitted aggregations which is usually with 10 seconds of latency, we directly serve from the in-memory hashmap. We go to HBase only for a miss, which will only happen if the time window of the aggregation has passed.If we want to read committed data, Puma3 will read from HBase and serve.Note that uncommitted aggregation result can decrease in value if the Puma3 process dies before making the next checkpoint. We plan to have a cache layer between serving and Puma3 to make sure numbers don’t decrease.
  23. Puma3 also supports joining with a static table in HBase. The join key has to be the row key in the static HBase table. It’s implemented as a simple distributed hash lookup in a user-defined function. We have found that local cache improves the throughput of the udf a lot.
  24. Comparing Puma2 and Puma3, we found that Puma3 is much better in writer throughput. We only need to use 25% of the boxes to handle the same work load. The main reason is that HBase is really good at write throughput.At the same time, Puma3 needs a lot of memory. Basically, all aggregations that can change needs to be stored in memory, to ensure the log stream write throughput. Right now we use 60GB of memory per box for the hashmap. In the future, we may use SSD that can easily scale to 10x more space per box.
  25. With Puma3, we can easily support these special aggregations, with some approximation.For unique counts, we have implemented a simple adaptive sampling algorithm, that samples more aggressively when the numbe of unique item increases. We can also easily implement the standard bloom filter for counting.For the most frequent items, we plan to implement the classic lossy counting algorithm and probabilistic lossy counting algorithm.
  26. The most important feature of Puma that distinguishes it from other stream processing projects is the language.We have built a SQL-like query language that allows us to define the input stream, the output table, as well as the query itself. Note that the query contains user-defined functions for Join as well as Aggregations.Puma3 is right now in pre-production stage. We plan to push it out in production as soon as we verified all the summaries against Puma2 and Hive.
  27. Here are a list of things we plan to do next.First is simple scheduling for Puma3. We just need very simple scheduling because the work load is continuous. Most likely we will reuse some existing frameworks.Second is the mass adoption inside the company. We plan to migrate most daily reporting queries from Hive, as long as the query is simple enough to be supported by Puma. This will reduce the latency as well as improve the efficiency, because of the saving in compression/decompression.The third one is open-source. Right now, the biggest bottleneck is Java Thrift which has diverged between Facebook and open-source. We plan to open-source the projects one by one, starting from Calligraphus.
  28. There are lots of similar systems in academia as well as other companies.
  29. Instead of comparing them one by one, I will end the presentation by a summary of the key differences.Data Freeway is a scalable data stream framework with 9GB/sec throughput and 10 sec latency. It supports both Push/RPC-based and Pull/File System-based channels. We have components to support arbitrary combination of channels to adapt to the use case.Puma is a reliable stream aggregation engine. It has good support for time-window-based Group By as well as table-stream Lookup Join. It has a query language that makes Puma comparable to Hive when comparing Realtime-MR and MR. Puma has no support and no plan to support sliding window and stream joins because those are very hard problems that we don’t see in our environment.