SlideShare a Scribd company logo
1 of 38
Download to read offline
Operations and Big Data:
Hadoop, Hive and Scribe


Zheng Shao 微博:@邵铮9
12/7/2011 Velocity China 2011
Agenda
 1   Operations: Challenges and Opportunities

 2   Big Data Overview

 3   Operations with Big Data

 4   Big Data Details: Hadoop, Hive, Scribe

 5   Conclusion
Operations
challenges and opportunities
Operations




Measure
                      Model
   and                         Under-
           Collect     and              Improve   Monitor
Instrume                       stand
                     Analyze
    nt
Operations




Measure
                      Model
   and                         Under-
           Collect     and              Improve   Monitor
Instrume                       stand
                     Analyze
    nt
Challenges
•  Huge amount of data
▪  Sampling    may not be good enough

•  Distributed environment
▪  Log   collection is hard
▪  Hardware    failures are normal
▪  Distributed   failures are hard to understand
Example 1: Cache miss and performance

                     •  Memcache layer has a bug that
                        decreased the cache hit rate by
                        half
          Memcache   •  MySQL layer got hit hard and
  Web                   performance of MySQL degraded
            MySQL
                     •  Web performance degraded
Example 2: Map-Reduce Retries

                       •  Attempt 1 hits a transient
           Attempt 1      distributed file system issue and
                          failed
           Attempt 2   •  Attempt 2 hits a real hardware
Map Task                  issue and failed
           Attempt 3
                       •  Attempt 3 hits a transient
                          application logic issue and failed
           Attempt 4
                       •  Attempt 4, by chance, succeeded

                       •  The whole process slowed down
Example 3: RPC Hierarchy

                          •  RPC 3A failed

                PRC 1A •  The whole RPC 0 failed because
        RPC 1               of that
                PRC 1B
RPC 0   RPC 2          •  The blame was on owner of
                RPC 3A
        RPC 3             service 3 because the log in
                RPC 3B service 0 shows that.
Example 4: Inconsistent results in RPC

                     •  RPC 0 got results from both RPC
                        1 and RPC 2


            RPC 1    •  Both RPC 1 and RPC 2
                        succeeded
 RPC 0
            RPC 2    •  But RPC 0 detects that the
                        results are inconsistent and fails

                     •  We may not have logged any
                        trace information for RPC 1 and
                        RPC 2 to continue debugging.
Opportunities
•  Big Data Technologies
                                         Logging	
  Logging	
  Logging	
  
▪  Distributed   logging systems
▪  Distributed   storage systems                                         Collect

▪  Distributed   computing systems
                                         Storage	
  Storage	
  Storage	
  
•  Deeper Analysis
                                                                         Model
▪  Data   mining and outlier detection
▪  Time-series   analysis
                                         Compu0ng	
   Compu0ng	
  Compu0ng	
  
Big Data Overview
An example from Facebook
Big Data
•  What is Big Data?
▪    Volume is big enough and hard to be managed by traditional technologies
▪    Value is big enough not to be sampled/dropped

•  Where is Big Data used?
▪    Product analysis
▪    User behavior analysis

▪    Business intelligence

•  Why use Big Data for Operations?
▪    Reuse existing infrastructure.
Overall Architecture

                                                  ~3GB/sec Near-realtime Processing
                       Scribe	
  Policy	
  
      PHP	
  

                                                     PTail	
              Puma	
          HBase	
  

      C++	
  
                              Scribe-­‐HDFS	
     ~6GB/sec        Batch Processing

                           ~9GB/sec
      Java	
  
   Nectar	
                                       Copy/Load	
       Central	
  HDFS	
     Hive	
  
Scribe	
  Client	
  
Operations
with Big Data
logview
•  Features
▪  PHP   Fatal StackTrace
▪  Group   StackTrace by similarity, order by counts
▪  Integrated    with SVN/Task/Oncall tools
▪  Low-pri:   Scribe can drop logview data




                                            Scribe	
     Log	
  View	
     HTTP	
  
         PHP	
  	
  Scribe	
  Client	
     Mid0er	
  
logmonitor
•  Rules
▪  Regular-expression           based: ".*Missing Block.*"
▪  Rule   has levels: WARN, ERROR, etc
▪  Dynamic     rules
                            Propagate	
                                           Modify	
  
                              rules	
                    Rules	
                   rules	
  
                                                        Storage	
  
      Apply	
  rules	
  

                                  <RuleName
                                    ,	
  Count,	
                         Top	
  Rules	
  
   PTail	
  /	
  Local	
  Log	
   Examples>	
   Logmonitor	
  
  Logmonitor	
  Client	
                                                                      Web	
  
                                                    Stats	
  Server	
  
Self Monitoring
•  Goal:
▪  Set   KPIs for SOA
                                                                logs	
        Scribe	
  
▪  Isolate   issues in distributed systems
▪  Make   it easy for service owners to monitor                            PTail,	
  Hive	
  
                                                  Service	
  
•  Approach
                                                           JMX	
  
▪  Log4J     integration with Scribe                   ThriR/Fb303	
  
                                                         counter	
  
▪  JMX/Thrift/Fb303     counters                          query	
  
                                                                               Service	
  
▪  Client-side   logging + Server-side logging                                 Owner	
  
Global Debugging with PTail
•  Logging instruction
▪  Logging   levels
▪  Logging   destination (log name)                                           Log	
  data	
  
                                                             Scribe	
                             PTail	
  
▪  Additional   fields: Request ID

                                    Log	
  data	
                 Log	
  data	
         Log	
  data	
  



                                            RPC	
  +	
  
                                           logging	
  
                                         instruc0ons	
   Service	
  3	
  
                                                                               RPC	
  +	
   Service	
  2	
  
                      Service	
  1	
  
                                                                              logging	
  
                                                                            instruc0ons	
  
Hive Pipelines
•  Daily and historical data analysis
▪  What   is the trend of a metric?
▪  When   did this bug first happen?

•  Examples
▪  SELECT    percentile(latency, “50,75,90,99”) FROM latency_log;
▪  SELECTrequest_id, GROUP_CONCAT(log_line) as total_log
 FROM trace GROUP BY request_id
 HAVING total_log LIKE "%FATAL%“;
Big Data Details
Hadoop, Hive, Scribe
Key Requirements

•  Ease of use                        •  Latency
▪  Smooth    learning curve           ▪  Real-time    data
▪  Easy    integration                ▪  Historical   data
▪  Structured/unstructured    data
                                      •  Reliability
▪  Schema     evolution
                                      ▪  Low   data loss
•  Scalable                           ▪  Consistent    computation
▪  Spiky   traffic and QoS
▪  Raw    data / Drill-down support
Overall Architecture

                                                  ~3GB/sec Near-realtime Processing
                       Scribe	
  Policy	
  
      PHP	
  

                                                     PTail	
              Puma	
          HBase	
  

      C++	
  
                              Scribe-­‐HDFS	
     ~6GB/sec        Batch Processing

                           ~9GB/sec
      Java	
  
   Nectar	
                                       Copy/Load	
       Central	
  HDFS	
     Hive	
  
Scribe	
  Client	
  
Distributed Logging System - Scribe




•  https://github.com/facebook/scribe
Distributed Logging System - Scribe

                                 Scribe	
  Policy	
  
                                       	
  
                                Traffic/Schema	
  
          Meta	
  Data	
         management	
                Meta	
  Data	
  



                                  ThriR	
  RPC	
        Scribe	
  Service	
               Log	
  Data	
  
      Applica0on	
                                      Log	
  Collec0on	
               ThriR	
  RPC	
  
        Nectar	
  Library	
        Log	
  Data	
  
Easy	
  integra0on/schema	
       <category,	
                         Log	
  Data	
  
           evolu0on	
             message>	
  
                                                           Local	
  Disk	
  
Scribe Improvements
•  Network efficiency
▪  Per-RPC   Compression (use quicklz)

•  Operation interface
▪  Category-based   blacklisting and sampling

•  Adaptive logging
▪  Use   BufferStore and NullStore to drop messages as needed

•  QoS
▪  Use   separate hardware for now
Distributed Storage Systems - Scribe-HDFS
•  Architecture
▪  Client                                        C1                C2           DataNode

▪  Mid-tier


▪  Writers                                       C1                C2           DataNode


•  Features                 Scribe	
  
                            Clients	
     Calligraphus	
     Calligraphus	
       HDFS	
  
                                            Mid-­‐0er	
        Writers	
  
▪  Scalability:   9GB/sec
▪  No
    single point of failure
 (except NameNode)                                 Zookeeper	
  

•  Not open-sourced yet
Distributed Storage Systems - HDFS
•  Architecture
▪  NameNode:      namespace, block locations
▪  DataNodes:     data blocks replicated 3 times

•  Features                                  Name	
  Node	
  
▪  3000-node,    PBs of spaces
                                                                   HDFS	
  Client	
  
▪  Highly   reliable
▪  No   random writes
                                               Data	
  Nodes	
  
•  https://github.com/facebook/hadoop-20
HDFS Improvements
•  Efficiency
▪  Random    read keep-alive: HDFS-941
▪  Faster   checksum - HDFS-2080
▪  Use   fadvise - HADOOP-7714

•  Credits:
▪  http://www.cloudera.com/resource/hadoop-world-2011-presentation-

 slides-hadoop-and-performance
Distributed Storage Systems - HBase
•  Architecture
▪  <row,   col-family, col, value>
▪  Write-Ahead    Log
▪  Records    are sorted in memory/files       Master	
  

•  Features                                                        HBase	
  Client	
  
▪  100-node.


▪  Random     read/write.
                                           Region	
  Servers	
  
▪  Great   write performance.
▪  http://svn.apache.org/viewvc/hbase/branches/0.89-fb/
Distributed Computing Systems – MR
•  Architecture
▪  JobTracker


▪  TaskTracker                    JobTracker	
  
▪  MR    Client                                     MR	
  Client	
  

•  Features
▪  Push   computation to data     TaskTracker	
  
▪  Reliable   - Automatic retry
▪  Not   easy to use
MR Improvements
•  Efficiency
▪  Faster   compareBytes: HADOOP-7761
▪  MR   sort cache locality: MAPREDUCE-3235
▪  Shuffle:   MAPREDUCE-64, MAPREDUCE-318



•  Credits:
▪  http://www.cloudera.com/resource/hadoop-world-2011-presentation-

 slides-hadoop-and-performance
Distributed Computing Systems – Hive
•  Architecture
▪  MetaStore


▪  Compiler                     MetaStore	
  
▪  Execution                                          Hive	
  cmd	
  line	
  
                                                        Compiler	
  
                                                        MR	
  Client	
  
•  Features
▪  SQL    Map-Reduce          Map-­‐Reduce	
  
                               Task	
  Trackers	
  
▪  Select,   Group By, Join
▪  UDF,   UDAF, UDTF, Script
Useful Features in Hive
•  Complex column types
▪  Array,   Struct, Map, Union
▪  CREATE      TABLE (a struct<c1:map<string,string>,c2:array<string>>);

•  UDFs
▪  UDF,   UDAF, UDTF

•  Efficient Joins
▪  Bucketed    Map Join: HIVE-917
Distributed Computing Systems – Puma
•  Architecture
▪  HDFS


▪  PTail


▪  Puma

                                         HDFS	
                              HBase	
  
▪  HBase



•  Features
                                                    PTail	
  +	
  Puma	
  
▪  StreamSQL:     Select, Group By, Join
▪  UDF,    UDAF
▪  Reliable   – No data loss/duplicate
Conclusion
Big Data can help operations
Big Data can help Operations
•  5 Steps to make it effective:
▪  Make    Big Data easy to use
▪  Log   more data and keep more sample whenever needed
▪  Build   debugging infrastructure on top of Big Data
▪  Both    real-time and historical analysis
▪  Continue    to improve Big Data
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

More Related Content

What's hot

Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 
HBase @ Twitter
HBase @ TwitterHBase @ Twitter
HBase @ Twitterctrezzo
 
Hadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionHadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionBenoit Perroud
 
MongoDB performance tuning and monitoring with MMS
MongoDB performance tuning and monitoring with MMSMongoDB performance tuning and monitoring with MMS
MongoDB performance tuning and monitoring with MMSNicholas Tang
 

What's hot (6)

Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
HBase @ Twitter
HBase @ TwitterHBase @ Twitter
HBase @ Twitter
 
Hadoop, Taming Elephants
Hadoop, Taming ElephantsHadoop, Taming Elephants
Hadoop, Taming Elephants
 
Play With Streams
Play With StreamsPlay With Streams
Play With Streams
 
Hadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionHadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment Evolution
 
MongoDB performance tuning and monitoring with MMS
MongoDB performance tuning and monitoring with MMSMongoDB performance tuning and monitoring with MMS
MongoDB performance tuning and monitoring with MMS
 

Viewers also liked

Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexasArvind Prabhakar
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)alexbaranau
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

Viewers also liked (7)

Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexas
 
Inside Flume
Inside FlumeInside Flume
Inside Flume
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Similar to Hadoop, hive和scribe在运维方面的应用

Fixing twitter
Fixing twitterFixing twitter
Fixing twitterRoger Xia
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...smallerror
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 
(ATS3-PLAT08) Optimizing Protocol Performance
(ATS3-PLAT08) Optimizing Protocol Performance(ATS3-PLAT08) Optimizing Protocol Performance
(ATS3-PLAT08) Optimizing Protocol PerformanceBIOVIA
 
Observability with Spring-based distributed systems
Observability with Spring-based distributed systemsObservability with Spring-based distributed systems
Observability with Spring-based distributed systemsRakuten Group, Inc.
 
Fixing Twitter Velocity2009
Fixing Twitter Velocity2009Fixing Twitter Velocity2009
Fixing Twitter Velocity2009John Adams
 
Correlate Log Data with Business Metrics Like a Jedi
Correlate Log Data with Business Metrics Like a JediCorrelate Log Data with Business Metrics Like a Jedi
Correlate Log Data with Business Metrics Like a JediTrevor Parsons
 
Mining Your Logs - Gaining Insight Through Visualization
Mining Your Logs - Gaining Insight Through VisualizationMining Your Logs - Gaining Insight Through Visualization
Mining Your Logs - Gaining Insight Through VisualizationRaffael Marty
 
Big server-is-watching-you
Big server-is-watching-youBig server-is-watching-you
Big server-is-watching-youmkherlakian
 
Scaling habits of ASP.NET
Scaling habits of ASP.NETScaling habits of ASP.NET
Scaling habits of ASP.NETDavid Giard
 
Building a highly scalable and available cloud application
Building a highly scalable and available cloud applicationBuilding a highly scalable and available cloud application
Building a highly scalable and available cloud applicationNoam Sheffer
 
PostgreSQL as a Big Data Platform
PostgreSQL as a Big Data Platform PostgreSQL as a Big Data Platform
PostgreSQL as a Big Data Platform Chris Travers
 
Facebook architecture
Facebook architectureFacebook architecture
Facebook architecturedrewz lin
 
Facebook architecture
Facebook architectureFacebook architecture
Facebook architecturemysqlops
 
Qcon 090408233824-phpapp01
Qcon 090408233824-phpapp01Qcon 090408233824-phpapp01
Qcon 090408233824-phpapp01jgregory1234
 
Facebook的架构
Facebook的架构Facebook的架构
Facebook的架构yiditushe
 

Similar to Hadoop, hive和scribe在运维方面的应用 (20)

Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
(ATS3-PLAT08) Optimizing Protocol Performance
(ATS3-PLAT08) Optimizing Protocol Performance(ATS3-PLAT08) Optimizing Protocol Performance
(ATS3-PLAT08) Optimizing Protocol Performance
 
Datastage Online Training
Datastage Online TrainingDatastage Online Training
Datastage Online Training
 
Observability with Spring-based distributed systems
Observability with Spring-based distributed systemsObservability with Spring-based distributed systems
Observability with Spring-based distributed systems
 
Fixing Twitter Velocity2009
Fixing Twitter Velocity2009Fixing Twitter Velocity2009
Fixing Twitter Velocity2009
 
Correlate Log Data with Business Metrics Like a Jedi
Correlate Log Data with Business Metrics Like a JediCorrelate Log Data with Business Metrics Like a Jedi
Correlate Log Data with Business Metrics Like a Jedi
 
Mining Your Logs - Gaining Insight Through Visualization
Mining Your Logs - Gaining Insight Through VisualizationMining Your Logs - Gaining Insight Through Visualization
Mining Your Logs - Gaining Insight Through Visualization
 
Big server-is-watching-you
Big server-is-watching-youBig server-is-watching-you
Big server-is-watching-you
 
Scaling habits of ASP.NET
Scaling habits of ASP.NETScaling habits of ASP.NET
Scaling habits of ASP.NET
 
Building a highly scalable and available cloud application
Building a highly scalable and available cloud applicationBuilding a highly scalable and available cloud application
Building a highly scalable and available cloud application
 
PostgreSQL as a Big Data Platform
PostgreSQL as a Big Data Platform PostgreSQL as a Big Data Platform
PostgreSQL as a Big Data Platform
 
Facebook architecture
Facebook architectureFacebook architecture
Facebook architecture
 
Facebook architecture
Facebook architectureFacebook architecture
Facebook architecture
 
Qcon 090408233824-phpapp01
Qcon 090408233824-phpapp01Qcon 090408233824-phpapp01
Qcon 090408233824-phpapp01
 
Facebook的架构
Facebook的架构Facebook的架构
Facebook的架构
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 

Recently uploaded

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 

Recently uploaded (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 

Hadoop, hive和scribe在运维方面的应用

  • 1. Operations and Big Data: Hadoop, Hive and Scribe Zheng Shao 微博:@邵铮9 12/7/2011 Velocity China 2011
  • 2. Agenda 1 Operations: Challenges and Opportunities 2 Big Data Overview 3 Operations with Big Data 4 Big Data Details: Hadoop, Hive, Scribe 5 Conclusion
  • 4. Operations Measure Model and Under- Collect and Improve Monitor Instrume stand Analyze nt
  • 5. Operations Measure Model and Under- Collect and Improve Monitor Instrume stand Analyze nt
  • 6. Challenges •  Huge amount of data ▪  Sampling may not be good enough •  Distributed environment ▪  Log collection is hard ▪  Hardware failures are normal ▪  Distributed failures are hard to understand
  • 7. Example 1: Cache miss and performance •  Memcache layer has a bug that decreased the cache hit rate by half Memcache •  MySQL layer got hit hard and Web performance of MySQL degraded MySQL •  Web performance degraded
  • 8. Example 2: Map-Reduce Retries •  Attempt 1 hits a transient Attempt 1 distributed file system issue and failed Attempt 2 •  Attempt 2 hits a real hardware Map Task issue and failed Attempt 3 •  Attempt 3 hits a transient application logic issue and failed Attempt 4 •  Attempt 4, by chance, succeeded •  The whole process slowed down
  • 9. Example 3: RPC Hierarchy •  RPC 3A failed PRC 1A •  The whole RPC 0 failed because RPC 1 of that PRC 1B RPC 0 RPC 2 •  The blame was on owner of RPC 3A RPC 3 service 3 because the log in RPC 3B service 0 shows that.
  • 10. Example 4: Inconsistent results in RPC •  RPC 0 got results from both RPC 1 and RPC 2 RPC 1 •  Both RPC 1 and RPC 2 succeeded RPC 0 RPC 2 •  But RPC 0 detects that the results are inconsistent and fails •  We may not have logged any trace information for RPC 1 and RPC 2 to continue debugging.
  • 11. Opportunities •  Big Data Technologies Logging  Logging  Logging   ▪  Distributed logging systems ▪  Distributed storage systems Collect ▪  Distributed computing systems Storage  Storage  Storage   •  Deeper Analysis Model ▪  Data mining and outlier detection ▪  Time-series analysis Compu0ng   Compu0ng  Compu0ng  
  • 12. Big Data Overview An example from Facebook
  • 13. Big Data •  What is Big Data? ▪  Volume is big enough and hard to be managed by traditional technologies ▪  Value is big enough not to be sampled/dropped •  Where is Big Data used? ▪  Product analysis ▪  User behavior analysis ▪  Business intelligence •  Why use Big Data for Operations? ▪  Reuse existing infrastructure.
  • 14. Overall Architecture ~3GB/sec Near-realtime Processing Scribe  Policy   PHP   PTail   Puma   HBase   C++   Scribe-­‐HDFS   ~6GB/sec Batch Processing ~9GB/sec Java   Nectar   Copy/Load   Central  HDFS   Hive   Scribe  Client  
  • 16. logview •  Features ▪  PHP Fatal StackTrace ▪  Group StackTrace by similarity, order by counts ▪  Integrated with SVN/Task/Oncall tools ▪  Low-pri: Scribe can drop logview data Scribe   Log  View   HTTP   PHP    Scribe  Client   Mid0er  
  • 17. logmonitor •  Rules ▪  Regular-expression based: ".*Missing Block.*" ▪  Rule has levels: WARN, ERROR, etc ▪  Dynamic rules Propagate   Modify   rules   Rules   rules   Storage   Apply  rules   <RuleName ,  Count,   Top  Rules   PTail  /  Local  Log   Examples>   Logmonitor   Logmonitor  Client   Web   Stats  Server  
  • 18. Self Monitoring •  Goal: ▪  Set KPIs for SOA logs   Scribe   ▪  Isolate issues in distributed systems ▪  Make it easy for service owners to monitor PTail,  Hive   Service   •  Approach JMX   ▪  Log4J integration with Scribe ThriR/Fb303   counter   ▪  JMX/Thrift/Fb303 counters query   Service   ▪  Client-side logging + Server-side logging Owner  
  • 19. Global Debugging with PTail •  Logging instruction ▪  Logging levels ▪  Logging destination (log name) Log  data   Scribe   PTail   ▪  Additional fields: Request ID Log  data   Log  data   Log  data   RPC  +   logging   instruc0ons   Service  3   RPC  +   Service  2   Service  1   logging   instruc0ons  
  • 20. Hive Pipelines •  Daily and historical data analysis ▪  What is the trend of a metric? ▪  When did this bug first happen? •  Examples ▪  SELECT percentile(latency, “50,75,90,99”) FROM latency_log; ▪  SELECTrequest_id, GROUP_CONCAT(log_line) as total_log FROM trace GROUP BY request_id HAVING total_log LIKE "%FATAL%“;
  • 21. Big Data Details Hadoop, Hive, Scribe
  • 22. Key Requirements •  Ease of use •  Latency ▪  Smooth learning curve ▪  Real-time data ▪  Easy integration ▪  Historical data ▪  Structured/unstructured data •  Reliability ▪  Schema evolution ▪  Low data loss •  Scalable ▪  Consistent computation ▪  Spiky traffic and QoS ▪  Raw data / Drill-down support
  • 23. Overall Architecture ~3GB/sec Near-realtime Processing Scribe  Policy   PHP   PTail   Puma   HBase   C++   Scribe-­‐HDFS   ~6GB/sec Batch Processing ~9GB/sec Java   Nectar   Copy/Load   Central  HDFS   Hive   Scribe  Client  
  • 24. Distributed Logging System - Scribe •  https://github.com/facebook/scribe
  • 25. Distributed Logging System - Scribe Scribe  Policy     Traffic/Schema   Meta  Data   management   Meta  Data   ThriR  RPC   Scribe  Service   Log  Data   Applica0on   Log  Collec0on   ThriR  RPC   Nectar  Library   Log  Data   Easy  integra0on/schema   <category,   Log  Data   evolu0on   message>   Local  Disk  
  • 26. Scribe Improvements •  Network efficiency ▪  Per-RPC Compression (use quicklz) •  Operation interface ▪  Category-based blacklisting and sampling •  Adaptive logging ▪  Use BufferStore and NullStore to drop messages as needed •  QoS ▪  Use separate hardware for now
  • 27. Distributed Storage Systems - Scribe-HDFS •  Architecture ▪  Client C1 C2 DataNode ▪  Mid-tier ▪  Writers C1 C2 DataNode •  Features Scribe   Clients   Calligraphus   Calligraphus   HDFS   Mid-­‐0er   Writers   ▪  Scalability: 9GB/sec ▪  No single point of failure (except NameNode) Zookeeper   •  Not open-sourced yet
  • 28. Distributed Storage Systems - HDFS •  Architecture ▪  NameNode: namespace, block locations ▪  DataNodes: data blocks replicated 3 times •  Features Name  Node   ▪  3000-node, PBs of spaces HDFS  Client   ▪  Highly reliable ▪  No random writes Data  Nodes   •  https://github.com/facebook/hadoop-20
  • 29. HDFS Improvements •  Efficiency ▪  Random read keep-alive: HDFS-941 ▪  Faster checksum - HDFS-2080 ▪  Use fadvise - HADOOP-7714 •  Credits: ▪  http://www.cloudera.com/resource/hadoop-world-2011-presentation- slides-hadoop-and-performance
  • 30. Distributed Storage Systems - HBase •  Architecture ▪  <row, col-family, col, value> ▪  Write-Ahead Log ▪  Records are sorted in memory/files Master   •  Features HBase  Client   ▪  100-node. ▪  Random read/write. Region  Servers   ▪  Great write performance. ▪  http://svn.apache.org/viewvc/hbase/branches/0.89-fb/
  • 31. Distributed Computing Systems – MR •  Architecture ▪  JobTracker ▪  TaskTracker JobTracker   ▪  MR Client MR  Client   •  Features ▪  Push computation to data TaskTracker   ▪  Reliable - Automatic retry ▪  Not easy to use
  • 32. MR Improvements •  Efficiency ▪  Faster compareBytes: HADOOP-7761 ▪  MR sort cache locality: MAPREDUCE-3235 ▪  Shuffle: MAPREDUCE-64, MAPREDUCE-318 •  Credits: ▪  http://www.cloudera.com/resource/hadoop-world-2011-presentation- slides-hadoop-and-performance
  • 33. Distributed Computing Systems – Hive •  Architecture ▪  MetaStore ▪  Compiler MetaStore   ▪  Execution Hive  cmd  line   Compiler   MR  Client   •  Features ▪  SQL  Map-Reduce Map-­‐Reduce   Task  Trackers   ▪  Select, Group By, Join ▪  UDF, UDAF, UDTF, Script
  • 34. Useful Features in Hive •  Complex column types ▪  Array, Struct, Map, Union ▪  CREATE TABLE (a struct<c1:map<string,string>,c2:array<string>>); •  UDFs ▪  UDF, UDAF, UDTF •  Efficient Joins ▪  Bucketed Map Join: HIVE-917
  • 35. Distributed Computing Systems – Puma •  Architecture ▪  HDFS ▪  PTail ▪  Puma HDFS   HBase   ▪  HBase •  Features PTail  +  Puma   ▪  StreamSQL: Select, Group By, Join ▪  UDF, UDAF ▪  Reliable – No data loss/duplicate
  • 36. Conclusion Big Data can help operations
  • 37. Big Data can help Operations •  5 Steps to make it effective: ▪  Make Big Data easy to use ▪  Log more data and keep more sample whenever needed ▪  Build debugging infrastructure on top of Big Data ▪  Both real-time and historical analysis ▪  Continue to improve Big Data
  • 38. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0