SlideShare ist ein Scribd-Unternehmen logo
1 von 40
Downloaden Sie, um offline zu lesen
Real-time Analytics with
                              HBase
                       Alex Baranau, Sematext International




Sunday, May 20, 12
About me


                     Software Engineer at Sematext International

                     http://blog.sematext.com/author/abaranau

                     @abaranau

                     http://github.com/sematext (abaranau)




                                                         Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Plan


                     Problem background: what? why?

                     Going real-time with append-only updates
                     approach: how?

                     Open-source implementation: how exactly?

                     Q&A




                                                        Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: our services
                     Systems Monitoring Service (Solr, HBase, ...)

                     Search Analytics Service



          data collector                                      Reports
                                    Data
                                                        100

                                                         75

          data collector         Analytics &             50


                                  Storage                25


          data collector                                  0
                                                              2007   2008   2009   2010




                                                                 Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: Report Example
                     Search engine (Solr) request latency




                                                        Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: Report Example
                     HBase flush operations




                                             Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: requirements

                      High volume of input data

                      Multiple filters/dimensions

                      Interactive (fast) reports

                      Show wide range of data intervals

                      Real-time data changes visibility

                      No sampling, accurate data needed



                                                          Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: serve raw data?
                        simply storing all data points doesn’t work
                           to show 1-year worth of data points collected every second
                           31,536,000 points have to be fetched


                        pre-aggregation (at least partial) needed

                       Data Analytics & Storage                           Reports
                                              aggregated data
                       input data



                      data processing
                     (pre-aggregating)

                                                                             Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: pre-aggregation
                     OLAP-like Solution

                                aggregation rules
                               * filters/dimensions
                               * time range granularities   aggregated
                               * ...                           value


                                                            aggregated
           input data               processing                 value
              item
                                       logic
                                                            aggregated
                                                               value


                                                            Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: pre-aggregation
                     Simplified Example             aggregated record groups

                                                           by minute
                                                      minute: 22214701
                               aggregation rules      value: 30.0
                                * by sensor                       ...
                                * by minute/day              by day
                                                      day: 2012-04-26
      input data item                                 value: 10.5
     time: 1332882078             processing                      ...
     sensor: sensor55
     value: 80.0                     logic            by minute & sensor
                                                      minute: 22214701
                                                      sensor: sensor55
                                                      cpu: 70.3
                                                                  ...
                                                     ...
                                                           Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: RMW updates are slow
                        more dimensions/filters -> greater output data vs input data
                        ratio

                        individual ready-modify-write (Get+Put) operations are slow
                        and not efficient (10-20+ times slower than only Puts)



                                   sensor1                          sensor2
                      ...            <...>
                                               sensor2
                                              value:15.0    ...    value:41.0
                                                                                  input
                                  Get   Put   Get   Put      Get        Put

                            ...   sensor1       sensor2
                                                                  ...           reports
                                    <...>      avg : 28.7          Get/Scan      sensor2
                                               min: 15.0
                      storage                  max: 41.0
                       (HBase)
                                                                                   Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: improve updates


                       Using in-place increment operations? Not fast
                       enough and not flexible...

                       Buffering input records on the way in and
                       writing in small batches? Doesn’t scale and
                       possible loss of data...




                                                            Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: batch updates

                      More efficient data processing: multiple
                      updates processed at once, not individually

                      Decreases aggregation output (per input
                      record)

                      Reliable, no data loss

                      Using “dictionary records” helps to reduce
                      number of Get operations


                                                           Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: batch updates
              “Dictionary Records”

                     Using data de-normalization to reduce random Get
                     operations while doing “Get+Put” updates:

                       Keep compound records which hold data of
                       multiple “normal” records that are usually
                       updated together

                       N Get+Put operations replaced with M (Get+Put)
                       and N Put operations, where M << N


                                                            Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: batch updates

                      Not real-time

                      If done frequently (closer to real-time), still
                      a lot of costly Get+Put update operations

                      Bad (any?) rollback support

                      Handling of failures of tasks which partially
                      wrote data to HBase is complex



                                                             Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Going Real-time
                             with
                     Append-based Updates




Sunday, May 20, 12
Append-only: main goals

                     Increase record update throughput

                     Process updates more efficiently: reduce
                     operations number and resources usage

                     Ideally, apply high volume of incoming data
                     changes in real-time

                     Add ability to roll back changes

                     Handle well high update peaks


                                                          Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Append-only: how?

            1. Replace read-modify-write (Get+Put) operations
                     at write time with simple append-only writes (Put)

            2. Defer processing of updates to periodic jobs
            3. Perform processing of updates on the fly only
                     if user asks for data earlier than updates are
                     processed.




                                                             Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Append-only: writing updates

           1         Replace update (Get+Put) operations at write time
                     with simple append-only writes (Put)

                                sensor1          sensor2              sensor2
                      ...          ...          value:15.0   ...     value:41.0     input
                              Put          Put                                Put
                              ...     sensor1           sensor2
                                                                        ...
                                        <...>          avg : 22.7
                                                       max: 31.0
                                      sensor1
                                        <...>
                                                     ...
                                    ...                 sensor2
                                                       value: 15.0
                            storage                     sensor2
                                                       value: 41.0
                            (HBase)                  ...                            Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Append-only: writing updates

                     2        Defer processing of updates to periodic jobs


                     processing updates with MR job
         ...          sensor1           sensor2
                                                      ...
                        <...>          avg : 22.7           ...   sensor1       sensor2
                                                                                                         ...
                                       max: 31.0                    <...>      avg : 23.4
                      sensor1
                        <...>
                                          ...                                  max: 41.0
                        ...             sensor2
                                       value: 15.0
                                        sensor2
                                       value: 41.0
                                          ...

                                                                            Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Append-only: writing updates
                 3          Perform aggregations on the fly if user asks
                            for data earlier than updates are processed

                      ...     sensor1    sensor2
                                                         ...        reports
                                <...>   avg : 22.7                 sensor1         ...
                                        max: 31.0
                              sensor1
                                <...>
                                           ...
                                ...      sensor2
                                        value: 15.0
                                         sensor2
                      storage           value: 41.0
                                           ...
                                                       sensor2
                                                      avg : 23.4
                                                      max: 41.0

                                                                         Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Append-only: benefits
                     High update throughput

                     Real-time updates visibility

                     Efficient updates processing

                     Handling high peaks of update operations

                     Ability to roll back any range of changes

                     Automatically handling failures of tasks which
                     only partially updated data (e.g. in MR jobs)


                                                           Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
1/6
                     Append-only: high update throughput



                       Avoid Get+Put operations upon writing

                       Use only Put operations (i.e. insert new
                       records only) which is very fast in HBase

                       Process updates when flushing client-side
                       buffer to reduce the number of actual
                       writes




                                                            Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
2/6
                     Append-only: real-time updates

                       Increased update throughput allows to apply
                       updates in real-time

                       User always sees the latest data changes

                       Updates processed on the fly during Get or
                       Scan can be stored back right away

                       Periodic updates processing helps avoid doing
                       a lot of work during reads, making reading
                       very fast


                                                           Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
3/6
                     Append-only: efficient updates
                      To apply N changes:
                        N Get+Put operations replaced with

                        N Puts and 1 Scan (shared) + 1 Put operation

                      Applying N changes at once is much more
                      efficient than performing N individual changes
                        Especially when updated value is complex (like bitmaps),
                        takes time to load in memory

                        Skip compacting if too few records to process

                      Avoid a lot of redundant Get operations when
                      large portion of operations - inserting new data
                                                                   Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
4/6
                     Append-only: high peaks handling


                      Actual updates do not happen at write time

                      Merge is deferred to periodic jobs, which can
                      be scheduled to run at off-peak time (nights/
                      week-ends)

                      Merge speed is not critical, doesn’t affect the
                      visibility of changes



                                                             Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
5/6
                          Append-only: rollback
                        Rollbacks are easy when updates were not
                        processed yet (not merged)

                        To preserve rollback ability after they are
                        processed (and result is written back), updates
                        can be compacted into groups
                     written at:     processing updates
                        9:00       ...    sensor2
                                             ...
                                                      ...   ...   sensor2          ...
                                                                     ...
                                          sensor2
                                             ...
                                             ...
                        10:00             sensor2                 sensor2
                                             ...                     ...
                                          sensor2
                                             ...
                                             ...
                        11:00             sensor2                 sensor2
                                             ...                     ...
                                             ...                    ...
                                                                     Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
5/6
                       Append-only: rollback
                     Example:
                     * keep all-time avg value for sensor
                     * data collected every 10 second for 30 days

                     Solution:
                     * perform periodic compactions every 4 hours
                     * compact groups based on 1-hour interval

                     Result:
                     At any point of time there are no more than
                     24 * 30 + 4 * 60 * 6 = 2160 non-compacted
                     records that needs to be processed on the fly
                                                          Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
6/6
                     Append-only: idempotency
                      Using append-only approach helps recover from
                      failed tasks which write data to HBase
                         without rolling back partial updates

                         avoids applying duplicate updates

                         fixes task failure with simple restart of task

                      Note: new task should write records with same row
                      keys as failed one
                         easy, esp. given that input data is likely to be same

                      Very convenient when writing from MapReduce

                      Updates processing periodic jobs are also idempotent

                                                                                 Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Append-only: cons


                     Processing on the fly makes reading slower

                     Looking for data to compact (during periodic
                     compactions) may be inefficient

                     Increased amount of stored data depending
                     on use-case (in 0.92+)




                                                         Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Append-only + Batch?
                     Works very well together, batch approach
                     benefits from:

                       increased update throughput

                       automatic task failures handling

                       rollback ability

                     Use when HBase cluster cannot cope with
                     processing updates in real-time or update
                     operations are bottleneck in your batch

                     We use it ;)

                                                            Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Append-only updates implementation
                              HBaseHUT



Sunday, May 20, 12
HBaseHUT: Overview

                     Simple

                     Easy to integrate into existing projects
                       Packed as a singe jar to be added to HBase client
                       classpath (also add it to RegionServer classpath to
                       benefit from server-side optimizations)

                       Supports native HBase API: HBaseHUT classes
                       implement native HBase interfaces

                     Apache License, v2.0


                                                                Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
HBaseHUT: Overview
                     Processing of updates on-the-fly (behind
                     ResultScanner interface)

                       Allows storing back processed Result

                       Can use CPs to process updates on server-side

                     Periodic processing of updates with Scan or
                     MapReduce job
                       Including processing updates in groups based on write ts

                     Rolling back changes with MapReduce job


                                                                  Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
HBaseHUT vs(?) OpenTSDB
                      “vs” is wrong, they are simply different things
                        OpenTSDB is a time-series database

                        HBaseHUT is a library which implements append-only
                        updates approach to be used in your project

                      OpenTSDB uses “serve raw data” approach (with
                      storage improvements), limited to handling
                      numeric values

                      HBaseHUT is meant for (but not limited to)
                      “serve aggregated data” approach, works with
                      any data

                                                                 Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
HBaseHUT: API overview
            Writing data:
            Put put = new Put(HutPut.adjustRow(rowKey));
            // ...
            hTable.put(put);


            Reading data:
            Scan scan = new Scan(startKey, stopKey);
            ResultScanner resultScanner =
                new HutResultScanner(hTable.getScanner(scan),
                                     updateProcessor);

            for (Result current : resultScanner) {...}




                                                            Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
HBaseHUT: API overview
                 Example UpdateProcessor:
                 public class MaxFunction extends UpdateProcessor {
                   // ... constructor & utility methods

                     @Override
                     public void process(Iterable<Result> records,
                                         UpdateProcessingResult result) {
                       Double maxVal = null;

                         for (Result record : records) {
                           double val = getValue(record);
                           if (maxVal == null || maxVal < val) {
                             maxVal = val;
                           }
                         }

                         result.add(colfam, qual, Bytes.toBytes(maxVal));
                     }
                 }
                                                                   Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
HBaseHUT: how we use it
                        Data Analytics & Storage                              Reports
                                                 aggregated data




                                                                   HBaseHUT
                                      HBaseHUT
       input           initial data
        data           processing

                                                       HBase
                          HBaseHUT
                                                  HBaseHUT
                          periodic
                                                 MapReduce
                          updates
                                                   jobs
                         processing

                                                                      Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
HBaseHUT: Next Steps
                     Wider CPs (HBase 0.92+) utilization
                       Process updates during memstore flush

                     Make use of Append operation (HBase 0.94+)

                     Integrate with asynchbase lib

                     Reduce storage overhead from adjusting
                     row keys

                     etc.


                                                              Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Qs?
                     http://github.com/sematext/HBaseHUT

                     http://blog.sematext.com

                     @abaranau

                     http://github.com/sematext (abaranau)

                     http://sematext.com, we are hiring! ;)



                                                          Alex Baranau, Sematext International, 2012
Sunday, May 20, 12

Weitere ähnliche Inhalte

Andere mochten auch

Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, LucidworksVisualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, LucidworksLucidworks
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitDataWorks Summit
 
HBaseCon 2012 | Real-time Analytics with HBase - Sematext
HBaseCon 2012 | Real-time Analytics with HBase - SematextHBaseCon 2012 | Real-time Analytics with HBase - Sematext
HBaseCon 2012 | Real-time Analytics with HBase - SematextCloudera, Inc.
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachDataWorks Summit
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for ArchitectsNick Dimiduk
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataHortonworks
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster Cloudera, Inc.
 
HBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBaseHBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBaseCloudera, Inc.
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)alexbaranau
 
Analyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-timeAnalyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-timeDataWorks Summit
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engineth0masr
 
Cassandra/Hadoop Integration
Cassandra/Hadoop IntegrationCassandra/Hadoop Integration
Cassandra/Hadoop IntegrationJeremy Hanna
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014P. Taylor Goetz
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignMichael Noll
 

Andere mochten auch (20)

Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, LucidworksVisualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
HBaseCon 2012 | Real-time Analytics with HBase - Sematext
HBaseCon 2012 | Real-time Analytics with HBase - SematextHBaseCon 2012 | Real-time Analytics with HBase - Sematext
HBaseCon 2012 | Real-time Analytics with HBase - Sematext
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
 
HBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBaseHBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBase
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
 
Analyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-timeAnalyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-time
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engine
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Cassandra/Hadoop Integration
Cassandra/Hadoop IntegrationCassandra/Hadoop Integration
Cassandra/Hadoop Integration
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 

Ähnlich wie Real-time analytics with HBase (long version)

Klout changing landscape of social media
Klout changing landscape of social mediaKlout changing landscape of social media
Klout changing landscape of social mediaDataWorks Summit
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemDanny Yuan
 
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizadaCombinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizadaElasticsearch
 
Monitoring as an entry point for collaboration
Monitoring as an entry point for collaborationMonitoring as an entry point for collaboration
Monitoring as an entry point for collaborationJulien Pivotto
 
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizadaCombinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizadaElasticsearch
 
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemC4Media
 
Performance tuning in sap bi 7.0
Performance tuning in sap bi 7.0Performance tuning in sap bi 7.0
Performance tuning in sap bi 7.0gireesho
 
From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012Anand Deshpande
 
13 monitor-analyse-system
13 monitor-analyse-system13 monitor-analyse-system
13 monitor-analyse-systemsanganiraju
 
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...SL Corporation
 
Java one 2010
Java one 2010Java one 2010
Java one 2010scdn
 
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIDenny Lee
 
Data Mining with Excel 2010 and PowerPivot 201106
Data Mining with Excel 2010 and PowerPivot 201106Data Mining with Excel 2010 and PowerPivot 201106
Data Mining with Excel 2010 and PowerPivot 201106Mark Tabladillo
 
Combining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observabilityCombining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observabilityElasticsearch
 
Big Data
Big DataBig Data
Big DataNGDATA
 
What is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
What is going on? Application Diagnostics on Azure - Copenhagen .NET User GroupWhat is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
What is going on? Application Diagnostics on Azure - Copenhagen .NET User GroupMaarten Balliauw
 
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...Interactive exploration of complex relational data sets in a web - SemWeb.Pro...
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...Logilab
 
Azure Stream Analytics
Azure Stream AnalyticsAzure Stream Analytics
Azure Stream AnalyticsMarco Parenzan
 

Ähnlich wie Real-time analytics with HBase (long version) (20)

Klout changing landscape of social media
Klout changing landscape of social mediaKlout changing landscape of social media
Klout changing landscape of social media
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing system
 
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizadaCombinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
 
Monitoring as an entry point for collaboration
Monitoring as an entry point for collaborationMonitoring as an entry point for collaboration
Monitoring as an entry point for collaboration
 
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizadaCombinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
 
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing System
 
Performance tuning in sap bi 7.0
Performance tuning in sap bi 7.0Performance tuning in sap bi 7.0
Performance tuning in sap bi 7.0
 
Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
 
From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012
 
13 monitor-analyse-system
13 monitor-analyse-system13 monitor-analyse-system
13 monitor-analyse-system
 
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
 
Java one 2010
Java one 2010Java one 2010
Java one 2010
 
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BI
 
Data Mining with Excel 2010 and PowerPivot 201106
Data Mining with Excel 2010 and PowerPivot 201106Data Mining with Excel 2010 and PowerPivot 201106
Data Mining with Excel 2010 and PowerPivot 201106
 
Combining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observabilityCombining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observability
 
Big Data
Big DataBig Data
Big Data
 
What is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
What is going on? Application Diagnostics on Azure - Copenhagen .NET User GroupWhat is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
What is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
 
Spring Batch Introduction
Spring Batch IntroductionSpring Batch Introduction
Spring Batch Introduction
 
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...Interactive exploration of complex relational data sets in a web - SemWeb.Pro...
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...
 
Azure Stream Analytics
Azure Stream AnalyticsAzure Stream Analytics
Azure Stream Analytics
 

Kürzlich hochgeladen

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 

Kürzlich hochgeladen (20)

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 

Real-time analytics with HBase (long version)

  • 1. Real-time Analytics with HBase Alex Baranau, Sematext International Sunday, May 20, 12
  • 2. About me Software Engineer at Sematext International http://blog.sematext.com/author/abaranau @abaranau http://github.com/sematext (abaranau) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 3. Plan Problem background: what? why? Going real-time with append-only updates approach: how? Open-source implementation: how exactly? Q&A Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 4. Background: our services Systems Monitoring Service (Solr, HBase, ...) Search Analytics Service data collector Reports Data 100 75 data collector Analytics & 50 Storage 25 data collector 0 2007 2008 2009 2010 Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 5. Background: Report Example Search engine (Solr) request latency Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 6. Background: Report Example HBase flush operations Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 7. Background: requirements High volume of input data Multiple filters/dimensions Interactive (fast) reports Show wide range of data intervals Real-time data changes visibility No sampling, accurate data needed Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 8. Background: serve raw data? simply storing all data points doesn’t work to show 1-year worth of data points collected every second 31,536,000 points have to be fetched pre-aggregation (at least partial) needed Data Analytics & Storage Reports aggregated data input data data processing (pre-aggregating) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 9. Background: pre-aggregation OLAP-like Solution aggregation rules * filters/dimensions * time range granularities aggregated * ... value aggregated input data processing value item logic aggregated value Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 10. Background: pre-aggregation Simplified Example aggregated record groups by minute minute: 22214701 aggregation rules value: 30.0 * by sensor ... * by minute/day by day day: 2012-04-26 input data item value: 10.5 time: 1332882078 processing ... sensor: sensor55 value: 80.0 logic by minute & sensor minute: 22214701 sensor: sensor55 cpu: 70.3 ... ... Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 11. Background: RMW updates are slow more dimensions/filters -> greater output data vs input data ratio individual ready-modify-write (Get+Put) operations are slow and not efficient (10-20+ times slower than only Puts) sensor1 sensor2 ... <...> sensor2 value:15.0 ... value:41.0 input Get Put Get Put Get Put ... sensor1 sensor2 ... reports <...> avg : 28.7 Get/Scan sensor2 min: 15.0 storage max: 41.0 (HBase) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 12. Background: improve updates Using in-place increment operations? Not fast enough and not flexible... Buffering input records on the way in and writing in small batches? Doesn’t scale and possible loss of data... Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 13. Background: batch updates More efficient data processing: multiple updates processed at once, not individually Decreases aggregation output (per input record) Reliable, no data loss Using “dictionary records” helps to reduce number of Get operations Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 14. Background: batch updates “Dictionary Records” Using data de-normalization to reduce random Get operations while doing “Get+Put” updates: Keep compound records which hold data of multiple “normal” records that are usually updated together N Get+Put operations replaced with M (Get+Put) and N Put operations, where M << N Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 15. Background: batch updates Not real-time If done frequently (closer to real-time), still a lot of costly Get+Put update operations Bad (any?) rollback support Handling of failures of tasks which partially wrote data to HBase is complex Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 16. Going Real-time with Append-based Updates Sunday, May 20, 12
  • 17. Append-only: main goals Increase record update throughput Process updates more efficiently: reduce operations number and resources usage Ideally, apply high volume of incoming data changes in real-time Add ability to roll back changes Handle well high update peaks Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 18. Append-only: how? 1. Replace read-modify-write (Get+Put) operations at write time with simple append-only writes (Put) 2. Defer processing of updates to periodic jobs 3. Perform processing of updates on the fly only if user asks for data earlier than updates are processed. Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 19. Append-only: writing updates 1 Replace update (Get+Put) operations at write time with simple append-only writes (Put) sensor1 sensor2 sensor2 ... ... value:15.0 ... value:41.0 input Put Put Put ... sensor1 sensor2 ... <...> avg : 22.7 max: 31.0 sensor1 <...> ... ... sensor2 value: 15.0 storage sensor2 value: 41.0 (HBase) ... Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 20. Append-only: writing updates 2 Defer processing of updates to periodic jobs processing updates with MR job ... sensor1 sensor2 ... <...> avg : 22.7 ... sensor1 sensor2 ... max: 31.0 <...> avg : 23.4 sensor1 <...> ... max: 41.0 ... sensor2 value: 15.0 sensor2 value: 41.0 ... Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 21. Append-only: writing updates 3 Perform aggregations on the fly if user asks for data earlier than updates are processed ... sensor1 sensor2 ... reports <...> avg : 22.7 sensor1 ... max: 31.0 sensor1 <...> ... ... sensor2 value: 15.0 sensor2 storage value: 41.0 ... sensor2 avg : 23.4 max: 41.0 Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 22. Append-only: benefits High update throughput Real-time updates visibility Efficient updates processing Handling high peaks of update operations Ability to roll back any range of changes Automatically handling failures of tasks which only partially updated data (e.g. in MR jobs) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 23. 1/6 Append-only: high update throughput Avoid Get+Put operations upon writing Use only Put operations (i.e. insert new records only) which is very fast in HBase Process updates when flushing client-side buffer to reduce the number of actual writes Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 24. 2/6 Append-only: real-time updates Increased update throughput allows to apply updates in real-time User always sees the latest data changes Updates processed on the fly during Get or Scan can be stored back right away Periodic updates processing helps avoid doing a lot of work during reads, making reading very fast Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 25. 3/6 Append-only: efficient updates To apply N changes: N Get+Put operations replaced with N Puts and 1 Scan (shared) + 1 Put operation Applying N changes at once is much more efficient than performing N individual changes Especially when updated value is complex (like bitmaps), takes time to load in memory Skip compacting if too few records to process Avoid a lot of redundant Get operations when large portion of operations - inserting new data Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 26. 4/6 Append-only: high peaks handling Actual updates do not happen at write time Merge is deferred to periodic jobs, which can be scheduled to run at off-peak time (nights/ week-ends) Merge speed is not critical, doesn’t affect the visibility of changes Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 27. 5/6 Append-only: rollback Rollbacks are easy when updates were not processed yet (not merged) To preserve rollback ability after they are processed (and result is written back), updates can be compacted into groups written at: processing updates 9:00 ... sensor2 ... ... ... sensor2 ... ... sensor2 ... ... 10:00 sensor2 sensor2 ... ... sensor2 ... ... 11:00 sensor2 sensor2 ... ... ... ... Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 28. 5/6 Append-only: rollback Example: * keep all-time avg value for sensor * data collected every 10 second for 30 days Solution: * perform periodic compactions every 4 hours * compact groups based on 1-hour interval Result: At any point of time there are no more than 24 * 30 + 4 * 60 * 6 = 2160 non-compacted records that needs to be processed on the fly Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 29. 6/6 Append-only: idempotency Using append-only approach helps recover from failed tasks which write data to HBase without rolling back partial updates avoids applying duplicate updates fixes task failure with simple restart of task Note: new task should write records with same row keys as failed one easy, esp. given that input data is likely to be same Very convenient when writing from MapReduce Updates processing periodic jobs are also idempotent Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 30. Append-only: cons Processing on the fly makes reading slower Looking for data to compact (during periodic compactions) may be inefficient Increased amount of stored data depending on use-case (in 0.92+) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 31. Append-only + Batch? Works very well together, batch approach benefits from: increased update throughput automatic task failures handling rollback ability Use when HBase cluster cannot cope with processing updates in real-time or update operations are bottleneck in your batch We use it ;) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 32. Append-only updates implementation HBaseHUT Sunday, May 20, 12
  • 33. HBaseHUT: Overview Simple Easy to integrate into existing projects Packed as a singe jar to be added to HBase client classpath (also add it to RegionServer classpath to benefit from server-side optimizations) Supports native HBase API: HBaseHUT classes implement native HBase interfaces Apache License, v2.0 Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 34. HBaseHUT: Overview Processing of updates on-the-fly (behind ResultScanner interface) Allows storing back processed Result Can use CPs to process updates on server-side Periodic processing of updates with Scan or MapReduce job Including processing updates in groups based on write ts Rolling back changes with MapReduce job Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 35. HBaseHUT vs(?) OpenTSDB “vs” is wrong, they are simply different things OpenTSDB is a time-series database HBaseHUT is a library which implements append-only updates approach to be used in your project OpenTSDB uses “serve raw data” approach (with storage improvements), limited to handling numeric values HBaseHUT is meant for (but not limited to) “serve aggregated data” approach, works with any data Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 36. HBaseHUT: API overview Writing data: Put put = new Put(HutPut.adjustRow(rowKey)); // ... hTable.put(put); Reading data: Scan scan = new Scan(startKey, stopKey); ResultScanner resultScanner = new HutResultScanner(hTable.getScanner(scan), updateProcessor); for (Result current : resultScanner) {...} Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 37. HBaseHUT: API overview Example UpdateProcessor: public class MaxFunction extends UpdateProcessor { // ... constructor & utility methods @Override public void process(Iterable<Result> records, UpdateProcessingResult result) { Double maxVal = null; for (Result record : records) { double val = getValue(record); if (maxVal == null || maxVal < val) { maxVal = val; } } result.add(colfam, qual, Bytes.toBytes(maxVal)); } } Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 38. HBaseHUT: how we use it Data Analytics & Storage Reports aggregated data HBaseHUT HBaseHUT input initial data data processing HBase HBaseHUT HBaseHUT periodic MapReduce updates jobs processing Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 39. HBaseHUT: Next Steps Wider CPs (HBase 0.92+) utilization Process updates during memstore flush Make use of Append operation (HBase 0.94+) Integrate with asynchbase lib Reduce storage overhead from adjusting row keys etc. Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 40. Qs? http://github.com/sematext/HBaseHUT http://blog.sematext.com @abaranau http://github.com/sematext (abaranau) http://sematext.com, we are hiring! ;) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12