SlideShare ist ein Scribd-Unternehmen logo
1 von 40
Downloaden Sie, um offline zu lesen
Real-time Analytics with
                              HBase
                       Alex Baranau, Sematext International




Sunday, May 20, 12
About me


                     Software Engineer at Sematext International

                     http://blog.sematext.com/author/abaranau

                     @abaranau

                     http://github.com/sematext (abaranau)




                                                         Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Plan


                     Problem background: what? why?

                     Going real-time with append-only updates
                     approach: how?

                     Open-source implementation: how exactly?

                     Q&A




                                                        Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: our services
                     Systems Monitoring Service (Solr, HBase, ...)

                     Search Analytics Service



          data collector                                      Reports
                                    Data
                                                        100

                                                         75

          data collector         Analytics &             50


                                  Storage                25


          data collector                                  0
                                                              2007   2008   2009   2010




                                                                 Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: Report Example
                     Search engine (Solr) request latency




                                                        Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: Report Example
                     HBase flush operations




                                             Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: requirements

                      High volume of input data

                      Multiple filters/dimensions

                      Interactive (fast) reports

                      Show wide range of data intervals

                      Real-time data changes visibility

                      No sampling, accurate data needed



                                                          Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: serve raw data?
                        simply storing all data points doesn’t work
                           to show 1-year worth of data points collected every second
                           31,536,000 points have to be fetched


                        pre-aggregation (at least partial) needed

                       Data Analytics & Storage                           Reports
                                              aggregated data
                       input data



                      data processing
                     (pre-aggregating)

                                                                             Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: pre-aggregation
                     OLAP-like Solution

                                aggregation rules
                               * filters/dimensions
                               * time range granularities   aggregated
                               * ...                           value


                                                            aggregated
           input data               processing                 value
              item
                                       logic
                                                            aggregated
                                                               value


                                                            Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: pre-aggregation
                     Simplified Example             aggregated record groups

                                                           by minute
                                                      minute: 22214701
                               aggregation rules      value: 30.0
                                * by sensor                       ...
                                * by minute/day              by day
                                                      day: 2012-04-26
      input data item                                 value: 10.5
     time: 1332882078             processing                      ...
     sensor: sensor55
     value: 80.0                     logic            by minute & sensor
                                                      minute: 22214701
                                                      sensor: sensor55
                                                      cpu: 70.3
                                                                  ...
                                                     ...
                                                           Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: RMW updates are slow
                        more dimensions/filters -> greater output data vs input data
                        ratio

                        individual ready-modify-write (Get+Put) operations are slow
                        and not efficient (10-20+ times slower than only Puts)



                                   sensor1                          sensor2
                      ...            <...>
                                               sensor2
                                              value:15.0    ...    value:41.0
                                                                                  input
                                  Get   Put   Get   Put      Get        Put

                            ...   sensor1       sensor2
                                                                  ...           reports
                                    <...>      avg : 28.7          Get/Scan      sensor2
                                               min: 15.0
                      storage                  max: 41.0
                       (HBase)
                                                                                   Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: improve updates


                       Using in-place increment operations? Not fast
                       enough and not flexible...

                       Buffering input records on the way in and
                       writing in small batches? Doesn’t scale and
                       possible loss of data...




                                                            Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: batch updates

                      More efficient data processing: multiple
                      updates processed at once, not individually

                      Decreases aggregation output (per input
                      record)

                      Reliable, no data loss

                      Using “dictionary records” helps to reduce
                      number of Get operations


                                                           Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: batch updates
              “Dictionary Records”

                     Using data de-normalization to reduce random Get
                     operations while doing “Get+Put” updates:

                       Keep compound records which hold data of
                       multiple “normal” records that are usually
                       updated together

                       N Get+Put operations replaced with M (Get+Put)
                       and N Put operations, where M << N


                                                            Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: batch updates

                      Not real-time

                      If done frequently (closer to real-time), still
                      a lot of costly Get+Put update operations

                      Bad (any?) rollback support

                      Handling of failures of tasks which partially
                      wrote data to HBase is complex



                                                             Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Going Real-time
                             with
                     Append-based Updates




Sunday, May 20, 12
Append-only: main goals

                     Increase record update throughput

                     Process updates more efficiently: reduce
                     operations number and resources usage

                     Ideally, apply high volume of incoming data
                     changes in real-time

                     Add ability to roll back changes

                     Handle well high update peaks


                                                          Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Append-only: how?

            1. Replace read-modify-write (Get+Put) operations
                     at write time with simple append-only writes (Put)

            2. Defer processing of updates to periodic jobs
            3. Perform processing of updates on the fly only
                     if user asks for data earlier than updates are
                     processed.




                                                             Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Append-only: writing updates

           1         Replace update (Get+Put) operations at write time
                     with simple append-only writes (Put)

                                sensor1          sensor2              sensor2
                      ...          ...          value:15.0   ...     value:41.0     input
                              Put          Put                                Put
                              ...     sensor1           sensor2
                                                                        ...
                                        <...>          avg : 22.7
                                                       max: 31.0
                                      sensor1
                                        <...>
                                                     ...
                                    ...                 sensor2
                                                       value: 15.0
                            storage                     sensor2
                                                       value: 41.0
                            (HBase)                  ...                            Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Append-only: writing updates

                     2        Defer processing of updates to periodic jobs


                     processing updates with MR job
         ...          sensor1           sensor2
                                                      ...
                        <...>          avg : 22.7           ...   sensor1       sensor2
                                                                                                         ...
                                       max: 31.0                    <...>      avg : 23.4
                      sensor1
                        <...>
                                          ...                                  max: 41.0
                        ...             sensor2
                                       value: 15.0
                                        sensor2
                                       value: 41.0
                                          ...

                                                                            Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Append-only: writing updates
                 3          Perform aggregations on the fly if user asks
                            for data earlier than updates are processed

                      ...     sensor1    sensor2
                                                         ...        reports
                                <...>   avg : 22.7                 sensor1         ...
                                        max: 31.0
                              sensor1
                                <...>
                                           ...
                                ...      sensor2
                                        value: 15.0
                                         sensor2
                      storage           value: 41.0
                                           ...
                                                       sensor2
                                                      avg : 23.4
                                                      max: 41.0

                                                                         Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Append-only: benefits
                     High update throughput

                     Real-time updates visibility

                     Efficient updates processing

                     Handling high peaks of update operations

                     Ability to roll back any range of changes

                     Automatically handling failures of tasks which
                     only partially updated data (e.g. in MR jobs)


                                                           Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
1/6
                     Append-only: high update throughput



                       Avoid Get+Put operations upon writing

                       Use only Put operations (i.e. insert new
                       records only) which is very fast in HBase

                       Process updates when flushing client-side
                       buffer to reduce the number of actual
                       writes




                                                            Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
2/6
                     Append-only: real-time updates

                       Increased update throughput allows to apply
                       updates in real-time

                       User always sees the latest data changes

                       Updates processed on the fly during Get or
                       Scan can be stored back right away

                       Periodic updates processing helps avoid doing
                       a lot of work during reads, making reading
                       very fast


                                                           Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
3/6
                     Append-only: efficient updates
                      To apply N changes:
                        N Get+Put operations replaced with

                        N Puts and 1 Scan (shared) + 1 Put operation

                      Applying N changes at once is much more
                      efficient than performing N individual changes
                        Especially when updated value is complex (like bitmaps),
                        takes time to load in memory

                        Skip compacting if too few records to process

                      Avoid a lot of redundant Get operations when
                      large portion of operations - inserting new data
                                                                   Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
4/6
                     Append-only: high peaks handling


                      Actual updates do not happen at write time

                      Merge is deferred to periodic jobs, which can
                      be scheduled to run at off-peak time (nights/
                      week-ends)

                      Merge speed is not critical, doesn’t affect the
                      visibility of changes



                                                             Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
5/6
                          Append-only: rollback
                        Rollbacks are easy when updates were not
                        processed yet (not merged)

                        To preserve rollback ability after they are
                        processed (and result is written back), updates
                        can be compacted into groups
                     written at:     processing updates
                        9:00       ...    sensor2
                                             ...
                                                      ...   ...   sensor2          ...
                                                                     ...
                                          sensor2
                                             ...
                                             ...
                        10:00             sensor2                 sensor2
                                             ...                     ...
                                          sensor2
                                             ...
                                             ...
                        11:00             sensor2                 sensor2
                                             ...                     ...
                                             ...                    ...
                                                                     Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
5/6
                       Append-only: rollback
                     Example:
                     * keep all-time avg value for sensor
                     * data collected every 10 second for 30 days

                     Solution:
                     * perform periodic compactions every 4 hours
                     * compact groups based on 1-hour interval

                     Result:
                     At any point of time there are no more than
                     24 * 30 + 4 * 60 * 6 = 2160 non-compacted
                     records that needs to be processed on the fly
                                                          Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
6/6
                     Append-only: idempotency
                      Using append-only approach helps recover from
                      failed tasks which write data to HBase
                         without rolling back partial updates

                         avoids applying duplicate updates

                         fixes task failure with simple restart of task

                      Note: new task should write records with same row
                      keys as failed one
                         easy, esp. given that input data is likely to be same

                      Very convenient when writing from MapReduce

                      Updates processing periodic jobs are also idempotent

                                                                                 Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Append-only: cons


                     Processing on the fly makes reading slower

                     Looking for data to compact (during periodic
                     compactions) may be inefficient

                     Increased amount of stored data depending
                     on use-case (in 0.92+)




                                                         Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Append-only + Batch?
                     Works very well together, batch approach
                     benefits from:

                       increased update throughput

                       automatic task failures handling

                       rollback ability

                     Use when HBase cluster cannot cope with
                     processing updates in real-time or update
                     operations are bottleneck in your batch

                     We use it ;)

                                                            Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Append-only updates implementation
                              HBaseHUT



Sunday, May 20, 12
HBaseHUT: Overview

                     Simple

                     Easy to integrate into existing projects
                       Packed as a singe jar to be added to HBase client
                       classpath (also add it to RegionServer classpath to
                       benefit from server-side optimizations)

                       Supports native HBase API: HBaseHUT classes
                       implement native HBase interfaces

                     Apache License, v2.0


                                                                Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
HBaseHUT: Overview
                     Processing of updates on-the-fly (behind
                     ResultScanner interface)

                       Allows storing back processed Result

                       Can use CPs to process updates on server-side

                     Periodic processing of updates with Scan or
                     MapReduce job
                       Including processing updates in groups based on write ts

                     Rolling back changes with MapReduce job


                                                                  Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
HBaseHUT vs(?) OpenTSDB
                      “vs” is wrong, they are simply different things
                        OpenTSDB is a time-series database

                        HBaseHUT is a library which implements append-only
                        updates approach to be used in your project

                      OpenTSDB uses “serve raw data” approach (with
                      storage improvements), limited to handling
                      numeric values

                      HBaseHUT is meant for (but not limited to)
                      “serve aggregated data” approach, works with
                      any data

                                                                 Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
HBaseHUT: API overview
            Writing data:
            Put put = new Put(HutPut.adjustRow(rowKey));
            // ...
            hTable.put(put);


            Reading data:
            Scan scan = new Scan(startKey, stopKey);
            ResultScanner resultScanner =
                new HutResultScanner(hTable.getScanner(scan),
                                     updateProcessor);

            for (Result current : resultScanner) {...}




                                                            Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
HBaseHUT: API overview
                 Example UpdateProcessor:
                 public class MaxFunction extends UpdateProcessor {
                   // ... constructor & utility methods

                     @Override
                     public void process(Iterable<Result> records,
                                         UpdateProcessingResult result) {
                       Double maxVal = null;

                         for (Result record : records) {
                           double val = getValue(record);
                           if (maxVal == null || maxVal < val) {
                             maxVal = val;
                           }
                         }

                         result.add(colfam, qual, Bytes.toBytes(maxVal));
                     }
                 }
                                                                   Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
HBaseHUT: how we use it
                        Data Analytics & Storage                              Reports
                                                 aggregated data




                                                                   HBaseHUT
                                      HBaseHUT
       input           initial data
        data           processing

                                                       HBase
                          HBaseHUT
                                                  HBaseHUT
                          periodic
                                                 MapReduce
                          updates
                                                   jobs
                         processing

                                                                      Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
HBaseHUT: Next Steps
                     Wider CPs (HBase 0.92+) utilization
                       Process updates during memstore flush

                     Make use of Append operation (HBase 0.94+)

                     Integrate with asynchbase lib

                     Reduce storage overhead from adjusting
                     row keys

                     etc.


                                                              Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Qs?
                     http://github.com/sematext/HBaseHUT

                     http://blog.sematext.com

                     @abaranau

                     http://github.com/sematext (abaranau)

                     http://sematext.com, we are hiring! ;)



                                                          Alex Baranau, Sematext International, 2012
Sunday, May 20, 12

Weitere ähnliche Inhalte

Was ist angesagt?

Solution Use Case Demo: The Power of Relationships in Your Big Data
Solution Use Case Demo: The Power of Relationships in Your Big DataSolution Use Case Demo: The Power of Relationships in Your Big Data
Solution Use Case Demo: The Power of Relationships in Your Big DataInfiniteGraph
 
Microsoft R - ScaleR Overview
Microsoft R - ScaleR OverviewMicrosoft R - ScaleR Overview
Microsoft R - ScaleR OverviewKhalid Salama
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time ApplicationsDataWorks Summit
 
#dbhouseparty - Using Oracle’s Converged “AI” Database to Pick a Good but Ine...
#dbhouseparty - Using Oracle’s Converged “AI” Database to Pick a Good but Ine...#dbhouseparty - Using Oracle’s Converged “AI” Database to Pick a Good but Ine...
#dbhouseparty - Using Oracle’s Converged “AI” Database to Pick a Good but Ine...Tammy Bednar
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLJongwook Woo
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120Hyoungjun Kim
 
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Data Con LA
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopCloudera, Inc.
 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsDataWorks Summit
 
Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousingDataWorks Summit
 
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live ConnectTableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live ConnectRemy Rosenbaum
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...DataWorks Summit
 
Revenue Earned From Students in USA
Revenue Earned From Students in USARevenue Earned From Students in USA
Revenue Earned From Students in USAApekshitBhingardive
 
Graph Data Modeling in DataStax Enterprise
Graph Data Modeling in DataStax EnterpriseGraph Data Modeling in DataStax Enterprise
Graph Data Modeling in DataStax EnterpriseArtem Chebotko
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreCloudera, Inc.
 
Integrating Oracle Data Integrator with Oracle GoldenGate 12c
Integrating Oracle Data Integrator with Oracle GoldenGate 12cIntegrating Oracle Data Integrator with Oracle GoldenGate 12c
Integrating Oracle Data Integrator with Oracle GoldenGate 12cEdelweiss Kammermann
 
Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Kent Graziano
 
The Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameThe Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameCloudera, Inc.
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer
 
Con8862 no sql, json and time series data
Con8862   no sql, json and time series dataCon8862   no sql, json and time series data
Con8862 no sql, json and time series dataAnuj Sahni
 

Was ist angesagt? (20)

Solution Use Case Demo: The Power of Relationships in Your Big Data
Solution Use Case Demo: The Power of Relationships in Your Big DataSolution Use Case Demo: The Power of Relationships in Your Big Data
Solution Use Case Demo: The Power of Relationships in Your Big Data
 
Microsoft R - ScaleR Overview
Microsoft R - ScaleR OverviewMicrosoft R - ScaleR Overview
Microsoft R - ScaleR Overview
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
#dbhouseparty - Using Oracle’s Converged “AI” Database to Pick a Good but Ine...
#dbhouseparty - Using Oracle’s Converged “AI” Database to Pick a Good but Ine...#dbhouseparty - Using Oracle’s Converged “AI” Database to Pick a Good but Ine...
#dbhouseparty - Using Oracle’s Converged “AI” Database to Pick a Good but Ine...
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure ML
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120
 
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data Applications
 
Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousing
 
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live ConnectTableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
 
Revenue Earned From Students in USA
Revenue Earned From Students in USARevenue Earned From Students in USA
Revenue Earned From Students in USA
 
Graph Data Modeling in DataStax Enterprise
Graph Data Modeling in DataStax EnterpriseGraph Data Modeling in DataStax Enterprise
Graph Data Modeling in DataStax Enterprise
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data Store
 
Integrating Oracle Data Integrator with Oracle GoldenGate 12c
Integrating Oracle Data Integrator with Oracle GoldenGate 12cIntegrating Oracle Data Integrator with Oracle GoldenGate 12c
Integrating Oracle Data Integrator with Oracle GoldenGate 12c
 
Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)
 
The Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameThe Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the Same
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
 
Con8862 no sql, json and time series data
Con8862   no sql, json and time series dataCon8862   no sql, json and time series data
Con8862 no sql, json and time series data
 

Andere mochten auch

HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...Cloudera, Inc.
 
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceHBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceCloudera, Inc.
 
HBaseCon 2013: Deal Personalization Engine with HBase @ Groupon
HBaseCon 2013: Deal Personalization Engine with HBase @ GrouponHBaseCon 2013: Deal Personalization Engine with HBase @ Groupon
HBaseCon 2013: Deal Personalization Engine with HBase @ GrouponCloudera, Inc.
 
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsHadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsCloudera, Inc.
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaselarsgeorge
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQueryDharmesh Vaya
 

Andere mochten auch (6)

HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
 
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceHBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
 
HBaseCon 2013: Deal Personalization Engine with HBase @ Groupon
HBaseCon 2013: Deal Personalization Engine with HBase @ GrouponHBaseCon 2013: Deal Personalization Engine with HBase @ Groupon
HBaseCon 2013: Deal Personalization Engine with HBase @ Groupon
 
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsHadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQuery
 

Ähnlich wie HBaseCon 2012 | Real-time Analytics with HBase - Sematext

Klout changing landscape of social media
Klout changing landscape of social mediaKlout changing landscape of social media
Klout changing landscape of social mediaDataWorks Summit
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemDanny Yuan
 
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizadaCombinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizadaElasticsearch
 
Monitoring as an entry point for collaboration
Monitoring as an entry point for collaborationMonitoring as an entry point for collaboration
Monitoring as an entry point for collaborationJulien Pivotto
 
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizadaCombinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizadaElasticsearch
 
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemC4Media
 
Performance tuning in sap bi 7.0
Performance tuning in sap bi 7.0Performance tuning in sap bi 7.0
Performance tuning in sap bi 7.0gireesho
 
From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012Anand Deshpande
 
13 monitor-analyse-system
13 monitor-analyse-system13 monitor-analyse-system
13 monitor-analyse-systemsanganiraju
 
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...SL Corporation
 
Java one 2010
Java one 2010Java one 2010
Java one 2010scdn
 
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIDenny Lee
 
Data Mining with Excel 2010 and PowerPivot 201106
Data Mining with Excel 2010 and PowerPivot 201106Data Mining with Excel 2010 and PowerPivot 201106
Data Mining with Excel 2010 and PowerPivot 201106Mark Tabladillo
 
Combining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observabilityCombining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observabilityElasticsearch
 
Big Data
Big DataBig Data
Big DataNGDATA
 
What is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
What is going on? Application Diagnostics on Azure - Copenhagen .NET User GroupWhat is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
What is going on? Application Diagnostics on Azure - Copenhagen .NET User GroupMaarten Balliauw
 
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...Interactive exploration of complex relational data sets in a web - SemWeb.Pro...
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...Logilab
 
Azure Stream Analytics
Azure Stream AnalyticsAzure Stream Analytics
Azure Stream AnalyticsMarco Parenzan
 

Ähnlich wie HBaseCon 2012 | Real-time Analytics with HBase - Sematext (20)

Klout changing landscape of social media
Klout changing landscape of social mediaKlout changing landscape of social media
Klout changing landscape of social media
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing system
 
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizadaCombinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
 
Monitoring as an entry point for collaboration
Monitoring as an entry point for collaborationMonitoring as an entry point for collaboration
Monitoring as an entry point for collaboration
 
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizadaCombinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
 
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing System
 
Performance tuning in sap bi 7.0
Performance tuning in sap bi 7.0Performance tuning in sap bi 7.0
Performance tuning in sap bi 7.0
 
Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
 
From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012
 
13 monitor-analyse-system
13 monitor-analyse-system13 monitor-analyse-system
13 monitor-analyse-system
 
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
 
Java one 2010
Java one 2010Java one 2010
Java one 2010
 
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BI
 
Data Mining with Excel 2010 and PowerPivot 201106
Data Mining with Excel 2010 and PowerPivot 201106Data Mining with Excel 2010 and PowerPivot 201106
Data Mining with Excel 2010 and PowerPivot 201106
 
Combining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observabilityCombining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observability
 
Big Data
Big DataBig Data
Big Data
 
What is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
What is going on? Application Diagnostics on Azure - Copenhagen .NET User GroupWhat is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
What is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
 
Spring Batch Introduction
Spring Batch IntroductionSpring Batch Introduction
Spring Batch Introduction
 
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...Interactive exploration of complex relational data sets in a web - SemWeb.Pro...
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...
 
Azure Stream Analytics
Azure Stream AnalyticsAzure Stream Analytics
Azure Stream Analytics
 

Mehr von Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Mehr von Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Kürzlich hochgeladen

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Kürzlich hochgeladen (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

HBaseCon 2012 | Real-time Analytics with HBase - Sematext

  • 1. Real-time Analytics with HBase Alex Baranau, Sematext International Sunday, May 20, 12
  • 2. About me Software Engineer at Sematext International http://blog.sematext.com/author/abaranau @abaranau http://github.com/sematext (abaranau) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 3. Plan Problem background: what? why? Going real-time with append-only updates approach: how? Open-source implementation: how exactly? Q&A Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 4. Background: our services Systems Monitoring Service (Solr, HBase, ...) Search Analytics Service data collector Reports Data 100 75 data collector Analytics & 50 Storage 25 data collector 0 2007 2008 2009 2010 Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 5. Background: Report Example Search engine (Solr) request latency Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 6. Background: Report Example HBase flush operations Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 7. Background: requirements High volume of input data Multiple filters/dimensions Interactive (fast) reports Show wide range of data intervals Real-time data changes visibility No sampling, accurate data needed Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 8. Background: serve raw data? simply storing all data points doesn’t work to show 1-year worth of data points collected every second 31,536,000 points have to be fetched pre-aggregation (at least partial) needed Data Analytics & Storage Reports aggregated data input data data processing (pre-aggregating) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 9. Background: pre-aggregation OLAP-like Solution aggregation rules * filters/dimensions * time range granularities aggregated * ... value aggregated input data processing value item logic aggregated value Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 10. Background: pre-aggregation Simplified Example aggregated record groups by minute minute: 22214701 aggregation rules value: 30.0 * by sensor ... * by minute/day by day day: 2012-04-26 input data item value: 10.5 time: 1332882078 processing ... sensor: sensor55 value: 80.0 logic by minute & sensor minute: 22214701 sensor: sensor55 cpu: 70.3 ... ... Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 11. Background: RMW updates are slow more dimensions/filters -> greater output data vs input data ratio individual ready-modify-write (Get+Put) operations are slow and not efficient (10-20+ times slower than only Puts) sensor1 sensor2 ... <...> sensor2 value:15.0 ... value:41.0 input Get Put Get Put Get Put ... sensor1 sensor2 ... reports <...> avg : 28.7 Get/Scan sensor2 min: 15.0 storage max: 41.0 (HBase) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 12. Background: improve updates Using in-place increment operations? Not fast enough and not flexible... Buffering input records on the way in and writing in small batches? Doesn’t scale and possible loss of data... Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 13. Background: batch updates More efficient data processing: multiple updates processed at once, not individually Decreases aggregation output (per input record) Reliable, no data loss Using “dictionary records” helps to reduce number of Get operations Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 14. Background: batch updates “Dictionary Records” Using data de-normalization to reduce random Get operations while doing “Get+Put” updates: Keep compound records which hold data of multiple “normal” records that are usually updated together N Get+Put operations replaced with M (Get+Put) and N Put operations, where M << N Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 15. Background: batch updates Not real-time If done frequently (closer to real-time), still a lot of costly Get+Put update operations Bad (any?) rollback support Handling of failures of tasks which partially wrote data to HBase is complex Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 16. Going Real-time with Append-based Updates Sunday, May 20, 12
  • 17. Append-only: main goals Increase record update throughput Process updates more efficiently: reduce operations number and resources usage Ideally, apply high volume of incoming data changes in real-time Add ability to roll back changes Handle well high update peaks Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 18. Append-only: how? 1. Replace read-modify-write (Get+Put) operations at write time with simple append-only writes (Put) 2. Defer processing of updates to periodic jobs 3. Perform processing of updates on the fly only if user asks for data earlier than updates are processed. Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 19. Append-only: writing updates 1 Replace update (Get+Put) operations at write time with simple append-only writes (Put) sensor1 sensor2 sensor2 ... ... value:15.0 ... value:41.0 input Put Put Put ... sensor1 sensor2 ... <...> avg : 22.7 max: 31.0 sensor1 <...> ... ... sensor2 value: 15.0 storage sensor2 value: 41.0 (HBase) ... Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 20. Append-only: writing updates 2 Defer processing of updates to periodic jobs processing updates with MR job ... sensor1 sensor2 ... <...> avg : 22.7 ... sensor1 sensor2 ... max: 31.0 <...> avg : 23.4 sensor1 <...> ... max: 41.0 ... sensor2 value: 15.0 sensor2 value: 41.0 ... Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 21. Append-only: writing updates 3 Perform aggregations on the fly if user asks for data earlier than updates are processed ... sensor1 sensor2 ... reports <...> avg : 22.7 sensor1 ... max: 31.0 sensor1 <...> ... ... sensor2 value: 15.0 sensor2 storage value: 41.0 ... sensor2 avg : 23.4 max: 41.0 Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 22. Append-only: benefits High update throughput Real-time updates visibility Efficient updates processing Handling high peaks of update operations Ability to roll back any range of changes Automatically handling failures of tasks which only partially updated data (e.g. in MR jobs) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 23. 1/6 Append-only: high update throughput Avoid Get+Put operations upon writing Use only Put operations (i.e. insert new records only) which is very fast in HBase Process updates when flushing client-side buffer to reduce the number of actual writes Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 24. 2/6 Append-only: real-time updates Increased update throughput allows to apply updates in real-time User always sees the latest data changes Updates processed on the fly during Get or Scan can be stored back right away Periodic updates processing helps avoid doing a lot of work during reads, making reading very fast Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 25. 3/6 Append-only: efficient updates To apply N changes: N Get+Put operations replaced with N Puts and 1 Scan (shared) + 1 Put operation Applying N changes at once is much more efficient than performing N individual changes Especially when updated value is complex (like bitmaps), takes time to load in memory Skip compacting if too few records to process Avoid a lot of redundant Get operations when large portion of operations - inserting new data Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 26. 4/6 Append-only: high peaks handling Actual updates do not happen at write time Merge is deferred to periodic jobs, which can be scheduled to run at off-peak time (nights/ week-ends) Merge speed is not critical, doesn’t affect the visibility of changes Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 27. 5/6 Append-only: rollback Rollbacks are easy when updates were not processed yet (not merged) To preserve rollback ability after they are processed (and result is written back), updates can be compacted into groups written at: processing updates 9:00 ... sensor2 ... ... ... sensor2 ... ... sensor2 ... ... 10:00 sensor2 sensor2 ... ... sensor2 ... ... 11:00 sensor2 sensor2 ... ... ... ... Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 28. 5/6 Append-only: rollback Example: * keep all-time avg value for sensor * data collected every 10 second for 30 days Solution: * perform periodic compactions every 4 hours * compact groups based on 1-hour interval Result: At any point of time there are no more than 24 * 30 + 4 * 60 * 6 = 2160 non-compacted records that needs to be processed on the fly Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 29. 6/6 Append-only: idempotency Using append-only approach helps recover from failed tasks which write data to HBase without rolling back partial updates avoids applying duplicate updates fixes task failure with simple restart of task Note: new task should write records with same row keys as failed one easy, esp. given that input data is likely to be same Very convenient when writing from MapReduce Updates processing periodic jobs are also idempotent Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 30. Append-only: cons Processing on the fly makes reading slower Looking for data to compact (during periodic compactions) may be inefficient Increased amount of stored data depending on use-case (in 0.92+) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 31. Append-only + Batch? Works very well together, batch approach benefits from: increased update throughput automatic task failures handling rollback ability Use when HBase cluster cannot cope with processing updates in real-time or update operations are bottleneck in your batch We use it ;) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 32. Append-only updates implementation HBaseHUT Sunday, May 20, 12
  • 33. HBaseHUT: Overview Simple Easy to integrate into existing projects Packed as a singe jar to be added to HBase client classpath (also add it to RegionServer classpath to benefit from server-side optimizations) Supports native HBase API: HBaseHUT classes implement native HBase interfaces Apache License, v2.0 Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 34. HBaseHUT: Overview Processing of updates on-the-fly (behind ResultScanner interface) Allows storing back processed Result Can use CPs to process updates on server-side Periodic processing of updates with Scan or MapReduce job Including processing updates in groups based on write ts Rolling back changes with MapReduce job Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 35. HBaseHUT vs(?) OpenTSDB “vs” is wrong, they are simply different things OpenTSDB is a time-series database HBaseHUT is a library which implements append-only updates approach to be used in your project OpenTSDB uses “serve raw data” approach (with storage improvements), limited to handling numeric values HBaseHUT is meant for (but not limited to) “serve aggregated data” approach, works with any data Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 36. HBaseHUT: API overview Writing data: Put put = new Put(HutPut.adjustRow(rowKey)); // ... hTable.put(put); Reading data: Scan scan = new Scan(startKey, stopKey); ResultScanner resultScanner = new HutResultScanner(hTable.getScanner(scan), updateProcessor); for (Result current : resultScanner) {...} Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 37. HBaseHUT: API overview Example UpdateProcessor: public class MaxFunction extends UpdateProcessor { // ... constructor & utility methods @Override public void process(Iterable<Result> records, UpdateProcessingResult result) { Double maxVal = null; for (Result record : records) { double val = getValue(record); if (maxVal == null || maxVal < val) { maxVal = val; } } result.add(colfam, qual, Bytes.toBytes(maxVal)); } } Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 38. HBaseHUT: how we use it Data Analytics & Storage Reports aggregated data HBaseHUT HBaseHUT input initial data data processing HBase HBaseHUT HBaseHUT periodic MapReduce updates jobs processing Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 39. HBaseHUT: Next Steps Wider CPs (HBase 0.92+) utilization Process updates during memstore flush Make use of Append operation (HBase 0.94+) Integrate with asynchbase lib Reduce storage overhead from adjusting row keys etc. Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 40. Qs? http://github.com/sematext/HBaseHUT http://blog.sematext.com @abaranau http://github.com/sematext (abaranau) http://sematext.com, we are hiring! ;) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12