SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
Using distributed technologies
to analyze Big Data

                    Abhijit Sharma
                    Innovation Lab
                    BMC Software




                                     1
Data Explosion in Data Center
• Performance / Time Series Data
    § Incoming data rates ~Millions of data
        points/ min
    § Data generated/server/year ~ 2 GB
    § 50 K servers ~ 100 TB data / year




                                              2
Online Warehouse - Time Series
   § Extreme storage requirements – TS data for a data center e.g. last
       year
   § Online TS data availability i.e. no separate ETL
   § Support for common analytics operations
           § Roll-up data e.g. CPU/min to CPU/hour, CPU/day etc
           § Slice and Dice – CPU util. for UNIX servers in SFO data center last week
           § Statistical Operations : sum, count, avg., var, std. moving avg., frequency
                distributions, forecasting etc
   § Ease of use – SQL interface, design schema for TS data
   § Horizontal scaling - lower cost commodity hardware
                                                            OS          Data Cube -
   § High R/W volume                                                    CPU
                                                                        Time
                                                   Data
                                                   Center




                                                                                      3
P
a
g    Why not use RDBMS based Data
e
4    Warehousing?
|    Star schema – dimensions & facts
6/5/11 §   Offline data availability – ETL required – not online
      § Expensive to scale vertically – High end Hardware & Software
      § Limits to vertical scaling – big data may not fit
      § Features like transactions etc are unnecessary and a overhead
          for certain applications
      § Large scale distributed/partitioning is painful – sub optimal
          on high W/R ratios
      § Flexible Schema support which can be changed on the fly is
           not possible

                                                                        4
High Level Architecture


  Real time Continuous                      Schema &
  load of Metric &                          Query
  Dimension Data


                         Hive – Distributed SQL


            NoSQL Column Store - HBase


            Hadoop HDFS & Map Reduce Framework




                          Map Reduce & HDFS Nodes
                                                       5
P
a
g
e
     Map Reduce - Recap
6        Map Function                                   Reduce Function
                        § Apply to input data, Emits         § Apply to data grouped by reduction key
|
                            reduction key and value          § Often ‘reduces’ data (for example –
6/5/11                  § Output of Map is sorted              sum(values))
                            and partitioned for use    Mappers and Reducers can be chained together
                            by Reducers
                                Mappers and Reducers can be chained together




                                                                                                6
P
a
g
e
     HDFS Sweet spot
7

|     § Big Data Storage : Optimized for large files (ETL)
6/5/11 §   Writes are create, append, and large
      § Reads are mostly big and streaming
      § Throughput is more important than latency
      § Distributed, HA, Transparent Replication




                                                             7
When is raw HDFS unsuitable?
• Mutable data – Create, Update, Delete
• Small writes
• Random reads, % of small reads
• Structured data
• Online access to data – HDFS Loading is
   offline / batch process


                                            8
P
a
g
e
     NoSQL Data stores - Column
9

|        § Excellent W/R concurrent performance – fast writes
             and fast reads (random and sequential) – this is
6/5/11
             required for near real time update of data to TS Data
         § Distributed architecture, horizontal scaling, transparent
             replication of data
         § Highly Available (HA) and Fault Tolerant (FT) for no
            SPOF – shared nothing architecture
         § Reasonably rich data model
         § Flexible in terms of schema – amenable to ad-hoc
             changes even at runtime



                                                                  9
P
a
g
e
     HBase
10
         § (Table, Row, Column Family:Column, Timestamp) tuple maps to a stored
|             value 
         § Table is split into multiple equal sized regions each of which is a range of
6/5/11       sorted keys (partitioned automatically by the key)
         § Ordered Rows by key, Ordered columns in a Column Family
         § Table schema defines Column Families
         § Rows can have different number of columns
         § Columns have value and versions (any number)
         § Column range and key range queries

          Row Key        Column Family (dimensions)       Column Family
                                                          (metric)
          112334-7782    server : host1   dc : PUNE       value:20

          112334-7783             server:host2            value:10

                                                                                      10
P
a
g
e
      Hive – Distributed SQL > MR
11
       § MR is not easy to code for analytics tasks (e.g. group, aggregate etc.) chaining
|
           several Mappers & Reducers required
6/5/11 §
           Hive provides familiar SQL queries which automatically gets translated to a flow
              of appropriate Mappers and Reducers that execute the query leveraging MR.
       § Leverages Hadoop ecosystem - MR, HDFS, HBase

       § Hive defines a schema for the meta-tables it will use to build a schema its SQL
            queries can use and to store metadata
       § Storage Handlers for HDFS, HBase

       § Hive SQL supports common SQL select, filter, grouping, aggregation, insert etc
            clauses
       § Hive stores the data partitioned by partitions (you can specify partitioning key
            while loading Hive tables) and buckets (useful for statistical operations like
            sampling)
       § Hive queries can also include custom map/reduce tasks as scripts

                                                                                              11
Hive Queries - CREATE
TABLE                               EXTERNAL TABLE



CREATE TABLE wordfreq (word       CREATE external TABLE iops(key
  STRING, freq INT) ROW FORMAT      string, os string, deploymentsize
  DELIMITED FIELDS TERMINATED       string, ts int, value int) STORED
  BY 't' STORED AS TEXTFILE;       BY
                                    'org.apache.hadoop.hive.hbase.HB
LOAD DATA LOCAL INPATH              aseStorageHandler' WITH
  ‘freq.txt' OVERWRITE INTO TABLE   SERDEPROPERTIES
  wordfreq;                         ("hbase.columns.mapping" =
                                    ":key,data:os,data:deploymentSize,
                                    data:ts,data:value")




                                                                    12
Hive Queries - SELECT
TABLE                                      EXTERNAL TABLE
select * from wordfreq where freq >        select ts, avg(value) as cpu from
   100 sort by freq desc limit 3;             cpu_util_5min group by ts;
explain select * from wordfreq where       select architecture, avg(value) as cpu
   freq > 100 sort by freq desc limit 3;      from cpu_util_5min group by
                                              architecture;
select freq, count(*) AS f2 from
   wordfreq group by freq sort by f2
   desc limit 3;




                                                                                13
P
a
g
e
        Hive – SQL -> Map Reduce
     CPU utilization / 5 min with dimensions server, server-type, cluster, data-center, group by server-type and filter by value Unix
14
     SELECT timestamp, AVG(value)

|    FROM timeseries WHERE server-type = ‘Unix’


6/5/11 BY timestamp
   GROUP

           timeseries




                                                         Shuffle                             Reduce
                               Map
                                                          Sort




                                                                                                                               14
Thanks



         15

Weitere ähnliche Inhalte

Mehr von IndicThreads

Iot secure connected devices indicthreads
Iot secure connected devices indicthreadsIot secure connected devices indicthreads
Iot secure connected devices indicthreadsIndicThreads
 
Real world IoT for enterprises
Real world IoT for enterprisesReal world IoT for enterprises
Real world IoT for enterprisesIndicThreads
 
IoT testing and quality assurance indicthreads
IoT testing and quality assurance indicthreadsIoT testing and quality assurance indicthreads
IoT testing and quality assurance indicthreadsIndicThreads
 
Functional Programming Past Present Future
Functional Programming Past Present FutureFunctional Programming Past Present Future
Functional Programming Past Present FutureIndicThreads
 
Harnessing the Power of Java 8 Streams
Harnessing the Power of Java 8 Streams Harnessing the Power of Java 8 Streams
Harnessing the Power of Java 8 Streams IndicThreads
 
Building & scaling a live streaming mobile platform - Gr8 road to fame
Building & scaling a live streaming mobile platform - Gr8 road to fameBuilding & scaling a live streaming mobile platform - Gr8 road to fame
Building & scaling a live streaming mobile platform - Gr8 road to fameIndicThreads
 
Internet of things architecture perspective - IndicThreads Conference
Internet of things architecture perspective - IndicThreads ConferenceInternet of things architecture perspective - IndicThreads Conference
Internet of things architecture perspective - IndicThreads ConferenceIndicThreads
 
Cars and Computers: Building a Java Carputer
 Cars and Computers: Building a Java Carputer Cars and Computers: Building a Java Carputer
Cars and Computers: Building a Java CarputerIndicThreads
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache SparkIndicThreads
 
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
 Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & DockerIndicThreads
 
Speed up your build pipeline for faster feedback
Speed up your build pipeline for faster feedbackSpeed up your build pipeline for faster feedback
Speed up your build pipeline for faster feedbackIndicThreads
 
Unraveling OpenStack Clouds
 Unraveling OpenStack Clouds Unraveling OpenStack Clouds
Unraveling OpenStack CloudsIndicThreads
 
Digital Transformation of the Enterprise. What IT leaders need to know!
Digital Transformation of the Enterprise. What IT  leaders need to know!Digital Transformation of the Enterprise. What IT  leaders need to know!
Digital Transformation of the Enterprise. What IT leaders need to know!IndicThreads
 
Architectural Considerations For Complex Mobile And Web Applications
 Architectural Considerations For Complex Mobile And Web Applications Architectural Considerations For Complex Mobile And Web Applications
Architectural Considerations For Complex Mobile And Web ApplicationsIndicThreads
 
Functional Programming With Lambdas and Streams in JDK8
 Functional Programming With Lambdas and Streams in JDK8 Functional Programming With Lambdas and Streams in JDK8
Functional Programming With Lambdas and Streams in JDK8IndicThreads
 
Changing application demands: What developers need to know
Changing application demands: What developers need to knowChanging application demands: What developers need to know
Changing application demands: What developers need to knowIndicThreads
 
Data Privacy using IoTs in Smart Cities Project
 Data Privacy using IoTs in Smart Cities Project Data Privacy using IoTs in Smart Cities Project
Data Privacy using IoTs in Smart Cities ProjectIndicThreads
 
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
Big Data Analytics using Amazon Elastic MapReduce and Amazon RedshiftIndicThreads
 
Indic threads pune12-grammar of graphicsa new approach to visualization-karan
Indic threads pune12-grammar of graphicsa new approach to visualization-karanIndic threads pune12-grammar of graphicsa new approach to visualization-karan
Indic threads pune12-grammar of graphicsa new approach to visualization-karanIndicThreads
 
Indic threads pune12-java ee 7 platformsimplification html5
Indic threads pune12-java ee 7 platformsimplification html5Indic threads pune12-java ee 7 platformsimplification html5
Indic threads pune12-java ee 7 platformsimplification html5IndicThreads
 

Mehr von IndicThreads (20)

Iot secure connected devices indicthreads
Iot secure connected devices indicthreadsIot secure connected devices indicthreads
Iot secure connected devices indicthreads
 
Real world IoT for enterprises
Real world IoT for enterprisesReal world IoT for enterprises
Real world IoT for enterprises
 
IoT testing and quality assurance indicthreads
IoT testing and quality assurance indicthreadsIoT testing and quality assurance indicthreads
IoT testing and quality assurance indicthreads
 
Functional Programming Past Present Future
Functional Programming Past Present FutureFunctional Programming Past Present Future
Functional Programming Past Present Future
 
Harnessing the Power of Java 8 Streams
Harnessing the Power of Java 8 Streams Harnessing the Power of Java 8 Streams
Harnessing the Power of Java 8 Streams
 
Building & scaling a live streaming mobile platform - Gr8 road to fame
Building & scaling a live streaming mobile platform - Gr8 road to fameBuilding & scaling a live streaming mobile platform - Gr8 road to fame
Building & scaling a live streaming mobile platform - Gr8 road to fame
 
Internet of things architecture perspective - IndicThreads Conference
Internet of things architecture perspective - IndicThreads ConferenceInternet of things architecture perspective - IndicThreads Conference
Internet of things architecture perspective - IndicThreads Conference
 
Cars and Computers: Building a Java Carputer
 Cars and Computers: Building a Java Carputer Cars and Computers: Building a Java Carputer
Cars and Computers: Building a Java Carputer
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
 Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
 
Speed up your build pipeline for faster feedback
Speed up your build pipeline for faster feedbackSpeed up your build pipeline for faster feedback
Speed up your build pipeline for faster feedback
 
Unraveling OpenStack Clouds
 Unraveling OpenStack Clouds Unraveling OpenStack Clouds
Unraveling OpenStack Clouds
 
Digital Transformation of the Enterprise. What IT leaders need to know!
Digital Transformation of the Enterprise. What IT  leaders need to know!Digital Transformation of the Enterprise. What IT  leaders need to know!
Digital Transformation of the Enterprise. What IT leaders need to know!
 
Architectural Considerations For Complex Mobile And Web Applications
 Architectural Considerations For Complex Mobile And Web Applications Architectural Considerations For Complex Mobile And Web Applications
Architectural Considerations For Complex Mobile And Web Applications
 
Functional Programming With Lambdas and Streams in JDK8
 Functional Programming With Lambdas and Streams in JDK8 Functional Programming With Lambdas and Streams in JDK8
Functional Programming With Lambdas and Streams in JDK8
 
Changing application demands: What developers need to know
Changing application demands: What developers need to knowChanging application demands: What developers need to know
Changing application demands: What developers need to know
 
Data Privacy using IoTs in Smart Cities Project
 Data Privacy using IoTs in Smart Cities Project Data Privacy using IoTs in Smart Cities Project
Data Privacy using IoTs in Smart Cities Project
 
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 
Indic threads pune12-grammar of graphicsa new approach to visualization-karan
Indic threads pune12-grammar of graphicsa new approach to visualization-karanIndic threads pune12-grammar of graphicsa new approach to visualization-karan
Indic threads pune12-grammar of graphicsa new approach to visualization-karan
 
Indic threads pune12-java ee 7 platformsimplification html5
Indic threads pune12-java ee 7 platformsimplification html5Indic threads pune12-java ee 7 platformsimplification html5
Indic threads pune12-java ee 7 platformsimplification html5
 

Kürzlich hochgeladen

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Kürzlich hochgeladen (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Using the cloud and distributed technologies to analyze big data in the enterprise - Indicthreads cloud computing conference 2011

  • 1. Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1
  • 2. Data Explosion in Data Center • Performance / Time Series Data § Incoming data rates ~Millions of data points/ min § Data generated/server/year ~ 2 GB § 50 K servers ~ 100 TB data / year 2
  • 3. Online Warehouse - Time Series § Extreme storage requirements – TS data for a data center e.g. last year § Online TS data availability i.e. no separate ETL § Support for common analytics operations § Roll-up data e.g. CPU/min to CPU/hour, CPU/day etc § Slice and Dice – CPU util. for UNIX servers in SFO data center last week § Statistical Operations : sum, count, avg., var, std. moving avg., frequency distributions, forecasting etc § Ease of use – SQL interface, design schema for TS data § Horizontal scaling - lower cost commodity hardware OS Data Cube - § High R/W volume CPU Time Data Center 3
  • 4. P a g Why not use RDBMS based Data e 4 Warehousing? | Star schema – dimensions & facts 6/5/11 § Offline data availability – ETL required – not online § Expensive to scale vertically – High end Hardware & Software § Limits to vertical scaling – big data may not fit § Features like transactions etc are unnecessary and a overhead for certain applications § Large scale distributed/partitioning is painful – sub optimal on high W/R ratios § Flexible Schema support which can be changed on the fly is not possible 4
  • 5. High Level Architecture Real time Continuous Schema & load of Metric & Query Dimension Data Hive – Distributed SQL NoSQL Column Store - HBase Hadoop HDFS & Map Reduce Framework Map Reduce & HDFS Nodes 5
  • 6. P a g e Map Reduce - Recap 6 Map Function Reduce Function § Apply to input data, Emits § Apply to data grouped by reduction key | reduction key and value § Often ‘reduces’ data (for example – 6/5/11 § Output of Map is sorted sum(values)) and partitioned for use Mappers and Reducers can be chained together by Reducers Mappers and Reducers can be chained together 6
  • 7. P a g e HDFS Sweet spot 7 | § Big Data Storage : Optimized for large files (ETL) 6/5/11 § Writes are create, append, and large § Reads are mostly big and streaming § Throughput is more important than latency § Distributed, HA, Transparent Replication 7
  • 8. When is raw HDFS unsuitable? • Mutable data – Create, Update, Delete • Small writes • Random reads, % of small reads • Structured data • Online access to data – HDFS Loading is offline / batch process 8
  • 9. P a g e NoSQL Data stores - Column 9 | § Excellent W/R concurrent performance – fast writes and fast reads (random and sequential) – this is 6/5/11 required for near real time update of data to TS Data § Distributed architecture, horizontal scaling, transparent replication of data § Highly Available (HA) and Fault Tolerant (FT) for no SPOF – shared nothing architecture § Reasonably rich data model § Flexible in terms of schema – amenable to ad-hoc changes even at runtime 9
  • 10. P a g e HBase 10 § (Table, Row, Column Family:Column, Timestamp) tuple maps to a stored | value  § Table is split into multiple equal sized regions each of which is a range of 6/5/11 sorted keys (partitioned automatically by the key) § Ordered Rows by key, Ordered columns in a Column Family § Table schema defines Column Families § Rows can have different number of columns § Columns have value and versions (any number) § Column range and key range queries Row Key Column Family (dimensions) Column Family (metric) 112334-7782 server : host1 dc : PUNE value:20 112334-7783 server:host2 value:10 10
  • 11. P a g e Hive – Distributed SQL > MR 11 § MR is not easy to code for analytics tasks (e.g. group, aggregate etc.) chaining | several Mappers & Reducers required 6/5/11 § Hive provides familiar SQL queries which automatically gets translated to a flow of appropriate Mappers and Reducers that execute the query leveraging MR. § Leverages Hadoop ecosystem - MR, HDFS, HBase § Hive defines a schema for the meta-tables it will use to build a schema its SQL queries can use and to store metadata § Storage Handlers for HDFS, HBase § Hive SQL supports common SQL select, filter, grouping, aggregation, insert etc clauses § Hive stores the data partitioned by partitions (you can specify partitioning key while loading Hive tables) and buckets (useful for statistical operations like sampling) § Hive queries can also include custom map/reduce tasks as scripts 11
  • 12. Hive Queries - CREATE TABLE EXTERNAL TABLE CREATE TABLE wordfreq (word CREATE external TABLE iops(key STRING, freq INT) ROW FORMAT string, os string, deploymentsize DELIMITED FIELDS TERMINATED string, ts int, value int) STORED BY 't' STORED AS TEXTFILE; BY 'org.apache.hadoop.hive.hbase.HB LOAD DATA LOCAL INPATH aseStorageHandler' WITH ‘freq.txt' OVERWRITE INTO TABLE SERDEPROPERTIES wordfreq; ("hbase.columns.mapping" = ":key,data:os,data:deploymentSize, data:ts,data:value") 12
  • 13. Hive Queries - SELECT TABLE EXTERNAL TABLE select * from wordfreq where freq > select ts, avg(value) as cpu from 100 sort by freq desc limit 3; cpu_util_5min group by ts; explain select * from wordfreq where select architecture, avg(value) as cpu freq > 100 sort by freq desc limit 3; from cpu_util_5min group by architecture; select freq, count(*) AS f2 from wordfreq group by freq sort by f2 desc limit 3; 13
  • 14. P a g e Hive – SQL -> Map Reduce CPU utilization / 5 min with dimensions server, server-type, cluster, data-center, group by server-type and filter by value Unix 14 SELECT timestamp, AVG(value) | FROM timeseries WHERE server-type = ‘Unix’ 6/5/11 BY timestamp GROUP timeseries Shuffle Reduce Map Sort 14
  • 15. Thanks 15