SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Comparing Hadoop Data Storage
                (HDFS, HBase, Hive and Pig)

Rakesh Jadhav
SAS
Agenda

 •   Hadoop Ecosystem
 •   HDFS
 •   HBase
 •   Hive
 •   Pig
Hadoop Ecosystem
Hadoop Ecosystem Components
   HDFS:      Hadoop Distributed File System
   MapReduce: Hadoop Distributed Programming Paradigm
   HBase:     Hadoop Column Oriented Database for Random
                  Access Read/Write of Smaller Data
   Hive:      Hadoop Petabyte scalable Data Warehousing
                         Infrastructure
   Pig:       Hadoop Data Flow/Analysis Infrastructure
   Zookeeper: Hadoop Co-ordination service, Configuration Service
            Infrastructure
   Chukwa:    Hadoop Monitoring Service
   Avro:         Hadoop Data Serialization De-Serialization
              Infrastructure
   Mahout:      Hadoop Scalable Machine Learning Library
HDFS (Data Storage)
     Design Features

 •   Failure Is Norm
 •   Designed For Large Datasets than Small
 •   Designed For Batch Processing than Interactive
 •   Supports Write Once- Read Many
 •   Provides Interfaces to Move Processing Closer
     To Data
HDFS

 APPLICATION AREAS
  • Large Log Processing
  • Web search indexing
 LIMITATIONS
  •   Small Size Problem
  •   Single Node Of Failure
  •   No Random Access
  •   No Write Support
HBase (Data Storage)
  Design Features
 • Key-Value Store (Like Map)
 • Semi Structured Data
 • Column Family, Time Stamp
 • Key=RowKey.ColumnFamiliy.ColumnName.TimeStamp
 • De-normalized Data
 • Faster Data Retrieval Using Column Families
 • Static Column Families, Dynamic Columns
RDBMS v/s HBase: Example
RDBMS
ID  Name Age       Birth-    Marital         Location Weight     Employer
                   Place     Status
1   Sam    35      Mumbai    Married         Pune     76         XYZ
2   Bob    56      Chicago   Married         New      79         PQR
                                             York
HBase
Row                   Personal Information                      Other Information
Key                     (Column Family)                         (Column Family)

1   Nam    Age:     Birth-Place   Marital       Weight:T2   Locatio    Employer:T1=
    e:     T2=      :T1=Mumbai    Status        = 76        n: T2=     XYZ
    T1=S   35                     :T2=                      Pune
    am                            Married       Weight:T1
           Age:                                 = 65        Locatio
           T1:=2                  Marital                   n:
           5                      Status:                   T1:=Mu
                                  T1=                       mbai
                                  Unmarried

2   …      …        …             …             …           …          …
HBase: Application Areas

 • Applications which need Store/Access/Search
   using Key
 • Need Fast Random Access/Update to scalable
   structured data
 • Applications Needing Flexible Table Schema
 • Applications Needing range-search capabilities
   supported by key ordering
HBase: Limitations

 •   Expensive Full Row Read
 •   No Secondary Keys
 •   No SQL Support
 •   Not Efficient for Big Cell Values
Hive (Data Access)
  Design Features

  • Scalable data warehouse on top of Hadoop
    developed by Facebook
  • SQL like Query Language HiveQL
  • Limited JDBC support
  • Support for rich data types
  • Ability to insert custom map-reduce jobs
Hive: Application Areas

 • Adhoc analysis on huge structured data, not
   having any requirement of low latency
 • Log processing
 • Text Mining
 • Document Indexing
 • Customer Facing business intelligence (Google
   analytics)
 • Predictive Modeling, hypothesis testing
Hive: Limitations

 • No Support To Update Data
 • Only Bulk Load Support
 • Not Efficient For Small Data
Hive: Example

 • create table employee (id bigint, name string,
   age int…) ROW FORMAT DELIMITED
   FIELDS TERMINATED BY 't' STORED AS
   TEXTFILE;
 • LOAD DATA LOCAL INPATH
   '/sas/employee.txt' OVERWRITE INTO
   TABLE employee; 
 • INSERT OVERWRITE TABLE oldest_employee
   SELECT * FROM employee SORT BY age
   DESC LIMIT 100;
Pig(Data Access)

  • Pig Latin High level data flow language.
  • Client side library, no server side deployment needed.
  • Batch processing large unstructured data
  • Procedural language
  • Runtime Schema Creation, Check point ability, Splits pipeline support
  • Customer code support
  • Rich data types
  • Support for Joins
Pig: Application Areas

 • Extract Transform Load (ETL)
 • Unstructured Data Analysis
PIG: Limitations

 • Not efficient for processing small datasets
PIG: Example

 Load Emplyee data from text file, filter it using
  age and joining year and group using joining
  year.
 1. records = LOAD 'sas/input/files/employee.txt'
   AS (joiningYear:chararray, employeeId:int, age:int);
 2. filtered_records = FILTER records BY age> 30 AND
  ( joiningYear >=2000 OR joiningYear <= 2012);
 3. grouped_records = GROUP filtered_records BY joiningYear;
   max_age = FOREACH grouped_records GENERATE group,
   MAX(filtered_records.age);
   DUMP max_age;
Conclusion

 Organizations
 •Revisit data strategy
 •Evaluate Hadoop Ecosystem
 •Build economical, scalable solutions for Big Data problems
References

• Hadoop: Definitive Guide, By Tom White
• http://hadoop.apache.org/
• http://developer.yahoo.com/hadoop/tutorial/
• http://www-
  01.ibm.com/software/data/infosphere/hadoop/
• http://www.information-
  management.com/blogs/
• http://www.mckinsey.com/insights/mgi/researc
  h/technology_and_innovation/big_data_the_next
  _frontier_for_innovation
Thank You




            21

Weitere ähnliche Inhalte

Ähnlich wie Indic threads pune12-comparing hadoop data storage

Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars GeorgeJAX London
 
Nyc hadoop meetup introduction to h base
Nyc hadoop meetup   introduction to h baseNyc hadoop meetup   introduction to h base
Nyc hadoop meetup introduction to h base智杰 付
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airshipdave_revell
 
Relational to Graph - Import
Relational to Graph - ImportRelational to Graph - Import
Relational to Graph - ImportNeo4j
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsGeorge Stathis
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practicelarsgeorge
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
Scaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQLScaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQLRichard Schneeman
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 

Ähnlich wie Indic threads pune12-comparing hadoop data storage (20)

Valerii Moisieienko Apache hbase workshop
Valerii Moisieienko	Apache hbase workshopValerii Moisieienko	Apache hbase workshop
Valerii Moisieienko Apache hbase workshop
 
ACS DataMart_ppt
ACS DataMart_pptACS DataMart_ppt
ACS DataMart_ppt
 
ACS DataMart_ppt
ACS DataMart_pptACS DataMart_ppt
ACS DataMart_ppt
 
Apache hive
Apache hiveApache hive
Apache hive
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
HBase ArcheTypes
HBase ArcheTypesHBase ArcheTypes
HBase ArcheTypes
 
Nyc hadoop meetup introduction to h base
Nyc hadoop meetup   introduction to h baseNyc hadoop meetup   introduction to h base
Nyc hadoop meetup introduction to h base
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
 
Relational to Graph - Import
Relational to Graph - ImportRelational to Graph - Import
Relational to Graph - Import
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
Scaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQLScaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQL
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 

Mehr von IndicThreads

Http2 is here! And why the web needs it
Http2 is here! And why the web needs itHttp2 is here! And why the web needs it
Http2 is here! And why the web needs itIndicThreads
 
Understanding Bitcoin (Blockchain) and its Potential for Disruptive Applications
Understanding Bitcoin (Blockchain) and its Potential for Disruptive ApplicationsUnderstanding Bitcoin (Blockchain) and its Potential for Disruptive Applications
Understanding Bitcoin (Blockchain) and its Potential for Disruptive ApplicationsIndicThreads
 
Go Programming Language - Learning The Go Lang way
Go Programming Language - Learning The Go Lang wayGo Programming Language - Learning The Go Lang way
Go Programming Language - Learning The Go Lang wayIndicThreads
 
Building Resilient Microservices
Building Resilient Microservices Building Resilient Microservices
Building Resilient Microservices IndicThreads
 
App using golang indicthreads
App using golang  indicthreadsApp using golang  indicthreads
App using golang indicthreadsIndicThreads
 
Building on quicksand microservices indicthreads
Building on quicksand microservices  indicthreadsBuilding on quicksand microservices  indicthreads
Building on quicksand microservices indicthreadsIndicThreads
 
How to Think in RxJava Before Reacting
How to Think in RxJava Before ReactingHow to Think in RxJava Before Reacting
How to Think in RxJava Before ReactingIndicThreads
 
Iot secure connected devices indicthreads
Iot secure connected devices indicthreadsIot secure connected devices indicthreads
Iot secure connected devices indicthreadsIndicThreads
 
Real world IoT for enterprises
Real world IoT for enterprisesReal world IoT for enterprises
Real world IoT for enterprisesIndicThreads
 
IoT testing and quality assurance indicthreads
IoT testing and quality assurance indicthreadsIoT testing and quality assurance indicthreads
IoT testing and quality assurance indicthreadsIndicThreads
 
Functional Programming Past Present Future
Functional Programming Past Present FutureFunctional Programming Past Present Future
Functional Programming Past Present FutureIndicThreads
 
Harnessing the Power of Java 8 Streams
Harnessing the Power of Java 8 Streams Harnessing the Power of Java 8 Streams
Harnessing the Power of Java 8 Streams IndicThreads
 
Building & scaling a live streaming mobile platform - Gr8 road to fame
Building & scaling a live streaming mobile platform - Gr8 road to fameBuilding & scaling a live streaming mobile platform - Gr8 road to fame
Building & scaling a live streaming mobile platform - Gr8 road to fameIndicThreads
 
Internet of things architecture perspective - IndicThreads Conference
Internet of things architecture perspective - IndicThreads ConferenceInternet of things architecture perspective - IndicThreads Conference
Internet of things architecture perspective - IndicThreads ConferenceIndicThreads
 
Cars and Computers: Building a Java Carputer
 Cars and Computers: Building a Java Carputer Cars and Computers: Building a Java Carputer
Cars and Computers: Building a Java CarputerIndicThreads
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache SparkIndicThreads
 
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
 Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & DockerIndicThreads
 
Speed up your build pipeline for faster feedback
Speed up your build pipeline for faster feedbackSpeed up your build pipeline for faster feedback
Speed up your build pipeline for faster feedbackIndicThreads
 
Unraveling OpenStack Clouds
 Unraveling OpenStack Clouds Unraveling OpenStack Clouds
Unraveling OpenStack CloudsIndicThreads
 
Digital Transformation of the Enterprise. What IT leaders need to know!
Digital Transformation of the Enterprise. What IT  leaders need to know!Digital Transformation of the Enterprise. What IT  leaders need to know!
Digital Transformation of the Enterprise. What IT leaders need to know!IndicThreads
 

Mehr von IndicThreads (20)

Http2 is here! And why the web needs it
Http2 is here! And why the web needs itHttp2 is here! And why the web needs it
Http2 is here! And why the web needs it
 
Understanding Bitcoin (Blockchain) and its Potential for Disruptive Applications
Understanding Bitcoin (Blockchain) and its Potential for Disruptive ApplicationsUnderstanding Bitcoin (Blockchain) and its Potential for Disruptive Applications
Understanding Bitcoin (Blockchain) and its Potential for Disruptive Applications
 
Go Programming Language - Learning The Go Lang way
Go Programming Language - Learning The Go Lang wayGo Programming Language - Learning The Go Lang way
Go Programming Language - Learning The Go Lang way
 
Building Resilient Microservices
Building Resilient Microservices Building Resilient Microservices
Building Resilient Microservices
 
App using golang indicthreads
App using golang  indicthreadsApp using golang  indicthreads
App using golang indicthreads
 
Building on quicksand microservices indicthreads
Building on quicksand microservices  indicthreadsBuilding on quicksand microservices  indicthreads
Building on quicksand microservices indicthreads
 
How to Think in RxJava Before Reacting
How to Think in RxJava Before ReactingHow to Think in RxJava Before Reacting
How to Think in RxJava Before Reacting
 
Iot secure connected devices indicthreads
Iot secure connected devices indicthreadsIot secure connected devices indicthreads
Iot secure connected devices indicthreads
 
Real world IoT for enterprises
Real world IoT for enterprisesReal world IoT for enterprises
Real world IoT for enterprises
 
IoT testing and quality assurance indicthreads
IoT testing and quality assurance indicthreadsIoT testing and quality assurance indicthreads
IoT testing and quality assurance indicthreads
 
Functional Programming Past Present Future
Functional Programming Past Present FutureFunctional Programming Past Present Future
Functional Programming Past Present Future
 
Harnessing the Power of Java 8 Streams
Harnessing the Power of Java 8 Streams Harnessing the Power of Java 8 Streams
Harnessing the Power of Java 8 Streams
 
Building & scaling a live streaming mobile platform - Gr8 road to fame
Building & scaling a live streaming mobile platform - Gr8 road to fameBuilding & scaling a live streaming mobile platform - Gr8 road to fame
Building & scaling a live streaming mobile platform - Gr8 road to fame
 
Internet of things architecture perspective - IndicThreads Conference
Internet of things architecture perspective - IndicThreads ConferenceInternet of things architecture perspective - IndicThreads Conference
Internet of things architecture perspective - IndicThreads Conference
 
Cars and Computers: Building a Java Carputer
 Cars and Computers: Building a Java Carputer Cars and Computers: Building a Java Carputer
Cars and Computers: Building a Java Carputer
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
 Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
 
Speed up your build pipeline for faster feedback
Speed up your build pipeline for faster feedbackSpeed up your build pipeline for faster feedback
Speed up your build pipeline for faster feedback
 
Unraveling OpenStack Clouds
 Unraveling OpenStack Clouds Unraveling OpenStack Clouds
Unraveling OpenStack Clouds
 
Digital Transformation of the Enterprise. What IT leaders need to know!
Digital Transformation of the Enterprise. What IT  leaders need to know!Digital Transformation of the Enterprise. What IT  leaders need to know!
Digital Transformation of the Enterprise. What IT leaders need to know!
 

Kürzlich hochgeladen

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 

Kürzlich hochgeladen (20)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Indic threads pune12-comparing hadoop data storage

  • 1. Comparing Hadoop Data Storage (HDFS, HBase, Hive and Pig) Rakesh Jadhav SAS
  • 2. Agenda • Hadoop Ecosystem • HDFS • HBase • Hive • Pig
  • 4. Hadoop Ecosystem Components  HDFS: Hadoop Distributed File System  MapReduce: Hadoop Distributed Programming Paradigm  HBase: Hadoop Column Oriented Database for Random Access Read/Write of Smaller Data  Hive: Hadoop Petabyte scalable Data Warehousing Infrastructure  Pig: Hadoop Data Flow/Analysis Infrastructure  Zookeeper: Hadoop Co-ordination service, Configuration Service Infrastructure  Chukwa: Hadoop Monitoring Service  Avro: Hadoop Data Serialization De-Serialization Infrastructure  Mahout: Hadoop Scalable Machine Learning Library
  • 5. HDFS (Data Storage) Design Features • Failure Is Norm • Designed For Large Datasets than Small • Designed For Batch Processing than Interactive • Supports Write Once- Read Many • Provides Interfaces to Move Processing Closer To Data
  • 6. HDFS APPLICATION AREAS • Large Log Processing • Web search indexing LIMITATIONS • Small Size Problem • Single Node Of Failure • No Random Access • No Write Support
  • 7. HBase (Data Storage) Design Features • Key-Value Store (Like Map) • Semi Structured Data • Column Family, Time Stamp • Key=RowKey.ColumnFamiliy.ColumnName.TimeStamp • De-normalized Data • Faster Data Retrieval Using Column Families • Static Column Families, Dynamic Columns
  • 8. RDBMS v/s HBase: Example RDBMS ID Name Age Birth- Marital Location Weight Employer Place Status 1 Sam 35 Mumbai Married Pune 76 XYZ 2 Bob 56 Chicago Married New 79 PQR York HBase Row Personal Information Other Information Key (Column Family) (Column Family) 1 Nam Age: Birth-Place Marital Weight:T2 Locatio Employer:T1= e: T2= :T1=Mumbai Status = 76 n: T2= XYZ T1=S 35 :T2= Pune am Married Weight:T1 Age: = 65 Locatio T1:=2 Marital n: 5 Status: T1:=Mu T1= mbai Unmarried 2 … … … … … … …
  • 9. HBase: Application Areas • Applications which need Store/Access/Search using Key • Need Fast Random Access/Update to scalable structured data • Applications Needing Flexible Table Schema • Applications Needing range-search capabilities supported by key ordering
  • 10. HBase: Limitations • Expensive Full Row Read • No Secondary Keys • No SQL Support • Not Efficient for Big Cell Values
  • 11. Hive (Data Access) Design Features • Scalable data warehouse on top of Hadoop developed by Facebook • SQL like Query Language HiveQL • Limited JDBC support • Support for rich data types • Ability to insert custom map-reduce jobs
  • 12. Hive: Application Areas • Adhoc analysis on huge structured data, not having any requirement of low latency • Log processing • Text Mining • Document Indexing • Customer Facing business intelligence (Google analytics) • Predictive Modeling, hypothesis testing
  • 13. Hive: Limitations • No Support To Update Data • Only Bulk Load Support • Not Efficient For Small Data
  • 14. Hive: Example • create table employee (id bigint, name string, age int…) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE; • LOAD DATA LOCAL INPATH '/sas/employee.txt' OVERWRITE INTO TABLE employee;  • INSERT OVERWRITE TABLE oldest_employee SELECT * FROM employee SORT BY age DESC LIMIT 100;
  • 15. Pig(Data Access) • Pig Latin High level data flow language. • Client side library, no server side deployment needed. • Batch processing large unstructured data • Procedural language • Runtime Schema Creation, Check point ability, Splits pipeline support • Customer code support • Rich data types • Support for Joins
  • 16. Pig: Application Areas • Extract Transform Load (ETL) • Unstructured Data Analysis
  • 17. PIG: Limitations • Not efficient for processing small datasets
  • 18. PIG: Example Load Emplyee data from text file, filter it using age and joining year and group using joining year. 1. records = LOAD 'sas/input/files/employee.txt' AS (joiningYear:chararray, employeeId:int, age:int); 2. filtered_records = FILTER records BY age> 30 AND ( joiningYear >=2000 OR joiningYear <= 2012); 3. grouped_records = GROUP filtered_records BY joiningYear; max_age = FOREACH grouped_records GENERATE group, MAX(filtered_records.age); DUMP max_age;
  • 19. Conclusion Organizations •Revisit data strategy •Evaluate Hadoop Ecosystem •Build economical, scalable solutions for Big Data problems
  • 20. References • Hadoop: Definitive Guide, By Tom White • http://hadoop.apache.org/ • http://developer.yahoo.com/hadoop/tutorial/ • http://www- 01.ibm.com/software/data/infosphere/hadoop/ • http://www.information- management.com/blogs/ • http://www.mckinsey.com/insights/mgi/researc h/technology_and_innovation/big_data_the_next _frontier_for_innovation
  • 21. Thank You 21