This document compares different data storage options in Hadoop including HDFS, HBase, Hive, and Pig. It provides an overview of each system's design features, applications areas, and limitations. HDFS is designed for large datasets and batch processing. HBase is a column-oriented database for random access of small data. Hive is a data warehousing system with SQL-like queries. Pig is a data flow language for ETL processes and unstructured data analysis. The document aims to help organizations evaluate which Hadoop ecosystem components to use for their big data problems.
4. Hadoop Ecosystem Components
HDFS: Hadoop Distributed File System
MapReduce: Hadoop Distributed Programming Paradigm
HBase: Hadoop Column Oriented Database for Random
Access Read/Write of Smaller Data
Hive: Hadoop Petabyte scalable Data Warehousing
Infrastructure
Pig: Hadoop Data Flow/Analysis Infrastructure
Zookeeper: Hadoop Co-ordination service, Configuration Service
Infrastructure
Chukwa: Hadoop Monitoring Service
Avro: Hadoop Data Serialization De-Serialization
Infrastructure
Mahout: Hadoop Scalable Machine Learning Library
5. HDFS (Data Storage)
Design Features
• Failure Is Norm
• Designed For Large Datasets than Small
• Designed For Batch Processing than Interactive
• Supports Write Once- Read Many
• Provides Interfaces to Move Processing Closer
To Data
6. HDFS
APPLICATION AREAS
• Large Log Processing
• Web search indexing
LIMITATIONS
• Small Size Problem
• Single Node Of Failure
• No Random Access
• No Write Support
7. HBase (Data Storage)
Design Features
• Key-Value Store (Like Map)
• Semi Structured Data
• Column Family, Time Stamp
• Key=RowKey.ColumnFamiliy.ColumnName.TimeStamp
• De-normalized Data
• Faster Data Retrieval Using Column Families
• Static Column Families, Dynamic Columns
8. RDBMS v/s HBase: Example
RDBMS
ID Name Age Birth- Marital Location Weight Employer
Place Status
1 Sam 35 Mumbai Married Pune 76 XYZ
2 Bob 56 Chicago Married New 79 PQR
York
HBase
Row Personal Information Other Information
Key (Column Family) (Column Family)
1 Nam Age: Birth-Place Marital Weight:T2 Locatio Employer:T1=
e: T2= :T1=Mumbai Status = 76 n: T2= XYZ
T1=S 35 :T2= Pune
am Married Weight:T1
Age: = 65 Locatio
T1:=2 Marital n:
5 Status: T1:=Mu
T1= mbai
Unmarried
2 … … … … … … …
9. HBase: Application Areas
• Applications which need Store/Access/Search
using Key
• Need Fast Random Access/Update to scalable
structured data
• Applications Needing Flexible Table Schema
• Applications Needing range-search capabilities
supported by key ordering
10. HBase: Limitations
• Expensive Full Row Read
• No Secondary Keys
• No SQL Support
• Not Efficient for Big Cell Values
11. Hive (Data Access)
Design Features
• Scalable data warehouse on top of Hadoop
developed by Facebook
• SQL like Query Language HiveQL
• Limited JDBC support
• Support for rich data types
• Ability to insert custom map-reduce jobs
12. Hive: Application Areas
• Adhoc analysis on huge structured data, not
having any requirement of low latency
• Log processing
• Text Mining
• Document Indexing
• Customer Facing business intelligence (Google
analytics)
• Predictive Modeling, hypothesis testing
13. Hive: Limitations
• No Support To Update Data
• Only Bulk Load Support
• Not Efficient For Small Data
14. Hive: Example
• create table employee (id bigint, name string,
age int…) ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't' STORED AS
TEXTFILE;
• LOAD DATA LOCAL INPATH
'/sas/employee.txt' OVERWRITE INTO
TABLE employee;
• INSERT OVERWRITE TABLE oldest_employee
SELECT * FROM employee SORT BY age
DESC LIMIT 100;
15. Pig(Data Access)
• Pig Latin High level data flow language.
• Client side library, no server side deployment needed.
• Batch processing large unstructured data
• Procedural language
• Runtime Schema Creation, Check point ability, Splits pipeline support
• Customer code support
• Rich data types
• Support for Joins
18. PIG: Example
Load Emplyee data from text file, filter it using
age and joining year and group using joining
year.
1. records = LOAD 'sas/input/files/employee.txt'
AS (joiningYear:chararray, employeeId:int, age:int);
2. filtered_records = FILTER records BY age> 30 AND
( joiningYear >=2000 OR joiningYear <= 2012);
3. grouped_records = GROUP filtered_records BY joiningYear;
max_age = FOREACH grouped_records GENERATE group,
MAX(filtered_records.age);
DUMP max_age;
19. Conclusion
Organizations
•Revisit data strategy
•Evaluate Hadoop Ecosystem
•Build economical, scalable solutions for Big Data problems
20. References
• Hadoop: Definitive Guide, By Tom White
• http://hadoop.apache.org/
• http://developer.yahoo.com/hadoop/tutorial/
• http://www-
01.ibm.com/software/data/infosphere/hadoop/
• http://www.information-
management.com/blogs/
• http://www.mckinsey.com/insights/mgi/researc
h/technology_and_innovation/big_data_the_next
_frontier_for_innovation