Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It addresses challenges in big data by providing reliability, scalability, and fault tolerance. Hadoop allows distributed processing of large datasets across clusters using MapReduce and can scale from single servers to thousands of machines, each offering local computation and storage. It is widely used for applications such as log analysis, data warehousing, and web indexing.
2. Overview
Why
Hadoop?
What is
Hadoop?
How to
Hadoop?
Examples
Data Growth
What is Big Data?
Hadoop usage
Components
No SQL
Cluster
Vendors
Tool Comparison
Typical Implementation
Data Analysis with Pig & Hive
Opportunities
Map Reduce deep dive
Wordcount
Search index
Recommendation Engine
4. Data Growth
OLTP
Databases for
Operations
Throw away
historical data
Relational
Oracle, DB2
OLAP
Data warehouses for
analytics
Cheaper centralized
storage -> Data
warehouses
(ETL tools)
Relational/MPP
appliances
< few hundred TB
Big Data
Data explosion
(social media, etc)
Petabyte scale
Network speeds
haven’t increased
Need Data Locality
Distributed
processing on
commodity
hardware
(Hadoop)
Non-relational
5. Big Data
What is Big Data?
Volume
Petabyte scale
Variety
Structured
Semi-structured
Unstructured
Velocity
Social
Sensor
Throughput
Veracity
Unclean
Imprecise
Unclear
6. Where is Hadoop Used?
Industry
Technology
Use Cases
Search
People you may know
Movie recommendations
Banks
Fraud Detection
Regulatory
Risk management
Media
Retail
Marketing analytics
Customer service
Product recommendations
Manufacturing Preventive maintenance
8. Hadoop
HDFS
Distributed Storage
Economical: commodity hardware
Scalable: rebalances data on new nodes
Fault Tolerant: detects faults & auto recovers
Reliable: maintains multiple copies of data
High throughput: because data is distributed
Open source
distributed
computing
framework for
storage and
processing
What is Hadoop?
MapReduce
Distributed Processing
Data Locality: process where the data resides
Fault Tolerant: auto-recover job failures
Scalable: add nodes to increase parallelism
Economical: commodity hardware
9. • Unlike RDBMS:
o De-normalized
o No secondary indexes
o No transactions
• Modeled after Google’s Big Table
• Random real time read/write access to Big Data
• Billions of rows x millions of columns
• Commodity hardware
• Open source, distributed, versioned, column oriented
• Integrates with MapReduce; Has Java/REST APIs
• Automatic sharding
NoSQL DBs - HBase
Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
10. Master Node
Slave Node Slave Node Slave Node
Job Tracker
Task Tracker Task Tracker Task Tracker
Name Node
Data Node Data Node Data Node
Cluster
How Does Hadoop Work?
12. Comparison
Traditional ETL/BI
Expensive license
Expensive hardware
Hadoop
Open source
Cheap commodity hardware
< 100 TB
Central storage
Petabyte scale
Distributed storage
CostVolume
Quick response for processing
small data
Not as fast on large data
Even smallest job takes 15 seconds
Super fast on large data
Speed
Thousands of reads/writes per
minute
Millions of reads/writes per
minute
Thruput
15. Data Analysis: Pig & Hive
Pig Hive
Abstraction on top of MapReduce. Generates MapReduce jobs in the
backend. Useful for analysts who are not programmers.
Data flow language
No schema
Better with less structured Data
SQL like language
Schema, tables, joins are stored in
a meta-store.
Example
LOAD ‘file’ USING
PigStorage(‘t’) AS (id, name);
FILTER
FOREACH
GROUP
ORDER
STORE
Example
CREATE TABLE customer (id
INT, name STRING) ROW
FORMAT DELIMITED FIELDS
TERMINATED BY ‘t’;
SELECT * from customer
WHERE id < 100 limit 10;
18. Word count - Java
• Copy input files to HDFS
o hadoop fs –put file1.txt input
• Create driver
o Set configuration variables, mapper and reducer class names
• Create mapper
o Read input and emit key value pairs
• Create reducer (optional)
o Aggregate all values for a particular key
• Execute
o hadoop jar WordCount.jar WordCount input output
• Analyze output
o hadoop fs –cat output/* | head
19. Word count - Streaming
• Hadoop is written in Java. I don’t know Java. What
do I do?
o Hadoop Streaming (Python, Ruby, R, etc)
• Copy input files to HDFS
o hadoop fs –put file1.txt input
• Create mapper
o Read input stream (stdin) and emit (print) key value pairs
• Create reducer (optional)
o Aggregate all values for a particular key
• Execute
o hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-
stream*.jar -mapper mapper.py –file mapper.py -reducer reducer.py –file
reducer.py -input input –output output
• Analyze output
o hdoop fs –cat output/* | head
20. Hadoop for R
Sys.setenv(HADOOP_HOME="/home/istvan/hadoop")
Sys.setenv(HADOOP_CMD="/home/istvan/hadoop/bin/hadoop")
library(rmr2)
library(rhdfs)
setwd("/home/istvan/rhadoop/blogs/")
gdp <- read.csv("GDP_converted.csv")
head(gdp)
hdfs.init()
gdp.values <- to.dfs(gdp)
# AAPL revenue in 2012 in millions USD
aaplRevenue = 156508
gdp.map.fn <- function(k,v) {
key <- ifelse(v[4] < aaplRevenue, "less", "greater")
keyval(key, 1)
}
count.reduce.fn <- function(k,v) {
keyval(k, length(v))
}
count <- mapreduce(input=gdp.values,
map = gdp.map.fn,
reduce = count.reduce.fn)
from.dfs(count)
• RHadoop package
o rmr
o rhdfs
o Rhbase
• Uses Hadoop
Streaming
• Example on the right
determines how
many countries
have greater GDP
than Apple
Source: http://bighadoop.wordpress.com/2013/02/25/r-and-hadoop-data-analysis-rhadoop/
21. Search index example
• Crawl web
o Crawl and save websites to local directory
• Ingest files to HDFS
• Map
o Split the words & associate words with file names
• Reduce
o Build an index with words and files & count of occurrences
• Search
o Pass the word to the index to get the files it shows up in. Display the file
listing in descending order of number of occurrences of the word in a file
22. Recommender example
• Use web server logs with user ratings info for items
• Create Hive tables to build structure on top of this
log data
• Generate Mahout specific csv input file
(user, item, rating)
• Run Mahout to build item recommendations for
users
o mahout recommeditembased
--input /user/hive/warehouse/mahout_input
--output recommendations
-s SIMILARITY_PEARSON_CORRELATION –n 20
23. Recap
Why
Hadoop?
What is
Hadoop?
How to
Hadoop?
Demo
Data Growth
What is Big Data?
Hadoop usage
Components
No SQL
Cluster
Vendors
Tool Comparison
Typical Implementation
Data Analysis with Pig & Hive
Opportunities
Map Reduce deep dive
Wordcount
Search index
Recommendation Engine