SlideShare ist ein Scribd-Unternehmen logo
1 von 85
Downloaden Sie, um offline zu lesen
Will Y Lin
Hadoop Product Family
and Ecosystem
Agenda
• What is BigData?
• What is the problem?
• Hadoop
– Introduction to Hadoop
– Hadoop components
– What sort of problems can be solved with Hadoop?
• Hadoop ecosystem
• Conclusion
What is BigData?
A set of files A database A single file
Big data Expands on 4 fronts
Velocity
Volume
Variety
Veracity
MB GB TB PB
batch
periodic
near Real-Time
Real-Time
http://whatis.techtarget.com/definition/3Vs
The Data-Driven World
• Modern systems have to deal with far more data than
was the case in the past
– Organizations are generating huge amounts of data
– That data has inherent value, and cannot be discarded
• Examples:
– Yahoo – over 170PB of data
– Facebook – over 30PB of data
– eBay – over 5PB of data
• Many organizations are generating data at a rate of
terabytes per day
What is the problem
• Traditionally, computation has been processor-bound
• For decades, the primary push was to increase the
computing power of a single machine
– Faster processor, more RAM
• Distributed systems evolved to allow developers to use
multiple machines for a single job
– At compute time, data is copied to the compute nodes
What is the problem
• Getting the data to the processors
becomes the bottleneck
• Quick calculation
– Typical disk data transfer rate:
• 75MB/sec
– Time taken to transfer 100GB of data
to the processor:
• approx. 22 minutes!
What is the problem
• Failure of a component may cost a lot
• What we need when job fail?
– May result in a graceful degradation of application performance,
but entire system does not completely fail
– Should not result in the loss of any data
– Would not affect the outcome of the job
Hadoop Solutions
The most common problems Hadoop can solve
Threat Analysis/Trade Surveillance
• Challenge:
– Detecting threats in the form of fraudulent activity or attacks
• Large data volumes involved
• Like looking for a needle in a haystack
• Solution with Hadoop:
– Parallel processing over huge datasets
– Pattern recognition to identify anomalies
• – i.e., threats
• Typical Industry:
– Security, Financial Services
Recommendation Engine
• Challenge:
– Using user data to predict which products to recommend
• Solution with Hadoop:
– Batch processing framework
• Allow execution in in parallel over large datasets
– Collaborative filtering
• Collecting ‘taste’ information from many users
• Utilizing information to predict what similar users like
• Typical Industry
– ISP, Advertising
Walmart Case
Revenue ?
FridayFriday
BeerBeer
DiapersDiapers
Hadoop!
• Apache Hadoop project
– inspired by Google's MapReduce and Google File System
papers.
• Open sourced, flexible and available architecture for
large scale computation and data processing on a
network of commodity hardware
• Open Source Software + Hardware Commodity
– IT Costs Reduction
– inspired by
Hadoop Concepts
• Distribute the data as it is initially stored in the system
• Individual nodes can work on data local to those nodes
• Users can focus on developing applications.
Hadoop Components
• Hadoop consists of two core components
– The Hadoop Distributed File System (HDFS)
– MapReduce Software Framework
• There are many other projects based around core
Hadoop
– Often referred to as the ‘Hadoop Ecosystem’
– Pig, Hive, HBase, Flume, Oozie, Sqoop, etc
MapReduce Runtime
(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase
(Column NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
(Job Workflow & Scheduling)
Pig/Hive (Analytical
Language)
Hue
(Web Console)
Mahout
(Data Mining)
Hadoop Components: HDFS
• HDFS, the Hadoop Distributed File System, is
responsible for storing data on the cluster
• Two roles in HDFS
– Namenode: Record metadata
– Datanode: Store data
MapReduce Runtime
(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase
(Column NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
(Job Workflow & Scheduling)
Pig/Hive (Analytical
Language)
Hue
(Web Console)
Mahout
(Data Mining)
How Files Are Stored: Example
• NameNode holds metadata for the
data files
• DataNodes hold the actual blocks
• Each block is replicated three
times on the cluster
HDFS: Points To Note
• When a client application
wants to read a file:
• It communicates with
the NameNode to
determine which
blocks make up the
file, and which
DataNodes those
blocks reside on
• It then
communicates
directly with the
DataNodes to read
the data
Hadoop Components: MapReduce
• MapReduce is a method for distributing a task across
multiple nodes
• It works like a Unix pipeline:
– cat input | grep | sort | uniq -c | cat > output
– Input | Map | Shuffle & Sort | Reduce | Output
MapReduce Runtime
(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase
(Column NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
(Job Workflow & Scheduling)
Pig/Hive (Analytical
Language)
Hue
(Web Console)
Mahout
(Data Mining)
Features of MapReduce
• Automatic parallelization and distribution
• Automatic re-execution on failure
• Locality optimizations
• MapReduce abstracts all the ‘housekeeping’ away from
the developer
– Developer can concentrate simply on writing the Map and
Reduce functions
MapReduce Runtime
(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase
(Column NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
(Job Workflow & Scheduling)
Pig/Hive (Analytical
Language)
Hue
(Web Console)
Mahout
(Data Mining)
Example : word count
• Word count is challenging over massive amounts of
data
– Using a single compute node would be too time-consuming
– Number of unique words can easily exceed the RAM
• MapReduce breaks complex tasks down into smaller
elements which can be executed in parallel
• More nodes, more faster
Word Count Example
Key: offset
Value: line
Key: word
Value: count
Key: word
Value: sum of count
0:The cat sat on the mat
22:The aardvark sat on the sofa
The Hadoop Ecosystems
Growing Hadoop Ecosystem
• The term ‘Hadoop’ is taken to be the combination of
HDFS and MapReduce
• There are numerous other projects surrounding Hadoop
– Typically referred to as the ‘Hadoop Ecosystem’
• Zookeeper
• Hive and Pig
• HBase
• Flume
• Other Ecosystem Projects
– Sqoop
– Oozie
– Hue
– Mahout
The Ecosystem is the System
• Hadoop has become the kernel of the distributed
operating system for Big Data
• No one uses the kernel alone
• A collection of projects at Apache
Relation Map
MapReduce Runtime
(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase
(Column NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue
(Web Console)
Mahout
(Data Mining)
Zookeeper – Coordination Framework
MapReduce Runtime
(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase
(Column NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue
(Web Console)
Mahout
(Data Mining)
What is ZooKeeper
• A centralized service for maintaining
– Configuration information
– Providing distributed synchronization
• A set of tools to build distributed applications that can
safely handle partial failures
• ZooKeeper was designed to store coordination data
– Status information
– Configuration
– Location information
Why use ZooKeeper?
• Manage configuration across nodes
• Implement reliable messaging
• Implement redundant services
• Synchronize process execution
ZooKeeper Architecture
– All servers store a copy of the data (in memory)
– A leader is elected at startup
– 2 roles – leader and follower
• Followers service clients, all updates go through leader
• Update responses are sent when a majority of servers have persisted the
change
– HA support
Hbase – Column NoSQL DB
MapReduce Runtime
(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase
(Column NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue
(Web Console)
Mahout
(Data Mining)
Structured-data vs Raw-data
I – Inspired by
• Apache open source project
• Inspired from Google Big Table
• Non-relational, distributed database written in Java
• Coordinated by Zookeeper
Row & Column Oriented
Hbase – Data Model
• Cells are “versioned”
• Table rows are sorted by row key
• Region – a row range [start-key:end-key]
Architecture
• Master Server (HMaster)
– Assigns regions to regionservers
– Monitors the health of regionservers
• RegionServers
– Contain regions and handle client read/write request
Hbase – workflow
When to use HBase
• Need random, low latency access to the data
• Application has a variable schema where each row is
slightly different
• Add columns
• Most of columns are NULL in each row
Flume / Sqoop – Data Integration Framework
MapReduce Runtime
(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase
(Column NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue
(Web Console)
Mahout
(Data Mining)
What’s the problem for data collection
• Data collection is currently a priori and ad hoc
• A priori – decide what you want to collect ahead of time
• Ad hoc – each kind of data source goes through its own
collection path
(and how can it help?)
• A distributed data collection service
• It efficiently collecting, aggregating, and moving large
amounts of data
• Fault tolerant, many failover and recovery mechanism
• One-stop solution for data collection of all formats
Flume: High-Level Overview
• Logical Node
• Source
• Sink
Architecture
• basic diagram
– one master control multiple node
Architecture
• multiple master control multiple node
An example flow
Flume / Sqoop – Data Integration Framework
MapReduce Runtime
(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase
(Column NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue
(Web Console)
Mahout
(Data Mining)
Sqoop
• Easy, parallel database import/export
• What you want do?
– Insert data from RDBMS to HDFS
– Export data from HDFS back into RDBMS
What is Sqoop
• A suite of tools that connect Hadoop and database
systems
• Import tables from databases into HDFS for deep
analysis
• Export MapReduce results back to a database for
presentation to end-users
• Provides the ability to import from SQL databases
straight into your Hive data warehouse
How Sqoop helps
• The Problem
– Structured data in traditional databases cannot be easily
combined with complex data stored in HDFS
• Sqoop (SQL-to-Hadoop)
– Easy import of data from many databases to HDFS
– Generate code for use in MapReduce applications
Sqoop - import process
Sqoop - export process
• Exports are performed in parallel using MapReduce
Why Sqoop
• JDBC-based implementation
– Works with many popular database vendors
• Auto-generation of tedious user-side code
– Write MapReduce applications to work with your data, faster
• Integration with Hive
– Allows you to stay in a SQL-based environment
Sqoop - JOB
• Job management options
• E.g sqoop job –create myjob –import –connect xxxxxxx
--table mytable
Pig / Hive – Analytical Language
MapReduce Runtime
(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase
(Column NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue
(Web Console)
Mahout
(Data Mining)
Why Hive and Pig?
• Although MapReduce is very powerful, it can also be
complex to master
• Many organizations have business or data analysts who
are skilled at writing SQL queries, but not at writing Java
code
• Many organizations have programmers who are skilled
at writing code in scripting languages
• Hive and Pig are two projects which evolved separately
to help such people analyze huge amounts of data via
MapReduce
– Hive was initially developed at Facebook, Pig at Yahoo!
Hive – Developed by
• What is Hive?
– An SQL-like interface to Hadoop
• Data Warehouse infrastructure that provides data
summarization and ad hoc querying on top of Hadoop
– MapRuduce for execution
– HDFS for storage
• Hive Query Language
– Basic-SQL : Select, From, Join, Group-By
– Equi-Join, Muti-Table Insert, Multi-Group-By
– Batch query
SELECT * FROM purchases WHERE price > 100 GROUP BY storeid
Pig
• A high-level scripting language (Pig Latin)
• Process data one step at a time
• Simple to write MapReduce program
• Easy understand
• Easy debug A = load ‘a.txt’ as (id, name, age, ...)
B = load ‘b.txt’ as (id, address, ...)
C = JOIN A BY id, B BY id;STORE C into ‘c.txt’
– Initiated by
Hive vs. Pig
Hive Pig
Language HiveQL (SQL-like) Pig Latin, a scripting language
Schema Table definitions
that are stored in a
metastore
A schema is optionally defined
at runtime
Programmait Access JDBC, ODBC PigServer
• Input
• For the given sample input the map emits
• the reduce just sums up the values
Hello World Bye World
Hello Hadoop Goodbye Hadoop
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>
WordCount Example
WordCount Example In MapReduce
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
WordCount Example By Pig
A = LOAD 'wordcount/input' USING PigStorage as (token:chararray);
B = GROUP A BY token;
C = FOREACH B GENERATE group, COUNT(A) as count;
DUMP C;
WordCount Example By Hive
CREATE TABLE wordcount (token STRING);
LOAD DATA LOCAL INPATH ’wordcount/input'
OVERWRITE INTO TABLE wordcount;
SELECT count(*) FROM wordcount GROUP BY token;
Oozie – Job Workflow & Scheduling
MapReduce Runtime
(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase
(Column NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue
(Web Console)
Mahout
(Data Mining)
What is ?
• A Java Web Application
• Oozie is a workflow scheduler for Hadoop
• Crond for Hadoop
Job 1
Job 3
Job 2
Job 4 Job 5
Why
• Why use Oozie instead of just cascading a jobs one
after another
• Major flexibility
– Start, Stop, Suspend, and re-run jobs
• Oozie allows you to restart from a failure
– You can tell Oozie to restart a job from a specific node in the
graph or to skip specific failed nodes
High Level Architecture
• Web Service API
• database store :
– Workflow definitions
– Currently running workflow instances, including instance states
and variables
Oozie
Hadoop/Pig/HDFS
DB
WS
API
Tomcat
web-app
How it triggered
• Time
– Execute your workflow every 15 minutes
• Time and Data
– Materialize your workflow every hour, but only run them when
the input data is ready.
00:15 00:30 00:45 01:00
01:00 02:00 03:00 04:00
Hadoop
Input Data Exists?
Exeample Workflow
Oozie use criteria
• Need Launch, control, and monitor jobs from your Java
Apps
– Java Client API/Command Line Interface
• Need control jobs from anywhere
– Web Service API
• Have jobs that you need to run every hour, day, week
• Need receive notification when a job done
– Email when a job is complete
Hue – Web Console
MapReduce Runtime
(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase
(Column NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue
(Web Console)
Mahout
(Data Mining)
Hue – developed by
• Hadoop User Experience
• Apache Open source project
• HUE is a web UI for Hadoop
• Platform for building custom applications with a nice UI
library
Hue
• HUE comes with a suite of applications
– File Browser: Browse HDFS; change permissions and
ownership; upload, download, view and edit files.
– Job Browser: View jobs, tasks, counters, logs, etc.
– Beeswax: Wizards to help create Hive tables, load data, run and
manage Hive queries, and download results in Excel format.
Hue: File Browser UI
Hue: Beewax UI
Mahout – Data Mining
MapReduce Runtime
(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase
(Column NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue
(Web Console)
Mahout
(Data Mining)
What is
• Machine-learning tool
• Distributed and scalable machine learning algorithms on
the Hadoop platform
• Building intelligent applications easier and faster
Why
• Current state of ML libraries
– Lack Community
– Lack Documentation and Examples
– Lack Scalability
– Are Research oriented
Mahout – scale
• Scale to large datasets
– Hadoop MapReduce implementations that scales linearly with
data
• Scalable to support your business case
– Mahout is distributed under a commercially friendly Apache
Software license
• Scalable community
– Vibrant, responsive and diverse
Mahout – four use cases
• Mahout machine learning algorithms
– Recommendation mining : takes users’ behavior and find items
said specified user might like
– Clustering : takes e.g. text documents and groups them based
on related document topics
– Classification : learns from existing categorized documents what
specific category documents look like and is able to assign
unlabeled documents to appropriate category
– Frequent item set mining : takes a set of item groups (e.g. terms
in query session, shopping cart content) and identifies, which
individual items typically appear together
Use case Example
• Predict what the user likes based on
– His/Her historical behavior
– Aggregate behavior of people similar to him
Conclusion
Today, we introduced:
• Why Hadoop is needed
• The basic concepts of HDFS and MapReduce
• What sort of problems can be solved with Hadoop
• What other projects are included in the Hadoop
ecosystem
Recap – Hadoop Ecosystem
MapReduce Runtime
(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase
(Column NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue
(Web Console)
Mahout
(Data Mining)
Questions?
Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQueryCsaba Toth
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop IntroductionDzung Nguyen
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview EMC
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 

Was ist angesagt? (20)

Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Apache Hadoop at 10
Apache Hadoop at 10Apache Hadoop at 10
Apache Hadoop at 10
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 

Ähnlich wie Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2

hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoopyaevents
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache HadoopKMS Technology
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copyMohammad_Tariq
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisTrieu Nguyen
 
Big data, Hadoop, NoSQL DB - introduction
Big data, Hadoop, NoSQL DB - introductionBig data, Hadoop, NoSQL DB - introduction
Big data, Hadoop, NoSQL DB - introductionkvaderlipa
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 

Ähnlich wie Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2 (20)

Hadoop training
Hadoop trainingHadoop training
Hadoop training
 
Hadoop
HadoopHadoop
Hadoop
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache Hadoop
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copy
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
 
Big data, Hadoop, NoSQL DB - introduction
Big data, Hadoop, NoSQL DB - introductionBig data, Hadoop, NoSQL DB - introduction
Big data, Hadoop, NoSQL DB - introduction
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 

Mehr von tcloudcomputing-tw

Hadoop Security Now and Future
Hadoop Security Now and FutureHadoop Security Now and Future
Hadoop Security Now and Futuretcloudcomputing-tw
 
Session 4 - News from ACS Community
Session 4 - News from ACS CommunitySession 4 - News from ACS Community
Session 4 - News from ACS Communitytcloudcomputing-tw
 
Session 3 - CloudStack Test Automation and CI
Session 3 - CloudStack Test Automation and CISession 3 - CloudStack Test Automation and CI
Session 3 - CloudStack Test Automation and CItcloudcomputing-tw
 
Session 2 - CloudStack Usage and Application (2013.Q3)
Session 2 - CloudStack Usage and Application (2013.Q3)Session 2 - CloudStack Usage and Application (2013.Q3)
Session 2 - CloudStack Usage and Application (2013.Q3)tcloudcomputing-tw
 
Session 1 - CloudStack Plugin Structure and Implementation (2013.Q3)
Session 1 - CloudStack Plugin Structure and Implementation (2013.Q3)Session 1 - CloudStack Plugin Structure and Implementation (2013.Q3)
Session 1 - CloudStack Plugin Structure and Implementation (2013.Q3)tcloudcomputing-tw
 
2012 CloudStack Design Camp in Taiwan--- CloudStack Overview-2
2012 CloudStack Design Camp in Taiwan--- CloudStack Overview-22012 CloudStack Design Camp in Taiwan--- CloudStack Overview-2
2012 CloudStack Design Camp in Taiwan--- CloudStack Overview-2tcloudcomputing-tw
 
2012 CloudStack Design Camp in Taiwan--- CloudStack Overview-1
2012 CloudStack Design Camp in Taiwan--- CloudStack Overview-12012 CloudStack Design Camp in Taiwan--- CloudStack Overview-1
2012 CloudStack Design Camp in Taiwan--- CloudStack Overview-1tcloudcomputing-tw
 

Mehr von tcloudcomputing-tw (7)

Hadoop Security Now and Future
Hadoop Security Now and FutureHadoop Security Now and Future
Hadoop Security Now and Future
 
Session 4 - News from ACS Community
Session 4 - News from ACS CommunitySession 4 - News from ACS Community
Session 4 - News from ACS Community
 
Session 3 - CloudStack Test Automation and CI
Session 3 - CloudStack Test Automation and CISession 3 - CloudStack Test Automation and CI
Session 3 - CloudStack Test Automation and CI
 
Session 2 - CloudStack Usage and Application (2013.Q3)
Session 2 - CloudStack Usage and Application (2013.Q3)Session 2 - CloudStack Usage and Application (2013.Q3)
Session 2 - CloudStack Usage and Application (2013.Q3)
 
Session 1 - CloudStack Plugin Structure and Implementation (2013.Q3)
Session 1 - CloudStack Plugin Structure and Implementation (2013.Q3)Session 1 - CloudStack Plugin Structure and Implementation (2013.Q3)
Session 1 - CloudStack Plugin Structure and Implementation (2013.Q3)
 
2012 CloudStack Design Camp in Taiwan--- CloudStack Overview-2
2012 CloudStack Design Camp in Taiwan--- CloudStack Overview-22012 CloudStack Design Camp in Taiwan--- CloudStack Overview-2
2012 CloudStack Design Camp in Taiwan--- CloudStack Overview-2
 
2012 CloudStack Design Camp in Taiwan--- CloudStack Overview-1
2012 CloudStack Design Camp in Taiwan--- CloudStack Overview-12012 CloudStack Design Camp in Taiwan--- CloudStack Overview-1
2012 CloudStack Design Camp in Taiwan--- CloudStack Overview-1
 

Kürzlich hochgeladen

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 

Kürzlich hochgeladen (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2

  • 1. Will Y Lin Hadoop Product Family and Ecosystem
  • 2. Agenda • What is BigData? • What is the problem? • Hadoop – Introduction to Hadoop – Hadoop components – What sort of problems can be solved with Hadoop? • Hadoop ecosystem • Conclusion
  • 3. What is BigData? A set of files A database A single file
  • 4. Big data Expands on 4 fronts Velocity Volume Variety Veracity MB GB TB PB batch periodic near Real-Time Real-Time http://whatis.techtarget.com/definition/3Vs
  • 5. The Data-Driven World • Modern systems have to deal with far more data than was the case in the past – Organizations are generating huge amounts of data – That data has inherent value, and cannot be discarded • Examples: – Yahoo – over 170PB of data – Facebook – over 30PB of data – eBay – over 5PB of data • Many organizations are generating data at a rate of terabytes per day
  • 6. What is the problem • Traditionally, computation has been processor-bound • For decades, the primary push was to increase the computing power of a single machine – Faster processor, more RAM • Distributed systems evolved to allow developers to use multiple machines for a single job – At compute time, data is copied to the compute nodes
  • 7. What is the problem • Getting the data to the processors becomes the bottleneck • Quick calculation – Typical disk data transfer rate: • 75MB/sec – Time taken to transfer 100GB of data to the processor: • approx. 22 minutes!
  • 8. What is the problem • Failure of a component may cost a lot • What we need when job fail? – May result in a graceful degradation of application performance, but entire system does not completely fail – Should not result in the loss of any data – Would not affect the outcome of the job
  • 9. Hadoop Solutions The most common problems Hadoop can solve
  • 10. Threat Analysis/Trade Surveillance • Challenge: – Detecting threats in the form of fraudulent activity or attacks • Large data volumes involved • Like looking for a needle in a haystack • Solution with Hadoop: – Parallel processing over huge datasets – Pattern recognition to identify anomalies • – i.e., threats • Typical Industry: – Security, Financial Services
  • 11. Recommendation Engine • Challenge: – Using user data to predict which products to recommend • Solution with Hadoop: – Batch processing framework • Allow execution in in parallel over large datasets – Collaborative filtering • Collecting ‘taste’ information from many users • Utilizing information to predict what similar users like • Typical Industry – ISP, Advertising
  • 14. • Apache Hadoop project – inspired by Google's MapReduce and Google File System papers. • Open sourced, flexible and available architecture for large scale computation and data processing on a network of commodity hardware • Open Source Software + Hardware Commodity – IT Costs Reduction – inspired by
  • 15. Hadoop Concepts • Distribute the data as it is initially stored in the system • Individual nodes can work on data local to those nodes • Users can focus on developing applications.
  • 16. Hadoop Components • Hadoop consists of two core components – The Hadoop Distributed File System (HDFS) – MapReduce Software Framework • There are many other projects based around core Hadoop – Often referred to as the ‘Hadoop Ecosystem’ – Pig, Hive, HBase, Flume, Oozie, Sqoop, etc MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)
  • 17. Hadoop Components: HDFS • HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster • Two roles in HDFS – Namenode: Record metadata – Datanode: Store data MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)
  • 18. How Files Are Stored: Example • NameNode holds metadata for the data files • DataNodes hold the actual blocks • Each block is replicated three times on the cluster
  • 19. HDFS: Points To Note • When a client application wants to read a file: • It communicates with the NameNode to determine which blocks make up the file, and which DataNodes those blocks reside on • It then communicates directly with the DataNodes to read the data
  • 20. Hadoop Components: MapReduce • MapReduce is a method for distributing a task across multiple nodes • It works like a Unix pipeline: – cat input | grep | sort | uniq -c | cat > output – Input | Map | Shuffle & Sort | Reduce | Output MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)
  • 21. Features of MapReduce • Automatic parallelization and distribution • Automatic re-execution on failure • Locality optimizations • MapReduce abstracts all the ‘housekeeping’ away from the developer – Developer can concentrate simply on writing the Map and Reduce functions MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)
  • 22. Example : word count • Word count is challenging over massive amounts of data – Using a single compute node would be too time-consuming – Number of unique words can easily exceed the RAM • MapReduce breaks complex tasks down into smaller elements which can be executed in parallel • More nodes, more faster
  • 23. Word Count Example Key: offset Value: line Key: word Value: count Key: word Value: sum of count 0:The cat sat on the mat 22:The aardvark sat on the sofa
  • 25. Growing Hadoop Ecosystem • The term ‘Hadoop’ is taken to be the combination of HDFS and MapReduce • There are numerous other projects surrounding Hadoop – Typically referred to as the ‘Hadoop Ecosystem’ • Zookeeper • Hive and Pig • HBase • Flume • Other Ecosystem Projects – Sqoop – Oozie – Hue – Mahout
  • 26. The Ecosystem is the System • Hadoop has become the kernel of the distributed operating system for Big Data • No one uses the kernel alone • A collection of projects at Apache
  • 27. Relation Map MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)
  • 28. Zookeeper – Coordination Framework MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)
  • 29. What is ZooKeeper • A centralized service for maintaining – Configuration information – Providing distributed synchronization • A set of tools to build distributed applications that can safely handle partial failures • ZooKeeper was designed to store coordination data – Status information – Configuration – Location information
  • 30. Why use ZooKeeper? • Manage configuration across nodes • Implement reliable messaging • Implement redundant services • Synchronize process execution
  • 31. ZooKeeper Architecture – All servers store a copy of the data (in memory) – A leader is elected at startup – 2 roles – leader and follower • Followers service clients, all updates go through leader • Update responses are sent when a majority of servers have persisted the change – HA support
  • 32. Hbase – Column NoSQL DB MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)
  • 34. I – Inspired by • Apache open source project • Inspired from Google Big Table • Non-relational, distributed database written in Java • Coordinated by Zookeeper
  • 35. Row & Column Oriented
  • 36. Hbase – Data Model • Cells are “versioned” • Table rows are sorted by row key • Region – a row range [start-key:end-key]
  • 37. Architecture • Master Server (HMaster) – Assigns regions to regionservers – Monitors the health of regionservers • RegionServers – Contain regions and handle client read/write request
  • 39. When to use HBase • Need random, low latency access to the data • Application has a variable schema where each row is slightly different • Add columns • Most of columns are NULL in each row
  • 40. Flume / Sqoop – Data Integration Framework MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)
  • 41. What’s the problem for data collection • Data collection is currently a priori and ad hoc • A priori – decide what you want to collect ahead of time • Ad hoc – each kind of data source goes through its own collection path
  • 42. (and how can it help?) • A distributed data collection service • It efficiently collecting, aggregating, and moving large amounts of data • Fault tolerant, many failover and recovery mechanism • One-stop solution for data collection of all formats
  • 43. Flume: High-Level Overview • Logical Node • Source • Sink
  • 44. Architecture • basic diagram – one master control multiple node
  • 45. Architecture • multiple master control multiple node
  • 47. Flume / Sqoop – Data Integration Framework MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)
  • 48. Sqoop • Easy, parallel database import/export • What you want do? – Insert data from RDBMS to HDFS – Export data from HDFS back into RDBMS
  • 49. What is Sqoop • A suite of tools that connect Hadoop and database systems • Import tables from databases into HDFS for deep analysis • Export MapReduce results back to a database for presentation to end-users • Provides the ability to import from SQL databases straight into your Hive data warehouse
  • 50. How Sqoop helps • The Problem – Structured data in traditional databases cannot be easily combined with complex data stored in HDFS • Sqoop (SQL-to-Hadoop) – Easy import of data from many databases to HDFS – Generate code for use in MapReduce applications
  • 51. Sqoop - import process
  • 52. Sqoop - export process • Exports are performed in parallel using MapReduce
  • 53. Why Sqoop • JDBC-based implementation – Works with many popular database vendors • Auto-generation of tedious user-side code – Write MapReduce applications to work with your data, faster • Integration with Hive – Allows you to stay in a SQL-based environment
  • 54. Sqoop - JOB • Job management options • E.g sqoop job –create myjob –import –connect xxxxxxx --table mytable
  • 55. Pig / Hive – Analytical Language MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)
  • 56. Why Hive and Pig? • Although MapReduce is very powerful, it can also be complex to master • Many organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java code • Many organizations have programmers who are skilled at writing code in scripting languages • Hive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce – Hive was initially developed at Facebook, Pig at Yahoo!
  • 57. Hive – Developed by • What is Hive? – An SQL-like interface to Hadoop • Data Warehouse infrastructure that provides data summarization and ad hoc querying on top of Hadoop – MapRuduce for execution – HDFS for storage • Hive Query Language – Basic-SQL : Select, From, Join, Group-By – Equi-Join, Muti-Table Insert, Multi-Group-By – Batch query SELECT * FROM purchases WHERE price > 100 GROUP BY storeid
  • 58. Pig • A high-level scripting language (Pig Latin) • Process data one step at a time • Simple to write MapReduce program • Easy understand • Easy debug A = load ‘a.txt’ as (id, name, age, ...) B = load ‘b.txt’ as (id, address, ...) C = JOIN A BY id, B BY id;STORE C into ‘c.txt’ – Initiated by
  • 59. Hive vs. Pig Hive Pig Language HiveQL (SQL-like) Pig Latin, a scripting language Schema Table definitions that are stored in a metastore A schema is optionally defined at runtime Programmait Access JDBC, ODBC PigServer
  • 60. • Input • For the given sample input the map emits • the reduce just sums up the values Hello World Bye World Hello Hadoop Goodbye Hadoop < Hello, 1> < World, 1> < Bye, 1> < World, 1> < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2> WordCount Example
  • 61. WordCount Example In MapReduce public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
  • 62. WordCount Example By Pig A = LOAD 'wordcount/input' USING PigStorage as (token:chararray); B = GROUP A BY token; C = FOREACH B GENERATE group, COUNT(A) as count; DUMP C;
  • 63. WordCount Example By Hive CREATE TABLE wordcount (token STRING); LOAD DATA LOCAL INPATH ’wordcount/input' OVERWRITE INTO TABLE wordcount; SELECT count(*) FROM wordcount GROUP BY token;
  • 64. Oozie – Job Workflow & Scheduling MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)
  • 65. What is ? • A Java Web Application • Oozie is a workflow scheduler for Hadoop • Crond for Hadoop Job 1 Job 3 Job 2 Job 4 Job 5
  • 66. Why • Why use Oozie instead of just cascading a jobs one after another • Major flexibility – Start, Stop, Suspend, and re-run jobs • Oozie allows you to restart from a failure – You can tell Oozie to restart a job from a specific node in the graph or to skip specific failed nodes
  • 67. High Level Architecture • Web Service API • database store : – Workflow definitions – Currently running workflow instances, including instance states and variables Oozie Hadoop/Pig/HDFS DB WS API Tomcat web-app
  • 68. How it triggered • Time – Execute your workflow every 15 minutes • Time and Data – Materialize your workflow every hour, but only run them when the input data is ready. 00:15 00:30 00:45 01:00 01:00 02:00 03:00 04:00 Hadoop Input Data Exists?
  • 70. Oozie use criteria • Need Launch, control, and monitor jobs from your Java Apps – Java Client API/Command Line Interface • Need control jobs from anywhere – Web Service API • Have jobs that you need to run every hour, day, week • Need receive notification when a job done – Email when a job is complete
  • 71. Hue – Web Console MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)
  • 72. Hue – developed by • Hadoop User Experience • Apache Open source project • HUE is a web UI for Hadoop • Platform for building custom applications with a nice UI library
  • 73. Hue • HUE comes with a suite of applications – File Browser: Browse HDFS; change permissions and ownership; upload, download, view and edit files. – Job Browser: View jobs, tasks, counters, logs, etc. – Beeswax: Wizards to help create Hive tables, load data, run and manage Hive queries, and download results in Excel format.
  • 76. Mahout – Data Mining MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)
  • 77. What is • Machine-learning tool • Distributed and scalable machine learning algorithms on the Hadoop platform • Building intelligent applications easier and faster
  • 78. Why • Current state of ML libraries – Lack Community – Lack Documentation and Examples – Lack Scalability – Are Research oriented
  • 79. Mahout – scale • Scale to large datasets – Hadoop MapReduce implementations that scales linearly with data • Scalable to support your business case – Mahout is distributed under a commercially friendly Apache Software license • Scalable community – Vibrant, responsive and diverse
  • 80. Mahout – four use cases • Mahout machine learning algorithms – Recommendation mining : takes users’ behavior and find items said specified user might like – Clustering : takes e.g. text documents and groups them based on related document topics – Classification : learns from existing categorized documents what specific category documents look like and is able to assign unlabeled documents to appropriate category – Frequent item set mining : takes a set of item groups (e.g. terms in query session, shopping cart content) and identifies, which individual items typically appear together
  • 81. Use case Example • Predict what the user likes based on – His/Her historical behavior – Aggregate behavior of people similar to him
  • 82. Conclusion Today, we introduced: • Why Hadoop is needed • The basic concepts of HDFS and MapReduce • What sort of problems can be solved with Hadoop • What other projects are included in the Hadoop ecosystem
  • 83. Recap – Hadoop Ecosystem MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)