Hadoop development series(1)

Introduction to Big Data
and Hadoop
4/15/2019Footer Text 1

What is Big Data??
• Large amount of Data .
• Its a popular term used to express exponential growth of
data .
• Big data is difficult to store , collect , maintain , Analyze
and Visualize .

Big Data characteristics
• Volume :-
Large amount of data .
• Velocity :-
The rate at which data is getting generated
• Variety :-
Different types of Data
- Structured data ,eg MySql
- Semi-Structured data, eg xml , json
- Unstructured data, eg text , audio, video

Big Data sources
• Social Media
• Banks
• Instruments
• Websites
• Stock Market

Hadoop Introduction
• Open source framework that allows distributed
processing of large datasets on the cluster of commodity
hardware
• Hadoop is a data management tool and uses scale out
storage .

Why Use Hadoop?
 Cheaper
Scales to Petabytes or
more
 Faster
Parallel data
processing
 Better
Suited for particular
types of BigData
problems

Where is Hadoop used?
Industry Use Cases
Technology Search
People you may know
Movie recommendations
Banks Fraud Detection
Regulatory
Risk management
Media Retail Marketing analytics
Customer service
Product recommendations
Manufacturing Preventive maintenance

Defining Hadoop Cluster
• Size of data is most important factor while defining
hadoop cluster
5 Servers with 10 TB storage
capacity each
Total Storage Capacity : - 50TB

Hadoop Components
• Hadoop 1 Componets
- HDFS (Hadoop distributed file system)
- MapReduce
• Hadoop 2 Component
- HDFS (Hadoop distributed file system)
- YARN/MRv2
HDFS
MR/
YARN
Storage/
Reads-Writes
Processing

Hadoop Daemons
• Hadoop 1 Daemos
Namenode
Datanode
Secondary Namenode
job Tracker
Task Tracker
HDFS MapReduce
NameNode
DataNode
Job Tracker
Task Tracker

Hadoop Daemons
• Hadoop 2 Daemos
Namenode
Datanode
Secondary Namenode
Resource Manager
Node Manager
HDFS YARN
NameNode
DataNode
Resource Manager
Node Manager

Hadoop Master Slave
Architecture
HDFS MR/YARN
NameNode DataNode ResourceManager NodeManager
Master Slave Master Slave

Hadoop Cluster
• Assume that we have hadoop cluster with 4 nodes
Master
NameNode
ResourceManager
Slave
DataNode
NodeManager

Modes of Operation
• Stand Alone
• Pseudo Distributed
• Fully Distributed

Secondary Name Node
• Secondary Namenode is not a hot backup for Namenode
.
• It just takes hourly backup of Namenode metadata
• It is can be used to Restart a crashed Hadoop Cluster
• Secondary Namenode is an important demon for
Hadoop1 , However in hadoop2 It is not that much
Important .

Ecosystems
• Hadoop 2
• Hadoop 1
Oozie
PIG HIVE Mahout
MapReduce
HDFS
Oozie
PIG HIVE
Mahout
MapReduce
HDFS
YARN Resource Managment
OtherYARN
frameworks MPI ,
Giraph
Flume Sqoop
Flume Sqoop

MapReduce Job – LogicalView

MapReduce
 MapReduce job consist of two tasks
 Map Task
 Reduce Task
 Blocks of data distributed across several machinesare
processed by map tasks parallel
 Results are aggregated in the reducer
 Works only on KEY/VALUE pair

Data Flow in a MapReduce
Program in Hadoop

MapReduce:Word Count
Deer 1
Bear 1
River 1
Car 1
Car 1
River 1
Deer 1
Car 1
Bear 1
Bear 2
Car 3
Deer 2
River 2
Can we do word count in parallel?
Deer Bear River
Car Car River
Deer Car Bear

MapReduce: Word Count
Program

Mapper Class
public class WordCount
{
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context ) throws
IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}

Reducer Class
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context )
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

Driver Class
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}

Use Cases
 Utilities want to predict power consumption
 Banks and insurance companies want to
understand risk
 Fraud detection
 Marketing departments want to understand
customers
 Recommendations
 Location-Based Ad Targeting
 Threat Analysis

Hadoop development series(1)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Ähnlich wie Hadoop development series(1)

Ähnlich wie Hadoop development series(1) (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Hadoop development series(1)