2. What is Big Data??
• Large amount of Data .
• Its a popular term used to express exponential growth of
data .
• Big data is difficult to store , collect , maintain , Analyze
and Visualize .
4/15/2019Footer Text 2
3. Big Data characteristics
• Volume :-
Large amount of data .
• Velocity :-
The rate at which data is getting generated
• Variety :-
Different types of Data
- Structured data ,eg MySql
- Semi-Structured data, eg xml , json
- Unstructured data, eg text , audio, video
4/15/2019Footer Text 3
5. Big Data sources
• Social Media
• Banks
• Instruments
• Websites
• Stock Market
4/15/2019Footer Text 5
6. Hadoop Introduction
• Open source framework that allows distributed
processing of large datasets on the cluster of commodity
hardware
• Hadoop is a data management tool and uses scale out
storage .
4/15/2019Footer Text 6
7. Why Use Hadoop?
Cheaper
Scales to Petabytes or
more
Faster
Parallel data
processing
Better
Suited for particular
types of BigData
problems
4/15/2019Footer Text 7
8. Where is Hadoop used?
4/15/2019Footer Text 8
Industry Use Cases
Technology Search
People you may know
Movie recommendations
Banks Fraud Detection
Regulatory
Risk management
Media Retail Marketing analytics
Customer service
Product recommendations
Manufacturing Preventive maintenance
9. Defining Hadoop Cluster
• Size of data is most important factor while defining
hadoop cluster
4/15/2019Footer Text 9
5 Servers with 10 TB storage
capacity each
Total Storage Capacity : - 50TB
14. Hadoop Cluster
• Assume that we have hadoop cluster with 4 nodes
4/15/2019Footer Text 14
Master
NameNode
ResourceManager
Slave
DataNode
NodeManager
15. Modes of Operation
• Stand Alone
• Pseudo Distributed
• Fully Distributed
4/15/2019Footer Text 15
16. Secondary Name Node
• Secondary Namenode is not a hot backup for Namenode
.
• It just takes hourly backup of Namenode metadata
• It is can be used to Restart a crashed Hadoop Cluster
• Secondary Namenode is an important demon for
Hadoop1 , However in hadoop2 It is not that much
Important .
4/15/2019Footer Text 16
19. MapReduce
MapReduce job consist of two tasks
Map Task
Reduce Task
Blocks of data distributed across several machinesare
processed by map tasks parallel
Results are aggregated in the reducer
Works only on KEY/VALUE pair
4/15/2019Footer Text 19
20. Data Flow in a MapReduce
Program in Hadoop
4/15/2019Footer Text 20
21. MapReduce:Word Count
Deer 1
Bear 1
River 1
Car 1
Car 1
River 1
Deer 1
Car 1
Bear 1
Bear 2
Car 3
Deer 2
River 2
Can we do word count in parallel?
Deer Bear River
Car Car River
Deer Car Bear
23. Mapper Class
public class WordCount
{
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context ) throws
IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
4/15/2019Footer Text 23
24. Reducer Class
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context )
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
4/15/2019Footer Text 24
25. Driver Class
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
4/15/2019Footer Text 25
26. Use Cases
4/15/2019Footer Text 26
Utilities want to predict power consumption
Banks and insurance companies want to
understand risk
Fraud detection
Marketing departments want to understand
customers
Recommendations
Location-Based Ad Targeting
Threat Analysis