Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Hadoop development series(1)

25 Aufrufe

Veröffentlicht am

What is big data?
What is hadoop?
Hadoop cluster.

Veröffentlicht in: Ingenieurwesen
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Hadoop development series(1)

  1. 1. Introduction to Big Data and Hadoop 4/15/2019Footer Text 1
  2. 2. What is Big Data?? • Large amount of Data . • Its a popular term used to express exponential growth of data . • Big data is difficult to store , collect , maintain , Analyze and Visualize . 4/15/2019Footer Text 2
  3. 3. Big Data characteristics • Volume :- Large amount of data . • Velocity :- The rate at which data is getting generated • Variety :- Different types of Data - Structured data ,eg MySql - Semi-Structured data, eg xml , json - Unstructured data, eg text , audio, video 4/15/2019Footer Text 3
  4. 4. 4/15/2019Footer Text 4
  5. 5. Big Data sources • Social Media • Banks • Instruments • Websites • Stock Market 4/15/2019Footer Text 5
  6. 6. Hadoop Introduction • Open source framework that allows distributed processing of large datasets on the cluster of commodity hardware • Hadoop is a data management tool and uses scale out storage . 4/15/2019Footer Text 6
  7. 7. Why Use Hadoop?  Cheaper Scales to Petabytes or more  Faster Parallel data processing  Better Suited for particular types of BigData problems 4/15/2019Footer Text 7
  8. 8. Where is Hadoop used? 4/15/2019Footer Text 8 Industry Use Cases Technology Search People you may know Movie recommendations Banks Fraud Detection Regulatory Risk management Media Retail Marketing analytics Customer service Product recommendations Manufacturing Preventive maintenance
  9. 9. Defining Hadoop Cluster • Size of data is most important factor while defining hadoop cluster 4/15/2019Footer Text 9 5 Servers with 10 TB storage capacity each Total Storage Capacity : - 50TB
  10. 10. Hadoop Components • Hadoop 1 Componets - HDFS (Hadoop distributed file system) - MapReduce • Hadoop 2 Component - HDFS (Hadoop distributed file system) - YARN/MRv2 4/15/2019Footer Text 10 HDFS MR/ YARN Storage/ Reads-Writes Processing
  11. 11. Hadoop Daemons • Hadoop 1 Daemos Namenode Datanode Secondary Namenode job Tracker Task Tracker 4/15/2019Footer Text 11 HDFS MapReduce NameNode DataNode Job Tracker Task Tracker
  12. 12. Hadoop Daemons • Hadoop 2 Daemos Namenode Datanode Secondary Namenode Resource Manager Node Manager 4/15/2019Footer Text 12 HDFS YARN NameNode DataNode Resource Manager Node Manager
  13. 13. Hadoop Master Slave Architecture 4/15/2019Footer Text 13 HDFS MR/YARN NameNode DataNode ResourceManager NodeManager Master Slave Master Slave
  14. 14. Hadoop Cluster • Assume that we have hadoop cluster with 4 nodes 4/15/2019Footer Text 14 Master NameNode ResourceManager Slave DataNode NodeManager
  15. 15. Modes of Operation • Stand Alone • Pseudo Distributed • Fully Distributed 4/15/2019Footer Text 15
  16. 16. Secondary Name Node • Secondary Namenode is not a hot backup for Namenode . • It just takes hourly backup of Namenode metadata • It is can be used to Restart a crashed Hadoop Cluster • Secondary Namenode is an important demon for Hadoop1 , However in hadoop2 It is not that much Important . 4/15/2019Footer Text 16
  17. 17. Ecosystems • Hadoop 2 4/15/2019Footer Text 17 • Hadoop 1 Oozie PIG HIVE Mahout MapReduce HDFS Oozie PIG HIVE Mahout MapReduce HDFS YARN Resource Managment OtherYARN frameworks MPI , Giraph Flume Sqoop Flume Sqoop
  18. 18. MapReduce Job – LogicalView 4/15/2019Footer Text 18
  19. 19. MapReduce  MapReduce job consist of two tasks  Map Task  Reduce Task  Blocks of data distributed across several machinesare processed by map tasks parallel  Results are aggregated in the reducer  Works only on KEY/VALUE pair 4/15/2019Footer Text 19
  20. 20. Data Flow in a MapReduce Program in Hadoop 4/15/2019Footer Text 20
  21. 21. MapReduce:Word Count Deer 1 Bear 1 River 1 Car 1 Car 1 River 1 Deer 1 Car 1 Bear 1 Bear 2 Car 3 Deer 2 River 2 Can we do word count in parallel? Deer Bear River Car Car River Deer Car Bear
  22. 22. MapReduce: Word Count Program 4/15/2019Footer Text 22
  23. 23. Mapper Class public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } 4/15/2019Footer Text 23
  24. 24. Reducer Class public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } 4/15/2019Footer Text 24
  25. 25. Driver Class public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } 4/15/2019Footer Text 25
  26. 26. Use Cases 4/15/2019Footer Text 26  Utilities want to predict power consumption  Banks and insurance companies want to understand risk  Fraud detection  Marketing departments want to understand customers  Recommendations  Location-Based Ad Targeting  Threat Analysis

×