Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Basic of Big Data

31 Aufrufe

Veröffentlicht am

What is big data?
comparison on rdbms vs hadoop
wordcount example

Veröffentlicht in: Ingenieurwesen
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Basic of Big Data

  1. 1. Basics of Big Data Analytics & Hadoo p Ambuj Kumar Ambuj_kumar@aol.com http://ambuj4bigdata.blogspot.in http://ambujworld.wordpress.com
  2. 2. Agend a Big Data –  Concepts overview  Analytics –  Concepts overview  Hadoop –  Concepts overview  HDFS  Concepts overview  Data Flow - Read & Write Operation  MapReduce  Concepts overview  WordCount Program  Use Cases  Landscape  Hadoop Features & Summary
  3. 3. What is Big Data?Big data is data which is too large, complex and dynamic for any conventional data tools to capture, store, manage and analyze.
  4. 4. Challenges of Big Data • Storage (~ Petabytes) 1 • Processing (Timely manner) • Variety of Data (Structured, Semi Structured,Un-structured) • Cos t 2 3 4
  5. 5. Big Data AnalyticsBig data analytics is the process of examining large amounts of data of a variety of types (big data) to uncover hidden patterns, unknown correlations and other useful information. Big Data AnalyticsSolutions There are many different Big Data Analytics Solutions out in the market.  Tableau – visualization tools  SAS – Statistical computing  IBM and Oracle –They have a range of tools for Big Data Analysis  Revolution – Statistical computing  R – Open source tool for Statisticalcomputing
  6. 6. What is Hadoop?  Open-source data storage and processingAPI  Massively scalable, automaticallyparallelizable  Based on work from Google  GFS + MapReduce + BigTable  Current Distributions based on Open Source and VendorWork  Apache Hadoop  Cloudera – CDH4  Hortonworks  MapR  AWS  Windows Azure HDInsight
  7. 7. Why Use Hadoop? Cheaper Scales to Petabytes or more Faster Parallel data processing Better Suited for particular types of BigData problems
  8. 8. Hadoop HistoryIn 2008, Hadoop becameApache Top Level Project
  9. 9. Comparing:RDBMS vs. HadoopTraditional RDBMS Hadoop / MapReduce Data Size Gigabytes (Terabytes) Petabytes (Hexabytes) Access Interactive and Batch Batch – NOT Interactive Updates Read / Write many times Write once, Read many times Structure Static Schema Dynamic Schema Integrity High (ACID) Low Scaling Nonlinear Linear Query ResponseTime Can be near immediate Has latency (due tobatch processing)
  10. 10. Where is Hadoop used? Technology Industry Use Cases Search People you may know Movie recommendations Banks Fraud Detection Regulatory Risk management Media Retail Marketing analytics Customer service Product recommendations Manufacturing Preventive maintenance
  11. 11. Companies Using Hadoop  Search Yahoo,Amazon,Zvents  Log Processing Facebook,Yahoo, ContextWeb.Joost,Last.fm  Recommendation Systems Facebook,Linkedin  DataWarehouse Facebook,AOL  Video & ImageAnalysis NewYorkTimes,Eyealike ------- Almost in every domain!
  12. 12. Hadoop is a set of Apache Frameworks and more…  Data storage (HDFS)  Runs on commodity hardware (usually Linux)  Horizontally scalable  Processing (MapReduce)  Parallelized (scalable) processing Fault Tolerant  Other Tools / Frameworks  Data Access  HBase, Hive, Pig, Mahout  Tools  Hue, Sqoop  Monitoring  Greenplum, Cloudera Hadoop Core - HDFS MapReduceAPI Monitoring &Alerting Tools & Libraries DataAccess
  13. 13. Core parts of Hadoop distribution HDFS Storage Redundant (3copies) For large files – large blocks 64 or 128 MB / block Can scale to 1000s of nodes MapReduce API Batch (Job) processing Distributed and Localized to clusters (Map) Auto-Parallelizable for huge amounts of data Fault-tolerant (auto retries) Adds high availability and more Other Libraries Pig Hive HBase Others
  14. 14. Hadoop Cluster HDFS (Physical) Storage Name Node Data Node 1 Data Node 2 Data Node 3 Secondary Name Node • Contains web site to view cluster information • V2 Hadoop uses multiple Name Nodes for HA One Name Node Many Data Nodes • 3 copies of each node by default Work with data in HDFS • Using common Linux shell commands • Block size is 64 or 128 MB
  15. 15. MapReduce Job – Logical View
  16. 16. Hadoop Ecosystem
  17. 17. Common Hadoop Distributions Open Source Apache Commercial Cloudera Hortonworks MapR AWS MapReduce Microsoft HDInsight
  18. 18. HDFS :Architecture Master NameNode Slave Bunch of DataNodes HDFS Layers NameNode Storage ………… NS Block Management NameNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode Name Space Block Storage
  19. 19. HDFS : Basic Features Highly fault- tolerant High throughput Suitable for applications with large data sets Streaming access to file system data Can be built out of commodity hardware
  20. 20. HDFS Write (1/2) Client Name Node 1 2 Data Node A Data Node B Data Node C Data Node D A2 A3 A4A1 3 Client contacts NameNode to write data NameNode says write it to thesenodes Client sequentiallywrites blocks to DataNode
  21. 21. HDFS Write (2/2) Client Name Node Data Node A Data Node B Data Node C Data Node D A1 DataNodes replicatedata blocks, orchestrated by the NameNode A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3
  22. 22. HDFS Read Client Name Node 1 2 Data Node A Data Node B Data Node C Data Node D 3 Client contacts NameNode to read data NameNode says you can findit here Client sequentially reads blocks from DataNode A1 A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3
  23. 23. HA (High Availability) for NameNode NameNode (StandBy) DataNode NameNode (Active) Active NameNode Do normal namenode’s operation Standby NameNode Maintain NameNode’s data Ready to be active NameNode DataNode DataNode DataNode DataNode
  24. 24. MapRedu ce MapReduce job consist of two tasks  Map Task  Reduce Task Blocks of data distributed across several machinesare processed by map tasks parallel  Results are aggregated in the reducer  Works only on KEY/VALUE pair
  25. 25. MapReduce:Word Count Deer 1 Bear 1 River 1 Car 1 Car 1 River 1 Deer 1 Car 1 Bear 1 Bear 2 Car 3 Deer 2 River 2 Can we do word count in parallel? Deer Bear River Car Car River Deer Car Bear
  26. 26. MapReduce:Word Count Program
  27. 27. Data Flow in a MapReduce Program in Hadoop
  28. 28. Mapper ClassPackage ambuj.com.wc; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class WordCountMapper extends Mapper<LongWritable, Text, Text, LongWritable> { private final static LongWritable one = new LongWritable(1); private Text word = newText(); @Override public void map(LongWritable inputKey, Text inputVal, Context context) throws IOException, InterruptedException { String line = inputVal.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }
  29. 29. Reducer Classpackage ambuj.com.wc; import java.io.IOException; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class WordCountReducer extends Reducer<Text, LongWritable, Text, LongWritable> { @Override public void reduce(Text key, Iterable<LongWritable> listOfValues, Context context) throws IOException, InterruptedException { long sum = 0; for (LongWritable val : listOfValues) { sum = sum + val.get(); } context.write(key, new LongWritable(sum)); } }
  30. 30. Driver Class package ambuj.com.wc; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class WordCountDriver extends Configured implements Tool { @Override public int run(String[] args) throws Exception { Configuration conf = newConfiguration(); Job job = new Job(conf,"WordCount"); job.setJarByClass(WordCountDriver.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(LongWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); return 0; } public static void main(String[] args) throws Exception { ToolRunner.run(new WordCountDriver(), args); } }
  31. 31. A view of Hadoop Client Job Data Node Task Tracker Task Task Task Job Tracker Name Node Data Node Task Tracker Task Task Task Data Node Task Tracker Task Task Task MasterSlave Blocks HDFS MapReduce
  32. 32. Use Cases  Utilities want to predict power consumption  Banks and insurance companies want to understand risk  Fraud detection  Marketing departments want to understand customers  Recommendations  Location-Based Ad Targeting  Threat Analysis
  33. 33. Big Data Landscape
  34. 34. Hadoop Features & SummaryDistributed frame work for processing and storing data generally on commodity hardware. Completely open source and written in Java.  Store anything  Unstructured or semi structured data,  Storage capacity  Scale linearly, cost in not exponential.  Data locality and process in yourway.  Code moves todata  In MR you specify the actual steps in processing the data and drive the out put.  Stream access: Process data in any language.  Failure and fault tolerance:  Detect Failure and Heals itself.  Reliable, data replicated, failed task are rerun , no need maintain backup of data  Cost effective: Hadoop is designed to be a scale-out architecture operating on a cluster of commodity PC machines. The Hadoop framework transparently for customization to provides applications both reliability, adaption and data motion. Primarily used for batch processing, not real-time/ transactional user applications.
  35. 35. References - Hadoop  Hadoop:The Definitive Guide,Third Edition by Tom White.  http://hadoop.apache.org  http://www.cloudera.com  http://ambuj4bigdata.blogspot.com  http://ambujworld.wordpress.com
  36. 36. Thank You