Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Hadoop – Large scale data analysis<br />Abhijit Sharma<br />Page 1    |    9/8/2011<br />
Unprecedented growth in <br />Data set size - Facebook 21+ PB data warehouse, 12+ TB/day<br />Un(semi)-structured data – l...
Page 3    |    9/8/2011<br />Putting Big Data to work<br />Data driven Org – decision support, new offerings<br />Analytic...
Embarrassingly data parallel problems<br />Data chunked & distributed across cluster<br />Parallel processing with data lo...
Open source system for large scale batch distributed computing on big data<br />Map Reduce Programming Paradigm & Framewor...
MapReduce is a programming model and an implementation for parallel processing of large data sets<br />Map processes each ...
Map : Apply a function to each list member - Parallelizable<br />[1, 2, 3].collect { it * it } <br />Output : [1, 2, 3] ->...
Page 8    |    9/8/2011<br />Word Count - Shell<br />cat * | grep  | sort                | uniq –c<br />input| map  | shuf...
Page 9    |    9/8/2011<br />Word Count - Map Reduce<br />
mapper (filename, file-contents):<br />for each word in file-contents:<br />    emit (word, 1) // single count for a word ...
Word Count / Distributed logs search for # accesses to various URLs<br />Map – emits word/URL, 1 for each doc/log split<br...
Hides complexity of distributed computing<br />Automatic parallelization of job<br />Automatic data chunking & distributio...
Page 13    |    9/8/2011<br />Hadoop Map Reduce Architecture<br />
Very large files – block size 64 MB/128 MB<br />Data access pattern - Write once read many<br />Writes are large, create &...
Page 15    |    9/8/2011<br />HDFS Architecture<br />
Thanks<br />Page 16    |    9/8/2011<br />
Page 17    |    9/8/2011<br />Backup Slides<br />
Page 18    |    9/8/2011<br />Map & Reduce Functions<br />
Page 19    |    9/8/2011<br />Job Configuration<br />
Job Tracker tracks MR jobs – runs on master node<br />Task Tracker<br />Runs on data nodes and tracks Mapper, Reducer task...
Name Node <br />Manages the file system namespace and regulates access to files by clients – stores meta data<br />Mapping...
Nächste SlideShare
Wird geladen in …5
×

An introduction to Hadoop for large scale data analysis

2.249 Aufrufe

Veröffentlicht am

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

An introduction to Hadoop for large scale data analysis

  1. 1. Hadoop – Large scale data analysis<br />Abhijit Sharma<br />Page 1 | 9/8/2011<br />
  2. 2. Unprecedented growth in <br />Data set size - Facebook 21+ PB data warehouse, 12+ TB/day<br />Un(semi)-structured data – logs, documents, graphs<br />Connected data web, tags, graphs<br />Relevant to enterprises – logs, social media, machine generated data, breaking of silos<br />Page 2 | 9/8/2011<br />Big Data Trends<br />
  3. 3. Page 3 | 9/8/2011<br />Putting Big Data to work<br />Data driven Org – decision support, new offerings<br />Analytics on large data sets (FB Insights – Page, App etc stats), <br />Data Mining – Clustering - Google News articles<br />Search - Google<br />
  4. 4. Embarrassingly data parallel problems<br />Data chunked & distributed across cluster<br />Parallel processing with data locality – task dispatched where data is<br />Horizontal/Linear scaling approach using commodity hardware<br />Write Once, Read Many<br />Examples <br />Distributed logs – grep, # of accesses per URL<br />Search - Term Vector generation, Reverse Links<br />Page 4 | 9/8/2011<br />Problem characteristics and examples<br />
  5. 5. Open source system for large scale batch distributed computing on big data<br />Map Reduce Programming Paradigm & Framework <br />Map Reduce Infrastructure<br />Distributed File System (HDFS)<br />Endorsed/used extensively by web giants – Google, FB, Yahoo!<br />Page 5 | 9/8/2011<br />What is Hadoop?<br />
  6. 6. MapReduce is a programming model and an implementation for parallel processing of large data sets<br />Map processes each logical record per input split to generate a set of intermediate key/value pairs<br />Reduce merges all intermediate values associated with the same intermediate key<br />Page 6 | 9/8/2011<br />Map Reduce - Definition<br />
  7. 7. Map : Apply a function to each list member - Parallelizable<br />[1, 2, 3].collect { it * it } <br />Output : [1, 2, 3] -> Map (Square) : [1, 4, 9]<br />Reduce : Apply a function and an accumulator to each list member<br />[1, 2, 3].inject(0) { sum, item -> sum + item } <br />Output : [1, 2, 3] -> Reduce (Sum) : 6<br />Map & Reduce <br />[1, 2, 3].collect { it * it }.inject(0) { sum, item -> sum + item } <br />Output : [1, 2, 3] -> Map (Square) -> [1, 4, 9] -> Reduce (Sum) : 14<br />Page 7 | 9/8/2011<br />Map Reduce - Functional Programming Origins<br />
  8. 8. Page 8 | 9/8/2011<br />Word Count - Shell<br />cat * | grep | sort | uniq –c<br />input| map | shuffle & sort | reduce<br />
  9. 9. Page 9 | 9/8/2011<br />Word Count - Map Reduce<br />
  10. 10. mapper (filename, file-contents):<br />for each word in file-contents:<br /> emit (word, 1) // single count for a word e.g. (“the”, 1) for each occurrence of “the”<br />reducer (word, Iterator values): // Iterator for list of counts for a word e.g. (“the”, [1,1,..])<br />sum = 0<br /> for each value in intermediate_values:<br /> sum = sum + value<br /> emit (word, sum)<br />Page 10 | 9/8/2011<br />Word Count - Pseudo code<br />
  11. 11. Word Count / Distributed logs search for # accesses to various URLs<br />Map – emits word/URL, 1 for each doc/log split<br />Reduce – sums up the counts for a specific word/URL<br />Term Vector generation – term -> [doc-id]<br />Map – emits term, doc-id for each doc split<br />Reduce – Identity Reducer – accumulates the (term, [doc-id, doc-id ..])<br />Reverse Links – source -> target to target-> source<br />Map – emits (target, source) for each doc split<br />Reducer – Identity Reducer – accumulates the (target, [source, source ..]) <br />Page 11 | 9/8/2011<br />Examples – Map Reduce Defn<br />
  12. 12. Hides complexity of distributed computing<br />Automatic parallelization of job<br />Automatic data chunking & distribution (via HDFS)<br />Data locality – MR task dispatched where data is<br />Fault tolerant to server, storage, N/W failures<br />Network and disk transfer optimization<br />Load balancing<br />Page 12 | 9/8/2011<br />Map Reduce – Hadoop Implementation<br />
  13. 13. Page 13 | 9/8/2011<br />Hadoop Map Reduce Architecture<br />
  14. 14. Very large files – block size 64 MB/128 MB<br />Data access pattern - Write once read many<br />Writes are large, create & append only<br />Reads are large & streaming<br />Commodity hardware<br />Tolerant to failure – server, storage, network<br />Highly available through transparent replication<br /><ul><li>Throughput is more important than latency</li></ul>Page 14 | 9/8/2011<br />HDFS Characteristics<br />
  15. 15. Page 15 | 9/8/2011<br />HDFS Architecture<br />
  16. 16. Thanks<br />Page 16 | 9/8/2011<br />
  17. 17. Page 17 | 9/8/2011<br />Backup Slides<br />
  18. 18. Page 18 | 9/8/2011<br />Map & Reduce Functions<br />
  19. 19. Page 19 | 9/8/2011<br />Job Configuration<br />
  20. 20. Job Tracker tracks MR jobs – runs on master node<br />Task Tracker<br />Runs on data nodes and tracks Mapper, Reducer tasks assigned to the node<br />Heartbeats to Job Tracker<br />Maintains and picks up tasks from a queue<br />Page 20 | 9/8/2011<br />Hadoop Map Reduce Components<br />
  21. 21. Name Node <br />Manages the file system namespace and regulates access to files by clients – stores meta data<br />Mapping of blocks to Data Nodes and replicas<br />Manage replication<br />Executes file system namespace operations like opening, closing, and renaming files and directories.<br />Data Node<br />One per node, which manages local storage attached to the node <br />Internally, a file is split into one or more blocks and these blocks are stored in a set of Data Nodes<br />Responsible for serving read and write requests from the file system’s clients. The Data Nodes also perform block creation, deletion, and replication upon instruction from the Name Node.<br />Page 21 | 9/8/2011<br />HDFS<br />

×