Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 1 of 5
Course Outline
What is Hadoop? Open-source data storage and processing API Massively scalable, automatically parallelizableBased on wo...
Why Use Hadoop? CheaperScales to Petabytes ormore FasterParallel data processing BetterSuited for particular typesof...
What types of business problems for Hadoop?Source: Cloudera “Ten Common Hadoopable Problems”
Companies UsingHadoop Facebook Yahoo Amazon eBay American Airlines The New York Times Federal Reserve Board IBM O...
Forecast growth of Hadoop Job MarketSource: Indeed -- http://www.indeed.com/jobtrends/Hadoop.html
Hadoop is a set of Apache Frameworks and more… Data storage (HDFS)Runs on commodity hardware (usually Linux)Horizontall...
What are the core parts of a Hadoop distribution?
Hadoop Cluster HDFS (Physical) Storage
MapReduce Job – Logical ViewImage from - http://mm-tom.s3.amazonaws.com/blog/MapReduce.png
Hadoop Ecosystem
Common Hadoop Distributions Open SourceApache CommercialClouderaHortonworksMapRAWS MapReduceMicrosoft HDInsight (B...
A View of Hadoop (from Hortonworks)Source: “Intro to Map Reduce” -- http://www.youtube.com/watch?v=ht3dNvdNDzI
Setting up Hadoop Development
Demo – Setting up Cloudera HadoopNote: Demo VMs can be downloaded from - https://ccp.cloudera.com/display/SUPPORT/Demo+VMs
Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 2 of 5
So, what’s the problem? “I can just use some ‘SQL-like’ language to query Hadoop, right? “Yeah, SQL-on-Hadoop…that’s wha...
Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
Demo – Using Hive QL on CDH4
What is Hive? a data warehouse system for Hadoop thatfacilitates easy data summarizationsupports ad-hoc queries (still ...
Is HQL == ANSI SQL? – NO!--non-equality joins ARE allowed on ANSI SQL--but are NOT allowed on Hive (HQL)SELECT a.*FROM aJO...
Preparing for MapReduce
Common Hadoop Shell Commandshadoop fs –cat file:///file2hadoop fs –mkdir /user/hadoop/dir1 /user/hadoop/dir2hadoop fs –cop...
Demo – Working with Files and HDFS
Thinking in MapReduce Hint: “It’s Functional”
Understanding MapReduce – P1/3 Map>>(K1, V1) Info inInput Splitlist (K2, V2)Key / Value out(intermediate values)On...
Understanding MapReduce – P2/3 Map>>(K1, V1) Info inInput Splitlist (K2, V2)Key / Value out(intermediate values)On...
Understanding MapReduce – P3/3 Map>>(K1, V1) Info inInput Splitlist (K2, V2)Key / Value out(intermediate values)On...
Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.pngMapReduce Example - WordCount
MapReduce ObjectsEach daemon spawns a new JVM
Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
Demo – Running MapReduce WordCount
Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 3 of 5
Ways to run MapReduce Jobs Configure JobConf options From Development Environment (IDE) From a GUI utilityCloudera – H...
Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
Setting up Hadoop On Windows Azure About HDInsight
Demo – MapReduce in the Cloud WordCount MapReduce using HDInsight
MapReduce (WordCount) with Java ScriptNote: JavaScript ispart of the AzureHadoop distribution
Common Data Sources for MapReduce Jobs
Where is your Data coming from? On premisesLocal file systemLocal HDFS instance Private CloudCloud storage Public Cl...
Common Data Jobs for MapReduce
Demo – Other Types of MapReduceTip: Review the Java MapReduce code in these samples as well.
Methods to write MapReduce Jobs Typical – usually written in JavaMapReduce 2.0 APIMapReduce 1.0 API StreamingUses std...
Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
Demo – MapReduce via C# & PowerShell
Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
Using AWS MapReduceNote: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on theAWS Cloud
What is Pig? ETL Library for HDFS developed at YahooPig RuntimePig LanguageGenerates MapReduce Jobs ETL stepsLOAD <f...
MapReduce Python SampleRemember that white space matters in Python!
Demo – Using AWS MapReduce withPigNote: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on th...
AWS Data Pipeline with HIVE
Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 4 of 5
Better MapReduce - Optimizations
Optimization BEFORE running a MapReduce Job
More about Input File Compression From Cloudera… Their version of LZO ‘splittable’Type File Size GB Compress DecompressN...
Optimization WITHIN a MapReduce Job
59
Mapper Task Optimization
Data Types WritableText (String)IntWritableLongWritableFloatWritableBooleanWritable WritableComparable for keys Cu...
Reducer Task Optimization
MapReduce Job Optimization
Demo – Unit Testing MapReduce Using MRUnit + Asserts Optionally using ApprovalTestsImage from http://c0de-x.com/wp-conte...
A note about MapReduce 2.0 Splits the existing JobTracker’s rolesresource managementjob lifecycle management MapReduce...
What is Mahout? Library with common machine learning algorithms Over 20 algorithmsRecommendation (likelihood – Pandora)...
Mahout Algorithms
Setting up Hadoop on Windows For local development Install from binaries from Web Platform Installer Install .NET Azure...
Demo – Mahout Using HDInsight
What about the output?
Clients (Visualizations) for HDFS Many clients use HiveOften included in GUI console tools for Hadoop distributions as w...
Demo – Executing Hive Queries
Demo – Using HDFS output in Excel 2013To download Data Explorer:http://www.microsoft.com/en-us/download/details.aspx?id=36...
AboutVisualization
Demo – New Visualizations – D3
Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 5 of 5
Limitations of MapReduce
Comparing: RDBMS vs. HadoopTraditional RDBMS Hadoop / MapReduceData Size Gigabytes (Terabytes) Petabytes (Hexabytes)Access...
Microsoft alternatives to MapReduce Use existing relational systemScale via cloud or edition (i.e. Enterprise or PDW) U...
Looking Forward - Dremel or Apache Drill Based on original research from Google
Apache Drill Architecture
In-market MapReduce AlternativesCloudera ImpalaGoogle Big Query
Demo – Google’s BigQuery Dremel for the rest of us
Hadoop MapReduce Call to Action
More MapReduce Developer Resources Based on the distribution – on premisesApacheMapReduce tutorial - http://hadoop.apac...
The Changing Data Landscape
Hadoop MapReduce Fundamentals
Nächste SlideShare
Wird geladen in …5
×

Hadoop MapReduce Fundamentals

103.157 Aufrufe

Veröffentlicht am

deck from my 5 part series of YouTube (SoCalDevGal channel) on Hadoop MapReduce

Veröffentlicht in: Technologie, Bildung
  • Nice !! Download 100 % Free Ebooks, PPts, Study Notes, Novels, etc @ https://www.ThesisScientist.com
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • More than 5000 registered IT consultants and Corporates.Search for IT online training Providers at http://www.todaycourses.com
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Hi All, We are planning to start new Salesforce Online batch on this week... If any one interested to attend the demo please register in our website... For this batch we are also provide everyday recorded sessions with Materials. For more information feel free to contact us : siva@keylabstraining.com. For Course Content and Recorded Demo Click Here : http://www.keylabstraining.com/salesforce-online-training-hyderabad-bangalore
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Hi All, We are planning to start new Salesforce Online batch on this week... If any one interested to attend the demo please register in our website... For this batch we are also provide everyday recorded sessions with Materials. For more information feel free to contact us : siva@keylabstraining.com. For Course Content and Recorded Demo Click Here: http://www.keylabstraining.com/salesforce-online-training-hyderabad-bangalore
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • search more than 5000 registered IT trainers at http://www.todaycourses.com
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Hadoop MapReduce Fundamentals

  1. Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 1 of 5
  2. Course Outline
  3. What is Hadoop? Open-source data storage and processing API Massively scalable, automatically parallelizableBased on work from GoogleGFS + MapReduce + BigTableCurrent Distributions based on Open Source and Vendor WorkApache HadoopCloudera – CH4 w/ ImpalaHortonworksMapRAWSWindows Azure HDInsight
  4. Why Use Hadoop? CheaperScales to Petabytes ormore FasterParallel data processing BetterSuited for particular typesof BigData problems
  5. What types of business problems for Hadoop?Source: Cloudera “Ten Common Hadoopable Problems”
  6. Companies UsingHadoop Facebook Yahoo Amazon eBay American Airlines The New York Times Federal Reserve Board IBM Orbitz
  7. Forecast growth of Hadoop Job MarketSource: Indeed -- http://www.indeed.com/jobtrends/Hadoop.html
  8. Hadoop is a set of Apache Frameworks and more… Data storage (HDFS)Runs on commodity hardware (usually Linux)Horizontally scalable Processing (MapReduce)Parallelized (scalable) processingFault Tolerant Other Tools / FrameworksData AccessHBase, Hive, Pig, MahoutToolsHue, SqoopMonitoringGreenplum, ClouderaHadoop Core - HDFSMapReduce APIData AccessTools & LibrariesMonitoring & Alerting
  9. What are the core parts of a Hadoop distribution?
  10. Hadoop Cluster HDFS (Physical) Storage
  11. MapReduce Job – Logical ViewImage from - http://mm-tom.s3.amazonaws.com/blog/MapReduce.png
  12. Hadoop Ecosystem
  13. Common Hadoop Distributions Open SourceApache CommercialClouderaHortonworksMapRAWS MapReduceMicrosoft HDInsight (Beta)
  14. A View of Hadoop (from Hortonworks)Source: “Intro to Map Reduce” -- http://www.youtube.com/watch?v=ht3dNvdNDzI
  15. Setting up Hadoop Development
  16. Demo – Setting up Cloudera HadoopNote: Demo VMs can be downloaded from - https://ccp.cloudera.com/display/SUPPORT/Demo+VMs
  17. Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 2 of 5
  18. So, what’s the problem? “I can just use some ‘SQL-like’ language to query Hadoop, right? “Yeah, SQL-on-Hadoop…that’s what I want “I don’t want learn a new query language and…. “I want massive scale for my shiny, new BigData
  19. Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
  20. Demo – Using Hive QL on CDH4
  21. What is Hive? a data warehouse system for Hadoop thatfacilitates easy data summarizationsupports ad-hoc queries (still batch though…)created by Facebook a mechanism to project structure onto this data and query the data using aSQL-like language – HiveQLInteractive-console –or-Execute scriptsKicks off one or more MapReduce jobs in the background an ability to use indexes, built-in user-defined functions
  22. Is HQL == ANSI SQL? – NO!--non-equality joins ARE allowed on ANSI SQL--but are NOT allowed on Hive (HQL)SELECT a.*FROM aJOIN b ON (a.id <> b.id)Note: Joins are quite different in MapReduce, more on that coming up…
  23. Preparing for MapReduce
  24. Common Hadoop Shell Commandshadoop fs –cat file:///file2hadoop fs –mkdir /user/hadoop/dir1 /user/hadoop/dir2hadoop fs –copyFromLocal <fromDir> <toDir>hadoop fs –put <localfile>hdfs://nn.example.com/hadoop/hadoopfilesudo hadoop jar <jarFileName> <method> <fromDir> <toDir>hadoop fs –ls /user/hadoop/dir1hadoop fs –cat hdfs://nn1.example.com/file1hadoop fs –get /user/hadoop/file <localfile>Tips-- ‘sudo’ means ‘run as administrator’ (super user)--some hadoop configurations use ‘hadoop dfs’ rather than ‘hadoop fs’ – file paths to hadoop differ for the former, see the linkincluded for more detail
  25. Demo – Working with Files and HDFS
  26. Thinking in MapReduce Hint: “It’s Functional”
  27. Understanding MapReduce – P1/3 Map>>(K1, V1) Info inInput Splitlist (K2, V2)Key / Value out(intermediate values)One list per localnodeCan implement localReducer (orCombiner)
  28. Understanding MapReduce – P2/3 Map>>(K1, V1) Info inInput Splitlist (K2, V2)Key / Value out(intermediate values)One list per localnodeCan implement localReducer (orCombiner) Shuffle/Sort>>
  29. Understanding MapReduce – P3/3 Map>>(K1, V1) Info inInput Splitlist (K2, V2)Key / Value out(intermediate values)One list per localnodeCan implement localReducer (orCombiner) Reduce(K2, list(V2) Shuffle / Sort phaseprecedes Reduce phaseCombines Map outputinto a listlist (K3, V3)Usually aggregatesintermediate values(input) <k1, v1>  map  <k2, v2>  combine  <k2, v2>  reduce  <k3, v3> (output) Shuffle/Sort>>
  30. Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.pngMapReduce Example - WordCount
  31. MapReduce ObjectsEach daemon spawns a new JVM
  32. Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
  33. Demo – Running MapReduce WordCount
  34. Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 3 of 5
  35. Ways to run MapReduce Jobs Configure JobConf options From Development Environment (IDE) From a GUI utilityCloudera – HueMicrosoft Azure – HDInsight console From the command linehadoop jar <filename.jar> input output
  36. Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
  37. Setting up Hadoop On Windows Azure About HDInsight
  38. Demo – MapReduce in the Cloud WordCount MapReduce using HDInsight
  39. MapReduce (WordCount) with Java ScriptNote: JavaScript ispart of the AzureHadoop distribution
  40. Common Data Sources for MapReduce Jobs
  41. Where is your Data coming from? On premisesLocal file systemLocal HDFS instance Private CloudCloud storage Public CloudInput Storage bucketsScript / Code bucketsOutput buckets
  42. Common Data Jobs for MapReduce
  43. Demo – Other Types of MapReduceTip: Review the Java MapReduce code in these samples as well.
  44. Methods to write MapReduce Jobs Typical – usually written in JavaMapReduce 2.0 APIMapReduce 1.0 API StreamingUses stdin and stdoutCan use any language to write Map and Reduce FunctionsC#, Python, JavaScript, etc… PipesOften used with C++ Abstraction librariesHive, Pig, etc… write in a higher level language, generate one or moreMapReduce jobs
  45. Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
  46. Demo – MapReduce via C# & PowerShell
  47. Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
  48. Using AWS MapReduceNote: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on theAWS Cloud
  49. What is Pig? ETL Library for HDFS developed at YahooPig RuntimePig LanguageGenerates MapReduce Jobs ETL stepsLOAD <file>FILTER, JOIN, GROUP BY, FOREACH, GENERATE, COUNT…DUMP {to screen for testing}  STORE <newFile>
  50. MapReduce Python SampleRemember that white space matters in Python!
  51. Demo – Using AWS MapReduce withPigNote: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on theAWS Cloud
  52. AWS Data Pipeline with HIVE
  53. Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 4 of 5
  54. Better MapReduce - Optimizations
  55. Optimization BEFORE running a MapReduce Job
  56. More about Input File Compression From Cloudera… Their version of LZO ‘splittable’Type File Size GB Compress DecompressNone Log 8.0 - -Gzip Log.gz 1.3 241 72LZO Log.lzo 2.0 55 35
  57. Optimization WITHIN a MapReduce Job
  58. 59
  59. Mapper Task Optimization
  60. Data Types WritableText (String)IntWritableLongWritableFloatWritableBooleanWritable WritableComparable for keys Custom Types supported – write RawComparator
  61. Reducer Task Optimization
  62. MapReduce Job Optimization
  63. Demo – Unit Testing MapReduce Using MRUnit + Asserts Optionally using ApprovalTestsImage from http://c0de-x.com/wp-content/uploads/2012/10/staredad_english.png
  64. A note about MapReduce 2.0 Splits the existing JobTracker’s rolesresource managementjob lifecycle management MapReduce 2.0 provides many benefits over the existing MapReduceframework, such as better scalabilitythrough distributed job lifecycle managementsupport for multiple Hadoop MapReduce API versions in a single cluster
  65. What is Mahout? Library with common machine learning algorithms Over 20 algorithmsRecommendation (likelihood – Pandora)Classification (known data and new data – spam id)Clustering (new groups of similar data – Google news) Can non-statisticians find value using this library?
  66. Mahout Algorithms
  67. Setting up Hadoop on Windows For local development Install from binaries from Web Platform Installer Install .NET Azure SDK (for Azure BLOB storage) Install other toolsNeudesic Azure Storage Viewer
  68. Demo – Mahout Using HDInsight
  69. What about the output?
  70. Clients (Visualizations) for HDFS Many clients use HiveOften included in GUI console tools for Hadoop distributions as well Microsoft includes clients in Office (Excel 2013)Direct Hive clientConnect using ODBCPowerPivot – data mashups and presentationData Explorer – connect, transform, mashup and filterHadoop SDK on Codeplex Other popular clientsQlikviewTableauKarmasphere
  71. Demo – Executing Hive Queries
  72. Demo – Using HDFS output in Excel 2013To download Data Explorer:http://www.microsoft.com/en-us/download/details.aspx?id=36803
  73. AboutVisualization
  74. Demo – New Visualizations – D3
  75. Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 5 of 5
  76. Limitations of MapReduce
  77. Comparing: RDBMS vs. HadoopTraditional RDBMS Hadoop / MapReduceData Size Gigabytes (Terabytes) Petabytes (Hexabytes)Access Interactive and Batch Batch – NOT InteractiveUpdates Read / Write many times Write once, Read many timesStructure Static Schema Dynamic SchemaIntegrity High (ACID) LowScaling Nonlinear LinearQuery ResponseTimeCan be near immediate Has latency (due to batch processing)
  78. Microsoft alternatives to MapReduce Use existing relational systemScale via cloud or edition (i.e. Enterprise or PDW) Use in memory OLAPSQL Server Analysis Services Tabular Models Use “productized” DremelMicrosoft Polybase – status = beta?
  79. Looking Forward - Dremel or Apache Drill Based on original research from Google
  80. Apache Drill Architecture
  81. In-market MapReduce AlternativesCloudera ImpalaGoogle Big Query
  82. Demo – Google’s BigQuery Dremel for the rest of us
  83. Hadoop MapReduce Call to Action
  84. More MapReduce Developer Resources Based on the distribution – on premisesApacheMapReduce tutorial - http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.htmlClouderaClouderaCloudera University - http://university.cloudera.com/Cloudera Developer Course (4 day) - *RECOMMENDED* -http://university.cloudera.com/training/apache_hadoop/developer.htmlHortonworksMapR Based on the distribution – cloudAWS MapReduceTutorial - http://aws.amazon.com/elasticmapreduce/training/#gsWindows Azure HDInsightTutorial -http://www.windowsazure.com/en-us/manage/services/hdinsight/using-mapreduce-with-hdinsight/More resources - http://www.windowsazure.com/en-us/develop/net/tutorials/intro-to-hadoop/
  85. The Changing Data Landscape

×