Hadoop MapReduce Fundamentals


Published on

deck from my 5 part series of YouTube (SoCalDevGal channel) on Hadoop MapReduce

Published in: Technologie, Bildung
2 Kommentare
86 Gefällt mir
Keine Downloads
Bei Slideshare
Aus Einbettungen
Anzahl an Einbettungen
Gefällt mir
Einbettungen 0
No embeds

No notes for slide
  • http://en.wikipedia.org/wiki/MapReduce
  • http://allthingsd.com/files/2012/04/big-numbers.jpg
  • http://www.cloudera.com/content/dam/cloudera/Resources/PDF/cloudera_White_Paper_Ten_Hadoopable_Problems_Real_World_Use_Cases.pdf Also -- http://gigaom.com/2012/06/05/10-ways-companies-are-using-hadoop-to-do-more-than-serve-ads/
  • Image: http://siliconangle.com/files/2012/08/hadoop-300x300.jpg
  • http://www.platfora.com/wp-content/themes/PlatforaV2.0/img/enter/deployment_pick_graphic.png
  • http://indoos.files.wordpress.com/2010/08/hadoop_map1.png?w=819&h=612
  • http://datameer2.datameer.com/blog/wp-content/uploads/2012/06/hadoop_ecosystem_d3_photoshop.jpg http://datameer2.datameer.com/blog/wp-content/uploads/2013/01/hadoop_ecosystem_clean.png http://www.datameer.com/blog/perspectives/hadoop-ecosystem-as-of-january-2013-now-an-app.html
  • Image from: http://vichargrave.com/wp-content/uploads/2013/02/Hadoop-Development.png http://wiki.apache.org/hadoop/HowToSetupYourDevelopmentEnvironment https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Hadoop+Demo+VM+for+CDH4
  • https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads
  • http://queryio.com/hadoop-big-data-images/hadoop-sql.jpg
  • http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  • http://hive.apache.org/ https://cwiki.apache.org/confluence/display/Hive/GettingStarted
  • https://cwiki.apache.org/confluence/display/Hive/LanguageManual http://en.wikipedia.org/wiki/Apache_Hive
  • http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html http://nsinfra.blogspot.in/2012/06/difference-between-hadoop-dfs-and.html
  • http://www.fincher.org/tips/General/SoftwareEngineering/FunctionalProgramming.shtml http://rbxbx.info/images/fault-tolerance.png
  • The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.
  • The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.
  • The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.
  • http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  • http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  • http://www.windowsazure.com/en-us/manage/services/hdinsight/get-started-hdinsight/
  • Image from http://curiousellie.typepad.com/.a/6a0133ec911c1f970b0168ebe6a2e4970c-500wi
  • http://hadoop.apache.org/docs/r1.1.2/streaming.html How to run and compile a Hadoop Java program -- https://sites.google.com/site/hadoopandhive/home/how-to-run-and-compile-a-hadoop-program Sample code to compile a JAVA class: javac –classpath ~/hadoop/hadoop-core-1.0.1.jar;commons-cli-1.2.jar –d classes <nameOfJavaFile>.java && jar –cvf <nameOfJarFile>.jar –C classes/
  • http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  • http://blogs.msdn.com/b/carlnol/archive/2013/02/05/submitting-hadoop-mapreduce-jobs-using-powershell.aspx
  • http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  • About: Pig - http://en.wikipedia.org/wiki/Pig_(programming_tool) PigLatin language reference - http://pig.apache.org/docs/r0.10.0/start.html#pl-statements
  • http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
  • http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/ http://www.slideshare.net/cloudera/mr-perf
  • http://4.bp.blogspot.com/-2S6IuPD71A8/TZiNw8AyWkI/AAAAAAAAB0k/tS5QTP9SzHA/s1600/Detailed%2BHadoop%2BMapreduce%2BData%2BFlow.png
  • The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.
  • Tips from Cloudera -- http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/ & http://www.slideshare.net/Hadoop_Summit/optimizing-mapreduce-job-performance
  • http://blog.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/ http://hadoop.apache.org/docs/r0.23.6/api/index.html
  • http://mahout.apache.org/
  • Download local Hadoop via the Web Platform InstallerAlso download the Azure .NET SDK for VS 2012Link to download Windows Azure storage explorerhttp://azurestorageexplorer.codeplex.com/LInk for downloading .NET SDK for Hadoophttp://hadoopsdk.codeplex.com/wikipage?title=roadmap&referringTitle=Home
  • Image from - http://bluewatersql.files.wordpress.com/2013/04/image12.png
  • http://www.research-live.com/Journals/1/Files/2013/1/11/covermania.jpg
  • https://github.com/mbostock/d3/wiki/Gallery
  • Original Reference: Tom White’ s Hadoop: The Definitive Guide (I made some modifications based on my experience)
  • http://research.google.com/pubs/pub36632.html
  • https://docs.google.com/document/d/1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I/edit
  • http://cloudera.com/content/cloudera/en/campaign/introducing-impala.html GigaOm ‘The Future…of Hadoop is real-time’ -- http://gigaom.com/2013/03/07/5-reasons-why-the-future-of-hadoop-is-real-time-relatively-speaking/ http://devopsangle.com/2012/08/20/googles-dremel-here-comes-a-new-challenger-to-yarnhadoop/
  • Course Title: Module Title ©2011 DevelopMentor 1-Oct-2011
  • Hadoop MapReduce Fundamentals

    1. Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 1 of 5
    2. Course Outline
    3. What is Hadoop? Open-source data storage and processing API Massively scalable, automatically parallelizableBased on work from GoogleGFS + MapReduce + BigTableCurrent Distributions based on Open Source and Vendor WorkApache HadoopCloudera – CH4 w/ ImpalaHortonworksMapRAWSWindows Azure HDInsight
    4. Why Use Hadoop? CheaperScales to Petabytes ormore FasterParallel data processing BetterSuited for particular typesof BigData problems
    5. What types of business problems for Hadoop?Source: Cloudera “Ten Common Hadoopable Problems”
    6. Companies UsingHadoop Facebook Yahoo Amazon eBay American Airlines The New York Times Federal Reserve Board IBM Orbitz
    7. Forecast growth of Hadoop Job MarketSource: Indeed -- http://www.indeed.com/jobtrends/Hadoop.html
    8. Hadoop is a set of Apache Frameworks and more… Data storage (HDFS)Runs on commodity hardware (usually Linux)Horizontally scalable Processing (MapReduce)Parallelized (scalable) processingFault Tolerant Other Tools / FrameworksData AccessHBase, Hive, Pig, MahoutToolsHue, SqoopMonitoringGreenplum, ClouderaHadoop Core - HDFSMapReduce APIData AccessTools & LibrariesMonitoring & Alerting
    9. What are the core parts of a Hadoop distribution?
    10. Hadoop Cluster HDFS (Physical) Storage
    11. MapReduce Job – Logical ViewImage from - http://mm-tom.s3.amazonaws.com/blog/MapReduce.png
    12. Hadoop Ecosystem
    13. Common Hadoop Distributions Open SourceApache CommercialClouderaHortonworksMapRAWS MapReduceMicrosoft HDInsight (Beta)
    14. A View of Hadoop (from Hortonworks)Source: “Intro to Map Reduce” -- http://www.youtube.com/watch?v=ht3dNvdNDzI
    15. Setting up Hadoop Development
    16. Demo – Setting up Cloudera HadoopNote: Demo VMs can be downloaded from - https://ccp.cloudera.com/display/SUPPORT/Demo+VMs
    17. Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 2 of 5
    18. So, what’s the problem? “I can just use some ‘SQL-like’ language to query Hadoop, right? “Yeah, SQL-on-Hadoop…that’s what I want “I don’t want learn a new query language and…. “I want massive scale for my shiny, new BigData
    19. Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
    20. Demo – Using Hive QL on CDH4
    21. What is Hive? a data warehouse system for Hadoop thatfacilitates easy data summarizationsupports ad-hoc queries (still batch though…)created by Facebook a mechanism to project structure onto this data and query the data using aSQL-like language – HiveQLInteractive-console –or-Execute scriptsKicks off one or more MapReduce jobs in the background an ability to use indexes, built-in user-defined functions
    22. Is HQL == ANSI SQL? – NO!--non-equality joins ARE allowed on ANSI SQL--but are NOT allowed on Hive (HQL)SELECT a.*FROM aJOIN b ON (a.id <> b.id)Note: Joins are quite different in MapReduce, more on that coming up…
    23. Preparing for MapReduce
    24. Common Hadoop Shell Commandshadoop fs –cat file:///file2hadoop fs –mkdir /user/hadoop/dir1 /user/hadoop/dir2hadoop fs –copyFromLocal <fromDir> <toDir>hadoop fs –put <localfile>hdfs://nn.example.com/hadoop/hadoopfilesudo hadoop jar <jarFileName> <method> <fromDir> <toDir>hadoop fs –ls /user/hadoop/dir1hadoop fs –cat hdfs://nn1.example.com/file1hadoop fs –get /user/hadoop/file <localfile>Tips-- ‘sudo’ means ‘run as administrator’ (super user)--some hadoop configurations use ‘hadoop dfs’ rather than ‘hadoop fs’ – file paths to hadoop differ for the former, see the linkincluded for more detail
    25. Demo – Working with Files and HDFS
    26. Thinking in MapReduce Hint: “It’s Functional”
    27. Understanding MapReduce – P1/3 Map>>(K1, V1) Info inInput Splitlist (K2, V2)Key / Value out(intermediate values)One list per localnodeCan implement localReducer (orCombiner)
    28. Understanding MapReduce – P2/3 Map>>(K1, V1) Info inInput Splitlist (K2, V2)Key / Value out(intermediate values)One list per localnodeCan implement localReducer (orCombiner) Shuffle/Sort>>
    29. Understanding MapReduce – P3/3 Map>>(K1, V1) Info inInput Splitlist (K2, V2)Key / Value out(intermediate values)One list per localnodeCan implement localReducer (orCombiner) Reduce(K2, list(V2) Shuffle / Sort phaseprecedes Reduce phaseCombines Map outputinto a listlist (K3, V3)Usually aggregatesintermediate values(input) <k1, v1>  map  <k2, v2>  combine  <k2, v2>  reduce  <k3, v3> (output) Shuffle/Sort>>
    30. Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.pngMapReduce Example - WordCount
    31. MapReduce ObjectsEach daemon spawns a new JVM
    32. Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
    33. Demo – Running MapReduce WordCount
    34. Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 3 of 5
    35. Ways to run MapReduce Jobs Configure JobConf options From Development Environment (IDE) From a GUI utilityCloudera – HueMicrosoft Azure – HDInsight console From the command linehadoop jar <filename.jar> input output
    36. Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
    37. Setting up Hadoop On Windows Azure About HDInsight
    38. Demo – MapReduce in the Cloud WordCount MapReduce using HDInsight
    39. MapReduce (WordCount) with Java ScriptNote: JavaScript ispart of the AzureHadoop distribution
    40. Common Data Sources for MapReduce Jobs
    41. Where is your Data coming from? On premisesLocal file systemLocal HDFS instance Private CloudCloud storage Public CloudInput Storage bucketsScript / Code bucketsOutput buckets
    42. Common Data Jobs for MapReduce
    43. Demo – Other Types of MapReduceTip: Review the Java MapReduce code in these samples as well.
    44. Methods to write MapReduce Jobs Typical – usually written in JavaMapReduce 2.0 APIMapReduce 1.0 API StreamingUses stdin and stdoutCan use any language to write Map and Reduce FunctionsC#, Python, JavaScript, etc… PipesOften used with C++ Abstraction librariesHive, Pig, etc… write in a higher level language, generate one or moreMapReduce jobs
    45. Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
    46. Demo – MapReduce via C# & PowerShell
    47. Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
    48. Using AWS MapReduceNote: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on theAWS Cloud
    49. What is Pig? ETL Library for HDFS developed at YahooPig RuntimePig LanguageGenerates MapReduce Jobs ETL stepsLOAD <file>FILTER, JOIN, GROUP BY, FOREACH, GENERATE, COUNT…DUMP {to screen for testing}  STORE <newFile>
    50. MapReduce Python SampleRemember that white space matters in Python!
    51. Demo – Using AWS MapReduce withPigNote: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on theAWS Cloud
    52. AWS Data Pipeline with HIVE
    53. Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 4 of 5
    54. Better MapReduce - Optimizations
    55. Optimization BEFORE running a MapReduce Job
    56. More about Input File Compression From Cloudera… Their version of LZO ‘splittable’Type File Size GB Compress DecompressNone Log 8.0 - -Gzip Log.gz 1.3 241 72LZO Log.lzo 2.0 55 35
    57. Optimization WITHIN a MapReduce Job
    58. 59
    59. Mapper Task Optimization
    60. Data Types WritableText (String)IntWritableLongWritableFloatWritableBooleanWritable WritableComparable for keys Custom Types supported – write RawComparator
    61. Reducer Task Optimization
    62. MapReduce Job Optimization
    63. Demo – Unit Testing MapReduce Using MRUnit + Asserts Optionally using ApprovalTestsImage from http://c0de-x.com/wp-content/uploads/2012/10/staredad_english.png
    64. A note about MapReduce 2.0 Splits the existing JobTracker’s rolesresource managementjob lifecycle management MapReduce 2.0 provides many benefits over the existing MapReduceframework, such as better scalabilitythrough distributed job lifecycle managementsupport for multiple Hadoop MapReduce API versions in a single cluster
    65. What is Mahout? Library with common machine learning algorithms Over 20 algorithmsRecommendation (likelihood – Pandora)Classification (known data and new data – spam id)Clustering (new groups of similar data – Google news) Can non-statisticians find value using this library?
    66. Mahout Algorithms
    67. Setting up Hadoop on Windows For local development Install from binaries from Web Platform Installer Install .NET Azure SDK (for Azure BLOB storage) Install other toolsNeudesic Azure Storage Viewer
    68. Demo – Mahout Using HDInsight
    69. What about the output?
    70. Clients (Visualizations) for HDFS Many clients use HiveOften included in GUI console tools for Hadoop distributions as well Microsoft includes clients in Office (Excel 2013)Direct Hive clientConnect using ODBCPowerPivot – data mashups and presentationData Explorer – connect, transform, mashup and filterHadoop SDK on Codeplex Other popular clientsQlikviewTableauKarmasphere
    71. Demo – Executing Hive Queries
    72. Demo – Using HDFS output in Excel 2013To download Data Explorer:http://www.microsoft.com/en-us/download/details.aspx?id=36803
    73. AboutVisualization
    74. Demo – New Visualizations – D3
    75. Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 5 of 5
    76. Limitations of MapReduce
    77. Comparing: RDBMS vs. HadoopTraditional RDBMS Hadoop / MapReduceData Size Gigabytes (Terabytes) Petabytes (Hexabytes)Access Interactive and Batch Batch – NOT InteractiveUpdates Read / Write many times Write once, Read many timesStructure Static Schema Dynamic SchemaIntegrity High (ACID) LowScaling Nonlinear LinearQuery ResponseTimeCan be near immediate Has latency (due to batch processing)
    78. Microsoft alternatives to MapReduce Use existing relational systemScale via cloud or edition (i.e. Enterprise or PDW) Use in memory OLAPSQL Server Analysis Services Tabular Models Use “productized” DremelMicrosoft Polybase – status = beta?
    79. Looking Forward - Dremel or Apache Drill Based on original research from Google
    80. Apache Drill Architecture
    81. In-market MapReduce AlternativesCloudera ImpalaGoogle Big Query
    82. Demo – Google’s BigQuery Dremel for the rest of us
    83. Hadoop MapReduce Call to Action
    84. More MapReduce Developer Resources Based on the distribution – on premisesApacheMapReduce tutorial - http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.htmlClouderaClouderaCloudera University - http://university.cloudera.com/Cloudera Developer Course (4 day) - *RECOMMENDED* -http://university.cloudera.com/training/apache_hadoop/developer.htmlHortonworksMapR Based on the distribution – cloudAWS MapReduceTutorial - http://aws.amazon.com/elasticmapreduce/training/#gsWindows Azure HDInsightTutorial -http://www.windowsazure.com/en-us/manage/services/hdinsight/using-mapreduce-with-hdinsight/More resources - http://www.windowsazure.com/en-us/develop/net/tutorials/intro-to-hadoop/
    85. The Changing Data Landscape