Hadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals

deck from my 5 part series of YouTube (SoCalDevGal channel) on Hadoop MapReduce

deck from my 5 part series of YouTube (SoCalDevGal channel) on Hadoop MapReduce

  • 1. Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 1 of 5
  • 2. Course Outline
  • 3. What is Hadoop? Open-source data storage and processing API Massively scalable, automatically parallelizableBased on work from GoogleGFS + MapReduce + BigTableCurrent Distributions based on Open Source and Vendor WorkApache HadoopCloudera – CH4 w/ ImpalaHortonworksMapRAWSWindows Azure HDInsight
  • 4. Why Use Hadoop? CheaperScales to Petabytes ormore FasterParallel data processing BetterSuited for particular typesof BigData problems
  • 5. What types of business problems for Hadoop?Source: Cloudera “Ten Common Hadoopable Problems”
  • 6. Companies UsingHadoop Facebook Yahoo Amazon eBay American Airlines The New York Times Federal Reserve Board IBM Orbitz
  • 7. Forecast growth of Hadoop Job MarketSource: Indeed -- http://www.indeed.com/jobtrends/Hadoop.html
  • 8. Hadoop is a set of Apache Frameworks and more… Data storage (HDFS)Runs on commodity hardware (usually Linux)Horizontally scalable Processing (MapReduce)Parallelized (scalable) processingFault Tolerant Other Tools / FrameworksData AccessHBase, Hive, Pig, MahoutToolsHue, SqoopMonitoringGreenplum, ClouderaHadoop Core - HDFSMapReduce APIData AccessTools & LibrariesMonitoring & Alerting
  • 9. What are the core parts of a Hadoop distribution?
  • 10. Hadoop Cluster HDFS (Physical) Storage
  • 11. MapReduce Job – Logical ViewImage from - http://mm-tom.s3.amazonaws.com/blog/MapReduce.png
  • 12. Hadoop Ecosystem
  • 13. Common Hadoop Distributions Open SourceApache CommercialClouderaHortonworksMapRAWS MapReduceMicrosoft HDInsight (Beta)
  • 14. A View of Hadoop (from Hortonworks)Source: “Intro to Map Reduce” -- http://www.youtube.com/watch?v=ht3dNvdNDzI
  • 15. Setting up Hadoop Development
  • 16. Demo – Setting up Cloudera HadoopNote: Demo VMs can be downloaded from - https://ccp.cloudera.com/display/SUPPORT/Demo+VMs
  • 17. Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 2 of 5
  • 18. So, what’s the problem? “I can just use some ‘SQL-like’ language to query Hadoop, right? “Yeah, SQL-on-Hadoop…that’s what I want “I don’t want learn a new query language and…. “I want massive scale for my shiny, new BigData
  • 19. Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
  • 20. Demo – Using Hive QL on CDH4
  • 21. What is Hive? a data warehouse system for Hadoop thatfacilitates easy data summarizationsupports ad-hoc queries (still batch though…)created by Facebook a mechanism to project structure onto this data and query the data using aSQL-like language – HiveQLInteractive-console –or-Execute scriptsKicks off one or more MapReduce jobs in the background an ability to use indexes, built-in user-defined functions
  • 22. Is HQL == ANSI SQL? – NO!--non-equality joins ARE allowed on ANSI SQL--but are NOT allowed on Hive (HQL)SELECT a.*FROM aJOIN b ON (a.id <> b.id)Note: Joins are quite different in MapReduce, more on that coming up…
  • 23. Preparing for MapReduce
  • 24. Common Hadoop Shell Commandshadoop fs –cat file:///file2hadoop fs –mkdir /user/hadoop/dir1 /user/hadoop/dir2hadoop fs –copyFromLocal <fromDir> <toDir>hadoop fs –put <localfile>hdfs://nn.example.com/hadoop/hadoopfilesudo hadoop jar <jarFileName> <method> <fromDir> <toDir>hadoop fs –ls /user/hadoop/dir1hadoop fs –cat hdfs://nn1.example.com/file1hadoop fs –get /user/hadoop/file <localfile>Tips-- ‘sudo’ means ‘run as administrator’ (super user)--some hadoop configurations use ‘hadoop dfs’ rather than ‘hadoop fs’ – file paths to hadoop differ for the former, see the linkincluded for more detail
  • 25. Demo – Working with Files and HDFS
  • 26. Thinking in MapReduce Hint: “It’s Functional”
  • 27. Understanding MapReduce – P1/3 Map>>(K1, V1) Info inInput Splitlist (K2, V2)Key / Value out(intermediate values)One list per localnodeCan implement localReducer (orCombiner)
  • 30. Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.pngMapReduce Example - WordCount
  • 31. MapReduce ObjectsEach daemon spawns a new JVM
  • 32. Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
  • 33. Demo – Running MapReduce WordCount
  • 34. Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 3 of 5
  • 35. Ways to run MapReduce Jobs Configure JobConf options From Development Environment (IDE) From a GUI utilityCloudera – HueMicrosoft Azure – HDInsight console From the command linehadoop jar <filename.jar> input output
  • 36. Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
  • 37. Setting up Hadoop On Windows Azure About HDInsight
  • 38. Demo – MapReduce in the Cloud WordCount MapReduce using HDInsight
  • 39. MapReduce (WordCount) with Java ScriptNote: JavaScript ispart of the AzureHadoop distribution
  • 40. Common Data Sources for MapReduce Jobs
  • 41. Where is your Data coming from? On premisesLocal file systemLocal HDFS instance Private CloudCloud storage Public CloudInput Storage bucketsScript / Code bucketsOutput buckets
  • 42. Common Data Jobs for MapReduce
  • 43. Demo – Other Types of MapReduceTip: Review the Java MapReduce code in these samples as well.
  • 44. Methods to write MapReduce Jobs Typical – usually written in JavaMapReduce 2.0 APIMapReduce 1.0 API StreamingUses stdin and stdoutCan use any language to write Map and Reduce FunctionsC#, Python, JavaScript, etc… PipesOften used with C++ Abstraction librariesHive, Pig, etc… write in a higher level language, generate one or moreMapReduce jobs
  • 45. Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
  • 46. Demo – MapReduce via C# & PowerShell
  • 47. Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
  • 48. Using AWS MapReduceNote: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on theAWS Cloud
  • 49. What is Pig? ETL Library for HDFS developed at YahooPig RuntimePig LanguageGenerates MapReduce Jobs ETL stepsLOAD <file>FILTER, JOIN, GROUP BY, FOREACH, GENERATE, COUNT…DUMP {to screen for testing}  STORE <newFile>
  • 50. MapReduce Python SampleRemember that white space matters in Python!
  • 51. Demo – Using AWS MapReduce withPigNote: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on theAWS Cloud
  • 52. AWS Data Pipeline with HIVE
  • 53. Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 4 of 5
  • 54. Better MapReduce - Optimizations
  • 55. Optimization BEFORE running a MapReduce Job
  • 56. More about Input File Compression From Cloudera… Their version of LZO ‘splittable’Type File Size GB Compress DecompressNone Log 8.0 - -Gzip Log.gz 1.3 241 72LZO Log.lzo 2.0 55 35
  • 57. Optimization WITHIN a MapReduce Job
  • 59. Mapper Task Optimization
  • 60. Data Types WritableText (String)IntWritableLongWritableFloatWritableBooleanWritable WritableComparable for keys Custom Types supported – write RawComparator
  • 61. Reducer Task Optimization
  • 62. MapReduce Job Optimization
  • 63. Demo – Unit Testing MapReduce Using MRUnit + Asserts Optionally using ApprovalTestsImage from http://c0de-x.com/wp-content/uploads/2012/10/staredad_english.png
  • 64. A note about MapReduce 2.0 Splits the existing JobTracker’s rolesresource managementjob lifecycle management MapReduce 2.0 provides many benefits over the existing MapReduceframework, such as better scalabilitythrough distributed job lifecycle managementsupport for multiple Hadoop MapReduce API versions in a single cluster
  • 65. What is Mahout? Library with common machine learning algorithms Over 20 algorithmsRecommendation (likelihood – Pandora)Classification (known data and new data – spam id)Clustering (new groups of similar data – Google news) Can non-statisticians find value using this library?
  • 66. Mahout Algorithms
  • 67. Setting up Hadoop on Windows For local development Install from binaries from Web Platform Installer Install .NET Azure SDK (for Azure BLOB storage) Install other toolsNeudesic Azure Storage Viewer
  • 68. Demo – Mahout Using HDInsight
  • 69. What about the output?
  • 70. Clients (Visualizations) for HDFS Many clients use HiveOften included in GUI console tools for Hadoop distributions as well Microsoft includes clients in Office (Excel 2013)Direct Hive clientConnect using ODBCPowerPivot – data mashups and presentationData Explorer – connect, transform, mashup and filterHadoop SDK on Codeplex Other popular clientsQlikviewTableauKarmasphere
  • 71. Demo – Executing Hive Queries
  • 72. Demo – Using HDFS output in Excel 2013To download Data Explorer:http://www.microsoft.com/en-us/download/details.aspx?id=36803
  • 73. AboutVisualization
  • 74. Demo – New Visualizations – D3
  • 75. Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 5 of 5
  • 76. Limitations of MapReduce
  • 77. Comparing: RDBMS vs. HadoopTraditional RDBMS Hadoop / MapReduceData Size Gigabytes (Terabytes) Petabytes (Hexabytes)Access Interactive and Batch Batch – NOT InteractiveUpdates Read / Write many times Write once, Read many timesStructure Static Schema Dynamic SchemaIntegrity High (ACID) LowScaling Nonlinear LinearQuery ResponseTimeCan be near immediate Has latency (due to batch processing)
  • 78. Microsoft alternatives to MapReduce Use existing relational systemScale via cloud or edition (i.e. Enterprise or PDW) Use in memory OLAPSQL Server Analysis Services Tabular Models Use “productized” DremelMicrosoft Polybase – status = beta?
  • 79. Looking Forward - Dremel or Apache Drill Based on original research from Google
  • 80. Apache Drill Architecture
  • 81. In-market MapReduce AlternativesCloudera ImpalaGoogle Big Query
  • 82. Demo – Google’s BigQuery Dremel for the rest of us
  • 83. Hadoop MapReduce Call to Action
  • 84. More MapReduce Developer Resources Based on the distribution – on premisesApacheMapReduce tutorial - http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.htmlClouderaClouderaCloudera University - http://university.cloudera.com/Cloudera Developer Course (4 day) - *RECOMMENDED* -http://university.cloudera.com/training/apache_hadoop/developer.htmlHortonworksMapR Based on the distribution – cloudAWS MapReduceTutorial - http://aws.amazon.com/elasticmapreduce/training/#gsWindows Azure HDInsightTutorial -http://www.windowsazure.com/en-us/manage/services/hdinsight/using-mapreduce-with-hdinsight/More resources - http://www.windowsazure.com/en-us/develop/net/tutorials/intro-to-hadoop/
  • 85. The Changing Data Landscape