Are you new to Hadoop and need to start processing data fast and effectively? Have you been playing with CDH and are ready to move on to development supporting a technical or business use case? Are you prepared to unlock the full potential of all your data by building and deploying powerful Hadoop-based applications?
If you're wondering whether Cloudera's Developer Training for Apache Hadoop is right for you and your team, then this presentation is right for you. You will learn who is best suited to attend the live training, what prior knowledge you should have, and what topics the course covers. Cloudera Curriculum Manager, Ian Wrigley, will discuss the skills you will attain during the course and how they will help you become a full-fledged Hadoop application developer.
During the session, Ian will also present a short portion of the actual Cloudera Developer course, discussing the difference between New and Old APIs, why there are different APIs, and which you should use when writing your MapReduce code. Following the presentation, Ian will answer your questions about this or any of Cloudera’s other training courses.
Visit the resources section of cloudera.com to view the on-demand webinar.
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Introduction to Hadoop Developer Training Webinar
1. An Introduction to Cloudera’s
Hadoop Developer Training Course
Ian Wrigley
Curriculum Manager
1
2. Welcome to the Webinar!
All lines are muted
Q & A after the presentation
Ask questions at any time by typing them in the
WebEx panel
A recording of this Webinar will be available on
demand at cloudera.com
2
3. Topics
Why Cloudera Training?
Who Should Attend Developer Training?
Developer Course Contents
A Deeper Dive: The New API vs The Old API
A Deeper Dive: Determining the Optimal Number of
Reducers
Conclusion
3
4. Cloudera’s Training is the Industry Standard
Big Data Cloudera has trained
professionals from employees from
55% 100%
of the Fortune 100 of the top 20 global
have attended live technology firms to
Cloudera training use Hadoop
Cloudera has trained over
15,000
students
4
5. Cloudera Training: The Benefits
1 Broadest Range of Courses
Cover all the key Hadoop components 5 Widest Geographic Coverage
Most classes offered: 20 countries plus virtual classroom
2 Most Experienced Instructors
Over 15,000 students trained since 2009 6 Most Relevant Platform & Community
CDH deployed more than all other distributions combined
3 Leader in Certification
Over 5,000 accredited Cloudera professionals 7 Depth of Training Material
Hands-on labs and VMs support live instruction
4 State of the Art Curriculum
Classes updated regularly as Hadoop evolves 8 Ongoing Learning
Video tutorials and e-learning complement training
5
6. The professionalism and expansive
technical knowledge of our classroom
instructor was incredible. The quality of
the training was on par with a university.
6
7. Topics
Why Cloudera Training?
Who Should Attend Developer Training?
Developer Course Contents
A Deeper Dive: The New API vs The Old API
A Deeper Dive: Determining the Optimal Number of
Reducers
Conclusion
7
8. Common Attendee Profiles
Software Developers/Engineers
Business analysts
IT managers
Hadoop system administrators
8
9. Course Pre-Requisites
Programming experience
Knowledge of Java highly recommended
Understanding of common computer science
principles is helpful
Prior knowledge of Hadoop is not required
9
10. Who Should Not Attend?
If you have no programming experience, you’re likely
to find the course very difficult
You might consider our Hive and Pig training course instead
If you will be focused solely on configuring and
managing your cluster, our Administrator training
course would probably be a better alternative
10
11. Topics
Why Cloudera Training?
Who Should Attend Developer Training?
Developer Course Contents
A Deeper Dive: The New API vs The Old API
A Deeper Dive: Determining the Optimal Number of
Reducers
Conclusion
11
12. Developer Training: Overview
The course assumes no pre-existing knowledge of
Hadoop
Starts by discussing the motivation for Hadoop
What problems exist that are difficult (or impossible) to
solve with existing systems
Explains basic Hadoop concepts
The Hadoop Distributed File System (HDFS)
MapReduce
Introduces the Hadoop API (Application Programming
Interface)
12
13. Developer Training: Overview (cont’d)
Moves on to discuss more complex Hadoop concepts
Custom Partitioners
Custom Writables and WritableComparables
Custom InputFormats and OutputFormats
Investigates common MapReduce algorithms
Sorting, searching, indexing, joining data sets, etc.
Then covers the Hadoop ‘ecosystem’
Hive, Pig, Sqoop, Flume, Mahout, Oozie
13
15. Hands-On Exercises
The course features many Hands-On Exercises
Analyzing log files
Unit-testing Hadoop code
Writing and implementing Combiners
Writing custom Partitioners
Using SequenceFiles and file compression
Creating an inverted index
Creating custom WritableComparables
Importing data with Sqoop
Writing Hive queries
…and more
15
16. Certification
Our Developer course is good preparation for the
Cloudera Certified Developer for Apache Hadoop
(CCDH) exam
A voucher for one attempt at the exam is currently
included in the course fee
16
17. Topics
Why Cloudera Training?
Who Should Attend Developer Training?
Developer Course Contents
A Deeper Dive: The New API vs The Old API
A Deeper Dive: Determining the Optimal Number of
Reducers
Conclusion
17
23. Topics
Why Cloudera Training?
Who Should Attend Developer Training?
Developer Course Contents
A Deeper Dive: The New API vs The Old API
A Deeper Dive: Determining the Optimal Number of
Reducers
Conclusion
23
30. Topics
Why Cloudera Training?
Who Should Attend Developer Training?
Developer Course Contents
A Deeper Dive: The New API vs The Old API
A Deeper Dive: Determining the Optimal Number of
Reducers
Conclusion
30
31. Conclusion
Cloudera’s Developer training course is:
Technical
Hands-on
Interactive
Comprehensive
Attendees leave the course with the skillset required
to write, test, and run Hadoop jobs
The course is a good preparation for the CCDH
certification exam
31
32. Questions?
For more information on Cloudera’s training
courses, or to book a place on an upcoming course:
http://university.cloudera.com
My e-mail address: ian@cloudera.com
Feel free to ask questions!
Hit the Q&A button, and type away
32
Hinweis der Redaktion
This topic is discussed in further detail in TDG 3e on pages 27-30 (TDG 2e, 25-27).NOTE: The New API / Old API is completely unrelated to MRv1 (MapReduce in CDH3 and earlier) / MRv2 (next-generation MapReduce, also called YARN, which will be available along with MRv1 starting in CDH4). Instructors are advised to avoid confusion by not mentioning MRv2 during this section of class, and if asked about it, to simply say that it’s unrelated to the old/new API and defer further discussion until later.
On this slide, you should point out the similarities as well as the differences between the two APIs. You should emphasize that they are both doing the same thing and that there are just a few differences in how they go about it.You can tell whether a class belongs to the “Old API” or the “New API” based on the package name. The old API contains “mapred” while the new API contains “mapreduce” instead. This is the most important thing to keep in mind, because some classes/interfaces have the same name in both APIs. Consequently, when you are writing your import statements (or generating them with the IDE), you will want to be cautious and use the one that corresponds whichever API you are using to write your code.The functions of the OutputCollector and Reporter object have been consolidated into a single Context object. For this reason, the new API is sometimes called the “Context Objects” API (TDG 3e, page 27 or TDG 2e, page 25).NOTE: The “Keytype” and “Valuetype” shown in the map method signature aren’t actual classes defined in Hadoop API. They are just placeholders for whatever type you use for key and value (e.g. IntWritable and Text). Also, the generics for the keys and values are not shown in the class definition for the sake of brevity, but they are used in the new API just as they are in the old API.
An example of maintaining sorted order globally across all reducers was given earlier in the course when Partitioners were introduced.NOTE: worker nodes are configured to reserve a portion (typically 20% - 30%) of their available disk space for storing intermediate data. If too many Mappers are feeding into too few reducers, you can produce more data than the reducer(s) could store. That’s a problem.At any rate, having all your mappers feeding into a single reducer (or just a few reducers) isn’t spreading the work efficiently across the cluster.
Use of the TotalOrderPartitioner is described in detail on pages 274-277 of TDG 3e (TDG 2e, 237-241). It is essentially based on sampling your keyspace so you can divide it up efficiently among several reducers, based on the global sort order of those keys.
But beware that this can be a naïve approach. If processing sales data this way, business-to-business operations (like plumbing supply warehouses) would likely have little or no data for the weekend since they will likely be closed. Conversely, a retail store in a shopping mall will likely have far more data for a Saturday than a Tuesday.
The upper bound on the number of reducers is based on your cluster (machines are configured to have a certain number of “reduce slots” based on the CPU, RAM and other performance characteristics of the machine). The general advice is to choose something a bit less than the max number of reduce slots to allow for speculative execution.
One factor in determining the reducer count is the reduce capacity the developer has access to (or the number of "reduce slots" in either the cluster or the user's pool). One technique is to make the reducer count a multiple of this capacity. If the developer has access to N slots, but they pick N+1 reducers, the reduce phase will go into a second "wave" which will cause that one extra reducer to potentially double the execution time of the reduce phase. However, if the developer chooses 2N or 3N reducers, each wave takes less time, but there are more "waves", so you don't see a big degradation in job performance if you need a second wave (or more waves) due to an extra reducer, a failed task, etc.Suggestion: draw a picture on the whiteboard that shows reducers running in waves, showing cluster slot count, reducer execution times, etc. to tie together the explanation of performance issues as they have been explained in the last few slides:1 reducer will run very slow on an entire data setSetting the number of reducers to the available slot count can maximize parallelism in one reducer wave. However, if you have a failure then you'll run the reduce phase of the job into a second wave, and that will double the execution time of the reduce phase of the job.Setting the number of reducers to a high number will mean many waves of shorter running reducers. This scales nicely because you don't have to be aware of the cluster size and you don't have the cost of a second wave, but it might be more inefficient for some jobs.