2. DISCLAIMER
I have no experience with Hadoop in a realworld project
The installation notes I present are not
nescessarily suitable for production
The example scripts have not been used on
real (big) data
Hence the title Playing with Hadoop
4. The Problem (it's not new)
We have (access to)
more and more data
Processing this data
takes longer and
longer
Not enough memory
Running out of disk
space
Our trusty old server
can't keep up
!!!!!
6. Scaling out
Add more
(commodity) servers
Redundancy is
replaced by
replication
You can keep on
scaling out, it's cheap
How do we enable
our software to run
across multiple
servers?
7. Google solved this
Google published two papers
Google File System (GFS), 2003
http://research.google.com/archive/gfs.html
MapReduce, 2004
http://research.google.com/archive/mapreduce.html
GFS and MapReduced provided a platform for
processing huge amounts of data in an efficient
way
8. Hadoop was born
Doug Cutting read the Google papers
Based on those, he created Hadoop
(named after his sons toy elephant)
It is an implementation of GFS/MapReduce
(Open Source / Apache License)
Written in Java and deployed on Linux
First part of Lucene, now an Apache project
https://hadoop.apache.org/
9. Hadoop Components
Hadoop Common – utilities to control the rest
HDFS – Hadoop Distributed File System
YARN – Yet Another Resource Negotiator
MapReduce – YARN-based parallel processing
This enables us to write software that can
handle Big Data by scaling out
10. Big Data isn't just big
Huge amounts of data (volume)
Unstructured data (form)
Highly dynamic data (burst/change rate)
Big Data is actually hard-to-handle (with
traditional tools/methods) data
11. Examples of Big Data
Log files, i.e.
web server access logs
application logs
Internet feeds
Twitter, Facebook, etc.
RSS
Images (face recognition, tagging)
13. Needed to run Hadoop
You need the following to run Hadoop
Java JDK
Linux server
Hadoop tarball
I'm using the following
JDK 1.6.24 64 bit
Ubuntu 12.04 LTS 64 bit
Hadoop 1.0.4
Could not get JDK7 + Hadoop 2.2 to work
25. Three modes of operation
Pi was calculated in Local standalone mode
it is the default mode (i.e. no configuration needed)
all components of Hadoop run in a single JVM
Pseudo-distributed mode
components communicate using sockets
a separate JVM is spawned for each component
it is a mini-cluster on a single host
Fully distributed mode
components are spread across multiple machines
43. Hadoop MapReduce
A reducer will get all values associated with a
given key
Precursor job can be used to normalize data
Combiners can be used to perform early sorting
of map output before it is send to the reducer
45. Playing with MapReduce
We don't need Hadoop to play with MapReduce
Instead we can emulate Hadoop using two
scripts
wc_mapper.pl – a Word Count Mapper
wc_reducer.pl – a Word Count Reducer
We connect them using a pipe (|)
Very Unix-like!
46. Run MapReduce without Hadoop
https://gist.github.com/soren/7596270 https://gist.github.com/soren/7596285
47. Hadoop's Streaming interface
Enables you to write jobs in any programming
language, e.g. Perl
Input from STDIN
Output to STDOUT
Key/Value pairs separated by TAB
Reducers will get values one-by-one
Not to be confused with Hadoop Pipes that
provides a native C++ interface to Hadoop
48. Run Perl Word Count
https://gist.github.com/soren/7596270
https://gist.github.com/soren/7596285
50. Hadoop::Streaming
Perl interface to Hadoop's Streaming interface
Implemented in Moose
You'll can now implement you MapReduce as
a class with a map() and reduce() method
a mapper script
a reducer script
51. Installing Hadoop::Streaming
Btw, Perl was already installed on the server ;-)
But we want to install Hadoop::Streaming
I also had to install local::lib to make it work
All you have to do is
sudo cpan local::lib Hadoop::Streaming
Nice and easy
55. The Web User Interface
HDFS
MapReduce
http://localhost:8070/
File Browser
http://localhost:8030/
http://localhost:8075/browseDirectory.jsp?namenodeInfo
Note: this is with port forwarding in VirtualBox
50030 → 8030, 50070 → 8070, 50075 → 8075
56. Joins in Hadoop
It's possible to implement joins in MapReduce
Reduce-joins – simple
Map-joins – less data to transfer
Do you need joins?
Maybe you're data has structure → SQL?
Try Hive (HiveQL)
Or Pig (Pig Latin)
57. Hadoop in the Cloud
Elastic MapReduce (EMR)
http://aws.amazon.com/elasticmapreduce/
Essentially Hadoop in the Cloud
Build on EC2 and S3
You can upload JARs or scripts
58. There's more
Distributions
Cloudera Distribution for Hadoop (CDH)
http://www.cloudera.com/
Hortonworks Data Platform (HDP)
http://hortonworks.com/
HBase, Hive, Pig and other related projects
https://hadoop.apache.org/
But, a basic Hadoop setup is a good start
and a nice place to just play with Hadoop
59. I like big data and I can not lie
Oh, my God, Becky, look at the data, it's so big
It looks like one of those Hadoop guys setups
Who understands those Hadoop guys
They only map/reduce it because it is on a
distributed file system
I mean the data, it's just so big
I can't believe it's so huge
It's just out there, I mean, it's gross
Look, it's just so blah
60. The End
Questions?
Slides will be available at http://www.slideshare.net/slu/
Find me on Twitter https://twitter.com/slu