2. Overview
Data Storage and Analysis
Comparison with other Systems
HPC and Grid Computing
Volunteer Computing
History of Hadoop
Analyzing Data with Hadoop
Hadoop in the Enterprise
The Collective Wisdom of the Valley
3. The Problem
IDC estimates the size of the digital
universe has grown to 1.8 zettabytes
by the end of 2011
◦ 1 zettabyte = 1,000 exabytes = 1M
petabytes
Individual data footprints are growing
Storing and Analyzing datasets in the
petabyte range requires new and
innovative solutions
4. The Problem
Storage capacities of hard drives have
increased but transfer rates have not
kept up
◦ Solution: read from multiple disks at once
Hardware Failure
Most analysis tasks need to be able to
combine the data in some way.
5. What Hadoop provides:
The ability to read and write data in
parallel to or from multiple disks
Enables applications to work with
thousands of nodes and petabytes of
data.
A reliable shared storage and analysis
system (HDFS and MapReduce)
A free license
7. MapReduce vs. RDBMS
MapReduce Premise: the entire
dataset—or at least a good portion of
it—is processed for each query.
◦ Batch Query Processor
Another Trend: Seek time is improving
more slowly than transfer time
MapReduce is good for analyzing the
whole dataset, whereas RDBMS is
good for point queries or updates.
8. MapReduce vs. RDBMS
Traditional RDBMS MapReduce
Data Size Gigabytes Petabytes
Access Interactive and batch Batch
Updates Read and write many Write once, read many
times times
Structure Static schema Dynamic schema
Integrity High Low
Scaling Nonlinear Linear
• MapReduce suits applications where
the data is written once, and read
many times, whereas a RDBMS is
good for datasets that are continually
updated.
9. Data Structure
Structured Data – data organized into
entities that have a defined format.
◦ Realm of RDBMS
Semi-Structured Data – there may be
a schema, but often ignored; schema
is used as a guide to the structure of
the data.
Unstructured Data – doesn’t have any
particular internal structure.
MapReduce works well with semi-
structured and unstructured data.
10. More differences…
Relational data is often normalized to
retain its integrity and remove
redundancy
Normalization poses problems for
MapReduce
MapReduce is a linearly scalable
programming model.
Over time, the differences between
RDBMS and MapReduce are likely to
blur
11. HPC and Grid Computing
The approach in HPC is to distribute the
work across a cluster of machines, which
access a shared filesystem, hosted by a
SAN.
◦ In very large datasets, bandwidth is the
bottleneck and network nodes become idle
MapReduce tries to collocate the data
with the compute node, so data access
is fast since it is local.
◦ Works to conserve bandwidth by explicitly
modeling network topology.
12. Handling Partial Failure
MapReduce – implementation detects
failed map or reduce tasks and
reschedules replacements on
machines that are healthy
Shared-Nothing Architecture – tasks
have no dependence on one another
To contrast, MPI programs have to
explicitly manage their own
checkpointing and recovery.
13. Why is MapReduce cool?
Invented by engineers at Google as a
system for building production search
indexes because they found
themselves solving the same problem
over and over again.
Wide range of algorithms expressed:
◦ Image Analysis
◦ Graph-based problems
◦ Machine Learning
14. Volunteer Computing
Seti@Home
MapReduce is designed to run jobs that
last minutes or hours on trusted,
dedicated hardware running in a single
data center with very high aggregate
bandwidth interconnects.
Seti@home runs a perpetual
computation on untrusted machines on
the Internet with highly variable
connection speeds and no data locality
15. History of Hadoop
Created by Doug Cutting
2002 – Apache Nutch, open source web
search engine
2003 – Google publishes a paper describing
the architecture of their distributed filesystem,
GFS.
2004 – Nutch Distributed Filesystem (NDFS)
2004 – Google publishes a paper on
MapReduce
2005 – Nutch MapReduce implementation
2006 – Hadoop is created; Cutting joins
Yahoo!
2008 – Yahoo! demonstrates Hadoop
17. Analyzing Data with Hadoop
Case: NCDC Weather Data
◦ What’s the highest recorded global temp for each
year in the dataset?
Express our query as a MapReduce job
MapReduce breaks the processing into two
phases: Map and Reduce
Input to our Map phase is raw NCDC data
Map Function: Pull out the year and air
temperature AND filter out temps that are
missing, suspect or erroneous.
Reducer Function: finding the max temp for
each year
18. MapReduce Example
Map function extracts the year and
temp:
◦ (1950, 0), (1950, 22), (1950, -11), (1949,
111), (1949, 78)
MapReduce sorts and groups the
data:
◦ (1949, [111,78])
◦ (1950, [0, 22, -11])
Reduce function iterates through the
list:
19.
20. Hadoop in the Enterprise
Accelerate nightly batch business processes
Storage of extremely high volumes of data
Creation of automatic, redundant backups
Improving the scalability of applications
Use of Java for data processing instead of
SQL
Producing JIT feeds for dashboards and BI
Handling urgent, ad hoc request for data
Turning unstructured data into relational data
Taking on tasks that require massive
parallelism
Moving existing algorithms, code,
frameworks, and components to a highly
distributed computing environment
21.
22. Hadoop in the News
the open-source LAMP stack
transformed web startup economics 10
years ago
Argues that Hadoop is now displacing
expense proprietary solutions.
Hadoop’s architechture of map-reducing
across of a cluster of commodity nodes
is more flexible and cost effective than
traditional data warehouses.
3 Areas of application in Startup’s:
◦ Analysis of Customer Behavior
◦ Powering new user-facing features
◦ Enabling entire new lines of business
23. An interesting point to close on…
From TechCrunch: ―What is most
remarkable is how the startup world is
collectively creating this ecosystem:
Yahoo, Facebook, Twitter, LinkedIn, and
other companies are actively adding to
the tool chain. This illustrates a new
thesis or collective wisdom rising from
the valley: If a technology is not your
core value-add, it should be open-
sourced because then others can
improve it, and potential future
employees can learn it. This rising tide
has lifted all boats, and is just getting
started‖
24. Training and Certifications
Hortonworks – Believes that Apache
Hadoop will process half of the world’s
data within the next five years
◦ Hortonworks Data Platform – open source
distribution of Apache Hadoop
◦ Support, Training, Partner Enablement
programs designed to assist enterprises
and solution providers
Hortonworks University
25. Extra Resources
Running Hadoop on Ubuntu Linux
(Single-Node Cluster)
Running Hadoop on Amazon EC2
26. Works Cited
White, Tom (2011).
Hadoop: The Definitive
Guide. Sebastopol,
CA: O’Reilly.
TechCrunch (July 2011) –
―Hadoop and Startups:
Where Open Source
Meets Business Data‖
Wikipedia – Apache
Hadoop
Apache Hadoop Website
Hinweis der Redaktion
Storage Capacities: One typical drive from 1990 could store 1,370 MB of data and had a transfer speed of 4.4 MB/s, so you could read all the data from a full drive in about 5 minutes. 20 years later, one terabyte drives are the norm, but the transfer speed is around 100 MB/sec, so it takes more than 2.5 hours to ready all the data off the disk. Solution: Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in under 2 minutes. Only using one hundredth of a disk may seem wasteful. But we can store on hundred dataset, each of which is one terabyte, and provide shared access to them.Hardware Failure: as soon as you start using many pieces of hardware, the chance that one will fail is fairly high. A common way to avoid data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure there is another copy available. This is how RAID works, though the Hadoop Distributed Filesystem (HDFS) takes a slightly different approach. Darta read from one disk may need to be combined with the data from any of the other 99 disks. Various distributed systems allow data to be combined from multiple sources, but doing this correctly is notoriously challenging. MapReduce provides a programming model that abstracts the problem from disk reads and writes, tranforming it into a computation over sets of keys and values. The important point here is that there are two parts to the computation, the map and the reduce, and it’s the interface between the two where the “mixing” occurs.
Yahoo – 10,000 core Linux clusterFacebook – claims to have the largest Hadoop cluster in the world at 30 PB
MapReduce enables you to run an ad hoc query against your whole dataset and get the results in a reasonable time E.g. Mailtrust, Rackspace’s mail division, used Hadoop for processing email logs. One ad hoc query they wrote was to find the geographic distribution of their users. They said: “This data was so useful that we’ve scheduled the MapReduce job to run monthly and we will be using this data to help us decide which Rackspace data centers to place new mail servers in as we grow.”Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth. If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate. On the other hand, for updating a small proportion of records in a database, a traditional B-Tree (the data structure used in relational databases, which is limited by the rate it can perform sesks) works well. For updating the majority of a database, a B-Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database.
Structured Data – such as XML documents or database tables that conform to a particular predefined schema (RDBMS).Semi-Structured Data – for example, a spreadsheet, in which the structure is the grid of the cells, although the cells themselves may hold any form of dataUnstructured Data – e.g. plain text or image dataMapReduce works well on unstructured or semi-structured data, since it is designed to interpret the data at processing time. In other words, the input keys and values for MapReduce are not an intrinsic property of the data, but they are chosen by the person analyzing the data.
Problems for MapReduce – it makes reading a record a non local operation, and one of the central assumptions that MapReduce makes is that it is possible to perform (high-speed) streaming reads and writes.A web server log is a good example of a set of records that is not normalized (for example, the client hostnames are specified in full each time, even though the same client may appear may times), and this is one reason that logfiles of all kinds are particularly well-suited to analysis with MapReudce. MapReduce is a linearly scalable programming model. The programmer writes 2 functions: a map function and a reduce function—each of which defines a mapping from one set of key-value pairs to anotherThese functions are oblivious to the size of the data or the cluster that they are operating on, so they can be used unchanged for a small dataset and for a massive one. More importantly, if you double the size of the input data, a job will run twice as slow. But if you also double the sixe of the cluster, a job will run as fast as the original one. This is not generally true of SQL queries. The lines will blur as relational databases start incorporating some of the ideas form MapReduce and from the other direction, as higher-level query languages built on MapReduce (such as Pig and Hive) make MapReduce systems more approachable to traditional database programmers.
High Performance Computing (HPC) and Grid Computing communities have been doing large-scale data processing for years, using such API’s such as Message Passing Interface (MPI)HPC works well for predominantly compute-intensive jobs, but becomes a problem when nodes need to access larger data volumes (100’s of GB)Data locality is at the heart of MapReduce and is the reason for it’s good performance. Recognizing that network bandwidth is the most precious resource in a data center environment (e.g. it is easy to saturate network links by copying data around), MapReduce implementations go to great lengths to conserve it by explicitly modeling network topology. MPI gives great control to the programmer, but requires that he or she explicitly handthe mechanics of the data flow, exposed via low-level C routines and constructs, such as sockets, as well as the higer-level algorithm for the analysis. MapReduce operates only at the higher level: the programmer thinks in terms of functions of key and value pairs, and the data flow is implicit.
How do you handle partial failure?—When you don’t know if a remote process has failed or not—and still making progress with the overall computationShared nothing architecture makes MapReduce able to handle partial failure. From a programmers point of view, the order in which the tasks run doesn’t matter. MPI Programs – gives more control to the programmer, but makes them more difficult to write. In some ways MapReduce is a restrictive programming model since you are limited to key and value types that are related in specified ways, and mapper and reducers run with very limited coordination between one another (The mappers pass keys and values to reducers)
Search for Extra-Terrestrial Intelligence – volunteers donate CUP time from their otherwise idel computers to analyze radio telescope data for signs of intelligent life outside earth. Most prominent of many volunteer computing progjects. Similar to MapReduce in that it breaks a problem into independent pieces to be worked on in parallel
Nutch -- Architecture wouldn’t scale to index billions of pagesPaper about GFS provided the info needed to solve their storage needs for the very large files generated as a part of the web crawl and indexing process. In particular, GFS would free up time being spent on administrative tasks such as managing storage nodes. NDFS was an open source implementation of the GFSGoogle introduced MapReduce to the world, by mid 2005 the Nutch project developed an open source implementationDoug Cutting joined Yahoo!, which proviede a dedicated team and the resources to turn Hadoop into a system that ran at the web scale. This was demonstrated in February 2008 when yahoo! announced that it’s production search index was being generated by a 10,000 core Hadoop ClusterThe NY Times used Amazon’s EC2 compute cloud to crunch through 4 terabytes of scanned arhives from the paper converting them to PDFs fro the Web. The processing took less than 24 hours to run using 100 machines, and the project probably wouldn’t have been embarked on without the combination of Amazon’s pay by the hour model and hadoops easy to use parallel programming model. Broke a world record to become the fastest system to sort a terabyte of data. Running on a 910 node cluster, Hadoop sorted one terabyte in 209 seconds. In November of the same year, Google announced its MapReduce implementation sorted one terabyte in 68 secods. By 2009, Yahoo! used Hadoop to sort one terabyte in 62 seconds.
MapReduce – a distribute ddata processing model and execution environment that runs on large clusters of commodity machinesHDFS – a distributed filesystem that runs on large clusters of commodity machines.Pig – A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clustersHive – A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL for querying the dataHbase – a distributed, column-oriented database. Hbase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads). Sqoop – a tool for efficiently moving data between relational databases and HDFS.
Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. The two functions are also specified by the programmer.For the example, we choose a text input format that gives us each line in the dataset as a text vlue. The key is the offset of the beginning of the line from the beginning of a file. Map function – just a data preparation phase, setting up the data in such a way that the reducer function can do its work on it: finding the max temp each year
http://techcrunch.com/2011/07/17/hadoop-startups-where-open-source-meets-business-data/ LAMP (Linux, Apache, MySQL, PHP/Python) - As new open0-source webservers, databases, and web-friendly programming lanuages liberated developers from proprietary software and big iron hardware, startup costs plummeted. This lower barrier to entry changed the startup funding game, and led to the emergence of the current Angel/Seed funding ecosystem. – This also enabled the generation of web apps we use today. With Hadoop… Startups are creating more intelligent businesses and more intelligent productsEven modestly successful startup has a user base comparable in population to nation statesThe problem this poses is that understanding the value of every user and transaction becomes more complex.The opportunity this poses is that the collective intelligence of the population can be leveraged into better user experiences. Before Hadoop, analyzing this scale of data required the same kind of enterprise solutions that LAMP was created to avoid.The key to understanding the significance of Hadoop is that it’s not juast a specific piece of technology, but movement of developers trying to collectively solve the Big Data problems of their organizations.