This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
2. Brief background on me
Phil has over 16 years experience in data-centric system
development. His work has flowed from simulation and video-
game-like systems, to high-performance computing (HPC), to
traditional database (Oracle, SQL Server, Postgres, MySQL)
and CRM (warehouse/analytical) systems, and most recently to
the Hadoop stack. Recently, as an employee at TripAdvisor he
led the research into Hadoop/Hive which resulted in the
successful migration from the traditional RDBMS platform to a
system which is based on Hadoop/Hive and is integrated with
MS SQL Server/SSAS. Currently, he's focused on the Hadoop
stack and is creating a solution which involves integrating
Hadoop in a more traditional enterprise environment.
3. Agenda
To make you as excited about Hadoop as I am
What is Hadoop (high-level) ?
What have we actually done with it?
How does “it” (HDFS, M/R, Hive, and HBase) work?
Future of Hadoop
5. Q: What is Hadoop:
A#1 - The thing that empowers
Yahoo, FB, and others
Yahoo has >25k Hadoop nodes…wow…
6. Q: What is Hadoop
A#2 - Last year’s revolution (sort of)
The Linux/Hadoop vs Closed-Source “conflict” is a false one, IMO, and I’ll explain why as we go on
7. Q: What is Hadoop
A#3 – the revolution of 5+ years ago
8. “Success has many fathers”
And you can look them up, because it’s FOSS !
People are fighting to contribute, and to get credit… be a contributor…
(http://hortonworks.com/reality-check-contributions-to-apache-hadoop/)
9. What is Hadoop:
A#4 – the wave everyone is riding
Nearly all the big players (and many smaller ones) are on board…
10. In fact, beware of this
http://nosql.mypopescu.com/post/2955078419/origin-of-nosql
12. Hadoop projects performed by BlueMetal Architects
Hadoop at a Web 2.0 company (prior to BMA)
Ported traditional 30TB Warehouse to Hive
Big transform jobs in Hive
E.G. Joins 50M rows to 12B rows
Big Data jobs, e.g. Social Graph processing with
many “Cartesians” to empower emails
Hadoop in HealthCare (at BMA)
Applied HBase as part of a new system
Feeds data (via WS) to:
E.D.
Patient Web Portal
Other HealthCare affiliates
Note: Both projects include Hadoop as part of larger systems.
13. Warehouse Goals
Use the right tool for the right job
–Hadoop (M/R, Hive) is a batch system
• Inherently high-latency
–RDBMS (& other tools) are still needed
Empower users
–Minimize complexity
• Eliminate joins (almost)
• Eliminate “dimensions” (maybe)
–Expose *all* data
–Provide low-latency options
–Provide self-service options
14. A strategy for MASSIVE processing:
Best tool for the job
This is what we implemented and, it turns out, is also what Yahoo has done.
Yahoo’s SSAS cube is the largest in the world (14TB/quarter, 3B rows/day)
18. Map-Reduce (M/R) example
Note: this job is not optimized
Take home message: “Simple API - Mappers read the
input and emit K/V pairs. Framework sends Reducers
K/V pairs partitioned and ordered* by Key”
(From: http://www.infosun.fim.uni-passau.de/cl/MapReduceFoundation/)
19. Hadoop M/R with some details:
Note: Partition, Combine and Shuffle
(From: http://www.lecturemaker.com/2011/02/rhipe/)
20. Hadoop M/R Primer
Let’s discuss HDFS: (blocks, replication) and how that helps “data local tasks”
(From: Yahoo)
21. Hadoop Terasort Job Profile
- or “hey, I thought it was just M/R”
(from
http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_s
orts_a_petabyte_in_162/)
22. Why Hadoop?
Because you don’t want to handle this…
This is actually a profile of a job running on an old version of Hadoop, but jobs
with many failures look similar. This also shows improvement in Hadoop.
(From: http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_sorts_a_petabyte_in_162/)
23. Hadoop M/R executive summary
Distributed storage system, with distributed processing
capability, on commodity hardware (or in the cloud).
Moves the computation to the data !
That, in turn, saves network which is the limiting factor in
distributed apps.
The same code can run on data of any size. The cluster is
scaled with the data, not the code.
24. Hadoop Stack Key Components
(http://hortonworks.com/technology/hortonworksdataplatform/)
HCatalog is a recent project that allows Pig scripts to use Hive metadata/schemas.
Hadoop is not just about non/semi structured data !
26. Common RDBMS warehouse query
select top 10
t.*
from (
select ip_address, count(*) as cnt
from f_pageviews pv
join d_ipaddress ip on (pv.ip_key = ip.id)
where date_key = 2992
group by ip_address
)t
order by cnt desc
– wait a few minutes
- time is usually 1-4x nominal time depending on load
- … assumes the job can succeed at all !
27. Hive Version…
The luxury of Hadoop space/power, means dimensional processing might not be
required
NOTE: Hive does support “column-oriented” storage, which is very efficient.
select t.*
from (
select ip_address, count(*) as cnt
from f_lookback
where ds = '2011-03-11'
group by ip_address
)t
order by cnt desc
Limit 10
– BUT – runtime is trickier
Time to run your job = HQL parse + M/R Job Submit + [ wait
in the queue for availability ] + M/R Job Runtime
28. What else can Hadoop do?
FB: Invented Cassandra but went with HBase for their new messaging system.
Does that mean HBase is ”better”? – no, it’s about using the right tool for the job.
http://www.facebook.com/note.php?note_id=454991608919
That’s to hold 135B messages per month !
http://highscalability.com/blog/2010/11/16/facebooks-new-real-time-messaging-system-hbase-to-store-135.html
Scale is relative (to your hardware and load),
but when you want a consistent “OLTP” solution that doesn’t require redesign to scale,
consider Hbase.
29. HBase Architecture
Not shown: HM, ZK and HDFS
(From: http://www.larsgeorge.com/2009/10/hbase-architecture-101-
storage.html)
30. HBase: a more detailed view
(http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html)
31. HBase: one way to look at it
A BigTable Implementation: memcached + LSM + framework
(From: http://java.dzone.com/news/bigtable-model-cassandra-and)
32. HBase: Hadoop BigTable
Not just a CRUD back-end:
…coprocessors, versioned cells, range scans, optimization (e.g.
selective compression) via column families, etc.
The most important of these is distributed processing.
33. Hadoop in (pre*) action
Hadoop indexed “THE DATA” for Watson
http://developer.yahoo.com/blogs/hadoop/posts/2011/02/i%E2%80%99ll-take-hadoop-for-400-alex/
*Runtime processing used Apache JMS + UIMA .
35. Overlapping Ecosystems
Hadoop (usage and contributions) will be
“shared” between FOSS and Closed Source
communities.
Image from: http://cyhshonorsbio.wikispaces.com/The+Chemistry+of+Life
36. False Conflicts, with Solutions
Sodium(explosive) + Chlorine(poison) =>
Salt(vital)
From http://strangetimes.lastsuperpower.net/?p=1663
Closed Source + Open Source =>
Free + Enterprise + Support
+ Integration
Visit: http://en.wikipedia.org/wiki/Business_models_for_open_source_software#Hybrid
37. IMO, an important message from a
brilliant man
Anant Jhingran Hadoop Summit 2011 IBM Watson & Big Data with Q&A
http://www.youtube.com/watch?v=IVS__xF3Byg
Add value by fostering the ecosystem.
Do not fragment Hadoop (as Unix did).
There is room for folks from many areas to contribute and benefit.
39. MS embraced Hadoop despite having developed
technology similar to NextGen Hadoop. Wow.
Hadoop release on Azure is 3/12.
BlueMetal Architects is part of the MS TAP program for Hadoop on Azure. Please
contact us as we’ll be blogging about it.
41. Hadoop NextGen:
A Brave New (!?) world
Hadoop “nextGen” will support more than M/R, e.g. “Apache Giraph”
BUT, the diagram is from MS Dryad blogs. Graph processing will also be “big”.
42. Hadoop >> (un)structured data store.
Why do this (except ad-hoc) …?
RDBMS and Hadoop have strengths, use them, don’t negate both.
See the above Warehouse Architecture diagram…
From: http://nosql.mypopescu.com/post/344388408/hadoop-and-oracle-parallel-processing)
44. Useful/Supporting Links
Bing crawls the web for Yahoo (for US, Canada, and some other countries)
http://www.ehow.com/info_8208930_isnt-yahoo-crawling-website.html
World’s largest SSAS Cube: 14TB/quarter, 3B rows/day
http://jobs.climber.com/jobs/Media-Communication/-CA-US/MS-SQL-SSAS-SSIS-
Engineer/22735283
http://hadoop.apache.org/
http://www.docstoc.com/docs/66356954/Advanced-HBase
https://ccp.cloudera.com/display/SUPPORT/Hadoop+Tutorial
http://wiki.apache.org/hadoop/WordCount
https://blogs.apache.org/foundation/entry/apache_innovation_bolsters_ibm_s