Weitere ähnliche Inhalte
Ähnlich wie The Hadoop Ecosystem (20)
Kürzlich hochgeladen (20)
The Hadoop Ecosystem
- 2. The Hadoop Ecosystem
• Introduction
– What Hadoop is, and what it’s not
– Origins and History
– Hello Hadoop
• The Hadoop Bestiary
• The Hadoop Providers
• Hosted Hadoop Frameworks
© J Singh, 2011 2
2
- 3. What Hadoop is, and what it’s not
• A Framework for Map Reduce
• A Top-level Apache Project
• Hadoop is • Hadoop is not
A Framework, not a “solution” A painless replacement for SQL
• Think Linux or J2EE
Scalable Uniformly fast or efficient
Great for pipelining massive Great for ad hoc Analysis
amounts of data to achieve the
end result
Sometimes the only option
© J Singh, 2011 3
3
- 4. You are ready for Hadoop when…
• You no longer get enthused by the prospect of more data
– Rate of data accumulation is increasing
– The idea of moving data from hither to yon is positively scary
– A hit man threatens to delete your data in the middle of the night
• And you want to pay him to do it
• Seriously, you are ready for Hadoop when analysis is the bottleneck
– Could be because of data size
– Could be because of the complexity of the data
– Could be because of the level of analysis required
– Could be because the analysis requirements are fluid
© J Singh, 2011 4
4
- 5. MapReduce Conceptual Underpinnings
• Based on Functional Programming model
– From Lisp
• (map square '(1 2 3 4)) (1 4 9 16)
• (reduce plus '(1 4 9 16)) 30
– From APL
• +/ N N 1 2 3 4
• Easy to distribute (based on each element of the vector)
• New for Map/Reduce: Nice failure/retry semantics
– Hundreds and thousands of low-end servers are running at the
same time
© J Singh, 2011 5
5
- 6. MapReduce Flow
Word Count Example
MapOut
foo 1
Lines Result
bar 1
foo bar foo 3
quux 1
quux foo labs 1
foo 1
foo labs quux 2
foo 1
quux bar 1
labs 1
quux 1
© J Singh, 2011 6
6
- 7. Hello Hadoop
• Word Count
– Example with Unstructured Data
– Load 5 books from Gutenberg.org
into /tmp/gutenberg
– Load them into HDFS
– Run Hadoop
• Results are put into HDFS
– Copy results into file system
– What could be simpler?
– DIY instructions for Amazon EC2
available on DataThinks.org blog
© J Singh, 2011 7
7
- 8. The Hadoop Ecosystem
• Introduction
• The Hadoop Bestiary
– Core: Hadoop Map Reduce and Hadoop Distributed File System
– Data Access: HBase, Pig, Hive
– Algorithms: Mahout
– Data Import: Flume, Sqoop and Nutch
• The Hadoop Providers
• Hosted Hadoop Frameworks
© J Singh, 2011 8
8
- 9. The Core: Hadoop and HDFS
• Hadoop • Hadoop Distributed File System
– One master, n slaves – Robust Data Storage across
– Master machines, insulating against
• Schedules mappers & reducers failure
• Connects pipeline stages – Keeps n copies of each file
• Handles failure semantics • Configurable number of copies
• Distributes copies across racks
and locations
© J Singh, 2011 9
9
- 10. Hadoop Bestiary (p1a): Hbase, Pig
• Database Primitives • Processing
– Hbase – Pig
• Wide column data structure • A high(-ish) level data-flow
built on HDFS language and execution
framework for parallel
computation
• Accesses HDFS and Hbase
• Batch as well as Interactive
• Integrates UDFs written in
Java, Python, JavaScript
• Compiles to map & reduce
functions – not 100% efficiently
© J Singh, 2011 10
10
- 11. In Pig (Latin)
Users = load ‘users’ as (name, age);
Filtered = filter Users by
age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
count(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;
store Top5 into ‘top5sites’;
© J Singh, 2011 11
11
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
- 12. Pig Translation into Map Reduce
Load Users Load Pages
Users = load …
Filter by age
Fltrd = filter …
Pages = load …
Job 1 Join on name Joined = join …
Group on url
Grouped = group …
Summed = … count()…
Job 2 Count clicks Sorted = order …
Top5 = limit …
Order by clicks
Job 3 Take top 5
© J Singh, 2011 Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt 12
12
- 13. Hadoop Bestiary (p1b): Hbase, Hive
• Database Primitives • Processing
– Hbase – Hive
• Wide column data structure • Data Warehouse Infrastructure
built on HDFS • QL, a subset of SQL that
supports primitives supportable
by Map Reduce
• Support for custom mappers
and reducers for more
sophisticated analysis
• Compiles to map & reduce
functions – not 100% efficiently
Hive Example
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
:: ::
STORED AS SEQUENCEFILE;
© J Singh, 2011 13
13
- 14. Hadoop Bestiary (p2): Mahout
• Algorithms • Examples
– Mahout – Clustering Algorithms
• Scalable machine learning and • Canopy Clustering
data mining • K-Means Clustering
• Runs on top of Hadoop • …
• Written in Java
• In active development – Recommenders / Collaborative
– Algorithms being added
Filtering Algorithms
– Other
• Regression Algorithms
• Neural Networks
• Hidden Markov Models
© J Singh, 2011 14
14
- 15. Hadoop Bestiary (p3): Data Import
• Data Import Mechanisms • Data Import
– Sqoop: Structured Data – Sqoop
– Flume: Streams • Import from RDBMS to HDFS
• Export too
– Flume
• Import streams
– Text Files
– System Logs
– Nutch
• Import from Web
• Note: Nutch + Hadoop = Lucene
© J Singh, 2011 15
15
- 17. The Hadoop Ecosystem
• Introduction
• The Hadoop Bestiary
• The Hadoop Providers
– Apache
– Cloudera
– Options when your data lives in a Database
• Hosted Hadoop Frameworks
© J Singh, 2011 17
17
- 18. Apache Distribution
• The Definitive Repository
– The hub for Code, Documentation, Tutorials
– Many contributors, for example
• Pig was a Yahoo! Contribution
• Hive came from Facebook
• Sqoop came from Cloudera
• Bare metal install option:
– Download to your machine(s) from Apache
– Install and Operate
• Modify to fit your business better
© J Singh, 2011 18
18
- 19. Cloudera
• Cloudera : Hadoop :: Red Hat : Linux
• Cloudera’s Distribution Including Apache Hadoop (CDH)
– A packaged set of Hadoop modules that work together
– Now at CDH3
– Largest contributor of code to Apache Hadoop
• $76M in Venture funding so far
© J Singh, 2011 19
19
- 20. When the data lives in a Database…
• Objective: keeping Analytics and Data as close as possible
• Options for RDBMS : • Options for NoSQL Databases
– Sqoop data to/from HDFS – Sqoop-like connectors
• Need to move the data • Need to move the data
• Can utilize all parts of Hadoop
– In-database analytics
• Available for TeraData, – Built-in Map Reduce available
Greenplum, etc. for most NoSQL databases
• If you have the need • Knows about and tuned to the
– And the $$$ storage mechanism
• But typically only offers map
and reduce
– No Pig, Hive, …
© J Singh, 2011 20
20
- 21. The Hadoop Ecosystem
• Introduction
• The Hadoop Bestiary
• The Hadoop Providers
• Hadoop Platforms as a Service
– Amazon Elastic MapReduce
– Hadoop in Windows Azure
– Google App Engine
– Other
• Infochimps
• IBM SmartCloud
© J Singh, 2011 21
21
- 22. Amazon Elastic Map Reduce (EMR)
• Hosted Map Reduce
– CLI on your laptop
• Control over size of cluster
• Automatic spin-up/down instances
– Map & Reduce programs on S3
• Pig, Hive or
• Custom in Java, Ruby, Python,
Perl, PHP, R, C++, Cascading
– Data In/Out on S3 or
– Data In/Out on DynamoDB
• Keep in mind:
– Hadoop on EC2 is also an option
© J Singh, 2011 22
22
- 23. Hadoop in Windows Azure
• Basic Level
– Hive Add-in for Excel
– Hive ODBC Driver
• Hadoop-based Distribution for Windows Server and Azure
– Strategic Partnership with HortonWorks
– Windows-based CLI on your laptop
• Broadest Level
– JavaScript framework for Hadoop
– Hadoop connectors for SQL Server and Parallel Data Warehouse
© J Singh, 2011 23
23
- 24. Google App Engine MapReduce
• Map Reduce as a Service
– Distinct from Google’s internal Map Reduce
– Part of Google App Engine
• Works with Google Datastore
– A Wide Column Store
• A “purely programmatic” environment
– Write Map and Reduce functions in Python / Java
© J Singh, 2011 24
24
- 26. Take Aways
• There are many flavors of
Hadoop.
– The important part is
Functional Programming and
Map Reduce
– Don’t let the proliferation of
choices stump you.
– Experiment with it!
© J Singh, 2011 26
26
- 27. Thank you
• J Singh
– President, Early Stage IT
• Technology Services and Strategy for Startups
• DataThinks.org is a new service of Early Stage IT
– “Big Data” analytics solutions
© J Singh, 2011 27
27
Hinweis der Redaktion
- Sources: Top 5 Reasons Not to Use Hadoop for AnalyticsThe Dark Side of HadoopHadoopDon’t’s: What not to do to harvest Hadoop’s full potential
- Get started with Hadoop
- http://pig.apache.org/docs/r0.9.2/index.htmlApache HadoopCascading
- http://pig.apache.org/docs/r0.9.2/index.html
- Flume Users GuideThrift PaperThrift Paper
- Missing components:Cascading