1. THE FUTURE OF
HADOOP: CHOOSING
THE RIGHT OPTIONS
Subash D’Souza
Hadoop Innovation Summit
2014
2. WHO AM I?
Recognized as a Champion of Big Data by Cloudera
Co-Organizer - Los Angeles Hadoop User Group
Organizer - Los Angeles HBase User Group
Organizer – Los Angeles Big Data Users Group
Organizer - Big Data Camp LA
Speaker – Big Data Camp LA 2013
Leading a BOF Session at Hadoop Summit Europe 2014
Author – HBase Developer’s Cookbook (Out Fall 2014)
Technical Reviewer – Apache Flume: Distributed Log Collection for Hadoop
3. HADOOP: OLD & NEW
Hadoop first released in 2006.
Based on the GFS and MapReduce papers released by Google
Ever since adoption has been massive and rapid
Companies like Facebook, Netflix, EBay, Yahoo, Expedia, Spotify and even the
Social Security Administration are adopting Hadoop
Hadoop 2.0 AKA YARN went GA in September of 2013
Is backwards compatible with Hadoop 1.0 API’s
Replaced Jobtracker and Tasktrackers with Application Master, Resource Manager
and Node Managers
4. A BRIEF HISTORY
Google
releases GFS
paper
2002
2003
Google
releases
MapReduce
paper
2004
Nutch adds
distributed
file system
Doug Cutting
launches
Nutch project
MapR
founded
2005
Hortonworks
founded
Cloudera
founded
2006
2007
Hadoop spun
out of Nutch
project at
Yahoo
MapReduce
implemented
in Nutch
Stinger/ Tez
to be
released
Hadoop 2.0
w/HA
available
2008
2009
2010
2011
Hadoop
breaks
Terasort
world record
2012
2013
2014
YARN goes
GA
HBase, Zookee
per, Flume and
more added to
CDH
Impala
(SQL on
Hadoop)
launched
5. PREVIOUSLY, THE STATE OF
DATA
As a data analyst, previously, you were not able to
ask questions you wanted to ask because you did
not have the data points available
Corollary, you couldn’t think of questions to ask of
your data because you didn’t know you had access
to those data points
7. FOCUS
No standard way to get to the data
This is a plus and minus, plus because there is variety to choose from, minus because the
no. of tools to pull the data is huge and evermore expanding
As a company what do you choose?
What do you focus on?
Question – Do you replace your current data
infrastructure or do you augment it?
16. CHOICES
Hortonworks – Completely Open Source – Everything on their platform is available
from Apache Hadoop Distribution. Available as a free download or with paid
support.
Cloudera – Offers the open source Apache Hadoop Distribution as well as
management tools built for the Cloudera Distribution. Available as a free download
or with paid support with the additional tools
MapR – Offers a version of Hadoop that replaces the HDFS with a proprietary
MFS(MapR File System). Everything else on their stack is based on the open
source Apache distribution. Offers a free M3 version along with paid M5 and M7
versions.
17. ADVANTAGES OF YARN
Ability to handle multi tenant clients, i.e. running
multiple
applications
atop
the
same
framework(multi-tenancy)
Splits the work of Job tracker into Resource
Manager and Application master so Job tracker
does not have to allocate resources as well as
manage the tasks
Ability to restart Jobs from the place where they
failed
Scales well beyond the limitations of MR1(4000
22. SQL ON HADOOP VS.
TRADITIONAL RDBMS
Data on Hadoop is not as responsive as a RDBMS
Data in Hadoop can scale much better than an
RDBMS
Data in Hadoop can be accessed using a variety of
mechanisms such as Hive, Imapala, Drill, etc. i.e.
the query engines are abstracted from the
Hadoop(HDFS) storage layer. The same cannot be
said of RDBMS where you would need between
one system to another example, Oracle cannot pull
from SQL Server and vice versa
23. QUESTION?
Do we augment or replace our current data
infrastructure?
Answer – Augment
Why? – combine the best of both worlds, use
aggregated data in your data stores and all the
detail data and lifetime in Hadoop
Of course, you will different SLA’s based on the
query you ask.
25. STARTUPS VS. MATURE
Startups that are in data should make the
consideration of going with YARN to gain the
advantages of YARN
Mature companies tend to be conservative and
hence will look to the more established use cases of
MR1
Startups and Mature companies should look at the
advantages of YARN as well as applying more near
real-time sql-on-hadoop
26. GETTING STARTED WITH
HADOOP VS. ESTABLISHED
HADOOP PRACTICES
Getting started with Hadoop – Opportunity to get off
the ground running YARN plus bleeding edge
technologies.
Established companies with a Hadoop practice tend
to be conservative but that shouldn’t prevent them
from coming with a migration plan to YARN
27. REAL TIME ANALYTICS
Kiji
HBase
Storm
Shark
Redshift
Impala
Stinger
Drill
Accumolo
Presto
Hawq
IBM BigSQL
32. FUTURE OF HADOOP: YARN &
NEAR REAL TIME SQL-ONHADOOP
Multi Tenancy
HA(High Availability)
Tools for SQL-On-Hadoop
Impala
Stinger/Tez
Drill
Shark
33. WHAT DO YOU CHOOSE?
The choices are huge
The toolsets are varied
First focus on the problems you are trying to solve. Don’t
choose Hadoop because it is the latest buzz word. Make
sure there is a real need to solve
Focus on developers and administrators and ensure that
whatever toolset you choose, they have the relevant
skillset or training will be provided or relevant resources will
be brought in from outside( whether through hiring or
consulting)
REMEMBER PROBLEMSET!!! i.e what you are trying to
34. CAVEATS
Work still being done on bringing real time sql-onhadoop to YARN.
Impala has Llama for this.
Stinger for Hive Preview is currently available
HBase on YARN(HOYA) is also actively being
worked on.
Since YARN is a low level API, some abstraction is
needed which is available with tools such as Samza
and Weave
35. BIG DATA = BIG IMPACT
Ken Rudin, Director of Analytics, Facebook
“You need to go the last mile and evangelize your
insights so that people actually act on them and
there is impact."
“It doesn’t matter how brilliant our analyses are. If
nothing changes we have made no impact”
36. GIVING BACK
Hadoop is an open source project
Work done on this and the ecosystem tools are by
committers and contributors, some of whom do this in
their own personal time, in reporting and fixing bugs as
well as new functionality.
Please
give
back
either
by
becoming
a
contributor(Testing, filing bugs) or getting out your use
case for Hadoop(at meetups and/or conferences such
as this one) so others can make use of the issues you
have faced as well see the rapid adoption of the