A brief but informative overview of Apache Hadoop, an open source software framework for processing large amounts of data.
Given by Andrew Oliver, President of Mammoth Data, at All Things Open 2014.
3. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Andrew C. Oliver, President & Founder
● @acoliver
● Programming since age 8
● Java since ~1997
● Founded POI project (currently hosted at Apache) with Marc Johnson ~2000
○ Former member Jakarta PMC
○ Emeritus member of Apache Software Foundation
● Joined JBoss ~2002
● Former Board Member/current helper/lifetime member: Open Source Initiative
(http://opensource.org)
● Column in InfoWorld: http://www.infoworld.com/author-bios/andrew-oliver
○ I make fanboys cry
4. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Open Software Integrators
Founded Nov 2007 by Andrew C. Oliver (me)
in Durham, NC
Pivoted from Java/Linux consulting to full on Hadoop/NoSQL this year
We’re Hiring
mid to senior level (Java/Linux and Database background)
devopsy type people (Puppet, Chef, Salt, etc, Linux background, database understanding,
Ruby/Python/etc)
up to 50% travel, salary + bonus, 401k, health, etc etc
preferred: Java, Tomcat, JBoss, Hibernate, Spring, RDBMS, JQuery
nice to have: Hadoop, Neo4j, MongoDB, Cassandra, Ruby, at least one Cloud platform
5. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Overview
What is Hadoop anyhow?
What is Hadoop Good For?
What isn’t it good for?
How do you get data into Hadoop?
How do you get data out of Hadoop?
How do you process data in Hadoop?
How do you analyze data in Hadoop?
How do you secure Hadoop?
6. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
But first...
This is an overview talk intended as a roadmap to point you at the most important bits to learn on the
way…
It is not comprehensive training…
It is not an in-depth look at any part of Hadoop
It is a rather high level selective overview of the Hadoop ecosystem
9. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Hadoop is...
HDFS
Distributed Filesystem similar to Gluster, Ceph, etc.
You can use other distributed filesystems in place of HDFS
Blocks are distributed, and by default duplicated on at least 1 other node
128m default block size
Restful API, CLI tools, third-party tools to “mount” HDFS on Linux (stable), Windows (ymmv),
Mac (?)
DO NOT PUT YOUR DATA NODES ON A SAN! IT IS WRONG! DO NOT DO IT! EVEN ON THURSDAY!
10. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Hadoop is...
YARN
Yet another resource negotiator
schedules “work” among nodes, distributes the “processing”
Map Reduce is
an API
an algorithm, data is mapped to nodes, the answers are “reduced” to a single answer
Hive is
HDFS/Hadoop based data warehousing
SQL, JDBC, ODBC
Tables map to files on HDFS
No updates, deletes, transactions (but coming in “Stinger.next”)
14. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
What Is Hadoop Good For?
Working with large amounts of data in batch
ETL processing / Data Transformation
Analytics / BI
Integration (Data Lake, Enterprise Data Hub)
Working with streams of data
Events
Log data
Time series or similar data (HBase)
15. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
What Is Hadoop Bad At?
What is Hadoop bad at?
Quick jobs - i.e. Hive/Map Reduce setup time is measured in seconds to minutes.
Lots of small files (128MB block size = 0 byte files are 128m files)
General DBMS stuff - HBase is a much more “specific” database than MySQL/etc.
High Availability
WHA???
Knox, Oozie, etc all have shaky support if any for HA Namenodes.
17. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
How Do You Get Data Into Hadoop?
How do you get data into Hadoop?
Sqoop it from an RDBMS
Use JDBC or ODBC and push into Hive from an external DB
Push data into Hive with the restful API
Put an extract file onto HDFS with the REST API
process it into Hive directly with a LOAD DATA statement
transform/process it into Hive using PIG
use Java
Message it in there with Kafka, RabbitMQ or similar MQ and custom “spout” for Storm
Use any multitude of APIs that write data into HDFS, HBase, Hive, etc.
18. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
How Do You Get Data Out Of Hadoop?
How do you get data out of Hadoop?
Should you be getting it out or should you process it there?
JDBC/ODBC to Hive
HBase can be mounted into Hive
REST APIs for Hive/HDFS
APIs for Kafka, Spark, Storm, etc (subscribe)
HDCP to another HDFS
Mount it with FUSE and use your favorite Linux tool
hadoop fs -cat /path/to/file/on/hdfs |grep stuff > mynewlocalfile
20. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
How Do You Process Data In Hadoop?
Map-reduce Java API
Hive supports SQL (soon to be not a subset)
PIG can munge files on HDFS and can work with Hive
Storm and Spark have their own APIs for dealing with events or so-called micro-batches of data
There are numerous toolkits
Mahout - common machine learning algorithms (many not very parallelizable/etc)
MLib - Machine learning built on Spark
GraphX
21. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
How Do You Analyze Data In Hadoop?
Most major BI tools now support Hadoop
Tableau
Pentaho
Datameer
Your favorite probably here
All that stuff is for l4m3rs, use the command line interface :-)
hive -e ‘select * from sometable’
pig hdfs://some/dir/myscript.pig
Use RStudio and write some R to predict what sales will be next month (you will be sort of wrong
probably)
Use your favorite SQL tool that supports JDBC/ODBC
Use Hue
23. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
How Do You Secure Hadoop?
HDFS supports POSIX (that means Linux-style) filesystem security
The most complete security authentication throughout Hadoop is based on Kerberos (yeah I know).
You can do it with just straight LDAP too, but it isn’t integrated.
Knox supplies “perimeter-based security” for (only):
Hive
HDFS
Ooozie
HBase
HCatalog
Supposedly Argus will save us from all of this!