5 Things that Make Hadoop a Game Changer
Webinar by Elliott Cordo, Caserta Concepts
There is much hype and mystery surrounding Hadoop's role in analytic architecture. In this webinar, Elliott presented, in detail, the services and concepts that makes Hadoop a truly unique solution - a game changer for the enterprise. He talked about the real benefits of a distributed file system, the multi workload processing capabilities enabled by YARN, and the 3 other important things you need to know about Hadoop.
To access the recorded webinar, visit the event site: https://www.brighttalk.com/webcast/9061/131029
For more information the services and solutions that Caserta Concepts offers, please visit http://casertaconcepts.com/
2. And about me
• 15 years of data management
• 10 years devoted to data engineering
• I’m still a nerd, not an academic
• Startups to the fortune 100
Elliott Cordo
Chief Architect, Caserta Concepts
elliott@casertaconcepts.com
3. Caserta Concepts
Technology services company with focused expertise in:
Data Warehousing
Business Intelligence
Big Data Analytics
Data is all we do.
Established in 2001:
Data Science & Analytics
Data on the Cloud
Data Interaction & Visualization
Industry recognized work force
Strategy, Assessments, Implementation
Writing, Education, Mentoring
Broad experience across industries:
Financial Services / Insurance / Services
eCommerce / Advertising / Higher Education / Healthcare
6. Caserta Innovation Lab (CIL)
• Internal laboratory established to test & develop
solution concepts and ideas
• Used to accelerate client projects
• Examples:
• Search (SOLR) based BI
• Big Data Governance Toolkit
• Text Analytics on Social Network Data
• Continuous Integration / End-to-end streaming
• Recommendation Engine Optimization
• Others (confidential)
• CIL is hosted on
7. What is “a Hadoop”?
• A distributed file system
– Like any file system – any data type can be stored
• A distributed processing framework
– Processing is spread across a cluster of machines
8. Born in the server rooms of tech giants
• They were the first guys to deal with this BIG DATA
stuff
– GB turned to TB turned to PB
– Technologies weren’t scalable enough
– New workloads
– New data types
• Now many of us have to deal with it too!
9. What is “a Big Data”?
4V’s
Lets keep it simple:
• Data is big – measured in multiple TB or PB
• Data isn’t very relational:
– Json
– Log files
– Raw text
– Various binary formats
10. Some misconceptions about Hadoop
• Data Lake – just dump your data in there and
everything will be fine.
• You have to be a java programmer
• You don’t have to be a programmer
• SQL is no good, actually SQL Is really good!
• Hadoop is the solution for everything
12. #1 Low-Cost computing and data
management
• Before Hadoop we only had a few tools in our
quiver for dealing with massive datasets.
• Scaling up to ginormous servers and storage
systems (via SAN’s and such)
• MPP Databases (Massively Parallel Processing)
13. Now you know longer need $$$ to be
competitive
• Store anything, for as long as needed!
• Run massive query and computing workloads
• Leverage commodity servers or cloud
• Enablement for startups and innovators
14. And about those MPP’s
• I’m an MPP proponent and often use them in
conjunction with large Hadoop implementations –
storage for “hot data”
• However they are EXPENSIVE!
– Orders of magnitude more expensive than Hadoop
– Often use proprietary hardware
• ..and they are limited:
– Not generally as scalable as Hadoop
– Only good at structured data
15. #2 Promote agile data culture
Enable your data analysts and data scientists to
thrive
• Onboard data fast
• Blend with governed data
• Produce business insights
16. The Traditional Way
ENTERPRISE
DATA WAREHOUSE
Data Governance
Data Integration
Modeling
Cool
new
data
• Data warehouse is
central/protected/highly
governed
• Rigorous SDLC and governance
• A lot of work just to validate
this data is valuable?
17. The Data Lake
Not all data in Hadoop will be fully governed
Only a subset of the data in Hadoop will be fully governed.
We refer to this as the Trusted layer of the Big Data Warehouse.
Limited barriers to ingest
Big
Data
Warehouse
Data Science Workspace
Staging
machine learning, blending with
Metadata Catalog
ILM who has access, how long do we “manage it”
Data Quality and Monitoring Monitoring of
Landing Area – Source Data in “Full Fidelity”
completeness of data
Metadata Catalog
ILM who has access,
how long do we “manage it”
Agile business insight through data-munging,
external data, development of facts
Data is ready to be turned
into information:
organized, well defined,
complete.
Raw machine data
collection, collect
everything
Metadata Catalog
ILM who has access, how long to “manage it”
Data Quality and Monitoring Monitoring
of completeness of data
User community arbitrary queries and reporting Fully Data Governed
18. The Data Science Factory
• Data scientists have access to ALL layers of the data lake
• New data is easy to onboard only requirement basic metadata
and ILM
• The products of data science may evolve into new data products,
including new data warehouse events (facts) and reference data
(dimensions)
BDW
Data Science
Workspace
Staging
Landing Area
Cool
new
data
New
Insights
19. #3 Hadoop + Cloud = elasticity
• Pay for the computing you need
• Scale elastically and quickly
20. Unique opportunities
• Cloud storage is cheap and dependable
– Archiving services are even cheaper
• Pay only for the data processing you need
• Test new processes quickly and inexpensively
• Optimize operations and scale rapidly
21. Ephemeral data processing
Why have severs sitting around all day doing nothing!
• Build cluster in one command!
• Bootstrap install anything special you need:
– Spark, Impala, Mahout
• Run the jobs you need to
• Tear it down
aws emr create-cluster --applications Name=Pig --ami-version 3.2.1 --
instance-groups
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge
--steps Type=PIG,Name="Pig Program",ActionOnFailure=CONTINUE,Args=[-
f,s3://elasticmapreduce/samples/pig-apache/do-reports2.pig,-
p,INPUT=s3://elasticmapreduce/samples/pig-apache/input,-
p,OUTPUT=s3://mybucket/pig-apache/output]
22. #4 No rigid data structures
• As a distributed file system Hadoop can store
anything.
• Structure isn’t required, unless you want it…
• Remove barriers for new data ingest
• Keep data in it’s full fidelity
23. Embrace semi-structured data
• Logs are ubiquitous
• Json is pretty darn awesome
• VERY difficult to model in the relational world
– Extreme denomination/brittle data structures
– Alternative: fidelity loss
• Plenty of tools for dealing with these formats “Late Bind”
– Hive Serdes
– PIG
24. Embrace unstructured data
• Raw text
• PDF
• Various binary data formats
• Custom engineered processes, machine
learning
25. Why is it important that Hadoop let’s
you store as-is?
• Data which is not important today could be tomorrow
• What if your processes are flawed, and you need to
recast?
• Versioning as the data evolves
• This is a traditional data management best practice – it
just hasn’t always been practical
26. #5 Hadoop: The Data OS
• Hadoop has evolved
• It’s not just about map reduce anymore
• YARN has enabled a new generation of
applications to run within Hadoop
27. YARN
• Yet Another Resource Negotiator
• YARN + HDFS
– Shareable cluster resources
– A shared distributed file system
• What we get:
– Real time engines
– Distributed databases
– ETL applications
28. And why does this matter?
• Hadoop isn’t just batch anymore
• Real time workloads
• Interactive queries
• Machine Learning
• Machine to Machine communication
29. Some challenges
• Data governance
• Sophistication of tools
– Ease of use
– Resources
• Culture
30. Some tips
• Don’t throw away your data warehouse,
unless it’s sucks..
• Don’t forget governance
• Establish a Data Science Factory produce
business insights
31. The most important tip
Polygot Persistence – “where any decent sized
enterprise will have a variety of different data
storage technologies for different kinds of data.
There will still be large amounts of it managed in
relational stores, but increasingly we'll be first asking
how we want to manipulate the data and only then
figuring out what technology is the best bet for it.”
-- Martin Fowler
32. Thank You
Elliott Cordo
Chief Architect, Caserta Concepts
elliott@casertaconcepts.com