5 Things that Make Hadoop a Game Changer

5 Things That Make Hadoop a Game
Changer

And about me
• 15 years of data management
• 10 years devoted to data engineering
• I’m still a nerd, not an academic
• Startups to the fortune 100
Elliott Cordo
Chief Architect, Caserta Concepts
elliott@casertaconcepts.com

Caserta Concepts
 Technology services company with focused expertise in:
 Data Warehousing
 Business Intelligence
 Big Data Analytics
Data is all we do.
 Established in 2001:
 Data Science & Analytics
 Data on the Cloud
 Data Interaction & Visualization
 Industry recognized work force
 Strategy, Assessments, Implementation
 Writing, Education, Mentoring
 Broad experience across industries:
 Financial Services / Insurance / Services
 eCommerce / Advertising / Higher Education / Healthcare

Client Portfolio
Finance. Healthcare
& Insurance
Retail/eCommerce
& Manufacturing
Education
& Services

Caserta Innovation Lab (CIL)
• Internal laboratory established to test & develop
solution concepts and ideas
• Used to accelerate client projects
• Examples:
• Search (SOLR) based BI
• Big Data Governance Toolkit
• Text Analytics on Social Network Data
• Continuous Integration / End-to-end streaming
• Recommendation Engine Optimization
• Others (confidential)
• CIL is hosted on

What is “a Hadoop”?
• A distributed file system
– Like any file system – any data type can be stored
• A distributed processing framework
– Processing is spread across a cluster of machines

Born in the server rooms of tech giants
• They were the first guys to deal with this BIG DATA
stuff
– GB turned to TB turned to PB
– Technologies weren’t scalable enough
– New workloads
– New data types
• Now many of us have to deal with it too!

What is “a Big Data”?
4V’s
Lets keep it simple:
• Data is big – measured in multiple TB or PB
• Data isn’t very relational:
– Json
– Log files
– Raw text
– Various binary formats

Some misconceptions about Hadoop
• Data Lake – just dump your data in there and
everything will be fine.
• You have to be a java programmer
• You don’t have to be a programmer
• SQL is no good, actually SQL Is really good!
• Hadoop is the solution for everything

So what makes Hadoop a game
changer?

#1 Low-Cost computing and data
management
• Before Hadoop we only had a few tools in our
quiver for dealing with massive datasets.
• Scaling up to ginormous servers and storage
systems (via SAN’s and such)
• MPP Databases (Massively Parallel Processing)

Now you know longer need $$$ to be
competitive
• Store anything, for as long as needed!
• Run massive query and computing workloads
• Leverage commodity servers or cloud
• Enablement for startups and innovators

And about those MPP’s
• I’m an MPP proponent and often use them in
conjunction with large Hadoop implementations –
storage for “hot data”
• However they are EXPENSIVE!
– Orders of magnitude more expensive than Hadoop
– Often use proprietary hardware
• ..and they are limited:
– Not generally as scalable as Hadoop
– Only good at structured data

#2 Promote agile data culture
Enable your data analysts and data scientists to
thrive
• Onboard data fast
• Blend with governed data
• Produce business insights

The Traditional Way
ENTERPRISE
DATA WAREHOUSE
Data Governance
Data Integration
Modeling
Cool
new
data
• Data warehouse is
central/protected/highly
governed
• Rigorous SDLC and governance
• A lot of work just to validate
this data is valuable?

The Data Lake
 Not all data in Hadoop will be fully governed
 Only a subset of the data in Hadoop will be fully governed.
 We refer to this as the Trusted layer of the Big Data Warehouse.
 Limited barriers to ingest
Big
Data
Warehouse
Data Science Workspace
Staging
machine learning, blending with
Metadata  Catalog
ILM  who has access, how long do we “manage it”
Data Quality and Monitoring  Monitoring of
Landing Area – Source Data in “Full Fidelity”
completeness of data
ILM  who has access,
how long do we “manage it”
Agile business insight through data-munging,
external data, development of facts
Data is ready to be turned
into information:
organized, well defined,
complete.
Raw machine data
collection, collect
everything
ILM  who has access, how long to “manage it”
Data Quality and Monitoring  Monitoring
of completeness of data
User community arbitrary queries and reporting Fully Data Governed

The Data Science Factory
• Data scientists have access to ALL layers of the data lake
• New data is easy to onboard  only requirement basic metadata
and ILM
• The products of data science may evolve into new data products,
including new data warehouse events (facts) and reference data
(dimensions)
BDW
Data Science
Workspace
Staging
Landing Area
Cool
new
data
New
Insights

#3 Hadoop + Cloud = elasticity
• Pay for the computing you need
• Scale elastically and quickly

Unique opportunities
• Cloud storage is cheap and dependable
– Archiving services are even cheaper
• Pay only for the data processing you need
• Test new processes quickly and inexpensively
• Optimize operations and scale rapidly

Ephemeral data processing
Why have severs sitting around all day doing nothing!
• Build cluster in one command!
• Bootstrap  install anything special you need:
– Spark, Impala, Mahout
• Run the jobs you need to
• Tear it down
aws emr create-cluster --applications Name=Pig --ami-version 3.2.1 --
instance-groups
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge
--steps Type=PIG,Name="Pig Program",ActionOnFailure=CONTINUE,Args=[-
f,s3://elasticmapreduce/samples/pig-apache/do-reports2.pig,-
p,INPUT=s3://elasticmapreduce/samples/pig-apache/input,-
p,OUTPUT=s3://mybucket/pig-apache/output]

#4 No rigid data structures
• As a distributed file system Hadoop can store
anything.
• Structure isn’t required, unless you want it…
• Remove barriers for new data ingest
• Keep data in it’s full fidelity

Embrace semi-structured data
• Logs are ubiquitous
• Json is pretty darn awesome
• VERY difficult to model in the relational world
– Extreme denomination/brittle data structures
– Alternative: fidelity loss
• Plenty of tools for dealing with these formats “Late Bind”
– Hive Serdes
– PIG

Embrace unstructured data
• Raw text
• PDF
• Various binary data formats
• Custom engineered processes, machine
learning

Why is it important that Hadoop let’s
you store as-is?
• Data which is not important today could be tomorrow
• What if your processes are flawed, and you need to
recast?
• Versioning as the data evolves
• This is a traditional data management best practice – it
just hasn’t always been practical

#5 Hadoop: The Data OS
• Hadoop has evolved
• It’s not just about map reduce anymore
• YARN has enabled a new generation of
applications to run within Hadoop

YARN
• Yet Another Resource Negotiator
• YARN + HDFS
– Shareable cluster resources
– A shared distributed file system
• What we get:
– Real time engines
– Distributed databases
– ETL applications

And why does this matter?
• Hadoop isn’t just batch anymore
• Real time workloads
• Interactive queries
• Machine Learning
• Machine to Machine communication

Some challenges
• Data governance
• Sophistication of tools
– Ease of use
– Resources
• Culture

Some tips
• Don’t throw away your data warehouse,
unless it’s sucks..
• Don’t forget governance
• Establish a Data Science Factory  produce
business insights

The most important tip
Polygot Persistence – “where any decent sized
enterprise will have a variety of different data
storage technologies for different kinds of data.
There will still be large amounts of it managed in
relational stores, but increasingly we'll be first asking
how we want to manipulate the data and only then
figuring out what technology is the best bet for it.”
-- Martin Fowler

Thank You
Elliott Cordo
Chief Architect, Caserta Concepts
elliott@casertaconcepts.com

5 Things that Make Hadoop a Game Changer

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie 5 Things that Make Hadoop a Game Changer

Ähnlich wie 5 Things that Make Hadoop a Game Changer (20)

Mehr von Caserta

Mehr von Caserta (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

5 Things that Make Hadoop a Game Changer