Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop

Big Data Warehousing
January 20, 2014

Sponsored By:

Today’s Topic: Big Data 2.0: YARN
Distributed ETL & SQL with Hadoop

Agenda
7:00

Networking (15 min)
Grab some food and a drink... Make some friends.

7:15

Welcome + Intro

President, Caserta Concepts

7:30

Joe Caserta (15 min)

About the Meetup, about Caserta Concepts

Elliott Cordo (20 min)

Hadoop 2.0: The Evolution of
Hadoop, SQL, and NoSQL

Chief Architect, Caserta Concepts

7:50

Paul Dingman (20 min)
Chief Technologist, Actian Innovation Lab

Using Actian to process data in
Hadoop
The latest features of Actian to enable maximum
throughput

8:10

Tyler Mitchell (35 min)

See how it works!

Senior Engineer, Actian Innovation Lab

8:45

Q&A, More Networking (15 min)
Tell us what you’re up to…

About the BDW Meetup
• Big Data is a complex, rapidly changing

landscape
• We want to share our stories and hear

about yours
• Great networking opportunity for like

minded data nerds
• Opportunities to collaborate on exciting

projects
• Founded by Caserta Concepts, DW, BI &

Big Data Analytics Consulting
• Next BDW Meetup: February 10, 2014
• Data Governance on Big Data with Cloudera

About Caserta Concepts
Focused
Expertise
•
•
•
•

Big Data Analytics
Data Warehousing
Business Intelligence
Strategic Data
Ecosystems

Industries Served
•
•
•
•
•

Financial Services
Healthcare / Insurance
Retail / eCommerce
Digital Media / Marketing
K-12 / Higher Education

Founded in 2001
• President: Joe Caserta, industry thought leader,
consultant, educator and co-author, The Data
Warehouse ETL Toolkit (Wiley, 2004)

Implementation Expertise & Offerings
Strategic Roadmap/
Assessment/Consulting

Big Data
Analytics

Storm
Database

BI/Visualization/
Analytics

Master Data Management

Client Portfolio
Finance
& Insurance

Retail/eCommerce
& Manufacturing

Education
& Services

Caserta Partners
Hadoop Distributions

Platforms/ETL

Analytics & BI

Caserta Concepts
Listed as a Top 20 Most Promising
Data Analytics Consulting Company

CIOReview looked at hundreds of data analytics consulting companies and shortlisted
the ones who are at the forefront of tackling the real analytics challenges.
A distinguished panel comprising of CEOs, CIOs, VCs, industry analysts and the editorial
board of CIOReview selected the Final 20.

Opportunities
Does this word cloud excite you?

Speak with us about our open positions: jobs@casertaconcepts.com

BIG DATA 2.0, EVOLUTION OF HADOOP,
SQL, AND NOSQL
Elliott Cordo

Hadoop1.0
WHAT DID WE ACHIEVE
• Established Hadoop’s place in analytic architecture
• Realized cheap, reliable, scalable storage and processing
• Made us more data driven
• Store and process anything  new data types, structured, unstructured
• New types of analysis including Machine learning  Mahout

What did this mean to the Big Data
Warehouse
• Extending the Data Warehouse
• Establish new facts and “projections” in Hadoop on unstructured
and high volume data sources
• Hive, Impala
• Datameer

• BIG ETL -- Using MapReduce pipelines to process

massive amounts of data 
• Using our favorite ELT tool PIG

• Data storage for staging
• Reducing the costs and increasing the performance of our EDW

Where did it fall short
• Pretty much only MapReduce
• Batch oriented – not tuned to real time or interactive

processes  Look at what was achieved with Impala
side-stepping MR for SQL Queries on Hadoop
• Hive performance made users sad

• Legacy vendors were slow to adopt due to the massive

paradigm shift in their product architecture.

Hadoop 2.0 - what is the big deal
• YARN “Yet Another Resource Negotiator”

• Job Tracker and Task Tracker has been split up
• Increase scalability
• Remove MapReduce from core architecture
• Now there is a
• Global Resource Manager
• Per Application - Application Manger – Map Reduce will have it’s own
• Per node slave NodeManger (with per application container)

YARN – Why is it significant
• Provides a management layer between

Applications and Hadoop
• These applications could still be Map Reduce
• Or all sorts of applications such as Streaming, ETL Engines,

New Database engines all running NATIVELY in Hadoop!
• These applications can have access to HDFS and

safely contained by cluster resources.
• 1st generation impala ran OUTSIDE of Hadoop and competed

with cluster resources
• More intelligent use of cluster resources  not just

slots… more productivity out of the same hardware.

Why is it important we are moving beyond
map reduce?
• MapReduce is a generalized computing framework
• A query engine for instance can benefit from a “non-

generalized” pattern”, the flexibility isn’t fully needed
• In-memory/ disk data access
• Index usage

• Serialization
• Shuffling/ data movement

 again look at the Impala approach

• MapReduce is not suited well for other tasks such as real

time stream processing, iterative machine learning, graph
processing

ETL Can benefit from this approach too!
• ETL have broader scopes than query engines but gains

can be made from a purpose built processing framework
• Batch is not the only way! Streaming apps can now

interact with HDFS and be managed by cluster resources
• Storm
• SPARK

• Existing Assets: SIGNIFICANT existing IP more easily

leverage from both open source and commercial software

Back to Query Engines
MPP: Massively Parallel Processing - scalable,
distributed processing engines.
• Typically underlying storage is columnar in nature
(performance, compression, easier to distribute data)
• Present themselves relationally and handle all the brutal
work of aggregation, joins  ANSI Compliant SQL
• Impala, HAWQ– the industry is really just taking the

approach of building MPP’s on Hadoop
• Columnar storage: ORC, Parquet, Proprietary
• Advanced query optimizations

MPP’s leveraging dedicated storage
• Modern MPP’s like Actian’s Matrix are also taking

advantage of Hadoop
• Integrating tight integrations to Hadoop infrastructure

On Demand Integration
• Developing tools and frameworks that leverage YARN

heavy lifting  ETL

So, about NOSQL
• In “Big Data 1.0” NoSQL found it’s place as a mainstream

analytic store:
• Cassandra
• HBase
• Redis
• Riak

• They gave us raw, unbeatable performance for handling

realtime analytic workloads

NOSQL Use cases - BDW
• Highly scalable and flexible Staging, ODS Layers

• High performance analytic store  Real time data analytic

systems
• Recommendation, customer profile data  web-facing

performance characteristics, flexible schema
• BIG ETL Components  Reference data lookup cache,

stream joins

2.0 NOSQL Evolutions
SQL!!!
• Easier adoption
• Standardizing Interfaces
Cassandra CQL3
Pheonix on HBase
Evolving
• Greater flexibility on in-memory/disk persistence
• In memory will also likely usher more flexibility on server side

processes: Map Reduce, Aggregation, Joins
• Analytic support

So.. In conclusion
HADOOP IS THE NEW DATA OS?
What we have:
• A distributed file system
• A robust multitenant resource manager

• Generalized framework for distributed computing and data

processing
Even greater mainstream adoption of NOSQL
SQL rules!

THANK YOU
Elliott Cordo (elliott@casertaconcepts.com)

Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (14)

Ähnlich wie Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop

Ähnlich wie Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop (20)

Mehr von Caserta

Mehr von Caserta (17)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop