In our most recent Big Data Warehousing Meetup, we learned about transitioning from Big Data 1.0 with Hadoop 1.x with nascent technologies to the advent of Hadoop 2.x with YARN to enable distributed ETL, SQL and Analytics solutions. Caserta Concepts Chief Architect Elliott Cordo and an Actian Engineer covered the complete data value chain of an Enterprise-ready platform including data connectivity, collection, preparation, optimization and analytics with end user access.
For more information on our services or upcoming events, please visit our website at http://www.casertaconcepts.com/.
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Â
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
1. Big Data Warehousing
January 20, 2014
Sponsored By:
Todayâs Topic: Big Data 2.0: YARN
Distributed ETL & SQL with Hadoop
2. Agenda
7:00
Networking (15 min)
Grab some food and a drink... Make some friends.
7:15
Welcome + Intro
President, Caserta Concepts
7:30
Joe Caserta (15 min)
About the Meetup, about Caserta Concepts
Elliott Cordo (20 min)
Hadoop 2.0: The Evolution of
Hadoop, SQL, and NoSQL
Chief Architect, Caserta Concepts
7:50
Paul Dingman (20 min)
Chief Technologist, Actian Innovation Lab
Using Actian to process data in
Hadoop
The latest features of Actian to enable maximum
throughput
8:10
Tyler Mitchell (35 min)
See how it works!
Senior Engineer, Actian Innovation Lab
8:45
Q&A, More Networking (15 min)
Tell us what youâre up toâŚ
3. About the BDW Meetup
⢠Big Data is a complex, rapidly changing
landscape
⢠We want to share our stories and hear
about yours
⢠Great networking opportunity for like
minded data nerds
⢠Opportunities to collaborate on exciting
projects
⢠Founded by Caserta Concepts, DW, BI &
Big Data Analytics Consulting
⢠Next BDW Meetup: February 10, 2014
⢠Data Governance on Big Data with Cloudera
4. About Caserta Concepts
Focused
Expertise
â˘
â˘
â˘
â˘
Big Data Analytics
Data Warehousing
Business Intelligence
Strategic Data
Ecosystems
Industries Served
â˘
â˘
â˘
â˘
â˘
Financial Services
Healthcare / Insurance
Retail / eCommerce
Digital Media / Marketing
K-12 / Higher Education
Founded in 2001
⢠President: Joe Caserta, industry thought leader,
consultant, educator and co-author, The Data
Warehouse ETL Toolkit (Wiley, 2004)
5. Implementation Expertise & Offerings
Strategic Roadmap/
Assessment/Consulting
Big Data
Analytics
Storm
Database
BI/Visualization/
Analytics
Master Data Management
8. Caserta Concepts
Listed as a Top 20 Most Promising
Data Analytics Consulting Company
CIOReview looked at hundreds of data analytics consulting companies and shortlisted
the ones who are at the forefront of tackling the real analytics challenges.
A distinguished panel comprising of CEOs, CIOs, VCs, industry analysts and the editorial
board of CIOReview selected the Final 20.
10. BIG DATA 2.0, EVOLUTION OF HADOOP,
SQL, AND NOSQL
Elliott Cordo
Chief Architect, Caserta Concepts
11. Hadoop1.0
WHAT DID WE ACHIEVE
⢠Established Hadoopâs place in analytic architecture
⢠Realized cheap, reliable, scalable storage and processing
⢠Made us more data driven
⢠Store and process anything ď new data types, structured, unstructured
⢠New types of analysis including Machine learning ď Mahout
12. What did this mean to the Big Data
Warehouse
⢠Extending the Data Warehouse
⢠Establish new facts and âprojectionsâ in Hadoop on unstructured
and high volume data sources
⢠Hive, Impala
⢠Datameer
⢠BIG ETL -- Using MapReduce pipelines to process
massive amounts of data ď
⢠Using our favorite ELT tool PIG
⢠Data storage for staging
⢠Reducing the costs and increasing the performance of our EDW
13. Where did it fall short
⢠Pretty much only MapReduce
⢠Batch oriented â not tuned to real time or interactive
processes ď Look at what was achieved with Impala
side-stepping MR for SQL Queries on Hadoop
⢠Hive performance made users sad
⢠Legacy vendors were slow to adopt due to the massive
paradigm shift in their product architecture.
14. Hadoop 2.0 - what is the big deal
⢠YARN ď âYet Another Resource Negotiatorâ
⢠Job Tracker and Task Tracker has been split up
⢠Increase scalability
⢠Remove MapReduce from core architecture
⢠Now there is a
⢠Global Resource Manager
⢠Per Application - Application Manger â Map Reduce will have itâs own
⢠Per node slave NodeManger (with per application container)
15. YARN â Why is it significant
⢠Provides a management layer between
Applications and Hadoop
⢠These applications could still be Map Reduce
⢠Or all sorts of applications such as Streaming, ETL Engines,
New Database engines all running NATIVELY in Hadoop!
⢠These applications can have access to HDFS and
safely contained by cluster resources.
⢠1st generation impala ran OUTSIDE of Hadoop and competed
with cluster resources
⢠More intelligent use of cluster resources ď not just
slots⌠more productivity out of the same hardware.
16. Why is it important we are moving beyond
map reduce?
⢠MapReduce is a generalized computing framework
⢠A query engine for instance can benefit from a ânon-
generalizedâ patternâ, the flexibility isnât fully needed
⢠In-memory/ disk data access
⢠Index usage
⢠Serialization
⢠Shuffling/ data movement
ď again look at the Impala approach
⢠MapReduce is not suited well for other tasks such as real
time stream processing, iterative machine learning, graph
processing
17. ETL Can benefit from this approach too!
⢠ETL have broader scopes than query engines but gains
can be made from a purpose built processing framework
⢠Batch is not the only way! Streaming apps can now
interact with HDFS and be managed by cluster resources
⢠Storm
⢠SPARK
⢠Existing Assets: SIGNIFICANT existing IP more easily
leverage from both open source and commercial software
18. Back to Query Engines
MPP: Massively Parallel Processing - scalable,
distributed processing engines.
⢠Typically underlying storage is columnar in nature
(performance, compression, easier to distribute data)
⢠Present themselves relationally and handle all the brutal
work of aggregation, joins ď ANSI Compliant SQL
⢠Impala, HAWQâ the industry is really just taking the
approach of building MPPâs on Hadoop
⢠Columnar storage: ORC, Parquet, Proprietary
⢠Advanced query optimizations
19. MPPâs leveraging dedicated storage
⢠Modern MPPâs like Actianâs Matrix are also taking
advantage of Hadoop
⢠Integrating tight integrations to Hadoop infrastructureď
On Demand Integration
⢠Developing tools and frameworks that leverage YARN
heavy lifting ď ETL
20. So, about NOSQL
⢠In âBig Data 1.0â NoSQL found itâs place as a mainstream
analytic store:
⢠Cassandra
⢠HBase
⢠Redis
⢠Riak
⢠They gave us raw, unbeatable performance for handling
realtime analytic workloads
21. NOSQL Use cases - BDW
⢠Highly scalable and flexible Staging, ODS Layers
⢠High performance analytic store ď Real time data analytic
systems
⢠Recommendation, customer profile data ď web-facing
performance characteristics, flexible schema
⢠BIG ETL Components ď Reference data lookup cache,
stream joins
22. 2.0 NOSQL Evolutions
SQL!!!
⢠Easier adoption
⢠Standardizing Interfaces
ď Cassandra CQL3
ď Pheonix on HBase
Evolving
⢠Greater flexibility on in-memory/disk persistence
⢠In memory will also likely usher more flexibility on server side
processes: Map Reduce, Aggregation, Joins
⢠Analytic support
23. So.. In conclusion
HADOOP IS THE NEW DATA OS?
What we have:
⢠A distributed file system
⢠A robust multitenant resource manager
⢠Generalized framework for distributed computing and data
processing
Even greater mainstream adoption of NOSQL
SQL rules!