SlideShare ist ein Scribd-Unternehmen logo
1 von 72
REMINDER
Check in on the
COLLABORATE mobile app
Architectural Considerations for
Data Warehousing with Hadoop
Prepared by:
Mark Grover, Software Engineer
Jonathan Seidman, Solutions Architect
Cloudera, Inc.
github.com/hadooparchitecturebook/h
adoop-arch-book/tree/master/ch11-
data-warehousing
Session ID#: 10251
@mark_grover
@jseidman
About Us
■ Mark
▪ Software Engineer at
Cloudera
▪ Committer on Apache
Bigtop, PMC member on
Apache Sentry
(incubating)
▪ Contributor to Apache
Hadoop, Spark, Hive,
Sqoop, Pig and Flume
■ Jonathan
▪ Senior Solutions
Architect/Partner
Engineering at Cloudera
▪ Previously, Technical Lead
on the big data team at
Orbitz Worldwide
▪ Co-founder of the Chicago
Hadoop User Group and
Chicago Big Data
About the Book
■ @hadooparchbook
■ hadooparchitecturebook.com
■ github.com/hadooparchitectur
ebook
■ slideshare.com/hadooparchbo
ok
Agenda
■ Typical data warehouse architecture.
■ Challenges with the existing data warehouse architecture.
■ How Hadoop complements an existing data warehouse
architecture.
■ (Very) Brief intro to Hadoop.
■ Example use case.
■ Walkthrough of example use case implementation.
Typical Data Warehouse
Architecture
Example High Level Data Warehouse
Architecture
Extract
Data
Staging
Area
Operational
Source
Systems
Load
Data
Warehouse
Data
Analysis/Visu
alization Tools
Transformations
Challenges with the Data
Warehouse Architecture
Challenge – ETL/ELT Processing
OLTP
Enterprise
Applications
ODS
Data Warehouse
QueryExtract
Transform
Load
Business
Intelligence
Transform
Challenges – ETL/ELT Processing
OLTP
Enterprise
Applications
ODS
Data Warehouse
QueryExtract
Transform
Load
Business
Intelligence
Transform
1 Slow Data Transformations = Missed ETL SLAs.
2 Slow Queries = Frustrated Business Users.
1
2
1
Challenges – Data Archiving
Data
Warehouse
Tape
Archive
■ Full-fidelity data only kept for a short duration
■ Expensive or sometimes impossible to look at historical raw
data
Challenge – Disparate Data Sources
Data
Warehouse
■ How do you join data from disparate sources with EDW?
Business
Intelligence
???
Challenge – Lack of Agility
■ Responding to changing requirements, mistakes, etc.
requires lengthy processes.
Challenge – Exploratory Analysis in the
EDW
■ Difficult for users to do exploratory analysis of data in the data
warehouse.
Business
Users
Developers Analysts
Data
Warehouse
Complementing the EDW with
Hadoop
Data Warehouse Architecture with Hadoop
Extract
Hadoop
Operational
Source
Systems
EDW
BI/Analytics Tools
Logs,
machine
data, etc.
Extract
Transformation/Analysis
Load
Hadoop
ETL/ELT Optimization with Hadoop
OLTP
Enterprise
Applications
ODS
Business
Intelligence
Transform
Query
Store
ETL
Data Warehouse
Query
(High $/Byte)
Active Archiving with Hadoop
Data
Warehouse
Hadoop
Joining Disparate Data Sources with
Hadoop
Data
Warehouse
Business
IntelligenceHadoop
Agile Data Access with Hadoop
Schema-on-Write (RDBMS):
• Prescriptive Data Modeling:
• Create static DB schema
• Transform data into RDBMS
• Query data in RDBMS format
• New columns must be added
explicitly before new data can
propagate into the system.
• Good for Known Unknowns
(Repetition)
Schema-on-Read (Hadoop):
• Descriptive Data Modeling:
• Copy data in its native format
• Create schema + parser
• Query Data in its native format
• New data can start flowing any time
and will appear retroactively once the
schema/parser properly describes it.
• Good for Unknown Unknowns
(Exploration)
Exploratory Analysis with Hadoop
Hadoop
Business
Users
Developers Analysts
Data
Warehouse
A Very Brief Intro to Hadoop
What is Apache Hadoop?
Has the Flexibility to Store
and Mine Any Type of Data
 Ask questions across structured and
unstructured data that were previously
impossible to ask or solve
 Not bound by a single schema
Excels at
Processing Complex Data
 Scale-out architecture divides
workloads across multiple nodes
 Flexible file system eliminates ETL
bottlenecks
Scales
Economically
 Can be deployed on commodity
hardware
 Open source platform guards
against vendor lock
Hadoop
Distributed File
System (HDFS)
Self-Healing, High
Bandwidth Clustered
Storage
Parallel
Processing
(MapReduce,
Spark, Impala,
etc.)
Distributed Computing
Frameworks
Apache Hadoop is an open source
platform for data storage and
processing that is…
 Scalable
 Fault tolerant
 Distributed
CORE HADOOP SYSTEM COMPONENTS
Oracle Big Data Appliance
■ All of the capabilities we’re talking about here are available as
part of the Oracle BDA.
Challenges of Hadoop Implementation
Challenges of Hadoop Implementation
Other Challenges – Architectural
Considerations
Data
Sources
Ingestion
Raw Data
Storage
(Formats,
Schema)
Processed
Data
Storage
(Formats,
Schema)
Processing
Data
Consumption
Orchestration
(Scheduling,
Managing,
Monitoring)
Metadata
Management
Hadoop Third Party Ecosystem
Data
Systems
Applications
Infrastructure
Operational
Tools
Walkthrough of Example Use
Case
Use-case
■ Movielens dataset
■ Users register by entering some demographic information
▪ Users can update demographic information later on
■ Rate movies
▪ Ratings can be updated later on
■ Auxillary information about movies available
▪ e.g. release date, IMDB URL, etc.
Movielens data set
u.user
user id | age | gender | occupation | zip code
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
6|42|M|executive|98101
7|57|M|administrator|91344
Movielens data set
u.item
movie id | movie title | release date | video release date
| IMDb URL | unknown | Action | Adventure | Animation
|Children's | Comedy | Crime | Documentary | Drama |
Fantasy | Film-Noir | Horror | Musical | Mystery | Romance
| Sci-Fi | Thriller | War | Western |
1|Toy Story (1995)|01-Jan-
1995||http://us.imdb.com/M/title-
exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0
|0|0|0
2|GoldenEye (1995)|01-Jan-
1995||http://us.imdb.com/M/title-
exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1
|0|0
3|Four Rooms (1995)|01-Jan-
1995||http://us.imdb.com/M/title-
exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
0|1|0|0
Movielens data set
u.data
user id | item id | rating | timestamp
196|242|3|881250949
186|302|3|891717742
22|377|1|878887116
244 51|2|880606923
166|346|1|886397596
OLTP schema
Movielens data set - OLTP
Data Modeling
Data Modeling Considerations
■ We need to consider the following in our architecture:
▪ Storage layer – HDFS? HBase? Etc.
▪ File system schemas – how will we lay out the data?
▪ File formats – what storage formats to use for our data, both raw
and processed data?
▪ Data compression formats?
■ Hadoop is not a database, so these considerations will be
different from an RDBMS.
Denormalization
■ Why denormalize?
■ When to do denormalize?
■ How much to denormalize?
Why Denormalize?
■ Regular joins are expensive in Hadoop
■ When you have 2 data sets, no guarantees that
corresponding records will be present on the same
■ Such a guarantee exists when storing such data in a single
data set
When to Denormalize?
■ Well, it’s difficult to say
■ It depends
Movielens Data Set - Denormalization
Denormalize Denormalize
Data Set in Hadoop
Tracking Updates (CDC)
■ Can’t update data in-place in HDFS
■ HDFS is append-only filesystem
■ We have to track all updates
Tracking Updates in Hadoop
Hadoop File Types
■ Formats designed specifically to store and process data on
Hadoop:
▪ File based – SequenceFile
▪ Serialization formats – Thrift, Protocol Buffers, Avro
▪ Columnar formats – RCFile, ORC, Parquet
Final Schema in Hadoop
Our Storage Format Recommendation
■ Columnar format (Parquet) for merged/compacted data sets
▪ user, user_rating, movie
■ Row format (Avro) for history/append-only data sets
▪ user_history, user_rating_fact
Ingestion
Sources Interceptors Selectors Channels Sinks
Flume Agent
Ingestion – Apache Flume
Twitter, logs, JMS,
webserver, Kafka
Mask, re-format,
validate…
DR, critical
Memory, file,
Kafka
HDFS, HBase,
Solr
Ingestion – Apache Kafka
Source System Source System Source System Source System
Hadoop
Security
Systems
Real-time
monitoring
Data
Warehouse
Kafka
Ingestion – Apache Sqoop
■ Apache project designed to ease import and export of data
between Hadoop and external data stores such as an
RDBMS.
■ Provides functionality to do bulk imports and exports of data.
■ Leverages MapReduce to transfer data in parallel.
Client Sqoop
MapReduce Map Map Map
Hadoop
Run import Collect metadata
Generate code,
Execute MR job
Pull data
Write to Hadoop
Sqoop Import Example – Movie
sqoop import --connect 
jdbc:mysql://mysql_server:3306/movielens 
--username myuser --password mypass --query 
'SELECT movie.*, group_concat(genre.name)
FROM movie
JOIN movie_genre ON (movie.id =
movie_genre.movie_id)
JOIN genre ON (movie_genre.genre_id = genre.id)
WHERE ${CONDITIONS}
GROUP BY movie.id' 
--split-by movie.id --as-avrodatafile 
--target-dir /data/movielens/movie
Data Processing
Popular Processing Engines
■ MapReduce
▪ Programming paradigm
■ Pig
▪ Workflow language based
■ Hive
▪ Batch SQL-engine
■ Impala
▪ Near real-time concurrent SQL engine
■ Spark
▪ DAG engine
Final Schema in Hadoop
Merge Updates
hive>INSERT OVERWRITE TABLE user_tmp
SELECT user.*
FROM user
LEFT OUTER JOIN user_upserts
ON (user.id = user_upserts.id)
WHERE
user_upserts.id IS NULL
UNION ALL
SELECT
id, age, occupation, zipcode,
TIMESTAMP(last_modified)
FROM user_upserts;
Aggregations
user_rating_fact
user_rating
user_history
movie
user
Merge updates
One
record/user/movie
Merge updates
One record/user
avg_movie_rating
latest_trending_
movies
Aggregations
hive>CREATE TABLE avg_movie_rating AS
SELECT
movie_id,
ROUND(AVG(rating), 1) AS rating
FROM
user_rating
GROUP BY
movie_id;
Export to Data Warehouse
user_rating_fact
user_rating
user_history
movie
user
Merge updates
One
record/user/movie
Merge updates
One record/user
Data
Warehouse
avg_movie_rating
latest_trending_
movies
Export
Sqoop Export
sqoop export --connect 
jdbc:mysql:/mysql_server:3306/movie_dwh 
--username myuser --password mypass 
--table avg_movie_rating --export-dir 
/user/hive/warehouse/avg_movie_rating 
-m 16 --update-key movie_id --update-mode 
allowinsert --input-fields-terminated-by 
'001’ --lines-terminated-by 'n'
Final Architecture
Final Architecture
Please complete the session
evaluation
Thank you!
@hadooparchbook
You may complete the session evaluation either
on paper or online via the mobile app
This is a slide title that can be up to two
lines of text without losing readability
■ This is the first bullet of text
■ This is the second bullet of text at the same level
▪ This is a sub-bullet of text
— This is a secondary sub-bullet of text (and should be as far sub-
bullets indent)
– This tertiary sub-bullet will be seldom used, but available
▪ This is another sub-bullet of text
■ And this is the third bullet of text
This is a slide title (one or two lines)
■ This is the first bullet of text
▪ This is a sub-bullet of text
■ This is the second bullet of
text at the same level
▪ This is a sub-bullet of text
— This is a secondary sub-
bullet of text
– This tertiary sub-bullet
that can be use
▪ This is another sub-bullet
of text
■ Senior Solutions
Architect/Partner
Enablement at Cloudera
■ Previously, Technical Lead
on the big data team at
Orbitz Worldwide
This is a slide title (one or two lines)
■ This is the first bullet of text
▪ This is a sub-bullet of text
■ This is the second bullet of
text at the same level
▪ This is a sub-bullet of text
— This is a secondary sub-bullet
of text
– This tertiary sub-bullet
that can be use
▪ This is another sub-bullet of
text
■ This is the first bullet of text
■ This is the second bullet of
text at the same level
▪ This is a sub-bullet of text
▪ This is a sub-bullet of text
— This secondary sub-bullet
▪ This is another sub-bullet of
text
■ And this is another bullet of
text
Subject number one Subject number two
This is a slide title for a slide with just the
title line (e.g., images/diagrams below)
What is Hadoop?
Hadoop is an open-source system designed
To store and process petabyte scale data.
That’s pretty much what you need to know.
Well almost…
Compression Codecs
snappy
Well, maybe.
Not splittable.
X
Splittable.
Getting
better…
Very good
choice
Splittable,
but no...
Our Compression Codec Recommendation
■ Snappy for all data sets (columnar as well as row based)
File Format Choices
Data set Storage format Compression Codec
movie Parquet Snappy
user_history Avro Snappy
user Parquet Snappy
user_rating_fact Avro Snappy
user_rating Parquet Snappy

Weitere ähnliche Inhalte

Was ist angesagt?

Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureVARUN SAXENA
 
Big Data Business Wins: Real-time Inventory Tracking with Hadoop
Big Data Business Wins: Real-time Inventory Tracking with HadoopBig Data Business Wins: Real-time Inventory Tracking with Hadoop
Big Data Business Wins: Real-time Inventory Tracking with HadoopDataWorks Summit
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Building Your Data Streams for all the IoT
Building Your Data Streams for all the IoTBuilding Your Data Streams for all the IoT
Building Your Data Streams for all the IoTDevOps.com
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
MongoDB and Azure Databricks
MongoDB and Azure DatabricksMongoDB and Azure Databricks
MongoDB and Azure DatabricksMongoDB
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsZhenxiao Luo
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQLkristinferrier
 
Introduction to Kibana
Introduction to KibanaIntroduction to Kibana
Introduction to KibanaVineet .
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache KuduAndriy Zabavskyy
 
NiFi Best Practices for the Enterprise
NiFi Best Practices for the EnterpriseNiFi Best Practices for the Enterprise
NiFi Best Practices for the EnterpriseGregory Keys
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDatabricks
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo..."Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...Lucidworks
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Simplilearn
 

Was ist angesagt? (20)

Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and Future
 
Apache NiFi Crash Course Intro
Apache NiFi Crash Course IntroApache NiFi Crash Course Intro
Apache NiFi Crash Course Intro
 
Big Data Business Wins: Real-time Inventory Tracking with Hadoop
Big Data Business Wins: Real-time Inventory Tracking with HadoopBig Data Business Wins: Real-time Inventory Tracking with Hadoop
Big Data Business Wins: Real-time Inventory Tracking with Hadoop
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Building Your Data Streams for all the IoT
Building Your Data Streams for all the IoTBuilding Your Data Streams for all the IoT
Building Your Data Streams for all the IoT
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
MongoDB and Azure Databricks
MongoDB and Azure DatabricksMongoDB and Azure Databricks
MongoDB and Azure Databricks
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
 
Introduction to Kibana
Introduction to KibanaIntroduction to Kibana
Introduction to Kibana
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
NiFi Best Practices for the Enterprise
NiFi Best Practices for the EnterpriseNiFi Best Practices for the Enterprise
NiFi Best Practices for the Enterprise
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo..."Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
 

Ähnlich wie Data warehousing with Hadoop

Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Cloudera, Inc.
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overviewRohit Jain
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an examplehadooparchbook
 
Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Grouphadooparchbook
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoopAdam Muise
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionMapR Technologies
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionJames Serra
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's includedJames Serra
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Joan Novino
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & HadoopBlackvard
 

Ähnlich wie Data warehousing with Hadoop (20)

2014 08-20-pit-hug
2014 08-20-pit-hug2014 08-20-pit-hug
2014 08-20-pit-hug
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overview
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
 
Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Group
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoop
 
Drill at the Chicago Hug
Drill at the Chicago HugDrill at the Chicago Hug
Drill at the Chicago Hug
 
Apache drill
Apache drillApache drill
Apache drill
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop Solution
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 

Mehr von hadooparchbook

Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platformhadooparchbook
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platformhadooparchbook
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationshadooparchbook
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platformhadooparchbook
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platformhadooparchbook
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patternshadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applicationshadooparchbook
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detectionhadooparchbook
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detectionhadooparchbook
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoophadooparchbook
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata Londonhadooparchbook
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applicationshadooparchbook
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationshadooparchbook
 

Mehr von hadooparchbook (20)

Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platform
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata London
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
 

Kürzlich hochgeladen

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 

Kürzlich hochgeladen (20)

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 

Data warehousing with Hadoop

  • 1. REMINDER Check in on the COLLABORATE mobile app Architectural Considerations for Data Warehousing with Hadoop Prepared by: Mark Grover, Software Engineer Jonathan Seidman, Solutions Architect Cloudera, Inc. github.com/hadooparchitecturebook/h adoop-arch-book/tree/master/ch11- data-warehousing Session ID#: 10251 @mark_grover @jseidman
  • 2. About Us ■ Mark ▪ Software Engineer at Cloudera ▪ Committer on Apache Bigtop, PMC member on Apache Sentry (incubating) ▪ Contributor to Apache Hadoop, Spark, Hive, Sqoop, Pig and Flume ■ Jonathan ▪ Senior Solutions Architect/Partner Engineering at Cloudera ▪ Previously, Technical Lead on the big data team at Orbitz Worldwide ▪ Co-founder of the Chicago Hadoop User Group and Chicago Big Data
  • 3. About the Book ■ @hadooparchbook ■ hadooparchitecturebook.com ■ github.com/hadooparchitectur ebook ■ slideshare.com/hadooparchbo ok
  • 4. Agenda ■ Typical data warehouse architecture. ■ Challenges with the existing data warehouse architecture. ■ How Hadoop complements an existing data warehouse architecture. ■ (Very) Brief intro to Hadoop. ■ Example use case. ■ Walkthrough of example use case implementation.
  • 6. Example High Level Data Warehouse Architecture Extract Data Staging Area Operational Source Systems Load Data Warehouse Data Analysis/Visu alization Tools Transformations
  • 7. Challenges with the Data Warehouse Architecture
  • 8. Challenge – ETL/ELT Processing OLTP Enterprise Applications ODS Data Warehouse QueryExtract Transform Load Business Intelligence Transform
  • 9. Challenges – ETL/ELT Processing OLTP Enterprise Applications ODS Data Warehouse QueryExtract Transform Load Business Intelligence Transform 1 Slow Data Transformations = Missed ETL SLAs. 2 Slow Queries = Frustrated Business Users. 1 2 1
  • 10. Challenges – Data Archiving Data Warehouse Tape Archive ■ Full-fidelity data only kept for a short duration ■ Expensive or sometimes impossible to look at historical raw data
  • 11. Challenge – Disparate Data Sources Data Warehouse ■ How do you join data from disparate sources with EDW? Business Intelligence ???
  • 12. Challenge – Lack of Agility ■ Responding to changing requirements, mistakes, etc. requires lengthy processes.
  • 13. Challenge – Exploratory Analysis in the EDW ■ Difficult for users to do exploratory analysis of data in the data warehouse. Business Users Developers Analysts Data Warehouse
  • 14. Complementing the EDW with Hadoop
  • 15. Data Warehouse Architecture with Hadoop Extract Hadoop Operational Source Systems EDW BI/Analytics Tools Logs, machine data, etc. Extract Transformation/Analysis Load
  • 16. Hadoop ETL/ELT Optimization with Hadoop OLTP Enterprise Applications ODS Business Intelligence Transform Query Store ETL Data Warehouse Query (High $/Byte)
  • 17. Active Archiving with Hadoop Data Warehouse Hadoop
  • 18. Joining Disparate Data Sources with Hadoop Data Warehouse Business IntelligenceHadoop
  • 19. Agile Data Access with Hadoop Schema-on-Write (RDBMS): • Prescriptive Data Modeling: • Create static DB schema • Transform data into RDBMS • Query data in RDBMS format • New columns must be added explicitly before new data can propagate into the system. • Good for Known Unknowns (Repetition) Schema-on-Read (Hadoop): • Descriptive Data Modeling: • Copy data in its native format • Create schema + parser • Query Data in its native format • New data can start flowing any time and will appear retroactively once the schema/parser properly describes it. • Good for Unknown Unknowns (Exploration)
  • 20. Exploratory Analysis with Hadoop Hadoop Business Users Developers Analysts Data Warehouse
  • 21. A Very Brief Intro to Hadoop
  • 22. What is Apache Hadoop? Has the Flexibility to Store and Mine Any Type of Data  Ask questions across structured and unstructured data that were previously impossible to ask or solve  Not bound by a single schema Excels at Processing Complex Data  Scale-out architecture divides workloads across multiple nodes  Flexible file system eliminates ETL bottlenecks Scales Economically  Can be deployed on commodity hardware  Open source platform guards against vendor lock Hadoop Distributed File System (HDFS) Self-Healing, High Bandwidth Clustered Storage Parallel Processing (MapReduce, Spark, Impala, etc.) Distributed Computing Frameworks Apache Hadoop is an open source platform for data storage and processing that is…  Scalable  Fault tolerant  Distributed CORE HADOOP SYSTEM COMPONENTS
  • 23. Oracle Big Data Appliance ■ All of the capabilities we’re talking about here are available as part of the Oracle BDA.
  • 24. Challenges of Hadoop Implementation
  • 25. Challenges of Hadoop Implementation
  • 26. Other Challenges – Architectural Considerations Data Sources Ingestion Raw Data Storage (Formats, Schema) Processed Data Storage (Formats, Schema) Processing Data Consumption Orchestration (Scheduling, Managing, Monitoring) Metadata Management
  • 27. Hadoop Third Party Ecosystem Data Systems Applications Infrastructure Operational Tools
  • 29. Use-case ■ Movielens dataset ■ Users register by entering some demographic information ▪ Users can update demographic information later on ■ Rate movies ▪ Ratings can be updated later on ■ Auxillary information about movies available ▪ e.g. release date, IMDB URL, etc.
  • 30. Movielens data set u.user user id | age | gender | occupation | zip code 1|24|M|technician|85711 2|53|F|other|94043 3|23|M|writer|32067 4|24|M|technician|43537 5|33|F|other|15213 6|42|M|executive|98101 7|57|M|administrator|91344
  • 31. Movielens data set u.item movie id | movie title | release date | video release date | IMDb URL | unknown | Action | Adventure | Animation |Children's | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western | 1|Toy Story (1995)|01-Jan- 1995||http://us.imdb.com/M/title- exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0 |0|0|0 2|GoldenEye (1995)|01-Jan- 1995||http://us.imdb.com/M/title- exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1 |0|0 3|Four Rooms (1995)|01-Jan- 1995||http://us.imdb.com/M/title- exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0| 0|1|0|0
  • 32. Movielens data set u.data user id | item id | rating | timestamp 196|242|3|881250949 186|302|3|891717742 22|377|1|878887116 244 51|2|880606923 166|346|1|886397596
  • 36. Data Modeling Considerations ■ We need to consider the following in our architecture: ▪ Storage layer – HDFS? HBase? Etc. ▪ File system schemas – how will we lay out the data? ▪ File formats – what storage formats to use for our data, both raw and processed data? ▪ Data compression formats? ■ Hadoop is not a database, so these considerations will be different from an RDBMS.
  • 37. Denormalization ■ Why denormalize? ■ When to do denormalize? ■ How much to denormalize?
  • 38. Why Denormalize? ■ Regular joins are expensive in Hadoop ■ When you have 2 data sets, no guarantees that corresponding records will be present on the same ■ Such a guarantee exists when storing such data in a single data set
  • 39. When to Denormalize? ■ Well, it’s difficult to say ■ It depends
  • 40. Movielens Data Set - Denormalization Denormalize Denormalize
  • 41. Data Set in Hadoop
  • 42. Tracking Updates (CDC) ■ Can’t update data in-place in HDFS ■ HDFS is append-only filesystem ■ We have to track all updates
  • 44. Hadoop File Types ■ Formats designed specifically to store and process data on Hadoop: ▪ File based – SequenceFile ▪ Serialization formats – Thrift, Protocol Buffers, Avro ▪ Columnar formats – RCFile, ORC, Parquet
  • 45. Final Schema in Hadoop
  • 46. Our Storage Format Recommendation ■ Columnar format (Parquet) for merged/compacted data sets ▪ user, user_rating, movie ■ Row format (Avro) for history/append-only data sets ▪ user_history, user_rating_fact
  • 48. Sources Interceptors Selectors Channels Sinks Flume Agent Ingestion – Apache Flume Twitter, logs, JMS, webserver, Kafka Mask, re-format, validate… DR, critical Memory, file, Kafka HDFS, HBase, Solr
  • 49. Ingestion – Apache Kafka Source System Source System Source System Source System Hadoop Security Systems Real-time monitoring Data Warehouse Kafka
  • 50. Ingestion – Apache Sqoop ■ Apache project designed to ease import and export of data between Hadoop and external data stores such as an RDBMS. ■ Provides functionality to do bulk imports and exports of data. ■ Leverages MapReduce to transfer data in parallel. Client Sqoop MapReduce Map Map Map Hadoop Run import Collect metadata Generate code, Execute MR job Pull data Write to Hadoop
  • 51. Sqoop Import Example – Movie sqoop import --connect jdbc:mysql://mysql_server:3306/movielens --username myuser --password mypass --query 'SELECT movie.*, group_concat(genre.name) FROM movie JOIN movie_genre ON (movie.id = movie_genre.movie_id) JOIN genre ON (movie_genre.genre_id = genre.id) WHERE ${CONDITIONS} GROUP BY movie.id' --split-by movie.id --as-avrodatafile --target-dir /data/movielens/movie
  • 53. Popular Processing Engines ■ MapReduce ▪ Programming paradigm ■ Pig ▪ Workflow language based ■ Hive ▪ Batch SQL-engine ■ Impala ▪ Near real-time concurrent SQL engine ■ Spark ▪ DAG engine
  • 54. Final Schema in Hadoop
  • 55. Merge Updates hive>INSERT OVERWRITE TABLE user_tmp SELECT user.* FROM user LEFT OUTER JOIN user_upserts ON (user.id = user_upserts.id) WHERE user_upserts.id IS NULL UNION ALL SELECT id, age, occupation, zipcode, TIMESTAMP(last_modified) FROM user_upserts;
  • 57. Aggregations hive>CREATE TABLE avg_movie_rating AS SELECT movie_id, ROUND(AVG(rating), 1) AS rating FROM user_rating GROUP BY movie_id;
  • 58. Export to Data Warehouse
  • 59. user_rating_fact user_rating user_history movie user Merge updates One record/user/movie Merge updates One record/user Data Warehouse avg_movie_rating latest_trending_ movies Export
  • 60. Sqoop Export sqoop export --connect jdbc:mysql:/mysql_server:3306/movie_dwh --username myuser --password mypass --table avg_movie_rating --export-dir /user/hive/warehouse/avg_movie_rating -m 16 --update-key movie_id --update-mode allowinsert --input-fields-terminated-by '001’ --lines-terminated-by 'n'
  • 63. Please complete the session evaluation Thank you! @hadooparchbook You may complete the session evaluation either on paper or online via the mobile app
  • 64. This is a slide title that can be up to two lines of text without losing readability ■ This is the first bullet of text ■ This is the second bullet of text at the same level ▪ This is a sub-bullet of text — This is a secondary sub-bullet of text (and should be as far sub- bullets indent) – This tertiary sub-bullet will be seldom used, but available ▪ This is another sub-bullet of text ■ And this is the third bullet of text
  • 65. This is a slide title (one or two lines) ■ This is the first bullet of text ▪ This is a sub-bullet of text ■ This is the second bullet of text at the same level ▪ This is a sub-bullet of text — This is a secondary sub- bullet of text – This tertiary sub-bullet that can be use ▪ This is another sub-bullet of text ■ Senior Solutions Architect/Partner Enablement at Cloudera ■ Previously, Technical Lead on the big data team at Orbitz Worldwide
  • 66. This is a slide title (one or two lines) ■ This is the first bullet of text ▪ This is a sub-bullet of text ■ This is the second bullet of text at the same level ▪ This is a sub-bullet of text — This is a secondary sub-bullet of text – This tertiary sub-bullet that can be use ▪ This is another sub-bullet of text ■ This is the first bullet of text ■ This is the second bullet of text at the same level ▪ This is a sub-bullet of text ▪ This is a sub-bullet of text — This secondary sub-bullet ▪ This is another sub-bullet of text ■ And this is another bullet of text Subject number one Subject number two
  • 67. This is a slide title for a slide with just the title line (e.g., images/diagrams below)
  • 68.
  • 69. What is Hadoop? Hadoop is an open-source system designed To store and process petabyte scale data. That’s pretty much what you need to know. Well almost…
  • 70. Compression Codecs snappy Well, maybe. Not splittable. X Splittable. Getting better… Very good choice Splittable, but no...
  • 71. Our Compression Codec Recommendation ■ Snappy for all data sets (columnar as well as row based)
  • 72. File Format Choices Data set Storage format Compression Codec movie Parquet Snappy user_history Avro Snappy user Parquet Snappy user_rating_fact Avro Snappy user_rating Parquet Snappy