SlideShare ist ein Scribd-Unternehmen logo
1 von 159
Downloaden Sie, um offline zu lesen
Architectural
Considerations
for Hadoop
Applications
Strata+Hadoop World, London– May 5th, 2015
tiny.cloudera.com/app-arch-london
Mark Grover | @mark_grover
Ted Malaska | @TedMalaska
Jonathan Seidman | @jseidman
Gwen Shapira | @gwenshap
2
About the book
•  @hadooparchbook
•  hadooparchitecturebook.com
•  github.com/hadooparchitecturebook
•  slideshare.com/hadooparchbook
©2014 Cloudera, Inc. All Rights Reserved.
3
About the presenters
•  Principal Solutions Architect
at Cloudera
•  Previously, lead architect at
FINRA
•  Contributor to Apache
Hadoop, HBase, Flume,
Avro, Pig and Spark
•  Senior Solutions Architect/
Partner Enablement at
Cloudera
•  Contributor to Apache
Sqoop.
•  Previously, Technical Lead
on the big data team at
Orbitz, co-founder of the
Chicago Hadoop User
Group and Chicago Big
Data
©2014 Cloudera, Inc. All Rights Reserved.
Ted Malaska Jonathan Seidman
4
About the presenters
•  Solutions Architect turned
Software Engineer at
Cloudera
•  Committer on Apache
Sqoop
•  Contributor to Apache
Flume and Apache Kafka
•  Software Engineer at
Cloudera
•  Committer on Apache
Bigtop, PMC member on
Apache Sentry
(incubating)
•  Contributor to Apache
Hadoop, Spark, Hive,
Sqoop, Pig and Flume
©2014 Cloudera, Inc. All Rights Reserved.
Gwen Shapira Mark Grover
5
Logistics
•  Break at 10:30-11:00 AM
•  Questions at the end of each section
©2014 Cloudera, Inc. All Rights Reserved.
6
Case Study
Clickstream Analysis
7
Analytics
©2014 Cloudera, Inc. All Rights Reserved.
8
Analytics
©2014 Cloudera, Inc. All Rights Reserved.
9
Analytics
©2014 Cloudera, Inc. All Rights Reserved.
10
Analytics
©2014 Cloudera, Inc. All Rights Reserved.
11
Analytics
©2014 Cloudera, Inc. All Rights Reserved.
12
Analytics
©2014 Cloudera, Inc. All Rights Reserved.
13
Analytics
©2014 Cloudera, Inc. All Rights Reserved.
14
Web Logs – Combined Log Format
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0"
200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/
5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?
productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com"
"Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/
GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile
Safari/533.1”
15
Clickstream Analytics
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/
2014:21:08:30 ] "GET /seatposts
HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/
top_online_shops" "Mozilla/5.0
(Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/36.0.1944.0 Safari/
537.36”
16
Similar use-cases
•  Sensors – heart, agriculture, etc.
•  Casinos – session of a person at a table
©2014 Cloudera, Inc. All Rights Reserved.
17
Pre-Hadoop
Architecture
Clickstream Analysis
18
Click Stream Analysis (Before Hadoop)
©2014 Cloudera, Inc. All Rights Reserved.
Web logs
(full fidelity)
(2 weeks)
Data Warehouse
Transform/
Aggregate Business
Intelligence
Tape Archive
19
Problems with Pre-Hadoop Architecture
•  Full fidelity data is stored for small amount of time (~weeks).
•  Older data is sent to tape, or even worse, deleted!
•  Inflexible workflow - think of all aggregations beforehand
©2014 Cloudera, Inc. All Rights Reserved.
20
Effects of Pre-Hadoop Architecture
•  Regenerating aggregates is expensive or worse, impossible
•  Can’t correct bugs in the workflow/aggregation logic
•  Can’t do experiments on existing data
©2014 Cloudera, Inc. All Rights Reserved.
21
Why is Hadoop A
Great Fit?
Clickstream Analysis
22
Why is Hadoop a great fit?
•  Volume of clickstream data is huge
•  Velocity at which it comes in is high
•  Variety of data is diverse - semi-structured data
•  Hadoop enables
–  active archival of data
–  Aggregation jobs
–  Querying the above aggregates or raw fidelity data
©2014 Cloudera, Inc. All Rights Reserved.
23
Click Stream Analysis (with Hadoop)
©2014 Cloudera, Inc. All Rights Reserved.
Web logs Hadoop Business
Intelligence
Active archive (no tape)
Aggregation engine
Querying engine
24
Challenges of Hadoop Implementation
©2014 Cloudera, Inc. All Rights Reserved.
25
Challenges of Hadoop Implementation
©2014 Cloudera, Inc. All Rights Reserved.
26
Other challenges - Architectural Considerations
•  Storage managers?
–  HDFS? HBase?
•  Data storage and modeling:
–  File formats? Compression? Schema design?
•  Data movement
–  How do we actually get the data into Hadoop? How do we get it out?
•  Metadata
–  How do we manage data about the data?
•  Data access and processing
–  How will the data be accessed once in Hadoop? How can we transform it? How do
we query it?
•  Orchestration
–  How do we manage the workflow for all of this?
©2014 Cloudera, Inc. All Rights Reserved.
27
Case Study
Requirements
Overview of Requirements
28
Overview of Requirements
Data
Sources
Ingestion
Raw Data
Storage
(Formats,
Schema)
Processed
Data Storage
(Formats,
Schema)
©2014 Cloudera, Inc. All Rights Reserved.
Processing
Data
Consumption
Orchestration
(Scheduling,
Managing,
Monitoring)
29
Case Study
Requirements
Data Ingestion
30
Data Ingestion Requirements
©2014 Cloudera, Inc. All Rights Reserved.
Web Servers
Web Servers
Web Servers
Web Servers
Web Servers
Logs
244.157.45.12 - - [17/
Oct/2014:21:08:30 ]
"GET /seatposts HTTP/
1.0" 200 4463 …
CRM
Data
ODS
Web Servers
Web Servers
Web Servers
Web Servers
Web Servers
Logs
Hadoop
31
Data Ingestion Requirements
•  So we need to be able to support:
–  Reliable ingestion of large volumes of semi-structured event data arriving with
high velocity (e.g. logs).
–  Timeliness of data availability – data needs to be available for processing to
meet business service level agreements.
–  Periodic ingestion of data from relational data stores.
©2014 Cloudera, Inc. All Rights Reserved.
32
Case Study
Requirements
Data Storage
33
Data Storage Requirements
©2014 Cloudera, Inc. All Rights Reserved.
Store all the data
Make the data
accessible for
processing
Compress
the data
34
Case Study
Requirements
Data Processing
35
Processing requirements
Be able to answer questions like:
•  What is my website’s bounce rate?
–  i.e. how many % of visitors don’t go past the landing page?
•  Which marketing channels are leading to most sessions?
•  Do attribution analysis
–  Which channels are responsible for most conversions?
©2014 Cloudera, Inc. All Rights Reserved.
36
Sessionization
©2014 Cloudera, Inc. All Rights Reserved.
Website visit
Visitor 1
Session 1
Visitor 1
Session 2
Visitor 2
Session 1
> 30 minutes
37
Case Study
Requirements
Orchestration
38
Orchestration is simple
We just need to execute actions
One after another
©2014 Cloudera, Inc. All Rights Reserved.
39©2014 Cloudera, Inc. All Rights Reserved.
Actually,
we also need to handle errors
And user notifications
….
40©2014 Cloudera, Inc. All Rights Reserved.
And…
•  Re-start workflows after errors
•  Reuse of actions in multiple workflows
•  Complex workflows with decision points
•  Trigger actions based on events
•  Tracking metadata
•  Integration with enterprise software
•  Data lifecycle
•  Data quality control
•  Reports
41©2014 Cloudera, Inc. All Rights Reserved.
OK, maybe we need a product
To help us do all that
42
Architectural
Considerations
Data Modeling
43
Data Modeling Considerations
•  We need to consider the following in our architecture:
–  Storage layer – HDFS? HBase? Etc.
–  File system schemas – how will we lay out the data?
–  File formats – what storage formats to use for our data, both raw and
processed data?
–  Data compression formats?
©2014 Cloudera, Inc. All Rights Reserved.
44
Architectural
Considerations
Data Modeling – Storage Layer
45
Data Storage Layer Choices
•  Two likely choices for raw data:
©2014 Cloudera, Inc. All Rights Reserved.
46
Data Storage Layer Choices
•  Stores data directly as files
•  Fast scans
•  Poor random reads/writes
•  Stores data as Hfiles on
HDFS
•  Slow scans
•  Fast random reads/writes
©2014 Cloudera, Inc. All Rights Reserved.
47
Data Storage – Storage Manager Considerations
•  Incoming raw data:
–  Processing requirements call for batch transformations across multiple
records – for example sessionization.
•  Processed data:
–  Access to processed data will be via things like analytical queries – again
requiring access to multiple records.
•  We choose HDFS
–  Processing needs in this case served better by fast scans.
©2014 Cloudera, Inc. All Rights Reserved.
48
Architectural
Considerations
Data Modeling – Raw Data Storage
49
Storage Formats – Raw Data and Processed Data
©2014 Cloudera, Inc. All Rights Reserved.
Processed
Data
Raw
Data
50
Data Storage – Format Considerations
Click to enter confidentiality information
Logs
(plain
text)
51
Data Storage – Format Considerations
Click to enter confidentiality information
Logs
(plain
text)
Logs
(plain
text)
Logs
(plain
text)
Logs
(plain
text)
Logs
(plain
text)
Logs
(plain
text)
Logs
Logs
Logs
Logs
Logs
52
Data Storage – Compression
Click to enter confidentiality information
snappy
Well, maybe.
But not splittable.
X
Splittable. Getting
better…
Hmmm….Splittable, but no...
53
Raw Data Storage – More About Snappy
•  Designed at Google to provide high compression speeds with reasonable
compression.
•  Not the highest compression, but provides very good performance for
processing on Hadoop.
•  Snappy is not splittable though, which brings us to…
©2014 Cloudera, Inc. All Rights Reserved.
54
Hadoop File Types
•  Formats designed specifically to store and process data on Hadoop:
–  File based – SequenceFile
–  Serialization formats – Thrift, Protocol Buffers, Avro
–  Columnar formats – RCFile, ORC, Parquet
Click to enter confidentiality information
55
SequenceFile
•  Stores records as
binary key/value pairs.
•  SequenceFile “blocks”
can be compressed.
•  This enables
splittability with non-
splittable compression.
©2014 Cloudera, Inc. All Rights Reserved.
56
Avro
•  Kinda SequenceFile on
Steroids.
•  Self-documenting –
stores schema in
header.
•  Provides very efficient
storage.
•  Supports splittable
compression. ©2014 Cloudera, Inc. All Rights Reserved.
57
Our Format Recommendations for Raw Data…
•  Avro with Snappy
–  Snappy provides optimized compression.
–  Avro provides compact storage, self-documenting files, and supports schema
evolution.
–  Avro also provides better failure handling than other choices.
•  SequenceFiles would also be a good choice, and are directly
supported by ingestion tools in the ecosystem.
–  But only supports Java.
©2014 Cloudera, Inc. All Rights Reserved.
58
But Note…
•  For simplicity, we’ll use plain text for raw data in our example.
©2014 Cloudera, Inc. All Rights Reserved.
59
Architectural
Considerations
Data Modeling – Processed Data Storage
60
Storage Formats – Raw Data and Processed Data
©2014 Cloudera, Inc. All Rights Reserved.
Processed
Data
Raw
Data
61
Access to Processed Data
©2014 Cloudera, Inc. All Rights Reserved.
Column A Column B Column C Column D
Value Value Value Value
Value Value Value Value
Value Value Value Value
Value Value Value Value
Analytical
Queries
62
Columnar Formats
•  Eliminates I/O for columns that are not part of a query.
•  Works well for queries that access a subset of columns.
•  Often provide better compression.
•  These add up to dramatically improved performance for many
queries.
©2014 Cloudera, Inc. All Rights Reserved.
1 2014-10-1
3
abc
2 2014-10-1
4
def
3 2014-10-1
5
ghi
1 2 3
2014-10-1
3
2014-10-1
4
2014-10-1
5
abc def ghi
63
Columnar Choices – RCFile
•  Designed to provide efficient processing for Hive queries.
•  Only supports Java.
•  No Avro support.
•  Limited compression support.
•  Sub-optimal performance compared to newer columnar formats.
©2014 Cloudera, Inc. All Rights Reserved.
64
Columnar Choices – ORC
•  A better RCFile.
•  Also designed to provide efficient processing of Hive queries.
•  Only supports Java.
©2014 Cloudera, Inc. All Rights Reserved.
65
Columnar Choices – Parquet
•  Designed to provide efficient processing across Hadoop
programming interfaces – MapReduce, Hive, Impala, Pig.
•  Multiple language support – Java, C++
•  Good object model support, including Avro.
•  Broad vendor support.
•  These features make Parquet a good choice for our processed data.
©2014 Cloudera, Inc. All Rights Reserved.
66
Architectural
Considerations
Data Modeling – Schema Design
67
HDFS Schema Design – One Recommendation
/etl – Data in various stages of ETL workflow
/data – processed data to be shared data with the entire organization
/tmp – temp data from tools or shared between users
/user/<username> - User specific data, jars, conf files
/app – Everything but data: UDF jars, HQL files, Oozie workflows
©2014 Cloudera, Inc. All Rights Reserved.
68
Partitioning
•  Split the dataset into smaller consumable chunks.
•  Rudimentary form of “indexing”. Reduces I/O needed to process
queries.
©2014 Cloudera, Inc. All Rights Reserved.
69
Partitioning
©2014 Cloudera, Inc. All Rights Reserved.
dataset
col=val1/file.txt
col=val2/file.txt
…
col=valn/file.txt
dataset
file1.txt
file2.txt
…
filen.txt
Un-partitioned HDFS
directory structure
Partitioned HDFS
directory structure
70
Partitioning considerations
•  What column to partition by?
–  Don’t have too many partitions (<10,000)
–  Don’t have too many small files in the partitions
–  Good to have partition sizes at least ~1 GB, generally a multiple of block size.
•  We’ll partition by timestamp. This applies to both our raw and
processed data.
©2014 Cloudera, Inc. All Rights Reserved.
71
Partitioning For Our Case Study
•  Raw dataset:
–  /etl/BI/casualcyclist/clicks/rawlogs/year=2014/month=10/day=10!
•  Processed dataset:
–  /data/bikeshop/clickstream/year=2014/month=10/day=10!
©2014 Cloudera, Inc. All Rights Reserved.
72
Architectural
Considerations
Data Ingestion
73©2014 Cloudera, Inc. All rights reserved.
•  Omniture data on FTP
•  Apps
•  App Logs
•  RDBMS
Typical Clickstream data sources
74©2014 Cloudera, Inc. All rights reserved.
Getting Files from FTP
75©2014 Cloudera, Inc. All rights reserved.
curl ftp://myftpsite.com/sitecatalyst/
myreport_2014-10-05.tar.gz
--user name:password | hdfs -put - /etl/clickstream/raw/
sitecatalyst/myreport_2014-10-05.tar.gz
Don’t over-complicate things
76©2014 Cloudera, Inc. All rights reserved.
Apache NiFi
77©2014 Cloudera, Inc. All rights reserved.
Reliable, distributed and highly available systems
That allow streaming events to Hadoop
Event Streaming – Flume and Kafka
78©2014 Cloudera, Inc. All rights reserved.
•  Many available data collection sources
•  Well integrated into Hadoop
•  Supports file transformations
•  Can implement complex topologies
•  Very low latency
•  No programming required
Flume:
79©2014 Cloudera, Inc. All rights reserved.
“We just want to grab data
from this directory
and write it to HDFS”
We use Flume when:
80©2014 Cloudera, Inc. All rights reserved.
•  Very high-throughput publish-subscribe messaging
•  Highly available
•  Stores data and can replay
•  Can support many consumers with no extra latency
Kafka is:
81©2014 Cloudera, Inc. All rights reserved.
“Kafka is awesome.
We heard it cures cancer”
Use Kafka When:
82©2014 Cloudera, Inc. All rights reserved.
•  Use Flume with a Kafka Source
•  Allows to get data from Kafka,
run some transformations
write to HDFS, HBase or Solr
Actually, why choose?
83©2014 Cloudera, Inc. All rights reserved.
•  We want to ingest events from log files
•  Flume’s Spooling Directory source fits
•  With HDFS Sink
•  We would have used Kafka if…
–  We wanted the data in non-Hadoop systems too
In Our Example…
84
Sources Interceptors Selectors Channels Sinks
Flume Agent
Short Intro to Flume
Twitter, logs, JMS,
webserver, Kafka
Mask, re-format,
validate…
DR, critical
Memory, file,
Kafka
HDFS, HBase,
Solr
85
Configuration
•  Declarative
– No coding required.
– Configuration specifies
how components are
wired together.
©2014 Cloudera, Inc. All Rights Reserved.
86
Interceptors
•  Mask fields
•  Validate information
against external source
•  Extract fields
•  Modify data format
•  Filter or split events
©2014 Cloudera, Inc. All rights reserved.
87©2014 Cloudera, Inc. All rights reserved.
Any sufficiently complex configuration
Is indistinguishable from code
88
A Brief Discussion of Flume Patterns – Fan-in
•  Flume agent runs on
each of our servers.
•  These client agents
send data to multiple
agents to provide
reliability.
•  Flume provides support
for load balancing.
©2014 Cloudera, Inc. All Rights Reserved.
89
A Brief Discussion of Flume Patterns – Splitting
•  Common need is to split
data on ingest.
•  For example:
– Sending data to multiple
clusters for DR.
– To multiple destinations.
•  Flume also supports
partitioning, which is key
to our implementation.
©2014 Cloudera, Inc. All Rights Reserved.
90
Flume Agent
Web Logs
Spooling
Dir
Source
Timestamp
Interceptor
File
Channel
Avro Sink
Avro Sink
Avro Sink
Flume Architecture – Client Tier
Web Server Flume Agent
91
Flume Architecture – Collector Tier
Flume Agent
Flume Agent
HDFS
Avro
Source
File
Channel
HDFS Sink
HDFS Sink
92©2014 Cloudera, Inc. All rights reserved.
•  Add Kafka producer to our webapp
•  Send clicks and searches as messages
•  Flume can ingest events from Kafka
•  We can add a second consumer for real-time
processing in SparkStreaming
•  Another consumer for alerting…
•  And maybe a batch consumer too
What if…. We were to use Kafka?
93
Channels Sinks
Flume Agent
The Kafka Channel
Kafka
HDFS, HBase,
Solr
Producer
A
Producer
B
Producer
C
Kafka
Producers
94
Sources Interceptors Channels
Flume Agent
The Kafka Channel
Twitter, logs, JMS,
webserver
Mask, re-format,
validate…
Kafka
Consumer
A
Kafka
Consumers
Consumer
B
Consumer
C
95
Sources Interceptors Selectors Channels Sinks
Flume Agent
The Kafka Channel
Twitter, logs, JMS,
webserver
Mask, re-format,
validate…
DR, critical Kafka
HDFS, HBase,
Solr
96
Architectural
Considerations
Data Processing – Engines
tiny.cloudera.com/app-arch-slides
97
Processing Engines
•  MapReduce
•  Abstractions
•  Spark
•  Spark Streaming
•  Impala
Confidentiality Information Goes Here
98
MapReduce
•  Oldie but goody
•  Restrictive Framework / Innovated Work Around
•  Extreme Batch
Confidentiality Information Goes Here
99
MapReduce Basic High Level
Confidentiality Information Goes Here
Mapper
HDFS
(Replicated)
Native File System
Block of
Data
Temp Spill
Data
Partitioned
Sorted Data
Reducer
Reducer
Local Copy
Output File
100
MapReduce Innovation
•  Mapper Memory Joins
•  Reducer Memory Joins
•  Buckets Sorted Joins
•  Cross Task Communication
•  Windowing
•  And Much More
Confidentiality Information Goes Here
101
Abstractions
•  SQL
–  Hive
•  Script/Code
–  Pig: Pig Latin
–  Crunch: Java/Scala
–  Cascading: Java/Scala
Confidentiality Information Goes Here
102
Spark
•  The New Kid that isn’t that New Anymore
•  Easily 10x less code
•  Extremely Easy and Powerful API
•  Very good for machine learning
•  Scala, Java, and Python
•  RDDs
•  DAG Engine
Confidentiality Information Goes Here
103
Spark - DAG
Confidentiality Information Goes Here
104
Spark - DAG
Confidentiality Information Goes Here
Filter KeyBy
KeyBy
TextFile
TextFile
Join Filter Take
105
Spark - DAG
Confidentiality Information Goes Here
Filter KeyBy
KeyBy
TextFile
TextFile
Join Filter Take
Good
Good
Good
Good
Good
Good
Good-Replay
Good-Replay
Good-Replay
Good
Good-Replay
Good
Good-Replay
Lost Block
Replay
Good-Replay
Lost Block
Good
Future
Future
Future
Future
106
Spark Streaming
•  Calling Spark in a Loop
•  Extends RDDs with DStream
•  Very Little Code Changes from ETL to Streaming
Confidentiality Information Goes Here
107
Spark Streaming
Confidentiality Information Goes Here
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
Pre-first
Batch
First
Batch
Second
Batch
108
Spark Streaming
Confidentiality Information Goes Here
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count
Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count
Pre-first
Batch
First
Batch
Second
Batch
Stateful RDD 1
Print
Stateful RDD 2
Stateful RDD 1
109
Impala
Confidentiality Information Goes Here
•  MPP Style SQL Engine on top of Hadoop
•  Very Fast
•  High Concurrency
•  Analytical windowing functions (C5.2).
110
Impala – Broadcast Join
Confidentiality Information Goes Here
Impala Daemon
Smaller Table
Data Block
100% Cached
Smaller Table
Smaller Table
Data Block
Impala Daemon
100% Cached
Smaller Table
Impala Daemon
100% Cached
Smaller Table
Impala Daemon
Hash Join Function
Bigger Table
Data Block
100% Cached
Smaller Table
Output
Impala Daemon
Hash Join Function
Bigger Table
Data Block
100% Cached
Smaller Table
Output
Impala Daemon
Hash Join Function
Bigger Table
Data Block
100% Cached
Smaller Table
Output
111
Impala – Partitioned Hash Join
Confidentiality Information Goes Here
Impala Daemon
Smaller Table
Data Block
~33% Cached
Smaller Table
Smaller Table
Data Block
Impala Daemon
~33% Cached
Smaller Table
Impala Daemon
~33% Cached
Smaller Table
Hash Partitioner Hash Partitioner
Impala Daemon
BiggerTable
Data Block
Impala Daemon Impala Daemon
Hash Partitioner
Hash Join Function
33% Cached
Smaller Table
Hash Join Function
33% Cached
Smaller Table
Hash Join Function
33% Cached
Smaller Table
Output Output Output
BiggerTable
Data Block
Hash Partitioner
BiggerTable
Data Block
Hash Partitioner
112
Impala vs Hive
Confidentiality Information Goes Here
•  Very different approaches and
•  We may see convergence at some point
•  But for now
–  Impala for speed
–  Hive for batch
113
Architectural
Considerations
Data Processing – Patterns and
Recommendations
114
What processing needs to happen?
Confidentiality Information Goes Here
•  Sessionization
•  Filtering
•  Deduplication
•  BI / Discovery
115
Sessionization
Confidentiality Information Goes Here
Website visit
Visitor 1
Session 1
Visitor 1
Session 2
Visitor 2
Session 1
> 30 minutes
116
Why sessionize?
Confidentiality Information Goes Here
Helps answers questions like:
•  What is my website’s bounce rate?
–  i.e. how many % of visitors don’t go past the landing page?
•  Which marketing channels (e.g. organic search, display ad, etc.) are
leading to most sessions?
–  Which ones of those lead to most conversions (e.g. people buying things,
signing up, etc.)
•  Do attribution analysis – which channels are responsible for most
conversions?
117
Sessionization
Confidentiality Information Goes Here
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X
10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12+1413580110
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/
1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5;
en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0
Mobile Safari/533.1” 244.157.45.12+1413583199
118
How to Sessionize?
Confidentiality Information Goes Here
1.  Given a list of clicks, determine which clicks
came from the same user (Partitioning,
ordering)
2.  Given a particular user's clicks, determine if a
given click is a part of a new session or a
continuation of the previous session (Identifying
session boundaries)
119
#1 – Which clicks are from same user?
•  We can use:
–  IP address (244.157.45.12)
–  Cookies (A9A3BECE0563982D)
–  IP address (244.157.45.12)and user agent string ((KHTML, like
Gecko) Chrome/36.0.1944.0 Safari/537.36")
©2014 Cloudera, Inc. All Rights Reserved.
120
#1 – Which clicks are from same user?
•  We can use:
–  IP address (244.157.45.12)
–  Cookies (A9A3BECE0563982D)
–  IP address (244.157.45.12)and user agent string ((KHTML, like
Gecko) Chrome/36.0.1944.0 Safari/537.36")
©2014 Cloudera, Inc. All Rights Reserved.
121
#1 – Which clicks are from same user?
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0"
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC
Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
122
#2 – Which clicks part of the same session?
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0"
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC
Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
> 30 mins apart = different
sessions
123©2014 Cloudera, Inc. All rights reserved.
Sessionization engine recommendation
•  We have sessionization code in MR and Spark on github. The
complexity of the code varies, depends on the expertise in the
organization.
•  We choose MR
–  MR API is stable and widely known
–  No Spark + Oozie (orchestration engine) integration currently
124
Filtering – filter out incomplete records
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0"
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U…
125
Filtering – filter out records from bots/spiders
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
209.85.238.11 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0"
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC
Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
Google spider IP address
126©2014 Cloudera, Inc. All rights reserved.
Filtering recommendation
•  Bot/Spider filtering can be done easily in any of the engines
•  Incomplete records are harder to filter in schema systems like
Hive, Impala, Pig, etc.
•  Flume interceptors can also be used
•  Pretty close choice between MR, Hive and Spark
•  Can be done in Spark using rdd.filter()
•  We can simply embed this in our MR sessionization job
127
Deduplication – remove duplicate records
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
128©2014 Cloudera, Inc. All rights reserved.
Deduplication recommendation
•  Can be done in all engines.
•  We already have a Hive table with all the columns, a simple
DISTINCT query will perform deduplication
•  reduce() in spark
•  We use Pig
129©2014 Cloudera, Inc. All rights reserved.
BI/Discovery engine recommendation
•  Main requirements for this are:
–  Low latency
–  SQL interface (e.g. JDBC/ODBC)
–  Users don’t know how to code
•  We chose Impala
–  It’s a SQL engine
–  Much faster than other engines
–  Provides standard JDBC/ODBC interfaces
130©2014 Cloudera, Inc. All rights reserved.
End-to-end processing
SessionizationFilteringDeduplication
BI tools
131
Architectural
Considerations
Orchestration
132
•  Data arrives through Flume
•  Triggers a processing event:
–  Sessionize
–  Enrich – Location, marketing channel…
–  Store as Parquet
•  Each day we process events from the previous day
Orchestrating Clickstream
133©2014 Cloudera, Inc. All rights reserved.
•  Workflow is fairly simple
•  Need to trigger workflow based on data
•  Be able to recover from errors
•  Perhaps notify on the status
•  And collect metrics for reporting
Choosing Right
134
Oozie or Azkaban?
©2014 Cloudera, Inc. All rights reserved.
135©2014 Cloudera, Inc. All rights reserved.
Oozie Architecture
136©2014 Cloudera, Inc. All rights reserved.
•  Part of all major Hadoop distributions
•  Hue integration
•  Built -in actions – Hive, Sqoop, MapReduce, SSH
•  Complex workflows with decisions
•  Event and time based scheduling
•  Notifications
•  SLA Monitoring
•  REST API
Oozie features
137©2014 Cloudera, Inc. All rights reserved.
•  Overhead in launching jobs
•  Steep learning curve
•  XML Workflows
Oozie Drawbacks
138©2014 Cloudera, Inc. All rights reserved.
Azkaban Architecture
Azkaban Executor ServerAzkaban Web Server
HDFS viewer
plugin
Job types plugin
MySQL
Client
Hadoop
139©2014 Cloudera, Inc. All rights reserved.
•  Simplicity
•  Great UI – including pluggable visualizers
•  Lots of plugins – Hive, Pig…
•  Reporting plugin
Azkaban features
140©2014 Cloudera, Inc. All rights reserved.
•  Doesn’t support workflow decisions
•  Can’t represent data dependency
Azkaban Limitations
141©2014 Cloudera, Inc. All rights reserved.
•  Workflow is fairly simple
•  Need to trigger workflow based on data
•  Be able to recover from errors
•  Perhaps notify on the status
•  And collect metrics for reporting
Choosing…
Easier in Oozie
142©2014 Cloudera, Inc. All rights reserved.
•  Workflow is fairly simple
•  Need to trigger workflow based on data
•  Be able to recover from errors
•  Perhaps notify on the status
•  And collect metrics for reporting
Choosing the right Orchestration Tool
Better in Azkaban
143©2014 Cloudera, Inc. All rights reserved.
The best orchestration tool
is the one you are an expert on
Important Decision Consideration!
144©2014 Cloudera, Inc. All rights reserved.
Orchestration Patterns – Fan Out
145©2014 Cloudera, Inc. All rights reserved.
Capture & Decide Pattern
146
Putting It All
Together
Final Architecture
147
Final Architecture – High Level Overview
Data
Sources
Ingestion
Raw Data
Storage
(Formats,
Schema)
Processed
Data Storage
(Formats,
Schema)
©2014 Cloudera, Inc. All Rights Reserved.
Processing
Data
Consumption
Orchestration
(Scheduling,
Managing,
Monitoring)
148
Final Architecture – High Level Overview
Data
Sources
Ingestion
Raw Data
Storage
(Formats,
Schema)
Processed
Data Storage
(Formats,
Schema)
©2014 Cloudera, Inc. All Rights Reserved.
Processing
Data
Consumption
Orchestration
(Scheduling,
Managing,
Monitoring)
149
Final Architecture – Ingestion/Storage
Web Server Flume Agent
Web Server Flume Agent
Web Server Flume Agent
Web Server Flume Agent
Web Server Flume Agent
Web Server Flume Agent
Web Server Flume Agent
Web Server Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Fan-in
Pattern
Multi Agents for
Failover and rolling
restarts
HDFS
©2014 Cloudera, Inc. All Rights Reserved.
/etl/BI/casualcyclist/
clicks/rawlogs/
year=2014/month=10/
day=10!
150
Final Architecture – High Level Overview
Data
Sources
Ingestion
Raw Data
Storage
(Formats,
Schema)
Processed
Data Storage
(Formats,
Schema)
©2014 Cloudera, Inc. All Rights Reserved.
Processin
g
Data
Consumption
Orchestration
(Scheduling,
Managing,
Monitoring)
151
Final Architecture – Processing and Storage
/etl/BI/
casualcyclist/
clicks/rawlogs/
year=2014/
month=10/day=10
…
dedup->filtering-
>sessionization
/data/bikeshop/
clickstream/
year=2014/
month=10/day=10
…
©2014 Cloudera, Inc. All Rights Reserved.
parquetize
152
Final Architecture – High Level Overview
Data
Sources
Ingestion
Raw Data
Storage
(Formats,
Schema)
Processed
Data Storage
(Formats,
Schema)
©2014 Cloudera, Inc. All Rights Reserved.
Processing
Data
Consumption
Orchestration
(Scheduling,
Managing,
Monitoring)
153
Final Architecture – Data Access
Hive/
Impala
BI/
Analytics
Tools
DWH
Sqoop
Local
Disk
R, etc.
DB import tool
JDBC/ODBC
©2014 Cloudera, Inc. All Rights Reserved.
154
Demo
©2014 Cloudera, Inc. All rights reserved.
155Confidentiality Information Goes Here
Join the Discussion
Get community
help or provide
feedback
cloudera.com/community
156Built	
  for	
  Produc-on	
  Success.	
  Powered	
  by	
  Apache	
  Hadoop	
  
157Try	
  Cloudera	
  Live	
  today—cloudera.com/live	
  
158
BOOK SIGNINGS THEATER SESSIONS
TECHNICAL DEMOS GIVEAWAYS
Visit us at
Booth
#101
HIGHLIGHTS:
Apache Kafka is now fully
supported with Cloudera
Learn why Cloudera is
the leader for security &
governance in Hadoop
159
Free books and Lots of Q&A!
•  Book signings
–  Wednesday, May 6th, 3:15 PM at Cloudera booth (#101)
–  Wednesday, May 6th, 6PM at O’Reilly booth
•  Ask Us Anything!
–  Wednesday, May 6th, 14:35-15:15 at Westminster Suite
•  Office hours!
–  Thursday, May 7th, 11:45-12:25 at Table A
•  Stay in touch!
–  @hadooparchbook
–  hadooparchitecturebook.com
–  slideshare.com/hadooparchbook
•  Please leave us a review!
©2014 Cloudera, Inc. All Rights Reserved.

Weitere ähnliche Inhalte

Was ist angesagt?

Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationshadooparchbook
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationshadooparchbook
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detectionhadooparchbook
 
Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Grouphadooparchbook
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patternshadooparchbook
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detectionhadooparchbook
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platformhadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorialmarkgrover
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platformhadooparchbook
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applicationshadooparchbook
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoopmarkgrover
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopDataWorks Summit
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
 

Was ist angesagt? (20)

Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
 
Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Group
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoop
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 

Andere mochten auch

Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platformhadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
 
Art of Disorderly Programming
Art of Disorderly ProgrammingArt of Disorderly Programming
Art of Disorderly Programmingimcpune
 
Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)Steve Loughran
 
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...Remy Rosenbaum
 
Data protection for hadoop environments
Data protection for hadoop environmentsData protection for hadoop environments
Data protection for hadoop environmentsDataWorks Summit
 
Importance of Big data for your Business
Importance of Big data for your BusinessImportance of Big data for your Business
Importance of Big data for your Businessazuyo.com
 
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesReal-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesSingleStore
 
How to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauHow to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauCodemotion
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
 
Data streaming-systems
Data streaming-systemsData streaming-systems
Data streaming-systemsimcpune
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 

Andere mochten auch (15)

Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Art of Disorderly Programming
Art of Disorderly ProgrammingArt of Disorderly Programming
Art of Disorderly Programming
 
Inside hadoop-dev
Inside hadoop-devInside hadoop-dev
Inside hadoop-dev
 
Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)
 
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
 
Data protection for hadoop environments
Data protection for hadoop environmentsData protection for hadoop environments
Data protection for hadoop environments
 
Importance of Big data for your Business
Importance of Big data for your BusinessImportance of Big data for your Business
Importance of Big data for your Business
 
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesReal-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
 
How to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauHow to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin Leau
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Data streaming-systems
Data streaming-systemsData streaming-systems
Data streaming-systems
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 

Ähnlich wie Hadoop Application Architectures tutorial - Strata London

Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Cloudera, Inc.
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.
 
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...MapR Technologies
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprisesmarkgrover
 
大数据数据治理及数据安全
大数据数据治理及数据安全大数据数据治理及数据安全
大数据数据治理及数据安全Jianwei Li
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduCloudera, Inc.
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoopmarkgrover
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Jonathan Seidman
 
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Cloudera, Inc.
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online TrainingLearntek1
 
Hitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Vantara
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Govern This! Data Discovery and the application of data governance with new s...
Govern This! Data Discovery and the application of data governance with new s...Govern This! Data Discovery and the application of data governance with new s...
Govern This! Data Discovery and the application of data governance with new s...Cloudera, Inc.
 
Bringing Trus and Visibility to Apache Hadoop
Bringing Trus and Visibility to Apache HadoopBringing Trus and Visibility to Apache Hadoop
Bringing Trus and Visibility to Apache HadoopDataWorks Summit
 
Journey to the Cloud: What I Wish I Knew Before I Started
Journey to the Cloud: What I Wish I Knew Before I Started Journey to the Cloud: What I Wish I Knew Before I Started
Journey to the Cloud: What I Wish I Knew Before I Started Datavail
 
What it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready stateWhat it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready stateClouderaUserGroups
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Data Con LA
 
Journey to the Cloud: What I Wish I Knew Before I Started
 Journey to the Cloud: What I Wish I Knew Before I Started Journey to the Cloud: What I Wish I Knew Before I Started
Journey to the Cloud: What I Wish I Knew Before I StartedDatavail
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Hortonworks
 

Ähnlich wie Hadoop Application Architectures tutorial - Strata London (20)

Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 
大数据数据治理及数据安全
大数据数据治理及数据安全大数据数据治理及数据安全
大数据数据治理及数据安全
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
 
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Hitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop Solution
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Govern This! Data Discovery and the application of data governance with new s...
Govern This! Data Discovery and the application of data governance with new s...Govern This! Data Discovery and the application of data governance with new s...
Govern This! Data Discovery and the application of data governance with new s...
 
Bringing Trus and Visibility to Apache Hadoop
Bringing Trus and Visibility to Apache HadoopBringing Trus and Visibility to Apache Hadoop
Bringing Trus and Visibility to Apache Hadoop
 
Journey to the Cloud: What I Wish I Knew Before I Started
Journey to the Cloud: What I Wish I Knew Before I Started Journey to the Cloud: What I Wish I Knew Before I Started
Journey to the Cloud: What I Wish I Knew Before I Started
 
What it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready stateWhat it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready state
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
 
Journey to the Cloud: What I Wish I Knew Before I Started
 Journey to the Cloud: What I Wish I Knew Before I Started Journey to the Cloud: What I Wish I Knew Before I Started
Journey to the Cloud: What I Wish I Knew Before I Started
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks
 

Kürzlich hochgeladen

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Kürzlich hochgeladen (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Hadoop Application Architectures tutorial - Strata London

  • 1. Architectural Considerations for Hadoop Applications Strata+Hadoop World, London– May 5th, 2015 tiny.cloudera.com/app-arch-london Mark Grover | @mark_grover Ted Malaska | @TedMalaska Jonathan Seidman | @jseidman Gwen Shapira | @gwenshap
  • 2. 2 About the book •  @hadooparchbook •  hadooparchitecturebook.com •  github.com/hadooparchitecturebook •  slideshare.com/hadooparchbook ©2014 Cloudera, Inc. All Rights Reserved.
  • 3. 3 About the presenters •  Principal Solutions Architect at Cloudera •  Previously, lead architect at FINRA •  Contributor to Apache Hadoop, HBase, Flume, Avro, Pig and Spark •  Senior Solutions Architect/ Partner Enablement at Cloudera •  Contributor to Apache Sqoop. •  Previously, Technical Lead on the big data team at Orbitz, co-founder of the Chicago Hadoop User Group and Chicago Big Data ©2014 Cloudera, Inc. All Rights Reserved. Ted Malaska Jonathan Seidman
  • 4. 4 About the presenters •  Solutions Architect turned Software Engineer at Cloudera •  Committer on Apache Sqoop •  Contributor to Apache Flume and Apache Kafka •  Software Engineer at Cloudera •  Committer on Apache Bigtop, PMC member on Apache Sentry (incubating) •  Contributor to Apache Hadoop, Spark, Hive, Sqoop, Pig and Flume ©2014 Cloudera, Inc. All Rights Reserved. Gwen Shapira Mark Grover
  • 5. 5 Logistics •  Break at 10:30-11:00 AM •  Questions at the end of each section ©2014 Cloudera, Inc. All Rights Reserved.
  • 7. 7 Analytics ©2014 Cloudera, Inc. All Rights Reserved.
  • 8. 8 Analytics ©2014 Cloudera, Inc. All Rights Reserved.
  • 9. 9 Analytics ©2014 Cloudera, Inc. All Rights Reserved.
  • 10. 10 Analytics ©2014 Cloudera, Inc. All Rights Reserved.
  • 11. 11 Analytics ©2014 Cloudera, Inc. All Rights Reserved.
  • 12. 12 Analytics ©2014 Cloudera, Inc. All Rights Reserved.
  • 13. 13 Analytics ©2014 Cloudera, Inc. All Rights Reserved.
  • 14. 14 Web Logs – Combined Log Format ©2014 Cloudera, Inc. All Rights Reserved. 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
  • 15. 15 Clickstream Analytics ©2014 Cloudera, Inc. All Rights Reserved. 244.157.45.12 - - [17/Oct/ 2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/ top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/ 537.36”
  • 16. 16 Similar use-cases •  Sensors – heart, agriculture, etc. •  Casinos – session of a person at a table ©2014 Cloudera, Inc. All Rights Reserved.
  • 18. 18 Click Stream Analysis (Before Hadoop) ©2014 Cloudera, Inc. All Rights Reserved. Web logs (full fidelity) (2 weeks) Data Warehouse Transform/ Aggregate Business Intelligence Tape Archive
  • 19. 19 Problems with Pre-Hadoop Architecture •  Full fidelity data is stored for small amount of time (~weeks). •  Older data is sent to tape, or even worse, deleted! •  Inflexible workflow - think of all aggregations beforehand ©2014 Cloudera, Inc. All Rights Reserved.
  • 20. 20 Effects of Pre-Hadoop Architecture •  Regenerating aggregates is expensive or worse, impossible •  Can’t correct bugs in the workflow/aggregation logic •  Can’t do experiments on existing data ©2014 Cloudera, Inc. All Rights Reserved.
  • 21. 21 Why is Hadoop A Great Fit? Clickstream Analysis
  • 22. 22 Why is Hadoop a great fit? •  Volume of clickstream data is huge •  Velocity at which it comes in is high •  Variety of data is diverse - semi-structured data •  Hadoop enables –  active archival of data –  Aggregation jobs –  Querying the above aggregates or raw fidelity data ©2014 Cloudera, Inc. All Rights Reserved.
  • 23. 23 Click Stream Analysis (with Hadoop) ©2014 Cloudera, Inc. All Rights Reserved. Web logs Hadoop Business Intelligence Active archive (no tape) Aggregation engine Querying engine
  • 24. 24 Challenges of Hadoop Implementation ©2014 Cloudera, Inc. All Rights Reserved.
  • 25. 25 Challenges of Hadoop Implementation ©2014 Cloudera, Inc. All Rights Reserved.
  • 26. 26 Other challenges - Architectural Considerations •  Storage managers? –  HDFS? HBase? •  Data storage and modeling: –  File formats? Compression? Schema design? •  Data movement –  How do we actually get the data into Hadoop? How do we get it out? •  Metadata –  How do we manage data about the data? •  Data access and processing –  How will the data be accessed once in Hadoop? How can we transform it? How do we query it? •  Orchestration –  How do we manage the workflow for all of this? ©2014 Cloudera, Inc. All Rights Reserved.
  • 28. 28 Overview of Requirements Data Sources Ingestion Raw Data Storage (Formats, Schema) Processed Data Storage (Formats, Schema) ©2014 Cloudera, Inc. All Rights Reserved. Processing Data Consumption Orchestration (Scheduling, Managing, Monitoring)
  • 30. 30 Data Ingestion Requirements ©2014 Cloudera, Inc. All Rights Reserved. Web Servers Web Servers Web Servers Web Servers Web Servers Logs 244.157.45.12 - - [17/ Oct/2014:21:08:30 ] "GET /seatposts HTTP/ 1.0" 200 4463 … CRM Data ODS Web Servers Web Servers Web Servers Web Servers Web Servers Logs Hadoop
  • 31. 31 Data Ingestion Requirements •  So we need to be able to support: –  Reliable ingestion of large volumes of semi-structured event data arriving with high velocity (e.g. logs). –  Timeliness of data availability – data needs to be available for processing to meet business service level agreements. –  Periodic ingestion of data from relational data stores. ©2014 Cloudera, Inc. All Rights Reserved.
  • 33. 33 Data Storage Requirements ©2014 Cloudera, Inc. All Rights Reserved. Store all the data Make the data accessible for processing Compress the data
  • 35. 35 Processing requirements Be able to answer questions like: •  What is my website’s bounce rate? –  i.e. how many % of visitors don’t go past the landing page? •  Which marketing channels are leading to most sessions? •  Do attribution analysis –  Which channels are responsible for most conversions? ©2014 Cloudera, Inc. All Rights Reserved.
  • 36. 36 Sessionization ©2014 Cloudera, Inc. All Rights Reserved. Website visit Visitor 1 Session 1 Visitor 1 Session 2 Visitor 2 Session 1 > 30 minutes
  • 38. 38 Orchestration is simple We just need to execute actions One after another ©2014 Cloudera, Inc. All Rights Reserved.
  • 39. 39©2014 Cloudera, Inc. All Rights Reserved. Actually, we also need to handle errors And user notifications ….
  • 40. 40©2014 Cloudera, Inc. All Rights Reserved. And… •  Re-start workflows after errors •  Reuse of actions in multiple workflows •  Complex workflows with decision points •  Trigger actions based on events •  Tracking metadata •  Integration with enterprise software •  Data lifecycle •  Data quality control •  Reports
  • 41. 41©2014 Cloudera, Inc. All Rights Reserved. OK, maybe we need a product To help us do all that
  • 43. 43 Data Modeling Considerations •  We need to consider the following in our architecture: –  Storage layer – HDFS? HBase? Etc. –  File system schemas – how will we lay out the data? –  File formats – what storage formats to use for our data, both raw and processed data? –  Data compression formats? ©2014 Cloudera, Inc. All Rights Reserved.
  • 45. 45 Data Storage Layer Choices •  Two likely choices for raw data: ©2014 Cloudera, Inc. All Rights Reserved.
  • 46. 46 Data Storage Layer Choices •  Stores data directly as files •  Fast scans •  Poor random reads/writes •  Stores data as Hfiles on HDFS •  Slow scans •  Fast random reads/writes ©2014 Cloudera, Inc. All Rights Reserved.
  • 47. 47 Data Storage – Storage Manager Considerations •  Incoming raw data: –  Processing requirements call for batch transformations across multiple records – for example sessionization. •  Processed data: –  Access to processed data will be via things like analytical queries – again requiring access to multiple records. •  We choose HDFS –  Processing needs in this case served better by fast scans. ©2014 Cloudera, Inc. All Rights Reserved.
  • 49. 49 Storage Formats – Raw Data and Processed Data ©2014 Cloudera, Inc. All Rights Reserved. Processed Data Raw Data
  • 50. 50 Data Storage – Format Considerations Click to enter confidentiality information Logs (plain text)
  • 51. 51 Data Storage – Format Considerations Click to enter confidentiality information Logs (plain text) Logs (plain text) Logs (plain text) Logs (plain text) Logs (plain text) Logs (plain text) Logs Logs Logs Logs Logs
  • 52. 52 Data Storage – Compression Click to enter confidentiality information snappy Well, maybe. But not splittable. X Splittable. Getting better… Hmmm….Splittable, but no...
  • 53. 53 Raw Data Storage – More About Snappy •  Designed at Google to provide high compression speeds with reasonable compression. •  Not the highest compression, but provides very good performance for processing on Hadoop. •  Snappy is not splittable though, which brings us to… ©2014 Cloudera, Inc. All Rights Reserved.
  • 54. 54 Hadoop File Types •  Formats designed specifically to store and process data on Hadoop: –  File based – SequenceFile –  Serialization formats – Thrift, Protocol Buffers, Avro –  Columnar formats – RCFile, ORC, Parquet Click to enter confidentiality information
  • 55. 55 SequenceFile •  Stores records as binary key/value pairs. •  SequenceFile “blocks” can be compressed. •  This enables splittability with non- splittable compression. ©2014 Cloudera, Inc. All Rights Reserved.
  • 56. 56 Avro •  Kinda SequenceFile on Steroids. •  Self-documenting – stores schema in header. •  Provides very efficient storage. •  Supports splittable compression. ©2014 Cloudera, Inc. All Rights Reserved.
  • 57. 57 Our Format Recommendations for Raw Data… •  Avro with Snappy –  Snappy provides optimized compression. –  Avro provides compact storage, self-documenting files, and supports schema evolution. –  Avro also provides better failure handling than other choices. •  SequenceFiles would also be a good choice, and are directly supported by ingestion tools in the ecosystem. –  But only supports Java. ©2014 Cloudera, Inc. All Rights Reserved.
  • 58. 58 But Note… •  For simplicity, we’ll use plain text for raw data in our example. ©2014 Cloudera, Inc. All Rights Reserved.
  • 60. 60 Storage Formats – Raw Data and Processed Data ©2014 Cloudera, Inc. All Rights Reserved. Processed Data Raw Data
  • 61. 61 Access to Processed Data ©2014 Cloudera, Inc. All Rights Reserved. Column A Column B Column C Column D Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value Analytical Queries
  • 62. 62 Columnar Formats •  Eliminates I/O for columns that are not part of a query. •  Works well for queries that access a subset of columns. •  Often provide better compression. •  These add up to dramatically improved performance for many queries. ©2014 Cloudera, Inc. All Rights Reserved. 1 2014-10-1 3 abc 2 2014-10-1 4 def 3 2014-10-1 5 ghi 1 2 3 2014-10-1 3 2014-10-1 4 2014-10-1 5 abc def ghi
  • 63. 63 Columnar Choices – RCFile •  Designed to provide efficient processing for Hive queries. •  Only supports Java. •  No Avro support. •  Limited compression support. •  Sub-optimal performance compared to newer columnar formats. ©2014 Cloudera, Inc. All Rights Reserved.
  • 64. 64 Columnar Choices – ORC •  A better RCFile. •  Also designed to provide efficient processing of Hive queries. •  Only supports Java. ©2014 Cloudera, Inc. All Rights Reserved.
  • 65. 65 Columnar Choices – Parquet •  Designed to provide efficient processing across Hadoop programming interfaces – MapReduce, Hive, Impala, Pig. •  Multiple language support – Java, C++ •  Good object model support, including Avro. •  Broad vendor support. •  These features make Parquet a good choice for our processed data. ©2014 Cloudera, Inc. All Rights Reserved.
  • 67. 67 HDFS Schema Design – One Recommendation /etl – Data in various stages of ETL workflow /data – processed data to be shared data with the entire organization /tmp – temp data from tools or shared between users /user/<username> - User specific data, jars, conf files /app – Everything but data: UDF jars, HQL files, Oozie workflows ©2014 Cloudera, Inc. All Rights Reserved.
  • 68. 68 Partitioning •  Split the dataset into smaller consumable chunks. •  Rudimentary form of “indexing”. Reduces I/O needed to process queries. ©2014 Cloudera, Inc. All Rights Reserved.
  • 69. 69 Partitioning ©2014 Cloudera, Inc. All Rights Reserved. dataset col=val1/file.txt col=val2/file.txt … col=valn/file.txt dataset file1.txt file2.txt … filen.txt Un-partitioned HDFS directory structure Partitioned HDFS directory structure
  • 70. 70 Partitioning considerations •  What column to partition by? –  Don’t have too many partitions (<10,000) –  Don’t have too many small files in the partitions –  Good to have partition sizes at least ~1 GB, generally a multiple of block size. •  We’ll partition by timestamp. This applies to both our raw and processed data. ©2014 Cloudera, Inc. All Rights Reserved.
  • 71. 71 Partitioning For Our Case Study •  Raw dataset: –  /etl/BI/casualcyclist/clicks/rawlogs/year=2014/month=10/day=10! •  Processed dataset: –  /data/bikeshop/clickstream/year=2014/month=10/day=10! ©2014 Cloudera, Inc. All Rights Reserved.
  • 73. 73©2014 Cloudera, Inc. All rights reserved. •  Omniture data on FTP •  Apps •  App Logs •  RDBMS Typical Clickstream data sources
  • 74. 74©2014 Cloudera, Inc. All rights reserved. Getting Files from FTP
  • 75. 75©2014 Cloudera, Inc. All rights reserved. curl ftp://myftpsite.com/sitecatalyst/ myreport_2014-10-05.tar.gz --user name:password | hdfs -put - /etl/clickstream/raw/ sitecatalyst/myreport_2014-10-05.tar.gz Don’t over-complicate things
  • 76. 76©2014 Cloudera, Inc. All rights reserved. Apache NiFi
  • 77. 77©2014 Cloudera, Inc. All rights reserved. Reliable, distributed and highly available systems That allow streaming events to Hadoop Event Streaming – Flume and Kafka
  • 78. 78©2014 Cloudera, Inc. All rights reserved. •  Many available data collection sources •  Well integrated into Hadoop •  Supports file transformations •  Can implement complex topologies •  Very low latency •  No programming required Flume:
  • 79. 79©2014 Cloudera, Inc. All rights reserved. “We just want to grab data from this directory and write it to HDFS” We use Flume when:
  • 80. 80©2014 Cloudera, Inc. All rights reserved. •  Very high-throughput publish-subscribe messaging •  Highly available •  Stores data and can replay •  Can support many consumers with no extra latency Kafka is:
  • 81. 81©2014 Cloudera, Inc. All rights reserved. “Kafka is awesome. We heard it cures cancer” Use Kafka When:
  • 82. 82©2014 Cloudera, Inc. All rights reserved. •  Use Flume with a Kafka Source •  Allows to get data from Kafka, run some transformations write to HDFS, HBase or Solr Actually, why choose?
  • 83. 83©2014 Cloudera, Inc. All rights reserved. •  We want to ingest events from log files •  Flume’s Spooling Directory source fits •  With HDFS Sink •  We would have used Kafka if… –  We wanted the data in non-Hadoop systems too In Our Example…
  • 84. 84 Sources Interceptors Selectors Channels Sinks Flume Agent Short Intro to Flume Twitter, logs, JMS, webserver, Kafka Mask, re-format, validate… DR, critical Memory, file, Kafka HDFS, HBase, Solr
  • 85. 85 Configuration •  Declarative – No coding required. – Configuration specifies how components are wired together. ©2014 Cloudera, Inc. All Rights Reserved.
  • 86. 86 Interceptors •  Mask fields •  Validate information against external source •  Extract fields •  Modify data format •  Filter or split events ©2014 Cloudera, Inc. All rights reserved.
  • 87. 87©2014 Cloudera, Inc. All rights reserved. Any sufficiently complex configuration Is indistinguishable from code
  • 88. 88 A Brief Discussion of Flume Patterns – Fan-in •  Flume agent runs on each of our servers. •  These client agents send data to multiple agents to provide reliability. •  Flume provides support for load balancing. ©2014 Cloudera, Inc. All Rights Reserved.
  • 89. 89 A Brief Discussion of Flume Patterns – Splitting •  Common need is to split data on ingest. •  For example: – Sending data to multiple clusters for DR. – To multiple destinations. •  Flume also supports partitioning, which is key to our implementation. ©2014 Cloudera, Inc. All Rights Reserved.
  • 90. 90 Flume Agent Web Logs Spooling Dir Source Timestamp Interceptor File Channel Avro Sink Avro Sink Avro Sink Flume Architecture – Client Tier Web Server Flume Agent
  • 91. 91 Flume Architecture – Collector Tier Flume Agent Flume Agent HDFS Avro Source File Channel HDFS Sink HDFS Sink
  • 92. 92©2014 Cloudera, Inc. All rights reserved. •  Add Kafka producer to our webapp •  Send clicks and searches as messages •  Flume can ingest events from Kafka •  We can add a second consumer for real-time processing in SparkStreaming •  Another consumer for alerting… •  And maybe a batch consumer too What if…. We were to use Kafka?
  • 93. 93 Channels Sinks Flume Agent The Kafka Channel Kafka HDFS, HBase, Solr Producer A Producer B Producer C Kafka Producers
  • 94. 94 Sources Interceptors Channels Flume Agent The Kafka Channel Twitter, logs, JMS, webserver Mask, re-format, validate… Kafka Consumer A Kafka Consumers Consumer B Consumer C
  • 95. 95 Sources Interceptors Selectors Channels Sinks Flume Agent The Kafka Channel Twitter, logs, JMS, webserver Mask, re-format, validate… DR, critical Kafka HDFS, HBase, Solr
  • 96. 96 Architectural Considerations Data Processing – Engines tiny.cloudera.com/app-arch-slides
  • 97. 97 Processing Engines •  MapReduce •  Abstractions •  Spark •  Spark Streaming •  Impala Confidentiality Information Goes Here
  • 98. 98 MapReduce •  Oldie but goody •  Restrictive Framework / Innovated Work Around •  Extreme Batch Confidentiality Information Goes Here
  • 99. 99 MapReduce Basic High Level Confidentiality Information Goes Here Mapper HDFS (Replicated) Native File System Block of Data Temp Spill Data Partitioned Sorted Data Reducer Reducer Local Copy Output File
  • 100. 100 MapReduce Innovation •  Mapper Memory Joins •  Reducer Memory Joins •  Buckets Sorted Joins •  Cross Task Communication •  Windowing •  And Much More Confidentiality Information Goes Here
  • 101. 101 Abstractions •  SQL –  Hive •  Script/Code –  Pig: Pig Latin –  Crunch: Java/Scala –  Cascading: Java/Scala Confidentiality Information Goes Here
  • 102. 102 Spark •  The New Kid that isn’t that New Anymore •  Easily 10x less code •  Extremely Easy and Powerful API •  Very good for machine learning •  Scala, Java, and Python •  RDDs •  DAG Engine Confidentiality Information Goes Here
  • 103. 103 Spark - DAG Confidentiality Information Goes Here
  • 104. 104 Spark - DAG Confidentiality Information Goes Here Filter KeyBy KeyBy TextFile TextFile Join Filter Take
  • 105. 105 Spark - DAG Confidentiality Information Goes Here Filter KeyBy KeyBy TextFile TextFile Join Filter Take Good Good Good Good Good Good Good-Replay Good-Replay Good-Replay Good Good-Replay Good Good-Replay Lost Block Replay Good-Replay Lost Block Good Future Future Future Future
  • 106. 106 Spark Streaming •  Calling Spark in a Loop •  Extends RDDs with DStream •  Very Little Code Changes from ETL to Streaming Confidentiality Information Goes Here
  • 107. 107 Spark Streaming Confidentiality Information Goes Here Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Print Pre-first Batch First Batch Second Batch
  • 108. 108 Spark Streaming Confidentiality Information Goes Here Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Pre-first Batch First Batch Second Batch Stateful RDD 1 Print Stateful RDD 2 Stateful RDD 1
  • 109. 109 Impala Confidentiality Information Goes Here •  MPP Style SQL Engine on top of Hadoop •  Very Fast •  High Concurrency •  Analytical windowing functions (C5.2).
  • 110. 110 Impala – Broadcast Join Confidentiality Information Goes Here Impala Daemon Smaller Table Data Block 100% Cached Smaller Table Smaller Table Data Block Impala Daemon 100% Cached Smaller Table Impala Daemon 100% Cached Smaller Table Impala Daemon Hash Join Function Bigger Table Data Block 100% Cached Smaller Table Output Impala Daemon Hash Join Function Bigger Table Data Block 100% Cached Smaller Table Output Impala Daemon Hash Join Function Bigger Table Data Block 100% Cached Smaller Table Output
  • 111. 111 Impala – Partitioned Hash Join Confidentiality Information Goes Here Impala Daemon Smaller Table Data Block ~33% Cached Smaller Table Smaller Table Data Block Impala Daemon ~33% Cached Smaller Table Impala Daemon ~33% Cached Smaller Table Hash Partitioner Hash Partitioner Impala Daemon BiggerTable Data Block Impala Daemon Impala Daemon Hash Partitioner Hash Join Function 33% Cached Smaller Table Hash Join Function 33% Cached Smaller Table Hash Join Function 33% Cached Smaller Table Output Output Output BiggerTable Data Block Hash Partitioner BiggerTable Data Block Hash Partitioner
  • 112. 112 Impala vs Hive Confidentiality Information Goes Here •  Very different approaches and •  We may see convergence at some point •  But for now –  Impala for speed –  Hive for batch
  • 114. 114 What processing needs to happen? Confidentiality Information Goes Here •  Sessionization •  Filtering •  Deduplication •  BI / Discovery
  • 115. 115 Sessionization Confidentiality Information Goes Here Website visit Visitor 1 Session 1 Visitor 1 Session 2 Visitor 2 Session 1 > 30 minutes
  • 116. 116 Why sessionize? Confidentiality Information Goes Here Helps answers questions like: •  What is my website’s bounce rate? –  i.e. how many % of visitors don’t go past the landing page? •  Which marketing channels (e.g. organic search, display ad, etc.) are leading to most sessions? –  Which ones of those lead to most conversions (e.g. people buying things, signing up, etc.) •  Do attribution analysis – which channels are responsible for most conversions?
  • 117. 117 Sessionization Confidentiality Information Goes Here 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12+1413580110 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/ 1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” 244.157.45.12+1413583199
  • 118. 118 How to Sessionize? Confidentiality Information Goes Here 1.  Given a list of clicks, determine which clicks came from the same user (Partitioning, ordering) 2.  Given a particular user's clicks, determine if a given click is a part of a new session or a continuation of the previous session (Identifying session boundaries)
  • 119. 119 #1 – Which clicks are from same user? •  We can use: –  IP address (244.157.45.12) –  Cookies (A9A3BECE0563982D) –  IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36") ©2014 Cloudera, Inc. All Rights Reserved.
  • 120. 120 #1 – Which clicks are from same user? •  We can use: –  IP address (244.157.45.12) –  Cookies (A9A3BECE0563982D) –  IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36") ©2014 Cloudera, Inc. All Rights Reserved.
  • 121. 121 #1 – Which clicks are from same user? ©2014 Cloudera, Inc. All Rights Reserved. 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
  • 122. 122 #2 – Which clicks part of the same session? ©2014 Cloudera, Inc. All Rights Reserved. 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” > 30 mins apart = different sessions
  • 123. 123©2014 Cloudera, Inc. All rights reserved. Sessionization engine recommendation •  We have sessionization code in MR and Spark on github. The complexity of the code varies, depends on the expertise in the organization. •  We choose MR –  MR API is stable and widely known –  No Spark + Oozie (orchestration engine) integration currently
  • 124. 124 Filtering – filter out incomplete records ©2014 Cloudera, Inc. All Rights Reserved. 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U…
  • 125. 125 Filtering – filter out records from bots/spiders ©2014 Cloudera, Inc. All Rights Reserved. 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 209.85.238.11 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” Google spider IP address
  • 126. 126©2014 Cloudera, Inc. All rights reserved. Filtering recommendation •  Bot/Spider filtering can be done easily in any of the engines •  Incomplete records are harder to filter in schema systems like Hive, Impala, Pig, etc. •  Flume interceptors can also be used •  Pretty close choice between MR, Hive and Spark •  Can be done in Spark using rdd.filter() •  We can simply embed this in our MR sessionization job
  • 127. 127 Deduplication – remove duplicate records ©2014 Cloudera, Inc. All Rights Reserved. 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
  • 128. 128©2014 Cloudera, Inc. All rights reserved. Deduplication recommendation •  Can be done in all engines. •  We already have a Hive table with all the columns, a simple DISTINCT query will perform deduplication •  reduce() in spark •  We use Pig
  • 129. 129©2014 Cloudera, Inc. All rights reserved. BI/Discovery engine recommendation •  Main requirements for this are: –  Low latency –  SQL interface (e.g. JDBC/ODBC) –  Users don’t know how to code •  We chose Impala –  It’s a SQL engine –  Much faster than other engines –  Provides standard JDBC/ODBC interfaces
  • 130. 130©2014 Cloudera, Inc. All rights reserved. End-to-end processing SessionizationFilteringDeduplication BI tools
  • 132. 132 •  Data arrives through Flume •  Triggers a processing event: –  Sessionize –  Enrich – Location, marketing channel… –  Store as Parquet •  Each day we process events from the previous day Orchestrating Clickstream
  • 133. 133©2014 Cloudera, Inc. All rights reserved. •  Workflow is fairly simple •  Need to trigger workflow based on data •  Be able to recover from errors •  Perhaps notify on the status •  And collect metrics for reporting Choosing Right
  • 134. 134 Oozie or Azkaban? ©2014 Cloudera, Inc. All rights reserved.
  • 135. 135©2014 Cloudera, Inc. All rights reserved. Oozie Architecture
  • 136. 136©2014 Cloudera, Inc. All rights reserved. •  Part of all major Hadoop distributions •  Hue integration •  Built -in actions – Hive, Sqoop, MapReduce, SSH •  Complex workflows with decisions •  Event and time based scheduling •  Notifications •  SLA Monitoring •  REST API Oozie features
  • 137. 137©2014 Cloudera, Inc. All rights reserved. •  Overhead in launching jobs •  Steep learning curve •  XML Workflows Oozie Drawbacks
  • 138. 138©2014 Cloudera, Inc. All rights reserved. Azkaban Architecture Azkaban Executor ServerAzkaban Web Server HDFS viewer plugin Job types plugin MySQL Client Hadoop
  • 139. 139©2014 Cloudera, Inc. All rights reserved. •  Simplicity •  Great UI – including pluggable visualizers •  Lots of plugins – Hive, Pig… •  Reporting plugin Azkaban features
  • 140. 140©2014 Cloudera, Inc. All rights reserved. •  Doesn’t support workflow decisions •  Can’t represent data dependency Azkaban Limitations
  • 141. 141©2014 Cloudera, Inc. All rights reserved. •  Workflow is fairly simple •  Need to trigger workflow based on data •  Be able to recover from errors •  Perhaps notify on the status •  And collect metrics for reporting Choosing… Easier in Oozie
  • 142. 142©2014 Cloudera, Inc. All rights reserved. •  Workflow is fairly simple •  Need to trigger workflow based on data •  Be able to recover from errors •  Perhaps notify on the status •  And collect metrics for reporting Choosing the right Orchestration Tool Better in Azkaban
  • 143. 143©2014 Cloudera, Inc. All rights reserved. The best orchestration tool is the one you are an expert on Important Decision Consideration!
  • 144. 144©2014 Cloudera, Inc. All rights reserved. Orchestration Patterns – Fan Out
  • 145. 145©2014 Cloudera, Inc. All rights reserved. Capture & Decide Pattern
  • 147. 147 Final Architecture – High Level Overview Data Sources Ingestion Raw Data Storage (Formats, Schema) Processed Data Storage (Formats, Schema) ©2014 Cloudera, Inc. All Rights Reserved. Processing Data Consumption Orchestration (Scheduling, Managing, Monitoring)
  • 148. 148 Final Architecture – High Level Overview Data Sources Ingestion Raw Data Storage (Formats, Schema) Processed Data Storage (Formats, Schema) ©2014 Cloudera, Inc. All Rights Reserved. Processing Data Consumption Orchestration (Scheduling, Managing, Monitoring)
  • 149. 149 Final Architecture – Ingestion/Storage Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Fan-in Pattern Multi Agents for Failover and rolling restarts HDFS ©2014 Cloudera, Inc. All Rights Reserved. /etl/BI/casualcyclist/ clicks/rawlogs/ year=2014/month=10/ day=10!
  • 150. 150 Final Architecture – High Level Overview Data Sources Ingestion Raw Data Storage (Formats, Schema) Processed Data Storage (Formats, Schema) ©2014 Cloudera, Inc. All Rights Reserved. Processin g Data Consumption Orchestration (Scheduling, Managing, Monitoring)
  • 151. 151 Final Architecture – Processing and Storage /etl/BI/ casualcyclist/ clicks/rawlogs/ year=2014/ month=10/day=10 … dedup->filtering- >sessionization /data/bikeshop/ clickstream/ year=2014/ month=10/day=10 … ©2014 Cloudera, Inc. All Rights Reserved. parquetize
  • 152. 152 Final Architecture – High Level Overview Data Sources Ingestion Raw Data Storage (Formats, Schema) Processed Data Storage (Formats, Schema) ©2014 Cloudera, Inc. All Rights Reserved. Processing Data Consumption Orchestration (Scheduling, Managing, Monitoring)
  • 153. 153 Final Architecture – Data Access Hive/ Impala BI/ Analytics Tools DWH Sqoop Local Disk R, etc. DB import tool JDBC/ODBC ©2014 Cloudera, Inc. All Rights Reserved.
  • 154. 154 Demo ©2014 Cloudera, Inc. All rights reserved.
  • 155. 155Confidentiality Information Goes Here Join the Discussion Get community help or provide feedback cloudera.com/community
  • 156. 156Built  for  Produc-on  Success.  Powered  by  Apache  Hadoop  
  • 157. 157Try  Cloudera  Live  today—cloudera.com/live  
  • 158. 158 BOOK SIGNINGS THEATER SESSIONS TECHNICAL DEMOS GIVEAWAYS Visit us at Booth #101 HIGHLIGHTS: Apache Kafka is now fully supported with Cloudera Learn why Cloudera is the leader for security & governance in Hadoop
  • 159. 159 Free books and Lots of Q&A! •  Book signings –  Wednesday, May 6th, 3:15 PM at Cloudera booth (#101) –  Wednesday, May 6th, 6PM at O’Reilly booth •  Ask Us Anything! –  Wednesday, May 6th, 14:35-15:15 at Westminster Suite •  Office hours! –  Thursday, May 7th, 11:45-12:25 at Table A •  Stay in touch! –  @hadooparchbook –  hadooparchitecturebook.com –  slideshare.com/hadooparchbook •  Please leave us a review! ©2014 Cloudera, Inc. All Rights Reserved.