2. Introduction
Chris K Wensel
chris@wensel.net
• Cascading, Lead Developer
• http://cascading.org/
• Concurrent, Inc., Founder
• Hadoop/Cascading support and tools
• http://concurrentinc.com/
3. Computing Systems
data info
value
• Exist to create value out of data
• Everything else is an implementation
detail
4. In Todays Computing
Environment
• Lots of relevant medium-large data sets
– that individually could fit in a RDBMS
• Lots of applications touching that data
– where do you think PERL came from?
• Underutilized hardware owning
(intermediate) data
– xen/vmware add complexity (sprawl)
5. continued...
• Raw data continuously arriving (and in
bursts)
– we mostly care about the new stuff
• Raw data is dirty
– bots and bugs
• Demands on timely/predictable result
availability
– downstream systems must be fed
• The ‘Cloud’ is enabling an on-demand
model
6. Data Warehousing != Data
ETL
Processing
process streams
hub and spoke [distributed]
[monolithic]
• Data Warehousing
– monolithic systems and data schema
– distribution through manual federation/
sharding
• Data Processing
– cluster of peer systems
– dynamic even distribution of data and
processing
7. Data Warehousing
data
raw data ETL warehouse ETL reporting
loggers [BI, KPI, etc]
loggers [cache]
loggers
ETL
ETL
data
mining
product Consumer
R, SAS, some data
Excel, etc
Analyst
• Agility, no “one size fits all” schema,
resistant to change
• Complex Analytics, cannot be represented
by SQL
• Massive Data Sets, won’t fit or too
8. Production Data Processing
raw data data processing valuable
loggers data
loggers
loggers
Consumer
• Online / Real-Time process
– low latency (milliseconds to seconds for
results)
– smaller datasets - streams
• Offline / Batch
– high latency (minutes to days for results)
– larger datasets - files
9. Hadoop Adoption
Cluster
Rack Rack Rack
Node Node Node Node ...
Global Compute-space
Global Namespace
• Distributed replicated storage for large
files
• Distributed fault tolerant exec of batch
processes
• Scale out vs (legacy) scale up
• Java API allows complex analysis
10. But Stuffed into Legacy Roles
data
mining
data warehouse
raw data ETL
loggers Hadoop + pig / hive
loggers
loggers
ETL
Analyst
• Hadoop deployments mirror legacy
architectures
– ETL into cached “structured storage”
• Pig/Hive are syntaxes for Data Mining
“Big” data
– SQL like, but hard to customize and not
“advanced”
11. Hadoop for Data Processing
Value Creation
Scalability
Simplicity
• More Value through Innovation
• Scalability, Not Performance
• Simplifies Infrastructure
12. Simplicity
Cluster
Rack Rack Rack
Node Node Node Node ...
cpus Global Compute-space
disks Global Namespace
• Virtualization across resources, not
within (PaaS)
– A single FileSystem across disks - no DBA
– A single Execution System across CPUs -
less IT
13. Scalability
Users Cluster
Client
Rack Rack Rack
Node Node Node Node ...
Client
job
job
job
Client
• Scalability - continued reliability and met
expectations as demand changes
• Application Scalability - data grows, app/
infra expand
• Organizational Scalability - simpler infra
14. Creating Value
events
reporting
raw data
loggers
loggers data processing
loggers Hadoop
+ Hadoop
etlCascading
analytics
Cascading
Producer Consumer
product
operational
Value
• Unconstrained processing model
• Data processing requires integration
• Processing must not fail or fall behind
15. Consequences
• Improved reliability of production
processes
– “we had a failed disk yet jobs never
failed”
• Greater utilization of hardware
resources
– dynamically moves code to available
cores
• Increased rate of innovation
– diverse analytics over larger sets, less
bureaucracy
• Fewer staff
16. Hadoop MapReduce
Count Job Sort Job
[ k, [v] ] [ k, [v] ]
Map Reduce Map Reduce
[ k, v ] [ k, v ] [ k, v ] [ k, v ]
File File File
[ k, v ] = key and value pair
[ k, [v] ] = key and associated values collection
• Nearly impossible to “think in”
• Apps are many dependent MR jobs
17. Cascading
Word Count/Sort Flow
Map Reduce Map Reduce
[ f1,f2,.. ] [ f1,f2,.. ] [ f1,f2,.. ]
Parse Group Count Sort
[ f1,f2,.. ]
[ f1,f2,.. ]
Data [ f1, f2,... ] = tuples with field names Data
• Alternative model & API to MapReduce
– pipe/filters of re-usable operations
• For rapidly implementing Data Processing
Systems
• Open-Source
18. Emerging Tool Support
• Karmasphere IDE (soon)
– Developing and Debugging
• Bixo (Bixo Labs) Data Mining Toolkit
– Apache Nutch replacement
– Easier to customize to meet new business
models
• Clojure & JRuby Domain Specific
Languages (DSL)
– Machine Learning
– Simple/Complex Ad-Hoc queries
19. Practical Applications
• Log/event analysis, device and system
monitoring
• Web crawling and content mining
• Behavior ad-targeting segmentation
• Ad campaign ROI
• Demand and event prediction
• POS analytics for product demand pricing
20. Successes
• Publicis/RazorFish - Behavioral Ad-
Targeting
– Cascading + AWS (Elastic MapReduce)
– Daily automated User Behavior
Segmentation
– 6wks dev, 3T/day, $13k/mo
– 500% increase in return on ad spend
from a similar campaign a year before
21. continued...
• FlightCaster - Predicting flight delays
– Clojure + Cascading + AWS
– Machine learning and production
processing
– 3mos dev, 10G day, <1T total currently,
<$2k/mos
• Etsy - Online Marketplace
– JRuby + Cascading
– Data mining (Hadoop as a DW!)
– 750M page-views/mo, 60G/day of logs
22. Resources
• Chris K Wensel
– chris@wensel.net
– @cwensel
• Cascading
– an API for optimizing production data
processing
– http://cascading.org
• Concurrent, Inc.
– Support and Mentoring
– http://concurrentinc.com