Cascading: Enterprise Data Workflows based on Functional Programming

“Cascading:
Enterprise Data Workﬂows
based on Functional Programming”

Paco Nathan
Concurrent, Inc.
San Francisco, CA
@pacoid

Copyright @2013, Concurrent, Inc.

1

Cascading: Workflow Abstraction
Document

1. Machine Data
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

2. Cascading
Count

Word
Count

3. Sample Code
4. A Little Theory…
5. Workflows
6. Lingual
7. Pattern
8. Open Data

2

Q3 1997: inﬂection point

Four independent teams were working toward horizontal
scale-out of workﬂows based on commodity hardware.
This effort prepared the way for huge Internet successes
in the 1997 holiday season… AMZN, EBAY, Inktomi
(YHOO Search), then GOOG

MapReduce and the Apache Hadoop open source stack
emerged from this.

3

Circa 1996: pre- inﬂection point

Stakeholder Customers

Excel pivot tables
PowerPoint slide decks strategy

BI
Product
Analysts

requirements

SQL Query optimized
Engineering code Web App
result sets

transactions

RDBMS

4

Circa 1996: pre- inﬂection point

Stakeholder Customers

Excel pivot tables
PowerPoint slide decks strategy

“Throw it over the wall”
BI
Product
Analysts

requirements

SQL Query optimized
Engineering code Web App
result sets

transactions

RDBMS

5

Circa 2001: post- big ecommerce successes

Stakeholder Product Customers

dashboards UX
Engineering

models servlets

recommenders
Algorithmic + Web Apps
Modeling classiﬁers

Middleware
aggregation
event
SQL Query history
result sets customer
transactions
Logs

DW ETL RDBMS

6

Circa 2001: post- big ecommerce successes

Stakeholder Product Customers

“Data products”
dashboards UX
Engineering

models servlets

recommenders
Algorithmic + Web Apps
Modeling classiﬁers

Middleware
aggregation
event
SQL Query history
result sets customer
transactions
Logs

DW ETL RDBMS

7

Circa 2013: clusters everywhere

Data Products Customers
business
Domain process Prod
Expert Workﬂow
dashboard
metrics
data
Web Apps, s/w
History services
science Mobile, etc. dev
Data
Scientist
Planner social
discovery interactions
+ optimized transactions,
Eng
modeling taps capacity content

App Dev
Use Cases Across Topologies

Hadoop, Log In-Memory
etc. Events Data Grid
Ops DW Ops
batch near time

Cluster Scheduler
introduced existing
capability SDLC

RDBMS
RDBMS

8

Circa 2013: clusters everywhere

Data Products Customers
business
Domain process Prod
Expert Workﬂow
dashboard
metrics
data
Web Apps, s/w
History services
science Mobile, etc. dev
Data
Scientist
Planner social
discovery interactions
+ optimized transactions,
Eng
modeling taps capacity content

App Dev

“Optimizing topologies”
Use Cases Across Topologies

Hadoop, Log In-Memory
etc. Events Data Grid
Ops DW Ops
batch near time

Cluster Scheduler
introduced existing
capability SDLC

RDBMS
RDBMS

9

references…

by Leo Breiman
Statistical Modeling: The Two Cultures
Statistical Science, 2001
bit.ly/eUTh9L

10

references…

Amazon
“Early Amazon: Splitting the website” – Greg Linden
glinden.blogspot.com/2006/02/early-amazon-splitting-website.html

eBay
“The eBay Architecture” – Randy Shoup, Dan Pritchett
addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html
addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf

Inktomi (YHOO Search)
“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
youtube.com/watch?v=E91oEn1bnXM

Google
“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
youtube.com/watch?v=qsan-GQaeyk
perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx

11

Document

1. Machine Data
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

2. Cascading
Count

Word
Count

3. Sample Code
5. Workflows
6. Lingual
7. Pattern
8. Open Data

12

Cascading – origins

API author Chris Wensel worked as a system architect
at an Enterprise firm well-known for many popular
data products.
Wensel was following the Nutch open source project –
where Hadoop started.
Observation: would be difficult to find Java developers
to write complex Enterprise apps in MapReduce –
potential blocker for leveraging new open source
technology.

13

Cascading – functional programming

Key insight: MapReduce is based on functional programming
– back to LISP in 1970s. Apache Hadoop use cases are
mostly about data pipelines, which are functional in nature.
To ease staffing problems as “Main Street” Enterprise firms
began to embrace Hadoop, Cascading was introduced
in late 2007, as a new Java API to implement functional
programming for large-scale data workflows:

• leverages JVM and Java-based tools without any
need to create new languages
• allows programmers who have J2EE expertise
to leverage the economics of Hadoop clusters

14

functional programming… in production

• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc.,
have invested in open source projects atop Cascading
– used for their large-scale production deployments
• new case studies for Cascading apps are mostly
based on domain-speciﬁc languages (DSLs) in JVM
languages which emphasize functional programming:

Cascalog in Clojure (2010)
Scalding in Scala (2012)

github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki

15

Cascading – deﬁnitions

• a pattern language for Enterprise Data Workﬂows
Customers
• simple to build, easy to test, robust in production
• design principles ⟹ ensure best practices at scale Web
App

logs Cache
logs
Logs

Support
source
trap sink
tap
tap tap

Data
Modeling PMML
Workflow

source
sink
tap
tap

Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting

16

Cascading – usage

• Java API, DSLs in Scala, Clojure,
Customers
Jython, JRuby, Groovy, ANSI SQL
• ASL 2 license, GitHub src, Web
App
http://conjars.org
• 5+ yrs production use, logs
logs
Logs
Cache

multiple Enterprise verticals Support
source
trap sink
tap
tap tap

Data
Modeling PMML
Workflow

source
sink
tap
tap

Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting

17

Cascading – integrations

• partners: Microsoft Azure, Hortonworks,
Customers
Amazon AWS, MapR, EMC, SpringSource,
Cloudera Web

• taps: Memcached, Cassandra, MongoDB,
App

HBase, JDBC, Parquet, etc. logs
logs Cache

• serialization: Avro, Thrift, Kryo, Support
Logs

JSON, etc. trap
source
tap sink
tap tap

• topologies: Apache Hadoop, Data
tuple spaces, local mode Modeling PMML
Workflow

source
sink
tap
tap

Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting

18

Cascading – deployments

• case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, Factual, etc.
• use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utility grids, telecom,
genomics, climatology, agronomics, etc.

19

Cascading – deployments

• case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, Factual, etc.
• use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utilityworkﬂow abstraction
grids, telecom, addresses:
genomics, climatology, agronomics, etc.
• stafﬁng bottleneck;
• system integration;
• operational complexity;
• test-driven development

20

Document

1. Machine Data
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

2. Cascading
Count

Word
Count

3. Sample Code
5. Workflows
6. Lingual
7. Pattern
8. Open Data

21

The Ubiquitous Word Count
Document

Definition:
Collection

Tokenize
GroupBy
M token Count

count how often each word appears
count how often each word appears R Word
Count

in aacollection of text documents
in collection of text documents
This simple program provides an excellent test case for
parallel processing, since it illustrates: void map (String doc_id, String text):

• requires a minimal amount of code for each word w in segment(text):
emit(w, "1");

• demonstrates use of both symbolic and numeric values
• shows a dependency graph of tuples as an abstraction void reduce (String word, Iterator group):

• is not many steps away from useful search indexing int count = 0;

• serves as a “Hello World” for Hadoop apps for each pc in group:
count += Int(pc);

Any distributed computing framework which can run Word emit(word, String(count));
Count efficiently in parallel at scale can handle much
larger and more interesting compute problems.

22

word count – conceptual ﬂow diagram

Document
Collection

Tokenize
GroupBy
M token Count

R Word
Count

1 map cascading.org/category/impatient
1 reduce
18 lines code gist.github.com/3900702

23

word count – Cascading app in Java
Document
Collection

String docPath = args[ 0 ]; Tokenize
GroupBy
M token

String wcPath = args[ 1 ]; Count

Properties properties = new Properties(); R Word
Count

AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();

24

word count – generated ﬂow diagram
Document
Collection

Tokenize
[head] M
GroupBy
token Count

R Word
Count

Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

[{2}:'doc_id', 'text']
[{2}:'doc_id', 'text']

map
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]

[{1}:'token']
[{1}:'token']

GroupBy('wc')[by:['token']]

wc[{1}:'token']
[{1}:'token']

reduce
Every('wc')[Count[decl:'count']]

[{2}:'token', 'count']
[{1}:'token']

Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']


[tail]

25

word count – Cascalog / Clojure
Document
Collection

(ns impatient.core M
Tokenize
GroupBy
token Count

  (:use [cascalog.api] R Word
Count

        [cascalog.more-taps :only (hfs-delimited)])
  (:require [clojure.string :as s]
            [cascalog.ops :as c])
  (:gen-class))

(defmapcatop split [line]
  "reads in a line of string and splits it by regex"
  (s/split line #"[[](),.)s]+"))

(defn -main [in out & args]
  (?<- (hfs-delimited out)
       [?word ?count]
       ((hfs-delimited in :skip-header? true) _ ?line)
       (split ?line :> ?word)
       (c/count ?count)))

; Paul Lam
; github.com/Quantisan/Impatient

26

word count – Cascalog / Clojure
Document
Collection

github.com/nathanmarz/cascalog/wiki
Tokenize
GroupBy
M token Count

R Word
Count

• implements Datalog in Clojure, with predicates backed
by Cascading – for a highly declarative language
• run ad-hoc queries from the Clojure REPL –
approx. 10:1 code reduction compared with SQL
• composable subqueries, used for test-driven development
(TDD) practices at scale
• Leiningen build: simple, no surprises, in Clojure itself
• more new deployments than other Cascading DSLs –
Climate Corp is largest use case: 90% Clojure/Cascalog
• has a learning curve, limited number of Clojure developers
• aggregators are the magic, and those take effort to learn

27

word count – Scalding / Scala
Document
Collection

import com.twitter.scalding._ M
Tokenize
GroupBy
token Count

R Word
Count

class WordCount(args : Args) extends Job(args) {
Tsv(args("doc"),
('doc_id, 'text),
skipHeader = true)
.read
.flatMap('text -> 'token) {
text : String => text.split("[ [](),.]")
}
.groupBy('token) { _.size('count) }
.write(Tsv(args("wc"), writeHeader = true))
}

28

Document
Collection

Tokenize
GroupBy
M token Count

R Word
Count

• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual ﬂow diagram
and function calls
• extensive libraries are available for linear algebra, abstract
algebra, machine learning – e.g., Matrix API, Algebird, etc.
• signiﬁcant investments by Twitter, Etsy, eBay, etc.
• great for data services at scale
• less learning curve than Cascalog

29

Document
Collection

Tokenize
GroupBy
M token Count

R Word
Count

• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual ﬂow diagram
and function calls Cascalog and Scalding DSLs
• extensive libraries are available for linear algebra, abstractaspects
leverage the functional
algebra, machine learning – e.g., Matrix API, Algebird, etc.
of MapReduce, helping limit
• signiﬁcant investments by Twitter, Etsy, eBay, etc.
complexity in process
• great for data services at scale
• less learning curve than Cascalog

30

Document

1. Machine Data
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

2. Cascading
Count

Word
Count

3. Sample Code
5. Workflows
6. Lingual
7. Pattern
8. Open Data

31

workflow abstraction – pattern language

Cascading uses a “plumbing” metaphor in the Java API,
to define workflows out of familiar elements: Pipes, Taps,
Tuple Flows, Filters, Joins, Traps, etc.
Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Data is represented as flows of tuples. Operations within Word

the flows bring functional programming aspects into Java Count

In formal terms, this provides a pattern language

32

references…

pattern language: a structured method for solving
large, complex design problems, where the syntax of
the language promotes the use of best practices

amazon.com/dp/0195019199

design patterns: the notion originated in consensus
negotiation for architecture, later applied in OOP
software engineering by “Gang of Four”
amazon.com/dp/0201633612

33

workflow abstraction – pattern language

Cascading uses a “plumbing” metaphor in the Java API,
to define workflows out of familiar elements: Pipes, Taps,
Tuple Flows, Filters, Joins, Traps, etc.
Document
Collection

Scrub
Tokenize

design principles of the pattern
token

M

language ensure best practices
Stop Word
List
HashJoin
Left
Regex
token
GroupBy
token
R

for robust, parallel data workflows
RHS

at scale Count

Data is represented as flows of tuples. Operations within Word

the flows bring functional programming aspects into Java Count

In formal terms, this provides a pattern language

34

workflow abstraction – literate programming

Cascading workflows generate their own visual
documentation: flow diagrams

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

In formal terms, flow diagrams leverage a methodology Word
Count

called literate programming
Provides intuitive, visual representations for apps –
great for cross-team collaboration

35

references…

by Don Knuth
Literate Programming
Univ of Chicago Press, 1992
literateprogramming.com/

“Instead of imagining that our main task is
to instruct a computer what to do, let us
concentrate rather on explaining to human
beings what we want a computer to do.”

36

workflow abstraction – business process

Following the essence of literate programming, Cascading
workflows provide statements of business process
This recalls a sense of business process management
for Enterprise apps (think BPM/BPEL for Big Data)
Cascading creates a separation of concerns between
business process and implementation details (Hadoop, etc.)
This is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
By virtue of the pattern language, the flow planner then
determines how to translate business process into efficient,
parallel jobs at scale

37

references…

by Edgar Codd
“A relational model of data for large shared data banks”
Communications of the ACM, 1970
dl.acm.org/citation.cfm?id=362685
Rather than arguing between SQL vs. NoSQL…
structured vs. unstructured data frameworks…
this approach focuses on what apps do:
the process of structuring data

38

workflow abstraction – functional relational programming

The combination of functional programming, pattern language,
DSLs, literate programming, business process, etc., traces back
to the original definition of the relational model (Codd, 1970)
prior to SQL.
Cascalog, in particular, implements more of what Codd intended
for a “data sublanguage” and is considered to be close to a full
implementation of the functional relational programming
paradigm defined in:
Moseley & Marks, 2006
“Out of the Tar Pit”
goo.gl/SKspn

39

workflow abstraction – functional relational programming

The combination of functional programming, pattern language,
DSLs, literate programming, business process, etc., traces back
to the original definition of the relational model (Codd, 1970)
prior to SQL.
Cascalog, in particular, implements more of what Codd intended
for a “data sublanguage” and is considered to be close to a full
implementation of the functional relational programming
paradigm defined in: several theoretical aspects converge
Moseley & Marks, 2006 into software engineering practices
“Out of the Tar Pit” which minimize the complexity of
goo.gl/SKspn
building and maintaining Enterprise
data workflows

40

Document

1. Machine Data
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

2. Cascading
Count

Word
Count

3. Sample Code
5. Workflows
6. Lingual
7. Pattern
8. Open Data

41

Customers
Let’s consider a “strawman” architecture
for an example app… at the front end
Web
App
LOB use cases drive demand for apps

logs Cache
logs
Logs

Support
source
trap sink
tap
tap tap

Data
Modeling PMML
Workflow

source
sink
tap
tap

Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting

42

Customers
Same example… in the back ofﬁce
Organizations have substantial investments Web
App
in people, infrastructure, process

logs Cache
logs
Logs

Support
source
trap sink
tap
tap tap

Data
Modeling PMML
Workflow

source
sink
tap
tap

Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting

43

Customers
Same example… the heavy lifting!
“Main Street” ﬁrms are migrating Web
App
workﬂows to Hadoop, for cost
savings and scale-out
logs Cache
logs
Logs

Support
source
trap sink
tap
tap tap

Data
Modeling PMML
Workflow

source
sink
tap
tap

Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting

44

Cascading workﬂows – taps

• taps integrate other data frameworks, as tuple streams
Customers

• these are “plumbing” endpoints in the pattern language
• sources (inputs), sinks (outputs), traps (exceptions) Web
App

• text delimited, JDBC, Memcached,
logs
HBase, Cassandra, MongoDB, etc. logs
Logs
Cache

• data serialization: Avro, Thrift,
Support
source
trap sink
tap
Kryo, JSON, etc. tap tap

• extend a new kind of tap in just
Data
Modeling PMML
Workflow

a few lines of Java sink
source
tap
tap

Analytics
Cubes customer
Customer
profile DBs
schema and provenance get Hadoop
Prefs

Cluster
derived from analysis of the taps Reporting

45

Cascading workﬂows – taps

String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();


RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
Pipe wcPipe = new Pipe( "wc", docPipe ); source and sink taps
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); for TSV data in HDFS
wcFlow.complete();

46

Cascading workflows – topologies

• topologies execute workflows on clusters
Customers

• flow planner is like a compiler for queries
- Hadoop (MapReduce jobs) Web
App

- local mode (dev/test or special config)
logs Cache
- in-memory data grids (real-time)
logs
Logs

Support

• flow planner can be extended trap
tap
source
tap sink
tap
to support other topologies
Data
Modeling PMML
Workflow

source
sink
tap
blend flows in different topologies tap

Analytics
into the same app – for example, Cubes customer
Customer
profile DBs
batch (Hadoop) + transactions (IMDG) Hadoop
Prefs

Cluster
Reporting

47

Cascading workﬂows – topologies

String docPath = args[ 0 ];
String wcPath = args[ 1 ];


RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); ﬂow planner for
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); Apache Hadoop
Pipe wcPipe = new Pipe( "wc", docPipe ); topology
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

wcFlow.complete();

48

example topologies…

49

Cascading workflows – test-driven development

• assert patterns (regex) on the tuple streams
Customers
• adjust assert levels, like log4j levels
• trap edge cases as “data exceptions” Web
App

• TDD at scale:
1. start from raw inputs in the flow graph logs
logs
Logs
Cache

2. define stream assertions for each stage Support
source
trap sink
of transforms tap
tap
tap

3. verify exceptions, code to remove them Modeling PMML
Data
Workflow

4. when impl is complete, app has full sink
source
tap
tap
test coverage Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
redirect traps in production Reporting
Cluster

to Ops, QA, Support, Audit, etc.

50

Two Avenues to the App Layer…

Enterprise: must contend with
complexity at scale everyday…
incumbents extend current practices and
infrastructure investments – using J2EE,

complexity ➞
ANSI SQL, SAS, etc. – to migrate
workﬂows onto Apache Hadoop while
leveraging existing staff

Start-ups: crave complexity and
scale to become viable…
new ventures move into Enterprise space
to compete using relatively lean staff,
while leveraging sophisticated engineering
practices, e.g., Cascalog and Scalding
scale ➞

51

Document

1. Machine Data
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

2. Cascading
Count

Word
Count

3. Sample Code
5. Workflows
6. Lingual
7. Pattern
8. Open Data

52

Cascading workflows – ANSI SQL

• collab with Optiq – industry-proven code base
Customers
• ANSI SQL parser/optimizer atop Cascading
flow planner Web
App
• JDBC driver to integrate into existing
tools and app servers logs
logs Cache

• relational catalog over a collection Support
Logs

of unstructured data trap
source
tap sink
tap tap

• SQL shell prompt to run queries Data
Modeling
• enable analysts without retraining
PMML
Workflow

on Hadoop, etc. sink
tap
source
tap

• transparency for Support, Ops, Analytics
Cubes customer
Customer
Finance, et al. profile DBs
Prefs
Hadoop
Cluster
Reporting

a language for queries – not a database,
but ANSI SQL as a DSL for workflows

53

Lingual – CSV data in local ﬁle system

cascading.org/lingual

54

Lingual – shell prompt, catalog


55

Lingual – queries


56

abstraction layers in queries…
abstraction RDBMS JVM Cluster
parser ANSI SQL ANSI SQL
compliant parser compliant parser
optimizer logical plan, logical plan,
optimized based on stats optimized based on stats
planner physical plan API “plumbing”

machine query history, app history,
data table stats tuple stats
topology b-trees, etc. heterogenous, distributed:
Hadoop, in-memory, etc.
visualization ERD ﬂow diagram

schema table schema tuple schema

catalog relational catalog tap usage DB

provenance (manual audit) data set
producers/consumers

57

Lingual – JDBC driver

public void run() throws ClassNotFoundException, SQLException {
Class.forName( "cascading.lingual.jdbc.Driver" );
Connection connection =
DriverManager.getConnection(
"jdbc:lingual:local;schemas=src/main/resources/data/example" );
Statement statement = connection.createStatement();

ResultSet resultSet = statement.executeQuery(
"select *n"
+ "from "EXAMPLE"."SALES_FACT_1997" as sn"
+ "join "EXAMPLE"."EMPLOYEE" as en"
+ "on e."EMPID" = s."CUST_ID"" );

while( resultSet.next() ) {
int n = resultSet.getMetaData().getColumnCount();
StringBuilder builder = new StringBuilder();

for( int i = 1; i <= n; i++ ) {
builder.append( ( i > 1 ? "; " : "" )
+ resultSet.getMetaData().getColumnLabel( i )
+ "="
+ resultSet.getObject( i ) );
}

System.out.println( builder );
}

resultSet.close();
statement.close();
connection.close();
}

58

Lingual – JDBC result set

$ gradle clean jar
$ hadoop jar build/libs/lingual-examples–1.0.0-wip-dev.jar

CUST_ID=100; PROD_ID=10; EMPID=100; NAME=Bill
CUST_ID=150; PROD_ID=20; EMPID=150; NAME=Sebastian

Caveat: if you absolutely positively must have sub-second
SQL query response for Pb-scale data on a 1000+ node
cluster… Good luck with that! (call the MPP vendors)
This ANSI SQL library is primarily intended for batch
workﬂows – high throughput, not low-latency –
for many under-represented use cases in Enterprise IT.
In other words, SQL as a DSL.

59

Lingual – connecting Hadoop and R

# load the JDBC package
library(RJDBC)

# set up the driver
drv <- JDBC("cascading.lingual.jdbc.Driver",
"~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")

# set up a database connection to a local repository
connection <- dbConnect(drv,
"jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/
tables;schema=EMPLOYEES")

# query the repository: in this case the MySQL sample database (CSV files)
df <- dbGetQuery(connection,
"SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")
head(df)

# use R functions to summarize and visualize part of the data
df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25
summary(df$hire_age)

library(ggplot2)
m <- ggplot(df, aes(x=hire_age))
m <- m + ggtitle("Age at hire, people named Gina")
m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()

60

Lingual – connecting Hadoop and R

> summary(df$hire_age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
20.86 27.89 31.70 31.61 35.01 43.92

61

Document

1. Machine Data
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

2. Cascading
Count

Word
Count

3. Sample Code
5. Workflows
6. Lingual
7. Pattern
8. Open Data

62

Pattern – model scoring

• migrate workloads: SAS,Teradata, etc.,
exporting predictive models as PMML Customers

• great open source tools – R, Weka, Web
App
KNIME, Matlab, RapidMiner, etc.
• integrate with other libraries – logs
logs Cache
Logs
Matrix API, etc. Support

• leverage PMML as another kind trap
tap
source
tap sink
tap

of DSL
Data
Modeling PMML
Workflow

source
sink
tap
tap

Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting

cascading.org/pattern

63

Pattern – create a model in R

## train a RandomForest model

f <- as.formula("as.factor(label) ~ .")
fit <- randomForest(f, data_train, ntree=50)

## test the model on the holdout test set

print(fit$importance)
print(fit)

predicted <- predict(fit, data)
data$predicted <- predicted
confuse <- table(pred = predicted, true = data[,1])
print(confuse)

## export predicted labels to TSV

write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"),
quote=FALSE, sep="t", row.names=FALSE)

## export RF model to PMML

saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))

64

Pattern – capture model parameters as PMML
<?xml version="1.0"?>
<PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.dmg.org/PMML-4_0
http://www.dmg.org/v4-0/pmml-4-0.xsd">
<Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">
  <Extension name="user" value="ceteri" extender="Rattle/PMML"/>
  <Application name="Rattle/PMML" version="1.2.30"/>
  <Timestamp>2012-10-22 19:39:28</Timestamp>
</Header>
<DataDictionary numberOfFields="4">
  <DataField name="label" optype="categorical" dataType="string">
   <Value value="0"/>
   <Value value="1"/>
  </DataField>
  <DataField name="var0" optype="continuous" dataType="double"/>
</DataDictionary>
<MiningModel modelName="randomForest_Model" functionName="classification">
  <MiningSchema>
   <MiningField name="label" usageType="predicted"/>
   <MiningField name="var0" usageType="active"/>
  </MiningSchema>
  <Segmentation multipleModelMethod="majorityVote">
   <Segment id="1">
    <True/>
    <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">
     <MiningSchema>
      <MiningField name="label" usageType="predicted"/>
     </MiningSchema>
...

65

Pattern – score a model, within an app
public class Main {
public static void main( String[] args ) {
  String pmmlPath = args[ 0 ];
  String ordersPath = args[ 1 ];
  String classifyPath = args[ 2 ];
  String trapPath = args[ 3 ];


  Tap ordersTap = new Hfs( new TextDelimited( true, "t" ), ordersPath );
  Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );
  Tap trapTap = new Hfs( new TextDelimited( true, "t" ), trapPath );

  // define a "Classifier" model from PMML to evaluate the orders
  ClassifierFunction classFunc = new ClassifierFunction( new Fields( "score" ), pmmlPath );
  Pipe classifyPipe = new Each( new Pipe( "classify" ), classFunc.getInputFields(), classFunc, Fields.ALL );

  FlowDef flowDef = FlowDef.flowDef().setName( "classify" )
   .addSource( classifyPipe, ordersTap )
   .addTrap( classifyPipe, trapTap )
   .addSink( classifyPipe, classifyTap );

  Flow classifyFlow = flowConnector.connect( flowDef );
  classifyFlow.writeDOT( "dot/classify.dot" );
  classifyFlow.complete();
}
}

66

Pattern – score a model, using pre-deﬁned Cascading app

Customer
Orders

Scored GroupBy
Classify Assert
Orders token

M R

PMML
Model
Count

Failure Confusion
Traps Matrix

cascading.org/pattern

67

Pattern – score a model, using pre-deﬁned Cascading app

## run an RF classifier at scale

hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap
--pmml data/sample.rf.xml

## run an RF classifier at scale, assert regression test, measure confusion matrix

hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap
--pmml data/sample.rf.xml --assert --measure out/measure

## run a predictive model at scale, measure RMSE

hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap
--pmml data/iris.lm_p.xml --rmse out/measure

68

PMML – model coverage

• Association Rules: AssociationModel element
• Cluster Models: ClusteringModel element
• Decision Trees: TreeModel element
• Naïve Bayes Classiﬁers: NaiveBayesModel element
• Neural Networks: NeuralNetwork element
• Regression: RegressionModel and GeneralRegressionModel elements
• Rulesets: RuleSetModel element
• Sequences: SequenceModel element
• Support Vector Machines: SupportVectorMachineModel element
• Text Models: TextModel element
• Time Series: TimeSeriesModel element

ibm.com/developerworks/industry/library/ind-PMML2/

69

PMML – vendor coverage

70

experiments – Random Forest model

## train a Random Forest model
## example: http://mkseo.pe.kr/stats/?p=220

f <- as.formula("as.factor(label) ~ var0 + var1 + var2")
fit <- randomForest(f, data=data, proximity=TRUE, ntree=25)
print(fit)
saveXML(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/"))

OOB estimate of error rate: 14%
Confusion matrix:
0 1 class.error
0 69 16 0.1882353
1 12 103 0.1043478

71

experiments – Logistic Regression model

## train a Logistic Regression model (special case of GLM)
## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r

f <- as.formula("as.factor(label) ~ var0 + var2")
fit <- glm(f, family=binomial, data=data)
print(summary(fit))
saveXML(pmml(fit), file=paste(out_folder, "sample.lr.xml", sep="/"))

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.8524 0.3803 4.871 1.11e-06 ***
var0 -1.3755 0.4355 -3.159 0.00159 **
var2 -3.7742 0.5794 -6.514 7.30e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

NB: this model has “var1” intentionally omitted

72

experiments – evaluating results

•

use a confusion matrix to compare results for the classiﬁers
• Logistic Regression has a lower “false negative” rate (5% vs. 11%)
however it has a much higher “false positive” rate (52% vs. 14%)
• assign a cost model to select a winner –
for example, in an ecommerce anti-fraud classiﬁer:
FN ∼ chargeback risk
FP ∼ customer support costs
• can extend this to evaluate
N models, M labels in an
N × M × M matrix

73

Document

1. Machine Data
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

2. Cascading
Count

Word
Count

3. Sample Code
5. Workflows
6. Lingual
7. Pattern
8. Open Data

74

Palo Alto is quite a pleasant place

• temperate weather
• lots of parks, enormous trees
• great coffeehouses
• walkable downtown
• not particularly crowded

On a nice summer day, who wants to be stuck
indoors on a phone call?
Instead, take it outside – go for a walk

And example open source project:
github.com/Cascading/CoPA/wiki

75

1. Open Data about municipal infrastructure
(GIS data: trees, roads, parks)
✚
2. Big Data about where people like to walk
(smartphone GPS logs)
✚
Document
Collection

3. some curated metadata M
Tokenize
Scrub
token

HashJoin Regex

(which surfaces the value)
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

4. personalized recommendations:
“Find a shady spot on a summer day in which to walk
near downtown Palo Alto.While on a long conference call.
Sipping a latte or enjoying some fro-yo.”

76

discovery
The City of Palo Alto recently began to support Open Data
to give the local community greater visibility into how
their city government operates
This effort is intended to encourage students, entrepreneurs,
local organizations, etc., to build new apps which contribute
to the public good

paloalto.opendata.junar.com/dashboards/7576/geographic-information/

77

discovery
GIS about trees in Palo Alto:

78

discovery
Geographic_Information,,,

"Tree: 29 site 2 at 203 ADDISON AV, on ADDISON AV 44 from pl"," Private: -1 Tree ID: 29
Street_Name: ADDISON AV Situs Number: 203 Tree Site: 2 Species: Celtis australis
Source: davey tree Protected: Designated: Heritage: Appraised Value:
Hardscape: None Identifier: 40 Active Numeric: 1 Location Feature ID: 13872
Provisional: Install Date: ","37.4409634615283,-122.15648458861,0.0 ","Point"
"Wilkie Way from West Meadow Drive to Victoria Place"," Sequence: 20 Street_Name: Wilkie
Way From Street PMMS: West Meadow Drive To Street PMMS: Victoria Place Street ID:
598 (Wilkie Wy, Palo Alto) From Street ID PMMS: 689 To Street ID PMMS: 567 Year
Constructed: 1950 Traffic Count: 596 Traffic Index: residential local Traffic
Class: local residential Traffic Date: 08/24/90 Paving Length: 208 Paving Width:
40 Paving Area: 8320 Surface Type: asphalt concrete Surface Thickness: 2.0 Base
Type Pvmt: crusher run base Base Thickness: 6.0 Soil Class: 2 Soil Value: 15
Curb Type: Curb Thickness: Gutter Width: 36.0 Book: 22 Page: 1 District
Number: 18 Land Use PMMS: 1 Overlay Year: 1990 Overlay Thickness: 1.5 Base
Failure Year: 1990 Base Failure Thickness: 6 Surface Treatment Year: Surface
Treatment Type: Alligator Severity: none Alligator Extent: 0 Block Severity:
none Block Extent: 0 Longitude and Transverse Severity: none Longitude and Transverse
Extent: 0
Trench Severity:
Ravelling Severity:
none
none
Trench Extent: 0
(unstructured data…)
Ravelling Extent:
Rutting Severity:
0 Ridability Severity:
none Rutting Extent:
none
0
Road Performance: UL (Urban Local) Bike Lane: 0 Bus Route: 0 Truck Route: 0
Remediation: Deduct Value: 100 Priority: Pavement Condition: excellent
Street Cut Fee per SqFt: 10.00 Source Date: 6/10/2009 User Modified By: mnicols
Identifier System: 21410 ","-122.1249640794,37.4155803115645,0.0
-122.124661859039,37.4154224594993,0.0 -122.124587720719,37.4153758330704,0.0
-122.12451895942,37.4153242300888,0.0 -122.124456098457,37.4152680432944,0.0
-122.124399616238,37.4152077003122,0.0 -122.124374937753,37.4151774433318,0.0 ","Line"
79

discovery
(defn parse-gis [line]
"leverages parse-csv for complex CSV format in GIS export"
(first (csv/parse-csv line))
)

(defn etl-gis [gis trap]
"subquery to parse data sets from the GIS source tap"
(<- [?blurb ?misc ?geo ?kind]
(gis ?line)
(parse-gis ?line :> ?blurb ?misc ?geo ?kind)
(:trap (hfs-textline trap))
))

(specify what you require,
not how to achieve it…
data prep costs are 80/20)

80

discovery

(ad-hoc queries get refined
into composable predicates)

Identifier: 474
Tree ID: 412
Tree: 412 site 1 at 115 HAWTHORNE AV
Tree Site: 1
Street_Name: HAWTHORNE AV
Situs Number: 115
Private: -1
Species: Liquidambar styraciflua
Source: davey tree
Hardscape: None
37.446001565119,-122.167713417554,0.0
Point

81

discovery

(curate valuable metadata)
82

Cascading: Enterprise Data Workflows based on Functional Programming

Cascading: Enterprise Data Workflows based on Functional Programming

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Cascading: Enterprise Data Workflows based on Functional Programming

Ähnlich wie Cascading: Enterprise Data Workflows based on Functional Programming (20)

Mehr von Paco Nathan

Mehr von Paco Nathan (20)

Cascading: Enterprise Data Workflows based on Functional Programming