Developer Data Modeling Mistakes: From Postgres to NoSQL
Functional programming for optimization problems in Big Data
1. “Functional programming
for optimization problems
in Big Data”
Paco Nathan
Concurrent, Inc.
San Francisco, CA
@pacoid
Copyright @2013, Concurrent, Inc.
Wednesday, 06 March 13 1
2. The Workflow Abstraction
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
1. Data Science
2. Functional Programming
3. Workflow Abstraction
4. Typical Use Cases
5. Open Data Example
Wednesday, 06 March 13 2
Let’s consider a trendline subsequent to the 1997 Q3 inflection point which enabled huge ecommerce successes and commercialized Big Data.
Where did Big Data come from, and where is this kind of work headed?
3. Q3 1997: inflection point
Four independent teams were working toward horizontal
scale-out of workflows based on commodity hardware.
This effort prepared the way for huge Internet successes
in the 1997 holiday season… AMZN, EBAY, Inktomi
(YHOO Search), then GOOG
MapReduce and the Apache Hadoop open source stack
emerged from this.
Wednesday, 06 March 13 3
Q3 1997: Greg Linden, et al., @ Amazon, Randy Shoup, et al., @ eBay -- independent teams arrived at the same conclusion:
parallelize workloads onto clusters of commodity servers to scale-out horizontally.
Google and Inktomi (YHOO Search) were working along the same lines.
4. Circa 1996: pre- inflection point
Stakeholder Customers
Excel pivot tables
PowerPoint slide decks strategy
BI
Product
Analysts
requirements
SQL Query optimized
Engineering code Web App
result sets
transactions
RDBMS
Wednesday, 06 March 13 4
Perl and C++ for CGI :)
Feedback loops shown in red represent data innovations at the time… these are rather static.
Characterized by slow, manual processes:
data modeling / business intelligence; “throw it over the wall”…
this thinking led to impossible silos
5. Circa 2001: post- big ecommerce successes
Stakeholder Product Customers
dashboards UX
Engineering
models servlets
recommenders
Algorithmic + Web Apps
Modeling classifiers
Middleware
aggregation
event
SQL Query history
result sets customer
transactions
Logs
DW ETL RDBMS
Wednesday, 06 March 13 5
Machine data (unstructured logs) captured social interactions. Data from aggregated logs fed into algorithmic modeling to produce recommenders, classifiers, and other predictive models -- e.g., ad networks automating parts of the
marketing funnel, as in our case study.
LinkedIn, Facebook, Twitter, Apple, etc., followed early successes. Algorithmic modeling, leveraging machine data, allowed for Big Data to become monetized.
6. Circa 2013: clusters everywhere
Data Products Customers
business
Domain process Prod
Expert Workflow
dashboard
metrics
data
Web Apps, s/w
History services
science Mobile, etc. dev
Data
Scientist
Planner social
discovery interactions
+ optimized transactions,
Eng
modeling taps capacity content
App Dev
Use Cases Across Topologies
Hadoop, Log In-Memory
etc. Events Data Grid
Ops DW Ops
batch near time
Cluster Scheduler
introduced existing
capability SDLC
RDBMS
RDBMS
Wednesday, 06 March 13 6
Here’s what our more savvy customers are using for architecture and process today: traditional SDLC, but also Data Science inter-disciplinary teams.
Also, machine data (app history) driving planners and schedulers for advanced multi-tenant cluster computing fabric.
Not unlike a practice at LLL, where much more data gets collected about the machine than about the experiment.
We see this feeding into cluster optimization in YARN, Apache Mesos, etc.
7. references…
by Leo Breiman
Statistical Modeling: The Two Cultures
Statistical Science, 2001
bit.ly/eUTh9L
Wednesday, 06 March 13 7
Leo Breiman wrote an excellent paper in 2001, “Two Cultures”, chronicling this evolution and the sea change from data modeling (silos, manual process) to algorithmic modeling (machine data for automation/optimization)
8. references…
Amazon
“Early Amazon: Splitting the website” – Greg Linden
glinden.blogspot.com/2006/02/early-amazon-splitting-website.html
eBay
“The eBay Architecture” – Randy Shoup, Dan Pritchett
addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html
addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf
Inktomi (YHOO Search)
“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
youtube.com/watch?v=E91oEn1bnXM
Google
“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
youtube.com/watch?v=qsan-GQaeyk
perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
Wednesday, 06 March 13 8
In their own words…
9. core values
Data Science teams develop actionable insights, building
confidence for decisions
that work may influence a few decisions worth billions
(e.g., M&A) or billions of small decisions (e.g., AdWords)
probably somewhere in-between…
solving for pattern, at scale.
by definition, this is a multi-disciplinary
pursuit which requires teams, not sole
players
Wednesday, 06 March 13 9
10. team process = needs
help people ask the
discovery right questions
allow automation to place
modeling informed bets
deliver products at Gephi
integration scale to customers
build smarts into
apps product features
keep infrastructure
systems running, cost-effective
Wednesday, 06 March 13 10
11. team composition = roles
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Domain
Expert business process, stakeholder Count
Word
Count
data
science
Data
Scientist data prep, discovery, modeling, etc.
App Dev software engineering, automation
Ops systems engineering, access
introduced
capability
Wednesday, 06 March 13 11
This is an example of multi-disciplinary team composition for data science
While other emerging problems spaces will require other more specific kinds of team roles
12. matrix: evaluate needs × roles
nn
o
overy
very elliing
e ng ratiio
rat o apps
apps stem
stem
ss
diisc
d sc mod
mod nteg
iinte
g sy
sy
stakeholder
scientist
developer
ops
Wednesday, 06 March 13 12
13. most valuable skills
approximately 80% of the costs for data-related projects
get spent on data preparation – mostly on cleaning up
data quality issues: ETL, log file analysis, etc.
unfortunately, data-related budgets for many companies tend
to go into frameworks which can only be used after clean up
most valuable skills:
‣ learn to use programmable tools that prepare data
‣ learn to generate compelling data visualizations
‣ learn to estimate the confidence for reported results
‣ learn to automate work, making analysis repeatable
D3
the rest of the skills – modeling,
algorithms, etc. – those are secondary
Wednesday, 06 March 13 13
14. science in data science?
edoMpUsserD:IUN
tcudorP ylppA lenaP yrotnevnI tneilC
in a nutshell, what we do…
tcudorP evomeR lenaP yrotnevnI tneilC
edoMmooRyM:IUN
edoMmooRcilbuP:IUN
ydduB ddA
nigoL etisbeW
vd
edoMsdneirF:IUN
edoMtahC:IUN
egasseM a evaeL
G1 :gniniamer ecaps sserddA
dekcilCeliforPyM:IUN
edoMstiderCyuB:IUN
tohspanS a ekaT
egapemoH nwO tisiV
elbbuB a epyT
‣ estimate probability
taeS egnahC
wodniW D3 nepO
dneirF ddA
revO tcudorP pilF lenaP yrotnevnI tneilC
lenaP tidE
woN tahC
teP yalP
teP deeF
2 petS egaP traC esahcruP edaM remotsuC
M215 :gniniamer ecaps sserddA
gnihtolC no tuP
bew :metI na yuB
edoMeivoM:IUN
ytinummoc ,tneilc :detratS weiV eivoM
‣ calculate analytic variance
teP weN etaerC
detrats etius tset :tseTytivitcennoC
emag pazyeh dehcnuaL
eciov mooRcilbuP tahC
egasseM yadhtriB
edoMlairotuT:IUN
ybbol semag dehcnuaL
noitartsigeR euqinU
edoMpUsserD:IUN
tcudorP ylppA lenaP yrotnevnI tneilC
tcudorP evomeR lenaP yrotnevnI tneilC
edoMmooRyM:IUN
edoMmooRcilbuP:IUN
ydduB ddA
nigoL etisbeW
vd
edoMsdneirF:IUN
edoMtahC:IUN
egasseM a evaeL
G1 :gniniamer ecaps sserddA
dekcilCeliforPyM:IUN
edoMstiderCyuB:IUN
tohspanS a ekaT
egapemoH nwO tisiV
elbbuB a epyT
taeS egnahC
dneirF ddA
revO tcudorP pilF lenaP yrotnevnI tneilC
lenaP tidE
woN tahC
teP yalP
teP deeF
2 petS egaP traC esahcruP edaM remotsuC
M215 :gniniamer ecaps sserddA
gnihtolC no tuP
bew :metI na yuB
edoMeivoM:IUN
ytinummoc ,tneilc :detratS weiV eivoM
teP weN etaerC
detrats etius tset :tseTytivitcennoC
emag pazyeh dehcnuaL
eciov mooRcilbuP tahC
egasseM yadhtriB
edoMlairotuT:IUN
ybbol semag dehcnuaL
noitartsigeR euqinU
wodniW D3 nepO
‣ manipulate order complexity
‣ leverage use of learning theory
+ collab with DevOps, Stakeholders
+ reduce work to cron entries
Wednesday, 06 March 13 14
15. references…
by DJ Patil
Data Jujitsu
O’Reilly, 2012
amazon.com/dp/B008HMN5BE
Building Data Science Teams
O’Reilly, 2011
amazon.com/dp/B005O4U3ZE
Wednesday, 06 March 13 15
Leo Breiman wrote an excellent paper in 2001, “Two Cultures”, chronicling this evolution and the sea change from data modeling (silos, manual process) to algorithmic modeling (machine data for automation/optimization)
16. The Workflow Abstraction
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
1. Data Science
2. Functional Programming
3. Workflow Abstraction
4. Typical Use Cases
5. Open Data Example
Wednesday, 06 March 13 16
Origin and overview of Cascading API as a workflow abstraction for Enterprise Big Data apps.
17. Cascading – origins
API author Chris Wensel worked as a system architect
at an Enterprise firm well-known for several popular
data products.
Wensel was following the Nutch open source project –
before Hadoop even had a name.
He noted that it would become difficult to find Java
developers to write complex Enterprise apps directly
in Apache Hadoop – a potential blocker for leveraging
this new open source technology.
Wednesday, 06 March 13 17
Cascading initially grew from interaction with the Nutch project, before Hadoop had a name
API author Chris Wensel recognized that MapReduce would be too complex for J2EE developers to perform substantial work in an Enterprise context, with any abstraction layer.
18. Cascading – functional programming
Key insight: MapReduce is based on functional programming
– back to LISP in 1970s. Apache Hadoop use cases are
mostly about data pipelines, which are functional in nature.
To ease staffing problems as “Main Street” Enterprise firms
began to embrace Hadoop, Cascading was introduced
in late 2007, as a new Java API to implement functional
programming for large-scale data workflows.
Wednesday, 06 March 13 18
Years later, Enterprise app deployments on Hadoop are limited by staffing issues: difficulty of retraining staff, scarcity of Hadoop experts.
19. examples…
• Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested
in functional programming open source projects atop
Cascading – used for their large-scale production
deployments
• new case studies for Cascading apps are mostly
based on domain-specific languages (DSLs) in JVM
languages which emphasize functional programming:
Cascalog in Clojure (2010)
Scalding in Scala (2012)
github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki
Wednesday, 06 March 13 19
Many case studies, many Enterprise production deployments now for 5+ years.
20. The Ubiquitous Word Count
Document
Collection
Definition: M
Tokenize
GroupBy
token Count
count how often each word appears
count how often each word appears
R Word
Count
inin a collection of text documents
a collection of text documents
This simple program provides an excellent test case for
parallel processing, since it illustrates: void map (String doc_id, String text):
for each word w in segment(text):
• requires a minimal amount of code emit(w, "1");
• demonstrates use of both symbolic and numeric values
• shows a dependency graph of tuples as an abstraction void reduce (String word, Iterator group):
• is not many steps away from useful search indexing
int count = 0;
• serves as a “Hello World” for Hadoop apps for each pc in group:
count += Int(pc);
Any distributed computing framework which can run Word emit(word, String(count));
Count efficiently in parallel at scale can handle much
larger and more interesting compute problems.
Wednesday, 06 March 13 20
Taking a wild guess, most people who’ve written any MapReduce code have seen this example app already...
21. word count – conceptual flow diagram
Document
Collection
Tokenize
GroupBy
M token Count
R Word
Count
1 map cascading.org/category/impatient
1 reduce
18 lines code gist.github.com/3900702
Wednesday, 06 March 13 21
Based on a Cascading implementation of Word Count, this is a conceptual flow diagram: the pattern language in use to specify the business process, using a literate programming methodology to describe a data workflow.
22. word count – Cascading app in Java
Document
Collection
String docPath = args[ 0 ]; Tokenize
GroupBy
token
String wcPath = args[ 1 ]; M Count
Properties properties = new Properties(); R Word
Count
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
Wednesday, 06 March 13 22
Based on a Cascading implementation of Word Count, here is sample code --
approx 1/3 the code size of the Word Count example from Apache Hadoop
2nd to last line: generates a DOT file for the flow diagram
23. word count – generated flow diagram
Document
Collection
Tokenize
[head] M
GroupBy
token Count
R Word
Count
Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']
[{2}:'doc_id', 'text']
[{2}:'doc_id', 'text']
map
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]
[{1}:'token']
[{1}:'token']
GroupBy('wc')[by:['token']]
wc[{1}:'token']
[{1}:'token']
reduce
Every('wc')[Count[decl:'count']]
[{2}:'token', 'count']
[{1}:'token']
Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']
[{2}:'token', 'count']
[{2}:'token', 'count']
[tail]
Wednesday, 06 March 13 23
As a concrete example of literate programming in Cascading,
here is the DOT representation of the flow plan -- generated by the app itself.
24. word count – Cascalog / Clojure
Document
Collection
(ns impatient.core M
Tokenize
GroupBy
token Count
(:use [cascalog.api] R Word
Count
[cascalog.more-taps :only (hfs-delimited)])
(:require [clojure.string :as s]
[cascalog.ops :as c])
(:gen-class))
(defmapcatop split [line]
"reads in a line of string and splits it by regex"
(s/split line #"[[](),.)s]+"))
(defn -main [in out & args]
(?<- (hfs-delimited out)
[?word ?count]
((hfs-delimited in :skip-header? true) _ ?line)
(split ?line :> ?word)
(c/count ?count)))
; Paul Lam
; github.com/Quantisan/Impatient
Wednesday, 06 March 13 24
Here is the same Word Count app written in Clojure, using Cascalog.
25. word count – Cascalog / Clojure
Document
Collection
github.com/nathanmarz/cascalog/wiki
Tokenize
GroupBy
M token Count
R Word
Count
• implements Datalog in Clojure, with predicates backed
by Cascading – for a highly declarative language
• run ad-hoc queries from the Clojure REPL –
approx. 10:1 code reduction compared with SQL
• composable subqueries, used for test-driven development
(TDD) practices at scale
• Leiningen build: simple, no surprises, in Clojure itself
• more new deployments than other Cascading DSLs –
Climate Corp is largest use case: 90% Clojure/Cascalog
• has a learning curve, limited number of Clojure developers
• aggregators are the magic, and those take effort to learn
Wednesday, 06 March 13 25
From what we see about language features, customer case studies, and best practices in general --
Cascalog represents some of the most sophisticated uses of Cascading, as well as some of the largest deployments.
Great for large-scale, complex apps, where small teams must limit the complexities in their process.
26. word count – Scalding / Scala
Document
Collection
import com.twitter.scalding._ M
Tokenize
GroupBy
token Count
R Word
Count
class WordCount(args : Args) extends Job(args) {
Tsv(args("doc"),
('doc_id, 'text),
skipHeader = true)
.read
.flatMap('text -> 'token) {
text : String => text.split("[ [](),.]")
}
.groupBy('token) { _.size('count) }
.write(Tsv(args("wc"), writeHeader = true))
}
Wednesday, 06 March 13 26
Here is the same Word Count app written in Scala, using Scalding.
Very compact, easy to understand; however, also more imperative than Cascalog.
27. word count – Scalding / Scala
Document
Collection
github.com/twitter/scalding/wiki
Tokenize
GroupBy
M token Count
R Word
Count
• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual flow diagram
and function calls
• extensive libraries are available for linear algebra, abstract
algebra, machine learning – e.g., Matrix API, Algebird, etc.
• significant investments by Twitter, Etsy, eBay, etc.
• great for data services at scale
• less learning curve than Cascalog,
not as much of a high-level language
Wednesday, 06 March 13 27
If you wanted to see what a data services architecture for machine learning work at, say, Google scale would look like as an open source project -- that’s Scalding. That’s what they’re doing.
28. word count – Scalding / Scala
Document
Collection
github.com/twitter/scalding/wiki
Tokenize
GroupBy
M token Count
R Word
Count
• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual flow diagram
and function calls Cascalog and Scalding DSLs
• extensive libraries are available for linear algebra, abstractaspects
leverage the functional
algebra, machine learning – e.g., Matrix API, Algebird, etc.
of MapReduce, helping to limit
• significant investments by Twitter, Etsy, eBay, etc.
complexity in process
• great for data services at scale
(imagine SOA infra @ Google as an open source project)
• less learning curve than Cascalog,
not as much of a high-level language
Wednesday, 06 March 13 28
Arguably, using a functional programming language to build flows is better than trying to represent functional programming constructs within Java…
29. The Workflow Abstraction
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
1. Data Science
2. Functional Programming
3. Workflow Abstraction
4. Typical Use Cases
5. Open Data Example
Wednesday, 06 March 13 29
CS theory related to data workflow abstraction, to manage complexity
30. Cascading workflows – pattern language
Cascading uses a “plumbing” metaphor in the Java API,
to define workflows out of familiar elements: Pipes, Taps,
Tuple Flows, Filters, Joins, Traps, etc.
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Data is represented as flows of tuples. Operations within Word
the tuple flows bring functional programming aspects into Count
Java apps.
In formal terms, this provides a pattern language.
Wednesday, 06 March 13 30
A pattern language, based on the metaphor of “plumbing”
31. references…
pattern language: a structured method for solving
large, complex design problems, where the syntax of
the language promotes the use of best practices.
amazon.com/dp/0195019199
design patterns: the notion originated in consensus
negotiation for architecture, later applied in OOP
software engineering by “Gang of Four”.
amazon.com/dp/0201633612
Wednesday, 06 March 13 31
Chris Alexander originated the use of pattern language in a project called “The Oregon Experiment”, in the 1970s.
32. Cascading workflows – literate programming
Cascading workflows generate their own visual
documentation: flow diagrams
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
In formal terms, flow diagrams leverage a methodology Word
Count
called literate programming
Provides intuitive, visual representations for apps, great
for cross-team collaboration.
Wednesday, 06 March 13 32
Formally speaking, the pattern language in Cascading gets leveraged as a visual representation used for literate programming.
Several good examples exist, but the phenomenon of different developers troubleshooting a program together over the “cascading-users” email list is most telling -- expert developers generally ask a novice to provide a flow diagram first
33. references…
by Don Knuth
Literate Programming
Univ of Chicago Press, 1992
literateprogramming.com/
“Instead of imagining that our main task is
to instruct a computer what to do, let us
concentrate rather on explaining to human
beings what we want a computer to do.”
Wednesday, 06 March 13 33
Don Knuth originated the notion of literate programming, or code as “literature” which explains itself.
34. examples…
• Scalding apps have nearly 1:1 correspondence
between function calls and the elements in their
flow diagrams – excellent elision and literate
representation
• noticed on cascading-users email list:
when troubleshooting issues, Cascading experts ask
novices to provide an app’s flow diagram (generated [head]
as a DOT file), sometimes in lieu of showing code Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']
[{2}:'doc_id', 'text']
[{2}:'doc_id', 'text']
map
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]
In formal terms, a flow diagram is a directed, acyclic [{1}:'token']
[{1}:'token']
graph (DAG) on which lots of interesting math applies GroupBy('wc')[by:['token']]
for query optimization, predictive models about app
wc[{1}:'token']
[{1}:'token']
reduce
execution, parallel efficiency metrics, etc. Every('wc')[Count[decl:'count']]
[{2}:'token', 'count']
[{1}:'token']
Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']
[{2}:'token', 'count']
[{2}:'token', 'count']
[tail]
Wednesday, 06 March 13 34
Literate programming examples observed on the email list are some of the best illustrations of this methodology.
35. Cascading workflows – business process
Following the essence of literate programming, Cascading
workflows provide statements of business process
This recalls a sense of business process management
for Enterprise apps (think BPM/BPEL for Big Data)
As a separation of concerns between business process
and implementation details (Hadoop, etc.)
This is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
By virtue of the pattern language, the flow planner in used
in a Cascading app determines how to translate business
process into efficient, parallel jobs at scale.
Wednesday, 06 March 13 35
Business Stakeholder POV:
business process management for workflow orchestration (think BPM/BPEL)
36. references…
by Edgar Codd
“A relational model of data for large shared data banks”
Communications of the ACM, 1970
dl.acm.org/citation.cfm?id=362685
Rather than arguing between SQL vs. NoSQL…
structured vs. unstructured data frameworks…
this approach focuses on:
the process of structuring data
That’s what apps do – Making Data Work
Wednesday, 06 March 13 36
Focus on *the process of structuring data*
which must happen before the large-scale joins, predictive models, visualizations, etc.
Just because your data is loaded into a “structured” store, that does not imply that your app has finished structuring it for the purpose of making data work.
BTW, anybody notice that the O’Reilly “animal” for the Cascading book is an Atlantic Cod? (pun intended)
37. Cascading workflows – functional relational programming
The combination of functional programming, pattern language,
DSLs, literate programming, business process, etc., traces back
to the original definition of the relational model (Codd, 1970)
prior to SQL.
Cascalog, in particular, implements more of what Codd intended
for a “data sublanguage” and is considered to be close to a full
implementation of the functional relational programming
paradigm defined in:
Moseley & Marks, 2006
“Out of the Tar Pit”
goo.gl/SKspn
Wednesday, 06 March 13 37
A more contemporary statement along similar lines...
38. Two Avenues…
Enterprise: must contend with
complexity at scale everyday…
incumbents extend current practices and
infrastructure investments – using J2EE,
complexity ➞
ANSI SQL, SAS, etc. – to migrate
workflows onto Apache Hadoop while
leveraging existing staff
Start-ups: crave complexity and
scale to become viable…
new ventures move into Enterprise space
to compete using relatively lean staff,
while leveraging sophisticated engineering
practices, e.g., Cascalog and Scalding
scale ➞
Wednesday, 06 March 13 38
Enterprise data workflows are observed in two modes: start-ups approaching complexity and incumbent firms grappling with complexity
39. Cascading workflows – functional relational programming
The combination of functional programming, pattern language,
DSLs, literate programming, business process, etc., traces back
to the original definition of the relational model (Codd, 1970)
prior to SQL.
Cascalog, in particular, implements more of what Codd intended for a
several theoretical aspects converge
“data sublanguage” and is considered to be close to a full
implementation of the functional relational programming
paradigm defined in: into software engineering practices
Moseley & Marks, 2006which mitigates the complexity of
“Out of the Tar Pit” building and maintaining Enterprise
goo.gl/SKspn
data workflows
Wednesday, 06 March 13 39
40. The Workflow Abstraction
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
1. Data Science
2. Functional Programming
3. Workflow Abstraction
4. Typical Use Cases
5. Open Data Example
Wednesday, 06 March 13 40
Here are a few use cases to consider, for Enterprise data workflows
41. Cascading – deployments
• 5+ history of Enterprise production deployments,
ASL 2 license, GitHub src, http://conjars.org
• partners: Amazon AWS, Microsoft Azure, Hortonworks,
MapR, EMC, SpringSource, Cloudera
• case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma,
uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc.
• use cases: ETL, marketing funnel, anti-fraud, social media,
retail pricing, search analytics, recommenders, eCRM,
utility grids, genomics, climatology, etc.
Wednesday, 06 March 13 41
Several published case studies about Cascading, Cascalog, Scalding, etc.
Wide range of use cases.
Significant investment by Twitter, Etsy, and other firms for OSS based on Cascading.
Partnerships with the various Hadoop distro vendors, cloud providers, etc.
42. Finance: Ecommerce Risk
Problem:
stat.berkeley.edu
<1% chargeback rate allowed by Visa, others follow
• may leverage CAPTURE/AUTH wait period
• Cybersource,Vindicia, others haven’t stopped fraud
>15% chargeback rate common for mobile in US:
• not much info shared with merchant
• carrier as judge/jury/executioner; customer assumed correct
most common: professional fraud (identity theft, etc.)
• patterns of attack change all the time
• widespread use of IP proxies, to mask location
• global market for stolen credit card info
other common case is friendly fraud
• teenager billing to parent’s cell phone
Wednesday, 06 March 13 42
43. Finance: Ecommerce Risk
KPI:
stat.berkeley.edu
chargeback rate (CB)
• ground truth for how much fraud the bank/carrier claims
• 7-120 day latencies from the bank
false positive rate (FP)
• estimated cost: predicts customer support issues
• complaints due to incorrect fraud scores on valid orders (or lies)
false negative rate (FN)
• estimated risk: how much fraud may pass undetected in future orders
• changes with new product features/services/inventory/marketing
Wednesday, 06 March 13 43
44. Finance: Ecommerce Risk
Data Science Issues:
stat.berkeley.edu
• chargeback limits imply few training cases
• sparse data implies lots of missing values – must impute
• long latency on chargebacks – “good” flips to “bad”
• most detection occurs within large-scale batch,
decisions required during real-time event processing
• not just one pattern to detect – many, ever-changing
• many unknowns: blocked orders scare off professional fraud,
inferences cannot be confirmed
• cannot simply use raw data as input – requires lots of
data preparation and statistical modeling
• each ecommerce firm has shopping/policy nuances
which get exploited differently – hard to generalize solutions
Wednesday, 06 March 13 44
45. Finance: Ecommerce Risk
Predictive Analytics:
stat.berkeley.edu
batch
• cluster/segment customers for expected behaviors
• adjust for seasonal variation
• geospatial indexing / bayesian point estimates (fraud by lat/lng)
• impute missing values (“guesses” to fill-in sparse data)
• run anti-fraud classifier (customer 360)
real-time
• exponential smoothing (estimators for velocity)
• calculate running medians (anomaly detection)
• run anti-fraud classifier (per order)
Wednesday, 06 March 13 45
46. Finance: Ecommerce Risk
1. Data Preparation (batch)
stat.berkeley.edu
‣ ETL from bank, log sessionization, customer profiles, etc.
- large-scale joins of customers + orders
‣ apply time window
- too long: patterns lose currency
- too short: not enough wait for chargebacks
‣ segment customers
- temporary fraud (identity theft which has been resolved)
- confirmed fraud (chargebacks from the bank)
- estimated fraud (blocked/banned by Customer Support)
- valid orders (but different clusters of expected behavior)
‣ subsample to rebalance data
- produce training set + test holdout
- adjust balance for FP/FN bias (company risk profile)
Wednesday, 06 March 13 46
47. Finance: Ecommerce Risk
2. Model Creation (analyst)
stat.berkeley.edu
‣ distinguish between different IV data types
- continuous (e.g., age)
- boolean (e.g., paid lead)
- categorical (e.g., gender)
- computed (e.g., geo risk, velocities)
‣ use geospatial smoothing for lat/lng
‣ determine distributions for IV
‣ adjust IV for seasonal variation, where appropriate
‣ impute missing values based on density functions / medians
‣ factor analysis: determine which IV to keep (too many creates problems)
‣ train model: random forest (RF) classifiers predict likely fraud
‣ calculate the confusion matrix (TP/FP/TN/FN)
Wednesday, 06 March 13 47
48. Finance: Ecommerce Risk
3. Test Model (analyst/batch loop)
stat.berkeley.edu
‣ calculate estimated fraud rates
‣ identify potential found fraud cases
‣ report to Customer Support for review
‣ generate risk vs. benefit curves
‣ visualize estimated impact of new model
4. Decision (stakeholder)
‣ decide risk vs. benefit (minimize fraud + customer support costs)
‣ coordinate with bank/carrier if there are current issues
‣ determine go/no-go, when to deploy in production, size of rollout
Wednesday, 06 March 13 48
49. Finance: Ecommerce Risk
5. Production Deployment (near-time)
stat.berkeley.edu
‣ run model on in-memory grid / transaction processing
‣ A/B test to verify model in production (progressive rollout)
‣ detect anomalies
- use running medians on continuous IVs
- use exponential smoothing on computed IVs (velocities)
- trigger notifications
‣ monitor KPI and other metrics in dashboards
Wednesday, 06 March 13 49
50. Finance: Ecommerce Risk
risk classifier risk classifier
dimension: customer 360 dimension: per-order
Cascading apps
training analyst's customer
data prep laptop
data sets transactions
predict score new
model costs orders
PMML
model
detect anomaly
fraudsters detection
segment velocity
customers metrics
Hadoop Customer IMDG
DB
batch real-time
workloads workloads
ETL
chargebacks, partner
DW etc. data
Wednesday, 06 March 13 50
51. Ecommerce: Marketing Funnel
Problem:
• must optimize large ad spend budget
Wikipedia
• different vendors report different kinds of metrics
• some campaigns are much smaller than others
• seasonal variation distorts performance
• inherent latency in spend vs. effect
• ads channels cannot scale up immediately
• must “scrub” leads to dispute payments/refunds
• hard to predict ROI for incremental ad spend
• many issues of diminishing returns in general
Wednesday, 06 March 13 51
52. Ecommerce: Marketing Funnel
KPI:
cost per paying user (CPP)
Wikipedia
• must align metrics for different ad channels
• generally need to estimate to end-of-month
customer lifetime value (LTV)
• big differences based on geographic region, age, gender, etc.
• assumes that new customers behave like previous customers
return on investment (ROI)
• relationship between CPP and LTV
• adjust to invest in marketing (>CPP) vs. extract profit (>LTV)
other metrics
• reach: how many people get a brand message
• customer satisfaction: would recommend to a friend, etc.
Wednesday, 06 March 13 52
53. Ecommerce: Marketing Funnel
Predictive Analytics:
batch
Wikipedia
• log aggregation, followed with cohort analysis
• bayesian point estimates compare different-sized ad tests
• time series analysis normalizes for seasonal variation
• geolocation adjusts for regional cost/benefit
• customer lifetime value estimates ROI of new leads
• linear programming models estimate elasticity of demand
real-time
• determine whether this is actually a new customer…
• new: modify initial UX based on ad channel, region, friends, etc.
• old: recommend products/services/friends based on behaviors
• adjust spend on poorly performing channels
• track back to top referring sites/partners
Wednesday, 06 March 13 53
54. Airlines
Problem:
• minimize schedule delays
• re-route around weather and airport
conditions
• manage supplier channels and inventories
to minimize AOG
KPI:
forecast future passenger demand
customer loyalty
aircraft on ground (AOG)
mean time between failures (MTBF)
Wednesday, 06 March 13 54
55. Airlines
Predictive Analytics:
batch
• predict “last mile” failures
• optimize capacity utilization
• operations research problem to optimize stocking /
minimize fuel waste
• boost customer loyalty by adjusting incentives
frequent flyer programs
real-time
• forecast schedule delays
• monitor factors for travel conditions: weather,
airports, etc.
Wednesday, 06 March 13 55