SlideShare ist ein Scribd-Unternehmen logo
1 von 79
Downloaden Sie, um offline zu lesen
Paco Nathan
liber118.com/pxn/
“Using Cascalog with
Palo Alto Open Data”
Licensed under a Creative Commons Attribution-
NonCommercial-NoDerivs 3.0 Unported License.
LA Clojure User Group
1Friday, 19 July 13
Cascading / Cascalog / Scalding
Enterprise Data Workflows with Cascading
Cluster Computing with Mesos
Using Cascalog with Palo Alto Open Data
2Friday, 19 July 13
Cascading – origins
API author Chris Wensel worked as a system architect
at an Enterprise firm well-known for many popular
data products.
Wensel was following the Nutch open source project –
where Hadoop started.
Observation: would be difficult to find Java developers
to write complex Enterprise apps in MapReduce –
potential blocker for leveraging new open source
technology.
3Friday, 19 July 13
Cascading – functional programming
Key insight: MapReduce is based on functional programming
– back to LISP in 1970s. Apache Hadoop use cases are
mostly about data pipelines, which are functional in nature.
To ease staffing problems as “Main Street” Enterprise firms
began to embrace Hadoop, Cascading was introduced
in late 2007, as a new Java API to implement functional
programming for large-scale data workflows:
• leverages JVM and Java-based tools without any
need to create new languages
• allows programmers who have J2EE expertise
to leverage the economics of Hadoop clusters
4Friday, 19 July 13
Cascading – functional programming
• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc.,
have invested in open source projects atop Cascading
– used for their large-scale production deployments
• new case studies for Cascading apps are mostly
based on domain-specific languages (DSLs) in JVM
languages which emphasize functional programming:
Cascalog in Clojure (2010)
Scalding in Scala (2012)
github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki
Why Adopting the Declarative Programming PracticesWill ImproveYour Return fromTechnology
Dan Woods, 2013-04-17 Forbes
forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming-
practices-will-improve-your-return-from-technology/
5Friday, 19 July 13
Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Cascading – integrations
• partners: Microsoft Azure, Hortonworks,
Amazon AWS, MapR, EMC, SpringSource,
Cloudera
• taps: Memcached, Cassandra, MongoDB,
HBase, JDBC, Parquet, etc.
• serialization: Avro, Thrift, Kryo,
JSON, etc.
• topologies: Apache Hadoop,
tuple spaces, local mode
6Friday, 19 July 13
Cascading – deployments
• case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, Factual, etc.
• use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utility grids, telecom,
genomics, climatology, agronomics, etc.
7Friday, 19 July 13
Cascading – deployments
• case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, Factual, etc.
• use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utility grids, telecom,
genomics, climatology, agronomics, etc.
workflow abstraction addresses:
• staffing bottleneck;
• system integration;
• operational complexity;
• test-driven development
8Friday, 19 July 13
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
1 map
1 reduce
18 lines code gist.github.com/3900702
WordCount – conceptual flow diagram
cascading.org/category/impatient
9Friday, 19 July 13
WordCount – Cascading app in Java
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
 .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
10Friday, 19 July 13
mapreduce
Every('wc')[Count[decl:'count']]
Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']
GroupBy('wc')[by:['token']]
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]
Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']
[head]
[tail]
[{2}:'token', 'count']
[{1}:'token']
[{2}:'doc_id', 'text']
[{2}:'doc_id', 'text']
wc[{1}:'token']
[{1}:'token']
[{2}:'token', 'count']
[{2}:'token', 'count']
[{1}:'token']
[{1}:'token']
WordCount – generated flow diagram
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
11Friday, 19 July 13
(ns impatient.core
  (:use [cascalog.api]
        [cascalog.more-taps :only (hfs-delimited)])
  (:require [clojure.string :as s]
            [cascalog.ops :as c])
  (:gen-class))
(defmapcatop split [line]
  "reads in a line of string and splits it by regex"
  (s/split line #"[[](),.)s]+"))
(defn -main [in out & args]
  (?<- (hfs-delimited out)
       [?word ?count]
       ((hfs-delimited in :skip-header? true) _ ?line)
       (split ?line :> ?word)
       (c/count ?count)))
; Paul Lam
; github.com/Quantisan/Impatient
WordCount – Cascalog / Clojure
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
12Friday, 19 July 13
github.com/nathanmarz/cascalog/wiki
• implements Datalog in Clojure, with predicates backed
by Cascading – for a highly declarative language
• run ad-hoc queries from the Clojure REPL –
approx. 10:1 code reduction compared with SQL
• composable subqueries, used for test-driven development
(TDD) practices at scale
• Leiningen build: simple, no surprises, in Clojure itself
• more new deployments than other Cascading DSLs –
Climate Corp is largest use case: 90% Clojure/Cascalog
• has a learning curve, limited number of Clojure developers
• aggregators are the magic, and those take effort to learn
WordCount – Cascalog / Clojure
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
13Friday, 19 July 13
import com.twitter.scalding._
 
class WordCount(args : Args) extends Job(args) {
Tsv(args("doc"),
('doc_id, 'text),
skipHeader = true)
.read
.flatMap('text -> 'token) {
text : String => text.split("[ [](),.]")
}
.groupBy('token) { _.size('count) }
.write(Tsv(args("wc"), writeHeader = true))
}
WordCount – Scalding / Scala
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
14Friday, 19 July 13
github.com/twitter/scalding/wiki
• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual flow diagram
and function calls
• extensive libraries are available for linear algebra, abstract
algebra, machine learning – e.g., Matrix API, Algebird, etc.
• significant investments by Twitter, Etsy, eBay, etc.
• great for data services at scale
• less learning curve than Cascalog
WordCount – Scalding / Scala
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
15Friday, 19 July 13
Workflow Abstraction – pattern language
Cascading uses a “plumbing” metaphor in the Java API,
to define workflows out of familiar elements: Pipes, Taps,
Tuple Flows, Filters, Joins, Traps, etc.
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
Data is represented as flows of tuples. Operations within
the flows bring functional programming aspects into Java
A Pattern Language
Christopher Alexander, et al.
amazon.com/dp/0195019199
16Friday, 19 July 13
Workflow Abstraction – literate programming
Cascading workflows generate their own visual
documentation: flow diagrams
in formal terms, flow diagrams leverage a methodology
called literate programming
provides intuitive, visual representations for apps –
great for cross-team collaboration
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
Literate Programming
Don Knuth
literateprogramming.com
17Friday, 19 July 13
Workflow Abstraction – business process
following the essence of literate programming, Cascading
workflows provide statements of business process
this recalls a sense of business process management
for Enterprise apps (think BPM/BPEL for Big Data)
Cascading creates a separation of concerns between
business process and implementation details (Hadoop, etc.)
this is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
by virtue of the pattern language, the flow planner then
determines how to translate business process into efficient,
parallel jobs at scale
18Friday, 19 July 13
Cascading / Cascalog / Scalding
Enterprise Data Workflows with Cascading
Cluster Computing with Mesos
Using Cascalog with Palo Alto Open Data
19Friday, 19 July 13
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
20Friday, 19 July 13
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
ANSI SQL for ETL
21Friday, 19 July 13
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
22Friday, 19 July 13
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive models
23Friday, 19 July 13
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive modelsANSI SQL for ETL most of the licensing costs…
24Friday, 19 July 13
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
most of the project costs…
25Friday, 19 July 13
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
a compiler sees it all…
cascading.org
26Friday, 19 July 13
a compiler sees it all…
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
FlowDef flowDef = FlowDef.flowDef()
.setName( "etl" )
.addSource( "example.employee", emplTap )
.addSource( "example.sales", salesTap )
.addSink( "results", resultsTap );
 
SQLPlanner sqlPlanner = new SQLPlanner()
.setSql( sqlStatement );
 
flowDef.addAssemblyPlanner( sqlPlanner );
cascading.org
27Friday, 19 July 13
a compiler sees it all…
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
FlowDef flowDef = FlowDef.flowDef()
.setName( "classifier" )
.addSource( "input", inputTap )
.addSink( "classify", classifyTap );
 
PMMLPlanner pmmlPlanner = new PMMLPlanner()
.setPMMLInput( new File( pmmlModel ) )
.retainOnlyActiveIncomingFields();
 
flowDef.addAssemblyPlanner( pmmlPlanner );
28Friday, 19 July 13
cascading.org
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
visual collaboration for the business logic is a great
way to improve how teams work together
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
29Friday, 19 July 13
Lingual – CSV data in local file system
cascading.org/lingual
30Friday, 19 July 13
Lingual – shell prompt, catalog
cascading.org/lingual
31Friday, 19 July 13
Lingual – queries
cascading.org/lingual
32Friday, 19 July 13
# load the JDBC package
library(RJDBC)
 
# set up the driver
drv <- JDBC("cascading.lingual.jdbc.Driver",
"~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")
 
# set up a database connection to a local repository
connection <- dbConnect(drv,
"jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/
tables;schema=EMPLOYEES")
 
# query the repository: in this case the MySQL sample database (CSV files)
df <- dbGetQuery(connection,
"SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")
head(df)
 
# use R functions to summarize and visualize part of the data
df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25
summary(df$hire_age)
library(ggplot2)
m <- ggplot(df, aes(x=hire_age))
m <- m + ggtitle("Age at hire, people named Gina")
m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()
Lingual – connecting Hadoop and R
33Friday, 19 July 13
> summary(df$hire_age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
20.86 27.89 31.70 31.61 35.01 43.92
Lingual – connecting Hadoop and R
cascading.org/lingual
34Friday, 19 July 13
Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Pattern – model scoring
• migrate workloads: SAS,Teradata, etc.,
exporting predictive models as PMML
• great open source tools – R, Weka,
KNIME, Matlab, RapidMiner, etc.
• integrate with other libraries –
Matrix API, etc.
• leverage PMML as another kind
of DSL
cascading.org/pattern
35Friday, 19 July 13
• established XML standard for predictive model markup
• organized by Data Mining Group (DMG), since 1997
http://dmg.org/
• members: IBM, SAS, Visa, NASA, Equifax, Microstrategy,
Microsoft, etc.
• PMML concepts for metadata, ensembles, etc., translate
directly into Cascading tuple flows
“PMML is the leading standard for statistical and data mining models and
supported by over 20 vendors and organizations.With PMML, it is easy
to develop a model on one system using one application and deploy the
model on another system using another application.”
PMML – standard
wikipedia.org/wiki/Predictive_Model_Markup_Language
36Friday, 19 July 13
PMML – vendor coverage
37Friday, 19 July 13
• Association Rules: AssociationModel element
• Cluster Models: ClusteringModel element
• Decision Trees: TreeModel element
• Naïve Bayes Classifiers: NaiveBayesModel element
• Neural Networks: NeuralNetwork element
• Regression: RegressionModel and GeneralRegressionModel elements
• Rulesets: RuleSetModel element
• Sequences: SequenceModel element
• SupportVector Machines: SupportVectorMachineModel element
• Text Models: TextModel element
• Time Series: TimeSeriesModel element
PMML – model coverage
ibm.com/developerworks/industry/library/ind-PMML2/
38Friday, 19 July 13
## train a RandomForest model
 
f <- as.formula("as.factor(label) ~ .")
fit <- randomForest(f, data_train, ntree=50)
 
## test the model on the holdout test set
 
print(fit$importance)
print(fit)
 
predicted <- predict(fit, data)
data$predicted <- predicted
confuse <- table(pred = predicted, true = data[,1])
print(confuse)
 
## export predicted labels to TSV
 
write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"),
quote=FALSE, sep="t", row.names=FALSE)
 
## export RF model to PMML
 
saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))
Pattern – create a model in R
39Friday, 19 July 13
<?xml version="1.0"?>
<PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://www.dmg.org/PMML-4_0
http://www.dmg.org/v4-0/pmml-4-0.xsd">
 <Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">
  <Extension name="user" value="ceteri" extender="Rattle/PMML"/>
  <Application name="Rattle/PMML" version="1.2.30"/>
  <Timestamp>2012-10-22 19:39:28</Timestamp>
 </Header>
 <DataDictionary numberOfFields="4">
  <DataField name="label" optype="categorical" dataType="string">
   <Value value="0"/>
   <Value value="1"/>
  </DataField>
  <DataField name="var0" optype="continuous" dataType="double"/>
  <DataField name="var1" optype="continuous" dataType="double"/>
  <DataField name="var2" optype="continuous" dataType="double"/>
 </DataDictionary>
 <MiningModel modelName="randomForest_Model" functionName="classification">
  <MiningSchema>
   <MiningField name="label" usageType="predicted"/>
   <MiningField name="var0" usageType="active"/>
   <MiningField name="var1" usageType="active"/>
   <MiningField name="var2" usageType="active"/>
  </MiningSchema>
  <Segmentation multipleModelMethod="majorityVote">
   <Segment id="1">
    <True/>
    <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">
     <MiningSchema>
      <MiningField name="label" usageType="predicted"/>
      <MiningField name="var0" usageType="active"/>
      <MiningField name="var1" usageType="active"/>
      <MiningField name="var2" usageType="active"/>
     </MiningSchema>
...
Pattern – capture model parameters as PMML
40Friday, 19 July 13
public static void main( String[] args ) throws RuntimeException {
String inputPath = args[ 0 ];
String classifyPath = args[ 1 ];
// set up the config properties
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
  // create source and sink taps
Tap inputTap = new Hfs( new TextDelimited( true, "t" ), inputPath );
Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );
  // handle command line options
OptionParser optParser = new OptionParser();
optParser.accepts( "pmml" ).withRequiredArg();
  OptionSet options = optParser.parse( args );
 
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "classify" )
.addSource( "input", inputTap )
.addSink( "classify", classifyTap );
 
if( options.hasArgument( "pmml" ) ) {
String pmmlPath = (String) options.valuesOf( "pmml" ).get( 0 );
PMMLPlanner pmmlPlanner = new PMMLPlanner()
.setPMMLInput( new File( pmmlPath ) )
.retainOnlyActiveIncomingFields()
.setDefaultPredictedField( new Fields( "predict", Double.class ) ); // default value if missing from the model
flowDef.addAssemblyPlanner( pmmlPlanner );
}
 
// write a DOT file and run the flow
Flow classifyFlow = flowConnector.connect( flowDef );
classifyFlow.writeDOT( "dot/classify.dot" );
classifyFlow.complete();
}
Pattern – score a model, within an app
41Friday, 19 July 13
Customer
Orders
Classify
Scored
Orders
GroupBy
token
Count
PMML
Model
M R
Failure
Traps
Assert
Confusion
Matrix
Pattern – score a model, using pre-defined Cascading app
cascading.org/pattern
42Friday, 19 July 13
Roadmap – existing algorithms for scoring
• 	

Random Forest
• Decision Trees
• Linear Regression
• GLM
• Logistic Regression
• K-Means Clustering
• Hierarchical Clustering
• Multinomial
• SupportVector Machines (prepared for release)
also, model chaining and general support for ensembles
cascading.org/pattern
43Friday, 19 July 13
Roadmap – next priorities for scoring
• 	

Time Series (ARIMA forecast)
• Association Rules (basket analysis)
• Naïve Bayes
• Neural Networks
algorithms extended based on customer use cases –
contact groups.google.com/forum/?fromgroups#!forum/pattern-user
cascading.org/pattern
44Friday, 19 July 13
Cascading / Cascalog / Scalding
Enterprise Data Workflows with Cascading
Cluster Computing with Mesos
Using Cascalog with Palo Alto Open Data
45Friday, 19 July 13
Q3 1997: inflection point
Four independent teams were working toward horizontal
scale-out of workflows based on commodity hardware
This effort prepared the way for huge Internet successes
in the 1997 holiday season… AMZN, EBAY, Inktomi
(YHOO Search), then GOOG
MapReduce and the Apache Hadoop open source stack
emerged from this
46Friday, 19 July 13
RDBMS
Stakeholder
SQL Query
result sets
Excel pivot tables
PowerPoint slide decks
Web App
Customers
transactions
Product
strategy
Engineering
requirements
BI
Analysts
optimized
code
Circa 1996: pre- inflection point
47Friday, 19 July 13
RDBMS
Stakeholder
SQL Query
result sets
Excel pivot tables
PowerPoint slide decks
Web App
Customers
transactions
Product
strategy
Engineering
requirements
BI
Analysts
optimized
code
Circa 1996: pre- inflection point
“throw it over the wall”
48Friday, 19 July 13
RDBMS
SQL Query
result sets
recommenders
+
classifiers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big ecommerce successes
49Friday, 19 July 13
RDBMS
SQL Query
result sets
recommenders
+
classifiers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big ecommerce successes
“data products”
50Friday, 19 July 13
Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere
51Friday, 19 July 13
Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere
“optimize topologies”
52Friday, 19 July 13
Operating Systems, redux
meanwhile, GOOG is 3+ generations ahead,
with much improved ROI on data centers
John Wilkes, et al.
Borg/Omega: data center “secret sauce”
youtu.be/0ZFMlO98Jkc
0%
25%
50%
75%
100%
RAILS CPU
LOAD
MEMCACHED
CPU LOAD
0%
25%
50%
75%
100%
HADOOP CPU
LOAD
0%
25%
50%
75%
100%
t t
0%
25%
50%
75%
100%
Rails
Memcached
Hadoop
COMBINED CPU LOAD (RAILS,
MEMCACHED, HADOOP)
Florian Leibert, Chronos/Mesos @ Airbnb
Mesos, open source cloud OS – like Borg
goo.gl/jPtTP
53Friday, 19 July 13
Mesos
mesos.apache.org
Return of the Borg: HowTwitter Rebuilt Google’s SecretWeapon
Cade Metz
wired.com/wiredenterprise/2013/03/google-
borg-twitter-mesos/
54Friday, 19 July 13
Mesos
a common substrate for cluster computing
• scale to 10,000s of nodes using fast, event-driven C++ impl
• improve utilization across workloads
• run long-lived services (e.g., Hypertable and HBase) on the
same nodes as batch app and share resources
• build new cluster computing frameworks without reinventing low-level
facilities, and have them coexist with existing work
• run multiple instances/versions of Hadoop on the same cluster to isolate
production and experimental jobs
• reshape cluster resources based on ML from app history
• reduce latency in transferring data products from one cluster to another
• enable new kinds of apps, which combine frameworks with lower latency
55Friday, 19 July 13
Cascading / Cascalog / Scalding
Enterprise Data Workflows with Cascading
Cluster Computing with Mesos
Using Cascalog with Palo Alto Open Data
56Friday, 19 July 13
Palo Alto is quite a pleasant place
• temperate weather
• lots of parks, enormous trees
• great coffeehouses
• walkable downtown
• not particularly crowded
On a nice summer day, who wants to be stuck
indoors on a phone call?
Instead, take it outside – go for a walk
And example open source project:
github.com/Cascading/CoPA/wiki
57Friday, 19 July 13
1. Open Data about municipal infrastructure
(GIS data: trees, roads, parks)
✚
2. Big Data about where people like to walk
(smartphone GPS logs)
✚
3. some curated metadata
(which surfaces the value)
4. personalized recommendations:
“Find a shady spot on a summer day in which to walk
near downtown Palo Alto.While on a long conference call.
Sipping a latte or enjoying some fro-yo.”
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
58Friday, 19 July 13
The City of Palo Alto recently began to support Open Data
to give the local community greater visibility into how
their city government operates
This effort is intended to encourage students, entrepreneurs,
local organizations, etc., to build new apps which contribute
to the public good
paloalto.opendata.junar.com/dashboards/7576/geographic-information/
discovery
59Friday, 19 July 13
GIS about trees in Palo Alto:
discovery
60Friday, 19 July 13
Geographic_Information,,,
"Tree: 29 site 2 at 203 ADDISON AV, on ADDISON AV 44 from pl"," Private: -1 Tree ID: 29
Street_Name: ADDISON AV Situs Number: 203 Tree Site: 2 Species: Celtis australis
Source: davey tree Protected: Designated: Heritage: Appraised Value:
Hardscape: None Identifier: 40 Active Numeric: 1 Location Feature ID: 13872
Provisional: Install Date: ","37.4409634615283,-122.15648458861,0.0 ","Point"
"Wilkie Way from West Meadow Drive to Victoria Place"," Sequence: 20 Street_Name: Wilkie
Way From Street PMMS: West Meadow Drive To Street PMMS: Victoria Place Street ID:
598 (Wilkie Wy, Palo Alto) From Street ID PMMS: 689 To Street ID PMMS: 567 Year
Constructed: 1950 Traffic Count: 596 Traffic Index: residential local Traffic
Class: local residential Traffic Date: 08/24/90 Paving Length: 208 Paving Width:
40 Paving Area: 8320 Surface Type: asphalt concrete Surface Thickness: 2.0 Base
Type Pvmt: crusher run base Base Thickness: 6.0 Soil Class: 2 Soil Value: 15
Curb Type: Curb Thickness: Gutter Width: 36.0 Book: 22 Page: 1 District
Number: 18 Land Use PMMS: 1 Overlay Year: 1990 Overlay Thickness: 1.5 Base
Failure Year: 1990 Base Failure Thickness: 6 Surface Treatment Year: Surface
Treatment Type: Alligator Severity: none Alligator Extent: 0 Block Severity:
none Block Extent: 0 Longitude and Transverse Severity: none Longitude and Transverse
Extent: 0 Ravelling Severity: none Ravelling Extent: 0 Ridability Severity: none
Trench Severity: none Trench Extent: 0 Rutting Severity: none Rutting Extent: 0
Road Performance: UL (Urban Local) Bike Lane: 0 Bus Route: 0 Truck Route: 0
Remediation: Deduct Value: 100 Priority: Pavement Condition: excellent
Street Cut Fee per SqFt: 10.00 Source Date: 6/10/2009 User Modified By: mnicols
Identifier System: 21410 ","-122.1249640794,37.4155803115645,0.0
-122.124661859039,37.4154224594993,0.0 -122.124587720719,37.4153758330704,0.0
-122.12451895942,37.4153242300888,0.0 -122.124456098457,37.4152680432944,0.0
-122.124399616238,37.4152077003122,0.0 -122.124374937753,37.4151774433318,0.0 ","Line"
discovery
(unstructured data…)
61Friday, 19 July 13
(defn parse-gis [line]
"leverages parse-csv for complex CSV format in GIS export"
(first (csv/parse-csv line))
)
 
 
(defn etl-gis [gis trap]
"subquery to parse data sets from the GIS source tap"
(<- [?blurb ?misc ?geo ?kind]
(gis ?line)
(parse-gis ?line :> ?blurb ?misc ?geo ?kind)
(:trap (hfs-textline trap))
))
discovery
(specify what you require,
not how to achieve it…
80/20 rule of data prep cost)
62Friday, 19 July 13
discovery
(ad-hoc queries get refined
into composable predicates)
Identifier: 474
Tree ID: 412
Tree: 412 site 1 at 115 HAWTHORNE AV
Tree Site: 1
Street_Name: HAWTHORNE AV
Situs Number: 115
Private: -1
Species: Liquidambar styraciflua
Source: davey tree
Hardscape: None
37.446001565119,-122.167713417554,0.0
Point
63Friday, 19 July 13
discovery
(curate valuable metadata)
64Friday, 19 July 13
(defn get-trees [src trap tree_meta]
"subquery to parse/filter the tree data"
(<- [?blurb ?tree_id ?situs ?tree_site
?species ?wikipedia ?calflora ?avg_height
?tree_lat ?tree_lng ?tree_alt ?geohash
]
(src ?blurb ?misc ?geo ?kind)
(re-matches #"^s+Private.*Tree ID.*" ?misc)
(parse-tree ?misc :> _ ?priv ?tree_id ?situs ?tree_site ?raw_species)
((c/comp s/trim s/lower-case) ?raw_species :> ?species)
(tree_meta ?species ?wikipedia ?calflora ?min_height ?max_height)
(avg ?min_height ?max_height :> ?avg_height)
(geo-tree ?geo :> _ ?tree_lat ?tree_lng ?tree_alt)
(read-string ?tree_lat :> ?lat)
(read-string ?tree_lng :> ?lng)
(geohash ?lat ?lng :> ?geohash)
(:trap (hfs-textline trap))
))
discovery
?blurb!! Tree: 412 site 1 at 115 HAWTHORNE AV, on HAWTHORNE AV 22
?tree_id! " 412
?situs"" 115
?tree_site" 1
?species" " liquidambar styraciflua
?wikipedia" http://en.wikipedia.org/wiki/Liquidambar_styraciflua
?calflora http://calflora.org/cgi-bin/species_query.cgi?where-calre
?avg_height" 27.5
?tree_lat" 37.446001565119
?tree_lng" -122.167713417554
?tree_alt" 0.0
?geohash" " 9q9jh0
65Friday, 19 July 13
// run analysis and visualization in R
library(ggplot2)
dat_folder <- '~/src/concur/CoPA/out/tree'
data <- read.table(file=paste(dat_folder, "part-00000", sep="/"),
sep="t", quote="", na.strings="NULL", header=FALSE, encoding="UTF8")
 
summary(data)
t <- head(sort(table(data$V5), decreasing=TRUE)
trees <- as.data.frame.table(t, n=20))
colnames(trees) <- c("species", "count")
 
m <- ggplot(data, aes(x=V8))
m <- m + ggtitle("Estimated Tree Height (meters)")
m + geom_histogram(aes(y = ..density.., fill = ..count..)) + geom_density()
 
par(mar = c(7, 4, 4, 2) + 0.1)
plot(trees, xaxt="n", xlab="")
axis(1, labels=FALSE)
text(1:nrow(trees), par("usr")[3] - 0.25, srt=45, adj=1,
labels=trees$species, xpd=TRUE)
grid(nx=nrow(trees))
discovery
66Friday, 19 July 13
discovery
sweetgum
analysis of the tree data:
67Friday, 19 July 13
M
tree
GIS
export
Regex
parse-gis
src
Scrub
species
Geohash
Regex
parse-tree
tree
Tree
Metadata
Join
Failure
Traps
Estimate
height
M
discovery
(flow diagram, gis tree)
68Friday, 19 July 13
9q9jh0
geohash with 6-digit resolution
approximates a 5-block square
centered lat: 37.445, lng: -122.162
modeling
69Friday, 19 July 13
Each road in the GIS export is listed as a block between two
cross roads, and each may have multiple road segments to
represent turns:
" -122.161776959558,37.4518836690781,0.0
" -122.161390381489,37.4516410983794,0.0
" -122.160786011735,37.4512589903357,0.0
" -122.160531178368,37.4510977281699,0.0
modeling
( lat0, lng0, alt0 )
( lat1, lng1, alt1 )
( lat2, lng2, alt2 )
( lat3, lng3, alt3 )
NB: segments in the raw GIS have the order of geo coordinates
scrambled: (lng, lat, alt)
70Friday, 19 July 13
9q9jh0
X X
X
Filter trees which are too far away to provide shade. Calculate a sum
of moments for tree height × distance, as an estimator for shade:
modeling
71Friday, 19 July 13
(defn get-shade [trees roads]
"subquery to join tree and road estimates, maximize for shade"
(<- [?road_name ?geohash ?road_lat ?road_lng
?road_alt ?road_metric ?tree_metric]
(roads ?road_name _ _ _
?albedo ?road_lat ?road_lng ?road_alt ?geohash
?traffic_count _ ?traffic_class _ _ _ _)
(road-metric
?traffic_class ?traffic_count ?albedo :> ?road_metric)
(trees _ _ _ _ _ _ _
?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash)
(read-string ?avg_height :> ?height)
;; limit to trees which are higher than people
(> ?height 2.0)
(tree-distance
?tree_lat ?tree_lng ?road_lat ?road_lng :> ?distance)
;; limit to trees within a one-block radius (not meters)
(<= ?distance 25.0)
(/ ?height ?distance :> ?tree_moment)
(c/sum ?tree_moment :> ?sum_tree_moment)
;; magic number 200000.0 used to scale tree moment
;; based on median
(/ ?sum_tree_moment 200000.0 :> ?tree_metric)
))
modeling
72Friday, 19 July 13
M
tree
Join
Calculate
distance
shade
Filter
height
Sum
moment
REstimate
traffic
R
road
Filter
distance
M M
Filter
sum_moment
(flow diagram, shade)
modeling
73Friday, 19 July 13
(defn get-gps [gps_logs trap]
"subquery to aggregate and rank GPS tracks per user"
(<- [?uuid ?geohash ?gps_count ?recent_visit]
(gps_logs
?date ?uuid ?gps_lat ?gps_lng ?alt ?speed ?heading
?elapsed ?distance)
(read-string ?gps_lat :> ?lat)
(read-string ?gps_lng :> ?lng)
(geohash ?lat ?lng :> ?geohash)
(c/count :> ?gps_count)
(date-num ?date :> ?visit)
(c/max ?visit :> ?recent_visit)
))
modeling
?uuid ?geohash ?gps_count ?recent_visit
cf660e041e994929b37cc5645209c8ae 9q8yym 7 1972376866448
342ac6fd3f5f44c6b97724d618d587cf 9q9htz 4 1972376690969
32cc09e69bc042f1ad22fc16ee275e21 9q9hv3 3 1972376670935
342ac6fd3f5f44c6b97724d618d587cf 9q9hv3 3 1972376691356
342ac6fd3f5f44c6b97724d618d587cf 9q9hwn 13 1972376690782
342ac6fd3f5f44c6b97724d618d587cf 9q9hwp 58 1972376690965
482dc171ef0342b79134d77de0f31c4f 9q9jh0 15 1972376952532
b1b4d653f5d9468a8dd18a77edcc5143 9q9jh0 18 1972376945348
74Friday, 19 July 13
Recommenders often combine multiple signals, via weighted
averages, to rank personalized results:
• GPS of person ∩ road segment
• frequency and recency of visit
• traffic class and rate
• road albedo (sunlight reflection)
• tree shade estimator
Adjusting the mix allows for further personalization at the end use
modeling
(defn get-reco [tracks shades]
"subquery to recommend road segments based on GPS tracks"
(<- [?uuid ?road ?geohash ?lat ?lng ?alt
?gps_count ?recent_visit ?road_metric ?tree_metric]
(tracks ?uuid ?geohash ?gps_count ?recent_visit)
(shades ?road ?geohash ?lat ?lng ?alt ?road_metric ?tree_metric)
))
75Friday, 19 July 13
‣ addr: 115 HAWTHORNE AVE
‣ lat/lng: 37.446, -122.168
‣ geohash: 9q9jh0
‣ tree: 413 site 2
‣ species: Liquidambar styraciflua
‣ est. height: 23 m
‣ shade metric: 4.363
‣ traffic: local residential, light traffic
‣ recent visit: 1972376952532
‣ a short walk from my train stop ✔
apps
76Friday, 19 July 13
Could combine this with a variety of data APIs:
• Trulia neighborhood data, housing prices
• Factual local business (FB Places, etc.)
• CommonCrawl open source full web crawl
• Wunderground local weather data
• WalkScore neighborhood data, walkability
• Data.gov US federal open data
• Data.NASA.gov NASA open data
• DBpedia datasets derived fromWikipedia
• GeoWordNet semantic knowledge base
• Geolytics demographics, GIS, etc.
• Foursquare,Yelp, CityGrid, Localeze,YP
• various photo sharing
apps
77Friday, 19 July 13
Enterprise DataWorkflows
with Cascading
O’Reilly, 2013
shop.oreilly.com/product/
0636920028536.do
Follow-Up…
also, check out this newsletter
for updates:
liber118.com/pxn/
78Friday, 19 July 13
Follow-Up…
blog, developer community, code/wiki/gists, maven repo,
commercial products, etc.:
cascading.org
zest.to/group11
github.com/Cascading
conjars.org
goo.gl/KQtUL
concurrentinc.com
79Friday, 19 July 13

Weitere ähnliche Inhalte

Andere mochten auch

High Bandwidth suspention modelling and Design LQR Full state Feedback Contro...
High Bandwidth suspention modelling and Design LQR Full state Feedback Contro...High Bandwidth suspention modelling and Design LQR Full state Feedback Contro...
High Bandwidth suspention modelling and Design LQR Full state Feedback Contro...
Idabagus Mahartana
 

Andere mochten auch (19)

Data First [FutureStack16]
Data First [FutureStack16]Data First [FutureStack16]
Data First [FutureStack16]
 
Tiempo del verbo 3
Tiempo del verbo 3Tiempo del verbo 3
Tiempo del verbo 3
 
High Bandwidth suspention modelling and Design LQR Full state Feedback Contro...
High Bandwidth suspention modelling and Design LQR Full state Feedback Contro...High Bandwidth suspention modelling and Design LQR Full state Feedback Contro...
High Bandwidth suspention modelling and Design LQR Full state Feedback Contro...
 
Nancy duarte
Nancy duarteNancy duarte
Nancy duarte
 
CV_CC_port
CV_CC_portCV_CC_port
CV_CC_port
 
Evaluacion de geometrtia.
Evaluacion de geometrtia.Evaluacion de geometrtia.
Evaluacion de geometrtia.
 
BFUMTravelfolder
BFUMTravelfolderBFUMTravelfolder
BFUMTravelfolder
 
Catàleg BEEP Abril 2015 en Català
Catàleg BEEP Abril 2015 en CatalàCatàleg BEEP Abril 2015 en Català
Catàleg BEEP Abril 2015 en Català
 
Out2 Cocoa
Out2 CocoaOut2 Cocoa
Out2 Cocoa
 
Sync in an NFV World (Ram, ITSF 2016)
Sync in an NFV World (Ram, ITSF 2016)Sync in an NFV World (Ram, ITSF 2016)
Sync in an NFV World (Ram, ITSF 2016)
 
A wideband hybrid plasmonic fractal patch nanoantenn
A wideband hybrid plasmonic fractal patch nanoantennA wideband hybrid plasmonic fractal patch nanoantenn
A wideband hybrid plasmonic fractal patch nanoantenn
 
Daum Communications Case Study
Daum Communications Case StudyDaum Communications Case Study
Daum Communications Case Study
 
ACMA
ACMAACMA
ACMA
 
Test-Driven Microservices: System Confidence
Test-Driven Microservices: System ConfidenceTest-Driven Microservices: System Confidence
Test-Driven Microservices: System Confidence
 
las tic
las tic las tic
las tic
 
Estudio de la situación laboral y de cualificación de los RRHH del deporte de...
Estudio de la situación laboral y de cualificación de los RRHH del deporte de...Estudio de la situación laboral y de cualificación de los RRHH del deporte de...
Estudio de la situación laboral y de cualificación de los RRHH del deporte de...
 
Compassion lesson #4
Compassion lesson #4Compassion lesson #4
Compassion lesson #4
 
Discapacidad intelectual limite
Discapacidad intelectual limiteDiscapacidad intelectual limite
Discapacidad intelectual limite
 
Catàleg BEEP especial Nadal 2014 en Català
Catàleg BEEP especial Nadal 2014 en CatalàCatàleg BEEP especial Nadal 2014 en Català
Catàleg BEEP especial Nadal 2014 en Català
 

Ähnlich wie July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 

Ähnlich wie July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data" (20)

OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open DataOSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
 
Using Cascalog to build an app with City of Palo Alto Open Data
Using Cascalog to build an app with City of Palo Alto Open DataUsing Cascalog to build an app with City of Palo Alto Open Data
Using Cascalog to build an app with City of Palo Alto Open Data
 
Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big Data
 
Hadoop Summit: Pattern – an open source project for migrating predictive mode...
Hadoop Summit: Pattern – an open source project for migrating predictive mode...Hadoop Summit: Pattern – an open source project for migrating predictive mode...
Hadoop Summit: Pattern – an open source project for migrating predictive mode...
 
Pattern: PMML for Cascading and Hadoop
Pattern: PMML for Cascading and HadoopPattern: PMML for Cascading and Hadoop
Pattern: PMML for Cascading and Hadoop
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingBoulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
 
Pattern - an open source project for migrating predictive models from SAS, et...
Pattern - an open source project for migrating predictive models from SAS, et...Pattern - an open source project for migrating predictive models from SAS, et...
Pattern - an open source project for migrating predictive models from SAS, et...
 
Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014
 
Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015
 
Hadoop and Beyond
Hadoop and BeyondHadoop and Beyond
Hadoop and Beyond
 
Triplewave: a step towards RDF Stream Processing on the Web
Triplewave: a step towards RDF Stream Processing on the WebTriplewave: a step towards RDF Stream Processing on the Web
Triplewave: a step towards RDF Stream Processing on the Web
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic Web
 
Big Data to SMART Data : Process Scenario
Big Data to SMART Data : Process ScenarioBig Data to SMART Data : Process Scenario
Big Data to SMART Data : Process Scenario
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 

Mehr von Paco Nathan

Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Paco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Paco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
Paco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
Paco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
Paco Nathan
 

Mehr von Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
 

Kürzlich hochgeladen

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

  • 1. Paco Nathan liber118.com/pxn/ “Using Cascalog with Palo Alto Open Data” Licensed under a Creative Commons Attribution- NonCommercial-NoDerivs 3.0 Unported License. LA Clojure User Group 1Friday, 19 July 13
  • 2. Cascading / Cascalog / Scalding Enterprise Data Workflows with Cascading Cluster Computing with Mesos Using Cascalog with Palo Alto Open Data 2Friday, 19 July 13
  • 3. Cascading – origins API author Chris Wensel worked as a system architect at an Enterprise firm well-known for many popular data products. Wensel was following the Nutch open source project – where Hadoop started. Observation: would be difficult to find Java developers to write complex Enterprise apps in MapReduce – potential blocker for leveraging new open source technology. 3Friday, 19 July 13
  • 4. Cascading – functional programming Key insight: MapReduce is based on functional programming – back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature. To ease staffing problems as “Main Street” Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows: • leverages JVM and Java-based tools without any need to create new languages • allows programmers who have J2EE expertise to leverage the economics of Hadoop clusters 4Friday, 19 July 13
  • 5. Cascading – functional programming • Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc., have invested in open source projects atop Cascading – used for their large-scale production deployments • new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascalog in Clojure (2010) Scalding in Scala (2012) github.com/nathanmarz/cascalog/wiki github.com/twitter/scalding/wiki Why Adopting the Declarative Programming PracticesWill ImproveYour Return fromTechnology Dan Woods, 2013-04-17 Forbes forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming- practices-will-improve-your-return-from-technology/ 5Friday, 19 July 13
  • 6. Hadoop Cluster source tap source tap sink tap trap tap customer profile DBsCustomer Prefs logs logs Logs Data Workflow Cache Customers Support Web App Reporting Analytics Cubes sink tap Modeling PMML Cascading – integrations • partners: Microsoft Azure, Hortonworks, Amazon AWS, MapR, EMC, SpringSource, Cloudera • taps: Memcached, Cassandra, MongoDB, HBase, JDBC, Parquet, etc. • serialization: Avro, Thrift, Kryo, JSON, etc. • topologies: Apache Hadoop, tuple spaces, local mode 6Friday, 19 July 13
  • 7. Cascading – deployments • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc. • use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, telecom, genomics, climatology, agronomics, etc. 7Friday, 19 July 13
  • 8. Cascading – deployments • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc. • use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, telecom, genomics, climatology, agronomics, etc. workflow abstraction addresses: • staffing bottleneck; • system integration; • operational complexity; • test-driven development 8Friday, 19 July 13
  • 9. Document Collection Word Count Tokenize GroupBy token Count R M 1 map 1 reduce 18 lines code gist.github.com/3900702 WordCount – conceptual flow diagram cascading.org/category/impatient 9Friday, 19 July 13
  • 10. WordCount – Cascading app in Java String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); Document Collection Word Count Tokenize GroupBy token Count R M 10Friday, 19 July 13
  • 11. mapreduce Every('wc')[Count[decl:'count']] Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']'] GroupBy('wc')[by:['token']] Each('token')[RegexSplitGenerator[decl:'token'][args:1]] Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']'] [head] [tail] [{2}:'token', 'count'] [{1}:'token'] [{2}:'doc_id', 'text'] [{2}:'doc_id', 'text'] wc[{1}:'token'] [{1}:'token'] [{2}:'token', 'count'] [{2}:'token', 'count'] [{1}:'token'] [{1}:'token'] WordCount – generated flow diagram Document Collection Word Count Tokenize GroupBy token Count R M 11Friday, 19 July 13
  • 12. (ns impatient.core   (:use [cascalog.api]         [cascalog.more-taps :only (hfs-delimited)])   (:require [clojure.string :as s]             [cascalog.ops :as c])   (:gen-class)) (defmapcatop split [line]   "reads in a line of string and splits it by regex"   (s/split line #"[[](),.)s]+")) (defn -main [in out & args]   (?<- (hfs-delimited out)        [?word ?count]        ((hfs-delimited in :skip-header? true) _ ?line)        (split ?line :> ?word)        (c/count ?count))) ; Paul Lam ; github.com/Quantisan/Impatient WordCount – Cascalog / Clojure Document Collection Word Count Tokenize GroupBy token Count R M 12Friday, 19 July 13
  • 13. github.com/nathanmarz/cascalog/wiki • implements Datalog in Clojure, with predicates backed by Cascading – for a highly declarative language • run ad-hoc queries from the Clojure REPL – approx. 10:1 code reduction compared with SQL • composable subqueries, used for test-driven development (TDD) practices at scale • Leiningen build: simple, no surprises, in Clojure itself • more new deployments than other Cascading DSLs – Climate Corp is largest use case: 90% Clojure/Cascalog • has a learning curve, limited number of Clojure developers • aggregators are the magic, and those take effort to learn WordCount – Cascalog / Clojure Document Collection Word Count Tokenize GroupBy token Count R M 13Friday, 19 July 13
  • 14. import com.twitter.scalding._   class WordCount(args : Args) extends Job(args) { Tsv(args("doc"), ('doc_id, 'text), skipHeader = true) .read .flatMap('text -> 'token) { text : String => text.split("[ [](),.]") } .groupBy('token) { _.size('count) } .write(Tsv(args("wc"), writeHeader = true)) } WordCount – Scalding / Scala Document Collection Word Count Tokenize GroupBy token Count R M 14Friday, 19 July 13
  • 15. github.com/twitter/scalding/wiki • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls • extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc. • significant investments by Twitter, Etsy, eBay, etc. • great for data services at scale • less learning curve than Cascalog WordCount – Scalding / Scala Document Collection Word Count Tokenize GroupBy token Count R M 15Friday, 19 July 13
  • 16. Workflow Abstraction – pattern language Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Scrub token Document Collection Tokenize Word Count GroupBy token Count Stop Word List Regex token HashJoin Left RHS M R Data is represented as flows of tuples. Operations within the flows bring functional programming aspects into Java A Pattern Language Christopher Alexander, et al. amazon.com/dp/0195019199 16Friday, 19 July 13
  • 17. Workflow Abstraction – literate programming Cascading workflows generate their own visual documentation: flow diagrams in formal terms, flow diagrams leverage a methodology called literate programming provides intuitive, visual representations for apps – great for cross-team collaboration Scrub token Document Collection Tokenize Word Count GroupBy token Count Stop Word List Regex token HashJoin Left RHS M R Literate Programming Don Knuth literateprogramming.com 17Friday, 19 July 13
  • 18. Workflow Abstraction – business process following the essence of literate programming, Cascading workflows provide statements of business process this recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data) Cascading creates a separation of concerns between business process and implementation details (Hadoop, etc.) this is especially apparent in large-scale Cascalog apps: “Specify what you require, not how to achieve it.” by virtue of the pattern language, the flow planner then determines how to translate business process into efficient, parallel jobs at scale 18Friday, 19 July 13
  • 19. Cascading / Cascalog / Scalding Enterprise Data Workflows with Cascading Cluster Computing with Mesos Using Cascalog with Palo Alto Open Data 19Friday, 19 July 13
  • 20. Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end uses 20Friday, 19 July 13
  • 21. Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end uses ANSI SQL for ETL 21Friday, 19 July 13
  • 22. Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end usesJ2EE for business logic 22Friday, 19 July 13
  • 23. Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end uses SAS for predictive models 23Friday, 19 July 13
  • 24. Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end uses SAS for predictive modelsANSI SQL for ETL most of the licensing costs… 24Friday, 19 July 13
  • 25. Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end usesJ2EE for business logic most of the project costs… 25Friday, 19 July 13
  • 26. ETL data prep predictive model data sources end uses Lingual: DW → ANSI SQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source a compiler sees it all… cascading.org 26Friday, 19 July 13
  • 27. a compiler sees it all… ETL data prep predictive model data sources end uses Lingual: DW → ANSI SQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source FlowDef flowDef = FlowDef.flowDef() .setName( "etl" ) .addSource( "example.employee", emplTap ) .addSource( "example.sales", salesTap ) .addSink( "results", resultsTap );   SQLPlanner sqlPlanner = new SQLPlanner() .setSql( sqlStatement );   flowDef.addAssemblyPlanner( sqlPlanner ); cascading.org 27Friday, 19 July 13
  • 28. a compiler sees it all… ETL data prep predictive model data sources end uses Lingual: DW → ANSI SQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source FlowDef flowDef = FlowDef.flowDef() .setName( "classifier" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap );   PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlModel ) ) .retainOnlyActiveIncomingFields();   flowDef.addAssemblyPlanner( pmmlPlanner ); 28Friday, 19 July 13
  • 29. cascading.org ETL data prep predictive model data sources end uses Lingual: DW → ANSI SQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source visual collaboration for the business logic is a great way to improve how teams work together Failure Traps bonus allocation employee PMML classifier quarterly sales Join Count leads 29Friday, 19 July 13
  • 30. Lingual – CSV data in local file system cascading.org/lingual 30Friday, 19 July 13
  • 31. Lingual – shell prompt, catalog cascading.org/lingual 31Friday, 19 July 13
  • 33. # load the JDBC package library(RJDBC)   # set up the driver drv <- JDBC("cascading.lingual.jdbc.Driver", "~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")   # set up a database connection to a local repository connection <- dbConnect(drv, "jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/ tables;schema=EMPLOYEES")   # query the repository: in this case the MySQL sample database (CSV files) df <- dbGetQuery(connection, "SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'") head(df)   # use R functions to summarize and visualize part of the data df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25 summary(df$hire_age) library(ggplot2) m <- ggplot(df, aes(x=hire_age)) m <- m + ggtitle("Age at hire, people named Gina") m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density() Lingual – connecting Hadoop and R 33Friday, 19 July 13
  • 34. > summary(df$hire_age) Min. 1st Qu. Median Mean 3rd Qu. Max. 20.86 27.89 31.70 31.61 35.01 43.92 Lingual – connecting Hadoop and R cascading.org/lingual 34Friday, 19 July 13
  • 35. Hadoop Cluster source tap source tap sink tap trap tap customer profile DBsCustomer Prefs logs logs Logs Data Workflow Cache Customers Support Web App Reporting Analytics Cubes sink tap Modeling PMML Pattern – model scoring • migrate workloads: SAS,Teradata, etc., exporting predictive models as PMML • great open source tools – R, Weka, KNIME, Matlab, RapidMiner, etc. • integrate with other libraries – Matrix API, etc. • leverage PMML as another kind of DSL cascading.org/pattern 35Friday, 19 July 13
  • 36. • established XML standard for predictive model markup • organized by Data Mining Group (DMG), since 1997 http://dmg.org/ • members: IBM, SAS, Visa, NASA, Equifax, Microstrategy, Microsoft, etc. • PMML concepts for metadata, ensembles, etc., translate directly into Cascading tuple flows “PMML is the leading standard for statistical and data mining models and supported by over 20 vendors and organizations.With PMML, it is easy to develop a model on one system using one application and deploy the model on another system using another application.” PMML – standard wikipedia.org/wiki/Predictive_Model_Markup_Language 36Friday, 19 July 13
  • 37. PMML – vendor coverage 37Friday, 19 July 13
  • 38. • Association Rules: AssociationModel element • Cluster Models: ClusteringModel element • Decision Trees: TreeModel element • Naïve Bayes Classifiers: NaiveBayesModel element • Neural Networks: NeuralNetwork element • Regression: RegressionModel and GeneralRegressionModel elements • Rulesets: RuleSetModel element • Sequences: SequenceModel element • SupportVector Machines: SupportVectorMachineModel element • Text Models: TextModel element • Time Series: TimeSeriesModel element PMML – model coverage ibm.com/developerworks/industry/library/ind-PMML2/ 38Friday, 19 July 13
  • 39. ## train a RandomForest model   f <- as.formula("as.factor(label) ~ .") fit <- randomForest(f, data_train, ntree=50)   ## test the model on the holdout test set   print(fit$importance) print(fit)   predicted <- predict(fit, data) data$predicted <- predicted confuse <- table(pred = predicted, true = data[,1]) print(confuse)   ## export predicted labels to TSV   write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), quote=FALSE, sep="t", row.names=FALSE)   ## export RF model to PMML   saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/")) Pattern – create a model in R 39Friday, 19 July 13
  • 40. <?xml version="1.0"?> <PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0"  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  xsi:schemaLocation="http://www.dmg.org/PMML-4_0 http://www.dmg.org/v4-0/pmml-4-0.xsd">  <Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">   <Extension name="user" value="ceteri" extender="Rattle/PMML"/>   <Application name="Rattle/PMML" version="1.2.30"/>   <Timestamp>2012-10-22 19:39:28</Timestamp>  </Header>  <DataDictionary numberOfFields="4">   <DataField name="label" optype="categorical" dataType="string">    <Value value="0"/>    <Value value="1"/>   </DataField>   <DataField name="var0" optype="continuous" dataType="double"/>   <DataField name="var1" optype="continuous" dataType="double"/>   <DataField name="var2" optype="continuous" dataType="double"/>  </DataDictionary>  <MiningModel modelName="randomForest_Model" functionName="classification">   <MiningSchema>    <MiningField name="label" usageType="predicted"/>    <MiningField name="var0" usageType="active"/>    <MiningField name="var1" usageType="active"/>    <MiningField name="var2" usageType="active"/>   </MiningSchema>   <Segmentation multipleModelMethod="majorityVote">    <Segment id="1">     <True/>     <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">      <MiningSchema>       <MiningField name="label" usageType="predicted"/>       <MiningField name="var0" usageType="active"/>       <MiningField name="var1" usageType="active"/>       <MiningField name="var2" usageType="active"/>      </MiningSchema> ... Pattern – capture model parameters as PMML 40Friday, 19 July 13
  • 41. public static void main( String[] args ) throws RuntimeException { String inputPath = args[ 0 ]; String classifyPath = args[ 1 ]; // set up the config properties Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );   // create source and sink taps Tap inputTap = new Hfs( new TextDelimited( true, "t" ), inputPath ); Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );   // handle command line options OptionParser optParser = new OptionParser(); optParser.accepts( "pmml" ).withRequiredArg();   OptionSet options = optParser.parse( args );   // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "classify" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap );   if( options.hasArgument( "pmml" ) ) { String pmmlPath = (String) options.valuesOf( "pmml" ).get( 0 ); PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlPath ) ) .retainOnlyActiveIncomingFields() .setDefaultPredictedField( new Fields( "predict", Double.class ) ); // default value if missing from the model flowDef.addAssemblyPlanner( pmmlPlanner ); }   // write a DOT file and run the flow Flow classifyFlow = flowConnector.connect( flowDef ); classifyFlow.writeDOT( "dot/classify.dot" ); classifyFlow.complete(); } Pattern – score a model, within an app 41Friday, 19 July 13
  • 42. Customer Orders Classify Scored Orders GroupBy token Count PMML Model M R Failure Traps Assert Confusion Matrix Pattern – score a model, using pre-defined Cascading app cascading.org/pattern 42Friday, 19 July 13
  • 43. Roadmap – existing algorithms for scoring • Random Forest • Decision Trees • Linear Regression • GLM • Logistic Regression • K-Means Clustering • Hierarchical Clustering • Multinomial • SupportVector Machines (prepared for release) also, model chaining and general support for ensembles cascading.org/pattern 43Friday, 19 July 13
  • 44. Roadmap – next priorities for scoring • Time Series (ARIMA forecast) • Association Rules (basket analysis) • Naïve Bayes • Neural Networks algorithms extended based on customer use cases – contact groups.google.com/forum/?fromgroups#!forum/pattern-user cascading.org/pattern 44Friday, 19 July 13
  • 45. Cascading / Cascalog / Scalding Enterprise Data Workflows with Cascading Cluster Computing with Mesos Using Cascalog with Palo Alto Open Data 45Friday, 19 July 13
  • 46. Q3 1997: inflection point Four independent teams were working toward horizontal scale-out of workflows based on commodity hardware This effort prepared the way for huge Internet successes in the 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG MapReduce and the Apache Hadoop open source stack emerged from this 46Friday, 19 July 13
  • 47. RDBMS Stakeholder SQL Query result sets Excel pivot tables PowerPoint slide decks Web App Customers transactions Product strategy Engineering requirements BI Analysts optimized code Circa 1996: pre- inflection point 47Friday, 19 July 13
  • 48. RDBMS Stakeholder SQL Query result sets Excel pivot tables PowerPoint slide decks Web App Customers transactions Product strategy Engineering requirements BI Analysts optimized code Circa 1996: pre- inflection point “throw it over the wall” 48Friday, 19 July 13
  • 49. RDBMS SQL Query result sets recommenders + classifiers Web Apps customer transactions Algorithmic Modeling Logs event history aggregation dashboards Product Engineering UX Stakeholder Customers DW ETL Middleware servletsmodels Circa 2001: post- big ecommerce successes 49Friday, 19 July 13
  • 50. RDBMS SQL Query result sets recommenders + classifiers Web Apps customer transactions Algorithmic Modeling Logs event history aggregation dashboards Product Engineering UX Stakeholder Customers DW ETL Middleware servletsmodels Circa 2001: post- big ecommerce successes “data products” 50Friday, 19 July 13
  • 51. Workflow RDBMS near timebatch services transactions, content social interactions Web Apps, Mobile, etc.History Data Products Customers RDBMS Log Events In-Memory Data Grid Hadoop, etc. Cluster Scheduler Prod Eng DW Use Cases Across Topologies s/w dev data science discovery + modeling Planner Ops dashboard metrics business process optimized capacitytaps Data Scientist App Dev Ops Domain Expert introduced capability existing SDLC Circa 2013: clusters everywhere 51Friday, 19 July 13
  • 52. Workflow RDBMS near timebatch services transactions, content social interactions Web Apps, Mobile, etc.History Data Products Customers RDBMS Log Events In-Memory Data Grid Hadoop, etc. Cluster Scheduler Prod Eng DW Use Cases Across Topologies s/w dev data science discovery + modeling Planner Ops dashboard metrics business process optimized capacitytaps Data Scientist App Dev Ops Domain Expert introduced capability existing SDLC Circa 2013: clusters everywhere “optimize topologies” 52Friday, 19 July 13
  • 53. Operating Systems, redux meanwhile, GOOG is 3+ generations ahead, with much improved ROI on data centers John Wilkes, et al. Borg/Omega: data center “secret sauce” youtu.be/0ZFMlO98Jkc 0% 25% 50% 75% 100% RAILS CPU LOAD MEMCACHED CPU LOAD 0% 25% 50% 75% 100% HADOOP CPU LOAD 0% 25% 50% 75% 100% t t 0% 25% 50% 75% 100% Rails Memcached Hadoop COMBINED CPU LOAD (RAILS, MEMCACHED, HADOOP) Florian Leibert, Chronos/Mesos @ Airbnb Mesos, open source cloud OS – like Borg goo.gl/jPtTP 53Friday, 19 July 13
  • 54. Mesos mesos.apache.org Return of the Borg: HowTwitter Rebuilt Google’s SecretWeapon Cade Metz wired.com/wiredenterprise/2013/03/google- borg-twitter-mesos/ 54Friday, 19 July 13
  • 55. Mesos a common substrate for cluster computing • scale to 10,000s of nodes using fast, event-driven C++ impl • improve utilization across workloads • run long-lived services (e.g., Hypertable and HBase) on the same nodes as batch app and share resources • build new cluster computing frameworks without reinventing low-level facilities, and have them coexist with existing work • run multiple instances/versions of Hadoop on the same cluster to isolate production and experimental jobs • reshape cluster resources based on ML from app history • reduce latency in transferring data products from one cluster to another • enable new kinds of apps, which combine frameworks with lower latency 55Friday, 19 July 13
  • 56. Cascading / Cascalog / Scalding Enterprise Data Workflows with Cascading Cluster Computing with Mesos Using Cascalog with Palo Alto Open Data 56Friday, 19 July 13
  • 57. Palo Alto is quite a pleasant place • temperate weather • lots of parks, enormous trees • great coffeehouses • walkable downtown • not particularly crowded On a nice summer day, who wants to be stuck indoors on a phone call? Instead, take it outside – go for a walk And example open source project: github.com/Cascading/CoPA/wiki 57Friday, 19 July 13
  • 58. 1. Open Data about municipal infrastructure (GIS data: trees, roads, parks) ✚ 2. Big Data about where people like to walk (smartphone GPS logs) ✚ 3. some curated metadata (which surfaces the value) 4. personalized recommendations: “Find a shady spot on a summer day in which to walk near downtown Palo Alto.While on a long conference call. Sipping a latte or enjoying some fro-yo.” Scrub token Document Collection Tokenize Word Count GroupBy token Count Stop Word List Regex token HashJoin Left RHS M R 58Friday, 19 July 13
  • 59. The City of Palo Alto recently began to support Open Data to give the local community greater visibility into how their city government operates This effort is intended to encourage students, entrepreneurs, local organizations, etc., to build new apps which contribute to the public good paloalto.opendata.junar.com/dashboards/7576/geographic-information/ discovery 59Friday, 19 July 13
  • 60. GIS about trees in Palo Alto: discovery 60Friday, 19 July 13
  • 61. Geographic_Information,,, "Tree: 29 site 2 at 203 ADDISON AV, on ADDISON AV 44 from pl"," Private: -1 Tree ID: 29 Street_Name: ADDISON AV Situs Number: 203 Tree Site: 2 Species: Celtis australis Source: davey tree Protected: Designated: Heritage: Appraised Value: Hardscape: None Identifier: 40 Active Numeric: 1 Location Feature ID: 13872 Provisional: Install Date: ","37.4409634615283,-122.15648458861,0.0 ","Point" "Wilkie Way from West Meadow Drive to Victoria Place"," Sequence: 20 Street_Name: Wilkie Way From Street PMMS: West Meadow Drive To Street PMMS: Victoria Place Street ID: 598 (Wilkie Wy, Palo Alto) From Street ID PMMS: 689 To Street ID PMMS: 567 Year Constructed: 1950 Traffic Count: 596 Traffic Index: residential local Traffic Class: local residential Traffic Date: 08/24/90 Paving Length: 208 Paving Width: 40 Paving Area: 8320 Surface Type: asphalt concrete Surface Thickness: 2.0 Base Type Pvmt: crusher run base Base Thickness: 6.0 Soil Class: 2 Soil Value: 15 Curb Type: Curb Thickness: Gutter Width: 36.0 Book: 22 Page: 1 District Number: 18 Land Use PMMS: 1 Overlay Year: 1990 Overlay Thickness: 1.5 Base Failure Year: 1990 Base Failure Thickness: 6 Surface Treatment Year: Surface Treatment Type: Alligator Severity: none Alligator Extent: 0 Block Severity: none Block Extent: 0 Longitude and Transverse Severity: none Longitude and Transverse Extent: 0 Ravelling Severity: none Ravelling Extent: 0 Ridability Severity: none Trench Severity: none Trench Extent: 0 Rutting Severity: none Rutting Extent: 0 Road Performance: UL (Urban Local) Bike Lane: 0 Bus Route: 0 Truck Route: 0 Remediation: Deduct Value: 100 Priority: Pavement Condition: excellent Street Cut Fee per SqFt: 10.00 Source Date: 6/10/2009 User Modified By: mnicols Identifier System: 21410 ","-122.1249640794,37.4155803115645,0.0 -122.124661859039,37.4154224594993,0.0 -122.124587720719,37.4153758330704,0.0 -122.12451895942,37.4153242300888,0.0 -122.124456098457,37.4152680432944,0.0 -122.124399616238,37.4152077003122,0.0 -122.124374937753,37.4151774433318,0.0 ","Line" discovery (unstructured data…) 61Friday, 19 July 13
  • 62. (defn parse-gis [line] "leverages parse-csv for complex CSV format in GIS export" (first (csv/parse-csv line)) )     (defn etl-gis [gis trap] "subquery to parse data sets from the GIS source tap" (<- [?blurb ?misc ?geo ?kind] (gis ?line) (parse-gis ?line :> ?blurb ?misc ?geo ?kind) (:trap (hfs-textline trap)) )) discovery (specify what you require, not how to achieve it… 80/20 rule of data prep cost) 62Friday, 19 July 13
  • 63. discovery (ad-hoc queries get refined into composable predicates) Identifier: 474 Tree ID: 412 Tree: 412 site 1 at 115 HAWTHORNE AV Tree Site: 1 Street_Name: HAWTHORNE AV Situs Number: 115 Private: -1 Species: Liquidambar styraciflua Source: davey tree Hardscape: None 37.446001565119,-122.167713417554,0.0 Point 63Friday, 19 July 13
  • 65. (defn get-trees [src trap tree_meta] "subquery to parse/filter the tree data" (<- [?blurb ?tree_id ?situs ?tree_site ?species ?wikipedia ?calflora ?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash ] (src ?blurb ?misc ?geo ?kind) (re-matches #"^s+Private.*Tree ID.*" ?misc) (parse-tree ?misc :> _ ?priv ?tree_id ?situs ?tree_site ?raw_species) ((c/comp s/trim s/lower-case) ?raw_species :> ?species) (tree_meta ?species ?wikipedia ?calflora ?min_height ?max_height) (avg ?min_height ?max_height :> ?avg_height) (geo-tree ?geo :> _ ?tree_lat ?tree_lng ?tree_alt) (read-string ?tree_lat :> ?lat) (read-string ?tree_lng :> ?lng) (geohash ?lat ?lng :> ?geohash) (:trap (hfs-textline trap)) )) discovery ?blurb!! Tree: 412 site 1 at 115 HAWTHORNE AV, on HAWTHORNE AV 22 ?tree_id! " 412 ?situs"" 115 ?tree_site" 1 ?species" " liquidambar styraciflua ?wikipedia" http://en.wikipedia.org/wiki/Liquidambar_styraciflua ?calflora http://calflora.org/cgi-bin/species_query.cgi?where-calre ?avg_height" 27.5 ?tree_lat" 37.446001565119 ?tree_lng" -122.167713417554 ?tree_alt" 0.0 ?geohash" " 9q9jh0 65Friday, 19 July 13
  • 66. // run analysis and visualization in R library(ggplot2) dat_folder <- '~/src/concur/CoPA/out/tree' data <- read.table(file=paste(dat_folder, "part-00000", sep="/"), sep="t", quote="", na.strings="NULL", header=FALSE, encoding="UTF8")   summary(data) t <- head(sort(table(data$V5), decreasing=TRUE) trees <- as.data.frame.table(t, n=20)) colnames(trees) <- c("species", "count")   m <- ggplot(data, aes(x=V8)) m <- m + ggtitle("Estimated Tree Height (meters)") m + geom_histogram(aes(y = ..density.., fill = ..count..)) + geom_density()   par(mar = c(7, 4, 4, 2) + 0.1) plot(trees, xaxt="n", xlab="") axis(1, labels=FALSE) text(1:nrow(trees), par("usr")[3] - 0.25, srt=45, adj=1, labels=trees$species, xpd=TRUE) grid(nx=nrow(trees)) discovery 66Friday, 19 July 13
  • 67. discovery sweetgum analysis of the tree data: 67Friday, 19 July 13
  • 69. 9q9jh0 geohash with 6-digit resolution approximates a 5-block square centered lat: 37.445, lng: -122.162 modeling 69Friday, 19 July 13
  • 70. Each road in the GIS export is listed as a block between two cross roads, and each may have multiple road segments to represent turns: " -122.161776959558,37.4518836690781,0.0 " -122.161390381489,37.4516410983794,0.0 " -122.160786011735,37.4512589903357,0.0 " -122.160531178368,37.4510977281699,0.0 modeling ( lat0, lng0, alt0 ) ( lat1, lng1, alt1 ) ( lat2, lng2, alt2 ) ( lat3, lng3, alt3 ) NB: segments in the raw GIS have the order of geo coordinates scrambled: (lng, lat, alt) 70Friday, 19 July 13
  • 71. 9q9jh0 X X X Filter trees which are too far away to provide shade. Calculate a sum of moments for tree height × distance, as an estimator for shade: modeling 71Friday, 19 July 13
  • 72. (defn get-shade [trees roads] "subquery to join tree and road estimates, maximize for shade" (<- [?road_name ?geohash ?road_lat ?road_lng ?road_alt ?road_metric ?tree_metric] (roads ?road_name _ _ _ ?albedo ?road_lat ?road_lng ?road_alt ?geohash ?traffic_count _ ?traffic_class _ _ _ _) (road-metric ?traffic_class ?traffic_count ?albedo :> ?road_metric) (trees _ _ _ _ _ _ _ ?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash) (read-string ?avg_height :> ?height) ;; limit to trees which are higher than people (> ?height 2.0) (tree-distance ?tree_lat ?tree_lng ?road_lat ?road_lng :> ?distance) ;; limit to trees within a one-block radius (not meters) (<= ?distance 25.0) (/ ?height ?distance :> ?tree_moment) (c/sum ?tree_moment :> ?sum_tree_moment) ;; magic number 200000.0 used to scale tree moment ;; based on median (/ ?sum_tree_moment 200000.0 :> ?tree_metric) )) modeling 72Friday, 19 July 13
  • 74. (defn get-gps [gps_logs trap] "subquery to aggregate and rank GPS tracks per user" (<- [?uuid ?geohash ?gps_count ?recent_visit] (gps_logs ?date ?uuid ?gps_lat ?gps_lng ?alt ?speed ?heading ?elapsed ?distance) (read-string ?gps_lat :> ?lat) (read-string ?gps_lng :> ?lng) (geohash ?lat ?lng :> ?geohash) (c/count :> ?gps_count) (date-num ?date :> ?visit) (c/max ?visit :> ?recent_visit) )) modeling ?uuid ?geohash ?gps_count ?recent_visit cf660e041e994929b37cc5645209c8ae 9q8yym 7 1972376866448 342ac6fd3f5f44c6b97724d618d587cf 9q9htz 4 1972376690969 32cc09e69bc042f1ad22fc16ee275e21 9q9hv3 3 1972376670935 342ac6fd3f5f44c6b97724d618d587cf 9q9hv3 3 1972376691356 342ac6fd3f5f44c6b97724d618d587cf 9q9hwn 13 1972376690782 342ac6fd3f5f44c6b97724d618d587cf 9q9hwp 58 1972376690965 482dc171ef0342b79134d77de0f31c4f 9q9jh0 15 1972376952532 b1b4d653f5d9468a8dd18a77edcc5143 9q9jh0 18 1972376945348 74Friday, 19 July 13
  • 75. Recommenders often combine multiple signals, via weighted averages, to rank personalized results: • GPS of person ∩ road segment • frequency and recency of visit • traffic class and rate • road albedo (sunlight reflection) • tree shade estimator Adjusting the mix allows for further personalization at the end use modeling (defn get-reco [tracks shades] "subquery to recommend road segments based on GPS tracks" (<- [?uuid ?road ?geohash ?lat ?lng ?alt ?gps_count ?recent_visit ?road_metric ?tree_metric] (tracks ?uuid ?geohash ?gps_count ?recent_visit) (shades ?road ?geohash ?lat ?lng ?alt ?road_metric ?tree_metric) )) 75Friday, 19 July 13
  • 76. ‣ addr: 115 HAWTHORNE AVE ‣ lat/lng: 37.446, -122.168 ‣ geohash: 9q9jh0 ‣ tree: 413 site 2 ‣ species: Liquidambar styraciflua ‣ est. height: 23 m ‣ shade metric: 4.363 ‣ traffic: local residential, light traffic ‣ recent visit: 1972376952532 ‣ a short walk from my train stop ✔ apps 76Friday, 19 July 13
  • 77. Could combine this with a variety of data APIs: • Trulia neighborhood data, housing prices • Factual local business (FB Places, etc.) • CommonCrawl open source full web crawl • Wunderground local weather data • WalkScore neighborhood data, walkability • Data.gov US federal open data • Data.NASA.gov NASA open data • DBpedia datasets derived fromWikipedia • GeoWordNet semantic knowledge base • Geolytics demographics, GIS, etc. • Foursquare,Yelp, CityGrid, Localeze,YP • various photo sharing apps 77Friday, 19 July 13
  • 78. Enterprise DataWorkflows with Cascading O’Reilly, 2013 shop.oreilly.com/product/ 0636920028536.do Follow-Up… also, check out this newsletter for updates: liber118.com/pxn/ 78Friday, 19 July 13
  • 79. Follow-Up… blog, developer community, code/wiki/gists, maven repo, commercial products, etc.: cascading.org zest.to/group11 github.com/Cascading conjars.org goo.gl/KQtUL concurrentinc.com 79Friday, 19 July 13