SlideShare ist ein Scribd-Unternehmen logo
1 von 96
Downloaden Sie, um offline zu lesen
“Cascading:
 Enterprise Data Workflows
 based on Functional Programming”

 Paco Nathan
 Concurrent, Inc.
 San Francisco, CA
 @pacoid




Copyright @2013, Concurrent, Inc.




                                    1
Cascading: Workflow Abstraction
                               Document




1. Machine Data
                               Collection



                                                            Scrub
                                            Tokenize
                                                            token

                                       M



                                                                    HashJoin   Regex
                                                                      Left     token
                                                                                       GroupBy    R
                                                       Stop Word                        token
                                                          List
                                                                      RHS




2. Cascading
                                                                                          Count




                                                                                                      Word
                                                                                                      Count




3. Sample Code
4. A Little Theory…
5. Workflows
6. Lingual
7. Pattern
8. Open Data

                                                                                                              2
Q3 1997: inflection point

Four independent teams were working toward horizontal
scale-out of workflows based on commodity hardware.
This effort prepared the way for huge Internet successes
in the 1997 holiday season… AMZN, EBAY, Inktomi
(YHOO Search), then GOOG

MapReduce and the Apache Hadoop open source stack
emerged from this.




                                                           3
Circa 1996: pre- inflection point

                                       Stakeholder                   Customers

                Excel pivot tables
              PowerPoint slide decks        strategy



                    BI
                                           Product
                  Analysts


                                          requirements



                  SQL Query                              optimized
                                         Engineering       code         Web App
                   result sets



                                                                        transactions




                                                                        RDBMS




                                                                                       4
Circa 1996: pre- inflection point

                                       Stakeholder                   Customers

                Excel pivot tables
              PowerPoint slide decks        strategy




        “Throw it over the wall”
                    BI
                                           Product
                  Analysts


                                          requirements



                  SQL Query                              optimized
                                         Engineering       code         Web App
                   result sets



                                                                        transactions




                                                                        RDBMS




                                                                                       5
Circa 2001: post- big ecommerce successes

               Stakeholder                    Product                   Customers




                 dashboards                                                  UX
                                             Engineering

                               models                        servlets

                                             recommenders
               Algorithmic                          +                   Web Apps
                Modeling                        classifiers


                                                                        Middleware
                               aggregation
                                                              event
                SQL Query                                    history
                 result sets                                               customer
                                                                         transactions
                                                Logs



                   DW                             ETL                    RDBMS




                                                                                        6
Circa 2001: post- big ecommerce successes

               Stakeholder                    Product                   Customers




                  “Data products”
                 dashboards                                                  UX
                           Engineering

                               models                        servlets

                                             recommenders
               Algorithmic                          +                   Web Apps
                Modeling                        classifiers


                                                                        Middleware
                               aggregation
                                                              event
                SQL Query                                    history
                 result sets                                               customer
                                                                         transactions
                                                Logs



                   DW                             ETL                    RDBMS




                                                                                        7
Circa 2013: clusters everywhere

                                             Data Products                                      Customers
                               business
       Domain                  process                                                                                       Prod
       Expert                                 Workflow
                                 dashboard
                                  metrics
                       data
                                                                                               Web Apps,               s/w
                                                History                   services
                     science                                                                   Mobile, etc.            dev
        Data
      Scientist
                                               Planner                                     social
                               discovery                                                interactions
                                   +                      optimized                                    transactions,
                                                                                                                              Eng
                               modeling           taps     capacity                                       content

       App Dev
                                                  Use Cases Across Topologies


                                                Hadoop,                 Log                      In-Memory
                                                  etc.                 Events                     Data Grid
         Ops                          DW                                                                                      Ops
                                                                                batch     near time


                                                                      Cluster Scheduler
       introduced                                                                                                            existing
        capability                                                                                                            SDLC

                                                                                                       RDBMS
                                                                                                        RDBMS


                                                                                                                                        8
Circa 2013: clusters everywhere

                                             Data Products                                      Customers
                               business
       Domain                  process                                                                                       Prod
       Expert                                 Workflow
                                 dashboard
                                  metrics
                       data
                                                                                               Web Apps,               s/w
                                                History                   services
                     science                                                                   Mobile, etc.            dev
        Data
      Scientist
                                               Planner                                     social
                               discovery                                                interactions
                                   +                      optimized                                    transactions,
                                                                                                                              Eng
                               modeling           taps     capacity                                       content

       App Dev

                                  “Optimizing topologies”
                                                  Use Cases Across Topologies


                                                Hadoop,                 Log                      In-Memory
                                                  etc.                 Events                     Data Grid
         Ops                          DW                                                                                      Ops
                                                                                batch     near time


                                                                      Cluster Scheduler
       introduced                                                                                                            existing
        capability                                                                                                            SDLC

                                                                                                       RDBMS
                                                                                                        RDBMS


                                                                                                                                        9
references…

   by Leo Breiman
   Statistical Modeling: The Two Cultures
   Statistical Science, 2001
   bit.ly/eUTh9L




                                            10
references…

  Amazon
  “Early Amazon: Splitting the website” – Greg Linden
  glinden.blogspot.com/2006/02/early-amazon-splitting-website.html

  eBay
  “The eBay Architecture” – Randy Shoup, Dan Pritchett
  addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html
  addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf

  Inktomi (YHOO Search)
  “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
  youtube.com/watch?v=E91oEn1bnXM

  Google
  “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
  youtube.com/watch?v=qsan-GQaeyk
  perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx




                                                                             11
Cascading: Workflow Abstraction
                               Document




1. Machine Data
                               Collection



                                                            Scrub
                                            Tokenize
                                                            token

                                       M



                                                                    HashJoin   Regex
                                                                      Left     token
                                                                                       GroupBy    R
                                                       Stop Word                        token
                                                          List
                                                                      RHS




2. Cascading
                                                                                          Count




                                                                                                      Word
                                                                                                      Count




3. Sample Code
4. A Little Theory…
5. Workflows
6. Lingual
7. Pattern
8. Open Data

                                                                                                              12
Cascading – origins

API author Chris Wensel worked as a system architect
at an Enterprise firm well-known for many popular
data products.
Wensel was following the Nutch open source project –
where Hadoop started.
Observation: would be difficult to find Java developers
to write complex Enterprise apps in MapReduce –
potential blocker for leveraging new open source
technology.




                                                        13
Cascading – functional programming

Key insight: MapReduce is based on functional programming
– back to LISP in 1970s. Apache Hadoop use cases are
mostly about data pipelines, which are functional in nature.
To ease staffing problems as “Main Street” Enterprise firms
began to embrace Hadoop, Cascading was introduced
in late 2007, as a new Java API to implement functional
programming for large-scale data workflows:

• leverages JVM and Java-based tools without any
    need to create new languages
•   allows programmers who have J2EE expertise
    to leverage the economics of Hadoop clusters




                                                               14
functional programming… in production

• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc.,
    have invested in open source projects atop Cascading
    – used for their large-scale production deployments
•   new case studies for Cascading apps are mostly
    based on domain-specific languages (DSLs) in JVM
    languages which emphasize functional programming:

    Cascalog in Clojure (2010)
    Scalding in Scala (2012)


github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki




                                                             15
Cascading – definitions

• a pattern language for Enterprise Data Workflows
                                                                               Customers
• simple to build, easy to test, robust in production
• design principles ⟹ ensure best practices at scale                             Web
                                                                                 App




                                                                   logs         Cache
                                                                     logs
                                                                       Logs

                                              Support
                                                                      source
                                                            trap                  sink
                                                                        tap
                                                             tap                  tap


                                                                    Data
                                              Modeling    PMML
                                                                   Workflow

                                                                                 source
                                                            sink
                                                                                   tap
                                                            tap

                                              Analytics
                                               Cubes                            customer
                                                                                 Customer
                                                                               profile DBs
                                                                                   Prefs
                                                                     Hadoop
                                                                     Cluster
                                              Reporting




                                                                                             16
Cascading – usage

• Java API, DSLs in Scala, Clojure,
                                                                       Customers
  Jython, JRuby, Groovy, ANSI SQL
• ASL 2 license, GitHub src,                                             Web
                                                                         App
  http://conjars.org
• 5+ yrs production use,                                   logs
                                                             logs
                                                               Logs
                                                                        Cache

  multiple Enterprise verticals       Support
                                                              source
                                                    trap                  sink
                                                                tap
                                                     tap                  tap


                                                            Data
                                      Modeling    PMML
                                                           Workflow

                                                                         source
                                                    sink
                                                                           tap
                                                    tap

                                      Analytics
                                       Cubes                            customer
                                                                         Customer
                                                                       profile DBs
                                                                           Prefs
                                                             Hadoop
                                                             Cluster
                                      Reporting




                                                                                     17
Cascading – integrations

• partners: Microsoft Azure, Hortonworks,
                                                                             Customers
  Amazon AWS, MapR, EMC, SpringSource,
  Cloudera                                                                     Web

• taps: Memcached, Cassandra, MongoDB,
                                                                               App



  HBase, JDBC, Parquet, etc.                                     logs
                                                                   logs       Cache

• serialization: Avro, Thrift, Kryo,        Support
                                                                     Logs



  JSON, etc.                                              trap
                                                                    source
                                                                      tap       sink
                                                           tap                  tap

• topologies: Apache Hadoop,                                      Data
  tuple spaces, local mode                  Modeling    PMML
                                                                 Workflow

                                                                               source
                                                          sink
                                                                                 tap
                                                          tap

                                            Analytics
                                             Cubes                            customer
                                                                               Customer
                                                                             profile DBs
                                                                                 Prefs
                                                                   Hadoop
                                                                   Cluster
                                            Reporting




                                                                                           18
Cascading – deployments

• case studies: Climate Corp, Twitter, Etsy,
  Williams-Sonoma, uSwitch, Airbnb, Nokia,
  YieldBot, Square, Harvard, Factual, etc.
• use cases: ETL, marketing funnel, anti-fraud,
  social media, retail pricing, search analytics,
  recommenders, eCRM, utility grids, telecom,
  genomics, climatology, agronomics, etc.




                                                    19
Cascading – deployments

• case studies: Climate Corp, Twitter, Etsy,
  Williams-Sonoma, uSwitch, Airbnb, Nokia,
  YieldBot, Square, Harvard, Factual, etc.
• use cases: ETL, marketing funnel, anti-fraud,
  social media, retail pricing, search analytics,
  recommenders, eCRM, utilityworkflow abstraction
                                  grids, telecom,   addresses:
  genomics, climatology, agronomics, etc.
                                • staffing bottleneck;
                                • system integration;
                                • operational complexity;
                                • test-driven development



                                                                 20
Cascading: Workflow Abstraction
                               Document




1. Machine Data
                               Collection



                                                            Scrub
                                            Tokenize
                                                            token

                                       M



                                                                    HashJoin   Regex
                                                                      Left     token
                                                                                       GroupBy    R
                                                       Stop Word                        token
                                                          List
                                                                      RHS




2. Cascading
                                                                                          Count




                                                                                                      Word
                                                                                                      Count




3. Sample Code
4. A Little Theory…
5. Workflows
6. Lingual
7. Pattern
8. Open Data

                                                                                                              21
The Ubiquitous Word Count
                                                                    Document




Definition:
                                                                    Collection




                                                                                 Tokenize
                                                                                            GroupBy
                                                                            M                token    Count




    count how often each word appears
   count how often each word appears                                                           R              Word
                                                                                                              Count




   in aacollection of text documents
    in collection of text documents
This simple program provides an excellent test case for
parallel processing, since it illustrates:                void map (String doc_id, String text):

 • requires a minimal amount of code                       for each word w in segment(text):
                                                             emit(w, "1");

 • demonstrates use of both symbolic and numeric values
 • shows a dependency graph of tuples as an abstraction   void reduce (String word, Iterator group):

 • is not many steps away from useful search indexing      int count = 0;


 • serves as a “Hello World” for Hadoop apps               for each pc in group:
                                                             count += Int(pc);

Any distributed computing framework which can run Word     emit(word, String(count));
Count efficiently in parallel at scale can handle much
larger and more interesting compute problems.


                                                                                                                      22
word count – conceptual flow diagram


 Document
 Collection




                Tokenize
                               GroupBy
         M                      token               Count




                                  R                             Word
                                                                Count




 1 map                                cascading.org/category/impatient
 1 reduce
18 lines code                                 gist.github.com/3900702


                                                                         23
word count – Cascading app in Java
                                                                                        Document
                                                                                        Collection




String docPath = args[ 0 ];                                                                          Tokenize
                                                                                                                GroupBy
                                                                                                M                token

String wcPath = args[ 1 ];                                                                                                Count




Properties properties = new Properties();                                                                          R              Word
                                                                                                                                  Count



AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
 .addSource( docPipe, docTap )
 .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();



                                                                                                                                          24
word count – generated flow diagram
                                                                                                   Document
                                                                                                   Collection




                                                                                                                Tokenize
                                                   [head]                                                  M
                                                                                                                           GroupBy
                                                                                                                            token    Count




                                                                                                                              R              Word
                                                                                                                                             Count




                     Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

                                             [{2}:'doc_id', 'text']
                                             [{2}:'doc_id', 'text']




                                                                                          map
                      Each('token')[RegexSplitGenerator[decl:'token'][args:1]]

                                                 [{1}:'token']
                                                 [{1}:'token']



                                       GroupBy('wc')[by:['token']]

                                               wc[{1}:'token']
                                               [{1}:'token']




                                                                                          reduce
                                    Every('wc')[Count[decl:'count']]

                                             [{2}:'token', 'count']
                                             [{1}:'token']



                  Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']

                                             [{2}:'token', 'count']
                                             [{2}:'token', 'count']



                                                    [tail]


                                                                                                                                                     25
word count – Cascalog / Clojure
                                                         Document
                                                         Collection




(ns impatient.core                                               M
                                                                      Tokenize
                                                                                 GroupBy
                                                                                  token    Count



  (:use [cascalog.api]                                                              R              Word
                                                                                                   Count


        [cascalog.more-taps :only (hfs-delimited)])
  (:require [clojure.string :as s]
            [cascalog.ops :as c])
  (:gen-class))

(defmapcatop split [line]
  "reads in a line of string and splits it by regex"
  (s/split line #"[[](),.)s]+"))

(defn -main [in out & args]
  (?<- (hfs-delimited out)
       [?word ?count]
       ((hfs-delimited in :skip-header? true) _ ?line)
       (split ?line :> ?word)
       (c/count ?count)))

; Paul Lam
; github.com/Quantisan/Impatient




                                                                                                           26
word count – Cascalog / Clojure
                                                               Document
                                                               Collection




github.com/nathanmarz/cascalog/wiki
                                                                            Tokenize
                                                                                       GroupBy
                                                                       M                token    Count




                                                                                          R              Word
                                                                                                         Count




• implements Datalog in Clojure, with predicates backed
  by Cascading – for a highly declarative language
• run ad-hoc queries from the Clojure REPL –
  approx. 10:1 code reduction compared with SQL
• composable subqueries, used for test-driven development
  (TDD) practices at scale
• Leiningen build: simple, no surprises, in Clojure itself
• more new deployments than other Cascading DSLs –
  Climate Corp is largest use case: 90% Clojure/Cascalog
• has a learning curve, limited number of Clojure developers
• aggregators are the magic, and those take effort to learn




                                                                                                                 27
word count – Scalding / Scala
                                                          Document
                                                          Collection




import com.twitter.scalding._                                     M
                                                                       Tokenize
                                                                                  GroupBy
                                                                                   token    Count



                                                                                     R              Word
                                                                                                    Count


class WordCount(args : Args) extends Job(args) {
  Tsv(args("doc"),
       ('doc_id, 'text),
       skipHeader = true)
    .read
    .flatMap('text -> 'token) {
       text : String => text.split("[ [](),.]")
     }
    .groupBy('token) { _.size('count) }
    .write(Tsv(args("wc"), writeHeader = true))
}




                                                                                                            28
word count – Scalding / Scala
                                                                   Document
                                                                   Collection




github.com/twitter/scalding/wiki
                                                                                Tokenize
                                                                                           GroupBy
                                                                           M                token    Count




                                                                                              R              Word
                                                                                                             Count




• extends the Scala collections API so that distributed lists
  become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual flow diagram
  and function calls
• extensive libraries are available for linear algebra, abstract
  algebra, machine learning – e.g., Matrix API, Algebird, etc.
• significant investments by Twitter, Etsy, eBay, etc.
• great for data services at scale
• less learning curve than Cascalog




                                                                                                                     29
word count – Scalding / Scala
                                                                          Document
                                                                          Collection




github.com/twitter/scalding/wiki
                                                                                       Tokenize
                                                                                                  GroupBy
                                                                                  M                token    Count




                                                                                                     R              Word
                                                                                                                    Count




• extends the Scala collections API so that distributed lists
  become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual flow diagram
  and function calls        Cascalog and Scalding DSLs
• extensive libraries are available for linear algebra, abstractaspects
                            leverage the functional
  algebra, machine learning – e.g., Matrix API, Algebird, etc.
                            of MapReduce, helping limit
• significant investments by Twitter, Etsy, eBay, etc.
                            complexity in process
• great for data services at scale
• less learning curve than Cascalog




                                                                                                                            30
Cascading: Workflow Abstraction
                               Document




1. Machine Data
                               Collection



                                                            Scrub
                                            Tokenize
                                                            token

                                       M



                                                                    HashJoin   Regex
                                                                      Left     token
                                                                                       GroupBy    R
                                                       Stop Word                        token
                                                          List
                                                                      RHS




2. Cascading
                                                                                          Count




                                                                                                      Word
                                                                                                      Count




3. Sample Code
4. A Little Theory…
5. Workflows
6. Lingual
7. Pattern
8. Open Data

                                                                                                              31
workflow abstraction – pattern language

Cascading uses a “plumbing” metaphor in the Java API,
to define workflows out of familiar elements: Pipes, Taps,
Tuple Flows, Filters, Joins, Traps, etc.
                      Document
                      Collection



                                                   Scrub
                                   Tokenize
                                                   token

                              M



                                                           HashJoin   Regex
                                                             Left     token
                                                                              GroupBy    R
                                              Stop Word                        token
                                                 List
                                                             RHS




                                                                                 Count


Data is represented as flows of tuples. Operations within                                     Word

the flows bring functional programming aspects into Java                                      Count




In formal terms, this provides a pattern language



                                                                                                     32
references…

  pattern language: a structured method for solving
  large, complex design problems, where the syntax of
  the language promotes the use of best practices

  amazon.com/dp/0195019199



  design patterns: the notion originated in consensus
  negotiation for architecture, later applied in OOP
  software engineering by “Gang of Four”
  amazon.com/dp/0201633612




                                                        33
workflow abstraction – pattern language

Cascading uses a “plumbing” metaphor in the Java API,
to define workflows out of familiar elements: Pipes, Taps,
Tuple Flows, Filters, Joins, Traps, etc.
                      Document
                      Collection



                                                   Scrub
                                   Tokenize



                              design principles of the pattern
                                                   token

                              M




                              language ensure best practices
                                              Stop Word
                                                 List
                                                           HashJoin
                                                             Left
                                                                      Regex
                                                                      token
                                                                              GroupBy
                                                                               token
                                                                                         R




                              for robust, parallel data workflows
                                                             RHS




                              at scale                                           Count


Data is represented as flows of tuples. Operations within                                     Word

the flows bring functional programming aspects into Java                                      Count




In formal terms, this provides a pattern language



                                                                                                     34
workflow abstraction – literate programming

Cascading workflows generate their own visual
documentation: flow diagrams


                      Document
                      Collection



                                                   Scrub
                                   Tokenize
                                                   token

                              M



                                                           HashJoin   Regex
                                                             Left     token
                                                                              GroupBy    R
                                              Stop Word                        token
                                                 List
                                                             RHS




                                                                                 Count



In formal terms, flow diagrams leverage a methodology                                         Word
                                                                                             Count

called literate programming
Provides intuitive, visual representations for apps –
great for cross-team collaboration


                                                                                                     35
references…

  by Don Knuth
  Literate Programming
  Univ of Chicago Press, 1992
  literateprogramming.com/

  “Instead of imagining that our main task is
   to instruct a computer what to do, let us
   concentrate rather on explaining to human
   beings what we want a computer to do.”




                                                36
workflow abstraction – business process

Following the essence of literate programming, Cascading
workflows provide statements of business process
This recalls a sense of business process management
for Enterprise apps (think BPM/BPEL for Big Data)
Cascading creates a separation of concerns between
business process and implementation details (Hadoop, etc.)
This is especially apparent in large-scale Cascalog apps:
  “Specify what you require, not how to achieve it.”
By virtue of the pattern language, the flow planner then
determines how to translate business process into efficient,
parallel jobs at scale




                                                              37
references…

  by Edgar Codd
  “A relational model of data for large shared data banks”
  Communications of the ACM, 1970
  dl.acm.org/citation.cfm?id=362685
  Rather than arguing between SQL vs. NoSQL…
  structured vs. unstructured data frameworks…
  this approach focuses on what apps do:
    the process of structuring data




                                                             38
workflow abstraction – functional relational programming

The combination of functional programming, pattern language,
DSLs, literate programming, business process, etc., traces back
to the original definition of the relational model (Codd, 1970)
prior to SQL.
Cascalog, in particular, implements more of what Codd intended
for a “data sublanguage” and is considered to be close to a full
implementation of the functional relational programming
paradigm defined in:
   Moseley & Marks, 2006
   “Out of the Tar Pit”
   goo.gl/SKspn




                                                                   39
workflow abstraction – functional relational programming

The combination of functional programming, pattern language,
DSLs, literate programming, business process, etc., traces back
to the original definition of the relational model (Codd, 1970)
prior to SQL.
Cascalog, in particular, implements more of what Codd intended
for a “data sublanguage” and is considered to be close to a full
implementation of the functional relational programming
paradigm defined in:           several theoretical aspects converge
   Moseley & Marks, 2006     into software engineering practices
   “Out of the Tar Pit”      which minimize the complexity of
   goo.gl/SKspn
                             building and maintaining Enterprise
                             data workflows


                                                                     40
Cascading: Workflow Abstraction
                               Document




1. Machine Data
                               Collection



                                                            Scrub
                                            Tokenize
                                                            token

                                       M



                                                                    HashJoin   Regex
                                                                      Left     token
                                                                                       GroupBy    R
                                                       Stop Word                        token
                                                          List
                                                                      RHS




2. Cascading
                                                                                          Count




                                                                                                      Word
                                                                                                      Count




3. Sample Code
4. A Little Theory…
5. Workflows
6. Lingual
7. Pattern
8. Open Data

                                                                                                              41
Enterprise Data Workflows
                                                           Customers
Let’s consider a “strawman” architecture
for an example app… at the front end
                                                             Web
                                                             App
LOB use cases drive demand for apps

                                               logs         Cache
                                                 logs
                                                   Logs

                          Support
                                                  source
                                        trap                  sink
                                                    tap
                                         tap                  tap


                                                Data
                          Modeling    PMML
                                               Workflow

                                                             source
                                        sink
                                                               tap
                                        tap

                          Analytics
                           Cubes                            customer
                                                             Customer
                                                           profile DBs
                                                               Prefs
                                                 Hadoop
                                                 Cluster
                         Reporting




                                                                         42
Enterprise Data Workflows
                                                           Customers
Same example… in the back office
Organizations have substantial investments                   Web
                                                             App
in people, infrastructure, process

                                               logs         Cache
                                                 logs
                                                   Logs

                          Support
                                                  source
                                        trap                  sink
                                                    tap
                                         tap                  tap


                                                Data
                          Modeling    PMML
                                               Workflow

                                                             source
                                        sink
                                                               tap
                                        tap

                          Analytics
                           Cubes                            customer
                                                             Customer
                                                           profile DBs
                                                               Prefs
                                                 Hadoop
                                                 Cluster
                         Reporting




                                                                         43
Enterprise Data Workflows
                                                          Customers
Same example… the heavy lifting!
“Main Street” firms are migrating                            Web
                                                            App
workflows to Hadoop, for cost
savings and scale-out
                                              logs         Cache
                                                logs
                                                  Logs

                          Support
                                                 source
                                       trap                  sink
                                                   tap
                                        tap                  tap


                                               Data
                         Modeling    PMML
                                              Workflow

                                                            source
                                       sink
                                                              tap
                                       tap

                         Analytics
                          Cubes                            customer
                                                            Customer
                                                          profile DBs
                                                              Prefs
                                                Hadoop
                                                Cluster
                         Reporting




                                                                        44
Cascading workflows – taps

•   taps integrate other data frameworks, as tuple streams
                                                                                   Customers

•   these are “plumbing” endpoints in the pattern language
•   sources (inputs), sinks (outputs), traps (exceptions)                            Web
                                                                                     App


•   text delimited, JDBC, Memcached,
                                                                       logs
    HBase, Cassandra, MongoDB, etc.                                      logs
                                                                           Logs
                                                                                    Cache



• data serialization: Avro, Thrift,
                                                  Support
                                                                          source
                                                                trap                  sink
                                                                            tap
    Kryo, JSON, etc.                                             tap                  tap




• extend a new kind of tap in just
                                                                        Data
                                                  Modeling    PMML
                                                                       Workflow

    a few lines of Java                                         sink
                                                                                     source
                                                                                       tap
                                                                tap

                                                  Analytics
                                                   Cubes                            customer
                                                                                     Customer
                                                                                   profile DBs
schema and provenance get                                                Hadoop
                                                                                       Prefs

                                                                         Cluster
derived from analysis of the taps                 Reporting




                                                                                                 45
Cascading workflows – taps

String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );                                                source and sink taps
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );                      for TSV data in HDFS
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
 .addSource( docPipe, docTap )
 .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();



                                                                                                               46
Cascading workflows – topologies

• topologies execute workflows on clusters
                                                                                Customers

• flow planner is like a compiler for queries
  - Hadoop (MapReduce jobs)                                                       Web
                                                                                  App


  - local mode (dev/test or special config)
                                                                    logs         Cache
  - in-memory data grids (real-time)
                                                                      logs
                                                                        Logs

                                               Support

• flow planner can be extended                                trap
                                                              tap
                                                                       source
                                                                         tap       sink
                                                                                   tap
  to support other topologies
                                                                     Data
                                               Modeling    PMML
                                                                    Workflow

                                                                                  source
                                                             sink
                                                                                    tap
blend flows in different topologies                           tap

                                               Analytics
into the same app – for example,                Cubes                            customer
                                                                                  Customer
                                                                                profile DBs
batch (Hadoop) + transactions (IMDG)                                  Hadoop
                                                                                    Prefs

                                                                      Cluster
                                               Reporting




                                                                                              47
Cascading workflows – topologies

String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );   flow planner for
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );                     Apache Hadoop
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );                                                topology
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
 .addSource( docPipe, docTap )
 .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();



                                                                                                          48
example topologies…




                      49
Cascading workflows – test-driven development

•   assert patterns (regex) on the tuple streams
                                                                                 Customers
•   adjust assert levels, like log4j levels
•   trap edge cases as “data exceptions”                                           Web
                                                                                   App

•   TDD at scale:
    1. start from raw inputs in the flow graph                        logs
                                                                       logs
                                                                         Logs
                                                                                  Cache


    2. define stream assertions for each stage   Support
                                                                        source
                                                              trap                  sink
       of transforms                                           tap
                                                                          tap
                                                                                    tap



    3. verify exceptions, code to remove them   Modeling    PMML
                                                                      Data
                                                                     Workflow

    4. when impl is complete, app has full                    sink
                                                                                   source
                                                                                     tap
                                                              tap
       test coverage                            Analytics
                                                 Cubes                            customer
                                                                                   Customer
                                                                                 profile DBs
                                                                                     Prefs
                                                                       Hadoop
redirect traps in production                    Reporting
                                                                       Cluster

to Ops, QA, Support, Audit, etc.


                                                                                               50
Two Avenues to the App Layer…

Enterprise: must contend with
complexity at scale everyday…
incumbents extend current practices and
infrastructure investments – using J2EE,




                                             complexity ➞
ANSI SQL, SAS, etc. – to migrate
workflows onto Apache Hadoop while
leveraging existing staff


Start-ups: crave complexity and
scale to become viable…
new ventures move into Enterprise space
to compete using relatively lean staff,
while leveraging sophisticated engineering
practices, e.g., Cascalog and Scalding
                                                            scale ➞

                                                                      51
Cascading: Workflow Abstraction
                               Document




1. Machine Data
                               Collection



                                                            Scrub
                                            Tokenize
                                                            token

                                       M



                                                                    HashJoin   Regex
                                                                      Left     token
                                                                                       GroupBy    R
                                                       Stop Word                        token
                                                          List
                                                                      RHS




2. Cascading
                                                                                          Count




                                                                                                      Word
                                                                                                      Count




3. Sample Code
4. A Little Theory…
5. Workflows
6. Lingual
7. Pattern
8. Open Data

                                                                                                              52
Cascading workflows – ANSI SQL

• collab with Optiq – industry-proven code base
                                                                            Customers
• ANSI SQL parser/optimizer atop Cascading
  flow planner                                                                 Web
                                                                              App
• JDBC driver to integrate into existing
  tools and app servers                                         logs
                                                                  logs       Cache

• relational catalog over a collection      Support
                                                                    Logs


  of unstructured data                                   trap
                                                                   source
                                                                     tap       sink
                                                          tap                  tap

• SQL shell prompt to run queries                                Data
                                           Modeling
• enable analysts without retraining
                                                       PMML
                                                                Workflow


  on Hadoop, etc.                                        sink
                                                         tap
                                                                              source
                                                                                tap



• transparency for Support, Ops,           Analytics
                                            Cubes                            customer
                                                                              Customer
  Finance, et al.                                                           profile DBs
                                                                                Prefs
                                                                  Hadoop
                                                                  Cluster
                                           Reporting

a language for queries – not a database,
but ANSI SQL as a DSL for workflows

                                                                                          53
Lingual – CSV data in local file system




cascading.org/lingual


                                         54
Lingual – shell prompt, catalog




cascading.org/lingual


                                  55
Lingual – queries




cascading.org/lingual


                        56
abstraction layers in queries…
          abstraction               RDBMS                   JVM Cluster
              parser               ANSI SQL                  ANSI SQL
                                 compliant parser          compliant parser
            optimizer             logical plan,              logical plan,
                            optimized based on stats   optimized based on stats
             planner               physical plan            API “plumbing”

            machine               query history,              app history,
             data                   table stats               tuple stats
             topology              b-trees, etc.       heterogenous, distributed:
                                                        Hadoop, in-memory, etc.
           visualization               ERD                   flow diagram

             schema                table schema              tuple schema

             catalog             relational catalog          tap usage DB


           provenance             (manual audit)               data set
                                                         producers/consumers



                                                                                    57
Lingual – JDBC driver

public void run() throws ClassNotFoundException, SQLException {
    Class.forName( "cascading.lingual.jdbc.Driver" );
    Connection connection =
      DriverManager.getConnection(
       "jdbc:lingual:local;schemas=src/main/resources/data/example" );
    Statement statement = connection.createStatement();
 
    ResultSet resultSet = statement.executeQuery(
        "select *n"
          + "from "EXAMPLE"."SALES_FACT_1997" as sn"
          + "join "EXAMPLE"."EMPLOYEE" as en"
          + "on e."EMPID" = s."CUST_ID"" );
 
    while( resultSet.next() ) {
      int n = resultSet.getMetaData().getColumnCount();
      StringBuilder builder = new StringBuilder();
 
      for( int i = 1; i <= n; i++ ) {
        builder.append( ( i > 1 ? "; " : "" )
            + resultSet.getMetaData().getColumnLabel( i )
            + "="
            + resultSet.getObject( i ) );
        }

      System.out.println( builder );
      }
 
    resultSet.close();
    statement.close();
    connection.close();
    }




                                                                         58
Lingual – JDBC result set

$ gradle clean jar
$ hadoop jar build/libs/lingual-examples–1.0.0-wip-dev.jar
 
CUST_ID=100; PROD_ID=10; EMPID=100; NAME=Bill
CUST_ID=150; PROD_ID=20; EMPID=150; NAME=Sebastian




                            Caveat: if you absolutely positively must have sub-second
                            SQL query response for Pb-scale data on a 1000+ node
                            cluster… Good luck with that! (call the MPP vendors)
                            This ANSI SQL library is primarily intended for batch
                            workflows – high throughput, not low-latency –
                            for many under-represented use cases in Enterprise IT.
                            In other words, SQL as a DSL.




 cascading.org/lingual
                                                                                        59
Lingual – connecting Hadoop and R

   # load the JDBC package
   library(RJDBC)
    
   # set up the driver
   drv <- JDBC("cascading.lingual.jdbc.Driver",
     "~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")
    
   # set up a database connection to a local repository
   connection <- dbConnect(drv,
     "jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/
   tables;schema=EMPLOYEES")
    
   # query the repository: in this case the MySQL sample database (CSV files)
   df <- dbGetQuery(connection,
     "SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")
   head(df)
    
   # use R functions to summarize and visualize part of the data
   df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25
   summary(df$hire_age)

   library(ggplot2)
   m <- ggplot(df, aes(x=hire_age))
   m <- m + ggtitle("Age at hire, people named Gina")
   m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()


                                                                                             60
Lingual – connecting Hadoop and R

   > summary(df$hire_age)
      Min. 1st Qu. Median     Mean 3rd Qu.    Max.
     20.86   27.89   31.70   31.61   35.01   43.92




cascading.org/lingual
                                                     61
Cascading: Workflow Abstraction
                               Document




1. Machine Data
                               Collection



                                                            Scrub
                                            Tokenize
                                                            token

                                       M



                                                                    HashJoin   Regex
                                                                      Left     token
                                                                                       GroupBy    R
                                                       Stop Word                        token
                                                          List
                                                                      RHS




2. Cascading
                                                                                          Count




                                                                                                      Word
                                                                                                      Count




3. Sample Code
4. A Little Theory…
5. Workflows
6. Lingual
7. Pattern
8. Open Data

                                                                                                              62
Pattern – model scoring

• migrate workloads: SAS,Teradata, etc.,
  exporting predictive models as PMML                                       Customers



• great open source tools – R, Weka,                                          Web
                                                                              App
  KNIME, Matlab, RapidMiner, etc.
• integrate with other libraries –                              logs
                                                                  logs       Cache
                                                                    Logs
  Matrix API, etc.                         Support

• leverage PMML as another kind                          trap
                                                          tap
                                                                   source
                                                                     tap       sink
                                                                               tap

  of DSL
                                                                 Data
                                           Modeling    PMML
                                                                Workflow

                                                                              source
                                                         sink
                                                                                tap
                                                         tap

                                           Analytics
                                            Cubes                            customer
                                                                              Customer
                                                                            profile DBs
                                                                                Prefs
                                                                  Hadoop
                                                                  Cluster
                                           Reporting


cascading.org/pattern


                                                                                          63
Pattern – create a model in R

   ## train a RandomForest model
    
   f <- as.formula("as.factor(label) ~ .")
   fit <- randomForest(f, data_train, ntree=50)
    
   ## test the model on the holdout test set
    
   print(fit$importance)
   print(fit)
    
   predicted <- predict(fit, data)
   data$predicted <- predicted
   confuse <- table(pred = predicted, true = data[,1])
   print(confuse)
    
   ## export predicted labels to TSV
    
   write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"),
     quote=FALSE, sep="t", row.names=FALSE)
    
   ## export RF model to PMML
    
   saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))




                                                                          64
Pattern – capture model parameters as PMML
   <?xml version="1.0"?>
   <PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.dmg.org/PMML-4_0
    http://www.dmg.org/v4-0/pmml-4-0.xsd">
    <Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">
     <Extension name="user" value="ceteri" extender="Rattle/PMML"/>
     <Application name="Rattle/PMML" version="1.2.30"/>
     <Timestamp>2012-10-22 19:39:28</Timestamp>
    </Header>
    <DataDictionary numberOfFields="4">
     <DataField name="label" optype="categorical" dataType="string">
      <Value value="0"/>
      <Value value="1"/>
     </DataField>
     <DataField name="var0" optype="continuous" dataType="double"/>
     <DataField name="var1" optype="continuous" dataType="double"/>
     <DataField name="var2" optype="continuous" dataType="double"/>
    </DataDictionary>
    <MiningModel modelName="randomForest_Model" functionName="classification">
     <MiningSchema>
      <MiningField name="label" usageType="predicted"/>
      <MiningField name="var0" usageType="active"/>
      <MiningField name="var1" usageType="active"/>
      <MiningField name="var2" usageType="active"/>
     </MiningSchema>
     <Segmentation multipleModelMethod="majorityVote">
      <Segment id="1">
       <True/>
       <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">
        <MiningSchema>
         <MiningField name="label" usageType="predicted"/>
         <MiningField name="var0" usageType="active"/>
         <MiningField name="var1" usageType="active"/>
         <MiningField name="var2" usageType="active"/>
        </MiningSchema>
   ...

                                                                                                                                                 65
Pattern – score a model, within an app
   public class Main {
     public static void main( String[] args ) {
       String pmmlPath = args[ 0 ];
       String ordersPath = args[ 1 ];
       String classifyPath = args[ 2 ];
       String trapPath = args[ 3 ];

         Properties properties = new Properties();
         AppProps.setApplicationJarClass( properties, Main.class );
         HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

         // create source and sink taps
         Tap ordersTap = new Hfs( new TextDelimited( true, "t" ), ordersPath );
         Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );
         Tap trapTap = new Hfs( new TextDelimited( true, "t" ), trapPath );

         // define a "Classifier" model from PMML to evaluate the orders
         ClassifierFunction classFunc = new ClassifierFunction( new Fields( "score" ), pmmlPath );
         Pipe classifyPipe = new Each( new Pipe( "classify" ), classFunc.getInputFields(), classFunc, Fields.ALL );

         // connect the taps, pipes, etc., into a flow
         FlowDef flowDef = FlowDef.flowDef().setName( "classify" )
          .addSource( classifyPipe, ordersTap )
          .addTrap( classifyPipe, trapTap )
          .addSink( classifyPipe, classifyTap );

         // write a DOT file and run the flow
         Flow classifyFlow = flowConnector.connect( flowDef );
         classifyFlow.writeDOT( "dot/classify.dot" );
         classifyFlow.complete();
       }
   }

                                                                                                                      66
Pattern – score a model, using pre-defined Cascading app



                Customer
                 Orders



                                      Scored             GroupBy
                           Classify            Assert
                                      Orders              token

                      M                                             R




            PMML
            Model
                                                            Count




                                               Failure              Confusion
                                                Traps                Matrix




cascading.org/pattern


                                                                                67
Pattern – score a model, using pre-defined Cascading app

   ## run an RF classifier at scale
    
   hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap 
     --pmml data/sample.rf.xml
    


   ## run an RF classifier at scale, assert regression test, measure confusion matrix
    
   hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap 
     --pmml data/sample.rf.xml --assert --measure out/measure


    
   ## run a predictive model at scale, measure RMSE
    
   hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap 
        --pmml data/iris.lm_p.xml --rmse out/measure




                                                                                        68
PMML – model coverage

•   Association Rules: AssociationModel element
•   Cluster Models: ClusteringModel element
•   Decision Trees: TreeModel element
•   Naïve Bayes Classifiers: NaiveBayesModel element
•   Neural Networks: NeuralNetwork element
•   Regression: RegressionModel and GeneralRegressionModel elements
•   Rulesets: RuleSetModel element
•   Sequences: SequenceModel element
•   Support Vector Machines: SupportVectorMachineModel element
•   Text Models: TextModel element
•   Time Series: TimeSeriesModel element

ibm.com/developerworks/industry/library/ind-PMML2/


                                                                      69
PMML – vendor coverage




                         70
experiments – Random Forest model

   ## train a Random Forest model
   ## example: http://mkseo.pe.kr/stats/?p=220
    
   f <- as.formula("as.factor(label) ~ var0 + var1 + var2")
   fit <- randomForest(f, data=data, proximity=TRUE, ntree=25)
   print(fit)
   saveXML(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/"))



            OOB estimate of   error rate: 14%
   Confusion matrix:
      0   1 class.error
   0 69 16     0.1882353
   1 12 103    0.1043478




                                                                          71
experiments – Logistic Regression model

   ## train a Logistic Regression model (special case of GLM)
   ## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r
    
   f <- as.formula("as.factor(label) ~ var0 + var2")
   fit <- glm(f, family=binomial, data=data)
   print(summary(fit))
   saveXML(pmml(fit), file=paste(out_folder, "sample.lr.xml", sep="/"))



   Coefficients:
               Estimate Std. Error z value Pr(>|z|)
   (Intercept)    1.8524    0.3803   4.871 1.11e-06 ***
   var0          -1.3755    0.4355 -3.159 0.00159 **
   var2          -3.7742    0.5794 -6.514 7.30e-11 ***
   ---
   Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
    ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1




   NB: this model has “var1” intentionally omitted


                                                                                 72
experiments – evaluating results

• 	

use a confusion matrix to compare results for the classifiers
• Logistic Regression has a lower “false negative” rate (5% vs. 11%)
  however it has a much higher “false positive” rate (52% vs. 14%)
• assign a cost model to select a winner –
  for example, in an ecommerce anti-fraud classifier:
    FN ∼ chargeback risk
    FP ∼ customer support costs
• can extend this to evaluate
  N models, M labels in an
  N × M × M matrix




                                                                       73
Cascading: Workflow Abstraction
                               Document




1. Machine Data
                               Collection



                                                            Scrub
                                            Tokenize
                                                            token

                                       M



                                                                    HashJoin   Regex
                                                                      Left     token
                                                                                       GroupBy    R
                                                       Stop Word                        token
                                                          List
                                                                      RHS




2. Cascading
                                                                                          Count




                                                                                                      Word
                                                                                                      Count




3. Sample Code
4. A Little Theory…
5. Workflows
6. Lingual
7. Pattern
8. Open Data

                                                                                                              74
Palo Alto is quite a pleasant place

• temperate weather
• lots of parks, enormous trees
• great coffeehouses
• walkable downtown
• not particularly crowded


On a nice summer day, who wants to be stuck
indoors on a phone call?
Instead, take it outside – go for a walk

And example open source project:
github.com/Cascading/CoPA/wiki


                                              75
1. Open Data about municipal infrastructure
(GIS data: trees, roads, parks)
                             ✚
2. Big Data about where people like to walk
(smartphone GPS logs)
                             ✚
                                                            Document
                                                            Collection




3. some curated metadata                                            M
                                                                         Tokenize
                                                                                         Scrub
                                                                                         token




                                                                                                 HashJoin   Regex




(which surfaces the value)
                                                                                                   Left     token
                                                                                                                    GroupBy    R
                                                                                    Stop Word                        token
                                                                                       List
                                                                                                   RHS




                                                                                                                       Count




                                                                                                                                   Word
                                                                                                                                   Count




4. personalized recommendations:
“Find a shady spot on a summer day in which to walk
 near downtown Palo Alto.While on a long conference call.
 Sipping a latte or enjoying some fro-yo.”

                                                                                                                                           76
discovery
The City of Palo Alto recently began to support Open Data
to give the local community greater visibility into how
their city government operates
This effort is intended to encourage students, entrepreneurs,
local organizations, etc., to build new apps which contribute
to the public good


paloalto.opendata.junar.com/dashboards/7576/geographic-information/




                                                                            77
discovery
GIS about trees in Palo Alto:




                                            78
discovery
Geographic_Information,,,

"Tree: 29 site 2 at 203 ADDISON AV, on ADDISON AV 44 from pl","               Private:     -1     Tree ID:     29
Street_Name:    ADDISON AV      Situs Number:      203      Tree Site:      2      Species:    Celtis australis
Source:    davey tree      Protected:         Designated:            Heritage:           Appraised Value:
Hardscape:    None     Identifier:    40      Active Numeric:        1     Location Feature ID:        13872
Provisional:         Install Date:        ","37.4409634615283,-122.15648458861,0.0 ","Point"
"Wilkie Way from West Meadow Drive to Victoria Place","             Sequence:      20     Street_Name:     Wilkie
Way     From Street PMMS:     West Meadow Drive        To Street PMMS:        Victoria Place       Street ID:
598 (Wilkie Wy, Palo Alto)       From Street ID PMMS:        689       To Street ID PMMS:      567      Year
Constructed:    1950      Traffic Count:    596      Traffic Index:        residential local        Traffic
Class:    local residential      Traffic Date:      08/24/90        Paving Length:       208     Paving Width:
40     Paving Area:    8320     Surface Type:      asphalt concrete         Surface Thickness:        2.0     Base
Type Pvmt:    crusher run base      Base Thickness:        6.0      Soil Class:       2    Soil Value:     15
Curb Type:         Curb Thickness:         Gutter Width:        36.0      Book:     22     Page:    1     District
Number:    18    Land Use PMMS:     1     Overlay Year:        1990      Overlay Thickness:       1.5     Base
Failure Year:    1990      Base Failure Thickness:       6      Surface Treatment Year:             Surface
Treatment Type:         Alligator Severity:      none       Alligator Extent:         0    Block Severity:
none     Block Extent:     0    Longitude and Transverse Severity:            none      Longitude and Transverse
Extent:    0
Trench Severity:
                Ravelling Severity:
                     none
                                         none
                              Trench Extent:     0
                                                  (unstructured data…)
                                                   Ravelling Extent:
                                                        Rutting Severity:
                                                                            0      Ridability Severity:
                                                                                 none      Rutting Extent:
                                                                                                              none
                                                                                                                0
Road Performance:     UL (Urban Local)      Bike Lane:       0      Bus Route:      0     Truck Route:     0
Remediation:         Deduct Value:    100      Priority:            Pavement Condition:       excellent
Street Cut Fee per SqFt:      10.00    Source Date:        6/10/2009       User Modified By:       mnicols
Identifier System:     21410    ","-122.1249640794,37.4155803115645,0.0
-122.124661859039,37.4154224594993,0.0 -122.124587720719,37.4153758330704,0.0
-122.12451895942,37.4153242300888,0.0 -122.124456098457,37.4152680432944,0.0
-122.124399616238,37.4152077003122,0.0 -122.124374937753,37.4151774433318,0.0 ","Line"
                                                                                                                     79
discovery
(defn parse-gis [line]
    "leverages parse-csv for complex CSV format in GIS export"
    (first (csv/parse-csv line))
  )
 
 
(defn etl-gis [gis trap]
    "subquery to parse data sets from the GIS source tap"
    (<- [?blurb ?misc ?geo ?kind]
        (gis ?line)
        (parse-gis ?line :> ?blurb ?misc ?geo ?kind)
        (:trap (hfs-textline trap))
     ))




                      (specify what you require,
                        not how to achieve it…
                      data prep costs are 80/20)


                                                                             80
discovery



 (ad-hoc queries get refined
into composable predicates)


    Identifier:   474
    Tree ID:      412
    Tree:         412 site 1 at 115 HAWTHORNE AV
    Tree Site:    1
    Street_Name: HAWTHORNE AV
    Situs Number: 115
    Private:      -1
    Species:      Liquidambar styraciflua
    Source:       davey tree
    Hardscape:    None
    37.446001565119,-122.167713417554,0.0
    Point



                                                         81
discovery




(curate valuable metadata)
                                         82
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional Programming

Weitere ähnliche Inhalte

Was ist angesagt?

Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarCloudera, Inc.
 
Data Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesData Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesAmazon Web Services
 
IBM THINK 2020 - Cloud Data Lake with IBM Cloud Data Services
IBM THINK 2020 - Cloud Data Lake with IBM Cloud Data Services IBM THINK 2020 - Cloud Data Lake with IBM Cloud Data Services
IBM THINK 2020 - Cloud Data Lake with IBM Cloud Data Services Torsten Steinbach
 
Coud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AICoud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AITorsten Steinbach
 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsDataWorks Summit
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Cloudera, Inc.
 
Rapidly Building Data Driven Web Pages with Dynamic ADO.NET
Rapidly Building Data Driven Web Pages with Dynamic ADO.NETRapidly Building Data Driven Web Pages with Dynamic ADO.NET
Rapidly Building Data Driven Web Pages with Dynamic ADO.NETgoodfriday
 
a9TD6cbzTZotpJihekdc+w==.docx
a9TD6cbzTZotpJihekdc+w==.docxa9TD6cbzTZotpJihekdc+w==.docx
a9TD6cbzTZotpJihekdc+w==.docxVasimMemon4
 
Introduction to Hortonworks Data Platform for Windows
Introduction to Hortonworks Data Platform for WindowsIntroduction to Hortonworks Data Platform for Windows
Introduction to Hortonworks Data Platform for WindowsHortonworks
 
Anil_BigData Resume
Anil_BigData ResumeAnil_BigData Resume
Anil_BigData ResumeAnil Sokhal
 
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...Cloudera, Inc.
 
Advanced analytics with sap hana and r
Advanced analytics with sap hana and rAdvanced analytics with sap hana and r
Advanced analytics with sap hana and rSAP Technology
 
Platforms for data science
Platforms for data sciencePlatforms for data science
Platforms for data scienceDeepak Singh
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureData Science Milan
 
Introduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI ToolsIntroduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI ToolsDataWorks Summit
 
Liquidity Risk Management powered by SAP HANA
Liquidity Risk Management powered by SAP HANALiquidity Risk Management powered by SAP HANA
Liquidity Risk Management powered by SAP HANASAP Technology
 
Nagarjuna_Damarla
Nagarjuna_DamarlaNagarjuna_Damarla
Nagarjuna_DamarlaNag Arjun
 
Module 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWSModule 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWSLam Le
 

Was ist angesagt? (20)

Sandish3Certs
Sandish3CertsSandish3Certs
Sandish3Certs
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
 
Data Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesData Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web Services
 
IBM THINK 2020 - Cloud Data Lake with IBM Cloud Data Services
IBM THINK 2020 - Cloud Data Lake with IBM Cloud Data Services IBM THINK 2020 - Cloud Data Lake with IBM Cloud Data Services
IBM THINK 2020 - Cloud Data Lake with IBM Cloud Data Services
 
Coud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AICoud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AI
 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data Applications
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
 
Prashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEWPrashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEW
 
Rapidly Building Data Driven Web Pages with Dynamic ADO.NET
Rapidly Building Data Driven Web Pages with Dynamic ADO.NETRapidly Building Data Driven Web Pages with Dynamic ADO.NET
Rapidly Building Data Driven Web Pages with Dynamic ADO.NET
 
a9TD6cbzTZotpJihekdc+w==.docx
a9TD6cbzTZotpJihekdc+w==.docxa9TD6cbzTZotpJihekdc+w==.docx
a9TD6cbzTZotpJihekdc+w==.docx
 
Introduction to Hortonworks Data Platform for Windows
Introduction to Hortonworks Data Platform for WindowsIntroduction to Hortonworks Data Platform for Windows
Introduction to Hortonworks Data Platform for Windows
 
Anil_BigData Resume
Anil_BigData ResumeAnil_BigData Resume
Anil_BigData Resume
 
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
 
Advanced analytics with sap hana and r
Advanced analytics with sap hana and rAdvanced analytics with sap hana and r
Advanced analytics with sap hana and r
 
Platforms for data science
Platforms for data sciencePlatforms for data science
Platforms for data science
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
 
Introduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI ToolsIntroduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI Tools
 
Liquidity Risk Management powered by SAP HANA
Liquidity Risk Management powered by SAP HANALiquidity Risk Management powered by SAP HANA
Liquidity Risk Management powered by SAP HANA
 
Nagarjuna_Damarla
Nagarjuna_DamarlaNagarjuna_Damarla
Nagarjuna_Damarla
 
Module 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWSModule 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWS
 

Ähnlich wie Cascading: Enterprise Data Workflows based on Functional Programming

A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...Paco Nathan
 
Front-Ending the Web with Microsoft Office
Front-Ending the Web with Microsoft OfficeFront-Ending the Web with Microsoft Office
Front-Ending the Web with Microsoft Officegoodfriday
 
Ira d. kleiner, ms, mba, 2013 1
Ira d. kleiner, ms, mba, 2013 1Ira d. kleiner, ms, mba, 2013 1
Ira d. kleiner, ms, mba, 2013 1Ira Kleiner
 
Sap microsoft interoperability sitnl 08-12-2012
Sap microsoft interoperability sitnl 08-12-2012Sap microsoft interoperability sitnl 08-12-2012
Sap microsoft interoperability sitnl 08-12-2012Twan van den Broek
 
OreDev 2008: Software + Services
OreDev 2008: Software + ServicesOreDev 2008: Software + Services
OreDev 2008: Software + Servicesukdpe
 
Nuxeo Corporate Presentation - April 2007
Nuxeo Corporate Presentation - April 2007Nuxeo Corporate Presentation - April 2007
Nuxeo Corporate Presentation - April 2007Stefane Fermigier
 
Fcs Corporate
Fcs CorporateFcs Corporate
Fcs Corporatedeepu86
 
Competitive Analysis w SWOT Matrix
Competitive Analysis w SWOT MatrixCompetitive Analysis w SWOT Matrix
Competitive Analysis w SWOT MatrixDavid Castro
 
EMC Documentum & Captiva
EMC Documentum & CaptivaEMC Documentum & Captiva
EMC Documentum & CaptivaITDogadjaji.com
 
Oracle tech fmw-02-soa-suite-11g-neum-15.04.2010
Oracle tech fmw-02-soa-suite-11g-neum-15.04.2010Oracle tech fmw-02-soa-suite-11g-neum-15.04.2010
Oracle tech fmw-02-soa-suite-11g-neum-15.04.2010Oracle BH
 
Marlabs campus recruitment brochure 2011
Marlabs campus recruitment brochure 2011Marlabs campus recruitment brochure 2011
Marlabs campus recruitment brochure 2011Marlabs
 
Sneak peak ca e rwin data modeler r8 preview09222010
Sneak peak ca e rwin data modeler r8 preview09222010Sneak peak ca e rwin data modeler r8 preview09222010
Sneak peak ca e rwin data modeler r8 preview09222010ERwin Modeling
 
Innovate 2012 ls 1439 linked data oslc
Innovate 2012 ls 1439 linked data oslcInnovate 2012 ls 1439 linked data oslc
Innovate 2012 ls 1439 linked data oslcSteve Speicher
 
Healthcare cio summit dallas feb 2013
Healthcare cio summit dallas feb 2013Healthcare cio summit dallas feb 2013
Healthcare cio summit dallas feb 2013Shyam Desigan
 
Healthcare cio summit dallas feb 2013
Healthcare cio summit dallas feb 2013Healthcare cio summit dallas feb 2013
Healthcare cio summit dallas feb 2013Shyam Desigan
 
Leveraging BI and Predictive Analytics to deliver Real time forecasting
Leveraging BI and Predictive Analytics to deliver Real time forecastingLeveraging BI and Predictive Analytics to deliver Real time forecasting
Leveraging BI and Predictive Analytics to deliver Real time forecastingShyam Desigan
 
Доклад Растислава Хлавача на SPCUA 2012
Доклад Растислава Хлавача на SPCUA 2012Доклад Растислава Хлавача на SPCUA 2012
Доклад Растислава Хлавача на SPCUA 2012Lizard Soft
 

Ähnlich wie Cascading: Enterprise Data Workflows based on Functional Programming (20)

A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...
 
Front-Ending the Web with Microsoft Office
Front-Ending the Web with Microsoft OfficeFront-Ending the Web with Microsoft Office
Front-Ending the Web with Microsoft Office
 
IBM Cloud Strategy
IBM Cloud StrategyIBM Cloud Strategy
IBM Cloud Strategy
 
Ira d. kleiner, ms, mba, 2013 1
Ira d. kleiner, ms, mba, 2013 1Ira d. kleiner, ms, mba, 2013 1
Ira d. kleiner, ms, mba, 2013 1
 
Sap microsoft interoperability sitnl 08-12-2012
Sap microsoft interoperability sitnl 08-12-2012Sap microsoft interoperability sitnl 08-12-2012
Sap microsoft interoperability sitnl 08-12-2012
 
Axug
AxugAxug
Axug
 
Keynote Day 1 2009
Keynote Day 1 2009Keynote Day 1 2009
Keynote Day 1 2009
 
OreDev 2008: Software + Services
OreDev 2008: Software + ServicesOreDev 2008: Software + Services
OreDev 2008: Software + Services
 
Nuxeo Corporate Presentation - April 2007
Nuxeo Corporate Presentation - April 2007Nuxeo Corporate Presentation - April 2007
Nuxeo Corporate Presentation - April 2007
 
Fcs Corporate
Fcs CorporateFcs Corporate
Fcs Corporate
 
Competitive Analysis w SWOT Matrix
Competitive Analysis w SWOT MatrixCompetitive Analysis w SWOT Matrix
Competitive Analysis w SWOT Matrix
 
EMC Documentum & Captiva
EMC Documentum & CaptivaEMC Documentum & Captiva
EMC Documentum & Captiva
 
Oracle tech fmw-02-soa-suite-11g-neum-15.04.2010
Oracle tech fmw-02-soa-suite-11g-neum-15.04.2010Oracle tech fmw-02-soa-suite-11g-neum-15.04.2010
Oracle tech fmw-02-soa-suite-11g-neum-15.04.2010
 
Marlabs campus recruitment brochure 2011
Marlabs campus recruitment brochure 2011Marlabs campus recruitment brochure 2011
Marlabs campus recruitment brochure 2011
 
Sneak peak ca e rwin data modeler r8 preview09222010
Sneak peak ca e rwin data modeler r8 preview09222010Sneak peak ca e rwin data modeler r8 preview09222010
Sneak peak ca e rwin data modeler r8 preview09222010
 
Innovate 2012 ls 1439 linked data oslc
Innovate 2012 ls 1439 linked data oslcInnovate 2012 ls 1439 linked data oslc
Innovate 2012 ls 1439 linked data oslc
 
Healthcare cio summit dallas feb 2013
Healthcare cio summit dallas feb 2013Healthcare cio summit dallas feb 2013
Healthcare cio summit dallas feb 2013
 
Healthcare cio summit dallas feb 2013
Healthcare cio summit dallas feb 2013Healthcare cio summit dallas feb 2013
Healthcare cio summit dallas feb 2013
 
Leveraging BI and Predictive Analytics to deliver Real time forecasting
Leveraging BI and Predictive Analytics to deliver Real time forecastingLeveraging BI and Predictive Analytics to deliver Real time forecasting
Leveraging BI and Predictive Analytics to deliver Real time forecasting
 
Доклад Растислава Хлавача на SPCUA 2012
Доклад Растислава Хлавача на SPCUA 2012Доклад Растислава Хлавача на SPCUA 2012
Доклад Растислава Хлавача на SPCUA 2012
 

Mehr von Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 

Mehr von Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 

Cascading: Enterprise Data Workflows based on Functional Programming

  • 1. “Cascading: Enterprise Data Workflows based on Functional Programming” Paco Nathan Concurrent, Inc. San Francisco, CA @pacoid Copyright @2013, Concurrent, Inc. 1
  • 2. Cascading: Workflow Abstraction Document 1. Machine Data Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS 2. Cascading Count Word Count 3. Sample Code 4. A Little Theory… 5. Workflows 6. Lingual 7. Pattern 8. Open Data 2
  • 3. Q3 1997: inflection point Four independent teams were working toward horizontal scale-out of workflows based on commodity hardware. This effort prepared the way for huge Internet successes in the 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG MapReduce and the Apache Hadoop open source stack emerged from this. 3
  • 4. Circa 1996: pre- inflection point Stakeholder Customers Excel pivot tables PowerPoint slide decks strategy BI Product Analysts requirements SQL Query optimized Engineering code Web App result sets transactions RDBMS 4
  • 5. Circa 1996: pre- inflection point Stakeholder Customers Excel pivot tables PowerPoint slide decks strategy “Throw it over the wall” BI Product Analysts requirements SQL Query optimized Engineering code Web App result sets transactions RDBMS 5
  • 6. Circa 2001: post- big ecommerce successes Stakeholder Product Customers dashboards UX Engineering models servlets recommenders Algorithmic + Web Apps Modeling classifiers Middleware aggregation event SQL Query history result sets customer transactions Logs DW ETL RDBMS 6
  • 7. Circa 2001: post- big ecommerce successes Stakeholder Product Customers “Data products” dashboards UX Engineering models servlets recommenders Algorithmic + Web Apps Modeling classifiers Middleware aggregation event SQL Query history result sets customer transactions Logs DW ETL RDBMS 7
  • 8. Circa 2013: clusters everywhere Data Products Customers business Domain process Prod Expert Workflow dashboard metrics data Web Apps, s/w History services science Mobile, etc. dev Data Scientist Planner social discovery interactions + optimized transactions, Eng modeling taps capacity content App Dev Use Cases Across Topologies Hadoop, Log In-Memory etc. Events Data Grid Ops DW Ops batch near time Cluster Scheduler introduced existing capability SDLC RDBMS RDBMS 8
  • 9. Circa 2013: clusters everywhere Data Products Customers business Domain process Prod Expert Workflow dashboard metrics data Web Apps, s/w History services science Mobile, etc. dev Data Scientist Planner social discovery interactions + optimized transactions, Eng modeling taps capacity content App Dev “Optimizing topologies” Use Cases Across Topologies Hadoop, Log In-Memory etc. Events Data Grid Ops DW Ops batch near time Cluster Scheduler introduced existing capability SDLC RDBMS RDBMS 9
  • 10. references… by Leo Breiman Statistical Modeling: The Two Cultures Statistical Science, 2001 bit.ly/eUTh9L 10
  • 11. references… Amazon “Early Amazon: Splitting the website” – Greg Linden glinden.blogspot.com/2006/02/early-amazon-splitting-website.html eBay “The eBay Architecture” – Randy Shoup, Dan Pritchett addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf Inktomi (YHOO Search) “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff) youtube.com/watch?v=E91oEn1bnXM Google “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff) youtube.com/watch?v=qsan-GQaeyk perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx 11
  • 12. Cascading: Workflow Abstraction Document 1. Machine Data Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS 2. Cascading Count Word Count 3. Sample Code 4. A Little Theory… 5. Workflows 6. Lingual 7. Pattern 8. Open Data 12
  • 13. Cascading – origins API author Chris Wensel worked as a system architect at an Enterprise firm well-known for many popular data products. Wensel was following the Nutch open source project – where Hadoop started. Observation: would be difficult to find Java developers to write complex Enterprise apps in MapReduce – potential blocker for leveraging new open source technology. 13
  • 14. Cascading – functional programming Key insight: MapReduce is based on functional programming – back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature. To ease staffing problems as “Main Street” Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows: • leverages JVM and Java-based tools without any need to create new languages • allows programmers who have J2EE expertise to leverage the economics of Hadoop clusters 14
  • 15. functional programming… in production • Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc., have invested in open source projects atop Cascading – used for their large-scale production deployments • new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascalog in Clojure (2010) Scalding in Scala (2012) github.com/nathanmarz/cascalog/wiki github.com/twitter/scalding/wiki 15
  • 16. Cascading – definitions • a pattern language for Enterprise Data Workflows Customers • simple to build, easy to test, robust in production • design principles ⟹ ensure best practices at scale Web App logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting 16
  • 17. Cascading – usage • Java API, DSLs in Scala, Clojure, Customers Jython, JRuby, Groovy, ANSI SQL • ASL 2 license, GitHub src, Web App http://conjars.org • 5+ yrs production use, logs logs Logs Cache multiple Enterprise verticals Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting 17
  • 18. Cascading – integrations • partners: Microsoft Azure, Hortonworks, Customers Amazon AWS, MapR, EMC, SpringSource, Cloudera Web • taps: Memcached, Cassandra, MongoDB, App HBase, JDBC, Parquet, etc. logs logs Cache • serialization: Avro, Thrift, Kryo, Support Logs JSON, etc. trap source tap sink tap tap • topologies: Apache Hadoop, Data tuple spaces, local mode Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting 18
  • 19. Cascading – deployments • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc. • use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, telecom, genomics, climatology, agronomics, etc. 19
  • 20. Cascading – deployments • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc. • use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utilityworkflow abstraction grids, telecom, addresses: genomics, climatology, agronomics, etc. • staffing bottleneck; • system integration; • operational complexity; • test-driven development 20
  • 21. Cascading: Workflow Abstraction Document 1. Machine Data Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS 2. Cascading Count Word Count 3. Sample Code 4. A Little Theory… 5. Workflows 6. Lingual 7. Pattern 8. Open Data 21
  • 22. The Ubiquitous Word Count Document Definition: Collection Tokenize GroupBy M token Count count how often each word appears count how often each word appears R Word Count in aacollection of text documents in collection of text documents This simple program provides an excellent test case for parallel processing, since it illustrates: void map (String doc_id, String text): • requires a minimal amount of code for each word w in segment(text): emit(w, "1"); • demonstrates use of both symbolic and numeric values • shows a dependency graph of tuples as an abstraction void reduce (String word, Iterator group): • is not many steps away from useful search indexing int count = 0; • serves as a “Hello World” for Hadoop apps for each pc in group: count += Int(pc); Any distributed computing framework which can run Word emit(word, String(count)); Count efficiently in parallel at scale can handle much larger and more interesting compute problems. 22
  • 23. word count – conceptual flow diagram Document Collection Tokenize GroupBy M token Count R Word Count 1 map cascading.org/category/impatient 1 reduce 18 lines code gist.github.com/3900702 23
  • 24. word count – Cascading app in Java Document Collection String docPath = args[ 0 ]; Tokenize GroupBy M token String wcPath = args[ 1 ]; Count Properties properties = new Properties(); R Word Count AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); 24
  • 25. word count – generated flow diagram Document Collection Tokenize [head] M GroupBy token Count R Word Count Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']'] [{2}:'doc_id', 'text'] [{2}:'doc_id', 'text'] map Each('token')[RegexSplitGenerator[decl:'token'][args:1]] [{1}:'token'] [{1}:'token'] GroupBy('wc')[by:['token']] wc[{1}:'token'] [{1}:'token'] reduce Every('wc')[Count[decl:'count']] [{2}:'token', 'count'] [{1}:'token'] Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']'] [{2}:'token', 'count'] [{2}:'token', 'count'] [tail] 25
  • 26. word count – Cascalog / Clojure Document Collection (ns impatient.core M Tokenize GroupBy token Count   (:use [cascalog.api] R Word Count         [cascalog.more-taps :only (hfs-delimited)])   (:require [clojure.string :as s]             [cascalog.ops :as c])   (:gen-class)) (defmapcatop split [line]   "reads in a line of string and splits it by regex"   (s/split line #"[[](),.)s]+")) (defn -main [in out & args]   (?<- (hfs-delimited out)        [?word ?count]        ((hfs-delimited in :skip-header? true) _ ?line)        (split ?line :> ?word)        (c/count ?count))) ; Paul Lam ; github.com/Quantisan/Impatient 26
  • 27. word count – Cascalog / Clojure Document Collection github.com/nathanmarz/cascalog/wiki Tokenize GroupBy M token Count R Word Count • implements Datalog in Clojure, with predicates backed by Cascading – for a highly declarative language • run ad-hoc queries from the Clojure REPL – approx. 10:1 code reduction compared with SQL • composable subqueries, used for test-driven development (TDD) practices at scale • Leiningen build: simple, no surprises, in Clojure itself • more new deployments than other Cascading DSLs – Climate Corp is largest use case: 90% Clojure/Cascalog • has a learning curve, limited number of Clojure developers • aggregators are the magic, and those take effort to learn 27
  • 28. word count – Scalding / Scala Document Collection import com.twitter.scalding._ M Tokenize GroupBy token Count   R Word Count class WordCount(args : Args) extends Job(args) { Tsv(args("doc"), ('doc_id, 'text), skipHeader = true) .read .flatMap('text -> 'token) { text : String => text.split("[ [](),.]") } .groupBy('token) { _.size('count) } .write(Tsv(args("wc"), writeHeader = true)) } 28
  • 29. word count – Scalding / Scala Document Collection github.com/twitter/scalding/wiki Tokenize GroupBy M token Count R Word Count • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls • extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc. • significant investments by Twitter, Etsy, eBay, etc. • great for data services at scale • less learning curve than Cascalog 29
  • 30. word count – Scalding / Scala Document Collection github.com/twitter/scalding/wiki Tokenize GroupBy M token Count R Word Count • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls Cascalog and Scalding DSLs • extensive libraries are available for linear algebra, abstractaspects leverage the functional algebra, machine learning – e.g., Matrix API, Algebird, etc. of MapReduce, helping limit • significant investments by Twitter, Etsy, eBay, etc. complexity in process • great for data services at scale • less learning curve than Cascalog 30
  • 31. Cascading: Workflow Abstraction Document 1. Machine Data Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS 2. Cascading Count Word Count 3. Sample Code 4. A Little Theory… 5. Workflows 6. Lingual 7. Pattern 8. Open Data 31
  • 32. workflow abstraction – pattern language Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Data is represented as flows of tuples. Operations within Word the flows bring functional programming aspects into Java Count In formal terms, this provides a pattern language 32
  • 33. references… pattern language: a structured method for solving large, complex design problems, where the syntax of the language promotes the use of best practices amazon.com/dp/0195019199 design patterns: the notion originated in consensus negotiation for architecture, later applied in OOP software engineering by “Gang of Four” amazon.com/dp/0201633612 33
  • 34. workflow abstraction – pattern language Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Document Collection Scrub Tokenize design principles of the pattern token M language ensure best practices Stop Word List HashJoin Left Regex token GroupBy token R for robust, parallel data workflows RHS at scale Count Data is represented as flows of tuples. Operations within Word the flows bring functional programming aspects into Java Count In formal terms, this provides a pattern language 34
  • 35. workflow abstraction – literate programming Cascading workflows generate their own visual documentation: flow diagrams Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count In formal terms, flow diagrams leverage a methodology Word Count called literate programming Provides intuitive, visual representations for apps – great for cross-team collaboration 35
  • 36. references… by Don Knuth Literate Programming Univ of Chicago Press, 1992 literateprogramming.com/ “Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.” 36
  • 37. workflow abstraction – business process Following the essence of literate programming, Cascading workflows provide statements of business process This recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data) Cascading creates a separation of concerns between business process and implementation details (Hadoop, etc.) This is especially apparent in large-scale Cascalog apps: “Specify what you require, not how to achieve it.” By virtue of the pattern language, the flow planner then determines how to translate business process into efficient, parallel jobs at scale 37
  • 38. references… by Edgar Codd “A relational model of data for large shared data banks” Communications of the ACM, 1970 dl.acm.org/citation.cfm?id=362685 Rather than arguing between SQL vs. NoSQL… structured vs. unstructured data frameworks… this approach focuses on what apps do: the process of structuring data 38
  • 39. workflow abstraction – functional relational programming The combination of functional programming, pattern language, DSLs, literate programming, business process, etc., traces back to the original definition of the relational model (Codd, 1970) prior to SQL. Cascalog, in particular, implements more of what Codd intended for a “data sublanguage” and is considered to be close to a full implementation of the functional relational programming paradigm defined in: Moseley & Marks, 2006 “Out of the Tar Pit” goo.gl/SKspn 39
  • 40. workflow abstraction – functional relational programming The combination of functional programming, pattern language, DSLs, literate programming, business process, etc., traces back to the original definition of the relational model (Codd, 1970) prior to SQL. Cascalog, in particular, implements more of what Codd intended for a “data sublanguage” and is considered to be close to a full implementation of the functional relational programming paradigm defined in: several theoretical aspects converge Moseley & Marks, 2006 into software engineering practices “Out of the Tar Pit” which minimize the complexity of goo.gl/SKspn building and maintaining Enterprise data workflows 40
  • 41. Cascading: Workflow Abstraction Document 1. Machine Data Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS 2. Cascading Count Word Count 3. Sample Code 4. A Little Theory… 5. Workflows 6. Lingual 7. Pattern 8. Open Data 41
  • 42. Enterprise Data Workflows Customers Let’s consider a “strawman” architecture for an example app… at the front end Web App LOB use cases drive demand for apps logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting 42
  • 43. Enterprise Data Workflows Customers Same example… in the back office Organizations have substantial investments Web App in people, infrastructure, process logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting 43
  • 44. Enterprise Data Workflows Customers Same example… the heavy lifting! “Main Street” firms are migrating Web App workflows to Hadoop, for cost savings and scale-out logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting 44
  • 45. Cascading workflows – taps • taps integrate other data frameworks, as tuple streams Customers • these are “plumbing” endpoints in the pattern language • sources (inputs), sinks (outputs), traps (exceptions) Web App • text delimited, JDBC, Memcached, logs HBase, Cassandra, MongoDB, etc. logs Logs Cache • data serialization: Avro, Thrift, Support source trap sink tap Kryo, JSON, etc. tap tap • extend a new kind of tap in just Data Modeling PMML Workflow a few lines of Java sink source tap tap Analytics Cubes customer Customer profile DBs schema and provenance get Hadoop Prefs Cluster derived from analysis of the taps Reporting 45
  • 46. Cascading workflows – taps String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); source and sink taps wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); for TSV data in HDFS // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); 46
  • 47. Cascading workflows – topologies • topologies execute workflows on clusters Customers • flow planner is like a compiler for queries - Hadoop (MapReduce jobs) Web App - local mode (dev/test or special config) logs Cache - in-memory data grids (real-time) logs Logs Support • flow planner can be extended trap tap source tap sink tap to support other topologies Data Modeling PMML Workflow source sink tap blend flows in different topologies tap Analytics into the same app – for example, Cubes customer Customer profile DBs batch (Hadoop) + transactions (IMDG) Hadoop Prefs Cluster Reporting 47
  • 48. Cascading workflows – topologies String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); flow planner for // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); Apache Hadoop // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); topology wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); 48
  • 50. Cascading workflows – test-driven development • assert patterns (regex) on the tuple streams Customers • adjust assert levels, like log4j levels • trap edge cases as “data exceptions” Web App • TDD at scale: 1. start from raw inputs in the flow graph logs logs Logs Cache 2. define stream assertions for each stage Support source trap sink of transforms tap tap tap 3. verify exceptions, code to remove them Modeling PMML Data Workflow 4. when impl is complete, app has full sink source tap tap test coverage Analytics Cubes customer Customer profile DBs Prefs Hadoop redirect traps in production Reporting Cluster to Ops, QA, Support, Audit, etc. 50
  • 51. Two Avenues to the App Layer… Enterprise: must contend with complexity at scale everyday… incumbents extend current practices and infrastructure investments – using J2EE, complexity ➞ ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff Start-ups: crave complexity and scale to become viable… new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding scale ➞ 51
  • 52. Cascading: Workflow Abstraction Document 1. Machine Data Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS 2. Cascading Count Word Count 3. Sample Code 4. A Little Theory… 5. Workflows 6. Lingual 7. Pattern 8. Open Data 52
  • 53. Cascading workflows – ANSI SQL • collab with Optiq – industry-proven code base Customers • ANSI SQL parser/optimizer atop Cascading flow planner Web App • JDBC driver to integrate into existing tools and app servers logs logs Cache • relational catalog over a collection Support Logs of unstructured data trap source tap sink tap tap • SQL shell prompt to run queries Data Modeling • enable analysts without retraining PMML Workflow on Hadoop, etc. sink tap source tap • transparency for Support, Ops, Analytics Cubes customer Customer Finance, et al. profile DBs Prefs Hadoop Cluster Reporting a language for queries – not a database, but ANSI SQL as a DSL for workflows 53
  • 54. Lingual – CSV data in local file system cascading.org/lingual 54
  • 55. Lingual – shell prompt, catalog cascading.org/lingual 55
  • 57. abstraction layers in queries… abstraction RDBMS JVM Cluster parser ANSI SQL ANSI SQL compliant parser compliant parser optimizer logical plan, logical plan, optimized based on stats optimized based on stats planner physical plan API “plumbing” machine query history, app history, data table stats tuple stats topology b-trees, etc. heterogenous, distributed: Hadoop, in-memory, etc. visualization ERD flow diagram schema table schema tuple schema catalog relational catalog tap usage DB provenance (manual audit) data set producers/consumers 57
  • 58. Lingual – JDBC driver public void run() throws ClassNotFoundException, SQLException { Class.forName( "cascading.lingual.jdbc.Driver" ); Connection connection = DriverManager.getConnection( "jdbc:lingual:local;schemas=src/main/resources/data/example" ); Statement statement = connection.createStatement();   ResultSet resultSet = statement.executeQuery( "select *n" + "from "EXAMPLE"."SALES_FACT_1997" as sn" + "join "EXAMPLE"."EMPLOYEE" as en" + "on e."EMPID" = s."CUST_ID"" );   while( resultSet.next() ) { int n = resultSet.getMetaData().getColumnCount(); StringBuilder builder = new StringBuilder();   for( int i = 1; i <= n; i++ ) { builder.append( ( i > 1 ? "; " : "" ) + resultSet.getMetaData().getColumnLabel( i ) + "=" + resultSet.getObject( i ) ); } System.out.println( builder ); }   resultSet.close(); statement.close(); connection.close(); } 58
  • 59. Lingual – JDBC result set $ gradle clean jar $ hadoop jar build/libs/lingual-examples–1.0.0-wip-dev.jar   CUST_ID=100; PROD_ID=10; EMPID=100; NAME=Bill CUST_ID=150; PROD_ID=20; EMPID=150; NAME=Sebastian Caveat: if you absolutely positively must have sub-second SQL query response for Pb-scale data on a 1000+ node cluster… Good luck with that! (call the MPP vendors) This ANSI SQL library is primarily intended for batch workflows – high throughput, not low-latency – for many under-represented use cases in Enterprise IT. In other words, SQL as a DSL. cascading.org/lingual 59
  • 60. Lingual – connecting Hadoop and R # load the JDBC package library(RJDBC)   # set up the driver drv <- JDBC("cascading.lingual.jdbc.Driver", "~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")   # set up a database connection to a local repository connection <- dbConnect(drv, "jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/ tables;schema=EMPLOYEES")   # query the repository: in this case the MySQL sample database (CSV files) df <- dbGetQuery(connection, "SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'") head(df)   # use R functions to summarize and visualize part of the data df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25 summary(df$hire_age) library(ggplot2) m <- ggplot(df, aes(x=hire_age)) m <- m + ggtitle("Age at hire, people named Gina") m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density() 60
  • 61. Lingual – connecting Hadoop and R > summary(df$hire_age) Min. 1st Qu. Median Mean 3rd Qu. Max. 20.86 27.89 31.70 31.61 35.01 43.92 cascading.org/lingual 61
  • 62. Cascading: Workflow Abstraction Document 1. Machine Data Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS 2. Cascading Count Word Count 3. Sample Code 4. A Little Theory… 5. Workflows 6. Lingual 7. Pattern 8. Open Data 62
  • 63. Pattern – model scoring • migrate workloads: SAS,Teradata, etc., exporting predictive models as PMML Customers • great open source tools – R, Weka, Web App KNIME, Matlab, RapidMiner, etc. • integrate with other libraries – logs logs Cache Logs Matrix API, etc. Support • leverage PMML as another kind trap tap source tap sink tap of DSL Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting cascading.org/pattern 63
  • 64. Pattern – create a model in R ## train a RandomForest model   f <- as.formula("as.factor(label) ~ .") fit <- randomForest(f, data_train, ntree=50)   ## test the model on the holdout test set   print(fit$importance) print(fit)   predicted <- predict(fit, data) data$predicted <- predicted confuse <- table(pred = predicted, true = data[,1]) print(confuse)   ## export predicted labels to TSV   write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), quote=FALSE, sep="t", row.names=FALSE)   ## export RF model to PMML   saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/")) 64
  • 65. Pattern – capture model parameters as PMML <?xml version="1.0"?> <PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0"  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  xsi:schemaLocation="http://www.dmg.org/PMML-4_0 http://www.dmg.org/v4-0/pmml-4-0.xsd">  <Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">   <Extension name="user" value="ceteri" extender="Rattle/PMML"/>   <Application name="Rattle/PMML" version="1.2.30"/>   <Timestamp>2012-10-22 19:39:28</Timestamp>  </Header>  <DataDictionary numberOfFields="4">   <DataField name="label" optype="categorical" dataType="string">    <Value value="0"/>    <Value value="1"/>   </DataField>   <DataField name="var0" optype="continuous" dataType="double"/>   <DataField name="var1" optype="continuous" dataType="double"/>   <DataField name="var2" optype="continuous" dataType="double"/>  </DataDictionary>  <MiningModel modelName="randomForest_Model" functionName="classification">   <MiningSchema>    <MiningField name="label" usageType="predicted"/>    <MiningField name="var0" usageType="active"/>    <MiningField name="var1" usageType="active"/>    <MiningField name="var2" usageType="active"/>   </MiningSchema>   <Segmentation multipleModelMethod="majorityVote">    <Segment id="1">     <True/>     <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">      <MiningSchema>       <MiningField name="label" usageType="predicted"/>       <MiningField name="var0" usageType="active"/>       <MiningField name="var1" usageType="active"/>       <MiningField name="var2" usageType="active"/>      </MiningSchema> ... 65
  • 66. Pattern – score a model, within an app public class Main { public static void main( String[] args ) {   String pmmlPath = args[ 0 ];   String ordersPath = args[ 1 ];   String classifyPath = args[ 2 ];   String trapPath = args[ 3 ];   Properties properties = new Properties();   AppProps.setApplicationJarClass( properties, Main.class );   HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );   // create source and sink taps   Tap ordersTap = new Hfs( new TextDelimited( true, "t" ), ordersPath );   Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );   Tap trapTap = new Hfs( new TextDelimited( true, "t" ), trapPath );   // define a "Classifier" model from PMML to evaluate the orders   ClassifierFunction classFunc = new ClassifierFunction( new Fields( "score" ), pmmlPath );   Pipe classifyPipe = new Each( new Pipe( "classify" ), classFunc.getInputFields(), classFunc, Fields.ALL );   // connect the taps, pipes, etc., into a flow   FlowDef flowDef = FlowDef.flowDef().setName( "classify" )    .addSource( classifyPipe, ordersTap )    .addTrap( classifyPipe, trapTap )    .addSink( classifyPipe, classifyTap );   // write a DOT file and run the flow   Flow classifyFlow = flowConnector.connect( flowDef );   classifyFlow.writeDOT( "dot/classify.dot" );   classifyFlow.complete(); } } 66
  • 67. Pattern – score a model, using pre-defined Cascading app Customer Orders Scored GroupBy Classify Assert Orders token M R PMML Model Count Failure Confusion Traps Matrix cascading.org/pattern 67
  • 68. Pattern – score a model, using pre-defined Cascading app ## run an RF classifier at scale   hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap --pmml data/sample.rf.xml   ## run an RF classifier at scale, assert regression test, measure confusion matrix   hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap --pmml data/sample.rf.xml --assert --measure out/measure   ## run a predictive model at scale, measure RMSE   hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap --pmml data/iris.lm_p.xml --rmse out/measure 68
  • 69. PMML – model coverage • Association Rules: AssociationModel element • Cluster Models: ClusteringModel element • Decision Trees: TreeModel element • Naïve Bayes Classifiers: NaiveBayesModel element • Neural Networks: NeuralNetwork element • Regression: RegressionModel and GeneralRegressionModel elements • Rulesets: RuleSetModel element • Sequences: SequenceModel element • Support Vector Machines: SupportVectorMachineModel element • Text Models: TextModel element • Time Series: TimeSeriesModel element ibm.com/developerworks/industry/library/ind-PMML2/ 69
  • 70. PMML – vendor coverage 70
  • 71. experiments – Random Forest model ## train a Random Forest model ## example: http://mkseo.pe.kr/stats/?p=220   f <- as.formula("as.factor(label) ~ var0 + var1 + var2") fit <- randomForest(f, data=data, proximity=TRUE, ntree=25) print(fit) saveXML(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/")) OOB estimate of error rate: 14% Confusion matrix: 0 1 class.error 0 69 16 0.1882353 1 12 103 0.1043478 71
  • 72. experiments – Logistic Regression model ## train a Logistic Regression model (special case of GLM) ## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r   f <- as.formula("as.factor(label) ~ var0 + var2") fit <- glm(f, family=binomial, data=data) print(summary(fit)) saveXML(pmml(fit), file=paste(out_folder, "sample.lr.xml", sep="/")) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.8524 0.3803 4.871 1.11e-06 *** var0 -1.3755 0.4355 -3.159 0.00159 ** var2 -3.7742 0.5794 -6.514 7.30e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 NB: this model has “var1” intentionally omitted 72
  • 73. experiments – evaluating results • use a confusion matrix to compare results for the classifiers • Logistic Regression has a lower “false negative” rate (5% vs. 11%) however it has a much higher “false positive” rate (52% vs. 14%) • assign a cost model to select a winner – for example, in an ecommerce anti-fraud classifier: FN ∼ chargeback risk FP ∼ customer support costs • can extend this to evaluate N models, M labels in an N × M × M matrix 73
  • 74. Cascading: Workflow Abstraction Document 1. Machine Data Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS 2. Cascading Count Word Count 3. Sample Code 4. A Little Theory… 5. Workflows 6. Lingual 7. Pattern 8. Open Data 74
  • 75. Palo Alto is quite a pleasant place • temperate weather • lots of parks, enormous trees • great coffeehouses • walkable downtown • not particularly crowded On a nice summer day, who wants to be stuck indoors on a phone call? Instead, take it outside – go for a walk And example open source project: github.com/Cascading/CoPA/wiki 75
  • 76. 1. Open Data about municipal infrastructure (GIS data: trees, roads, parks) ✚ 2. Big Data about where people like to walk (smartphone GPS logs) ✚ Document Collection 3. some curated metadata M Tokenize Scrub token HashJoin Regex (which surfaces the value) Left token GroupBy R Stop Word token List RHS Count Word Count 4. personalized recommendations: “Find a shady spot on a summer day in which to walk near downtown Palo Alto.While on a long conference call. Sipping a latte or enjoying some fro-yo.” 76
  • 77. discovery The City of Palo Alto recently began to support Open Data to give the local community greater visibility into how their city government operates This effort is intended to encourage students, entrepreneurs, local organizations, etc., to build new apps which contribute to the public good paloalto.opendata.junar.com/dashboards/7576/geographic-information/ 77
  • 78. discovery GIS about trees in Palo Alto: 78
  • 79. discovery Geographic_Information,,, "Tree: 29 site 2 at 203 ADDISON AV, on ADDISON AV 44 from pl"," Private: -1 Tree ID: 29 Street_Name: ADDISON AV Situs Number: 203 Tree Site: 2 Species: Celtis australis Source: davey tree Protected: Designated: Heritage: Appraised Value: Hardscape: None Identifier: 40 Active Numeric: 1 Location Feature ID: 13872 Provisional: Install Date: ","37.4409634615283,-122.15648458861,0.0 ","Point" "Wilkie Way from West Meadow Drive to Victoria Place"," Sequence: 20 Street_Name: Wilkie Way From Street PMMS: West Meadow Drive To Street PMMS: Victoria Place Street ID: 598 (Wilkie Wy, Palo Alto) From Street ID PMMS: 689 To Street ID PMMS: 567 Year Constructed: 1950 Traffic Count: 596 Traffic Index: residential local Traffic Class: local residential Traffic Date: 08/24/90 Paving Length: 208 Paving Width: 40 Paving Area: 8320 Surface Type: asphalt concrete Surface Thickness: 2.0 Base Type Pvmt: crusher run base Base Thickness: 6.0 Soil Class: 2 Soil Value: 15 Curb Type: Curb Thickness: Gutter Width: 36.0 Book: 22 Page: 1 District Number: 18 Land Use PMMS: 1 Overlay Year: 1990 Overlay Thickness: 1.5 Base Failure Year: 1990 Base Failure Thickness: 6 Surface Treatment Year: Surface Treatment Type: Alligator Severity: none Alligator Extent: 0 Block Severity: none Block Extent: 0 Longitude and Transverse Severity: none Longitude and Transverse Extent: 0 Trench Severity: Ravelling Severity: none none Trench Extent: 0 (unstructured data…) Ravelling Extent: Rutting Severity: 0 Ridability Severity: none Rutting Extent: none 0 Road Performance: UL (Urban Local) Bike Lane: 0 Bus Route: 0 Truck Route: 0 Remediation: Deduct Value: 100 Priority: Pavement Condition: excellent Street Cut Fee per SqFt: 10.00 Source Date: 6/10/2009 User Modified By: mnicols Identifier System: 21410 ","-122.1249640794,37.4155803115645,0.0 -122.124661859039,37.4154224594993,0.0 -122.124587720719,37.4153758330704,0.0 -122.12451895942,37.4153242300888,0.0 -122.124456098457,37.4152680432944,0.0 -122.124399616238,37.4152077003122,0.0 -122.124374937753,37.4151774433318,0.0 ","Line" 79
  • 80. discovery (defn parse-gis [line] "leverages parse-csv for complex CSV format in GIS export" (first (csv/parse-csv line)) )     (defn etl-gis [gis trap] "subquery to parse data sets from the GIS source tap" (<- [?blurb ?misc ?geo ?kind] (gis ?line) (parse-gis ?line :> ?blurb ?misc ?geo ?kind) (:trap (hfs-textline trap)) )) (specify what you require, not how to achieve it… data prep costs are 80/20) 80
  • 81. discovery (ad-hoc queries get refined into composable predicates) Identifier: 474 Tree ID: 412 Tree: 412 site 1 at 115 HAWTHORNE AV Tree Site: 1 Street_Name: HAWTHORNE AV Situs Number: 115 Private: -1 Species: Liquidambar styraciflua Source: davey tree Hardscape: None 37.446001565119,-122.167713417554,0.0 Point 81