SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Downloaden Sie, um offline zu lesen
Cascading Meetup #4

                       BlueKai
                       Cupertino, CA
                       2013-03-05




                                       Copyright @2013, Concurrent, Inc.




Tuesday, 05 March 13                                                       1
Cascading Meetup
                                             Document
                                             Collection



                                                                          Scrub
                                                          Tokenize
                                                                          token

                                                     M



                                                                                  HashJoin   Regex
                                                                                    Left     token
                                                                                                     GroupBy    R
                                                                     Stop Word                        token
                                                                        List
                                                                                    RHS




                                                                                                        Count




                                                                                                                    Word
                                                                                                                    Count




              1. Enterprise Data Workflows
              2. ANSI SQL Support
              3. Test-Driven Development




Tuesday, 05 March 13                                                                                                        2
Enterprise Data Workflows
                                                                                    Customers
            Let’s consider an example app…
            at the front end                                                          Web
                                                                                      App
            LOB use cases drive demand for apps
                                                                        logs         Cache
                                                                          logs
                                                                            Logs

                                                   Support
                                                                           source
                                                                 trap                  sink
                                                                             tap
                                                                  tap                  tap


                                                                         Data
                                                   Modeling    PMML
                                                                        Workflow

                                                                                      source
                                                                 sink
                                                                                        tap
                                                                 tap

                                                   Analytics
                                                    Cubes                            customer
                                                                                      Customer
                                                                                    profile DBs
                                                                                        Prefs
                                                                          Hadoop
                                                                          Cluster
                                                   Reporting




Tuesday, 05 March 13                                                                              3
LOB use cases drive the demand for Big Data apps
Enterprise Data Workflows
                                                                                                                 Customers
             An example… in the back office
             Organizations have substantial investments                                                            Web
                                                                                                                   App
             in people, infrastructure, process
                                                                                                     logs         Cache
                                                                                                       logs
                                                                                                         Logs

                                                                      Support
                                                                                                        source
                                                                                              trap                  sink
                                                                                                          tap
                                                                                               tap                  tap


                                                                                                      Data
                                                                     Modeling            PMML
                                                                                                     Workflow

                                                                                                                   source
                                                                                              sink
                                                                                                                     tap
                                                                                              tap

                                                                     Analytics
                                                                      Cubes                                       customer
                                                                                                                   Customer
                                                                                                                 profile DBs
                                                                                                                     Prefs
                                                                                                       Hadoop
                                                                                                       Cluster
                                                                    Reporting




Tuesday, 05 March 13                                                                                                           4
Enterprise organizations have seriously ginormous investments in existing back office practices:
people, infrastructure, processes
Enterprise Data Workflows
                                                                                                          Customers
              An example… for the heavy lifting!
              “Main Street” firms are migrating                                                              Web
                                                                                                            App
              workflows to Hadoop, for cost
              savings and scale-out
                                                                                              logs         Cache
                                                                                                logs
                                                                                                  Logs

                                                                          Support
                                                                                                 source
                                                                                       trap                  sink
                                                                                                   tap
                                                                                        tap                  tap


                                                                                               Data
                                                                         Modeling    PMML
                                                                                              Workflow

                                                                                                            source
                                                                                       sink
                                                                                                              tap
                                                                                       tap

                                                                         Analytics
                                                                          Cubes                            customer
                                                                                                            Customer
                                                                                                          profile DBs
                                                                                                              Prefs
                                                                                                Hadoop
                                                                                                Cluster
                                                                        Reporting




Tuesday, 05 March 13                                                                                                    5
“Main Street” firms have invested in Hadoop to address Big Data needs,
off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.
Two Avenues…

             Enterprise: must contend with
             complexity at scale everyday…
             incumbents extend current practices and
             infrastructure investments – using J2EE,




                                                                                                            complexity ➞
             ANSI SQL, SAS, etc. – to migrate
             workflows onto Apache Hadoop while
             leveraging existing staff


              Start-ups: crave complexity and
              scale to become viable…
              new ventures move into Enterprise space
              to compete using relatively lean staff,
              while leveraging sophisticated engineering
              practices, e.g., Cascalog and Scalding
                                                                                                                                    scale ➞

Tuesday, 05 March 13                                                                                                                          6
Enterprise data workflows are observed in two modes: start-ups approaching complexity and incumbent firms grappling with complexity
Two Avenues…

              Enterprise: must contend with
              complexity at scale everyday…
              incumbents extend current practices and
              infrastructure investments – using J2EE,




                                                            complexity ➞
              ANSI SQL, SAS, etc. – to migrate
              workflows onto Apache Hadoop while
              leveraging existing staff
                                         Hadoop almost never gets used
                                         in isolation; data workflows define
               Start-ups: crave complexity and
               scale to become viable… the “glue” required for system
               new ventures move into Enterprise space of Enterprise apps
                                         integration
               to compete using relatively lean staff,
               while leveraging sophisticated engineering
               practices, e.g., Cascalog and Scalding
                                                                           scale ➞

Tuesday, 05 March 13                                                                 7
Hadoop is almost never used in isolation.
Enterprise data workflows are about system integration.
There are a couple different ways to arrive at the party.
Cascading Meetup
                                             Document
                                             Collection



                                                                          Scrub
                                                          Tokenize
                                                                          token

                                                     M



                                                                                  HashJoin   Regex
                                                                                    Left     token
                                                                                                     GroupBy    R
                                                                     Stop Word                        token
                                                                        List
                                                                                    RHS




                                                                                                        Count




                                                                                                                    Word
                                                                                                                    Count




              1. Enterprise Data Workflows
              2. ANSI SQL Support
              3. Test-Driven Development




Tuesday, 05 March 13                                                                                                        8
Cascading workflows – ANSI SQL

               • collab with Optiq – industry-proven code base
                                                                                                                                                    Customers

               • ANSI SQL parser/optimizer atop Cascading
                   flow planner                                                                                                                        Web
                                                                                                                                                      App

               • JDBC driver to integrate into existing
                   tools and app servers                                                                                                logs
                                                                                                                                          logs       Cache
                                                                                                                                            Logs

               • relational catalog over a collection                                                        Support
                                                                                                                                           source
                   of unstructured data                                                                                          trap
                                                                                                                                  tap
                                                                                                                                             tap       sink
                                                                                                                                                       tap



               • SQL shell prompt to run queries                                                            Modeling         PMML
                                                                                                                                         Data
                                                                                                                                        Workflow

                                                                                                                                                      source
                                                                                                                                 sink
                                                                                                                                                        tap
                                                                                                                                 tap

                                                                                                            Analytics
                                                                                                             Cubes                                   customer
                                                                                                                                                      Customer
                                                                                                                                                    profile DBs
                                                                                                                                                        Prefs
                                                                                                                                          Hadoop
                                                                                                                                          Cluster
                                                                                                           Reporting




Tuesday, 05 March 13                                                                                                                                              9
ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration.

Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.
Cascading workflows – ANSI SQL

               • collab with Optiq – industry-proven code base
                                                                                                                                                    Customers

               • ANSI SQL parser/optimizer atop Cascading
                   flow planner                                                                                                                        Web
                                                                                                                                                      App

               • JDBC driver to integrate into existing
                   tools and app servers                                                                                                logs
                                                                                                                                          logs       Cache

                                     Premise: most SQL in the world gets                                                                    Logs

               • relational catalog over a collection                                                        Support


                 of unstructured datawritten by machines…                                                                        trap
                                                                                                                                  tap
                                                                                                                                           source
                                                                                                                                             tap       sink
                                                                                                                                                       tap



               • SQL shell prompt to run isn’t a database; this is about making
                                     This queries                                                           Modeling         PMML
                                                                                                                                         Data
                                                                                                                                        Workflow


                                     machine-to-machine communications                                                           sink
                                                                                                                                 tap
                                                                                                                                                      source
                                                                                                                                                        tap



                                     simpler and more robust at scale.
                                                                                                            Analytics
                                                                                                             Cubes                                   customer
                                                                                                                                                      Customer
                                                                                                                                                    profile DBs
                                                                                                                                                        Prefs
                                                                                                                                          Hadoop
                                                                                                                                          Cluster
                                                                                                           Reporting




Tuesday, 05 March 13                                                                                                                                              10
ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration.

Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.
Cascading workflows – ANSI SQL

               • enable analysts without retraining
                   on Hadoop, etc.                                                                                                                  Customers




               • transparency for Support, Ops,                                                                                                       Web
                                                                                                                                                      App
                   Finance, et al.
                                                                                                                                        logs         Cache
                                                                                                                                          logs
                                                                                                                                            Logs

                                                                                                             Support
                                                                                                                                           source
                                                                                                                                 trap                  sink
                                                                                                                                             tap
                                                                                                                                  tap                  tap


                                                                                                                                         Data
             a language for queries – not a database,                                                       Modeling         PMML
                                                                                                                                        Workflow


             but ANSI SQL as a DSL for workflows                                                                                  sink
                                                                                                                                 tap
                                                                                                                                                      source
                                                                                                                                                        tap

                                                                                                            Analytics
                                                                                                             Cubes                                   customer
                                                                                                                                                      Customer
                                                                                                                                                    profile DBs
                                                                                                                                                        Prefs
                                                                                                                                          Hadoop
                                                                                                                                          Cluster
                                                                                                           Reporting




Tuesday, 05 March 13                                                                                                                                              11
ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration.

Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.
ANSI SQL – reviews
            Open Source 'Lingual' Helps SQL Devs Unlock Hadoop
            Thor Olavsrud, 2013-02-22
            cio.com/article/729283/Open_Source_Lingual_Helps_SQL_Devs_Unlock_Hadoop


            Hadoop Apps Without MapReduce Mindsets
            Adrian Bridgwater, 2013-02-28
            drdobbs.com/open-source/hadoop-apps-without-mapreduce-mindsets/240149708


            Concurrent gives old SQL users new Hadoop tricks
            Jack Clark, 2013-02-20
            theregister.co.uk/2013/02/20/hadoop_sql_translator_lingual_launches/


            Concurrent Open Source Project Ties SQL to Hadoop
            Michael Vizard, 2013-02-21
            itbusinessedge.com/blogs/it-unmasked/concurrent-open-source-project-ties-sql-to-hadoop.html


            Concurrent Releases Lingual, a SQL DSL for Hadoop
            Boris Lublinsky, 2013-02-28
            infoq.com/news/2013/02/Lingual

Tuesday, 05 March 13                                                                                      12
ANSI SQL – CSV data in local file system




               cascading.org/lingual


Tuesday, 05 March 13                                                                             13
The test database for MySQL is available for download from https://launchpad.net/test-db/

Here we have a bunch o’ CSV flat files in a directory in the local file system.

Use the “lingual” command line interface to overlay DDL to describe the expected table schema.
ANSI SQL – shell prompt, catalog




                cascading.org/lingual


Tuesday, 05 March 13                                                                      14
Use the “lingual” SQL shell prompt to run SQL queries interactively, show catalog, etc.
ANSI SQL – queries




              cascading.org/lingual


Tuesday, 05 March 13                                                       15
Here’s an example SQL query on that “employee” test database from MySQL.
ANSI SQL – layers

                                        abstraction                                                       RDBMS                                                     JVM Cluster
                                                parser                                                 ANSI SQL                                                      ANSI SQL
                                                                                                     compliant parser                                              compliant parser
                                              optimizer                                             logical plan,                                                 logical plan,
                                                                                              optimized based on stats                                      optimized based on stats
                                               planner                                                   physical plan                                              API “plumbing”

                                               machine                                                 query history,                                                  app history,
                                                data                                                     table stats                                                    tuple stats
                                               topology                                                  b-trees, etc.                                      heterogenous, distributed:
                                                                                                                                                               Hadoop, IMDG, etc.
                                            visualization                                                      ERD                                                    flow diagram

                                               schema                                                   table schema                                                  tuple schema

                                                catalog                                              relational catalog                                               tap usage DB


                                             provenance                                                (manual audit)                                                data set
                                                                                                                                                               producers/consumers
Tuesday, 05 March 13                                                                                                                                                                               16
When you peel back the onion skin on a SQL query, each of the abstraction layers used in an RDBMS has an analogue (or better) in the context of Enterprise Data Workflows running on JVM clusters
ANSI SQL – JDBC driver
             public void run() throws ClassNotFoundException, SQLException {
                 Class.forName( "cascading.lingual.jdbc.Driver" );
                 Connection connection =
                   DriverManager.getConnection( "jdbc:lingual:local;schemas=src/main/resources/data/example" );
                 Statement statement = connection.createStatement();
              
                 ResultSet resultSet = statement.executeQuery(
                     "select *n"
                       + "from "EXAMPLE"."SALES_FACT_1997" as sn"
                       + "join "EXAMPLE"."EMPLOYEE" as en"
                       + "on e."EMPID" = s."CUST_ID"" );
              
                 while( resultSet.next() ) {
                   int n = resultSet.getMetaData().getColumnCount();
                   StringBuilder builder = new StringBuilder();
              
                   for( int i = 1; i <= n; i++ ) {
                     builder.append( ( i > 1 ? "; " : "" )
                         + resultSet.getMetaData().getColumnLabel( i ) + "=" + resultSet.getObject( i ) );
                     }

                        System.out.println( builder );
                        }
              
                     resultSet.close();
                     statement.close();
                     connection.close();
                     }



Tuesday, 05 March 13                                                                                                      17
Note that in this example the schema for the DDL has been derived directly from the CSV files.

In other words, point the JDBC connection at a directory of flat files and query as if they were already loaded into SQL.
ANSI SQL – JDBC driver
            $ gradle clean jar
            $ hadoop jar build/libs/lingual-examples–1.0.0-wip-dev.jar
             
            CUST_ID=100; PROD_ID=10; EMPID=100; NAME=Bill
            CUST_ID=150; PROD_ID=20; EMPID=150; NAME=Sebastian




                                Caveat: if you absolutely positively must have sub-second
                                SQL query response for Pb-scale data on a 1000+ node
                                cluster… Good luck with that! (call the MPP vendors)
                                This ANSI SQL library is primarily intended for batch
                                workflows – high throughput, not low-latency –
                                for many under-represented use cases in Enterprise IT.
                                It’s essentially ANSI SQL as a DSL.




Tuesday, 05 March 13                                                                        18
success
Cascading Meetup
                                             Document
                                             Collection



                                                                          Scrub
                                                          Tokenize
                                                                          token

                                                     M



                                                                                  HashJoin   Regex
                                                                                    Left     token
                                                                                                     GroupBy    R
                                                                     Stop Word                        token
                                                                        List
                                                                                    RHS




                                                                                                        Count




                                                                                                                    Word
                                                                                                                    Count




              1. Enterprise Data Workflows
              2. ANSI SQL Support
              3. Test-Driven Development




Tuesday, 05 March 13                                                                                                        19
Test-Driven Development (TDD)




                                source: Wikipedia

Tuesday, 05 March 13                                20
A general view of TDD process
Test-Driven Development (TDD)




                                                                    In terms of Big Data apps,TDD is not
                                                                    generally part of the conversation




Tuesday, 05 March 13                                                                                       21
TDD is not usually high on the list when people start discussing Big Data apps.
Traps – Cascading “exceptional data”

               •   assert patterns (regex) on the tuple streams
                                                                                                                     Customers
               •   adjust assert levels, like log4j levels
               •   define traps on branches                                                                             Web
                                                                                                                       App

               •   tuples which fail asserts get trapped
                                                                                                         logs         Cache
                                                                                                           logs
                                                                                                             Logs

                                                                                    Support
                                                                                                            source
                                                                                                  trap                  sink
                                                                                                              tap
                                                                                                   tap                  tap


                                                                                                          Data
                                                                                    Modeling    PMML
                                                                                                         Workflow

                                                                                                                       source
                                                                                                  sink
                                                                                                                         tap
                                                                                                  tap

                                                                                    Analytics
                                                                                     Cubes                            customer
                                                                                                                       Customer
                                                                                                                     profile DBs
                                                                                                                         Prefs
                                                                                                           Hadoop
                                                                                                           Cluster
                                                                                    Reporting




Tuesday, 05 March 13                                                                                                               22
An innovation in Cascading was to introduce the notion of a “data exception”,
based on setting stream assertion levels as part of the business logic of an app.
Traps – example code
            // set up... 

            Pipe etlPipe = new Pipe( "etlPipe" );

            // some processing... 

            AssertMatches assertMatches = new AssertMatches( ".*true" );
            etlPipe = new Each( etlPipe, AssertionLevel.STRICT, assertMatches );
             
            // some processing... 

            FlowDef flowDef = FlowDef.flowDef().setName( "etl" )
              .addSource( etlPipe, jsonTap )
              .addTrap( etlPipe, trapTap )
              .addTailSink( etlPipe, cacheTap );
             
            if( options.has( "assert" ) )
              flowDef.setAssertionLevel( AssertionLevel.STRICT );
            else
              flowDef.setAssertionLevel( AssertionLevel.NONE );


Tuesday, 05 March 13                                                               23
Example use in Cascading code
Traps – redirect exceptions in production
            shunt the trapped exceptional data to other
            parts of the organization:                                                     Customers



             •   Ops: notifications                                                           Web
                                                                                             App

             •   QA: investigate data anomalies	

             •   Support: review customer records                              logs
                                                                                 logs
                                                                                   Logs
                                                                                            Cache


             •   	

                  Finance: audit                          Support
                                                                                  source
                                                                        trap                  sink
                                                                                    tap
                                                                         tap                  tap


                                                                                Data
                                                          Modeling    PMML
                                                                               Workflow

                                                                                             source
                                                                        sink
                                                                                               tap
                                                                        tap

                                                          Analytics
                                                           Cubes                            customer
                                                                                             Customer
                                                                                           profile DBs
                                                                                               Prefs
                                                                                 Hadoop
                                                                                 Cluster
                                                          Reporting




Tuesday, 05 March 13                                                                                     24
TDD – practice at scale
             1. assert expected patterns in raw input
             2. run just that, to find edge cases
             3. handle the edge cases for input data
             4. assert expected patterns after first chunk of processing
             5. run just that, to verify failure
             6. code until test passes                  GIS                               Regex




                                                                                  tree
                                                                                                           Scrub
                                                       export                            parse-tree        species




             7. repeat #4 for each chunk
                                                   M                              M
                                                                                                                                Estimate
                                                                                                                     Join                  Geohash
                                                                                                                                 height




                                                                 Regex




                                                                            src
                                                                parse-gis
                                                                                                            Tree                                                 Filter
                                                                                                                                                         tree
                                                                                                          Metadata                                               height




                                                                                         Failure                                                     M
                                                                                          Traps
                                                                                                                                                                                       Calculate         Filter             Sum
                                                                                                                                                                            Join
                                                                                                                                                                                        distance        distance           moment           Filter
                                                                                                                                                                                                                                         sum_moment




                                                                                                                                                                Estimate           R   M                               R                 M
                                                                                                                                                         road




                                                                                  road
                                                                                           Regex
                                                                                                                                                                  traffic
                                                                                         parse-road
                                                                                                                                                                                                                                                      shade




                                                                                                                     Estimate     Road
                                                                                                         Join
                                                                                                                      Albedo    Segments
                                                                                                                                           Geohash                                                                                                            Join



                                                                                  M
                                                                                                                                R
                                                                                               Road
                                                                                              Metadata                                                                                     gps                                                                       R
                                                                                                                                                                                                                                                       gps               reco
                                                                                                                                                                                           logs




                                                                                                                                                                                                                     Count
                                                                                                                                                                                                   Geohash                             Max
                                                                                                                                                                                                                   gps_count
                                                                                                                                                                                                                                    recent_visit




                                                                                                                                                                                       M                           R




Tuesday, 05 March 13                                                                                                                                                                                                                                                            25
TDD – Cascalog features
             consider that TDD is about asserting and negating logical
             predicates…
               •   Cascalog is based on logical predicates
               •   function definitions as composable subqueries
               •   functions are not particularly far from being unit tests
               •   Midje: facts, mocks

               sritchie.github.com/2011/09/30/testing-cascalog-with-midje.html
               sritchie.github.com/2012/01/22/cascalog-testing-20.html




Tuesday, 05 March 13                                                                                                26
Moreover, the Cascalog language by Nathan Marz, Sam Ritchie, et al., nearly uses TDD as its methodology --
in the transition from ad-hoc queries as logic predicates, then composing those predicates into large-scale apps.
Cascading Meetup
                                             Document
                                             Collection



                                                                          Scrub
                                                          Tokenize
                                                                          token

                                                     M



                                                                                  HashJoin   Regex
                                                                                    Left     token
                                                                                                     GroupBy    R
                                                                     Stop Word                        token
                                                                        List
                                                                                    RHS




                                                                                                        Count




                                                                                                                    Word
                                                                                                                    Count




              1. Enterprise Data Workflows
              2. ANSI SQL Support
              3. Test-Driven Development
              …plus, a proposal




Tuesday, 05 March 13                                                                                                        27
ANSI SQL – multiple flows



                                               GIS                               Regex




                                                                         tree
                                                                                                  Scrub
                                              export                            parse-tree        species




                                          M                              M
                                                                                                                       Estimate
                                                                                                            Join                  Geohash
                                                                                                                        height




                                                        Regex




                                                                   src
                                                       parse-gis
                                                                                                   Tree                                                 Filter
                                                                                                                                                tree
                                                                                                 Metadata                                               height




                                                                                Failure                                                     M
                                                                                 Traps
                                                                                                                                                                              Calculate         Filter             Sum
                                                                                                                                                                   Join
                                                                                                                                                                               distance        distance           moment           Filter
                                                                                                                                                                                                                                sum_moment




                                                                                                                                                       Estimate           R   M                               R                 M
                                                                                                                                                road
                                                                         road




                                                                                  Regex
                                                                                                                                                         traffic
                                                                                parse-road
                                                                                                                                                                                                                                             shade




                                                                                                            Estimate     Road
                                                                                                Join
                                                                                                             Albedo    Segments
                                                                                                                                  Geohash                                                                                                            Join



                                                                         M
                                                                                                                       R
                                                                                      Road
                                                                                     Metadata                                                                                     gps                                                                       R
                                                                                                                                                                                                                                              gps               reco
                                                                                                                                                                                  logs




                                                                                                                                                                                                            Count
                                                                                                                                                                                          Geohash                             Max
                                                                                                                                                                                                          gps_count
                                                                                                                                                                                                                           recent_visit




                                                                                                                                                                              M                           R




              Suppose your organization is responsible
              for an large-scale app…
              Multiple teams develop reusable libraries…
Tuesday, 05 March 13                                                                                                                                                                                                                                                   28
Suppose you have a app with a complex flow diagram like this, with contributions to the business logic from different departments…
ANSI SQL – multiple flows



                                               GIS                               Regex




                                                                         tree
                                                                                                  Scrub
                                              export                            parse-tree        species




                                          M                              M
                                                                                                                       Estimate
                                                                                                            Join                  Geohash
                                                                                                                        height




                                                        Regex




                                                                   src
                                                       parse-gis
                                                                                                   Tree                                                 Filter
                                                                                                                                                tree
                                                                                                 Metadata                                               height




                                                                                Failure                                                     M
                                                                                 Traps
                                                                                                                                                                              Calculate         Filter             Sum
                                                                                                                                                                   Join
                                                                                                                                                                               distance        distance           moment           Filter
                                                                                                                                                                                                                                sum_moment




                                                                                                                                                       Estimate           R   M                               R                 M
                                                                                                                                                road
                                                                         road




                                                                                  Regex
                                                                                                                                                         traffic
                                                                                parse-road
                                                                                                                                                                                                                                             shade




                                                                                                            Estimate     Road
                                                                                                Join
                                                                                                             Albedo    Segments
                                                                                                                                  Geohash                                                                                                            Join



                                                                         M
                                                                                                                       R
                                                                                      Road
                                                                                     Metadata                                                                                     gps                                                                       R
                                                                                                                                                                                                                                              gps               reco
                                                                                                                                                                                  logs




                                                                                                                                                                                                            Count
                                                                                                                                                                                          Geohash                             Max
                                                                                                                                                                                                          gps_count
                                                                                                                                                                                                                           recent_visit




                                                                                                                                                                              M                           R




              Data Analysts: ANSI SQL queries
              for data prep
              (displaces Hive, etc.)
Tuesday, 05 March 13                                                                                                                                                                                                                                                   29
Analysts are generally working with ANSI SQL queries in a DW, e.g., for ETL, data prep, pulling data cubes.
These can migrate into a Cascading app to run on Hadoop.
ANSI SQL – multiple flows



                                                GIS                               Regex




                                                                          tree
                                                                                                   Scrub
                                               export                            parse-tree        species




                                           M                              M
                                                                                                                        Estimate
                                                                                                             Join                  Geohash
                                                                                                                         height




                                                         Regex




                                                                    src
                                                        parse-gis
                                                                                                    Tree                                                 Filter
                                                                                                                                                 tree
                                                                                                  Metadata                                               height




                                                                                 Failure                                                     M
                                                                                  Traps
                                                                                                                                                                               Calculate         Filter             Sum
                                                                                                                                                                    Join
                                                                                                                                                                                distance        distance           moment           Filter
                                                                                                                                                                                                                                 sum_moment




                                                                                                                                                        Estimate           R   M                               R                 M
                                                                                                                                                 road
                                                                          road




                                                                                   Regex
                                                                                                                                                          traffic
                                                                                 parse-road
                                                                                                                                                                                                                                              shade




                                                                                                             Estimate     Road
                                                                                                 Join
                                                                                                              Albedo    Segments
                                                                                                                                   Geohash                                                                                                            Join



                                                                          M
                                                                                                                        R
                                                                                       Road
                                                                                      Metadata                                                                                     gps                                                                       R
                                                                                                                                                                                                                                               gps               reco
                                                                                                                                                                                   logs




                                                                                                                                                                                                             Count
                                                                                                                                                                                           Geohash                             Max
                                                                                                                                                                                                           gps_count
                                                                                                                                                                                                                            recent_visit




                                                                                                                                                                               M                           R




              Server-side Engineering: HBase tap
              for customer profiles
              (integrating other components)
Tuesday, 05 March 13                                                                                                                                                                                                                                                    30
Engineering provides integration with customer profiles, e.g., transactional data objects in HBase.
These can migrate into a Cascading app to run on Hadoop.
ANSI SQL – multiple flows



                                                GIS                               Regex




                                                                          tree
                                                                                                   Scrub
                                               export                            parse-tree        species




                                           M                              M
                                                                                                                        Estimate
                                                                                                             Join                  Geohash
                                                                                                                         height




                                                         Regex




                                                                    src
                                                        parse-gis
                                                                                                    Tree                                                 Filter
                                                                                                                                                 tree
                                                                                                  Metadata                                               height




                                                                                 Failure                                                     M
                                                                                  Traps
                                                                                                                                                                               Calculate         Filter             Sum
                                                                                                                                                                    Join
                                                                                                                                                                                distance        distance           moment           Filter
                                                                                                                                                                                                                                 sum_moment




                                                                                                                                                        Estimate           R   M                               R                 M
                                                                                                                                                 road
                                                                          road




                                                                                   Regex
                                                                                                                                                          traffic
                                                                                 parse-road
                                                                                                                                                                                                                                              shade




                                                                                                             Estimate     Road
                                                                                                 Join
                                                                                                              Albedo    Segments
                                                                                                                                   Geohash                                                                                                            Join



                                                                          M
                                                                                                                        R
                                                                                       Road
                                                                                      Metadata                                                                                     gps                                                                       R
                                                                                                                                                                                                                                               gps               reco
                                                                                                                                                                                   logs




                                                                                                                                                                                                             Count
                                                                                                                                                                                           Geohash                             Max
                                                                                                                                                                                                           gps_count
                                                                                                                                                                                                                            recent_visit




                                                                                                                                                                               M                           R




              Ops + Support: Traps get
              routed to customer review
              (ties into notifications, etc.)
Tuesday, 05 March 13                                                                                                                                                                                                                                                    31
Support needs to review exceptional data, via reports/notifications.
These can migrate into a Cascading app to run on Hadoop.
ANSI SQL – multiple flows



                                               GIS                               Regex




                                                                         tree
                                                                                                  Scrub
                                              export                            parse-tree        species




                                          M                              M
                                                                                                                       Estimate
                                                                                                            Join                  Geohash
                                                                                                                        height




                                                        Regex




                                                                   src
                                                       parse-gis
                                                                                                   Tree                                                 Filter
                                                                                                                                                tree
                                                                                                 Metadata                                               height




                                                                                Failure                                                     M
                                                                                 Traps
                                                                                                                                                                              Calculate         Filter             Sum
                                                                                                                                                                   Join
                                                                                                                                                                               distance        distance           moment           Filter
                                                                                                                                                                                                                                sum_moment




                                                                                                                                                       Estimate           R   M                               R                 M
                                                                                                                                                road
                                                                         road




                                                                                  Regex
                                                                                                                                                         traffic
                                                                                parse-road
                                                                                                                                                                                                                                             shade




                                                                                                            Estimate     Road
                                                                                                Join
                                                                                                             Albedo    Segments
                                                                                                                                  Geohash                                                                                                            Join



                                                                         M
                                                                                                                       R
                                                                                      Road
                                                                                     Metadata                                                                                     gps                                                                       R
                                                                                                                                                                                                                                              gps               reco
                                                                                                                                                                                  logs




                                                                                                                                                                                                            Count
                                                                                                                                                                                          Geohash                             Max
                                                                                                                                                                                                          gps_count
                                                                                                                                                                                                                           recent_visit




                                                                                                                                                                              M                           R




              Data Scientists: R => PMML
              for predictive models
              (displaces SAS, etc.)
Tuesday, 05 March 13                                                                                                                                                                                                                                                   32
Scientists perform their model creation work in R, Weka, SAS, Microstrategy, etc., which can export as PMML.
These can migrate into a Cascading app to run on Hadoop.
ANSI SQL – multiple flows



                                              GIS                               Regex




                                                                        tree
                                                                                                 Scrub
                                             export                            parse-tree        species




                                         M                              M
                                                                                                                      Estimate
                                                                                                           Join                  Geohash
                                                                                                                       height




                                                       Regex




                                                                  src
                                                      parse-gis
                                                                                                  Tree                                                 Filter
                                                                                                                                               tree
                                                                                                Metadata                                               height




                                                                               Failure                                                     M
                                                                                Traps
                                                                                                                                                                             Calculate         Filter             Sum
                                                                                                                                                                  Join
                                                                                                                                                                              distance        distance           moment           Filter
                                                                                                                                                                                                                               sum_moment




                                                                                                                                                      Estimate           R   M                               R                 M
                                                                                                                                               road
                                                                        road




                                                                                 Regex
                                                                                                                                                        traffic
                                                                               parse-road
                                                                                                                                                                                                                                            shade




                                                                                                           Estimate     Road
                                                                                               Join
                                                                                                            Albedo    Segments
                                                                                                                                 Geohash                                                                                                            Join



                                                                        M
                                                                                                                      R
                                                                                     Road
                                                                                    Metadata                                                                                     gps                                                                       R
                                                                                                                                                                                                                                             gps               reco
                                                                                                                                                                                 logs




                                                                                                                                                                                                           Count
                                                                                                                                                                                         Geohash                             Max
                                                                                                                                                                                                         gps_count
                                                                                                                                                                                                                          recent_visit




                                                                                                                                                                             M                           R




             App Engineering: Java/Scala/Clojure
             for business logic in data pipelines
             (displaces Pig, etc.)
Tuesday, 05 March 13                                                                                                                                                                                                                                                  33
Generally the revenue apps require some custom business logic -- representing business process for LOB.
These can migrate into a Cascading app to run on Hadoop.
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKai

Weitere ähnliche Inhalte

Andere mochten auch

Birth of the Global Mind
Birth of the Global MindBirth of the Global Mind
Birth of the Global MindTim O'Reilly
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphsDavid Gleich
 
What Android Can Learn from Steve Jobs
What Android Can Learn from Steve JobsWhat Android Can Learn from Steve Jobs
What Android Can Learn from Steve JobsTim O'Reilly
 
The roadtrip that led to my first rails commit and how you could make yours too
The roadtrip that led to my first rails commit and how you could make yours tooThe roadtrip that led to my first rails commit and how you could make yours too
The roadtrip that led to my first rails commit and how you could make yours tooMohnish Jadwani
 
Seoul Digital Forum (keynote file)
Seoul Digital Forum (keynote file)Seoul Digital Forum (keynote file)
Seoul Digital Forum (keynote file)Tim O'Reilly
 
Social networks and professionalism
Social networks and professionalismSocial networks and professionalism
Social networks and professionalismKaren Brooks
 
A GeoSocial Intelligence Framework for Studying & Promoting Resilience to Sea...
A GeoSocial Intelligence Framework for Studying & Promoting Resilience to Sea...A GeoSocial Intelligence Framework for Studying & Promoting Resilience to Sea...
A GeoSocial Intelligence Framework for Studying & Promoting Resilience to Sea...SMART Infrastructure Facility
 
Service oriented architecture
Service oriented architectureService oriented architecture
Service oriented architectureMahdi Nasseri
 
Comment le picture marketing permet de développer ses ventes en ligne et en b...
Comment le picture marketing permet de développer ses ventes en ligne et en b...Comment le picture marketing permet de développer ses ventes en ligne et en b...
Comment le picture marketing permet de développer ses ventes en ligne et en b...Emilie Marquois
 
The Clothesline Paradox and the Sharing Economy (Keynote file)
The Clothesline Paradox and the Sharing Economy (Keynote file)The Clothesline Paradox and the Sharing Economy (Keynote file)
The Clothesline Paradox and the Sharing Economy (Keynote file)Tim O'Reilly
 
Some Lessons for Startups (ppt)
Some Lessons for Startups (ppt)Some Lessons for Startups (ppt)
Some Lessons for Startups (ppt)Tim O'Reilly
 
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...OSCON Byrum
 
Colaboracion y Social CRM
Colaboracion y Social CRMColaboracion y Social CRM
Colaboracion y Social CRMJesus Hoyos
 
Columbia Law School - Decentralized Ledgers Presentation on 4/7/2014
Columbia Law School - Decentralized Ledgers Presentation on 4/7/2014Columbia Law School - Decentralized Ledgers Presentation on 4/7/2014
Columbia Law School - Decentralized Ledgers Presentation on 4/7/2014Ldger, Inc
 
Technical Debt and Selling Rearchitecture
Technical Debt and Selling RearchitectureTechnical Debt and Selling Rearchitecture
Technical Debt and Selling RearchitectureSergey Sundukovskiy
 
Hadoop and Beyond
Hadoop and BeyondHadoop and Beyond
Hadoop and BeyondPaco Nathan
 
Elastic Apache Mesos on Amazon EC2
Elastic Apache Mesos on Amazon EC2Elastic Apache Mesos on Amazon EC2
Elastic Apache Mesos on Amazon EC2Paco Nathan
 

Andere mochten auch (20)

Birth of the Global Mind
Birth of the Global MindBirth of the Global Mind
Birth of the Global Mind
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphs
 
Parent resources
Parent resourcesParent resources
Parent resources
 
What Android Can Learn from Steve Jobs
What Android Can Learn from Steve JobsWhat Android Can Learn from Steve Jobs
What Android Can Learn from Steve Jobs
 
Creative, Digital & Design Business Briefing July 2015
Creative, Digital & Design Business Briefing July 2015Creative, Digital & Design Business Briefing July 2015
Creative, Digital & Design Business Briefing July 2015
 
The roadtrip that led to my first rails commit and how you could make yours too
The roadtrip that led to my first rails commit and how you could make yours tooThe roadtrip that led to my first rails commit and how you could make yours too
The roadtrip that led to my first rails commit and how you could make yours too
 
Seoul Digital Forum (keynote file)
Seoul Digital Forum (keynote file)Seoul Digital Forum (keynote file)
Seoul Digital Forum (keynote file)
 
Social networks and professionalism
Social networks and professionalismSocial networks and professionalism
Social networks and professionalism
 
A GeoSocial Intelligence Framework for Studying & Promoting Resilience to Sea...
A GeoSocial Intelligence Framework for Studying & Promoting Resilience to Sea...A GeoSocial Intelligence Framework for Studying & Promoting Resilience to Sea...
A GeoSocial Intelligence Framework for Studying & Promoting Resilience to Sea...
 
Service oriented architecture
Service oriented architectureService oriented architecture
Service oriented architecture
 
Comment le picture marketing permet de développer ses ventes en ligne et en b...
Comment le picture marketing permet de développer ses ventes en ligne et en b...Comment le picture marketing permet de développer ses ventes en ligne et en b...
Comment le picture marketing permet de développer ses ventes en ligne et en b...
 
The Clothesline Paradox and the Sharing Economy (Keynote file)
The Clothesline Paradox and the Sharing Economy (Keynote file)The Clothesline Paradox and the Sharing Economy (Keynote file)
The Clothesline Paradox and the Sharing Economy (Keynote file)
 
Some Lessons for Startups (ppt)
Some Lessons for Startups (ppt)Some Lessons for Startups (ppt)
Some Lessons for Startups (ppt)
 
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...
 
Colaboracion y Social CRM
Colaboracion y Social CRMColaboracion y Social CRM
Colaboracion y Social CRM
 
Columbia Law School - Decentralized Ledgers Presentation on 4/7/2014
Columbia Law School - Decentralized Ledgers Presentation on 4/7/2014Columbia Law School - Decentralized Ledgers Presentation on 4/7/2014
Columbia Law School - Decentralized Ledgers Presentation on 4/7/2014
 
Technical Debt and Selling Rearchitecture
Technical Debt and Selling RearchitectureTechnical Debt and Selling Rearchitecture
Technical Debt and Selling Rearchitecture
 
Hadoop and Beyond
Hadoop and BeyondHadoop and Beyond
Hadoop and Beyond
 
Government 2.0
Government 2.0Government 2.0
Government 2.0
 
Elastic Apache Mesos on Amazon EC2
Elastic Apache Mesos on Amazon EC2Elastic Apache Mesos on Amazon EC2
Elastic Apache Mesos on Amazon EC2
 

Ähnlich wie Cascading meetup #4 @ BlueKai

Enterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingEnterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingPaco Nathan
 
Building Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with CascadingBuilding Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with CascadingPaco Nathan
 
A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...Paco Nathan
 
Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Paco Nathan
 
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big DataFunctional programming for optimization problems in Big Data
Functional programming for optimization problems in Big DataPaco Nathan
 
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataUsing Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataPaco Nathan
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataPaco Nathan
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow AbstractionOReillyStrata
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow AbstractionPaco Nathan
 
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional ProgrammingCascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional ProgrammingPaco Nathan
 
Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133OpenStack Foundation
 
Top 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStackTop 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStackOpenStack Foundation
 
Cascading for the Impatient
Cascading for the ImpatientCascading for the Impatient
Cascading for the ImpatientPaco Nathan
 
Web standards, why care?
Web standards, why care?Web standards, why care?
Web standards, why care?Thomas Roessler
 
IT-as-a-Service: Cloud Computing and the Evolving Role of Enterprise IT
IT-as-a-Service: Cloud Computing and the Evolving Role of Enterprise ITIT-as-a-Service: Cloud Computing and the Evolving Role of Enterprise IT
IT-as-a-Service: Cloud Computing and the Evolving Role of Enterprise ITBob Rhubart
 
Nuxeo Corporate Presentation - April 2007
Nuxeo Corporate Presentation - April 2007Nuxeo Corporate Presentation - April 2007
Nuxeo Corporate Presentation - April 2007Stefane Fermigier
 
First Operational Technology (OT) High Performance Messaging Patterns for Ent...
First Operational Technology (OT) High Performance Messaging Patterns for Ent...First Operational Technology (OT) High Performance Messaging Patterns for Ent...
First Operational Technology (OT) High Performance Messaging Patterns for Ent...Real-Time Innovations (RTI)
 
Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013Narayan Bharadwaj
 
Understand, Extend and Customize Alloy by IBM and SAP
Understand, Extend and Customize Alloy by IBM and SAPUnderstand, Extend and Customize Alloy by IBM and SAP
Understand, Extend and Customize Alloy by IBM and SAPChristian Holsing
 

Ähnlich wie Cascading meetup #4 @ BlueKai (20)

Enterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingEnterprise Data Workflows with Cascading
Enterprise Data Workflows with Cascading
 
Building Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with CascadingBuilding Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with Cascading
 
A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...
 
Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)
 
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big DataFunctional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
 
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataUsing Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open Data
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow Abstraction
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow Abstraction
 
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional ProgrammingCascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional Programming
 
Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133
 
Top 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStackTop 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStack
 
Cascading for the Impatient
Cascading for the ImpatientCascading for the Impatient
Cascading for the Impatient
 
Web standards, why care?
Web standards, why care?Web standards, why care?
Web standards, why care?
 
LMAX Architecture
LMAX ArchitectureLMAX Architecture
LMAX Architecture
 
IT-as-a-Service: Cloud Computing and the Evolving Role of Enterprise IT
IT-as-a-Service: Cloud Computing and the Evolving Role of Enterprise ITIT-as-a-Service: Cloud Computing and the Evolving Role of Enterprise IT
IT-as-a-Service: Cloud Computing and the Evolving Role of Enterprise IT
 
Nuxeo Corporate Presentation - April 2007
Nuxeo Corporate Presentation - April 2007Nuxeo Corporate Presentation - April 2007
Nuxeo Corporate Presentation - April 2007
 
First Operational Technology (OT) High Performance Messaging Patterns for Ent...
First Operational Technology (OT) High Performance Messaging Patterns for Ent...First Operational Technology (OT) High Performance Messaging Patterns for Ent...
First Operational Technology (OT) High Performance Messaging Patterns for Ent...
 
Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013
 
Understand, Extend and Customize Alloy by IBM and SAP
Understand, Extend and Customize Alloy by IBM and SAPUnderstand, Extend and Customize Alloy by IBM and SAP
Understand, Extend and Customize Alloy by IBM and SAP
 

Mehr von Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 

Mehr von Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 

Kürzlich hochgeladen (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 

Cascading meetup #4 @ BlueKai

  • 1. Cascading Meetup #4 BlueKai Cupertino, CA 2013-03-05 Copyright @2013, Concurrent, Inc. Tuesday, 05 March 13 1
  • 2. Cascading Meetup Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Enterprise Data Workflows 2. ANSI SQL Support 3. Test-Driven Development Tuesday, 05 March 13 2
  • 3. Enterprise Data Workflows Customers Let’s consider an example app… at the front end Web App LOB use cases drive demand for apps logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 05 March 13 3 LOB use cases drive the demand for Big Data apps
  • 4. Enterprise Data Workflows Customers An example… in the back office Organizations have substantial investments Web App in people, infrastructure, process logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 05 March 13 4 Enterprise organizations have seriously ginormous investments in existing back office practices: people, infrastructure, processes
  • 5. Enterprise Data Workflows Customers An example… for the heavy lifting! “Main Street” firms are migrating Web App workflows to Hadoop, for cost savings and scale-out logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 05 March 13 5 “Main Street” firms have invested in Hadoop to address Big Data needs, off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.
  • 6. Two Avenues… Enterprise: must contend with complexity at scale everyday… incumbents extend current practices and infrastructure investments – using J2EE, complexity ➞ ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff Start-ups: crave complexity and scale to become viable… new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding scale ➞ Tuesday, 05 March 13 6 Enterprise data workflows are observed in two modes: start-ups approaching complexity and incumbent firms grappling with complexity
  • 7. Two Avenues… Enterprise: must contend with complexity at scale everyday… incumbents extend current practices and infrastructure investments – using J2EE, complexity ➞ ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff Hadoop almost never gets used in isolation; data workflows define Start-ups: crave complexity and scale to become viable… the “glue” required for system new ventures move into Enterprise space of Enterprise apps integration to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding scale ➞ Tuesday, 05 March 13 7 Hadoop is almost never used in isolation. Enterprise data workflows are about system integration. There are a couple different ways to arrive at the party.
  • 8. Cascading Meetup Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Enterprise Data Workflows 2. ANSI SQL Support 3. Test-Driven Development Tuesday, 05 March 13 8
  • 9. Cascading workflows – ANSI SQL • collab with Optiq – industry-proven code base Customers • ANSI SQL parser/optimizer atop Cascading flow planner Web App • JDBC driver to integrate into existing tools and app servers logs logs Cache Logs • relational catalog over a collection Support source of unstructured data trap tap tap sink tap • SQL shell prompt to run queries Modeling PMML Data Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 05 March 13 9 ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration. Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.
  • 10. Cascading workflows – ANSI SQL • collab with Optiq – industry-proven code base Customers • ANSI SQL parser/optimizer atop Cascading flow planner Web App • JDBC driver to integrate into existing tools and app servers logs logs Cache Premise: most SQL in the world gets Logs • relational catalog over a collection Support of unstructured datawritten by machines… trap tap source tap sink tap • SQL shell prompt to run isn’t a database; this is about making This queries Modeling PMML Data Workflow machine-to-machine communications sink tap source tap simpler and more robust at scale. Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 05 March 13 10 ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration. Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.
  • 11. Cascading workflows – ANSI SQL • enable analysts without retraining on Hadoop, etc. Customers • transparency for Support, Ops, Web App Finance, et al. logs Cache logs Logs Support source trap sink tap tap tap Data a language for queries – not a database, Modeling PMML Workflow but ANSI SQL as a DSL for workflows sink tap source tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 05 March 13 11 ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration. Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.
  • 12. ANSI SQL – reviews Open Source 'Lingual' Helps SQL Devs Unlock Hadoop Thor Olavsrud, 2013-02-22 cio.com/article/729283/Open_Source_Lingual_Helps_SQL_Devs_Unlock_Hadoop Hadoop Apps Without MapReduce Mindsets Adrian Bridgwater, 2013-02-28 drdobbs.com/open-source/hadoop-apps-without-mapreduce-mindsets/240149708 Concurrent gives old SQL users new Hadoop tricks Jack Clark, 2013-02-20 theregister.co.uk/2013/02/20/hadoop_sql_translator_lingual_launches/ Concurrent Open Source Project Ties SQL to Hadoop Michael Vizard, 2013-02-21 itbusinessedge.com/blogs/it-unmasked/concurrent-open-source-project-ties-sql-to-hadoop.html Concurrent Releases Lingual, a SQL DSL for Hadoop Boris Lublinsky, 2013-02-28 infoq.com/news/2013/02/Lingual Tuesday, 05 March 13 12
  • 13. ANSI SQL – CSV data in local file system cascading.org/lingual Tuesday, 05 March 13 13 The test database for MySQL is available for download from https://launchpad.net/test-db/ Here we have a bunch o’ CSV flat files in a directory in the local file system. Use the “lingual” command line interface to overlay DDL to describe the expected table schema.
  • 14. ANSI SQL – shell prompt, catalog cascading.org/lingual Tuesday, 05 March 13 14 Use the “lingual” SQL shell prompt to run SQL queries interactively, show catalog, etc.
  • 15. ANSI SQL – queries cascading.org/lingual Tuesday, 05 March 13 15 Here’s an example SQL query on that “employee” test database from MySQL.
  • 16. ANSI SQL – layers abstraction RDBMS JVM Cluster parser ANSI SQL ANSI SQL compliant parser compliant parser optimizer logical plan, logical plan, optimized based on stats optimized based on stats planner physical plan API “plumbing” machine query history, app history, data table stats tuple stats topology b-trees, etc. heterogenous, distributed: Hadoop, IMDG, etc. visualization ERD flow diagram schema table schema tuple schema catalog relational catalog tap usage DB provenance (manual audit) data set producers/consumers Tuesday, 05 March 13 16 When you peel back the onion skin on a SQL query, each of the abstraction layers used in an RDBMS has an analogue (or better) in the context of Enterprise Data Workflows running on JVM clusters
  • 17. ANSI SQL – JDBC driver public void run() throws ClassNotFoundException, SQLException { Class.forName( "cascading.lingual.jdbc.Driver" ); Connection connection = DriverManager.getConnection( "jdbc:lingual:local;schemas=src/main/resources/data/example" ); Statement statement = connection.createStatement();   ResultSet resultSet = statement.executeQuery( "select *n" + "from "EXAMPLE"."SALES_FACT_1997" as sn" + "join "EXAMPLE"."EMPLOYEE" as en" + "on e."EMPID" = s."CUST_ID"" );   while( resultSet.next() ) { int n = resultSet.getMetaData().getColumnCount(); StringBuilder builder = new StringBuilder();   for( int i = 1; i <= n; i++ ) { builder.append( ( i > 1 ? "; " : "" ) + resultSet.getMetaData().getColumnLabel( i ) + "=" + resultSet.getObject( i ) ); } System.out.println( builder ); }   resultSet.close(); statement.close(); connection.close(); } Tuesday, 05 March 13 17 Note that in this example the schema for the DDL has been derived directly from the CSV files. In other words, point the JDBC connection at a directory of flat files and query as if they were already loaded into SQL.
  • 18. ANSI SQL – JDBC driver $ gradle clean jar $ hadoop jar build/libs/lingual-examples–1.0.0-wip-dev.jar   CUST_ID=100; PROD_ID=10; EMPID=100; NAME=Bill CUST_ID=150; PROD_ID=20; EMPID=150; NAME=Sebastian Caveat: if you absolutely positively must have sub-second SQL query response for Pb-scale data on a 1000+ node cluster… Good luck with that! (call the MPP vendors) This ANSI SQL library is primarily intended for batch workflows – high throughput, not low-latency – for many under-represented use cases in Enterprise IT. It’s essentially ANSI SQL as a DSL. Tuesday, 05 March 13 18 success
  • 19. Cascading Meetup Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Enterprise Data Workflows 2. ANSI SQL Support 3. Test-Driven Development Tuesday, 05 March 13 19
  • 20. Test-Driven Development (TDD) source: Wikipedia Tuesday, 05 March 13 20 A general view of TDD process
  • 21. Test-Driven Development (TDD) In terms of Big Data apps,TDD is not generally part of the conversation Tuesday, 05 March 13 21 TDD is not usually high on the list when people start discussing Big Data apps.
  • 22. Traps – Cascading “exceptional data” • assert patterns (regex) on the tuple streams Customers • adjust assert levels, like log4j levels • define traps on branches Web App • tuples which fail asserts get trapped logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 05 March 13 22 An innovation in Cascading was to introduce the notion of a “data exception”, based on setting stream assertion levels as part of the business logic of an app.
  • 23. Traps – example code // set up...  Pipe etlPipe = new Pipe( "etlPipe" ); // some processing...  AssertMatches assertMatches = new AssertMatches( ".*true" ); etlPipe = new Each( etlPipe, AssertionLevel.STRICT, assertMatches );   // some processing...  FlowDef flowDef = FlowDef.flowDef().setName( "etl" ) .addSource( etlPipe, jsonTap ) .addTrap( etlPipe, trapTap ) .addTailSink( etlPipe, cacheTap );   if( options.has( "assert" ) ) flowDef.setAssertionLevel( AssertionLevel.STRICT ); else flowDef.setAssertionLevel( AssertionLevel.NONE ); Tuesday, 05 March 13 23 Example use in Cascading code
  • 24. Traps – redirect exceptions in production shunt the trapped exceptional data to other parts of the organization: Customers • Ops: notifications Web App • QA: investigate data anomalies • Support: review customer records logs logs Logs Cache • Finance: audit Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 05 March 13 24
  • 25. TDD – practice at scale 1. assert expected patterns in raw input 2. run just that, to find edge cases 3. handle the edge cases for input data 4. assert expected patterns after first chunk of processing 5. run just that, to verify failure 6. code until test passes GIS Regex tree Scrub export parse-tree species 7. repeat #4 for each chunk M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R Tuesday, 05 March 13 25
  • 26. TDD – Cascalog features consider that TDD is about asserting and negating logical predicates… • Cascalog is based on logical predicates • function definitions as composable subqueries • functions are not particularly far from being unit tests • Midje: facts, mocks sritchie.github.com/2011/09/30/testing-cascalog-with-midje.html sritchie.github.com/2012/01/22/cascalog-testing-20.html Tuesday, 05 March 13 26 Moreover, the Cascalog language by Nathan Marz, Sam Ritchie, et al., nearly uses TDD as its methodology -- in the transition from ad-hoc queries as logic predicates, then composing those predicates into large-scale apps.
  • 27. Cascading Meetup Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Enterprise Data Workflows 2. ANSI SQL Support 3. Test-Driven Development …plus, a proposal Tuesday, 05 March 13 27
  • 28. ANSI SQL – multiple flows GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R Suppose your organization is responsible for an large-scale app… Multiple teams develop reusable libraries… Tuesday, 05 March 13 28 Suppose you have a app with a complex flow diagram like this, with contributions to the business logic from different departments…
  • 29. ANSI SQL – multiple flows GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R Data Analysts: ANSI SQL queries for data prep (displaces Hive, etc.) Tuesday, 05 March 13 29 Analysts are generally working with ANSI SQL queries in a DW, e.g., for ETL, data prep, pulling data cubes. These can migrate into a Cascading app to run on Hadoop.
  • 30. ANSI SQL – multiple flows GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R Server-side Engineering: HBase tap for customer profiles (integrating other components) Tuesday, 05 March 13 30 Engineering provides integration with customer profiles, e.g., transactional data objects in HBase. These can migrate into a Cascading app to run on Hadoop.
  • 31. ANSI SQL – multiple flows GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R Ops + Support: Traps get routed to customer review (ties into notifications, etc.) Tuesday, 05 March 13 31 Support needs to review exceptional data, via reports/notifications. These can migrate into a Cascading app to run on Hadoop.
  • 32. ANSI SQL – multiple flows GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R Data Scientists: R => PMML for predictive models (displaces SAS, etc.) Tuesday, 05 March 13 32 Scientists perform their model creation work in R, Weka, SAS, Microstrategy, etc., which can export as PMML. These can migrate into a Cascading app to run on Hadoop.
  • 33. ANSI SQL – multiple flows GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R App Engineering: Java/Scala/Clojure for business logic in data pipelines (displaces Pig, etc.) Tuesday, 05 March 13 33 Generally the revenue apps require some custom business logic -- representing business process for LOB. These can migrate into a Cascading app to run on Hadoop.