Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Cascading meetup #4 @ BlueKai
1. Cascading Meetup #4
BlueKai
Cupertino, CA
2013-03-05
Copyright @2013, Concurrent, Inc.
Tuesday, 05 March 13 1
2. Cascading Meetup
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
1. Enterprise Data Workflows
2. ANSI SQL Support
3. Test-Driven Development
Tuesday, 05 March 13 2
3. Enterprise Data Workflows
Customers
Let’s consider an example app…
at the front end Web
App
LOB use cases drive demand for apps
logs Cache
logs
Logs
Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Tuesday, 05 March 13 3
LOB use cases drive the demand for Big Data apps
4. Enterprise Data Workflows
Customers
An example… in the back office
Organizations have substantial investments Web
App
in people, infrastructure, process
logs Cache
logs
Logs
Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Tuesday, 05 March 13 4
Enterprise organizations have seriously ginormous investments in existing back office practices:
people, infrastructure, processes
5. Enterprise Data Workflows
Customers
An example… for the heavy lifting!
“Main Street” firms are migrating Web
App
workflows to Hadoop, for cost
savings and scale-out
logs Cache
logs
Logs
Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Tuesday, 05 March 13 5
“Main Street” firms have invested in Hadoop to address Big Data needs,
off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.
6. Two Avenues…
Enterprise: must contend with
complexity at scale everyday…
incumbents extend current practices and
infrastructure investments – using J2EE,
complexity ➞
ANSI SQL, SAS, etc. – to migrate
workflows onto Apache Hadoop while
leveraging existing staff
Start-ups: crave complexity and
scale to become viable…
new ventures move into Enterprise space
to compete using relatively lean staff,
while leveraging sophisticated engineering
practices, e.g., Cascalog and Scalding
scale ➞
Tuesday, 05 March 13 6
Enterprise data workflows are observed in two modes: start-ups approaching complexity and incumbent firms grappling with complexity
7. Two Avenues…
Enterprise: must contend with
complexity at scale everyday…
incumbents extend current practices and
infrastructure investments – using J2EE,
complexity ➞
ANSI SQL, SAS, etc. – to migrate
workflows onto Apache Hadoop while
leveraging existing staff
Hadoop almost never gets used
in isolation; data workflows define
Start-ups: crave complexity and
scale to become viable… the “glue” required for system
new ventures move into Enterprise space of Enterprise apps
integration
to compete using relatively lean staff,
while leveraging sophisticated engineering
practices, e.g., Cascalog and Scalding
scale ➞
Tuesday, 05 March 13 7
Hadoop is almost never used in isolation.
Enterprise data workflows are about system integration.
There are a couple different ways to arrive at the party.
8. Cascading Meetup
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
1. Enterprise Data Workflows
2. ANSI SQL Support
3. Test-Driven Development
Tuesday, 05 March 13 8
9. Cascading workflows – ANSI SQL
• collab with Optiq – industry-proven code base
Customers
• ANSI SQL parser/optimizer atop Cascading
flow planner Web
App
• JDBC driver to integrate into existing
tools and app servers logs
logs Cache
Logs
• relational catalog over a collection Support
source
of unstructured data trap
tap
tap sink
tap
• SQL shell prompt to run queries Modeling PMML
Data
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Tuesday, 05 March 13 9
ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration.
Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.
10. Cascading workflows – ANSI SQL
• collab with Optiq – industry-proven code base
Customers
• ANSI SQL parser/optimizer atop Cascading
flow planner Web
App
• JDBC driver to integrate into existing
tools and app servers logs
logs Cache
Premise: most SQL in the world gets Logs
• relational catalog over a collection Support
of unstructured datawritten by machines… trap
tap
source
tap sink
tap
• SQL shell prompt to run isn’t a database; this is about making
This queries Modeling PMML
Data
Workflow
machine-to-machine communications sink
tap
source
tap
simpler and more robust at scale.
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Tuesday, 05 March 13 10
ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration.
Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.
11. Cascading workflows – ANSI SQL
• enable analysts without retraining
on Hadoop, etc. Customers
• transparency for Support, Ops, Web
App
Finance, et al.
logs Cache
logs
Logs
Support
source
trap sink
tap
tap tap
Data
a language for queries – not a database, Modeling PMML
Workflow
but ANSI SQL as a DSL for workflows sink
tap
source
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Tuesday, 05 March 13 11
ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration.
Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.
12. ANSI SQL – reviews
Open Source 'Lingual' Helps SQL Devs Unlock Hadoop
Thor Olavsrud, 2013-02-22
cio.com/article/729283/Open_Source_Lingual_Helps_SQL_Devs_Unlock_Hadoop
Hadoop Apps Without MapReduce Mindsets
Adrian Bridgwater, 2013-02-28
drdobbs.com/open-source/hadoop-apps-without-mapreduce-mindsets/240149708
Concurrent gives old SQL users new Hadoop tricks
Jack Clark, 2013-02-20
theregister.co.uk/2013/02/20/hadoop_sql_translator_lingual_launches/
Concurrent Open Source Project Ties SQL to Hadoop
Michael Vizard, 2013-02-21
itbusinessedge.com/blogs/it-unmasked/concurrent-open-source-project-ties-sql-to-hadoop.html
Concurrent Releases Lingual, a SQL DSL for Hadoop
Boris Lublinsky, 2013-02-28
infoq.com/news/2013/02/Lingual
Tuesday, 05 March 13 12
13. ANSI SQL – CSV data in local file system
cascading.org/lingual
Tuesday, 05 March 13 13
The test database for MySQL is available for download from https://launchpad.net/test-db/
Here we have a bunch o’ CSV flat files in a directory in the local file system.
Use the “lingual” command line interface to overlay DDL to describe the expected table schema.
14. ANSI SQL – shell prompt, catalog
cascading.org/lingual
Tuesday, 05 March 13 14
Use the “lingual” SQL shell prompt to run SQL queries interactively, show catalog, etc.
15. ANSI SQL – queries
cascading.org/lingual
Tuesday, 05 March 13 15
Here’s an example SQL query on that “employee” test database from MySQL.
16. ANSI SQL – layers
abstraction RDBMS JVM Cluster
parser ANSI SQL ANSI SQL
compliant parser compliant parser
optimizer logical plan, logical plan,
optimized based on stats optimized based on stats
planner physical plan API “plumbing”
machine query history, app history,
data table stats tuple stats
topology b-trees, etc. heterogenous, distributed:
Hadoop, IMDG, etc.
visualization ERD flow diagram
schema table schema tuple schema
catalog relational catalog tap usage DB
provenance (manual audit) data set
producers/consumers
Tuesday, 05 March 13 16
When you peel back the onion skin on a SQL query, each of the abstraction layers used in an RDBMS has an analogue (or better) in the context of Enterprise Data Workflows running on JVM clusters
17. ANSI SQL – JDBC driver
public void run() throws ClassNotFoundException, SQLException {
Class.forName( "cascading.lingual.jdbc.Driver" );
Connection connection =
DriverManager.getConnection( "jdbc:lingual:local;schemas=src/main/resources/data/example" );
Statement statement = connection.createStatement();
ResultSet resultSet = statement.executeQuery(
"select *n"
+ "from "EXAMPLE"."SALES_FACT_1997" as sn"
+ "join "EXAMPLE"."EMPLOYEE" as en"
+ "on e."EMPID" = s."CUST_ID"" );
while( resultSet.next() ) {
int n = resultSet.getMetaData().getColumnCount();
StringBuilder builder = new StringBuilder();
for( int i = 1; i <= n; i++ ) {
builder.append( ( i > 1 ? "; " : "" )
+ resultSet.getMetaData().getColumnLabel( i ) + "=" + resultSet.getObject( i ) );
}
System.out.println( builder );
}
resultSet.close();
statement.close();
connection.close();
}
Tuesday, 05 March 13 17
Note that in this example the schema for the DDL has been derived directly from the CSV files.
In other words, point the JDBC connection at a directory of flat files and query as if they were already loaded into SQL.
18. ANSI SQL – JDBC driver
$ gradle clean jar
$ hadoop jar build/libs/lingual-examples–1.0.0-wip-dev.jar
CUST_ID=100; PROD_ID=10; EMPID=100; NAME=Bill
CUST_ID=150; PROD_ID=20; EMPID=150; NAME=Sebastian
Caveat: if you absolutely positively must have sub-second
SQL query response for Pb-scale data on a 1000+ node
cluster… Good luck with that! (call the MPP vendors)
This ANSI SQL library is primarily intended for batch
workflows – high throughput, not low-latency –
for many under-represented use cases in Enterprise IT.
It’s essentially ANSI SQL as a DSL.
Tuesday, 05 March 13 18
success
19. Cascading Meetup
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
1. Enterprise Data Workflows
2. ANSI SQL Support
3. Test-Driven Development
Tuesday, 05 March 13 19
21. Test-Driven Development (TDD)
In terms of Big Data apps,TDD is not
generally part of the conversation
Tuesday, 05 March 13 21
TDD is not usually high on the list when people start discussing Big Data apps.
22. Traps – Cascading “exceptional data”
• assert patterns (regex) on the tuple streams
Customers
• adjust assert levels, like log4j levels
• define traps on branches Web
App
• tuples which fail asserts get trapped
logs Cache
logs
Logs
Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Tuesday, 05 March 13 22
An innovation in Cascading was to introduce the notion of a “data exception”,
based on setting stream assertion levels as part of the business logic of an app.
23. Traps – example code
// set up...
Pipe etlPipe = new Pipe( "etlPipe" );
// some processing...
AssertMatches assertMatches = new AssertMatches( ".*true" );
etlPipe = new Each( etlPipe, AssertionLevel.STRICT, assertMatches );
// some processing...
FlowDef flowDef = FlowDef.flowDef().setName( "etl" )
.addSource( etlPipe, jsonTap )
.addTrap( etlPipe, trapTap )
.addTailSink( etlPipe, cacheTap );
if( options.has( "assert" ) )
flowDef.setAssertionLevel( AssertionLevel.STRICT );
else
flowDef.setAssertionLevel( AssertionLevel.NONE );
Tuesday, 05 March 13 23
Example use in Cascading code
24. Traps – redirect exceptions in production
shunt the trapped exceptional data to other
parts of the organization: Customers
• Ops: notifications Web
App
• QA: investigate data anomalies
• Support: review customer records logs
logs
Logs
Cache
•
Finance: audit Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Tuesday, 05 March 13 24
25. TDD – practice at scale
1. assert expected patterns in raw input
2. run just that, to find edge cases
3. handle the edge cases for input data
4. assert expected patterns after first chunk of processing
5. run just that, to verify failure
6. code until test passes GIS Regex
tree
Scrub
export parse-tree species
7. repeat #4 for each chunk
M M
Estimate
Join Geohash
height
Regex
src
parse-gis
Tree Filter
tree
Metadata height
Failure M
Traps
Calculate Filter Sum
Join
distance distance moment Filter
sum_moment
Estimate R M R M
road
road
Regex
traffic
parse-road
shade
Estimate Road
Join
Albedo Segments
Geohash Join
M
R
Road
Metadata gps R
gps reco
logs
Count
Geohash Max
gps_count
recent_visit
M R
Tuesday, 05 March 13 25
26. TDD – Cascalog features
consider that TDD is about asserting and negating logical
predicates…
• Cascalog is based on logical predicates
• function definitions as composable subqueries
• functions are not particularly far from being unit tests
• Midje: facts, mocks
sritchie.github.com/2011/09/30/testing-cascalog-with-midje.html
sritchie.github.com/2012/01/22/cascalog-testing-20.html
Tuesday, 05 March 13 26
Moreover, the Cascalog language by Nathan Marz, Sam Ritchie, et al., nearly uses TDD as its methodology --
in the transition from ad-hoc queries as logic predicates, then composing those predicates into large-scale apps.
27. Cascading Meetup
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
1. Enterprise Data Workflows
2. ANSI SQL Support
3. Test-Driven Development
…plus, a proposal
Tuesday, 05 March 13 27
28. ANSI SQL – multiple flows
GIS Regex
tree
Scrub
export parse-tree species
M M
Estimate
Join Geohash
height
Regex
src
parse-gis
Tree Filter
tree
Metadata height
Failure M
Traps
Calculate Filter Sum
Join
distance distance moment Filter
sum_moment
Estimate R M R M
road
road
Regex
traffic
parse-road
shade
Estimate Road
Join
Albedo Segments
Geohash Join
M
R
Road
Metadata gps R
gps reco
logs
Count
Geohash Max
gps_count
recent_visit
M R
Suppose your organization is responsible
for an large-scale app…
Multiple teams develop reusable libraries…
Tuesday, 05 March 13 28
Suppose you have a app with a complex flow diagram like this, with contributions to the business logic from different departments…
29. ANSI SQL – multiple flows
GIS Regex
tree
Scrub
export parse-tree species
M M
Estimate
Join Geohash
height
Regex
src
parse-gis
Tree Filter
tree
Metadata height
Failure M
Traps
Calculate Filter Sum
Join
distance distance moment Filter
sum_moment
Estimate R M R M
road
road
Regex
traffic
parse-road
shade
Estimate Road
Join
Albedo Segments
Geohash Join
M
R
Road
Metadata gps R
gps reco
logs
Count
Geohash Max
gps_count
recent_visit
M R
Data Analysts: ANSI SQL queries
for data prep
(displaces Hive, etc.)
Tuesday, 05 March 13 29
Analysts are generally working with ANSI SQL queries in a DW, e.g., for ETL, data prep, pulling data cubes.
These can migrate into a Cascading app to run on Hadoop.
30. ANSI SQL – multiple flows
GIS Regex
tree
Scrub
export parse-tree species
M M
Estimate
Join Geohash
height
Regex
src
parse-gis
Tree Filter
tree
Metadata height
Failure M
Traps
Calculate Filter Sum
Join
distance distance moment Filter
sum_moment
Estimate R M R M
road
road
Regex
traffic
parse-road
shade
Estimate Road
Join
Albedo Segments
Geohash Join
M
R
Road
Metadata gps R
gps reco
logs
Count
Geohash Max
gps_count
recent_visit
M R
Server-side Engineering: HBase tap
for customer profiles
(integrating other components)
Tuesday, 05 March 13 30
Engineering provides integration with customer profiles, e.g., transactional data objects in HBase.
These can migrate into a Cascading app to run on Hadoop.
31. ANSI SQL – multiple flows
GIS Regex
tree
Scrub
export parse-tree species
M M
Estimate
Join Geohash
height
Regex
src
parse-gis
Tree Filter
tree
Metadata height
Failure M
Traps
Calculate Filter Sum
Join
distance distance moment Filter
sum_moment
Estimate R M R M
road
road
Regex
traffic
parse-road
shade
Estimate Road
Join
Albedo Segments
Geohash Join
M
R
Road
Metadata gps R
gps reco
logs
Count
Geohash Max
gps_count
recent_visit
M R
Ops + Support: Traps get
routed to customer review
(ties into notifications, etc.)
Tuesday, 05 March 13 31
Support needs to review exceptional data, via reports/notifications.
These can migrate into a Cascading app to run on Hadoop.
32. ANSI SQL – multiple flows
GIS Regex
tree
Scrub
export parse-tree species
M M
Estimate
Join Geohash
height
Regex
src
parse-gis
Tree Filter
tree
Metadata height
Failure M
Traps
Calculate Filter Sum
Join
distance distance moment Filter
sum_moment
Estimate R M R M
road
road
Regex
traffic
parse-road
shade
Estimate Road
Join
Albedo Segments
Geohash Join
M
R
Road
Metadata gps R
gps reco
logs
Count
Geohash Max
gps_count
recent_visit
M R
Data Scientists: R => PMML
for predictive models
(displaces SAS, etc.)
Tuesday, 05 March 13 32
Scientists perform their model creation work in R, Weka, SAS, Microstrategy, etc., which can export as PMML.
These can migrate into a Cascading app to run on Hadoop.
33. ANSI SQL – multiple flows
GIS Regex
tree
Scrub
export parse-tree species
M M
Estimate
Join Geohash
height
Regex
src
parse-gis
Tree Filter
tree
Metadata height
Failure M
Traps
Calculate Filter Sum
Join
distance distance moment Filter
sum_moment
Estimate R M R M
road
road
Regex
traffic
parse-road
shade
Estimate Road
Join
Albedo Segments
Geohash Join
M
R
Road
Metadata gps R
gps reco
logs
Count
Geohash Max
gps_count
recent_visit
M R
App Engineering: Java/Scala/Clojure
for business logic in data pipelines
(displaces Pig, etc.)
Tuesday, 05 March 13 33
Generally the revenue apps require some custom business logic -- representing business process for LOB.
These can migrate into a Cascading app to run on Hadoop.