SlideShare ist ein Scribd-Unternehmen logo
1 von 45
Big Data Warehousing Meetup
December 10, 2013

Real-time Trade Data Monitoring
with Storm & Cassandra
Agenda
7:00

Networking
Grab a slice of pizza and a drink...

7:15

Welcome & Intro

President, Caserta Concepts
Author, Data Warehouse ETL Toolkit

7:30

Joe Caserta

About the Meetup and about Caserta Concepts

Elliott Cordo

Cassandra

Chief Architect, Caserta Concepts

8:00

Noel Vega
Consultant, Caserta Concepts
Consultant, Dimension Data, LLC

8:309:00

Q&A / More Networking

Storm
About the BDW Meetup
• Big Data is a complex, rapidly changing

landscape
• We want to share our stories and hear

about yours
• Great networking opportunity for like

minded data nerds
• Opportunities to collaborate on exciting

projects
• Founded by Caserta Concepts, Big Data

Analytics, DW & BI Consulting
• Next BDW Meetup: JANUARY 20
About Caserta Concepts
Focused
Expertise
•
•
•
•

Big Data Analytics
Data Warehousing
Business Intelligence
Strategic Data
Ecosystems

Industries Served
•
•
•
•
•

Financial Services
Healthcare / Insurance
Retail / eCommerce
Digital Media / Marketing
K-12 / Higher Education

Founded in 2001
• President: Joe Caserta, industry thought leader,
consultant, educator and co-author, The Data
Warehouse ETL Toolkit (Wiley, 2004)
Caserta Concepts
Listed as one of the 20 Most Promising
Data Analytics Consulting Companies

CIOReview looked at hundreds of data analytics consulting companies and shortlisted
the ones who are at the forefront of tackling the real analytics challenges.
A distinguished panel comprising of CEOs, CIOs, VCs, industry analysts and the editorial
board of CIOReview selected the Final 20.
Expertise & Offerings
Strategic Roadmap/
Assessment/Consulting/
Implementation

Big Data
Analytics

Data Warehousing/
ETL/Data Integration

BI/Visualization/
Analytics
Client Portfolio
Finance
& Insurance

Retail/eCommerce
& Manufacturing

Education
& Services
We are hiring
Does this word cloud excite you?

Speak with us about our open positions: jobs@casertaconcepts.com
Why talk about Storm & Cassandra?
Traditional BI

ERP
ETL

Traditional
EDW

Finance
ETL

Ad-Hoc/Canned
Reporting

Legacy

Big Data BI

Big Data Cluster
NoSQL
Database

Storm

Data Analytics

Mahout

N1

MapReduce

N2

N3

Pig/Hive

N4

N5

Hadoop Distributed File System (HDFS)
Horizontally Scalable Environment - Optimized for Analytics

Data Science
What is Storm
• Distributed Event Processor
• Real-time data ingestion and dissemination
• In-Stream ETL
• Reliably process unbounded streams of data
• Storm is fast: Clocked it at over a million tuples per second per node
• It is scalable, fault-tolerant, guarantees your data will be processed

• Preferred technology for real-time big data processing by organizations

worldwide:
• Partial list at https://github.com/nathanmarz/storm/wiki/Powered-By
• Incubator:
• http://wiki.apache.org/incubator/StormProposal
Components of Storm
• Spout – Collects data from upstream feeds and submits

it for processing
• Tuple – A collection of data that is passed within Storm
• Bolt – Processes tuples (Transformations)
• Stream – Identifies outputs from Spouts/Bolts

• Storm usually outputs to a NoSQL database
Why NoSQL?
• Performance:
• Relational databases have a lot of features, overhead that we don’t
need in many cases. Although we will miss some…
• Scalability:
• Most relational databases scale vertically giving them limits to how
large they can get. Federation and Sharding is an awkward manual
process.
• Agile
• Sparse Data / Data with a lot of variation
• Most NoSQL scale horizontally on commodity hardware
What is Cassandra?
• Column families are the equivalent to a table in a RDMS
• Primary unit of storage is a column, they are stored

contiguously
Skinny Rows: Most like relational database. Except
columns are optional and not stored if omitted:

Wide Rows: Rows can be billions of columns wide, used
for time series, relationships, secondary indexes:
REAL TIME TRADE DATA MONITORING
Elliott Cordo
Chief Architect, Caserta Concepts
The Use Case
• Trade data (orders and executions)
• High volume of incoming data
• 500 thousand records per second
• 12 billion messages per day
• Required that data be aggregated and monitored in real

time (end to end latency measured in 100's of ms)
• Both raw messages and analytics stored, persisted to a

database
The Data
• Primarily FIX messages: Financial Information Exchange 

• Established in early 90's as a standard for trade data

communication  widely used throughout the industry

• Basically a delimited file of variable attribute-value pairs
• Looks something like this:
8=FIX.4.2 | 9=178 | 35=8 | 49=PHLX | 56=PERS | 52=20071123-05:30:00.000 |
11=ATOMNOCCC9990900 | 20=3 | 150=E | 39=E | 55=MSFT | 167=CS | 54=1 | 38=15 | 40=2 |
44=15 | 58=PHLX EQUITY TESTING | 59=0 | 47=C | 32=0 | 31=0 | 151=15 | 14=0 | 6=0 |
10=128 |

• A single trade can be comprised of 1000's of such messages,

although typical trades have about a dozen
Additional Requirements
• Linearly scalable
• Highly available  no single point of failure ,quick recovery
• Quicker time to benefit

• Processing guarantees  NO DATA IS LOST!
Some Sample Analytic Use Cases
• Sum(Notional volume) by Ticker: Daily, Hourly, Minute
• Average trade latency (Execution TS – Order TS)
• Wash Sales (sell within x seconds of last buy) for same

Client/Ticker
How has this system traditionally been
handled
• Typically by manually partitioning the application  Having a number

Message Queue

of independent systems and databases “dividing” the problem
Use Case 1:
Partition A

Database A

Use Case 1:
Partition B

Database B

Use Case 2:
All Partitions

Database C

Main issues 
• Growth requires changing these systems to accept the new
partitioning scheme: Development!
• A lot of different applications replicating complex architecture, tons of
boilerplate code
• Performing analysis across the partitioning schemes very difficult
Need to Establish a Platform as a Service
Architecture
d3.js Analytics

Atomic data

Sensor
Data

Aggregates
Storm Cluster

Event Monitors

• Redis queue is used for ingestion
• Storm is used for real-time ETL and outputs atomic data
and derived data needed for analytics
• Redis is used as a reference data lookup cache and
state
• Real time analytics are produced from the aggregated
data.
• Higher latency ad-hoc analytics are done in Hadoop
using Pig and Hive

Low Latency
Analytics
Deeper Dive: Cassandra as an Analytic
Database
• Based on a blend of Dynamo and BigTable
• Distributed, master-less
• Super fast writes  Can ingest lots of data!
• Very fast reads

Why did we choose it:
• Data throughput requirements
• High availability
• Simple expansion
• Interesting data models for time series data (more on this
later)
Design Practices
• Cassandra does not support aggregation or joins 

Data model must be tuned to usage
• Denormalize your data (flatten your primary dimensional

attributes into your fact)
• Storing the same data redundantly is OK

Might sound weird but we've been doing this all along
in the traditional world modeling our data to make
analytic queries simple!
Wide rows are our friends
• Cassandra composite columns are powerful for analytic

models
• Facilitate multi-dimensional analysis
• A wide row table may have N number of rows, and a
variable number of columns (millions of columns)
ClientA
ClientB
ClientC
…

20130101 20130102 20130103 20130104 20130104 20130105 …
10003
9493
43143
45553
54553
34343 …
45453
34313
54543
`23233
4233
34423 …
3323
35313
43123
54543
43433
4343 …
…
…
…
…
…
..
…

• And now with CQL3 we have “unpacked” wide rows into

named columns  Easy to work with!
More about wide rows!
• The left-most column is the ROW KEY
• It is the mechanism by which the row is distributed across the Cassandra cluster…
• Care must be taken to prevent hot spots: Dates for example are not generally good

candidates because all load will go to given set of servers on a particular day!
• Data can be filtered using equal and “in” clause
ClientA
ClientB
ClientC
…

20130101 20130102 20130103 20130104 20130104 20130105 …
10003
9493
43143
45553
54553
34343 …
45453
34313
54543
`23233
4233
34423 …
3323
35313
43123
54543
43433
4343 …
…
…
…
…
…
..
…
Create table Client_Daily_Summary (
Client text,
Date_ID int,
Trade_Count int,
Primary key (Client, Date_ID))

• The top row is the COLUMN KEY
• Their can be a variable number of columns
• It is acceptable to have millions/ even billions of columns in a table
• Columns keys are sorted and can accept a range query (greater than / less than)
Traditional Cassandra Analytic Model
If we wanted to track trade count by day, hour we could
stream our ETL to two (or more) summary fact tables
ClientA
ClientB
ClientC

20130101 20130102 20130103 20130104 20130104 20130105
10003
9493
43143
45553
54553
34343
45453
34313
54543
`23233
4233
34423
3323
35313
43123
54543
43433
4343

Sample analytic query: Give me daily trade counts for ClientA between Jan 1 and Jan 3:
Select Date_ID, Trade_Count from Client_Hourly_Summary `
where Client='ClientA' and Date_ID>=20130101 and Date_ID <=20130103
ClientA|20131101
ClientA|20131102
ClientB|20131101

0900
1000
4545
332

1000
949
3431
3531

1100
4314
5454
4312

1200
4555
2323
5454

1300
5455
423
4343

1400
3434
3442
434

Sample analytic query: Give me hourly trade counts for ClientA for Jan1 between 9 and 11 AM
Select Hour, Trade_Count from Client_Hourly_Summary `
where Client_Date='ClientA|20131101' and hour >= 900 and <= 1100
But there are other methods too
• Assuming some level of client side aggregation (and additive measures) we

could also further unpack and leverage column keys using CQL 3  A slightly
different use case:
Create table Client_Ticker_Summary (
Client text,
Date_ID int,
Ticker text,
Trade_Count int,
Notional_Volume float,
Primary Key (Client, Date_ID, Ticker))

The first column in the PK definition
is the Row Key aka Partition Key

Look at all this flexible SQL goodness:
select * from Client_Ticker_Summary
where Client in ('ClientA','ClientB')
select * from Client_Ticker_Summary
where Client in ('ClientA','ClientB') and Date_ID >= 20130101 and Date_ID <= 20130103
select * from Client_Ticker_Summary
where Client ='ClientA' and Date_ID >= 20130101 and Date_ID <= 20130103
Select * from Client_Ticker_Summary
where Client = 'ClientA’ and Date_ID=20130101 and Ticker in ('APPL','GE','PG')
ALSO  But not recommended!
select * from Client_Ticker_Summary
where Date_ID > 20120101 allow filtering;
select * from Client_Ticker_Summary
where Date_ID = 20120101 and ticker in ('APPL','GE') allow filtering;
Storing the Atomic data
8=FIX.4.2 | 9=178 | 35=8 | 49=PHLX | 56=PERS | 52=20071123-05:30:00.000 | 11=ATOMNOCCC9990900 |
20=3 | 150=E | 39=E | 55=MSFT | 167=CS | 54=1 | 38=15 | 40=2 | 44=15 | 58=PHLX EQUITY TESTING |
59=0 | 47=C | 32=0 | 31=0 | 151=15 | 14=0 | 6=0 | 10=128 |

• We must land all atomic data:
• Persistence
• Future replay (new metrics, corrections)
• Drill down capabilities/auditability

• The sparse nature of the FIX data fits the Cassandra data model very

well.
• We will store tags which are actually present in the data, saving space  a few

approaches depending on usage pattern.
Create table Trades_Skinny(
OrderID Text Primary_Key,
Date_ID int,
Ticker int,
Client text,
…Many more columns)
Create index ix_Date_ID on
Trade_Data_Skinny (Date_ID)

Create table Trades_Wide(
Order_ID Text Primary_Key,
Tag text,
Value text,
Primary key (Order_ID, Tag))

Create table Trades_Map(
OrderID Text Primary_Key,
Date_ID int,
Ticker int,
Client text,
Tags map <text, text>)
Create index ix_Date_ID on
Trade_Data_Map (Date_ID)
Big data solutions usually employ multiple DB types
Some considerations:
 Size type requirements:
• Volume: which is a disk space size requirement.
• Velocity: which is an message rate requirement.
 Data-Structure & Query Pattern complexity: Simple K/V pair -vs- Relational -vs- …
 C.A.P. theorem alignment: Which two does of your use-case benefit from?
 Value-add features:
• API: (Interface: e.g. HTTP ReST -vs- Client classes). (Power: e.g.
mget, incrementBy).
• Replication and/or H/A support. (B.C./D.R.)
• Support for Data Processing Patterns (e.g. Riak has Map/Reduce; Redis zSets
has Top-N)
• Transaction support (Redis: Multi; Command list; Exec).
• and so on.
Contact

Elliott Cordo
Principal Consultant, Caserta
Concepts
P: (855) 755-2246 x267
E: elliott@casertaconcepts.com

info@casertaconcepts.com
1(855) 755-2246
www.casertaconcepts.com
DEEP-DIVE INTO STORM TOPOLOGY
Noel Milton Vega
Consultant, Dimension Data, LLC.
Consultant, Caserta Concepts
Practical Deep Dive: Continuity-of-Service across Storm
failures
An approach to making topologies more resilient to task failure
 Tasks in Storm are the units that do the actual work.
 Tasks can individually fail due to:
 Resource starvation (OOM, CPU)
 Unhandled exceptions
 Timeouts (such as waiting for I/O)
 and so on
 Tasks also fail because parent Executors, Workers or Supervisors fail.
 Nimbus will spawn a replacement task, but in the context of C.o.S. is that
enough?
Answer: No. But, maybe we can work around that.

http://bit.ly/1bsBooT

 My “storm-user” Google group question:
Storyboard: Continuity-of-Service
ACME C
heck Deposit C (H.Q.)
orp

X

S
tep1: deposit client [A-I] checks
S
tep2: update checkbook balance

S
tep1: deposit client [J-R] checks
S
tep2: update checkbook balance

S
tep1: deposit client [S checks
-Z]
S
tep2: update checkbook balance

Blue:
 Deposits a check for an [A-I] client, and is given a deposit receipt for it (Step1).
 Before he’s able to journal the receipt to the check register journal, he quits. (Step2).
1) ACME H.Q. notices that [A-I] checks aren’t being processed. Should the workload be
redistributed? No! (exception policy).
2) Policy Consequence: there’s no difference before & after event, so context has to be
remembered:
 The new hire’s role is as check depositor for ACME (not a plumber for sub-company
FOOBAR).
 Their specific ACME role is to deposit checks for clients [A-I].
 The role did have state: there’s an Aggregate check register; and an incomplete
Transaction.
Storyboard: Continuity-of-Service

Why this example? It has the operational requirements of real-world use cases:
 Distributed model (where processors are autonomous). Suitable for Big Data.
 Specific Failure / Recovery requirements:
 Incomplete Transaction are completed
 Aggregated state is remembered
 Behavior Persistence: Same behavior before & after an exception event (stikyness).
Modeling this use-case story in Storm

Blue:



Deposits a batch of checks for clients [A-I] and is given a deposit receipt for them (Step1).
Before he’s able to journal the receipt to the check register journal, he quits. (Step2).

1) ACME H.Q. notices that [A-I] checks aren’t being processed. Should the workload be
redistributed? No! (by policy).

2) Policy Consequence: there’s no difference before & after event, so context has to be
remembered:
acmeBolt
 The role is check depositor for ACME (not a plumber for sister-company FOO).
acmeBolt task (fields grouped
 The specific ACME role is to deposit checks for clients [A-I].
 The role did have state: there’s an Aggregate check register; and an incomplete
Java objects in the JVM associated with
Transaction.
acmeBolt task
Modeling this use-case story in Storm
http://bit.ly/1bsBooT
What does Storm remember across task fail/restarts? (if
anything)
http://bit.ly/1bsBooT

worker
exec
t0

X

worker
exec

worker
exec

t0
t0
supervisor node 1-of-3

worker
exec
t1

worker
exec

worker
exec

t1
t2
supervisor node 2-of-3

worker
exec
t2

worker
exec

worker
exec

t2
t2
supervisor node 3-of-3

- What is Storm’s grouping/re-grouping policy?
- Will replacement tasks use the same identifier?
Programmatically, what we’re asking is this …
http://bit.ly/1bsBooT

// ===============================
// Constructor.
// ===============================
public bolt01(Properties properties) {
}

worker
exec
t0

X

worker
exec

t0
t0
supervisor node 1-of-3

// ===============================
// prepare() method
// ===============================
public void prepare(Map stormConf, TopologyContext
}
// ===============================
// execute() method.
// ===============================
public void execute(Tuple inTuple) {
}

worker
exec

context,

worker
exec
t1

worker
exec

worker
exec

t1
t2
supervisor node 2-of-3

worker
exec
t2

worker
exec

worker
exec

t2
t2
supervisor node 3-of-3

OutputCollector collector) {

Is identification remembered here?
Is grouping remembered here? (i.e. redistribution
policy)
Lab behavior observations shows Storm does
remember …
http://bit.ly/1bsBooT

componentID =
context.getThisComponentId();
# Defined in topology class. E.g. bolt01

ComponentID
taskPntr1
0
taskPntr2
1
taskPntr3
2
…
taskPntrN
N-1

taskID = context.getThisTaskId();
# An integer between [1 – N], where N is
the
number of tasks, topology-wide.
taskIndex = context.getThisTaskIndex();
# An integer between [0-(N-1)], where N
is
the number of tasks, component-wide.
fqid = componentID + “.0” +
Integer.toString(taskIndex)
# Ex: bolt02.05; spout01.03; bolt01.00
Quick digression …
Lab tests show Storm does remember, but what’s
missing?
http://bit.ly/1bsBooT

So in Lab tests we observed the following behaviors in Storm:
 Preserve the FQID (e.g.: bolt01.02) before & after task failures. IDENTITY
PERSISTANCE!
 Tasks with a given FQID will receive the same grouping of data throughout the life of a
topology. (Analogy: New hire will be an ACME check depositor for clients [A-I]).

And yet, there is something still missing?
While Storm can replay unprocessed Tuples that timed-out during the
fail/restart period, it can’t regenerate in-memory (in-JVM) aggregated state
What to do? 
REDIS to the rescue :: Continuity-of-Service

Since we observed the following behaviors in Storm:
 Preserves the FQID (e.g.: bolt01.02) before & after task failures. IDENTITY
PERSISTANCE!
 Tasks with a given FQID will receive the same grouping of data throughout
the life of a topology.
REDIS to the rescue :: Continuity-of-Service
FQID is maintained across task Fail/Restarts
(i.e. for the lifetime of the topology).

// ===============================
// prepare() method
// ===============================
public void prepare(Map stormConf, TopologyContext
[ ... snip ... ]

context,

OutputCollector collector) {

this.componentID = context.getThisComponentId(); // e.g. bolt01; spout03
this.taskIndex =

context.getThisTaskIndex();

// [0-(N-1)]; where N = Number of component tasks.

this.fqid = componentID + “.0” + Integer.toString(this.taskIndex); // bolt01.04; spout03.00
this.redisKeyPrefix = this.fqid; // Use your unique Fully Qualified I.D. as a Redis key prefix.
// Establish connection to Redis [not shown], and recover lost data structures, if any.
this.hashMap = this.jedisClient.hgetAll(this.redisKeyPrefix + “-myMap”); //bolt01.01-myMap
}
// ===============================
// execute() method
// ===============================
public void execute(Tuple inTuple) {
[ ... snip ... ]

Tuple grouping/partitioning is maintained across task
fail/restarts (i.e. for the lifetime of the topology).

String customer = inTuple.getString(0);
double balance += inTuple.getString(1);
this.hashMap.put(customer, balance); // Recovered, as necessary, in prepare().
this.jedisClient.hput(this.redisKeyPrefix + “-myMap”,
customer,
balance);
}
Summary :: Storm / Redis and Continuity-of-Service

Master

r/o Slave (local)

host:6379

Fields grouping within a stream
is based on field-1 of the Tuple.

}

KEY: dataSourceQueue01

spout01.00

bolt01.00

 taskIndex -vstaskID

bolt01.01

bolt01.02

spout01.01

KEY: dataSourceQueue02

spout01.02
spout01.03
spout01.04

KEY: spout01.tupleAchHash

tupleGUID
GUID1
GUID2

...

GUID-n

Tuple
tuple1
tuple2
Tuple-n

KE bolt01.02-dat aS
Y:
truct 1
KE bolt01.02-dat aS
Y:
truct 2
KE bolt01.02-dat aS
Y:
tructN

KE bolt02.00-dat aS
Y:
truct 1
KE bolt02.00-dat aS
Y:
truct 2
KE bolt01.00-dat aS
Y:
tructN

...

spout01.05

bolt02.00
bolt02.01
bolt02.02

}
}






v
v

S
trings (Byte-arrays).
Lists (2-way queue, as linked list)
S
ets
Hashes
S
orted S (Hashes w/ sorted values)
ets
S
e/De-serialize objects as JS
ON
Other in-memory solution: e.g. MemS
QL.
Noel Milton Vega
Consultant, Dimension Data, LLC.
Consultant, Caserta Concepts
P: (212) 699-2660
E1: noel@casertaconcepts.com
E2: nmvega@didata.us

info@casertaconcepts.com
1(855) 755-2246
www.casertaconcepts.com
Q&A / THANK YOU
501 Fifth Ave
17th Floor
New York, NY 10017
1-855-755-2246
info@casertaconcepts.com

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data Platforms: An Overview
Big Data Platforms: An OverviewBig Data Platforms: An Overview
Big Data Platforms: An OverviewC. Scyphers
 
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...DataStax
 
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...Mark Rittman
 
Top 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data SolutionTop 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data SolutionDataStax
 
Design Patterns for Building 360-degree Views with HBase and Kiji
Design Patterns for Building 360-degree Views with HBase and KijiDesign Patterns for Building 360-degree Views with HBase and Kiji
Design Patterns for Building 360-degree Views with HBase and KijiHBaseCon
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016StampedeCon
 
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...Mark Rittman
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Big Data Spain
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016StampedeCon
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Adam Muise
 
Real World Use Case with Cassandra (Eddie Satterly, DataNexus) | C* Summit 2016
Real World Use Case with Cassandra (Eddie Satterly, DataNexus) | C* Summit 2016Real World Use Case with Cassandra (Eddie Satterly, DataNexus) | C* Summit 2016
Real World Use Case with Cassandra (Eddie Satterly, DataNexus) | C* Summit 2016DataStax
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design PatternsJohn Yeung
 
Next Generation Hadoop Introduction
Next Generation Hadoop IntroductionNext Generation Hadoop Introduction
Next Generation Hadoop IntroductionAdam Muise
 
Why data warehouses cannot support hot analytics
Why data warehouses cannot support hot analyticsWhy data warehouses cannot support hot analytics
Why data warehouses cannot support hot analyticsImply
 

Was ist angesagt? (20)

Big Data Platforms: An Overview
Big Data Platforms: An OverviewBig Data Platforms: An Overview
Big Data Platforms: An Overview
 
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
 
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
 
Instrumenting your Instruments
Instrumenting your Instruments Instrumenting your Instruments
Instrumenting your Instruments
 
Top 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data SolutionTop 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data Solution
 
Design Patterns for Building 360-degree Views with HBase and Kiji
Design Patterns for Building 360-degree Views with HBase and KijiDesign Patterns for Building 360-degree Views with HBase and Kiji
Design Patterns for Building 360-degree Views with HBase and Kiji
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
 
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
 
Accelerating Data Warehouse Modernization
Accelerating Data Warehouse ModernizationAccelerating Data Warehouse Modernization
Accelerating Data Warehouse Modernization
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015
 
Real World Use Case with Cassandra (Eddie Satterly, DataNexus) | C* Summit 2016
Real World Use Case with Cassandra (Eddie Satterly, DataNexus) | C* Summit 2016Real World Use Case with Cassandra (Eddie Satterly, DataNexus) | C* Summit 2016
Real World Use Case with Cassandra (Eddie Satterly, DataNexus) | C* Summit 2016
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
 
Next Generation Hadoop Introduction
Next Generation Hadoop IntroductionNext Generation Hadoop Introduction
Next Generation Hadoop Introduction
 
Why data warehouses cannot support hot analytics
Why data warehouses cannot support hot analyticsWhy data warehouses cannot support hot analytics
Why data warehouses cannot support hot analytics
 

Andere mochten auch

PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.DECK36
 
The Upstream Game, 2hr version
The Upstream Game, 2hr versionThe Upstream Game, 2hr version
The Upstream Game, 2hr versionSean Roberts
 
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...DataStax
 
2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit MumbaiAnand Haridass
 
2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...Anand Haridass
 
Target: Escaping Disco-Era Data Modeling
Target: Escaping Disco-Era Data ModelingTarget: Escaping Disco-Era Data Modeling
Target: Escaping Disco-Era Data ModelingDataStax Academy
 
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...DataStax
 
Cassandra and Riak at BestBuy.com
Cassandra and Riak at BestBuy.comCassandra and Riak at BestBuy.com
Cassandra and Riak at BestBuy.comjoelcrabb
 
Day 2 General Session Presentations RedisConf
Day 2 General Session Presentations RedisConfDay 2 General Session Presentations RedisConf
Day 2 General Session Presentations RedisConfRedis Labs
 
Walmart & IBM Revisit the Linear Road Benchmark- Roger Rea, IBM
Walmart & IBM Revisit the Linear Road Benchmark- Roger Rea, IBMWalmart & IBM Revisit the Linear Road Benchmark- Roger Rea, IBM
Walmart & IBM Revisit the Linear Road Benchmark- Roger Rea, IBMRedis Labs
 
Traveler's Guide to Cassandra
Traveler's Guide to CassandraTraveler's Guide to Cassandra
Traveler's Guide to CassandraDataStax Academy
 
praktikum_solidarnost_Ivaylo Radev
praktikum_solidarnost_Ivaylo Radevpraktikum_solidarnost_Ivaylo Radev
praktikum_solidarnost_Ivaylo RadevIvaylo Radev
 
Consciousness as a Limitation
Consciousness as a LimitationConsciousness as a Limitation
Consciousness as a LimitationTrieu Nguyen
 
Art nouveau &amp; de st ijl
Art nouveau &amp; de st ijlArt nouveau &amp; de st ijl
Art nouveau &amp; de st ijlPawan Singh
 
Building a useful target architecture - Myth or reality2
Building a useful target architecture - Myth or reality2Building a useful target architecture - Myth or reality2
Building a useful target architecture - Myth or reality2Regine Deleu
 
Crossroads of Asynchrony and Graceful Degradation
Crossroads of Asynchrony and Graceful DegradationCrossroads of Asynchrony and Graceful Degradation
Crossroads of Asynchrony and Graceful DegradationC4Media
 
Does Current Advertising Cause Future Sales?
Does Current Advertising Cause Future Sales?Does Current Advertising Cause Future Sales?
Does Current Advertising Cause Future Sales?Trieu Nguyen
 

Andere mochten auch (20)

PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.
 
The Upstream Game, 2hr version
The Upstream Game, 2hr versionThe Upstream Game, 2hr version
The Upstream Game, 2hr version
 
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
 
2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai
 
2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
 
Target: Escaping Disco-Era Data Modeling
Target: Escaping Disco-Era Data ModelingTarget: Escaping Disco-Era Data Modeling
Target: Escaping Disco-Era Data Modeling
 
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
 
Cassandra in e-commerce
Cassandra in e-commerceCassandra in e-commerce
Cassandra in e-commerce
 
Cassandra and Riak at BestBuy.com
Cassandra and Riak at BestBuy.comCassandra and Riak at BestBuy.com
Cassandra and Riak at BestBuy.com
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
 
Day 2 General Session Presentations RedisConf
Day 2 General Session Presentations RedisConfDay 2 General Session Presentations RedisConf
Day 2 General Session Presentations RedisConf
 
Walmart & IBM Revisit the Linear Road Benchmark- Roger Rea, IBM
Walmart & IBM Revisit the Linear Road Benchmark- Roger Rea, IBMWalmart & IBM Revisit the Linear Road Benchmark- Roger Rea, IBM
Walmart & IBM Revisit the Linear Road Benchmark- Roger Rea, IBM
 
Traveler's Guide to Cassandra
Traveler's Guide to CassandraTraveler's Guide to Cassandra
Traveler's Guide to Cassandra
 
praktikum_solidarnost_Ivaylo Radev
praktikum_solidarnost_Ivaylo Radevpraktikum_solidarnost_Ivaylo Radev
praktikum_solidarnost_Ivaylo Radev
 
Consciousness as a Limitation
Consciousness as a LimitationConsciousness as a Limitation
Consciousness as a Limitation
 
Art nouveau &amp; de st ijl
Art nouveau &amp; de st ijlArt nouveau &amp; de st ijl
Art nouveau &amp; de st ijl
 
Building a useful target architecture - Myth or reality2
Building a useful target architecture - Myth or reality2Building a useful target architecture - Myth or reality2
Building a useful target architecture - Myth or reality2
 
Crossroads of Asynchrony and Graceful Degradation
Crossroads of Asynchrony and Graceful DegradationCrossroads of Asynchrony and Graceful Degradation
Crossroads of Asynchrony and Graceful Degradation
 
zeugnis-zsb
zeugnis-zsbzeugnis-zsb
zeugnis-zsb
 
Does Current Advertising Cause Future Sales?
Does Current Advertising Cause Future Sales?Does Current Advertising Cause Future Sales?
Does Current Advertising Cause Future Sales?
 

Ähnlich wie Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cassandra

Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Ibm_IoT_Architecture_and_Capabilities
Ibm_IoT_Architecture_and_CapabilitiesIbm_IoT_Architecture_and_Capabilities
Ibm_IoT_Architecture_and_CapabilitiesIBM_Info_Management
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWSCaserta
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionDmitry Anoshin
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudAmazon Web Services
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMark Kromer
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Anubhav Kale
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauSam Palani
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousingEr. Nawaraj Bhandari
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseDataStax
 
Optimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 MinutesOptimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 MinutesAlexandra Sasha Blumenfeld
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Lviv Startup Club
 
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?TechWell
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Kent Graziano
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople
 

Ähnlich wie Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cassandra (20)

Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Ibm_IoT_Architecture_and_Capabilities
Ibm_IoT_Architecture_and_CapabilitiesIbm_IoT_Architecture_and_Capabilities
Ibm_IoT_Architecture_and_Capabilities
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWS
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
Optimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 MinutesOptimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 Minutes
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
 
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud Computing
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 

Mehr von Caserta

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingCaserta
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Caserta
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Caserta
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017Caserta
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Caserta
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteCaserta
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Caserta
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Caserta
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseCaserta
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Caserta
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Caserta
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?Caserta
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation Caserta
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for EveryoneCaserta
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure CloudCaserta
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the CloudCaserta
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on HadoopCaserta
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data LakeCaserta
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by DatabricksCaserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkCaserta
 

Mehr von Caserta (20)

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 

Kürzlich hochgeladen

Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxYounusS2
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum ComputingGDSC PJATK
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?SANGHEE SHIN
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.francesco barbera
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 

Kürzlich hochgeladen (20)

Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptx
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum Computing
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 

Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cassandra

  • 1. Big Data Warehousing Meetup December 10, 2013 Real-time Trade Data Monitoring with Storm & Cassandra
  • 2. Agenda 7:00 Networking Grab a slice of pizza and a drink... 7:15 Welcome & Intro President, Caserta Concepts Author, Data Warehouse ETL Toolkit 7:30 Joe Caserta About the Meetup and about Caserta Concepts Elliott Cordo Cassandra Chief Architect, Caserta Concepts 8:00 Noel Vega Consultant, Caserta Concepts Consultant, Dimension Data, LLC 8:309:00 Q&A / More Networking Storm
  • 3. About the BDW Meetup • Big Data is a complex, rapidly changing landscape • We want to share our stories and hear about yours • Great networking opportunity for like minded data nerds • Opportunities to collaborate on exciting projects • Founded by Caserta Concepts, Big Data Analytics, DW & BI Consulting • Next BDW Meetup: JANUARY 20
  • 4. About Caserta Concepts Focused Expertise • • • • Big Data Analytics Data Warehousing Business Intelligence Strategic Data Ecosystems Industries Served • • • • • Financial Services Healthcare / Insurance Retail / eCommerce Digital Media / Marketing K-12 / Higher Education Founded in 2001 • President: Joe Caserta, industry thought leader, consultant, educator and co-author, The Data Warehouse ETL Toolkit (Wiley, 2004)
  • 5. Caserta Concepts Listed as one of the 20 Most Promising Data Analytics Consulting Companies CIOReview looked at hundreds of data analytics consulting companies and shortlisted the ones who are at the forefront of tackling the real analytics challenges. A distinguished panel comprising of CEOs, CIOs, VCs, industry analysts and the editorial board of CIOReview selected the Final 20.
  • 6. Expertise & Offerings Strategic Roadmap/ Assessment/Consulting/ Implementation Big Data Analytics Data Warehousing/ ETL/Data Integration BI/Visualization/ Analytics
  • 7. Client Portfolio Finance & Insurance Retail/eCommerce & Manufacturing Education & Services
  • 8. We are hiring Does this word cloud excite you? Speak with us about our open positions: jobs@casertaconcepts.com
  • 9. Why talk about Storm & Cassandra? Traditional BI ERP ETL Traditional EDW Finance ETL Ad-Hoc/Canned Reporting Legacy Big Data BI Big Data Cluster NoSQL Database Storm Data Analytics Mahout N1 MapReduce N2 N3 Pig/Hive N4 N5 Hadoop Distributed File System (HDFS) Horizontally Scalable Environment - Optimized for Analytics Data Science
  • 10. What is Storm • Distributed Event Processor • Real-time data ingestion and dissemination • In-Stream ETL • Reliably process unbounded streams of data • Storm is fast: Clocked it at over a million tuples per second per node • It is scalable, fault-tolerant, guarantees your data will be processed • Preferred technology for real-time big data processing by organizations worldwide: • Partial list at https://github.com/nathanmarz/storm/wiki/Powered-By • Incubator: • http://wiki.apache.org/incubator/StormProposal
  • 11. Components of Storm • Spout – Collects data from upstream feeds and submits it for processing • Tuple – A collection of data that is passed within Storm • Bolt – Processes tuples (Transformations) • Stream – Identifies outputs from Spouts/Bolts • Storm usually outputs to a NoSQL database
  • 12. Why NoSQL? • Performance: • Relational databases have a lot of features, overhead that we don’t need in many cases. Although we will miss some… • Scalability: • Most relational databases scale vertically giving them limits to how large they can get. Federation and Sharding is an awkward manual process. • Agile • Sparse Data / Data with a lot of variation • Most NoSQL scale horizontally on commodity hardware
  • 13. What is Cassandra? • Column families are the equivalent to a table in a RDMS • Primary unit of storage is a column, they are stored contiguously Skinny Rows: Most like relational database. Except columns are optional and not stored if omitted: Wide Rows: Rows can be billions of columns wide, used for time series, relationships, secondary indexes:
  • 14. REAL TIME TRADE DATA MONITORING Elliott Cordo Chief Architect, Caserta Concepts
  • 15. The Use Case • Trade data (orders and executions) • High volume of incoming data • 500 thousand records per second • 12 billion messages per day • Required that data be aggregated and monitored in real time (end to end latency measured in 100's of ms) • Both raw messages and analytics stored, persisted to a database
  • 16. The Data • Primarily FIX messages: Financial Information Exchange  • Established in early 90's as a standard for trade data communication  widely used throughout the industry • Basically a delimited file of variable attribute-value pairs • Looks something like this: 8=FIX.4.2 | 9=178 | 35=8 | 49=PHLX | 56=PERS | 52=20071123-05:30:00.000 | 11=ATOMNOCCC9990900 | 20=3 | 150=E | 39=E | 55=MSFT | 167=CS | 54=1 | 38=15 | 40=2 | 44=15 | 58=PHLX EQUITY TESTING | 59=0 | 47=C | 32=0 | 31=0 | 151=15 | 14=0 | 6=0 | 10=128 | • A single trade can be comprised of 1000's of such messages, although typical trades have about a dozen
  • 17. Additional Requirements • Linearly scalable • Highly available  no single point of failure ,quick recovery • Quicker time to benefit • Processing guarantees  NO DATA IS LOST!
  • 18. Some Sample Analytic Use Cases • Sum(Notional volume) by Ticker: Daily, Hourly, Minute • Average trade latency (Execution TS – Order TS) • Wash Sales (sell within x seconds of last buy) for same Client/Ticker
  • 19. How has this system traditionally been handled • Typically by manually partitioning the application  Having a number Message Queue of independent systems and databases “dividing” the problem Use Case 1: Partition A Database A Use Case 1: Partition B Database B Use Case 2: All Partitions Database C Main issues  • Growth requires changing these systems to accept the new partitioning scheme: Development! • A lot of different applications replicating complex architecture, tons of boilerplate code • Performing analysis across the partitioning schemes very difficult
  • 20. Need to Establish a Platform as a Service Architecture d3.js Analytics Atomic data Sensor Data Aggregates Storm Cluster Event Monitors • Redis queue is used for ingestion • Storm is used for real-time ETL and outputs atomic data and derived data needed for analytics • Redis is used as a reference data lookup cache and state • Real time analytics are produced from the aggregated data. • Higher latency ad-hoc analytics are done in Hadoop using Pig and Hive Low Latency Analytics
  • 21. Deeper Dive: Cassandra as an Analytic Database • Based on a blend of Dynamo and BigTable • Distributed, master-less • Super fast writes  Can ingest lots of data! • Very fast reads Why did we choose it: • Data throughput requirements • High availability • Simple expansion • Interesting data models for time series data (more on this later)
  • 22. Design Practices • Cassandra does not support aggregation or joins  Data model must be tuned to usage • Denormalize your data (flatten your primary dimensional attributes into your fact) • Storing the same data redundantly is OK Might sound weird but we've been doing this all along in the traditional world modeling our data to make analytic queries simple!
  • 23. Wide rows are our friends • Cassandra composite columns are powerful for analytic models • Facilitate multi-dimensional analysis • A wide row table may have N number of rows, and a variable number of columns (millions of columns) ClientA ClientB ClientC … 20130101 20130102 20130103 20130104 20130104 20130105 … 10003 9493 43143 45553 54553 34343 … 45453 34313 54543 `23233 4233 34423 … 3323 35313 43123 54543 43433 4343 … … … … … … .. … • And now with CQL3 we have “unpacked” wide rows into named columns  Easy to work with!
  • 24. More about wide rows! • The left-most column is the ROW KEY • It is the mechanism by which the row is distributed across the Cassandra cluster… • Care must be taken to prevent hot spots: Dates for example are not generally good candidates because all load will go to given set of servers on a particular day! • Data can be filtered using equal and “in” clause ClientA ClientB ClientC … 20130101 20130102 20130103 20130104 20130104 20130105 … 10003 9493 43143 45553 54553 34343 … 45453 34313 54543 `23233 4233 34423 … 3323 35313 43123 54543 43433 4343 … … … … … … .. … Create table Client_Daily_Summary ( Client text, Date_ID int, Trade_Count int, Primary key (Client, Date_ID)) • The top row is the COLUMN KEY • Their can be a variable number of columns • It is acceptable to have millions/ even billions of columns in a table • Columns keys are sorted and can accept a range query (greater than / less than)
  • 25. Traditional Cassandra Analytic Model If we wanted to track trade count by day, hour we could stream our ETL to two (or more) summary fact tables ClientA ClientB ClientC 20130101 20130102 20130103 20130104 20130104 20130105 10003 9493 43143 45553 54553 34343 45453 34313 54543 `23233 4233 34423 3323 35313 43123 54543 43433 4343 Sample analytic query: Give me daily trade counts for ClientA between Jan 1 and Jan 3: Select Date_ID, Trade_Count from Client_Hourly_Summary ` where Client='ClientA' and Date_ID>=20130101 and Date_ID <=20130103 ClientA|20131101 ClientA|20131102 ClientB|20131101 0900 1000 4545 332 1000 949 3431 3531 1100 4314 5454 4312 1200 4555 2323 5454 1300 5455 423 4343 1400 3434 3442 434 Sample analytic query: Give me hourly trade counts for ClientA for Jan1 between 9 and 11 AM Select Hour, Trade_Count from Client_Hourly_Summary ` where Client_Date='ClientA|20131101' and hour >= 900 and <= 1100
  • 26. But there are other methods too • Assuming some level of client side aggregation (and additive measures) we could also further unpack and leverage column keys using CQL 3  A slightly different use case: Create table Client_Ticker_Summary ( Client text, Date_ID int, Ticker text, Trade_Count int, Notional_Volume float, Primary Key (Client, Date_ID, Ticker)) The first column in the PK definition is the Row Key aka Partition Key Look at all this flexible SQL goodness: select * from Client_Ticker_Summary where Client in ('ClientA','ClientB') select * from Client_Ticker_Summary where Client in ('ClientA','ClientB') and Date_ID >= 20130101 and Date_ID <= 20130103 select * from Client_Ticker_Summary where Client ='ClientA' and Date_ID >= 20130101 and Date_ID <= 20130103 Select * from Client_Ticker_Summary where Client = 'ClientA’ and Date_ID=20130101 and Ticker in ('APPL','GE','PG') ALSO  But not recommended! select * from Client_Ticker_Summary where Date_ID > 20120101 allow filtering; select * from Client_Ticker_Summary where Date_ID = 20120101 and ticker in ('APPL','GE') allow filtering;
  • 27. Storing the Atomic data 8=FIX.4.2 | 9=178 | 35=8 | 49=PHLX | 56=PERS | 52=20071123-05:30:00.000 | 11=ATOMNOCCC9990900 | 20=3 | 150=E | 39=E | 55=MSFT | 167=CS | 54=1 | 38=15 | 40=2 | 44=15 | 58=PHLX EQUITY TESTING | 59=0 | 47=C | 32=0 | 31=0 | 151=15 | 14=0 | 6=0 | 10=128 | • We must land all atomic data: • Persistence • Future replay (new metrics, corrections) • Drill down capabilities/auditability • The sparse nature of the FIX data fits the Cassandra data model very well. • We will store tags which are actually present in the data, saving space  a few approaches depending on usage pattern. Create table Trades_Skinny( OrderID Text Primary_Key, Date_ID int, Ticker int, Client text, …Many more columns) Create index ix_Date_ID on Trade_Data_Skinny (Date_ID) Create table Trades_Wide( Order_ID Text Primary_Key, Tag text, Value text, Primary key (Order_ID, Tag)) Create table Trades_Map( OrderID Text Primary_Key, Date_ID int, Ticker int, Client text, Tags map <text, text>) Create index ix_Date_ID on Trade_Data_Map (Date_ID)
  • 28. Big data solutions usually employ multiple DB types Some considerations:  Size type requirements: • Volume: which is a disk space size requirement. • Velocity: which is an message rate requirement.  Data-Structure & Query Pattern complexity: Simple K/V pair -vs- Relational -vs- …  C.A.P. theorem alignment: Which two does of your use-case benefit from?  Value-add features: • API: (Interface: e.g. HTTP ReST -vs- Client classes). (Power: e.g. mget, incrementBy). • Replication and/or H/A support. (B.C./D.R.) • Support for Data Processing Patterns (e.g. Riak has Map/Reduce; Redis zSets has Top-N) • Transaction support (Redis: Multi; Command list; Exec). • and so on.
  • 29. Contact Elliott Cordo Principal Consultant, Caserta Concepts P: (855) 755-2246 x267 E: elliott@casertaconcepts.com info@casertaconcepts.com 1(855) 755-2246 www.casertaconcepts.com
  • 30. DEEP-DIVE INTO STORM TOPOLOGY Noel Milton Vega Consultant, Dimension Data, LLC. Consultant, Caserta Concepts
  • 31. Practical Deep Dive: Continuity-of-Service across Storm failures An approach to making topologies more resilient to task failure  Tasks in Storm are the units that do the actual work.  Tasks can individually fail due to:  Resource starvation (OOM, CPU)  Unhandled exceptions  Timeouts (such as waiting for I/O)  and so on  Tasks also fail because parent Executors, Workers or Supervisors fail.  Nimbus will spawn a replacement task, but in the context of C.o.S. is that enough? Answer: No. But, maybe we can work around that. http://bit.ly/1bsBooT  My “storm-user” Google group question:
  • 32. Storyboard: Continuity-of-Service ACME C heck Deposit C (H.Q.) orp X S tep1: deposit client [A-I] checks S tep2: update checkbook balance S tep1: deposit client [J-R] checks S tep2: update checkbook balance S tep1: deposit client [S checks -Z] S tep2: update checkbook balance Blue:  Deposits a check for an [A-I] client, and is given a deposit receipt for it (Step1).  Before he’s able to journal the receipt to the check register journal, he quits. (Step2). 1) ACME H.Q. notices that [A-I] checks aren’t being processed. Should the workload be redistributed? No! (exception policy). 2) Policy Consequence: there’s no difference before & after event, so context has to be remembered:  The new hire’s role is as check depositor for ACME (not a plumber for sub-company FOOBAR).  Their specific ACME role is to deposit checks for clients [A-I].  The role did have state: there’s an Aggregate check register; and an incomplete Transaction.
  • 33. Storyboard: Continuity-of-Service Why this example? It has the operational requirements of real-world use cases:  Distributed model (where processors are autonomous). Suitable for Big Data.  Specific Failure / Recovery requirements:  Incomplete Transaction are completed  Aggregated state is remembered  Behavior Persistence: Same behavior before & after an exception event (stikyness).
  • 34. Modeling this use-case story in Storm Blue:   Deposits a batch of checks for clients [A-I] and is given a deposit receipt for them (Step1). Before he’s able to journal the receipt to the check register journal, he quits. (Step2). 1) ACME H.Q. notices that [A-I] checks aren’t being processed. Should the workload be redistributed? No! (by policy). 2) Policy Consequence: there’s no difference before & after event, so context has to be remembered: acmeBolt  The role is check depositor for ACME (not a plumber for sister-company FOO). acmeBolt task (fields grouped  The specific ACME role is to deposit checks for clients [A-I].  The role did have state: there’s an Aggregate check register; and an incomplete Java objects in the JVM associated with Transaction. acmeBolt task
  • 35. Modeling this use-case story in Storm http://bit.ly/1bsBooT
  • 36. What does Storm remember across task fail/restarts? (if anything) http://bit.ly/1bsBooT worker exec t0 X worker exec worker exec t0 t0 supervisor node 1-of-3 worker exec t1 worker exec worker exec t1 t2 supervisor node 2-of-3 worker exec t2 worker exec worker exec t2 t2 supervisor node 3-of-3 - What is Storm’s grouping/re-grouping policy? - Will replacement tasks use the same identifier?
  • 37. Programmatically, what we’re asking is this … http://bit.ly/1bsBooT // =============================== // Constructor. // =============================== public bolt01(Properties properties) { } worker exec t0 X worker exec t0 t0 supervisor node 1-of-3 // =============================== // prepare() method // =============================== public void prepare(Map stormConf, TopologyContext } // =============================== // execute() method. // =============================== public void execute(Tuple inTuple) { } worker exec context, worker exec t1 worker exec worker exec t1 t2 supervisor node 2-of-3 worker exec t2 worker exec worker exec t2 t2 supervisor node 3-of-3 OutputCollector collector) { Is identification remembered here? Is grouping remembered here? (i.e. redistribution policy)
  • 38. Lab behavior observations shows Storm does remember … http://bit.ly/1bsBooT componentID = context.getThisComponentId(); # Defined in topology class. E.g. bolt01 ComponentID taskPntr1 0 taskPntr2 1 taskPntr3 2 … taskPntrN N-1 taskID = context.getThisTaskId(); # An integer between [1 – N], where N is the number of tasks, topology-wide. taskIndex = context.getThisTaskIndex(); # An integer between [0-(N-1)], where N is the number of tasks, component-wide. fqid = componentID + “.0” + Integer.toString(taskIndex) # Ex: bolt02.05; spout01.03; bolt01.00
  • 40. Lab tests show Storm does remember, but what’s missing? http://bit.ly/1bsBooT So in Lab tests we observed the following behaviors in Storm:  Preserve the FQID (e.g.: bolt01.02) before & after task failures. IDENTITY PERSISTANCE!  Tasks with a given FQID will receive the same grouping of data throughout the life of a topology. (Analogy: New hire will be an ACME check depositor for clients [A-I]). And yet, there is something still missing? While Storm can replay unprocessed Tuples that timed-out during the fail/restart period, it can’t regenerate in-memory (in-JVM) aggregated state What to do? 
  • 41. REDIS to the rescue :: Continuity-of-Service Since we observed the following behaviors in Storm:  Preserves the FQID (e.g.: bolt01.02) before & after task failures. IDENTITY PERSISTANCE!  Tasks with a given FQID will receive the same grouping of data throughout the life of a topology.
  • 42. REDIS to the rescue :: Continuity-of-Service FQID is maintained across task Fail/Restarts (i.e. for the lifetime of the topology). // =============================== // prepare() method // =============================== public void prepare(Map stormConf, TopologyContext [ ... snip ... ] context, OutputCollector collector) { this.componentID = context.getThisComponentId(); // e.g. bolt01; spout03 this.taskIndex = context.getThisTaskIndex(); // [0-(N-1)]; where N = Number of component tasks. this.fqid = componentID + “.0” + Integer.toString(this.taskIndex); // bolt01.04; spout03.00 this.redisKeyPrefix = this.fqid; // Use your unique Fully Qualified I.D. as a Redis key prefix. // Establish connection to Redis [not shown], and recover lost data structures, if any. this.hashMap = this.jedisClient.hgetAll(this.redisKeyPrefix + “-myMap”); //bolt01.01-myMap } // =============================== // execute() method // =============================== public void execute(Tuple inTuple) { [ ... snip ... ] Tuple grouping/partitioning is maintained across task fail/restarts (i.e. for the lifetime of the topology). String customer = inTuple.getString(0); double balance += inTuple.getString(1); this.hashMap.put(customer, balance); // Recovered, as necessary, in prepare(). this.jedisClient.hput(this.redisKeyPrefix + “-myMap”, customer, balance); }
  • 43. Summary :: Storm / Redis and Continuity-of-Service Master r/o Slave (local) host:6379 Fields grouping within a stream is based on field-1 of the Tuple. } KEY: dataSourceQueue01 spout01.00 bolt01.00  taskIndex -vstaskID bolt01.01 bolt01.02 spout01.01 KEY: dataSourceQueue02 spout01.02 spout01.03 spout01.04 KEY: spout01.tupleAchHash tupleGUID GUID1 GUID2 ... GUID-n Tuple tuple1 tuple2 Tuple-n KE bolt01.02-dat aS Y: truct 1 KE bolt01.02-dat aS Y: truct 2 KE bolt01.02-dat aS Y: tructN KE bolt02.00-dat aS Y: truct 1 KE bolt02.00-dat aS Y: truct 2 KE bolt01.00-dat aS Y: tructN ... spout01.05 bolt02.00 bolt02.01 bolt02.02 } }      v v S trings (Byte-arrays). Lists (2-way queue, as linked list) S ets Hashes S orted S (Hashes w/ sorted values) ets S e/De-serialize objects as JS ON Other in-memory solution: e.g. MemS QL.
  • 44. Noel Milton Vega Consultant, Dimension Data, LLC. Consultant, Caserta Concepts P: (212) 699-2660 E1: noel@casertaconcepts.com E2: nmvega@didata.us info@casertaconcepts.com 1(855) 755-2246 www.casertaconcepts.com
  • 45. Q&A / THANK YOU 501 Fifth Ave 17th Floor New York, NY 10017 1-855-755-2246 info@casertaconcepts.com

Hinweis der Redaktion

  1. Alternative NoSQL: Hbase, Cassandra, Druid, VoltDB