O'Reilly Media Webcast: Building Real-Time Data Pipelines

Building Real-Time Data Pipelines
Through In-Memory Architectures
Ben Lorica, Chief Data Scientist, O'Reilly Media
@bigdata
Eric Frenkiel, CEO & Co-Founder, MemSQL
@ericfrenkiel

What’s In Store
Why In-Memory for Real Time
Using an In-Memory Database with Spark and Kafka
Real-Time Use Cases and Demonstrations
About MemSQL

Going Real-Time is the Next Phase for Big Data
More
Sensors
More
Interconnectivity
More
User Demand
…and companies are at risk of being left behind

Expensive
Not scalable
Batch only
SAN-burdened
1%

Success will
be driven by
real-time
analytic
applications

Speed
Serving
Batch Fast Updates
Unified queries, full SQL
Fast Appends
A Fresh Look at Lambda Architectures

Comprehensive Architecture
Transactions

Real Time
Speed/Streaming Layer
Fast Updates
Rowstore
Transactions

Real Time
Fast Updates
Rowstore
Analytics
Transactions

Real Time
Fast Updates
Rowstore
Historical
Batch Layer
Fast Appends
Columnstore
Analytics
Transactions

Real Time
Fast Updates
Rowstore
Historical
Batch Layer
Fast Appends
Columnstore
Analytics
Transactions
Execution engine that spans the data spectrum

Simplified Lambda Architectures with MemSQL
Layer Traditional Lambda MemSQL Lambda
Batch Hadoop MemSQL Column Store
Speed Storm, Spark Kafka > Spark > MemSQL
Serving Cassandra, HBase MemSQL

Designing the Ideal Real-Time Pipeline
Message Queue Transformation Speed/Serving Layer
End-to-End Data Pipeline Under One Second

 A high-throughput
distributed messaging
system
 Publish and subscribe to
Kafka “topics”
 Centralized data transport
for the organization
Kafka

 In-memory execution
engine
 High level operators for
procedural and
programmatic analytics
 Faster than MapReduce
Spark

 In-memory, distributed
database
 Full transactions and
complete durability
 Enable real-time,
performant applications
MemSQL

Lambda Applies to Real-Time Data Pipelines
Message
Queue
Batch
Inputs DatabaseTransformation Application

Kafka, Spark, and MemSQL Make it Simple
Batch
Inputs Application

Put Apache Spark in the fast lane
with MemSQL Streamliner

 One click deployment of
integrated Apache Spark
 Put Spark in the Fast Lane
• GUI pipeline setup
• Multiple data pipelines
• Real-time transformation
 Eliminates batch ETL
 Open source on GitHub
Introducing the MemSQL Streamliner

Simple Deployment Process
Application

Cluster
1. Deploy MemSQL
In-Memory | Distributed | Relational
Application

Cluster
2. Deploy Spark
Application

Cluster
Kafka Connects to Each Node
Application

Streamliner Architecture
First of many integrated Apache Spark solutions
Other
Real-Time Data
Sources Application
Apache Spark
Future Solution
Future Machine
Learning Solution
STREAMLINER

Streamliner ETL Detail
Other
Real-Time Data
Sources Application
Apache Spark
Future Solution
Future Machine
Learning Solution
STREAMLINER
Custom
Future Extractor
JSON
Custom
Future Transformer
STREAMLINER
Extract Transform Load

Streamliner: Dynamic Resource Management
Without Streamliner With Streamliner
Pipeline 1
Spark Worker
Pipeline 2
Spark Worker
Executor
(P2 only)
Executor
(P2 only)
Executor
(P1 only)
Executor
(P1 only)
Driver
(P1 only)
Driver
(P2 only)
All Pipelines
Streamliner Driver
…
…
Spark WorkerSpark Worker
Executor
(P1 or P2)
Executor
(P1 or P2)
Executor
(P1 or P2)
Executor
(P1 or P2)

One Architecture
for Many Applications

Monitoring real-time Xfinity programming and video health

 Collect streaming data at scale
(hundreds of MemSQL
machines)
 Proactively diagnose issues
 Query ad-hoc and in real-time
with full SQL
From 30 minutes to less than 1 second
Real-time
Analytics

Massive Ingest and Concurrent Analytics
 Instant accuracy to the latest repin
 Build real-time analytic applications
Real-time
analytics

Using Real-Time for Personalization
Ad Servers
EC2
Real-time
analytics
PostgreSQL
Legacy reports
Monitoring S3 (replay)
HDFS
Data Science
Vertica
Operational Data Store (ODS)
Star Schema MictoStrategy
 Reach overlap and ad optimization
 Over 60,000 queries per second
 Millisecond response times

MemCity
Capturing data from 1.4 million households
Total AWS hardware costs at $2.35 per hour

Subscribing to Kafka
(2015-07-06T16:43:40.33Z, 329280, 23, 60)
0111001010101111101111100000001010
111100001110101100000010010010111…
Publish to Kafka Topic
0111001010101111101111100000001010
111100001110101100000010010010111…
1110010101000101010001010100010111
111010100011110101100011010101000…
0101111000011100101010111110001111
011010111100000000101110101100000…
Event added to message queue

Enrich and Transform the Data
Spark polling Kafka for new messages
(2015-07-06T16:43:40.33Z, 329280, 23, 60)
(2015-07-06T16:43:40.33Z, 329280, 94110, 23,
‘kitchen_appliance’, 60)
Deserialization
Enrichment
0111001010101111101111100000001010
111100001110101100000010010010111…

Persist and Prepare for Production
RDD.saveToMemSQL()
INSERT INTO memcity_table ...
time house_id zip
device
_id
device_type watts
2015-
07-
06T16:4
3:40.33
Z
329280 94110 23
‘kitchen_app
liance’
60
… … … … … …

Go to Production
Compress development
timelines
SELECT ... FROM memcity_table ...

Building Real-Time Data
Pipelines and Predictive
Applications

Adding Real-Time Scoring to Predictive Applications
Streamliner
Input
User Jar
SAS Generated PMML
Industrial
Equipment
Sensor Data
S1 S2 S3 P1 P2 P3
Scoring Real-Time Data
with Predictive Models
Sensor 1 Predictive Model 1

MemSQL at a Glance
• Enable every company to be a real-time
enterprise
• Founded 2011, based in San Francisco
• Founders are ex-Facebook, SQL Server
engineers
• Deliver a database technology for modern
architecture
Enterprise Focus

The Real-Time Database for Transactions and Analytics
In-Memory Distributed Relational
Data CenterSoftware Cloud

MemSQL for the Spectrum of Transactions
Each Transaction Paramount Transactional Aggregates Paramount
 Guarantee that every individual
transaction is persisted
 No individual transaction can be lost
• Financial credits and debits
• Inventory movement
• Employee status
 Capture massive event streams for
immediate analysis
 Transaction repetition/redundancy at
the device level
• Event data and clickstreams
• Sensor data, Internet of Things
• Mobile applications
• Real-time streams

Gartner Magic Quadrant for ODBMS
Leading Relational
Database in
Visionaries Quadrant

Forrester Wave: In-Memory Database Platforms
”“MemSQL Named
Strong Performer

GET YOUR FREE COPY:
memsql.com/oreilly

O'Reilly Media Webcast: Building Real-Time Data Pipelines

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to O'Reilly Media Webcast: Building Real-Time Data Pipelines

Similar to O'Reilly Media Webcast: Building Real-Time Data Pipelines (20)

More from SingleStore

More from SingleStore (20)

Recently uploaded

Recently uploaded (20)

O'Reilly Media Webcast: Building Real-Time Data Pipelines