As our customers tap into new sources of data or modify to existing data pipelines, we are often asked questions like: What technologies should we consider? Where can we reduce data latency? How can we simplify our data architecture?
To eliminate the guesswork, we teamed up with Ben Lorica, Chief Data Scientist at O’Reilly Media to host a webcast centered around building real-time data pipelines.
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
O'Reilly Media Webcast: Building Real-Time Data Pipelines
1. Building Real-Time Data Pipelines
Through In-Memory Architectures
Ben Lorica, Chief Data Scientist, O'Reilly Media
@bigdata
Eric Frenkiel, CEO & Co-Founder, MemSQL
@ericfrenkiel
2. What’s In Store
Why In-Memory for Real Time
Using an In-Memory Database with Spark and Kafka
Real-Time Use Cases and Demonstrations
About MemSQL
3. Going Real-Time is the Next Phase for Big Data
More
Sensors
More
Interconnectivity
More
User Demand
…and companies are at risk of being left behind
22. One click deployment of
integrated Apache Spark
Put Spark in the Fast Lane
• GUI pipeline setup
• Multiple data pipelines
• Real-time transformation
Eliminates batch ETL
Open source on GitHub
Introducing the MemSQL Streamliner
27. Streamliner Architecture
First of many integrated Apache Spark solutions
Other
Real-Time Data
Sources Application
Apache Spark
Future Solution
Future Machine
Learning Solution
STREAMLINER
37. Collect streaming data at scale
(hundreds of MemSQL
machines)
Proactively diagnose issues
Query ad-hoc and in real-time
with full SQL
From 30 minutes to less than 1 second
Real-time
Analytics
42. Using Real-Time for Personalization
Ad Servers
EC2
Real-time
analytics
PostgreSQL
Legacy reports
Monitoring S3 (replay)
HDFS
Data Science
Vertica
Operational Data Store (ODS)
Star Schema MictoStrategy
Reach overlap and ad optimization
Over 60,000 queries per second
Millisecond response times
44. Subscribing to Kafka
(2015-07-06T16:43:40.33Z, 329280, 23, 60)
0111001010101111101111100000001010
111100001110101100000010010010111…
Publish to Kafka Topic
0111001010101111101111100000001010
111100001110101100000010010010111…
1110010101000101010001010100010111
111010100011110101100011010101000…
0101111000011100101010111110001111
011010111100000000101110101100000…
Event added to message queue
45. Enrich and Transform the Data
Spark polling Kafka for new messages
(2015-07-06T16:43:40.33Z, 329280, 23, 60)
(2015-07-06T16:43:40.33Z, 329280, 94110, 23,
‘kitchen_appliance’, 60)
Deserialization
Enrichment
0111001010101111101111100000001010
111100001110101100000010010010111…
46. Persist and Prepare for Production
RDD.saveToMemSQL()
INSERT INTO memcity_table ...
time house_id zip
device
_id
device_type watts
2015-
07-
06T16:4
3:40.33
Z
329280 94110 23
‘kitchen_app
liance’
60
… … … … … …
49. Adding Real-Time Scoring to Predictive Applications
Streamliner
Input
User Jar
SAS Generated PMML
Industrial
Equipment
Sensor Data
S1 S2 S3 P1 P2 P3
Scoring Real-Time Data
with Predictive Models
Sensor 1 Predictive Model 1
50. What’s In Store
Why In-Memory for Real Time
Using an In-Memory Database with Spark and Kafka
Real-Time Use Cases and Demonstrations
About MemSQL
51. MemSQL at a Glance
• Enable every company to be a real-time
enterprise
• Founded 2011, based in San Francisco
• Founders are ex-Facebook, SQL Server
engineers
• Deliver a database technology for modern
architecture
Enterprise Focus
52. The Real-Time Database for Transactions and Analytics
In-Memory Distributed Relational
Data CenterSoftware Cloud
53. MemSQL for the Spectrum of Transactions
Each Transaction Paramount Transactional Aggregates Paramount
Guarantee that every individual
transaction is persisted
No individual transaction can be lost
• Financial credits and debits
• Inventory movement
• Employee status
Capture massive event streams for
immediate analysis
Transaction repetition/redundancy at
the device level
• Event data and clickstreams
• Sensor data, Internet of Things
• Mobile applications
• Real-time streams