SimplifyStreamingArchitecture

page
SIMPLIFY YOUR STREAMING DATA
ARCHITECTURE WITH KAFKA & VOLTDB
Maheedhar Gunturu

page© 2014 VoltDB PROPRIETARY
WHO AM I?
• Maheedhar Gunturu – Software & Solutions Architect @VoltDB
mgunturu@voltdb.com
@Vanguard_space
http://chat.voltdb.com/
• Previously:
 Solutions Architect @ MapR
 Working on Big Data systems since 2010
• Current Interests include
 NVRAM
 GPU Co-processors for databases
 Operationalizing Deep-learning based applications using Fast Data.
2

COMPANY OVERVIEW
FAST : World Record Cloud Benchmark:
YCSB (Yahoo Cloud Serving Benchmark) - 2.4m million tps (transactions per second)
3
Mike Stonebraker*
Founded in 2009 by database luminary
VoltDB in the
Magic Quadrant
“Operational Databases”
Other Stonebraker Companies
• Professor in MIT for the Data and
Artificial Intelligence lab
• Co-Founder of VoltDB
• Creator of the C-store & H-store
Project
• 2014 Turing Award Winner
• His students include – Mike Olson,
Diane Greene, Daniel Abadi…

WHAT IS VOLTDB?
• An operational database purpose-built to run 100% in-
memory at web scale
 In-Memory
 Relational, SQL, ACID Compliant
 Scale-out on commodity hardware
 Reliability, HA, fault tolerant
 Integration with downstream systems i.e. OLAP,
Hadoop, DW
4
Best use cases: operational and transactional workloads

page© 2014 VoltDB PROPRIETARY 5
Fast (in motion)
Streaming Analytics:
real time summary and
aggregation
Transaction Processing:
per-event decisions using
context + history.
Big (at rest)
Exploration:
data science, investigation of
large data sets
Reporting:
recommendation matrixes,
search indexes, trend and BI

WHO NEEDS VOLTDB ?
2/
17
/2
01
6
2/
17
/2
01
6
Time CriticalNot Time Critical
Unimportant Data
Important Data
Most streaming
systems (exactly
once semantics is
very expensive)
Most anything,
Spark, etc.
Data warehouses,
RDBMS,
Transaction
processing engines

page
Perishable insights can have exponentially more
value than after-the-fact traditional historical
analytics.

WHAT ARE SOME COMMON USE-CASES?
• Data (messages) stream in from humans or devices
• Internet of Things (IoT)
• Ad-bidding platforms
• Telecommunications – OSS/BSS/NFV
• Fintech – trading and risk assessment platforms
• Consumer facing Online Billing and booking systems
• Massively multi-player games
• At high volume
• 100K ~ 1M messages/sec requires specialized software
• HighAvailability
• Nobody wants to go down these days
8

Streaming Analytics Transactions Transformations
• Materialized Views
• Capped Tables
• Ranking Indexes
• Stored Procedures
• Java + SQL
• ACID guarantees
• No client-side
transaction control
• Stored Procedures
• Loaders/Importers
• Export Connectors
• Sessionization
• Enrichment
• JDBC connector
VoltDB architecture
Commodity HW HA + ACID Scale-out VM-friendly
How do we do it?
• XDCR – configurable out of the box.
• Geospatial capabilities for location based BI and decision making

The general-purpose RDBMS can’t scale

MODERN OLTP
11
1. Processing streams requires integrated access to state.
2. Using real time analytics requires a query interface.
3. Reacting to incoming events requires transactions.
State + Query + Transactions = OLTP
Fast
Streaming Analytics
Transaction Processing

Absolute Non-negotiables in VoltDB
• Transactional Consistency
• Extremely high throughput
• Linear Scalability
• Resiliency
• Minimum dev-ops administration
State , Speed , Scale , Stable, Simple

Our Customers are streaming important data
• Typical Deployments
• 100K to 1M transactions/sec
• Commodity Hardware
32GB - 256 GB RAM
16 Cores – 64 Cores
10 GigE ~ 40 GigE
• VoltDB runs on AWS, Azure, GCP, IBM Bluemix
Transalytics = Transactions + Analytics

page
“A single, unified database that supports
transactions and analytics in real-time without
sacrificing transactional integrity, performance,
and scale.” - Mike Gualtieri, Forrester
Transactional
Operational
Analytical
Translytical Database
Traditional Stack For streaming data
Single data layer

How can queuing systems like
Kafka Simplify the architecture?

How do we perceive Kafka?

APACHE KAFKA
• High Throughput
• Low Latency
• Scalable
• Centralized
• Real-time

STATE OF THE ART KAFKA @ LINKED-IN
18

• Handles 1.4 Trillion/day messages for various applications in Linked-in
• Over 1400 brokers
• Can handle well over a few million messages/sec
• At-least once delivery of messages
• Strong durability contract with replication
• Rich ecosystem
• Expresso - offload My-SQL replication
• Venice – Compute derived data
• Nuage- a portal to manage topics and associated metadata
• Goblin – Ingestion framework
• Mirror-maker - replication
CAN KAFKA SCALE?

page© 2015 Forrester Research, Inc. Reproduction Prohibited 20
JUST ENOUGH KAFKA
› Producer
› Broker
› Topics
› Partitions – Random & Semantic
› ISR – Leader & Controller, high watermark
› Zookeeper – offsets
› Consumer
› Consumer Groups
› Commit Log – TTLs, compactions

• Make sure that the producer is set to Acks = all
• Make sure “replica.lag.time.max.ms” set to a minimum (match it with the voltdb timeout)
• Make sure "replica.lag.max.messages” set to a minimum (this parameter is getting
deprecated from 0.9)
• Disable unclean.leader.election.enable = false
• Use default.replication.factor = 3
• Make sure that the consumer is set to read only committed messages min.insync.replicas = 2
(this is applied to Topic level – need to be done manually before 0.9)
• “autocommit.enable” = false
• Disable automatic topic creation in kafka
• “Block.on.buffer.full” = true
• “Max.inflight.requests.per.connect = 1”
• Use rebalance listener to limit duplicates
• Connection to Zookeeper
• Monitor Consumer lag via offsets
• Report consumer counts and errors to a separate topics
SO WHAT ARE THE CAVEATS WITH KAFKA?

Sample Config file
<import>
<configuration type="kafka" format="csv" enabled="true">
<property name="brokers">kafkasvr:9092</property>
<property name="topics">employees</property>
<property name="procedure">EMPLOYEE.insert</property>
</configuration>
<configuration type="kafka" enabled="true">
<property name="topics">employees</property>
<property name="procedure">EMPLOYEE.insert</property>
</configuration>
<configuration type="kafka" enabled="true">
<property name="topics">managers</property>
<property name="procedure">MANAGER.insert</property>
</configuration>
</import>
• Supports multiple data formats like CSV(default), TSV,
JSON etc. (refer documentation)
• Supports various types sources of data like Kafka,
Kinesis.
• Supply a list of brokers to pick-up offsets
• Supply a topic name which contains the messages
• Supply the stored procedure name to invoke per
event/message and then insert the result into the db
• “fetch.message.max.bytes” maximum size of message
that is fetched from Kafka (default 64 KB)
• “groupid” the group the consumer belongs to.
• “socket.timeout.ms” (milliseconds) the maximum
time the socket connection waits before timing out.
HOW TO CONFIGURE KAFKA-> VOLTDB? (IMPORTER)

HOW TO CONFIGURE VOLTDB -> KAFKA?
(EXPORTER)
23

• According to our customer success team as of today – approximately 15-20% of our
customers are using Kafka & VoltDB together.
Examples
• King Games (of Candy Crush fame) – 5 nodes, 384GB RAM, 32 cores – 300+ topics
with more that a 400,000 Txns/sec @ 50% CPU utilization.
• MaxCDN (now Stackpath – Global CDN) – 11 nodes, 128 GB RAM,16 cores, couple
of hundred topics with more that 500,000 Txns/sec @ 30% CPU utilization.
• Nimble Storage (Infosight dashboard & support) – 9 nodes,128GB RAM,64 cores –
50+ topics with more that 200,000 Txns/sec @ 20~30% CPU utilization.
• We highly recommend this architecture if it meets the SLA requirements
IS KAFKA & VOLTDB INTEGRATION IN PRODUCTION?

SO WHAT DOES KAFKA BRING TO AN IN-MEMORY
DATABASE LIKE VOLTDB?
• Centralized infrastructure
• Recreate state
• Resiliency with at-least once delivery
• Impedance mismatch between applications
• Integrations with various applications
• Export and Import capabilities
• Cost optimization for HW
25

IDEMPOTENCE!
• Is the property of certain operations in mathematics and
computer science, that can be applied multiple times
without changing the result beyond the initial application.
• At-Least-Once Delivery + Idempotent Operations =
Exactly Once Semantics
27

page© 2015 Forrester Research, Inc. Reproduction Prohibited 28
28
Idempotent Not Idempotent
set x = 5;
same as
set x = 5; set x = 5;
x++;
not same as
x++ ; x++;
If (x % 2 == 0) x++;
same as
If (x % 2 == 0) x++;
If (x % 2 == 0) x++;
If (x % 2 == 0) x* = 2;
same as
If (x % 2 == 0) x* = 2;
If (x % 2 == 0) x* = 2;
spill coffee on brown pants eat whole plate of spaghetti

What interesting problems do we solve?
• Correlation – streaming Join (state management)
• Out of order delivery
• At least once delivery – How to dedup
• Precise Accounting
• Precise Statistics – Event time vs processing time

FAST DATA: APACHE-STYLE
31
Applications, Message Queues, Data Sources
Ingest
Analyze Decide
Counters
Aggregations
Time series
Statistics
Store results
Query and
recombine
Fast serving
Per-event policy evaluations
Responses (synchronous)
Side-effects (asynchronous)
Export & Pipeline
Kafka / RabbitMQ
Storm, Flume, Sqoop
Storm +
Serving Layer
Spark +
Serving Layer
Cassandra,
HBase
Hadoop, Message queues

FAST DATA: VOLTDB-STYLE
32
Applications, Message Queues, Data Sources
Ingest
Analyze Decide
Counters
Aggregations
Time series
Statistics
Store results
Query and
recombine
Fast serving
Per-event policy evaluations
Responses (synchronous)
Side-effects (asynchronous)
Export & Pipeline
Kafka / RabbitMQ
VoltDB
SQL, Java for
Analytics
Transactions /
ACID
Hadoop, Message queues

THREE MAJOR DRAWBACKS OF STREAMING
SOLUTIONS
• Streaming solutions lack context
• filter, aggregate and join operations require state.
• need backend databases to support decisions
• good for fast ingestion only.
• Streaming solutions are not architected for real-time decisions
• not ACID (atomicity, consistency, isolation, durability)
• no support for JDBC/ODBC
• Good for algorithmic processing of windowed data
• Streaming solutions lack operational transparency
• good for statically configured topological results
• need back-end databases for storing aggregates/counters
33

CEP / STREAM PROCESSING VS. VOLTDB
Common characteristics
• Process high speed, streaming
data
• Ingest thousands to millions of
events per second
• Can function as part of a data
pipeline
• Basic event alerting or
enrichment
34
Stream processing is right choice when....
• Unstructured audio, video, image, signal
processing data streams
• Micro-second latencies are needed
• Examine stream for temporal pattern detection
VoltDB is the right choice when...
• Realtime streaming analytics - calculation and
serving
• Transactional decisions per event - data
informed
• Ad hoc queries of state
• Common interface across data architecture
stack (SQL)

INTEGRATING DATA SOURCES WITH VOLTDB
• CSV loader
• Kafka loader
• JDBC loader
• Vertica UDx
• Extensible loader API
• JDBC
• ODBC
• HTTP JSON
• Native client drivers / SDKs
BULK LOADERS APPLICATION INTERFACES
35

INTEGRATING VOLTDB WITH EXPORT TARGETS
36
• Local file system export
• JDBC export
• Kafka export
• Elasticsearch export
• HDFS export
• HTTP export
• Extensible API

VOLTDB EXPORT UI
CREATE TABLE events (
EventID INTEGER,
time TIMESTAMP,
msg VARCHAR(128));
EXPORT TABLE events;
37
<export enabled="true" target="file">
ddl.sql
deployment.xmlINSERT into TABLE values…
Application SQL

ACID PROCESSING
• Sync intra-cluster replication
• Replicated durability
• High availability (configurable)
• Serializable isolation
• Ad-hoc SQL or stored procedures.
• Partitioned & distributed transactions
• Load balanced reads across replicas
38

MATERIALIZED VIEWS
• Declarative SQL
• Fully transactional
• Supports ad-hoc query
39
CREATE VIEW registrations_by_zipcode (
zipcode, registered_voters
) AS
SELECT zipcode, count(*) from voters
where registration=1 GROUP BY zipcode;

MV FOR STREAMING AGGREGATION
• Partitioned on cluster
• Immediately up-to-date
• Active/active HA
40
Global Read: SELECT
sum(count) WHERE sec > 130
and sec < 140;

REAL-TIME ANALYTICS IN VOLTDB
• Counters
• Counting is exceedingly hard at scale
• VoltDB was designed to excel at counters
• Aggregates
• Materialized views maintain slices of fast moving data
and enable fast access
• Group by keys + time functions (day, hour, minute,
second)
• Query views for time-series rollups, e.g. “last 30
minutes”
• Leaderboards
• Leaderboards rank items by size, value or amplitude
• Index optimizations enable fast ranking of records
within large sets
41

REVIEW
Application
Event
Sources
VoltDB
Client
Interface
Partition
Replica 1
Partition
Replica 2
Export
Destination
(OLAP,
HTTP)
• SQL + Java transactions
• JSON column values
• HA in-memory processing
• ACID (durable to disk)
• Ranking indexes
• Indexes on functions
• Capped tables
• Mat. views: RT aggregation
• Append only export
• 1-5 ms @ 99% responses
42

QUESTIONS?
• Developer
• The source code for Kafka – VoltDB connector?
https://github.com/VoltDB/voltdb-kafka-connector
• The Developer Guide to Streaming Data Applications
http://learn.voltdb.com/WPThreeContenders.html
• Architect:
• Download our new ‘recipes’ eBook
https://voltdb.com/ebook/fast-data-recipes
• Email us your questions: askanengineer@voltdb.com
• http://chat.voltdb.com/
careers@voltdb.com
43

OPENET
Application/Use Case
• Openet enables the world's largest network operators to innovate
service offerings in an increasingly mobile, data-driven society
• Applications include Policy Manager, Evolved Charging,
Convergent Mediation
Why VoltDB?
• Performance and scalability that provides real-time control of
network resource consumption, and real-time interaction between
network systems and their users
• Virtualized platform for elasticity and ease of operations
• Simplified deployments with ACID and built-in availability for risk
averse Telco customers
• Saves $0.5 million/customer installation; unlimited scale in the
cloud
45
“VoltDB is the logical choice for a cloud-deployable, transactional
database that can flexibly handle high-volume data streams for
service providers to monitor and leverage in real time.”

ASIAINFO
• Advanced IT software solutions and services for the
telecommunications industry
• Veris Convergent Context-awareness Center (C3)
• User session management and stream processing
that enables real-time matching and data processing
Why VoltDB?
• High transaction performance with immediate user device
make and model identification
• URL/website matching and real-time campaign triggers
based on VoltDB to enable very rapid processing
• SQL, ACID, data integrity, and disaster recovery
• Reduced TCO: more than 500,000 transactions per second
on commodity servers
46

EAGLE INVESTMENT SYSTEMS, BNY MELON
• Eagle Investment Systems is a leading provider of financial
services technology and a subsidiary of BNY Mellon.
• VoltDB powers Eagle’s cloud-based software for tracking the
performance of investment portfolios and analyzing
performance risks
Why VoltDB?
• Performance and transparent scalability to meet application
workloads and client SLAs
• High speed data cache for risk calculations with large and
rapidly changing data sets
• Lower TCO
47
“We deliver the best in class technology to our clients, and when we evaluated VoltDB, we
discovered that it suited our requirements. With its in-memory, high-velocity database,
VoltDB provides us a great foundation to enhance our current and future offerings.”
Marc Firenze, CTO

ERICSSON Application/Use Case
• Ericsson MediaFirst is an end-to-end cloud-based
platform for the creation, management and delivery of
next generation Pay TV
Why VoltDB?
• Enabled Ericsson to move from batch and manual
processing to real-time user session monitoring of 10’s to
100’s of million users
• Ensure user experience across devices
• Attract, retain and monetize new subscribers
• Cloud ready (Azure)
• Agility - quickly develop and deploy TV services across all
end-points at web speeds to respond to changing market
trends and conditions
48
“VoltDB gives us up-to-the-second operational visibility into the performance of the systems across our
customers’ carrier grade TV networks as well as enable real-time user targeting. The database gives our platform
competitive advantage by letting us analyze device and user data as it comes in from Tier One providers”
Mark Hydar, Head of Engineering, Ericsson MediaFirst

SMART METER MARKET LEADERS PICK VOLTDB*
49
* > 60 million meters under management
Leader in the Gartner
Magic Quadrant
Announced Utility Customers
• UK Smart Meter
• ShikoKu Electric Power
• Hokkaido Electric Power

FLYTXT Application/Use Case
• Customer Experience, Revenue Management and Data
Monetization Solutions for mobile operators
• Drive campaigns to increase revenue, reduce churn,
enhance loyalty and create new revenue streams
Business Impact
• 1.2% incremental revenue
• 18.9% higher ARPU
• Conversion rates 40%-300% higher
Why VoltDB?
• Performance and scale to drive its real-time analytics
platform to extract actionable intelligence from 4 billion
events/day streaming from more than 200 million mobile
subscribers
• Monetize data faster, more efficiently and at lower cost
50
“The partnership with VoltDB enhances our platform’s capability
to act upon insights derived from subscriber actions in real time.”
Prateek Kapadia, CTO

AIRPUSH
• Managing online mobile advertising
• Manages over 120,000 live applications
Why VoltDB?
• Replaced costly MySQL infrastructure with scalable VoltDB
cluster
• Enabled accurate ad-campaign balance tracking, dramatically
improving “last-dollar” decisions, saving millions in budget
overages
• Eliminated the opportunity cost of placing wrong ads
• Reduced infrastructure cost by 93% (7 servers vs. 100)
51
“Achieved a previously impossible level of budget
management accuracy”

SIMPLIFYING THE LAMBDA ARCHITECTURE
Use Case
• Content delivery network service provider
• Counting content views
Why VoltDB?
• Real-time analytics+ transactions w/scale
• Replaced Storm, Cassandra with VoltDB for
real-time streaming aggregations with
“exactly once” semantic
Bottom line
• Accurate – guaranteed correct results with
VoltDB’s ‘exactly-once’ semantics
• Faster time to market
• 32 TB of data processed with 7 servers
• 1/10th the resources of the alternatives 52

MOBILE
Use Case
• The Emagine real-time event decision
making platform for Communications
Service Providers (CSPs)
Why VoltDB?
• Real-time analytics+ transactions
• Scale - billions of network events per day,
analyzing hundreds of thousands of
transactions simultaneously, and then
intelligently interacting with customers
Bottom line
• 3 ms system response time
• 253% increase in offer purchases
Real-Time Event Decisioning
53

VOLTDB: A BEAUTIFUL ARCHITECTURE
Work Queue
Execution Engine
Table and Index
Data
VoltDB Cluster
Server
1 Partition 1 Partition 2 Partition 3
Server
Server
Inside a Partition
54

WHY VOLTDB?
Faster
Smarter Better
• Superior architecture for fast data/translytics
• In-Memory, Scale-out, ACID,
SQL+JSON
• Rapid data ingestion with transactions
• Data durability and HA
• VoltDB customers realize exceptional
business value
55

SimplifyStreamingArchitecture

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie SimplifyStreamingArchitecture

Ähnlich wie SimplifyStreamingArchitecture (20)

SimplifyStreamingArchitecture