SlideShare a Scribd company logo
1 of 23
Ron Bodkin 
Founder & President, Think Big 
September 2014 
Real-time Big Data Analytics with Storm
2 
Agenda 
Ÿ Storm refresher 
Ÿ When is Storm appropriate? 
Ÿ API Options 
Ÿ Use Cases 
Copyright 2013-2014 Think Big, a Teradata Company
3 
Storm Overview 
Ÿ DAG Processing of never ending streams of data 
- Open Source: https://github.com/nathanmarz/storm/wiki 
- Used at Twitter plus dozens of other companies 
- Reliable - At Least Once semantics 
- Think MapReduce for data streams 
- Java / Clojure based 
- Bolts in Java and ‘Shell Bolts’ 
- Not a queue, but usually reads from a queue. 
Ÿ Related: 
- Spark Streaming, Samza, S4, CEP 
Ÿ Compromises 
- Static topologies & cluster sizing – Avoid messy dynamic rebalancing 
- Reliability: relies on reliable or durable data source, Nimbus SPOF 
- State not built in 
Ÿ Community Support, Hortonworks commercial support 
Copyright 2013-2014 Think Big, a Teradata Company
4 
Storm Concepts 
Ÿ Cluster 
- Nimbus 
- Zookeeper 
- Supervisor, Workers 
Ÿ Topology 
Ÿ Streams 
Ÿ Tuple 
Ÿ Spout 
Ÿ Bolt 
Ÿ Stream Groupings 
- Shuffle, Field 
Ÿ Trident 
Ÿ DRPC 
Copyright 2013-2014 Think Big, a Teradata Company
5 
What is Real-Time? 
Ÿ Low latency 
- Query response 
- Data refresh 
- End-to-end response 
Ÿ … nanoseconds, milliseconds, seconds, or minutes depending on 
your problem 
Copyright 2013-2014 Think Big, a Teradata Company
6 
Why Real-time? 
Ÿ Better end-user experience 
- Ex: View an ad, see the counter move. 
Ÿ Operational intelligence 
- Low latency analysis 
- Operational intelligence 
Ÿ Event response 
- Content half life measured at 3 hours (H Mason: http://bit.ly/nu7IDw) 
- Path to additional real-time capabilities 
- Scalable analysis 
- Example: Trend analysis to recommend ‘hot’ articles. 
Copyright 2013-2014 Think Big, a Teradata Company
7 
Why Storm? 
Ÿ Scalable way to manage queue readers and logic pipeline 
Ÿ Much better than roll your own 
Ÿ Reliable (Message guarantees, fault tolerant) 
Ÿ Fast (over 1MM messages/node/s in some tests) 
Ÿ Multi-node scaling 
Ÿ It works 
Ÿ Open source community 
Ÿ It runs on YARN 
Ÿ See also: https://github.com/nathanmarz/storm/wiki/Rationale 
Copyright 2013-2014 Think Big, a Teradata Company
8 
Programming Options 
Ÿ Java 
- Raw API (tuples and raw computation) 
- Trident – Java DSL (SQL-like primitives) micro-batch, exactly once 
Ÿ Python 
Ÿ Scala 
Ÿ Clojure 
Ÿ Ruby 
Ÿ Trident-ML 
Ÿ Storm-Esper 
Copyright 2013-2014 Think Big, a Teradata Company
9 
Use Cases 
Ÿ Scale analysis pipeline 
Ÿ Lively stats 
Ÿ Recommendations 
Ÿ Better predictions 
Ÿ Realtime analytics 
Ÿ Online machine learning 
Ÿ Distributed RPC 
Copyright 2013-2014 Think Big, a Teradata Company
10 
Overall Architecture 
Monitoring & Alerting (Metrics, Alarms, Notifications) 
Memcache (Tuple fail tracking) 
NoSQL 
Queue 
Archive Logs 
Hadoop 
Ad Serving 
LB 
Edge 
Edge 
Impression 
S3 
S3 
S3 
DFS 
Management 
Server 
LB 
Edge 
Edge 
Relational 
Store 
Ad Management 
Ad Selling 
Storm 
- Queue Management 
- Simple Bot Filtering 
- Real-time Bucketization 
- Performance Counters 
- Event Logging 
View Ad 
Cleansing 
Model Training 
Recommendations 
Events 
Model Parameters 
getPrediction 
Performance Counters 
Impression Buckets 
Copyright 2013-2014 Think Big, a Teradata Company
11 
Integration Patterns 
Copyright 2013-2014 Think Big, a Teradata Company
12 
Analytics Architecture 
Storm 
Time Series 
Simple Bot BucketBolt 
Annotator 
Web 
Server 
DFS 
Adapter 
Impression 
Spout 
Time Series 
Buckets 
(Batch) 
Time Series 
Buckets 
(Realtime) 
Impression 
Prediction 
Predictive 
Model 
Parameters 
Impressions 
Impressions 
Impressions 
Hadoop 
Impression 
Bucketization 
Predictive 
Model 
Training 
NoSQL 
Bolt 
Time 
Copyright 2013-2014 Think Big, a Teradata Company
13 
Analyze 
Massive Historical 
Data Set 
Analyze 
Recent 
Past 
Realtime 
Prediction 
Solution Approach 
Historical Data Set = S3 
Analyze = Hadoop + Pig + R 
Recent Past = Storm + NoSQL 
Analyze = R + Web Service 
Copyright 2013-2014 Think Big, a Teradata Company
Ad Hoc Classic BI Access 
Queries 
14 
Simple Real-Time Recommendation Updates 
Web Servers 
Read & 
Update 
• Requests read single user 
profile 
• May update that profile record 
• Classic key/value store 
problem 
• Can split transient state into 
read/write NoSQL with batch 
view NoSQL 
Batch 
Log Feed 
Batch 
Model 
Feed 
Hadoop Cluster 
Batch Export 
MPP Database 
NoSQL 
Database 
Copyright 2013-2014 Think Big, a Teradata Company
15 
Storm Topology (Greatly Simplified) 
S3 
S3 
S3 
S3 
DynamoDB 
Performance 
Counters<T> 
S3 
Adapter<T> 
Event 
Spout 
DynamoDB 
Adapter<T> 
Command 
Spout 
SQS 
BucketBolt 
SimpleBot 
Annotator 
DynamoDB 
Adapter<T> 
SQS 
Copyright 2013-2014 Think Big, a Teradata Company
Ad Hoc Classic BI Access 
Queries 
16 
Real-Time Recommendation Updates w/ Storm 
Web Servers 
• More complex 
recommendation logic 
• Processing benefits from 
distributed logic of 
streaming 
• Parallelize updates 
• Parallelize analysis 
Batch 
Log Feed 
Update 
Batch 
Model 
Feed 
Hadoop Cluster 
Batch Export 
Storm 
MPP Database 
NoSQL 
Database 
Read & 
write 
Copyright 2013-2014 Think Big, a Teradata Company
17 
Use of Storm in Recommendations 
Ÿ More common: asynchronous updates of model 
- Time-sensitive 
- E.g., updating many time series, single event impacts many items 
Ÿ More advanced: score many items 
- Multi-dimensional models (item, user, context) so costly to precompute 
- Distribute computation at read-time: still blend history with recent 
Copyright 2013-2014 Think Big, a Teradata Company
18 
Use of Storm in “targeted user 360” use case 
Copyright 2013-2014 Think Big, a Teradata Company
19 
Storm for Energy Grid Management 
Ÿ Kafka - Data Ingestion 
- Meter / sensor data from grid points and major end users 
- GIS, CRM, Lightning strike and Weather data 
- Energy production data from solar & wind (transient sources) 
- Forecast data for weather, energy production, energy markets 
Ÿ Storm - Streams processing 
- Parse, filter, distribute data to 
• Search (Solr or Elasticsearch) 
• Archive 
• Model training (Hadoop) 
• Data serving (NoSQL) 
- Topologies implement distributed predictive models 
• Computationally intensive 
• Predict outages 
• Predict grid stability problems 
• Predict market prices 
- Publish alerts & warnings 
Ÿ DRPC to query complex grid for Dashboards and investigation 
Copyright 2013-2014 Think Big, a Teradata Company
20 
Finance: Trade Compliance Monitoring 
Order Mgmt 
System 
• Storm reduces market data to 
usable subset 
• Integrates high, medium and 
low frequency data 
• Real-time Storm filter triggers 
deeper search via NoSQL 
Batch 
Log 
Feed 
Batch 
Model 
Feed 
Hadoop Cluster 
Batch 
Export 
Classic BI 
Access 
Storm 
Ad Hoc 
Queries 
MPP Database 
NoSQL 
Database 
Market Data 
Feed 
Copyright 2013-2014 Think Big, a Teradata Company
21 
Use of Storm in Trade Compliance Monitoring 
Ÿ Market data ingestion and processing 
- Up to millions of messages per second 
- Requires filtering, averaging, aggregation 
Ÿ Trade information 
- Orders and executions – up to tens of thousands of messages/second 
- Filtering and aggregation 
Ÿ Reference and static data also incorporated 
Ÿ Market and trade information have to be joined by security and time 
Ÿ Models used to detect anomalous trading behavior over time 
Ÿ Simple filtering may be used to detect restricted behavior 
Ÿ Alerts, reports must be immediate to reduce economic impact 
Ÿ Recent comparative analysis for financial data: 
- Within 50% of costly commercial software performance 
- 25x better price/performance 
Copyright 2013-2014 Think Big, a Teradata Company
23 
Lessons 
Ÿ Easy to develop, hard to debug – start locally (timeouts) 
Ÿ Storm infinite loop of failures 
- Use Memcached to count # tuple failures 
Ÿ At Least Once processing 
- Hadoop based read-repair job 
Ÿ Performance Counters not getting flushed 
- Tick Tuples 
Ÿ Always ACK 
Ÿ Batching to S3/HDFS 
- Run a compaction & event de-duplication job in Hadoop 
- Or use exactly once semantics e.g., with Trident 
Ÿ Queues to connect topology parts with different time frames 
Copyright 2013-2014 Think Big, a Teradata Company
24 
Conclusions 
Ÿ There’s many kinds of real-time problems 
Ÿ Use of Hadoop and/or NoSQL can solve 
- Low latency queries 
- Event response with localized intelligence 
- Operational intelligence 
Ÿ Storm is valuable for 
- Ingesting data within seconds 
- Complex real-time distributed logic 
- Operational intelligence 
- Use cases in many industries and functions 
Ÿ We’re Hiring! 
Copyright 2013-2014 Think Big, a Teradata Company

More Related Content

What's hot

Apache Storm
Apache StormApache Storm
Apache StormEdureka!
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormEugene Dvorkin
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Brian O'Neill
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Folio3 Software
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridDataWorks Summit
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014P. Taylor Goetz
 
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Adrianos Dadis
 
Experience with Kafka & Storm
Experience with Kafka & StormExperience with Kafka & Storm
Experience with Kafka & StormOtto Mok
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormMd. Shamsur Rahim
 
STORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOPSTORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOPDataWorks Summit
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignMichael Noll
 
Storm presentation
Storm presentationStorm presentation
Storm presentationShyam Raj
 
Real-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaReal-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaAndrew Montalenti
 
Real-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormReal-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormJohn Georgiadis
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark StreamingP. Taylor Goetz
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 

What's hot (20)

Apache Storm
Apache StormApache Storm
Apache Storm
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
 
Introduction to Storm
Introduction to StormIntroduction to Storm
Introduction to Storm
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop Grid
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
 
Experience with Kafka & Storm
Experience with Kafka & StormExperience with Kafka & Storm
Experience with Kafka & Storm
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache Storm
 
STORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOPSTORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOP
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
 
Real-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaReal-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and Kafka
 
Real-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormReal-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and Storm
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 

Similar to Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics

Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopHazelcast
 
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Altan Khendup
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduCloudera, Inc.
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San JoseCloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San JoseMingliang Liu
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworksIJDKP
 
Cassandra summit-2013
Cassandra summit-2013Cassandra summit-2013
Cassandra summit-2013dfilppi
 
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive AnalyticsInfochimps, a CSC Big Data Business
 
Lambda kappa architecture - the jury are still out
Lambda   kappa architecture - the jury are still outLambda   kappa architecture - the jury are still out
Lambda kappa architecture - the jury are still outYoav chernobroda
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Cloudera, Inc.
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in SparkSnappyData
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationInside Analysis
 
Hybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerianHybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerianData Con LA
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detectionhadooparchbook
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliData Driven Innovation
 
Use Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdfUse Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdfM Waleed Kadous
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
 

Similar to Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics (20)

Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
 
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San JoseCloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworks
 
Cassandra summit-2013
Cassandra summit-2013Cassandra summit-2013
Cassandra summit-2013
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
 
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
 
Lambda kappa architecture - the jury are still out
Lambda   kappa architecture - the jury are still outLambda   kappa architecture - the jury are still out
Lambda kappa architecture - the jury are still out
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter Integration
 
Hybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerianHybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerian
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
 
Use Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdfUse Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdf
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 

More from Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 

More from Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Recently uploaded

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 

Recently uploaded (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 

Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics

  • 1. Ron Bodkin Founder & President, Think Big September 2014 Real-time Big Data Analytics with Storm
  • 2. 2 Agenda Ÿ Storm refresher Ÿ When is Storm appropriate? Ÿ API Options Ÿ Use Cases Copyright 2013-2014 Think Big, a Teradata Company
  • 3. 3 Storm Overview Ÿ DAG Processing of never ending streams of data - Open Source: https://github.com/nathanmarz/storm/wiki - Used at Twitter plus dozens of other companies - Reliable - At Least Once semantics - Think MapReduce for data streams - Java / Clojure based - Bolts in Java and ‘Shell Bolts’ - Not a queue, but usually reads from a queue. Ÿ Related: - Spark Streaming, Samza, S4, CEP Ÿ Compromises - Static topologies & cluster sizing – Avoid messy dynamic rebalancing - Reliability: relies on reliable or durable data source, Nimbus SPOF - State not built in Ÿ Community Support, Hortonworks commercial support Copyright 2013-2014 Think Big, a Teradata Company
  • 4. 4 Storm Concepts Ÿ Cluster - Nimbus - Zookeeper - Supervisor, Workers Ÿ Topology Ÿ Streams Ÿ Tuple Ÿ Spout Ÿ Bolt Ÿ Stream Groupings - Shuffle, Field Ÿ Trident Ÿ DRPC Copyright 2013-2014 Think Big, a Teradata Company
  • 5. 5 What is Real-Time? Ÿ Low latency - Query response - Data refresh - End-to-end response Ÿ … nanoseconds, milliseconds, seconds, or minutes depending on your problem Copyright 2013-2014 Think Big, a Teradata Company
  • 6. 6 Why Real-time? Ÿ Better end-user experience - Ex: View an ad, see the counter move. Ÿ Operational intelligence - Low latency analysis - Operational intelligence Ÿ Event response - Content half life measured at 3 hours (H Mason: http://bit.ly/nu7IDw) - Path to additional real-time capabilities - Scalable analysis - Example: Trend analysis to recommend ‘hot’ articles. Copyright 2013-2014 Think Big, a Teradata Company
  • 7. 7 Why Storm? Ÿ Scalable way to manage queue readers and logic pipeline Ÿ Much better than roll your own Ÿ Reliable (Message guarantees, fault tolerant) Ÿ Fast (over 1MM messages/node/s in some tests) Ÿ Multi-node scaling Ÿ It works Ÿ Open source community Ÿ It runs on YARN Ÿ See also: https://github.com/nathanmarz/storm/wiki/Rationale Copyright 2013-2014 Think Big, a Teradata Company
  • 8. 8 Programming Options Ÿ Java - Raw API (tuples and raw computation) - Trident – Java DSL (SQL-like primitives) micro-batch, exactly once Ÿ Python Ÿ Scala Ÿ Clojure Ÿ Ruby Ÿ Trident-ML Ÿ Storm-Esper Copyright 2013-2014 Think Big, a Teradata Company
  • 9. 9 Use Cases Ÿ Scale analysis pipeline Ÿ Lively stats Ÿ Recommendations Ÿ Better predictions Ÿ Realtime analytics Ÿ Online machine learning Ÿ Distributed RPC Copyright 2013-2014 Think Big, a Teradata Company
  • 10. 10 Overall Architecture Monitoring & Alerting (Metrics, Alarms, Notifications) Memcache (Tuple fail tracking) NoSQL Queue Archive Logs Hadoop Ad Serving LB Edge Edge Impression S3 S3 S3 DFS Management Server LB Edge Edge Relational Store Ad Management Ad Selling Storm - Queue Management - Simple Bot Filtering - Real-time Bucketization - Performance Counters - Event Logging View Ad Cleansing Model Training Recommendations Events Model Parameters getPrediction Performance Counters Impression Buckets Copyright 2013-2014 Think Big, a Teradata Company
  • 11. 11 Integration Patterns Copyright 2013-2014 Think Big, a Teradata Company
  • 12. 12 Analytics Architecture Storm Time Series Simple Bot BucketBolt Annotator Web Server DFS Adapter Impression Spout Time Series Buckets (Batch) Time Series Buckets (Realtime) Impression Prediction Predictive Model Parameters Impressions Impressions Impressions Hadoop Impression Bucketization Predictive Model Training NoSQL Bolt Time Copyright 2013-2014 Think Big, a Teradata Company
  • 13. 13 Analyze Massive Historical Data Set Analyze Recent Past Realtime Prediction Solution Approach Historical Data Set = S3 Analyze = Hadoop + Pig + R Recent Past = Storm + NoSQL Analyze = R + Web Service Copyright 2013-2014 Think Big, a Teradata Company
  • 14. Ad Hoc Classic BI Access Queries 14 Simple Real-Time Recommendation Updates Web Servers Read & Update • Requests read single user profile • May update that profile record • Classic key/value store problem • Can split transient state into read/write NoSQL with batch view NoSQL Batch Log Feed Batch Model Feed Hadoop Cluster Batch Export MPP Database NoSQL Database Copyright 2013-2014 Think Big, a Teradata Company
  • 15. 15 Storm Topology (Greatly Simplified) S3 S3 S3 S3 DynamoDB Performance Counters<T> S3 Adapter<T> Event Spout DynamoDB Adapter<T> Command Spout SQS BucketBolt SimpleBot Annotator DynamoDB Adapter<T> SQS Copyright 2013-2014 Think Big, a Teradata Company
  • 16. Ad Hoc Classic BI Access Queries 16 Real-Time Recommendation Updates w/ Storm Web Servers • More complex recommendation logic • Processing benefits from distributed logic of streaming • Parallelize updates • Parallelize analysis Batch Log Feed Update Batch Model Feed Hadoop Cluster Batch Export Storm MPP Database NoSQL Database Read & write Copyright 2013-2014 Think Big, a Teradata Company
  • 17. 17 Use of Storm in Recommendations Ÿ More common: asynchronous updates of model - Time-sensitive - E.g., updating many time series, single event impacts many items Ÿ More advanced: score many items - Multi-dimensional models (item, user, context) so costly to precompute - Distribute computation at read-time: still blend history with recent Copyright 2013-2014 Think Big, a Teradata Company
  • 18. 18 Use of Storm in “targeted user 360” use case Copyright 2013-2014 Think Big, a Teradata Company
  • 19. 19 Storm for Energy Grid Management Ÿ Kafka - Data Ingestion - Meter / sensor data from grid points and major end users - GIS, CRM, Lightning strike and Weather data - Energy production data from solar & wind (transient sources) - Forecast data for weather, energy production, energy markets Ÿ Storm - Streams processing - Parse, filter, distribute data to • Search (Solr or Elasticsearch) • Archive • Model training (Hadoop) • Data serving (NoSQL) - Topologies implement distributed predictive models • Computationally intensive • Predict outages • Predict grid stability problems • Predict market prices - Publish alerts & warnings Ÿ DRPC to query complex grid for Dashboards and investigation Copyright 2013-2014 Think Big, a Teradata Company
  • 20. 20 Finance: Trade Compliance Monitoring Order Mgmt System • Storm reduces market data to usable subset • Integrates high, medium and low frequency data • Real-time Storm filter triggers deeper search via NoSQL Batch Log Feed Batch Model Feed Hadoop Cluster Batch Export Classic BI Access Storm Ad Hoc Queries MPP Database NoSQL Database Market Data Feed Copyright 2013-2014 Think Big, a Teradata Company
  • 21. 21 Use of Storm in Trade Compliance Monitoring Ÿ Market data ingestion and processing - Up to millions of messages per second - Requires filtering, averaging, aggregation Ÿ Trade information - Orders and executions – up to tens of thousands of messages/second - Filtering and aggregation Ÿ Reference and static data also incorporated Ÿ Market and trade information have to be joined by security and time Ÿ Models used to detect anomalous trading behavior over time Ÿ Simple filtering may be used to detect restricted behavior Ÿ Alerts, reports must be immediate to reduce economic impact Ÿ Recent comparative analysis for financial data: - Within 50% of costly commercial software performance - 25x better price/performance Copyright 2013-2014 Think Big, a Teradata Company
  • 22. 23 Lessons Ÿ Easy to develop, hard to debug – start locally (timeouts) Ÿ Storm infinite loop of failures - Use Memcached to count # tuple failures Ÿ At Least Once processing - Hadoop based read-repair job Ÿ Performance Counters not getting flushed - Tick Tuples Ÿ Always ACK Ÿ Batching to S3/HDFS - Run a compaction & event de-duplication job in Hadoop - Or use exactly once semantics e.g., with Trident Ÿ Queues to connect topology parts with different time frames Copyright 2013-2014 Think Big, a Teradata Company
  • 23. 24 Conclusions Ÿ There’s many kinds of real-time problems Ÿ Use of Hadoop and/or NoSQL can solve - Low latency queries - Event response with localized intelligence - Operational intelligence Ÿ Storm is valuable for - Ingesting data within seconds - Complex real-time distributed logic - Operational intelligence - Use cases in many industries and functions Ÿ We’re Hiring! Copyright 2013-2014 Think Big, a Teradata Company