SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Data Stream Algorithms
in Storm and R
Radek Maciaszek
Who Am I?
 Radek Maciaszek
 Consulting at DataMine Lab (www.dataminelab.com) - Data mining,
business intelligence and data warehouse consultancy.
 Data scientist at a hedge fund in London
 BSc Computer Science, MSc in Cognitive and Decisions Sciences, MSc
in Bioinformatics
 During the career worked with many companies on Big Data and real
time processing projects; indcluding Orange, ad4game, Unanimis,
SkimLinks, CognitiveMatch, OpenX and many others.
Agenda
• Why streaming algorithms?
• Streaming algorithms crash course
• Apache Storm
• Storm + R – for more traditional statistics
• Use Cases
Data Explosion
• Exponential growth of information
[IDC, 2012] Information Data Corporation, a market research company
Data, data everywhere [Economist]
• “In 2013, the available storage capacity could hold 33% of all
data. By 2020, it will be able to store less than 15%” [IDC, 2014]
Data Streams – crash course
• Reasons to use data streams processing
• Data doesn’t fit into available memory and/or disk
• (Near) real-time data processing
• Scalability, cloud processing
• Examples
• Network traffic (ISP)
• Fraud detection
• Web traffic (i.e. online advertising)
Use Case – Dynamic Sampling
• OpenX – the ad server
• Customers with tens of millions of ad views per hour
• Challenge
• Create samples for statistical analysis. E.g: A/B testing, ANOVA, etc.
• How to sample data in real-time on the input stream of the data of
unknown size
• Solution
• Reservoir Sampling – allows to find a sample of a constant length from a
stream of unknown length of elements
Data Streaming algorithms
• Sampling
• Use statistic of a sample to estimate the statistic of population. The
bigger the sample the better the estimate.
• Reservoir Sampling – sample populations, without knowing it’s size.
• Algorithm:
• Store first n elements into the reservoir.
• Insert each k-th from the input stream in a random spot of the reservoir
with a probability of n/k (decreasing probability)
Source: Maldonado, et al; 2011
Moving average
• Example, online mean of the moving average of time-series at
time “t”
• Where: M – window size of the moving average.
• At time “t” predict “t+1” by removing last “t-M” element, and
adding “t” element.
• Requires last M elements to be stored in memory
• There are many more algorithms: mean, variance, regression,
percentile
Use Case - Counting Unique Users
• Large UK ad-network
• Challenge – calculate number of unique visitors - one of the
most important metrics in online advertising
• Hadoop MapReduce. It worked but took long time and too
much memory.
• Better solutions:
• Cardinality estimation on stream data, e.g. HyperLogLog algorithm
• Highly effective algorithm to count distinct number of elements
• Many other use cases:
• ISP estimates of traffic usage
• Cardinality in DB queries optimisation
• Algorithm and implementation details
• Transform input data into i.i.d. (independent and identically distributed)
uniform random bits of information
• Hash(x)= bit1 bit2 …
• Where P(bit1)=P(bit2)=1/2
• 1xxx -> P = 1/2, n >= 2
11xx -> P = 1/4, n >= 4
111x -> P = 1/8, n >= 8
n >=
• Record biggest
• Flajolet (1983) estimated the bias
• and
p = position of a first “0”
• 1983 - Flajolet & Martin. (first streaming algorithm)
Probabilistic counting
unhashed hashed
Source: http://git.io/veCtc
Probabilistic counting - algorithm
• Algorithm: p – calculates position of first zero in the bitmap
• Estimate the size using:
• R proof-of-concept implementation: http://git.io/ve8Ia
• Example:
Can we do better?
• LogLog – instead of keeping track of all 01s, keep track only of
the largest 0
• This will take LogLog bits, but at the cost of lost precision
• SuperLogLog – remove x% (typically 70%) of largest number
before estimating, more complex analysis
• HyperLogLog – harmonic mean of estimates
• Fast, cheap and 98% correct
• What if you want more traditional statistics?
Reference: Flajolet; Fusy et al. 2007
R – Open Source Statistics
• Open Source = low cost of adopting. Useful in prototyping.
• Large global community - more than 2.5 million users
• ~5,000 open source free packages
• Extensively used for modelling and visualisations
Source: Rexer Analytics
Use Case – Real-time Machine Learning
• Gaming ad-network
• 150m+ ad impressions per day
• Lambda architecture (fast and batch layers): Storm used in
parallel to Hadoop
• Challenge
• Make real-time decision on which ad to display – vs old system that used
to make decisions every 1h
• Use sophisticated statistical environment for A/B testing
• Solution
• Beta Distribution to compare effectiveness of the ads
• Use Storm to do real-time statistics
Use Case – Beta Distributions
• Comparing two ads:
• Ratio: CTR = Clicks / Views
• Wolphram Alpha: beta distribution (5, (30-5))
Source: Wolphram Alpha
Beta distributions prototyping – the R code
• Bootstrapping in R
Apache Storm
• Real-time calculations – the Hadoop of real time
• Fault tolerance, easy to scale
• Easy to develop - has local and distributed mode
• Storm multi-lang can be used with any language, including R
Getty Images
Storm Architecture
• Nimbus
• Master - equivalent of Hadoop JobTracker
• Distributes workload across cluster
• Heartbeat, reallocation of workers when needed
• Supervisor
• Runs the workers
• Communicates with Nimbus
using ZK
• Zookeeper
• coordination,
nodes discovery
Source: Apache Storm
Storm Topology
Image source: Storm github wiki
Can integrate with third party
languages and databases:
• Java
• Python
• Ruby
• Redis
• Hbase
• Cassandra
• Graph of stream computations
• Basic primitives nodes
• Spout – source of streams (Twitter API, queue, logs)
• Bolt – consumes streams, does the work, produces
streams
• Storm Trident
Storm + R
• Storm Multi-Language protocol
• Multiple Storm-R multi-language packages
provide Storm/R plumbing
• Recommended package: http://cran.r-
project.org/web/packages/Storm
• Example R code
Storm and R
storm = Storm$new();
storm$lambda = function(s) {
t = s$tuple;
t$output =
vector(mode="character",length=1);
clicks = as.numeric(t$input[1]);
views = as.numeric(t$input[2]);
t$output[1] = rbeta(1, clicks, views -
clicks);
s$emit(t);
#alternative: mark the tuple as failed.
s$fail(t);
}
storm$run();
Storm and Java integration
• Define Spout/Bolt in any programming language
• Executed as subprocess – JSON over stdin/stdout
public static class RBolt extends ShellBolt
implements IRichBolt {
public RBolt() {
super("Rscript", ”script.R");
}
}
Source: Apache Storm
Storm + R = flexibility
• Integration with existing Storm ecosystem – NoSQL, Kafka
• SOA framework - DRPC
• Scaling up your existing R processes
• Trident
Source: Apache Storm
Storm References
• https://storm.apache.org
• Storm and Java stream algorithms implementations:
• https://github.com/addthis/stream-lib
• https://github.com/aggregateknowledge/java-hll
• https://github.com/pmerienne/trident-ml
Thank you
• Summary:
• Data stream algorithms
• Storm – can be used with stream algorithms
• Storm + R – more traditional
• Questions and discussion
• https://uk.linkedin.com/in/radekmaciaszek
• http://www.dataminelab.com

Weitere ähnliche Inhalte

Was ist angesagt?

Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormMd. Shamsur Rahim
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014StampedeCon
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming AlgorithmsJoe Kelley
 
Experience with Kafka & Storm
Experience with Kafka & StormExperience with Kafka & Storm
Experience with Kafka & StormOtto Mok
 
Storm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataStorm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataDataWorks Summit
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDataWorks Summit
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeFlink Forward
 
Apache Storm
Apache StormApache Storm
Apache StormEdureka!
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Brian O'Neill
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Sonal Raj
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Databricks
 
S4: Distributed Stream Computing Platform
S4: Distributed Stream Computing PlatformS4: Distributed Stream Computing Platform
S4: Distributed Stream Computing PlatformAleksandar Bradic
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series WorldMapR Technologies
 
Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream ProcessingZbigniew Jerzak
 
Apache Storm and Oracle Event Processing for Real-time Analytics
Apache Storm and Oracle Event Processing for Real-time AnalyticsApache Storm and Oracle Event Processing for Real-time Analytics
Apache Storm and Oracle Event Processing for Real-time AnalyticsPrabhu Thukkaram
 
And Then There Are Algorithms
And Then There Are AlgorithmsAnd Then There Are Algorithms
And Then There Are AlgorithmsInfluxData
 
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data CloudSchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data CloudAnsgar Scherp
 

Was ist angesagt? (20)

Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache Storm
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Experience with Kafka & Storm
Experience with Kafka & StormExperience with Kafka & Storm
Experience with Kafka & Storm
 
Storm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataStorm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-Data
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
 
S4: Distributed Stream Computing Platform
S4: Distributed Stream Computing PlatformS4: Distributed Stream Computing Platform
S4: Distributed Stream Computing Platform
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series World
 
Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream Processing
 
Apache Storm and Oracle Event Processing for Real-time Analytics
Apache Storm and Oracle Event Processing for Real-time AnalyticsApache Storm and Oracle Event Processing for Real-time Analytics
Apache Storm and Oracle Event Processing for Real-time Analytics
 
And Then There Are Algorithms
And Then There Are AlgorithmsAnd Then There Are Algorithms
And Then There Are Algorithms
 
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data CloudSchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
 

Andere mochten auch

Computer Programming For Power Systems Analysts.
Computer Programming For Power Systems Analysts.Computer Programming For Power Systems Analysts.
Computer Programming For Power Systems Analysts.H. Kheir
 
Big Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di MilanoBig Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di MilanoMarco Brambilla
 
Detecting Anomalies in Streaming Data
Detecting Anomalies in Streaming DataDetecting Anomalies in Streaming Data
Detecting Anomalies in Streaming DataSubutai Ahmad
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Raja Chiky
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithmsSandeep Joshi
 
Extending lifespan with Hadoop and R
Extending lifespan with Hadoop and RExtending lifespan with Hadoop and R
Extending lifespan with Hadoop and RRadek Maciaszek
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the CloudDataMine Lab
 
[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms
[RakutenTechConf2013] [D-3_2] Counting Big Databy Streaming Algorithms[RakutenTechConf2013] [D-3_2] Counting Big Databy Streaming Algorithms
[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming AlgorithmsRakuten Group, Inc.
 
Aggregation computation over distributed data streams
Aggregation computation over distributed data streamsAggregation computation over distributed data streams
Aggregation computation over distributed data streamsYueshen Xu
 
Data Science with R for Java Developers
Data Science with R for Java DevelopersData Science with R for Java Developers
Data Science with R for Java DevelopersNLJUG
 
Detecting Malicious Websites using Machine Learning
Detecting Malicious Websites using Machine LearningDetecting Malicious Websites using Machine Learning
Detecting Malicious Websites using Machine LearningAndrew Beard
 
Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm Hamza Aslam
 
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016Adrianos Dadis
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time ComputationSonal Raj
 
Architectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop DistributionArchitectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop Distributionmcsrivas
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalHortonworks
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR Technologies
 
Márton Balassi Streaming ML with Flink-
Márton Balassi Streaming ML with Flink- Márton Balassi Streaming ML with Flink-
Márton Balassi Streaming ML with Flink- Flink Forward
 
Machine learning fro computer vision - a whirlwind of key concepts for the un...
Machine learning fro computer vision - a whirlwind of key concepts for the un...Machine learning fro computer vision - a whirlwind of key concepts for the un...
Machine learning fro computer vision - a whirlwind of key concepts for the un...potaters
 

Andere mochten auch (20)

Computer Programming For Power Systems Analysts.
Computer Programming For Power Systems Analysts.Computer Programming For Power Systems Analysts.
Computer Programming For Power Systems Analysts.
 
Big Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di MilanoBig Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di Milano
 
Chapter 2.1 : Data Stream
Chapter 2.1 : Data StreamChapter 2.1 : Data Stream
Chapter 2.1 : Data Stream
 
Detecting Anomalies in Streaming Data
Detecting Anomalies in Streaming DataDetecting Anomalies in Streaming Data
Detecting Anomalies in Streaming Data
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 
Extending lifespan with Hadoop and R
Extending lifespan with Hadoop and RExtending lifespan with Hadoop and R
Extending lifespan with Hadoop and R
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the Cloud
 
[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms
[RakutenTechConf2013] [D-3_2] Counting Big Databy Streaming Algorithms[RakutenTechConf2013] [D-3_2] Counting Big Databy Streaming Algorithms
[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms
 
Aggregation computation over distributed data streams
Aggregation computation over distributed data streamsAggregation computation over distributed data streams
Aggregation computation over distributed data streams
 
Data Science with R for Java Developers
Data Science with R for Java DevelopersData Science with R for Java Developers
Data Science with R for Java Developers
 
Detecting Malicious Websites using Machine Learning
Detecting Malicious Websites using Machine LearningDetecting Malicious Websites using Machine Learning
Detecting Malicious Websites using Machine Learning
 
Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm
 
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time Computation
 
Architectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop DistributionArchitectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop Distribution
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document Database
 
Márton Balassi Streaming ML with Flink-
Márton Balassi Streaming ML with Flink- Márton Balassi Streaming ML with Flink-
Márton Balassi Streaming ML with Flink-
 
Machine learning fro computer vision - a whirlwind of key concepts for the un...
Machine learning fro computer vision - a whirlwind of key concepts for the un...Machine learning fro computer vision - a whirlwind of key concepts for the un...
Machine learning fro computer vision - a whirlwind of key concepts for the un...
 

Ähnlich wie Data Stream Algorithms in Storm and R

Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsPetr Novotný
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series DatabasePramit Choudhary
 
IoT interoperability
IoT interoperabilityIoT interoperability
IoT interoperability1248 Ltd.
 
Solving Cybersecurity at Scale
Solving Cybersecurity at ScaleSolving Cybersecurity at Scale
Solving Cybersecurity at ScaleDataWorks Summit
 
Mining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetMining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetJ On The Beach
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapSrinath Perera
 
Infrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKProInfrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKProopenminted_eu
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Budapest Big Data Meetup Real-time stream processing
Budapest Big Data Meetup Real-time stream processingBudapest Big Data Meetup Real-time stream processing
Budapest Big Data Meetup Real-time stream processingGabor Boros
 
Data science in ruby, is it possible? is it fast? should we use it?
Data science in ruby, is it possible? is it fast? should we use it?Data science in ruby, is it possible? is it fast? should we use it?
Data science in ruby, is it possible? is it fast? should we use it?Rodrigo Urubatan
 
Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Stavros Kontopoulos
 
Filtering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media StreamingFiltering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media StreamingCloud Elements
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesDavid Martínez Rego
 
Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OSri Ambati
 
Accelerating Analytics for the Future of Genomics
Accelerating Analytics for the Future of GenomicsAccelerating Analytics for the Future of Genomics
Accelerating Analytics for the Future of GenomicsAmazon Web Services
 

Ähnlich wie Data Stream Algorithms in Storm and R (20)

Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big Graphs
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
HUG France - Apache Drill
HUG France - Apache DrillHUG France - Apache Drill
HUG France - Apache Drill
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series Database
 
IoT interoperability
IoT interoperabilityIoT interoperability
IoT interoperability
 
Solving Cybersecurity at Scale
Solving Cybersecurity at ScaleSolving Cybersecurity at Scale
Solving Cybersecurity at Scale
 
Mining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetMining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert Bifet
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 
Infrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKProInfrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKPro
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Budapest Big Data Meetup Real-time stream processing
Budapest Big Data Meetup Real-time stream processingBudapest Big Data Meetup Real-time stream processing
Budapest Big Data Meetup Real-time stream processing
 
Data science in ruby, is it possible? is it fast? should we use it?
Data science in ruby, is it possible? is it fast? should we use it?Data science in ruby, is it possible? is it fast? should we use it?
Data science in ruby, is it possible? is it fast? should we use it?
 
Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016
 
Filtering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media StreamingFiltering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media Streaming
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOA
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
 
Accelerating Analytics for the Future of Genomics
Accelerating Analytics for the Future of GenomicsAccelerating Analytics for the Future of Genomics
Accelerating Analytics for the Future of Genomics
 

Kürzlich hochgeladen

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 

Kürzlich hochgeladen (20)

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 

Data Stream Algorithms in Storm and R

  • 1. Data Stream Algorithms in Storm and R Radek Maciaszek
  • 2. Who Am I?  Radek Maciaszek  Consulting at DataMine Lab (www.dataminelab.com) - Data mining, business intelligence and data warehouse consultancy.  Data scientist at a hedge fund in London  BSc Computer Science, MSc in Cognitive and Decisions Sciences, MSc in Bioinformatics  During the career worked with many companies on Big Data and real time processing projects; indcluding Orange, ad4game, Unanimis, SkimLinks, CognitiveMatch, OpenX and many others.
  • 3. Agenda • Why streaming algorithms? • Streaming algorithms crash course • Apache Storm • Storm + R – for more traditional statistics • Use Cases
  • 4. Data Explosion • Exponential growth of information [IDC, 2012] Information Data Corporation, a market research company
  • 5. Data, data everywhere [Economist] • “In 2013, the available storage capacity could hold 33% of all data. By 2020, it will be able to store less than 15%” [IDC, 2014]
  • 6. Data Streams – crash course • Reasons to use data streams processing • Data doesn’t fit into available memory and/or disk • (Near) real-time data processing • Scalability, cloud processing • Examples • Network traffic (ISP) • Fraud detection • Web traffic (i.e. online advertising)
  • 7. Use Case – Dynamic Sampling • OpenX – the ad server • Customers with tens of millions of ad views per hour • Challenge • Create samples for statistical analysis. E.g: A/B testing, ANOVA, etc. • How to sample data in real-time on the input stream of the data of unknown size • Solution • Reservoir Sampling – allows to find a sample of a constant length from a stream of unknown length of elements
  • 8. Data Streaming algorithms • Sampling • Use statistic of a sample to estimate the statistic of population. The bigger the sample the better the estimate. • Reservoir Sampling – sample populations, without knowing it’s size. • Algorithm: • Store first n elements into the reservoir. • Insert each k-th from the input stream in a random spot of the reservoir with a probability of n/k (decreasing probability) Source: Maldonado, et al; 2011
  • 9. Moving average • Example, online mean of the moving average of time-series at time “t” • Where: M – window size of the moving average. • At time “t” predict “t+1” by removing last “t-M” element, and adding “t” element. • Requires last M elements to be stored in memory • There are many more algorithms: mean, variance, regression, percentile
  • 10. Use Case - Counting Unique Users • Large UK ad-network • Challenge – calculate number of unique visitors - one of the most important metrics in online advertising • Hadoop MapReduce. It worked but took long time and too much memory. • Better solutions: • Cardinality estimation on stream data, e.g. HyperLogLog algorithm • Highly effective algorithm to count distinct number of elements • Many other use cases: • ISP estimates of traffic usage • Cardinality in DB queries optimisation • Algorithm and implementation details
  • 11. • Transform input data into i.i.d. (independent and identically distributed) uniform random bits of information • Hash(x)= bit1 bit2 … • Where P(bit1)=P(bit2)=1/2 • 1xxx -> P = 1/2, n >= 2 11xx -> P = 1/4, n >= 4 111x -> P = 1/8, n >= 8 n >= • Record biggest • Flajolet (1983) estimated the bias • and p = position of a first “0” • 1983 - Flajolet & Martin. (first streaming algorithm) Probabilistic counting unhashed hashed Source: http://git.io/veCtc
  • 12. Probabilistic counting - algorithm • Algorithm: p – calculates position of first zero in the bitmap • Estimate the size using: • R proof-of-concept implementation: http://git.io/ve8Ia • Example:
  • 13. Can we do better? • LogLog – instead of keeping track of all 01s, keep track only of the largest 0 • This will take LogLog bits, but at the cost of lost precision • SuperLogLog – remove x% (typically 70%) of largest number before estimating, more complex analysis • HyperLogLog – harmonic mean of estimates • Fast, cheap and 98% correct • What if you want more traditional statistics? Reference: Flajolet; Fusy et al. 2007
  • 14. R – Open Source Statistics • Open Source = low cost of adopting. Useful in prototyping. • Large global community - more than 2.5 million users • ~5,000 open source free packages • Extensively used for modelling and visualisations Source: Rexer Analytics
  • 15. Use Case – Real-time Machine Learning • Gaming ad-network • 150m+ ad impressions per day • Lambda architecture (fast and batch layers): Storm used in parallel to Hadoop • Challenge • Make real-time decision on which ad to display – vs old system that used to make decisions every 1h • Use sophisticated statistical environment for A/B testing • Solution • Beta Distribution to compare effectiveness of the ads • Use Storm to do real-time statistics
  • 16. Use Case – Beta Distributions • Comparing two ads: • Ratio: CTR = Clicks / Views • Wolphram Alpha: beta distribution (5, (30-5)) Source: Wolphram Alpha
  • 17. Beta distributions prototyping – the R code • Bootstrapping in R
  • 18. Apache Storm • Real-time calculations – the Hadoop of real time • Fault tolerance, easy to scale • Easy to develop - has local and distributed mode • Storm multi-lang can be used with any language, including R Getty Images
  • 19. Storm Architecture • Nimbus • Master - equivalent of Hadoop JobTracker • Distributes workload across cluster • Heartbeat, reallocation of workers when needed • Supervisor • Runs the workers • Communicates with Nimbus using ZK • Zookeeper • coordination, nodes discovery Source: Apache Storm
  • 20. Storm Topology Image source: Storm github wiki Can integrate with third party languages and databases: • Java • Python • Ruby • Redis • Hbase • Cassandra • Graph of stream computations • Basic primitives nodes • Spout – source of streams (Twitter API, queue, logs) • Bolt – consumes streams, does the work, produces streams • Storm Trident
  • 21. Storm + R • Storm Multi-Language protocol • Multiple Storm-R multi-language packages provide Storm/R plumbing • Recommended package: http://cran.r- project.org/web/packages/Storm • Example R code
  • 22. Storm and R storm = Storm$new(); storm$lambda = function(s) { t = s$tuple; t$output = vector(mode="character",length=1); clicks = as.numeric(t$input[1]); views = as.numeric(t$input[2]); t$output[1] = rbeta(1, clicks, views - clicks); s$emit(t); #alternative: mark the tuple as failed. s$fail(t); } storm$run();
  • 23. Storm and Java integration • Define Spout/Bolt in any programming language • Executed as subprocess – JSON over stdin/stdout public static class RBolt extends ShellBolt implements IRichBolt { public RBolt() { super("Rscript", ”script.R"); } } Source: Apache Storm
  • 24. Storm + R = flexibility • Integration with existing Storm ecosystem – NoSQL, Kafka • SOA framework - DRPC • Scaling up your existing R processes • Trident Source: Apache Storm
  • 25. Storm References • https://storm.apache.org • Storm and Java stream algorithms implementations: • https://github.com/addthis/stream-lib • https://github.com/aggregateknowledge/java-hll • https://github.com/pmerienne/trident-ml
  • 26. Thank you • Summary: • Data stream algorithms • Storm – can be used with stream algorithms • Storm + R – more traditional • Questions and discussion • https://uk.linkedin.com/in/radekmaciaszek • http://www.dataminelab.com