SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Implement a scalable statistical
aggregation system using Akka
Scala by the Bay, 12 Nov 2016
Stanley Nguyen, Vu Ho
Email Security@Symantec Singapore
The system
Provides service to answer time-series analytical questions such as
COUNT, TOPK, SET MEMBERSHIP, CARDINALITY on a dynamic set
of data streams by using statistical approach.
Motivation
 The system collects data from multiple sources in streaming log
format
 Some common questions in Email Anti-Abuse system
 Most frequent Items (IP, domain, sender, etc.)
 Number of unique items
 Have we seen an item before?
=> Need to be able to answer such questions in a timely manner
Data statistics
 6K email logs/second
 One email log is flatten out to subevents
 Ip, sender, sender domain, etc
 Time period (last 5 minutes, 1 hour, 4 hours, 1 day, 1 week, etc)
Total ~200K messages/second
Challenges
 Our system needs to be
 Responsive
 Space efficient
 Reactive
 Extensible
 Scalable
 Resilient
Sketching data structures
 How many times have we seen a certain IP?
 Count Min Sketch (CMS): Counting things + TopK
 How many unique senders have we seen yesterday?
 HyperLogLog (HLL): Set cardinality
 Did we see a certain IP last month?
 Bloom Filter (BF): Set membership
SPACE / SPEED
 Implement data structure for
finding cardinality (i.e. counting
things); set membership; top-k
elements – solved by using
streamlib / twitter algebird
 Implement a dynamic,
reactive, distributed system
for answering cardinality (i.e.
counting things); set
membership; top-k elements
What we try to solveWhat is available
Sketching data structures
 Responsive
 Space efficient
 Reactive
 Extensible
 Scalable
 Resilient
Akka Actor
BACK PRESSURE?
Akka Stream
GraphDSL
FLOW-SHAPE NODE
Using GraphDSL
(msg-type, @timestamp, key, value)
GraphDSL - Limitations
Our design – Dynamic stream
Merge Hub
 Provided by Akka Stream:
Allow dynamic set of TCP producers
Splitter Hub
 Split the stream based on event type to a dynamic set of
downstream consumers.
 Consumers are actors which implement CMS, BF, HLL, etc logic.
 Not available in akka-stream.
Splitter Hub API
 Similar to built-in akka stream’s BroadcastHub; different in back-
pressure implementation.
 [[SplitterHub]].source can be supplied with a predicate/selector function
to return a filtered subset of data.
selector
Splitter Hub’s Implementation
Splitter Hub
 The [[Source]] can be materialized any number of times — each
materialization creates a new consumer which can be registered with the
hub, and then receives items matching the selector function from the
upstream.
Consumer can be added at run time
Consumers
 Can be either local or remote.
 Managed by coordination actor.
 Implements a specific data structure (CMS/BF/HLL) for a particular event
type from a specific time-range.
 Responsibility:
 Answer a specific query.
 Persisting serialization of internal data structure such as count-min-table, etc.
regularly. COUNT-QUERY
forward
ref
snapshot
 Responsive
 Space efficient
 Reactive
 Extensible
 Scalable
 Resilient
Scaling out
 If data does not fit in one machine.
 Server crashes.
 How to maintain back pressure end-to-end.
Scaling out
Akka stream TCP
 Handled by Kernel (back-pressure, reliable).
 For each worker, we create a source for each message type it is
responsible for using SplitterHub source() API.
 Connect each source to a TCP connection and send to worker.
 Backpressure is maintained across network.
~>
~>
Master-Worker communication
Master Failover
 The Coordinator is the Single Point of Failure.
 Run multiple Coordinator Actors as Cluster Singleton .
 Worker communicates to master (heartbeat) using Cluster Client.
Worker Failover
 Worker persists all events to DB journal + snapshot.
 Akka Persistent.
 Redis for storing Journal + Snapshot.
 When a worker is down, its keys are re-distributed.
 Master then redirects traffic to other workers.
 CMS Actors are restored on new worker from Snapshot + Journal.
Benchmark
Akka-stream on single node 100K+ msg/second (one msg-type)
Akka-stream on remote node
(remote TCP)
15-20K msg/second (one msg-type)
Akka-stream on remote node
(remote TCP) with akka persistent
journal
2000+ msg/second (one msg-type)
Conclusion
 Our system is
 Responsive
 Reactive
 Scalable
 Resilient
 Future works:
 Make worker metric agnostics
 Scale out master
 Exactly one delivery for worker
 More flexible filter using SplitterHub
Q&A

Weitere ähnliche Inhalte

Was ist angesagt?

Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksAnyscale
 
Visualizing C2_MLADS_2015
Visualizing C2_MLADS_2015Visualizing C2_MLADS_2015
Visualizing C2_MLADS_2015Todd Lanning
 
Time series-analysis-using-an-event-streaming-platform -_v3_final
Time series-analysis-using-an-event-streaming-platform -_v3_finalTime series-analysis-using-an-event-streaming-platform -_v3_final
Time series-analysis-using-an-event-streaming-platform -_v3_finalconfluent
 
Ceilometer presentation ODS Grizzly.pdf
Ceilometer presentation ODS Grizzly.pdfCeilometer presentation ODS Grizzly.pdf
Ceilometer presentation ODS Grizzly.pdfOpenStack Foundation
 
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Spark Summit
 
Streaming ETL to Elastic with Apache Kafka and KSQL
Streaming ETL to Elastic with Apache Kafka and KSQLStreaming ETL to Elastic with Apache Kafka and KSQL
Streaming ETL to Elastic with Apache Kafka and KSQLconfluent
 
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...Databricks
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streamingphanleson
 
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsLightbend
 
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...HostedbyConfluent
 
Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming PlatformDr. Mirko Kämpf
 
Big data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M RulliBig data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M Rullimfrancis
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applicationsDing Li
 
KSQL: Open Source Streaming for Apache Kafka
KSQL: Open Source Streaming for Apache KafkaKSQL: Open Source Streaming for Apache Kafka
KSQL: Open Source Streaming for Apache Kafkaconfluent
 
Introduction to the Processor API
Introduction to the Processor APIIntroduction to the Processor API
Introduction to the Processor APIconfluent
 
Streaming Transformations - Putting the T in Streaming ETL
Streaming Transformations - Putting the T in Streaming ETLStreaming Transformations - Putting the T in Streaming ETL
Streaming Transformations - Putting the T in Streaming ETLconfluent
 
INTRODUCING: CREATE PIPELINE
INTRODUCING: CREATE PIPELINEINTRODUCING: CREATE PIPELINE
INTRODUCING: CREATE PIPELINESingleStore
 
PowerStream Demo
PowerStream DemoPowerStream Demo
PowerStream DemoSingleStore
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsAsis Mohanty
 

Was ist angesagt? (20)

Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at Databricks
 
Visualizing C2_MLADS_2015
Visualizing C2_MLADS_2015Visualizing C2_MLADS_2015
Visualizing C2_MLADS_2015
 
Time series-analysis-using-an-event-streaming-platform -_v3_final
Time series-analysis-using-an-event-streaming-platform -_v3_finalTime series-analysis-using-an-event-streaming-platform -_v3_final
Time series-analysis-using-an-event-streaming-platform -_v3_final
 
Ceilometer presentation ODS Grizzly.pdf
Ceilometer presentation ODS Grizzly.pdfCeilometer presentation ODS Grizzly.pdf
Ceilometer presentation ODS Grizzly.pdf
 
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
 
Streaming ETL to Elastic with Apache Kafka and KSQL
Streaming ETL to Elastic with Apache Kafka and KSQLStreaming ETL to Elastic with Apache Kafka and KSQL
Streaming ETL to Elastic with Apache Kafka and KSQL
 
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streaming
 
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
 
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...
 
Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming Platform
 
Big data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M RulliBig data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M Rulli
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
KSQL: Open Source Streaming for Apache Kafka
KSQL: Open Source Streaming for Apache KafkaKSQL: Open Source Streaming for Apache Kafka
KSQL: Open Source Streaming for Apache Kafka
 
Introduction to the Processor API
Introduction to the Processor APIIntroduction to the Processor API
Introduction to the Processor API
 
Streaming Transformations - Putting the T in Streaming ETL
Streaming Transformations - Putting the T in Streaming ETLStreaming Transformations - Putting the T in Streaming ETL
Streaming Transformations - Putting the T in Streaming ETL
 
INTRODUCING: CREATE PIPELINE
INTRODUCING: CREATE PIPELINEINTRODUCING: CREATE PIPELINE
INTRODUCING: CREATE PIPELINE
 
PowerStream Demo
PowerStream DemoPowerStream Demo
PowerStream Demo
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital Kedia
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
 

Andere mochten auch

MDJ 202 2nd Assgmnt
MDJ 202 2nd AssgmntMDJ 202 2nd Assgmnt
MDJ 202 2nd AssgmntSyeera Azryn
 
Acessibilidade para as pessoas com necessidades comunicativas especiais
Acessibilidade para as pessoas com necessidades comunicativas especiaisAcessibilidade para as pessoas com necessidades comunicativas especiais
Acessibilidade para as pessoas com necessidades comunicativas especiaisValdemar Júnior
 
Final draft(I 2)
Final draft(I 2)Final draft(I 2)
Final draft(I 2)Keaton Ott
 
3.c Pilar Sala - "Lecciones aprendidas. Aspectos prácticos del cambio de modelo"
3.c Pilar Sala - "Lecciones aprendidas. Aspectos prácticos del cambio de modelo"3.c Pilar Sala - "Lecciones aprendidas. Aspectos prácticos del cambio de modelo"
3.c Pilar Sala - "Lecciones aprendidas. Aspectos prácticos del cambio de modelo"Badalona Serveis Assistencials
 
Wealthiest income – analysis and commentary - Canada - 2016
Wealthiest income – analysis and commentary - Canada - 2016Wealthiest income – analysis and commentary - Canada - 2016
Wealthiest income – analysis and commentary - Canada - 2016paul young cpa, cga
 
7.a Jan Schwietzke - "Caring me, tratamiento online para la depresión"
7.a Jan Schwietzke - "Caring  me, tratamiento online para la depresión"7.a Jan Schwietzke - "Caring  me, tratamiento online para la depresión"
7.a Jan Schwietzke - "Caring me, tratamiento online para la depresión"Badalona Serveis Assistencials
 
62 oitava categoria - caso 02 e caso 03
62   oitava categoria - caso 02 e caso 0362   oitava categoria - caso 02 e caso 03
62 oitava categoria - caso 02 e caso 03Fatoze
 
6.b German Lorenzo y Silvia Morea - "Experiencia en VIC de integración de ser...
6.b German Lorenzo y Silvia Morea - "Experiencia en VIC de integración de ser...6.b German Lorenzo y Silvia Morea - "Experiencia en VIC de integración de ser...
6.b German Lorenzo y Silvia Morea - "Experiencia en VIC de integración de ser...Badalona Serveis Assistencials
 
Southern Transport Service SOP
Southern Transport Service SOPSouthern Transport Service SOP
Southern Transport Service SOPRichard Gibbens
 

Andere mochten auch (16)

MDJ 202 2nd Assgmnt
MDJ 202 2nd AssgmntMDJ 202 2nd Assgmnt
MDJ 202 2nd Assgmnt
 
Acessibilidade para as pessoas com necessidades comunicativas especiais
Acessibilidade para as pessoas com necessidades comunicativas especiaisAcessibilidade para as pessoas com necessidades comunicativas especiais
Acessibilidade para as pessoas com necessidades comunicativas especiais
 
Final draft(I 2)
Final draft(I 2)Final draft(I 2)
Final draft(I 2)
 
Carpeta san francisco11
Carpeta san francisco11Carpeta san francisco11
Carpeta san francisco11
 
Mapa conceptual
Mapa conceptual Mapa conceptual
Mapa conceptual
 
Curriculum vitae sheyla
Curriculum vitae sheylaCurriculum vitae sheyla
Curriculum vitae sheyla
 
3.c Pilar Sala - "Lecciones aprendidas. Aspectos prácticos del cambio de modelo"
3.c Pilar Sala - "Lecciones aprendidas. Aspectos prácticos del cambio de modelo"3.c Pilar Sala - "Lecciones aprendidas. Aspectos prácticos del cambio de modelo"
3.c Pilar Sala - "Lecciones aprendidas. Aspectos prácticos del cambio de modelo"
 
PPTSIRISHPROPOSAL
PPTSIRISHPROPOSALPPTSIRISHPROPOSAL
PPTSIRISHPROPOSAL
 
Wealthiest income – analysis and commentary - Canada - 2016
Wealthiest income – analysis and commentary - Canada - 2016Wealthiest income – analysis and commentary - Canada - 2016
Wealthiest income – analysis and commentary - Canada - 2016
 
7.a Jan Schwietzke - "Caring me, tratamiento online para la depresión"
7.a Jan Schwietzke - "Caring  me, tratamiento online para la depresión"7.a Jan Schwietzke - "Caring  me, tratamiento online para la depresión"
7.a Jan Schwietzke - "Caring me, tratamiento online para la depresión"
 
Manual passo a passo instalação moldura 2 DIN Fiat Ducato/Peugeot Boxer/Citro...
Manual passo a passo instalação moldura 2 DIN Fiat Ducato/Peugeot Boxer/Citro...Manual passo a passo instalação moldura 2 DIN Fiat Ducato/Peugeot Boxer/Citro...
Manual passo a passo instalação moldura 2 DIN Fiat Ducato/Peugeot Boxer/Citro...
 
62 oitava categoria - caso 02 e caso 03
62   oitava categoria - caso 02 e caso 0362   oitava categoria - caso 02 e caso 03
62 oitava categoria - caso 02 e caso 03
 
Sobrecargado De Informacion: Medidas Que Tomar
Sobrecargado De Informacion: Medidas Que TomarSobrecargado De Informacion: Medidas Que Tomar
Sobrecargado De Informacion: Medidas Que Tomar
 
6.b German Lorenzo y Silvia Morea - "Experiencia en VIC de integración de ser...
6.b German Lorenzo y Silvia Morea - "Experiencia en VIC de integración de ser...6.b German Lorenzo y Silvia Morea - "Experiencia en VIC de integración de ser...
6.b German Lorenzo y Silvia Morea - "Experiencia en VIC de integración de ser...
 
evaluacion
evaluacionevaluacion
evaluacion
 
Southern Transport Service SOP
Southern Transport Service SOPSouthern Transport Service SOP
Southern Transport Service SOP
 

Ähnlich wie [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureGabriele Modena
 
Akka Microservices Architecture And Design
Akka Microservices Architecture And DesignAkka Microservices Architecture And Design
Akka Microservices Architecture And DesignYaroslav Tkachenko
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Amazon Web Services
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...GeeksLab Odessa
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYC
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYCBuilding a Reactive System with Akka - Workshop @ O'Reilly SAConf NYC
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYCKonrad Malawski
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317Nan Zhu
 
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data AnalyticsStrata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data AnalyticsSingleStore
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesDatabricks
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemAccumulo Summit
 
Streaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQLStreaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQLNick Dearden
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Databricks
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustEvan Chan
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...DataStax Academy
 
Kafka streams decoupling with stores
Kafka streams decoupling with storesKafka streams decoupling with stores
Kafka streams decoupling with storesYoni Farin
 
Real time data-pipeline from inception to production
Real time data-pipeline from inception to productionReal time data-pipeline from inception to production
Real time data-pipeline from inception to productionShreya Mukhopadhyay
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016 Databricks
 

Ähnlich wie [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka (20)

Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
Akka Microservices Architecture And Design
Akka Microservices Architecture And DesignAkka Microservices Architecture And Design
Akka Microservices Architecture And Design
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYC
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYCBuilding a Reactive System with Akka - Workshop @ O'Reilly SAConf NYC
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYC
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317
 
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data AnalyticsStrata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
 
Streaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQLStreaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQL
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
 
Kafka streams decoupling with stores
Kafka streams decoupling with storesKafka streams decoupling with stores
Kafka streams decoupling with stores
 
Real time data-pipeline from inception to production
Real time data-pipeline from inception to productionReal time data-pipeline from inception to production
Real time data-pipeline from inception to production
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 

Kürzlich hochgeladen

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 

Kürzlich hochgeladen (20)

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 

[ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

  • 1. Implement a scalable statistical aggregation system using Akka Scala by the Bay, 12 Nov 2016 Stanley Nguyen, Vu Ho Email Security@Symantec Singapore
  • 2. The system Provides service to answer time-series analytical questions such as COUNT, TOPK, SET MEMBERSHIP, CARDINALITY on a dynamic set of data streams by using statistical approach.
  • 3. Motivation  The system collects data from multiple sources in streaming log format  Some common questions in Email Anti-Abuse system  Most frequent Items (IP, domain, sender, etc.)  Number of unique items  Have we seen an item before? => Need to be able to answer such questions in a timely manner
  • 4. Data statistics  6K email logs/second  One email log is flatten out to subevents  Ip, sender, sender domain, etc  Time period (last 5 minutes, 1 hour, 4 hours, 1 day, 1 week, etc) Total ~200K messages/second
  • 5. Challenges  Our system needs to be  Responsive  Space efficient  Reactive  Extensible  Scalable  Resilient
  • 6. Sketching data structures  How many times have we seen a certain IP?  Count Min Sketch (CMS): Counting things + TopK  How many unique senders have we seen yesterday?  HyperLogLog (HLL): Set cardinality  Did we see a certain IP last month?  Bloom Filter (BF): Set membership SPACE / SPEED
  • 7.  Implement data structure for finding cardinality (i.e. counting things); set membership; top-k elements – solved by using streamlib / twitter algebird  Implement a dynamic, reactive, distributed system for answering cardinality (i.e. counting things); set membership; top-k elements What we try to solveWhat is available
  • 9.  Responsive  Space efficient  Reactive  Extensible  Scalable  Resilient
  • 14. Our design – Dynamic stream
  • 15. Merge Hub  Provided by Akka Stream: Allow dynamic set of TCP producers
  • 16. Splitter Hub  Split the stream based on event type to a dynamic set of downstream consumers.  Consumers are actors which implement CMS, BF, HLL, etc logic.  Not available in akka-stream.
  • 17. Splitter Hub API  Similar to built-in akka stream’s BroadcastHub; different in back- pressure implementation.  [[SplitterHub]].source can be supplied with a predicate/selector function to return a filtered subset of data. selector
  • 19. Splitter Hub  The [[Source]] can be materialized any number of times — each materialization creates a new consumer which can be registered with the hub, and then receives items matching the selector function from the upstream. Consumer can be added at run time
  • 20. Consumers  Can be either local or remote.  Managed by coordination actor.  Implements a specific data structure (CMS/BF/HLL) for a particular event type from a specific time-range.  Responsibility:  Answer a specific query.  Persisting serialization of internal data structure such as count-min-table, etc. regularly. COUNT-QUERY forward ref snapshot
  • 21.  Responsive  Space efficient  Reactive  Extensible  Scalable  Resilient
  • 22. Scaling out  If data does not fit in one machine.  Server crashes.  How to maintain back pressure end-to-end.
  • 24. Akka stream TCP  Handled by Kernel (back-pressure, reliable).  For each worker, we create a source for each message type it is responsible for using SplitterHub source() API.  Connect each source to a TCP connection and send to worker.  Backpressure is maintained across network. ~> ~>
  • 26. Master Failover  The Coordinator is the Single Point of Failure.  Run multiple Coordinator Actors as Cluster Singleton .  Worker communicates to master (heartbeat) using Cluster Client.
  • 27. Worker Failover  Worker persists all events to DB journal + snapshot.  Akka Persistent.  Redis for storing Journal + Snapshot.  When a worker is down, its keys are re-distributed.  Master then redirects traffic to other workers.  CMS Actors are restored on new worker from Snapshot + Journal.
  • 28. Benchmark Akka-stream on single node 100K+ msg/second (one msg-type) Akka-stream on remote node (remote TCP) 15-20K msg/second (one msg-type) Akka-stream on remote node (remote TCP) with akka persistent journal 2000+ msg/second (one msg-type)
  • 29. Conclusion  Our system is  Responsive  Reactive  Scalable  Resilient  Future works:  Make worker metric agnostics  Scale out master  Exactly one delivery for worker  More flexible filter using SplitterHub
  • 30. Q&A