In this tutorial we walk through state-of-the-art streaming systems, algorithms, and deployment architectures and cover the typical challenges in modern real-time big data platforms and offering insights on how to address them. We also discuss how advances in technology might impact the streaming architectures and applications of the future. Along the way, we explore the interplay between storage and stream processing and discuss future developments.
Tutorial - Modern Real Time Streaming ArchitecturesKarthik Ramasamy
Across diverse segments in industry, there has been a shift in focus from big data to fast data, stemming, in part, from the deluge of high-velocity data streams as well as the need for instant data-driven insights, and there has been a proliferation of messaging and streaming frameworks that enterprises utilize to satisfy the needs of various applications.
Drawing on their experience operating streaming systems at Twitter scale, Karthik Ramasamy, Sanjeev Kulkarni, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them. They also discuss how advances in technology might impact the streaming architectures and applications of the future. Along the way, they explore the interplay between storage and stream processing and speculate about future developments.
Topics include:
Basic requirements of stream processing
Streaming and one-pass algorithms
Different types of streaming architectures
An in-depth review of streaming frameworks
Deploying and operating stream processing applications
Lessons learned from building a real-time stack using Apache Pulsar and Apache Heron at Twitter Scale
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.
Modernizing to a Cloud Data ArchitectureDatabricks
Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.
Delta Lake, an open-source innovations which brings new capabilities for transactions, version control and indexing your data lakes. We uncover how Delta Lake benefits and why it matters to you. Through this session, we showcase some of its benefits and how they can improve your modern data engineering pipelines. Delta lake provides snapshot isolation which helps concurrent read/write operations and enables efficient insert, update, deletes, and rollback capabilities. It allows background file optimization through compaction and z-order partitioning achieving better performance improvements. In this presentation, we will learn the Delta Lake benefits and how it solves common data lake challenges, and most importantly new Delta Time Travel capability.
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
You’ve heard the marketing buzz, maybe you have been to a workshop and worked with some Spark, Delta, SQL, Python, or R, but you still need some help putting all the pieces together? Join us as we review some common techniques to build a lakehouse using Delta Lake, use SQL Analytics to perform exploratory analysis, and build connectivity for BI applications.
Building Cloud-Native App Series - Part 3 of 11
Microservices Architecture Series
AWS Kinesis Data Streams
AWS Kinesis Firehose
AWS Kinesis Data Analytics
Apache Flink - Analytics
ksqlDB: A Stream-Relational Database Systemconfluent
Speaker: Matthias J. Sax, Software Engineer, Confluent
ksqlDB is a distributed event streaming database system that allows users to express SQL queries over relational tables and event streams. The project was released by Confluent in 2017 and is hosted on Github and developed with an open-source spirit. ksqlDB is built on top of Apache Kafka®, a distributed event streaming platform. In this talk, we discuss ksqlDB’s architecture that is influenced by Apache Kafka and its stream processing library, Kafka Streams. We explain how ksqlDB executes continuous queries while achieving fault tolerance and high vailability. Furthermore, we explore ksqlDB’s streaming SQL dialect and the different types of supported queries.
Matthias J. Sax is a software engineer at Confluent working on ksqlDB. He mainly contributes to Kafka Streams, Apache Kafka's stream processing library, which serves as ksqlDB's execution engine. Furthermore, he helps evolve ksqlDB's "streaming SQL" language. In the past, Matthias also contributed to Apache Flink and Apache Storm and he is an Apache committer and PMC member. Matthias holds a Ph.D. from Humboldt University of Berlin, where he studied distributed data stream processing systems.
https://db.cs.cmu.edu/events/quarantine-db-talk-2020-confluent-ksqldb-a-stream-relational-database-system/
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
Tutorial - Modern Real Time Streaming ArchitecturesKarthik Ramasamy
Across diverse segments in industry, there has been a shift in focus from big data to fast data, stemming, in part, from the deluge of high-velocity data streams as well as the need for instant data-driven insights, and there has been a proliferation of messaging and streaming frameworks that enterprises utilize to satisfy the needs of various applications.
Drawing on their experience operating streaming systems at Twitter scale, Karthik Ramasamy, Sanjeev Kulkarni, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them. They also discuss how advances in technology might impact the streaming architectures and applications of the future. Along the way, they explore the interplay between storage and stream processing and speculate about future developments.
Topics include:
Basic requirements of stream processing
Streaming and one-pass algorithms
Different types of streaming architectures
An in-depth review of streaming frameworks
Deploying and operating stream processing applications
Lessons learned from building a real-time stack using Apache Pulsar and Apache Heron at Twitter Scale
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.
Modernizing to a Cloud Data ArchitectureDatabricks
Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.
Delta Lake, an open-source innovations which brings new capabilities for transactions, version control and indexing your data lakes. We uncover how Delta Lake benefits and why it matters to you. Through this session, we showcase some of its benefits and how they can improve your modern data engineering pipelines. Delta lake provides snapshot isolation which helps concurrent read/write operations and enables efficient insert, update, deletes, and rollback capabilities. It allows background file optimization through compaction and z-order partitioning achieving better performance improvements. In this presentation, we will learn the Delta Lake benefits and how it solves common data lake challenges, and most importantly new Delta Time Travel capability.
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
You’ve heard the marketing buzz, maybe you have been to a workshop and worked with some Spark, Delta, SQL, Python, or R, but you still need some help putting all the pieces together? Join us as we review some common techniques to build a lakehouse using Delta Lake, use SQL Analytics to perform exploratory analysis, and build connectivity for BI applications.
Building Cloud-Native App Series - Part 3 of 11
Microservices Architecture Series
AWS Kinesis Data Streams
AWS Kinesis Firehose
AWS Kinesis Data Analytics
Apache Flink - Analytics
ksqlDB: A Stream-Relational Database Systemconfluent
Speaker: Matthias J. Sax, Software Engineer, Confluent
ksqlDB is a distributed event streaming database system that allows users to express SQL queries over relational tables and event streams. The project was released by Confluent in 2017 and is hosted on Github and developed with an open-source spirit. ksqlDB is built on top of Apache Kafka®, a distributed event streaming platform. In this talk, we discuss ksqlDB’s architecture that is influenced by Apache Kafka and its stream processing library, Kafka Streams. We explain how ksqlDB executes continuous queries while achieving fault tolerance and high vailability. Furthermore, we explore ksqlDB’s streaming SQL dialect and the different types of supported queries.
Matthias J. Sax is a software engineer at Confluent working on ksqlDB. He mainly contributes to Kafka Streams, Apache Kafka's stream processing library, which serves as ksqlDB's execution engine. Furthermore, he helps evolve ksqlDB's "streaming SQL" language. In the past, Matthias also contributed to Apache Flink and Apache Storm and he is an Apache committer and PMC member. Matthias holds a Ph.D. from Humboldt University of Berlin, where he studied distributed data stream processing systems.
https://db.cs.cmu.edu/events/quarantine-db-talk-2020-confluent-ksqldb-a-stream-relational-database-system/
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
The document provides an overview of the Databricks platform, which offers a unified environment for data engineering, analytics, and AI. It describes how Databricks addresses the complexity of managing data across siloed systems by providing a single "data lakehouse" platform where all data and analytics workloads can be run. Key features highlighted include Delta Lake for ACID transactions on data lakes, auto loader for streaming data ingestion, notebooks for interactive coding, and governance tools to securely share and catalog data and models.
Doug Bateman, a principal data engineering instructor at Databricks, presented on how to build a Lakehouse architecture. He began by introducing himself and his background. He then discussed the goals of describing key Lakehouse features, explaining how Delta Lake enables it, and developing a sample Lakehouse using Databricks. The key aspects of a Lakehouse are that it supports diverse data types and workloads while enabling using BI tools directly on source data. Delta Lake provides reliability, consistency, and performance through its ACID transactions, automatic file consolidation, and integration with Spark. Bateman concluded with a demo of creating a Lakehouse.
This document summarizes Netflix's use of Kafka in their data pipeline. It discusses how Netflix evolved from using S3 and EMR to introducing Kafka and Kafka producers and consumers to handle 400 billion events per day. It covers challenges of scaling Kafka clusters and tuning Kafka clients and brokers. Finally, it outlines Netflix's roadmap which includes contributing to open source projects like Kafka and testing failure resilience.
The Top 5 Apache Kafka Use Cases and Architectures in 2022Kai Wähner
This document discusses the top 5 use cases and architectures for data in motion in 2022. It describes:
1) The Kappa architecture as an alternative to the Lambda architecture that uses a single stream to handle both real-time and batch data.
2) Hyper-personalized omnichannel experiences that integrate customer data from multiple sources in real-time to provide personalized experiences across channels.
3) Multi-cloud deployments using Apache Kafka and data mesh architectures to share data across different cloud platforms.
4) Edge analytics that deploy stream processing and Kafka brokers at the edge to enable low-latency use cases and offline functionality.
5) Real-time cybersecurity applications that use streaming data
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
Snowflake: The Good, the Bad, and the UglyTyler Wishnoff
Learn how to solve the top 3 challenges Snowflake customers face, and what you can do to ensure high-performance, intelligent analytics at any scale. Ideal for those currently using Snowflake and those considering it. Learn more at: https://kyligence.io/
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...confluent
In this talk we’ll look at the relationship between three of the most disruptive software engineering paradigms: event sourcing, stream processing and serverless. We’ll debunk some of the myths around event sourcing. We’ll look at the inevitability of event-driven programming in the serverless space and we’ll see how stream processing links these two concepts together with a single ‘database for events’. As the story unfolds we’ll dive into some use cases, examine the practicalities of each approach-particularly the stateful elements-and finally extrapolate how their future relationship is likely to unfold. Key takeaways include: The different flavors of event sourcing and where their value lies. The difference between stream processing at application- and infrastructure-levels. The relationship between stream processors and serverless functions. The practical limits of storing data in Kafka and stream processors like KSQL."
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
This is Part 4 of the GoldenGate series on Data Mesh - a series of webinars helping customers understand how to move off of old-fashioned monolithic data integration architecture and get ready for more agile, cost-effective, event-driven solutions. The Data Mesh is a kind of Data Fabric that emphasizes business-led data products running on event-driven streaming architectures, serverless, and microservices based platforms. These emerging solutions are essential for enterprises that run data-driven services on multi-cloud, multi-vendor ecosystems.
Join this session to get a fresh look at Data Mesh; we'll start with core architecture principles (vendor agnostic) and transition into detailed examples of how Oracle's GoldenGate platform is providing capabilities today. We will discuss essential technical characteristics of a Data Mesh solution, and the benefits that business owners can expect by moving IT in this direction. For more background on Data Mesh, Part 1, 2, and 3 are on the GoldenGate YouTube channel: https://www.youtube.com/playlist?list=PLbqmhpwYrlZJ-583p3KQGDAd6038i1ywe
Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/)
Mr. Pollock is an expert technology leader for data platforms, big data, data integration and governance. Jeff has been CTO at California startups and a senior exec at Fortune 100 tech vendors. He is currently Oracle VP of Products and Cloud Services for Data Replication, Streaming Data and Database Migrations. While at IBM, he was head of all Information Integration, Replication and Governance products, and previously Jeff was an independent architect for US Defense Department, VP of Technology at Cerebra and CTO of Modulant – he has been engineering artificial intelligence based data platforms since 2001. As a business consultant, Mr. Pollock was a Head Architect at Ernst & Young’s Center for Technology Enablement. Jeff is also the author of “Semantic Web for Dummies” and "Adaptive Information,” a frequent keynote at industry conferences, author for books and industry journals, formerly a contributing member of W3C and OASIS, and an engineering instructor with UC Berkeley’s Extension for object-oriented systems, software development process and enterprise architecture.
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
- Delta Lake is an open source project that provides ACID transactions, schema enforcement, and time travel capabilities to data stored in data lakes such as S3 and ADLS.
- It allows building a "Lakehouse" architecture where the same data can be used for both batch and streaming analytics.
- Key features include ACID transactions, scalable metadata handling, time travel to view past data states, schema enforcement, schema evolution, and change data capture for streaming inserts, updates and deletes.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaKai Wähner
If there were a buzzword of the hour, it would certainly be "data mesh"! This new architectural paradigm unlocks analytic data at scale and enables rapid access to an ever-growing number of distributed domain datasets for various usage scenarios.
As such, the data mesh addresses the most common weaknesses of the traditional centralized data lake or data platform architecture. And the heart of a data mesh infrastructure must be real-time, decoupled, reliable, and scalable.
This presentation explores how Apache Kafka, as an open and scalable decentralized real-time platform, can be the basis of a data mesh infrastructure and - complemented by many other data platforms like a data warehouse, data lake, and lakehouse - solve real business problems.
There is no silver bullet or single technology/product/cloud service for implementing a data mesh. The key outcome of a data mesh architecture is the ability to build data products; with the right tool for the job.
A good data mesh combines data streaming technology like Apache Kafka or Confluent Cloud with cloud-native data warehouse and data lake architectures from Snowflake, Databricks, Google BigQuery, et al.
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
Enterprise data architectures usually contain many systems—data lakes, message queues, and data warehouses—that data must pass through before it can be analyzed. Each transfer step between systems adds a delay and a potential source of errors. What if we could remove all these steps? In recent years, cloud storage and new open source systems have enabled a radically new architecture: the lakehouse, an ACID transactional layer over cloud storage that can provide streaming, management features, indexing, and high-performance access similar to a data warehouse. Thousands of organizations including the largest Internet companies are now using lakehouses to replace separate data lake, warehouse and streaming systems and deliver high-quality data faster internally. I’ll discuss the key trends and recent advances in this area based on Delta Lake, the most widely used open source lakehouse platform, which was developed at Databricks.
Azure data analytics platform - A reference architecture Rajesh Kumar
This document provides an overview of Azure data analytics architecture using the Lambda architecture pattern. It covers Azure data and services, including ingestion, storage, processing, analysis and interaction services. It provides a brief overview of the Lambda architecture including the batch layer for pre-computed views, speed layer for real-time views, and serving layer. It also discusses Azure data distribution, SQL Data Warehouse architecture and design best practices, and data modeling guidance.
The document provides an introduction and overview of Apache Kafka presented by Jeff Holoman. It begins with an agenda and background on the presenter. It then covers basic Kafka concepts like topics, partitions, producers, consumers and consumer groups. It discusses efficiency and delivery guarantees. Finally, it presents some use cases for Kafka and positioning around when it may or may not be a good fit compared to other technologies.
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Amazon Web Services
Snowflake is a cloud-based data warehouse that is built for the cloud. It was founded in 2012 and has raised $1 billion in funding. Snowflake's architecture separates storage, compute, and metadata services, allowing it to offer unlimited scalability, multiple clusters that can access shared data with no downtime, and full transactional consistency across the system. Snowflake has over 2000 customers including large enterprises that use it for analytics, data science, and sharing large volumes of data securely.
This document provides an overview of patterns for scalability, availability, and stability in distributed systems. It discusses general recommendations like immutability and referential transparency. It covers scalability trade-offs around performance vs scalability, latency vs throughput, and availability vs consistency. It then describes various patterns for scalability including managing state through partitioning, caching, sharding databases, and using distributed caching. It also covers patterns for managing behavior through event-driven architecture, compute grids, load balancing, and parallel computing. Availability patterns like fail-over, replication, and fault tolerance are discussed. The document provides examples of popular technologies that implement many of these patterns.
Real-time processing of large amounts of dataconfluent
This document discusses real-time processing of large amounts of data using a streaming platform. It begins with an agenda for the presentation, then discusses how streaming platforms can be used as a central nervous system in enterprises. Several use cases are presented, including using Apache Kafka and the Confluent Platform for applications like fraud detection, customer analytics, and migrating from batch to stream-based data processing. The rest of the document goes into details on Kafka, Confluent Platform, and how they can be used to build stream processing applications.
EDA Meets Data Engineering – What's the Big Deal?confluent
Presenter: Guru Sattanathan, Systems Engineer, Confluent
Event-driven architectures have been around for many years, much like Apache Kafka®, which first open sourced in 2011. The reality is that the true potential of Kafka is only being realised now. Kafka is becoming the central nervous system of many of today’s enterprises. It is bringing a profound paradigm shift to the way we think about enterprise IT. What has changed in Kafka to enable this paradigm shift? Is it not just a message broker, and how are enterprises using it today? This session will explore these key questions.
Sydney: https://content.deloitte.com.au/20200221-tel-event-tech-community-syd-registration
Melbourne: https://content.deloitte.com.au/20200221-tel-event-tech-community-mel-registration
The document provides an overview of the Databricks platform, which offers a unified environment for data engineering, analytics, and AI. It describes how Databricks addresses the complexity of managing data across siloed systems by providing a single "data lakehouse" platform where all data and analytics workloads can be run. Key features highlighted include Delta Lake for ACID transactions on data lakes, auto loader for streaming data ingestion, notebooks for interactive coding, and governance tools to securely share and catalog data and models.
Doug Bateman, a principal data engineering instructor at Databricks, presented on how to build a Lakehouse architecture. He began by introducing himself and his background. He then discussed the goals of describing key Lakehouse features, explaining how Delta Lake enables it, and developing a sample Lakehouse using Databricks. The key aspects of a Lakehouse are that it supports diverse data types and workloads while enabling using BI tools directly on source data. Delta Lake provides reliability, consistency, and performance through its ACID transactions, automatic file consolidation, and integration with Spark. Bateman concluded with a demo of creating a Lakehouse.
This document summarizes Netflix's use of Kafka in their data pipeline. It discusses how Netflix evolved from using S3 and EMR to introducing Kafka and Kafka producers and consumers to handle 400 billion events per day. It covers challenges of scaling Kafka clusters and tuning Kafka clients and brokers. Finally, it outlines Netflix's roadmap which includes contributing to open source projects like Kafka and testing failure resilience.
The Top 5 Apache Kafka Use Cases and Architectures in 2022Kai Wähner
This document discusses the top 5 use cases and architectures for data in motion in 2022. It describes:
1) The Kappa architecture as an alternative to the Lambda architecture that uses a single stream to handle both real-time and batch data.
2) Hyper-personalized omnichannel experiences that integrate customer data from multiple sources in real-time to provide personalized experiences across channels.
3) Multi-cloud deployments using Apache Kafka and data mesh architectures to share data across different cloud platforms.
4) Edge analytics that deploy stream processing and Kafka brokers at the edge to enable low-latency use cases and offline functionality.
5) Real-time cybersecurity applications that use streaming data
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
Snowflake: The Good, the Bad, and the UglyTyler Wishnoff
Learn how to solve the top 3 challenges Snowflake customers face, and what you can do to ensure high-performance, intelligent analytics at any scale. Ideal for those currently using Snowflake and those considering it. Learn more at: https://kyligence.io/
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...confluent
In this talk we’ll look at the relationship between three of the most disruptive software engineering paradigms: event sourcing, stream processing and serverless. We’ll debunk some of the myths around event sourcing. We’ll look at the inevitability of event-driven programming in the serverless space and we’ll see how stream processing links these two concepts together with a single ‘database for events’. As the story unfolds we’ll dive into some use cases, examine the practicalities of each approach-particularly the stateful elements-and finally extrapolate how their future relationship is likely to unfold. Key takeaways include: The different flavors of event sourcing and where their value lies. The difference between stream processing at application- and infrastructure-levels. The relationship between stream processors and serverless functions. The practical limits of storing data in Kafka and stream processors like KSQL."
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
This is Part 4 of the GoldenGate series on Data Mesh - a series of webinars helping customers understand how to move off of old-fashioned monolithic data integration architecture and get ready for more agile, cost-effective, event-driven solutions. The Data Mesh is a kind of Data Fabric that emphasizes business-led data products running on event-driven streaming architectures, serverless, and microservices based platforms. These emerging solutions are essential for enterprises that run data-driven services on multi-cloud, multi-vendor ecosystems.
Join this session to get a fresh look at Data Mesh; we'll start with core architecture principles (vendor agnostic) and transition into detailed examples of how Oracle's GoldenGate platform is providing capabilities today. We will discuss essential technical characteristics of a Data Mesh solution, and the benefits that business owners can expect by moving IT in this direction. For more background on Data Mesh, Part 1, 2, and 3 are on the GoldenGate YouTube channel: https://www.youtube.com/playlist?list=PLbqmhpwYrlZJ-583p3KQGDAd6038i1ywe
Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/)
Mr. Pollock is an expert technology leader for data platforms, big data, data integration and governance. Jeff has been CTO at California startups and a senior exec at Fortune 100 tech vendors. He is currently Oracle VP of Products and Cloud Services for Data Replication, Streaming Data and Database Migrations. While at IBM, he was head of all Information Integration, Replication and Governance products, and previously Jeff was an independent architect for US Defense Department, VP of Technology at Cerebra and CTO of Modulant – he has been engineering artificial intelligence based data platforms since 2001. As a business consultant, Mr. Pollock was a Head Architect at Ernst & Young’s Center for Technology Enablement. Jeff is also the author of “Semantic Web for Dummies” and "Adaptive Information,” a frequent keynote at industry conferences, author for books and industry journals, formerly a contributing member of W3C and OASIS, and an engineering instructor with UC Berkeley’s Extension for object-oriented systems, software development process and enterprise architecture.
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
- Delta Lake is an open source project that provides ACID transactions, schema enforcement, and time travel capabilities to data stored in data lakes such as S3 and ADLS.
- It allows building a "Lakehouse" architecture where the same data can be used for both batch and streaming analytics.
- Key features include ACID transactions, scalable metadata handling, time travel to view past data states, schema enforcement, schema evolution, and change data capture for streaming inserts, updates and deletes.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaKai Wähner
If there were a buzzword of the hour, it would certainly be "data mesh"! This new architectural paradigm unlocks analytic data at scale and enables rapid access to an ever-growing number of distributed domain datasets for various usage scenarios.
As such, the data mesh addresses the most common weaknesses of the traditional centralized data lake or data platform architecture. And the heart of a data mesh infrastructure must be real-time, decoupled, reliable, and scalable.
This presentation explores how Apache Kafka, as an open and scalable decentralized real-time platform, can be the basis of a data mesh infrastructure and - complemented by many other data platforms like a data warehouse, data lake, and lakehouse - solve real business problems.
There is no silver bullet or single technology/product/cloud service for implementing a data mesh. The key outcome of a data mesh architecture is the ability to build data products; with the right tool for the job.
A good data mesh combines data streaming technology like Apache Kafka or Confluent Cloud with cloud-native data warehouse and data lake architectures from Snowflake, Databricks, Google BigQuery, et al.
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
Enterprise data architectures usually contain many systems—data lakes, message queues, and data warehouses—that data must pass through before it can be analyzed. Each transfer step between systems adds a delay and a potential source of errors. What if we could remove all these steps? In recent years, cloud storage and new open source systems have enabled a radically new architecture: the lakehouse, an ACID transactional layer over cloud storage that can provide streaming, management features, indexing, and high-performance access similar to a data warehouse. Thousands of organizations including the largest Internet companies are now using lakehouses to replace separate data lake, warehouse and streaming systems and deliver high-quality data faster internally. I’ll discuss the key trends and recent advances in this area based on Delta Lake, the most widely used open source lakehouse platform, which was developed at Databricks.
Azure data analytics platform - A reference architecture Rajesh Kumar
This document provides an overview of Azure data analytics architecture using the Lambda architecture pattern. It covers Azure data and services, including ingestion, storage, processing, analysis and interaction services. It provides a brief overview of the Lambda architecture including the batch layer for pre-computed views, speed layer for real-time views, and serving layer. It also discusses Azure data distribution, SQL Data Warehouse architecture and design best practices, and data modeling guidance.
The document provides an introduction and overview of Apache Kafka presented by Jeff Holoman. It begins with an agenda and background on the presenter. It then covers basic Kafka concepts like topics, partitions, producers, consumers and consumer groups. It discusses efficiency and delivery guarantees. Finally, it presents some use cases for Kafka and positioning around when it may or may not be a good fit compared to other technologies.
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Amazon Web Services
Snowflake is a cloud-based data warehouse that is built for the cloud. It was founded in 2012 and has raised $1 billion in funding. Snowflake's architecture separates storage, compute, and metadata services, allowing it to offer unlimited scalability, multiple clusters that can access shared data with no downtime, and full transactional consistency across the system. Snowflake has over 2000 customers including large enterprises that use it for analytics, data science, and sharing large volumes of data securely.
This document provides an overview of patterns for scalability, availability, and stability in distributed systems. It discusses general recommendations like immutability and referential transparency. It covers scalability trade-offs around performance vs scalability, latency vs throughput, and availability vs consistency. It then describes various patterns for scalability including managing state through partitioning, caching, sharding databases, and using distributed caching. It also covers patterns for managing behavior through event-driven architecture, compute grids, load balancing, and parallel computing. Availability patterns like fail-over, replication, and fault tolerance are discussed. The document provides examples of popular technologies that implement many of these patterns.
Real-time processing of large amounts of dataconfluent
This document discusses real-time processing of large amounts of data using a streaming platform. It begins with an agenda for the presentation, then discusses how streaming platforms can be used as a central nervous system in enterprises. Several use cases are presented, including using Apache Kafka and the Confluent Platform for applications like fraud detection, customer analytics, and migrating from batch to stream-based data processing. The rest of the document goes into details on Kafka, Confluent Platform, and how they can be used to build stream processing applications.
EDA Meets Data Engineering – What's the Big Deal?confluent
Presenter: Guru Sattanathan, Systems Engineer, Confluent
Event-driven architectures have been around for many years, much like Apache Kafka®, which first open sourced in 2011. The reality is that the true potential of Kafka is only being realised now. Kafka is becoming the central nervous system of many of today’s enterprises. It is bringing a profound paradigm shift to the way we think about enterprise IT. What has changed in Kafka to enable this paradigm shift? Is it not just a message broker, and how are enterprises using it today? This session will explore these key questions.
Sydney: https://content.deloitte.com.au/20200221-tel-event-tech-community-syd-registration
Melbourne: https://content.deloitte.com.au/20200221-tel-event-tech-community-mel-registration
This document provides a high-level summary of streaming data processing and the Lambda architecture. It begins with a brief history of batch and streaming systems for big data. It then introduces the Lambda architecture as a way to handle both batch and streaming data using separate batch and speed layers. The document discusses advantages and disadvantages of the Lambda architecture, as well as use cases, implementation tips, and approaches that have emerged beyond the Lambda architecture like Kappa and FastData architectures.
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
Tinder’s Quickfire Pipeline powers all things data at Tinder. It was originally built using AWS Kinesis Firehoses and has since been extended to use both Kafka and other event buses. It is the core of Tinder’s data infrastructure. This rich data flow of both client and backend data has been extended to service a variety of needs at Tinder, including Experimentation, ML, CRM, and Observability, allowing backend developers easier access to shared client side data. We perform this using many systems, including Kafka, Spark, Flink, Kubernetes, and Prometheus. Many of Tinder’s systems were natively designed in an RPC first architecture.
Things we’ll discuss decoupling your system at scale via event-driven architectures include:
– Powering ML, backend, observability, and analytical applications at scale, including an end to end walk through of our processes that allow non-programmers to write and deploy event-driven data flows.
– Show end to end the usage of dynamic event processing that creates other stream processes, via a dynamic control plane topology pattern and broadcasted state pattern
– How to manage the unavailability of cached data that would normally come from repeated API calls for data that’s being backfilled into Kafka, all online! (and why this is not necessarily a “good” idea)
– Integrating common OSS frameworks and libraries like Kafka Streams, Flink, Spark and friends to encourage the best design patterns for developers coming from traditional service oriented architectures, including pitfalls and lessons learned along the way.
– Why and how to avoid overloading microservices with excessive RPC calls from event-driven streaming systems
– Best practices in common data flow patterns, such as shared state via RocksDB + Kafka Streams as well as the complementary tools in the Apache Ecosystem.
– The simplicity and power of streaming SQL with microservices
This paper covers our experience of building real-time pipelines for financial data, the various open source libraries we experimented with and the impacts we saw in a very brief time.
This document provides an overview of streaming analytics, including definitions, common use cases, and key concepts like streaming engines, processing models, and guarantees. It also provides examples of analyzing data streams using Apache Spark Structured Streaming, Apache Flink, and Kafka Streams APIs. Code snippets demonstrate windowing, triggers, and working with event-time.
Horizontal Scaling for Millions of Customers! elangovans
This document provides an overview of Elangovan Shanmugam's experience and expertise in software architecture. Some key points:
- Elangovan has over 25 years of experience in software development and has designed resilient systems that can handle millions of customers and transactions per second.
- He discusses his work on Tax products that can import documents in under 2 seconds for 45 million filers, and his role as Chief Architect for Mint which serves 35 million customers processing billions of transactions daily.
- The document outlines Elangovan's approach to software architecture including strategies for microservices, scalability, high availability, and application architecture for multiple platforms and millions of users.
Self-tuning data centers aim to minimize human intervention through machine learning techniques. Current challenges include meeting service level agreements for performance and uptime while maximizing efficiency of resources and minimizing costs. A self-tuning architecture uses monitoring data to detect issues and make recommendations for scaling, migration, or tuning of resources without human input. This approach aims to optimize data centers so they can scale efficiently to support growing workloads and applications.
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...HostedbyConfluent
To remain competitive, organizations need to democratize access to fast analytics, not only to gain real-time insights on their business but also to power smart apps that need to react in the moment. In this session, you will learn how Kafka and SingleStore enable modern, yet simple data architecture to analyze both fast paced incoming data as well as large historical datasets. In particular, you will understand why SingleStore is well suited process data streams coming from Kafka.
Bridge Your Kafka Streams to Azure Webinarconfluent
With a fully managed Apache Kafka(R) as-a-service on Microsoft Azure, businesses can focus on building applications and not managing clusters. Build a persistent bridge from on-premises data systems to the cloud with a hybrid Kafka service or stream across public clouds for multi-cloud data pipelines.
In this session for business and technical data leaders, you can learn about powering business applications with the managed Kafka service that streams data into Azure SQL Data Warehouse, Cosmos DB, Azure Data Lake Storage and Azure Blob Storage.
Fast Data – Fast Cars: Wie Apache Kafka die Datenwelt revolutioniertconfluent
Für die Automobilindustrie ist die digitale Transformation wie für jede andere Branche zugleich eine digitale Revolution: Neue Marktspieler, neue Technologien und die in immer größeren Mengen anfallenden Daten schaffen neue Chancen, aber auch neue Herausforderungen – und erfordern neben neuen IT-Architekturen auch völlig neue Denkansätze.
60% der Fortune500-Unternehmen setzen zur Umsetzung ihrer Daten-Streaming-Projekte auf die umfassende verteilte Streaming-Plattform Apache Kafka®, darunter auch die AUDI AG.
Erfahren Sie in diesem Webinar:
Wie Kafka als Grundlage sowohl für Daten-Pipelines als auch für Anwendungen dient, die Echtzeit-Datenströme konsumieren und verarbeiten.
Wie Kafka Connect und Kafka Streams geschäftskritische Anwendungen unterstützt
Wie Audi mithilfe von Kafka und Confluent eine Fast Data IoT-Plattform umgesetzt hat, die den Bereich „Connected Car“ revolutioniert
Sprecher:
David Schmitz, Principal Architect, Audi Electronics Venture GmbH
Kai Waehner, Technology Evangelist, Confluent
Rodrigo Campos presented on Linux systems capacity planning. He discussed performance monitoring tools like Sysstat and common metrics like CPU usage. He explained concepts from queueing theory like utilization, Little's Law, and using modeling tools like PDQ to create what-if scenarios of system performance. Campos provided an example of modeling a web application using a customer behavior model to understand and optimize performance bottlenecks.
Spark Streaming and IoT by Mike FreedmanSpark Summit
This document discusses using Spark Streaming for IoT applications and the challenges involved. It notes that while Spark simplifies programming across different processing intervals from batch to stream, programming models alone are not sufficient as IoT data streams can have varying rates and delays. It proposes a unified data infrastructure with abstractions like data series that support joining real-time and historical data while handling delays transparently. It also suggests approaches for Spark Streaming to better support processing many independent low-volume IoT streams concurrently and improving resource utilization for such applications. Finally, it introduces the Device-Model-Infra framework for addressing these IoT analytics challenges through combined programming models and data abstractions.
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...confluent
A powerful stream processing platform and an end-user friendly spreadsheet-interface, if this combination rings a bell, you should definitely attend our „Streamsheets and Apache Kafka“ webinar. While development is interactive with a web user interface, Streamsheets applications can run as mission-critical applications. They directly consume and produce event streams in Apache Kafka. One popular option is to run everything in the cloud leveraging the fully managed Confluent Cloud service on AWS, GCP or Azure. Without any coding or scripting, end-users leverage their existing spreadsheet skills to build customized streaming apps for analysis, dashboarding, condition monitoring or any kind of real-time pre-and post-processing of Kafka or KsqlDB streams and tables.
Hear Kai Waehner of Confluent and Kristian Raue of Cedalo on these topics:
• Where Apache Kafka and Streamsheets fit in the data ecosystem (Industrial IoT, Smart Energy, Clinical Applications, Finance Applications)
• Customer Story: How the Freiburg University Hospital uses Kafka and Streamsheets for dashboarding the utilization of clinical assets
• 15-Minutes Live Demonstration: Building a financial fraud detection dashboard based on Confluent Cloud, ksqlDB and Cedalo Cloud Streamsheets just using spreadsheet formulas.
Speaker:
Kai Waehner, Technology Evangelist, Confluent
Kristian Raue, Founder & Chief Technologist, cedalo
Unlocking the Power of IoT: A comprehensive approach to real-time insightsconfluent
In today's data-driven world, the Internet of Things (IoT) is revolutionizing industries and unlocking new possibilities. Join Data Reply, Confluent, and Imply as we unveil a comprehensive solution for IoT that harnesses the power of real-time insights.
A presentation pertaining to the integration of real-time data to the cloud with significant potential in the areas of Industrial IT,Real-time sensor information processing and Smart grids applied to various vertical industries. This is related to my blog post at www.cloudshoring.in
Open Source Bristol 30 March 2022
https://www.meetup.com/Open-Source-Bristol/events/284198269/
18:35 // 'Building a Scalable Event Streaming and Messaging Platform using Apache Pulsar for Fintech' // Tim Spann and John Kinson
Today, companies are adopting Apache Pulsar, an open-source messaging and event streaming platform. Pulsar’s scalability and cloud-native capabilities make it uniquely positioned to meet a range of emerging business needs, including AdTech, fraud detection, IoT analytics, microservices development, and payment processing.
Tim Spann and John Kinson will share insights into the modern data streaming landscape, how Apache Pulsar fits into it, and how it can be used for Fintech. John will also talk about the origins of StreamNative as a Commercial Open Source Software company, and how that has shaped the go-to-market strategy.
[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Servic...Vinu Charanya
Twitter is powered by thousands of microservices that run on our internal Cloud platform which consists of a suite of multi-tenant platform services that offer Compute, Storage, Messaging, Monitoring, etc as a service. These platforms have thousands of tenants and run atop hundreds of thousands of servers, across on-prem & the public cloud. The scale & diversity in multi-tenant infrastructure services make it extremely difficult to effectively forecast capacity, compute resource utilization & cost and drive efficiency.
In this talk, I would like to share how my team is building a system (Kite - A unified service manager) to help define, model, provision, meter & charge infrastructure resources. The infrastructure resources include primitive bare metal servers / VMs on the public cloud and abstract resources offered by multi-tenant services such as our Compute platform (powered by Apache Aurora/Mesos), Storage (Manhattan for key/val, Cache, RDBMS), Observability. Along with how we solved this problem, I also intend to share a few case-studies on how we were able to use this data to better plan capacity & drive a cultural change in engineering that helped improve overall resource utilization & drive significant savings in infrastructure spending.
In the wake of IoT becoming ubiquitous, there has been a large interest in the industry to develop novel techniques for anomaly detection at the Edge. Example applications include, but not limited to, smart cities/grids of sensors, industrial process control in manufacturing, smart home, wearables, connected vehicles, agriculture (sensing for soil moisture and nutrients). What makes anomaly detection at the Edge different? The following constraints be it due to the sensors or the applications necessitate the need for the development of new algorithms for AD.
* Very low power and low compute/memory resources
* High data volume making centralized AD infeasible owing to the communication overhead
* Need for low latency to drive fast action taking
Guaranteeing privacy In this talk we shall throw light on the above in detail. Subsequently, we shall walk through the algorithm design process for anomaly detection at the Edge. Specifically, we shall dive into the need to build small models/ensembles owing to limited memory on the sensors. Further, how to training data in an online fashion as long term historical data is not available due to limited storage. Given the need for data compression to contain the communication overhead, can one carry out anomaly detection on compressed data? We shall throw light on building of small models, sequential and one-shot learning algorithms, compressing the data with the models and limiting the communication to only the data corresponding to the anomalies and model description. We shall illustrate the above with concrete examples from the wild!
Serverless Streaming Architectures and Algorithms for the EnterpriseArun Kejariwal
In recent years, serverless has gained momentum in the realm of cloud computing. Broadly speaking, it comprises function as a service (FaaS) and backend as a service (BaaS). The distinction between the two is that under FaaS, one writes and maintains the code (e.g., the functions) for serverless compute; in contrast, under BaaS, the platform provides the functionality and manages the operational complexity behind it. Serverless provides a great means to boost development velocity. With greatly reduced infrastructure costs, more agile and focused teams, and faster time to market, enterprises are increasingly adopting serverless approaches to gain a key advantage over their competitors.
Example early use cases of serverless include, for example, data transformation in batch and ETL scenarios and data processing using MapReduce patterns. As a natural extension, serverless is being used in the streaming context such as, but not limited to, real-time bidding, fraud detection, intrusion detection. Serverless is, arguably, naturally suited to extracting insights from fast data, that is, high-volume, high-velocity data. Example tasks in this regard include filtering and reducing noise in the data and leveraging machine learning and deep learning models to provide continuous insights about business operations.
We walk the audience through the landscape of streaming systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage. We overview the inception and growth of the serverless paradigm. Further, we deep dive into Apache Pulsar, which provides native serverless support in the form of Pulsar functions, and paint a bird’s-eye view of the application domains where Pulsar functions can be leveraged.
Baking in intelligence in a serverless flow is paramount from a business perspective. To this end, we detail different serverless patterns—event processing, machine learning, and analytics—for different use cases and highlight the trade-offs. We present perspectives on how advances in hardware technology and the emergence of new applications will impact the evolution of serverless streaming architectures and algorithms. The topics covered include an introduction to st
reaming, an introduction to serverless, serverless and streaming requirements, Apache Pulsar, application domains, serverless event processing patterns, serverless machine learning patterns, and serverless analytics patterns.
Sequence-to-Sequence Modeling for Time SeriesArun Kejariwal
In this talk we overview Sequence-2-Sequence (S2S) and explore its early use cases. We walk the audience through how to leverage S2S modeling for several use cases, particularly with regard to real-time anomaly detection and forecasting.
Sequence-to-Sequence Modeling for Time SeriesArun Kejariwal
This document provides an overview of time series forecasting using deep learning techniques. It discusses recurrent neural networks (RNNs) and their application to time series forecasting, including different RNN architectures like LSTMs and attention mechanisms. It also summarizes various approaches to training RNNs, such as backpropagation through time, and regularization techniques. Finally, it lists several examples of time series forecasting applications and provides references for further reading on the topic.
In this talk we walk through an architecture in which models are served in real time and the models are updated, using Apache Pulsar, without restarting the application at hand. They then describe how to apply Pulsar functions to support two example use—sampling and filtering—and explore a concrete case study of the same.
Designing Modern Streaming Data ApplicationsArun Kejariwal
Many industry segments have been grappling with fast data (high-volume, high-velocity data). The enterprises in these industry segments need to process this fast data just in time to derive insights and act upon it quickly. Such tasks include but are not limited to enriching data with additional information, filtering and reducing noisy data, enhancing machine learning models, providing continuous insights on business operations, and sharing these insights just in time with customers. In order to realize these results, an enterprise needs to build an end-to-end data processing system, from data acquisition, data ingestion, data processing, and model building to serving and sharing the results. This presents a significant challenge, due to the presence of multiple messaging frameworks and several streaming computing frameworks and storage frameworks for real-time data.
In this tutorial we lead a journey through the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline, messaging frameworks, streaming computing frameworks, storage frameworks for real-time data, and more. We also share case studies from the IoT, gaming, and healthcare as well as their experience operating these systems at internet scale at Twitter and Yahoo. We conclude by offering their perspectives on how advances in hardware technology and the emergence of new applications will impact the evolution of messaging systems, streaming systems, storage systems for streaming data, and reinforcement learning-based systems that will power fast processing and analysis of a large (potentially of the order of hundreds of millions) set of data streams.
Topics include:
* An introduction to streaming
* Common data processing patterns
* Different types of end-to-end stream processing architectures
* How to seamlessly move data across data different frameworks
* Case studies: Healthcare and the IoT
* Data sketches for mining insights from data streams
There has been a shift from big data to live streaming data to facilitate faster data-driven decision making. As the number of live data streams grow—partly a result of the expanding IoT—it is critical to develop techniques to better extract actionable insights.
One current application, anomaly detection, is a necessary but insufficient step, due to the fact that anomaly detection over a set of live data streams may result in an anomaly fatigue, limiting effective decision making. One way to address the above is to carry out anomaly detection in a multidimensional space. However, this is typically very expensive computationally and hence not suitable for live data streams. Another approach is to carry out anomaly detection on individual data streams and then leverage correlation analysis to minimize false positives, which in turn helps in surfacing actionable insights faster.
In this talk, we explain how marrying correlation analysis with anomaly detection can help and share techniques to guide effective decision making.
Topics include:
* An overview correlation analysis
* Robust correlation analysis
* Overview of alternative measures, such as co-median
* Trade-offs between speed and accuracy
* Correlation analysis in large dimensions
In this talk we walk the audience through how to marry correlation analysis with anomaly detection, discuss how the topics are intertwined, and detail the challenges one may encounter based on production data. We also showcase how deep learning can be leveraged to learn nonlinear correlation, which in turn can be used to further contain the false positive rate of an anomaly detection system. Further, we provide an overview of how correlation can be leveraged for common representation learning.
There has been a shift from big data to live streaming data to facilitate faster data-driven decision making. As the number of live data streams grow—partly a result of the expanding IoT—it is critical to develop techniques to better extract actionable insights.
One current application, anomaly detection, is a necessary but insufficient step, due to the fact that anomaly detection over a set of live data streams may result in an anomaly fatigue, limiting effective decision making. One way to address the above is to carry out anomaly detection in a multidimensional space. However, this is typically very expensive computationally and hence not suitable for live data streams. Another approach is to carry out anomaly detection on individual data streams and then leverage correlation analysis to minimize false positives, which in turn helps in surfacing actionable insights faster.
In this talk we explain how marrying correlation analysis with anomaly detection can help and share techniques to guide effective decision making.
Topics include:
* An overview correlation analysis
* Robust correlation analysis
* Trade-offs between speed and accuracy
* Multi-modal correlation analysis
compute tier. Detection and filtering of anomalies in live data is of paramount importance for robust decision making. To this end, in this talk we share techniques for anomaly detection in live data.
Anomaly detection in real-time data streams using HeronArun Kejariwal
Twitter has become the de facto medium for consumption of news in real time, and billions of events are generated and analyzed on a daily basis. To analyze these events, Twitter designed its own next-generation streaming system, Heron. Arun Kejariwal and Karthik Ramasamy walk you through how Heron is used to detect anomalies in real-time data streams. Although there’s been over 75 years of prior work in anomaly detection, most of the techniques cannot be used off the shelf because they’re not suitable for high-velocity data streams. Arun and Karthik explain how to make trade-offs between accuracy and speed and discuss incremental approaches that marry sampling with robust measures such as median and MCD for anomaly detection.
Data Data Everywhere: Not An Insight to Take Action UponArun Kejariwal
The big data era is characterized by ever-increasing velocity and volume of data. Over the last two or three years, several talks at Velocity have explored how to analyze operations data at scale, focusing on anomaly detection, performance analysis, and capacity planning, to name a few topics. Knowledge sharing of the techniques for the aforementioned problems helps the community to build highly available, performant, and resilient systems.
A key aspect of operations data is that data may be missing—referred to as “holes”—in the time series. This may happen for a wide variety of reasons, including (but not limited to):
# Packets being dropped due to unresponsive downstream services
# A network hiccup
# Transient hardware or software failure
# An issue with the data collection service
“Holes” in the time series on data analysis can potentially skew the analysis of data. This in turn can materially impact decision making. Arun Kejariwal presents approaches for analyzing operations data in the presence of “holes” in the time series, highlighting how missing data impacts common data analysis such as anomaly detection and forecasting, discussing the implications of missing data on time series of different granularities, such as minutely and hourly, and exploring a gamut of techniques that can be used to address the missing data issue (e.g., approximate the data using interpolation, regression, ensemble methods, etc.). Arun then walks you through how the techniques can be leveraged using real data.
Real Time Analytics: Algorithms and SystemsArun Kejariwal
In this tutorial, an in-depth overview of streaming analytics -- applications, algorithms and platforms -- landscape is presented. We walk through how the field has evolved over the last decade and then discuss the current challenges -- the impact of the other three Vs, viz., Volume, Variety and Veracity, on Big Data streaming analytics.
Finding bad apples early: Minimizing performance impactArun Kejariwal
The big data era is characterized by the ever-increasing velocity and volume of data. In order to store and analyze the ever-growing data, the operational footprint of data stores and Hadoop have also grown over time. (As per a recent report from IDC, the spending on big data infrastructure is expected to reach $41.5 billion by 2018.) The clusters comprise several thousands of nodes. The high performance of such clusters is vital for delivering the best user experience and productivity of teams.
The performance of such clusters is often limited by slow/bad nodes. Finding slow nodes in large clusters is akin to finding a needle in a haystack; hence, manual identification of slow/bad nodes is not practical. To this end, we developed a novel statistical technique to automatically detect slow/bad nodes in clusters comprising hundreds to thousands of nodes. We modeled the problem as a classification problem and employed a simple, yet very effective, distance measure to determine slow/bad nodes. The key highlights of the proposed technique are the following:
# Robustness against anomalies (note that anomalies may occur, for example, due to an ad-hoc heavyweight job on a Hadoop cluster)
# Given the varying data characteristics of different services, no one model fits all. Consequently, we parameterized the threshold used for classification
The proposed technique works well with both hourly and daily data, and has been in use in production by multiple services. This has not only eliminated manual investigation efforts, but has also mitigated the impact of slow nodes, which used to get detected after several weeks/months of lag!
We shall walk the audience through how the techniques are being used with REAL data.
This document discusses stream processing and anomaly detection. It covers real-time analytics using streaming systems like Storm. Storm provides a framework for processing streaming data reliably and at scale. The document describes Storm's architecture and data model. It also discusses how Twitter uses Storm to process billions of messages daily. The document then covers anomaly detection in Storm systems, including identifying performance bottlenecks, anomalous nodes, and input traffic spikes in real-time. Statistical and correlation techniques are used to detect anomalies while minimizing false positives.
Statistical Learning Based Anomaly Detection @ TwitterArun Kejariwal
This document discusses Twitter's approach to statistical learning based anomaly detection. It begins with an overview of anomaly detection challenges at scale given Twitter's massive time series data. It then reviews traditional approaches and their limitations, particularly in dealing with seasonality. The document proposes addressing seasonality through time series decomposition before applying a robust statistical approach like ESD on the residual. It provides an example and discusses applications and production deployment at Twitter. In closing, it promotes joining Twitter's efforts in open sourcing their anomaly detection work.
Days In Green (DIG): Forecasting the life of a healthy serviceArun Kejariwal
This document describes Twitter's Days In Green (DIG) methodology for forecasting the lifespan of a healthy service before it exceeds a predefined capacity threshold. It involves collecting time series data on a service's key performance metric, detecting anomalies and breakouts, fitting an ARIMA model to capture trends and seasonality, and forecasting the number of days before the threshold is breached to determine capacity needs. The methodology has been deployed at Twitter to help plan capacity for hundreds of services and detect those nearing disaster recovery thresholds.
Gimme More! Supporting User Growth in a Performant and Efficient FashionArun Kejariwal
This document discusses capacity planning approaches for supporting user growth at Twitter. It describes the need to plan capacity proactively through forecasting to ensure good user experience without overprovisioning resources. The document evaluates several forecasting models like linear regression, splines, Holt-Winters, and ARIMA and their suitability for Twitter's data based on characteristics like outliers, seasonality, and boundary conditions. It emphasizes that accurate forecasting requires continuous refinement of models as the data stream evolves over time.
This document summarizes Twitter's approach to capacity planning for large events like the Super Bowl. It discusses using historical traffic patterns to predict capacity needs, analyzing key metrics like tweets per second, and planning for potential traffic spikes through statistical analysis and scenario modeling. For Super Bowl 2013, Twitter's models predicted a traffic spike could push tweets per second into the 20,000+ range, higher than previous years, and the company was able to maintain high availability during the game despite the brief blackout.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
4. 4
Internet of Things (IoT)
$1.9
T
in
value
by
2020
-‐
Mfg
(15%),
Health
Care
(15%),
Insurance
(11%)
26
B
-‐
75
B
units
[2,
3,
4,
5]
Improve
opera@onal
efficiencies,
customer
experience,
new
business
modelsY
Beacons:
Retailers
and
bank
branches
60M
units
market
by
2019
[6]
Smart
buildings:
Reduce
energy
costs,
cut
maintenance
costs
Increase
safety
&
security
Large
Market
Poten@al
5. 5
The Future
Biostamps [2]
Mobile
Sensor Network
Exponential growth [1]
[1]
hap://opensignal.com/assets/pdf/reports/2015_08_fragmenta@on_report.pdf
[2]
hap://www.ericsson.com/thinkingahead/networked_society/stories/#/film/mc10-‐biostamp
6. 6
Intelligent Health Care
Con@nuous
Monitoring
Tracking Movements
Measure
effect
of
social
influences
Google Lens
Measure
glucose
level
in
tears
Watch/Wristband
Smart Textiles
Skin
temperature
Perspira@on
Ingestible Sensors
Medica@on
compliance
[1]
Heart
func@on
!
!
8. 8
Increasingly Connected World
Internet of Things
30
B
connected
devices
by
2020
Health Care
153
Exabytes
(2013)
-‐>
2314
Exabytes
(2020)
Machine Data
40%
of
digital
universe
by
2020
Connected Vehicles
Data
transferred
per
vehicle
per
month
4
MB
-‐>
5
GB
Digital Assistants (Predictive Analytics)
$2B
(2012)
-‐>
$6.5B
(2019)
[1]
Siri/Cortana/Google
Now
Augmented/Virtual Reality
$150B
by
2020
[2]
Oculus/HoloLens/Magic
Leap
Ñ
!+
>
10. 10
Traditional Data Processing
Challenges
Introduces
too
much
“decision
latency”
Responses
are
delivered
“aqer
the
fact”
Maximum
value
of
the
iden@fied
situa@on
is
lost
Decisions
are
made
on
old
and
stale
data
Data
at
Rest
Store Analyze Act
11. 11
The New Era: Streaming Data/Fast Data
Events
are
analyzed
and
processed
in
real-‐@me
as
they
drive
Decisions
are
@mely,
contextual
and
based
on
fresh
data
Decision
latency
is
eliminated
Data
in
mo@on
12. 12
Real Time Use Cases
Algorithmic
trading
Online
fraud
detec@on
Geo
fencing
Proximity/loca@on
tracking
Intrusion
detec@on
systems
Traffic
management
Real
@me
recommenda@ons
Churn
detec@on
Internet
of
things
Social
media/data
analy@cs
Gaming
data
feed
13. 13
Requirements of Stream Processing
In-stream Handle imperfections Predictable Performance
Process
data
as
it
is
passes
by
Delayed,
missing
and
out-‐of-‐order
data
and
Repeatable and
Scalability
I
14. 14
High level languages Integrate stored and
streaming data
Data safety and
availability
Process and respond
SQL
or
DSL
for
comparing
present
with
the
past
and
Repeatable
Applica@on
should
keep
at
high
volumes
" # $
Requirements of Stream Processing
17. 17
Current Messaging Systems
01 02 03 04
05 06 07 08
ActiveMQ RabbitMQ Pulsar RocketMQ
Azure
Event Hub
Google
Pub-Sub
Satori Kafka
18. 18
Why Apache Pulsar?
Ordering
Guaranteed
ordering
MulH-‐tenancy
A
single
cluster
can
support
many
tenants
and
use
cases
High
throughput
Can
reach
1.8
M
messages/s
in
a
single
parHHon
Durability
Data
replicated
and
synced
to
disk
Geo-‐replicaHon
Out
of
box
support
for
geographically
distributed
applicaHons
Unified
messaging
model
Support
both
Topic
&
Queue
semanHc
in
a
single
model
Delivery
Guarantees
At
least
once,
at
most
once
and
effecHvely
once
Low
Latency
Low
publish
latency
of
5ms
at
99pct
Highly
scalable
Can
support
millions
of
topics
20. 20
Pulsar Producer
PulsarClient client = PulsarClient.create(
“http://broker.usw.example.com:8080”);
Producer producer = client.createProducer(
“persistent://my-property/us-west/my-namespace/my-topic”);
// handles retries in case of failure
producer.send("my-message".getBytes());
// Async version:
producer.sendAsync("my-message".getBytes()).thenRun(() -> {
// Message was persisted
});
21. 21
Pulsar Consumer
PulsarClient client = PulsarClient.create(
"http://broker.usw.example.com:8080");
Consumer consumer = client.subscribe(
"persistent://my-property/us-west/my-namespace/my-topic",
"my-subscription-name");
while (true) {
// Wait for a message
Message msg = consumer.receive();
System.out.println("Received message: " + msg.getData());
// Acknowledge the message so that it can be deleted by broker
consumer.acknowledge(msg);
}
22. 22
Pulsar Architecture
Pulsar
Broker
1 Pulsar
Broker
1 Pulsar
Broker
1
Bookie
1 Bookie
2 Bookie
3 Bookie
4 Bookie
5
Apache
BookKeeper
Apache
Pulsar
Producer
Consumer
Stateless
Serving
BROKER
Clients interact only with brokers
No state is stored in brokers
BOOKIES
Apache BookKeeper as the storage
Storage is append only
Provides high performance, low latency
Durability
No data loss. fsync before acknowledgement
23. 23
Pulsar Architecture
Pulsar
Broker
1 Pulsar
Broker
1 Pulsar
Broker
1
Bookie
1 Bookie
2 Bookie
3 Bookie
4 Bookie
5
Apache
BookKeeper
Apache
Pulsar
Producer
Consumer
SeparaHon
of
Storage
and
Serving
SERVING
Brokers can be added independently
Traffic can be shifted quickly across brokers
STORAGE
Bookies can be added independently
New bookies will ramp up traffic quickly
24. 24
Pulsar Architecture
Clients
CLIENTS
Lookup correct broker through service discovery
Establish connections to brokers
Enforce authentication/authorization during
connection establishment
Establish producer/consumer session
Reconnect with backoff strategy
Dispatcher
Load
Balancer
Managed
Ledger
CacheGlobal
Replica@on
Producer
Consumer
Service
Discovery
Pulsar
Broker
Bookie
25. 25
Pulsar Architecture
Message
Dispatching
DISPATCHER
End-to-end async message processing
Messages relayed across producers, bookies and
consumers with no copies
Pooled reference count buffers
Dispatcher
Load
Balancer
Managed
Ledger
CacheGlobal
Replica@on
Producer
Consumer
Service
Discovery
Pulsar
Broker
Bookie
MANAGED LEDGER
Abstraction of single topic storage
Cache recent messages
26. 26
Pulsar Architecture
Geo
ReplicaHon
GEO REPLICATION
Asynchronous replication
Integrated in the broker message flow
Simple configuration to add/remove regions
Topic
(T1) Topic
(T1)
Topic
(T1)
Subscrip@on
(S1)
Subscrip@on
(S1)
Producer
(P1)
Consumer
(C1)
Producer
(P3)
Producer
(P2)
Consumer
(C2)
Data
Center
A Data
Center
B
Data
Center
C
27. 27
Pulsar Use Cases - Message Queue
Online
Events Topic
(T)
Worker
1
Worker
2
Decouple
Online/Offline
Topic
(T)
Worker
3
MESSAGE QUEUES
Decouple online or background
High availability
Reliable data transport
NoHficaHons
Long
running
tasks
Low
latency
publish
28. 28
Pulsar Use Cases - Feedback System
Event Topic
(T)
Propagate
States
Controller
Topic
(T)
Serving
System Serving
System Serving
System
FEEDBACK SYSTEM
Coordinate large number of machines
Propagate states
Examples
State propagation
Personalization
Ad-systems
Feedback
Updates
29. 29
Pulsar in Production
3+
years
Serves
2.3
million
topics
100
billion
messages/day
Average
latency
<
5
ms
99%
15
ms
(strong
durability
guarantees)
Zero
data
loss
80+
applica@ons
Self
served
provisioning
Full-‐mesh
cross-‐datacenter
replica@on
-‐
8+
data
centers
33. 33
Apache Beam
Promises
Abstrac@ng
the
Computa@on
Express
Computa@on
Expressive
Windowing/Triggering
Incremental
Processing
for
late
data
Selectable
Engine
Select
criteria
Latency
Resource
Cost
Supported
Engines
Google
DataFlow
Apache
Spark,
Apache
Flink,
Apache
Apex
34. 34
Apache Beam
ComputaHon
AbstracHon
All
Data
is
4
tuple
Key,
Value
Event
Time
Window
the
tuple
belongs
Core
Operators
ParDo
User
supplied
DoFn
Emits
Zero
or
more
elements
GroupByKey
Groups
tuples
by
keys
in
the
window
36. 36
Apache Beam
Challenges
Mul@ple
Layers
API
vs
Execu@on
Troubleshoo@ng
complexi@es
Need
higher
level
APIs
Mul@ple
efforts
on
their
way
Other
Cloud
Vendor
Buy-‐in?
Azure/AWS?
37. 37
IBM S-Store
Promises
Combine
Stream
Processing
and
Transac@ons
Extended
an
OLTP
engine(H-‐Store)
adding
Tuple
Ordering
Windowing
Push-‐based
processing
Exactly
Once
Seman@cs
38. 38
IBM S-Store
Data
and
Processing
Model
Tuples
grouped
into
Atomic
Batches
Grouping
of
Non-‐overlapping
tuples
Treated
like
a
Transac@on
Atomic
Batches
belong
to
one
Stream
Processing
is
modeled
as
a
DAG
DAG
nodes
consume
one
or
more
streams
and
possibly
output
more
Node
logic
is
treated
as
a
Transac@on
39. 39
IBM S-Store
Exactly
Once
Guarantees
Strong
Inputs
and
Outputs
are
logged
at
every
DAG
node
On
component
failure,
the
log
is
replayed
from
snapshot
Weak
Distributed
Snapsho{ng
40. 40
IBM S-Store
Challenges
Throughput
Non
OLTP
processing
is
much
slower
compared
to
modern
systems
Scalability
Mul@-‐Node
s@ll
in
research
(2016)
41. 41
Heron Terminology
Topology
Directed
acyclic
graph
ver@ces
=
computa@on,
and
edges
=
streams
of
data
tuples
Spouts
Sources
of
data
tuples
for
the
topology
Examples
-‐
Pulsar/Ka}a/MySQL/Postgres
Bolts
Process
incoming
tuples,
and
emit
outgoing
tuples
Examples
-‐
filtering/aggrega@on/join/any
func@on
,
%
44. 44
Heron Groupings
01 02 03 04
Shuffle Grouping
Random distribution of tuples
Fields Grouping
Group tuples by a field or
multiple fields
All Grouping
Replicates tuples to all tasks
Global Grouping
Send the entire stream to one
task
/
.
-
,
46. 46
Writing Heron Topologies
Procedural - Low Level API
Directly
write
your
spouts
and
bolts
Functional - Mid Level API
Use
of
maps,
flat
maps,
transform,
windows
Declarative - SQL (coming)
Use
of
declara@ve
language
-‐
specify
what
you
want,
system
will
figure
it
out.
,
%
47. 47
Heron Design Goals
Efficiency
Reduce
resource
consump@on
Support
for
diverse
workloads
Throughput
vs
latency
sensi@ve
Support
for
mulHple
semanHcs
Atmost
once,
Atleast
once,
Effec@vely
once
NaHve
MulH-‐Language
Support
C++,
Java,
Python
Task
IsolaHon
Ease
of
debug-‐ability/isola@on/profiling
Support
for
back
pressure
Topologies
should
be
self
adjus@ng
Use
of
containers
Runs
in
schedulers
-‐
Kubernetes
&
DCOS
&
many
more
MulH-‐level
APIs
Procedural,
Func@onal
and
Declara@ve
for
diverse
applica@ons
Diverse
deployment
models
Run
as
a
service
or
pure
library
61. 61
Pulsar Operations
Reac@ng
to
Failures
Brokers
Bookies
Common
Issues
Consumer
Backlog
I/O
Priori@za@on
and
Throaling
Mul@-‐Tenancy
62. 62
Reacting to Failures - Brokers
Brokers
don’t
have
durable
state
Easily
replaceable
Topics
are
immediately
reassigned
to
healthy
brokers
Expanding
capacity
Simply
add
new
broker
node
If
other
brokers
are
overloaded,
traffic
will
be
automa@cally
assigned
Load
manager
Monitor
traffic
load
on
all
brokers
(CPU,
memory,
network,
topics)
Ini@ally
place
topics
to
least
loaded
brokers
Reassign
topics
when
a
broker
is
overloaded
63. 63
Reacting to Failures - Bookies
When
a
bookie
fails,
brokers
will
immediately
con@nue
on
other
bookies
Auto-‐Recovery
mechanism
will
re-‐establish
the
replica@on
factor
in
background
If
a
bookie
keeps
giving
errors
or
@meouts,
it
will
be
“quaran@ned”
Not
considered
for
new
ledgers
for
some
period
of
@me
64. 64
Consumer Backlog
Metrics
are
available
to
make
assessments:
When
problem
started
How
big
is
backlog?
Messages?
Disk
space?
How
fast
is
draining?
What’s
the
ETA
to
catch
up
with
publishers?
Establish
where
is
the
boaleneck
Applica@on
is
not
fast
enough
Disk
read
IO
65. 65
I/O Prioritization and Throttling
Priori@ze
access
to
IO
During
an
outage
many
tenants
might
try
to
drain
backlog
as
fast
as
they
can
Read
IO
becomes
the
boaleneck
Throaling
can
be
used
to
priori@ze
draining:
Cri@cal
use
cases
can
recover
quickly
Fewer
concurrent
readers
lead
to
higher
throughput
Once
they
catch
up,
message
will
be
dispatched
from
cache
66. 66
Enforcing Multi-Tenancy
Ensure
tenants
don’t
cause
performance
issues
on
other
tenants
Backlog
quotas
Soq-‐Isola@on
Flow
control
Throaling
In
cases
when
user
behavior
is
triggering
performance
degrada@on
Hard-‐isola@on
as
a
last
resource
for
quick
reac@on
while
proper
fix
is
deployed
Isolate
tenant
on
a
subset
of
brokers
Can
be
also
applied
at
the
BookKeeper
level
67. 67
Heron @Twitter
LARGEST
CLUSTER
100’s
of
TOPOLOGIES
BILLIONS
OF
MESSAGES100’s
OF
TERABYTESREDUCED
INCIDENTS
GOOD
NIGHT
SLEEP
3X - 5X reduction in resource usage
73. 73
Heron Happy Facts :)
v
No
more
pages
during
midnight
for
Heron
team
Ø
Very
rare
incidents
for
Heron
customer
teams
ü
Easy
to
debug
during
incident
for
quick
turn
around
§
Reduced
resource
u@liza@on
saving
cost
79. 79
Data Skew
Multiple Keys
Several
keys
map
into
single
instance
and
their
count
is
high
Single Key
Single
key
maps
into
a
instance
and
its
count
is
high
H
C
81. 81
Self Regulating Streaming Systems
Automate
Tuning
SLO
Maintenance
Self
RegulaHng
Streaming
Systems
Tuning
Manual,
@me-‐consuming
and
error-‐prone
task
of
tuning
various
systems
knobs
to
achieve
SLOs
SLO
Maintenance
of
SLOs
in
the
face
of
unpredictable
load
varia@ons
and
hardware
or
soqware
performance
degrada@on
Self
RegulaHng
Streaming
Systems
System
that
adjusts
itself
to
the
environmental
changes
and
con@nue
to
produce
results
82. 82
Self Regulating Streaming Systems
Self tuning Self stabilizing Self healing
G !
Several tuning knobs
Time consuming tuning phase
The system should take
as input an SLO and
automatically configure
the knobs.
The system should
react to external shocks
a n d a u t o m a t i c a l l y
reconfigure itself
Stream jobs are long running
Load variations are common
The system should
identify internal faults
and attempt to recover
from them
System performance affected
by hardware or software
delivering degraded quality of
service
83. 83
Enter Dhalion
Dhalion periodically executes
well-specified policies that
optimize execution based on
some objective.
We created policies that
dynamically provision resources
in the presence of load variations
and auto-tune streaming
applications so that a throughput
SLO is met.
Dhalion is a policy based
framework integrated into Heron
84. 84
Dhalion Policy Framework
Symptom
Detector 1
Symptom
Detector 2
Symptom
Detector 3
Symptom
Detector N
....
Diagnoser 1
Diagnoser 2
Diagnoser M
....
Resolver
Invocation
D
iagnosis
1
Diagnosis 2
D
iagnosis
M
Symptom 1
Symptom 2
Symptom 3
Symptom N
Symptom
Detection
Diagnosis
Generation
Resolution
Resolver 1
Resolver 2
Resolver M
....
Resolver
Selection
Metrics
85. 85
Dynamic Resource Provisioning
Policy
This
policy
reacts
to
unexpected
load
varia@ons
(workload
spikes)
Goal
Goal
is
to
scale
up
and
scale
down
the
topology
resources
as
needed
-‐
while
keeping
the
topology
in
a
steady
state
where
back
pressure
is
not
observed
H
C
Policy
86. 86
Dynamic Resource Provisioning
Pending Tuples
Detector
Backpressure
Detector
Processing Rate
Skew Detector
Resource Over
provisioning
Diagnoser
Resource Under
Provisioning
Diagnoser
Data Skew
Diagnoser
Resolver
Invocation
Diagnosis
Symptoms
Symptom
Detection
Diagnosis
Generation
Resolution
Metrics
Slow Instances
Diagnoser
Bolt
Scale
Down
Resolver
Bolt
Scale
Up
Resolver
Data
Skew
Resolver
Restart
Instances
Resolver
ImplementaHon
93. 93
Experimental Setup
% %
Spout Splitter Bolt Counter Bolt
Shuffle Grouping Fields Grouping
Microsoq
HDInsight
Intel
Xeon
ES-‐2673
CPU@2.40
GHz
28
GB
of
Memory
Throughput
of
Spouts
(No.
Of
tuples
emiaed
over
1
min)
Throughput
of
Bolts
(No.
of
tuples
emiaed
over
1
min)
Number
of
Heron
Instances
provisioned
Hardware
and
Soqware
Configura@on Evalua@on
Metrics
94. 94
Dynamic Provisioning Profile
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
0 10 20 30 40 50 60 70 80 90 100 110 120
Normalized Throughput
Time (in minutes)
Spout Splitter Bolt Counter Bolt
Scale
Down
Scale Up
S1
S2
S3
The Dynamic Resource
Provisioning Policy is able to
adjust the topology resources
on-the-fly when workload
spikes occur.
The policy can correctly detect and
resolve bottlenecks even on multi-
stage topologies where
backpressure is gradually
propagated from one stage of the
topology to another.
0
5
10
15
0 20 40 60 80 100 120
Number of Bolts
Time (in minutes)
Splitter Bolt Counter Bolt
Heron Instances are gradually
scaled up and down according
to the input load
102. 102
DATA SKETCHES
Early
work
The space complexity of approximating
the frequency moments
Counting
Frequent Elements
[Misra and Gries, 1982]
Flajolet and Martin 1985]
Computing on Data Streams
[Henzinger et al. 1998]
[Alon et al. 1996]
Counting
[Morris, 1977]
Median of a sequence
[Munro and Paterson, 1980]
Membership
[Bloom, 1970]
107. 107
Sampling
[1]
J.
S.
Viaer.
Random
Sampling
with
a
Reservoir.
ACM
Transac@ons
on
Mathema@cal
Soqware,
Vol.
11(1):37–57,
March
1985.
Obtain
a
representa@ve
sample
from
a
data
stream
Maintain
dynamic
sample
A
data
stream
is
a
con@nuous
process
Not
known
in
advance
how
many
points
may
elapse
before
an
analyst
may
need
to
use
a
representa@ve
sample
Reservoir
sampling
[1]
Probabilis@c
inser@ons
and
dele@ons
on
arrival
of
new
stream
points
Probability
of
successive
inser@on
of
new
points
reduces
with
progression
of
the
stream
An
unbiased
sample
contains
a
larger
and
larger
frac@on
of
points
from
the
distant
history
of
the
stream
Prac@cal
perspec@ve
Data
stream
may
evolve
and
hence,
the
majority
of
the
points
in
the
sample
may
represent
the
stale
history
108. 108
Sampling
Sliding
window
approach
(sample
size
k,
window
width
n)
Sequence-‐based
Replace
expired
element
with
newly
arrived
element
Disadvantage:
highly
periodic
Chain-‐sample
approach
Select
element
ith
with
probability
Min(i,n)/n
Select
uniformly
at
random
an
index
from
[i+1,
i+n]
of
the
element
which
will
replace
the
ith
item
Maintain
k
independent
chain
samples
Timestamp-‐based
#
elements
in
a
moving
window
may
vary
over
@me
Priority-‐sample
approach
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
[1]
B.
Babcock.
Sampling
From
a
Moving
Window
Over
Streaming
Data.
In
Proceedings
of
SODA,
2002.
109. 109
Sampling
[1]
C.
C.
Aggarwal.On
Biased
Reservoir
Sampling
in
the
presence
of
Stream
Evolu@on.
in
Proceedings
of
VLDB,
2006.
Biased
Reservoir
Sampling
[1]
Use
a
temporal
bias
func@on
-‐
recent
points
have
higher
probability
of
being
represented
in
the
sample
reservoir
Memory-‐less
bias
func@ons
Future
probability
of
retaining
a
current
point
in
the
reservoir
is
independent
of
its
past
history
or
arrival
@me
Probability
of
an
rth
point
belonging
to
the
reservoir
at
the
@me
t
is
propor@onal
to
the
bias
func@on
Exponen@al
bias
func@ons
for
rth
data
point
at
@me
t,
where,
r
≤
t,
λ
∈
[0,
1]
is
the
bias
rate
Maximum
reservoir
requirement
R(t)
is
bounded
110. 110
Filtering
Set
Membership
Determine,
with
some
false
probability,
if
an
item
in
a
data
stream
has
been
seen
before
Databases
(e.g.,
speed
up
semi-‐join
opera@ons),
Caches,
Routers,
Storage
Systems
Reduce
space
requirement
in
probabilis@c
rou@ng
tables
Speedup
longest-‐prefix
matching
of
IP
addresses
Encode
mul@cast
forwarding
informa@on
in
packets
Summarize
content
to
aid
collabora@ons
in
overlay
and
peer-‐to-‐peer
networks
Improve
network
state
management
and
monitoring
111. 111
Filtering
Set
Membership
Applica@on
to
hyphena@on
programs
Early
UNIX
spell
checkers
[1]
Illustra@on
borrowed
from
hap://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf
[1]
112. 112
Filtering
Set
Membership
Natural
generaliza@on
of
hashing
False
posi@ves
are
possible
No
false
nega@ves
No
dele@ons
allowed
For
false
posi@ve
rate
ε,
#
hash
func@ons
=
log2(1/ε)
where,
n
=
#
elements,
k
=
#
hash
func@ons
m
=
#
bits
in
the
array
113. 113
Filtering
Set
Membership
Minimizing
false
posi@ve
rate
ε
w.r.t.
k
[1]
k
=
ln
2
*
(m/n)
ε
=
(1/2)k
≈
(0.6185)m/n
1.44
*
log2(1/ε)
bits
per
item
Independent
of
item
size
or
#
items
Informa@on-‐theore@c
minimum:
log2(1/ε)
bits
per
item
44%
overhead
X
=
#
0
bits
where
[1]
A.
Broder
and
M.
Mitzenmacher.
Network
Applica@ons
of
Bloom
Filters:
A
Survey.
In
Internet
Mathema@cs
Vol.
1,
No.
4,
2005.
114. 114
Filtering
Set
Membership:
Cuckoo
Filter
[1]
Key
Highlights
Add
and
remove
items
dynamically
For
false
posi@ve
rate
ε
<
3%,
more
space
efficient
than
Bloom
filter
Higher
performance
than
Bloom
filter
for
many
real
workloads
Asympto@cally
worse
performance
than
Bloom
filter
Min
fingerprint
size
α
log
(#
entries
in
table)
Overview
Stores
only
a
fingerprint
of
an
item
inserted
Original
key
and
value
bits
of
each
item
not
retrievable
Set
membership
query
for
item
x:
search
hash
table
for
fingerprint
of
x
[1]
Fan
et
al.,
Cuckoo
Filter:
Prac@cally
Beaer
Than
Bloom.
In
Proceedings
of
the
10th
ACM
Interna@onal
on
Conference
on
Emerging
Networking
Experiments
and
Technologies,
2014.
115. 115
Filtering
Set
Membership
[1]
R.
Pagh
and
F.
Rodler.
Cuckoo
hashing.
Journal
of
Algorithms,
51(2):122-‐144,
2004.
[2]
Illustra@on
borrowed
from
“Fan
et
al.,
Cuckoo
Filter:
Prac@cally
Beaer
Than
Bloom.
In
Proceedings
of
the
10th
ACM
Interna@onal
on
Conference
on
Emerging
Networking
Experiments
and
Technologies,
2014.”
[2]
Illustra@on
of
Cuckoo
hashing
[2]
Cuckoo Hashing [1]
High
space
occupancy
Prac@cal
implementa@ons:
mul@ple
items/bucket
Example
uses:
Soqware-‐based
Ethernet
switches
Cuckoo Filter [2]
Uses
a
mul@-‐way
associa@ve
Cuckoo
hash
table
Employs
par@al-‐key
cuckoo
hashing
Store
fingerprint
of
an
item
Relocate
exis@ng
fingerprints
to
their
alterna@ve
loca@ons
[2]
116. Dele@on
Item
must
have
been
previously
inserted
116
Filtering
Set
Membership
Cuckoo Filter
Par@al-‐key
cuckoo
hashing
Fingerprint
hashing
ensures
uniform
distribu@on
of
items
in
the
table
Length
of
fingerprint
<<
Size
of
h1
or
h2
Possible
to
have
mul@ple
entries
of
a
fingerprint
in
a
bucket
Alternate
bucket
Significantly
shorter
than
h1
and
h2
118. 118
Cardinality
Dis@nct
Elements
Database
systems/Search
engines
#
dis@nct
queries
Network
monitoring
applica@ons
Natural
language
processing
#
dis@nct
mo@fs
in
a
DNA
sequence
#
dis@nct
elements
of
RFID/sensor
networks
119. 119
Previous
work
Probabilis@c
coun@ng
[Flajolet
and
Mar@n,
1985]
LogLog
coun@ng
[Durand
and
Flajolet,
2003]
HyperLogLog
[Flajolet
et
al.,
2007]
Sliding
HyperLogLog
[Chabchoub
and
Hebrail,
2010]
HyperLogLog
in
Prac@ce
[Heule
et
al.,
2013]
Self-‐Organizing
Bitmap
[Chen
and
Cao,
2009]
Discrete
Max-‐Count
[Ting,
2014]
Sequence
of
sketches
forms
a
Markov
chain
when
h
is
a
strong
universal
hash
Es@mate
cardinality
using
a
mar@ngale
Cardinality
121. 121
Hyperloglog
Apply
hash
func@on
h
to
every
element
in
a
mul@set
Cardinality
of
mul@set
is
2max(ϱ)
where
0ϱ-‐11
is
the
bit
paaern
observed
at
the
beginning
of
a
hash
value
Above
suffers
with
high
variance
Employ
stochas@c
averaging
Par@@on
input
stream
into
m
sub-‐streams
Si
using
first
p
bits
of
hash
values
(m
=
2p)
where
Cardinality
122. 122
Use
of
64-‐bit
hash
func@on
Total
memory
requirement
5
*
2p
-‐>
6
*
2p,
where
p
is
the
precision
Empirical
bias
correc@on
Uses
empirically
determined
data
for
cardinali@es
smaller
than
5m
and
uses
the
unmodified
raw
es@mate
otherwise
Sparse
representa@on
For
n≪m,
store
an
integer
obtained
by
concatena@ng
the
bit
paaerns
for
idx
and
ϱ(w)
Use
variable
length
encoding
for
integers
that
uses
variable
number
of
bytes
to
represent
integers
Use
difference
encoding
-‐
store
the
difference
between
successive
elements
Other
op@miza@ons
[1,
2]
Hypeloglog Optimizations
[1]
hap://druid.io/blog/2014/02/18/hyperloglog-‐op@miza@ons-‐for-‐real-‐world-‐systems.html
[2]
hap://an@rez.com/news/75
Cardinality
123. 123
Self-‐Learning
Bitmap
(S-‐bitmap)
[1]
Achieve
constant
rela@ve
es@ma@on
errors
for
unknown
cardinali@es
in
a
wide
range,
say
from
10s
to
>106
Bitmap
obtained
via
adap@ve
sampling
process
Bits
corresponding
to
the
sampled
items
are
set
to
1
Sampling
rates
are
learned
from
#
dis@nct
items
already
passed
and
reduced
sequen@ally
as
more
bits
are
set
to
1
For
given
input
parameters
Nmax
and
es@ma@on
precision
ε,
size
of
bit
mask
For
r
=
1
-‐2ε2(1+ε2)-‐1
and
sampling
probability
pk
=
m
(m+1-‐k)-‐1(1+ε2)rk,
where
k
∈
[1,m]
Rela@ve
error
≣
ε
[1]
Chen
et
al.
“Dis@nct
coun@ng
with
a
self-‐learning
bitmap”.
Journal
of
the
American
Sta@s@cal
Associa@on,
106(495):879–890,
2011.
Cardinality
124. 124
Quantiles
Quan@les,
Histograms
Large
set
of
real-‐world
applica@ons
Database
applica@ons
Sensor
networks
Opera@ons
Proper@es
Provide
tunable
and
explicit
guarantees
on
the
precision
of
approxima@on
Single
pass
Early
work
[Greenwald
and
Khanna,
2001]
-‐
worst
case
space
requirement
[Arasu
and
Manku,
2004]
-‐
sliding
window
based
model,
worst
case
space
requirement
125. 125
q-‐digest
[1]
Groups
values
in
variable
size
buckets
of
almost
equal
weights
Unlike
a
tradi@onal
histogram,
buckets
can
overlap
Key
features
Detailed
informa@on
about
frequent
values
preserved
Less
frequent
values
lumped
into
larger
buckets
Using
message
of
size
m,
answer
within
an
error
of
Except
root
and
leaf
nodes,
a
node
v
∈
q-‐digest
iff
Max
signal
value
#
Elements
Compression
Factor
Complete
binary
tree
[1]
Shrivastava
et
al.,
Medians
and
Beyond:
New
Aggrega@on
Techniques
for
Sensor
Networks.
In
Proceedings
of
SenSys,
2004.
Quantiles
126. 126
q-‐digest
Building
a
q-‐digest
q-‐digests
can
be
constructed
in
a
distributed
fashion
Merge
q-‐digests
Quantiles
127. 127
t-‐digest
[1]
Approxima@on
of
rank-‐based
sta@s@cs
Compute
quan@le
q
with
an
accuracy
rela@ve
to
max(q,
1-‐q)
Compute
hybrid
sta@s@cs
such
as
trimmed
sta@s@cs
Key
features
Robust
with
respect
to
highly
skewed
distribu@ons
Independent
of
the
range
of
input
values
(unlike
q-‐digest)
Rela@ve
error
is
bounded
Non-‐equal
bin
sizes
Few
samples
contribute
to
the
bins
corresponding
to
the
extreme
quan@les
Merging
independent
t-‐digests
Reasonable
accuracy
[1]T.
Dunning
and
O.
Ertl,
“”Compu@ng
Extremely
Accurate
Quan@les
using
t-‐digests”,
2017.
haps://github.com/tdunning/t-‐digest/blob/master/docs/t-‐digest-‐paper/histo.pdf
Quantiles
128. 128
t-‐digest
Group
samples
into
sub-‐sequences
Smaller
sub-‐sequences
near
the
ends
Larger
sub-‐sequences
in
the
middle
Scaling
func@on
Mapping
k
is
monotonic
k(0)
=
1
and
k(1)
=
δ
k-‐size
of
each
subsequence
<
1
No@onal
Index
Compression
parameterQuan@le
Quantiles
129. 129
t-‐digest
Es@ma@ng
quan@le
via
interpola@on
Sub-‐sequences
contain
centroid
of
the
samples
Es@mate
the
boundaries
of
the
sub-‐sequences
Error
Scales
quadra@cally
in
#
samples
Small
#
samples
in
the
sub-‐sequences
near
q=0
and
q=1
improves
accuracy
Lower
accuracy
in
the
middle
of
the
distribu@on
Larger
sub-‐sequences
in
the
middle
Two
flavors
Progressive
merging
(buffering
based)
and
clustering
variant
Quantiles
130. 130
Frequent Elements
Applica@ons
Track
bandwidth
hogs
Determine
popular
tourist
des@na@ons
Itemset
mining
Entropy
es@ma@on
Compressed
sensing
Search
log
mining
Network
data
analysis
DBMS
op@miza@on
131. Count-‐min
Sketch
[1]
A
two-‐dimensional
array
counts
with
w
columns
and
d
rows
Each
entry
of
the
array
is
ini@ally
zero
d
hash
func@ons
are
chosen
uniformly
at
random
from
a
pairwise
independent
family
Update
For
a
new
element
i,
for
each
row
j
and
k
=
hj(i),
increment
the
kth
column
by
one
Point
query
where,
sketch
is
the
table
Parameters
131
),( δε
}1{}1{:,,1 wnhh d ……… →
!
!
"
#
#
$
=
ε
e
w
!
!
"
#
#
$
=
δ
1
lnd
[1]
Cormode,
Graham;
S.
Muthukrishnan
(2005).
"An
Improved
Data
Stream
Summary:
The
Count-‐Min
Sketch
and
its
Applica@ons".
J.
Algorithms
55:
29–38.
Frequent Elements
132. Variants
of
Count-‐min
Sketch
[1]
Count-‐Min
sketch
with
conserva@ve
update
(CU
sketch)
Update
an
item
with
frequency
c
Avoid
unnecessary
upda@ng
of
counter
values
=>
Reduce
over-‐es@ma@on
error
Prone
to
over-‐es@ma@on
error
on
low-‐frequency
items
Lossy
Conserva@ve
Update
(LCU)
-‐
SWS
Divide
stream
into
windows
At
window
boundaries,
∀
1
≤
i
≤
w,
1
≤
j
≤
d,
decrement
sketch[i,j]
if
0
<
sketch[i,j]
≤
132
[1]
Cormode,
G.
2009.
Encyclopedia
entry
on
’Count-‐MinSketch’.
In
Encyclopedia
of
Database
Systems.
Springer.,
511–516.
Frequent Elements
133. 133
OPEN SOURCE TWITTER
YAHOO!
HUAWEI
streamDM^
SGD
Learner
and
Perceptron
Naive
Bayes
CluStream
Hoeffding
Decision
Trees
Bagging
Stream
KM++
DATA
SKETCHES
*
Unique
Quan@le
Histogram
Sampling
Theta
Sketches
Tuple
Sketches
Most
Frequent
ALGEBIRD#
Filtering
Unique
Histogram
Most
Frequent
*
haps://datasketches.github.io/
#
haps://github.com/twiaer/algebird
^
hap://huawei-‐noah.github.io/streamDM/
**
haps://github.com/jiecchen/StreamLib
StreamLib**
134. 134
Anomaly Detection
[1]
A.
S.
Willsky,
“A
survey
of
design
methods
for
failure
detec@on
systems,”
Automa@ca,
vol.
12,
pp.
601–611,
1976.
Very
rich
-‐
over
150
yrs
-‐
history
Manufacturing
Sta@s@cs
Econometrics,
Financial
engineering
Signal
processing
Control
systems,
Autonomous
systems
-‐
fault
detec@on
[1]
Networking
Computa@onal
biology
(e.g.,
microarray
analysis)
Computer
vision
135. 135
Very
rich
-‐
over
150
yrs
-‐
history
Anomalies
are
contextual
in
nature
“DISCORDANT observations may be defined as those which
present the appearance of differing in respect of their law of
frequency from other observations with which they ale
combined. In the treatment of such observations there is great
diversity between authorities ; but this discordance of methods
may be reduced by the following reflection. Different methods
are adapted to different hypotheses about the cause of a
discordant observation; and different hypotheses are true, or
appropriate, according as the subject-matter, or the degree of
accuracy required, is different.”
F. Y. Edgeworth, “On Discordant Observations”, 1887.
Anomaly Detection
137. 137
Anomaly Detection
COMMON
APPROACHES
DOMAINS
STATS
MFG
OPS
NOT
VALID
in
real-‐life
Moving Averages
SMA, EWMA, PEWMA
Assumption
Normal Distribution
PARAMS
WIDTH
DECAY
Rule Based
µ ± σ
Stone
1868
Glaisher
1872
Edgeworth
1887
Stewart
1920
Irwin
1925
Jeffreys
1932
Rider
1933
138. 138
Anomaly Detection
ROBUST
MEASURES
MEDIAN MAD [1] MCD [2] MVEE [3,4]
Median Absolute
Deviation
Minimum Covariance
Determinant
Minimum Volume
Enclosing Ellipsoid
[1]P.
J.
Rousseeuw
and
C.
Croux,
“Alterna@ves
to
the
Median
Absolute
Devia@on”,
1993.
[2]
hap://onlinelibrary.wiley.com/wol1/doi/10.1002/wics.61/abstract
[3]
P.
J.
Rousseeuw
and
A.
M.
Leroy.,“Robust
Regression
and
Outlier
Detec@on”,
1987.
[4]
M.
J.Todda
and
E.
A.
Yıldırım
,
“On
Khachiyan's
algorithm
for
the
computa@on
of
minimum-‐volume
enclosing
ellipsoids”,
2007.
140. 140
Anomaly Detection
Challenges
Live
Data
Mul@-‐dimensional
Low
memory
footprint
Accuracy
vs.
Speed
trade-‐off
Encoding
the
context
Data
types
Video,
Audio,
Text
Data
veracity
Wearables
Smart
ci@es,
Connected
Home,
Internet
of
Things
144. 144
Lambda Architecture
Batch
Layer
Accurate
but
delayed
HDFS/Mapreduce
Fast
Layer
Inexact
but
fast
Storm/Ka}a
Query
Merge
Layer
Merge
results
from
batch
and
fast
layers
at
query
@me
145. 145
Lambda Architecture
Characteris@cs
During
Inges@on,
Data
is
cloned
into
two.
One
goes
to
the
batch
layer
Other
goes
to
the
fast
layer
Processing
done
at
two
layers
Expressed
as
Map-‐reduces
in
batch
layer
Expressed
as
topologies
in
the
speed
layer
146. 146
Lambda Architecture
Challenges
Inherently
Inefficient
Data
is
replicated
twice
Computa@on
is
replicated
twice
Opera@onally
Inefficient
Maintain
both
batch
and
streaming
systems
Tune
topologies
for
both
systems
147. 147
Kappa Architecture
Streaming
is
everything
Computa@on
is
expressed
in
a
topology
Computa@on
is
mostly
done
only
once
when
the
data
arrives
Data
moves
into
permanent
storage
149. 149
Kappa Architecture
Challenges
Data
Reprocessing
could
be
very
expensive
Code/Logic
Changes
Either
Data
needs
to
be
brought
back
from
Storage
to
the
bus
Or
Computa@on
needs
to
be
expressed
to
run
on
bulk-‐storage
Historic
Analysis
How
to
do
data
analy@cs
over
all
of
last
years
data
151. 151
Observations
Lambda
is
complicated
and
inefficient
Replica@on
of
Data
and
Computa@on
Mul@ple
Systems
to
operate
and
tune
Kappa
is
too
simplis@c
Data
reprocessing
too
expensive
Historical
analysis
not
possible
152. 152
Observations
Computa@on
across
batch/real@me
is
similar
Expressed
as
DAGS
Run
parallely
on
the
cluster
Intermediate
results
need
not
be
materialized
Func@onal/Declara@ve
APIs
Storage
is
the
key
Messaging/Storage
are
two
faces
of
the
same
coin
They
serve
the
same
data
153. 153
Real-Time Storage Requirements
Requirements
for
a
real-‐Hme
storage
plakorm
Be
able
to
write
and
read
streams
of
records
with
low
latency,
storage
durability
Data
storage
should
be
durable,
consistent
and
fault
tolerant
Enable
clients
to
stream
or
tail
ledgers
to
propagate
data
as
they’re
wriaen
Store
and
provide
access
to
both
historic
and
real-‐@me
data
154. 154
Apache BookKeeper - Stream Storage
A
storage
for
log
streams
Replicated,
durable
storage
of
log
streams
Provide
fast
tailing/streaming
facility
Op@mized
for
immutable
data
Low-‐latency
durability
Simple
repeatable
read
consistency
High
write
and
read
availability
155. 155
Record
Smallest
I/O
and
Address
Unit
A
sequence
of
invisible
records
A
record
is
sequence
of
bytes
The
smallest
I/O
unit,
as
well
as
the
unit
of
address
Each
record
contains
sequence
numbers
for
addressing
157. 157
Ledger
Finite
sequence
of
records
Ledger:
A
finite
sequence
of
records
that
gets
terminated
A
client
explicitly
close
it
A
writer
who
writes
records
into
it
has
crashed.
158. 158
Stream
Infinite
sequence
of
records
Stream:
An
unbounded,
infinite
sequence
of
records
Physically
comprised
of
mul@ple
ledgers
159. 159
Bookies
Stores
fragment
of
records
Bookie
-‐
A
storage
server
to
store
data
records
Ensemble:
A
group
of
bookies
storing
the
data
records
of
a
ledger
Individual
bookies
store
fragments
of
ledgers
161. 161
Tying it all together
A
typical
installaHon
of
Apache
BookKeeper
162. 162
BookKeeper - Use Cases
Combine
messaging
and
storage
Stream
Storage
combines
the
func@onality
of
streaming
and
storage
WAL
-‐
Write
Ahead
Log Message
Store Object
Store
SnapshotsStream
Processing
164. 164
BookKeeper in Production
Enterprise
Grade
Stream
Storage
4+
years
at
Twiaer
and
Yahoo,
2+
years
at
Salesforce
Mul@ple
use
cases
from
messaging
to
storage
Database
replica@on,
Message
store,
Stream
compu@ng
…
600+
bookies
in
one
single
cluster
Data
is
stored
from
days
to
a
year
Millions
of
log
streams
1
trillion
records/day,
17
PB/day
166. 166
Real Time is Messy and Unpredictable
Aggregation
Systems
Messaging
Systems
Result
Engine
HDFS
Queryable
Engines
167. 167
Streamlio - Unified Architecture
Interactive
Querying
Storm API
Trident/Apache
Beam
SQL
Application
Builder
Pulsar
API
BK/
HDFS
API
Kubernetes
Metadata
Management
Operational
Monitoring
Chargeback
Security
Authentication
Quota
Management
Rules
Engine
Kafka
API
168. 168
RESOURCES
Sketching
Algorithms
haps://www.cs.upc.edu/~gavalda/papers/portoschool.pdf
haps://mapr.com/blog/some-‐important-‐streaming-‐algorithms-‐you-‐should-‐know-‐about/
haps://gist.github.com/debasishg/8172796
Synopses
for
Massive
Data:
Samples,
Histograms,
Wavelets,
Sketches
Data
Streams:
Models
and
Algorithms
Charu
Aggarwal
hap://www.springer.com/us/book/9780387287591
Data
Streams:
Algorithms
and
ApplicaHons
Muthu
Muthukrishnan
hap://algo.research.googlepages.com/eight.ps
Graph
Streaming
Algorithms
A.
McGregor
G.
Cormode,
M.
Garofalakis
and
P.
J.
Haas
Sketching
as
a
Tool
for
Numerical
Linear
Algebra
D.
Woodruff
169. 169
Dhalion:
Self-‐Regula@ng
VLDB’17
Twiaer
Heron:
Towards
Extensible
ICDE’17
Dhalion:
Self-‐Regula@ng
VLDB’17
MillWheel:
VLDB’13
Readings
Stream
Processing
in
Heron
Stream
Processing
in
Heron
Streaming
Engines
Twiaer
Heron:
Stream
SIGMOD’15
Processing
at
scale
Fault-‐Tolerant
Stream
Processing
at
Internet
Scale
The
Dataflow
Model:
A
Prac@cal
VLDB’15
Approach
to
Balancing
Correctness,
Latency
and
Cost
in
Massive-‐Scale,
Unbounded
Out-‐of-‐Order
Data
Processing
Anomaly
Detec@on
in
Strata
San
Jose’17
Real-‐Time
Data
Streams
Using
Heron
170. 170
Readings
FOCS’00
Clustering Data Streams
SIGMOD’02
Querying and mining data streams:
You only get one look
SIAM Journal of Computing’09
Stream Order and Order Statistics: Quantile
Estimation in Random-Order Streams
PODS’02
Models and Issues in Data Stream Systems
SIGMOD’07
Statistical Analysis of Sketch Estimators
PODS’10
An optimal algorithm for the distinct
elements problem
171. 171
Readings
SODA’10
Coresets and Sketches for high dimensional
subspace approximation problems
SIGMOD’16
Time Adaptive Sketches (Ada-Sketches) for
Summarizing Data Streams
SOSR’17
Heavy-Hitter Detection Entirely in the Data
Plane
PODS’12
Graph Sketches: Sparsification, Spanners, and
Subgraphs
Arxiv’16
Coresets and Sketches
ACM Queue’17
Data Sketching: The approximate approach is
often faster and more efficient
172.
173. 173
GET
IN
TOUCH
C O N T A C T
U S
@arun_kejariwal
@kramasamy,
@sanjeerk
@sijieg,
@merlimat
@nlu90
karthik@stremlio.io
arun_kejariwal@acm.org
174.
175. E N J O Y T H E P R E S E N T A T I O N
The End