Stream or segment : what is the best way to access your events in Pulsar_Neng

•

0 gefällt mir•462 views

Infinite event streams are the core data abstraction in Apache Pulsar. Pulsar provides two-level reading APIs for accessing events in Pulsar topics, one is pub/sub and the other one is segment readers. The pub/sub API provides a unified messaging API for accessing events in a streaming way. People can choose different subscription modes for consuming events. The segment API provides a way to access events directly from Apache BookKeeper and tiered storage, which is more suitable for batch-oriented workloads. You can combine both pub/sub API and segment API to create a unified data processing experience as well. In the past year, we at StreamNative have been helping with many customers running Pulsar for different use cases from online queuing, event sourcing to stream and batch processing. We also worked on integrating Pulsar with different components in the big data ecosystem. In this talk, we will share our experiences and best practices of choosing the right API for accessing your event streams in Pulsar for different use cases.

Daten & Analysen

streamnative.io
Stream/Segment - best way to access
events in Pulsar
Neng Lu

Who Am I
❏ StreamNative Software Engineer
❏ Ex-Twitter
❏ Contributed to Apache Projects - Heron, Pulsar
❏ Interested in event streaming technologies

Apache Pulsar
“Flexible Pub/Sub Messaging Backed by Durable Log Storage”

Apache Pulsar
“Cloud-native Messaging and Event Streaming Platform”

Pulsar Use Cases
❏ Unified Event Center/Bus (Queuing + Streaming)
❏ Billing Service
❏ Push Notification
❏ Worker Queue
❏ Logging Pipeline
❏ IoT
❏ Streaming-first, unified data processing

Data Processing Categories
❏ Batch
❏ The amount of data is huge
❏ Can run on a huge cluster
❏ Fine-grained fault tolerance

Data Processing Categories
❏ Batch
❏ The amount of data is huge
❏ Can run on a huge cluster
❏ Fine-grained fault tolerance
❏ Streaming
❏ Long running jobs
❏ Time critical
❏ scalability as well as fault tolerant

Data Processing Categories
❏ Interactive
❏ Time critical
❏ Medium data size
❏ Rerun on failures
❏ Batch
❏ The amount of data is huge
❏ Can run on a huge cluster
❏ Fine-grained fault tolerance
❏ Streaming
❏ Long running jobs
❏ Time critical
❏ scalability as well as fault tolerant

Apache Pulsar Layered Architecture
Stateless Serving
Durable Storage

Pulsar Messaging API
❏ Read data from brokers with different Subscription Modes
❏ Consume / Seek / Receive
❏ Reprocessing data by rewinding (seeking) the cursors

Subscription Mode
❏ Exclusive
❏ Failover
❏ Shared
❏ Key_Shared

Pulsar Segment API
❏ Read data from storage (bookkeeper or tiered storage)
❏ Fine-grained Parallelism
❏ Predicate pushdown (publish timestamp)

Segment Centric Storage
❏ Topic Partition (Managed Ledger)
❏ The storage layer for a single topic
partition
❏ Segment (Ledger)
❏ Single writer, append-only
❏ Replicated to multiple bookies

Tired Storage
❏ Long retention
❏ Low cost
❏ Easy to access

Apache Pulsar Data APIs
Bookie1 Bookie2 Bookie3 Bookie4
Producer Consumer
Broker 1 Broker 2 Broker 3
Bookie5
HADOOPGCSS3
Messaging API
Segment API

Benefits
❏ Unlimited Topic Partition Storage
❏ Instant Scaling without Data Rebalancing
❏ Broker Failure Recovery
❏ Bookie Failure Recovery
❏ Cluster Expansion
❏ Low latency reading for messaging data
❏ High throughput reading for batch data
❏ Reduced cost for whole data storage

Pulsar Flink Case
Flink
Job18 7 6 5 4 3 2 1
1
2
1
1
1
0
9
Flink
Job2

Conclusion
❏ Apache Pulsar is a cloud-native messaging streaming system
❏ Multi layered architecture
❏ Segment centric storage
❏ Two levels of reading API: Pub/Sub + Segment
❏ Apache Pulsar provides a unified view of data

Community
❏ Pulsar Website: https://pulsar.apache.org
❏ Twitter: @apache_pulsar / @streamnativeio
❏ Slack: https://apache-pulsar.herokuapp.com
❏ Mailing Lists dev@pulsar.apache.org , users@pulsar.apache.org
❏ Github: https://github.com/apache/pulsar
❏ Medium: https://medium.com/streamnative

Reference
❏ http://pulsar.apache.org/docs/en/concepts-overview/
❏ https://www.splunk.com/en_us/blog/it/comparing-pulsar-and-kafka-how-a-segment-based-architecture-delivers-better-
performance-scalability-and-
resilience.html#:~:text=Pulsar%20Architecture%20Basics&text=An%20Apache%20Pulsar%20cluster%20is,bookies%20that%2
0durably%20store%20messages.
❏ https://pulsar.apache.org/docs/en/concepts-tiered-storage/
❏ https://flink.apache.org/2019/05/03/pulsar-flink.html

Weitere ähnliche Inhalte

Was ist angesagt?

How Orange Financial combat financial frauds over 50M transactions a day usin...

JinfengHuang3

A Unified Platform for Real-time Storage and Processing

StreamNative

Apache PulsarFirst Overview

Ricardo Paiva

From Three Nines to Five Nines - A Kafka Journey

Allen (Xiaozhong) Wang

What's new in apache pulsar 2.4.0

StreamNative

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber

confluent

Real Time Data Streaming using Kafka & Storm

Ran Silberman

Stream processing using Apache Storm - Big Data Meetup Athens 2016

Adrianos Dadis

Kafka Summit SF 2017 - Infrastructure for Streaming Applications

confluent

Zhiyong Bai As a high performance and scalable key value database, Zhihu use HBase to provide online data store system along with Mysql and Redis. Zhihu’s platform team had accumulated some experience in technology of container, and this time, based on Kubernetes, we build flexible platform of online HBase system, create multiple logic isolated HBase clusters on the shared physical cluster with fast rapid，and provide customized service for different business needs. Combined with Consul and DNS server, we implement high available access of HBase using client mainly written with Python. This presentation is mainly shared the architecture of online HBase platform in Zhihu and some practical experience in production environment. hbaseconasia2017 hbasecon hbase

hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes

HBaseCon

Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.

Apache Kafka at LinkedIn

Discover Pinterest

Building a distributed Key-Value store with Cassandra

aaronmorton

"There's little talk about capacity planning Kafka clusters, it's very much learn as you go, every cluster is different. In this talk Kafka DevOps Engineer Jason Bell takes you through the things that will help you, from broker capacity, thinking about topics and how the other Confluent components can affect throughput and performance. With a number of production deployments under his watchful gaze for over six years Jason has plenty of experience, stories and useful information that will help you. By the end of the talk you'll have a good understanding of designing the cluster for various scenarios, where the points of latency are to watch and monitor. And also how to prevent teams breaking the cluster behind your back. This talk is designed for everyone, anyone who is just starting to those who are operating Kafka on a daily basis."

Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis

HostedbyConfluent

What’s easier than building a data pipeline nowadays? You add a few Apache Kafka clusters and a way to ingest data (probably over HTTP), design a way to route your data streams, add a few stream processors and consumers, integrate with a data warehouse… wait, this looks like a lot of things, doesn’t it? And you probably want to make it highly scalable and available too. Join this session to learn best practices for building a data pipeline, drawn from my experience at Activision/Demonware. I’ll share the lessons learned about scaling pipelines, not only in terms of volume, but also in terms of supporting more games and more use cases. You’ll also hear about message schemas and envelopes, Apache Kafka organization, topics naming conventions, routing, reliable and scalable producers and the ingestion layer, as well as stream processing.

Building Scalable and Extendable Data Pipeline for Call of Duty Games (Yarosl...

confluent

Ingesting Healthcare Data, Micah Whitacre

confluent

Apache BookKeeper: A High Performance and Low Latency Storage Service

Sijie Guo

Modern search systems provide incredible feature sets, developer-friendly APIs, and low latency indexing and query response. By some measures, these systems operate "at scale," but rarely is that quantified. Customers of Rocana typically look to push ingest rates in excess of 1 million events per second, retaining years of data online for query, with the expectation of sub-second response times for any reasonably sized subset of data. We quickly found that the tradeoffs made by general purpose search systems, while right for common use cases, were less appropriate for these high cardinality, large scale use cases. This session details the architecture, tradeoffs, and interesting implementation decisions made in building a new time series optimized distributed search system using Apache Lucene, Kafka, and HDFS. Data ingestion and durability, index and metadata organization, storage, query scheduling and optimization, and failure modes will be covered. Finally, a summary of the results achieved will be shown.

High cardinality time series search: A new level of scale - Data Day Texas 2016

Eric Sammer

Connecting kafka message systems with scylla

Maheedhar Gunturu

Kafka At Scale in the Cloud

confluent

Redis for horizontally scaled data processing at jFrog bintray

Redis Labs

Was ist angesagt? (20)

How Orange Financial combat financial frauds over 50M transactions a day usin...

A Unified Platform for Real-time Storage and Processing

Apache PulsarFirst Overview

From Three Nines to Five Nines - A Kafka Journey

What's new in apache pulsar 2.4.0

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber

Real Time Data Streaming using Kafka & Storm

Stream processing using Apache Storm - Big Data Meetup Athens 2016

Kafka Summit SF 2017 - Infrastructure for Streaming Applications

hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes

Apache Kafka at LinkedIn

Building a distributed Key-Value store with Cassandra

Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis

Building Scalable and Extendable Data Pipeline for Call of Duty Games (Yarosl...

Ingesting Healthcare Data, Micah Whitacre

Apache BookKeeper: A High Performance and Low Latency Storage Service

High cardinality time series search: A new level of scale - Data Day Texas 2016

Connecting kafka message systems with scylla

Kafka At Scale in the Cloud

Redis for horizontally scaled data processing at jFrog bintray

Ähnlich wie Stream or segment : what is the best way to access your events in Pulsar_Neng

As organizations are getting better at capturing streaming data and the data velocity and volume are ever-increasing, the traditional messaging queues or log storage systems are suffering from scalability or operational and maintenance problems. Apache Pulsar is a multi-tenant, high-performance distributed pub-sub messaging system. Pulsar includes multiple features, such as native support for multiple clusters in a Pulsar instance, seamless geo-replication of messages across clusters, very low publishing and end-to-end latency, seamless scalability to over a million topics, and guaranteed message delivery with persistent message storage provided by Apache BookKeeper. In this talk, I will use one of the most popular stream processing engines, Apache Flink, as an example, to share our experience in building a stream processing and storage stack. Some of the traits are: * How to ensure end-to-end exactly-once semantics based on Pulsar's durable and replayable storage as well as Pulsar transaction. * How to implement Pulsar topics as infinite tables based on Pulsar's schema. * How to efficiently store stream states in Pulsar based on Pulsar's layered storage API. * A usage scenario that chaining all functionalities in the streaming platform.

Virtual Flink Forward 2020: Build your next-generation stream platform based ...

Flink Forward

How Orange Financial combat financial frauds over 50M transactions a day usin...

StreamNative

Both Apache Pulsar and Apache Flink share a similar view on how the data and the computation level of an application can be “streaming-first” with batch as a special case streaming. With Apache Pulsar’s Segmented-Stream storage and Apache Flink’s steps to unify batch and stream processing workloads under one framework, there are numerous ways of integrating the two technologies to provide elastic data processing at massive scale, and build a real streaming warehouse. In this talk, Sijie Guo from the Apache Pulsar community will share the latest integrations between Apache Pulsar and Apache Flink. He will explain how Apache Flink can integrate and leverage Pulsar’s built-in efficient schemas to allow users of Flink SQL query Pulsar streams in realtime.

Query Pulsar Streams using Apache Flink

StreamNative

bigdata 2022_ FLiP Into Pulsar Apps In this session, Timothy will introduce you to the world of Apache Pulsar and how to build real-time messaging and streaming applications with a variety of OSS libraries, schemas, languages, frameworks, and tools. FLiP Into Pulsar Apps 08:30 – 09:15 • 23 Nov, 2022 In this session, Timothy will introduce you to the world of Apache Pulsar and how to build real-time messaging and streaming applications with a variety of OSS libraries, schemas, languages, frameworks, and tools.

bigdata 2022_ FLiP Into Pulsar Apps

Timothy Spann

Biography Tim Spann is a Principal DataFlow Field Engineer at Cloudera where he works with Apache NiFi, MiniFi, Pulsar, Apache Flink, Apache MXNet, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a senior solutions architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science. Talk Real-Time Streaming in Any and All Clouds, Hybrid and Beyond Today, data is being generated from devices and containers living at the edge of networks, clouds and data centers. We need to run business logic, analytics and deep learning at the scale and as events arrive. Tools: Apache Flink, Apache Pulsar, Apache NiFi, MiNiFi, DJL.ai Apache MXNet. References: https://www.datainmotion.dev/2019/11/introducing-mm-flank-apache-flink-stack.html https://www.datainmotion.dev/2019/08/rapid-iot-development-with-cloudera.html https://www.datainmotion.dev/2019/09/powering-edge-ai-for-sensor-reading.html https://www.datainmotion.dev/2019/05/dataworks-summit-dc-2019-report.html https://www.datainmotion.dev/2019/03/using-raspberry-pi-3b-with-apache-nifi.html Source Code: https://github.com/tspannhw/MmFLaNK FLiP Stack StreamNative

Big data conference europe real-time streaming in any and all clouds, hybri...

Timothy Spann

Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...

Timothy Spann

Timothy Spann: Apache Pulsar for ML

Edunomica

Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrumgaard | Current 2022 At home, I monitor the temperature, humidity, gas levels, ozone, air quality, and other features around my desk. Let's bring this to the different spots around the conference including lunch tables, vendor booths, hotel rooms, and more. I need to know about these readings now, not when I get back home from the conference. We need to get these sensor readings immediately in case we need to turn on a fan or move to another area. We will also see if my talk produces a lot of hot air!?!?? My setup is pretty simple, a raspberry pi, a breakout garden sensor mount, and as many sensors as I am willing to fly to Austin. The software stack is Python and Java, Apache Pulsar, MQTT, HTML, JQuery, and Apache Kafka. https://dzone.com/articles/five-sensors-real-time-with-pulsar-and-python-on-a https://www.datainmotion.dev/2022/04/flip-py-pi-enviroplus-using-apache.html https://dzone.com/articles/pulsar-in-python-on-pi

Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...

HostedbyConfluent

(Current22) Let's Monitor The Conditions at the Conference Let's Monitor The Conditions at the Conference Session Time11:15 am - 12:00 pm Session DateWednesday, 5 October 2022 Session Type:In-Person Location:Ballroom G Session Description: At home, I monitor the temperature, humidity, gas levels, ozone, air quality, and other features around my desk. Let's bring this to the different spots around the conference including lunch tables, vendor booths, hotel rooms, and more. I need to know about these readings now, not when I get back home from the conference. We need to get these sensor readings immediately in case we need to turn on a fan or move to another area. We will also see if my talk produces a lot of hot air!? My setup is pretty simple, a raspberry pi, a breakout garden sensor mount, and as many sensors as I am willing to fly to Austin. The software stack is Python and Java, Apache Pulsar, MQTT, HTML, JQuery, and Apache Kafka. Timothy Spann, StreamNative Developer Advocate Tim Spann is a Developer Advocate @ StreamNative where he works with Apache Pulsar, Apache Flink, Apache NiFi, Apache MXNet, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Principal Field Engineer at Cloudera, a Senior Solutions Architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC.

(Current22) Let's Monitor The Conditions at the Conference

Timothy Spann

Music city data Hail Hydrate! from stream to lake

Timothy Spann

Linked In Stream Processing Meetup - Apache Pulsar

Karthik Ramasamy

Messaging, storage, or both? The real time story of Pulsar and Apache Distri...

Streamlio

Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaS

Sadique Puthen

Pravega is a stream storage system that we designed and built from the ground up for modern day stream processors such as Flink. Its storage layer is tiered and designed to provide low latency for writing and reading, while being able to store an unbounded amount of stream data that eventually becomes cold. We rely on a high-throughput component to store cold stream data, which is critical to enable applications to rely on Pravega alone for storing stream data. Pravega’s API enables applications to manipulate streams with a set of desirable features such as avoiding duplication and writing data transactionally. Both features are important for applications that require exactly-once semantics. This talk goes into the details of Pravega’s architecture and establishes the need for such a storage system.

Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...

Flink Forward

While the last few years have seen great advancements in computing paradigms for big data stores, there remains one critical bottleneck in this architecture - the ingestion process. Instead of immediate insights into the data, a poor ingestion process can cause headaches and problems to no end. On the other hand, a well-designed ingestion infrastructure should give you real-time visibility into how your systems are functioning at any given time. This can significantly increase the overall effectiveness of your ad-campaigns, fraud-detection systems, preventive-maintenance systems, or other critical applications underpinning your business. In this session we will explore various modes of ingest including pipelining, pub-sub, and micro-batching, and identify the use-cases where these can be applied. We will present this in the context of open source frameworks such as Apache Flume, Kafka, among others that can be used to build related solutions. We will also present when and how to use multiple modes and frameworks together to form hybrid solutions that can address non-trivial ingest requirements with little or no extra overhead. Through this discussion we will drill-down into details of configuration and sizing for these frameworks to ensure optimal operations and utilization for long-running deployments.

Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...

Data Con LA

apidays New York 2022 - Beyond API Regulations for Finance, Insurance, and Healthcare July 27 & 28, 2022 Leveraging Event Streaming to Super-Charge your Business Mary Grygleski, Streaming Developer Advocate at DataStax ------------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/ Deep dive into the API industry with our reports: https://www.apidays.global/industry-reports/ Subscribe to our global newsletter: https://apidays.typeform.com/to/i1MPEW

apidays New York 2022 - Leveraging Event Streaming to Super-Charge your Busin...

apidays

Having used apache pulsar in production for an year for our pub sub use cases such as stream analytics, event sourcing etc, this slide deck presents the lesson learned per se understanding the architecture, tuning the cluster, managing to keep it highly available and fault tolerant and much more. While the slides are presented in terms of apache pulsar, a lot of the concepts can be easily extended to a lot of distributed systems. The views here are my own and do not represent the view of nutanix corporation.

lessons from managing a pulsar cluster

Shivji Kumar Jha

With a current zoo of technologies and different ways of their interaction it's a big challenge to architect a system (or adopt existed one) that will conform to low-latency BigData analysis requirements. Apache Kafka and Kappa Architecture in particular take more and more attention over classic Hadoop-centric technologies stack. New Consumer API put significant boost in this direction. Microservices-based streaming processing and new Kafka Streams tend to be a synergy in BigData world.

Big Data Streams Architectures. Why? What? How?

Anton Nazaruk

Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming https://www.meetup.com/new-york-city-apache-pulsar-meetup/ https://www.meetup.com/new-york-city-apache-pulsar-meetup/events/289674210/ |WHAT THE SESSION WILL COVER| Apache NiFi Apache Pulsar Apache Flink Flink SQL We will show you how to build apps, so download beforehand to Docker, K8, your Laptop, or the cloud. Cloudera CSP Setup Getting Started with Cloudera Stream Processing Community Edition You may download CSP-CE here: Cloudera Stream Processing Community Edition The Cloudera CDP User's page: CDP Resources Page https://youtu.be/s80sz3NWwHo https://docs.cloudera.com/csp-ce/latest/index.html https://www.cloudera.com/downloads/cdf/csp-community-edition.html Apache Pulsar https://pulsar.apache.org/docs/getting-started-standalone/ or https://streamnative.io/free-cloud/ Cloudera + Pulsar https://community.cloudera.com/t5/Cloudera-Stream-Processing-Forum/Using-Apache-Pulsar-with-SQL-Stream-Builder/m-p/349917 https://community.cloudera.com/t5/Community-Articles/Using-Apache-NiFi-with-Apache-Pulsar-for-Streaming/ta-p/337891 |AGENDA| 6:00 - 6:30 PM EST: Food, Drink, and Networking!!! 6:30 - 7:15 PM EST: Presentation - Tim Spann, StreamNative Developer Advocate 7:15 - 8:00 PM EST: Presentation - John Kuchmek, Cloudera Principal Solutions Engineer 8:00 - 8:30 PM EST: Round Table on Real-Time Streaming, Q&A |ABOUT THE SPEAKERS| John Kuchmek is a Principal Solutions Engineer for Cloudera. Before joining Cloudera, John transitioned to the Autonomous Intelligence team where he was in charge of integrating the platforms to allow data scientists to work with various types of data. Tim Spann is a Developer Advocate for StreamNative. He works with StreamNative Cloud, Apache Pulsar™, Apache Flink®, Flink® SQL, Big Data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. See: https://www.meetup.com/new-york-city-apache-pulsar-meetup/events/283837865/ https://github.com/tspannhw/SpeakerProfile https://www.meetup.com/futureofdata-newyork/ https://github.com/tspannhw/pulsar-transit-function https://www.meetup.com/futureofdata-princeton/ https://github.com/tspannhw/create-nifi-pulsar-flink-apps https://medium.com/@tspann/using-apache-pulsar-with-cloudera-sql-builder-apache-flink-b518aa9eadff https://github.com/tspannhw/meetups/blob/main/15December2022.md All resources for the meetup

Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming

Timothy Spann

Introduction to Apache Kafka

Ricardo Bravo

Ähnlich wie Stream or segment : what is the best way to access your events in Pulsar_Neng (20)

Virtual Flink Forward 2020: Build your next-generation stream platform based ...

How Orange Financial combat financial frauds over 50M transactions a day usin...

Query Pulsar Streams using Apache Flink

bigdata 2022_ FLiP Into Pulsar Apps

Big data conference europe real-time streaming in any and all clouds, hybri...

Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...

Timothy Spann: Apache Pulsar for ML

Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...

(Current22) Let's Monitor The Conditions at the Conference

Music city data Hail Hydrate! from stream to lake

Linked In Stream Processing Meetup - Apache Pulsar

Messaging, storage, or both? The real time story of Pulsar and Apache Distri...

Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaS

Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...

Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...

apidays New York 2022 - Leveraging Event Streaming to Super-Charge your Busin...

lessons from managing a pulsar cluster

Big Data Streams Architectures. Why? What? How?

Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming

Introduction to Apache Kafka

Mehr von StreamNative

So, you are a responsible software engineer building microservices for Apache Kafka, and life is good. Eventually, you hear the community talking about the outstanding experience they are having with Apache Pulsar features. They talk about infinite event stream retention, a rebalance-free architecture, native support for event processing, and multi-tenancy. Exciting, right? Most people would want to migrate their code to Pulsar. Especially when you know that Pulsar also supports Kafka clients natively via the protocol handler known as KoP — which enables the Kafka client APIs on Pulsar. But, as said before, you are responsible; and you don't believe in fairy tales, just like you don't believe that migrations like this happen effortlessly. This session will discuss the architecture behind protocol handlers, what it means having one enabled on Pulsar, and how the KoP works. It will detail the effort required to migrate a microservice written for Kafka to Pulsar, and whether the code need to change for this.

Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022

StreamNative

This talk describes Klaviyo’s internal messaging system, an asynchronous application framework built around Pulsar that provides a set of high-quality tools for building business-critical asynchronous data flows in unreliable environments. This framework includes: a pulsar ORM and schema migrator for topic configuration; a retry/replay system; a versioned schema registry; a consumer framework oriented around preventing message loss and in hostile environments while maximizing observability; an experimental “online schema change” for topics; and more. Development of this system was informed by lessons learned during heavy use of datastores like RabbitMQ and Kafka, and frameworks like Celery, Spark, and Flink. In addition to the capabilities of this system, this talk will also cover (sometimes painful) lessons learned about the process of converting a heterogenous async-computing environment onto Pulsar and a unified model.

Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...

StreamNative

In this talk, learn how Toast leverages our Envoy control-plane to manage blue-green deploys of Pulsar consumers, and how this has helped drive adoption across the engineering organization. Dive into the history of Pulsar at Toast, starting from its introduction in 2019 to provide event-driven architecture across a rapidly scaling restaurant software platform. We will detail some of the hurdles that we encountered gaining buy-in across a diverse set of teams, and dive deep into how we enforce best practices and integrate with our service control plane.

Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...

StreamNative

Event streaming architectures launched a reexamination of applications and systems architectures across the board. We live in a world where answers are needed now in a constant real-time flow. Yet beyond the event streaming system itself, what are the corequisites to ensure our large scale distributed database systems can keep pace with this always-on, always-current real time flow of data? What are the requirements and expectations for this next tech cycle?

Distributed Database Design Decisions to Support High Performance Event Strea...

StreamNative

Pulsar Functions is a succinct framework provided by Apache Pulsar to conduct real-time data processing. Its use cases include ETL pipeline, event-driven applications, and simple data analytics. While Pulsar Functions already provides an extremely simple programming interface, we want to further lower the barrier for users to access real-time data. Since SQL is one of the universal languages in the technology world and well accepted by the vast majority of data engineers, we decided to add a SQL expressing layer on top of Pulsar Functions runtime. In this talk, we will discuss the architecture and implementation of this new service. We will see how SQL syntax, Pulsar Functions, and Function Mesh can work together to deliver a unique user development experience for real-time data jobs in the cloud environment. We will also walk through use cases like filtering, routing, and projecting messages as well as integrating with the Pulsar IO Connectors framework.

Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022

StreamNative

Starting with version 2.10, the Apache ZooKeeper dependency has been eliminated and replaced with a pluggable framework that enables you to reduce the infrastructure footprint of Apache Pulsar by leveraging alternative metadata and coordination systems based on your deployment environment. In this talk, walk through the steps required to utilize the existing etcd service running inside Kubernetes to act as Pulsar's metadata store, thereby eliminating the need to run ZooKeeper entirely, leaving you with a Zookeeper-less Pulsar.

Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022

StreamNative

Apache Pulsar is a highly available, distributed messaging system that provides guarantees of no message loss and strong message ordering with predictable read and write latency. In this talk, learn how this can be validated for Apache Pulsar Kubernetes deployments. Various failures are injected using Chaos Mesh to simulate network and other infrastructure failure conditions. There are many questions that are asked about failure scenarios, but it could be hard to find answers to these important questions. When a failure happens, how long does it take to recover? Does it cause unavailability? How does it impact throughput and latency? Are the guarantees of no message loss and strong message ordering kept, even when components fail? If a complete availability zone fails, is the system configured correctly to handle AZ failures? This talk will help you find answers to these questions and apply the tooling and practices to your own testing and validation.

Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...

StreamNative

Despite what the Ghostbusters said, we’re going to go ahead and cross (or, join) the streams. This session covers getting started with streaming data pipelines, maximizing Pulsar’s messaging system alongside one of the most flexible streaming frameworks available, Apache Flink. Specifically, we’ll demonstrate the use of Flink SQL, which provides various abstractions and allows your pipeline to be language-agnostic. So, if you want to leverage the power of a high-speed, highly customizable stream processing engine without the usual overhead and learning curves of the technologies involved (and their interconnected relationships), then this talk is for you. Watch the step-by-step demo to build a unified batch and streaming pipeline from scratch with Pulsar, via the Flink SQL client. This means you don’t need to be familiar with Flink, (or even a specific programming language). The examples provided are built for highly complex systems, but the talk itself will be accessible to any experience level.

Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...

StreamNative

Apache Pulsar depends upon message acknowledgments to provide at-least-once or exactly-once processing guarantees. With these guarantees, any transmission between the broker and its producers and consumers requires an acknowledgment. But what happens if an acknowledgment is not received? Resending the message introduces the potential of duplicate processing and increases the likelihood of out or order processing. Therefore, it is critical to understand the Pulsar message redelivery semantics in order to prevent either of these conditions. In this talk, we will walk you through the redelivery semantics of Apache Pulsar, and highlight some of the control mechanisms available to application developers to control this behavior. Finally, we will present best practices for configuring message redelivery to suit various use cases.

Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022

StreamNative

Lakehouses are quickly growing in popularity as a new approach to Data Platform Architecture bringing some of the long-established benefits from OLTP world to OLAP, including transactions, record-level updates/deletes, and changes streaming. In this talk, we will discuss Apache Hudi and how it unlocks possibilities of building your own fully open-source Lakehouse featuring a rich set of integrations with existing technologies, including Apache Pulsar. In this session, we will present: - What Lakehouses are, and why they are needed. - What Apache Hudi is and how it works. - Provide a use-case and demo that applies Apache Hudi’s DeltaStreamer tool to ingest data from Apache Pulsar.

Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...

StreamNative

Pulsar is a horizontally scalable messaging system, so the traffic in a logical cluster must be balanced across all the available Pulsar brokers as evenly as possible, in order to ensure full utilization of the broker layer. You can use multiple settings and tools to control the traffic distribution which requires a bit of context to understand how the traffic is managed in Pulsar. In this talk, we will walk you through the load balancing capabilities of Apache Pulsar, and highlight some of the control mechanisms available to control the distribution of load across the Pulsar brokers. Finally, we will discuss the various loading shedding strategies that are available. At the end of the talk, you will have a better understanding of how Pulsar's broker level auto-balancing works, and how to properly configure it to meet your workload demands.

Understanding Broker Load Balancing - Pulsar Summit SF 2022

StreamNative

Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...

StreamNative

In today’s world, we are seeing a big shift toward the Cloud. With this shift comes a big shift in the expectations we have for a messaging system, especially when the messaging system is presented as managed service in a large-scale, multi-tenant environment. For any large-scale enterprise, it’s very important to evaluate messaging system and be confident before expanding complex distributed data systems like Apache Pulsar from on-premise to elastically scalable, fully managed services on cloud services. We must consider aspects such as: migration from and integration with large-scale on-premise clusters, security, cost efficiency, and the cloud friendliness of the architecture, modeling cost and capacity, tenant isolation, deployment robustness, availability, monitoring, etc. Not every messaging system is built to be cloud-native and run as a managed service with cost efficiency. We have been running large-scale Apache Pulsar at Yahoo for the last 8 years on various platforms and hardware configurations while meeting application SLAs and serving more than 1M topics in a cluster. In this talk, we will talk about Pulsar’s journey in Yahoo! from an on-premise platform to a hybrid cloud and on-premise system. We will talk about Pulsar’s architecture and features that make Pulsar a good cloud-native messaging-system choice for any enterprise.

Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022

StreamNative

Pulsar Summit San Francisco is the event dedicated to Apache Pulsar. This one-day, action-packed event will include 5 keynotes, 12 breakout sessions, and 1 amazing happy hour. Speakers are from top companies, including Google, AWS, Databricks, Onehouse, StarTree, Intel, ScyllaDB, and more! It’s the perfect opportunity to network with Pulsar thought leaders in person. Join developers, architects, data engineers, DevOps professionals, and anyone who wants to learn about messaging and event streaming for this one-day, in-person event. Pulsar Summit San Francisco brings the Apache Pulsar Community together to share best practices and discuss the future of streaming technologies.

Event-Driven Applications Done Right - Pulsar Summit SF 2022

StreamNative

Our services team creates, builds, and maintains the as a service offering for base platform services within our organization. Several thousand applications use these custom services daily generating more than 700 million requests per minute. One of these services was our publish / subscriber offering, BQ with custom SDK and custom metrics based on Apache Pulsar. BQ is the core communication service within our organization, having more 200M RPM. All the core processes of the organization depend on this service for operation: the CDC of any of our RDBMS or NoSQL offering, all the eventing efforts of the organization, async communication between apps, notification systems, etc. The backend of the solution was Apache Pulsar running on EC2 on AWS and on top of that we built several components as wrappers of the actual backend, creating our own SDKs and abstractions and in many ways extending the features provided by Pulsar. We had a multi-cluster setup 100% on AWS, with custom Pulsar Docker images running on large ASG setups, along with our own wrapping and admin APIs and DBs. All of this in turn transformed the solution into a volatile solution.

Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022

StreamNative

There is an increasing need to unleash analytical capabilities directly to the end-users to democratize decision-making. User-Facing Analytics is a new frontier that will shape the products of tomorrow and push the limits of existing technology. It demands a solution that will scale to millions of users to provide fast, real-time insights. In this session, Xiang will talk about his journey to build Apache Pinot to tackle the analytics problem space with the architectural changes and technology inventions made over the past decade. He will also talk about how other big data companies such as LinkedIn, Uber, and Stripe power their user-facing analytical applications.

Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022

StreamNative

Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022

StreamNative

Welcome and Opening Remarks - Pulsar Summit SF 2022

StreamNative

Milvus is an open-source vector database that leverages a novel data fabric to build and manage vector similarity search applications. As the world's most popular vector database, it has already been adopted in production by thousands of companies around the world, including Lucidworks, Shutterstock, and Cloudinary. With the launch of Milvus 2.0, the community aims to introduce a cloud-native, highly scalable and extendable vector similarity solution, and the key design concept is log as data. Milvus relies on Pulsar as the log pub/sub system. Pulsar helps Milvus to reduce system complexity by loosely decoupling each micro service, making the system stateless by disaggregating log storage and computation, which also makes the system further extendable. We will introduce the overview design, the implementation details of Milvus and its roadmap in this topic. Takeaways: 1) Get a general idea about what is a vector database and its real-world use cases. 2) Understand the major design principles of Milvus 2.0. 3) Learn how to build a complex system with the help of a modern log system like Pulsar.

Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...

StreamNative

MQTT (Message Queuing Telemetry Transport,) is a message protocol based on the pub/sub model with the advantages of compact message structure, low resource consumption, and high efficiency, which is suitable for IoT applications with low bandwidth and unstable network environments. This session will introduce MQTT on Pulsar, which allows developers users of MQTT transport protocol to use Apache Pulsar. I will share the architecture, principles and future planning of MoP, to help you understand Apache Pulsar's capabilities and practices in the IoT industry.

MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...

StreamNative

Mehr von StreamNative (20)