Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apache Pulsar - Pulsar Summit SF 2022

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 60 Anzeige

Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apache Pulsar - Pulsar Summit SF 2022

Herunterladen, um offline zu lesen

Despite what the Ghostbusters said, we’re going to go ahead and cross (or, join) the streams. This session covers getting started with streaming data pipelines, maximizing Pulsar’s messaging system alongside one of the most flexible streaming frameworks available, Apache Flink. Specifically, we’ll demonstrate the use of Flink SQL, which provides various abstractions and allows your pipeline to be language-agnostic. So, if you want to leverage the power of a high-speed, highly customizable stream processing engine without the usual overhead and learning curves of the technologies involved (and their interconnected relationships), then this talk is for you. Watch the step-by-step demo to build a unified batch and streaming pipeline from scratch with Pulsar, via the Flink SQL client. This means you don’t need to be familiar with Flink, (or even a specific programming language). The examples provided are built for highly complex systems, but the talk itself will be accessible to any experience level.

Despite what the Ghostbusters said, we’re going to go ahead and cross (or, join) the streams. This session covers getting started with streaming data pipelines, maximizing Pulsar’s messaging system alongside one of the most flexible streaming frameworks available, Apache Flink. Specifically, we’ll demonstrate the use of Flink SQL, which provides various abstractions and allows your pipeline to be language-agnostic. So, if you want to leverage the power of a high-speed, highly customizable stream processing engine without the usual overhead and learning curves of the technologies involved (and their interconnected relationships), then this talk is for you. Watch the step-by-step demo to build a unified batch and streaming pipeline from scratch with Pulsar, via the Flink SQL client. This means you don’t need to be familiar with Flink, (or even a specific programming language). The examples provided are built for highly complex systems, but the talk itself will be accessible to any experience level.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Weitere von StreamNative (20)

Aktuellste (20)

Anzeige

Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apache Pulsar - Pulsar Summit SF 2022

  1. 1. Cross the Streams! Creating Data Pipelines with Apache Flink + Pulsar Caito Scherr – Developer Advocate – Ververica
  2. 2. Agenda 00 Who am I? 01 Intro to Flink SQL 02 Flink SQL Demo 03 Flink + Pulsar @CAITO_200_OK
  3. 3. Who am I? 00 Caito Scherr 01 Apache Flink 02 DevRel @ Ververica 03 Portland, Oregon @CAITO_200_OK
  4. 4. Who am I? 00 Caito Scherr 01 Apache Flink 02 DevRel @ Ververica 03 Portland, Oregon @CAITO_200_OK
  5. 5. Who am I? 00 Caito Scherr 01 Apache Flink 02 DevRel @ Ververica 03 Portland, Oregon @CAITO_200_OK
  6. 6. Who am I? 00 Caito Scherr 01 Apache Flink 02 DevRel @ Ververica 03 Portland, Oregon @CAITO_200_OK
  7. 7. Intro Flink SQL
  8. 8. Stream Processing @CAITO_200_OK
  9. 9. Stream Processing > The Challenges @CAITO_200_OK ● You can’t pause to fix it ● Lots of data, FAST ● Ingesting multiple formats ● Failure recovery ● Needs to scale
  10. 10. Flink > Addressing Stream Processing’s Challenges @CAITO_200_OK
  11. 11. Flink > Addressing Stream Processing’s Challenges @CAITO_200_OK
  12. 12. Flink > Basics @CAITO_200_OK 12 Building Blocks (events, state, (event) time) DataStream API (streams, windows) Table API (dynamic tables) Flink SQL PyFlink Ease of Use Expressiven ess Streaming Analytics & ML Stateful Stream Processing
  13. 13. Flink > Summary @CAITO_200_OK Flexible APIs ● Ease of use/Expressiveness ● Wide Range of Use Cases High Performance ● Local State Access ● High Throughput/Low Latency Stateful Processing ● State = First-class Citizen ● Event-time Support Fault Tolerance ● Distributed State Snapshots ● Exactly-once Guarantees
  14. 14. Flink SQL @CAITO_200_OK ● Stream processing: real-time processing ● Stream processing is complex ● Flink is highly performant streaming ● Flink solves many problems in streaming ● Flink is complex ● Flink SQL: access to Flink’s benefits ● Abstracts away the complexity
  15. 15. Flink SQL @CAITO_200_OK ● Stream processing: real-time processing ● Stream processing is complex ● Flink is highly performant streaming ● Flink solves many problems in streaming ● Flink is complex ● Flink SQL: access to Flink’s benefits ● Abstracts away the complexity
  16. 16. Flink SQL @CAITO_200_OK ● Stream processing: real-time processing ● Stream processing is complex ● Flink is highly performant streaming ● Flink solves many problems in streaming ● Flink is complex ● Flink SQL: access to Flink’s benefits ● Abstracts away the complexity
  17. 17. Flink SQL @CAITO_200_OK ● Stream processing: real-time processing ● Stream processing is complex ● Flink is highly performant streaming ● Flink solves many problems in streaming ● Flink is complex ● Flink SQL: access to Flink’s benefits ● Abstracts away the complexity
  18. 18. Flink SQL @CAITO_200_OK ● Stream processing: real-time processing ● Stream processing is complex ● Flink is highly performant streaming ● Flink solves many problems in streaming ● Flink is complex ● Flink SQL: access to Flink’s benefits ● Abstracts away the complexity
  19. 19. Flink SQL @CAITO_200_OK ● Stream processing: real-time processing ● Stream processing is complex ● Flink is highly performant streaming ● Flink solves many problems in streaming ● Flink is complex ● Flink SQL: access to Flink’s benefits ● Abstracts away the complexity
  20. 20. Flink SQL @CAITO_200_OK ● Stream processing: real-time processing ● Stream processing is complex ● Flink is highly performant streaming ● Flink solves many problems in streaming ● Flink is complex ● Flink SQL: access to Flink’s benefits ● Abstracts away the complexity
  21. 21. Flink SQL Demo @CAITO_200_OK ● Making the complex simple ● You could start a data pipeline anywhere! ● Language agnostic From: Free Guy movie
  22. 22. Flink SQL Demo > Regular SQL @CAITO_200_OK user cnt Mary 2 Bob 1 SELECT user_id, COUNT(url) AS cnt FROM clicks GROUP BY user_id; Take a snapshot when the query starts A final result is produced A row that was added after the query was started is not considered user cTime url Mary 12:00:00 https://… Bob 12:00:00 https://… Mary 12:00:02 https://… Liz 12:00:03 https://… The query terminates Image: Marta Paes @morsapaes
  23. 23. Flink SQL Demo > Flink SQL @CAITO_200_OK user cTime url user cnt SELECT user_id, COUNT(url) AS cnt FROM clicks GROUP BY user_id; Mary 12:00:00 https://… Bob 12:00:00 https://… Mary 12:00:02 https://… Liz 12:00:03 https://… Bob 1 Liz 1 Mary 1 Mary 2 Ingest all changes as they happen Continuously update the result The result is identical to the one-time query (at this point) Image: Marta Paes @morsapaes
  24. 24. Flink SQL Demo @CAITO_200_OK
  25. 25. Flink SQL Demo @CAITO_200_OK ● Check Java version ● Download Flink Snapshot ● Un-tar it
  26. 26. 26 What Next? >> Flink SQL Cookbook
  27. 27. 27
  28. 28. 28
  29. 29. 29
  30. 30. 30
  31. 31. 31
  32. 32. Flink SQL Demo @CAITO_200_OK ● Flink SQL + DataGen ● Same startup steps ● True stream processing example
  33. 33. © 2020 Ververica
  34. 34. © 2020 Ververica
  35. 35. Pulsar + Flink
  36. 36. Flink + Pulsar @CAITO_200_OK “Stream as a unified view on data” “Batch as a special case of streaming”
  37. 37. Flink + Pulsar @CAITO_200_OK ● Pub/Sub messaging layer (streaming) ● Durable storage layer (batch)
  38. 38. Flink + Pulsar > Unified Processing with Flink @CAITO_200_OK ● Mix historic & real-time ● Reuse code & logic ● Simplify operations now bounded query unbounded query past future bounded query start of the stream unbounded query
  39. 39. Flink + Pulsar > Unified data stack @CAITO_200_OK Unified Processing Engine (Batch / Streaming) Unified Storage (Segments / Pub/Sub)
  40. 40. Demo > Twier Firehose @CAITO_200_OK Demo: Marta Paes @morsapaes
  41. 41. Demo > Twier Firehose @CAITO_200_OK Demo: Marta Paes @morsapaes
  42. 42. Demo > Twier Firehose @CAITO_200_OK CREATE CATALOG pulsar WITH ( 'type' = 'pulsar', 'service-url' = 'pulsar://pulsar:6650', 'admin-url' = 'http://pulsar:8080', 'format' = 'json' ); Catalog DDL Demo: Marta Paes @morsapaes
  43. 43. Demo > Twier Firehose @CAITO_200_OK Not cool. 👹 Demo: Marta Paes @morsapaes
  44. 44. Demo > Get Relevant Timestamps @CAITO_200_OK CREATE TABLE pulsar_tweets ( publishTime TIMESTAMP(3) METADATA, WATERMARK FOR publishTime AS publishTime - INTERVAL '5' SECOND ) WITH ( 'connector' = 'pulsar', 'topic' = 'persistent://public/default/tweets', 'value.format' = 'json', 'service-url' = 'pulsar://pulsar:6650', 'admin-url' = 'http://pulsar:8080', 'scan.startup.mode' = 'earliest-offset' ) LIKE tweets; Derive schema from the original topic Define the source connector (Pulsar) Read and use Pulsar message metadata Demo: Marta Paes @morsapaes
  45. 45. Demo > Windowed Aggregation @CAITO_200_OK CREATE TABLE pulsar_tweets_agg ( tmstmp TIMESTAMP(3), tweet_cnt BIGINT ) WITH ( 'connector'='pulsar', 'topic'='persistent://public/default/tweets_agg', 'value.format'='json', 'service-url'='pulsar://pulsar:6650', 'admin-url'='http://pulsar:8080' ); Sink Table DDL INSERT INTO pulsar_tweets_agg SELECT TUMBLE_START(publishTime, INTERVAL '10' SECOND) AS wStart, COUNT(id) AS tweet_cnt FROM pulsar_tweets GROUP BY TUMBLE(publishTime, INTERVAL '10' SECOND); Continuous SQL Query Demo: Marta Paes @morsapaes
  46. 46. Demo > Tweet Count in Windows @CAITO_200_OK Demo: Marta Paes @morsapaes
  47. 47. What’s Next?
  48. 48. What Next @CAITO_200_OK
  49. 49. What Next @CAITO_200_OK
  50. 50. What Next @CAITO_200_OK
  51. 51. How to Get Involved @CAITO_200_OK ● Geing involved page: one source for Flink community resources ● hps://flink.apache.org/community.html
  52. 52. Contribute @CAITO_200_OK ● Github ● Issue Tracker ● Becoming a Commier
  53. 53. @CAITO_200_OK
  54. 54. New Slack Space! @CAITO_200_OK ● Go-to space for user troubleshooting ● 800 members in less than 2 months ● Members include most of the Flink commiers + PMC members
  55. 55. New Slack Space! @CAITO_200_OK
  56. 56. Hangout With Us @CAITO_200_OK ● Regional meetups ● Virtual and in person options ● hps://www.meetup.com/topics/apache-flink/
  57. 57. Stay Connected @CAITO_200_OK ● Twier ● Website ● Blog - Flink ● Blog - Ververica ● Youtube
  58. 58. Thank you info@ververica.com www.ververica.com @VervericaData
  59. 59. Questions? ● caito@ververica.com ● @CAITO_200_OK info@ververica.com www.ververica.com @VervericaData
  60. 60. Resources ● Flink Ahead: What Comes After Batch & Streaming: https://youtu.be/h5OYmy9Yx7Y ● Apache Pulsar as one Storage System for Real Time & Historical Data Analysis: https://medium.com/streamnative/apache-pulsar-as-one-storage-455222c59017 ● Flink Table API & SQL: https://ci.apache.org/projects/flink/flink-docs-master/dev/table/sql/queries.html#ope rations ● Flink SQL Cookbook: https://github.com/ververica/flink-sql-cookbook ● When Flink & Pulsar Come Together: https://flink.apache.org/2019/05/03/pulsar-flink.html ● How to Query Pulsar Streams in Flink: https://flink.apache.org/news/2019/11/25/query-pulsar-streams-using-apache-flink.ht ml ● What’s New in the Flink/Pulsar Connector: ● https://flink.apache.org/2021/01/07/pulsar-flink-connector-270.html ● Marta’s Demo: https://github.com/morsapaes/flink-sql-pulsar 60 @Caito_200_OK

×