Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

You’re Spiky and We Know It With Ravindra Bhanot | Current 2022

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 25 Anzeige

You’re Spiky and We Know It With Ravindra Bhanot | Current 2022

Herunterladen, um offline zu lesen

You’re Spiky and We Know It With Ravindra Bhanot | Current 2022

Twilio is a communications company with SMS, Voice APIs which see traffic spikes and troughs based on times of the day and week when companies send out communications(think early morning stock updates) or when end customers interact with companies(think restaurant delivery orders placed at lunch or dinner times)
When building a generic monitoring system using Kafka Streams with exactly-once-processing, this spiki-ness can cause some set of our customers(think companies with a lot of interactions as above) to have more traffic at certain times of the day which can let them overpower traffic for other customers(think a group of hospitals sending communication about appointments).
This talk elaborates the challenges that Twilio faced when building such a monitoring platform, which can aggregate customer data and send alerts in a timely manner under SLA. The optimizations we would go in depth are :-
1. TOPOLOGY - Spreading of work in Kafka Streams application topology
2. STORAGE - Statestore reads and writes from topology nodes to avoid data skewness
3. COMPUTE - Use of Punctuator from Processor API to avoid competition between customers

Do listen to this talk if building a real time alerting pipeline OR spreading a mix of global and local state stores OR aggregations over upserts OR use of punctuators in the wild piques your interest.

You’re Spiky and We Know It With Ravindra Bhanot | Current 2022

Twilio is a communications company with SMS, Voice APIs which see traffic spikes and troughs based on times of the day and week when companies send out communications(think early morning stock updates) or when end customers interact with companies(think restaurant delivery orders placed at lunch or dinner times)
When building a generic monitoring system using Kafka Streams with exactly-once-processing, this spiki-ness can cause some set of our customers(think companies with a lot of interactions as above) to have more traffic at certain times of the day which can let them overpower traffic for other customers(think a group of hospitals sending communication about appointments).
This talk elaborates the challenges that Twilio faced when building such a monitoring platform, which can aggregate customer data and send alerts in a timely manner under SLA. The optimizations we would go in depth are :-
1. TOPOLOGY - Spreading of work in Kafka Streams application topology
2. STORAGE - Statestore reads and writes from topology nodes to avoid data skewness
3. COMPUTE - Use of Punctuator from Processor API to avoid competition between customers

Do listen to this talk if building a real time alerting pipeline OR spreading a mix of global and local state stores OR aggregations over upserts OR use of punctuators in the wild piques your interest.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Ähnlich wie You’re Spiky and We Know It With Ravindra Bhanot | Current 2022 (20)

Weitere von HostedbyConfluent (20)

Anzeige

Aktuellste (20)

You’re Spiky and We Know It With Ravindra Bhanot | Current 2022

  1. 1. © 2019 TWILIO INC. ALL RIGHTS RESERVED. You’re spiky and we know it !! Ravindra Bhanot ©2022 TWILIO INC. ALL RIGHTS RESERVED
  2. 2. Ravi Bhanot Principal Software Engineer Thomas D’Silva Principal Software Engineer Scott Reynolds Architect Key Contributors ©2022 TWILIO INC. ALL RIGHTS RESERVED Evolved the app through its stages to support scale and requirements. Critical in designing composable and flexible filter design Guided in key, value designs of schemas with scalability and latency considerations.
  3. 3. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Introduction to Twilio ©2022 TWILIO INC. ALL RIGHTS RESERVED
  4. 4. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Twilio communication data patterns - spikes and seasonality Spikes at hour boundaries Elevated traffic at some hours of the day ©2022 TWILIO INC. ALL RIGHTS RESERVED
  5. 5. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Alerting use case / Minimum viable product - Alert customers on defined threshold criteria as real-time as possible ©2022 TWILIO INC. ALL RIGHTS RESERVED - Allow setting alerts over a subset of data in event stream using field level filters - Allow flexible time rollups - Allow statistical operations over rollups - SUM, COUNT, AVERAGE, PERCENTAGE
  6. 6. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Approaches to creating a monitoring solution for customers ● Cache/Database based approach with cron/routines to check thresholds ● Streams based approach ©2022 TWILIO INC. ALL RIGHTS RESERVED
  7. 7. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Chosen way ● Kafka Streams due to familiarity ● Ease of maintaining state in statestores ● Ease of calculating and updating aggregates ©2022 TWILIO INC. ALL RIGHTS RESERVED
  8. 8. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Evolution of the solution Journey of how we went from a simple two stage(filter and aggregate application) to a three stage scalable solution. Image credit - Shutterstock ©2022 TWILIO INC. ALL RIGHTS RESERVED
  9. 9. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Glossary for forthcoming slides ● Input topic - Stream of records on which aggregates will be calculated and thresholds will be checked. ● Alert Config - User/Customer defined configuration of thresholds, time constraints and filters. ● Account - Notion of a single user of the system whose interactions will cause multiple events on the input topic. ● Alert - A single violation or recovery event of an alert config’s threshold criteria. ©2022 TWILIO INC. ALL RIGHTS RESERVED ● Sink - Send stream of records to kafka broker than just forwarding to next stage of app locally inside a machine
  10. 10. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Sample Alert Config/Alert Criteria definition { "alert_config_sid": "AK10001", "account_sid": "AC40c3a0fa4f71f6f0b7cbd895724fb211", "recordFilter": { "field_name": {"string": "phonenumber"}, "equality_type": "EQUALS", "field_value": {"string": "+1949xxxxxxx"}, "operator_type": "LEAF", "operands": null }, "dataset": "BillingTransactions", "measure_field_name": {"string": "amount"}, "threshold": { "alert_level_threshold": "100", "comparison": "ABOVE", "operation": "SUM", "time_period": "FIVE_MINS" }, "flap_detection_method": "WEIGHTED_AVERAGE" "notification_preferences": [..], "date_created_epoch_milli": 1599071100000, "date_updated_epoch_milli": 1599071100000 } ©2022 TWILIO INC. ALL RIGHTS RESERVED
  11. 11. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Sample Triggered Alert ERROR SAMPLE - { "alert_config_sid": "AK10001", "dataset": "BillingTransactions", "account_sid": "AC40c3a0fa4f71f6f0b7cbd895724fb211", "alert_time_epoch": 1599071100000, "metric_value": "200", "alert_status": "ERROR", } RECOVERED SAMPLE - { "alert_config_sid": "AK10001", "dataset": "BillingTransactions", "account_sid": "AC40c3a0fa4f71f6f0b7cbd895724fb211", "alert_time_epoch": 159907113600, "metric_value": "10", "alert_status": "RECOVERED", } ©2022 TWILIO INC. ALL RIGHTS RESERVED
  12. 12. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Sample Alert Config with Composable Filters { . . "recordFilter": { "field_name": null, "equality_type": "EQUALS", "field_value": null, "operator_type": "OR", "operands": { "array": [ { "field_name": {"string": "status"}, "equality_type": "EQUALS", "field_value": {"string": "SENT"}, "operator_type": "LEAF", "operands": null }, { "field_name": {"string": "status"}, "equality_type": "EQUALS", "field_value": {"string": "DELIVERED"}, "operator_type": "LEAF", "operands": null } ] } }, . . ©2022 TWILIO INC. ALL RIGHTS RESERVED
  13. 13. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Australopithecus state - Filter, Aggregate & Inspect ©2022 TWILIO INC. ALL RIGHTS RESERVED
  14. 14. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Filter, Aggregate & Inspect - Key design considerations - Alert configs topic read into a global statestore - All alert configs are available on all machines in app cluster - Key of output topic of all stages to include account_id and alert_config_id for contextual reference when processing - Sink output topic of first stage(Filter) to broker - Records get distributed across the machines in the app cluster as source to second stage(Aggregator) - Third stage(Inspect) consumes output topic of second stage(aggregator) local to each machine ©2022 TWILIO INC. ALL RIGHTS RESERVED
  15. 15. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Filter, Aggregate & Inspect - Downsides - Too many records in first stage(Filter) for an account causing too many inspections in third stage(Inspect) -> causing lag for inspection of other accounts when their records show up - Aggregates for present minute still getting populated while being Inspected - can cause false alerts to trigger Image credit - Adobe ©2022 TWILIO INC. ALL RIGHTS RESERVED
  16. 16. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Filter, Aggregate & Inspect - Crucial Learnings - Distribution of records amongst partitions Sink to broker if trying a repartition. Forward to next stage if processing metadata locally present. - Statestore/RocksDB tuning a) Using LRU cache with a BloomFilter b) Use kHashSearch instead of default Binary Search in case of frequent lookups Blog - https://www.twilio.com/blog/kafka-streams-near-real-time ©2022 TWILIO INC. ALL RIGHTS RESERVED
  17. 17. Questions ?
  18. 18. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Neanderthal state - Filter, Aggregate & Punctuate ©2022 TWILIO INC. ALL RIGHTS RESERVED
  19. 19. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Filter, Aggregate & Punctuate - Design considerations - Punctuator to inspect state of local aggregates Punctuate every 60 seconds based on wallclock time so that only aggregates local to a node are inspected. Aggregates not read immediately to allow record state to finalize. - Aggregation pauses when punctuator runs Punctuation should be quick and as current minute aggregates are not read in punctuation, okay to pause processing of a stage to free up cores for punctuator. ©2022 TWILIO INC. ALL RIGHTS RESERVED
  20. 20. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Filter, Aggregate & Punctuate - Downsides - Limit the number of alert configs per account as time takes scales as the number of alert configs increases. Performance test to determine the limits on number of alert configs per account for a single twilio use case. - Flappy/Toggling Alerts Alerts can toggle between ERROR and RECOVERED states causing a lot of notification spam for the end customer. ©2022 TWILIO INC. ALL RIGHTS RESERVED
  21. 21. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Homo Sapiens state - Filter, Aggregate/Punctuate & Conduct ©2022 TWILIO INC. ALL RIGHTS RESERVED
  22. 22. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Filter, Aggregate/Punctuate & Conduct - Design features - Decaying weighted average to avoid over notifying customers Based on history of last X time windows, assign decreasing weights based on time of state transitions (example - ERROR -> RECOVERY) Use this cumulative score to decide if flappy or not. Image credit - Nagios ©2022 TWILIO INC. ALL RIGHTS RESERVED
  23. 23. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Scalability testing ©2022 TWILIO INC. ALL RIGHTS RESERVED
  24. 24. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Future state - Support predictive or proactive operations to avoid entering bad states - Make punctuator adaptive to needs of real timeliness - Detect anomalies in data ©2022 TWILIO INC. ALL RIGHTS RESERVED
  25. 25. Q & A Other notable contributors over the years ● Minakshi Korad ● Georgiana Ogrean ● Jyotsna Shevade ● Ram Kolla ● Dante Bourret ● Sriram Ramarathnam ● Tom Tobin

×