Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Building real time data
pipeline
Challenges and solutions
● Data Architect @ Intuit
● Built a real time data pipeline
● Deployed the same in production
Email: Veeramani_Moorthy@int...
● Business Goal
● NRT pipeline Architecture
● Challenges involved & solution
● Metrics & Monitoring Architecture
● Challen...
● Build a low latency (1 min SLA) data pipe which can listen to database
changes, transform & publish the final outcome to...
Technologies used
● Confluent 2.0.1
○ Kafka
○ Schema Registry
○ Kafka connect
○ Zoo keeper
● Spark Streaming 1.6.1
● Datom...
CDC Schema
● Payload
○ Before record
○ After record
● Header
○ Frag number
○ Seq number
○ Table name
○ Shard id
Out of sequence events
● Will I be able to detect it?
● How do I handle it?
○ Single partition kafka topic
○ Multi partiti...
Late Arrival
● Can we allow delayed events?
● Embrace eventual consistency
○ Eventual is ok
○ Never is not ok
Will maintai...
Spark streaming (throughput vs latency)
Especially in the context of updating a remote data store
● Foreach
● Mapreduce
Schema evolves over time
At time t1t2
Does downstream processing fail?
Use Schema registry which supports versioning
Kafka
Topic
Schema
Registry
When you go live
● It’s essential to bootstrap your system
● Built a bootstrap connector
● Due to huge data load, It takes...
Enable CDC, before you bootstrap
● Duplicates are okay, but data loss is not okay
● Ensure at least once guarantee
Good to...
Published corrupted data for past N hour
● Defect in your code
● Mis-configuration
● Some system failure
You can fix the p...
Answer: build replay
● Build replay at every stage of the pipeline
● If not, at least at the very first stage
● Now, how d...
Spark streaming: checkpointing
Pitfalls
● Spark checkpoints entire DAG (binary)
○ Till which offset it has processed?
○ To replay, Can you set offset to ...
All kafka brokers went down, then?
● We usually re-start them one by one
● Noticed data loss at some topics
Does Kafka los...
Kafka broker setup
Kafka broker - Failover scenario
So, if all kafka brokers goes down
Re-start them in the reverse order of failures
Is it good enough?
What if followers are lagging behind?
● Again, this can cause data loss
● Min.insync.replica config to rescue
Kafka connect setup
● Standalone mode
● Distributed mode
Diagnosing data issues
● Data loss
● Data corruption
● SLA miss
How do you quickly diagnose the issue?
Diagnosing data issues quickly
● Need a mechanism to track each event uniquely end to end.
● Log aggregation
Batch vs Streaming
● In general, when do you choose to go for streaming?
○ Time critical data
○ Quick decision
● Lot of us...
Batch & Streaming
Metrics & Monitoring Architecture
CDC
Connector
Reconciler
Transforme
r
JMS
Connector
CDC
EBS
Consumer
Audit events
Audit
...
SLA computation
● Source DB timestamp
● Stage timestamp
● SLA = stage TS – source TS
Use NTP to sync to all nodes
Are these the only challenges?
Questions?
Building real time Data Pipeline using Spark Streaming
Nächste SlideShare
Wird geladen in …5
×

Building real time Data Pipeline using Spark Streaming

939 Aufrufe

Veröffentlicht am

Challenges and Solutions

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

Building real time Data Pipeline using Spark Streaming

  1. 1. Building real time data pipeline Challenges and solutions
  2. 2. ● Data Architect @ Intuit ● Built a real time data pipeline ● Deployed the same in production Email: Veeramani_Moorthy@intuit.com Gmail: veeru.moorthy Linkedin: https://www.linkedin.com/in/veeramani-moorthy-0ab4a72/
  3. 3. ● Business Goal ● NRT pipeline Architecture ● Challenges involved & solution ● Metrics & Monitoring Architecture ● Challenges involved & solution ● Q & A
  4. 4. ● Build a low latency (1 min SLA) data pipe which can listen to database changes, transform & publish the final outcome to Salesforce. ● Zero data loss ● Ordering guarantee
  5. 5. Technologies used ● Confluent 2.0.1 ○ Kafka ○ Schema Registry ○ Kafka connect ○ Zoo keeper ● Spark Streaming 1.6.1 ● Datomic 0.9 ● DSE 4.2
  6. 6. CDC Schema ● Payload ○ Before record ○ After record ● Header ○ Frag number ○ Seq number ○ Table name ○ Shard id
  7. 7. Out of sequence events ● Will I be able to detect it? ● How do I handle it? ○ Single partition kafka topic ○ Multi partition w/ hash partition on PK ○ Read first, before writing ○ Go with EVAT data model w/ change history
  8. 8. Late Arrival ● Can we allow delayed events? ● Embrace eventual consistency ○ Eventual is ok ○ Never is not ok Will maintain the state only for 5 mins. Is that an option?
  9. 9. Spark streaming (throughput vs latency) Especially in the context of updating a remote data store ● Foreach ● Mapreduce
  10. 10. Schema evolves over time At time t1t2 Does downstream processing fail?
  11. 11. Use Schema registry which supports versioning Kafka Topic Schema Registry
  12. 12. When you go live ● It’s essential to bootstrap your system ● Built a bootstrap connector ● Due to huge data load, It takes few mins/hous ● During bootstrap, DB state might be getting changed So, does it cause data loss?
  13. 13. Enable CDC, before you bootstrap ● Duplicates are okay, but data loss is not okay ● Ensure at least once guarantee Good to support selective bootstrap
  14. 14. Published corrupted data for past N hour ● Defect in your code ● Mis-configuration ● Some system failure You can fix the problem & push the fix. But, will it fix the data retrospectively?
  15. 15. Answer: build replay ● Build replay at every stage of the pipeline ● If not, at least at the very first stage ● Now, how do you build replay? ○ Checkpoint (Topic, partition & offset) ○ Traceability ○ Re-start the pipe from given offset
  16. 16. Spark streaming: checkpointing
  17. 17. Pitfalls ● Spark checkpoints entire DAG (binary) ○ Till which offset it has processed? ○ To replay, Can you set offset to some older value? ● Will you be able to upgrade/re-configure your spark app easily? ● Also, it does auto-ack Don’t rely on spark checkpointing, build your own
  18. 18. All kafka brokers went down, then? ● We usually re-start them one by one ● Noticed data loss at some topics Does Kafka lose data?
  19. 19. Kafka broker setup
  20. 20. Kafka broker - Failover scenario
  21. 21. So, if all kafka brokers goes down Re-start them in the reverse order of failures Is it good enough?
  22. 22. What if followers are lagging behind? ● Again, this can cause data loss ● Min.insync.replica config to rescue
  23. 23. Kafka connect setup ● Standalone mode ● Distributed mode
  24. 24. Diagnosing data issues ● Data loss ● Data corruption ● SLA miss How do you quickly diagnose the issue?
  25. 25. Diagnosing data issues quickly ● Need a mechanism to track each event uniquely end to end. ● Log aggregation
  26. 26. Batch vs Streaming ● In general, when do you choose to go for streaming? ○ Time critical data ○ Quick decision ● Lot of use cases: 30 mins batch processing will do good ● Both batch & real time streaming on same data
  27. 27. Batch & Streaming
  28. 28. Metrics & Monitoring Architecture CDC Connector Reconciler Transforme r JMS Connector CDC EBS Consumer Audit events Audit Streaming Job
  29. 29. SLA computation ● Source DB timestamp ● Stage timestamp ● SLA = stage TS – source TS
  30. 30. Use NTP to sync to all nodes
  31. 31. Are these the only challenges?
  32. 32. Questions?

×