Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
1 Private and Confidential1 Public
In-Stream Processing Service Blueprint
Reference architecture for real-time Big Data ap...
2
What we’ll talk about today
• What’s In-Stream Processing?
• How it’s used to process huge data streams in real time
• H...
3
In a complex landscape of Big Data systems…
4
…In-Stream Processing Service is
an approach to build real-time Big Data applications
Today’s focus
5
• Fraud detection
• Sentiment analytics
• Preventive maintenance
• Facilities optimization
• Network monitoring
• Intell...
6
Open source world is diverse and confusing
7
What credentials do we have to talk about this?
Big Data history @Grid Dynamics
8
Blueprint goals
Pre-integrated Enterprise-gradeCloud-ready
Proven mission-
critical use
Built 100% from
leading open
sou...
9
Target performance & reliability SLAs
Throughput Up to 100,000 events per second
Latency 1-60 seconds
Retention Raw data...
10
Conceptual architecture
11
Selected stack
• REST API
• Message
Queue
• HDFS
12
Every component is scalable in its own way
• No single point of failure
• Automatic failover
• Data replication
13
Multi-node Kafka cluster
• Undisputed modern choice
for real time MOM
• Retention and replay
• Scalable via partitionin...
14
Single Zookeeper cluster for all components
• Distributed coordination
service, facilitates HA
of other clustered servi...
15
Spark streaming:
Central component of the platform
Why Spark streaming?
• Leading In-Stream Processing middleware
• Act...
16
Cassandra as operational store
• Massively scalable, highly available
NoSQL database
• Ideal choice as large operationa...
17
REDIS as a lookup database
• Simple, cheap and super-
performant lookup store
• Needed when event processing
requires f...
18
Putting all the pieces together:
end-to-end platform configuration
• No single points of failure
• No bottlenecks
• Sca...
19
Out of scope
• Dashboards for visualization, monitoring, and reporting
of In-Stream Processing results
• Tools and mode...
20
Case study:
Large media agency
Business opportunity
• Real-time popularity trends for
the content across all client’s
p...
21
Case study:
Implementation details
22
Summary
• In-Stream Processing is a hot new technology
• It can process mind-boggling volumes of events in real-time an...
23
What we didn’t get to talk about today
• Docker, Docker, Docker: how to make auto-deployment and auto-scaling work
• Da...
24
Thank you!
Anton Ovchinnikov: aovchinnikov@griddynamics.com
Grid Dynamics blog: blog.griddynamics.com
Follow up on twit...
In-Stream Processing Service Blueprint, Reference architecture for real-time Big Data applications
Nächste SlideShare
Wird geladen in …5
×

In-Stream Processing Service Blueprint, Reference architecture for real-time Big Data applications

931 Aufrufe

Veröffentlicht am

What is it about? In-Stream Event Processing is a new approach for building near real time big data systems with rapidly growing user base and applications like clickstream analytics, preventive maintenance or fraud detection. Maturity of some open source projects enables building an enterprise grade In-Stream Processing service in-house. However the open source world comprises of many competing projects of different maturity, having different perspectives so the task to select effective and efficient projects is not straightforward. In the talk I’ll present a blueprint of an In-Stream Processing Service, enterprise grade reliable and scalable, cloud ready, build from 100% open source components.

Veröffentlicht in: Technologie

In-Stream Processing Service Blueprint, Reference architecture for real-time Big Data applications

  1. 1. 1 Private and Confidential1 Public In-Stream Processing Service Blueprint Reference architecture for real-time Big Data applications Anton Ovchinnikov, Data Scientist, Grid Dynamics
  2. 2. 2 What we’ll talk about today • What’s In-Stream Processing? • How it’s used to process huge data streams in real time • How to build in-stream processing with open source • All about scale and reliability considerations • Example of large-scale customer implementation • Where can you learn more (hint: blog.griddynamics.com)
  3. 3. 3 In a complex landscape of Big Data systems…
  4. 4. 4 …In-Stream Processing Service is an approach to build real-time Big Data applications Today’s focus
  5. 5. 5 • Fraud detection • Sentiment analytics • Preventive maintenance • Facilities optimization • Network monitoring • Intelligence and surveillance • Risk management • E-commerce • Clickstream analytics • Dynamic pricing • Supply chain optimization • Predictive medicine • Transaction cost analysis • Market data management • Algorithmic trading • Data warehouse augmentation Multiple industries and use cases
  6. 6. 6 Open source world is diverse and confusing
  7. 7. 7 What credentials do we have to talk about this? Big Data history @Grid Dynamics
  8. 8. 8 Blueprint goals Pre-integrated Enterprise-gradeCloud-ready Proven mission- critical use Built 100% from leading open source projects Production-ready Portable across clouds Extendable
  9. 9. 9 Target performance & reliability SLAs Throughput Up to 100,000 events per second Latency 1-60 seconds Retention Raw data and results archived for 30 days Reliability Built-in data loss mitigation mechanism in case of faults Availability 99.999 on commodity cloud infrastructure
  10. 10. 10 Conceptual architecture
  11. 11. 11 Selected stack • REST API • Message Queue • HDFS
  12. 12. 12 Every component is scalable in its own way • No single point of failure • Automatic failover • Data replication
  13. 13. 13 Multi-node Kafka cluster • Undisputed modern choice for real time MOM • Retention and replay • Scalable via partitioning • Persistent • Super-fast
  14. 14. 14 Single Zookeeper cluster for all components • Distributed coordination service, facilitates HA of other clustered services • Guaranteed consistent storage • Client monitoring • Leader election
  15. 15. 15 Spark streaming: Central component of the platform Why Spark streaming? • Leading In-Stream Processing middleware • Active community • Rate of adoption • Vendor support • Excellent integration with Hadoop eco- system Key architectural considerations • Runs on top of Hadoop • Out-of-the-box integration with Kafka • Support for machine learning pipelines
  16. 16. 16 Cassandra as operational store • Massively scalable, highly available NoSQL database • Ideal choice as large operational store (100s of GB) for streaming applications • Needed when event processing is stateful, and the state is quite large Example: stores user profiles as they are being updated in real time from clickstreams
  17. 17. 17 REDIS as a lookup database • Simple, cheap and super- performant lookup store • Needed when event processing requires frequent access to GBs of reference data • Can be updated from outside • Master/slave architecture Example: IP geo-mapping, dictionaries, training sets
  18. 18. 18 Putting all the pieces together: end-to-end platform configuration • No single points of failure • No bottlenecks • Scaling or recovering any component cluster mitigates availability issues • Caveat: pathologies do happen, even in this design – for example, dynamic repartitioning is not supported
  19. 19. 19 Out of scope • Dashboards for visualization, monitoring, and reporting of In-Stream Processing results • Tools and modeling environments for data scientists • Logging and monitoring • Data encryption, in motion or at rest • Dynamic re-partitioning is not supported and will require down-time • Disaster recovery • Infrastructure provisioning and deployment automation
  20. 20. 20 Case study: Large media agency Business opportunity • Real-time popularity trends for the content across all client’s properties drives audience coverage with the most interesting, trending articles Work done • Implementation used Amazon Kinesis & Amazon EMR/Spark Streaming stack • Data egress: S3 and Amazon Redshift
  21. 21. 21 Case study: Implementation details
  22. 22. 22 Summary • In-Stream Processing is a hot new technology • It can process mind-boggling volumes of events in real-time and discover insights • You can build a whole platform with 100% open source components • We give you a complete blueprint on how to put it together • It will run on any public cloud
  23. 23. 23 What we didn’t get to talk about today • Docker, Docker, Docker: how to make auto-deployment and auto-scaling work • Data scientist’s kitchen: what they are doing when no one is watching • Cloud sandbox for our In-Stream Processing Blueprint: how to take it for a spin on AWS • Demo app: see how social analytics is done using the blueprint • All this, and more: coming up soon in our blog (blog.griddynamics.com) • Please, subscribe!
  24. 24. 24 Thank you! Anton Ovchinnikov: aovchinnikov@griddynamics.com Grid Dynamics blog: blog.griddynamics.com Follow up on twitter: @griddynamics We are hiring! https://www.griddynamics.com/careers

×