Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Big Data Real Time Architectures
Lambda, Kappa motivation and practical applications
@dmarcous
Problems
Volume
Variety Velocity
Solutions
Batch processing
NoSQL
Stream
processing
More Problems?
● Machines FAIL
● Humans make mistakes
● We want everything in real time!
○ We can’t do everything in real ...
Batch processing
◦ Large amount of static data
◦ Scalable solution
◦ Volume
Real-time processing
◦ Computing streaming dat...
● Nathan Marz (Twitter)
● How to beat the CAP theorem
○ http://nathanmarz.com/blog/how-to-
beat-the-cap-theorem.html
Lambd...
Lambda Architecture is:
A complementary pair of:
- in-memory real-time processing
- large HDD/SSD batch processing
Propose...
● Data duplication
○ Columnar + Compressed
○ Don’t be cheap...
● Too many tools!
○ Stay on 1 platform - Hadoop/YARN
● Do I...
● Jay Kreps (LinkedIn)
● Questioning The Lambda Architecture
○ http://radar.oreilly.com/2014/07/questioning-the-lambda-arc...
Lambda Kappa
● Different● Common
○ Greek letters
○ Real time processing at scale
○ Immutable Architectures
■ “Replay” possible
○ Born o...
● Data Ingestion
○ Kafka
○ Apache Flume
○ Samza
● Batch
○ MR (Hive, Pig etc.)
○ Tez
○ Spark
○ Dataflow (=Google Flume)
● S...
● Lambdas
○ Twitter
○ Spotify (music recommendations)
○ Liveperson
○ Inneractive
● Kappas
○ LinkedIn
○ Yahoo
● Platforms
○...
● Zeta Architecture
○ Includes cluster management
■ Monitoring
■ Scheduling
■ Container system etc.
○ Inspired by Google
●...
● lambda
○ http://www.infoq.com/interviews/marz-lambda-architecture
● Kappa
○ http://www.kappa-architecture.com/
Appendix ...
Big data real time architectures
Nächste SlideShare
Wird geladen in …5
×

Big data real time architectures

3.963 Aufrufe

Veröffentlicht am

Big data real time architectures -
How do to big data processing in real time?
What architectures are out there to support this paradigm?
Which one should we choose?
What Advantages / Pitfalls they contain.

Veröffentlicht in: Daten & Analysen
  • Dating direct: ❤❤❤ http://bit.ly/369VOVb ❤❤❤
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Dating for everyone is here: ♥♥♥ http://bit.ly/369VOVb ♥♥♥
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Big data real time architectures

  1. 1. Big Data Real Time Architectures Lambda, Kappa motivation and practical applications @dmarcous
  2. 2. Problems Volume Variety Velocity
  3. 3. Solutions Batch processing NoSQL Stream processing
  4. 4. More Problems? ● Machines FAIL ● Humans make mistakes ● We want everything in real time! ○ We can’t do everything in real time :( ● We might think of a new way to analyse old data ● We might want to take a look of older versions of the raw / aggregated data ● Looking at raw data is cool, looking at aggregated data is cooler, looking at indexed/ data with ad-hoc filter is the coolest. What if we want them all on the same set of data?
  5. 5. Batch processing ◦ Large amount of static data ◦ Scalable solution ◦ Volume Real-time processing ◦ Computing streaming data ◦ Low latency ◦ Velocity Hybrid computation ◦ Lambda Architecture ◦ Kappa Architecture Big Data Timeline 2006 2010 1st Generation 2003 Inception 2nd Generation 2012 3rd Generation
  6. 6. ● Nathan Marz (Twitter) ● How to beat the CAP theorem ○ http://nathanmarz.com/blog/how-to- beat-the-cap-theorem.html Lambda Architecture ● Concepts : ○ Immutable data ○ Everything can be re-run ○ Using the best tool for purpose ○ Query = Function(All Data) ○ real time isn’t accurate, batch will fix any mistakes ● Layers ○ Batch ○ Speed ○ Serving
  7. 7. Lambda Architecture is: A complementary pair of: - in-memory real-time processing - large HDD/SSD batch processing Proposed by Nathan Marz Slow, but large and persistent. Fast, but small and volatile.
  8. 8. ● Data duplication ○ Columnar + Compressed ○ Don’t be cheap... ● Too many tools! ○ Stay on 1 platform - Hadoop/YARN ● Do I really need to write everything twice? (Cross DB ORM) ○ Frameworks ■ Twitter Summingbird (MR + Storm) ■ Apache Spark (batch / Streaming) ■ Google Dataflow ● No place for ad-hoc analysis ○ Add more specialised data sources ■ Solr / Elasticsearch ● Incremental Algorithms are HARD - stream process based on smart thresholds (= history) ○ Mix it up - Key value access during speed process ● A new event may be related to an old one, that might be realted to an older one… ○(Add graph processing (GraphX/ Giraph/ Titan Lambda Pitfalls
  9. 9. ● Jay Kreps (LinkedIn) ● Questioning The Lambda Architecture ○ http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html ● Concepts ○ retain the full log of the data ○ processing = new instance of the same stream ○ input - choose where to start reading from the log (now, 1 day ago, 1 year ago..) ○ real time is accurate! ○ re-processing only when code changes Kappa Architecture
  10. 10. Lambda Kappa
  11. 11. ● Different● Common ○ Greek letters ○ Real time processing at scale ○ Immutable Architectures ■ “Replay” possible ○ Born out of need ○ Both use Materialised views / indexed results for serving Lambda Kappa Lambda Kappa Processing Paradigm Batch + Streaming Streaming Re-processing Paradigm Every Batch Cycle Only when code changes Reliability Batch is reliable, streaming is approximate Streaming with consistency (exactly once) Resource Consumption Function = Query (All data) Incremental algorithms, running on deltas
  12. 12. ● Data Ingestion ○ Kafka ○ Apache Flume ○ Samza ● Batch ○ MR (Hive, Pig etc.) ○ Tez ○ Spark ○ Dataflow (=Google Flume) ● Stream ○ Storm ○ Spark Streaming ○ Samza ○ Dataflow (=Google Flume) ○ Flink Tooling ● Serving ○ DBs ■ ElephantDB ■ SploutSQL ■ HBase / Cassandra ○ Queries ■ Impala ■ Presto ■ Big Query
  13. 13. ● Lambdas ○ Twitter ○ Spotify (music recommendations) ○ Liveperson ○ Inneractive ● Kappas ○ LinkedIn ○ Yahoo ● Platforms ○ Oryx2 (Cloudera) ■ Lambda ML Platform using Kafka + Spark ○ Novelti.io (Previously Lambdoop) ■ Streaming intelligence for everything (mainly IoT) Users
  14. 14. ● Zeta Architecture ○ Includes cluster management ■ Monitoring ■ Scheduling ■ Container system etc. ○ Inspired by Google ● iot-a ○ Internet of Things ○ Layered ■ MQ (kafka - RT) ■ DB (HBase - Interactive) ■ DFS (Batch) ● Mu Architecture ○Lambda with only 1 set of aggregated views ?More Architectures
  15. 15. ● lambda ○ http://www.infoq.com/interviews/marz-lambda-architecture ● Kappa ○ http://www.kappa-architecture.com/ Appendix - Videos

×