SlideShare ist ein Scribd-Unternehmen logo
1 von 74
Reactive Fast Data & the Data Lake
with Akka, Kafka, Spark
DevNexus, Feb 23, 2017
Todd Fritz
tafritz@outlook.com
• Questions, etc
www.linkedin.com/in/tfritz
http://www.slideshare.net/ToddFritz
https://github.com/todd-fritz
2
3
• Senior Solutions Architect @ Cox Automotive, Inc.
• Strategic Data Services
• The opinions contained herein may not represent my employer.
• Background: platform development, middleware, messaging, app dev
• DevOps mentality
• History of exploring, learning, assimilating new technology and styles
• Life-long learner
• Novice bass player
• Scuba diver
About Me
All presentations (including today) are available on SlideShare:
https://www.slideshare.net/ToddFritz
AJUG (Jan 17, 2017)
Building a Reactive Fast Data Lake with Akka, Kafka and Spark
Video – https://vimeo.com/200863800
DevNexus 2015
Containerizing a Monolithic Application into a Microservice-based PaaS
Great Wide Open - Atlanta (April 3, 2014)
Server to Cloud: converting a legacy platform to an open source PaaS
AJUG (April 15, 2014)
Convert a legacy platform to a micro-PaaS using Docker
Video - https://vimeo.com/94556976
Previous Presentations
4
• Preface
• Reactive Systems, Patterns, Implementations
• The Enterprise
• Fast Data
• The Data Lake: Analytics & App Dev
• Questions
• Resources
Agenda
5
Preface
“Our greatest glory is not in never failing,
but in rising every time we fall.”
-Confucius
6
• Why Reactive?
• What is Fast Data?
• What is a Data Lake?
• What is the intersection?
• Business Value?
• Architecture considerations?
• Importance of Messaging
• Use Case: A system that can scale to millions of users, billions of
messages
7
Reactive Systems, Patterns, Implementations
“People who think they know everything
are a great annoyance to those of us who do.”
- Isaac Asimov
8
• The Reactive Manifesto: http://www.reactivemanifesto.org
• Many organizations independently building software to similar
patterns
• Yesterday’s architectures just don’t cut it
• Actors (Akka, Erlang)
• Good starting point: https://www.lightbend.com/blog/architect-reactive-design-patterns
9
“…a set of design principles for creating cohesive systems.” *
“…a way of thinking about systems architecture and design in a distributed
environment…” *
“…implementation techniques, tooling, and design patterns...” * 
components of a larger whole
What is Reactive?
10* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
• “…based on an architectural style that allows … multiple individual
services to coalesce as a single unit and react to its surroundings while
remaining aware of each other…” *
• Event Driven
• Asynch & Non-Blocking
• “… scale up/down, load balance...” *
Components may qualify as reactive, but when combined 
does not guarantee a Reactive System
Reactive Systems
11* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
1. Reactive Systems
2. Reactive Programming
3. Functional Reactive Programming (FRP)
Reactive Begets*
12* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
• Subset of asynch programming
• Information drives logic flow
• Decompose into multiple, asynch, nonblocking steps
• Combine  into a composed workflow
• Reactive API libraries are either declarative or callback-based
• Reactive programming is event-driven
• Reactive systems are message driven
Reactive Programming*
13* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
• More efficient utilization of resources
• Increased performance via serialization reduction
• Productivity: Reactive libraries handle complexity
• Ideal for back-pressured, scalable, high-performance
components
Benefits of Reactive Programming*
14* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
• Reactive & Dataflow Programming
• Examples
• Futures / Promises
• (Reactive) Streams –back-pressurized
• Dataflow Variables – state change  event driven actions
• Technologies
• Reactive Streams Specification
• The standard for interoperability amongst Reactive Programming
libraries on the JVM
Dataflow Programming*
15* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
• Reactive Programming: event-driven
• Dataflow chains
• Messages not directed
• Events are observable facts
• Produced by State machine
• Listeners attach  to react
• Reactive Systems: message-driven
• Basis of communication; prefer asynch
• Decoupled
• Resilience and Elastic
• Long-lived, addressable components
• Reacts to directed messages
Event-Driven vs. Message-Driven*
16* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
• Common pattern: Events within Messages
• Example
• AWS Lambda, Pub/Sub
• Pros
• Abstraction, simplicity
• Cons
• Lose some control
• Forces developers to deal with distributed programming complexities
• Can’t hide behind “leaky” abstractions that pretend a network does not
exist
Event-Driven*
17* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
Reactive Systems: Characteristics*
18
Responsive
• Low latency
• Consistent, Predictable
• Usability
• Essential for Utility
• Fail Fast
• Happier Customers
Resilient
• Responsive through failure
• Resilient (H/A) through
replication
• Failure isolated to
Component (bulkheads)
• Delegate recovery to
external Component
(supervisor hierarchies)
• Client does not handle
failures
Elastic
• Responsive to workload
• Devoid of bottlenecks
• Perf metrics drive
predictive / reactive
autonomic scaling
• Commodity
infrastructure
Message Driven
• Asynchronous
• Loose Coupling
• Isolation
• Location Transparency
• Non Blocking
• Recipients can passivate
• Message Passing
• Flow Control
• Exception Management
• Elasticity
• Back Pressure
* Source: The Reactive Manifesto
Reactive Systems: Patterns
19
Architecture
Single Component
• Component does one thing, fully and well
• Single Responsibility Principle
• Max cohesion, min coupling
Architecture
Let-it-Crash
• Prefer a full component restart to complex internal error
handling
• Design for failure
• Leads to more reliable components
• Avoids hard to find/fix errors
• Failure is unavoidable
Implementation
Circuit Breaker
• Protect services by breaking connections during
failures
• From EE
• Protects clients from timeouts
• Allows time for service to recover
Source: https://www.lightbend.com/blog/architect-reactive-design-patterns
Implementation
Saga
• Divide long-lived, distributed transactions into
quick local ones with compensating actions for
recovery.
• Compensating txns run during saga rollback
• Concurrent sagas can see intermediate state
• Sags need to be persistent to recover from
hardware failures. Save points.
The Enterprise
“Even if you are on the right track,
you’ll get run over if you just sit there.”
- Will Rogers
20
21
• Companies characteristics matter
• Young companies (e.g. start-ups)
• Mid-size companies
• Large companies
The Enterprise
22
• Multiple software applications
• Specialization to function
• Analytics, DataMart
• Silo’d data needs to be moved
• Older application bad at interoperability
• Trend toward ”real time”
• Evolution toward Data as a Service (DaaS)
Evolving the Enterprise
23
• Real time applications & analytics
• Increased agility + productivity = speed to market
• Decoupling enables interoperability
• Data flows to system that need it
• Powerful data processing
• Reduced TCO
• Cloud friendly
How Reactive Helps the Enterprise
24
• Friends don’t let Analysts do map reduce
• Larger companies may have communities of SAS users
• Analysts are not coders
Remember the Analysts
25
26Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
All about that Data…
Fast Data
“If you are in a spaceship that is traveling at the speed of light,
and you turn on the headlights,
does anything happen?”
- Steven Wright
27
• Big Data  Fast Data
• Continuous data streams measured in time
• Old way: store  act  analyze
• Act in real-time
• Uses technology proven to scale
• Akka, Kafka, Spark, Flink
• Send to destinations
• Prefer shared-nothing clustering, in-memory storage
Fast Data
28* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
Time is Gold
29* Source: “Towards Benchmarking Modern Distributed Systems”, Grace Huang (Intel)
• Time to talk tech
• Reference architecture: Lightbend’s Fast Data platform
• Core tech stack:
• Kafka
• Spark
• Akka (and Akka-HTTP)
• Alpakka / Camel
• (AWS, Spring, others can/should be incorporated with scope)
Now, the Fun Stuff
30* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
31
Source: Lightbend, Inc.
• JMS -> AMQP –> Kafka
• Streaming platform
• Popular use cases
• Data pipelines (messaging)
• Real-time actions
• Metrics, weblogs, stream processing, event sourcing
• Commit log
• Spread across a cluster
• Record streams stored in topics
• Each record has a key, value, timestamp
• Each topic has offsets and a retention policy
Kafka 101
32
• Producer
• Consumer
• Streams
• Connector
Kafka APIs
33
• Limitations of older message providers
• Limitations of log forwarders (Scribe or Flume
• Supports Polyglot
• Robust ecosystem
• https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem
Why Use Kafka?
34
• A fast and engine for large-scale data processing
• Hadoop / YARN
• Map-Reduce – mapper, reducer, disk IO, queue, fetch resource
• Great for parallel file processing of large files
• Synchronization barrier during persistence
• Spark
• In-memory data processing
• Interactive/iterative data query
• Better supports more complex, interactive (real-time) apps
• 100x faster than Hadoop MR (in memory)
• 10x faster on disk
• Microbatching
Spark
35Source: spark.apache.org
• Combine SQL, streaming, complex analytics
• SQL
• Dataframes
• MLlib
• GraphX
• Spark Streaming
Spark
36Source: spark.apache.org
• Slow due to replication, serialization, filesystem IO
• Inefficient use cases:
• Iterative algorithms (ML, Graphs, Network analysis)
• Interactive (ad-hoc) data mining
Hadoop MR
37Source: “Spark Overview”, Lisa Hua
Spark: Clustered
38Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
Spark: Terminology
39Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
• Job
• Stage
• Task
Spark: SQL
40Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
• Spark SQL: a module for structured data querying
• Supports basic SQL & HiveQL
• A distributed query engine via JDBC/ODBC, or CLI
Spark: Dataframe, Datasets (RDDs)
41Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
• Dataframe is a distributed assembly of data into named columns
• E.g. a relational table
• Dataset was added in 1.6 to provide benefit of RDDs and Spark SQL’s
execution engine
• Build from JVM objects, then manipulate with functions
Spark Streaming
42Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
• Merge inbound data with other data
• Polyglot: Scala, Python, Java
• Lower level access via DStream (via StreamingContext)
• Create RDD’s from the DStream
• Two primary metrics to monitor and tune
• Kyro
Spark Streaming
43Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
44Source: “Akka Actor Introduction”, Gene
Akka
45Source: “Introducing Akka”, Jonas Bonér
• Simplifies distributed programming
• Concurrency, Scalability, Fault-Tolerance
• A single, unified model
• Manages System Overload
• Elastic and dynamic
• Scale up & out
• Program to a higher level
• Not threadbound
Akka: Failure Management
46Source: “Introducing Akka”, Jonas Bonér
In Java/C/C+ etc.
• Single thread of control
• If that thread blows up you are screwed
• Requires explicit error handling within your single thread
• Errors isolated to your thread; the others have no clue!
Akka: Perfect for the Cloud
47Source: “Introducing Akka”, Jonas Bonér
• Elastic and dynamic
• Fault tolerant & self healing (autonomic)
• Adaptive load-balancing, cluster rebalancing, Actor migration
• Build loosely coupled systems that dynamically adapt at runtime
Akka 101
48Source: “Introducing Akka”, Jonas Bonér
• Unit of code = Actor
• Actors have been around since 1973 (Erlang, telco); 9 nines of uptime!
• About Actors
About Actors
49Source: “Introducing Akka”, Jonas Bonér
• An actor is an alternative to …
• Fundamental unit of computation that embodies:
1. Processing
2. Storage
3. Communication
• Axioms – When an actor receives a message it can:
1. Create new Actors
2. Send messages to Actors it knows
3. Designate how it should handle the next message it receives
Akka: Performance
50Source: “Introducing Akka”, Jonas Bonér
+50 million messages per second !!!
Akka: Core Actor Operations
51Source: “Introducing Akka”, Jonas Bonér
0. Define
1. Create
2. Send
3. Become
4. Supervise
Akka: Define Operation
52Source: “Introducing Akka”, Jonas Bonér
• Define the Actor
• Define the message (class) the actor should respond to
Akka: Create Operation
53Source: “Introducing Akka”, Jonas Bonér
• Yes, creates a new actor.
• Lightweight: 2.6M per Gb RAM
• Strong encapsulation
Akka: Send Operation
54Source: “Introducing Akka”, Jonas Bonér
• Sends a message to an Actor
• Everything is non-blocking
• Everything is Reactive
• Actor passivates until receiving a message  then awakens
• Messages are energy
This is not an
endorsement…
Akka: Become Operation
55Source: “Introducing Akka”, Jonas Bonér
Dynamically redefine an actor’s behavior
• Reactively triggered by receipt of a message
• Will not react differently to messages it receives
• Behaviors are stacked – can by pushed and popped…
Akka: Become Operation…
56Source: “Introducing Akka”, Jonas Bonér
Why?
• A busy actor can become an Actor Pool or Router!
• Implement FSM (Finite State Machine)
• Implement graceful degradation
• Spawn empty pool that can ”Become” something else
• Very useful. Limited only by your imagination.
Akka: Supervise Operation
57Source: “Introducing Akka”, Jonas Bonér
• Manage another Actor’s failures
• A Supervisor monitors, then responds
• Supervisor is notified of failures
• Separates processing from error handling
Akka: Remote Deployment
58Source: “Introducing Akka”, Jonas Bonér
Akka Streams
59Source: Akka Documentation
• Akka Streams API is decoupled from Reactive Streams interfaces
• Interoperable with any conformant Reactive Streams implementation
• Principles
• All features explicit in the API
• Compositionality; combined pieces retain the function of each part
• Model of domain of distributed bounded stream processing
• Reactive Streams  JDK9 Flow API
Akka Streams
60Source: Akka Documentation
• Immutable building blocks
• Source
• Sink
• Flow
• BidiFlow
• Graph
• Built in backpressure capability
• Difference between error & failure
• Error: accessible within the stream, as a data element
• Failure: stream itself has collapsed
Akka HTTP
61Source: Akka Documentation
• Built atop Akka Streams
• Expose an incoming connection in form of a Source instance
• Backpressurized
• Best practices:
• Libraries shall provide their users with reusable pieces
• Avoid destruction of compositionality.
• Libraries may provide facilities that consume and materialize graphs
• Including convenience “sugar” for use cases
• Adds interesting capabilities to Akka Streams
• Akka alternative to Apache Camel (EIP)
• Community driven, focused on connectors to external libraries,
integration patterns and conversions.
• https://github.com/akka/alpakka/releases/tag/v0.1
• Akka Streams Integration
• https://github.com/akka/alpakka
• http://developer.lightbend.com/docs/alpakka/current/
Alpakka 101
62
The Data Lake for Analytics, App Dev
“Lake Wobegon, the little town that time forgot
and the decades cannot improve.”
- Garrison Keillor
63
64
AWS re:Invent…
Burning Man for nerds?
The Data Lake
65
• The “Old” way
• Let’s combine data (de-silo) to increase sharing & usage
• Cost reduction through centralized consolidation
• Data Vacuum
• Forces complexity onto consumers
• Data acted on after storage
• Current thinking
• The old way does not work
• Need governance and other functions to reduce complexity
• 3 V’s matter: Volume, Variety AND Velocity
• Able to act on data as it occurs (Fast Data)
• Following slides elaborate & map to techs (verbal)
The Data Lake
66
• Fast Data fills the Data Lake
• Massive & easily accessible
• Built on commodity hardware (or now, “the Cloud”)
• Data not stored in a way that is optimized for data analysis
• Retains all attributes
• Awareness of Data Lake fallacy: http://www.gartner.com/newsroom/id/2809117
Dark Side of the Old Ways
67Source: “Gartner Says Beware of the Data Lake Fallacy”
• Vacuum data + use later = SWAMP
• Redundant tooling & low interoperability
• Blissful ignorant: how/why data is used, governed, defined and secured
• Remove silos? Great, so now it’s comingled.
• Inability to measure data quality (or fix)
• Accepts data without governance or oversight
• Accepts any data without metadata (description)
• Inability to share lineage of findings by other analysis to share found value
• Security, access control (and tracking of both)
• Data Ownership, Entitlements?
• Tenancy?
• Compliance? Audits?
• What to do?
68
Disrupt
69
Not all Analysts enjoy going “Bogging”…
70
Blueprint: Light at End of Tunnel
https://knowledgent.com/whitepaper/design-successful-data-lake/
Success Highlights
71Source: “Gartner Says Beware of the Data Lake Fallacy”
• Data Lake should enable analysis of data stored in the lake.
• Requires features to secure, curate, run analytics, visualization, reporting
• Use multiple tools and products – no product (today) does it all
• Domain specific – tailored to your industry
• Metadata management – without this you get a swamp
• Configurable ingestion workflows
• Interoperate with Existing environment
• Timely
• Flexible
• Quality
• Findable
72
Questions?
“I'm sorry, if you were right, I'd agree with you.”
- Robin Williams
73
• The Reactive Manifesto (http://www.reactivemanifesto.org/)
• Chaos Monkey? Use Linear Fault Driven Testing instead.
• https://www.lightbend.com/blog/architect-reactive-design-patterns
• http://www.infoworld.com/article/2608040/big-data/fast-data--the-next-step-after-big-data.html
• https://www.lightbend.com/blog/lessons-learned-from-paypal-implementing-back-pressure-with-akka-streams-and-kafka
• https://kafka.apache.org
• http://www.slideshare.net/ducasf/introduction-to-kafka
• http://www.slideshare.net/SparkSummit/grace-huang-49762421
• http://www.slideshare.net/HadoopSummit/performance-comparison-of-streaming-big-data-platforms
• https://github.com/akka/alpakka
• http://developer.lightbend.com/docs/alpakka/current/
• https://github.com/akka/alpakka/releases/tag/v0.1
• http://www.slideshare.net/LisaHua/spark-overview-37479609
• http://spark.apache.org/
• https://www.realdbamagic.com/intro-to-apache-spark-2016-slides/
• http://www.slideshare.net/gene7299/akka-actor-presentation
• http://www.slideshare.net/jboner/introducing-akka
• http://bit.ly/hewitt-on-actors
• http://tech.measurence.com/2016/06/01/a-dive-into-akka-streams.html
• https://infocus.emc.com/rachel_haines/is-the-data-lake-the-best-architecture-to-support-big-data/
• http://www.gartner.com/newsroom/id/2809117
• https://knowledgent.com/whitepaper/design-successful-data-lake/
Resources
74

Weitere ähnliche Inhalte

Was ist angesagt?

Going Reactive in the Land of No
Going Reactive in the Land of NoGoing Reactive in the Land of No
Going Reactive in the Land of NoLightbend
 
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data PipelinesETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelinesconfluent
 
dotScale 2017 Keynote: The Rise of Real Time by Neha Narkhede
dotScale 2017 Keynote: The Rise of Real Time by Neha NarkhededotScale 2017 Keynote: The Rise of Real Time by Neha Narkhede
dotScale 2017 Keynote: The Rise of Real Time by Neha Narkhedeconfluent
 
War Stories: DIY Kafka
War Stories: DIY KafkaWar Stories: DIY Kafka
War Stories: DIY Kafkaconfluent
 
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...Hakka Labs
 
Nine Neins - where Java EE will never take you
Nine Neins - where Java EE will never take youNine Neins - where Java EE will never take you
Nine Neins - where Java EE will never take youMarkus Eisele
 
Internet of Things and Multi-model Data Infrastructure
Internet of Things and Multi-model Data InfrastructureInternet of Things and Multi-model Data Infrastructure
Internet of Things and Multi-model Data InfrastructureSingleStore
 
Leveraging services in stream processor apps at Ticketmaster (Derek Cline, Ti...
Leveraging services in stream processor apps at Ticketmaster (Derek Cline, Ti...Leveraging services in stream processor apps at Ticketmaster (Derek Cline, Ti...
Leveraging services in stream processor apps at Ticketmaster (Derek Cline, Ti...confluent
 
Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka...
Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka...Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka...
Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka...HostedbyConfluent
 
Full Stack Reactive In Practice
Full Stack Reactive In PracticeFull Stack Reactive In Practice
Full Stack Reactive In PracticeLightbend
 
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...HostedbyConfluent
 
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...HostedbyConfluent
 
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsLightbend
 
Apache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformApache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformconfluent
 
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...confluent
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...confluent
 
Lightbend Fast Data Platform
Lightbend Fast Data PlatformLightbend Fast Data Platform
Lightbend Fast Data PlatformLightbend
 
Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming PlatformDr. Mirko Kämpf
 

Was ist angesagt? (20)

Going Reactive in the Land of No
Going Reactive in the Land of NoGoing Reactive in the Land of No
Going Reactive in the Land of No
 
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data PipelinesETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
 
dotScale 2017 Keynote: The Rise of Real Time by Neha Narkhede
dotScale 2017 Keynote: The Rise of Real Time by Neha NarkhededotScale 2017 Keynote: The Rise of Real Time by Neha Narkhede
dotScale 2017 Keynote: The Rise of Real Time by Neha Narkhede
 
War Stories: DIY Kafka
War Stories: DIY KafkaWar Stories: DIY Kafka
War Stories: DIY Kafka
 
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
 
Nine Neins - where Java EE will never take you
Nine Neins - where Java EE will never take youNine Neins - where Java EE will never take you
Nine Neins - where Java EE will never take you
 
Event Driven Architecture
Event Driven ArchitectureEvent Driven Architecture
Event Driven Architecture
 
Internet of Things and Multi-model Data Infrastructure
Internet of Things and Multi-model Data InfrastructureInternet of Things and Multi-model Data Infrastructure
Internet of Things and Multi-model Data Infrastructure
 
Leveraging services in stream processor apps at Ticketmaster (Derek Cline, Ti...
Leveraging services in stream processor apps at Ticketmaster (Derek Cline, Ti...Leveraging services in stream processor apps at Ticketmaster (Derek Cline, Ti...
Leveraging services in stream processor apps at Ticketmaster (Derek Cline, Ti...
 
Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka...
Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka...Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka...
Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka...
 
Full Stack Reactive In Practice
Full Stack Reactive In PracticeFull Stack Reactive In Practice
Full Stack Reactive In Practice
 
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
 
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
 
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
 
Apache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformApache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platform
 
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Lightbend Fast Data Platform
Lightbend Fast Data PlatformLightbend Fast Data Platform
Lightbend Fast Data Platform
 
Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming Platform
 

Ähnlich wie Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Building Reactive Fast Data & the Data Lake with Akka, Kafka, SparkBuilding Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Building Reactive Fast Data & the Data Lake with Akka, Kafka, SparkTodd Fritz
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSSteve Wong
 
Cloud 2.0: Containers, Microservices and Cloud Hybridization
Cloud 2.0: Containers, Microservices and Cloud HybridizationCloud 2.0: Containers, Microservices and Cloud Hybridization
Cloud 2.0: Containers, Microservices and Cloud HybridizationMark Hinkle
 
OpenStack Block Storage 101
OpenStack Block Storage 101OpenStack Block Storage 101
OpenStack Block Storage 101NetApp
 
Unraveling OpenStack Clouds
 Unraveling OpenStack Clouds Unraveling OpenStack Clouds
Unraveling OpenStack CloudsIndicThreads
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven productsLars Albertsson
 
Clould Computing and its application in Libraries
Clould Computing and its application in LibrariesClould Computing and its application in Libraries
Clould Computing and its application in LibrariesAmit Shaw
 
Give your microservices a bus ride with MassTransit
Give your microservices a bus ride with MassTransitGive your microservices a bus ride with MassTransit
Give your microservices a bus ride with MassTransitAlexey Zimarev
 
Software Architectures, Week 5 - Advanced Architectures
Software Architectures, Week 5 - Advanced ArchitecturesSoftware Architectures, Week 5 - Advanced Architectures
Software Architectures, Week 5 - Advanced ArchitecturesAngelos Kapsimanis
 
Microservices Docker Kubernetes Istio Kanban DevOps SRE
Microservices Docker Kubernetes Istio Kanban DevOps SREMicroservices Docker Kubernetes Istio Kanban DevOps SRE
Microservices Docker Kubernetes Istio Kanban DevOps SREAraf Karsh Hamid
 
Cloud Foundry and OpenStack – Marriage Made in Heaven !
Cloud Foundry and OpenStack – Marriage Made in Heaven !Cloud Foundry and OpenStack – Marriage Made in Heaven !
Cloud Foundry and OpenStack – Marriage Made in Heaven ! Animesh Singh
 
Database@Home : Data Driven Apps - Data-driven Microservices Architecture wit...
Database@Home : Data Driven Apps - Data-driven Microservices Architecture wit...Database@Home : Data Driven Apps - Data-driven Microservices Architecture wit...
Database@Home : Data Driven Apps - Data-driven Microservices Architecture wit...Tammy Bednar
 
The DIY Punk Rock DevOps Playbook
The DIY Punk Rock DevOps PlaybookThe DIY Punk Rock DevOps Playbook
The DIY Punk Rock DevOps Playbookbcantrill
 
2022-Devnexus-StatefulMicroservices.pptx.pdf
2022-Devnexus-StatefulMicroservices.pptx.pdf2022-Devnexus-StatefulMicroservices.pptx.pdf
2022-Devnexus-StatefulMicroservices.pptx.pdfGrace Jansen
 
170215 msa intro
170215 msa intro170215 msa intro
170215 msa introSonic leigh
 
Closer Look at Cloud Centric Architectures
Closer Look at Cloud Centric ArchitecturesCloser Look at Cloud Centric Architectures
Closer Look at Cloud Centric ArchitecturesTodd Kaplinger
 
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...VMware Tanzu
 
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...VMware Tanzu
 
An Open and Collaborative Ecosystem for IoT
An Open and Collaborative Ecosystem for IoTAn Open and Collaborative Ecosystem for IoT
An Open and Collaborative Ecosystem for IoTCharles Eckel
 
Building FoundationDB
Building FoundationDBBuilding FoundationDB
Building FoundationDBFoundationDB
 

Ähnlich wie Reactive Fast Data & the Data Lake with Akka, Kafka, Spark (20)

Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Building Reactive Fast Data & the Data Lake with Akka, Kafka, SparkBuilding Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OS
 
Cloud 2.0: Containers, Microservices and Cloud Hybridization
Cloud 2.0: Containers, Microservices and Cloud HybridizationCloud 2.0: Containers, Microservices and Cloud Hybridization
Cloud 2.0: Containers, Microservices and Cloud Hybridization
 
OpenStack Block Storage 101
OpenStack Block Storage 101OpenStack Block Storage 101
OpenStack Block Storage 101
 
Unraveling OpenStack Clouds
 Unraveling OpenStack Clouds Unraveling OpenStack Clouds
Unraveling OpenStack Clouds
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Clould Computing and its application in Libraries
Clould Computing and its application in LibrariesClould Computing and its application in Libraries
Clould Computing and its application in Libraries
 
Give your microservices a bus ride with MassTransit
Give your microservices a bus ride with MassTransitGive your microservices a bus ride with MassTransit
Give your microservices a bus ride with MassTransit
 
Software Architectures, Week 5 - Advanced Architectures
Software Architectures, Week 5 - Advanced ArchitecturesSoftware Architectures, Week 5 - Advanced Architectures
Software Architectures, Week 5 - Advanced Architectures
 
Microservices Docker Kubernetes Istio Kanban DevOps SRE
Microservices Docker Kubernetes Istio Kanban DevOps SREMicroservices Docker Kubernetes Istio Kanban DevOps SRE
Microservices Docker Kubernetes Istio Kanban DevOps SRE
 
Cloud Foundry and OpenStack – Marriage Made in Heaven !
Cloud Foundry and OpenStack – Marriage Made in Heaven !Cloud Foundry and OpenStack – Marriage Made in Heaven !
Cloud Foundry and OpenStack – Marriage Made in Heaven !
 
Database@Home : Data Driven Apps - Data-driven Microservices Architecture wit...
Database@Home : Data Driven Apps - Data-driven Microservices Architecture wit...Database@Home : Data Driven Apps - Data-driven Microservices Architecture wit...
Database@Home : Data Driven Apps - Data-driven Microservices Architecture wit...
 
The DIY Punk Rock DevOps Playbook
The DIY Punk Rock DevOps PlaybookThe DIY Punk Rock DevOps Playbook
The DIY Punk Rock DevOps Playbook
 
2022-Devnexus-StatefulMicroservices.pptx.pdf
2022-Devnexus-StatefulMicroservices.pptx.pdf2022-Devnexus-StatefulMicroservices.pptx.pdf
2022-Devnexus-StatefulMicroservices.pptx.pdf
 
170215 msa intro
170215 msa intro170215 msa intro
170215 msa intro
 
Closer Look at Cloud Centric Architectures
Closer Look at Cloud Centric ArchitecturesCloser Look at Cloud Centric Architectures
Closer Look at Cloud Centric Architectures
 
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
 
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
 
An Open and Collaborative Ecosystem for IoT
An Open and Collaborative Ecosystem for IoTAn Open and Collaborative Ecosystem for IoT
An Open and Collaborative Ecosystem for IoT
 
Building FoundationDB
Building FoundationDBBuilding FoundationDB
Building FoundationDB
 

Kürzlich hochgeladen

Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 

Kürzlich hochgeladen (20)

Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 

Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

  • 1. Reactive Fast Data & the Data Lake with Akka, Kafka, Spark DevNexus, Feb 23, 2017 Todd Fritz
  • 3. 3 • Senior Solutions Architect @ Cox Automotive, Inc. • Strategic Data Services • The opinions contained herein may not represent my employer. • Background: platform development, middleware, messaging, app dev • DevOps mentality • History of exploring, learning, assimilating new technology and styles • Life-long learner • Novice bass player • Scuba diver About Me
  • 4. All presentations (including today) are available on SlideShare: https://www.slideshare.net/ToddFritz AJUG (Jan 17, 2017) Building a Reactive Fast Data Lake with Akka, Kafka and Spark Video – https://vimeo.com/200863800 DevNexus 2015 Containerizing a Monolithic Application into a Microservice-based PaaS Great Wide Open - Atlanta (April 3, 2014) Server to Cloud: converting a legacy platform to an open source PaaS AJUG (April 15, 2014) Convert a legacy platform to a micro-PaaS using Docker Video - https://vimeo.com/94556976 Previous Presentations 4
  • 5. • Preface • Reactive Systems, Patterns, Implementations • The Enterprise • Fast Data • The Data Lake: Analytics & App Dev • Questions • Resources Agenda 5
  • 6. Preface “Our greatest glory is not in never failing, but in rising every time we fall.” -Confucius 6
  • 7. • Why Reactive? • What is Fast Data? • What is a Data Lake? • What is the intersection? • Business Value? • Architecture considerations? • Importance of Messaging • Use Case: A system that can scale to millions of users, billions of messages 7
  • 8. Reactive Systems, Patterns, Implementations “People who think they know everything are a great annoyance to those of us who do.” - Isaac Asimov 8
  • 9. • The Reactive Manifesto: http://www.reactivemanifesto.org • Many organizations independently building software to similar patterns • Yesterday’s architectures just don’t cut it • Actors (Akka, Erlang) • Good starting point: https://www.lightbend.com/blog/architect-reactive-design-patterns 9
  • 10. “…a set of design principles for creating cohesive systems.” * “…a way of thinking about systems architecture and design in a distributed environment…” * “…implementation techniques, tooling, and design patterns...” *  components of a larger whole What is Reactive? 10* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  • 11. • “…based on an architectural style that allows … multiple individual services to coalesce as a single unit and react to its surroundings while remaining aware of each other…” * • Event Driven • Asynch & Non-Blocking • “… scale up/down, load balance...” * Components may qualify as reactive, but when combined  does not guarantee a Reactive System Reactive Systems 11* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  • 12. 1. Reactive Systems 2. Reactive Programming 3. Functional Reactive Programming (FRP) Reactive Begets* 12* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  • 13. • Subset of asynch programming • Information drives logic flow • Decompose into multiple, asynch, nonblocking steps • Combine  into a composed workflow • Reactive API libraries are either declarative or callback-based • Reactive programming is event-driven • Reactive systems are message driven Reactive Programming* 13* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  • 14. • More efficient utilization of resources • Increased performance via serialization reduction • Productivity: Reactive libraries handle complexity • Ideal for back-pressured, scalable, high-performance components Benefits of Reactive Programming* 14* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  • 15. • Reactive & Dataflow Programming • Examples • Futures / Promises • (Reactive) Streams –back-pressurized • Dataflow Variables – state change  event driven actions • Technologies • Reactive Streams Specification • The standard for interoperability amongst Reactive Programming libraries on the JVM Dataflow Programming* 15* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  • 16. • Reactive Programming: event-driven • Dataflow chains • Messages not directed • Events are observable facts • Produced by State machine • Listeners attach  to react • Reactive Systems: message-driven • Basis of communication; prefer asynch • Decoupled • Resilience and Elastic • Long-lived, addressable components • Reacts to directed messages Event-Driven vs. Message-Driven* 16* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  • 17. • Common pattern: Events within Messages • Example • AWS Lambda, Pub/Sub • Pros • Abstraction, simplicity • Cons • Lose some control • Forces developers to deal with distributed programming complexities • Can’t hide behind “leaky” abstractions that pretend a network does not exist Event-Driven* 17* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  • 18. Reactive Systems: Characteristics* 18 Responsive • Low latency • Consistent, Predictable • Usability • Essential for Utility • Fail Fast • Happier Customers Resilient • Responsive through failure • Resilient (H/A) through replication • Failure isolated to Component (bulkheads) • Delegate recovery to external Component (supervisor hierarchies) • Client does not handle failures Elastic • Responsive to workload • Devoid of bottlenecks • Perf metrics drive predictive / reactive autonomic scaling • Commodity infrastructure Message Driven • Asynchronous • Loose Coupling • Isolation • Location Transparency • Non Blocking • Recipients can passivate • Message Passing • Flow Control • Exception Management • Elasticity • Back Pressure * Source: The Reactive Manifesto
  • 19. Reactive Systems: Patterns 19 Architecture Single Component • Component does one thing, fully and well • Single Responsibility Principle • Max cohesion, min coupling Architecture Let-it-Crash • Prefer a full component restart to complex internal error handling • Design for failure • Leads to more reliable components • Avoids hard to find/fix errors • Failure is unavoidable Implementation Circuit Breaker • Protect services by breaking connections during failures • From EE • Protects clients from timeouts • Allows time for service to recover Source: https://www.lightbend.com/blog/architect-reactive-design-patterns Implementation Saga • Divide long-lived, distributed transactions into quick local ones with compensating actions for recovery. • Compensating txns run during saga rollback • Concurrent sagas can see intermediate state • Sags need to be persistent to recover from hardware failures. Save points.
  • 20. The Enterprise “Even if you are on the right track, you’ll get run over if you just sit there.” - Will Rogers 20
  • 21. 21
  • 22. • Companies characteristics matter • Young companies (e.g. start-ups) • Mid-size companies • Large companies The Enterprise 22
  • 23. • Multiple software applications • Specialization to function • Analytics, DataMart • Silo’d data needs to be moved • Older application bad at interoperability • Trend toward ”real time” • Evolution toward Data as a Service (DaaS) Evolving the Enterprise 23
  • 24. • Real time applications & analytics • Increased agility + productivity = speed to market • Decoupling enables interoperability • Data flows to system that need it • Powerful data processing • Reduced TCO • Cloud friendly How Reactive Helps the Enterprise 24
  • 25. • Friends don’t let Analysts do map reduce • Larger companies may have communities of SAS users • Analysts are not coders Remember the Analysts 25
  • 26. 26Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam All about that Data…
  • 27. Fast Data “If you are in a spaceship that is traveling at the speed of light, and you turn on the headlights, does anything happen?” - Steven Wright 27
  • 28. • Big Data  Fast Data • Continuous data streams measured in time • Old way: store  act  analyze • Act in real-time • Uses technology proven to scale • Akka, Kafka, Spark, Flink • Send to destinations • Prefer shared-nothing clustering, in-memory storage Fast Data 28* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  • 29. Time is Gold 29* Source: “Towards Benchmarking Modern Distributed Systems”, Grace Huang (Intel)
  • 30. • Time to talk tech • Reference architecture: Lightbend’s Fast Data platform • Core tech stack: • Kafka • Spark • Akka (and Akka-HTTP) • Alpakka / Camel • (AWS, Spring, others can/should be incorporated with scope) Now, the Fun Stuff 30* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  • 32. • JMS -> AMQP –> Kafka • Streaming platform • Popular use cases • Data pipelines (messaging) • Real-time actions • Metrics, weblogs, stream processing, event sourcing • Commit log • Spread across a cluster • Record streams stored in topics • Each record has a key, value, timestamp • Each topic has offsets and a retention policy Kafka 101 32
  • 33. • Producer • Consumer • Streams • Connector Kafka APIs 33
  • 34. • Limitations of older message providers • Limitations of log forwarders (Scribe or Flume • Supports Polyglot • Robust ecosystem • https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem Why Use Kafka? 34
  • 35. • A fast and engine for large-scale data processing • Hadoop / YARN • Map-Reduce – mapper, reducer, disk IO, queue, fetch resource • Great for parallel file processing of large files • Synchronization barrier during persistence • Spark • In-memory data processing • Interactive/iterative data query • Better supports more complex, interactive (real-time) apps • 100x faster than Hadoop MR (in memory) • 10x faster on disk • Microbatching Spark 35Source: spark.apache.org
  • 36. • Combine SQL, streaming, complex analytics • SQL • Dataframes • MLlib • GraphX • Spark Streaming Spark 36Source: spark.apache.org
  • 37. • Slow due to replication, serialization, filesystem IO • Inefficient use cases: • Iterative algorithms (ML, Graphs, Network analysis) • Interactive (ad-hoc) data mining Hadoop MR 37Source: “Spark Overview”, Lisa Hua
  • 38. Spark: Clustered 38Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
  • 39. Spark: Terminology 39Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam • Job • Stage • Task
  • 40. Spark: SQL 40Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam • Spark SQL: a module for structured data querying • Supports basic SQL & HiveQL • A distributed query engine via JDBC/ODBC, or CLI
  • 41. Spark: Dataframe, Datasets (RDDs) 41Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam • Dataframe is a distributed assembly of data into named columns • E.g. a relational table • Dataset was added in 1.6 to provide benefit of RDDs and Spark SQL’s execution engine • Build from JVM objects, then manipulate with functions
  • 42. Spark Streaming 42Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
  • 43. • Merge inbound data with other data • Polyglot: Scala, Python, Java • Lower level access via DStream (via StreamingContext) • Create RDD’s from the DStream • Two primary metrics to monitor and tune • Kyro Spark Streaming 43Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
  • 44. 44Source: “Akka Actor Introduction”, Gene
  • 45. Akka 45Source: “Introducing Akka”, Jonas Bonér • Simplifies distributed programming • Concurrency, Scalability, Fault-Tolerance • A single, unified model • Manages System Overload • Elastic and dynamic • Scale up & out • Program to a higher level • Not threadbound
  • 46. Akka: Failure Management 46Source: “Introducing Akka”, Jonas Bonér In Java/C/C+ etc. • Single thread of control • If that thread blows up you are screwed • Requires explicit error handling within your single thread • Errors isolated to your thread; the others have no clue!
  • 47. Akka: Perfect for the Cloud 47Source: “Introducing Akka”, Jonas Bonér • Elastic and dynamic • Fault tolerant & self healing (autonomic) • Adaptive load-balancing, cluster rebalancing, Actor migration • Build loosely coupled systems that dynamically adapt at runtime
  • 48. Akka 101 48Source: “Introducing Akka”, Jonas Bonér • Unit of code = Actor • Actors have been around since 1973 (Erlang, telco); 9 nines of uptime! • About Actors
  • 49. About Actors 49Source: “Introducing Akka”, Jonas Bonér • An actor is an alternative to … • Fundamental unit of computation that embodies: 1. Processing 2. Storage 3. Communication • Axioms – When an actor receives a message it can: 1. Create new Actors 2. Send messages to Actors it knows 3. Designate how it should handle the next message it receives
  • 50. Akka: Performance 50Source: “Introducing Akka”, Jonas Bonér +50 million messages per second !!!
  • 51. Akka: Core Actor Operations 51Source: “Introducing Akka”, Jonas Bonér 0. Define 1. Create 2. Send 3. Become 4. Supervise
  • 52. Akka: Define Operation 52Source: “Introducing Akka”, Jonas Bonér • Define the Actor • Define the message (class) the actor should respond to
  • 53. Akka: Create Operation 53Source: “Introducing Akka”, Jonas Bonér • Yes, creates a new actor. • Lightweight: 2.6M per Gb RAM • Strong encapsulation
  • 54. Akka: Send Operation 54Source: “Introducing Akka”, Jonas Bonér • Sends a message to an Actor • Everything is non-blocking • Everything is Reactive • Actor passivates until receiving a message  then awakens • Messages are energy This is not an endorsement…
  • 55. Akka: Become Operation 55Source: “Introducing Akka”, Jonas Bonér Dynamically redefine an actor’s behavior • Reactively triggered by receipt of a message • Will not react differently to messages it receives • Behaviors are stacked – can by pushed and popped…
  • 56. Akka: Become Operation… 56Source: “Introducing Akka”, Jonas Bonér Why? • A busy actor can become an Actor Pool or Router! • Implement FSM (Finite State Machine) • Implement graceful degradation • Spawn empty pool that can ”Become” something else • Very useful. Limited only by your imagination.
  • 57. Akka: Supervise Operation 57Source: “Introducing Akka”, Jonas Bonér • Manage another Actor’s failures • A Supervisor monitors, then responds • Supervisor is notified of failures • Separates processing from error handling
  • 58. Akka: Remote Deployment 58Source: “Introducing Akka”, Jonas Bonér
  • 59. Akka Streams 59Source: Akka Documentation • Akka Streams API is decoupled from Reactive Streams interfaces • Interoperable with any conformant Reactive Streams implementation • Principles • All features explicit in the API • Compositionality; combined pieces retain the function of each part • Model of domain of distributed bounded stream processing • Reactive Streams  JDK9 Flow API
  • 60. Akka Streams 60Source: Akka Documentation • Immutable building blocks • Source • Sink • Flow • BidiFlow • Graph • Built in backpressure capability • Difference between error & failure • Error: accessible within the stream, as a data element • Failure: stream itself has collapsed
  • 61. Akka HTTP 61Source: Akka Documentation • Built atop Akka Streams • Expose an incoming connection in form of a Source instance • Backpressurized • Best practices: • Libraries shall provide their users with reusable pieces • Avoid destruction of compositionality. • Libraries may provide facilities that consume and materialize graphs • Including convenience “sugar” for use cases
  • 62. • Adds interesting capabilities to Akka Streams • Akka alternative to Apache Camel (EIP) • Community driven, focused on connectors to external libraries, integration patterns and conversions. • https://github.com/akka/alpakka/releases/tag/v0.1 • Akka Streams Integration • https://github.com/akka/alpakka • http://developer.lightbend.com/docs/alpakka/current/ Alpakka 101 62
  • 63. The Data Lake for Analytics, App Dev “Lake Wobegon, the little town that time forgot and the decades cannot improve.” - Garrison Keillor 63
  • 65. The Data Lake 65 • The “Old” way • Let’s combine data (de-silo) to increase sharing & usage • Cost reduction through centralized consolidation • Data Vacuum • Forces complexity onto consumers • Data acted on after storage • Current thinking • The old way does not work • Need governance and other functions to reduce complexity • 3 V’s matter: Volume, Variety AND Velocity • Able to act on data as it occurs (Fast Data) • Following slides elaborate & map to techs (verbal)
  • 66. The Data Lake 66 • Fast Data fills the Data Lake • Massive & easily accessible • Built on commodity hardware (or now, “the Cloud”) • Data not stored in a way that is optimized for data analysis • Retains all attributes • Awareness of Data Lake fallacy: http://www.gartner.com/newsroom/id/2809117
  • 67. Dark Side of the Old Ways 67Source: “Gartner Says Beware of the Data Lake Fallacy” • Vacuum data + use later = SWAMP • Redundant tooling & low interoperability • Blissful ignorant: how/why data is used, governed, defined and secured • Remove silos? Great, so now it’s comingled. • Inability to measure data quality (or fix) • Accepts data without governance or oversight • Accepts any data without metadata (description) • Inability to share lineage of findings by other analysis to share found value • Security, access control (and tracking of both) • Data Ownership, Entitlements? • Tenancy? • Compliance? Audits? • What to do?
  • 68. 68
  • 69. Disrupt 69 Not all Analysts enjoy going “Bogging”…
  • 70. 70 Blueprint: Light at End of Tunnel https://knowledgent.com/whitepaper/design-successful-data-lake/
  • 71. Success Highlights 71Source: “Gartner Says Beware of the Data Lake Fallacy” • Data Lake should enable analysis of data stored in the lake. • Requires features to secure, curate, run analytics, visualization, reporting • Use multiple tools and products – no product (today) does it all • Domain specific – tailored to your industry • Metadata management – without this you get a swamp • Configurable ingestion workflows • Interoperate with Existing environment • Timely • Flexible • Quality • Findable
  • 72. 72
  • 73. Questions? “I'm sorry, if you were right, I'd agree with you.” - Robin Williams 73
  • 74. • The Reactive Manifesto (http://www.reactivemanifesto.org/) • Chaos Monkey? Use Linear Fault Driven Testing instead. • https://www.lightbend.com/blog/architect-reactive-design-patterns • http://www.infoworld.com/article/2608040/big-data/fast-data--the-next-step-after-big-data.html • https://www.lightbend.com/blog/lessons-learned-from-paypal-implementing-back-pressure-with-akka-streams-and-kafka • https://kafka.apache.org • http://www.slideshare.net/ducasf/introduction-to-kafka • http://www.slideshare.net/SparkSummit/grace-huang-49762421 • http://www.slideshare.net/HadoopSummit/performance-comparison-of-streaming-big-data-platforms • https://github.com/akka/alpakka • http://developer.lightbend.com/docs/alpakka/current/ • https://github.com/akka/alpakka/releases/tag/v0.1 • http://www.slideshare.net/LisaHua/spark-overview-37479609 • http://spark.apache.org/ • https://www.realdbamagic.com/intro-to-apache-spark-2016-slides/ • http://www.slideshare.net/gene7299/akka-actor-presentation • http://www.slideshare.net/jboner/introducing-akka • http://bit.ly/hewitt-on-actors • http://tech.measurence.com/2016/06/01/a-dive-into-akka-streams.html • https://infocus.emc.com/rachel_haines/is-the-data-lake-the-best-architecture-to-support-big-data/ • http://www.gartner.com/newsroom/id/2809117 • https://knowledgent.com/whitepaper/design-successful-data-lake/ Resources 74

Hinweis der Redaktion

  1. Background: Reactive Systems, Patterns, Implementations
  2. Why Reactive? Not new, concepts go back 50 years (Erlang – Actor model, Telco). Proven. Underpins every use case, every business capability, every product feature Concepts go back almost 50 years; Erlang, etc. EDA/CEP Data Lake Large, easily accessible, centralized generally store-everything approach Structured and unstructured Data classified, organized, analyzed on access Business value Real time data, for application components, middleware, data processing, analytics Common platform for both app dev and analytics Bedrock is messaging We’ve been using this technique for decades, via many technologies Improvements over time around component isolation, decoupling to benefit scalability and concurrency Architecture Importance of vision, governance, tenancy, entitlements, security messaging We’ve been using this technique for decades, via many technologies Improvements over time around component isolation, decoupling to benefit scalability and concurrency A journey of innovation and successive refinement
  3. Many organizations independently building software to similar patterns Increasing pressures to simplify, scale, innovate and improve customer experience Increasing proliferation and interoperability of system environments and connected devices More data with contextual use cases Yesterday’s architectures just don’t cut it Need more flexible, resilient, robust systems Solution through evolution
  4. Usually non-blocking
  5. Reactive Systems: Architecture, Design Reactive Programming: declarative event-based) Regarding FRP: NOTE: The inventor of this term, Conal Elliott, says this term is misapplied today (e.g. RxJS, RxJava, Bacon.js, etc). Refer to his presentation (July 22, 2015) for the details: https://begriffs.com/posts/2015-07-22-essence-of-frp.html
  6. Reactive Programming != Functional Reactive Programming New information drives logic flow  vs. control flow driven by thread-of-execution Avoids resource contention (Amdahl’s Law) that impedes scalability. Reactive API libraries are either declarative (functional composition, combinators) callback-based (attached to events, executed during dataflow chain) with stream-based operators (windowing, triggers, etc)
  7. Increased (efficient) utilization of compute resources (incl. multi-core) Increased performance via serialization reduction Amdahl’s Law Neil Günter’s Universal Scalability Law. quantify the effects of contention and coordination in concurrent, distributed systems. explains how cost of coherency in a system can lead to negative results as (new) resources are added to the system. Productivity: Reactive libraries handle complexity asynch, nonblocking compute, IO, coordination between components. Ideal for creating components that are composed to workflows, that are back-pressured, scalable, high-performance
  8. Reactive Programming is related to Dataflow Programming Both emphasize flow of data vs. flow of control Examples Futures / Promises (Reactive) Streams – unbounded data processing. Asynch, non-blocking, back-pressured pipelines connecting sources and destinations. Dataflow Variables – single assignment variables (AKA a cell in Excel) a value change can trigger dependent functions to produce new values (state) Technologies that do this, include Akka Streams RxJava Vert.X Reactive Streams Specification The standard for interoperability amongst Reactive Programming libraries on the JVM “…an initiative to provide a standard for asynchronous stream processing with non-blocking back pressure.”
  9. Reactive Programming - event-driven Computation via dataflow chains Events are not directed; “addressable event sources” Events are mere facts that can be observed Emitted by changes in state (state machine) Listeners attach to even sources, which in turn, react to them Emitted locally Reactive Systems - message-driven Basis of comm across network/components; prefer asynch Sender/Receiver are decoupled Focus on resilience and elasticity via communication/communication inherent to distributed systems Long-lived, addressable components Waits for messages to be sent, then reacts to them Messages are ”directed” A message has a clear destination; “addressable recipient”
  10. Common pattern is to us Messaging as means to communicate Events across network/components Events within Messages Examples AWS Lambda, distributed streaming (Spark Streaming, Flink, Kafka, Akka Streams, Pub/Sub) Cons Lose some control Messaging forces developers to deal with complex realities of distributed programming Failure detection Message delivery contracts (dupes, retry, ordering) Consistency guarantee Can’t hide behind “leaky” abstractions that pretend a network does not exist (EJB, XA, RPC, etc)
  11. Devoid of bottlenecks or hot spot
  12. Companies that have a size, complexity and age Young companies (e.g. start-ups) Less legacy overhead Able to use newer technology; innovate, disrupt, find a niche Fewer expensive, enterprise-class solutions; value through IP or a unique product More likely to build cloud native (various reasons) Staff typically has larger sphere of influence Sometimes filled with Unicorns Mid-size companies May have legacy overhead Complexity if growth through acquisition vs. growth through product evolution Strategic use of of Enterprise products May be cloud native, or hybrid Increased division of labor, defined roles, paying customers to keep happy large companies Probably where the terms “technical debt” and “culture debt” came from legacy overhead, complex environments Fear of change, less room to fail – backed by valid business reasons complexity due to both acquisition and product evolution ”different” tech stack s Enterprise products, prefers “supported” flavors of open source A blend of on-premises, hybrid, cloud More politics Slower to see value from new technology; transformative change is more difficult; planning, budgeting, process, management structure Matrixed division of labor, defined roles, paying customers to keep happy
  13. Multiple software applications Customer facing, revenue generating Specialization to function operation, development, maintenance activities Data typically moved from system of record, into one or more centralized data “hubs” OLAP / Data Warehouses Data Lake, common use of Hadoop Older application bad at interoperability Decreases value to enterprise Drop in ETL, ESB, SOA, iPaaS to do so ($$$) Perhaps refactor service layers into more modern middleware Trend toward ”real time” The old, batch-oriented, application silos are too complex and just can’t do this well Unlocks new capabilities
  14. Products (and supporting software) operates in Real Time Opens door to do things to increase customer satisfaction/retention Save money, become more agile (speed to market) – productivity enhancer Data flows to system that need it, acted on in real-time Reduced TCO auto scaling, resiliency, high availability, autonomic & automated
  15. Analysts are not coders They just want to connect a friendly BI tool (e.g. Tableau) to run their SQL against data. Can’t expect people to switch careers or skill sets to accommodate new technology.
  16. Use case reminder: millions of concurrent actors handling billions of messages, in real-time Continuous data streams Aka Fire Hose Velocity AND volume - Gb/hr, Tb/hr Old way: store  act  analyze Say hello to Map Reduce! Act in real-time As things are happening Loses value if stored first, and queried later (hdfs, redshift, etc) Uses technology proven to scale Kafka, Storm, etc -> Delivery Akka, Spark, Flink -> Act send to destinations Data Lake Other systems Fast Data is the obvious future Many of us have been using these techs for years
  17. Mi
  18. Core techs Kafka - open source or Confluent Spark - open source or Databricks (a nice managed service in AWS) Akka (and Akka-HTTP) (open source or Lightbend stack) Alpakka
  19. Streaming platform Process messages when provided Fault-tolerant storage Pub/Sub capability Popular use cases Data pipelines (messaging) between a source and destination Real-time action / transformation Metrics, weblogs, stream processing, event sourcing A commit log Spread across a cluster Record streams stored in topics Each record has a key, value, timestamp Each topic has offsets and a retention policy
  20. Producer API To publish a record to Kafka Consumer API To subscribe to a topic(s) Consumer groups Handle records pushed to topics Streams API Stream processing Consume from Stream An, do processing, publish to Stream Bn E.g. aggregation, joining Connector API To build custom Producers/Consumers Purpose built integration components
  21. Limitations of older message providers RabbitMQ, AMQ Doesn’t scale well; complex Need to use other abstractions for batching Lacks replayability (reset offset, etc) Limitations of log forwarders (Scribe or Flume Push architecture; high performance, scales well Business logic clogs up these systems in endpoints; needs to push data fast Built to push data to large sink (e.g. Hadoop) Query at a later time; not real time Supports Polyglot Python, Go, .NET, Node.js, C/C++, etc
  22. Interactive (ad-hoc) data mining (R, Excel, Searching, analyst queries) Slow for anything real time
  23. Job a set of tasks to be executed as a result of an action Stage a set of tasks in a job that can be run in parallel Task a individual unit of work sent to a single executor
  24. Dataframe is a distributed assembly of data into named columns Analogous to a relational table, or data frame in R/Python (with richer optimizations) Dataset was added in 1.6 to provide benefit of RDDs and Spark SQL’s execution engine Build datasets from JVM objects and then manipulate with functional transformations
  25. Scalable, high-throughput, fault tolerant processing of real time streams Microbatched. Think of in terms of EDA (e.g. Esper), for “windowing” (etc) vs. handling a single message/event.
  26. Merge inbound data with historical data Two primary metrics to monitor and tune: Processing time (per batch) Scheduling delay (processed upon arrival?) Uses Kyro for serialization
  27. who uses it?
  28. Simplifies distributed programming Akka handles/enables: concurr/scalability/fault tolerance Self Healing Complex, easy to make mistakes Adaptive load-balancing, cluster rebalancing, Actor migration With a single unified Programming Model Managed Runtime Open Source Distribution Manage System Overload (backpressure) Program to a higher level No more shared state, state visibility, threads, locks, concurrent collections, thread notifications Low level concurrency built into the plumbing; it becomes simple workflow just deal with messages Increases CPU utilization, lowers latency, high throughput, scalable! Superior, Proven model to detect and recover from errors
  29. Results in tons of defensive code, scattered throughout the codebase entangled in your business logic
  30. Encapsulates code like servlets or session beans; policy decisions separated from biz logic The vehicle to create concurrent, scalable, fault-tolerant apps atop the fabric About Actors Encapsulated, decoupled Managing own memory and behavior Communicates asynchronously with non-blocking messages Elastic – grow/shrink on demand Hot deploy, change runtime behavior
  31. Alternative to: A thread An object instance or component Callback or listener Singleton or service Router, load-balancer or pool Session bean or MDB Out of process service A Finite State Machine (FSM)
  32. 1. Create Yes, creates new actor. From ActorSystem, then ActorRef. Strong encapsulation of: state/behavior (indistinguishable), message queue
  33. Everything is (usually) asynch (ask), lockless; fire & forget
  34. (Think in terms of the object changed it’s type – interface, protocol, implementation)
  35. A busy actor can become an Actor Pool or Router! Implement FSM (Finite State Machine) Implement graceful degradation Spawn empty workers that can ”Become” whatever the Master desires Very useful. Limited only by your imagination.
  36. Manage another Actor’s failures or the person sitting next to you
  37. Akka Streams API is decoupled from Reactive Streams interfaces Implementation details to pass stream data between processing stages
  38. Immutable building blocks / blueprints enabled for libraries, include Source: something with exactly one output stream Sink: something with exactly one input stream Flow: something with exactly one input and one output stream BidiFlow: something with exactly two input streams and two output streams that conceptually behave like two Flows of opposite direction Graph: a packaged stream processing topology that exposes a certain set of input and output ports, characterized by an object of type Shape. Built in backpressure capability No stage can push downstream unless it received a pull beforehand Difference between error and failure Error is accessible within the stream as a data element (signaled via onNext) Failure means the stream itself has collapsed (signaled via onError). Want failure to propagate faster than data (essential, to deal with backpressure) Data elements emitted before a failure can still be lost of the onError overtakes them Recovery element acts as bulkhead to confine a stream collapse to a given region of stream topology, to isolate outside from impact of collapsed region (e.g. buffered elements)
  39. Can expose an incoming connection in form of a Source instance To start listening on network with Akka HTTP, create a Route and bind it to a port (similar syntax to Spray). Backpressure on source? Akka HTTP stops consuming data from network; eventually leads to 0 TCP window – applying backpressure to sending party itself (e.g. a sensor) Best practices Libraries shall provide their users with reusable pieces, i.e. expose factories that return graphs, allowing full compositionality Avoid destruction of compositionality. Express functionality of a library such that materialization can be done by user outside of library’s control. Libraries may optionally and additionally provide facilities that consume and materialize graphs Allows a library to provide convenience “sugar” for use cases
  40. Data acted on after storage Map Reduce, Hive, Spark trickery Don’t get me started on “Lambda” architectures
  41. Data not stored in a way that is optimized for data analysis Avro & Parquet, for example May be many different structures of data based on access path
  42. Blissful ignorant: how/why data is used, governed, defined and secured Does this sound like a good solution? Data Lake solves old problem of siloing data. Great, so now it’s all in comingled. Federated query? AWS Athena? Why move data if not necessary?
  43. Blissful ignorant: how/why data is used, governed, defined and secured Does this sound like a good solution? Data Lake solves old problem of siloing data. Great, so now it’s all in comingled. Federated query? AWS Athena? Why move data if not necessary?
  44. Blissful ignorant: how/why data is used, governed, defined and secured Does this sound like a good solution? Data Lake solves old problem of siloing data. Great, so now it’s all in comingled. Federated query? AWS Athena? Why move data if not necessary?
  45. Is this the Light at the end of the Tunnel? Or, another train coming down the tracks?
  46. enable analysis of the full variety and volume of Extracting maximum value out of the Data Lake requires customized management and integration that are currently unavailable from any single open-source platform or commercial product vendor. Domain specification A Data Lake customized for biomedical research would be significantly different from one tailored to financial services. business-aware data-locating capability -- enables business users to find, explore, understand, and trust the data. Search capability requires business ontologies, within which business terminology can be mapped to the physical data. Self Service Metadata Management capture a robust set of attributes for every piece of content within the lake. data lineage, data quality, and usage history are vital to usability. automated metadata extraction, capture, and tracking facility. Without a high-degree of automated and mandatory metadata management, a Data Lake will rapidly become a Data Swamp. Ingestion new sources of external information continually discovered by business users. need rapidly on-boarding to avoid frustration and monetize opportunities. A configuration-driven, ingestion workflow mechanism can provide a high level of reuse, enabling easy, secure, and trackable content ingestion from new sources. Interoperate needs a supervisor that integrates and manages, when required, existing data management tools, such as data profiling, data mastering and cleansing, and data masking technologies.