3. 3
• Senior Solutions Architect @ Cox Automotive, Inc.
• Strategic Data Services
• The opinions contained herein may not represent my employer.
• Background: platform development, middleware, messaging, app dev
• DevOps mentality
• History of exploring, learning, assimilating new technology and styles
• Life-long learner
• Novice bass player
• Scuba diver
About Me
4. All presentations (including today) are available on SlideShare:
https://www.slideshare.net/ToddFritz
AJUG (Jan 17, 2017)
Building a Reactive Fast Data Lake with Akka, Kafka and Spark
Video – https://vimeo.com/200863800
DevNexus 2015
Containerizing a Monolithic Application into a Microservice-based PaaS
Great Wide Open - Atlanta (April 3, 2014)
Server to Cloud: converting a legacy platform to an open source PaaS
AJUG (April 15, 2014)
Convert a legacy platform to a micro-PaaS using Docker
Video - https://vimeo.com/94556976
Previous Presentations
4
5. • Preface
• Reactive Systems, Patterns, Implementations
• The Enterprise
• Fast Data
• The Data Lake: Analytics & App Dev
• Questions
• Resources
Agenda
5
7. • Why Reactive?
• What is Fast Data?
• What is a Data Lake?
• What is the intersection?
• Business Value?
• Architecture considerations?
• Importance of Messaging
• Use Case: A system that can scale to millions of users, billions of
messages
7
8. Reactive Systems, Patterns, Implementations
“People who think they know everything
are a great annoyance to those of us who do.”
- Isaac Asimov
8
9. • The Reactive Manifesto: http://www.reactivemanifesto.org
• Many organizations independently building software to similar
patterns
• Yesterday’s architectures just don’t cut it
• Actors (Akka, Erlang)
• Good starting point: https://www.lightbend.com/blog/architect-reactive-design-patterns
9
10. “…a set of design principles for creating cohesive systems.” *
“…a way of thinking about systems architecture and design in a distributed
environment…” *
“…implementation techniques, tooling, and design patterns...” *
components of a larger whole
What is Reactive?
10* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
11. • “…based on an architectural style that allows … multiple individual
services to coalesce as a single unit and react to its surroundings while
remaining aware of each other…” *
• Event Driven
• Asynch & Non-Blocking
• “… scale up/down, load balance...” *
Components may qualify as reactive, but when combined
does not guarantee a Reactive System
Reactive Systems
11* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
12. 1. Reactive Systems
2. Reactive Programming
3. Functional Reactive Programming (FRP)
Reactive Begets*
12* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
13. • Subset of asynch programming
• Information drives logic flow
• Decompose into multiple, asynch, nonblocking steps
• Combine into a composed workflow
• Reactive API libraries are either declarative or callback-based
• Reactive programming is event-driven
• Reactive systems are message driven
Reactive Programming*
13* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
14. • More efficient utilization of resources
• Increased performance via serialization reduction
• Productivity: Reactive libraries handle complexity
• Ideal for back-pressured, scalable, high-performance
components
Benefits of Reactive Programming*
14* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
15. • Reactive & Dataflow Programming
• Examples
• Futures / Promises
• (Reactive) Streams –back-pressurized
• Dataflow Variables – state change event driven actions
• Technologies
• Reactive Streams Specification
• The standard for interoperability amongst Reactive Programming
libraries on the JVM
Dataflow Programming*
15* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
16. • Reactive Programming: event-driven
• Dataflow chains
• Messages not directed
• Events are observable facts
• Produced by State machine
• Listeners attach to react
• Reactive Systems: message-driven
• Basis of communication; prefer asynch
• Decoupled
• Resilience and Elastic
• Long-lived, addressable components
• Reacts to directed messages
Event-Driven vs. Message-Driven*
16* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
17. • Common pattern: Events within Messages
• Example
• AWS Lambda, Pub/Sub
• Pros
• Abstraction, simplicity
• Cons
• Lose some control
• Forces developers to deal with distributed programming complexities
• Can’t hide behind “leaky” abstractions that pretend a network does not
exist
Event-Driven*
17* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
18. Reactive Systems: Characteristics*
18
Responsive
• Low latency
• Consistent, Predictable
• Usability
• Essential for Utility
• Fail Fast
• Happier Customers
Resilient
• Responsive through failure
• Resilient (H/A) through
replication
• Failure isolated to
Component (bulkheads)
• Delegate recovery to
external Component
(supervisor hierarchies)
• Client does not handle
failures
Elastic
• Responsive to workload
• Devoid of bottlenecks
• Perf metrics drive
predictive / reactive
autonomic scaling
• Commodity
infrastructure
Message Driven
• Asynchronous
• Loose Coupling
• Isolation
• Location Transparency
• Non Blocking
• Recipients can passivate
• Message Passing
• Flow Control
• Exception Management
• Elasticity
• Back Pressure
* Source: The Reactive Manifesto
19. Reactive Systems: Patterns
19
Architecture
Single Component
• Component does one thing, fully and well
• Single Responsibility Principle
• Max cohesion, min coupling
Architecture
Let-it-Crash
• Prefer a full component restart to complex internal error
handling
• Design for failure
• Leads to more reliable components
• Avoids hard to find/fix errors
• Failure is unavoidable
Implementation
Circuit Breaker
• Protect services by breaking connections during
failures
• From EE
• Protects clients from timeouts
• Allows time for service to recover
Source: https://www.lightbend.com/blog/architect-reactive-design-patterns
Implementation
Saga
• Divide long-lived, distributed transactions into
quick local ones with compensating actions for
recovery.
• Compensating txns run during saga rollback
• Concurrent sagas can see intermediate state
• Sags need to be persistent to recover from
hardware failures. Save points.
20. The Enterprise
“Even if you are on the right track,
you’ll get run over if you just sit there.”
- Will Rogers
20
22. • Companies characteristics matter
• Young companies (e.g. start-ups)
• Mid-size companies
• Large companies
The Enterprise
22
23. • Multiple software applications
• Specialization to function
• Analytics, DataMart
• Silo’d data needs to be moved
• Older application bad at interoperability
• Trend toward ”real time”
• Evolution toward Data as a Service (DaaS)
Evolving the Enterprise
23
24. • Real time applications & analytics
• Increased agility + productivity = speed to market
• Decoupling enables interoperability
• Data flows to system that need it
• Powerful data processing
• Reduced TCO
• Cloud friendly
How Reactive Helps the Enterprise
24
25. • Friends don’t let Analysts do map reduce
• Larger companies may have communities of SAS users
• Analysts are not coders
Remember the Analysts
25
27. Fast Data
“If you are in a spaceship that is traveling at the speed of light,
and you turn on the headlights,
does anything happen?”
- Steven Wright
27
28. • Big Data Fast Data
• Continuous data streams measured in time
• Old way: store act analyze
• Act in real-time
• Uses technology proven to scale
• Akka, Kafka, Spark, Flink
• Send to destinations
• Prefer shared-nothing clustering, in-memory storage
Fast Data
28* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
29. Time is Gold
29* Source: “Towards Benchmarking Modern Distributed Systems”, Grace Huang (Intel)
30. • Time to talk tech
• Reference architecture: Lightbend’s Fast Data platform
• Core tech stack:
• Kafka
• Spark
• Akka (and Akka-HTTP)
• Alpakka / Camel
• (AWS, Spring, others can/should be incorporated with scope)
Now, the Fun Stuff
30* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
32. • JMS -> AMQP –> Kafka
• Streaming platform
• Popular use cases
• Data pipelines (messaging)
• Real-time actions
• Metrics, weblogs, stream processing, event sourcing
• Commit log
• Spread across a cluster
• Record streams stored in topics
• Each record has a key, value, timestamp
• Each topic has offsets and a retention policy
Kafka 101
32
34. • Limitations of older message providers
• Limitations of log forwarders (Scribe or Flume
• Supports Polyglot
• Robust ecosystem
• https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem
Why Use Kafka?
34
35. • A fast and engine for large-scale data processing
• Hadoop / YARN
• Map-Reduce – mapper, reducer, disk IO, queue, fetch resource
• Great for parallel file processing of large files
• Synchronization barrier during persistence
• Spark
• In-memory data processing
• Interactive/iterative data query
• Better supports more complex, interactive (real-time) apps
• 100x faster than Hadoop MR (in memory)
• 10x faster on disk
• Microbatching
Spark
35Source: spark.apache.org
40. Spark: SQL
40Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
• Spark SQL: a module for structured data querying
• Supports basic SQL & HiveQL
• A distributed query engine via JDBC/ODBC, or CLI
41. Spark: Dataframe, Datasets (RDDs)
41Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
• Dataframe is a distributed assembly of data into named columns
• E.g. a relational table
• Dataset was added in 1.6 to provide benefit of RDDs and Spark SQL’s
execution engine
• Build from JVM objects, then manipulate with functions
43. • Merge inbound data with other data
• Polyglot: Scala, Python, Java
• Lower level access via DStream (via StreamingContext)
• Create RDD’s from the DStream
• Two primary metrics to monitor and tune
• Kyro
Spark Streaming
43Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
45. Akka
45Source: “Introducing Akka”, Jonas Bonér
• Simplifies distributed programming
• Concurrency, Scalability, Fault-Tolerance
• A single, unified model
• Manages System Overload
• Elastic and dynamic
• Scale up & out
• Program to a higher level
• Not threadbound
46. Akka: Failure Management
46Source: “Introducing Akka”, Jonas Bonér
In Java/C/C+ etc.
• Single thread of control
• If that thread blows up you are screwed
• Requires explicit error handling within your single thread
• Errors isolated to your thread; the others have no clue!
47. Akka: Perfect for the Cloud
47Source: “Introducing Akka”, Jonas Bonér
• Elastic and dynamic
• Fault tolerant & self healing (autonomic)
• Adaptive load-balancing, cluster rebalancing, Actor migration
• Build loosely coupled systems that dynamically adapt at runtime
48. Akka 101
48Source: “Introducing Akka”, Jonas Bonér
• Unit of code = Actor
• Actors have been around since 1973 (Erlang, telco); 9 nines of uptime!
• About Actors
49. About Actors
49Source: “Introducing Akka”, Jonas Bonér
• An actor is an alternative to …
• Fundamental unit of computation that embodies:
1. Processing
2. Storage
3. Communication
• Axioms – When an actor receives a message it can:
1. Create new Actors
2. Send messages to Actors it knows
3. Designate how it should handle the next message it receives
51. Akka: Core Actor Operations
51Source: “Introducing Akka”, Jonas Bonér
0. Define
1. Create
2. Send
3. Become
4. Supervise
52. Akka: Define Operation
52Source: “Introducing Akka”, Jonas Bonér
• Define the Actor
• Define the message (class) the actor should respond to
53. Akka: Create Operation
53Source: “Introducing Akka”, Jonas Bonér
• Yes, creates a new actor.
• Lightweight: 2.6M per Gb RAM
• Strong encapsulation
54. Akka: Send Operation
54Source: “Introducing Akka”, Jonas Bonér
• Sends a message to an Actor
• Everything is non-blocking
• Everything is Reactive
• Actor passivates until receiving a message then awakens
• Messages are energy
This is not an
endorsement…
55. Akka: Become Operation
55Source: “Introducing Akka”, Jonas Bonér
Dynamically redefine an actor’s behavior
• Reactively triggered by receipt of a message
• Will not react differently to messages it receives
• Behaviors are stacked – can by pushed and popped…
56. Akka: Become Operation…
56Source: “Introducing Akka”, Jonas Bonér
Why?
• A busy actor can become an Actor Pool or Router!
• Implement FSM (Finite State Machine)
• Implement graceful degradation
• Spawn empty pool that can ”Become” something else
• Very useful. Limited only by your imagination.
57. Akka: Supervise Operation
57Source: “Introducing Akka”, Jonas Bonér
• Manage another Actor’s failures
• A Supervisor monitors, then responds
• Supervisor is notified of failures
• Separates processing from error handling
59. Akka Streams
59Source: Akka Documentation
• Akka Streams API is decoupled from Reactive Streams interfaces
• Interoperable with any conformant Reactive Streams implementation
• Principles
• All features explicit in the API
• Compositionality; combined pieces retain the function of each part
• Model of domain of distributed bounded stream processing
• Reactive Streams JDK9 Flow API
60. Akka Streams
60Source: Akka Documentation
• Immutable building blocks
• Source
• Sink
• Flow
• BidiFlow
• Graph
• Built in backpressure capability
• Difference between error & failure
• Error: accessible within the stream, as a data element
• Failure: stream itself has collapsed
61. Akka HTTP
61Source: Akka Documentation
• Built atop Akka Streams
• Expose an incoming connection in form of a Source instance
• Backpressurized
• Best practices:
• Libraries shall provide their users with reusable pieces
• Avoid destruction of compositionality.
• Libraries may provide facilities that consume and materialize graphs
• Including convenience “sugar” for use cases
62. • Adds interesting capabilities to Akka Streams
• Akka alternative to Apache Camel (EIP)
• Community driven, focused on connectors to external libraries,
integration patterns and conversions.
• https://github.com/akka/alpakka/releases/tag/v0.1
• Akka Streams Integration
• https://github.com/akka/alpakka
• http://developer.lightbend.com/docs/alpakka/current/
Alpakka 101
62
63. The Data Lake for Analytics, App Dev
“Lake Wobegon, the little town that time forgot
and the decades cannot improve.”
- Garrison Keillor
63
65. The Data Lake
65
• The “Old” way
• Let’s combine data (de-silo) to increase sharing & usage
• Cost reduction through centralized consolidation
• Data Vacuum
• Forces complexity onto consumers
• Data acted on after storage
• Current thinking
• The old way does not work
• Need governance and other functions to reduce complexity
• 3 V’s matter: Volume, Variety AND Velocity
• Able to act on data as it occurs (Fast Data)
• Following slides elaborate & map to techs (verbal)
66. The Data Lake
66
• Fast Data fills the Data Lake
• Massive & easily accessible
• Built on commodity hardware (or now, “the Cloud”)
• Data not stored in a way that is optimized for data analysis
• Retains all attributes
• Awareness of Data Lake fallacy: http://www.gartner.com/newsroom/id/2809117
67. Dark Side of the Old Ways
67Source: “Gartner Says Beware of the Data Lake Fallacy”
• Vacuum data + use later = SWAMP
• Redundant tooling & low interoperability
• Blissful ignorant: how/why data is used, governed, defined and secured
• Remove silos? Great, so now it’s comingled.
• Inability to measure data quality (or fix)
• Accepts data without governance or oversight
• Accepts any data without metadata (description)
• Inability to share lineage of findings by other analysis to share found value
• Security, access control (and tracking of both)
• Data Ownership, Entitlements?
• Tenancy?
• Compliance? Audits?
• What to do?
70. 70
Blueprint: Light at End of Tunnel
https://knowledgent.com/whitepaper/design-successful-data-lake/
71. Success Highlights
71Source: “Gartner Says Beware of the Data Lake Fallacy”
• Data Lake should enable analysis of data stored in the lake.
• Requires features to secure, curate, run analytics, visualization, reporting
• Use multiple tools and products – no product (today) does it all
• Domain specific – tailored to your industry
• Metadata management – without this you get a swamp
• Configurable ingestion workflows
• Interoperate with Existing environment
• Timely
• Flexible
• Quality
• Findable
Why Reactive?
Not new, concepts go back 50 years (Erlang – Actor model, Telco). Proven.
Underpins every use case, every business capability, every product feature
Concepts go back almost 50 years; Erlang, etc.
EDA/CEP
Data Lake
Large, easily accessible, centralized
generally store-everything approach
Structured and unstructured
Data classified, organized, analyzed on access
Business value
Real time data, for application components, middleware, data processing, analytics
Common platform for both app dev and analytics
Bedrock is messaging
We’ve been using this technique for decades, via many technologies
Improvements over time around component isolation, decoupling to benefit scalability and concurrency
Architecture
Importance of vision, governance, tenancy, entitlements, security
messaging
We’ve been using this technique for decades, via many technologies
Improvements over time around component isolation, decoupling to benefit scalability and concurrency
A journey of innovation and successive refinement
Many organizations independently building software to similar patterns
Increasing pressures to simplify, scale, innovate and improve customer experience
Increasing proliferation and interoperability of system environments and connected devices
More data with contextual use cases
Yesterday’s architectures just don’t cut it
Need more flexible, resilient, robust systems
Solution through evolution
Usually non-blocking
Reactive Systems: Architecture, Design
Reactive Programming: declarative event-based)
Regarding FRP:
NOTE: The inventor of this term, Conal Elliott, says this term is misapplied today (e.g. RxJS, RxJava, Bacon.js, etc).
Refer to his presentation (July 22, 2015) for the details: https://begriffs.com/posts/2015-07-22-essence-of-frp.html
Reactive Programming != Functional Reactive Programming
New information drives logic flow vs. control flow driven by thread-of-execution
Avoids resource contention (Amdahl’s Law) that impedes scalability.
Reactive API libraries are either
declarative (functional composition, combinators)
callback-based (attached to events, executed during dataflow chain)
with stream-based operators (windowing, triggers, etc)
Increased (efficient) utilization of compute resources (incl. multi-core)
Increased performance via serialization reduction
Amdahl’s Law
Neil Günter’s Universal Scalability Law.
quantify the effects of contention and coordination in concurrent, distributed systems.
explains how cost of coherency in a system can lead to negative results as (new) resources are added to the system.
Productivity: Reactive libraries handle complexity
asynch, nonblocking compute, IO, coordination between components.
Ideal for creating components that are composed to workflows,
that are back-pressured, scalable, high-performance
Reactive Programming is related to Dataflow Programming
Both emphasize flow of data vs. flow of control
Examples
Futures / Promises
(Reactive) Streams – unbounded data processing. Asynch, non-blocking, back-pressured pipelines connecting sources and destinations.
Dataflow Variables – single assignment variables (AKA a cell in Excel)
a value change can trigger dependent functions to produce new values (state)
Technologies that do this, include
Akka Streams
RxJava
Vert.X
Reactive Streams Specification
The standard for interoperability amongst Reactive Programming libraries on the JVM
“…an initiative to provide a standard for asynchronous stream processing with non-blocking back pressure.”
Reactive Programming - event-driven
Computation via dataflow chains
Events are not directed; “addressable event sources”
Events are mere facts that can be observed
Emitted by changes in state (state machine)
Listeners attach to even sources, which in turn, react to them
Emitted locally
Reactive Systems - message-driven
Basis of comm across network/components; prefer asynch
Sender/Receiver are decoupled
Focus on resilience and elasticity via communication/communication inherent to distributed systems
Long-lived, addressable components
Waits for messages to be sent, then reacts to them
Messages are ”directed”
A message has a clear destination; “addressable recipient”
Common pattern is to us Messaging as means to communicate Events across network/components
Events within Messages
Examples
AWS Lambda, distributed streaming (Spark Streaming, Flink, Kafka, Akka Streams, Pub/Sub)
Cons
Lose some control
Messaging forces developers to deal with complex realities of distributed programming
Failure detection
Message delivery contracts (dupes, retry, ordering)
Consistency guarantee
Can’t hide behind “leaky” abstractions that pretend a network does not exist (EJB, XA, RPC, etc)
Devoid of bottlenecks or hot spot
Companies that have a size, complexity and age
Young companies (e.g. start-ups)
Less legacy overhead
Able to use newer technology; innovate, disrupt, find a niche
Fewer expensive, enterprise-class solutions; value through IP or a unique product
More likely to build cloud native (various reasons)
Staff typically has larger sphere of influence
Sometimes filled with Unicorns
Mid-size companies
May have legacy overhead
Complexity if growth through acquisition vs. growth through product evolution
Strategic use of of Enterprise products
May be cloud native, or hybrid
Increased division of labor, defined roles, paying customers to keep happy
large companies
Probably where the terms “technical debt” and “culture debt” came from
legacy overhead, complex environments
Fear of change, less room to fail – backed by valid business reasons
complexity due to both acquisition and product evolution
”different” tech stack s
Enterprise products, prefers “supported” flavors of open source
A blend of on-premises, hybrid, cloud
More politics
Slower to see value from new technology; transformative change is more difficult; planning, budgeting, process, management structure
Matrixed division of labor, defined roles, paying customers to keep happy
Multiple software applications
Customer facing, revenue generating
Specialization to function
operation, development, maintenance activities
Data typically moved from system of record, into one or more centralized data “hubs”
OLAP / Data Warehouses
Data Lake, common use of Hadoop
Older application bad at interoperability
Decreases value to enterprise
Drop in ETL, ESB, SOA, iPaaS to do so ($$$)
Perhaps refactor service layers into more modern middleware
Trend toward ”real time”
The old, batch-oriented, application silos are too complex and just can’t do this well
Unlocks new capabilities
Products (and supporting software) operates in Real Time
Opens door to do things to increase customer satisfaction/retention
Save money, become more agile (speed to market) – productivity enhancer
Data flows to system that need it, acted on in real-time
Reduced TCO auto scaling, resiliency, high availability, autonomic & automated
Analysts are not coders
They just want to connect a friendly BI tool (e.g. Tableau) to run their SQL against data.
Can’t expect people to switch careers or skill sets to accommodate new technology.
Use case reminder: millions of concurrent actors handling billions of messages, in real-time
Continuous data streams
Aka Fire Hose
Velocity AND volume - Gb/hr, Tb/hr
Old way: store act analyze
Say hello to Map Reduce!
Act in real-time
As things are happening
Loses value if stored first, and queried later (hdfs, redshift, etc)
Uses technology proven to scale
Kafka, Storm, etc -> Delivery
Akka, Spark, Flink -> Act
send to destinations
Data Lake
Other systems
Fast Data is the obvious future
Many of us have been using these techs for years
Mi
Core techs
Kafka - open source or Confluent
Spark - open source or Databricks (a nice managed service in AWS)
Akka (and Akka-HTTP) (open source or Lightbend stack)
Alpakka
Streaming platform
Process messages when provided
Fault-tolerant storage
Pub/Sub capability
Popular use cases
Data pipelines (messaging)
between a source and destination
Real-time action / transformation
Metrics, weblogs, stream processing, event sourcing
A commit log
Spread across a cluster
Record streams stored in topics
Each record has a key, value, timestamp
Each topic has offsets and a retention policy
Producer API
To publish a record to Kafka
Consumer API
To subscribe to a topic(s)
Consumer groups
Handle records pushed to topics
Streams API
Stream processing
Consume from Stream An, do processing, publish to Stream Bn
E.g. aggregation, joining
Connector API
To build custom Producers/Consumers
Purpose built integration components
Limitations of older message providers
RabbitMQ, AMQ
Doesn’t scale well; complex
Need to use other abstractions for batching
Lacks replayability (reset offset, etc)
Limitations of log forwarders (Scribe or Flume
Push architecture; high performance, scales well
Business logic clogs up these systems in endpoints; needs to push data fast
Built to push data to large sink (e.g. Hadoop)
Query at a later time; not real time
Supports Polyglot
Python, Go, .NET, Node.js, C/C++, etc
Interactive (ad-hoc) data mining (R, Excel, Searching, analyst queries)
Slow for anything real time
Job
a set of tasks to be executed as a result of an action
Stage
a set of tasks in a job that can be run in parallel
Task
a individual unit of work sent to a single executor
Dataframe is a distributed assembly of data into named columns
Analogous to a relational table, or data frame in R/Python (with richer optimizations)
Dataset was added in 1.6 to provide benefit of RDDs and Spark SQL’s execution engine
Build datasets from JVM objects and then manipulate with functional transformations
Scalable, high-throughput, fault tolerant processing of real time streams
Microbatched. Think of in terms of EDA (e.g. Esper), for “windowing” (etc) vs. handling a single message/event.
Merge inbound data with historical data
Two primary metrics to monitor and tune:
Processing time (per batch)
Scheduling delay (processed upon arrival?)
Uses Kyro for serialization
who uses it?
Simplifies distributed programming
Akka handles/enables: concurr/scalability/fault tolerance
Self Healing
Complex, easy to make mistakes
Adaptive load-balancing, cluster rebalancing, Actor migration
With a single unified
Programming Model
Managed Runtime
Open Source Distribution
Manage System Overload (backpressure)
Program to a higher level
No more shared state, state visibility, threads, locks, concurrent collections, thread notifications
Low level concurrency built into the plumbing; it becomes simple workflow just deal with messages
Increases CPU utilization, lowers latency, high throughput, scalable!
Superior, Proven model to detect and recover from errors
Results in tons of defensive code,
scattered throughout the codebase
entangled in your business logic
Encapsulates code like servlets or session beans; policy decisions separated from biz logic
The vehicle to create concurrent, scalable, fault-tolerant apps atop the fabric
About Actors
Encapsulated, decoupled
Managing own memory and behavior
Communicates asynchronously with non-blocking messages
Elastic – grow/shrink on demand
Hot deploy, change runtime behavior
Alternative to:
A thread
An object instance or component
Callback or listener
Singleton or service
Router, load-balancer or pool
Session bean or MDB
Out of process service
A Finite State Machine (FSM)
1. Create
Yes, creates new actor. From ActorSystem, then ActorRef.
Strong encapsulation of: state/behavior (indistinguishable), message queue
Everything is (usually) asynch (ask), lockless; fire & forget
(Think in terms of the object changed it’s type –
interface, protocol, implementation)
A busy actor can become an Actor Pool or Router!
Implement FSM (Finite State Machine)
Implement graceful degradation
Spawn empty workers that can ”Become” whatever the Master desires
Very useful. Limited only by your imagination.
Manage another Actor’s failures
or the person sitting next to you
Akka Streams API is decoupled from Reactive Streams interfaces
Implementation details to pass stream data between processing stages
Immutable building blocks / blueprints enabled for libraries, include
Source: something with exactly one output stream
Sink: something with exactly one input stream
Flow: something with exactly one input and one output stream
BidiFlow: something with exactly two input streams and two output streams that conceptually behave like two Flows of opposite direction
Graph: a packaged stream processing topology that exposes a certain set of input and output ports, characterized by an object of type Shape.
Built in backpressure capability
No stage can push downstream unless it received a pull beforehand
Difference between error and failure
Error is accessible within the stream as a data element (signaled via onNext)
Failure means the stream itself has collapsed (signaled via onError).
Want failure to propagate faster than data (essential, to deal with backpressure)
Data elements emitted before a failure can still be lost of the onError overtakes them
Recovery element acts as bulkhead to confine a stream collapse to a given region of stream topology, to isolate outside from impact of collapsed region (e.g. buffered elements)
Can expose an incoming connection in form of a Source instance
To start listening on network with Akka HTTP, create a Route and bind it to a port (similar syntax to Spray).
Backpressure on source? Akka HTTP stops consuming data from network; eventually leads to 0 TCP window – applying backpressure to sending party itself (e.g. a sensor)
Best practices
Libraries shall provide their users with reusable pieces,
i.e. expose factories that return graphs, allowing full compositionality
Avoid destruction of compositionality.
Express functionality of a library such that materialization can be done by user outside of library’s control.
Libraries may optionally and additionally provide facilities that consume and materialize graphs
Allows a library to provide convenience “sugar” for use cases
Data acted on after storage
Map Reduce, Hive, Spark trickery
Don’t get me started on “Lambda” architectures
Data not stored in a way that is optimized for data analysis
Avro & Parquet, for example
May be many different structures of data based on access path
Blissful ignorant: how/why data is used, governed, defined and secured
Does this sound like a good solution?
Data Lake solves old problem of siloing data. Great, so now it’s all in comingled.
Federated query? AWS Athena? Why move data if not necessary?
Blissful ignorant: how/why data is used, governed, defined and secured
Does this sound like a good solution?
Data Lake solves old problem of siloing data. Great, so now it’s all in comingled.
Federated query? AWS Athena? Why move data if not necessary?
Blissful ignorant: how/why data is used, governed, defined and secured
Does this sound like a good solution?
Data Lake solves old problem of siloing data. Great, so now it’s all in comingled.
Federated query? AWS Athena? Why move data if not necessary?
Is this the Light at the end of the Tunnel?
Or, another train coming down the tracks?
enable analysis of the full variety and volume of
Extracting maximum value out of the Data Lake requires customized management and integration
that are currently unavailable
from any single open-source platform or commercial product vendor.
Domain specification
A Data Lake customized for biomedical research would be significantly different from one tailored to financial services.
business-aware data-locating capability -- enables business users to find, explore, understand, and trust the data.
Search capability requires business ontologies, within which business terminology can be mapped to the physical data.
Self Service
Metadata Management
capture a robust set of attributes for every piece of content within the lake.
data lineage, data quality, and usage history are vital to usability.
automated metadata extraction, capture, and tracking facility. Without a high-degree of automated and mandatory metadata management, a Data Lake will rapidly become a Data Swamp.
Ingestion
new sources of external information continually discovered by business users.
need rapidly on-boarding to avoid frustration and monetize opportunities.
A configuration-driven, ingestion workflow mechanism can provide a high level of reuse, enabling easy, secure, and trackable content ingestion from new sources.
Interoperate
needs a supervisor that integrates and manages, when required, existing data management tools,
such as data profiling, data mastering and cleansing, and data masking technologies.