2. This is an overview of some interesting advanced
features of Scala. It is not meant to be a tutorial and
assume that you are familiar with the key constructs
of the language.
Some of the examples are extracted from Scala for
Machine Learning â Packt Publishing
3. Scala has a lot of features âŠ..
Actors
Composed futures
F-bound
Reactive
Advanced functional programming?
... among them
5. Functors and monads are defined as single type higher kinds:
M[_]. The problem is to define monadic composition for
objects belongs to categories that have two or more types
M[_, _] ( i.e. Function1[U, V] ).
Higher kind projection
Scala support functorial and monadic operations for multi-
type categories using higher kind type projection
6. Higher kind projection
Let us consider a covariant functor F that applies a morphism f
within a category C defined as
âđ, đ â đ¶ đ: đ â đ
đč đ â đ = đč đ â đč(đ)
The definition of a functor in Scala relies on a single type
higher kind M
(*) Functors are important concepts in algebraic topology used
in defining algebra for tensors for example.
7. Higher kind projection
How can we define a functor for classes that have multiple
parameterized type?
Letâs consider the definition of a tensor using Scala Function1
The covariant CoVector (resp. contravariant Vector) vectors are
created through a projection onto the covariant (resp.
contravariant) parameterized type T of Function1.
8. Higher kind projection
The implementation of the functor for the Vector type uses the
projection of the higher kind Function1 to its covariant
component by accessing # the inner type Vector of Tensor
The map applies covariant composition, compose of Function1
10. Contravariant functors
Some categories of objects such as covariant tensors or
function parameterized on the input or contravariant
type (i.e. T => Function1[T, U] for a given type U),
require the order of morphisms be reversed.
Morphisms on contravariant argument type are transported
through contravariant functors.
11. Contravariant functors
Let us consider a contravariant functor F that applies a
morphism f within a category C defined as
âđ, đ â đ¶ đ đ â đ
đč đ â đ = đč đ â đč(đ)
The definition of a contravariant functor in Scala relies on a
single type higher kind M
12. Contravariant functors
The implementation of the contravariant functor for the CoVector
type uses the projection of the higher kind Function1 to its
covariant component by accessing # the inner type CoVector of
Tensor
The map applies covariant composition, andThen of Function1
14. It is quite common to compose, iteratively or recursively
functions, methods or data transformations.
Monadic composition
Monads extends the concept of functor to support
composition (or chaining) of computation into a chain
15. Monads are abstract structures in algebraic topology related to
the category theory.
A category C is a structure which has
â object {a, b,c...}
â morphism or maps on objects f: a->b
â composition of morphisms
f: a->b, g: b->c => f o g: a->c
Monads enable the âmonadicâ composition or chaining of
functions or computation on single type argument.
Monadic composition
16. Letâs consider the definition of a kernel function Kf as the composition
of 2 functions g o h.
đŠđ đ±, đČ = đ(
đ
â(đ„đ, đŠđ))
Monadic composition
We create a monad to generate any kind of kernel functions Kf, by
composing their component g: g1 o g2 o ⊠o gn o h
17. A monad extends a functor with binding method (flatMap)
The monadic definition of the kernel function component h
Monadic composition
19. The monadic composition consists of chaining the flatMap invocation
on the functor, map, that preserves morphisms on kernel functions.
Monadic composition
The for comprehension closure is a syntactic sugar on the iterative
monadic composition.
21. Streams
Streams reduce memory consumption by allocating and
releasing chunk of data (or slice or time series) while allowing
reuse of intermediate results.
Some problems lend themselves to process very large data
sets of unknown size for which the execution may have to be
aborted or re-applied
22. The large data set is converted into a stream then broken
down into manageable slices. The slices are instantiated,
processed (i.e. loss function) and released back to the
garbage collector, one at the time
X0 X1 âŠ.... Xn âŠâŠâŠ. Xm
Data stream
1
2đ
đŠ đ â đ đ|đ„ đ
2
+ đ đ 2
Garbage collector
Xi
Allocate
slice .take
Release slice .drop
Heap
Traversal loss function
Streams
23. Slices of NOBS observations are allocated one at the time, (take)
processed, then released (drop) at the time.
Views and streams
24. The reference streamRef has to be weak, in order to have the slices
garbage collected. Otherwise the memory consumption increases
with each new batch of data.
(*) Alternatives: define strmRef as a def or use StreamIterator
Views and streams
25. Comparing list, stream and stream with weak references.
Views and streams
Operating zone
27. Views
Scientific computations require chaining complex data
transformations on large data set. There is not always a
need to process all elements of the dataset.
Scala allows the creation of a view on collections that are
the result of a data transformation. The elements are
instantiated only once needed.
28. Views
Accessing an element of the list requires allocating
the entire list in memory.
Accessing an element of the view requires allocating
only this element in memory.
30. Type classes
Scala libraries classes cannot always be sub-classed.
Wrapping library component in a helper class clutters the
design.
Type classes extends classes functionality without
cluttering name spaces (alternative to type classes)
The purpose of reusability goes beyond refactoring code.
It includes leveraging existing well understood concepts
and semantic.
31. Letâs consider the definition of a tensor as being either a vector
or a covector.
Type classes
Letâs extend the concept of tensor with. A metric is computed as
the inner product or composition of a Covector and a vector.
The computationis implemented by the method Metric.apply
34. Stacked mixins models
Scala stacked traits and abstract values preserve the core
formalism of mathematical expressions.
Traditional programming languages compare unfavorably to
scientific related language such as R because their inability
to follow a strict mathematical formalism:
1. Variable declaration
2. Model definition
3. Instantiation
37. Stacked mixins models
Building machine learning apps requires configurable,
dynamic workflows
Leverage mixins, inheritance and abstract values to create
models and weave data transformation.
Factory design patterns have been used to model dynamic
systems (GoF). Dependency injection has gain popularity
for creating configurable systems (i.e. Spring framework).
38. Stacked mixins models
Multiple models and algorithms are typically evaluated by
weaving computation tasks.
A learning platform is a framework that
âą Define computational tasks
âą Wires the tasks (data flow)
âą Deploys the tasks (*)
Overcome limitation of monadic composition (3 level of
dynamic bindingâŠ)
(*) Actor-based deployment
39. Even the simplest workflow, defined as a pipeline of data transformations
requires a flexible design âŠ
Stacked mixins models
40. Stacked mixins models
Summary of the 3 configurability layers of Cake pattern
1. Given the objective of the computation, select the best
sequence of module/tasks (i.e. Modeling: Preprocessing +
Training + Validating)
2. Given the profile of data input, select the best data
transformation for each module (i.e. Data preprocessing:
Kalman, DFT, Moving averageâŠ.)
3. Given the computing platform, select the best
implementation for each data transformation (i.e. Kalman:
KalmanOnAkka, SparkâŠ)
44. A simple clustering workflow requires a preprocessor &
reducer. The computation sequence exec transform a time
series of element of type U and return a time series of
type W as option
Stacked mixins models
45. A model is created by processing the original time series of type TS[T]
through a preprocessor, a training supervisor and a validator
Stacked mixins models
48. Magnet pattern
Method overloading in Scala has limitations:
âą Type erasure in the JVM causes collision of type of
arguments in overloaded methods
âą Overloaded methods cannot be lifted into a function
âą Code may be unecessary duplicated
The magnet pattern overcomes these limitations by
encapsulating the return and redefining the overloaded
methods as implicit functions.
49. Magnet pattern
Letâs consider the following three incarnations of the method test
These methods have different return types. The first and last
methods conflict because of type erasure on T => List[Double]
50. Magnet pattern
Step 1: Define generic return type and constructor
Step 2: Implement the test methods as implicits
51. Magnet pattern
Step 3: Implement the lifted function test as follows
The first call invokes the implicit fromTN and the second
triggers the implicit fromT.
The return type is inferred from the type of argument
53. View bound
Context bound cannot be used to bind the parameterized
type of a generic class to a primitive type.
Scala view bounds allows to create developers to create
class with parameterized types associated to a Scala or
Java primitive type.
54. View bound
Letâs consider a class which parameterized type can be
manipulate as a Float.
Context bound is not permissible
Constraining the type with a upper bound Float does not
work as Float is a final class.
55. View bound
The solution is to bind the class type to a Float using an
implicit conversion (or view)
The <% directive is the short notation for
57. F-Bound polymorphism is a parametric type polymorphism
that constrains the subtypes to themselves using bounds.
It is important to write code that catch error at compile
time. How can we enforce type integrity in subclasses?
F-Bound polymorphism
58. F-Bound polymorphism
Letâs create a trait that define a discriminative learning model
with method to manipulate data.
The class Svm and Mlp implements the Discriminative trait.
The problem is that nothing prevent to create a class Nnet
that impersonates an Svm class.
59. F-Bound polymorphism
One solution is to restrict (or bound) the type to a Discriminative
class
It prevents a new class to insert itself into the hierarchy.
.. but does not guarantee the type integrity for existing classes
60. F-Bound polymorphism
The self reference guarantee the integrity of each existing
and new subclass. F-Bound polymorphism is a self-
referenced bound polymorphism.
61. Higher kind projection
Contravariant functors
Monadic composition
Streams
Views
Type classes
Stacked mixins models
Cake pattern
Magnet pattern
View bounds
F-bound polymorphism
Data flow control
Continuation passing style
62. Data flow back pressure
A data flow control mechanism handling back pressure
on bounded mail boxes of upstream actors.
Scala actors provide a reliable way to deploy workflows
on a distributed environment. However, some nodes
may experience slow processing and create performance
bottlenecks.
63. Data flow back pressure
Actor-based workflow has to consider
- Cascading failures => supervision strategy
- Cascading bottleneck => Mailbox back-pressure strategy
Workers
Router, Dispatcher, âŠ
64. Messages passing scheme to process various data streams
with transformations.
Dataset
Workers
Controller
Watcher
Load->
Compute->
Bounded mailboxes
<- GetStatus
Status ->
Completed->
Data flow back pressure
65. Worker actors processes data chunk msg.xt sent by the
controller with the transformation msg.fct
Message sent by collector to trigger computation
Data flow back pressure
66. Watcher actor monitors messages queues report to collector with
Status message.
GetStatus message sent by the collector has no payload
Data flow back pressure
67. Controller creates the workers, bounded mailbox for each worker
actor (msgQueues) and the watcher actor.
Data flow back pressure
68. The Controller loads the data sets per chunk upon receiving the
message Load from the main program. It processes the results of
the computation from the worker (Completed) and throttle the
input to workers for each Status message.
Data flow back pressure
69. The Load message is implemented as a loop that create data chunk
which size is adjusted according to the load computed by the
watcher and forwarded to the controller, Status
Data flow back pressure
70. Simple throttle increases/decreases size of the batch of
observations given the current load and specified watermark.
Data flow back pressure
Selecting faster/slower and less/more accurate version of algorithm
can also be used in the regulation strategy
71. Feedback control loop adjusts the size of the batches given the
load in mail boxes and complexity of the computation
Data flow back pressure
72. âą Feedback control loop should be smoothed (moving
average, KalmanâŠ)
âą A larger variety of data flow control actions such as
adding more workers, increasing queue capacity, âŠ
âą The watch dog should handle dead letters, in case of a
failure of the feedback control or the workers.
âą Reactive streams introduced in Akka 2.2+ has a
sophisticated TCP-based propagation and back pressure
control flows
Notes
Data flow back pressure
74. Delimited continuation
Continuation Passing Style (CPS) is a technique that
abstracts computation unit as a data structure in order to
control the state of a computer program, workflow or
sequence of data transformations
Continuations are used to âjumpâ to a method that
produces a call to the current method. They can be
regarded as âfunctional GOTOâ
75. Delimited continuation
A data transformation (or computation unit) can be
extended (continued) with another transformation known
as continuation. The continuation is provided as argument
of the orginal transformation.
Letâs consider the following workflow
The first workflow is not a continuation, the second is
76. Delimited continuation
A delimited continuation is a section of the workflow that
is reified into a function returning a value. This technique
relies on control delimiters (shift/reset) to make the
continuation composable and reusable.
77. More Scala nuggetsâŠ
âą Domain specific language
âą Reactive streams
âą Back-pressure strategy using connection state
Wait a minute, there is moreâŠ..
Hinweis der Redaktion
Context of the presentation:
The transition from Java and Python to Scala is not that easy: It goes beyond selecting Scala for its obvious benefits.- support functional concepts- leverage open source libraries and framework if needed- fast, distributed enough to handle large data setsScala was the most logical choice.
Scientific programming may very well involved different roles in a project:
Mathematicians for formulas
Data scientists for data processing and modeling
Software engineering for implementation
Dev. Ops and performance engineers for deployment in productionIn order to ease the pain, we tend to to learn/adopt Scala incrementally within a development team.. The problem is that you end up with an inconsistent code base with different levels of quality and the team developed a somewhat negative attitude toward the language.The solution is to select a list of problems or roadblocks (in our case Machine learning) and compare the solution in Scala with Java, Python ... (You sell the outcome not the process).PresentationA set of diverse Scala features or constructs the entire team agreed that Scala is a far better solution than Python or Java.
DisclaimerâŠ
Being an object oriented and functional language, Scala has a lot of features and powerful constructs to choose fromâŠ.
Here is a list of some of the features of Scala that are particularly valuable for writing scientific workflows, machine learning algorithms and complex analytics solutions.
Geometric entities on a differential (Riemann) manifold are defined as tensors. Tensor can be
Covariant
Contra variant
A bilinear form such as Tensor product
Inner product
n-differential forms
âŠ
Some useful references, on the theory of categories and monads ..
1 One Div Zero Monads are Elephants Part 2 J. Iry Blog 2007 http://james-iry.blogspot.com/2007/10/monads-are-elephants-part-2.html
2. Monad Design for the Web §7 A Review of Collections as Monads L.G. Meredith Artima 2012
For better understanding of kernel functions in Machine learning
Introduction to Machine Learning §Nonparametric Regression: Smoothing Models. E. Alpaydin MIT Press 2007
A Short Introduction to Learning with Kernels B. Scholkopt, Max Planck Institut f Ìur Biologische Kybernetik A. Smola Australian National University 2005 http://alex.smola.org/papers/2003/SchSmo03c.pdf
The purpose here is to generate and experiment with any kind of explicit kernels by defining and composing two g and h functions
A function h operates on each feature or component of the vector
A function g is the transformation of the dot product of the two vectors. The dot product is computed by applying the function to all the elements and compute the sum.
The âdotâ product K is computed by traversing the two observations (vector of features), computing the sum and finally applying the g transform. The variable type is the type of the function g (F1 = Double => Double)
Once the functor is defined, the monad is created by adding the flatMap method. The monad, KFMonad which take a kernel function as argument is defined as an implicit class so kernel functions âinheritsâ the monadic methods.
The map and flatMap transformation applies to the g function or transformation on the inner product.
The flatMap method is implemented by creating a new Kernel and applying the transformation to only one of the component of the Kernel function: function h (in red). This âpartialâ monadic operation is good enough for building Kernel functions on the fly.
Kernel functions that project the inner product to the manifold for non-linear models belong to the family of exponential functions
Polynomial functions and Radius basis functions are two of the most commonly used kernel functions. Note: The source code is shown here to illustrate the fact that the implementation in any other language would be a lot more messy and wonât be able to fit in any of those slides.
Finally we can chain flatMap to map to compose two kernel function and compute the dot/inner product of the resulting kernel function.
Note that composed kernel function used the h function of the last invocation in the for instruction.
The method does not expose the functor or KF classes that wraps the components of the kernel function.
Real-time streaming is becoming popular (i.e. Apache Spark streaming library, Akka reactive streamsâŠ.). Short of using one of these frameworks, you can create a simple streaming mechanism for large data sets that require.
Streams vs. Iterator: Iterator does not allow to dynamically select the chunk of memory or preserve it if necessary for future computation.
It is not uncommon to have to train a model as labeled data becomes available (online training). In this case, the size of the data set can be very large and unknown. The processing of the data would result in high memory consumption and heavy load on the Garbage collector.
Finally, the traversing the entire data sets (~ allocating large memory chunk) may not even needed as some computation may abort if some condition is met.
Scalaâs streams can help!
In order to minimize the memory footprint, two actions have to take place
Allocate slice of the data set (memory chunk) from the heap using take method.
Release the memory chunk back to the Garbage collection through a drop method.
This example is taking from âScala for Machine Learningâ Packt Publishing.
Most of machine learning algorithms consists of minimizing a loss function or maximizing a likelihood. For each iteration (or recursion) the cumulative loss is computed and passed to the optimizer (i.e. Gradient descent or variant) to update the model parameters.
Each slice of n observations is allocated, processed by the loss function then released back to the Garbage collector.
The Loss class has a single method, exec, that traverses the stream. Once again, the loss is computed using a tail recursion.
An observation is defined as y = f(x) where x the feature set containing for instance, age, ethnicity of patient, body temperature and y is the label value such as the diagnosed disease.. The tail recursion allocates the next slice of STEP observations through the take method, computes the lost, nextLoss, then drop the slice. The reference is recursively redefined as the reference to the remaining stream.
The problem is that the garbage collector cannot reclaim the memory because the first reference to the stream is created outside the recursion. The solution is to declare the reference to the stream as weak so it chunks of memory associated to the slices/batches of observations already processed can be reclaimed.
The reference of the stream is created as a Weak java reference to an instance created by the stream constructor, Stream.iterate.
In this case, the weak reference has been used to show Java concepts are still relevant.
Letâs compare the memory consumption of three strategies to compute the loss function on a very large dataset.
A list
A stream with standard reference
A stream with weak reference
In the first scenario, and as expected, the memory for the entire data set is allocated before processing. The memory requirement for the stream with a strong reference increases each time a new slice is instantiated, because the memory block is held by the reference to the original stream. Only the stream with a weak reference guarantees that only the memory for a slice of STEP observations is needed through the entire execution.
The first thing to come to mind in creating complex system from existing objects (or classes) is a factory design pattern.
Design patterns have been introduced by the âGang of fourâ in the eponymous âDesign Patterns: Elements of Reusable Object-Oriented softwareâ some 20 years ago⊠The list of factory design patterns includes Builder, Prototype, Factory method, Composite, Bridge and obviously Singleton.
Those patterns are not very convenient for weaving data transformation (these transformation being defined as class or interface). This is where dependency injection popularized by the Spring framework comes into play.
Beyond composition and inheritance, Scala enables us to implement and chain data transformations/reductions by stacking the traits that declare these transformations or reductions.
The implementation in Scala matches perfectly the universal mathematical formalism
Here is another example
Declaration variable đ„ââ, đŠââ
Declaration of model f(x,y)=đ„+đŠ
Instantiation of variable đ„=5, đŠ=7; đ 5,7 =12
The first thing to come to mind in creating complex system from existing objects (or classes) is a factory design pattern.
Design patterns have been introduced by the âGang of fourâ in the eponymous âDesign Patterns: Elements of Reusable Object-Oriented softwareâ some 20 years ago⊠The list of factory design patterns includes Builder, Prototype, Factory method, Composite, Bridge and obviously Singleton.
Those patterns are not very convenient for weaving data transformation (these transformation being defined as class or interface). This is where dependency injection popularized by the Spring framework comes into play.
Beyond composition and inheritance, Scala enables us to implement and chain data transformations/reductions by stacking the traits that declare these transformations or reductions.
We briefly mentioned that the for comprehension can be used to chain/stack data transformation. Dependency injection provides a very flexible approach to create workflows dynamically, sometimes referred as the Cake pattern.
Note: As far the 3rd point, deployment of tasks, it usually involves a actor-based (non blocking) distributed architecture such as Akka and Spark. We will mention it briefly later in this presentation is introducing mailbox back-pressure mechanism.
Letâs look at the Training module/task as an example. The task of training a model is executed by a Supervisor instance that can be either a support vector machine or a multi-layer perceptron, in this simplistic case. Each of these two âsupervisorsâ can have several implementations (single host, distributed through a low-latency network,âŠ)
Once defined, the modules are to be weaved/chained by making sure that output of a module/tasks matches the input of the subsequent task.
Notes:
The training module can be broken down further into generative and discriminative models.
Real-world applications are significantly more complex and would include REST service, DAO to access relational database, cachesâŠ.
The terms âmoduleâ, âtasksâ or âcomputational tasksâ are used interchangeably in this section.
Letâs consider the Preprocessing module (or task) implemented as trait
Preprocessing of a data set is performed by a processor of type Preprocessor that is defined at run time. Therefore the preprocessor has to be declared as an abstract value.
The three preprocessors defined in the preprocessing modules are Kalman filter, Moving Average (MovAv) and Discrete Fourier filter (DFTF). Those 2 inner classes act as adapter or stub to the actual implementation of those algorithm.
Here is an implementation of the Kalman and Discrete Fourier transform band-pass filter.
The2 inner classes, Kalman and DFTF act as adapter or stub to the actual implementation of those algorithm. It allows the implementation may consist of multiple version. For instance filtering.Kalman is a trait with several implementation of the algorithm (single host, distributed, using SparkâŠ)
Such design allows to
Select the type of preprocessing algorithm within the Preprocessing module or namespace
Select the implementation of this particular type of algorithm/preprocessor in the filtering package
From the data management perspective, Clustering implements two consecutive data transformations: preprocessing and dimension reduction.
Modeling workflow is created by chaining an implementation of the filter, training and validator, all selected at run-time.
Modeling is therefore implemented as a stack of 3 traits, each representing a transformation or reduction on data sets.
Computational tasks related to machine learning can be complex and lengthy. The process should be able to select the appropriate date flow (or sequence of data transformation or reduction) at run time, according to the state of the computation.
In this simple case, a clustering task is triggered if anomalyDetection is needed, training a model is launched otherwise.
These conditional path execution are important for complex analysis or lengthy computation that require unattended execution (i.e. overnight or over the week-end).
Note: The overriding of the abstract value for the Modeling workflow are omitted here for the sake of clarity
Summary: This factory pattern operates on 3 level of componentization: Dynamic selection of
1- Workflow or sequence of tasks according to the objective of the computation (i.e. Clustering => Preprocessing)
2- Task processing algorithm according to the data (i.e. Preprocessing => Kalman filter)
3- Implementation of task processing according to the environment (i.e. Kalman filter => Implementation on Apache Spark)
The objective is to avoid bottleneck in the computation data flow which would result in overflowing actorsâ mail box/local buffers.
A strategy to control the flow (or back-pressure flow) is needed to regulate the data flow across all modules.
This example use a back-pressure handling mechanism that consists of monitoring bounded mail boxes. This is a simplistic approach to flow control described for the sake of illustrating the concept. As we will see later, there is a far more effective mechanism to deal with back-pressure.
We mentioned earlier that a learning platform requires implementing, wiring and deploying tasks. Akka or framework derived from Akka, are commonly used to deploy workflow for large datasets because of the immutability, non-blocking and supervision characteristic for actors.
Scala/Akka actors are resilient because that are defined with hierarchical context for which an actor because a supervisor to other actors. In this slide, a router is a supervising actor to the workers and, depending on the selected strategy, is responsible of restarting a worker in case of failure.
But, what about the case for which the load (number of messages in the
In this example, an actor, Controller loads chunk of data, partitions and distributed across multiple Worker actors, along with a data transformation.
Upon receiving the message âComputeâ the workers process data given a transformation function. The workers returns the processed data through a Completed message.
The purpose of the watch dog actor, âWatcherâ is to monitor the utilization of mailbox and report it to the Controller
This is a simple feedback control:
1- Watcher monitors the utilization of the mailbox (average length)
2- The controller adjust the size of each batch in the load message handler (throttling)
3- The workers process the next batch
Letâs start with the Worker actor.
The load on a worker depends on three variables
1- The amount of data to process
2- The complexity of the data transformation
3- The underlying system (cores, memory..)
The controller provides the slice of data to be processed by the workers msg.xt as well as the data transformation msg.fct.
Letâs look at our watch dog actor, Watcher: It computes the load as the average mailbox utilization and send it back to the controller through a Status message.
As its name implies, the controller configure and manage the dynamic workflow and control the back pressure from the worker actors.
As far as the configuration is concerned, the Controller generates a list of workers, the bounded mailboxes, msgQueues, for the workers and ultimately the watcher actor.
The worker and watcher are created within the Controller context using the Akka actorOf constructor.
As far as the management of data flow and feedback control loop, the Controller
loads partition and distribute batch of data points to the worker actors (message: Load)
processes the results of the computation in workers (message: Completed)
- Throttle up or down the flow upon receiving the status on utilization of mail boxes from the watcher (message: Status)
The composition of the messages processed by the controller are self-explanatory. It
Adjusts the size of the next batch if required (throttle method)
Extracts the next batch of data from the input stream
Partition and distribute the batch across the worker actors
Send the partition along with the data transformation to the workers.
The implementation of the throttle method is rather simple. It takes the load computed by the watcher actor and the current batch size (number of data points to be processed) as input. It update the batch size using a simple ratio relative to the watermark. For instance if load is below the watermark, the batch size is increased.
The bottom graph describes the throttle action and the complexity of data transformation.
The complexity of the data transformation and has an impact on the load on workers. It varies from 0 (simple map operation) to 2 (complex data processing involving recursion or iterations).
The throttle intensity ranges between -6 (rapid decrease of size of batches) and +6 (rapid increase of size of batches of data)
The top graph displays the actual utilization of the mail boxes with capacity of 512 messages as regulated by the feedback control loop (executed by the controller).
The deployment of a reactive data flow in production would require significant improvement on our NaĂŻve model.
The feedback control loop could be smoothed with a moving average technique or Kalman filter to avoid erratic behavior
We would need to provide a larger range of options for control actions beside adjusting the size of data batches: increase of number of workers, mail box capacity, caching strategy, .. A fine grained set of actions reduces also the risk of instable systems.
The watch dog should be able to handle dead letters in case of failure (mailbox overflowing)
Reactive streams control the flow back pressure at the TCP connection level. It is far more accurate and responsive that mailbox utilization.