2. Raise your hands if you have used
- Cascalog
- Hadoop
- Spark
- Flink
- Samza
- Storm
- Sqoop
3. What’s good for ?
Realtime event stream processing
Continuous computation
Extract, transform, load (ETL)
Data transformation à la map-reduce
Data ingestion and storage medium transfer
Data cleaning
21. Flow Conditions
From -> To ( if predicate correct)
Flow conditions are used for isolating
logic about whether or not segments
should pass through different tasks in
a workflow, exception handling and
support a rich degree of composition
with runtime parameterization.
22. Windows / Triggers
partitions a possible unbounded
sequence of data into finite
pieces, allowing aggregations to
be specified
- Timer
- Segment
- Punctuation
- Watermark
23. Life Cycles
allows you to hook in and execute
arbitrary code at critical points
during a task (kinda middleware)
:lifecycle/start-task?
:lifecycle/before-task-start
:lifecycle/before-batch
:lifecycle/after-read-batch
:lifecycle/after-batch
:lifecycle/after-task-stop
:lifecycle/after-ack-segment
:lifecycle/after-retry-segment
24. Job
A job will be translated into multiple
tasks. Peers will take care of these
tasks.
If your number of tasks > available peers
A job won’t be complete ( Buy me a beer
or 10)
25. Bulk functions
perform a fn more efficiently over a
batch of segments rather than
processing one segment at a time.
- Write to DB
Onyx will ignore the output of your function and
pass the same segments that you received
downstream
26. Group by
“like” values are always
routed to the same virtual
peer
- Group by key
- Group by a fn
Specify in the catalog!
27. Fixed Windows
a data point will fall into
exactly one instance of a
window (often called an
extent in the literature)
Between t1=0 and t2=4 how many
events have happened?
t1=5 t2=9, t1=10 t2=14
And so on..
28. Sliding Window
a slide value for how long to wait
between spawning a new window
extent
Between t1=0 and t2=14 how many events have
happened?
t1=5 t2=19 ?
t1=10 t2=24 ?
30. Session Window
dynamically resize their upper and
lower bounds in reaction to
incoming data
Sessions capture a time span of activity for a
specific key, such as a user ID. If no activity
occurs within a timeout gap, the session
closes. If an event occurs within the bounds of
a session, the window size is fused with the
new event, and the session is extended by its
timeout gap either in the forward or backward
direction
33. Peer
is a node in the cluster responsible for processing data
34. Virtual Peer
A Virtual Peer refers to a single peer process running on a single physical
machine. A single Virtual Peer executes at most one task at a time.
36. Aeron
Efficient reliable UDP unicast, UDP
multicast, and IPC message transport
Messaging layer takes care of the direct
peer to peer transfer of segment
batches, acks, segment completion and
segment retries to the relevant virtual
peers.
38. Scheduling
If there is no master, how does
scheduling work ?
Peers contend to work on tasks.
39. Types of Job Schedulers
- Greedy ( I need ALL!!!! Gimme all!!)
- Balanced Robin ( Fair play)
- Percentage ( Not so fair play)
40. Types of Task Schedulers
- Balanced
- Percentage
- Colocation (assigns them to the peers on a single physical machine, low latency, min network)
41. Tags
a set of machines in your cluster are
privileged
Run some tasks at some specific
machines
Declare a peer with capabilities
- Datomic
- Special Hardware (GPU, Memory)
- Network
43. Example - Let’s process some logs
49556677821280438558577372995495836672945903576549425154
44. Check out the repository
https://github.com/bcambel/onyx-
test
45. End users configuring what
- workflows should look like.
- Language agnostic
- Location agnostic
- Tolerant to machine generation
- Temporally agnostic ( should wait for a time to be realized)
46. If you are not enjoying your experience
There is something fundamentally wrong with the tool
Think about Apple’s smooth product experience.
Pain detected, thought through. (Pain -> Pleasure)
49. Questions
- How does Onyx distributes reads (input tasks) ? Parallelization??
- evenly break up a database table into chunks which can be read by multiple peers
- Segments realized. Tasks created. Peers get into action