3. Algorithms: a tribute
Numbers and Algorithms:
9th century Persian Muslim
mathematician Abu Abdullah
Muhammad ibn Musa Al-Khwarizmi,
whose work built upon that of the 7th
century Indian mathematician
Brahmagupta.
We own a lot to these guys !!!
4. Why do we need parallelism?
It gets bigger,
It doesn’t get much faster
BUT
We get more cores in a chip.
More cores = more parallelism
We are happy now, right?
5. Moore’s law
Every 18 months, the number of CPU
core’s double
Another
interpretation:
Every 18 months, the number of idle
CPU core’s double
7. Modern applications
Scalability:
Vertical: concurrency
(use all the cores, memory and I/O of a given machine)
Horizontal: distribution
(use all the machines in the cluster)
High availability:
Fault tolerance: all levels (local, distributed)
(the terminator effect: you can stop it but can’t kill it )
8. Streaming applications
Performance:
Efficient use of resources:
CPU and memory, but also OS threads and sockets
Asynchronous:
event driven, reacts on new data
Distributed:
more machines = more performance
the algorithm is partitioned and/or replicated on the cluster
9. What to increase?
More CPU: It helps when there is
computation involved
More MEMORY: It helps when there is
more state to keep
More I/O: It helps when there are
more messages to transfer
10. Streaming or batch?
ProcessingData
Natalino Busa - 12 Feb. 2013
Data
source system target systemour system
What differentiate Streaming from Batch?
● Granularity of Data
● Granularity of Processing
Granularity impacts:
Throughput, Latency, and the Cost of the system!
11. The choice is yours
1000 events/sec (1 KB/event)
running on 100 cores all day long
“Wait a day, then process”
860 M events = 86 GB of data
Latency: 24 hours
Throughput: 1 update/day
BATCH: Hadoop
Latency 1ms
Throughput: 1000 updates/sec
STREAMING: Akka
“Do not wait”
Process the 1KB of data each msec.
“Both are valid options. It depends on the application domain and the
requirements/specs of the target and source systems”
12. Mapping it to existing applications
Granularity of
Data
256 GB 256 GB
Granularity of
Processing
1 CPU 100 CPU’s
Traditional DB systems Big Data (Hadoop)
Granularity of
Data
1 KB 1 KB
Granularity of
Processing
1 CPU 100 CPU’s
Traditional mail server Web application server
16. Technology matrix
GranularityofData Granularity of Processing
Small Big
Small Akka Akka
Gigaspaces
Big ? Storm
System end-to-end throughput
High ~ 10’000 events/sec Medium ~100 events/sec Low ~10 events/sec
Akka Storm/ Gigaspaces Scripting languages
17. Big Data in motion
Both are:
Distributed, fault-tolerant, streaming
- Storm
++ multi-language
-- not user/admin friendly
-- slow supervising
processing elements are jvm’s
ideal when data is coarse grained
- Akka
++ high throughput, fine grained actors
++ dynamic topologies
-- low-level, but high performance
processing elements are small and lightweight
ideal for millions of transactions per second
- Gigaspaces
++ combines memory + application distribution
-- framework api is not very flexible
processing elements are jvms
ideal for all-in-one solution, with little customization
18. Opportunity: Lambda Architecture
Logic layer
Software as a Service
e.g realt-time predictor
Natalino Busa - 12 Feb. 2013
from http://www.manning.com/marz/
19. Opportunity: Batch + Streaming
Batch
Computing
Front End Services
In-Memory
Distributed Database
In-memory
Distributed DB’s
Batch
Streaming
HTML5 Client / Responsive Applow-latency
HTTP API services FETCH
(refresh)
Streaming
Computing
Data Warehouses Messaging Busses
PUSH
(SSE, notifications)