6. Computing Trends
Current
Past
Computation (CPUs)
Expensive
Computation Cheap
(Many Core Computers)
Disk Storage Expensive
Disk Storage Cheap
(Cheap Commodity Disks)
DRAM Expensive
DRAM / SSD
Getting Cheap
Coordination Easy
(Latches Don t Often Hit)
Coordination Hard
(Latches Stall a Lot, etc)
Source: Immutability Changes Everything - Pat Helland, RICON2012
A real-time architecture using Hadoop & Storm. #JaxLondon
6
7. Credits
Nathan Marz
Ex-Backtype & Twitter
Startup in Stealthmode
Storm
Cascalog
ElephantDB
manning.com/marz
A real-time architecture using Hadoop & Storm. #JaxLondon
7
8. A Data System
A real-time architecture using Hadoop & Storm. #JaxLondon
8
9. Data is more than Information
Not all information is equal.
Some information is derived from other pieces of
information.
A real-time architecture using Hadoop & Storm. #JaxLondon
9
10. Data is more than Information
Eventually you will reach the most
This is the information you hold true, simple because it exists.
A real-time architecture using Hadoop & Storm. #JaxLondon
10
11. Events - Before
Events used to manipulate the
master data.
A real-time architecture using Hadoop & Storm. #JaxLondon
11
12. Events - After
Today, events are the master
data.
A real-time architecture using Hadoop & Storm. #JaxLondon
12
18. Query
The data you query is often transformed,
aggregated, ...
A real-time architecture using Hadoop & Storm. #JaxLondon
18
19. Query
Query = function ( all data )
A real-time architecture using Hadoop & Storm. #JaxLondon
19
20. Number of people living in each city.
Person
Location
Time
Location
Count
Nathan
Antwerp
2005-01-01
Ghent
2
Geert
Dendermonde
2011-10-08
Dendermonde
1
John
Ghent
2010-05-02
Nathan
Ghent
2013-02-03
A real-time architecture using Hadoop & Storm. #JaxLondon
20
35. MapReduce
MAP
1. Take a large data set and divide it into subsets
…
2. Perform the same function on all subsets
REDUCE
DoWork()
DoWork()
DoWork()
…
3. Combine the output from all subsets
…
Output
A real-time architecture using Hadoop & Storm. #JaxLondon
36
36. Serialization & Schema
Catch errors as quickly as they happen.
Validation on write vs on read.
A real-time architecture using Hadoop & Storm. #JaxLondon
37
37. Serialization & Schema
CSV is actually a serialization language that is just
poorly defined.
A real-time architecture using Hadoop & Storm. #JaxLondon
38
38. Serialization & Schema
Use a format with a schema.
-
Thrift
Avro
Protobuffers
A real-time architecture using Hadoop & Storm. #JaxLondon
39
39. Batch View Database
Read only database.
No random writes required.
A real-time architecture using Hadoop & Storm. #JaxLondon
40
40. Batch View Database
Every iteration produces the
Views from scratch.
A real-time architecture using Hadoop & Storm. #JaxLondon
41
42. Batch Layer
Just a few hours of data.
Data absorbed into Batch Views
Not yet
absorbed.
A real-time architecture using Hadoop & Storm. #JaxLondon
Now
Time
44
48. Speed Layer
Storing a limited window of data.
Compensating for the last few hours of data.
A real-time architecture using Hadoop & Storm. #JaxLondon
50
49. Speed Layer
All the complexity is isolated in the Speed
layer.
-corrected.
A real-time architecture using Hadoop & Storm. #JaxLondon
51
50. CAP
You have a choice between:
Availability
-
Queries are eventual consistent.
Consistency
-
Queries are consistent.
A real-time architecture using Hadoop & Storm. #JaxLondon
52
51. Eventual accuracy
Some algorithms are hard to implement
in real time. For those cases we could
estimate the results.
A real-time architecture using Hadoop & Storm. #JaxLondon
53
59. Speed Layer Views
The views are stored in Read & Write database.
-
Cassandra
Hbase
Redis
MySQL
ElasticSearch
Much more complex than a read only view.
A real-time architecture using Hadoop & Storm. #JaxLondon
61
69. Lambda Architecture
Can discard any view, batch and real time,
and just recreate everything from the master
data.
A real-time architecture using Hadoop & Storm. #JaxLondon
71
70. Lambda Architecture
Mistakes are corrected via recomputation.
Write bad data? Remove the data & recompute.
Bug in view generation? Just recompute the view.
A real-time architecture using Hadoop & Storm. #JaxLondon
72
74. DataCrunchers
We enable companies in envisioning, defining and
implementing a data strategy.
A one-stop-shop for all your Big Data needs.
The first Big Data Consultancy agency in Belgium.
A real-time architecture using Hadoop & Storm. #JaxLondon
76
How much data doyou have?
44 times as much data in the next decade, 15Zbin 2015
Data silos (erp,crm, …)
Customers
Trimble (3Tb inhundatabasesysteem)
Truvo (wijzigenvaneenindexduurt24u)
Traditionele systemen kunnen dit volume niet aan.
How many data do you have?
Turn 12 terabytes of Tweets created each day into improved product sentiment analysis
Convert 350 billion annual meter readings to better predict power consumption
3
Real time
Timesensitivedecisiontaking
Frauddetection
Energyallocation
Marketingcampaigns
Market transactions
Solution:
Real-time solutions in combination with batch (hadoop)
Nosqlsystems
4
Structured
Unstructured
80% is unstructured data,
A key drawback of using traditional relational database systems is that they're not good at handling variable data.
Aflexibledata model
Word, email,foto, text, video, APIs, …?
What are your needs regarding variety?
The endresult:bringingstructureintounstructureddata
Monitor 100’s of live video feeds from surveillance cameras to target points of interest
Exploit the 80% data growth in images, video and documents to improve customer satisfaction
5
We can afford to keepImmutableCopiesof lots of data.
We NEED immutability to Coordinate with fewer challenges.
Semaphores & Locks are the things to avoid:
Instruction opportunities lost waiting for a semaphore increase with more cores…
6
The #of followers on Twitter = all follows & unfollows combined.
Account balance
9
Data = event
In an ever changingworld we found a ‘safe heaven’ for data
Everything we do generates events:
Pay with Credit Card
Commit to Git
Click on a webpage
Tweet
10
It is easier tostore all data in a cost effective way.
Compare to DWH world.
13
Immutability greatly restricts the range of errors that can cause data loss or data corruption.
Ex.
Only CR, no moreCRUD.
Informationmight of course change.
Fault Tolerance
Data loss
Human error, Hardware failure
Data Corruption
Parallel metfunctioneelprogrammeren.
14
Allows state regeneration.Eg. What was my bank balance on 1 may 2005?
15
Queries as pure functions that take all data as input is the most general formulation.
Different functions may look at different portions and aggregate information in different ways.
19
22
Tooslow; might be petabyte scale
Impala/Drill: why not
23
The batch layer can calculate anything (given enough time).
28
The batchlayer stores the data normalized, but in the views it generates, data is often, if not always de normalized.
29
Not vertically
30
31
It’s OK to croak and restart
32
Is something really immutable when it’s name can change.
33
Doesn’t have to be Hadoop.The importance here is a Distributed FS combined with a processing framework.
Spark,
34
http://www.quora.com/Apache-Hadoop/What-is-the-advantage-of-writing-custom-input-format-and-writable-versus-the-TextInputFormat-and-Text-writable/answer/Eric-Sammer?srid=PU&st=ns
Value of schemas
• Structural integrity
• Guarantees on what can and can’t be stored
• Prevents corruption
Otherwise you’ll detect corruption issues at read-time
37
Maarkanopgelostworden, doorbvbES je views opvoorhandtegenereren.
42
43
47
48
In some circumstances.
49
50
All the complexity of *dealing* with the CAP theorem (like read repair) is isolated in the realtime layer.
51
Consistency (all nodes see the same data at the same time)
Availability (a guarantee that every request receives a response about whether it was successful or failed)
Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)
http://codahale.com/you-cant-sacrifice-partition-tolerance/
HbasavsCassandra
52
Eg. Unique counts
ML
53
54
Nimbus:
Manages the cluster
Worker Node:
Supervisor:
Manages workers; restartsthem if needed
Executer
Physical JVM process.
Execute tasks (those are spread evenly across the workers)
Tasks
Each in his own Thread.
Is the actual Bolt or Spout.
Processes the stream.
56
Tuple:
Named list of values
Dynamiclytyped
Stream
Sequence of Tuples
57
Spout
Source of Streams
Sometimesreplayable
Bolt
Streamtransformations
At least 1 input stream
0 - * output streams
58
60
61
The serving layer needs to be able to answer any query in a short amount of time.
64
65
AVG = sum + count;preaggregate, but not everything is possible.
67
Lambda firstnamed by Alonzo Church, he needed a letter for functional abstraction in theory of computation in the 1930s.
70