The document discusses a real-time architecture using Hadoop and Storm. It proposes a layered architecture with a batch layer, speed layer, and serving layer. The batch layer uses Hadoop for batch processing and view generation. The speed layer uses Storm for stream processing and real-time views. The serving layer queries both the batch and real-time views to provide merged results. This architecture is known as the Lambda architecture and allows discarding and recomputing views from the immutable raw data as needed.
6. Computing Trends
Past
Current
Computation (CPUs)
Expensive
Computation Cheap
(Many Core Computers)
Disk Storage Expensive
Disk Storage Cheap
(Cheap Commodity Disks)
DRAM Expensive
DRAM / SSD
Getting Cheap
Coordination Easy
(Latches Don’t Often Hit)
Coordination Hard
(Latches Stall a Lot, etc)
Source: Immutability Changes Everything - Pat Helland, RICON2012
A real-time architecture using Hadoop & Storm. #JaxLondon
6
7. Credits
Nathan Marz
Ex-Backtype & Twitter
Startup in
Stealthmode
Storm
Cascalog
ElephantDB
manning.com/marz
A real-time architecture using Hadoop & Storm. #JaxLondon
7
8. A Data System
A real-time architecture using Hadoop & Storm. #JaxLondon
8
9. Data is more than Information
Not all information is equal.
Some information is derived from other pieces of
information.
A real-time architecture using Hadoop & Storm. #JaxLondon
9
10. Data is more than Information
Eventually you will reach the most
‘raw’ form of information.
This is the information you hold true, simple because it
exists.
Let’s call this ‘data’, very similar to ‘event’.
A real-time architecture using Hadoop & Storm. #JaxLondon
10
11. Events - Before
Events used to manipulate
the master data.
A real-time architecture using Hadoop & Storm. #JaxLondon
11
12. Events - After
Today, events are the master
data.
A real-time architecture using Hadoop & Storm. #JaxLondon
12
13. Data System
Let’s store everything.
A real-time architecture using Hadoop & Storm. #JaxLondon
13
18. Query
The data you query is often
transformed, aggregated, ...
Rarely used in it’s original form.
A real-time architecture using Hadoop & Storm. #JaxLondon
18
19. Query
Query = function ( all data
)
A real-time architecture using Hadoop & Storm. #JaxLondon
19
20. Number of people living in each city.
Person
Location
Time
Location
Count
Nathan
Antwerp
2005-01-01
Ghent
2
Geert
Dendermond
e
2011-10-08
Dendermonde
1
John
Ghent
2010-05-02
Nathan
Ghent
2013-02-03
A real-time architecture using Hadoop & Storm. #JaxLondon
20
35. MapReduce
MAP
1. Take a large data set and divide it into subsets
…
2. Perform the same function on all subsets
REDUCE
DoWork()
DoWork()
DoWork()
…
3. Combine the output from all subsets
…
Output
A real-time architecture using Hadoop & Storm. #JaxLondon
36
36. Serialization & Schema
Catch errors as quickly as they happen.
Validation on write vs on read.
A real-time architecture using Hadoop & Storm. #JaxLondon
37
37. Serialization & Schema
CSV is actually a serialization language that is
just poorly defined.
A real-time architecture using Hadoop & Storm. #JaxLondon
38
38. Serialization & Schema
Use a format with a schema.
-
Thrift
Avro
Protobuffers
Added bonus: it’s faster & uses less space.
A real-time architecture using Hadoop & Storm. #JaxLondon
39
39. Batch View Database
Read only database.
No random writes required.
A real-time architecture using Hadoop & Storm. #JaxLondon
40
40. Batch View Database
Every iteration produces the
Views from scratch.
A real-time architecture using Hadoop & Storm. #JaxLondon
41
42. Batch Layer
We are not done yet…
Just a few hours of data.
Data absorbed into Batch Views
Not yet
absorbed.
A real-time architecture using Hadoop & Storm. #JaxLondon
No
w
Time
44
48. Speed Layer
Storing a limited window of
data.
Compensating for the last few hours of data.
A real-time architecture using Hadoop & Storm. #JaxLondon
50
49. Speed Layer
All the complexity is isolated in the
Speed layer.
If anything goes wrong, it’s auto-corrected.
A real-time architecture using Hadoop & Storm. #JaxLondon
51
50. CAP
You have a choice between:
Availability
-
Queries are eventual consistent.
Consistency
-
Queries are consistent.
A real-time architecture using Hadoop & Storm. #JaxLondon
52
51. Eventual accuracy
Some algorithms are hard to
implement in real time. For those
cases we could estimate the results.
A real-time architecture using Hadoop & Storm. #JaxLondon
53
59. Speed Layer Views
The views are stored in Read & Write database.
-
Cassandra
Hbase
Redis
MySQL
ElasticSearch
…
Much more complex than a read only view.
A real-time architecture using Hadoop & Storm. #JaxLondon
61
69. Lambda Architecture
Can discard any view, batch and real
time, and just recreate everything from
the master data.
A real-time architecture using Hadoop & Storm. #JaxLondon
71
70. Lambda Architecture
Mistakes are corrected via recomputation.
Write bad data? Remove the data & recompute.
Bug in view generation? Just recompute the view.
A real-time architecture using Hadoop & Storm. #JaxLondon
72
74. DataCrunchers
We enable companies in envisioning, defining and
implementing a data strategy.
A one-stop-shop for all your Big Data needs.
The first Big Data Consultancy agency in Belgium.
A real-time architecture using Hadoop & Storm. #JaxLondon
76
How much data do you have? 44 times as much data in the next decade, 15 Zb in 2015Data silos (erp, crm, …)CustomersTrimble (3Tb in hun database systeem)Truvo (wijzigen van een index duurt 24u)Traditionele systemen kunnen dit volume niet aan.How many data do you have?Turn 12 terabytes of Tweets created each day into improved product sentiment analysisConvert 350 billion annual meter readings to better predict power consumption
Real timeTime sensitivedecisiontakingFrauddetectionEnergy allocationMarketing campaignsMarket transactionsSolution:Real-time solutions in combination with batch (hadoop)Nosql systems
StructuredUnstructured80% is unstructured data, A key drawback of using traditional relational database systems is that they're not good at handling variable data. A flexible data modelWord, email, foto, text, video, APIs, …?What are your needs regarding variety?The end result: bringingstructureintounstructured dataMonitor 100’s of live video feeds from surveillance cameras to target points of interestExploit the 80% data growth in images, video and documents to improve customer satisfaction
We can afford to keep Immutable Copies of lots of data.We NEED immutability to Coordinate with fewer challenges.Semaphores & Locks are the things to avoid: Instruction opportunities lost waiting for a semaphore increase with more cores…
The # of followers on Twitter = all follows & unfollows combined.Account balance
Data = eventIn an ever changing world we found a ‘safe heaven’ for dataEverything we do generates events:Pay with Credit CardCommit to GitClick on a webpageTweet
It is easier to store all data in a cost effective way.Compare to DWH world.
Immutability greatly restricts the range of errors that can cause data loss or data corruption.Ex. Only CR, no more CRUD.Information might of course change.Fault ToleranceData lossHuman error, Hardware failureData CorruptionParallel met functioneelprogrammeren.
Allows state regeneration. Eg. What was my bank balance on 1 may 2005?
Queries as pure functions that take all data as input is the most general formulation.Different functions may look at different portions and aggregate information in different ways.
Too slow; might be petabyte scaleImpala/Drill: why not
The batch layer can calculate anything (given enough time).
The batch layer stores the data normalized, but in the views it generates, data is often, if not always de normalized.
Not vertically
It’s OK to croak and restart
Is something really immutable when it’s name can change.
Doesn’t have to be Hadoop. The importance here is a Distributed FS combined with a processing framework.Spark,
http://www.quora.com/Apache-Hadoop/What-is-the-advantage-of-writing-custom-input-format-and-writable-versus-the-TextInputFormat-and-Text-writable/answer/Eric-Sammer?srid=PU&st=nsValue of schemas• Structural integrity• Guarantees on what can and can’t be stored• Prevents corruptionOtherwise you’ll detect corruption issues at read-time
Maarkanopgelostworden, door bvb ES je views op voorhandtegenereren.
In some circumstances.
All the complexity of *dealing* with the CAP theorem (like read repair) is isolated in the realtime layer.
Consistency (all nodes see the same data at the same time)Availability (a guarantee that every request receives a response about whether it was successful or failed)Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)http://codahale.com/you-cant-sacrifice-partition-tolerance/Hbasavs Cassandra
Eg. Unique countsML
Nimbus:Manages the clusterWorker Node:Supervisor:Manages workers; restarts them if neededExecuterPhysical JVM process.Execute tasks (those are spread evenly across the workers)TasksEach in his own Thread. Is the actual Bolt or Spout.Processes the stream.
Tuple:Named list of valuesDynamicly typedStreamSequence of Tuples
SpoutSource of StreamsSometimes replayableBoltStream transformationsAt least 1 input stream0 - * output streams
The serving layer needs to be able to answer any query in a short amount of time.
AVG = sum + count; preaggregate, but not everything is possible.
Lambda first named by Alonzo Church, he needed a letter for functional abstraction in theory of computation in the 1930s.