Describes the thinking behind MapR's architecture. MapR"s Hadoop achieves better reliability on commodity hardware compared to anything on the planet, including custom, proprietary hardware from other vendors. Apache HDFS and Cassandra replication is also discussed, as are SAN and NAS storage systems like Netapp and EMC.
2. 2
ď§ 100% Apache Hadoop
ď§ With significant enterprise-
grade enhancements
ď§ Comprehensive
management
ď§ Industry-standard
interfaces
ď§ Higher performance
MapR Distribution for Apache Hadoop
3. 3
MapR: Lights Out Data Center Ready
⢠Automated stateful failover
⢠Automated re-replication
⢠Self-healing from HW and SW
failures
⢠Load balancing
⢠Rolling upgrades
⢠No lost jobs or data
⢠99999âs of uptime
Reliable Compute Dependable Storage
⢠Business continuity with snapshots
and mirrors
⢠Recover to a point in time
⢠End-to-end check summing
⢠Strong consistency
⢠Built-in compression
⢠Mirror between two sites by RTO
policy
4. 4
MapR does MapReduce (fast)
TeraSort Record
1 TB in 54 seconds
1003 nodes
MinuteSort Record
1.5 TB in 59 seconds
2103 nodes
5. 5
MapR does MapReduce (faster)
TeraSort Record
1 TB in 54 seconds
1003 nodes
MinuteSort Record
1.5 TB in 59 seconds
2103 nodes
1.65
300
6. 6
Google chose MapR to
provide Hadoop on
Google Compute Engine
The Cloud Leaders Pick MapR
Amazon EMR is the largest
Hadoop provider in revenue
and # of clusters
Deploying OpenStack? MapR partnership with Canonical
and Mirantis on OpenStack support.
7. 7
1. Make the storage reliable
â Recover from disk and node failures
How to make a cluster reliable?
8. 8
1. Make the storage reliable
â Recover from disk and node failures
2. Make services reliable
â Services need to checkpoint their state rapidly
â Restart failed service, possibly on another node
â Move check-pointed state to restarted service, using (1) above
How to make a cluster reliable?
9. 9
1. Make the storage reliable
â Recover from disk and node failures
2. Make services reliable
â Services need to checkpoint their state rapidly
â Restart failed service, possibly on another node
â Move check-pointed state to restarted service, using (1) above
3. Do it fast
â Instant-on ⌠(1) and (2) must happen very, very fast
â Without maintenance windows
⢠No compactions (eg, Cassandra, Apache HBase)
⢠No âanti-entropyâ that periodically wipes out the cluster (eg, Cassandra)
How to make a cluster reliable?
11. 11
ď§ No NVRAM
ď§ Cannot assume special connectivity
â no separate data paths for "online" vs. replica traffic
Reliability With Commodity Hardware
12. 12
ď§ No NVRAM
ď§ Cannot assume special connectivity
â no separate data paths for "online" vs. replica traffic
ď§ Cannot even assume more than 1 drive per node
â no RAID possible
Reliability With Commodity Hardware
13. 13
ď§ No NVRAM
ď§ Cannot assume special connectivity
â no separate data paths for "online" vs. replica traffic
ď§ Cannot even assume more than 1 drive per node
â no RAID possible
ď§ Use replication, but âŚ
â cannot assume peers have equal drive sizes
â drive on first machine is 10x larger than drive on other?
Reliability With Commodity Hardware
14. 14
ď§ No NVRAM
ď§ Cannot assume special connectivity
â no separate data paths for "online" vs. replica traffic
ď§ Cannot even assume more than 1 drive per node
â no RAID possible
ď§ Use replication, but âŚ
â cannot assume peers have equal drive sizes
â drive on first machine is 10x larger than drive on other?
ď§ No choice but to replicate for reliability
Reliability With Commodity Hardware
15. 15
ď§ Replication is easy, right? All we have to do is send the same bits
to the master and replica.
Reliability via Replication
Clients
Primary
Server
Replica
Clients
Primary
Server
Replica
Normal replication, primary forwards Cassandra-style replication
16. 16
ď§ When the replica comes back, it is stale
â it must brought up-to-date
â until then, exposed to failure
But crashes occurâŚ
Clients
Primary
Server
Replica
Clients
Primary
Server
Replica
Primary re-syncs replica Replica remains stale until
"anti-entropy" process
kicked off by administrator
Who
re-syncs?
18. 18
ď§ HDFS solves the problem a third way
ď§ Make everything read-only
â Nothing to re-sync
Unless its Apache HDFS âŚ
19. 19
ď§ HDFS solves the problem a third way
ď§ Make everything read-only
â Nothing to re-sync
ď§ Single writer, no reads allowed while writing
Unless its Apache HDFS âŚ
20. 20
ď§ HDFS solves the problem a third way
ď§ Make everything read-only
â Nothing to re-sync
ď§ Single writer, no reads allowed while writing
ď§ File close is the transaction that allows readers to see data
â unclosed files are lost
â cannot write any further to closed file
Unless its Apache HDFS âŚ
21. 21
ď§ HDFS solves the problem a third way
ď§ Make everything read-only
â Nothing to re-sync
ď§ Single writer, no reads allowed while writing
ď§ File close is the transaction that allows readers to see data
â unclosed files are lost
â cannot write any further to closed file
ď§ Real-time not possible with HDFS
â to make data visible, must close file immediately after writing
â Too many files is a serious problem with HDFS (a well documented
limitation)
Unless its Apache HDFS âŚ
22. 22
ď§ HDFS solves the problem a third way
ď§ Make everything read-only
â Nothing to re-sync
ď§ Single writer, no reads allowed while writing
ď§ File close is the transaction that allows readers to see data
â unclosed files are lost
â cannot write any further to closed file
ď§ Real-time not possible with HDFS
â to make data visible, must close file immediately after writing
â Too many files is a serious problem with HDFS (a well documented
limitation)
ď§ HDFS therefore cannot do NFS, ever
â No âcloseâ in NFS ⌠can lose data any time
Unless its Apache HDFS âŚ
23. 23
ď§ To support normal apps, need full read/write support
ď§ Let's return to issue: resync the replica when it comes back
This is the 21st centuryâŚ
Clients
Primary
Server
Replica
Clients
Primary
Server
Replica
Primary re-syncs replica Replica remains stale until
"anti-entropy" process
kicked off by administrator
Who
re-syncs?
24. 24
ď§ 24 TB / server
â @ 1000MB/s = 7 hours
â practical terms, @ 200MB/s = 35 hours
How long to re-sync?
25. 25
ď§ 24 TB / server
â @ 1000MB/s = 7 hours
â practical terms, @ 200MB/s = 35 hours
ď§ Did you say you want to do this online?
How long to re-sync?
26. 26
ď§ 24 TB / server
â @ 1000MB/s = 7 hours
â practical terms, @ 200MB/s = 35 hours
ď§ Did you say you want to do this online?
â throttle re-sync rate to 1/10th
â 350 hours to re-sync (= 15 days)
How long to re-sync?
27. 27
ď§ 24 TB / server
â @ 1000MB/s = 7 hours
â practical terms, @ 200MB/s = 35 hours
ď§ Did you say you want to do this online?
â throttle re-sync rate to 1/10th
â 350 hours to re-sync (= 15 days)
ď§ What is your Mean Time To Data Loss (MTTDL)?
How long to re-sync?
28. 28
ď§ 24 TB / server
â @ 1000MB/s = 7 hours
â practical terms, @ 200MB/s = 35 hours
ď§ Did you say you want to do this online?
â throttle re-sync rate to 1/10th
â 350 hours to re-sync (= 15 days)
ď§ What is your Mean Time To Data Loss (MTTDL)?
â how long before a double disk failure?
â a triple disk failure?
How long to re-sync?
29. 29
ď§ Use dual-ported disk to side-step this problem
Traditional solutions
Clients
Primary
Server
Replica
Dual Ported
Disk Array
Raid-6 with idle spares
Servers use
NVRAM
30. 30
ď§ Use dual-ported disk to side-step this problem
Traditional solutions
Clients
Primary
Server
Replica
Dual Ported
Disk Array
Raid-6 with idle spares
Servers use
NVRAM
COMMODITY HARDWARE LARGE SCALE CLUSTERING
31. 31
ď§ Use dual-ported disk to side-step this problem
Traditional solutions
Clients
Primary
Server
Replica
Dual Ported
Disk Array
Raid-6 with idle spares
Servers use
NVRAM
COMMODITY HARDWARE LARGE SCALE CLUSTERING
32. 32
ď§ Use dual-ported disk to side-step this problem
Traditional solutions
Clients
Primary
Server
Replica
Dual Ported
Disk Array
Raid-6 with idle spares
Servers use
NVRAM
COMMODITY HARDWARE LARGE SCALE CLUSTERING
Large Purchase Contracts, 5-year spare-parts plan
33. 33
Forget Performance?
SAN/NAS
data data data
data data data
daa data data
data data data
function
RDBMS
Traditional Architecture
function
App
function
App
function
App
34. 34
Forget Performance?
SAN/NAS
data data data
data data data
daa data data
data data data
function
RDBMS
Traditional Architecture
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
Hadoop
function
App
function
App
function
App
Geographically dispersed also?
35. 35
ď§ Chop the data on each node to 1000's of pieces
â not millions of pieces, only 1000's
â pieces are called containers
What MapR does
36. 36
ď§ Chop the data on each node to 1000's of pieces
â not millions of pieces, only 1000's
â pieces are called containers
What MapR does
37. 37
ď§ Chop the data on each node to 1000's of pieces
â not millions of pieces, only 1000's
â pieces are called containers
ď§ Spread replicas of each container across the cluster
What MapR does
38. 38
ď§ Chop the data on each node to 1000's of pieces
â not millions of pieces, only 1000's
â pieces are called containers
ď§ Spread replicas of each container across the cluster
What MapR does
41. 41
ď§ 100-node cluster
ď§ each node holds 1/100th of every node's data
ď§ when a server dies
MapR Replication Example
42. 42
ď§ 100-node cluster
ď§ each node holds 1/100th of every node's data
ď§ when a server dies
MapR Replication Example
43. 43
ď§ 100-node cluster
ď§ each node holds 1/100th of every node's data
ď§ when a server dies
MapR Replication Example
44. 44
ď§ 100-node cluster
ď§ each node holds 1/100th of every node's data
ď§ when a server dies
ď§ entire cluster resync's the dead node's data
MapR Replication Example
45. 45
ď§ 100-node cluster
ď§ each node holds 1/100th of every node's data
ď§ when a server dies
ď§ entire cluster resync's the dead node's data
MapR Replication Example
47. 47
ď§ 99 nodes re-sync'ing in parallel
â 99x number of drives
â 99x number of ethernet ports
â 99x cpu's
MapR Re-sync Speed
48. 48
ď§ 99 nodes re-sync'ing in parallel
â 99x number of drives
â 99x number of ethernet ports
â 99x cpu's
ď§ Each is resync'ing 1/100th of the
data
MapR Re-sync Speed
49. 49
ď§ 99 nodes re-sync'ing in parallel
â 99x number of drives
â 99x number of ethernet ports
â 99x cpu's
ď§ Each is resync'ing 1/100th of the
data
MapR Re-sync Speed
ď§ Net speed up is about 100x
â 350 hours vs. 3.5
50. 50
ď§ 99 nodes re-sync'ing in parallel
â 99x number of drives
â 99x number of ethernet ports
â 99x cpu's
ď§ Each is resync'ing 1/100th of the
data
MapR Re-sync Speed
ď§ Net speed up is about 100x
â 350 hours vs. 3.5
ď§ MTTDL is 100xbetter
52. 52
ď§ Writes are synchronous
ď§ Data is replicated in a "chain"
fashion
â utilizes full-duplex network
ď§ Meta-data is replicated in a
"star" manner
â response time better
MapR's Read-write Replication
client1
client2
clientN
client1
client2
clientN
53. 53
ďŹ As data size increases, writes
spread more, like dropping a
pebble in a pond
ďŹ Larger pebbles spread the
ripples farther
ďŹ Space balanced by moving idle
containers
Container Balancing
⢠Servers keep a bunch of containers "ready to go".
⢠Writes get distributed around the cluster.
54. 54
MapR Container Resync
ď§ MapR is 100% random
write
â very tough problem
ď§ On a complete crash, all
replicas diverge from
each other
ď§ On recovery, which one
should be master?
Complete
crash
55. 55
MapR Container Resync
ď§ MapR can detect exactly where
replicas diverged
â even at 2000 MB/s update rate
ď§ Resync means
â roll-back rest to divergence point
â roll-forward to converge with chosen
master
ď§ Done while online
â with very little impact on normal
operations
New master
after crash
56. 56
ď§ Resync traffic is âsecondaryâ
ď§ Each node continuously measures RTT to all its peers
ď§ More throttle to slower peers
â Idle system runs at full speed
ď§ All automatically
MapR does Automatic Resync Throttling
58. 58
NameNode
E F
NameNode
E F
NameNode
E F
MapR's No-NameNode Architecture
HDFS Federation MapR (distributed metadata)
⢠Multiple single points of failure
⢠Limited to 50-200 million files
⢠Performance bottleneck
⢠Commercial NAS required
⢠HA w/ automatic failover
⢠Instant cluster restart
⢠Up to 1T files (> 5000x advantage)
⢠10-20x higher performance
⢠100% commodity hardware
NAS
appliance
NameNode
A B
NameNode
C D
NameNode
E F
DataNode DataNode DataNode
DataNode DataNode DataNode
A F C D E D
B C E B
C F B F
A B
A D
E
61. 61
MapRâs NFS allows Direct Deposit
Connectors not needed
No extra scripts or clusters to deploy and maintain
Random Read/Write
Compression
Distributed HA
Web
Server
âŚ
Database
Server
Application
Server
63. 63
MapR Volumes
100K volumes are OK,
create as many as
desired!
/projects
/tahoe
/yosemite
/user
/msmith
/bjohnson
Volumes dramatically simplify the
management of Big Data
⢠Replication factor
⢠Scheduled mirroring
⢠Scheduled snapshots
⢠Data placement control
⢠User access and tracking
⢠Administrative permissions
65. 65
M7 Tables
ď§ M7 tables integrated into storage
â always available on every node, zero admin
ď§ Unlimited number of tables
â Apache HBase is typically 10-20 tables (max 100)
ď§ No compactions
ď§ Instant-On
â zero recovery time
ď§ 5-10x better perf
ď§ Consistent low latency
â At 95%-ile and 99%-ile MapR M7
69. 69
ď§ ALL Hadoop components are Highly Available, eg, YARN
MapR makes Hadoop truly HA
70. 70
ď§ ALL Hadoop components are Highly Available, eg, YARN
ď§ ApplicationMaster (old JT) and TaskTracker record their state in
MapR
MapR makes Hadoop truly HA
71. 71
ď§ ALL Hadoop components are Highly Available, eg, YARN
ď§ ApplicationMaster (old JT) and TaskTracker record their state in
MapR
ď§ On node-failure, AM recovers its state from MapR
â Works even if entire cluster restarted
MapR makes Hadoop truly HA
72. 72
ď§ ALL Hadoop components are Highly Available, eg, YARN
ď§ ApplicationMaster (old JT) and TaskTracker record their state in
MapR
ď§ On node-failure, AM recovers its state from MapR
â Works even if entire cluster restarted
ď§ All jobs resume from where they were
â Only from MapR
MapR makes Hadoop truly HA
73. 73
ď§ ALL Hadoop components are Highly Available, eg, YARN
ď§ ApplicationMaster (old JT) and TaskTracker record their state in
MapR
ď§ On node-failure, AM recovers its state from MapR
â Works even if entire cluster restarted
ď§ All jobs resume from where they were
â Only from MapR
ď§ Allows pre-emption
â MapR can pre-empt any job, without losing its progress
â ExpressLane⢠feature in MapR exploits it
MapR makes Hadoop truly HA
74. 74
Where/how can YOU exploit
MapRâs unique advantage?
ď§ ALL your code can easily be scale-out HA
75. 75
Where/how can YOU exploit
MapRâs unique advantage?
ď§ ALL your code can easily be scale-out HA
ď§ Save service-state in MapR
ď§ Save data in MapR
76. 76
Where/how can YOU exploit
MapRâs unique advantage?
ď§ ALL your code can easily be scale-out HA
ď§ Save service-state in MapR
ď§ Save data in MapR
ď§ Use Zookeeper to notice service failure
77. 77
Where/how can YOU exploit
MapRâs unique advantage?
ď§ ALL your code can easily be scale-out HA
ď§ Save service-state in MapR
ď§ Save data in MapR
ď§ Use Zookeeper to notice service failure
ď§ Restart anywhere, data+state will move there automatically
78. 78
Where/how can YOU exploit
MapRâs unique advantage?
ď§ ALL your code can easily be scale-out HA
ď§ Save service-state in MapR
ď§ Save data in MapR
ď§ Use Zookeeper to notice service failure
ď§ Restart anywhere, data+state will move there automatically
ď§ Thatâs what we did!
79. 79
Where/how can YOU exploit
MapRâs unique advantage?
ď§ ALL your code can easily be scale-out HA
ď§ Save service-state in MapR
ď§ Save data in MapR
ď§ Use Zookeeper to notice service failure
ď§ Restart anywhere, data+state will move there automatically
ď§ Thatâs what we did!
ď§ Only from MapR: HA for Impala, Hive, Oozie, Storm, MySQL,
SOLR/Lucene, Kafka, âŚ
80. 80
Build cluster brick by brick, one node at a time
ďŹ Use commodity hardware at rock-bottom prices
ďŹ Get enterprise-class reliability: instant-restart, snapshots,
mirrors, no-single-point-of-failure, âŚ
ďŹ Export via NFS, ODBC, Hadoop and other std protocols
MapR: Unlimited Scale
# files, # tables trillions
# rows per table trillions
# data 1-10 Exabytes
# nodes 10,000+