MapR Architecture Presentation given at Strata + Hadoop World 2013 by MapR CTO & Co-Founder M.C. Srivas
Prior to co-founding MapR, Srivas ran one of the major search infrastructure teams at Google where GFS, BigTable and MapReduce were used extensively. He wanted to provide that powerful capability to everyone, and started MapR on his vision to build the next-generation platform for Big Data. His strategy was to evolve Hadoop and bring simplicity of use, extreme speed and complete reliability to Hadoop users everywhere, and make it seamlessly easy for enterprises to use this powerful new way to get deep insights. Srivas brings to MapR his experiences at Google, Spinnaker Networks, Transarc in building game-changing products that advance the state of the art.
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Shift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
1. Shift into High Gear:
Dramatically Improve Hadoop
and NoSQL Performance
M. C. Srivas, CTO/Founder
1
2. MapR History
MapR M7
Launches ultra-reliable
(struggling with Nutch)
Copies MR paper,
calls it Hadoop
06
uses Hadoop
publishes
MapReduce
paper
MapR founded
‘11
‘09
07
‘12
MapR
partners with
Uses
Hadoop
Uses Hadoop
Fastest NoSQL
on the planet
DOW drops
from 14,300
to 6,800
2
Breaks all world
records!
‘13
2500 nodes
Largest
non-Web
installation
3. MapR Distribution for Apache Hadoop
Tools
Web 2.0
Enterprise Apps
Analytics
Operations
Vertical Apps
OLTP
Management
Hardening, Testing, Integrating,
Innovating, Contributing as part
of the
Apache Open Source Community
NoSQL
Data Platform
Batch
99.999%
HA
Data
Protection
Interactive
Disaster
Recovery
Enterprise
Integration
3
Real Time
Scalability &
Performance
Multi-Tenancy
4. MapR Distribution for Apache Hadoop
100% Apache Hadoop
With significant enterprisegrade enhancements
Comprehensive
management
Industry-standard
interfaces
Higher performance
4
5. The Cloud Leaders Pick MapR
Amazon EMR is the largest
Hadoop provider in revenue
and # of clusters
Google chose MapR to
provide Hadoop on
Google Compute Engine
MapR partnership with Canonical and Mirantis
provides advantages for OpenStack deployments
5
7. How to Make a Cluster Reliable?
Make the storage reliable
1.
–
Recover from disk and node failures
Make services reliable
2.
–
–
–
Services need to checkpoint their state rapidly
Restart failed service, possibly on another node
Move check-pointed state to restarted service, using (1) above
Do it fast
3.
–
–
Instant-on … (1) and (2) must happen very, very fast
Without maintenance windows
No compactions (e.g., Cassandra, Apache HBase)
• No “anti-entropy” that periodically wipes out the cluster (e.g., Cassandra)
•
7
8. Reliability with Commodity Hardware
No NVRAM
Cannot assume special connectivity
–
Cannot even assume more than 1 drive per node
–
No RAID possible
Use replication, but …
–
–
No separate data paths for "online" vs. replica traffic
Cannot assume peers have equal drive sizes
Drive on first machine is 10x larger than drive on other?
No choice but to replicate for reliability
8
9. Reliability via Replication
Replication is easy, right?
All we have to do is send the same bits to the master and replica
Clients
Primary
Server
Clients
Primary
Server
Replica
Normal replication, primary forwards
Replica
Cassandra-style replication
9
10. But Crashes Occur…
When the replica comes back, it is stale
– It must be brought up-to-date
– Until then, exposed to failure
Clients
Clients
Who
re-syncs?
Primary
Server
Primary
Server
Replica
Primary re-syncs replica
Replica
Replica remains stale until
"anti-entropy" process
kicked off by administrator
10
11. Unless it’s HDFS …
HDFS solves the problem a third way
Makes everything read-only
–
static data, trivial to re-sync
Single writer, no reads allowed while writing
File close is the transaction that allows readers to see data
–
–
Realtime not possible with HDFS
–
–
Unclosed files are lost
Cannot write any further to closed file
To make data visible, must close file immediately after writing
Too many files is a serious problem with HDFS (a well documented
limitation)
HDFS therefore cannot do NFS, ever
–
No “close” in NFS … can lose data any time
11
12. This is the 21st Century…
To support normal apps, need full read/write support
Let's return to the issue: re-sync the replica when it comes back
Clients
Clients
Who
re-syncs?
Primary
Server
Primary
Server
Replica
Primary re-syncs replica
Replica
Replica remains stale until
"anti-entropy" process
kicked off by administrator
12
13. How Long to Re-sync?
24 TB / server
–
–
Did you say you want to do this online?
–
–
@ 1000MB/s = 7 hours
Practical terms, @ 200MB/s = 35 hours
Throttle re-sync rate to 1/10th
350 hours to re-sync (= 15 days)
What is your Mean Time To Data Loss (MTTDL)?
–
–
How long before a double disk failure?
A triple disk failure?
13
14. Traditional Solutions
Use dual-ported disk to side-step this problem
Clients
Servers use
NVRAM
Primary
Server
Raid-6 with idle spares
Replica
Dual Ported
Disk Array
COMMODITY HARDWARE
LARGE SCALE CLUSTERING
Large Purchase Contracts, 5-year spare-parts plan
14
16. What MapR Does
Chop the data on each node to 1000s of pieces
–
–
Not millions of pieces, only 1000s
Pieces are called containers
Spread replicas of each container across the cluster
16
18. MapR Replication Example
100-node cluster
Each node holds 1/100th of every node's data
When a server dies, entire cluster re-syncs the dead node's data
18
19. MapR Re-sync Speed
–
–
–
99 nodes re-sync'ing in parallel
Net speed up is about 100x
–
99x number of drives
99x number of Ethernet ports
99x CPUs
Each is re-sync'ing 1/100th of the
data
19
350 hours vs. 3.5
MTTDL is
100x better
20. MapR Reliability
Mean Time To Data Loss (MTTDL) is far better
–
Does not require idle spare drives
–
–
Rest of cluster has sufficient spare capacity to absorb one node’s data
On a 100-node cluster, 1 node’s data == 1% of cluster capacity
Utilizes all resources
–
–
Improves as cluster size increases
no wasted “master-slave” nodes
no wasted idle spare drives … all spindles put to use
Better reliability with less resources
–
on commodity hardware!
20
22. MapR's Read-write Replication
Writes are synchronous
–
client2
client1
Visible immediately
clientN
Data is replicated in a "chain"
fashion
–
Utilizes full-duplex network
client2
client1
Meta-data is replicated in a
"star" manner
–
Response time better
22
clientN
23. Container Balancing
• Servers keep a bunch of containers "ready to go”
• Writes get distributed around the cluster
23
As data size increases, writes
spread more (like dropping a
pebble in a pond)
Larger pebbles spread the
ripples farther
Space balanced by moving
idle containers
24. MapR Container Re-sync
MapR is 100% random write
– uses distributed transactions
Complete
crash
On a complete crash,
all replicas diverge from
each other
On recovery, which one
should be master?
24
25. MapR Container Re-sync
MapR can detect exactly where
replicas diverged
– even at 2000 MB/s update rate
Re-sync means
– roll-back each to divergence point
– roll-forward to converge with chosen
master
Done while online
–
with very little impact on normal
operations
25
New master
after crash
26. MapR Does Automatic Re-sync Throttling
Re-sync traffic is “secondary”
Each node continuously measures RTT to all its peers
More throttle to slower peers
–
Idle system runs at full speed
All automatically
26
28. MapR's Containers
Files/directories are sharded and placed into
containers
Each container contains
Directories & files
Data blocks
28
Replicated on servers
Containers are 1632 GB segments of
disk, placed on
nodes
Automatically managed
29. MapR’s Architectural Params
HDFS 'block'
Unit of I/O
–
4K/8K (8K in MapR)
10^3
map-red
i/o KB
Unit of Chunking (a mapreduce split)
–
10-100's of megabytes
–
Unit of Resync (a replica)
–
–
–
–
10-100's of gigabytes
container in MapR
automatically managed
29
10^9
admin
Unit of Administration (snap,
repl, mirror, quota, backup)
–
10^6
resync
1 gigabyte - 1000's of terabytes
volume in MapR
what data is affected by my
missing blocks?
30. Where & How Does MapR Exploit This
Unique Advantage?
30
31. MapR's No NameNode HATM Architecture
HDFS Federation
MapR (Distributed Metadata)
NAS
appliance
E
E
F
E
F
E
F
C
E
D
F
B
NameNode
NameNode
NameNode
NameNode
NameNode
NameNode
A
DataNode
DataNode
DataNode
C
D
E
D
B
B
C
E
B
A
DataNode
F
A
A
D
C
F
B
F
DataNode
DataNode
• HA w/automatic failover
• Instant cluster restart
• Up to 1T files (> 5000x advantage)
• 10-20x higher performance
• 100% commodity hardware
• Multiple single points of failure
• Limited to 50-200 million files
• Performance bottleneck
• Commercial NAS required
31
32. Relative Performance and Scale
Other
Advantage
Rate (creates/s)
14-16K
335-360
40x
Scale (files)
6B
1.3M
4615x
File creates/s
MapR
18000
Other distribution
0
MapR distribution
16000
400
350
300
250
200
150
100
50
0
0.5
1
1.5
Files (M)
File creates/s
14000
12000
Benchmark: 100-byte file
creates
10000
8000
6000
4000
2000
Other distribution
0
0
0
1000
100
2000
200
3000
400
Files (M)
4000
600
5000
800
32
6000
1000
Hardware: 10 nodes, each
2 x 4 cores
24 GB RAM
12 x 1 TB 7200 RPM
1 GigE NIC
33. Where & How Does MapR Exploit This
Unique Advantage?
33
34. MapR’s NFS Allows Direct Deposit
Web
Server
Random Read/Write
Compression
Database
Server
Distributed HA
…
Application
Server
Connectors not needed
No extra scripts or clusters to deploy and maintain
34
35. Where & How Does MapR Exploit This
Unique Advantage?
35
36. MapR Volumes & Snapshots
/projects
/tahoe
/yosemite
/user
/msmith
/bjohnson
Volumes dramatically simplify the
management of Big Data
•
•
•
•
•
•
Replication factor
Scheduled mirroring
Scheduled snapshots
Data placement control
User access and tracking
Administrative permissions
100K volumes are
OK, create as many as
desired!
36
37. Where & How Does MapR Exploit This
Unique Advantage?
37
38. MapR M7 Tables
Binary compatible with Apache HBase
–
–
no recompilation needed to access M7 tables
Just set CLASSPATH
M7 tables accessed via pathname
–
–
–
openTable( "hello") … uses HBase
openTable( "/hello") … uses M7
openTable( "/user/srivas/hello") … uses M7
38
39. M7 Tables
M7 tables integrated into storage
–
Unlimited number of tables
–
Apache HBase is typically 10-20 tables (max 100)
No compactions
–
always available on every node, zero admin
update in place
Instant-on
–
Zero recovery time
5-10x better performance
Consistent low latency
–
MapR M7
At 95th & 99th percentiles
39
40. YCSB 50-50 Mix (throughput)
M7 (ops/sec)
Other HBase 0.94.9
latency (usec)
Other HBase 0.94.9
(ops/sec)
40
M7 Latency (usec)
42. Where & How Does MapR Exploit This
Unique Advantage?
43
43. MapR Makes Hadoop Truly HA
ALL Hadoop components have High Availability
–
e.g. YARN
ApplicationMaster (old JT) and TaskTracker record their state in
MapR
On node-failure, AM recovers its state from MapR
–
All jobs resume from where they were
–
Works even if entire cluster restarted
Only from MapR
Allows preemption
–
–
MapR can preempt any job, without losing its progress
ExpressLane™ feature in MapR exploits it
44
44. MapR Does MapReduce (Fast!)
TeraSort Record
1 TB in 54 seconds
1003 nodes
MinuteSort Record
1.5 TB in 59 seconds
2103 nodes
45
45. MapR Does MapReduce (Faster!)
TeraSort Record
1 TB in 54 seconds
1003 nodes
MinuteSort Record
1.65
1.5 TB in 59 seconds
2103 nodes
300
46
46. YOUR CODE
Can Exploit MapR’s Unique Advantages
ALL your code can easily be scale-out HA
–
–
Save service-state in MapR
Save data in MapR
Use Zookeeper to notice service failure
Restart anywhere, data+state will move there
automatically
That’s what we did!
Only from MapR: HA for
Impala, Hive, Oozie, Storm, MySQL, SOLR/Lucene,
HCatalog, Kafka, …
47
47. MapR: Unlimited Hadoop
Reliable Compute
Dependable Storage
Build cluster brick by brick, one node at a time
Use commodity hardware at rock-bottom prices
Get enterprise-class reliability: instantrestart, snapshots, mirrors, no single-point-of-failure, …
Export via NFS, ODBC, Hadoop and other standard
protocols
48
Hinweis der Redaktion
MapR provides a complete platform for Apache Hadoop. MapR has integrated, tested and hardened a broad array of packages as part of this distribution Hive, Pig, Oozie, Sqoop, plus additional packages such as Cascading. All of these packages with MapR updates are available on Github.We have spent over a two year well funded effort to provide deep architectural improvements to create the next generation distribution for Hadoop. This is in stark contrast with the alternative distributions from Cloudera, HortonWorks, Apache which are all equivalent.
The Namenode today in Hadoop is a single point of failure, a scalability limitation, and a performance bottleneck.With MapR there is no dedicated NameNode. The NameNode function is distributed across the cluster. This provides major advantages in terms of HA, data loss avoidance, scalability and performance. Other distributions you have a bottleneck regardless of the number of nodes in the cluster. With other distributions the most number of files that you can support is 200M at the maximum and that is with an extremely high end server. 50% of the processing of Hadoop in Facebook is to pack and unpack files to try to work around this limitation. MapR scales uniformly.
Another major advantage with MapR is the distributed Namenode. The Namenode today in Hadoop is a single point of failure, a scalability limitation, and a performance bottleneck.With MapR there is no dedicated NameNode. The NameNode function is distributed across the cluster. This provides major advantages in terms of HA, data loss avoidance, scalability and performance. Other distributions you have a bottleneck regardless of the number of nodes in the cluster. With other distributions the most number of files that you can support is between 70-100M. 50% of the processing of Hadoop in Facebook is to pack and unpack files to try to work around this limitation. MapR scales uniformly.
With MapR’s Direct Access NFS we use industry standard protocol to load data into the cluster. There are no clients that need to be deployed. No special agents. If writing over NFS is it highly available. We have that. Do you need to compress it. We do it automatically. We support random read/write.
Another aspect of ease of use is the support for volumes. Only MapR provides the logical volume construct to group content. Policies can be used to enforce user access, admin permissions, data protection polices, etc. Data placement can also be controlled. So I have certain nodes that have SSD drives. Other Hadoop distros require you to have permissions at the individual files or across the entire cluster.MapR’s Hadoop distribution includes support for Volumes - a unique idea that dramatically simplifies the management of your big data. Volumes allow you to efficiently define data management policies. With other Hadoop distributions you either set policies for the entire cluster or individual files. Setting policies by file or directories is too granular and hierarchical. There are just too many files. On the other hand, cluster-wide policies are not granular enough. Slide 2A volume is a logical constructthat lets you collect and refer to data files regardless of where the file is physically located, and with no concern for the relationship between the data files.