This is the presentation I made on the Hadoop User Group Ireland meetup in Dublin. It covers the main ideas of both MPP, Hadoop and the distributed systems in general, and also how to chose the best option for you
4. 4Pivotal Confidential–Internal Use Only
Distributed Systems
Avoid distributed systems in all the problems that potentially
could be solved using non-distributed systems
5. 5Pivotal Confidential–Internal Use Only
Distributed Systems
Ÿ Consensus problem
– Paxos
– RAFT
– ZAB
– etc.
Ÿ Transaction consistency
– 2PC
– 3PC
7. 7Pivotal Confidential–Internal Use Only
Distributed Systems
http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf
8. 8Pivotal Confidential–Internal Use Only
Distributed Systems
Reasons to use
Ÿ Performance issues
– More than 100’000 TPS
– More than 4 GB/sec scan rate
– More than 100’000 IOPS
Ÿ Capacity issues
– More than 50TB of data
Ÿ DR and Geo-Distribution
10. 10Pivotal Confidential–Internal Use Only
MPP
Main principles
Ÿ Shared Nothing
Ÿ Data Sharding
Ÿ Data Replication
Ÿ Distributed Transactions
Ÿ Parallel Processing
12. 12Pivotal Confidential–Internal Use Only
MPP
Works well for
Ÿ Relational data
Ÿ Batch processing
Ÿ Ad hoc analytical SQL
Ÿ Low concurrency
Ÿ Applications requiring ANSI SQL
13. 13Pivotal Confidential–Internal Use Only
MPP
Not the best choice for
Ÿ Non-relational data
Ÿ OLTP and event stream processing
Ÿ High concurrency
Ÿ 100+ server clusters
Ÿ Non-analytical use cases
Ÿ Geo-Distributed use cases
16. 16Pivotal Confidential–Internal Use Only
Hadoop
HDFS
Ÿ Distributed filesystem
Ÿ Block-level storage with big blocks
Ÿ Non-updatable
Ÿ Synchronous block replication
Ÿ No built-in Geo-Distribution support
Ÿ No built-in DR solution
18. 18Pivotal Confidential–Internal Use Only
Hadoop
YARN
Ÿ Cluster resource manager
Ÿ Manages CPU and RAM allocation
Ÿ Schedulers are pluggable
Ÿ Can handle different resource pools
Ÿ Supports both MR and non-MR workload
20. 20Pivotal Confidential–Internal Use Only
Hadoop
MapReduce
Ÿ Framework for distributed data processing
Ÿ Two main operations: map and reduce
Ÿ Data hits disk after “map” and before “reduce”
Ÿ Scales to thousands of servers
Ÿ Can process petabytes of data
Ÿ Extremely reliable
22. 22Pivotal Confidential–Internal Use Only
Hadoop
HBase
Ÿ Distributed key-value store
Ÿ Data is sharded by key
Ÿ Data is stored in sorted order
Ÿ Stores multiple versions of the row
Ÿ Easily scales
24. 24Pivotal Confidential–Internal Use Only
Hadoop
Hive
Ÿ Query engine with SQL-like syntax
Ÿ Translates HiveQL query to MR / Tez / Spark job
Ÿ Processes HDFS data
Ÿ Supports UDFs and UDAFs
26. 26Pivotal Confidential–Internal Use Only
Hadoop
Works well for
Ÿ Write Once Read Many
Ÿ 100+ server clusters
Ÿ Both relational and non-relational data
Ÿ High concurrency
Ÿ Batch processing and analytical workload
Ÿ Elastic scalability
27. 27Pivotal Confidential–Internal Use Only
Hadoop
Not the best choice for
Ÿ Write-heavy workloads
Ÿ Small clusters
Ÿ Analytical DWH cases
Ÿ OLTP and event stream processing
Ÿ Cost savings
30. 30Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardware Options Mostly Appliances Commodity
31. 31Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardware Options Mostly Appliances Commodity
Vendor Lock-in Typical Not Common
32. 32Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardware Options Mostly Appliances Commodity
Vendor Lock-in Typical Not Common
Technology Price $200K – $10M $50K – $500K
33. 33Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardware Options Mostly Appliances Commodity
Vendor Lock-in Typical Not Common
Technology Price $200K – $10M $50K – $500K
Implementation Cost Moderate High
34. 34Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardware Options Mostly Appliances Commodity
Vendor Lock-in Typical Not Common
Technology Price $200K – $10M $50K – $500K
Implementation Cost Moderate High
Extensibility Vendor-provided APIs Open Source
35. 35Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardware Options Mostly Appliances Commodity
Vendor Lock-in Typical Not Common
Technology Price $200K – $10M $50K – $500K
Implementation Cost Moderate High
Extensibility Vendor-provided APIs Open Source
Supportability Easy Complex
36. 36Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardware Options Mostly Appliances Commodity
Vendor Lock-in Typical Not Common
Technology Price $200K – $10M $50K – $500K
Implementation Cost Moderate High
Extensibility Vendor-provided APIs Open Source
Supportability Easy Complex
Scalability Up to 100 servers Up to 5000 servers
37. 37Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardware Options Mostly Appliances Commodity
Vendor Lock-in Typical Not Common
Technology Price $200K – $10M $50K – $500K
Implementation Cost Moderate High
Extensibility Vendor-provided APIs Open Source
Supportability Easy Complex
Scalability Up to 100 servers Up to 5000 servers
Scalability Up to 100-300 TB Up to 100 PB
38. 38Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardware Options Mostly Appliances Commodity
Vendor Lock-in Typical Not Common
Technology Price $200K – $10M $50K – $500K
Implementation Cost Moderate High
Extensibility Vendor-provided APIs Open Source
Supportability Easy Complex
Scalability Up to 100 servers Up to 5000 servers
Scalability Up to 100-300 TB Up to 100 PB
Target Systems DWH Purpose-Built Batch
39. 39Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardware Options Mostly Appliances Commodity
Vendor Lock-in Typical Not Common
Technology Price $200K – $10M $50K – $500K
Implementation Cost Moderate High
Extensibility Vendor-provided APIs Open Source
Supportability Easy Complex
Scalability Up to 100 servers Up to 5000 servers
Scalability Up to 100-300 TB Up to 100 PB
Target Systems DWH Purpose-Built Batch
Target End Users Business Analysts Developers
41. 41Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
Debugging Easy Very Hard
42. 42Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
Debugging Easy Very Hard
Accessibility SQL Mainly Java
43. 43Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
Debugging Easy Very Hard
Accessibility SQL Mainly Java
DBA Skill Level Low High
44. 44Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
Debugging Easy Very Hard
Accessibility SQL Mainly Java
DBA Skill Level Low High
Single Job Redundancy Low High
45. 45Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
Debugging Easy Very Hard
Accessibility SQL Mainly Java
DBA Skill Level Low High
Single Job Redundancy Low High
Query Latency 10-20 ms 10-20 sec
46. 46Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
Debugging Easy Very Hard
Accessibility SQL Mainly Java
DBA Skill Level Low High
Single Job Redundancy Low High
Query Latency 10-20 ms 10-20 sec
Query Runtime 5-7 sec 10-15 mins
47. 47Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
Debugging Easy Very Hard
Accessibility SQL Mainly Java
DBA Skill Level Low High
Single Job Redundancy Low High
Query Latency 10-20 ms 10-20 sec
Query Runtime 5-7 sec 10-15 mins
Query Max Runtime 1-2 hours 1-2 weeks
48. 48Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
Debugging Easy Very Hard
Accessibility SQL Mainly Java
DBA Skill Level Low High
Single Job Redundancy Low High
Query Latency 10-20 ms 10-20 sec
Query Runtime 5-7 sec 10-15 mins
Query Max Runtime 1-2 hours 1-2 weeks
Min Collection Size Megabytes Gigabytes
49. 49Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
Debugging Easy Very Hard
Accessibility SQL Mainly Java
DBA Skill Level Low High
Single Job Redundancy Low High
Query Latency 10-20 ms 10-20 sec
Query Runtime 5-7 sec 10-15 mins
Query Max Runtime 1-2 hours 1-2 weeks
Min Collection Size Megabytes Gigabytes
Max Concurrency 10-15 queries 70-100 jobs
51. 51Pivotal Confidential–Internal Use Only
Summary
Use MPP for
Ÿ Analytical DWH
Ÿ Ad hoc analyst SQL queries and BI
Ÿ Keep under 100TB of data
Use Hadoop for
Ÿ Specialized data processing systems
Ÿ Over 100TB of data