More Related Content
Similar to Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance (20)
Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance
- 1. Innovations In Apache Hadoop MapReduce,
Pig and Hive for improving query
performance
gopalv@apache.org
vinodkv@apache.org
Page 1
- 5. • Scalability
– Already works great, just don’t break it for performance gains
• Isolation + Security
– Queries between different users run as different users
• Fault tolerance
– Keep all of MR’s safety nets to work around bad nodes in clusters
• UDFs
– Make sure they are “User” defined and not “Admin” defined
© Hortonworks Inc. 2013
- 7. Benchmark spec
• The TPC-DS benchmark data+query set
• Query 27 (big joins small)
– For all items sold in stores located in specified states during a given
year, find the average quantity, average list price, average list sales
price, average coupon amount for a given gender, marital status,
education and customer demographic.
• Query 82 (big joins big)
– List all items and current prices sold through the store channel from
certain manufacturers in a given price range and consistently had a
quantity between 100 and 500 on hand in a 60-day period.
© Hortonworks Inc. 2013
- 9. TL;DR - II
• TPC-DS Query 82, Scale=200, 10 EC2 nodes (40 disks)
© Hortonworks Inc. 2013
- 10. Forget the actual benchmark
• First of all, YMMV
– Software
– Hardware
– Setup
– Tuning
• Text formats seem to be the staple of all comparisons
– Really?
– Everybody’s using it but only for benchmarks!
© Hortonworks Inc. 2013
- 11. What did the trick?
• Mapreduce?
• HDFS?
• Or is it just Hive?
© Hortonworks Inc. 2013
- 13. RCFile
• Binary RCFiles
• Hive pushes down column projections
• Less I/O, Less CPU
• Smaller files
© Hortonworks Inc. 2013
- 14. Data organization
• No data system at scale is loaded once & left alone
• Partitions are essential
• Data flows into new partitions every day
© Hortonworks Inc. 2013
- 15. A closer look
• Now revisiting the benchmark and its results
© Hortonworks Inc. 2013
- 22. What changed?
• Job Count/Correct plan
• Correct data formats
• Correct data organization
• Correct configuration
© Hortonworks Inc. 2013
- 23. What changed?
Data Formats
Data Organization
Query Plan
© Hortonworks Inc. 2013
- 25. Is that all?
• NO!
• In Hive
– Metastore
– RCFile issues
– CPU intensive code
• In YARN+MR
– Parallelism
– Spin-up times
– Data locality
• In HDFS
– Bad disks/deteriorating nodes
© Hortonworks Inc. 2013
- 26. In Hive
• NO!
• In Hive
– Metastore
– RCFile issues
– CPU intensive code
• In YARN+MR
– Parallelism
– Spin-up times
– Data locality
• In HDFS
– Bad disks/deteriorating nodes
© Hortonworks Inc. 2013
- 27. In Hive
• NO!
• In Hive
– Metastore
– RCFile issues
– CPU intensive code
• In YARN+MR
– Parallelism
– Spin-up times
– Data locality
• In HDFS
– Bad disks/deteriorating nodes
© Hortonworks Inc. 2013
- 28. Hive Metastore
• 1+N Select problem
– SELECT partitions FROM tables;
– /* for each needed partition */ SELECT * FROM Partition ..
– For query 27 , generates > 5000 queries! 4-5 seconds lost on each call!
– Lazy loading or Include/Join are general solutions
• Datanucleus/ORM issues
– 100K NPEs try.. Catch.. Ignore..
• Metastore DB Schema revisit
– Denormalize some/all of it?
© Hortonworks Inc. 2013
- 29. In Hive
• NO!
• In Hive
– Metastore
– RCFile issues
– CPU intensive code
• In YARN+MR
– Parallelism
– Spin-up times
– Data locality
• In HDFS
– Bad disks/deteriorating nodes
© Hortonworks Inc. 2013
- 30. RCFile issues
• RCFiles do not split well
– Row groups and row group boundaries
• Small row groups vs big row groups
– Sync() vs min split
– Storage packing
• Run-length information is lost
– Unnecessary deserialization costs
© Hortonworks Inc. 2013
- 31. ORC file format
• A single file as output of each task.
– Dramatically simplifies integration with Hive
– Lowers pressure on the NameNode
• Support for the Hive type model
– Complex types (struct, list, map, union)
– New types (datetime, decimal)
– Encoding specific to the column type
• Split files without scanning for markers
• Bound the amount of memory required for
reading or writing.
© Hortonworks Inc. 2013
- 32. In Hive
• NO!
• In Hive
– Metastore
– RCFile issues
– CPU intensive code
• In YARN+MR
– Parallelism
– Spin-up times
– Data locality
• In HDFS
– Bad disks/deteriorating nodes
© Hortonworks Inc. 2013
- 34. CPU intensive code
• Hive query engine processes one row at a time
– Very inefficient in terms of CPU usage
• Lazy deserialization: layers
• Object inspector calls
• Lots of virtual method calls
© Hortonworks Inc. 2013
- 36. Vectorization to the rescue
• Process a row batch at a time instead of a single row
• Row batch to consist of column vectors
– The column vector will consist of array(s) of primitive types as far as
possible
• Each operator will process the whole column vector at a
time
• File formats to give out vectorized batches for processing
• Underlying research promises
– Better instruction pipelines and cache usage
– Mechanical sympathy
© Hortonworks Inc. 2013
- 37. Vectorization: Prelim results
• Functionality
– Some arithmetic operators and filters using primitive type columns
– Have a basic integration benchmark to prove that the whole setup
works
• Performance
– Micro benchmark
– More than 30x improvement in the CPU time
– Disclaimer:
– Micro benchmark!
– Include io or deserialization costs or complex and string datatypes
© Hortonworks Inc. 2013
- 38. In YARN+MR
• NO!
• In Hive
– Metastore
– RCFile issues
– CPU intensive code
• In YARN+MR
– Data locality
– Parallelism
– Spin-up times
• In HDFS
– Bad disks/deteriorating nodes
© Hortonworks Inc. 2013
- 39. In YARN+MR
• NO!
• In Hive
– Metastore
– RCFile issues
– CPU intensive code
• In YARN+MR
– Data locality
– Parallelism
– Spin-up times
• In HDFS
– Bad disks/deteriorating nodes
© Hortonworks Inc. 2013
- 41. In YARN+MR
• NO!
• In Hive
– Metastore
– RCFile issues
– CPU intensive code
• In YARN+MR
– Data locality
– Parallelism
– Spin-up times
• In HDFS
– Bad disks/deteriorating nodes
© Hortonworks Inc. 2013
- 42. Parallelism
• Can tune it (to some extent)
– Controlling splits/reducer count
• Hive doesn’t know dynamic cluster status
– Benchmarks max out clusters, real jobs may or may not
• Hive does not let you control parallelism
– particularly in case of multiple jobs in a query
© Hortonworks Inc. 2013
- 43. In YARN+MR
• NO!
• In Hive
– Metastore
– RCFile issues
– CPU intensive code
• In YARN+MR
– Data locality
– Parallelism
– Spin-up times
• In HDFS
– Bad disks/deteriorating nodes
© Hortonworks Inc. 2013
- 44. Spin up times
• AM startup costs
• Task startup costs
• Multiple waves of map tasks
© Hortonworks Inc. 2013
- 46. AM Pool Service
• Pre-launches a pool of AMs
• Jobs submitted to these pre-launched AMs
– Saves 3-5 seconds
• Pre-launched AMs can pre-allocate containers
• Tasks can be started as soon as the job is submitted
– Saves 2-3 seconds
© Hortonworks Inc. 2013
- 47. Container reuse
• Tez MapReduce AM supports Container reuse
• Launched JVMs are re-used between tasks
– about 4-5 seconds saved in case of multiple waves
• Allows future enhancements
– re-using task data structures across splits
© Hortonworks Inc. 2013
- 48. In HDFS
• NO!
• In Hive
– Metastore
– RCFile issues
– CPU intensive code
• In YARN+MR
– Data locality
– Parallelism
– Spin-up times
• In HDFS
– Bad disks/deteriorating nodes
© Hortonworks Inc. 2013
- 49. Speculation/bad disks
• No cluster remains at 100% forever
• Bad disks cause latency issues
– Speculation is one defense, but it is not enough
– Fault tolerance is a safety net
• Possible solutions:
– More feedback from HDFS about stale nodes, bad/slow disks
– Volume scheduling
© Hortonworks Inc. 2013
- 51. General guidelines contd.
• Benchmarks: To repeat, YMMV.
• Benchmark *your* use-case.
• Decide your problem size
– If (smallData) {
Mysql/Postgres/Your smart phone
} else {
–Make it work
–Make it scale
–Make it faster
}
• If it is (seems to be) slow, file a bug, spend a little time!
• Replacing systems without understanding them
– Is an easy way to have an illusion of progress
© Hortonworks Inc. 2013
- 52. Related talks
• “Optimizing Hive Queries” by Owen O’Malley
• “What’s New and What’s Next in Apache Hive” by Gunther
Hagleitner
© Hortonworks Inc. 2013
- 53. Credits
• Arun C Murthy
• Bikas Saha
• Gopal Vijayaraghavan
• Hitesh Shah
• Siddharth Seth
• Vinod Kumar Vavilapalli
• Alan Gates
• Ashutosh Chauhan
• Vikram Dixit
• Gunther Hagleitner
• Owen O’Malley
• Jintendranath Pandey
• Yahoo!, Facebook, Twitter, SAP and Microsoft all contributing.
© Hortonworks Inc. 2013
Editor's Notes
- Since the time we started this, we’ve seen multiple people benchmark hive comparing its text format processors against alternatives
- Not mapreduce, not hdfs, just plain hive
- Layers of inspectors that identify column type, de-serialize data and determine appropriate expression routines in the inner loop
- I wrote all of the code and Jitendra was just consulting :P