Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Big Data Benchmarking

4.026 Aufrufe

Veröffentlicht am

Covers different types of big data benchmarking, different suites, details into terasort, demo with TPCx-HS

Meetup Details of presentation:

http://www.meetup.com/lspe-in/events/203918952/

Veröffentlicht in: Technologie
  • Login to see the comments

Big Data Benchmarking

  1. 1. Big Data Benchmarks Srinivasa Rao Aravilli N Venkata Naga Ravi
  2. 2. 2 Why ..  Evaluating the effect of a hardware/software upgrade:  OS, Java VM,. . .  Hadoop, Cloudera CDH, Pig, Hive, Impala,. . .  Debugging:  Compare with other clusters or published results.  Performance tuning
  3. 3. 3 Industry Standard benchmarking organizations • TPC - Transaction Processing Performance Council (http://www.tpc.org/ ) • SPEC - The Standard Performance Evaluation Corporation (https://www.spec.org/ ) • CLDS – Centre for Large- scale Data System Research (http://clds.sdsc.edu/bdbc) • Top Outcomes • BigData Top100 - an end-to-end application-layer benchmark for big data applications • Terasort - Functional benchmark focusing on Sort function ( quicksort using MapReduce) • Hibench • Sort, Machine learning ( K-means clustering, Classification)
  4. 4. 4 Types of Benchmark • Micro-benchmarks. To evaluate specific lower-level, system operations • E.g., A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters, Panda et al, OSU • Functional / component benchmarks. Specific high-level function. • E.g. Sorting: Terasort • E.g. Basic SQL: Individual SQL operations, e.g. Select, Project, Join, Order-By, ... • Application-level benchmarks. • Measure system performance (hardware and software) for a given application scenario—with given data and workload
  5. 5. 5 Terasort using Hadoop Terasort includes 3 MapReduce Applications • Teragen – generates the data • Terasort – samples the input data and uses them with MapReduce to sort the data • Teravalidate – validates the output data is sorted
  6. 6. 6 MapReduce for Teragen
  7. 7. 7 Map Reduce Modelloser look at MapReduce’s implementation model source: http:/ / developer.yahoo.com/ hadoop/ tutorial/ module4.html”
  8. 8. 8 Benchmarking Suite • HiBench, Yan Li, Intel (https://github.com/intel-hadoop/HiBench) • YCSB -Yahoo Cloud Serving Benchmark, Brian Cooper, Yahoo! (https://github.com/brianfrankcooper/YCSB/) • Berkeley Big Data Benchmark, Pavlo et al., AMPLab (https://amplab.cs.berkeley.edu/benchmark/) • BigDataBench, Jianfeng Zhan, Chinese Academy of Sciences (http://prof.ict.ac.cn/BigDataBench/) • Grid Mix (http://hadoop.apache.org/docs/r1.2.1/gridmix.html) • Big Bench (https://github.com/intel-hadoop/Big-Bench) • TPCx-HS (http://www.tpc.org/tpcx-hs/ )
  9. 9. 9 TPCx-HS benchmarks X: Express H: Hadoop S: Sort • TPCx-HS was developed to provide an objective measure of hardware, operating system and commercial Apache Hadoop File System API compatible software distributions, and to provide the industry with verifiable performance, price-performance and availability metrics. • http://www.tpc.org/tpcx-hs/
  10. 10. 10 TPCx HS Demo
  11. 11. 11 TPCx-HS benchmarks Scale Factor The TPCx-HS follows a stepped size model. Scale factor (SF) used for the test dataset must be chosen from the set of fixed Scale Factors defined as : • 1TB, 3TB, 10TB, 30TB, 100TB, 300TB, 1000TB, 3000TB, 10000TB. • The corresponding number of records are • 10B, 30B, 100B, 300B, 1000B, 3000B, 10000B, 30000B, 100000B, where each record is 100 bytes generated by HSGen. • http://www.tpc.org/tpcx-hs/
  12. 12. 12 TPCx-HS benchmarks - Metrics
  13. 13. 13 TPCx-HS Results on Cisco UCS Cisco Published Results
  14. 14. 14 Comparison of various Benchmarks Suites.
  15. 15. 15
  16. 16. 16 Spark Performance
  17. 17. 17 Spark sorted the same data 3X faster using 10X fewer machines. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache.
  18. 18. 18 Sort Bench Mark http://sortbenchmark.org/ • GraySort • MinuteSort • CloudSort • JouleSort • PennySort • TeraByteSort • DatamationSort

×