This document discusses BigDataBench, an open source project for big data benchmarking. BigDataBench includes six real-world data sets and 19 workloads that cover common big data applications and preserve the four V's of big data. The workloads were chosen to represent typical application domains like search engines, social networks, and e-commerce. BigDataBench aims to provide a standardized benchmark for evaluating big data systems, architectures, and software stacks. It has been used in several case studies for workload characterization and performance evaluation of different hardware platforms for big data workloads.
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
詹剑锋:Big databench—benchmarking big data systems
1. BigDataBench: Benchmarking
Big Data Systems
http://prof.ict.ac.cn/jfzhan
INSTITUTE OF COMPUTING TECHNOLOGY
1
Jianfeng Zhan
Computer Systems Research Center, ICT, CAS
CCF Big Data Technology Conference
2013-12-06
2. Why Big Data Benchmarking?
2
Measuring big data architecture and
systems quantitatively
2/
3. What is BigDataBench?
An open source project on big data
benchmarking:
•
3/
http://prof.ict.ac.cn/BigDataBench/
•
6 real-world data sets and 19 workloads
–
•
4V characteristics
–
3/
Extended in near future
Volume, Variety, Velocity, and Veracity
5. 5/
Possible Users
Systems
OS for big data
File systems for big data
…………………………..
Architecture
Data
management
Processor
Memory
Networks
…………..
BigDataBench
Performance
optimization
Co-design
5/
…….....
Distributed systems
Scheduling
Programming systems
6. Research Publications
Characterizing data analysis workloads in data
centers. Zhen Jia, Lei Wang, Jianfeng Zhan,
Lixing Zhang, and Chunjie Luo. IISWC 2013
Best paper award
6/
6/
BigDataBench: a Big Data Benchmark Suite
from Internet Services. Lei Wang, Jianfeng
Zhan, et al. HPCA 2014, Industry Session.
9. Methodology (Cont’)
9/
Represent
ative Data
Sets
Investigate
Typical
Application
Domains
Data Types
Structured
Semi-structured
Unstructured
Data
Sources
Text data
Graph data
Table data
Extended …
Big Data
Sets
Preserving
4V
data generation tool
preserving data
characteristics
Diverse
Worklo
ads
Application
Types
Basic & Important
Operations and
Algorithms
Extended…
Offline analytics
Realtime analytics
Online services
Represent
Software Stack
Extended…
BigDataBench
Big Data
Workloads
11. Top Sites on the Web
More details in http://www.alexa.com/topsites/global;0
Search Engine, Social Network and Electronic
Commerce hold 80% page views of all the
11/
Internet service.
14. Data Generation Tools
Data Sources
Text, Graph and Table
• Six real raw data
14/
Synthetics Data
Scale
• From GB to PB
Features
• Preserve characteristics of real-world data
14/
16. Improved Text generator
16/
topic2
topic1
select topic randomly
machine
evaluate
big
CPU
data
mining
architecture
CPU
select word randomly
benchmarking
topic3
memory system
learning
topics
following multinomial distribution
words
following multinomial distribution under topic2
Modeling on the both topic and word
level
16/
document
18. BigDataBench Case Study
18/
Performance evaluation and Diagnosis
SJTU, and XJTU
Workload
Characterization
Evaluating Big
Data Hardware
Systems
ICT, CAS
SIAT, CAS
USTC, and Florida
International
University
BigDataBench
Networks for
big data
OSU
Energy Efficiency of
Big Data Systems
CNCERT
http://prof.ict.ac.cn/BigDataBench/#users
18/
21. Floating point operation intensity
Data Analytics
Services
21
The total number of (floating point or integer) instructions divided by the
total number of memory access bytes in a run of workload.
Very low floating point operation intensities ( 0.009), two orders of
magnitude lower than the theory number of state-of-practice CPU (1.8)
21/
23. 23/
Ratio of Integer to Floating Point
Operations
Data Analytics
Services
The average of big data workloads is 100
Parsec, HPCC and SPECFP (1.4, 1.0, 0.67)
24. Integer operation intensity
Data Analytics
Services
The average integer operation intensity of big data
24/ workloads is 0.49
That of PARSEC, HPCC, SPECFP is 1.5, 0.38, 0.23
25. Cache Behaviors
Data Analytics
Services
Big data workloads have high L1I misses than HPC workloads
Data analysis workloads have better L2 cache behaviors than service workloads
25/
except BFS
Big data workloads have good L3 behaviors
26. TLB Behaviors
14
data analysis
5
service
ITLB misses of big data workloads are higher than HPC workloads.
DTLB misses of big data workloads are higher than HPC workloads.
26/
26/
27. BigDataBench Case Study
27/
Performance evaluation and Diagnosis
SJTU, and XJTU
Big Data workload
Characterization
Evaluating Big
Data Hardware
Systems
ICT, CAS
SIAT, CAS
USTC, and Florida
International
University
BigDataBench
Networks for
big data
OSU
Energy Efficiency of
Big Data Systems
CNCERT
http://prof.ict.ac.cn/BigDataBench/#users
29. Experimental Platforms
Xeon (Common processor)
Atom ( Low power processor)
Tilera (Many
Brief Comparison
Basic Information
core processor)
CPU Type
Intel Atom D510
Tilera TilePro36
CPU Core
4 cores @
1.6GHz
2 cores @
1.66GHz
36 cores @
500MHz
L1 I/D
Cache
32KB
24KB
16KB/8KB
L2 Cache
29/
Intel Xeon
E5310
4096KB
512KB
64KB
30. Experimental Platforms
Hadoop Cluster
Information
Xeon VS Atom
Xeon VS Tilera
[ 1 Xeon master+7
Comprison
[1 Xeon master+7 Xeon
Xeon slaves ] VS [ 1
(the same logical
slaves] VS [ 1 Xeon
Atom master +7 Atom
core number)
master +1 Tilera slave]
slaves]
Hadoop setting
30/
Following the guidance on Hadoop official
website
35. Reference
Jing Quan, University of Science and Technology of China, Yingjie
Shi, Chinese Academy of Sciences, Ming Zhao, Florida
International University, Wei Yang, University of Science and
Technology of China.
”The Implications from Benchmarking Three Different Data
Center Platforms”
The First Workshop on Big Data Benchmarks, Performance
Optimization, and Emerging hardware (BPOE 2013) in
conjunction with 2013 IEEE International Conference on Big
Data (IEEE Big Data 2013)
35/
37. BigDataBench Class
For Architecture
For OS
37/
19 among 19
19 among 19
For Runtime environment (Hadoop)
9 of 19 workloads
•Sort, Grep, WordCount, PageRank, Index, Kmeans, Connected Components,
Collaborative Filtering and Naive Bayes.
For Data management
6 of 19 workloads
•Read, Write, Scan, Select Query, Aggregate Query, Join Query
37/
38. BigDataBench Class: data sources
Text related
6 of 19 workloads
•Sort, Grep, WordCount, Index, Collaborative Filtering and Naive Bayes
Graph related
•BFS, PageRank, Kmeans, and Connected Components
38/
4 of 19 workloads
Table related
9 of 19 workloads
•Read, Write, Scan, Select Query, Aggregate Query, Join Query, Nutch Server, Olio
Server and Rubis Server
39. BigDataBench Class: Application Types
Online Services
6 of 19 workloads
• Read, Write, Scan, Nutch server, Olio Server and Rubis server
Offline Analytics
39/
10 of 19 workloads
• Sort, Grep, WordCount, BFS, PageRank, Index, Kmeans, Connected Components,
Collaborative Filtering and Naive Bayes.
Realtime Analytics
3 of 19 workloads
• Select Query, Aggregate Query and Join Query
40. BigDataBench Class: Application Domains
Search engine related:
Basic Operations + Search Engine
7 of 19 workloads
•Sort, Grep, WordCount, BFS, PageRank, Index and Nutch Server
Social network related:
Basic Cloud OLTP+ Basic Relational Query+ Social
Network
40/
9 of 19 workloads
•Read, Write, Scan, Select Query, Aggregate Query, Join Query, Olio Server, Kmeans and
Connected Components
E-commerce related:
Basic Cloud OLTP+ Basic Relational Query+ Social
Network
9 of 19 workloads
• Read, Write, Scan, Select Query, Aggregate Query, Join Query, Rubis server, Collaborative
Filtering and Naive Bayes
43. Related Resources
BigDataBench project
BPOE workshop
43/
http://prof.ict.ac.cn/BigDataBench
http://prof.ict.ac.cn/bpoe
A series of workshops on Big Data Benchmarks,
Performance Optimization, and Emerging Hardware
BPOE-4: interaction among OS, architecture, and data
management
• Co-located with ASPLOS 2014
44. BPOE-4 SC
Christos Kozyrakis, Stanford
Xiaofang Zhou, University of Queensland
Dhabaleswar K Panda, Ohio State University
Raghunath Nambiar, Cisco
Lizy K John, University of Texas at Austin
Xiaoyong Du, Renmin University of China
44/
H. Peter Hofstee, IBM Austin Research Laboratory
Ippokratis Pandis, IBM Almaden Research Center
Alexandros Labrinidis, University of Pittsburgh
Bill Jia, Facebook
Jianfeng Zhan, ICT, Chinese Academy of Sciences