This is a presentation that was held at the Sixth International Workshop on Testing Database Systems, collocated with ACM SIGMOD 2013, June 24, New York, USA.
Full paper and additional information available at:
http://msrg.org/papers/dbtest13-rabl
Abstract:
Generating data sets for the performance testing of database systems on a particular hardware configuration and application domain is a very time consuming and tedious process. It is time consuming, because of the large amount of data that needs to be generated and tedious, because new data generators might need to be developed or existing once adjusted. The difficulty in generating this data is amplified by constant advances in hardware and software that allow the testing of ever larger and more complicated systems. In this paper, we present an approach for rapidly developing customized data generators. Our approach, which is based on the Parallel Data Generator Framework (PDGF), deploys a new concept of so called meta generators. Meta generators extend the concept of column-based generators in PDGF. Deploying meta generators in PDGF significantly reduces the development effort of customized data generators, it facilitates their debugging and eases their maintenance.
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Rapid Development of Data Generators Using Meta Generators in PDGF
1. MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Rapid Development of Data
Generators Using Meta
Generators in PDGF
Tilmann Rabl, Meikel Poess, Manuel Danisch, Hans-Arno
Jacobsen
DBTest 2013, June 24, New York City
2. DBMS Benchmarking is
Increasingly Complex
•
Data Volumes are sky rocketing
Enterprise data warehouses double every three years
Many enterprise data warehouses are in petabyte size
•
Systems are becoming increasingly complex
Large number of processor cores
Single systems (SMP) with high number of cores (80 on
commodity hardware, 2048 on specialized hardware)
Multi node systems (sky is the limit)
Large memory
Dell released a TPC-H benchmark with 15 TB of main
memory on 64 systems
•
How to challenge these systems?
4. Parallel Data Generation
Framework
•
Generic data generation framework
•
Relational model
Schema specified in configuration file
Post-processing stage for alternative representations
•
Repeatable computation
Based on XORSHIFT random number generators
Hierarchical seeding strategy
6. PDGF Architecture
• Controller
Initialization
• To generate data for a schema the user defines:
• Meta Scheduler
Inter node scheduling
Schema XML file Inter thread scheduling
• Scheduler
• Worker
Defines relational schema data generation
Blockwise
• Update Black Box
Co-ordination of data updates
Generation XML file
• Seeding System
Random sequence adaption
Defines output format (CSV, XML, merging tables)
• Generators
Value generation
• Output system
Data formating
8. Base Generators in PDGF
•
DictList generator
<table name="users">
<size>10000</size>
Random line from file
<fields>
<field name="name">
• Long generator
<type>java.sql.types.VARCHAR</type>
Random long in interval
<size>100</size>
<gen_DictList>
• Others
<file>dicts/names.dict</file>
</gen_DictList>
StaticValue
</field>
Double
<field name="age">
Date
<type>java.sql.types.NUMERIC</type>
<gen_LongGenerator>
String
<min>0</min>
Text
<max>120</max>
</gen_LongGenerator>
…
</field>
</fields>
</table>
9. Null Generator
•
Add NULL logic to every generator?
Could easily be implemented in higher class
Adds to the configuration file
Reduces performance (every time)
•
Higher order generator NullGenerator
Only used if added to the schema
Can be added to any generator
<field name="age">
<type>java.sql.types.NUMERIC</type>
<gen_NullGenerator>
<probability>0.05</probability>
<gen_LongGenerator>
<min>0</min>
<max>120</max>
</gen_LongGenerator>
</gen_NullGenerator>
</field>
10. Meta Generators
•
Control flow and post-processing generators
Null generator controls flow
•
Post-processing
•
FormattedNumberGenerator
PaddingGenerator
UpperLowerCaseGenerator
PrePostfixGenerator
FormulaGenerator
Flow control
ProbabilityGenerator
SequentialGenerator
IfGenerator
SwitchGenerator
ReferenceGenerator
11. Post-Processing Example
•
Phone number for users
10s of representations
PhoneNumberGenerator was too inflexible
•
Formatted long number
Long numbers between 10010001 and 9999999999
Number formatting (%d%d%d) %d%d%d-%d%d%d%d
<field name="phonenumber">
<type>java.sql.types.VARCHAR</type>
<size>30</size>
<generator name="FormattedNumberGenerator">
<generator name="LongGenerator">
<min>10010001</min>
<max>9999999999</max>
</generator>
<format>(%d%d%d) %d%d%d-%d%d%d%d</format>
</generator>
</field>
12. Flow Control Example
•
More elaborate name field
Name male or female
50% chance
All upper case
Padded to 100 characters
•
Sequential generator
Probability generator
DictList generator
UpperLowerCase generator
Padding generator
<field name="name">
<type>java.sql.types.VARCHAR</type>
<size>100</size>
<generator name="SequentialGenerator">
<generator name="ProbabilityGenerator">
<probability value="0.5">
<generator name="DictList">
<file>dicts/female.dict</file>
</generator>
</probability>
<probability value="0.5">
<generator name="DictList">
<file>dicts/male.dict</file>
</generator>
</probability>
</generator>
<generator name="UpperLowerCaseGenerator">
<mode>uppercase</mode>
</generator>
<generator name="PaddingGenerator">
<character> </character>
<padToLeft>true</padToLeft>
</generator>
</generator>
</field>
13. Core Performance
250
200
150
100
50
0
Static Value
(no Cache)
Base Time
•
•
Generator
Null Generator
(100% NULL)
Base Time Sub
Null Generator
(0% NULL)
Sub Generator
Test environment: single core laptop, no I/O
Base time for framework ~ 55 ns (Base Time)
Seeding, method invocation, setting a value
•
Computation time for generator 50+ ns (Gen Time)
•
Cache update if referenced ~ 50 ns (Cache Update)
Cache lookup if intra row reference ~ 50 ns (Cache Lookup)
Sub-generator invocation ~ 50 ns
•
•
16. Performance Meta Generators
1600
1400
1200
1000
800
600
400
200
0
Null Generator Null Generator PrePostFix
(100% Null)
(0% Null)
•
Sequential
(exec 2)
Meta generator overhead:
Base overhead ~ 50 ns
Generator overhead starts from 50 ns
Sub generator invocation ~ 50ns
•
Often negligible due to lazy formatting
Sequential
(concat 2)
Sequential
(2 formated
+ long)
17. Use Cases
•
TPC-H / SSB
8 tables, 61 columns (first non-trivial example)
Without meta-FVGs: 26 custom FVGs
2h editing: 10 custom FVGs
1 day reimplementation: 0 custom FVGs, i.e. no coding
SSB variations
skews on dimension attributes, fact measures, references
•
TPC-DI (in process)
20 tables, 200 columns
19 custom FVGs (mainly for performance in corner cases)
56x NullGenerator
32x ProbabilityGenerator
3000 lines of config (XML import for multiple files).
18. Conclusion & Future Work
•
Meta generators
Improve usability and expressiveness
Speed up schema definition
Remove necessity for coding
Enlarged configuration files
•
Used in TPC benchmark(s)
•
Performance overhead is small, often negligible
•
Future work
GUI and SQL export
SQL import and data extraction