SlideShare ist ein Scribd-Unternehmen logo
1 von 19
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG

Rapid Development of Data
Generators Using Meta
Generators in PDGF
Tilmann Rabl, Meikel Poess, Manuel Danisch, Hans-Arno
Jacobsen
DBTest 2013, June 24, New York City
DBMS Benchmarking is
Increasingly Complex
•

Data Volumes are sky rocketing
 Enterprise data warehouses double every three years
 Many enterprise data warehouses are in petabyte size

•

Systems are becoming increasingly complex
 Large number of processor cores
 Single systems (SMP) with high number of cores (80 on
commodity hardware, 2048 on specialized hardware)
 Multi node systems (sky is the limit)

 Large memory
 Dell released a TPC-H benchmark with 15 TB of main
memory on 64 systems

•

How to challenge these systems?
Benchmarks are increasingly
complex
500
450
400
350
300
250
200
150
100
50
0

430

188

#Tables
#Columns

92
4 10
TPC-A

9
TPC-C

33
TPC_E

24
TPC-DS

•

More tables, columns

•

More relationships, dependencies, data types, …

•

How to build these benchmarks?

•

Parallel Data Generation Framework to the rescue!
Parallel Data Generation
Framework
•

Generic data generation framework

•

Relational model
 Schema specified in configuration file
 Post-processing stage for alternative representations

•

Repeatable computation
 Based on XORSHIFT random number generators
 Hierarchical seeding strategy
Repeatable Data Generation
•
PDGF Architecture

• Controller
 Initialization
• To generate data for a schema the user defines:
• Meta Scheduler
 Inter node scheduling
 Schema XML file Inter thread scheduling
• Scheduler

• Worker
 Defines relational schema data generation
 Blockwise
• Update Black Box
 Co-ordination of data updates
 Generation XML file
• Seeding System
 Random sequence adaption
 Defines output format (CSV, XML, merging tables)
• Generators
 Value generation
• Output system
 Data formating
Configuring PDGF
•

Schema configuration
 Data model

•

Relational model
 Tables, fields

•

Properties
 Table size, characters, …

•

Generators
 Base generators
 Meta generators

•

<table name="SUPPLIER">
<size>${S}</size>
<field name="S_SUPPKEY" size="" type="NUMERIC“
primary="true" unique="true">
<gen_IdGenerator />
</field>
<field name="S_NAME" size="25" type="VARCHAR">
<gen_PrePostfixGenerator>
<gen_PaddingGenerator>
<gen_OtherFieldValueGenerator>
<reference field="S_SUPPKEY" />
</gen_OtherFieldValueGenerator >
<character>0</character>
<padToLeft>true</padToLeft>
<size>9</size>
</gen_PaddingGenerator >
<prefix>Supplier </prefix>
</gen_PrePostfixGenerator>
</field>
[..]

Update definition
 Insert, update, delete
 Generated as change data capture
Base Generators in PDGF
•

DictList generator

<table name="users">
<size>10000</size>
 Random line from file
<fields>
<field name="name">
• Long generator
<type>java.sql.types.VARCHAR</type>
 Random long in interval
<size>100</size>
<gen_DictList>
• Others
<file>dicts/names.dict</file>
</gen_DictList>
 StaticValue
</field>
 Double
<field name="age">
 Date
<type>java.sql.types.NUMERIC</type>
<gen_LongGenerator>
 String
<min>0</min>
 Text
<max>120</max>
</gen_LongGenerator>
 …
</field>
</fields>
</table>
Null Generator
•

Add NULL logic to every generator?
 Could easily be implemented in higher class
 Adds to the configuration file
 Reduces performance (every time)

•

Higher order generator NullGenerator
 Only used if added to the schema
 Can be added to any generator
<field name="age">
<type>java.sql.types.NUMERIC</type>
<gen_NullGenerator>
<probability>0.05</probability>
<gen_LongGenerator>
<min>0</min>
<max>120</max>
</gen_LongGenerator>
</gen_NullGenerator>
</field>
Meta Generators
•

Control flow and post-processing generators
 Null generator controls flow

•

Post-processing






•

FormattedNumberGenerator
PaddingGenerator
UpperLowerCaseGenerator
PrePostfixGenerator
FormulaGenerator

Flow control






ProbabilityGenerator
SequentialGenerator
IfGenerator
SwitchGenerator
ReferenceGenerator
Post-Processing Example
•

Phone number for users
 10s of representations
 PhoneNumberGenerator was too inflexible

•

Formatted long number
 Long numbers between 10010001 and 9999999999
 Number formatting (%d%d%d) %d%d%d-%d%d%d%d
<field name="phonenumber">
<type>java.sql.types.VARCHAR</type>
<size>30</size>
<generator name="FormattedNumberGenerator">
<generator name="LongGenerator">
<min>10010001</min>
<max>9999999999</max>
</generator>
<format>(%d%d%d) %d%d%d-%d%d%d%d</format>
</generator>
</field>
Flow Control Example
•

More elaborate name field
 Name male or female
 50% chance

 All upper case
 Padded to 100 characters

•

Sequential generator
 Probability generator
 DictList generator

 UpperLowerCase generator
 Padding generator

<field name="name">
<type>java.sql.types.VARCHAR</type>
<size>100</size>
<generator name="SequentialGenerator">
<generator name="ProbabilityGenerator">
<probability value="0.5">
<generator name="DictList">
<file>dicts/female.dict</file>
</generator>
</probability>
<probability value="0.5">
<generator name="DictList">
<file>dicts/male.dict</file>
</generator>
</probability>
</generator>
<generator name="UpperLowerCaseGenerator">
<mode>uppercase</mode>
</generator>
<generator name="PaddingGenerator">
<character> </character>
<padToLeft>true</padToLeft>
</generator>
</generator>
</field>
Core Performance
250
200
150
100
50
0
Static Value
(no Cache)
Base Time
•
•

Generator

Null Generator
(100% NULL)
Base Time Sub

Null Generator
(0% NULL)
Sub Generator

Test environment: single core laptop, no I/O
Base time for framework ~ 55 ns (Base Time)
 Seeding, method invocation, setting a value

•

Computation time for generator 50+ ns (Gen Time)

•

Cache update if referenced ~ 50 ns (Cache Update)
Cache lookup if intra row reference ~ 50 ns (Cache Lookup)
Sub-generator invocation ~ 50 ns

•
•
Performance Basic Generators
600
500
400
300
200
100
0
DictList

•

LongGenerator DoubleGenerator DateGenerator

Basic generators without formatting
 120ns – 510ns

RandomString
Performance Formatted Values
2000
1800
1600
1400
1200
1000
800
600
400
200
0
DictList

•

SimpleFormat
Number Generator

DateGenerator
(formatted)

Basic Generators with formatting
 Usually > 1000ns

DoubleGenerator (4
places)
Performance Meta Generators
1600
1400
1200
1000

800
600
400
200
0
Null Generator Null Generator PrePostFix
(100% Null)
(0% Null)

•

Sequential
(exec 2)

Meta generator overhead:
 Base overhead ~ 50 ns
 Generator overhead starts from 50 ns
 Sub generator invocation ~ 50ns

•

Often negligible due to lazy formatting

Sequential
(concat 2)

Sequential
(2 formated
+ long)
Use Cases
•

TPC-H / SSB






8 tables, 61 columns (first non-trivial example)
Without meta-FVGs: 26 custom FVGs
2h editing: 10 custom FVGs
1 day reimplementation: 0 custom FVGs, i.e. no coding
SSB variations
 skews on dimension attributes, fact measures, references

•

TPC-DI (in process)






20 tables, 200 columns
19 custom FVGs (mainly for performance in corner cases)
56x NullGenerator
32x ProbabilityGenerator
3000 lines of config (XML import for multiple files).
Conclusion & Future Work
•

Meta generators





Improve usability and expressiveness
Speed up schema definition
Remove necessity for coding
Enlarged configuration files

•

Used in TPC benchmark(s)

•

Performance overhead is small, often negligible

•

Future work
 GUI and SQL export
 SQL import and data extraction
Thanks

•

Questions?

•

Contact: tilmann.rabl@utoronto.ca

•

Download and try PDGF:

•

http://www.paralleldatageneration.org

•

Some big data info in our BigBench presentation
 Tuesday, 4pm, Industry 3

Weitere ähnliche Inhalte

Was ist angesagt?

Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Julian Hyde
 
MySQL 8.0: Common Table Expressions
MySQL 8.0: Common Table ExpressionsMySQL 8.0: Common Table Expressions
MySQL 8.0: Common Table Expressionsoysteing
 
MySQL 8.0: Common Table Expressions
MySQL 8.0: Common Table Expressions MySQL 8.0: Common Table Expressions
MySQL 8.0: Common Table Expressions oysteing
 
Myth busters - performance tuning 103 2008
Myth busters - performance tuning 103 2008Myth busters - performance tuning 103 2008
Myth busters - performance tuning 103 2008paulguerin
 
Managing terabytes: When PostgreSQL gets big
Managing terabytes: When PostgreSQL gets bigManaging terabytes: When PostgreSQL gets big
Managing terabytes: When PostgreSQL gets bigSelena Deckelmann
 
Advanced MySQL Query Optimizations
Advanced MySQL Query OptimizationsAdvanced MySQL Query Optimizations
Advanced MySQL Query OptimizationsDave Stokes
 
MySQL Optimizer Cost Model
MySQL Optimizer Cost ModelMySQL Optimizer Cost Model
MySQL Optimizer Cost ModelOlav Sandstå
 
15 MySQL Basics #burningkeyboards
15 MySQL Basics #burningkeyboards15 MySQL Basics #burningkeyboards
15 MySQL Basics #burningkeyboardsDenis Ristic
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache CalciteJulian Hyde
 
Percona Live 2012PPT: MySQL Query optimization
Percona Live 2012PPT: MySQL Query optimizationPercona Live 2012PPT: MySQL Query optimization
Percona Live 2012PPT: MySQL Query optimizationmysqlops
 
MySQL Indexing : Improving Query Performance Using Index (Covering Index)
MySQL Indexing : Improving Query Performance Using Index (Covering Index)MySQL Indexing : Improving Query Performance Using Index (Covering Index)
MySQL Indexing : Improving Query Performance Using Index (Covering Index)Hemant Kumar Singh
 
Understanding Query Execution
Understanding Query ExecutionUnderstanding Query Execution
Understanding Query Executionwebhostingguy
 
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres OpenJohn Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres OpenPostgresOpen
 
PostgreSQL performance improvements in 9.5 and 9.6
PostgreSQL performance improvements in 9.5 and 9.6PostgreSQL performance improvements in 9.5 and 9.6
PostgreSQL performance improvements in 9.5 and 9.6Tomas Vondra
 
MySQL Replication Evolution -- Confoo Montreal 2017
MySQL Replication Evolution -- Confoo Montreal 2017MySQL Replication Evolution -- Confoo Montreal 2017
MySQL Replication Evolution -- Confoo Montreal 2017Dave Stokes
 
Indexing the MySQL Index: Key to performance tuning
Indexing the MySQL Index: Key to performance tuningIndexing the MySQL Index: Key to performance tuning
Indexing the MySQL Index: Key to performance tuningOSSCube
 
How to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better PerformanceHow to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better Performanceoysteing
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 

Was ist angesagt? (20)

Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
MySQL 8.0: Common Table Expressions
MySQL 8.0: Common Table ExpressionsMySQL 8.0: Common Table Expressions
MySQL 8.0: Common Table Expressions
 
MySQL 8.0: Common Table Expressions
MySQL 8.0: Common Table Expressions MySQL 8.0: Common Table Expressions
MySQL 8.0: Common Table Expressions
 
Myth busters - performance tuning 103 2008
Myth busters - performance tuning 103 2008Myth busters - performance tuning 103 2008
Myth busters - performance tuning 103 2008
 
Managing terabytes: When PostgreSQL gets big
Managing terabytes: When PostgreSQL gets bigManaging terabytes: When PostgreSQL gets big
Managing terabytes: When PostgreSQL gets big
 
Advanced MySQL Query Optimizations
Advanced MySQL Query OptimizationsAdvanced MySQL Query Optimizations
Advanced MySQL Query Optimizations
 
MySQL Optimizer Cost Model
MySQL Optimizer Cost ModelMySQL Optimizer Cost Model
MySQL Optimizer Cost Model
 
15 MySQL Basics #burningkeyboards
15 MySQL Basics #burningkeyboards15 MySQL Basics #burningkeyboards
15 MySQL Basics #burningkeyboards
 
Apache TAJO
Apache TAJOApache TAJO
Apache TAJO
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
Percona Live 2012PPT: MySQL Query optimization
Percona Live 2012PPT: MySQL Query optimizationPercona Live 2012PPT: MySQL Query optimization
Percona Live 2012PPT: MySQL Query optimization
 
Hive commands
Hive commandsHive commands
Hive commands
 
MySQL Indexing : Improving Query Performance Using Index (Covering Index)
MySQL Indexing : Improving Query Performance Using Index (Covering Index)MySQL Indexing : Improving Query Performance Using Index (Covering Index)
MySQL Indexing : Improving Query Performance Using Index (Covering Index)
 
Understanding Query Execution
Understanding Query ExecutionUnderstanding Query Execution
Understanding Query Execution
 
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres OpenJohn Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
 
PostgreSQL performance improvements in 9.5 and 9.6
PostgreSQL performance improvements in 9.5 and 9.6PostgreSQL performance improvements in 9.5 and 9.6
PostgreSQL performance improvements in 9.5 and 9.6
 
MySQL Replication Evolution -- Confoo Montreal 2017
MySQL Replication Evolution -- Confoo Montreal 2017MySQL Replication Evolution -- Confoo Montreal 2017
MySQL Replication Evolution -- Confoo Montreal 2017
 
Indexing the MySQL Index: Key to performance tuning
Indexing the MySQL Index: Key to performance tuningIndexing the MySQL Index: Key to performance tuning
Indexing the MySQL Index: Key to performance tuning
 
How to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better PerformanceHow to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better Performance
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 

Ähnlich wie Rapid Development of Data Generators Using Meta Generators in PDGF

Star schema my sql
Star schema   my sqlStar schema   my sql
Star schema my sqldeathsubte
 
Storage Methods for Nonstandard Data Patterns
Storage Methods for Nonstandard Data PatternsStorage Methods for Nonstandard Data Patterns
Storage Methods for Nonstandard Data PatternsBob Burgess
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevAltinity Ltd
 
Optimizer overviewoow2014
Optimizer overviewoow2014Optimizer overviewoow2014
Optimizer overviewoow2014Mysql User Camp
 
Data Warehouse Logical Design using Mysql
Data Warehouse Logical Design using MysqlData Warehouse Logical Design using Mysql
Data Warehouse Logical Design using MysqlHAFIZ Islam
 
Data Modeling on Azure for Analytics
Data Modeling on Azure for AnalyticsData Modeling on Azure for Analytics
Data Modeling on Azure for AnalyticsIke Ellis
 
MariaDB ColumnStore
MariaDB ColumnStoreMariaDB ColumnStore
MariaDB ColumnStoreMariaDB plc
 
Query Optimization with MySQL 5.6: Old and New Tricks
Query Optimization with MySQL 5.6: Old and New TricksQuery Optimization with MySQL 5.6: Old and New Tricks
Query Optimization with MySQL 5.6: Old and New TricksMYXPLAIN
 
Introducing DataWave
Introducing DataWaveIntroducing DataWave
Introducing DataWaveData Works MD
 
Migrating To PostgreSQL
Migrating To PostgreSQLMigrating To PostgreSQL
Migrating To PostgreSQLGrant Fritchey
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at AlibabaMichael Stack
 
Bye bye $GLOBALS['TYPO3_DB']
Bye bye $GLOBALS['TYPO3_DB']Bye bye $GLOBALS['TYPO3_DB']
Bye bye $GLOBALS['TYPO3_DB']Jan Helke
 
Modernizing your database with SQL Server 2019
Modernizing your database with SQL Server 2019Modernizing your database with SQL Server 2019
Modernizing your database with SQL Server 2019Antonios Chatzipavlis
 
SQL Server 2014 Mission Critical Performance - Level 300 Presentation
SQL Server 2014 Mission Critical Performance - Level 300 PresentationSQL Server 2014 Mission Critical Performance - Level 300 Presentation
SQL Server 2014 Mission Critical Performance - Level 300 PresentationDavid J Rosenthal
 
In memory databases presentation
In memory databases presentationIn memory databases presentation
In memory databases presentationMichael Keane
 
Dynamic Publishing with Arbortext Data Merge
Dynamic Publishing with Arbortext Data MergeDynamic Publishing with Arbortext Data Merge
Dynamic Publishing with Arbortext Data MergeClay Helberg
 
Scaling MongoDB
Scaling MongoDBScaling MongoDB
Scaling MongoDBMongoDB
 
SQL Server 2014 for Developers (Cristian Lefter)
SQL Server 2014 for Developers (Cristian Lefter)SQL Server 2014 for Developers (Cristian Lefter)
SQL Server 2014 for Developers (Cristian Lefter)ITCamp
 
MemSQL 201: Advanced Tips and Tricks Webcast
MemSQL 201: Advanced Tips and Tricks WebcastMemSQL 201: Advanced Tips and Tricks Webcast
MemSQL 201: Advanced Tips and Tricks WebcastSingleStore
 

Ähnlich wie Rapid Development of Data Generators Using Meta Generators in PDGF (20)

Star schema my sql
Star schema   my sqlStar schema   my sql
Star schema my sql
 
Storage Methods for Nonstandard Data Patterns
Storage Methods for Nonstandard Data PatternsStorage Methods for Nonstandard Data Patterns
Storage Methods for Nonstandard Data Patterns
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
Optimizer overviewoow2014
Optimizer overviewoow2014Optimizer overviewoow2014
Optimizer overviewoow2014
 
Data Warehouse Logical Design using Mysql
Data Warehouse Logical Design using MysqlData Warehouse Logical Design using Mysql
Data Warehouse Logical Design using Mysql
 
Data Modeling on Azure for Analytics
Data Modeling on Azure for AnalyticsData Modeling on Azure for Analytics
Data Modeling on Azure for Analytics
 
MariaDB ColumnStore
MariaDB ColumnStoreMariaDB ColumnStore
MariaDB ColumnStore
 
Sql server T-sql basics ppt-3
Sql server T-sql basics  ppt-3Sql server T-sql basics  ppt-3
Sql server T-sql basics ppt-3
 
Query Optimization with MySQL 5.6: Old and New Tricks
Query Optimization with MySQL 5.6: Old and New TricksQuery Optimization with MySQL 5.6: Old and New Tricks
Query Optimization with MySQL 5.6: Old and New Tricks
 
Introducing DataWave
Introducing DataWaveIntroducing DataWave
Introducing DataWave
 
Migrating To PostgreSQL
Migrating To PostgreSQLMigrating To PostgreSQL
Migrating To PostgreSQL
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
 
Bye bye $GLOBALS['TYPO3_DB']
Bye bye $GLOBALS['TYPO3_DB']Bye bye $GLOBALS['TYPO3_DB']
Bye bye $GLOBALS['TYPO3_DB']
 
Modernizing your database with SQL Server 2019
Modernizing your database with SQL Server 2019Modernizing your database with SQL Server 2019
Modernizing your database with SQL Server 2019
 
SQL Server 2014 Mission Critical Performance - Level 300 Presentation
SQL Server 2014 Mission Critical Performance - Level 300 PresentationSQL Server 2014 Mission Critical Performance - Level 300 Presentation
SQL Server 2014 Mission Critical Performance - Level 300 Presentation
 
In memory databases presentation
In memory databases presentationIn memory databases presentation
In memory databases presentation
 
Dynamic Publishing with Arbortext Data Merge
Dynamic Publishing with Arbortext Data MergeDynamic Publishing with Arbortext Data Merge
Dynamic Publishing with Arbortext Data Merge
 
Scaling MongoDB
Scaling MongoDBScaling MongoDB
Scaling MongoDB
 
SQL Server 2014 for Developers (Cristian Lefter)
SQL Server 2014 for Developers (Cristian Lefter)SQL Server 2014 for Developers (Cristian Lefter)
SQL Server 2014 for Developers (Cristian Lefter)
 
MemSQL 201: Advanced Tips and Tricks Webcast
MemSQL 201: Advanced Tips and Tricks WebcastMemSQL 201: Advanced Tips and Tricks Webcast
MemSQL 201: Advanced Tips and Tricks Webcast
 

Mehr von Tilmann Rabl

TPC-DI - The First Industry Benchmark for Data Integration
TPC-DI - The First Industry Benchmark for Data IntegrationTPC-DI - The First Industry Benchmark for Data Integration
TPC-DI - The First Industry Benchmark for Data IntegrationTilmann Rabl
 
Crafting bigdatabenchmarks
Crafting bigdatabenchmarksCrafting bigdatabenchmarks
Crafting bigdatabenchmarksTilmann Rabl
 
Big Data Benchmarking Tutorial
Big Data Benchmarking TutorialBig Data Benchmarking Tutorial
Big Data Benchmarking TutorialTilmann Rabl
 
A BigBench Implementation in the Hadoop Ecosystem
A BigBench Implementation in the Hadoop EcosystemA BigBench Implementation in the Hadoop Ecosystem
A BigBench Implementation in the Hadoop EcosystemTilmann Rabl
 
MADES - A Multi-Layered, Adaptive, Distributed Event Store
MADES - A Multi-Layered, Adaptive, Distributed Event StoreMADES - A Multi-Layered, Adaptive, Distributed Event Store
MADES - A Multi-Layered, Adaptive, Distributed Event StoreTilmann Rabl
 
CaSSanDra: An SSD Boosted Key-Value Store
CaSSanDra: An SSD Boosted Key-Value StoreCaSSanDra: An SSD Boosted Key-Value Store
CaSSanDra: An SSD Boosted Key-Value StoreTilmann Rabl
 
Solving Big Data Challenges for Enterprise Application Performance Management
Solving Big Data Challenges for Enterprise Application Performance ManagementSolving Big Data Challenges for Enterprise Application Performance Management
Solving Big Data Challenges for Enterprise Application Performance ManagementTilmann Rabl
 

Mehr von Tilmann Rabl (7)

TPC-DI - The First Industry Benchmark for Data Integration
TPC-DI - The First Industry Benchmark for Data IntegrationTPC-DI - The First Industry Benchmark for Data Integration
TPC-DI - The First Industry Benchmark for Data Integration
 
Crafting bigdatabenchmarks
Crafting bigdatabenchmarksCrafting bigdatabenchmarks
Crafting bigdatabenchmarks
 
Big Data Benchmarking Tutorial
Big Data Benchmarking TutorialBig Data Benchmarking Tutorial
Big Data Benchmarking Tutorial
 
A BigBench Implementation in the Hadoop Ecosystem
A BigBench Implementation in the Hadoop EcosystemA BigBench Implementation in the Hadoop Ecosystem
A BigBench Implementation in the Hadoop Ecosystem
 
MADES - A Multi-Layered, Adaptive, Distributed Event Store
MADES - A Multi-Layered, Adaptive, Distributed Event StoreMADES - A Multi-Layered, Adaptive, Distributed Event Store
MADES - A Multi-Layered, Adaptive, Distributed Event Store
 
CaSSanDra: An SSD Boosted Key-Value Store
CaSSanDra: An SSD Boosted Key-Value StoreCaSSanDra: An SSD Boosted Key-Value Store
CaSSanDra: An SSD Boosted Key-Value Store
 
Solving Big Data Challenges for Enterprise Application Performance Management
Solving Big Data Challenges for Enterprise Application Performance ManagementSolving Big Data Challenges for Enterprise Application Performance Management
Solving Big Data Challenges for Enterprise Application Performance Management
 

Kürzlich hochgeladen

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Kürzlich hochgeladen (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

Rapid Development of Data Generators Using Meta Generators in PDGF

  • 1. MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Rapid Development of Data Generators Using Meta Generators in PDGF Tilmann Rabl, Meikel Poess, Manuel Danisch, Hans-Arno Jacobsen DBTest 2013, June 24, New York City
  • 2. DBMS Benchmarking is Increasingly Complex • Data Volumes are sky rocketing  Enterprise data warehouses double every three years  Many enterprise data warehouses are in petabyte size • Systems are becoming increasingly complex  Large number of processor cores  Single systems (SMP) with high number of cores (80 on commodity hardware, 2048 on specialized hardware)  Multi node systems (sky is the limit)  Large memory  Dell released a TPC-H benchmark with 15 TB of main memory on 64 systems • How to challenge these systems?
  • 3. Benchmarks are increasingly complex 500 450 400 350 300 250 200 150 100 50 0 430 188 #Tables #Columns 92 4 10 TPC-A 9 TPC-C 33 TPC_E 24 TPC-DS • More tables, columns • More relationships, dependencies, data types, … • How to build these benchmarks? • Parallel Data Generation Framework to the rescue!
  • 4. Parallel Data Generation Framework • Generic data generation framework • Relational model  Schema specified in configuration file  Post-processing stage for alternative representations • Repeatable computation  Based on XORSHIFT random number generators  Hierarchical seeding strategy
  • 6. PDGF Architecture • Controller  Initialization • To generate data for a schema the user defines: • Meta Scheduler  Inter node scheduling  Schema XML file Inter thread scheduling • Scheduler  • Worker  Defines relational schema data generation  Blockwise • Update Black Box  Co-ordination of data updates  Generation XML file • Seeding System  Random sequence adaption  Defines output format (CSV, XML, merging tables) • Generators  Value generation • Output system  Data formating
  • 7. Configuring PDGF • Schema configuration  Data model • Relational model  Tables, fields • Properties  Table size, characters, … • Generators  Base generators  Meta generators • <table name="SUPPLIER"> <size>${S}</size> <field name="S_SUPPKEY" size="" type="NUMERIC“ primary="true" unique="true"> <gen_IdGenerator /> </field> <field name="S_NAME" size="25" type="VARCHAR"> <gen_PrePostfixGenerator> <gen_PaddingGenerator> <gen_OtherFieldValueGenerator> <reference field="S_SUPPKEY" /> </gen_OtherFieldValueGenerator > <character>0</character> <padToLeft>true</padToLeft> <size>9</size> </gen_PaddingGenerator > <prefix>Supplier </prefix> </gen_PrePostfixGenerator> </field> [..] Update definition  Insert, update, delete  Generated as change data capture
  • 8. Base Generators in PDGF • DictList generator <table name="users"> <size>10000</size>  Random line from file <fields> <field name="name"> • Long generator <type>java.sql.types.VARCHAR</type>  Random long in interval <size>100</size> <gen_DictList> • Others <file>dicts/names.dict</file> </gen_DictList>  StaticValue </field>  Double <field name="age">  Date <type>java.sql.types.NUMERIC</type> <gen_LongGenerator>  String <min>0</min>  Text <max>120</max> </gen_LongGenerator>  … </field> </fields> </table>
  • 9. Null Generator • Add NULL logic to every generator?  Could easily be implemented in higher class  Adds to the configuration file  Reduces performance (every time) • Higher order generator NullGenerator  Only used if added to the schema  Can be added to any generator <field name="age"> <type>java.sql.types.NUMERIC</type> <gen_NullGenerator> <probability>0.05</probability> <gen_LongGenerator> <min>0</min> <max>120</max> </gen_LongGenerator> </gen_NullGenerator> </field>
  • 10. Meta Generators • Control flow and post-processing generators  Null generator controls flow • Post-processing      • FormattedNumberGenerator PaddingGenerator UpperLowerCaseGenerator PrePostfixGenerator FormulaGenerator Flow control      ProbabilityGenerator SequentialGenerator IfGenerator SwitchGenerator ReferenceGenerator
  • 11. Post-Processing Example • Phone number for users  10s of representations  PhoneNumberGenerator was too inflexible • Formatted long number  Long numbers between 10010001 and 9999999999  Number formatting (%d%d%d) %d%d%d-%d%d%d%d <field name="phonenumber"> <type>java.sql.types.VARCHAR</type> <size>30</size> <generator name="FormattedNumberGenerator"> <generator name="LongGenerator"> <min>10010001</min> <max>9999999999</max> </generator> <format>(%d%d%d) %d%d%d-%d%d%d%d</format> </generator> </field>
  • 12. Flow Control Example • More elaborate name field  Name male or female  50% chance  All upper case  Padded to 100 characters • Sequential generator  Probability generator  DictList generator  UpperLowerCase generator  Padding generator <field name="name"> <type>java.sql.types.VARCHAR</type> <size>100</size> <generator name="SequentialGenerator"> <generator name="ProbabilityGenerator"> <probability value="0.5"> <generator name="DictList"> <file>dicts/female.dict</file> </generator> </probability> <probability value="0.5"> <generator name="DictList"> <file>dicts/male.dict</file> </generator> </probability> </generator> <generator name="UpperLowerCaseGenerator"> <mode>uppercase</mode> </generator> <generator name="PaddingGenerator"> <character> </character> <padToLeft>true</padToLeft> </generator> </generator> </field>
  • 13. Core Performance 250 200 150 100 50 0 Static Value (no Cache) Base Time • • Generator Null Generator (100% NULL) Base Time Sub Null Generator (0% NULL) Sub Generator Test environment: single core laptop, no I/O Base time for framework ~ 55 ns (Base Time)  Seeding, method invocation, setting a value • Computation time for generator 50+ ns (Gen Time) • Cache update if referenced ~ 50 ns (Cache Update) Cache lookup if intra row reference ~ 50 ns (Cache Lookup) Sub-generator invocation ~ 50 ns • •
  • 14. Performance Basic Generators 600 500 400 300 200 100 0 DictList • LongGenerator DoubleGenerator DateGenerator Basic generators without formatting  120ns – 510ns RandomString
  • 15. Performance Formatted Values 2000 1800 1600 1400 1200 1000 800 600 400 200 0 DictList • SimpleFormat Number Generator DateGenerator (formatted) Basic Generators with formatting  Usually > 1000ns DoubleGenerator (4 places)
  • 16. Performance Meta Generators 1600 1400 1200 1000 800 600 400 200 0 Null Generator Null Generator PrePostFix (100% Null) (0% Null) • Sequential (exec 2) Meta generator overhead:  Base overhead ~ 50 ns  Generator overhead starts from 50 ns  Sub generator invocation ~ 50ns • Often negligible due to lazy formatting Sequential (concat 2) Sequential (2 formated + long)
  • 17. Use Cases • TPC-H / SSB      8 tables, 61 columns (first non-trivial example) Without meta-FVGs: 26 custom FVGs 2h editing: 10 custom FVGs 1 day reimplementation: 0 custom FVGs, i.e. no coding SSB variations  skews on dimension attributes, fact measures, references • TPC-DI (in process)      20 tables, 200 columns 19 custom FVGs (mainly for performance in corner cases) 56x NullGenerator 32x ProbabilityGenerator 3000 lines of config (XML import for multiple files).
  • 18. Conclusion & Future Work • Meta generators     Improve usability and expressiveness Speed up schema definition Remove necessity for coding Enlarged configuration files • Used in TPC benchmark(s) • Performance overhead is small, often negligible • Future work  GUI and SQL export  SQL import and data extraction
  • 19. Thanks • Questions? • Contact: tilmann.rabl@utoronto.ca • Download and try PDGF: • http://www.paralleldatageneration.org • Some big data info in our BigBench presentation  Tuesday, 4pm, Industry 3