SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Mutant Tests Too: The SQL
DataWorksSummit,Barcelona2019
ElliotWest,JayGreen-Stevens
21/March/2019
2
SQL SQL SQL SQL SQL SQL
SQL SQL SQL SQL SQL SQ
L SQL SQL SQL SQL SQL S
QL SQL SQL SQL SQL SQL
SQL SQL SQL SQL SQL SQL
SQL SQL TESTS?? SQL SQ
L SQL SQL SQL SQL SQL S
QL SQL SQL SQL SQL SQL
SQL SQL SQL SQL SQL SQL
SQL SQL SQL SQL SQL SQ
L SQL SQL SQL SQL SQL S
3
4
45 years on:
Is SQL still relevant?
Platform
release
SQL
release(s)
5
Our vision
• Any process that alters or interprets data should be tested!
• Unit tests are the first defence
• A stitch in time…
6
Unit testing SQL: Challenges
• Unit test culture + tools not as strong
• Monolithic scripts difficult to reason about
• No visibility on test coverage or quality
case when ( ! (get_from_list(p4, ',', 0) LIKE '%<%') and
length(get_from_list(get_from_list(p4, ',', 0), ";", 2)) > 0 )
then get_from_list(get_from_list(p4, ',', 0), ";", 2)
end as extract1,
case when ( ! (get_from_list(p4, ',', 0) LIKE '%<%') and
length(get_from_list(get_from_list(p4, ',', 0), ";", 6)) > 0 )
then cast(get_from_list(get_from_list(p4, ',', 0), ";", 6) as BIGINT)
end as extract2
7
Unit testing SQL: Focus areas
• Improve the following:
• Unit testing frameworks
– HiveRunner
• SQL modularisation patterns
• Code coverage
– MutantSwarm
8
HiveRunner overview
• Open source unit test framework for Hive SQL
– https://github.com/klarna/HiveRunner
• Local execution (Laptop/CI servers)
• Test cases defined in Java + Hive SQL
• Based on JUnit 4
9
Framework: HiveRunner
HiveRunner
User Test
Suite
SQL
Under Test
SQL
Setup
Test
Reports
10
Monolithic queries
SELECT ... FROM (
SELECT ... FROM (
SELECT ... FROM (
SELECT ... FROM a WHERE ...
) A LEFT JOIN (
SELECT ... FROM b
) B ON (...)
) ab FULL OUTER JOIN (
SELECT ... FROM c WHERE ...
) C ON (...)
) abc LEFT JOIN (
SELECT ... FROM d WHERE ...
) D ON (...)
GROUP BY ...;
-- Query 1
-- Query 2
-- Query 3
-- Query 4
-- Query 3
-- Query 5
-- Query 3
-- Query 2
-- Query 6
-- Query 2
-- Query 1
-- Query 7
-- Query 1
-- Query 1
11
Approaches
• Chained tables
– Regular and Temporary
• Views
• Variable substitution
• HPL/SQL
12
Chained Tables
INSERT OVERWRITE TABLE x AS
SELECT …
FROM (
) …
SELECT upper(a), b, …
FROM source
…
INSERT OVERWRITE TABLE
inner AS
SELECT upper(a), b, …
FROM source
…
SELECT *
FROM inner
SOURCE inner.sql;
inner.sqlquery.sql
Q2
Q1
13
Chained Tables
• Well understood
• Can view intermediate results
• I/O between stages
• Query optimisation limited
source1
X Y
… …
temp1
X Y
… … result
X Y
… …source2
X Y
… …
temp3
X Y
… …
Q1
Q2
Q3
Q4
14
Variable substitution
INSERT OVERWRITE TABLE x AS
SELECT …
FROM (
) …
SELECT upper(a), b, …
FROM source
…
SELECT upper(a), b, …
FROM source
…
${inner}
SET inner = <inner contents>;
inner.sqlquery.sql
S2
S1
Q1
15
Variable substitution
source1
X Y
… …
result
X Y
… …source2
X Y
… …
S1 Q1S2+ =
• Requires assembly • Not ’pure SQL’
16
Nested Views
INSERT OVERWRITE TABLE x AS
SELECT …
FROM (
) …
SELECT upper(a), b, …
FROM source
…
CREATE VIEW inner AS
SELECT upper(a), b, …
FROM source
…SELECT *
FROM inner
SOURCE inner.sql;
inner.sqlquery.sql
Q1
V1
17
Nested Views
source1
X Y
… …
result
X Y
… …source2
X Y
… …
Q1
• Greater scope for optimisation • Poor parameterisation handling
V1
18
Testing: Job counts
0
1
2
3
4
5
Monolithic Query Views Chained Tables Chained Tables
(Temporary)
File Moves Map-Only Jobs MapReduce Jobs
• Queries: 3
• Nesting depth: 2
• Aggregations: 1
• Sorts: 1
19
Testing: Execution time
0
500
1000
1500
2000
Monolithic Query Views Chained Tables Chained Tables
(Temporary)
Average Time Taken (seconds)
• EMR cluster
• 100GB data input
• Multiple iterations
20
Further details
• https://medium.com/hotels-com-technology/modularising-hive-
queries-dff90f85f7d6
• http://apache.org/.../Unit+Testing+Hive+SQL
21
Mutant Swarm
22
Code coverage tools
• Java has awesome code coverage and reporting tools
• Cobertura, Jacoco, PIT
• Highlights weaknesses in tests
• Motivates developers to write more tests
• Fast feedback
• Can we have this for SQL?
23
What is Mutation testing?
• Used to design new software tests and evaluate quality of existing
software tests
• Involves modifying a program in small ways (mutants)
• Test suites are measured by the % of mutants killed
• New tests designed to kill additional mutants
24
Example SQL test
SELECT lower(name), price
FROM hotels
WHERE price > 100
ORDER BY price DESC
LIMIT 5
hotels
ID NAME PRICE
1 AC Hotel 150
2 Princess 200
3 Motel X 60
Assert size == 2
Assert result[0] == ‘princess’
Assert result[1] == ‘ac hotel’
NAME PRICE
princess 200
ac hotel 150
SOURCE UNDER TEST RESULTTEST DATA
TEST SUITE
25
Mutating SQL
SELECT lower(name), price
FROM hotels
WHERE price > 100
ORDER BY price DESC
LIMIT 5
FN: lower
COL: price
OP: >
SORT: DESC
VAL: 5
Genes
null
name
upper
ASC
1
10
<, =
<>
Mutations
Sequencing
26
Mutant survivors/victims
SELECT lower(name), price
FROM hotels
WHERE price > 100
ORDER BY price DESC
LIMIT 5
SELECT upper(name), price
FROM hotels
WHERE price > 100
ORDER BY price DESC
LIMIT 5
SELECT lower(name), null
FROM hotels
WHERE price > 100
ORDER BY price DESC
LIMIT 5
SELECT lower(name), price
FROM hotels
WHERE price < 100
ORDER BY price DESC
LIMIT 5
SELECT lower(name), price
FROM hotels
WHERE price = 100
ORDER BY price DESC
LIMIT 5
SELECT lower(name), price
FROM hotels
WHERE price > 100
ORDER BY price ASC
LIMIT 5
╳
╳ ╳
✓
Column projection price not fully covered by test suite.
╳
TEST
Assert size == 2
Assert result[0] == ‘princess’
Assert result[1] == ‘ac hotel’
ID NAME PRICE
1 AC Hotel 150
2 Princess 200
3 Motel X 60
27
Gaps versus weaknesses
SELECT lower(name), null
FROM hotels
WHERE price > 100
ORDER BY price DESC
LIMIT 5 ✓
• Survivors can suggest a gap in tests
• However, perhaps price was expected to be null?
• Suggests that more tests and discriminating test data is required
28
Mutant SQL
Under Test
HiveRunner
User Test
Suite
Mutation
Reports
SQL
Under Test
SQL
Setup
Mutant SQL
Under Test
Mutant SQL
Under Test
Test
Reports
Mutant
Gene DB +
Sequencer
Mutant Swarm: How it works
1. Run tests + report
2. Sequence SQL genes
3. Generate mutants
4. Run tests for each
mutant
5. Mutation report
29
30
Challenges
• Execution performance
– HiveRunner is not the fastest
• Solutions?
– Run tests in parallel
– Mutate only changed code sections
– Improve HiveRunner performance
tMS ∝ tHR × genes × mutations
31
Summary
• Created a toolkit to facilitate:
– Increased test development, of a higher quality
– Data quality improvement by identifying errors
– Unit testing, modularisation, test coverage
• Future work: Improve:
– Execution performance
– Mutation types (greater coverage of SQL language)
32
github.com/HotelsDotCom/
mutant-swarm
hive-runner-user@googlegroups.com
linkedin.com/in/teabot
linkedin.com/in/jay-greeeen
33
Thank you!
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
DataWorks Summit
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
DataWorks Summit
 
Running secured Spark job in Kubernetes compute cluster and integrating with ...
Running secured Spark job in Kubernetes compute cluster and integrating with ...Running secured Spark job in Kubernetes compute cluster and integrating with ...
Running secured Spark job in Kubernetes compute cluster and integrating with ...
DataWorks Summit
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
DataWorks Summit
 

Was ist angesagt? (20)

DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on Kubernetes
 
Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...
Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...
Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...
 
Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0
 
Building large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twillBuilding large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twill
 
Hadoop engineering bo_f_final
Hadoop engineering bo_f_finalHadoop engineering bo_f_final
Hadoop engineering bo_f_final
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
 
5 Apache Spark Tips in 5 Minutes
5 Apache Spark Tips in 5 Minutes5 Apache Spark Tips in 5 Minutes
5 Apache Spark Tips in 5 Minutes
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and Drivers
 
Apache REEF - stdlib for big data
Apache REEF - stdlib for big dataApache REEF - stdlib for big data
Apache REEF - stdlib for big data
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
Running secured Spark job in Kubernetes compute cluster and integrating with ...
Running secured Spark job in Kubernetes compute cluster and integrating with ...Running secured Spark job in Kubernetes compute cluster and integrating with ...
Running secured Spark job in Kubernetes compute cluster and integrating with ...
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 
Running a container cloud on YARN
Running a container cloud on YARNRunning a container cloud on YARN
Running a container cloud on YARN
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
 

Ähnlich wie Mutant Tests Too: The SQL

ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...
ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...
ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...
Vyacheslav Lapin
 
How to Troubleshoot & Optimize Database Query Performance for Your Application
How to Troubleshoot  & Optimize Database Query Performance for Your ApplicationHow to Troubleshoot  & Optimize Database Query Performance for Your Application
How to Troubleshoot & Optimize Database Query Performance for Your Application
Dynatrace
 

Ähnlich wie Mutant Tests Too: The SQL (20)

Performance Test Driven Development with Oracle Coherence
Performance Test Driven Development with Oracle CoherencePerformance Test Driven Development with Oracle Coherence
Performance Test Driven Development with Oracle Coherence
 
Static analysis of java enterprise applications
Static analysis of java enterprise applicationsStatic analysis of java enterprise applications
Static analysis of java enterprise applications
 
Oracle Database 12c - New Features for Developers and DBAs
Oracle Database 12c - New Features for Developers and DBAsOracle Database 12c - New Features for Developers and DBAs
Oracle Database 12c - New Features for Developers and DBAs
 
Oracle Database 12c - New Features for Developers and DBAs
Oracle Database 12c  - New Features for Developers and DBAsOracle Database 12c  - New Features for Developers and DBAs
Oracle Database 12c - New Features for Developers and DBAs
 
Oracle Database In-Memory Option in Action
Oracle Database In-Memory Option in ActionOracle Database In-Memory Option in Action
Oracle Database In-Memory Option in Action
 
In Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry OsborneIn Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry Osborne
 
ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...
ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...
ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...
 
An Approach to Sql tuning - Part 1
An Approach to Sql tuning - Part 1An Approach to Sql tuning - Part 1
An Approach to Sql tuning - Part 1
 
SQCFramework: SPARQL Query Containment Benchmarks Generation Framework
SQCFramework: SPARQL Query Containment Benchmarks Generation FrameworkSQCFramework: SPARQL Query Containment Benchmarks Generation Framework
SQCFramework: SPARQL Query Containment Benchmarks Generation Framework
 
Unit testing with Spock Framework
Unit testing with Spock FrameworkUnit testing with Spock Framework
Unit testing with Spock Framework
 
DealingwithVerificationDataOverload
DealingwithVerificationDataOverloadDealingwithVerificationDataOverload
DealingwithVerificationDataOverload
 
Java Micro-Benchmarking
Java Micro-BenchmarkingJava Micro-Benchmarking
Java Micro-Benchmarking
 
CBDW2014 - MockBox, get ready to mock your socks off!
CBDW2014 - MockBox, get ready to mock your socks off!CBDW2014 - MockBox, get ready to mock your socks off!
CBDW2014 - MockBox, get ready to mock your socks off!
 
Query optimizer vivek sharma
Query optimizer vivek sharmaQuery optimizer vivek sharma
Query optimizer vivek sharma
 
Web application attacks using Sql injection and countermasures
Web application attacks using Sql injection and countermasuresWeb application attacks using Sql injection and countermasures
Web application attacks using Sql injection and countermasures
 
Dun ddd
Dun dddDun ddd
Dun ddd
 
Is ScalaC Getting Faster, or Am I just Imagining It
Is ScalaC Getting Faster, or Am I just Imagining ItIs ScalaC Getting Faster, or Am I just Imagining It
Is ScalaC Getting Faster, or Am I just Imagining It
 
How to Troubleshoot & Optimize Database Query Performance for Your Application
How to Troubleshoot  & Optimize Database Query Performance for Your ApplicationHow to Troubleshoot  & Optimize Database Query Performance for Your Application
How to Troubleshoot & Optimize Database Query Performance for Your Application
 
A Guide to Event-Driven SRE-inspired DevOps
A Guide to Event-Driven SRE-inspired DevOpsA Guide to Event-Driven SRE-inspired DevOps
A Guide to Event-Driven SRE-inspired DevOps
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago Mola
 

Mehr von DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Mutant Tests Too: The SQL

  • 1. Mutant Tests Too: The SQL DataWorksSummit,Barcelona2019 ElliotWest,JayGreen-Stevens 21/March/2019
  • 2. 2 SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQ L SQL SQL SQL SQL SQL S QL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL TESTS?? SQL SQ L SQL SQL SQL SQL SQL S QL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQ L SQL SQL SQL SQL SQL S
  • 3. 3
  • 4. 4 45 years on: Is SQL still relevant? Platform release SQL release(s)
  • 5. 5 Our vision • Any process that alters or interprets data should be tested! • Unit tests are the first defence • A stitch in time…
  • 6. 6 Unit testing SQL: Challenges • Unit test culture + tools not as strong • Monolithic scripts difficult to reason about • No visibility on test coverage or quality case when ( ! (get_from_list(p4, ',', 0) LIKE '%<%') and length(get_from_list(get_from_list(p4, ',', 0), ";", 2)) > 0 ) then get_from_list(get_from_list(p4, ',', 0), ";", 2) end as extract1, case when ( ! (get_from_list(p4, ',', 0) LIKE '%<%') and length(get_from_list(get_from_list(p4, ',', 0), ";", 6)) > 0 ) then cast(get_from_list(get_from_list(p4, ',', 0), ";", 6) as BIGINT) end as extract2
  • 7. 7 Unit testing SQL: Focus areas • Improve the following: • Unit testing frameworks – HiveRunner • SQL modularisation patterns • Code coverage – MutantSwarm
  • 8. 8 HiveRunner overview • Open source unit test framework for Hive SQL – https://github.com/klarna/HiveRunner • Local execution (Laptop/CI servers) • Test cases defined in Java + Hive SQL • Based on JUnit 4
  • 10. 10 Monolithic queries SELECT ... FROM ( SELECT ... FROM ( SELECT ... FROM ( SELECT ... FROM a WHERE ... ) A LEFT JOIN ( SELECT ... FROM b ) B ON (...) ) ab FULL OUTER JOIN ( SELECT ... FROM c WHERE ... ) C ON (...) ) abc LEFT JOIN ( SELECT ... FROM d WHERE ... ) D ON (...) GROUP BY ...; -- Query 1 -- Query 2 -- Query 3 -- Query 4 -- Query 3 -- Query 5 -- Query 3 -- Query 2 -- Query 6 -- Query 2 -- Query 1 -- Query 7 -- Query 1 -- Query 1
  • 11. 11 Approaches • Chained tables – Regular and Temporary • Views • Variable substitution • HPL/SQL
  • 12. 12 Chained Tables INSERT OVERWRITE TABLE x AS SELECT … FROM ( ) … SELECT upper(a), b, … FROM source … INSERT OVERWRITE TABLE inner AS SELECT upper(a), b, … FROM source … SELECT * FROM inner SOURCE inner.sql; inner.sqlquery.sql Q2 Q1
  • 13. 13 Chained Tables • Well understood • Can view intermediate results • I/O between stages • Query optimisation limited source1 X Y … … temp1 X Y … … result X Y … …source2 X Y … … temp3 X Y … … Q1 Q2 Q3 Q4
  • 14. 14 Variable substitution INSERT OVERWRITE TABLE x AS SELECT … FROM ( ) … SELECT upper(a), b, … FROM source … SELECT upper(a), b, … FROM source … ${inner} SET inner = <inner contents>; inner.sqlquery.sql S2 S1 Q1
  • 15. 15 Variable substitution source1 X Y … … result X Y … …source2 X Y … … S1 Q1S2+ = • Requires assembly • Not ’pure SQL’
  • 16. 16 Nested Views INSERT OVERWRITE TABLE x AS SELECT … FROM ( ) … SELECT upper(a), b, … FROM source … CREATE VIEW inner AS SELECT upper(a), b, … FROM source …SELECT * FROM inner SOURCE inner.sql; inner.sqlquery.sql Q1 V1
  • 17. 17 Nested Views source1 X Y … … result X Y … …source2 X Y … … Q1 • Greater scope for optimisation • Poor parameterisation handling V1
  • 18. 18 Testing: Job counts 0 1 2 3 4 5 Monolithic Query Views Chained Tables Chained Tables (Temporary) File Moves Map-Only Jobs MapReduce Jobs • Queries: 3 • Nesting depth: 2 • Aggregations: 1 • Sorts: 1
  • 19. 19 Testing: Execution time 0 500 1000 1500 2000 Monolithic Query Views Chained Tables Chained Tables (Temporary) Average Time Taken (seconds) • EMR cluster • 100GB data input • Multiple iterations
  • 22. 22 Code coverage tools • Java has awesome code coverage and reporting tools • Cobertura, Jacoco, PIT • Highlights weaknesses in tests • Motivates developers to write more tests • Fast feedback • Can we have this for SQL?
  • 23. 23 What is Mutation testing? • Used to design new software tests and evaluate quality of existing software tests • Involves modifying a program in small ways (mutants) • Test suites are measured by the % of mutants killed • New tests designed to kill additional mutants
  • 24. 24 Example SQL test SELECT lower(name), price FROM hotels WHERE price > 100 ORDER BY price DESC LIMIT 5 hotels ID NAME PRICE 1 AC Hotel 150 2 Princess 200 3 Motel X 60 Assert size == 2 Assert result[0] == ‘princess’ Assert result[1] == ‘ac hotel’ NAME PRICE princess 200 ac hotel 150 SOURCE UNDER TEST RESULTTEST DATA TEST SUITE
  • 25. 25 Mutating SQL SELECT lower(name), price FROM hotels WHERE price > 100 ORDER BY price DESC LIMIT 5 FN: lower COL: price OP: > SORT: DESC VAL: 5 Genes null name upper ASC 1 10 <, = <> Mutations Sequencing
  • 26. 26 Mutant survivors/victims SELECT lower(name), price FROM hotels WHERE price > 100 ORDER BY price DESC LIMIT 5 SELECT upper(name), price FROM hotels WHERE price > 100 ORDER BY price DESC LIMIT 5 SELECT lower(name), null FROM hotels WHERE price > 100 ORDER BY price DESC LIMIT 5 SELECT lower(name), price FROM hotels WHERE price < 100 ORDER BY price DESC LIMIT 5 SELECT lower(name), price FROM hotels WHERE price = 100 ORDER BY price DESC LIMIT 5 SELECT lower(name), price FROM hotels WHERE price > 100 ORDER BY price ASC LIMIT 5 ╳ ╳ ╳ ✓ Column projection price not fully covered by test suite. ╳ TEST Assert size == 2 Assert result[0] == ‘princess’ Assert result[1] == ‘ac hotel’ ID NAME PRICE 1 AC Hotel 150 2 Princess 200 3 Motel X 60
  • 27. 27 Gaps versus weaknesses SELECT lower(name), null FROM hotels WHERE price > 100 ORDER BY price DESC LIMIT 5 ✓ • Survivors can suggest a gap in tests • However, perhaps price was expected to be null? • Suggests that more tests and discriminating test data is required
  • 28. 28 Mutant SQL Under Test HiveRunner User Test Suite Mutation Reports SQL Under Test SQL Setup Mutant SQL Under Test Mutant SQL Under Test Test Reports Mutant Gene DB + Sequencer Mutant Swarm: How it works 1. Run tests + report 2. Sequence SQL genes 3. Generate mutants 4. Run tests for each mutant 5. Mutation report
  • 29. 29
  • 30. 30 Challenges • Execution performance – HiveRunner is not the fastest • Solutions? – Run tests in parallel – Mutate only changed code sections – Improve HiveRunner performance tMS ∝ tHR × genes × mutations
  • 31. 31 Summary • Created a toolkit to facilitate: – Increased test development, of a higher quality – Data quality improvement by identifying errors – Unit testing, modularisation, test coverage • Future work: Improve: – Execution performance – Mutation types (greater coverage of SQL language)

Hinweis der Redaktion

  1. Java/Cascading Modular Automated testing SQL Manual testing Long cycle times
  2. Accessible Powerful Widely taught
  3. Innovation in data processing space Hadoop changes everything Higher levels of abstraction Streaming SQL follows As recently as Beam Cloud services also
  4. Unit tests empower developers in the testing process Hours spent writing tests saves days cleaning up erroneous data
  5. Easy to write Difficult to maintain Hard to test Would like to decompose into units! Modularise with Hive… How?
  6. Chained tables – common pattern in our team Belief that views have adverse performance impact
  7. Query plan Number of jobs – rough estimation of work/overheads
  8. All very similar Takeaways You can use any Views don’t seems to adversely impact