Mutant Tests Too: The SQL

Mutant Tests Too: The SQL
DataWorksSummit,Barcelona2019
ElliotWest,JayGreen-Stevens
21/March/2019

2
SQL SQL SQL SQL SQL SQL
SQL SQL SQL SQL SQL SQ
L SQL SQL SQL SQL SQL S
QL SQL SQL SQL SQL SQL
SQL SQL TESTS?? SQL SQ
QL SQL SQL SQL SQL SQL
SQL SQL SQL SQL SQL SQ

4
45 years on:
Is SQL still relevant?
Platform
release
SQL
release(s)

5
Our vision
• Any process that alters or interprets data should be tested!
• Unit tests are the first defence
• A stitch in time…

6
Unit testing SQL: Challenges
• Unit test culture + tools not as strong
• Monolithic scripts difficult to reason about
• No visibility on test coverage or quality
case when ( ! (get_from_list(p4, ',', 0) LIKE '%<%') and
length(get_from_list(get_from_list(p4, ',', 0), ";", 2)) > 0 )
then get_from_list(get_from_list(p4, ',', 0), ";", 2)
end as extract1,
case when ( ! (get_from_list(p4, ',', 0) LIKE '%<%') and
length(get_from_list(get_from_list(p4, ',', 0), ";", 6)) > 0 )
then cast(get_from_list(get_from_list(p4, ',', 0), ";", 6) as BIGINT)
end as extract2

7
Unit testing SQL: Focus areas
• Improve the following:
• Unit testing frameworks
– HiveRunner
• SQL modularisation patterns
• Code coverage
– MutantSwarm

8
HiveRunner overview
• Open source unit test framework for Hive SQL
– https://github.com/klarna/HiveRunner
• Local execution (Laptop/CI servers)
• Test cases defined in Java + Hive SQL
• Based on JUnit 4

9
Framework: HiveRunner
HiveRunner
User Test
Suite
SQL
Under Test
SQL
Setup
Test
Reports

10
Monolithic queries
SELECT ... FROM (
SELECT ... FROM (
SELECT ... FROM (
SELECT ... FROM a WHERE ...
) A LEFT JOIN (
SELECT ... FROM b
) B ON (...)
) ab FULL OUTER JOIN (
SELECT ... FROM c WHERE ...
) C ON (...)
) abc LEFT JOIN (
SELECT ... FROM d WHERE ...
) D ON (...)
GROUP BY ...;
-- Query 1
-- Query 2
-- Query 3
-- Query 4
-- Query 3
-- Query 5
-- Query 3
-- Query 2
-- Query 6
-- Query 2
-- Query 1
-- Query 7
-- Query 1
-- Query 1

11
Approaches
• Chained tables
– Regular and Temporary
• Views
• Variable substitution
• HPL/SQL

12
Chained Tables
INSERT OVERWRITE TABLE x AS
SELECT …
FROM (
) …
SELECT upper(a), b, …
FROM source
…
INSERT OVERWRITE TABLE
inner AS
FROM source
…
SELECT *
FROM inner
SOURCE inner.sql;
inner.sqlquery.sql
Q2
Q1

13
Chained Tables
• Well understood
• Can view intermediate results
• I/O between stages
• Query optimisation limited
source1
X Y
… …
temp1
X Y
… … result
X Y
… …source2
X Y
… …
temp3
X Y
… …
Q1
Q2
Q3
Q4

14
Variable substitution
SELECT …
FROM (
) …
FROM source
…
FROM source
…
${inner}
SET inner = <inner contents>;
inner.sqlquery.sql
S2
S1
Q1

15
Variable substitution
source1
X Y
… …
result
X Y
… …source2
X Y
… …
S1 Q1S2+ =
• Requires assembly • Not ’pure SQL’

16
Nested Views
SELECT …
FROM (
) …
FROM source
…
CREATE VIEW inner AS
FROM source
…SELECT *
FROM inner
SOURCE inner.sql;
inner.sqlquery.sql
Q1
V1

17
Nested Views
source1
X Y
… …
result
X Y
… …source2
X Y
… …
Q1
• Greater scope for optimisation • Poor parameterisation handling
V1

18
Testing: Job counts
0
1
2
3
4
5
Monolithic Query Views Chained Tables Chained Tables
(Temporary)
File Moves Map-Only Jobs MapReduce Jobs
• Queries: 3
• Nesting depth: 2
• Aggregations: 1
• Sorts: 1

19
Testing: Execution time
0
500
1000
1500
2000
Monolithic Query Views Chained Tables Chained Tables
(Temporary)
Average Time Taken (seconds)
• EMR cluster
• 100GB data input
• Multiple iterations

20
Further details
• https://medium.com/hotels-com-technology/modularising-hive-
queries-dff90f85f7d6
• http://apache.org/.../Unit+Testing+Hive+SQL

22
Code coverage tools
• Java has awesome code coverage and reporting tools
• Cobertura, Jacoco, PIT
• Highlights weaknesses in tests
• Motivates developers to write more tests
• Fast feedback
• Can we have this for SQL?

23
What is Mutation testing?
• Used to design new software tests and evaluate quality of existing
software tests
• Involves modifying a program in small ways (mutants)
• Test suites are measured by the % of mutants killed
• New tests designed to kill additional mutants

24
Example SQL test
SELECT lower(name), price
FROM hotels
WHERE price > 100
ORDER BY price DESC
LIMIT 5
hotels
ID NAME PRICE
1 AC Hotel 150
2 Princess 200
3 Motel X 60
Assert size == 2
Assert result[0] == ‘princess’
Assert result[1] == ‘ac hotel’
NAME PRICE
princess 200
ac hotel 150
SOURCE UNDER TEST RESULTTEST DATA
TEST SUITE

25
Mutating SQL
FROM hotels
WHERE price > 100
ORDER BY price DESC
LIMIT 5
FN: lower
COL: price
OP: >
SORT: DESC
VAL: 5
Genes
null
name
upper
ASC
1
10
<, =
<>
Mutations
Sequencing

26
Mutant survivors/victims
FROM hotels
WHERE price > 100
ORDER BY price DESC
LIMIT 5
SELECT upper(name), price
FROM hotels
WHERE price > 100
ORDER BY price DESC
LIMIT 5
SELECT lower(name), null
FROM hotels
WHERE price > 100
ORDER BY price DESC
LIMIT 5
FROM hotels
WHERE price < 100
ORDER BY price DESC
LIMIT 5
FROM hotels
WHERE price = 100
ORDER BY price DESC
LIMIT 5
FROM hotels
WHERE price > 100
ORDER BY price ASC
LIMIT 5
╳
╳ ╳
✓
Column projection price not fully covered by test suite.
╳
TEST
Assert size == 2
Assert result[0] == ‘princess’
Assert result[1] == ‘ac hotel’
ID NAME PRICE
1 AC Hotel 150
2 Princess 200
3 Motel X 60

27
Gaps versus weaknesses
SELECT lower(name), null
FROM hotels
WHERE price > 100
ORDER BY price DESC
LIMIT 5 ✓
• Survivors can suggest a gap in tests
• However, perhaps price was expected to be null?
• Suggests that more tests and discriminating test data is required

28
Mutant SQL
Under Test
HiveRunner
User Test
Suite
Mutation
Reports
SQL
Under Test
SQL
Setup
Mutant SQL
Under Test
Mutant SQL
Under Test
Test
Reports
Mutant
Gene DB +
Sequencer
Mutant Swarm: How it works
1. Run tests + report
2. Sequence SQL genes
3. Generate mutants
4. Run tests for each
mutant
5. Mutation report

30
Challenges
• Execution performance
– HiveRunner is not the fastest
• Solutions?
– Run tests in parallel
– Mutate only changed code sections
– Improve HiveRunner performance
tMS ∝ tHR × genes × mutations

31
Summary
• Created a toolkit to facilitate:
– Increased test development, of a higher quality
– Data quality improvement by identifying errors
– Unit testing, modularisation, test coverage
• Future work: Improve:
– Execution performance
– Mutation types (greater coverage of SQL language)

32
github.com/HotelsDotCom/
mutant-swarm
hive-runner-user@googlegroups.com
linkedin.com/in/teabot
linkedin.com/in/jay-greeeen

Mutant Tests Too: The SQL

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Mutant Tests Too: The SQL

Ähnlich wie Mutant Tests Too: The SQL (20)

Mehr von DataWorks Summit

Mehr von DataWorks Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Mutant Tests Too: The SQL

Hinweis der Redaktion