SlideShare a Scribd company logo
1 of 50
Download to read offline
Applications of Search-based
Software Testing to Trustworthy
Artificial Intelligence
Lionel Briand
SSBSE 2022 Keynote
http://www.lbriand.info
ML-Enabled Systems (MLS)
2
Sensors
Controller
Actuators Decision
Sensors
/Camera
Environment
ADAS
Machine Learning
Example Automotive Applications
• Object detection, identification, classification,
localization and prediction of movement
• Sensor fusion and scene comprehension, e.g., lane
detection
• Driver monitoring
• Driver replacement
• Functional safety, security
• Powertrains, e.g., improve motor control and battery
management
3
Tian et al. 2018
Importance
• ML components are increasingly part of safety- or mission-
critical systems (ML-enabled systems - MLS)
• Many domains, including aerospace, automotive, health care,
…
• Many ML algorithms, supervised vs. unsupervised,
classification vs regression, etc.
• But increasing use of deep learning and reinforcement
learning
4
Testing Levels
• Testing is still the main mechanism through which to gain trust
• Levels: model (e.g., Deep Neural Networks or DNN) , integration,
system
• Research largely focused on model testing
• Integration: Issues that arise when multiple models and
components are integrated
• System: Test the MLS in its target environment, in-field or
simulated
5
Challenges: Overview
• Behavior driven by training data and learning process
• Neither specifications nor code
• Huge model input space, especially for autonomous
systems
• Automated test oracles / verdicts
• Test suite adequacy, i.e., when is it good enough?
• Models are never perfect, but how do we decide
whether they are good enough?
6
Model-Level Testing and
Analysis
7
Test Inputs: Adversarial or
Natural?
• Adversarial inputs: Focus on robustness, e.g., noise or attacks
• Natural inputs: Focus on functional aspects, e.g., functional
safety
8
Adversarial example due to noise (Goodfellow et al., 2014)
Key-points Detection Testing with
Simulation in the Loop:
• DNNs used for key-points detection
in images
• Many applications, e.g., face
recognition
• Testing: Find test suite that causes
DNN to poorly predict as many key-
points as possible within time
budget
• Images generated by a simulator
9
Ground truth
Predicted
Example Application
• Drowsiness or gaze detection based on interior camera monitoring the driver
• In the drowsiness or gaze detection problem, each Key-Point (KP) may be highly
important for safety
• Each KP leads to one test objective
• For our subject DNN, we have 27 test objectives
• Goal: Cause the DNN to mispredict as many key-points as possible
• Solution: Many-objective search algorithms (based on genetic algorithms)
combined with simulator
10
Ul Haq et al., 2021
Overview
11
Input Generator (search) Simulator
Input (vector)
DNN
Fitness
Calculator
Actual Key-points Positions
Predicted Key-points Positions
Fitness Score
(Error Value)
Most Critical
Test Inputs
Test
Image
Results
• Our approach is effective in generating test suites that cause the DNN to severely
mispredict more than 93% of all key-points on average
• Not all mispredictions can be considered failures … (e.g., shadows)
• We must know when the DNN cannot be expected to be accurate and have contingency
measures
• Some key-points are more severely predicted than others, detailed analysis revealed
two reasons:
• Under-representation of some key-points (hidden) in the training data
• Large variation in the shape and size of the mouth across different 3D models (more
training needed)
12
Interpretation
• Regression trees to predict accuracy based on simulation parameters
• Enable detailed analysis to find the root causes of high Normalized Error (NE) values, e.g., shadow
on the location of KP26 is the cause of high NE values
• Regression trees show excellent accuracy, reasonable size.
• Amenable to risk analysis, gaining useful safety insights, and contingency plans at run-time
13
Image Characteristics Condition NE
! = 9 ∧ # < 18.41 0.04
! = 9 ∧ # ≥ 18.41 ∧ $ < −22.31 ∧ % < 17.06 0.26
! = 9 ∧ # ≥ 18.41 ∧ $ < −22.31 ∧ 17.06 ≤ % < 19 0.71
! = 9 ∧ # ≥ 18.41 ∧ $ < −22.31 ∧ % ≥ 19 0.36
Representative rules derived from the decision tree for KP26
(M: Model-ID, P: Pitch, R: Roll, Y: Yaw, NE: Normalized Error)
(A) A test image satisfying
the first condition
(B) A test image satisfying
the third condition
NE = 0.013 NE = 0.89
Real Images
• Test selection requires a different approach than with a
simulator
• Labeling costs are significant
14
DNN Structural Coverage Criteria
15
Chen et al. 2020
• Neuron coverage
• How neurons are
activated
• Many variants
Limitations
• Require access to the DNN internals and sometimes the
training set. Not realistic in many practical settings.
• There is a weak correlation between coverage and
misclassification or poor predictions for natural inputs
• Also many questions regarding the studies focused on
adversarial inputs …
16
Diversity-driven Test Selection
• Test inputs are real images
• Black-box approach based on measuring the diversity of
test inputs.
• Scalable selection
• Assumption: The more diverse, the more likely test inputs
are to reveal faults
17
Agahababaeyan et al., 2022
Extracting Image Features
• VGG16 is a convolutional neural network trained on a
subset of the ImageNet dataset, a collection of over 14
million images belonging to 22,000 categories.
18
Features:
- Activation values
after last conv layer
- Characterize semantic
elements such as shapes
and colors
Geometric Diversity (GD)
• Given a dataset X and its corresponding feature vectors V,
the geometric diversity of a subset S ⊆ X is defined as the
hyper-volume of the parallelepiped spanned by the rows of
Vs, i.e., feature vectors of items in S, where the larger the
volume, the more diverse is the feature space of S
19
Correlation with Faults
Correlation between geometric diversity and faults
Estimating Faults in DNNs
21
#Clusters ~ #Faults
Pareto Front Optimization
22
• Black-box test selection to detect as many diverse faults
as possible for a given test budget
• Label selected images
• Search-based Approach: Multi-Objective Genetic Search
(NSGA-II)
• Two objectives (Max):
• diversity
• uncertainty (e.g., Gini)
GD score
Gini
score
Explanation and Analysis with
Heatmaps
23
• Heatmaps capture the extent to which the pixels of an image impacted on a specific
result
• Limitations:
• Heatmaps should be manually inspected to determine the reason for misclassification
• underrepresented (but dangerous) failure causes might be unnoticed
• Model debugging (i.e., improvement) not automated
A heatmap can show that long hair is what
caused a female doctor to be classified as a
nurse
HUDD
24
• How to help perform risk analysis with real images?
• Rely on heatmaps to generate clusters of images leading to failures because of a
same root cause (common characteristics of the images)
• enable engineers to ensure safety with introducing countermeasures
Error-inducing
Test Set images
Step1.
Heatmap
based
clustering
Root cause clusters
C1 C2 C3
Step 2. Inspection of subset
of cluster elements.
Fahmy et al. 2021
• Classification
• Gaze Detection
26
90
270
180 0
45
22.5
67.5
337.5
315
292.5
247.5
225
202.5
157.5
135
112.5
Top
Center
B
o
t
t
o
m
L
e
f
t
Bottom
Center
B
o
t
t
o
m
R
i
g
h
t
Top
Right
Middle
Right
T
o
p
L
e
f
t
Middle
Left
Example Application
Clusters identify different problems
Cluster 1
(angle ~157.5)
borderline cases
Cluster3
(near closed eyes)
incomplete training set
27
Cluster 2
(eye middle center)
incomplete set of classes
Combining Simulation and Actual
Images in Testing
28
Training Set
Simulator
Images
DNN
Training
Test Set
Simulator
Images
DNN
Testing
DNN
Training
(fine-tuning)
Training Set
Real-world
Images
DNN
Testing
Trained
DNN
Fine-Tuned
DNN
Real-world Error
Inducing Images
Test Set
Real-world
Images
RQ1: Is it possible to automatically characterize unsafe scenarios?
RQ2: Is it possible to improve the accuracy of the DNN by leveraging the identified
unsafe scenarios?
Simulator-based Explanations for
DNN Errors (SEDE)
29
RCCs
Real-world Error
Inducing
Images
HUDD
Unsafe images in the RCC
+ simulator parameters
Safe images similar to unsafe
+ simulator parameters
Configuration
Parameters
Simulator
Images
Evolutionary
Algorithm
Simulator
IF-THEN rules
Rules Extraction
Algorithm (PART)
Explanation
expression
Fahmy et al., 2022
Example PART Rule
• Example simulation
parameters: pitch,
yaw, and roll
(orientation of the
head)
• Rules expressed in
terms of simulation
parameters
30
Case 1
Reinforcement Learning Testing
and Safety
31
Cartpole Example
• Environment (CartPole from OpenAI GYM): a pole is attached to a cart, and
the goal is to move the cart right and left to keep the pole from falling.
• Actions: Left and right
• Reward: For every timestep that the pole remains upright (+1 reward)
• Termination
• Pole angle is greater than 12 degrees
• Cart distance from the center point is greater than 2.4
• Episode length is more than 200
32
Reward and Functional Faults
33
• Episodes: State-action sequences
• Testing goal: Find episodes to detect and explain reward and functional faults
• Reward fault: the accumulated reward is less than a threshold, e.g., the pole
falls down in the first 70 timesteps
• Functional fault: the pole is stable and, regardless of the accumulated reward,
the cart moves beyond the 2.4 limited distance, and the episode terminates
Functional fault Reward fault
STARLA: Objectives
• Detect as quickly
as possible diverse
faulty episodes
• Accurately
characterize faulty
episodes
• STARLA: Search
and ML based
34
Zolfagharian et al. 2022
System-Level Testing and
Analysis
35
Testing via Physics-based
Simulation
36
ADAS
(SUT)
Simulator (Matlab/Simulink)
Model
(Matlab/Simulink)
▪ Physical plant (vehicle / sensors / actuators)
▪ Other cars
▪ Pedestrians
▪ Environment (weather / roads / traffic signs)
Test input
Test output
time-stamped output
Example Violation
• Violation: Ego Vehicle collides with other vehicle
• Vehicle-in-front slows down suddenly and then moves to the right
• Possible reason: Agent was not trained with such episodes
38
Car View Top View
Reinforcement Learning for ADAS
Testing (MORLOT)
• Goal: make sequential changes to the environment, in a realistic
manner, with the objective of triggering safety violations
• Train an RL agent to do so
• Challenge: Test many requirements simultaneously
39
Ul Haq et al. 2022
Example
• Consider the vehicle-in-front as an RL
agent that aims to cause the ego vehicle
to violate safety requirements by
performing a sequence of actions
• State: collision
• Action: turn left/right,
increase/decrease speed of vehicle in
front
• Reward: distance between vehicles
40
Ego Vehicle Vehicle in-front
Possible actions
Reinforcement Learning for ADAS
Testing (MORLOT)
41
Action (Behavior of environment actors, e.g., vehicle in front)
Reward (e.g., based on distance from vehicle in front)
State/Next State (Locations and conditions about the environment,
e.g., collision)
RL Agent
RL-Environment
Simulator
(CARLA)
Control
(Image, location,…)
ADAS
Ul Haq et al. 2022
MORLOT: Overview
42
Objectives
(i.e., reqs),
Q-tables
Many-Objective Reinforcement Learning for Online Testing
Observe
State
Choose Action
Randomly or using
Many-Objective Search
Take Action/
Receive
Rewards
Update Q-
tables/
Archive
Reset
Environment
Many-Objective Search
Select Action
Get Fitness Value
Reinforcement Learning Environment
Interacts
Tabular-based Reinforcement Learning (Q-learning)
System Testing Challenges
43
To address the challenges, we propose SAMOTA (Surrogate-Assisted
Many-Objective Testing Approach) by leveraging many-objective
search and Surrogate Models (SMs)
Ul Haq et al. 2022
Large Input
Space
Many Safety
Requirements
Computationally-
intensive
Simulation
Surrogate Models
• Surrogate model: Model that mimics the simulator, to a
certain extent, while being much less computationally
expensive
• Research: Combine search with surrogate modeling to
decrease the computational cost of testing (Ul Haq et al. 2022)
44
Polynomial
Regression
(PR)
Radial Basis Function (RBF) Kringing (KR)
SAMOTA
45
Steps:
1. Initialization
2. Global search
3. Local search
4. Glocal search
5. Local search
6. …
Execute
Simulator
Global
SMs Many Objective
Search Algorithm
Most Critical
Test Cases
Most Uncertain
Test Cases
Global Search
Initialisation
Execute
Simulator Database
Minimal
Test Suite
Local SM
per Cluster
Local Search
Most Critical
Test Cases
Single Objective
Search Algorithm
Clustering
Top Points
Conclusions
46
Automated Testing
• Testing remains the key verification mechanism in practice to
demonstrate trustworthiness and understand limitations (e.g., safety)
• Different levels of testing: model, integration, system
• Automation is key at all levels and requires different strategies, e.g.,
sequences of actions and states versus single inputs (e.g., images)
• Key recurring strategy: combine evolutionary computing and machine
learning
• Concrete industrial problems could not have been addressed otherwise
• Scalability remains the main challenge
47
Why Search?
• Models are hard to analyze and therefore not amenable
automated reasoning (though there is work)
• Models exhibit complex, no-linear behaviors
• AI-enabled systems often operate in open contexts described
by many dimensions (AI for autonomy)
• Testing AI-enabled systems usually involves interactions with
complex simulators
• Search is therefore a natural approach in such context
48
Applications of Search-based
Software Testing to Trustworthy
Artificial Intelligence
Lionel Briand
SSBSE 2022 Keynote
http://www.lbriand.info
References
50
Selected References
• Ul Haq et al. "Automatic Test Suite Generation for Key-points Detection DNNs Using Many-Objective Search" ACM International
Symposium on Software Testing (ISSTA), 2021
• Ul Haq et al., “Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and Many-Objective Optimization”
IEEE/ACM ICSE 2022
• Ul Haq et al., “Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems”, ArXiv report,
https://arxiv.org/abs/2210.15432
• Fahmy et al. "Supporting DNN Safety Analysis and Retraining through Heatmap-based Unsupervised Learning" IEEE Transactions
on Reliability, Special section on Quality Assurance of Machine Learning Systems, 2021
• Fahmy et al. "Simulator-based explanation and debugging of hazard-triggering events in DNN-based safety-critical systems”,
ArXiv report, https://arxiv.org/pdf/2204.00480.pdf, ACM TOSEM, 2022
• Attaoui et al., “Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction and Clustering”, ArXiv report,
https://arxiv.org/pdf/2201.05077v1.pdf, ACM TOSEM, 2022
• Aghababaeyan et al., “Black-Box Testing of Deep Neural Networks through Test Case Diversity”,
https://arxiv.org/pdf/2112.12591.pdf
• Zolfagharian et al., “Search-Based Testing Approach for Deep Reinforcement Learning Agents”,
https://arxiv.org/pdf/2206.07813.pdf
51
Selected ML Testing References
• Goodfellow et al. "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014).
• Zhang et al. "DeepRoad: GAN-based metamorphic testing and input validation framework for autonomous driving systems." In 33rd
IEEE/ACM International Conference on Automated Software Engineering (ASE), 2018.
• Tian et al. "DeepTest: Automated testing of deep-neural-network-driven autonomous cars." In Proceedings of the 40th
international conference on software engineering, 2018.
• Li et al. “Structural Coverage Criteria for Neural Networks Could Be Misleading”, IEEE/ACM 41st International Conference on
Software Engineering: New Ideas and Emerging Results (NIER)
• Kim et al. "Guiding deep learning system testing using surprise adequacy." In IEEE/ACM 41st International Conference on Software
Engineering (ICSE), 2019.
• Ma et al. "DeepMutation: Mutation testing of deep learning systems." In 2018 IEEE 29th International Symposium on Software
Reliability Engineering (ISSRE), 2018.
• Zhang et al. "Machine learning testing: Survey, landscapes and horizons." IEEE Transactions on Software Engineering (2020).
• Riccio et al. "Testing machine learning based systems: a systematic mapping." Empirical Software Engineering 25, no. 6 (2020)
• Gerasimou et al., “Importance-Driven Deep Learning System Testing”, IEEE/ACM 42nd International Conference on Software
Engineering, 2020
52

More Related Content

Similar to Applications of Search-based Software Testing to Trustworthy Artificial Intelligence

Comparing Offline and Online Testing of Deep Neural Networks: An Autonomous C...
Comparing Offline and Online Testing of Deep Neural Networks: An Autonomous C...Comparing Offline and Online Testing of Deep Neural Networks: An Autonomous C...
Comparing Offline and Online Testing of Deep Neural Networks: An Autonomous C...
Lionel Briand
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
Sanghamitra Deb
 
Automatic Test Suite Generation for Key-Points Detection DNNs using Many-Obje...
Automatic Test Suite Generation for Key-Points Detection DNNs using Many-Obje...Automatic Test Suite Generation for Key-Points Detection DNNs using Many-Obje...
Automatic Test Suite Generation for Key-Points Detection DNNs using Many-Obje...
Lionel Briand
 
Big Data Challenges and Solutions
Big Data Challenges and SolutionsBig Data Challenges and Solutions
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Lionel Briand
 
Skin Lesion Detection from Dermoscopic Images using Convolutional Neural Netw...
Skin Lesion Detection from Dermoscopic Images using Convolutional Neural Netw...Skin Lesion Detection from Dermoscopic Images using Convolutional Neural Netw...
Skin Lesion Detection from Dermoscopic Images using Convolutional Neural Netw...
Universitat Politècnica de Catalunya
 

Similar to Applications of Search-based Software Testing to Trustworthy Artificial Intelligence (20)

AI in SE: A 25-year Journey
AI in SE: A 25-year JourneyAI in SE: A 25-year Journey
AI in SE: A 25-year Journey
 
Achieving Scalability in Software Testing with Machine Learning and Metaheuri...
Achieving Scalability in Software Testing with Machine Learning and Metaheuri...Achieving Scalability in Software Testing with Machine Learning and Metaheuri...
Achieving Scalability in Software Testing with Machine Learning and Metaheuri...
 
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?
 
A data science observatory based on RAMP - rapid analytics and model prototyping
A data science observatory based on RAMP - rapid analytics and model prototypingA data science observatory based on RAMP - rapid analytics and model prototyping
A data science observatory based on RAMP - rapid analytics and model prototyping
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
 
Seminar nov2017
Seminar nov2017Seminar nov2017
Seminar nov2017
 
Enabling Automated Software Testing with Artificial Intelligence
Enabling Automated Software Testing with Artificial IntelligenceEnabling Automated Software Testing with Artificial Intelligence
Enabling Automated Software Testing with Artificial Intelligence
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
 
Comparing Offline and Online Testing of Deep Neural Networks: An Autonomous C...
Comparing Offline and Online Testing of Deep Neural Networks: An Autonomous C...Comparing Offline and Online Testing of Deep Neural Networks: An Autonomous C...
Comparing Offline and Online Testing of Deep Neural Networks: An Autonomous C...
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
 
Automatic Test Suite Generation for Key-Points Detection DNNs using Many-Obje...
Automatic Test Suite Generation for Key-Points Detection DNNs using Many-Obje...Automatic Test Suite Generation for Key-Points Detection DNNs using Many-Obje...
Automatic Test Suite Generation for Key-Points Detection DNNs using Many-Obje...
 
DSUS_MAO_2012_Jie
DSUS_MAO_2012_JieDSUS_MAO_2012_Jie
DSUS_MAO_2012_Jie
 
07 learning
07 learning07 learning
07 learning
 
Automated Testing of Autonomous Driving Assistance Systems
Automated Testing of Autonomous Driving Assistance SystemsAutomated Testing of Autonomous Driving Assistance Systems
Automated Testing of Autonomous Driving Assistance Systems
 
Big Data Challenges and Solutions
Big Data Challenges and SolutionsBig Data Challenges and Solutions
Big Data Challenges and Solutions
 
MLSEV Virtual. State of the Art in ML
MLSEV Virtual. State of the Art in MLMLSEV Virtual. State of the Art in ML
MLSEV Virtual. State of the Art in ML
 
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
 
P1121133727
P1121133727P1121133727
P1121133727
 
Skin Lesion Detection from Dermoscopic Images using Convolutional Neural Netw...
Skin Lesion Detection from Dermoscopic Images using Convolutional Neural Netw...Skin Lesion Detection from Dermoscopic Images using Convolutional Neural Netw...
Skin Lesion Detection from Dermoscopic Images using Convolutional Neural Netw...
 
lecture_16.pptx
lecture_16.pptxlecture_16.pptx
lecture_16.pptx
 

More from Lionel Briand

Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...
Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...
Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...
Lionel Briand
 
Data-driven Mutation Analysis for Cyber-Physical Systems
Data-driven Mutation Analysis for Cyber-Physical SystemsData-driven Mutation Analysis for Cyber-Physical Systems
Data-driven Mutation Analysis for Cyber-Physical Systems
Lionel Briand
 
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled SystemsMany-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems
Lionel Briand
 
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...
Lionel Briand
 
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...
Lionel Briand
 
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Lionel Briand
 
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...
Lionel Briand
 
On Systematically Building a Controlled Natural Language for Functional Requi...
On Systematically Building a Controlled Natural Language for Functional Requi...On Systematically Building a Controlled Natural Language for Functional Requi...
On Systematically Building a Controlled Natural Language for Functional Requi...
Lionel Briand
 
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Lionel Briand
 
Guidelines for Assessing the Accuracy of Log Message Template Identification ...
Guidelines for Assessing the Accuracy of Log Message Template Identification ...Guidelines for Assessing the Accuracy of Log Message Template Identification ...
Guidelines for Assessing the Accuracy of Log Message Template Identification ...
Lionel Briand
 
A Theoretical Framework for Understanding the Relationship between Log Parsin...
A Theoretical Framework for Understanding the Relationship between Log Parsin...A Theoretical Framework for Understanding the Relationship between Log Parsin...
A Theoretical Framework for Understanding the Relationship between Log Parsin...
Lionel Briand
 
Requirements in Cyber-Physical Systems: Specifications and Applications
Requirements in Cyber-Physical Systems: Specifications and ApplicationsRequirements in Cyber-Physical Systems: Specifications and Applications
Requirements in Cyber-Physical Systems: Specifications and Applications
Lionel Briand
 

More from Lionel Briand (20)

Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Metamorphic Testing for Web System Security
Metamorphic Testing for Web System SecurityMetamorphic Testing for Web System Security
Metamorphic Testing for Web System Security
 
Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...
Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...
Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...
 
Fuzzing for CPS Mutation Testing
Fuzzing for CPS Mutation TestingFuzzing for CPS Mutation Testing
Fuzzing for CPS Mutation Testing
 
Data-driven Mutation Analysis for Cyber-Physical Systems
Data-driven Mutation Analysis for Cyber-Physical SystemsData-driven Mutation Analysis for Cyber-Physical Systems
Data-driven Mutation Analysis for Cyber-Physical Systems
 
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled SystemsMany-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems
 
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...
 
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...
 
PRINS: Scalable Model Inference for Component-based System Logs
PRINS: Scalable Model Inference for Component-based System LogsPRINS: Scalable Model Inference for Component-based System Logs
PRINS: Scalable Model Inference for Component-based System Logs
 
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
 
Reinforcement Learning for Test Case Prioritization
Reinforcement Learning for Test Case PrioritizationReinforcement Learning for Test Case Prioritization
Reinforcement Learning for Test Case Prioritization
 
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...
 
On Systematically Building a Controlled Natural Language for Functional Requi...
On Systematically Building a Controlled Natural Language for Functional Requi...On Systematically Building a Controlled Natural Language for Functional Requi...
On Systematically Building a Controlled Natural Language for Functional Requi...
 
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
 
Guidelines for Assessing the Accuracy of Log Message Template Identification ...
Guidelines for Assessing the Accuracy of Log Message Template Identification ...Guidelines for Assessing the Accuracy of Log Message Template Identification ...
Guidelines for Assessing the Accuracy of Log Message Template Identification ...
 
A Theoretical Framework for Understanding the Relationship between Log Parsin...
A Theoretical Framework for Understanding the Relationship between Log Parsin...A Theoretical Framework for Understanding the Relationship between Log Parsin...
A Theoretical Framework for Understanding the Relationship between Log Parsin...
 
Requirements in Cyber-Physical Systems: Specifications and Applications
Requirements in Cyber-Physical Systems: Specifications and ApplicationsRequirements in Cyber-Physical Systems: Specifications and Applications
Requirements in Cyber-Physical Systems: Specifications and Applications
 
Practical Constraint Solving for Generating System Test Data
Practical Constraint Solving for Generating System Test DataPractical Constraint Solving for Generating System Test Data
Practical Constraint Solving for Generating System Test Data
 
Automating System Test Case Classification and Prioritization for Use Case-Dr...
Automating System Test Case Classification and Prioritization for Use Case-Dr...Automating System Test Case Classification and Prioritization for Use Case-Dr...
Automating System Test Case Classification and Prioritization for Use Case-Dr...
 

Recently uploaded

%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 

Recently uploaded (20)

WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 

Applications of Search-based Software Testing to Trustworthy Artificial Intelligence

  • 1. Applications of Search-based Software Testing to Trustworthy Artificial Intelligence Lionel Briand SSBSE 2022 Keynote http://www.lbriand.info
  • 2. ML-Enabled Systems (MLS) 2 Sensors Controller Actuators Decision Sensors /Camera Environment ADAS Machine Learning
  • 3. Example Automotive Applications • Object detection, identification, classification, localization and prediction of movement • Sensor fusion and scene comprehension, e.g., lane detection • Driver monitoring • Driver replacement • Functional safety, security • Powertrains, e.g., improve motor control and battery management 3 Tian et al. 2018
  • 4. Importance • ML components are increasingly part of safety- or mission- critical systems (ML-enabled systems - MLS) • Many domains, including aerospace, automotive, health care, … • Many ML algorithms, supervised vs. unsupervised, classification vs regression, etc. • But increasing use of deep learning and reinforcement learning 4
  • 5. Testing Levels • Testing is still the main mechanism through which to gain trust • Levels: model (e.g., Deep Neural Networks or DNN) , integration, system • Research largely focused on model testing • Integration: Issues that arise when multiple models and components are integrated • System: Test the MLS in its target environment, in-field or simulated 5
  • 6. Challenges: Overview • Behavior driven by training data and learning process • Neither specifications nor code • Huge model input space, especially for autonomous systems • Automated test oracles / verdicts • Test suite adequacy, i.e., when is it good enough? • Models are never perfect, but how do we decide whether they are good enough? 6
  • 8. Test Inputs: Adversarial or Natural? • Adversarial inputs: Focus on robustness, e.g., noise or attacks • Natural inputs: Focus on functional aspects, e.g., functional safety 8 Adversarial example due to noise (Goodfellow et al., 2014)
  • 9. Key-points Detection Testing with Simulation in the Loop: • DNNs used for key-points detection in images • Many applications, e.g., face recognition • Testing: Find test suite that causes DNN to poorly predict as many key- points as possible within time budget • Images generated by a simulator 9 Ground truth Predicted
  • 10. Example Application • Drowsiness or gaze detection based on interior camera monitoring the driver • In the drowsiness or gaze detection problem, each Key-Point (KP) may be highly important for safety • Each KP leads to one test objective • For our subject DNN, we have 27 test objectives • Goal: Cause the DNN to mispredict as many key-points as possible • Solution: Many-objective search algorithms (based on genetic algorithms) combined with simulator 10 Ul Haq et al., 2021
  • 11. Overview 11 Input Generator (search) Simulator Input (vector) DNN Fitness Calculator Actual Key-points Positions Predicted Key-points Positions Fitness Score (Error Value) Most Critical Test Inputs Test Image
  • 12. Results • Our approach is effective in generating test suites that cause the DNN to severely mispredict more than 93% of all key-points on average • Not all mispredictions can be considered failures … (e.g., shadows) • We must know when the DNN cannot be expected to be accurate and have contingency measures • Some key-points are more severely predicted than others, detailed analysis revealed two reasons: • Under-representation of some key-points (hidden) in the training data • Large variation in the shape and size of the mouth across different 3D models (more training needed) 12
  • 13. Interpretation • Regression trees to predict accuracy based on simulation parameters • Enable detailed analysis to find the root causes of high Normalized Error (NE) values, e.g., shadow on the location of KP26 is the cause of high NE values • Regression trees show excellent accuracy, reasonable size. • Amenable to risk analysis, gaining useful safety insights, and contingency plans at run-time 13 Image Characteristics Condition NE ! = 9 ∧ # < 18.41 0.04 ! = 9 ∧ # ≥ 18.41 ∧ $ < −22.31 ∧ % < 17.06 0.26 ! = 9 ∧ # ≥ 18.41 ∧ $ < −22.31 ∧ 17.06 ≤ % < 19 0.71 ! = 9 ∧ # ≥ 18.41 ∧ $ < −22.31 ∧ % ≥ 19 0.36 Representative rules derived from the decision tree for KP26 (M: Model-ID, P: Pitch, R: Roll, Y: Yaw, NE: Normalized Error) (A) A test image satisfying the first condition (B) A test image satisfying the third condition NE = 0.013 NE = 0.89
  • 14. Real Images • Test selection requires a different approach than with a simulator • Labeling costs are significant 14
  • 15. DNN Structural Coverage Criteria 15 Chen et al. 2020 • Neuron coverage • How neurons are activated • Many variants
  • 16. Limitations • Require access to the DNN internals and sometimes the training set. Not realistic in many practical settings. • There is a weak correlation between coverage and misclassification or poor predictions for natural inputs • Also many questions regarding the studies focused on adversarial inputs … 16
  • 17. Diversity-driven Test Selection • Test inputs are real images • Black-box approach based on measuring the diversity of test inputs. • Scalable selection • Assumption: The more diverse, the more likely test inputs are to reveal faults 17 Agahababaeyan et al., 2022
  • 18. Extracting Image Features • VGG16 is a convolutional neural network trained on a subset of the ImageNet dataset, a collection of over 14 million images belonging to 22,000 categories. 18 Features: - Activation values after last conv layer - Characterize semantic elements such as shapes and colors
  • 19. Geometric Diversity (GD) • Given a dataset X and its corresponding feature vectors V, the geometric diversity of a subset S ⊆ X is defined as the hyper-volume of the parallelepiped spanned by the rows of Vs, i.e., feature vectors of items in S, where the larger the volume, the more diverse is the feature space of S 19
  • 20. Correlation with Faults Correlation between geometric diversity and faults
  • 21. Estimating Faults in DNNs 21 #Clusters ~ #Faults
  • 22. Pareto Front Optimization 22 • Black-box test selection to detect as many diverse faults as possible for a given test budget • Label selected images • Search-based Approach: Multi-Objective Genetic Search (NSGA-II) • Two objectives (Max): • diversity • uncertainty (e.g., Gini) GD score Gini score
  • 23. Explanation and Analysis with Heatmaps 23 • Heatmaps capture the extent to which the pixels of an image impacted on a specific result • Limitations: • Heatmaps should be manually inspected to determine the reason for misclassification • underrepresented (but dangerous) failure causes might be unnoticed • Model debugging (i.e., improvement) not automated A heatmap can show that long hair is what caused a female doctor to be classified as a nurse
  • 24. HUDD 24 • How to help perform risk analysis with real images? • Rely on heatmaps to generate clusters of images leading to failures because of a same root cause (common characteristics of the images) • enable engineers to ensure safety with introducing countermeasures Error-inducing Test Set images Step1. Heatmap based clustering Root cause clusters C1 C2 C3 Step 2. Inspection of subset of cluster elements. Fahmy et al. 2021
  • 25. • Classification • Gaze Detection 26 90 270 180 0 45 22.5 67.5 337.5 315 292.5 247.5 225 202.5 157.5 135 112.5 Top Center B o t t o m L e f t Bottom Center B o t t o m R i g h t Top Right Middle Right T o p L e f t Middle Left Example Application
  • 26. Clusters identify different problems Cluster 1 (angle ~157.5) borderline cases Cluster3 (near closed eyes) incomplete training set 27 Cluster 2 (eye middle center) incomplete set of classes
  • 27. Combining Simulation and Actual Images in Testing 28 Training Set Simulator Images DNN Training Test Set Simulator Images DNN Testing DNN Training (fine-tuning) Training Set Real-world Images DNN Testing Trained DNN Fine-Tuned DNN Real-world Error Inducing Images Test Set Real-world Images RQ1: Is it possible to automatically characterize unsafe scenarios? RQ2: Is it possible to improve the accuracy of the DNN by leveraging the identified unsafe scenarios?
  • 28. Simulator-based Explanations for DNN Errors (SEDE) 29 RCCs Real-world Error Inducing Images HUDD Unsafe images in the RCC + simulator parameters Safe images similar to unsafe + simulator parameters Configuration Parameters Simulator Images Evolutionary Algorithm Simulator IF-THEN rules Rules Extraction Algorithm (PART) Explanation expression Fahmy et al., 2022
  • 29. Example PART Rule • Example simulation parameters: pitch, yaw, and roll (orientation of the head) • Rules expressed in terms of simulation parameters 30 Case 1
  • 31. Cartpole Example • Environment (CartPole from OpenAI GYM): a pole is attached to a cart, and the goal is to move the cart right and left to keep the pole from falling. • Actions: Left and right • Reward: For every timestep that the pole remains upright (+1 reward) • Termination • Pole angle is greater than 12 degrees • Cart distance from the center point is greater than 2.4 • Episode length is more than 200 32
  • 32. Reward and Functional Faults 33 • Episodes: State-action sequences • Testing goal: Find episodes to detect and explain reward and functional faults • Reward fault: the accumulated reward is less than a threshold, e.g., the pole falls down in the first 70 timesteps • Functional fault: the pole is stable and, regardless of the accumulated reward, the cart moves beyond the 2.4 limited distance, and the episode terminates Functional fault Reward fault
  • 33. STARLA: Objectives • Detect as quickly as possible diverse faulty episodes • Accurately characterize faulty episodes • STARLA: Search and ML based 34 Zolfagharian et al. 2022
  • 35. Testing via Physics-based Simulation 36 ADAS (SUT) Simulator (Matlab/Simulink) Model (Matlab/Simulink) ▪ Physical plant (vehicle / sensors / actuators) ▪ Other cars ▪ Pedestrians ▪ Environment (weather / roads / traffic signs) Test input Test output time-stamped output
  • 36. Example Violation • Violation: Ego Vehicle collides with other vehicle • Vehicle-in-front slows down suddenly and then moves to the right • Possible reason: Agent was not trained with such episodes 38 Car View Top View
  • 37. Reinforcement Learning for ADAS Testing (MORLOT) • Goal: make sequential changes to the environment, in a realistic manner, with the objective of triggering safety violations • Train an RL agent to do so • Challenge: Test many requirements simultaneously 39 Ul Haq et al. 2022
  • 38. Example • Consider the vehicle-in-front as an RL agent that aims to cause the ego vehicle to violate safety requirements by performing a sequence of actions • State: collision • Action: turn left/right, increase/decrease speed of vehicle in front • Reward: distance between vehicles 40 Ego Vehicle Vehicle in-front Possible actions
  • 39. Reinforcement Learning for ADAS Testing (MORLOT) 41 Action (Behavior of environment actors, e.g., vehicle in front) Reward (e.g., based on distance from vehicle in front) State/Next State (Locations and conditions about the environment, e.g., collision) RL Agent RL-Environment Simulator (CARLA) Control (Image, location,…) ADAS Ul Haq et al. 2022
  • 40. MORLOT: Overview 42 Objectives (i.e., reqs), Q-tables Many-Objective Reinforcement Learning for Online Testing Observe State Choose Action Randomly or using Many-Objective Search Take Action/ Receive Rewards Update Q- tables/ Archive Reset Environment Many-Objective Search Select Action Get Fitness Value Reinforcement Learning Environment Interacts Tabular-based Reinforcement Learning (Q-learning)
  • 41. System Testing Challenges 43 To address the challenges, we propose SAMOTA (Surrogate-Assisted Many-Objective Testing Approach) by leveraging many-objective search and Surrogate Models (SMs) Ul Haq et al. 2022 Large Input Space Many Safety Requirements Computationally- intensive Simulation
  • 42. Surrogate Models • Surrogate model: Model that mimics the simulator, to a certain extent, while being much less computationally expensive • Research: Combine search with surrogate modeling to decrease the computational cost of testing (Ul Haq et al. 2022) 44 Polynomial Regression (PR) Radial Basis Function (RBF) Kringing (KR)
  • 43. SAMOTA 45 Steps: 1. Initialization 2. Global search 3. Local search 4. Glocal search 5. Local search 6. … Execute Simulator Global SMs Many Objective Search Algorithm Most Critical Test Cases Most Uncertain Test Cases Global Search Initialisation Execute Simulator Database Minimal Test Suite Local SM per Cluster Local Search Most Critical Test Cases Single Objective Search Algorithm Clustering Top Points
  • 45. Automated Testing • Testing remains the key verification mechanism in practice to demonstrate trustworthiness and understand limitations (e.g., safety) • Different levels of testing: model, integration, system • Automation is key at all levels and requires different strategies, e.g., sequences of actions and states versus single inputs (e.g., images) • Key recurring strategy: combine evolutionary computing and machine learning • Concrete industrial problems could not have been addressed otherwise • Scalability remains the main challenge 47
  • 46. Why Search? • Models are hard to analyze and therefore not amenable automated reasoning (though there is work) • Models exhibit complex, no-linear behaviors • AI-enabled systems often operate in open contexts described by many dimensions (AI for autonomy) • Testing AI-enabled systems usually involves interactions with complex simulators • Search is therefore a natural approach in such context 48
  • 47. Applications of Search-based Software Testing to Trustworthy Artificial Intelligence Lionel Briand SSBSE 2022 Keynote http://www.lbriand.info
  • 49. Selected References • Ul Haq et al. "Automatic Test Suite Generation for Key-points Detection DNNs Using Many-Objective Search" ACM International Symposium on Software Testing (ISSTA), 2021 • Ul Haq et al., “Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and Many-Objective Optimization” IEEE/ACM ICSE 2022 • Ul Haq et al., “Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems”, ArXiv report, https://arxiv.org/abs/2210.15432 • Fahmy et al. "Supporting DNN Safety Analysis and Retraining through Heatmap-based Unsupervised Learning" IEEE Transactions on Reliability, Special section on Quality Assurance of Machine Learning Systems, 2021 • Fahmy et al. "Simulator-based explanation and debugging of hazard-triggering events in DNN-based safety-critical systems”, ArXiv report, https://arxiv.org/pdf/2204.00480.pdf, ACM TOSEM, 2022 • Attaoui et al., “Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction and Clustering”, ArXiv report, https://arxiv.org/pdf/2201.05077v1.pdf, ACM TOSEM, 2022 • Aghababaeyan et al., “Black-Box Testing of Deep Neural Networks through Test Case Diversity”, https://arxiv.org/pdf/2112.12591.pdf • Zolfagharian et al., “Search-Based Testing Approach for Deep Reinforcement Learning Agents”, https://arxiv.org/pdf/2206.07813.pdf 51
  • 50. Selected ML Testing References • Goodfellow et al. "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). • Zhang et al. "DeepRoad: GAN-based metamorphic testing and input validation framework for autonomous driving systems." In 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE), 2018. • Tian et al. "DeepTest: Automated testing of deep-neural-network-driven autonomous cars." In Proceedings of the 40th international conference on software engineering, 2018. • Li et al. “Structural Coverage Criteria for Neural Networks Could Be Misleading”, IEEE/ACM 41st International Conference on Software Engineering: New Ideas and Emerging Results (NIER) • Kim et al. "Guiding deep learning system testing using surprise adequacy." In IEEE/ACM 41st International Conference on Software Engineering (ICSE), 2019. • Ma et al. "DeepMutation: Mutation testing of deep learning systems." In 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE), 2018. • Zhang et al. "Machine learning testing: Survey, landscapes and horizons." IEEE Transactions on Software Engineering (2020). • Riccio et al. "Testing machine learning based systems: a systematic mapping." Empirical Software Engineering 25, no. 6 (2020) • Gerasimou et al., “Importance-Driven Deep Learning System Testing”, IEEE/ACM 42nd International Conference on Software Engineering, 2020 52