3. Example Automotive Applications
• Object detection, identification, classification,
localization and prediction of movement
• Sensor fusion and scene comprehension, e.g., lane
detection
• Driver monitoring
• Driver replacement
• Functional safety, security
• Powertrains, e.g., improve motor control and battery
management
3
Tian et al. 2018
4. Importance
• ML components are increasingly part of safety- or mission-
critical systems (ML-enabled systems - MLS)
• Many domains, including aerospace, automotive, health care,
…
• Many ML algorithms, supervised vs. unsupervised,
classification vs regression, etc.
• But increasing use of deep learning and reinforcement
learning
4
5. Testing Levels
• Testing is still the main mechanism through which to gain trust
• Levels: model (e.g., Deep Neural Networks or DNN) , integration,
system
• Research largely focused on model testing
• Integration: Issues that arise when multiple models and
components are integrated
• System: Test the MLS in its target environment, in-field or
simulated
5
6. Challenges: Overview
• Behavior driven by training data and learning process
• Neither specifications nor code
• Huge model input space, especially for autonomous
systems
• Automated test oracles / verdicts
• Test suite adequacy, i.e., when is it good enough?
• Models are never perfect, but how do we decide
whether they are good enough?
6
8. Test Inputs: Adversarial or
Natural?
• Adversarial inputs: Focus on robustness, e.g., noise or attacks
• Natural inputs: Focus on functional aspects, e.g., functional
safety
8
Adversarial example due to noise (Goodfellow et al., 2014)
9. Key-points Detection Testing with
Simulation in the Loop:
• DNNs used for key-points detection
in images
• Many applications, e.g., face
recognition
• Testing: Find test suite that causes
DNN to poorly predict as many key-
points as possible within time
budget
• Images generated by a simulator
9
Ground truth
Predicted
10. Example Application
• Drowsiness or gaze detection based on interior camera monitoring the driver
• In the drowsiness or gaze detection problem, each Key-Point (KP) may be highly
important for safety
• Each KP leads to one test objective
• For our subject DNN, we have 27 test objectives
• Goal: Cause the DNN to mispredict as many key-points as possible
• Solution: Many-objective search algorithms (based on genetic algorithms)
combined with simulator
10
Ul Haq et al., 2021
11. Overview
11
Input Generator (search) Simulator
Input (vector)
DNN
Fitness
Calculator
Actual Key-points Positions
Predicted Key-points Positions
Fitness Score
(Error Value)
Most Critical
Test Inputs
Test
Image
12. Results
• Our approach is effective in generating test suites that cause the DNN to severely
mispredict more than 93% of all key-points on average
• Not all mispredictions can be considered failures … (e.g., shadows)
• We must know when the DNN cannot be expected to be accurate and have contingency
measures
• Some key-points are more severely predicted than others, detailed analysis revealed
two reasons:
• Under-representation of some key-points (hidden) in the training data
• Large variation in the shape and size of the mouth across different 3D models (more
training needed)
12
13. Interpretation
• Regression trees to predict accuracy based on simulation parameters
• Enable detailed analysis to find the root causes of high Normalized Error (NE) values, e.g., shadow
on the location of KP26 is the cause of high NE values
• Regression trees show excellent accuracy, reasonable size.
• Amenable to risk analysis, gaining useful safety insights, and contingency plans at run-time
13
Image Characteristics Condition NE
! = 9 ∧ # < 18.41 0.04
! = 9 ∧ # ≥ 18.41 ∧ $ < −22.31 ∧ % < 17.06 0.26
! = 9 ∧ # ≥ 18.41 ∧ $ < −22.31 ∧ 17.06 ≤ % < 19 0.71
! = 9 ∧ # ≥ 18.41 ∧ $ < −22.31 ∧ % ≥ 19 0.36
Representative rules derived from the decision tree for KP26
(M: Model-ID, P: Pitch, R: Roll, Y: Yaw, NE: Normalized Error)
(A) A test image satisfying
the first condition
(B) A test image satisfying
the third condition
NE = 0.013 NE = 0.89
14. Real Images
• Test selection requires a different approach than with a
simulator
• Labeling costs are significant
14
15. DNN Structural Coverage Criteria
15
Chen et al. 2020
• Neuron coverage
• How neurons are
activated
• Many variants
16. Limitations
• Require access to the DNN internals and sometimes the
training set. Not realistic in many practical settings.
• There is a weak correlation between coverage and
misclassification or poor predictions for natural inputs
• Also many questions regarding the studies focused on
adversarial inputs …
16
17. Diversity-driven Test Selection
• Test inputs are real images
• Black-box approach based on measuring the diversity of
test inputs.
• Scalable selection
• Assumption: The more diverse, the more likely test inputs
are to reveal faults
17
Agahababaeyan et al., 2022
18. Extracting Image Features
• VGG16 is a convolutional neural network trained on a
subset of the ImageNet dataset, a collection of over 14
million images belonging to 22,000 categories.
18
Features:
- Activation values
after last conv layer
- Characterize semantic
elements such as shapes
and colors
19. Geometric Diversity (GD)
• Given a dataset X and its corresponding feature vectors V,
the geometric diversity of a subset S ⊆ X is defined as the
hyper-volume of the parallelepiped spanned by the rows of
Vs, i.e., feature vectors of items in S, where the larger the
volume, the more diverse is the feature space of S
19
22. Pareto Front Optimization
22
• Black-box test selection to detect as many diverse faults
as possible for a given test budget
• Label selected images
• Search-based Approach: Multi-Objective Genetic Search
(NSGA-II)
• Two objectives (Max):
• diversity
• uncertainty (e.g., Gini)
GD score
Gini
score
23. Explanation and Analysis with
Heatmaps
23
• Heatmaps capture the extent to which the pixels of an image impacted on a specific
result
• Limitations:
• Heatmaps should be manually inspected to determine the reason for misclassification
• underrepresented (but dangerous) failure causes might be unnoticed
• Model debugging (i.e., improvement) not automated
A heatmap can show that long hair is what
caused a female doctor to be classified as a
nurse
24. HUDD
24
• How to help perform risk analysis with real images?
• Rely on heatmaps to generate clusters of images leading to failures because of a
same root cause (common characteristics of the images)
• enable engineers to ensure safety with introducing countermeasures
Error-inducing
Test Set images
Step1.
Heatmap
based
clustering
Root cause clusters
C1 C2 C3
Step 2. Inspection of subset
of cluster elements.
Fahmy et al. 2021
25. • Classification
• Gaze Detection
26
90
270
180 0
45
22.5
67.5
337.5
315
292.5
247.5
225
202.5
157.5
135
112.5
Top
Center
B
o
t
t
o
m
L
e
f
t
Bottom
Center
B
o
t
t
o
m
R
i
g
h
t
Top
Right
Middle
Right
T
o
p
L
e
f
t
Middle
Left
Example Application
26. Clusters identify different problems
Cluster 1
(angle ~157.5)
borderline cases
Cluster3
(near closed eyes)
incomplete training set
27
Cluster 2
(eye middle center)
incomplete set of classes
27. Combining Simulation and Actual
Images in Testing
28
Training Set
Simulator
Images
DNN
Training
Test Set
Simulator
Images
DNN
Testing
DNN
Training
(fine-tuning)
Training Set
Real-world
Images
DNN
Testing
Trained
DNN
Fine-Tuned
DNN
Real-world Error
Inducing Images
Test Set
Real-world
Images
RQ1: Is it possible to automatically characterize unsafe scenarios?
RQ2: Is it possible to improve the accuracy of the DNN by leveraging the identified
unsafe scenarios?
28. Simulator-based Explanations for
DNN Errors (SEDE)
29
RCCs
Real-world Error
Inducing
Images
HUDD
Unsafe images in the RCC
+ simulator parameters
Safe images similar to unsafe
+ simulator parameters
Configuration
Parameters
Simulator
Images
Evolutionary
Algorithm
Simulator
IF-THEN rules
Rules Extraction
Algorithm (PART)
Explanation
expression
Fahmy et al., 2022
29. Example PART Rule
• Example simulation
parameters: pitch,
yaw, and roll
(orientation of the
head)
• Rules expressed in
terms of simulation
parameters
30
Case 1
31. Cartpole Example
• Environment (CartPole from OpenAI GYM): a pole is attached to a cart, and
the goal is to move the cart right and left to keep the pole from falling.
• Actions: Left and right
• Reward: For every timestep that the pole remains upright (+1 reward)
• Termination
• Pole angle is greater than 12 degrees
• Cart distance from the center point is greater than 2.4
• Episode length is more than 200
32
32. Reward and Functional Faults
33
• Episodes: State-action sequences
• Testing goal: Find episodes to detect and explain reward and functional faults
• Reward fault: the accumulated reward is less than a threshold, e.g., the pole
falls down in the first 70 timesteps
• Functional fault: the pole is stable and, regardless of the accumulated reward,
the cart moves beyond the 2.4 limited distance, and the episode terminates
Functional fault Reward fault
33. STARLA: Objectives
• Detect as quickly
as possible diverse
faulty episodes
• Accurately
characterize faulty
episodes
• STARLA: Search
and ML based
34
Zolfagharian et al. 2022
35. Testing via Physics-based
Simulation
36
ADAS
(SUT)
Simulator (Matlab/Simulink)
Model
(Matlab/Simulink)
▪ Physical plant (vehicle / sensors / actuators)
▪ Other cars
▪ Pedestrians
▪ Environment (weather / roads / traffic signs)
Test input
Test output
time-stamped output
36. Example Violation
• Violation: Ego Vehicle collides with other vehicle
• Vehicle-in-front slows down suddenly and then moves to the right
• Possible reason: Agent was not trained with such episodes
38
Car View Top View
37. Reinforcement Learning for ADAS
Testing (MORLOT)
• Goal: make sequential changes to the environment, in a realistic
manner, with the objective of triggering safety violations
• Train an RL agent to do so
• Challenge: Test many requirements simultaneously
39
Ul Haq et al. 2022
38. Example
• Consider the vehicle-in-front as an RL
agent that aims to cause the ego vehicle
to violate safety requirements by
performing a sequence of actions
• State: collision
• Action: turn left/right,
increase/decrease speed of vehicle in
front
• Reward: distance between vehicles
40
Ego Vehicle Vehicle in-front
Possible actions
39. Reinforcement Learning for ADAS
Testing (MORLOT)
41
Action (Behavior of environment actors, e.g., vehicle in front)
Reward (e.g., based on distance from vehicle in front)
State/Next State (Locations and conditions about the environment,
e.g., collision)
RL Agent
RL-Environment
Simulator
(CARLA)
Control
(Image, location,…)
ADAS
Ul Haq et al. 2022
40. MORLOT: Overview
42
Objectives
(i.e., reqs),
Q-tables
Many-Objective Reinforcement Learning for Online Testing
Observe
State
Choose Action
Randomly or using
Many-Objective Search
Take Action/
Receive
Rewards
Update Q-
tables/
Archive
Reset
Environment
Many-Objective Search
Select Action
Get Fitness Value
Reinforcement Learning Environment
Interacts
Tabular-based Reinforcement Learning (Q-learning)
41. System Testing Challenges
43
To address the challenges, we propose SAMOTA (Surrogate-Assisted
Many-Objective Testing Approach) by leveraging many-objective
search and Surrogate Models (SMs)
Ul Haq et al. 2022
Large Input
Space
Many Safety
Requirements
Computationally-
intensive
Simulation
42. Surrogate Models
• Surrogate model: Model that mimics the simulator, to a
certain extent, while being much less computationally
expensive
• Research: Combine search with surrogate modeling to
decrease the computational cost of testing (Ul Haq et al. 2022)
44
Polynomial
Regression
(PR)
Radial Basis Function (RBF) Kringing (KR)
43. SAMOTA
45
Steps:
1. Initialization
2. Global search
3. Local search
4. Glocal search
5. Local search
6. …
Execute
Simulator
Global
SMs Many Objective
Search Algorithm
Most Critical
Test Cases
Most Uncertain
Test Cases
Global Search
Initialisation
Execute
Simulator Database
Minimal
Test Suite
Local SM
per Cluster
Local Search
Most Critical
Test Cases
Single Objective
Search Algorithm
Clustering
Top Points
45. Automated Testing
• Testing remains the key verification mechanism in practice to
demonstrate trustworthiness and understand limitations (e.g., safety)
• Different levels of testing: model, integration, system
• Automation is key at all levels and requires different strategies, e.g.,
sequences of actions and states versus single inputs (e.g., images)
• Key recurring strategy: combine evolutionary computing and machine
learning
• Concrete industrial problems could not have been addressed otherwise
• Scalability remains the main challenge
47
46. Why Search?
• Models are hard to analyze and therefore not amenable
automated reasoning (though there is work)
• Models exhibit complex, no-linear behaviors
• AI-enabled systems often operate in open contexts described
by many dimensions (AI for autonomy)
• Testing AI-enabled systems usually involves interactions with
complex simulators
• Search is therefore a natural approach in such context
48
49. Selected References
• Ul Haq et al. "Automatic Test Suite Generation for Key-points Detection DNNs Using Many-Objective Search" ACM International
Symposium on Software Testing (ISSTA), 2021
• Ul Haq et al., “Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and Many-Objective Optimization”
IEEE/ACM ICSE 2022
• Ul Haq et al., “Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems”, ArXiv report,
https://arxiv.org/abs/2210.15432
• Fahmy et al. "Supporting DNN Safety Analysis and Retraining through Heatmap-based Unsupervised Learning" IEEE Transactions
on Reliability, Special section on Quality Assurance of Machine Learning Systems, 2021
• Fahmy et al. "Simulator-based explanation and debugging of hazard-triggering events in DNN-based safety-critical systems”,
ArXiv report, https://arxiv.org/pdf/2204.00480.pdf, ACM TOSEM, 2022
• Attaoui et al., “Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction and Clustering”, ArXiv report,
https://arxiv.org/pdf/2201.05077v1.pdf, ACM TOSEM, 2022
• Aghababaeyan et al., “Black-Box Testing of Deep Neural Networks through Test Case Diversity”,
https://arxiv.org/pdf/2112.12591.pdf
• Zolfagharian et al., “Search-Based Testing Approach for Deep Reinforcement Learning Agents”,
https://arxiv.org/pdf/2206.07813.pdf
51
50. Selected ML Testing References
• Goodfellow et al. "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014).
• Zhang et al. "DeepRoad: GAN-based metamorphic testing and input validation framework for autonomous driving systems." In 33rd
IEEE/ACM International Conference on Automated Software Engineering (ASE), 2018.
• Tian et al. "DeepTest: Automated testing of deep-neural-network-driven autonomous cars." In Proceedings of the 40th
international conference on software engineering, 2018.
• Li et al. “Structural Coverage Criteria for Neural Networks Could Be Misleading”, IEEE/ACM 41st International Conference on
Software Engineering: New Ideas and Emerging Results (NIER)
• Kim et al. "Guiding deep learning system testing using surprise adequacy." In IEEE/ACM 41st International Conference on Software
Engineering (ICSE), 2019.
• Ma et al. "DeepMutation: Mutation testing of deep learning systems." In 2018 IEEE 29th International Symposium on Software
Reliability Engineering (ISSRE), 2018.
• Zhang et al. "Machine learning testing: Survey, landscapes and horizons." IEEE Transactions on Software Engineering (2020).
• Riccio et al. "Testing machine learning based systems: a systematic mapping." Empirical Software Engineering 25, no. 6 (2020)
• Gerasimou et al., “Importance-Driven Deep Learning System Testing”, IEEE/ACM 42nd International Conference on Software
Engineering, 2020
52