History and Present Developments in Machine Learning, by Tom Dietterich, Emeritus Professor of computer science at Oregon State University and Chief Scientist at BigML.
*Machine Learning School in The Netherlands 2022.
3. BigML, Inc #DutchMLSchool
โข Anomaly Detection Use Cases
โข Four Basic Methods for Anomaly Detection with Engineered Features
โข Benchmarking Study
โข Incorporating Feedback
โข Deep Versions of the Four Basic Methods
โข Classifier-Based Anomaly Detection using the Max Logit Score
โข Familiarity Hypothesis
โข Challenges for the Future
Outline
3
5. BigML, Inc #DutchMLSchool 5
โขData Cleaning
โขRemove corrupted data from the training data
โขExample: Typos in feature values, feature values interchanged, test results from two patients
combined
โขFault Detection, Fraud Detection, Cyber Attack
โขAt training or test time, faulty or illegal behavior creates anomalous data
โขOpen Category Detection
โขAt test time, the classifier is given an instance of a novel category
โขExample: Self-driving car (trained in Europe) encounters a kangaroo (in Australia)
โขOut-of-Distribution Detection
โขAt test time, the classifier is given an instance collected in a different way
โขExample: Chest X-Ray classifier trained only on front views is shown a side view
โขExample: Self-driving car trained in clear conditions must operate during rainy conditions
Use Cases
6. BigML, Inc #DutchMLSchool 6
โขClaim: Every deployed ML
classifier should include an
anomaly detector to detect
queries that lie outside the
region of competence of the
classifier
โขAlso useful as a performance
indicator to detect that you
need to retrain the classifier
Protecting a Classifier
๐ฅ๐ฅ๐๐
Anomaly
Detector
๐ด๐ด ๐ฅ๐ฅ๐๐ > ๐๐?
Classifier ๐๐
Training
Examples
(๐ฅ๐ฅ๐๐, ๐ฆ๐ฆ๐๐) no
๏ฟฝ
๐ฆ๐ฆ = ๐๐(๐ฅ๐ฅ๐๐)
yes reject
7. BigML, Inc #DutchMLSchool 7
โขDefinition: An โanomalyโ is a data point generated by a process that is
different than the process generating the โnominalโ data
โขLet ๐ท๐ท0 be the probability distribution of the nominal process
โขLet ๐ท๐ท๐๐ be the probability distribution of the anomaly process
โขTwo formal settings
โข Clean training data
โข Contaminated training data
Anomaly Detection Definitions
8. BigML, Inc #DutchMLSchool 8
โข Given:
โข Training data: ๐ฅ๐ฅ1, ๐ฅ๐ฅ2, โฆ , ๐ฅ๐ฅ๐๐
โข All data come from ๐ท๐ท0 the โnominalโ distribution
โข Test data: ๐ฅ๐ฅ๐๐+1, โฆ , ๐ฅ๐ฅ๐๐+๐๐ from a mixture of ๐ท๐ท0 and ๐ท๐ท๐๐ (the anomaly
distribution)
โข Find:
โข The data points in the test data that belong to ๐ท๐ท๐๐
โข Examples:
โข Protecting a classifier
โข Detecting manufacturing defects / equipment failure
Clean Training Data
9. BigML, Inc #DutchMLSchool 9
โข Given:
โข Training data: ๐ฅ๐ฅ1, ๐ฅ๐ฅ2, โฆ , ๐ฅ๐ฅ๐๐ from a mixture of ๐ท๐ท0 and ๐ท๐ท๐๐ (the anomaly
distribution)
โข Find:
โข The data points in the training data that belong to ๐ท๐ท๐๐
โข Use Cases:
โข Data cleaning
โข Fraud detection, Insider Threat detection
โข These two cases can be combined
โข Contaminated training data + Separate contaminated test data
Contaminated Training Data
11. BigML, Inc #DutchMLSchool 11
โขDistance-Based Methods
โขAnomaly score
๐ด๐ด ๐ฅ๐ฅ๐๐ = min
๐ฅ๐ฅโ๐ท๐ท
๐ฅ๐ฅ๐๐ โ ๐ฅ๐ฅ
โขDensity Estimation Methods
โขSurprise: ๐ด๐ด ๐ฅ๐ฅ๐๐ = โ log ๐๐๐ท๐ท(๐ฅ๐ฅ๐๐)
โขModel the joint distribution
๐๐๐ท๐ท(๐ฅ๐ฅ) of the input data points
๐ฅ๐ฅ1, โฆ โ ๐ท๐ท
Theoretical Approaches to Anomaly Detection
โขQuantile Methods
โขFind a smooth function ๐๐ such that
๐ฅ๐ฅ: ๐๐ ๐ฅ๐ฅ โฅ 0 contains 1 โ ๐ผ๐ผ of the
training data
โขAnomaly score ๐ด๐ด ๐ฅ๐ฅ = โ๐๐(๐ฅ๐ฅ)
โขReconstruction Methods
โขTrain an auto-encoder: ๐ฅ๐ฅ โ
๐ท๐ท ๐ธ๐ธ ๐ฅ๐ฅ , where ๐ธ๐ธ is the encoder and
๐ท๐ท is the decoder
โขAnomaly score
๐ด๐ด ๐ฅ๐ฅ๐๐ = ๐ฅ๐ฅ๐๐ โ ๐ท๐ท ๐ธ๐ธ ๐ฅ๐ฅ๐๐
12. BigML, Inc #DutchMLSchool 12
โขDefine a distance ๐๐(๐ฅ๐ฅ๐๐, ๐ฅ๐ฅ๐๐)
โข ๐ด๐ด ๐ฅ๐ฅ๐๐ = min
๐ฅ๐ฅโ๐ท๐ท
๐๐(๐ฅ๐ฅ๐๐, ๐ฅ๐ฅ)
โขRequires a good distance metric
Approach 1: Distance-Based Methods
๐ฅ๐ฅ๐๐
๐ฅ๐ฅ๐๐
13. BigML, Inc #DutchMLSchool 13
โข Approximates L1 (Manhattan) Distance
โข (Guha, et al., ICML 2016)
โข Construct a fully random binary tree
โข choose attribute ๐๐ at random
โข choose splitting threshold ๐๐ uniformly from
min ๐ฅ๐ฅโ ๐๐ , max ๐ฅ๐ฅโ ๐๐
โข until every data point is in its own leaf
โข let ๐๐(๐ฅ๐ฅ๐๐) be the depth of point ๐ฅ๐ฅ๐๐
โข repeat ๐ฟ๐ฟ times
โข let ฬ
๐๐(๐ฅ๐ฅ๐๐) be the average depth of ๐ฅ๐ฅ๐๐
โข ๐ด๐ด ๐ฅ๐ฅ๐๐ = 2
โ
๏ฟฝ
๐๐ ๐ฅ๐ฅ๐๐
๐๐ ๐ฅ๐ฅ๐๐
โข ๐๐(๐ฅ๐ฅ๐๐) is the expected depth
Isolation Forest [Liu, Ting, Zhou, 2011]
๐ฅ๐ฅโ ๐๐
๐ฅ๐ฅโ ๐๐ > ๐๐
๐ฅ๐ฅโ 2 > ๐๐2 ๐ฅ๐ฅโ 8 > ๐๐3
๐ฅ๐ฅโ 3 > ๐๐4 ๐ฅ๐ฅโ 1 > ๐๐5
๐ฅ๐ฅ๐๐
14. BigML, Inc #DutchMLSchool 14
โข Given a data set ๐ฅ๐ฅ1, โฆ , ๐ฅ๐ฅ๐๐ where
๐ฅ๐ฅ๐๐ โ โ๐๐
โข We assume the data have been drawn
iid from an unknown probability
density: ๐ฅ๐ฅ๐๐ โผ ๐๐ ๐ฅ๐ฅ๐๐
โข Goal: Estimate ๐๐
โข Anomaly Score: ๐ด๐ด ๐ฅ๐ฅ๐๐ = โ log ๐๐ ๐ฅ๐ฅ๐๐
โข โsurprisalโ from information theory
โข Why density estimation?
โข Gives a more global view by combining
distances to all data points
Approach 2: Density Estimation
15. BigML, Inc #DutchMLSchool 15
โขIntroduce sparse random
projections ฮ ๐๐ into 1-
dimensional space
โขFit a density estimator
๐๐๐๐ ฮ ๐๐ ๐ฅ๐ฅ in each 1-d space
โข ๐ด๐ด ๐ฅ๐ฅ =
1
๐ฟ๐ฟ
โ๐๐=1
๐ฟ๐ฟ
โ log ๐๐๐๐ ฮ ๐๐ ๐ฅ๐ฅ๐๐
Example: LODA
(Pevny, 2015)
16. BigML, Inc #DutchMLSchool 16
โข Vapnikโs principle: We only need to
estimate the โdecision boundaryโ between
nominal and anomalous
โข Surround the data by a function ๐๐ that
captures 1 โ ๐๐ of the training data
โข One-Class Support Vector Machine
(OCSVM)
โข ๐๐ is a hyperplane in โkernel spaceโ
โข Support Vector Data Description (SVDD)
โข ๐๐ is a sphere is โkernel spaceโ
โข Issue
โข Need to choose ๐๐ at learning time rather
than run time
Approach 3: Quantile Methods
17. BigML, Inc #DutchMLSchool 17
โข NavLab self-driving van (Pomerleau, 1992)
โข Primary head: Predict steering angle from
input image
โข Secondary head: Predict the input image
(โauto-encoderโ)
โข ๐ด๐ด ๐ฅ๐ฅ๐๐ = ๐ฅ๐ฅ๐๐ โ ๏ฟฝ
๐ฅ๐ฅ๐๐
โข If reconstruction is poor, this suggests that
the steering angle should not be trusted
โข Principle: Anomaly Detection through
Failure
โข Define a task on which the learned system
should fail for anomalies
Approach 4: Reconstruction Methods
Pomerleau, NIPS 1992
18. BigML, Inc #DutchMLSchool 18
โข NASA Mars Science Laboratory ChemCam
instrument
โข Collects 6144 spectral bands on rock samples
from 7m distance using laser stimulation
โข Goal: active learning to find interesting spectra
โข DEMUD
โข Incremental PCA applied to samples one at a time
โข Fit only to the samples labeled as โuninterestingโ by
the user
โข Show the user the most un-uninteresting sample
(sample with highest PCA reconstruction error)
โข Rapidly discovers interesting samples
โข Wagstaff, et al. (2013)
Application: Finding Unusual Chemical Spectra
19. BigML, Inc #DutchMLSchool 19
โข Distance-Based Methods
โข k-NN: Mean distance to ๐๐-nearest neighbors
โข LOF: Local Outlier Factor (Breunig, et al., 2000)
โข ABOD: kNN Angle-Based Outlier Detector (Kriegel, et al., 2008)
โข IFOR: Isolation Forest (Liu, et al., 2008)
โข Density-Based Approaches
โข RKDE: Robust Kernel Density Estimation (Kim & Scott, 2008)
โข EGMM: Ensemble Gaussian Mixture Model (our group)
โข LODA: Lightweight Online Detector of Anomalies (Pevny, 2016)
โข Quantile-Based Methods
โข OCSVM: One-class SVM (Schoelkopf, et al., 1999)
โข SVDD: Support Vector Data Description (Tax & Duin, 2004)
Benchmarking Study [Andrew Emmott, 2015, 2020]
20. BigML, Inc #DutchMLSchool 20
โข Select 19 data sets from UC Irvine repository
โข Choose one or more classes to be โanomaliesโ; the rest are โnominalsโ
โข Manipulate
โข Relative frequency
โข Point difficulty
โข Irrelevant features
โข Clusteredness
โข 20 replicates of each configuration
โข Result: 11,888 Non-trivial Benchmark Datasets
Benchmarking Methodology
21. BigML, Inc #DutchMLSchool 21
โข Linear ANOVA
โข log
๐ด๐ด๐ด๐ด๐ด๐ด
1 โ๐ด๐ด๐ด๐ด๐ด๐ด
~ ๐๐๐๐ + ๐๐๐๐ + ๐๐๐๐ + ๐๐๐๐ + ๐๐๐ ๐ ๐ ๐ ๐ ๐ + ๐๐๐๐๐๐๐๐
โข rf: relative frequency
โข pd: point difficulty
โข cl: normalized clusteredness
โข ir: irrelevant features
โข pset: โParentโ set
โข algo: anomaly detection algorithm
โข Assess the algo effect while controlling for all other factors
โข ๐ด๐ด๐ด๐ด๐ด๐ด: area under the ROC curve for the nominal vs. anomaly binary decision
Analysis of Variance
22. BigML, Inc #DutchMLSchool 22
โข 19 UCI Datasets
โข 9 Leading โfeature-basedโ algorithms
โข 11,888 non-trivial benchmark datasets
โข Mean AUC effect for โnominalโ vs. โanomalyโ decisions
โข Controlling for
โข Parent data set
โข Difficulty of individual queries
โข Fraction of anomalies
โข Irrelevant features
โข Clusteredness of anomalies
โข Baseline method: Distance to nominal mean (โtmdโ)
โข Best methods: K-nearest neighbors and Isolation Forest
โข Worst methods: Kernel-based OCSVM and SVDD
Benchmarking Study Results
0.62
0.64
0.66
0.68
0.70
0.72
0.74
0.76
0.78
knn iforest egmm rkde lof abod loda svdd tmd ocsvm
Mean AUC Effect
23. BigML, Inc #DutchMLSchool 23
โข Show top-ranked candidate to the
user
โข User labels candidate
โข Label is used to update the anomaly
detector
โข Two methods
โข AAD [Das, et al, ICDM 2016]
โข GLAD-OMD (modified version of
iForest) [Siddiqui, et al., KDD 2018]
Incorporating User Feedback: Initial Work
Data
Anomaly
Detection
Best
Candidate
User
Anomaly Analysis
yes
no
24. BigML, Inc #DutchMLSchool 24
User Feedback Yields Big Improvements in
Anomaly Discovery
APT Engagement 3 Results
27. BigML, Inc #DutchMLSchool 27
โขK-nearest neighbor in the
latent space
โขIssue: What distance metric to
use?
โขCosine distance is the most
popular:
๐๐ ๐ง๐ง1, ๐ง๐ง2 =
๐ง๐ง1 โ ๐ง๐ง2
๐ง๐ง1 โ๐ง๐ง2โ
Distance-Based Methods
28. BigML, Inc #DutchMLSchool 28
โขMahalanobis Method
โข Fit a joint multivariate Gaussian
โข Each class ๐๐ has its own mean ๐๐๐๐
โข Shared covariance matrix ฮฃ
โขGiven a new ๐ฅ๐ฅ,
log ๐๐(๐ฅ๐ฅ) โ min
๐๐
๐ฅ๐ฅ โ ๐๐๐๐
โค
ฮฃโ1
๐ฅ๐ฅ โ ๐๐๐๐
This is known as the squared
Mahalanobis distance
Density-Based Methods
29. BigML, Inc #DutchMLSchool 29
โข Residual Flow Deep Density Estimator
โข (Chen, Behrmann, Duvenaud, et al. NeurIPS 2019)
โข Standard Cross-Entropy Supervised Loss
โข Claim: This helps focus ๐๐ ๐ฅ๐ฅ on relevant aspects of the images
โข Anomaly Score: ๐ด๐ด ๐ฅ๐ฅ๐๐ = โ log ๐๐(๐ฅ๐ฅ๐๐)
Open Hybrid: Classification + Density Estimation
(Tack, Li, Guo, Guo, 2020)
30. BigML, Inc #DutchMLSchool 30
โข The method is somewhat tricky to work with
โข Set ๐๐ as the mean of a small set of points passed through the untrained network
โข No bias weights
โข These help prevent โhypersphere collapseโ
Quantile Method: Deep SVDD (Ruff, et al. ICML 2018)
31. BigML, Inc #DutchMLSchool 31
โข Encoder: ๐ง๐ง = ๐ธ๐ธ ๐ฅ๐ฅ
โข Decoder: ๏ฟฝ
๐ฅ๐ฅ = ๐ท๐ท(๐ง๐ง)
โข Challenge: How to constrain ๐ธ๐ธ and
๐ท๐ท so that the autoencoder fails on
anomalies but succeeds on nominal
images?
โข Autoencoders often learn general-
purpose image compression
methods
Reconstruction Methods: Deep Autoencoders
๐ฅ๐ฅ
๐ง๐ง
๏ฟฝ
๐ฅ๐ฅ
๐ธ๐ธ ๐ท๐ท
33. BigML, Inc #DutchMLSchool 33
โขGarrepalli (2020)
โข Train classifier to optimize
softmax likelihood (minimize
โcross-entropy lossโ)
โข Maximum logit score is better
than two distance methods:
โข Isolation Forest
โข LOF (a nearest-neighbor method)
Surprise: The Max Logit Score
0.68 0.67
0.63
0.72
0.51
0.44
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
H (y|x) Max SoftMax-
prob.
Max BCE-prob Max-logit Iforest LOF
AUROC
Anomaly Measures on Latent Representations for CIFAR-100
34. BigML, Inc #DutchMLSchool 34
โข Vaze, Han, Vedaldi, Zisserman (2021): โOpen
Set Recognition: A Good Classifier is All You
Needโ (ICLR 2022; arXiv 2110.06207)
โข Carefully train a classifier using the latest tricks
โข Standard cross-entropy combined with the
following:
โข Cosine learning rate schedule
โข Learning rate warmup
โข RandAugment augmentations
โข Label Smoothing
โข Anomaly score: max logit
โข โ max
๐๐
โ๐๐
More Evidence for Max Logit
Protocol from Lawrence Neal et al. (2018)
35. BigML, Inc #DutchMLSchool 35
โขNovel class difficulty based on
semantic distance
โข CUB: Bird species
โข Air: Aircraft
โข ImageNet
Still More Evidence for Max Logit
37. BigML, Inc #DutchMLSchool 37
โข DenseNet with 384-dimensional
latent space.
โข CIFAR-10: 6 known classes, 4 novel
classes
โข UMAP visualization
โข Light green: novel classes
โข Darker greens: known classes
โข Note that many novel classes stay
toward the center of the space;
others overlap with known classes
โข Training was not required to โpull
them outโ so that they could be
discriminated
How are open set images represented by deep
learning?
Alex Guyer
6 Known
Classes
4 Novel
Classes
38. BigML, Inc #DutchMLSchool 38
Similar Results from Other Groups
[Tack, et al. NeurIPS 2020] [Vaze, et al. arXiv 2110.06207]
39. BigML, Inc #DutchMLSchool 39
โข Convolutional neural network learns โfeaturesโ that
detect image patches relevant to the classification
task
โข The logit layer weights these features to make the
classification decision
โข Novel classes activate fewer of these features, so
their activation vectors are smaller
โข Hypothesis: The networks donโt detect that an
elephant is novel because of trunk and tusks but
because its head doesnโt activate known features
The Familiarity Hypothesis
The network doesnโt
detect novelty, it detects
the absence of familiarity
40. BigML, Inc #DutchMLSchool 40
Novel images strongly activate fewer
features
โข CIFAR 10: 6 known classes; 4 novel
classes
โข DenseNet (๐ง๐ง has 324 dimensions)
โข Activation threshold ๐๐
โข Count number of features whose
activation exceeds ๐๐
โข OOD images activate fewer
features
Evidence: Number of Activated Features
Alex Guyer (unpublished)
41. BigML, Inc #DutchMLSchool 41
Are they features โonโ the object vs. the
background?
โข Strategy: blur the object and see how the
feature activations change
โข activations that change must be on the object
โข Details:
โข PASCAL VOC Segmented Images
โข Blur the original image (31x31 kernel; sd=31)
โข Form composite image where blurred region
replaces the segmented region
Which features are responsible for the drop in
activation?
https://www.peko-step.com/en/tool/blur.html
42. BigML, Inc #DutchMLSchool 42
Blurring Examples
Note: This does not remove all object-related information (e.g.,
object boundary), so we donโt detect all on-object features
43. BigML, Inc #DutchMLSchool 43
โข โpresence featureโ
โข ๐ต๐ต๐ต๐ต ๐๐, ๐๐ > 0. Blurring decreases the
activity of the feature. Its net effect is to
measure the presence of one or more
image patterns
โข Its activity is high when those patterns
are present
โข โabsence featureโ
โข ๐ต๐ต๐ต๐ต ๐๐, ๐๐ < 0. Blurring increases the
activity of the feature. Its net effect is to
measure the absence of one or more
image patterns
โข Its activity is high when those patterns
are absent
โข Define the โblurring effectโ of feature ๐๐ on
image ๐๐
๐ต๐ต๐ต๐ต ๐๐, ๐๐ = ๐ง๐ง๐๐๐๐ โ ฬ
๐ง๐ง๐๐๐๐
where
โข ๐ง๐ง๐๐๐๐ is the activation of latent feature ๐๐ on
image ๐๐
โข ฬ
๐ง๐ง๐๐๐๐ is the activation of latent feature ๐๐ on
blurred image ๐๐
Blurring Effect
44. BigML, Inc #DutchMLSchool 44
โขOn average, the activation of
a feature changes when the
object (of class ๐๐) is blurred
๐๐๐๐ ๐๐, ๐๐
=
1
๐๐๐๐
๏ฟฝ
๐๐:๐ฆ๐ฆ๐๐=๐๐
๐ง๐ง๐๐๐๐๐๐ โ ฬ
๐ง๐ง๐๐๐๐๐๐
โขFeature ๐๐ is a net presence
feature for class ๐๐ if
๐๐๐๐ ๐๐, ๐๐ > 0.02
โขFeature ๐๐ is a net absence
feature for class ๐๐ if
๐๐๐๐ ๐๐, ๐๐ < โ0.02
โขOtherwise ๐๐ is net neutral for
class ๐๐
โOn Objectโ score of feature ๐๐ for class ๐๐
45. BigML, Inc #DutchMLSchool 45
โข Logit score is โ๐๐๐๐ = โ๐๐ ๐ค๐ค๐๐๐๐๐ง๐ง๐๐๐๐
โข Contribution of ๐๐ in image ๐๐ to class ๐๐:
โข ๐๐๐๐๐๐๐๐ = ๐ค๐ค๐๐๐๐๐ง๐ง๐๐๐๐ (in normal images)
โข ฬ
๐๐๐๐๐๐๐๐ = ๐ค๐ค๐๐๐๐ ฬ
๐ง๐ง๐๐๐๐ (in blurred images)
โข Mean contribution
โข ฬ
๐๐๐๐๐๐ =
1
๐๐๐๐
โ ๐๐ ๐ฆ๐ฆ๐๐ = ๐๐ ๐๐๐๐๐๐๐๐
โข ฬ ฬ
๐๐๐๐๐๐ =
1
๐๐๐๐
โ ๐๐ ๐ฆ๐ฆ๐๐ = ๐๐ ฬ
๐๐๐๐๐๐๐๐
Feature Taxonomy
๐๐๐๐๐๐ > ๐๐ ๐๐๐๐๐๐ < ๐๐
๐๐๐๐ ๐๐, ๐๐
> 0.02
positive
presence
negative
presence
๐๐๐๐ ๐๐, ๐๐
< 0.02
positive
absence
negative
absence
Sun & Li: On the Effectiveness of Sparsification for Detecting the
Deep Unknowns. arXiv 2111.09805
46. BigML, Inc #DutchMLSchool 46
Mean feature types for class 3
1.00
0.00
On-Object
Index
(presence)
On-Object
Index
(absence)
positive features
negative features
red = presence
blue = absence
47. BigML, Inc #DutchMLSchool 47
Zoomed View: Blurring reduces ฬ
๐๐๐๐๐๐
Mean unblurred
contribution
Mean blurred contribution
โข Blurringโฆ
โข reduces the contribution of
positive presence features (red
dots)
โข reduces the contribution of
negative absence features (blue
dots)
1.00
0.00
On-Object
Index
(presence)
On-Object
Index
(absence)
48. BigML, Inc #DutchMLSchool 48
Decomposing the Logit Score: Four Cases
Positive presence:
๐ค๐ค๐๐๐๐ > 0 and
๐๐๐๐ ๐๐, ๐๐ > 0
Positive absence:
๐ค๐ค๐๐๐๐ > 0 and
๐๐๐๐ ๐๐, ๐๐ < 0
Negative presence:
๐ค๐ค๐๐๐๐ > 0 and
๐๐๐๐(๐๐, ๐๐) > 0
Negative absence:
๐ค๐ค๐๐๐๐ < 0 and
๐๐๐๐ ๐๐, ๐๐ < 0
52. BigML, Inc #DutchMLSchool 52
โข Note that the Positive Presence
features dominate the max logit
score
โข The Negative Absence and
Positive Absence features
(purple and blue lines) make a
small contribution
โข Negative Presence features
make no contribution
โข Conclusion: Decreases in
activations of positive presence
account for most of the max
logit score
Decomposing the Novelty Scores
53. BigML, Inc #DutchMLSchool 53
โขRed line: trend for Positive
Presence contribution to max
logit score
โขBlack line: smooth estimate of
classification accuracy
(โknownโ vs โnovelโ)
Decreases in Positive Presence Features
Account for Novelty Detection Accuracy
54. BigML, Inc #DutchMLSchool 54
โขBlakemore, Colin, and Grahame F.
Cooper. โDevelopment of the brain
depends on the visual environment.โ
(1970): 477-478.
โข Kittens raised in environments with
only horizontal or only vertical lines
โข โThey were virtually blind for contours
perpendicular to the orientation they
had experienced.โ
โขChomsky: โPoverty of the stimulusโ
Can we expect computer vision systems to perceive
things they have not been trained on?
Source: Li Yang Ku
https://computervisionblog.wordpress.com/2013/06/01/ca
ts-and-vision-is-vision-acquired-or-innate/
55. BigML, Inc #DutchMLSchool 55
โข Familiarity-based anomaly detection advantages:
โข Easy to implement โ Anomaly signal (max logit) can be extracted from the
classifier. No separate anomaly detection model is needed
โข Training on additional, auxiliary classes improves both classification and
anomaly detection performance
โข Familiarity-based anomaly detection weaknesses
โข Partially-occluded nominal objects will be flagged as anomalies
โข If an image contains both a novel object and a known object, the novel object
will not be detected
โข Adversarial attacks can easily cause false anomalies and missed anomalies
Implications
57. BigML, Inc #DutchMLSchool 57
โข Can we learn deep representations that can represent outliers?
โข Nonstationarity
โข As the world changes, the anomaly detection model must also change
โข Explanation
โข Users often want explanations of why something is labeled as anomalous in order to provide feedback or
take other actions
โข Setting alarm thresholds
โข How can we set a threshold to control the false alarm and missed alarm rates?
โข Incremental (continual) learning in deep networks
โข How can we efficiently update a trained neural network to incorporate user feedback?
โข Anomaly detection in temporal, spatial, and spatio-temporal data, in video data, etc.
โข Anomaly detection at multiple scales
Challenges for Anomaly Detection
59. BigML, Inc #DutchMLSchool
โข Four Basic Methods
โข Distances, densities, density quantiles, and reconstruction
โข Distances work best; Isolation Forest is very robust
โข Anomaly Detection in Deep Learning
โข The four basic methods have been extended to deep learning
โข They often do not work well when applied to learned representations
โข Classifier Max Logit Score Gives Very Competitive Performance
โข Computed as a side effect of standard deep classifiers
โข Measures familiarity rather than novelty, which makes it risky in many settings
โข Advances in Deep Anomaly Detection Require Learning Better Representations
Shallow and Deep Methods for Anomaly Detection
59