Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy Data, by Sergül Aydöre, Assistant Professor at Stevens Institute of Technology
"Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy Data"
By Sergül Aydöre, Assistant Professor at Stevens Institute of Technology
Abstract:
The use of complex models –with many parameters– is challenging with high-dimensional small-sample
problems: indeed, they face rapid overfitting. Such situations are common when data collection is expensive,
as in neuroscience, biology, or geology. Dedicated regularization can be crafted to tame overfit, typically via
structured penalties. But rich penalties require mathematical expertise and entail large computational costs.
Stochastic regularizers such as dropout are easier to implement: they prevent overfitting by random perturbations.
Used inside a stochastic optimizer, they come with little additional cost. We propose a structured stochastic
regularization that relies on feature grouping. Using a fast clustering algorithm, we define a family of
groups of features that capture feature covariations. We then randomly select these groups inside a stochastic
gradient descent loop. This procedure acts as a structured regularizer for high-dimensional correlated data
without additional computational cost and it has a denoising effect. We demonstrate the performance of our
approach for logistic regression both on a sample-limited face image dataset with varying additive noise and on
a typical high-dimensional learning problem, brain image classification.
Ähnlich wie Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy Data, by Sergül Aydöre, Assistant Professor at Stevens Institute of Technology
Ähnlich wie Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy Data, by Sergül Aydöre, Assistant Professor at Stevens Institute of Technology (20)
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy Data, by Sergül Aydöre, Assistant Professor at Stevens Institute of Technology
1. Using Feature Grouping as a
Stochastic Regularizer for
High Dimensional Noisy Data
Sergül Aydöre
Assistant Professor
Electrical and Computer Engineering
Stevens Institute of Technology
2. 2
Landscape of Machine Learning Applications
https://research.hubspot.com/charts/simplified-ai-landscape
3. • Data is High Dimensional, Noisy and Sample Size is
small as in NeuroImaging
3
But what if
PET acquisition process
wikipedia
Implantation of intracranial
electrodes.
Cleveland Epilepsy Clinic
An elastic EEG cap with 60
electrodes [Bai2012]
A typical MEG equipment [BML2001]
MRI Scanner and rs-fMRI time series acquisition [NVIDIA]
4. 4
Other High Dimensional, Noisy Data and Small
Sample Size Situations
Genomics
Integrative Genomics Viewer, 2012
Seismology
https://www.mapnagroup.com
Astronomy
AstronomyMagazine,
2015
6. 6
Challenges
1. High Dimensionality of the data due to rich temporal and
spatial structure
2. Noise in the data due to mechanical or physical artifacts.
7. 7
Challenges
1. High Dimensionality of the data due to rich temporal and
spatial structure
2. Noise in the data due to mechanical or physical artifacts.
3. Difficulty and cost of data collection
8. 8
Overfitting
• ML models with
large number of
parameters
require large
amount of data.
Otherwise,
overfitting can
occur!
http://scott.fortmann-roe.com/docs/MeasuringError.html
9. 9
Regularization Methods to overcome Overfitting
• Early Stopping [Yao, 2007]
• Ridge Regression (ℓ2 regularization) [Tibshirami 1996]
• Least Absolute Shrinkage and Selection Operator
(LASSO or ℓ1 regularization ) [Tibshirami 1996]
• Dropout [Srivastana 2014]
• Group Lasso [Yuan 2016]
10. Regularization Methods to overcome Overfitting
• Early Stopping
• Ridge Regression (ℓ2 regularization)
• Least Absolute Shrinkage and Selection Operator
(LASSO or ℓ1 regularization )
• Dropout
• Group Lasso
SPARSITY
11. Regularization Methods to overcome Overfitting
• Early Stopping
• Ridge Regression (ℓ2 regularization)
• Least Absolute Shrinkage and Selection Operator
(LASSO or ℓ1 regularization )
• Dropout
• Group Lasso
STOCHASTICITY
SPARSITY
12. Regularization Methods to overcome Overfitting
• Early Stopping
• Ridge Regression (ℓ2 regularization)
• Least Absolute Shrinkage and Selection Operator
(LASSO or ℓ1 regularization )
• Dropout
• Group Lasso
STOCHASTICITY
STRUCTURE & SPARSITY
12
SPARSITY
13. Regularization Methods to overcome Overfitting
• Early Stopping
• Ridge Regression (ℓ2 regularization)
• Least Absolute Shrinkage and Selection Operator
(LASSO or ℓ1 regularization )
• Dropout
• Group Lasso
• PROPOSED: STRUCTURE & STOHASTICITY
STOCHASTICITY
STRUCTURE & SPARSITY
13
SPARSITY
14. 14
Problem Setting: Supervised Learning
• Training samples:
drawn from
• Parameters of the model are estimated by:
Loss per sample
16. 16
Dropout
• Randomly removes units in the network during training.
• Idea: Prevents units from co-adapting too much.
• Attractive property: Can be used inside stochastic gradient descent
without an additional computation cost.
[Srivastana 2014]
17. 17
Dropout
• Randomly removes units in the network during training.
• Idea: Prevents units from co-adapting too much.
• Attractive property: Can be used inside stochastic gradient descent
without an additional computation cost.
[Srivastana 2014]
18. 18
Dropout
• Randomly removes units in the network during training.
• Idea: Prevents units from co-adapting too much.
• Attractive property: Can be used inside stochastic gradient descent
without an additional computation cost.
[Srivastana 2014]
30. 30
Replace Masking with Structured Matrices
We project the training
samples onto a lower
dimensional space by
. Hence, weight matrix
becomes:
approximate x
31. 31
Replace Masking with Structured Matrices
To update , we
project the gradients
back to the original
space
36. 36
Recursive Nearest Agglomeration Clustering
(ReNA)
Hoyos-Idrobo 2016
• Agglomerative clustering schemes start off by placing every data
element in its own cluster.
• They proceed by merging repeatedly the closest pair of connected
clusters until finding the desired number of clusters.
37. 37
Insights: Random Reductions While Fitting
• Let where is the deterministic
term and is the zero-mean noise term.
Loss on the
smoothed input
Regularization Cost
variance of the
model given the
smooth input
features
variance of the
estimated target due
to the randomization
38. Insights: Random Reductions While Fitting
• Regularization Cost:
• For dropout, we have and is diagonal matrix
where for .
• This is equivalent to ridge regression after “orthogonalizing” the
features.
Constant for linear
regression
40. 40
Experimental Results: Olivetti Faces
• High Dimensional Data and the sample size is small
• Consists of grayscale 64 x 64 face images from 40 subjects
• For each subject , there are 10 different images with varying light.
• Goal: Identification of the individual whose picture was taken
41. 41
Experimental Results: Olivetti Faces
• High Dimensional Data and the sample size is small
• Consists of grayscale 64 x 64 face images from 40 subjects
• For each subject , there are 10 different images with varying light.
• Goal: Identification of the individual whose picture was taken
42. 42
Experimental Results: Olivetti Faces
• High Dimensional Data and the sample size is small
• Consists of grayscale 64 x 64 face images from 40 subjects
• For each subject , there are 10 different images with varying light.
• Goal: Identification of the individual whose picture was taken
43. 43
Experimental Results: Olivetti Faces
• High Dimensional Data and the sample size is small
• Consists of grayscale 64 x 64 face images from 40 subjects
• For each subject , there are 10 different images with varying light.
• Goal: Identification of the individual whose picture was taken
44. 44
Experimental Results: Olivetti Faces
• High Dimensional Data and the sample size is small
• Consists of grayscale 64 x 64 face images from 40 subjects
• For each subject , there are 10 different images with varying light.
• Goal: Identification of the individual whose picture was taken
45. 45
Experimental Results: Olivetti Faces
• High Dimensional Data and the sample size is small
• Consists of grayscale 64 x 64 face images from 40 subjects
• For each subject , there are 10 different images with varying light.
• Goal: Identification of the individual whose picture was taken
55. 55
Experimental Results: Olivetti Faces
• Visualization of the learned weights for logistic regression for a single
Olivetti face with high noise using different regularizers.
56. 56
Experimental Results: Olivetti Faces
• Performance in terms of loss as a function of computation time for
MLP with a single layer using feature grouping and best parameters
for other regularizers, for Olivetti face data with high noise.
57. 57
Experimental Results: Neuroimaging Data Set
• Openly accessible fMRI data set from Human Connectome Project
• 500 subjects, 8 cognitive tasks to classify
• Feature dimension: 33854, training set: 3052 samples, test set: 791
samples
60. 60
Summary – Stochastic Regularizer
• We introduced a stochastic regularizer
based on feature averaging that
captures the structure of data.
• Our approach leads to higher accuracy
at high noise settings without
additional computation time.
• Learned weights have more structure
at high noise settings.
61. 61
Collaborators and References
• S. Aydore, B. Thirion, O. Grisel, G. Varoquaux. “Using Feature Grouping as a Stochastic
Regularizer for High-Dimensional Noisy Data”, Women in Machine Learning Workshop, NeurIPS
2018, Montreal, Canada, 2018, accessible at arXiv preprint: 1807.11718.
• S. Aydore, L. Dicker, D. Foster.“A local Regret in Nonconvex Online Learning”, Continual
Learning Workshop, NeurIPS 2018, Montreal, Canada, 2018, accessible at arXiv preprint:
1811.05095.
Bertrand Thirion
(INRIA, France)
Olivier Grisel
(INRIA, France)
Gaël Varoquaux
(INRIA, France)
Dean Foster
(Amazon & University of Pennsylvania)
Lee Dicker
(Amazon & University of Rutgers)
In the graphic below, the x-axis reflects the level of technical sophistication the AI tool has. The y-axis represents the mass appeal of the tool.
Here is a landscape for the popular machine learning applications. It is of course very exciting to see such progress in AI. But all these applications require massive amounts of data to train machine learning models.
Some fields such as brain imaging often does not have such massive amounts of samples whereas the dimension of the features is large due to the rich spatial and temporal information.
This problem is not limited to brain imaging. There are other fields which also suffer from small-sample data situations.
The performance of machine learning models is often evaluated by their prediction ability on unseen data. While each iteration of model training decreases the training risk, fitting the training data too well can lead to failure in generalization on future predictions. This phenomenon is often called ``overfitting’’ in the field of machine learning. The risk of overfitting is more severe for high-dimensional data-scarce situations. Such situations are common when the data collection is expensive, as in neuroscience, biology, or geology.
Feature grouping defines a matrix Φ that extracts piece- wise constant approximations of the data
Let ΦFG
be a matrix composed with constant amplitude groups
(clusters). Formally, the set of k clusters is given by P =
{C1, C2, . . . , Ck}, where each cluster Cq ⊂ [p] contains a set of
indexesthatdoesnotoverlapotherclusters,C ∩C =∅,for ql
all q ̸= l. Thus, (ΦFG x)q = αq j∈Cq xj yields a reduction of a data sample x on the q-th cluster, where αq is a constant for each cluster. With an appropriate permutation of the indexes of the data x, the matrix ΦFG can be written as
We call ΦFG x ∈ Rk the reduced version of x and ΦTFGΦFG x ∈ Rp the approximation of x.
Feature grouping defines a matrix Φ that extracts piece- wise constant approximations of the data
Let ΦFG
be a matrix composed with constant amplitude groups
(clusters). Formally, the set of k clusters is given by P =
{C1, C2, . . . , Ck}, where each cluster Cq ⊂ [p] contains a set of
indexesthatdoesnotoverlapotherclusters,C ∩C =∅,for ql
all q ̸= l. Thus, (ΦFG x)q = αq j∈Cq xj yields a reduction of a data sample x on the q-th cluster, where αq is a constant for each cluster. With an appropriate permutation of the indexes of the data x, the matrix ΦFG can be written as
We call ΦFG x ∈ Rk the reduced version of x and ΦTFGΦFG x ∈ Rp the approximation of x.
Feature grouping defines a matrix Φ that extracts piece- wise constant approximations of the data
Let ΦFG
be a matrix composed with constant amplitude groups
(clusters). Formally, the set of k clusters is given by P =
{C1, C2, . . . , Ck}, where each cluster Cq ⊂ [p] contains a set of
indexesthatdoesnotoverlapotherclusters,C ∩C =∅,for ql
all q ̸= l. Thus, (ΦFG x)q = αq j∈Cq xj yields a reduction of a data sample x on the q-th cluster, where αq is a constant for each cluster. With an appropriate permutation of the indexes of the data x, the matrix ΦFG can be written as
We call ΦFG x ∈ Rk the reduced version of x and ΦTFGΦFG x ∈ Rp the approximation of x.