This document describes improvements made to the self-organizing map (SOM) algorithm to make it more efficient for sparse, high-dimensional input data. The key contributions are a sparse SOM (Sparse-Som) and sparse batch SOM (Sparse-BSom) algorithm that exploit the sparseness of the data to reduce computational complexity from O(TMD) to O(TMd), where d is the number of non-zero dimensions. Sparse-Som speeds up the BMU search and weight update phases, while Sparse-BSom further allows for efficient parallelization. Experiments show Sparse-Som and Sparse-BSom train significantly faster than standard SOM on sparse datasets, with comparable or better quality
3. Introduction
Contributions
Experiments
SOM Description
Standard vs. Batch Versions
Motivation for Sparse SOM
Presentation: The SOM Network
Self-Organizing Map (Kohonen 1982):
an artificial neural network
trained by unsupervised competitive learning
produces a low-dimensional map of the input space
Many applications
Commonly used for data projection, clustering, etc.
6. Introduction
Contributions
Experiments
SOM Description
Standard vs. Batch Versions
Motivation for Sparse SOM
The SOM Training Algorithm
Within the main loop:
1 compute the distance between input and weight vectors
dk(t) = x(t) − wk(t) 2
(1)
2 find the node weight closest to the input (BMU)
dc(t) = min
k
d(t) (2)
3 update the BMU and its neighbors to be closer to the input
7. Introduction
Contributions
Experiments
SOM Description
Standard vs. Batch Versions
Motivation for Sparse SOM
Standard Algorithm: The Learning Rule
Update weight vectors at each time step, for a random sample x:
wk(t + 1) = wk(t) + α(t)hck(t) [x(t) − wk(t)] (3)
α(t) is the decreasing learning rate
hck(t) is the neighborhood function
For example, the Gaussian:
hck(t) = exp −
rk − rc
2
2σ(t)2
Gaussian neighborhood
8. Introduction
Contributions
Experiments
SOM Description
Standard vs. Batch Versions
Motivation for Sparse SOM
The Batch Algorithm
The BMUs and their neighbors are updated once at the end of
each epoch, with the average of all the samples that trigger them:
wk(tf ) =
tf
t0
hck(t )x(t )
tf
t0
hck(t )
(4)
9. Introduction
Contributions
Experiments
SOM Description
Standard vs. Batch Versions
Motivation for Sparse SOM
Motivation: Reduce the Computing Time
Computing time depends on:
1 the T training iterations
2 the M nodes in the network
3 the D dimensions of the vectors
The issue: large sparse datasets
Many real-world datasets are sparse and high-dimensional.
But existing SOMs can’t exploit sparseness to save time.
11. Introduction
Contributions
Experiments
SOM Description
Standard vs. Batch Versions
Motivation for Sparse SOM
Overcoming the Dimensionality Problem
A popular option: reduce the model space
Use less dimension-sensitive space reduction techniques (such as
SVD, Random-Mapping, etc.).
But is this the only way ?
Let’s define f is the fraction of non-zero values, and d = D × f .
Can we reduce the SOM complexity from O(TMD) to O(TMd) ?
14. Introduction
Contributions
Experiments
Sparse Algorithms
Parallelism
Speedup the BMU Search
Euclidean distance (1) is equivalent to:
dk(t) = x(t) 2
− 2(wk(t) · x(t)) + wk(t) 2
(5)
Batch version
This change suffices to make sparse Batch SOM efficient, since
wk
2 can be computed once for each epoch.
Standard version
By storing w(t) 2, we can compute w(t + 1) 2 efficiently.
wk(t + 1) 2
= (1 − β(t))2
wk(t) 2
+ β(t)2
x(t) 2
+ 2β(t)(1 − β(t))(wk(t) · x(t))
(6)
15. Introduction
Contributions
Experiments
Sparse Algorithms
Parallelism
Sparse-Som: Speedup the Update Phase
We express the learning rule (3) as (Natarajan 1997):
wk(t + 1) = (1 − β(t)) wk(t) +
β(t)
1 − β(t)
x(t) (7)
Don’t update entire weight vectors
We keep the scalar coefficient separately, so we update only the
values affected by x(t).
Numerical stability
To avoid numerical stability issues, we use double-precision
floating-point, and rescale the weights when needed.
18. Introduction
Contributions
Experiments
Sparse Algorithms
Parallelism
What We Did
Sparse-Som: hard to parallelize
not adapted to data
partitioning
too much latency with
network partitioning
Sparse-BSom: much simpler
adapted both to data and
network partitioning
less synchronization needed
Another specific issue due to sparseness
Memory-access latency, because the non-linear access pattern to
weight vectors.
Mitigation: improve the processor cache locality
Access to the weight-vectors in the inner-loop.
20. Introduction
Contributions
Experiments
Evaluation
Speed benchmark
Quality test
The Evaluation
We’ve trained SOM networks
on various datasets
with same parameters
5 times each test
then measured their performance (speed and quality).
Our speed baseline
Somoclu (Wittek et al. 2017) is a massively parallel batch SOM
implementation, which uses the classical algorithm.
23. Introduction
Contributions
Experiments
Evaluation
Speed benchmark
Quality test
Speed Benchmark
4 datasets used (2 sparse and 2 dense), to test both:
Serial mode (Sparse-Som vs. Sparse-BSom)
Parallel mode (Sparse-BSom vs. Somoclu)
Hardware and system specifications:
Intel Xeon E5-4610
4 sockets of 6 cores each
cadenced at 2.4 GHz
2 threads / core
Linux Ubuntu 16.04 (64 bits)
GCC 5.4
30. Introduction
Contributions
Experiments
Summary: Main Benefits
Sparse-Som and Sparse-BSom run much faster than their
classical “dense” counterparts with sparse data.
Advantages of each version:
Sparse-Som
maps seem to have a
better organization
Sparse-BSom
highly parallelizable
more memory efficient
(single-precision)