2. Who we are
• Founded in 2013 by 2 PhDs who worked at IRCAM
• Won Mirex 2011 in Music Similarity Estimation and Music
Classification
• We sell our technology through our API
• A team of 9 today
3. What we want to do
•Create a high-dimensional space where every
song is a vector
•Use this space to find similars and classify
songs
•Each query must be <50ms in millions of tracks
4. How music information retrieval worked in 2011
• Short-term descriptors: MFCCs,
Fluctuation Patterns ("Block-level
audio features for music genres
classification",Seyerlehner and
al.) and much more !
• Pooling techniques : VQ, GMM-SV
("GMM Supervector for content
based music similarity",
Charbuillet and al.), Vlad
("Aggregation local descriptors
into a compact image
representation", Jégou and al.) ...
Audio
MFCCs
Vlad
FP
GMM-
SV
5. One of our evaluation datasets
• Evaluation metrics for search engine : Precision at K or
mean Average Precision
• Evaluation set presented here : 8500 tracks in 141
playlists from mainstream music
P@k 1 5 10 20 50
mirex2011 17,48 15,39 13,87 12,23 10,00
6. From 2013 to 2014 @niland
• How to make a product from research work !
• And a lot of work on short-term descriptors and pooling techniques
• But still completely unsupervised, no real way to match outputs with
human perception !
P@k 1 5 10 20 50
mirex2011 17,48 15,39 13,87 12,23 10,00
2014 19,70 16,81 15,37 13,57 11,01
% +12.70 +9.23 +10.81 +10.96 +10.10
7. Matching algorithm outputs with human perception
•Learn the outputs of a collaborative filtering
model
"Deep content-based music recommendation", Oord and
al.
•Or use a network trained to classify into groups
of similar tracks
8. Integrating human idea of similarity
•150k tracks in 3500 theme-based albums from
of our clients
•Each album represents a genre, mood or an
usage
•Each gathers socially similar tracks
9. • We use outputs from our previous system
• We train it with a classification cost
• And remove the classification layer !
P@k 1 5 10 20 50
2014 19,70 16,81 15,37 13,57 11,01
+deep 23,40 21,09 19,68 18,07 15,19
% +18.78 +25.46 +28.04 +33.16 +37.97
Learning with theme-based albums
10. What if we want to remove the highly engineered features and
pooling techniques ?
Convolutional Neural Networks for Image Recognition :
Source : http://www.clarifai.com/technology
11. And for music ?
• Mel-Spectrogram (time-frequency representation) as an
input : axis have different meanings !
Should we really use square filters ?
• Labels on the whole track (>= 30 seconds) : input is
128x1200 for a 30 second song !
We have to pool along time axis !
12. And for music ?
Source : Sander Dieleman, http://benanne.github.io/2014/08/05/spotify-cnns.html
13. And for music ?
Some ideas to slightly improve it :
• Multi-scale pooling
• Reduce max pooling
• Add batch-norm
P@k 1 5 10 20 50
2014+deep 23,40 21,09 19,68 18,07 15,19
CNN 23,85 21,31 19,81 18,06 15,18
14. Okay, so ?
• Our 2014 system is a mix of 6 different short-term
descriptors + 6 different "smart" pooling functions, 10
years of research !
• Has the engineering problem become a data problem ?
P@k 1 5 10 20 50
2014+deep 23,40 21,09 19,68 18,07 15,19
CNN 23,85 21,31 19,81 18,06 15,18
15. From Fisher Vectors to simple pooling functions?
• A very simple pooling function can give great results !
P@k 1 5 10 20 50
Mean 20,94 19,04 17,69 16,17 13,74
Max 22,21 19,90 18,58 17,07 14,61
Var 21,66 19,46 18,14 16,58 14,13
Mean+Max+Var 23,85 21,31 19,81 18,06 15,18
16. And with square filters?
•Square filters also seem to work !
P@k 1 5 10 20 50
CNN 23,85 21,31 19,81 18,06 15,18
CNNsq 22,94 20,84 19,79 18,15 15,52
17. A transferable model for music
• Works also for world music, library music…
• This dataset : 10k tracks from library music, 300 groups
P@k 1 5 10 20 50
2014+deep 30,66 19,99 15,57 11,81 7,93
CNN 29,76 19,82 15,55 11,85 7,80
18. The spectrogram is still an engineered feature…
Could we learn a better temporal filter bank to
replace FFT and mel-filtering ?
“End-to-end learning for music audio", Dieleman and al.
"Learning the Speech Front-end with raw waveform CLDNNs",
Sainath and al.
20. P@k 1 5 10 20 50
Raw 20,11 18,95 17,23 15,91 14,26
Spectro 23,85 21,31 19,81 18,06 15,18
The spectrogram is still an engineered feature…
Maybe we need more data ?
21. We can improve !
• Add more albums !
• With 500k tracks ? 1M ?
P@k 1 5 10 20 50
25k tracks 19,84 17,98 15,21 14,06 13,41
150k tracks 23,85 21,31 19,81 18,06 15,18
22. And …
• Add more layers !
"Deep Residual Learning for Image Recognition", He and al.
P@k 1 5 10 20 50
PlainNet9 23,85 21,31 19,81 18,06 15,18
ResNet78 23,87 22,17 20,98 19,38 16,68
23. And ?
• Data augmentation ?
"Exploring data augmentation for improved singing voice detection with neural networks",
Schlüter and Grill
• Recurrent Neural Networks ?
• Siamese Network ?
"An exploration of deep learning in music informatics", Humphrey and al.
• More data ! Or semi supervised approach ?
"Semi-supervised learning with ladder networks", Rasmus and al.