Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
End-to-end
Music Classification
• MIR 

‣ 2017 DJ music
classification 

• 

• 

‣ ...

‣ ...

• ^_^

‣ 

- ML for Music

- Automated feature engineer usin...
Music Classification, Why?
• classification “representation learning” 

‣ Music streaming service
?

‣ content-based recomme...
:
End-to-end Music Classification
• History of E2E Music Classification Models
‣ E2E ?

• Interpretability of E2E Music Clas...
• Sample-level Deep Convolutional Neural Networks 

for Music Auto-tagging using Raw Waveforms (2017)

Jongpil Lee, Jiyoun...
Spectrogram to End-to-end
1 2 3
2014 2017 (SampleCNN)
• End-to-end
• STFT conv. layer
• (frame-level)
• End-to-end
• STFT ...
Frame- vs. Sample-level
• ( ) 

• Sample-level
‣ 1D convolution 

‣ 7 ( 0.14ms )*

• Frame-level
‣ STFT window 1D convolut...
Frame-level Mel-spectrogram Model
(=Channel)
1
1) 2D Convolution 2) 1D Convolution
(=Channel)
(=Channel)
• Spectrogram 

•...
Spectrogram Model
•
‣ ( 3~7 )

‣ 

‣ Phase invariant

•
‣ Task mid-level representation

‣ Hyperparameter tuning (e.g. win...
Audio time & frequency resolution
Convolutional Filters Decouple
Time & Frequency Resolution
• Conv. filter time & frequency resolution trade-off 

• Convolut...
Frame-level Raw Waveform Model
• 2014 E2E music classification 

‣ But, spectrogram model 

- , E2E music classification



...
1D Conv. on
Spectrogram
Time
Channel
(=Frequency)
Channel
(=unsortedfrequency-like)
Time
1D conv. filter
STFT
1D conv. filte...
Frame-level Raw Waveform Model
• :
‣ CNN (layer )

‣ Log-based amplitude compression 

‣ conv. layer phase variation 

• 2...
Sample-level Raw Waveform Model
• 

‣ Log-scale amplitude compression

‣ Phase invariance

• :

‣ 

- Filter size = one of...
Sample-level Raw Waveform Model3
•E2E model spectrogram
•E2E music classification model
Comparison with frame-level models
...
Sample-level Raw Waveform Model3
...
Filter size 2
Filter size 3
Filter size 4
Filter size 5
6551 × 128
19683 × 128
2187× 128
729 × 256
243 × 256
81 × 256
27 × 256
9 × 256
3 × 256
256 512 256
1 × 512
50
tag predicti...
Res-n Block
• From ResNet (2015 ImageNet challenges )

• Motivation:

‣ Skip-connection net. 

• n: conv. layer (1 or 2)

...
Basic
Res-2
SE 0.9083
0.9061
0.9055
SE Block
• From SENet (2017 ImageNet challenges )

• Motivation:

‣ Channel , channel ...
SE Block for Image (2D Conv.)
Squeeze operation:
• Aggregate spatial dimensions
• Produce channel-wise statistics
Excitati...
SE Block for Audio (1D Conv.)
Time
Channel(orFrequency-like)
Time
Global temporal statistics
for each channel
Squeeze oper...
SE Block for Audio (1D Conv.)
relu
relu
sigmoid
T×C
1×C
1×αC
1×C
T×C
T×C
Conv1D
FC
FC
Scale
BatchNorm
MaxPool
GlobalAvgPool
Difference with Original SE Block
• Original SENet FC layer


‣ 𝑟: reduction ratio

• 

‣ 𝜶: amplifying ratio

• Original ...
Amplifying Ratio (alpha) Grid Search
AUC
Amplifying Ratio
OverfittingUnderfitting
ReSE-n block
• Res-n & SE blockrelu
relu
sigmoid
relu
T×C
1×C
1×αC
1×C
T×C
T×C
Conv1D
BatchNorm
Dropout
Conv1D
BatchNorm
G...
Multi-level Feature Aggregation
• layer 3 output 

‣ 3 output concatenate

‣ Simple, but powerful

• Motivation:

‣ music ...
Comparison of Architectures
Basic
SE
Res-1
Res-2
ReSE-1
ReSE-2 0.9102
0.9066
0.9061
0.9048
0.9083
0.9055
0.9113
0.9053
0.9...
Comparison with SoTA
Ensemble of 3 models
Best among single models!
Interpretability of Deep Learning
• 

• interpretability 

‣ vision 

‣ Audio 

• Weapons of Math Destruction ( )
— , “ ”
...
SampleCNN Filter Visualization
• channel frequency signal 

• frequency ( )

• Layer filter
(e.g. mel-scale)

‣ i.e. piano ...
Channel back propagate
Input
Filter Viz. Process Example:
layer channel
6551 × 128
19683 × 128
2187× 128
729 × 256
243 × 2...
Excitation
Visualization
relu
relu
sigmoid
T×C
1×C
1×αC
1×C
T×C
T×C
Conv1D
FC
FC
Scale
BatchNorm
MaxPool
GlobalAvgPool
Exc...
Excitation
Visualization
• SE block channel
tag


• Mid block general , last
block discriminative signal
processing


• bl...
Excitation
Visualization
• Excitation loudness
tag
excitation layer
Sorted channel index
Excitation
Standard deviations of...
Analysis of the First Excitation
Sorted channel index
Excitation
tag 50 excitation
Analysis of the First Excitation
• audio segment 

• linear regression line

• SE block loudness
normalize 

• But ,
Avera...
Analysis of the First Excitation
• linear regression

• loudness normalize


‣ #negative = 109

‣ #positive = 19

• Loudne...
Variation of Excitations increases
according to Loudness
• audio segment excitation channel
• segment ( )
• Loudness excit...
Excitation Comparison
with Speech Dataset
TensorFlow Speech Commands Dataset
• 1 audio

• 

‣ : “yes”, “no”, “right”, “go” 

• SE block
Average of 128 Channels
MagnaTagATune
(Music dataset)
TensorFlow Speech Commands
(Speech dataset)
Most Positive Channel
MagnaTagATune
(Music dataset)
TensorFlow Speech Commands
(Speech dataset)
Linear Regression Lines
MagnaTagATune
(Music dataset)
TensorFlow Speech Commands
(Speech dataset)
#negative=109, #positive...
MagnaTagATune
(Music dataset)
TensorFlow Speech Commands
(Speech dataset)
Variation of Excitations increases according to ...
• audio filter visualization ?

• SE block squeeze, excitation ?

• ROC-AUC policy gradient directly optimize ?
Image filter...
(Taejun Kim)

i2r.jun@gmail.com
^_^
References
• [Cover art] http://www.sqoop.co.ug/201805/four-one-one/nation-media-
group-launches-music-record-label-lit-mu...
End-to-end Music Classification
End-to-end Music Classification
Nächste SlideShare
Wird geladen in …5
×

End-to-end Music Classification

End-to-end music classification model의 짧은 역사와 그들의 작동 방식을 이해하기 위한 노력들을 살펴봅니다.
Techtalk @ Naver Green Factory - 2018. 05. 09.

  • Als Erste(r) kommentieren

End-to-end Music Classification

  1. 1. End-to-end Music Classification
  2. 2. • MIR ‣ 2017 DJ music classification • • ‣ ... ‣ ... • ^_^ ‣ - ML for Music - Automated feature engineer using RL ICASSP 2018 😆
  3. 3. Music Classification, Why? • classification “representation learning” ‣ Music streaming service ? ‣ content-based recommendation • Music streaming service ‣ ! • ‣ “ ” ‣
  4. 4. : End-to-end Music Classification • History of E2E Music Classification Models ‣ E2E ? • Interpretability of E2E Music Classification Models ‣
  5. 5. • Sample-level Deep Convolutional Neural Networks 
 for Music Auto-tagging using Raw Waveforms (2017)
 Jongpil Lee, Jiyoung Park, Keunhyoung Luke Kim and Juhan Nam
 Sound and Music Computing Conf. (SMC), 2017. ‣ Music classification end-to-end approach ‣ Frequency • Sample-level CNN Architectures
 for Music Auto-tagging using Raw Waveforms (2018)
 Taejun Kim, Jongpil Lee and Juhan Nam,
 IEEE Int. Conf. Acoustical, Speech Signal Processing (ICASSP), 2018. ‣ CNN architecture ‣ Loudness
  6. 6. Spectrogram to End-to-end 1 2 3 2014 2017 (SampleCNN) • End-to-end • STFT conv. layer • (frame-level) • End-to-end • STFT conv. layer • (sample-level) • Handcrafted spectrogram • 1D or 2D conv.
  7. 7. Frame- vs. Sample-level • ( ) • Sample-level ‣ 1D convolution ‣ 7 ( 0.14ms )* • Frame-level ‣ STFT window 1D convolution ‣ 256 ( 12ms )* ‣ Spectrogram frame-level 
 → Trade-off between time- & frequency resolution! • *22,050kHz
  8. 8. Frame-level Mel-spectrogram Model (=Channel) 1 1) 2D Convolution 2) 1D Convolution (=Channel) (=Channel) • Spectrogram • (e.g. MNIST) • • Spectrogram 1 sequence • Frequency dim. = dim. • End-to-end
  9. 9. Spectrogram Model • ‣ ( 3~7 ) ‣ ‣ Phase invariant • ‣ Task mid-level representation ‣ Hyperparameter tuning (e.g. window length, hop size, etc.) ‣ Phase ‣ Time- & frequency-resolution trade-off
  10. 10. Audio time & frequency resolution
  11. 11. Convolutional Filters Decouple Time & Frequency Resolution • Conv. filter time & frequency resolution trade-off • Convolution resolution: ‣ Time resolution stride - Stride↓ time resolution↑ ‣ Frequency resolution filter - #filters↑ frequency resolution↑ ‣ Stride time & frequency resolution
  12. 12. Frame-level Raw Waveform Model • 2014 E2E music classification ‣ But, spectrogram model - , E2E music classification
 • STFT 1D strided conv. layer ‣ Spectrogram 1D conv. mid-level representation • Strided conv. output spectrogram hyperparameter ‣ Filter size (=window size of STFT) ‣ Stride (=hop size of STFT) 2 Spectrogram Waveform with one 1D conv. net.
  13. 13. 1D Conv. on Spectrogram Time Channel (=Frequency) Channel (=unsortedfrequency-like) Time 1D conv. filter STFT 1D conv. filter 1D conv. filter 1D Conv. on Waveform vs. • 1D conv. 2D conv. spectrogram vs. waveform • E2E music classification Channel frequency 
 → !
  14. 14. Frame-level Raw Waveform Model • : ‣ CNN (layer ) ‣ Log-based amplitude compression ‣ conv. layer phase variation • 2014 , : ‣ Batch Norm. ResNet ( ) ‣ GPU ( , ) 2
  15. 15. Sample-level Raw Waveform Model • ‣ Log-scale amplitude compression ‣ Phase invariance • : ‣ - Filter size = one of {2, 3, 4, 5, 7} ‣ STFT conv. layer • net. 3 6551 × 128 19683 × 128 2187× 128 729 × 256 243 × 256 81 × 256 27 × 256 9 × 256 3 × 256 512 1 × 512 50 tag prediction 1Dconvolutionalblock×9 strided conv FC×2 59049 × 1 raw waveform 1D conv. block SampleCNN!
  16. 16. Sample-level Raw Waveform Model3 •E2E model spectrogram •E2E music classification model Comparison with frame-level models Comparison with state-of-the-arts
  17. 17. Sample-level Raw Waveform Model3 ... Filter size 2 Filter size 3 Filter size 4 Filter size 5
  18. 18. 6551 × 128 19683 × 128 2187× 128 729 × 256 243 × 256 81 × 256 27 × 256 9 × 256 3 × 256 256 512 256 1 × 512 50 tag prediction 1Dconvolutionalblock×9 multi-levelfeatureaggregation strided conv FC×2 59049 × 1 raw waveform globalmaxpooling globalmaxpooling Base Architecture Advanced Sample-level Raw Waveform Model4 image classification 1) Convolutional blocks from ResNet & SENet 2) Multi-level feature aggregation 1D Convolutional Blocks Conv1D BatchNorm MaxPool relu Basic block relu relu Conv1D BatchNorm Conv1D BatchNorm MaxPool Dropout Res-n block relu relu sigmoid relu T×C 1×C 1×αC 1×C T×C T×C Conv1D BatchNorm Dropout Conv1D BatchNorm GlobalAvgPool FC FC Scale MaxPool ReSE-n block relu relu sigmoid T×C 1×C 1×αC 1×C T×C T×C Conv1D FC FC Scale BatchNorm MaxPool GlobalAvgPool SE block Excitation SampleCNN
  19. 19. Res-n Block • From ResNet (2015 ImageNet challenges ) • Motivation: ‣ Skip-connection net. • n: conv. layer (1 or 2) ‣ Conv. layer regularization 
 dropout (inspired by WideResNet) relu relu Conv1D BatchNorm Conv1D BatchNorm MaxPool Dropout Skip-connection Basic Res-1 Res-2 0.9061 0.9048 0.9055 AUC on MagnaTagATune 1.8 , but
  20. 20. Basic Res-2 SE 0.9083 0.9061 0.9055 SE Block • From SENet (2017 ImageNet challenges ) • Motivation: ‣ Channel , channel recalibration - channel (=frequency-like) ( ) , ( ) - channel weight (0~1) rescale (recalibration) relu relu sigmoid T×C 1×C 1×αC 1×C T×C T×C Conv1D FC FC Scale BatchNorm MaxPool GlobalAvgPool AUC on MagnaTagATune basic block 1.08
  21. 21. SE Block for Image (2D Conv.) Squeeze operation: • Aggregate spatial dimensions • Produce channel-wise statistics Excitation operation: • Using the statistics, learn channel relationships • Produce weight for each channel Global spatial information for each channel Excitations (range 0~1): Weight for each channel Reweight each channel using the weights
  22. 22. SE Block for Audio (1D Conv.) Time Channel(orFrequency-like) Time Global temporal statistics for each channel Squeeze operation: • Aggregate temporal dimensions • Produce frequency-wise statistics Excitation operation: • Using the statistics, learn frequency relationships • Produce weight for each frequency Excitations (range 0~1): Weight for each frequency Reweight each frequency using the weights
  23. 23. SE Block for Audio (1D Conv.) relu relu sigmoid T×C 1×C 1×αC 1×C T×C T×C Conv1D FC FC Scale BatchNorm MaxPool GlobalAvgPool
  24. 24. Difference with Original SE Block • Original SENet FC layer ‣ 𝑟: reduction ratio • ‣ 𝜶: amplifying ratio • Original SENet 16 , 16 ‣ Audio channel ? relu relu sigmoid T×C 1×C 1×αC 1×C T×C T×C Conv1D FC FC Scale BatchNorm MaxPool GlobalAvgPool
  25. 25. Amplifying Ratio (alpha) Grid Search AUC Amplifying Ratio OverfittingUnderfitting
  26. 26. ReSE-n block • Res-n & SE blockrelu relu sigmoid relu T×C 1×C 1×αC 1×C T×C T×C Conv1D BatchNorm Dropout Conv1D BatchNorm GlobalAvgPool FC FC Scale MaxPool Basic Res-2 SE ReSE-1 ReSE-2 0.9102 0.9066 0.9083 0.9061 0.9055 1.8 AUC on MagnaTagATune
  27. 27. Multi-level Feature Aggregation • layer 3 output ‣ 3 output concatenate ‣ Simple, but powerful • Motivation: ‣ music tag abstraction ‣ Example: - “vocal”: low abstraction - “metal”: high abstraction • Global max pooling time dimension 
 1
  28. 28. Comparison of Architectures Basic SE Res-1 Res-2 ReSE-1 ReSE-2 0.9102 0.9066 0.9061 0.9048 0.9083 0.9055 0.9113 0.9053 0.9098 0.9037 0.9111 0.9077 multi-feature aggregation no multi-feature aggregation x1.7 x1.08SampleCNN
  29. 29. Comparison with SoTA Ensemble of 3 models Best among single models!
  30. 30. Interpretability of Deep Learning • • interpretability ‣ vision ‣ Audio • Weapons of Math Destruction ( ) — , “ ” • , ‣ ( ) ‣ “ ?” “ ?” ...
  31. 31. SampleCNN Filter Visualization • channel frequency signal • frequency ( ) • Layer filter (e.g. mel-scale) ‣ i.e. piano key • Layer Low-frequency Sorted channel index Frequency(0~11KHz)
  32. 32. Channel back propagate Input Filter Viz. Process Example: layer channel 6551 × 128 19683 × 128 2187× 128 729 × 256 243 × 256 81 × 256 utionalblock×9 strided conv 59049 × 1 raw waveformInitialize input randomly (random noise) Backprop. 1 2 STFT 3 4 Frequency Channel 3 of Layer 1 Time Channel layer
  33. 33. Excitation Visualization relu relu sigmoid T×C 1×C 1×αC 1×C T×C T×C Conv1D FC FC Scale BatchNorm MaxPool GlobalAvgPool Excitation Sorted channel index Excitation
  34. 34. Excitation Visualization • SE block channel tag • Mid block general , last block discriminative signal processing • block tag excitation! ‣ Loudness Sorted channel index Excitation
  35. 35. Excitation Visualization • Excitation loudness tag excitation layer Sorted channel index Excitation Standard deviations of excitations across tags
  36. 36. Analysis of the First Excitation Sorted channel index Excitation tag 50 excitation
  37. 37. Analysis of the First Excitation • audio segment • linear regression line • SE block loudness normalize • But , Average of 128 Channels Most Positive Channel Most Negative Channel Most Neutral Channel Least Regression Error Channel Loudness Excitation
  38. 38. Analysis of the First Excitation • linear regression • loudness normalize ‣ #negative = 109 ‣ #positive = 19 • Loudness excitation ? Loudness Excitation
  39. 39. Variation of Excitations increases according to Loudness • audio segment excitation channel • segment ( ) • Loudness excitation
  40. 40. Excitation Comparison with Speech Dataset
  41. 41. TensorFlow Speech Commands Dataset • 1 audio • ‣ : “yes”, “no”, “right”, “go” • SE block
  42. 42. Average of 128 Channels MagnaTagATune (Music dataset) TensorFlow Speech Commands (Speech dataset)
  43. 43. Most Positive Channel MagnaTagATune (Music dataset) TensorFlow Speech Commands (Speech dataset)
  44. 44. Linear Regression Lines MagnaTagATune (Music dataset) TensorFlow Speech Commands (Speech dataset) #negative=109, #positive=19 !
  45. 45. MagnaTagATune (Music dataset) TensorFlow Speech Commands (Speech dataset) Variation of Excitations increases according to Loudness
  46. 46. • audio filter visualization ? • SE block squeeze, excitation ? • ROC-AUC policy gradient directly optimize ? Image filter viz.
  47. 47. (Taejun Kim) i2r.jun@gmail.com ^_^
  48. 48. References • [Cover art] http://www.sqoop.co.ug/201805/four-one-one/nation-media- group-launches-music-record-label-lit-music.html

×