SlideShare a Scribd company logo
1 of 23
WEAKLY-SUPERVISED SOUND EVENT DETECTION
WITH SELF-ATTENTION
Koichi Miyazaki, Tatsuya Komatsu, Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Kazuya Takeda
This work was done in the internship at LINE Corporation
ICASSP2020
Session WE1.L5: Acoustic Event Detection
stacked
Transformer
encoder
Outline of this work
l Goal
– Improve sound event detection (SED) performance
– Utilize weak label data for training
l Contributions
– Propose self-attention based weakly-supervised SED
– Introduce a special tag token to handle weak label information
l Evaluation
– Improved SED performance compared with CRNN
• CRNN baseline: 30.61% → Proposed: 34.28%
2
Alarm
Time
detect
onset offset
Alarm, Dog, Speech
weak label
Background
l Sound event detection (SED)
– Identifying environmental sounds with timestamps
l Collecting annotated dataset
– Strong label
• Easy to handle J
• Expensive annotation cost L
– Weak label
• Hard to handle L
• Cheap annotation cost J
3
Alarm
Time
detect
onset offset
Time
Dog
Speech
Alarm, Dog, Speech
→ NOT including timestamps
= only tags are available
→ including timestamps
Alarm
strong label
weak label
Problem
Weakly-supervised training for SED
l Multi-instance learning (MIL)
– Effective approach to train using weal label
– Predict frame-by-frame, aggregate them to obtain sequence-level prediction
4
Aggregate in
time domain
Time
Score
calculate loss
weak label
class
predicted score
class1
class2
class3
What approach is effective to aggregate?
How to aggregate frame-level prediction
l Global max pooling
– Capture short duration
– Weak to effect of noise
l Global average pooling
– Capture long duration
– Ignore short duration
l Attention pooling
– Flexible decision by
attention mechanism
5
weighted sum
max
average
Time
Score
sequence-level
prediction
frame-level
prediction
Attention pooling
l Calculate prediction and confidence of each frame
according to the input
6
Frame-level prediction
input frame
level feature
event feature
frame level confidence
(attention weight)
sum
sigmoidsoftmax
weighted sum
Time
sequence-level
prediction
Self-attention
l Transformer [Vaswani+17]
– Effectively use self-attention model
– Enable to capture local and global context information
– Great success in NLP, various audio/speech tasks
• ASR, speaker recognition, speaker diarization, TTS, etc..
7
Positional
Encoding
Multi-Head
Attention
Add & Norm
Feed
Forward
Add & Norm
N×
Transformerencoder
Input
Output
In this work, we use Transformer encoder
Overview of self-attention
8
DenseDenseDense
×
=
×
event feature
attention weight
In weakly-supervised SED,
how to handle weak label data?
input frame
level feature
Time
output frame
level feature
Time
Proposed method
l Weakly-supervised training for SED with self-attention and tag token
– Introduce transformer encoder as self-attention for sequence modeling
– Introduce tag token dedicated to weak label estimation
9
Predict
stronglabel
Predict
weaklabel
SigmoidSigmoid
Classifier
Append tag token at first frame
stacked
Transformer
encoder
feature
sequence
input
Self attention with tag token
10
DenseDenseDense
×
=
×
event feature
attention weight
: Tag token
TimeTime
input frame
level feature
output frame
level feature
Self attention with tag token
11
DenseDenseDense
×
=
×
event feature
attention weight
… encoder N
: Tag token
encoder 2
encoder 1
TimeTime
input frame
level feature
output frame
level feature
…
append tag token
(constant value)
input
Relationship of
tag token and input
Aggregatedtotagtoken
ineachencoder
Self attention with tag token
12
DenseDenseDense
×
=
×
event feature
attention weight
… encoder N
: Tag token
encoder 2
encoder 1
TimeTime
input frame
level feature
output frame
level feature
…
append tag token
(constant value)
strong label
prediction
weak label
prediction
input
Relationship of
tag token and input
Aggregatedtotagtoken
ineachencoder
Experiments
l DCASE2019 task 4
– Sound event detection in domestic environments
– Evaluation metrics: Event-based, Segment based macro F1
– Baseline model: CRNN
– Provided dataset details
13
Experimental conditions
l Network training configuration
– Feature: 64-dim log mel filterbank
– Transformer setting: 128 attention dim, 16 heads (each head handle 8 dim)
14
Experimental results
15
Method Event-based[%] Segment-based[%] Frame-based[%]
CRNN(baseline) 30.61 62.21 60.94
Transformer(E=3) 34.27 65.07 61.85
Transformer(E=4) 33.05 65.14 62.00
Transformer(E=5) 31.81 63.90 60.78
Transformer(E=6) 34.28 64.33 61.26
Experimental results
16
Method Event-based[%] Segment-based[%] Frame-based[%]
CRNN(baseline) 30.61 62.21 60.94
Transformer(E=3) 34.27 65.07 61.85
Transformer(E=4) 33.05 65.14 62.00
Transformer(E=5) 31.81 63.90 60.78
Transformer(E=6) 34.28 64.33 61.26
Transformer models outperformed CRNN model
Experimental results
17
: CRNN
: Transformer
Experimental results
18
Especially Blender and Dishes class are improved
=> Effective for repeatedly appear sounds
+10.4%
+13.5%
Experimental results
19
Attention pooling vs. Tag token
Method Encoder stack
Event-
based[%]
Segment-
based[%]
Frame-
based[%]
Self-attention
+ Attention pooling
3 33.99 65.95 62.36
6 33.84 65.61 62.10
Self-attention
+ Tag token
3 34.27 65.07 61.85
6 34.28 64.33 61.26
Experimental results
20
Method Encoder stack
Event-
based[%]
Segment-
based[%]
Frame-
based[%]
Self-attention
+ Attention pooling
3 33.99 65.95 62.36
6 33.84 65.61 62.10
Self-attention
+ Tag token
3 34.27 65.07 61.85
6 34.28 64.33 61.26
Perform comparable results
Attention pooling vs. Tag token
Predicted example
21
Visualization of attention weights
22
Conclusion
l Proposed method
– Weakly-supervised training for SED with self-attention and tag token
• Self-attention: effective sequence modeling using local and global context
• Tag token: aggregate tag information through self-attention
l Result
– Improved SED performance compared with CRNN
• CRNN baseline: 30.61% → Proposed: 34.28%
– Effective for repeatedly appear sounds
23

More Related Content

What's hot

What's hot (20)

グラフニューラルネットワーク入門
グラフニューラルネットワーク入門グラフニューラルネットワーク入門
グラフニューラルネットワーク入門
 
[DL輪読会]Wav2CLIP: Learning Robust Audio Representations From CLIP
[DL輪読会]Wav2CLIP: Learning Robust Audio Representations From CLIP[DL輪読会]Wav2CLIP: Learning Robust Audio Representations From CLIP
[DL輪読会]Wav2CLIP: Learning Robust Audio Representations From CLIP
 
深層学習と音響信号処理
深層学習と音響信号処理深層学習と音響信号処理
深層学習と音響信号処理
 
全力解説!Transformer
全力解説!Transformer全力解説!Transformer
全力解説!Transformer
 
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
 
[DL輪読会] MoCoGAN: Decomposing Motion and Content for Video Generation
[DL輪読会] MoCoGAN: Decomposing Motion and Content for Video Generation[DL輪読会] MoCoGAN: Decomposing Motion and Content for Video Generation
[DL輪読会] MoCoGAN: Decomposing Motion and Content for Video Generation
 
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
 
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
 
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話
 
【DL輪読会】NeRF-VAE: A Geometry Aware 3D Scene Generative Model
【DL輪読会】NeRF-VAE: A Geometry Aware 3D Scene Generative Model【DL輪読会】NeRF-VAE: A Geometry Aware 3D Scene Generative Model
【DL輪読会】NeRF-VAE: A Geometry Aware 3D Scene Generative Model
 
深層学習を利用した音声強調
深層学習を利用した音声強調深層学習を利用した音声強調
深層学習を利用した音声強調
 
論文紹介 wav2vec: Unsupervised Pre-training for Speech Recognition
論文紹介  wav2vec: Unsupervised Pre-training for Speech Recognition論文紹介  wav2vec: Unsupervised Pre-training for Speech Recognition
論文紹介 wav2vec: Unsupervised Pre-training for Speech Recognition
 
[DL輪読会]Parallel WaveNet: Fast High-Fidelity Speech Synthesis
[DL輪読会]Parallel WaveNet: Fast High-Fidelity Speech Synthesis[DL輪読会]Parallel WaveNet: Fast High-Fidelity Speech Synthesis
[DL輪読会]Parallel WaveNet: Fast High-Fidelity Speech Synthesis
 
環境音の特徴を活用した音響イベント検出・シーン分類
環境音の特徴を活用した音響イベント検出・シーン分類環境音の特徴を活用した音響イベント検出・シーン分類
環境音の特徴を活用した音響イベント検出・シーン分類
 
論文紹介 Unsupervised training of neural mask-based beamforming
論文紹介 Unsupervised training of neural  mask-based beamforming論文紹介 Unsupervised training of neural  mask-based beamforming
論文紹介 Unsupervised training of neural mask-based beamforming
 
【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language Models【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language Models
 
深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習
 
データサイエンティスト向け性能問題対応の基礎
データサイエンティスト向け性能問題対応の基礎データサイエンティスト向け性能問題対応の基礎
データサイエンティスト向け性能問題対応の基礎
 
音声合成のコーパスをつくろう
音声合成のコーパスをつくろう音声合成のコーパスをつくろう
音声合成のコーパスをつくろう
 
Introduction to A3C model
Introduction to A3C modelIntroduction to A3C model
Introduction to A3C model
 

Similar to Weakly-Supervised Sound Event Detection with Self-Attention

Surrey dl-4
Surrey dl-4Surrey dl-4
Surrey dl-4
ozzie73
 
Compressed learning for time series classification
Compressed learning for time series classificationCompressed learning for time series classification
Compressed learning for time series classification
學翰 施
 
Google and SRI talk September 2016
Google and SRI talk September 2016Google and SRI talk September 2016
Google and SRI talk September 2016
Hagai Aronowitz
 
Efficient Implementation of Self-Organizing Map for Sparse Input Data
Efficient Implementation of Self-Organizing Map for Sparse Input DataEfficient Implementation of Self-Organizing Map for Sparse Input Data
Efficient Implementation of Self-Organizing Map for Sparse Input Data
ymelka
 

Similar to Weakly-Supervised Sound Event Detection with Self-Attention (20)

Expert estimation from Multimodal Features
Expert estimation from Multimodal FeaturesExpert estimation from Multimodal Features
Expert estimation from Multimodal Features
 
Surrey dl-4
Surrey dl-4Surrey dl-4
Surrey dl-4
 
Face Recognition: From Scratch To Hatch
Face Recognition: From Scratch To HatchFace Recognition: From Scratch To Hatch
Face Recognition: From Scratch To Hatch
 
Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)
Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)
Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
ASR_final
ASR_finalASR_final
ASR_final
 
Dcase2016 oral presentation - Experiments on DCASE 2016: Acoustic Scene Class...
Dcase2016 oral presentation - Experiments on DCASE 2016: Acoustic Scene Class...Dcase2016 oral presentation - Experiments on DCASE 2016: Acoustic Scene Class...
Dcase2016 oral presentation - Experiments on DCASE 2016: Acoustic Scene Class...
 
BAS 250 Lecture 8
BAS 250 Lecture 8BAS 250 Lecture 8
BAS 250 Lecture 8
 
DNN-based permutation solver for frequency-domain independent component analy...
DNN-based permutation solver for frequency-domain independent component analy...DNN-based permutation solver for frequency-domain independent component analy...
DNN-based permutation solver for frequency-domain independent component analy...
 
Audio Processing
Audio ProcessingAudio Processing
Audio Processing
 
Introduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionIntroduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detection
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
Compressed learning for time series classification
Compressed learning for time series classificationCompressed learning for time series classification
Compressed learning for time series classification
 
Learning stochastic neural networks with Chainer
Learning stochastic neural networks with ChainerLearning stochastic neural networks with Chainer
Learning stochastic neural networks with Chainer
 
Google and SRI talk September 2016
Google and SRI talk September 2016Google and SRI talk September 2016
Google and SRI talk September 2016
 
Object Tracking with Instance Matching and Online Learning
Object Tracking with Instance Matching and Online LearningObject Tracking with Instance Matching and Online Learning
Object Tracking with Instance Matching and Online Learning
 
Explaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiers
 
Efficient Implementation of Self-Organizing Map for Sparse Input Data
Efficient Implementation of Self-Organizing Map for Sparse Input DataEfficient Implementation of Self-Organizing Map for Sparse Input Data
Efficient Implementation of Self-Organizing Map for Sparse Input Data
 
MLSD18. Unsupervised Learning
MLSD18. Unsupervised LearningMLSD18. Unsupervised Learning
MLSD18. Unsupervised Learning
 
DNA Splice site prediction
DNA Splice site predictionDNA Splice site prediction
DNA Splice site prediction
 

More from NU_I_TODALAB

More from NU_I_TODALAB (20)

異常音検知に対する深層学習適用事例
異常音検知に対する深層学習適用事例異常音検知に対する深層学習適用事例
異常音検知に対する深層学習適用事例
 
深層生成モデルに基づく音声合成技術
深層生成モデルに基づく音声合成技術深層生成モデルに基づく音声合成技術
深層生成モデルに基づく音声合成技術
 
信号の独立性に基づく多チャンネル音源分離
信号の独立性に基づく多チャンネル音源分離信号の独立性に基づく多チャンネル音源分離
信号の独立性に基づく多チャンネル音源分離
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
 
敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワーク敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワーク
 
距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
 
Interactive voice conversion for augmented speech production
Interactive voice conversion for augmented speech productionInteractive voice conversion for augmented speech production
Interactive voice conversion for augmented speech production
 
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクトCREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
 
Recent progress on voice conversion: What is next?
Recent progress on voice conversion: What is next?Recent progress on voice conversion: What is next?
Recent progress on voice conversion: What is next?
 
Statistical voice conversion with direct waveform modeling
Statistical voice conversion with direct waveform modelingStatistical voice conversion with direct waveform modeling
Statistical voice conversion with direct waveform modeling
 
音素事後確率を利用した表現学習に基づく発話感情認識
音素事後確率を利用した表現学習に基づく発話感情認識音素事後確率を利用した表現学習に基づく発話感情認識
音素事後確率を利用した表現学習に基づく発話感情認識
 
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
 
End-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head DecoderネットワークEnd-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head Decoderネットワーク
 
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
 
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
 
Hands on Voice Conversion
Hands on Voice ConversionHands on Voice Conversion
Hands on Voice Conversion
 
Advanced Voice Conversion
Advanced Voice ConversionAdvanced Voice Conversion
Advanced Voice Conversion
 
Deep Neural Networkに基づく日常生活行動認識における適応手法
Deep Neural Networkに基づく日常生活行動認識における適応手法Deep Neural Networkに基づく日常生活行動認識における適応手法
Deep Neural Networkに基づく日常生活行動認識における適応手法
 
CTCに基づく音響イベントからの擬音語表現への変換
CTCに基づく音響イベントからの擬音語表現への変換CTCに基づく音響イベントからの擬音語表現への変換
CTCに基づく音響イベントからの擬音語表現への変換
 

Recently uploaded

FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
dharasingh5698
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 

Recently uploaded (20)

FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 

Weakly-Supervised Sound Event Detection with Self-Attention

  • 1. WEAKLY-SUPERVISED SOUND EVENT DETECTION WITH SELF-ATTENTION Koichi Miyazaki, Tatsuya Komatsu, Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Kazuya Takeda This work was done in the internship at LINE Corporation ICASSP2020 Session WE1.L5: Acoustic Event Detection
  • 2. stacked Transformer encoder Outline of this work l Goal – Improve sound event detection (SED) performance – Utilize weak label data for training l Contributions – Propose self-attention based weakly-supervised SED – Introduce a special tag token to handle weak label information l Evaluation – Improved SED performance compared with CRNN • CRNN baseline: 30.61% → Proposed: 34.28% 2 Alarm Time detect onset offset Alarm, Dog, Speech weak label
  • 3. Background l Sound event detection (SED) – Identifying environmental sounds with timestamps l Collecting annotated dataset – Strong label • Easy to handle J • Expensive annotation cost L – Weak label • Hard to handle L • Cheap annotation cost J 3 Alarm Time detect onset offset Time Dog Speech Alarm, Dog, Speech → NOT including timestamps = only tags are available → including timestamps Alarm strong label weak label Problem
  • 4. Weakly-supervised training for SED l Multi-instance learning (MIL) – Effective approach to train using weal label – Predict frame-by-frame, aggregate them to obtain sequence-level prediction 4 Aggregate in time domain Time Score calculate loss weak label class predicted score class1 class2 class3 What approach is effective to aggregate?
  • 5. How to aggregate frame-level prediction l Global max pooling – Capture short duration – Weak to effect of noise l Global average pooling – Capture long duration – Ignore short duration l Attention pooling – Flexible decision by attention mechanism 5 weighted sum max average Time Score sequence-level prediction frame-level prediction
  • 6. Attention pooling l Calculate prediction and confidence of each frame according to the input 6 Frame-level prediction input frame level feature event feature frame level confidence (attention weight) sum sigmoidsoftmax weighted sum Time sequence-level prediction
  • 7. Self-attention l Transformer [Vaswani+17] – Effectively use self-attention model – Enable to capture local and global context information – Great success in NLP, various audio/speech tasks • ASR, speaker recognition, speaker diarization, TTS, etc.. 7 Positional Encoding Multi-Head Attention Add & Norm Feed Forward Add & Norm N× Transformerencoder Input Output In this work, we use Transformer encoder
  • 8. Overview of self-attention 8 DenseDenseDense × = × event feature attention weight In weakly-supervised SED, how to handle weak label data? input frame level feature Time output frame level feature Time
  • 9. Proposed method l Weakly-supervised training for SED with self-attention and tag token – Introduce transformer encoder as self-attention for sequence modeling – Introduce tag token dedicated to weak label estimation 9 Predict stronglabel Predict weaklabel SigmoidSigmoid Classifier Append tag token at first frame stacked Transformer encoder feature sequence input
  • 10. Self attention with tag token 10 DenseDenseDense × = × event feature attention weight : Tag token TimeTime input frame level feature output frame level feature
  • 11. Self attention with tag token 11 DenseDenseDense × = × event feature attention weight … encoder N : Tag token encoder 2 encoder 1 TimeTime input frame level feature output frame level feature … append tag token (constant value) input Relationship of tag token and input Aggregatedtotagtoken ineachencoder
  • 12. Self attention with tag token 12 DenseDenseDense × = × event feature attention weight … encoder N : Tag token encoder 2 encoder 1 TimeTime input frame level feature output frame level feature … append tag token (constant value) strong label prediction weak label prediction input Relationship of tag token and input Aggregatedtotagtoken ineachencoder
  • 13. Experiments l DCASE2019 task 4 – Sound event detection in domestic environments – Evaluation metrics: Event-based, Segment based macro F1 – Baseline model: CRNN – Provided dataset details 13
  • 14. Experimental conditions l Network training configuration – Feature: 64-dim log mel filterbank – Transformer setting: 128 attention dim, 16 heads (each head handle 8 dim) 14
  • 15. Experimental results 15 Method Event-based[%] Segment-based[%] Frame-based[%] CRNN(baseline) 30.61 62.21 60.94 Transformer(E=3) 34.27 65.07 61.85 Transformer(E=4) 33.05 65.14 62.00 Transformer(E=5) 31.81 63.90 60.78 Transformer(E=6) 34.28 64.33 61.26
  • 16. Experimental results 16 Method Event-based[%] Segment-based[%] Frame-based[%] CRNN(baseline) 30.61 62.21 60.94 Transformer(E=3) 34.27 65.07 61.85 Transformer(E=4) 33.05 65.14 62.00 Transformer(E=5) 31.81 63.90 60.78 Transformer(E=6) 34.28 64.33 61.26 Transformer models outperformed CRNN model
  • 18. Experimental results 18 Especially Blender and Dishes class are improved => Effective for repeatedly appear sounds +10.4% +13.5%
  • 19. Experimental results 19 Attention pooling vs. Tag token Method Encoder stack Event- based[%] Segment- based[%] Frame- based[%] Self-attention + Attention pooling 3 33.99 65.95 62.36 6 33.84 65.61 62.10 Self-attention + Tag token 3 34.27 65.07 61.85 6 34.28 64.33 61.26
  • 20. Experimental results 20 Method Encoder stack Event- based[%] Segment- based[%] Frame- based[%] Self-attention + Attention pooling 3 33.99 65.95 62.36 6 33.84 65.61 62.10 Self-attention + Tag token 3 34.27 65.07 61.85 6 34.28 64.33 61.26 Perform comparable results Attention pooling vs. Tag token
  • 23. Conclusion l Proposed method – Weakly-supervised training for SED with self-attention and tag token • Self-attention: effective sequence modeling using local and global context • Tag token: aggregate tag information through self-attention l Result – Improved SED performance compared with CRNN • CRNN baseline: 30.61% → Proposed: 34.28% – Effective for repeatedly appear sounds 23