SlideShare ist ein Scribd-Unternehmen logo
1 von 5
Downloaden Sie, um offline zu lesen
Computational Auditory Scene Recognition
Shriram Nandakumar, Deepa Naik
Student Numbers: 244935, 232887
nandakum@student.tut.fi, deepa.naik@student.tut.fi
ABSTRACT
Computational Auditory Scene Recognition (CASR) refers
to the study of processing and understanding the audio
signals of a scene to understand its context. In this paper, a
multi-class (12 class) audio environment classification is
attempted. Frame-level sub-band energy ratios are
computed as features for the input audio data. 𝑘-Nearest
Neighbor classifier is used for classification. Performance is
analysed based on overall classification accuracy and class-
wise accuracies. In spite of the simplicity of the methods
used, the performance of the classifier is superior.
1. INTRODUCTION / BACKGROUND
Auditory Scene Analysis (ASA) refers to the
physiological process that a human ear performs on the
sound reaching the ear. This process is responsible for the
ability of human hearing to distinguish individual sound
sources from a complex mixture of sounds [1].
Computational Auditory Scene Analysis (CASA) is the
challenge of constructing a machine system that is capable
of achieving the human performance in ASA and hence
biologically motivated [2]. It is usually referred to as cock-
tail party problem. It typically uses one (mono-aural) or two
(bin-aural) microphone recordings of the acoustic scene. As
a result, CASA is fundamentally different from other source
separation techniques like beamforming and independent
component analysis (ICA).
The goal of CASA is to computationally extract individual
“streams” from one or two recordings of an acoustic scene.
In ASA, stream is a perceptual structure in which local time-
frequency segments that are likely to have arisen from the
same environmental source are grouped. In CASA, the term
"stream" can refer both to the perceptual representation of a
sound source, and to the representation of a sound source in
computer memory. CASA finds applications in noise-robust
automatic speech recognition, hearing prostheses, automatic
music transcription, audio information retrieval and
developments in hearing science [1].
Figure 1 shows a typical CASA architecture. A digitally
recorded acoustic mixture is subjected to peripheral analysis
by obtaining the time-frequency representation or
cochleagram. Acoustic features are then extracted from the
time-frequency representation. Examples of conventional
acoustic features are periodicity, onsets, offsets, amplitude
and frequency modulation. From the extracted features,
mid-level representations such as segments are obtained.
Scene Organization involves grouping of cues and training
models of individual sound sources. The final step includes
re-synthesis of audio waveform from a separated stream.
Fig 1. CASA Architecture [1]
Computational Auditory Scene Recognition (CASR), an
offspring of CASA, is a term coined by Peltonen [4]. CASR
involves the study of processing and understanding the
audio signals of a scene to understand its context. For
example, it refers to the process of identifying the
environment of a mobile device based on the characteristics
of the audio signal recorded by the device [5].
Unlike CASA, CASR classifies the mixtures of audio
themselves into predefined classes, without trying to
recognize the sources of audio. A typical example is
identifying the location as street from sound sources such as
car passing by, people walking and other environmental
sounds. Such context-aware mobile devices can react to
changes in usage environment and thus provide better
service to the needs of users by adjusting the mode of
operation. Areas that are closely related to CASR include
speech/music discrimination, noise classification and
content-based audio retrieval [4].
2. THEORY & METHODS
This section covers the theory and the concepts of pre-
processing, feature extraction and classification methods
used.
2.1 Pre-processing:
Any audio signal analysis and classification task requires
frame-wise features to be extracted due to the non-stationary
nature of real-world audio signals. Hence the pre-processing
step involves dividing the signal into short segments called
frames. Typically, a smooth window such as Hanning
window is used to multiply the signal values in each frame.
2.2 Feature Extraction:
The acoustic attributes of an audio signal can be divided
into two groups- perceptual & physical features [4]. The
perceptual features describe the sensation of a sound in
subjective terms such as loudness, pitch, brightness and
timbre. Physical features are calculated mathematically
from the sound wave. Examples are intensity, fundamental
frequency, spectrum, spectral centroid and others. This
paper uses a simple physical feature called Sub-Band
Energy Ratio (SBER) to accomplish the task.
SBER describes the energy distribution of the audio signal
among different frequency bands. SBER is computed as:
𝑥(𝑖) =
∑ |𝑆(𝑙)|2𝑒 𝑖
𝑙=𝑏 𝑖
∑ |𝑆(𝑙)|2𝐿/2
𝑙=0
,
(1)
where S is the Discrete Fourier Transform of a signal frame,
𝑙 is the bin index, L is the total number of frequency bins
and 𝑏𝑖and 𝑒𝑖 are the first and last bins of the 𝑖th frequency
band.
2.3 Classification
𝑘-Nearest Neighbor (𝑘-NN) method is used for
classification. In the simplest 1-NN, an M-dimensional test
sample is classified based on its neighbor, in an Euclidean
distance sense, among correctly classified training samples
𝐷:
𝐷 = {𝒙1, 𝜃(𝒙1), 𝒙2, 𝜃(𝒙2), … , 𝒙 𝑁, 𝜃(𝒙 𝑁)}, (2)
where N is the size of the training set, 𝜃(𝒙 𝒏) is the index of
the class that the 𝑛th training sample belongs to and it takes
values in {1, 2, … . , 𝐶}, 𝐶 being the number of classes. The
nearest neighbor to the test vector 𝒙 is computed as
𝒙′
= ∀𝒙 𝑛∈ 𝐷
𝑎𝑟𝑔𝑚𝑖𝑛
𝑑(𝒙, 𝒙 𝑛), (3)
Test vector is assigned the same class as that of 𝒙′. 𝑑(𝒂, 𝒃)
is the distance function, typically the 𝑙2 distance defined as:
𝑑 (𝒂, 𝒃) = ‖𝒂 − 𝒃‖2 = √∑ (𝑎𝑖 − 𝑏𝑖)2𝑀
𝑖=1 (4)
In the case of 𝑘-NN, instead of a single neighbor, 𝑘
neighbors are computed in the order of their distances along
with their class labels. A majority voting is performed
among the class labels of the neighbors to classify the test
sample.
2.4 Performance Measures
Performance assessment is an equally important part of
classifer design even if sophisticated methods are deployed
in the design stage. Reliable statistical estimates of the
performance measures should be obtained by judiciously
dividing the data for training, testing and cross-validation.
A 𝑘-NN classifier is free of hyper parameters. Hence there
is no need of cross-validation.
As performance measures, overall classification accuracy
and accuracies for each class are obtained from the
confusion matrix. Each column of the confusion matrix
denotes the instances in a predicted class, while each row
represents the instances in an actual class. Class-wise
classification accuracy is computed as:
𝐴𝑐𝑐𝑖 (%) = (
𝑐 𝑖,𝑖
∑ 𝑐 𝑖,𝑗
𝐾
𝑗=1
)*100 (5)
where 𝐴𝑐𝑐𝑖 is the percentage accuracy for 𝑖th class, 𝑐𝑖,𝑗 are
the entries of the confusion matrix 𝐶, 𝐾 is the number of
classes.
3. IMPLEMENTATION
This section gives details of the database and specifics of
the classifier implementation. The implementation is done
in Matlab®
.
3.1. Database
A subset of the Environmental Noise Data Set (series 2),
collected by University of East Anglia, is used [5]. The data
consists of 8𝑘𝐻𝑧 recordings from 12 different audio
environments. There is one audio file for every
environment. To prepare training and test sets, every audio
file is divided into 1𝑠 non-overlapping segments. 80% of
the total number of segments is used for training and the rest
for testing. The segments are also given class label tags as
shown in the table 1.
TABLE 1.
Class Label Environment
1 Building site
2 Bus
3 Highway
4 Car
5 Launderette
6 Office
7 Presentation
8 Shopping Centre
9 Street / people
10 Street / Traffic
11 Supermarket
12 Train
Fig 2. Plot of overall percentage accuracy for various
values of 𝑘 as in 𝑘- Nearest Neighbors.
3.2. Specifics of pre-processing, feature extraction and
classifier implementation
The frame size is chosen to be 30𝑚𝑠 with a 50% overlap
between adjacent frames. The samples are Hanning
windowed. The number of DFT bins is 1024. Four
frequency bands are considered, viz., 0 - 0.5 kHz, 0.5 – 1
kHz, 1 – 2 kHz and 2 – 4 kHz. Hence, for each signal frame,
a 4-dimensional feature vector is extracted by computing
SBERs. For the k-NN classifier implementation, 𝑘 is varied
between 1 and 10. The inbuilt function in Matlab is used.
4. RESULTS AND DISCUSSION
The plot of overall percentage accuracy for various values
of 𝑘 is shown in figure 2. It can be observed that for values
of 𝑘 ≥ 3, the classifier yield a better performance. The bad
classification accuracy for smaller values of 𝑘 can be
attributed to its inherent problem of over-fitting. Larger
values of 𝑘 (for example, 𝑘 ≥ 10) are also not wise choices
as they under-fit the data and hence yield unsatisfactory
results.
The class-wise percentage accuracies for 𝑘 = 1 and 𝑘 =
5 are shown in figure 3 and figure 4 respectively. To
complement the figures, an example confusion matrix is
shown in table 1. It can be observed that building site (class
1) and office (class 6) environments are identified with a
near perfect accuracy. The launderette (class 5) and street /
traffic (class 10) environments can also be distinguished
with high degree of confidence. The classifier finds it
extremely difficult to distinguish bus (class 2) and shopping
center (class 8) from other environments. Bus environment
is often confused with highway, street/traffic, car and
presentation, while shopping center with super market.
It is also interesting to observe that there is a drastic
difference in classification accuracy for a change in 𝑘 in the
case of train environment (class 12).
Fig 3. Plot of class-wise percentage accuracies for 𝑘 = 1.
Fig 4. Plot of class-wise percentage accuracies for 𝑘 = 5.
TABLE 2.
CONFUSION MATRIX FOR 𝑘=5.
(The rows correspond to true class and the columns correspond to predicted class)
No. 1 2 3 4 5 6 7 8 9 10 11 12
1 56 1 2 1
2 18 10 9 1 9 0 12
3 13 44 3
4 9 30 4 16
5 3 54 1 1
6 61
7 1 2 6 5 1 40 3 1
8 2 24 4 1 19 9
9 1 7 3 1 33 1 13
10 5 52 1 2
11 1 3 1 4 11 7 30 2
12 3 1 55
1 2 3 4 5 6 7 8 9 10
67
67.5
68
68.5
69
69.5
70
70.5
71
k
PercentageAccuracy
5. CONCLUSION
In this paper, a method for recognizing the environment of
an audio signal was proposed. With sub-band energy ratios
as features and 𝑘 −Nearest Neighbor as the classifier, an
overall accuracy of 70% was achieved on a 12-class dataset.
In spite of the simple methods used in all design stages of
the classifier, the performance was observed to be far from
satisfactory. The simplicity of the algorithm comes at the
cost of slow performance as 𝑘 −NN requires exhaustive
memory search operations for real-time classification.
REFERENCES
[1] D. Wang, G.J.Brown, ‘‘Computational Auditory Scene
Analysis- Principles, Algorithms and Applications,’’ Wiley-
IEEE press, 2006.
[2] E.C.Cherry, “Some experiments on the recognition of
speech, with one and with two ears,” The Journal of the
Acoustical Society of America, vol. 25, pp. 975-979, 1953.
[3]A.S.Bregman, “Auditory Scene Analysis: The Perceptual
Organization of Sound,” MIT Press, 1990.
[4]V.Peltonen, “Computational auditory scene
recognition,” M.Sc. Thesis, Dept. of Signal Processing,
Tampere University of Technology, Tampere, Finland,
2001.
[5] School of Computing Sciences, University of East
Anglia (2004), Environmental noise data set: Series 2
[online],Available:
http://guides.lib.monash.edu/content.php?pid=346637&sid
=3402748.
APPENDIX
(Answers to the intermediate tasks)
Task 2.1
Sampling Frequency= 8 kHz.
Length of one frame in samples= 240
Fig 5. Signal in 50th
frame of speech.wav
Fig 6. Amplitude Spectrum of signal in 50th
frame of
speech.wav
Indices of the DFT corresponding to 1-2 kHz= 256-512.
Fig 7. Sub-band Energy Ratios for 50th
frame of
speech.wav
Comments about the lab work:
This is one of the best tasks in the entire lab course. It is well
developed, simple to understand and gave a hands-on
experience on a real pattern classification task. It is also a
good task to hone matlab skills, especially with array
indexing. It took 2 days to implement and 4 days to write
the report.
0 50 100 150 200 250
-0.5
0
0.5
1
Signal in 50th frame
Samples
Amplitude
0 500 1000 1500 2000 2500 3000 3500 4000
0
1
2
3
4
5
6
7
8
Amplitude Spectrum of the 50th frame
Frequency in Hz
Amplitude
0-0.5 0.5-1 1-2 2-4
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Sub-band Frequency Range in kHz
SBER
CASR-Report

Weitere ähnliche Inhalte

Was ist angesagt?

Spectral Density Oriented Feature Coding For Pattern Recognition Application
Spectral Density Oriented Feature Coding For Pattern Recognition ApplicationSpectral Density Oriented Feature Coding For Pattern Recognition Application
Spectral Density Oriented Feature Coding For Pattern Recognition ApplicationIJERDJOURNAL
 
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...Designing an Efficient Multimodal Biometric System using Palmprint and Speech...
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...IDES Editor
 
Paper id 252014135
Paper id 252014135Paper id 252014135
Paper id 252014135IJRAT
 
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITION
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITIONSYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITION
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITIONcsandit
 
A parallel rough set based smoothing filter
A parallel rough set based smoothing filterA parallel rough set based smoothing filter
A parallel rough set based smoothing filterprjpublications
 
Performance Evaluation of Quarter Shift Dual Tree Complex Wavelet Transform B...
Performance Evaluation of Quarter Shift Dual Tree Complex Wavelet Transform B...Performance Evaluation of Quarter Shift Dual Tree Complex Wavelet Transform B...
Performance Evaluation of Quarter Shift Dual Tree Complex Wavelet Transform B...IJECEIAES
 
IEEE Final Year Projects 2011-2012 :: Elysium Technologies Pvt Ltd::Imageproc...
IEEE Final Year Projects 2011-2012 :: Elysium Technologies Pvt Ltd::Imageproc...IEEE Final Year Projects 2011-2012 :: Elysium Technologies Pvt Ltd::Imageproc...
IEEE Final Year Projects 2011-2012 :: Elysium Technologies Pvt Ltd::Imageproc...sunda2011
 
Review of Diverse Techniques Used for Effective Fractal Image Compression
Review of Diverse Techniques Used for Effective Fractal Image CompressionReview of Diverse Techniques Used for Effective Fractal Image Compression
Review of Diverse Techniques Used for Effective Fractal Image CompressionIRJET Journal
 
Dynamic thresholding on speech segmentation
Dynamic thresholding on speech segmentationDynamic thresholding on speech segmentation
Dynamic thresholding on speech segmentationeSAT Publishing House
 
Performance analysis of image filtering algorithms for mri images
Performance analysis of image filtering algorithms for mri imagesPerformance analysis of image filtering algorithms for mri images
Performance analysis of image filtering algorithms for mri imageseSAT Publishing House
 
IRJET - Computer-Assisted ALL, AML, CLL, CML Detection and Counting for D...
IRJET -  	  Computer-Assisted ALL, AML, CLL, CML Detection and Counting for D...IRJET -  	  Computer-Assisted ALL, AML, CLL, CML Detection and Counting for D...
IRJET - Computer-Assisted ALL, AML, CLL, CML Detection and Counting for D...IRJET Journal
 
Flower Classification Using Neural Network Based Image Processing
Flower Classification Using Neural Network Based Image ProcessingFlower Classification Using Neural Network Based Image Processing
Flower Classification Using Neural Network Based Image ProcessingIOSR Journals
 
Dynamic thresholding on speech segmentation
Dynamic thresholding on speech segmentationDynamic thresholding on speech segmentation
Dynamic thresholding on speech segmentationeSAT Journals
 
A survey on clustering techniques for identification of
A survey on clustering techniques for identification ofA survey on clustering techniques for identification of
A survey on clustering techniques for identification ofeSAT Publishing House
 

Was ist angesagt? (18)

Spectral Density Oriented Feature Coding For Pattern Recognition Application
Spectral Density Oriented Feature Coding For Pattern Recognition ApplicationSpectral Density Oriented Feature Coding For Pattern Recognition Application
Spectral Density Oriented Feature Coding For Pattern Recognition Application
 
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...Designing an Efficient Multimodal Biometric System using Palmprint and Speech...
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...
 
Paper id 252014135
Paper id 252014135Paper id 252014135
Paper id 252014135
 
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITION
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITIONSYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITION
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITION
 
A parallel rough set based smoothing filter
A parallel rough set based smoothing filterA parallel rough set based smoothing filter
A parallel rough set based smoothing filter
 
50120140502010
5012014050201050120140502010
50120140502010
 
Cb34474478
Cb34474478Cb34474478
Cb34474478
 
Performance Evaluation of Quarter Shift Dual Tree Complex Wavelet Transform B...
Performance Evaluation of Quarter Shift Dual Tree Complex Wavelet Transform B...Performance Evaluation of Quarter Shift Dual Tree Complex Wavelet Transform B...
Performance Evaluation of Quarter Shift Dual Tree Complex Wavelet Transform B...
 
D04812125
D04812125D04812125
D04812125
 
IEEE Final Year Projects 2011-2012 :: Elysium Technologies Pvt Ltd::Imageproc...
IEEE Final Year Projects 2011-2012 :: Elysium Technologies Pvt Ltd::Imageproc...IEEE Final Year Projects 2011-2012 :: Elysium Technologies Pvt Ltd::Imageproc...
IEEE Final Year Projects 2011-2012 :: Elysium Technologies Pvt Ltd::Imageproc...
 
Review of Diverse Techniques Used for Effective Fractal Image Compression
Review of Diverse Techniques Used for Effective Fractal Image CompressionReview of Diverse Techniques Used for Effective Fractal Image Compression
Review of Diverse Techniques Used for Effective Fractal Image Compression
 
Fw3610731076
Fw3610731076Fw3610731076
Fw3610731076
 
Dynamic thresholding on speech segmentation
Dynamic thresholding on speech segmentationDynamic thresholding on speech segmentation
Dynamic thresholding on speech segmentation
 
Performance analysis of image filtering algorithms for mri images
Performance analysis of image filtering algorithms for mri imagesPerformance analysis of image filtering algorithms for mri images
Performance analysis of image filtering algorithms for mri images
 
IRJET - Computer-Assisted ALL, AML, CLL, CML Detection and Counting for D...
IRJET -  	  Computer-Assisted ALL, AML, CLL, CML Detection and Counting for D...IRJET -  	  Computer-Assisted ALL, AML, CLL, CML Detection and Counting for D...
IRJET - Computer-Assisted ALL, AML, CLL, CML Detection and Counting for D...
 
Flower Classification Using Neural Network Based Image Processing
Flower Classification Using Neural Network Based Image ProcessingFlower Classification Using Neural Network Based Image Processing
Flower Classification Using Neural Network Based Image Processing
 
Dynamic thresholding on speech segmentation
Dynamic thresholding on speech segmentationDynamic thresholding on speech segmentation
Dynamic thresholding on speech segmentation
 
A survey on clustering techniques for identification of
A survey on clustering techniques for identification ofA survey on clustering techniques for identification of
A survey on clustering techniques for identification of
 

Ähnlich wie CASR-Report

Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...
Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...
Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...ijtsrd
 
Eigen Subspace based Direction of Arrival Estimation for Coherent Sources
Eigen Subspace based Direction of Arrival Estimation for Coherent SourcesEigen Subspace based Direction of Arrival Estimation for Coherent Sources
Eigen Subspace based Direction of Arrival Estimation for Coherent SourcesINFOGAIN PUBLICATION
 
Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...TELKOMNIKA JOURNAL
 
A novel speech enhancement technique
A novel speech enhancement techniqueA novel speech enhancement technique
A novel speech enhancement techniqueeSAT Publishing House
 
A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...
A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...
A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...CSCJournals
 
Design and implementation of different audio restoration techniques for audio...
Design and implementation of different audio restoration techniques for audio...Design and implementation of different audio restoration techniques for audio...
Design and implementation of different audio restoration techniques for audio...eSAT Journals
 
IRJET- Machine Learning and Noise Reduction Techniques for Music Genre Classi...
IRJET- Machine Learning and Noise Reduction Techniques for Music Genre Classi...IRJET- Machine Learning and Noise Reduction Techniques for Music Genre Classi...
IRJET- Machine Learning and Noise Reduction Techniques for Music Genre Classi...IRJET Journal
 
Broad phoneme classification using signal based features
Broad phoneme classification using signal based featuresBroad phoneme classification using signal based features
Broad phoneme classification using signal based featuresijsc
 
IRJET - Essential Features Extraction from Aaroh and Avroh of Indian Clas...
IRJET -  	  Essential Features Extraction from Aaroh and Avroh of Indian Clas...IRJET -  	  Essential Features Extraction from Aaroh and Avroh of Indian Clas...
IRJET - Essential Features Extraction from Aaroh and Avroh of Indian Clas...IRJET Journal
 
Path Loss Prediction by Robust Regression Methods
Path Loss Prediction by Robust Regression MethodsPath Loss Prediction by Robust Regression Methods
Path Loss Prediction by Robust Regression Methodsijceronline
 
IRJET- Musical Instrument Recognition using CNN and SVM
IRJET-  	  Musical Instrument Recognition using CNN and SVMIRJET-  	  Musical Instrument Recognition using CNN and SVM
IRJET- Musical Instrument Recognition using CNN and SVMIRJET Journal
 
Sensitivity of Support Vector Machine Classification to Various Training Feat...
Sensitivity of Support Vector Machine Classification to Various Training Feat...Sensitivity of Support Vector Machine Classification to Various Training Feat...
Sensitivity of Support Vector Machine Classification to Various Training Feat...Nooria Sukmaningtyas
 
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...sipij
 
Broad Phoneme Classification Using Signal Based Features
Broad Phoneme Classification Using Signal Based Features  Broad Phoneme Classification Using Signal Based Features
Broad Phoneme Classification Using Signal Based Features ijsc
 
IMPROVEMENT OF BM3D ALGORITHM AND EMPLOYMENT TO SATELLITE AND CFA IMAGES DENO...
IMPROVEMENT OF BM3D ALGORITHM AND EMPLOYMENT TO SATELLITE AND CFA IMAGES DENO...IMPROVEMENT OF BM3D ALGORITHM AND EMPLOYMENT TO SATELLITE AND CFA IMAGES DENO...
IMPROVEMENT OF BM3D ALGORITHM AND EMPLOYMENT TO SATELLITE AND CFA IMAGES DENO...ijistjournal
 
IMPROVEMENT OF BM3D ALGORITHM AND EMPLOYMENT TO SATELLITE AND CFA IMAGES DENO...
IMPROVEMENT OF BM3D ALGORITHM AND EMPLOYMENT TO SATELLITE AND CFA IMAGES DENO...IMPROVEMENT OF BM3D ALGORITHM AND EMPLOYMENT TO SATELLITE AND CFA IMAGES DENO...
IMPROVEMENT OF BM3D ALGORITHM AND EMPLOYMENT TO SATELLITE AND CFA IMAGES DENO...ijistjournal
 
IRJET- Emotion recognition using Speech Signal: A Review
IRJET-  	  Emotion recognition using Speech Signal: A ReviewIRJET-  	  Emotion recognition using Speech Signal: A Review
IRJET- Emotion recognition using Speech Signal: A ReviewIRJET Journal
 
Classification of Vehicles Based on Audio Signals using Quadratic Discriminan...
Classification of Vehicles Based on Audio Signals using Quadratic Discriminan...Classification of Vehicles Based on Audio Signals using Quadratic Discriminan...
Classification of Vehicles Based on Audio Signals using Quadratic Discriminan...ijsc
 

Ähnlich wie CASR-Report (20)

Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...
Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...
Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...
 
Eigen Subspace based Direction of Arrival Estimation for Coherent Sources
Eigen Subspace based Direction of Arrival Estimation for Coherent SourcesEigen Subspace based Direction of Arrival Estimation for Coherent Sources
Eigen Subspace based Direction of Arrival Estimation for Coherent Sources
 
Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...
 
A novel speech enhancement technique
A novel speech enhancement techniqueA novel speech enhancement technique
A novel speech enhancement technique
 
A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...
A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...
A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...
 
T26123129
T26123129T26123129
T26123129
 
Design and implementation of different audio restoration techniques for audio...
Design and implementation of different audio restoration techniques for audio...Design and implementation of different audio restoration techniques for audio...
Design and implementation of different audio restoration techniques for audio...
 
IRJET- Machine Learning and Noise Reduction Techniques for Music Genre Classi...
IRJET- Machine Learning and Noise Reduction Techniques for Music Genre Classi...IRJET- Machine Learning and Noise Reduction Techniques for Music Genre Classi...
IRJET- Machine Learning and Noise Reduction Techniques for Music Genre Classi...
 
Ijcet 06 09_001
Ijcet 06 09_001Ijcet 06 09_001
Ijcet 06 09_001
 
Broad phoneme classification using signal based features
Broad phoneme classification using signal based featuresBroad phoneme classification using signal based features
Broad phoneme classification using signal based features
 
IRJET - Essential Features Extraction from Aaroh and Avroh of Indian Clas...
IRJET -  	  Essential Features Extraction from Aaroh and Avroh of Indian Clas...IRJET -  	  Essential Features Extraction from Aaroh and Avroh of Indian Clas...
IRJET - Essential Features Extraction from Aaroh and Avroh of Indian Clas...
 
Path Loss Prediction by Robust Regression Methods
Path Loss Prediction by Robust Regression MethodsPath Loss Prediction by Robust Regression Methods
Path Loss Prediction by Robust Regression Methods
 
IRJET- Musical Instrument Recognition using CNN and SVM
IRJET-  	  Musical Instrument Recognition using CNN and SVMIRJET-  	  Musical Instrument Recognition using CNN and SVM
IRJET- Musical Instrument Recognition using CNN and SVM
 
Sensitivity of Support Vector Machine Classification to Various Training Feat...
Sensitivity of Support Vector Machine Classification to Various Training Feat...Sensitivity of Support Vector Machine Classification to Various Training Feat...
Sensitivity of Support Vector Machine Classification to Various Training Feat...
 
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
 
Broad Phoneme Classification Using Signal Based Features
Broad Phoneme Classification Using Signal Based Features  Broad Phoneme Classification Using Signal Based Features
Broad Phoneme Classification Using Signal Based Features
 
IMPROVEMENT OF BM3D ALGORITHM AND EMPLOYMENT TO SATELLITE AND CFA IMAGES DENO...
IMPROVEMENT OF BM3D ALGORITHM AND EMPLOYMENT TO SATELLITE AND CFA IMAGES DENO...IMPROVEMENT OF BM3D ALGORITHM AND EMPLOYMENT TO SATELLITE AND CFA IMAGES DENO...
IMPROVEMENT OF BM3D ALGORITHM AND EMPLOYMENT TO SATELLITE AND CFA IMAGES DENO...
 
IMPROVEMENT OF BM3D ALGORITHM AND EMPLOYMENT TO SATELLITE AND CFA IMAGES DENO...
IMPROVEMENT OF BM3D ALGORITHM AND EMPLOYMENT TO SATELLITE AND CFA IMAGES DENO...IMPROVEMENT OF BM3D ALGORITHM AND EMPLOYMENT TO SATELLITE AND CFA IMAGES DENO...
IMPROVEMENT OF BM3D ALGORITHM AND EMPLOYMENT TO SATELLITE AND CFA IMAGES DENO...
 
IRJET- Emotion recognition using Speech Signal: A Review
IRJET-  	  Emotion recognition using Speech Signal: A ReviewIRJET-  	  Emotion recognition using Speech Signal: A Review
IRJET- Emotion recognition using Speech Signal: A Review
 
Classification of Vehicles Based on Audio Signals using Quadratic Discriminan...
Classification of Vehicles Based on Audio Signals using Quadratic Discriminan...Classification of Vehicles Based on Audio Signals using Quadratic Discriminan...
Classification of Vehicles Based on Audio Signals using Quadratic Discriminan...
 

CASR-Report

  • 1. Computational Auditory Scene Recognition Shriram Nandakumar, Deepa Naik Student Numbers: 244935, 232887 nandakum@student.tut.fi, deepa.naik@student.tut.fi ABSTRACT Computational Auditory Scene Recognition (CASR) refers to the study of processing and understanding the audio signals of a scene to understand its context. In this paper, a multi-class (12 class) audio environment classification is attempted. Frame-level sub-band energy ratios are computed as features for the input audio data. 𝑘-Nearest Neighbor classifier is used for classification. Performance is analysed based on overall classification accuracy and class- wise accuracies. In spite of the simplicity of the methods used, the performance of the classifier is superior. 1. INTRODUCTION / BACKGROUND Auditory Scene Analysis (ASA) refers to the physiological process that a human ear performs on the sound reaching the ear. This process is responsible for the ability of human hearing to distinguish individual sound sources from a complex mixture of sounds [1]. Computational Auditory Scene Analysis (CASA) is the challenge of constructing a machine system that is capable of achieving the human performance in ASA and hence biologically motivated [2]. It is usually referred to as cock- tail party problem. It typically uses one (mono-aural) or two (bin-aural) microphone recordings of the acoustic scene. As a result, CASA is fundamentally different from other source separation techniques like beamforming and independent component analysis (ICA). The goal of CASA is to computationally extract individual “streams” from one or two recordings of an acoustic scene. In ASA, stream is a perceptual structure in which local time- frequency segments that are likely to have arisen from the same environmental source are grouped. In CASA, the term "stream" can refer both to the perceptual representation of a sound source, and to the representation of a sound source in computer memory. CASA finds applications in noise-robust automatic speech recognition, hearing prostheses, automatic music transcription, audio information retrieval and developments in hearing science [1]. Figure 1 shows a typical CASA architecture. A digitally recorded acoustic mixture is subjected to peripheral analysis by obtaining the time-frequency representation or cochleagram. Acoustic features are then extracted from the time-frequency representation. Examples of conventional acoustic features are periodicity, onsets, offsets, amplitude and frequency modulation. From the extracted features, mid-level representations such as segments are obtained. Scene Organization involves grouping of cues and training models of individual sound sources. The final step includes re-synthesis of audio waveform from a separated stream. Fig 1. CASA Architecture [1] Computational Auditory Scene Recognition (CASR), an offspring of CASA, is a term coined by Peltonen [4]. CASR involves the study of processing and understanding the audio signals of a scene to understand its context. For example, it refers to the process of identifying the environment of a mobile device based on the characteristics of the audio signal recorded by the device [5]. Unlike CASA, CASR classifies the mixtures of audio themselves into predefined classes, without trying to recognize the sources of audio. A typical example is identifying the location as street from sound sources such as car passing by, people walking and other environmental sounds. Such context-aware mobile devices can react to changes in usage environment and thus provide better service to the needs of users by adjusting the mode of operation. Areas that are closely related to CASR include speech/music discrimination, noise classification and content-based audio retrieval [4]. 2. THEORY & METHODS This section covers the theory and the concepts of pre- processing, feature extraction and classification methods used.
  • 2. 2.1 Pre-processing: Any audio signal analysis and classification task requires frame-wise features to be extracted due to the non-stationary nature of real-world audio signals. Hence the pre-processing step involves dividing the signal into short segments called frames. Typically, a smooth window such as Hanning window is used to multiply the signal values in each frame. 2.2 Feature Extraction: The acoustic attributes of an audio signal can be divided into two groups- perceptual & physical features [4]. The perceptual features describe the sensation of a sound in subjective terms such as loudness, pitch, brightness and timbre. Physical features are calculated mathematically from the sound wave. Examples are intensity, fundamental frequency, spectrum, spectral centroid and others. This paper uses a simple physical feature called Sub-Band Energy Ratio (SBER) to accomplish the task. SBER describes the energy distribution of the audio signal among different frequency bands. SBER is computed as: 𝑥(𝑖) = ∑ |𝑆(𝑙)|2𝑒 𝑖 𝑙=𝑏 𝑖 ∑ |𝑆(𝑙)|2𝐿/2 𝑙=0 , (1) where S is the Discrete Fourier Transform of a signal frame, 𝑙 is the bin index, L is the total number of frequency bins and 𝑏𝑖and 𝑒𝑖 are the first and last bins of the 𝑖th frequency band. 2.3 Classification 𝑘-Nearest Neighbor (𝑘-NN) method is used for classification. In the simplest 1-NN, an M-dimensional test sample is classified based on its neighbor, in an Euclidean distance sense, among correctly classified training samples 𝐷: 𝐷 = {𝒙1, 𝜃(𝒙1), 𝒙2, 𝜃(𝒙2), … , 𝒙 𝑁, 𝜃(𝒙 𝑁)}, (2) where N is the size of the training set, 𝜃(𝒙 𝒏) is the index of the class that the 𝑛th training sample belongs to and it takes values in {1, 2, … . , 𝐶}, 𝐶 being the number of classes. The nearest neighbor to the test vector 𝒙 is computed as 𝒙′ = ∀𝒙 𝑛∈ 𝐷 𝑎𝑟𝑔𝑚𝑖𝑛 𝑑(𝒙, 𝒙 𝑛), (3) Test vector is assigned the same class as that of 𝒙′. 𝑑(𝒂, 𝒃) is the distance function, typically the 𝑙2 distance defined as: 𝑑 (𝒂, 𝒃) = ‖𝒂 − 𝒃‖2 = √∑ (𝑎𝑖 − 𝑏𝑖)2𝑀 𝑖=1 (4) In the case of 𝑘-NN, instead of a single neighbor, 𝑘 neighbors are computed in the order of their distances along with their class labels. A majority voting is performed among the class labels of the neighbors to classify the test sample. 2.4 Performance Measures Performance assessment is an equally important part of classifer design even if sophisticated methods are deployed in the design stage. Reliable statistical estimates of the performance measures should be obtained by judiciously dividing the data for training, testing and cross-validation. A 𝑘-NN classifier is free of hyper parameters. Hence there is no need of cross-validation. As performance measures, overall classification accuracy and accuracies for each class are obtained from the confusion matrix. Each column of the confusion matrix denotes the instances in a predicted class, while each row represents the instances in an actual class. Class-wise classification accuracy is computed as: 𝐴𝑐𝑐𝑖 (%) = ( 𝑐 𝑖,𝑖 ∑ 𝑐 𝑖,𝑗 𝐾 𝑗=1 )*100 (5) where 𝐴𝑐𝑐𝑖 is the percentage accuracy for 𝑖th class, 𝑐𝑖,𝑗 are the entries of the confusion matrix 𝐶, 𝐾 is the number of classes. 3. IMPLEMENTATION This section gives details of the database and specifics of the classifier implementation. The implementation is done in Matlab® . 3.1. Database A subset of the Environmental Noise Data Set (series 2), collected by University of East Anglia, is used [5]. The data consists of 8𝑘𝐻𝑧 recordings from 12 different audio environments. There is one audio file for every environment. To prepare training and test sets, every audio file is divided into 1𝑠 non-overlapping segments. 80% of the total number of segments is used for training and the rest for testing. The segments are also given class label tags as shown in the table 1. TABLE 1. Class Label Environment 1 Building site 2 Bus 3 Highway 4 Car 5 Launderette 6 Office 7 Presentation 8 Shopping Centre 9 Street / people 10 Street / Traffic 11 Supermarket 12 Train
  • 3. Fig 2. Plot of overall percentage accuracy for various values of 𝑘 as in 𝑘- Nearest Neighbors. 3.2. Specifics of pre-processing, feature extraction and classifier implementation The frame size is chosen to be 30𝑚𝑠 with a 50% overlap between adjacent frames. The samples are Hanning windowed. The number of DFT bins is 1024. Four frequency bands are considered, viz., 0 - 0.5 kHz, 0.5 – 1 kHz, 1 – 2 kHz and 2 – 4 kHz. Hence, for each signal frame, a 4-dimensional feature vector is extracted by computing SBERs. For the k-NN classifier implementation, 𝑘 is varied between 1 and 10. The inbuilt function in Matlab is used. 4. RESULTS AND DISCUSSION The plot of overall percentage accuracy for various values of 𝑘 is shown in figure 2. It can be observed that for values of 𝑘 ≥ 3, the classifier yield a better performance. The bad classification accuracy for smaller values of 𝑘 can be attributed to its inherent problem of over-fitting. Larger values of 𝑘 (for example, 𝑘 ≥ 10) are also not wise choices as they under-fit the data and hence yield unsatisfactory results. The class-wise percentage accuracies for 𝑘 = 1 and 𝑘 = 5 are shown in figure 3 and figure 4 respectively. To complement the figures, an example confusion matrix is shown in table 1. It can be observed that building site (class 1) and office (class 6) environments are identified with a near perfect accuracy. The launderette (class 5) and street / traffic (class 10) environments can also be distinguished with high degree of confidence. The classifier finds it extremely difficult to distinguish bus (class 2) and shopping center (class 8) from other environments. Bus environment is often confused with highway, street/traffic, car and presentation, while shopping center with super market. It is also interesting to observe that there is a drastic difference in classification accuracy for a change in 𝑘 in the case of train environment (class 12). Fig 3. Plot of class-wise percentage accuracies for 𝑘 = 1. Fig 4. Plot of class-wise percentage accuracies for 𝑘 = 5. TABLE 2. CONFUSION MATRIX FOR 𝑘=5. (The rows correspond to true class and the columns correspond to predicted class) No. 1 2 3 4 5 6 7 8 9 10 11 12 1 56 1 2 1 2 18 10 9 1 9 0 12 3 13 44 3 4 9 30 4 16 5 3 54 1 1 6 61 7 1 2 6 5 1 40 3 1 8 2 24 4 1 19 9 9 1 7 3 1 33 1 13 10 5 52 1 2 11 1 3 1 4 11 7 30 2 12 3 1 55 1 2 3 4 5 6 7 8 9 10 67 67.5 68 68.5 69 69.5 70 70.5 71 k PercentageAccuracy
  • 4. 5. CONCLUSION In this paper, a method for recognizing the environment of an audio signal was proposed. With sub-band energy ratios as features and 𝑘 −Nearest Neighbor as the classifier, an overall accuracy of 70% was achieved on a 12-class dataset. In spite of the simple methods used in all design stages of the classifier, the performance was observed to be far from satisfactory. The simplicity of the algorithm comes at the cost of slow performance as 𝑘 −NN requires exhaustive memory search operations for real-time classification. REFERENCES [1] D. Wang, G.J.Brown, ‘‘Computational Auditory Scene Analysis- Principles, Algorithms and Applications,’’ Wiley- IEEE press, 2006. [2] E.C.Cherry, “Some experiments on the recognition of speech, with one and with two ears,” The Journal of the Acoustical Society of America, vol. 25, pp. 975-979, 1953. [3]A.S.Bregman, “Auditory Scene Analysis: The Perceptual Organization of Sound,” MIT Press, 1990. [4]V.Peltonen, “Computational auditory scene recognition,” M.Sc. Thesis, Dept. of Signal Processing, Tampere University of Technology, Tampere, Finland, 2001. [5] School of Computing Sciences, University of East Anglia (2004), Environmental noise data set: Series 2 [online],Available: http://guides.lib.monash.edu/content.php?pid=346637&sid =3402748. APPENDIX (Answers to the intermediate tasks) Task 2.1 Sampling Frequency= 8 kHz. Length of one frame in samples= 240 Fig 5. Signal in 50th frame of speech.wav Fig 6. Amplitude Spectrum of signal in 50th frame of speech.wav Indices of the DFT corresponding to 1-2 kHz= 256-512. Fig 7. Sub-band Energy Ratios for 50th frame of speech.wav Comments about the lab work: This is one of the best tasks in the entire lab course. It is well developed, simple to understand and gave a hands-on experience on a real pattern classification task. It is also a good task to hone matlab skills, especially with array indexing. It took 2 days to implement and 4 days to write the report. 0 50 100 150 200 250 -0.5 0 0.5 1 Signal in 50th frame Samples Amplitude 0 500 1000 1500 2000 2500 3000 3500 4000 0 1 2 3 4 5 6 7 8 Amplitude Spectrum of the 50th frame Frequency in Hz Amplitude 0-0.5 0.5-1 1-2 2-4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Sub-band Frequency Range in kHz SBER