CASR-Report

Computational Auditory Scene Recognition
Shriram Nandakumar, Deepa Naik
Student Numbers: 244935, 232887
nandakum@student.tut.fi, deepa.naik@student.tut.fi
ABSTRACT
Computational Auditory Scene Recognition (CASR) refers
to the study of processing and understanding the audio
signals of a scene to understand its context. In this paper, a
multi-class (12 class) audio environment classification is
attempted. Frame-level sub-band energy ratios are
computed as features for the input audio data. 𝑘-Nearest
Neighbor classifier is used for classification. Performance is
analysed based on overall classification accuracy and class-
wise accuracies. In spite of the simplicity of the methods
used, the performance of the classifier is superior.
1. INTRODUCTION / BACKGROUND
Auditory Scene Analysis (ASA) refers to the
physiological process that a human ear performs on the
sound reaching the ear. This process is responsible for the
ability of human hearing to distinguish individual sound
sources from a complex mixture of sounds [1].
Computational Auditory Scene Analysis (CASA) is the
challenge of constructing a machine system that is capable
of achieving the human performance in ASA and hence
biologically motivated [2]. It is usually referred to as cock-
tail party problem. It typically uses one (mono-aural) or two
(bin-aural) microphone recordings of the acoustic scene. As
a result, CASA is fundamentally different from other source
separation techniques like beamforming and independent
component analysis (ICA).
The goal of CASA is to computationally extract individual
“streams” from one or two recordings of an acoustic scene.
In ASA, stream is a perceptual structure in which local time-
frequency segments that are likely to have arisen from the
same environmental source are grouped. In CASA, the term
"stream" can refer both to the perceptual representation of a
sound source, and to the representation of a sound source in
computer memory. CASA finds applications in noise-robust
automatic speech recognition, hearing prostheses, automatic
music transcription, audio information retrieval and
developments in hearing science [1].
Figure 1 shows a typical CASA architecture. A digitally
recorded acoustic mixture is subjected to peripheral analysis
by obtaining the time-frequency representation or
cochleagram. Acoustic features are then extracted from the
time-frequency representation. Examples of conventional
acoustic features are periodicity, onsets, offsets, amplitude
and frequency modulation. From the extracted features,
mid-level representations such as segments are obtained.
Scene Organization involves grouping of cues and training
models of individual sound sources. The final step includes
re-synthesis of audio waveform from a separated stream.
Fig 1. CASA Architecture [1]
Computational Auditory Scene Recognition (CASR), an
offspring of CASA, is a term coined by Peltonen [4]. CASR
involves the study of processing and understanding the
audio signals of a scene to understand its context. For
example, it refers to the process of identifying the
environment of a mobile device based on the characteristics
of the audio signal recorded by the device [5].
Unlike CASA, CASR classifies the mixtures of audio
themselves into predefined classes, without trying to
recognize the sources of audio. A typical example is
identifying the location as street from sound sources such as
car passing by, people walking and other environmental
sounds. Such context-aware mobile devices can react to
changes in usage environment and thus provide better
service to the needs of users by adjusting the mode of
operation. Areas that are closely related to CASR include
speech/music discrimination, noise classification and
content-based audio retrieval [4].
2. THEORY & METHODS
This section covers the theory and the concepts of pre-
processing, feature extraction and classification methods
used.

2.1 Pre-processing:
Any audio signal analysis and classification task requires
frame-wise features to be extracted due to the non-stationary
nature of real-world audio signals. Hence the pre-processing
step involves dividing the signal into short segments called
frames. Typically, a smooth window such as Hanning
window is used to multiply the signal values in each frame.
2.2 Feature Extraction:
The acoustic attributes of an audio signal can be divided
into two groups- perceptual & physical features [4]. The
perceptual features describe the sensation of a sound in
subjective terms such as loudness, pitch, brightness and
timbre. Physical features are calculated mathematically
from the sound wave. Examples are intensity, fundamental
frequency, spectrum, spectral centroid and others. This
paper uses a simple physical feature called Sub-Band
Energy Ratio (SBER) to accomplish the task.
SBER describes the energy distribution of the audio signal
among different frequency bands. SBER is computed as:
𝑥(𝑖) =
∑ |𝑆(𝑙)|2𝑒 𝑖
𝑙=𝑏 𝑖
∑ |𝑆(𝑙)|2𝐿/2
𝑙=0
,
(1)
where S is the Discrete Fourier Transform of a signal frame,
𝑙 is the bin index, L is the total number of frequency bins
and 𝑏𝑖and 𝑒𝑖 are the first and last bins of the 𝑖th frequency
band.
2.3 Classification
𝑘-Nearest Neighbor (𝑘-NN) method is used for
classification. In the simplest 1-NN, an M-dimensional test
sample is classified based on its neighbor, in an Euclidean
distance sense, among correctly classified training samples
𝐷:
𝐷 = {𝒙1, 𝜃(𝒙1), 𝒙2, 𝜃(𝒙2), … , 𝒙 𝑁, 𝜃(𝒙 𝑁)}, (2)
where N is the size of the training set, 𝜃(𝒙 𝒏) is the index of
the class that the 𝑛th training sample belongs to and it takes
values in {1, 2, … . , 𝐶}, 𝐶 being the number of classes. The
nearest neighbor to the test vector 𝒙 is computed as
𝒙′
= ∀𝒙 𝑛∈ 𝐷
𝑎𝑟𝑔𝑚𝑖𝑛
𝑑(𝒙, 𝒙 𝑛), (3)
Test vector is assigned the same class as that of 𝒙′. 𝑑(𝒂, 𝒃)
is the distance function, typically the 𝑙2 distance defined as:
𝑑 (𝒂, 𝒃) = ‖𝒂 − 𝒃‖2 = √∑ (𝑎𝑖 − 𝑏𝑖)2𝑀
𝑖=1 (4)
In the case of 𝑘-NN, instead of a single neighbor, 𝑘
neighbors are computed in the order of their distances along
with their class labels. A majority voting is performed
among the class labels of the neighbors to classify the test
sample.
2.4 Performance Measures
Performance assessment is an equally important part of
classifer design even if sophisticated methods are deployed
in the design stage. Reliable statistical estimates of the
performance measures should be obtained by judiciously
dividing the data for training, testing and cross-validation.
A 𝑘-NN classifier is free of hyper parameters. Hence there
is no need of cross-validation.
As performance measures, overall classification accuracy
and accuracies for each class are obtained from the
confusion matrix. Each column of the confusion matrix
denotes the instances in a predicted class, while each row
represents the instances in an actual class. Class-wise
classification accuracy is computed as:
𝐴𝑐𝑐𝑖 (%) = (
𝑐 𝑖,𝑖
∑ 𝑐 𝑖,𝑗
𝐾
𝑗=1
)*100 (5)
where 𝐴𝑐𝑐𝑖 is the percentage accuracy for 𝑖th class, 𝑐𝑖,𝑗 are
the entries of the confusion matrix 𝐶, 𝐾 is the number of
classes.
3. IMPLEMENTATION
This section gives details of the database and specifics of
the classifier implementation. The implementation is done
in Matlab®
.
3.1. Database
A subset of the Environmental Noise Data Set (series 2),
collected by University of East Anglia, is used [5]. The data
consists of 8𝑘𝐻𝑧 recordings from 12 different audio
environments. There is one audio file for every
environment. To prepare training and test sets, every audio
file is divided into 1𝑠 non-overlapping segments. 80% of
the total number of segments is used for training and the rest
for testing. The segments are also given class label tags as
shown in the table 1.
TABLE 1.
Class Label Environment
1 Building site
2 Bus
3 Highway
4 Car
5 Launderette
6 Office
7 Presentation
8 Shopping Centre
9 Street / people
10 Street / Traffic
11 Supermarket
12 Train

Fig 2. Plot of overall percentage accuracy for various
values of 𝑘 as in 𝑘- Nearest Neighbors.
3.2. Specifics of pre-processing, feature extraction and
classifier implementation
The frame size is chosen to be 30𝑚𝑠 with a 50% overlap
between adjacent frames. The samples are Hanning
windowed. The number of DFT bins is 1024. Four
frequency bands are considered, viz., 0 - 0.5 kHz, 0.5 – 1
kHz, 1 – 2 kHz and 2 – 4 kHz. Hence, for each signal frame,
a 4-dimensional feature vector is extracted by computing
SBERs. For the k-NN classifier implementation, 𝑘 is varied
between 1 and 10. The inbuilt function in Matlab is used.
4. RESULTS AND DISCUSSION
The plot of overall percentage accuracy for various values
of 𝑘 is shown in figure 2. It can be observed that for values
of 𝑘 ≥ 3, the classifier yield a better performance. The bad
classification accuracy for smaller values of 𝑘 can be
attributed to its inherent problem of over-fitting. Larger
values of 𝑘 (for example, 𝑘 ≥ 10) are also not wise choices
as they under-fit the data and hence yield unsatisfactory
results.
The class-wise percentage accuracies for 𝑘 = 1 and 𝑘 =
5 are shown in figure 3 and figure 4 respectively. To
complement the figures, an example confusion matrix is
shown in table 1. It can be observed that building site (class
1) and office (class 6) environments are identified with a
near perfect accuracy. The launderette (class 5) and street /
traffic (class 10) environments can also be distinguished
with high degree of confidence. The classifier finds it
extremely difficult to distinguish bus (class 2) and shopping
center (class 8) from other environments. Bus environment
is often confused with highway, street/traffic, car and
presentation, while shopping center with super market.
It is also interesting to observe that there is a drastic
difference in classification accuracy for a change in 𝑘 in the
case of train environment (class 12).
Fig 3. Plot of class-wise percentage accuracies for 𝑘 = 1.
Fig 4. Plot of class-wise percentage accuracies for 𝑘 = 5.
TABLE 2.
CONFUSION MATRIX FOR 𝑘=5.
(The rows correspond to true class and the columns correspond to predicted class)
No. 1 2 3 4 5 6 7 8 9 10 11 12
1 56 1 2 1
2 18 10 9 1 9 0 12
3 13 44 3
4 9 30 4 16
5 3 54 1 1
6 61
7 1 2 6 5 1 40 3 1
8 2 24 4 1 19 9
9 1 7 3 1 33 1 13
10 5 52 1 2
11 1 3 1 4 11 7 30 2
12 3 1 55
1 2 3 4 5 6 7 8 9 10
67
67.5
68
68.5
69
69.5
70
70.5
71
k
PercentageAccuracy

5. CONCLUSION
In this paper, a method for recognizing the environment of
an audio signal was proposed. With sub-band energy ratios
as features and 𝑘 −Nearest Neighbor as the classifier, an
overall accuracy of 70% was achieved on a 12-class dataset.
In spite of the simple methods used in all design stages of
the classifier, the performance was observed to be far from
satisfactory. The simplicity of the algorithm comes at the
cost of slow performance as 𝑘 −NN requires exhaustive
memory search operations for real-time classification.
REFERENCES
[1] D. Wang, G.J.Brown, ‘‘Computational Auditory Scene
Analysis- Principles, Algorithms and Applications,’’ Wiley-
IEEE press, 2006.
[2] E.C.Cherry, “Some experiments on the recognition of
speech, with one and with two ears,” The Journal of the
Acoustical Society of America, vol. 25, pp. 975-979, 1953.
[3]A.S.Bregman, “Auditory Scene Analysis: The Perceptual
Organization of Sound,” MIT Press, 1990.
[4]V.Peltonen, “Computational auditory scene
recognition,” M.Sc. Thesis, Dept. of Signal Processing,
Tampere University of Technology, Tampere, Finland,
2001.
[5] School of Computing Sciences, University of East
Anglia (2004), Environmental noise data set: Series 2
[online],Available:
http://guides.lib.monash.edu/content.php?pid=346637&sid
=3402748.
APPENDIX
(Answers to the intermediate tasks)
Task 2.1
Sampling Frequency= 8 kHz.
Length of one frame in samples= 240
Fig 5. Signal in 50th
frame of speech.wav
Fig 6. Amplitude Spectrum of signal in 50th
frame of
speech.wav
Indices of the DFT corresponding to 1-2 kHz= 256-512.
Fig 7. Sub-band Energy Ratios for 50th
frame of
speech.wav
Comments about the lab work:
This is one of the best tasks in the entire lab course. It is well
developed, simple to understand and gave a hands-on
experience on a real pattern classification task. It is also a
good task to hone matlab skills, especially with array
indexing. It took 2 days to implement and 4 days to write
the report.
0 50 100 150 200 250
-0.5
0
0.5
1
Signal in 50th frame
Samples
Amplitude
0 500 1000 1500 2000 2500 3000 3500 4000
0
1
2
3
4
5
6
7
8
Amplitude Spectrum of the 50th frame
Frequency in Hz
Amplitude
0-0.5 0.5-1 1-2 2-4
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Sub-band Frequency Range in kHz
SBER

CASR-Report

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (18)

Ähnlich wie CASR-Report

Ähnlich wie CASR-Report (20)

CASR-Report