2. The Web, databases, and other digitized information storehouses contain a
growing volume of audio content.
For example newscasts, sporting events, telephone conversations,
recordings of meetings, Webcasts, documentary archives etc.
Users want to make the most of this material by searching and indexing
the digitized audio content.
In the past, companies had to create and manually analyze written
transcripts of audio content because using computers to recognize, interpret,
and analyze digitized speech was difficult.
However, the development of faster microprocessors, larger storage
capacities, and better speech-recognition algorithms has made audio mining
easier.
3. Audio mining, also called audio searching, takes a text-based query
and locates the search term or phrase in an audio file.
This helps users by, for example, letting them quickly get to specific
places in a recorded conversation or determine when a company is
mentioned in a newscast.
Audio indexing uses speech recognition to analyze an entire file and
produce a searchable index of content bearing words and their
locations.
This is critical because audio content is in a binary format that is
otherwise not readily searchable.
Indexing audio content thus enables searching.
4. There are two main approaches to audio mining :
1.Text-based indexing :
oIt converts speech to text and then identifies words in a dictionary
that can contain several hundred thousand entries. If a word or name is
not in the dictionary, the system will choose the most similar word it
can find.
oThe system uses language understanding to create a confidence level
for its findings. For findings with less than a 100 percent confidence
level, the system offers other possible word matches.
5. 2. Phoneme-based indexing:
o It doesn’t convert speech to text but instead works only with sounds.
o The system first analyzes and identifies sounds in a piece of audio
content to create a phonetic-based index. It then uses a dictionary of
several dozen phonemes to convert a user’s search term to the correct
phoneme string.
o Phonemes are the smallest units of speech in a language, all words
are set of phonemes.
oFinally, the system looks for the search terms in the index.
6. A phonetic system requires a more efficient search tool because it
must phoneticize the query term, then try to match it with the existing
phonetic string output .This is considerably more complex than using
one of the many existing text-based search tools.
Phoneme-based searches can result in more false matches than the
text-based approach, particularly for short search terms, because many
words sound alike or sound like parts of other words.
However, phonetic indexing can still be useful if the analyzed
material contains important words that are likely to be missing from a
text system’s dictionary, such as foreign terms and names of people and
places
7. Text- and phoneme-based systems operate in much the same way, except
that the former uses a text-based dictionary and the latter uses a phonetic
dictionary.
A speech recognizer converts the observed acoustic signal into the
corresponding written representation of the spoken words.
Speech recognition software contains acoustic models of the way in which
all phonemes are represented.
Also, there is a statistical language model that indicates how likely words are
to follow each other in a specific Language.
By using these capabilities, as well as complex probability analysis, the
technology can take a speech signal of unknown content and convert it to a
series of words .
10. Since music is often
described by
genre, we would
like to annotate
our music data with genre.
Classification by genre
is useful for
music search and
retrieval and also for
playlist generation
11. Linear and non-linear neural networks
Gaussian Classification
Gaussian mixture models
Hidden Markov model
12. Let us consider a simple weather forecast problem and try to
emulate a model that can predict tomorrow's weather based on
today’s condition. In this example we have three stationary all day
weather, which could be sunny (S), cloudy (C), or Rainy (R). From
the
history of the weather of the town under investigation we have the
following table
13. •We refer to the weather conditions by state q that are
sampled at instant t and the problem is to find the
probability of weather condition of tomorrow given today's
condition
P(qt+1 /qt).
•An acceptable approximation for n instants history is :
•P(qt+1/qt , qt-1 , qt-2 , ….. , qt-n ) » P(qt+1 /qt).
•This is the first order Markov chain as the history is
considered to be one instant only.
14. Let us now ask this question: Given today as sunny (S) what is
the probability that the next following five days are S , C , C , R
and S, having the above model?
The answer resides in the following formula using first order
Markov chain:
P(q1 = S, q2=S, q3=C, q4=C, q5=R, q6=S) =
P(S).P(q2=S/q1=S). P(q3=C/q2=S). P(q4=C/q3=C).
P(q5=R/q4=C). P(q6=S/q5=R)
= 1 x 0.7 x 0.2 x 0.8 x 0.15 x 0.15
= 0.00252
The initial probability P(S) = 1, as it is assumed that today is
sunny.
15.
16. The model is completely defined by these three sets of parameters a,
b, and p and the model of N states and M observations can be
referred to by :
λ = (A , B , p )
where A = {aij}, B = {bj(wk)} 1 <=i , j <=N and 1< =k <= M.
aij represents the probability of state transition (probability of being
in state Sj given state Si )
aij = P(qt+1=Sj / qt=Si) .
bj(wk) is the probability distribution in a state Sj
w is the alphabet and k is the number of symbols in this alphabet.
π = {1 0 0 } is the initial state probability distribution.
17. Each phoneme(unit of sound) could be represented by a
state of different and varying duration.
Accordingly, the transition between different phonemes to
form a word can be represented by A = {aij}.
The observations in this case are the sounds produced in
each position and due to the variations in the evolution of
each sound
this can be also represented by a probabilistic function
B = {bj(wk)}.
18. A major challenge for speech recognition tools has been recognizing the
speech of different users in different environments.
With this in mind, BBN, IBM, Fast-Talk, and ScanSoft have designed their
audio mining technology to be largely speaker independent.
For example, Fast-Talk’s acoustic models are trained to recognize numerous
speakers via exposure to audio data from speakers representing various ages,
dialects, and speaking styles.
Some audio mining technology uses acoustic models tuned to understand
speech from different environments— such as telephony, TV, or radio.
19. Designing filters to reduce the background noise that can interfere
with accurate speech recognition.
Creating efficient data structures for representing content.
Developing algorithms that quickly work through the data structures
during indexing and searching.
20. Precision is improving but it is still a key issue impeding the
technology’s widespread adoption, particularly in such accuracy-
critical applications as court reporting and medical dictation.
Processing conversational speech can be particularly difficult
because of such factors as overlapping words and background noise.
Breakthroughs in natural language understanding will eventually
lead to big improvements, but until then audio mining will get
better only incrementally.
It’s currently seen as a ‘nice to have,’ not quite yet a ‘need to have’
technology.
21. Companies could use audio mining to analyze customer-service and
helpdesk conversations or even voice mail.
Law enforcement and intelligence organizations could use the
technology to analyze intercepted phone conversations.
Broadcast companies like CNN and Radio Free Asia are already
using
audio mining to quickly retrieve important background information
from previous broadcasts when new stories break.
A US prison is using ScanSoft’s audio mining product to analyze
recordings of prisoners’ phone calls to identify illegal activity.
22. Musical audio mining relates to the identification of perceptually
important characteristics of a piece of music such as melodic,
harmonic or rhythmic structure.
Searches can then be carried out to find pieces of music that are
similar in terms of their melodic, harmonic and/or rhythmic
characteristics.
This type of analysis can also be used in music to determine
characteristics like beats per minute (BPM), musical key, and musical
structure, information that is employed to classify music.
Music downloading sites that categorize music by genre uses audio
mining to organize the music.
23. Research paper: “Lets Hear It For Audio Mining” by Neal Leavitt.
Research paper: “Tendencies, Perspectives, and Opportunities of
Musical Audio-Mining” by Ghent University, Belgium.
wikipedia.org/wiki/Audio_mining