1. SHF Public Report Version 1.0 WP2
SemanticHIFI
IST-507913
Public Report of WP2 “Indexing”
Covering period: December 2003 – October 2005
Report version: 1.0
Report preparation date:
Writers: Geoffroy Peeters, Jean-Julien Aucouturier, Florian Plenge, Matthias Gruhne,
Christian Sailer, Etan Fisher
Control: Hugues Vinet, Francis Rousseaux, IRCAM
Classification: Public
Contract start date: December, 1st 2003
Duration: 36 months
Project co-ordinator: Hugues Vinet, IRCAM
Involved Partners: Ircam-AS (WP Co-ordinator), SonyCSL, Fraunhofer IDMT, BGU, Native
Instrument
Project funded by the European Community
under the "Information Society Technology"
Program
SHF-IST-507913 page 1
2. SHF Public Report Version 1.0 WP2
Table of contents
1 WP2 OVERVIEW................................................................................................................................................3
1.1 OBJECTIVES....................................................................................................................................................... 3
1.2 PARTNERS’ ROLES............................................................................................................................................... 3
1.3 WP2 CONTRIBUTION TO THE PROJECT.................................................................................................................... 3
1.4 SYNTHESIS OF MAIN ACHIEVEMENTS....................................................................................................................... 4
2 WP2 RESULTS AND ACHIEVEMENTS........................................................................................................ 4
2.1 SEGMENTATION/AUDIO SUMMARY.........................................................................................................................4
2.2 RHYTHM DESCRIPTION......................................................................................................................................... 6
2.3 TONALITY DESCRIPTION........................................................................................................................................8
2.4 EXTRACTOR DISCOVERY SYSTEM........................................................................................................................ 10
2.5 TEMPO AND PHASE DETECTION............................................................................................................................ 13
2.6 BROWSING BY LYRICS....................................................................................................................................... 13
2.7 SOUND SOURCE SEPARATION / SCORE ALIGNMENT................................................................................................14
2.8 AUDIOID........................................................................................................................................................ 16
3 METHODOLOGIES EMPLOYED ................................................................................................................ 16
3.1 WP MANAGEMENT AND CO-ORDINATION............................................................................................................ 16
3.2 MARKET FOLLOWING.........................................................................................................................................17
3.3 SCIENTIFIC RESEARCH METHODOLOGIES.................................................................................................................18
3.4 SOFTWARE DEVELOPMENT PRACTICES................................................................................................................... 18
4 DISSEMINATION.............................................................................................................................................18
4.1 SCIENTIFIC PUBLICATIONS................................................................................................................................... 18
4.2 OTHER SCIENTIFIC DISSEMINATION ACTIONS...........................................................................................................19
4.3 CONTRIBUTION TO CLUSTERING AND STANDARDISATION...........................................................................................19
4.4 PROFESSIONAL COMMUNICATIONS........................................................................................................................ 19
4.5 PRESS ARTICLES AND INTERVIEWS........................................................................................................................ 20
5 OUTLOOK......................................................................................................................................................... 20
5.1 INFORMATION PRODUCED BY THE WP.................................................................................................................. 20
5.2 NEW METHODS................................................................................................................................................. 20
5.3 SCIENTIFIC BREAKTHROUGHT...............................................................................................................................20
6 CONCLUSION...................................................................................................................................................20
SHF-IST-507913 page 2
3. SHF Public Report Version 1.0 WP2
1 WP2 Overview
1.1 Objectives
The goal of the WP2 indexing work-package is to develop algorithms/techniques and provide
modules for the automatic extraction of signal features allowing a set of specific
functionalities. The targeted functionalities are music segmentation/ summarization/
visualization, browsing/ searching music using high-level music descriptors (rhythm / tonality
/ timbre features), or using a generic scheme for extracting high-level audio descriptors from
audio signal, music remixing (by source separation using score alignment), browsing by lyrics
(using lyrics to score to audio alignment), automated identification of audio signal, tempo and
phase detection.
1.2 Partners’ roles
Ircam AS: WP leader
• Ircam AS: music segmentation/ summarization/ visualization; browsing/ searching
music using high-level music descriptors (rhythm / tonality),
• Sony CSL: generic scheme for extracting high-level audio descriptors (EDS),
• BGU: music remixing (by source separation using score alignment), browsing
by lyrics (using lyrics to score to audio alignment),
• Fraunhofer IDMT: automated identification of audio signal,
• Native Instrument: tempo and phase detection.
1.3 WP2 contribution to the project
Modules developed in WP2 provides the necessary indexing information for WP3 Browsing,
the necessary segments information for WP5 Performing, audio identification module for
WP7 Sharing, and indexing information for the two applications developed (WP6 Authoring
Tools and WP8 HIFI system). The order of priority for the development of WP2 modules is
made therefore according to the interest of the application providers in specific
functionalities, and according to the dependence of modules on other modules.
Development of modules for:
• music segmentation / summarization / visualization,
• high-level features for browsing/searching (rhythm / tonal / timbre description), generic
description inside EDS,
• browsing by lyrics (based on score alignment),
• music remixing (source separation based on score alignment),
• automated identification of audio signal,
SHF-IST-507913 page 3
4. SHF Public Report Version 1.0 WP2
• beat and phase detection.
1.4 Synthesis of main achievements
Research has been accomplished and final functional prototypes have been developed.
Modules are available either in the forms of executable, or libraries. Integration into the
applications is being accomplished.
2 WP2 Results and Achievements
2.1 Segmentation/Audio Summary
Responsible partner : IrcamAS
2.1.1 Functional description
Automatic music structure (segmentation) discovery aims at providing insights into the
temporal organization of a music track by analyzing its acoustical content in terms of
repetitions. It then represents a music track as either a set of states or as a set of sequences.
A state is defined as a set of contiguous times which contains similar acoustical information.
Example of this are the musical background of a verse segment or of a chorus segment which
is usually constant during the segment.
A sequence is defined as a set of successive times which is similar to another set of successive
times but the times inside a set are not necessarily identical to each other. It is therefore a
specific case of a state. Example of this are the various melodies repeated in a music track.
Once extracted, the structure can be used for intra-document browsing (forward/backward
insight a music track by verse/chorus, … by melodies) and visualization.
Automatic audio summary generation aims at creating a short audio extract (usually 30
seconds) which is representative of the various contents of a music track. It uses a beat-
synchronous concatenation of various parts of a music track according to the parts estimated
during the structure process.
The module developed in SHF performs the three following tasks:
1) estimates the large scale structure of a music,
2) provide a visual map of a music track,
3) provide an audio summary of a music track.
2.1.2 Scientific and technological breakthroughs
An algorithm for automatic structure extraction and automatic summary generation had been
previously developed in the framework of the European IST Project CUIDADO. This
technology has been further developed in the framework of the SemanticHIFI project.
SHF-IST-507913 page 4
5. SHF Public Report Version 1.0 WP2
We describe the improvement of the technology made in the Semantic HIFI project (see
Figure).
The feature extraction front-end has been extended to represent harmonic features and
combination of timbre and harmonic features. This is essential for classical and jazz music.
For the structure extraction based on state representation, a time-constrained hierarchical
agglomerative clustering is now used which allows to get rid off of the noisy frames (non
repeated frames). For the structure extraction based on sequence representation, the sequence
detection and connection is now made by a likelihood approach. Each time segment is
considered as a mother sequence and it’s likelihood to represent the track duration is
computed by comparing its logical occurrences in the track to the observed occurrences in the
track. A negative weighting is applied in order to avoid two segments to overlap. The audio
summary is now directly computed using the original audio signal (stereo, 44.1KHz),
estimated beat markers (WP rhythm description) are used for beat-synchronous overlap of the
various parts. The algorithm now considers a total duration of the summary constrain.
2.1.3 Hardware and software outlook
The module consists of two executables.
1) An extraction module which extracts the multi-level structure of a music track (stored as
an xml file) and creates an audio summary (stored as an audio file and as an xml file
containing the structure of the audio summary).
2) A music players [see below] which allows the user to interactively listen to the various
parts of a music track based on its estimated structure. The player allows the user to skip
from one chorus to the next one, from the beginning of the verse to the beginning of the
chorus, …
SHF-IST-507913 page 5
6. SHF Public Report Version 1.0 WP2
chorus
bridge
verse
Media Player mockup for music structure browsing
2.2 Rhythm description
Responsible partner : IrcamAS
2.2.1 Functional description
This module aims at providing high-level features derived directly form audio signal analysis
related to rhythm characteristics in order to allow browsing in a music track database. The
module (tempo and phase extraction) developed by Native Instrument targets specifically
percussive based music (dance and electronic music). The module developed here by
IrcamAS targets the description of rhythm for the general class of music including non-
percussive based music (jazz, classical, variety music). In this case, note onset detection is a
crucial factor, as well as quick variation of the tempo detection. The algorithms developed
performs first the extraction of a time-variable tempo, estimation of beat marker positions,
and extract a set of high-level features for browsing: global tempo, global meter, percussivity
features, periodicity features, and a rhythmical pattern that can be used for search by similar
rhythm in a database.
SHF-IST-507913 page 6
7. SHF Public Report Version 1.0 WP2
2.2.2 Scientific and technological breakthroughs
Rhythm features extraction module
The technology has been entirely developed in the framework of the Semantic HIFI project. It
has been designed considering the usual drawback of other techniques:
• weak onset detection
• tempo ambiguities
The module first estimates the onset of the signal over time. Onsets are defined here as any
meaningful starting of music events (percussions, notes, note transitions, periodic variation of
notes). The onset detection is based on a spectral flux measures of a time and frequency
reassigned spectrogram. The use of the later allows more precise detection of onset even in
case of non-percussive instrument. Most other methods are based on either sub-band energy
functions, or spectrum flux on spectrogram.
The periodicity of the signal is then measure by a proposed combination of Discrete Fourier
Transform and Auto-Correlation Function. Since both functions have inverse octave
ambiguities, the combination of both allows to reduce octave ambiguities. Most other
methods are based either on DFT, or ACF, or inter-onset-interval histogram.
The tempo is related to the periodicity through rhythm templates. So far; three templates are
considered which represents the 2/2, 2/3, 3/2 meter/beat subdivision characteristics. The
probability to observe a specific tempo in a specific rhythm template is computed based on
the observed periodicities. Most other methods consider directly the main periodicity as the
tempo which does not allows to distinguish between 2/4, 3/4 , 6/8 meters.
Tracking of the tempo changes over time is done by formulating the problem as a Viterbi
decoding problem. The tempo and meter/beat subdivision are then estimated simultaneously
as the best temporal path through the observations. Most other methods are based on short
time memory of the past detected periodicities.
SHF-IST-507913 page 7
8. SHF Public Report Version 1.0 WP2
The rhythm charcaterization for search by similarity is based on rhythm templates in the
spectral domain which allows to avoid the usual drawback of other techniques: difficulty to
have a robust estimation for non-percussive based music, difficulty to average the description
over the full length of the files, length of the description, computation time of the comparison.
2.2.3 Hardware and software outlook
The module consists of a single executable which performs both tempo/phase estimation and
rhythm characterization. The module output two xml files:
1) time variable tempo and beat position estimation (which can be used latter for
performing),
2) the global tempo/meter estimates and rhythm characteristics (which can be used latter for
browsing/searching).
2.3 Tonality description
Responsible partner : IrcamAS
2.3.1 Functional description
This module aims at providing high-level features derived directly form audio signal analysis
related to tonality characteristics in order to allow browsing in a music track database. These
features are especially important for music based on tonal information (classical music).
Tonality extraction from detected notes (transcription) requires a previous estimation of
multi-pitches and a previous segmentation step. While this technology can be used for small
polyphony, it still requires a large computation time which makes it difficult to use in a real
task application of large music collection. The module developed is based on a less
computation time demanding technology: HPCP/Chroma vector estimation. While this
technology does not allow the extraction of exact notes (melody), it is sufficient for the
extraction of the global key (C,C#,D,…), mode (Major, minor) and chords of a music track.
We add to this, the extraction of an harmonic pattern which can be used for search by
similarity.
For each music track, the module provides the following descriptions:
1) Global key (C,C#,D,…) and mode (Major, minor) of the music track,
2) Harmonic pattern: which can be used for search by similarity.
SHF-IST-507913 page 8
9. SHF Public Report Version 1.0 WP2
2.3.2 Scientific and technological breakthroughs
Tonality features extraction module
The technology has been entirely developed in the framework of the Semantic HIFI project.
The algorithm operates in two separate stages: one off-line (learning), one on-line
(evaluation).
In the offline stage, templates of tonality and modes are learnt for each possible pairs of key
(C,Db,D,…) and tonality (Major/minor). For this, we follow a similar approach than the one
proposed by Gomez. The tonality and modes templates are based on Krumhansl profiles, the
polyphonic (chords) is modeled using three mains triads (Gomez) and the harmonic structure
of the instrument pitches is modeled as k^[0:H-1] with k<1 (Gomez).
In the online stage, for an unknown given music track, its audio signal is first smoothed in the
time and frequency plan in order to remove noise and transients part. Its spectrum is then
computed and converted to the chroma (Wakefield) scale (also called HPCP (Fujishima)). For
this the energy of the peaks of the spectrum are summed inside frequency bands
corresponding to the chroma scale. Median filtering is applied to smooth chroma over time.
The resulting time/frequency representation is a called a chromagram. The key/mode
estimation is performed by estimating the most likely key/mode template according to the
observed chroma over time. A similar approach than the ones from Izmirli is used. The
chroma vectors are progressively accumulated along time in a forward way. At each time the
most likely key/mode template is estimated from the accumulated chroma vector, and a
salience is assigned to it based on its distance to the second most likely templates. The global
key/mode assign to the music track is the global most salient key/mode template.
SHF-IST-507913 page 9
10. SHF Public Report Version 1.0 WP2
2.3.3 Hardware and software outlook
The module consists of a single executable which performs both tonality/mode estimation and
harmonic pattern estimation. The module output the results in a single xml file.
2.4 Extractor Discovery System
Responsible partner : SONY CSL
2.4.1 Functional description
EDS (Extractor Discovery System) is a generic scheme for extracting arbitrary high-level
audio descriptors from audio signals. It is able to automatically produce a fully-fledged audio
extractor (an executable) from a database of labeled audio examples. It uses a supervised
learning approach. Its main characteristics is that it finds automatically optimal audio features
adapted to the problem at work. Descriptors are traditionally designed by combining Low-
Level Descriptors (LLDs) using machine-learning algorithms. The key idea of EDS is to
substitute the basic LLDs with arbitrary complex compositions of signal processing operators:
EDS composes automatically operators to build features as signal processing functions that
are optimal for a given descriptor extraction task. The search for specific features is based on
genetic programming, a well-known technique for exploring search spaces of function
compositions. Resulting features are then fed to a learning model such as a GMM or SVM to
produce a fully-fledged extractor program.
Screenshot of the EDS system v1
SHF-IST-507913 page 10
11. SHF Public Report Version 1.0 WP2
The global architecture of EDS, illustrated in Figure 2, consists in two parts: modeling of the
descriptor and synthesis of the extractor. Both parts are fully automatic and lead eventually to
an extractor for the descriptor.
The modeling of the descriptor is the main part of EDS. It consists in searching automatically
for a set of relevant features using the genetic search algorithm, and then to search
automatically for the optimal model for the descriptor, that combines these features. The
search for specific features is based on genetic programming, a well-known technique for
exploring search spaces of function compositions. The genetic programming engine composes
automatically signal processing operators to build arbitrarily complex functions. Each built
function is given a fitness value which represents how well the function performs to extract a
given descriptor on a given learning database. The evaluation of a function is very costly, as it
involves complex signal processing on whole audio databases. Therefore, to limit the search,
a set of heuristics are introduced to improve the a priori relevance of the created functions, as
well as rewriting rules to simplify functions before their evaluation. Once the system has
found relevant features, it combines them to feed them into various machine learning models,
and then optimizes the model parameters.
The synthesis part consists in generating an executable file to compute the best model on any
audio signal. This program allows computing this model on arbitrary audio signals, to predict
their value for the modeled descriptor.
Architecture of the EDS system
SHF-IST-507913 page 11
12. SHF Public Report Version 1.0 WP2
2.4.2 Scientific and technological breakthroughs
EDS is a very novel concept. Genetic algorithms have been employed for algorithm
discovery, e.g. for efficient compilation of Signal Processing transform, or radar-image
analysis, however, EDS is the first application to audio signal processing and metadata
extraction. The idea of incorporating expert music signal processing knowledge is also novel.
The EDS technique originated in the Cuidado project, however it has been much researched
and improved in the scope of the Semantic Hifi project :
• All machine learning agglomeration models can now be parameterized, and a complete
search now allows to optimize automatically these parameters, in order to improve the
descriptor performance, for a given set of features.
• New models have been added : Gaussian Mixture Models and Hidden Markov Models.
• Automatization of the whole process : given a description problem defined by a database
of labeled signals, the system is now able, automatically, to solve the problem and
produce a program to compute the descriptor for an external signals, by:
- building a set of relevant features,
- selecting and building a model by using optimized machine learning techniques,
- generating a model file and an executable file for its computation.
• Specific study of the new following descriptors : percussivity, pitchness,
Happiness/sadness, Danceability, Density, Complexity. Each of these are difficult
problems that have received less satisfactory solutions in the research communities.
2.4.3 Hardware and software outlook
The GUI of the EDS system has undergone a complete redesign to make it more easily
extensible and more-user friendly. The system now includes visualization possibilities, to
monitor the increasing fitness of the features and the precision of classification tasks. The
EDS system is now more user-friendly, features and extractors can be edited, copied/pasted,
saved in readable xml format, etc.
Screenshot of EDS v2.1 which features notably visualization of classification precision
We have abstracted the EDS system into a non-GUI API, which can be called from within
MCM (see WP3.1). The EDS feature is integrated in the final WP3 prototype. A “generalize”
SHF-IST-507913 page 12
13. SHF Public Report Version 1.0 WP2
button allows the user to:
• create a new field (e.g. the happiness of a song),
• input a few values manually,
• have EDS generalize these values into an algorithm, which can be used to describe
unlabelled examples.
2.5 Tempo and phase detection
Responsible partner : Native Instrument
Native Instruments compiled a collection of not copy-righted music representative of Native
Instruments customers. We evaluated various state of the art tempo detection algorithms with
the help of this collection.
2.5.1 Functional description
2.5.2 Scientific and technological breakthroughs
• Compile a collection of not copy-righted music representative of our customers
• Evaluate various state of the art tempo detection algorithms with the help of this collection
2.5.3 Hardware and software outlook
2.6 Browsing by Lyrics
Responsible partner : BGU
2.6.1 Functional description
The user loads the audio file and the attached metadata to the HIFI system. The HIFI system
plays the song and at the same time shows the user the lyrics. The user must have the XML
file containing the synchronized lyrics data to use this tool. The file may be available through
the sharing system. A simple use case of the browsing feature would be to use the traditional
skip button. The user will have the option to use the skip button in the traditional way. The
user will have the option to do ‘Semantic Skip’ – while he presses the skip button he will see
the lyrics of the sound at the location of the skip. A more sophisticated use case is the search
option – the user enters text and the song skips to the location of the words within the song.
2.6.2 Scientific and technological breakthroughs
The Semantic HIFI browsing by lyrics option is aimed to be equal to other commercial and
state-of-the-art lyrics synchronization tools.
2.6.3 Hardware and software outlook
The algorithm has several steps:
Lyrics Alignment
1) Find MIDI file that contains the lyrics as MIDI text-events
SHF-IST-507913 page 13
14. SHF Public Report Version 1.0 WP2
2) Use score alignment (currently we used the score alignment algorithm that we have from
CUIDADO) to align the MIDI to the score. Usually, the MIDI contains a track that
corresponds to the singer. We use this track to align the sound file.
3) Output the result of the previous step to a file (XML metadata file) that contains time and
text information.
Lyrics Display
1) When the player plays the track, it reads the file, get the appropriate time to text
information and display the appropriate text.
The windows media player plug-in was build using windows media player visualization SDK.
We implemented COM interface of the visualization object with the algorithm. The figure
below shows the plug-in interface. While the song plays, the player display the appropriate
lyrics as a visualization.
Browsing by Lyrics in Windows Media Player
2.7 Sound Source Separation / Score Alignment
Responsible partner : BGU
2.7.1 Functional description
This advanced audio processing tool performs the separation of parts or instruments from
within a multi-track audio recording. The main goal of this tool is to give the listener the
ability to manipulate audio in ways not available previously and to enable artistic liberty
normally available only in the studio.
Originally, source separation was based upon alignment to the existing score of the recording
available as a MIDI file. This enabled the direct harmonic separation of the instruments based
on the information appearing in the score. A more advanced approach is now available which
SHF-IST-507913 page 14
15. SHF Public Report Version 1.0 WP2
does not require the use of score alignment. Even so, the score alignment may be used for
melody or instrument identification. This tool is developed in collaboration with IRCAM
Room Acoustics for the purpose of spatially re-mixing multi-track audio.
2.7.2 Scientific and technological breakthroughs
The source separation tool presents a high-level challenge, requiring the use of complex
statistical sound processing techniques. These techniques, to be published in the coming year,
employ a model-based pitch and voicing (or harmonicity) tracking algorithm. The problem of
multi-track recording separation has been approached from many directions since the turn of
the century but has not yet been applied in a working user-based system such as Semantic
HiFi.
100
Bass 3
150
Pizzicato(A
)
200
2
250
Pizzicato(B
Frequency )
[Hz] 1
300
350 Pizzicato(A) - 0
Harmonic
400
-1
450
500 -2
20 40 60 80 100 120 140 160
Frames
The above figure presents the harmonic log-likelihood function of one second of multi-track
audio (the intro to Stan Getz’ Round Midnight). This function gives clear information as to
the existence of harmonic instruments, in this case – bass and violins. The non-harmonic
information contains the sound of a cymbal.
The source separation algorithm gathers this information and performs a harmonic projection
for extracting the basic sound of each instrument. Then, source quality is improved using
harmonic sharing techniques.
2.7.3 Hardware and software outlook
The software is currently written in MATLAB and may be optimized for specific use and
compiled as a plug-in. It is currently automatic and separates up to four intelligible and
audibly pleasing parts from a given recording. These tracks can be re-mixed to mono, stereo
or surround.
SHF-IST-507913 page 15
16. SHF Public Report Version 1.0 WP2
2.8 AudioID
Responsible partner : Fraunhofer IDMT
2.8.1 Functional description
AudioID is a system that performs an automated identification of audio signals. The essential
property of AudioID lies in the fact that it does not rely on the availability of metadata
information that is attached to the audio signal itself. It will, however, identify all incoming
audio signals by means of a database of works that are known to the system. This
functionality can be considered the algorithmic equivalent of human recognition of a song
from the memory of the recognizing person. Before querying for audio items it is required to
extract fingerprints from all songs that are to be classified. A fingerprint contains the
“essence” of an audio item and its extraction algorithm has been standardized within the
MPEG-7 standard. To identify music, a fingerprint should be extracted from the query item
and compared to the database. On a positive comparison result, the gathered metadata are
replied and further processed.
2.8.2 Scientific and technological breakthroughs
During the project Fraunhofer IDMT reached a major improvement in the area of recognizing
extremely distorted signals. Therefore it is now possible to recognize extremely distorted
music signals (for example GSM-coded signals). Furthermore it has been a significant
improvement in the classification speed, so that broadcast monitoring with 100 or more
channels is possible in real-time.
2.8.3 Hardware and software outlook
There are already several available applications for identifying music. The most important
software tool is the classification server. This program contains a database of all previously
extracted audio fingerprints. When identifying music, a client extracts the fingerprint from the
actual music data and sends it to the server. The server looks up the database and replies the
classification result. The client will thereafter preprocess the result and work with it.
There are several clients for different ways of application. Among them is a multi-channel
broadcast monitoring system, a tool for cleaning up the user hard disk at home and a client
whish records music from cell phones and identifies them.
During the project furthermore a java plugin for an AudioId client has been developed in
order to integrate the system into the existing Semantic Hifi demonstrator.
The system runs in real time and the most performant tool is the classification server. A
todays average PC can contain up to 100000 fingerprints.
3 Methodologies Employed
3.1 WP Management and Co-ordination
WP management consists in:
• Detailed planning of each sub-wp and coherence with integration constrains
SHF-IST-507913 page 16
17. SHF Public Report Version 1.0 WP2
• Ensure coherence between each sub-wps
• Ensure coherence between sub-wps targeted functionalities and targeted applications
• Ensure deliverables, modules and documentation are provided for milestones
• Initiate corrective actions
3.2 Market following
3.2.1 User scenarios
Audio segmentation:
Interaction – Intra-document browsing
• The user listen to a song and wants to skip to the next part of the song, to the next chorus
of the song
• The user wants to have a visual representation of the structure of the song
Creation - Remix
• The user wants to remix a piece of music by its structure (exchanging verse, chorus
positions).
• The user wants a specific visual or automatic accompaniment during verse / chorus, …
Audio summary
Inter-document browsing
• The user wants a quick preview of a song item in a play-list, in a catalogue, or in a
database.
Rhythm description
• The user wants to query a database according to the tempo of the songs, the meter, the
percussivity/periodicity features
• The user wants music title with similar tempo and rhythm pattern than a target music title
• The user wants a play-list with tracks of increasing tempo, constant tempo
Creation
• The user wants to synchronize two tracks according to their tempo, wants to align them
according to their beat location.
• The user wants to remix a track using segments defined as beats.
Tonality description
Inter-document browsing
• The user wants to query a database according to the key, mode
• The user wants music title with similar key, mode or harmonic pattern
SHF-IST-507913 page 17
18. SHF Public Report Version 1.0 WP2
3.3 Scientific research methodologies
For a specific functionality:
1) proof a feasibility in a given research time
2) study on state of the art, study on protected technologies
3) development of a starting technology
4) development of an evaluation database (must be representative on the market targeted by
the application) and development of corresponding annotations (according to the targeted
functionality)
5) test of the starting technology on the evaluation corpus
6) improvement of the starting technology according to the failures
7) increase of the database size in order to reach realistic conditions
8) test of the technology, computation time evaluation, decrease of algorithms complexity
9) development of prototype and module
3.4 Software development practices
Re-implementation of Matlab(c) source code into C/C++ source code.
1) Implementation of unit testing for each part of the Matlab source
code.
2) Validation of C++ implementation with the Unit testing suite.
3) Usage of valgrind (http://www.valgrind.org) in order to avoid memory
leak and memory access violation.
FUTUR:
4) Integration of C++ module within the J2EE server application
5) Packaging of C++ module for an easy deployment on each partners
platform.
4 Dissemination
4.1 Scientific publications
Peeters, G. (2004). Deriving Musical Structures from Signal Analysis for Music Audio
Summary Generation: "Sequence" and "State" Approach. CMMR 2003 (LNCS 2771). U. K.
Wiil, Springer-Verlag Berlin Heidelberg 2004: 142-165
Peeters, G. (2005). Time Variable Tempo Detection and Beat Marking. ICMC, Barcelone,
Spain.
SHF-IST-507913 page 18
19. SHF Public Report Version 1.0 WP2
Peeters, G. (2005). Rhythm Classification using spectral rhythm patterns. ISMIR, London,
UK.
Peeters, G. (2005). «Indexation et accès au contenu musical.» Les Nouveaux Dossiers de
l'Audiovisuel 3.
Peeters, G. (2005). MIREX 2005: Tempo detection and beat marking for perceptual tempo
induction. ISMIR, London, UK.
Monceaux Jérôme; Pachet, François; Amadu, Frédéric; Roy, Pierre and Aymeric Zils
Descriptor-based spatialization. Proceedings of AES Conference 2005, 2005.
Giordano Cabral, François Pachet, and Jean-Pierre Briot. Recognizing chords with EDS: Part
one.. Proceedings of the 6thComputer Music Modeling and Retrieval (CMMR'2005), Pisa,
Italy, September 2005.
Giordano Cabral, François Pachet, and Jean-Pierre Briot. Automatic X traditional descriptor
extraction: The case of chord recognition.. Proceedings of the 6th International Conference on
Music Information Retrieval (ISMIR'2005), London, U.K., September 2005.
Zils, A. and Pachet, F. Automatic Extraction of Music Descriptors from Acoustic Signals
using EDS. Proceedings of the 116th AES Convention, May 2004.
Pachet, F. and Zils, A. Automatic Extraction of Music Descriptors from Acoustic Signals.
Proceedings of ISMIR 2004, 2004.
4.2 Other scientific dissemination actions
Peeters, G. (2005). Naviguer dans sa discothèque. La semaine du son, Paris, France.
4.3 Contribution to clustering and standardisation
Organization of a half-day workshop on the practical use of MPEG-7 audio in audio
application at the AES 25th Int. Conf. Metadata for audio, London, UK. Invited speakers : G.
Peeters, M. Jacob, E. Gomez (UPF), J. Herre (FHG), M. Casey (King’s College).
Peeters, G. (2004). Workshop on MPEG-7 audio. AES 25th Int. Conf. Metadata for audio,
London, UK.
4.4 Professional communications
Peeters, G. (2005). Description automatique et classification des sons instrumentaux. Journées
de la SFA (Description automatique et perception de la musique), ENST, Paris.
SHF-IST-507913 page 19
20. SHF Public Report Version 1.0 WP2
4.5 Press articles and interviews
Vinet, H., V. Puig, et al. (2005). Le magazine du multimedia, RFI. Paris.
Peeters, G. (2005). "SemanticHIFI, chaîne fidèle", Télérama.fr.
Vinet, H., G. Peeters, et al. (2005). "Je t'MP3 moi non plus" /" Le futur fait ses gammes",
Télérama.fr.
Vinet, H., G. Peeters, et al. (2004). L'ircam recherche la chaine HIFI du futur. 01Net.
5 Outlook
5.1 Information produced by the WP
Reusable use cases?
yes for all developed modules
R&D organisation?
target specific functionalities according to application provider markets,
knowledge exchange and library exchange (when possible) among partners
5.2 New methods
For music structure extraction, summary generation
For beat detection, rhythm characterization
5.3 Scientific breakthrought
New features for structure detection: dynamic features based on tempo
New algorithm for sequences detection: maximum likelihood approach
New algorithm for onset detection: reassigned spectral energy flux
New algorithm periodicity estimation: combined DFT/FM-ACF
New algorithm for tempo tracking: Viterbi rhythm based template decoding
6 Conclusion
Semantic HIFI workpackage on indexing provides research and modules for the two
applications of the project: the authoring tools and the HIFI system. It therefore covers
researches on all the major indexing fields of music information retrieval: from the
identification of the audio track (audio identification) in order to allow linking the audio to
SHF-IST-507913 page 20
21. SHF Public Report Version 1.0 WP2
textual metadata, to the indexing of the content (rhythm, tonality, timbre) in order to allow
search by content or by similarity, to the system learning of user-defined descriptions (EDS),
to innovative listening/performing paradigms allowed by music structure extraction, lyrics
synchronization, re-mixing by source separation.
SHF-IST-507913 page 21