MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Multimodal
Person Discovery
in Broadcast TV
Johann Poignant / Hervé Bredin / Claude Barras
2015
bredin@limsi.fr 
herve.niderb.fr 
@hbredin

Outline
• Motivations
• Deﬁnition of the task
• Datasets
• Baseline & Metadata
• Evaluation protocol
• Organization
• Results
• Conclusion
2

"the usual suspects"
• Huge TV archives are useless if not searchable
• People ❤ people
• Need for person-based indexes
4

REPERE
• Evaluation campaigns in 2012, 2013 and 2014 
Three French consortia funded by ANR
• Multimodal people recognition in TV documents 
"who speaks when?" and "who appears when?"
• Led to signiﬁcant progress in both supervised and
unsupervised multimodal person recognition
5

From REPERE to Person Discovery
• Speaking faces 
Focus on "person of interest"
• Unsupervised approaches 
People may not be "famous" at indexing time
• Evidence 
Archivist/journalist use case
6

Input
8
Broadcast TV pre-segmented into shots.

Speaking face
9
Tag each shot with the names of people
both speaking and appearing at the same time

Person discovery
• Prior biometric models are not allowed.
• Person names must be discovered automatically 
in text overlay or speech utterances.
unsupervised approaches only
10

Evidence
11
Associate each name with a unique shot 
prooving that the person actually holds this name

Evidence (cont.)
12
an image evidence is a shot during which a person is visible 
and their name is written on screen 
an audio evidence is a shot during which a person is visible
and their name is pronounced at least once
during a [shot start time - 5s, shot end time + 5s] neighborhood
shot #3
for B
shot #1
for A

Datasets
DEV | REPERE TEST | INA
14
137 hours
two French TV channels
eight different types
106 hours (172 videos)
only one French TV channel
only one type (news)
dense audio annotations
speaker diarization
speech transcription
sparse video annotations
face detection & recognition
optical character recognition
no prior annotation
a posteriori
collaborative
annotation

Datasets
TEST | INA
15
106 hours (172 videos)
only one French TV channel
only one type (news)
no prior annotation
a posteriori
collaborative
annotation
http://dataset.ina.fr

Information retrieval task
• Queries formatted as ﬁrstname_lastname 
e.g. francois_hollande 
 
"return all shots where François Hollande is
speaking and visible at the same time"
• Approximate search among submitted names  
e.g. francois_holande
• Select shots tagged with the most similar name
17

Evidence-weighted MAP
18
C(q) =
1 if ⇢q > 0.95 and q 2 E(E(nq
0 otherwise
To ensure participants do provide correct evidences fo
hypothesized name n 2 N, standard MAP is alter
EwMAP (Evidence-weighted Mean Average Precisio
o cial metric for the task:
EwMAP =
1
|Q|
X
q2Q
C(q) · AP(q)
Acknowledgment. This work was supported by the
National Agency for Research under grant ANR-12
0006-01. The open source CAMOMILE collaborativ
tation platform2
was used extensively throughout the
of the task: from the run submission script to the aut
leaderboard, including a posteriori collaborative ann
MAP =
1
| Q |
X
q2Q
AP(q)
C(q) measures the correctness of  
provided evidence for query q

the "multi" in multimedia
Task necessitates expertise 
in various domains: 
 
• multimedia 
• computer vision 
• speech processing 
• natural language processing
20
Technological barriers to entry is lowered 
by the provision of a baseline system.
github.com/MediaevalPersonDiscoveryTask

22
face tracking • •
face clustering • • • •
speaker diarization • • • • •
optical character recognition • • •
automatic speech recognition
speaking face detection • • • •
fusion • • • • • •
Was the baseline useful?
team relied on the baseline module
• team developed their own module
• team tried both
9 participants

23
focus on
monomodal
components
baseline
face tracking • •
face clustering • • • •
speaker diarization • • • • •
optical character recognition • • •
automatic speech recognition
speaking face detection • • • •
fusion • • • • • •
focus on
multimodal
fusion

Schedule
01.05 development set release 
01.06 test set release 
 
01.07 "out-domain" submission deadline 
 
01.07 — 08.07 leaderboard updated every 6 hours 
08.07 "in-domain" submission deadline 
 
09.07 — 28.07 collaborative annotation 
28.07 test set annotation release 
28.07 — 28.08 adjudication
25

Leaderboard
26
computed on a secret 
subset of the test set
updated every 6 hours
private leaderboard
 
participants know how they rank  
 
participants know their score
but do not know the score of others

Thanks!
28
around 50 hours in total

Thanks!
29
half of the test set ✔

collaborative annotation platform
-projectgithub.com/
objectives
open source
serverclient
client
client
bring
your
own
client
JSON
data model
corpus
medium layer
annotation
medium fragment attached metadata
time
Lorem ipsum dolor sit amet, consectetur
adipisicing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna
aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi
ut aliquip ex ea commodo consequat.
multimedia
document
REST API
GET /corpus/:idcorpus
POST /corpus/:idcorpus/medium
PUT /layer/:idlayer/annotation
DEL /annotation/:idannotation
obtain information about a corpus
add a medium to a corpus
update an annotation
delete an annotation
and more...
permissions
annotation history
annotation queue
user authentication
provide an architecture for effective creation and sharing of
annotations of multimedia, multimodal and multilingual data
homogeneous
collection of
multimedia
documents
homogeneous
collection of
annotationsaudio
video
image
text
categorical value
numerical value
free text
raw content
enable novel collaborative and interactive annotation tools
be generic enough to support
as many use cases as possible
online documentation
http://camomile-project.github.io/camomile-server
JSON
JSON
A
D
VER
TISEM
EN
T

Stats
• 14 (+ organizers) registered participants 
received dev and test data
• 8 (+ organizers) participants submitted 70 runs
• 7 (+ organizers) submitted a working note paper
• 7 (+ organizers) attend the workshop
32

33
out-domain results
28k shots
2070 queries
1642 # vs. 428 $
EwMAP (%) for PRIMARY runs

34
out-domain results
28k shots
2070 queries
1642 # vs. 428 $

in-domain results
35

in-domain results
36

# people that no other team discovered
38
1200 01 4 0 0 0
speech transcripts?
0
anchors

Conclusion
• Had a great (and exhausting) time organizing this task
• The winning submission DID NOT make any use of
the face modality.
• (Almost) nobody used ASR 
Most people are introduced by overlaid names in the test set
• No cross-show approaches
• Poster session later today at 16:00 
Technical retreat tomorrow morning at 9:15
39

Next (oral)
• Mateusz Budnik / LIG at MediaEval 2015
Multimodal Person Discovery in Broadcast TV Task
• Meriem Bendris / PERCOLATTE: a multimodal
person discovery system in TV broadcast for the
Medieval 2015 evaluation campaign
• Rosalia Barros / GTM-UVigo Systems for Person
Discovery Task at MediaEval 2015
40

Next (poster)
• Claude Barras / Multimodal Person Discovery in Broadcast TV at
MediaEval 2015
• Johann Poignant / LIMSI at MediaEval 2015: Person Discovery in
Broadcast TV Task
• Paula Lopez Otero / GTM-UVigo Systems for Person Discovery Task
at MediaEval 2015
• Javier Hernando / UPC System for the 2015 MediaEval Multimodal
Person Discovery in Broadcast TV task
• Guillaume Gravier / SSIG and IRISA at Multimodal Person Discovery
• Nam Le / EUMSSI team at the MediaEval Person Discovery Challenge
41

MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Ähnlich wie MediaEval 2015 - Multimodal Person Discovery in Broadcast TV (20)

Mehr von multimediaeval

Mehr von multimediaeval (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

MediaEval 2015 - Multimodal Person Discovery in Broadcast TV