We describe the “Multimodal Person Discovery in Broadcast TV” task of MediaEval 2015 benchmarking initiative. Participants were asked to return the names of people who can be both seen as well as heard in every shot of a collection of videos. The list of people was not known a priori and their names had to be discovered in an unsupervised way from media content using text overlay or speech transcripts. The task was evaluated using information retrieval metrics, based on a posteriori collaborative annotation of the test corpus.
http://ceur-ws.org/Vol-1436/
http://www.multimediaeval.org
4. "the usual suspects"
• Huge TV archives are useless if not searchable
• People ❤ people
• Need for person-based indexes
4
5. REPERE
• Evaluation campaigns in 2012, 2013 and 2014
Three French consortia funded by ANR
• Multimodal people recognition in TV documents
"who speaks when?" and "who appears when?"
• Led to significant progress in both supervised and
unsupervised multimodal person recognition
5
6. From REPERE to Person Discovery
• Speaking faces
Focus on "person of interest"
• Unsupervised approaches
People may not be "famous" at indexing time
• Evidence
Archivist/journalist use case
6
9. Speaking face
9
Tag each shot with the names of people
both speaking and appearing at the same time
10. Person discovery
• Prior biometric models are not allowed.
• Person names must be discovered automatically
in text overlay or speech utterances.
unsupervised approaches only
10
12. Evidence (cont.)
12
an image evidence is a shot during which a person is visible
and their name is written on screen
an audio evidence is a shot during which a person is visible
and their name is pronounced at least once
during a [shot start time - 5s, shot end time + 5s] neighborhood
shot #3
for B
shot #1
for A
14. Datasets
DEV | REPERE TEST | INA
14
137 hours
two French TV channels
eight different types
106 hours (172 videos)
only one French TV channel
only one type (news)
dense audio annotations
speaker diarization
speech transcription
sparse video annotations
face detection & recognition
optical character recognition
no prior annotation
a posteriori
collaborative
annotation
15. Datasets
TEST | INA
15
106 hours (172 videos)
only one French TV channel
only one type (news)
no prior annotation
a posteriori
collaborative
annotation
http://dataset.ina.fr
17. Information retrieval task
• Queries formatted as firstname_lastname
e.g. francois_hollande
"return all shots where François Hollande is
speaking and visible at the same time"
• Approximate search among submitted names
e.g. francois_holande
• Select shots tagged with the most similar name
17
18. Evidence-weighted MAP
18
C(q) =
1 if ⇢q > 0.95 and q 2 E(E(nq
0 otherwise
To ensure participants do provide correct evidences fo
hypothesized name n 2 N, standard MAP is alter
EwMAP (Evidence-weighted Mean Average Precisio
o cial metric for the task:
EwMAP =
1
|Q|
X
q2Q
C(q) · AP(q)
Acknowledgment. This work was supported by the
National Agency for Research under grant ANR-12
0006-01. The open source CAMOMILE collaborativ
tation platform2
was used extensively throughout the
of the task: from the run submission script to the aut
leaderboard, including a posteriori collaborative ann
MAP =
1
| Q |
X
q2Q
AP(q)
C(q) measures the correctness of
provided evidence for query q
20. the "multi" in multimedia
Task necessitates expertise
in various domains:
• multimedia
• computer vision
• speech processing
• natural language processing
20
Technological barriers to entry is lowered
by the provision of a baseline system.
github.com/MediaevalPersonDiscoveryTask
22. 22
face tracking • •
face clustering • • • •
speaker diarization • • • • •
optical character recognition • • •
automatic speech recognition
speaking face detection • • • •
fusion • • • • • •
Was the baseline useful?
team relied on the baseline module
• team developed their own module
• team tried both
9 participants
23. 23
focus on
monomodal
components
baseline
face tracking • •
face clustering • • • •
speaker diarization • • • • •
optical character recognition • • •
automatic speech recognition
speaking face detection • • • •
fusion • • • • • •
focus on
multimodal
fusion
25. Schedule
01.05 development set release
01.06 test set release
01.07 "out-domain" submission deadline
01.07 — 08.07 leaderboard updated every 6 hours
08.07 "in-domain" submission deadline
09.07 — 28.07 collaborative annotation
28.07 test set annotation release
28.07 — 28.08 adjudication
25
26. Leaderboard
26
computed on a secret
subset of the test set
updated every 6 hours
private leaderboard
participants know how they rank
participants know their score
but do not know the score of others
30. collaborative annotation platform
-projectgithub.com/
objectives
open source
serverclient
client
client
bring
your
own
client
JSON
data model
corpus
medium layer
annotation
medium fragment attached metadata
time
Lorem ipsum dolor sit amet, consectetur
adipisicing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna
aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi
ut aliquip ex ea commodo consequat.
multimedia
document
REST API
GET /corpus/:idcorpus
POST /corpus/:idcorpus/medium
PUT /layer/:idlayer/annotation
DEL /annotation/:idannotation
obtain information about a corpus
add a medium to a corpus
update an annotation
delete an annotation
and more...
permissions
annotation history
annotation queue
user authentication
provide an architecture for effective creation and sharing of
annotations of multimedia, multimodal and multilingual data
homogeneous
collection of
multimedia
documents
homogeneous
collection of
annotationsaudio
video
image
text
categorical value
numerical value
free text
raw content
enable novel collaborative and interactive annotation tools
be generic enough to support
as many use cases as possible
online documentation
http://camomile-project.github.io/camomile-server
JSON
JSON
A
D
VER
TISEM
EN
T
32. Stats
• 14 (+ organizers) registered participants
received dev and test data
• 8 (+ organizers) participants submitted 70 runs
• 7 (+ organizers) submitted a working note paper
• 7 (+ organizers) attend the workshop
32
38. # people that no other team discovered
38
1200 01 4 0 0 0
speech transcripts?
0
anchors
EwMAP (%) for PRIMARY runs
39. Conclusion
• Had a great (and exhausting) time organizing this task
• The winning submission DID NOT make any use of
the face modality.
• (Almost) nobody used ASR
Most people are introduced by overlaid names in the test set
• No cross-show approaches
• Poster session later today at 16:00
Technical retreat tomorrow morning at 9:15
39
40. Next (oral)
• Mateusz Budnik / LIG at MediaEval 2015
Multimodal Person Discovery in Broadcast TV Task
• Meriem Bendris / PERCOLATTE: a multimodal
person discovery system in TV broadcast for the
Medieval 2015 evaluation campaign
• Rosalia Barros / GTM-UVigo Systems for Person
Discovery Task at MediaEval 2015
40
41. Next (poster)
• Claude Barras / Multimodal Person Discovery in Broadcast TV at
MediaEval 2015
• Johann Poignant / LIMSI at MediaEval 2015: Person Discovery in
Broadcast TV Task
• Paula Lopez Otero / GTM-UVigo Systems for Person Discovery Task
at MediaEval 2015
• Javier Hernando / UPC System for the 2015 MediaEval Multimodal
Person Discovery in Broadcast TV task
• Guillaume Gravier / SSIG and IRISA at Multimodal Person Discovery
• Nam Le / EUMSSI team at the MediaEval Person Discovery Challenge
41