1. Detection of Violent Scenes using Affective Features
Esra Acar
Competence Center Information Retrieval and Machine Learning
4. October 2012
2. ⶠMotivation
ⶠBackground
ⶠThe Method
ï§ Audio Features
ï§ Visual Features
ⶠResults & Discussion
ⶠConclusions & Future Work
4. October 2012 Detection of Violent Scenes using Affective Features 2
3. ⶠThe MediaEval 2012 Affect Task aims at detecting violent
segments in movies.
ⶠA recent work on horror scene recognition detects horror
scenes by affect-related features.
ⶠWe investigate whether
ï§ affect-related features provide good representation of
violence, and
ï§ making abstractions from low-level features is better than
directly using low-level data.
4. October 2012 Detection of Violent Scenes using Affective Features 3
4. ⶠThe affective content of a video corresponds to
ï§ the intensity (i.e. arousal), and
ï§ the type (i.e. valence) of emotion
expected to arise in the user while watching that video.
ⶠRecent research efforts propose methods to map low-level
features to high-level emotions.
ⶠFilm-makers intend to elicit some particular emotions (i.e.
expected emotions) in the audience.
ⶠWhen we refer to violence as an expected emotion in
videos, affect-related features are applicable for violence
detection.
4. October 2012 Detection of Violent Scenes using Affective Features 4
5. ⶠThe method uses affect-related audio and visual features to
represent violence.
ⶠLow-level audio and visual features are extracted.
ⶠMid-level audio features are generated based on the low-
level ones.
ⶠThe audio and visual features are then fused at the feature-
level and a two-class SVM is trained.
4. October 2012 Detection of Violent Scenes using Affective Features 5
6. ⶠAffect-related audio features used in the work are:
ï§ Audio energy
ï§ related to the arousal aspect.
ï§ high/low energy corresponds to high/low emotion intensity.
ï§ used for vocal emotion detection.
ï§ Mel-Frequency Cepstral Coefficients (MFCC)
ï§ related to the arousal aspect.
ï§ works well for the detection of excitement/non-excitement.
ï§ Pitch
ï§ related to the valence aspect.
ï§ significant for emotion detection in speech and music.
4. October 2012 Detection of Violent Scenes using Affective Features 6
7. ⶠEach video shot has different numbers of audio energy, pitch and
MFCC feature vectors (due to varying shot durations).
ⶠAudio representations are obtained by computing mean and
standard deviation for these audio features.
ⶠAbstraction for MFCC:
ï§ MFCC-based Bag of Audio Words (BoAW) approach is chosen to
generate mid-level audio representations.
ï§ Two different audio vocabularies are constructed: violence and
non-violence vocabularies (by k-means clustering).
ï§ MFCC of violent/non-violent movie segments are used to
construct violence/non-violence words.
ï§ Violence and non-violence word occurrences within a video shot
are represented by a BoAW histogram.
4. October 2012 Detection of Violent Scenes using Affective Features 7
8. ⶠAverage motion
ï§ related to the arousal aspect.
ï§ Motion vectors are computed using block-based motion
estimation.
ï§ Average motion is found as the average magnitude of all
motion vectors.
ⶠWe compute average motion around the keyframe of video
shots.
4. October 2012 Detection of Violent Scenes using Affective Features 8
9. ⶠThe performance of our method was assessed on 3
Hollywood movies (evaluation criteria: MAP at 100).
ⶠWe submitted five runs:
ï§ r1-low-level: low-level audio and visual features,
ï§ Runs based on mid-level audio and low-level visual features
ï§ r2-mid-level-100k: 100k samples for dictionary construction,
ï§ r3-mid-level-300k: 300k samples for dictionary construction,
ï§ r4-mid-level-300k-default: 300k samples for dictionary
construction + SVM default parameters, and
ï§ r5-mid-level-500k: 500k samples for dictionary construction.
4. October 2012 Detection of Violent Scenes using Affective Features 9
10. Table 1 â Precision, Recall and F-measure at shot level
Run AED-P AED-R AED-F
r1-low-level 0.141 0.597 0.2287
r2-mid-level-100k 0.140 0.629 0.2285
r3-mid-level-300k 0.144 0.625 0.2337
r4-mid-level-300k-default 0.190 0.627 0.2971
r5-mid-level-500k 0.154 0.603 0.2457
Table 2 â Mean Average Precision (MAP) values at 20 and 100
Run MAP at 20 MAP at 100
r1-low-level 0.2132 0.18502
r2-mid-level-100k 0.2037 0.14492
r3-mid-level-300k 0.3593 0.18538
r4-mid-level-300k-default 0.1547 0.15083
r5-mid-level-500k 0.15 0.11527
ⶠSlightly better performance is achieved with mid-level representations compared
to the low-level one.
ⶠUsing affect-related features to describe violence needs some improvements
(especially the visual part).
4. October 2012 Detection of Violent Scenes using Affective Features 10
11. ⶠThe aim of this work was to investigate whether affect-
related features are well-suited to describe violence.
ⶠAffect-related audio and visual features are merged in a
supervised manner using SVM.
ⶠOur main finding is that more sophisticated affect-related
features are necessary to describe the content of videos
(especially the visual part).
ⶠOur next step in this work is to use
ï§ mid-level features such as human facial features, and
ï§ more sophisticated motion descriptors such as Lagrangian
measures
for video content representation.
4. October 2012 Detection of Violent Scenes using Affective Features 11
12. Thank you!
Questions?
4. October 2012 Detection of Violent Scenes using Affective Features 12