SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Competence Center Information Retrieval & Machine Learning
11th International Workshop on Content-Based Multimedia Indexing (CBMI), Veszprem, Hungary, 2013
Detecting Violent Content in Hollywood Movies by Mid-level
Audio Representations
Esra Acar
Esra Acar, Frank Hopfgartner, Sahin Albayrak
Outline
217. Juni 2013 CBMI‘2013
► Motivation
► The Violence Detection Method
 Audio Representation of Videos
 Learning Violence Detection Model
► Performance Evaluation
► Conclusions & Future Work
Motivation
317. Juni 2013 CBMI‘2013
► Goal: the detection of most violent scenes in Hollywood
movies.
► Use case: Parents select or reject movies by previewing parts of
the movies that include the most violent moments.
► We investigate the discriminative power of mid-level audio
features
 Bag-of-Audio Words (BoAW) representations based on Mel-
Frequency Cepstral Coefficients (MFCCs)
 Two different BoAW construction methods
Vector quantization-based (VQ-based) method, and
Sparse coding-based (SC-based) method
The Violence Detection Method
417. Juni 2013 CBMI‘2013
►The definition of violence: “physical violence or
accident resulting in human injury or pain”
“violence” as defined in the MediaEval Violent
Scenes Detection (VSD) task.
►Two main components of the method:
The representation of video shots
The learning of a violence model
Audio Representation of Videos (1)
517. Juni 2013 CBMI‘2013
► Mel-Frequency Cepstral Coefficients (MFCCs)
 are commonly used in speech recognition and music
information retrieval (e.g., genre classification).
 relate better to human perception.
 work well for the detection of excitement/non-excitement
(i.e., indicators of the excitement level of video segments).
► MFCC-based audio representation is employed for the
description of the audio content of Hollywood movies.
► Using mid-level representations may help modeling video
segments one step closer to human perception. Examples are:
 bags of features,
 the upper units of convolutional networks or deep belief
networks
Audio Representation of Videos (2)
617. Juni 2013 CBMI‘2013
► We use mid-level audio features based on MFCCs (i.e., BoAW
approach).
► The BoAW approach with two different coding schemes
 Vector quantization (by k-means clustering)
dividing feature vectors into groups, where each group is
represented by its centroid point (e.g., k-means clustering
algorithm).
 Sparse coding (by the LARS algorithm)
representing a feature vector as a linear combination of an over-
complete set of basis vectors.
Audio Representation of Videos (3)
717. Juni 2013 CBMI‘2013
Dictionary Generation Phase
Audio Representation of Videos (4)
817. Juni 2013 CBMI‘2013
Representation Construction Phase
Learning Violence Detection Model
917. Juni 2013 CBMI‘2013
Learning a Violence Model
Performance Evaluation
1017. Juni 2013 CBMI‘2013
► Dataset:
 32,708 video shots from 18 Hollywood movies of different genres
(ranging from extremely violent movies to movies without
violence).
Training set: 26,138 video shots from 15 movies.
Test set: 6,570 video shots from 3 movies.
► Ground truth:
 generated by 7 human assessors. Violent movie segments are
annotated at the frame-level.
 Each video shot is labeled as violent or non-violent.
The characteristics of training and test datasets
Evaluation Metrics
1117. Juni 2013 CBMI‘2013
► The ranking of violent shots are more important for the use
case.
► Metrics other than precision and recall are required to
compare the performance.
► Average precision at 20 & 100 are used (official metrics in the
MediaEval VSD task)
► R-precision which can be seen as an alternative to the precision
at k.
Results & Discussions (1)
1217. Juni 2013 CBMI‘2013
Average Precision at 100 for the Baseline and Our Methods
Average Precision at 20 & 100 and R-precision
for the VQ- and SC-based methods
Results & Discussions (2)
1317. Juni 2013 CBMI‘2013
Average Precision at 20 & 100 and R-precision on Independence Day
Average Precision at 20 & 100 and R-precision on Dead Poets Society
Average Precision at 20 & 100 and R-precision on Fight Club
Results & Discussions (3)
1417. Juni 2013 CBMI‘2013
Team Features Modality APat100*
ARF Color, texture, audio and concepts audio-visual 0.651
Shanghai-
Hong Kong
Trajectory-based features, SIFT, STIP, MFCCs audio-visual 0.624
TEC Color, motion, acoustic features audio-visual 0.618
TUM Acoustic energy and spectral, color, texture,
optical flow
audio-visual 0.484
SC-based
(ours)
BoAW with sparse coding audio 0.444
VQ-based
(ours)
BoAW with vector quantization audio 0.387
LIG-MIRM Color, texture, bag of SIFT and MFCCs audio-visual 0.314
NII Visual concepts learned from color and
texture
visual 0.308
DYNI-LSIS Multi-scale local binary pattern visual 0.125
* Average Precision at 100 (the official evaluation metric of the MediaEval VSD task)
Sample Video Shots (Correctly Classified)
1517. Juni 2013 CBMI‘2013
Sample Video Shots (Wrongly Classified)
1617. Juni 2013 CBMI‘2013
Conclusions
1717. Juni 2013 CBMI‘2013
► An approach for movie violent content detection at video shot
level is presented.
► Mid-level audio features based on BoAW approach with two
different coding schemes are employed.
► Promising results are obtained
 the SC-based BoAW outperforms all uni-modal submissions in
the MediaEval VSD task except one vision-based method.
► One significant point is that the average precision variation of
the proposed method is high for movies of varying violence
levels.
Future Work
1817. Juni 2013 CBMI‘2013
► Construction of more sophisticated mid-level representations
for video content analysis.
► Augmenting the feature set by including visual features (both
low-level and mid-level) helps further improving classification.
► Extend our approach to user-generated videos.
 Different from Hollywood movies, these videos are not
professionally edited, e.g., in order to enhance dramatic
scenes.
1917. Juni 2013 CBMI‘2013
THANKS!
QUESTIONS?

Weitere ähnliche Inhalte

Ähnlich wie Detecting Violent Content in Hollywood Movies by Mid-level Audio Representations

Ijarcet vol-2-issue-4-1347-1351
Ijarcet vol-2-issue-4-1347-1351Ijarcet vol-2-issue-4-1347-1351
Ijarcet vol-2-issue-4-1347-1351Editor IJARCET
 
An In-Depth Evaluation of Multimodal Video Genre Categorization
An In-Depth Evaluation of Multimodal Video Genre CategorizationAn In-Depth Evaluation of Multimodal Video Genre Categorization
An In-Depth Evaluation of Multimodal Video Genre CategorizationIonut Mironica
 
Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval
Fisher Kernel based Relevance Feedback for Multimodal Video RetrievalFisher Kernel based Relevance Feedback for Multimodal Video Retrieval
Fisher Kernel based Relevance Feedback for Multimodal Video RetrievalIonut Mironica
 
TVSum: Summarizing Web Videos Using Titles
TVSum: Summarizing Web Videos Using TitlesTVSum: Summarizing Web Videos Using Titles
TVSum: Summarizing Web Videos Using TitlesNEERAJ BAGHEL
 
Media Genre Inference for Predicting Media Interestingness
Media Genre Inference for Predicting Media InterestingnessMedia Genre Inference for Predicting Media Interestingness
Media Genre Inference for Predicting Media InterestingnessBenoit HUET
 
MediaEval 2017 - Interestingness Task: EURECOM @MediaEval 2017: Media Genre I...
MediaEval 2017 - Interestingness Task: EURECOM @MediaEval 2017: Media Genre I...MediaEval 2017 - Interestingness Task: EURECOM @MediaEval 2017: Media Genre I...
MediaEval 2017 - Interestingness Task: EURECOM @MediaEval 2017: Media Genre I...multimediaeval
 
ppt icitisee 2022_without_recording.pptx
ppt icitisee 2022_without_recording.pptxppt icitisee 2022_without_recording.pptx
ppt icitisee 2022_without_recording.pptxssusera4da91
 
Analysis of visual similarity in news videos with robust and memory efficient...
Analysis of visual similarity in news videos with robust and memory efficient...Analysis of visual similarity in news videos with robust and memory efficient...
Analysis of visual similarity in news videos with robust and memory efficient...MediaMixerCommunity
 
Violent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence ClusteringViolent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence Clusteringcsandit
 
Violent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence ClusteringViolent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence Clusteringcsandit
 
Violent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence ClusteringViolent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence Clusteringcsandit
 
Multimedia Information Retrieval: What is it, and why isn't ...
Multimedia Information Retrieval: What is it, and why isn't ...Multimedia Information Retrieval: What is it, and why isn't ...
Multimedia Information Retrieval: What is it, and why isn't ...webhostingguy
 
AMATH582_Final_Poster
AMATH582_Final_PosterAMATH582_Final_Poster
AMATH582_Final_PosterMark Chang
 
Ac02417471753
Ac02417471753Ac02417471753
Ac02417471753IJMER
 
image processing image processing image processing
image processing  image processing  image processingimage processing  image processing  image processing
image processing image processing image processingSportsAcademy1
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Universitat Politècnica de Catalunya
 
Action event retrieval from cricket video using audio energy feature for even...
Action event retrieval from cricket video using audio energy feature for even...Action event retrieval from cricket video using audio energy feature for even...
Action event retrieval from cricket video using audio energy feature for even...IAEME Publication
 
Action event retrieval from cricket video using audio energy feature for event
Action event retrieval from cricket video using audio energy feature for eventAction event retrieval from cricket video using audio energy feature for event
Action event retrieval from cricket video using audio energy feature for eventIAEME Publication
 

Ähnlich wie Detecting Violent Content in Hollywood Movies by Mid-level Audio Representations (20)

Ijarcet vol-2-issue-4-1347-1351
Ijarcet vol-2-issue-4-1347-1351Ijarcet vol-2-issue-4-1347-1351
Ijarcet vol-2-issue-4-1347-1351
 
An In-Depth Evaluation of Multimodal Video Genre Categorization
An In-Depth Evaluation of Multimodal Video Genre CategorizationAn In-Depth Evaluation of Multimodal Video Genre Categorization
An In-Depth Evaluation of Multimodal Video Genre Categorization
 
Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval
Fisher Kernel based Relevance Feedback for Multimodal Video RetrievalFisher Kernel based Relevance Feedback for Multimodal Video Retrieval
Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval
 
TVSum: Summarizing Web Videos Using Titles
TVSum: Summarizing Web Videos Using TitlesTVSum: Summarizing Web Videos Using Titles
TVSum: Summarizing Web Videos Using Titles
 
Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018
Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018
Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018
 
Media Genre Inference for Predicting Media Interestingness
Media Genre Inference for Predicting Media InterestingnessMedia Genre Inference for Predicting Media Interestingness
Media Genre Inference for Predicting Media Interestingness
 
MediaEval 2017 - Interestingness Task: EURECOM @MediaEval 2017: Media Genre I...
MediaEval 2017 - Interestingness Task: EURECOM @MediaEval 2017: Media Genre I...MediaEval 2017 - Interestingness Task: EURECOM @MediaEval 2017: Media Genre I...
MediaEval 2017 - Interestingness Task: EURECOM @MediaEval 2017: Media Genre I...
 
ppt icitisee 2022_without_recording.pptx
ppt icitisee 2022_without_recording.pptxppt icitisee 2022_without_recording.pptx
ppt icitisee 2022_without_recording.pptx
 
Analysis of visual similarity in news videos with robust and memory efficient...
Analysis of visual similarity in news videos with robust and memory efficient...Analysis of visual similarity in news videos with robust and memory efficient...
Analysis of visual similarity in news videos with robust and memory efficient...
 
Violent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence ClusteringViolent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence Clustering
 
Violent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence ClusteringViolent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence Clustering
 
Violent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence ClusteringViolent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence Clustering
 
C04841417
C04841417C04841417
C04841417
 
Multimedia Information Retrieval: What is it, and why isn't ...
Multimedia Information Retrieval: What is it, and why isn't ...Multimedia Information Retrieval: What is it, and why isn't ...
Multimedia Information Retrieval: What is it, and why isn't ...
 
AMATH582_Final_Poster
AMATH582_Final_PosterAMATH582_Final_Poster
AMATH582_Final_Poster
 
Ac02417471753
Ac02417471753Ac02417471753
Ac02417471753
 
image processing image processing image processing
image processing  image processing  image processingimage processing  image processing  image processing
image processing image processing image processing
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
 
Action event retrieval from cricket video using audio energy feature for even...
Action event retrieval from cricket video using audio energy feature for even...Action event retrieval from cricket video using audio energy feature for even...
Action event retrieval from cricket video using audio energy feature for even...
 
Action event retrieval from cricket video using audio energy feature for event
Action event retrieval from cricket video using audio energy feature for eventAction event retrieval from cricket video using audio energy feature for event
Action event retrieval from cricket video using audio energy feature for event
 

Kürzlich hochgeladen

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Kürzlich hochgeladen (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

Detecting Violent Content in Hollywood Movies by Mid-level Audio Representations

  • 1. Competence Center Information Retrieval & Machine Learning 11th International Workshop on Content-Based Multimedia Indexing (CBMI), Veszprem, Hungary, 2013 Detecting Violent Content in Hollywood Movies by Mid-level Audio Representations Esra Acar Esra Acar, Frank Hopfgartner, Sahin Albayrak
  • 2. Outline 217. Juni 2013 CBMI‘2013 ► Motivation ► The Violence Detection Method  Audio Representation of Videos  Learning Violence Detection Model ► Performance Evaluation ► Conclusions & Future Work
  • 3. Motivation 317. Juni 2013 CBMI‘2013 ► Goal: the detection of most violent scenes in Hollywood movies. ► Use case: Parents select or reject movies by previewing parts of the movies that include the most violent moments. ► We investigate the discriminative power of mid-level audio features  Bag-of-Audio Words (BoAW) representations based on Mel- Frequency Cepstral Coefficients (MFCCs)  Two different BoAW construction methods Vector quantization-based (VQ-based) method, and Sparse coding-based (SC-based) method
  • 4. The Violence Detection Method 417. Juni 2013 CBMI‘2013 ►The definition of violence: “physical violence or accident resulting in human injury or pain” “violence” as defined in the MediaEval Violent Scenes Detection (VSD) task. ►Two main components of the method: The representation of video shots The learning of a violence model
  • 5. Audio Representation of Videos (1) 517. Juni 2013 CBMI‘2013 ► Mel-Frequency Cepstral Coefficients (MFCCs)  are commonly used in speech recognition and music information retrieval (e.g., genre classification).  relate better to human perception.  work well for the detection of excitement/non-excitement (i.e., indicators of the excitement level of video segments). ► MFCC-based audio representation is employed for the description of the audio content of Hollywood movies. ► Using mid-level representations may help modeling video segments one step closer to human perception. Examples are:  bags of features,  the upper units of convolutional networks or deep belief networks
  • 6. Audio Representation of Videos (2) 617. Juni 2013 CBMI‘2013 ► We use mid-level audio features based on MFCCs (i.e., BoAW approach). ► The BoAW approach with two different coding schemes  Vector quantization (by k-means clustering) dividing feature vectors into groups, where each group is represented by its centroid point (e.g., k-means clustering algorithm).  Sparse coding (by the LARS algorithm) representing a feature vector as a linear combination of an over- complete set of basis vectors.
  • 7. Audio Representation of Videos (3) 717. Juni 2013 CBMI‘2013 Dictionary Generation Phase
  • 8. Audio Representation of Videos (4) 817. Juni 2013 CBMI‘2013 Representation Construction Phase
  • 9. Learning Violence Detection Model 917. Juni 2013 CBMI‘2013 Learning a Violence Model
  • 10. Performance Evaluation 1017. Juni 2013 CBMI‘2013 ► Dataset:  32,708 video shots from 18 Hollywood movies of different genres (ranging from extremely violent movies to movies without violence). Training set: 26,138 video shots from 15 movies. Test set: 6,570 video shots from 3 movies. ► Ground truth:  generated by 7 human assessors. Violent movie segments are annotated at the frame-level.  Each video shot is labeled as violent or non-violent. The characteristics of training and test datasets
  • 11. Evaluation Metrics 1117. Juni 2013 CBMI‘2013 ► The ranking of violent shots are more important for the use case. ► Metrics other than precision and recall are required to compare the performance. ► Average precision at 20 & 100 are used (official metrics in the MediaEval VSD task) ► R-precision which can be seen as an alternative to the precision at k.
  • 12. Results & Discussions (1) 1217. Juni 2013 CBMI‘2013 Average Precision at 100 for the Baseline and Our Methods Average Precision at 20 & 100 and R-precision for the VQ- and SC-based methods
  • 13. Results & Discussions (2) 1317. Juni 2013 CBMI‘2013 Average Precision at 20 & 100 and R-precision on Independence Day Average Precision at 20 & 100 and R-precision on Dead Poets Society Average Precision at 20 & 100 and R-precision on Fight Club
  • 14. Results & Discussions (3) 1417. Juni 2013 CBMI‘2013 Team Features Modality APat100* ARF Color, texture, audio and concepts audio-visual 0.651 Shanghai- Hong Kong Trajectory-based features, SIFT, STIP, MFCCs audio-visual 0.624 TEC Color, motion, acoustic features audio-visual 0.618 TUM Acoustic energy and spectral, color, texture, optical flow audio-visual 0.484 SC-based (ours) BoAW with sparse coding audio 0.444 VQ-based (ours) BoAW with vector quantization audio 0.387 LIG-MIRM Color, texture, bag of SIFT and MFCCs audio-visual 0.314 NII Visual concepts learned from color and texture visual 0.308 DYNI-LSIS Multi-scale local binary pattern visual 0.125 * Average Precision at 100 (the official evaluation metric of the MediaEval VSD task)
  • 15. Sample Video Shots (Correctly Classified) 1517. Juni 2013 CBMI‘2013
  • 16. Sample Video Shots (Wrongly Classified) 1617. Juni 2013 CBMI‘2013
  • 17. Conclusions 1717. Juni 2013 CBMI‘2013 ► An approach for movie violent content detection at video shot level is presented. ► Mid-level audio features based on BoAW approach with two different coding schemes are employed. ► Promising results are obtained  the SC-based BoAW outperforms all uni-modal submissions in the MediaEval VSD task except one vision-based method. ► One significant point is that the average precision variation of the proposed method is high for movies of varying violence levels.
  • 18. Future Work 1817. Juni 2013 CBMI‘2013 ► Construction of more sophisticated mid-level representations for video content analysis. ► Augmenting the feature set by including visual features (both low-level and mid-level) helps further improving classification. ► Extend our approach to user-generated videos.  Different from Hollywood movies, these videos are not professionally edited, e.g., in order to enhance dramatic scenes.
  • 19. 1917. Juni 2013 CBMI‘2013 THANKS! QUESTIONS?