Action event retrieval from cricket video using audio energy feature for event summarization

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME
267
ACTION EVENT RETRIEVAL FROM CRICKET VIDEO USING AUDIO
ENERGY FEATURE FOR EVENT SUMMARIZATION
Vilas Naik1
, Prasanna Patil2
, Vishwanath Chikaraddi3
1
Department of CSE, Basaveshwar Engg College, Bagalkot, India.
2
3
ABSTRACT
Content-Based Video Retrieval (CBVR) is an active research discipline focused on
computational strategies to search for relevant videos based on multimodal content analysis in video
such as visual, audio, text to browse and index video. However, finding the desired video or event
from a large amount of video database remains a challenging and time-consuming task. As a result,
efficient video retrieval/event retrieval becomes more challenging. We present audio based
approaches for event retrieval from sports video. The approach has been shown effective applied to
cricket videos. The approach retrieves the action event based on audio level of the played shot of a
batsman and loud cheering of audience as a response in a cricket match. These audio symbols can be
retrieved by measuring audio level and pattern which is normally higher than regular audio level
using audio energy features. The experiments conducted and the results analyzed reveal that the
mechanism can be efficiently used on cricket video for extraction of events like power stroke actions
and crowd cheer from stadium.
Keywords: Adaptive Thresholding, Audio Energy, Event Retrieval, MFCC, Short Time Energy,
Video Summarization, Zero Crossing Rate.
1. INTRODUCTION
Sports video distribution over various networks should contribute to quick adoption and
widespread usage of multimedia services worldwide because processing of sports video operations
like browsing, indexing, summarization and retrieval makes it possible to deliver sports video over
narrow band networks such as the Internet and wireless. Amount of daily content created by TV
channels over the world cannot be measured On-site news shooting, in-studio programs, sports event
broadcasts, in-house produced films, serials, documentaries, and other production jobs, broadcasted
everyday to homes of billions. This increased generation and distribution rate of audiovisual content
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING &
TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 4, Issue 4, July-August (2013), pp. 267-274
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2013): 6.1302 (Calculated by GISI)
www.jifactor.com
IJCET
© I A E M E

268
created a new problem of management. It is clear that when accessing lengthy voluminous video
programs the ability to access highlights and to skip the less interesting parts of the videos will save
not only the viewer’s time, but also data downloading/airtime costs if the viewer receives videos
wirelessly from remote servers. Moreover, it would be very attractive if users can access and view
the content based on their own preferences. To realize above needs, the source video has to be tagged
with semantic labels. These labels must not only be broad to cover general events in the video, e.g.,
shot of a batsman, wicket falling sound when ball hits the wicket, audience shouting sounds when
batsman misses shot, audience cheering sound. This is a very challenging task and would require
exploiting multimodal and multi context approaches.
Video content can be accessed by using either a top-down approach or a bottom-up approach.
The top-down approach i.e. video browsing is useful when need to get an essence of the content. The
bottom-up approach i.e. video retrieval is useful when knowing exactly what are looking for in the
content. In video summarization, what “essence” the summary should capture depends on whether
the content is scripted or not. Considerable progress has been made in multimodal analysis, video
representation, summarization, browsing and retrieval, which are the five fundamental bases for
accessing video content. The first three bases focus on meta-data generation & organization while
the last two focus on meta-data consumption. Multimodal analysis deals with the signal processing
part of the video system, including shot boundary detection, key frame extraction, key object
detection, audio analysis, closed caption analysis etc. Video representation is concerned with the
structure of the video. Again, it is useful to have different representations for scripted and unscripted
content. Built on top of the video representation, video summarization, either based on ToC
generation or highlights extraction, deals with how to use the representation structure to provide the
viewers top-down access using the summary for video browsing. Finally, video retrieval is
concerned with retrieving specific video objects. For today's video content, techniques are urgently
needed for automatically (or semi-automatically) constructing video, video Highlights and video
Indices to facilitate summarization, browsing and retrieval. The work proposed employs energy
levels of MFCC coefficients of audio sample find knocking sound of bat hitting ball to detect
batsman stroke action followed by cheers by spectators in ground.
The remaining part of the work is organized in to 4 sections. Section 2 presents the related
work and background of the algorithms used. Section 3 gives a detailed description of
characterization of video and audio characterization and features. Section 4 describes the new
proposed algorithm. Section 5 discusses the results obtained. Section 6 brings up the conclusion.
2. RELATED WORK
Research towards the automatic detection and retrieval of events in sport videos data has
attracted a lot of attention in recent years. Sports video analysis and events/highlights extraction and
summarization are probably one of the important topics research. The review of literature is
conducted and summary is presented here. The transform or subband audio coders [1], which are
employed in many modern audio coding standards, describes a new coder in which quantization
strategies extended by incorporating run-length and arithmetic encoders. In [2], the method describes
a necessary capability for content-based retrieval is to support the paradigm of query by example,
and the work presents an algorithm for matching multimodal (audio-visual) patterns for the purpose
of content-based video retrieval. A generalized sound recognition system that uses reduced-
dimension log-spectral features and a minimum entropy hidden Markov model classifier [3], address
the major challenges of generalized sound recognition. In [4], the author focused on the use of
Hidden Markov Models (HMMs) for structure analysis of videos, and demonstrates how they can be
efficiently applied to merge audio and visual cues. The exploitation of features from multiple
modalities, namely, audio, video, and text are described in [5]. Concept representations are modeled

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July
using Gaussian mixtures models (GMM), hidden
machines (SVM). In [6], NVRS, which is convenient for users to fast browsing and retrieving news
video by different categories such as political,
audio cues will shown to play an important
SVM. The author addresses the problem of bridging semantic
implemented an MPEG-7 compliant browsing system for semantic
sports video [8]. A CBIR system for use in a psychological
movement and Dyslexia is described
audio cues for attaining this level
indexing approach [10], which analyzes both audio and
interrelations to extract high-level semantic cues.
automatically extracted from video and used to index its contents.
audio–visual feature based framework for event detection in broadcast video
field sports. The methods of segmenting, visualizing, and
considering audio and visual data are
author reviewed different research works in 3 types of video, i.e., video of
broadcast news, and sports video. In [15], author presented
assessing semantic relevance in video retrieval like
indexing.
The related work reveals that the audio pattern can be prominent cue for identification of
significant events from sports video. Based on these audio patterns the requested events can be
detected or classified.
3. VIDEO CHARACTERIZATION
Video characterization is the process of understanding
sequence, the procedure is important part of most of video processing tasks
retrieving, summarization and indexing
and sounds of commentator, audience, whistling,
level features that are successfully used in speech analysis
provide good results for audio signal analysis in sports
Zero Crossing Rate (ZCR) and Short Time Energy (STE)
signals, a zero crossing is said to occur if successive sample have different algebraic signs. The rate
at which zero crossings occur is a simple measure of the frequency content of a signal. This average
zero-crossing rate gives a reasonable way to estimate the frequency of sine wave. Zero crossing is
suitable for narrowband signals, but audio signals may in
components.
For audio signals, short-time energy is an essential parameter for distinguishing silence clips
from non-silence clips. It is evident that the short
lower than those of non-silence clips. The short
effective measurement to differentiate silence clips from non
have much smaller ZCR values than the non
Mel-Frequency Cepstral Coefficient (MFCC)
tasks based on audio features. To extract MFCC features, input audio is divided into overlapping
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976
6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME
269
using Gaussian mixtures models (GMM), hidden Markov models (HMM), and support vector
which is convenient for users to fast browsing and retrieving news
t categories such as political, finance, amusement, etc. In [7], author reveals that
shown to play an important role in semantics inference of video using weaknesses of
es the problem of bridging semantic gap in the sports dom
7 compliant browsing system for semantic retrieval and summarization of
CBIR system for use in a psychological study of the relationship between human
is described in [9]. The work presents a novel use of interactive visual and
audio cues for attaining this level of indexing performance. Content-based movie parsing and
which analyzes both audio and visual sources and accounts for their
level semantic cues. The Aural and visual cues described in [11]
automatically extracted from video and used to index its contents. In [12], the work
framework for event detection in broadcast video of multiple
methods of segmenting, visualizing, and indexing presentation videos by separately
are investigated in [13]. In [14], Semantic Retrieval of Video
works in 3 types of video, i.e., video of meetings, movies and
In [15], author presented different complementary approaches for
elevance in video retrieval like adaptive video indexing and elemental co
VIDEO CHARACTERIZATION AND AUDIO CUE BASED EVENT RETRIEVAL
is the process of understanding syntax and semantics of
is important part of most of video processing tasks
summarization and indexing. In sports video the audio signal mainly comes from
and sounds of commentator, audience, whistling, and environment. Therefore, first extract some low
features that are successfully used in speech analysis and then experiment whether they can
audio signal analysis in sports video.
Zero Crossing Rate (ZCR) and Short Time Energy (STE) - In the context of discrete
t which zero crossings occur is a simple measure of the frequency content of a signal. This average
crossing rate gives a reasonable way to estimate the frequency of sine wave. Zero crossing is
suitable for narrowband signals, but audio signals may include both narrowband and broadband
time energy is an essential parameter for distinguishing silence clips
silence clips. It is evident that the short-time energy values of silence clips are remarkably
silence clips. The short-time average zero-crossing rate (ZCR) is another
effective measurement to differentiate silence clips from non-silence clips, as the silence clips
have much smaller ZCR values than the non-silence clips. ZCR is defined formally as
Frequency Cepstral Coefficient (MFCC) - MFCC features are a natural choice for recognition
To extract MFCC features, input audio is divided into overlapping
August (2013), © IAEME
Markov models (HMM), and support vector
which is convenient for users to fast browsing and retrieving news
In [7], author reveals that
of video using weaknesses of
ports domain and
retrieval and summarization of
study of the relationship between human
of interactive visual and
based movie parsing and
visual sources and accounts for their
described in [11] can be
the work proposed is
of multiple different
indexing presentation videos by separately
Semantic Retrieval of Video,
meetings, movies and
complementary approaches for
elemental concept
AND AUDIO CUE BASED EVENT RETRIEVAL
syntax and semantics of a video
like segmenting,
the audio signal mainly comes from speech
first extract some low
and then experiment whether they can
In the context of discrete-time
t which zero crossings occur is a simple measure of the frequency content of a signal. This average
crossing rate gives a reasonable way to estimate the frequency of sine wave. Zero crossing is
clude both narrowband and broadband
time energy is an essential parameter for distinguishing silence clips
time energy values of silence clips are remarkably
crossing rate (ZCR) is another
silence clips, as the silence clips
ZCR is defined formally as
MFCC features are a natural choice for recognition
To extract MFCC features, input audio is divided into overlapping

270
frames of duration 30 ms with 10 ms overlapping for consecutive frames. Each frame is then
multiplied by a hamming window function:
Where, N is the number of samples in the window. After performing FFT on each windowed frame,
MFCC are calculated using the following discrete cosine transform:
Where, K is the number of sub-bands, and L is the desired length of cepstrum. Si’s, 1 ≤ i ≤ K,
represent the filter bank energy after passing through the triangular band-pass filters. Figure 3.6
summarizes the MFCC extraction process.
Figure 3.1: MFCC feature extraction
4. PROPOSED ALGORITHM AND IMPLEMENTATION
The solution implementation contains three important modules. One is a module for
extracting audio and video data from sports video stream and next is matlab code module to detect
peaks in audio and last is action event retrieval. The proposed algorithm is described in the following
steps,
Step1: Read the input video stream and separate the audio track and video manually using external
tools. The separated audio track and video are stored in the matlab module.
Step2: The number of samples/sampling frequency of separated audio is calculated.
Step3: Calculate MFCC coefficients for each second of audio. MFCC includes these sub steps-
divide the signal into frames, for each frame take the Fourier transform, take the logarithm, convert
to Mel spectrum, take the discrete cosine transform (DCT).
Step4: The function wenergy() is used to calculate wave energy values for each second using MFCC
values obtained previously.
Step5: Store all energy values in an array and plot those energy values to get graphical view.
Step6: Using Adaptive threshold determine the high peak action event location where the energy
value is adaptively high. The frames around 25-50 before and after the audio peak are considered for
action event retrieval.
The steps 1-6 are repeated for a fixed size of the input data for the complete sports video stream.

271
5. EXPERIMENTATION AND RESULT ANALYSIS
The proposed algorithm is implemented in Matlab 2012a version with the default
optimization turned on. The system used for the experimentation is an Intel I5- 2.66 GHz processor
with 4 GB of RAM and running under the Windows 7 ultimate operating system. Various Cricket
videos used in experiment were collected via the Internet. The data set for experimentation
comprises of 10 videos each of size 30 MB. The proposed model accepts the input video in the form
of “.avi”, the algorithm can also work on other formats of video. Results are explained below.
Figure 5.1 Input video clip
The video sample is a cricket video clip where cricket player bravo hits three stunning sixes
and one boundary and video shows players and audience celebrating the moment. The sample has
audio variations and manual examination reveals the maximum amplitude for audio during the hit
and cheering.
The problem solution is implemented by separating the input cricket video stream into video
and audio streams and then analyzing the audio stream for peak values present. Using adaptive
threshold technique the peak values corresponding to the video frame numbers is calculated for all
samples of audio using wave energy as shown below.
Figure 5.2: Wave Energy of an audio
Then frames corresponding to particular peak are extracted and stored as peak frames. Frame
detection based approach shows excellent detection accuracy and also results in saving of processing
time. The algorithm gives many action events each containing 25 frames from input video.

272
Figure 5.3: Peak Frames Detected for given video
Meanwhile, all other action events with 25 frames when the audio peak is at high are
extracted. The frames retrieved indicate the action event whose wave energy peak is at high level
which is loud cheering of audience, shot of a batsman or it may be falling of a wicket. With these
frames, one second early and one second later frames are clubbed together to form a action video clip
as shown below.
Figure 5.4: Action Events Retrieved for given video
For the evaluation purpose we have selected few video clips downloaded from different
datasets. The preprocessing step in the algorithm is omitted. In this section, it presents quantitative
results on the performance of the action event detection and retrieval system. Table 5.1 shows the
overview of the tests performed. It demonstrates the total number of frames, number of frames
retrieved. In the next columns it gives size and length of video and finally the fidelity which
describes the number of significant action events retrieved from the original video.

273
Table 5.1 Experimental results of proposed algorithm, Experimented on Cricket Video
Name of the File
Number
of
Frames
Size
(in
MBs)
Length of
video in
minutes
Number of
frames
Retrieved
Total Number
of Action
Events
sample.avi
(Global thresholding)
3577 28.1 2:23 75 28
sample.avi
(Adaptive thresholding
with 5 frames)
3577 28.1 2:23 75 24
sample.avi
(Adaptive thresholding
with 10 frames)
3577 28.1 2:23 75 28
The chapter gave the detailed description about the experiments that were conducted for
testing and evaluation of the proposed methodology for action event retrieval. The experimental
results are indeed encouraging and shown significant efficiency in event retrieval with 86% of
accuracy. It also described about how the peak frames formed into a video file in .avi format with the
manual investigation of the input cricket video.
6. CONCLUSION
The algorithm for retrieval of batsman stroke action and audience cheer events in cricket
video is designed and experimented on sufficient number of cricket clips. The algorithm is
implemented in Matlab 2012a and executed on Intel I5- 2.66 GHz processor with 4 GB of RAM. The
algorithm is implemented to extract batsman stoke action by detecting knocking sound of bat hitting
a ball heard in broadcasted cricket video following by detection of cheers from audience by energy
level of audio in MFCC domain. The event retrieved using audio cue is successfully experimented
and the audio pattern using wave energy features has proved that a particular event in a video can be
identified by its peculiar audio pattern and its parameters. The presented algorithm first separated the
Audio content from the video via a software tool, and then extracted the audio samples per frame,
once got the audio samples, and then calculated MFCC coefficients and wave energy values for
whole audio segments. After applying adaptive thresholding peak frames were found and the event
video is considering the frames around that audio peak. The experimental results are indeed
encouraging and shown significant efficiency in event detection with 86% for an audio track.
7. REFERENCES
[1]. Henrique Malvar, 1998, “Enhancing the Performance of Subband Audio Coders for Speech
Signals,” in Proc. of IEEE International Symposium on Circuits and Systems – Monterey,
CA, June 1998.
[2]. Milind R. Naphade, Roy Wang and Thomas S. Huang, 2001, “Multimodal Pattern Matching
For Audio-Visual Query and Retrieval,” department of Electrical and Computer Engineering
Beckman Institute for Advanced Science and Technology University of Illinois, Urbana-
Champaign, 2001.

274
[3]. Michael A. Casey, 2001, “Reduced-Rank Spectra and Minimum-Entropy Priors as Consistent
and Reliable Cues for Generalized Sound Recognition,” MERL, Cambridge Research
Laboratory, 2001.
[4]. E. Kijak, G. Gravier, P. Gros, L. Oisel and F. Bimbot, 2003, “Hmm Based Structuring Of
Tennis Videos Using Visual And Audio Cues,” Thomson multimedia R&D, France, de Belle-
Fontaine, 35510 Cesson-Sevigne, France, 2003.
[5]. W. H. Adams, Giridharan Iyengar, 2003, “Semantic indexing of multimedia Content Using
Visual, Audio, And Text Cues,” EURASIP Journal on Applied Signal Processing, 2, 1–16,
2003.
[6]. Huayong LIU, Tingting HE, 2004, “A Content-Based News Video Retrieval System:
NVRS,” department of Computer Science, Central China Normal University, Wuhan 430079,
PR China, 2004.
[7]. Min Xu, Ling-Yu Duan, Liang-Tien Chia, 2004, “Audio Keyword Generation For Sports
Video Analysis,” School of Computer Engineering, Nanyang Technological University,
Singapore, October 10-16, 2004.
[8]. Baoxin Li, James H. Errico, 2004, “Bridging the Semantic Gap in Sports Video Retrieval and
Summarization,” SHARP Laboratories of America, 5750 NW Pacific Rim Blvd., Camas, WA
98607, USA- 2004.
[9]. L. Joyeux, E. Doyle, H. Denman, 2004, “Content Based Access for A Massive Database of
Human Observation Video,” in Proc. of the 6th ACM SIGMM international workshop on
Multimedia information retrieval, 46 – 52, 2004.
[10]. Ying Li, 2004, “Content-Based Movie Analysis and Indexing Based On Audiovisual Cues,”
in Proc. of IEEE Transactions On Circuits And Systems For Video Technology, Vol. 14, No.
8, August 2004.
[11]. Michael G. Christel, Chang Huang, Neema Moraveji, and Norman Papernick, 2004,
“Exploiting Multiple Modalities for Interactive Video Retrieval,” Carnegie Mellon University
Pittsburgh, 5-1-2004.
[12]. David A. Sadlier and Noel E. O’Connor, 2005 ,“Event Detection In Field Sports Video Using
Audio–Visual Features And A Support Vector Machine,” in Proc. of IEEE transactions on
circuits and systems for video technology, vol. 15, no. 10, October 2005.
[13]. Alexander Haubold and John R. Kender, 2006, “Augmented Segmentation and Visualization
for Presentation Videos,” Department of Computer Science, Columbia University, New
York, 2006.
[14]. Ziyou Xiong, Xiang Zhou, Qi Tian, Rui Yong, and Thomas S. Huang, 2006, “Semantic
Retrieval Of Video,” United Technologies Research Center, East Hartford, 2006.
[15]. JOSE A. LAY, PAISARN MUNEESAWANG, TAHIR AMIN AND LING GUAN, 2007,
“Assessing Semantic Relevance by Using Audiovisual Cues,” International Journal Of
Information And Systems Sciences, Volume 3, Number 3, Pages 420-427.
[16] Reeja S R and Dr. N. P Kavya, “Motion Detection for Video Denoising – The State of Art
and the Challenges”, International Journal of Computer Engineering & Technology (IJCET),
Volume 3, Issue 2, 2012, pp. 518 - 525, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
[16] Reshma R.Gulwani and Sudhirkumar D.Sawarkar, “Video Indexing using Shot Boundary
Detection Approach and Search Tracks in Video”, International Journal of Computer
Engineering & Technology (IJCET), Volume 4, Issue 3, 2013, pp. 432 - 440, ISSN Print:
0976 – 6367, ISSN Online: 0976 – 6375.
[17]. Vilas Naik and Raghavendra Havin, “Entropy Features Trained Support Vector Machine
Based Logo Detection Method for Replay Detection and Extraction from Sports Videos”,
International Journal of Graphics and Multimedia (IJGM), Volume 4, Issue 1, 2013, pp. 20 -
30, ISSN Print: 0976 – 6448, ISSN Online: 0976 –6456.

Action event retrieval from cricket video using audio energy feature for event summarization

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (11)

Ähnlich wie Action event retrieval from cricket video using audio energy feature for event summarization

Ähnlich wie Action event retrieval from cricket video using audio energy feature for event summarization (20)

Mehr von IAEME Publication

Mehr von IAEME Publication (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Action event retrieval from cricket video using audio energy feature for event summarization