What is Video Classification?
Images and videos have become universal on the internet, which has encouraged the development of algorithms that can analyze their
semantic content for various applications, including search and summarization.
Video classification is a machine learning task of identifying what a video represents.
It is based on humans to recognize objects, classify them, process this information, extract information from them and interpret the
results.
Videos are an ordered sequence or collection of images or frames thus video classification is not different from an image classification
problem.
Video classification process involves collecting a data set that contains a set of unique classes such as different action or movements.
2
Neural Network
McCullok and pitts were the first one to get inspired by human brain and model it for object detection in images and videos.
Each node, or artificial neuron, connects to another and has an associated weight and threshold.
If the output of any individual node is above the specified threshold value, that node is activated, sending data to the next layer of the
network. Otherwise, no data is passed along to the next layer of the network.
3
McCullok-Pitts Neuron
McCullock and Pitts Proposed a highly simplified computational model of the first neuron (1943).
g aggregates the input and function f takes a decision based on this aggregation.
The input can be excitatory or inhibitory.
b is a thresholding parameter.
4
Perceptron
Frank Rosenblatt proposed the first perceptron model in 1958.
A more general model of the McCullock-Pitts neurons.
Input values no longer limited to Boolean values.
5
Deep Neural Network
Limitation of giving error in more complex and dense data.
It eliminates the technique’s usefulness from classification problem.
Multilayering has initially introduced with two hidden layers.
One layer takes the Boolean decision and second layer presents an arbitrary space.
CNN based on multilayered neural network demonstrated as an effective class of models for understanding image content,
giving state-of-the-art results on image recognition, segmentation, detection and retrieval.
Scaling up the networks to tens of millions of parameters and massive labeled datasets that can support the learning process.
CNNs have been shown to learn powerful and interpretable image features. Encouraged by positive results in domain of
images, the study on performance of CNNs in object detection and large-scale video classification was performed.
6
Convolution Neural Network
A single convolution process consist of (1) Convolutional filtering (2) ReLU activation (3) Pooling.
8
Standard Convolution Neural Network
Le-Net
• Le-Net first CNN network developed for hand recognition task.
• It uses image input image of size 32× 32 with one channel.
• It uses three convolution filters of size 5×5 and stride 1 along with
two average poolings of 2×2 and stride 2.
• it feed 400 input learning parameters to the 120 fully connected
neural network and uses a soft-max layer to give an output.
• It uses tanh activation function.
• The parameters learned during this training session were 60k.
• It has been limited to detecting the hand-written images only and
can’t handle other kinds of images.
Alex-Net
• Image-Net is a large database of images for an object classification and
detection.
• Alex-Net was trained on image-Net dataset.
• It has 5 convolution layers and 3 fully connected layers.
• Alex-Net used local response normalization(LRU) in the first and second
convolution.
• Max pooling was used in 1st , 2nd and 5th convolution layers of neurons.
• It used ReLU function instead of tanh function.
• Developers used two GPUs for training on a very large dataset and even
cross-linked the two data sets.
• The number of parameters learned during training were 60M and
neurons used were 0.6M.
• 95% computation in convolution and 5% in fully connected layers.
9
Standard Convolution Neural Network
VGG-Net
• VGG-Net was developed by Visual Geometry Group of oxford.
• VGG Group focused on increasing the depth of convolution
network for achieving better performance and lower error rate as
against Alex-Net.
• They maintained homogenous architecture throughout their
network instead of struggling in finding filter size.
• It has filters of constant size 3×3 and stride 1 pad1.
• It has max pooling of 2×3 with stride 2.
• It also employed the single 1×1 convolution filters.
• VGG-Net proved that using a smaller filters reduces the number of
learning parameters while capturing the features of images.
• VGG 16 and VGG 19 were slightly difference so the performance
reached a saturation point after a certain depth.
Google-Net
• Deeper network with better efficiency along with reduced parameter
count and memory usage.
• It has 22 layers without any fully connected layers to get rid of faster
multiplication of number of parameters.
• It has 5M parameters which is very small compared to Alex-Net.
• Google-Net has inception module which allows parallel branching of
convolutions and reptations of these structures.
• It is an improved version of naïve inception model which was
computationally very expensive.
• Google-Net introduced a 1×1 bottleneck layer to reduce channel size.
• It applied 1×1 convolution after max pool to save more
computations.
10
Standard Convolution Neural Network
VGG-Net
• VGG-Net was developed by Visual Geometry Group of oxford.
• VGG Group focused on increasing the depth of convolution
network for achieving better performance and lower error rate as
against Alex-Net.
• They maintained homogenous architecture throughout their
network instead of struggling in finding filter size.
• It has filters of constant size 3×3 and stride 1 pad1.
• It has max pooling of 3×3 with stride 2.
• VGG-Net proved that using a smaller filters reduces the number of
learning parameters while capturing the features of images.
• VGG 16 and VGG 19 were slightly difference so the performance
reached a saturation point after a certain depth.
Google-Net
• Deeper network with better efficiency along with reduced parameter
count and memory usage.
• It has 22 layers without any fully connected layers to get rid of faster
multiplication of number of parameters.
• It has 5M parameters which is very small compared to Alex-Net.
• Google-Net has inception module which allows parallel branching of
convolutions and reptations of these structures.
• It is an improved version of naïve inception model which was
computationally very expensive.
• Google-Net introduced a 1×1 bottleneck layer to reduce channel size.
• It applied 1×1 convolution after max pool to save more
computations.
11
Pre-Deep Learning Era for Object Detection
• An video classification process has object classification, localization and detection.
• Earlier methods like Viola-Jones or sliding window algorithm were used for an object detection.
• It was predominantly used for facial detection.
• It has three main parts like (1) Weak classification using Haar like features (2) Adaboost to build strong classifiers (3) strong classifiers.
• Haar like features were rectangular features based on Haar wavelets.
• Feature values given by f = Ʃ(Sum of pixels inside black box) – Ʃ(Sum of pixels inside white box).
• Each feature is a weak classifiers where the number of parameters in sliding window were
very large.
• AdaBoost was introduced to form a strong classifier.
13
Pre-Deep Learning Era for Object Detection
• Light effects may hide features and reduce accuracy.
• Firstly CNN is trained on set of images to become a classifier.
• CNN processes the sliding windows of same size as an input to extract the feature maps.
• CNN will classify in each window with an object or not.
• Sliding window classifier sometimes forms large number of bounding boxes
detected for a same object.
• CNN uses an Intersection over union(IOU) for the suppression of extra bounding boxes.
14
Pre-Deep Learning Era for Object Detection
• This IOU helps in non-maximum suppression(NMS).
• It selects any random box from a bounding box proposal list and compare it with rest of the bounding boxes for IOU score.
• If IOU>0.5, it removes box from the list and repeats until boxes left are exclusive.
• Bounding boxes may not fit tightly around the object and computation is very time consuming and expensive.
15
Standard Convolution Neural Network
• R-CNN is a Region Proposal based object detection algorithm.
• Use selective search as an algorithm to identify potential regions that might have objects within them.
• It uses graph-based image segmentation to get initial regions.
• Regions Proposals are wrapped to the same size to fed them in CNN.
• Each region is evaluated by CNN to perform classification and object detection.
• CNN performs Alex-Net or VGG-16 convolution to extract feature maps and produces 4096 feature vector from each region
proposal.
• CNN classify the object using soft-max and bounding box regression.
• Segmentation and passing of each region through CNN takes a lot of time and space.
• It can’t be used in real time as it takes 47s for each image.
16
Standard Convolution Neural Network
Fast R-CNN
• Fast R-CNN directly maps region proposals to the feature maps of CNN
and now an original image is fed to CNN instead of region proposals.
• The network first processes the whole image with several convolutions
and max pooling to produce a feature maps.
• Fast R-CNN handled the major bottleneck of generating 2000 regions
per image and feed them to CNN one by one.
• It extracted the region of interests and projected them onto the feature
maps instead of large feature maps per region.
• It adds a layer of ROI pooling.
• Given an ROL of size h × w, RoI Pooling converts the image H × W into
sub-windows of size h/H and w/W.
• These feature vectors are Furthur fed to two output layers for
classification and bounding box regression.
Faster R-CNN
• Fast CNN has a bottleneck of generating 2000 region proposals
which was resolved in Faster R-CNN algorithm.
• it replaces the selective search with a regional proposal network.
• It uses anchor boxes of different scales and aspect ratios to
identify the multiple objects in image.
18
Single Stage vs Two Staged Object Detector
Single Stage Detector
• Single convolutional
network predicts the bounding
boxes and the class probabilities
for these boxes.
• YOLO, SSD
Two Stage Detector
• First, the model proposes a set
of regions of interests by
select search or regional
proposal network. The proposed
regions are sparse as the potential
bounding box candidates can be
infinite. Then a classifier only
processes the region candidates.
• Fast R-CNN, Faster R-CNN, Mask R-
CNN
SSD – Single Shot MultiBox Detector
• SSD model detects objects in a single pass, which means it saves a lot of time.
• And at the same time, the SSD model also seems to have amazing accuracy in its
detection.
• SSD overcomes the limitation of one shot detection and can be used for object detection
in real-time.
• In Faster R-CNN, the whole process runs at 7 frames per second, whereas SSD runs at
59 frames per second.
• In order to achieve high detection accuracy, the SSD model produces predictions at
different scales from the feature maps of different scales and explicitly separates
predictions by aspect ratio.
20
Terminology - Multibox
• VGG-16 is the base network that performs the feature extraction. Conv layers evaluate
boxes of different aspect ratios at each location in several feature maps with different
scales.
21
Terminology - Jaccard Overlap (IOU)
22
Fig: YOLO grid cells – centre of an object seen in the 10th cell
Fig: Image with a actual and predicted bounding boxes
Overlap (intersection ) Union
𝐼𝑂𝑈 =
𝐴𝑟𝑒𝑎 𝑜𝑓 𝐼𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑣𝑒𝑟𝑙𝑎𝑝
𝐴𝑟𝑒𝑎 𝑜𝑓 𝑢𝑛𝑖𝑜𝑛
Terminology – Matching Strategy
• VGG-16 is the base network that performs the feature extraction. Conv layers evaluate
boxes of different aspect ratios at each location in several feature maps with different
scales.
During training time the default boxes are matched over aspect ratio, location and scale to
the ground truth boxes. The boxes with the highest overlap with the ground truth bounding
boxes are selected.
IoU (intersection over union) between predicted boxes and ground truth should be greater
than 0.5.
23
SSD – Architecture
• The SSD model is made up of 2 parts namely
– The backbone model
– The SSD head.
• The Backbone model is a typical pre-trained image classification network that works as
the feature map extractor. Here, the image final image classification layers of the model
are removed to give us only the extracted feature maps.
• SSD head is made up of a couple of convolutional layers stacked together and it is added
to the top of the backbone model. This gives us the output as the bounding boxes over the
objects. These convolutional layers detect the various objects in the image.
24
Architecture - Backbone
• VGG 16
• POOL5 changed:
– 3x3 kernel instead of 2x2
– Stride 1 instead of 2
• First 2 FCs replaced by CNN
• Last FC removed altogether
• No dropouts used
• Conv4_3 also used for prediction:
– 4th group of Conv
– 3rd Kernel
26
Architecture – SSD Head
• The multi-scale feature maps are added to the end of the truncated backbone model.
These multi-scale feature maps reduce in size progressively, which allows the detections
at various scales of the image.
• The convolutional layers used here vary for each feature layer.
27
SSD – Architecture
• Each prediction is composed of
– Bounding box with shape offset. ∆cx, ∆cy, h and w, representing the offsets from the
center of the default box and its height and width
– Confidences for all object categories or all the classes. Class 0 is reserved to indicate
absence of the object
• SSD uses default boxes of different scales, shapes and aspect ratio on different output
layers.
• It uses 8732 boxes for a better coverage of location, scale and aspect ratios. Most of the
prediction will not contain any object. SSD drops predictions that have confidence score
that is lower than 0.01. We then apply Non Max suppression (NMS) overlap of 0.45 per
class and keep the top 200 detections per image.
28
Pre-trained Model
• Model is trained with 512x512 size images
• Trained on Pascal VOC dataset
• ResNet-50 V1 as backbone model.
29
Accuracy Comparisions of the state of the arts
30
Models Accuracy - VOC2007 test mAP
Faster R-CNN (VGG16) 73.2
YOLO 63.4
SSD300 (VGG) 72.1
SSD500 (VGG16) 75.1
YOLO – You Only Look Once
31
In 2016, the paper “You Only Look Once: Unified, Real-Time
Object Detection” by Joseph Redmon, Santosh Divvala, Ross
Girshick and Ali Farhadi made a big change in this field.
Most accurate real-time object detector
– There are other more accurate ones, but they are not real-time
Fastest object detector
– The speed was achieved because the algorithm does not operate in two stages like R-CNN
does.
Bounding Box - Example
34
𝑃𝐶
𝑥
𝑦
𝑤
ℎ
𝐶1
𝐶2
=
1
50
70
60
70
1
0
𝐶1 = Dog Class
𝐶2 = Person Class
Anchor box
35
YOLO can work well for multiple objects where each object is associated with one grid cell.
But in the case of overlap, in which one grid cell actually contains the centre points of two
different objects, we use anchor boxes to allow one grid cell to detect multiple objects.
Anchor boxes can be thought of as ‘predictions’ of bounding boxes.
By defining anchor boxes, we can create a longer grid cell vector
and associate multiple classes with each grid cell.
Anchor boxes have a defined aspect ratio, and they try to
detect objects that nicely fit into a box with that ratio.
Confidence Score
36
It indicates how sure the system is ‘that the anchor box contains an object.
Class confidence: 𝑃 𝐶𝑙𝑎𝑠𝑠 𝑂𝑏𝑗𝑒𝑐𝑡
which means that ‘given that there is an object ,what is the probability that this object is
of a specified class i’. This conditional probability indicates how confident the classifier is
about its class prediction.
If the classifier predicts that the object is a car with 70% confidence, it means
𝑃 𝐶𝑎𝑟 𝑂𝑏𝑗𝑒𝑐𝑡 =0.70.
Based on both these levels of confidence, we define the confidence of the system to
localize and classify an object as :
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑆𝑐𝑜𝑟𝑒 = 𝐵𝑜𝑥 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 × 𝐶𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝐶𝑙𝑎𝑠𝑠 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑡𝑦
𝐵𝑜𝑥 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝐶 = 𝑃𝑐 ∗ 𝐼𝑂𝑈
Non Maximal Suppression (NMS)
37
Fig: Many anchor boxes for an object
YOLO uses Non-Maximal Suppression
(NMS) to only keep the best bounding
box.
The first step in NMS is to remove all the
predicted bounding boxes that have a
detection probability that is less than a
given NMS threshold.
Say that we fix NMS threshold to 0.6, this
means that all predicted bounding boxes
that have a detection probability less than
0.6 will be removed.
The second step in NMS, is to select the
bounding boxes with the highest
detection probability and eliminate all the
bounding boxes whose Intersection
Over Union (IOU) value is higher than a
given IOU threshold
Fig: One anchor box remaining after Non Maximal Suppression
YOLOv3 – Object Prediction
39
53 CNNs layers(Darknet-53)
Stacked with
53 more layers
producing
106 layers for YOLO v3
Detections are made at three layers 82, 94 and
106.
o By 1×1 kernels applied to down sampled
images at (13×13), (26×26) and (52×52)
Feature Maps:
(13×13), (26×26) ,(52×52)
The shape of the detection kernels
also has its depth that is calculated by
the equation
(b*(5+c))
where:
b = no.of bounding boxes that each cell
of the produced feature map can predict.
For Yolov3 b=3
(5+c) bounding box attributes
C=80, COCO classes
total (3*(5+80))=255 attributes
So 1×1 kernels produce
Feature maps:
(13×13×255), (26×26×255) ,(52×52×255)
YOLO – Training
41
Fig: Sample from COCO dataset
Yolov3 works on ‘Darknet’ framework. It is an open source framework which is written in C
and CUDA and can be run on GPU for getting high speed .
The dataset for object detection is the COCO (Microsoft Common Objects in Context) data
set which has 382K images in 80 object categories.
The ground truth value of bounding boxes predicted are compared to those with the dataset
and thus when trained gives the object detection.
LOSS FUNCTION
42
YOLO uses sum-squared error between the predictions and the
ground truth to calculate loss. The loss function composes of:
•the classification loss.
•the localization loss (errors between the predicted
boundary box and the ground truth).
•the confidence loss (the objectness of the box).
The final loss adds localization, confidence and classification
losses together.
Video Classification – Block Diagram
43
Input Video
Extraction of
frames
Object
detection
using
YOLO/SSD
Output from
Object
Detection
Algorithm for
Scenario
Identification
Outputs
summary of
the video
Conclusion
49
We have explored the basics of Deep Neural
Networks , Object detection.
Explored object detection techniques like RCNN, SSD
and YOLO
We have proposed an algorithm for video
classification which describes different scenarios
Future Scope
50
More accurate and fast object detection can be
achieved using latest versions of YOLO.
Deep learning training can be performed for video
classification unlike frame wise classification using
LSTM and ResNet.
Complex videos involving more than one scenario
and generic scenario identification.