2. Outline
• Artistic style transfer for videos and spherical images
• SalNet360: Saliency Maps for omnidirectional images with CNN
• Restricted Deformable Convolution based Road Scene Semantic Segmentation Using
Surround View Cameras
• Distortion-Aware Convolutional Filters for Dense Prediction in Panoramic Images
• Appendix:
• Spatial Transform Network
• Active Convolution: Learning the Shape of Convolution for Image Classification
• Warped Convolutions: Efficient Invariance to Spatial Transformations
• Deformable Convolutional Networks
3. Artistic style transfer for videos and spherical images
• Manually re-drawing an image in a certain artistic style takes a professional artist a long time.
• Doing this for a video sequence single-handedly is beyond imagination.
• Two computational approaches, transfer the style from one image (for example, a painting)
to a whole video sequence.
• The first approach, adapts to videos the original image style transfer technique by CNN
(CVPR’16) based on energy minimization.
• Try other ways of initialization and loss functions to generate consistent and stable stylized
video sequences even in cases with large motion and strong occlusion.
• The second approach formulates video stylization as a learning problem.
• Run a deep network architecture and training procedures that allow to stylize arbitrary-
length videos in a consistent and stable way, and nearly in real time.
4. Artistic style transfer for videos and spherical images
Basic training procedure for a style transfer network with prior image (e.g. last frame warped).
The goal of the network is to produce a new stylized image, where the combination of
perceptual loss and deviation from the prior image (in non-occluded regions) is minimized.
5. Artistic style transfer for videos and spherical images
Training procedure for the multi-frame approach, shown with three frames. Back-propagating only one frame
already improves the quality of generated videos a lot. Back-propagating more frames would have required
decreasing the size of the network due to memory restrictions.
6. Artistic style transfer for videos and spherical images
• Virtual reality (VR) applications become increasingly popular, and the demand for
image processing methods applicable to spherical images and videos rises.
• Spherical reality media is typically distributed via a 2D projection.
• The most common format is the equirectangular projection.
• However, this format does not preserve shapes: the distortion of the projection
becomes very large towards the poles of the sphere.
• Such non-uniform distortions are problematic for style transfer.
• Therefore, it works with subdivided spherical images that consist of multiple
rectilinear projections.
• In particular, it uses cubic projection, which represents a spherical image with six
non-distorted square images.
7. Artistic style transfer for videos and spherical images
Cubemap projection used for stylizing
spherical images. The generated
images must be consistent along the
boundaries of neighboring cube faces.
Every cube face has four neighbors.
For style transfer in this regime, the six cube faces must
be stylized such that their cut edges are consistent, i.e.,
the style transfer must not introduce false discontinuities
along the edges of the cube in the final projection. Since
applications in VR environments must run in real time,
here only consider the fast, network-based approach.
8. Artistic style transfer for videos and spherical images
Training data generation process for a network to
adapt to perspective transformed border regions.
The extensions for video style transfer and for spherical
images can be combined to process spherical videos.
This yields two constraints: (1)
each cube face should be
consistent along the motion
trajectory; (2) neighboring cube
faces must have consistent
boundaries.
For 1), calculate optical flow for
each cube face separately, then
warp stylized cube faces the
same way as for regular planar
videos. For 2), blend both the
warped image from the last
frame and the transformed
border of already stylized
neighboring cube faces.
9. Artistic style transfer for videos and spherical images
The left image shows the overlap region of a
cube face from a panoramic image. The right
shows close-ups for two networks. Left: Not
fine-tuned. Right: Fine-tuned. In regions with
little structure (top and middle), the fine-
tuning strategy reduced unnatural artifacts
along the inner edge of the prior image. It
sometimes uses stylistic features to mask the
transition (middle). In regions with more
structure (bot- tom), both networks adapted
well to the given prior.
10. SalNet360: Saliency Maps for omni-
directional images with CNN
• With the current trend in the Virtual Reality (VR) field, adapting known techniques to this
new kind of media is starting to gain momentum.
• One of the applications for VR headsets is displaying of Omni-directional Images (ODIs).
• These images portray an entire scene as seen from a static point of view, and when viewed
through a VR headset, allow for an immersive user experience.
• The most common method for storing ODIs is by applying equirectangular, cylindrical or
cubic projections and saving them as standard two-dimensional images.
• The prediction of Visual Attention data from any kind of media is of valuable use to content
creators and used to efficiently drive encoding algorithms.
• This is an architectural extension to any CNN to fine-tune traditional 2D saliency prediction
to ODIs in an end-to-end manner.
• To address these issues:
• Subdividing the ODI into undistorted patches.
• Providing the CNN with the spherical coordinates for each pixel in the patches.
11. SalNet360: Saliency Maps for omni-
directional images with CNN
ODI Saliency Detection Pipeline.
This method takes an ODI as input and splits it into six patches using the pre-processing steps. Each of
these six patches is sent through the CNN. The output of the CNN for all the patches are then combined
using the post-processing technique.
12. SalNet360: Saliency Maps for omni-
directional images with CNN
Spherical coordinates definition and sliding frustum used to create the patches.
By specifying the field of view per
patch and its resolution, it is
possible to calculate the spherical
coordinates of each pixel in the
patch. These are then used to find
the corresponding pixels in the
ODI by applying the following
equations:
13. SalNet360: Saliency Maps for omni-
directional images with CNN
Network Architecture
Patches extracted from the ODI.
14. SalNet360: Saliency Maps for omni-
directional images with CNN
Comparison of the three experimental scenarios. Top row: On the left the input ODI, on the right the
ground truth saliency map blended with the image. Bottom row: From left to right, the result of the three
experimental scenarios: Base CNN, Base CNN + Patches, Base CNN + Patches + Spherical Coords.
15. Restricted Deformable Convolution based Road Scene
Semantic Segmentation Using Surround View Cameras
• Understanding the surrounding environment of the vehicle is still one of the challenges for
autonomous driving.
• This does 360-degree road scene semantic segmentation using surround view cameras,
which are widely equipped in existing production cars.
• First, to address large distortion problem in the fisheye images, Restricted Deformable
Convolution (RDC) is proposed for semantic segmentation, which can effectively model
geometric transformations by learning the shapes of convolutional filters conditioned on the
input feature map.
• Second, to obtain a large-scale training set of surround view images, a method called zoom
augmentation is proposed to transform conventional images to fisheye images.
• Finally, an RDC based semantic segmentation model is built; the model is trained for real-
world surround view images through a multi-task learning architecture by combining real-
world images with transformed images.
• It takes ERFNet as the baseline model for segmentation.
16. Restricted Deformable Convolution based Road Scene
Semantic Segmentation Using Surround View Cameras
The center of the undistorted image is clear, but the boundaries of the
image are very blurred. And some information is lost during transferring the
pixels of the raw fisheye image into the undistorted image.
“ERFNet: Efficient Residual Factorized ConvNet for Real-Time Semantic Segmentation”, 2017
17. Restricted Deformable Convolution based Road Scene
Semantic Segmentation Using Surround View Cameras
Surround view cameras consist of four fisheye cameras mounted on each side of the vehicle.
Cameras in different directions capture images with different image composition.
18. Restricted Deformable Convolution based Road Scene
Semantic Segmentation Using Surround View Cameras
RDC is restricted version of deformable convolution. The sampling locations of 3x3 convolutions: (a)
Standard convolution. (b) Dilated convolution with dilation 2. (c) Deformable convolution. (d)
Restricted deformable convolution. The dark points are the actual sampling locations, and the hollow
circles in (c) and (d) are the initial sampling locations. (a) and (b) employ a fixed grid of sampling
locations. (c) and (d) augment the sampling locations with learned 2D offsets (red arrows). The primary
difference between (c) and (d) is that restricted deformable convolution employs a fixed central
sampling location. No offsets are needed to be learned for the central sampling location in (d).
19. Restricted Deformable Convolution based Road Scene
Semantic Segmentation Using Surround View Cameras
A 3 × 3 restricted deformable convolution. The module is initiated with a 3×3 filter with dilation. Offset
fields are learned from the input feature map by a regular convol layer. The channel dimension 2(N − 1)
corresponds N − 1 2D offsets (the red arrows). The actual sampling positions (dark points) are obtained
by adding the 2D offsets. The value of the new position is obtained by using bilinear interpolation to
weight the four nearest points. The yellow arrows denote the BP paths of gradients.
20. Restricted Deformable Convolution based Road Scene
Semantic Segmentation Using Surround View Cameras
(a) 3 × 3 regular convolution. (b) Factorized convolutions. (c) Factorized restricted
deformable convolution. The nonlinearities in (b) and (c) are omitted here.
2D filters can be approximated as a combination of 1D filters, for the sake of reducing memory
and computational cost. A basic decomposed layer consists of vertical kernels followed by
horizontal ones, and a nonlinearity is inserted in between 1D convolutions.
For 2D RDC, each learned offset has two components: vertical direction and horizontal direction.
With 2D kernel decomposed into a vertical kernel and a horizontal kernel, the offsets can also be
decomposed into two components of the same directions.
21. Restricted Deformable Convolution based Road Scene
Semantic Segmentation Using Surround View Cameras
• Training of deep networks requires a huge number of training images, but training datasets
are always limited.
• Data argumentation methods are adopted to enlarge training data using label-preserving
transformations.
• Many forms are employed to do data augmentation for semantic segmentation, such as
horizontally flipping, scaling, rotation, cropping and color jittering.
• The operation of warping conventional images to fisheye-style images is generally called
zoom augmentation.
• The zoom augmentation can adopt a fixed focal length or a randomly changing focal length.
• Via the zoom augmentation method, an existing conventional image dataset for semantic
segmentation can be transformed into a fisheye- style image dataset.
• The smaller the focal length, the larger the degree of distortions.
22. Restricted Deformable Convolution based Road Scene
Semantic Segmentation Using Surround View Cameras
Zoom augmentation. The left are the original color image and annotation. The right are the transformed
images and annotations by zoom augmentation with a focal length changing from 200 to 800.
23. Restricted Deformable Convolution based Road Scene
Semantic Segmentation Using Surround View Cameras
The multi-task learning architecture for road scene semantic segmentation. The data are then fed into
three shared-weight sub-networks (the blue blocks). The total loss is the weighted sum of main losses and
auxiliary losses. γ is auxiliary loss weighting to balance the contribution of auxiliary losses. α is the task
weighting of main branch to balance the main losses of different tasks. Similarly, β is the task weighting of
auxiliary branch to balance the auxiliary losses of different task.
24. Restricted Deformable Convolution based Road Scene
Semantic Segmentation Using Surround View Cameras
ERFNet-RDC-λ. (a) Non-bt-1D block in ERFNet. (b) Reconstructed non-bt-1D block. The first two
convolutional layers are replaced with RDC layers. (c) The encoder of ERFNet-RDC-λ.
(a) (b) (c)
25. Restricted Deformable Convolution based Road Scene
Semantic Segmentation Using Surround View Cameras
One example of the segmented results produced by ERFNet, ERFNet-DC-8,
ERFNet-FRDC-8, ERFNet-RDC-8. The red pixels denotes false recognitions of the
bus. The ERFNet-RDC-8 nearly detected the whole bus in the image.
26. Restricted Deformable Convolution based Road Scene
Semantic Segmentation Using Surround View Cameras
(a) Results from different models
(b) List of 18 classes names and correspond. colors used for labeling.
Examples of results on the test set of SVScape. The results of front,
rear, left and right view are displayed in (a). The first two rows
show raw image and ground truth, and the following four rows
show the results produced by different models. The last row show
the improvement/error map which denotes the pixels
misclassified by this method in red and the pixels that are
misclassified by the base model ERFNet but correctly predicted by
the proposed method in green. The color code is listed in (b).
27. Restricted Deformable Convolution based Road Scene
Semantic Segmentation Using Surround View Cameras
The bird’s eye view image semantic segmentation by mapping segmentation
results of raw surround view images to bird’s eye view plane.
28. Distortion-Aware Convolutional Filters for Dense
Prediction in Panoramic Images
• There is a high demand of 3D data for 360◦ panoramic images and videos, pushed by the
growing availability on the market of specialized hardware for both capturing (e.g., omni-
directional cameras) as well as visualizing in 3D (e.g., head mounted displays) panoramic
images and videos.
• At the same time, 3D sensors able to capture 3D panoramic data are expensive and/or
hardly available.
• To fill this gap, here is a learning approach for panoramic depth map estimation from a
single image.
• Thanks to a specifically developed distortion-aware deformable convolution filter, this
method can be trained by means of conventional perspective images, then used to regress
depth for panoramic images, thus bypassing the effort needed to create annotated
panoramic training dataset.
• It demonstrates for emerging tasks such as panoramic monocular SLAM, panoramic
semantic segmentation and panoramic style transfer.
29. Distortion-Aware Convolutional Filters for Dense
Prediction in Panoramic Images
From a single input equirectangular image (top left), this method exploits distortion-aware convolutions to
notably reduce distortions in depth prediction that affect conventional CNNs (bottom row). Top right: the same
idea used to predict semantic labels, to obtain panoramic 3D semantic segmentation from a single image.
30. Distortion-Aware Convolutional Filters for Dense
Prediction in Panoramic Images
The key concept behind the distortion-aware convolution is that the sampling grid is
deformed according to the image distortion model, so that the receptive field is rectified.
31. Distortion-Aware Convolutional Filters for Dense
Prediction in Panoramic Images
Computation of the adaptive sampling grid for equirectangular image. Each pixel p in the equirectangular
image is transformed into unit sphere coordinates, then the sampling grid is computed on the tangent plane
in unit sphere coordinates, finally the sampling grid is back- projected into equirectangular image to
determine the location of the distorted sampling grid.
32. Distortion-Aware Convolutional Filters for Dense
Prediction in Panoramic Images
A major advantage of the approach is that standard convolutional architectures can be
used with common datasets for perspective images to train the weights. At test time, the
weights are transferred on the same architecture with distortion-aware convolutional
filters so to process equirectangular images. Although the figure report the case of depth
prediction, it applies the same strategy for the semantic segmentation task.
33. Distortion-Aware Convolutional Filters for Dense
Prediction in Panoramic Images
Compared methods in experimental evaluation: (a) Standard convolution on equirectangular
image, (b) Standard convolution on 6 rectified images via cube map projection, (c) Distortion-
aware convolution on equirectangular image.
34. Distortion-Aware Convolutional Filters for Dense
Prediction in Panoramic Images
Example of equirectangular image with/without inpainting and extracted rectified perspective images.
Since the images on this dataset lack color nearby polar regions, they are filled in with zeros. To avoid biasing
the network during training, apply an inpainting algorithm. To create perspective images for training, first
extract images with limited field of view along different directions from the original 360◦ panoramic image.
Directions are sampled on a 20◦ interval along the vertical axis (yaw rotation) and on a 15◦ interval along the
horizontal axis (pitch rotation). Then, rectify them into a standard perspective view. These rectified perspective
images are created by mapping pixels from the equirectangular projection to the perspective projection.
35. Distortion-Aware Convolutional Filters for Dense
Prediction in Panoramic Images
Depth prediction on Stanford 2D-3D-S dataset. Red circles highlight artifacts due to distortions induced by the
standard convolutional model (a) and by the CubeMap representation (b) that are instead solved by this approach (c).
36. Distortion-Aware Convolutional Filters for Dense
Prediction in Panoramic Images
Qualitative comparison of semantic segmentation on Stanford 2D-3D-S dataset. Red circles highlight errors
on polar regions and borders of the CubeMap model that are not present in our distortion-aware approach.
37. Distortion-Aware Convolutional Filters for Dense
Prediction in Panoramic Images
Application of our distortion-aware convolution for panoramic style transfer.
38.
39. Spatial Transform Network
• CNNs are still limited are not spatially invariant to the input in a efficient manner.
• A learnable Spatial Transformer, allows spatial manipulation of data within the network.
• This differentiable module can be inserted into existing convolutional architectures, giving
NNs the ability to actively spatially transform feature maps, conditional on the feature map
itself, without any extra training supervision or modification to the optimization process.
• The use of spatial transformers results in models which learn invariance to translation, scale,
rotation and more generic warping for a number of classes of transformations.
• (i) image classification: a spatial transformer that crops out and scale-normalizes the
appropriate region can simplify the subsequent classification task, and lead to superior
classification performance;
• (ii) co-localization: given a set of images containing different instances of the same (but
unknown) class, a spatial transformer can be used to localize them in each image;
• (iii) spatial attention: a spatial transformer can be used for tasks requiring an attention
mechanism, and can be trained purely with backpropagation without reinforcement learning.
40. Spatial Transform Network
The result of using a spatial transformer as the 1st layer of a fully-connected network trained for distorted
MNIST digit classification. (a) The input to the spatial transformer network is an image of an MNIST digit
that is distorted with random translation, scale, rotation, and clutter. (b) The localization network of the
spatial transformer predicts a transformation to apply to the input image. (c) The output of the spatial
transformer, after applying the transformation. (d) The classification prediction produced by the subsequent
fully-connected network on the output of the spatial transformer. The spatial transformer network (a CNN
including a spatial transformer module) is trained end-to-end with only class labels – no knowledge of the
ground truth transformations is given to the system.
41. Spatial Transform Network
The spatial transformer mechanism is split into three parts. In order of computation, first a localization
network takes the input feature map, and through a number of hidden layers outputs the parameters
of the spatial transformation that should be applied to the feature map – this gives a transformation
conditional on the input. Then, the predicted transformation parameters are used to create a sampling
grid, which is a set of points where the input map should be sampled to produce the transformed
output. This is done by the grid generator. Finally, the feature map and the sampling grid are taken as
inputs to the sampler, producing the output map sampled from the input at the grid points.
42. Spatial Transform Network
Two examples of applying the parameterized sampling grid to an image U
producing the output V . (a) The sampling grid is the regular grid G = TI (G),
where I is the identity transformation parameters. (b) The sampling grid is
the result of warping the regular grid with an affine transformation Tθ (G).
43. Active Convolution: Learning the Shape of
Convolution for Image Classification
• A conv unit, active convolution unit (ACU), no fixed shape to define any form of convolution.
• Its shape can be learned through backpropagation during training.
• This unit has a few advantages.
• First, the ACU is a generalization of convolution; it can define not only all conventional
convolutions, but also convolutions with fractional pixel coordinates; it can freely change
the shape of the convolution, which provides greater freedom to form CNN structures.
• Second, the shape of the convolution is learned while training and there is no need to
tune it by hand.
• Third, the ACU can learn better than a conventional unit, simply by changing the
conventional convolution to an ACU.
• Code is available at https://github.com/jyh2986/Active-Convolution.
44. Active Convolution: Learning the Shape of
Convolution for Image Classification
Concept of the ACU. Black dots represent each synapse. The
ACUs output is the summation of values in all positions pk
multiplied by weight. The position is parameterized by pk .
The ACU can define more diverse forms of the receptive
fields for convolutions with learnable positions parameters.
Inspired by the nervous system, call one acceptor of the
ACU the synapse. Position parameters can be differentiated,
and the shape can be learned through backpropagation.
45. Active Convolution: Learning the Shape of
Convolution for Image Classification
• ACU is considered a generalization of the convolution unit.
• Any conventional convolution is represented with ACU by
setting positions of synapses properly and fixing all positions.
• Dilated convolution can be also represented by multiplying
the dilation factor with the position parameters.
• Compared to a conventional convolution, the ACU can
generate fractional dilated convolutions and be used to
directly calculate the results of the interpolated convolution.
• It can also be used to define K synapses without any
restriction (e.g., cross-shaped convolution with five synapses,
or a circular convolution with many synapses).
Comparison of a conventional convolution
unit with the ACU. (a) Conventional
convolution unit with 4 input neurons and
two output neurons. (b) Unlike the
convolution unit, the synapses of the ACU
can be connected at inter-neuron
positions and are movable.
46. Active Convolution: Learning the Shape of
Convolution for Image Classification
• At the network level, ACU converts a discrete input space to a
continuous one.
• Since the ACU uses bilinear interpolation between adjacent
neurons, synapses can connect inter-neuron spaces.
• This lends greater representational power to convolution units.
• The position parameters control the synapses that connect
neuron spaces, and the synapses can move around the neuron
space to reduce error.
• A convolution unit has a number of learnable filters, and each
filter is convolved with its receptive field.
• ACU has a learnable position parameter θp, which is the set of
positions of the synapses.
Coordinate system of interpolation.
m, n represent the base position of
the convolution αk , and βk is the
displacement of the kth synapse.
47. Warped Convolutions: Efficient Invariance to
Spatial Transformations
• Warped convolutions, a simple and exact construction, yet has the same computational
complexity that standard convolutions enjoy.
• It consists of a constant image warp followed by a simple convolution, which are standard
blocks in deep learning toolboxes.
• With a carefully crafted warp, the resulting architecture can be made equivariant to a wide
range of two-parameter spatial transformations.
• Continuous convolution:
• Group convolutions:
• Image plane
• Warped convolutions:
• exponential map
49. Warped Convolutions: Efficient Invariance to
Spatial Transformations
First row: Sampling grids that
define the warps associated with
different spatial transformations.
Second row: An example image (a)
after warping with each grid (b-d).
Third row: A small translation is
applied to each warped image,
which is then mapped back to the
original space (by an inverse warp).
Translation in one axis of the
appropriate warped space is
equivalent to (b) horizontal scaling;
(c) planar rotation; (d) 3D rotation
around the vertical axis.
50. Deformable Convolutional Networks
• Two new modules to enhance the transformation modeling capability of CNNs, namely,
deformable convolution and deformable RoI pooling.
• Both are based on the idea of augmenting the spatial sampling locations in the modules
with additional offsets and learning the offsets from the target tasks, without additional
supervision.
• The modules can replace their counterparts in existing CNNs and can be easily trained end-
to-end by standard back-propagation, giving rise to deformable convolutional networks.
• Learning dense spatial transformation in deep CNNs is effective for sophisticated vision tasks
such as object detection and semantic segmentation.
• The code is released at https://github.com/msracver/Deformable-ConvNets.
51. Deformable Convolutional Networks
Illustration of the sampling locations in 3 × 3 standard and deformable
convolutions. (a) regular sampling grid (green points) of standard convolution.
(b) deformed sampling locations (dark blue points) with augmented offsets
(light blue arrows) in deformable convolution. (c)-(d) are special cases of (b),
showing that the deformable convolution generalizes various transformations
for scale,(anisotropic) aspect ratio and rotation.
52. Deformable Convolutional Networks
• Both deformable convolution and RoI pooling modules operate on the 2D spatial domain.
• The operation remains the same across the channel dimension.
• Without loss of generality, the modules are described in 2D here for notation clarity.
• The 2D convolution consists of two steps: 1) sampling using a regular grid R over the input
feature map x; 2) summation of sampled values weighted by w.
• RoI pooling converts an input rectangular region of arbitrary size into fixed size features.
• Both deformable convolution and RoI pooling modules have the same input and output.
• First, a deep fully convolutional network generates feature maps over the whole input image.
• Second, a shallow task specific network generates results from the feature maps.
• The DCN idea is augmenting the spatial sampling locations in convolution and RoI pooling
with additional offsets and learning the offsets from target tasks.
55. Deformable Convolutional Networks
Illustration of the fixed receptive field in standard convolution (a) and the adaptive receptive field in deformable
convolution (b), using two layers. Top: two activation units on the top feature map, on two objects of different scales
and shapes. The activation is from a 3 × 3 filter. Middle: the sampling locations of the 3 × 3 filter on the preceding
feature map. Another two activation units are highlighted. Bottom: the sampling locations of two levels of 3 × 3 filters
on the preceding feature map. Two sets of locations are highlighted, corresponding to the highlighted units above.