SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
Dalian Maritime University
Naeem Shehzad
COMPUTER SCIENCE AND TECHNOLOGY
DALIAN MARITIME UNIVERSITY
1
Decomposing Image Generation into Layout
Prediction and Conditional Synthesis
2
Dalian Maritime University
Introduction
• Newly, Generative Adversarial Networks (GANs) have done it possible to synthesis sharp,
realistic images from random noise. Much more improvement has been made via methods
such as Progressive Growing of Generative Adversarial Networks (PGGAN) in generating
higher and higher resolution images. That developed innovations to stabilize the training
larger and deeper networks.
• Even though, challenges remain, within specific in generating images that does not contain
one centered object. Often local textures are convincingly modeled while global structure is
incoherent. This can lead to a lack of clear separation in the appearance of different
generated object classes. Exemplify, we show interpolated Progressive Growing of Generative
Adversarial Networks (PGGAN) generations of street scenes and their DeepLab
segmentations. Large changes in the semantic labels from one image to the next imply that,
in some sense, this method has learned to generate patches of “scene-like” texture, without
separating image regions into discrete objects. Transitions between images in the sequence
end up being transitions between texture patches rather than natural changes in scene
composition, unlike a real video sequence and its corresponding segmentation.
3
• We argue that this lack of separation in object textures can be mitigated
by providing an internal representation of the scene structure. In this
work, we decompose the generation process into two stages. First, a GAN
generates a semantic segmentation mask that acts as the internal
representation of the scene structure. The segmentation mask is then
translated into an image using a conditional generation approach. Both
Progressive Growing of Generative Adversarial Networks (PGGAN) and our
proposed method guide the optimization of networks during network
training, although in our case we use segmentation masks rather than
sequentially growing the networks.
• We implement the proposed approach as a combination of two existing
techniques. To generate segmentation masks, we train Wasserstein GAN
with DCGAN architecture, extended with two up sampling layers. To
texture these segmentation masks, we use pix2pix. We validate the
proposed approach using the CityScapes dataset, which provides both
images and pixel-level segmentation masks. We compare the proposed
approach with both the basic GAN without guidance during optimization
and the Progressive Growing of Generative Adversarial Networks (PGGAN)
4
5
Progressive Growing of Generative Adversarial Networks (PGGAN), Top: generated
images bottom: DeepLab segmentations.
Real video sequence. top: real images, bottom: DeepLab segmentations
6
Our method. top: generated images, bottom: DeepLab segmentations
The main findings; this method achieves equal image quality in generating
multi-object scenes as PGGAN, as measured by a user study and by the Frechet
Inception Distance (FID) metric. Besides, we use segmentation networks as a tool to
probe the structure of generated images and show that our proposed method
interpolates in a more semantically meaningful way than Progressive Growing of
Generative Adversarial Networks (PGGAN), and generates images with more clearly
differentiated objects.
Related Work
• The main idea of this work, is decomposing scene generation into semantic maps
and image to image translation, has been very nicely explored in concurrent work
• Image generation some works have proposed a hierarchical approach to image
generation. LapGAN. Use a hierarchy of stacked generators and discriminators.
Recently, impressive results have been shown by growing the size of the generator
and discriminator, and extended to make use of style conditioning in StyleGAN.
• Methods that split the generation of images into stages include the Layered
Recursive GAN that splits the generation of images into background and
foreground. Generate images part via part using multiple generators.
• Conditional image generation Another relevant line of work is conditional image
generation, in particular works using segmentation maps. Pix2 pix has shown that
image to image translation can give very realistic results in going from a
segmentation mask to a real image. Recent improvements, such as CRN and
pix2pixHD have shown extensions to higher resolutions. Also shown very
convincing translations from segmentation maps to images using a semi
parametric approach. Improvements have also been shown by making use of
spatial normalisation in SPADE.
7
Method
• The showed approach is composed of two steps:
– Generation of segmentation masks
– Filling them in with more detailed patterns.
• Generating segmentation masks with GANs We train a GAN to generate segmentation
maps. We use the objective of the Wasserstein GAN, with gradient penalty. We use a
modified version of DCGAN. The original architecture goes up to 64×64 resolution. To
bring our generated segmentation maps up to 256×256 resolution, we add two up
sampling layers, each of which 64 feature channels have. Our generator has 5.4 M
trainable parameters.
• For the discriminator, we use the same approach as described in PatchGAN. We keep
the DCGAN discriminator, which operates on 64×64 patches of the segmentation maps,
and average over different patches. Our discriminator has 4.4M parameters.
• Conditional image synthesis To render the segmentation map, we train a pix2pix with
11.4M parameters. We did not train our models jointly. This work demonstrates an
effective combination of existing tools.
• Dataset We use the CityScapes dataset, because it contains varied scenes and pixelwise
semantic annotations. The dataset provides 22000 training images annotated with
coarse segmentations, and 3000 annotated with fine segmentations.
• We take a 1024×1024 crop of the original 1024×2048 images, and downscale it to
256×256. Although this type of cropping does reduce the amount of variation in the
scenes, we find that this simplified task is still sufficiently challenging to showcase
problems with the generation of multi object scenes.
8
Experimental Sceneries
• As per calculation from several configurations for our model, the
result of generator we experiment with two alternatives. The first
one is to encode the segmentation maps using the RGB colour
scheme used to visualise the CityScapes dataset, so the output of
generator is simply an RGB image. And the second one is to use
tensors of shape num classes×width×height, where the class of
every spatial location is encoded as a one-hot vector. In this case,
the result of the generator is num classes-dimensional and we use a
soft-max as the result non-linearity.
• Moreover, we done experiment with up sampling in the generator
using either transposed difficulties or nearest-neighbor up
sampling, followed by a convolutional layer.
• Finally, we also compare results in training with the coarse and
acceptable explanations provided by the CityScapes dataset.
9
Other Techniques
• The difficulties in capturing the distribution of multi-object scenes
with a Generative Adversarial Networks (GAN), here we train a
model to directly generate scenes from CityScapes. We used the
DCGAN architecture with two up sampling layers as well we state to
this method as DCGAN∗. 0.00002 is used a learning rate.
• We also compare our method to Progressive Growing of GANs
(PGGAN). This method constrains and guides the optimisation by
jointly growing the generator and discriminator during training. The
generator of this model has 23M parameters. We train the model
using the provided settings, till the model has seen 150000 images.
• Meanwhile, Progressive Growing of Generative Adversarial
Networks (PGGAN) does not have the benefit of supervision with
intermediate segmentation maps, our comparison is not between
two like things. However, the experimental results may still be
informative in showing that intermediate guidance with semantic
segmentations is useful for generating multi object scenes with
undoubtedly differentiable objects.
10
11
Results
Qualitative Evaluation
• Sample generations from different methods in Figure number 4. Apart from the DCGAN*
image samples, all generated image samples have similar quality. Segmentation maps
generated by the different variants of our method are indeed all similar to real segmentation
maps. However, the object shapes are not perfectly realistic. Like our method predicts an
uneven sidewalk and strange arrangements of traffic signs in (f). This does not prevent
realistic translations through pix2pix.
User Study
• To evaluate the realism of generations, we examined Mechanical Turk workers to pick the
image “that looks most like a real photograph of a street scene”. In each comparison, one
image was from our method, and one from PGGAN. Numbers are averaged over 100
comparisons, done by 25 workers each.
Frechet´ Inception Distance
• We report the Frechet´ Inception Distance. This metric, characteristics for both the real and
the generated images are extracted from the pool 3 layer of an Inception network trained on
ImageNet. The distance is then the Wasserstein2 distance between Gaussians fitted to both
classes. Both PGGAN and our methods have lower FID scores than our baseline of directly
generating images with DCGAN. Images rendered from generated coarse annotations have
FID scores similar to those of PGGAN, whereas the scores for images rendered from our
generated fine annotations are higher. This could be due to a lack of fine explanations for
training. 12
Latent interpolations
• We calculate our method via segmenting the first and last video
frames, and then finding the latent vectors that approximately
reconstruct these segmentations. We linearly interpolate between
the first and last latent vector to generate intermediate
segmentation maps, and render these with pix2pix. To recover the
latent representations of segmentation maps generated using our
method, we start with a batch of 64 latent vectors and minimize the
cross entropy between the generated and target segmentation by
optimising the la-tent vector using Adam with a learning rate of
0.01 for 100 iterations, and pick the best one. We found that
computing the softmax of our generator output before computing
cross entropy helped the optimisation. For this experiment we use
our method trained on fine annotations with one-hot encoding and
up sampling via transposed convolutions.
13
14
15
Segmenting interpolations
• Here we discuss further the segmented interpolations in Figure.1. In other sampled
sequences, we observe the same noisy pattern in the segmentations of interpolations. In
particular, the pink regions, which have the label fence, are often predicted, probably due to
the blocky art effects in the generated images.
• Interpolations generated by our method, where the segmentation maps are fine, also
produce very noisy segmentation maps. There are two reasons why this might hap-pen. First,
the training set of fine annotations contains only 3000 images, compared to 22000 images of
coarse annotations. The generator might have over fit to these examples. Secondly, the fine
annotations are much more complex than the coarse ones, and contain many small objects.
Since our generated fine segmentation maps are not perfect, the many small noisy objects
may get rendered into something un-recognizable by pix2pix.
• When we use our method trained with coarse explanations, however, we see a much
smoother and cleaner pattern of segmentations from DeepLab. The source segmentations
generated by our method, and the re-segmentations of our rendered images by DeepLab
visually seem to match well, which is not the case with fine annotations.
• Most highly, Progressive Growing of Generative Adversarial Networks (PGGAN) tends to let
objects appear and dissolve without much physical consistency, whereas our approach will
keep them in existence, merely moving them around. These transitions can be seen much
more clearly in video sequences generated from random walks in the latent space, which can
be viewed in the Supplementary.
16
Conclusion
• We have proposed an alternative method for training
Generative Adversarial Networks (GANs) to generate
images with multiple objects. The segmentation maps
are less complex than natural images, and Generative
Adversarial Networks (GANs) have an easier time
learning them than high resolution images directly.
Incorporating segmentation maps into Generative
Adversarial Networks (GAN) training progresses
interpolations. Interpolations produced by our method
are aware of scene layouts, and show a higher
structural consistency between frames than
Progressive Growing of Generative Adversarial
Networks (PGGAN).
17
Dalian Maritime University
18

Weitere ähnliche Inhalte

Was ist angesagt?

The Importance of Time in Visual Attention Models
The Importance of Time in Visual Attention ModelsThe Importance of Time in Visual Attention Models
The Importance of Time in Visual Attention Models
Universitat Politècnica de Catalunya
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
Jinwon Lee
 
PR-065 : High-Resolution Image Synthesis and Semantic Manipulation with Condi...
PR-065 : High-Resolution Image Synthesis and Semantic Manipulation with Condi...PR-065 : High-Resolution Image Synthesis and Semantic Manipulation with Condi...
PR-065 : High-Resolution Image Synthesis and Semantic Manipulation with Condi...
광희 이
 
Intel, Intelligent Systems Lab: Syable View Synthesis Whitepaper
Intel, Intelligent Systems Lab: Syable View Synthesis WhitepaperIntel, Intelligent Systems Lab: Syable View Synthesis Whitepaper
Intel, Intelligent Systems Lab: Syable View Synthesis Whitepaper
Alejandro Franceschi
 
Super resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooSuper resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun Yoo
JaeJun Yoo
 

Was ist angesagt? (20)

IRJET- Remote Sensing Image Retrieval using Convolutional Neural Network with...
IRJET- Remote Sensing Image Retrieval using Convolutional Neural Network with...IRJET- Remote Sensing Image Retrieval using Convolutional Neural Network with...
IRJET- Remote Sensing Image Retrieval using Convolutional Neural Network with...
 
An improved image compression algorithm based on daubechies wavelets with ar...
An improved image compression algorithm based on daubechies  wavelets with ar...An improved image compression algorithm based on daubechies  wavelets with ar...
An improved image compression algorithm based on daubechies wavelets with ar...
 
The Importance of Time in Visual Attention Models
The Importance of Time in Visual Attention ModelsThe Importance of Time in Visual Attention Models
The Importance of Time in Visual Attention Models
 
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignPR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
 
PR-065 : High-Resolution Image Synthesis and Semantic Manipulation with Condi...
PR-065 : High-Resolution Image Synthesis and Semantic Manipulation with Condi...PR-065 : High-Resolution Image Synthesis and Semantic Manipulation with Condi...
PR-065 : High-Resolution Image Synthesis and Semantic Manipulation with Condi...
 
Intel, Intelligent Systems Lab: Syable View Synthesis Whitepaper
Intel, Intelligent Systems Lab: Syable View Synthesis WhitepaperIntel, Intelligent Systems Lab: Syable View Synthesis Whitepaper
Intel, Intelligent Systems Lab: Syable View Synthesis Whitepaper
 
Intel ILS: Enhancing Photorealism Enhancement
Intel ILS: Enhancing Photorealism EnhancementIntel ILS: Enhancing Photorealism Enhancement
Intel ILS: Enhancing Photorealism Enhancement
 
CPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of Maps
CPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of MapsCPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of Maps
CPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of Maps
 
IRJET- Crowd Density Estimation using Novel Feature Descriptor
IRJET- Crowd Density Estimation using Novel Feature DescriptorIRJET- Crowd Density Estimation using Novel Feature Descriptor
IRJET- Crowd Density Estimation using Novel Feature Descriptor
 
Google Research Siggraph Whitepaper | Total Relighting: Learning to Relight P...
Google Research Siggraph Whitepaper | Total Relighting: Learning to Relight P...Google Research Siggraph Whitepaper | Total Relighting: Learning to Relight P...
Google Research Siggraph Whitepaper | Total Relighting: Learning to Relight P...
 
Finding the best solution for Image Processing
Finding the best solution for Image ProcessingFinding the best solution for Image Processing
Finding the best solution for Image Processing
 
Parn pyramidal+affine+regression+networks+for+dense+semantic+correspondence
Parn pyramidal+affine+regression+networks+for+dense+semantic+correspondenceParn pyramidal+affine+regression+networks+for+dense+semantic+correspondence
Parn pyramidal+affine+regression+networks+for+dense+semantic+correspondence
 
Super resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooSuper resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun Yoo
 
CNN and its applications by ketaki
CNN and its applications by ketakiCNN and its applications by ketaki
CNN and its applications by ketaki
 
PR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020sPR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020s
 
Java image processing ieee projects 2012 @ Seabirds ( Chennai, Bangalore, Hyd...
Java image processing ieee projects 2012 @ Seabirds ( Chennai, Bangalore, Hyd...Java image processing ieee projects 2012 @ Seabirds ( Chennai, Bangalore, Hyd...
Java image processing ieee projects 2012 @ Seabirds ( Chennai, Bangalore, Hyd...
 
HRNET : Deep High-Resolution Representation Learning for Human Pose Estimation
HRNET : Deep High-Resolution Representation Learning for Human Pose EstimationHRNET : Deep High-Resolution Representation Learning for Human Pose Estimation
HRNET : Deep High-Resolution Representation Learning for Human Pose Estimation
 
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
 
Efficient de cvpr_2020_paper
Efficient de cvpr_2020_paperEfficient de cvpr_2020_paper
Efficient de cvpr_2020_paper
 

Ähnlich wie Decomposing image generation into layout priction and conditional synthesis

Automatic Detection of Window Regions in Indoor Point Clouds Using R-CNN
Automatic Detection of Window Regions in Indoor Point Clouds Using R-CNNAutomatic Detection of Window Regions in Indoor Point Clouds Using R-CNN
Automatic Detection of Window Regions in Indoor Point Clouds Using R-CNN
Zihao(Gerald) Zhang
 

Ähnlich wie Decomposing image generation into layout priction and conditional synthesis (20)

Automatic Detection of Window Regions in Indoor Point Clouds Using R-CNN
Automatic Detection of Window Regions in Indoor Point Clouds Using R-CNNAutomatic Detection of Window Regions in Indoor Point Clouds Using R-CNN
Automatic Detection of Window Regions in Indoor Point Clouds Using R-CNN
 
An Intelligent approach to Pic to Cartoon Conversion using White-box-cartooni...
An Intelligent approach to Pic to Cartoon Conversion using White-box-cartooni...An Intelligent approach to Pic to Cartoon Conversion using White-box-cartooni...
An Intelligent approach to Pic to Cartoon Conversion using White-box-cartooni...
 
ADVANCED SINGLE IMAGE RESOLUTION UPSURGING USING A GENERATIVE ADVERSARIAL NET...
ADVANCED SINGLE IMAGE RESOLUTION UPSURGING USING A GENERATIVE ADVERSARIAL NET...ADVANCED SINGLE IMAGE RESOLUTION UPSURGING USING A GENERATIVE ADVERSARIAL NET...
ADVANCED SINGLE IMAGE RESOLUTION UPSURGING USING A GENERATIVE ADVERSARIAL NET...
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A survey
 
Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018
Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018
Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018
 
Unsupervised Object Detection
Unsupervised Object DetectionUnsupervised Object Detection
Unsupervised Object Detection
 
Video to Video Translation CGAN
Video to Video Translation CGANVideo to Video Translation CGAN
Video to Video Translation CGAN
 
Implementing Neural Style Transfer
Implementing Neural Style Transfer Implementing Neural Style Transfer
Implementing Neural Style Transfer
 
A Fully Progressive approach to Single image super-resolution
A Fully Progressive approach to Single image super-resolution A Fully Progressive approach to Single image super-resolution
A Fully Progressive approach to Single image super-resolution
 
Image colorization
Image colorizationImage colorization
Image colorization
 
Image colorization
Image colorizationImage colorization
Image colorization
 
Traffic sign classification
Traffic sign classificationTraffic sign classification
Traffic sign classification
 
A PROJECT REPORT ON REMOVAL OF UNNECESSARY OBJECTS FROM PHOTOS USING MASKING
A PROJECT REPORT ON REMOVAL OF UNNECESSARY OBJECTS FROM PHOTOS USING MASKINGA PROJECT REPORT ON REMOVAL OF UNNECESSARY OBJECTS FROM PHOTOS USING MASKING
A PROJECT REPORT ON REMOVAL OF UNNECESSARY OBJECTS FROM PHOTOS USING MASKING
 
Detection of surface flaws in a pipe using vision based technique
Detection of surface flaws in a pipe using vision based techniqueDetection of surface flaws in a pipe using vision based technique
Detection of surface flaws in a pipe using vision based technique
 
cvpresentation-190812154654 (1).pptx
cvpresentation-190812154654 (1).pptxcvpresentation-190812154654 (1).pptx
cvpresentation-190812154654 (1).pptx
 
ppt 20BET1024.pptx
ppt 20BET1024.pptxppt 20BET1024.pptx
ppt 20BET1024.pptx
 
IRJET- A Review on Data Dependent Label Distribution Learning for Age Estimat...
IRJET- A Review on Data Dependent Label Distribution Learning for Age Estimat...IRJET- A Review on Data Dependent Label Distribution Learning for Age Estimat...
IRJET- A Review on Data Dependent Label Distribution Learning for Age Estimat...
 
Generation of Deepfake images using GAN and Least squares GAN.ppt
Generation of Deepfake images using GAN and Least squares GAN.pptGeneration of Deepfake images using GAN and Least squares GAN.ppt
Generation of Deepfake images using GAN and Least squares GAN.ppt
 
Brain Tumour Detection.pptx
Brain Tumour Detection.pptxBrain Tumour Detection.pptx
Brain Tumour Detection.pptx
 
Seeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper reviewSeeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper review
 

Kürzlich hochgeladen

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 

Kürzlich hochgeladen (20)

psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 

Decomposing image generation into layout priction and conditional synthesis

  • 1. Dalian Maritime University Naeem Shehzad COMPUTER SCIENCE AND TECHNOLOGY DALIAN MARITIME UNIVERSITY 1
  • 2. Decomposing Image Generation into Layout Prediction and Conditional Synthesis 2 Dalian Maritime University
  • 3. Introduction • Newly, Generative Adversarial Networks (GANs) have done it possible to synthesis sharp, realistic images from random noise. Much more improvement has been made via methods such as Progressive Growing of Generative Adversarial Networks (PGGAN) in generating higher and higher resolution images. That developed innovations to stabilize the training larger and deeper networks. • Even though, challenges remain, within specific in generating images that does not contain one centered object. Often local textures are convincingly modeled while global structure is incoherent. This can lead to a lack of clear separation in the appearance of different generated object classes. Exemplify, we show interpolated Progressive Growing of Generative Adversarial Networks (PGGAN) generations of street scenes and their DeepLab segmentations. Large changes in the semantic labels from one image to the next imply that, in some sense, this method has learned to generate patches of “scene-like” texture, without separating image regions into discrete objects. Transitions between images in the sequence end up being transitions between texture patches rather than natural changes in scene composition, unlike a real video sequence and its corresponding segmentation. 3
  • 4. • We argue that this lack of separation in object textures can be mitigated by providing an internal representation of the scene structure. In this work, we decompose the generation process into two stages. First, a GAN generates a semantic segmentation mask that acts as the internal representation of the scene structure. The segmentation mask is then translated into an image using a conditional generation approach. Both Progressive Growing of Generative Adversarial Networks (PGGAN) and our proposed method guide the optimization of networks during network training, although in our case we use segmentation masks rather than sequentially growing the networks. • We implement the proposed approach as a combination of two existing techniques. To generate segmentation masks, we train Wasserstein GAN with DCGAN architecture, extended with two up sampling layers. To texture these segmentation masks, we use pix2pix. We validate the proposed approach using the CityScapes dataset, which provides both images and pixel-level segmentation masks. We compare the proposed approach with both the basic GAN without guidance during optimization and the Progressive Growing of Generative Adversarial Networks (PGGAN) 4
  • 5. 5 Progressive Growing of Generative Adversarial Networks (PGGAN), Top: generated images bottom: DeepLab segmentations. Real video sequence. top: real images, bottom: DeepLab segmentations
  • 6. 6 Our method. top: generated images, bottom: DeepLab segmentations The main findings; this method achieves equal image quality in generating multi-object scenes as PGGAN, as measured by a user study and by the Frechet Inception Distance (FID) metric. Besides, we use segmentation networks as a tool to probe the structure of generated images and show that our proposed method interpolates in a more semantically meaningful way than Progressive Growing of Generative Adversarial Networks (PGGAN), and generates images with more clearly differentiated objects.
  • 7. Related Work • The main idea of this work, is decomposing scene generation into semantic maps and image to image translation, has been very nicely explored in concurrent work • Image generation some works have proposed a hierarchical approach to image generation. LapGAN. Use a hierarchy of stacked generators and discriminators. Recently, impressive results have been shown by growing the size of the generator and discriminator, and extended to make use of style conditioning in StyleGAN. • Methods that split the generation of images into stages include the Layered Recursive GAN that splits the generation of images into background and foreground. Generate images part via part using multiple generators. • Conditional image generation Another relevant line of work is conditional image generation, in particular works using segmentation maps. Pix2 pix has shown that image to image translation can give very realistic results in going from a segmentation mask to a real image. Recent improvements, such as CRN and pix2pixHD have shown extensions to higher resolutions. Also shown very convincing translations from segmentation maps to images using a semi parametric approach. Improvements have also been shown by making use of spatial normalisation in SPADE. 7
  • 8. Method • The showed approach is composed of two steps: – Generation of segmentation masks – Filling them in with more detailed patterns. • Generating segmentation masks with GANs We train a GAN to generate segmentation maps. We use the objective of the Wasserstein GAN, with gradient penalty. We use a modified version of DCGAN. The original architecture goes up to 64×64 resolution. To bring our generated segmentation maps up to 256×256 resolution, we add two up sampling layers, each of which 64 feature channels have. Our generator has 5.4 M trainable parameters. • For the discriminator, we use the same approach as described in PatchGAN. We keep the DCGAN discriminator, which operates on 64×64 patches of the segmentation maps, and average over different patches. Our discriminator has 4.4M parameters. • Conditional image synthesis To render the segmentation map, we train a pix2pix with 11.4M parameters. We did not train our models jointly. This work demonstrates an effective combination of existing tools. • Dataset We use the CityScapes dataset, because it contains varied scenes and pixelwise semantic annotations. The dataset provides 22000 training images annotated with coarse segmentations, and 3000 annotated with fine segmentations. • We take a 1024×1024 crop of the original 1024×2048 images, and downscale it to 256×256. Although this type of cropping does reduce the amount of variation in the scenes, we find that this simplified task is still sufficiently challenging to showcase problems with the generation of multi object scenes. 8
  • 9. Experimental Sceneries • As per calculation from several configurations for our model, the result of generator we experiment with two alternatives. The first one is to encode the segmentation maps using the RGB colour scheme used to visualise the CityScapes dataset, so the output of generator is simply an RGB image. And the second one is to use tensors of shape num classes×width×height, where the class of every spatial location is encoded as a one-hot vector. In this case, the result of the generator is num classes-dimensional and we use a soft-max as the result non-linearity. • Moreover, we done experiment with up sampling in the generator using either transposed difficulties or nearest-neighbor up sampling, followed by a convolutional layer. • Finally, we also compare results in training with the coarse and acceptable explanations provided by the CityScapes dataset. 9
  • 10. Other Techniques • The difficulties in capturing the distribution of multi-object scenes with a Generative Adversarial Networks (GAN), here we train a model to directly generate scenes from CityScapes. We used the DCGAN architecture with two up sampling layers as well we state to this method as DCGAN∗. 0.00002 is used a learning rate. • We also compare our method to Progressive Growing of GANs (PGGAN). This method constrains and guides the optimisation by jointly growing the generator and discriminator during training. The generator of this model has 23M parameters. We train the model using the provided settings, till the model has seen 150000 images. • Meanwhile, Progressive Growing of Generative Adversarial Networks (PGGAN) does not have the benefit of supervision with intermediate segmentation maps, our comparison is not between two like things. However, the experimental results may still be informative in showing that intermediate guidance with semantic segmentations is useful for generating multi object scenes with undoubtedly differentiable objects. 10
  • 11. 11
  • 12. Results Qualitative Evaluation • Sample generations from different methods in Figure number 4. Apart from the DCGAN* image samples, all generated image samples have similar quality. Segmentation maps generated by the different variants of our method are indeed all similar to real segmentation maps. However, the object shapes are not perfectly realistic. Like our method predicts an uneven sidewalk and strange arrangements of traffic signs in (f). This does not prevent realistic translations through pix2pix. User Study • To evaluate the realism of generations, we examined Mechanical Turk workers to pick the image “that looks most like a real photograph of a street scene”. In each comparison, one image was from our method, and one from PGGAN. Numbers are averaged over 100 comparisons, done by 25 workers each. Frechet´ Inception Distance • We report the Frechet´ Inception Distance. This metric, characteristics for both the real and the generated images are extracted from the pool 3 layer of an Inception network trained on ImageNet. The distance is then the Wasserstein2 distance between Gaussians fitted to both classes. Both PGGAN and our methods have lower FID scores than our baseline of directly generating images with DCGAN. Images rendered from generated coarse annotations have FID scores similar to those of PGGAN, whereas the scores for images rendered from our generated fine annotations are higher. This could be due to a lack of fine explanations for training. 12
  • 13. Latent interpolations • We calculate our method via segmenting the first and last video frames, and then finding the latent vectors that approximately reconstruct these segmentations. We linearly interpolate between the first and last latent vector to generate intermediate segmentation maps, and render these with pix2pix. To recover the latent representations of segmentation maps generated using our method, we start with a batch of 64 latent vectors and minimize the cross entropy between the generated and target segmentation by optimising the la-tent vector using Adam with a learning rate of 0.01 for 100 iterations, and pick the best one. We found that computing the softmax of our generator output before computing cross entropy helped the optimisation. For this experiment we use our method trained on fine annotations with one-hot encoding and up sampling via transposed convolutions. 13
  • 14. 14
  • 15. 15
  • 16. Segmenting interpolations • Here we discuss further the segmented interpolations in Figure.1. In other sampled sequences, we observe the same noisy pattern in the segmentations of interpolations. In particular, the pink regions, which have the label fence, are often predicted, probably due to the blocky art effects in the generated images. • Interpolations generated by our method, where the segmentation maps are fine, also produce very noisy segmentation maps. There are two reasons why this might hap-pen. First, the training set of fine annotations contains only 3000 images, compared to 22000 images of coarse annotations. The generator might have over fit to these examples. Secondly, the fine annotations are much more complex than the coarse ones, and contain many small objects. Since our generated fine segmentation maps are not perfect, the many small noisy objects may get rendered into something un-recognizable by pix2pix. • When we use our method trained with coarse explanations, however, we see a much smoother and cleaner pattern of segmentations from DeepLab. The source segmentations generated by our method, and the re-segmentations of our rendered images by DeepLab visually seem to match well, which is not the case with fine annotations. • Most highly, Progressive Growing of Generative Adversarial Networks (PGGAN) tends to let objects appear and dissolve without much physical consistency, whereas our approach will keep them in existence, merely moving them around. These transitions can be seen much more clearly in video sequences generated from random walks in the latent space, which can be viewed in the Supplementary. 16
  • 17. Conclusion • We have proposed an alternative method for training Generative Adversarial Networks (GANs) to generate images with multiple objects. The segmentation maps are less complex than natural images, and Generative Adversarial Networks (GANs) have an easier time learning them than high resolution images directly. Incorporating segmentation maps into Generative Adversarial Networks (GAN) training progresses interpolations. Interpolations produced by our method are aware of scene layouts, and show a higher structural consistency between frames than Progressive Growing of Generative Adversarial Networks (PGGAN). 17