SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
Diffusion Video Autoencoders:
Toward Temporally Consistent Face Video Editing
via Disentangled Video Encoding
Gyeongman Kim1 Hajin Shim1 Hyunsu Kim2 Yunjey Choi2 Junho Kim2 Eunho Yang1,3
1KAIST 2NAVER AI Lab 3AITRICS
Machine Learning & Intelligence Laboratory
CVPR 2023
1 min Summary
Cropped frames
!! !"#! !"
!"#!
$ !"
$
!!
$
1. Encode
2. Edit
3. Decode
Temporally
Inconsistent
⋯
!%
!%
$
Edited frames
Problem: Temporal consistency
3
Previous methods
• Face video editing: The task of modifying certain attributes of a face in a video
• All previous methods use GAN to edit faces for each frame independently
→ Modifying attributes, such as beards, causes temporal inconsistency problem
⋯
(!!"!"#
,!#$!"#
) (!!"!
,!#$!
)
(!!"#
,!#$#
)
Cropped frames
Edited frames
(!!"$
,!#$$
)
!%&
!%&
'
1. Encode
2. Edit
3. Decode
Temporally
Consistent
Solution: Decompose a video into a single identity, etc.
4
Ours
• Diffusion Video Autoencoders
• Decompose a video into {single identity 𝑧!", each frame (motion 𝑧#$!
, background 𝑧%&!
)}
• video → decomposed features z!", 𝑧#$! '()
*
, 𝑧%&! '()
*
→ video
→ Entire frame can be edited consistently with single modification of the identity feature
Paper Details
Method Overview: video autoencoding & editing pipeline
6
Video !!
":$
"%&frame autoencoding & editing
#'()
#*)
!"#
!*)
":$ !*),,-.
-)*/
!*),,-.
$%&'&(#
!'()
":$
"0123
4:5
"0123
4:5,3678
!9
(;)
!
$!
(;)
⋯
⋯
!!( $ , &"#$%
&
)
!!( $ , &"#$%
'
)
encoding
decoding
Frame !!
(;)
⋯
!!( $ , &"#$%
( ,%*+,
)
decoding
!
$!
; ,-)*/
• Design a diffusion video autoencoder:
𝑥+
(-)
→ 𝑧/012
(-)
, 𝑥*
(-)
→ 𝑥+
(-)
• High-level semantic latent 𝑧/012
(-)
(512-dim):
consist of representative identity feature 𝑧!",425
and motion feature 𝑧67"
(-)
• Noise map 𝑥*
(-)
:
Only information left out by 𝑧/012
(-)
is encoded
(=background information)
• Since background information shows high
variance to project to a low-dimensional space,
encode background at noise map 𝑥*
(-)
Method Overview: video autoencoding & editing pipeline
7
Video !!
":$
"%&frame autoencoding & editing
#'()
#*)
!"#
!*)
":$ !*),,-.
-)*/
!*),,-.
$%&'&(#
!'()
":$
"0123
4:5
"0123
4:5,3678
!9
(;)
!
$!
(;)
⋯
⋯
!!( $ , &"#$%
&
)
!!( $ , &"#$%
'
)
encoding
decoding
Frame !!
(;)
⋯
!!( $ , &"#$%
( ,%*+,
)
decoding
!
$!
; ,-)*/
• Design a diffusion video autoencoder:
𝑥+
(-)
→ 𝑧/012
(-)
, 𝑥*
(-)
→ 𝑥+
(-)
• High-level semantic latent 𝑧/012
(-)
(512-dim):
consist of representative identity feature 𝑧!",425
and motion feature 𝑧67"
(-)
• Noise map 𝑥*
(-)
:
Only information left out by 𝑧/012
(-)
is encoded
(=background information)
• Since background information shows high
variance to project to a low-dimensional space,
encode background at noise map 𝑥*
(-)
Frozen pre-trained encoders
for feature extraction
In order to nearly-perfect reconstruct,
use DDIM which utilizes deterministic
forward-backward process
Method Overview: training objective
8
Image !!
!",$
!"#$%&"
((,*, +%&'()
!",)
-*,)
!"#$%&"
((,*, +%&'()
mask
!" -*
⊕
!"#$%&" ((, *, +%&'()
×(− 1 − /!/ /! )
"!,#
Shared U-Net
(1, 2,3)
ℒ!"#$%&
ℒ'&(
-)~0(0, 2)
Estimated !!
"!
"!,$
-*,$
ℒ!"#$%&
-$~0(0, 2)
×(1/ /! )
• ℒ8!#562 = 𝔼9"~; 9" ,<!~𝒩 +,> ,' 𝜖? 𝑥', 𝑡, 𝑧/012 − 𝜖' )
• Simple version of DDPM loss
• ℒ42& = 𝔼9"~; 9" ,<#,<$~𝒩 +,> ,' 𝑓?,)⨀𝑚 − 𝑓?,@⨀𝑚 )
• For clear decomposition btw background and face information
Method Overview: training objective
9
Image !!
!",$
!"#$%&"
((,*, +%&'()
!",)
-*,)
!"#$%&"
((,*, +%&'()
mask
!" -*
⊕
!"#$%&" ((, *, +%&'()
×(− 1 − /!/ /! )
"!,#
Shared U-Net
(1, 2,3)
ℒ!"#$%&
ℒ'&(
-)~0(0, 2)
Estimated !!
"!
"!,$
-*,$
ℒ!"#$%&
-$~0(0, 2)
×(1/ /! )
• ℒ8!#562 = 𝔼9"~; 9" ,<!~𝒩 +,> ,' 𝜖? 𝑥', 𝑡, 𝑧/012 − 𝜖' )
• Simple version of DDPM loss
• ℒ42& = 𝔼9"~; 9" ,<#,<$~𝒩 +,> ,' 𝑓?,)⨀𝑚 − 𝑓?,@⨀𝑚 )
• For clear decomposition btw background and face information
Encourages the useful information
of the image to be well contained
in the semantic latent 𝑧%&'(
Effect of noise in 𝑥) on the face
region will be reduced and 𝑧%&'( will
be responsible for face features
Method Overview: video editing framework
10
!!!
!
"!!"#
!
"!" !
"#$
1. Conditional sampling with !$% for prediction of each noise level
!!!"#
%&'# !!#
%&'#
!!&
%&'#
⋯
⋯
!$
"$%
$'&
⋯
2. Conditional sampling with trainable (!"
#$%
and optimize with ℒ)*+,
ℒ)*+, ℒ)*+, ℒ)*+,
• Classifier-based editing
• Train a linear classifier for each attribute of CelebA-HQ in the identity feature 𝑧*+ space
• CLIP-based editing
• Minimize CLIP loss between intermediate images with drastically reduced number of steps 𝑆 (≪ 𝑇)
Video !!
":$
"%&frame autoencoding & editing
#'()
#*)
!"#
!*)
":$ !*),,-.
-)*/
!*),,-.
$%&'&(#
!'()
":$ "0123
4:5
"0123
4:5,3678
!9
(;)
!
$!
(;)
⋯
⋯
!!( $ , &"#$%
&
)
!!( $ , &"#$%
'
)
encoding
decoding
Frame !!
(;)
⋯
!!( $ , &"#$%
( ,%*+,
)
decoding
!
$!
; ,-)*/
Experiment: Reconstruction
11
• Our diffusion video autoencoder with T = 100 shows the best reconstruction ability and still
outperforms e4e with only T = 20
Latent Transformer
STIT
Experiment: Temporal Consistency
Original
LT
STIT
VideoEditGAN
Ours
12
• Only our diffusion video autoencoder successfully produces the temporally consistent result
Experiment: Temporal Consistency
13
• We greatly improve global consistency (TG-ID)
interpret as being consistent
as the original is when their
values are close to 1
Experiment: Editing Wild Face Videos
14
• Owing to the reconstructability of diffusion models, editing wild videos that are difficult to
inversion by GAN-based methods becomes possible.
Ours
STIT
Original
Original Latent
Transformer
STIT Ours
“young”
“gender”
“beard”
Experiment: Decomposed Features Analysis
15
Input Random !! Identity
switch
Motion
switch
Background
switch
• Generated images with switched identity, motion, and background feature confirm that the
features are properly decomposed
Experiment: Ablation Study
16
Input Recon Sampling with random !!
w/
ℒ
!"#
w/o
ℒ
!"#
w/
ℒ
!"#
w/o
ℒ
!"#
• Without the regularization loss, the identity changes significantly according to the random noise
• we can conclude that the regularization loss helps the model to decompose features effectively
Conclusions
17
• Our contribution is four-fold:
• We devise diffusion video autoencoders that decompose the video into a single time-
invariant and per-frame time-variant features for temporally consistent editing
• Based on the decomposed representation of our model, face video editing can be
conducted by editing only the single time-invariant identity feature and decoding it
together with the remaining original features
• Owing to the nearly-perfect reconstruction ability of diffusion models, our framework
can be utilized to edit exceptional cases such that a face is partially occluded by some
objects as well as usual cases
• In addition to the existing predefined attributes editing method, we propose a text-
based identity editing method based on the local directional CLIP loss for the
intermediately generated product of diffusion video autoencoders
Thank you !
Any Questions ?

Weitere ähnliche Inhalte

Ähnlich wie G. Kim, CVPR 2023, MLILAB, KAISTAI

Metadata to create and collect
Metadata to create and collectMetadata to create and collect
Metadata to create and collect
vrt-medialab
 
Scrambling For Video Surveillance
Scrambling For Video SurveillanceScrambling For Video Surveillance
Scrambling For Video Surveillance
Kobi Magnezi
 
Camera , Visual , Imaging Technology : A Walk-through
Camera , Visual ,  Imaging Technology : A Walk-through Camera , Visual ,  Imaging Technology : A Walk-through
Camera , Visual , Imaging Technology : A Walk-through
Sherin Sasidharan
 
Twiggy - let's get our widget on!
Twiggy - let's get our widget on!Twiggy - let's get our widget on!
Twiggy - let's get our widget on!
Elliott Kember
 
AppliedMagicBrochure2002
AppliedMagicBrochure2002AppliedMagicBrochure2002
AppliedMagicBrochure2002
aletawalther
 
Motion graphics and_compositing_video_analysis_worksheet 2
Motion graphics and_compositing_video_analysis_worksheet 2Motion graphics and_compositing_video_analysis_worksheet 2
Motion graphics and_compositing_video_analysis_worksheet 2
js1productionstmuk
 
Motion graphics and_compositing_video_analysis_worksheet 2
Motion graphics and_compositing_video_analysis_worksheet 2Motion graphics and_compositing_video_analysis_worksheet 2
Motion graphics and_compositing_video_analysis_worksheet 2
jsproductionstm
 
Motion graphics and_compositing_video_analysis_worksheet 2
Motion graphics and_compositing_video_analysis_worksheet 2Motion graphics and_compositing_video_analysis_worksheet 2
Motion graphics and_compositing_video_analysis_worksheet 2
js1productionstm
 

Ähnlich wie G. Kim, CVPR 2023, MLILAB, KAISTAI (20)

Pc54
Pc54Pc54
Pc54
 
IMAGE PROCESSING
IMAGE PROCESSINGIMAGE PROCESSING
IMAGE PROCESSING
 
Mastering LOG Footage & Creating Custom Lookup Tables
Mastering LOG Footage & Creating Custom Lookup TablesMastering LOG Footage & Creating Custom Lookup Tables
Mastering LOG Footage & Creating Custom Lookup Tables
 
AI 로봇 아티스트의 비밀(창원대학교 정보통신공학과 특강)
AI 로봇 아티스트의 비밀(창원대학교 정보통신공학과 특강)AI 로봇 아티스트의 비밀(창원대학교 정보통신공학과 특강)
AI 로봇 아티스트의 비밀(창원대학교 정보통신공학과 특강)
 
Training Videovigilancia IP: What, Why, When and How
Training Videovigilancia IP: What, Why, When and HowTraining Videovigilancia IP: What, Why, When and How
Training Videovigilancia IP: What, Why, When and How
 
Ghajini - The Game Development
Ghajini - The Game DevelopmentGhajini - The Game Development
Ghajini - The Game Development
 
Road Map
Road MapRoad Map
Road Map
 
Semi-automatic and easy creation of learning friendly OCW video content
Semi-automatic and easy creation of learning friendly OCW video contentSemi-automatic and easy creation of learning friendly OCW video content
Semi-automatic and easy creation of learning friendly OCW video content
 
Applying Media Content Analysis to the Production of Musical Videos as Summar...
Applying Media Content Analysis to the Production of Musical Videos as Summar...Applying Media Content Analysis to the Production of Musical Videos as Summar...
Applying Media Content Analysis to the Production of Musical Videos as Summar...
 
Metadata to create and collect
Metadata to create and collectMetadata to create and collect
Metadata to create and collect
 
Scrambling For Video Surveillance
Scrambling For Video SurveillanceScrambling For Video Surveillance
Scrambling For Video Surveillance
 
Camera , Visual , Imaging Technology : A Walk-through
Camera , Visual ,  Imaging Technology : A Walk-through Camera , Visual ,  Imaging Technology : A Walk-through
Camera , Visual , Imaging Technology : A Walk-through
 
Twiggy - let's get our widget on!
Twiggy - let's get our widget on!Twiggy - let's get our widget on!
Twiggy - let's get our widget on!
 
Profiling for Grown-Ups
Profiling for Grown-UpsProfiling for Grown-Ups
Profiling for Grown-Ups
 
AppliedMagicBrochure2002
AppliedMagicBrochure2002AppliedMagicBrochure2002
AppliedMagicBrochure2002
 
VFX Operations
VFX OperationsVFX Operations
VFX Operations
 
Motion graphics and_compositing_video_analysis_worksheet 2
Motion graphics and_compositing_video_analysis_worksheet 2Motion graphics and_compositing_video_analysis_worksheet 2
Motion graphics and_compositing_video_analysis_worksheet 2
 
Motion graphics and_compositing_video_analysis_worksheet 2
Motion graphics and_compositing_video_analysis_worksheet 2Motion graphics and_compositing_video_analysis_worksheet 2
Motion graphics and_compositing_video_analysis_worksheet 2
 
Motion graphics and_compositing_video_analysis_worksheet 2
Motion graphics and_compositing_video_analysis_worksheet 2Motion graphics and_compositing_video_analysis_worksheet 2
Motion graphics and_compositing_video_analysis_worksheet 2
 
05 vdo
05 vdo05 vdo
05 vdo
 

Mehr von MLILAB

H. Shim, NeurIPS 2018, MLILAB, KAIST AI
H. Shim, NeurIPS 2018, MLILAB, KAIST AIH. Shim, NeurIPS 2018, MLILAB, KAIST AI
H. Shim, NeurIPS 2018, MLILAB, KAIST AI
MLILAB
 

Mehr von MLILAB (20)

J. Jeong, AAAI 2024, MLILAB, KAIST AI..
J. Jeong,  AAAI 2024, MLILAB, KAIST AI..J. Jeong,  AAAI 2024, MLILAB, KAIST AI..
J. Jeong, AAAI 2024, MLILAB, KAIST AI..
 
J. Yun, NeurIPS 2023, MLILAB, KAISTAI
J. Yun,  NeurIPS 2023,  MLILAB,  KAISTAIJ. Yun,  NeurIPS 2023,  MLILAB,  KAISTAI
J. Yun, NeurIPS 2023, MLILAB, KAISTAI
 
S. Kim, NeurIPS 2023, MLILAB, KAISTAI
S. Kim,  NeurIPS 2023,  MLILAB,  KAISTAIS. Kim,  NeurIPS 2023,  MLILAB,  KAISTAI
S. Kim, NeurIPS 2023, MLILAB, KAISTAI
 
C. Kim, INTERSPEECH 2023, MLILAB, KAISTAI
C. Kim, INTERSPEECH 2023, MLILAB, KAISTAIC. Kim, INTERSPEECH 2023, MLILAB, KAISTAI
C. Kim, INTERSPEECH 2023, MLILAB, KAISTAI
 
Y. Jung, ICML 2023, MLILAB, KAISTAI
Y. Jung, ICML 2023, MLILAB, KAISTAIY. Jung, ICML 2023, MLILAB, KAISTAI
Y. Jung, ICML 2023, MLILAB, KAISTAI
 
J. Song, S. Kim, ICML 2023, MLILAB, KAISTAI
J. Song, S. Kim, ICML 2023, MLILAB, KAISTAIJ. Song, S. Kim, ICML 2023, MLILAB, KAISTAI
J. Song, S. Kim, ICML 2023, MLILAB, KAISTAI
 
K. Seo, ICASSP 2023, MLILAB, KAISTAI
K. Seo, ICASSP 2023, MLILAB, KAISTAIK. Seo, ICASSP 2023, MLILAB, KAISTAI
K. Seo, ICASSP 2023, MLILAB, KAISTAI
 
S. Kim, ICLR 2023, MLILAB, KAISTAI
S. Kim, ICLR 2023, MLILAB, KAISTAIS. Kim, ICLR 2023, MLILAB, KAISTAI
S. Kim, ICLR 2023, MLILAB, KAISTAI
 
Y. Kim, ICLR 2023, MLILAB, KAISTAI
Y. Kim, ICLR 2023, MLILAB, KAISTAIY. Kim, ICLR 2023, MLILAB, KAISTAI
Y. Kim, ICLR 2023, MLILAB, KAISTAI
 
J. Yun, AISTATS 2022, MLILAB, KAISTAI
J. Yun, AISTATS 2022, MLILAB, KAISTAIJ. Yun, AISTATS 2022, MLILAB, KAISTAI
J. Yun, AISTATS 2022, MLILAB, KAISTAI
 
J. Song, J. Park, ICML 2022, MLILAB, KAISTAI
J. Song, J. Park, ICML 2022, MLILAB, KAISTAIJ. Song, J. Park, ICML 2022, MLILAB, KAISTAI
J. Song, J. Park, ICML 2022, MLILAB, KAISTAI
 
J. Park, J. Song, ICLR 2022, MLILAB, KAISTAI
J. Park, J. Song, ICLR 2022, MLILAB, KAISTAIJ. Park, J. Song, ICLR 2022, MLILAB, KAISTAI
J. Park, J. Song, ICLR 2022, MLILAB, KAISTAI
 
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAIJ. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
 
J. Park, AAAI 2022, MLILAB, KAIST AI
J. Park, AAAI 2022, MLILAB, KAIST AIJ. Park, AAAI 2022, MLILAB, KAIST AI
J. Park, AAAI 2022, MLILAB, KAIST AI
 
J. Song, et. al., ASRU 2021, MLILAB, KAIST AI
J. Song, et. al., ASRU 2021, MLILAB, KAIST AIJ. Song, et. al., ASRU 2021, MLILAB, KAIST AI
J. Song, et. al., ASRU 2021, MLILAB, KAIST AI
 
J. Song, H. Shim et al., ICASSP 2021, MLILAB, KAIST AI
J. Song, H. Shim et al., ICASSP 2021, MLILAB, KAIST AIJ. Song, H. Shim et al., ICASSP 2021, MLILAB, KAIST AI
J. Song, H. Shim et al., ICASSP 2021, MLILAB, KAIST AI
 
T. Yoon, et. al., ICLR 2021, MLILAB, KAIST AI
T. Yoon, et. al., ICLR 2021, MLILAB, KAIST AIT. Yoon, et. al., ICLR 2021, MLILAB, KAIST AI
T. Yoon, et. al., ICLR 2021, MLILAB, KAIST AI
 
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AI
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AIG. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AI
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AI
 
I. Chung, AAAI 2020, MLILAB, KAIST AI
I. Chung, AAAI 2020, MLILAB, KAIST AII. Chung, AAAI 2020, MLILAB, KAIST AI
I. Chung, AAAI 2020, MLILAB, KAIST AI
 
H. Shim, NeurIPS 2018, MLILAB, KAIST AI
H. Shim, NeurIPS 2018, MLILAB, KAIST AIH. Shim, NeurIPS 2018, MLILAB, KAIST AI
H. Shim, NeurIPS 2018, MLILAB, KAIST AI
 

Kürzlich hochgeladen

Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Kürzlich hochgeladen (20)

chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdf
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 

G. Kim, CVPR 2023, MLILAB, KAISTAI

  • 1. Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding Gyeongman Kim1 Hajin Shim1 Hyunsu Kim2 Yunjey Choi2 Junho Kim2 Eunho Yang1,3 1KAIST 2NAVER AI Lab 3AITRICS Machine Learning & Intelligence Laboratory CVPR 2023
  • 3. Cropped frames !! !"#! !" !"#! $ !" $ !! $ 1. Encode 2. Edit 3. Decode Temporally Inconsistent ⋯ !% !% $ Edited frames Problem: Temporal consistency 3 Previous methods • Face video editing: The task of modifying certain attributes of a face in a video • All previous methods use GAN to edit faces for each frame independently → Modifying attributes, such as beards, causes temporal inconsistency problem
  • 4. ⋯ (!!"!"# ,!#$!"# ) (!!"! ,!#$! ) (!!"# ,!#$# ) Cropped frames Edited frames (!!"$ ,!#$$ ) !%& !%& ' 1. Encode 2. Edit 3. Decode Temporally Consistent Solution: Decompose a video into a single identity, etc. 4 Ours • Diffusion Video Autoencoders • Decompose a video into {single identity 𝑧!", each frame (motion 𝑧#$! , background 𝑧%&! )} • video → decomposed features z!", 𝑧#$! '() * , 𝑧%&! '() * → video → Entire frame can be edited consistently with single modification of the identity feature
  • 6. Method Overview: video autoencoding & editing pipeline 6 Video !! ":$ "%&frame autoencoding & editing #'() #*) !"# !*) ":$ !*),,-. -)*/ !*),,-. $%&'&(# !'() ":$ "0123 4:5 "0123 4:5,3678 !9 (;) ! $! (;) ⋯ ⋯ !!( $ , &"#$% & ) !!( $ , &"#$% ' ) encoding decoding Frame !! (;) ⋯ !!( $ , &"#$% ( ,%*+, ) decoding ! $! ; ,-)*/ • Design a diffusion video autoencoder: 𝑥+ (-) → 𝑧/012 (-) , 𝑥* (-) → 𝑥+ (-) • High-level semantic latent 𝑧/012 (-) (512-dim): consist of representative identity feature 𝑧!",425 and motion feature 𝑧67" (-) • Noise map 𝑥* (-) : Only information left out by 𝑧/012 (-) is encoded (=background information) • Since background information shows high variance to project to a low-dimensional space, encode background at noise map 𝑥* (-)
  • 7. Method Overview: video autoencoding & editing pipeline 7 Video !! ":$ "%&frame autoencoding & editing #'() #*) !"# !*) ":$ !*),,-. -)*/ !*),,-. $%&'&(# !'() ":$ "0123 4:5 "0123 4:5,3678 !9 (;) ! $! (;) ⋯ ⋯ !!( $ , &"#$% & ) !!( $ , &"#$% ' ) encoding decoding Frame !! (;) ⋯ !!( $ , &"#$% ( ,%*+, ) decoding ! $! ; ,-)*/ • Design a diffusion video autoencoder: 𝑥+ (-) → 𝑧/012 (-) , 𝑥* (-) → 𝑥+ (-) • High-level semantic latent 𝑧/012 (-) (512-dim): consist of representative identity feature 𝑧!",425 and motion feature 𝑧67" (-) • Noise map 𝑥* (-) : Only information left out by 𝑧/012 (-) is encoded (=background information) • Since background information shows high variance to project to a low-dimensional space, encode background at noise map 𝑥* (-) Frozen pre-trained encoders for feature extraction In order to nearly-perfect reconstruct, use DDIM which utilizes deterministic forward-backward process
  • 8. Method Overview: training objective 8 Image !! !",$ !"#$%&" ((,*, +%&'() !",) -*,) !"#$%&" ((,*, +%&'() mask !" -* ⊕ !"#$%&" ((, *, +%&'() ×(− 1 − /!/ /! ) "!,# Shared U-Net (1, 2,3) ℒ!"#$%& ℒ'&( -)~0(0, 2) Estimated !! "! "!,$ -*,$ ℒ!"#$%& -$~0(0, 2) ×(1/ /! ) • ℒ8!#562 = 𝔼9"~; 9" ,<!~𝒩 +,> ,' 𝜖? 𝑥', 𝑡, 𝑧/012 − 𝜖' ) • Simple version of DDPM loss • ℒ42& = 𝔼9"~; 9" ,<#,<$~𝒩 +,> ,' 𝑓?,)⨀𝑚 − 𝑓?,@⨀𝑚 ) • For clear decomposition btw background and face information
  • 9. Method Overview: training objective 9 Image !! !",$ !"#$%&" ((,*, +%&'() !",) -*,) !"#$%&" ((,*, +%&'() mask !" -* ⊕ !"#$%&" ((, *, +%&'() ×(− 1 − /!/ /! ) "!,# Shared U-Net (1, 2,3) ℒ!"#$%& ℒ'&( -)~0(0, 2) Estimated !! "! "!,$ -*,$ ℒ!"#$%& -$~0(0, 2) ×(1/ /! ) • ℒ8!#562 = 𝔼9"~; 9" ,<!~𝒩 +,> ,' 𝜖? 𝑥', 𝑡, 𝑧/012 − 𝜖' ) • Simple version of DDPM loss • ℒ42& = 𝔼9"~; 9" ,<#,<$~𝒩 +,> ,' 𝑓?,)⨀𝑚 − 𝑓?,@⨀𝑚 ) • For clear decomposition btw background and face information Encourages the useful information of the image to be well contained in the semantic latent 𝑧%&'( Effect of noise in 𝑥) on the face region will be reduced and 𝑧%&'( will be responsible for face features
  • 10. Method Overview: video editing framework 10 !!! ! "!!"# ! "!" ! "#$ 1. Conditional sampling with !$% for prediction of each noise level !!!"# %&'# !!# %&'# !!& %&'# ⋯ ⋯ !$ "$% $'& ⋯ 2. Conditional sampling with trainable (!" #$% and optimize with ℒ)*+, ℒ)*+, ℒ)*+, ℒ)*+, • Classifier-based editing • Train a linear classifier for each attribute of CelebA-HQ in the identity feature 𝑧*+ space • CLIP-based editing • Minimize CLIP loss between intermediate images with drastically reduced number of steps 𝑆 (≪ 𝑇) Video !! ":$ "%&frame autoencoding & editing #'() #*) !"# !*) ":$ !*),,-. -)*/ !*),,-. $%&'&(# !'() ":$ "0123 4:5 "0123 4:5,3678 !9 (;) ! $! (;) ⋯ ⋯ !!( $ , &"#$% & ) !!( $ , &"#$% ' ) encoding decoding Frame !! (;) ⋯ !!( $ , &"#$% ( ,%*+, ) decoding ! $! ; ,-)*/
  • 11. Experiment: Reconstruction 11 • Our diffusion video autoencoder with T = 100 shows the best reconstruction ability and still outperforms e4e with only T = 20 Latent Transformer STIT
  • 12. Experiment: Temporal Consistency Original LT STIT VideoEditGAN Ours 12 • Only our diffusion video autoencoder successfully produces the temporally consistent result
  • 13. Experiment: Temporal Consistency 13 • We greatly improve global consistency (TG-ID) interpret as being consistent as the original is when their values are close to 1
  • 14. Experiment: Editing Wild Face Videos 14 • Owing to the reconstructability of diffusion models, editing wild videos that are difficult to inversion by GAN-based methods becomes possible. Ours STIT Original Original Latent Transformer STIT Ours “young” “gender” “beard”
  • 15. Experiment: Decomposed Features Analysis 15 Input Random !! Identity switch Motion switch Background switch • Generated images with switched identity, motion, and background feature confirm that the features are properly decomposed
  • 16. Experiment: Ablation Study 16 Input Recon Sampling with random !! w/ ℒ !"# w/o ℒ !"# w/ ℒ !"# w/o ℒ !"# • Without the regularization loss, the identity changes significantly according to the random noise • we can conclude that the regularization loss helps the model to decompose features effectively
  • 17. Conclusions 17 • Our contribution is four-fold: • We devise diffusion video autoencoders that decompose the video into a single time- invariant and per-frame time-variant features for temporally consistent editing • Based on the decomposed representation of our model, face video editing can be conducted by editing only the single time-invariant identity feature and decoding it together with the remaining original features • Owing to the nearly-perfect reconstruction ability of diffusion models, our framework can be utilized to edit exceptional cases such that a face is partially occluded by some objects as well as usual cases • In addition to the existing predefined attributes editing method, we propose a text- based identity editing method based on the local directional CLIP loss for the intermediately generated product of diffusion video autoencoders
  • 18. Thank you ! Any Questions ?