BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
Learning a multi-center convolutional network for unconstrained face alignment
1. Learning a Multi-Center
Convolutional Network for
Unconstrained Face Alignment
Zhiwen Shao, Hengliang Zhu, Yangyang Hao,
Min Wang, and Lizhuang Ma
Shanghai Jiao Tong University
5. Methods based on low-level handcrafted features have a
limited capacity to represent highly complex faces
Deep convolutional network
A nonlinear regression problem, which transforms
appearance to shape
Motivation
6. Cascaded CNN [1], Zhou et al. [2], CFAN [3], and CDAN [4]
employ cascaded deep networks to refine predicted shapes
Previous Deep Learning Methods
time-consuming training processes
high model complexity
[1] Y. Sun, X. Wang, and X. Tang, “Deep convolutional network cascade for facial point
detection,” in IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2013, pp.
3476–3483.[2] E. Zhou, H. Fan, Z. Cao, Y. Jiang, and Q. Yin, “Extensive facial landmark localization with
coarse-to-fine convolutional network cascade,” in IEEE International Conference on Computer
Vision Workshops. IEEE, 2013, pp. 386–391.
[3] J. Zhang, S. Shan, M. Kan, and X. Chen, “Coarse-to-fine auto-encoder networks (cfan) for real-
time face alignment,” in European Conference on Computer Vision. Springer, 2014, pp. 1–16.
[4] R. Weng, J. Lu, Y.-P. Tan, and J. Zhou, “Learning cascaded deep auto-encoder networks for
face alignment,” IEEE Transactions on Multimedia, vol. 18, no. 10, pp. 2066–2078, 2016.
Multiple networks based
7. TCDCN [5] needs extra labels of facial attributes for
samples
one single network without auxiliary information
Previous Deep Learning Methods
limits the universality of this method
Single network based
[5] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Learning deep representation for face alignment with
auxiliary attributes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no.
5, pp. 918–930, 2016.
9. Structural Correlations
Chin is occluded Right contour is invisible
Unconstrained faces with partial occlusion and large pose
Landmarks in the same local region have similar properties
including occlusion and visibility
10. Face Partition
29 landmarks 68 landmarks
Partition of facial landmarks for different labeling patterns
Left eye, right eye, nose, mouth, left contour, chin, and
right contour
12. Network Architecture
Shared layers
• Eight convolutional layers and one fully-connected layer
• Each max-pooling layer follows a stack of two convolutional layers
13. Network Architecture
• Each cluster of facial landmarks is treated as a separate center
• Each layer estimates x and y coordinates of all n facial landmarks
• Focusing on the shape estimation of a specific face region
Multiple center-specific shape prediction layers
14. Loss Function
^ ^
2 2 2
2 1 22 1 2
1
[( ) ( ) ]/ (2 )
n
j j jj j
j
E w f f f f d− −
=
= − + −∑
Weighted inter-ocular distance normalized Euclidean loss
jw weight of the j-th landmark
ground truth coordinatesf predicted coordinates
^
f
d ground truth inter-ocular distance
the first center-specific layer:
larger weights for landmarks around the left eye
21. Weight Computation
Multiple relationship
( ) ( )i c i m
P P
w wη=
( )i c
P set of center-specific landmarks
( )i m
P set of remaining minor landmarks
amplification factor
Different fine-tuning steps have different center-
specific and minor facial landmarks
Consistent with the basic model
( ) ( )
( ) ( )
| | ( | |)i c i m
i c i c
P P
w P w n P n+ − =
| |× number of elements in a set
During the i-th fine-tuning step
22. ( )
( )
( )
( )
/[( 1) | | ]
/[( 1) | | ]
i c
i m
i c
P
i c
P
w n P n
w n P n
η η
η
= − +
= − +
other centers with relatively small weights rather than
zeroutilize implicit structural correlations among different parts
landmarks from the same cluster have similar properties
share an identical weight
search the solution smoothly
Weight Computation
During the i-th fine-tuning step
23. Combined Model
high-level representation
( 1) 1
0 1( , , , ) ( 1024)T D
Dx x x D+ ×
= ∈ =x L ¡
weight matrix ( 1) 2
1 2 2( , , , ) D n
n
+ ×
= ∈W w w wL ¡
0 1( , , , ) , 1, ,2T
k k k Dkw w w k n= =w L L
^
2 12 1
^
22
T
jj
T
jj
f
f
−− =
=
w x
w x
weight matrix of the i-th center-specific layer
i
W
2 1 2 1
2 2
combined i
j j
combined i
j j
− −=
=
w w
w w
( )
1, , , i c
i m j P= ∈L
24. Combined Model
Combined Model S combined
Θ ∪ W
complexity is as same as the basic model
improves the location performance by
exploiting the advantage of each center-specific
solution
Our multi-center learning algorithm takes full advantage of each
stage and searches the optimal solution smoothly
26. Datasets
COFW
occluded dataset in the wild
1345 training images
507 testing images
IBUG
large appearance variations
3148 training images
135 testing images
27. Evaluation Metric
inter-ocular distance normalized mean error
cumulative errors distribution (CED) curves
failure rate
failure: mean error larger than 10%
28. Validation of Multi-Center Learning Algorithm
Method COFW IBUG
Mean Failure Mean Failure
Basic 6.26 3.16 9.23 33.33
Combined 6.08 2.96 8.87 25.93
Mean Error (%) and Failure Rate (%)
improve the accuracy and robustness
good performance of basic model
effectiveness of our network
reinforce the learning for each local face region
34. Comparison with Other Methods
Deep model Speed (FPS) CPU
Cascaded CNN 5 single core, i5-6200U 2.3GHz
CFAN* 43 i7-3770 3.4GHz
CDAN* 50 i5 3.2GHz
TCDCN 50 single core, i5-6200U 2.3GHz
CFT 31 single core, i5-6200U 2.3GHz
MCNet 67 single core, i5-6200U 2.3GHz
Time of face detection is excluded
35. Conclusions
We propose a novel multi-center convolutional network, which
exploits the representation power of each center
We propose the reinforcement for each center to improve the
shape estimation precision of each facial part
Comprehensive experiments demonstrate that our method
achieves real-time and competitive performance compared to
other state-of-the-art techniques
Good morning, everyone. I am Zhiwen Shao. I come from Shanghai Jiao Tong University. In our paper, we propose a Multi-Center Convolutional Network to achieve face alignment.
I first show the background of face alignment
These images illustrate the results of face alignment.
We can observe that these face images are very challenging. They have severe occlusions and large variations of pose, expression, illumination.
Our goal is to develop an efficient method to handle unconstrained faces
Face alignment can be regarded as a nonlinear regression problem, which transforms appearance to shape
Most conventional methods are based on low-level handcrafted features, so they have a limited capacity to represent complex faces
As we all know, a deep convolutional network has an outstanding representation ability. Therefore we use it to model the highly nonlinear function
There are two types of deep learning methods.
The first is multiple networks based.
These methods employ cascaded deep networks to refine predicted shapes successively.
Their training processes are complicated and time-consuming. And they have high computational cost and model complexity due to the use of multiple networks
A very typical method is TCDCN.
It trains only one deep network, but it needs extra labels of facial attributes for training samples.
This limits the universality of this method.
In contrast, our method uses one single network without auxiliary information
Next I introduce our method in details
Partial occlusion and large pose are main characteristics of unconstrained faces.
We discover that each facial landmark is not isolated but highly correlated with adjacent landmarks.
There are two examples.
In the left figure, facial landmarks along the chin are all occluded. And the right figure shows that landmarks on the right side of the face are almost invisible.
Therefore, landmarks in the same local face region have similar properties including occlusion and visibility.
We analyze the structure of a face, and partition it into seven clusters: left eye, right eye, nose, mouth, left contour, chin, and right contour.
As shown in these two figures, different labeling patterns of 29 and 68 facial landmarks are partitioned into 5 and 7 clusters respectively. Each cluster contains structurally relevant facial landmarks.
This is the structure of our multi-center convolutional network.
Our network consists of shared layers and multiple center-specific shape prediction layers.
The shared layers contain eight convolutional layers and one fully-connected layer.
Each max-pooling layer follows a stack of two convolutional layers
The stack of convolutional layers is excellent in feature learning, which is proposed by VGGNet.
According to the evaluation metric, we use weighted inter-ocular distance normalized Euclidean loss
We first pre-train a basic model with shared layers and one shape prediction layer.
Corresponding to Step 1
We further fine-tune each center-specific layer respectively
Corresponding to Step 2 to Step 6
Based on the pre-trained model, our network keeps shared layers and initializes each center-specific layer with the shape prediction parameters. There are m branches of center-specific layers at the end of our network. The fine-tuning of center-specific layers is mutually independent.
Shared layers and integrated shape prediction layer constitute the combined model
Corresponding to Step 7
We obtain the integrated shape prediction layer by combining corresponding parameters from each center-specific layer.
We assume there is a multiple relationship between two weights
To be consistent with the basic model, we keep weights conforming to this formula
The summation of weights is ensured to equal n
By solving two equations, we obtain the respective weights
When emphasizing on the detection of current center, we still consider other centers with relatively small weights rather than zero.
This is beneficial for utilizing implicit structural correlations among different facial parts and searching the solution smoothly
Then I show the experiments
Euclidean distance between two pupil centers
We show the mean error of each cluster for basic model and combined model on COFW dataset
It can be observed that the combined model improves the detection performance of each cluster
We report the results of our method MCNet and previous works.
We can see that our method outperforms most state-of-the-art methods
It is worth noting that TCDCN obtains better performance than our method on IBUG partly owing to their larger training data.
Although occlusions are not detected explicitly, we achieve an outstanding performance on par with Wu et al. on COFW benchmark.
We plot the CED curves for our method and several state-of-the-art methods.
It is observed that our method achieves competitive performance on both two benchmarks.
Our method achieves better performance for high-level normalized mean error. Therefore, our method is strongly robust to unconstrained environments.
There are several images from COFW
We can see our method indicates higher accuracy than RCPR and CFT in the details
Benefiting from utilizing structural correlations among different facial parts, our method is robust to severe occlusions.
We also show example images from IBUG where our method MCNet outperforms LBF and CFSS
Our method also achieves higher accuracy in the details. Therefore our method demonstrates superior capability of handling severe occlusions and complex variations of pose, expression, illumination.
To obtain a more comprehensive comparison, we present the average running speed of different deep learning methods for face alignment
We evaluate these methods on a single core i5-6200U 2.3GHz CPU with 1000 face images. Since CFAN and CDAN do not share their code, we use their published speed results.
Both TCDCN and our method MCNet are based on only one network, so they show relatively quick speed. Cascaded CNN, CFAN and CDAN employ multiple networks, so they cost more running time.
Our method only takes 15 ms on average to process one face, profiting from low model complexity and computational cost of our network. We believe that our method can be extended to real-time facial landmark tracking in unconstrained scenarios.