SwinDPSR: Dual-Path Face Super-Resolution Network Integrating Swin Transformer
Abstract
:1. Introduction
- We propose a dual-path face super-resolution network fused with Swin Transformer, called SwinDPSR, to perform face super-resolution reconstruction by fusing local detail features and global face features. The proposed global representation path utilizes Transformer’s self-attention mechanism to recover face global information. This is followed by feature fusion with a local representation path composed of facial attention units, thereby improving the representation ability and SR performance of the network.
- We jointly train the network with pixel loss, style loss, and SSIM loss to promote network convergence from the pixel level, perception level, and image structure level, respectively.
- In addition to traditional SR evaluation metrics like PSNR and SSIM, we incorporate learned perceptual image patch similarity (LPIPS), mean perceptual score (MPS), and identity similarity as network performance indicators. LPIPS is computed using AlexNet to measure the L2 distance between SR and HR image eigenvectors. Identity similarity, calculated using FaceNet, quantifies the cosine similarity between SR and HR image eigenvectors.
2. Related Works
2.1. Attention Networks
2.2. Vision Transformer
3. Proposed Method
3.1. Overall Architecture
Algorithm 1 Training of SwinDPSR |
Require: Set the batch size to 16, the amplification factor to 8, the epoch to 20, the network initialization method to Xavier, the learning rate to , the learning rate decay strategy to linear decay, and the parameters in the Adam optimizer to 0.9 and to 0.99.
|
3.2. Details of SwinDPSR
3.2.1. Local Representation Path
3.2.2. Global Representation Path
3.2.3. ECA Module
3.3. Training and Loss Function
4. Experiments
4.1. Datasets
4.2. Implementation Details
4.3. Evaluation Metrics
4.4. Ablation Experiments and Discussion
4.5. Comparison with State-of-the-Art Methods
4.5.1. Quantitative and Qualitative Comparison
4.5.2. Face Reconstruction and Recognition on Real-World Surveillance Scenarios
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Zhou, E.; Fan, H.; Cao, Z.; Jiang, Y.; Yin, Q. Learning face hallucination in the wild. In Proceedings of the AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015. No. 1. [Google Scholar]
- Liu, H.; Han, Z.; Guo, J.; Ding, X. A noise robust face hallucination framework via cascaded model of deep convolutional networks and manifold learning. In Proceedings of the 2018 IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, USA, 23–27 July 2018; pp. 1–6. [Google Scholar]
- Chen, C.; Gong, D.; Wang, H.; Li, Z.; Wong, K.-Y.K. Learning spatial attention for face super-resolution. IEEE Trans. Image Process. 2020, 30, 1219–1231. [Google Scholar] [CrossRef] [PubMed]
- Liu, S.; Xiong, C.; Shi, X.; Gao, Z. Progressive face super-resolution with cascaded recurrent convolutional network. Neurocomputing 2021, 449, 357–367. [Google Scholar] [CrossRef]
- Shi, J.; Wang, Y.; Yu, Z.; Li, G.; Hong, X.; Wang, F.; Gong, Y. Exploiting multi-scale parallel self-attention and local variation via dual-branch transformer-cnn structure for face super-resolution. IEEE Trans. Multimed. 2023, 26, 2608–2620. [Google Scholar] [CrossRef]
- Hou, H.; Xu, J.; Hou, Y.; Hu, X.; Wei, B.; Shen, D. Semi-cycled generative adversarial networks for real-world face super-resolution. IEEE Trans. Image Process. 2023, 32, 1184–1199. [Google Scholar] [CrossRef]
- Chen, Y.; Tai, Y.; Liu, X.; Shen, C.; Yang, J. Fsrnet: End-to-end learning face super-resolution with facial priors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2492–2501. [Google Scholar]
- Zhang, Y.; Wu, Y.; Chen, L. Msfsr: A multi-stage face super-resolution with accurate facial representation via enhanced facial boundaries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops 2020, Seattle, WA, USA, 14–19 June 2020; pp. 504–505. [Google Scholar]
- Yin, Y.; Robinson, J.; Zhang, Y.; Fu, Y. Joint super-resolution and alignment of tiny faces. In Proceedings of the AAAI Conference on Artificial Intelligence 2020, New York City, NY, USA, 7–12 February 2020; Volume 7, pp. 12693–12700. [Google Scholar]
- Yu, X.; Fernando, B.; Ghanem, B.; Porikli, F.; Hartley, R. Face super-resolution guided by facial component heatmaps. In Proceedings of the European Conference on Computer Vision (ECCV) 2018, Munich, Germany, 8–14 September 2018; pp. 217–233. [Google Scholar]
- Kim, J.; Li, G.; Yun, I.; Jung, C.; Kim, J. Edge and identity preserving network for face super-resolution. Neurocomputing 2021, 446, 11–22. [Google Scholar] [CrossRef]
- Lu, Y.; Tai, Y.-W.; Tang, C.-K. Attribute-guided face generation using conditional cyclegan. In Proceedings of the European Conference on Computer Vision (ECCV) 2018, Munich, Germany, 8–14 September 2018; pp. 282–297. [Google Scholar]
- Lee, C.-H.; Zhang, K.; Lee, H.-C.; Cheng, C.-W.; Hsu, W. Attribute augmented convolutional neural network for face hallucination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops 2018, Salt Lake City, UT, USA, 18–23 June 2018; pp. 721–729. [Google Scholar]
- Li, M.; Sun, Y.; Zhang, Z.; Xie, H.; Yu, J. Deep learning face hallucination via attributes transfer and enhancement. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 604–609. [Google Scholar]
- Li, M.; Zhang, Z.; Yu, J.; Chen, C.W. Learning face image super-resolution through facial semantic attribute transformation and self-attentive structure enhancement. IEEE Trans. Multimed. 2020, 23, 468–483. [Google Scholar] [CrossRef]
- Jiang, K.; Wang, Z.; Yi, P.; Lu, T.; Jiang, J.; Xiong, Z. Dual-path deep fusion network for face image hallucination. IEEE Trans. Neural Netw. Learn. Syst. 2020, 33, 378–391. [Google Scholar] [CrossRef]
- Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 3146–3154. [Google Scholar]
- Tao, A.; Sapra, K.; Catanzaro, B. Hierarchical multi-scale attention for semantic segmentation. arXiv 2020, arXiv:2005.10821. [Google Scholar]
- Zhong, Z.; Lin, Z.Q.; Bidart, R.; Hu, X.; Daya, I.B.; Li, Z.; Zheng, W.S.; Li, J.; Wong, A. Squeeze-and-attention networks for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 14–19 June 2020; pp. 13065–13074. [Google Scholar]
- Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV) 2018, Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar]
- Qin, X.; Wang, Z.; Bai, Y.; Xie, X.; Jia, H. FFA-net: Feature fusion attention network for single image dehazing. In Proceedings of the AAAI Conference on Artificial Intelligence 2020, New York, NY, USA, 7–12 February 2020; pp. 11908–11915. [Google Scholar]
- Tian, C.W.; Xu, Y.; Li, Z.Y.; Zuo, W.M.; Fei, L.K.; Liu, H. Attention-guided CNN for image denoising. Neural Netw. 2020, 124, 117–129. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Gool, L.V.; Timofte, R. SwinIR: Image restoration using Swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2021, Montreal, QC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar]
- Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision 2022, Tel-Aviv, Israel, 23–27 October 2022; pp. 205–218. [Google Scholar]
- Song, Y.; He, Z.; Qian, H.; Du, X. Vision transformers for single image dehazing. IEEE Trans. Image Process. 2023, 32, 1927–1941. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision 2020, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
- Yang, F.; Yang, H.; Fu, J.; Lu, H.; Guo, B. Learning texture transformer network for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 14–19 June 2020; pp. 5791–5800. [Google Scholar]
- Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VIII. pp. 483–499. [Google Scholar]
- Bulat, A.; Tzimiropoulos, G. How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE International Conference on Computer Vision 2017, Venice, Italy, 22–29 October 2017; pp. 1021–1030. [Google Scholar]
- Odena, A.; Dumoulin, V.; Olah, C. Deconvolution and checkerboard artifacts. Distill 2016, 1, e3. [Google Scholar] [CrossRef]
- Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-attention with relative position representations. arXiv 2018, arXiv:1803.02155. [Google Scholar]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. Eca-net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
- Li, X.; Li, W.; Ren, D.; Zhang, H.; Wang, M.; Zuo, W. Enhanced blind face restoration with multi-exemplar images and adaptive spatial feature fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 14–19 June 2020; pp. 2706–2715. [Google Scholar]
- Parkhi, O.; Vedaldi, A.; Zisserman, A. Deep face recognition. In Proceedings of the BMVC 2015, British Machine Vision Conference 2015, Swansea, Wales, UK, 7–10 September 2015. [Google Scholar]
- Gatys, L.A.; Ecker, A.S.; Bethge, M. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2414–2423. [Google Scholar]
- Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision 2015, Santiago, Chile, 7–13 December 2015; pp. 3730–3738. [Google Scholar]
- Zhang, K.; Zhang, Z.; Li, Z.; Qiao, Y. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 2016, 23, 1499–1503. [Google Scholar] [CrossRef]
- Le, V.; Brandt, J.; Lin, Z.; Bourdev, L.; Huang, T.S. Interactive facial feature localization. In Proceedings of the Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Proceedings, Part III 12. pp. 679–692. [Google Scholar]
- Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recognition 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 4401–4410. [Google Scholar]
- Grgic, M.; Delac, K.; Grgic, S. Scface–surveillance cameras face database. Multimed. Tools Appl. 2011, 51, 863–879. [Google Scholar] [CrossRef]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
- Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
- El Helou, M.; Zhou, R.; Süsstrunk, S.; Timofte, R.; Afifi, M.; Brown, M.S.; Xu, K.; Cai, H.; Liu, Y.; Wang, L.W.; et al. Aim 2020: Scene relighting and illumination estimation challenge. In Proceedings of the Computer Vision–ECCV 2020 Workshops, Glasgow, UK, 23–28 August 2020; Proceedings, Part III 16. pp. 499–518. [Google Scholar]
- Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2015, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
- Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Aitken, A.P.; Tejani, A.; Totz, J.; Wang, Z.; Shi, W. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
Hyperparameter | Value |
---|---|
Batch Size | 16 |
Amplification Factor | 8 |
Epoch | 20 |
Network Initialization | Xavier |
Learning Rate | |
Learning Rate Decay Strategy | Linear Decay |
(Adam Optimizer) | 0.9 |
(Adam Optimizer) | 0.99 |
GPU | Tesla V100 |
Environment | PyTorch |
PSNR | SSIM | LPIPS | MPS | |
---|---|---|---|---|
SwinDPSR w/o global | 28.3417 | 0.8345 | 0.2020 | 0.8162 |
SwinDPSR w/o local | 26.0894 | 0.7636 | 0.3018 | 0.7308 |
SwinDPSR | 28.5689 | 0.8395 | 0.1855 | 0.8270 |
PSNR | SSIM | LPIPS | MPS | |
---|---|---|---|---|
Lpix | 28.5689 | 0.8395 | 0.1855 | 0.8270 |
Lpix_Lstyle | 28.5984 | 0.8396 | 0.1817 | 0.8289 |
Lpix_Lstyle_Lssim | 28.6326 | 0.8415 | 0.1828 | 0.8293 |
PSNR | SSIM | LPIPS | MPS | |
---|---|---|---|---|
BaseLine | 28.6326 | 0.8415 | 0.1828 | 0.8293 |
SEModule | 28.6517 | 0.8429 | 0.1819 | 0.8304 |
ECAModule | 28.7688 | 0.8449 | 0.1799 | 0.8325 |
Helen Dataset | FFHQ Dataset | |||||||
---|---|---|---|---|---|---|---|---|
PSNR | SSIM | LPIPS | MPS | PSNR | SSIM | LPIPS | MPS | |
Bicubic | 24.5312 | 0.6981 | 0.5030 | 0.5975 | 24.2786 | 0.6609 | 0.5378 | 0.5615 |
SRGAN | 25.2783 | 0.7171 | 0.1964 | 0.7603 | 24.6129 | 0.6735 | 0.2052 | 0.7341 |
FSRNet | 26.9341 | 0.7950 | 0.2212 | 0.7869 | 26.4785 | 0.7673 | 0.2272 | 0.7700 |
FSRGAN | 25.8452 | 0.7556 | 0.1379 | 0.8088 | 25.191 | 0.7191 | 0.1380 | 0.7905 |
AACNN | 26.7893 | 0.7867 | 0.2369 | 0.7748 | 26.2496 | 0.7511 | 0.4811 | 0.6349 |
SPARNet | 28.2816 | 0.8328 | 0.2037 | 0.8145 | 26.8418 | 0.7894 | 0.2245 | 0.7824 |
EIPNet | 26.8985 | 0.7912 | 0.1913 | 0.7999 | 26.7129 | 0.7717 | 0.2192 | 0.7762 |
SwinDPSR | 28.7688 | 0.8449 | 0.1799 | 0.8325 | 27.9004 | 0.8099 | 0.1886 | 0.8106 |
Method | Average Identity Similarity |
---|---|
Bicubic | 0.228293 |
SRGAN | 0.302244 |
FSRNet | 0.385995 |
FSRGAN | 0.359804 |
AACNN | 0.445988 |
SPARNet | 0.506417 |
EIPNet | 0.480793 |
SwinDPSR | 0.516619 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, X.; Li, Y.; Gu, M.; Zhang, H.; Zhang, X.; Wang, J.; Lv, X.; Deng, H. SwinDPSR: Dual-Path Face Super-Resolution Network Integrating Swin Transformer. Symmetry 2024, 16, 511. https://doi.org/10.3390/sym16050511
Liu X, Li Y, Gu M, Zhang H, Zhang X, Wang J, Lv X, Deng H. SwinDPSR: Dual-Path Face Super-Resolution Network Integrating Swin Transformer. Symmetry. 2024; 16(5):511. https://doi.org/10.3390/sym16050511
Chicago/Turabian StyleLiu, Xing, Yan Li, Miao Gu, Hailong Zhang, Xiaoguang Zhang, Junzhu Wang, Xindong Lv, and Hongxia Deng. 2024. "SwinDPSR: Dual-Path Face Super-Resolution Network Integrating Swin Transformer" Symmetry 16, no. 5: 511. https://doi.org/10.3390/sym16050511
APA StyleLiu, X., Li, Y., Gu, M., Zhang, H., Zhang, X., Wang, J., Lv, X., & Deng, H. (2024). SwinDPSR: Dual-Path Face Super-Resolution Network Integrating Swin Transformer. Symmetry, 16(5), 511. https://doi.org/10.3390/sym16050511