research-article

Editing Text in the Wild

Authors:

Chengquan Zhang,

Xiang BaiAuthors Info & Claims

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 1500 - 1508

https://doi.org/10.1145/3343031.3350929

Published: 15 October 2019 Publication History

Abstract

In this paper, we are interested in editing text in natural images, which aims to replace or modify a word in the source image with another one while maintaining its realistic look. This task is challenging, as the styles of both background and text need to be preserved so that the edited image is visually indistinguishable from the source image. Specifically, we propose an end-to-end trainable style retention network (SRNet) that consists of three modules: text conversion module, background inpainting module and fusion module. The text conversion module changes the text content of the source image into the target text while keeping the original text style. The background inpainting module erases the original text, and fills the text region with appropriate texture. The fusion module combines the information from the two former modules, and generates the edited text images. To our knowledge, this work is the first attempt to edit text in natural images at the word level. Both visual effects and quantitative results on synthetic and real-world dataset (ICDAR 2013) fully confirm the importance and necessity of modular decomposition. We also conduct extensive experiments to validate the usefulness of our method in various real-world applications such as text image synthesis, augmented reality (AR) translation, information hiding, etc.

References

[1]

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Learning to Compose Neural Networks for Question Answering. In NAACL-HLT . 1545--1554.

[2]

Samaneh Azadi, Matthew Fisher, Vladimir G Kim, Zhaowen Wang, Eli Shechtman, and Trevor Darrell. 2018. Multi-content gan for few-shot font style transfer. In CVPR. 7564--7573.

[3]

Guha Balakrishnan, Amy Zhao, Adrian V Dalca, Fredo Durand, and John Guttag. 2018. Synthesizing images of humans in unseen poses. In CVPR. 8340--8348.

[4]

Shancheng Fang, Hongtao Xie, Zheng-Jun Zha, Nannan Sun, Jianlong Tan, and Yongdong Zhang. 2018. Attention and Language Ensemble for Scene Text Recognition with Convolutional Sequence Modeling. In ACM Multimedia. ACM, 248--256.

[5]

Victor Fragoso, Steffen Gauglitz, Shane Zamora, Jim Kleban, and Matthew Turk. 2011. TranslatAR: A mobile augmented reality translator. In WACV. IEEE, 497--502.

[6]

Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2016. Image style transfer using convolutional neural networks. In CVPR. 2414--2423.

[7]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In NeurIPS. 2672--2680.

[8]

Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic data for text localisation in natural images. In CVPR. 2315--2324.

[9]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.

[10]

Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML. 448--456.

Digital Library

[11]

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image translation with conditional adversarial networks. In CVPR . 1125--1134.

[12]

M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. 2014. Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition. arXiv preprint arXiv:1406.2227 (2014).

[13]

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In ECCV. Springer, 694--711.

[14]

Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Almazan Almazan, and Lluis Pere De Las Heras. 2013. ICDAR 2013 robust reading competition. In ICDAR. IEEE, 1484--1493.

[15]

Diederik P Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR. 13.

[16]

Shangbang Long, Xin He, and Cong Yao. 2018. Scene Text Detection and Recognition: The Deep Learning Era. arXiv preprint arXiv:1811.04256 (2018).

[17]

Pengyuan Lyu, Xiang Bai, Cong Yao, Zhen Zhu, Tengteng Huang, and Wenyu Liu. 2017. Auto-encoder guided gan for chinese calligraphy synthesis. In ICDAR, Vol. 1. IEEE, 1095--1100.

[18]

Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. 2016. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In IC3DV . IEEE, 565--571.

[19]

Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).

[20]

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. 2018. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957 (2018).

[21]

Toshiki Nakamura, Anna Zhu, Keiji Yanai, and Seiichi Uchida. 2017. Scene text eraser. In ICDAR, Vol. 1. IEEE, 832--837.

[22]

Alec Radford, Luke Metz, and Soumith Chintala. 2016. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR .

[23]

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In MICCAI. Springer, 234--241.

[24]

Prasun Roy, Saumik Bhattacharya, Subhankar Ghosh, and Umapada Pal. 2019. STEFANN: Scene Text Editor using Font Adaptive Neural Network. arXiv preprint arXiv:1903.01192 (2019).

[25]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et almbox. 2015. ImageNet Large Scale Visual Recognition Challenge. IJCV, Vol. 3, 115 (2015), 211--252.

Digital Library

[26]

Baoguang Shi, Xiang Bai, and Cong Yao. 2017. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE TPAMI, Vol. 39, 11 (2017), 2298--2304.

Digital Library

[27]

Baoguang Shi, Mingkun Yang, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2018. Aster: An attentional scene text recognizer with flexible rectification. IEEE TPAMI (2018).

[28]

Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR .

[29]

Danyang Sun, Tongzheng Ren, Chongxun Li, Hang Su, and Jun Zhu. 2017. Learning to write stylized chinese characters by reading a handful of examples. IJCAI (2017).

[30]

Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et almbox. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, Vol. 13, 4 (2004), 600--612.

[31]

Shuai Yang, Jiaying Liu, Zhouhui Lian, and Zongming Guo. 2017. Awesome typography: Statistics-based text effects transfer. In NeurIPS . 7464--7473.

[32]

Shuai Yang, Jiaying Liu, Wenjing Wang, and Zongming Guo. 2019. Tet-gan: Text effects transfer via stylization and destylization. In AAAI, Vol. 33. 1238--1245.

[33]

Shuai Yang, Jiaying Liu, Wenhan Yang, and Zongming Guo. 2018. Context-Aware Unsupervised Text Stylization. In ACM Multimedia. ACM, 1688--1696.

[34]

Chengquan Zhang, Borong Liang, Zuming Huang, Mengyi En, Junyu Han, Errui Ding, and Xinghao Ding. 2019 a. Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes. In CVPR . 10552--10561.

[35]

Shuaitao Zhang, Yuliang Liu, Lianwen Jin, Yaoxiong Huang, and Songxuan Lai. 2019 b. Ensnet: Ensconce text in the wild. In AAAI, Vol. 33. 801--808.

[36]

TY Zhang and Ching Y Suen. 1984. A fast parallel algorithm for thinning digital patterns. Commun. ACM, Vol. 27, 3 (1984), 236--239.

Digital Library

[37]

Yexun Zhang, Ya Zhang, and Wenbin Cai. 2018. Separating style and content for generalized style transfer. In CVPR. 8447--8455.

[38]

Zheng Zhang, Chengquan Zhang, Wei Shen, Cong Yao, Wenyu Liu, and Xiang Bai. 2016. Multi-oriented text detection with fully convolutional networks. In CVPR . 4159--4167.

[39]

Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. 2017. EAST: an efficient and accurate scene text detector. In CVPR. 5551--5560.

[40]

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV . 2223--2232.

[41]

Zhen Zhu, Tengteng Huang, Baoguang Shi, Miao Yu, Bofei Wang, and Xiang Bai. 2019. Progressive Pose Attention Transfer for Person Image Generation. In CVPR . 2347--2356.

Cited By

Zhang H(2025)Image Inpainting of Portraits Artwork Design and ImplementationITM Web of Conferences10.1051/itmconf/2025700302670(03026)Online publication date: 23-Jan-2025
https://doi.org/10.1051/itmconf/20257003026
Luo DLiu YYang RLiu XZeng JZhou YBai X(2025)Toward real text manipulation detection: New dataset and new solutionPattern Recognition10.1016/j.patcog.2024.110828157(110828)Online publication date: Jan-2025
https://doi.org/10.1016/j.patcog.2024.110828
Pang SChen XXie YZhan HYin BLu Y(2025)Diff-TST: Diffusion model for one-shot text-image style transferExpert Systems with Applications10.1016/j.eswa.2024.125747263(125747)Online publication date: Mar-2025
https://doi.org/10.1016/j.eswa.2024.125747
Show More Cited By

Index Terms

Editing Text in the Wild
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Reconstruction
  2. Computer graphics
    1. Image manipulation
      1. Image-based rendering
2. Information systems
  1. Information systems applications
    1. Multimedia information systems
      1. Multimedia content creation

Recommendations

Self-Supervised Cross-Language Scene Text Editing
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

We propose and formulate the task of cross-language scene text editing, modifying the text content of a scene image into new text in another language, while preserving the scene text style and background texture. The key challenges of this task lie in ...
Conditional reiterative High-Fidelity GAN inversion for image editing
Abstract
Our work introduces a conditional reiteration mechanism for High-Fidelity GAN (Generative Adversarial Networks) inversion (HFGI), preserving image-specific details (like background, appearance, etc.) for both normal and out-of-domain images (e.g. ...
Graphical abstract

Display Omitted
Highlights
- We proposed a Conditional Repetition Branch that aids in preserving the high-confidence region, capturing image-specific.
- The proposed method significantly improves the performance of reconstructing and editing out-of-the-domain ...
A Text-Book of Practical Therapeutics: With Especial Reference to the Application of Remedial Measures to Disease and Their Employment Upon a Rational Basis

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

October 2019

2794 pages

ISBN:9781450368896

DOI:10.1145/3343031

General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Program for Support of Top-notch Young Professionals
National Natural Science Foundation of China
Program for HUST Academic Frontier Youth Team

Conference

MM '19

Sponsor:

SIGMM

MM '19: The 27th ACM International Conference on Multimedia

October 21 - 25, 2019

Nice, France

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

84
Total Citations
View Citations
712
Total Downloads

Downloads (Last 12 months)99
Downloads (Last 6 weeks)11

Reflects downloads up to 25 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang H(2025)Image Inpainting of Portraits Artwork Design and ImplementationITM Web of Conferences10.1051/itmconf/2025700302670(03026)Online publication date: 23-Jan-2025
https://doi.org/10.1051/itmconf/20257003026
Luo DLiu YYang RLiu XZeng JZhou YBai X(2025)Toward real text manipulation detection: New dataset and new solutionPattern Recognition10.1016/j.patcog.2024.110828157(110828)Online publication date: Jan-2025
https://doi.org/10.1016/j.patcog.2024.110828
Pang SChen XXie YZhan HYin BLu Y(2025)Diff-TST: Diffusion model for one-shot text-image style transferExpert Systems with Applications10.1016/j.eswa.2024.125747263(125747)Online publication date: Mar-2025
https://doi.org/10.1016/j.eswa.2024.125747
Fu LWu ZZhu YLiu YBai X(2025)Enhancing scene text detectors with realistic text image synthesis using diffusion modelsComputer Vision and Image Understanding10.1016/j.cviu.2024.104224250(104224)Online publication date: Jan-2025
https://doi.org/10.1016/j.cviu.2024.104224
Yuan HYanai K(2025)SceneTextStyler: Editing Text with Style TransformationMultiMedia Modeling10.1007/978-981-96-2074-6_22(194-201)Online publication date: 1-Jan-2025
https://doi.org/10.1007/978-981-96-2074-6_22
Wang YDu XYe HJin CHe LSong MWang R(2024)Jointly Text Region and Stroke Modeling for Scene Text RemovalProceedings of the 2nd International Workshop on Multimedia Content Generation and Evaluation: New Methods and Practice10.1145/3688867.3690167(3-10)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3688867.3690167
Shimoda WHaraguchi DUchida SYamaguchi K(2024)Towards Diverse and Consistent Typography Generation2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00713(7281-7290)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00713
Santoso JSimon CWilliem (2024)On Manipulating Scene Text in the Wild with Diffusion Models2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00512(5190-5199)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00512
Deshmukh GSusladkar OMakwana DMittal STeja R S(2024)Textual Alchemy: CoFormer for Scene Text Understanding2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00291(2919-2929)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00291
Zhou JDai PLi YHu MCao X(2024)Explicitly-Decoupled Text Transfer With Minimized Background Reconstruction for Scene Text EditingIEEE Transactions on Image Processing10.1109/TIP.2024.347735533(5921-5935)Online publication date: 2024
https://doi.org/10.1109/TIP.2024.3477355
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten