Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (35)

Search Parameters:
Keywords = ICDAR

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
19 pages, 3590 KiB  
Article
A Multi-Scale Natural Scene Text Detection Method Based on Attention Feature Extraction and Cascade Feature Fusion
by Nianfeng Li, Zhenyan Wang, Yongyuan Huang, Jia Tian, Xinyuan Li and Zhiguo Xiao
Sensors 2024, 24(12), 3758; https://doi.org/10.3390/s24123758 - 9 Jun 2024
Viewed by 486
Abstract
Scene text detection is an important research field in computer vision, playing a crucial role in various application scenarios. However, existing scene text detection methods often fail to achieve satisfactory results when faced with text instances of different sizes, shapes, and complex backgrounds. [...] Read more.
Scene text detection is an important research field in computer vision, playing a crucial role in various application scenarios. However, existing scene text detection methods often fail to achieve satisfactory results when faced with text instances of different sizes, shapes, and complex backgrounds. To address the challenge of detecting diverse texts in natural scenes, this paper proposes a multi-scale natural scene text detection method based on attention feature extraction and cascaded feature fusion. This method combines global and local attention through an improved attention feature fusion module (DSAF) to capture text features of different scales, enhancing the network’s perception of text regions and improving its feature extraction capabilities. Simultaneously, an improved cascaded feature fusion module (PFFM) is used to fully integrate the extracted feature maps, expanding the receptive field of features and enriching the expressive ability of the feature maps. Finally, to address the cascaded feature maps, a lightweight subspace attention module (SAM) is introduced to partition the concatenated feature maps into several sub-space feature maps, facilitating spatial information interaction among features of different scales. In this paper, comparative experiments are conducted on the ICDAR2015, Total-Text, and MSRA-TD500 datasets, and comparisons are made with some existing scene text detection methods. The results show that the proposed method achieves good performance in terms of accuracy, recall, and F-score, thus verifying its effectiveness and practicality. Full article
(This article belongs to the Special Issue Computer Vision and Virtual Reality: Technologies and Applications)
Show Figures

Figure 1

30 pages, 3035 KiB  
Review
Advancements and Challenges in Handwritten Text Recognition: A Comprehensive Survey
by Wissam AlKendi, Franck Gechter, Laurent Heyberger and Christophe Guyeux
J. Imaging 2024, 10(1), 18; https://doi.org/10.3390/jimaging10010018 - 8 Jan 2024
Cited by 3 | Viewed by 4527
Abstract
Handwritten Text Recognition (HTR) is essential for digitizing historical documents in different kinds of archives. In this study, we introduce a hybrid form archive written in French: the Belfort civil registers of births. The digitization of these historical documents is challenging due to [...] Read more.
Handwritten Text Recognition (HTR) is essential for digitizing historical documents in different kinds of archives. In this study, we introduce a hybrid form archive written in French: the Belfort civil registers of births. The digitization of these historical documents is challenging due to their unique characteristics such as writing style variations, overlapped characters and words, and marginal annotations. The objective of this survey paper is to summarize research on handwritten text documents and provide research directions toward effectively transcribing this French dataset. To achieve this goal, we presented a brief survey of several modern and historical HTR offline systems of different international languages, and the top state-of-the-art contributions reported of the French language specifically. The survey classifies the HTR systems based on techniques employed, datasets used, publication years, and the level of recognition. Furthermore, an analysis of the systems’ accuracies is presented, highlighting the best-performing approach. We have also showcased the performance of some HTR commercial systems. In addition, this paper presents a summarization of the HTR datasets that publicly available, especially those identified as benchmark datasets in the International Conference on Document Analysis and Recognition (ICDAR) and the International Conference on Frontiers in Handwriting Recognition (ICFHR) competitions. This paper, therefore, presents updated state-of-the-art research in HTR and highlights new directions in the research field. Full article
(This article belongs to the Section Computer Vision and Pattern Recognition)
Show Figures

Figure 1

21 pages, 5577 KiB  
Article
AFRE-Net: Adaptive Feature Representation Enhancement for Arbitrary Oriented Object Detection
by Tianwei Zhang, Xu Sun, Lina Zhuang, Xiaoyu Dong, Jianjun Sha, Bing Zhang and Ke Zheng
Remote Sens. 2023, 15(20), 4965; https://doi.org/10.3390/rs15204965 - 14 Oct 2023
Cited by 1 | Viewed by 1221
Abstract
Arbitrary-oriented object detection (AOOD) is a crucial task in aerial image analysis but is also faced with significant challenges. In current AOOD detectors, commonly used multi-scale feature fusion modules fall short in spatial and semantic information complement between scales. Additionally, fixed feature extraction [...] Read more.
Arbitrary-oriented object detection (AOOD) is a crucial task in aerial image analysis but is also faced with significant challenges. In current AOOD detectors, commonly used multi-scale feature fusion modules fall short in spatial and semantic information complement between scales. Additionally, fixed feature extraction structures are usually used following a fusion model, resulting in the inability of detectors to self-adjust. At the same time, feature fusion and extraction modules are designed in isolation and the internal synergy between them is ignored. The above problems result in feature representation deficiency, thus affecting the overall detection precision. To solve these problems, we first create a fine-grained feature pyramid network (FG-FPN) that not only provides richer spatial and semantic features, but also completes neighbor scale features in a self-learning mode. Subsequently, we propose a novel feature enhancement module (FEM) to fit FG-FPN. FEM authorizes the detection unit to automatically adjust the sensing area and adaptively suppress background interference, thereby generating stronger feature representations. Our proposed solution was tested through extensive experiments on challenging datasets, including DOTA (77.44% mAP), HRSC2016 (97.82% mAP), UCAS-AOD (91.34% mAP), as well as ICDAR2015 (86.27% F-score) and its effectiveness and high applicability are verified on all the above datasets. Full article
Show Figures

Figure 1

18 pages, 3224 KiB  
Article
Text Recognition Model Based on Multi-Scale Fusion CRNN
by Le Zou, Zhihuang He, Kai Wang, Zhize Wu, Yifan Wang, Guanhong Zhang and Xiaofeng Wang
Sensors 2023, 23(16), 7034; https://doi.org/10.3390/s23167034 - 8 Aug 2023
Cited by 2 | Viewed by 1779
Abstract
Scene text recognition is a crucial area of research in computer vision. However, current mainstream scene text recognition models suffer from incomplete feature extraction due to the small downsampling scale used to extract features and obtain more features. This limitation hampers their ability [...] Read more.
Scene text recognition is a crucial area of research in computer vision. However, current mainstream scene text recognition models suffer from incomplete feature extraction due to the small downsampling scale used to extract features and obtain more features. This limitation hampers their ability to extract complete features of each character in the image, resulting in lower accuracy in the text recognition process. To address this issue, a novel text recognition model based on multi-scale fusion and the convolutional recurrent neural network (CRNN) has been proposed in this paper. The proposed model has a convolutional layer, a feature fusion layer, a recurrent layer, and a transcription layer. The convolutional layer uses two scales of feature extraction, which enables it to derive two distinct outputs for the input text image. The feature fusion layer fuses the different scales of features and forms a new feature. The recurrent layer learns contextual features from the input sequence of features. The transcription layer outputs the final result. The proposed model not only expands the recognition field but also learns more image features at different scales; thus, it extracts a more complete set of features and achieving better recognition of text. The results of experiments are then presented to demonstrate that the proposed model outperforms the CRNN model on text datasets, such as Street View Text, IIIT-5K, ICDAR2003, and ICDAR2013 scenes, in terms of text recognition accuracy. Full article
Show Figures

Figure 1

23 pages, 4154 KiB  
Article
PCBSNet: A Pure Convolutional Bilateral Segmentation Network for Real-Time Natural Scene Text Detection
by Zhe Lian, Yanjun Yin, Min Zhi and Qiaozhi Xu
Electronics 2023, 12(14), 3055; https://doi.org/10.3390/electronics12143055 - 12 Jul 2023
Cited by 2 | Viewed by 1304
Abstract
Scene text detection is a fundamental research work in the field of image processing and has extensive application value. Segmentation-based methods have time-consuming feature processing, while post-processing algorithms are excellent. Real-time semantic segmentation methods use lightweight backbone networks for feature extraction and aggregation [...] Read more.
Scene text detection is a fundamental research work in the field of image processing and has extensive application value. Segmentation-based methods have time-consuming feature processing, while post-processing algorithms are excellent. Real-time semantic segmentation methods use lightweight backbone networks for feature extraction and aggregation but lack effective post-processing methods. The pure convolutional network improves model performance by changing key components. Combining the advantages of three types of methods, we propose a Pure Convolutional Bilateral Segmentation Network (PCBSNet) for real-time natural scene text detection. First, we constructed a bilateral feature extraction backbone network to significantly improve detection speed. The low extraction detail branch captures spatial information, while the efficient semantic extraction branch accurately captures semantic features through a series of micro designs. Second, we built an efficient attention aggregation module to guide the efficient and adaptive aggregation of features from the two branches. The fused feature map undergoes feature enhancement to obtain more accurate and reliable feature representation. Finally, we used differentiable binarization post-processing to construct text instance boundaries. To evaluate the effectiveness of the proposed model, we compared it with mainstream lightweight models on three datasets: ICDAR2015, MSRA-TD500, and CTW1500. The F-measure scores were 82.9%, 82.8%, and 78.9%, respectively, and the FPS were 59.1, 94.3, and 75.5 frames per second. We also conducted extensive ablation experiments on the ICDAR2015 dataset to validate the rationality of the proposed improvements. The obtained results indicate that the proposed model significantly improves inference speed while enhancing accuracy and demonstrates good competitiveness compared to other advanced detection methods. However, when faced with curved text, the detection performance of PCBSNet needs to be improved. Full article
Show Figures

Figure 1

14 pages, 2189 KiB  
Article
DenseTextPVT: Pyramid Vision Transformer with Deep Multi-Scale Feature Refinement Network for Dense Text Detection
by My-Tham Dinh, Deok-Jai Choi and Guee-Sang Lee
Sensors 2023, 23(13), 5889; https://doi.org/10.3390/s23135889 - 25 Jun 2023
Cited by 1 | Viewed by 2632
Abstract
Detecting dense text in scene images is a challenging task due to the high variability, complexity, and overlapping of text areas. To adequately distinguish text instances with high density in scenes, we propose an efficient approach called DenseTextPVT. We first generated high-resolution features [...] Read more.
Detecting dense text in scene images is a challenging task due to the high variability, complexity, and overlapping of text areas. To adequately distinguish text instances with high density in scenes, we propose an efficient approach called DenseTextPVT. We first generated high-resolution features at different levels to enable accurate dense text detection, which is essential for dense prediction tasks. Additionally, to enhance the feature representation, we designed the Deep Multi-scale Feature Refinement Network (DMFRN), which effectively detects texts of varying sizes, shapes, and fonts, including small-scale texts. DenseTextPVT, then, is inspired by Pixel Aggregation (PA) similarity vector algorithms to cluster text pixels into correct text kernels in the post-processing step. In this way, our proposed method enhances the precision of text detection and effectively reduces overlapping between text regions under dense adjacent text in natural images. The comprehensive experiments indicate the effectiveness of our method on the TotalText, CTW1500, and ICDAR-2015 benchmark datasets in comparison to existing methods. Full article
(This article belongs to the Special Issue Object Detection Based on Vision Sensors and Neural Network)
Show Figures

Figure 1

14 pages, 3819 KiB  
Article
Scene Text Detection Based on Multi-Headed Self-Attention Using Shifted Windows
by Baohua Huang and Xiaoru Feng
Appl. Sci. 2023, 13(6), 3928; https://doi.org/10.3390/app13063928 - 20 Mar 2023
Cited by 1 | Viewed by 1598
Abstract
Scene text detection has become a popular topic in computer vision research. Most of the current research is based on deep learning, using Convolutional Neural Networks (CNNs) to extract the visual features of images. However, due to the limitations of convolution kernel size, [...] Read more.
Scene text detection has become a popular topic in computer vision research. Most of the current research is based on deep learning, using Convolutional Neural Networks (CNNs) to extract the visual features of images. However, due to the limitations of convolution kernel size, CNNs can only extract local features of images with small perceptual fields, and they cannot obtain more global features. In this paper, to improve the accuracy of scene text detection, a feature enhancement module is added to the text detection model. This module acquires global features of an image by computing the multi-headed self-attention of the feature map. The improved model extracts local features using CNNs, while extracting global features through the feature enhancement module. The features extracted by both of these are then fused to ensure that visual features at different levels of the image are extracted. A shifted window is used in the calculation of the self-attention, which reduces the computational complexity from the second power of the input image width-height product to the first power. Experiments are conducted on the multi-oriented text dataset ICDAR2015 and the multi-language text dataset MSRA-TD500. Compared with the pre-improvement method DBNet, the F1-score improves by 0.5% and 3.5% on ICDAR2015 and MSRA-TD500, respectively, indicating the effectiveness of the model improvement. Full article
Show Figures

Figure 1

22 pages, 4672 KiB  
Article
DCTable: A Dilated CNN with Optimizing Anchors for Accurate Table Detection
by Takwa Kazdar, Wided Souidene Mseddi, Moulay A. Akhloufi, Ala Agrebi, Marwa Jmal and Rabah Attia
J. Imaging 2023, 9(3), 62; https://doi.org/10.3390/jimaging9030062 - 7 Mar 2023
Cited by 2 | Viewed by 1781
Abstract
With the widespread use of deep learning in leading systems, it has become the mainstream in the table detection field. Some tables are difficult to detect because of the likely figure layout or the small size. As a solution to the underlined problem, [...] Read more.
With the widespread use of deep learning in leading systems, it has become the mainstream in the table detection field. Some tables are difficult to detect because of the likely figure layout or the small size. As a solution to the underlined problem, we propose a novel method, called DCTable, to improve Faster R-CNN for table detection. DCTable came up to extract more discriminative features using a backbone with dilated convolutions in order to improve the quality of region proposals. Another main contribution of this paper is the anchors optimization using the Intersection over Union (IoU)-balanced loss to train the RPN and reduce the false positive rate. This is followed by a RoI Align layer, instead of the ROI pooling, to improve the accuracy during mapping table proposal candidates by eliminating the coarse misalignment and introducing the bilinear interpolation in mapping region proposal candidates. Training and testing on a public dataset showed the effectiveness of the algorithm and a considerable improvement of the F1-score on ICDAR 2017-Pod, ICDAR-2019, Marmot and RVL CDIP datasets. Full article
(This article belongs to the Topic Computer Vision and Image Processing)
Show Figures

Figure 1

21 pages, 5456 KiB  
Article
G-Rep: Gaussian Representation for Arbitrary-Oriented Object Detection
by Liping Hou, Ke Lu, Xue Yang, Yuqiu Li and Jian Xue
Remote Sens. 2023, 15(3), 757; https://doi.org/10.3390/rs15030757 - 28 Jan 2023
Cited by 26 | Viewed by 3044
Abstract
Typical representations for arbitrary-oriented object detection tasks include the oriented bounding box (OBB), the quadrilateral bounding box (QBB), and the point set (PointSet). Each representation encounters problems that correspond to its characteristics, such as boundary discontinuity, square-like problems, representation ambiguity, and isolated points, [...] Read more.
Typical representations for arbitrary-oriented object detection tasks include the oriented bounding box (OBB), the quadrilateral bounding box (QBB), and the point set (PointSet). Each representation encounters problems that correspond to its characteristics, such as boundary discontinuity, square-like problems, representation ambiguity, and isolated points, which lead to inaccurate detection. Although many effective strategies have been proposed for various representations, there is still no unified solution. Current detection methods based on Gaussian modeling have demonstrated the possibility of resolving this dilemma; however, they remain limited to OBB. To go further, in this paper, we propose a unified Gaussian representation called G-Rep to construct Gaussian distributions for OBB, QBB, and PointSet, which achieves a unified solution to various representations and problems. Specifically, PointSet- or QBB-based object representations are converted into Gaussian distributions and their parameters are optimized using the maximum likelihood estimation algorithm. Then, three optional Gaussian metrics are explored to optimize the regression loss of the detector because of their excellent parameter optimization mechanisms. Furthermore, we also use Gaussian metrics for sampling to align label assignment and regression loss. Experimental results obtained on several publicly available datasets, such as DOTA, HRSC2016, UCAS-AOD, and ICDAR2015, show the excellent performance of the proposed method for arbitrary-oriented object detection. Full article
(This article belongs to the Section Remote Sensing Image Processing)
Show Figures

Figure 1

17 pages, 18703 KiB  
Article
Irregular Scene Text Detection Based on a Graph Convolutional Network
by Shiyu Zhang, Caiying Zhou, Yonggang Li, Xianchao Zhang, Lihua Ye and Yuanwang Wei
Sensors 2023, 23(3), 1070; https://doi.org/10.3390/s23031070 - 17 Jan 2023
Cited by 4 | Viewed by 1830
Abstract
Detecting irregular or arbitrary shape text in natural scene images is a challenging task that has recently attracted considerable attention from research communities. However, limited by the CNN receptive field, these methods cannot directly capture relations between distant component regions by local convolutional [...] Read more.
Detecting irregular or arbitrary shape text in natural scene images is a challenging task that has recently attracted considerable attention from research communities. However, limited by the CNN receptive field, these methods cannot directly capture relations between distant component regions by local convolutional operators. In this paper, we propose a novel method that can effectively and robustly detect irregular text in natural scene images. First, we employ a fully convolutional network architecture based on VGG16_BN to generate text components via the estimated character center points, which can ensure a high text component detection recall rate and fewer noncharacter text components. Second, text line grouping is treated as a problem of inferring the adjacency relations of text components with a graph convolution network (GCN). Finally, to evaluate our algorithm, we compare it with other existing algorithms by performing experiments on three public datasets: ICDAR2013, CTW-1500 and MSRA-TD500. The results show that the proposed method handles irregular scene text well and that it achieves promising results on these three public datasets. Full article
Show Figures

Figure 1

19 pages, 4627 KiB  
Article
An Improved Differentiable Binarization Network for Natural Scene Street Sign Text Detection
by Manhuai Lu, Yi Leng, Chin-Ling Chen and Qiting Tang
Appl. Sci. 2022, 12(23), 12120; https://doi.org/10.3390/app122312120 - 27 Nov 2022
Cited by 1 | Viewed by 1685
Abstract
The street sign text information from natural scenes usually exists in a complex background environment and is affected by natural light and artificial light. However, most of the current text detection algorithms do not effectively reduce the influence of light and do not [...] Read more.
The street sign text information from natural scenes usually exists in a complex background environment and is affected by natural light and artificial light. However, most of the current text detection algorithms do not effectively reduce the influence of light and do not make full use of the relationship between high-level semantic information and contextual semantic information in the feature extraction network when extracting features from images, and they are ineffective at detecting text in complex backgrounds. To solve these problems, we first propose a multi-channel MSER (Maximally Stable Extreme Regions) method to fully consider color information in text detection, which separates the text area in the image from the complex background, effectively reducing the influence of the complex background and light on street sign text detection. We also propose an enhanced feature pyramid network text detection method, which includes a feature pyramid route enhancement (FPRE) module and a high-level feature enhancement (HLFE) module. The two modules can make full use of the network’s low-level and high-level semantic information to enhance the network’s effectiveness in localizing text information and detecting text with different shapes, sizes, and inclined text. Experiments showed that the F-scores obtained by the method proposed in this paper on ICDAR 2015 (International Conference on Document Analysis and Recognition 2015) dataset, ICDAR2017-MLT (International Conference on Document Analysis and Recognition 2017- Competition on Multi-lingual scene text detection) dataset, and the Natural Scene Street Signs (NSSS) dataset constructed in this study are 89.5%, 84.5%, and 73.3%, respectively, which confirmed the performance advantage of the method proposed in street sign text detection. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

15 pages, 4324 KiB  
Article
Research on Small Acceptance Domain Text Detection Algorithm Based on Attention Mechanism and Hybrid Feature Pyramid
by Mingzhu Liu, Ben Li and Wei Zhang
Electronics 2022, 11(21), 3559; https://doi.org/10.3390/electronics11213559 - 31 Oct 2022
Cited by 2 | Viewed by 1361
Abstract
In the traditional text detection process, the text area of the small receptive field in the video image is easily ignored, the features that can be extracted are few, and the calculation is large. These problems are not conducive to the recognition of [...] Read more.
In the traditional text detection process, the text area of the small receptive field in the video image is easily ignored, the features that can be extracted are few, and the calculation is large. These problems are not conducive to the recognition of text information. In this paper, a lightweight network structure on the basis of the EAST algorithm, the Convolution Block Attention Module (CBAM), is proposed. It is suitable for the spatial and channel hybrid attention module of text feature extraction of the natural scene video images. The improved structure proposed in this paper can obtain deep network features of text and reduce the computation of text feature extraction. Additionally, a hybrid feature pyramid + BLSTM network is designed to improve the attention to the small acceptance domain text regions and the text sequence features of the region. The test results on the ICDAR2015 demonstrate that the improved construction can effectively boost the attention of small acceptance domain text regions and improve the sequence feature detection accuracy of small acceptance domain of long text regions without significantly increasing computation. At the same time, the proposed network constructions are superior to the traditional EAST algorithm and other improved algorithms in accuracy rate P, recall rate R, and F-value. Full article
(This article belongs to the Special Issue Deep Learning in Image Processing and Pattern Recognition)
Show Figures

Figure 1

12 pages, 1430 KiB  
Article
Attention Guided Feature Encoding for Scene Text Recognition
by Ehtesham Hassan and Lekshmi V. L.
J. Imaging 2022, 8(10), 276; https://doi.org/10.3390/jimaging8100276 - 8 Oct 2022
Viewed by 1632
Abstract
The real-life scene images exhibit a range of variations in text appearances, including complex shapes, variations in sizes, and fancy font properties. Consequently, text recognition from scene images remains a challenging problem in computer vision research. We present a scene text recognition methodology [...] Read more.
The real-life scene images exhibit a range of variations in text appearances, including complex shapes, variations in sizes, and fancy font properties. Consequently, text recognition from scene images remains a challenging problem in computer vision research. We present a scene text recognition methodology by designing a novel feature-enhanced convolutional recurrent neural network architecture. Our work addresses scene text recognition as well as sequence-to-sequence modeling, where a novel deep encoder–decoder network is proposed. The encoder in the proposed network is designed around a hierarchy of convolutional blocks enabled with spatial attention blocks, followed by bidirectional long short-term memory layers. In contrast to existing methods for scene text recognition, which incorporate temporal attention on the decoder side of the entire architecture, our convolutional architecture incorporates novel spatial attention design to guide feature extraction onto textual details in scene text images. The experiments and analysis demonstrate that our approach learns robust text-specific feature sequences for input images, as the convolution architecture designed for feature extraction is tuned to capture a broader spatial text context. With extensive experiments on ICDAR2013, ICDAR2015, IIIT5K and SVT datasets, the paper demonstrates an improvement over many important state-of-the-art methods. Full article
(This article belongs to the Section Computer Vision and Pattern Recognition)
Show Figures

Figure 1

15 pages, 31062 KiB  
Article
Text Line Extraction in Historical Documents Using Mask R-CNN
by Ahmad Droby, Berat Kurar Barakat, Reem Alaasam, Boraq Madi, Irina Rabaev and Jihad El-Sana
Signals 2022, 3(3), 535-549; https://doi.org/10.3390/signals3030032 - 4 Aug 2022
Cited by 10 | Viewed by 3609
Abstract
Text line extraction is an essential preprocessing step in many handwritten document image analysis tasks. It includes detecting text lines in a document image and segmenting the regions of each detected line. Deep learning-based methods are frequently used for text line detection. However, [...] Read more.
Text line extraction is an essential preprocessing step in many handwritten document image analysis tasks. It includes detecting text lines in a document image and segmenting the regions of each detected line. Deep learning-based methods are frequently used for text line detection. However, only a limited number of methods tackle the problems of detection and segmentation together. This paper proposes a holistic method that applies Mask R-CNN for text line extraction. A Mask R-CNN model is trained to extract text lines fractions from document patches, which are further merged to form the text lines of an entire page. The presented method was evaluated on the two well-known datasets of historical documents, DIVA-HisDB and ICDAR 2015-HTR, and achieved state-of-the-art results. In addition, we introduce a new challenging dataset of Arabic historical manuscripts, VML-AHTE, where numerous diacritics are present. We show that the presented Mask R-CNN-based method can successfully segment text lines, even in such a challenging scenario. Full article
Show Figures

Figure 1

18 pages, 5334 KiB  
Article
Scene Text Detection Using Attention with Depthwise Separable Convolutions
by Ehtesham Hassan and Lekshmi V. L.
Appl. Sci. 2022, 12(13), 6425; https://doi.org/10.3390/app12136425 - 24 Jun 2022
Cited by 11 | Viewed by 1837
Abstract
In spite of significant research efforts, the existing scene text detection methods fall short of the challenges and requirements posed in real-life applications. In natural scenes, text segments exhibit a wide range of shape complexities, scale, and font property variations, and they appear [...] Read more.
In spite of significant research efforts, the existing scene text detection methods fall short of the challenges and requirements posed in real-life applications. In natural scenes, text segments exhibit a wide range of shape complexities, scale, and font property variations, and they appear mostly incidental. Furthermore, the computational requirement of the detector is an important factor for real-time operation. To address the aforementioned issues, the paper presents a novel scene text detector using a deep convolutional network which efficiently detects arbitrary oriented and complex-shaped text segments from natural scenes and predicts quadrilateral bounding boxes around text segments. The proposed network is designed in a U-shape architecture with the careful incorporation of skip connections to capture complex text attributes at multiple scales. For addressing the computational requirement of the input processing, the proposed scene text detector uses the MobileNet model as the backbone that is designed on depthwise separable convolutions. The network design is integrated with text attention blocks to enhance the learning ability of our detector, where the attention blocks are based on efficient channel attention. The network is trained in a multi-objective formulation supported by a novel text-aware non-maximal procedure to generate final text bounding box predictions. On extensive evaluations on ICDAR2013, ICDAR2015, MSRA-TD500, and COCOText datasets, the paper reports detection F-scores of 0.910, 0.879, 0.830, and 0.617, respectively. Full article
Show Figures

Figure 1

Back to TopTop