End-to-End Scene Text Spotting Under Dual Domain Awareness Based on Multi-Party Synergetic Explicit Information

CHEN Ping-ping; LIN Hu; CHEN Hong-hui; XIE Zhao-peng

doi:10.12263/DZXB.20240919

您当前的位置：

首页 >

文章列表页 >

End-to-End Scene Text Spotting Under Dual Domain Awareness Based on Multi-Party Synergetic Explicit Information

PAPERS | 更新时间：2025-12-08

- End-to-End Scene Text Spotting Under Dual Domain Awareness Based on Multi-Party Synergetic Explicit Information
- ACTA ELECTRONICA SINICA Vol. 53, Issue 3, Pages: 974-985(2025)
- 作者机构：
  
  福州大学物理与信息工程学院，福建福州 350108
- 作者简介：
- 基金信息：
  
  National Natural Science Foundation of China(62171135);Distinguished Young Scholars Program of Fujian Province, China(2022J06010);Fujian Provincial Department of Education Key Research Project(2023XQ004);Fuzhou Science and Technology Planning(2023-P-001)
- DOI：10.12263/DZXB.20240919
  CLC： TP391.1;TP183
- Received：16 October 2024，
  
  Revised：2025-02-25，
  
  Published：25 March 2025
- 稿件说明：
移动端阅览
陈平平, 林虎, 陈宏辉, 等. 双域感知下多方显式信息协同的场景端到端文本识别[J]. 电子学报, 2025, 53(03): 974-985.

CHEN Ping-ping, LIN Hu, CHEN Hong-hui, et al. End-to-End Scene Text Spotting Under Dual Domain Awareness Based on Multi-Party Synergetic Explicit Information[J]. Acta Electronica Sinica, 2025, 53(03): 974-985.
陈平平, 林虎, 陈宏辉, 等. 双域感知下多方显式信息协同的场景端到端文本识别[J]. 电子学报, 2025, 53(03): 974-985. DOI：10.12263/DZXB.20240919

CHEN Ping-ping, LIN Hu, CHEN Hong-hui, et al. End-to-End Scene Text Spotting Under Dual Domain Awareness Based on Multi-Party Synergetic Explicit Information[J]. Acta Electronica Sinica, 2025, 53(03): 974-985. DOI：10.12263/DZXB.20240919

摘要

在复杂自然场景的端到端文本识别中，由于文本和背景难以区分，文本检测的位置信息和识别的语义信息不匹配，无法有效利用检测和识别之间的相关性.针对该问题，本文提出双域感知下多方显式信息协同的自然场景端到端文本识别方法（Multi-party Synergetic explicit Information with Dual-domain Awaren

ess text spotting，MSIDA），通过强化文本区域特征和边缘纹理，利用文本检测和识别特征之间的协同作用提高端到端文本识别性能.首先，设计融合文本空间和方向信息的双域感知模块（Dual-Domain Awareness，DDA），增强文本实例的视觉特征信息；其次，提出多方显式信息协同模块（Multi-party Explicit Information Synergy，MEIS）提取编码特征中的显式信息，通过匹配对齐用于检测和识别的位置、分类和字符多方信息生成候选文本实例；最后，协同特征通过解码器引导可学习的查询序列获得文本检测和识别的结果.相比最新的DeepSolo（Decoder with explicit points Solo）方法，在Total-Text、ICDAR 2015和CTW1500数据集上，MSIDA模型的准确率分别提升0.8%、0.8%和0.4%.代码和数据集在

https：//github.com/msida2024/MSIDA.git

https://github.com/msida2024/MSIDA.git

可以获取.

Abstract

In the end-to-end text recognition of complex natural scenes

because text and background are difficult to distinguish

the location information detected by text and the semantic information recognized do not match

and the correlation between detection and recognition cannot be effectively utilized. In response to this problem

this paper proposes a multi-party synergetic information with dual-domain awareness text spotting (MSIDA). By enhancing text region features and edge textures

the synergies between text detection and recognition features are utilized to improve end-to-end text recognition performance. Firstly

a dual-domain awareness (DDA) module integrating text space and direction information is designed to enhance the visual feature information of text instances. Secondly

a multi-party explicit information synergy(MEIS) is proposed to extract explicit information from coding features and generate candidate text instances by matching and allocating the position

classification and character multi-party information used for detection and recognition. Finally

cooperative features guide learnable query sequences through decoders to obtain text detection and recognition results. Compared to the latest decoder with explicit points solo (DeepSolo) method

on the Total-Text

ICDAR 2015 and CTW1500 datasets

the accuracy of MSIDA improved respectively by 0.8%

0.8% and 0.4%. The code and datasets are avai

lable at

https://github.com/msida2024/MSIDA.git

关键词

Keywords

references

ZHANG C S , TAO Y F , DU K , et al . Character-level street view text spotting based on deep multisegmentation network for smarter autonomous driving [J ] . IEEE Transactions on Artificial Intelligence , 2022 , 3 ( 2 ): 297 - 308 .

DESOUZA G N , KAK A C . Vision for mobile robot navigation: A survey [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2002 , 24 ( 2 ): 237 - 267 .

孟伟伦 , 郭景峰 , 邢珂萱 , 等 . 基于字形特征的中文医学命名实体识别方法 [J ] . 电子学报 , 2024 , 52 ( 6 ): 1945 - 1954 .

MENG W L , GUO J F , XING K X , et al . A Chinese medical named entity recognition method based on glyph features [J ] . Acta Electronica Sinica , 2024 , 52 ( 6 ): 1945 - 1954 . (in Chinese)

黄俊炀 , 陈宏辉 , 王嘉宝 , 等 . 多域字符距离感知的场景文本图像超分辨率重建 [J ] . 电子学报 , 2024 , 52 ( 7 ): 2262 - 2270 .

HUANG J Y , CHEN H H , WANG J B , et al . Scene text image super-resolution reconstruction based on perceiving multi-domain character distance [J ] . Acta Electronica Sinica , 2024 , 52 ( 7 ): 2262 - 2270 . (in Chinese)

LI H , WANG P , SHEN C H . Towards end-to-end text spotting with convolutional recurrent neural networks [C ] // 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 5248 - 5256 .

FENG W , HE W H , YIN F , et al . TextDragon: An end-to-end framework for arbitrary shaped text spotting [C ] // 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2019 : 9075 - 9084 .

LIU X B , LIANG D , YAN S , et al . FOTS: Fast oriented text spotting with a unified network [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 5676 - 5685 .

YAO C , BAI X , LIU W Y , et al . Detecting texts of arbitrary orientations in natural images [C ] // 2012 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2012 : 1083 - 1090 .

LIU Y L , CHEN H , SHEN C H , et al . ABCNet: Real-time scene text spotting with adaptive bezier-curve network [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2020 : 9809 - 9818 .

HUANG M X , LIU Y L , PENG Z H , et al . SwinTextSpotter: Scene text spotting via better synergy between text detection and text recognition [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2022 : 4583 - 4593 .

LIAO M H , LYU P Y , HE M H , et al . Mask TextSpotter: An end-to-end trainable neural network for spotting text with arbitrary shapes [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2021 , 43 ( 2 ): 532 - 548 .

LIAO M H , PANG G , HUANG J , et al . Mask TextSpotter V3: Segmentation proposal network for robust scene text spotting [M ] // Computer Vision-ECCV 2020 . Cham : Springer International Publishing , 2020 : 706 - 722 .

XING L J , TIAN Z , HUANG W L , et al . Convolutional character networks [C ] // 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2019 .

VASWANI A , SHAZEER N , PARMAR N , et al . Attention is all you need [J ] . Neural Information Processing Systems , 2017 , 30 : 1 - 9 .

ZHANG X , SU Y W , TRIPATHI S , et al . Text spotting transformers [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2022 : 9509 - 9518 .

YE M Y , ZHANG J , ZHAO S S , et al . DeepSolo: Let transformer decoder with explicit points solo for text spotting [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2023 : 19348 - 19357 .

YAIR KITTENPLON , INBAL LAVI , SHARON FOGEL , et al . Towards weakly-supervised text spotting using a multi-task transformer [EB/OL ] . ( 2022-02-14 )[ 2025-03-11 ] . https://arxiv.org/abs/2202.05508 https://arxiv.org/abs/2202.05508 .

HE K M , ZHANG X Y , REN S Q , et al . Deep residual learning for image recognition [C ] // 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2016 : 770 - 778 .

JIA D , YUAN Y H , HE H D , et al . DETRs with hybrid matching [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2023 : 19702 - 19712 .

邹北骥 , 郭建京 , 朱承璋 , 等 . 基于自适应色彩聚类和上下文信息的自然场景文本检测 [J ] . 电子学报 , 2018 , 46 ( 6 ): 1436 - 1444 .

ZOU B J , GUO J J , ZHU C Z , et al . Natural scene text detection based on adaptive color clustering and context information [J ] . Acta Electronica Sinica , 2018 , 46 ( 6 ): 1436 - 1444 . (in Chinese)

LIU Y L , SHEN C H , JIN L W , et al . ABCNet v2: Adaptive bezier-curve network for real-time end-to-end text spotting [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2022 , 44 ( 11 ): 8048 - 8064 .

ZHU X , SU W , LU L , et al . Deformable DETR: Deformable transformers for end-to-end object detection [EB/OL ] . ( 2021-03-18 )[ 2025-03-11 ] . https://arxiv.org/abs/2010.04159 https://arxiv.org/abs/2010.04159 .

POLYNOMIALS B . Introduction to the Mathematics of Computer Graphics [M ] . RhodeIsland : American Mathematical Society , 2016 .

YE M Y , ZHANG J , ZHAO S S , et al . DPText-DETR: Towards better scene text detection with dynamic points in transformer [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2023 , 37 ( 3 ): 3241 - 3249 .

KUHN H W . The Hungarian method for the assignment problem [J ] . Naval Research Logistics Quarterly , 1955 , 2 ( 1/2 ): 83 - 97 .

LIN T Y , GOYAL P , GIRSHICK R , et al . Focal loss for dense object detection [C ] // 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 2999 - 3007 .

师硕 , 覃嘉俊 , 于洋 , 等 . 基于改进ConvMixer和动态焦点损失的视听情感识别 [J ] . 电子学报 , 2024 , 52 ( 8 ): 2824 - 2835 .

SHI S , QIN J J , YU Y , et al . Improved ConvMixer and focal loss with dynamic weight for audio-visual emotion recognition [J ] . Acta Electronica Sinica , 2024 , 52 ( 8 ): 2824 - 2835 . (in Chinese)

GRAVES A , FERNÁNDEZ S , GOMEZ F , et al . Connectionist temporal classification [C ] // Proceedings of the 23rd International Conference on Machine Learning . New York : ACM , 2006 : 369 - 376 .

CHENG C K , CHAN C S , LIU C L . Total-Text: Toward orientation robustness in scene text detection [J ] . International Journal on Document Analysis and Recognition (IJDAR) , 2020 , 23 ( 1 ): 31 - 52 .

KARATZAS D , GOMEZ-BIGORDA L , NICOLAOU A , et al . ICDAR 2015 competition on robust reading [C ] // 2015 13th International Conference on Document Analysis and Recognition (ICDAR) . Piscataway : IEEE , 2015 : 1156 - 1160 .

LIU Y L , JIN L W , ZHANG S T , et al . Curved scene text detection via transverse and longitudinal sequence connection [J ] . Pattern Recognition , 2019 , 90 : 337 - 345 .

NAYEF N , YIN F , BIZID I , et al . ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identification - RRC-MLT [C ] // 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) . Piscataway : IEEE , 2017 : 1454 - 1459 .

KARATZAS D , SHAFAIT F , UCHIDA S , et al . ICDAR 2013 robust reading competition [C ] // 2013 12th International Conference on Document Analysis and Recognition . Piscataway : IEEE , 2013 : 1484 - 1493 .

LIN T Y , DOLLÁR P , GIRSHICK R , et al . Feature pyramid networks for object detection [C ] // 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2017 : 936 - 944 .

LOSHCHILOV I , HUTTER F . Decoupled weight decay regularization [EB/OL ] . ( 2019-01-04 )[ 2025-3-11 ] . https://arxiv.org/abs/1711.05101 https://arxiv.org/abs/1711.05101 .

BAEK Y , SHIN S , BAEK J , et al . Character region attention for text spotting [M ] // Computer Vision-ECCV 2020 . Cham : Springer International Publishing , 2020 : 504 - 521 .

WANG P F , ZHANG C Q , QI F , et al . PGNet: Real-time arbitrarily-shaped text spotting with point gathering network [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2021 , 35 ( 4 ): 2782 - 2790 .

WANG W H , XIE E Z , LI X , et al . PAN++: Towards efficient and accurate end-to-end spotting of arbitrarily-shaped text [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2022 , 44 ( 9 ): 5349 - 5367 .

PENG D Z , WANG X Y , LIU Y L , et al . SPTS: Single-point text spotting [C ] // Proceedings of the 30th ACM International Conference on Multimedia . New York : ACM , 2022 : 4272 - 4281 .

LI Z C , QU Y D , XIE H T , et al . LATextSpotter: Empowering transformer decoder with length perception ability [C ] // 2024 IEEE International Symposium on Circuits and Systems (ISCAS) . Piscataway : IEEE , 2024 : 1 - 5 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

Scene Text Image Super-Resolution Reconstruction Based on Perceiving Multi-Domain Character Distance

Visual Object Tracking Algorithm Based on Adaptive Feature Selection

Continual Learning Methods and Applications in Computer Vision

Related Author

HUANG Jun-yang

CHEN Hong-hui

WANG Jia-bao

CHEN Ping-ping

LIN Zhi-jian

WANG Cai-xia

AN Qi

ZHOU Hong-ce

Related Institution

College of Physics and Information Engineering， Fuzhou University

Jilin Province Technology Research Center of Photoelectric Detection and Intelligent Information Processing

School of Electronic Information and Engineering, Changchun University of Science and Technology

School of Computer Science and Technology, Harbin Institute of Technology

School of Control Science and Engineering, Shandong University

⁰