Enhancing the Scene Text Recognition with Encoder-Decoder Interactive Model

Published: December 19, 2025

Please cite:
@inproceedings{yongbin2025,
title={Enhancing the Scene Text Recognition with Encoder-Decoder Interactive Model},
author={Yongbing Mu, Mieradilijiang Maimaiti, Miaomiao Xu, Wenkai Li and Wushour Silamu},
journal={Sensors},
year={2025},
}

Scene text recognition has significant application value in autonomous driving, smart retail, and assistive devices. However, due to challenges such as multi-scale variations, distortions, and complex backgrounds, existing methods such as CRNN, ViT, and PARSeq, while showing good performance, still have room for improvement in feature extraction and semantic modeling capabilities. To address these issues, this paper proposes a novel scene text recognition model named the Encoder–Decoder Interactive Model (EDIM). Based on an encoder–decoder framework, EDIM introduces a Multi-scale Dilated Fusion Attention (MSFA) module in the encoder to enhance multi-scale feature representation. In the decoder, a Sequential Encoder–Decoder Context Fusion (SeqEDCF) mechanism is designed to enable efficient semantic interaction between the encoder and decoder. The effectiveness of the proposed method is validated on six regular and irregular benchmark test sets, as well as various subsets of the Union14M-L dataset. Experimental results demonstrate that EDIM outperforms state-of-the-art (SOTA) methods across multiple metrics, achieving significant performance gains, especially in recognizing irregular and distorted text.

[PDF]

Share on

Twitter Facebook LinkedIn