RAST: Residual-Attentive and Scale-aware Transformer for Robust Scene Text Recognition
Published:
Please cite:
@inproceedings{li2025_rast,
title={RAST: Residual-Attentive and Scale-aware Transformer for Robust Scene Text Recognition},
author={Wenkai Li, Yongbin Mu, Miaomiao Xu, Mieradilijiang Maimaiti, Yanbing Li, and Wushour Silamu},
journal={Conference: Chinese Conference on Pattern Recognition and Computer Vision (PRCV)},
year={2025},
}
Abstract
Transformer-based Scene Text Recognition (STR) has made remarkable strides in recent years owing to its superior ability to capture long-range dependencies. However, current models often suffer from insufficient feature diversity in the encoder and limited spatial adaptability in the decoder, which hampers performance in complex or distorted text scenarios. In this paper, we propose a novel architecture named Residual-Attentive and Scale-aware Transformer Text Recognizer (RAST), which enhances both the encoding and decoding stages for robust scene text recognition. Specifically, we introduce a multi-layer Residual-Attentive Enhancement (RAE) module to mitigate feature collapse in Vision Transformers and reinforce low-level semantic retention. In addition, we design a Multi-Scale Deformable Attention (MSDA) that dynamically aggregates multi-scale spatial features to better model text with variable layouts and severe deformations. Extensive experiments conducted on six standard benchmarks (IIIT5k, SVT, IC13, IC15, SVTP, and CUTE80) and seven recent challenging Union14M-Benchmark (Curve, Multi-oriented, Artistic, Contextless, Salient, Multi-words, and General) demonstrate that RAST achieves superior performance and generalization compared to state-of-the-art STR models. These results confirm the effectiveness and practicality of the proposed architecture.
[PDF]