MixFormer: A Cross-Modal Transformer for Arbitrary-Shaped Scene Text Detection

Published:

Please cite:
@inproceedings{weng2025_mixformer,
title={MixFormer: A Cross-Modal Transformer for Arbitrary-Shaped Scene Text Detection},
author={Yaolin Weng, Chuanlong Liu, Miaomiao Xu, Mieradilijiang Maimaiti, and Wushour Silamu},
journal={Conference: Chinese Conference on Pattern Recognition and Computer Vision (PRCV)},
year={2025},
}

Abstract

Scene text detection is a key technology in computer vision and is crucial for applications such as autonomous driving, intelligent navigation, and image retrieval. In recent years, methods based on large-scale contrastive language-image pre-training (CLIP) have made significant progress in this field. However, while existing CLIP-based methods have improved performance, they also introduce significant computational overhead. To address this issue, we propose MixFormer(A novel transformer approach incorporating textual features), which uses the CLIP text encoder as an independent auxiliary branch and designs a multi-level text-image fusion mechanism. This approach efficiently integrates CLIP’s semantic capabilities while only increasing parameters by 5.2%, significantly enhancing text feature representation. Experiments demonstrate that MixFormer achieves a F1 score of 90.0% on the CTW1500 dataset, setting a new state-of-the-art record, and validates that this method can significantly improve detection accuracy without significantly increasing computational costs.

[PDF]