Data Augmentation for Low-Resource Languages NMT Guided by Constrained Sampling


Please cite:
@article{Maimaiti2022DataAF, title={Data augmentation for low‚Äźresource languages NMT guided by constrained sampling}, author={M. Maimaiti and Yang Liu and Huanbo Luan and Maosong Sun}, journal={International Journal of Intelligent Systems}, year={2022}, volume={37}, pages={30 - 51} }


Data augmentation is a ubiquitous approach for several text generation tasks. Intuitively, in the machine translation (MT) paradigm, especially in low-resource languages (LRLs) scenario, many data augmentation methods have appeared. The most commonly used methods are building pseudo corpus by randomly sampling, omitting, or replacing some words in the text. However, previous approaches hardly guarantee the quality of augmented data. In this work, we try to augment the corpus by introducing a constrained sampling method. Additionally, we also build the evaluation framework to select higher quality data after augmentation. Namely, we use the discriminator sub-model to mitigate syntactic and semantic errors to some extent. Experimental results show that our augmentation method consistently outperforms all the previous SOTA methods on both small and large scale corpora in 8 language pairs from 4 corpora by $2.38 \sim 4.18$ BLEU points.