The performance of translation system and corpus size is inseparable, corpus quality has also a direct impact on the final translation result. Thus, auto tagging and stemming as fundamental work in creating a corpus is of great significance.
Uyghur automatic stemming and POS tagging is indispensable part of Uyghur NLP as above, in order to increase the translate quality of based on statistical Moses Uyghur Chinese bilateral machine translation, have to expend the Uyghur Chinese parallel corpus size then added both tagging and stemming to machine translation data set. From the research of Uyghur POS tagging and stemming beginning till now different people use variety methods to achieve, but the result is not good. In this paper, references the stemming algorithms based on Morfessor and POS tagging method based on conditional random fields (CRF) respectively.
Explained the principles of CRF, Morfessor and doing experiments in order to collect corpus to research, developed an online data collecting tool of Uyghur, Kazak, Kirghiz based on WEB, achieved 90% text corpus via this in paper.
Because of CRF and Morfessor has strict requirements to the format of data set, developed two pre-processing softwares. Changed CRF template file and do many experiments on LINUX, achieved tagging model and called this developed automatic tagging system. In Morfessor can only trained the best segmentation model via doing experiment on huge corpus, called this model developed automatic stemming system.
Finally, test result showed that for Uyghur POS tagging precision reached 89.73%, automatic segmentation precision achieved 86.80%. Based on research results above, the BLEU score was increased from former score 23.42 to current score 25.38 of the Uyghur Chinese bidirectional statistical machine translation based on MOSES.
[Postgraduate Thesis Chinese Paper]