# American Institute of Mathematical Sciences

• Previous Article
Collaborative filtering recommendation algorithm towards intelligent community
• DCDS-S Home
• This Issue
• Next Article
A mathematical analysis for the forecast research on tourism carrying capacity to promote the effective and sustainable development of tourism
August & September 2019, 12(4&5): 823-836. doi: 10.3934/dcdss.2019055

## Uyghur morphological analysis using joint conditional random fields: Based on small scaled corpus

 1 Xinjiang Technical Institute of Physical and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China 2 University of Chinese Academy of Sciences, Beijing 100049, China 3 Institute of Mathematics and Information of Hotan Teachers College, Hotan 848000, China

* Corresponding author: Ghalip Abdukerim

Received  June 2017 Revised  October 2017 Published  November 2018

As a fundamental research in the field of natural language processing, the Uyghur morphological analysis is used mainly to determine the part of speech (POS) and segmental morphemes (stem and affix) of a word in a given sentence, as well as to automatically annotate the grammatical function of the morphemes based on the context. It is necessary to provide various information for other tasks of natural language processing including syntactic analysis, machine translation, automatic summarization, and semantic analysis, etc. In order to increase the morphological analysis efficiency, this paper puts forward a hybrid approach to create a statistical model for Uyghur morphological tagging through a small-scale corpus. Experimental results show that this plan can obtain an overall accuracy of 92.58 % with a limited training corpus.

Citation: Ghalip Abdukerim, Eziz Tursun, Yating Yang, Xiao Li. Uyghur morphological analysis using joint conditional random fields: Based on small scaled corpus. Discrete & Continuous Dynamical Systems - S, 2019, 12 (4&5) : 823-836. doi: 10.3934/dcdss.2019055
##### References:
 [1] B. Aisha and M. Sun, A statistical method for Uyghur tokenization, in International Conference on Natural Language Processing and Knowledge Engineering, (2009), 1-5. doi: 10.1109/NLPKE.2009.5313764. [2] Uyghur Language, Available from: https://en.wikipedia.org/wiki/Uyghur_language. [3] S. Dandapat, S. Sarkar and A. Basu, Automatic part-of-speech tagging for bengali: An approach for morphologically rich languages in a poor resource scenario, in ACL 2007, Proceedings of the Meeting of the Association for Computational Linguistics, June 23-30, 2007, Prague, Czech Republic, 2007. [4] T. Ibrahim and B. Yuan, A survey on minority language information processing research and application in xinjiang, Journal of Chinese Information Processing, 6 (2011), 149-156. [5] T. Klymchuk, Regularizing algorithm for mixed matrix pencils, Applied Mathematics and Nonlinear Sciences, 2 (2017), 123-130. [6] O. Kohonen, S. Virpioja, L. Leppanen and K. Lagus, Semi-supervised extensions to morfessor baseline, Proceedings of the Morpho Challenge 2010 Workshop, 2010. [7] T. Kudo, K. Yamamoto and Y. Matsumoto, Applying conditional random fields to japanese morphological analysis, in Conference on Empirical Methods in Natural Language Processing, EMNLP 2004, A Meeting of Sigdat, A Special Interest Group of the Acl, Held in Conjunction with ACL 2004, 25-26 July 2004, Barcelona, Spain, 6 (2004), 230-237. [8] Lafferty, D. John, McCallum, Andrew, Pereira and C. N. Fernando, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, 2001. [9] T. Litip, The possibility of handling phonetic harmony by computer in Uyghur, Journal of the Central University for Nationalities, 5 (2004), 108-113. [10] A. Mairehaba, W.-B. Jiang, Z.-Y. Wang, Y. Tuergen and Q. LIU, Directed graph model of Uyghur morphological analysis, Journal of Software, 12 (2012), 3115-3129. doi: 10.3724/SP.J.1001.2012.04205. [11] A. Mijit, N. Graham, M. Masato, M. Shinsuke, K. Tatsuya and H. Askar, Uyghur Morpheme-based Language Models and ASR, Ipsj Sig Notes, (2010), 581-584. doi: 10.1109/ICOSP.2010.5656065. [12] M. Orhun, A. C. eyd Tantug and A. Esref, Rule Based Analysis of the Uyghur Nouns, International Journal on Asian Language Processing, 1 (2009), 33-44. [13] L. Tohti, Modern Uyghur Reference Grammar, China Social Science Press, Beijing, 2012. [14] E. Tursun, D. Ganguly, T. Osman, Y. Yating, G. Abdukerim, Z. Junlin and L. Qun, A semisupervised Tag-Transition-Based markovian model for Uyghur morphology analysis, ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 16 (2016), 8-23. doi: 10.1145/2968410. [15] A. Wumaier, T. Yibulayin, Z. Kadeer and S. Tian, Conditional random fields combined fsm stemming method for uyghur, in IEEE International Conference on Computer Science and Information Technology, (2009), 295-299. doi: 10.1109/ICCSIT.2009.5234727. [16] H. Xue, Y. Yang, T. Osman, X. Li and R. Zhang, Uyghur word segmentation using a combination of rules and statistics, Advances in information Sciences and Service Sciences(AISS), 3 (2011), 105-113. [17] H. Zhang, Q. Cai, W. Jiang, Y. Lv and Q. Liu, Joint voice harmony restoration and morphological segmentation for morphology analysis, Journal of Chinese Information Processing, 6 (2014), 9-17. [18] L. Zhu, Y. Pan and J. Wang, Affine transformation based ontology sparse vector learning algorithm, Applied Mathematics and Nonlinear Sciences, 2 (2017), 111-122. doi: 10.21042/AMNS.2017.1.00009.

show all references

##### References:
 [1] B. Aisha and M. Sun, A statistical method for Uyghur tokenization, in International Conference on Natural Language Processing and Knowledge Engineering, (2009), 1-5. doi: 10.1109/NLPKE.2009.5313764. [2] Uyghur Language, Available from: https://en.wikipedia.org/wiki/Uyghur_language. [3] S. Dandapat, S. Sarkar and A. Basu, Automatic part-of-speech tagging for bengali: An approach for morphologically rich languages in a poor resource scenario, in ACL 2007, Proceedings of the Meeting of the Association for Computational Linguistics, June 23-30, 2007, Prague, Czech Republic, 2007. [4] T. Ibrahim and B. Yuan, A survey on minority language information processing research and application in xinjiang, Journal of Chinese Information Processing, 6 (2011), 149-156. [5] T. Klymchuk, Regularizing algorithm for mixed matrix pencils, Applied Mathematics and Nonlinear Sciences, 2 (2017), 123-130. [6] O. Kohonen, S. Virpioja, L. Leppanen and K. Lagus, Semi-supervised extensions to morfessor baseline, Proceedings of the Morpho Challenge 2010 Workshop, 2010. [7] T. Kudo, K. Yamamoto and Y. Matsumoto, Applying conditional random fields to japanese morphological analysis, in Conference on Empirical Methods in Natural Language Processing, EMNLP 2004, A Meeting of Sigdat, A Special Interest Group of the Acl, Held in Conjunction with ACL 2004, 25-26 July 2004, Barcelona, Spain, 6 (2004), 230-237. [8] Lafferty, D. John, McCallum, Andrew, Pereira and C. N. Fernando, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, 2001. [9] T. Litip, The possibility of handling phonetic harmony by computer in Uyghur, Journal of the Central University for Nationalities, 5 (2004), 108-113. [10] A. Mairehaba, W.-B. Jiang, Z.-Y. Wang, Y. Tuergen and Q. LIU, Directed graph model of Uyghur morphological analysis, Journal of Software, 12 (2012), 3115-3129. doi: 10.3724/SP.J.1001.2012.04205. [11] A. Mijit, N. Graham, M. Masato, M. Shinsuke, K. Tatsuya and H. Askar, Uyghur Morpheme-based Language Models and ASR, Ipsj Sig Notes, (2010), 581-584. doi: 10.1109/ICOSP.2010.5656065. [12] M. Orhun, A. C. eyd Tantug and A. Esref, Rule Based Analysis of the Uyghur Nouns, International Journal on Asian Language Processing, 1 (2009), 33-44. [13] L. Tohti, Modern Uyghur Reference Grammar, China Social Science Press, Beijing, 2012. [14] E. Tursun, D. Ganguly, T. Osman, Y. Yating, G. Abdukerim, Z. Junlin and L. Qun, A semisupervised Tag-Transition-Based markovian model for Uyghur morphology analysis, ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 16 (2016), 8-23. doi: 10.1145/2968410. [15] A. Wumaier, T. Yibulayin, Z. Kadeer and S. Tian, Conditional random fields combined fsm stemming method for uyghur, in IEEE International Conference on Computer Science and Information Technology, (2009), 295-299. doi: 10.1109/ICCSIT.2009.5234727. [16] H. Xue, Y. Yang, T. Osman, X. Li and R. Zhang, Uyghur word segmentation using a combination of rules and statistics, Advances in information Sciences and Service Sciences(AISS), 3 (2011), 105-113. [17] H. Zhang, Q. Cai, W. Jiang, Y. Lv and Q. Liu, Joint voice harmony restoration and morphological segmentation for morphology analysis, Journal of Chinese Information Processing, 6 (2014), 9-17. [18] L. Zhu, Y. Pan and J. Wang, Affine transformation based ontology sparse vector learning algorithm, Applied Mathematics and Nonlinear Sciences, 2 (2017), 111-122. doi: 10.21042/AMNS.2017.1.00009.
The morphological analysis result and hierarchical relationship of a Uyghur sentence
The Architecture of a semi-supervised morphological analysis based on the hybrid approach
Morphological Tag Decoding Process of Words in the Sentence
The Relationship between Parameter $\beta$ and Accuracy
Feature Template of POS Tagging Model
 Features Description ${{w}_{i-2}}{{pos}_{i}}$, ${{w}_{i-1}}{{pos}_{i}}$, ${{w}_{i}}{{pos}_{i}}$, ${{w}_{i+1}}{{pos}_{i}}$, ${{w}_{i+2}}{{pos}_{i}}$ Unary context features of the word ${{w}_{i-2}}{{w}_{i-1}}{{pos}_{i}}$, ${{w}_{i-1}}{{w}_{i}}{{pos}_{i}}$, ${{w}_{i}}{{w}_{i+1}}{{pos}_{i}}$, ${{w}_{i+1}}{{w}_{i+2}}{{pos}_{i}}$, ${{w}_{i-1}}{{w}_{i+1}}{{pos}_{i}}$ Binary context features of the word $h_1(w_i){{pos}_{i}}$, $h_2(w_i){{pos}_{i}}$, $h_3(w_i){{pos}_{i}}$, $h_4(w_i){{pos}_{i}}$, $h_5(w_i){{pos}_{i}}$ n characters selected from the beginning of the word $t_1(w_i){{pos}_{i}}$, $t_2(w_i){{pos}_{i}}$, $t_3(w_i){{pos}_{i}}$, $t_4(w_i){{pos}_{i}}$, $t_5(w_i){{pos}_{i}}$ n characters selected from the end of the word ${{pos}_{i-1}}{{pos}_{i}}$ POS tag transition feature
 Features Description ${{w}_{i-2}}{{pos}_{i}}$, ${{w}_{i-1}}{{pos}_{i}}$, ${{w}_{i}}{{pos}_{i}}$, ${{w}_{i+1}}{{pos}_{i}}$, ${{w}_{i+2}}{{pos}_{i}}$ Unary context features of the word ${{w}_{i-2}}{{w}_{i-1}}{{pos}_{i}}$, ${{w}_{i-1}}{{w}_{i}}{{pos}_{i}}$, ${{w}_{i}}{{w}_{i+1}}{{pos}_{i}}$, ${{w}_{i+1}}{{w}_{i+2}}{{pos}_{i}}$, ${{w}_{i-1}}{{w}_{i+1}}{{pos}_{i}}$ Binary context features of the word $h_1(w_i){{pos}_{i}}$, $h_2(w_i){{pos}_{i}}$, $h_3(w_i){{pos}_{i}}$, $h_4(w_i){{pos}_{i}}$, $h_5(w_i){{pos}_{i}}$ n characters selected from the beginning of the word $t_1(w_i){{pos}_{i}}$, $t_2(w_i){{pos}_{i}}$, $t_3(w_i){{pos}_{i}}$, $t_4(w_i){{pos}_{i}}$, $t_5(w_i){{pos}_{i}}$ n characters selected from the end of the word ${{pos}_{i-1}}{{pos}_{i}}$ POS tag transition feature
Feature Template of the Morphological Tagging Model
 Features Description ${{m}_{i-2}}{{t}_{i}}$, ${{m}_{i-1}}{{t}_{i}}$, ${{m}_{i}}{{t}_{i}}$, ${{m}_{i+1}}{{t}_{i}}$, ${{m}_{i+2}}{{t}_{i}}$ Unary context features of the morpheme ${{m}_{i-2}}{{m}_{i-1}}{{t}_{i}}$, ${{m}_{i-1}}{{m}_{i}}{{t}_{i}}$, ${{m}_{i}}{{m}_{i+1}}{{t}_{i}}$, ${{m}_{i+1}}{{m}_{i+2}}{{t}_{i}}$, ${{m}_{i-1}}{{m}_{i+1}}{{t}_{i}}$ Binary context features of the morpheme ${{t}_{i-1}}{{t}_{i}}$ Morphological tag transition feature
 Features Description ${{m}_{i-2}}{{t}_{i}}$, ${{m}_{i-1}}{{t}_{i}}$, ${{m}_{i}}{{t}_{i}}$, ${{m}_{i+1}}{{t}_{i}}$, ${{m}_{i+2}}{{t}_{i}}$ Unary context features of the morpheme ${{m}_{i-2}}{{m}_{i-1}}{{t}_{i}}$, ${{m}_{i-1}}{{m}_{i}}{{t}_{i}}$, ${{m}_{i}}{{m}_{i+1}}{{t}_{i}}$, ${{m}_{i+1}}{{m}_{i+2}}{{t}_{i}}$, ${{m}_{i-1}}{{m}_{i+1}}{{t}_{i}}$ Binary context features of the morpheme ${{t}_{i-1}}{{t}_{i}}$ Morphological tag transition feature
List of Morphological Tag Candidates of Words in the Sentence
Manually Tagged Corpus Format and Content Example
Details of Experimental Data
 Number of sentences Number of words (including punctuation marks) Number of Uyghur words Training set 1000 12433 10391 Development set 200 2564 2151 Test set 200 2492 2075
 Number of sentences Number of words (including punctuation marks) Number of Uyghur words Training set 1000 12433 10391 Development set 200 2564 2151 Test set 200 2492 2075
Experimental Results
 Method Accuracy (%) Stemming Morpheme segmentation POS Overall Tag sequence Markov model 90.18 83.25 86.17 75.13 Joint CRF model 91.98 85.79 92.7 77.95 Tag sequence Markov model, $\alpha$=0.95 92.65 88.47 88.12 79.65 Joint CRF model, $\alpha$=0.9 92.85 89.76 92.6 80.73
 Method Accuracy (%) Stemming Morpheme segmentation POS Overall Tag sequence Markov model 90.18 83.25 86.17 75.13 Joint CRF model 91.98 85.79 92.7 77.95 Tag sequence Markov model, $\alpha$=0.95 92.65 88.47 88.12 79.65 Joint CRF model, $\alpha$=0.9 92.85 89.76 92.6 80.73
Analysis for the Influence of Filtering Rules on Morphological Tagging
 Method(Joint CRF model, $\alpha$=0.9, $\beta$=0.1) Accuracy (%) Stemming Morpheme segmentation POS Overall Joint CRF model, $\alpha$=0.9, $\beta$=0.1, When filtering rules are not used 92.85 89.76 92.6 80.73 Joint CRF model, $\alpha$=0.9, $\beta$=0.1, When filtering rules are used 97.4 94.58 96.35 92.58 Tag sequence transition model, $\alpha$=0.95, When filtering rules are used 94.35 93.22 94.78 91.81
 Method(Joint CRF model, $\alpha$=0.9, $\beta$=0.1) Accuracy (%) Stemming Morpheme segmentation POS Overall Joint CRF model, $\alpha$=0.9, $\beta$=0.1, When filtering rules are not used 92.85 89.76 92.6 80.73 Joint CRF model, $\alpha$=0.9, $\beta$=0.1, When filtering rules are used 97.4 94.58 96.35 92.58 Tag sequence transition model, $\alpha$=0.95, When filtering rules are used 94.35 93.22 94.78 91.81
 [1] Reuven Segev. Book review: Marcelo Epstein, The Geometrical Language of Continuum Mechanics. Journal of Geometric Mechanics, 2011, 3 (1) : 139-143. doi: 10.3934/jgm.2011.3.139 [2] Alexandra Fronville, Abdoulaye Sarr, Vincent Rodin. Modelling multi-cellular growth using morphological analysis. Discrete & Continuous Dynamical Systems - B, 2017, 22 (1) : 83-99. doi: 10.3934/dcdsb.2017004 [3] Rainer Buckdahn, Ingo Bulla, Jin Ma. Pathwise Taylor expansions for Itô random fields. Mathematical Control & Related Fields, 2011, 1 (4) : 437-468. doi: 10.3934/mcrf.2011.1.437 [4] Shengtian Yang, Thomas Honold. Good random matrices over finite fields. Advances in Mathematics of Communications, 2012, 6 (2) : 203-227. doi: 10.3934/amc.2012.6.203 [5] Diego Rapoport. Random representations of viscous fluids and the passive magnetic fields transported on them. Conference Publications, 2001, 2001 (Special) : 327-336. doi: 10.3934/proc.2001.2001.327 [6] Tom Goldstein, Xavier Bresson, Stan Osher. Global minimization of Markov random fields with applications to optical flow. Inverse Problems & Imaging, 2012, 6 (4) : 623-644. doi: 10.3934/ipi.2012.6.623 [7] Zhihui Yuan. Multifractal analysis of random weak Gibbs measures. Discrete & Continuous Dynamical Systems - A, 2017, 37 (10) : 5367-5405. doi: 10.3934/dcds.2017234 [8] Tsuguhito Hirai, Hiroyuki Masuyama, Shoji Kasahara, Yutaka Takahashi. Performance analysis of large-scale parallel-distributed processing with backup tasks for cloud computing. Journal of Industrial & Management Optimization, 2014, 10 (1) : 113-129. doi: 10.3934/jimo.2014.10.113 [9] Jiping Tao, Zhijun Chao, Yugeng Xi. A semi-online algorithm and its competitive analysis for a single machine scheduling problem with bounded processing times. Journal of Industrial & Management Optimization, 2010, 6 (2) : 269-282. doi: 10.3934/jimo.2010.6.269 [10] Bas Janssens. Infinitesimally natural principal bundles. Journal of Geometric Mechanics, 2016, 8 (2) : 199-220. doi: 10.3934/jgm.2016004 [11] Antoni Buades, Bartomeu Coll, Jose-Luis Lisani, Catalina Sbert. Conditional image diffusion. Inverse Problems & Imaging, 2007, 1 (4) : 593-608. doi: 10.3934/ipi.2007.1.593 [12] Igor G. Vladimirov. The monomer-dimer problem and moment Lyapunov exponents of homogeneous Gaussian random fields. Discrete & Continuous Dynamical Systems - B, 2013, 18 (2) : 575-600. doi: 10.3934/dcdsb.2013.18.575 [13] Xian Zhang, Vinesh Nishawala, Martin Ostoja-Starzewski. Anti-plane shear Lamb's problem on random mass density fields with fractal and Hurst effects. Evolution Equations & Control Theory, 2019, 8 (1) : 231-246. doi: 10.3934/eect.2019013 [14] Seung-Yeal Ha, Shi Jin. Local sensitivity analysis for the Cucker-Smale model with random inputs. Kinetic & Related Models, 2018, 11 (4) : 859-889. doi: 10.3934/krm.2018034 [15] Cristina Anton, Alan Yong. Stochastic dynamics and survival analysis of a cell population model with random perturbations. Mathematical Biosciences & Engineering, 2018, 15 (5) : 1077-1098. doi: 10.3934/mbe.2018048 [16] Tomás Caraballo, Maria-José Garrido-Atienza, Javier López-de-la-Cruz, Alain Rapaport. Modeling and analysis of random and stochastic input flows in the chemostat model. Discrete & Continuous Dynamical Systems - B, 2017, 22 (11) : 1-24. doi: 10.3934/dcdsb.2018280 [17] Andrew Vlasic. Long-run analysis of the stochastic replicator dynamics in the presence of random jumps. Journal of Dynamics & Games, 2018, 5 (4) : 283-309. doi: 10.3934/jdg.2018018 [18] Seung-Yeal Ha, Shi Jin, Jinwook Jung. A local sensitivity analysis for the kinetic Kuramoto equation with random inputs. Networks & Heterogeneous Media, 2019, 14 (2) : 317-340. doi: 10.3934/nhm.2019013 [19] M. L. Bertotti, Sergey V. Bolotin. Chaotic trajectories for natural systems on a torus. Discrete & Continuous Dynamical Systems - A, 2003, 9 (5) : 1343-1357. doi: 10.3934/dcds.2003.9.1343 [20] Daniel Grieser. A natural differential operator on conic spaces. Conference Publications, 2011, 2011 (Special) : 568-577. doi: 10.3934/proc.2011.2011.568

2017 Impact Factor: 0.561

## Tools

Article outline

Figures and Tables