Printer Friendly

PART OF SPEECH TAGGER FOR MALAY LANGUAGE BASED ON WORDS MORPHOLOGY.

Byline: Mohd Pouzi Hamzah and Syarifah fatem Na'imah Binti Syed Kamaruddin

ABSTRACT : Part of Speech (POS) tagging is an essential task in pre-processing for text processing performance. A POS tagger assigns a tag to each token which is consigned as word classes of noun verb adjective and adverb because all the four word classes is the basic and important structure in Malay sentences. Rules for tagging are developed to facilitate the process of new information extraction from unstructured text. This paper presents the evaluation of a POS tagger for Malay texts. It evaluates the accuracy of tagging on each word in a police report corpus from which 50 sentences and 643 words are selected from five different police reports. The results show that tagging accuracy varies between 85.7 percent and 91.8 percent overall. Accuracy is the main factor in evaluating any POS tagger.

1.0 INTRODUCTION KEYWORDS: Malay language; POS-tagger; Police reports

In our knowledge base there are 2758 nouns 1459 verbs One of the fundamental tasks in information retrieval is part of speech (POS) tagging. POS tagging is a process of assigning accurate grammatical classes or word classes to every word[1]. POS tagging plays elementary role in information retrieval system such as information extraction and text parsing. Nowadays Malay database has been rising in number as there are increasing reports or news. So POS tagging is required to analyze every single word in the reports. There are few other methods for POS tagging. Ruled based tagging stochastic tagging and transformation-based tagging are three common approaches of part-of-speech tagging[2]. This paper describes rules for tagging on Malay texts. The main interest in this research is to test the accuracy of tagging by using the POS tagger system. Some of the collections tests using the Malay language are Fatimah and Mohd Pouzi which uses a collection of verses from the Quran as a collection of documents[3 4].

The Malay texts are taken from the police report corpus. Most of police reports consist of free-text filed by the victim. About 643 words have been tagged from a sample of police report corpus.

Based on our corpus a knowledge base is used and a very coarse tagset is picked to tag Malay texts. The tagsets are noun verb adjective and adverb[5 6]. In English language tagging and parsing are inextricably linked but they perform their own processes[7]. A tagger related to each word of the text denotes its word classes for example beautiful lady" will be tagged as adjectives and noun respectively. Nevertheless in Malay language many words are ambiguous where a word can be classified into several classes. For example a word like tanduk" can be classified as a noun and sometimes as a verb. It depends on the use of the word in a sentence. A common example from the investigation domain is perang". Perang" can be a noun meaning the colour of brown or it can be a verb if the meaning is war. Another example is pukul". Pukul" can be a noun if it refers to o'clock and be a verb if it means to hit or strike. They are called homonyms.

1102 adjectives and 226 adverbs as shown in Figure 1. In the Knowledge base we have classified words according to their classes as root words compound words and reduplicate words. Combination of two words that can be linked to meaning is called a compound word. For example sakit" combines with hati" to mean liver disease" but its actual meaning is annoyed". Figure 2 shows the number of articles published in Science Direct Scopus and IEEE starting from 2008 to 2012 related on tagging process for English and Malay language. This graph expresses the importance to increase the number of study in Malay language because the ratio of studies done is much lower compare to English language. In addition this paper discusses about the corpus and methodology of pre- processing. Result and discussion will also be discussed at the end of the paper.

2.0 THE CORPUS (POLICE REPORT)

The texts are collection of Malay texts that are taken from the Police Diraja Malaysia (PDRM) corpus. The corpus is collected from daily reports and common texts. The corpus is tokenized and its words are tagged according to knowledge base[2]. The system will automatically tag proper words and numerical words such as date time and place where these words were not in the knowledge base.

They are tagged with a coarse tagset consisting of four different tags which are noun verb adverb and adjective. Examples of adverbs are yesterday outside and always. 643 words are tagged from a sample of the report at this early stage. Table 1 shows the tags and their corresponding frequencies in the corpus.

3.0 METHODOLOGY

The proposed framework portrays the methodology for a pre-processing process. The process includes tokenization and tagging process.

A. Tokenization

A tokenization process plays a role to improve the accuracy of POS tagging. For example if a word is a reduplicate word it can be recognized by-". So a word will be one token. The examples are adik-beradik" berwarna-warni" and gotong-royong". Another word will be tokenized by space and punctuation mark. Figure 3 shows the pre- processing process.

B. Pos Tagging

To tag a token six of knowledge bases become the inputs which are knowledge base of root words compound words reduplicate words kata luar biasa" kata pandu" and affix.

Table 1: The tags distribution

Tag Name###Frequency in Corpus###Probability

Noun(N)###366###0.5692

Verb(V)###80###0.1244

Adjective(ADV) 12###0.0187

Adverb(ADV)###182###0.2830

Others###3###0.0047

Total###643###1

Arrangement of rules for tagging process is important to ensure the accuracy of the tagging and at the same time to perform the process in the shortest possible time. Different results will be obtained if the order of the process is changed. Ordered process in the rules below is after some running programs are executed and the results obtained are reviewed manually.

A General algorithm for tagging using the knowledge base:

1) Read the first sentence to the end

2) Check the word in the table compound words

3) If step (2) failed to get the words

a. Check the word in the table of reduplicate words

4) If step (3) failed to get the word

a. Check the word in the table of kata luar biasa"

5) If step (4) failed to get the words

a. Check the word in the table of kata Pandu" 6) If step (5) failed to get the word

a. Check the word in the table of root words

7) If step (6) failed to get the word

a. Check the word in the table of affix words

8) If step (7) failed to get the words

a. Check the first letter of the word if the

first letter of the word has a capital letter the list of candidates is a noun

9) If step (8) failed to get the words

a. Check whether the word is a numerical if the word is numerical the list of candidates is a noun

10) If step (9) failed to get the words

4.0 RESULTS AND DISCUSSION

Evaluation of the experiment is first done by comparing the results with the implementation of rules and tagged set of data manually. Comparisons are made using program developed specifically for this comparison. Specific tests have been developed to assist the evaluation using the program[8]. This test set consists of five documents from police reports.

In Table II the category "other" is the label of a word cannot be determined the word class. Words that are labeled as "other" errors are due to unauthorized use of the word in the document such as spelling error or abbreviated words. Based on the Table 2 or Figure 4 nouns and adverbs form the majority of the words in each document. Most of the adverbs have no significance in indexing so the main components can be concluded as a noun followed by verb adjective and adverb.

For the evaluation purpose the tagged test document is compared manually with the original manually tagged documents and the differences are recorded. The performance of the tagger is evaluated from two different aspects when considering the accuracy of tagging as the percentage of correctly assigned tags. The accuracy is taken based on a comparison of human tagging and computer tagging by the tag set and the accuracy for correct and incorrect tagging by documents as shown in Table 3. In thisexperiment incorrect word' means not correct tagged by the system and missed' means the system not processed.

Evaluation performed by program is given in Table 4. From the table below the percentage of accuracy achieved is in the interval [85.7 91.8] with an average accuracy obtained is 88.4%. This accuracy is increase from the previous malay tagger software Lazy Man's Way" which the results achieved 86.87% for precision and 72.56% for recall[9]. Result with high accuracy could support experiments that will be conducted in the information retrieval. The results analysis from the Table 5 show that the system achieved 91.3% precision and 88.4% recall. The formulas to calculate the precision and recall are shown in formula 1 and 2. Experimental tests only get 88.4% because some of the factors that caused an error in the tagging of the word. One of the factors is about semantic elements. For example bersyukur" is an adjective although according to Malay morphology a word that begins with "ber" is a verb[10].

Based on the example given the words that have a prefix "ber" are difficult to define their own tagset if only based on the morphology. The example of police report was processed by the system as shown in Figure 5.

Higher tagging accuracy could be achieved if the semantic aspect is also taken into this research. However the scope of this study involves only tagging according to morphology aspect and the result obtained from the evaluation is that the average accuracy of 88.4% is closest to the Malay tagging software[4]. Besides words that carry a lot of meaning also affect the accuracy of tagging such as perang". Words that do not exist in the Knowledge base also cannot be tagged properly. Lastly the factor that has affected the tagging decision is checking error done by human during the tagging process while it is done manually

Table 2: Statistics of pos tagging in police report

###Report###Total Words###Nouns###Verbs###ADJ###ADV###Others

###1###156###100###22###2###32###-

###2###105###56###13###6###28###2

###3###195###111###22###-###61###1

###4###97###53###4###11###29###-

###5###90###46###12###-###32###-

5.0 CONCLUSIONS

This paper experiments on applying rules for tagging process to the police report corpus. The Part of Speech (POS) tagger that was proposed is developed manually. The result's accuracy is computed to 88.4%. About 643 words of different cases and untagged police reports are used. 567 correct words and 76 incorrect words are computed. In addition morphological aspect is the only element that is being considered in implementing the rules for tagging. Besides hefty data are required in order to obtain a higher accuracy. Finally to achieve more accuracy in POS tagger it is recommended to broaden the police report corpus up to 20000 words. This could be the focus of further work.

Table 3: Comparison of human tagging and computer tagging

###Type###Human###Computer###Accuracy

###Tagging###Tagging###(%)

###Noun###366###350###95.6

###Verb###80###75###93.8

###Adjective 12###10###83.3

###Adverb###183###151###82.5

###Average###-###-###88.8

Table 4: Accuracy of part of speech (POS) tagging in police reports

###Total###Correct Incorrect###Accuracy

Reports###Missed

###Words###Words###Words###(%)

1###156###137###11###8###87.8

Table 5: The results analysis

Reports Precision (%) Recall (%)

1###92.6###87.8

2###88.2###85.7

3###90.0###87.6

REFERENCES

[1] D. Jurafsky and J. H. Martin Speech and Language Processing: An Introduction to Natural Language Processing Computational Linguistics and Speech Recognition: Pearson Prentice Hall 2009.

[2] T. Baldwin "Open source corpus analysis tools for Malay" in In Proc. of the 5th International Conference on Language Resources and Evaluation(LRREC) GenoaItaly 2006.

[3] F. Ahmad "A Malay language document retrieval system: An experimental approach and analysis" Phd Universiti Kebangsaan Malaysia Malaysia 1995.

[4] M. P. Hamzah "Frasa Dan Hubungan Semantik Dalam Perwakilan Pengetahuan: Kesan Terhadap Keberkesanan Capaian Dokumen Melayu" Phd Universiti Kebangsaan Malaysia Bangi 2006.

[5] K. Gerald and Z. M. Don "Tagging a corpus of Malay texts and coping with 'syntactic drift'" presented at the Proceedings of the Corpus Linguistics 2003 Conference University of Lancaster Centre for Computer Corpus Research on Language 2003.

[6] A. H. Omar "Word Classes in Malay" in Essays on Malaysian Linguistic. vol. 13 ed Malaysia: Dewan Bahasa dan Pustaka Ministry of Education Malaysia Kuala Lumpur 1993 pp. 162-174.

[7] Z. M. Don "Processing Natural Malay Text: A Data- Driven" 2010.

[8] S. Tasharofi et al. "Evaluation of statistical part of speech tagging of persian text" in Signal Processing and Its Applications 2007. ISSPA 2007. 9th International Symposium on 2007 pp. 1-4.

[9] N. Zamin et al. "A Lazy Man's Way to Part-of-Speech Tagging" in Knowledge Management and Acquisition for Intelligent Systems. vol. 7457 D. Richards and B. Kang Eds. ed: Springer Berlin Heidelberg 2012 pp. 106-117.

[10] S.-F. Chung "Uses of ter- in Malay: A corpus-based study" Journal of Pragmatics vol. 43 pp. 799-813 2011.
COPYRIGHT 2014 Asianet-Pakistan
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2014 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Publication:Science International
Article Type:Report
Geographic Code:9MALA
Date:Dec 31, 2014
Words:2272
Previous Article:MARKET-DRIVEN RESEARCH APPROACH FOR A SUCCESSFUL RESEARCH PRODUCT COMMERCIALISATION.
Next Article:EFFECT OF THICKNESS ON THE STRUCTURAL AND SURFACE ROUGHNESS OF TIALN FILMS.
Topics:

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters