Printer Friendly

AN EFFICIENT INDEXING AND SEARCHING TECHNIQUE FOR INFORMATION RETRIEVAL FOR URDU LANGUAGE.

Byline: M. M. Qureshi, and M. Shoaib

ABSTRACT: Indexing techniques (Ricardo and Berthier, 2004) are used to improve retrieval of data in response to certain search condition. Inverted files (Wikipedia, 20 Jan 2010) (Chouvalit and Veera, 2007) (Fidel and Angel, 2002) (indexing technique) is mostly used for creating indexes. This paper proposes indexing technique for Urdu language. Language processing step in Index creation (Christopher et al, 2008) is different for a particular language. We discuss index creation steps specifically for Urdu language. We explore morphological rules for Urdu language and implement these rules to create Urdu stemmer. We implement our proposed technique with different implementations and compare results. We suggest that create indexes without stop words and also index file should be an order index file.

Keywords: Index technique, Inverted files, Urdu Language, Urdu Stemmer, Morphological rules

INTRODUCTION

Language specific indexing technique can be used in many ways. One useful way is in making of information retrieval system specific to one language. We can also use language specific indexing technique in a searching system which is designed for a particular language. For searching Urdu contents efficiently we have to create indexing technique for Urdu language (Lee-Feng and Hsiao-Tieh, 1996). Inverted file creation is mostly used indexing technique and we used its methodology to create indexing technique for Urdu language.

Language processing is major step in index creation. This step has sub steps (Stop words removal, normalization and stemming). We identified stop words list for Urdu language. Normalization is application specific functionality. In stemming process, we have to create morphological rules to create a stemmer. We create Urdu Stemmer by identifying morphological rules for Urdu language. We implement our proposed indexing technique and create index file. Different implementations are used and we compare our results. We suggest that we have to remove stop words from index file so that retrieval of records is fast. Also implement stemming rules so that more relevant results retrieved. Also we suggest that an ordered indexing file is needed for fast retrieval of records.

Rest of papers is as follows. Section 2 describes related works. Section 3 discusses our proposed indexing technique. In Section 4, we give results and discussion. In section 5, we conclude our discussion also provide future directions and recommendations.

Related Work

Information Retrieval (IR): IR objective is to store, organize, represent information contents and provide access to that information contents.

"Information retrieval (IR) is the science of searching for information in documents, searching for documents themselves, searching for metadata which describe documents, or searching within databases, whether relational stand-alone databases or hyper textually-networked databases such as the World Wide Web".(Christopher et al, 2008)

Searching: There are many text search algorithms presented. Some popular search algorithms are

* Brute Force (BF)

* Knuth-Morris-Pratt (KMP)

* Boyer-Moore Family

* Shift-Or

* Suffix Automation

* Practical Comparison

* Phrases and Proximity

Indexing: Searching algorithms are used in many applications but there are some limitations of these algorithms. When database is huge or very large text document, the retrieval became slow down. The solution to this problem is to build data structure (index) for fast searching. There are many indexing techniques but most familiar techniques are following:

* Inverted files

* Suffix arrays

* Signature files

Inverted Files: An inverted file is used to store contents as mapping of different documents. Mapping is done with two key elements Vocabulary and Occurrences.

Index Creation: Index creation is an important task in information retrieval. Index creation is a process which has different steps.

Steps for Index Creation: Major steps in creating index are (Christopher et al, 2008)

1) Document collection that will be used for index

2) Text tokenization

3) Language processing for tokens

4) Index those documents in which each term lies

Document Collection: For creating document collection, we have to create digital documents. Digital documents are files which consist of bytes in a computer system. For processing first step is to convert these bytes into characters. In case of English, this is simple process. But if a document is encoded in Unicode or UTF-8, it will be a complex step. For details, please refer to (Christopher et al, 2008).

Tokenization: Tokenization means splitting of given document into pieces, which is often called tokens. For example,

Input: Ali is a brave boy.

Output: Ali is a Brave Boy

Language Processing

Stop Words: In any language, some very common words are present which have little contribution for processing of query, so we eliminate that words to get efficient result. These common words are called stop words. For further details please refer to (Christopher et al, 2008). In English, examples are (a, an, the, prepositions (by, at, for), and, why, etc).

In Urdu examples are

Normalization: We can make efficient index by using normalization process. In this process, we make equivalent classes of same term. For example, word UK. We also want to make this word searchable when querying U.K. The standard way is to create implicit equivalence classes. For example we UK and U.K are both mapped to UK.

Stemming: "Stemming usually refers to crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time and often includes the removal of derivational affixes"(Christopher et al, 2008).In simple words, in stemming our goal is to reduce inflected or derived words to their stem or base word.

There are different forms of words due to grammatical reasons. For example in English word, transfer has different forms like transferred, pre transfer, post transfer, transfers.

Some Urdu words have two or more plural forms like word duurra (period,dw ) has different plural forms like:-Duuree (periods, yrdw)

Duuroon (periods,) Adwaar (periods,)

There is almost no work present in literature for index creation in Urdu Language (Kashif, 2007).

There are different stemming algorithms for English language. Most popular stemming algorithm is Porter Stemmer (Porter, 1980). Other stemming algorithms are Lovins Stemmer (Julie, 1968), Paice/Husk stemmer (Lancaster, 20 Jan 2010), Dawson Stemmer and Krovetz algorithm (Lancaster, 22 Jan 2010).

Indexing Technique for Urdu Language: As we discussed earlier, there is hardly any work presented for Urdu language in the field of IR (Especially in Index Creation process) which we are used to create our proposed indexing technique. First step is collection of documents. These documents are source from which we search against user queries. These documents are converted in electronic format. Next step is tokenization of these documents. We tokenize documents on basis of white space means a token is word in document. Next step is removing stop words from our tokens. Next we normalize our tokens. In next step we used these normalized token to create stemmer. Then these normalized and stemmed tokens are used to create inverted index file. This file is used to make our proposed system more efficient because due to use of this file, system provides more efficient and relevant results against user query. The architecture of proposed indexing technique is shown in Figure1.

Document Collection

Step 1: First step is collection of documents. We used Unicode UTF-8 encoding scheme for making electronic documents that are used for processing of our proposed indexing technique.

Tokenization

Step 2: In this step we generate tokens from electronic documents. For instance, one record in temporary storage is

Stop Words Removal

Step 3: As we discussed earlier stop words have a little use in IR system, so we eliminate these words. We used stop words table to reduce the size of index file. For example, if user enter a query

We can eliminate "w" word from query using stop words list.

Normalization

Step 4: This step is related to application specific. Means we have different process in this step for different application.

Stemmer and index Construction

Step 5: We have to define rules in order to create stemmer. Our objective is to create stemmer for Urdu language, so we have to define rules for Urdu language. These rules should entirely different from English and other Native Languages because these grammatical rules vary from one language to other language. We use grammatical rules of Urdu present in Urdu grammar literature and also use morphological analysis of Urdu grammar to come up rules that help in construction of Stemmer for Urdu language. In this section, first we present rules for stemmer (Sara, 2004).

There are two main groups of verbs with respect to their ending alphabet.

1. Verbs that ends with consonant alphabet (Alphabet except).

We will call it consonant group.

2. Verbs that ends with alphabet. We will call it vowel group.

From second group, there are further subgroups:

I. Verbs that ends with alphabet

II. Verbs that ends with alphabet III. Verbs that ends with alphabet

Step 4.1 This step related to plurals in Urdu. This step has further five sub steps

Step 4.1.1: Tokens are sequentially scanned and check their last character. If it is consonant then append Nw at the end and search for its location and save in index list.

Step 4.1.2: Tokens are sequentially scanned and check their last character. If it is then delete and append at the end and search for its location and save in index list.

Step 4.1.3: Tokens are sequentially scanned and check their last character. If it is then append at the end and search for its location and save in index list.

Step 4.1.4: Tokens are sequentially scanned and check their last character. If it is then delete and append

at the end and search for its location and save in index list.

Step 4.2: In this step, we cover inflected forms of verb with respect to tenses in Urdu.

For example

Here is definitive verb and are inflected verbs. Here root verb is and it is obtained by removing from definitive verb. This word is called auxiliary word. We identify some auxiliary words which are This step has further four sub steps.

Step 4.2.1: In this step, we identify rules for consonant verbs (verbs except y,y,w,). Tokens are sequentially scanned and check their last two characters. If it is an auxiliary word then delete auxiliary word and search for its location and save in index list.

Step 4.2.2: In this step, we identify rules for w,. Tokens are sequentially scanned and check their last two characters. If it is an auxiliary word then delete auxiliary word and search for its location and save in index list.

Step 4.2.3: In this step, we identify rules for y. Tokens are sequentially scanned and check their last two characters. If it is an auxiliary word then delete auxiliary word and search for its location and save in index list.

Step 4.2.4: In this step, we identify rules for y. Tokens are sequentially scanned and check their last two characters. If it is an auxiliary word then delete auxiliary word and search for its location and save in index list.

RESULTS AND DISCUSSION

Below are some results which obtained from our computational model.

1) We created indexing file for two times. First time, we create file with stop words. In that case size of

file is 461 kilo bytes (KB). Second time, we create file without stop words, then it size is 453 kilo bytes (KB). 2) We observe results for three different implementations in order to compare efficiency of indexing technique in respect of time execution. First implementation is "with stop words", means stop words are present in indexing file. Second implementation is "without stop words and unordered". This means indexing file has no stop words but file is unordered. And third implementation is "without stop words and order". This means that indexing file has no stop word and indexing file is ordered. Table 1 shows execution time of some results of these three implementations. Graphically represent in Figure2.

Table 1

Word With stop Without stop Without stop

words words and order words and order

(micro unordered (micro seconds)

seconds) (micro

seconds)

0.43 0.42 0.21

We also observe results without implementing stemming rules and implementing with stemming rules. Table 2 shows records of some results of these two implementations. Graphically represent in Figure3.

We see that in above section, size of indexing file without stop words is less than indexing file without stop words. Also stop words have a little or no use in searching. Also we see that execution time of searching is lower when we use indexing file without stop words. So we recommend that index file is used without stop words.

We can observe that execution time of searching "without stop words and order" is less than "without stop words and order" implementation. So we can infer that if

we implement ordering in indexing file, than it is more efficient that if we do not implement any order in indexing file. So we recommend that, we also perform ordering in creating indexing file. One thing, we should note ordering is done on basis of how we want to use that indexing file. For example, in our computational model we want to use that indexing file for searching of Urdu contents which are only related to Urdu alphabets. So we order that indexing file purely on basis of Urdu alphabets. If someone wants to implement indexing file for searching of hotels on basis of each country, he should order that indexing file on basis of country names.

We can see that in table 5.2, we are showing records for each result. These are total records retrieved for each query word. For example there are 41 records retrieved against word "" for Column1 ("Without stemming rules") and 59 records for Column2 ("with stemming rules. In Column1, we do not implement stemming rules, and Column2, we implement stemming rules. We implement stemming rules, if we want to allow users for expanded search. This means that if user search for word "work", we can allow user to search also for related words i.e. "works, worked, worker ,etc". So, we can infer that, for expanded search, we have to implement stemming rules in our indexing technique.

Keeping in view of expanded search, we can make our indexing technique more efficient by introducing multilevel indexing for stem words. For example we have a word "". We can create multilevel index for this for other words related to it .i.e.

Conclusion and Recommendations

We describe in detail our proposed indexing technique. We suggest that our proposed indexing technique will be helpful for fast retrieval of records in searching of Urdu contents.

We explore stemming rules only for verbs. But we do not explore some complex rules such as special cases of plurals. Also we do not explore complex rules of verb which mostly relate to adverbs. One can carry from this point and can explore more rules. Also work can be done in exploring nouns of Urdu language that their behavior is how much deviate from verbs.

REFERENCES

Christopher, (2008) Christopher D. Manning, Prabhakar Raghavan, Hinrich Sch " utze, , Introduction to Information Retrieval , 1-35 ISBN:978-0-521-86571-5.(2008)

Chouvalit and Veera, (2007) Chouvalit Khancome and Veera Boonjing, String Matching using Inverted Lists, World Academy of Science, Engineering and Technology, 2007

Fidel and Angel, (2002) Fidel Cacheda, Angel Vina, Inverted files and dynamic signature files for optimisation of Web directories, Eleventh International World Wide Web (WWW) Conference, 2002

Julie, (1968) Julie Beth Lovins, Development of a Stemming Algorithm, Mechanical translation and computational linguistics, 1968

Kashif, (2007) Kashif Riaz, Challenges in Urdu Stemming (A Progress Report), FDIA 2007

Lancaster, (2010) Lancaster, The Lancaster Stemming Algorithm, other stemmers, http:// www.comp. lancs.ac.uk/computing/research/stemming/Links /others.htm (accessed 20 Jan 2010)

Lancaster, (2010) Lancaster, The Lancaster Stemming Algorithm, Paice/Husk, http:// www. comp. lancs.ac.uk/computing/research/stemming/Links /paice.htm (accessed 22 Jan 2010)

Lee-Feng and Hsiao-Tieh, (1996) Lee-Feng Chien, Hsiao-Tieh Pu, Important Issues on Chinese Information Retrieval, Computational Linguistics and Chinese Language Processing, August 1996

Porter, (1980) M.F.Porter, An algorithm for suffix stripping, Computer Laboratory Cambridge, Program vol. 14: pp 130-137, 1980

Ricardo and Berthier, (2004) Ricardo Baeza-Yates, Berthier Ribeiro-Neto, Chapter 7, Modern Information Retrieval, 192-207 ISBN 81-297-0274-6. (2004)

Sara, (2004) Sara Hussain, Finite-State Morphological Analyzer for Urdu, National University of Computer & Emerging Sciences,2004, http:// www.crulp.org/English%20Site/..\Publication\C rulp_report\CR03_05E.pdf (accessed 20 Mar 2009)

Snowball, (2010) Snowball, http:// snowball. tartarus. org/ algorithms/ lovins/stemmer.html (accessed 22-Feb 2010)

Wen-hui (2010) Wen-hui ZHANG, Hua-lin QIAN, Wei MAO, Guo-nian, A Multilingual (Chinese, English) Indexing, Retrieval, Searching Search Engine, http://www.isoc.org/inet99/proceedings/posters/ 210/index.htm (accessed 20 Jan 2010)

Wikipedia, (2010) Wikipedia, Inverted Index, http://en.wikipedia.org/wiki/Inverted_index (accessed 20 Jan 2010)

Wikipedia, (2010) Wikipedia, Ontology, http://en. wikipedia.org/wiki/Ontology_(computer_scienc e) (accessed 22 Feb2010)

Department of CS and E, University of Engineering and Technology, Lahore, Pakistan
COPYRIGHT 2010 Asianet-Pakistan
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2010 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Publication:Pakistan Journal of Science
Article Type:Report
Geographic Code:9PAKI
Date:Sep 30, 2010
Words:2828
Previous Article:PHARMACY SERVICES IN EMERGENCY AT PUBLIC HOSPITALS A COMPARATIVE STUDY.
Next Article:SOFTWARE QUALITY METRICS FOR RHETORICAL STRUCTURE THEORY BASED INFORMATION RETRIEVAL SYSTEMS.
Topics:

Terms of use | Privacy policy | Copyright © 2020 Farlex, Inc. | Feedback | For webmasters