SOFTWARE QUALITY METRICS FOR RHETORICAL STRUCTURE THEORY BASED INFORMATION RETRIEVAL SYSTEMS.
ABSTRACT: Currently, several information retrieval models exist, for example, Boolean model, vector space model, probabilistic model etc. but none of them provides exact information required to the user. The search engines based on such models emphasize on the syntactic aspects of the query rather than capturing the semantics. Most of the work done in the domain of information retrieval is based on the keyword based indexing technique. Searching based on this keyword based indexing technique leads to an irrelevant result set and poor performance of search engines. The world of information retrieval is revolutionized by the rhetorical structure theory because it improvised the performance of search engines by introducing the semantic based retrieval, hence, reducing the irrelevant search to its minimum. In this paper, we present and validate design metric for rhetorical structure theory based information retrieval systems so as to measure the relevance retrieval and to improvise the retrieval performance.
Keywords: Keyword based indexing, dynamic weight, text based information retrieval, tokens, discourse structure
In Text Based Information Retrieval (TBIR) systems, when the user enters a query 'q' against a collection of documents 'c' then each document 'd' is examined and is given a weight according to the criteria of how well it satisfies the semantics of the query 'q' or in other words, how well requirement of the user is fulfilled. For every instance of triple less thanq;d;cgreater than, the weight assignment attributed to a document 'd' is done by a function evaluationless thanq;d;cgreater than. Generally speaking, an Information Retrieval (IR) system (Arnt et al., 2004 and Petratos, 2006) performs various retrieval tasks such as document indexing, ranking and classification. Document ranking is achieved by sorting all documents in the collection on the basis of assigned weights (Papka et al., 2008). The core job of an IR system is the parsing process. The tokens produced in the parsing of documents and queries are called terms.
The query provided to the IR system in natural language is parsed into a set of terms. The terms derived from a document are used to build an inverted list which is used as an index to the document collection. Normally, in IR systems it is assumed that if there is co-occurrence of a term in the query and a document then the document is relevant to the query. Also the co-occurrences of multiple terms of a query in a document contribute to the degree of relevance of that document. Current information retrieval techniques (Liu et al., 2009) are able to retrieve only 30% of the relevant information. Currently, the information retrieval systems are based on key-based indexing technique in which keywords are assigned static weights using some retrieval models such as extended Boolean Model, Vector Model or Probabilistic Model etc. The words carrying different meaning in various contexts can result in retrieval of irrelevant information (Shoaib and Shah, 2006).
Rhetorical Structure Theory (RST) (Mann and Thompson, 1986, Mann and Thompson, 1987, Mathkour, et al., 2008, Taboada and Mann, 2006) is considered as a descriptive linguistic approach to identify the textual relations with the intention of generating text. The success of RST can be seen from the wide application of this theory in various domains ranging from discourse analysis to psycholinguistics, theoretical linguistics and computational linguistics (Jun'chi and Tsujii, 1994, Mathkour, et al., 2008, Taboada and Mann, 2005). This theory is remarkably used in various applications of computational linguistics, for example, text parsing and generation, machine translation and easy scoring and most importantly in natural language processing (Mann and Thompson, 1987, Marcu, 1997).This theory organizes the text by means of relations that exist between various sections of the text and establishes a coherency in which every section of the text plays a role or function with respect to other sections of the text.
The resulting relations are termed as coherence relations, discourse relations, conjunction relations and Rhetorical Relations. There may be 30 different relations in various sections of text (Taboada, and Mann, 2005, Taboada, and Mann, 2006). RST identifies two different types of text units: The Nucleus and the Satellites. Nuclei are of prime importance whereas the satellites contribute to the nucleus and are considered secondary. In elaboration relation the nucleus is the part of the text containing basic information and satellites are parts of the text containing additional or supporting information about the nucleus. RST relations on the text are applied recursively until all text units are associated with one of the RST relations (Taboada, and Mann, 2006).
With the gigantic increase in the volume of the web contents, it has become a burning need of the day to improvise the performance of the search engines so that they may be able to retrieve the semantically relevant information in response to the user query. Currently most of the search engines are based on the keyword based searching techniques which are considered to be static due to the reliance on the syntax only. These IR systems are unable to handle the various contexts in which the same term appears in different documents. RST based IR system cope up with this problem by capturing the semantics of the terms using the dynamic weight assignment techniques.
In this paper, we present various design metrics for RST based IR systems to measure the retrieval relevancy and improve performance of the IR systems. We also present validation of the proposed metrics by taking various case studies. We have made base of our research work the architecture of RST based IR system presented in (Shoaib and Shah, 2006). Case studies to validate the proposed design metrics have also been taken from the work presented by (Shoaib and Shah, 2006).
The rest of the paper is organized as follows: In section 2, we describe the architecture of RST based IR system based upon which we present the design metrics and their validation. In section 3, we present materials and methods. In section 4, we present results and discussions and in section 5 we conclude our work and present future directions.
Architecture of RST based IR system: Since, we have selected the architecture of RST based IR system proposed by Shoaib and Shah (2006) as a base of our research work, therefore, in this section we describe this architecture in detail. They proposed an indexing technique with dynamic weight assignment technique based on RST. They used the punctuation marks and cue phrases (Marcu, 2000) (words that connect two or more text spans) to define the rhetorical relations. They constructed RST tree whose leaves represent the text spans and the internal nodes represent the rhetorical relations. Keywords extracted from the text spans are assigned dynamic weight such that the keywords in the text span closer to nucleus are assigned priority weights. With this dynamic weight assignment technique the precision rate has been clearly improvised. The system developed on the basis of this approach is able to capture the semantics of a document in an efficient way.
The architecture (given by Shoaib and Shah, 2006) of their prosed model consists of four different modules: 1) Segmentor 2) Rhetorical
The purpose and functionality of each component is defined as followed.
Segmentor: Text structure is usually organized into words, sentences and paragraphs. There is a correlation between these different text spans which is needed to be identified. To handle the structure in the text there is a need to identify the boundaries in the structure. This is the primary function of text segmentation. Text segmentor provides the structural information which is utilized for text analysis.
Rhetorical relation finder: In the next step the indexer automatically finds out the rhetorical relations. For this purpose, concept of cue phrases and punctuation is used to get the relationship between every text span. RST tree: From the list of text spans and the list of relations generated in the previous stages Rhetorical Tree is generated whose nodes represent the relations and leaves represent the text spans. Based on the height of the tree, initial weight is assigned to text spans. The indexer utilizes the keywords and their corresponding initial/dynamic weight to populate its knowledge base.
Indexer: Indexer uses the concept of strong and weak nodes to assess the initial weight. This initial weight assessment is used for dynamic weight assignment. The initial weight is taken between 0 and 1. Root node is assigned 1 whereas the nucleus and satellite to the parent node are assigned 0.9 and 0.5 respectively. These weights are variable. The weight assigned to child node is generalized by the following formula. Initial weight of child node = Weight of child node * Weight of parent node. After calculating the initial weight of all children nodes these weights are associated with the index terms existing in the text spans. On the basis of initial weight assessment and term frequency, the actual weight of index terms is calculated as follows:
Actual weight of the index term = Initial weight assessment* Term frequency. Indexer maintains the knowledge base by saving the document ID, vocabulary ID and actual weights. Document ID keeps track of which word belongs to which document. Vocabulary ID assures that there is no redundancy of words in the knowledge base thereby consuming less space. Dynamic weight indicates the semantic based occurrence of important index terms.
MATERIALS AND METHODS
In this section, we propose a number of metrics related to following basic components of RST based IR system.
2. Rhetorical relation finder
3. Rhetorical tree parser
4. RST based indexer
Number of segments metric: Number of Segments (NoS) is one of the proposed metrics to improvise performance of IR systems. It is for segmentor that is one of basic components of RST based IR systems. This metric calculates the total number of segments into which a text document can be divided. This metric is defined by taking the summation of no. of cue phrases and punctuation marks identified (since these two parameters are used to divide a text into various segments). A segment is defined as any elementary unit such as a keyword, a line of text or a paragraph itself.
There is no way defined yet to validate the correct number of segments in advance. The output of the first component of the architecture is the list of the individual segments. This list is further fed into the next component which is the rhetorical relation finder. This component works on the list of segments to identify the rhetorical relations correspondingly. The benefit of computing the no. of segments in advance gives the exact idea of how much memory will be required to store the segments. And also, the no. of segments will be helpful to judge the structure of the RST tree in advance and hence the corresponding rhetorical relations.
Mathematically, the No. of Segments metric is defined as follows:
NoC = Number of Cue Phrases identified
NoP = Number of Punctuation Marks Identified
Rhetorical relation distance metric: There can be two ways to measure the distance between two text spans. First method is to calculate the physically linear distance between two text spans. This syntactically linear distance means the no. of consecutive sentences that occur between two text spans. The other way is to compute the semantic distance between text spans. This distance can be determined by computing the rhetorical relation distance (RRD) metric. This metric determines the semantic distance between two text spans in terms of no. of internal nodes of a discourse graph. The nodes of the graph represent the text spans. The computation of rhetorical distance helps in analyzing the strength of a rhetorical relation. As greater the value of the RRD metric, weaker would be the rhetorical relation and vice versa. Hence, RRD = 0 means strongest relation (Direct relation) =1 means good relation (Indirect relation) greater than1 means weak relation (Indirect relation)
A weaker rhetorical relation indicates that there is an indirect relation between two nodes where each node is representing a text span. A Rhetorical relation will be considered stronger if it is a direct relation between two nodes. Rhetorical relation strength can help in the construction and improvisation of the rhetorical structure tree.
Suppose there are two nodes in the RST graph Ni and Nj and the level of any node is represented as L with the indexes i and j representing the specific level number. We assume that index Lj always indicate a deeper level as compared to level Li. Index k represents the particular node number at a specific level then mathematically the rhetorical relation distance between these two nodes will be Rhetorical Relation Distance (RRD) = Lj(Njk) - Li+1 (Nik).
Where Lxgreater than2 and x represents the number of the level At various depths the node having non sibling parents at same levels, the RRD will be computed as:
L2 = 2 + 1 = 3 a case
L3 = 3 + 2 = 5 a case 3[non-
L4 = 4 + 3 = 7 a As Above
And so on...
Metrics for indexer: This section defines some important measurements for the indexer component of the architecture.
Indexer size metric: There are various measures of the size of indexer. Some researchers have shown that the size of indexer is taken in terms of number of web pages crawled, while others emphasize on how much computer storage is required to support the index for measuring the indexer size(Brown, 1995, Henzinger, et al., 1999).Our proposed indexer size metric adopts a different approach. We have measured the size of the indexer in terms of the number of key terms taken from various documents that exist in a document corpse and stored in the indexer as index terms.
So the indexer size metric (S_idx ) is defined as, "The sum of index terms taken from individual documents of the document collection". Mathematically,
Where'd' represents a particular document in the document corpse 'D'. T_idxj represents the index term taken from a particular document and j is the counter for total number of index terms taken from a particular document.
Indexer term relevancy metric: The efficacy of the indexer mainly depends upon the degree of relevancy of the index terms. We define the index term relevancy (ITR) metric as "The degree of relevancy of the index term according to the various context in which the term is used in different documents"
Indexer ensures the semantically relevant results by assigning the initial weights to the nodes of the RST tree. Then it computes the dynamic weights of the index terms based on the initial weights and term frequency. Mathematically,
Where TFi indicates the term frequency of an index term and W_initi represents the initial weight of an index term.
User query relevancy metric: User Query Relevancy (UQR) metric can be defined as "Corresponding to a user's query how much relevant results are returned by indexer".
The UQR can be computed in terms of the major operation performed in the indexer searching process. This is the process of finding the index terms with maximum dynamic weight so we compute this major operation using the max() function. When the user enters a query in natural language form then the max() function is computed against each term in the user entered query to search out the most relevant index terms (terms having maximum dynamic weights).Mathematically
Quality of indexer metric: Another important metric that is useful for measuring the effectiveness of the indexer is the Quality of Indexer (QI) metric. When a user enters a query for which there can be multiple answers to choose from, then the most important issue related to indexer effectiveness is whether it indexes the information that are more useful from user perspective or not. Another reason for considering the quality of the indexer is the fact that the size of indexer is growing at a slower rate as compared to the rate of increase in amount of the web contents. Due to storage and processing power limitations of indexer it becomes an important issue for the indexer to index the most useful and relevant documents. In addition to this, the search engines should index the useful documents in some ranked order so that the user may get the most relevant information rather than getting too many results most of which are irrelevant to the user query. Hence, the indexer with high quality indexes is the growi
ng need of the users.
Q(I) in a broader perspective can be considered at two tiers:
a. The overall quality of the documents indexed
b. The average quality per indexed document Suppose every document indexed by the search engine is given a weight w (d). For convenience, we assume the weights are scaled so that the sum of all weights is 1.If we refer I to the indexer and |D| to the set of all documents indexed by a search engine, then we define the quality of the indexer Q(I) as
Note that since w is scaled, we always have 0 [?] Q(I) [greater than or equal to] 1 . Let |D| indicates the set of documents indexed by I then the average quality per indexed document is:
Qavg(I) = Q(I) / |D|
This average quality per indexed document is a very good measure of how well an indexer selects documents for indexing purpose.
RESULTS AND DISCUSSIONS
In this section we validate the various metrics defined for the architecture of RST based IR systems:
Validation of NoS metric: We validate Number of Segments (NoS) for the following documents.
Table- 1. Documents taken as a sample text used by Shoaib and Shah (2006).
We have taken two documents (Doc 0 and Doc 1) as shown in table 1.
For Document 0: NoC = 1 NoP = 3
For Document 1: NoC = 8 NoP = 4
Then the total number of segments will be computed as follows:
= [([1 + 3] + [8 + 4])] = [(4 + 12)] = 16
The same no. of segments are shown by (Shoaib and Shah, 2006) which is in accordance to our NoS metric.
Validation of RRD metric: Taking an example as a case study from the work presented by (Bosma, 2005) the discourse structure shown in figure 2 is converted to the RST graph in figure 3.
The nodes of the RST graph in figure 3 represent the text spans in terms of nuclei and satellites. The arrows represent the relation between the nucleus and the satellite with the head pointing from the nucleus to the satellite. From this RST graph the RRD between the node 3C and 3A can be calculated as the
RRD between 3C and 3A = Lj(Njk) - Li+1 (Nik)
= 20 - (0+1)0
Hence, the number of internal nodes between 3C and 3A is 1 which shows that the rhetorical relation is good between 3C and 3A.
RRD between 3C and 3D = Lj(Njk) - Li+1 (Nik)
= 10 - (0+1)0
Hence, the rhetorical relation is direct or strong between 3C and 3D
Case No. 1
If two nodes at same level have a common parent node then the RRD will always be equal to 1. So, for nodes 3D and 3B in figure 4, RRD between 3B and 3D = 1
Case No. 2
The second exceptional case states that the RRD between two nodes 3F and 3A (figure 4) having sibling nodes as parent, will always be equal to 3. Hence RRD between 3F and 3A = 3
The third exceptional case states that the RRD between two nodes 3H and 3E (figure 5) existing at level 3 and having non-sibling nodes as parent, will be RRD between 3H and 3E:
Lx = x + (x-1) L3 = 3 + 2
Hence, the number of internal nodes between 3H and 3E is 5, which indicates that the relation between node 3H and 3E is indirect and weak.
Validation of S_idx metric: We suppose that the document corpse consists of two documents as shown in table 1. The size of indexer is calculated as follows: We have |D| = 2
For Doc 0:
D0(T_idx) = 19
For doc 1: d1(T_idx) = 21
S_idx = 0 + [ 19 + 21 ] S_idx = 0 + 40
S_idx = 40
Hence, the size of indexer is 40 index terms which is in accordance to the indexer proposed by (Shoaib and Shah, 2006).
Validation of ITR metric: To validate ITR metric we take the case study as shown in table 2. Dynamic weights vary each time according to the context in which the index term is used in various documents. For example, the user provides the query to search out information about "lyrics day". In the indexer, as shown in table 2, the index term "day" is assigned the dynamic
weight 5.39 in document 0 but it is assigned dynamic weight 2.24 in document 1 according to the context in which it is used. Higher the value of the dynamic weight, the more the relevancy. Therefore, document 0 is returned in response to user query.
Validation of QI Metric: Indexer quality metric is validated by the following case study: Suppose the user queries the Google search engine to get information about the keyword "dooms day" with the intention to get the information about the Day of Judgment. The result of the Google search engine in response to the user query is shown in the figure
The search engine retrieved 5520000 results against the user query. For the case study we are taking only the top ten results shown in the ranking order.
After analyzing the contents of the documents the following scaled up weights are assigned to the top ten documents:
d1= 0.10 d2= 0.075 d3= 0.050
d4= 0.20 d5= 0.025 d6=0.075
d7= 0.30 d8= 0.025 d9=0.050
then the quality of indexer is computed as:
Q(I) = (0.10 + 0.075+ 0.050+ 0.20 + 0.025 +0.075+ 0.30 + 0.025 + 0.050 + 0.075)
Q(I) = 0.9
Which indicates that 0 [?] Q(I) [greater than or equal to] 1
The value of Q(I) closer to 1 means the indexer
has indexed valuable and relevant documents
Proceeding further we get the average quality per indexed document as
Qavg(I) = w(I) / |D|
= 0.9 / 10 = 0.09
Conclusion and Future Work: We have proposed a number of metrics for the RST based IR systems. Number of Segments(NoS) shows that the correct number of segments can be validated in advance leading to better decision on the storage capacity and better understanding of the rhetorical relations and the RST tree construction. The next metric we define is related to the semantic distance computation between any two nodes of the RST graph. Then we propose a number of metrics related to the most important component of the RST based IR architecture i.e. Indexer. We propose the size of indexer metric based on the number of key terms indexed and validate it to show that the size of indexer metric is directly related to the efficiency of the indexer. Indexer Term Relevancy metric is based on the value of maximum dynamic weight. The degree of the relevancy indicates better performance of the indexer. Higher the degree of term relevancy better will be the efficiency of the indexer.
Extending the same concept, we propose a nd validate the user query relevancy metric which computes the efficacy of the indexer to return most relevant results in response to the user query. Finally, we propose the quality of indexer metric based on the weight assigned to the various documents indexed.
A challenging issue regarding the enhancement of this research is to automate the results of the metrics so that when the user enters a query these metrics are automatically computed to show the performance of the search engine. Another issue which can be addressed in future is to conduct the comparative study of the RST based IR systems based on the comparison of the performance of the RST based systems with the other Information Retrieval systems using the metrics defined. An automated model can be designed for the comparative analysis of the RST based IR systems with other IR systems. This model will have an interface like the search engine to accept the user query and to return the comparative performance statistics of the RST based IR systems.
Arnt, A., S. Zilberstein, J. Allan and A. Mouaddib. Dynamic Composition of Information Retrieval Techniques, Journal of Intelligent Information Systems, 23:67-97 (2004).
Bosma, W. Query Based Summarization using Rhetorical Structure Theory, 15th Meeting of CLIN, 29-44, (2005).
Brown, E. W. Execution Performance Issues in Full-Text Information Retrieval, Technical Report, 95-81, (1995).
Henzinger, M. R., A. Heydon, M. Mitzenmacher and Najork, M. Measuring Index Quality Using
Random Walks on the Web, Comput. Netw, 31: 1291-1303 (1999).
Jun'chi. F and J. Tsujii. Breaking down rhetorical relations for the purpose of analyzing discourse structures, 15th International Conference on Computational Linguistics, Kyoto, 1177-1183 (1994).
Liu, P., Z. Zhu and L. Zhao. Research on Information Retrieval System Based on Ant Clustering Algorithm, Journal of software, 4:1032-1036 (2009).
Mann, W. C. and S. A. Thompson. Rhetorical Structure Theory: description and construction of text structures, Information Sciences Institute, Nijmegen, The Netherlands, ISI/RS-86-174,1-15 (1986).
Mann, W. C. and S. A Thompson. Rhetorical Structure Theory: A Framework for the Analysis of Texts, IPRA Papers in Pragmatics 1:79-105 (1987).
Mann, W. C. and S. A. Thompson. Rhetorical Structure Theory: description and construction of text structures, In Natural Language Generation: Recent Advances in Artificial Intelligence, Psychology, and Linguistics (G. Kempen, Hrsg.), Boston/Dordrecht: Kluwer Academic Publishers, 85-96 (1987).
Marcu, D. The Rhetorical Parsing of Natural Language Texts, The Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, (ACL'97/EACL'97) Madrid, Spain, 96-103 (1997).
Marcu, D. The Theory and Practice of Discourse Parsing and Summarization, 1st Edn., The MIT press, Cambridge, MA., ISBN- 10: 0262133725, 268, (2000).
Mathkour, H. I., A. A. Touir and W. A. Al-Sanea. Parsing Arabic Texts Using Rhetorical Structure Theory, Journal of Computer Science, 4:713-720 (2008).
Papka, R., J. P. Callan, and A. G. Barto. Text Based Information Retrieval Using Exponential Gradient Descent, proc. NIPS, X, 3-9 (2008).
Petratos, P. Information Retrieval Systems: A Perspective on Human Computer Interaction, Issues in Informing Science and Information Technology, 3:511-518 (2006).
Shoaib, M. and A. A. Shah. A new indexing technique for IR systems using Rhetorical Structure Theory, journal of computer science, 2006.
Taboada, M. and W. C. Mann. Applications of Rhetorical Structure Theory, Discourse Studies, 8:567-588 (2005).
Taboada, M. and W. C. Mann. Rhetorical Structure Theory: Looking Back and Moving Ahead, Discourse Studies, 8:423-459 (2006).
Department of Computer Science & Engineering, Lahore, Pakistan, *Department of CS, Virtual University of Pakistan
|Printer friendly Cite/link Email Feedback|
|Publication:||Pakistan Journal of Science|
|Date:||Sep 30, 2010|
|Previous Article:||AN EFFICIENT INDEXING AND SEARCHING TECHNIQUE FOR INFORMATION RETRIEVAL FOR URDU LANGUAGE.|
|Next Article:||RATIONAL USE OF 5-FLUOROURACIL IN CHEMOTHERAPY AND BREAST Cancer.|