Printer Friendly

Methods of assessing the database queries' results relevance.

UDC 004.827

Introduction. Regularly observed is that the volumes of stored and processed data increase exponentially. This implies special requirements to the methods and tools of data searching and processing [1].

The relevance of query results represents one of indicators characterizing the quality of information retrieval. The notion of relevance does mean semantic matching between the search query and the result [2]. The relevance characterizes the extent to which the content found as a result of information retrieval, does satisfy respective information request. In various cases the relevance calculation approaches do differ [3 ... 5]. Herein, we propose to consider the relevance as a quantitative measure of the search result compliance to the query. Low relevance of some query sample is a consequence of the uncertainty of request or the searched object's parameters values.

When searching objects, we do face two types of uncertainty causes: query uncertainty and object description uncertainty [6]. The query uncertainty may include semantic ambiguity of the text data and the object description uncertainty corresponds to the measurement uncertainty, text data uncertainty, characteristics processing error etc. One of the most common types of uncertainty is the uncertainty of objects' temporal characteristics description, e.g. dates of events, history exhibits dating, etc. Uncertainty of objects' temporal characteristics description does reveal in the cases where the events' time range is artificially expanded.

Analysis of recent research and publications. In [1] exposes the discussion on possibility of direct search using mobile phones to find some information on the Internet. The proposed search strategy allows to minimize the relevant documents' total volume and to rank the found documents aiming onto the system efficiency and accuracy improvement. The [2] examines the main factors influencing the relevance, closely considering one of the algorithms to determine the relevance of a document to the request formulated and the impact of search engines' own resources. The source [3] discusses the current methods of text fragments' relevance calculating on the basis of case models' analysis for the subsequent annotations construction in the form of extracts, i.e. annotations, consisting entirely of original text fragments sequence. Suggested is a new method of calculating the text fragments' relevance based on an assessment of the subjects' balance within the normalized subjects' space, obtained through non-negative matrices factorization, (used as the matrix decomposition in the latent semantic analysis model). The [4] is devoted to seeking an approach to finding solutions at knowledge bases using document metadata, when the document's relevance is estimated with a set of metrics that formalize these semantic networks' proximity. In [5] proposes a method for assessing the text response relevance in computer-based training systems. In [6] considered are the fuzzy database queries, query uncertainty and object description uncertainty.

The Aim of the Research consisted in developing a methodology to quantify the query results relevance. Proposed is to use fuzzy sets when describing objects and database queries to facilitate the relevance evaluation.

Main Body.

Describing temporal characteristics to evaluate the query relevance. So often only approximately known is when the searched event has occurred. The historical object's temporal characteristics correct description essentially influences the historical events further representation. Both an unclear description of the temporal characteristics, and the use of different formats in the object description are hindering further analysis, search and evaluation of historical events' time period.

To describe the temporal characteristics various formats are used: an exact date / time, e.g., March 19, 1946; a time interval, e.g., 336 ... 323 BC; various terms with different degrees of detail, e.g., the second half of 3rd century BC, the last third of 2nd century BC. Such temporal characteristics description makes difficult or ever impossible objects' searching and grouping by time characteristics.

To solve the problem, proposed is to describe the temporal characteristics of objects and queries in the form of fuzzy variables.

Here we admit (PO, T, MTo) set as the object's fuzzy variable, where PO--variable's name, T--universal set, MTo--fuzzy subset of T set. The query fuzzy variable correlates to (PZ, T, MTz) set, where PZ--variable's name, T--universal set, MTz--fuzzy subset of T set.

The fuzzy set of time characteristics MT is defined as a set of ordered pairs MT={[[mu].sub.MT](t)/t}, where MT--fuzzy set time characteristics, [[mu].sub.MT](t)--membership function, t--time response [7].

The characteristic membership function in most cases has a trapezoidal shape (Fig. 1). The smaller is values' difference between a and b temporal characteristics as well as c and d, the closer is the given fuzzy variable to the crisp one. If fuzzy variable becomes crisp one, the membership function takes a rectangular form, with a=b and c=d. In most cases, the time characteristics getting a maximum fuzziness, the membership function takes a triangular shape, with b=c. I.e. comparing a triangular and a trapezoidal functions, provided they do cover the same time span, the triangular function has a larger uncertainty.

Evaluating the query and result relevance.

We shall distinguish key relevance types according to the type of object found upon request: the object is not fully consistent with the requirement subject; the object is fully compliant; the object partially corresponds to the query.

1. The found object is completely inconsistent with the query (Fig. 1). This occurs when the query does not result in finding any object which coincides with the request's at least one value, i.e. the functions of the object and the query does not intersect. In this case proposed is to calculate the degree of remoteness between the found object and the query:

DR = ([absolute value of ([b.sub.i] - [c.sub.j])] + [absolute value of ([a.sub.i] - [d.sub.j])])/2, (1)

where DR--degree of divergence between the found object and the request parameters;

i--coefficient indicating that the query temporal characteristics belong to the request's fuzzy variable;

j--coefficient indicating that the temporal characteristics belong to the object's fuzzy variable;

[a.sub.i], [b.sub.i], [c.sub.i], [d.sub.i]--parameters of query fuzzy variable, satisfying the condition [a.sub.i] [less than or equal to] [b.sub.i] [less than or equal to] [c.sub.i] [less than or equal to] [d.sub.i].

[a.sub.j], [b.sub.j], [c.sub.j], [d.sub.j]--parameters of object fuzzy variable, satisfying the condition [a.sub.j] [less than or equal to] [b.sub.j] [less than or equal to] [c.sub.j] [less than or equal to] [d.sub.j].

The greater is the divergence/remoteness between the found object and the query, the less such found object does match the respective query.

2. The found object is completely consistent with the query. This occurs when the query results in finding an object coinciding with all request's parameters i.e. the object is fully consistent with the query.

3. The found object is partially consistent with the query:

--The query fully absorbs the found object, i.e. the query resulted in finding an object that matches the request by all object parameters, but the request contains some parameters not represented with the found object. That can be due to the case when high uncertainty request formulated either the object has more precisely defined parameters than these requested.

--The found object does completely absorb the request, i.e. the query resulted in finding an object that matches the request by all parameters, but contains some parameters not represented at the request. This can occur when the object has a high uncertainty or the request has been more accurately formulated than the object's features.

[FIGURE 1 OMITTED]

--The found object does partially overlap the request, i.e. the query object found coincides with the query by several values of the request. In these cases when the found object corresponds to the requirement only partially, proposed is to calculate relevance as

P = [S.sub.I]/[S.sub.NI],

where P--relevance;

[S.sub.I]--area of object-to-query intersection;

[S.sub.NI]--area of non-coincidence region, located between the object and the query.

Now we proceed to series of transformations:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],

where [S.sub.PO]--object area;

[S.sub.PZ]--query area;

S--area of region covering both the object and the query;

[a.sub.k], [b.sub.k], [c.sub.k], [d.sub.k]--parameters of intersection region satisfying the condition [a.sub.k] [less than or equal to] [b.sub.k] [less than or equal to] [c.sub.k] [less than or equal to] [d.sub.k]. Therefore the query result relevance will be

P = [d.sub.k] - [a.sub.k] + [c.sub.k] - [b.sub.k]/[d.sub.j] - [a.sub.j] + [c.sub.j] - [b.sub.j] + [d.sub.i] - [a.sub.i] + [c.sub.i] - [b.sub.i] - 2[d.sub.k] + 2[a.sub.k] - 2[c.sub.k] + 2[b.sub.k]. (2)

The smaller is the relevance factor the lesser would be found object-to-query compliance index. Fig. 2 shows the functions that completely covers the query object. In Fig. 2, a the area of query-to-object intersection is larger than in Fig. 2, b, as the request does completely cover the object, and the object area in Fig. 3, a is larger than the object area in Fig. 2, b. Additionally, the area where the object and the request do not intersect, in Fig. 2, a is smaller than in Fig. 2, b. Thus, the larger is the query-object intersection area and lesser the area in which the object and the request don't intersect the better relevance will be found.

[FIGURE 2 OMITTED]

The Fig. 3 shows the functions, where the object does completely cover the query. In Fig. 3, a the area of query-to-object intersection is larger than in Fig. 3, b. Additionally, the area where the object and the request do not intersect, in Fig. 3, a is smaller than in Fig. 3, b. The area on which the object and the request do not intersect, in Fig. 3, b is greater than in Fig. 2, b, thus the query relevance shown in Fig. 3, b is worse than in Fig. 2, b.

[FIGURE 3 OMITTED]

Fig. 4 shows the functions, at which the object is partially covering the query. The best relevance of the examples presented, is attribute to the query represented in Fig. 4, e as it has the largest area of object-to-request intersection as well as the smallest area in which the object and the request do not intersect. The worst relevance case in a query displayed in Fig. 4, d, as it has the smallest area of the object-to-request intersection as well as the biggest area in which the object and the request do not intersect. At Fig. 4, a and Fig. 4, b the intersection areas are the same, but in Fig. 4, a the relevance is better, since the area in which the object and the request do not intersect, in Fig. 4, a is much less than in Fig. 4, b.

Results. In this paper some particular cases of the correspondence between object and query are considered. For objects not fully complying with the required parameters, it is proposed to calculate the degree of the found object's and request's remoteness by the formula (1). The results confirm that the larger is the distance between the found object and the request, the greater is the degree of the found object's non-matching to the request. For objects partially compliant to the request, it is proposed to calculate the relevance using formula (2). The research evidenced that the less relevant query object is, the lesser such found object corresponds to the request. The effected study includes a search by the archaeological museum's exhibits that relate to the ancient department (Ancient Greece). Upon request, it was necessary to find artifacts dated of the 3rd century BC. As a result the whole found sampling of thirteen objects included two objects, fully complying with the request: Terracotta 'Tanagra' figure of a woman wearing a sunhat (3rd century BC) and Red-figure Pelike. Attica (330-320 BC), and two objects that partially match the request: Aphrodite. Terracotta (4th-3rd century BC) and Vessel in the form of a horse's head (3rd-2nd century BC). Untrained users who conducted an automated search of objects spent about 3 minutes on familiarization with the search principle, filling the query data and search properly.

[FIGURE 4 OMITTED]

Conclusions. In this paper described is the methodology of quantifying the query results relevance. The suggested information technology uses fuzzy sets to describe objects and databases query to facilitate searching and objects grouping by temporal characteristics, as well as the evaluation of query results relevance. This methodology includes a description of the three types of found object's compliance to the query: the found object is fully inconsistent with the request, the found object is fully compliant, the found object does partially correspond to the request. The presented method allows quantitative evaluation of the queries results' quality; for objects that are fully inconsistent with the required specification, calculated is the degree of remoteness between the found object and the request; for objects partially matching the request, calculated is the relevance index.

DOI 10.15276/opu.1.45.2015.20

References

[1.] Lukina, A.G. (2007). Requirements to systems for searching information in the internet with the use of a mobile phone as a final device. Nauchno-Technicheskaya Informatsiya: Seriya 1, 8, 23-26.

[2.] Lyudkevich, S. and Esipov, E. (2003, November). The main factors that determine relevance. PromoTechart. Retrieved from http://www.promo-techart.ru/analysis/relevants.htm

[3.] Mashechkin, I.V., Petrovskiy, M.I. and Tsarev, D.V. (2013). Methods of text fragment relevance estimation based on the topic model analysis in the text summarization problem. Numerical Methods and Programming, 14(1), 91-102.

[4.] Karpenko A.P. and Trusonoshin V.A. (2013). Multi-criteria estimation of the relevancy of documents in the enterprise ontological knowledge base using thematic clusterization. Science and Education, 11. DOI: 10.7463/1113.0637857

[5.] Badorina, L.N. (2007). Method of the relevance degree estimation of the text answer in computer training systems. Proceedings of the National Aviation University, 31(1), 70-72.

[6.] Konovalov, D.P. (2010). On the question of fuzzy queries to relational databases. Perspektivy Razvitija Informacionnyh Tehnologij, 2, 87-92.

[7.] Pedrycz, W. and Chen, S.-M. (Eds.). (2013). Time Series Analysis, Modeling and Applications: A Computational Intelligence Perspective. Heidelberg: Springer.

[TEXT NOT REPRODUCIBLE IN ASCII]

Received October 15, 2014

V.A. Krisilov, DEng, Professor, E.A. Gorodnichaya, Odessa National Polytechnic University
COPYRIGHT 2015 Odessa National Polytechnic University
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2015 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Krisilov, V.A.; Gorodnichaya, E.A.
Publication:Odes'kyi Politechnichnyi Universytet. Pratsi
Article Type:Report
Date:Mar 1, 2015
Words:2418
Previous Article:Application of time-frequency spectral analysis methods.
Next Article:Prediction of a relational database's operation in the information system.
Topics:

Terms of use | Privacy policy | Copyright © 2021 Farlex, Inc. | Feedback | For webmasters |