A NATURAL LANGUAGE BASED RETRIEVAL SYSTEM FOR THE DATA WAREHOUSE.
ABSTRACT: Natural Language Interfaces to Databases (NLIDB) are used to facilitate non- technical users. Currently, data warehouses are widely used by enterprises for decision-making. The users of such decision-making systems are top management who are normally non-technical having less knowledge of the data warehouse schema and about writing technical queries.
This paper proposes a Natural Language Retrieval System for Data Warehouse (NLRSDW) that facilitates executives to interact with Data Warehouse in Natural Language. This system uses the Online Analytical Processing (OLAP) engine to efficiently and accurately generate the results. The use of ME/R model, data graph, utilization of Materialized Views (MVs) and query optimization technique supports in good performance of the system. A comparative experimental analysis with existing systems shows that NLRSDW has reasonable stance.
Key words: NLIDB, Natural Language Processing, Data Warehouse, NLRSDW, Natural Language Retrieval System for Data Warehouse.
The NLIDB facilitate users to write their request in natural language rather than technical database query language. Several NLIDBs have been developed in the literature (Androutsopoulos et al. 1995), (Chaudhuri et al., 2005), (Jeffrey et al., 2010), (Navin et al., 2003), (Pinchak and Lin, 2006).Our intention in this work is to build a natural language retrieval system to support non- technical users unlike other Business Intelligence (BI) tools that mostly focus on expert users. Several NLIDBs have been proposed in the literature ((Huang et al., 2008), (Popescu et al., 2004), (El-Mouadib et al., 2009), (Kacholia et al., 2005), (Bhalotia et al., 2002), (Agrawal et al., 2002), (Hristidis, 2002), (Hristidis et al., 2003).
Ju Fan et al., 2011) present a keyword-based system SQLSUGG that generates SQL query. It explores the data dictionary, data instances and builds templates which are ranked based on relevance to keyword query. Later, several queries are constructed for each template. The SQL queries then also ranked according to keyword and attribute matching. (El-Mouadib et al., 2009) proposes a generic natural language interface to database. It provides facility to query the system in multiple languages. Keyword search in aggregations is the problem directly related to data warehousing. A keyword search on aggregations in relational databases is appeared in literature (Bin and Jian, 2009).
Authors propose a technique to find minimum group-bys containing query keywords. (Tata and Lohman, 2008) present an approach to pose aggregated keyword query to the system and get aggregated results. SQAK (SQL Aggregates using Keywords) is such a system that transforms aggregated keyword query to aggregated results. (Bhalotia et al.,2002) present a natural language interface for database termed as BANKS.
The BANKS system facilitates keyword searching in data and browsing in schema. It also takes the tuples as nodes and foreign key relationship as edges among nodes. The result is presented as rooted tree containing all keywords. Inverted index file is used in DBX plorer by (Agarwal et al., 2002) stores matching rows for each keyword similar as web server maintains an inverted index file for searching all documents contains a particular keyword. Such a file in DBX plorer is termed as Symbol table. (Hristidis and Papakonstantinou, 2002) also presents a keyword based system Discover with discriminatory characteristics. It works in two phases.
In first phase, all candidate networks (or join expressions) are generated. In the second phase, candidate networks are evaluated in which common sub- expressions are identified and a minimum required candidate networks are chosen.
The Data Warehouse is the advanced form of the database which is used for decision-making by the executives (Chaudhuri and Dayal, 1997).The users of decision-making systems are top management (executives) who are normally non-technical having less knowledge of the data warehouse schema and about writing database technical queries(Bargui et al., 2009).In fact, ad-hoc query is the information need of user that may not be fulfilled with front-end tools having predefined capabilities e.g. Reporting, OLAP or Data mining tools etc. Such information needs are not easy to express in technical query language.
Therefore, Natural Language (NL) interface can facilitate the executives in their decision-making process. Currently, work is started to generate natural language interfaces for multi-dimensional databases. Our focus of research work is on the retrieval system for data warehouse. P. (Wu et al., 2007), propose a keyword- based natural language interface for data warehouse. The Keyword Driven Analytical Processing (KDAP) system takes keyword query from the user and divides it into keywords. Based on different interpretations of a keyword, a number of aggregations termed as star nets can be generated.
This system has several drawbacks including maximum I/Os usage, system not studied from performance perspective, do not use OLAP engine, requires user interaction, and does not rank measures. According to (Kuchmann-Beauger and Aufaure, 2011), (Kuchmann-Beauger and Aufaure, 2012) a natural language question answering system for data warehouse is proposed that not only provide results from the data warehouse but also use data mining techniques to retrieve relevant results from the web.
The has following drawbacks: The architecture does not provide the taxonomy used to perform the semantic analysis for the given user's specification of OLAP queries, System works on only aggregation queries but slicing and dicing and drill-down queries are not considered, accuracy and execution time analysis is not performed, and results are not compared with existing systems. (Naeemet al., 2012) propose a tool QueGen (Query Generator) to generate OLAP queries from natural language input. This tool semantically parses the NL query(Manning, 2011), (Cer et al., 2010) and maps to OLAP query. Drawbacks of this system are mapping details for a specific query is not provided, absence of technical element in the schema, ambiguity in mapping the value, incomplete query, dimension/attribute containing value, Performance and accuracy analysis is not performed.
MATERIALS AND METHODS
NL Retrieval System Architecture: The NLRSDW is composed of several components. Each component has significant contribution in processing of the input query from keyword separation to the generation of results on the interface. The architecture of the NLRSDW is shown in Figure 1.Following sections provide functional detail of each component given in the architecture.
Keyword Query: It is the form of input that user provides in NL. It can be any combination of keywords or phrase related to domain. The keyword query can be any open query similar to the query written in search engine. When we allow user to input open query, full phrase matching has rare chances.
So by mapping keywords to schema elements and/or data instances, retrieval of data with maximum accuracy is possible.
Throughout this paper, we use examples from the Adventure Works data warehouse that comes with Microsoft SQL Server. Let us take an example keyword query (Query 1) taken from [Wu et al., 2007]. Query 1: California mountain bikes.
This query consists of 3 keywords that are separated by the system on spaces. So keywords to be searched from the data warehouse schema and data instances are California, mountain and bikes. Multi-Dimensional Entity Relationship (ME/R) Model: The data warehouse Reverse Engineering (RE) provides the ME/R model or star schema. It provides schema in XML format which can be explored for keyword matching further.
The RE process returns XML file. The ME/R model should be periodically updated automatically. For this purpose, application is scheduled to automatically update the model. Thus, the application takes metadata from the data warehouse and generate updated XML schema. The benefit of using ME/R model is that it is in-memory structure, thus support in efficient query processing. The ME/R model is represented graphically in result of data warehouse RE process (Figure 2).
Domain Thesaurus: The system maintains the domain thesaurus to translate query keywords. Thesaurus is initially built from the ME/R model containing entities and attributes. It is then extended from the data instances as well.
From database perspective, domain thesaurus adds attribute-value relationships plus adds synonyms from dictionary (e.g. Wordnet (Miller, 1995)). We add reference of ME/R element with each value to disambiguate the search. With addition of element's reference in thesaurus, occurrences of one word in multiple dimensions are recognized.
Data Graph: The idea behind the data graph is to boost up efficiency of searching process. Like initial load, data graph is required to be built first time. Later on, it can be updated by incremental load. Direct access to data warehouse takes multiple read/write operations. Data graph reduce these operations with in-memory processing. Direct searching in the Data Warehouse to retrieve results takes following I/Os:
* Parse the query
* Query plan generation
* Search the area of analysis
* Transfer each row in memory
* Transfer rows against drill-down analysis Searching in data graph to retrieve results require following I/Os:
* Search data graph for area of analysis
* Return part of the graph for selected MV.
* Transfer graph against drill-down analysis
Part of graph can be retrieved in single I/O. So searching of the data graph is more efficient.
Data Warehouse: Data Warehouse has different semantics than Online Transaction Processing (OLTP) systems. This research is specifically focusing on the semantics of the Data Warehouse for natural language processing. The differentiating characteristics of the data warehouse are aggregations, subject-oriented, non- volatile, time-variant and integrated data. Analytic OLAP operations including rollup, drill-down, slicing and dicing and pivoting have different nature than OLTP operations.
For experiments, Adventure Works is being used that is part of MS SQL Server. The reason of using this Data Warehouse is to compare with earlier system KDAP which also use Adventure Works. As Data Warehouse contains millions of records extend incrementally, query processing should be efficient to generate quick results. It is specially targeted in this work to generate efficient results against natural language input.
Multi-dimensional Query Topic (MQT): The NLRSDW match the keywords in specified (found elements using ME/R and domain thesaurus) Dimension/facts and their attributes/Measures. Based on the mapped values, ambiguities are resolved and relevant values are ranked. Finally, a graph structure is generated which we call as MQT.
An illustration of MQT is given in Figure 4 for Query 2: Query 2: Give me total revenue of july 2013 To execute MQT efficiently, a query plan is generated that executes those operations in earliest which reduce the dataset. This system ranks the similar matched values and chooses relevant contextual value. It is done automatically with zero user interaction. To provide full automation, ranking methods accuracy is a challenging task. The ranking method considers the context of value in user query, query templates, similarity of keyword query and element's values.
Identification of Aggregation Functions: The OLAP operations include rollup, drill-down, slicing and dicing and pivoting. In first three operations, results are required to be aggregated (Yang et al., 2009), (Zhou and Pei, 2009). Commonly used aggregation functions in DBMSs are SUM, AVG, MIN, MAX, SUB, MEAN, MEDIAN, MODE and STDDEV etc. Each OLAP query is composed of at least one aggregation function. Therefore it is a challenging task to identify aggregation function from user input. Let us analyse query 2:
Keyword 'Total'is mapped to SUM function that sums up revenue measure for the month of 'july' under the year '2013'. An important issue arising here, how aggregation keywords can be identified. In this work rules are defined on the basis of which machine automatically identify aggregation functions. As an implementation, following rule maps Total to SUM() aggregation function.
If( Keyword = 'total') then Return SUM(); Query Optimization: The final MQT is requiredto be executed efficiently. Thus, query optimization strategy generates a plan that executes the query with minimal resources such as I/Os and memory. An effective plan order operations containing projection, selection and join.
The Relational Algebra (RA) is generated by NL mapping that is further transformed in tree (Figure 5) to devise an efficient execution plan. The NLRSDW builds RA expression to use the OLAP engine functionality. The query tree will not include the selection operations from the dimensions because those already have selected during keyword mapping.
Aggregation Level Identifier: The Data Warehouse stores pre-aggregated data for efficient response time requirement. Instead of generating results from detailed data, it is beneficial to choose aggregation levels. The keyword query is mapped to attributes of dimensions or measures in facts. PSalesAmount((sEnglishMonthName ='July' UCalendarYear = '2013' (DimTime))DimTime.Timekey = FactResellerSales.TimekeyFactResellerSales).
So, mapped attributes generate grouping of data in specific order. Thus, correct identification of grouping attributes can be used to identify aggregation level because aggregations are built with GROUP BY attributes. Sometime required aggregation level exist that alone can be used to compute results. If required pre- aggregation level does not exist but multiple lower aggregation levels are maintained by the Data Warehouse. These can be used to calculate higher level.
For instance if 'day' level exists, that can be used to calculate 'month' level. This process is significant task of our research work. A significant issue at this stage is the selection of exact order of GROUP BY columns. So ambiguities are resolved with a ranking method.
Query Knowledgebase: The knowledgebase in the NLRSDW is maintained that saves successful queries or query fragments (Niculae et al., 2005). When user writes a query, the query is initially mapped in the knowledgebase to find the matching query or queries. If matching queries are found in the knowledgebase, all details of processing such as ME/R mapping, data graph mapping and MQT is also found in the knowledgebase.
In this way, processing details are reused that takes efficiency and accuracy in the system. Phrase matching retrieves minimum results because same phrases might not exist in the knowledgebase. Keyword-based mapping generates encouraging results. The resulting queries against keyword mapping are ranked based on maximum matches.
Domain thesaurus is helpful here to find synonymsfor keywords. If any query is not found for user input query, the query is processed in detail by the system. Finally, user input query is added in the knowledgebase. Thus knowledgebase is extended by learning process.
Ranking: Natural Language query translation generates multiple occurrences for a keyword. This is due to the fact that NL query does not impose fixed semantics similar to SQL. So, exact location (Dimension, fact or attribute) is not specified. Therefore, ranking is required to order instances for a keyword based on relevance. Ranking is required in following situations:
i. It is required to choose most relevant dimensions, attributes and instances.
ii. In fact table, ranking should be performed on most accurate measure(s).
iii. Ranking is also required to find exact aggregation level.
iv. Ranking is necessary to order GROUP BY attributes v. If multiple query topics are generated, then rank most appropriate query topic.
In each of above options, ranking criteria depends upon contextual parameters.
RESULTS AND DISCUSSION
The NLRSDW has been developed in C# .NET and MS SQL Server 2008. All experiments are performed on core i3 machine with 3GB RAM. For experiments, Adventure Works DW Data Warehouse is used which is part of MS SQL Server. The reverse engineered ME/R model is shown in Figure 6.
Table I: Characteristics of the AdventureWorksDW Schema
The Adventure Works DW contain thousands of records in each fact table. For instance, Fact Reseller Sales has 60,855 records. Table I depicts statistics of the Adventure Works DW.
In such a large volume Data Warehouse, keyword-based searching can be heavy activity. To tackle this issue, Domain Thesaurus and ME/R model reduce the area of search space to explore in the data instances. Further, data graph has been developed and used in the NLRSDW to load targeted sub-graph in main memory for efficient processing with minimum I/Os in comparison to direct searching in the data instances. The ME/R model developed by this system is shown in Figure 6.
Moreover, Query Optimization plan generates an efficient strategy to execute MQT using OLAP engine. We take a query to show processing of different modules in the NLRSDW.
Query 3: Show me all suppliers: The interface of NLRSDW is depicted in Figure 7. First, system maps Query 3 with the ME/R model (see Figure 6). 'Supplier' keyword is mapped to Dim Supplier and 'all' mapped SUM aggregation function.
After execution of query 3, results are shown in Figure 8.We took 10 queries that were executed on the Retrieval System. Based on analysis, 6 queries could not be mapped with any element of the ME/R. Two of them mapped with 35% schema elements and remaining were mapped to 25% elements. Based on this analysis, it has been concluded that most of the user queries contain values that can be mapped in data instances.
A set of queries from AdventureWorks is given in [Wuet al., 2007]. Reader can explore those queries to verify the analysis. Therefore, ME/R alone is very little helpful in mapping user keywords. As user has freedom to use open NL, it may not be possible to match a keyword with ME/R element which is not exactly the same as technical element. The keyword might be synonym of the element.
To increase the applicability of the ME/R, we use domain thesaurus which maintains possible synonyms of schema elements. Thus, a synonym keyword can also be mapped to ME/R element. For example, Product has synonym item, productline, commodity, goods etc. So any of these keywords can directly be mapped to dimension Product. For instance, only keyword sale is mapped to 6 elements in the ME/R for Query 4.
Query 4: Mountain Tire sale December November
Keywords December and November map to month(or time) whereas Mountain Tire is mapped to product with the help of Domain Thesaurus. So ME/R and Domain Thesaurus both identify the area of search in the data instances. Once area of search is confined, next step is to pick the part from the data graph. The data graph is loaded in the memory for efficient searching of data. System load Time and Product dimensions in- memory whereas sale is mapped to 4 fact tables and 2 attributes of Fact table (Fact Reseller Sales).
In such ambiguous situation, it is not feasible to load all fact tables because fact tables contain high volume of records. We can filter out fact tables firstly based on connectivity of dimensions. The fact table that is attached with all or most dimensions is chosen. Secondly, fact tables are filtered based on attribute name such as sales Amount and sales Territory Key. The system also loads indexes of required attributes before searching in that.
During keyword search in the data graph, two type of ambiguities attribute-instance and join-path are taken place [Wu et al., 2007]. In attribute-instance ambiguity, a keyword can be mapped to instances of more than one attributes. For example, Lahore can be mapped to Address in Dim Customer dimension or city Name in the Dim Sales Territory. The join-path ambiguity is created due to attribute-instance ambiguity. So, more than one paths (RA expressions) are generated.
Conclusion: In this paper, a Natural Language Interface for Data Warehouse is proposed. The key stakeholders of this system are executives who are non-technical. First, framework of the NLRSDW is discussed including its components such as Keyword Query Processor, ME/R model, Domain Thesaurus, Data Graph, MVs Identifier, Query Knowledgebase and Query Optimization method.
Later, a prototype has been developed and explained in this work. Based on comparison with existing systems, it is obvious that NLRSDW can efficiently and accurately execute the user keyword query.
As future work, the user query and result interface will be studied from Human Computer Interaction (HCI) perspective. Currently, interface show results in tabular form. It would be extended with graphical capabilities.
Agrawal, S., S. Chaudhuri, and G. Das.Dbxplorer: A system for keyword-based search over relational databases. In ICDE, 5-16(2002).
Androutsopoulos, I., G. D. Ritchie, and P. Thanisch. Natural Language Interfaces to Databases - An Introduction, Natural Language Engineering. I: 29-81 (1995).
Bargui, F., H. B. Abdallah, J. Feki.Multidimensional concept extraction and validation from OLAP requirements in NL. IEEE NLP-KE, 1-8 (2009).
Bhalotia G., A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan.Keyword searching and browsing in databases using banks.In ICDE, 431-440(2002). Cer, D., M. C. Marneffe, D. Jurafsky, C. D. Manning. Parsing to Stanford Dependencies: Trade-offs between speed and accuracy.In Proceedings of LREC (2010).
Chaudhuri, S., R. Ramakrishnan, and G. Weikum. Integrating DB and IR technologies: What is the sound of one hand clapping? In Proc. of CIDR, (2005). Chaudhuri, S., U. Dayal.An overview of data warehousing and OLAP technology. SIGMOD Record, 26(1):65-74 (1997).
El-Mouadib,F. A., Z. S. Zubi, A. A. Almagrous, and I. S. El-Feghi.Generic Interactive Natural Language Interface to Databases (GINLIDB).International Journal of Computers, 3(3)(2009).
Fan, J., G. Li and L. Zhou. Interactive SQL Query Suggestion: Making Databases User-Friendly. In proceeding of the ICDE, 351-362 (2011). Hristidis V. and Y. Papakonstantinou. Discover: Keyword search in relational databases. In VLDB, 670-681 (2002).
Hristidis, V., L. Gravano, and Y. Papakonstantinou. Efficient IR-Style keyword search over relational databases. In VLDB, 850-861(2003).
Huang, B., G. Zhang, P. Sheu. A natural language database interface based on a probabilistic context free grammar. In proceedings of IEEE International workshop on semantic computing and systems (2008).
Jeffrey X. Y., L. Qin, L. Chang, Keyword Search in Relational Databases: A Survey. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering (2010).
Kacholia,V., S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and H. Karambelkar.Bidirectional expansion for keyword search on graph databases.In VLDB, 505-516 (2005).
Kuchmann-Beauger N. and M. Aufaure.A Natural Language Interface for Data Warehouse Question Answering, In Proc. NLDB, 201-208 (2011).
Kuchmann-BeaugerN., and M. Aufaure. Natural Language Interfaces for DataWarehouses. 8emes Journeesfrancophonessur les Entrepots de Donneesetl'Analyse en ligne, France (2012). Manning, C. D. Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? In Gelbukh, A. F. (ed.) CICLing, Part I. LNCS, 6608: 171-189.Springer, Heidelberg (2011). Miller, G. WordNet: a lexical database for english, Communications of the ACM 38(1): 39-41 (1995).
Naeem, M. A., SaifUllah, and I. S. Bajwa.Interacting with Data Warehouse by Using a Natural Language Interface.In proceedings of the NLDB, 372-377 (2012).
Navin K., R. Ramakrishnan, V. Ercegovac. The QUIQ Engine: A Hybrid IR DB System. ICDE (2003).
Niculae.S., K. Leila., Bipin, C. Desai.Using semantic templates for a natural language interface to the CINDI virtual library. Data and Knowledge Engineering 55(1): 4-19 (2005).
Pinchak, C., D. Lin. A probabilistic answer type model. In Proceedings of the 11st Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy, The Association for Computer Linguistics (2006). Tata S. and G. M. Lohman.Sqak: doing more with keywords. In SIGMOD Conference, 889-902 (2008).
Wu, P., Y. Sismanis and B.Reinwald.Towards Keyword- driven Analytical Processing. In SIGMOD, 617- 628 (2007).
Yang, X., C. M. Procopiuc, and D. Srivastava. Summarizing relational databases.PVLDB, 2(1):634-645 (2009).
ZhouB. and J. Pei. Answering aggregate keyword queries on relational databases using minimal group- bys.In EDBT, 108-119(2009).
Dept. of CS and E, University of Engineering and Technology, Lahore, Pakistan
Corresponding author email: email@example.com
|Printer friendly Cite/link Email Feedback|
|Publication:||Pakistan Journal of Science|
|Date:||Sep 30, 2013|
|Previous Article:||CALCULATION OF IDEAL FLOW AROUND A JOUKOWSKI AEROFOIL USING INDIRECT PANEL METHOD WITH DOUBLET DISTRIBUTION ALONE.|
|Next Article:||IDENTIFICATION AND ANALYSIS OF A SUSTAINABLE SYSTEM OF ROAD TRAFFIC PATTERN IN LAHORE CITY.|