Printer Friendly

Domain terminology collection for semantic interpretation of sensor network data.

1. Introduction

Many studies have considered the integrated management of data received from sensor networks [1, 2]. In particular, some significant research has focused on ontology-based approaches for developing standardized and semantic relations between the data [3-6]. The data collected from sensors represents various tangible and intangible objects, such as temperature, acceleration, GPS, light, barometric pressure, magnetic degree, and acoustic measurements. Existing research deals with the integration of the sensor data itself, the definition of standard schemes, and management applications for understanding the sensor data. However, the data could be interpreted differently according to the environment and which objects are related to the sensor, even though the data type, and even its value, may be the same. For example, two 1[degrees]C measurements from a refrigerator and an aquarium will have very different interpretations. To make appropriate decisions in different situations, the conceptual idea of the sensor network domain should be related to other concepts in different domains. To attack this issue, knowledge bases incorporating ontology, taxonomy, folksonomy, or thesaurus information should first be constructed, allowing reliable connections to be formed between concepts of the sensor network and concepts of other domain knowledge bases. The fundamental step in constructing knowledge bases is to collect domain terminologies, and our research deals with a domain-term collection method.

Domain-terms, which are the main components of the knowledge, are words and compound words that have specific meanings in a specific context (definition of the term "Terminology": http://en.wikipedia.org/wiki/Terminology). Constructing knowledge bases manually requires considerable labor, cost, and time and can sometimes result in conflict [7, 8]. Therefore, the automatic construction of a body of knowledge by extracting domain-terms from various sources is a popular area of research [8-15]. Nowadays, Wikipedia (WP) and similar repositories are widely employed as information sources [7, 16, 17]. WP contains diverse forms to explain concepts (hereafter, we use concepts, terms, and article as the same meaning), such as abstract information (specific and long definitions), tabular information, the main article content, article links, and category information. The term "article" is generally used in WP, but it also means "title of article." In this paper, we use "article" and "term" interchangeably. Moreover, WP provides highly reliable and widely used content, because it is based on semantic information from the collective intelligence of contributors worldwide. However, WP has a couple of weaknesses in its category structure (we detail these with examples in Section 2). One is that it has loops in the category hierarchy, and the other is that a significant number of categories are unreasonably generalized. These weaknesses were similarly identified in previous work [7]. General methods of extracting domainterms from knowledge, such as for Princeton WordNet [18], use a vertical search (top-down or bottom-up) that chooses a representative term (e.g., science) covering a field of interest and extracts as domain-terms all of the terms (e.g., natural science, life science, biology, botany, ecology, genetic science, morphology, anatomy, biomedical science, medical science, information science, and natural language processing) contained under the representative term [19]. Because of the weaknesses identified above, such methods cannot be applied to WP. This research proposes a horizontal method to resolve the difficulties of a vertical search. The method requires one domain category as input (multiple categories are possible, but we consider only the single case in this paper). The entry category contains many articles. We call these domain articles, and each domain article is involved in one or more categories. We consider the categories connected to the domain articles as candidate categories that can be deeply related to the entry category. Then, our method measures the similarity between the domain category and the candidate category. If the similarity matches or exceeds a predetermined threshold, the candidate and its articles are added to the domain category group and the domain article group, respectively. The method generates a similar category group and a domain terminology group through iterative processes and evaluates its category grouping and term collection performance.

The remainder of this paper is organized as follows. Section 2 describes our motivation for this research. Section 3 proposes the domain-term collection method through domain category grouping. Section 4 presents experimental results and evaluates the performance of the proposed approach, and finally, in Section 5, we summarize our research.

2. Motivation

Many applications employ various WP components for semantic information processing. WP is an agglomeration of knowledge that has been cultivated by contributors from diverse fields; thus, its content has wide coverage and high reliability. In particular, the hierarchical structure of categories and the semantic networks between articles appear similar to the human knowledge system. These strengths allow WP to be widely used; however, unfortunately, additional processes are needed. We indicate a couple of weaknesses of WP in this section. Box 1 shows a case of loop relations in the hierarchical structure, which represents one of the weaknesses.

The category "Natural language processing" has "Concepts" as a supercategory, and each "Concept" has itself as one of the superconcepts a few steps later. The WP hierarchy contains many loop cases, and this poses difficulties during a vertical search. Even if this was resolved programmatically, there would be another obstacle, as shown in Box 2.

Box 2 enumerates the supercategories of "Natural language processing" after its loop cases have been removed. The initial category is a computer science technology, but this soon becomes connected to "Mind," "Marxism," "Humans," "Taxonomy," "Classification systems," "Libraries," "Collective intelligence," "Internet," "World," and "People." Some of the connections are appropriate, but others suffer from excessive generalization between categories. We call this inappropriate generalization, and it causes undesirable categories and terms to be collected in a domain category during a vertical search. Therefore, we propose a method of searching horizontally for related categories.

This research considers a category of interest as the entry domain category and measures the similarity of article intersections between this domain and the other, candidate category. If the similarity is equal to or exceeds a predetermined threshold, the candidate is used as an element of the domain set. Some well-known similarity measures for the degree of article intersection, such as the Jaccard similarity coefficient (JSC) or the Dice coefficient (DC), have significant limitations, as shown in Table 1.

Each case consists of two categories, and we wish to determine whether the candidate can be added to the domain set. In the first case, the categories have the same number of articles, 50 of which are shared as an intersection set. The similarity values are 0.333 and 0.5 according to the JSC and the DC, respectively. In the second case, almost all of the candidate articles are included in the domain category, but the similarities are lower than those of the first case. We believe that the second case should have higher similarity values, because the initial domain category was chosen by the user, which means that the domain has our trust. However, existing methods cannot provide a suitable measure of similarity. In this paper, we suggest simple new measurements to mitigate this limitation.

3. Domain-Term Extraction Based on Category Grouping

Articles included in the same WP category describe similar content. However, WP assigns many similar categories (e.g., "Word sense disambiguation," "Ontology learning," and "Data mining") to a single category (e.g., "Natural language processing"). To collect domain-terms that have wide coverage, we suggest a bootstrapping method that remedies the weaknesses of WP and improves on the existing similarity measurements mentioned in the previous section. We now describe these processes in detail using real examples.

Box 1: Case of a loop in the WP category hierarchy (bold:
represents the same category concept but it occurs iteratively on
the hierarchy).

Natural language processing [right arrow] Computational linguistics
[right arrow] Natural language and computing [right arrow]
Human-computer interaction [right arrow] Artificial intelligence
[right arrow] Futurology [right arrow] Social change [right arrow]
Social philosophy [right arrow] Human sciences [right arrow]
Interdisciplinary fields [right arrow] Science [right arrow]
Knowledge [right arrow] Concepts [right arrow] Mental content
[right arrow] Consciousness [right arrow] Philosophy of mind [right
arrow] Conceptions of self [right arrow] Concepts [right arrow] ...

Box 2: Case of inappropriate generalization in the WP category
hierarchy (bold: represents the same category concept but it occurs
iteratively on the hierarchy).

Natural language processing [right arrow] Computational linguistics
[right arrow] Natural language and computing [right arrow] ...
[right arrow] Artificial intelligence [right arrow] ... [right
arrow] Human sciences [right arrow] Social sciences [right arrow]
... [right arrow] Science [right arrow] Knowledge [right arrow]
Concepts [right arrow] Mental content [right arrow] Consciousness
[right arrow] Mind [right arrow] Concepts in metaphysics [right
arrow] ... [right arrow] Form [right arrow] Ontology [right arrow]
Reality [right arrow] ... [right arrow] Scientific observation
[right arrow] Data collection [right arrow] ... [right arrow]
Probability [right arrow] Philosophical logic [right arrow] ...
[right arrow] Neo-Marxism [right arrow] Marxism [right arrow] ...
[right arrow] Reasoning [right arrow] Intelligence [right arrow]
... [right arrow] Humans [right arrow] ... [right arrow] Phyla
[right arrow] Taxonomic categories [right arrow] Taxonomy [right
arrow] Classification systems [right arrow] Library science [right
arrow] Libraries [right arrow] ... [right arrow] Human welfare
organizations [right arrow] Non-profit organizations by
beneficiaries [right arrow] Non-profit organizations [right arrow]
... [right arrow] Constitutions [right arrow] Legal documents
[right arrow] Documents [right arrow] ... [right arrow] Crowd
psychology [right arrow] Collective intelligence [right arrow] ...
[right arrow] Cyberspace [right arrow] Internet [right arrow] Wide
area networks [right arrow] ... [right arrow] World [right arrow]
... [right arrow] People


3.1. System Flow. The proposed method takes one category, which the user selects as an entry (trigger), and follows the flowchart shown in Figure 1. Starting from the entry category, we determine similar categories through a horizontal search. In the search, articles included in the entry act as "clues" for measuring the similarity and "bridges" for preparing the next candidate category. Figure 2 illustrates an example of a category-article network. If the category "Natural language processing" is given as the entry, the method finds articles for similarity measurement and prepares the categories of each article for the next candidates. This means that "Information science," "Knowledge representation," "Machine learning," "Artificial intelligence applications," "Data mining," and so forth are processed individually as candidate categories. We now explain the process shown in Figure 1 using similar examples.

3.2. Domain-Term Selection through Category Grouping (Bootstrapping Method). To group similar categories, we choose a horizontal category search and propose new similarity measurements that enable the group to be enriched. The bootstrapping process proceeds as follows.

(1) An initial domain category (DC) consists of a user-selected category: DC = {user selected category}. For example, DC = {Natural language processing}. The length of DC increases throughout the iterative process.

(2) The domain articles (DA) of one category are collected. At first, DA consists of the articles of the entry, but it gradually becomes enriched with articles from new domain categories. Explicitly, DA = [{(art, dist, count, dw).sub.i], 1=< i=< n}, where art, dist, count, and dw denote an article, a distance from an entry category, an accumulated (overlapped) count, and a domainterm weight, respectively. The initial elements of DA have values of 1,1, and 1.0 for dist, count, and weight, respectively. For example, DA = {(Concept mining, 1, 1, 1.0), (Information retrieval, 1, 1, 1.0), (Language guessing, 1, 1, 1.0), (Stemming, 1, 1, 1.0), (Content determination, 1,1,1.0), ...}.

(3) There are two options to choose whether an article-link network is used in the similarity measurements. We explain the options using Figure 3 and Table 2. Figure 3 shows the network between the categories "Natural language processing" and "Data mining," whereas Table 2 defines the network types.

(i) The first option considers only category-article networks, such as {Content determination, Information retrieval, Languageware, Concept mining, Document classification, Text mining, Automatic summarization, String kernel, Sentic computing, ...} for "Natural language processing." Type 1 in Table 2 is related to this option.

(ii) The second option uses more complex networks that utilize the category-article network as well as article links. Types 2 and 3 in Table 2 represent this option.

If the first option is chosen, the method goes to Step 4; otherwise, it goes to Step 11.

(4) Candidate categories (CC) are collected using one article of DA. If DC already contains a candidate, it is not included in CC. Explicitly, CC (art;) = {[cat.sub.j], 1 =< j =< m}. For example, CC (Concept mining) = {Data mining, Artificial intelligence applications}.

(5) To prepare clues that indicate suitable categories, a set of candidate articles (CA) of one candidate category is formed. For example, CA (Data mining) = {Extension neural network, Big data, Data classification (business intelligence), Document classification, Web mining, Text mining, Concept mining,...}. If there is an intersection article, then an intersection set IS is constructed to measure the similarity: IS ([set.sub.i], [set.sub.j]) = ([set.sub.i] [intersection] [set.sub.j]) {[ia.sub.k], 1 [less than or equal to] k [less than or equal to] p}. For exam ple, IS (Natural language processing, Data mining) = {Languageware, Concept mining, Document classification, Text mining}.

(6) To eliminate the limitation described in Section 2, we propose a new similarity measurement:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (1)

where, basically, we assign a value of 0.5 to a. Based on (1), we can calculate the similarities of Cases 1 and 2 in Table 1 to obtain values of 0.5 and 0.54, respectively. However, we must consider an additional constraint in the bootstrapping method. Table 3 shows another example. Based on (1), we find that both cases have the same similarity value, as shown in Table 3. Even so, Case 2 is inappropriate, because the coverage of DA is too narrow. This may cause the generalization problem. Thus, before calculating similarities in the bootstrapping method, the similarity constraint,

[absolute value of (DA)] [greater than or equal to] [absolute value of (CA)], (2)

should be satisfied.

According to this constraint and (1), the similarity between "Natural language processing" and "Data mining" is calculated as follows: [absolute value of (DA)] = 117, [absolute value of (CA)] = 93, [absolute value of (IS (DA, CA))] = 4, sim (Natural language processing, Data mining) = (4/117 + 4/93)/2 = 0.039.

(7) If the similarity exceeds a predetermined threshold, go to Step 8; otherwise, we skip Step 8 and proceed to Step 9.

(8) If the candidate category has a similarity that is greater than the threshold, the system considers the candidate to be the domain and inserts the candidate and its articles into DC and DA. Here, new articles are accompanied by supplementary values of dist, count, and dw. Specifically, dist is the category distance from the entry category (in the case of "Data mining," dist is 2), count is 1, and dw is the similarity calculated in Step 6. If DA already contains an article, the original supplementary values are increased by adding the new ones. These values are used later for domain-term selection.

(9) If there is at least one element in CC, we return to Step 4 for the next candidate category; otherwise, proceed to Step 10.

(10) If there is at least one element in DA, we return to Step 2 for the new CC of the next domain article; otherwise, proceed to Step 19.

(11) To use the links of articles as additional clues for the similarity measurement, a domain link set (DLS) is collected: DLS(DA) = {linkjdomp 1 =< j =< m}. For example, DLS = {Information, Metadata, Relational database, World Wide Web, Data (computing), Document retrieval,...}.

(12) This step is the same as Step 4.

(13) This step is the same as Step 5.

(14) Construct a candidate link set (CLS) with links from CA. Explicitly, CLS(CA ([cat.sub.k])) = {link_ [cat.sub.t], 1 =< t =< q}. For example, CLS(CA (Data mining)) = {Computer science, Data set, Artificial intelligence, Database management system, Business intelligence, Neural network, Cluster analysis,...}.

(15) If the similarity constraint is satisfied, (1) is applied to determine the category-article similarity. We use additional similarity measurements for article-link networks. The network can have different degrees of relatedness according to the network type (see Table 2). If two categories have complex, close connections, the similarity should be greater because of common features in the neighborhood. To apply both characteristics, we propose a further similarity measure:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (3)

where distance is the number of articles that exist on the network (this is different from dist in DA). According to (3), we can calculate sim (DLS, CA), sim (DA, CLS), and sim (DLS, CLS), which have distances of 2, 2, and 3, respectively. The final similarity (finaLsim) between DA and CA is determined by summing sim (DA, CA) from (1) with sim (DLS, CA), sim (DA, CLS), and sim(DLS, CLS) from (3). The similarity between "Natural language processing" and "Data mining" can thus be calculated as follows:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (4)

(16) If the similarity exceeds the predetermined threshold, go to Step 17; otherwise, Step 18 is carried out.

(17) If the similarity exceeds the threshold, the system enriches DC, DA, and DLS. This step is similar to Step 8.

(18) If there is at least one element in CC, we return to Step 12 for the next candidate category; otherwise, go to Step 10.

(19) Output the DC and DA acquired through the bootstrapping process.

(20) Terminate the bootstrapping process and evaluate DC and DA, which are increased in Step 8 or 17. For DA, the evaluations are divided into three types by supplementary value (dist, count, and dw). In the case of domain weight, we use values normalized according to

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (5)

We have described all of the bootstrapping steps for domain-term selection by grouping similar categories. In the next section, a few aspects of performance are evaluated.

4. Experimental Evaluations

This section considers the evaluation of DC and DA. One objective of our research is to select as many domain-terms as possible, for which we proposed the bootstrapping method for similar category grouping. In the process, DC and DA become enriched with categories and articles, respectively, and each article has supplementary values of dist, count, and dw.

To evaluate the quality of DC and DA, we used an articlecategory dataset and a Pagelinks dataset that are included among the WP components in DBpedia (DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web: http://dbpedia.org/About) version 3.7. To utilize the WP networks, we implemented two versions of our system, referred to as new similarity (NS) and new similarity with links (NSL). We varied the threshold of each system from 0.1 to 0.9. Moreover, we chose entry sets of 40 categories (each set has one category) from fields of computer science, such as "Natural language processing," "Speech recognition," and "Semantic Web." The results are compared with those of a baseline method that employs the DC similarity measure. Table 4 shows some of the similar categories collected by NSL for the entry category "Semantic Web" with a threshold of 0.2.

We invited domain specialists to examine the results, and each collection was manually checked by each evaluator. The RE field in Table 4 shows the actual checked results, where values of 1 and 0 were ascribed by the evaluators for relevance and irrelevance, respectively. Tables 5 and 6 present summaries of the DC evaluations.

There was no improvement for thresholds greater than 0.6, and the bootstrapping was incomplete for a threshold of 0.1. Therefore, we present results for threshold values from 0.2 to 0.6. In the experiments, the baseline attained an extension rate of only 1.18 (40 categories extended by 7 categories), even though its precision was 100%. The aim of the information processing is to reduce the time taken to accomplish certain objectives, which implies that the information system should provide varied results. In this respect, we do not expect the baseline results to be helpful. However, it is apparent that NSL provides wide extension and high precision. The maximum extension rate was 24.45 with 799 appropriate categories for a threshold of 0.2. The minimum precision was around 84%, when the threshold was 0.3.

Box 3: Domain articles (DA) collected for "Semantic Web" on NLS
with threshold value 0.2.

XHTML + RDFa, Bath Profile, Sidecar file, COinS, Metadata
publishing, MARC standards, WizFolio, Qiqqa, ISO-TimeML, TimeML,
Metadata Authority Description Schema, Bookends (software), RIS
(file_format), Metadata Object Description Schema, EndNote, Refer
(software), ISO 2709, BibTeX, XML, S5 (fileformat), Semantic HTML,
Simple HTML Ontology Extensions, Opera Show Format, XOXO, XHTML
Friends Network, StrixDB, Graph Style Sheets, TriX (syntax), TriG
(syntax), RDF feed, Redland RDF Application Framework, RDF query
language, Turtle (syntax), RDFLib, Notation3, SPARQL, D3web,
Artificial architecture, NetWeaver Developer, Knowledge engineer,
Frame language, ...


In addition to evaluating DC, we evaluated DA with the NSL results for a threshold of 0.2. To examine the influence of the distance, count, and domain weight, we analyzed the results according to each factor. Six DAs were selected at random, with a total of 1,769 articles (terms). Box 3 enumerates a part of collected articles for a domain "Semantic Web." And Tables 7, 8, and 9 show the evaluation results with respect to distance, count, and weight.

The basic performance of the domain-term selection attained precision of 70.9%. As expected, the precision was inversely proportional to the distance; however, a distance of 4 produced almost all of the unrelated articles. The weight and count could be used as important criteria to select domain-terms; we found that the weight returned more refined results than the count (the weight returned 755 appropriate terms with 97.2% precision at a threshold of 0.3). At short distances, there are many names of people and organizations, such as "Squarespace," "Rackspace Cloud," and "Nsite Software (Platform as a Service)." These names were selected by the bootstrapping because their associated categories (e.g., "Cloud platforms," "Cloud storage," and "Cloud infrastructure") were similar to the entry (e.g., "Cloud computing"). This situation is not caused by our method, but by the definition of the article-category relations of WP. We believe that this can be resolved by processing content (abstracts) or tabular information in the future.

5. Conclusions and Future Work

This paper has proposed a method of domain-term collection through a bootstrapping process to assist the semantic interpretation of data from sensor networks. To achieve this, we identified weaknesses in the WP category hierarchy (i.e., loops and inappropriate generalizations), and chose a horizontal, rather than vertical, category search. We proposed new semantic similarity measurements and a similarity constraint to surpass existing methods. Moreover, we employed category-article networks and article-link networks to elicit information for the category similarity measurement. In performance evaluations, our category grouping based on NSL yielded the greatest number of proper results. In terms of domain-term selection, we confirmed that the results obtained with normalized weights had the best precision and extension rate. The distance-based metric had no positive influence on our research. When the distance was greater than three, almost all of the terms were unrelated. However, we believe that the collected domain terminologies can assist the construction of domain knowledge bases for the semantic interpretation of sensor data.

WP has additional weaknesses to those mentioned in this paper, especially in the category-article relation. For example, the term "Paco Nathan" is a personal name that has "Natural language processing" as one of its categories. The relation between the two, that is, "Paco Nathan" has expertise in "Natural language processing," causes noise and negatively influences semantic information processing. We think that this problem can be solved in future work by processing additional WP components, such as abstract or tabular information. Moreover, our research employed only the out-links of WP articles. If the in-links were considered, we expect that the results would be more significant, with wider coverage of domain-terms and higher relevance.

http://dx.doi.org/10.1155/2014/827319

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

References

[1] C. Kebler and K. Janowicz, "Linking sensor data--why, to what, and how?" in Proceedings of the 3rd International Workshop on Semantic Sensor Networks, pp. 77-91, 2010.

[2] S. Tilak, N. B. Abu-Ghazaleh, andW. Heinzelman, "Ataxonomy of wireless micro-sensor network models," ACM SIGMOBILE Mobile Computing and Communications Review, vol. 6, no. 2, pp. 28-36, 2002.

[3] G. Goodwin and D. J. Russomanno, "An ontology-bases sensor network prototype environment," in Proceedings of the 5th International Conference on Information Processing in Sensor Networks, pp. 1-2, 2006.

[4] K. Janowicz and M. Compton, "The Stimulus-Sensor-Observation ontology design pattern and its integration into the Semantic Sensor Network ontology," in Proceedings of the 3rd International Workshop on Semantic Sensor Networks, pp. 92-106, 2010.

[5] P. Barnaghi, S. Meissner, M. Presser, and K. Moessner, "Sense and sens'ability: semantic data modelling for sensor networks," in Proceedings of theICT Mobile Summit, pp. 1-9, 2009.

[6] A. Broring, P. Maue, K. Janowicz, D. Nust, and C. Malewski, "Semantically-enabled sensor plug & play for the sensor web," Sensors, vol. 11, no. 8, pp. 7568-7605, 2011.

[7] M. Hwang, D. Choi, and P Kim, "A method for knowledge base enrichment using wikipedia document information," Information, vol. 13, no. 5, pp. 1599-1612, 2010.

[8] P Velardi, A. Cucchiarelli, and M. Petit, "A taxonomy learning method and its application to characterize a scientific web community," IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 2, pp. 180-191, 2007

[9] H. Avancini, A. Lavelli, B. Magnini, F. Sebastiani, and R. Zanoli, "Expanding domain-specific lexicons by term categorization," in Proceedings of the ACM Symposium on Applied Computing, pp. 793-797, March 2003.

[10] J. J. Jung, Y. H. Yu, and K. S. Jo, "Collaborative web browsing based on ontology learning from bookmarks," in Proceedings of the International Conference on Computational Science, pp. 513520, 2004.

[11] J. J. Jung, "Computational reputation model based on selecting consensus choices: an empirical study on semantic wiki platform," Expert Systems with Applications, vol. 39, no. 10, pp. 9002-9007, 2012.

[12] J. Lehmann and L. Buhmann, "ORE--a tool for repairing and enriching knowledge bases," in Proceedings of the 9th International Semantic Web Conference, pp. 177-193, 2010.

[13] J. Liu, Y.-C. Liu, W. Jiang, and X.-L. Wang, "Research on automatic acquisition of domain terms," in Proceedings of the 7th International Conference on Machine Learning and Cybernetics (ICMLC '08), pp. 3026-3031, July2008.

[14] S. Minocha and P G. Thomas, "Collaborative Learning in a Wiki Environment: experiences from a software engineering course," New Review of Hypermedia and Multimedia, vol. 13, no. 2, pp. 187-209, 2007

[15] R. Navigli and P Velardi, "Ontology enrichment through automatic semantic annotation of on-line glossaries," in Proceedings of the Knowledge Engineering and Knowledge Management, pp. 126-140, 2006.

[16] C. Muller and I. Gurevych, "Using wikipedia and wiktionary in domain-specific information retrieval," in Proceedings of the 9th Cross-Language Evaluation Forum Conference on Evaluating Systems for Multilingual and Multimodal Information Access, pp. 219-226, 2009.

[17] J. Vivaldi and H. Rodriquez, "Finding domain terms using wikipedia," in Proceedings of the 7th International Conference on Language Resources and Evaluation, pp. 386-393, 2010.

[18] G. A. Miller, "WordNet: a lexical database for English," Communications of the ACM, vol. 38, no. 11, pp. 39-41,1995.

[19] S. Lee, S.-Y. Huh, and R. D. McNiel, "Automatic generation of concept hierarchies using WordNet," Expert Systems with Applications, vol. 35, no. 3, pp. 1132-1144, 2008.

Myunggwon Hwang, Jinhyung Kim, Jangwon Gim, Sa-kwang Song, Hanmin Jung, and Do-Heon Jeong

Department of Computer Intelligence Research, Korea Institute of Science and Technology Information (KISTI), 245 Daehak-ro, Yuseong-gu, Daejeon 305-806, Republic of Korea

Correspondence should be addressed to Do-Heon Jeong; heon@kisti.re.kr

Received 20 October 2013; Accepted 27 December 2013; Published 13 February 2014

Academic Editor: Hwa-Young Jeong

TABLE 1: Similarity issues for bootstrapping methods.

Cases         Categories         Count of articles

1       Domain #1 Candidate #1        100 100
2       Domain #2 Candidate #2        100 20

Cases   [absolute value of ([C.sub.1]    JSC    DC
         [intersection] [C.sub.2])]

1                    50                 0.333   0.5
2                    18                 0.176   0.3

TABLE 2: Network types and examples: C denotes a category; A, an
article; and L, a link.

Type   Semantic network               Examples of the network

1      C1 [right arrow]                   Natural language
       (A1 = A2) [left arrow]    processing Document [right arrow]
       C2                           classification [left arrow]
                                            Data mining
2      C1 [right arrow] A1                Natural language
       [right arrow] (L1 = A2)        processing [right arrow]
       [left arrow] C2                      Information
                                    retrieval [right arrow] Text
                                      corpus [left arrow] data
                                               mining
       C1 [right arrow]                   Natural language
       (A1 = L2) [left arrow]         processing [right arrow]
       A2 [left arrow] C2                    Automatic
                                     summarization [left arrow]
                                      Text mining [left arrow]
                                            Data mining
3      C1 [right arrow] A1                Natural language
       [right arrow] (L1 = L2)        processing [right arrow]
       [left arrow] A2                      Information
       [left arrow] C2                retrieval [right arrow]
                                      Information [left arrow]
                                         Data visualization
                                     [right arrow] Data mining

Bold: represents the category concepts connected with super and sub
relations but inappropriate generalization in hierarchical structure.

TABLE 3: Examples of similarity measurement.

Cases   Categories   Count of   [absolute value of ([C.sub.1]    sim
                     articles    [intersection] [C.sub.2])]

1          DA1         400                   100                0.625
           CA1         100
2          DA2         100                   00                 0.625
           CA2         400

TABLE 4: Domain categories collected for "Semantic Web" on NLS
with threshold value 0.2.

Categories                                           sim    dist   RE

Metadata publishing                                  0.47    2     1
Semantic HTML                                        0.40    2     1
RDF                                                  0.61    2     1
Knowledge engineering                                0.42    2     1
Folksonomy                                           0.27    2     1
Triplestores                                         0.68    2     1
Domain-specific knowledge representation languages   0.37    2     1
RDF data access                                      0.52    2     1
Book swapping                                        0.22    9     0
Rule engines                                         0.28    2     1
Ontology languages                                   0.35    2     1
Knowledge bases                                      0.30    2     1

RE: relevance evaluation.

TABLE 5: DC evaluations with baseline and NS.

Type                 Baseline                     NS

Threshold value    0.2     0.3    0.2     0.3     0.4     0.5    0.6
Extended count      7      --     282     133     87      82     --
Extension rate     1.18    --    8.05    4.33    3.18    3.05    --
Appropriate         7      --     241     113     80      74     --
Inappropriate       0      --     41      20       7       8     --
Precision (%)     100.00   --    85.46   84.96   91.95   90.24   --

TABLE 6: DC evaluations with NSL.

Type                              NSL

Threshold value    0.2     0.3     0.4     0.5     0.6
Extended count     938     346     207     134      47
Extension rate    24.45   9.65    6.18    4.35     2.18
Appropriate        799     293     187     127      47
Inappropriate      139     53      20       7       0
Precision (%)     85.18   84.68   90.34   94.78   100.00

TABLE 7: DA evaluations based on distance, including articles within
the indicated distance.

Distance          2       3       4       5      All

Appropriate     1,092   1,224   1,246   1,253   1,253
Inappropriate    252     336     471     508     514
Precision (%)   81.3    78.5    72.6    71.2    70.9

TABLE 8: DA evaluations based on count (overlapped).

Overlapped count     1       2       3       4

Appropriate        1,253    330     90      15
Inappropriate       514     80      14       1
Precision (%)      70.9    80.5    86.5    93.8

TABLE 9: DA evaluations based on normalized weight.

Threshold value     0      0.1     0.2     0.3

Appropriate       1,253   1,195    870     755
Inappropriate      514     292     69      22
Precision (%)     70.9    80.4    92.7    97.2
COPYRIGHT 2014 Sage Publications, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2014 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:Research Article
Author:Hwang, Myunggwon; Kim, Jinhyung; Gim, Jangwon; Song, Sa-kwang; Jung, Hanmin; Jeong, Do-Heon
Publication:International Journal of Distributed Sensor Networks
Article Type:Report
Date:Jan 1, 2014
Words:5226
Previous Article:Corrosion monitoring and evaluation of reinforced concrete structures utilizing the ultrasonic guided wave technique.
Next Article:Secure authentication system for hybrid cloud service in mobile communication environments.
Topics:

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters