Printer Friendly

Constructing the semantic information model using a collective intelligence approach.

1. Introduction

Demands for the process of semantic information are fast growing. Varied research has been performed as computing technology has covered huge information processes. Especially, much research focuses on the information model for presenting semantic information [1].

Generally, semantic representation means that an object, phenomenon and a concept are presented in computing models for a semantic information process [1][2]. Semantic representation can construct a unified information model with structured knowledge, and be able to process semantic information. Traditionally, sematic information is presented with the methods of knowledge representation, such as logic, semantic networks and frame. Lately, however, sematic information is represented based on ontology, a fairly new method [3][4].

Collection of massive semantic resources is very important in constructing the semantic information model [2]. Collected semantic information is able to present a simple description of concepts and relations to other concepts. Furthermore, inference should be possible from presented relations. This collection of semantic resources and construction of the semantic information model has a difficulty; that is, huge amounts of cost and time are consumed, because it should be processed by few experts. In addition, it has limitations of prompt handling of the semantic changes, such as creation of new semantic information, modification and deletion [5].

Much research to collect massive semantic resources based on "Collective Intelligence" is progressing to overcome these limitations. Collective intelligence is the knowledge shared by people. Therefore, it can be an alternative method to collect and construct semantic information effectively. However, semantic information from collective intelligence should also require verification by experts, because collective intelligence alone cannot guarantee the accuracy of semantic information.

Luis von Ahn proposed a novel method to collect and construct significant semantic information, "Games With A Purpose (GWAP)", through the cross-validation of semantic information gathered from collective intelligence [6]. The ESP Game is a GWAP that encourages user participation, and uses the results as meta tags in image searching systems. These are shown to improve the performance of the search system [6][7]. Verbosity also gathers semantic information, and through cross-validation of the gathered semantic information, verifies semantic resources [5][6]. However, in these games, evaluations of significance for the collected description and the constructed relation of concepts are not performed. Accordingly, it is hard to prove the effectiveness of collective intelligence.

In this paper, we propose FunWords, a collective intelligence approach to Games With A Purpose (GWAP) FunWords collects various semantic resources as words and word-senses, and constructs the semantic information model or ontology from the collected resources. FunWords, based on collective intelligence, proved to be more effective due to the limitations to the resource collection by the expert-driven-approach. It also applied GWAP to encourage the voluntary participation of many people. The results of FunWords are especially important, reflecting importance of semantic relationships by collective intelligence.

This paper starts by briefly reviewing knowledge databases and collective intelligence. We then introduce the design and implementation of FunWords. Finally, we explain the outline of the contribution in this study.

2. Related Work

In this section, we introduce knowledge base. We briefly overview a variety of other methods for collecting human knowledge using collective intelligence approaches.

2.1. Knowledge Base

Cyc

Cyc was the first effort to build a common-sense and comprehensive ontology database aimed to create artificial intelligence able to analogy as humans do. Cyc created a seed database of common-sense knowledge using paid experts. There are assertions, rules and common-sense knowledge created by millions of people in the database. It can generate new rules through analogy. However, there is a limitation to collect a large amount of common-sense by a small number of experts [8].

WordNet

WordNet is a giant semantic-lexical list that was based on English, and was made to be used in NLP (Natural Language Processing) such as machinery-translation, and various other artificial intelligence areas. WordNet has been consistently expanding since 1985, and the most recent version is WordNet 3.0, is made up of 150,000 words that are divided into 117,000 Synsets and 207,000 word-meaning pairs. Especially, WordNet divides the words into 4 lexical categories: Noun, Verb, Adjective, Adverb; and considering the difference of the semantic relationships for each lexical category, and expresses the semantic relationships for Synset. However, WordNet has a disadvantage of needing 20 years and large amount of expenses as well as manpower to develop WordNet, and is still continuously needed. Aso, there are limitations such as the difficulty to cope and reflect the changes that occur with time [9][10].

Wiktionary

Wiktionary, from the Wikimedia foundation, started as a web-based, multi-language word dictionary that is formed and based on the participation of users. Wiktionary has a purpose of expressing not only explanations to define the words, but also expression of semantic information that is used to understand words, such as thesaurus and phrase books. The advantage of Wiktionary is that it is based on volunteers all across the internet and not experts, so there are no constraints on time, expense and manpower. Also, since the dictionary is formed by many volunteers, a huge amount of semantic information can be formed in a shrot period of time. Furthermore, it is not limited to a specific language, but rather built on many different languages, providing the ability to construct a multi-lingual dictionary. However, it is made for humans, not computers, it has some limits in when incorporating into various artificial intelligence areas [11][12].

2.2. Collective Intelligence

Human computation aims to solve problems that are hard for computers, by using people power [6][13]. For example, human computation games, also called GWAP (Games With A Purpose), encourage the user to play games and solve computional problems using its results. The GWAP approach has been shown to be widely used in various parts, such as image tagging [13], music annotation [7] and common-sense collection [5]. These are hard to solve problems using a computational approach.

The GWAP approach adopts three types of matching mechanism to improve the quality of collected knowledge [6]. First, it uses an output-agreement mechanism that matches two user's output in the same input. Second, it uses inversion- agreement that matches one's input and the other's output as the other user's input. Third, it adopts input-agreement that analogizes and matches two user's input about the same output. For example, the ESP game shows the same image to two users then makes them tag the image [14].

ESP Game

The ESP Game, a famous GWAP game, tags the image by a human. It is played online by two players selected at random. When both players produce the same output within a limited time, they receive some points. One of the GWAP features, such as the ESP Game, uses a side effect of the game. That is, players play the game just for fun, ranking their self-satisfaction. However, the results generated by players can be used to solve many computational hard problems. Especially, the results of the ESP Game contribute to improve performance of image search systems, such as Google [13][14].

Verbosity

Verbosity is a kind of GWAP and collects common-sense knowledge using the inversion-agreement matching mechanism. It is played online by two players selected at random; one of the players is chosen as the "Narrator", while the other is the "Guesser". The Narrator explains words that are shown; then, the Guesser chooses the word from the Narrator's hint. The feature of verbosity can create huge facts in a just few weeks, because it is collected by not experts but from the common-sense of many users. However, it can collect meaningless knowledge. It is hard to collect detailed descriptions, because it is played by a user just to score highly [5].

PlayCoref

PlayCoref is a game to extract coreferences. Players annotate a selected coreference in the context. The annotated data is a crucial information based on computer- language and as meta data to analyze coreference learning and test data. Manual annotation requires much time and expense. Hoever, PlayCoref as a type of GWAP can be a method to collect a large volume of coreference annotation. However, PlayCoref is for the purpose of academic development, it is difficult to collect data from a user continuously due to the lack of game elements. Even though, it is a good substitution model for solving computational problems [15].

3. FunWords

FunWords is a Korean-lexical entry based semantic resource collection tool. Fig. 1 depicts the layout of FunWords. FunWords is an online-game designed to be played by multiple players. One player is chosen as the "Describer" while the others are the "Answerers". When the describer makes a crossword puzzle, FunWords collects word-sense, described as descriptions, from the correct answer, which is then solved by multiple random players. We developed such crossword puzzles as they because they effectively collect various meanings of the given word, and can be an educational tool for vocabulary learning.

[FIGURE 1 OMITTED]

3.1 Mechanisms

Scenario

The game rules are described as follows. The Describer creates a crossword puzzle in a 10 by 10 grid, with no limit on time. Each word can have a maximum of 11 descriptions, with the Describer earning more points for each description given. The Describer is also given points based on the completeness of the crossword puzzle in the grid, as well as the frequency of the word. FunWords also aids the Describer, by providing a list of standardized words in the corpus. The Answerer plays randomly generated crossword puzzles, asynchronously created by the Describer. The Answerer selects an empty crossword and guesses the word by description. At the end of the game, FunWords gives feedback to the Answerer, on whether or not they chose the right answer. It encourages user participation by implementation of a point system. This gives points whenever the user answers correctly and completes the game quicker than the average playing time.

Matching Algorithm

We use a special matching algorithm in FunWords, in creating and playing the puzzle, to maintain the consistency of results. This is an inversion-agreement matching algorithm that was introduced in GWAP. This algorithm match one's input with the other user's input. That is, we analyze collected word-senses to verify the data collected by collective intelligence by many people who have played the quiz; this consists of word-senses one creates. We obtained word-senses that have to have correctness more than a certain threshold through the inversion-agreement algorithm. The threshold is determined experimentally, but is set to 80% initially.

[FIGURE 2 OMITTED]

User Roles--Describer

Anyone can choose to be the Describer. The Describer makes the puzzle in the given 10 by 10 grid, with no limit on time. The crossword can be created at least from one word to however many they feel they can make. More descriptions provided make it easier for the Answerer to provide the correct answer; this results in higher scores to both the Describer and Answerer.

[FIGURE 3 OMITTED]

User Roles--Answerer

The user can play the part of both Describer and Answerer in FunWords. When playing as the Answerer, the user plays a puzzle selected randomly and automatically by the system. Each puzzle has a 5 minute time limit, and the Answerer guesses the word through the descriptions given. The more answers guessed for each word, the word-sense for the word increases in abundance, as well as strengthening the Describer-word relationship.

[FIGURE 4 OMITTED]

There is a special area for the Answerer, shown on the left side in Fig. 4. The special area has several input fields, where the Answerer is able to provide a rank for each description. In the input field, the Answerers are able to select the top three descriptions that were the most helpful in inferring the word, measuring the semantic relationship of describers on a 1-3scale, later used to analyze the results. It can construct semantic resources consisting of weighted semantic relationships that are never seen in existing references, such as a dictionary or thesaurus.

3.2 Sentence Templates

As mentioned before, FunWords adopts 11 sentence templates, because if the system allows the Describer to use free templates, it would generate more computational problems. We developed 10 sentence templates to analyze the relationships among words automatically, and a tag to collect other relationship information to interpret the results from FunWords into semantic resources [2].

The 11 sentence templates are divided into 4 large categories. Hierarchical category that can be used to explain the interactions between the words, Thesaurus category that can explain the semantic relatedness, Feature category that explains the characteristics of the target, as well as Tag which is semantically related to another word or concept. Many earlier studies were referred in order to build the semantic relationships shown above [16][17][18]. Especially, the general semantic relationship study results as well as semantic relationship used in WordNet and Verbosity was analyzed. The results showed that there were many relationships, but found that there was a relationship that was used often depending on the target and domain, and provided important information when defining the target. Especially, there were techniques that divided the words into 4 lexical categories and describing other semantic relationships depending on the lexical category. In addition, many researches point to hyperonymy, hyponymy, meronymy, synonymy and antonymy as the most important relationships [9][10].

For this, the current study utilized previous studies to categorize 5 kinds of basic semantic relationships into 2 large categories, and defined semantic relationships that expressed the characteristics of the target. Tag relationship was first defined to add or revise semantic relationships, but experiment results showed that it provided similar information to Tag Cloud. Therefore, we studied the importance of semantic relationships other than tag that were made into a template, and the semantic relationships divided by lexical categories.
Our implementation currently uses the following templates:

Allows for hierarchical categorization:

It is --.                                                   (Hypernym)
It has --.                                                   (Hyponym)

Provides information about the thesaurus of a word:
-- is a kind of it.                                     (Coordination)

It is the opposite of --.                                    (Antonym)
It is the synonym of  --.                                    (Synonym)

Provides information about the features of a word:

It looks like --.                                            (Mimetic)
It sounds --.                                           (Onomatopoeia)
It is typically_color.                                 (Related color)
It is typically in --.                                (Related domain)
It is typically use for --.                          (Related purpose)

Collects related words: For example, "keyboard mouse monitor" was a
clue for the word "computer". The clues were collected, and archived.

--.                                                             (Tags)


3.3 Representation

FunWords is not also for just collecting word-sense but for annotating the relationship between the words, and building metadata to construct the ontology, in an RDF format. As mentioned before, we converted the structured word-sense collected by 11 sentence templates to an RDF-triple, then presented it using XML.

For an illustrative example, the results for the word "puppy" during the experimental period are provided below Table 1. Table 2 is another representation of Table. 1 converted into RDF-triple format. Fig. 5 is a representation of Table 2 in semantic networks.

[FIGURE 5 OMITTED]

4. Experiments

4.1 Experimental Data

We collected evidence showing that FunWords is fun to play and that people provide correct word-senses while playing to certify the semantic relationships between the words and the word-senses. We present results from college students to give a sneak preview of FunWords, as the game has not been formally released to the public. Twenty students played the game over a one week period, generating 154 word-senses. Table 3 describes the statistics of collected word-senses. As shown in Table 3, the results can be classified into six groups, but only four of the six groups were used: common nouns, abstract nouns, adjectives and neologisms. The number of word-sense of verbs and adverbs were not used, since both had little statistical meaning.

4.2 Experimental Results and Analysis

We analyzed the rank of descriptions that were assessed by Answerers on a 1-3 scale to estimate the semantic relationships among collected word-senses and words. We collected 154 words, with a 77% correct rate. Correct rates refer to the percentage of answerers who answered the questions correctly among the total number of answerers. 95 words were extracted from the results when the threshold was set to 70%. The number of words extracted would have been too small if we applied the threshold. Thus, we used all collected words for analysis. Collected word-senses can be divided into four groups: common nouns, abstract nouns, adjectives and neologisms. Table 4 gives the results.

As shown in Table 4, experimental results show interesting facts, as follow. Common nouns contain structural semantic relationships, such as hypernym and hyponym, which are highlighted. The descriptive and notable semantic relationships, such as synonyms and antonyms, are underlined more than the structural relationships, because abstract nouns express the abstract concept as words. As adjectives are words that help describe, identify, or quantify things, semantic relationships such as description illustration--for example, color, sound, etc.--and status, adjectives are expressed more. Finally, as neologism is a newly coined term, semantic relationships, such as description and characteristics, are emphasized. The experiment's results can be utilized in prioritizing weights, for collecting and constructing semantic resources in the future. Fig. 6 is an example of constructing a semantic network with semantic relations that have high correct rates from our experimental results. The weight means the multiplication of inverse of importance rank and correct rate. A higher weight signifies a stronger semantic relationship

[FIGURE 6 OMITTED]

5. Conclusion

In this paper, we introduced FunWords, a GWAP (Games With A Purpose) approach, which uses human brain power, such as collective intelligence and human computation, to collect and annotate word-senses.

We utilized the 11 relationship templates in FunWords, and collected the 154 semantic relationships for a week from twenty college students. The results helped us recognize the features of the semantic relationships in the word-sense. The experiment clearly showed that the semantic relationships are emphasized differently in common nouns, abstract nouns, adjectives and neologisms. Common nouns, with the structural semantic relationships, such as hypernym and hyponym, are highlighted more. Abstract nouns, with descriptive and characteristic semantic relationships, such as synonym and antonym, are underlined more. Adjectives, with such semantic relationships as description and status, illustration--for example, color and sound--are expressed more. Finally, neologism, with the semantic relationships, such as description and characteristics, are emphasized.

From these analyses, we concluded that collecting different parts of speech or word classes requires different techniques. With these characteristics, weighting the semantic relationships can help reduce time and cost without considering unnecessary or slightly related factors, and improve the expressive power, such as readability with concentration on the weighted characteristics. Likewise, when constructing the semantic information model or ontology, the semantic relationships can be prioritized according to their weights. Very few attempts emphasizing the weighted semantic relationship have been made to construct the semantic information model or ontology. Applied our conclusion to construct the semantic information model or ontology, it can endow the model with expressive poser for the human's model readability, with effective cost-cutting and economy of time.

DOI: 10.3837/tiis.2011.10.001

The part of this paper was presented in the ICONI (International Conference on Internet) 2010, December 16-20, 2010, Philippines. This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology(2011-0026785)

References

[1] R. Davis, H. Shrobe, P. Szolovits, "What is a Knowledge Representation," AI Magazine, vol. 14, no. 1, pp. 17-33, 1993.

[2] G.A. Miller, "WordNet: A Lexical Database for English," Communications of the ACM, vol. 38, no. 11, pp. 39-41, 1995. Article (CrossRefLink)

[3] N. Guarino, "Formal Ontology, Conceptual Analysis and Knowledge Representation," International Journal of Human Computer Studies, 1995. Article (CrossRefLink)

[4] P. Spyns, A. de Moor, J. Vandenbussche, R. Meersman, "From Folksologies to Ontologies: How the Twain Meet," Lecture Notes in Computer Science, vol. 4275, pp. 735-755, 2006. Article (CrossRefLink)

[5] L. von Ahn, M. Kedia, M. Blum, "Verbosity: a game for collecting common-sense facts," in Proc. of the SIGCHI conference on Human Factors in computing systems, pp. 75-78, 2006. Article (CrossRefLink)

[6] L. von Ahn, "Games with a purpose," Computer, vol. 39, no. 6, pp. 92-94, 1999. Article (CrossRefLink)

[7] E.L.M. Law, L. von Ahn, R.B. Dannenberg, M. Crawford. "Tagatune: A game for music and sound annotation," in Proc. of the International Conference on Music Information Retrieval, pp. 361-364, 2003.

[8] D.B. Lenat, "CYC: A Large-Scale Investment in Knowledge Infrastructure," ACM Communications, vol. 38, no. 11, pp. 32-28, 1995. Article (CrossRefLink)

[9] G.A. Miller, "WordNet: a lexical database for English," ACM Bibliometrics, vol. 38, no. 11, pp. 39-41, 1995. Article (CrossRefLink)

[10] G.A. Miller, R. Beckwith, C. Fellbaum, D. Gross4, K.J. Miller, "Introduction to WordNet: An On-line Lexical Database," International Journal of Lexicoggraphy, vol. 24, no. 2, pp. 235-244, 1990. Article (CrossRefLink)

[11] E. Navarro, F. Sajous, B. Gaume, L. Prevot, H. ShuKai, K. Tzu-Yi, P. Magistry, H. Chu-Ren, "Wiktionary and NLP: improving synonymy networks," in Proc. of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources, pp. 19-27, 2009.

[12] T. Zesch, C. Muller, I. Gurevych, "Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary," in Proc. of the Conference on Language, pp. 1646-1652, 2008.

[13] L. von Ahn, L. Dabbish, "Designing games with a purpose," ACM Communications, vol. 51, no. 8, pp. 58-67, 2008. Article (CrossRefLink)

[14] L. von Ahn, L. Dabbish, "Labeling images with a computer game," in Proc. of the SIGCHI Conference on Human factors in computing systems, pp. 319-326, 2004. Article (CrossRefLink)

[15] B. Hladka, J. Mirovsky, P. Schlesinger, "Play the Language: Play Coreference," in Proc. of the ACL-IJCNLP Conference, pp. 209-212, 2009.

[16] C.A. Bean, R. Green, "Relationships in the organization of knowledge," Kluwer Academic Publichers, 2001.

[17] R. Green, C.A. Bean, S.H. Myaeng, "The semantics of relationships: an interdisciplinary perspective," Kluwer Academic Publichers, 2001.

[18] B. Rosario, M. Hearst, "Classifying the Semantic Relations in Noun Compounds via a Domain-Specific Lexical Hierarchy," in Proc. of the Conference on Empirical Methods in Natural Language Processing, pp.82-90, 2001.

Kigon Lyu, Jungyong Lee, Dongeon Sun, Daiyoung Kwon and Hyeoncheol Kim

Dept. of Computer Science Education, Korea University, Seoul, 136701--South Korea [e-mail: {gon0121, popobo, sunde41, daiyoung.kwon, harrykim} @korea.ac.kr]

* Corresponding author: Hyeoncheol Kim

Received April 8, 2011; revised July 27, 2011; revised September 20, 2011; accepted September 26, 2011; published October 31, 2011

Kigon Lyu is a Ph.D. candidate in the Department of Computer Science Education at Korea University, in Seoul, Korea. He received his B.S. and M.S. degrees in in Computer Engineering from the Baekseok University and Hanshin University, in 2006 and 2008, respectively. His research interests include knowledge representation, human computation and affective computing.

Jungyong Lee is a research engineer in LG Electronics. He received his B.S. degree from Kyungwon University and his M.S. degree in Department of Computer Science Education from Korea University, Seoul, in 2008 and2011. His research interests is smart device based learning.

Dongeun Sun received his B.S degree (2007), M.S degree (2009) in Department of Computer Science Education from Korea University, Seoul, Korea, currently being Ph.D. candidate in the same department. His research interests are in affective computing such as facial emotion recognition and ontology constructing from collective intelligence.

Daiyoung Kwon is a research professor at Korea University, in Seoul, Korea. He received his B.S., M.S., and Ph.D. degrees in Computer Science Education form Korea University, Seoul, in 2000, 2006 and 2011. His research interests include algorithmic thinking learning, educational programming languages and cognitive experiments for evaluating thinking abilities.

Hyeoncheol Kim is a professor at Korea University, Department of Computer Science Education, in Seoul, Korea. He received BS and MS in computer science from the Korea University and the University of Missouri-Rolla, respectively. He received Ph.D in computer and information sciences from the University of Florida. he worked at GTE Data Services Inc. and Samsung SDS before he joined Korea University in 1999. His research interests include data mining, human learning and learning software developments.
Table 1. Results for the word "puppy"

Puppy

It is a/an animal.
It has a/an Maltese.
Rabbit is a kind of animal.
It is the opposite of cat.
It is the synonym of whelp.
It sounds bowwow.
It is typically in animal hospital.
  tags; Golden retriever, Bulldog, etc.

Table 2. RDF-Triple format of the result

Subject      Predicate                   Object

Puppy     is a / an         animal
          has a / an        Maltese
          coordination      rabbit
          the opposite of   cat
          the synonym of    whelp
          Sounds            bowwow
          is typically in   animal hospital
          tags              Golden retriever, bulldog, etc.

Table 3. Statistics of collected word-senses

POS(Part of Speech)          # of sense

Noun         Common nouns        74
            Abstract nouns       15
          Verb                   3
        Adjective                21
         Adverb                  2
        Neologism                38

          Total                 154

Table 4. Analysis results of semantic relationship for each group
(The importance rank /the average of 3 scales)

Semantic           Common     Abstract
Relationships       nouns      Nouns     Adjective   Neologism

Hypernym          2 / 2.08    3 / 1.30   4 / 1.63    2 / 2.04
Hyponym           3 / 2.06    4 / 2.25   5 / 1.33    6 / 1.58
Coordination      7 / 1.40    5 / 1.33   6 / 1.13    5 / 1.67
Antonym           4 / 2.12    1 / 2.91   2 / 2.44    1 / 2.30
Synonym           1 / 2.22    2 / 1.93   1 / 2.53    3 / 2.02
Mimetic           6 / 2.06    6 / 1.50   3 / 2.11    10 / 0.00
Onomatopoeia      9 / 1.89    9 / 0.00   7 / 1.00    9 / 1.00
Related color     8 / 1.69    9 / 0.00   10 / 0.04   8 / 1.33
Related domain    5 / 2.05    6 / 1.50   8 / 0.09    4 / 2.00
Related purpose   10 / 1.44   6 / 1.50   8 / 0.09    7 / 1.44
Tags              11 / 0.00   9 / 0.00   11 / 0.00   10 / 0.00
Correct Rate         79%        73%         55%         74%
COPYRIGHT 2011 KSII, the Korean Society for Internet Information
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2011 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Lyu, Kigon; Lee, Jungyong; Sun, Dongeon; Kwon, Daiyoung; Kim, Hyeoncheol
Publication:KSII Transactions on Internet and Information Systems
Article Type:Report
Date:Oct 1, 2011
Words:4299
Previous Article:The syllable frequency effect in semantic categorization tasks in Korean.
Next Article:A pattern-based prediction model for dynamic resource provisioning in cloud environment.
Topics:

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters