Printer Friendly

Metadata And Linked Data in Word Sense Disambiguation.

Introduction

Word Sense Disambiguation (WSD) is referred to as an "Al-complete" problem (Mallery, 1998), i.e., a task that is relatively easy for people, but considerably more difficult for machines. If someone makes a query for a polysemous word (e.g., "plant," "bass," "mercury," etc ...), how is an information retrieval system to understand which sense of the word is intended? There exist tried-and-tested methods, such as just using the most predominant sense of the word (McCarthy, Koeling, Weeds, & Carroll, 2004); or looking at the words next to the query term to determine the statistically most likely meaning (Jurafsky & Martin, 2009; Manning & Schutze, 1999); but these methods often produce less-than-satisfactory results [often around 70%] (Navigli, 2009). Furthermore, these methods have been heavily dependent on the manual creation of knowledge sources (Edmonds, 2000), which are expensive to create and subject to change, thus creating what is termed a knowledge acquisition bottleneck (Gale, Church, & Yarowsky, 1992). Linked Data technologies (Berners-Lee, 2006), however, allow us to utilize existing ontologies and lexica, which can then be exploited to improve the automatic semantic understanding of the word. This paper will examine several systems that purport to disambiguate words by using Linked Data, and some of the models these systems use to ensure interoperability.

Literature Review

The most complete treatment of the subject of WSD is arguably Agirre & Edmonds [ed.] (2007), which presents a detailed definition of the problem, along with a history thereof, and numerous algorithms which are used in practice. Kwong (2013) offers slightly more recent coverage, along with predictions as to how WSD methods will evolve in the near future. Generalists might find sufficient the survey from Navigli (2009), or the chapters covering WSD in either Jurafsky & Martin (2009) or Manning & Schutze (1999). SemEval [which was originally named Senseval (Kilgarriff, 1998)] is an ongoing evaluation project which is used as a baseline to assess various WSD methods, including many which will be examined in this paper.

Linked Linguistic Open Data (LLOD) is heavily dependent on metadata, and any consideration thereof would require an examination of its standards. A brief history of the topic of linguistic annotation can be found in Palmer & Xue (2013). Bird & Simons (2003a) and Ide, Romary, & de la Clergerie (2004) proposed sets of best practices for linguistic annotations, while Simons, Bird, & Spanne (2008) offered a more recent set of recommendations that specifically suggested language codes from ISO 639-31 be used in metadata. Ide & Pustejovsky (2010) suggested a list of best practices for language technology metadata, focusing heavily on the work of the OLAC and European Languages Resource Association (ELRA). Gracia, Montiel-Ponsoda, Cimiano, Gomez-Perez. Buitelaar, & McCrae (2012) considered the issue of Linked Data being stored in different languages, and suggested that techniques such as ontology localization, ontology mapping, and cross-lingual ontology-based information access and presentation would help prevent information from being locked up in linguistic data silos. Gayo, Kontokostas, & Auer (2013) presented a set of best practices for multilingual linked open data, and point out that SPARQL queries can be improved if tags are identified by language. Reviews of specific linguistic annotation schemes include: the Open Languages Archives Community [OLAC] metadata set (Bird & Simons, 2003b); the General Ontology for Linguistic Description [GOLD] (Farrar & Langendoen; 2003); ISOcat, a Data Category Registry (DCR) for the ISO TC 37 (terminology and other language and content resources) registry (Kemps-Snijders, Windhouwer, Wittenburg, & Wright (2009); the ISO/TC 37/SC 4 standard (Lee & Romary, 2010); the lemon (LExicon Model for ONtologies) model (McCrae, Aguado-de-Cea, Buitelaar, Cimiano, Declerck, Gomez-Perez, ... & Wunner, 2012); and Lexical Markup Framework [LMF] (Francopoulo, 2013), which had a strong influence on the lemon model.

A number of papers detail projects that utilized the schemes listed above. Montiel-Ponsoda, Gracia del Rio, Aguado de Cea, & Gomez-Perez (2011) showed how the lemon model can be extended using a metamodel in OWL, which would allow translation to be represented on a separate layer. Buitelaar, Cimiano, Haase, & Sintek (2009) advocated using ontologies beyond those of RDFS, OWL, and SKOS, and presented a model called LexInfo, which combines aspects of older models. Chiarcos, Dipper, Gotze, Leser, Ludeling, Ritz, & Stede (2008) treated the Ontology of Linguistic Annotation, which is especially useful for corpora that have been annotated a number of different times in a number of different methods.

Two of the most commonly used linguistic tools on the Semantic Web are the general-purpose lexical ontologies WordNet (Fellbaum, 1998) and FrameNet (Baker, Fillmore, & Lowe, 1998). Although Ide (2014) argued that FrameNet was the "ideal resource for representation as linked data" (18), the majority of the projects covered later in this paper utilized WordNet, and thus this tool will be examined in more detail. Both FrameNet and WordNet are often used in Linked Data projects to automatically annotate texts with semantic metadata. Projects that have used these databases include Huang (2007), wherein WordNet files were converted to be presented in OWL to assist in machine comprehension of metaphor; and BabelNet (Navigli, 2012), a resource which will be reviewed later in this paper. Ehrmann, Cecconi, Vannella, McCrae, Cimiano, & Navigli (2014) converted BabelNet into Linked Data via the lemon model; and Moro, Navigli, Tucci & Passonneau (2014) used BabelNet to automatically annotate the Manually Annotated Sub-Corpus 3.0 (MASC) and therewith were able to perform automatic WSD with an accuracy of 70%, an impressive figure; but still too low to see much practical adoption.

Other examples of linguistic tools used with Semantic Web technologies include Krizhanovsky & Smirnov (2013), wherein Wiktionary was utilized to automatically create a general-purpose lexical ontology; Hellmann, Brekle, & Auer (2013) described a similar project wherein Wiktionary extractors made use of DBpedia to create RDF triples; de Melo (2014a) introduced lexvo.org, a system which automatically creates URIs for each word and sense, thus guaranteeing a constant reference; Mendes, Jakob, Garcia-Silva, & Bizer (2011), introduced DBpedia Spotlight, an open-source program that automatically annotates texts to the Linked Open Data cloud by using the URIs in DBpedia; and Serasset (2014) described the extraction of multilingual lexical data from Wiktionary, the importation thereof into DBNary, and the final conversion of the data into MLLOD (Multilingual Lexical Linked Open Data) via the lemon model.

A number of very different methods of using Semantic Web technologies to disambiguate word senses have been attempted and analyzed. Elbedweihy, Wrigley, Ciravegna, & Zhang (2013) used a combination of WordNet, BabelNet, and Wikipedia to help generate SPARQL queries, which would subsequently resolve ambiguities in the original queries with a success rate of 76%; Fragos (2013) also used WordNet--in this case the extended glosses of WordNet--to train WSD systems; McCarthy et al. (2004) used the WordNet similarity package and raw textual corpora to solve WSD by using the predominant sense of the word, a method which achieved a success rate of 64%; Ide (2006) treated the problem of polysemy by mapping FrameNet sets to WordNet.

Some case study reviews show the strengths and weaknesses of more general models: Haase (2004) looked at tags for digital images to argue that semantic metadata can help alleviate some of the issues of precision caused by selecting overly narrow terms; de Melo & Weikum (2008) argued that "language-related knowledge" forms the backbone of the semantic web, and presented ways in which linguistic items such as languages, scripts, and terms can unambiguously be linked with URIs, and from whence new links can automatically be formed; and Tagarelli, Longo, & Greco (2009) showed how notions of sense relatedness can be calculated by examining overlaps between dictionary glosses and measuring distances for ontology paths.

The rest of the paper will cover in more detail several tools and models that feature prominently in the use of metadata to disambiguate word senses. As WordNet comes up so often in this paper, a rudimentary comprehension of the workings of the lexicon will be necessary to fully understand how other tools covered in this paper work. The subsequent sections will look at lexvo.org; the lemon model; BabelNet; WordNet++; and finally tag disambiguation with TAGora Sense Repository. This paper will conclude with a brief examination of some of the issues of relying on meta- and linked data for Word Sense Disambiguation.

WordNet

WordNet (Fellbaum, 1998) is a general-purpose lexical ontology that features prominently in the web of linked data. In WordNet, words are organized according to Synset, which include not only synonyms, but also hypernyms (broader categories), hyponyms (narrower categories), meronyms (part-to-whole relationships), and more. Classifying words by hyper- and hyponyms allows entries to be nested hierarchically, which can assist in determining the correct sense of a word. WordNet is free to use and fully queriable, and below are screenshots showing the results for a query of the word play, a highly polysemous word which will again be considered in detail when we examine BabelNet.

In addition to organizing words in "Synsets," WordNet assigns a "sense key" to each sense of a word, a fact that will be exploited to great effect in BabelNet. For example, the sense of "play" indicating a dramatic performance would be listed as [play.sup.1.sub.n], with the superscript "1" indicating that this is the first sense of "play" listed in WordNet, and the subscript "n" indicating that this instance of "play" is a noun. [Play.sup.1.sub.n] would be in the same Synset as [drama.sup.1.sub.n] and dramatic [play.sup.1.sub.n], which are also the first senses listed for their respective terms, and nouns as well. The sense of the word "play" referring to children's games would be listed as [play.sup.8.sub.n], and in the same Synset as child's [play.sup.2.sub.n], as they are respectively the eighth and second sense of their terms (Navigli & Ponzetto, 2012a). Besides being used in the projects listed below, WordNet is also used by Wikipedia, as all pages at Wikipedia have WordNet senses automatically associated with them (Ponzetto & Navigli, 2010). However, one criticism of WordNet is that many senses listed are too similar, which may limit the usefulness of WordNet in WSD tasks (Ide, 2006).

Various metadata schemes have been used to link the information contained within the WordNet ontology with other databases and ontologies. There is a nearly complete XML version of WordNet (2), and an RDF version (3) which was structured according to the lemon model (cf. below). There exists as well a mapping of WordNet entries to schema.org terms (4). The World Wide Web Consortium also maintains extensive documentation regarding RDF/OWL representations of WordNet (van Assem, Gangemi, & Schreiber, 2006). Finally, entries in WordNet are semantically linked to a number of other LLOD resources, including lexvo.org (cf. below), and VerbNet.

Lexvo.org

Lexvo.org (de Melo, 2014a) is an easy-to-use service which provides URIs to identify any term in a given language. These URIs "serve as interchange URIs that can easily be created from any word-segmented natural language text" (de Melo, 2014a, 3). Unambiguous reference to a specific language is achieved by using ISO639-3 codes, and lexvo.org records also give the parts of speech for the various senses, and provide links to translations in other languages. By referring to what language the term is from, lexvo.org can help assist in multilingual disambiguation, which can help prevent some accidental, embarrassing results (5). In addition to linking individual senses of a word to WordNet synsets (copies of which are hosted at lexvo.org), records at lexvo.org are linked to Library of Congress Subject Headings; records at OpenCyc (6); and linked internally at lexvo.org to translations for the various senses of the word (e.g., here is the URI for the word jeu, and here is the IRI (7) for piece de theatre, two French words which capture two of the senses connoted by the English play). However, it should be pointed out that when I retrieved the record for "play," a number of URIs linked to it had expired--a problem with linked data that I will explore again at the conclusion of this paper. Lexvo.org uses sources such as Wikipedia, Wiktionary, and the Unicode CLDR (Common Locale Data Repository) (8) to supply descriptions of all the languages it covers. In 2008, Lexvo.org became the first web site to publish Linked Data based on Wiktionary on the Web (de Melo, 2014a), and it also utilizes the URIs created for words in the WordNet lexical RDF/OWL Representation of WordNet (van Assem, Gangemi, & Schreiber, 2006). Lexvo.org can help free data from "data silos" by providing unambiguous URIs for individual languages, terms within, and even senses of these terms. Lexvo.org links are used by the British Library in their British National Bibliography data; the Spanish National Library; Sudoc--the French academic catalog; LOCAH Linked Archives Hub project, and DBpedia Spotlight (de Melo, 2014b)

Lemon

Lemon (LExicon Model for Ontologies) "is designed to represent lexical information about words and terms relative to an ontology on the Web" (McCrae et al., 2012, 703), and is what is referred to as an ontology-lexicon, which shows how the classes of the ontology are realized 6 7 8 linguistically (Buitelaar, 2010). Lemon follows a principle referred to as semantics by reference, in that "the (lexical) meaning of the entries in the lexicon is assumed to be expressed exclusively in the ontology and the lexicon merely points to the appropriate concepts" (McCrae et al., 2012, 703). Lemon is similar to the SKOS project (Miles and Bechhofer, 2009), but differs in that it is "an independent and external model, intended to be published with arbitrary ontology-based conceptualisations [sic] ... in order to provide a richer description of the knowledge captured in those resources in one or several natural languages" (McCrae et al., 2012, 703).

Lemon is an RDF-native form, which allows it to exploit existing Semantic Web technologies.

The intention of the lemon model is not to be a semantic model, but instead it references existing resources (e.g., ontologies) which gives it the capability to represent semantics. The model does this through its "(lexical) sense" object, which avoids using the concept of a word sense that is commonly found in many existing models, a practice that Kilgariff (1997) criticized in his paper about the issues created by word senses (McCrae et al., 2014). The lemon model is used by a great number of linked data initiatives, including BabelNet, which will be considered later in this paper (Ehrmann et al., 2014).

BabelNet

Simply put, BabelNet (9) (Navigli & Ponzetto, 2012a) is a combination of the structure of Wikipedia augmented with Synsets from WordNet. The two databases were integrated by an automatic mapping, and any lexical gaps in resource-poor languages were plugged by using machine translation. The resulting "encyclopedic dictionary" provides lexical information on concepts and named entities in many languages, and the end result is a semantically-linked ontology. From version 2.0 onwards, the database was also linked with the Open Multilingual WordNet [OMWN] (Bond & Foster, 2013), a collection of wordnets in different languages, and OmegaWiki (10), a collaborative dictionary that is available in a number of different languages. The 2.0 version also had more than 9 million Babel synsets (i.e. entries linked in semantic networks) in over 50 languages (Ehrmann et al., 2014; Navigli, 2014). Currently, version 2.5 is in beta mode.

The knowledge contained within these Babel synsets can be used to perform knowledge-rich (Navigli & Velardi, 2005), graph-based (Bird & Liberman, 2001) Word Sense Disambiguation in both monolingual and multilingual settings. By utilizing the partial structure of a typical Wikipedia page, BabelNet is able to extract semantic information regarding an entry. For example, by using Wikipedia's redirect and disambiguation pages, internal and multilingual links, and category designations as well, BabelNet is able to automatically glean significant semantic information regarding an entry [cf. graph below] (Navigli & Ponzetto, 2012a; Navigli & Ponzetto, 2012b).

BabelNet uses a mapping algorithm to determine which sense of a polysemous WordNet entry should be paired with which disambiguated Wikipedia entry. In the example illustrated above, WordNet's [play.sup.1.sub.n] would be paired with Wikipedia's PLAY(THEATRE) entry since the graphs for two entries share more words in common than the graphs for the other choices. Concerning Word Sense Disambiguation, the gains in precision over rival methods may be quite small, but the gains in recall are fairly high. Furthermore, BabelNet can also take advantage of the multilingual links in Wikipedia to assist even in monolingual WSD, as can be seen in the figure below (Navigli & Ponzetto, 2012a). There also exists an RDF version of BabelNet 2.0 which contains about 1.1 billion triples, among which there are over 9 million SKOS concepts, and nearly 100 million lemon lexical senses (Ehrmann et al., 2014). BabelNet can currently be used as a stand-alone resource with its Java API, a SPARQL endpoint, or as a Linked Data interface as part of the Linguistic Linked Open Data (LLOD) cloud (Navigli, 2014). A wiki on converting BabelNet as Linguistic Linked is also maintained by the World Wide Web Consortium (11).

WordNet++

WordNet++ (Ponzetto & Navigli, 2010) is an English-only subset of BabelNet in which entries from Wikipedia are mapped to the corresponding senses in WordNet. For example, by looking at the disambiguation links for the Wikipedia entry SODA (SOFT DRINK), we can construct a disambiguation context which includes the words: soft, drink, cola, sugar. The possible matches at WordNet include [soda.sup.1.sub.n], which includes words like salt, acetate, chlorate, and benzoate. The context for [soda.sup.2.sub.n] includes soft, drink, cola, bitter, etc.... Having the largest intersection between them, WordNet++ would match Wikipedia's SODA(SOFT DRINK) with WordNet's [soda.sup.2.sub.n]. Furthermore, WordNet++ can use the additional links in Wikipedia (e.g. SODA(SOFT DRINK) is linked to the entry SYRUP to create even larger Synsets. This method generated consistently high results on several Semeval tasks (Ponzetto & Navigli, 2010).

Tag Disambiguation with DBpedia

Delicious (12) allows users to assign text descriptions to user-contributed tags, but these can't readily be used by machines (Garcia, Szomszor, Alani, & Corcho, 2009). Garcia et al. developed an approach that uses the TAGora Sense Repository (TSR) (13), a linked data resource that provides metadata about tags and their possible senses, to disambiguate the mostly like sense of a word. The TSR is ultimately linked to DBpedia (Morsey, Lehmann, Auer, Stadler, & Hellmann, 2012), a representation in the RDF model of a portion of the information in Wikipedia. This approach faced one difficultly that many other WSD systems do not face, namely tags do not occur in sentences, and therefore this approach could not examine the word in its context to disambiguate it. However, tags seldom occur singularly, and this method was able to look at the other tags in order to try to determine the most likely sense. Garcia et al. (2009) created an algorithm which represented the tags and the context in vectors, and then used similarity measures to choose the most likely sense. The authors stated that this approach was designed to be used for tag disambiguation, but could be used for other functions as well.

Conclusion

Like many other activities in the realm of artificial intelligence, Word Sense Disambiguation is a task that is notoriously difficult for machines, but relatively simple for humans. Successful methods for disambiguating polysemous words automatically have been developed, but these methods are heavily reliant on marked-up corpora, which require significant investments of time and money (Edmonds, 2000). Linked Open Data, however, is allowing computers to exploit semantically marked-up data in an attempt to share resources, and thus we can use already discovered knowledge to solve hitherto unseen problems. This is only possible, of course, because of a realization of the importance of interoperability and a commitment to shared standards.

But a dependence on interconnected data has created a new set of challenges. One huge problem with Linked Data is the quality of links. The ever-changing nature of information means that websites with static information are at a disadvantage, for if the information contained within the page ever changes, it has to be changed manually to remain current. While this problem is obviated by using Linked Data, a new problem arises if the link disappears without notice. In my cursory explorations of several tools examined above, I came across several dead links, which naturally caused me to question the practicality of these tools. De Melo (2014), in introducing lexvo.org, explained that he was reluctant to use dynamic information from Wiktionary, as the site changes frequently. It seems entirely possible that even a slight structural change at a site like Wiktionary could wreak havoc with the resources linked to it, and it is not certain that the administrators at Wiktionary will take these consequences into account when debating alterations. Furthermore, there presently seems to be no mechanism to prune and replace dead links, and such a deficit seems to be a liability for any service relying on linked data.

Another concern is that none of these standards or tools will gain purchase, and we could end up with a multitude of systems competing with and failing to operate with each other. While a model like Lemon seems to have gained wide use, BabelNet is a relatively young project, and may drift towards obsolescence like the projects WiSeNEt (Moro & Navigli, 2012) and MENTA (de Melo & Weikum, 2010), two recent similar projects that are scarcely mentioned today.

Nonetheless, the exponentially increasing amount of data available through Linked Open Data seems to suggest that WSD systems will come to increasingly rely on it, and on the metadata they will use to locate the sought-after information. Naturally, such linking would not be possible without a commitment to shared standards, and the success of these linked data tools can be considered another victory for the virtues of interoperability. However, in our rush to connect everything, it would be advisable to regularly examine the quality of the data we are linking.

References

Agirre, E., & Edmonds, P. G. (ed.). (2007). Word Sense Disambiguation: Algorithms and Applications. Berlin: Springer.

Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The Berkeley Framenet project. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 1 (pp. 86-90). Association for Computational Linguistics.

Bird, S., & Liberman, M. (2001). A formal framework for linguistic annotation. arXiv:cs/9903003v1

Bird, S., & Simons, G. (2003a). Extending Dublin Core metadata to support the description and discovery of language resources. Computers and the Humanities, 37(4), 375-388.

Bird, S., & Simons, G. (2003b). Seven dimensions of portability for language documentation and description. Language, 79, 557-582. Accessed October 21, 2014 at http://www.jstor.org.libaccess.sjlibrary.org/stable/4489465

Bond, F. & Foster, R. (2013). Linking and extending an Open Multilingual Wordnet. In Proc. of 51st Annual Meeting of the Association for Computational Linguistics, 1352-1362.

Buitelaar, P. (2010). Ontology-based semantic lexicons: Mapping between terms and object descriptions. In: C.R. Huang, N. Calzolari, A. Gangemi, A. Lenci, A. Oltramari & L. Prevot (Eds.), Ontology and the Lexicon, pp. 212-223. Cambridge: Cambridge University Press.

Buitelaar, P., Cimiano, P., Haase, P., & Sintek, M. (2009). Towards linguistically grounded ontologies. In The Semantic Web: Research and Applications, pp. 111-125. Berlin: Springer

Berners-Lee, T. (2006). Linked Data. Accessed September 29, 2014 at http://www.w3.org/DesignIssues/LinkedData.html

Chiarcos, C., Dipper, S., Gotze, M., Leser, U., Ludeling, A., Ritz, J., & Stede, M. (2008). A flexible framework for integrating annotations from different tools and tagsets. Traitement Automatique des Langues, 49(2), 271-293.

Cyganiak, R.; Wood, D.; Lanthaler, M. (2014) RDF 1.1 Concepts and abstract syntax. Accessed October 7, 2014 at http://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/#dfn-iri

de Melo, G. & Weikum, G. (2010). MENTA: Inducing multilingual taxonomies from Wikipedia, in: Proceedings of the Nineteenth ACM Conference on Information and, Knowledge Management, pp. 1099-1108. Toronto, Canada, 26-30.

de Melo, G. (2014a). Lexvo.org: Language-related information for the linguistic linked data cloud. Semantic Web Journal. Accessed October 15, 2014 at http://www.semantic-webjournal.net/system/files/swj521.pdf

de Melo, G. (2014b). Lexvo.org main page. Accessed November 18, 2014 at www.lexvo.org

de Melo, G., & Weikum, G. (2008, October). Language as a foundation of the semantic web. In International Semantic Web Conference (Posters & Demos). Accessed October 17, 2014 at https://www.mpi-inf.mpg.de/~gdemelo/papers/demelo-lexvo-iswc2008.pdf

Edmonds, P. (2000). Designing a task for SENSEvAl-2 [Technical report]. Brighton, UK: University of Brighton.

Ehrmann, M., Cecconi, F., Vannella, D., McCrae, J., Cimiano, P., & Navigli, R. (2014). Representing multilingual data as Linked Data: the Case of BabelNet 2.0. In Proceedings. of Language Resource Evaluation Conference. Accessed October 20, 2014 at http://www.lrec-conf.org/proceedings/lrec2014/pdf/810_Paper.pdf

Elbedweihy, K., Wrigley, S. N., Ciravegna, F., & Zhang, Z. (2013). Using Babelnet in bridging the gap between natural language queries and linked data concepts. In NLP-DBPEDIA@ ISWC. Accessed October 17, 2014 at http://ceur-ws.org/Vol1064/Elbedweihy Using BabelNet.pdf

Farrar, S., & Langendoen, D. T. (2003). A linguistic ontology for the semantic web. Glot International, 7(3), 97-100.

Fellbaum, C. (ed.). (1998). WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.

Fragos, K. (2013). Modeling WordNet glosses to perform word sense disambiguation. International Journal on Artificial Intelligence Tools, 22(02). Accessed October 24, 2014 at http://www.worldscientific.com.libaccess.sjlibrary.org/doi/abs/10.1142/S0218213013500 036

Francopoulo, G. (ed.). (2013) LMF Lexical Markup Framework. Hoboken, NJ: Wiley.

Gale, W. A., Church, K., & Yarowsky, D. (1992). Estimating upper and lower bounds on the performance of word-sense disambiguation programs. In Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, pp. 249-256.

Garcia, A., Szomszor, M., Alani, H., & Corcho, O. (2009). Preliminary results in tag disambiguation using DBpedia. Accessed October 20, 2014 at http://oro.open.ac.uk/20006/1/ckcar09-final.pdf

Gayo, J. E. L., Kontokostas, D., & Auer, S. (2013). Multilingual linked open data patterns. Semantic Web Journal Accessed October 15, 2014 at http://www.semantic-webiournal.net/svstem/files/swj495.pdf

Gracia, J., Montiel-Ponsoda, E., Cimiano, P., Gomez-Perez, A., Buitelaar, P., & McCrae, J. (2012). Challenges for the multilingual web of data. Web Semantics: Science, Services and Agents on the World Wide Web, 11, 63-71.

Haase, K. (2004) Context for semantic metadata. Proceedings of the 12th Annual ACM International Conference on Multimedia. ACM, 2004. Accessed October 16, 2014 at http://www.yamod.ch/media/11026/haase01.pdf

Hellmann, S., Brekle, J., & Auer, S. (2013). Leveraging the crowdsourcing of lexical resources for bootstrapping a linguistic data cloud. In Semantic Technology, pp. 191-206. Berlin: Springer.

Huang, X. X. (2007). An OWL-based WordNet lexical ontology. Journal of Zhejiang University SCIENCE A, 8(6), 864-870.

Ide, N. (2006). Making senses: Bootstrapping sense-tagged lists of semantically-related words. In Computational Linguistics and Intelligent Text Processing, pp. 13-27. Berlin: Springer.

Ide, N. (2014). FrameNet and Linked Data. In Proceedings of Frame Semantics in NLP: A Workshop in Honor of Chuck Fillmore (Vol. 1929, pp. 18-21).

Ide, N., & Pustejovsky, J. (2010, January). What does interoperability mean, anyway? Toward an operational definition of interoperability for language technology. In Proceedings of the Second International Conference on Global Interoperability for Language Resources. Hong Kong, China. Accessed October 21, 2014 at http://www.cs.vassar.edu/~ide/papers/ICGL10.pdf

Ide, N., Romary, L., & de la Clergerie, E. (2004). International standard for a linguistic annotation framework. Natural Language Engineering, 10(3-4), 211-225.

Jurafsky, D., & Martin, J. H. (2009). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Upper Saddle River, NJ: Prentice Hall.

Kemps-Snijders, M., Windhouwer, M., Wittenburg, P., & Wright, S. E. (2009). ISOcat: Remodelling metadata for language resources. International Journal of Metadata, Semantics and Ontologies, 4(4), 261-276.

Kilgarriff, A. (1997). I don't believe in word senses. Computers and the Humanities, 31(2), 91-113.

Kilgarriff, A. (1998, May). Senseval: An exercise in evaluating word sense disambiguation programs. In Proceedings of the First International Conference on Language Resources and Evaluation, pp. 581-588. Accessed October 24, 2014 at http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=3CDF14935273C9C8D3D5BF AD0964F349?doi= 10.1.1.32.1931&rep=rep1&type=pdf

Krizhanovsky, A. A., & Smirnov, A. V. (2013). An approach to automated construction of a general-purpose lexical ontology based on Wiktionary. Journal of Computer and Systems Sciences International, 52(2), 215-225.

Kwong, O. Y. (2013). New Perspectives on Computational and Cognitive Strategies for Word Sense Disambiguation. Berlin: Springer.

Lee, K., & Romary, L. (2010). Towards interoperability of ISO standards for Language Resource Management. Proc. ICGL 2010.

Mallery, J. C. (1988). Thinking about foreign policy: Finding an appropriate role for artificial intelligence computers. Ph.D. dissertation. Cambridge, MA: MIT Political Science Department.

Manning, C.D., & Schutze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press.

McCarthy, D., Koeling, R., Weeds, J., & Carroll, J. (2004, July). Finding predominant word senses in untagged text. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (p. 279). Association for Computational Linguistics. Accessed October 24, 2014 at http://sro.sussex.ac.uk/1210Z1/senseranks.pdf

McCrae, J., Aguado-de-Cea, G., Buitelaar, P., Cimiano, P., Declerck, T., Gomez-Perez, A., ... & Wunner, T. (2012). Interchanging lexical resources on the Semantic Web. Language Resources and Evaluation, 46(4), 701-719.

Mendes, P. N., Jakob, M., Garcia-Silva, A., & Bizer, C. (2011, September). DBpedia spotlight: Shedding light on the web of documents. In Proceedings of the 7th International Conference on Semantic Systems, pp. 1-8. ACM. Accessed October 21, 2014 at http://oa.upm.es/11477/2/INVE_MEM_2011_105377.pdf

Miles, A., & Bechhofer, S. (2009). SKOS simple knowledge organization system reference. http://www.w3.org/TR/skos-reference/. Accessed November 19, 2014.

Montiel-Ponsoda, E., Gracia del Rio, J., Aguado de Cea, G., & Gomez-Perez, A. (2011). Representing translations on the semantic web. Accessed October 17, 2014 at http://oa.upm.es/10295/1/Representing.pdf

Moro, A. & Navigli, R. (2012). WiSeNet: Building a Wikipedia-based semantic network with ontologized relations, in: Proceedings of the 21st ACM Conference on Information and Knowledge Management. Maui, Hawaii.

Moro, A., Navigli, R., Tucci, F. M., & Passonneau, R. J. (2014). Annotating the MASC Corpus with BabelNet. In Proceedings of LREC. Accessed October 16, 2014 at http://wwwusers.di.uniroma1.it/~moro/MoroEtAL LREC2014.pdf

Morsey, M.; Lehmann, J.; Auer, S.; Stadler, C.; & Hellmann, S. (2012). DBpedia and the live extraction of structured data from Wikipedia. Program, 46(2), 157-181. DOI: 10.1108/00330331211221828

Navigli, R. & Ponzetto, S. P. (2012a). BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193, 217-250.

Navigli, R. & Ponzetto, S. P. (2012b). Joining forces pays off: multilingual joint word sense disambiguation. In Proceedings of EMNLP, pp. 1399-1410. Jeju, Korea.

Navigli, R., & Velardi, P. (2005). Structural semantic interconnections: A knowledge-based approach to word sense disambiguation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(7), 1075-1086.

Navigli, R. (2009). Word sense disambiguation: A survey. ACM Computing Surveys (CSUR), 41(2), 10.

Navigli, R. (2012). BabelNet goes to the (Multilingual) Semantic Web. In: ISWC 2012 Workshop on Multilingual Semantic Web. Accessed October 21, 2014 at http://ceur-ws.org/Vol936/paper1.pdf

Navigli, R. (2014). About Babelnet. Accessed November 21, 2014 at http://babelnet.org/about

Palmer, M. & Xue, N. (2013). Linguistic annotation. In Clark, A., Fox, C., & Lappin, S. (eds.). The Handbook of Computational Linguistics and Natural Language Processing, West Sussex, UK: John Wiley & Sons.

Ponzetto, S. P. & Navigli, R. (2010). Knowledge-rich word sense disambiguation rivaling supervised systems. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, pp. 1522-1531. Stroudsburg, PA: Association for Computational Linguistics.

Serasset, G. (2014). DBnary: Wiktionary as a Lemon-based multilingual lexical resource in RDF. Semantic Web Journal-Special issue on Multilingual Linked Open Data. Accessed October 24, 2014 at http://www.semantic-web-iournal.net/svstem/files/swi648.pdf

Simons, G., Bird, S., & Spanne, J. (2008). Best practice recommendations for language resource description. Accessed October 21, 2014 at http://www.languagearchives.org/REC/bpr.html

Tagarelli, A., Longo, M., & Greco, S. (2009). Word sense disambiguation for XML structure feature generation. In The Semantic Web: Research and Applications (pp. 143-157). Berlin: Springer.

van Assem, M., Gangemi, A. & Schreiber, G. (2006). RDF/OWL Representation of WordNet. W3C Working Draft, World Wide Web Consortium. http://www.w3.org/TR/wordnet-rdf/

Matthew Corsmeier

San Jose State University, mjcorsme@gmail.com

(1) http://www-01.sil.org/iso639-3/

(2) http://wordnet.princeton.edu/wordnet/download/standoff/

(3) http://wordnet-rdf.princeton.edu/

(4) http://schema.rdfs.org/mappings/schemaorg_wn.owl

(5) Problems can occur when words in foreign languages have the same spellings but different meanings. E.g. the German word Mist is considered a faux ami (false friend), as its meaning is very different from the orthographically identical word in English--der Mist would be probably be translated as "crap" in English.

(6) http://www.cyc.com/platform/opencyc

(7) Technically, URIs cannot contain non-ASCII characters (Cyganiak, Wood, & Lanthaler, 2014), and since piece de theatre contains diacritics, the URL for it would be considered an IRI.

(8) http://cldr.unicode.org/

(9) babelnet.org

(10) http://www.omegawiki.org/Meta:Main_Page

(11) http://www.w3.org/community/bpmlod/wiki/Converting_BabelNet_as_Linguistic_Linked_Data

(12) https://delicious.com/

(13) Currently at http://www.taaora-proiect.eu/ The URL listed in Garcia et al. 2009 did not work at the time of writing.

Caption: A figure showing linked resources available in the Linguistic Linked Open Data Cloud. Taken from http://linguistics.okfn.org/llod on November 19, 2014.

Caption: Basic results of a query for the word "play" at WordNet. As one can easily see, "play" is highly polysemous, and most senses of the word are listed with their accompanying Synsets. Accessed November 19, 2014 at http://bit.ly/1uSKI0e

Caption: Drilling down on the first sense of the word "play," we can see a list of hyponyms (narrower forms), as well as a detailed hierarchy showing the hypernyms this sense of "play" is nested in, going all the way to "entity", which is the listing for all nouns in WordNet. Accessed November 19, 2014 at http://bit.ly/1xQxMrV

Caption: The results for the English word "play." Accessed November 21, 2014 at http://www.lexvo.org/page/term/eng/play

Caption: Drilling down on one of the results of a query for "play". Accessed November 19, 2014 at http://lemonmodel.net/lexica/uby/wn/ WN_LexicalEntry_46245

Caption: A diagram of the lemon model from Ehrmann et al., 2014, 404

Caption: From Navigli & Ponzatto, 2012, 220

The connections in WordNet reimagined as a graph. In this graph, the node would be the Synset, and the edges the lexical and semantic relations between terms. Remark the superscript numbers revealing the sense of a polysemous word, and the subscript letter indicating the part of speech.

The connections in Wikipedia reimagined as a graph. Here the edges are the hyperlinks between entries. For the sake of conciseness only a small portion of the graph is shown, but notice the inclusion of "tragedy" in both graphs; BabelNet was able to deduce these two entries referred to the same sense by calculating the percentage of words in common.

Caption: From Navigli & Ponsetto, 2012, 221

BabelNet takes the linked data from Wikipedia and WordNet to create a Babel Synset, which would also include translations of the word in various languages. Having access to such translations can even be used to assist in monolingual WSD.

Caption: Results of a query for the word "play" at BabelNet 2.5, which is currently in beta mode. Accessed November 20, 2014 at http://babelnet.org/exploreResult?word=plav&lang=EN

Caption: Drilling down to the first result of the query above. Accessed November 20, 2014 at http://babelnet.org/search?word=bn:00028604n&de tails=1&orig=play&lang=EN

Caption: Continuation of the web page displayed on the left, showing WordNet senses and definitions, as well as redirections and definitions from Wikipedia
COPYRIGHT 2015 University of Idaho Library
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2015 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Corsmeier, Matthew
Publication:Library Philosophy and Practice
Article Type:Report
Date:Jan 1, 2015
Words:5963
Previous Article:Demographic Variables And Ict Access As Predictors Of Information Communication Technologies' Usage Among Science Teachers In Federal Unity Schools...
Next Article:Local Model of Crisis Management in Libraries of Iran's Research Centers (mixed research).
Topics:

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters