Mining for gold: 21st-century search arrives with text mining.
[T]he massive scale of data creation and accumulation, together with the increasing dependence on data in research and scholarship, are profoundly changing the nature of knowledge discovery, organization, and reuse. As our intellectual heritage moves more deeply into online research and teaching environments, new modes of inquiry emerge; digital data afford investigations across disciplinary boundaries in the sciences, social sciences, and humanities, further muddling traditional boundaries of inquiry. How then are we responding to what may be the most complex and urgent contemporary challenge for research and scholarship?
--"The Problem of Data," August 2012; clir.org/ pubs/reports/pub154/pub154.pdf
"Any scholar born in the last century can illustrate the sea change that search has brought to literary research," points out University of Minnesota English professor Michael Hancher. "Vannevar Bush predicted as much in 1945. Bush made 'selection,' that is, discovery, of key importance. Bush proposed opening the archive to detailed inspection, annotation, and cross-referencing. Faltering, 'diligent' search had to be replaced by what we now call digital search, which can more efficiently thread the needle in the proverbial haystack."
"In the past couple of years, text mining has made inroads in most disciplines, and is being applied commercially in several areas under the guide of 'text analytics,'" explains University of Zurich linguistics professor Fabio Rinaldi. "I personally think that we are not that far away from some major breakthrough in natural language understanding that will allow machines to build a deeper representation of the knowledge embedded in the text." As Rinaldi observes, this will also allow the creation of better models of knowledge representation and sharing, which will go beyond natural language. "We will soon be able to ask a generic question and have computers answer most of them as competently
as humans," he asserts.
Researchers have benefited from stores of data and information as well--and with the evolution of text mining, we are seeing whole new applications and innovations from studying written tomes, full-text documents, and databases. In 2009, the International Association of Scientific, Technical and Medical Publishers estimated that each year, global STEM research alone produces more than 1.5 million new scholarly articles ("The STM Report," September 2009; stm-assoc.org/2009_10_13_MWC_STM_Report.pdf). Today, text mining is becoming more common in management, the biomedical sciences, and chemistry, with efforts underway to establish footholds in the social sciences and humanities.
Text mining shares many of the same objectives as data mining: to be able to use the stores of information being produced to better understand trends, discover new information, and seek better methods or services based on research, behavior, or preferences. Today, the number of published books, reports, articles, and other key information is beyond the ability of anyone, or even any organization, to adequately find, read, analyze, and use without help. Today's technologies are proving a good match for the task--not perfect, but showing incredible value in all areas of public policy, medical services, advanced research, and marketing strategies, etc. Text mining allows people to search virtually any type of textual content--documents, websites, news information--as well as data sources. Text and data mining are often referred to by the acronym TDM.
"Google has conditioned us to search using keywords," AlchemyAPI CEO Elliot Turner argues, "and while this will continue, we notice a trend towards making Qualitative Analysis (QA) capabilities, a la IBM Watson, more commonplace." Turner believes having a system that understands what you're looking for, the context of the question, and other complex relationships--inferred both from things mentioned in the article and those not in the article--enables faster answers to detailed questions. He gives this example: "A librarian cannot read every book in a library, but a QA system that has access to the same resources can help speed up research and make new discoveries." Turner adds, "Even doctors, students or librarians could use a cognitive search engine to aid them in their daily activities."
"We all face the challenge of an overabundance of information, and we all need strategies for more effectively processing, exploring and understanding text collections of varying sizes, be they email archives, Twitter feeds, library catalogues, or gene sequences," explains McGill University digital humanities professor Stefan Sinclair. "Text mining has been there for a while as a way to scratch; what's different," according to Sinclair, "is that an increasing number and variety of people feel the itch." This, he says, results in a nice feedback loop: "When more people need better tools, more development happens, which then leads to more innovation and more demand. That bodes very well for the future of text mining, including new algorithms, new techniques, better user interface design, more compelling examples of use, and so on."
Text mining of any copyrighted information isn't easy today, and neither is keeping up with the myriad software--from high-end expensive programs to open source and free packages with few bells or whistles to help with searching, sorting, and manipulating the information. And we need to move beyond text as well. "The text mining field needs the ability to move past text," urges Turner. "There is a big opportunity to improve the state of the art by mining audio, speech, images and video included within text documents." Because more than 1 billion photos per day are taken and shared, Turner believes companies which "conquer the challenge of using unsupervised deep-learning techniques will be poised to mine images as well as they do text."
In the general process for available software, content is first categorized by topic and taxonomy to allow for filtering, summarization, and routing. Names, topics, and other descriptors or categories can be applied, as can searching and advanced predictive and visualization tools--all from the natural language of the original content. "Education and awareness are two of the biggest issues preventing text analytics from obtaining widespread adoption at a more rapid pace," content analyst VP for marketing Steve Toole opines. "Analysts and content curators who have become accustomed to building taxonomies and Boolean search strings should take a closer look at machine learning approaches such as conceptual search and concept-aware auto-categorization to augment existing taxonomical structures and search methods." Toole says conceptual search, dynamic clustering, and concept-aware auto-categorization technologies can overcome limitations of word libraries, term shift, abbreviations, acronyms, synonymy, and even language used.
PROBLEMS WITH ACCESS TO CONTENT
Copyright has become a major issue with commercial publication of scholarly works, impeding the ability of even the authors of content to have access to their own published writings due to contractual constraints imposed by publishers' API (application program interface) requirements. API is the set of routines, protocols, and tools for building software applications and can make the process of using and manipulating text easy, difficult, or impossible. Changes to copyright law and the evolving open access movement for publications promise change; however, libraries still have much work ahead with vendors and publishers to create agreements that make it possible for researchers to mine stores of published articles and books for new insights and revelations.
MIT Libraries' Ellen Finnie Duranceau notes that while there is strong interest in text/data mining on the MIT campus from a wide range of disciplines, it has been very slow going working with any vendors to arrange text/data mining access. She reports, "The conversations, even when there seems to be interest on both ends, not infrequently stretch out for months, and aside from one example many years ago, and the new Elsevier service, have not yet led to a successful agreement." Duranceau adds, "From our experience, it seems information providers are moving in the right direction and are more engaged on the topic than in the past." The problem lies in the fact that the providers "have not yet widely found models that match their expectations and concerns with the realities of research needs in an academic setting."
As University of Illinois-Urbana-Champaign English professor Ted Underwood notes, "It's quite hard to get APIs and permissions from vendors in order to text-mine privately held resources." For that reason, Underwood primarily relies on HathiTrust Digital Library, largely a public-sector solution that receives some contributions from Google. Glen Worthey, Stanford University digital humanities librarian, states, "Because we've been negotiating up front (and paying for) perpetual access rights, including archival and local-use rights, we have generally not felt the need even to ask for text-mining permissions." However, getting journal and book publishers onboard with API for text mining has been another story at Stanford. "Several vendors have indeed indicated that they'd rather get out of the 'data delivery' business (since repeated data delivery requests can be burdensome) and instead understand how to build APIs that would meet the needs of our researchers," says Worthey. But even that has proven difficult: "Not only do the researchers generally need to do a great deal of actual manipulation and transformation of the data, they also need to experiment and continually refine their methods. In spite of their willingness--and even strong desire--to engage with the library and with researchers to understand these methods, vendors cannot be expected to develop an API that would meet every text-mining need of every researcher."
Max Haeussler and Casey Bergman have created a website (text.soe.ucsc.edu) where they have attempted to log information on the willingness/ability of researchers to access the content of their products to be mined by subscribers. They began (perhaps innocently) trying to "genocode" scientific articles for "references to chromosomal locations in scientific articles" that biomedical researchers could use as a visualization or map of human chromosomes research. However they quickly found that the texts were locked tightly behind the publishers' walls. "We strongly believe that requests like ours to develop tools that help promote access to the biomedical information will provide enormous benefits to the biomedical research community and to the publishing industry alike." You can follow their adventures in getting permissions information from publishers at their website. MIT Libraries has an excellent set of information on access from vendors making their databases--through their APIs (libguides.mit. edu/apis)--available for researchers at MIT.
Michael W. Carroll, professor of law at American University's Washington College of Law, asserts in the LIBLICENSE listserv, "[I]n the United States text mining is a user's right not a copyright owner's right. When a library signs an agreement denying users the right to bulk download for the purpose of text mining, the library is giving up a user's right in exchange for access to the publisher's database of articles ... Elsevier may have the rights to control text mining in some European countries, but this announcement still means that in the US Elsevier wants to control computational research in ways that go beyond its rights as a copyright owner."
In Carroll's opinion:
[T]he Google Books decision provides support because Google created a digital archive of publisher's works for the purpose of making them searchable (and to enable text mining). The court held that Google's creation of this archive and its continued retention of it was necessary to the beneficial purposes of providing search and text mining. Google's keeping the archive after it created its index did not affect the publishers' economic interests in exploiting the copyrighted works and therefore is a fair use. Although the purpose of the text mining researcher and Google are somewhat different, they both can articulate a socially beneficial reason for keeping a private archived copy of the publishers' works and their doing so does not interfere with the publishers' ability to economically exploit the works.
OPENING THE DOORS FOR BETTER ACCESS TO TEXTS FOR ANALYSIS
In December 2013, IFLA (the International Federation of Library Associations and Institutions) published a Statement on Libraries and Text and Data Mining. It asserts: "Legal certainty for text and data mining (TDM) can only be achieved by (statutory) exceptions. As an organization committed to the principle of freedom of access to information, and the belief that information should be utilised without restriction in ways vital to the educational and cultural well-being of communities, IFLA believes TDM to be an essential tool to the advancement of learning, and new forms of creation."
In 2011, the independent Ian Hargreaves report "Digital Opportunity: A Review of Intellectual Property and Growth" (ipo.gov.uk/ipreview-finalreport.pdf) was completed at the request of the British Prime Minister due to "the risk that the current intellectual property framework might not be sufficiently well designed to promote innovation and growth in the UK economy." The report, in sum, finds that "the UK's intellectual property framework, especially with regard to copyright, is falling behind what is needed. Copyright, once the exclusive concern of authors and their publishers, is today preventing medical researchers studying data and text in pursuit of new treatments."
This effort was followed by a March 2012 British Jisc report, "Value and Benefits of Text Mining" (jisc.ac.uk/reports/value-and-benefits-of- text-mining). The report found that "the full economic and societal potential afforded by this vast sea of information and data is not yet being realised within the UK. Realising the potential requires text and data analytical capability, access to the information and data sources, and involves a range of computerised analytical processes, not all of which are readily permitted within the current UK legislative environment for intellectual property" (p. 7).
University of Cambridge chemist Peter Murray-Rust sees resolution coming soon, at least for the U.K.: "In two months the UK parliament is expected to table and pass the Hargreaves recommendations for TDM, when we will be able legally to carry this out in UK. Since my institution subscribes to a large number of NPG journals which I have the right to read, I expect to start mining them, without further negotiations and without ... further permission, in the near future." And on April 1, the U.K. higher-education funding organization (HEFCE) announced the first national policy to meet these goals, which states that all research articles and conference proceedings accepted for publication alter April 1, 2016, "should be made open-access to be eligible for submission to the post-2014 Research Excellence Framework (REF)," the system in the U.K. by which all government-funded higher-education institutions are assessed ("Policy for Open Access in the post-2014 Research Excellence Framework"; www.hefce.ac.uk/media/hefce/content/ pubs/2014/201407/HEFCE2014_07.pdf). The requirement--which links specifically to national measures of institutional research productivity--is targeted to see that deposited material should be discoverable, and free to read and download, for anyone with an internet connection.
In the U.S., mandatory OA policies, from such funders as National Institutes of Health, the National Science Foundation, and an increasing number of universities, are starting to make access to the corps of research more available. PLOS, a leading OA publisher, has recently enacted new requirements for making data underlying published research available to future researchers. "PLOS journals require authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception ... Refusal to share data and related metadata and methods in accordance with this policy will be grounds for rejection" ("Data Access for the Open Access Literature: PLOS's Data Policy," Theo Bloom, Dec. 12, 2013; plos.org/data-access-for-the- open-access-literature-ploss-data-policy). Acceptable methods for making data available were outlined in its announcement. Older texts are also becoming more readily available from such sources as the Internet Archive (archive.org/details/texts), Million Book Digital Library Project (rr.cs.cmu.edu/mbdl.htm), HathiTrust Digital Library (ha thitrust.org), and the Digital Public Library of America (dp.la).
Librarians have a major role in this, according to Worthey. "I understand that the commercial sector wants to make money, but restricting access to scholars is not the way to do it." He says the fear that some licensed or in-copyright content will "leak out," suddenly becoming available for free on some imaginary "Pirate Bay" for something such as historical newspaper content, thereby undercutting the academic market, is "irrational," and that restricting access for scholarship is counterproductive. "Even in a market that has a place for commercial data providers," he argues, "it makes more sense to license liberally, trust the libraries and the researchers they serve implicitly, and prosecute abusers (if there are any)." Worthey says not only is lengthening copyright terms a continuing and growing problem, "the atmosphere of fear surrounding potential and real copyright litigation, and a failure to aggressively defend (and sometimes even to understand) our Fair Use rights are continual barriers to good research."
Murray-Rust sees the vested interests in the academic marketplace holding back the development of TDM in this area. "We have to change minds. Young people are sick of the restrictions imposed by publishers--but publishers are implicitly abetted by universities through their adherence to the branded journals for measuring careers and success. Universities are signing their lifeblood away to publishers." Murray-Rust says the time has come to break the climate of fear. "[Our team is] going to show that universal machine access to the scientific literature is as liberating as the Renaissance or Enlightenment."
MORE CHANGE COMING
"It's useful to remind ourselves of the ubiquity of text mining," Sinclair observes. "Every time we use a search engine we're relying on systems that perform ever-more complex text mining. Search engines," he points out, "find documents with terms of interest, considering a variety of factors such as relative frequency in a document, prominence of the document on the web, regional particularities, and user browsing behaviour." Sinclair says that while users have become accustomed to incredibly fast and powerful search engines for the web in their browsers, there are surprisingly few options available for quickly and easily working with custom text collections. What is needed, he maintains, are user-friendly systems that allow users to ingest large corpora, search by keywords, filter by facets or features (such as date) where possible, and then produce and analyze reconfigurable subsets (or worksets) of documents. He adds, "It may be useful to differentiate between text mining techniques that enable us to constitute subsets of documents and other text mining techniques that assist us in analyzing those documents of interest. In both cases," Sinclair concludes, "much more work is needed to design user interfaces that are appropriate for much wider audiences."
University of Southampton archaeologist Leif Isaksen suspects that in the foreseeable future, user interfaces are likely to go in two separate directions: Fully automated methods will provide broad statistical evaluations of massive corpora that are of less use in dealing with specifics, while manual and semi-automated methods will provide more accuracy in the details but consequently won't scale as well. Isaksen notes that there is something of a two-stage process in both cases: "The first is to identify common conceptual references (named entities like people, places and events, but also references to typed, but indefinite objects). The second is to identify relationships between them." Because the second is much more difficult and relies upon the first, Isaksen expects to see more progress in the former in the immediate term.
"Given the rapidly increasing volume of information that is available online," Isaksen continues, "I think some level of text mining can only increase, even if it is principally intended for search. We see this most in commercial services, such as those provided by Google, but I suspect that journal publishers and online repositories will also soon start to make much greater use of these technologies as a means of adding value to scholarly work, and thereby helping to justify their publication fees." He acknowledges, however, that text mining specific to a particular research question is still not something that can be easily generalized, due to the fluidity of natural language: "It's hard to see it spreading beyond a niche community for the time being."
Information professionals have always been at the forefront of changes and challenges with digital information systems. TDM is just the next chapter--but a clearly magnificent one--in the ongoing evolution of knowledge systems.
Seven Free Software Products to Get You Started in Text Mining Today
Many big-name software companies offer high-end, expensive text mining options today, led by SAS Institute, Inc. (sas.com/en_us/software/analytics.html) and IBM (www-01.ibm.com/software/ebusiness/ jstart/textanalytics). However, there are also some worthwhile OA products available. For a good listing and evaluation of today's products, go to either the National Centre for Text Mining (nactem.ac.uk) or the Digital Research Tools Wiki (digitalresearchtools. pbworks.com/w/page/17801672/FrontPage).
"The cost and accessibility to powerful hardware and software has made text mining capabilities far more practical than 20 or 30 years ago, when technologies such as Latent Semantic Indexing was first introduced," Steve Toole explains. "Today, a typical laptop or PC has enough computing power to handle complex algorithms required for sophisticated text mining capabilities." Here is a listing of some of the freeware/OA packages on the market:
"[Carrot.sup.2] organizes your search results into topics. With an instant overview of what's available, you will quickly find what you're looking for," whether from the web, images, PubMed, or other sources.
General Architecture for Text Engineering (GATE)
Open source free software, in use for more than 15 years now, is Java-based and with a strong support community. "Our user community is the largest and most diverse of any system of this type, and is spread across all but one of the continents."
A collection of Python scripts, the software purports to be "the most robust, efficient and hassle-free piece of software to realize unsupervised semantic modelling from plain text."
"Free software for quantitative content analysis or text data mining. It is also utilized for computational linguistics. You can analyze Japanese, English, French, German, Italian, Portuguese and Spanish text with KH Coder."
"KNIME [naim] is a user-friendly graphical workbench for the entire analysis process: data access, data transformation, initial investigation, powerful predictive analytics, visualisation and reporting."
An Apache Software Foundation "machine learning based toolkit for the processing of natural language text [that] supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and co-reference resolution."
Unstructured Information Management Architecture (UIMA)
Another Apache-based software that is intended to "analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user [allowing] applications to be decomposed into components."
Nancy K. Herther
University of Minnesota Libraries
Nancy K. Herther (email@example.com) is the librarian for American studies, anthropology, Asian American studies, and sociology at the University of Minnesota-Twin Cities Campus.
Comments? Email the senior editor (firstname.lastname@example.org).
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||SEARCHER'S VOICE|
|Comment:||Mining for gold: 21st-century search arrives with text mining.(SEARCHER'S VOICE)|
|Author:||Herther, Nancy K.|
|Date:||Jul 1, 2014|
|Previous Article:||Show, don't tell: data visualization for libraries.|
|Next Article:||The ebb and flow of reference products.|