Printer Friendly

It's a messy endeavor: automated text processing.

Bjorn Hohrmann, a German engineer who contributes to open source projects, posted the question, "How much does it cost to archive all human audio visual experiences?" He then proceeded to answer his own question. According to his estimate: about $1 trillion per year. (For more details, see www-archive/2013Jul/0047.html.)

My back-of-the-envelope calculation suggests that is likely to be conservative. It's no secret that nano devices equipped with CPUs, software and wireless will start becoming more widely available. Hohrmann's analysis does not appear to include the data that new technologies will generate in ever-increasing volumes.

Now consider the monumental problem of converting the audio to text and then indexing that text. On top of that, keywords and metadata are needed because nobody wants to watch or listen to a video to find the item of information needed. There are not enough hours in the day to keep pace with the textual information available. Toss in millions of hour-long podcasts or one day's uploads to YouTube ( and time becomes an insurmountable barrier.

Global media attention directed at content processing is partly a reaction to this simple fact of digital life: These days the only way a company, a government entity or a researcher can figure out what is happening is to use next-generation tools. The tools, as most knowledge management professionals know, are somewhat crude. The limitations on the most advanced products are ones that are difficult to work around. Budgets are tight, so systems that can filter or trim the content are essential. Computing power continues to increase, but the volume of data and the mathematical recipes themselves can bring supercomputers to a halt. Most concerning is how the plumbing required to move large volumes of data from Point A to Point B has capacity limitations. To increase available bandwidth in a computing infrastructure is not quite the walk in the park some marketers picture in their HDR-colored PowerPoints.


When I was in Australia in 2009, I learned about Leximancer (lexi, a text processing system that had its roots at the University of Queensland ( I spoke with Andrew Smith, a physicist and the founder of Leximancer, about what makes his system different from other systems. Leximancer, unlike other content processing systems, is designed to show users the information landscape to raise awareness of the space of available knowledge. The idea is to enable a user to generate and explore hypotheses. "My goal from the start," Smith said, "was to create a practical system for doing a kind of spectrum analysis on large collections of unstructured data, in a language-independent and emergent manner."

The company, conceived in 1999, was blessed with foresight because it immediately embraced the idea that the amount of digital information would grow exponentially. The core concept for the Leximancer system is that patterns of meaning would be latent in that data. As Smith told me, "Humans have limited memory, time and cognition, so these critical patterns of meaning might be missed by the people who need to know." Leximancer's system is designed to fill in the gaps of human brainpower and show users the information landscape to raise awareness of the space of available knowledge, and enable them to generate and explore hypotheses.

Smith said, "Leximancer is used most of the time for analyzing surveys, interviews, stakeholder submissions, social media extracts, research articles and patent corpora, engineering documentation, policy documents, inquiry transcripts, etc. It is not primarily a search engine, and is certainly not an enterprise search solution, though it is used as a component of such."

The Leximancer system is almost entirely data-driven, so that the "ontology" emerges from the data and is faithful to that data. Smith said, "My sense was that the gulf between the quantity of available information versus the actual human awareness, integration and understanding of this information is a serious and insidious threat. Certainly we address the problem of not knowing the best search terms to use in a given context, but we also address the problem of not even knowing what questions can or should be asked."

Valid representation of data

In the global data explosion, users are in a difficult position. Smith explained, "What we have seen is that many users are not prepared to think hard enough to understand complex or statistical truths. Users are looking for a plausible story or anecdote from the data, even if it is not representative. I think this is a danger in some interfaces for any user who is doing serious search/research/analysis. With our new product under development, we are designing to achieve both. We take care of statistical validity and present the user with an attractive mash-up that is nevertheless a valid representation of the data."

To help organizations and professionals who must analyze information, Leximancer offers software as a service and on-premises options. Smith positions Leximancer as an enhancement to existing retrieval systems, not a replacement.

He explained, "I do believe that most if not all current search technologies are not suitable for social media, or most fire hoses of weakly structured complex data such as system or transaction logs. The points that support my reasons for this are, first, that each data record is only a fragment of some unfolding story and cannot be understood in isolation, and contains few if any of the obvious keywords for its thread. Second, multiple stories are being played out simultaneously in different time scales, and the fragments of each story are intermixed in the fire hose. Third, terms that make up the data items can mean different things in different contexts, or different terms can mean the same things in some contexts. And, lastly, new data terms can appear at any time."

Four challenges

If we ignore for the moment the problem of processing "all" content, four interesting challenges are testing organizations that want to manage their knowledge in a more effective way.

The first is the shortage of mathematicians. Earlier this year, Dr. Roslyn Prinsley told The Conversation: "The fact that the demand in Australia for math graduates, at the minute, is outstripping supply is a major issue for this country. From 1998 to 2005, the demand for mathematicians increased by 52 percent. From 2001 to 2007, the number of enrollments in a mathematics major in Australian universities declined by 15 percent. On the global scale, we are falling behind too. In 2003, the percentage of students graduating with a major in mathematics or statistics in Australia was 0.4 percent. The Organization for Economic Co-operation and Development's (OECD, average was 1 percent." (See

Prinsley's comments have global implications. In the United States, the problem is notjust a decline in the mathematics major. There is a critical shortage of mathematics teachers. States from Alaska to Wyoming are being severely affected. See, for instance, the Commonwealth of Virginia's "Critical Shortage Teaching Endorsement Areas for 2013-2014 School Year" (doe.vir shortage_areas/2013-2014.pdf) and the U.S. Department of Education's Nationwide Listing of teacher shortage areas ( ope/pol/tsa.doc). Without individuals skilled in mathematics, systems that rely on numeric recipes will be unfathomable. How can an organization or an individual determine if a system's outputs are valid with a subpar education in numbers?

Another challenge is the need to process rich media. To make a podcast or video searchable, software must convert the speech to text. Due to the wide variations in audio quality, speech-to-text systems often produce results that are not usable. A short time ago, we attempted to process four recordings made in live venues. We used three different systems, but none of the systems was able to produce a usable ASCII transcription of the audio on the recordings.

The solution was difficult and required sending the audio and video source files to a human who was able to transcribe about 90 percent of the information. In a world in which forward-thinking engineers want to capture "all" rich media, the human-intermediated solution is neither affordable nor practical. At this time, an automated solution to unlock the information in audio and video content in a manner that makes search useful is not available. Progress is being made, just moving like a snail on a warm summer evening.

Third, companies engaged in next-generation content processing are constrained by a number of factors. Resources, even at large companies, are tight and information priorities are often fuzzy or fluid. On one hand, the enterprise solutions responsible for the day-to-day information retrieval needs of the organization are difficult to change, upgrade or replace. Ad hoc solutions to deal with hot-spot problems are often useful to a specific group. Migrating the expertise from a special project's solution across an organization can be difficult. The hurdle, according to the Harvard Business Review, is change management. "We behave based on the reality around us," says Gregory Shea and Cassie Solomon in the article "Change Management Is Bigger than Leadership." (See blogs. ment_is_bigger_th.html.) Despite the need for integrated systems, most organizations operate with fiefdoms, islands and silos of information.

Finally, managers responsible for making strategic and tactical decisions face a problem that is different from those just six or eight years ago. The sheer volume of data available within an organization requires different tools and business processes. For a person working in knowledge management, the journey now underway may be discomfiting. Buzzwords like "governance," "analytics" and "business intelligence" do little to provide reliable mileposts.

Leximancer's Smith said, "I cannot currently think of any other commercial automatic text analysis system whose output model has been cross-validated in the scientific literature."

In our world of proliferating information and hurdles that are difficult to get over, I think of Thomas Alva Edison's alleged quip, "I have not failed. I've just found 10,000 ways that won't work." I

Stephen E. Arnold is a consultant providing strategic information services. His blog is ocated at
COPYRIGHT 2014 Information Today, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2014 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Arnold, Stephen E.
Geographic Code:1USA
Date:Jan 1, 2014
Previous Article:Unexpected expertise.
Next Article:Coveo expands Salesforce product line.

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters