Printer Friendly

Searching and categorising data. (Searching Data).

Search and categorisation is absolutely vital to an ECM solution, and Butler Group feels that it is something that should be regarded to be core functionality. With hundreds of thousands if not millions of documents and other items of unstructured content within an organisation. it is impossible to find single items of content without a search capability. Categorisation is required to provide the search criteria to find content. Categories form part of the metadata associated with content. If content is not categorised, it becomes very difficult to find or indeed manage.

Butler Group believes that ECM products should be supplied with an integrated Integrated search and categorisation engine. Whether it is the vendor's own product or one acquired through an Original Equipment Manufacturer (OEM) agreement from a third party does not matter. However, as some organisations already have a search capability, there should be the option of turning off the ECM search engine, the search capability should not be offered as an optional extra. Locating content speedily on demand is an integral part of an ECM package. What is important is the tbnctionality of the search capability in terms of the search methods offered, and the ability to categorise content. As collaboration becomes more of a business requirement, the ability to save search results, share them with other users, and also subscribe to searches, increases in importance.

Different Search Methods

A keyword search is only effective for searching on metadata and not content. When searching on a common word for content using a Web based search it is typical for millions of results to be returned. Keywords entered for a search on metadata, however, can be effective. Unfortunately, Web searches can abused by content creators adding irrelevant keywords to metadata to get content returned for many different subjects.

Boolean searches use the terms AND, OR,and NOT to determine the results.Parentheses may also be used to create complex searches. Natural language searching enables users to search queries using everyday language. It returns broad results.In a parametric search, a user enters parameters to search on. A Text search needs to assess the relevancy of text. The more a word appears in a document such as 'the' or 'and' the less likely it is to be relevant.This type of search normally allows users to search on a word or phrase, proximity to each other, and sometimes not in proximity. A concept-based search takes concepts, rather than keywords, and should deliver more relevant results than a keyword search. Because it searches on concepts it will often return content that does not contain the word or phrase actually searched on. It will not return documents containing the phrase where the meaning is different to that requested, for example if information is required about Jaguar cars, the search will not return content about the animal. It often scores results in terms of relevancy, although a document highly relevant to one user may not be as relevant to another.

Query-based searching typically uses SQL to construct the search criteria. often incorporates drop-down boxes for users to build queries without any coding required. However, complex queries often require some coding. Taxonomy based search engines have a distinct advantage over keyword searches when it comes to returning accurate and specific information. Keyword-based search engines can be fooled by documents that have had keyword tags embedded in them, sometimes not related to the subject of the document A search engine using a taxonomy will look for concepts in the text of the document, so embedded keywords will not have any effect on the search. This enables knowledge workers to find information both internally and on the Web much quicker and more efficiently. Taxonomy-based searches can also be extremely quick, returning thousands of highly relevant documents on an internet-based search in a matter of seconds. They are normally also ranked in order of relevancy to the term or word used in the search, alt hough the highest ranked document does not necessarily mean the most relevant to the searcher. Taxonomy-based search engines also typically incorporate roles and rules-based searching, whereby access controls can be levied on individual documents or information categories. For example, a document classified as an invoice in the taxonomy could be available only to members of the Finance Department.


Searching across multiple repositories is the key requirement of a search and retrieval engine. If an organisation has the type of ECM architecture that Butler Group recommends, it should be possible to conduct a single search that spans multiple repositories as well as the Web. A user should be able to specify where he or she wants the search to be undertaken There is a need for searches to respect the underlying permissions of repositories and content so that only content the user is at least allowed to view is returned. This requires a tight level of integration between the ECM application and the underlying permissions of external repositories, typically held in access control lists.


In order to find a document or piece of content using its metadata it needs to be classified. Two forms of content can be classified. One is to classify content coming into the organisation, the other to classify content on the Web. There are two ways of achieving this: manually or automatically. Examples where classification can be used are e-mail and internal documents, records from enterprise applications, and documents required for Records Management. The manual classification of documents is very time-consuming and resource heavy. Organisations often also have a requirement to classify Web content especially when they are in a regulation heavy industry, or wish to track competitors. Unfortunately, done manually, it is impossible to cover more than a small portion of the Web.

Net directories provide a hierarchical directory of documents, using a traditional tree structure. Each document is associated with a node on the tree, either leaf or an internal node. By moving along the tree, a user is able to access content that has been manually categorised. A typical Net directory may be ten levels deep, with around 30 branches at sea level, resulting in a few hundreds of thousands of pages. Although this type of classification returns a fairly high level of relevancy, because it is creal manually, it covers only a small part of the Web. It is also resource heavy as the process is manual. Rules can be established for specifying certain words and phrases categorisation in a manual classification system, to ensure that each person undertaking this work categorises content in a consistent manner. There are different ways in which an automatic classification engine can work. Complicated algorithms are often involved, along with such factors as similar terms and distance. A common way of auto matically classifying content is to analyse the frequency with which a word (often taken from a taxonomy) appears in a group of documents. Common words, such as 'the' and 'and' are disregarded. From this, a picture is built up of the number of times a particular term could be expected to appear in a document. This information is then to 'train' the engine, before classifying content.

Another way of classifying content is to use well categorised data to train the the categorisation engine. Content patterns are examined, frequencies of words and phrases and how and where terms are used are detected, and thresholds created. This information is then used to categorise the content. Any content failing below the threshold is flagged, and can be manually categorised.

A drawback of automatically generated classification is accuracy; a computer can never be 100% accurate. Often documents are given a percentage accuracy, with users able to define a threshold below 'which the document can be manually checked. If all of the documents falling just below the threshold have been accurately categorised, then the threshold can be lowered. Below a certain percentage the engine will often not attempt a categorisation. Pattern recognition derived from categories using concepts found within unstructured text is another way of categorising content, The context is important to categorisation, which is where pattern recognition can be useful in ensuring that content has the correct context.

Taxonomy is the science of classifying information using a pre-defined system. A taxonomy should be simple, and easy to use. Organisations are increasingly building and implementing corporate taxonomies as a way of managing and organising increasing quantities of content. Hierarchical structures can be defined for taxonornies resulting in subcategories. For example, fiction could be a subcategory of book, or Smith and Son a subcategory of 'invoices'. This provides the ability to classify individual documents in several ways, enabling different users to access the same document in different contexts.

There are a number of ways that taxonomies can be created. Some vendors, including Autonomy, favour the automatic creation of taxonomies. Autonomy's technology creates taxonomies using a conceptual understanding of information. A number of documents or clusters can be taken and the main themes identified.

Alternatively, a document may be used to find similar information about a subject. It can also be broken down to create sub-categories. This method of creating taxonomies eliminates all human intervention, and has the advantage of being fast, enabling whole taxonomies to be built speedily. It also cuts out human errors caused by having to read large volumes of content. However, not all vendors favour the automated approach. divine creates and maintain its taxonomy manually using a team of librarians. Acquired from the purchase of Northern Light Technology LLC, its taxonomy currently contains approximately 17,000 subject terms, has 16 top level subjects, and hierarchies seven to nine levels deep. It uses sources, such as Dewey Decimal, Library Congress (LC) subject headings, Standard Industry Code/North America Industry Code Standard (SIC/NAICS), with an additional 1,000 distinct taxonomies consulted in specific subject areas, divine argues that its level of accuracy is much greater and more specific than taxo nomics created automatically. Whichever way taxonomies are created they can provide search engines with much better search and retrieval capabilities than those using a simple keyword-based search. To be effective, taxonomies must be highly focused on real-world business topics so that they can address real business issues. Butler Group believes that this is one area in which a manual creation of a taxonomy has an advantage over automation-machines; unless they are programmed to do so, do not know what important business concepts are. Any manual intervention is reducing the effectiveness of an automated process.

There are a number of vendors and software packages that provide taxonomy software, including Autonomy, Dialog, divine, Infosort, Inxight, Plumtree Corporate Portal, Semio, and Verity Knowledge Organiser. Dialog acquired NewsEdge, a global real-time news solutions vendor, and Intelligence Data, in 2001. This provided Dialog with a taxonomy which incorporates many aspects of news and information that are centred on subjects relevant to business. The company has a very large thesaurus of terms relating to business processes as well as company names, nicknames, subsidiaries, and acronyms. This is important for the accurate classification of documents as, to use a simple example, IBM, I.B.M., and 'Big Blue' all refer to the same company: International Business Machines. If any of these terms are used in the correct places in a document, it will be identified as being about IBM. Some search and categorisation products are delivered with pre-built taxonomies, particularly those targeted at specific vertical markets .

More general taxonomies include geographic regions or product codes. Butler Group believes that the best option is to use pre-defined taxonomies, where possible, that are customisable. The real value in taxonomy, apart from enabling much more accurate and efficient searches, is in the ability to use the taxonomy to automatically classify unstructured data, which could be e-mails, spreadsheets, and other information as well as word processed documents. The ability to quickly find the information required to perform a task, or to be able to store all information of a similar nature in the same directory are obviously important functions of an automated system, but automation can be applied to much more business-specific processes.

Searching Documents

Large multinational enterprises can receive thousands of documents a day. Many of these will be in an electronic format, but despite the fact that we are living in the so-called electronic age, we are actually using more paper than ever before, and this is reflected in the number of paper-based documents that organisations receive each day. While it is fairly easy to classify electronic documents, it is more difficult to classify and distribute paper documents across the enterprise. In the past, the information would have needed to be entered into the system manually. Now, however, using Optical Character Recognition (OCR) software and a taxonomy-based classification solution it is possible to scan and automatically catalogue paper-based documents. This has real business value in the processing of, for example, invoices. Once classified as an invoice, specific details can be automatically extracted from the document, such as customer name, product details, invoice total, and VAT amount. Using a thesaurus to a ssociate different terms with a particular field, the information extracted could automatically update a database.


No ECM solution is complete without a good search and categorisation capability, as there is little point in managing content if it cannot be found. The ability to search across federated repositories is a must, if the core ECM functionality is to become a platform in the infrastructure layer, as Butler Group believes it will. If content is well categorised, with the categories stored as part of the metadata, then there is no reason why a keyword search should not be sufficient to locate the content. However, concept-based searches, used correctly, can also provide highly relevant result lists.
COPYRIGHT 2003 A.P. Publications Ltd.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2003, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

Article Details
Printer friendly Cite/link Email Feedback
Publication:Software World
Date:Jul 1, 2003
Previous Article:Extending SRM with active archiving to manage the data life cycle. (Managing Data).
Next Article:Introducing search engines. (Searching Data).

Related Articles
Software World Editorial Index 2003.
Software that automatically indexes, categorises and posts electronic documents.
DocuLex announces Desktop Search 6.3.

Terms of use | Privacy policy | Copyright © 2020 Farlex, Inc. | Feedback | For webmasters