TextWare: fast indexing and searching.
TextWare: Fast Indexing and Searching
TextWare is designed to make the indexing and retrieval of information from large collections of existing, possibly non-uniform, documents as simple and rapid as possible.
TextWare provides satisfyingly rapid and flexible access to collections of text already on magnetic media or of scannable quality. As such, it would be a most useful way to organize and access a large collection of library bibliographies/guides. In a very busy reference/telephone service area, TextWare could be used not only for online versions of library documentation, but also to create an ongoing database of quick reference information, in effect, "making notes" to keep on file.
Fast Indexing Algorithm
Databases make fast retrieval possible because they are indexed. Generally, the number of indexed fields allowed must be limited to prevent the index from taking up a disproportionate amount of memory. To enter information into a database, that information must be converted and read or keyed into formats required by the database structure.
TextWare uses an algorithm that greatly reduces the size of the index, and its creation time. With the TextWare algorithm, the size of the index virtually stops growing as the amount of text continues to increase. For example, one megabyte of data may result in a TextWare index of 200 to 400 kilobytes, while five megabytes of data will result in only 500 to 750 megabytes of index entries. It is thus possible and practical to use TextWare to index every unique word in very large documents.
The terms Card (field), Document (record), and CardFile (database) are used by TextWare to describe its organization of data. The size of TextWare records (called Cards) is determined by the user, and may be set to a page, a paragraph, or any user-defined amount of text. The user sets Card size according to the amount of information that will be most useful during text retrieval and display.
There can be only one Card size defined for each database (a database is called a CardFile). A CardFile is a set of Cards. Records (called Documents) provide an optional mid-point in the hierarchy: sets of Cards representing individual text files may be designated as separate Documents, or Cards from several text files may all be given a Document name. Having the CardFile subdivided into Documents enhances text retrieval, since a search can be limited to specific Documents. The Document name can also be part of the hit-list display for a search, and is very useful for immediately identifying what file the hit is from and thus how useful it is likely to be.
TextWare can index and retrieve from documents in a variety of formats. Certain external document formats are automatically converted to TextWare's internal format. These automatically-converted external formats include Microsoft Word 5.0, WordPerfect 4.2, WordStar 5.0 and 5.5. Other external formats, such as PC Write, Volkswriter 3 and 4, WordPerfect 5.0 and 5.1, and ASCII files, can either be converted to TextWare's internal format, or left in their original formats.
Since TextWare accesses files by pathname, documents can be located on any drive, including CD-ROM or optical disk drives.
Text Retrieved from Files
The TextWare main menu screen provides three choices: Text-Retrieval, Text-Indexing, and CardFile-Utilities. A help line prompts the user to use the arrow and Enter keys to select and activate the desired function.
Choosing Text Retrieval brings up a list of CardFiles in the current directory, as well as options to change directory or change drive, in order to access any other CardFiles.
When a CardFile is chosen, a query window appears, with the number of unique words and total number of Cards in the file noted at the top.
Powerful Search Capabilities
TextWare is capable of extremely powerful and flexible searching. Search features include use of explicit Boolean operators (AND, OR, ANDNOT), as well as an implied Boolean AND, by typing two or more terms with a space in between. The truncation characters * and ? may both be used anywhere in a search term.
Pressing the F2 key provides access to search options. The default is a phrase search. When in phrase search mode, pressing the Ins key brings up a list of phrase search options, such as search by proximity, or field search. There are menus for defining each of these phrase search options.
The default proximity range is 10 words; this can be adjusted with the Proximity Window Definition. Field search allows the user to temporarily define the Cards in terms of fields or columns, in order to search information known to be in the same location in each Card.
Various Viewing Options
The number of search hits is displayed after a search. Pressing the Enter key brings up the hit list in a short form (70 characters per item) display. Arrow keys and the Enter key are then used to display the full text of any Card in the list.
While viewing a Card, the user may move immediately to the search term by pressing F2. The user may browse either through the hit list or through a numerical sequence of Cards by toggling the F9 key and pressing the + and - keys. The user may also edit, delete, or print a Card, or view related Cards or Images.
The option to edit, using TextWare's internal editor or an external editor, is accessed simply by pressing the Enter key. Pressing the Del key while viewing a Card marks it for deletion. All additions, changes and deletions are held in a .BCH file. Thereafter, each time the CardFile is accessed the user will be notified that modifications exist but are not available until the CardFile index is updated or reindexed.
The existence of related Cards or Images is indicated by a message and Function key indicator at the top of the screen. The Related Images feature is impressive: the screen is blanked and a clear, useful image drawn very rapidly.
Printing and Downloading Options
The text of a Card, or its entire related Document, may be printed or down-loaded to a file (using F4 or the Enter key) while the Card is being viewed. The F5 key is used to mark or unmark a Card being viewed for printing or downloading. Any of the Cards in the hit list may be similarly marked. Marked Cards can be printed either while viewing a Card or the Search prompt.
Remarkably Rapid Indexing
Items on the Indexing Menu of Operations include Select-Files-to-be-Indexed, Index-Selected-Files, Update-Index, Merge-CardFiles, Edit-Command-File, Compress-CardFile, and Reveal-Control-Codes.
The Select-Files-to-be-Indexed menu item allows any file on any drive to be specified by supplying its complete pathname. This operation creates a Command File, which must be edited (before the files are indexed) to specify Document names and the desired hit-list format.
User-created (or default) synonym, stopword, and header files are invoked, along with the Command File, when the Index-Selected-Files menu item is activated. At this point the new CardFile is given a name. The actual indexing process is remarkably rapid. During this review, 30 pages were indexed to paragraph level in 2.40 minutes.
Sophisticated indexing is fairly easy, provided that the manual is read thoroughly beforehand. TextWare does not provide warnings or prompts (such as: do X before doing Y) on either the indexing or the compression functions.
Utilities for Configuration and Card-linking
The CardFile Utility Program menu provides five choices. The first is Delete-CardFile which has four possible levels, from Delete-Update-Flags to erasure of the CardFile. The remaining choices are CardFile-Configuration, Protect-CardFile-from-Modification, Unprotect-CardFile-from-Modification, and Build-Related-Cards.
The CardFile-Configuration option is a real workhorse, with many sub- and sub-sub-menus. Almost every aspect of TextWare (pathnames, defaults, editors, reporting styles and formats, etc.) can be tailored by the user.
To Build-Related-Cards, the user supplies a file of Card numbers which are to be linked. Any number of Cards may be linked. However, should a viewer happen to retrieve a Card other than the first in the linked list, no indication of Related Cards is given. TextWare would be greatly enhanced by the ability to build, or at least mark, related cards interactively.
Online and Printed Manuals
One of the CardFiles provided with the TextWare package is the TextWare manual. This online manual allows users to explore TextWare functions (just play around) while absorbing the manual information. In fact, the online manual was usually much faster and easier to use than the printed version.
A printed TextWare manual, which suffers from a lack of indexed subtopics, is also provided.
Preparatory to installation of the TextWare package, the user's CONFIG.SYS and AUTOEXEC.BAT files must be edited slightly. Running the install program is then a simple matter of loading three disks as prompted by the program. The full TextWare system can be installed and ready to use in under 10 minutes.
When correctly installed, TextWare can be run from anywhere in the C: directory, in menu mode or, for experienced users, from commands at the DOS prompt.
An Impressive Package
TextWare is quite an impressive package; powerful, flexible, customizable, and yet almost transparently simple to use, particularly in Retrieval mode. Ease of use is enhanced by consistent, intuitive use of keys. Many of the Function keys have equivalent one word commands. If neither proves easy to remember, the Help screen is easily available where one would expect to find it--at F1. The Help Lines at the bottom of all the screens usually provide all the reminders that might be necessary.
An aspect of TextWare I found refreshing was the sense that it was created for reasonably intelligent users: it makes very sparing use of the beep and the default is Yes rather than No on the checking menu that appears when exiting functions. Too many programs don't trust users to know what they are doing.
The ability to edit the text directly while viewing a Card is a very useful aspect, as is the opposite option to protect the text against random or unauthorized editing. Both would certainly be important is a library setting if the system were provided for public querying, yet needed to be updated from time to time.
Lastly, developments to note for the future: TextWare is currently working on versions that will handle Mac and UNIX files.
Suzanne Bell is the Computer Science Librarian and Online Coordinator at the Rochester Institute of Technology.
|Printer friendly Cite/link Email Feedback|
|Author:||Bell, Suzanne S.|
|Date:||Apr 1, 1991|
|Previous Article:||Compound documents, multimedia, and edible information.|
|Next Article:||National Auto Data Service puts car assessment online.|