Printer Friendly

The maeftro di mufica, or quality control in the virtual library.

George Bernard Shaw famously summed up his frustrations with the irregularity of English pronunciation by explaining that the word "fish" might as well be written gh-o-ti-- gh as in "enough", o as in "women", and ti as in "nation". When noting the misconstructions of Google Books' optical character recognition, one could easily believe he is being confronted by a similarly perverse logic, but, unfortunately, there is no logic to scanning errors. They cannot be explained by orthography or phonology. Optical-recognition software seeks to categorize the shapes of letters, to interpret them by their physical location, and to refine them by contextual clues. Its results slowly improve but are rarely perfect. Google takes pride in the superior quantity of its scans but evidence of quality control or of efforts to incrementally improve performance seem to be indefinitely lacking. Here we document some common impediments to searches for documents discussing music.

Of the two common defenses of scanning errors, the major one is that "only a few" exist. Here one needs to understand the scale of the metric. Quoted rates of accuracy sound respectable on an academic scale of 1-100. The claims have slowly risen from 88% to 92%, 96%, and so forth. This is usually gauged against a simple text--an office computer script, a legal document, or a similarly regular writing. To the naked eye of a scanner, documents come in many levels of graphical complexity. Tables, illustrations, large blocks of white space, footnotes, and inconsistent type quality will all affect accuracy. A simple metric is illusory. Recognition Metrics, an OCR consultancy near Seattle focusing on recently created documents, explains an accuracy rate of 98% as representing a single page of 2,000 characters in which 40 will be incorrect. (3) Google Books, in contrast, attempts to render whatever is on its virtual shelf. A hypothetical error rate of 40/page can mean 40 words/ page without a dictionary match. Google Books has greater difficulty with early (c. 1500-1825) publications than with modern ones. Many books were set in larger type than is customary today, but local variations existed. Early typography favored bigger margins and careful centering. Calculating error rates for early books is not possible without knowledge of page formats (quarto, octavo, etc.) as well as margin allowances, fonts, etc. In manual encoding, texts are verified by sight or through double entry and comparison. (4) When neither produces an acceptable result, language-specific search-and-replace routines can spot and fix most errors. From a lexical perspective, most scanning errors are so predictable that they can systematically be located, then filtered by language and typography.

The second defense of "a few errors" in scanning is that recognition software is ostensibly "trainable". We examined this point in the context of musical notation in a controlled test of optical music recognition in Computing in Musicology. (5) One program consistently misplaced bar-lines. How does one quantify that kind of error? The object is present, but when many notes wander into the wrong measures, the cost of correction is high. In text as well, some objects deserve to be weighted more heavily than others. Among these initial letters of new sentences, paragraphs and words merit higher weights to reflect their role in segmentation. Google also omits certain special characters. To judge from the number of times CCARH has had to report copyright violations to Google Books' legal department, one deduces that recognition of the copyright sign [C] (being curiously absent in some reverse title-page scans of our own books of the 1990s) is beyond the capabilities of Google's recognition software to detect.

Error categories in book-text recognition

Errors can be grouped into three general categories according to their impact. These are the misreading (1) of single letters within words; (2) of groups of letters that may make single words unrecognizable; and (3) of errors so numerous that the text is unintelligible. Errors of the first kind can usually be eliminated (in principle) by systematic orthographic search-and-replace functions. Errors of the second kind are often arbitrary in nature. Since they cannot be anticipated, they elude systematic correction. Errors of the third kind may altogether obscure the language of the text. Once a sentence or two is completely off-track, it is unlikely that accuracy will improve. Table 1 illustrates the first two kinds of errors. Table 2 shows words containing these letters in early printed books.

Class-1 errors

The single most common misreading in Google Books is the substitution of the letter f for s [hereafter s>f]. (12)

It is especially prevalent in works published up to about 1825 anywhere in Europe or North America. Because scanning is so dependent on letter-shape, a high degree of consistency can be found across cognates in Latin-alphabet

languages. (13) This has a substantial impact on searches that involve the word "music" or its equivalents. The root is common to both Germanic and Romance languages. The readings "mufic", "mufique", "mufica", and "Mufik" seem to be ubiquitous. Google Books is not the only offender, simply the biggest. Google Translate cannot digest more than one or two instances of non-lexical results of its own scanning without launching into an endless loop. (14) The caveats for those searching for Psalm settings, hymns, and liturgical music can be summed up simply with the warning to be on the lookout for such non-words as "Bleffed", "Jefu", "Chrift", "Hofanna", "Ifrael" and other common terms with a native s.

Class-2 errors: Unpredictable character misreadings

When two or more adjacent characters are misread in a single word, there is often a typographical ligature involved. Letter sequences that used to be joined into one physical character included fi, ffi, li, lli in English, ae [ae]in British renderings of words derived from ancient languages and the ss [originally sz] of formal German. (15) (See Table 2) In parallel with ligatures, diacritical marks usually appear in a single composite character (a, e, e, et al.).16 Google Search seems to lack any sense of how to spot misinterpreted characters even when, with a language filter, many illegal character combinations could readily be found.

This kind of error becomes especially problematic when it occurs in close proximity to an s > f conversion, since entire words and phrases may become irredeemably unrecognizable. In writings on music the s > f permutation has the greatest negative impact, it seems, on books in German. Sc-, sch-, and -sf are frequently replaced by such alexical constructions as fc- and fch-, or the viable but often misintended -ft. One example referring to C. M. von Weber's Der Freischutz yielded the snippet "... [Freifchuss] was [war?] die deutfche Mufik fur die Buhne werden konnte. wenn fie ... in meiner zur Feier von Schillers hundertftem Todestag erfchienenen Feftfchrift."). (17) An s can also be misread as a p, d, or t. (The number of permutations is seemingly endless.) Consider this reported title: "Gottfched: Gedanken vom Urtprung und Alter der Mufik; in deffen kritifcder Geihtehte [Geschichte?] der Dichtkunfi [-kunst] der Deutfchen. Leipzig. 1757."

Google's perennial exclusion of punctuation marks exacerbates the proper segmentation of words, phrases, and sentences. (18) However, punctuation marks are used liberally for unrecognized characters. (19) Punctuation specific to particular European languages--the Spanish inverted question mark (?) or French quotes (<<...>>), for example--may, together with currency signs and mathematical symbols, be sprinkled liberally (and inappropriately) throughout scanned texts. (20) Characters, numerals, and letters with similar shapes may be confused, as in misreadings of lower-case l, the numeral 1, and the exclamation point !. Nonsensical anagrams for "The" include any middle letter that is as high as the "T": "Tbe", "Tde", and so forth. At the start of titles and sentences J and I are regularly confused (they were represented by the same letter in early typesetting). In the listing of ten titles beginning with the word "Jesu" in Illustration 1, six are highlighted as matches, while four are ignored. These kinds of errors can rarely be anticipated. In Latin and Romance languages lower-case v was rendered as u, upper-case U as V. Here recognition of early printed texts carries an implicit obligation to modernize for the reader but to preserve for the scholar. Ultimately the purpose of recognition should be clarified. Serving many audiences simultaneously is not destined to produce results that will satisfy all of them.

Some letter-changes are too idiosyncratic to classify. One extract from Gio. Battista Martini's Storia della musica (1757) refers to "Joacbirn Qgantz" [ = Joachim Quantz] just before citing "Pier France/20 Tofi [ = Opin. de' Cantori." The actual author would be Pierfrancesco Tosi, the work in question his Opinioni de' cantori antichi e moderni (1723). The surname Mendelssohn is particularly prone to distortion, as in "Mendelfohn Bartholdy wurde Mufikdirector und ubernahm die Leitung der Oper. Das Haus wurde renovirt und verziert und mit einer wenigftens anftandigen Ausseufeite gefchmuckt. Jn kurzer Zeit entfiand unter diefer Leitung ein Theater". (21)

Class-3 errors: Gobbledygook

Gobbledygook can start out innocuously with a b substitution for h in "the", a J at the start of any sentence starting with an I, or the overuse of ? and other punctuation marks for any unrecognized character. Small problems are compounded by the absence of spaces. Consider a citation from what proves to be the preface to an edition of Seneca from the year 1800. Google's snippet says this:

"r K ^ L r ^ I' I 0. ViasAL ... ^nal. II, zi. 10H.) noto. ' Huo mre^ >>lii via'eriin. Vi6e O^ttio^. Zel. ^112." <<um 1800. nu. 36. r>. 36a. nN8 6 t lloetilliinis, <nu c^nnni ex ni8ce epilta- lis.

The quotation comes (ostensibly) from Ruhkopf's "Praefatio" to the Opera Omnia edition of Seneca's works published by Weidmannische Buchhandlung, Leipzig. The passage is supposed to match Note 6 of the preface found in Vol. II, p. xii, which (in contrast to the snippet) reads:

      Cf. Wernsdorf I. 1. p. 12. Addi nunc potest Iunioribus aliis,
   a W. ibi allatis Iunio poeta, cuius epigramma elegans nuper primus
   protulit Ennius Quirinus Visconti in libro docto: Lettera su due
   monimenti, ne' quali e memoria d'Antonia Augusta p. 20. et
   vindicavit M. Pompeio iuniori, iam ex Anthologia (Brunk. Anal. II,
   p. 105.) noto. Quo iure, alii viderint. Vide Gotting. gel. Anz.
   anni 1800., nu. 36. p. 360.


The passages do not exactly coincide, but some common material is faintly identifiable. What is clear is that the absence of a Google lexicon for bibliographical abbreviations con tributes to the derangement of the text. (22) The more fundamental problem seems to be that Google's scanning algorithms cannot differentiate running text from footnotes, and so, in this case, has simply run to text to footnote without cognizance of the independent verbal contexts.

Systematic errors in other large digital collections

JSTOR is generally above the fray in scanning errors, but it is not free of a few persistent defects.23 Any number of JSTOR listings, even for recent articles, have s>f substitutions combined with other bizarre misspellings, but for numerous reasons the overall rate is much lower. Once in a while JSTOR completely misfires, as in this example:

   ..., por Fr. Francifco Xi- menez, hijo del Conuento de S.Domingode
   Mexico, Natural de la Villa de Luna del Reynode Aragon. A , bie R&.
   P. Maeftro Fr. Hermando Bana,Ppior Prouincalde 14 Protincia de S,
   iidio de Mexic,de l Orden de lie F redicadoer,e yCatbedratic
   hubiladode Tbeologia eI Il l niMe,fdad.... (24)


The quotation comes from a facsimile of a 1615 title-page (De la Natura raleza, e Virtudes de las plantas, i.e., a book on botany). The title-page was shown as an illustration in a modern article that was labeled a "match" in a search for the word "arias". The original title-page text of the work carried an elaborate dedication to Francisco Ximenes and to "N.ro [Nuestro] R. P. Maestro Fr. Hernando Bazan, Prior Provincal de la Prouincoia de Sa[n]ctiago de Mexico, de la Orden de los Predicadores, y Cathedratico Iubilado de Theologia en la Vniuersidad Real" [Our Rev. Father Hernando Bazan, provincial prior of Santiago of Mexico, from the order of preachers and professors of theology in the Royal University]. Facsimiles of title-pages from early prints within modern publications present a consistent trap comparable with that of abbreviation. Spurts of garbled text occur in any number of JSTOR republications of recent articles from journals such as Early Music, which also reproduces title-pages similar to this one.

Gallica (http://gallica.bnf.fr), which offers a cross-medium search engine spanning early and recent prints, manuscripts, images, and sound files, is not directly comparable with others. The extreme care it gives to difficult projects, such as its exquisite (and easily found) scans of illuminated manuscripts of Machaut's poetry and Cavalli's operas, for example, demonstrate a high regard for both quality and retrievability. Because it includes a large number of early printed books, Gallica offers an interesting antidote to Google Books: it contains very few errors of the kinds discussed here. It has relatively good success in avoiding the pitfalls of archaic French. (25)

Archive (http://www.archive.org) is much older in origin and still more heterogeneous in the range of materials it provides. Its lapses are far fewer than those of Google Books, but some of the categories into which the errors fall are the same. (26) One persistent glitch shared by Archive and JSTOR is an inability to suppress hyphens used in line segmentation when searching for single words. A search for an author named Gastone Vio in JSTOR encounters numerous "matches" for "vio-" in contexts in which the following word is "loncello". Case sensitivity would clearly go some distance in fixing the problem.

Evaluating incidental errors found in Google Search that match writings on third-party websites rather than in Google Books is not straightforward. However, a strong resemblance to lapses in Google Books will be noted. A random search for letter transpositions turned up these two versions of the same passage from Ephraim Chambers' Cyclopaedia, or, An Universal Dictionary of Arts and Sciences (1728):

a. "The fixth Chord of BaSs-Viols, and the tenth of large Theoobos, confift of 50 Threads, or Guts : There are Some of them 100 Foot long, twisted and polish'd with....";

b. "lerrawit obferves, that of late they have invente, C changing the Chords, to render their Sound mor without altering the Tone. fixth Chord ot Bafs-Viols, and...." (27)

In these cases the content is unambiguous, and it is available to the user. Whether the user will be enticed by such misinterpretations to view it is open to question. (28) The second quotation comes not from the original four-volume work (1728) but from a 1753 supplement found in a separate PDF at the same Wisconsin web location. The Wisconsin digital search engine provides said page in response to the (local) Boolean search "fixth" and "Chord".

Remedies

The sad part about the survival of so many ragged passages is that tools to remedy most of their defects are available. ABBYY FineReader offers what it calls "Historic OCR" for now unfamiliar kinds of typography. It has an alluring "before and after" example at its "Frakturschrift" page: http://www.frakturschrift.com/en:start. (29) The example adds in its summary that "tuned and optimized recognition technologies have to be used when processing historic documents printed in old fonts." At the same time ABBYY Historic OCR offers a discussion of "challenges" that were studied in the European Libraries IMPACT [IMProve ACcess to historical Text] project. (30)

The carefully curated Deutsches Text Archiv (http://www.deutschestextarchiv.de/), in which only two matches for "Mufik" could be found, has a built-in safeguard against nonsense. It shows the original text and the modern script side-by-side, which allows the user to easily identify any lapses. On a more general note, The Signal, an online blog of the Library of Congress's digital preservation program, offers a rigorous, detailed account of optical recognition and its efficiencies--when done consistently and well. (31)

In ordinary text-search on a single server, it would normally be possible to employ operators and delimiters (the "regular expressions" of the Unix grep tool) that would compensate for most spelling idiosyncrasies in Google Books. Because most characters used in grep queries are off limits in Google Search, (32) users may prefer to explore other search engines. The grep expression "[ch]at" would find all instances of "cat" or "hat" (the square brackets identify an either/or set). Likewise a search for "mae[fs]tro" would find all instances of both "maeftro" and "maestro". Table 3 offers a short list of the operators (e.g., AND, OR, NOT) supported by some common search engines to help narrow down the results. A comprehensive introduction to the subject of operator usage in search engines is available in a 2011 PowerPoint presentation by Paul Barron. (33)

Data repositories that emerged in the decades before Google and newer archives that consist entirely of material entered by hand have the advantage that their holdings contain exactly what their users entered--and verified. No instance of "maeftro" or other misspellings cited here will be found in most curated collections, nor in Wikipedia. Some repositories do, by intention, provide exact transcriptions that capture the wondering spellings of earlier centuries. Notational errors in music manuscripts are faithfully recorded in all the RISM databases, for example. A text equivalent would be the Early English Books Online database (http://quod.lib.umich.edu/e/eebo?key=title; page =browse; value=ar). Among its 25,000+ titles, the 1600 print of Shakespeare's Much Ado about Nothing reads: "Much adoe about nothing. As it hath been sundrie times publikely acted by the right honourable, the Lord Chamberlaine his seruants. Written by William Shakespeare." (34) Scholars can turn to such sources to appraise the state of usage at a particular time without cringing when they see the word "seruant" because what the modern eye sees as deviations as the proper forms of printed language at the time of publication.

While Google Books is a great boon to many scholarly endeavors and indisputably saves many trips to a physical library, its rough texts impose a degree of inconvenience when accuracy and precision are required. The Advanced Search form for Google Books enables search by ISBN, publisher, and year of the print (all possible assets for the eventual resale of scanned out-of-print titles (35)), but they do not provide an adequate means of overcoming the errors described here. Dan Cohen's "Is Google Good for History?" (2010) is one of the most comprehensive and diplomatic evaluations of the strengths and weakness of Google Books. (36) As the executive director of the Digital Public Library, Cohen offers extensive praise, but he perceptively questions Google's possible privatization of aspects of its celebrated open-access model. Cohen defends the company on the ground that their aim was to work quickly. To do the job well, he supposes, might have taken a century instead of a decade. He objects, though, to the lack of availability of research data and bulk downloads. (37)

An earlier appraisal (2009) by Geoff Nunberg ("Google Books: The Metadata Mess") noted other kinds of errors, the most bizarre--a proliferation of books published in "1899" by living authors--having been fixed. (38) Nunberg lamented the hopelessness of genre classification for literature, noting that Jane Eyre surfaces under the rubrics of autobiography, governesses, love stories, architecture, antiques, and collectibles. In music this is a more complicated issue. (39)

Yoav Goldberg (Bar Ilan University) and Jon Orwant (a manager of Google Books) presented a case of their n-gram approach to "a very large corpus of English Books" in a 2013 paper entitled "A Dataset of large syntactic n-grams over Time ..." based on a linguistic analysis of 345 billion words. (40) Their aim was to produce a usage timeline for designated terms. (41) The rise and fall of word usage is a perennial matter of interest to lexicographers but not one that is widely shared by most humanities scholars. "Big data" studies such as this one intermingle gleanings from texts the scans of which lie across a spectrum of accuracy rates. Humanities scholars generally want a result free of butchered words. The fact is, though, that Goole Books' own objectives would be better served by a higher degree of accuracy. (42)

The current state of fidelity of scanned early books to their physical originals suggests that we need the kinds of tools for search than we find mainly in curated repositories. In fact textual scholarship may be more efficiently served qualitatively by repositories that have existed since the days of mainframe computers. The Oxford Text Archive [http://ota .ox.ac.uk], established roughly 40 years ago, supports text search in 25 languages (ancient and modern) and includes the earliest encoded texts of Shakespeare, Milton, and the Bible plus numerous other writings studied by scholars. Project Gutenberg's book catalogue, in process of development since 1971 (http://www.gutenberg.org/catalog/), consists entirely of materials (again in numerous languages) curated by volunteers. The project design is an early harbinger of "crowd-sourcing." Even when all licensed-access database holdings are added to these repositories, the quantity war has clearly been won by Google Books. Google offers simplicity of search and universal access (the latter degraded at times by disregard for copyright restrictions).

Who will win the quality war? We may want to consider whether parts of today's "digitized" world will, in a distant future, be seen to belong to a primordial past. Google's huge investment in Google Books seems to be undermined by its indifference to improvement. All projects founded on scanning face the risk of achieving a value inversely proportional to their error rates. The need to insist on intellectual rigor in our growing digital libraries looms large on the humanities horizon.

Eleanor Selfridge-Field (2)

(1.) Il maestro di musica was a common antecedent phrase in opera buffa titles in the eighteenth century. There was a wide range of consequents, introduced by the word "or". Misrepresentation of social station or musical skill provided the foundation for the comic plot. Valuable advice has been contributed to this article by Ilias Chrissochoidis and Maureen Buja.

(2.) Eleanor Selfridge-Field is consulting professor of music at Stanford University and research director of the Center for Computer Assisted Research in the Humanities (CCARH). She teaches music informatics to students in several disciplines, maintains a historical research agenda in Italian music, and has given several recent lectures on digital humanities topics. She is the author of six books, the editor of sixteen, and a contributor to many journals in musicology (most recently Early Music, Journal of Interdisciplinary Musicology, Notes, and Musicae Scientiae).

(3.) See http://www.primerecognition.com/cost_justification.htm.

(4.) Double transcription and comparison is a process whereby 2 separate encoders separately prepare the same text. Divergences revealed through comparison lead to necessary corrections. Studies from the 1980s confirmed a near-perfect result from this method.

(5.) Eleanor Selfridge-Field, "Optical Recognition of Musical Notation: A Survey of Current Work," Computing in Musicology 9 (1993-94), 109-145; same author, "How Practical is Optical Music Recognition as an Input Method?" Computing in Musicology 9 (1993-94), 159-166.

(6.) Quoted from "La Festin du Pierre," in The Works of Moliere in French and English (London: Watts, 1748), p. 274. Why "toujours" is consistently truncated [as "toujou"] is unclear.

(7.) Quoted from Le Mercure de France (Mai 1768), p. 171.

(8.) This quotation runs together a sub-entry under TRIGYNOUS through TRILATERAL to the end of TRILETTO.

(9.) Oracion funebre panegyrica (Seville: En la Imprenta de la Universidad, 1744).

(10.) From contents listing for Carl von Winterfel[d], Der evangelischen Kirchengesang und seine Verhaltnis zur Kunst des Tonsatzes (Lepzig: Breitkopf und Hartel, 1847).

(11.) Quoted from La Borde [writing as << Onfroy >>], Essai sur la musique ancienne et moderne (Paris : De l'Imprimerie de Ph.-D. Pierres) t. 4, 1780.

(12.) Between the fifteenth and nineteenth centuries the letter s existed in at least three forms (often simplified to two in modern discussions). Early English and French exemplars often extend to both upward and downward. The intermediate version (Old Dutch, Renaissance Italian) extended upward but not "below the line" of most characters. The s in seventeenth- and eighteenth-century English and North American colonial typography was more notable for its lack of a cross-piece than its upward extender. (Both of these forms are classified as belonging to the "long s" class.) The round s was determined less by locale than by function within a word. It was used especially for initial and terminal positions, while the other form with used in most interior positions, sometimes modified to parse the word itself. For example, on Felix Mendelssohn's tombstone the surname (rendering long s here as f) reads Mendelsfohn. This tells us that the successive s's belonged to different syllables, while in a word such as "recess" or "progress" both s's would be long and thus resemble "reseff" and "progreff ".

(13.) Polish is an outlier (because of its large number of diacriticals). German Fraktur is problematical both because of overlapping ascenders (b, d, f, h, et al.) and descenders (g, p, and y) and because of decorative tendrils distracting the "eye" away from a letter's essential shape. Specialized software enables optical recognition of Greek, Hebrew, and Cyrillic, which have finite numbers of characters but wide variation in their rendering. In Asian scripts Hiragana and Katakana syllables are manageable because of their finite number, but pictographs as found in Kanji and Mandarin pose big challenges. Languages based on cursive script (Arabic, Persian) present a range of different choices related to variability in letter formation and in use of interpretive marks.

(14.) If one clicks the "translate" prompt shown with a citation that is obviously garbled, the "translate" software churns away until someone turns it off.

(15.) The Romanization of Fraktur in the nineteenth century lacked an appropriate ligature. In this instance recent books can produce more errors.

(16.) Those who use Adobe[R] fonts will appreciate their support for joined characters continues unblemished, while word processors offer no support for ligatures. Bembo[R] is a particular favorite of those trying to imitate early typography and could be a useful base font for training recognition software intended for use with early books, although the objective is to recognize ligatures in any font.

(17.) Mistaken punctuation replicates that in the screen view. The citation comes from Westermanns Monatscheft (1908), no page number shown.

(18.) Punctuation marks can interfere with the indexing of n-grams--character strings of progressively larger lengths--which facilitate the profiling of word-usage statistics along a time-line.

(19.) When, in 2011, Google introduced the cypher "+" to identify its Google+ social network, accommodations seem to have been made in its advanced search to obviate confusion.

(20.) Scanning software does not in general admit to its defeat, although some Google Books texts are full of "?"s, that may or may not indicate the software was admitting confusion.

(21.) From a 1940 study said to be by "Robert Blum and K Herlozsohn". The second name is not traceable nor, consequently, is the source. On further research, it may be that the source is the Allgemeines Theater-Lexikon oder Encyklopadie alles Wissenswerthen fur Buhnenkunstler, Dilettanten und Theaterfreunde unter Mitwirkung der sachkundigsten Schriftsteller Deutschlands, edited by R. Blum, K. Herlosssohn, H. Marggraff, etc., a multi-volume work published in Altenburg and Leipzig, Germany, between 1839-1946. (The second editor, Herlosssohn, suffers from the same transcription problems illustrated in this paper because of the repetition of the letter 's' in his name: in the original, the first two 's' were actually ss, which is replaced in modern German by 'ss'. Since this would have been written in Fraktur, it appears to modern eyes like a 'long s' followed by a 'z', or even just an elaborate 'z', hence the transcription error.)

(22.) Another snippet from the same work contains the phrases "^'a lqU;don[degrees]PPO"ihlr'^'a lqU;don[degrees]PPO"ihlr'" and " 7Hbehs facjee, Omne^o ^11[GAMMA],[GAMMA]^[GAMMA] que eft". These were not retrievable in a literal Google search, presumably because of the exclusion of non-alphabetic marks in search input. (An alternative scan of the same work is available on request from the National Library of the Czech Republic via Europe's Books2ebooks with the listing found at http://search.books2ebooks.eu/Record/nkcr_stt20110031756.)

(23.) A useful account of JSTOR's formative years is provided in Chapter 4 of Roger C. Schoenfeld's JSTOR: A History (Princeton, NJ: Princeton University Press, 2003). It divulges many details of the quandaries encountered in JSTOR's development. Scanning errors make up a small part of the picture when the contributions of intermediate technologies, storage media, graphical detail, and vendor particularities are factored into the picture. Preferences also vary by discipline. The original scientific model required accommodation for humanities journals.

(24.) This example comes from Rafael Chabran and Simon Varey, " 'An Epistle to Arias Montano': An English Translation of a Poem by Francisco Hernandez," Huntington Library Quarterly, 55/4 (1992), pp. 621-634. This match responded to a search for the English term "arias".

(25.) E.g., by correctly rendering the s in "plutost" (rather than presenting plutoft) before the word became "plutot", cf. Illustration 2.

(26.) Within Archive's multiplicity of formats instances of "claffical" music together with such words as "preferve", "fuch", and "inftitution" are ubiquitous in *.txt files but do not necessarily occur in corresponding passages in more finished formats.

(27.) In modern English: "The sixth string of bass viols, and the tenth of large theorbos, consist of 50 threads or guts: There are some of them 100 feet long." and so forth. The first quotation comes from Chambers' Cyclopaedia as found at the ARTFL server at the University of Chicago--http://artflsrv01.uchicago.edu/cgi-bin /philologic/getobject.pl?c.0:2364. The second quotation, at the University of Wisconsin, Madison, comes from http://digicoll.library.wisc.edu/collections/HistSciTech/Cyclopaedia.

(28.) The Wisconsin case in particular merits comparison with the Google paraphrase. See http://digicoll .library. wisc.edu/cgi-bin/HistSciTech/HistSciTech-idx?type=turn&id=HistSciTech.CycloSupple02&entity =HistSciTech.CycloSupple02.p0895&q1=fixth&q2=Chord.

(29.) Those interested in technical information will find it at http://www.frakturschrift.com/_media /en:white_paper_gothic-fraktur_ocr_e.pdf. Digital librarians will be pleased to note this addendum: ".improvements achieved in processing documents mean that today's OCR software can also be applied to image collections and historical documents that are already scanned."

(30.) See http://www.frakturschrift.com/en:projects:impact.

(31.) See http://blogs.loc.gov/digitalpreservation/2014/08/making-scanned-content-accessible-using-fulltext-search-and- ocr/). This account discusses indexing, language-tuning, procedures to preserve metadata when corrections are made to recognized text and much else.

(32.) Unix is particularly dependent on the verticule (I), which in Google Books results seems to be a random marker for unintelligible characters. Uses of this character in various programming contexts are discussed in the "Vertical bar" article in Wikipedia (http://en.wikipedia.org/wiki/Vertical_bar, accessed on March 18, 2015).

(33.) "Advanced Web Searching for VEMAns," http://vaasl.org/pdfs/Conference_Handouts/2011 /Barron%203.pdf. Barron is director of library and archives at the George C. Marshall Foundation.

(34.) http://name.umdl.umich.edu/A11991.0001.001.

(35.) The confidential perception now exists among librarians who were among the first to allow Google access to their collections that Google's own enthusiasm for the project has waned as its "market potential" has remained elusive.

(36.) See http://www.dancohen.org/2010/01/07/is-google-good-for-history/comment-page-1/.

(37.) In response to Cohen's post, Brandon Badger of Google Books pointed out that [Google's] epubs contain the optically recognized data that linguists would like to use, whereas PDFs contain only page images. (N.B. Recent efforts to access that data according to Badger's advice did not yield searchable results.)

(38.) Geoffrey Nunberg, "Google's Book Search: A Disaster for Scholars," Chronicle of Higher Education, 31 April 2009 (https://chronicle.com/article/Googles-Book-Search-A/48245/); rev. as "Google Books: The Metadata Mess," Presentation at the Google Books Settlement Conference, University of California, Berkeley, 28 August 2009, (http://people.ischool.berkeley.edu/~nunberg/GBook/GoogBookMetadataSh.pdf). The theme is newly expanded in Diana Kichuk, "Loose, Falling Characters and Sentences: The Persistence of the OCR Problem in Digital Repository E-Books," Libraries and the Academy 15/1 (2015), pp. 59-91 (DOI: 0.1353/ pla.2015.0005).

(39.) Genre in music is a more vexing problem and one less susceptible to semantic remedies, given that in the popular/country/folk sphere Billboard Magazine, which is the arbiter of popular categories, has been accused of manipulating its classifications to stimulate sales of lagging "genres". For Google Books' approach to music see the pertinent section of their sitemap: http://books.google.com/sitemap/Sitemap/Music.html.

(40.) Second Joint Conference on Lexical and Computational Semantics, Association for Computational Linguistics, Atlanta, Georgia, USA (2013), pp. 241-247.

(41.) Time-lines are also in course of implementation in JSTOR's bibliometric Data for Research project, on which see http://about.jstor.org/service/data-for-research. Since music cannot be isolated as a discrete subject area in JSTOR, these are currently of limited value. Further documentation can be found at http://about.jstor .org/sites/default/files/misc/Search_Documentation.pdf

(42.) The mission statement of Google Books (accessed on March 18, 2015 at http://books.google.com/intl /en/googlebooks/library/) asserts that the aims are to "make it easier for people to find relevant books ... [and] "to create a comprehensive, searchable, virtual card catalog of all books in all languages".

TABLE 1 Examples of f>s substitutions and miscellaneous errors in
online search, in declining order of frequency in November 2014, are
in bold type. Numbers and quotations come from Google Books unless
otherwise specified. The emphasis is on spelling errors in the
rendering to scanned texts where specialists would see correct
renderings in now unfamiliar typographical formations.

Search term      No. of matches           Missing match: text

Mufik            776,000                  Abdruck feiner ganzen
                                          Vortreffliclgkeit, feines
                                          acht menfchlichetu acht
                                          kunfilerifchen Charakters,
                                          Form und Inhalt aber finden
                                          fiets den wahrfien,
                                          anfprechendfien,
                                          befriedigendfien Ausdruck.
                                          Wir nennen Mozarts Mufik
                                          klafffch. (Heinrich Sattler,
                                          1856)

Jefus            371,000                  See text and << Jefu meine
                                          Freude >> entry below for
                                          examples.

La meme chofe    322,000                  Mon guieu [Mon sieur ?],
                                          Piarrot [Pierrot], tu mj
                                          vient toujou dire la meme
                                          chofe. PIERROT. Je te dis
                                          toujou la meme chofe, parce
                                          c'eft toujou la meme chofe,
                                          & fi ce n'etoit pas toujou
                                          la meme chofe, je ne te
                                          dirois pas toujou la meme
                                          chofe. (6)

Maeftro di       292,000                  See main text.
Mufica

Efpagna          119,000                  <<Tout le monde fait la
                                          fortune immenfe que
                                          Farinelli a faite en
                                          Efpagne>> .

                                          "... fymphoniej dediee a Mgr
                                          le Comte de Noailles, Grand
                                          d'Efpagne...." (7).

maeftro          88,400                   See main text.

Mufic            29,500 [vs "music"       << Triju'gum (i. iff old
                 in Google search:        recordi) The junfdiftion
                 15,100,000]              [jurisdiction] of three
                                          hundreds. TRILATERAL (ad<.
                                          from tbt Lat. tres tbrety
                                          and latus a fidt) Having
                                          three fides. Trilat'eralnels
                                          [s. from trilateral) Tbt
                                          quality of having three
                                          fides. Scott. TRILETTO (I-
                                          in mufic) A fhort trill.
                                          [Consecutive entries (run
                                          together) from John Ash: The
                                          New and Complete Dictionary
                                          of the English Language
                                          (1775), unpaginated.] (8)

Meifter          23,800                   See main text.

Univerfidad      14,800                   << hizo a efte Colegio
                                          Mayor, ya la Univerfidad
                                          para las Cathedras,
                                          defpuesde agre- gar a cftas
                                          el Beneficio de Yecla; no
                                          folo por los tavo- res, que
                                          con tanta bizarria hizo
                                          aquantos individuos de eftos
                                          dos Iluftres Cuerpos a S.
                                          Erna- acudieron; fino porque
                                          de nueftra Univerfidad fue S
                                          ... >>. (9)

<<Jefu meine     5,960                    Jefu. meine Freude_ll.
Freude>>                                  306. 322. 323. Jefu komm.
                                          mein Trofl und Lachen--ll.
                                          480, 552. Jefu. Kraft der
                                          bloden Herzen--ll. 514. 515.
                                          M-V. 11. Nr. 185. Jefu
                                          Kreuz. Leiden und Pein--l.
                                          502. M-V. l. Ne. 156. Jefu
                                          Leiden. Pein und Tod--l.
                                          122.--lll. (10)

La Mufique       1,297[gallica.bnfr.fr]   Summary: Found in
                                          bibliographies, periodicals
                                          (Le Mercure galant),
                                          commentaries on art
                                          (Vasari), military
                                          endeavors, an edition of
                                          Rousseau's letters with a
                                          response by Madame de Stael,
                                          et al.

"Ma maitreffe"   63 [gallica.bnfr.fr]     "Extrait 1: et Rude aux
                                          voleurs doux a l'amant 1 >>
                                          J'aboyais & tailais careile
                                          >> Ainfi j'ai fu diverfement
                                          u Servir mon maitre 8c ma
                                          maitreffe >>.' Sonnet de la
                                          belle Matineufe. (11)

TABLE 2 Original appearance of the letters f and s plus selected
ligatures in books printed be-fore 1800. Exact details varied by a
letter's position in a word, by font, and by publisher. A
transliteration and brief indications of year, place of publication,
and title are given.

Year   Language       Focus of image         Image
       (place of
       publication)

1668   Latin          ae ligature            Grzcas
       (Vienna)       Lower-case s           hellefponto
                      ct ligature            cuncta
                      Lower-case f,          frigidiflimus
                      lower-case s (x3)
1606   Italian        Lower-case s (x3)      fteffa
       (Venice)       Lower-case f           fedelifs

1614   Italian        Lower-case s           Caftello
       (Rome)         Upper-case F           Febbraro
                      Upper-case s           Signore
1708   English        Lower-case s
       (London)                              laft
                      Upper-case F,          Froft,
                      lower-case s
1762   French         ff ligature            difficulte
       (Paris)        ss ligature            expreffior

1601   French         Lower-case f           fait
       (Evreux)       st ligature            Requefte

1740   Spanish        st ligature            Mageftad
       (Madrid)       ss [in successive
                      syllables]             afsiften
                      ss [in one syllable]   Miffas
1600   English        fl and ct ligatures    afflict
       (London)       ss ligature            diffembled
                      fl and sh ligatures    flourisht
                      st ligature            fubsflance

Year   Transliteration   Source

1668   Graecas           Historiae Alexandri
       Hellesponto       Magni ...
       cuncta
       frigidissimus

1606   stessa            Preparations
       fedeliss.         dell'anima alla
       [=fedelissimo]    divina gratia
1614   Castello          Lettera annua dal
       Febbraro          Giappone del 1614
       Signore
1708   last              The British Apollo, or,
                         Curious amusements
       Frost             for the ingenious ...

1762   difficulte        Journal ecclesiastique
       expression        ou Biblitheque
                         raisonnee, vii/3
1601   fait              Actes de la
       Requeste          conference tenue
                         entre le sieur
                         Evesque d'Evreux ...
1740   Magestad          Coleccion de los
                         tratados depaz ...
       assisten          Part II
       Missas
1600   afflict           Titus Andronicus
       dissembled        partly by William
       flourisht         Shakespeare: The
       substance         First Quarto

TABLE 3 Permitted operators in selected text-search environments.

                     Google:                  Bing Query
                     Advanced Search;         Language; MS
                     Developer                Fast Query
                                              Language

Logical              Yes                      Yes
(Boolean)
operators                                     (alt OR = "|")
(AND, OR,
NOT)
String               Limited                  No?
operators            (emphasis
(BETWEEN,            on titles)
IN, NOT IN)
Proximity            Shows context            Yes
(NEAR)               but without
                     controls
Grammatical          No                       Selective
operators (for
punctuation marks)
Search by date,      Yes                      Yes
date range
Search by filetype   Yes                      Yes
Search in URL        Yes                      Yes
Wild card in         Yes but                  Yes (weak
search string        "removes some results"   results)
Language filter      Yes                      Yes

                     DuckDuckGo                    Structured
                     (private web                  Query Language
                     search)                       (SQL database
                                                   search)

Logical              Yes, plus                     Yes
(Boolean)            include/exclude
operators            commands
(AND, OR,
NOT)
String               A few (e.g.                   Yes, plus
operators            CONTAINS)                     additional
(BETWEEN,                                          ones
IN, NOT IN)
Proximity            No?                           Equivalent
(NEAR)

Grammatical          Yes                           Yes
operators (for
punctuation marks)
Search by date,      Yes                           De factor
date range
Search by filetype   Indirectly                    De facto
Search in URL        Indirectly                    Not relevant
Wild card in         Yes                           Yes
search string
Language filter      By changing user's "region"   Yes

                     Yandex       Yahoo
                     Advanced     Advanced
                     Search       Search

Logical              Yes          Partial: AND,
(Boolean)                         OR [not = "-"
operators                         followed by
(AND, OR,                         term to be
NOT)                              excluded
String               Equivalent   No?
operators
(BETWEEN,
IN, NOT IN)
Proximity            Yes          No?
(NEAR)

Grammatical          Yes          No
operators (for
punctuation marks)
Search by date,      Yes          For email
date range
Search by filetype                Yes
Search in URL        Yes          Yes
Wild card in         Yes          Yes (weak
search string                     results)
Language filter      Yes          Yes
COPYRIGHT 2015 International Association of Music Libraries, Archives and Documentation Centres
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2015 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Selfridge-Field, Eleanor
Publication:Fontes Artis Musicae
Article Type:Report
Date:Apr 1, 2015
Words:6505
Previous Article:Three Hundred Years of Composers' Instruments: The Cobbe Collection.
Next Article:Le traitement du fonds Cesar Franck au departement de la Musique de la BNF ou deux catalogues pour un meme fonds.
Topics:

Terms of use | Privacy policy | Copyright © 2020 Farlex, Inc. | Feedback | For webmasters