HUMANITY CONCURS MORE OR LESS UNIVERSALLY, regardless of language, that speech involves a succession of discrete articulations that fall consecutively upon one another in the course of any meaningful utterance. This intuition has a strong basis in reality, and reflects our shared instinct for categorization. The IPA does nothing to discourage this approach to our understanding of how language operates. Its dissection of the continuity of speech into segments through symbols disguises a much more complex reality, by implying that changes in articulation are instantaneous and absolute, and by isolating a few salient phonetic features to the exclusion of many others. IPA symbols imply a succession of static articulative moments, shifting suddenly from one articulation to the next, like a slide show. As Ball/Rahilly state, "Speech is a dynamic process, but phonetic symbolization treats it as if it were static." (1) This simplification of reality appears to be a necessity in any attempt at linguistic analysis, and ironically may reflect the limitations of language itself. Generative phonologies, which address the feature limits of traditional vowel and consonant charts and expand them, also resort to a binary (plus/minus) organization of its features. Attempts have been made in recent decades to bridge the phonetic silos created by segmental analysis, through the use of diagrams that demonstrate the behavior of parameters other than those implicit in IPA symbols themselves. In such charts, the behavior of the velum, palate, tongue, and lips can be traced through the time continuum. In doing so, the multilayered influences, or accommodations, that an articulation makes upon those it surrounds can be more readily seen. (2) However, the more fully such charts reflect the dynamic process of speech, the more cumbersome they become.

Colloquial speech typically involves a flow of articulations at a rate of between 10 and 20 per second. The familiarity of the process is so well entrenched in our reflex speech habits that we are unaware of its complexity. Some articulations are not able to be reiterated at rates faster than 4 or 5 per second, and others are intrinsically difficult at a fast rate--tongue-twisters are evidence of this. (3) It is therefore not surprising that speakers find shortcuts, or paths of least resistance, when proceeding from one articulation to the next. Frequently, a segment will not have the opportunity to arrive fully at the target position (or "citation form," as it is called) of a vowel or consonant, as implied by its IPA symbol. (4) The more fortis (requiring more energetic tensing) an articulation is, the more likely it is to accommodate itself to surrounding sounds. The degree to which an articulation may be altered is inversely proportional to its degree of probability of being confused with another phoneme, thereby confusing the intelligibility of the thought or changing the meaning of the word. For instance, Spanish /s/ readily varies from [s] to [integral] according to environment, because /S/ does not exist as a separate phoneme in Spanish. Other languages that have both /s/ and [integral] exhibit much less variation, because of such intelligibility limits. The marked tendency of Quebecois French to add a sibilant to onset consonants ("du" [[d.sup.z]y], "petit Yvonne" [p[t.sup.s]i[t.sup.s]ivon])--a process known as assibilation--was free to develop without loss of intelligibility, because there is no /d/-/[d.sup.z]/ or /t/-/[t.sup.s]/ contrast in French. Similarly, the centralization of /i/ and /u/ to [i] and [u] in some environments in Quebecois French ("midi" [mi.[d.sup.z]i]) is sometimes ascribed to the influence of English, but can also be defended on purely phonological grounds: there is no /i/-/I/ or /u/-/[??]/ contrast in French to muddy the waters.

Of the various forms of accommodation in language, the most prevalent is that known as assimilation. The reader may be familiar with the term, if only through examples frequently encountered in textbooks on lyric diction. In French, for instance, we are told that assimilation may only occur under very specific conditions, as in "heureux" and "les baisers," and not elsewhere, at least in traditional formal style. In German, one occasionally finds reference to the protean quality of the schwa, depending on context. Are the [[??]]s in "Mutter" and "Vater" truly identical, or do they diverge slightly because of the differing placement of the preceding vowel? Italian opera has a long established tradition of aggiustamento, in which some vowels open more than in speech, based on traditions of effective projection and elocution, rather than on strict phonological dicta. Colorni and Castel are two authors who unabashedly advocate these performing traditions in their recommendations. (5) Such aggiustamenti are forms of assimilation, based upon the transformation from speech to singing, rather than within the spoken form itself.

Assimilation is sometimes confused with another linguistic term--coarticulation--largely because coarticulation has sometimes been employed in linguistic writing as more or less equivalent to assimilation. (6) Coarticulation is arguably more useful when limited to its more prevalent modern usage: an articulation involving simultaneous constriction of the vocal tract at more than one distinct point. Thus, the consonants [[TEXT NOT REPRODUCIBLE IN ASCII]], [[TEXT NOT REPRODUCIBLE IN ASCII]], and [[TEXT NOT REPRODUCIBLE IN ASCII]] of some West African languages are examples of coarticulation in this strict sense. The IPA indicates this with a top tie bar, as in the examples, and also uses that symbol for the familiar affricates of European languages. (7) Some authorities include allophonic adjustments of a phoneme, as in [[integral].sup.w]] ("shoe") in this category, as "secondary coarticulations." (8)

It is understandable if singers and teachers associate the term assimilation with the notion of a special context, encountered only on occasion, that requires particular treatment, as in French. The reality, however, is that all speakers naturally engage in a continuous flow of articulations, in any language, that are constantly involved in processes of adjustment and accommodation, each articulation influenced by, and influencing, its immediate and sometimes not so immediate neighbors. More than any other factor, assimilation accounts for the prevalence of allophonic variation found in languages, usually in complementary distribution. Far from being a process of degradation of the "proper" pronunciation of a language, these processes are indispensable to the flow of speech at the tempo of typical colloquial usage, and occur in all languages to a greater or lesser extent. The tendency to "short-cut" target articulations in everyday speech is sometimes referred to as "speech economy," and can take several forms, only one of which is assimilation proper.

Assimilation typically assumes a number of common forms. Six of them are described here, the first two involving vowels, and the remainder pertaining to consonants.

Vowel Accommodation

This is the type of assimilation familiar from French and known as vowel harmonization, in which a vowel is modifed to conform to the vowel in the ensuing syllable: "heureux" [[TEXT NOT REPRODUCIBLE IN ASCII]], "les baisers" []. This is a form of regressive assimilation, in which the sound of the latter vowel infuences the former.

In English, vowel accommodation can take many forms, with adjacent or nearby articulations influencing one another in complex, subtle ways. One example must suffice for our purposes. If one compares the [ae] in "back" and "bag," they may feel like similar vowels, but differing slightly in some respect. Or they may feel identical, depending on the reader's dialect and individual speech habits. The former, "back," may be slightly more open, leaning toward [a], and the latter, "bag," slightly more closed, leaning to [epsilon] but certainly much closer to [ae]. The only way to indicate such minutiae in IPA is with diacritics, [[TEXT NOT REPRODUCIBLE IN ASCII]] for more open, or "lowered," and [[TEXT NOT REPRODUCIBLE IN ASCII]] for more close, or "raised." But these are not commonly encountered diacritics for vowels in the literature of lyric diction. Many speakers will sense that the vowel of "back" is pure, but that of "bag," being long, engages in a slight diphthong, roughly [[TEXT NOT REPRODUCIBLE IN ASCII]], that lies below the threshold of standard English transcription practice. It is difficult to account for these subtle differences and formulate phonological rules that govern them on the basis of environment alone. Each vowel is framed by similar or identical consonants, the only difference being the voicing of the final consonant. And it is difficult to see why such voicing should trigger a diphthong or other vowel variation in such ways, purely in terms of articulation and "speech economy." (9)

Other examples of English vowel accommodation include unstressed schwas replaced by syllabic consonants ([s[LAMBDA]d[TEXT NOT REPRODUCIBLE IN ASCII]]), and the apostrophe contractions ("can't," "wouldn't"), but these more accurately fall under the rubric of elision than assimilation.

An accommodation similar to assimilation is centralization, which affects vowels. The tendency for syllabic vowels in most unstressed syllables in English to centralize to some form of schwa is a marked characteristic of all dialects of the language--a characteristic it shares with German. A centralized front vowel is said to be "retracted," and a back vowel "advanced." The precise quality of the centralized vowel will vary according to the amount of time allotted to it, the placement of the onset and coda consonants that surround it, the direction of vowel change (in the case of diphthongs and triphthongs), and the vowels of the adjacent syllables. Unstressed schwas in English can partially assimilate to the vowel color of an ensuing stressed syllable, as in "alas" and "attack," where initial [??] tends toward [ae] with some speakers--or at least to English [[TEXT NOT REPRODUCIBLE IN ASCII]], which shares the openness of [ae].

A further accommodation is neutralization, or the merging of two distinct articulations in a particular environment. Many languages feature neutralized vowels. Russian, for instance, has /a, e, i, o, u/, but foregoes the /i/-/e/ and the /o/-/a/ contrasts in unstressed syllables, employing only [i] and [a].

The characteristic vowel harmony of many Asian languages, in which the presence of one vowel places constraints on the kind of vowels that may appear in proximity with it, is not a feature of English, or most European languages.


An example of anticipatory assimilation in English is the tendency to nasalize vowels, and even consonants, before a nasal consonant. We shall begin with an exercise. While speaking, sustain an [l] and try to determine whether the sound is oral, nasal, or partly both. A fully oral [l] will be produced with one's nostrils fully closed. In fact, it is impossible to produce a purely nasal [l], because lip parting is an imperative for that consonant. A "purely nasal [l]" is actually an [m]. But it is very easy, and quite usual, to produce a mixed oral/nasal lateral, particularly before a nasal vowel, as in French. In a word such as English "lamb" [laem], only the fnal consonant is "ofcially" nasal, but note that the vowel typically takes on a nasal quality [[TEXT NOT REPRODUCIBLE IN ASCII]] in anticipation of the ensuing [m]. With a vowel at least partly nasal, the initial [l] can follow suit, resulting in a double anticipatory assimilation. In fact, partial nasalization of sounds that are thought to be oral is very common, particularly in American English. The "excessive" nasality of American English is one of the tell-tale indicators of dialect to British English speakers. Of course, this simply means that nasal assimilation is less pronounced (although defnitely present) in most forms of British English. Tus, if a singer has a tendency to "sound nasal" too much of the time, the culprit again could be a linguistic refex that transfers naturally over to singing, rather than singers' preferences for their own personal sound quality as they perceive it.


Palatalization occurs at the allophonic level in English, which means that native speakers are unaware of its existence until the precise articulations are analyzed. Both anterior and posterior palatalization can occur. Velar plosives will palatalize before front vowels, while alveolar consonants will palatalize before back vowels.

The following examples illustrate:

     coop  [kup]
     cop   [k[??]p]
     cap   [[??]aep]
     kip   [[??]ip]
     keep  [[??]ip]

The phrase keep calm readily illustrates this assimilation of the initial /k/. The IPA subscript plus sign diacritic indicates "advanced," or in this instance, palatalized.

Posterior palatalization often occurs between words, as in "nice shoes," in which [nais [integral]uz] assimilates to [naI[integral]cuz]--and [z] may do the same, as in "his shoes" [hI[integral][integral]uz]. The same change occurred in historical English, in the change from [n[epsilon][??]n] to [n[epsilon]I.[integral][??]n] in the 18th century. Linguists often refer to assimilation processes in the diachronic investigation of language change.

While the anterior/posterior changes described appear to be mirror images of one another, there is an important difference, to be seen in the IPA symbols. One can represent the pattern of English articulation as follows:

   English/k/ [right arrow] [k] before a back vowel
                            [[??]] before a front vowel
   English/s/ [right arrow] [[integral]] before a palatal consonant
                            [s] elsewhere

The allophones of /k/ do not encroach upon the citation form of a diferent phoneme of English, as does /s/ before a palatal consonant. There is no phoneme /[TEXT NOT REPRODUCIBLE IN ASCII]/ in English, as there is /[integral]/. In the example, this is not a problem for intelligibility, as [naI[integral]] and [hI[integral]] are not in the lexicon of the language, ready to confound the utterance. Tis is not always the case, however, and such assimilations may need to be avoided in lyric diction, especially when the lyrics unfold slowly.

A common accommodation in English (although not strictly an assimilation) is phrasal yod-dropping, as in "want you" or "beside you." Here [t.j] and [d.j] are regularly replaced in speech by the affricates [tf] and [d[TEXT NOT REPRODUCIBLE IN ASCII]]--a change generally felt to be too colloquial for lyric purposes, except in popular styles. Segment deletion, or syncope, is a related accommodation of colloquial speech, generally to be avoided lyrically:

   lasts           [laests], not [laes]
   breathes        [b[??]i[??]z], not [b[??]z]
   clothed         [klou[??]d], not [kloud]
   baths / earth's [bae[??][theta]s], [3[theta]s], not [baes], [3s]
   asked           [aeskt], not [aest]

Other syncopes can occur as well, including vowels:

     picture        [pIk.t[integral][??]], not [pI.t[integral][??]]
     sudden         [s[LAMBDA].d[??]n], not [s[LAMBDA].[d.sup.n][??]]


In English, /C/ (a consonantal phoneme) becomes [[C.sup.w]] before a lip-rounded vowel.


     /p/           [right arrow] [p]                  peel
                                 [[p.sup.w]]          pool
     /[integral]/  [right arrow] [[integral]]         chic
                                 [[integral].sup.w]]  shock (Brit.),
     /f/           [right arrow] [f]                  feed
                                 [[f.sup.W]]          food

Such consonants are said to be labialized before a liprounded vowel. English has no lip-rounded front vowels, but French does. It is important to labialize such French consonants:

   /d/     [right arrow]   [d]            divin
                           [[d.sup.w]]    du
   /[??]/  [right arrow]   [[??]]         jamais
                           [[??].sup.w]]  jupe, jour, je
   /f/     [right arrow]   [f]            fiche, f[??]me
                           [[f.sup.w]]    feu, fou, fumer, f[??]t, faute

While the labialized consonant is distinct acoustically from the nonlabialized, the primary goal in French is to avoid a rising diphthong on the ensuing vowel, by ensuring that the consonant is fully rounded. An English singer is most likely to fall into this trap in the case of the French mixed vowels [y], [empty set], and [[??]], because the tongue position for these vowels is forward, and English has no mixed vowels--for example, [djy] instead of [[d.sup.w]y]. (10) In other words, the articulation refex in English is to associate front-vowel tongue position with unrounded lips.


Voicing assimilation in English usually takes the form of avoiding consonant clusters of the unvoiced/voiced and voiced/unvoiced type. Tus, the plural morpheme<S> regressively assimilates with the voicing of the previous consonant: cats [kaets], pigs [pigz]. (11) Plurals of the form <es> are invariably voiced: ostriches [[??]st[??]It[integral][??]z]. This is an automatic response for anyone with spoken fuency, and is likely only to arise as an issue in the voice studio with ESL singers.

A few words exist however in which voiced and unvoiced alternate in free variation. Perhaps the most common example is "with." As early as 1953, Kenyon and Knott state that "there is no consistent general practice," and that "[wi[theta]] is clearly not substandard." (12) The unvoiced final [[theta]] is not RP, but many areas of Britain, particularly the north of England, use only [[theta]] in this word. The choice is at least partly conditioned by context: [[theta]] is more likely when followed by an unvoiced consonant ("with care"). In lyric diction, it seems prudent to voice in most, if not all contexts, given the greater sonority of [??]. (13) The lyric "advantage" of voicing, however, does not always hold true. The frequent voicing of medial /t/ in American speech is not considered standard for singing, although there are arguably contexts and genres in which it should at least be considered. There are many minimal pairs:

   [t]/[d]   bitter / coating / heating / latter / otter /
             putting / traitor
             bidder / coding / heeding / ladder / odder /
             pudding / trader

Note that this tendency has yet to appear in standard speech for other unvoiced plosives:

   [p]/[b]   rapid / simple / staple
             rabid / symbol / stable
   [k]/[g]   bicker / decree / hackle / preclude
             bigger / degree / haggle / preglued

In the case of the voicing of normally unvoiced /t/ in American English, the accommodation is not related to the preceding or ensuing articulation or syllable, but to the fortis ([t])--lenis ([d]) distinction. In some cases, the auditor must at times infer the meaning from context, rather than from the utterance per se. For instance, "bitter" and "bidder" are homophonic in SAE, but this will rarely be problematic for comprehension, because one is an adjective and the other a noun, and grammatical context prevails. Generally, colloquial speech tends to reduce a speech pattern to its most convenient form, stopping just short of the point where intelligibility will be compromised. The voicing accommodation of plosives in SAE applies only to the /t/, and only in onset position in unstressed syllables. "Attack" will never reduce to [[TEXT NOT REPRODUCIBLE IN ASCII]], nor "rapid" to [[TEXT NOT REPRODUCIBLE IN ASCII]] or "locker" to [[TEXT NOT REPRODUCIBLE IN ASCII]]. Although voicing in "rapid" and "locker" would result in different words, the rule applies even when this is not the case ("upper," "maker"), just as /t/ will voice even if ambiguity is absent ("flatter"). This implies that the convenience is related more to speech economy than semantic clarity. The rule is strong, such that SAE allows [[TEXT NOT REPRODUCIBLE IN ASCII]] for two very different words that differ in only one articulation in their British pronunciations. English continues to evolve in terms of its socially condoned assimilations. The voicing of unvoiced plosives has been standard for many decades in North America, but the total loss of such plosives in words like "interact" [i.naOaekt] is a quite recent colloquial syncope, and found primarily in speakers under age 40. School-age children and young adults tend to be the most experimental and bold in assimilation speech habits, with the result that a generational gap in auditor comprehension can result.

Physiologically, the vocal folds must be close together to produce a voiced sound, and wide apart for unvoiced sounds. This process, as it unfolds in a word, phrase, or sentence, is not always in total synchrony with other articulators functioning simultaneously. Voicing assimilation thus often involves "bleed" of the voicing features across the other activities, especially in consonant clusters. Thus, close phonetic transcriptions of English may transcribe a segment as changing voicing through its duration. Usually this involves delaying the voicing until part way through the consonant. Singing, by its nature, will want to minimize this characteristic of speech by proceeding as soon as possible to voicing.

Voicing assimilation in French consonants occurs in two common contexts:

1) when a digraph includes both voiced and unvoiced consonants. This is almost always in the case of <b> + consonant, as in "obtenir," "subtil," "absent," where /b/ becomes [p].

2) when an unvoiced intervocalic <x> or <c> assimilates to [gz] and [g] respectively, as in "exemple," "exile," and "seconde." The assimilation of <x> is standard, but intervocalic <c> usually remains [k], except in the example cited [[TEXT NOT REPRODUCIBLE IN ASCII]].

Place and Manner of Articulation

In Italian, a few consonant digraphs present at morpheme boundaries have become either nonstandard or forbidden phonotactically. The latter category includes /nm/, /np/, and /nb/, which are realized as [mm], [mp], and [mb], in keeping with the bilabial second consonant:

   San Marco   /nm/   [mm]
   un poco     /np/   [mp]
   un bacio    /nb/   [mb]

Tus, [m] can be an allophone of /n/. Tese assimilations are obligatory in standard Italian, while the following three are optional, albeit frequently encountered:

   <bd>  /bd/  [dd]  abdicare
   <bn>  /bn/  [nn]  abnorme
   <gm>  /gm/  [mm]  pragmatic

A reasonable strategy for stage singing might be to observe the <n> assimilations, but ignore the latter three except in rapid enunciation. Of the optional ones, <bd> is the most likely for lyric diction.

The comparable situation in English involves the prefixes <in->, <im->, <ir->, and <il->. The nasal assimilation is usually reflected in the orthography.

    [Im]   imperfect
    [In]   indifferent
    [??]   ingredient

Such assimilations are termed homorganic, or involving the same place of articulation (bilabial, alveolar, and velar, respectively, in the examples). Homorganic assimilation also applies to the other negation prefxes, as in "irresponsible" and "illegal," but does not apply to <un->, as it does in Italian. Compare the above examples with the following, of which only <und-> is homorganic:

    [[LAMBDA]n]  unpublishe

It will be seen from the examples in the six categories above that assimilation is a phonological, not a phonetic construct. That is, assimilation processes are languagespecific, and usually cannot be assumed to exist in a given language by analogy with another, perhaps more familiar language. The varying inventories of phonemes and allophones from language to language are a direct result of the differing assimilation rules that apply to each.


When attempting to capture in symbols the subtleties of colloquial conversation, linguists will employ a narrow phonetic transcription in an attempt to capture many of the allophonic features of the utterance, and will ofen transcribe by phrase, rather than word-forword--a practice that readers of lyric diction literature may be largely unfamiliar with. In these respects, the IPA transcription of English text can appear quite disarming to one habituated to the word-for-word, carefully deliberate transcription usually encountered in the texts of songs and arias.

Consider, for instance, the phrase "What are you saying"? If embedded within a musical text, we might expect its word-for-word transcription to look something like [[TEXT NOT REPRODUCIBLE IN ASCII]]. A linguist might have occasion to transcribe the same sentence in colloquial utterance as [[TEXT NOT REPRODUCIBLE IN ASCII]], [[TEXT NOT REPRODUCIBLE IN ASCII]], [[TEXT NOT REPRODUCIBLE IN ASCII]], or [[TEXT NOT REPRODUCIBLE IN ASCII]], depending on the speaker and the moment. Authors too have frequently attempted to capture such offhand informality, whether the result of dialect, colloquialism, or both, in the scripts of novels, poetry, and other literary genres. The examples above might be rendered, for instance, as "Whatcha say'n?," or "Whadd'r ya sayin'?" Examples in sung texts may immediately come to the reader's mind, particularly in stage genres and folksong settings, where the attempt is either to capture a specific dialect, or "street talk." If such examples provide a challenge to both performers and coaches, it is not necessarily simply a result of limited acquaintance with the dialect in question. The attempt itself is fundamentally compromised by the limited number of orthographic symbols at the writer's disposal. Any attempt to capture the subtleties of dialect and colloquialism thus is doomed at some level from the outset, unless one is prepared to render the text (and read it) in a very detailed narrow IP A transcription. (14) The challenges of such orthographic anomalies are only compounded in unfamiliar languages, where even a strong familiarity with a standard pronunciation ill equips one to deal with the niceties of inflection characteristic of nonstandard dialects. Anyone who has tried to emulate a Scots dialect in speech, if only for fun, will appreciate how elusive the precise, authentic flow of an accent can be. Even assiduous practice, and the illusion of mastery, can easily result in embarrassment and even ridicule if seriously attempted in front of those who speak a dialect authentically.

The pronunciation of a phrase in speech naturally is also conditioned by the immediate context within which it is uttered. In a more formal context, or in the midst of a heated argument, one might indeed declaim [[TEXT NOT REPRODUCIBLE IN ASCII]], or even [[TEXT NOT REPRODUCIBLE IN ASCII]]. But such a degree of textual precision in speech, at the expense of phrasal flow, would seem ludicrously formal in all other situations. Speakers intuitively adjust their speech patterns to match the social situation (formal/informal, light-hearted/intense, etc.), and the examples given are not necessarily those of different speakers.

The level of importance that assimilation plays in the differentiation of careful from colloquial speech can be seen in the examples. The tendency toward centralization of vowels in rapid American English speech is particularly reflected in [[TEXT NOT REPRODUCIBLE IN ASCII]], as is the tendency for onset unvoiced plosives to be voiced. Both these accommodations are particularly common in unstressed syllables. Such assimilations, however, are languagespecific, and many are dialect specific. For example, the retention of pure vowel colors in unstressed syllables in French and Italian must be rehearsed and committed to reflex instinct by English singers. The singer's obligation to rid speech habits of regionally specific markers is in large part a process of identifying and removing telltale assimilations characteristic of nonstandard dialect. Discussions of "good English" lyric diction are often largely or exclusively concerned with the elimination of the assimilation practices of colloquial speech.

Not all accommodations of target articulations in everyday speech can be called assimilation, as we have seen. At times articulations are deleted or intruded to facilitate a "path of least resistance" to the intended result, and such modifications are not always obviously related to the articulations of the surrounding sounds. The centralization of vowels in unstressed syllables (the so-called "weak" form) so characteristic of all English dialects qualifies as assimilation only insofar as it facilitates the articulation of surrounding consonants in rapid speech. The inherent latitude in placement of such weak forms, dependent as it is on phonetic environment, speed of utterance, and individual speaker, creates difficulties in transcribing such words. This explains why phoneticians employ varying symbols to try to capture what is in reality a locus of possibilities, like the schwa. For example, the word "are" is variously transcribed as [[TEXT NOT REPRODUCIBLE IN ASCII]], [[TEXT NOT REPRODUCIBLE IN ASCII]], [[TEXT NOT REPRODUCIBLE IN ASCII]], [ar], and [[a.sup.r]] in its strong form, and [[TEXT NOT REPRODUCIBLE IN ASCII]], [[TEXT NOT REPRODUCIBLE IN ASCII]], [[TEXT NOT REPRODUCIBLE IN ASCII]], [[TEXT NOT REPRODUCIBLE IN ASCII]], and [[TEXT NOT REPRODUCIBLE IN ASCII]] in its weak form. (15)

The vowel and consonant charts familiar from the literature on lyric diction specify the most important or overt features of an articulation, and only those. The standard vowel quadrilateral visually specifies high/low and front/back only, and one must understand that two quite different vowels will be positioned identically on such a figure. The feature rounded/unrounded cannot be visualized on such a chart. Similarly, consonants are typically categorized in a tripartite fashion: place of articulation, manner of articulation, and voicing. The fact that other features are ignored reflects the reality that alterations therein do not often result in substantial perceptual and auditory changes. For instance, we know that [m] is a voiced bilabial nasal--which description tells us nothing about what the tongue or jaw may be doing. The tongue will typically be more forward in "me" than in "mow," in anticipation of the ensuing vowel, and the oral cavity is likely to have more internal space for "mow." This is a further example of anticipatory assimilation, since the [m] changes according to the ensuing vowel. Thus, there are two allophones, or conditioned variants, of /m/ in English, just as /t/ can be aspirated [[t.sup.h]], unaspirated [t], or unreleased [[TEXT NOT REPRODUCIBLE IN ASCII]], depending on its environment. If one tries to say "me" with the [m] in the position for "mow," one becomes aware that not all /m/s are created equal in English. It is in the nature of language acquisition and usage that these changes operate on a subconscious level, and must be pointed out before they are recognized to exist.

What relevance does this have for singing? In many cases, as in the /m/ allophones above, the spectrographic pattern of each reveals that they are acoustically, as well as articulatively distinct sounds. A careful comparison of "me" and "mow" will reveal that the two consonants, when sustained, in fact do not sound alike, just like the word-initial /k/s discussed earlier. One might think that such matters will take care of themselves by simply relying on reflex instincts. To this, three things must be said. First, not all native speakers develop identical instincts in terms of the degree of assimilation of adjacent sounds, even within the same dialect. And differing dialects do not just depart from one another at the level of the individual articulation, but also at the suprasegmental level where assimilation takes effect. Second, inbred assimilation habits in speech will inevitably be altered by the singing process, with its manifold differences from speech, such that relying on reflex becomes a much more hit-or-miss affair. Third, for the many singers whose first language is not English, a very different set of assimilative instincts will naturally be transferred over from the first language, in both speech and singing. German singers must of course learn to voice syllablefinal voiced plosives in English ("gab" [gaeb]), just as English singers must learn to devoice them when singing in German ("gab" [ga[??]]). But that is simply a change at the segmental level, not a difference of assimilation habits. One cannot assume that all languages possess a rule for /m/ allophones identical to that of English. Some may naturally employ the more back, open [m] before high vowels, in the same way that many Slavic languages will employ the velarized [[??]] in syllable-initial position. Note also that no standard symbols exist for the differing tongue position of English /m/ allophones, as there are for the allophones of /t/. It is helpful for a pedagogue to appreciate that an undesirable, recurring singing habit of a student may have its origins in language, rather than in vocal technique and production per se.

The /m/ and /b/ are obviously distinct phonemes in English and most other languages, and we take that as a given primarily because they are acoustically well differentiated from each other. This disguises the fact that they are remarkably similar in articulation. If asked how they differ, one is likely to point out the truth that [m] can be sustained, while [b] cannot, except in its silent occlusion stage. An important distinction in articulation is thereby neglected: the nasopharyngeal port (that part of the velum that controls placement of resonance) is open for [m] and closed for [b]. In other words, [m] is nasal and [b] is oral. As speakers we are so conditioned to perform the required mechanics of sound that we are quite unaware of this physical movement, at least in this context. One of the more challenging exercises in lyric diction is the alternation of [a]-[[??]]-[a]-[[??]]-[a]-[[??]], which involves the same movement, because we are not called upon to exercise voluntary control of the nasopharyngeal port in speech. This is but one example of the articulative complexity of speech, and how little information is actually imparted by standard phonetic symbols.

Calvert attempts to describe such activity in prose by detailing the physiological complexity involved in pronouncing the words "meat" and "fence." (16) His analysis of "fence" involves the itemization of fifteen motions, followed by fifteen ways in which the pronunciation could have been derailed by failing to perform the requisite movements in each. Quoting the entire passage would be instructive, but we shall confine ourselves to a segment. He describes the articulation of the "-en" [[epsilon]n] of "fence" in points 7-11, as follows:

Voicing continues (7) as the tip of the tongue closes against
the middle (8) of the alveolar ridge, closing off the flow of
air through the oral cavity, and almost simultaneously (9)
the velopharyngeal [ie. nasopharyngeal] port opens (10). Te
tongue remains in the lingua-alveolar position but increases
muscular tension, as simultaneously (11), voicing ceases and
the velopharyngeal port closes.

A detailed analysis then follows:

(7) Resonance for the [[TEXT NOT REPRODUCIBLE IN ASCII]] involves the
    nasal cavity closed at the velopharyngeal port, and the oral cavity
    open at both ends but restricted at the mid-to-front by the
    elevated tongue. (Instrumental studies reveal that the
    velopharyngeal port actually opens during the last portion of the
    vowel [[epsilon]], anticipating the nasal resonant [n], and begins
    its gradual closure as the tongue tip begins to close on the
(8) The point of [n] closure is determined by the tongue position of
    the preceding vowel [[epsilon]], farther back than that for the [I]
    and farther forward than for the [u].
(9) Opening of the velopharyngeal port to the nasal cavity is almost
    simultaneous with closure of the oral cavity by the tongue.
(10) Resonance for the [n] involves the nasal cavity open at both ends,
     and the oral cavity closed by the tongue at the alveolar ridge.
     Another resonance cavity is formed in front of the tongue and
     between the lips, which must be open for this sound.
(11) Termination of voice and closure of the velopharyngeal port must
     be simultaneous.

A description of "error implications" follows:

(7) If the velopharyngeal port should open too soon, the vowel will be
    a nasal [??] sound.
(8) The position of the front of the tongue for [[epsilon]] places the
    tip of the tongue just below the middle of the alveolar ridge. In
    order to make closure of the oral cavity almost simultaneous with
    opening of the velopharyngeal port, the tongue tip must close at
    the closest position on the alveolar ridge. Otherwise, the ending
    of the [[epsilon]] may have excessive open nasal resonance.
(9) If the velopharyngeal port should be delayed in opening, the voiced
    stop [d], homorganic with [n], may intrude for [f[epsilon]dnts].
(10) Resonating cavities to produce [n] require that the lips be open
     in front of the lingua-velar closure. If not, the bilabial nasal
     resonant [m] will be heard instead.
(11) Should the velopharyngeal port remain open as voicing ceases,
     breath would escape from the nasal cavity, leaving insufficient
    oral breath pressure for the [s].

Analysis at this level of detail requires three pages of prose to describe the articulation of a single word. Yet the analysis remains far from complete, and further levels of detail could have been added. For instance, under (10) the author might have mentioned that the entire tongue is necessarily involved in the closure of the oral cavity for [n], including the lateral portions of the tongue contacting the upper gums. Only in this way can the oral cavity be sealed of. If only the tip touches the alveolar ridge, the lateral [l] will result.

Of course, all of this precision takes place without conscious effort for the native speaker. This should not imply, however, that such a level of analysis is useful only to the theoretical phonetician. While many of the Indo-European languages share common articulative reflexes that occur without thought and effort, not all behave identically at the level of detail exhibited above. Furthermore, in a global age where much interest in Western classical music is now exhibited around the globe, and more singers from language traditions outside the Indo-European branch are encountered, fewer of these articulative minutiae can be taken for granted. The correction of diction problems may involve dialogue between teacher and student at exactly this level of detail, in order to zero in on exactly what the cause of a problem is. It can be disarming, and even frustrating, to work with a singer who cannot perceive the difference between two sounds whose distinctiveness seems obvious to the instructor. (17) It is the instructor's responsibility to know with precision at least the basics of how the student's first-language sound pattern operates, in order to "get inside the brain" of the student at the reflex linguistic level. This seemingly tall order is not as insurmountable as it may seem. Fluency is not the goal, but rather an awareness of the phonological behavior of the student's first language.


Not all individuals assimilate adjacent articulations in the same manner, or to the same extent. The reader may wonder where he or she stands on the scale of personal susceptibility to assimilation. A simple test will sufce. One who readily assimilates is likely to say [t[epsilon]m.m[epsilon]n] for "ten men," [[TEXT NOT REPRODUCIBLE IN ASCII]] or [[TEXT NOT REPRODUCIBLE IN ASCII]] for "his shoes," [[TEXT NOT REPRODUCIBLE IN ASCII]] for "nice shirt," [[TEXT NOT REPRODUCIBLE IN ASCII]] for "unpleasant," [[TEXT NOT REPRODUCIBLE IN ASCII]] for "ingredient," [[TEXT NOT REPRODUCIBLE IN ASCII]] for "ungrateful," and [[TEXT NOT REPRODUCIBLE IN ASCII]] for "phone booth." One more inclined to segmental "purity," or one speaking more carefully, is more likely to employ the citation forms for each assimilated segment above: [n], [z], [s], [n], [n], [n], and [n], respectively. The voice teacher should be vigilant for variations in assimilation from EFL student to student. Tese are likely to manifest themselves in parallel in both speech and singing. In the case of ESL singers, an overuse of learned citation forms is understandably common. Accepted and expected forms of assimilation, even in singing, may have to be taught.

Although a high level of adherence to citation forms is often the case in singing, it is not always advisable to eliminate speech assimilations in lyric contexts. An example: in speech, retention of the alveolar [n] in "on the" is quite unlikely, the interdental [[TEXT NOT REPRODUCIBLE IN ASCII]] being favored, in anticipatory assimilation with the [[??]]. There is little, if any acoustic difference between [n] and [[??]], so the awkward apical slide from [n] to [[??]] can effectively be dispensed with in both speech and singing.

As with most things, the closer one looks, the more complex the issues involved can become. The truism about truth is that it is rarely simple. Ultimately, artistic instinct must be the final arbiter of "correctness" and bon gout in the vocal delivery of text.


(1.) Martin S. Ball, and Soan Rahilly, Phonetics: The Science of Speech (London/New York: Arnold, 1999), 133

(2.) Ibid., 133-137; see for illustrations of such diagrams. Te methodology is known as parametric phonetics.

(3.) For a list of Zungenbrechern, or tongue-twisters in German, see Carl and Peter Martens, Phonetik der deutschen Sprache: Praktische Aussprachelehre (Munchen: Max Hueber, 1961), 249-251.

(4.) Students are sometimes nonplussed by the of-glide symbols employed in English or German diphthongs, for this reason. The failure of the English closing diphthongs [ai], [[??]I], [au], and [ou] to arrive at the [i] or [u] that speakers think they are doing is a result of such instinctive accommodation practice.

(5.) Evelina Colorni, Singers' Italian: A Manual of Diction and Phonetics (New York: G.Schirmer, 1970); Nico Castel, Te Complete Verdi Libretti, 4 volumes (Geneseo, NY: Leyerle Publications [various dates]); see also Castel's several other volumes of Italian opera--Puccini, Italian bel canto, Italian verismo, etc.

(6.) For example, Donald Calvert's extensive discussion of assimilation in Descriptive Phonetics (New York: Tieme-Stratton, 1980) unfolds under the term coarticulation, which he defnes as "producing two sounds in sequence so that they infuence how each other is produced" (85). Te present author followed suit in a recent article, "Phonetic Transcription--What it Doesn't Tell Us," Sournal of Singing 70, no.1 (September/October 2013): 60-61.

(7.) This is a level of detail not encountered in lyric diction literature, which contents itself with [t[integral]], [d[??]], [pf], etc. Strictly, such nomenclature would indicate successive double articulations. For instance, the Italian sounds [[??]] and [[??]], and Czech voiced <[??]> [??] could be described, for the beneft of English speakers, as coarticulations of [lj], [nj], and [rj], respectively. Such double IPA clusters are doubly misleading, because they imply successive articulations, and they ignore the adjustments to the standard articulations of the first consonants, in order to produce the new sound.

(8.) Linguists ofen employ the umbrella term "accommodation" to include assimilation, dissimilation, co-articulation, and other forms of articulation change stemming from environmental conditions.

(9.) It is interesting that the three standard modern English dictionaries of pronunciation difer in their descriptions of these words. The Longman Pronunciation Dictionary and Cambridge English Pronouncing Dictionary employ [as] without further refnement, while Oxford Dictionary of Pronunciation for Current English recommends [as] for SAE and [a] for British. None attempt to diferentiate between the words in terms of vowel color. For full citations of the dictionaries, see note 5.

(10.)Try saying "boo-boo" without full lip-rounding, then with, noting the diference in the sound of both the bilabial consonant and the vowel color.

(11.)Regressive assimilation (in which a segment infuences the sound of a later one) is also known as perseverative assimilation. Likewise, progressive assimilation (in which a later segment infuences an earlier one) is also called anticipatory assimilation. Linguists prefer anticipatory and perseverative.

(12.)Sohn S. Kenyon and Thomas A. Knott, A Pronouncing Dictionary of American English (Springfeld, MA: Merriam-Webster, 1953), 478.

(13.) Initial <w> and <wh> is also a troublesome issue, albeit not directly related to a discussion of assimilation.

(14.)The author recalls as a teenager his consternation, trying to read such passages of dialogue in Dickens--the entire intended efect being consigned to his uninformed imagination.

(15.)Daniel Sones, English Pronouncing Dictionary, 15th ed., Peter Roach & Sames Hartman, eds. (Cambridge: Cambridge University Press, 1997); Bernard Silverstein, NTC's Dictionary of American English Pronunciation (Lincolnwood, IL: National Textbook Co., 1994); Clive Upton, William A. Kretzschmar, Sr., and Rafal Konopka, Oxford Dictionary of Pronunciation for Current English (Oxford: Oxford University Press, 2001); S. C. Wells, Longman Pronunciation Dictionary (Harlow, Essex: Longman Group, 1990). Most sources cite transatlantic variant forms that are standard, but this is only a part explanation of the variety found in the symbols.

(16.) Calvert, 151-156.

(17.)Teaching [l]-[[??]]-[[??]]-[[??]] distinctions to Asian ESL students is perhaps the example that comes most readily to mind. It is important to understand that this challenge can result from three diferent phonological situations: the absence of one or more of these sounds in the native language; the lack of phonemic contrast between the sounds in the native language; and the presence of some or all of these sounds in the native language, but their complementary distribution difers from English (i.e., they are employed in diferent environments). Tese are ranked in order of increasing difculty of dealing with. The challenges faced by a Korean singer, for instance, will be distinct from those of a Sapanese, Chinese, or Vietnamese singer, due to one or more of these factors.

