Automatic Transcription Of English Connected Speech Phenomena

Svetlana Androsova; Information Technology in Industry (ITII)**Web of Science (Emerging Sources Citation Index)

IT in Industry, vol. 4, no. 1, 2016 Published online 30-Sep-2016 Automatic Transcription Of English Connected Speech Phenomena Svetlana V. Androsova and Evgenij Yu. Androsov Department of Foreign Languages Amur State University Blagoveshchensk, Russian Federation androsova_s@mail.ru, eugene_ap@mail.ru which is almost finalized and a part of a block on glottalization, particularly patterns for glottal burst. Abstract—Many phonetic phenomena that occur in connected speech are classified as phonetic periphery where anything can happen. A well-known convenient way to fix any phonetic phenomenon using certain symbols is transcription. The current paper aims at showing the model of predicting allophones by coordinating a number of factors that determine the choice of a particular allophone and visualizing the result changing certain letters into corresponding IPA symbols. Free Pascal compiler and Geany editor are used for programming purposes. The model is created for American English. It is tested for tap and glottal burst, the latter being one of the three glottalization patterns. The difference of the combination of factors for purely linguistic analysis and for computer programming is explained. We demonstrate (i) the framework for integrating separate blocks each dealing with one phenomenon (ii) a block for tapping which is almost finalized and a part of a block on glottalization, particularly patterns for glottal burst. Keywords—transcription; modeling; Free Pascal I. tap; glottalization; II. CONDITIONS DETERMINING TAPPING AND GLOTTALIZATION: LINGUISTIC AND COMPUTER VIEW The choice of the two particular allophones is determined, first by their regular occurrence in the conditions given below (90.1% for taps, 88.9% for glottal burst), second, by the drastic difference of their acoustic characteristics (Praat was used for acoustic analysis [2]) from the ones of the canonical allophone of /t/. Fig. 1–6 demonstrate that difference. In the canonical allophone (see Fig. 1–2) the following well-known phases can be found: occlusion, impulse, friction (always present) and aspiration (appears only before non-front vowels). Taps (Fig. 3–4) are generally very short, they might or might not have an occlusion; their impulse-like phase is very short and often not localized properly; there is always fundamental frequency (F0) presence and it is often accompanied by considerable intensity drop. computer INTRODUCTION Since instrumental methods of speech analysis appeared, a lot of research of the flow of speech has been performed, numerous results have been obtained concerning canonical and non-canonical phenomena. A number of connected speech phenomena have been studied, the factors determining them have been more or less described. Glottal bursts (Fig. 5–6) vary in duration from quite long to super short, might or might not be accompanied by the previous occlusion, might or might not have F0. Its impulse phase is concentrated in low-mid frequencies and it is very weak for good physiological reason – you cannot produce a strong one with your vocal folds. A well-known convenient way to fix any phonetic phenomenon using certain symbols is transcription. When data on connected speech phenomena are obtained, a series of logical questions appear: How regular are these phenomena? Can we create a computer model predicting them using e. g. a written text as an input file? How well will this model correspond with reality? Four conditions for /t/ and /d/ tapping in American English can be easily singled out based on phonetic literature analysis: In our paper we tried to answer these questions concerning taps and glottalization in American English. Traditionally in the phonetic literature they are viewed as a part of phonetic periphery where anything can happen [1]. We hypothesize that, no matter being phonetic periphery or not, phenomena like tapping and glottalization demonstrate regular patterns that can be formalized into a computer model. In the current paper we demonstrate (i) the framework for integrating separate blocks each dealing with one phenomenon (ii) a block for tapping Copyright © Androsova and Androsov 2016 1. intervocalic (V1CV2) word-internally in putting, pudding etc. with any stress pattern of V1 and V2 being only unstressed; 2. intervocalic word-finally in get in, had a etc. with any stress pattern of both V1 and V2; 3. before syllabic 'l' in battle, middle etc.; 4. between rhotic and non-rhotic vowels in words like party, sort of, harder, heard of etc. More information on taps (or flaps) is available in Wolfram and Johnson [3], Orion [4], Mills [5], Laver [6], Language Files [7], Herd et al. [8], Huffman [9], Broadbent [10], and Warner et al. [11]. 1 ISSN (Print): 2204-0595 ISSN (Online): 2203-1731 IT in Industry, vol. 4, no. 1, 2016 Published online 30-Sep-2016 Fig. 1. Canonical /t/ with fricative and aspiration phases in took. Fig. 4. Tap in pretty. The following conditions make for glottal burst:  word-internal /t/ + /l/, /n/, /m/; Fig. 2. Canonical /t/ with fricative phase in tea.  word-final /t/ + any sonorant except for /ŋ/. Glottal burst is a part of glottalization phenomenon which includes: 1) voiced and voiceless implosives (canonical locus is preserves, glottal occlusion is added, no audible release is noticed), 2) glottal stop (canonical locus is substituted by glottal stop, no audible release), 3) glottal burst (canonical locus is substituted by glottal stop which is released) [12]. For more information on glottalization see Firth [13], Cruttenden [14], Laver [15], Ladefoged [16], and Broadbent [10]. These conditions are meant for a user who possesses basic skills of transferring letters into sounds. There is quite a number of issues that are taken for granted and need no explanation for a native American English speaker or an American English learner who can at least read properly. People easily manage a number of things that computer does not naturally possess skills for. The ones crucial for tap and glottal burst are:     Fig. 3. Tap in not only. Copyright © Androsova and Androsov 2016 2 mute word-final “e” in words like hate, definite etc.; mute word-initial “h” in words like heir, honor etc.; groups of letters like “ough”, “augh”, “eigh” that contain a consonant letter but make one single vowel sound in words like bought, daughter, height etc.; mute “ai”, “e” in words like certain, written etc. ISSN (Print): 2204-0595 ISSN (Online): 2203-1731 IT in Industry, vol. 4, no. 1, 2016 Published online 30-Sep-2016 To these issues word-stress and sentence-stress should be added as far as a person, who speaks decent English, knows where to put stress and the computer does not. Issues, like the ones mentioned above, need to be programmed properly otherwise it will lead to inadequate result of the program performance that would distort real-speech pattern. III. EXPERIMENT A. Tools For transcription the symbols of the International Phonetic Alphabet (IPA) are used. For correct visualization of source text program and the results of its execution a font with unicode support is needed. Therefore Doulos SIL font was used. For programing Free Pascal is used. The source text was typed in Geany editor. Standard library of Free Pascal was used to work with text variables. Version 2.6.2 (2015) that supports unicode was used [17]. The program written in this version can by compiled in any further versions of Free Pascal. The program compilation and execution were performed in Linux. B. The program 1) Framework and conflicting cases The program operates within the following framework (see Fig. 7) that is very universal and can be suitable for any phonetic modifications, not only for tap and glottal burst. Fig. 5. Glottal burst in certain. We view some text where we wish to transcribe certain connected speech phenomena as an input file. As far as a number of phenomena occur on the word boundary, we cannot focus only on what is going on word-internally. Therefore we have to deal with every current word looking inside and out. This includes the characteristics of the previous word, the following word and non-alphabetic symbols between those words, particularly punctuation or symbols for pauses or any other symbols that might be deliberately added during text processing. Words and fillers are variables that are not subjected to any changes. They are analyzed according to certain criteria that have been targeted. For instance, if we target taps then the program needs to find out if there are certain letters in certain positions that presuppose tapping. There are two things that we need to keep in mind. First, if talking about «t» letter we know that it can be transcribed by one or more than one phonetic symbol, e. g. tap [ɾ], weak voiceless allophone [t], glottal burst [Ɂ], glottal stop [Ɂ˺], canonical alophone [th]. Second, there are so called conflicting cases. These are where formally one and the same position might require different allophones, like, for example word-final intervocalic «t» with the next «u» vowel letter: tap in that up but glottal burst in that usual (see Fig. 8). While processing the word that an additional variable sIsChange is introduced. Initially it has «-» for every symbol: «----». After the checking procedure in the final position in this word that has t-letter the program changes «-» into «+». After confirming the correspondence to the condition <Vowel>+t+<end-of-word>+<u><n><i> the variable sIsChange will be interpreted as«---+», the variable Fig. 6. Glottal burst in Atlantic. Copyright © Androsova and Androsov 2016 3 ISSN (Print): 2204-0595 ISSN (Online): 2203-1731 IT in Industry, vol. 4, no. 1, 2016 Published online 30-Sep-2016 aChangeStr[4] will be assigned Ɂ symbol. After that processing will continue but further processing will not change the symbol that has been previously assigned. Another example of the same sort would be postvocalic word-final «t» with the next «h» consonant letter that in most instances works for glottal stop but in case of mute «h» (honor etc.) makes for tap. 2) Regular expressions We use standard regular expressions that are provided with Free Pascal. This enables to decrease the number of ifoperators and by this simplify the program structure (see Fig. 9). 3) Two blocks of the program: tap and glottal burst The program deals with word-final and word-internal taps separately (earlier we demonstrated a pilot version of this block as a separate program [18]). Fig. 10 shows the algorithm of word-final tap processing. Fig. 11 shows the algorithm of word-internal tap processing. Fig. 10. Example of word-final tap processing. Fig. 7. Framework for automatic transcription of connected speech phenomena. Fig. 11. Example of word-internal tap processing. Fig. 8. Conflicting cases: that + «u». Fig. 12. Example of word-final glottal burst processing. Figures 12 and 13 demonstrate the algorithm for correspondingly word-final and word-internal glottal burst processing. Both blocks operate within 90% accuracy. In the tap block the unattended issue is word-initial intervocalic tapping that occurs for to as particle, preposition, or prefix in word sequences like so to speak, go to college, be together etc. Those cases are not very frequent but quite stable for tapping in American English speech flow and need to be programmed Fig. 9. Regular expressions. Copyright © Androsova and Androsov 2016 4 ISSN (Print): 2204-0595 ISSN (Online): 2203-1731 IT in Industry, vol. 4, no. 1, 2016 Published online 30-Sep-2016 Dr. Andrei D. Plutenko who sponsored the delivery of this paper to ICITA (July 2016). REFERENCES [1] [2] [3] [4] Fig. 13. Example of word-internal glottal burst processing. [5] properly. Glottalization block is to be considerably enlarged: glottal stop and implosive parts are to be added to glottal burst part. The first one will present no problem: just a set of char and strings enumerating consonants of a different than forelingual articulator are needed: f, v, p, b, k, g and, additionally, ph, qu strings. [6] [7] [8] The second part is a challenge because it is not easy to accurately predict pauses. Punctuation can be helpful (full stops, exclamatory and question marks, dashes, colons and semicolons, and at times – commas), but a certain number of pauses might be located in places other than punctuation marks. Such pauses must be tagged during text preprocessing. IV. [9] [10] [11] [12] CONCLUSION In the present paper we pointed out the necessity to make a computer model predicting various connected speech phonetic phenomena and, using the example of taps and glottal bursts that are regular in American English, showed how it might look like if we use Free Pascal language. The number of blocks each corresponding to a certain phonetic phenomenon can be easily enlarged; each block itself can be enlarged as well to reflect the majority of conditions and increase allophone predictability in the output file. [13] [14] [15] [16] The same kind of programming can be done for Canadian and Australian English where taps are almost as frequent and stable under the conditions described above as in American (about taps/flaps in Australian English see e.g. Malcolm [19], Cox [20]). Unfortunately tapping and glottalization cannot be modeled reliably for British English due to changing standard [21], [22] and high degree of free variation in the condition described above: for instance, one word-final intervocalic position presupposes several allophones with almost no predictability of the pattern in British and only one allophone in American with more than 90% predictability [23]. [17] [18] [19] [20] [21] [22] ACKNOWLEDGMENT [23] We express our grateful thanks to ICITA Conference Chair Dr. David Tien and the Rector of the Amur State University Copyright © Androsova and Androsov 2016 5 S. Greenberg, H. Carvey, L. Hitchcock, and S. Chang, “Beyond the Phoneme: A Juncture-Accent Model of Spoken language,”. Proc. of the Human Language Technology Conf. (HLT – 2002), pp. 36–43, San Diego, California, 2002. P. Boersma and D. Weenink,. “Praat: Doing phonetics by computer,” (Version 5.4.15) [Computer Program]. Retrieved November 30, 2015, from <http://www.fon.hum.uva.nl/praat/>. W. Wolfram and R. Johnson, Phonological Analysis. Focus on American English, Washington D. C. : Prentice Hall Regents, 1982, pp. 19–20. G. F. Orion, Pronouncing American English. Sounds, Stress and Intonation, New York, 1985, p. 199. C. Mills, American Grammar. Sound, Form and Meaning, New York, 1990, p. 119. J. Laver, Principles of Phonetics, New York : Cambridge Univ. Press, 1994, p. 218. Language Files: Materials for an Introduction to Language and Linguistics. 7th ed., The Ohio State University, 1998. W. Herd, A. Jongman, and J. Sereno, “An acoustic and perceptual analysis of /t/ and /d/ flaps in American English,” J. Phonetics, vol. 38, pp. 504–516, 2010. M. K. Huffman, “Segmental and Prosodic Effects on Coda Glottalization,” J. Phonetics, vol. 33 (3), pp. 335–362, 2005. J. M. Broadbent, “t-to-r in West Yorkshire English,” English language and linguistics, vol. 12 (1), pp. 141–168, 2008. N. Warner, A. Fountain, and B. V. Tucker, “Cues to Perception of Reduced Flaps,” JASA, vol. 125 (5), pp. 3317–3327, 2009. S. V. Androsova, “Allofonnoe var'irovanie soglasnyh fonem v spontannoj rechi (jeksperimental'no-foneticheskoe issledovanie na materiale amerikanskogo varianta anglijskogo jazyka) [Allophonic variation of consonants in spontaneous speech (experimental study based on American English)],” PhD Dis. St.-Petersburg State University, Blagoveshhensk, 2001. J. R. Firth, Papers in Linguistic 1934–1951, London : Oxford Univ. Press, 1957, p. 124. A. Cruttenden, Gimson's pronunciation of English, Edward Arnold Limited, 2001, p. 168. J. Laver, Principles of Phonetics, New York : Cambridge Univ. Press, 1994, p. 171. P. A Ladefoged, “Course of Phonetics,” 2nd ed. Singapore: Cengage Learning, 2006, p. 60. Free Pascal. URL: http://www.freepascal.org S. V. Androsova and E. Yu. Androsov, “Automatic transcription of taps in American English,” Proc. of the 2nd International Conference "Phonetics without borders" [ed. by S. Androsova]. Amur State University. Blagoveshchensk, pp. 10–16, 2015. K. Malcolm, Phasal Analysis: Analyzing Discourse through Communication Linguistics, New York, 2010. F. Cox, Australian English pronunciation and transcription, Cambridge University Press, 2012. A. H. Fabricius, “T-glottaling. Between Stigma and Prestige: A Sociolinguistic Study of Modern RP,”. Copenhagen, 2000. J. C. Wells, Accents of English, New York : Cambridge University Press, 1982. S. V. Androsova, V. G. Karavaeva, “Odnoudarnyje allofony v amerikanskom i britanskom variantah anglijskogo jazyka [Flaps / taps in American and British English],” Teoreticheskaja i prikladnaja lingvistika [Theoretical and Applied linguistics], vol. 1(2), pp. 5–20, 2015. ISSN (Print): 2204-0595 ISSN (Online): 2203-1731

RELATED PAPERS

RELATED TOPICS

Log In

Automatic Transcription Of English Connected Speech Phenomena

Automatic Transcription Of English Connected Speech Phenomena

Related Papers

RELATED PAPERS

RELATED TOPICS