Feature-based pronunciation modeling for speech recognition

Karen Livescu; James Glass

Feature-based pronunciation modeling for speech recognition

Proceedings of HLT-NAACL 2004: Short Papers on XX - HLT-NAACL '04, 2004

retroflexion for [r] starts and ends before the lip narrow- ing gesture for [v]. Again, all of the feature streams are produced correctly, but there is an apparent exchange of two phones, which cannot be represented via single- phone confusions conditioned on phonemic context. A final example from Switchboard is everybody → [eh r uw ay]. It is difficult to imagine a set of phonetic trans- formations that would predict this pronunciation without allowing a host of other impossible pronunciations. How- ever, when viewed in terms of features, the transforma- tion from [eh v r iy bcl b ah dx iy] to [eh r uw ay] is fairly simple. The tongue and lips desynchronize, caus- ing the lips to start to close for the [bcl] during the previ- ous vowel. In addition, the lip constrictions for [bcl] and [v], and the tongue tip gesture for [dx], are reduced. We will return to this example in the sections below. 3 Approach A feature-based pronunciation model is one that explic- itly models the evolution of multiple underlying linguis- tic feature streams to predict the allowed realizations of a word and their probabilities. Our approach begins with the usual assumption that each word has one or more tar- get phonemic pronunciations, or baseforms. Each base- form is converted to a table of underlying feature values. Table 1 shows what part of this table might look like for the word everybody. The table may include “unspecified” values (‘*’ in the table). More generally, each table en- try can be a distribution over the range of feature values. For now, we assume that all of the features go through the same sequence of indices (and therefore the same number of targets) in a given word; e.g., in Table 1, LIP-OPEN goes through the same indices as TT-LOC, although it has the same target value for indices 2 and 3. In the first time frame of speech, all of the features begin in index 0; in subsequent frames, each feature can either stay in the same index or transition to the next one with some probability. The surface feature values—i.e., the ones that are ac- tually produced by the speaker—can stray from the un- derlying pronunciation in two ways, typically because of articulatory inertia: substitution, in which a feature fails to reach its target underlying value; and asynchrony, in which different features proceed through their sequences of indices at different rates. We define the degree of asyn- chrony between two sets of features as the difference be- tween the average index of one set relative to the average index of the second. The degree of asynchrony is con- strained: More “synchronous” configurations are more probable (soft constraints), and we make the further sim- plifying assumption that there is an upper bound on the degree of asynchrony (hard constraints). A natural framework for such a model is provided by dynamic Bayesian networks (DBNs), because of their index 0 1 2 3 ... phoneme eh v r iy ... LIP-OPEN wide critical wide wide ... TT-LOC alv. * ret. alv. ... ... ... ... ... ... ... Table 1: Part of a target pronunciation for everybody. In this feature set, LIP-OPEN is the lip opening degree; TT-LOC is the location along the palate to which the tongue tip is closest (alv. = alveolar; ret. = retroflex). 2 3 U U U 2 3 S 1 S S sync =1 1;2 ind ind ind lexEntry sync =1 1,2;3 1 2 3 t t t t t t t t t 1 t t t wdTr t Figure 1: One frame of a DBN for recognition with a feature-based pronunciation model. Nodes represent variables; shaded nodes are observed. Edges repre- sent dependencies between variables. Edges without par- ents/children point from/to variables in adjacent frames (see text). ability to efficiently implement factored state representa- tions. Figure 1 shows one frame of the type of DBN used in our model (simplified somewhat for clarity of presenta- tion). This example DBN assumes a feature set with three features. The variables at time frame t are as follows: lexEntry t – entry in the lexicon corresponding to the cur- rent word and baseform. Words with multiple base- forms have one entry per baseform. lexEntry t ’s par- ents are lexEntry t-1 and wdTr t-1 ind j t – index of feature j into the underlying pronun- ciation, as in Table 1. ind j 0 = 0; in subsequent frames ind j t is conditioned on lexEntry t-1 , ind j t-1 , and wdTr t-1 (defined below). U j t – underlying value of feature j . Its distribution p(U j t |lexEntry t , ind j t ) is determined by the target feature table of lexEntry t . S j t – observed surface value of feature j . p(S j t |U j t ) en- codes allowed feature substitutions. wdTr t – binary variable indicating whether this is the last frame of the current word. sync A;B t – binary variable that enforces a synchrony con- straint between subsets A and B of the feature set. It is observed with value 1; its distribution is con-

Feature-based Pronunciation Modeling for Speech Recognition Karen Livescu and James Glass MIT Computer Science and Artificial Intelligence Laboratory Cambridge, MA 02139, USA {klivescu, glass}@csail.mit.edu Abstract We present an approach to pronunciation modeling in which the evolution of multiple linguistic feature streams is explicitly represented. This differs from phone-based models in that pronunciation variation is viewed as the result of feature asynchrony and changes in feature values, rather than phone substitutions, insertions, and deletions. We have implemented a flexible feature-based pronunciation model using dynamic Bayesian networks. In this paper, we describe our approach and report on a pilot experiment using phonetic transcriptions of utterances from the Switchboard corpus. The experimental results, as well as the model’s qualitative behavior, suggest that this is a promising way of accounting for the types of pronunciation variation often seen in spontaneous speech. 1 Introduction Pronunciation variation in spontaneous speech has been cited as a serious obstacle for automatic speech recognition (McAllester et al., 1998). Typical pronunciation models approach this problem by augmenting a phonemic dictionary with additional pronunciations, often resulting from the application of phone substitution, insertion, and deletion rules. By carefully constructing a rule set (Hazen et al., 2002), or by deriving rules or variants from data (Riley and Ljolje, 1996), many phenomena can be accounted for. However, the recognition improvement over a phonemic dictionary is typically modest, and some types of variation remain awkward to represent. These observations have motivated approaches to speech recognition based on multiple streams of linguistic features rather than a single stream of phones (e.g., King et al. (1998); Metze and Waibel (2002); Livescu et al. (2003)). Most of this work, however, has focused on acoustic modeling, i.e. the mapping between the features and acoustic observations. The pronunciation model is typically still phone-based, limiting the feature values to the target configurations of phones and forcing them to behave as a synchronous “bundle”. Some approaches have begun to relax these constraints. For example, Deng et al. (1997) and Richardson et al. (2000) model asynchronous feature trajectories using hidden Markov models (HMMs), with each state corresponding to a vector of feature values. This approach is powerful, but it cannot represent independencies between features. Kirchhoff (1996), in contrast, models the feature streams as independent, except for a requirement that they synchronize at syllable boundaries. As pointed out by Ostendorf (2000), such independence assumptions may allow for too much variability. In this paper, we propose a more general featurebased pronunciation model implemented using dynamic Bayesian networks (Dean and Kanazawa, 1989), which allow us to take advantage of inter-feature independencies while avoiding overly strong independence assumptions. In the following sections, we describe the model and present proof-of-concept experiments using phonetic transcriptions of utterances from the Switchboard conversational speech corpus (Greenberg et al., 1996). 2 Serval [sic] examples To help ground the discussion, we first present several examples of pronunciation variation. One common phenomenon is the nasalization of vowels preceding nasal consonants. This is a result of asynchrony: The velum is lowered before the oral closure is made. In more extreme cases, the nasal consonant is entirely absent, leaving only a nasalized vowel, as in can’t → [ k ae n t ] 1 . All of the feature values are still correct, although phonetically, this would be described as a deletion. Another example, taken from the Switchboard corpus, is several → [s eh r v ax l]. In this case, the tongue and lips have desynchronized to the point that the tongue 1 Here and throughout, we use the ARPAbet phonetic symbol set with additional diacritics, such as “ n” for nasalization. retroflexion for [r] starts and ends before the lip narrowing gesture for [v]. Again, all of the feature streams are produced correctly, but there is an apparent exchange of two phones, which cannot be represented via singlephone confusions conditioned on phonemic context. A final example from Switchboard is everybody → [eh r uw ay]. It is difficult to imagine a set of phonetic transformations that would predict this pronunciation without allowing a host of other impossible pronunciations. However, when viewed in terms of features, the transformation from [eh v r iy bcl b ah dx iy] to [eh r uw ay] is fairly simple. The tongue and lips desynchronize, causing the lips to start to close for the [bcl] during the previous vowel. In addition, the lip constrictions for [bcl] and [v], and the tongue tip gesture for [dx], are reduced. We will return to this example in the sections below. index phoneme LIP-OPEN TT-LOC ... 1 v critical * ... 2 r wide ret. ... 3 iy wide alv. ... ... ... ... ... ... Table 1: Part of a target pronunciation for everybody. In this feature set, LIP-OPEN is the lip opening degree; TT-LOC is the location along the palate to which the tongue tip is closest (alv. = alveolar; ret. = retroflex). lexEntry t wdTr t ind 2t ind 1t ind 3t sync 1,2;3 =1 t sync 1;2 t =1 3 Approach A feature-based pronunciation model is one that explicitly models the evolution of multiple underlying linguistic feature streams to predict the allowed realizations of a word and their probabilities. Our approach begins with the usual assumption that each word has one or more target phonemic pronunciations, or baseforms. Each baseform is converted to a table of underlying feature values. Table 1 shows what part of this table might look like for the word everybody. The table may include “unspecified” values (‘*’ in the table). More generally, each table entry can be a distribution over the range of feature values. For now, we assume that all of the features go through the same sequence of indices (and therefore the same number of targets) in a given word; e.g., in Table 1, LIP-OPEN goes through the same indices as TT-LOC, although it has the same target value for indices 2 and 3. In the first time frame of speech, all of the features begin in index 0; in subsequent frames, each feature can either stay in the same index or transition to the next one with some probability. The surface feature values—i.e., the ones that are actually produced by the speaker—can stray from the underlying pronunciation in two ways, typically because of articulatory inertia: substitution, in which a feature fails to reach its target underlying value; and asynchrony, in which different features proceed through their sequences of indices at different rates. We define the degree of asynchrony between two sets of features as the difference between the average index of one set relative to the average index of the second. The degree of asynchrony is constrained: More “synchronous” configurations are more probable (soft constraints), and we make the further simplifying assumption that there is an upper bound on the degree of asynchrony (hard constraints). A natural framework for such a model is provided by dynamic Bayesian networks (DBNs), because of their 0 eh wide alv. ... 1 Ut Ut Ut S1t S 2t S 3t 2 3 Figure 1: One frame of a DBN for recognition with a feature-based pronunciation model. Nodes represent variables; shaded nodes are observed. Edges represent dependencies between variables. Edges without parents/children point from/to variables in adjacent frames (see text). ability to efficiently implement factored state representations. Figure 1 shows one frame of the type of DBN used in our model (simplified somewhat for clarity of presentation). This example DBN assumes a feature set with three features. The variables at time frame t are as follows: lexEntryt – entry in the lexicon corresponding to the current word and baseform. Words with multiple baseforms have one entry per baseform. lexEntryt ’s parents are lexEntryt−1 and wdTrt−1 indjt – index of feature j into the underlying pronunciation, as in Table 1. indj0 = 0; in subsequent frames indjt is conditioned on lexEntryt−1 , indjt−1 , and wdTrt−1 (defined below). Ujt – underlying value of feature j. Its distribution p(Utj |lexEntryt , indjt ) is determined by the target feature table of lexEntryt . Sjt – observed surface value of feature j. p(Stj |Utj ) encodes allowed feature substitutions. wdTrt – binary variable indicating whether this is the last frame of the current word. synctA;B – binary variable that enforces a synchrony constraint between subsets A and B of the feature set. It is observed with value 1; its distribution is con- structed in such a way as to force its parent ind variables to obey the desired constraint. For example, to enforce a constraint between the average index of features 1 and 2 and the index of feature 3, we would have P (sync1,2;3 = 1|ind1t , ind2t , ind3t ) = 0 whent 1 2 ever indt , indt , ind3t violate the constraint. In an end-to-end recognizer, the acoustic observations would depend on the Sjt , which would be unobserved. However, to facilitate quick experimentation and isolate the pronunciation model, we begin by testing how well we can do when given observed surface feature values. 4 Experiments We have performed a pilot experiment using the following feature set, based on the vocal tract variables of articulatory phonology (Browman and Goldstein, 1992): degree of lip opening; tongue tip location and opening degree; tongue body location and opening degree; velum state; and glottal (voicing) state. We imposed the following synchrony constraints: (1) All four tongue features are completely synchronized; (2) the lips can desynchronize from the tongue by up to one index; and (3) the glottis and velum are synchronized, and their index must be within 2 of the mean index of the tongue and lips. We used the Graphical Models Toolkit (Bilmes and Zweig, 2002) to implement the model. The distributions p(Stj |Utj ) were constructed by hand based on linguistic considerations, e.g. that features tend to go from more “constricted” values to less constricted ones, but not vice versa. p(Utj |lexEntryt , indjt ) was derived from manually-constructed phoneme-to-featureprobability mappings. For these experiments, no parameter learning has been done. The task was to recognize an isolated word, given a set of observed surface feature sequences Sjt . To create the observations, we used the detailed phonetic transcriptions created at ICSI for the Switchboard corpus (Greenberg et al., 1996). For each word, we converted its transcription to a sequence of feature vectors, one vector per 10 ms frame. For this purpose, we divided diphthongs and stops into pairs of feature configurations. Given the input feature sequences, we computed a Viterbi score for each lexical entry in a 3000+-word (5500+-lexEntry) vocabulary, by “observing” the lexEntry variable and finding the most likely settings of all remaining variables. The most likely variable settings can be thought of as a multistream alignment between the surface and underlying feature streams. Finally, we output the word corresponding to the highest-scoring lexical entry. We performed this procedure on a development set of 165 word transcriptions, which was used to tune settings such as synchronization constraints, and a test set of 236 transcriptions 2 . We compared the performance of several models, measured in terms of word error rate (WER) and failure rate (FR), the percentage of inputs that had no Viterbi alignment with the correct word. To get a sense of the effect of feature asynchrony, we compared our asynchronous model with a version in which all features are forced to be synchronized, so that only feature substitution is allowed. This uses the same DBN, but with degenerate distributions for the synchronization variables. Also, since the Sj values are derived from phonetic transcriptions, and are therefore constant over several frames at a time, we also built a variant of the DBN in which Sj is allowed to change value with non-zero probability only when indj changes (by adding parents indjt , indjt−1 , Sjt−1 to Sjt ); we refer to this DBN as “segment-based”, and to the original as “frame-based”. We compared four variants, differing along the “synchronous vs. asynchronous” and “frame-based vs. segment-based” dimensions. The variant which is both synchronous and segment-based is similar to a phone-based pronunciation model with only context-independent phone substitutions. model baseforms only phonological rules sync. seg.-based sync. fr.-based async. seg.-based async. fr.-based dev set WER FR 63.6 61.2 50.3 47.9 38.2 24.8 35.2 23.0 32.7 19.4 29.7 16.4 test set WER FR 69.5 66.9 59.7 55.5 43.2 35.2 46.2 31.4 41.1 31.4 42.7 26.3 Table 2: Results of Switchboard ranking experiment. Table 2 shows the performance of these four models, as well as of two “baseline” models: one allowing only the baseform pronunciations (on average 1.7 per word), and another including all pronunciations produced by an extensive set of context-dependent phonological rules (about 4 per word), with no feature substitutions or asynchrony in either case. The phonological rules are the “full rule set” described in Hazen et al. (2002). We note that they were not designed with Switchboard in mind. The models that allow asynchrony outperform the ones that do not, in terms of both WER and FR. Looking more closely at the performance on the development set, the inputs on which the synchronous models failed but the asynchronous models succeeded were in fact the kinds of pronunciations that we expect to arise from feature asynchrony, including: nasals replaced by nasalization on a preceding vowel; a /t r/ sequence realized as /ch/; and everybody → [eh r uw ay]. The relative merits of the frame-based and segment-based models is less clear, as 2 We required that words in the development and test sets have phonemic pronunciations with at least 4 phonemes, so as to limit context effects from adjacent words. they have opposite relative performance on the development and test sets. For 27 (16.4%) development utterances, none of the models was able to find an alignment with the correct word. Most of these were due to apparent gesture deletions and context-dependent feature changes, which are not yet included in the model. Figure 2 shows a part of the Viterbi alignment of everybody with [eh r uw ay], produced by the segmentbased, asynchronous model. Using this model, everybody was the top-ranked word. As expected, the asynchrony is manifested in the [uw] region, and the lips do not close but reach only a narrow (glide-like) configuration. verting detailed phonetic transcriptions, as we have done in our decoding experiments above. The sync distributions cannot be trained via EM, since they are always observed with value 1. They can either be treated as experimental parameters or trained discriminatively. We are currently working on a new formulation in which the synchronization constraints can be trained via EM. In addition, we are currently investigating extensions to the model, including context-dependent feature substitutions. We also plan to extend this study to a larger data set and to multi-word utterances. References Figure 2: Spectrogram, transcription, and partial Viterbi alignment, including the lip opening and tongue tip location variables. Indices are relative to the underlying pronunciation /eh v r iy bcl b ah dx iy/. Adjacent frames with equal values have been merged for easier viewing. WI = wide; NA = narrow; CR = critical; CL = closed; ALV = alveolar; P-A = palato-alveolar; RET = retroflex. 5 Discussion We have motivated our pronunciation model as part of an overall strategy of feature-based speech recognition. One way in which this model could fit into a complete recognizer is, as mentioned above, by adding a variable A representing the acoustic observations, with the S j as its parents. The modeling of p(A|S 1 , . . . , S M ) (where M is the number of features) is a significant problem in its own right. Alternatively, as this study suggests, there may be some benefit to this type of model even if the acoustic model is phone-based. One possible setup would be to use a phonetic recognizer to produce a phone lattice, then convert the phones into features and proceed as in our Switchboard experiments. Thus far we have not trained the variable distributions. With the exception of the sync variables, these can be trained from feature transcriptions (i.e. S j observations) using the Expectation-Maximization (EM) algorithm (Dempster et al., 1977). In the absence of actual feature transcriptions, they can be approximated by con- J. Bilmes and G. Zweig, “The Graphical Models Toolkit: An open source software system for speech and time-series processing,” ICASSP, Orlando, 2002. C. P. Browman and L. Goldstein, “Articulatory phonology: An overview,” Phonetica, 49:155–180, 1992. T. Dean and K. Kanazawa, “A model for reasoning about persistence and causation,” Computational Intelligence, 5:142– 150, 1989. A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society, 39:1–38,1977. L. Deng, G. Ramsay, and D. Sun, “Production models as a structural basis for automatic speech recognition,” Speech Communication, 33:93–111, 1997. S. Greenberg, J. Hollenback, and D. Ellis, “Insights into spoken language gleaned from phonetic transcription of the Switchboard corpus,” ICSLP, Philadelphia, 1996. T. J. Hazen, I. L. Hetherington, H. Shu, and K. Livescu, “Pronunciation modeling using a finite-state transducer representation,” ITRW PMLA, Estes Park, CO, 2002. S. King, T. Stephenson, S. Isard, P. Taylor, and A. Strachan, “Speech recognition via phonetically featured syllables,” ICSLP, Sydney, 1998. K. Kirchhoff, “Syllable-level desynchronisation of phonetic features for speech recognition,” ICSLP, Philadelphia, 1996. K. Livescu, J. Glass, and J. Bilmes, “Hidden feature models for speech recognition using dynamic Bayesian networks,” Eurospeech, Geneva, 2003. D. McAllester, L. Gillick, F. Scattone, and M. Newman, “Fabricating conversational speech data with acoustic models: A program to examine model-data mismatch,” ICSLP, Sydney, 1998. F. Metze and A. Waibel, “A flexible stream architecture for ASR using articulatory features,” ICSLP, Denver, 2002. M. Ostendorf, “Incorporating linguistic theories of pronunciation variation into speech-recognition models,” Phil. Trans. R. Soc. Lond. A, 358:1325-1338, 2000. M. Richardson, J. Bilmes, and C. Diorio, “Hidden-articulator Markov models for speech recognition,” ITRW ASR2000, Paris, 2000. M. D. Riley and A. Ljolje, “Automatic generation of detailed pronunciation lexicons,” in C.-H. Lee, F. K. Soong, and K. K. Paliwal (eds.), Automatic Speech and Speaker Recognition, Kluwer Academic Publishers, Boston, 1996.

硕士学位认证报告☀《佛蒙特大学毕业证购买》微信95270640《UVM毕业证模板办理》文凭、本科、硕士、研究生学历都可以做,《文凭UVM毕业证书原版制作UVM成绩单》《仿制UVM毕业证成绩单佛蒙特大学学位证书pdf电子图》毕业证如果您是以下情况，我们都能竭诚为您解决实际问题：【公司采用定金+余款的付款流程，以最大化保障您的利益，让您放心无忧】 1、在校期间，因各种原因未能顺利毕业，拿不到官方毕业证+微信95270640 2、面对父母的压力，希望尽快拿到佛蒙特大学佛蒙特大学硕士毕业证成绩单； 3、不清楚流程以及材料该如何准备佛蒙特大学佛蒙特大学硕士毕业证成绩单； 4、回国时间很长，忘记办理； 5、回国马上就要找工作，办给用人单位看； 6、企事业单位必须要求办理的；面向美国乔治城大学毕业留学生提供以下服务: 【★佛蒙特大学佛蒙特大学硕士毕业证成绩单毕业证、成绩单等全套材料，从防伪到印刷，从水印到钢印烫金，与学校100%相同】【★真实使馆认证（留学人员回国证明），使馆存档可通过大使馆查询确认】【★真实教育部认证，教育部存档，教育部留服网站可查】【★真实留信认证，留信网入库存档，可查佛蒙特大学佛蒙特大学硕士毕业证成绩单】我们从事工作十余年的有着丰富经验的业务顾问，熟悉海外各国大学的学制及教育体系，并且以挂科生解决毕业材料不全问题为基础，为客户量身定制1对1方案，未能毕业的回国留学生成功搭建回国顺利发展所需的桥梁。我们一直努力以高品质的教育为起点，以诚信、专业、高效、创新作为一切的行动宗旨，始终把“诚信为主、质量为本、客户第一”作为我们全部工作的出发点和归宿点。同时为海内外留学生提供大学毕业证购买、补办成绩单及各类分数修改等服务；归国认证方面，提供《留信网入库》申请、《国外学历学位认证》申请以及真实学籍办理等服务，帮助众多莘莘学子实现了一个又一个梦想。专业服务，请勿犹豫联系我如果您真实毕业回国，对于学历认证无从下手，请联系我，我们免费帮您递交诚招代理：本公司诚聘当地代理人员，如果你有业余时间，或者你有同学朋友需要，有兴趣就请联系我你赢我赢，共创双赢你做代理，可以帮助佛蒙特大学同学朋友你做代理，可以拯救佛蒙特大学失足青年你做代理，可以挽救佛蒙特大学一个个人才你做代理，你将是别人人生佛蒙特大学的转折点你做代理，可以改变自己，改变他人，给他人和自己一个机会

Log In

Feature-based pronunciation modeling for speech recognition