The University of Melbourne School of Languages and Linguistics Honours Thesis November 2013 The Effect of Respeaking on Transcription Accuracy Mat Bettinson Supervisors: A/Prof. Lesley Stirling & A/Prof. Steven Bird 1 THIS PAGE INTENTIONALLY BLANK 2 Acknowledgements I would like to thank the teachers and mentors that guided me to this point. I would not be engaged in this research if it were not for a number of influential figures that inspired me along the way. Among them, Paul Gʃuba’s non-nonsense clarity, and ‘Gʃuba manoeuvʃe’ hand gestuʃe, helpfully addʃessed: “What aʃe we ʃeally doing heʃe?” Of my supervisors, I must thank Lesley Stirling for inspiring with engaging human stories in linguistic research and providing advice and encouragement when it was needed most. My thanks also to Steven Bird for straddling the gulf between linguistics and computational disciplines and foʃ intʃoducing me to the ‘peʃfect stoʃm’ of possibilities in language and technology. I must also express considerable gratitude for the participants in this study including my two speakers of Nepali but most especially the four transcription participants. Having generously agreed to volunteer many hours of their valuable time, they also stuck at it when it became clear I had tragically underestimated the workload. Majanmajanba! I’m also gʃateful foʃ the comments fʃom the examineʃs of this thesis. Theiʃ detailed comments guided this revised thesis. Above all, eternal gratitude to my ever patient wife Jeannine who has born no small burden of hardship in supporting my scholarship. Introduction .............................................................................................................. 8 1.1 Overview ............................................................................................................ 8 1.2 Motivation: documentation and problems of scale ............................................ 8 1.3 BOLD – a scalable method ................................................................................ 9 1.4 Respeaking as field method ............................................................................. 10 1.5 A thought experiment: the future philologist ................................................... 12 1.6 Aims and research questions............................................................................ 14 Literature review..................................................................................................... 15 2.1 Overview .......................................................................................................... 15 2.2 Transcription – description of speech sounds .................................................. 15 2.3 H&H theory ..................................................................................................... 17 2.4 Clear speech ..................................................................................................... 18 2.5 Types and levels of noise ................................................................................. 20 2.6 Ramifications for respeaking tasks .................................................................. 22 Method .................................................................................................................... 24 3.1 Overview .......................................................................................................... 24 3.2 Nepali ............................................................................................................... 24 3.3 Prediction of errors .......................................................................................... 26 3.4 Participants....................................................................................................... 29 3.5 Procedure ......................................................................................................... 29 4 3.5.1 The Aikuma application ........................................................................... 29 3.5.2 Data collection .......................................................................................... 30 3.5.3 Processing the data ................................................................................... 31 3.5.4 Experimental conditions ........................................................................... 32 3.5.5 Data validation and volume ...................................................................... 33 3.5.6 Reference transcription ............................................................................. 33 3.5.7 Transcription activity ................................................................................ 34 3.6 4. 5. Measuring accuracy ......................................................................................... 35 3.6.1 Edit distance ............................................................................................. 36 3.6.2 Improved phonetic edit distance ............................................................... 37 3.6.3 Summary................................................................................................... 38 Results .................................................................................................................... 39 4.1 Overview .......................................................................................................... 39 4.2 Clear speech in Nepali ..................................................................................... 39 4.2.1 Durations and speaking rate ..................................................................... 41 4.2.2 Expansion of vowel space ........................................................................ 43 4.3 Transcription metrics ....................................................................................... 44 4.4 Statistical analysis ............................................................................................ 44 4.4.1 Reliability of measures ............................................................................. 44 4.4.2 T-tests of independent variable: respeaking ............................................. 45 4.4.3 T-tests of independent variable: noise ...................................................... 47 4.4.4 Assessing the interaction of noise and respeaking ................................... 47 4.5 Analysis of common errors .............................................................................. 48 4.6 The impact of noise .......................................................................................... 52 4.7 Summary of Results ......................................................................................... 54 Discussion ............................................................................................................... 55 5 6 5.1 Overview .......................................................................................................... 55 5.2 Addressing the research questions ................................................................... 55 5.2.1 Respeaking and transcription accuracy .................................................... 55 5.2.2 Respeaking effect on error types .............................................................. 59 5.2.3 Contribution of clear speech vs. noise ...................................................... 60 5.3 Limitations of Study ........................................................................................ 61 5.4 Transcription differences: errors or choices? Introduction 1.1 Overview In linguistic fieldwork, language consultants are sometimes asked to repeat speech that was spontaneously recorded. Careful speech is often beneficial for analysis and transcription. New methods in language documentation promise improvements in efficiency by utilising such spoken annotations to create a written transcription. Consequently it may be possible to defer the written transcript so that the work does not need to be done in the field. This research examines the impact of respeaking on transcription accuracy when used as part of an emerging digital method in language documentation. 1.2 Motivation: documentation and problems of scale At the 1991 LSA symposium on endangered languages, the rate of language loss was descʃibed as a ‘cʃisis’. Half of the woʃld’s 6,000 languages may alʃeady be moʃibund and no longer being learned by children (Krauss, 1992). Arguments for the value of human language, and the resulting tragedy of their loss, are bountiful. Hale said languages aʃe the “priceless products of human mental industry” and that theiʃ loss ʃepʃesents an “irretrievable loss of diverse and interesting intellectual wealth” (1992, p. 36). Evans and Levinson (2009) described linguistic diversity as a ‘laboratory of variation’ with 6,000 natural experiments in evolving communicative systems, each offering opportunities to explore the nature of human cognition. Speakers themselves often ʃegaʃd the loss of theiʃ language as a “loss of identity, and as a cultuʃal, liteʃaʃy, intellectual, or spiritual severance from ancestors, community and territoʃy” (Woodbury, 2003, p. 4). The field of linguistics has struggled to find an effective response to the urgent need to describe the intellectual wealth of endangered languages. Contributing factors include a lack of focus on languages in cultural context (Hale, 1992) and the apparent lack of will to engage in linguistic fieldwork (Newman, 1998)1. Newman pulled no punches: “Linguists claim to be concerned about the endangered languages issue. In reality, nothing substantial is being done about it.” (1998, pp. 11). 1 8 In response to this crisis, documentary linguistics emerged as a sub-field for constructive action on the challenge of language endangerment. Nikolaus Himmelmann’s (1998) founding treatise argued for the documentation of language as a separate and distinct field from descriptive linguistics. Documentary linguists would focus on recording natural language rather than the narrower descriptive output of the traditional grammar and dictionary. The rise of documentary methods has also coincided with the cross-over between the paper-based era and the digital era (Bird & Simons, 2003). Storage is now effectively unlimited and virtually lossless in quality but critical limitations remain. Chief among them is the reliance on highly trained linguists performing a wide variety of tasks in often challenging fieldwork conditions. Theʃe’s a worldwide shortage of trained linguists, much less as Newman noted, linguists engaged in fieldwork. Liberman (2006) highlighted the need to scale up language documentation efforts to meet the challenge of endangerment. A corpus of around ten million words would be necessary for reasonable coverage of various aspects of language. This corresponds to around 2,000 hours of audio recordings2 which would need transcribing, at least two orders of magnitude beyond that volume of transcription usually undertaken in linguistic description. When it comes to endangered languages, no corpus of a tenth of that size (one million words) yet exists in a machine-readable context (Abney & Bird, 2010). This gulf between our capacity to capture primary data and our methods for transcription and analysis is arguably the most pressing challenge facing documentary linguistics today. 1.3 BOLD – a scalable method Woodbury’s fieldwoʃk on the centʃal Alaskan language Cup’ik in the 1970s produced a substantial quantity of audio cassette tapes. Conscious of the limited remaining time with eldeʃly speakeʃs of Cup’ik, the decision was made to skip interlinear glosses in favour of ‘ʃunning UN style tʃanslations’ (Woodbuʃy, 2003, p. 45). More radical still, Woodbury suggested that not all material would be transcribed. Instead, speakers would be asked to ‘ʃespeak’ ʃecoʃdings slowly and cleaʃly so that anyone with ‘tʃaining in the language’ could pʃovide a tʃanscʃiption if they wished. Reiman (2010) cited this Libeʃman’s abstʃact of his 2006 talk at the Texas Linguistic Society 2006 stated 50,000 houʃs which he has since described as an error, revising the figure to 2,000 hours in a Language Log post in February 2010: http://languagelog.ldc.upenn.edu/nll/?p=2099 2 9 example as an inspiration to develop a new audio-only based methodology called Basic Oral Language Documentation. BOLD describes a method whereby a new recording interleaves the original spontaneous recording with ʃespoken ‘oral annotations’. In the first instance, these annotations would be the same slow and careful respeaking, with the same goal of facilitating a deferred transcription. One that might be undertaken immediately on return from fieldwork or for any interested party in the future. Not only is transcription time consuming but it can only be performed by highly trained people. Free of this bottleneck, researchers may recruit paralinguists with minimal training to perform as many documentary events in parallel as willing participants and available equipment will allow. On that basis BOLD is one of the first methods in language documentation that may be categorised as a ‘scalable’ method3. A review of BOLD in six different fieldwork projects concluded that BOLD corpora should be made a funding priority and the methodology taught in all fieldwork courses (Boerger, 2011). The essential method of BOLD can be thought of as phrase-aligned audio segments where participants record natural speech and then make additional recordings which are time aligned with the natural speech event. These tasks have proven suitable for automation in the Aikuma smartphone (Bird & Hanke, 2013). Fieldwork trials in Papua New Guinea (http://www.boldpng.info) and the Brazilian Amazon demonstrated that language consultants found the system intuitive to use despite their limited exposure to digital technology. Mobile technologies such as these offer the chance to ‘cʃowdsouʃce’ potentially rich collections of linguistic data. It may be that the best response to the pace of language loss lies in interdisciplinary programmes with the development of software tools informed by research on the next-generation of digital linguistic field methods. 1.4 Respeaking as field method Respeaking has been used as a linguistic field method since well before the introduction of BOLD. In a three-volume guide to documenting languages of oral tradition first published in the 1970s, Bouquiaux & Thomas (1992) described a formalised method for producing transcriptions of unwritten languages. After recording spontaneous speech, the language consultant “repeats each sentence fairly slowly, so that it can be 3 Removing the natural language specification, collaborative compilation of dictionaries is perhaps the only other scalable method in language documentation. 10 tʃanscʃibed as he dictates” (p. 181). The transcription was then re-recorded in the same manner, thus creating an audio recording of the slow speech. The Bouquiaux & Thomas method was used to directly assist the production of a transcription as well as to provide an audio recording with the same properties of speech for additional analytical purposes such as consulting the field transcript and further developing phonological hypothesis on the language. Reflective of general practice in linguistic fieldwork, the production of the transcript was considered to be of primacy and hence embedded within the method. The consultative process of respeaking and transcribing adds considerably to the time and the patience required of language consultants. The respeaking step in BOLD is referred to as the process of producing spoken annotations (Reiman, 2010, p. 256). Three main varieties of oral annotation were pʃoposed, the fiʃst being ‘caʃeful speech’ of the same type as employed in the Bouquiaux & Thomas method. The second annotation is a phrase level translation into a language of wider communication and the third consists of analytical comments. These may include material such as unspoken but implied information, description of gestures and cultural knowledge. Discussing the benefits of these additional spoken annotations is beyond the scope of this work but it will suffice to note that the time saved in the laborious transcription phase provides an opportunity to record a great deal of additional information via further spoken annotations. Another attested motivation for the respeaking method is that respeaking regenerates an older recording into a fresh recording. As Woodbuʃy descʃibed in the Cup’ik documentation pʃoject, ʃespeaking mateʃial was pʃioʃitised foʃ ‘haʃd-to-heaʃ’ audio cassettes. Thus in re-recording linguistic data there exists the opportunity to improve the quality of the recording. The BOLD:PNG project suggested that respeaking should be undertaken in a “quiet place away from background noise and inteʃʃuptions” (http://www.boldpng.info/bold/stage2). By ensuring that the subjects are loud enough to be heard and free of unwanted noises to the greatest possible extent, audio quality improvements may be realised in the respeaking recording. Quality improvements may also come about due to differences in equipment used in the field and the equipment used in respeaking. Given the pace of change in technology, respeaking older recordings may jump across generational change in recording methods. 11 Fieldwork recordings made on analogue equipment can, where speakers are available, be regenerated into recordings in the digital domain. Digital recordings under appropriate storage conditions have a virtually unlimited shelf life. Even where these recordings are made in the same technological time frame, other equipment factors exist such as the difference in recording quality between smartphones and professional audio recording devices. Whether quality improvements come from reduction in ambient noise or improvements in equipment and technique, both result in an increase in the signal-to-noise ratio (SNR). Considering the benefit of respeaking fieldwork methods for producing transcriptions, the ‘ʃegeneʃation’ of these ʃecoʃdings pʃovides two potential impʃovements: 1. Greater intelligibility of careful speech 2. A boost in signal-to-noise ratio (SNR) What is not clear is the degree of contribution that each of these makes towards the benefits of ‘slow, caʃeful’ ʃespeaking, paʃticulaʃly when employed in the context of linguistic transcription by non-native speakers. 1.5 A thought experiment: the future philologist “Paʃt of my technical input as a linguist is to make guesses about what the ‘philologist in 500 yeaʃs’ is going to need,” (Woodbury, 2003, p. 45) Himmelmann suggested that language documentation ought to be a lasting, multipurpose record of a language (Himmelmann, 2006b). The multipurpose qualifier is intended to suggest a broader audience for language documentation, such as alternative research disciplines and the speech community themselves. The argument is that we should capture as many and varied instances of natural language possible because we can’t anticipate the needs of all possible stakeholdeʃs. As we enter the age of almost limitless digital stoʃage, it’s increasingly difficult to aʃgue that we shouldn’t be capturing everything we can. However the practical reality is that documentation stakeholders have different ideas about what ought to be captured and for what purpose. Woodbury suggests that linguists should anticipate the needs of our future selves, or at least those with similar interests. 12 Even with a reduced scope of linguistic enquiry, there is an absolute requirement to make language documentation a lasting record. We will need orthographic and glossing conventions, linguistic and ethnographic annotations and metadata to assist in identifying primary data and so on. Acknowledging the rising role of technology in this task, Bird & Simons (2003) warned against adopting ‘moʃibund’ technology such as proprietary software and file formats which themselves may result in endangered data. More broadly, Abney & Bird (2010) suggest that we may have attained some measure of success for a universal corpus of human language if we can freely translate between languages. They further described documentation of individual languages as a pyramid structure where a smaller subset of annotated material sits above a larger volume of unannotated material. One of the key decisions to be made will be how much material should be annotated or transcribed and in what detail. By the same token, considering the oral annotations of BOLD, how much of each type of annotation do we need? At the ten-year anniversary of Himmelmann’s seminal language documentation proposal, Evans (2008) warned against becoming ‘documentaʃist fundamentalists’ and that meʃely ʃecoʃding material without shaping an ‘evolving analysis’ would depʃive future linguists of key data. In this paper I adopt the stance that we are exploring ways to complement the essential and fundamental fieldwork of evolving analysis with larger volumes of audio material to help meet the challenges of scale that Liberman identified. Whether of a documentarist or descriptivist viewpoint, linguists of both persuasions are in agreement that we ought not deprive future generations of key data where possible (Himmelmann, 2011). Research into the effectiveness of documentary methods needs to anticipate the needs of the ‘futuʃe philologist’ to the extent we can guess. This paper also adopts the position that researching more efficient methods in language documentation is not optional. It is, in fact, at least as important in ensuring that data isn’t lost to futuʃe geneʃations. In summaʃy, the ‘futuʃe philologist’ pʃinciple orientates us towards the needs of language researchers in the future. With revolutions in digital technology enabling linguists to capture and analyze more data than ever before (Evans, 2009, p. xix) we are now considering recording potentially thousands of hours of audio material. Thousands 13 more may be required for the respeaking task as part of the BOLD method. While the essential scalability of these methods frees the field linguist from every hour of work, it would be naive to suggest that there are no resource costs in deploying these techniques. They do not remove the need for detailed description at the top of the ‘pyʃamid’ to make sense of the large volume of untranscribed material. The challenge is therefore the search for an appropriate balance in traditional and scalable methods in linguistic fieldwork. The goal of creating lasting, multipurpose records of a language implies the need to evaluate methods, where possible, from the end-user perspective of future generations. 1.6 Aims and research questions To date there has been no specific investigation of the impact of respeaking in transcription. The central aim of this research is to assess the value of respeaking in language documentation methods. Therefore, given the scenario of the future philologist working on BOLD-based documentation of a no longer spoken language:    Does the availability of respoken audio improve transcription accuracy? If so, can we observe improvements in particular types of transcription errors? To what extent can these be attʃibuted to ‘caʃeful speech’ oʃ lower noise? This study will also consider the use of the latest digital methods with data capture carried out using a smartphone application currently being developed. 14 2. Literature review 2.1 Overview There are three main areas of relevant literature for this research. The evolution of phonetic and phonemic transcription will be briefly discussed in section 2.2. Spontaneous speech and careful can be understood to occupy points on an articulatory continuum. Lindblom’s (1990) hyper and hypo speech (H&H) theory describes such a continuum of phonetic variation and is discussed in section 2.3. According to H&H theory, careful speech from the BOLD respeaking task may be classified as hyperspeech but the liteʃatuʃe on this phenomenon pʃefeʃs the teʃm ‘cleaʃ speech’. Pʃevious liteʃatuʃe on clear speech is discussed in section 2.4. Types and levels of noise used to degrade audio recordings in these studies are discussed in section 2.5. Finally in section 2.6, the ramifications of the literature will be discussed with relevance to respeaking in linguistic fieldwork. 2.2 Transcription – description of speech sounds Pike (1943) argued that transcription of speech sounds can and should be undertaken in such a way as to capture the full range of articulations theoretically possible in any language. This type of transcription, e.g. one that takes no account of the language being transcribed, has also been called an impressionistic transcription (Abercrombie, 1964). The field has long debated the exact choice of symbols and peʃhaps the IPA’s guidelines and symbol set (International Phonetic Association, 1949) has at least built some consensus in the most common speech sounds. Additional symbols or diacritics are often necessary to capture a fully impressionistic description of a language. As one might expect, even the relatively large symbol set of the IPA is the product of scholarship on particular families of languages. Roach (1987) pointed out that influence of Euʃopean languages ʃesulted in the ‘lumping togetheʃ’ (pp. 28) of dental, alveolar and post-alveolaʃ as the same ‘place’ on the IPA chaʃt. Discrete symbols are not the only way to categorise speech sounds. With the exception of a subset of diacritical marks in IPA notation, alphabets of symbols have the drawback that they do not inherently describe the state of the articulators and the manner in which 15 they are articulated. To begin with Jakobson, Fant and Halle (1951) classified sounds by acoustic properties rather than the state of the oral articulators. Following the advent of generative phonology (Chomsky & Halle, 1968), distinctive features have found use describing natural classes as a matrix of largely4 binary features which capture both the state of articulators and some acoustic features. For example, /m/ would be represented as [+voice, -continuant, +nasal, +sonorant, +labial]. While a full list of distinctive features for any given speech sound would be quite long, typically only the features that have changed are analysed. For example, the fricative / /  /s/ can be represented as an alteration in place features where [-anterior, -high]  [+anterior, -high], assuming all the other features such as [+ fricative] remain unchanged. When it comes to transcription activity, symbolic systems are necessary to reduce the amount of information to something manageable. The IPA with the wide array of diacritics is generally sufficient for most impressionistic transcriptions. Fine phonetic transcription describes speech sounds with great precision, complete with allophonic variation. By definition such detail is not necessary to capture the language for speakers of the language. Bloomfield argued that such detailed phonetic level transcriptions would always be subjective and arbitrary (1933, pp. 84-5). Chomsky and Halle saw language-independent transcriptions as meaningless, arguing instead that transcription should be viewed as a continuum between the ideal of a broad phonemic transcription with the least detail at the end and a narrow phonetic transcription with the most detail at the other end. Bloomfield is clearly right to the degree that impressionistic transcription is subjective on the part of the transcriber. How broad or narrow their transcription ends up being will depend on a range of factors such as whether they can perceive narrow phonetic detail oʃ whetheʃ they think it’s ʃelevant. A key challenge for this research is in comparing these subjective transcriptions. Distinctive features are particularly relevant heʃe since it’s possible to conveʃt the symbolic IPA representation into lists of distinctive features. Computational methods can compare these lists so that we’ʃe able to assess how phonetically similar one symbol (in a transcription) is to another. This technique is discussed in more detail in section 3.6. Largely because widely published lists of featuʃes such as Haye’s (2009) aʃe based on tʃinaʃy values wheʃe a dash oʃ otheʃ symbol means ‘not specified’. 4 16 2.3 H&H theory The wide variability in natural speech, particularly fast speech, has been noted widely (Dalby, 1986; Greenberg, 1999). Lindblom’s H&H theoʃy accounts foʃ the ‘invaʃiance pʃoblem’ wheʃe speech aʃticulations vaʃy to such an extent that it is difficult to pʃovide a consistent phonetic definition. The same speaker may even produce a continuum of variation motivated by communicative needs. H&H theory suggests we may view this continuum with listener-oriented clarity at one end (hyper speech) to talker-oriented economy of effort at the other (hypo speech). If a speaker believes that the listener will have difficulty understanding, such as in a noisy environment or a listener with comprehension issues, the speaker will ‘tune’ theiʃ peʃfoʃmance. Slowing the rate of speech and making other changes often described as speaking ‘cleaʃly’ and speech produced in this way is descʃibed in the liteʃatuʃe as ‘cleaʃ speech’. A typical scenario might involve communicating with an elderly hard-of-hearing relative. Aside from the more obvious volume modification, clear speech exhibits a reorganisation of articulatory gestures with resulting acoustic properties (Moon & Lindblom, 1989), motivated to enhance phonemic contrasts. The discrimination between possible speech segments is guided by knowledge of the language in a signal-complementary process. Speakers estimate an appropriate trade-off between hyper speech and hypo speech by also estimating the contribution of this signal-complementary process in the listener. This process is inherently tied to knowledge of the language. Luce (1986) showed that the probability of listeners recognising words was influenced by a number of factors including the frequency of the word and the similarity of pronunciation with other words. Functionalist accounts suggest we may view these effects as evidence of usage-based patterns or schemes emerging to increase generalisation and ease of access (Bybee, 1999). The effectiveness of language-informed signal-complementary processes in speech comprehension is such that native speakers can recognise words with highly reduced consonants. Context-indiscourse also plays a role in the signal-complementary process. Ernestus et al. (2002) showed a strong negative correlation between consonant reduction and intelligibility where words had fewer contexts to disambiguate the lexical item. 17 H&H theory suggests that the phonetic variation in natural speech can be explained as a ‘tug-of-waʃ’ between opposing motivations of speaker-oriented factors (economy of effort) and listener-oriented factors (achieving comprehension). Furthermore the variation in sound systems of different languages means that the properties of speech ‘tuning’ also vaʃy between languages. In the context of clear speech in linguistic fieldwork, language consultants are unlikely to be able to estimate the difficulties of speech sound discrimination in listeners. A speakeʃ’s effoʃt to impʃove discʃimination between speech sounds will be motivated by language-aware signal-complementary issues as well as acoustic phonetic properties of the language. Some of the properties of movement along the H&H continuum would seem to be universal, such as slowing of speech rate and enhancing word segmentation but the full pattern of phonetic variation is considerably more complex. 2.4 Clear speech Of goal-oriented speaking styles, clear speech is uniquely oriented towards enhancing intelligibility (Smiljanić & Bradlow, 2009). As a natural consequence, the literature on clear speech has focused on intelligibility gains in varying conditions. Picheny, Durlach, & Braida (1985) showed that clear speech delivered nearly 20% intelligibility improvement in hearing-impaired listeners. Perhaps unsurprisingly, non-native listener comprehension has not demonstrated the same benefit. Bradlow & Bent (2002) found that comprehension gains for non-native speakers of English were less than a third of native speakers. Studies in the clear speech literature have relied on participants reading the same set of materials in some manner of clear speech. Participants were asked to speak as if they were conversing with someone who is foreign or who has hearing difficulties (Picheny et al., 1985; Schum, 1996). These studies found wide variation in the properties of the clear speech produced, with correspondingly wide variation in comprehension gains. This may well be an artifact of participants forming their own interpretation on an appropriate level of clear speech. More recently Wassink et al. (2006) contrasted speech from different varieties of clear speech including Infant-Directed Speech (IDS) and socalled Lombard speech in Jamaican speakers of Creole and English. The Lombard effect is the observation that speakers modify their speech production in noisy 18 environments. Wassink et al. found that not all forms of clear speech demonstrate the same acoustic modifications but again found it expedient to describe the different types of clear speech along a continuum in line with H&H theory. Some of the acoustic metrics that have been used to examine clear speech include speaking rate, duration of speech sound segments, pauses, fundamental frequency and vowel foʃmant fʃeʂuencies, paʃticulaʃly the vowel ‘space’ ʃepʃesented by the limits of formants occurring in different modes of speech. Moon and Lindblom (1994) found that, independent of speech rate, faster formant transitions were suggestive of faster articulations in clear speech. They suggested that this reorganisation was motivated by avoiding coarticulatory effects and resulting target undershoots. Liu and Zeng (2006) explored the temporal reorganisation of clear speech comprehensibility gains by modifying casual speech by stretching it to the same length as clear speech compared with inserting gaps to attain the same length. Gaps were found to provide a superior comprehension benefit. They concluded that the beneficial acoustic cues resulting from temporal changes in clear speech were ‘multiple and distʃibuted’. 19 2.5 Types and levels of noise Assuming that respeaking aids transcription, the third research question concerns the extent that accuracy gains can be attributed to the properties clear speech and to what extent gains can be attributed to audio quality improvements. This suggests an experimental method where respeaking of a noisy recording is compared with respeaking of a clean recording. Practical considerations dictate the choice of a single level of signal-to-noise ratio (SNR) in this study, stemming from the size of the dataset required for statistically significant results. The choice of the noise level and the type of noise deserves explanation in the context of the relevant literature. As we have seen, one of the motivations of ʃespeaking is to ‘ʃegeneʃate’ a ʃecoʃding so that it is free of unwanted noise. Noise in this context is principally of two types. The first is noise introduced by the recording equipment. This tends to be random like the hiss of an audio cassette. In Woodbuʃy’s Cup’ik documentation environment, audio cassettes would have provided around 40-50dB SNR at the time of recording, reducing to around 30dB for tape of this vintage played today. Quiet passages in the recording might then reduce the effective SNR to levels of perhaps 10dB or slightly worse, ʃesulting in the ‘haʃd-to-hear’ tapes Woodbury described. In practical situations with field recordings, no audio recording should ever be so degraded that native speakers are unable to comprehend the recording. However nonnative speakers, and by extension the future philologist, do not have the strength of signal-complementary processes enabled by language awareness and therefore are vulnerable to degraded signals. Bradlow & Bent found that non-native speakers were ‘disproportionately challenged’ by degraded signals in comprehension tests than native speakers. Earlier work in the 1950s showed that correct identification of the place of aʃticulation of English consonants was ‘seveʃely’ impacted in noisy ʃecoʃdings (Milleʃ & Nicely, 1955) which may be suggestive of the types of transcription errors made in noisy conditions. The second type of noise is that resulting from unwanted acoustic events such as wind noise. Speech from other members of the speech community who are not the present target of the recording session is another common source of unwanted noise. Listeners 20 subjected to multi-talkeʃ ‘babble’, as it’s known in the liteʃatuʃe, face the additional challenge of linguistic interference. Van Engen and Bradlow (2007) showed that sixtalker babble impacted comprehension more than two-talker babble. The linguistic inteʃfeʃence is stʃongeʃ when the babble is moʃe ‘compʃehensible’ and in the same native language as the adversely-affected listener. Native English speakers were affected more by two-talker babble than non-native speakers. The scenario of the future philologist assumes a non-native listener environment but we should note that heritage speakers would likely be more adversely affected by multi-talker babble in their language. Noise that has similar spectral characteristics to voice has also been shown to result in stronger decreases in intelligibility (Brungart et al., 2001). Studies in the clear speech literature have often preferred to use white noise to digitally degrade recordings for experimental purposes. Clear speech studies investigating multitalker background noise have either synthesized babble out of natural speech or employed ‘speech-shaped’ noise where the spectral power contour has been shaped to match that of human speech. In effect this is the same as n-speaker multi-talker babble where n is an infinite number of speakers (Kalikow et al., 1977). While the contribution of noise due to recording devices has fallen in recent years, we have no reason to believe that field recording environments have changed. Therefore speech-shaped noise would seem to be a reasonable choice of noise type where we are seeking to mimic recording environments where multi-talker babble is commonplace. When searching for an appropriate level of noise for degraded signal conditions, the levels of noise used in clear speech comprehension studies have been of an entirely different magnitude to the levels of noise in poor-quality audio recordings. SNR ratios of -4 and -8dB were employed based on trials that showed these levels resulted in midto-high range intelligibility for native speakers. This is extremely noisy given that negative SNR describes conditions of more noise than speech signal. These levels are patently not equivalent to those encountered in the context of linguistic fieldwork methods. This study will instead adopt an SNR level for degraded audio which aims to replicate a plausibly noisy field recording. 21 2.6 Ramifications for respeaking tasks While these studies provide some theoretical backdrop for this research, some aspects limit the scope of their application to clear speech in linguistic fieldwork. In Bradlow & Bent’s study, participants had extensive experience of English (9.7 mean years of study). This is not comparable to language-naive transcription employed in this study. Neveʃtheless it’s inteʃesting to note the suʃpʃisingly low benefit of cleaʃ speech among participants that, while not native speakers, nevertheless had extensive working knowledge of English. This suggests that the full benefits of clear speech are unlikely to be realised in the context of respeaking as a fieldwork method. There are further considerations which may reduce the effectiveness of clear speech. To produce respeaking oral annotations in the BOLD methodology, language consultants are asked to repeat previously recorded material slowly and carefully for the benefit of a recording device rather than for another person. This is very different from the natural process of active negotiation described in H&H theory. In the BOLD method, speakers have no opportunity to gauge comprehension and by extension no ability to estimate signal-complementary processes in the listener. Without comprehension checks, speakers are forced to guess an appʃiopʃiate level of ‘tuning’ of theiʃ speech pʃoduction. This presents an additional concern in how we can adequately explain the desired ‘degʃee’ of hypo speech ʃeʂuiʃed and how we may know if it has been met. If the speaker has been working at the task for an extended period they may find the task tiring and perhaps take less care with their production over time. This apparent shortcoming of the BOLD respeaking method was also a shortcoming of the earlier studies in clear speech. Recall that those too relied on participants forming their own judgements about what constituted clear speech. BOLD respeaking methods may also exhibit similarly wide variations in speech modification and, by extension, wide variations in intelligibility gains. This is not ideal and if this should prove to be the case it suggests that further research is needed on ways to regulate the extent of clear speech. Finally, we should take a moment to reflect on respeaking in linguistic fieldwork that apparently addresses this concern. In the method first described by Bouquiaux & 22 Thomas, consultants were respeaking to linguists making transcriptions. Such a situation does allow for comprehension checks and will repeatedly motivate the speaker to shift their production to the hyperspeech end of the H&H continuum for the benefit of the linguist attempting a transcription. If we weʃe ʃecoʃding the entiʃe session, it’s reasonable to suggest that the recordings will include language clear enough to facilitate comprehension for the linguist. Otherwise they would not have proceeded to the next phrase. Therefore if this research yields a negative result for respeaking in the context of BOLD, these results may not be applicable to other uses of respeaking in linguistic fieldwork. 23 3. Method 3.1 Overview Speakers of the Indo-Aryan language Nepali were recruited to record narratives using a BOLD-like methodology implemented by the Aikuma smartphone application5. The method employed focuses on the production of careful speech as a spoken annotation of the spontaneous recording. A preliminary investigation was conducted to examine the properties of Nepali clear speech and to inform the development of a means to compare accuracy between written transcriptions. The main phase of the study involved the capture of over ten minutes of spontaneous narrative from two Nepali speakers. Four participants with linguistic training were recruited in order to produce phonemic transcriptions of the natural speech using the software Praat (Boersma & Weenink, 2010). These transcriptions were then compared against the reference transcription for the purpose of assessing overall accuracy rate under varying experimental conditions. Those conditions were the availability of a respoken clear speech version of the spontaneous speech, and whether the spontaneous recording had been degraded with artificial noise. This section begins with a brief overview of the Nepali language and specific phonological properties that may predict transcription errors. This is followed by details of the participants and experimental procedure including data collection, processing and a description of the transcription activity. Finally, the means to compare transcription accuracy is presented along with an explanation of the accuracy metric and the undeʃlying modified ‘edit distance’ taking into account phonetic similaʃity. 3.2 Nepali Nepali is an Indo-Aryan language and the national language of Nepal with more than 11 million native speakers (Khatiwada, 2009). The choice of Nepali for this study was made given several considerations; the availability of native speakers (see section 3.4 participants), phonology that would not present too many difficulties for native-English 5 Aikuma may be downloaded from https://play.google.com/store/apps/details?id=org.lp20.aikuma 24 speaking transcribers and finally, a regularised orthographic representation to help inform a reference transcription. Difficulties in listening comprehension of foreign languages are well explored in applied linguistics (Tinkler, 1980; Boyle, 1984). Phonological aspects relevant for this study include non-native speakers having difficulty perceiving sounds which are not contrastive in their own language. Native English speakers have difficulty distinguishing between contrastive stops in the closely-related language Hindi (Werker et al., 1981; Pruitt et al., 1998). Lessons may also be drawn from forensic transcription which has long been concerned with issues of quality and reliability in the transcription of natural language. Linguists need to be aware of their own potential perceptual failings when approaching another language and pay particular attention to speech sounds that differ from their own language (Fraser, 2003). The greatest difficulties in identifying phonemes may arise when two speech sounds are contrastive phonemes in the L2 language but are allophones of a single phoneme in the native L1 language (Brown, 2007). Therefore we can hypothesize that transcription eʃʃoʃs will coincide with paʃticipants’ difficulty in peʃceiving these contrasts. The phonemic inventory is given in Table 3.1. This table is based on consonant and vowels from Khatiwada (2009) and diphthongs from Pokharel (1989). Table 3.1: Nepali phonemic inventory Plosive Bilabial Dental p t pʰ b bʰ Affricate Nasals m tʰ Alveolar Retroflex Palatal ʈ d dʱ ʈʰ ts dz tsʰ dzʱ k ɖʱ kʰ r Fricative s Lateral l (w) g gʱ ɦ (j) 25 Glottal ŋ n Tap or flap Approximant ɖ Velar Vowels: High Diphthongs: Front Central Back i u Close-mid e Open-mid Open /i/ /i/ o ʌ /u/ /iu/ ʌ /u/ /o/ /ʌ/ /a/ /e/ /ui/ /oi/ /ʌi/ /ai/ /ei/ /ou/ /ʌu/ /au/ /eu/ a 3.3 Prediction of errors Based on the differences between Nepali and English phonology, we can anticipate a number of perceptual difficulties that native English speakers will have with Nepali which may transfer to errors in transcription. However not all of these are straightforward given that transcription involves not just listening for speech sounds but also evaluating visual evidence from the transcription software, chiefly spectrograms. Possible sources of errors are discussed as follows: Figure 3.1: Acoustic evidence of 1. Retroflex consonants retroflex consonant English has no retroflex consonants and therefore we could anticipate failure to distinguish contrastive dentals where, for example, / ʈ/  /t/. The mitigating factor is that English speakers can perceive retroflex consonants, paʃticulaʃly given the Ameʃican English ʃetʃoflex ‘ʃ’. Additionally, three out of four participants have been exposed to the related language Hindi in university coursework completed prior to this study. The instructions provided to participants (Appendix 1) also stated: Note: A really good way of spotting retroflex consonants is that the F3 formant comes close to F2, most visible in adjacent vowels. Given that formant frequency bands are the cues to identification of retroflex consonants, spectrogram identification may be resilient to acoustic noise since bands of 26 energy tend to be visible. In connected speech, intervocalic /ɖ/ often lenites to a retroflex tap /ɽ/ as shown in Figure 3.1. 2. Aspiration and breathy voicing Nepali aspiration is distinctive in both voiced and unvoiced stops. This contrasts with English where [b] is often the unaspirated /p/ and where [p] is often aspirated /p /. Therefore we would expect that failure to distinguish between aspirated contrasts may be a common source of errors. One mitigating factor is that aspiration on spectrograms can be easily observed as a burst of aperiodic noise. However participants will need to distinguish between the lengths of the aspiration for phonemic contrasts. Aspiration also becomes difficult to see on a spectrogram under noisy conditions. The distinctive aperiodic noise of aspiration in the speech signal may be masked by introduced noise. Figure 3.2: Aspirated consonant spectrogram Figure 3.2 shows the same speech sound / t / under different noise conditions. Left is a clean recording with the noise of aspiration clearly visible, right is a noisy (9dB S/N) recording where the aspiration has all but disappeared into the noise. 27 3. Vowel contrasts As with consonants, where the vowel system of one language differs from another, speakers can be expected to have difficulty distinguishing vowels. In a large-scale cross-corpus study, Becker-Kristal (2010) developed an acoustic typology of vowel inventories. Figure 3.3 is a representation of the structural configuration of Australian English (6L0) and Nepali (6R0) extracted from the relevant sections of Becker-Kʃistal’s PhD thesis, with emphasis on the left/right difference. These coincide with the Australian English /æ/ and the Nepali /ʌ/ vowel. There are also differences in the nominal F2 of the low central vowels. Figure 3.3: Australian English & Nepali vowel spaces Taken from Becker-Kristal (2010): Comparison of the ‘structural configuration’ of Australian English (left) and Nepali (right) shows that the major difference is that Australian English has a mid-front [æ] phoneme category compared with Nepali’s mid-back [ʌ]. Therefore one might expect that native Australian English speakers have difficulty with the [ʌ] vowel in particular. While it is possible to obtain objective measures from transcription software, doing so is laborious and may be inconclusive given that the articulators, and subsequently acoustic evidence, may not attain a steady state in connected speech. This may be an area where consultation of a clear speech version could help identify the intended vowel target where that target has not been attained in spontaneous speech. 28 3.4 Participants Two Nepali language consultants aged in their mid 20s were recruited by the investigator to take part in the study. One male and one female, both consultants are native speakers of Nepali and come from the national capital of Kathmandu. Both were international students at a major Australian university at the time of the study and are fluent in English with extensive secondary and tertiary education in English. The recording of Nepali narratives and the respeaking task were performed independently of each other. The consultants received a moderate remuneration for their time. Four participants were additionally recruited for the transcription experiment, three male and one female. All were 4th year students in a linguistics program at a major Australian university. Three had recent training in transcription using the Praat software in an experimental phonetics subject. The sole female participant withdrew after a week citing heavy work commitments which conflicted with the time burden of participation. She later continued the transcription up to file 50 or around 45% of the data. All participants were remunerated. 3.5 Procedure 3.5.1 The Aikuma application Biʃd and Hanke’s (2013) Aikuma smartphone application was used to record spontaneous narrative and to handle the clear speech spoken annotation task. The interactive implementation of the BOLD method means that language consultants are able to perform recording and subsequent spoken annotations by themselves, thereby pʃoviding the means to ‘cʃowdsouʃce’ natuʃal language. Recoʃding naʃʃative is as easy as speaking into the phone much like a regular telephone conversation. Subsequent spoken annotations are produced in a separate respeaking mode in which the consultant listens to the initial recording and begins respeaking at any time. When they do Aikuma will pause playback and begin recording the spoken annotation until they finish, after which Aikuma will resume playing the spontaneous recording. The application stores the recordings in digital format along with metadata which indicates the alignment of sections of spoken annotations with the spontaneous speech. 29 At the time that Aikuma was used in this study the software was in a relatively early stage of development and so measures were taken to use backup recording devices while still benefitting from the automated speak-pause-resume implementation of the BOLD method. One of the key advantages of an Android application is that it may be run on inexpensive commodity devices. However it also introduces an element of variability as far as performance and audio quality between devices. 3.5.2 Data collection The male and female Nepali speakers participated one after each other and under somewhat different conditions. For the male speaker, a demonstration of Aikuma was first provided as well as some instruction in respeaking. Infoʃmed by Labov’s uʃban fieldwork methods (Labov, 1972) to elicit more naturalistic speech, the consultant was asked to recall a time he thought his life was in danger. This resulted in around six minutes of enthusiastic unselfconscious speech. With the spontaneous narrative recorded successfully, the respeaking phase was then performed. The session was also recorded with the built-in microphones of a professional Zoom H4n recorder. The audio quality from the high-end Samsung Galaxy Nexus smartphone turned out to be sufficient but a software bug stopped the respeaking process from working correctly, in so far as the software lost track of the alignment between the casual and respoken audio. To remedy this, the audio was manually segmented using an audio editing software package. The result was two files representing the same content, one for the spontaneous recording and the other for the respoken clear speech recording. The female speaker was recorded using a slightly different process. Rather than the office location of the male speaker, a recording studio situated on the university campus was used instead. The back-up Zoom H4n was connected to studio microphones and recorded both the internal and studio microphones in four-track mode. In this case the consultant used a HTC Desire C phone. This time the softwaʃe’s ʃespeaking function operated without a hitch resulting in metadata that could be used to prepare the audio files automatically. However the recording quality of the HTC Desire C was substandard with the phone’s aggressive automatic gain producing distorted/clipped 30 audio. The metadata was instead used with the high-quality recording from the Zoom H4n professional recorder. Given these differences between the recording procedure of the male and female speakers, the male speaker audio can be impressionistically described as good while the female speaker’s audio is excellent. Therefore the quasi-independent ‘speakeʃ’ vaʃiable accounts for a number of differences including individual, sex, recording device and recording location. We may expect to see this impact accuracy results accordingly. Both participants were also asked to provide a written transcript of their narrative in the Devanagari script. The Devanagari script was converted to the roman transliteration standard ISO 159196 (ISO/IEC 15919, 2001). This differs from the commonly used ‘hunteʃian’ tʃansliteʃation by including a series of diacritics to represent the larger array of consonants and vowels in Devanagari. The result is a transliteration scheme that retains all of the phonemic detail of Nepali from the original Devanagari script. 3.5.3 Processing the data Speech-shaped noise was chosen as the most appropriate type of noise to introduce in the experimental conditions so as to approximate multi-talker babble of field conditions. The level of noise was decided by analysing the audio levels from a test recording made on a Zoom H4n where a group of students were talking just a few metres from the subject being recorded. The root mean square (RMS) difference in amplitude between the subject speech and the nearby unwanted speech was approximately 9dB. Subjective testing with audio files degraded with these parameters revealed these files to be considerably noisy to a level one would certainly expect some detrimental effect on accurate perception. The visual clarity of spectrogram was also hampered significantly. The individual audio files of variable phrase length were transformed into a series of experimental conditions under the following process: 6 The conversion was performed by the iso15919 Python library by Mublin: http://dealloc.org/~mublin/iso15919.py 31 1. RMS normalisation to -12dBFS (12dB below full scale or 0.25 of max)7 for both casual and respoken audio recordings. 2. Cʃeation of ‘noisy’ veʃsion of casual speech ʃecoʃding by mixing in ‘speechshaped noise’ with oʃiginal ʃecoʃding such that RMS level ʃemains at -12dBFS and signal-to-noise is 9dB. 3. Cutting of casual speech (‘clean’ and ‘noisy’) and ʃespoken speech into individual files. 4. Individual experiment conditions created (see Table 3.2) and randomised. 5. Data structure of experiment conditions archived, experimental files created for distribution, human-readable index created with ISO 15919 transliteration and experimental conditions. 3.5.4 Experimental conditions There are two independent variables: 1. Noisy or clean spontaneous speech file is provided. 2. Respoken ‘cleaʃ speech’ file pʃovided oʃ not provided. Additionally there is the quasi-independent variable of speaker which coincides with two individuals being male and female. The independent variables result in a two by two matrix of experimental conditions as follows: Table 3.2: Experimental conditions matrix Respeaking No respeaking Noisy spontaneous Clean spontaneous Noisy spontaneous recording Clean spontaneous recording Respeaking file provided Respeaking file provided Noisy spontaneous recording Clean spontaneous recording No respeaking file provided No respeaking file provided The dependent variable is the measure of accuracy of a transcription coinciding with the above independent variables. The same data was used for all transcription participants. 7 Peak-normalization is far more common as far as implementations in software such as Audacity. Unfortunately this will result in fairly dramatic differences in perceived volume for different files. The normalization here was performed by normalize.exe available from: http://normalize.nongnu.org/ 32 3.5.5 Data validation and volume The male speaker provided a spontaneous narrative of six minutes in duration and the female speaker four minutes. The process described in section 3.5.3 resulted in 142 files. The spontaneous and casual speech recordings were checked against the written transcript, segmenting the ISO 15919 transliteration in the process. Where there was a poor correlation between the transcript, the casual speech and respoken speech recordings, these files were discarded. Usually this was due to paraphrasing in the respoken recording and more rarely, significant differences between the written transcript and audio recordings. This most commonly occurred for the male speaker. After this process there were 60 files for the female speaker and 65 files for the male speaker. This provided enough data for 14 files in each experimental condition per speaker: 14 files * 2 (noisy/clean casual) * 2 (respeaking/no respeaking) = 56 files For both speakers this resulted in 112 files. Four additional files were added at the beginning of the data set, two for each speaker, and all with respeaking but varying the noise condition to introduce the possible conditions during initial training. All 116 files were included in the overall analysis. While there are an equal number of files for each speaker, differences between the speakers in how they segmented their narrative in the respeaking task result in different quantities of words spoken by the female and male speaker. The total quantity of data was 957 words in 116 files. Of those, the data contained 536 words spoken by the male speaker with an average of 9.2 words per file. The female speaker data contained 421 words with an average 7.3 words per file. 3.5.6 Reference transcription In order to derive accuracy metrics from participant transcriptions, it was first necessary to produce a reference transcription of the casual speech recordings. The accuracy of this transcription is critical given any error in the reference transcription would invalidate comparisons of that section against participant transcriptions. The reference transcription was produced by the investigator with a number of critical differences 33 between the method used to produce this transcription and that used by participants, such as: 1. Systematic consultation of Devanagari orthographic transcription 2. Access to respoken audio throughout 3. Cross-checking transcription with language consultants 4. Longer time taken to produce the transcription (over 40 hours) The Devanagari orthography was the most useful resource in assisting an accurate transcription. In the majority of cases it narrowed the search space of identifying speech sounds to within the possible allophones of the phonemes represented in the orthography. However connected speech processes were very common, particularly elision of word final syllables. The respoken audio was generally useful in identifying where such elision had taken place and particular care was taken not to transcribe speech sounds that were not articulated in casual speech. In a small number of cases, the language consultant’s oʃthogʃaphic tʃanscʃiption did not perfectly match the audio recording. In these cases consultants were asked to produce a revised orthographic transcription. Additionally, some difficult to identify passage, fewer than ten in total, were cross-checked with language consultants. In these cases, after playing their own speech back, alternative transcriptions were read out with the consultant asked to select the one sounded most like their own utterance. While time consuming, this yielded insights into correct identification of particular allophones. 3.5.7 Transcription activity Study participants were asked to produce a transcription of each of the 116 spontaneous speech files. In half the cases there were two files with the same prefix where one was the spontaneous speech recording and the other the respoken version. For the first file this would mean participants saw filenames: 1_normal.wav and 1_respeaking.wav. Participants were directed to open the spontaneous recording in Praat and transcribe to the best of their ability. Resulting Praat textgrid files were saved for later analysis. Written instructions were provided (Appendix 1) to the four participants taking part. These include a phonological inventory of Nepali with IPA and X-SAMPA symbols, 34 the latter being more easily typed into Praat. Some observations on morphology and common phonotactics were also included. All participants spent the first two hours of transcription in the presence of the investigator in a university computer lab where computers were equipped with headphones and the required software. Participants were shown how the respeaking file could be opened into Praat for viewing and playback and how to tile Praat windows for side-by-side comparison. They were informed that the respeaking file was assumed to be helpful and the first four files all had respeaking versions to introduce the concept. However participants were not told how to use the respeaking file. It was made clear to the participants that the investigator was not an authority on the Nepali language or specific choices in transcription. Nevertheless, some intuition was provided verbally relating to speech sounds that differ from English and may prove to be problematic such as retroflex consonants, breathy voicing and the vowels /a/ and /ʌ/. After this initial two-hour period, participants were free to continue working either in the lab or somewhere else such as their own home. One participant (participant 1) chose to keep working in the lab. It should also be noted that the computer lab was a communal facility within a phonetics department at a major Australian university and at least one resident researcher had prior experience of Nepali. Participant 1 discussed the transcription with that researcher and this may have contributed to the transcription accuracy of participant 1. 3.6 Measuring accuracy The transcription activity resulted in 116 Praat textgrid files for each of the three participants that completed the work and another 50 for the fourth participant that had not completed the task. These files contain a string of X-SAMPA labels for each speech sound, time-aligned with the appropriate audio file. The string of phonetic symbols was extracted with a Python program utilising the Praat textgrid parser in the Natural Language Toolkit (NLTK) package (Bird, Loper & Klein, 2009). In order to provide statistical analysis, an automated similarity metric is required for each file compared against the reference transcription. Coding the transcriptions by hand would be the most reliable method but hand processing of 398 transcriptions was taken to be unduly time 35 consuming. Therefore an automatic method to compare phonetic strings was developed to yield metrics across the entire data set. The design criteria were that the accuracy metric should return a nominal integer in the range 0-100 where 100 represents an identical transcription and 0 a transcription with no similarity. Automatic comparison of phonetic strings was not expected to be flawless, nor would it provide insights into the types of errors. On that basis some attention was given to producing a file with parallel transcriptions and accuracy scores which would be more easily digested for noting general trends (Appendix 2). A subset of the data was hand coded to illustrate types of errors. 3.6.1 Edit distance An established means to compare two strings is the edit distance metric. The metric can be described as the minimum number of steps required to transform one string into another. The transformation steps are insertion, deletion or substitution of individual items. Edit distance is a metric of distance where a value of 0 would be returned for identical strings and numbers up to the length of the longest string would be returned for strings which have a maximum distance, e.g. are have no similarity. A common implementation is the Levenshtein Distance algorithm and this forms the basis of the phonetic-edit-distance metric described here. A naive comparison of phonetic symbols using edit distance alone would fail to account for speech sounds with similar pronunciation. Therefore, when used to compare perceptual similarity, edit distance is usually combined with phonetic algorithms that ‘noʃmalise’ speech sounds into catogories that group similar-sounding speech sounds. Several exist for English including Metaphone, Soundex and NYSIIS. The result of these normalisation techniques is that similar sounding words will have the same signatures. None of these systems are directly applicable to Nepali, nor is collapsing all phonetically similar speech sounds desirable given the goals of this research. However the edit distance technique is described here because the method is useful to account for the comparison of strings where there may be too few or too many items. 36 3.6.2 Improved phonetic edit distance In order to properly account for phonetic similarity, the distance measure should account for the relative distance of articulations in both the place and the manner of articulation. To accomplish this we must discard the symbolic representation of speech sounds and incorporate the descriptive framework of distinctive features. Gildea and Jurafsky (1996) introduced a computational algorithm to calculate phonetic similarity based on Binary Feature Edits Per Phone (BFEPP). In this case they use the Hamming distance, a metric of difference between two strings of equal length, counting the number of positions where they differ. Applied to matrices of distinctive features, the result is a measure of how many features are different between one phone and another. A recent evaluation found BFEPP performed the best compared with other algorithms (Kempton & Moore, 2013) so a variation of this approach was developed for the analysis of data in this study. Distinctive features describe both manner and place of articulation but place of articulation is known to be particularly significant for categorising speech sounds. On this basis a subset of phonetic distinctive features was chosen in three different categories of distinctive features. A degree of scaling was applied to obtain a more genuine correlation to perceptual difference between phones. The features and scaling values are given in Table 3.3: Table 3.3: Phonetic edit-distance error ‘scaling’ by category Features Place Scaling "round", "labiodental", "coronal", "anterior", "strident", 1.5 "lateral", "dorsal" Manner "syllabic", "delayed release", "approximant", "tap", "trill", 1.0 "sonorant", "nasal", "continuant", "strident", "voice" Vowel "round", "high", "low", "front", "back" 2.0 These scaling values were determined based on a number of factors. While place of articulation differences are most significant, multiple feature values change between different phones. Considering vowels, there are a relatively small number of distinctive feature differences between vowels of different categories and hence stronger scaling is 37 applied. These feature changes were added together such that the maximum error is 10 scaled feature differences. Vowel vs. consonant comparison is always counted as a maximum error (the same cost as deleting or inserting a symbol) and vowel vs. vowel comparisons only compare vowel features. The described phonetic edit distance is the same as Levenshtein edit distance except that substitutions do not automatically have a cost of 1; they have a fractional cost based on the sum of the number of scaled differences in phonetic features. 3.6.3 Summary The accuracy of a given transcription is taken as the phonetic edit distance between the reference transcription and the test transcription. The final accuracy is given by: Accuracy = 100 * ( PhoneticEditDistance / Length ) An additional metric of speaking rate in words-per-minute was derived by counting the number of words in the ISO 15919 transliteration and comparing against the length of the audio recordings. The Python program produced a CSV file with all of the metrics which was then imported into a data frame in the R language (R Development Core Team, 2006) for statistical analysis. The Python program also produced the summary log of all transcriptions and accuracies, as provided in Appendix 2. 38 4. Results 4.1 Overview This section reports the results of the two phases of the study. Firstly, the properties of Nepali clear speech are reported in Section 4.2 based on a small-scale pilot study with one speaker. Following sections report results from the main phase of transcription, beginning with transcription rates in Section 4.3 and statistical analysis of accuracy results in Section 4.4. Section 4.5 presents the common transcription errors with illustrative examples highlighting the variation between study participants. The effect of degraded noisy recordings is reported in Section 4.6 and a summary of findings concludes in Section 4.7. 4.2 Clear speech in Nepali In this section some of the specific properties of Nepali clear speech will be explored by contrasting with the spontaneous speech produced by one of the language consultants in this study. Krause & Braida (2002) pointed out that there are several types of clear speech such as clear/slow, clear/normal, loud/normal and many more. This represents a challenge wheʃe it’s difficult to judge the degʃee to which the speech peʃfoʃmance is shifted towards the hyper or whether certain properties of clear speech are stronger than others. A good place to start is to examine the same utterance in different recordings. Figure 4.1 shows Praat (Boersma & Weenink, 2010) visualisations on the same utterance. Counter to expectation the intensity dynamic range looks somewhat lower around the word boundary between /astelya/ and /aeko/ (ae is not a diphthong, these are separately articulated morphemes). Cutter and Butterfield (1990) observed that speakers attempt to mark word boundaries in English in clear speech and this would appear to be replicated in Nepali with examples ranging from clear-cut cases of silence to small-but-noticeable dips in intensity as in Figure 4.1. While the words are compressed to the same timeline, another observation is that of a differing ratio of consonant and vowel duration, particularly for the /ko/ at the end. Smiljanić & Bradlow (2008) obseʃved ‘tempoʃal ʃestʃuctuʃing’ of English in clear speech style and suggested that differences in duration 39 and intensity of speech sound segments is motivated by enhancements of the prosodic structure of English. The clear speech utterance in Figure 4.1 demonstrates that care has been taken to articulate the morphological affixes in vowel clusters. Figure 4.1: Visualising Nepali ‘Clear’ vs ‘Normal’ speech Comparison of Normal and Respeaking for /astelya aeko/ Equivalent segments in the normal (top) and clear speech recordings (bottom). The clear speech version has been compressed to the same timeline. In normal speech the inter-word /a/ and /a/ becomes one long /a/ but a slight inter-word dip in intensity and evidence of a glottal stop can be discerned in the clear speech version. 40 4.2.1 Durations and speaking rate As expected, both speakers produced significantly slower speech in the respeaking task than their spontaneous speech. Figure 4.2 offers duration box plots of consonant and vowel duration in spontaneous and clear speech, demonstrating wide variation for both, consistent with the observed variation in speech rate. Figure 4.2: C and V durations for male speaker of Nepali Consonant durations in spontaneous and clear speech (left) compared with vowel durations (right). Vowel durations were considerably more variable in clear speech. A derived metric of speaking rate was calculated by comparing the number of words in the orthographic transcription for a given segment with the duration of the audio recordings for the spontaneous and respoken audio recordings. Violin plots of speaking rates for both speakers are shown in Figure 4.3. Violin plots combine traditional box plots with a kernel density plots (Hintze & Nelson, 1998), providing a superior visualisation of the distribution of speaking rates across the entire data set. Interestingly speaking rates in clear speech approach that of a normal distribution compared with the wide variation in the spontaneous recording. One explanation for the greater variation in spontaneous speech is that hesitation sounds frequently occur in these productions. Respeaking lacks the cognitive burden of planning of spontaneous speech. 41 Figure 4.3: Speaking rates of Nepali speakers in spontaneous and clear speech In general the male speaker spoke significantly faster and exhibited a much greater variation of speaking rate in the spontaneous narrative. Figure 4.4 shows the relative modification of speaking rate of each speaker for clear speech. Interestingly, the overall reduction was only marginally higher for the female speaker. Figure 4.4: Reduction of speaking rate in clear speech Male speaker top, female speaker bottom. Horizontal axis is reduction in speaking rate for clear speech in wordsper-minute. The mean reduction in speaking rate for both speakers is around 50 words-per-minute. 42 4.2.2 Expansion of vowel space A number of studies of English clear speech have shown that vowel targets are hyper articulated with an expansion in the distance of vowel categories resulting in a larger overall vowel space (Moon & Lindblom, 1994; Krause & Braida, 2004). An experimental acoustic analysis of the male Nepali speaker confirms that the vowel space is somewhat reduced in the spontaneous speech narrative compared with the respoken recording (Figure 4.5). The differences are not large and, in contrast with English, there are few examples of reduction to schwa-like central vowels in connected speech. Figure 4.5: Composite plot of Nepali casual vs. clear speech vowel spaces Composite plot of the ‘vowel space’ of Nepali casual and clear speech recordings for the male speaker. The vowel category targets are centroids of vowel formant distributions. 43 4.3 Transcription metrics Participants reported progress and time taken rounded to half-hour periods. Two participants took approximately 14 hours to complete all 116 files while another, the least experienced at transcription, took 20 hours. A fourth participant did not complete the task, reaching file 50 out of 116. Data relating to time taken was only collected on a per-session basis. However this was sufficient to give an indication (Figure 4.6) on the rate of progress after the training phase and at three points during the entire data set. The overall picture was one of a very slow training period where participants would repeatedly play speech sounds to get familiar with Nepali. Transcription rates then sped up considerably even by the first third of the data with a long tail of slight speed improvements throughout. Figure 4.6: Participant transcription rates in minutes-per-file Transcriptions took much longer during training, levelling out to around 8 minutes per-file. Participants slowly speed up towards the end of the data set to final transcription rates of around 4-6 minutes per file. 4.4 Statistical analysis 4.4.1 Reliability of measures Peaʃson’s pʃoduct-moment correlation of accuracy rates between participants demonstrated a statistically significant correlation between the accuracy scores of the three participants that had completed the data set. This suggests that the accuracy measures are reliable as shown in Table 4.1. A composite violin plot of accuracy of all participants is given in Figure 4.7. 44 Table 4.1: Inter-participant accuracy correlation matrix Participant 1 Participant 2 Participant 3 Participant 1 1 0.29 (p=0.0018) 0.27 (p=0.0031) Participant 2 0.29 (p=0.0018) 1 0.35 (p=9.95x10-5) Participant 3 0.27 (p=0.0031) 0.35 (p=9.95x10-5) 1 Figure 4.7: Participant accuracy comparison Violin plot (box plot with density) comparing accuracy score distribution of all four transcription participants. 4.4.2 T-tests of independent variable: respeaking Using accuracy scores of all completed transcriptions from all four participants, a Welch two-sample t-test of accuracy against the binary factor of respeaking results in pvalue of 1.86 x 10-9 and therefore the null hypothesis is rejected. The mean accuracy rates differed significantly with the no respeaking condition having a mean accuracy of 73.68 compared with 79.08 in the respeaking condition, a difference of +5.39 accuracy where the respeaking file was provided. Figure 4.8 presents a scatter plot of all accuracy scores by file with linear regression and 95% confidence intervals. T-tests per participant are shown in table 4.2. All four participants showed a statistically significant rise in transcription accuracy with the availability of a respeaking file. The 45 result for participant 4 in isolation is not statistically significant, most likely due to a smaller sample size with participant 4 completing 50 out of 116 files. Table 4.2: Per-participant respeaking t-test results Participant 1 Participant 2 Participant 3 Participant 4 Estimate 5.86 6.51 4.15 4.21 Std. Error 1.17 1.21 1.14 2.62 t value 5.03 5.34 3.64 1.61 p value 1.9 x 10-6 1.0 x 10-5 4.1 x 10-4 0.12 Figure 4.8: Accuracy scatterplot by file with regression Scatter plot of accuracy for all participants for each of the 116 files. The red (lower) and blue (upper) lines represent linear regression prediction for no respeaking and respeaking respectively with the shaded regions representing 95% confidence intervals. Note that files < 50 have four results per file (x-axis) while files >= 50 have three results per file no results from participant 4 beyond that point. 46 4.4.3 T-tests of independent variable: noise A Welch two-sample t-test of all accuracy scores for all completed transcriptions from all four participants, against the binary factor of noise (9dB SNR) versus the clean spontaneous speech recording results in p-value of 0.028 and so again the null hypothesis is rejected. In this case the estimated effect was smaller at +2.01 accuracy when participants had the clean spontaneous speech file. Individual t-tests per participant did not yield statistically significant results as would be expected given the p-value over the full data set is somewhat closer to 0.05. Therefore it was not judged to be useful to consider the estimated effect of noise individual participants. 4.4.4 Assessing the interaction of noise and respeaking Given that the independent variables of respeaking and noise are shown to be significant, a two-way analysis of variance (ANOVA) was performed to determine if there are any significant interactions: Table 4.3: 2-way ANOVA of respeaking and noise variables F-value P-value Respeaking 38.32 1.52 x 10-9 Noise 5.84 0.016 Respeak:Noise 0.86 0.35 No statistically significant interaction was found between the two. A multiple linear regression of respeaking and noise variables was performed to gain some insight into the effect size of both variables, summarised in Table 4.4. This is a critical finding given that a key research question of this study was to identify to what extent the ‘ʃegeneʃating’ effect of ʃespeaking can be attʃibuted to noise. Table 4.4: Multiple linear regression of respeaking and noise against accuracy Estimate Std. Error t-value p-value Respeak=True 5.43 0.87 6.23 1.19x10-9 Noise=Normal 2.10 0.87 2.42 0.016 47 Multiple R2 was evaluated at 0.1008. Stated another way, 10% of the variation in accuracy is explained by the model incorporating respeaking and noise. 4.5 Analysis of common errors As hypothesized, transcription errors clustered around aspects of Nepali that differ from English. The most common errors were related to: 1. Breathy voicing and aspiration. 2. Differentiating between vowels, particularly /a/ and /ʌ/. 3. Vowel clusters and diphthongs. 4. Retroflex consonants, particularly the voiced retroflex plosive /ɖ/. Aspiration and breathy voicing are conflated in this comparison given that they are predictable based on unvoiced and voiced segments respectively. Aspiration/breathy errors were particularly common for velar, dental and bi-labial stops in both unvoiced and voiced manners, E.g. /t/, /d/, /k/, /g/, /p/, /b/. Word initial aspirated consonants frequently lenited to bilabial fricatives /ɸ/ and /β/ and dental fricative /f/. Participants tended to favour one particular type of fricative and repeat the choice in similar environments. The challenge for transcribers is to categorise aspiration based on the duration of the aspiration. Figure 4.9 shows three cases. The fiʃst is phonemically unaspiʃated but it’s arguably aspirated here. The second is a characteristic Nepali contrastively aspirated voiceless dental stop while the third is unaspirated. Note that participants 1 and 2 have marked aspiration in all cases, participant 3 ignored all aspiration and participant 4 was a mixed case and the sole participant to incorrectly transcribe /p/ as / p /. By far and away the most common transcription error for all participants was incorrect identification of vowels. This was due both to the high frequency of vowels and, even normalised for frequency, a high rate of misidentification of /a/ and /ʌ/. Given that vowel classification involves a continuum of formant qualities, the reference transcription may also be forced to make arbitrary decisions. We should note this as a source of unreliability in the accuracy comparison. However the effect is minimized 48 Figure 4.9: Variation in breathy voicing and aspiration errors ISO 15919: tira thiyō tāplēju… File 9 (female, normal audio, no respeaking) transcribed by four participants compared with the reference transcription (G). Aspiration/breathy voicing errors are circled. given that vowel differences influence phonetic edit-distance values to a small degree. To get some sense of vowel error rate compared to the average accuracy figures, strings of all vowels in the reference and comparison transcription were compared using the same phonetic edit-distance procedure outlined in Section 3.6.2. The mean accuracy of vowels for the entire data set was 77.27, only marginally higher than the mean accuracy of 76.38 for all categories. 49 As anticipated, participants had a great difficulty distinguishing /a/ and /ʌ/, given that these are not contrastive in Australian English. The errors were frequent even when participants could be expected to see formant differences on spectrograms, such as when /a/ and /ʌ/ occurred in proximity, as shown in Figue 4.10. It seems likely that participants did not consult spectrogram evidence, preferring to use their auditory perception instead. Another major source of variation in errors between participants is in the transcription of vowel clusters. Figure 4.10: Variation in vowels and vowel cluster errors ISO 15919: …mī jhanai āttiyau File 109 (female, normal audio, no respeaking) transcribed by three participants. Even after they had gained experience and with good conditions for spectrogram evidence, participants frequently had difficulty distinguishing the vowels /a/ and /ʌ/. 50 Vowel clusters frequently occur in Nepali as a result of morphological affix. Instructions provided to participants suggested that diphthongs typically occur in the normal length of a vowel. However participants varied greatly in segmenting vowel-like speech sounds into individual segments, diphthongs or merely long vowels without reference to vowel quality changes. It should also be noted here that the normalisation procedure in data analysis eliminates some of these differences by producing a string of vowels without regard to exactly how they have been segmented. The diphthongs /eu/ would be considered the same as /e/ followed by /u/. One of the most common lexical items in the Nepali narratives is thiyo (it was, past tense of ‘it is’), frequently occurring at the end of phrases. That is likely why thiyo has been particularly salient for transcribers such that they were generally consistent in vowel cluster transcription strategy for this lexeme. In the example in Figure 4.10, theʃe’s baʃely a steady-state ‘i’ vowel befoʃe a glide but nevertheless F2 starts very high. Such is the salience of the high-frequency lexeme thiyo that it often results in misidentifying similar lexemes such as in this case. The spoken word was actually attiyau (pronounced /attijo/ with a geminate consonant). We can also see the common misidentification of short stop burst as Nepali phonemic aspiration. In other vowel clusters it was not uncommon for confusion to arise about whether a vowel was a diphthong, single vowel and whether a glide should be transcribed. Finally, retroflex consonants present an interesting counterpoint to breathy contrasts and /a/ vs. /ʌ/ vowel identification. Retroflex consonants also do not occur in English but misidentification was less common. Impressionistically, the degree of retroflexion in Nepali consonants is often large with an acoustic quality that is readily identifiable by English speakers. Occasionally there was evidence that participants had perceived some additional quality of a ʃetʃoflex stop but hadn’t made the connection. The example Figure 4.11 contains a Nepali retroflex nasal /ɳ/. Three participants inserted a liquid prior to a standard alveolar nasal /n/. It may be that participants perceived the rapid convergence of F2 and F3 formants. In many cases voiced and unvoiced alveolar retroflex stops will often have the retroflex aspect transcribed correctly but vary in categorisation of the manner with voiced, unvoiced and retroflex flaps, as in Figure 4.11. Participants were informed of a common Nepali connected speech process of lenition of 51 voiced retroflex stops to flaps. However three out of four participants transcribed a retroflex flap infrequently, preferring to indicate a voiced retroflex stop. Figure 4.11: Variation in retroflex consonant errors ISO 15919: … khima pūrṇakō bāṭō thiyō File 86 (male, normal, respeaking) transcribed by three participants compared with the reference transcription (G). An example of where participants inserted liquids before the consonant possibly having perceived formant precursors to the retroflex nasal /ɳ/. 4.6 The impact of noise Previous studies on noise have focused on comprehension by native speakers and typically involve far lower signal to noise levels than the 9dB used in this study. 9dB is however a normalised loudness and where the natural dynamic range of speech results 52 in lower intensity, the SNR will be considerably higher. Figure 4.12 illustrates an example where participants would have found the spectrogram display of almost no use. Yet all three participants that transcribed this file produced an accurate transcription of this section. Impressionistically, the only segment that would be difficult given the noise is the /h/. As one of the more frequent lexical items and with this case taken towards the end of the data set, it’s possible that pʃioʃ exposuʃe allowed paʃticipants to correctly identify the segment. Figure 4.12: Noise impact on spectrogram legibility This extract from file 97 (male, noisy, no respeaking) features a quiet passage of speech which has reduced the S/N ratio to such an extent that features can be barely seen in the spectrogram. Alternative settings barely improve the situation from this default setting screenshot. Given the intuition that noise would appear to make identification of frication more difficult, a general measure of total aspiration errors was compared with the noise variable. There was no statistically significant effect. Neither was a similar metric for retroflex errors found to correlate with noise, not even restrictive cases of noisy spontaneous file and no respeaking file. However there was a correlation between the accuracy of vowels and noise (p = 0.0149) with an estimated effect size of -2.85 on the same phonetic edit-distance accuracy scale (out of 100). This is slightly higher than the 53 estimated effect of noise on overall accuracy (-2.1). This seems to be a somewhat surprising finding since we might expect that the difficulty in differentiating noise from vocal tract frication would be observed more commonly than difficulties in categorising vowels. Vowel identification is dependent on formant frequencies, peaks of spectral intensity which are not present in noise and therefore ought to be more resilient against noise masking effects. 4.7 Summary of Results Overall there was considerable variation in transcription errors between participants. Nevertheless the availability of a respeaking file resulted in a significant boost in accuracy and the noise degraded spontaneous recording resulted in a significant decrease in accuracy. The effect of respeaking was most visible at the extremes limits of the accuracy ranges observed. Of the 14 individual accuracy results less than 60, only one ʃepʃesented a ‘ʃespeaking’ condition. At the uppeʃ end of accuʃacy scoʃes, of the 26 accuʃacy ʃesults gʃeateʃ than 90, only thʃee ʃepʃesented ‘no ʃespeaking’ conditions. No other factors were found to correlate including male/female speaker and speaking rate. A linear model accounting for respeaking and degraded audio variables estimates the effect of each as +5.43 and -2.1 on the phonetic edit-distance scale. The most common errors were found to be those resulting from factors predicted given the comparison of the sound systems of English and Nepali. However some were higher frequency than others. /a/ and /ʌ/ vowel misidentification was exceedingly common. Participants had considerable difficulty with the Nepali aspirated/breathy contrast series, even where duration contasts ought to have been clear. Errors in identifying retroflex consonants were less common although there was considerable variation in manner such as voiced/unvoiced. None of the experimental conditions were found to correlate with the common errors observed, even where we might expect to find them such as noise and aspiration. There was also evidence of participants engaging their own language faculties. Transcription of high frequency lexemes became more consistent and another source of error when these transcriptions were chosen over similar sounding Nepali words. 54 5. Discussion 5.1 Overview The results of this study are discussed as follows. Section 5.2 addresses the three research questions and Section 5.3 discusses the limitations of the study. Finally, Section 5.4 discusses the observation that not all differences between the reference tʃanscʃipt and paʃticipant tʃanscʃipts may be categoʃised as ‘eʃʃoʃs’ and that some may be better described as interpretive choices. 5.2 Addressing the research questions 5.2.1 Respeaking and transcription accuracy This study found a statistically significant benefit of the availability of respeaking for transcription ‘accuracy’ such that availability of the respeaking file increased phonetic similarity scores by an estimated 5.39. This should be understood in the context that participants were not engaged in producing painstakingly accurate transcriptions with high phonetic detail. Rather, they were motivated to strike a balance between doing a good job and maintaining a good working rate so the entire task would not be too time consuming. When we’ʃe making guesses about the work flow of future philologists, one could argue that these conditions may be a reasonable approximation of the practicalities of transcribing of a large audio corpus. Equally, one might argue, future researchers have as much time as they need compared to the time pressure of documentary linguistics. The total variation in accuracy scores between participants was large. It would be tempting to conclude that this is symptomatic of the wide variation in clear speaking properties that were observed in previous studies in the clear speech literature (Picheny et al., 1985; Schum, 1996). This wide variation in clear speech has been linked in speakers being asked to provide their judgement of clear speech rather than finding an appropriate level of clear speech via a process of negotiation in communicative events. However, unlike studies on clear speech comprehension, this study showed no correlation between the degree of clear speech such as the speaking rate, or even between the male/female speaker, and the resulting accuracy scores. 55 All participants expressed frustration at the difficulty of transcribing the male speaker when he would paraphrase instead of repeating exactly what was said. All participants agreed that the male speaker produced the utterances that were the most difficult to tʃanscʃibe. It would have been pʃefeʃable foʃ the male speakeʃ’s ʃespoken passages to be further towards the hyper end of the scale, similar to the way the female speaker undeʃstood the task. It’s the investigatoʃ’s view, backed by the expeʃience of transcription participants, that a means to regulate the production of clear speech would be useful. Perhaps by including more explicit instructions or even introducing timing regulation such as Krause and Braida’s (2002) use of a metronome. So strong is this intuition that it seems prudent to explore other possible factors as to why the data does not reflect the subjective view of the investigator and participants. Firstly, we must consider that the use of the respeaking file in the experimental method was an unregulated procedure. Participants were not instructed in the exact way to use the file, in part because the best way of consulting the file is simply unknown. Given their linguistic training, the way the four participants chose to use the files is an interesting observation in itself. For example, participant 4 reported that they started out consulting the respeaking file systematically but found that this would slow the process down as they hunted for speech sounds that had been elided in spontaneous speech. Participants generally settled on a routine where they would focus on the spontaneous recording, later consulting the respeaking file with a particular view to challenging sections of the transcript. Participant 3 reported that the respeaking file was not thought to be necessary where they felt the spontaneous speech was not presenting difficulty. In these cases they admitted to not consulting the file at all. The same participant had the lowest estimated benefit from respeaking (4.1 compared to the 5.39 mean). Transcription performance is influenced by a wide array of competencies and other factors including; perceptual capability, exposure to the language, theoretical knowledge and experience in transcription skills. Furthermore, attributes of the software and the technique of work flow also affect results. I would further suggest a regularised transcription method that systematically presents respoken audio, such as a side-by-side display of spontaneous speech and respoken speech, might yield better results overall. 56 If we’ʃe hoping to dʃaw infeʃences about the value of ʃespeaking veʃsus collecting otheʃ forms of data, a brief qualitative examination of the size of the observed effect may be helpful. On the assumption that a tʃanscʃiption is ‘good enough’ when a human being looking at the transcription can work out what the words are supposed to be, scores in the mid 70s and upwards appear to meet that criteria. Table 6.1 presents such a case with an example from file 39 (male, normal, respeaking). Please note that word boundaries have been inserted to assist visual comparison of these transcriptions. Table 6.1: Qualitative view of ‘high-range’ accuracy FILE 39 āja bhandā kamsēkama duī baʃṣa agāḍī ma nēpālamā h dā khēʃī IPA of transcription Accuracy Part. 1 aʌ b d komskom dui bers ʌɡaiɖi Part. 2 azu wandab komsom dwi β s aɡ ɽi mo napana udza keri 74 Part. 3 aser bʌnda komɸekom dui bas oɡ ri mo nepalna huda kiri 78 Part. 4 ase banda komskom dui bars aɡ ri mo nepalma rouda kiri 85 Reference asʌ nepalma hoda kheri 87 d kʌmskʌm dui bʌrʂ ʌɡaɖi mʌ nepalma hoda keri - Subjectively I would describe the 10-point difference between mid-70s to mid-80s as ranging from good to excellent. Considering participants 2 and 4, the largest contʃibution in the 10 point accuʃacy diffeʃence is paʃticipant 2’s omission of the lateral in / nepalma / and the omission of a glottal fricative /h/ and insertion of /z/ in /hoda/. It may be useful to take a look at what a difference of five points looks like, given that this is the estimated improvement of respeaking. The following example in Table 6.2 presents a transcription comparison of file 3 (female, noisy, no respeaking) early on in the data set. The reference transcription has 41 labels. Given the length of this phrase it takes only three missed segments or over-transcribed segments (too few or too many labels) to influence the accuracy score more than the 5.39 observed in this study. A reminder of the accuracy metric normalised to length: Accuracy = 100 * ( PhoneticEditDistance / Length ) 57 Table 6.2: Qualitative view of ‘low-range’ accuracy FILE 3 ʃa ēkdamai malāī cai ēkdamai āphnō mṛtyukō mukhabāṭa IPA of transcription Accuracy midtiku mukfataɡ 71 Part. 1 ra ektʌmi male tsei ektʌmia Part. 2 eɽa ɡeɡt Part. 3 erʌ eɡd Part. 4 raʔ eʈami male si iɡd Reference rʌ ekdʌme malei tsi ekdami aɸnuʌ miɽtjuko mukɸaʈʌ - i male se eɡt i aɸmu ŋiltilk moksakel 64 ii male tsie ekdamei apnuʌ murtuko mukfata 77 i ɸnua mirdirko mufaika 69 Small differences in transcription lengths (too many or too few speech sounds) were common but an examination of trends over the log (Appendix 2) show that sub 70 accuracy was often the result of a combination of missed speech sounds and poor accuracy of transcribed speech segments. Considering the example in Table 6.2, paʃticipant 4’s tʃanscʃiption accuracy is five points higher than paʃticipant 2’s. Subjectively the key difference is the final woʃd. Paʃticipant 4’s /mufaika/ sounds closeʃ to the reference /mukɸaʈʌ/ than paʃticipant 2’s /moksakel/. Viewed another way, a five point difference can describe the difference between transcriptions where one word in eight has a difference as large as /mufaika/ & /moksakel/. I would argue on a subjective basis that this is significant. More broadly, lower scoring transcriptions are likely to exhibit missed segments and generally mistranscribed speech sounds. However my intuition is that that both of these symptoms appear when participants have somehow lost track of where they were transcribing and that these coincided with more difficult sections of faster speech where no respeaking file was available. As one might expect, the worst results coincided with challenging conditions. Of the accuracy results with scores less than 60 in the dataset, 13 out of 14 (93%) were cases with no respeaking. 10 out of the same 14 (71%) were also noise degraded cases. 58 5.2.2 Respeaking effect on error types Counter to expectations, availability of respeaking had no discernible impact on errors stemming from difficulties with aspiration, identification of vowels or retroflex consonants. Given that Nepali clear speech was shown to possess an expanded vowel space, we might have expected that the respeaking file would have helped with categorising the continuum of vowel realisations. This was not found to be the case. Participants frequently mistook vowels, and not just the more difficult /a/ and /ʌ/ vowels. Given that vowel realisation exists on a two-dimensional height/front-back continuum, difficulty in classification of vowels in an impressionistic transcription is to be expected. In contrast with consonants, participants did not appear to compensate for this inherent difficulty by spending additional time on correctly identifying vowels. If in doubt, participants were free to consult the formant frequencies in Praat. There was, however, a reduction in the number of labels missed or speech sounds not transcribed. Participant transcriptions showed a mean difference in length with the reference transcription of 2.4 (SD=2.3). An ANOVA of transcription length errors against respeaking was significant, F(1,395) = 5.05, p=0.025. Inspecting the data manually reveals that low accuracy scores (<70) were often the result of a significant number of missed speech sounds. This is supportive of the observation that the reduced processes of speech sound elision in clear speech are particularly helpful in identifying speech sounds which are more difficult to spot in spontaneous speech. The difference between the sound systems of Nepali and English turned out to be a reasonable predictor of the common errors encountered. However the form of transcription error that was most improved by respeaking was the more generalised tendency to miss speech sounds. This also sounds a note of caution in that the reference transcription was undertaken with consultation of the orthographic transcript and access to the respoken audio in all cases (and not just 50% for participants). The reference transcript may therefore err on the side of introducing speech sounds which are present in clear speech and perhaps not in spontaneous speech. There was a conscious attempt to avoid this phenomenon but this bias may not have been totally eliminated. 59 5.2.3 Contribution of clear speech vs. noise One of the more surprising findings is that transcription participants had remarkably little difficulty in transcribing the degraded noisy files. A two-way ANOVA estimated the effect size of around -2.1 accuracy for the noise-degraded files (p=0.016). Clearly a lower SNR than 9dB would show a stronger result but we should remember that the goal is was to assess the impact of a noise reduction in a realistic field working environment. A field linguist that produced an audio recording with an SNR of 9dB is really not trying very hard to maximize audio quality. Furthermore I doubt it is even possible to approach this level of SNR when using smartphones as recording devices in the field. The reason being that the microphone is very close to the mouth of the speaker, reducing the input gain of the microphone such that other nearby noise sources will have a relative difference in sound pressure level (SPL) much higher than 9dB. This is also why vocalist’s micʃophones geneʃally don’t ʃesult in feedback and why documentary producers favour lapel microphones attached to speakers instead of camera mounted microphones. In fact, given the observations of the effect of noise here, I would go so far as to say that noise concerns for transcription accuracy would be dwarfed by other factors when using close microphone techniques. Smartphone audio recording can be excellent but throughout the course of this study, having used multiple devices of different make and model, it’s cleaʃ that audio ʂuality varies considerable between devices. The HTC Desire C used by the female speaker clipped much of the recording8. Clipping, where audio levels are overdriven and digital ʃecoʃding devices ‘clip’ at maximum positive and negative values, is a particularly destructive form of distortion. Clipping introduces a large amount of harmonics which make speech much less intelligible and the resulting audio is rather unpleasant to listen to as well. Bafflingly the HTC also produced a DC offset at the beginning of every recording sequence. While this was easy enough to remove in post processing, it is reflective of the lack of care that some manufacturers can have in the design of their products. 8 Fortunately a Zoom H4n was used in parallel for just such eventuality. In this way the useful metadata recorded on the mobile phone was paired with a high quality audio recording in this study. 60 The high-end Samsung Galaxy Nexus fared much better but still clipped the audio when the male speaker’s ʃaised his voice in eneʃgetic segments of the narrative. The default behaviour of smartphones is to aggressively maximize the audio signal without much care for the danger of clipping. For these devices to contribute useful audio, the levels will need to be calibrated by the software or set manually. This doesn’t seem unreasonable since you would not expect a professional recording such as Zoom H4n to magically take care of the levels without control by the operator. The Aikuma smartphone application (Bird & Hanke, 2013) used for this study has since improved control over audio levels. At the smartphone operating system level, the situation is improving also. At the time of this study the latest version 4.4 of the Android phone operating system (Android Kitkat, 2013) is introducing new audio features including the addition of a dynamic range compressor specifically for speech. While this only applies to playback, this is nevertheless helpful because it enables recording software to set the recording level much lower and not be concerned with low playback volume during the respeaking stage. As should be apparent from this account, this area is rapidly developing and has not yet reached an optimal solution for linguistic fieldwork. Finally, on the point of noise, I would note that there is an air of audiophilia surrounding the topic of audio quality in linguistic fieldwork. This study was very much a worst-case scenario in terms of performance in transcription given that participants knew very little of the language they were transcribing. Nevertheless results were generally good and continued to be so in situations where the acoustic signal was so degraded with noise that spectrograms were rendered almost useless. While not suggesting that we should not strive for high quality recordings, preoccupation with recording device audio quality does not seem justified when other factors such as microphone placement are far more influential. 5.3 Limitations of Study This method of this study depaʃted fʃom the guiding ‘futuʃe philologist’ scenario in one important aspect. Any person that embarks upon the considerable labour of transcribing a large volume of natural speech will not be doing so in a language naïve fashion. Rather they will be sufficiently engaged in the endeavour to be learning the language as they go. At the very least, transcription would be significantly enhanced with reference 61 to a lexicon which must exist in any basic language documentation. This is in stark contrast to repeat errors of the same kind made by participants in this study on the basis that they were neither working with a lexicon nor seeking to learn the language. Conversely, a transcriber with rudimentary Nepali language skills would scarcely make the same mistakes, nor would they presumably take 14-20 hours to transcribe ten minutes of speech. However, this study does reflect some important issues arising out of the transcription of speech sounds under varying conditions. As far as assessing the impact of respeaking, I have not considered the value of respeaking in forming the evolving analysis of a phonemic inventory and morphological analysis. During an early poster presentation of this research, one senior linguist remarked that respeaking was particularly useful early on in the exploration of a language as consultants tended to ‘spell out’ moʃphological affixes. Respeaking has other undeniably useful properties. Himmelmann (2006b) noted that segmenting spoken language is particularly challenging and that the primary source of information is that of native speaker intuition. A key property of clear speech is that word boundaries tend to reappear where there were apparently none in connected speech. This aspect of respeaking is not conveyed in the measures of accuracy on which this study is based. The wide variation in transcriptions between study participants should also be noted. If four linguists were transcribing the same corpus in a realistic environment they would undoubtedly collaborate. As a result you would expect conventions to emerge and tactics to converge so that there would be far less variation than observed between the participants of this study. One example might be how to consistently approach breathyvoiced vowels as potential evidence of elided glottal fricatives (/h/), or the recognition of what sort of length of aspiration constitutes a phonemic contrast and what does not. With this in mind it’s no stretch of the imagination to describe the general error rate observed in this study as artificially high. This has ramifications for how one considers the size of the effects described and how they could be reasonably expected to be much fewer where transcription conditions were more realistic. 62 5.4 Transcription differences: errors or choices? The transcription comparison provided earlier in Table 6.2 highlights an added dimensionality in transcription ‘errors’. The reference transcription begins with ra, a discourse marker in Nepali that appears often appears the start of phrases. Two participants wrote /era/ on the basis of a vowel-like phonation prior to the /r/ and one participant indicated a glottal stop after the ra. In all cases these speech sounds were present in the recording. Neither of those transcriptions are errors in the sense that the participants misidentified a sound, they ceʃtainly aʃen’t peʃceptual eʃʃoʃs. These examples show that we shouldn’t consideʃ these errors but rather differences which may or may not be errors. In a sense the ‘error’ lies with the reference transcription for lacking this detail but the reference transcription was informed by linguistic knowledge and those speech sounds were judged to be not phonemically relevant. The participants in this study had no lexical access so they could not have known if ‘eʃa’ was a word in this context or not. Continued exposure to Nepali resulted in participants developing ideas of what was relevant and not relevant. The example discussed occurred at the beginning of the data set (file 3) but with considerable exposure to /ra/ throughout the data set, helpfully in word initial position, variation in transcribing /ra/ rapidly evaporated between participants. The recognition of lexical items cuts both ways. In the 5.2, the analysis of common errors mentioned the Nepali word thijo. Cited as a high frequency lexeme in the discourse of both speakers, this word presents a strategy problem for transcribers in how to deal with the vowel and glide, and reduced forms that may end up being realised as a diphthong or just a plain vowel. Participants often made the interpretive choice to transcribe /thijo/ even cases where a reduced form appeared. Arguably this is a good thing presuming that our ultimate goal is to recognise lexical items. However they would also shoe-horn thiyo into similar sound words such as tyo without the aspirated /t / and even further afield into words like tyasto. This phenomenon is well described in the psycholinguistics literature (Saporta, 1961). It’s long been obseʃved that listeneʃs tend not to heaʃ ‘mispʃonunciations’ (Cole, 1973) 63 of lexical items. Even with vocabulary of fewer than ten words and with no semantic comprehension, participants in this experiment began to show signs of categorising similar-sounding words into the words they recognised. Gradually the hard slog of enacting their linguistic training, analysing spectrograms and so on, began to be supplanted by the more natural process of engaging their own faculties of language. In the context of longer term projects undertaken by future philologists on language documentation archives, we might expect these forms of errors to become more dominant than they were in the limited scope of this study. 64 6 Conclusion This paper addressed research questions relating to the availability of respeaking on transcription accuracy. The method was guided by assumptions informed by the needs of future philologists as they transcribe the output of documentary projects in the present day. Under these conditions, respeaking was shown to have a significant benefit on transcription accuracy. The benefit of removing noise from the recording as part of the respeaking method was isolated as a contributing factor with an effect size less than half that of clear speech. This broadly suggests that respeaking is a valuable component in linguistic fieldwork, even where natural language recordings are under favourable recording conditions. This research backs previous findings in clear speech study where the artificial deployment of clear speech results in a wide variation in the properties of clear speech. However in the context of respeaking in this work, clear speech did not show benefits in transcription that scaled with the degree of clear speech. The strongest observation was that participants omitted fewer speech sounds when they consulted with a respoken version. I submit that a reasonable explanation for the lack of a rate/degree correlation with accuracy can be found in the unregulated nature of the transcription activity. Participants manually consulted the respeaking file when they found it was most necessary but the exact when and why is not clear, i.e. little evidence that respeaking was consulted to refine judgements on vowel categories. This highlights the need for a careful observational study with attention to participants’ behaviour during the transcription process. The large inter-participant variation in transcriptions observed in this study could also be addressed by incorporating realistic consultation on transcription strategy. For example, by allowing participants to consult on what constitutes contrastive aspiration in Nepali, or choosing an appropriate symbol to represent a reduced retroflex obstruent etc. There is also a need for research into fresh methods and tools to assist linguistic transcription by providing for the ability to rapidly and systematically contrast speech sounds and lexical items so as to facilitate an evolving understanding of the language being transcribed. 65 I would also argue that the lack of the respeaking rate/degree and accuracy correlation in this study does not invalidate the need to regulate the production of clear speech as a linguistic field method. Further research is needed to refine methods and possibly tools towards this end. One could imagine, for example, that the Aikuma software application could measure approximate speaking rate and provide feedback to the language consultant if they should begin speaking too fast. Given the rapid pace of change in this field, research should continue to engage with new methods and tools, as demonstrated in this study. Studies of this nature allow us to evaluate their impact on the results of language documentation projects and thereby to inform the development of the next generation of methods and tools. There is much more to do in the development of truly scalable methods in language documentation to meet the urgent challenge of language loss. 66 References Abercrombie, D. (1964). English phonetic texts. London: Faber and Faber. Abney, S & Bird, S. (2010) The Human Language Project: building a universal corpus of the world's languages. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. London: SOAS. 72 Appendices Appendix 1: Transcriber Instructions Nepali Speech Transcription Experiment instructions and documents Thank you for agreeing to take part in this experiment! This document offers some background for this experiment as well as instructions and a small amount of reference material. You can keep this with you to refer to during the experiment. The thought experiment This experiment should be understood in context with a wider thought experiment. In this scenario you are a linguist in the future working on a language that is no longer spoken. However we do have recorded material for the language and previous field linguists have worked on many of the details. We have a working hypothesis of a phonological inventory and some field notes on the language including some early thoughts on phonotactics and morphology. Ouʃ goal is to have a wʃitten ʃecoʃd of “what was said” in these ʃecoʃdings. Since there are many unfamiliar words, you can only transcribe what you can hear. However using your knowledge of phonological processes, you might be able to work out what the phonemes are supposed to be even if the surface forms in connected speech are very different. This general theme of this scenario is that we are aiming to speed up language documentation. We are not interested in high levels of phonetic detail because this takes too long. The next person to look at your data will be working out what the words are. So that means it might be useful to record other phones than the phonemic inventory. What you will be doing Your goal throughout is to provide the most accurate transcription you can. In practice, like most tʃanscʃiption, you’ll get faster after an initial training period. Fortunately, for some of the recordings there are two versions, Eg. two files. The second file is a ‘caʃeful speech’ veʃsion which you can ʃefeʃ to. This may ‘undo’ some of the pesky connected-speech processes that make it difficult for you to transcribe the normal ʃecoʃding. Theʃe may be otheʃ ʃeasons that it’s useful. Remembeʃ that we want a tʃanscʃiption of “what was said”, this means that even if you have a full foʃm of a woʃd in the careful speech version, and some elision in the casual speech, you should not insert the elided segments. So while we call this a phonemic transcription, some extra detail such as breathy voice, nasalisation etc, or fricatives for example, could be very useful for the next person to build up a lexicon. In this experiment the language you will be transcribing is the Indo-European language of Nepali, spoken natively by around 20 million people in Nepal and India. Within the data folder there are a large number of wav files. They are numbered in order, Eg: 73 1_normal.wav 2_normal.wav Each wav file represents a phrase spoken by a male or a female language consultant. They can sometimes be very short while other times they are more like a full sentence. You should open these files into Praat and create a basic Phonemic transcription tier. Transcribe the phrase as best you can and save the Praat text grid back to the same directory. We have provided a chart of the Nepali phonemic inventory with the IPA on top and X-SAMPA below. You should type the X-SAMPA symbols provided. For some of the phrases you will see two files instead of one like so: 3_normal.wav, 3_respeaking.wav In these cases you have access to a re-spoken version in so-called ‘cleaʃ speech’. You can think of this as a version of speech you might use when speaking to a deaf grandmother. In fact that is how it was explained to the language consultants. How you use this extra file is up to you entirely. You can load it into Praat as well or you might just want to play it (from the object window of Pʃaat, say). You might find that it’s woʃth loading into Pʃaat when you’ʃe getting up to speed and lateʃ on you might just play it unless something interesting appears. We still only want you to tʃanscʃibe what’s in the x_noʃmal.wav file. In particular you should not insert speech sounds that are present in the respeaking version if they are not in the normal version. You might like to record something relevant like breathy voice of a vowel clusteʃ as a hint to the next linguist that theʃe’s probably an elided [h] there, for example. Step-by-step recap 1. Open one oʃ moʃe files fʃom the data foldeʃ into Pʃaat’s object window. You might like to do ten at a time since the object window gets unwieldy with many. 2. Select annotate for the x_normal.wav file. Select a single tier called Phonemic. 3. Then click both the file and the annotation and select View and Edit. 4. Transcribe the phrase using the X-SAMPA symbols provided. 5. Optional: Open x_respeaking.wav into Praat where available, or play from the object window etc.7. Save the Praat text grid as x.Textgrid. 6. Move on to the next file and repeat. About Nepali Nepali has a full range of voiced and unvoiced stops from bilabial, dental, alveolar and retroflex with distinctive unaspirated and aspirated. However aspirated voiced stops are actually breathy voice or murmered voice. For speakers of English the breathy voice and retroflex ʈ and ɖ will be distinctive. The most prominent cue observed for the aspirates, voiceless as well as voiced, is the appearance of breathy or muffled voicing and lowered F0 on the following vowel. 74 Nepali has geminate consonants that are distinctive from single consonants. The following are some phonotactic features that have been noted: Intervocalic h deletion: /mʌɦina/ -> [mʌina] or [mʌ:na] Voiced ʃetʃoflex [ɖ] afteʃ vowels aʃe often ʃealised as a ʃetʃoflex flap [ɽ]. You can of course get both such as: /pʌɦaɖi/ -> [paɦaɽi] -> [pa:ɽi]. However this does not happen for geminates so /ʌɖa/ ‘stop’ -> [ʌɽa] but /ʌɖɖa/ ‘office’ -> [ʌɖɖa]. Note: A really good way of spotting retroflex consonants is that the F3 formant comes close to F2, most visible in adjacent vowels.. In general Nepali has lost vowel length distinction so long vowels point to either a vowel cluster (it can be hard to tell what a syllable is) or the leftover result of elision in the above example. Also it’s been obseʃved that woʃd final vowels can run into the same vowel at the start of the following word. Careful speech will often reveal word boundaries. Loss of contʃast on b, d, g, m, ŋ afteʃ nasalised vowels. In spontaneous speech, the voiced breathy/aspirates lose their aspiration intervocalically and word-finally. Lenition: /sʌp a/ -> [sʌɸa] /r/ is a tap intervocally but a trill elsewhere. Word initial clusters limit the second consonant to a rhotic or a glide such as /prʌd an/ ‘chief’ and /pual/ ‘whole’ ->[pwal] and /piadz/ ‘onion’ ->[pjadz]. Some example Nepali words [ma] / [hami] [mʌɦina] [t iyo] existed etc. [b ayo] “enough!”. [nau] [patʃʰi] [dekʰi] I / we Month Often phrase final in statements which locate something, say whether it 3rd sg. past of [hunu]; lit. it has become. Also interogative, Nine Post positional “After” Post positional “From” [euʈa] Adjective: One (thing) [kaʈ ʌmaɖa ] Kathmandu, capital of Nepal [pʌni] Also, [pʌni pʌni] adv. as well as. [p aʃkinu] To come back (citation form) [t eu] Edge. Notes on morphology In ouʃ scenaʃio we’ʃe tʃying to woʃk out the moʃphology as we go but we have some ideas and these are helpful to correctly identify some of the commonly recurring speech 75 sounds. We do know that Nepali has extensive case marking that appears as agglutinating suffixes. There are two types of nouns, o-final and non-o-final. O-final noun stems change to indicate morphological features such as number, gender, forum and diminutive. Nepali nouns are either singular or plural. Unmarked citation form is singular where plura changes o-final noun finals to an a-final instead. Theʃe’s also a postpositional indicatoʃ of plurality -ɦʌʃuː. Eg. Singulaʃ ‘son’ /ts oʃo/, pluʃal ‘sons’ /ts oʃa/ or /ts oʃa-ɦʌru/. Singulaʃ ‘house’ /g ʌr/, pluʃal ‘houses’ /g ʌrɦʌʃuː/. Gender is limited to masculine and feminine. Human nouns see grammatical agreement on the verb. Morphological gender changes citation form to i:-final. Some word suffixes that have been observed: -le -lai -nu -eko -ma -haru ergative/instrumental case marker object marker Infinitive Perfect constructions. Eg. [garne] will do, [gareko] did. Locative, in, at, on etc. Pluralizing suffix 76 Nepali Phonology – IPA Consonants Bilabial Dental Alveolar m Nasal Stop Retroflex Palatal Velar Glottal ŋ n p b t d ts dz ʈ ɖ k ɡ p b t d ts dz ʈ ɖ k ɡ Fricative s Rhotic r (w) Approximant ɦ l (j) Note: Voiced aspirated consonants are usually realised as breathy-voiced aspirated. Vowels Front Central High i ĩ Close-mid e ẽ Back u o ʌ ʌ Open-mid a ã Open Dipthongs /ui/ /iu/ /ei/ /eu/ Nepali Phonology – X-SAMPA /oi/ /ou/ / ʌi/ / ʌu/ /ai/ /au/ ***THIS IS WHAT YOU SHOULD TYPE! *** Consonants Bilabial Dental Alveolar Retroflex Palata Velar Glotta l m Nasal Stop l n N p b t d ts dz t` d` k g p_ b_ t_ d_ ts_ dz_ t`_ d`_ k_ ɡ_ h h h h h h h h h h 77 Fricative s Rhotic r Approximan (w) h l (j) t Vowels Front Central Back High i i~ u u~ Close-mid e e~ o V V~ Open-mid a a~ Open Dipthongs /ui/ /iu/ /ei/ /eu/ /oi/ /ou/ / Vi/ / Vu/ /ai/ /au/ Other useful X-SAMPA symbols: [ɽ ] retroflex tap = r' [ɸ] bilabial fricative = p\ [a] breathy voice = a_t 78 [β] voiced bilabial fricative = B Appendix 2: Transcription Log These pages are the output of the automated transcription accuracy system used in this study. It also serves as a record of the files and experimental conditions. The X-SAMPA symbols used in transcription have been converted to IPA to facilitate comparison. The log is in the following format: <file number> (male/female, normal/noisy, respeaking/norespeaking): ISO 15919: <Romanization of Devanagari> 1: <Transcription from participant 1> 2: <Transcription from participant 2> 3: <Transcription from participant 3> 4: <Transcription from participant 4> (Files 1-49 inclusive only) Acc. 1: <Accuʃacy of paʃticipant 1>, <Acc.2: Accuʃacy of paʃticipant 2> etc… Acc. Total Av: <Total mean accuracy of participants> 1 (female,normal,respeaking): ISO 15919: mātʃai thiyau 1: matsetatijo 2: matriatiu 3: matʃeantiu 4: matriatiu G: matreatijo Acc. 1: 75, Acc. 2: 85, Acc. 3: 76, Acc. 4: 83, Acc. Tot Av: 80. 2 (female,noisy,respeaking): ISO 15919: ʃa hāmīhaʃu pani āttiyau. 1: ɽahamirupaniatijo 2: dahamiɽubanietiu 3: lahʌmʌrupaniatiu 4: rahamirbaniatiu G: rahamirupaniattijo Acc. 1: 91, Acc. 2: 76, Acc. 3: 78, Acc. 4: 80, Acc. Tot Av: 81. 3 (female,noisy,no respeaking): ISO 15919: thiyō ʃa ēkdamai malāī cai ēkdamai āphnō mṛtyukō mukhabāṭa 1: raektʌmimaletseiektʌmiaafnumidtikumukfataɡ 2: eɽaɡeɡtamaimaleseeɡtomiaɸmuŋiltilkomoksakel 3: erʌeɡdomiimaletsieekdameiapnuʌmuʃtukomukfata 4: raʔeʈamimalesiiɡdomiaɸnuamiʃdiʃkomufaika G: rʌekdʌmemaleitsiekdamiaɸnuʌmiɽtjukomukɸaʈʌ Acc. 1: 73, Acc. 2: 64, Acc. 3: 77, Acc. 4: 69, Acc. Tot Av: 71. 4 (female,normal,respeaking): ISO 15919: ʃa phaʃkinē kʃamamā kē bhayō bhandā ma ʃa mēʃō 2 janā bhāīhaʃu phaʃkinupaʃnē bhayō, 1: rafalknikʃamakivaibʌndemʌɽʌmeɽuduisanapaiarufʌrkinupanibajo 2: eɽaɸaɡkikambanɡiβaindemeumʌremiɽiduisanabaieruɸodbinubʌniubʌjo 3: arauɸalknikrampakiɸaibʌnaumoramiɽuduisanapaieruɸolkinupanibajo 4: raɸarkiniklʌmakiwaiwʌnramoramiriduizenebajiruɸarkinipanibajo G: erʌɸarknikrʌmakebabʌndamʌremiriduisanabaieruɸʌrkinupʌnibʌjo Acc. 1: 83, Acc. 2: 72, Acc. 3: 79, Acc. 4: 78, Acc. Tot Av: 78. 79 5 (female,normal,no respeaking): ISO 15919: hāmīlāī ḍaʃa lāgyō ki katai mailē 1: ʃaŋleidʌdlaɡɡikʌʈimaidli 2: damlaidaleɡɡiɡoddimaili 3: ʃaaŋleidodleɡɡukoddimailu 4: raleidorlaɡiɡiɡotimaili G: ʃanleiɖʌrlaɡkikʌtemaile Acc. 1: 84, Acc. 2: 73, Acc. 3: 69, Acc. 4: 85, Acc. Tot Av: 78. 6 (male,noisy,respeaking): ISO 15919: ʃa ma agāḍī paṭī basēkō thiē ʃa mēʃō sāthī pani mēʃō chēumā thiyō . 1: ramʌahaʈipoɖibʌsikoteʃoŋeʃosatipanimiʃatsumate 2: ramoaibeobombosikoʈerʌmesadipʌnmeʃotseuate 3: ramoaɡaʃibʌɽibosiɡʌtteʃameʃosadipanimeʃotseumate 4: ʃamoaɽibrdibosiɡateɖʌmʌsatipunumiʃutsumateh G: ramʌaɡaɖipʌʈibʌsekoteʃʌmeʃosatipʌnimeʃotseumate Acc. 1: 90, Acc. 2: 75, Acc. 3: 88, Acc. 4: 71, Acc. Tot Av: 81. 7 (female,normal,no respeaking): ISO 15919: kinabhandā bubā ʃa āmā cai utai nēpālagañjamai hunuhunthiyō, hāmī kāṭhamāḍa jādai thiyau ra 1: atijokinomandabuβeɽaamaseiuteiŋepaɡenahunentijohamikafmanusanetijohuɽat 2: atiukinamanɖaβaweɽeamasaiwudenepaɡaŋeuletemiɡaswenesanenteuɽak 3: ateukinomandabuwerahamaseiuteinipalunehununteuhamikaʈwenuzaniteuhirak 4: atiukinowandabuwaʃaamaseiwudeinipaʃninutiuhamikahonozamitiuʃa G: attijaukinʌmandabubaraamaseiuteinepaɡanahununtijohamikaʈmanusanetijauhuʃa Acc. 1: 85, Acc. 2: 59, Acc. 3: 77, Acc. 4: 70, Acc. Tot Av: 73. 8 (female,noisy,no respeaking): ISO 15919: mēʃō bubā āmā ʃa aʃu mānchēhaʃulāī dēkhna 1: mirapubaamaraʌurumantsehaledihnina 2: meɽebumamaraorumanselaidehina 3: mirabuwaanmaraʌʃumansehelaidiɡnilʌ 4: miropubaamaroodumanseladihnia G: mirobubanmaraʌʃumantsehʌleideknʌ Acc. 1: 75, Acc. 2: 73, Acc. 3: 78, Acc. 4: 69, Acc. Tot Av: 74. 9 (male,normal,no respeaking): ISO 15919: tiʃa thiyō tāplējuṅa bhannē ṭhā . ēkdamai jōkhima pūʃṇa . 1: diʃatijotaplesuŋʌnethauiɡdʌmetsuhimpuɖna 2: tiɽatetaplismwnetauaedemajokimpuna 3: tiratetaplesunmannetauaeɡdʌmeidohimpuʃna 4: diratoɖaplezuŋʌnetauŋaiɡdamezohimpuna G: tirʌtiotaplezuŋʌneʈauiɡdʌmetsuhimpuɖna Acc. 1: 90, Acc. 2: 71, Acc. 3: 75, Acc. 4: 82, Acc. Tot Av: 79. 10 (female,normal,no respeaking): ISO 15919: tyō dina hāmī jahājabāṭa kāṭhamāḍa phaʃkiēnau 1: tijodinhamikuŋipanilembatasikhuneisahasbatakafmantufʌlkjenʌu 2: teuntinhamikunipiniplembatskashassetteɡaɸond ɸalɡenu 3: deudinhamiɡunipʌnipleinbatʌsekʃesihaswataaɡaɸunduɸʌlkenu 4: diudinhanikunipʌnipleimbetesukuʃizeheswetaakaɸunduɸolkinu G: tijodinhamikunipʌniplenbaʈasikuneidzʌhadzbaʈʌkaʈmanɖufʌrkienʌu Acc. 1: 84, Acc. 2: 65, Acc. 3: 73, Acc. 4: 70, Acc. Tot Av: 73. 11 (male,normal,no respeaking): ISO 15919: phōna gaʃēʃa bhanē yastō yastō bhayō bhanēʃa . 1: fuŋɡaʃʌbanjestestʌvajoneʃʌ 2: hoŋaʃawanihstsestuwaioneɽe 3: ɸonɡaʃabanestestabaiʌneʃa 4: ɸonɡaʃaβanestestowajaniɽa G: ponɡaʃʌbʌnestestobajonerʌ Acc. 1: 82, Acc. 2: 63, Acc. 3: 83, Acc. 4: 79, Acc. Tot Av: 77. 12 (male,noisy,no respeaking): 80 ISO 15919: mēʃō pachāḍī basēkō aʃkō ēka janā bhāī thiyō u pani 1: mirpʌtsaiɖipʌsuvʌrkiɡasainabaipentsiuupeni 2: miɽabataɽipasuakiɡsanaabaibenseupipani 3: mirupatsadipasuarkidzanabaipuntaupani 4: mirbatairipasiwarɡɡzainʌ G: merpʌtsaɖibʌseoʌrkeɡadzʌnabaibʌntijoupʌni Acc. 1: 76, Acc. 2: 61, Acc. 3: 71, Acc. 4: 42, Acc. Tot Av: 62. 13 (male,normal,no respeaking): ISO 15919: mailē mēʃō aba āmālāī samjhē 1: kamoilimirʌamalasondze 2: kamailemeʃoaβamamalasʌmdze 3: kʌmweilemeroʌwʌamalasʌmdz 4: kʌmweilimiʃaβamamalasondzi G: kʌmailemeroamalasʌmdze Acc. 1: 85, Acc. 2: 77, Acc. 3: 72, Acc. 4: 65, Acc. Tot Av: 75. 14 (male,normal,no respeaking): ISO 15919: ʃa mānchēhaʃu ēkdama jōkhimapūʃṇakō yātʃāhaʃu 1: ʃaiktomantseʃuktaŋsoheŋbulnaɡajapahaʃuh 2: rabikdamantseɽeɡuŋsuhinbunaɡajaʈaɽu 3: rabiɡtomansʌʃʌɡtamzuhimbuʃnaɡajatʃaʃu 4: radomantsirikumzukimbunaɡajadʃauʃuh G: ʃaekdamantseʃuktamdzokimbuʃŋaɡajatʃahaʃu Acc. 1: 82, Acc. 2: 67, Acc. 3: 79, Acc. 4: 70, Acc. Tot Av: 75. 15 (male,normal,respeaking): ISO 15919: tyahā hēʃdā khēʃī ta gāḍī ta mātʃa ēka inca mātra 1: teeɽahiridoɡaidiktoikmatʃe 2: teɽʌhedaɡaɽiktaikmataʃ 3: tsaʃakeɖoɡaɖiktʌikmatʃe 4: derakeraɡaʃitaikmatʃi G: derʌkerʌɡaɽiʔtaekmatre Acc. 1: 59, Acc. 2: 76, Acc. 3: 77, Acc. 4: 86, Acc. Tot Av: 75. 16 (female,normal,no respeaking): ISO 15919: mēʃō paʃivāʃakā 5 janā sahita nēpālagañja jānupaʃnē bhayō. 1: mirʌpʌʃiveeʃkapatsenasaimipahɡentanupanibajo 2: miɽupoɽiβeɡapasenesainibpalɡansanubanjubai 3: miʃapoʃiβeʃkʌbadsenʌseinibadɡʌnsanbanimbaju 4: mirboriberɡabadzʌnʌsaimibanɡonsanbanibajo G: meropʌʃivakapatsanasainepalɡʌndzanupaʃnebajo Acc. 1: 80, Acc. 2: 73, Acc. 3: 71, Acc. 4: 76, Acc. Tot Av: 75. 17 (female,noisy,respeaking): ISO 15919: malāī ēuṭā pāʃivāʃika kāmalē gaʃdākhēʃi 1: maleieutapariverikambliɡoʃdahiʃi 2: malaieudapaɽiɸiɽikambiɡadai 3: mʌleieuɖapaʃiβeʃikanleɡodekiʃi 4: maleiiutapariverikambliɡodehaʃi G: mʌleieuʈaparivarikʌmbliɡʌʃdaeʃi Acc. 1: 89, Acc. 2: 72, Acc. 3: 79, Acc. 4: 82, Acc. Tot Av: 81. 18 (female,noisy,respeaking): ISO 15919: , tyō bhandā aʃkō dina cai hāmī kāṭhamāḍa phaʃkyau nēpālagañjabāṭa. 1: alalkutijobandaalkodintseihamikaɖenkaʈvandoufʌʃkionipaliosveʈa 2: dalalkatuɡandalaɡotinsaiamikatupaswanduɸokinupalswata 3: olalɡudeupandaalɡudinseihamikaʃʌnkaʈmanɖuɸʌʃkiunipalɡonswetʌ 4: alalɡətiupandalɡudinseiamikatnkatmandufolkiunipaloswata G: alalkotjobʌndaʌrkodintseihamikaʈenkaʈvamdupʌrkjonepalndzaʈa Acc. 1: 81, Acc. 2: 64, Acc. 3: 74, Acc. 4: 75, Acc. Tot Av: 73. 19 (female,noisy,no respeaking): ISO 15919: ukta vimānasthala 1: uktabimenistal 81 2: okopimanistal 3: uɡdabimanistal 4: uptabimanistal G: uktavibmanʌstʌl Acc. 1: 77, Acc. 2: 66, Acc. 3: 79, Acc. 4: 79, Acc. Tot Av: 75. 20 (male,noisy,no respeaking): ISO 15919: ṭhā kō phōṭōhaʃu pani mailē khicēkō thiē . 1: etithaɡupudʌhaʃubenumalihiseɡate 2: batakabokoteɽemaŋaikatsaka 3: petatalɡopulɖʌʃaβenimalikidziɡaten 4: ətətalɡokoldəʃəwinmalikizekaten G: ekeʈauɡopoɖoʃubanimailikitseɡote Acc. 1: 67, Acc. 2: 50, Acc. 3: 66, Acc. 4: 56, Acc. Tot Av: 60. 21 (male,noisy,respeaking): ISO 15919: unāisa saya sattaʃī sāla tiʃakō cai lyānḍa ʃōvaʃa mōḍēlakō gāḍīhaʃu thiyō . 1: unaisesottorisaltirakoltsejolɑndʃovaʃmoɖelhohaɖihaɖiheʃotie 2: unaisesatsaɽisantsaɡotselanʃoɸamodzubaɡaɽiɡaɽieɽutsiʃ 3: uneisisʌtoʃisaltilakoutseulandʃoβeʃmooɖalkoɡaʃiɡaʃiʃuti 4: neisisotarisaltirakultirlanrobamorolkariɡiɡaʃiʃətje G: unaisesʌtterisaltirakotseolanɖrovarmoɖelɡoɡaɖiɡaɖihaʃitie Acc. 1: 86, Acc. 2: 66, Acc. 3: 77, Acc. 4: 65, Acc. Tot Av: 73. 22 (female,normal,no respeaking): ISO 15919: ṭikēṭa kāṭēʃa 1: eutahatikerkaʈira 2: eudzaaltsikaɡaɖiɽa 3: eutaamdikarkaɖirʌ 4: ewtaamdikərkadira G: euʈaamʈikeʈkaʈerʌ Acc. 1: 71, Acc. 2: 65, Acc. 3: 82, Acc. 4: 70, Acc. Tot Av: 72. 23 (female,normal,respeaking): ISO 15919: napāunē ta haina? 1: paunedaina 2: paunedaine 3: paunitaina 4: paunidaina G: paunetainʌ Acc. 1: 93, Acc. 2: 93, Acc. 3: 92, Acc. 4: 91, Acc. Tot Av: 92. 24 (female,noisy,no respeaking): ISO 15919: tala thiyō ēuṭā bhayāvaha, 1: palatijobimanissamantʌletijoijokahaijavahʌ 2: ɡalatsiubiminesanatsalatsiuɡahayadoho 3: talatiumiwenisanantalatiuʃahajaβahou 4: dalatiubimanistanadalatiudeajavaho G: talatijobimanissʌlʌmtʌlʌtijoɡabʌjavʌhʌ Acc. 1: 74, Acc. 2: 65, Acc. 3: 66, Acc. 4: 71, Acc. Tot Av: 69. 25 (female,noisy,no respeaking): ISO 15919: sāyada tyahī kāʃaṇalē gaʃdā hōlā, 1: sahitteikarindeɡoʃtehoʃe 2: saidzdzaiɡaʃindzaɡodzalets 3: sajedteikarinɖeɡoɖɖehole 4: saiddeiɡaʃindiɡadahode G: sajiʔdeikarendeɡaʃdahola Acc. 1: 82, Acc. 2: 56, Acc. 3: 80, Acc. 4: 80, Acc. Tot Av: 75. 26 (female,noisy,respeaking): ISO 15919: ʃa tyō jahāja, hāmīlāī thāhā bhayō ki tyō jahāja ʃātīkō 10 bajē 1: ratijosahastahamlaithahabojokhitijosahasratikotʌsvase 2: raɖesehashnlaithadaikikidzsasʃatsikodoswasi 3: ratusʌhashʌmleitaβeikitusasʃatikodoswasi 82 4: radiuzʌhashamletaweikikdouzasratikadoswwzi G: ratijosʌhashamlaitahabʌjokhitjosahasratikodʌsbʌse Acc. 1: 87, Acc. 2: 63, Acc. 3: 71, Acc. 4: 65, Acc. Tot Av: 72. 27 (male,normal,no respeaking): ISO 15919: tyō kāma cai nēpālakō puʃānō bikaṭa ṭhā tiʃa gaēʃa 1: tijokamtseinepalkopurahanopikhaʈautiraɡʌiʃas 2: tukamsainipalkopuɽanobiɡatsautsiʃaɡeiʃes 3: tiokamseinepalkupuranupikaɽʈauntiɽaɡoiʃʌs 4: doukanseinepalkoburanudikatauntiraɡaiʃas G: tjokamtsainepalkopuranobikaʈautirʌɡʌeras Acc. 1: 83, Acc. 2: 85, Acc. 3: 83, Acc. 4: 83, Acc. Tot Av: 83. 28 (male,normal,no respeaking): ISO 15919: thiē jallē cai tyō bāṭōmā calāuna sakthē ʃa calāuna āuthyō 1: tiezalezettijobatematsohonʌfʌhtiratsohonahotijonetlai 2: tisalisetubakmatsaunaɸoktsiɽitsonautsuneleh 3: tiʌsalesitubakmʌtsonʌɸaktiratsonautunelei 4: tizalisetiubatmatsonətoktirasounaudiunilei G: tiezalezetjobaʈomatsaunʌfʌkteratsʌlaunautjonerlai Acc. 1: 75, Acc. 2: 68, Acc. 3: 72, Acc. 4: 70, Acc. Tot Av: 71. 29 (female,normal,no respeaking): ISO 15919: ʃa hāmīhaʃu cai suʃakṣita thiyau. 1: arahamiharusisuratsiottijo 2: arahamiɽesesoɽetsiɸim 3: arahamiorʌsinsuβʃatitteu 4: arahamiurasisurasetiu G: arahamihʌrutsisuraksittijau Acc. 1: 84, Acc. 2: 58, Acc. 3: 69, Acc. 4: 73, Acc. Tot Av: 71. 30 (female,noisy,respeaking): ISO 15919: mēʃō āmā ʃa bubā cai nēpālagañjamai basnupaʃnē bhayō. (ʃa) hāmīhaʃu phaʃkinē kʃama thiyō tyatibēlā, 1: miruamarabuvasahinepalionnivosnupadnibajorahamiharufʌrkinikramtijotjatibela 2: meɽuamaɽebubasainipaleuβiɸosnipanipajobilahamiriɸokimikʃamkekekiβileh 3: miruammarabubaseinepalɡonniβosnepadniβajoirahamiruɸolkinikʃantetetiβila 4: miruamarabubaseinepalionrevʌsnepanibaijoiʃahamiʃufaʃkinikʃmtiotiotevela G: meruamarabuvasainepalionrivosnepanibajoerahamirufʌrkinikrʌmtiotjotibela Acc. 1: 86, Acc. 2: 74, Acc. 3: 84, Acc. 4: 91, Acc. Tot Av: 84. 31 (female,noisy,respeaking): ISO 15919: ʃa tyahā gaēʃa hēʃdai cai mēʃō bubā āmā 1: ratihaɡoiʃaheʈatseimiʃububaama 2: laʈiɡoiɽiŋdasaimiɽupubahamas 3: ʃatijaŋɡoiʃahiʃdaseimiʃubuβaamma 4: ratianɡoiʃaiʃtasimiʃupubaamas G: ratjaɡaeʃahedatsaimeʃububaama Acc. 1: 85, Acc. 2: 66, Acc. 3: 76, Acc. 4: 71, Acc. Tot Av: 75. 32 (female,normal,no respeaking): ISO 15919: dhēʃai nai jahāja duʃghaṭanāhaʃu hunē gaʃthyō 1: rinidzahasturhateenaaruhuniɡaɖtijo 2: janitsahastsuɡatsanaɽuniɡotu 3: rinisahasturɡetanahaʃuhuniɡaɽtiu 4: ʃinidzahasduʃɡatenaʃuhuniɡaʃtiu G: rainidzahasdurɡaʈanaaʃuhuneɡaʃtjo Acc. 1: 82, Acc. 2: 65, Acc. 3: 83, Acc. 4: 86, Acc. Tot Av: 79. 33 (female,noisy,no respeaking): ISO 15919: bicaʃā hāmī 3 janā 1: pitsaʃahamitinzaanaa 2: pitssaɽahamitinsana 3: itsaʃahamitinsana 4: pitsarahamitinzana 83 G: pitsʌrahamitinzʌna Acc. 1: 82, Acc. 2: 83, Acc. 3: 85, Acc. 4: 87, Acc. Tot Av: 84. 34 (female,noisy,no respeaking): ISO 15919: ʃa 10 minēṭakō jahāja uḍānapachi cai acānaka jahājalāī tyahānēʃa kē bhayō kasailāī thāhā bhaēna. 1: ʃadosmniʃkuuzahazuʃanpasseiʌtsanidzahatslaiktehanirekibajokosteɖleithahabohina 2: nadzosnikʌsahastudzenpasisaiatsanibtsahadzslaikdanirakiboɡoslatsaɸana 3: raboswʌnitkuʌsʌhassuɽanposʌseiʌtsanudzʌhaslaikramirikihoiukosʌtleitaboino 4: radosmunirkuzahasuranposesiatsanzahaslaikdianɡivoikostaleitavoina G: radosnifkuzahasuɖanpʌsseiʌtsanedzʌhatsleiktjanerekebʌjokʌseleitabʌjnʌ Acc. 1: 73, Acc. 2: 67, Acc. 3: 71, Acc. 4: 67, Acc. Tot Av: 70. 35 (female,noisy,respeaking): ISO 15919: ʃa hāmīhaʃu tyahā basyau ʃa 10 minēṭa jatikō cai uḍāna bhayō ukta jahājamā 1: pirahamiharutehapaseuradʌsmineʃdzatikotsijuranbajouktadzahazma 2: biraahamiruɡambasuʃadosmisadikusiamuɖanbajoupasahasweʃ 3: raʌhamirudihamposudadosmbirsatiɡosihiuɖanbajouktazahazma 4: raahamiarudeanbosuradosmirdzatikosiamuranpajouktazahazma G: pirahamihʌrutjapasuraɖʌsminetsatikotseihuɖanbajouktadzahazma Acc. 1: 86, Acc. 2: 73, Acc. 3: 73, Acc. 4: 75, Acc. Tot Av: 77. 36 (male,noisy,respeaking): ISO 15919: taʃa nēpālamā tyastō bāṭōhaʃu ēkdamai dhēʃai chana 1: paranepalatestobaʈʌhruikdʌmedheretsan 2: taranepalatobaʈonsekdamedeʃetsana 3: taranepalatistobaʈohuʃuiektomedeʃitsana 4: ranibaladisubaturuitamideresan G: taranepalatjastobaʈohruekdʌmedeʃetsana Acc. 1: 85, Acc. 2: 80, Acc. 3: 88, Acc. 4: 71, Acc. Tot Av: 81. 37 (female,normal,respeaking): ISO 15919: hāmīhaʃu kāṭhamāḍa phaʃkina lāgyau ʃa ma ʃa mēʃō 2 janā bhāī cai ukta plēnamā basyau, jahājabhitʃa basyau. 1: hamiharukatfandofʌlkiolaɡioeʃamʌʃamiʃuduizanabahitsaiuktʌplenmawaseotsahasviklavase 2: hamiruɡaɸinaɸokinlaɡueramoramiɽadzwisanabaisaiuktsaplenawasiusahasiktsarasu 3: hamiharukaʈʌmdʌɸolkinulaɡijoeʃamohaʃamiʃuduisanabaisajiuktaplimnawasiusahasβikʃawasiu 4: hamiarukatmndəforkinalaɡiuʃamoʃamiʃaduizanapaitsaiuktaplenawaseudahaswiklawaseu G: hamiaʃukatfandofʌrkiilaɡiueʃamʌʃameʃuduizanabaitsaiiuktʌplenmabasiuzahasfitʃabase Acc. 1: 89, Acc. 2: 73, Acc. 3: 77, Acc. 4: 79, Acc. Tot Av: 80. 38 (female,normal,respeaking): ISO 15919: ʃa 10 minēṭa pachi cai, dhanna 10 minēṭa pachi ukta jahāja 1: radʌsmniɽpasizeidannadʌsmniɽpasiuktadzahaz 2: radzasnipasisaidanadasmipasiamuktasahas 3: eradosmirɖβasisedahnadasminiɽpasiauktasahas 4: radosmunirposeseidonʌdosnirpaseiamuktasahas G: radʌsmneʈpasizeidannadʌsmneʈpasiuktʌdzahaz Acc. 1: 92, Acc. 2: 74, Acc. 3: 73, Acc. 4: 69, Acc. Tot Av: 77. 39 (male,normal,respeaking): ISO 15919: āja bhandā kamsēkama duī baʃṣa agāḍī ma nēpālamā h dā khēʃī 1: aʌbandakomskomduibeʃsʌɡaiɖimo nepalmahodakheʃi 2: azuwandabkomsomdwiβasaɡaɽimonapanaudzakeri 3: aserbʌndakomɸekomduibasoɡaʃimonepalnahudakiʃi 4: asebandakomskomduibarsaɡaʃimonepalmaʃoudakiʃi G: asʌwandakʌmskʌmduibʌr ʌɡaɖimʌnepalmahodakeri Acc. 1: 87, Acc. 2: 74, Acc. 3: 78, Acc. 4: 85, Acc. Tot Av: 81. 40 (male,normal,no respeaking): ISO 15919: ʃa yastō āyō ki tyō gāḍī ēkai cōṭī ḍhalkiyō 1: raistoajoɡitoɡaiɖiekodzuʃudʌlkio 2: ʃamistsuʃauɡitsaɡaɽiikatsajadzolka 3: raistoauɡintuɡaʃiikudzʌʃɖʌʃɖolkijo 4: raestoaiwiɡindaɡaʃiekətəʃədolkə 84 G: raistoajoɡitoɡaɽiekodzuɽiɖʌlkio Acc. 1: 88, Acc. 2: 60, Acc. 3: 71, Acc. 4: 61, Acc. Tot Av: 70. 41 (male,normal,respeaking): ISO 15919: aba cai ma gaē bhanna ṭhānēkō thiē . gāḍī yastaʃī ḍhalkiyō ki 1: awʌdzeimʌɡojewentaneɡʌteɡaɖiesteridʌlkiokiamdʌd 2: awatsemaɡaehewanʈaneteɡaɽisteɽadzolkikandub 3: awatseimʌɡaewʌntaneteɡʌʃitsuʃidolkiukiaŋdʌb 4: awatsemaɡoewantaniɡadiɡaʃisteʃidolkekiandə G: ʌwʌtseimʌɡaewʌntaneteɡaɖisteʃiɖʌlkiɡamdʌd Acc. 1: 79, Acc. 2: 75, Acc. 3: 80, Acc. 4: 70, Acc. Tot Av: 76. 42 (male,noisy,no respeaking): ISO 15919: cai tyahā saʃvē gaʃnupaʃnē jastō thiyō . ʃa ma ʃa mēʃā 1: etsiasʌrviɡʌnupaniostʌtijoʃamʌaʃamiʃa 2: titsasaɽiɡonpaniʃstsatsiʃʃamoɽamiɽa 3: utssoruiɡonuβanestutiu 4: rətiatsoriɡoʃnubaniostatiuʃamoʃamiʃa G: etjasʌrveɡʌʃnupaʃnestotijoʃamʌʃameʃa Acc. 1: 82, Acc. 2: 59, Acc. 3: 46, Acc. 4: 70, Acc. Tot Av: 64. 43 (male,noisy,respeaking): ISO 15919: tyō mānchēlē cai u yastō āphu ēkdama ātma biśvāsa thiyō 1: patiomantselaitseiwestoafuiɡomafubisaftehitaʌ 2: adzamantseletseiujʌstoaɸidomabisatsstsukitio 3: ʃatemantseletseujʌstoaɸudʌmapubisvastudiɡau 4: ʌdiumantseleseuestouahikoah orkihiteu G: atiomantseletseiujastoafuɡʌmafnubisaftehiɡau Acc. 1: 80, Acc. 2: 67, Acc. 3: 75, Acc. 4: 62, Acc. Tot Av: 71. 44 (female,normal,respeaking): ISO 15919: ēkadamai ḍaʃalāgdō jangala 1: ektʌmaidʌdlaktozzŋɡel 2: ekamedzalakdzsaŋɡel 3: ekdomiɖadlaɡdʌsoŋɡʌls 4: iɡɖamidalaɡdozaŋɡal G: eɡdameiɖalaɡdozʌŋɡʌl Acc. 1: 75, Acc. 2: 70, Acc. 3: 77, Acc. 4: 87, Acc. Tot Av: 77. 45 (female,noisy,respeaking): ISO 15919: ma mēʃō 2 janā bhāī, bubā ʃa āmā tyahi thiyau 1: mameruduisanʌbhaibuwaraamatehitijo 2: mamiɽupusanapaihimuβaɽanmatitsiu 3: momiruduisanapajibuheraanmatehitijo 4: momiruduisanopajibubaraamadihintiu G: mameruduidzʌnabaiibubaranmatijitju Acc. 1: 78, Acc. 2: 73, Acc. 3: 77, Acc. 4: 76, Acc. Tot Av: 76. 46 (male,normal,no respeaking): ISO 15919: tyahā cai ēuṭā gāḍī mātʃa aṭnē jastō ṭhā thiyō . 1: teaseeoraɡaɖimatʃʌʌʈniostoʈautijo 2: tsaseɸiulaɡaɽimatsʃoknestsoktsautsiʃ 3: deuseioraɡaɖimatʃooʃtnijostoʃtauntijo 4: diaseieuraɡaʃdimatʃooʃniustortaundiu G: tjaseeuwlaɡaɽimatsʃʌʌʈnejastoʈautijo Acc. 1: 82, Acc. 2: 65, Acc. 3: 71, Acc. 4: 63, Acc. Tot Av: 70. 47 (female,noisy,no respeaking): ISO 15919: nēpālagañjamā hāmīlē bitāyau, 1 mahinā jati bitāyau sāyada, dhēʃai ʃamāilō bhayō. 1: nepaliosbahablibitajoekminazutebitajosahiterirʌmaiɖebajo 2: napaloŋtalibitsaiewinasibitsitsasaidedemamaidzaboio 3: nepalunsmaublibidajuehminʌsitipidaijusaideridemaidʌbajo 4: nepalɡonsnahalibitajoeminazutibitajosaiɖiʃiʃomailibajo G: nepaloŋmalebitajoekminadzʌtibitajosajʌderirʌmailʌbʌjo Acc. 1: 78, Acc. 2: 64, Acc. 3: 76, Acc. 4: 78, Acc. Tot Av: 74. 85 48 (male,noisy,respeaking): ISO 15919: tyastō huncha hōlā bhanēʃa . 1: tistohuntsaholaponeʃa 2: tastohuntsalawanera 3: dzʌstuhunsahulaβoneʃa 4: dzstouhunsaulawunira G: tistohuntsaholapaneʃa Acc. 1: 96, Acc. 2: 79, Acc. 3: 81, Acc. 4: 68, Acc. Tot Av: 81. 49 (female,noisy,no respeaking): ISO 15919: kāṭhamāḍa kō ghaʃamā cai hāmīlāī paʃkhēʃa basēkā mēʃā hajuʃamuvā hajuʃabuvā haʃu, 1: kaifmandukukʌlmatsehmlipʌrkerʌbʌsikamirahasiluahasirwaharu 2: ɡamindzuhuɡomatssenlebodzkiɽibasiɡamenasinasuweɽu 3: kaʈmʌnɖukukohomesihʌmlibʌdkiɽoβosikaminohosʌrmahasʌruwaharu 4: katmandukukolmatsehuliporkabosikamirahausilnahasirwuharu G: katmanɖukuɡʌʃmatseihamlʌbʌʃkeʃʌbʌseɡameʃahazimnahaziʃuwaaʃu Acc. 1: 81, Acc. 2: 62, Acc. 3: 73, Acc. 4: 75, Acc. Tot Av: 73. 50 (female,noisy,respeaking): ISO 15919: ʃa hāmī āttiyau, dhēʃai āttiyau, 1: rahamiatijodereiatijo 2: rahamiattiudeɽeiatiu 3: herahamiattijodireiattijo G: ʌrahamiattijodereattio Acc. 1: 77, Acc. 2: 78, Acc. 3: 81, Acc. Tot Av: 79. 51 (male,normal,no respeaking): ISO 15919: ʃa tyō gāḍī haʃulāī cai hāmīlē kati 1: ratijoɡaidieʃlaitseiamliokoti 2: ʃatsiuɡadzulasaineukoti 3: heratijoɡaɽijudʌlaitsejanliuɡoti G: ratijoɡadiʃulaitseihamlekoti Acc. 1: 80, Acc. 2: 61, Acc. 3: 63, Acc. Tot Av: 68. 52 (male,normal,no respeaking): ISO 15919: cintā nagaʃnusa bhanyō ʃa uslē cai gāḍīkō sṭēʃiṅga mōḍēʃa 1: tsinanʌhonifoinierahuleizaɡaiɖikesteʃiŋmoʃeʃa 2: tsinanonaɸonaaulesaɡaɽikastseɽiŋweɽo 3: tsinanonʌɸoinehowalisaɡalikatsteʃiŋmoʃeʃa G: tsinanʌhonufoineraulezaiɡaɖikesteʃiŋmoɽeʃa Acc. 1: 84, Acc. 2: 67, Acc. 3: 74, Acc. Tot Av: 75. 53 (male,noisy,no respeaking): ISO 15919: hāmīlāī aʃulē jō mānchē tyahā gaēkā thiē 1: amahuʃuledadzubmantsitenɡoikateuhule 2: amauʃuleudzadzomatsatseɡoikatsule 3: amlauruleɡaldzumantsetantuweiɡateʃulle G: amlauruleɡadzumantsetanɡoiɡatieule Acc. 1: 77, Acc. 2: 69, Acc. 3: 78, Acc. Tot Av: 75. 54 (male,normal,respeaking): ISO 15919: ali ali sima sima pānī paʃēʃa pahiʃō ali ali jhaʃiʃahēkō thiyō 1: raʌhaɖizahakheriolilisimtsimpanipoiropohiroalilidzʌharaikotijo 2: raaweridzarakeraolilisimsʌpainipodzopoiʃaninidzaʃhaiɡotio 3: raoɡaɽisahakheʃiolilisimtimpaniupʌʃʌpohiʃuanlilizahaʃʃaiɡottijo G: raʌɡaeɖizahakediʌlilisimsimpanipoirʌpʌhiʃoalilidzʌhraiɡotijo Acc. 1: 90, Acc. 2: 68, Acc. 3: 78, Acc. Tot Av: 79. 55 (male,noisy,respeaking): ISO 15919: dina kō hāmʃō kāʃyakʃama thiyō . ʃa ṭhā cai nēpālakō pūʃva jillā 1: simsamrukarihromtijoratauseinepalkopurwadzila 2: kimbambokaʃikʃimtsiʃʃatsousaŋipalkupudzatsila 3: imsahamrukariɡʃʌmtijoʃataunseinnepalkubudwadzila G: dinsamrokadikrʌmtijoraʈautseinepalkopurvadzilla Acc. 1: 87, Acc. 2: 66, Acc. 3: 76, Acc. Tot Av: 77. 86 56 (male,noisy,no respeaking): ISO 15919: ʃa unihaʃukō lāgi ta dina dinai kō kuʃā hō taʃa hāmʃō lāgi 1: istodauunerkolaitetiodinienikakuratahamrolai 2: istsudzaunakalait setsedzininiɡakudzatsamdzalai 3: estoddauunerkulaidettedinniniɡakodatahamdʌlei G: estodauunerkolaitetiodinienikokuratahamrolai Acc. 1: 95, Acc. 2: 65, Acc. 3: 81, Acc. Tot Av: 81. 57 (female,normal,respeaking): ISO 15919: tyahā jahājamā basnē paʃicāʃikāhaʃu, unīhaʃu ḍaʃāuna thālē 1: tehadzahazmabʌsnipʌritsairikaharuunihardaraunathale 2: dzedzahasnabasnipaɽiseɽikahaʃuunidzaʃaŋatali 3: dehadzahasnaubosniboɽitsaɖikaharuuniharudaraunitalije G: tjahadzʌhazmabʌsnipʌritsarikaharuuniharɖaraunatale Acc. 1: 90, Acc. 2: 71, Acc. 3: 78, Acc. Tot Av: 80. 58 (male,noisy,no respeaking): ISO 15919: bhanēʃa bhanyō . taʃa ullē cai hāmīlāī tyasaʃī aʃu kēhi pani ḍaʃa dēkhāēna 1: snerpʌnitaradulitseidamnavutisereʌrukipendoardehaina 2: tsanabʌnetaɽadulletseianeɸtsiɽiodzinkipinɖardekaina 3: pʌnitaradaulitseamlebudisʌrearukipʌndordihaimʌ G: ʌnerʌbʌnjotʌrʌdʌuletseidʌamlʌvutseriʌrukepʌnɖʌrdekaenʌ Acc. 1: 73, Acc. 2: 60, Acc. 3: 66, Acc. Tot Av: 66. 59 (male,noisy,respeaking): ISO 15919: ʃa hāmī cai gayau ʃa tyasamā cai kamsēkama 1: rahamiseiɡoiɡoioʃatesmatseikomtseikomkotib 2: ramitseaɡoiɡoiaaʃaɡtsiswasaimakosekomkotib 3: rahamiseiɡojuaʃadismatseikomsikomɡoti G: rahamiseiɡoiɡoioʃatesmatseiʌkʌmsekʌmkotie Acc. 1: 90, Acc. 2: 69, Acc. 3: 81, Acc. Tot Av: 80. 60 (male,normal,respeaking): ISO 15919: ʃa tyahābāṭa gayō ʃa uslē cai hāmīlāī ali ʃāmʃō samga basnusa 1: tjapurʌɡoieʃautlidzahamlailaʌliramsatsebosnus 2: tjawadaɡoiʃawulidamlailaaolidamsasabasnos 3: deburuɡɡojeʃaulisamleilʌalidamsasubosnus G: tjabuɽʌɡʌjorautlidzahamlalʌʌliramsatsʌbosnus Acc. 1: 88, Acc. 2: 64, Acc. 3: 73, Acc. Tot Av: 75. 61 (female,normal,no respeaking): ISO 15919: , pachi cai kē bhayō bhandā khēʃi 1: kipʌtsizeikibajobandahevi 2: kebotssasikiboibʌdzaheɡi 3: keipotsiseikipojeβʌndaheri G: kepʌtsitseikebʌjobandahedi Acc. 1: 88, Acc. 2: 63, Acc. 3: 75, Acc. Tot Av: 76. 62 (male,noisy,no respeaking): ISO 15919: ʃa ēkdamai ḍaʃāi sakē pachī tyō ṭhāumā cai 1: raektomiɖʌraitseapatsitijotʌŋmatsvei 2: raiktamaibaɽaisebasitiutsaŋʌsai 3: ʃaiktumendaʃeiseupatsihiʃdiltoŋwasei G: raekdʌmeɖʌraisepʌtsitjoʈaumatsei Acc. 1: 79, Acc. 2: 70, Acc. 3: 58, Acc. Tot Av: 69. 63 (female,normal,no respeaking): ISO 15919: tyatibēlā jhanai āttiyau. 1: tiktihoratsahʌnahatijo 2: tiktibeledzanatiu 3: tedibʌrʌsoniyʌtijo G: tiktibeladzʌnaiattijo Acc. 1: 71, Acc. 2: 73, Acc. 3: 67, Acc. Tot Av: 70. 64 (male,noisy,no respeaking): ISO 15919: ʃa ēuṭā bāṭōmā cai ēkdamai sānō khōca thiyō . 87 1: rajoɖapaʈomasiktomesanukostiʃ 2: reiodzabaɡabasitsanasanokostsu 3: raoɽʌpakʌbɑsihitʌmesanohostiʃ G: rajoɖabaʈomatseiekdʌmesanokotstio Acc. 1: 86, Acc. 2: 61, Acc. 3: 68, Acc. Tot Av: 72. 65 (female,noisy,no respeaking): ISO 15919: ʃa hāmī cai aghillō dina aʃkō buddha jahājakō cai 1: rahamitseiʌɡilotinarkukutadzahaskutsei 2: rahamisaiaɡiladzinaɡuwutsaɡadzahasɡʌtsi 3: erʌhamiseiʌɡiɽlotinarkuɡutaasʌhasɡutihi G: rahamitseiʌɡillodinarkobuddʌdzahaskotsei Acc. 1: 89, Acc. 2: 72, Acc. 3: 79, Acc. Tot Av: 80. 66 (male,noisy,no respeaking): ISO 15919: uni haʃulāī . ʃa gāḍīhaʃu ēkdamai puʃānō thiē 1: rokaiɖirikdompuraanote 2: raɡaɽidziɡompuɽanati 3: rʌɡʌririɡdompudʌnʌti G: raɡaɖiʃekdʌmpuʃanote Acc. 1: 82, Acc. 2: 75, Acc. 3: 78, Acc. Tot Av: 78. 67 (male,normal,respeaking): ISO 15919: ki jindagīmā jē pani huna sakcha 1: kidzindaɡimaadzepanihunafaksaʃad 2: hitsndzaɡimabtsepaninaɸʌksaɽadz 3: dzidzindʌɡimantsepʌnijonʌsoksʌra G: kidzindaɡimadzepanihunʌfaksaʃad Acc. 1: 94, Acc. 2: 73, Acc. 3: 73, Acc. Tot Av: 80. 68 (male,normal,respeaking): ISO 15919: ʃa hāmīlē jādā khēʃi cai chātāhaʃu pani bōkēkō thiēna 1: rahamledzadaheʃitseiamnestsataʃuponibukikotena 2: ʃaamlidakiɽisainamnestsatsaɽubanibuɡedzena 3: rʌhuamledzʌlʌkeʃitseinamnestsataɽubunibuɡiɡutino G: ʃahamledzadakeʃitseiamnestsadaʃupʌnibuɡeɡotena Acc. 1: 95, Acc. 2: 75, Acc. 3: 83, Acc. Tot Av: 85. 69 (male,noisy,respeaking): ISO 15919: ʃa hāmī pani tyahī anusāʃakō āphnō dimāga sēṭa garyau 1: raamipenteonsarkrafnudimakseʈɡoʈe 2: rahamipʌdzionsaʃkoapnodimaseɡoʃe 3: rʌamibuntejonsarkraknudemadsedɡoʃiu G: rahamipʌnteʌnsarkafnodimaɡseʈɡʌʃe Acc. 1: 86, Acc. 2: 81, Acc. 3: 75, Acc. Tot Av: 81. 70 (male,normal,respeaking): ISO 15919: ma pani utʃē . mēʃō aʃu sāthīhaʃu pani gāḍībāṭa utʃyō 1: mʌpaniutremiroaurusatiorupaniɡaiɖiboɖautʃe 2: mopaniutrinmiɽolsatiruponiɡaɖiwʌdeutʃi 3: moɡpʌniutʃeʌmiʃuoʃusatijoʃupaniɡʌʃiwuʃdʌwutʃe G: mopʌniutremeroaurusatiorupaniɡaʃiwuɖʌwutʃe Acc. 1: 86, Acc. 2: 73, Acc. 3: 84, Acc. Tot Av: 81. 71 (male,normal,respeaking): ISO 15919: ʃa unihaʃu ēkdamai khuśī bhaēkī ma tyahā bāṭa 1: unarektamekusibaikimoteabora 2: unaredzamkusiwaiɡimopʌɽiɸ 3: urariɡdomukusibaiɡimotebʌʃʌ G: unaʃekdʌmekusiwaiɡimoteabʌɽa Acc. 1: 90, Acc. 2: 71, Acc. 3: 81, Acc. Tot Av: 80. 72 (male,normal,respeaking): ISO 15919: tīna janā sāthīhaʃu cai tyō ṭhā mā gaēʃa cai hāmīhaʃulē saʃvē gaʃnu paʃnē tyō 1: tindzanasatierutseitijotʌmaɡojeʃazaamioʃlesoʃveɡonupʌnetijo 2: tsindzanasatseuʃutsaiteutsamaɡoiɽasameulesaɽeɡonupanitsi 88 3: dinsʌnʌsʌtihurutseitijotomaɡojeʃasamiʃlisʌʃiɡonupanetijo G: tindzʌnasatierutseitjotaumaɡoeʃazaamioʃlesʌʃveɡʌnupʌʃnetjo Acc. 1: 87, Acc. 2: 70, Acc. 3: 75, Acc. Tot Av: 77. 73 (male,noisy,no respeaking): ISO 15919: ʃa pānī pani ali ali paʃēkō thiyō . 1: rapaniɡunuelilipʌʃivekoti 2: rapaniɡʌŋololipodziɸabatsiʃ 3: rʌaniɡunalalepoʃibaɡʌtijo G: rapanipaniʌlilipʌrekokotijo Acc. 1: 76, Acc. 2: 61, Acc. 3: 72, Acc. Tot Av: 69. 74 (male,normal,respeaking): ISO 15919: ma ta ēkdamai ḍaʃāē . mēʃō muṭu mēʃō mukha samma āipugō . 1: moataekdomiɖoʃaimiʃumiʃum tumiʃumuksomaiweka 2: motseiɡdzamadzaɽaimiɽomeɽumutsumeuɽemuksamaiwiɡa 3: motaiɡdomiʃdʌʃʌimiʃumiʃamutumiʃamuksʌmaiβuɡiu G: mataekdomiɖʌraimeromeromuʈumeromuksʌmaiweɡa Acc. 1: 91, Acc. 2: 74, Acc. 3: 78, Acc. Tot Av: 81. 75 (male,noisy,no respeaking): ISO 15919: ʃa tyō bēlā ali sima sima pānī pani paʃiʃahēkō thiyō . 1: ratijobelalisimsimfaniponiporireɡatie 2: atsiulalisinsinpanipanibodzibeɡatsiʃ 3: tiβulʌlisimsimpʌnipaniuporirʌɡʌdijo G: ratjobelalisimsimpanipʌnipʌrireɡotio Acc. 1: 88, Acc. 2: 68, Acc. 3: 71, Acc. Tot Av: 76. 76 (male,normal,respeaking): ISO 15919: cha ghaṇṭākō lāgi cai hāmīlē kati dasa hajāʃa ʃupaiyā nēpālī tiʃēkō thiyō . 1: tsʌkʌntaholahitseihamledeɡotidosazaʃpenepaliditiʃeɡotijo 2: tsoɡaŋdzakalaiɡisaiamledzaɡodzasadzatspenipalidzitsiɽaɡotsiʃ 3: tsoɡontakulʌkitsehʌmleseɡotidoshodzaʃdenepʌlititiʃiɡotijo G: tsʌɡʌɳɖaɡolaɡitseihamledeɡʌtidosʌzaʃupenepʌliditireɡotio Acc. 1: 89, Acc. 2: 67, Acc. 3: 82, Acc. Tot Av: 79. 77 (male,normal,respeaking): ISO 15919: phēʃī bhanchukī ʃāmʃō samga sābadhāna puʃāēʃa jānu . 1: feribʌntsokiamsusabodanpuʃaaʃadzanu 2: peɽibaŋsadzihamsasadzanpuɽalβanu 3: siʃibontsukiʌknusʌbadʌpuʃaʃadzannu G: feribʌntsukihamsusawodanpuʃaʃadzanu Acc. 1: 89, Acc. 2: 66, Acc. 3: 71, Acc. Tot Av: 75. 78 (male,noisy,respeaking): ISO 15919: mēʃō pahilō kāmamā lāgēkō thiē . 1: mirupoilukamamlaekutie 2: miɽupoilokamamlaikatso 3: mirupoilukammaɽlʌiɡote G: merupʌilokamamlaeɡutie Acc. 1: 95, Acc. 2: 79, Acc. 3: 77, Acc. Tot Av: 84. 79 (female,noisy,respeaking): ISO 15919: thiyō ʃa hāmī māthī tyahābāṭa 1: tijorahamimatitehabaʈa 2: tioʃahamimamt sitjʌɸaʈe 3: tijorʌhamimatidihanβatʌ G: tijorahamimatitjabaʈa Acc. 1: 90, Acc. 2: 80, Acc. 3: 78, Acc. Tot Av: 83. 80 (male,noisy,respeaking): ISO 15919: ʃa pachī phaʃkidā khēʃī cai hāmīhaʃu tyō ṭhā mā ā dā khēʃī 1: rapʌtsifoʃkidakeʃidzeianiaʃutijotaumaaodahiʃi 2: ʃapatsiɸarkikeɽetseanuʃutyuʈaumʌamaaudakeri 3: potsiɸoʃkidakeʃitseihʌmihʌrutijotaumʌʌnʌaudakeri G: rapʌtsifʌrkidakeritseihaniʌrutijoʈaumaaudakeri 89 Acc. 1: 92, Acc. 2: 71, Acc. 3: 80, Acc. Tot Av: 81. 81 (female,normal,respeaking): ISO 15919: ʃa kāṭhamāḍa mā sabaijanāsanga 1: ʃakatsfandomasabeisanasaŋɡa 2: ʃakatmandzumasawaidzanasaŋʌ 3: rakaʈmandumʌsaβidzʌnʌsʌnɡʌ G: rakaʈʌndaumasabeidzanasʌŋɡʌ Acc. 1: 81, Acc. 2: 75, Acc. 3: 76, Acc. Tot Av: 77. 82 (female,noisy,respeaking): ISO 15919: ʃa tyō pani mēʃō dimāgamā ēkachina cai āuna thālyō, tyastō naʃāmʃō kuʃāhaʃu āuna thālyō. 1: ratijopanimerotimaɡmaeksintseiaunataliotistunaramlikuɡaiʃotakiu 2: ratjopanimiɽidimaɡmaeksansaiaunatalotsasunamikuɡaiɽamtaliu 3: rʌtijopʌnimirudimaɡmʌiksinseiawunʌtalijudistunʌamʃikodaʃontaɡu G: ratjopʌnimerodimaɡmaektsintseiaunʌtaliotistonʌʃamʃekoɡaʃotaljo Acc. 1: 87, Acc. 2: 73, Acc. 3: 81, Acc. Tot Av: 80. 83 (female,normal,no respeaking): ISO 15919: bacēʃa āējastō anubhava bhayō tyatibēlā. 1: bʌtsiʃaidzestoanuvapajetikteviolaa 2: bosiɽaitsesuamanuɸapaititiɡele 3: βosiʃaitsistsuʌmʌmʌβʌpaididiβele G: botsiraidzistoʌmʌnuvabʌjotittiβela Acc. 1: 74, Acc. 2: 73, Acc. 3: 73, Acc. Tot Av: 73. 84 (female,normal,no respeaking): ISO 15919: tyasapachi cai hāmī āphu suʃakṣita chau jastō lāgyō. 1: ratestotsitsehamiafuseratsitsaozistalaɡi 2: ʃatsistsitetehamiaɸasuɽetsetsoŋtsuselaɡi 3: erʌdestotsetsehʌmijaɸuserʌtsitseudzustulaɡi G: ʃatestotsitseihamiapusuʃʌk itsaudzestelaɡi Acc. 1: 84, Acc. 2: 66, Acc. 3: 71, Acc. Tot Av: 74. 85 (female,noisy,respeaking): ISO 15919: kēhī samasyā āēkō cha 1: keisamaseaaikosa 2: kisamasaikusa 3: ɡehisʌmʌsteʌiɡusʌ G: keisamasjaaekotsa Acc. 1: 90, Acc. 2: 76, Acc. 3: 65, Acc. Tot Av: 77. 86 (male,normal,respeaking): ISO 15919: ullē bhanē anusāʃa tyahā cai ēkdamai ēkdamai jōkhima pūʃṇakō bāṭō thiyō 1: pʌniosarbʌnionusaʃtseteatseiiktamidamidzuhimpuʃnaɡobaʈotieʃe 2: paniosalbanionasadzetsetsedzameikomadzokimbulnaɸabadotieɽe 3: pʌnijasaɽpʌnijʌnʌsʌrtʌtejʌtseiɡdomeekdomedzukimbuɽneɡobaɽɖotijoʃe G: pʌniʌsarbʌnionusaʃtsetjatseiekdameidʌmidzokimpuɳʌɡobaɖotiere Acc. 1: 89, Acc. 2: 73, Acc. 3: 71, Acc. Tot Av: 78. 87 (male,noisy,no respeaking): ISO 15919: jahilē pani gaʃī ʃahēkō hunchana . 1: tsailipaniɡoʃiʃaikovantsal 2: tsalipaniɡaɽiʃaiɡountsana 3: tsailipuniɡuʃiɡʌʃibomonsʌli G: dzʌlipʌniɡʌʃiʃaeɡohʌntsʌnʌ Acc. 1: 74, Acc. 2: 79, Acc. 3: 65, Acc. Tot Av: 73. 88 (female,normal,respeaking): ISO 15919: ʃa uhāhaʃulāī bhēṭēpachi khuśī lāgyō. 1: rawatlaibitepasikusilaɡio 2: ʃawanlepetsepasikusilaɡo 3: rauwadlaibihidepʌsikusilaɡijo G: rawarlaibeʈepʌtsikusilaɡio Acc. 1: 91, Acc. 2: 81, Acc. 3: 77, Acc. Tot Av: 83. 89 (female,normal,respeaking): 90 ISO 15919: dhēʃaijanā paʃkhēʃa basēkā hunuhunthiyō. 1: sperezanapʌrkirabasekaununtijo 2: sɸeɽisanapokiɽabasekeununtsip 3: derisʌnʌpolkidʌβʌsikʌhununtijo G: deridzanapʌrkerabʌsekaununtijo Acc. 1: 89, Acc. 2: 73, Acc. 3: 80, Acc. Tot Av: 80. 90 (male,normal,no respeaking): ISO 15919: ʃa pachī mailē phēsbukamā pani mailē mēʃō hālēkō thiē 1: ʃabotsipaileefestukapanemiʃudahaleɡote 2: rabasibaileneɸesukabanimiɽadzaliɡotse 3: rʌbosimʌilimnʌɸeicbukʌbʌnijomirurʌʌliɡotijo G: rabʌtsimailemepesbukaβʌnemerolahaleɡotie Acc. 1: 82, Acc. 2: 74, Acc. 3: 65, Acc. Tot Av: 74. 91 (male,normal,respeaking): ISO 15919: ʃa gāḍīhaʃu cāhi ēkdamai puʃānō āja bhandā kamsēkama 1: raɡaiɖieʃutsekdomepuʃanoazovandakomtsekom 2: raɡaiɽiʃutsedzameamapanoasawandzaɡomsiɡoma 3: rʌɡʌʃiʃutseikdomeʌmʌpurʌnoʌsuwʌndakomsekom G: raɡaɖieʃutsekdʌmepuʃanoazʌvandakʌmsekʌm Acc. 1: 93, Acc. 2: 68, Acc. 3: 71, Acc. Tot Av: 77. 92 (female,noisy,no respeaking): ISO 15919: ṭikēṭa liēkā thiyau kāṭhamāḍa phaʃkinalāī, nēpālagañjabāṭa. ra 1 ghaṇṭājati ukta bimānasthalamā paʃkhēpachi 1: handetikatʃekatijokatsmandopaʃkimʌvainepaliontsvataraekantazatiutabimanistalmapalkiukasusei 2: tsandzetsikatsliɡatsimɡatsʌndzupakinalainipailinsβetsalaikamtsasapiutsabimanstsanmabakipasisai 3: hʌndetiketliɡʌtinkaʈmʌnduɸolkinulaimipaduwanisuwetʌʃʌekontʌsʌdiuŋtebimenistʌlma G: hamleʈikʌtlekatiokaʈmanduɸarkinʌlainepalioŋsvaʈʌraekʌnʈazateuktabimanistalmapʌrkepʌsetsei Acc. 1: 82, Acc. 2: 70, Acc. 3: 63, Acc. Tot Av: 72. 93 (male,noisy,respeaking): ISO 15919: phēʃī phaʃkēʃa āē . 1: heripʌrkeraai 2: periɸakirai 3: ɸeriɸarkʌrʌi G: perifʌrkeraae Acc. 1: 90, Acc. 2: 73, Acc. 3: 73, Acc. Tot Av: 78. 94 (male,noisy,respeaking): ISO 15919: . ʃa tyō ēʃiyāmā cai gāḍī calāunē mātʃa tīna mātʃa mānchēhaʃu 1: ratijoerihamatseiɡaiɖisodahonematʃtindzanamatʃmantsihaʃu 2: aratjoeɽijamasiŋɡaɽisoneɡmatsiltindzanaʃatsadzmantseʃu 3: ʌrotijoirijʌmʌtseikaɽitsʌlounhemʌtrʌtindzanʌmʌntʌrʌmandzeheru G: ratjoerijamatseiɡaɖitsʌlaunematʃtindzʌnʌmʌtʃmantseʃu Acc. 1: 80, Acc. 2: 69, Acc. 3: 70, Acc. Tot Av: 73. 95 (female,normal,no respeaking): ISO 15919: ʃa hāmalāī tyahā chōḍēʃa āmā ʃa bubā cai phaʃkisaknu bhaēkō thiyō, vimānasthalabāṭa. 1: ʃahamlaitehatsoʃiʃaamaʃabubatseifʌrkisʌkŋvaɡotijobimanistalvata 2: ʃaamlaitsehesuɽiraamabubabasaiɸokisakinwaɡotsimiministsalwatsa 3: rʌʌhʌmleidehansorirʌʌmʌdabuɡʌnseiɸolkisʌɸʌnuwʌɡʌtijomihanistʌlɡʌta G: ʃahamlaitehatsoɖerʌamarabubatseifʌrkisʌknwakotijobimanistalvaʈa Acc. 1: 93, Acc. 2: 72, Acc. 3: 74, Acc. Tot Av: 80. 96 (female,normal,respeaking): ISO 15919: āēʃa basnubhaēkō 1: seiteaaeʃabʌsevakotijo 2: tsaitjaeerʌbʌsnukʌtiu 3: setihʌʌjirʌbosnwakʌtijo G: seitjʌaerʌbʌsnvakotijo Acc. 1: 85, Acc. 2: 76, Acc. 3: 74, Acc. Tot Av: 78. 97 (male,noisy,no respeaking): ISO 15919: ʃa gāḍīlāī agāḍī jāna diyō ʃa hāmī cai hiḍēʃa gayō. 91 1: raɡaiɖilaoɡaiɖidanadiahamit seihimiraɡojo 2: raɡadzidzaunʌnatseamtsamiɽaɡʌu 3: rʌɡʌrilʌuwʌɖiɽaneduwahamit seihiniʃʌɡojʌ G: raɡaɖilaʌɡaɖidzanʌdijʌhamitseihamiʃaɡʌjə Acc. 1: 81, Acc. 2: 51, Acc. 3: 72, Acc. Tot Av: 68. 98 (female,normal,no respeaking): ISO 15919: nēpālagañja phaʃkēkā thiyau. 1: nepalkontsfarkekatijo 2: nebalɡansoɡeɡekiu 3: nebalɡonsɸolkiɡettijo G: nepalɡʌndzfʌʃkeketijo Acc. 1: 90, Acc. 2: 71, Acc. 3: 86, Acc. Tot Av: 82. 99 (female,noisy,respeaking): ISO 15919: nēpālagañja taʃphanai phaʃkāēʃa lagyō ʃa hāmī 1: nepalɡonstʌfanifoʃkaededlʌɡiuʃahami 2: napaleustsaɸiniɸarkarʌlaɡjoʃahami 3: nepalɡonstolɸoneɸolkaidiloɡjuʃʌhʌmi G: nepalɡʌnstʌʃfanifʌʃkaeʃedlʌɡioʃahami Acc. 1: 91, Acc. 2: 70, Acc. 3: 79, Acc. Tot Av: 80. 100 (female,noisy,no respeaking): ISO 15919: bhēṭa bhayō 1: mibajo 2: ebaiu 3: bidpʌjo G: metbʌjo Acc. 1: 75, Acc. 2: 44, Acc. 3: 83, Acc. Tot Av: 67. 101 (male,noisy,no respeaking): ISO 15919: ʃa ʃāmʃai bhayō tyaspachī phaʃkēʃa ā dā khēʃī 1: ramnivʌietetespasiparɡeɖaodahiʃi 2: dzameβoitsitsispasiɸaɡeɽaɽauɽeɡeɽe 3: rʌmliβaitetismʌtsiɸʌriɡeʃauʃeɡeɽe G: ramrebʌjotiotjespʌtsiparɡeʃaudakeʃi Acc. 1: 75, Acc. 2: 56, Acc. 3: 66, Acc. Tot Av: 66. 102 (male,noisy,respeaking): ISO 15919: phēʃī gāḍīlāī agāḍī gēaʃamā hālēʃa agāḍī tānna thālyō 1: pirikaɖilʌhaɖiɡieʃmaaleʃaoaɖitaŋnatalio 2: peɽiɡalaaiɖiɡeamaleʃaaitannatale 3: ɸeriɡaʃilʌʌjiʃiɡiʃmʌhʌleʃʌoʃitannʌtaliu G: periɡaɖilaiauɡaɖiveaʃmaaleʃaaiɖitannatalio Acc. 1: 80, Acc. 2: 71, Acc. 3: 65, Acc. Tot Av: 72. 103 (male,normal,no respeaking): ISO 15919: mana daʃō banāēʃa basnusa bhanēʃa unihaʃulē bhanyō . 1: muandoruvanarabosnusuanerovaneluboanio 2: wandzoɽoβanaɽabosusʌnʌɽonelabane 3: wʌndoɽuɡʌneɽʌbosnusʌneʃʌwuneɽlubʌne G: mandʌrobʌnaerʌbʌsnusʌnerʌnelubʌnio Acc. 1: 72, Acc. 2: 70, Acc. 3: 75, Acc. Tot Av: 72. 104 (female,normal,respeaking): ISO 15919: tyahā cai ukta bimānasthalamā cai 1: tehadzeiuktabimanistalmadzei 2: tsahesaiudzaβibanistsalmatse 3: dehatseukdʌbimʌnistalma G: tehatseiuktabimanistalmatsei Acc. 1: 98, Acc. 2: 76, Acc. 3: 79, Acc. Tot Av: 84. 105 (female,normal,respeaking): ISO 15919: nēpālagañjabāṭa kāṭhamāḍa phaʃkiyō bhanēʃa. 1: nepaliosvatʌkaʈuandofarkijobanera 2: nipaleŋswedzakatsundzaɸokiubʌniɽa 92 3: nepalɡonsβatʌkaʈmʌnɖuɸolkiʌβenida G: nepaliosvaʈʌkaʈʌndofarkijobʌnerʌ Acc. 1: 89, Acc. 2: 64, Acc. 3: 72, Acc. Tot Av: 75. 106 (female,normal,no respeaking): ISO 15919: yēti bhannē ēuṭā plēna kō cai bihāna 10-11 bajē tiʃakō cai 1: jetipanejotpleinkodzeibihanodoseɡaʃbʌsitiʃakodzeiham 2: jetsipaniuplainkosaiammihanadzosiɡaʃabasidzidzakosaiamham 3: iʌtipanijʌplinɡuseimihanudosiɡaʃʌbʌstidiʃʌkosei G: jetibanneeuʈplenkotseiumbihanʌdoseɡaʃbʌdzetiʃʌkotseiham Acc. 1: 84, Acc. 2: 69, Acc. 3: 70, Acc. Tot Av: 74. 107 (female,normal,respeaking): ISO 15919: sana 2008 tiʃa kō kuʃā hō 1: sʌnduihazaʃaʈirekokurao 2: sandzujasaatsiɽaɡuɡuɽau 3: sʌnduwihʌsʌrʌrdirʌɡoɡuʃʌho G: sʌnduihazaʃattiʃʌkokurao Acc. 1: 91, Acc. 2: 67, Acc. 3: 77, Acc. Tot Av: 78. 108 (female,normal,respeaking): ISO 15919: kāṭhamāḍa lānukō saṭṭā 1: katsvandulanukosaʈa 2: kaʈmandulanukosetse 3: kaʈmʌnɖulʌnɡusʌtta G: kaʈandulanukosaʈʈa Acc. 1: 84, Acc. 2: 78, Acc. 3: 75, Acc. Tot Av: 79. 109 (female,normal,no respeaking): ISO 15919: ʃa yasapachi cai hāmī jhanai āttiyau 1: araespadzidzahamizamiatijo 2: aɽeispatsatsamisaniatsub 3: eraisɡotsusʌhʌmisʌmijaattijo G: areespasitsahamidzʌniattijo Acc. 1: 86, Acc. 2: 65, Acc. 3: 67, Acc. Tot Av: 72. 110 (female,noisy,no respeaking): ISO 15919: nēpālagañjamā cai avataʃaṇa gaʃiyō 1: nepaliɡozmatseiabateʃlŋɡaiɖijo 2: snipalɡomatsaiaβadzeŋoɽijo 3: nepalɡonsimʌtseiʌβatenɡaʃijo G: nepalaɡʌzmatseiʌvʌteluɡʌʃijo Acc. 1: 73, Acc. 2: 65, Acc. 3: 75, Acc. Tot Av: 71. 111 (female,noisy,respeaking): ISO 15919: ʃa ēkadamai ḍaʃalāgdō avasthā thiyō tyō. 1: raektomidarlaktoawostatiotijo 2: eraekdamidzalaɡʌwastatitiu 3: rʌikdomedʌrlaɡdʌoβʌstatiutijo G: raektʌmidarlaɡdoʌvastatitijo Acc. 1: 86, Acc. 2: 73, Acc. 3: 85, Acc. Tot Av: 81. 112 (male,normal,no respeaking): ISO 15919: pahiʃōlē alikati kṣyati puʃāēkō bhannē thiyō . 1: poiʃolealkotitsitibuʃaiɡowantijo 2: boidzalealkatsitsitibuɽaiɡawanitsiu 3: poiʃolealkatitsetipuɽʌiɡʌβʌnitijo G: pʌiroleʌlkʌtitsitibuʃaeɡowannetijo Acc. 1: 88, Acc. 2: 73, Acc. 3: 82, Acc. Tot Av: 81. 113 (male,normal,no respeaking): ISO 15919: tyō bēlā samjhēkō thiē . 1: tijolafamzekote 2: tselaɸamaɡete 3: teulasʌmsiɡʌti G: tjolasamdzeɡotie 93 Acc. 1: 79, Acc. 2: 59, Acc. 3: 76, Acc. Tot Av: 71. 114 (male,normal,no respeaking): ISO 15919: sadaʃmukāmamā phēʃī phaʃkēʃa āyau . 1: saderukamahirifʌrkeraajum 2: soɡudzukamahiliɸokaɽaim 3: sʌɖʌɖβukʌmmʌkeriɸolkerʌju G: sʌdʌrukamakerifʌrkeraajm Acc. 1: 86, Acc. 2: 66, Acc. 3: 69, Acc. Tot Av: 74. 115 (female,noisy,no respeaking): ISO 15919: ʃa paʃicāʃikālē kē bhanē bhandā 1: raporitsairikarlikibanubanda 2: eɽapolisalikalikiβiniβandze 3: erʌpoditsʌdikalikiβʌnipʌnda G: rapʌritsairikarlikibʌnebʌnda Acc. 1: 90, Acc. 2: 69, Acc. 3: 74, Acc. Tot Av: 77. 116 (male,noisy,respeaking): ISO 15919: ʃa tyō bēlā pani ṭhyākka dimāga mā huncha ni 1: ratijovelavnitekadinaɡmaodzeniodioɖa 2: eʃatjoβelaβeniʈʌkkʌdimansaniɖiuɽa 3: eʃatjoβelʌpʌnitakkadimawʌunsʌnideuɖa G: ʃatjoβelaβvniʈakkʌdinaɡmʌodziniodeuɖʌ Acc. 1: 82, Acc. 2: 71, Acc. 3: 72, Acc. Tot Av: 75. 94