The University of Melbourne
School of Languages and Linguistics
Honours Thesis November 2013
The Effect of Respeaking
on Transcription Accuracy
Mat Bettinson
Supervisors: A/Prof. Lesley Stirling &
A/Prof. Steven Bird
1
THIS PAGE INTENTIONALLY BLANK
2
Acknowledgements
I would like to thank the teachers and mentors that guided me to this point. I would not
be engaged in this research if it were not for a number of influential figures that inspired
me along the way. Among them, Paul Gʃuba’s non-nonsense clarity, and ‘Gʃuba
manoeuvʃe’ hand gestuʃe, helpfully addʃessed: “What aʃe we ʃeally doing heʃe?”
Of my supervisors, I must thank Lesley Stirling for inspiring with engaging human
stories in linguistic research and providing advice and encouragement when it was
needed most. My thanks also to Steven Bird for straddling the gulf between linguistics
and computational disciplines and foʃ intʃoducing me to the ‘peʃfect stoʃm’ of
possibilities in language and technology.
I must also express considerable gratitude for the participants in this study including my
two speakers of Nepali but most especially the four transcription participants. Having
generously agreed to volunteer many hours of their valuable time, they also stuck at it
when it became clear I had tragically underestimated the workload. Majanmajanba!
I’m also gʃateful foʃ the comments fʃom the examineʃs of this thesis. Theiʃ detailed
comments guided this revised thesis.
Above all, eternal gratitude to my ever patient wife Jeannine who has born no small
burden of hardship in supporting my scholarship. I love you.
3
Table of Contents
Acknowledgements .......................................................................................................... 3
Table of Contents ............................................................................................................ 4
List of figures.................................................................................................................... 6
List of tables ..................................................................................................................... 7
1.
2.
3.
Introduction .............................................................................................................. 8
1.1
Overview ............................................................................................................ 8
1.2
Motivation: documentation and problems of scale ............................................ 8
1.3
BOLD – a scalable method ................................................................................ 9
1.4
Respeaking as field method ............................................................................. 10
1.5
A thought experiment: the future philologist ................................................... 12
1.6
Aims and research questions............................................................................ 14
Literature review..................................................................................................... 15
2.1
Overview .......................................................................................................... 15
2.2
Transcription – description of speech sounds .................................................. 15
2.3
H&H theory ..................................................................................................... 17
2.4
Clear speech ..................................................................................................... 18
2.5
Types and levels of noise ................................................................................. 20
2.6
Ramifications for respeaking tasks .................................................................. 22
Method .................................................................................................................... 24
3.1
Overview .......................................................................................................... 24
3.2
Nepali ............................................................................................................... 24
3.3
Prediction of errors .......................................................................................... 26
3.4
Participants....................................................................................................... 29
3.5
Procedure ......................................................................................................... 29
4
3.5.1
The Aikuma application ........................................................................... 29
3.5.2
Data collection .......................................................................................... 30
3.5.3
Processing the data ................................................................................... 31
3.5.4
Experimental conditions ........................................................................... 32
3.5.5
Data validation and volume ...................................................................... 33
3.5.6
Reference transcription ............................................................................. 33
3.5.7
Transcription activity ................................................................................ 34
3.6
4.
5.
Measuring accuracy ......................................................................................... 35
3.6.1
Edit distance ............................................................................................. 36
3.6.2
Improved phonetic edit distance ............................................................... 37
3.6.3
Summary................................................................................................... 38
Results .................................................................................................................... 39
4.1
Overview .......................................................................................................... 39
4.2
Clear speech in Nepali ..................................................................................... 39
4.2.1
Durations and speaking rate ..................................................................... 41
4.2.2
Expansion of vowel space ........................................................................ 43
4.3
Transcription metrics ....................................................................................... 44
4.4
Statistical analysis ............................................................................................ 44
4.4.1
Reliability of measures ............................................................................. 44
4.4.2
T-tests of independent variable: respeaking ............................................. 45
4.4.3
T-tests of independent variable: noise ...................................................... 47
4.4.4
Assessing the interaction of noise and respeaking ................................... 47
4.5
Analysis of common errors .............................................................................. 48
4.6
The impact of noise .......................................................................................... 52
4.7
Summary of Results ......................................................................................... 54
Discussion ............................................................................................................... 55
5
6
5.1
Overview .......................................................................................................... 55
5.2
Addressing the research questions ................................................................... 55
5.2.1
Respeaking and transcription accuracy .................................................... 55
5.2.2
Respeaking effect on error types .............................................................. 59
5.2.3
Contribution of clear speech vs. noise ...................................................... 60
5.3
Limitations of Study ........................................................................................ 61
5.4
Transcription differences: errors or choices? ................................................... 63
Conclusion .............................................................................................................. 65
References ...................................................................................................................... 67
Appendices ..................................................................................................................... 73
Appendix 1: Transcriber Instructions ......................................................................... 73
Appendix 2: Transcription Log .................................................................................. 79
List of figures
Figure 3.1: Acoustic evidence of retroflex consonant .................................................... 26
Figure 3.2: Aspirated consonant spectrogram ................................................................ 27
Figure 3.3: Australian English & Nepali vowel spaces .................................................. 28
Figuʃe 4.1: Visualising Nepali ‘Cleaʃ’ vs ‘Noʃmal’ speech ........................................... 40
Figure 4.2: C and V durations for male speaker of Nepali ............................................. 41
Figure 4.3: Speaking rates of Nepali speakers in spontaneous and clear speech ........... 42
Figure 4.4: Reduction of speaking rate in clear speech .................................................. 42
Figure 4.5: Composite plot of Nepali casual vs. clear speech vowel spaces .................. 43
Figure 4.6: Participant transcription rates in minutes-per-file ........................................ 44
Figure 4.7: Participant accuracy comparison ................................................................. 45
Figure 4.8: Accuracy scatterplot by file with regression ................................................ 46
Figure 4.9: Variation in breathy voicing and aspiration errors....................................... 49
Figure 4.10: Variation in vowels and vowel cluster errors............................................. 50
Figure 4.11: Variation in retroflex consonant errors ...................................................... 52
6
List of tables
Table 3.1: Nepali phonemic inventory ........................................................................... 25
Table 3.2: Experimental conditions matrix .................................................................... 32
Table 3.3: Phonetic edit-distance eʃʃoʃ ‘scaling’ by categoʃy ........................................ 37
Table 4.1: Inter-participant accuracy correlation matrix ................................................ 45
Table 4.2: Per-participant respeaking t-test results ........................................................ 46
Table 4.3: 2-way ANOVA of respeaking and noise variables ....................................... 47
Table 4.4: Multiple linear regression of respeaking and noise against accuracy ........... 47
Table 6.1: Qualitative view of ‘high-ʃange’ accuʃacy .................................................... 57
Table 6.2: Qualitative view of ‘low-ʃange’ accuʃacy ..................................................... 58
7
1. Introduction
1.1 Overview
In linguistic fieldwork, language consultants are sometimes asked to repeat speech that
was spontaneously recorded. Careful speech is often beneficial for analysis and
transcription. New methods in language documentation promise improvements in
efficiency by utilising such spoken annotations to create a written transcription.
Consequently it may be possible to defer the written transcript so that the work does not
need to be done in the field. This research examines the impact of respeaking on
transcription accuracy when used as part of an emerging digital method in language
documentation.
1.2 Motivation: documentation and problems of scale
At the 1991 LSA symposium on endangered languages, the rate of language loss was
descʃibed as a ‘cʃisis’. Half of the woʃld’s 6,000 languages may alʃeady be moʃibund
and no longer being learned by children (Krauss, 1992). Arguments for the value of
human language, and the resulting tragedy of their loss, are bountiful. Hale said
languages aʃe the “priceless products of human mental industry” and that theiʃ loss
ʃepʃesents an “irretrievable loss of diverse and interesting intellectual wealth” (1992, p.
36). Evans and Levinson (2009) described linguistic diversity as a ‘laboratory of
variation’ with 6,000 natural experiments in evolving communicative systems, each
offering opportunities to explore the nature of human cognition. Speakers themselves
often ʃegaʃd the loss of theiʃ language as a “loss of identity, and as a cultuʃal, liteʃaʃy,
intellectual, or spiritual severance from ancestors, community and territoʃy” (Woodbury,
2003, p. 4). The field of linguistics has struggled to find an effective response to the
urgent need to describe the intellectual wealth of endangered languages. Contributing
factors include a lack of focus on languages in cultural context (Hale, 1992) and the
apparent lack of will to engage in linguistic fieldwork (Newman, 1998)1.
Newman pulled no punches: “Linguists claim to be concerned about the endangered languages issue. In
reality, nothing substantial is being done about it.” (1998, pp. 11).
1
8
In response to this crisis, documentary linguistics emerged as a sub-field for
constructive action on the challenge of language endangerment. Nikolaus
Himmelmann’s (1998) founding treatise argued for the documentation of language as a
separate and distinct field from descriptive linguistics. Documentary linguists would
focus on recording natural language rather than the narrower descriptive output of the
traditional grammar and dictionary. The rise of documentary methods has also
coincided with the cross-over between the paper-based era and the digital era (Bird &
Simons, 2003). Storage is now effectively unlimited and virtually lossless in quality but
critical limitations remain. Chief among them is the reliance on highly trained linguists
performing a wide variety of tasks in often challenging fieldwork conditions. Theʃe’s a
worldwide shortage of trained linguists, much less as Newman noted, linguists engaged
in fieldwork. Liberman (2006) highlighted the need to scale up language documentation
efforts to meet the challenge of endangerment. A corpus of around ten million words
would be necessary for reasonable coverage of various aspects of language. This
corresponds to around 2,000 hours of audio recordings2 which would need transcribing,
at least two orders of magnitude beyond that volume of transcription usually undertaken
in linguistic description. When it comes to endangered languages, no corpus of a tenth
of that size (one million words) yet exists in a machine-readable context (Abney & Bird,
2010). This gulf between our capacity to capture primary data and our methods for
transcription and analysis is arguably the most pressing challenge facing documentary
linguistics today.
1.3 BOLD – a scalable method
Woodbury’s fieldwoʃk on the centʃal Alaskan language Cup’ik in the 1970s produced a
substantial quantity of audio cassette tapes. Conscious of the limited remaining time
with eldeʃly speakeʃs of Cup’ik, the decision was made to skip interlinear glosses in
favour of ‘ʃunning UN style tʃanslations’ (Woodbuʃy, 2003, p. 45). More radical still,
Woodbury suggested that not all material would be transcribed. Instead, speakers would
be asked to ‘ʃespeak’ ʃecoʃdings slowly and cleaʃly so that anyone with ‘tʃaining in the
language’ could pʃovide a tʃanscʃiption if they wished. Reiman (2010) cited this
Libeʃman’s abstʃact of his 2006 talk at the Texas Linguistic Society 2006 stated 50,000 houʃs which he
has since described as an error, revising the figure to 2,000 hours in a Language Log post in February
2010: http://languagelog.ldc.upenn.edu/nll/?p=2099
2
9
example as an inspiration to develop a new audio-only based methodology called Basic
Oral Language Documentation. BOLD describes a method whereby a new recording
interleaves the original spontaneous recording with ʃespoken ‘oral annotations’. In the
first instance, these annotations would be the same slow and careful respeaking, with
the same goal of facilitating a deferred transcription. One that might be undertaken
immediately on return from fieldwork or for any interested party in the future. Not only
is transcription time consuming but it can only be performed by highly trained people.
Free of this bottleneck, researchers may recruit paralinguists with minimal training to
perform as many documentary events in parallel as willing participants and available
equipment will allow. On that basis BOLD is one of the first methods in language
documentation that may be categorised as a ‘scalable’ method3. A review of BOLD in
six different fieldwork projects concluded that BOLD corpora should be made a funding
priority and the methodology taught in all fieldwork courses (Boerger, 2011).
The essential method of BOLD can be thought of as phrase-aligned audio segments
where participants record natural speech and then make additional recordings which are
time aligned with the natural speech event. These tasks have proven suitable for
automation in the Aikuma smartphone (Bird & Hanke, 2013). Fieldwork trials in Papua
New Guinea (http://www.boldpng.info) and the Brazilian Amazon demonstrated that
language consultants found the system intuitive to use despite their limited exposure to
digital technology. Mobile technologies such as these offer the chance to ‘cʃowdsouʃce’
potentially rich collections of linguistic data. It may be that the best response to the pace
of language loss lies in interdisciplinary programmes with the development of software
tools informed by research on the next-generation of digital linguistic field methods.
1.4 Respeaking as field method
Respeaking has been used as a linguistic field method since well before the introduction
of BOLD. In a three-volume guide to documenting languages of oral tradition first
published in the 1970s, Bouquiaux & Thomas (1992) described a formalised method for
producing transcriptions of unwritten languages. After recording spontaneous speech,
the language consultant “repeats each sentence fairly slowly, so that it can be
3
Removing the natural language specification, collaborative compilation of dictionaries is perhaps the
only other scalable method in language documentation.
10
tʃanscʃibed as he dictates” (p. 181). The transcription was then re-recorded in the same
manner, thus creating an audio recording of the slow speech. The Bouquiaux &
Thomas method was used to directly assist the production of a transcription as well as to
provide an audio recording with the same properties of speech for additional analytical
purposes such as consulting the field transcript and further developing phonological
hypothesis on the language. Reflective of general practice in linguistic fieldwork, the
production of the transcript was considered to be of primacy and hence embedded
within the method. The consultative process of respeaking and transcribing adds
considerably to the time and the patience required of language consultants.
The respeaking step in BOLD is referred to as the process of producing spoken
annotations (Reiman, 2010, p. 256). Three main varieties of oral annotation were
pʃoposed, the fiʃst being ‘caʃeful speech’ of the same type as employed in the
Bouquiaux & Thomas method. The second annotation is a phrase level translation into a
language of wider communication and the third consists of analytical comments. These
may include material such as unspoken but implied information, description of gestures
and cultural knowledge. Discussing the benefits of these additional spoken annotations
is beyond the scope of this work but it will suffice to note that the time saved in the
laborious transcription phase provides an opportunity to record a great deal of additional
information via further spoken annotations.
Another attested motivation for the respeaking method is that respeaking regenerates an
older recording into a fresh recording. As Woodbuʃy descʃibed in the Cup’ik
documentation pʃoject, ʃespeaking mateʃial was pʃioʃitised foʃ ‘haʃd-to-heaʃ’ audio
cassettes. Thus in re-recording linguistic data there exists the opportunity to improve the
quality of the recording. The BOLD:PNG project suggested that respeaking should be
undertaken in a “quiet place away from background noise and inteʃʃuptions”
(http://www.boldpng.info/bold/stage2). By ensuring that the subjects are loud enough to
be heard and free of unwanted noises to the greatest possible extent, audio quality
improvements may be realised in the respeaking recording.
Quality improvements may also come about due to differences in equipment used in the
field and the equipment used in respeaking. Given the pace of change in technology,
respeaking older recordings may jump across generational change in recording methods.
11
Fieldwork recordings made on analogue equipment can, where speakers are available,
be regenerated into recordings in the digital domain. Digital recordings under
appropriate storage conditions have a virtually unlimited shelf life. Even where these
recordings are made in the same technological time frame, other equipment factors exist
such as the difference in recording quality between smartphones and professional audio
recording devices. Whether quality improvements come from reduction in ambient
noise or improvements in equipment and technique, both result in an increase in the
signal-to-noise ratio (SNR).
Considering the benefit of respeaking fieldwork methods for producing transcriptions,
the ‘ʃegeneʃation’ of these ʃecoʃdings pʃovides two potential impʃovements:
1. Greater intelligibility of careful speech
2. A boost in signal-to-noise ratio (SNR)
What is not clear is the degree of contribution that each of these makes towards the
benefits of ‘slow, caʃeful’ ʃespeaking, paʃticulaʃly when employed in the context of
linguistic transcription by non-native speakers.
1.5 A thought experiment: the future philologist
“Paʃt of my technical input as a linguist is to make guesses about what the
‘philologist in 500 yeaʃs’ is going to need,” (Woodbury, 2003, p. 45)
Himmelmann suggested that language documentation ought to be a lasting,
multipurpose record of a language (Himmelmann, 2006b). The multipurpose qualifier is
intended to suggest a broader audience for language documentation, such as alternative
research disciplines and the speech community themselves. The argument is that we
should capture as many and varied instances of natural language possible because we
can’t anticipate the needs of all possible stakeholdeʃs. As we enter the age of almost
limitless digital stoʃage, it’s increasingly difficult to aʃgue that we shouldn’t be
capturing everything we can.
However the practical reality is that documentation stakeholders have different ideas
about what ought to be captured and for what purpose. Woodbury suggests that linguists
should anticipate the needs of our future selves, or at least those with similar interests.
12
Even with a reduced scope of linguistic enquiry, there is an absolute requirement to
make language documentation a lasting record. We will need orthographic and glossing
conventions, linguistic and ethnographic annotations and metadata to assist in
identifying primary data and so on. Acknowledging the rising role of technology in this
task, Bird & Simons (2003) warned against adopting ‘moʃibund’ technology such as
proprietary software and file formats which themselves may result in endangered data.
More broadly, Abney & Bird (2010) suggest that we may have attained some measure
of success for a universal corpus of human language if we can freely translate between
languages. They further described documentation of individual languages as a pyramid
structure where a smaller subset of annotated material sits above a larger volume of
unannotated material.
One of the key decisions to be made will be how much material should be annotated or
transcribed and in what detail. By the same token, considering the oral annotations of
BOLD, how much of each type of annotation do we need? At the ten-year anniversary
of Himmelmann’s seminal language documentation proposal, Evans (2008) warned
against becoming ‘documentaʃist fundamentalists’ and that meʃely ʃecoʃding material
without shaping an ‘evolving analysis’ would depʃive future linguists of key data. In
this paper I adopt the stance that we are exploring ways to complement the essential and
fundamental fieldwork of evolving analysis with larger volumes of audio material to
help meet the challenges of scale that Liberman identified.
Whether of a documentarist or descriptivist viewpoint, linguists of both persuasions are
in agreement that we ought not deprive future generations of key data where possible
(Himmelmann, 2011). Research into the effectiveness of documentary methods needs to
anticipate the needs of the ‘futuʃe philologist’ to the extent we can guess. This paper
also adopts the position that researching more efficient methods in language
documentation is not optional. It is, in fact, at least as important in ensuring that data
isn’t lost to futuʃe geneʃations.
In summaʃy, the ‘futuʃe philologist’ pʃinciple orientates us towards the needs of
language researchers in the future. With revolutions in digital technology enabling
linguists to capture and analyze more data than ever before (Evans, 2009, p. xix) we are
now considering recording potentially thousands of hours of audio material. Thousands
13
more may be required for the respeaking task as part of the BOLD method. While the
essential scalability of these methods frees the field linguist from every hour of work, it
would be naive to suggest that there are no resource costs in deploying these techniques.
They do not remove the need for detailed description at the top of the ‘pyʃamid’ to make
sense of the large volume of untranscribed material. The challenge is therefore the
search for an appropriate balance in traditional and scalable methods in linguistic
fieldwork. The goal of creating lasting, multipurpose records of a language implies the
need to evaluate methods, where possible, from the end-user perspective of future
generations.
1.6 Aims and research questions
To date there has been no specific investigation of the impact of respeaking in
transcription. The central aim of this research is to assess the value of respeaking in
language documentation methods. Therefore, given the scenario of the future philologist
working on BOLD-based documentation of a no longer spoken language:
Does the availability of respoken audio improve transcription accuracy?
If so, can we observe improvements in particular types of transcription errors?
To what extent can these be attʃibuted to ‘caʃeful speech’ oʃ lower noise?
This study will also consider the use of the latest digital methods with data capture
carried out using a smartphone application currently being developed.
14
2. Literature review
2.1 Overview
There are three main areas of relevant literature for this research. The evolution of
phonetic and phonemic transcription will be briefly discussed in section 2.2.
Spontaneous speech and careful can be understood to occupy points on an articulatory
continuum. Lindblom’s (1990) hyper and hypo speech (H&H) theory describes such a
continuum of phonetic variation and is discussed in section 2.3. According to H&H
theory, careful speech from the BOLD respeaking task may be classified as hyperspeech
but the liteʃatuʃe on this phenomenon pʃefeʃs the teʃm ‘cleaʃ speech’. Pʃevious liteʃatuʃe
on clear speech is discussed in section 2.4. Types and levels of noise used to degrade
audio recordings in these studies are discussed in section 2.5. Finally in section 2.6, the
ramifications of the literature will be discussed with relevance to respeaking in
linguistic fieldwork.
2.2 Transcription – description of speech sounds
Pike (1943) argued that transcription of speech sounds can and should be undertaken in
such a way as to capture the full range of articulations theoretically possible in any
language. This type of transcription, e.g. one that takes no account of the language being
transcribed, has also been called an impressionistic transcription (Abercrombie, 1964).
The field has long debated the exact choice of symbols and peʃhaps the IPA’s
guidelines and symbol set (International Phonetic Association, 1949) has at least built
some consensus in the most common speech sounds. Additional symbols or diacritics
are often necessary to capture a fully impressionistic description of a language. As one
might expect, even the relatively large symbol set of the IPA is the product of
scholarship on particular families of languages. Roach (1987) pointed out that influence
of Euʃopean languages ʃesulted in the ‘lumping togetheʃ’ (pp. 28) of dental, alveolar
and post-alveolaʃ as the same ‘place’ on the IPA chaʃt.
Discrete symbols are not the only way to categorise speech sounds. With the exception
of a subset of diacritical marks in IPA notation, alphabets of symbols have the drawback
that they do not inherently describe the state of the articulators and the manner in which
15
they are articulated. To begin with Jakobson, Fant and Halle (1951) classified sounds by
acoustic properties rather than the state of the oral articulators. Following the advent of
generative phonology (Chomsky & Halle, 1968), distinctive features have found use
describing natural classes as a matrix of largely4 binary features which capture both the
state of articulators and some acoustic features. For example, /m/ would be represented
as [+voice, -continuant, +nasal, +sonorant, +labial]. While a full list of distinctive
features for any given speech sound would be quite long, typically only the features that
have changed are analysed. For example, the fricative / / /s/ can be represented as an
alteration in place features where [-anterior, -high] [+anterior, -high], assuming all
the other features such as [+ fricative] remain unchanged.
When it comes to transcription activity, symbolic systems are necessary to reduce the
amount of information to something manageable. The IPA with the wide array of
diacritics is generally sufficient for most impressionistic transcriptions. Fine phonetic
transcription describes speech sounds with great precision, complete with allophonic
variation. By definition such detail is not necessary to capture the language for speakers
of the language. Bloomfield argued that such detailed phonetic level transcriptions
would always be subjective and arbitrary (1933, pp. 84-5). Chomsky and Halle saw
language-independent transcriptions as meaningless, arguing instead that transcription
should be viewed as a continuum between the ideal of a broad phonemic transcription
with the least detail at the end and a narrow phonetic transcription with the most detail
at the other end.
Bloomfield is clearly right to the degree that impressionistic transcription is subjective
on the part of the transcriber. How broad or narrow their transcription ends up being
will depend on a range of factors such as whether they can perceive narrow phonetic
detail oʃ whetheʃ they think it’s ʃelevant. A key challenge for this research is in
comparing these subjective transcriptions. Distinctive features are particularly relevant
heʃe since it’s possible to conveʃt the symbolic IPA representation into lists of
distinctive features. Computational methods can compare these lists so that we’ʃe able
to assess how phonetically similar one symbol (in a transcription) is to another. This
technique is discussed in more detail in section 3.6.
Largely because widely published lists of featuʃes such as Haye’s (2009) aʃe based on tʃinaʃy values
wheʃe a dash oʃ otheʃ symbol means ‘not specified’.
4
16
2.3 H&H theory
The wide variability in natural speech, particularly fast speech, has been noted widely
(Dalby, 1986; Greenberg, 1999). Lindblom’s H&H theoʃy accounts foʃ the ‘invaʃiance
pʃoblem’ wheʃe speech aʃticulations vaʃy to such an extent that it is difficult to pʃovide
a consistent phonetic definition. The same speaker may even produce a continuum of
variation motivated by communicative needs. H&H theory suggests we may view this
continuum with listener-oriented clarity at one end (hyper speech) to talker-oriented
economy of effort at the other (hypo speech). If a speaker believes that the listener will
have difficulty understanding, such as in a noisy environment or a listener with
comprehension issues, the speaker will ‘tune’ theiʃ peʃfoʃmance. Slowing the rate of
speech and making other changes often described as speaking ‘cleaʃly’ and speech
produced in this way is descʃibed in the liteʃatuʃe as ‘cleaʃ speech’. A typical scenario
might involve communicating with an elderly hard-of-hearing relative. Aside from the
more obvious volume modification, clear speech exhibits a reorganisation of
articulatory gestures with resulting acoustic properties (Moon & Lindblom, 1989),
motivated to enhance phonemic contrasts.
The discrimination between possible speech segments is guided by knowledge of the
language in a signal-complementary process. Speakers estimate an appropriate trade-off
between hyper speech and hypo speech by also estimating the contribution of this
signal-complementary process in the listener. This process is inherently tied to
knowledge of the language. Luce (1986) showed that the probability of listeners
recognising words was influenced by a number of factors including the frequency of the
word and the similarity of pronunciation with other words. Functionalist accounts
suggest we may view these effects as evidence of usage-based patterns or schemes
emerging to increase generalisation and ease of access (Bybee, 1999). The effectiveness
of language-informed signal-complementary processes in speech comprehension is such
that native speakers can recognise words with highly reduced consonants. Context-indiscourse also plays a role in the signal-complementary process. Ernestus et al. (2002)
showed a strong negative correlation between consonant reduction and intelligibility
where words had fewer contexts to disambiguate the lexical item.
17
H&H theory suggests that the phonetic variation in natural speech can be explained as a
‘tug-of-waʃ’ between opposing motivations of speaker-oriented factors (economy of
effort) and listener-oriented factors (achieving comprehension). Furthermore the
variation in sound systems of different languages means that the properties of speech
‘tuning’ also vaʃy between languages. In the context of clear speech in linguistic
fieldwork, language consultants are unlikely to be able to estimate the difficulties of
speech sound discrimination in listeners. A speakeʃ’s effoʃt to impʃove discʃimination
between speech sounds will be motivated by language-aware signal-complementary
issues as well as acoustic phonetic properties of the language. Some of the properties of
movement along the H&H continuum would seem to be universal, such as slowing of
speech rate and enhancing word segmentation but the full pattern of phonetic variation
is considerably more complex.
2.4 Clear speech
Of goal-oriented speaking styles, clear speech is uniquely oriented towards enhancing
intelligibility (Smiljanić & Bradlow, 2009). As a natural consequence, the literature on
clear speech has focused on intelligibility gains in varying conditions. Picheny, Durlach,
& Braida (1985) showed that clear speech delivered nearly 20% intelligibility
improvement in hearing-impaired listeners. Perhaps unsurprisingly, non-native listener
comprehension has not demonstrated the same benefit. Bradlow & Bent (2002) found
that comprehension gains for non-native speakers of English were less than a third of
native speakers.
Studies in the clear speech literature have relied on participants reading the same set of
materials in some manner of clear speech. Participants were asked to speak as if they
were conversing with someone who is foreign or who has hearing difficulties (Picheny
et al., 1985; Schum, 1996). These studies found wide variation in the properties of the
clear speech produced, with correspondingly wide variation in comprehension gains.
This may well be an artifact of participants forming their own interpretation on an
appropriate level of clear speech. More recently Wassink et al. (2006) contrasted speech
from different varieties of clear speech including Infant-Directed Speech (IDS) and socalled Lombard speech in Jamaican speakers of Creole and English. The Lombard
effect is the observation that speakers modify their speech production in noisy
18
environments. Wassink et al. found that not all forms of clear speech demonstrate the
same acoustic modifications but again found it expedient to describe the different types
of clear speech along a continuum in line with H&H theory.
Some of the acoustic metrics that have been used to examine clear speech include
speaking rate, duration of speech sound segments, pauses, fundamental frequency and
vowel foʃmant fʃeʂuencies, paʃticulaʃly the vowel ‘space’ ʃepʃesented by the limits of
formants occurring in different modes of speech. Moon and Lindblom (1994) found that,
independent of speech rate, faster formant transitions were suggestive of faster
articulations in clear speech. They suggested that this reorganisation was motivated by
avoiding coarticulatory effects and resulting target undershoots. Liu and Zeng (2006)
explored the temporal reorganisation of clear speech comprehensibility gains by
modifying casual speech by stretching it to the same length as clear speech compared
with inserting gaps to attain the same length. Gaps were found to provide a superior
comprehension benefit. They concluded that the beneficial acoustic cues resulting from
temporal changes in clear speech were ‘multiple and distʃibuted’.
19
2.5 Types and levels of noise
Assuming that respeaking aids transcription, the third research question concerns the
extent that accuracy gains can be attributed to the properties clear speech and to what
extent gains can be attributed to audio quality improvements. This suggests an
experimental method where respeaking of a noisy recording is compared with
respeaking of a clean recording. Practical considerations dictate the choice of a single
level of signal-to-noise ratio (SNR) in this study, stemming from the size of the dataset
required for statistically significant results. The choice of the noise level and the type of
noise deserves explanation in the context of the relevant literature.
As we have seen, one of the motivations of ʃespeaking is to ‘ʃegeneʃate’ a ʃecoʃding so
that it is free of unwanted noise. Noise in this context is principally of two types. The
first is noise introduced by the recording equipment. This tends to be random like the
hiss of an audio cassette. In Woodbuʃy’s Cup’ik documentation environment, audio
cassettes would have provided around 40-50dB SNR at the time of recording, reducing
to around 30dB for tape of this vintage played today. Quiet passages in the recording
might then reduce the effective SNR to levels of perhaps 10dB or slightly worse,
ʃesulting in the ‘haʃd-to-hear’ tapes Woodbury described.
In practical situations with field recordings, no audio recording should ever be so
degraded that native speakers are unable to comprehend the recording. However nonnative speakers, and by extension the future philologist, do not have the strength of
signal-complementary processes enabled by language awareness and therefore are
vulnerable to degraded signals. Bradlow & Bent found that non-native speakers were
‘disproportionately challenged’ by degraded signals in comprehension tests than native
speakers. Earlier work in the 1950s showed that correct identification of the place of
aʃticulation of English consonants was ‘seveʃely’ impacted in noisy ʃecoʃdings (Milleʃ
& Nicely, 1955) which may be suggestive of the types of transcription errors made in
noisy conditions.
The second type of noise is that resulting from unwanted acoustic events such as wind
noise. Speech from other members of the speech community who are not the present
target of the recording session is another common source of unwanted noise. Listeners
20
subjected to multi-talkeʃ ‘babble’, as it’s known in the liteʃatuʃe, face the additional
challenge of linguistic interference. Van Engen and Bradlow (2007) showed that sixtalker babble impacted comprehension more than two-talker babble. The linguistic
inteʃfeʃence is stʃongeʃ when the babble is moʃe ‘compʃehensible’ and in the same
native language as the adversely-affected listener. Native English speakers were
affected more by two-talker babble than non-native speakers. The scenario of the future
philologist assumes a non-native listener environment but we should note that heritage
speakers would likely be more adversely affected by multi-talker babble in their
language. Noise that has similar spectral characteristics to voice has also been shown to
result in stronger decreases in intelligibility (Brungart et al., 2001).
Studies in the clear speech literature have often preferred to use white noise to digitally
degrade recordings for experimental purposes. Clear speech studies investigating multitalker background noise have either synthesized babble out of natural speech or
employed ‘speech-shaped’ noise where the spectral power contour has been shaped to
match that of human speech. In effect this is the same as n-speaker multi-talker babble
where n is an infinite number of speakers (Kalikow et al., 1977). While the contribution
of noise due to recording devices has fallen in recent years, we have no reason to
believe that field recording environments have changed. Therefore speech-shaped noise
would seem to be a reasonable choice of noise type where we are seeking to mimic
recording environments where multi-talker babble is commonplace.
When searching for an appropriate level of noise for degraded signal conditions, the
levels of noise used in clear speech comprehension studies have been of an entirely
different magnitude to the levels of noise in poor-quality audio recordings. SNR ratios
of -4 and -8dB were employed based on trials that showed these levels resulted in midto-high range intelligibility for native speakers. This is extremely noisy given that
negative SNR describes conditions of more noise than speech signal. These levels are
patently not equivalent to those encountered in the context of linguistic fieldwork
methods. This study will instead adopt an SNR level for degraded audio which aims to
replicate a plausibly noisy field recording.
21
2.6 Ramifications for respeaking tasks
While these studies provide some theoretical backdrop for this research, some aspects
limit the scope of their application to clear speech in linguistic fieldwork. In Bradlow &
Bent’s study, participants had extensive experience of English (9.7 mean years of study).
This is not comparable to language-naive transcription employed in this study.
Neveʃtheless it’s inteʃesting to note the suʃpʃisingly low benefit of cleaʃ speech among
participants that, while not native speakers, nevertheless had extensive working
knowledge of English. This suggests that the full benefits of clear speech are unlikely to
be realised in the context of respeaking as a fieldwork method.
There are further considerations which may reduce the effectiveness of clear speech. To
produce respeaking oral annotations in the BOLD methodology, language consultants
are asked to repeat previously recorded material slowly and carefully for the benefit of a
recording device rather than for another person. This is very different from the natural
process of active negotiation described in H&H theory. In the BOLD method, speakers
have no opportunity to gauge comprehension and by extension no ability to estimate
signal-complementary processes in the listener. Without comprehension checks,
speakers are forced to guess an appʃiopʃiate level of ‘tuning’ of theiʃ speech pʃoduction.
This presents an additional concern in how we can adequately explain the desired
‘degʃee’ of hypo speech ʃeʂuiʃed and how we may know if it has been met. If the
speaker has been working at the task for an extended period they may find the task
tiring and perhaps take less care with their production over time.
This apparent shortcoming of the BOLD respeaking method was also a shortcoming of
the earlier studies in clear speech. Recall that those too relied on participants forming
their own judgements about what constituted clear speech. BOLD respeaking methods
may also exhibit similarly wide variations in speech modification and, by extension,
wide variations in intelligibility gains. This is not ideal and if this should prove to be the
case it suggests that further research is needed on ways to regulate the extent of clear
speech.
Finally, we should take a moment to reflect on respeaking in linguistic fieldwork that
apparently addresses this concern. In the method first described by Bouquiaux &
22
Thomas, consultants were respeaking to linguists making transcriptions. Such a
situation does allow for comprehension checks and will repeatedly motivate the speaker
to shift their production to the hyperspeech end of the H&H continuum for the benefit
of the linguist attempting a transcription. If we weʃe ʃecoʃding the entiʃe session, it’s
reasonable to suggest that the recordings will include language clear enough to facilitate
comprehension for the linguist. Otherwise they would not have proceeded to the next
phrase. Therefore if this research yields a negative result for respeaking in the context of
BOLD, these results may not be applicable to other uses of respeaking in linguistic
fieldwork.
23
3. Method
3.1 Overview
Speakers of the Indo-Aryan language Nepali were recruited to record narratives using a
BOLD-like methodology implemented by the Aikuma smartphone application5. The
method employed focuses on the production of careful speech as a spoken annotation of
the spontaneous recording. A preliminary investigation was conducted to examine the
properties of Nepali clear speech and to inform the development of a means to compare
accuracy between written transcriptions.
The main phase of the study involved the capture of over ten minutes of spontaneous
narrative from two Nepali speakers. Four participants with linguistic training were
recruited in order to produce phonemic transcriptions of the natural speech using the
software Praat (Boersma & Weenink, 2010). These transcriptions were then compared
against the reference transcription for the purpose of assessing overall accuracy rate
under varying experimental conditions. Those conditions were the availability of a
respoken clear speech version of the spontaneous speech, and whether the spontaneous
recording had been degraded with artificial noise.
This section begins with a brief overview of the Nepali language and specific
phonological properties that may predict transcription errors. This is followed by details
of the participants and experimental procedure including data collection, processing and
a description of the transcription activity. Finally, the means to compare transcription
accuracy is presented along with an explanation of the accuracy metric and the
undeʃlying modified ‘edit distance’ taking into account phonetic similaʃity.
3.2 Nepali
Nepali is an Indo-Aryan language and the national language of Nepal with more than 11
million native speakers (Khatiwada, 2009). The choice of Nepali for this study was
made given several considerations; the availability of native speakers (see section 3.4
participants), phonology that would not present too many difficulties for native-English
5
Aikuma may be downloaded from https://play.google.com/store/apps/details?id=org.lp20.aikuma
24
speaking transcribers and finally, a regularised orthographic representation to help
inform a reference transcription.
Difficulties in listening comprehension of foreign languages are well explored in
applied linguistics (Tinkler, 1980; Boyle, 1984). Phonological aspects relevant for this
study include non-native speakers having difficulty perceiving sounds which are not
contrastive in their own language. Native English speakers have difficulty
distinguishing between contrastive stops in the closely-related language Hindi (Werker
et al., 1981; Pruitt et al., 1998). Lessons may also be drawn from forensic transcription
which has long been concerned with issues of quality and reliability in the transcription
of natural language. Linguists need to be aware of their own potential perceptual
failings when approaching another language and pay particular attention to speech
sounds that differ from their own language (Fraser, 2003).
The greatest difficulties in identifying phonemes may arise when two speech sounds are
contrastive phonemes in the L2 language but are allophones of a single phoneme in the
native L1 language (Brown, 2007). Therefore we can hypothesize that transcription
eʃʃoʃs will coincide with paʃticipants’ difficulty in peʃceiving these contrasts. The
phonemic inventory is given in Table 3.1. This table is based on consonant and vowels
from Khatiwada (2009) and diphthongs from Pokharel (1989).
Table 3.1: Nepali phonemic inventory
Plosive
Bilabial
Dental
p
t
pʰ
b
bʰ
Affricate
Nasals
m
tʰ
Alveolar Retroflex Palatal
ʈ
d
dʱ
ʈʰ
ts
dz
tsʰ
dzʱ
k
ɖʱ
kʰ
r
Fricative
s
Lateral
l
(w)
g
gʱ
ɦ
(j)
25
Glottal
ŋ
n
Tap or flap
Approximant
ɖ
Velar
Vowels:
High
Diphthongs:
Front Central Back
i
u
Close-mid e
Open-mid
Open
/i/
/i/
o
ʌ
/u/ /iu/
ʌ
/u/
/o/
/ʌ/
/a/
/e/
/ui/ /oi/
/ʌi/
/ai/
/ei/
/ou/ /ʌu/ /au/ /eu/
a
3.3 Prediction of errors
Based on the differences between Nepali and English phonology, we can anticipate a
number of perceptual difficulties that native English speakers will have with Nepali
which may transfer to errors in transcription. However not all of these are
straightforward given that transcription involves not just listening for speech sounds but
also evaluating visual evidence from the transcription software, chiefly spectrograms.
Possible sources of errors are discussed as follows:
Figure 3.1: Acoustic evidence of
1. Retroflex consonants
retroflex consonant
English has no retroflex consonants and therefore we
could anticipate failure to distinguish contrastive dentals
where, for example, / ʈ/ /t/. The mitigating factor is
that English speakers can perceive retroflex consonants,
paʃticulaʃly given the Ameʃican English ʃetʃoflex ‘ʃ’.
Additionally, three out of four participants have been
exposed to the related language Hindi in university
coursework completed prior to this study. The
instructions provided to participants (Appendix 1) also
stated:
Note: A really good way of spotting retroflex
consonants is that the F3 formant comes close to F2,
most visible in adjacent vowels.
Given that formant frequency bands are the cues to identification of retroflex
consonants, spectrogram identification may be resilient to acoustic noise since bands of
26
energy tend to be visible. In connected speech, intervocalic /ɖ/ often lenites to a
retroflex tap /ɽ/ as shown in Figure 3.1.
2. Aspiration and breathy voicing
Nepali aspiration is distinctive in both voiced and unvoiced stops. This contrasts with
English where [b] is often the unaspirated /p/ and where [p] is often aspirated /p /.
Therefore we would expect that failure to distinguish between aspirated contrasts may
be a common source of errors. One mitigating factor is that aspiration on spectrograms
can be easily observed as a burst of aperiodic noise. However participants will need to
distinguish between the lengths of the aspiration for phonemic contrasts. Aspiration also
becomes difficult to see on a spectrogram under noisy conditions. The distinctive
aperiodic noise of aspiration in the speech signal may be masked by introduced noise.
Figure 3.2: Aspirated consonant spectrogram
Figure 3.2 shows the same speech sound / t / under different noise conditions. Left is a
clean recording with the noise of aspiration clearly visible, right is a noisy (9dB S/N)
recording where the aspiration has all but disappeared into the noise.
27
3. Vowel contrasts
As with consonants, where the vowel system of one language differs from another,
speakers can be expected to have difficulty distinguishing vowels. In a large-scale
cross-corpus study, Becker-Kristal (2010) developed an acoustic typology of vowel
inventories. Figure 3.3 is a representation of the structural configuration of Australian
English (6L0) and Nepali (6R0) extracted from the relevant sections of Becker-Kʃistal’s
PhD thesis, with emphasis on the left/right difference. These coincide with the
Australian English /æ/ and the Nepali /ʌ/ vowel. There are also differences in the
nominal F2 of the low central vowels.
Figure 3.3: Australian English & Nepali vowel spaces
Taken from Becker-Kristal (2010): Comparison of the ‘structural configuration’ of Australian English (left)
and Nepali (right) shows that the major difference is that Australian English has a mid-front [æ] phoneme
category compared with Nepali’s mid-back [ʌ]. Therefore one might expect that native Australian English
speakers have difficulty with the [ʌ] vowel in particular.
While it is possible to obtain objective measures from transcription software, doing so is
laborious and may be inconclusive given that the articulators, and subsequently acoustic
evidence, may not attain a steady state in connected speech. This may be an area where
consultation of a clear speech version could help identify the intended vowel target
where that target has not been attained in spontaneous speech.
28
3.4 Participants
Two Nepali language consultants aged in their mid 20s were recruited by the
investigator to take part in the study. One male and one female, both consultants are
native speakers of Nepali and come from the national capital of Kathmandu. Both were
international students at a major Australian university at the time of the study and are
fluent in English with extensive secondary and tertiary education in English. The
recording of Nepali narratives and the respeaking task were performed independently of
each other. The consultants received a moderate remuneration for their time.
Four participants were additionally recruited for the transcription experiment, three
male and one female. All were 4th year students in a linguistics program at a major
Australian university. Three had recent training in transcription using the Praat software
in an experimental phonetics subject. The sole female participant withdrew after a week
citing heavy work commitments which conflicted with the time burden of participation.
She later continued the transcription up to file 50 or around 45% of the data. All
participants were remunerated.
3.5 Procedure
3.5.1 The Aikuma application
Biʃd and Hanke’s (2013) Aikuma smartphone application was used to record
spontaneous narrative and to handle the clear speech spoken annotation task. The
interactive implementation of the BOLD method means that language consultants are
able to perform recording and subsequent spoken annotations by themselves, thereby
pʃoviding the means to ‘cʃowdsouʃce’ natuʃal language. Recoʃding naʃʃative is as easy
as speaking into the phone much like a regular telephone conversation. Subsequent
spoken annotations are produced in a separate respeaking mode in which the consultant
listens to the initial recording and begins respeaking at any time. When they do Aikuma
will pause playback and begin recording the spoken annotation until they finish, after
which Aikuma will resume playing the spontaneous recording.
The application stores the recordings in digital format along with metadata which
indicates the alignment of sections of spoken annotations with the spontaneous speech.
29
At the time that Aikuma was used in this study the software was in a relatively early
stage of development and so measures were taken to use backup recording devices
while still benefitting from the automated speak-pause-resume implementation of the
BOLD method. One of the key advantages of an Android application is that it may be
run on inexpensive commodity devices. However it also introduces an element of
variability as far as performance and audio quality between devices.
3.5.2 Data collection
The male and female Nepali speakers participated one after each other and under
somewhat different conditions. For the male speaker, a demonstration of Aikuma was
first provided as well as some instruction in respeaking. Infoʃmed by Labov’s uʃban
fieldwork methods (Labov, 1972) to elicit more naturalistic speech, the consultant was
asked to recall a time he thought his life was in danger. This resulted in around six
minutes of enthusiastic unselfconscious speech. With the spontaneous narrative
recorded successfully, the respeaking phase was then performed. The session was also
recorded with the built-in microphones of a professional Zoom H4n recorder.
The audio quality from the high-end Samsung Galaxy Nexus smartphone turned out to
be sufficient but a software bug stopped the respeaking process from working correctly,
in so far as the software lost track of the alignment between the casual and respoken
audio. To remedy this, the audio was manually segmented using an audio editing
software package. The result was two files representing the same content, one for the
spontaneous recording and the other for the respoken clear speech recording.
The female speaker was recorded using a slightly different process. Rather than the
office location of the male speaker, a recording studio situated on the university campus
was used instead. The back-up Zoom H4n was connected to studio microphones and
recorded both the internal and studio microphones in four-track mode. In this case the
consultant used a HTC Desire C phone. This time the softwaʃe’s ʃespeaking function
operated without a hitch resulting in metadata that could be used to prepare the audio
files automatically. However the recording quality of the HTC Desire C was
substandard with the phone’s aggressive automatic gain producing distorted/clipped
30
audio. The metadata was instead used with the high-quality recording from the Zoom
H4n professional recorder.
Given these differences between the recording procedure of the male and female
speakers, the male speaker audio can be impressionistically described as good while the
female speaker’s audio is excellent. Therefore the quasi-independent ‘speakeʃ’ vaʃiable
accounts for a number of differences including individual, sex, recording device and
recording location. We may expect to see this impact accuracy results accordingly.
Both participants were also asked to provide a written transcript of their narrative in the
Devanagari script. The Devanagari script was converted to the roman transliteration
standard ISO 159196 (ISO/IEC 15919, 2001). This differs from the commonly used
‘hunteʃian’ tʃansliteʃation by including a series of diacritics to represent the larger array
of consonants and vowels in Devanagari. The result is a transliteration scheme that
retains all of the phonemic detail of Nepali from the original Devanagari script.
3.5.3 Processing the data
Speech-shaped noise was chosen as the most appropriate type of noise to introduce in
the experimental conditions so as to approximate multi-talker babble of field conditions.
The level of noise was decided by analysing the audio levels from a test recording made
on a Zoom H4n where a group of students were talking just a few metres from the
subject being recorded. The root mean square (RMS) difference in amplitude between
the subject speech and the nearby unwanted speech was approximately 9dB. Subjective
testing with audio files degraded with these parameters revealed these files to be
considerably noisy to a level one would certainly expect some detrimental effect on
accurate perception. The visual clarity of spectrogram was also hampered significantly.
The individual audio files of variable phrase length were transformed into a series of
experimental conditions under the following process:
6
The conversion was performed by the iso15919 Python library by Mublin:
http://dealloc.org/~mublin/iso15919.py
31
1. RMS normalisation to -12dBFS (12dB below full scale or 0.25 of max)7 for both
casual and respoken audio recordings.
2. Cʃeation of ‘noisy’ veʃsion of casual speech ʃecoʃding by mixing in ‘speechshaped noise’ with oʃiginal ʃecoʃding such that RMS level ʃemains at -12dBFS
and signal-to-noise is 9dB.
3. Cutting of casual speech (‘clean’ and ‘noisy’) and ʃespoken speech into
individual files.
4. Individual experiment conditions created (see Table 3.2) and randomised.
5. Data structure of experiment conditions archived, experimental files created for
distribution, human-readable index created with ISO 15919 transliteration and
experimental conditions.
3.5.4 Experimental conditions
There are two independent variables:
1. Noisy or clean spontaneous speech file is provided.
2. Respoken ‘cleaʃ speech’ file pʃovided oʃ not provided.
Additionally there is the quasi-independent variable of speaker which coincides with
two individuals being male and female. The independent variables result in a two by
two matrix of experimental conditions as follows:
Table 3.2: Experimental conditions matrix
Respeaking
No respeaking
Noisy spontaneous
Clean spontaneous
Noisy spontaneous recording
Clean spontaneous recording
Respeaking file provided
Respeaking file provided
Noisy spontaneous recording
Clean spontaneous recording
No respeaking file provided
No respeaking file provided
The dependent variable is the measure of accuracy of a transcription coinciding with the
above independent variables. The same data was used for all transcription participants.
7
Peak-normalization is far more common as far as implementations in software such as Audacity.
Unfortunately this will result in fairly dramatic differences in perceived volume for different files. The
normalization here was performed by normalize.exe available from: http://normalize.nongnu.org/
32
3.5.5 Data validation and volume
The male speaker provided a spontaneous narrative of six minutes in duration and the
female speaker four minutes. The process described in section 3.5.3 resulted in 142 files.
The spontaneous and casual speech recordings were checked against the written
transcript, segmenting the ISO 15919 transliteration in the process. Where there was a
poor correlation between the transcript, the casual speech and respoken speech
recordings, these files were discarded. Usually this was due to paraphrasing in the
respoken recording and more rarely, significant differences between the written
transcript and audio recordings. This most commonly occurred for the male speaker.
After this process there were 60 files for the female speaker and 65 files for the male
speaker. This provided enough data for 14 files in each experimental condition per
speaker:
14 files * 2 (noisy/clean casual) * 2 (respeaking/no respeaking) = 56 files
For both speakers this resulted in 112 files. Four additional files were added at the
beginning of the data set, two for each speaker, and all with respeaking but varying the
noise condition to introduce the possible conditions during initial training. All 116 files
were included in the overall analysis.
While there are an equal number of files for each speaker, differences between the
speakers in how they segmented their narrative in the respeaking task result in different
quantities of words spoken by the female and male speaker. The total quantity of data
was 957 words in 116 files. Of those, the data contained 536 words spoken by the male
speaker with an average of 9.2 words per file. The female speaker data contained 421
words with an average 7.3 words per file.
3.5.6 Reference transcription
In order to derive accuracy metrics from participant transcriptions, it was first necessary
to produce a reference transcription of the casual speech recordings. The accuracy of
this transcription is critical given any error in the reference transcription would
invalidate comparisons of that section against participant transcriptions. The reference
transcription was produced by the investigator with a number of critical differences
33
between the method used to produce this transcription and that used by participants,
such as:
1. Systematic consultation of Devanagari orthographic transcription
2. Access to respoken audio throughout
3. Cross-checking transcription with language consultants
4. Longer time taken to produce the transcription (over 40 hours)
The Devanagari orthography was the most useful resource in assisting an accurate
transcription. In the majority of cases it narrowed the search space of identifying speech
sounds to within the possible allophones of the phonemes represented in the
orthography. However connected speech processes were very common, particularly
elision of word final syllables. The respoken audio was generally useful in identifying
where such elision had taken place and particular care was taken not to transcribe
speech sounds that were not articulated in casual speech.
In a small number of cases, the language consultant’s oʃthogʃaphic tʃanscʃiption did not
perfectly match the audio recording. In these cases consultants were asked to produce a
revised orthographic transcription. Additionally, some difficult to identify passage,
fewer than ten in total, were cross-checked with language consultants. In these cases,
after playing their own speech back, alternative transcriptions were read out with the
consultant asked to select the one sounded most like their own utterance. While time
consuming, this yielded insights into correct identification of particular allophones.
3.5.7 Transcription activity
Study participants were asked to produce a transcription of each of the 116 spontaneous
speech files. In half the cases there were two files with the same prefix where one was
the spontaneous speech recording and the other the respoken version. For the first file
this would mean participants saw filenames: 1_normal.wav and 1_respeaking.wav.
Participants were directed to open the spontaneous recording in Praat and transcribe to
the best of their ability. Resulting Praat textgrid files were saved for later analysis.
Written instructions were provided (Appendix 1) to the four participants taking part.
These include a phonological inventory of Nepali with IPA and X-SAMPA symbols,
34
the latter being more easily typed into Praat. Some observations on morphology and
common phonotactics were also included.
All participants spent the first two hours of transcription in the presence of the
investigator in a university computer lab where computers were equipped with
headphones and the required software. Participants were shown how the respeaking file
could be opened into Praat for viewing and playback and how to tile Praat windows for
side-by-side comparison. They were informed that the respeaking file was assumed to
be helpful and the first four files all had respeaking versions to introduce the concept.
However participants were not told how to use the respeaking file.
It was made clear to the participants that the investigator was not an authority on the
Nepali language or specific choices in transcription. Nevertheless, some intuition was
provided verbally relating to speech sounds that differ from English and may prove to
be problematic such as retroflex consonants, breathy voicing and the vowels /a/ and /ʌ/.
After this initial two-hour period, participants were free to continue working either in
the lab or somewhere else such as their own home. One participant (participant 1) chose
to keep working in the lab. It should also be noted that the computer lab was a
communal facility within a phonetics department at a major Australian university and at
least one resident researcher had prior experience of Nepali. Participant 1 discussed the
transcription with that researcher and this may have contributed to the transcription
accuracy of participant 1.
3.6 Measuring accuracy
The transcription activity resulted in 116 Praat textgrid files for each of the three
participants that completed the work and another 50 for the fourth participant that had
not completed the task. These files contain a string of X-SAMPA labels for each speech
sound, time-aligned with the appropriate audio file. The string of phonetic symbols was
extracted with a Python program utilising the Praat textgrid parser in the Natural
Language Toolkit (NLTK) package (Bird, Loper & Klein, 2009). In order to provide
statistical analysis, an automated similarity metric is required for each file compared
against the reference transcription. Coding the transcriptions by hand would be the most
reliable method but hand processing of 398 transcriptions was taken to be unduly time
35
consuming. Therefore an automatic method to compare phonetic strings was developed
to yield metrics across the entire data set. The design criteria were that the accuracy
metric should return a nominal integer in the range 0-100 where 100 represents an
identical transcription and 0 a transcription with no similarity.
Automatic comparison of phonetic strings was not expected to be flawless, nor would it
provide insights into the types of errors. On that basis some attention was given to
producing a file with parallel transcriptions and accuracy scores which would be more
easily digested for noting general trends (Appendix 2). A subset of the data was hand
coded to illustrate types of errors.
3.6.1 Edit distance
An established means to compare two strings is the edit distance metric. The metric can
be described as the minimum number of steps required to transform one string into
another. The transformation steps are insertion, deletion or substitution of individual
items. Edit distance is a metric of distance where a value of 0 would be returned for
identical strings and numbers up to the length of the longest string would be returned for
strings which have a maximum distance, e.g. are have no similarity. A common
implementation is the Levenshtein Distance algorithm and this forms the basis of the
phonetic-edit-distance metric described here.
A naive comparison of phonetic symbols using edit distance alone would fail to account
for speech sounds with similar pronunciation. Therefore, when used to compare
perceptual similarity, edit distance is usually combined with phonetic algorithms that
‘noʃmalise’ speech sounds into catogories that group similar-sounding speech sounds.
Several exist for English including Metaphone, Soundex and NYSIIS. The result of
these normalisation techniques is that similar sounding words will have the same
signatures. None of these systems are directly applicable to Nepali, nor is collapsing all
phonetically similar speech sounds desirable given the goals of this research. However
the edit distance technique is described here because the method is useful to account for
the comparison of strings where there may be too few or too many items.
36
3.6.2 Improved phonetic edit distance
In order to properly account for phonetic similarity, the distance measure should
account for the relative distance of articulations in both the place and the manner of
articulation. To accomplish this we must discard the symbolic representation of speech
sounds and incorporate the descriptive framework of distinctive features. Gildea and
Jurafsky (1996) introduced a computational algorithm to calculate phonetic similarity
based on Binary Feature Edits Per Phone (BFEPP). In this case they use the Hamming
distance, a metric of difference between two strings of equal length, counting the
number of positions where they differ. Applied to matrices of distinctive features, the
result is a measure of how many features are different between one phone and another.
A recent evaluation found BFEPP performed the best compared with other algorithms
(Kempton & Moore, 2013) so a variation of this approach was developed for the
analysis of data in this study.
Distinctive features describe both manner and place of articulation but place of
articulation is known to be particularly significant for categorising speech sounds. On
this basis a subset of phonetic distinctive features was chosen in three different
categories of distinctive features. A degree of scaling was applied to obtain a more
genuine correlation to perceptual difference between phones. The features and scaling
values are given in Table 3.3:
Table 3.3: Phonetic edit-distance error ‘scaling’ by category
Features
Place
Scaling
"round", "labiodental", "coronal", "anterior", "strident",
1.5
"lateral", "dorsal"
Manner
"syllabic", "delayed release", "approximant", "tap", "trill",
1.0
"sonorant", "nasal", "continuant", "strident", "voice"
Vowel
"round", "high", "low", "front", "back"
2.0
These scaling values were determined based on a number of factors. While place of
articulation differences are most significant, multiple feature values change between
different phones. Considering vowels, there are a relatively small number of distinctive
feature differences between vowels of different categories and hence stronger scaling is
37
applied. These feature changes were added together such that the maximum error is 10
scaled feature differences. Vowel vs. consonant comparison is always counted as a
maximum error (the same cost as deleting or inserting a symbol) and vowel vs. vowel
comparisons only compare vowel features.
The described phonetic edit distance is the same as Levenshtein edit distance except that
substitutions do not automatically have a cost of 1; they have a fractional cost based on
the sum of the number of scaled differences in phonetic features.
3.6.3 Summary
The accuracy of a given transcription is taken as the phonetic edit distance between the
reference transcription and the test transcription. The final accuracy is given by:
Accuracy = 100 * ( PhoneticEditDistance / Length )
An additional metric of speaking rate in words-per-minute was derived by counting the
number of words in the ISO 15919 transliteration and comparing against the length of
the audio recordings. The Python program produced a CSV file with all of the metrics
which was then imported into a data frame in the R language (R Development Core
Team, 2006) for statistical analysis. The Python program also produced the summary
log of all transcriptions and accuracies, as provided in Appendix 2.
38
4. Results
4.1 Overview
This section reports the results of the two phases of the study. Firstly, the properties of
Nepali clear speech are reported in Section 4.2 based on a small-scale pilot study with
one speaker. Following sections report results from the main phase of transcription,
beginning with transcription rates in Section 4.3 and statistical analysis of accuracy
results in Section 4.4. Section 4.5 presents the common transcription errors with
illustrative examples highlighting the variation between study participants. The effect of
degraded noisy recordings is reported in Section 4.6 and a summary of findings
concludes in Section 4.7.
4.2 Clear speech in Nepali
In this section some of the specific properties of Nepali clear speech will be explored by
contrasting with the spontaneous speech produced by one of the language consultants in
this study. Krause & Braida (2002) pointed out that there are several types of clear
speech such as clear/slow, clear/normal, loud/normal and many more. This represents a
challenge wheʃe it’s difficult to judge the degʃee to which the speech peʃfoʃmance is
shifted towards the hyper or whether certain properties of clear speech are stronger than
others. A good place to start is to examine the same utterance in different recordings.
Figure 4.1 shows Praat (Boersma & Weenink, 2010) visualisations on the same
utterance.
Counter to expectation the intensity dynamic range looks somewhat lower around the
word boundary between /astelya/ and /aeko/ (ae is not a diphthong, these are separately
articulated morphemes). Cutter and Butterfield (1990) observed that speakers attempt to
mark word boundaries in English in clear speech and this would appear to be replicated
in Nepali with examples ranging from clear-cut cases of silence to small-but-noticeable
dips in intensity as in Figure 4.1. While the words are compressed to the same timeline,
another observation is that of a differing ratio of consonant and vowel duration,
particularly for the /ko/ at the end. Smiljanić & Bradlow (2008) obseʃved ‘tempoʃal
ʃestʃuctuʃing’ of English in clear speech style and suggested that differences in duration
39
and intensity of speech sound segments is motivated by enhancements of the prosodic
structure of English. The clear speech utterance in Figure 4.1 demonstrates that care has
been taken to articulate the morphological affixes in vowel clusters.
Figure 4.1: Visualising Nepali ‘Clear’ vs ‘Normal’ speech
Comparison of Normal and Respeaking for /astelya aeko/
Equivalent segments in the normal (top) and clear speech recordings (bottom). The clear speech version has
been compressed to the same timeline. In normal speech the inter-word /a/ and /a/ becomes one long /a/ but a
slight inter-word dip in intensity and evidence of a glottal stop can be discerned in the clear speech version.
40
4.2.1 Durations and speaking rate
As expected, both speakers produced significantly slower speech in the respeaking task
than their spontaneous speech. Figure 4.2 offers duration box plots of consonant and
vowel duration in spontaneous and clear speech, demonstrating wide variation for both,
consistent with the observed variation in speech rate.
Figure 4.2: C and V durations for male speaker of Nepali
Consonant durations in spontaneous and clear speech (left) compared with vowel durations (right). Vowel
durations were considerably more variable in clear speech.
A derived metric of speaking rate was calculated by comparing the number of words in
the orthographic transcription for a given segment with the duration of the audio
recordings for the spontaneous and respoken audio recordings. Violin plots of speaking
rates for both speakers are shown in Figure 4.3. Violin plots combine traditional box
plots with a kernel density plots (Hintze & Nelson, 1998), providing a superior
visualisation of the distribution of speaking rates across the entire data set. Interestingly
speaking rates in clear speech approach that of a normal distribution compared with the
wide variation in the spontaneous recording. One explanation for the greater variation
in spontaneous speech is that hesitation sounds frequently occur in these productions.
Respeaking lacks the cognitive burden of planning of spontaneous speech.
41
Figure 4.3: Speaking rates of Nepali speakers in spontaneous and clear speech
In general the male speaker spoke significantly faster and exhibited a much greater
variation of speaking rate in the spontaneous narrative. Figure 4.4 shows the relative
modification of speaking rate of each speaker for clear speech. Interestingly, the overall
reduction was only marginally higher for the female speaker.
Figure 4.4: Reduction of speaking rate in clear speech
Male speaker top, female speaker bottom. Horizontal axis is reduction in speaking rate for clear speech in wordsper-minute. The mean reduction in speaking rate for both speakers is around 50 words-per-minute.
42
4.2.2 Expansion of vowel space
A number of studies of English clear speech have shown that vowel targets are hyper
articulated with an expansion in the distance of vowel categories resulting in a larger
overall vowel space (Moon & Lindblom, 1994; Krause & Braida, 2004). An
experimental acoustic analysis of the male Nepali speaker confirms that the vowel space
is somewhat reduced in the spontaneous speech narrative compared with the respoken
recording (Figure 4.5). The differences are not large and, in contrast with English, there
are few examples of reduction to schwa-like central vowels in connected speech.
Figure 4.5: Composite plot of Nepali casual vs. clear speech vowel spaces
Composite plot of the ‘vowel space’ of Nepali casual and clear speech recordings for the male speaker. The
vowel category targets are centroids of vowel formant distributions.
43
4.3 Transcription metrics
Participants reported progress and time taken rounded to half-hour periods. Two
participants took approximately 14 hours to complete all 116 files while another, the
least experienced at transcription, took 20 hours. A fourth participant did not complete
the task, reaching file 50 out of 116. Data relating to time taken was only collected on a
per-session basis. However this was sufficient to give an indication (Figure 4.6) on the
rate of progress after the training phase and at three points during the entire data set.
The overall picture was one of a very slow training period where participants would
repeatedly play speech sounds to get familiar with Nepali. Transcription rates then sped
up considerably even by the first third of the data with a long tail of slight speed
improvements throughout.
Figure 4.6: Participant transcription rates in minutes-per-file
Transcriptions took much longer during training, levelling out to around 8 minutes per-file. Participants
slowly speed up towards the end of the data set to final transcription rates of around 4-6 minutes per file.
4.4 Statistical analysis
4.4.1 Reliability of measures
Peaʃson’s pʃoduct-moment correlation of accuracy rates between participants
demonstrated a statistically significant correlation between the accuracy scores of the
three participants that had completed the data set. This suggests that the accuracy
measures are reliable as shown in Table 4.1. A composite violin plot of accuracy of all
participants is given in Figure 4.7.
44
Table 4.1: Inter-participant accuracy correlation matrix
Participant 1
Participant 2
Participant 3
Participant 1
1
0.29 (p=0.0018)
0.27 (p=0.0031)
Participant 2
0.29 (p=0.0018)
1
0.35 (p=9.95x10-5)
Participant 3
0.27 (p=0.0031)
0.35 (p=9.95x10-5)
1
Figure 4.7: Participant accuracy comparison
Violin plot (box plot with density) comparing accuracy score distribution of all four transcription participants.
4.4.2 T-tests of independent variable: respeaking
Using accuracy scores of all completed transcriptions from all four participants, a
Welch two-sample t-test of accuracy against the binary factor of respeaking results in pvalue of 1.86 x 10-9 and therefore the null hypothesis is rejected. The mean accuracy
rates differed significantly with the no respeaking condition having a mean accuracy of
73.68 compared with 79.08 in the respeaking condition, a difference of +5.39 accuracy
where the respeaking file was provided. Figure 4.8 presents a scatter plot of all accuracy
scores by file with linear regression and 95% confidence intervals.
T-tests per participant are shown in table 4.2. All four participants showed a statistically
significant rise in transcription accuracy with the availability of a respeaking file. The
45
result for participant 4 in isolation is not statistically significant, most likely due to a
smaller sample size with participant 4 completing 50 out of 116 files.
Table 4.2: Per-participant respeaking t-test results
Participant 1
Participant 2
Participant 3
Participant 4
Estimate
5.86
6.51
4.15
4.21
Std. Error
1.17
1.21
1.14
2.62
t value
5.03
5.34
3.64
1.61
p value
1.9 x 10-6
1.0 x 10-5
4.1 x 10-4
0.12
Figure 4.8: Accuracy scatterplot by file with regression
Scatter plot of accuracy for all participants for each of the 116 files. The red (lower) and blue (upper) lines
represent linear regression prediction for no respeaking and respeaking respectively with the shaded regions
representing 95% confidence intervals. Note that files < 50 have four results per file (x-axis) while files >= 50
have three results per file no results from participant 4 beyond that point.
46
4.4.3 T-tests of independent variable: noise
A Welch two-sample t-test of all accuracy scores for all completed transcriptions from
all four participants, against the binary factor of noise (9dB SNR) versus the clean
spontaneous speech recording results in p-value of 0.028 and so again the null
hypothesis is rejected. In this case the estimated effect was smaller at +2.01 accuracy
when participants had the clean spontaneous speech file. Individual t-tests per
participant did not yield statistically significant results as would be expected given the
p-value over the full data set is somewhat closer to 0.05. Therefore it was not judged to
be useful to consider the estimated effect of noise individual participants.
4.4.4 Assessing the interaction of noise and respeaking
Given that the independent variables of respeaking and noise are shown to be significant,
a two-way analysis of variance (ANOVA) was performed to determine if there are any
significant interactions:
Table 4.3: 2-way ANOVA of respeaking and noise variables
F-value
P-value
Respeaking
38.32
1.52 x 10-9
Noise
5.84
0.016
Respeak:Noise
0.86
0.35
No statistically significant interaction was found between the two. A multiple linear
regression of respeaking and noise variables was performed to gain some insight into
the effect size of both variables, summarised in Table 4.4. This is a critical finding
given that a key research question of this study was to identify to what extent the
‘ʃegeneʃating’ effect of ʃespeaking can be attʃibuted to noise.
Table 4.4: Multiple linear regression of respeaking and noise against accuracy
Estimate
Std. Error
t-value
p-value
Respeak=True
5.43
0.87
6.23
1.19x10-9
Noise=Normal
2.10
0.87
2.42
0.016
47
Multiple R2 was evaluated at 0.1008. Stated another way, 10% of the variation in
accuracy is explained by the model incorporating respeaking and noise.
4.5 Analysis of common errors
As hypothesized, transcription errors clustered around aspects of Nepali that differ from
English. The most common errors were related to:
1. Breathy voicing and aspiration.
2. Differentiating between vowels, particularly /a/ and /ʌ/.
3. Vowel clusters and diphthongs.
4. Retroflex consonants, particularly the voiced retroflex plosive /ɖ/.
Aspiration and breathy voicing are conflated in this comparison given that they are
predictable based on unvoiced and voiced segments respectively. Aspiration/breathy
errors were particularly common for velar, dental and bi-labial stops in both unvoiced
and voiced manners, E.g. /t/, /d/, /k/, /g/, /p/, /b/. Word initial aspirated consonants
frequently lenited to bilabial fricatives /ɸ/ and /β/ and dental fricative /f/. Participants
tended to favour one particular type of fricative and repeat the choice in similar
environments.
The challenge for transcribers is to categorise aspiration based on the duration of the
aspiration. Figure 4.9 shows three cases. The fiʃst is phonemically unaspiʃated but it’s
arguably aspirated here. The second is a characteristic Nepali contrastively aspirated
voiceless dental stop while the third is unaspirated. Note that participants 1 and 2 have
marked aspiration in all cases, participant 3 ignored all aspiration and participant 4 was
a mixed case and the sole participant to incorrectly transcribe /p/ as / p /.
By far and away the most common transcription error for all participants was incorrect
identification of vowels. This was due both to the high frequency of vowels and, even
normalised for frequency, a high rate of misidentification of /a/ and /ʌ/. Given that
vowel classification involves a continuum of formant qualities, the reference
transcription may also be forced to make arbitrary decisions. We should note this as a
source of unreliability in the accuracy comparison. However the effect is minimized
48
Figure 4.9: Variation in breathy voicing and aspiration errors
ISO 15919: tira thiyō tāplēju…
File 9 (female, normal audio, no respeaking) transcribed by four participants compared with the reference
transcription (G). Aspiration/breathy voicing errors are circled.
given that vowel differences influence phonetic edit-distance values to a small degree.
To get some sense of vowel error rate compared to the average accuracy figures, strings
of all vowels in the reference and comparison transcription were compared using the
same phonetic edit-distance procedure outlined in Section 3.6.2. The mean accuracy of
vowels for the entire data set was 77.27, only marginally higher than the mean accuracy
of 76.38 for all categories.
49
As anticipated, participants had a great difficulty distinguishing /a/ and /ʌ/, given that
these are not contrastive in Australian English. The errors were frequent even when
participants could be expected to see formant differences on spectrograms, such as
when /a/ and /ʌ/ occurred in proximity, as shown in Figue 4.10. It seems likely that
participants did not consult spectrogram evidence, preferring to use their auditory
perception instead. Another major source of variation in errors between participants is
in the transcription of vowel clusters.
Figure 4.10: Variation in vowels and vowel cluster errors
ISO 15919: …mī jhanai āttiyau
File 109 (female, normal audio, no respeaking) transcribed by three participants. Even after they had gained
experience and with good conditions for spectrogram evidence, participants frequently had difficulty
distinguishing the vowels /a/ and /ʌ/.
50
Vowel clusters frequently occur in Nepali as a result of morphological affix.
Instructions provided to participants suggested that diphthongs typically occur in the
normal length of a vowel. However participants varied greatly in segmenting vowel-like
speech sounds into individual segments, diphthongs or merely long vowels without
reference to vowel quality changes. It should also be noted here that the normalisation
procedure in data analysis eliminates some of these differences by producing a string of
vowels without regard to exactly how they have been segmented. The diphthongs /eu/
would be considered the same as /e/ followed by /u/.
One of the most common lexical items in the Nepali narratives is thiyo (it was, past
tense of ‘it is’), frequently occurring at the end of phrases. That is likely why thiyo has
been particularly salient for transcribers such that they were generally consistent in
vowel cluster transcription strategy for this lexeme. In the example in Figure 4.10,
theʃe’s baʃely a steady-state ‘i’ vowel befoʃe a glide but nevertheless F2 starts very high.
Such is the salience of the high-frequency lexeme thiyo that it often results in
misidentifying similar lexemes such as in this case. The spoken word was actually
attiyau (pronounced /attijo/ with a geminate consonant). We can also see the common
misidentification of short stop burst as Nepali phonemic aspiration. In other vowel
clusters it was not uncommon for confusion to arise about whether a vowel was a
diphthong, single vowel and whether a glide should be transcribed.
Finally, retroflex consonants present an interesting counterpoint to breathy contrasts and
/a/ vs. /ʌ/ vowel identification. Retroflex consonants also do not occur in English but
misidentification was less common. Impressionistically, the degree of retroflexion in
Nepali consonants is often large with an acoustic quality that is readily identifiable by
English speakers. Occasionally there was evidence that participants had perceived some
additional quality of a ʃetʃoflex stop but hadn’t made the connection. The example
Figure 4.11 contains a Nepali retroflex nasal /ɳ/. Three participants inserted a liquid
prior to a standard alveolar nasal /n/. It may be that participants perceived the rapid
convergence of F2 and F3 formants. In many cases voiced and unvoiced alveolar
retroflex stops will often have the retroflex aspect transcribed correctly but vary in
categorisation of the manner with voiced, unvoiced and retroflex flaps, as in Figure 4.11.
Participants were informed of a common Nepali connected speech process of lenition of
51
voiced retroflex stops to flaps. However three out of four participants transcribed a
retroflex flap infrequently, preferring to indicate a voiced retroflex stop.
Figure 4.11: Variation in retroflex consonant errors
ISO 15919: … khima pūrṇakō bāṭō thiyō
File 86 (male, normal, respeaking) transcribed by three participants compared with the reference
transcription (G). An example of where participants inserted liquids before the consonant possibly having
perceived formant precursors to the retroflex nasal /ɳ/.
4.6 The impact of noise
Previous studies on noise have focused on comprehension by native speakers and
typically involve far lower signal to noise levels than the 9dB used in this study. 9dB is
however a normalised loudness and where the natural dynamic range of speech results
52
in lower intensity, the SNR will be considerably higher. Figure 4.12 illustrates an
example where participants would have found the spectrogram display of almost no use.
Yet all three participants that transcribed this file produced an accurate transcription of
this section. Impressionistically, the only segment that would be difficult given the
noise is the /h/. As one of the more frequent lexical items and with this case taken
towards the end of the data set, it’s possible that pʃioʃ exposuʃe allowed paʃticipants to
correctly identify the segment.
Figure 4.12: Noise impact on spectrogram legibility
This extract from file 97 (male, noisy, no respeaking) features a quiet passage of speech which has reduced the
S/N ratio to such an extent that features can be barely seen in the spectrogram. Alternative settings barely
improve the situation from this default setting screenshot.
Given the intuition that noise would appear to make identification of frication more
difficult, a general measure of total aspiration errors was compared with the noise
variable. There was no statistically significant effect. Neither was a similar metric for
retroflex errors found to correlate with noise, not even restrictive cases of noisy
spontaneous file and no respeaking file. However there was a correlation between the
accuracy of vowels and noise (p = 0.0149) with an estimated effect size of -2.85 on the
same phonetic edit-distance accuracy scale (out of 100). This is slightly higher than the
53
estimated effect of noise on overall accuracy (-2.1). This seems to be a somewhat
surprising finding since we might expect that the difficulty in differentiating noise from
vocal tract frication would be observed more commonly than difficulties in categorising
vowels. Vowel identification is dependent on formant frequencies, peaks of spectral
intensity which are not present in noise and therefore ought to be more resilient against
noise masking effects.
4.7 Summary of Results
Overall there was considerable variation in transcription errors between participants.
Nevertheless the availability of a respeaking file resulted in a significant boost in
accuracy and the noise degraded spontaneous recording resulted in a significant
decrease in accuracy. The effect of respeaking was most visible at the extremes limits of
the accuracy ranges observed. Of the 14 individual accuracy results less than 60, only
one ʃepʃesented a ‘ʃespeaking’ condition. At the uppeʃ end of accuʃacy scoʃes, of the 26
accuʃacy ʃesults gʃeateʃ than 90, only thʃee ʃepʃesented ‘no ʃespeaking’ conditions. No
other factors were found to correlate including male/female speaker and speaking rate.
A linear model accounting for respeaking and degraded audio variables estimates the
effect of each as +5.43 and -2.1 on the phonetic edit-distance scale. The most common
errors were found to be those resulting from factors predicted given the comparison of
the sound systems of English and Nepali. However some were higher frequency than
others. /a/ and /ʌ/ vowel misidentification was exceedingly common. Participants had
considerable difficulty with the Nepali aspirated/breathy contrast series, even where
duration contasts ought to have been clear. Errors in identifying retroflex consonants
were less common although there was considerable variation in manner such as
voiced/unvoiced.
None of the experimental conditions were found to correlate with the common errors
observed, even where we might expect to find them such as noise and aspiration. There
was also evidence of participants engaging their own language faculties. Transcription
of high frequency lexemes became more consistent and another source of error when
these transcriptions were chosen over similar sounding Nepali words.
54
5. Discussion
5.1 Overview
The results of this study are discussed as follows. Section 5.2 addresses the three
research questions and Section 5.3 discusses the limitations of the study. Finally,
Section 5.4 discusses the observation that not all differences between the reference
tʃanscʃipt and paʃticipant tʃanscʃipts may be categoʃised as ‘eʃʃoʃs’ and that some may
be better described as interpretive choices.
5.2 Addressing the research questions
5.2.1 Respeaking and transcription accuracy
This study found a statistically significant benefit of the availability of respeaking for
transcription ‘accuracy’ such that availability of the respeaking file increased phonetic
similarity scores by an estimated 5.39. This should be understood in the context that
participants were not engaged in producing painstakingly accurate transcriptions with
high phonetic detail. Rather, they were motivated to strike a balance between doing a
good job and maintaining a good working rate so the entire task would not be too time
consuming. When we’ʃe making guesses about the work flow of future philologists, one
could argue that these conditions may be a reasonable approximation of the
practicalities of transcribing of a large audio corpus. Equally, one might argue, future
researchers have as much time as they need compared to the time pressure of
documentary linguistics.
The total variation in accuracy scores between participants was large. It would be
tempting to conclude that this is symptomatic of the wide variation in clear speaking
properties that were observed in previous studies in the clear speech literature (Picheny
et al., 1985; Schum, 1996). This wide variation in clear speech has been linked in
speakers being asked to provide their judgement of clear speech rather than finding an
appropriate level of clear speech via a process of negotiation in communicative events.
However, unlike studies on clear speech comprehension, this study showed no
correlation between the degree of clear speech such as the speaking rate, or even
between the male/female speaker, and the resulting accuracy scores.
55
All participants expressed frustration at the difficulty of transcribing the male speaker
when he would paraphrase instead of repeating exactly what was said. All participants
agreed that the male speaker produced the utterances that were the most difficult to
tʃanscʃibe. It would have been pʃefeʃable foʃ the male speakeʃ’s ʃespoken passages to be
further towards the hyper end of the scale, similar to the way the female speaker
undeʃstood the task. It’s the investigatoʃ’s view, backed by the expeʃience of
transcription participants, that a means to regulate the production of clear speech would
be useful. Perhaps by including more explicit instructions or even introducing timing
regulation such as Krause and Braida’s (2002) use of a metronome. So strong is this
intuition that it seems prudent to explore other possible factors as to why the data does
not reflect the subjective view of the investigator and participants.
Firstly, we must consider that the use of the respeaking file in the experimental method
was an unregulated procedure. Participants were not instructed in the exact way to use
the file, in part because the best way of consulting the file is simply unknown. Given
their linguistic training, the way the four participants chose to use the files is an
interesting observation in itself. For example, participant 4 reported that they started out
consulting the respeaking file systematically but found that this would slow the process
down as they hunted for speech sounds that had been elided in spontaneous speech.
Participants generally settled on a routine where they would focus on the spontaneous
recording, later consulting the respeaking file with a particular view to challenging
sections of the transcript.
Participant 3 reported that the respeaking file was not thought to be necessary where
they felt the spontaneous speech was not presenting difficulty. In these cases they
admitted to not consulting the file at all. The same participant had the lowest estimated
benefit from respeaking (4.1 compared to the 5.39 mean). Transcription performance is
influenced by a wide array of competencies and other factors including; perceptual
capability, exposure to the language, theoretical knowledge and experience in
transcription skills. Furthermore, attributes of the software and the technique of work
flow also affect results. I would further suggest a regularised transcription method that
systematically presents respoken audio, such as a side-by-side display of spontaneous
speech and respoken speech, might yield better results overall.
56
If we’ʃe hoping to dʃaw infeʃences about the value of ʃespeaking veʃsus collecting otheʃ
forms of data, a brief qualitative examination of the size of the observed effect may be
helpful. On the assumption that a tʃanscʃiption is ‘good enough’ when a human being
looking at the transcription can work out what the words are supposed to be, scores in
the mid 70s and upwards appear to meet that criteria. Table 6.1 presents such a case
with an example from file 39 (male, normal, respeaking). Please note that word
boundaries have been inserted to assist visual comparison of these transcriptions.
Table 6.1: Qualitative view of ‘high-range’ accuracy
FILE 39
āja bhandā kamsēkama duī baʃṣa agāḍī ma nēpālamā h dā khēʃī
IPA of transcription
Accuracy
Part. 1
aʌ b d komskom dui bers ʌɡaiɖi
Part. 2
azu wandab komsom dwi β s aɡ ɽi mo napana udza keri 74
Part. 3
aser bʌnda komɸekom dui bas oɡ ri mo nepalna huda kiri 78
Part. 4
ase banda komskom dui bars aɡ ri mo nepalma rouda kiri 85
Reference
asʌ
nepalma hoda kheri 87
d kʌmskʌm dui bʌrʂ ʌɡaɖi mʌ nepalma hoda keri -
Subjectively I would describe the 10-point difference between mid-70s to mid-80s as
ranging from good to excellent. Considering participants 2 and 4, the largest
contʃibution in the 10 point accuʃacy diffeʃence is paʃticipant 2’s omission of the lateral
in / nepalma / and the omission of a glottal fricative /h/ and insertion of /z/ in /hoda/.
It may be useful to take a look at what a difference of five points looks like, given that
this is the estimated improvement of respeaking. The following example in Table 6.2
presents a transcription comparison of file 3 (female, noisy, no respeaking) early on in
the data set. The reference transcription has 41 labels. Given the length of this phrase it
takes only three missed segments or over-transcribed segments (too few or too many
labels) to influence the accuracy score more than the 5.39 observed in this study.
A reminder of the accuracy metric normalised to length:
Accuracy = 100 * ( PhoneticEditDistance / Length )
57
Table 6.2: Qualitative view of ‘low-range’ accuracy
FILE 3
ʃa ēkdamai malāī cai ēkdamai āphnō mṛtyukō mukhabāṭa
IPA of transcription
Accuracy
midtiku mukfataɡ 71
Part. 1
ra ektʌmi male tsei ektʌmia
Part. 2
eɽa ɡeɡt
Part. 3
erʌ eɡd
Part. 4
raʔ eʈami male si iɡd
Reference
rʌ ekdʌme malei tsi ekdami aɸnuʌ miɽtjuko mukɸaʈʌ -
i male se eɡt
i aɸmu ŋiltilk
moksakel 64
ii male tsie ekdamei apnuʌ murtuko mukfata 77
i
ɸnua mirdirko mufaika 69
Small differences in transcription lengths (too many or too few speech sounds) were
common but an examination of trends over the log (Appendix 2) show that sub 70
accuracy was often the result of a combination of missed speech sounds and poor
accuracy of transcribed speech segments. Considering the example in Table 6.2,
paʃticipant 4’s tʃanscʃiption accuracy is five points higher than paʃticipant 2’s.
Subjectively the key difference is the final woʃd. Paʃticipant 4’s /mufaika/ sounds closeʃ
to the reference /mukɸaʈʌ/ than paʃticipant 2’s /moksakel/. Viewed another way, a five
point difference can describe the difference between transcriptions where one word in
eight has a difference as large as /mufaika/ & /moksakel/. I would argue on a subjective
basis that this is significant.
More broadly, lower scoring transcriptions are likely to exhibit missed segments and
generally mistranscribed speech sounds. However my intuition is that that both of these
symptoms appear when participants have somehow lost track of where they were
transcribing and that these coincided with more difficult sections of faster speech where
no respeaking file was available. As one might expect, the worst results coincided with
challenging conditions. Of the accuracy results with scores less than 60 in the dataset,
13 out of 14 (93%) were cases with no respeaking. 10 out of the same 14 (71%) were
also noise degraded cases.
58
5.2.2 Respeaking effect on error types
Counter to expectations, availability of respeaking had no discernible impact on errors
stemming from difficulties with aspiration, identification of vowels or retroflex
consonants. Given that Nepali clear speech was shown to possess an expanded vowel
space, we might have expected that the respeaking file would have helped with
categorising the continuum of vowel realisations. This was not found to be the case.
Participants frequently mistook vowels, and not just the more difficult /a/ and /ʌ/ vowels.
Given that vowel realisation exists on a two-dimensional height/front-back continuum,
difficulty in classification of vowels in an impressionistic transcription is to be expected.
In contrast with consonants, participants did not appear to compensate for this inherent
difficulty by spending additional time on correctly identifying vowels. If in doubt,
participants were free to consult the formant frequencies in Praat.
There was, however, a reduction in the number of labels missed or speech sounds not
transcribed. Participant transcriptions showed a mean difference in length with the
reference transcription of 2.4 (SD=2.3). An ANOVA of transcription length errors
against respeaking was significant, F(1,395) = 5.05, p=0.025. Inspecting the data
manually reveals that low accuracy scores (<70) were often the result of a significant
number of missed speech sounds. This is supportive of the observation that the reduced
processes of speech sound elision in clear speech are particularly helpful in identifying
speech sounds which are more difficult to spot in spontaneous speech.
The difference between the sound systems of Nepali and English turned out to be a
reasonable predictor of the common errors encountered. However the form of
transcription error that was most improved by respeaking was the more generalised
tendency to miss speech sounds. This also sounds a note of caution in that the reference
transcription was undertaken with consultation of the orthographic transcript and access
to the respoken audio in all cases (and not just 50% for participants). The reference
transcript may therefore err on the side of introducing speech sounds which are present
in clear speech and perhaps not in spontaneous speech. There was a conscious attempt
to avoid this phenomenon but this bias may not have been totally eliminated.
59
5.2.3 Contribution of clear speech vs. noise
One of the more surprising findings is that transcription participants had remarkably
little difficulty in transcribing the degraded noisy files. A two-way ANOVA estimated
the effect size of around -2.1 accuracy for the noise-degraded files (p=0.016). Clearly a
lower SNR than 9dB would show a stronger result but we should remember that the
goal is was to assess the impact of a noise reduction in a realistic field working
environment. A field linguist that produced an audio recording with an SNR of 9dB is
really not trying very hard to maximize audio quality. Furthermore I doubt it is even
possible to approach this level of SNR when using smartphones as recording devices in
the field. The reason being that the microphone is very close to the mouth of the speaker,
reducing the input gain of the microphone such that other nearby noise sources will
have a relative difference in sound pressure level (SPL) much higher than 9dB. This is
also why vocalist’s micʃophones geneʃally don’t ʃesult in feedback and why
documentary producers favour lapel microphones attached to speakers instead of
camera mounted microphones. In fact, given the observations of the effect of noise here,
I would go so far as to say that noise concerns for transcription accuracy would be
dwarfed by other factors when using close microphone techniques.
Smartphone audio recording can be excellent but throughout the course of this study,
having used multiple devices of different make and model, it’s cleaʃ that audio ʂuality
varies considerable between devices. The HTC Desire C used by the female speaker
clipped much of the recording8. Clipping, where audio levels are overdriven and digital
ʃecoʃding devices ‘clip’ at maximum positive and negative values, is a particularly
destructive form of distortion. Clipping introduces a large amount of harmonics which
make speech much less intelligible and the resulting audio is rather unpleasant to listen
to as well. Bafflingly the HTC also produced a DC offset at the beginning of every
recording sequence. While this was easy enough to remove in post processing, it is
reflective of the lack of care that some manufacturers can have in the design of their
products.
8
Fortunately a Zoom H4n was used in parallel for just such eventuality. In this way the useful metadata
recorded on the mobile phone was paired with a high quality audio recording in this study.
60
The high-end Samsung Galaxy Nexus fared much better but still clipped the audio when
the male speaker’s ʃaised his voice in eneʃgetic segments of the narrative. The default
behaviour of smartphones is to aggressively maximize the audio signal without much
care for the danger of clipping. For these devices to contribute useful audio, the levels
will need to be calibrated by the software or set manually. This doesn’t seem
unreasonable since you would not expect a professional recording such as Zoom H4n to
magically take care of the levels without control by the operator. The Aikuma
smartphone application (Bird & Hanke, 2013) used for this study has since improved
control over audio levels. At the smartphone operating system level, the situation is
improving also. At the time of this study the latest version 4.4 of the Android phone
operating system (Android Kitkat, 2013) is introducing new audio features including the
addition of a dynamic range compressor specifically for speech. While this only applies
to playback, this is nevertheless helpful because it enables recording software to set the
recording level much lower and not be concerned with low playback volume during the
respeaking stage. As should be apparent from this account, this area is rapidly
developing and has not yet reached an optimal solution for linguistic fieldwork.
Finally, on the point of noise, I would note that there is an air of audiophilia
surrounding the topic of audio quality in linguistic fieldwork. This study was very much
a worst-case scenario in terms of performance in transcription given that participants
knew very little of the language they were transcribing. Nevertheless results were
generally good and continued to be so in situations where the acoustic signal was so
degraded with noise that spectrograms were rendered almost useless. While not
suggesting that we should not strive for high quality recordings, preoccupation with
recording device audio quality does not seem justified when other factors such as
microphone placement are far more influential.
5.3 Limitations of Study
This method of this study depaʃted fʃom the guiding ‘futuʃe philologist’ scenario in one
important aspect. Any person that embarks upon the considerable labour of transcribing
a large volume of natural speech will not be doing so in a language naïve fashion.
Rather they will be sufficiently engaged in the endeavour to be learning the language as
they go. At the very least, transcription would be significantly enhanced with reference
61
to a lexicon which must exist in any basic language documentation. This is in stark
contrast to repeat errors of the same kind made by participants in this study on the basis
that they were neither working with a lexicon nor seeking to learn the language.
Conversely, a transcriber with rudimentary Nepali language skills would scarcely make
the same mistakes, nor would they presumably take 14-20 hours to transcribe ten
minutes of speech. However, this study does reflect some important issues arising out of
the transcription of speech sounds under varying conditions.
As far as assessing the impact of respeaking, I have not considered the value of
respeaking in forming the evolving analysis of a phonemic inventory and morphological
analysis. During an early poster presentation of this research, one senior linguist
remarked that respeaking was particularly useful early on in the exploration of a
language as consultants tended to ‘spell out’ moʃphological affixes. Respeaking has
other undeniably useful properties. Himmelmann (2006b) noted that segmenting spoken
language is particularly challenging and that the primary source of information is that of
native speaker intuition. A key property of clear speech is that word boundaries tend to
reappear where there were apparently none in connected speech. This aspect of
respeaking is not conveyed in the measures of accuracy on which this study is based.
The wide variation in transcriptions between study participants should also be noted. If
four linguists were transcribing the same corpus in a realistic environment they would
undoubtedly collaborate. As a result you would expect conventions to emerge and
tactics to converge so that there would be far less variation than observed between the
participants of this study. One example might be how to consistently approach breathyvoiced vowels as potential evidence of elided glottal fricatives (/h/), or the recognition
of what sort of length of aspiration constitutes a phonemic contrast and what does not.
With this in mind it’s no stretch of the imagination to describe the general error rate
observed in this study as artificially high. This has ramifications for how one considers
the size of the effects described and how they could be reasonably expected to be much
fewer where transcription conditions were more realistic.
62
5.4 Transcription differences: errors or choices?
The transcription comparison provided earlier in Table 6.2 highlights an added
dimensionality in transcription ‘errors’. The reference transcription begins with ra, a
discourse marker in Nepali that appears often appears the start of phrases. Two
participants wrote /era/ on the basis of a vowel-like phonation prior to the /r/ and one
participant indicated a glottal stop after the ra. In all cases these speech sounds were
present in the recording. Neither of those transcriptions are errors in the sense that the
participants misidentified a sound, they ceʃtainly aʃen’t peʃceptual eʃʃoʃs. These
examples show that we shouldn’t consideʃ these errors but rather differences which
may or may not be errors. In a sense the ‘error’ lies with the reference transcription for
lacking this detail but the reference transcription was informed by linguistic knowledge
and those speech sounds were judged to be not phonemically relevant.
The participants in this study had no lexical access so they could not have known if ‘eʃa’
was a word in this context or not. Continued exposure to Nepali resulted in participants
developing ideas of what was relevant and not relevant. The example discussed
occurred at the beginning of the data set (file 3) but with considerable exposure to /ra/
throughout the data set, helpfully in word initial position, variation in transcribing /ra/
rapidly evaporated between participants.
The recognition of lexical items cuts both ways. In the 5.2, the analysis of common
errors mentioned the Nepali word thijo. Cited as a high frequency lexeme in the
discourse of both speakers, this word presents a strategy problem for transcribers in how
to deal with the vowel and glide, and reduced forms that may end up being realised as a
diphthong or just a plain vowel. Participants often made the interpretive choice to
transcribe /thijo/ even cases where a reduced form appeared. Arguably this is a good
thing presuming that our ultimate goal is to recognise lexical items. However they
would also shoe-horn thiyo into similar sound words such as tyo without the aspirated
/t / and even further afield into words like tyasto.
This phenomenon is well described in the psycholinguistics literature (Saporta, 1961).
It’s long been obseʃved that listeneʃs tend not to heaʃ ‘mispʃonunciations’ (Cole, 1973)
63
of lexical items. Even with vocabulary of fewer than ten words and with no semantic
comprehension, participants in this experiment began to show signs of categorising
similar-sounding words into the words they recognised. Gradually the hard slog of
enacting their linguistic training, analysing spectrograms and so on, began to be
supplanted by the more natural process of engaging their own faculties of language.
In the context of longer term projects undertaken by future philologists on language
documentation archives, we might expect these forms of errors to become more
dominant than they were in the limited scope of this study.
64
6
Conclusion
This paper addressed research questions relating to the availability of respeaking on
transcription accuracy. The method was guided by assumptions informed by the needs
of future philologists as they transcribe the output of documentary projects in the
present day. Under these conditions, respeaking was shown to have a significant benefit
on transcription accuracy. The benefit of removing noise from the recording as part of
the respeaking method was isolated as a contributing factor with an effect size less than
half that of clear speech. This broadly suggests that respeaking is a valuable component
in linguistic fieldwork, even where natural language recordings are under favourable
recording conditions.
This research backs previous findings in clear speech study where the artificial
deployment of clear speech results in a wide variation in the properties of clear speech.
However in the context of respeaking in this work, clear speech did not show benefits in
transcription that scaled with the degree of clear speech. The strongest observation was
that participants omitted fewer speech sounds when they consulted with a respoken
version. I submit that a reasonable explanation for the lack of a rate/degree correlation
with accuracy can be found in the unregulated nature of the transcription activity.
Participants manually consulted the respeaking file when they found it was most
necessary but the exact when and why is not clear, i.e. little evidence that respeaking
was consulted to refine judgements on vowel categories.
This highlights the need for a careful observational study with attention to participants’
behaviour during the transcription process. The large inter-participant variation in
transcriptions observed in this study could also be addressed by incorporating realistic
consultation on transcription strategy. For example, by allowing participants to consult
on what constitutes contrastive aspiration in Nepali, or choosing an appropriate symbol
to represent a reduced retroflex obstruent etc. There is also a need for research into fresh
methods and tools to assist linguistic transcription by providing for the ability to rapidly
and systematically contrast speech sounds and lexical items so as to facilitate an
evolving understanding of the language being transcribed.
65
I would also argue that the lack of the respeaking rate/degree and accuracy correlation
in this study does not invalidate the need to regulate the production of clear speech as a
linguistic field method. Further research is needed to refine methods and possibly tools
towards this end. One could imagine, for example, that the Aikuma software application
could measure approximate speaking rate and provide feedback to the language
consultant if they should begin speaking too fast.
Given the rapid pace of change in this field, research should continue to engage with
new methods and tools, as demonstrated in this study. Studies of this nature allow us to
evaluate their impact on the results of language documentation projects and thereby to
inform the development of the next generation of methods and tools. There is much
more to do in the development of truly scalable methods in language documentation to
meet the urgent challenge of language loss.
66
References
Abercrombie, D. (1964). English phonetic texts. London: Faber and Faber.
Abney, S & Bird, S. (2010) The Human Language Project: building a universal corpus
of the world's languages. Proceedings of the 48th Annual Meeting of the
Association for Computational Linguistics. Association for Computational
Linguistics, 88–97.
Android KitKat. (2013). Retrieved November, 1, 2013 from Android Developer Website
(n.d.), http://developer.android.com/about/versions/kitkat.html
Becker-Kristal, R. (2010). Acoustic typology of vowel inventories and Dispersion
Theory: Insights from a large cross-linguistic corpus. (Ph.D. thesis). University
of California, Los Angeles.
Bird, S. & Hanke F. (2013) Collaborative language documentation with networked
smartphones. ALS 2013 conference, University of Melbourne.
Bird, S., Loper, E. & Klein, E. (2009). Natural Language Processing with Python.
O'Reilly Media Inc.
Bird, S., & Simons, G. (2003). Seven dimensions of portability for language
documentation and description. Language, 79(3), 557–82.
Bloomfield, L. (1933). Language. New York: Henry Holt.
Boersma, P., & Weenink, D. (2010). Praat (Version 5.3.56). Amsterdam: Authors.
Boerger, B. (2011). To boldly go where no one has gone before. Language
Documentation & Conservation, 5, 208–33.
BouquIaux, L. & Thomas, J. M. C. (Eds.) (1992). Studying and describing unwritten
languages (J. Roberts, Trans.) Summer Institute of Linguistics. (Original work
published 1973).
Boyle, J. (1984). Factors affecting listening comprehension, ELT Journal, 38, 34–38
Bradlow, A. R., & Bent, T. (2002). The clear speech effect for non-native listeners. The
Journal of the Acoustical Society of America, 112(1), 272.
Brown, H.D. (2007). Principles of Language Learning and Teaching. White Plains, NY:
Pearson Education, Inc.
67
Brungart, D. S., Simpson, B. D., Ericson, M. A. & Scott, K. R. (2001). Informational
and energetic masking effects in the perception of multiple simultaneous talkers,
The Journal of the Acoustical Society of America, 110(1), 2527–38.
Bybee, J. (1999). Usage-based phonology. In Darnell, M. (Ed.) Functionalism and
formalism in linguistics (pp. 211–242), John Benjamins.
Chomsky, N., & Halle, M. (1968). The sound pattern of English. New York:
Harper & Row.
Cole, R.A. (1973). Listening for mispronunciations: A measure of what we hear during
speech. Perception and Psychophysics, 13, 153–6.
Cutler, A., & Butterfield, S. (1990). Durational cues to word boundaries in clear speech.
Speech Communication, 9(5), 485–95.
Dalby, J. (1986). Phonetic Strucure of Fast Speech in American English. Bloomington,
IN: Indiana University Linguistics Club.
Ernestus, M., Baayen, H., & Schreuder, R. (2002). The recognition of reduced word
forms. Brain and Language, 81(1), 162–73.
Evans, N. (2008). Review of Gippert, J., Himmelman N. & Mosel (Eds). (2006).
Essentials of language documentation, Language Documentation &
Conservation, 2 (2), 340–50.
Evans, N. (2009). Dying words: Endangered languages and what they have to tell us.
Wiley-Blackwell.
Evans, N., & Levinson, S. C. (2009). The myth of language universals: Language
diversity and its importance for cognitive science. Behavioral and Brain
Sciences, 32(05), 429–448.
Fraser, H. (2003) Issues in transcription: factors affecting the reliability of transcripts as
evidence in legal cases. Forensic Linguistics, 10(2), 201–26.
Gildea, D., Jurafsky, D., (1996). Learning Bias and Phonological-Rule Induction.
Computational Linguist, 22, 497–530.
Greenberg, S. (1999). Speaking in shorthand–A syllable-centric perspective for
understanding pronunciation variation. Speech Communication, 29(2), 159-176.
Hale, K. (1992) On endangered languages and the importance of linguistic
diversity. Endangered languages, Language 68 (1), 35–42.
68
Himmelmann, N. P. (1998). Documentary and Descriptive Linguistics. Linguistics,
36, 161–195.
Himmelmann, N. P. (2006a). Language documentation: What is it and what is it good
for? In Gippert, J., Himmelmann, N. P., & Mosel, U. (Eds.) Essentials of
language documentation (pp. 1–30), Walter de Gruyter.
Himmelmann, N. P. (2006b). The challenges of segmenting spoken language. In
Gippert, J., Himmelmann, N. P., & Mosel, U. (Eds.) Essentials of language
documentation (pp. 253–274), Walter de Gruyter.
Himmelmann, N. P. (2011). Linguistic Data Types and the Interface between
Language Documentation and Description. Language Documentation &
Conservation, 6, 187–207.
Hintze, J. L. & R. D. Nelson (1998). Violin plots: a box plot-density trace synergism.
The American Statistician, 52(2), 181–4.
International Phonetic Association.(1949). Principals of the International Phonetic
Association.
ISO/IEC, (2001). ISO 15919:2001 Information and documentation -- Transliteration of
Devanagari and related Indic scripts into Latin characters. Geneva, Switzerland:
ISO/IEC.
Jakobson, R., Fant, G., & Halle, M. (1951). Preliminaries to speech analysis. The
distinctive features and their correlates. Technical Report No. 13, Acoustics
Laboratory, M.I.T. Cambridge, Massachusetts.
Kalikow, D. N., Stevens, K. N., & Elliott, L. L. (1977). Development of a test of speech
intelligibility in noise using sentence materials with controlled word
predictability. The Journal of the Acoustical Society of America, 61, 1337–51.
Kempton, T., & Moore, R. K. (2013). Discovering the phoneme inventory of an
unwritten language: A machine-assisted approach. Speech Communication, 56,
152–166.
Krause, J. C., & Braida, L. D. (2002). Investigating alternative forms of clear speech:
The effects of speaking rate and speaking mode on intelligibility. Journal of the
Acoustical Society of America, 112, 2165–2172.
69
Krause, J. C. & Braida, L. D. (2004). Acoustic properties of naturally produced clear
speech at normal speaking rates. Journal of the Acoustical Society of America
115 (1), 362–378.
Krauss, M. (1992). The world's languages in crisis. Language, 68(1), 4–10.
Labov, W. (1972). Language in the inner city: Studies in the Black English vernacular
(Vol. 3). University of Pennsylvania Press.
Liberman, M. (2006) The problems of scale in language documentation. Computational
Linguistics for Less-Studied Languages workshop, Texas Linguistics Society.
Lindblom, B. (1990). Explaining phonetic variation: a sketch of the H&H theory. In
Hardcastle, W. J & Marchal (Eds) Speech production and speech modelling (pp.
403–439), Amsterdam, The Netherlands, Kluwer Academic.
Liu, S. & Zeng, F-G. (2006). Temporal properties in clear speech perception. Journal of
the Acoustical Society of America, 120(1), 424–32.
Luce, P.A. (1986) Neighbourhoods of Words in the Mental Lexicon, PhD thesis,
Department of Psychology, Indiana University.
Miller, G.A. and Nicely, P.E. (1955) An analysis of perceptual confusions among some
English consonants, Journal of the Acoustical Society of America, 27, 338–52.
Moon, S.J. & Lindblom, B. (1989) Formant Understood in Clear And Citation-Form
Speech: A Second Progress Report, STL-QPSR, 1, Dept. of Speech
Communication, RIT, Stockholm, 121–123.
Moon, S.J. & Lindblom. B. (1994). Interaction between duration, context, and speaking
style in English stressed vowels. Journal of the Acoustical Society of America,
96, 40–55.
Newman, P. (1998). We have seen the enemy and it is us: the endangered languages
issue as a hopeless cause. Studies in the Linguistic Sciences,28(2), 11-20.
Picheny, M. A., Durlach, N. I. & Braida, L. D. (1985). Speaking clearly for the hard of
hearing I: Intelligbility differences between clear and conversational speech.
Journal of Speech and Hearing Research, 28, 96–103.
Pike, K. L. (1943). Phonetics: A critical analysis of phonetic theory and a technic for
the practical description of sounds. University of Michigan publications,
Language and literature, 21. Ann Arbor: University of Michigan Press.
70
Pokharel, M. P. (1989). Experimental analysis of Nepali sound system (Ph.D. thesis),
University of Pune, India.
Pruitt, J., Akahane-Yamada, R. & Strange, W. (1998) Perceptual assimilation of Hindi
dental and retroflex consonants by native speakers of Japanese and English. The
Journal of the Acoustical Society of America, 103 (5), 3091.
R Development Core Team (2006). R: A Language and Environment for Statistical
Computing. Vienna: R Foundation for Statistical Computing. Available from
http://www.R-project.org.
Khatiwada, R. (2009). Nepali. Journal of the International Phonetic Association, 39,
373–80.
Roach, P. (1987). Rethinking phonetic taxonomy. Transactions of the Philological
Society, 1987. 24-37.
Saporta, S. (ed) (1961). Psycholinguistics: A Book of Readings, New York: Holt,
Rinehart and Winston.
Schum, D. J. (1996). Intelligbility of clear and conversational speech of young and
elderly talkers. Journal of the American Academy of Audiology, 7(3), 212–8.
Smiljanić, R., & Bʃadlow, A. R. (2008). Temporal organization of English clear and
plain speech. Journal of the Acoustical Society of America, 124(5), 3171–82.
Smiljanić, R., & Bʃadlow, A. R. (2009). Speaking and Heaʃing Cleaʃly: Talkeʃ and
Listener Factors in Speaking Style Changes. Language and linguistics compass,
3(1), 236–264.
Tinkler, T. (1980) Learning to teach listening comprehension, ELT Journal, 35 (1),
28–35.
Van Engen, K. J., & Bradlow, A. R. (2007). Sentence recognition in native- and
foreign-language multi-talker background noise. The Journal of the Acoustical
Society of America, 121(1), 519.
Wassink, A., Wright, R. & Franklin, A. (2006). Intraspeaker variability in vowel
production: an investigation of motherese, hyperspeech, and Lombard speech in
Jamaican speakers. Journal of Phonetics, 35, 363–79.
Werker, J. F.,Gilbert, J. H. V., Humphrey K. & Tees, R.C. (1981). Developmental
aspects of cross-language speech perception. Child Development, 52, 349–53.
71
Woodbury, A. C. (2003). Defining language documentation. In Peter K. Austin (Ed.),
Language Documentation and Description (Vol. 1, pp. 35–51). London: SOAS.
72
Appendices
Appendix 1: Transcriber Instructions
Nepali Speech Transcription Experiment instructions and documents
Thank you for agreeing to take part in this experiment! This document offers some
background for this experiment as well as instructions and a small amount of reference
material. You can keep this with you to refer to during the experiment.
The thought experiment
This experiment should be understood in context with a wider thought experiment. In
this scenario you are a linguist in the future working on a language that is no longer
spoken. However we do have recorded material for the language and previous field
linguists have worked on many of the details. We have a working hypothesis of a
phonological inventory and some field notes on the language including some early
thoughts on phonotactics and morphology.
Ouʃ goal is to have a wʃitten ʃecoʃd of “what was said” in these ʃecoʃdings. Since there
are many unfamiliar words, you can only transcribe what you can hear. However using
your knowledge of phonological processes, you might be able to work out what the
phonemes are supposed to be even if the surface forms in connected speech are very
different.
This general theme of this scenario is that we are aiming to speed up language
documentation. We are not interested in high levels of phonetic detail because this takes
too long. The next person to look at your data will be working out what the words are.
So that means it might be useful to record other phones than the phonemic inventory.
What you will be doing
Your goal throughout is to provide the most accurate transcription you can. In practice,
like most tʃanscʃiption, you’ll get faster after an initial training period.
Fortunately, for some of the recordings there are two versions, Eg. two files. The second
file is a ‘caʃeful speech’ veʃsion which you can ʃefeʃ to. This may ‘undo’ some of the
pesky connected-speech processes that make it difficult for you to transcribe the normal
ʃecoʃding. Theʃe may be otheʃ ʃeasons that it’s useful. Remembeʃ that we want a
tʃanscʃiption of “what was said”, this means that even if you have a full foʃm of a woʃd
in the careful speech version, and some elision in the casual speech, you should not
insert the elided segments. So while we call this a phonemic transcription, some extra
detail such as breathy voice, nasalisation etc, or fricatives for example, could be very
useful for the next person to build up a lexicon.
In this experiment the language you will be transcribing is the Indo-European language
of Nepali, spoken natively by around 20 million people in Nepal and India. Within the
data folder there are a large number of wav files. They are numbered in order, Eg:
73
1_normal.wav
2_normal.wav
Each wav file represents a phrase spoken by a male or a female language consultant.
They can sometimes be very short while other times they are more like a full sentence.
You should open these files into Praat and create a basic Phonemic transcription tier.
Transcribe the phrase as best you can and save the Praat text grid back to the same
directory. We have provided a chart of the Nepali phonemic inventory with the IPA on
top and X-SAMPA below. You should type the X-SAMPA symbols provided.
For some of the phrases you will see two files instead of one like so: 3_normal.wav,
3_respeaking.wav
In these cases you have access to a re-spoken version in so-called ‘cleaʃ speech’. You
can think of this as a version of speech you might use when speaking to a deaf
grandmother. In fact that is how it was explained to the language consultants. How you
use this extra file is up to you entirely. You can load it into Praat as well or you might
just want to play it (from the object window of Pʃaat, say). You might find that it’s
woʃth loading into Pʃaat when you’ʃe getting up to speed and lateʃ on you might just
play it unless something interesting appears.
We still only want you to tʃanscʃibe what’s in the x_noʃmal.wav file. In particular you
should not insert speech sounds that are present in the respeaking version if they are not
in the normal version. You might like to record something relevant like breathy voice of
a vowel clusteʃ as a hint to the next linguist that theʃe’s probably an elided [h] there, for
example.
Step-by-step recap
1. Open one oʃ moʃe files fʃom the data foldeʃ into Pʃaat’s object window. You
might like to do ten at a time since the object window gets unwieldy with many.
2. Select annotate for the x_normal.wav file. Select a single tier called Phonemic.
3. Then click both the file and the annotation and select View and Edit.
4. Transcribe the phrase using the X-SAMPA symbols provided.
5. Optional: Open x_respeaking.wav into Praat where available, or play from the
object window etc.7. Save the Praat text grid as x.Textgrid.
6. Move on to the next file and repeat.
About Nepali
Nepali has a full range of voiced and unvoiced stops from bilabial, dental, alveolar and
retroflex with distinctive unaspirated and aspirated. However aspirated voiced stops are
actually breathy voice or murmered voice. For speakers of English the breathy voice
and retroflex ʈ and ɖ will be distinctive. The most prominent cue observed for the
aspirates, voiceless as well as voiced, is the appearance of breathy or muffled voicing
and lowered F0 on the following vowel.
74
Nepali has geminate consonants that are distinctive from single consonants.
The following are some phonotactic features that have been noted:
Intervocalic h deletion: /mʌɦina/ -> [mʌina] or [mʌ:na]
Voiced ʃetʃoflex [ɖ] afteʃ vowels aʃe often ʃealised as a ʃetʃoflex flap [ɽ]. You can of
course get both such as: /pʌɦaɖi/ -> [paɦaɽi] -> [pa:ɽi]. However this does not happen
for geminates so /ʌɖa/ ‘stop’ -> [ʌɽa] but /ʌɖɖa/ ‘office’ -> [ʌɖɖa].
Note: A really good way of spotting retroflex consonants is that the F3 formant comes
close to F2, most visible in adjacent vowels..
In general Nepali has lost vowel length distinction so long vowels point to either a
vowel cluster (it can be hard to tell what a syllable is) or the leftover result of elision in
the above example. Also it’s been obseʃved that woʃd final vowels can run into the
same vowel at the start of the following word. Careful speech will often reveal word
boundaries.
Loss of contʃast on b, d, g, m, ŋ afteʃ nasalised vowels. In spontaneous speech, the
voiced breathy/aspirates lose their aspiration intervocalically and word-finally.
Lenition: /sʌp a/ -> [sʌɸa]
/r/ is a tap intervocally but a trill elsewhere.
Word initial clusters limit the second consonant to a rhotic or a glide such as /prʌd an/
‘chief’ and /pual/ ‘whole’ ->[pwal] and /piadz/ ‘onion’ ->[pjadz].
Some example Nepali words
[ma] / [hami]
[mʌɦina]
[t iyo]
existed etc.
[b ayo]
“enough!”.
[nau]
[patʃʰi]
[dekʰi]
I / we
Month
Often phrase final in statements which locate something, say whether it
3rd sg. past of [hunu]; lit. it has become. Also interogative,
Nine
Post positional “After”
Post positional “From”
[euʈa]
Adjective: One (thing)
[kaʈ ʌmaɖa ] Kathmandu, capital of Nepal
[pʌni]
Also, [pʌni pʌni] adv. as well as.
[p aʃkinu]
To come back (citation form)
[t eu]
Edge.
Notes on morphology
In ouʃ scenaʃio we’ʃe tʃying to woʃk out the moʃphology as we go but we have some
ideas and these are helpful to correctly identify some of the commonly recurring speech
75
sounds. We do know that Nepali has extensive case marking that appears as
agglutinating suffixes.
There are two types of nouns, o-final and non-o-final. O-final noun stems change to
indicate morphological features such as number, gender, forum and diminutive. Nepali
nouns are either singular or plural. Unmarked citation form is singular where plura
changes o-final noun finals to an a-final instead. Theʃe’s also a postpositional indicatoʃ
of plurality -ɦʌʃuː. Eg. Singulaʃ ‘son’ /ts oʃo/, pluʃal ‘sons’ /ts oʃa/ or /ts oʃa-ɦʌru/.
Singulaʃ ‘house’ /g ʌr/, pluʃal ‘houses’ /g ʌrɦʌʃuː/.
Gender is limited to masculine and feminine. Human nouns see grammatical agreement
on the verb. Morphological gender changes citation form to i:-final. Some word suffixes
that have been observed:
-le
-lai
-nu
-eko
-ma
-haru
ergative/instrumental case marker
object marker
Infinitive
Perfect constructions. Eg. [garne] will do, [gareko] did.
Locative, in, at, on etc.
Pluralizing suffix
76
Nepali Phonology – IPA
Consonants
Bilabial
Dental
Alveolar
m
Nasal
Stop
Retroflex
Palatal
Velar
Glottal
ŋ
n
p
b
t
d
ts
dz
ʈ
ɖ
k
ɡ
p
b
t
d
ts
dz
ʈ
ɖ
k
ɡ
Fricative
s
Rhotic
r
(w)
Approximant
ɦ
l
(j)
Note: Voiced aspirated consonants are usually realised as breathy-voiced aspirated.
Vowels
Front
Central
High
i ĩ
Close-mid
e ẽ
Back
u
o
ʌ ʌ
Open-mid
a ã
Open
Dipthongs
/ui/
/iu/
/ei/
/eu/
Nepali Phonology – X-SAMPA
/oi/
/ou/
/ ʌi/
/ ʌu/
/ai/
/au/
***THIS IS WHAT YOU SHOULD TYPE!
***
Consonants
Bilabial
Dental
Alveolar
Retroflex
Palata
Velar
Glotta
l
m
Nasal
Stop
l
n
N
p
b
t
d
ts
dz
t`
d`
k
g
p_
b_
t_
d_
ts_
dz_
t`_
d`_
k_
ɡ_
h
h
h
h
h
h
h
h
h
h
77
Fricative
s
Rhotic
r
Approximan
(w)
h
l
(j)
t
Vowels
Front
Central
Back
High
i i~
u u~
Close-mid
e e~
o
V V~
Open-mid
a a~
Open
Dipthongs
/ui/
/iu/
/ei/
/eu/
/oi/
/ou/
/ Vi/
/ Vu/
/ai/
/au/
Other useful X-SAMPA symbols:
[ɽ ] retroflex tap = r' [ɸ] bilabial fricative = p\
[a] breathy voice = a_t
78
[β] voiced bilabial fricative = B
Appendix 2: Transcription Log
These pages are the output of the automated transcription accuracy system used in this
study. It also serves as a record of the files and experimental conditions. The X-SAMPA
symbols used in transcription have been converted to IPA to facilitate comparison. The
log is in the following format:
<file number> (male/female, normal/noisy, respeaking/norespeaking):
ISO 15919: <Romanization of Devanagari>
1: <Transcription from participant 1>
2: <Transcription from participant 2>
3: <Transcription from participant 3>
4: <Transcription from participant 4> (Files 1-49 inclusive only)
Acc. 1: <Accuʃacy of paʃticipant 1>, <Acc.2: Accuʃacy of paʃticipant 2> etc…
Acc. Total Av: <Total mean accuracy of participants>
1 (female,normal,respeaking):
ISO 15919: mātʃai thiyau
1: matsetatijo
2: matriatiu
3: matʃeantiu
4: matriatiu
G: matreatijo
Acc. 1: 75, Acc. 2: 85, Acc. 3: 76, Acc. 4: 83, Acc. Tot Av: 80.
2 (female,noisy,respeaking):
ISO 15919: ʃa hāmīhaʃu pani āttiyau.
1: ɽahamirupaniatijo
2: dahamiɽubanietiu
3: lahʌmʌrupaniatiu
4: rahamirbaniatiu
G: rahamirupaniattijo
Acc. 1: 91, Acc. 2: 76, Acc. 3: 78, Acc. 4: 80, Acc. Tot Av: 81.
3 (female,noisy,no respeaking):
ISO 15919: thiyō ʃa ēkdamai malāī cai ēkdamai āphnō mṛtyukō mukhabāṭa
1: raektʌmimaletseiektʌmiaafnumidtikumukfataɡ
2: eɽaɡeɡtamaimaleseeɡtomiaɸmuŋiltilkomoksakel
3: erʌeɡdomiimaletsieekdameiapnuʌmuʃtukomukfata
4: raʔeʈamimalesiiɡdomiaɸnuamiʃdiʃkomufaika
G: rʌekdʌmemaleitsiekdamiaɸnuʌmiɽtjukomukɸaʈʌ
Acc. 1: 73, Acc. 2: 64, Acc. 3: 77, Acc. 4: 69, Acc. Tot Av: 71.
4 (female,normal,respeaking):
ISO 15919: ʃa phaʃkinē kʃamamā kē bhayō bhandā ma ʃa mēʃō 2 janā bhāīhaʃu phaʃkinupaʃnē bhayō,
1: rafalknikʃamakivaibʌndemʌɽʌmeɽuduisanapaiarufʌrkinupanibajo
2: eɽaɸaɡkikambanɡiβaindemeumʌremiɽiduisanabaieruɸodbinubʌniubʌjo
3: arauɸalknikrampakiɸaibʌnaumoramiɽuduisanapaieruɸolkinupanibajo
4: raɸarkiniklʌmakiwaiwʌnramoramiriduizenebajiruɸarkinipanibajo
G: erʌɸarknikrʌmakebabʌndamʌremiriduisanabaieruɸʌrkinupʌnibʌjo
Acc. 1: 83, Acc. 2: 72, Acc. 3: 79, Acc. 4: 78, Acc. Tot Av: 78.
79
5 (female,normal,no respeaking):
ISO 15919: hāmīlāī ḍaʃa lāgyō ki katai mailē
1: ʃaŋleidʌdlaɡɡikʌʈimaidli
2: damlaidaleɡɡiɡoddimaili
3: ʃaaŋleidodleɡɡukoddimailu
4: raleidorlaɡiɡiɡotimaili
G: ʃanleiɖʌrlaɡkikʌtemaile
Acc. 1: 84, Acc. 2: 73, Acc. 3: 69, Acc. 4: 85, Acc. Tot Av: 78.
6 (male,noisy,respeaking):
ISO 15919: ʃa ma agāḍī paṭī basēkō thiē ʃa mēʃō sāthī pani mēʃō chēumā thiyō .
1: ramʌahaʈipoɖibʌsikoteʃoŋeʃosatipanimiʃatsumate
2: ramoaibeobombosikoʈerʌmesadipʌnmeʃotseuate
3: ramoaɡaʃibʌɽibosiɡʌtteʃameʃosadipanimeʃotseumate
4: ʃamoaɽibrdibosiɡateɖʌmʌsatipunumiʃutsumateh
G: ramʌaɡaɖipʌʈibʌsekoteʃʌmeʃosatipʌnimeʃotseumate
Acc. 1: 90, Acc. 2: 75, Acc. 3: 88, Acc. 4: 71, Acc. Tot Av: 81.
7 (female,normal,no respeaking):
ISO 15919: kinabhandā bubā ʃa āmā cai utai nēpālagañjamai hunuhunthiyō, hāmī kāṭhamāḍa jādai
thiyau ra
1: atijokinomandabuβeɽaamaseiuteiŋepaɡenahunentijohamikafmanusanetijohuɽat
2: atiukinamanɖaβaweɽeamasaiwudenepaɡaŋeuletemiɡaswenesanenteuɽak
3: ateukinomandabuwerahamaseiuteinipalunehununteuhamikaʈwenuzaniteuhirak
4: atiukinowandabuwaʃaamaseiwudeinipaʃninutiuhamikahonozamitiuʃa
G: attijaukinʌmandabubaraamaseiuteinepaɡanahununtijohamikaʈmanusanetijauhuʃa
Acc. 1: 85, Acc. 2: 59, Acc. 3: 77, Acc. 4: 70, Acc. Tot Av: 73.
8 (female,noisy,no respeaking):
ISO 15919: mēʃō bubā āmā ʃa aʃu mānchēhaʃulāī dēkhna
1: mirapubaamaraʌurumantsehaledihnina
2: meɽebumamaraorumanselaidehina
3: mirabuwaanmaraʌʃumansehelaidiɡnilʌ
4: miropubaamaroodumanseladihnia
G: mirobubanmaraʌʃumantsehʌleideknʌ
Acc. 1: 75, Acc. 2: 73, Acc. 3: 78, Acc. 4: 69, Acc. Tot Av: 74.
9 (male,normal,no respeaking):
ISO 15919: tiʃa thiyō tāplējuṅa bhannē ṭhā . ēkdamai jōkhima pūʃṇa .
1: diʃatijotaplesuŋʌnethauiɡdʌmetsuhimpuɖna
2: tiɽatetaplismwnetauaedemajokimpuna
3: tiratetaplesunmannetauaeɡdʌmeidohimpuʃna
4: diratoɖaplezuŋʌnetauŋaiɡdamezohimpuna
G: tirʌtiotaplezuŋʌneʈauiɡdʌmetsuhimpuɖna
Acc. 1: 90, Acc. 2: 71, Acc. 3: 75, Acc. 4: 82, Acc. Tot Av: 79.
10 (female,normal,no respeaking):
ISO 15919: tyō dina hāmī jahājabāṭa kāṭhamāḍa phaʃkiēnau
1: tijodinhamikuŋipanilembatasikhuneisahasbatakafmantufʌlkjenʌu
2: teuntinhamikunipiniplembatskashassetteɡaɸond ɸalɡenu
3: deudinhamiɡunipʌnipleinbatʌsekʃesihaswataaɡaɸunduɸʌlkenu
4: diudinhanikunipʌnipleimbetesukuʃizeheswetaakaɸunduɸolkinu
G: tijodinhamikunipʌniplenbaʈasikuneidzʌhadzbaʈʌkaʈmanɖufʌrkienʌu
Acc. 1: 84, Acc. 2: 65, Acc. 3: 73, Acc. 4: 70, Acc. Tot Av: 73.
11 (male,normal,no respeaking):
ISO 15919: phōna gaʃēʃa bhanē yastō yastō bhayō bhanēʃa .
1: fuŋɡaʃʌbanjestestʌvajoneʃʌ
2: hoŋaʃawanihstsestuwaioneɽe
3: ɸonɡaʃabanestestabaiʌneʃa
4: ɸonɡaʃaβanestestowajaniɽa
G: ponɡaʃʌbʌnestestobajonerʌ
Acc. 1: 82, Acc. 2: 63, Acc. 3: 83, Acc. 4: 79, Acc. Tot Av: 77.
12 (male,noisy,no respeaking):
80
ISO 15919: mēʃō pachāḍī basēkō aʃkō ēka janā bhāī thiyō u pani
1: mirpʌtsaiɖipʌsuvʌrkiɡasainabaipentsiuupeni
2: miɽabataɽipasuakiɡsanaabaibenseupipani
3: mirupatsadipasuarkidzanabaipuntaupani
4: mirbatairipasiwarɡɡzainʌ
G: merpʌtsaɖibʌseoʌrkeɡadzʌnabaibʌntijoupʌni
Acc. 1: 76, Acc. 2: 61, Acc. 3: 71, Acc. 4: 42, Acc. Tot Av: 62.
13 (male,normal,no respeaking):
ISO 15919: mailē mēʃō aba āmālāī samjhē
1: kamoilimirʌamalasondze
2: kamailemeʃoaβamamalasʌmdze
3: kʌmweilemeroʌwʌamalasʌmdz
4: kʌmweilimiʃaβamamalasondzi
G: kʌmailemeroamalasʌmdze
Acc. 1: 85, Acc. 2: 77, Acc. 3: 72, Acc. 4: 65, Acc. Tot Av: 75.
14 (male,normal,no respeaking):
ISO 15919: ʃa mānchēhaʃu ēkdama jōkhimapūʃṇakō yātʃāhaʃu
1: ʃaiktomantseʃuktaŋsoheŋbulnaɡajapahaʃuh
2: rabikdamantseɽeɡuŋsuhinbunaɡajaʈaɽu
3: rabiɡtomansʌʃʌɡtamzuhimbuʃnaɡajatʃaʃu
4: radomantsirikumzukimbunaɡajadʃauʃuh
G: ʃaekdamantseʃuktamdzokimbuʃŋaɡajatʃahaʃu
Acc. 1: 82, Acc. 2: 67, Acc. 3: 79, Acc. 4: 70, Acc. Tot Av: 75.
15 (male,normal,respeaking):
ISO 15919: tyahā hēʃdā khēʃī ta gāḍī ta mātʃa ēka inca mātra
1: teeɽahiridoɡaidiktoikmatʃe
2: teɽʌhedaɡaɽiktaikmataʃ
3: tsaʃakeɖoɡaɖiktʌikmatʃe
4: derakeraɡaʃitaikmatʃi
G: derʌkerʌɡaɽiʔtaekmatre
Acc. 1: 59, Acc. 2: 76, Acc. 3: 77, Acc. 4: 86, Acc. Tot Av: 75.
16 (female,normal,no respeaking):
ISO 15919: mēʃō paʃivāʃakā 5 janā sahita nēpālagañja jānupaʃnē bhayō.
1: mirʌpʌʃiveeʃkapatsenasaimipahɡentanupanibajo
2: miɽupoɽiβeɡapasenesainibpalɡansanubanjubai
3: miʃapoʃiβeʃkʌbadsenʌseinibadɡʌnsanbanimbaju
4: mirboriberɡabadzʌnʌsaimibanɡonsanbanibajo
G: meropʌʃivakapatsanasainepalɡʌndzanupaʃnebajo
Acc. 1: 80, Acc. 2: 73, Acc. 3: 71, Acc. 4: 76, Acc. Tot Av: 75.
17 (female,noisy,respeaking):
ISO 15919: malāī ēuṭā pāʃivāʃika kāmalē gaʃdākhēʃi
1: maleieutapariverikambliɡoʃdahiʃi
2: malaieudapaɽiɸiɽikambiɡadai
3: mʌleieuɖapaʃiβeʃikanleɡodekiʃi
4: maleiiutapariverikambliɡodehaʃi
G: mʌleieuʈaparivarikʌmbliɡʌʃdaeʃi
Acc. 1: 89, Acc. 2: 72, Acc. 3: 79, Acc. 4: 82, Acc. Tot Av: 81.
18 (female,noisy,respeaking):
ISO 15919: , tyō bhandā aʃkō dina cai hāmī kāṭhamāḍa phaʃkyau nēpālagañjabāṭa.
1: alalkutijobandaalkodintseihamikaɖenkaʈvandoufʌʃkionipaliosveʈa
2: dalalkatuɡandalaɡotinsaiamikatupaswanduɸokinupalswata
3: olalɡudeupandaalɡudinseihamikaʃʌnkaʈmanɖuɸʌʃkiunipalɡonswetʌ
4: alalɡətiupandalɡudinseiamikatnkatmandufolkiunipaloswata
G: alalkotjobʌndaʌrkodintseihamikaʈenkaʈvamdupʌrkjonepalndzaʈa
Acc. 1: 81, Acc. 2: 64, Acc. 3: 74, Acc. 4: 75, Acc. Tot Av: 73.
19 (female,noisy,no respeaking):
ISO 15919: ukta vimānasthala
1: uktabimenistal
81
2: okopimanistal
3: uɡdabimanistal
4: uptabimanistal
G: uktavibmanʌstʌl
Acc. 1: 77, Acc. 2: 66, Acc. 3: 79, Acc. 4: 79, Acc. Tot Av: 75.
20 (male,noisy,no respeaking):
ISO 15919: ṭhā kō phōṭōhaʃu pani mailē khicēkō thiē .
1: etithaɡupudʌhaʃubenumalihiseɡate
2: batakabokoteɽemaŋaikatsaka
3: petatalɡopulɖʌʃaβenimalikidziɡaten
4: ətətalɡokoldəʃəwinmalikizekaten
G: ekeʈauɡopoɖoʃubanimailikitseɡote
Acc. 1: 67, Acc. 2: 50, Acc. 3: 66, Acc. 4: 56, Acc. Tot Av: 60.
21 (male,noisy,respeaking):
ISO 15919: unāisa saya sattaʃī sāla tiʃakō cai lyānḍa ʃōvaʃa mōḍēlakō gāḍīhaʃu thiyō .
1: unaisesottorisaltirakoltsejolɑndʃovaʃmoɖelhohaɖihaɖiheʃotie
2: unaisesatsaɽisantsaɡotselanʃoɸamodzubaɡaɽiɡaɽieɽutsiʃ
3: uneisisʌtoʃisaltilakoutseulandʃoβeʃmooɖalkoɡaʃiɡaʃiʃuti
4: neisisotarisaltirakultirlanrobamorolkariɡiɡaʃiʃətje
G: unaisesʌtterisaltirakotseolanɖrovarmoɖelɡoɡaɖiɡaɖihaʃitie
Acc. 1: 86, Acc. 2: 66, Acc. 3: 77, Acc. 4: 65, Acc. Tot Av: 73.
22 (female,normal,no respeaking):
ISO 15919: ṭikēṭa kāṭēʃa
1: eutahatikerkaʈira
2: eudzaaltsikaɡaɖiɽa
3: eutaamdikarkaɖirʌ
4: ewtaamdikərkadira
G: euʈaamʈikeʈkaʈerʌ
Acc. 1: 71, Acc. 2: 65, Acc. 3: 82, Acc. 4: 70, Acc. Tot Av: 72.
23 (female,normal,respeaking):
ISO 15919: napāunē ta haina?
1: paunedaina
2: paunedaine
3: paunitaina
4: paunidaina
G: paunetainʌ
Acc. 1: 93, Acc. 2: 93, Acc. 3: 92, Acc. 4: 91, Acc. Tot Av: 92.
24 (female,noisy,no respeaking):
ISO 15919: tala thiyō ēuṭā bhayāvaha,
1: palatijobimanissamantʌletijoijokahaijavahʌ
2: ɡalatsiubiminesanatsalatsiuɡahayadoho
3: talatiumiwenisanantalatiuʃahajaβahou
4: dalatiubimanistanadalatiudeajavaho
G: talatijobimanissʌlʌmtʌlʌtijoɡabʌjavʌhʌ
Acc. 1: 74, Acc. 2: 65, Acc. 3: 66, Acc. 4: 71, Acc. Tot Av: 69.
25 (female,noisy,no respeaking):
ISO 15919: sāyada tyahī kāʃaṇalē gaʃdā hōlā,
1: sahitteikarindeɡoʃtehoʃe
2: saidzdzaiɡaʃindzaɡodzalets
3: sajedteikarinɖeɡoɖɖehole
4: saiddeiɡaʃindiɡadahode
G: sajiʔdeikarendeɡaʃdahola
Acc. 1: 82, Acc. 2: 56, Acc. 3: 80, Acc. 4: 80, Acc. Tot Av: 75.
26 (female,noisy,respeaking):
ISO 15919: ʃa tyō jahāja, hāmīlāī thāhā bhayō ki tyō jahāja ʃātīkō 10 bajē
1: ratijosahastahamlaithahabojokhitijosahasratikotʌsvase
2: raɖesehashnlaithadaikikidzsasʃatsikodoswasi
3: ratusʌhashʌmleitaβeikitusasʃatikodoswasi
82
4: radiuzʌhashamletaweikikdouzasratikadoswwzi
G: ratijosʌhashamlaitahabʌjokhitjosahasratikodʌsbʌse
Acc. 1: 87, Acc. 2: 63, Acc. 3: 71, Acc. 4: 65, Acc. Tot Av: 72.
27 (male,normal,no respeaking):
ISO 15919: tyō kāma cai nēpālakō puʃānō bikaṭa ṭhā tiʃa gaēʃa
1: tijokamtseinepalkopurahanopikhaʈautiraɡʌiʃas
2: tukamsainipalkopuɽanobiɡatsautsiʃaɡeiʃes
3: tiokamseinepalkupuranupikaɽʈauntiɽaɡoiʃʌs
4: doukanseinepalkoburanudikatauntiraɡaiʃas
G: tjokamtsainepalkopuranobikaʈautirʌɡʌeras
Acc. 1: 83, Acc. 2: 85, Acc. 3: 83, Acc. 4: 83, Acc. Tot Av: 83.
28 (male,normal,no respeaking):
ISO 15919: thiē jallē cai tyō bāṭōmā calāuna sakthē ʃa calāuna āuthyō
1: tiezalezettijobatematsohonʌfʌhtiratsohonahotijonetlai
2: tisalisetubakmatsaunaɸoktsiɽitsonautsuneleh
3: tiʌsalesitubakmʌtsonʌɸaktiratsonautunelei
4: tizalisetiubatmatsonətoktirasounaudiunilei
G: tiezalezetjobaʈomatsaunʌfʌkteratsʌlaunautjonerlai
Acc. 1: 75, Acc. 2: 68, Acc. 3: 72, Acc. 4: 70, Acc. Tot Av: 71.
29 (female,normal,no respeaking):
ISO 15919: ʃa hāmīhaʃu cai suʃakṣita thiyau.
1: arahamiharusisuratsiottijo
2: arahamiɽesesoɽetsiɸim
3: arahamiorʌsinsuβʃatitteu
4: arahamiurasisurasetiu
G: arahamihʌrutsisuraksittijau
Acc. 1: 84, Acc. 2: 58, Acc. 3: 69, Acc. 4: 73, Acc. Tot Av: 71.
30 (female,noisy,respeaking):
ISO 15919: mēʃō āmā ʃa bubā cai nēpālagañjamai basnupaʃnē bhayō. (ʃa) hāmīhaʃu phaʃkinē kʃama
thiyō tyatibēlā,
1: miruamarabuvasahinepalionnivosnupadnibajorahamiharufʌrkinikramtijotjatibela
2: meɽuamaɽebubasainipaleuβiɸosnipanipajobilahamiriɸokimikʃamkekekiβileh
3: miruammarabubaseinepalɡonniβosnepadniβajoirahamiruɸolkinikʃantetetiβila
4: miruamarabubaseinepalionrevʌsnepanibaijoiʃahamiʃufaʃkinikʃmtiotiotevela
G: meruamarabuvasainepalionrivosnepanibajoerahamirufʌrkinikrʌmtiotjotibela
Acc. 1: 86, Acc. 2: 74, Acc. 3: 84, Acc. 4: 91, Acc. Tot Av: 84.
31 (female,noisy,respeaking):
ISO 15919: ʃa tyahā gaēʃa hēʃdai cai mēʃō bubā āmā
1: ratihaɡoiʃaheʈatseimiʃububaama
2: laʈiɡoiɽiŋdasaimiɽupubahamas
3: ʃatijaŋɡoiʃahiʃdaseimiʃubuβaamma
4: ratianɡoiʃaiʃtasimiʃupubaamas
G: ratjaɡaeʃahedatsaimeʃububaama
Acc. 1: 85, Acc. 2: 66, Acc. 3: 76, Acc. 4: 71, Acc. Tot Av: 75.
32 (female,normal,no respeaking):
ISO 15919: dhēʃai nai jahāja duʃghaṭanāhaʃu hunē gaʃthyō
1: rinidzahasturhateenaaruhuniɡaɖtijo
2: janitsahastsuɡatsanaɽuniɡotu
3: rinisahasturɡetanahaʃuhuniɡaɽtiu
4: ʃinidzahasduʃɡatenaʃuhuniɡaʃtiu
G: rainidzahasdurɡaʈanaaʃuhuneɡaʃtjo
Acc. 1: 82, Acc. 2: 65, Acc. 3: 83, Acc. 4: 86, Acc. Tot Av: 79.
33 (female,noisy,no respeaking):
ISO 15919: bicaʃā hāmī 3 janā
1: pitsaʃahamitinzaanaa
2: pitssaɽahamitinsana
3: itsaʃahamitinsana
4: pitsarahamitinzana
83
G: pitsʌrahamitinzʌna
Acc. 1: 82, Acc. 2: 83, Acc. 3: 85, Acc. 4: 87, Acc. Tot Av: 84.
34 (female,noisy,no respeaking):
ISO 15919: ʃa 10 minēṭakō jahāja uḍānapachi cai acānaka jahājalāī tyahānēʃa kē bhayō kasailāī thāhā
bhaēna.
1: ʃadosmniʃkuuzahazuʃanpasseiʌtsanidzahatslaiktehanirekibajokosteɖleithahabohina
2: nadzosnikʌsahastudzenpasisaiatsanibtsahadzslaikdanirakiboɡoslatsaɸana
3: raboswʌnitkuʌsʌhassuɽanposʌseiʌtsanudzʌhaslaikramirikihoiukosʌtleitaboino
4: radosmunirkuzahasuranposesiatsanzahaslaikdianɡivoikostaleitavoina
G: radosnifkuzahasuɖanpʌsseiʌtsanedzʌhatsleiktjanerekebʌjokʌseleitabʌjnʌ
Acc. 1: 73, Acc. 2: 67, Acc. 3: 71, Acc. 4: 67, Acc. Tot Av: 70.
35 (female,noisy,respeaking):
ISO 15919: ʃa hāmīhaʃu tyahā basyau ʃa 10 minēṭa jatikō cai uḍāna bhayō ukta jahājamā
1: pirahamiharutehapaseuradʌsmineʃdzatikotsijuranbajouktadzahazma
2: biraahamiruɡambasuʃadosmisadikusiamuɖanbajoupasahasweʃ
3: raʌhamirudihamposudadosmbirsatiɡosihiuɖanbajouktazahazma
4: raahamiarudeanbosuradosmirdzatikosiamuranpajouktazahazma
G: pirahamihʌrutjapasuraɖʌsminetsatikotseihuɖanbajouktadzahazma
Acc. 1: 86, Acc. 2: 73, Acc. 3: 73, Acc. 4: 75, Acc. Tot Av: 77.
36 (male,noisy,respeaking):
ISO 15919: taʃa nēpālamā tyastō bāṭōhaʃu ēkdamai dhēʃai chana
1: paranepalatestobaʈʌhruikdʌmedheretsan
2: taranepalatobaʈonsekdamedeʃetsana
3: taranepalatistobaʈohuʃuiektomedeʃitsana
4: ranibaladisubaturuitamideresan
G: taranepalatjastobaʈohruekdʌmedeʃetsana
Acc. 1: 85, Acc. 2: 80, Acc. 3: 88, Acc. 4: 71, Acc. Tot Av: 81.
37 (female,normal,respeaking):
ISO 15919: hāmīhaʃu kāṭhamāḍa phaʃkina lāgyau ʃa ma ʃa mēʃō 2 janā bhāī cai ukta plēnamā basyau,
jahājabhitʃa basyau.
1: hamiharukatfandofʌlkiolaɡioeʃamʌʃamiʃuduizanabahitsaiuktʌplenmawaseotsahasviklavase
2: hamiruɡaɸinaɸokinlaɡueramoramiɽadzwisanabaisaiuktsaplenawasiusahasiktsarasu
3: hamiharukaʈʌmdʌɸolkinulaɡijoeʃamohaʃamiʃuduisanabaisajiuktaplimnawasiusahasβikʃawasiu
4: hamiarukatmndəforkinalaɡiuʃamoʃamiʃaduizanapaitsaiuktaplenawaseudahaswiklawaseu
G: hamiaʃukatfandofʌrkiilaɡiueʃamʌʃameʃuduizanabaitsaiiuktʌplenmabasiuzahasfitʃabase
Acc. 1: 89, Acc. 2: 73, Acc. 3: 77, Acc. 4: 79, Acc. Tot Av: 80.
38 (female,normal,respeaking):
ISO 15919: ʃa 10 minēṭa pachi cai, dhanna 10 minēṭa pachi ukta jahāja
1: radʌsmniɽpasizeidannadʌsmniɽpasiuktadzahaz
2: radzasnipasisaidanadasmipasiamuktasahas
3: eradosmirɖβasisedahnadasminiɽpasiauktasahas
4: radosmunirposeseidonʌdosnirpaseiamuktasahas
G: radʌsmneʈpasizeidannadʌsmneʈpasiuktʌdzahaz
Acc. 1: 92, Acc. 2: 74, Acc. 3: 73, Acc. 4: 69, Acc. Tot Av: 77.
39 (male,normal,respeaking):
ISO 15919: āja bhandā kamsēkama duī baʃṣa agāḍī ma nēpālamā h dā khēʃī
1: aʌbandakomskomduibeʃsʌɡaiɖimo nepalmahodakheʃi
2: azuwandabkomsomdwiβasaɡaɽimonapanaudzakeri
3: aserbʌndakomɸekomduibasoɡaʃimonepalnahudakiʃi
4: asebandakomskomduibarsaɡaʃimonepalmaʃoudakiʃi
G: asʌwandakʌmskʌmduibʌr ʌɡaɖimʌnepalmahodakeri
Acc. 1: 87, Acc. 2: 74, Acc. 3: 78, Acc. 4: 85, Acc. Tot Av: 81.
40 (male,normal,no respeaking):
ISO 15919: ʃa yastō āyō ki tyō gāḍī ēkai cōṭī ḍhalkiyō
1: raistoajoɡitoɡaiɖiekodzuʃudʌlkio
2: ʃamistsuʃauɡitsaɡaɽiikatsajadzolka
3: raistoauɡintuɡaʃiikudzʌʃɖʌʃɖolkijo
4: raestoaiwiɡindaɡaʃiekətəʃədolkə
84
G: raistoajoɡitoɡaɽiekodzuɽiɖʌlkio
Acc. 1: 88, Acc. 2: 60, Acc. 3: 71, Acc. 4: 61, Acc. Tot Av: 70.
41 (male,normal,respeaking):
ISO 15919: aba cai ma gaē bhanna ṭhānēkō thiē . gāḍī yastaʃī ḍhalkiyō ki
1: awʌdzeimʌɡojewentaneɡʌteɡaɖiesteridʌlkiokiamdʌd
2: awatsemaɡaehewanʈaneteɡaɽisteɽadzolkikandub
3: awatseimʌɡaewʌntaneteɡʌʃitsuʃidolkiukiaŋdʌb
4: awatsemaɡoewantaniɡadiɡaʃisteʃidolkekiandə
G: ʌwʌtseimʌɡaewʌntaneteɡaɖisteʃiɖʌlkiɡamdʌd
Acc. 1: 79, Acc. 2: 75, Acc. 3: 80, Acc. 4: 70, Acc. Tot Av: 76.
42 (male,noisy,no respeaking):
ISO 15919: cai tyahā saʃvē gaʃnupaʃnē jastō thiyō . ʃa ma ʃa mēʃā
1: etsiasʌrviɡʌnupaniostʌtijoʃamʌaʃamiʃa
2: titsasaɽiɡonpaniʃstsatsiʃʃamoɽamiɽa
3: utssoruiɡonuβanestutiu
4: rətiatsoriɡoʃnubaniostatiuʃamoʃamiʃa
G: etjasʌrveɡʌʃnupaʃnestotijoʃamʌʃameʃa
Acc. 1: 82, Acc. 2: 59, Acc. 3: 46, Acc. 4: 70, Acc. Tot Av: 64.
43 (male,noisy,respeaking):
ISO 15919: tyō mānchēlē cai u yastō āphu ēkdama ātma biśvāsa thiyō
1: patiomantselaitseiwestoafuiɡomafubisaftehitaʌ
2: adzamantseletseiujʌstoaɸidomabisatsstsukitio
3: ʃatemantseletseujʌstoaɸudʌmapubisvastudiɡau
4: ʌdiumantseleseuestouahikoah orkihiteu
G: atiomantseletseiujastoafuɡʌmafnubisaftehiɡau
Acc. 1: 80, Acc. 2: 67, Acc. 3: 75, Acc. 4: 62, Acc. Tot Av: 71.
44 (female,normal,respeaking):
ISO 15919: ēkadamai ḍaʃalāgdō jangala
1: ektʌmaidʌdlaktozzŋɡel
2: ekamedzalakdzsaŋɡel
3: ekdomiɖadlaɡdʌsoŋɡʌls
4: iɡɖamidalaɡdozaŋɡal
G: eɡdameiɖalaɡdozʌŋɡʌl
Acc. 1: 75, Acc. 2: 70, Acc. 3: 77, Acc. 4: 87, Acc. Tot Av: 77.
45 (female,noisy,respeaking):
ISO 15919: ma mēʃō 2 janā bhāī, bubā ʃa āmā tyahi thiyau
1: mameruduisanʌbhaibuwaraamatehitijo
2: mamiɽupusanapaihimuβaɽanmatitsiu
3: momiruduisanapajibuheraanmatehitijo
4: momiruduisanopajibubaraamadihintiu
G: mameruduidzʌnabaiibubaranmatijitju
Acc. 1: 78, Acc. 2: 73, Acc. 3: 77, Acc. 4: 76, Acc. Tot Av: 76.
46 (male,normal,no respeaking):
ISO 15919: tyahā cai ēuṭā gāḍī mātʃa aṭnē jastō ṭhā thiyō .
1: teaseeoraɡaɖimatʃʌʌʈniostoʈautijo
2: tsaseɸiulaɡaɽimatsʃoknestsoktsautsiʃ
3: deuseioraɡaɖimatʃooʃtnijostoʃtauntijo
4: diaseieuraɡaʃdimatʃooʃniustortaundiu
G: tjaseeuwlaɡaɽimatsʃʌʌʈnejastoʈautijo
Acc. 1: 82, Acc. 2: 65, Acc. 3: 71, Acc. 4: 63, Acc. Tot Av: 70.
47 (female,noisy,no respeaking):
ISO 15919: nēpālagañjamā hāmīlē bitāyau, 1 mahinā jati bitāyau sāyada, dhēʃai ʃamāilō bhayō.
1: nepaliosbahablibitajoekminazutebitajosahiterirʌmaiɖebajo
2: napaloŋtalibitsaiewinasibitsitsasaidedemamaidzaboio
3: nepalunsmaublibidajuehminʌsitipidaijusaideridemaidʌbajo
4: nepalɡonsnahalibitajoeminazutibitajosaiɖiʃiʃomailibajo
G: nepaloŋmalebitajoekminadzʌtibitajosajʌderirʌmailʌbʌjo
Acc. 1: 78, Acc. 2: 64, Acc. 3: 76, Acc. 4: 78, Acc. Tot Av: 74.
85
48 (male,noisy,respeaking):
ISO 15919: tyastō huncha hōlā bhanēʃa .
1: tistohuntsaholaponeʃa
2: tastohuntsalawanera
3: dzʌstuhunsahulaβoneʃa
4: dzstouhunsaulawunira
G: tistohuntsaholapaneʃa
Acc. 1: 96, Acc. 2: 79, Acc. 3: 81, Acc. 4: 68, Acc. Tot Av: 81.
49 (female,noisy,no respeaking):
ISO 15919: kāṭhamāḍa kō ghaʃamā cai hāmīlāī paʃkhēʃa basēkā mēʃā hajuʃamuvā hajuʃabuvā haʃu,
1: kaifmandukukʌlmatsehmlipʌrkerʌbʌsikamirahasiluahasirwaharu
2: ɡamindzuhuɡomatssenlebodzkiɽibasiɡamenasinasuweɽu
3: kaʈmʌnɖukukohomesihʌmlibʌdkiɽoβosikaminohosʌrmahasʌruwaharu
4: katmandukukolmatsehuliporkabosikamirahausilnahasirwuharu
G: katmanɖukuɡʌʃmatseihamlʌbʌʃkeʃʌbʌseɡameʃahazimnahaziʃuwaaʃu
Acc. 1: 81, Acc. 2: 62, Acc. 3: 73, Acc. 4: 75, Acc. Tot Av: 73.
50 (female,noisy,respeaking):
ISO 15919: ʃa hāmī āttiyau, dhēʃai āttiyau,
1: rahamiatijodereiatijo
2: rahamiattiudeɽeiatiu
3: herahamiattijodireiattijo
G: ʌrahamiattijodereattio
Acc. 1: 77, Acc. 2: 78, Acc. 3: 81, Acc. Tot Av: 79.
51 (male,normal,no respeaking):
ISO 15919: ʃa tyō gāḍī haʃulāī cai hāmīlē kati
1: ratijoɡaidieʃlaitseiamliokoti
2: ʃatsiuɡadzulasaineukoti
3: heratijoɡaɽijudʌlaitsejanliuɡoti
G: ratijoɡadiʃulaitseihamlekoti
Acc. 1: 80, Acc. 2: 61, Acc. 3: 63, Acc. Tot Av: 68.
52 (male,normal,no respeaking):
ISO 15919: cintā nagaʃnusa bhanyō ʃa uslē cai gāḍīkō sṭēʃiṅga mōḍēʃa
1: tsinanʌhonifoinierahuleizaɡaiɖikesteʃiŋmoʃeʃa
2: tsinanonaɸonaaulesaɡaɽikastseɽiŋweɽo
3: tsinanonʌɸoinehowalisaɡalikatsteʃiŋmoʃeʃa
G: tsinanʌhonufoineraulezaiɡaɖikesteʃiŋmoɽeʃa
Acc. 1: 84, Acc. 2: 67, Acc. 3: 74, Acc. Tot Av: 75.
53 (male,noisy,no respeaking):
ISO 15919: hāmīlāī aʃulē jō mānchē tyahā gaēkā thiē
1: amahuʃuledadzubmantsitenɡoikateuhule
2: amauʃuleudzadzomatsatseɡoikatsule
3: amlauruleɡaldzumantsetantuweiɡateʃulle
G: amlauruleɡadzumantsetanɡoiɡatieule
Acc. 1: 77, Acc. 2: 69, Acc. 3: 78, Acc. Tot Av: 75.
54 (male,normal,respeaking):
ISO 15919: ali ali sima sima pānī paʃēʃa pahiʃō ali ali jhaʃiʃahēkō thiyō
1: raʌhaɖizahakheriolilisimtsimpanipoiropohiroalilidzʌharaikotijo
2: raaweridzarakeraolilisimsʌpainipodzopoiʃaninidzaʃhaiɡotio
3: raoɡaɽisahakheʃiolilisimtimpaniupʌʃʌpohiʃuanlilizahaʃʃaiɡottijo
G: raʌɡaeɖizahakediʌlilisimsimpanipoirʌpʌhiʃoalilidzʌhraiɡotijo
Acc. 1: 90, Acc. 2: 68, Acc. 3: 78, Acc. Tot Av: 79.
55 (male,noisy,respeaking):
ISO 15919: dina kō hāmʃō kāʃyakʃama thiyō . ʃa ṭhā cai nēpālakō pūʃva jillā
1: simsamrukarihromtijoratauseinepalkopurwadzila
2: kimbambokaʃikʃimtsiʃʃatsousaŋipalkupudzatsila
3: imsahamrukariɡʃʌmtijoʃataunseinnepalkubudwadzila
G: dinsamrokadikrʌmtijoraʈautseinepalkopurvadzilla
Acc. 1: 87, Acc. 2: 66, Acc. 3: 76, Acc. Tot Av: 77.
86
56 (male,noisy,no respeaking):
ISO 15919: ʃa unihaʃukō lāgi ta dina dinai kō kuʃā hō taʃa hāmʃō lāgi
1: istodauunerkolaitetiodinienikakuratahamrolai
2: istsudzaunakalait setsedzininiɡakudzatsamdzalai
3: estoddauunerkulaidettedinniniɡakodatahamdʌlei
G: estodauunerkolaitetiodinienikokuratahamrolai
Acc. 1: 95, Acc. 2: 65, Acc. 3: 81, Acc. Tot Av: 81.
57 (female,normal,respeaking):
ISO 15919: tyahā jahājamā basnē paʃicāʃikāhaʃu, unīhaʃu ḍaʃāuna thālē
1: tehadzahazmabʌsnipʌritsairikaharuunihardaraunathale
2: dzedzahasnabasnipaɽiseɽikahaʃuunidzaʃaŋatali
3: dehadzahasnaubosniboɽitsaɖikaharuuniharudaraunitalije
G: tjahadzʌhazmabʌsnipʌritsarikaharuuniharɖaraunatale
Acc. 1: 90, Acc. 2: 71, Acc. 3: 78, Acc. Tot Av: 80.
58 (male,noisy,no respeaking):
ISO 15919: bhanēʃa bhanyō . taʃa ullē cai hāmīlāī tyasaʃī aʃu kēhi pani ḍaʃa dēkhāēna
1: snerpʌnitaradulitseidamnavutisereʌrukipendoardehaina
2: tsanabʌnetaɽadulletseianeɸtsiɽiodzinkipinɖardekaina
3: pʌnitaradaulitseamlebudisʌrearukipʌndordihaimʌ
G: ʌnerʌbʌnjotʌrʌdʌuletseidʌamlʌvutseriʌrukepʌnɖʌrdekaenʌ
Acc. 1: 73, Acc. 2: 60, Acc. 3: 66, Acc. Tot Av: 66.
59 (male,noisy,respeaking):
ISO 15919: ʃa hāmī cai gayau ʃa tyasamā cai kamsēkama
1: rahamiseiɡoiɡoioʃatesmatseikomtseikomkotib
2: ramitseaɡoiɡoiaaʃaɡtsiswasaimakosekomkotib
3: rahamiseiɡojuaʃadismatseikomsikomɡoti
G: rahamiseiɡoiɡoioʃatesmatseiʌkʌmsekʌmkotie
Acc. 1: 90, Acc. 2: 69, Acc. 3: 81, Acc. Tot Av: 80.
60 (male,normal,respeaking):
ISO 15919: ʃa tyahābāṭa gayō ʃa uslē cai hāmīlāī ali ʃāmʃō samga basnusa
1: tjapurʌɡoieʃautlidzahamlailaʌliramsatsebosnus
2: tjawadaɡoiʃawulidamlailaaolidamsasabasnos
3: deburuɡɡojeʃaulisamleilʌalidamsasubosnus
G: tjabuɽʌɡʌjorautlidzahamlalʌʌliramsatsʌbosnus
Acc. 1: 88, Acc. 2: 64, Acc. 3: 73, Acc. Tot Av: 75.
61 (female,normal,no respeaking):
ISO 15919: , pachi cai kē bhayō bhandā khēʃi
1: kipʌtsizeikibajobandahevi
2: kebotssasikiboibʌdzaheɡi
3: keipotsiseikipojeβʌndaheri
G: kepʌtsitseikebʌjobandahedi
Acc. 1: 88, Acc. 2: 63, Acc. 3: 75, Acc. Tot Av: 76.
62 (male,noisy,no respeaking):
ISO 15919: ʃa ēkdamai ḍaʃāi sakē pachī tyō ṭhāumā cai
1: raektomiɖʌraitseapatsitijotʌŋmatsvei
2: raiktamaibaɽaisebasitiutsaŋʌsai
3: ʃaiktumendaʃeiseupatsihiʃdiltoŋwasei
G: raekdʌmeɖʌraisepʌtsitjoʈaumatsei
Acc. 1: 79, Acc. 2: 70, Acc. 3: 58, Acc. Tot Av: 69.
63 (female,normal,no respeaking):
ISO 15919: tyatibēlā jhanai āttiyau.
1: tiktihoratsahʌnahatijo
2: tiktibeledzanatiu
3: tedibʌrʌsoniyʌtijo
G: tiktibeladzʌnaiattijo
Acc. 1: 71, Acc. 2: 73, Acc. 3: 67, Acc. Tot Av: 70.
64 (male,noisy,no respeaking):
ISO 15919: ʃa ēuṭā bāṭōmā cai ēkdamai sānō khōca thiyō .
87
1: rajoɖapaʈomasiktomesanukostiʃ
2: reiodzabaɡabasitsanasanokostsu
3: raoɽʌpakʌbɑsihitʌmesanohostiʃ
G: rajoɖabaʈomatseiekdʌmesanokotstio
Acc. 1: 86, Acc. 2: 61, Acc. 3: 68, Acc. Tot Av: 72.
65 (female,noisy,no respeaking):
ISO 15919: ʃa hāmī cai aghillō dina aʃkō buddha jahājakō cai
1: rahamitseiʌɡilotinarkukutadzahaskutsei
2: rahamisaiaɡiladzinaɡuwutsaɡadzahasɡʌtsi
3: erʌhamiseiʌɡiɽlotinarkuɡutaasʌhasɡutihi
G: rahamitseiʌɡillodinarkobuddʌdzahaskotsei
Acc. 1: 89, Acc. 2: 72, Acc. 3: 79, Acc. Tot Av: 80.
66 (male,noisy,no respeaking):
ISO 15919: uni haʃulāī . ʃa gāḍīhaʃu ēkdamai puʃānō thiē
1: rokaiɖirikdompuraanote
2: raɡaɽidziɡompuɽanati
3: rʌɡʌririɡdompudʌnʌti
G: raɡaɖiʃekdʌmpuʃanote
Acc. 1: 82, Acc. 2: 75, Acc. 3: 78, Acc. Tot Av: 78.
67 (male,normal,respeaking):
ISO 15919: ki jindagīmā jē pani huna sakcha
1: kidzindaɡimaadzepanihunafaksaʃad
2: hitsndzaɡimabtsepaninaɸʌksaɽadz
3: dzidzindʌɡimantsepʌnijonʌsoksʌra
G: kidzindaɡimadzepanihunʌfaksaʃad
Acc. 1: 94, Acc. 2: 73, Acc. 3: 73, Acc. Tot Av: 80.
68 (male,normal,respeaking):
ISO 15919: ʃa hāmīlē jādā khēʃi cai chātāhaʃu pani bōkēkō thiēna
1: rahamledzadaheʃitseiamnestsataʃuponibukikotena
2: ʃaamlidakiɽisainamnestsatsaɽubanibuɡedzena
3: rʌhuamledzʌlʌkeʃitseinamnestsataɽubunibuɡiɡutino
G: ʃahamledzadakeʃitseiamnestsadaʃupʌnibuɡeɡotena
Acc. 1: 95, Acc. 2: 75, Acc. 3: 83, Acc. Tot Av: 85.
69 (male,noisy,respeaking):
ISO 15919: ʃa hāmī pani tyahī anusāʃakō āphnō dimāga sēṭa garyau
1: raamipenteonsarkrafnudimakseʈɡoʈe
2: rahamipʌdzionsaʃkoapnodimaseɡoʃe
3: rʌamibuntejonsarkraknudemadsedɡoʃiu
G: rahamipʌnteʌnsarkafnodimaɡseʈɡʌʃe
Acc. 1: 86, Acc. 2: 81, Acc. 3: 75, Acc. Tot Av: 81.
70 (male,normal,respeaking):
ISO 15919: ma pani utʃē . mēʃō aʃu sāthīhaʃu pani gāḍībāṭa utʃyō
1: mʌpaniutremiroaurusatiorupaniɡaiɖiboɖautʃe
2: mopaniutrinmiɽolsatiruponiɡaɖiwʌdeutʃi
3: moɡpʌniutʃeʌmiʃuoʃusatijoʃupaniɡʌʃiwuʃdʌwutʃe
G: mopʌniutremeroaurusatiorupaniɡaʃiwuɖʌwutʃe
Acc. 1: 86, Acc. 2: 73, Acc. 3: 84, Acc. Tot Av: 81.
71 (male,normal,respeaking):
ISO 15919: ʃa unihaʃu ēkdamai khuśī bhaēkī ma tyahā bāṭa
1: unarektamekusibaikimoteabora
2: unaredzamkusiwaiɡimopʌɽiɸ
3: urariɡdomukusibaiɡimotebʌʃʌ
G: unaʃekdʌmekusiwaiɡimoteabʌɽa
Acc. 1: 90, Acc. 2: 71, Acc. 3: 81, Acc. Tot Av: 80.
72 (male,normal,respeaking):
ISO 15919: tīna janā sāthīhaʃu cai tyō ṭhā mā gaēʃa cai hāmīhaʃulē saʃvē gaʃnu paʃnē tyō
1: tindzanasatierutseitijotʌmaɡojeʃazaamioʃlesoʃveɡonupʌnetijo
2: tsindzanasatseuʃutsaiteutsamaɡoiɽasameulesaɽeɡonupanitsi
88
3: dinsʌnʌsʌtihurutseitijotomaɡojeʃasamiʃlisʌʃiɡonupanetijo
G: tindzʌnasatierutseitjotaumaɡoeʃazaamioʃlesʌʃveɡʌnupʌʃnetjo
Acc. 1: 87, Acc. 2: 70, Acc. 3: 75, Acc. Tot Av: 77.
73 (male,noisy,no respeaking):
ISO 15919: ʃa pānī pani ali ali paʃēkō thiyō .
1: rapaniɡunuelilipʌʃivekoti
2: rapaniɡʌŋololipodziɸabatsiʃ
3: rʌaniɡunalalepoʃibaɡʌtijo
G: rapanipaniʌlilipʌrekokotijo
Acc. 1: 76, Acc. 2: 61, Acc. 3: 72, Acc. Tot Av: 69.
74 (male,normal,respeaking):
ISO 15919: ma ta ēkdamai ḍaʃāē . mēʃō muṭu mēʃō mukha samma āipugō .
1: moataekdomiɖoʃaimiʃumiʃum tumiʃumuksomaiweka
2: motseiɡdzamadzaɽaimiɽomeɽumutsumeuɽemuksamaiwiɡa
3: motaiɡdomiʃdʌʃʌimiʃumiʃamutumiʃamuksʌmaiβuɡiu
G: mataekdomiɖʌraimeromeromuʈumeromuksʌmaiweɡa
Acc. 1: 91, Acc. 2: 74, Acc. 3: 78, Acc. Tot Av: 81.
75 (male,noisy,no respeaking):
ISO 15919: ʃa tyō bēlā ali sima sima pānī pani paʃiʃahēkō thiyō .
1: ratijobelalisimsimfaniponiporireɡatie
2: atsiulalisinsinpanipanibodzibeɡatsiʃ
3: tiβulʌlisimsimpʌnipaniuporirʌɡʌdijo
G: ratjobelalisimsimpanipʌnipʌrireɡotio
Acc. 1: 88, Acc. 2: 68, Acc. 3: 71, Acc. Tot Av: 76.
76 (male,normal,respeaking):
ISO 15919: cha ghaṇṭākō lāgi cai hāmīlē kati dasa hajāʃa ʃupaiyā nēpālī tiʃēkō thiyō .
1: tsʌkʌntaholahitseihamledeɡotidosazaʃpenepaliditiʃeɡotijo
2: tsoɡaŋdzakalaiɡisaiamledzaɡodzasadzatspenipalidzitsiɽaɡotsiʃ
3: tsoɡontakulʌkitsehʌmleseɡotidoshodzaʃdenepʌlititiʃiɡotijo
G: tsʌɡʌɳɖaɡolaɡitseihamledeɡʌtidosʌzaʃupenepʌliditireɡotio
Acc. 1: 89, Acc. 2: 67, Acc. 3: 82, Acc. Tot Av: 79.
77 (male,normal,respeaking):
ISO 15919: phēʃī bhanchukī ʃāmʃō samga sābadhāna puʃāēʃa jānu .
1: feribʌntsokiamsusabodanpuʃaaʃadzanu
2: peɽibaŋsadzihamsasadzanpuɽalβanu
3: siʃibontsukiʌknusʌbadʌpuʃaʃadzannu
G: feribʌntsukihamsusawodanpuʃaʃadzanu
Acc. 1: 89, Acc. 2: 66, Acc. 3: 71, Acc. Tot Av: 75.
78 (male,noisy,respeaking):
ISO 15919: mēʃō pahilō kāmamā lāgēkō thiē .
1: mirupoilukamamlaekutie
2: miɽupoilokamamlaikatso
3: mirupoilukammaɽlʌiɡote
G: merupʌilokamamlaeɡutie
Acc. 1: 95, Acc. 2: 79, Acc. 3: 77, Acc. Tot Av: 84.
79 (female,noisy,respeaking):
ISO 15919: thiyō ʃa hāmī māthī tyahābāṭa
1: tijorahamimatitehabaʈa
2: tioʃahamimamt sitjʌɸaʈe
3: tijorʌhamimatidihanβatʌ
G: tijorahamimatitjabaʈa
Acc. 1: 90, Acc. 2: 80, Acc. 3: 78, Acc. Tot Av: 83.
80 (male,noisy,respeaking):
ISO 15919: ʃa pachī phaʃkidā khēʃī cai hāmīhaʃu tyō ṭhā mā ā dā khēʃī
1: rapʌtsifoʃkidakeʃidzeianiaʃutijotaumaaodahiʃi
2: ʃapatsiɸarkikeɽetseanuʃutyuʈaumʌamaaudakeri
3: potsiɸoʃkidakeʃitseihʌmihʌrutijotaumʌʌnʌaudakeri
G: rapʌtsifʌrkidakeritseihaniʌrutijoʈaumaaudakeri
89
Acc. 1: 92, Acc. 2: 71, Acc. 3: 80, Acc. Tot Av: 81.
81 (female,normal,respeaking):
ISO 15919: ʃa kāṭhamāḍa mā sabaijanāsanga
1: ʃakatsfandomasabeisanasaŋɡa
2: ʃakatmandzumasawaidzanasaŋʌ
3: rakaʈmandumʌsaβidzʌnʌsʌnɡʌ
G: rakaʈʌndaumasabeidzanasʌŋɡʌ
Acc. 1: 81, Acc. 2: 75, Acc. 3: 76, Acc. Tot Av: 77.
82 (female,noisy,respeaking):
ISO 15919: ʃa tyō pani mēʃō dimāgamā ēkachina cai āuna thālyō, tyastō naʃāmʃō kuʃāhaʃu āuna thālyō.
1: ratijopanimerotimaɡmaeksintseiaunataliotistunaramlikuɡaiʃotakiu
2: ratjopanimiɽidimaɡmaeksansaiaunatalotsasunamikuɡaiɽamtaliu
3: rʌtijopʌnimirudimaɡmʌiksinseiawunʌtalijudistunʌamʃikodaʃontaɡu
G: ratjopʌnimerodimaɡmaektsintseiaunʌtaliotistonʌʃamʃekoɡaʃotaljo
Acc. 1: 87, Acc. 2: 73, Acc. 3: 81, Acc. Tot Av: 80.
83 (female,normal,no respeaking):
ISO 15919: bacēʃa āējastō anubhava bhayō tyatibēlā.
1: bʌtsiʃaidzestoanuvapajetikteviolaa
2: bosiɽaitsesuamanuɸapaititiɡele
3: βosiʃaitsistsuʌmʌmʌβʌpaididiβele
G: botsiraidzistoʌmʌnuvabʌjotittiβela
Acc. 1: 74, Acc. 2: 73, Acc. 3: 73, Acc. Tot Av: 73.
84 (female,normal,no respeaking):
ISO 15919: tyasapachi cai hāmī āphu suʃakṣita chau jastō lāgyō.
1: ratestotsitsehamiafuseratsitsaozistalaɡi
2: ʃatsistsitetehamiaɸasuɽetsetsoŋtsuselaɡi
3: erʌdestotsetsehʌmijaɸuserʌtsitseudzustulaɡi
G: ʃatestotsitseihamiapusuʃʌk itsaudzestelaɡi
Acc. 1: 84, Acc. 2: 66, Acc. 3: 71, Acc. Tot Av: 74.
85 (female,noisy,respeaking):
ISO 15919: kēhī samasyā āēkō cha
1: keisamaseaaikosa
2: kisamasaikusa
3: ɡehisʌmʌsteʌiɡusʌ
G: keisamasjaaekotsa
Acc. 1: 90, Acc. 2: 76, Acc. 3: 65, Acc. Tot Av: 77.
86 (male,normal,respeaking):
ISO 15919: ullē bhanē anusāʃa tyahā cai ēkdamai ēkdamai jōkhima pūʃṇakō bāṭō thiyō
1: pʌniosarbʌnionusaʃtseteatseiiktamidamidzuhimpuʃnaɡobaʈotieʃe
2: paniosalbanionasadzetsetsedzameikomadzokimbulnaɸabadotieɽe
3: pʌnijasaɽpʌnijʌnʌsʌrtʌtejʌtseiɡdomeekdomedzukimbuɽneɡobaɽɖotijoʃe
G: pʌniʌsarbʌnionusaʃtsetjatseiekdameidʌmidzokimpuɳʌɡobaɖotiere
Acc. 1: 89, Acc. 2: 73, Acc. 3: 71, Acc. Tot Av: 78.
87 (male,noisy,no respeaking):
ISO 15919: jahilē pani gaʃī ʃahēkō hunchana .
1: tsailipaniɡoʃiʃaikovantsal
2: tsalipaniɡaɽiʃaiɡountsana
3: tsailipuniɡuʃiɡʌʃibomonsʌli
G: dzʌlipʌniɡʌʃiʃaeɡohʌntsʌnʌ
Acc. 1: 74, Acc. 2: 79, Acc. 3: 65, Acc. Tot Av: 73.
88 (female,normal,respeaking):
ISO 15919: ʃa uhāhaʃulāī bhēṭēpachi khuśī lāgyō.
1: rawatlaibitepasikusilaɡio
2: ʃawanlepetsepasikusilaɡo
3: rauwadlaibihidepʌsikusilaɡijo
G: rawarlaibeʈepʌtsikusilaɡio
Acc. 1: 91, Acc. 2: 81, Acc. 3: 77, Acc. Tot Av: 83.
89 (female,normal,respeaking):
90
ISO 15919: dhēʃaijanā paʃkhēʃa basēkā hunuhunthiyō.
1: sperezanapʌrkirabasekaununtijo
2: sɸeɽisanapokiɽabasekeununtsip
3: derisʌnʌpolkidʌβʌsikʌhununtijo
G: deridzanapʌrkerabʌsekaununtijo
Acc. 1: 89, Acc. 2: 73, Acc. 3: 80, Acc. Tot Av: 80.
90 (male,normal,no respeaking):
ISO 15919: ʃa pachī mailē phēsbukamā pani mailē mēʃō hālēkō thiē
1: ʃabotsipaileefestukapanemiʃudahaleɡote
2: rabasibaileneɸesukabanimiɽadzaliɡotse
3: rʌbosimʌilimnʌɸeicbukʌbʌnijomirurʌʌliɡotijo
G: rabʌtsimailemepesbukaβʌnemerolahaleɡotie
Acc. 1: 82, Acc. 2: 74, Acc. 3: 65, Acc. Tot Av: 74.
91 (male,normal,respeaking):
ISO 15919: ʃa gāḍīhaʃu cāhi ēkdamai puʃānō āja bhandā kamsēkama
1: raɡaiɖieʃutsekdomepuʃanoazovandakomtsekom
2: raɡaiɽiʃutsedzameamapanoasawandzaɡomsiɡoma
3: rʌɡʌʃiʃutseikdomeʌmʌpurʌnoʌsuwʌndakomsekom
G: raɡaɖieʃutsekdʌmepuʃanoazʌvandakʌmsekʌm
Acc. 1: 93, Acc. 2: 68, Acc. 3: 71, Acc. Tot Av: 77.
92 (female,noisy,no respeaking):
ISO 15919: ṭikēṭa liēkā thiyau kāṭhamāḍa phaʃkinalāī, nēpālagañjabāṭa. ra 1 ghaṇṭājati ukta
bimānasthalamā paʃkhēpachi
1: handetikatʃekatijokatsmandopaʃkimʌvainepaliontsvataraekantazatiutabimanistalmapalkiukasusei
2: tsandzetsikatsliɡatsimɡatsʌndzupakinalainipailinsβetsalaikamtsasapiutsabimanstsanmabakipasisai
3: hʌndetiketliɡʌtinkaʈmʌnduɸolkinulaimipaduwanisuwetʌʃʌekontʌsʌdiuŋtebimenistʌlma
G: hamleʈikʌtlekatiokaʈmanduɸarkinʌlainepalioŋsvaʈʌraekʌnʈazateuktabimanistalmapʌrkepʌsetsei
Acc. 1: 82, Acc. 2: 70, Acc. 3: 63, Acc. Tot Av: 72.
93 (male,noisy,respeaking):
ISO 15919: phēʃī phaʃkēʃa āē .
1: heripʌrkeraai
2: periɸakirai
3: ɸeriɸarkʌrʌi
G: perifʌrkeraae
Acc. 1: 90, Acc. 2: 73, Acc. 3: 73, Acc. Tot Av: 78.
94 (male,noisy,respeaking):
ISO 15919: . ʃa tyō ēʃiyāmā cai gāḍī calāunē mātʃa tīna mātʃa mānchēhaʃu
1: ratijoerihamatseiɡaiɖisodahonematʃtindzanamatʃmantsihaʃu
2: aratjoeɽijamasiŋɡaɽisoneɡmatsiltindzanaʃatsadzmantseʃu
3: ʌrotijoirijʌmʌtseikaɽitsʌlounhemʌtrʌtindzanʌmʌntʌrʌmandzeheru
G: ratjoerijamatseiɡaɖitsʌlaunematʃtindzʌnʌmʌtʃmantseʃu
Acc. 1: 80, Acc. 2: 69, Acc. 3: 70, Acc. Tot Av: 73.
95 (female,normal,no respeaking):
ISO 15919: ʃa hāmalāī tyahā chōḍēʃa āmā ʃa bubā cai phaʃkisaknu bhaēkō thiyō, vimānasthalabāṭa.
1: ʃahamlaitehatsoʃiʃaamaʃabubatseifʌrkisʌkŋvaɡotijobimanistalvata
2: ʃaamlaitsehesuɽiraamabubabasaiɸokisakinwaɡotsimiministsalwatsa
3: rʌʌhʌmleidehansorirʌʌmʌdabuɡʌnseiɸolkisʌɸʌnuwʌɡʌtijomihanistʌlɡʌta
G: ʃahamlaitehatsoɖerʌamarabubatseifʌrkisʌknwakotijobimanistalvaʈa
Acc. 1: 93, Acc. 2: 72, Acc. 3: 74, Acc. Tot Av: 80.
96 (female,normal,respeaking):
ISO 15919: āēʃa basnubhaēkō
1: seiteaaeʃabʌsevakotijo
2: tsaitjaeerʌbʌsnukʌtiu
3: setihʌʌjirʌbosnwakʌtijo
G: seitjʌaerʌbʌsnvakotijo
Acc. 1: 85, Acc. 2: 76, Acc. 3: 74, Acc. Tot Av: 78.
97 (male,noisy,no respeaking):
ISO 15919: ʃa gāḍīlāī agāḍī jāna diyō ʃa hāmī cai hiḍēʃa gayō.
91
1: raɡaiɖilaoɡaiɖidanadiahamit seihimiraɡojo
2: raɡadzidzaunʌnatseamtsamiɽaɡʌu
3: rʌɡʌrilʌuwʌɖiɽaneduwahamit seihiniʃʌɡojʌ
G: raɡaɖilaʌɡaɖidzanʌdijʌhamitseihamiʃaɡʌjə
Acc. 1: 81, Acc. 2: 51, Acc. 3: 72, Acc. Tot Av: 68.
98 (female,normal,no respeaking):
ISO 15919: nēpālagañja phaʃkēkā thiyau.
1: nepalkontsfarkekatijo
2: nebalɡansoɡeɡekiu
3: nebalɡonsɸolkiɡettijo
G: nepalɡʌndzfʌʃkeketijo
Acc. 1: 90, Acc. 2: 71, Acc. 3: 86, Acc. Tot Av: 82.
99 (female,noisy,respeaking):
ISO 15919: nēpālagañja taʃphanai phaʃkāēʃa lagyō ʃa hāmī
1: nepalɡonstʌfanifoʃkaededlʌɡiuʃahami
2: napaleustsaɸiniɸarkarʌlaɡjoʃahami
3: nepalɡonstolɸoneɸolkaidiloɡjuʃʌhʌmi
G: nepalɡʌnstʌʃfanifʌʃkaeʃedlʌɡioʃahami
Acc. 1: 91, Acc. 2: 70, Acc. 3: 79, Acc. Tot Av: 80.
100 (female,noisy,no respeaking):
ISO 15919: bhēṭa bhayō
1: mibajo
2: ebaiu
3: bidpʌjo
G: metbʌjo
Acc. 1: 75, Acc. 2: 44, Acc. 3: 83, Acc. Tot Av: 67.
101 (male,noisy,no respeaking):
ISO 15919: ʃa ʃāmʃai bhayō tyaspachī phaʃkēʃa ā dā khēʃī
1: ramnivʌietetespasiparɡeɖaodahiʃi
2: dzameβoitsitsispasiɸaɡeɽaɽauɽeɡeɽe
3: rʌmliβaitetismʌtsiɸʌriɡeʃauʃeɡeɽe
G: ramrebʌjotiotjespʌtsiparɡeʃaudakeʃi
Acc. 1: 75, Acc. 2: 56, Acc. 3: 66, Acc. Tot Av: 66.
102 (male,noisy,respeaking):
ISO 15919: phēʃī gāḍīlāī agāḍī gēaʃamā hālēʃa agāḍī tānna thālyō
1: pirikaɖilʌhaɖiɡieʃmaaleʃaoaɖitaŋnatalio
2: peɽiɡalaaiɖiɡeamaleʃaaitannatale
3: ɸeriɡaʃilʌʌjiʃiɡiʃmʌhʌleʃʌoʃitannʌtaliu
G: periɡaɖilaiauɡaɖiveaʃmaaleʃaaiɖitannatalio
Acc. 1: 80, Acc. 2: 71, Acc. 3: 65, Acc. Tot Av: 72.
103 (male,normal,no respeaking):
ISO 15919: mana daʃō banāēʃa basnusa bhanēʃa unihaʃulē bhanyō .
1: muandoruvanarabosnusuanerovaneluboanio
2: wandzoɽoβanaɽabosusʌnʌɽonelabane
3: wʌndoɽuɡʌneɽʌbosnusʌneʃʌwuneɽlubʌne
G: mandʌrobʌnaerʌbʌsnusʌnerʌnelubʌnio
Acc. 1: 72, Acc. 2: 70, Acc. 3: 75, Acc. Tot Av: 72.
104 (female,normal,respeaking):
ISO 15919: tyahā cai ukta bimānasthalamā cai
1: tehadzeiuktabimanistalmadzei
2: tsahesaiudzaβibanistsalmatse
3: dehatseukdʌbimʌnistalma
G: tehatseiuktabimanistalmatsei
Acc. 1: 98, Acc. 2: 76, Acc. 3: 79, Acc. Tot Av: 84.
105 (female,normal,respeaking):
ISO 15919: nēpālagañjabāṭa kāṭhamāḍa phaʃkiyō bhanēʃa.
1: nepaliosvatʌkaʈuandofarkijobanera
2: nipaleŋswedzakatsundzaɸokiubʌniɽa
92
3: nepalɡonsβatʌkaʈmʌnɖuɸolkiʌβenida
G: nepaliosvaʈʌkaʈʌndofarkijobʌnerʌ
Acc. 1: 89, Acc. 2: 64, Acc. 3: 72, Acc. Tot Av: 75.
106 (female,normal,no respeaking):
ISO 15919: yēti bhannē ēuṭā plēna kō cai bihāna 10-11 bajē tiʃakō cai
1: jetipanejotpleinkodzeibihanodoseɡaʃbʌsitiʃakodzeiham
2: jetsipaniuplainkosaiammihanadzosiɡaʃabasidzidzakosaiamham
3: iʌtipanijʌplinɡuseimihanudosiɡaʃʌbʌstidiʃʌkosei
G: jetibanneeuʈplenkotseiumbihanʌdoseɡaʃbʌdzetiʃʌkotseiham
Acc. 1: 84, Acc. 2: 69, Acc. 3: 70, Acc. Tot Av: 74.
107 (female,normal,respeaking):
ISO 15919: sana 2008 tiʃa kō kuʃā hō
1: sʌnduihazaʃaʈirekokurao
2: sandzujasaatsiɽaɡuɡuɽau
3: sʌnduwihʌsʌrʌrdirʌɡoɡuʃʌho
G: sʌnduihazaʃattiʃʌkokurao
Acc. 1: 91, Acc. 2: 67, Acc. 3: 77, Acc. Tot Av: 78.
108 (female,normal,respeaking):
ISO 15919: kāṭhamāḍa lānukō saṭṭā
1: katsvandulanukosaʈa
2: kaʈmandulanukosetse
3: kaʈmʌnɖulʌnɡusʌtta
G: kaʈandulanukosaʈʈa
Acc. 1: 84, Acc. 2: 78, Acc. 3: 75, Acc. Tot Av: 79.
109 (female,normal,no respeaking):
ISO 15919: ʃa yasapachi cai hāmī jhanai āttiyau
1: araespadzidzahamizamiatijo
2: aɽeispatsatsamisaniatsub
3: eraisɡotsusʌhʌmisʌmijaattijo
G: areespasitsahamidzʌniattijo
Acc. 1: 86, Acc. 2: 65, Acc. 3: 67, Acc. Tot Av: 72.
110 (female,noisy,no respeaking):
ISO 15919: nēpālagañjamā cai avataʃaṇa gaʃiyō
1: nepaliɡozmatseiabateʃlŋɡaiɖijo
2: snipalɡomatsaiaβadzeŋoɽijo
3: nepalɡonsimʌtseiʌβatenɡaʃijo
G: nepalaɡʌzmatseiʌvʌteluɡʌʃijo
Acc. 1: 73, Acc. 2: 65, Acc. 3: 75, Acc. Tot Av: 71.
111 (female,noisy,respeaking):
ISO 15919: ʃa ēkadamai ḍaʃalāgdō avasthā thiyō tyō.
1: raektomidarlaktoawostatiotijo
2: eraekdamidzalaɡʌwastatitiu
3: rʌikdomedʌrlaɡdʌoβʌstatiutijo
G: raektʌmidarlaɡdoʌvastatitijo
Acc. 1: 86, Acc. 2: 73, Acc. 3: 85, Acc. Tot Av: 81.
112 (male,normal,no respeaking):
ISO 15919: pahiʃōlē alikati kṣyati puʃāēkō bhannē thiyō .
1: poiʃolealkotitsitibuʃaiɡowantijo
2: boidzalealkatsitsitibuɽaiɡawanitsiu
3: poiʃolealkatitsetipuɽʌiɡʌβʌnitijo
G: pʌiroleʌlkʌtitsitibuʃaeɡowannetijo
Acc. 1: 88, Acc. 2: 73, Acc. 3: 82, Acc. Tot Av: 81.
113 (male,normal,no respeaking):
ISO 15919: tyō bēlā samjhēkō thiē .
1: tijolafamzekote
2: tselaɸamaɡete
3: teulasʌmsiɡʌti
G: tjolasamdzeɡotie
93
Acc. 1: 79, Acc. 2: 59, Acc. 3: 76, Acc. Tot Av: 71.
114 (male,normal,no respeaking):
ISO 15919: sadaʃmukāmamā phēʃī phaʃkēʃa āyau .
1: saderukamahirifʌrkeraajum
2: soɡudzukamahiliɸokaɽaim
3: sʌɖʌɖβukʌmmʌkeriɸolkerʌju
G: sʌdʌrukamakerifʌrkeraajm
Acc. 1: 86, Acc. 2: 66, Acc. 3: 69, Acc. Tot Av: 74.
115 (female,noisy,no respeaking):
ISO 15919: ʃa paʃicāʃikālē kē bhanē bhandā
1: raporitsairikarlikibanubanda
2: eɽapolisalikalikiβiniβandze
3: erʌpoditsʌdikalikiβʌnipʌnda
G: rapʌritsairikarlikibʌnebʌnda
Acc. 1: 90, Acc. 2: 69, Acc. 3: 74, Acc. Tot Av: 77.
116 (male,noisy,respeaking):
ISO 15919: ʃa tyō bēlā pani ṭhyākka dimāga mā huncha ni
1: ratijovelavnitekadinaɡmaodzeniodioɖa
2: eʃatjoβelaβeniʈʌkkʌdimansaniɖiuɽa
3: eʃatjoβelʌpʌnitakkadimawʌunsʌnideuɖa
G: ʃatjoβelaβvniʈakkʌdinaɡmʌodziniodeuɖʌ
Acc. 1: 82, Acc. 2: 71, Acc. 3: 72, Acc. Tot Av: 75.
94