Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
A Fully Annotated Corpus of Russian Speech Pavel Skrelin, Nina Volskaya, Daniil Kocharov, Karina Evgrafova, Olga Glotova, Vera Evdokimova Department of Phonetics, Saint-Petersburg State University Universitetskaya Emb., 11, 199034, Saint-Petersburg, Russia E-mail: skrelin@phonetics.pu.ru, volni@phonetics.pu.ru, kocharov@phonetics.pu.ru, evgrafova@phonetics.pu.ru, oglotova@phonetics.pu.ru, postmaster@phonetics.pu.ru Abstract The paper introduces CORPRES – a fully annotated Russian speech corpus developed at the Department of Phonetics, St. Petersburg State University as a result of a three-year project. The corpus includes samples of different speaking styles produced by 4 male and 4 female speakers. Six levels of annotation cover all phonetic and prosodic information about the recorded speech data, including labels for pitch marks, phonetic events, narrow and wide phonetic transcription, orthographic and prosodic transcription. Precise phonetic transcription of the data provides an especially valuable resource for both research and development purposes. Overall corpus size is 528 458 running words and contains 60 hours of speech made up of 7.5 hours from each speaker. 40% of the corpus was manually segmented and fully annotated on all six levels. 60% of the corpus was partly annotated; there are labels for pitch period and phonetic event labels. Orthographic, prosodic and ideal phonetic transcription for this part was generated and stored as text files. The fully annotated part of the corpus covers all speaking styles included in the corpus and all speakers. The paper contains information about CORPRES design and annotation principles, overall data description and some speculation about possible use of the corpus. 1. with the idea that it might be suitable for use in a wider range of phonetic research and development. Therefore, the corpus was designed along a number of principles. Firstly, the sample was to represent a number of speaking styles. As the corpus included only read speech, different styles of texts were selected for recording with specific characteristics of those styles in mind: - an action-oriented fiction narrative resembling conversational speech; - a fiction narrative of a more descriptive nature containing longer sentences and very little direct speech; - a play containing a high number of conversational remarks and emotionally expressive dialogues and monologues; - purely informational neutral texts on IT, politics and economy containing terminology, geographical and proper names, numerals, acronyms and abbreviations. The choice of diverse texts served our other goal of making the corpus phonetically and prosodically rich, i.e. to contain a large number of all Russian phonemes in all possible contexts and a wide range of diverse prosodic structures, and to provide for good lexical representation. The corpus is composed of 60 hours of speech recorded from 8 speakers (7.5 hours from each speaker). Thirdly, the corpus was intended as a sample of Standard Russian (St. Petersburg pronunciation variant); dialect variation was not accounted for. However, records were made from eight speakers, four men and four women, in order to cover a certain degree of variation within the St. Petersburg pronunciation variant. Fourthly, it was necessary to ensure consistently high quality of all data both in terms of technical characteristics and voice quality. The latter objective was achieved by recording professional speakers: some of them worked in radio broadcasting; some were professional actors or television newsmen. In addition to voice training, pleasantness of voice and clear articulation were considered. Introduction Contemporary research both in linguistic phonetics and speech technology is largely based on and can largely benefit from the use of large speech corpora. The corpus to be used for these purposes needs to meet the following requirements: it has to contain a large sample of speech data, to ensure a consistently high quality of the data, and to have annotation that enables researchers of a wide range of phonetic issues to search for and find specific data that is valid and reliable. Good examples of such a resource are the corpora developed for Dutch (Van Son et al., 2001). For the Russian language, the existing speech corpora tend to serve a narrow practical purpose (Arlazarov et. al., 2004). Therefore, the need for a fully annotated large corpus of Russian speech recorded at a consistently high quality is evident. In this paper we present CORPRES – a fully annotated COrpus of Russian Professionally REad Speech developed at the Department of Phonetics, Saint-Petersburg State University as a result of a three-year project. The corpus meets all of the requirements to databases of this kind listed above and may be used both for the purposes of development and scientific research. It is large enough for statistical machine learning (60 hours of continuous speech) and has six annotation levels including prosodic annotation, rule-based canonical phonetic transcription and manual transcription reflecting the actual sounds pronounced by the speakers. In the paper, we describe the corpus design and data and discuss the principles and issues behind its development. 2. Corpus Design The aim of the corpus was to provide a large sample of Standard Russian continuous speech. It was originally intended for use in unit-selection TTS synthesis, however, 109 Figure 1: Annotation levels. The recordings were made in the recording studio at the Department of Phonetics, University of St. Petersburg. Motu Traveler multi-channel recording system, an AKG capacitor microphone and WaveLab software were used. The recordings have a sample rate of 22050 Hz and a bitrate of 16 bits. Before the recording sessions, all texts were revised to detect and resolve ambiguities caused by nonstandard words, terminology etc. All transliterated foreign language words, terminology, acronyms and numbers were clarified in the prompts to avoid difficulties and mistakes. In case of doubt, speakers could ask for instructions from researchers present at the studio. Slips of the tongue were noted, and the speakers were asked to read the passages where they occurred once again. The final, but the most crucial objective we had in mind was to ensure that the annotation of the corpus covers a wide range of information that may be of interest to those involved in most areas of phonetic research. There are six annotation levels that will be further discussed in greater detail. 3. prominent words. Prosodic transcription on Level 6 includes labels for different types of pauses, types of tone unit, and non-speech events such as laughter or breathing. Figure 1 shows the six levels at which the annotation is done. (Levels 1-6 are not in numerical order for the purpose of clearer visual design.) 3.1 Detecting and Labeling Fundamental Frequency Periods of The fundamental frequency periods were detected automatically. A linear combination of the following methods was used for this purpose: autocorrelation, analysis-by-synthesis, spectral domain analysis, estimation of the energy of signal peaks and estimation of the ratio of lengths and correlation of neighboring periods. For a detailed description of the algorithm, see (Kocharov, 2008). The efficiency of automatic pitch detection and pitch periods labeling was about 98%. The results of the automatic procedure were checked and corrected manually. Annotation 3.2 Phonetic Transcription The annotation captures the maximum amount of phonetically and prosodically relevant data. The six annotation levels are as follows: Level 1 – pitch marks; Level 2 – phonetic events labeling; Level 3 – real phonetic transcription (this is performed manually and reflects the sounds actually pronounced by the speakers); Level 4 – ideal phonetic transcription (this level is automatically generated by a linguistic transcriber in accordance with a canonical set of rules); Level 5 - orthographic transcription; Level 6 – prosodic transcription. Levels 1 and 2 contain information on various phonetic events: epenthetic vowels, voice onset time, voiced plosure, stationary parts of voiceless consonants, laryngalization, and glottalization. The phonetic events were annotated manually by expert phoneticians. Level 5 also contains information on prosodically Phonetic transcription is of fundamental importance in speech corpora as it reflects characteristic phonetic features of speech. The transcription system should be well-grounded linguistically and also comprehensible for corpus users. In CORPRES transcription is available at two levels. Level 3 contains narrow phonetic transcription. We called this transcription level ‘real’ phonetic transcription because it reflects the sounds actually pronounced by the speakers. The ‘ideal’ transcription found at Level 4 was generated in accordance with a set of phonological rules without reference to the actual sound. As a result, Level 4 contains a canonical phonetic transcription of the speech sample. The transcription symbols used were a version of SAMPA for the Russian language. To mark positional allophones of 6 Russian vowel phonemes /a/, /o/, /i/, /u/, /e/, /y/ 18 symbols were used. Each vowel symbol contained indication of the sound’s position regarding stress. Thus 0 was used to for a stressed accented vowel, 1 - for an 110 It is impossible to estimate the number of phonemes in the part of the corpus which was not annotated on phonetic transcription levels, therefore, two cells in the table remain empty. unstressed vowel in a pretonic syllable, 4 – an unstressed one in a post-tonic syllable. The set of consonant symbols included 41 symbols to cover 36 Russian consonant phonemes and 5 voiced allophones of voiceless consonants which occur frequently at word junctions. To produce the real phonetic transcription, the speech signal was manually segmented, transcribed and peer-revised by expert phoneticians. Ideal phonetic transcription was generated automatically by an automatic transcriber. The labels were placed automatically to coincide with the label positions produced manually on the real transcription level. Procedure of automatic labeling is based on calculating the Levenshtein distance. Automatic labeling is not perfect due to the mismatch of ideal and real phonetic transcriptions. Therefore, the results of the automatic procedure were further manually corrected. 5. 3.3 Orthographic and Prosodic Transcription Prosodic information was marked by expert phoneticians on the basis of perceptual and acoustic analysis of the speech data in a text file containing orthographic transcription. Labels were later automatically transferred from the text file to the annotation files to coincide with the phonetic transcription levels. Orthographic transcription was stored on Level 5, it contains the boundaries of words and word labels. Besides the prosodically prominent words are labeled with special symbols. Prosodic information was stored on Level 6, it contains the boundaries of tone units and pauses and their labels. The set of symbols to label pauses and tone units and the principles behind the labeling process are described in detail in (Volskaya & Skrelin, 2009). 4. Total Correctly Mispronounced Elided Count 1 118 833 947 508 101 292 70 033 Percents 100 84.7 9.05 6.25 Table 2: Ideal vs. real transcription. Table 2 reveals that despite the fact that as many as 84.7% of the ideal transcription reflects the actual pronunciation, 9.05% of the expected sounds are replaced by other sounds, and 6.25% of the expected sounds are actually not pronounced at all. Table 3 shows in percentage terms the ratio between vowel realizations according to ideal transcription (down) and real transcription (across). 0 is used to mark a stressed vowel, 1 – a pretonic vowel, and 4 – a post-tonic vowel. The column Total shows the whole number of corresponding allophones. This data shows that there is a certain degree of variation even for stressed vowels that tend to be more stable than the unstressed ones, with approximately 1-3% of them pronounced as allophones of other phonemes. Some of the unstressed vowels are especially unstable, e.g. less than 50% of post-tonic /a/ vowels are pronounced as /a/, while a third of them is pronounced as /y/ allophones. The vowel variation findings support those obtained earlier on a smaller corpus of read and spontaneous speech (Bolotova 2003). A closer look at vowel variation data provides insight into the changes in Standard Russian. The general phonotactic rule for unstressed vowels is that /e/ and /o/ do not generally occur in the unstressed position, but can be found in a small number of words, mostly loan words and Corpus Data Description Overall corpus size is 528,458 running words. 40% of the corpus (24 hours of speech) was manually segmented and fully annotated on all six levels. 60% of the corpus was partly annotated; there are labels for pitch period and phonetic event labels. Orthographic and prosodic transcription, as well as the ideal phonetic transcription (see Section 3 for detail) for this part was generated and then stored as text files, but was not transferred to sound file labels. The fully annotated part of the corpus covers all speaking styles included in the corpus and all speakers. Table 1 shows general corpus statistics. Fully Annotated Data Phonemes 1 048 867 Words 211 437 Tone Units 64 055 Hours 24 Partly Annotated Data – 317 021 86 546 36 Findings Based on the Corpus Data As CORPRES contains a large sample of high quality speech data with detailed annotation, it enables researchers of a wide range of phonetic issues to search for and find specific data that is valid and reliable. The fact makes it suitable for use in a wide range of phonetic research. For the time being, the necessary information from the corpus (e.g. sound variants and their frequency distribution and etc.) is obtained by means of specially designed computer programs to suit a certain task. For instance, consulting the corpus we can obtain important information about the changes in the Russian standard pronunciation (Bondarko, 2009). In Table 2 we compare the ideal phonetic transcription reflecting the way the speech sample is supposed to be pronounced according to the canonical transcription rules of the Russian language and the real phonetic transcription reflecting the way it actually was pronounced by the speakers recorded. Total Amount – 528 458 150 601 60 Table 1: General corpus statistics. 111 foreign names, and contexts (post-tonic /e/ is mostly found in word-final open syllables) (e.g. radio /r a0 d’ i4 o4/, izvinite /i1 z v’ i1 n’ i0 t’ e4/, Hemingway /h e1 m’ i1 n g u1 e0 j/. Our data showed that unstressed /e/ is pronounced as /i/ or /y/ in 40-45% of the cases. The unstressed /o/ is pronounced in 77.4% and appears to be more stable. Therefore, we may assume that the phonotactics of Standard Russian is going through change in this respect. a0 a1 a4 e0 e1 e4 i0 i1 i4 o0 o1 o4 u0 u1 u4 y0 y1 y4 a e i o 0.1 98.3 1.5 80.7 3.9 0.1 1.6 46.3 13.2 1.6 4.6 1 0.4 97.6 0.6 61 13.2 0.6 55.6 18.9 1.1 0.5 98.9 0.1 6.2 91 0.2 0.6 19 77.4 0.3 0.1 0.2 99.1 1.3 0.3 0.1 93.4 7.1 3 71.7 0.2 0.2 0.9 0.2 1.6 0.9 2.4 0.4 0.6 1.3 6.9 7.1 0.8 1 9.2 0.3 0.8 u 0.5 1.3 0.6 2.2 0.1 0.8 0.9 0.2 2.2 5.1 99.7 98.5 92.8 1 2 2 y 0.1 13.1 33 0.9 23.9 22.2 0.5 1.8 1.9 0.3 2.8 13.1 0.1 0.4 2.1 97.9 81.9 86.7 would largely benefit both the development of speech technology applications and theoretical research in Russian phonetics. 6. Conclusion The Department of Phonetics, SPSU developed a fully-annotated large corpus of Russian speech including samples of different speaking styles produced by 4 male and 4 female speakers. The six levels of annotation cover all phonetic and prosodic information about the recorded speech data. Precise phonetic transcription of the data provides an especially valuable resource for both research and development. The corpus may be used for unit-selection TTS synthesis purposes, as well as a bootstrapping corpus for speech recognition systems, or as data for research in Russian phonetics and inter- and intra-speaker variability. Total 52 769 76 992 53 667 30 861 159 90 20 596 47 840 38 799 43 875 1 945 99 12 503 12 729 9 144 9 355 6 275 14 337 7. References Arlazarov V.L., Bogdanov D.S., Krivnova O.F., and Podrabinovitch A.Ya. (2004). Creation of Russian Speech Databases: Design, Processing, Development Tools. In Proceedings of SPECOM'2004. St. Petersburg, pp. 650--656. Bolotova O. (2003). On some acoustic features of spontaneous speech and reading in Russian (quantitative and qualitative comparison methods). In: Proceedings of the 15th International Congress of Phonetic Sciences, Barcelona: Causal Productions Pty Ltd, pp. 913--916. Bondarko L. (2009). Short Description of Russian Sound System. In: De Silva V., Ullakonoja R. (Eds.), Phonetics of Russian and Finnish: General Description of Phonetic Systems. Experimental Studies on Spontaneous and Read-Aloud Speech. Frankfurt am Main: Peter Lager, pp. 77--87. Kocharov D. (2008). Avtomaticheskoe opredelenie chastity osnovnogo tona pri pomoschi linejnoj kombinatsii razlichnih metodov // In. Materialy XXXVII mezhdunarodnoj filologicheskoj konferentsii, St. Petersburg: SPbSU, pp. 7--11. (In Russian) Van Son R.J.J.H., Binnenpoorte D., Van Den Heuvel H. and Pols L.C.W. (2001). The IFA Corpus: a Phonemically Segmented Dutch “Open Source” Speech Database. In Proceedings of Eurospeech 2001. Aalborg, pp. 2051--2054. Volskaya N.B., Skrelin P.A. (2009). Prosodic model for Russian. In Proceedings of Nordic Prosody X. Frankfurt am Main: Peter Lager, pp. 249--260. Table 3: Ideal vs. real transcription: vowels. As the annotated part of the corpus used for this analysis includes an even distribution of all of the represented speaking styles and speakers, we can expect that similar results could be obtained from the analysis of the rest of the corpus. This clearly shows that the ideal transcription alone does not yield data that would be sufficient or valid for any type of phonetic research or practical application. Therefore, despite the large amount of human and financial resources required, precise phonetic transcription seems to be an indispensible part of corpus annotation at the present moment. There appear to be two ways of overcoming the discrepancy between rule-based transcription and manual transcription. One possible solution is to bring the automatic transcriber up-to-date by using the obtained information about the actual sound pronunciation. In this respect, the present corpus and its two levels of phonetic transcription may be used as a database for revising the traditional view of Standard Russian pronunciation and introducing new phonetic transcription rules. The other solution is to avoid automatic rule-based transcription altogether and transcribe all of the data manually. The former course of action appears to be more preferable as the emergence of a set of rules reflecting the current state of the language 112