Multimodal Russian Corpus (MURCO): First Steps

Elena Grishina

Multimodal Russian Corpus (MURCO): First Steps Elena Grishina Institute of Russian Language RAS 18/2 Volkhonka st., Moscow, Russia rudi2007@yandex.ru Abstract The paper introduces the Multimodal Russian Corpus (MURCO), which has been created in the framework of the Russian National Corpus (RNC). The MURCO provides the users with the great amount of phonetic, orthoepic, intonational information related to Russian. Moreover, the deeply annotated part of the MURCO contains the data concerning Russian gesticulation, speech act system, types of vocal gestures and interjections in Russian, and so on. The Corpus is on free access. The paper describes the main types of annotation and the interface structure of the MURCO. The MURCO consists of two parts, the second part being the subset of the first: 1) the whole Corpus, which is annotated from the lexical (lemmatization), morphological, semantic, accentological, metatextual, socioligical point of view (these types of annotation are standard for the RNC), and also from the point of view of phonetics (the orthoepic annotation and the mark-up of accentological word structure), 2) the deeply annotated MURCO, which is annotated in addition from the point of view of gesticulation and speech act structure. 1. Introduction As the programs of LREC’2008 and LREC’2010 have shown, the construction and the creation of multimodal corpora are doubtless the mainstream of the contemporary corpus linguistics. The elaboration of the multimodal corpora follows 4 lines: 1) speech act classification and identification of the types of the dialogue moves, which are specific for various real situations (Strauß et al., 2008; Möller et al., 2008; Georgila et al., 2008; Kostoulas et al., 2008; Brutti et al., 2008, Marasek & Gubrynowicz, 2008; Nallasamy et al., 2008); 2) identification and specification of the human affects and emotions and their connections with speech and gesticulation (Forbes-Riley et al., 2008; Gnjatovíc & Rösner, 2008; Wilson, 2008; Devillers & Martin, 2008; Sainz et al., 2008; Fék et al., 2008; Cullen et al., 2008); 3) investigations in the area of thematic development of dialogue, including the problems of anaphora and the reference as a whole (van Son et al., 2008; Stoia et al., 2008; Gallo et al., 2008; Wilks et al., 2008); 4) creation of the specialized gesture corpora or the gesture components of the multimodal corpora, and elaboration of the gesture classifications and the set of the parameters of gesture description (van Son et al., 2008; Savino et al., 2008; Knight & Tennent, 2008; Blache et al., 2008). The construction and creation of multimodal corpora come across some commercial and legal obstacles. Firstly, the multimodal corpora, which have been created as the parts of various business projects, very often become inaccessible for an ordinary user. Secondly, the multimodal corpora dealing with the real persons as the informants face the legal difficulties concerning copyright offence and privacy invasion. It seems that the decisions and suggestions that have been chosen by the MURCO constructors (in spite of their shortcomings) let us to cross the mentioned obstacles and to create the resource which can be useful for the researchers in the diverse fields of linguistics. 2. Basic MURCO Principles 2.1 Spoken Component of RNC So far, the RNC contains the Spoken Subcorpus (just now its volume is circa 8 million tokens), but this subcorpus does not include the oral speech proper – it includes only the transcripts of the spoken texts (Grishina, 2006). The structure of the Spoken Subcorpus of the RNC is as follows: Types of texts Public spoken Russian Private spoken Russian Movie speech Million tokens Percentage 4.4 51% 0.8 10% 3.4 39% Table 1: Spoken Russian in RNC It is absolutely natural that to supplement and to replenish the Spoken Subcorpus of the RNC, or, to be more precise, to transform it, we have to work out the generally accessible and relatively fair-sized multimodal corpus. To avoid the legal problems mentioned above, we have decided to use the cinematographic material in the MURCO. Naturally, in the future we are also going to include in the MURCO the patterns of the public and private spoken Russian, but the cinematographic Russian is the most appropriate material to begin the project with. It should be mentioned inter alia that the usage of the cinematographic material to elaborate and test the annotation system of the pioneering corpus is far more promising than the usage of the “natural” (public or private) spoken Russian. The main reason for it is the fact that the cinema includes exceptionally manifold set of situations, and this situational variety results in the linguistic variety. Therefore, to annotate the movie Russian we need greater number of definitions and more elaborated system of concepts than to annotate the “real-life” Russian. In other words, the exercised annotation of the movie Russian will be useful for the markup of the “natural” Russian, but the opposite is not right. There are some features, which distinguish the natural spoken speech and the cinematographic one (first of all we mean the parameter of the text coherence), but the differences though remarkable are not crucial (see (Grishina, 2007a, 2007b) about the usage of the discourse markers in the movie transcripts; the strategy of their usage is virtually the same in the natural and cinematographic spoken Russian); that is to say, the higher coherence of the movie transcripts in comparison with the transcripts of the natural spoken texts does not turn the former into the written texts: they remain spoken ones (Forchini, 2009). All these types of annotation will be preserved in the MURCO. Three types of annotation, which are specific for the MURCO, are as follows: • the orthoepic annotation • the speech act annotation • the gesture annotation. We’ll describe these new annotation types below. It ought to be mentioned that the orthoepic annotation differs from the speech act and gesture annotation from the point of view of the obligation degree. Since the orthoepic annotation is planned to be automatic, it will be obligatory in all texts which will be included in the MURCO. On the contrary, the speech act and gesture annotations are manual; therefore, they will be used in the so called “deeply annotated” texts1. 3.1 Standard RNC and MURCO Annotation 2.2 Outputting Units in MURCO The MURCO is the collection of the clixts. A clixt is the pair of a clip and the corresponding text (i.e. the corresponding part of a movie transcript). It is supposed that a user will have the opportunity to download not only the text component of a clixt (=marked up transcript), but also its sound/video component, so after downloading a user may employ any program to analyze it. The duration of a clip is within the interval of 5-20 sec. As we have mentioned above, just now the total volume of cinematographic transcripts in the Spoken Subcorpus of the RNC is 3.4 million tokens. In the near future we will bring it up to 5 million tokens. Therefore, if we manage to transform this subcorpus into multimodal state, we will obtain one of the largest open multimodal corpora, so the task is ambitious enough. 3. Types of Annotation in MURCO Since a clixt contains sound (=speech) and/or video tracks, it will be annotated from the point of view of text, sound and video. Therefore, the total structure of the MURCO annotation ought to be as follows: Annotation zone Text Sound Video Types of annotation Standard RNC annotation Speech act annotation Orthoepic annotation Gesture annotation Table 2. So, we see that some types of annotation in the MURCO are standard and quite usual for the RNC; the other ones are absolutely new and specific only for the MURCO. The standard RNC annotation includes 5 types (RNC, 2006; RNC, 2009): • metatextual annotation • morphological annotation • semantic annotation • accentological annotation • sociological annotation 3.1.1. Metatextual Annotation Every text in the RNC is supplied with the extralinguistic sociological information, which characterizes a text as a whole. This information forms the so called metatextual annotation. The main items of the metatextual information concern the author’s characteristics (name, age) and the text-as-a-whole characteristics (title, date of creation, genre, and so on). In the MURCO the metatextual annotation of a movie transcript as a whole will be attributed to every clixt derivable of this movie. 3.1.2. Morphological Annotation The morphological annotation in the RNC is provided with the automatic morphological parser “MyStem”, which has been elaborated by the team of the Yandex, the biggest Russian search engine. Every token in the RNC is supplied with morphological information. The morphological string contains a lemma, a part of speech, the constant grammatical characteristics (e.g. gender for nouns, aspect and transitivity for verbs), the variable grammatical characteristics (e.g. case for nouns, gender-case-number for adjectives, person for verbs, and so on). The search is possible according to all these parameters. 3.1.3. Semantic Annotation The texts in the RNC are semantically tagged with the program named Semmarkup (elaborated by A. Polyakov), which is based on the semantic dictionary of the RNC. The latter, in its turn, is founded on the database Lexicograph, elaborated under the leadership of E. Paducheva and E. Rakhilina (Russian Academy of Sciences). Every word in the RNC is supplied with the semantic characteristics, which includes three types of tags: • Class (a name, a reflexive pronoun, etc.) • Lexical and semantic features (a lexeme's thematic 1 Naturally, we do not plan to annotate every film which will be included in the MURCO, from the point of view of its speech act and gesture structure. We suppose the volume of the deeply annotated subcorpus of the MURCO ought to be about 1 million tokens. class, indications of causality or assessment, etc.) • Derivational features (a diminutive, an adjectival adverb, etc.) The set of semantic and lexical parameters is different for different parts of speech. Moreover, nouns are divided into three subclasses (concrete nouns, abstract nouns, and proper names), each with its own hierarchy of tags. 3.1.4. Accentological Annotation The annotation of the spoken texts (including the movie transcripts) in the RNC contains in addition the accentological (Grishina, 2008; Grishina, 2009a; Savchuk, 2009; Grishina et al., 2010) and sociological (Grishina & Savchuk, 2009) information. It is widely known that the stress in Russian is free and mobile, so the accentological information and the possibility of finding a word with this or that location of the stress mark is very important for a user. Moreover, the transcripts of the movies give us the possibility to reflect the real (as opposed to normative) Russian accentological system. 3.1.5. Sociological Annotation The sociological annotation includes the data relating to a speaker (his age, sex, occupation, and if a speaker is an actor, then his name). Strictly speaking, the main features of the sociological annotation coincide with the main traits of the metatextual annotation. However, as long as spoken monologues are very seldom, in spoken subcorpus the sociological information ought to be attached not only to the text as a whole, but to every cue of a text. This task being fulfilled, the special program multiplies the sociological characteristics of a cue and assigns it to every token. Therefore, a user can formulate his morphological, semantic, lexical, accentological queries taking into account this or that sociological characteristics of a speaker (e.g. there is the possibility to form the subcorpora of masculine/feminine cues, of the speakers of certain age, of a certain actor, etc.; it is also possible to search this or that token/lexeme, morphological/semantic/accentological feature in combination with the sociological characteristics of a speaker). 3.2 Automatic, Semi-automatic and Manual Annotation in MURCO 3.2.1. Automatic Annotation The annotation process in the MURCO may be automatic, semi-automatic and manual. Automatic annotation is provided with the corresponding parser. In the RNC and in the MURCO the morphological parser and semantic annotator are fully automatic. The orthoepic annotation in the MURCO will also be fully automatic. We may automatically annotate the combinations of consonants and vowels within the word limits and at the word boundaries. The morphophonemic type of the Russian orthography gives us the possibility to pass on from the orthographical combination of the letters to the orthoepic combinations of the sounds. Therefore, we may analyze the history and contemporary situation as for the Russian pronunciation. For example, we may firstly obtain all word combinations, which include the letter combinations [d#l] and [t#l] (# means word boundary). Then, listening to the corresponding clips we may analyze the manner of the pronunciation of this letter combination. The obtained result seems to be very interesting: in the combination ‘empty word + full word’ the difference between voiced [d] and voiceless [t] persists, i.e. this word combination functions as one word; in the combination of two full words the voiced [d] sounds as the voiceless [t], i.e. the word boundary # functions here as a voiceless consonant; as for the word combination vr’ad li ‘scarcely, hardly’, its sounding ([dl] or [tl]) depends on the place and the date of the speaker’s birth. The Spoken Subcorpus of the RNC is partially accentuated (namely, in the movie transcripts, which form the considerable part of the Spoken Subcorpus, the stressed syllables are marked). Therefore, we may in automatic mode annotate the accentological structure of a word, e.g. we may mark first, second and so on pretonic vowels, first, second and so on post-tonic vowels, quantity of syllables, quality and number of a stressed vowel. It means that in the MURCO we may receive a set of the clips which fit our accentological query. For example, we may receive the clips which may illustrate different types of the vowel reduction in the second pretonic syllable in Russian. 3.2.2. Semi-automatic Annotation The accentological and the sociological annotations in the RNC and in the MURCO are semi-automatic. To mark the stressed vowels, the spoken texts have been processed with the special program and after that they are corrected manually according to the real pronunciation. To mark the sociological characteristics of the spoken texts, they have been tagged manually and after that they are processed with the special program, so that the input markup of a cue is assigned to every token. 3.2.3. Manual Annotation It is obvious enough, that we have no possibility to annotate the speech acts and the gestures in the movie clips automatically or in semi-automatic mode. One of the reasons for that (to say nothing of all technical difficulties) is the fact that to elaborate automatic or at least semi-automatic tagging of the speech acts and the gestures we need to have a test corpus to train a speech act or gesture tagger. So, we face the circularity: to obtain an automatic annotator we need a corpus, to obtain a corpus we need an automatic annotator. Therefore, to annotate the speech acts and the gestures in the MURCO we may use the manual mode of annotation only. Maybe in future the MURCO will become one of the possible sources to create the sought-for speech act parser or gesture tagger. It is well known that the main shortcoming of any manual annotation is the inability to provide the uniformity and commonality of the markup. In addition, the manual annotation includes a lot of chores which may be automat- ed. These two circumstances cause the necessity to create the special workbenches for the annotators to make the process of the annotation the easiest one and the result of this process essentially normalized one. The workbenches “Marker” (the workbench to annotate speech acts) and “GesturesMarker” (the workbench to annotate gestures) offer an annotator the possibility to move from point to point answering the questions and selecting this or that variant among the displayed ones (the detailed description of both workbenches see in (Kudinov & Grishina, 2009)). In conclusion of the section we may summarize the stated above. The types of annotation in the MURCO may be characterized like this: Method of annotation Assigned Automatic (obligatory) Semiautomatic (obligatory) metatextual – annotation morphological, semantic, orthoepic sociological, annotation, accentological annotation of annotation accentological word structure to text to word to clixt (text+clip) metatextual annotation – Manual (selected) 4.2 Queries on Accentological Word Structure It is well known that the dynamic quality of Russian stress leads to the great degree of the reduction of the unstressed syllables in a word. Consequently, it is very important to give a user an opportunity to obtain information, concerning the position of the stressed syllable and the quality of the stressed vowel, the position/quality of the pre- and post-tonic vowels, and so on. Owing to the fact that the majority of the clixts in the MURCO are accentuated, it is possible to annotate the accentological structure of any token in automatic mode. The content of the possible requests is defined in line with the Table 4: – stressed vowel pre-tonic vowel post-tonic vowel – syllables speech act and gesture annotation Table 3. 4. combinations, which may present a difficulty. Naturally, all these combinations suppose to become searchable. MURCO Interface 4.1 Orthoepic Queries The orthoepic annotation in the MURCO is founded on the morphophonemic principle of the Russian orthography, which means that there are quite transparent correspondence between the word spelling and the word pronunciation. Therefore, we receive the possibility to annotate the combinations of letters to obtain the pronunciation of the correspondent sounds. The crucial types of sound combinations in Russian are as follows: • C…C = combination of two or more consonants within the word limits • V…V = combination of two or more vowels within the word limits • C…C#C…C = combination of consonants at the word boundaries • V…V#V…V = combination of vowels at the word boundaries • C…C#V…V = combination of the consonants before the vowels at the word boundaries Obviously, it is quite easy to annotate such combinations of letters in a text automatically. Consequently, to any tokens in the MURCO will be assigned the set of the letter quality of vowel A B C quantity 4 number of syllable 1 2 3 Table 4. In the table cells A–C a user may specify the letter designation of a vowel (in the stressed, pre- and post-tonic syllable), in the cells 1–3 – the number of the corresponding syllable, in the cell 4 – the quantity of the syllables in a word. All these parameters are independent, so a user can freely combine them if necessary. For example, a user may request all tokens containing 1) the second post-tonic syllable, 2) the stressed syllable o, 3) three syllables, 4) vowel o in the second pre-tonic syllable, while a token has 4 syllables and the stressed vowel o. All these parameters are very important for the phoneticians, specialists in orthoepy, dialectologists, and investigators in the area of the history of Russian. In addition, the importance of orthoepic and accentological annotation can scarcely be overestimated, having in mind the professional interests of the teachers of Russian, uppermost as a foreign language. 4.3 Speech Act Queries 4.3.1 Sociolinguistic Characteristics of Clixt 1. Quantity of participants (1, 2, 3, many). We distinguish clixts with one, two, three and many participants. Since we describe a clixt from the point of view of speech, “a participant” here means “a speaker”. Therefore, if one of the characters of a clixts is silent (even if this character is gesticulating), this character is not considered as a participant of this clixt. The physiological activities (see below) are not regarded as speech specimens, so if a character in a clixts only sighs, spits, groans, and so on, this character is not considered as a participant of this clixt. 2. Participants’ sex (Mas, Fem, Mixed). So, there are three possibilities here: Male (all the participants of a clixt are of male sex), Female (all the participants are women/girls), Mixed (there are men and women in a clixt). 3. Language (Russian, Russian with accent, Foreign (Ukrainian, English, and so on), Quasi, Secret… the list is open). Naturally, the main language used in the MURCO is Russian. But also there are a lot of inclusions of foreign languages, which ought to be marked. It should be also noted that an annotator has the possibility to mark up the occurrences of “Russian with accent” (for instance, south-Russian dialect, north-Russian dialect, uncertain Russian dialect). Also an annotator may mark up the usage of a Quasi-Language (the participants of a clixts speak non-existent language) and a Secret Language (the participants of a clixts speak a secret language, which is familiar to them, but is incomprehensible to the profane; this secret language may be generated from the natural Russian according to the definite set of the rules or may be a kind of argot or social/professional slang). 4. Social situation (Telephone call, Dinner speech, Talk with authorities… the list is open). The main social situation, which is marked up in the MURCO, is “non-specific situation”. It means that the participants of a clixt are connected with the non-official or private relations. If the relationships between the clixt participants are official and public, the fact is specially marked. Among others, we tick off Telephone calls, Dinner speeches, Talks with authorities, Shop talks, Restaurant and Taxi orders, and so on. Bearing in mind, that the annotation of the kind may be combined also with the gesture annotation, it gives us the opportunity to analyze the special social and gesture formulas, which are specific for this or that social situation. 4.3.1 Intensional Characteristics of Clixt 1. The types of the speech acts. The basic principle of the meaningful characteristics of a speech act in the MURCO is founded on 2 hypotheses: A) in the process of everyday communication a native speaker easily distinguishes one speech act from the other, otherwise the communication between the members of a speech community must fail; B) the main types of speech acts are embodied in the speech verbs of this or that language. These hypotheses, being adopted, let us build the faceted classification of the Russian speech acts, which basically addresses not the linguistic investigations concerning the different types of speech acts, but the natural linguistic intuition of an annotator and the experience of previous language usage, which has been engraved in the language itself. Naturally, this decision has a lot of drawbacks (and the most serious of them seems to be the unavoidable subjectivity of the annotation), but there seems to be no other choice. The striving to stick to the pure scientific and logical methods in the field of the speech act definitions leads us to the following risks: a) the impossibility to carry out any speech act annotation of the MURCO at all for lack of generally accepted scientific classification of the Russian (English, French, German, and so on) speech acts (let alone the fact that to create the classification of the kind we need the missing corpus with the manually annotated speech acts, so we face the circularity again); B) suppose we manage to elaborate the wanting speech act classification based on the pure logical and scientific grounds; may we be sure that this classification would be taken as equally logical by an annotator? We do not think so, because it is obvious enough that in the framework of the humanities the classification, which seems to be quite logical and objective to one person, is interpreted as absolutely subjective by the others. Therefore, it is far more preferable to rely upon and give credence to one’s native language and one’s everyday speech activities. In this paper we have no possibility to describe the speech act system of the MURCO in detail (see (Grishina, 2009b), where the interface of the MURCO is outlined), but we ought to mention that the list of the Russian speech acts includes about 150 items, grouped into 13 types (Address or call, Agreement, Assertion, Citation, Complimentary, Critical utterance, Etiquette formula, Imperative, Joke, Modal utterance or performative, Negation, Question, Trade utterance). The majority of these 150 speech acts corresponds to the Russian locutionary verbs, but there are the speech acts lacking the corresponding locutionary verbs, for instance, different types of questions (open, closed, indirect, critical, feedback), some types of negations (alienation), some etiquette formulas (Not at all!, etiquette modesty), and so on. This lack of correlations, however, does not change the main principles of the definition of the speech acts in the MURCO. To every clixt may be attached more than one type of speech act, and moreover, every speech act in a clixt may be characterized from different points of view (e.g., an assertion may be characterized at the same time as information, declaration, statement). Thus, the classification of speech act is not tree-like, it is faceted. 2. The completeness of an utterance. This markup zone gives an annotator a possibility to define the types of utterance breakings. On default an utterance is marked up as full one. The types of breakings are as follows: A) self-interruption – a speaker breaks his utterance under the influence of his own change of speech strategy; B) interruption – a speaker breaks his utterances under the influence of some external circumstance (for instance, a listener interrupts a speaker); C) unfinished utterance – a speaker has not intended to finish his utterance, for example, if its completion is absolutely predictable; D) gesture instead of word – the variant of the previous item: an utterances is finished with a gesture, not words; E) continued utterance – the variant of the item C: a speaker invites a listener to finish a speaker’s utterance; F) question without answer – an unaccomplished question-answer complex; G) overlapping cues – the situation, when two or more cues are uttered simultaneously, so it is difficult to make them out. 3. The types of repetitions. It is widely known that the repetitions in the spoken speech are of great importance and go far beyond meaning transference. Within this annotation zone it is possible to mark up: A) the one-word /many-word/single/multiple repetitions; B) repetitions with intensifiers (very, never, often, always, absolutely and so on); C) repetitions of the same text with different intonation; D) repetitions with the change of addressee (a speaker repeats the same text, addressing to different persons); E) repetitions during the overinterrogations – a) repetitions in answers: I’m going to C h i t a. – Where? – C h i t a. b) repetitions in questions: I’m going t o C h i t a. – T o C h i t a? – Yes.; F) echo repetitions – a listener repeats a speaker’s cue or its part with the same intonation; G) mimicking – a listener mimics a speaker’s cue with the special mimicking intonation; H) envelope repetitions – the repetitions of a word at the beginning and at the end of a elementary discursive unit (EDU); I) relay repetitions – the repetitions of a word at the end of the previous EDU and at the beginning of the following EDU; J) simultaneous speaking – a cue or its part is uttered by two or more speakers at the same time; K) redirection of question – one person questions another, and this questioned person redirects the same question to the third person; L) imitation – a listener tries to imitate the speech behaviour of a speaker. 4. The manner of phonation. In this zone an annotator marks up different types of phonation and pronunciation of a cue. The types of phonation/pronunciation may be determined with a speaker’s mental/physical state (crying, laughing, drunken, talking to oneself; articulation disorders, slip of the tongue, inarticulate cue, exercise stress, out of breath), a situation of speaking (declamation, reading, singing, dubbing-in, dictation); at this stage of annotation the special types of phonation are also marked (shout, whisper, ventriloquism, muffled shout, chanting, scanning, humming, parcelling out). 5. The vocal gestures, interjections and physiological activities. In this zone an annotator marks up: A) the interjection, i.e. the non-verbal words, which have the standard written forms (for instance, Oh (meaning agitation, admiration, pity, mockery, distrust, and so on), Ah (meaning understanding, pain, fright, reply to address, scorn, and so on), Uh huh (meaning approval, agreement, backing-yes), and so on); B) the vocal gestures, i.e. non-verbal words, which lack the special written forms (for instance, iconic sounds, teasing sounds, feeling cold, intensity of feeling, and so on); C) physiological activities, i.e. a speaker’s or a listener’s physiological acts, for instance, sigh, cough, yawn, chuckle, whistle, spit, kiss, and so on. In fact, the deeply annotated part of the MURCO lets us investigate these important linguistic phenomena on a new level. user with the annotation and interface concerning Russian gesticulation. The basic principles and ideological grounds for our gesture classification we gave described earlier (Grishina, 2009b). So, in this paper we list the main items of the MURCO interface, concerning the gesticulation subject matter. 4.4.1 Sociolinguistic Characteristics of Gesture 1) The name of an actor (if it is known). 2) The sex of an actor (Male, Female). 3) The sex of a character (Male; Female; Unknown (for example, in the animated films); Men, playing female role (for example, John Travolta in Hairspray); Woman, playing male role (this is practically impossible); Men pretending to be a woman (for example, Dustin Hoffman in Tootsie); Woman pretending to be a man (for example, Julia Andrews in Victor Victoria)). It is obvious that the last 4 items are very important for the investigation of the gender specificity of the gesticulation. 4) The actor’s age and the character’s age (Child, Teenager, Adult, Aged, Unknown). It should be mentioned that any specific social situation, in which the gesticulation takes place, ought to be marked up while annotating a clixt, so there is no necessity to mark it up once more the gestures being marked up. 4.4.2 Involved Objects The gesticulation often enough supposes the object usage. This fact, naturally, ought to be mentioned while marking up this or that gesture. The objects in question may play three main roles. 1) The substitutes. These are the objects, which substitute any gesticulating human organ (for example, a pointer or a pencil instead of a speaker’s forefinger in a deictic gesture to show with a forefinger). 2) The spoilers. These are the objects, which impede a gesticulating person and prevent him from pure gesticulating (for example, some clothes in the speaker’s hand, which spoil a greeting handshake). 3) The accessories. These are the whole set of the objects, which are involved in the gesticulation (the substitute, the spoilers, and the adaptors). The latter are the objects, which act as the necessary components of this or that gesture and at the same time are not the part of the human body (for example, a watch is the adaptor for the gesture to check time, a surface is the adaptor for the gesture to bang one’s fist on smth). 4.4 Gesture Queries During last three decades the investigation of the role of the gesticulation in different languages has progressed to a large degree. Now it is the current opinion that it is time to elaborate the gesture corpora to base the investigation of the gesture systems on a hard ground (see the materials of LREC’2008 and their review and the main bibliography in (Grishina, 2009b)). The MURCO seems to be the resource, which is generally accessible and quite considerable as for its volume, moreover, the MURCO is planned to include a lot of video tracks. So, it is absolutely necessary to provide a 4.4.3 Repetition Factor In the MURCO single and multiple gestures are distinguished (single gestures are labelled with perfectives, multiple ones with imperfectives). 4.4.4 Active Organ The active organs of the gestures are distributed into 6 groups according to the main organs of the human body. 1. Main organ: head (brow, brows, chin, ear, eye, eyes, face, forehead, head, lips, lower lip, mouth, nose, tongue, upper lip, upper teeth) 2. Main organ: body (body, shoulder, shoulders, back) 3. Main organ: arm (arm, fingers, forefinger, forefinger+long finger, forefinger+long finger+fourth finger, forefinger+long finger+thumb, forefinger+thumb, fourth finger, hand, little finger, long finger, thumb) 4. Main organ: arms (arms, hands, forefingers, fingers) 5. Main organ: leg (foot, shin) 6. Main organ: legs (feet, legs) 4.4.5 Passive Organ The set of the passive organs is specific for this or that active organ. The basic passive organs are as follows: No passive organ, arm, arms, back, body, breast/stomach, chin, eat, eye, face, fingers, hair, hand, head, hip, hips, lips, lower lip, mouth, neck, nose, shoulder, throat. 4.4.6 Adaptor Adaptor is the object, which is the necessary component of this or that gesture, but is not one of the organs of human body. The main types of adaptors are as follows: No adaptor, cloth, earth, external object, glasses, gloves, handset, headwear, heavy object, interlocutor, piece of furniture, pocket, sky, surface, tableware, tie, vessels, watch, wristlet. 4.4.7 Dimensional Characteristics of Gesture 1. Palm orientation: up, down, one opposite the other, to speaker’s body, outside, perpendicularly to speaker’s body 2. Direction of movement: backward, differently directed, does not matter, downwards, forward, forward-backward, from right to left, from the outside to the center, from within outside, horizontal circle, on its axis, outside, to oneself, to the center, upwards, vertical circle. 4.4.8 Gesture Meanings and Gesture Types Till the moment we have marked out about 250 gesture meanings, which are grouped into 14 gesture types. The gesture types are as follows: • Adopted, Conventional, Corporate, Critical, Decorative, Deictic, Etiquette, Gestures – speech acts, Gestures of inner state, Iconic, Physiological, Regulating, Rhetorical, Searching Every type includes some gesture meaning. For example, some of the etiquette gestures are as follows: • gratitude (to applaud, to move one’s head forward, twice-repeated kiss, to close one’s eyes, to nod, to touch smb, to bow, to touch smb’s hand, to kiss smb, to kiss smb’s hand, press one’s hands to one’s breast, and so on) • apology (to beat one’s breast, to nod, to move one’s chin outside, to press smb’s hand to one’s breast, to press one’s hand to one’s breast) • invitation (to nod, to show smth with one’s hand, to bow), and so on. So, the meaning of a gesture is described as a combination of 3 parameters: 1) its contextual meaning in this or that consituation, represented in a clip/clixt, 2) the type of task which is fulfilled with the gesture (=the gesture type), and 3) the traditional Russian name of the gesture (=the gesture name). The latter may be lacking, and in this case we ought to invent the missed name. 5. Conclusion Thus we can see that the MURCO considerably extends searching possibilities up about the characteristics of spoken Russian. We may illustrate the fact with the queries, applying to the Russian greeting formulas (GF) (see Table 5). Corpus Types of queries Spoken Subcorpus of RNC MURCO + + + + + + – + – + – + – + 1. Lexical queries: the retrieve of the specific lexemes, used in GF (e.g. zdravstvujte ‘how do you do?’, privet ‘hi!’, and so on) 2. Morphological queries: the retrieve of the specific morphological characteristics of the GF lexemes (e.g. zdravstvujte (Pl or courtesy) vs zdravstvuj (Sg), privet (Noun) vs privetstvuju (Verb), and so on) 3. Sociological queries: the forming of the gender and chronological subcorpora to investigate the peculiarities of the GP usage 4. Semantic & speech act queries: the retrieve of all Russian GP simultaneously 5. Orthoepic/accentological queries: the retrieve of the types of the vowel contractions and the shortening of the consonant groups in the GF; the reduction of the pre- and post-tonic vowels in GF 6. Speech act queries: the retrieve of the types of repetitions, used in GF; the types of vocal gestures and interjections, accompanying the different types of GF; GF, used in the man/woman dialogues; see also the item 4 7. Gesture queries: the retrieve of the gestures, accompanying Russian GF Table 5. 6. Acknowledgements The work of the MURCO group is supported by the program “Genesis and Interaction of Social, Cultural and Language Communities” of the Russian Academy of Sciences. The author’s investigation is supported by the RFBR 2 (RFFI) under the grant 08-06-00371а and the grant “Elaboration of Multimodal Russian Corpus (MURCO) within the framework of Russian National Corpus (www.ruscorpora.ru)”. 2 The Russian Fund of Basic Researches. 7. References Blache, Ph., et al. (2008). Creating and exploiting multimodal annotated corpora In LREC’2008. Brutti, A., et al. (2008). WOZ Acoustic Data Collection for Interactive TV. In LREC’2008. Cullen, Ch., et al. (2008). Emotional Speech Corpus Construction, Annotation and Distribution. In LREC’2008. Devillers, L., Martin, J.-C. (2008). Coding Emotional Events in Audiovisual Corpora. In LREC’2008. Fék, M., et al. (2008). Multimodal Spontaneous Expressive Speech Corpus for Hungarian. In LREC’2008. Forbes-Riley, K., et al. (2008). Uncertainty Corpus: Resource to Study User Affect in Complex Spoken Dialogue Systems. In LREC’2008. Forchini, P. (2009). Spontaneity reloaded: American face-to-face and movie conversation compared. In Corpus Linguistics 2009. Abstracts The 5th Corpus Linguistics Conference, 20-23 July 2009, Liverpool, p. 118. Gallo, C.G., et al. (2008). Production In A Multimodal Corpus: How Speakers Communicate Complex Actions. In LREC’2008. Georgila, K., et al. (2008). A Fully Annotated Corpus for Studying the Effect of Cognitive Ageing on Users’ Interactions with Spoken Dialogue Systems. In LREC’2008. Gnjatovíc, M., Rösner, D. (2008). On the Role of the NIMITEK Corpus in Developing an Emotion Adaptive Spoken Dialogue System. In LREC’2008. Grishina, E. (2006). Spoken Russian in the Russian National Corpus (RNC). In LREC’2006: 5th International Conference on Language Resources and Evaluation. ELRA, pp. 121-124. At: http://docs.google.com/View?id=df52fjjj_3wd9mcrdg Grishina, E. (2007a). O markerah razgovornoj rechi (predvaritel’noje issledovanije podkorpusa kino v Nacional’nom korpuse russkogo jazyka). In Kompjuternaja lingvistika i intellektual’nyje tehnologii. Trudy mezhdunarodnoj konferencii “Dialog 2007”. Moscow, RSGU, pp. 147-156. At: http://www.dialog-21.ru/dialog2007/materials/html/22.htm. Grishina, E. (2007b). Text Navigators in Spoken Russian. In Proceedings of the workshop “Representation of Semantic Structure of Spoken Speech” (CAEPIA’2007, Spain, 2007, 12-16.11.07, Salamanca). Salamanca, pp. 39-50. At: http://docs.google.com/Doc?docid=df52fjjj_11fmxsdzdh&hl =en. Grishina, E. (2008). National’nyj corpus russkogo jazyka kak istochnik svedenij ob ustnoj rechi. Rechevyje tekhnologii, 3, pp. 50-62. At: http://docs.google.com/View?id=df52fjjj_34g9d9w2dg Grishina, E. (2009a) Korpus “Istorija russkogo udarenija”. In RNC’2009, pp. 150-174. At: http://docs.google.com/View?id=df52fjjj_37ghmg36cb Grishina, E. (2009b). Multimedijnyj corpus russkogo jazyka (MURCO): problemy annotacii. In RNC’2009, pp. 175-214. At: http://docs.google.com/View?id=df52fjjj_363wxt76dk Grishina, E., et al. (2010). Design and data collection for the Accentological corpus of Russian. In In LREC’2010: 7th International Conference on Language Resources and Evaluation. ELRA (forthcoming). Grishina, E., Savchuk, S. (2009). Ustnyj korpus v Nacional’nom korpuse russkogo jazyka: sostav i struktura. In RNC’2009, pp. 129-149. At: http://docs.google.com/View?id=df52fjjj_39gh8wsffv Knight, D., Tennent, P. (2008). Introducing DRS (The Digital Replay System): A tool for the future of Corpus Linguistic research and analysis. In LREC’2008. Kostoulas, T., et al. (2008). A Real-World Emotional Speech Corpus for Modern Greek. In LREC’2008. Kudinov, M., Grishina, E. (2009) Insrumenty poluavtomaticheskoj razmetki dl’a Mul’timedojnogo russkogo korpusa (MURCO). In: Kop’juternaja lingvistika i intellektual’nyje tekhnologii (Mezhdunarodnaja konferencija “Dialog 20009”, 8(15)). Computational Linguistics and Intellectual Technologies (Annual International Conference “Dialogue 2009”, 8(15)). Moscow: RSGU, pp. 249-261. At: http://www.dialog-21.ru/dialog2009/materials/html/40.htm LREC’2008. (2008). 6th International Conference on Language Resources and Evaluation. Marrakesh: ELRA. At: http://www.lrec-conf.org/proceedings/lrec2008/ Marasek, K., Gubrynowicz, R. (2008). Design and Data Collection for Spoken Polish Dialogs Database. In LREC’2008. Möller, S., et al. (2008). Corpus Analysis of Spoken Smart-Home Interactions with Older Users. In LREC’2008. Nallasamy, U., et al. (2008). NineOneOne: Recognizing and Classifying Speech for Handling Minority Language Emergency Calls. In LREC’2008. RNC’2006. (2006). Nacional’nyj korpus russkogo jazyka: 2003–2005. Rezul’taty i perspektivy. Moscow: Indrik. RNC’2009. (2009). Nacional’nyj korpus russkogo jazyka: 2006–2008. Novyje rezul’taty i perspektivy. Sankt-Peterburg: Nestor-Istorija. Sainz, I., et al. (2008). Subjective evaluation of an emotional speech database for Basque. In LREC’2008. Savchuk, S. (2009). Spoken Texts Representation in the Russian National Corpus: Spoken and Accentologic Sub-Corpora. In NLP, Corpus Linguistics, Corpus Based Grammar Research. Fifth International Conference, Smolenice, Slovakia, 25-27 November 2009. Proceedings. Brno, Tribun, pp. 310-320. Savino, M., et al. (2008). Integrating Audio and Visual Information for Modelling Communicative Behaviours Perceived as Different. In LREC’2008. Stoia, L., et al. (2008). SCARE: A Situated Corpus with Annotated Referring Expressions. In LREC’2008. Strauß, P.-M., et al. (2008). The PIT Corpus of German Multi-Party Dialogues. In LREC’2008. van Son, R.J.J.H., et al. (2008). The IFADV corpus: A free dialog video corpus. In LREC’2008. Wilks, Y., et al. (2008). Dialogue, Speech and Images: The Companions Project Data Set. In LREC’2008. Wilson, Th. (2008). Annotating Subjective Content in Meetings. In LREC’2008.

RELATED PAPERS

RELATED TOPICS

Log In

Multimodal Russian Corpus (MURCO): First Steps

Multimodal Russian Corpus (MURCO): First Steps

Related Papers

RELATED PAPERS

RELATED TOPICS