Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Hindi Guide Summary PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Transcription quality 2

Punctuation 3
Format 7
Agreed spelling 12
Longform generic rules 18
2

Transcription quality
Avoid making any typographical errors. Carefully check your work before marking items as
"complete".
Correct: मैं घर जा रहा हूं।
Incorrect: मे घर जा रहा हूं।

Transcribe what is actually spoken. Use context to help with spelling and homophone
disambiguation. Look up words if you are unsure. Do not correct speaker's grammar if they
intentionally say something, even if what they say does not follow the standard grammatical
rules of the transcription language.
Example audio: " मुझे भूख लगा है। "
Correct: मुझे भूख लगा है।
Incorrect: मुझे भूख लगी है।

Do not transcribe words that are not spoken, even if they are obviously intended by the
speaker. Avoid putting words in the speaker's mouth. However, do transcribe implied times
and units of currency.
Example audio: " तीन सौ रुपए इस ममठाई के ललए बहुत ज़्यादा हैं। "
Correct: ₹300 इस मिठाई के मिए बहुत ज़्यादा हैं।
Incorrect: 300 रुपए इस ममठाई के ललए बहुत ज़्यादा हैं।
“transcribe implied times and units of currency”

Regarding spacing:
Use only one space between words and sentences.

Correct: आपका नाम क्या है?


Incorrect: आपका नाम क्या है ?

For most types of punctuation, do not put a space between the preceding word and the
punctuation.

Correct: नमस्ते, यह डॉ. दीपक हैं।


Incorrect: नमस्ते , यह डॉ. दीपक हैं ।

For quotation marks and similar punctuation, put a space before the opening punctuation, but
not necessarily after the closing punctuation.
3

Correct: स
ं जय ने बोला, "मैं तुमसे प्यार करता हं।"
Incorrect: स
ं जय ने बोला, " मैं तुमसे प्यार करता हं। "

Punctuation
Follow the punctuation regulations of your locale. Additional conventions are outlined in this
section.
Full sentences should end with a punctuation mark.

In general, a complete sentence contains a subject and a verb.

Sometimes a phrase which is not obviously grammatically a sentence should nevertheless be


treated as a sentence because of its context, e.g. if it's an answer to a specific question, or if
it's an example where dropping the subject sounds completely natural as a complete sentence.
Correct: मपछली बार

Correct: फूलों के चित्र

Correct: खाने पर आ रहे हो कल?

Interjections, greetings, and farewells said in isolation should be considered complete


sentences and punctuated as such.

Correct: वाह!

Correct: नमस्ते।

Correct: पक्का। फिर फमलते हैं।

Capitalize sentence fragments that sound like the beginning of a sentence. Add end
punctuation to sentence fragments that sound like the end of a sentence. For fragments that
do not clearly sound like the beginning or end of a sentence, leave out capitalization and
punctuation. Note that sentence fragments may be a result of cut-off audio samples.

Correct: तुम्हें क्या लगता है? ऐसा नहीं है मक


Explanation: Begins as complete sentence and ends mid-stream.

Correct: मुश्ककल था। इस बात का कोई मतलब ही नहीं है।


4

Explanation: Fragment is the end of a sentence.

कहां हो कहां हो तुम?


Explanation: Repeated beginning of the sentence.

Only use commas where required. Err on the side of minimal punctuation. Do not rely on
intonation.
Correct: अगला पेट्रोल पंप फकधर है?
Incorrect: अगला, पेट्रोल पंप, फकधर है??
Explanation: Even if the speaker uses long pauses in these places, do not use a comma. There are
places where commas are allowed or required, but this example contains neither.

Use a comma when a sentence starts with an interjection or yes/no word. However: If there is
a long pause between an interjection or yes/no word and a full sentence that follows it, treat
that initial word as a separate sentence.
Correct: लेमकन, वो सि हो सकता है।
Explanation: Discourse word.

Correct: अच्छा दोस्त, जो भी करो सावधानी से करना।


Explanation: Yes/no word. Other examples of these types items include "हां", "अच्छा" and others.

Correct: शायद, पर मुझे पक्का नहीं पता।


Explanation: Use a comma when there is no pause, or when there is a pause that isn't long.

Use commas in lists.


Correct: मेरा बेटा भोला, छोटा, प्यारा और नटखट है।

Use commas for non-restrictive modifiers, but do not use commas for restrictive
modifiers. The basic test for this is whether the modifier can be dropped from the
sentence and still keep basically the same meaning.

Correct: इूंमिया के प्रधान मूंत्री, नरेंद्र मोदी, अमे ररका गए थे।


Explanation: Non-restrictive modifier. "नरेंद्र मोदी" does not change the core meaning of "इूंमिया के प्रधान
मूंत्री", it just adds additional information about the Indian prime minister.
5

Use commas in sign-offs, such as those at the end of a message. Do not use end
punctuation.

Correct: तुम्हारी दोस्त, सोनाली

Do not use commas in sentences that consist only of a greeting and an addressee. If a
greeting occurs at the beginning of a sentence or fragment, place a comma after the
greeting. If the greeting includes an addressee, place the comma after the addressee.

Correct: नमस्ते, मैं अूंजलल बोल रही हूं।

Correct: नमस्ते नेहा, मैं पू जा बोल रही हूं।

Except in greetings, sentence-initial and sentence-final addressees should be


separated by a comma.
Correct: सचिन, मुझे कॉल कर।
Correct: तू कैसी है , मनीषा?
Correct: मनोज, नमस्ते , फववेक बोल रहा हं।

Capitalize and punctuate the following as questions: 1) All queries syntactically built as
questions, regardless of intonation. 2) All queries which sound like they are being used
as questions, regardless of sentence structure.
Correct: तीन बजे?
Correct: और स्वाती भी आ रही है?

If a speaker uses clearly exclamatory intonation, use an exclamation point. If there is


any doubt, err on the side of using period.
Correct: िुप कर!
Correct: फबलकुल!
Correct: जन्मफदन मुबारक!

Use a comma between reported speech verbs and direct quotations. Do not put
punctuation within quotation marks unless the punctuation belongs to the reported
speech.
6

Correct: मेरा दोस्त बोला, "मगरमछ"।

Incorrect: मेरा दोस्त बोला, "मगरमछ।"

Incorrect: मेरा दोस्त बोला "मगरमछ।"

Incorrect: मेरा दोस्त बोला "मगरमछ"।

If the text in quotation marks qualifies as a sentence, punctuate as if it were its own
utterance. Do not alter its end punctuation even if the quote is within a sentence. Do not
add excess punctuation after end quotation marks.
Correct: ने हा बोली, "तीन बजे फमलते हैं ।"
Incorrect: नेहा बोली, "तीन बजे फमलते हैं।"।

Correct: जचतन ने पूछा, "क्या हम तीन बजे फमलेंगे? "


Incorrect: जचतन ने पूछा, "क्या हम तीन बजे फमलेंगे?"।

Use a colon but no quotation marks in quotative voice actions when the quote follows
the command. Use quotation marks when the quote is in the middle of the sentence.

Correct: फ्रेंि में "मुझे तुमसे प्यार है।" कैसे कहते हैं?

Correct: जापानी में कैसे कहते हैं: मुझे पीनी है ।

Correct: example@gmail.com को ईमेल भेजो: तुम कब आओगे?

When speakers make a request for single words to be translated into another language,
don't punctuate or capitalize the words, even if you'd consider the words as sentences
in other situations.

Correct: स्पेफनश में "नमस्ते" का अनुवाद करें।

Correct: नमस्ते।
7

Do not use quotation marks for metalinguistic uses of words or phrases. These uses
include defining the word, talking about the spelling of the word, or any other type of
reference to the word itself as a thing.

Correct: बच्चे ने अभी पापा बोला।


Incorrect: बच्चे ने अभी "पापा" बोला।

Apart from standard letters, you should not use any other symbol than: 0-9
äâàæÆçÇéèëêïîñÑôöŒœüûùμÿÄÂÀÉÈËÊÏÎÔÖÜÛÙŸ²³,?!'"_°:.()<>{}[]√/@#$€£₹+=%*&-.;

When two opposing teams are mentioned, include a hyphen between their names.
Correct: क्या तुमने कल भारत-पाक का खेल देखा?

Include a hyphen between locations in flight itineraries.


Correct: क्या मुं बई-फदल्ली की उड़ान दो घंटे की है?
Incorrect: क्या मुंबई फदल्ली की उड़ान दो घंटे की है?

Use hyphen in phrases and compounds typically written with hyphen. If in doubt, use
hyphen.

Correct: मेरे माता-फपता बरेली से हैं।

Correct: वहां कभी-कभी जाते हो?

For sentence-level spoken punctuation, write out the full word or words between curly
brackets. Do not add punctuation symbols after spoken punctuation. Be careful with
homonyms.

Example audio: " तुम कैसे हो प्रश्न चिह्न "


Correct: तुम कैसे हो {प्रश्न चिह्न}
Incorrect: तुम कैसे हो?
Incorrect: तुम कैसे हो प्रश्न चिह्न
Incorrect: तुम कैसे हो प्रश्न चिह्न?

Don't spell out internal punctuation like hyphens in web pages, email addresses,
addresses, phone numbers, or other word-level punctuation.
8

If a word that can refer to a punctuation mark is spoken in isolation, it should be written
out between curly brackets.

Correct: {पूर्ण फवराम}


Correct: {अल्पफवराम}

Format
Transcribe numbers, abbreviations etc. following the formatting conventions in this
section.

Devanagari numerals should not be used, only Western Arabic numerals should be used.
Cardinals and ordinals from 0 to 9 are written with letters (except for measures and
currency - see Currency and Unit). Use digits for cardinals and ordinals 10 and above,
even if they are coordinated with numbers under 10. Transcribe all decimal numbers as
digits.

Correct: मेरी कक्षा में नौ फवद्यार्थी हैं।


Correct: मेरी कक्षा में 47 फवद्यार्थी हैं।

When two or more numbers refer to the same noun, and one number is 10 or greater,
transcribe both as numerals.

Correct: वो 9 या 10 कुत्ते लेकर आए।


Correct: यहां पांि घोड़े और 20 बैल रहते हैं।

If a large number consists of only a number followed by "हज़ार", "लाख", "करोड़", or higher,
then transcribe as a numeral plus word. Otherwise, transcribe as numerals.

Correct: 7 करोड़
Example audio: " सात करोड़ "
Correct: 1 हज़ार लोग
Example audio: " एक हज़ार लोग "

Write lists of numbers with digits and without commas.

Correct: 0 1 1 2 3 5 8 13
9

Explanation: शून्य एक एक दो तीन पांि आठ तेरह

For long numbers (4+ digits) indicating quantity, insert the relevant separator (comma,
decimal point, or space, depending on language).
Correct: 10,000
Example audio: " दस हज़ार "

In math expressions or units & measures, transcribe fraction words using numerals and
slashes.
Correct: उन्हें 1/4 फक.ग्र. की आवश्यकता है ।
Incorrect: उन्हें िौर्थाई फक.ग्र. की आवश्यकता है।
Incorrect: उन्हें ¼ फक.ग्र. की आवश्यकता है। (bad because it includes the pre-combined fraction ¼)
Incorrect: उन्हें 0.25 फक.ग्र. की आवश्यकता है ।
Example audio: " उन्हें िौर्थाई फकलो ग्राम की आवश्यकता है। "

Correct: 3/4 फक.मी. में दाएं मुड़ें।


Incorrect: तीन िौर्थाई फक.मी. में दाएं मुड़ें।
Incorrect: 0.75 फक.मी. में दाएं मुड़ें।
Example audio: " तीन िौर्थाई फकलोमीटर में दाएं मुड़ें। "

Correct: 5 * 6
Incorrect: पांि * छह
Incorrect: 5 गुना 6
Example audio: " पांि गुना छह "

For mixed numbers in math and units & measures, use numerals with "and".
Correct: 5 िुट 3 1/2 इंि

When referring to items (not units or measures), write fractions out in words. With mixed
numbers, write the whole number part out in words if it is under ten, otherwise write it
with numerals.
Example audio: " मुझे आधी रोटी दीजजये। "
Correct: मुझे आधी रोटी दीजजये।
Incorrect: मुझे 1/2 रोटी दीजजये।
Incorrect: मुझे 0.5 रोटी दीजजये।

For mixed numbers that represent currency amounts, always use decimals.
Correct: तुम मुझे $2.50 दे सकते हो?
Example audio: " तुम मुझे ढाई डॉलर दे सकते हो "
10

Correct: शीतल ने यह घर ₹7.5 कककक कक ककककक ककक


Example audio: " शीतल ने यह घर साढ़े सात करोड़ रुपये का खरीदा है। "

Transcribe percentages using numerals and the % sign. (In the unlikely case that you
encounter a number of a million or greater used as a percentage, spell it out.)

Correct: 1 फमललयन प्रचतशत


Correct: 50% खाना गायब र्था

If a number appears in a context which calls for a certain formatting in your language,
use that formatting. Otherwise, default to the general rule for transcribing numbers.

Correct: King Henry VIII


Example audio: " king henry the eighth "

Transcribe seasons and episodes of television shows with numerals.


Correct: सीज़न 3 एफपसोड 2
Example audio: " सीज़न तीन एफपसोड दो "
Correct: देख भाई देख एफपसोड 2

If it is a product type or statistic, use the common written form.


Correct: 4x4
Example audio: " िार बटा िार "

Correct: उसने 4.2 का बल्लेबाजी औसत रखा।


Example audio: " उसने िार दशमलव दो का बल्ले बाजी औसत रखा। "

Transcribe phone numbers as you would write them down in their natural blocks. Do not
use hyphens between blocks. When applicable, the STD code should be surounded by
spaces.
Correct: +91 9897 034 241
Example audio: " प्लस नौ एक नौ आठ नौ सात शून्य तीन िार दो िार एक "

Transcribe alpha-digit sequences (product codes etc.) in their most natural way
(possibly several ways accepted). Do not transcribe credit card numbers, etc.
Correct: XT 660 or XT660

If it really sounds like a math expression, then transcribe it with numbers and symbols, with spaces
in between.
Example audio: " पांि गुना छह फकतना होता है "
11

Correct: 5 * 6 फकतना होता है?


Incorrect: पांि बारी छह फकतना होता है?
Incorrect: 5 गुना 6 फकतना होता है?

Transcribe currencies as commonly written in the transcription language.


When a speaker uses words like "dollar" without specifying a quantity, spell them out.
Correct: बस र्थोड़े रुपए
Correct: एक भारतीय रुपया फकतने अमेररकी डॉलर होता है?
Correct: तु म्हें नेपाली रुपया िाफहए या अमेररकी डॉलर?

For ranges or non-specific currency quantities, write everything out as spoken.


Correct: मुझे िार सौ या पांि सौ रुपए िाफहए।
Example audio: " एक सौ से िार सौ रुपए "
Correct: 100 से 400 रुपए
Example audio: " नौ से बारह पौंड "
Correct: 9 से 12 पौंड

Abbreviate all units that follow numeric values.


Example audio: " मेरा पररवार दस फक ग्र आलू लेकर आया है। "
Correct: मेरा पररवार 10 फक.ग्र. आलू लेकर आया है।

Transcribe all numeric values preceding units in numeral form, even if under 10.
Correct: उसका वज़न 2 फक.ग्र. है ।
Incorrect: उसका वज़न दो फक.ग्र. है।

Correct: मैं वहां 6 महीने से हं।


Incorrect: मैं वहां छह महीने से हं।

If it is clear from context that a number or number sequence refers to currency or time,
format it as such.

Example audio: " दूध िालीस का है। "


Correct: दूध ₹40 कक ककक

Example audio: " पांि बजके पैंतालीस फमनट का अलामण लगाओ। "
Correct: 5:45 का अलामण लगाओ।

Use the natural form for transcribing dates.


12

Example audio: " जुलाई बारह उन्नीस सौ िौंसठ "


Correct: जुलाई 12 1964

Example audio: " बारह जुलाई उन्नीस सौ िौंसठ "


Correct: 12 जुलाई 1964

Example audio: " आज तारीख है पांि दस दो हज़ार बारह "


Correct: आज तारीख है 5.10.2012

Exception: When the date is spoken as a sequence of numbers, transcribe as such.


Example audio: " बारह स्लैश सात स्लैश दो हज़ार दस "
Correct: 12/07/2010

Write times in hh:mm format whenever possible, unless it would look unnatural to do
so.
Example audio: " छह बजके पांि फमनट "
Correct: 6:05

Example audio: " आठ बजे के आस पास "


Explanation: 8:00 ललखना अस्वाभाफवक है

Correct: 8 बजे के आस पास

Use commas for ENTITY, LOCATION.


Correct: Pizza Hut, जनक पुरी
Correct: धमणशाला, राउरकेला
Correct: होटल, पीतम पुरा
Correct: PK फिल्म का समय, फबरसा रोड में

Write URLs, email addresses, and Twitter hashtags as they are spoken and don't
capitalize them.
Example audio: " amazon dot com "
Correct: amazon.com
Example audio: " h t t p colon slash slash one two three dot com "
Correct: http://123.com
Example audio: " w w w forbes dot com "
Correct: www.forbes.com

If the speaker drops a "w" or dots and it's an obvious URL, you should correct these
errors. If the speaker doesn't say the "w"s at all, do not add them.
13

Example audio: " w w facebook dot com "


Correct: www.facebook.com

Example audio: " google dot co u k "


Correct: google.co.uk

Do not abbreviate unless the speaker says an abbreviated form.


Correct: बनारस हहंदू फवश्वफवद्यालय
Incorrect: बी.एि.यू.

In acronyms, do not use periods between letters.


Correct: यूपी, एपी
Correct: भाजपा, इसरो, सपा

Agreed spelling
If a word is spelled or obvious pauses are made between letters, spell it into letters as it is said
(often done for foreign names or businesses, for example). Use lowercase letters for the
spelled-out portion. This rule does not apply to acronyms or initialisms, or to spelled-out web
or email addresses.
Correct: पारुल प आ र उ ल
Explanation: Person said "पारुल" and then spelled it
Correct: amazon.co.uk
Example audio: " amazon dot c o dot u k "

Correct: सीईओ
Example audio: " सी ई ओ "
Correct: FIFA
Explanation: Pronounced the word as "फीफा", or spelled out "एफ आई एफ ए".

Correct: एएए
14

Explanation: Speaker says "मिपल ए", or "ए ए ए".

Correct: मैं िॉ.आनूं द को जानता हूं।


Example audio: " मैं िॉक्टर आनूं द को जानता हूं। "

Correct: मम.शमा नहीं आएूं गे।


Example audio: " ममस्टर शमा नहीं आएूंगे। "

Transcribe words representing laughter or other non-speech vocalizations with up to


three syllables, but no more.
Correct: heh, ha, haha, hahaha, hehe, hehehe, boo hoo, boo hoo hoo, lalala
Correct: हा हा हा
Example audio: " हा हा हा हा हा "

Spellings of common interjections


आह, ओह, अरे, अजी, हे भगवान, हे रा, हट, हाहा, बाप रे, अहा

Use official spelling, capitalization, and punctuation for proper names. Google them and
pay attention to the correct format. Official format and spelling of a proper name may
supersede the usual written transcription conventions detailed in this document.

Format proper names as they are most commonly formatted on the entity's website
(especially official documents), if available, or the Wikipedia or IMDb page. In cases of
ambiguity, defer to their privacy policy. If no other sources, use top Google hits.

Correct: वो Amazon में काम करता है।


Correct: मैं Pizza Hut और Subway में अक्सर खाती हूं।
Correct: YouTube
Correct: Apple के सामने से बाएूं मु ड़ जाएूं।
Correct: Burger King
Incorrect: BURGER KING
Explanation: Do not spell "Burger King" all in upper case as in the stylized form of the logo, stick to the
official form as per the privacy policy.

Correct: LEGO
Incorrect: Lego

Official spelling is “okay”, not “ok”.


Correct: Okay.
Correct: Okay, सरिता।
Correct: Okay रमन, अब िलना िामहए।
15

Refer to the Google Play Store for official spellings of media titles. For film/television, IMDb is
also available. If an utterance is ambiguous between a media title and a sentence or web
search, use your judgment for which is more likely; if truly unclear, default to media title.

Write media titles as they are most commonly written. Movie titles and English book
titles should be written in Devanagari.

Correct: दाग द िायर


Incorrect: Daag: The Fire

Correct: गे म ऑफ़ थ्रोंस
Incorrect: Game of Thrones

Correct: गोदान
Incorrect: Godaan

Correct: हैरी पॉटर


Incorrect: Harry Potter

Correct: राजनीचत
Incorrect: Raajneeti

Do not use quotation marks for media titles.


Correct: फकक फिल्म के दृश्य फदखाएं।
Incorrect: "फकक" फिल्म के दृश्य फदखाएं।

Correct: अजुण न कपूर की फिल्म गुंडे.


Incorrect: अजुणन कपूर की फिल्म "गुंडे".

Transcribe all media titles with original punctuation. In cases where original punctuation
falls at the end of a sentence, do not transcribe sentence-level punctuation. That is,
media title punctuation trumps sentence level punctuation when in conflict. If a popular
media title consists of an entire sentence but the official spelling is without punctuation,
then don't punctuate the title. If an utterance is ambiguous between a media title and a
sentence or web search, use your judgment for which is more likely and treat it
accordingly.
16

Correct: करार - द डील कब फनकली र्थी?

Correct: पुरे पररवार की मनपस


ं द फिल्म है दावत-ए-इश़्।

End words with specific characters as listed below:

Correct: गए
Incorrect: गये
Explanation: "ए" instead of "ये"

Correct: जाइए
Incorrect: जाइये
Explanation: "ए" instead of "ये"

Correct: गई
Incorrect: गयी
Explanation: "ई" instead of "यी"

Correct: छाई
Incorrect: छायी
Explanation: "ई" instead of "यी"

Correct: गएं
Incorrect: गयें
Explanation: "एं" instead of "यें"

Correct: आएं
Incorrect: आयें
Explanation: "एं" instead of "यें"

Correct: गईं
Incorrect: गयीं
Explanation: "ईं" instead of "यीं"

Correct: बाईं
Incorrect: बायीं
Explanation: "ईं" instead of "यीं"
17

Use anuswara, ंं, instead of half म when the next character is any of प series consonants
प, ि, ब, भ.

Correct: भूक
ं प
Incorrect: भूकम्प

Correct: िं बल
Incorrect: िम्बल

Correct: गंभीर
Incorrect: गम्भीर

Use anuswara, ंं, instead of half न or half र् when the next character is श, ष, स, or any of
the क, ि, ट, त, series. The full set of these characters is श, ष, स, क, ख, ग, घ, ि, छ, ज, झ, ट,
ठ, ड, ढ, त, र्थ, द, ध.

Correct: मंि
Incorrect: मन्च

Correct: हहंदी
Incorrect: फहन्दी

Correct: रघुवंश
Incorrect: रघुवन्श

Here is one exception to the above two rules. When the previous character is ंॉ, do not
use anuswara, ंं.

Correct: सॉन्ग
Incorrect: सॉंग

Example audio: " सॉन्ग "


18

Explanation: If you followed the above rules, सॉन्ग will transform into सॉंग. While the character
sequence in the latter is actually स, ंॉ, ंं, ग, it looks like the sequence स, ंा, ं, ग, which has a different
pronunciation.

Always use anuswara ंं. Since chandrabindu ं and anuswara ंं are commonly
interchanged, only use anuswara ंं.

Correct: ल
ं गूर
Incorrect: लगूर

Correct: हंसो
Incorrect: हसो

Correct: आंसू
Incorrect: आसू

Correct: हं
Incorrect: ह

Correct: लडफकयां
Incorrect: लड़फकयाँ

Correct: मुस्कुराएं
Incorrect: मुस्कुराए

If you hear a word that does not sound like a standard word of your language because
there is a small sound change (i.e. accent, speech error, speech impairment, etc),
transcribe the intended word.

Correct: अचभषेक
Example audio: " अचभसेक "

Correct: रसगुल्ला
Example audio: " रसोगुल्ला "

If you hear a word that does not sound like a standard word of your language because it
appears to be nonsense, first perform a Google search for the word. If there is a clear
candidate, transcribe that word.
19

Correct: रामगड
Explanation: User says "रामगड". This might sound like nonsense at first, but the transcriber guesses
the spelling "रामगढ़" and is by corrected Google Search to "रामगड", a place in India. Transcribe रामगड.

Correct: भफड़या
Explanation: User says "भफड़या". Transcriber searches "बफढ़या", finds correct results. Transcribe भफड़या

If you can't understand part of the audio, transcribe only the part you can understand.
For the part you cannot understand, create a separate speaker segment and add the
Unintelligible label as instructed in: Longform generic rules > Unintelligible or foreign or
singing.

For utterances that contain speech that is user-generated, pre-recorded, or synthesized,


transcribe all of it.

If a user repeats a sentence for the sake of the phone, format the repetition as a
sentence if it's restating (as a sentence) what the person has said.

Correct: रॉकस्टार के गाने डाउनलोड करो। रॉकस्टार के गाने डाउनलोड करो।


Correct: फदल्ली का मौसम फदखाओ। फदल्ली में मौसम कैसा है?
Correct: फकस मशीन से बगीिे को साि करोगे? बगीिे को साि करोगे
Explanation: If the repeated phrase is part of the sentence that just happens to form a sentence on
its own (possibly under a different interpretation), format it as a fragment. While "weed a garden"
can be a command, it is ambiguous and is most likely a fragment in this context.
Complete words that have been truncated only if a very small portion of the word is
missing (one syllable or less in a multisyllable word) and it is obvious what the word
should be. In cases of ambiguity, do not transcribe the cut-off word. Do not put
punctuation at the end of truncated words.

Correct: भूपचत महेश भूपचत


Example audio: " भूपचत महेश भूपत "
Explanation: Final sound "इ" was truncated.

Correct: मनीष वस
ं त
Example audio: " मनीष वस
ं त कु "
Explanation: Unclear whether they would have said "कुमार " or "क
ं ु ज".

Correct: अफमताभ बच्चन की फिल्में


20

Example audio: " फमताभ बच्चन की फिल्में "

For numbers, stick to what is uttered, even if you know this is not all the speaker is going
to say.
Correct: टावर 2
Example audio: " टावर दो- "

Transcribe any filler words that are applicable and used in the target language. Below
are examples of filler words in the English language. These may or may not be applicable
in the target language. Again, only transcribe filler words that exist in and are used in
the target language.

हम्म
हं

Do not skip utterances that contain words in English. Most of them should be
transcribed using Devanagari characters. Only use Latin characters for English words if
they are measurements units, URLs, company names or tech words.

Correct: हेलो
Incorrect: hello
Explanation: Use Devanagari characters for common words in English

Correct: mp3, jpeg


Incorrect: म्प्३, ज्पेग्
Explanation: Use Latin characters for technical words

Correct: km, MB, C, dB


Explanation: Use Latin characters for measurement units

Correct: YouTube. Samsung. Gmail.


Explanation: Use Latin characters for company names that are not normally written in Devanagari

Correct: हैरी पॉटर


Incorrect: Harry Potter
Explanation: Use Devanagari for media titles

Correct: www.google.co.in
Explanation: Use Latin characters for URLs
21

Correct non-standard pronunciations to their standard ones. Non-standard


pronunciations could be from speakers of regional dialects, language learners, or
speakers from different countries.

Correct: मेरी भाषा एकदम बफढ़या है।


Incorrect: मेरी भाषा एकदम भफड़या है।
Example audio: " मेरी भाषा एकदम भफड़या है "
Explanation: Person said "भफड़या" with a "भ" sound, but it should still be spelled as standard.

Correct: बेंगलुरु
Incorrect: बंगलूर
Explanation: Person said "बंगलूर" but it should still be spelled as standard.

Longform generic rules


Unintelligible or foreign or singing
If you hear speech that is unintelligible or in a foreign language, create a speaker segment that
covers only the audio range with that speech. Then choose either the Unintelligible or Foreign
Speech option in the Label drop-down of the Edit Annotation window and then assign to the
appropriate speaker.
If you hear audio that is singing, transcribe the lyrics, assign to the appropriate speaker, and add
the Singing label. If the singing is in a foreign language, select the Foreign Speech label.

Segmentation
If overlapping speech is occurring: the segment boundaries should be accurate with at least
100ms precision.
If overlapping speech is NOT occurring, the segment boundaries do not have to be 100% precise
but should start and end within 1 second from when the speaker begins/ends their speech. The
boundaries should not overlap with other speaker turns next to the segment.
0.5-Second Rule:
Speaker turns should not contain pauses in speech that are longer than 0.5 seconds. If a speaker
does pause their speech for longer than 0.5 seconds, end the speaker turn before the pause then
create a new turn for when the speaker resumes talking.
There should never be a speaker turn with a pause in the speech that lasts for more than 0.5
seconds. If a speaker stops (takes a breath or makes a pause in their speech for whatever reason)
for more than 0.5 seconds, and then continues their speech – the speaker turns should reflect
that. The same goes for annotations. Noise, laughter and music annotations should be ended if
there is a 0.5 second pause in the sounds that should go under these annotations. Buffers should
NOT be taken into consideration while determining 0.5 second. It should solely be based on when
the sound/speech begins and ends.
22

Tips: Every time when a speaker makes a pause, you should pause the audio and take a look at
the waveform. The numbers 5 and 8 on the NumPad are shortcuts that move the red line 0.5
forwards or backwards in the audio – and you should use them every time when you measure
for 0.5 second pauses. If you bring the red line to the where the speaker continues and click on 8
from NumPad, you will see this is much more useful than NumPad 5.
30-Second Rule:
Speaker turns should not be longer than 30 seconds. If a single speaker talks for more than 30
consecutive seconds without taking a 0.5 second pause, then end the turn at the 30 second mark
and begin a new turn.
When a speaker talks, sometimes the speaker turn can’t be split exactly at the 30-second mark
because it means that a word would get cut off. In such cases, you can end the segment as close
to the 30-second mark as possible, but without cutting off words. So, a segment can end at the
28-30-second mark, but never at the 30.5-second mark. The 30 second rule does not apply to
annotations. The annotations for noise, music, and laughter can last for more than 30 seconds.

Speaker labeling
All speaker labels should be consistently formatted. Speaker labels should always: be in all
lowercase, be spelled correctly, and should not contain underscores or hyphens. A validator
within the tool will prevent you from submitting incorrect formats.
Correct: speaker 1
Incorrect: Speaker 1
Correct: pre recorded speaker 1 (without a hyphen)
Incorrect: pre-recorded speaker_1
Correct: unidentifiable speaker

'speaker #' Used for different speakers in the audio. Includes a number that corresponds to each
different speaker.
'pre recorded speaker #' Used when there is speech coming from a machine. Includes a number
that corresponds to each different pre recorded speaker.
'unidentifiable speaker' Used when you cannot identify who the speaker is. Does not ever include
numbers.
'speaker Tom' Used when the name of a speaker becomes known. The names of speakers should
always be capitalized. You can use first and last names. (Note: adding speaker names will be
allowed for some projects but not others. In tool validators will indicate whether or not you can
submit a speaker name.)
We do not use usernames/nicknames in speaker labeling. Only if the actual name of the person
is known, we can use it in the speaker label. If only the internet username/YouTube
name/nickname is known, the speaker should be “speaker #”.

Annotations
Please annotate everything that you can hear, even if the line is flat. This means that all
speech, noises, music and laughter should be annotated, whenever heard.
Noise
23

Any noise that is audible should be annotated properly, regardless of whether the noise is heard
during the speech or when there is no speech. This is the most frequent correction the reviewers
have to make. This, for examples, includes noise such as gasping while there is a pause in speech
to catch breath.
Music
Any music that is audible should be annotated properly, regardless of whether the music is heard
during the speech or when there is no speech. If there are lyrics involved, please transcribe the
lyrics in accordance with the above mentioned rules.
Laughter
Any laughter that is audible should be annotated properly, regardless of whether the music is
heard during the speech or when there is no speech.

PII
PII stands for Personally Identifiable Information. PII is information that is not publicly available,
but can help you or Google identify an individual person.
The full list of what is considered PII can be found in the guidelines, but a quick rule of thumb is
to mark as PII everything that could identify a person:
personal name, address (street name and number), phone number, credit card information, etc.
Things that should not be marked as PII are:
Names of fictional characters – (e.g. Harry Potter) – The character Harry Potter is not a real
person, and cannot have personal information.
Names of pets – Pets cannot be considered as personas, thus no PII
Commonly known names (celebrities) – e.g. Brad Pitt
Internet usernames, nicknames, channel names – PewDiePie, DanTDM

Given that all of the audio files are coming from YouTube videos that are already public, we
should take this into consideration while deciding whether anything in the audio is PII or not. If
you want any information that belongs to other people that are not around considering the
content of the audio file, or any other sensitive information (like credit card number) should be
marked as PII going forward. So any YouTuber name, names of the people in the audio file, their
social media accounts, email addresses or even mobile numbers should not be considered as
PII if they are shared willingly in the audio file.

You might also like