Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Prosody-Based Automatic Segmentation of Speech into Sentences and Topics Elizabeth Shriberg Andreas Stolcke Speech Technology and Research Laboratory.

Similar presentations


Presentation on theme: "1 Prosody-Based Automatic Segmentation of Speech into Sentences and Topics Elizabeth Shriberg Andreas Stolcke Speech Technology and Research Laboratory."— Presentation transcript:

1 1 Prosody-Based Automatic Segmentation of Speech into Sentences and Topics Elizabeth Shriberg Andreas Stolcke Speech Technology and Research Laboratory Dilek Hakkani-Tur Gokhan Tur Depatment of Computer Engineering, Bilkent University To appear in Speech Communication 32(1-2) Special Issue on Accessing Information in Spoken Audio Presenter: Yi-Ting Chen

2 2 Outline Introduction Method –Prosodic modeling –Language modeling –Model combination –Data Results and discussion Summary and conclusion

3 3 Introduction (1/2) Why process audio data? Why automatic segmentation? –A crucial step toward robust information extraction from speech is the automatic determination of topic, sentence, and phrase boundaries Why used prosody? –In all languages, prosody is used to convey structural, semantic, and functional information –Prosodic cues by their nature are relatively unaffected by word identity –Unlike spectral features, some prosodic features are largely invariant to changes in channel characteristics –Prosodic feature extraction can be achieved with minimal additional computational load and no additional training data

4 4 Introduction (2/2) In this paper we describe the prosodic modeling in detail Using decision tree and hidden Markov modeling techniques to combine prosodic cues with word-based approaches, and evaluate performance on two speech corpora To look at results for both true word, and word as hypothesized by a speech recognizer.

5 5 Method (1/6) –Prosodic modeling Feature extraction regions –For each inter-word boundary, we looked at prosodic features of the word immediately preceding and following the boundary, or alternatively within a window of 20 frames (200ms) befor and after the boundary –They extracted prosodic features reflecting pause durations, phone durations, pitch information, and voice quality information –They chose not to use amplitude- or energy-based features, since previous word showed these features to be both less reliable than and largely redundant with duration and pitch features

6 6 Method (2/6) –Prosodic modeling Features: –The features were designed to be independent of word identities –They began with a set of over 100 features, was pared down to a smaller set by eliminating features –Pause features: Important cues to boundaries between semantic units The pause model was trained as an individual phone In the case of no pause at the boundary, this pause duration feature was output as 0 The duration of the pause preceding the word before the boundary Raw durations and durations normalized were investigated for pause duration distributions from the particular speaker

7 7 Method (3/6) –Prosodic modeling Features: –Phone and rhyme duration features: a slowing down toward the ends of units, or preboundary lengthening Preboundary lengthening typically affects the nucleus and coda of syllables Duration characteristics of the last rhyme of the syllable preceding the boundary Each phone in the rhyme was normalized for inherent duration as

8 8 Method (4/6) –Prosodic modeling Features: –F0 features: Pitch information is typically less robust and more difficult to model than other prosodic features To smooth out microintonation and tracking errors, simplify F0 feature computation, and identify speaking-range parameters for each speaker

9 9 Method (5/6) –Prosodic modeling Features: –F0 features: Reset features Range features F0 slope features F0 continuity features –Estimated voice quality features –Other features: speaker gender 、 turn boundaries 、 time elapsed from the start of turn and the turn count in the conversation

10 10 Method (6/6) –Prosodic modeling Decision trees –Decision trees are probabilistic classifiers –Given a set of features and a labeled training set, the decision tree construction algorithm repeatedly selects a single feature that has the highest predictive value –The leaves of the tree store probabilities about the class distribution of all samples falling into the corresponding region of the feature space –Decision trees make no assumptions about the shape of feature distributions –It is not necessary to convert feature values to some standard scale Feature selection algorithm

11 11 Method (1/3) –Language modeling The goal: to capture information about segment boundaries contained in the word sequences To model the joint distribution of boundary types and words in a hidden Markov model (HMM) To denote boundary classification by and use for the word sequences the structure of the HMM : Using the slightly more complex forward-backward algorithm to maximize the posterior probability of each individual boundary classification

12 12 Method (2/3) –Language modeling Sentence segmentation –A hidden-event N-gram language model –The states of the HMM consist of the end-of sentence status of each word, plus any preceding words and possibly boundary tags to fill up the N-gram context –Transition probabilities are given by N-gram probabilities estimated from annotated –Boundary-tagged training data using Katz backoff –Ex:

13 13 Method (3/3) –Language modeling Topic segmentation –First, to constructed 100 individual unigram topic cluster language models using the Multipass k-means algorithm (Using TDT) –Then to built an HMM in which the states are topic clusters, and the observation are sentences –In addition to the basic HMM segmenter, incorporating two states for modeling the initial and final sentences of a topic segment

14 14 Method (1/3) –Model combination Expecting prosodic and lexical segmentation cues to be partly complementary –Posterior Probability interpolation –Integrated hidden Markov modeling With suitable independence assumption to apply the familiar techniques to compute: or To incorporate the prosodic information into the HMM, prosodic features are modeled as emissions from relecant HMM states, with likelihoods So, a complete path through the HMM is associated with the total probability

15 15 Method (2/3) –Model combination Expecting prosodic and lexical segmentation cues to be partly complementary –Integrated hidden Markov modeling How to estimate the likelihoods –Note that the decision tree estimates posteriors –These can be converted to likelihoods using Bayes’ rule as in –A beneficial side effect of this approach is that the decision tress models the lower-frequency events in greater detail than if presented with the raw, highly skewed class distribution –A tunable model combination weight (MCW) was introduced

16 16 Method (3/3) –Model combination Expecting prosodic and lexical segmentation cues to be partly complementary –HMM posteriors as decision tress features For practical reasons we chose not to use it in this work Drawback: overestimate the informativeness of the word- based posteriors based on automatic transcriptions –Alternative models HMM: A drawback is that the independence assumptions may be inappropriate and inherently lime the performance of the model The decision trees: advantages: enhances discrimination between the target classifications and input features can be combined easily drawbacks: the sensitivity to skewed class distribution expensive to model multiple target variables

17 17 Method (1/2) –Data Speech data and annotations –Switchboard data: a sub set of the corpus that had been hand- labeled for sentence boundaries by LDC –Broadcast News data for topic and sentence segmentation was extracted from the LDC’ 1997 Broadcast News (BN) release –Training of Broadcast News language models used an additional 130 million word of text-only transcripts from the 1996 Hub-4 language model corpus (for sentence segmentation ) Training, tuning, and test sets

18 18 Method (2/2) –Data Word recognition –1-best output from SRI’s DECIPHER large-vocabulary speech recognizer –Skipping several of the computationally expensive or cumbersome steps (such as acoustic adaptation) –Switchboard test set:46.7% WER –Broadcast News: 30.5% WER Evaluation metrics –Sentence segmentation performance for true words was measured by boundary classification error –For recognized words, a string alignment of the automatically labeled recognition hypothesis are performed –Then to calculate error rate –Topic segmentation was evaluated using the metric defined by NIST for TDT-2 evaluation

19 19 Results and discussion (1/10) Task 1: Sentence segmentation of Broadcast New data –Prosodic features usage The best-performing tree identified six features for this task, which fall into four groups Pause > turn > F0 > Rhyme duration Based on descriptive literature, the behavior of the features is precisely

20 20 Results and discussion (2/10) Task 1: Sentence segmentation of Broadcast New data –Error reduction from prosody –The prosodic model alone performs better than a word-based language model –The prosodic model is somewhat more robust to recognizer output than the language model

21 21 Results and discussion (3/10) Task 1: Sentence segmentation of Broadcast New data –Performance without F0 features The F0 features used are not typically extracted or computed in most ASR systems Removing all F0 features: It could also indicate a higher degree of correlation between true words and the prosodic features?

22 22 Results and discussion (4/10) Task 2: Sentence segmentation of Switchboard data –Prosodic feature usage A different distribution of features than observed for Broadcast News The primary feature type used here is pre-boundary duration Pause duration at the boundary was also useful Most interesting about this tree was the consistent behavior of duration features, which gave higher probability to a sentence boundary

23 23 Results and discussion (5/10) Task 2: Sentence segmentation of Switchboard data –Error reduction from prosody Prosodic alone is not a particularly good mood model Combining prosody with the language model resulted in a statistically significant improvement All differences were statistically significant

24 24 Results and discussion (6/10) Task 3: Topic segmentation of Broadcast News data –Prosodic feature usage Five feature types most helpful for this task: The results are similar to those seen earlier for sentence segmentation in Broadcast News The importance of pause duration is underestimated

25 25 Results and discussion (7/10) Task 3: Topic segmentation of Broadcast News data –Prosodic feature usage The speaker-gender feature –The women in a sense behave more “neatly” than the men –One possible explanation is that men are more likely than women to produce regions of nonmodal voicing of topic boundaries

26 26 Results and discussion (8/10) Task 3: Topic segmentation of Broadcast News data –Error reduction from prosody All results reflect the word-averaged, weighted error metric used in the TDT-2 evaluation Chance here correspond to outputting the “no boundary” class at all locations, meaning that the false alarm rate will be zero and miss rate will be 1 A weight of 0.7 to false alarms and 0.3 to miss

27 27 Results and discussion (9/10) Task 3: Topic segmentation of Broadcast News data –Performance without F0 features The experiments were conducted only for true word, since as shown in table 5, results are similar to those for recognized words

28 28 Results and discussion (10/10) Comparisons of error reduction across conditions –Performance without F0 features While researcher typically have found Switchboard a difficult corpus to process, in the case of sentence segmentation on true word it just the opposite-atypically Previous word on automatic segmentation on Switchboard transcripts is likely to overestimate success for other corpora

29 29 Summary and conclusion The use of prosodic information for sentence and topic segmentation have studied Results showed that on Broadcast News the prosodic model alone performed as well as purely word0based statistical language models Interestingly, the integrated HMM worded best on transcribed words, while the posterior interpolation approach was much more robust in the case of recognized


Download ppt "1 Prosody-Based Automatic Segmentation of Speech into Sentences and Topics Elizabeth Shriberg Andreas Stolcke Speech Technology and Research Laboratory."

Similar presentations


Ads by Google