Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
SUMMARY In this paper, we propose a driver identification method that is based on the driving behavior signals that are observed while the driver is following another vehicle. Driving behavior signals, such as the use of the accelerator... more
SUMMARY In this paper, we propose a driver identification method that is based on the driving behavior signals that are observed while the driver is following another vehicle. Driving behavior signals, such as the use of the accelerator pedal, brake pedal, vehicle velocity, and dis- tance from the vehicle in front, were measured using a driving simulator. We compared the
ABSTRACT As the spread of voice communication tools by Internet continues to spread, people have more chances to use microphones on their private PCs in various acoustic environments. In the case of using PC-based speech input... more
ABSTRACT As the spread of voice communication tools by Internet continues to spread, people have more chances to use microphones on their private PCs in various acoustic environments. In the case of using PC-based speech input application, such a variety of environments will cause speech recognition performance degradation. To improve speech recognition accuracy, it is crucial to collect speech data in the environment in which the system is used[1]. We collected speech interactions with PC-based applications in a wide range of user environments through a field test, and have obtained 488 hours of recorded data including 29 hours of speech segments, corresponding to about sixty thousand utterances. In addition to collecting data, we assessed system usability by a questionnaire that asked about system usability and the subjective impression of the speech recognition performance. Using the system data log and the questionnaire results, we analyzed the relationship among subjective performance and objective metrics. Through analysis, a Bayesian network based stochastic model was built that predicted the subjective score of system usability from personal profiles and several objective metrics. Results of experiments showed that each user's satisfaction index could be predicted for 35.2% of the subjects using the trained Bayesian network.
Our research group has recently developed a new data collection vehicle equipped with various sensors for the synchronous recording of multimodal data including speech, video, driving behavior, and physiological signals. Driver speech is... more
Our research group has recently developed a new data collection vehicle equipped with various sensors for the synchronous recording of multimodal data including speech, video, driving behavior, and physiological signals. Driver speech is recorded with 12 microphones distributed throughout the vehicle. Face images and a view of the road ahead are captured with three CCD cameras. Driving behavior signals including gas and brake pedal pressures, steering angles, vehicle velocities, and following distances are recorded. Physiological sensors are mounted to measure the drivers’ heart rate, skin conductance, and emotion-based sweating on the palm of the hand and sole of the foot. The multimodal data are collected while driving on city roads and expressways during four different tasks: reading random four-character alphanumeric strings, reading words on billboards and signs seen while driving, interacting with a spoken dialogue system to retrieve and play music, and talking on a cell phone...
ABSTRACT As the spread of voice communication tools by Internet continues to spread, people have more chances to use microphones on their private PCs in various acoustic environments. In the case of using PC-based speech input... more
ABSTRACT As the spread of voice communication tools by Internet continues to spread, people have more chances to use microphones on their private PCs in various acoustic environments. In the case of using PC-based speech input application, such a variety of environments will cause speech recognition performance degradation. To improve speech recognition accuracy, it is crucial to collect speech data in the environment in which the system is used[1]. We collected speech interactions with PC-based applications in a wide range of user environments through a field test, and have obtained 488 hours of recorded data including 29 hours of speech segments, corresponding to about sixty thousand utterances. In addition to collecting data, we assessed system usability by a questionnaire that asked about system usability and the subjective impression of the speech recognition performance. Using the system data log and the questionnaire results, we analyzed the relationship among subjective performance and objective metrics. Through analysis, a Bayesian network based stochastic model was built that predicted the subjective score of system usability from personal profiles and several objective metrics. Results of experiments showed that each user's satisfaction index could be predicted for 35.2% of the subjects using the trained Bayesian network.
In this paper, we introduce a spoken language interface for music information retrieval. In response to voice commands, the system searches for a song through an internet music shop or a ‘‘playlist’’ stored in the local PC; the system... more
In this paper, we introduce a spoken language interface for music information retrieval. In response to voice commands, the system searches for a song through an internet music shop or a ‘‘playlist’’ stored in the local PC; the system then plays it. To cope with the almost unlimited size of the vocabulary, a remote server program with which a user can customize their recognition grammar and dictionary is implemented. When a user selects favorite artists, the server program automatically generates a minimal set of recognition grammars and a dictionary. The system then sends them to the interface program. Therefore, on average, the vocabulary is less than 1000 words for each user. To perform a field test of the system, we implemented a speech collection capability, whereby speech utterances are compressed in free lossless audio codec (FLAC) format and are sent back to the server program with dialogue logs. Currently, the system is available to the public for experimental use. More tha...
This paper describes emotional speech classification in anime films. An emotional speech corpus was constructed by using data collected over 8 h. The corpus consists of emotional speech material of a total of 984 utterances. Five... more
This paper describes emotional speech classification in anime films. An emotional speech corpus was constructed by using data collected over 8 h. The corpus consists of emotional speech material of a total of 984 utterances. Five emotions, namely, joy, surprise, anger, sadness, and the neutral case, were labeled and divided into training and test data. In a previous study, Attack and Keep and Decay were adopted as parameters to describe temporal characteristic of the power transition. This paper proposed an improved method of A-K-D unit, and evaluated it. As a result, acoustic features of the proposed method were more effective than the conventional method when we used for GMM.
A method for recognizing spoken utterances of a speaker is disclosed, the method comprising the steps of providing a database of labeled speech data; providing a prototype of a Hidden Markov Model (HMM) definition to define the... more
A method for recognizing spoken utterances of a speaker is disclosed, the method comprising the steps of providing a database of labeled speech data; providing a prototype of a Hidden Markov Model (HMM) definition to define the characteristics of the HMM; and parameterizing speech utterances according to one of linear prediction parameters or Mel-scale filter bank parameters. The method further includes selecting a frame period for accommodating the parameters and generating HMMs and decoding to specified speech utterances by causing the user to utter predefined training speech utterances for each HMM. The method then statistically computes the generated HMMs with the prototype HMM to provide a set of fully trained HMMs for each utterance indicative of the speaker. The trained HMMs are used for recognizing a speaker by computing Laplacian distances via distance table lookup for utterances of the speaker during the selected frame period; and iteratively decoding node transitions corresponding to the spoken utterances during the selected frame period to determine which predefined utterance is present.
ASRU1999: IEEE workshop on Automatic Speech Recognition and Understanding, December 1999.A sharable software repository for Japanese LVCSR (Large Vocabulary Continuous Speech Recognition) is introduced. It is designed as a baseline... more
ASRU1999: IEEE workshop on Automatic Speech Recognition and Understanding, December 1999.A sharable software repository for Japanese LVCSR (Large Vocabulary Continuous Speech Recognition) is introduced. It is designed as a baseline platform for research and developed by researchers of different academic institutes under the governmental support. The repository consists of a recognition engine, variety of acoustic models and language models as well as Japanese morphological analysis tools. These modules can be easily integrated and replaced under a plug-and-play framework, which makes it possible to fairly evaluate components and to develop specific application systems. In this paper, specifications of the current version is described and assessment in 20000-word dictation task, which was also set up in our project, is reported. The software repository is freely available to the public
... 3.4. Speaking-Rate We define the speaking-rate(SPR) as the duration per mora. It is calculated using the result of forced alignment of the reference monophone label. ... In Figure 3, the average word accuracy and SNR are plot-ted for... more
... 3.4. Speaking-Rate We define the speaking-rate(SPR) as the duration per mora. It is calculated using the result of forced alignment of the reference monophone label. ... In Figure 3, the average word accuracy and SNR are plot-ted for all drivers. ...
We propose the combination of a physical-model-based and a deep-learning (DL)-based source separation for near- and far-field source separation. The DL-based near- and far-field source separation method uses... more
We propose the combination of a physical-model-based and a deep-learning (DL)-based source separation for near- and far-field source separation. The DL-based near- and far-field source separation method uses spherical-harmonic-analysis-based acoustic features. Deep learning is a state-of-the-art technique for source separation. In this approach, a deep neural network (D NN) is used to predict a time-frequency (T-F) mask. To accurately predict a T-F mask, it is necessary to use acoustic features that have high mutual information with the oracle T-F mask. However, the effective acoustic features to separate near- and far-field sources are unknown. In this study, low-frequency-band near- and far-field sources are estimated based on spherical harmonic analysis and used as acoustic features. Subsequently, a DNN predicts a T-F mask to separate all frequency bands. Our experimental results show that the proposed method improved the signal-to-distortion-rate by 6–8 dB compared to the harmonic-analysis-based method.
In our participation in QAC3, our Question Answering system developed for QAC2 is extended in two ways, separately. The first system is constructed to improve the performance of answer evaluation. The automatic lexico-syntactic pattern... more
In our participation in QAC3, our Question Answering system developed for QAC2 is extended in two ways, separately. The first system is constructed to improve the performance of answer evaluation. The automatic lexico-syntactic pattern acquisition from large corpora and the method to incorporate the patterns into QA system are developed and evaluated. The second system is constructed to implement the ability of context processing for information access dialogue (IAD), which is a main target of QAC3 evaluation. The system exploits passage retrieval for selecting an appropriate context from the history of the series questions, in order to compose a complete question. 1
Musical recordings, when performed by non-proficient (amateur) performers, include two types of tempo fluctuations–intended “tempo curves” and non-intended “mis-played components”–due to poor control of instruments. In this study, we... more
Musical recordings, when performed by non-proficient (amateur) performers, include two types of tempo fluctuations–intended “tempo curves” and non-intended “mis-played components”–due to poor control of instruments. In this study, we propose a method for estimating intended tempo fluctuations, called “true tempo curves,” from mis-played recordings. We also propose an automatic audio signal modification that can adjust the signal by time-scale modification with an estimated true tempo curve to remove the mis-played component. Onset timings are detected by an onset detection method based on the human auditory system. The true tempo curve is estimated by polynomial regression analysis using detected onset timings and score information. The power spectrograms of the observed musical signals are adjusted using the true tempo curve. A subjective evaluation was performed to test the closeness of the rhythm, and it was observed that the mean opinion score values of the adjusted sounds were ...
Statistical language models have gained a reputation as providing the overall performance for speech recognition, and so widely used in speech recognition systems today. The tasks to which statistical language models can be applied are,... more
Statistical language models have gained a reputation as providing the overall performance for speech recognition, and so widely used in speech recognition systems today. The tasks to which statistical language models can be applied are, however, limited, because a large corpus is essential for the building of a statistical model, and the collection of a new corpus is a very costly task in terms of time and e ort. Thus, if our aim is to apply speech recognition to various tasks as required, we need a way of developing a new language model for a given task at a reasonable cost.
This paper describes a new multi-channel method of noisy speech recognition, which estimates the log spectrum of speech at a close-talking microphone based on the multiple regression of the log spectra (MRLS) of noisy signals captured by... more
This paper describes a new multi-channel method of noisy speech recognition, which estimates the log spectrum of speech at a close-talking microphone based on the multiple regression of the log spectra (MRLS) of noisy signals captured by the distributed microphones. The advantages of the proposed method are as follows: 1  The method does not make any assumptions about the positions of the speaker and noise sources with respect to the microphones. Therefore, the system can be trained for various sitting positions of drivers. 2  The regression weights can be statistically optimized over a certain length of speech segments (e.g., sentences of speech) under particular road conditions. The performance of the proposed method is illustrated by speech recognition of real in-car dialogue data. In comparison to the nearest distant microphone and multi-microphone adaptive beamformer, the proposed approach obtains relative word error rate (WER) reductions of 9.8% and 3.6% respectively.
We are developing a large-scale real-world driving database of more than 200 drivers using a data collection vehicle equipped with various sensors for the synchronous recording of multimedia data including speech, video, driving behavior,... more
We are developing a large-scale real-world driving database of more than 200 drivers using a data collection vehicle equipped with various sensors for the synchronous recording of multimedia data including speech, video, driving behavior, and physiological signals. Driver’s speech and videos are captured with multi-channel microphones and cameras. Gas and brake pedal pressures, steering angles, vehicle velocities, and following distances
ABSTRACT In this chapter, driver characteristics under driving conditions are extracted through spectral analysis of driving signals. We assume that characteristics of drivers while accelerating or decelerating can be represented by... more
ABSTRACT In this chapter, driver characteristics under driving conditions are extracted through spectral analysis of driving signals. We assume that characteristics of drivers while accelerating or decelerating can be represented by “cepstral features” obtained through spectral analysis of gas and brake pedal pressure readings. Cepstral features of individual drivers can be modeled with a Gaussian mixture mode! (GMM). Driver models are evaluated in driver identification experiments using driving signals of 276 drivers collected in a real vehicle on city roads. Experimental results show that the driver model based on cepstral features achieves a 76.8 % driver identification rate, resulting in a 55 % error reduction over a conventional driver model that uses raw gas and brake pedal operation signals. Key wordsDriving behavior-driver identification-pedal pressure-spectral analysis-Gaussian mixture model
The dependency of conversational utterances on the mode of dialogue is analyzed. A speech corpus of 800 speakers collected under three different modes, i.e., talking to a human operator, a WOZ system and an ASR system, is used for... more
The dependency of conversational utterances on the mode of dialogue is analyzed. A speech corpus of 800 speakers collected under three different modes, i.e., talking to a human operator, a WOZ system and an ASR system, is used for analysis. Some characteristics such as sentence complexity and loudness of the voice are found to be significantly different among the dialogue
... We added two TV cameras, an array microphones and a speech synthesizer. Inside the robot is a Linux computer capable of communicating via radio Ethernet. ... NINJA on PC Speaker independent continuous Japanese speech recognition ...
A method for automatic detection of potentially dangerous situations in motor vehicle traffic is introduced. Unlike precedent works, which typically relied on camera arrays or road‐traffic monitoring sensors to detect collision incidents,... more
A method for automatic detection of potentially dangerous situations in motor vehicle traffic is introduced. Unlike precedent works, which typically relied on camera arrays or road‐traffic monitoring sensors to detect collision incidents, the proposed approach specifically incorporates changes in a drivers’ behavior, detected through driver speech and brake pedal operation. Experiments were performed using a large real‐world multimedia driving database of 493 drivers, obtained from the Centre for Integrated Acoustic Information Research (CIAIR, Nagoya University). The drivers, who interacted verbally with a human operator, uttered expletive words to express negative feelings in 11 of the 25 situations that we selected as potentially hazardous. In 17 of them, sudden and intense compression of the brake pedal was observed. The proposed lexicographical speech‐feature‐based method also detected 33 false alarms to detect 80% of these 11 scenes. As for the other 17 scenes, our method based on two‐dimensional hi...
Behavioral synchronization between speech and finger tapping provides a novel approach to improving speech recognition accuracy. We combine a sequence of finger tapping timings recorded alongside an utterance using two distinct methods:... more
Behavioral synchronization between speech and finger tapping provides a novel approach to improving speech recognition accuracy. We combine a sequence of finger tapping timings recorded alongside an utterance using two distinct methods: in the first method, HMM state transition probabilities at the word boundaries are controlled by the timing of the finger tapping; in the second, the probability (relative frequency) of the finger tapping is used as a 'feature' and combined with MFCC in a HMM recognition system. We evaluate these methods through connected digit recognition under different noise conditions (AURORA-2J). Leveraging the synchrony between speech and finger tapping provides a 46% relative improvement in connected digit recognition experiments.
Summary: In this paper, we address issues in improving hands-free speech recognition performance in different car environments using multiple spatially distributed microphones. In the previous work, we proposed the multiple linear... more
Summary: In this paper, we address issues in improving hands-free speech recognition performance in different car environments using multiple spatially distributed microphones. In the previous work, we proposed the multiple linear regression of the log spectra (MRLS) for estimating ...

And 127 more