In this paper, we present our experiments with BERT (Bidirectional Encoder Representations from T... more In this paper, we present our experiments with BERT (Bidirectional Encoder Representations from Transformers) models in the task of sentiment analysis, which aims to predict the sentiment polarity for the given text. We trained an ensemble of BERT models from a large self-collected movie reviews dataset and distilled the knowledge into a single production model. Moreover, we proposed an improved BERT’s pooling layer architecture, which outperforms standard classification layer while enables per-token sentiment predictions. We demonstrate our improvements on a publicly available dataset with Czech movie reviews
This paper describes the development of a stateless spoken spoken language understanding (SLU) mo... more This paper describes the development of a stateless spoken spoken language understanding (SLU) module based on artificial neural networks that is able to deal with the uncertainty of the automatic speech recognition (ASR) output. The work builds upon the concept of weighted neurons introduced by the authors previously and presents a generalized weighting term for such a neuron. The effect of different forms and parameter estimation methods of the weighting term is experimentally evaluated on the multi-task training corpus, created by merging two different semantically annotated corpora. The robustness of the best performing weighting schemes is then demonstrated by experiments involving hybrid word-semantic (WSE) lattices and also limited data scenario.
The paper present the issues encountered in processing spontaneous Czech speech in the MALACH pro... more The paper present the issues encountered in processing spontaneous Czech speech in the MALACH project. Specific problems connected with a frequent occurrence of colloquial words in spontaneous Czech are analyzed; a partial solution is proposed and experimentally evaluated.
Abstract. The paper describes the system built by the team from the University of West Bohemia fo... more Abstract. The paper describes the system built by the team from the University of West Bohemia for participation in the CLEF 2006 CL-SR track. We have decided to concentrate only on the monolingual search-ing in the Czech test collection and investigate the eect of proper lan-guage processing on the retrieval performance. We have employed the Czech morphological analyser and tagger for that purposes. For the ac-tual search system, we have used the classical tf.idf approach with blind relevance feedback as implemented in the Lemur toolkit. The results in-dicate that a suitable linguistic preprocessing is indeed crucial for the Czech IR performance. 1
This paper concentrates on the design and evaluation of the method that would be able to automati... more This paper concentrates on the design and evaluation of the method that would be able to automatically correct the spelling of i/y in the Czech words at the output of the ASR decoder. After analysis of both the Czech grammar rules and the data, we have decided to deal only with the endings consisting of consonants b/f/l/m/p/s/v/z followed by i/y in both short and long forms. The correction is framed as the classification task where the word could belong to the “i” class, the “y” class or the “empty” class. Using the state-of-the-art Bidirectional Encoder Representations from Transformers (BERT) architecture, we were able to substantially improve the correctness of the i/y spelling both on the simulated and the real ASR output. Since the misspelling of i/y in the Czech texts is seen by the majority of native Czech speakers as a blatant error, the corrected output greatly improves the perceived quality of the ASR system.
PDTSC 1.0 is a multi-purpose corpus of spoken language. 768,888 tokens, 73,374 sentences and 7,32... more PDTSC 1.0 is a multi-purpose corpus of spoken language. 768,888 tokens, 73,374 sentences and 7,324 minutes of spontaneous dialog speech have been recorded, transcribed and edited in several interlinked layers: audio recordings, automatic and manual transcription and manually reconstructed text. PDTSC 1.0 is a delayed release of data annotated in 2012. It is an update of Prague Dependency Treebank of Spoken Language (PDTSL) 0.5 (published in 2009). In 2017, Prague Dependency Treebank of Spoken Czech (PDTSC) 2.0 was published as an update of PDTSC 1.0.
The package contains Czech recordings of the Visual History Archive which consists of the intervi... more The package contains Czech recordings of the Visual History Archive which consists of the interviews with the Holocaust survivors. The archive consists of audio recordings, four types of automatic transcripts, manual annotations of selected topics and interviews' metadata. The archive totally contains 353 recordings and 592 hours of interviews.
In this paper we present an online system for cross-lingual lexical (full-text) searching in the ... more In this paper we present an online system for cross-lingual lexical (full-text) searching in the large archive of the Holocaust testimonies. Video interviews recorded in two languages (English and Czech) were automatically transcribed and indexed in order to provide efficient access to the lexical content of the recordings. The engine takes advantage of the state-of-the-art speech recognition system and performs fast spoken term detection (STD), providing direct access to the segments of interviews containing queried words or short phrases.
In this paper, we present our experiments with BERT models in the task of Large-scale Multi-label... more In this paper, we present our experiments with BERT models in the task of Large-scale Multi-label Text Classification (LMTC). In the LMTC task, each text document can have multiple class labels, while the total number of classes is in the order of thousands. We propose a pooling layer architecture on top of BERT models, which improves the quality of classification by using information from the standard [CLS] token in combination with pooled sequence output. We demonstrate the improvements on Wikipedia datasets in three different languages using public pre-trained BERT models.
2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018
This paper explores the possibility to use grapheme-based word and sub-word models in the task of... more This paper explores the possibility to use grapheme-based word and sub-word models in the task of spoken term detection (STD). The usage of grapheme models eliminates the need for expert-prepared pronunciation lexicons (which are often far from complete) and/or trainable grapheme-to-phoneme (G2P) algorithms that are frequently rather inaccurate, especially for rare words (words coming from a different language). Moreover, the G2P conversion of the search terms that need to be performed on-line can substantially increase the response time of the STD system. Our results show that using various grapheme-based models, we can achieve STD performance (measured in terms of ATWV) comparable with phoneme-based models but without the additional burden of G2P conversion.
In this paper we propose a pipeline for processing of scanned historical documents into the elect... more In this paper we propose a pipeline for processing of scanned historical documents into the electronic text form that could then be indexed and stored in a database. The nature of the documents presents a substantial challenge for standard automated techniques – not only there is a mix of typewritten and handwritten documents of varying quality but the scanned pages often contain multiple documents at once. Moreover, the language of the texts alternates mostly between Russian and Ukrainian but other languages also occur. The paper focuses mainly on segmentation, document type classification, and image preprocessing of the scanned documents; the output of those methods is then passed to the off-the-shelf OCR software and a baseline performance is evaluated on a simplified OCR task.
In this paper, we present our experiments with BERT (Bidirectional Encoder Representations from T... more In this paper, we present our experiments with BERT (Bidirectional Encoder Representations from Transformers) models in the task of sentiment analysis, which aims to predict the sentiment polarity for the given text. We trained an ensemble of BERT models from a large self-collected movie reviews dataset and distilled the knowledge into a single production model. Moreover, we proposed an improved BERT’s pooling layer architecture, which outperforms standard classification layer while enables per-token sentiment predictions. We demonstrate our improvements on a publicly available dataset with Czech movie reviews
This paper describes the development of a stateless spoken spoken language understanding (SLU) mo... more This paper describes the development of a stateless spoken spoken language understanding (SLU) module based on artificial neural networks that is able to deal with the uncertainty of the automatic speech recognition (ASR) output. The work builds upon the concept of weighted neurons introduced by the authors previously and presents a generalized weighting term for such a neuron. The effect of different forms and parameter estimation methods of the weighting term is experimentally evaluated on the multi-task training corpus, created by merging two different semantically annotated corpora. The robustness of the best performing weighting schemes is then demonstrated by experiments involving hybrid word-semantic (WSE) lattices and also limited data scenario.
The paper present the issues encountered in processing spontaneous Czech speech in the MALACH pro... more The paper present the issues encountered in processing spontaneous Czech speech in the MALACH project. Specific problems connected with a frequent occurrence of colloquial words in spontaneous Czech are analyzed; a partial solution is proposed and experimentally evaluated.
Abstract. The paper describes the system built by the team from the University of West Bohemia fo... more Abstract. The paper describes the system built by the team from the University of West Bohemia for participation in the CLEF 2006 CL-SR track. We have decided to concentrate only on the monolingual search-ing in the Czech test collection and investigate the eect of proper lan-guage processing on the retrieval performance. We have employed the Czech morphological analyser and tagger for that purposes. For the ac-tual search system, we have used the classical tf.idf approach with blind relevance feedback as implemented in the Lemur toolkit. The results in-dicate that a suitable linguistic preprocessing is indeed crucial for the Czech IR performance. 1
This paper concentrates on the design and evaluation of the method that would be able to automati... more This paper concentrates on the design and evaluation of the method that would be able to automatically correct the spelling of i/y in the Czech words at the output of the ASR decoder. After analysis of both the Czech grammar rules and the data, we have decided to deal only with the endings consisting of consonants b/f/l/m/p/s/v/z followed by i/y in both short and long forms. The correction is framed as the classification task where the word could belong to the “i” class, the “y” class or the “empty” class. Using the state-of-the-art Bidirectional Encoder Representations from Transformers (BERT) architecture, we were able to substantially improve the correctness of the i/y spelling both on the simulated and the real ASR output. Since the misspelling of i/y in the Czech texts is seen by the majority of native Czech speakers as a blatant error, the corrected output greatly improves the perceived quality of the ASR system.
PDTSC 1.0 is a multi-purpose corpus of spoken language. 768,888 tokens, 73,374 sentences and 7,32... more PDTSC 1.0 is a multi-purpose corpus of spoken language. 768,888 tokens, 73,374 sentences and 7,324 minutes of spontaneous dialog speech have been recorded, transcribed and edited in several interlinked layers: audio recordings, automatic and manual transcription and manually reconstructed text. PDTSC 1.0 is a delayed release of data annotated in 2012. It is an update of Prague Dependency Treebank of Spoken Language (PDTSL) 0.5 (published in 2009). In 2017, Prague Dependency Treebank of Spoken Czech (PDTSC) 2.0 was published as an update of PDTSC 1.0.
The package contains Czech recordings of the Visual History Archive which consists of the intervi... more The package contains Czech recordings of the Visual History Archive which consists of the interviews with the Holocaust survivors. The archive consists of audio recordings, four types of automatic transcripts, manual annotations of selected topics and interviews' metadata. The archive totally contains 353 recordings and 592 hours of interviews.
In this paper we present an online system for cross-lingual lexical (full-text) searching in the ... more In this paper we present an online system for cross-lingual lexical (full-text) searching in the large archive of the Holocaust testimonies. Video interviews recorded in two languages (English and Czech) were automatically transcribed and indexed in order to provide efficient access to the lexical content of the recordings. The engine takes advantage of the state-of-the-art speech recognition system and performs fast spoken term detection (STD), providing direct access to the segments of interviews containing queried words or short phrases.
In this paper, we present our experiments with BERT models in the task of Large-scale Multi-label... more In this paper, we present our experiments with BERT models in the task of Large-scale Multi-label Text Classification (LMTC). In the LMTC task, each text document can have multiple class labels, while the total number of classes is in the order of thousands. We propose a pooling layer architecture on top of BERT models, which improves the quality of classification by using information from the standard [CLS] token in combination with pooled sequence output. We demonstrate the improvements on Wikipedia datasets in three different languages using public pre-trained BERT models.
2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018
This paper explores the possibility to use grapheme-based word and sub-word models in the task of... more This paper explores the possibility to use grapheme-based word and sub-word models in the task of spoken term detection (STD). The usage of grapheme models eliminates the need for expert-prepared pronunciation lexicons (which are often far from complete) and/or trainable grapheme-to-phoneme (G2P) algorithms that are frequently rather inaccurate, especially for rare words (words coming from a different language). Moreover, the G2P conversion of the search terms that need to be performed on-line can substantially increase the response time of the STD system. Our results show that using various grapheme-based models, we can achieve STD performance (measured in terms of ATWV) comparable with phoneme-based models but without the additional burden of G2P conversion.
In this paper we propose a pipeline for processing of scanned historical documents into the elect... more In this paper we propose a pipeline for processing of scanned historical documents into the electronic text form that could then be indexed and stored in a database. The nature of the documents presents a substantial challenge for standard automated techniques – not only there is a mix of typewritten and handwritten documents of varying quality but the scanned pages often contain multiple documents at once. Moreover, the language of the texts alternates mostly between Russian and Ukrainian but other languages also occur. The paper focuses mainly on segmentation, document type classification, and image preprocessing of the scanned documents; the output of those methods is then passed to the off-the-shelf OCR software and a baseline performance is evaluated on a simplified OCR task.
Uploads
Papers by Pavel Ircing