Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Search on speech from spoken queries: the Multi-domain International ALBAYZIN 2018 Query-by-Example Spoken Term Detection Evaluation

Published: 01 December 2019 Publication History

Abstract

The huge amount of information stored in audio and video repositories makes search on speech (SoS) a priority area nowadays. Within SoS, Query-by-Example Spoken Term Detection (QbE STD) aims to retrieve data from a speech repository given a spoken query. Research on this area is continuously fostered with the organization of QbE STD evaluations. This paper presents a multi-domain internationally open evaluation for QbE STD in Spanish. The evaluation aims at retrieving the speech files that contain the queries, providing their start and end times, and a score that reflects the confidence given to the detection. Three different Spanish speech databases that encompass different domains have been employed in the evaluation: MAVIR database, which comprises a set of talks from workshops; RTVE database, which includes broadcast television (TV) shows; and COREMAH database, which contains 2-people spontaneous speech conversations about different topics. The evaluation has been designed carefully so that several analyses of the main results can be carried out. We present the evaluation itself, the three databases, the evaluation metrics, the systems submitted to the evaluation, the results, and the detailed post-evaluation analyses based on some query properties (within-vocabulary/out-of-vocabulary queries, single-word/multi-word queries, and native/foreign queries). Fusion results of the primary systems submitted to the evaluation are also presented. Three different teams took part in the evaluation, and ten different systems were submitted. The results suggest that the QbE STD task is still in progress, and the performance of these systems is highly sensitive to changes in the data domain. Nevertheless, QbE STD strategies are able to outperform text-based STD in unseen data domains.

References

[1]
K. Ng, V. W. Zue, Subword-based approaches for spoken document retrieval. Speech Commun.32(3), 157---186 (2000).
[2]
B. Chen, K. -Y. Chen, P. -N. Chen, Y. -W. Chen, Spoken document retrieval with unsupervised query modeling techniques. IEEE Trans. Audio Speech Lang. Process.20(9), 2602---2612 (2012).
[3]
T. -H. Lo, Y. -W. Chen, K. -Y. Chen, H. -M. Wang, B. Chen, in Proc. of ASRU. Neural relevance-aware query modeling for spoken document retrieval (IEEEUSA, 2017), pp. 466---473.
[4]
W. F. L. Heeren, F. M. G. de Jong, L. B. van der Werff, M. A. H. Huijbregts, R. J. F. Ordelman, in Proc. of LREC. Evaluation of spoken document retrieval for historic speech collections (ELRABelgium, 2008), pp. 2037---2041.
[5]
Y. -C. Pan, H. -Y. Lee, L. -S. Lee, Interactive spoken document retrieval with suggested key terms ranked by a Markov decision process. IEEE Trans. Audio Speech Lang. Process.20(2), 632---645 (2012).
[6]
Y. -W. Chen, K. -Y. Chen, H. -M. Wang, B. Chen, in Proc. of Interspeech. Exploring the use of significant words language modeling for spoken document retrieval (ISCAFrance, 2017), pp. 2889---2893.
[7]
P. Gao, J. Liang, P. Ding, B. Xu, in Proc. of ICASSP. A novel phone-state matrix based vocabulary-independent keyword spotting method for spontaneous speech (IEEEUSA, 2007), pp. 425---428.
[8]
B. Zhang, R. Schwartz, S. Tsakalidis, L. Nguyen, S. Matsoukas, in Proc. of Interspeech. White listing and score normalization for keyword spotting of noisy speech (ISCAFrance, 2012), pp. 1832---1835.
[9]
A. Mandal, J. van Hout, Y. -C. Tam, V. Mitra, Y. Lei, J. Zheng, D. Vergyri, L. Ferrer, M. Graciarena, A. Kathol, H. Franco, in Proc. of Interspeech. Strategies for high accuracy keyword detection in noisy channels (ISCAFrance, 2013), pp. 15---19.
[10]
T. Ng, R. Hsiao, L. Zhang, D. Karakos, S. H. Mallidi, M. Karafiat, K. Vesely, I. Szoke, B. Zhang, L. Nguyen, R. Schwartz, in Proc. of Interspeech. Progress in the BBN keyword search system for the DARPA RATS program (ISCAFrance, 2014), pp. 959---963.
[11]
V. Mitra, J. van Hout, H. Franco, D. Vergyri, Y. Lei, M. Graciarena, Y. -C. Tam, J. Zheng, in Proc. of ICASSP. Feature fusion for high-accuracy keyword spotting (IEEEUSA, 2014), pp. 7143---7147.
[12]
S. Panchapagesan, M. Sun, A. Khare, S. Matsoukas, A. Mandal, B. Hoffmeister, S. Vitaladevuni, in Proc. of Interspeech. Multi-task learning and weighted cross-entropy for DNN-based keyword spotting (ISCAFrance, 2016), pp. 760---764.
[13]
J. Mamou, B. Ramabhadran, O. Siohan, in Proc. of ACM SIGIR. Vocabulary independent spoken term detection (ACMUSA, 2007), pp. 615---622.
[14]
D. Schneider, T. Mertens, M. Larson, J. Kohler, in Proc. of Interspeech. Contextual verification for open vocabulary spoken term detection (ISCAFrance, 2010), pp. 697---700.
[15]
C. Parada, A. Sethy, M. Dredze, F. Jelinek, in Proc. of Interspeech. A spoken term detection framework for recovering out-of-vocabulary words using the web (ISCAFrance, 2010), pp. 1269---1272.
[16]
I. Szöke, M. Faps?o, L. Burget, J. C?ernock?, in Proc. of Speech Search Workshop at SIGIR. Hybrid word-subword decoding for spoken term detection (ACMUSA, 2008), pp. 42---48.
[17]
Y. Wang, F. Metze, in Proc. of Interspeech. An in-depth comparison of keyword specific thresholding and sum-to-one score normalization (ISCAFrance, 2014), pp. 2474---2478.
[18]
L. Mangu, G. Saon, M. Picheny, B. Kingsbury, in Proc. of ICASSP. Order-free spoken term detection (IEEEUSA, 2015), pp. 5331---5335.
[19]
A. Buzo, H. Cucu, C. Burileanu, in Proc. of MediaEval. SpeeD@MediaEval 2014: Spoken term detection with robust multilingual phone recognition (CEURGermany, 2014), pp. 721---722.
[20]
R. Konno, K. Ouchi, M. Obara, Y. Shimizu, T. Chiba, T. Hirota, Y. Itoh, in Proc. of NTCIR-12. An STD system using multiple STD results and multiple rescoring method for NTCIR-12 SpokenQuery &Doc task (Japan Society for Promotion of ScienceJapan, 2016), pp. 200---204.
[21]
R. Jarina, M. Kuba, R. Gubka, M. Chmulik, M. Paralic, in Proc. of MediaEval. UNIZA system for the spoken web search task at MediaEval 2013 (CEURGermany, 2013), pp. 791---792.
[22]
X. Anguera, M. Ferrarons, in Proc. of ICME. Memory efficient subsequence DTW for query-by-example spoken term detection (IEEEUSA, 2013), pp. 1---6.
[23]
H. Lin, A. Stupakov, J. Bilmes, in Proc. of Interspeech. Spoken keyword spotting via multi-lattice alignment (ISCAFrance, 2008), pp. 2191---2194.
[24]
C. Chan, L. Lee, in Proc. of Interspeech. Unsupervised spoken-term detection with spoken queries using segment-based dynamic time warping (ISCAFrance, 2010), pp. 693---696.
[25]
S. Settle, K. Levin, H. Kamper, K. Livescu, in Proc. of Interspeech. Query-by-example search with discriminative neural acoustic word embeddings (ISCAFrance, 2017), pp. 2874---2878.
[26]
R. Shankar, C. M. Vikram, S. R. M. Prasanna, in Proc. of Interspeech. Spoken keyword detection using joint DTW-CNN (ISCAFrance, 2018), pp. 117---121.
[27]
A. Ali, M. A. Clements, in Proc. of MediaEval. Spoken web search using and ergodic hidden Markov model of speech (CEURGermany, 2013), pp. 861---862.
[28]
A. Caranica, A. Buzo, H. Cucu, C. Burileanu, in Proc. of MediaEval. SpeeD@MediaEval 2015: Multilingual phone recognition approach to Query By Example STD (CEURGermany, 2015), pp. 781---783.
[29]
S. Kesiraju, G. Mantena, K. Prahallad, in Proc. of MediaEval. IIIT-H system for MediaEval 2014 QUESST (CEURGermany, 2014), pp. 761---762.
[30]
M. Ma, A. Rosenberg, in Proc. of MediaEval. CUNY systems for the Query-by-Example search on speech task at MediaEval 2015 (CEURGermany, 2015), pp. 831---833.
[31]
J. Takahashi, T. Hashimoto, R. Konno, S. Sugawara, K. Ouchi, S. Oshima, T. Akyu, Y. Itoh, in Proc. of NTCIR-11. An IWAPU STD system for OOV query terms and spoken queries (Japan Society for Promotion of ScienceJapan, 2014), pp. 384---389.
[32]
M. Makino, A. Kai, in Proc. of NTCIR-11. Combining subword and state-level dissimilarity measures for improved spoken term detection in NTCIR-11 SpokenQuery &Doc task (Japan Society for Promotion of ScienceJapan, 2014), pp. 413---418.
[33]
N. Sakamoto, K. Yamamoto, S. Nakagawa, in Proc. of ASRU. Combination of syllable based N-gram search and word search for spoken term detection through spoken queries and IV/OOV classification (IEEEUSA, 2015), pp. 200---206.
[34]
J. Hou, V. T. Pham, C. -C. Leung, L. Wang, H. Xu, H. Lv, L. Xie, Z. Fu, C. Ni, X. Xiao, H. Chen, S. Zhang, S. Sun, Y. Yuan, P. Li, T. L. Nwe, S. Sivadas, B. Ma, E. S. Chng, H. Li, in Proc. of MediaEval. The NNI Query-by-Example system for MediaEval 2015 (IEEEUSA, 2015), pp. 141---143.
[35]
J. Vavrek, P. Viszlay, M. Lojka, M. Pleva, J. Juhar, M. Rusko, in Proc. of MediaEval. TUKE at MediaEval 2015 QUESST (CEURGermany, 2015), pp. 451---453.
[36]
H. Wang, T. Lee, C. -C. Leung, B. Ma, H. Li, Acoustic segment modeling with spectral clustering methods. IEEE/ACM Trans. Audio Speech Lang. Process.23(2), 264---277 (2015).
[37]
C. -T. Chung, L. -S. Lee, Unsupervised discovery of structured acoustic tokens with applications to spoken term detection. IEEE/ACM Trans. Audio Speech Lang. Process.26(2), 394---405 (2018).
[38]
C. -T. Chung, C. -Y. Tsai, C. -H. Liu, L. -S. Lee, Unsupervised iterative deep learning of speech features and acoustic tokens with applications to spoken term detection. IEEE/ACM Trans. Audio Speech Lang. Process.25(10), 1914---1928 (2017).
[39]
P. Lopez-Otero, J. Parapar, A. Barreiro, Efficient query-by-example spoken document retrieval combining phone multigram representation and dynamic time warping. Inf. Process. Manag.56(1), 43---60 (2019).
[40]
G. Mantena, S. Achanta, K. Prahallad, Query-by-example spoken term detection using frequency domain linear prediction and non-segmental dynamic time warping. IEEE/ACM Trans. Audio Speech Lang. Process.22(5), 946---955 (2014).
[41]
H. Tulsiani, P. Rao, in Proc. of MediaEval. The IIT-B Query-by-Example system for MediaEval 2015 (CEURGermany, 2015), pp. 341---343.
[42]
M. Bouallegue, G. Senay, M. Morchid, D. Matrouf, G. Linares, R. Dufour, in Proc. of MediaEval. LIA@MediaEval 2013 spoken web search task: an I-vector based approach (CEURGermany, 2013), pp. 771---772.
[43]
L. J. Rodriguez-Fuentes, A. Varona, M. Penagarikano, G. Bordel, M. Diez, in Proc. of MediaEval. GTTS systems for the SWS task at MediaEval 2013 (CEURGermany, 2013), pp. 831---832.
[44]
H. Wang, T. Lee, C. -C. Leung, B. Ma, H. Li, in Proc. of ICASSP. Using parallel tokenizers with DTW matrix combination for low-resource spoken term detection (IEEEUSA, 2013), pp. 8545---8549.
[45]
H. Wang, T. Lee, in Proc. of MediaEval. The CUHK spoken web search system for MediaEval 2013 (CEURGermany, 2013), pp. 681---682.
[46]
J. Proenca, A. Veiga, F. Perdigão, in Proc. of MediaEval. The SPL-IT query by example search on speech system for MediaEval 2014 (CEURGermany, 2014), pp. 741---742.
[47]
J. Proenca, A. Veiga, F. Perdigao, in Proc. of EUSIPCO. Query by example search with segmented dynamic time warping for non-exact spoken queries (SpringerGermany, 2015), pp. 1691---1695.
[48]
J. Proenca, L. Castela, F. Perdigao, in Proc. of MediaEval. The SPL-IT-UC Query by Example search on speech system for MediaEval 2015 (CEURGermany, 2015), pp. 471---473.
[49]
J. Proenca, F. Perdigao, in Proc. of Interspeech. Segmented dynamic time warping for spoken Query-by-Example search (ISCAFrance, 2016), pp. 750---754.
[50]
P. Lopez-Otero, L. Docio-Fernandez, C. Garcia-Mateo, in Proc. of MediaEval. GTM-UVigo systems for the Query-by-Example search on speech task at MediaEval 2015 (CEURGermany, 2015), pp. 521---523.
[51]
P. Lopez-Otero, L. Docio-Fernandez, C. Garcia-Mateo, in Proc. of ASRU. Phonetic unit selection for cross-lingual Query-by-Example spoken term detection (IEEEUSA, 2015), pp. 223---229.
[52]
A. Saxena, B. Yegnanarayana, in Proc. of Interspeech. Distinctive feature based representation of speech for Query-by-Example spoken term detection (ISCAFrance, 2015), pp. 3680---3684.
[53]
P. Lopez-Otero, L. Docio-Fernandez, C. Garcia-Mateo, in Proc. of Interspeech. Compensating gender variability in query-by-example search on speech using voice conversion (ISCAFrance, 2017), pp. 2909---2913.
[54]
A. Asaei, D. Ram, H. Bourlard, in Proc. of Interspeech. Phonological posterior hashing for query by example spoken term detection (ISCAFrance, 2018), pp. 2067---2071.
[55]
M. Skacel, I. Szöke, in Proc. of MediaEval. BUT QUESST 2015 system description (CEURGermany, 2015), pp. 721---723.
[56]
H. Chen, C. -C. Leung, L. Xie, B. Ma, H. Li, in Proc. of Interspeech. Unsupervised bottleneck features for low-resource Query-by-Example spoken term detection (ISCAFrance, 2016), pp. 923---927.
[57]
Y. Yuan, C. -C. Leung, L. Xie, H. Chen, B. Ma, H. Li, in Proc. of ICASSP. Pairwise learning using multi-lingual bottleneck features for low-resource Query-by-Example spoken term detection (IEEEUSA, 2017), pp. 5645---5649.
[58]
J. van Hout, V. Mitra, H. Franco, C. Bartels, D. Vergyri, in Proc. of ASRU. Tackling unseen acoustic conditions in query-by-example search using time and frequency convolution for multilingual deep bottleneck features (IEEEUSA, 2017), pp. 48---54.
[59]
E. Yilmaz, J. van Hout, H. Franco, in Proc. of ASRU. Noise-robust exemplar matching for rescoring query-by-example search (IEEEUSA, 2017), pp. 1---7.
[60]
Y. Yuan, C. -C. Leung, L. Xie, H. Chen, B. Ma, H. Li, in Proc. of ICASSP. Pairwise learning using multi-lingual bottleneck features for low-resource query-by-example spoken term detection (IEEEUSA, 2017), pp. 5645---5649.
[61]
A. H. H. N. Torbati, J. Picone, in Proc. of Interspeech. A nonparametric bayesian approach for spoken term detection by example query (ISCAFrance, 2016), pp. 928---932.
[62]
A. Popli, A. Kumar, in Proc. of MMSP. Query-by-example spoken term detection using low dimensional posteriorgrams motivated by articulatory classes (IEEEUSA, 2015), pp. 1---6.
[63]
P. Yang, C. -C. Leung, L. Xie, B. Ma, H. Li, in Proc. of Interspeech. Intrinsic spectral analysis based on temporal context features for query-by-example spoken term detection (ISCAFrance, 2014), pp. 1722---1726.
[64]
B. George, A. Saxena, G. Mantena, K. Prahallad, B. Yegnanarayana, in Proc. of Interspeech. Unsupervised query-by-example spoken term detection using bag of acoustic words and non-segmental dynamic time warping (ISCAFrance, 2014), pp. 1742---1746.
[65]
D. Ram, A. Asaei, H. Bourlard, Sparse subspace modeling for query by example spoken term detection. IEEE/ACM Trans. Audio Speech Lang. Process.26(6), 1126---1139 (2018).
[66]
P. Lopez-Otero, L. Docio-Fernandez, C. Garcia-Mateo, Finding relevant features for zero-resource query-by-example search on speech. Speech Commun.84:, 24---35 (2016).
[67]
T. J. Hazen, W. Shen, C. M. White, in Proc. of ASRU. Query-by-example spoken term detection using phonetic posteriorgram templates (IEEEUSA, 2009), pp. 421---426.
[68]
A. Abad, R. F. Astudillo, I. Trancoso, in Proc. of MediaEval. The L2F spoken web search system for MediaEval 2013 (CEURGermany, 2013), pp. 851---852.
[69]
I. Szöke, M. Skácel, L. Burget, in Proc. of MediaEval. BUT QUESST 2014 system description (CEURGermany, 2014), pp. 621---622.
[70]
I. Szöke, L. Burget, F. Grézl, J. H. Černock?, L. Ondel, in Proc. of ICASSP. Calibration and fusion of query-by-example systems - BUT SWS 2013 (IEEEUSA, 2014), pp. 7849---7853.
[71]
A. Abad, L. J. Rodríguez-Fuentes, M. Penagarikano, A. Varona, G. Bordel, in Proc. of Interspeech. On the calibration and fusion of heterogeneous spoken term detection systems (ISCAFrance, 2013), pp. 20---24.
[72]
P. Yang, H. Xu, X. Xiao, L. Xie, C. -C. Leung, H. Chen, J. Yu, H. Lv, L. Wang, S. J. Leow, B. Ma, E. S. Chng, H. Li, in Proc. of MediaEval. The NNI query-by-example system for MediaEval 2014 (CEURGermany, 2014), pp. 691---692.
[73]
C. -C. Leung, L. Wang, H. Xu, J. Hou, V. T. Pham, H. Lv, L. Xie, X. Xiao, C. Ni, B. Ma, E. S. Chng, H. Li, in Proc. of Interspeech. Toward high-performance language-independent Query-by-Example spoken term detection for MediaEval 2015: Post-Evaluation analysis (ISCAFrance, 2016), pp. 3703---3707.
[74]
H. Xu, J. Hou, X. Xiao, V. T. Pham, C. -C. Leung, L. Wang, V. H. Do, H. Lv, L. Xie, B. Ma, E. S. Chng, H. Li, in Proc. of ICASSP. Approximate search of audio queries by using DTW with phone time boundary and data augmentation (IEEEUSA, 2016), pp. 6030---6034.
[75]
S. Oishi, T. Matsuba, M. Makino, A. Kai, in Proc. of NTCIR-12. Combining state-level and DNN-based acoustic matches for efficient spoken term detection in NTCIR-12 SpokenQuery &Doc-2 task (Japan Society for Promotion of ScienceJapan, 2016), pp. 205---210.
[76]
S. Oishi, T. Matsuba, M. Makino, A. Kai, in Proc. of Interspeech. Combining state-level spotting and posterior-based acoustic match for improved query-by-example spoken term detection (ISCAFrance, 2016), pp. 740---744.
[77]
M. Obara, K. Kojima, K. Tanaka, S. -W. Lee, Y. Itoh, in Proc. of Interspeech. Rescoring by combination of posteriorgram score and subword-matching score for use in Query-by-Example (ISCAFrance, 2016), pp. 1918---1922.
[78]
B. Taras, C. Nadeu, Audio segmentation of broadcast news in the Albayzin-2010 evaluation: overview, results, and discussion. EURASIP J. Audio Speech Music. Process.2011(1), 1---10 (2011).
[79]
M. Zelenák, H. Schulz, J. Hernando, Speaker diarization of broadcast news in Albayzin 2010 evaluation campaign. EURASIP J. Audio Speech Music. Process.2012(19), 1---9 (2012).
[80]
L. J. Rodríguez-Fuentes, M. Penagarikano, A. Varona, M. Díez, G. Bordel, in Proc. of Interspeech. The Albayzin 2010 Language Recognition Evaluation (ISCAFrance, 2011), pp. 1529---1532.
[81]
J. Tejedor, D. T. Toledano, P. Lopez-Otero, L. Docio-Fernandez, C. Garcia-Mateo, A. Cardenal, J. D. Echeverry-Correa, A. Coucheiro-Limeres, J. Olcoz, A. Miguel, Spoken term detection ALBAYZIN 2014 evaluation: overview, systems, results, and discussion. EURASIP J. Audio Speech Music. Process.2015(21), 1---27 (2015).
[82]
J. Tejedor, D. T. Toledano, X. Anguera, A. Varona, L. F. Hurtado, A. Miguel, J. Colás, Query-by-example spoken term detection ALBAYZIN 2012 evaluation: overview, systems, results, and discussion. EURASIP J. Audio Speech Music. Process.2013(23), 1---17 (2013).
[83]
J. Tejedor, D. T. Toledano, P. Lopez-Otero, L. Docio-Fernandez, C. Garcia-Mateo, Comparison of ALBAYZIN query-by-example spoken term detection 2012 and 2014 evaluations. EURASIP J. Audio Speech Music. Process.2016(1), 1---19 (2016).
[84]
D. Castán, D. Tavarez, P. Lopez-Otero, J. Franco-Pedroso, H. Delgado, E. Navas, L. Docio-Fernández, D. Ramos, J. Serrano, A. Ortega, E. Lleida, Albayzín-2014 evaluation: audio segmentation and classification in broadcast news domains. EURASIP J. Audio Speech Music. Process.2015(33), 1---9 (2015).
[85]
F. Méndez, L. Docío, M. Arza, F. Campillo, in Proc. of FALA. The Albayzin 2010 text-to-speech evaluation (ISCAFrance, 2010), pp. 317---340.
[86]
J. Tejedor, D. T. Toledano, P. Lopez-Otero, L. Docio-Fernandez, L. Serrano, I. Hernaez, A. Coucheiro-Limeres, J. Ferreiros, J. Olcoz, J. Llombart, Albayzin 2016 spoken term detection evaluation: an international open competitive evaluation in spanish. EURASIP J. Audio Speech Music Process.2017(22), 1---23 (2017).
[87]
J. Tejedor, D. T. Toledano, P. Lopez-Otero, L. Docio-Fernandez, J. Proença, F. Perdigão, F. García-Granada, E. Sanchis, A. Pompili, A. Abad, Albayzin query-by-example spoken term detection 2016 evaluation. EURASIP J. Audio Speech Music Process.2018(2), 1---25 (2018).
[88]
J. Billa, K. W. Ma, J. W. McDonough, Zavaliagkos, D. R. Miller, K. N. Ross, A. El-Jaroudi, in Proc. of Eurospeech. Multilingual speech recognition: the 1996 Byblos callhome system (ISCAFrance, 1997), pp. 363---366.
[89]
H. Cuayahuitl, B. Serridge, in Proc. of MICAI. Out-of-vocabulary word modeling and rejection for spanish keyword spotting systems (SpringerGermany, 2002), pp. 156---165.
[90]
M. Killer, S. Stuker, T. Schultz, in Proc. of Eurospeech. Grapheme based speech recognition (ISCAFrance, 2003), pp. 3141---3144.
[91]
J. Tejedor, Contributions to keyword spotting and spoken term detection for information retrieval in audio mining. PhD thesis (Universidad Autónoma de Madrid, Madrid, 2009).
[92]
L. Burget, P. Schwarz, M. Agarwal, P. Akyazi, K. Feng, A. Ghoshal, O. Glembek, N. Goel, M. Karafiat, D. Povey, A. Rastrow, R. C. Rose, S. Thomas, in Proc. of ICASSP. Multilingual acoustic modeling for speech recognition based on subspace gaussian mixture models (IEEEUSA, 2010), pp. 4334---4337.
[93]
J. Tejedor, D. T. Toledano, D. Wang, S. King, J. Colás, Feature analysis for discriminative confidence estimation in spoken term detection. Comput. Speech Lang.28(5), 1083---1114 (2014).
[94]
J. Li, X. Wang, B. Xu, in Proc. of Interspeech. An empirical study of multilingual and low-resource spoken term detection using deep neural networks (ISCAFrance, 2014), pp. 1747---1751.
[95]
M. Hazewinkel, Student test (Kluwer Academic, Denmark, 1994).
[96]
NIST, The spoken term detection (STD) 2006 Evaluation Plan. https://catalog.ldc.upenn.edu/docs/LDC2011S02/std06-evalplan-v10.pdf. Accessed Apr 2019.
[97]
J. G. Fiscus, J. Ajot, J. S. Garofolo, G. Doddingtion, in Proc. of SSCS. Results of the 2006 spoken term detection evaluation (ACMUSA, 2007), pp. 45---50.
[98]
A. Martin, G. Doddingtion, T. Kamm, M. Ordowski, M. Przybocki, in Proc. of Eurospeech. The DET curve in assessment of detection task performance (ISCAFrance, 1997), pp. 1895---1898.
[99]
NIST, Evaluation Toolkit (STDEval) Software. https://www.nist.gov/itl/iad/mig/tools. Accessed Apr 2019.
[100]
I. T. Union, ITU-T Recommendation P.563: Single-ended method for objective speech quality assessment in narrow-band telephony applications. http://www.itu.int/rec/T-REC-P.563/en. Accessed Apr 2019.
[101]
E. Lleida, A. Ortega, A. Miguel, V. Bazán, C. Pérez, M. Zotano, A. de Prada, RTVE2018 database description. Vivolab and Corporación Radiotelevisión Española, Zaragoza. http://catedrartve.unizar.es/reto2018/RTVE2018DB.pdf. Accessed Apr 2019.
[102]
M. V. Matos, Diseño y compilación de un corpus multimodal de análisis pragmático para la aplicación a la enseñanza del español. PhD thesis Universidad Autónoma de Madrid, Madrid, (2017).
[103]
N. Rajput, F. Metze, in Proc. of MediaEval. Spoken web search (CEURGermany, 2011), pp. 1---2.
[104]
F. Metze, E. Barnard, M. Davel, C. van Heerden, X. Anguera, G. Gravier, N. Rajput, in Proc. of MediaEval. The spoken web search task (CEURGermany, 2012), pp. 41---42.
[105]
X. Anguera, F. Metze, A. Buzo, I. Szöke, L. J. Rodriguez-Fuentes, in Proc. of MediaEval. The spoken web search task (CEURGermany, 2013), pp. 921---922.
[106]
X. Anguera, L. J. Rodriguez-Fuentes, I. Szöke, A. Buzo, F. Metze, in Proc. of MediaEval. Query by Example Search on Speech at MediaEval 2014 (CEURGermany, 2014), pp. 351---352.
[107]
I. Szöke, L. J. Rodriguez-Fuentes, A. Buzo, X. Anguera, F. Metze, J. Proenca, M. Lojka, X. Xiong, in Proc. of MediaEval. Query by Example Search on Speech at MediaEval 2015 (CEURGermany, 2015), pp. 81---82.
[108]
T. Akiba, H. Nishizaki, H. Nanjo, G. J. F. Jones, in Proc. of NTCIR-11. Overview of the NTCIR-11 spokenquery &doc task (Japan Society for Promotion of ScienceJapan, 2014), pp. 1---15.
[109]
T. Akiba, H. Nishizaki, H. Nanjo, G. J. F. Jones, in Proc. of NTCIR-12. Overview of the NTCIR-12 spokenquery &doc-2 (Japan Society for Promotion of ScienceJapan, 2016), pp. 1---13.
[110]
P. Schwarz, Phoneme recognition based on long temporal context. PhD thesis (FIT, BUT, Brno, Czech Republic, 2008).
[111]
A. Varona, M. Penagarikano, L. J. Rodríguez-Fuentes, G. Bordel, in Proc. of Interspeech. On the use of lattices of time-synchronous cross-decoder phone co-occurrences in a SVM-phonotactic language recognition system (ISCAFrance, 2011), pp. 2901---2904.
[112]
F. Eyben, M. Wollmer, B. Schuller, in Proc. of ACM Multimedia (MM). OpenSMILE - the Munich versatile and fast open-source audio feature extractor (ACMUSA, 2010), pp. 1459---1462.
[113]
Y. Zhang, J. R. Glass, in Proc. of ASRU. Unsupervised spoken keyword spotting via segmental DTW on gaussian posteriorgrams (IEEEUSA, 2009), pp. 398---403.
[114]
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, in Proc. of ASRU. The KALDI speech recognition toolkit (IEEEUSA, 2011).
[115]
M. Muller, Information retrieval for music and motion (Springer, New York, 2007).
[116]
I. Szöke, M. Skacel, L. Burget, in Proc. of MediaEval. BUT QUESST 2014 system description (CEURGermany, 2014), pp. 621---622.
[117]
J. Ponte, W. Croft, in Proc. of ACM SIGIR. A language modeling approach to information retrieval, (1998), pp. 275---281.
[118]
J. Parapar, A. Freire, A. Barreiro, in Proc. of ECIR. Revisiting n-gram based models for retrieval in degraded large collections, (2009), pp. 680---684.
[119]
E. Rodríguez-Banga, C. Garcia-Mateo, F. Méndez-Pazó, M. González-González, C. Magariños, in Proc. of Iberspeech. Cotovía: an open source TTS for Galician and Spanish, (2012), pp. 308---315.
[120]
C. Manning, P. Raghavan, H. Schutze, Introduction to information retrieval (Cambridge University Press, Cambridge, 2008).
[121]
A. Abad, L. J. Rodríguez-Fuentes, M. Peñagarikano, A. Varona, G. Bordel, in Proc. of Interspeech. On the calibration and fusion of heterogeneous spoken term detection systems, (2013), pp. 20---24.
[122]
N. Brummer, D. van Leeuwen, in Proc. of IEEE Odyssey 2006: The Speaker and Language Recognition Workshop. On calibration of language recognition scores (IEEEUSA, 2006), pp. 1---8.
[123]
N. Brummer, E. de Villiers, The BOSARIS Toolkit user guide: theory, algorithms and code for binary classifier score processing (Agnitio Labs. https://sites.google.com/site/nikobrummer. Accessed Apr 2019.
[124]
J. Wiseman, Python interface to the WebRTC (https://webrtc.org/) voice activity detector (VAD). https://github.com/wiseman/py-webrtcvad. Accessed Apr 2019.
[125]
A. Silnova, P. Matejka, O. Glembek, O. Plchot, O. Novotny, F. Grezl, P. Schwarz, L. Burget, J. H. Cernocky, in Proc. of Odyssey. BUT/Phonexia bottleneck feature ExtractorIEEEUSA, 2018), pp. 283---287.
[126]
C. Cieri, D. Miller, K. Walker, in Proc. of LREC. The Fisher Corpus: a resource for the next generations of speech-to-text (ELRABelgium, 2004), pp. 69---71.
[127]
Intelligence Advanced Research Projects Activity (IARPA), Babel Program (Intelligence Advanced Research Projects Activity (IARPA). https://www.iarpa.gov/index.php/research-programs/babel. Accessed Apr 2019.
[128]
L. J. Rodriguez-Fuentes, A. Varona, M. Penagarikano, G. Bordel, M. Diez, in Proc. of ICASSP. High-performance query-by-example spoken term detection on the SWS 2013 evaluationIEEEUSA, 2014), pp. 7819---7823.
[129]
A. Abad, L. J. Rodriguez-Fuentes, M. Penagarikano, A. Varona, M. Diez, G. Bordel, in Proc. of Interspeech. On the calibration and fusion of heterogeneous spoken term detection systems (ISCAFrance, 2013), pp. 20---24.
[130]
C. Garcia-Mateo, J. Dieguez-Tirado, L. Docio-Fernandez, A. Cardenal-Lopez, in Proc. of LREC. Transcrigal: a bilingual system for automatic indexing of broadcast news (ELRABelgium, 2004), pp. 2061---2064.
[131]
A. Moreno, L. Campillos, in Proc. of Iberspeech. MAVIR: a corpus of spontaneous formal speech in spanish and english (ISCAFrance, 2004), pp. 224---230.
[132]
A. Stolcke, in Proc. of Interspeech. SRILM - an extensible language modeling toolkit (ISCAFrance, 2002), pp. 901---904.
[133]
G. Chen, S. Khudanpur, D. Povey, J. Trmal, D. Yarowsky, O. Yilmaz, in Proc. of ICASSP. Quantifying the value of pronunciation lexicons for keyword search in low resource languages (IEEEUSA, 2013), pp. 8560---8564.
[134]
V. T. Pham, N. F. Chen, S. Sivadas, H. Xu, I. -F. Chen, C. Ni, E. S. Chng, H. Li, in Proc. of SLT. System and keyword dependent fusion for spoken term detection (IEEEUSA, 2014), pp. 430---435.
[135]
D. Can, M. Saraclar, Lattice indexing for spoken term detection. IEEE Trans. Audio Speech Lang. Process.19(8), 2338---2347 (2011).
[136]
D. R. H. Miller, M. Kleber, C. -L. Kao, O. Kimball, T. Colthurst, S. A. Lowe, R. M. Schwartz, H. Gish, in Proc. of Interspeech. Rapid and accurate spoken term detection (ISCAFrance, 2007), pp. 314---317.
[137]
G. Chen, O. Yilmaz, J. Trmal, D. Povey, S. Khudanpur, in Proc. of ASRU. Using proxies for OOV keywords in the keyword search task (IEEEUSA, 2013), pp. 416---421.

Cited By

View all
  • (2022)An automatic method for constructing machining process knowledge base from knowledge graphRobotics and Computer-Integrated Manufacturing10.1016/j.rcim.2021.10222273:COnline publication date: 3-Jan-2022

Index Terms

  1. Search on speech from spoken queries: the Multi-domain International ALBAYZIN 2018 Query-by-Example Spoken Term Detection Evaluation
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image EURASIP Journal on Audio, Speech, and Music Processing
        EURASIP Journal on Audio, Speech, and Music Processing  Volume 2019, Issue 1
        Dec 2019
        399 pages
        ISSN:1687-4714
        EISSN:1687-4722
        Issue’s Table of Contents

        Publisher

        Hindawi Limited

        London, United Kingdom

        Publication History

        Published: 01 December 2019

        Author Tags

        1. International evaluation
        2. Query-by-Example Spoken Term Detection
        3. Search on speech
        4. Spanish language

        Qualifiers

        • Article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 14 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2022)An automatic method for constructing machining process knowledge base from knowledge graphRobotics and Computer-Integrated Manufacturing10.1016/j.rcim.2021.10222273:COnline publication date: 3-Jan-2022

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media