Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Comparison of ALBAYZIN query-by-example spoken term detection 2012 and 2014 evaluations

Published: 01 December 2016 Publication History

Abstract

Query-by-example spoken term detection (QbE STD) aims at retrieving data from a speech repository given an acoustic query containing the term of interest as input. Nowadays, it is receiving much interest due to the large volume of multimedia information. This paper presents the systems submitted to the ALBAYZIN QbE STD 2014 evaluation held as a part of the ALBAYZIN 2014 Evaluation campaign within the context of the IberSPEECH 2014 conference. This is the second QbE STD evaluation in Spanish, which allows us to evaluate the progress in this technology for this language. The evaluation consists in retrieving the speech files that contain the input queries, indicating the start and end times where the input queries were found, along with a score value that reflects the confidence given to the detection of the query. Evaluation is conducted on a Spanish spontaneous speech database containing a set of talks from workshops, which amount to about 7 h of speech. We present the database, the evaluation metric, the systems submitted to the evaluation, the results, and compare this second evaluation with the first ALBAYZIN QbE STD evaluation held in 2012. Four different research groups took part in the evaluations held in 2012 and 2014. In 2014, new multi-word and foreign queries were added to the single-word and in-language queries used in 2012. Systems submitted to the second evaluation are hybrid systems which integrate letter transcription- and template matching-based systems. Despite the significant improvement obtained by the systems submitted to this second evaluation compared to those of the first evaluation, results still show the difficulty of this task and indicate that there is still room for improvement.

References

[1]
T Zhang, C-CJ Kuo, in Hierarchical classification of audio data for archiving and retrieving. Proc. of ICASSP (IEEEWashington DC, USA, 1999), pp. 3001---3004.
[2]
M Helén, T Virtanen, in Query by example of audio signals using Euclidean distance between Gaussian Mixture Models. Proc. of ICASSP (IEEEWashington DC, USA, 2007), pp. 225---228.
[3]
M Helén, T Virtanen, Audio query by example using similarity measures between probability density functions of features. EURASIP, Journal on Audio, Speech, and Music Processing. 2010:, 2---1212 (2010).
[4]
G Tzanetakis, A Ermolinskyi, P Cook, in Pitch histograms in audio and symbolic music information retrieval. Proc. of ISMIR (ISMIRParis, France, 2002), pp. 31---38.
[5]
W-H Tsai, H-M Wang, in A query-by-example framework to retrieve music documents by singer. Proc. of ICME (IEEEWashington DC, USA, 2004), pp. 1863---1866.
[6]
TK Chia, KC Sim, H Li, HT Ng, in A lattice-based approach to query-by-example spoken document retrieval. Proc. of ACM SIGIR (ACMNew York, USA, 2008), pp. 363---370.
[7]
A Muscariello, G Gravier, F Bimbot, in Zero-resource audio-only spoken term detection based on a combination of template matching techniques. Proc. of Interspeech (ISCABaixas, France, 2011), pp. 921---924.
[8]
J Tejedor, M Fapšo, I Szöke, Černocky, H¿, F Grézl, Comparison of methods for language-dependent and language-independent query-by-example spoken term detection. ACM Trans. Inf. Syst. 30(3), 18---11834 (2012).
[9]
G Mantena, X Anguera, in Speed improvements to information retrieval-based dynamic time warping using hierarchical k-means clustering. Proc. of ICASSP (IEEEWashington DC, USA, 2013), pp. 8515---8519.
[10]
G Mantena, S Achanta, K Prahallad, Query-by-example spoken term detection using frequency domain linear prediction and non-segmental dynamic time warping. IEEE/ACM Trans. Audio Speech Lang. Process. 22(5), 946---955 (2014).
[11]
J Vavrek, M Pleva, M Lojka, P Viszlay, Kiktova, E¿, D Hládek, J Juhár, in TUKE at MediaEval 2013 spoken web search task. Proc. of MediaEval (CEURAachen, Germany, 2013), pp. 73---1732.
[12]
R Jarina, M Kuba, R Gubka, M Chmulik, M Paralic, in UNIZA system for the spoken web search task at MediaEval 2013. Proc. of MediaEval (CEURAachen, Germany, 2013), pp. 79---1792.
[13]
A Ali, MA Clements, in Spoken web search using and ergodic hidden Markov model of speech. Proc. of MediaEval (CEURAachen, Germany, 2013), pp. 86---1862.
[14]
A Buzo, H Cucu, C Burileanu, in SpeeD@MediaEval 2014: Spoken term detection with robust multilingual phone recognition. Proc. of MediaEval (CEURAachen, Germany, 2014), pp. 72---1722.
[15]
S Kesiraju, G Mantena, K Prahallad, in IIIT-H system for MediaEval 2014 QUESST. Proc. of MediaEval (CEURAachen, Germany, 2014). pp. 76---1762.
[16]
J Takahashi, T Hashimoto, R Konno, S Sugawara, K Ouchi, S Oshima, T Akyu, Y Itoh, in An IWAPU STD system for OOV query terms and spoken queries. Proc. of NTCIR-11 (National Institute of InformaticsTokyo, Japan, 2014), pp. 384---389.
[17]
M Makino, A Kai, in Combining subword and state-level dissimilarity measures for improved spoken term detection in NTCIR-11 SpokenQuery &Doc task. Proc. of NTCIR-11 (National Institute of InformaticsTokyo, Japan, 2014), pp. 413---418.
[18]
M Gubian, L Boves, M Versteegh, in Calibration of distance measures for unsupervised query-by-example. Proc. of Interspeech (ISCABaixas, France, 2013), pp. 2639---2643.
[19]
X Anguera, M Ferrarons, in Memory efficient subsequence DTW for query-by-example spoken term detection. Proc. of ICME (IEEEWashington DC, USA, 2013).
[20]
H Wang, T Lee, in The CUHK spoken web search system for MediaEval 2013. Proc. of MediaEval (CEURAachen, Germany, 2013), pp. 68---1682.
[21]
M Bouallegue, G Senay, M Morchid, D Matrouf, G Linares, R Dufour, in LIA@MediaEval 2013 spoken web search task: An I-Vector based approach. Proc. of MediaEval (CEURAachen, Germany, 2013), pp. 77---1772.
[22]
LJ Rodriguez-Fuentes, A Varona, M Penagarikano, G Bordel, M Diez, in GTTS systems for the SWS task at MediaEval 2013. Proc. of MediaEval (CEURAachen, Germany, 2013), pp. 83---1832.
[23]
H Wang, T Lee, C-C Leung, B Ma, H Li, in Using parallel tokenizers with DTW matrix combination for low-resource spoken term detection. Proc. of ICASSP (IEEEWashington DC, USA, 2013), pp. 8545---8549.
[24]
H Wang, T Lee, in CUHK system for QUESST task of MediaEval 2014. Proc. of MediaEval (CEURAachen, Germany, 2014), pp. 73---1732.
[25]
J Proenca, A Veiga, F Perdigão, in The SPL-IT query by example search on speech system for MediaEval 2014. Proc. of MediaEval (CEURAachen, Germany, 2014), pp. 74---1742.
[26]
P Yang, C-C Leung, L Xie, B Ma, H Li, in Intrinsic spectral analysis based on temporal context features for query-by-example spoken term detection. Proc. of Interspeech (ISCABaixas, France, 2014), pp. 1722---1726.
[27]
B George, A Saxena, G Mantena, K Prahallad, B Yegnanarayana, in Unsupervised query-by-example spoken term detection using bag of acoustic words and non-segmental dynamic time warping. Proc. of Interspeech (ISCABaixas, France, 2014), pp. 1742---1746.
[28]
TJ Hazen, W Shen, CM White, in Query-by-example spoken term detection using phonetic posteriorgram templates. Proc. of ASRU (IEEEWashington DC, USA, 2009), pp. 421---426.
[29]
A Abad, RF Astudillo, I Trancoso, in The L2F spoken web search system for MediaEval 2013. Proc. of MediaEval (CEURAachen, Germany, 2013), pp. 85---1852.
[30]
A Abad, LJ Rodríguez-Fuentes, M Penagarikano, A Varona, G Bordel, in On the calibration and fusion of heterogeneous spoken term detection systems. Proc. of Interspeech (ISCABaixas, France, 2013), pp. 20---24.
[31]
I Szöke, M Skácel, L Burget, in BUT QUESST 2014 system description. Proc. of MediaEval (CEURAachen, Germany, 2014), pp. 62---1622.
[32]
P Yang, H Xu, X Xiao, L Xie, C-C Leung, H Chen, J Yu, H Lv, L Wang, SJ Leow, B Ma, ES Chng, H Li, in The NNI query-by-example system for MediaEval 2014. Proc. of MediaEval (CEURAachen, Germany, 2014), pp. 69---1692.
[33]
I Szöke, L Burget, F Grézl, JH Černocký, L Ondel, in Calibration and fusion of query-by-example systems - BUT SWS 2013. Proc. of ICASSP (IEEEWashington DC, USA, 2014), pp. 7849---7853.
[34]
H Wang, T Lee, C-C Leung, B Ma, H Li, Acoustic segment modeling with spectral clustering methods. IEEE/ACM Trans. Audio Speech Lang. Process. 23(2), 264---277 (2015).
[35]
C-T Chung, W-N Hsu, C-Y Lee, L-S Lee, in Enhancing automatically discovered multi-level acoustic patterns considering context consistency with applications in spoken term detection. Proc. of ICASSP (IEEEWashington DC, USA, 2015), pp. 5231---5235.
[36]
NIST, The Ninth Text REtrieval Conference (TREC 9) (2000). http://trec.nist.gov. Accessed 8 January 2016.
[37]
H Joho, K Kishida, in Overview of the NTCIR-11, SpokenQuery&Doc Task. Proc. of NTCIR-11 (National Institute of InformaticsTokyo, Japan, 2014), pp. 1---7.
[38]
X Anguera, F Metze, A Buzo, I Szöke, LJ Rodriguez-Fuentes, in The spoken web search task. Proc. of MediaEval (CEURAachen, Germany, 2013), pp. 1---2.
[39]
X Anguera, LJ Rodriguez-Fuentes, I Szöke, A Buzo, F Metze, in Query by example search on speech at Mediaeval 2014. Proc. of MediaEval (CEURAachen, Germany, 2014), pp. 1---2.
[40]
NIST, Draft KWS14 Keyword Search Evaluation Plan (National Institute of Standards and Technology (NIST), Gaithersburg, MD, USA, 2013). National Institute of Standards and Technology (NIST). http://www.nist.gov/itl/iad/mig/upload/KWS14-evalplan-v11.pdf. Accessed 8 January 2016.
[41]
B Taras, C Nadeu, Audio segmentation of broadcast news in the Albayzin-2010 evaluation: overview, results, and discussion. EURASIP Journal on Audio, Speech, and Music Processing. 2011:, 1---1110 (2011).
[42]
M Zelenák, H Schulz, J Hernando, Speaker diarization of broadcast news in Albayzin 2010 evaluation campaign. EURASIP Journal on Audio, Speech, and Music Processing.2012:, 19---1199 (2012).
[43]
LJ Rodríguez-Fuentes, M Penagarikano, A Varona, M Díez, G Bordel, in The Albayzin 2010 Language Recognition Evaluation. Proc. of Interspeech (ISCABaixas, France, 2011), pp. 1529---1532.
[44]
J Tejedor, DT Toledano, X Anguera, A Varona, LF Hurtado, A Miguel, J Colás, Query-by-example spoken term detection ALBAYZIN 2012 evaluation: overview, systems, results, and discussion. EURASIP, Journal on Audio, Speech, and Music Processing. 2013:, 23---12317 (2013).
[45]
F Méndez, L Docío, M Arza, F Campillo, in The Albayzin 2010 text-to-speech evaluation. Proc. of FALA (Spanish Thematic Network on Speech TechnologyMadrid, Spain, 2010), pp. 317---340.
[46]
J Billa, KW Ma, JW McDonough, G Zavaliagkos, DR Miller, KN Ross, A El-Jaroudi, in Multilingual speech recognition: the 1996 Byblos callhome system. Proc. of Eurospeech (ISCABaixas, France, 1997).
[47]
H Cuayahuitl, B Serridge, in Out-of-vocabulary word modeling and rejection for spanish keyword spotting systems. Proc. of MICAI (SpringerLondon, United Kingdom, 2002), pp. 156---165.
[48]
M Killer, S Stuker, T Schultz, in Grapheme based speech recognition. Proc. of Eurospeech (ISCABaixas, France, 2003), pp. 3141---3144.
[49]
J Tejedor, Contributions to Keyword Spotting and Spoken Term Detection For Information Retrieval in Audio Mining. PhD thesis, Universidad Autónoma de Madrid, Madrid, Spain (Universidad AutÂ3noma de Madrid, Madrid, Spain, 2009).
[50]
L Burget, P Schwarz, M Agarwal, P Akyazi, K Feng, A Ghoshal, O Glembek, N Goel, M Karafiat, D Povey, A Rastrow, RC Rose, S Thomas, in Multilingual acoustic modeling for speech recognition based on subspace gaussian mixture models. Proc. of ICASSP (IEEEWashington DC, USA, 2010), pp. 4334---4337.
[51]
J Tejedor, DT Toledano, D Wang, S King, J Colás, Feature analysis for discriminative confidence estimation in spoken term detection. Comput. Speech Lang. 28(5), 1083---1114 (2014).
[52]
J Li, X Wang, B Xu, in An empirical study of multilingual and low-resource spoken term detection using deep neural networks. Proc. of Interspeech (ISCABaixas, France, 2014), pp. 1747---1751.
[53]
NIST, The Spoken Term Detection (STD) 2006 Evaluation Plan, 10th edn. (National Institute of Standards and Technology (NIST), Gaithersburg, MD, USA, 2006). National Institute of Standards and Technology (NIST). http://www.nist.gov/speech/tests/std. Accessed 8 January 2016.
[54]
JG Fiscus, J Ajot, JS Garofolo, G Doddingtion, in Results of the 2006 spoken term detection evaluation. Proc. of SSCS (ACMNew York, USA, 2007), pp. 45---50.
[55]
A Martin, G Doddington, T Kamm, M Ordowski, M Przybocki, in The DET curve in assessment of detection task performance. Proc. of Eurospeech (ISCABaixas, France, 1997), pp. 1895---1898.
[56]
NIST, NIST Speech Tools and APIs: 2006 (National Institute of Standards and Technology (NIST), Gaithersburg, MD, USA, 1996). National Institute of Standards and Technology (NIST). http://www.nist.gov/speech/tools/index.htm. Accessed 8 January 2016.
[57]
N Rajput, F Metze, in Spoken web search. Proc. of MediaEval (CEURAachen, Germany, 2011), pp. 1---2.
[58]
F Metze, E Barnard, M Davel, Heerden C van, X Anguera, G Gravier, N Rajput, in The spoken web search task. Proc. of MediaEval (CEURAachen, Germany, 2012), pp. 1---2.
[59]
NTCIR-11 Spoken Query and Spoken Document Retrieval Task Organizers, Definition of SQ-STD Task at NTCIR-11 SpokenQuery &Doc (2014). http://www.nlp.cs.tut.ac.jp/~sdpwg/ntcir11/SQ-STD.pdf. Accessed 8 January 2016.
[60]
D Povey, A Ghoshal, G Boulianne, L Burget, O Glembek, N Goel, M Hannemann, P Motlicek, Y Qian, P Schwarz, J Silovsky, G Stemmer, K Vesely, in The Kaldi speech recognition toolkit. Proc. of ASRU (IEEEWashington DC, USA, 2011).
[61]
L Docío-Fernández, A Cardenal-López, C García-Mateo, in TC-STAR 2006 automatic speech recognition evaluation: The uvigo system. Proc. of TC-STAR Workshop on Speech-to-Speech Translation (META-NETBerlin, Germany, 2006).
[62]
A Stolcke, in SRILM - an extensible language modeling toolkit. Proc. of ICSLP (ISCABaixas, France, 2002), pp. 901---904.
[63]
D Povey, M Hannemann, G Boulianne, L Burget, A Ghoshal, M Janda, M Karafiat, S Kombrink, P Motlicek, Y Qian, K Riedhammer, K Vesely, NT Vu, in Proc. of ICASSP. Generating exact lattices in the WFST framework (IEEEWashington DC, USA, 2012), pp. 4213---4216.
[64]
G Chen, S Khudanpur, D Povey, J Trmal, D Yarowsky, O Yilmaz, in Quantifying the value of pronunciation lexicons for keyword search in low resource languages. Proc. of ICASSP (IEEEWashington DC, USA, 2013), pp. 8560---8564.
[65]
VT Pham, NF Chen, S Sivadas, H Xu, I-F Chen, C Ni, ES Chng, H Li, in System and keyword dependent fusion for spoken term detection. Proc. of SLT (IEEEWashington DC, USA, 2014), pp. 430---435.
[66]
D Can, M Saraclar, Lattice indexing for spoken term detection. IEEE Trans. Audio Speech Lang. Process. 19(8), 2338---2347 (2011).
[67]
P Lopez-Otero, L Docio-Fernandez, C Garcia-Mateo, in Introducing a framework for the evaluation of music detection tools. Proc. of LREC (European Language Resources AssociationParis, France, 2014), pp. 568---572.
[68]
C Neves, A Veiga, Sa, L', F Perdigão, in Audio fingerprinting system for broadcast streams, 1. Proc. of ConfTele (Santa Maria da Feira, Instituto de Telecomunicações, Campus Universitãrio de Santiago, Aveiro, Portugal, 2009), pp. 481---484.
[69]
K Seyerlehner, G Widmer, T Pohle, M Sched, in Proc. of the 10th Conference on Digital Audio Effects. Automatic music detection in television productions (LaBRI, Universit¿ BordeauxBordeaux, France, 2007).
[70]
S Kim, E Unal, S Narayanan, in Music fingerprint extraction for classical music cover song identification. Proc. of ICME (IEEEWashington DC, USA, 2008), pp. 1261---1264.
[71]
J Haitsma, T Kalker, in A highly robust audio fingerprinting system. Proc. of ISMIR (ISMIRParis, France, 2002), pp. 107---115.
[72]
TJ Hazen, W Shen, CM White, in Query-by-example spoken term detection using phonetic posteriorgram templates. Proc. of ASRU (IEEEWashington DC, USA, 2009), pp. 421---426.
[73]
P Schwarz, Phoneme recognition based on long temporal context. PhD thesis, Brno University of Technology (Brno University of Technology, Brno, Czech Republic, 2009).
[74]
A Abad, RF Astudillo, in The L2F spoken web search system for mediaeval 2012. Proc. of MediaEval (CEURAachen, Germany, 2012), pp. 9---10.
[75]
N Brümmer, E de Villiers, The BOSARIS toolkit user guide: Theory, algorithms and code for binary classifier score processing. Technical report (2011). https://sites.google.com/site/nikobrummer. Accessed 8 January 2016.
[76]
Vesely, K¿, A Ghoshal, L Burget, D Povey, in Sequence-discriminative training of deep neural networks. Proc. of Interspeech (ISCABaixas, France, 2013), pp. 2345---2349.
[77]
IberSPEECH 2012, "VII Jornadas en Tecnología del Habla" and "III Iberian SLTech Workshop" (2012). http://iberspeech2012.ii.uam.es. Accessed 8 January 2016.
[78]
D Wang, S King, J Frankel, Stochastic pronunciation modelling for out-of-vocabulary spoken term detection. IEEE Trans. Audio Speech Lang. Process. 19(4), 688---698 (2011).
[79]
JA Gómez, E Sanchis, MJ Castro-Bleda, in Automatic speech segmentation based on acoustical clustering. Proc. of the Joint IAPR International Conference on Structural, Syntactic, and Statistical Pattern Recognition (SpringerLondon, United Kingdom, 2010), pp. 540---548.
[80]
A Park, JR Glass, in Towards unsupervised pattern discovery in speech. Proc. of ASRU (IEEEWashington DC, USA, 2005), pp. 53---58.

Cited By

View all
  • (2018)ALBAYZIN Query-by-example Spoken Term Detection 2016 evaluationEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-018-0125-92018:1(1-25)Online publication date: 1-Dec-2018
  • (2017)ALBAYZIN 2016 spoken term detection evaluationEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-017-0119-z2017:1(1-23)Online publication date: 1-Dec-2017

Index Terms

  1. Comparison of ALBAYZIN query-by-example spoken term detection 2012 and 2014 evaluations
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image EURASIP Journal on Audio, Speech, and Music Processing
    EURASIP Journal on Audio, Speech, and Music Processing  Volume 2016, Issue 1
    December 2016
    248 pages
    ISSN:1687-4714
    EISSN:1687-4722
    Issue’s Table of Contents

    Publisher

    Hindawi Limited

    London, United Kingdom

    Publication History

    Published: 01 December 2016

    Author Tags

    1. International evaluation
    2. Query-by-example spoken term detection
    3. Search on spontaneous speech

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 03 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2018)ALBAYZIN Query-by-example Spoken Term Detection 2016 evaluationEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-018-0125-92018:1(1-25)Online publication date: 1-Dec-2018
    • (2017)ALBAYZIN 2016 spoken term detection evaluationEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-017-0119-z2017:1(1-23)Online publication date: 1-Dec-2017

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media