Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

OAVA: the open audio-visual archives aggregator

Published: 16 December 2023 Publication History

Abstract

The purpose of the current article is to provide an overview of an open-access audiovisual aggregation and search service platform developed for Greek audiovisual content during the OAVA (Open Access AudioVisual Archive) project. The platform allows the search of audiovisual resources utilizing metadata descriptions, as well as full-text search utilizing content generated from automatic speech recognition (ASR) processes through deep learning models. A dataset containing reliable Greek audiovisual content providers and their resources (1710 in total) is created. Both providers and resources are reviewed according to specific criteria already established and used for content aggregation purposes, to ensure the quality of the content and to avoid copyright infringements. Well-known aggregation services and well-established schemas for audiovisual resources have been studied and considered regarding both aggregated content and metadata. Most Greek audiovisual content providers do not use established metadata schemas when publishing their content, nor technical cooperation with them is guaranteed. Thus, a model is developed for reconciliation and aggregation. To utilize audiovisual resources the OAVA platform makes use of the latest state-of-the-art ASR approaches. OAVA platform supports Greek and English speech-to-text models. Specifically for Greek, to mitigate the scarcity of available datasets, a large-scale ASR dataset is annotated to train and evaluate deep learning architectures. The result of the above-mentioned efforts, namely selection of content, metadata, development of appropriate ASR techniques, and aggregation and enrichment of content and metadata, is the OAVA platform. This unified search mechanism for Greek audiovisual content will serve teaching, research, and cultural activities. OAVA platform is available at: https://openvideoarchives.gr/.

References

[1]
Ardila, R., Branson, M., Davis, K., et al.: Common voice: a massively-multilingual speech corpus (2019). arXiv preprint arXiv:1912.06670
[2]
Barry, M., Sifton, D.: Towards a cross-Canadian digital library platform. In: 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, pp 1–2 (2017)
[3]
Bashir, B., Nasreen, N., Loan, F.A.: National digital library of India: an overview. Library Philosophy and Practice (e-journal) (2019). https://digitalcommons.unl.edu/libphilprac/2601 (visited April 28, 2020)
[4]
CIDOC (n.d.) Cidoc crm scope. https://www.cidoc-crm.org/scope. Last accessed on 2023-04-02
[5]
Cieri, C., Miller, D., Walker, K.: The fisher corpus: a resource for the next generations of speech-to-text. In: LREC, pp 69–71 (2004)
[6]
Deng L and Li X Machine learning paradigms for speech recognition: an overview IEEE Trans. Audio Speech Lang. Process. 2013 21 5 1060-1089
[7]
Devlin, J., Chang, M.W., Lee, K., et al.: Bert: pre-training of deep bidirectional transformers for language understanding (2018). arXiv preprint arXiv:1810.04805
[8]
DigitalNZ (n.d.a) About digitalnz. https://digitalnz.org/about. Last accessed on 2022-09-11
[9]
DigitalNZ (n.d.b) Digitalnz metadata dictionary. https://docs.google.com/document/pub?id=1Z3I_ckQWjnQQ4SzpORbClcIXUheO-Jd4jt-oZFuMcoQ. Last accessed on 2022-09-11
[10]
DigitalNZ (n.d.c) Supplejack. https://digitalnz.org/developers/supplejack. Last accessed on 2022-09-11
[11]
DPLA (n.d.a) About us from https://dp.la/about. Last accessed on 2022-09-11
[12]
DPLA (n.d.b) Collection development guidelines. https://pro.dp.la/hubs/collection-development-guidelines. Last accessed on 2022-09-11
[13]
DPLA (n.d.c) Metadata application profile. https://pro.dp.la/hubs/metadata-application-profile. Last accessed on 2022-09-11
[14]
DPLA (n.d.d) Our hubs. https://pro.dp.la/hubs/our-hubs. Last accessed on 2022-09-11
[15]
Drude, L., Heitkaemper, J., Boeddeker, C., et al.: Sms-wsj: database, performance measures, and baseline recipe for multi-channel source separation and recognition (2019). arXiv preprint arXiv:1910.13934
[16]
EBU (n.d.) Tech 3293 ebu core metadata set (ebucore): specification v. 1.10., p. 7. https://tech.ebu.ch/docs/tech/tech3293.pdf. Last accessed on 2022-09-11
[17]
Europeana (n.d.a) About. https://www.europeana.eu/en/about-us. Last accessed on 2022-09-11
[18]
Europeana (n.d.b) Europeana data model. https://pro.europeana.eu/page/edm-documentation. Last accessed on 2022-09-11
[19]
Europeana (n.d.c) Europeana semantic elements documentation. https://pro.europeana.eu/page/ese-documentation. Last accessed on 2022-09-11
[20]
Europeana (n.d.d) For developers. https://www.europeana.eu/en/for-developers. Last accessed on 2022-09-11
[21]
Europeana (n.d.e) For developers. https://pro.europeana.eu/page/apis. Last accessed on 2022-09-11
[22]
Europeana (n.d.f) Metadata. https://pro.europeana.eu/share-your-data/metadata. Last accessed on 2022-09-11
[23]
Fernie, K., Gavrilis, D., Angelis, S.: The carare metadata schema, v. 2.0. Europeana Carare project (2013)
[24]
Frosini L, Bardi A, Manghi P, et al. An aggregation framework for digital humanities infrastructures: the parthenos experience Sci. Res. Inf. Technol. 2018 8 1 33-50
[25]
Galvez, D., Diamos, G., Ciro, J., et al.: The people’s speech: a large-scale diverse English speech recognition dataset for commercial usage (2021). arXiv preprint arXiv:2111.09344
[26]
GitHub (n.d.) Supplejack. https://digitalnz.github.io/supplejack/. Last accessed on 2022-09-11
[27]
Godfrey, J.J., Holliman, E.C., McDaniel, J.: Switchboard: telephone speech corpus for research and development. In: IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE Computer Society, pp. 517–520 (1992)
[28]
Graves, A.: Sequence transduction with recurrent neural networks (2012). arXiv preprint arXiv:1211.3711
[29]
Graves, A., Fernandez, S., Gomez, F., et al.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on Machine learning, pp. 369–376 (2006)
[30]
Gregory L and Williams S On being a hub: some details behind providing metadata for the digital public library of America D-Lib Mag. 2014 20 7/8 25-32
[31]
Gulati, A., Qin, J., Chiu, C.C., et al.: Conformer: convolution-augmented transformer for speech recognition (2020). arXiv preprint arXiv:2005.08100
[32]
Han, W., Zhang, Z., Zhang, Y., et al.: Contextnet: improving convolutional neural networks for automatic speech recognition with global context (2020). arXiv preprint arXiv:2005.03191
[33]
Heafield, K.: Kenlm: faster and smaller language model queries. In: Proceedings of the sixth workshop on statistical machine translation, pp. 187–197 (2011)
[34]
Holley, R.: Extending the scope of trove: addition of e-resources subscribed to by australian libraries. D-Lib 17(11/12) (2011)
[35]
Holley, R.: Resource sharing in Australia: find and get in trove-making “getting” better (2011)
[36]
Iranzo-Sánchez, J., Silvestre-Cerda, J.A., Jorge, J., et al.: Europarl-st: a multilingual corpus for speech translation of parliamentary debates. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 8229–8233 (2020)
[37]
Kapidakis, S.: When a metadata provider task is successful. In: International Conference on Theory and Practice of Digital Libraries. Springer, pp. 544–552 (2017)
[38]
Klijn, E., De Lusenet, Y.: Tracking the reel world. A survey of audiovisual collections in Europe. Amsterdam, training for audiovisual preservation in Europe (2008)
[39]
Koluguri, N.R., Li, J., Lavrukhin, V., et al.: Speakernet: 1d depth-wise separable convolutional network for text-independent speaker recognition and verification (2020). arXiv preprint arXiv:2010.12653
[40]
Koutsikakis, J., Chalkidis, I., Malakasiotis, P., et al.: Greek-bert: the Greeks visiting sesame street. In: 11th Hellenic Conference on Artificial Intelligence, pp. 110–117 (2020)
[41]
Kriman, S., Beliaev, S., Ginsburg, B., et al.: Quartznet: deep automatic speech recognition with 1d time-channel separable convolutions. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).IEEE, pp. 6124–6128 (2020)
[42]
Kudo, T., Richardson, J.: Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing (2018). arXiv preprint arXiv:1808.06226
[43]
Li, J., Lavrukhin, V., Ginsburg, B., et al.: Jasper: an end-to-end convolutional neural acoustic model (2019). arXiv preprint arXiv:1904.03288
[44]
Majumdar, S., Balam, J., Hrinchuk, O., et al.: Citrinet: closing the gap between non-autoregressive and autoregressive end-to-end models for automatic speech recognition (2021). arXiv preprint arXiv:2104.01721
[45]
Malliari A, Nitsos I, Zapounidou S, et al. Mapping audiovisual content providers and resources in Greece Int. J. Digit. Lib. 2022 23 217-227
[46]
Manghi P, Artini M, Atzori C, et al. The d-net software toolkit: a framework for the realization, maintenance, and operation of aggregative infrastructures Program 2014 48 4 322-354
[47]
Nagrani, A., Chung, J.S., Zisserman, A.: Voxceleb: a large-scale speaker identification dataset (2017). arXiv preprint arXiv:1706.08612
[48]
NCAMC (2018-2021) National centre of audiovisual media and communication. Who we are: establishment and mission. https://www.ekome.media/who-we-are/. Last accessed on 2023-04-02
[49]
NDLI (n.d.a) Faq. https://ndl.iitkgp.ac.in/faq. Last accessed on 2022-09-11
[50]
NDLI (n.d.b) National digital library of India. https://ndl.iitkgp.ac.in/. Last accessed on 2022-09-11
[51]
NLA (n.d.a) Technical ecosystem. https://trove.nla.gov.au/about/what-trove/technical-ecosystem. Last accessed on 2022-09-11
[52]
NLA (n.d.b) Technical specifications. https://trove.nla.gov.au/technical-specifications. Last accessed on 2022-09-11
[53]
NLA (n.d.c) Trove content. https://trove.nla.gov.au/about/what-trove/trove-content. Last accessed on 2022-09-11
[55]
NTUA (2006-2014) Introduction to mint. http://mint.image.ece.ntua.gr/redmine/projects/mint/wiki/Introduction_to_MINT. Last accessed on 2023-04-02
[56]
NTUA (2006-2014) Projects using mint. http://mint.image.ece.ntua.gr/redmine/projects/mint/wiki/Projects_using_Mint. Last accessed on 2023-04-02
[57]
Oesterlen, E.M.: (n.d.) Aggregation handbook, 3rd edition. https://tech.ebu.ch/docs/tech/tech3293.pdf. Last accessed on 2022-09-11
[58]
Panayotov, V., Chen, G., Povey, D., et al.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 5206–5210 (2015)
[59]
Pitschmann, L.: (n.d.) Building sustainable collections of free third-party web resources. https://clir.wordpress.clir.org/wp-content/uploads/sites/6/pub98_57d70f70b208f.pdf. Last accessed on 2022-09-11
[60]
Pratap, V., Xu, Q., Sriram, A., et al.: Mls: a large-scale multilingual dataset for speech research (2020). arXiv preprint arXiv:2012.03411
[61]
Purday, J.: Think culture: Europeana. Eu from concept to construction (2009)
[62]
Scholz, H.: (n.d.) Europeana publishing guide v1.8: a guide to the metadata and content requirements for data partners publishing material in Europeana collections. https://europeana.atlassian.net/wiki/spaces/EF/pages/2059763713/EPF+-+Publishing+guidelines. Last accessed on 2022-09-11
[63]
Shibata, Y., Kida, T., Fukamachi, S., et al.: Byte pair encoding: a text compression scheme that accelerates pattern matching (1999)
[64]
Togia A, Koseoglou E, Zapounidou S, et al. Open access infrastructure in Greece: current status, challenges and perspectives ELPUB 2018 2018 1-21
[65]
Wang, C., Riviere, M., Lee, A., et al.: Voxpopuli: a large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation (2021). arXiv preprint arXiv:2101.00390
[66]
Wenzek, G., Lachaux, M.A., Conneau, A., et al.: Ccnet: extracting high quality monolingual datasets from web crawl data (2019). arXiv preprint arXiv:1911.00359

Recommendations

Comments

Information & Contributors

Information

Published In

cover image International Journal on Digital Libraries
International Journal on Digital Libraries  Volume 25, Issue 4
Dec 2024
294 pages

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 16 December 2023
Accepted: 26 October 2023
Revision received: 17 October 2023
Received: 15 September 2022

Author Tags

  1. Audiovisual material
  2. Speech-to-text technologies
  3. Cultural heritage
  4. Open access
  5. Content aggregators

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media