Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Reversible speaker de-identification using pre-trained transformation functions

Published: 01 November 2017 Publication History

Abstract

A speaker de-identification method based on pre-trained transformations is proposed.We overcome the need for a parallel corpus between input and target speakers.Objective and subjective evaluations prove the validity of the proposed approach.This de-identification method achieves universality, naturalness and reversibility. Speaker de-identification approaches must accomplish three main goals: universality, naturalness and reversibility. The main drawback of the traditional approach to speaker de-identification using voice conversion techniques is its lack of universality, since a parallel corpus between the input and target speakers is necessary to train the conversion parameters. It is possible to make use of a synthetic target to overcome this issue, but this harms the naturalness of the resulting de-identified speech. Hence, a technique is proposed in this paper in which a pool of pre-trained transformations between a set of speakers is used as follows: given a new user to de-identify, its most similar speaker in this set of speakers is chosen as the source speaker, and the speaker that is the most dissimilar to the source speaker is chosen as the target speaker. Speaker similarity is measured using the i-vector paradigm, which is usually employed as an objective measure of speaker de-identification performance, leading to a system with high de-identification accuracy. The transformation method is based on frequency warping and amplitude scaling, in order to obtain natural sounding speech while masking the identity of the speaker. In addition, compared to other voice conversion approaches, the proposed method is easily reversible. Experiments were conducted on Albayzin database, and performance was evaluated in terms of objective and subjective measures. These results showed a high success when de-identifying speech, as well as a great naturalness of the transformed voices. In addition, when making the transformation parameters available to a trusted holder, it is possible to invert the de-identification procedure, hence recovering the original speaker identity. The computational cost of the proposed approach is small, making it possible to produce de-identified speech in real-time with a high level of naturalness.

References

[1]
M. Abe, S. Nakamura, K. Shikano, H. Kuwabara, Voice conversion through vector quantization, 1988.
[2]
M. Abou-Zleikha, Z.-H. Tan, M. Christensen, S. Jensen, A discriminative approach for speaker selection in speaker de-identification systems, 2015.
[3]
L. Arslan, Speaker transformation algorithm using segmental codebooks (STASC), Speech Commun., 28 (1999) 211-226.
[4]
G. Baudoin, Y. Stylianou, On the transformation of speech spectrum for voice conversion, 1996.
[5]
H. Benisty, D. Malah, Voice conversion using GMM with enhanced global variance, 2011.
[6]
O. Capp, J. Laroche, E. Moulines, Regularized estimation of cepstrum envelope from discrete frequency points, 1995.
[7]
M. Carey, E. Parris, H. Lloyd-Thomas, S. Bennett, H. Bunnell, W. Idsardi, Robust prosodic features for speaker identification, Int. Conf. on Spoken Lang. Process., 3 (1996) 1800-1803.
[8]
L.-H. Chen, Z.-H. Ling, L.-J. Liu, L.-R. Dai, Voice conversion using deep neural networks with layer-wise generative training, IEEE/ACM Trans. Audio Speech Lang. Process., 22 (2014) 1859-1872.
[9]
G. Degottex, Y. Stylianou, Analysis and synthesis of speech using an adaptive full-band harmonic model, IEEE Trans. Audio Speech Lang. Process., 21 (2013) 2085-2095.
[10]
N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front end factor analysis for speaker verification, IEEE Trans. Audio Speech Lang. Process, 19 (2010) 788-798.
[11]
S. Desai, A. Black, B. Yegnanarayana, K. Prahallad, Spectral mapping using artificial neural networks for voice conversion, IEEE Trans. Audio Speech Lang. Process. (2010) 954-964.
[12]
H. Duxans, A. Bonafonte, A. Kain, J. Van Santen, Including dynamic and phonetic information in voice conversion systems, 2004.
[13]
D. Erro, A. Moreno, A. Bonafonte, Voice conversion based on weighted frequency warping, Comput. Speech Lang., 18 (2010) 922-931.
[14]
D. Erro, E. Navas, I. Hernaez, Parametric voice conversion based on bilinear frequency warping plus amplitude scaling, IEEE Trans. Audio Speech Lang. Process., 21 (2013) 556-566.
[15]
D. Erro, I. Sainz, E. Navas, I. Hernaez, Harmonics plus noise model based vocoder for statistical parametric speech synthesis, IEEE J. Sel. Top. Signal Process., 8 (2014) 184-194.
[16]
D. Garcia-Romero, C.Y. Espy-Wilson, Analysis of i-vector length normalization in speaker recognition systems, 2011.
[17]
S. Garfinkel, De-Identification of personally identifiable information, National institute of Standards and Technology (NIST), U.S. Department of Commerce, 2015.
[18]
E. Godoy, O. Rosec, T. Chonavel, Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or nonparallel corpora, IEEE Trans. Audio Speech, Lang. Process., 20 (2012) 1313-1323.
[19]
R. Gross, L. Sweeney, J. Cohn, F. De la Torre, S. Baker, Protecting Privacy in Video Surveillance, Springer Publishing Company, Incorporated, 2009.
[20]
D.-Y. Hunag, L. Xie, Y. Wa Lee, J. Wu, H. Ming, X. Tian, S. Zhang, C. Ding, M. Li, Q. Nguyen, M. Dong, E. Chng, H. Ll, An automatic voice conversion evaluation strategy based on perceptual background noise distortion and speaker similarity, 2016.
[21]
F. Itakura, Minimum prediction residual principle applied to speech recognition, IEEE Trans. Acoust. Speech Signal Process., 23 (1975) 67-72.
[22]
Q. Jin, A. Toth, T. Schultz, A. Black, Voice convergin: Speaker de-identification by voice transformation, 2009.
[23]
Q. Jin, A.R. Toth, T. Schultz, A.W. Black, Speaker de-identification via voice transformation, 2009.
[24]
T. Justin, V. truc, S. Dobriek, B. Vesnicer, I. Ipic, F. Miheli, Speaker de-identification using diphone recognition and speech synthesis, 2015.
[25]
A. Kain, M. Macon, Spectral voice conversion for text-to-speech synthesis, 1998.
[26]
A.B. Kain, J.P. Hosom, X. Niu, J.P. van Santen, M. Fried-Oken, J. Staehely, Improving the intelligibility of dysarthric speech, Speech Commun., 49 (2007) 743-759.
[27]
C. Magarios, P. Lopez-Otero, L. Docio-Fernandez, D. Erro, E. Banga, C. Garcia-Mateo, Piecewise linear definition of transformation functions for speaker de-identification, 2016.
[28]
M. Mashimo, T. Toda, H. Kawanami, H. Kashioka, K. Shikano, N. Campbell, Evaluation of cross-language voice conversion using bilingual and non-bilingual databases, 2002.
[29]
H. Mizuno, M. Abe, Voice conversion algorithm based on piecewise linear conversion rules of formant frequency and spectrum tilt, Speech Commun., 16 (1995) 153-164.
[30]
A. Moreno, D. Poch, A. Bonafonte, E. Lleida, J. Llisterri, J. Mario, C. Nadeu, Albayzin speech database: design of the phonetic corpus, 1993.
[31]
T. Nakashika, T. Takiguchi, Y. Ariki, Voice conversion using RNN pre-trained by recurrent temporal restricted Boltzmann machines, Trans. Audio Speech Lang. Process., 23 (2015) 580-587.
[32]
M. Narendranath, H. Murthy, S. Rajendran, B. Yegnanarayana, Transformation of formants for voice conversion using artificial neural networks, Speech Commun., 16 (1995) 207-216.
[33]
I. Neamatullah, M. Douglass, L.-W. Lehman, A. Reisner, M. Villarroel, W. Long, P. Szolovits, G. Moody, R. Mark, G. Clifford, Automated de-identification of free-text medical records, BMC Med. Inform. Decis. Mak., 8 (2008).
[34]
J. Ortega-Garcia, J. Fierrez, F. Alonso-Fernandez, J. Galbally, M. Freire, J. Gonzalez-Rodriguez, C. Garcia-Mateo, J. Alba-Castro, E. Gonzalez-Agulla, E. Otero-Muras, S. Garcia-Salicetti, L. Allano, B. Ly-Van, B. Dorizzi, J. Kittler, T. Bourlai, N. Poh, F. Deravi, M. Ng, M. Fairhurst, J. Hennebert, A. Humm, M. Tistarelli, L. Brodo, J. Richiardi, A. Drygajlo, H. Ganster, F. Sukno, S.-K. Pavani, A. Frangi, L. Akarun, A. Savran, The multi-scenario multi-environment BioSecure multimodal database (BMDB), IEEE Trans. Pattern Anal. Mach. Intell., 32 (2009) 1097-1111.
[35]
M. Pitz, H. Ney, Vocal tract normalization equals linear transformation in cepstral space, IEEE Trans. Speech Audio Process., 13 (2005) 930-944.
[36]
M. Pobar, I. Ipsic, Online speaker de-identification using voice transformation, 2014.
[37]
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The Kaldi speech recognition toolkit, IEEE Signal Processing Society, 2011.
[38]
U. Remes, R. Karhila, M. Kurimo, Objective evaluation measures for speaker-adaptive HMM-TTS systems, 2013.
[39]
Y. Stylianou, O. Cappe, E. Moulines, Continuous probabilistic transform for voice conversion, IEEE Trans. Speech Audio Process., 6 (1998) 131-142.
[40]
D. Sndermann, H. Ney, VTLN-based voice conversion, 2003.
[41]
T. Toda, A. Black, K. Tokuda, Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory, IEEE Trans. Audio Speech Lang. Process., 15 (2007) 2222-2235.
[42]
T. Toda, H. Saruwatari, K. Shikano, Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum, 2001.
[43]
J. Tribolet, P. Noll, B. McDermott, R. Crochiere, A study of complexity and quality of speech waveform coders, 1978.
[44]
O. Turk, L. Arslan, Subband based voice conversion, 2002.
[45]
H. Valbret, E. Moulines, J. Tubach, Voice transformation using PSOLA technique, Speech Commun., 11 (1992) 175-187.
[46]
L. Xu, K. Lee, H. Li, Z. Yang, Rapid computation of i-vector, 2016.
[47]
T. Zorila, D. Erro, I. Hernaez, Advances in Speech and Language Technologies for Iberian Languages. Springer Berlin Heidelberg, 2012.

Cited By

View all
  • (2023)Speaker anonymization using generative adversarial networksJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-22364245:2(3345-3359)Online publication date: 1-Jan-2023
  • (2023)VoiceCloakProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/35962667:2(1-21)Online publication date: 12-Jun-2023
  • (2022)X-vector anonymization using autoencoders and adversarial training for preserving speech privacyComputer Speech and Language10.1016/j.csl.2022.10135174:COnline publication date: 1-Jul-2022
  • Show More Cited By

Index Terms

  1. Reversible speaker de-identification using pre-trained transformation functions
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Computer Speech and Language
      Computer Speech and Language  Volume 46, Issue C
      November 2017
      579 pages

      Publisher

      Academic Press Ltd.

      United Kingdom

      Publication History

      Published: 01 November 2017

      Author Tags

      1. Amplitude scaling
      2. Frequency warping
      3. Speaker de-identification
      4. Speaker re-identification
      5. Voice transformation
      6. i-vector

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 03 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Speaker anonymization using generative adversarial networksJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-22364245:2(3345-3359)Online publication date: 1-Jan-2023
      • (2023)VoiceCloakProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/35962667:2(1-21)Online publication date: 12-Jun-2023
      • (2022)X-vector anonymization using autoencoders and adversarial training for preserving speech privacyComputer Speech and Language10.1016/j.csl.2022.10135174:COnline publication date: 1-Jul-2022
      • (2022)Speaker anonymization by modifying fundamental frequency and x-vector singular valueComputer Speech and Language10.1016/j.csl.2021.10132673:COnline publication date: 1-May-2022
      • (2022)A Systematic Survey of Architectural Approaches and Trade-Offs in Data De-identificationSoftware Architecture10.1007/978-3-031-16697-6_5(66-82)Online publication date: 19-Sep-2022
      • (2021)Evaluating X-Vector-Based Speaker Anonymization Under White-Box AssessmentSpeech and Computer10.1007/978-3-030-87802-3_10(100-111)Online publication date: 27-Sep-2021
      • (2019)Vocal Tract Length Normalization using a Gaussian mixture model framework for query-by-example spoken term detectionComputer Speech and Language10.1016/j.csl.2019.03.00558:C(175-202)Online publication date: 1-Nov-2019

      View Options

      View options

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media