Search | arXiv e-print repository

Enhancing COVID-19 Severity Analysis through Ensemble Methods

Authors: Anand Thyagachandran, Hema A Murthy

Abstract: Computed Tomography (CT) scans provide a detailed image of the lungs, allowing clinicians to observe the extent of damage caused by COVID-19. The CT severity score (CTSS) based scoring method is used to identify the extent of lung involvement observed on a CT scan. This paper presents a domain knowledge-based pipeline for extracting regions of infection in COVID-19 patients using a combination of… ▽ More Computed Tomography (CT) scans provide a detailed image of the lungs, allowing clinicians to observe the extent of damage caused by COVID-19. The CT severity score (CTSS) based scoring method is used to identify the extent of lung involvement observed on a CT scan. This paper presents a domain knowledge-based pipeline for extracting regions of infection in COVID-19 patients using a combination of image-processing algorithms and a pre-trained UNET model. The severity of the infection is then classified into different categories using an ensemble of three machine-learning models: Extreme Gradient Boosting, Extremely Randomized Trees, and Support Vector Machine. The proposed system was evaluated on a validation dataset in the AI-Enabled Medical Image Analysis Workshop and COVID-19 Diagnosis Competition (AI-MIA-COV19D) and achieved a macro F1 score of 64%. These results demonstrate the potential of combining domain knowledge with machine learning techniques for accurate COVID-19 diagnosis using CT scans. The implementation of the proposed system for severity analysis is available at \textit{https://github.com/aanandt/Enhancing-COVID-19-Severity-Analysis-through-Ensemble-Methods.git } △ Less

Submitted 17 March, 2023; v1 submitted 13 March, 2023; originally announced March 2023.

arXiv:2302.07480 [pdf]

Iridium-doping as a strategy to realize visible light absorption and p-type behavior in BaTiO3

Authors: Sujana Chandrappa, Simon Joyson Galbao, P S Sankara Rama Krishnan, Namitha Anna Koshi, Srewashi Das, Stephen Nagaraju Myakala, Seung Cheol Lee, Arnab Dutta, Alexey Cherevan, Satadeep Bhattacharjee, Dharmapura H K Murthy

Abstract: BaTiO3 is typically a strong n-type material with tuneable optoelectronic properties via doping and controlling the synthesis conditions. It has a wide band gap that can only harness the ultraviolet region of the solar spectrum. Despite significant progress, achieving visible-light absorbing BTO with tuneable carrier concentration has been challenging, a crucial requirement for many applications.… ▽ More BaTiO3 is typically a strong n-type material with tuneable optoelectronic properties via doping and controlling the synthesis conditions. It has a wide band gap that can only harness the ultraviolet region of the solar spectrum. Despite significant progress, achieving visible-light absorbing BTO with tuneable carrier concentration has been challenging, a crucial requirement for many applications. In this work, a p-type BTO with visible-light absorption is realized via iridium doping. Detailed analysis using advanced spectroscopy tools and computational electronic structure analysis is used to rationalize the n- to p-type transition after Ir doping. Results offered mechanistic insight into the interplay between the dopant site occupancy, the dopant position within the band gap, and the defect chemistry affecting the carrier concentration. A decrease in the Ti3+ donor levels concentration and the mutually correlated oxygen vacancies upon Ir doping is attributed to the p-type behavior. Due to the formation of Ir3+ or Ir4+ in-gap energy levels within the forbidden region, the optical transition can be elicited from or to such levels resulting in visible-light absorption. This newly developed Ir-doped BTO can be a promising p-type perovskite-oxide with imminent applications in solar fuel generation, spintronics and optoelectronics. △ Less

Submitted 15 February, 2023; originally announced February 2023.

Comments: 21 pages, 8 figures

arXiv:2302.06227 [pdf, other]

Fast and small footprint Hybrid HMM-HiFiGAN based system for speech synthesis in Indian languages

Authors: Sudhanshu Srivastava, Ishika Gupta, Anusha Prakash, Jom Kuriakose, Hema A. Murthy

Abstract: Hidden-Markov-model (HMM) based text-to-speech (HTS) offers flexibility in speaking styles along with fast training and synthesis while being computationally less intense. HTS performs well even in low-resource scenarios. The primary drawback is that the voice quality is poor compared to that of E2E systems. A hybrid approach combining HMM-based feature generation and neural-network-based HiFi-GAN… ▽ More Hidden-Markov-model (HMM) based text-to-speech (HTS) offers flexibility in speaking styles along with fast training and synthesis while being computationally less intense. HTS performs well even in low-resource scenarios. The primary drawback is that the voice quality is poor compared to that of E2E systems. A hybrid approach combining HMM-based feature generation and neural-network-based HiFi-GAN vocoder to improve HTS synthesis quality is proposed. HTS is trained on high-resolution mel-spectrograms instead of conventional mel generalized coefficients (MGC), and the output mel-spectrogram corresponding to the input text is used in a HiFi-GAN vocoder trained on Indic languages, to produce naturalness that is equivalent to that of E2E systems, as evidenced from the DMOS and PC tests. △ Less

Submitted 13 February, 2023; originally announced February 2023.

Comments: 5 pages, 5 figures

arXiv:2212.11982 [pdf, other]

HMM-based data augmentation for E2E systems for building conversational speech synthesis systems

Authors: Ishika Gupta, Anusha Prakash, Jom Kuriakose, Hema A. Murthy

Abstract: This paper proposes an approach to build a high-quality text-to-speech (TTS) system for technical domains using data augmentation. An end-to-end (E2E) system is trained on hidden Markov model (HMM) based synthesized speech and further fine-tuned with studio-recorded TTS data to improve the timbre of the synthesized voice. The motivation behind the work is that issues of word skips and repetitions… ▽ More This paper proposes an approach to build a high-quality text-to-speech (TTS) system for technical domains using data augmentation. An end-to-end (E2E) system is trained on hidden Markov model (HMM) based synthesized speech and further fine-tuned with studio-recorded TTS data to improve the timbre of the synthesized voice. The motivation behind the work is that issues of word skips and repetitions are usually absent in HMM systems due to their ability to model the duration distribution of phonemes accurately. Context-dependent pentaphone modeling, along with tree-based clustering and state-tying, takes care of unseen context and out-of-vocabulary words. A language model is also employed to reduce synthesis errors further. Subjective evaluations indicate that speech produced using the proposed system is superior to the baseline E2E synthesis approach in terms of intelligibility when combining complementing attributes from HMM and E2E frameworks. The further analysis highlights the proposed approach's efficacy in low-resource scenarios. △ Less

Submitted 22 December, 2022; originally announced December 2022.

Comments: 6 pages, 7 figures, 33 references

arXiv:2211.08790 [pdf, other]

Structural Segmentation and Labeling of Tabla Solo Performances

Authors: Gowriprasad R, R Aravind, Hema A Murthy

Abstract: Tabla is a North Indian percussion instrument used as an accompaniment and an exclusive instrument for solo performances. Tabla solo is intricate and elaborate, exhibiting rhythmic evolution through a sequence of homogeneous sections marked by shared rhythmic characteristics. Each section has a specific structure and name associated with it. Tabla learning and performance in the Indian subcontinen… ▽ More Tabla is a North Indian percussion instrument used as an accompaniment and an exclusive instrument for solo performances. Tabla solo is intricate and elaborate, exhibiting rhythmic evolution through a sequence of homogeneous sections marked by shared rhythmic characteristics. Each section has a specific structure and name associated with it. Tabla learning and performance in the Indian subcontinent is based on stylistic schools called gharana-s. Several compositions by various composers from different gharana-s are played in each section. This paper addresses the task of segmenting the tabla solo concert into musically meaningful sections. We then assign suitable section labels and recognize gharana-s from the sections. We present a diverse collection of over 38 hours of solo tabla recordings for the task. We motivate the problem and present different challenges and facets of the tasks. Inspired by the distinct musical properties of tabla solo, we compute several rhythmic and timbral features for the segmentation task. This work explores the approach of automatically locating the significant changes in the rhythmic structure by analyzing local self-similarity in an unsupervised manner. We also explore supervised random forest and a convolutional neural network trained on hand-crafted features. Both supervised and unsupervised approaches are also tested on a set of held-out recordings. Segmentation of an audio piece into its structural components and labeling is crucial to many music information retrieval applications like repetitive structure finding, audio summarization, and fast music navigation. This work helps us obtain a comprehensive musical description of the tabla solo concert. △ Less

Submitted 16 November, 2022; originally announced November 2022.

Comments: 35 pages, 11 figures

arXiv:2211.01603 [pdf, other]

Using Signal Processing in Tandem With Adapted Mixture Models for Classifying Genomic Signals

Authors: Saish Jaiswal, Shreya Nema, Hema A Murthy, Manikandan Narayanan

Abstract: Genomic signal processing has been used successfully in bioinformatics to analyze biomolecular sequences and gain varied insights into DNA structure, gene organization, protein binding, sequence evolution, etc. But challenges remain in finding the appropriate spectral representation of a biomolecular sequence, especially when multiple variable-length sequences need to be handled consistently. In t… ▽ More Genomic signal processing has been used successfully in bioinformatics to analyze biomolecular sequences and gain varied insights into DNA structure, gene organization, protein binding, sequence evolution, etc. But challenges remain in finding the appropriate spectral representation of a biomolecular sequence, especially when multiple variable-length sequences need to be handled consistently. In this study, we address this challenge in the context of the well-studied problem of classifying genomic sequences into different taxonomic units (strain, phyla, order, etc.). We propose a novel technique that employs signal processing in tandem with Gaussian mixture models to improve the spectral representation of a sequence and subsequently the taxonomic classification accuracies. The sequences are first transformed into spectra, and projected to a subspace, where sequences belonging to different taxons are better distinguishable. Our method outperforms a similar state-of-the-art method on established benchmark datasets by an absolute margin of 6.06% accuracy. △ Less

Submitted 3 November, 2022; originally announced November 2022.

arXiv:2211.01338 [pdf, other]

Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture Videos into Multiple Indian Languages

Authors: Anusha Prakash, Arun Kumar, Ashish Seth, Bhagyashree Mukherjee, Ishika Gupta, Jom Kuriakose, Jordan Fernandes, K V Vikram, Mano Ranjith Kumar M, Metilda Sagaya Mary, Mohammad Wajahat, Mohana N, Mudit Batra, Navina K, Nihal John George, Nithya Ravi, Pruthwik Mishra, Sudhanshu Srivastava, Vasista Sai Lodagala, Vandan Mujadia, Kada Sai Venkata Vineeth, Vrunda Sukhadia, Dipti Sharma, Hema Murthy, Pushpak Bhattacharya , et al. (2 additional authors not shown)

Abstract: Cross-lingual dubbing of lecture videos requires the transcription of the original audio, correction and removal of disfluencies, domain term discovery, text-to-text translation into the target language, chunking of text using target language rhythm, text-to-speech synthesis followed by isochronous lipsyncing to the original video. This task becomes challenging when the source and target languages… ▽ More Cross-lingual dubbing of lecture videos requires the transcription of the original audio, correction and removal of disfluencies, domain term discovery, text-to-text translation into the target language, chunking of text using target language rhythm, text-to-speech synthesis followed by isochronous lipsyncing to the original video. This task becomes challenging when the source and target languages belong to different language families, resulting in differences in generated audio duration. This is further compounded by the original speaker's rhythm, especially for extempore speech. This paper describes the challenges in regenerating English lecture videos in Indian languages semi-automatically. A prototype is developed for dubbing lectures into 9 Indian languages. A mean-opinion-score (MOS) is obtained for two languages, Hindi and Tamil, on two different courses. The output video is compared with the original video in terms of MOS (1-5) and lip synchronisation with scores of 4.09 and 3.74, respectively. The human effort also reduces by 75%. △ Less

Submitted 1 November, 2022; originally announced November 2022.

arXiv:2210.17153 [pdf, other]

The Importance of Accurate Alignments in End-to-End Speech Synthesis

Authors: Anusha Prakash, Hema A Murthy

Abstract: Unit selection synthesis systems required accurate segmentation and labeling of the speech signal owing to the concatenative nature. Hidden Markov model-based speech synthesis accommodates some transcription errors, but it was later shown that accurate transcriptions yield highly intelligible speech with smaller amounts of training data. With the arrival of end-to-end (E2E) systems, it was observe… ▽ More Unit selection synthesis systems required accurate segmentation and labeling of the speech signal owing to the concatenative nature. Hidden Markov model-based speech synthesis accommodates some transcription errors, but it was later shown that accurate transcriptions yield highly intelligible speech with smaller amounts of training data. With the arrival of end-to-end (E2E) systems, it was observed that very good quality speech could be synthesised with large amounts of data. As end-to-end synthesis progressed from Tacotron to FastSpeech2, it has become imminent that features that represent prosody are important for good-quality synthesis. In particular, durations of the sub-word units are important. Variants of FastSpeech use a teacher model or forced alignments to obtain good-quality synthesis. In this paper, we focus on duration prediction, using signal processing cues in tandem with forced alignment to produce accurate phone durations during training. The current work aims to highlight the importance of accurate alignments for good-quality synthesis. An attempt is made to train the E2E systems with accurately labeled data, and compare the same with approximately labeled data. △ Less

Submitted 31 October, 2022; originally announced October 2022.

Comments: Version 1 uploaded

arXiv:2106.01400 [pdf, other]

Dual Script E2E framework for Multilingual and Code-Switching ASR

Authors: Mari Ganesh Kumar, Jom Kuriakose, Anand Thyagachandran, Arun Kumar A, Ashish Seth, Lodagala Durga Prasad, Saish Jaiswal, Anusha Prakash, Hema Murthy

Abstract: India is home to multiple languages, and training automatic speech recognition (ASR) systems for languages is challenging. Over time, each language has adopted words from other languages, such as English, leading to code-mixing. Most Indian languages also have their own unique scripts, which poses a major limitation in training multilingual and code-switching ASR systems. Inspired by results in… ▽ More India is home to multiple languages, and training automatic speech recognition (ASR) systems for languages is challenging. Over time, each language has adopted words from other languages, such as English, leading to code-mixing. Most Indian languages also have their own unique scripts, which poses a major limitation in training multilingual and code-switching ASR systems. Inspired by results in text-to-speech synthesis, in this work, we use an in-house rule-based phoneme-level common label set (CLS) representation to train multilingual and code-switching ASR for Indian languages. We propose two end-to-end (E2E) ASR systems. In the first system, the E2E model is trained on the CLS representation, and we use a novel data-driven back-end to recover the native language script. In the second system, we propose a modification to the E2E model, wherein the CLS representation and the native language characters are used simultaneously for training. We show our results on the multilingual and code-switching tasks of the Indic ASR Challenge 2021. Our best results achieve 6% and 5% improvement (approx) in word error rate over the baseline system for the multilingual and code-switching tasks, respectively, on the challenge development data. △ Less

Submitted 2 June, 2021; originally announced June 2021.

Comments: Accepted for publication at Interspeech 2021

arXiv:2105.04946 [pdf, other]

Probing Photo-excited Charge Carrier Trapping and Defect Formation in Synergistic Doping of SrTiO3

Authors: Namitha Anna Koshi, Dharmapura H K Murthy, Sudip Chakraborty, Seung-Cheol Lee, Satadeep Bhattacharjee

Abstract: Strontium titanate (SrTiO3) is widely used as a promising photocatalyst due to its unique band edge alignment with respect to the oxidation and reduction potential corresponding to oxygen evolution reaction (OER) and hydrogen evolution reaction (HER). However, further enhancement of the photocatalytic activity in this material could be envisaged through the effective control of oxygen vacancy stat… ▽ More Strontium titanate (SrTiO3) is widely used as a promising photocatalyst due to its unique band edge alignment with respect to the oxidation and reduction potential corresponding to oxygen evolution reaction (OER) and hydrogen evolution reaction (HER). However, further enhancement of the photocatalytic activity in this material could be envisaged through the effective control of oxygen vacancy states. This could substantially tune the photoexcited charge carrier trapping under the influence of elemental functionalization in SrTiO3, corresponding to the defect formation energy. The charge trapping states in SrTiO3 decrease through the substitutional doping in Ti sites with p-block elements like Aluminium (Al) with respect to the relative oxygen vacancies. With the help of electronic structure calculations based on density functional theory (DFT) formalism, we have explored the synergistic effect of doping with both Al and Iridium (Ir) in SrTiO3 from the perspective of defect formation energy, band edge alignment and the corresponding charge carrier recombination probability to probe the photoexcited charge carrier trapping that primarily governs the photocatalytic water splitting process. We have also systematically investigated the ratio-effect of Ir:Al functionalization on the position of acceptor levels lying between Fermi and conduction band in oxygen deficient SrTiO3, which governs the charge carrier recombination and therefore the corresponding photocatalytic efficiency. △ Less

Submitted 11 May, 2021; originally announced May 2021.

arXiv:2103.03215 [pdf, other]

Front-end Diarization for Percussion Separation in Taniavartanam of Carnatic Music Concerts

Authors: Nauman Dawalatabad, Jilt Sebastian, Jom Kuriakose, C. Chandra Sekhar, Shrikanth Narayanan, Hema A. Murthy

Abstract: Instrument separation in an ensemble is a challenging task. In this work, we address the problem of separating the percussive voices in the taniavartanam segments of Carnatic music. In taniavartanam, a number of percussive instruments play together or in tandem. Separation of instruments in regions where only one percussion is present leads to interference and artifacts at the output, as source se… ▽ More Instrument separation in an ensemble is a challenging task. In this work, we address the problem of separating the percussive voices in the taniavartanam segments of Carnatic music. In taniavartanam, a number of percussive instruments play together or in tandem. Separation of instruments in regions where only one percussion is present leads to interference and artifacts at the output, as source separation algorithms assume the presence of multiple percussive voices throughout the audio segment. We prevent this by first subjecting the taniavartanam to diarization. This process results in homogeneous clusters consisting of segments of either a single voice or multiple voices. A cluster of segments with multiple voices is identified using the Gaussian mixture model (GMM), which is then subjected to source separation. A deep recurrent neural network (DRNN) based approach is used to separate the multiple instrument segments. The effectiveness of the proposed system is evaluated on a standard Carnatic music dataset. The proposed approach provides close-to-oracle performance for non-overlapping segments and a significant improvement over traditional separation schemes. △ Less

Submitted 4 March, 2021; originally announced March 2021.

arXiv:2011.07279 [pdf, other]

Towards Zero-Shot Learning with Fewer Seen Class Examples

Authors: Vinay Kumar Verma, Ashish Mishra, Anubha Pandey, Hema A. Murthy, Piyush Rai

Abstract: We present a meta-learning based generative model for zero-shot learning (ZSL) towards a challenging setting when the number of training examples from each \emph{seen} class is very few. This setup contrasts with the conventional ZSL approaches, where training typically assumes the availability of a sufficiently large number of training examples from each of the seen classes. The proposed approach… ▽ More We present a meta-learning based generative model for zero-shot learning (ZSL) towards a challenging setting when the number of training examples from each \emph{seen} class is very few. This setup contrasts with the conventional ZSL approaches, where training typically assumes the availability of a sufficiently large number of training examples from each of the seen classes. The proposed approach leverages meta-learning to train a deep generative model that integrates variational autoencoder and generative adversarial networks. We propose a novel task distribution where meta-train and meta-validation classes are disjoint to simulate the ZSL behaviour in training. Once trained, the model can generate synthetic examples from seen and unseen classes. Synthesize samples can then be used to train the ZSL framework in a supervised manner. The meta-learner enables our model to generates high-fidelity samples using only a small number of training examples from seen classes. We conduct extensive experiments and ablation studies on four benchmark datasets of ZSL and observe that the proposed model outperforms state-of-the-art approaches by a significant margin when the number of examples per seen class is very small. △ Less

Submitted 14 November, 2020; originally announced November 2020.

Comments: Accepted in WACV 2021

arXiv:2011.02195 [pdf, other]

Correlation based Multi-phasal models for improved imagined speech EEG recognition

Authors: Rini A Sharon, Hema A Murthy

Abstract: Translation of imagined speech electroencephalogram(EEG) into human understandable commands greatly facilitates the design of naturalistic brain computer interfaces. To achieve improved imagined speech unit classification, this work aims to profit from the parallel information contained in multi-phasal EEG data recorded while speaking, imagining and performing articulatory movements corresponding… ▽ More Translation of imagined speech electroencephalogram(EEG) into human understandable commands greatly facilitates the design of naturalistic brain computer interfaces. To achieve improved imagined speech unit classification, this work aims to profit from the parallel information contained in multi-phasal EEG data recorded while speaking, imagining and performing articulatory movements corresponding to specific speech units. A bi-phase common representation learning module using neural networks is designed to model the correlation and reproducibility between an analysis phase and a support phase. The trained Correlation Network is then employed to extract discriminative features of the analysis phase. These features are further classified into five binary phonological categories using machine learning models such as Gaussian mixture based hidden Markov model and deep neural networks. The proposed approach further handles the non-availability of multi-phasal data during decoding. Topographic visualizations along with result-based inferences suggest that the multi-phasal correlation modelling approach proposed in the paper enhances imagined-speech EEG recognition performance. △ Less

Submitted 4 November, 2020; originally announced November 2020.

Journal ref: Interspeech SMM 2020

arXiv:2010.06304 [pdf, other]

doi 10.1109/TASLP.2020.3036231

Novel Architectures for Unsupervised Information Bottleneck based Speaker Diarization of Meetings

Authors: Nauman Dawalatabad, Srikanth Madikeri, C. Chandra Sekhar, Hema A. Murthy

Abstract: Speaker diarization is an important problem that is topical, and is especially useful as a preprocessor for conversational speech related applications. The objective of this paper is two-fold: (i) segment initialization by uniformly distributing speaker information across the initial segments, and (ii) incorporating speaker discriminative features within the unsupervised diarization framework. In… ▽ More Speaker diarization is an important problem that is topical, and is especially useful as a preprocessor for conversational speech related applications. The objective of this paper is two-fold: (i) segment initialization by uniformly distributing speaker information across the initial segments, and (ii) incorporating speaker discriminative features within the unsupervised diarization framework. In the first part of the work, a varying length segment initialization technique for Information Bottleneck (IB) based speaker diarization system using phoneme rate as the side information is proposed. This initialization distributes speaker information uniformly across the segments and provides a better starting point for IB based clustering. In the second part of the work, we present a Two-Pass Information Bottleneck (TPIB) based speaker diarization system that incorporates speaker discriminative features during the process of diarization. The TPIB based speaker diarization system has shown improvement over the baseline IB based system. During the first pass of the TPIB system, a coarse segmentation is performed using IB based clustering. The alignments obtained are used to generate speaker discriminative features using a shallow feed-forward neural network and linear discriminant analysis. The discriminative features obtained are used in the second pass to obtain the final speaker boundaries. In the final part of the paper, variable segment initialization is combined with the TPIB framework. This leverages the advantages of better segment initialization and speaker discriminative features that results in an additional improvement in performance. An evaluation on standard meeting datasets shows that a significant absolute improvement of 3.9% and 4.7% is obtained on the NIST and AMI datasets, respectively. △ Less

Submitted 13 October, 2020; originally announced October 2020.

Comments: Accepted in IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING

Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, 2021, pp 14-27

arXiv:2010.05497 [pdf, other]

The "Sound of Silence" in EEG -- Cognitive voice activity detection

Authors: Rini A Sharon, Hema A Murthy

Abstract: Speech cognition bears potential application as a brain computer interface that can improve the quality of life for the otherwise communication impaired people. While speech and resting state EEG are popularly studied, here we attempt to explore a "non-speech"(NS) state of brain activity corresponding to the silence regions of speech audio. Firstly, speech perception is studied to inspect the exis… ▽ More Speech cognition bears potential application as a brain computer interface that can improve the quality of life for the otherwise communication impaired people. While speech and resting state EEG are popularly studied, here we attempt to explore a "non-speech"(NS) state of brain activity corresponding to the silence regions of speech audio. Firstly, speech perception is studied to inspect the existence of such a state, followed by its identification in speech imagination. Analogous to how voice activity detection is employed to enhance the performance of speech recognition, the EEG state activity detection protocol implemented here is applied to boost the confidence of imagined speech EEG decoding. Classification of speech and NS state is done using two datasets collected from laboratory-based and commercial-based devices. The state sequential information thus obtained is further utilized to reduce the search space of imagined EEG unit recognition. Temporal signal structures and topographic maps of NS states are visualized across subjects and sessions. The recognition performance and the visual distinction observed demonstrates the existence of silence signatures in EEG. △ Less

Submitted 12 October, 2020; originally announced October 2020.

arXiv:2009.04983 [pdf, other]

Exploration of End-to-end Synthesisers forZero Resource Speech Challenge 2020

Authors: Karthik Pandia D S, Anusha Prakash, Mano Ranjith Kumar, Hema A Murthy

Abstract: A Spoken dialogue system for an unseen language is referred to as Zero resource speech. It is especially beneficial for developing applications for languages that have low digital resources. Zero resource speech synthesis is the task of building text-to-speech (TTS) models in the absence of transcriptions. In this work, speech is modelled as a sequence of transient and steady-state acoustic units,… ▽ More A Spoken dialogue system for an unseen language is referred to as Zero resource speech. It is especially beneficial for developing applications for languages that have low digital resources. Zero resource speech synthesis is the task of building text-to-speech (TTS) models in the absence of transcriptions. In this work, speech is modelled as a sequence of transient and steady-state acoustic units, and a unique set of acoustic units is discovered by iterative training. Using the acoustic unit sequence, TTS models are trained. The main goal of this work is to improve the synthesis quality of zero resource TTS system. Four different systems are proposed. All the systems consist of three stages: unit discovery, followed by unit sequence to spectrogram mapping, and finally spectrogram to speech inversion. Modifications are proposed to the spectrogram mapping stage. These modifications include training the mapping on voice data, using x-vectors to improve the mapping, two-stage learning, and gender-specific modelling. Evaluation of the proposed systems in the Zerospeech 2020 challenge shows that quite good quality synthesis can be achieved. △ Less

Submitted 10 September, 2020; originally announced September 2020.

Comments: Accepted for publication in Interspeech 2020

arXiv:2007.13517 [pdf, other]

doi 10.1109/TIFS.2021.3067998

Evidence of Task-Independent Person-Specific Signatures in EEG using Subspace Techniques

Authors: Mari Ganesh Kumar, Shrikanth Narayanan, Mriganka Sur, Hema A Murthy

Abstract: Electroencephalography (EEG) signals are promising as alternatives to other biometrics owing to their protection against spoofing. Previous studies have focused on capturing individual variability by analyzing task/condition-specific EEG. This work attempts to model biometric signatures independent of task/condition by normalizing the associated variance. Toward this goal, the paper extends ideas… ▽ More Electroencephalography (EEG) signals are promising as alternatives to other biometrics owing to their protection against spoofing. Previous studies have focused on capturing individual variability by analyzing task/condition-specific EEG. This work attempts to model biometric signatures independent of task/condition by normalizing the associated variance. Toward this goal, the paper extends ideas from subspace-based text-independent speaker recognition and proposes novel modifications for modeling multi-channel EEG data. The proposed techniques assume that biometric information is present in the entire EEG signal and accumulate statistics across time in a high dimensional space. These high dimensional statistics are then projected to a lower dimensional space where the biometric information is preserved. The lower dimensional embeddings obtained using the proposed approach are shown to be task-independent. The best subspace system identifies individuals with accuracies of 86.4% and 35.9% on datasets with 30 and 920 subjects, respectively, using just nine EEG channels. The paper also provides insights into the subspace model's scalability to unseen tasks and individuals during training and the number of channels needed for subspace modeling. △ Less

Submitted 25 March, 2021; v1 submitted 27 July, 2020; originally announced July 2020.

Comments: ©2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Journal ref: IEEE Transactions on Information Forensics and Security, 2021

arXiv:2006.06971 [pdf, other]

doi 10.21437/Interspeech.2020-2663

Generic Indic Text-to-speech Synthesisers with Rapid Adaptation in an End-to-end Framework

Authors: Anusha Prakash, Hema A Murthy

Abstract: Building text-to-speech (TTS) synthesisers for Indian languages is a difficult task owing to a large number of active languages. Indian languages can be classified into a finite set of families, prominent among them, Indo-Aryan and Dravidian. The proposed work exploits this property to build a generic TTS system using multiple languages from the same family in an end-to-end framework. Generic syst… ▽ More Building text-to-speech (TTS) synthesisers for Indian languages is a difficult task owing to a large number of active languages. Indian languages can be classified into a finite set of families, prominent among them, Indo-Aryan and Dravidian. The proposed work exploits this property to build a generic TTS system using multiple languages from the same family in an end-to-end framework. Generic systems are quite robust as they are capable of capturing a variety of phonotactics across languages. These systems are then adapted to a new language in the same family using small amounts of adaptation data. Experiments indicate that good quality TTS systems can be built using only 7 minutes of adaptation data. An average degradation mean opinion score of 3.98 is obtained for the adapted TTSes. Extensive analysis of systematic interactions between languages in the generic TTSes is carried out. x-vectors are included as speaker embedding to synthesise text in a particular speaker's voice. An interesting observation is that the prosody of the target speaker's voice is preserved. These results are quite promising as they indicate the capability of generic TTSes to handle speaker and language switching seamlessly, along with the ease of adaptation to a new language. △ Less

Submitted 12 June, 2020; originally announced June 2020.

Journal ref: INTERSPEECH (2002) 2962-2966

arXiv:2006.04372 [pdf, ps, other]

doi 10.21437/Interspeech.2019-2336

Zero resource speech synthesis using transcripts derived from perceptual acoustic units

Authors: Karthik Pandia D S, Hema A Murthy

Abstract: Zerospeech synthesis is the task of building vocabulary independent speech synthesis systems, where transcriptions are not available for training data. It is, therefore, necessary to convert training data into a sequence of fundamental acoustic units that can be used for synthesis during the test. This paper attempts to discover, and model perceptual acoustic units consisting of steady-state, and… ▽ More Zerospeech synthesis is the task of building vocabulary independent speech synthesis systems, where transcriptions are not available for training data. It is, therefore, necessary to convert training data into a sequence of fundamental acoustic units that can be used for synthesis during the test. This paper attempts to discover, and model perceptual acoustic units consisting of steady-state, and transient regions in speech. The transients roughly correspond to CV, VC units, while the steady-state corresponds to sonorants and fricatives. The speech signal is first preprocessed by segmenting the same into CVC-like units using a short-term energy-like contour. These CVC segments are clustered using a connected components-based graph clustering technique. The clustered CVC segments are initialized such that the onset (CV) and decays (VC) correspond to transients, and the rhyme corresponds to steady-states. Following this initialization, the units are allowed to re-organise on the continuous speech into a final set of AUs in an HMM-GMM framework. AU sequences thus obtained are used to train synthesis models. The performance of the proposed approach is evaluated on the Zerospeech 2019 challenge database. Subjective and objective scores show that reasonably good quality synthesis with low bit rate encoding can be achieved using the proposed AUs. △ Less

Submitted 8 June, 2020; originally announced June 2020.

arXiv:2001.06657 [pdf, other]

Stacked Adversarial Network for Zero-Shot Sketch based Image Retrieval

Authors: Anubha Pandey, Ashish Mishra, Vinay Kumar Verma, Anurag Mittal, Hema A. Murthy

Abstract: Conventional approaches to Sketch-Based Image Retrieval (SBIR) assume that the data of all the classes are available during training. The assumption may not always be practical since the data of a few classes may be unavailable, or the classes may not appear at the time of training. Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) relaxes this constraint and allows the algorithm to handle previous… ▽ More Conventional approaches to Sketch-Based Image Retrieval (SBIR) assume that the data of all the classes are available during training. The assumption may not always be practical since the data of a few classes may be unavailable, or the classes may not appear at the time of training. Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) relaxes this constraint and allows the algorithm to handle previously unseen classes during the test. This paper proposes a generative approach based on the Stacked Adversarial Network (SAN) and the advantage of Siamese Network (SN) for ZS-SBIR. While SAN generates a high-quality sample, SN learns a better distance metric compared to that of the nearest neighbor search. The capability of the generative model to synthesize image features based on the sketch reduces the SBIR problem to that of an image-to-image retrieval problem. We evaluate the efficacy of our proposed approach on TU-Berlin, and Sketchy database in both standard ZSL and generalized ZSL setting. The proposed method yields a significant improvement in standard ZSL as well as in a more challenging generalized ZSL setting (GZSL) for SBIR. △ Less

Submitted 18 January, 2020; originally announced January 2020.

Comments: Accepted in WACV'2020

arXiv:1904.07453 [pdf, other]

doi 10.1109/ASRU46091.2019.9003824

Spoof detection using time-delay shallow neural network and feature switching

Authors: Mari Ganesh Kumar, Suvidha Rupesh Kumar, Saranya M, B. Bharathi, Hema A. Murthy

Abstract: Detecting spoofed utterances is a fundamental problem in voice-based biometrics. Spoofing can be performed either by logical accesses like speech synthesis, voice conversion or by physical accesses such as replaying the pre-recorded utterance. Inspired by the state-of-the-art \emph{x}-vector based speaker verification approach, this paper proposes a time-delay shallow neural network (TD-SNN) for s… ▽ More Detecting spoofed utterances is a fundamental problem in voice-based biometrics. Spoofing can be performed either by logical accesses like speech synthesis, voice conversion or by physical accesses such as replaying the pre-recorded utterance. Inspired by the state-of-the-art \emph{x}-vector based speaker verification approach, this paper proposes a time-delay shallow neural network (TD-SNN) for spoof detection for both logical and physical access. The novelty of the proposed TD-SNN system vis-a-vis conventional DNN systems is that it can handle variable length utterances during testing. Performance of the proposed TD-SNN systems and the baseline Gaussian mixture models (GMMs) is analyzed on the ASV-spoof-2019 dataset. The performance of the systems is measured in terms of the minimum normalized tandem detection cost function (min-t-DCF). When studied with individual features, the TD-SNN system consistently outperforms the GMM system for physical access. For logical access, GMM surpasses TD-SNN systems for certain individual features. When combined with the decision-level feature switching (DLFS) paradigm, the best TD-SNN system outperforms the best baseline GMM system on evaluation data with a relative improvement of 48.03\% and 49.47\% for both logical and physical access, respectively. △ Less

Submitted 23 January, 2020; v1 submitted 16 April, 2019; originally announced April 2019.

Journal ref: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 1011--1017

arXiv:1902.08051 [pdf, other]

doi 10.1109/ICASSP.2019.8683114

Incremental Transfer Learning in Two-pass Information Bottleneck based Speaker Diarization System for Meetings

Authors: Nauman Dawalatabad, Srikanth Madikeri, C Chandra Sekhar, Hema A Murthy

Abstract: The two-pass information bottleneck (TPIB) based speaker diarization system operates independently on different conversational recordings. TPIB system does not consider previously learned speaker discriminative information while diarizing new conversations. Hence, the real time factor (RTF) of TPIB system is high owing to the training time required for the artificial neural network (ANN). This pap… ▽ More The two-pass information bottleneck (TPIB) based speaker diarization system operates independently on different conversational recordings. TPIB system does not consider previously learned speaker discriminative information while diarizing new conversations. Hence, the real time factor (RTF) of TPIB system is high owing to the training time required for the artificial neural network (ANN). This paper attempts to improve the RTF of the TPIB system using an incremental transfer learning approach where the parameters learned by the ANN from other conversations are updated using current conversation rather than learning parameters from scratch. This reduces the RTF significantly. The effectiveness of the proposed approach compared to the baseline IB and the TPIB systems is demonstrated on standard NIST and AMI conversational meeting datasets. With a minor degradation in performance, the proposed system shows a significant improvement of 33.07% and 24.45% in RTF with respect to TPIB system on the NIST RT-04Eval and AMI-1 datasets, respectively. △ Less

Submitted 21 February, 2019; originally announced February 2019.

Comments: 5 pages, 2 figures, To appear in Proc. ICASSP 2019, May 12-17, 2019, Brighton, UK

arXiv:1803.07144 [pdf]

Synthesis and Characterization of Copper Doped Zinc Oxide Thin Films for CO Gas Sensing

Authors: Sachin S Bharadwaj, Shivaraj B W, H N Narasimha Murthy, M Krishna, Manjush Ganiger, Mohd Idris, Pundaleek Anawal, Vitthal Sangappa Angadi

Abstract: Objective of this work was to synthesize Copper doped Zinc Oxide (CZO) films and optimization of process parameters by varying molarity of zinc acetate dehydrate from 0.5 M to 1.0 M, concentration of copper acetate monohydrate from 1% to 5 % and annealing temperature from 200 C to 300 C to measure the sensitivity of CZO films for CO (Carbon Monoxide) gas. The concentration of CO gas was maintained… ▽ More Objective of this work was to synthesize Copper doped Zinc Oxide (CZO) films and optimization of process parameters by varying molarity of zinc acetate dehydrate from 0.5 M to 1.0 M, concentration of copper acetate monohydrate from 1% to 5 % and annealing temperature from 200 C to 300 C to measure the sensitivity of CZO films for CO (Carbon Monoxide) gas. The concentration of CO gas was maintained at 5 ppm and operating temperature of 250 oC was used for sensing. Analysis for sensitivity showed highest grading for parametric combination of 0.75 molarity, 3% copper concentration and 300 C annealing temperature with surface roughness of 3.90 nm and grain size of 256 nm. TEM image revealed the crystalline grain size was 5 nm. ANOVA showed that annealing temperature influenced the sensitivity by 69.06 % . △ Less

Submitted 19 February, 2018; originally announced March 2018.

Comments: 7 Pages, 16 Figures, 3 Tables

arXiv:1711.02318 [pdf, other]

Non-uniform time-scaling of Carnatic music transients

Authors: Venkata Subramanian Viraraghavan, Arpan Pal, R Aravind, Hema Murthy

Abstract: Gamakas are an integral aspect of Carnatic Music, a form of classical music prevalent in South India. They are used in ragas, which may be seen as melodic scales and/or a set of characteristic melodic phrases. Gamakas exhibit continuous pitch variation often spanning several semitones. In this paper, we study how gamakas scale with tempo and propose a novel approach to change the tempo of Carnatic… ▽ More Gamakas are an integral aspect of Carnatic Music, a form of classical music prevalent in South India. They are used in ragas, which may be seen as melodic scales and/or a set of characteristic melodic phrases. Gamakas exhibit continuous pitch variation often spanning several semitones. In this paper, we study how gamakas scale with tempo and propose a novel approach to change the tempo of Carnatic music pieces. The music signal is viewed as consisting of constant-pitch segments and transients. The transients show continuous pitch variation and we consider their analyses from a theoretical stand-point. We next observe the non-uniform ratios of time-scaling of constant-pitch segments, transients and silence in excerpts from nine concert renditions of varnams in six ragas. The results indicate that the changing tempo of Carnatic music does not change the duration of transients significantly. We report listening tests on our algorithm to slow down Carnatic music that is consistent with this observation. △ Less

Submitted 7 November, 2017; originally announced November 2017.

Comments: The non-uniform time-scaling of CP-notes and transients in Carnatic concert renditions is new; it has not been reported earlier in the literature, but a reviewer pointed out that the proposed algorithm is previously known

arXiv:1709.00663 [pdf, other]

A Generative Model For Zero Shot Learning Using Conditional Variational Autoencoders

Authors: Ashish Mishra, M Shiva Krishna Reddy, Anurag Mittal, Hema A Murthy

Abstract: Zero shot learning in Image Classification refers to the setting where images from some novel classes are absent in the training data but other information such as natural language descriptions or attribute vectors of the classes are available. This setting is important in the real world since one may not be able to obtain images of all the possible classes at training. While previous approaches h… ▽ More Zero shot learning in Image Classification refers to the setting where images from some novel classes are absent in the training data but other information such as natural language descriptions or attribute vectors of the classes are available. This setting is important in the real world since one may not be able to obtain images of all the possible classes at training. While previous approaches have tried to model the relationship between the class attribute space and the image space via some kind of a transfer function in order to model the image space correspondingly to an unseen class, we take a different approach and try to generate the samples from the given attributes, using a conditional variational autoencoder, and use the generated samples for classification of the unseen classes. By extensive testing on four benchmark datasets, we show that our model outperforms the state of the art, particularly in the more realistic generalized setting, where the training classes can also appear at the test time along with the novel classes. △ Less

Submitted 27 January, 2018; v1 submitted 3 September, 2017; originally announced September 2017.

arXiv:1608.05892 [pdf]

A study on the growth mechanism and the process parameters controlling aluminum oxide thin films deposition by pulsed pressure MOCVD

Authors: Hari Murthy, S. S Miya, Susan Krumdieck

Abstract: Aluminum oxide thin films were deposited on silicon substrates under different deposition conditions using pulse pressure metal organic chemical vapour deposition (PP-MOCVD). The current study investigates into the growth mechanism of the deposited film and the control of the film morphology by varying the processing parameters of PP-MOCVD - choice of solvent, concentration, and presence of a shie… ▽ More Aluminum oxide thin films were deposited on silicon substrates under different deposition conditions using pulse pressure metal organic chemical vapour deposition (PP-MOCVD). The current study investigates into the growth mechanism of the deposited film and the control of the film morphology by varying the processing parameters of PP-MOCVD - choice of solvent, concentration, and presence of a shield. Aluminum sec-butoxide (ASB) was used as the aluminum source while hexane and toluene were used as the solvents. The films were deposited at 475oC at different precursor concentrations. It was observed that the choice of solvent has no effect on the surface morphology, but it influenced the deposition rate. The improved deposition rate, relatively close enthalpy of vaporisation (ΔH) values and uniformity of the film, irrespective of the growth conditions, showed that hexane was a better solvent for ASB than toluene. A hybrid mode of vapour deposition and vapour condensation model for thin film growth is proposed where five different mechanisms lead to a solid film formation. These include vapour phase deposition under low arrival rate, vapour phase deposition under high arrival rate, Leidenfrost aerosol formation, heterogeneous particle formation and liquid droplet impingement. The important parameter that needs to be controlled is the precursor flux arrival rate which can be controlled by varying the precursor concentration, use of a solvent with a low ΔH and the presence of a shield over the substrate, which influences the surface morphology and the growth rate of the films. △ Less

Submitted 21 August, 2016; originally announced August 2016.

Comments: 27 pages, 12 figures, pre-peer review

arXiv:1603.05435 [pdf, ps, other]

Modified Group Delay Based MultiPitch Estimation in Co-Channel Speech

Authors: Rajeev Rajan, Hema A. Murthy

Abstract: Phase processing has been replaced by group delay processing for the extraction of source and system parameters from speech. Group delay functions are ill-behaved when the transfer function has zeros that are close to unit circle in the z-domain. The modified group delay function addresses this problem and has been successfully used for formant and monopitch estimation. In this paper, modified gro… ▽ More Phase processing has been replaced by group delay processing for the extraction of source and system parameters from speech. Group delay functions are ill-behaved when the transfer function has zeros that are close to unit circle in the z-domain. The modified group delay function addresses this problem and has been successfully used for formant and monopitch estimation. In this paper, modified group delay functions are used for multipitch estimation in concurrent speech. The power spectrum of the speech is first flattened in order to annihilate the system characteristics, while retaining the source characteristics. Group delay analysis on this flattened spectrum picks the predominant pitch in the first pass and a comb filter is used to filter out the estimated pitch along with its harmonics. The residual spectrum is again analyzed for the next candidate pitch estimate in the second pass. The final pitch trajectories of the constituent speech utterances are formed using pitch grouping and post processing techniques. The performance of the proposed algorithm was evaluated on standard datasets using two metrics; pitch accuracy and standard deviation of fine pitch error. Our results show that the proposed algorithm is a promising pitch detection method in multipitch environment for real speech recordings. △ Less

Submitted 17 March, 2016; originally announced March 2016.

arXiv:1107.1576 [pdf, other]

doi 10.1002/adfm.201002358

Influence of Phase Segregation on Recombination Dynamics in Organic Bulk-Heterojunction Solar Cells

Authors: Andreas Baumann, Tom J. Savenije, Dharmapura Hanumantharaya K. Murthy, Martin Heeney, Vladimir Dyakonov, Carsten Deibel

Abstract: We studied the recombination dynamics of charge carriers in organic bulk heterojunction solar cells made of the blend system poly(2,5-bis(3-dodecyl thiophen-2-yl) thieno[2,3-b]thiophene) (pBTCT-C12):[6,6]-phenyl-C61-butyric acid methyl ester (PC61BM) with a donor--acceptor ratio of 1:1 and 1:4. The techniques of charge carrier extraction by linearly increasing voltage (photo-CELIV) and, as local p… ▽ More We studied the recombination dynamics of charge carriers in organic bulk heterojunction solar cells made of the blend system poly(2,5-bis(3-dodecyl thiophen-2-yl) thieno[2,3-b]thiophene) (pBTCT-C12):[6,6]-phenyl-C61-butyric acid methyl ester (PC61BM) with a donor--acceptor ratio of 1:1 and 1:4. The techniques of charge carrier extraction by linearly increasing voltage (photo-CELIV) and, as local probe, time-resolved microwave conductivity (TRMC) were used. We observed a difference in the initially extracted charge carrier concentration in the photo-CELIV experiment by one order of magnitude, which we assigned to an enhanced geminate recombination due to a fine interpenetrating network with isolated phase regions in the 1:1 pBTCT-C12:PC61BM bulk heterojunction solar cells. In contrast, extensive phase segregation in 1:4 blend devices leads to an efficient polaron generation resulting in an increased short circuit current density of the solar cell. For both studied ratios a bimolecular recombination of polarons was found using the complementary experiments. The charge carrier decay order of above two for temperatures below 300 K can be explained by a release of trapped charges. This mechanism leads to a delayed bimolecular recombination processes. The experimental findings can be generalized to all polymer:fullerene blend systems allowing for phase segregation. △ Less

Submitted 8 July, 2011; originally announced July 2011.

Comments: 14 pages, 5 figures

Journal ref: Adv. Func. Mat. 21, 1687 (2011)

arXiv:1106.5642 [pdf]

doi 10.1088/0957-4484/22/31/315710

Efficient photogeneration of charge carriers in silicon nanowires with a radial doping gradient

Authors: D. H. K. Murthy, T. Xu, W. H. Chen, J. Houtepen A., T. J. Savenije, L. D. A. Siebbeles, J. P. Nys, Christophe Krzeminski, Bruno Grandidier, Didier Stiévenard, Philippe Pareige, F. Jomard, Gilles Patriache, O. I. Lebedev

Abstract: From electrodeless time-resolved microwave conductivity measurements, the efficiency of charge carrier generation, their mobility, and decay kinetics on photo-excitation were studied in arrays of Si nanowires grown by the vapor-liquid-solid mechanism. A large enhancement in the magnitude of the photoconductance and charge carrier lifetime are found depending on the incorporation of impurities duri… ▽ More From electrodeless time-resolved microwave conductivity measurements, the efficiency of charge carrier generation, their mobility, and decay kinetics on photo-excitation were studied in arrays of Si nanowires grown by the vapor-liquid-solid mechanism. A large enhancement in the magnitude of the photoconductance and charge carrier lifetime are found depending on the incorporation of impurities during the growth. They are explained by the internal electric field that builds up, due to a higher doped sidewalls, as revealed by detailed analysis of the nanowire morphology and chemical composition. △ Less

Submitted 28 June, 2011; originally announced June 2011.

Journal ref: Nanotechnology (2011), Vol. 22, p. 315710

Showing 1–29 of 29 results for author: Murthy, H