2018 Speech Processing Courses in Crete (SPCC2018)
"Toawrds flexible and intelligible end-to-end speech synthesis systems"
Lecture slides
Tomoki Toda: Advanced Voice Conversion, July 26, 2018
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
2018 Speech Processing Courses in Crete (SPCC2018)
"Toawrds flexible and intelligible end-to-end speech synthesis systems"
Hands-on slides
Tomoki Toda: Hands on Voice Conversion, July 26, 2018
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
Recent progress on voice conversion: What is next?NU_I_TODALAB
Invited Talk at IEEE SLT 2021
Title: "Recent progress on voice conversion: What is next?"
Speaker: Tomoki Toda
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
Statistical voice conversion with direct waveform modelingNU_I_TODALAB
This document provides an outline for a tutorial on voice conversion techniques. It begins with an introduction to the goal of the tutorial, which is to help participants grasp the basics and recent progress of VC, develop a baseline VC system, and develop a more sophisticated system using a neural vocoder. The tutorial will include an overview of VC techniques, introduction of freely available software for building a VC system, and breaks between sessions. The first session will cover the basics of VC, improvements to VC techniques, and an overview of recent progress in direct waveform modeling. The second session will demonstrate how to develop a VC system using the WaveNet vocoder with freely available tools.
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...NU_I_TODALAB
APSIPA ASC 2021
Ding Ma, Wen-Chin Huang, Tomoki Toda: Investigation of text-to-speech-based synthetic parallel data for sequence-to-sequence non-parallel voice conversion, Dec. 2021
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
The VoiceMOS Challenge 2022 aimed to encourage research in automatic prediction of mean opinion scores (MOS) for speech quality. It featured two tracks evaluating systems' ability to predict MOS ratings from a large existing dataset or a separate listening test. 21 teams participated in the main track and 15 in the out-of-domain track. Several teams outperformed the best baseline, which fine-tuned a self-supervised model, though the top-performing approaches generally involved ensembling or multi-task learning. While unseen systems were predictable, unseen listeners and speakers remained a difficulty, especially for generalizing to a new test. The challenge highlighted progress in MOS prediction but also the need for metrics reflecting both ranking and absolute accuracy
Interactive voice conversion for augmented speech productionNU_I_TODALAB
This document discusses recent progress in interactive voice conversion techniques for augmenting speech production. It begins by explaining the physical limitations of normal speech production and how voice conversion can augment speech by controlling more information. It then discusses how interactive voice conversion allows for quick response times, better controllability through real-time feedback, and understanding user intent from multimodal behavior signals. Recent advances discussed include low-latency voice conversion networks, controllable waveform generation respecting the source-filter model of speech, and expression control using signals like arm movements. The goal is to develop cooperatively augmented speech that can help users with lost speech abilities.
2018 Speech Processing Courses in Crete (SPCC2018)
"Toawrds flexible and intelligible end-to-end speech synthesis systems"
Hands-on slides
Tomoki Toda: Hands on Voice Conversion, July 26, 2018
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
Recent progress on voice conversion: What is next?NU_I_TODALAB
Invited Talk at IEEE SLT 2021
Title: "Recent progress on voice conversion: What is next?"
Speaker: Tomoki Toda
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
Statistical voice conversion with direct waveform modelingNU_I_TODALAB
This document provides an outline for a tutorial on voice conversion techniques. It begins with an introduction to the goal of the tutorial, which is to help participants grasp the basics and recent progress of VC, develop a baseline VC system, and develop a more sophisticated system using a neural vocoder. The tutorial will include an overview of VC techniques, introduction of freely available software for building a VC system, and breaks between sessions. The first session will cover the basics of VC, improvements to VC techniques, and an overview of recent progress in direct waveform modeling. The second session will demonstrate how to develop a VC system using the WaveNet vocoder with freely available tools.
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...NU_I_TODALAB
APSIPA ASC 2021
Ding Ma, Wen-Chin Huang, Tomoki Toda: Investigation of text-to-speech-based synthetic parallel data for sequence-to-sequence non-parallel voice conversion, Dec. 2021
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
The VoiceMOS Challenge 2022 aimed to encourage research in automatic prediction of mean opinion scores (MOS) for speech quality. It featured two tracks evaluating systems' ability to predict MOS ratings from a large existing dataset or a separate listening test. 21 teams participated in the main track and 15 in the out-of-domain track. Several teams outperformed the best baseline, which fine-tuned a self-supervised model, though the top-performing approaches generally involved ensembling or multi-task learning. While unseen systems were predictable, unseen listeners and speakers remained a difficulty, especially for generalizing to a new test. The challenge highlighted progress in MOS prediction but also the need for metrics reflecting both ranking and absolute accuracy
Interactive voice conversion for augmented speech productionNU_I_TODALAB
This document discusses recent progress in interactive voice conversion techniques for augmenting speech production. It begins by explaining the physical limitations of normal speech production and how voice conversion can augment speech by controlling more information. It then discusses how interactive voice conversion allows for quick response times, better controllability through real-time feedback, and understanding user intent from multimodal behavior signals. Recent advances discussed include low-latency voice conversion networks, controllable waveform generation respecting the source-filter model of speech, and expression control using signals like arm movements. The goal is to develop cooperatively augmented speech that can help users with lost speech abilities.
GAN-based statistical speech synthesis (in Japanese)Yuki Saito
Guest presentation at "Applied Gaussian Process and Machine Learning," Graduate School of Information Science and Technology, The University of Tokyo, Japan, 2021.
日本音響学会2021春季研究発表会1-1-2
北村大地, 矢田部浩平, "スペクトログラム無矛盾性を用いた独立低ランク行列分析の実験的評価," 日本音響学会 2021年春季研究発表会講演論文集, 1-1-2, pp. 121–124, Tokyo, March 2021.
Daichi Kitamura and Kohei Yatabe, "Experimental evaluation of consistent independent low-rank matrix analysis," Proceedings of 2021 Spring Meeting of Acoustical Society of Japan, 1-1-2, pp. 121–124, Tokyo, March 2021 (in Japanese).
The document describes a real-time DNN voice conversion system with feedback to acquire character traits. It proposes a method to provide real-time feedback of the converted voice to the speaker to encourage speech modification (prosody and emphasis) towards the target speaker's character. Subjective evaluations from the first-person (user) perspective and third-person perspective found that the system improved the reproduction of the target speaker's character, especially for inexperienced users. Providing only pitch feedback was already quite effective.
Attendees at the Infinity Software 2014 User Group Conference had the opportunity to meet the key players involved with the Evolution payroll integration. Get the latest on the sync tool; find out where it is today and where it is headed.
Introducing the Round Trip through SMIF / FIBOJim Logan
This presentation introduces Semantic Modeling for Information Federation (SMIF), a reference implementation of SMIF called the Cameo Concept Modeler (CCM), and how CCM has added the capability to "round trip" between a viewpoint with a higher level of abstraction and a viewpoint with a lower level of abstraction for the Financial Industry Business Ontology (FIBO) team.
GAN-based statistical speech synthesis (in Japanese)Yuki Saito
Guest presentation at "Applied Gaussian Process and Machine Learning," Graduate School of Information Science and Technology, The University of Tokyo, Japan, 2021.
日本音響学会2021春季研究発表会1-1-2
北村大地, 矢田部浩平, "スペクトログラム無矛盾性を用いた独立低ランク行列分析の実験的評価," 日本音響学会 2021年春季研究発表会講演論文集, 1-1-2, pp. 121–124, Tokyo, March 2021.
Daichi Kitamura and Kohei Yatabe, "Experimental evaluation of consistent independent low-rank matrix analysis," Proceedings of 2021 Spring Meeting of Acoustical Society of Japan, 1-1-2, pp. 121–124, Tokyo, March 2021 (in Japanese).
The document describes a real-time DNN voice conversion system with feedback to acquire character traits. It proposes a method to provide real-time feedback of the converted voice to the speaker to encourage speech modification (prosody and emphasis) towards the target speaker's character. Subjective evaluations from the first-person (user) perspective and third-person perspective found that the system improved the reproduction of the target speaker's character, especially for inexperienced users. Providing only pitch feedback was already quite effective.
Attendees at the Infinity Software 2014 User Group Conference had the opportunity to meet the key players involved with the Evolution payroll integration. Get the latest on the sync tool; find out where it is today and where it is headed.
Introducing the Round Trip through SMIF / FIBOJim Logan
This presentation introduces Semantic Modeling for Information Federation (SMIF), a reference implementation of SMIF called the Cameo Concept Modeler (CCM), and how CCM has added the capability to "round trip" between a viewpoint with a higher level of abstraction and a viewpoint with a lower level of abstraction for the Financial Industry Business Ontology (FIBO) team.
Using Advanced Technology to Provide Enhanced, Device-based User SupportUBMCanon
The document discusses how animations can be used to provide enhanced user support for devices by more effectively depicting interactions, emphasizing important features, and illustrating invisible processes. It explains that while animations take more time and resources to develop than static graphics, usability tests have shown they improve users' ability to perform tasks. Guidelines are provided for the effective design of instructional animations.
Lean Change Management (part II) - IAD 2014Fabio Armani
This document discusses an approach for managing organizational change using lean and agile principles. It proposes using a "Lean Change Canvas" and "Transformation Canvas" to plan and guide change initiatives at both the project and enterprise levels. The canvases help involve stakeholders, establish a shared vision and target state, and select early changes to validate the overall transformation approach through an iterative process. The document emphasizes that organizational change emerges from many localized actions rather than a top-down plan, and that success is defined by environmental fit rather than simply closing gaps to a predefined future state.
The quest of one-piece-flow in IT by Pierre Masai, Toyota Motor EuropeInstitut Lean France
The document discusses Toyota's approach to one piece flow and problem solving. It describes how Toyota implements one piece flow throughout the entire business process improvement cycle and values delivering benefits in small, testable pieces. Toyota also takes a three-level approach to problem solving to thoroughly investigate issues, set challenging targets, and implement countermeasures to continuously improve.
1) MTPE improves translation efficiency and quality while allowing for more scalable management compared to traditional translation.
2) Vertical MT training is important to build confidence in results by focusing on specific domains.
3) Terminology plays a key role by being extracted from documents, applied to MT, and collected for future use.
Digital transformation is the top priority for organizations in 2018. But what does this mean in practical terms? What does it mean with respect to your organization’s culture, its way of working and the technologies you’ll need?
- What is digital transformation? What does it actually involve?
- What are the key technologies you need to implement?
- How can you successfully transform large organizations?
[KDD 2018 tutorial] End to-end goal-oriented question answering systemsQi He
End to-end goal-oriented question answering systems
version 2.0: An updated version with references of the old version (https://www.slideshare.net/QiHe2/kdd-2018-tutorial-end-toend-goaloriented-question-answering-systems).
08/22/2018: The old version was just deleted for reducing the confusion.
Research Questions for Validation and Verification in the Context of Model-Ba...Michalis Famelis
This document summarizes the key discussion points and research questions identified by a working group on model-based engineering and verification and validation. The group identified 8 thematic categories for focusing research: 1) bridging the gap between models and formal verification techniques, 2) refining existing V&V methodologies for model-based development, 3) relating design-time models to runtime behavior, 4) determining appropriate properties to verify, 5) verifying model transformations, 6) handling informal, formal and incomplete modeling, 7) benchmarking and comparing V&V tools, and 8) leveraging domain-specific languages. For each category, the document outlines the current status and poses open research questions to guide future work at the intersection of
An introduction to Kanban I presented with Flavius Stef at the Bucharest Agile Meetup Group in February 2014. See the event details on http://www.meetup.com/The-Bucharest-Agile-Software-Meetup-Group/events/146222892/. See http://mozaicworks.com for articles and events about Kanban and agile
Product Owner Team: Leading Agile Program Management from Agile2015 by Dean S...LeadingAgile
This deck was used at the Agile Alliance Conference of the year Agile2015 in Washington, DC. The content was presented to a group of around 200 attendees by Enterprise Transformation Consultant, Dean Stevens.
A concept model is a concrete representation of the world that disambiguates concepts. This is not a design, but it can provide the basis for all system development phases, as well as generate schemas, code, and tests. This presentation describes how.
This is a presentation given at the No Magic World Symposium in May, 2017.
Personal customer experiences are and will be more and more vital. People to people, but also people to machine. Today, there are several providers of the same services, and the new ones are faster, more flexible, and more personalized in their communications with their customers & users. How do we ensure that we provide the right information to our employees as well as to our customers so they can better serve and increase customer satisfaction?
This webinar will focus on how you as an organization will have to restructure, rethink and redesign your technological platform to support increasing employee- and customer demands.
Key takeaways:
Holistic understanding of how to make a successful cloud transition
Learn why modern organizations excel in customer treatment, productivity, flexibility, and agility
High-level architecture and how and why DevOps changes organizations
This is the 2 day workshop for Effective Design in PowerPoint. It combines technical and basic design techniques.
This was training deck for corporate training specially for Managers.
This document summarizes a presentation by Kristen Sosulski on teaching data visualization. It discusses her background and experience teaching courses on data visualization to MBA and analytics students. It outlines challenges in teaching students to design visualizations that provide insights rather than just being visually appealing. The presentation covers using software like Tableau to incorporate annotation, animation, and interactivity into visualizations. It also provides techniques for effectively presenting visualizations, including identifying key takeaways, putting findings in context, and presenting key numbers. Students practice these skills through individual and group projects involving live and video presentations with feedback.
Surviving in a Microservices environment -abridgedSteve Pember
Many presentations on Microservices offer a high-level view; rarely does one hear what it’s like to work in such an environment. Individual services are somewhat trivial to develop, but now you suddenly have countless others to track. You’ll become obsessed over how they communicate. You’ll have to start referring to the whole thing as “the Platform”. You will have to take on some considerable DevOps work and start learning about deployment pipelines, metrics, and logging.
Don’t panic. In this presentation we’ll discuss what we learned over the past four years by highlighting our mistakes. We’ll examine what a development lifecycle might look like for adding a new service, developing a feature, or fixing bugs. We’ll see how team communication is more important than one might realize. Most importantly, we’ll show how - while an individual service is simple - the infrastructure demands are now much more complicated: your organization will need to introduce and become increasingly dependent on various technologies, procedures, and tools - ranging from the ELK stack to Grafana to Kubernetes. Lastly, you’ll come away with the understanding that your resident SREs will become the most valued members of your team.
Lean change method toronto agile meetupagilebydesign
The document discusses various approaches to facilitating agile adoption using lean change principles. It introduces the Change Canvas as a tool for co-creating change plans in a collaborative way. Examples are provided of how the Change Canvas has been used to map out change contexts, agents, objectives, obstacles and plans. Scaling approaches are also discussed, such as managing coaching flows over time and predicting adoption progress. Overall, the document advocates applying lean change principles like co-creative change and validated adoption to customize agile transformation approaches.
The document discusses how product managers and engineering teams can work together to ensure their products deliver value to customers. It recommends conducting a value analysis to understand customer value expectations, limiting factors, and potential changes. This helps prioritize requirements and guide architecture decisions. The document also suggests treating core platform technologies as products by linking technical metrics to customer value. Finally, it emphasizes having multi-layered roadmaps covering technology, industry trends, product-technology integration, and release planning.
A training toolbox for editors - Hilary CadmanTheSfEP
This document provides an overview of training tools for editors, including webinars, online courses, and screencasts. It discusses webinar platforms like GoToWebinar and Zoom, how to prepare and structure webinars, and using webinars for business opportunities. Online course platforms like Udemy, Thinkific and Teachable are presented, along with how to develop course content and modules. Screencasts are defined as video recordings of computer screens that can include audio narration, and tips are provided for creating and sharing them.
#Financial Modeling: Growing needs of financial modeling skills in financial ...13 Llama Interactive
About EduPristine
Trusted by Fortune 500 Companies and 10,000 Students from 40+ countries across the globe, EduPristine is one of the leading Training providers for Finance Certifications like CFA, PRM, FRM, Financial Modeling, Business Analytics etc. EduPristine holds a profound history in training Risk Professionals across the globe. It has been an International Authorized Training provider for FRM & PRM trainings since past 4 years and has helped 250+ FRM aspirants clear the Exam. It is Registered with GARP & CFA Institute as an Approved Provider of Continuing Professional Education (CPE) credits.
http://www.edupristine.com/ca
Weakly-Supervised Sound Event Detection with Self-AttentionNU_I_TODALAB
IEEE ICASSP 2020
Koichi Miyazaki, Tatsuya Komatsu, Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Kazuya Takeda, Weakly-supervised sound event detection with self-attention, May 2020
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...NU_I_TODALAB
IEEE International Workshop on Machine Learning for Signal Processing (MLSP2017)
Nominated For Best Student Paper Award (student: Shogo Seki)
Shogo Seki, Hirokazu Kameoka, Tomoki Toda, Kazuya Takeda: Missing Component Restoration for Masked Speech Signals based on Time-Domain Spectrogram Factorization,Sep. 2017
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
Sachpazis_Consolidation Settlement Calculation Program-The Python Code and th...Dr.Costas Sachpazis
Consolidation Settlement Calculation Program-The Python Code
By Professor Dr. Costas Sachpazis, Civil Engineer & Geologist
This program calculates the consolidation settlement for a foundation based on soil layer properties and foundation data. It allows users to input multiple soil layers and foundation characteristics to determine the total settlement.
Covid Management System Project Report.pdfKamal Acharya
CoVID-19 sprang up in Wuhan China in November 2019 and was declared a pandemic by the in January 2020 World Health Organization (WHO). Like the Spanish flu of 1918 that claimed millions of lives, the COVID-19 has caused the demise of thousands with China, Italy, Spain, USA and India having the highest statistics on infection and mortality rates. Regardless of existing sophisticated technologies and medical science, the spread has continued to surge high. With this COVID-19 Management System, organizations can respond virtually to the COVID-19 pandemic and protect, educate and care for citizens in the community in a quick and effective manner. This comprehensive solution not only helps in containing the virus but also proactively empowers both citizens and care providers to minimize the spread of the virus through targeted strategies and education.
Cricket management system ptoject report.pdfKamal Acharya
The aim of this project is to provide the complete information of the National and
International statistics. The information is available country wise and player wise. By
entering the data of eachmatch, we can get all type of reports instantly, which will be
useful to call back history of each player. Also the team performance in each match can
be obtained. We can get a report on number of matches, wins and lost.
Better Builder Magazine brings together premium product manufactures and leading builders to create better differentiated homes and buildings that use less energy, save water and reduce our impact on the environment. The magazine is published four times a year.
Sri Guru Hargobind Ji - Bandi Chor Guru.pdfBalvir Singh
Sri Guru Hargobind Ji (19 June 1595 - 3 March 1644) is revered as the Sixth Nanak.
• On 25 May 1606 Guru Arjan nominated his son Sri Hargobind Ji as his successor. Shortly
afterwards, Guru Arjan was arrested, tortured and killed by order of the Mogul Emperor
Jahangir.
• Guru Hargobind's succession ceremony took place on 24 June 1606. He was barely
eleven years old when he became 6th Guru.
• As ordered by Guru Arjan Dev Ji, he put on two swords, one indicated his spiritual
authority (PIRI) and the other, his temporal authority (MIRI). He thus for the first time
initiated military tradition in the Sikh faith to resist religious persecution, protect
people’s freedom and independence to practice religion by choice. He transformed
Sikhs to be Saints and Soldier.
• He had a long tenure as Guru, lasting 37 years, 9 months and 3 days
This is an overview of my career in Aircraft Design and Structures, which I am still trying to post on LinkedIn. Includes my BAE Systems Structural Test roles/ my BAE Systems key design roles and my current work on academic projects.
2. Outline
• What is voice conversion (VC)?
• Why is VC needed?
• How to do VC?
• Tell us VC research history and recent progress.
• How to improve a conversion model?
• How to improve an objective function?
• How to generate a converted waveform?
• How to make training more flexible?
• How to compare different techniques?
• How to develop applications?
• Summary
Outline
3. Outline
• What is voice conversion (VC)?
• Why is VC needed?
• How to do VC?
• Tell us VC research history and recent progress!
• How to improve a conversion model?
• How to improve an objective function?
• How to generate a converted waveform?
• How to make training more flexible?
• How to compare different techniques?
• How to develop applications?
• Summary
VC is a technique to generate
speech sounds conveying desired
para‐/non‐linguistic information!
Outline: What’s VC?
9. • Described as a regression problem
• Supervised training using utterance pairs of source & target speech
Basic Framework of Statistical VC
Target speaker
Conversion model
Please say
the same thing.
Please say
the same thing.
Let’s convert
my voice.
Let’s convert
my voice.
Source speech Target speech
1. Training with parallel data (around 50 utterance pairs)
2. Conversion of any utterance while keeping linguistic contents unchanged
Source speaker
[Abe; ’90]
Example: speaker conversion
What’s VC?: 6
14. Speech
waveform
a r a y u rsil u g e N j i ts u
Phoneme
sequence
あらゆる 現実Silence
Word
sequence
Difficulty in Handling Speech Waveform
Sentence 「あらゆる現実を全て自分の方へ・・・」
• Need to properly model characteristics of speech waveform
• How to model long‐term dependency over a sequence?
• How to model fluctuation components?
* Sorry for Japanese example
What’s VC?: 11
15. Outline
• What is voice conversion (VC)?
• Why is VC needed?
• How to do VC?
• Tell us VC research history and recent progress!
• How to improve a conversion model?
• How to improve an objective function?
• How to generate a converted waveform?
• How to make training more flexible?
• How to compare different techniques?
• How to develop applications?
• Summary
There are several research topics.
Let’s look at them one by one.
Outline: VC progress
21. From Discontinuous to Continuous Conversion
[Stylianou; ’98]
VQ‐based conversion GMM‐based conversion
1. VC progress on conversion model: 4
• Model feature correlation more accurately
4.4
4.6
4.8
5
5.2
5.4
5.6
5.8
6
Original feature, x
4.6 4.8 5 5.2 5.4 5.6 5.8 6 6.2
Targetfeature,y
Codebook mapping
0.005
0.001
0.0001
Input feature
Target feature
4.4
4.6
4.8
5
5.2
5.4
5.6
5.8
6
Original feature, x
4.6 4.8 5 5.2 5.4 5.6 5.8 6 6.2Targetfeature,y
0.005
0.001
0.0001
Input featureTarget feature
Discrete function w/ hard
clustering
Ignore feature correlation w/
discrete mapping
Continuous function w/ soft
clustering
Directly model feature correlation
w/ linear regression
22. GMM‐based Conversion
M
m
yy
m
yx
m
xy
m
xx
m
y
m
x
m
t
t
m
1
)()(
)()(
)(
)(
,;
ΣΣ
ΣΣ
μ
μ
y
x
N λyx |, ttp
Parameter set
Joint feature
vector tx
ty
Mean
vector
Component weight
Covariance
matrix
m
)(x
mμ )(xx
mΣ
)( y
mμ )( yx
mΣ
)(xy
mΣ
)( yy
mΣ
M
m
xy
m
xy
tmtt
ttt
tt
tt mp
p
p
p
1
)|()|(
, ,;,|
d|,
|,
,| Σμyλx
yλyx
λyx
λxy N
Conversion w/ conditional p.d.f. (also modeled by a GMM)
Training of joint p.d.f. (modeled by a GMM) [Kain; ’98]
1. VC progress on conversion model: 5
[Stylianou; ’98]
M
m
xy
tmtttttt mpp
1
)|(
,,|d,|ˆ μλxyλxyyyMMSE estimate:
25. |(),|(,|maxargˆ,,ˆ )(
1
,,
1
1
y
tt
T
t
ttT PPP
T
vλXyλXyyy
yy
Source feature
sequence TXtX2X1X
Converted static
feature sequence Tyˆ1
ˆy 2
ˆy tyˆ
[Toda; ’07]
Conditional p.d.f.
for static features
Conditional p.d.f. for
dynamic features
(= linearly transformed)
Function of
static features
GMM
Converted
static features
• Simultaneously convert all frames over a time sequence (e.g., utterance)
1. VC progress on conversion model: 8
Conversion w/ MLPG(Maxim Likelihood Parameter Generation [Tokuda; ’00])
31. 2. Improve an Objective Function
MMSE [Stylianou; ’98]
Key ideas are
how to keep or reproduce natural speech fluctuation!
how to handle errors of time alignment!
Divergence
‐based
Minimization of
regression error
Maximization of
likelihood
MLE [Toda; ’07]
GV [Toda; ’07]
Regularization w/ feature to
capture oversmoothing effects
MS [Takamichi; ’16]
More generalized features
Distance‐
based
GAN [Saito; ’18]
Data‐driven regularization
2. VC progress on objective function: 1
MLE w/ DP‐GMM
[Nankaku; ’07]
Hidden
alignment
Sequence
mapping GAN w/ Gated
CNN [Kaneko; ’17]
RobustSensitive
Misalignment
33. • Use GV likelihood as a regularization term in conversion [Toda; ’07]
• Also possible to use it in training [Zen; ’12][Hwang; ’13]
• Simpler way: design postfilter to enhance GV [Toda; ’12b]
Regularization w/ GV
GV p.d.f.
)|(),|(,|maxargˆ,,ˆ )()(
1
,,
1
1
vy
tt
T
t
ttT PPP
T
λvλXyλXyyy
yy
Conditional p.d.f.
for static features
p.d.f. of GV
(= nonlinearly
transformed)
Conditional p.d.f. for
dynamic features
(= linearly transformed)
Function of
static features
GMM
Converted
static features
2. VC progress on objective function: 3
tyˆ )GV(
ˆtyPostfilter (simple linear
transformation w/ mean & var)
Converted feature w/o GV
Too small GV… GV close to natural one!
Enhanced feature
34. Modulation frequency
components
0 Hz
0.25 Hz
0.5 Hz
~ Hz
=
……
From GV to Modulation Spectrum
0 1 2 3
0
1
‐1
)( y
dv
dy
Decompose a parameter
sequence into individual
modulation frequency
components
Time [sec]
)(
1,
y
dv
)(
2,
y
dv
)(
,
y
fdv
)(
0,
y
dv
p.d.f. modeling of their
power values (i.e., their GVs)
Parameter
sequence
Incorporate them into
the objective function
[Takamichi; ’15] or design
postfilter [Takamichi; ’16]
[Takamichi; ’16]
2. VC progress on objective function: 4
38. 3. Improve Waveform Generation
HNM [Stylianou, ’96]
STRAIGHT [Kawahara; ’99]
AHOCODER [Erro; ’14]
WORLD [Morise; ’16]
Key ideas are
how to leverage source waveform!
how to avoid assumptions in source‐filter model!
High‐quality vocoder
PSOLA [Valbret; ’92]
Waveform modification
Direct waveform
filtering [Kobayashi; ’18a]
Time‐variant log‐spectral
differential filter estimation
Mixed excitation
[Ohtani; ’06]
Phase modeling
[Kain; ’01][Ye; ’06]
Residual selection
[Suendermann; ’05a]
Excitation
modeling
Excitation pulse
generation
[Juvela; ’18]
GAN
WaveNet vocoder
[Kobayashi; ’17]Neural vocoder
Leverage source waveform
Directly improve vocoder
3. VC progress on waveform generation: 1
39. Input speech
waveform
Time-variant filter Converted speech
waveform
Direct Waveform Modification
• Apply time‐variant filtering to input speech waveform to convert its
spectral envelope only
[Kobayashi; ’18a]
)(ˆ )/(
zH xy
t
][*][*][ˆ
][*][ˆ][ˆ
)(1)()(
)()/()(
nsnhnh
nsnhns
xx
t
y
t
xxy
t
y
][)(
ns x
)(
)(ˆ
)(ˆ
)(
)(
)/(
zH
zH
zH x
t
y
txy
t
3. VC progress on waveform generation: 2
• Keep natural phase components!
• Alleviate the over‐smoothing effects!
• But hard to convert excitation parameters (e.g., F0)
Tddd ˆ,,ˆ,ˆ
21
Sequence of log‐spectral differentials
(e.g., mel‐cepstrum differentials )
Converted
parameters =
λyx |, ttp
GMM
ttt xyd
λdx |, ttp
DIFFGMM
Variable transformation
40. Excitation Modeling
• Hard to generate natural excitation waveforms by using traditional
excitation models of source filter!
• Two important components need to be modeled…
• Stochastic component
Parameterized as frequency‐dependent aperiodicicy and statistically
converted in a mixed‐excitation framework [Ohtani; ’06]
• Phase component
Modeled with templates [Kain; ’01] or waveform reshaping filter [Ye; ’06]
Develop one pitch residual waveform dataset and select the best one using
other speech parameters (e.g., F0 & spectral parameter) [Suendermann; ’05a]
3. VC progress on waveform generation: 3
Excitation
Pulse train
Gaussian noise
Synthetic speech
Synthesis filter
)(zH
][*][][ nenhnx
][ne
42. VC w/ WaveNet [van den Oord; ’16b]
• Implementation of WaveNet vocoder for VC
• Target speaker‐dependent WaveNet vocoder [Tamamori; ’17] can generate
speech waveform almost indistinguishable from natural one [Hayashi; ’17]!
• Use target speaker‐dependent WaveNet vocoder to generate speech
waveform from converted speech parameters [Kobayashi; ’17]
Can significantly improve conversion accuracy on speaker identity!
Could also reduce adverse effects of some errors on converted speech
e.g., by training WaveNet vocoder w/ the converted speech parameters.
• Possible to directly use WaveNet for VC [Niwa; ’18]
Input
speech
Statistical
conversion
Converted speech
parameters
Analysis
Speech
parameters
Feature
extraction error
Conversion error Converted
speech
Synthesis w/
WaveNet
vocoder
Less affected by errors
3. VC progress on waveform generation: 5
46. Nonparallel Training w/ CycleGAN[Zhu; ’17]
• Simultaneously train two networks between two speakers.
4. VC progress on flexible framework: 3
Target
data 𝒚
Source
data 𝒙
Conversion
network
from 𝒙 to 𝒚
Conversion
network
from 𝒚 to 𝒙
Converted
data 𝒙 ⇒ 𝒚
Converted data
𝒙 ⇒ 𝒚 ⇒ 𝒙
Converted
data 𝒚 ⇒ 𝒙
Converted data
𝒚 ⇒ 𝒙 ⇒ 𝒚
Cycle loss
𝐿 𝒙, 𝒙
Cycle loss
𝐿 𝒚, 𝒚
Adversarial loss
𝐿 𝒙
Adversarial loss
𝐿 𝒚
Trained by
minimizing
𝐿 𝒙 𝐿 𝒚
𝐿 𝒙, 𝒙
𝐿 𝒚, 𝒚
Discriminator
network for 𝒙
0: Converted, 1: Natural
Discriminator
network for 𝒚
0: Converted, 1: Natural
[Fang; ’18][Kaneko; ’18]
47. One‐to‐Many (or Many‐to‐One) VC
[Toda; ’06]
• Convert reference speaker’s voice into arbitrary speaker’s voice
tX )(s
tY
sTt :1
tz
)(s
w
Ss :1
Speaker
info
Context
info
Model frame‐dependent contextual
factor and utterance‐dependent speaker
factor w/ different latent variables
Factorize speaker and context using the reference speaker as an anchor point!
)1(
:1 1TY
tX )2(
:1 2TY
)(
:1
S
TS
Y
Reference
speaker
Prestored speakers
1st speaker
2nd speaker
Sth speaker
Use multiple parallel datasets
between the reference speaker &
individuals of prestored speakers
Training datasets Model training
4. VC progress on flexible framework: 4
48. Eigenvoice Conversion (EVC)
Super vectors
= concatenated means
(context & speaker
dependent)
)(
)(
2
)(
1
)1(
)1(
2
)1(
1
,,
J
M
J
J
M b
b
b
b
b
b
)(
)(
1
s
J
s
w
w
)0(
)0(
2
)0(
1
Mb
b
b
Bias vector
= average speaker
(context dependent)
+
Factors
(speaker
dependent)
×
Basis vectors
= typical speaker
variations
(context dependent)
=
Used as speaker‐adaptive parameters
)(
)(
2
)(
1
s
M
s
s
μ
μ
μ
= +
• Factorize GMM mean vectors into context‐ and speaker‐dependent
components using Eigenvoice technique [Kuhn; ’00]
[Toda; ’06]
4. VC progress on flexible framework: 5
49. Demo: Many‐to‐One EVC
• Convert arbitrary speaker’s voice into pretrained
target speakers’ voice
tX tY
Tt :1
tz
wSpeaker
Context
Adaptation: Unsupervised estimation
of speaker‐adaptive parameter from
given input speech
tY
tz
wˆSpeaker
Context
Conversion: Use of the model adapted
with the estimated speaker‐adaptive
parameter
Tt :1
tX
Sorry for old demo
system developed more
than 10 years ago…
T:1x T:1y T:1x T:1y
4. VC progress on flexible framework: 6
51. Speaker‐Independent Feature Extraction
• Extract phoneme posteriorgram (PPG) as speaker‐independent contextual
features and use them as input of the conversion network
4. VC progress on flexible framework: 8
Phone recognizer
1x 2x Tx3xSource feature
sequence
Target feature
sequence
𝒚 𝒚 𝒚 𝒚
𝒑 𝒑 𝒑 𝒑
PPG
Target‐dependent
conversion network
No longer need to use
parallel data!
Target
speech data
PPG data
Phone
recognizer
Conversion
network
Remove speaker‐
dependencies!
Add speaker‐
dependencies!
[Sun; ’16]
52. Encoder
network
tz
Decoder
network
Target
speaker
Context
tX
𝑡 1: 𝑇
Gaussian
prior 𝑁 𝟎, 𝑰
Conversion step
Unsupervised Factorization
• Use multiple nonparallel datasets to develop a factorized conversion
model (e.g. CVAE [Hsu; ’16] or ARBM [Nakashika; ’16]) without any other models
4. VC progress on flexible framework: 9
)(s
tY
Speaker
Context
Decoder
network
Encoder
network
𝑡 1: 𝑇
𝑠 1: 𝑆
Gaussian
prior 𝑁 𝟎, 𝑰
Conditional Variational Autoencoder (CVAE)
Training step
Remove speaker
dependencies
Speaker‐independent
encoder network is trained!
Speaker‐adaptive
decoder is trained!
GAN can be used in
VAW‐GAN [Hsu; ’17].
Add speaker
dependencies
tz
)(s
w
)(s
w
)(s
tY
58. Overall Result of VCC2016 Listening Tests
1 2 3 4 5
0
20
40
60
80
100
Mean opinion score (MOS) on naturalness
Correct rate [%] on speaker similarity
Target
Source
Baseline A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
Better
Better
P
Q
Correct = 75%
MOS = 3.5
• 22 submitted systems + 1 baseline system were evaluated.
5. VC progress on comparison: 3
60. Voice Conversion Challenge 2018 (VCC2018)
• Two tasks
• HUB task (main): Parallel training
• SPOKE task (optional): Nonparallel training
• Evaluation
• Naturalness and speaker similarity by listening tests
• Word error rate and spoofing results
• Design of VCC 2018 dataset (using DAPS [Mysore, ’15] )
• Down‐sampled to 22.05 kHz
# of speakers # of sentences
Source
speakers
2 females & 2 males 81 for training
& 35 for evaluation
Target
speakers
2 females & 2 males 81 for training
Other source
speakers
2 females & 2 males Other 81 for training
& 35 for evaluation
HUB tskSPOKE task
5. VC progress on comparison: 5
[Lorenzo‐Trueba; ’18]
61. Overall Results of VCC2018 Listening Tests
100
80
60
40
20
0
1 2 3 4 5
MOS on naturalness
Similarity score [%]
100
80
60
40
20
0
1 2 3 4 5
MOS on naturalness
Similarity score [%]
Results of Hub task Results of Spoke task
Baseline
system
N17 system
N10 system
Baseline
system
N17 system
N10 system
• 23 submitted systems + 1 baseline
system were evaluated in HUB task.
• 11 submitted systems + 1 baseline
system were evaluated in SPOKE task.
5. VC progress on comparison: 6
66. 6. Develop Various Applications
Cross‐lingual VC
[Abe; ’91]
Key ideas are
how to apply VC techniques to various mapping tasks!
development of real‐time VC (RT‐VC) applications!
Tellecommunication
Bandwidth
extension [Jax; ’03]
Speaking‐aid
Intelligibility
enhancement of
disordered speech
[Kain; ’07][Aihara; ’14]
Inversion & production
mapping [Richmond; ’03]
[Toda; ’08]
Articulatory
controllable
waveform
modification
[Tobing; ’17]
Singing VC
[Villavicencio; ’10]
[Doi; ’12]
Speech translation
[Hattori; ’11]
Entertainment
Articulatory modification
TTS
Silent speech
communication [Toda; ’12a]
Voice changer &
vocal effector [Kobayashi; ’14]
Alaryngeal speech
enhancement
[Nakamura; ’12][Doi; ’14]
Augmented speech production
[Toda, ’14]
F0‐controlled
electrolarynx [Tanaka; ’17]
RT‐VC
RT‐VC
RT‐VC
6. VC progress on application: 1
* NOTE: More applications have been studied
Accent
conversion
[Felps; ’09]
67. Real‐Time Statistical Voice Conversion
Source feature sequence
Converted feature sequence
Batch‐type conversion
Lttt xxxy λ ,,,,fˆ 1
1tx
1
ˆ ty
tx
tyˆ
1x
1
ˆy
2x
2
ˆy
Tx
Tyˆ
3x
3
ˆy
TT xxxyyy λ ,,,fˆ,,ˆ,ˆ 2121
Low‐delay frame‐wise
conversion
Sequence‐based conversion
Source feature sequence
Converted feature sequence
1tx2tx3tx
1
ˆ ty1
ˆ ty
• Approximate sequence‐based conversion with low‐delay conversion by
propagating all past info and also looking at near future info
2tx
6. VC progress on application: 2
[Toda; ’12b]
68. 1. Speaking‐Aid: Alaryngeal Speech Enhancement
• Real‐time conversion from alaryngeal speech into normal speech
ES
l
spg.data
Time [s]
Frequency[Hz]
1 1.5 2 2.5 3
0
1000
2000
3000
4000
5000
6000
7000
8000
ESl
spg.data
Time [s]
Frequency[Hz]
1 1.5 2 2.5 3
0
1000
2000
3000
4000
5000
6000
7000
8000
Esophageal speech Enhanced speech
Waveform
F0 pattern
Aperiodicity
Spectral
envelope
Time Time
VC
Laryngectomee Spectrum
Aperiodicity
F0
Spectral
segment
[Doi; ’14]
6. VC progress on application: 3
Augmented speech production beyond physical constraints!
69. 2. Silent Speech Communication
[Toda; ’12a]
6. VC progress on application: 4
Speaking side
・・・
Speak something private in
non‐audible murmur
or soft voice
Present more naturally sounding
speech to only a specific listener
My account
number is …
Listening side
VC
• Real‐time conversion from non‐audible murmur (very soft whispered voice)
[Nakajima; ’06] detected w/ body‐conductive microphone to natural voices
Augmented speech production to develop telepathy‐like communication!
Non‐audible
murmur
microphone
Converted to air‐conducted voices
from non‐audible murmur
Normal voice ( )
Whispered voice ( )
from body‐conducted soft voice
Soft voice ( )
71. Outline
• What is voice conversion (VC)?
• Why is VC needed?
• How to do VC?
• Tell us VC research history and recent progress!
• How to improve a conversion model?
• How to improve an objective function?
• How to generate a converted waveform?
• How to make training more flexible?
• How to compare different techniques?
• How to develop applications?
• Summary Let me tell you about one more
important thing.
Outline: Summary
75. [Erro; ’14] D. Erro, I. Sainz, E. Navas, I. Hernaez. Harmonics plus noise model based vocoder for statistical
parametric speech synthesis. IEEE J. Sel. Topics in Signal Process., Vol. 8, No. 2, pp. 184–194, 2014.
[Fang; ’18] F. Fang, J. Yamagishi, I. Echizen, J. Lorenzo‐Trueba. High‐quality nonparallel voice conversion
based on cycle‐consistent adversarial network. Proc. IEEE ICASSP, pp. 5279–5283, 2018.
[Felps; ’09] D. Felps, H. Bortfeld, R. Gutierrez‐Osuna. Foreign accent conversion in computer assisted
pronunciation training. Speech Commun., Vol. 51, No. 10, pp. 920–932, 2009.
[Goodfellow; ’14] I. Goodfellow, J. Pouget‐Abadie, M. Mirza, B. Xu, D. Warde‐Farley, S. Ozair, A. Courville, Y.
Bengio. Generative adversarial nets. Proc. NIPS, pp. 2672–2680, 2014.
[Hattori; ’11] N. Hattori, T. Toda, Hisashi Kawai, H. Saruwatari, K. Shikano. Speaker‐adaptive speech
synthesis based on eigenvoice conversion and language‐dependent prosodic conversion in speech‐to‐
speech translation. Proc. INTERSPEECH, pp. 2769–2772, 2011.
[Hayashi; ’17] T. Hayashi, A. Tamamori, K. Kobayashi, K. Takeda, T. Toda. An investigation of multi‐speaker
training for WaveNet vocoder. Proc. IEEE ASRU, pp. 712–718, 2017.
[Hsu; ’16] C.‐C. Hsu, H.‐T. Hwang, Y.‐C. Wu, Y. Tsao, H.‐M. Wang. Voice conversion from non‐parallel corpora
using variational auto‐encoder. Proc. APSIPA ASC, 6 pages, 2016.
[Hsu; ’17] C.‐C. Hsu, H.‐T. Hwang, Y.‐C. Wu, Y. Tsao, H.‐M. Wang. Voice conversion from unaligned corpora
using variational autoencoding Wasserstein generative adversarial networks. Proc. INTERSPEECH, pp.
3364–3368, 2017.
[Hwang; ’13] H. Hwang, Y. Tsao, H. Wang, Y. Wang, S. Chen. Incorporating global variance in the training
phase of GMM‐based voice conversion. Proc. APSIPA ASC, 6 pages, 2013.
[Jax; ’03] P. Jax, P. Vary. On artificial bandwidth extension of telephone speech. Signal Processing, Vol. 83,
pp. 1707–1719, 2003.
[Jin; ’16] Z. Jin, A. Finkelstein, S. DiVerdi, J. Lu, G.J. Mysore. CUTE: a concatenative method for voice
conversion using exemplar‐based unit selection. Proc. IEEE ICASSP, pp. 5660–5664, 2016.
References: 2
81. [Toda; ’16] T. Toda, L.‐H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu, J. Yamagishi. The Voice
Conversion Challenge 2016. Proc. INTERSPEECH, pp. 1632–1636, 2016.
[Tokuda; ’00] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, T. Kitamura. Speech parameter generation
algorithms for HMM‐based speech synthesis. Proc. IEEE ICASSP, pp. 1315–1318, 2000.
[Valbret; ’92] H. Valbret, E. Moulines and J. P. Tubach. Voice transformation using PSOLA technique. Speech
Commun., Vol. 11, No. 2–3, pp. 175–187, 1992.
[van den Oord; ’16a] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, K. Kavukcuoglu.
Conditional image generation with PixelCNN decoders. arXiv preprint, arXiv:1606.05328, 13 pages, 2016.
[van den Oord; ’16b] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N.
Kalchbrenner, A. W. Senior, and K. Kavukcuoglu. WaveNet: a generative model for raw audio. arXiv preprint,
arXiv:1609.03499, 15 pages, 2016.
[van den Oord; ’17] A. van den Oord, O. Vinyals, K. Kavukcuoglu. Neural discrete representation learning.
arXiv preprint, arXiv:1711.00937, 11 pages, 2017.
[Villavicencio; ’10] F. Villavicencio, J. Bonada. Applying voice conversion to concatenative singing‐voice
synthesis. Proc. INTERSPEECH, pp. 2162–2165, 2010.
[Wu; ’14] Z. Wu, T. Virtanen, E. Chng, H. Li. Exemplar‐based sparse representation with residual
compensation for voice conversion. IEEE/ACM Trans. Audio, Speech & Lang. Process., Vol. 22, No. 10, pp.
1506–1521, 2014.
[Wu; ’15] Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, H. Li. Spoofing and countermeasures for
speaker verification: A survey. Speech Commun. Vol. 66, pp. 130–153, 2015.
[Wu; ’17] Z. Wu, J. Yamagishi, T. Kinnunen, C. Hanilci, M. Sahidullah, A. Sizov, N. Evans, M. Todisco, H.
Delgado. ASVspoof: the automatic speaker verification spoofing and countermeasures challenge. IEEE J.
Sel. Topics in Signal Process., Vol. 11, No. 4, pp. 588–604, 2017.
[Xu; ’14] N. Xu, Y. Tang, J. Bao, A. Jiang, X. Liu, Z. Yang. Voice conversion based on Gaussian processes by
coherent and asymmetric training with limited training data. Speech Commun., Vol. 58, pp. 124–138, 2014.
References: 8