Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–6 of 6 results for author: Teh, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2312.11805  [pdf, other

    cs.CL cs.AI cs.CV

    Gemini: A Family of Highly Capable Multimodal Models

    Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

    Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More

    Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

  2. arXiv:2303.09446  [pdf, other

    eess.AS cs.AI cs.CL cs.LG

    Controllable Prosody Generation With Partial Inputs

    Authors: Dan Andrei Iliescu, Devang Savita Ram Mohan, Tian Huey Teh, Zack Hodari

    Abstract: We address the problem of human-in-the-loop control for generating prosody in the context of text-to-speech synthesis. Controlling prosody is challenging because existing generative models lack an efficient interface through which users can modify the output quickly and precisely. To solve this, we introduce a novel framework whereby the user provides partial inputs and the generative model genera… ▽ More

    Submitted 15 April, 2024; v1 submitted 14 March, 2023; originally announced March 2023.

    Comments: 5 pages

  3. arXiv:2106.08352  [pdf, other

    eess.AS cs.LG cs.SD

    Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis

    Authors: Devang S Ram Mohan, Vivian Hu, Tian Huey Teh, Alexandra Torresquintero, Christopher G. R. Wallis, Marlene Staib, Lorenzo Foglianti, Jiameng Gao, Simon King

    Abstract: Text does not fully specify the spoken form, so text-to-speech models must be able to learn from speech data that vary in ways not explained by the corresponding text. One way to reduce the amount of unexplained variation in training data is to provide acoustic information as an additional learning signal. When generating speech, modifying this acoustic information enables multiple distinct rendit… ▽ More

    Submitted 15 June, 2021; originally announced June 2021.

    Comments: To be published in Interspeech 2021. 5 pages, 4 figures

  4. Phonological Features for 0-shot Multilingual Speech Synthesis

    Authors: Marlene Staib, Tian Huey Teh, Alexandra Torresquintero, Devang S Ram Mohan, Lorenzo Foglianti, Raphael Lenain, Jiameng Gao

    Abstract: Code-switching---the intra-utterance use of multiple languages---is prevalent across the world. Within text-to-speech (TTS), multilingual models have been found to enable code-switching. By modifying the linguistic input to sequence-to-sequence TTS, we show that code-switching is possible for languages unseen during training, even within monolingual models. We use a small set of phonological featu… ▽ More

    Submitted 6 August, 2020; originally announced August 2020.

    Comments: 5 pages, to be presented at INTERSPEECH 2020

  5. arXiv:2008.03096  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Incremental Text to Speech for Neural Sequence-to-Sequence Models using Reinforcement Learning

    Authors: Devang S Ram Mohan, Raphael Lenain, Lorenzo Foglianti, Tian Huey Teh, Marlene Staib, Alexandra Torresquintero, Jiameng Gao

    Abstract: Modern approaches to text to speech require the entire input character sequence to be processed before any audio is synthesised. This latency limits the suitability of such models for time-sensitive tasks like simultaneous interpretation. Interleaving the action of reading a character with that of synthesising audio reduces this latency. However, the order of this sequence of interleaved actions v… ▽ More

    Submitted 7 August, 2020; originally announced August 2020.

    Comments: To be published in Interspeech 2020. 5 pages, 4 figures

  6. arXiv:1803.05419  [pdf, other

    stat.ML cs.LG

    Generalised Structural CNNs (SCNNs) for time series data with arbitrary graph topology

    Authors: Thomas Teh, Chaiyawan Auepanwiriyakul, John Alexander Harston, A. Aldo Faisal

    Abstract: Deep Learning methods, specifically convolutional neural networks (CNNs), have seen a lot of success in the domain of image-based data, where the data offers a clearly structured topology in the regular lattice of pixels. This 4-neighbourhood topological simplicity makes the application of convolutional masks straightforward for time series data, such as video applications, but many high-dimension… ▽ More

    Submitted 30 May, 2018; v1 submitted 14 March, 2018; originally announced March 2018.