Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–22 of 22 results for author: Mensch, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2401.04088  [pdf, other

    cs.LG cs.CL

    Mixtral of Experts

    Authors: Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix , et al. (1 additional authors not shown)

    Abstract: We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected e… ▽ More

    Submitted 8 January, 2024; originally announced January 2024.

    Comments: See more details at https://mistral.ai/news/mixtral-of-experts/

  2. arXiv:2312.11805  [pdf, other

    cs.CL cs.AI cs.CV

    Gemini: A Family of Highly Capable Multimodal Models

    Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

    Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More

    Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

  3. arXiv:2310.06825  [pdf, other

    cs.CL cs.AI cs.LG

    Mistral 7B

    Authors: Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed

    Abstract: We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences o… ▽ More

    Submitted 10 October, 2023; originally announced October 2023.

    Comments: Models and code are available at https://mistral.ai/news/announcing-mistral-7b/

  4. arXiv:2303.13518  [pdf, other

    cs.CV cs.AI cs.LG

    Three ways to improve feature alignment for open vocabulary detection

    Authors: Relja Arandjelović, Alex Andonian, Arthur Mensch, Olivier J. Hénaff, Jean-Baptiste Alayrac, Andrew Zisserman

    Abstract: The core problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes. Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining, and struggles to prevent the language model from forgetting unseen classes. We propose t… ▽ More

    Submitted 23 March, 2023; originally announced March 2023.

  5. arXiv:2211.04236  [pdf, other

    cs.CL cs.LG

    Self-conditioned Embedding Diffusion for Text Generation

    Authors: Robin Strudel, Corentin Tallec, Florent Altché, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, Rémi Leblond

    Abstract: Can continuous diffusion models bring the same performance breakthrough on natural language they did for image generation? To circumvent the discrete nature of text data, we can simply project tokens in a continuous space of embeddings, as is standard in language modeling. We propose Self-conditioned Embedding Diffusion, a continuous diffusion mechanism that operates on token embeddings and allows… ▽ More

    Submitted 8 November, 2022; originally announced November 2022.

    Comments: 15 pages

  6. arXiv:2210.04319  [pdf, other

    cs.LG

    Dissecting adaptive methods in GANs

    Authors: Samy Jelassi, David Dobre, Arthur Mensch, Yuanzhi Li, Gauthier Gidel

    Abstract: Adaptive methods are a crucial component widely used for training generative adversarial networks (GANs). While there has been some work to pinpoint the "marginal value of adaptive methods" in standard tasks, it remains unclear why they are still critical for GAN training. In this paper, we formally study how adaptive methods help train GANs; inspired by the grafting method proposed in arXiv:2002.… ▽ More

    Submitted 9 October, 2022; originally announced October 2022.

  7. arXiv:2204.14198  [pdf, other

    cs.CV cs.AI cs.LG

    Flamingo: a Visual Language Model for Few-Shot Learning

    Authors: Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals , et al. (2 additional authors not shown)

    Abstract: Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily i… ▽ More

    Submitted 15 November, 2022; v1 submitted 29 April, 2022; originally announced April 2022.

    Comments: 54 pages. In Proceedings of Neural Information Processing Systems (NeurIPS) 2022

  8. arXiv:2203.15556  [pdf, other

    cs.CL cs.LG

    Training Compute-Optimal Large Language Models

    Authors: Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, Laurent Sifre

    Abstract: We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion… ▽ More

    Submitted 29 March, 2022; originally announced March 2022.

  9. arXiv:2202.01169  [pdf, other

    cs.CL cs.LG

    Unified Scaling Laws for Routed Language Models

    Authors: Aidan Clark, Diego de las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, George van den Driessche, Eliza Rutherford, Tom Hennigan, Matthew Johnson, Katie Millican, Albin Cassirer, Chris Jones, Elena Buchatskaya, David Budden, Laurent Sifre, Simon Osindero, Oriol Vinyals, Jack Rae, Erich Elsen, Koray Kavukcuoglu , et al. (1 additional authors not shown)

    Abstract: The performance of a language model has been shown to be effectively modeled as a power-law in its parameter count. Here we study the scaling behaviors of Routing Networks: architectures that conditionally use only a subset of their parameters while processing an input. For these models, parameter count and computational requirement form two independent axes along which an increase leads to better… ▽ More

    Submitted 9 February, 2022; v1 submitted 2 February, 2022; originally announced February 2022.

    Comments: Fixing typos and affiliation clarity

  10. arXiv:2112.11446  [pdf, other

    cs.CL cs.AI

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    Authors: Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor , et al. (55 additional authors not shown)

    Abstract: Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gop… ▽ More

    Submitted 21 January, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

    Comments: 120 pages

  11. arXiv:2112.04426  [pdf, other

    cs.CL cs.LG

    Improving language models by retrieving from trillions of tokens

    Authors: Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan , et al. (3 additional authors not shown)

    Abstract: We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25$\times$ fewer parameters. After fine-tuning, RETRO performance translates to d… ▽ More

    Submitted 7 February, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

    Comments: Fix incorrect reported numbers in Table 14

  12. arXiv:2010.08354  [pdf, other

    cs.LG stat.ML

    Differentiable Divergences Between Time Series

    Authors: Mathieu Blondel, Arthur Mensch, Jean-Philippe Vert

    Abstract: Computing the discrepancy between time series of variable sizes is notoriously challenging. While dynamic time warping (DTW) is popularly used for this purpose, it is not differentiable everywhere and is known to lead to bad local optima when used as a "loss". Soft-DTW addresses these issues, but it is not a positive definite divergence: due to the bias introduced by entropic regularization, it ca… ▽ More

    Submitted 25 February, 2021; v1 submitted 16 October, 2020; originally announced October 2020.

    Comments: V3: AISTATS 2021 camera-ready

  13. arXiv:2002.06277  [pdf, other

    cs.LG math.OC math.PR stat.ML

    A mean-field analysis of two-player zero-sum games

    Authors: Carles Domingo-Enrich, Samy Jelassi, Arthur Mensch, Grant Rotskoff, Joan Bruna

    Abstract: Finding Nash equilibria in two-player zero-sum continuous games is a central problem in machine learning, e.g. for training both GANs and robust models. The existence of pure Nash equilibria requires strong conditions which are not typically met in practice. Mixed Nash equilibria exist in greater generality and may be found using mirror descent. Yet this approach does not scale to high dimensions.… ▽ More

    Submitted 6 May, 2021; v1 submitted 14 February, 2020; originally announced February 2020.

    Journal ref: Published at NeurIPS 2020

  14. arXiv:1905.12363  [pdf, other

    stat.ML cs.LG math.OC

    Extragradient with player sampling for faster Nash equilibrium finding

    Authors: Carles Domingo Enrich, Samy Jelassi, Carles Domingo-Enrich, Damien Scieur, Arthur Mensch, Joan Bruna

    Abstract: Data-driven modeling increasingly requires to find a Nash equilibrium in multi-player games, e.g. when training GANs. In this paper, we analyse a new extra-gradient method for Nash equilibrium finding, that performs gradient extrapolations and updates on a random subset of players at each iteration. This approach provably exhibits a better rate of convergence than full extra-gradient for non-smoot… ▽ More

    Submitted 21 July, 2020; v1 submitted 29 May, 2019; originally announced May 2019.

  15. arXiv:1905.06005  [pdf, other

    stat.ML cs.LG math.OC

    Geometric Losses for Distributional Learning

    Authors: Arthur Mensch, Mathieu Blondel, Gabriel Peyré

    Abstract: Building upon recent advances in entropy-regularized optimal transport, and upon Fenchel duality between measures and continuous functions , we propose a generalization of the logistic loss that incorporates a metric or cost between classes. Unlike previous attempts to use optimal transport distances for learning, our loss results in unconstrained convex objective functions, supports infinite (or… ▽ More

    Submitted 15 May, 2019; originally announced May 2019.

    Journal ref: Proceedings of the International Conference on Machine Learning, 2019, Long Beach, United States

  16. arXiv:1809.06035  [pdf, other

    stat.ML cs.CV cs.LG q-bio.QM

    Extracting representations of cognition across neuroimaging studies improves brain decoding

    Authors: Arthur Mensch, Julien Mairal, Bertrand Thirion, Gaël Varoquaux

    Abstract: Cognitive brain imaging is accumulating datasets about the neural substrate of many different mental processes. Yet, most studies are based on few subjects and have low statistical power. Analyzing data across studies could bring more statistical power; yet the current brain-imaging analytic framework cannot be used at scale as it requires casting all cognitive tasks in a unified theoretical frame… ▽ More

    Submitted 19 May, 2021; v1 submitted 17 September, 2018; originally announced September 2018.

    Journal ref: PLoS Computational Biology, Public Library of Science, 2021

  17. arXiv:1802.03676  [pdf, other

    stat.ML cs.LG

    Differentiable Dynamic Programming for Structured Prediction and Attention

    Authors: Arthur Mensch, Mathieu Blondel

    Abstract: Dynamic programming (DP) solves a variety of structured combinatorial problems by iteratively breaking them down into smaller subproblems. In spite of their versatility, DP algorithms are usually non-differentiable, which hampers their use as a layer in neural networks trained by backpropagation. To address this issue, we propose to smooth the max operator in the dynamic programming recursion, usi… ▽ More

    Submitted 20 February, 2018; v1 submitted 10 February, 2018; originally announced February 2018.

  18. arXiv:1710.11438  [pdf, other

    stat.ML cs.LG q-bio.NC

    Learning Neural Representations of Human Cognition across Many fMRI Studies

    Authors: Arthur Mensch, Julien Mairal, Danilo Bzdok, Bertrand Thirion, Gaël Varoquaux

    Abstract: Cognitive neuroscience is enjoying rapid increase in extensive public brain-imaging datasets. It opens the door to large-scale statistical models. Finding a unified perspective for all available data calls for scalable and automated solutions to an old challenge: how to aggregate heterogeneous information on brain function into a universal cognitive system that relates mental operations/cognitive… ▽ More

    Submitted 10 November, 2017; v1 submitted 31 October, 2017; originally announced October 2017.

    Comments: Advances in Neural Information Processing Systems, Dec 2017, Long Beach, United States. 2017

    Journal ref: Advances in Neural Information Processing Systems, 2017

  19. arXiv:1701.05363  [pdf, other

    stat.ML cs.LG math.OC q-bio.NC

    Stochastic Subsampling for Factorizing Huge Matrices

    Authors: Arthur Mensch, Julien Mairal, Bertrand Thirion, Gael Varoquaux

    Abstract: We present a matrix-factorization algorithm that scales to input matrices with both huge number of rows and columns. Learned factors may be sparse or dense and/or non-negative, which makes our algorithm suitable for dictionary learning, sparse component analysis, and non-negative matrix factorization. Our algorithm streams matrix columns while subsampling them to iteratively learn the matrix facto… ▽ More

    Submitted 30 October, 2017; v1 submitted 19 January, 2017; originally announced January 2017.

    Comments: IEEE Transactions on Signal Processing, Institute of Electrical and Electronics Engineers, A Paraître

    Journal ref: IEEE Transactions on Signal Processing, 2018, 66 (1), pp 113-128

  20. arXiv:1611.10041  [pdf, other

    math.OC cs.LG stat.ML

    Subsampled online matrix factorization with convergence guarantees

    Authors: Arthur Mensch, Julien Mairal, Gaël Varoquaux, Bertrand Thirion

    Abstract: We present a matrix factorization algorithm that scales to input matrices that are large in both dimensions (i.e., that contains morethan 1TB of data). The algorithm streams the matrix columns while subsampling them, resulting in low complexity per iteration andreasonable memory footprint. In contrast to previous online matrix factorization methods, our approach relies on low-dimensional statistic… ▽ More

    Submitted 30 November, 2016; originally announced November 2016.

    Journal ref: 9th NIPS Workshop on Optimization for Machine Learning, Dec 2016, Barcelone, Spain

  21. arXiv:1605.00937  [pdf, other

    stat.ML cs.LG q-bio.QM

    Dictionary Learning for Massive Matrix Factorization

    Authors: Arthur Mensch, Julien Mairal, Bertrand Thirion, Gaël Varoquaux

    Abstract: Sparse matrix factorization is a popular tool to obtain interpretable data decompositions, which are also effective to perform data completion or denoising. Its applicability to large datasets has been addressed with online and randomized methods, that reduce the complexity in one of the matrix dimension, but not in both of them. In this paper, we tackle very large matrices in both dimensions. We… ▽ More

    Submitted 26 May, 2016; v1 submitted 3 May, 2016; originally announced May 2016.

    Journal ref: Proceedings of the International Conference on Machine Learning, 2016, pp 1737-1746

  22. Compressed Online Dictionary Learning for Fast fMRI Decomposition

    Authors: Arthur Mensch, Gaël Varoquaux, Bertrand Thirion

    Abstract: We present a method for fast resting-state fMRI spatial decomposi-tions of very large datasets, based on the reduction of the temporal dimension before applying dictionary learning on concatenated individual records from groups of subjects. Introducing a measure of correspondence between spatial decompositions of rest fMRI, we demonstrates that time-reduced dictionary learning produces result as r… ▽ More

    Submitted 8 February, 2016; originally announced February 2016.

    Journal ref: IEEE International Symposium on Biomedical Imaging, 2016