Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 60 results for author: Varoquaux, G

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.19804  [pdf, other

    cs.AI cs.LG stat.ML

    Imputation for prediction: beware of diminishing returns

    Authors: Marine Le Morvan, Gaël Varoquaux

    Abstract: Missing values are prevalent across various fields, posing challenges for training and deploying predictive models. In this context, imputation is a common practice, driven by the hope that accurate imputations will enhance predictions. However, recent theoretical and empirical studies indicate that simple constant imputation can be consistent and competitive. This empirical study aims at clarifyi… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

  2. arXiv:2406.14085  [pdf, other

    cs.AI

    Teaching Models To Survive: Proper Scoring Rule and Stochastic Optimization with Competing Risks

    Authors: Julie Alberge, Vincent Maladière, Olivier Grisel, Judith Abécassis, Gaël Varoquaux

    Abstract: When data are right-censored, i.e. some outcomes are missing due to a limited period of observation, survival analysis can compute the "time to event". Multiple classes of outcomes lead to a classification variant: predicting the most likely event, known as competing risks, which has been less studied. To build a loss that estimates outcome probabilities for such settings, we introduce a strictly… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  3. arXiv:2402.16785  [pdf, other

    cs.LG

    CARTE: Pretraining and Transfer for Tabular Learning

    Authors: Myung Jun Kim, Léo Grinsztajn, Gaël Varoquaux

    Abstract: Pretrained deep-learning models are the go-to solution for images or text. However, for tabular data the standard is still to train tree-based models. Indeed, transfer learning on tables hits the challenge of data integration: finding correspondences, correspondences in the entries (entity matching) where different words may denote the same entity, correspondences across columns (schema matching),… ▽ More

    Submitted 31 May, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

  4. arXiv:2402.06282  [pdf, other

    cs.DB cs.LG

    Retrieve, Merge, Predict: Augmenting Tables with Data Lakes

    Authors: Riccardo Cappuzzo, Aimee Coelho, Felix Lefebvre, Paolo Papotti, Gael Varoquaux

    Abstract: We present an in-depth analysis of data discovery in data lakes, focusing on table augmentation for given machine learning tasks. We analyze alternative methods used in the three main steps: retrieving joinable tables, merging information, and predicting with the resultant table. As data lakes, the paper uses YADL (Yet Another Data Lake) -- a novel dataset we developed as a tool for benchmarking t… ▽ More

    Submitted 27 May, 2024; v1 submitted 9 February, 2024; originally announced February 2024.

    Comments: 12 pages + references, 10 figures. Under submission at VLDB2024 (EA&B track)

  5. arXiv:2402.04957  [pdf, other

    cs.CL

    Reconfidencing LLMs from the Grouping Loss Perspective

    Authors: Lihu Chen, Alexandre Perez-Lebel, Fabian M. Suchanek, Gaël Varoquaux

    Abstract: Large Language Models (LLMs), including ChatGPT and LLaMA, are susceptible to generating hallucinated answers in a confident tone. While efforts to elicit and calibrate confidence scores have proven useful, recent findings show that controlling uncertainty must go beyond calibration: predicted scores may deviate significantly from the actual posterior probabilities due to the impact of grouping lo… ▽ More

    Submitted 18 June, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

  6. arXiv:2401.10407  [pdf, other

    cs.CL

    Learning High-Quality and General-Purpose Phrase Representations

    Authors: Lihu Chen, Gaël Varoquaux, Fabian M. Suchanek

    Abstract: Phrase representations play an important role in data science and natural language processing, benefiting various tasks like Entity Alignment, Record Linkage, Fuzzy Joins, and Paraphrase Classification. The current state-of-the-art method involves fine-tuning pre-trained language models for phrasal embeddings using contrastive learning. However, we have identified areas for improvement. First, the… ▽ More

    Submitted 22 February, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

    Comments: Findings of EACL 2024

  7. arXiv:2312.09634  [pdf, other

    stat.ML cs.LG

    Vectorizing string entries for data processing on tables: when are larger language models better?

    Authors: Léo Grinsztajn, Edouard Oyallon, Myung Jun Kim, Gaël Varoquaux

    Abstract: There are increasingly efficient data processing pipelines that work on vectors of numbers, for instance most machine learning models, or vector databases for fast similarity search. These require converting the data to numbers. While this conversion is easy for simple numerical and categorical entries, databases are strife with text entries, such as names or descriptions. In the age of large lang… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

  8. arXiv:2310.12864  [pdf, other

    cs.CL

    The Locality and Symmetry of Positional Encodings

    Authors: Lihu Chen, Gaël Varoquaux, Fabian M. Suchanek

    Abstract: Positional Encodings (PEs) are used to inject word-order information into transformer-based language models. While they can significantly enhance the quality of sentence representations, their specific contribution to language models is not fully understood, especially given recent findings that various positional encodings are insensitive to word order. In this work, we conduct a systematic study… ▽ More

    Submitted 19 October, 2023; originally announced October 2023.

    Comments: Long Paper in Findings of EMNLP23

  9. arXiv:2307.10926  [pdf, other

    eess.IV cs.CV cs.LG

    Confidence intervals for performance estimates in 3D medical image segmentation

    Authors: R. El Jurdi, G. Varoquaux, O. Colliot

    Abstract: Medical segmentation models are evaluated empirically. As such an evaluation is based on a limited set of example images, it is unavoidably noisy. Beyond a mean performance measure, reporting confidence intervals is thus crucial. However, this is rarely done in medical image segmentation. The width of the confidence interval depends on the test set size and on the spread of the performance measure… ▽ More

    Submitted 21 July, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

    Comments: 10 pages

  10. arXiv:2302.01860  [pdf, other

    cs.CL

    GLADIS: A General and Large Acronym Disambiguation Benchmark

    Authors: Lihu Chen, Gaël Varoquaux, Fabian M. Suchanek

    Abstract: Acronym Disambiguation (AD) is crucial for natural language understanding on various sources, including biomedical reports, scientific papers, and search engine queries. However, existing acronym disambiguation benchmarks and tools are limited to specific domains, and the size of prior benchmarks is rather small. To accelerate the research on acronym disambiguation, we construct a new benchmark na… ▽ More

    Submitted 13 March, 2023; v1 submitted 3 February, 2023; originally announced February 2023.

    Comments: Long paper at EACL 23

  11. Understanding metric-related pitfalls in image analysis validation

    Authors: Annika Reinke, Minu D. Tizabi, Michael Baumgartner, Matthias Eisenmann, Doreen Heckmann-Nötzel, A. Emre Kavur, Tim Rädsch, Carole H. Sudre, Laura Acion, Michela Antonelli, Tal Arbel, Spyridon Bakas, Arriel Benis, Matthew Blaschko, Florian Buettner, M. Jorge Cardoso, Veronika Cheplygina, Jianxu Chen, Evangelia Christodoulou, Beth A. Cimini, Gary S. Collins, Keyvan Farahani, Luciana Ferrer, Adrian Galdran, Bram van Ginneken , et al. (53 additional authors not shown)

    Abstract: Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibilit… ▽ More

    Submitted 23 February, 2024; v1 submitted 3 February, 2023; originally announced February 2023.

    Comments: Shared first authors: Annika Reinke and Minu D. Tizabi; shared senior authors: Lena Maier-Hein and Paul F. Jäger. Published in Nature Methods. arXiv admin note: text overlap with arXiv:2206.01653

    Journal ref: Nature methods, 1-13 (2024)

  12. arXiv:2302.00370  [pdf, other

    stat.ML cs.LG

    How to select predictive models for causal inference?

    Authors: Matthieu Doutreligne, Gaël Varoquaux

    Abstract: As predictive models -- e.g., from machine learning -- give likely outcomes, they may be used to reason on the effect of an intervention, a causal-inference task. The increasing complexity of health data has opened the door to a plethora of models, but also the Pandora box of model selection: which of these models yield the most valid causal estimates? Here we highlight that classic machine-learni… ▽ More

    Submitted 16 May, 2023; v1 submitted 1 February, 2023; originally announced February 2023.

    Comments: 35 pages

  13. arXiv:2210.16315  [pdf, other

    cs.LG cs.AI stat.ML

    Beyond calibration: estimating the grouping loss of modern neural networks

    Authors: Alexandre Perez-Lebel, Marine Le Morvan, Gaël Varoquaux

    Abstract: The ability to ensure that a classifier gives reliable confidence scores is essential to ensure informed decision-making. To this end, recent work has focused on miscalibration, i.e., the over or under confidence of model scores. Yet calibration is not enough: even a perfectly calibrated classifier with the best possible accuracy can have confidence scores that are far from the true posterior prob… ▽ More

    Submitted 27 April, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

    Journal ref: ICLR 2023 -- The Eleventh International Conference on Learning Representations, May 2023, Kigali, Rwanda

  14. arXiv:2207.08815  [pdf, other

    cs.LG cs.AI stat.ME stat.ML

    Why do tree-based models still outperform deep learning on tabular data?

    Authors: Léo Grinsztajn, Edouard Oyallon, Gaël Varoquaux

    Abstract: While deep learning has enabled tremendous progress on text and image datasets, its superiority on tabular data is not clear. We contribute extensive benchmarks of standard and novel deep learning methods as well as tree-based models such as XGBoost and Random Forests, across a large number of datasets and hyperparameter combinations. We define a standard set of 45 datasets from varied domains wit… ▽ More

    Submitted 18 July, 2022; originally announced July 2022.

  15. Metrics reloaded: Recommendations for image analysis validation

    Authors: Lena Maier-Hein, Annika Reinke, Patrick Godau, Minu D. Tizabi, Florian Buettner, Evangelia Christodoulou, Ben Glocker, Fabian Isensee, Jens Kleesiek, Michal Kozubek, Mauricio Reyes, Michael A. Riegler, Manuel Wiesenfarth, A. Emre Kavur, Carole H. Sudre, Michael Baumgartner, Matthias Eisenmann, Doreen Heckmann-Nötzel, Tim Rädsch, Laura Acion, Michela Antonelli, Tal Arbel, Spyridon Bakas, Arriel Benis, Matthew Blaschko , et al. (49 additional authors not shown)

    Abstract: Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. Particularly in automatic biomedical image analysis, chosen performance metrics often do not reflect the domain interest, thus failing to adequately measure scientific progress and hindering translation of ML techniques into practice. To overcome this, our large international ex… ▽ More

    Submitted 23 February, 2024; v1 submitted 3 June, 2022; originally announced June 2022.

    Comments: Shared first authors: Lena Maier-Hein, Annika Reinke. arXiv admin note: substantial text overlap with arXiv:2104.05642 Published in Nature Methods

    Journal ref: Nature methods, 1-18 (2024)

  16. arXiv:2203.07860  [pdf, other

    cs.CL

    Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost

    Authors: Lihu Chen, Gaël Varoquaux, Fabian M. Suchanek

    Abstract: State-of-the-art NLP systems represent inputs with word embeddings, but these are brittle when faced with Out-of-Vocabulary (OOV) words. To address this issue, we follow the principle of mimick-like models to generate vectors for unseen words, by learning the behavior of pre-trained embeddings using only the surface form of words. We present a simple contrastive learning framework, LOVE, which ext… ▽ More

    Submitted 21 March, 2022; v1 submitted 15 March, 2022; originally announced March 2022.

    Comments: Long paper accepted by ACL main conference. 17 pages

  17. arXiv:2202.10580  [pdf, other

    cs.LG cs.AI

    Benchmarking missing-values approaches for predictive models on health databases

    Authors: Alexandre Perez-Lebel, Gaël Varoquaux, Marine Le Morvan, Julie Josse, Jean-Baptiste Poline

    Abstract: BACKGROUND: As databases grow larger, it becomes harder to fully control their collection, and they frequently come with missing values: incomplete observations. These large databases are well suited to train machine-learning models, for instance for forecasting or to extract biomarkers in biomedical settings. Such predictive approaches can use discriminative -- rather than generative -- modeling,… ▽ More

    Submitted 17 February, 2022; originally announced February 2022.

    Comments: GigaScience, Oxford Univ Press, In press

  18. arXiv:2110.06135  [pdf, other

    cs.LG q-bio.NC

    Label scarcity in biomedicine: Data-rich latent factor discovery enhances phenotype prediction

    Authors: Marc-Andre Schulz, Bertrand Thirion, Alexandre Gramfort, Gaël Varoquaux, Danilo Bzdok

    Abstract: High-quality data accumulation is now becoming ubiquitous in the health domain. There is increasing opportunity to exploit rich data from normal subjects to improve supervised estimators in specific diseases with notorious data scarcity. We demonstrate that low-dimensional embedding spaces can be derived from the UK Biobank population dataset and used to enhance data-scarce prediction of health in… ▽ More

    Submitted 12 October, 2021; originally announced October 2021.

    Comments: Accepted at NIPS 2017 Workshop on Machine Learning for Health

  19. arXiv:2107.09947  [pdf, other

    cs.LG math.ST q-bio.QM

    Preventing dataset shift from breaking machine-learning biomarkers

    Authors: Jéroôme Dockès, Gaël Varoquaux, Jean-Baptiste Poline

    Abstract: Machine learning brings the hope of finding new biomarkers extracted from cohorts with rich biomedical measurements. A good biomarker is one that gives reliable detection of the corresponding condition. However, biomarkers are often extracted from a cohort that differs from the target population. Such a mismatch, known as a dataset shift, can undermine the application of the biomarker to new indiv… ▽ More

    Submitted 21 July, 2021; originally announced July 2021.

    Comments: GigaScience, BioMed Central, In press

  20. arXiv:2106.00311  [pdf, other

    stat.ML cs.AI cs.LG

    What's a good imputation to predict with missing values?

    Authors: Marine Le Morvan, Julie Josse, Erwan Scornet, Gaël Varoquaux

    Abstract: How to learn a good predictor on data with missing values? Most efforts focus on first imputing as well as possible and second learning on the completed data to predict the outcome. Yet, this widespread practice has no theoretical grounding. Here we show that for almost all imputation functions, an impute-then-regress procedure with a powerful learner is Bayes optimal. This result holds for all mi… ▽ More

    Submitted 30 November, 2021; v1 submitted 1 June, 2021; originally announced June 2021.

  21. arXiv:2104.05642  [pdf, other

    eess.IV cs.CV

    Common Limitations of Image Processing Metrics: A Picture Story

    Authors: Annika Reinke, Minu D. Tizabi, Carole H. Sudre, Matthias Eisenmann, Tim Rädsch, Michael Baumgartner, Laura Acion, Michela Antonelli, Tal Arbel, Spyridon Bakas, Peter Bankhead, Arriel Benis, Matthew Blaschko, Florian Buettner, M. Jorge Cardoso, Jianxu Chen, Veronika Cheplygina, Evangelia Christodoulou, Beth Cimini, Gary S. Collins, Sandy Engelhardt, Keyvan Farahani, Luciana Ferrer, Adrian Galdran, Bram van Ginneken , et al. (68 additional authors not shown)

    Abstract: While the importance of automatic image analysis is continuously increasing, recent meta-research revealed major flaws with respect to algorithm validation. Performance metrics are particularly key for meaningful, objective, and transparent performance assessment and validation of the used automatic algorithms, but relatively little attention has been given to the practical pitfalls when using spe… ▽ More

    Submitted 6 December, 2023; v1 submitted 12 April, 2021; originally announced April 2021.

    Comments: Shared first authors: Annika Reinke and Minu D. Tizabi. This is a dynamic paper on limitations of commonly used metrics. It discusses metrics for image-level classification, semantic and instance segmentation, and object detection. For missing use cases, comments or questions, please contact a.reinke@dkfz.de. Substantial contributions to this document will be acknowledged with a co-authorship

  22. arXiv:2103.10292  [pdf, ps, other

    eess.IV cs.CV cs.LG stat.ML

    How I failed machine learning in medical imaging -- shortcomings and recommendations

    Authors: Gaël Varoquaux, Veronika Cheplygina

    Abstract: Medical imaging is an important research field with many opportunities for improving patients' health. However, there are a number of challenges that are slowing down the progress of the field as a whole, such optimizing for publication. In this paper we reviewed several problems related to choosing datasets, methods, evaluation metrics, and publication strategies. With a review of literature and… ▽ More

    Submitted 12 May, 2022; v1 submitted 18 March, 2021; originally announced March 2021.

    Journal ref: npj Digit. Med. 5, 48 (2022). https://doi.org/10.1038/s41746-022-00592-y

  23. arXiv:2103.03098  [pdf, other

    cs.LG stat.ML

    Accounting for Variance in Machine Learning Benchmarks

    Authors: Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Naz Sepah, Edward Raff, Kanika Madan, Vikram Voleti, Samira Ebrahimi Kahou, Vincent Michalski, Dmitriy Serdyuk, Tal Arbel, Chris Pal, Gaël Varoquaux, Pascal Vincent

    Abstract: Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, data augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process, reve… ▽ More

    Submitted 1 March, 2021; originally announced March 2021.

    Comments: Submitted to MLSys2021

  24. arXiv:2012.08844  [pdf, other

    cs.CL

    A Lightweight Neural Model for Biomedical Entity Linking

    Authors: Lihu Chen, Gaël Varoquaux, Fabian M. Suchanek

    Abstract: Biomedical entity linking aims to map biomedical mentions, such as diseases and drugs, to standard entities in a given knowledge base. The specific challenge in this context is that the same biomedical entity can have a wide range of names, including synonyms, morphological variations, and names with different word orderings. Recently, BERT-based methods have advanced the state-of-the-art by allow… ▽ More

    Submitted 21 May, 2021; v1 submitted 16 December, 2020; originally announced December 2020.

  25. arXiv:2007.01627  [pdf, other

    cs.LG cs.AI stat.ML

    NeuMiss networks: differentiable programming for supervised learning with missing values

    Authors: Marine Le Morvan, Julie Josse, Thomas Moreau, Erwan Scornet, Gaël Varoquaux

    Abstract: The presence of missing values makes supervised learning much more challenging. Indeed, previous work has shown that even when the response is a linear function of the complete data, the optimal predictor is a complex function of the observed entries and the missingness indicator. As a result, the computational or sample complexities of consistent approaches depend on the number of missing pattern… ▽ More

    Submitted 4 November, 2020; v1 submitted 3 July, 2020; originally announced July 2020.

    Journal ref: Advances in Neural Information Processing Systems 33, Dec 2020, Vancouver, Canada

  26. arXiv:2002.00658  [pdf, other

    cs.LG cs.AI stat.ML

    Linear predictor on linearly-generated data with missing values: non consistency and solutions

    Authors: Marine Le Morvan, Nicolas Prost, Julie Josse, Erwan Scornet, Gaël Varoquaux

    Abstract: We consider building predictors when the data have missing values. We study the seemingly-simple case where the target to predict is a linear function of the fully-observed data and we show that, in the presence of missing values, the optimal predictor may not be linear. In the particular Gaussian case, it can be written as a linear function of multiway interactions between the observed data and t… ▽ More

    Submitted 12 May, 2020; v1 submitted 3 February, 2020; originally announced February 2020.

    Journal ref: Proceedings of Machine Learning Research, PMLR, In press

  27. arXiv:1909.09264  [pdf, other

    stat.ML cs.LG

    Comparing distributions: $\ell_1$ geometry improves kernel two-sample testing

    Authors: M. Scetbon, G. Varoquaux

    Abstract: Are two sets of observations drawn from the same distribution? This problem is a two-sample test. Kernel methods lead to many appealing properties. Indeed state-of-the-art approaches use the $L^2$ distance between kernel-based distribution representatives to derive their test statistics. Here, we show that $L^p$ distances (with $p\geq 1$) between these distribution representatives give metrics on… ▽ More

    Submitted 30 September, 2019; v1 submitted 19 September, 2019; originally announced September 2019.

  28. Encoding high-cardinality string categorical variables

    Authors: Patricio Cerda, Gaël Varoquaux

    Abstract: Statistical models usually require vector representations of categorical variables, using for instance one-hot encoding. This strategy breaks down when the number of categories grows, as it creates high-dimensional feature vectors. Additionally, for string entries, one-hot encoding does not capture information in their representation.Here, we seek low-dimensional encoding of high-cardinality strin… ▽ More

    Submitted 18 May, 2020; v1 submitted 3 July, 2019; originally announced July 2019.

    Journal ref: IEEE Transactions on Knowledge and Data Engineering, Institute of Electrical and Electronics Engineers, pp.1-1

  29. arXiv:1906.02687  [pdf, other

    eess.SP cs.LG stat.ML

    Manifold-regression to predict from MEG/EEG brain signals without source modeling

    Authors: David Sabbagh, Pierre Ablin, Gael Varoquaux, Alexandre Gramfort, Denis A. Engemann

    Abstract: Magnetoencephalography and electroencephalography (M/EEG) can reveal neuronal dynamics non-invasively in real-time and are therefore appreciated methods in medicine and neuroscience. Recent advances in modeling brain-behavior relationships have highlighted the effectiveness of Riemannian geometry for summarizing the spatially correlated time-series from M/EEG in terms of their covariance. However,… ▽ More

    Submitted 22 November, 2019; v1 submitted 4 June, 2019; originally announced June 2019.

  30. arXiv:1902.06931  [pdf, other

    stat.ML cs.LG math.ST

    On the consistency of supervised learning with missing values

    Authors: Julie Josse, Jacob M. Chen, Nicolas Prost, Erwan Scornet, Gaël Varoquaux

    Abstract: In many application settings, the data have missing entries which make analysis challenging. An abundant literature addresses missing values in an inferential framework: estimating parameters and their variance from incomplete tables. Here, we consider supervised-learning settings: predicting a target when missing values appear in both training and testing data. We show the consistency of two appr… ▽ More

    Submitted 21 March, 2024; v1 submitted 19 February, 2019; originally announced February 2019.

  31. arXiv:1809.10024  [pdf

    cs.CY q-bio.NC stat.ML

    Computational and informatics advances for reproducible data analysis in neuroimaging

    Authors: Russell A. Poldrack, Krzysztof J. Gorgolewski, Gael Varoquaux

    Abstract: The reproducibility of scientific research has become a point of critical concern. We argue that openness and transparency are critical for reproducibility, and we outline an ecosystem for open and transparent science that has emerged within the human neuroimaging community. We discuss the range of open data sharing resources that have been developed for neuroimaging data, and the role of data sta… ▽ More

    Submitted 24 September, 2018; originally announced September 2018.

  32. arXiv:1809.06304  [pdf, other

    stat.ML cs.IT cs.LG

    Approximate message-passing for convex optimization with non-separable penalties

    Authors: Andre Manoel, Florent Krzakala, Gaël Varoquaux, Bertrand Thirion, Lenka Zdeborová

    Abstract: We introduce an iterative optimization scheme for convex objectives consisting of a linear loss and a non-separable penalty, based on the expectation-consistent approximation and the vector approximate message-passing (VAMP) algorithm. Specifically, the penalties we approach are convex on a linear transformation of the variable to be determined, a notable example being total variation (TV). We des… ▽ More

    Submitted 17 September, 2018; originally announced September 2018.

    Comments: 18 pages, 6 figures

  33. arXiv:1809.06035  [pdf, other

    stat.ML cs.CV cs.LG q-bio.QM

    Extracting representations of cognition across neuroimaging studies improves brain decoding

    Authors: Arthur Mensch, Julien Mairal, Bertrand Thirion, Gaël Varoquaux

    Abstract: Cognitive brain imaging is accumulating datasets about the neural substrate of many different mental processes. Yet, most studies are based on few subjects and have low statistical power. Analyzing data across studies could bring more statistical power; yet the current brain-imaging analytic framework cannot be used at scale as it requires casting all cognitive tasks in a unified theoretical frame… ▽ More

    Submitted 19 May, 2021; v1 submitted 17 September, 2018; originally announced September 2018.

    Journal ref: PLoS Computational Biology, Public Library of Science, 2021

  34. arXiv:1807.11718  [pdf, other

    cs.LG stat.ML

    Feature Grouping as a Stochastic Regularizer for High-Dimensional Structured Data

    Authors: Sergul Aydore, Bertrand Thirion, Gael Varoquaux

    Abstract: In many applications where collecting data is expensive, for example neuroscience or medical imaging, the sample size is typically small compared to the feature dimension. It is challenging in this setting to train expressive, non-linear models without overfitting. These datasets call for intelligent regularization that exploits known structure, such as correlations between the features arising fr… ▽ More

    Submitted 22 April, 2019; v1 submitted 31 July, 2018; originally announced July 2018.

    Comments: 12 pages, 14 figures

    Journal ref: ICML2019

  35. arXiv:1806.01139  [pdf, other

    stat.ME cs.IR cs.LG

    Text to brain: predicting the spatial distribution of neuroimaging observations from text reports

    Authors: Jérôme Dockès, Demian Wassermann, Russell Poldrack, Fabian Suchanek, Bertrand Thirion, Gaël Varoquaux

    Abstract: Despite the digital nature of magnetic resonance imaging, the resulting observations are most frequently reported and stored in text documents. There is a trove of information untapped in medical health records, case reports, and medical publications. In this paper, we propose to mine brain medical publications to learn the spatial distribution associated with anatomical terms. The problem is form… ▽ More

    Submitted 28 June, 2018; v1 submitted 4 June, 2018; originally announced June 2018.

    Journal ref: MICCAI 2018 - 21st International Conference on Medical Image Computing and Computer Assisted Intervention, Sep 2018, Granada, Spain. pp.1-18, 2018

  36. arXiv:1806.00979  [pdf, other

    cs.LG cs.AI stat.ML

    Similarity encoding for learning with dirty categorical variables

    Authors: Patricio Cerda, Gaël Varoquaux, Balázs Kégl

    Abstract: For statistical learning, categorical variables in a table are usually considered as discrete entities and encoded separately to feature vectors, e.g., with one-hot encoding. "Dirty" non-curated data gives rise to categorical variables with a very high cardinality but redundancy: several categories reflect the same entity. In databases, this issue is typically solved with a deduplication step. We… ▽ More

    Submitted 4 June, 2018; originally announced June 2018.

  37. arXiv:1710.11438  [pdf, other

    stat.ML cs.LG q-bio.NC

    Learning Neural Representations of Human Cognition across Many fMRI Studies

    Authors: Arthur Mensch, Julien Mairal, Danilo Bzdok, Bertrand Thirion, Gaël Varoquaux

    Abstract: Cognitive neuroscience is enjoying rapid increase in extensive public brain-imaging datasets. It opens the door to large-scale statistical models. Finding a unified perspective for all available data calls for scalable and automated solutions to an old challenge: how to aggregate heterogeneous information on brain function into a universal cognitive system that relates mental operations/cognitive… ▽ More

    Submitted 10 November, 2017; v1 submitted 31 October, 2017; originally announced October 2017.

    Comments: Advances in Neural Information Processing Systems, Dec 2017, Long Beach, United States. 2017

    Journal ref: Advances in Neural Information Processing Systems, 2017

  38. arXiv:1701.05363  [pdf, other

    stat.ML cs.LG math.OC q-bio.NC

    Stochastic Subsampling for Factorizing Huge Matrices

    Authors: Arthur Mensch, Julien Mairal, Bertrand Thirion, Gael Varoquaux

    Abstract: We present a matrix-factorization algorithm that scales to input matrices with both huge number of rows and columns. Learned factors may be sparse or dense and/or non-negative, which makes our algorithm suitable for dictionary learning, sparse component analysis, and non-negative matrix factorization. Our algorithm streams matrix columns while subsampling them to iteratively learn the matrix facto… ▽ More

    Submitted 30 October, 2017; v1 submitted 19 January, 2017; originally announced January 2017.

    Comments: IEEE Transactions on Signal Processing, Institute of Electrical and Electronics Engineers, A Paraître

    Journal ref: IEEE Transactions on Signal Processing, 2018, 66 (1), pp 113-128

  39. arXiv:1611.10041  [pdf, other

    math.OC cs.LG stat.ML

    Subsampled online matrix factorization with convergence guarantees

    Authors: Arthur Mensch, Julien Mairal, Gaël Varoquaux, Bertrand Thirion

    Abstract: We present a matrix factorization algorithm that scales to input matrices that are large in both dimensions (i.e., that contains morethan 1TB of data). The algorithm streams the matrix columns while subsampling them, resulting in low complexity per iteration andreasonable memory footprint. In contrast to previous online matrix factorization methods, our approach relies on low-dimensional statistic… ▽ More

    Submitted 30 November, 2016; originally announced November 2016.

    Journal ref: 9th NIPS Workshop on Optimization for Machine Learning, Dec 2016, Barcelone, Spain

  40. Recursive nearest agglomeration (ReNA): fast clustering for approximation of structured signals

    Authors: Andrés Hoyos-Idrobo, Gaël Varoquaux, Jonas Kahn, Bertrand Thirion

    Abstract: In this work, we revisit fast dimension reduction approaches, as with random projections and random sampling. Our goal is to summarize the data to decrease computational costs and memory footprint of subsequent analysis. Such dimension reduction can be very efficient when the signals of interest have a strong structure, such as with images. We focus on this setting and investigate feature clusteri… ▽ More

    Submitted 19 March, 2018; v1 submitted 15 September, 2016; originally announced September 2016.

    Comments: IEEE Transactions on Pattern Analysis and Machine Intelligence, Institute of Electrical and Electronics Engineers, In press

  41. arXiv:1606.06439  [pdf, other

    stat.ML cs.CV q-bio.NC

    Social-sparsity brain decoders: faster spatial sparsity

    Authors: Gaël Varoquaux, Matthieu Kowalski, Bertrand Thirion

    Abstract: Spatially-sparse predictors are good models for brain decoding: they give accurate predictions and their weight maps are interpretable as they focus on a small number of regions. However, the state of the art, based on total variation or graph-net, is computationally costly. Here we introduce sparsity in the local neighborhood of each voxel with social-sparsity, a structured shrinkage operator. We… ▽ More

    Submitted 21 June, 2016; originally announced June 2016.

    Comments: in Pattern Recognition in NeuroImaging, Jun 2016, Trento, Italy. 2016

  42. arXiv:1605.00937  [pdf, other

    stat.ML cs.LG q-bio.QM

    Dictionary Learning for Massive Matrix Factorization

    Authors: Arthur Mensch, Julien Mairal, Bertrand Thirion, Gaël Varoquaux

    Abstract: Sparse matrix factorization is a popular tool to obtain interpretable data decompositions, which are also effective to perform data completion or denoising. Its applicability to large datasets has been addressed with online and randomized methods, that reduce the complexity in one of the matrix dimension, but not in both of them. In this paper, we tackle very large matrices in both dimensions. We… ▽ More

    Submitted 26 May, 2016; v1 submitted 3 May, 2016; originally announced May 2016.

    Journal ref: Proceedings of the International Conference on Machine Learning, 2016, pp 1737-1746

  43. Compressed Online Dictionary Learning for Fast fMRI Decomposition

    Authors: Arthur Mensch, Gaël Varoquaux, Bertrand Thirion

    Abstract: We present a method for fast resting-state fMRI spatial decomposi-tions of very large datasets, based on the reduction of the temporal dimension before applying dictionary learning on concatenated individual records from groups of subjects. Introducing a measure of correspondence between spatial decompositions of rest fMRI, we demonstrates that time-reduced dictionary learning produces result as r… ▽ More

    Submitted 8 February, 2016; originally announced February 2016.

    Journal ref: IEEE International Symposium on Biomedical Imaging, 2016

  44. arXiv:1512.06999  [pdf, ps, other

    q-bio.NC cs.LG stat.CO stat.ML

    FAASTA: A fast solver for total-variation regularization of ill-conditioned problems with application to brain imaging

    Authors: Gaël Varoquaux, Michael Eickenberg, Elvis Dohmatob, Bertand Thirion

    Abstract: The total variation (TV) penalty, as many other analysis-sparsity problems, does not lead to separable factors or a proximal operatorwith a closed-form expression, such as soft thresholding for the $\ell\_1$ penalty. As a result, in a variational formulation of an inverse problem or statisticallearning estimation, it leads to challenging non-smooth optimization problemsthat are often solved with e… ▽ More

    Submitted 22 December, 2015; originally announced December 2015.

    Journal ref: Colloque GRETSI, Sep 2015, Lyon, France. Gretsi, 2015, http://www.gretsi.fr/colloque2015/myGretsi/programme.php

  45. arXiv:1511.04898  [pdf, other

    stat.ML cs.CV

    Fast clustering for scalable statistical analysis on structured images

    Authors: Bertrand Thirion, Andrés Hoyos-Idrobo, Jonas Kahn, Gael Varoquaux

    Abstract: The use of brain images as markers for diseases or behavioral differences is challenged by the small effects size and the ensuing lack of power, an issue that has incited researchers to rely more systematically on large cohorts. Coupled with resolution increases, this leads to very large datasets. A striking example in the case of brain imaging is that of the Human Connectome Project: 20 Terabytes… ▽ More

    Submitted 16 November, 2015; originally announced November 2015.

    Comments: ICML Workshop on Statistics, Machine Learning and Neuroscience (Stamlins 2015), Jul 2015, Lille, France

  46. arXiv:1412.3925  [pdf, other

    q-bio.NC cs.CV

    Region segmentation for sparse decompositions: better brain parcellations from rest fMRI

    Authors: Alexandre Abraham, Elvis Dohmatob, Bertrand Thirion, Dimitris Samaras, Gael Varoquaux

    Abstract: Functional Magnetic Resonance Images acquired during resting-state provide information about the functional organization of the brain through measuring correlations between brain areas. Independent components analysis is the reference approach to estimate spatial components from weakly structured data such as brain signal time courses; each of these components may be referred to as a brain network… ▽ More

    Submitted 12 December, 2014; originally announced December 2014.

    Journal ref: Sparsity Techniques in Medical Imaging, Sep 2014, Boston, United States. pp.8

  47. arXiv:1412.3919  [pdf, other

    cs.LG cs.CV stat.ML

    Machine Learning for Neuroimaging with Scikit-Learn

    Authors: Alexandre Abraham, Fabian Pedregosa, Michael Eickenberg, Philippe Gervais, Andreas Muller, Jean Kossaifi, Alexandre Gramfort, Bertrand Thirion, Gäel Varoquaux

    Abstract: Statistical machine learning methods are increasingly used for neuroimaging data analysis. Their main virtue is their ability to model high-dimensional datasets, e.g. multivariate analysis of activation images or resting-state time series. Supervised learning is typically used in decoding or encoding settings to relate brain images to behavioral or clinical observations, while unsupervised learnin… ▽ More

    Submitted 12 December, 2014; originally announced December 2014.

    Comments: Frontiers in neuroscience, Frontiers Research Foundation, 2013, pp.15

  48. arXiv:1311.3859  [pdf, other

    stat.ML cs.LG q-bio.NC

    Mapping cognitive ontologies to and from the brain

    Authors: Yannick Schwartz, Bertrand Thirion, Gaël Varoquaux

    Abstract: Imaging neuroscience links brain activation maps to behavior and cognition via correlational studies. Due to the nature of the individual experiments, based on eliciting neural response from a small number of stimuli, this link is incomplete, and unidirectional from the causal point of view. To come to conclusions on the function implied by the activation of brain regions, it is necessary to combi… ▽ More

    Submitted 20 November, 2013; v1 submitted 15 November, 2013; originally announced November 2013.

    Comments: NIPS (Neural Information Processing Systems), United States (2013)

  49. arXiv:1309.0238  [pdf, ps, other

    cs.LG cs.MS

    API design for machine learning software: experiences from the scikit-learn project

    Authors: Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake Vanderplas, Arnaud Joly, Brian Holt, Gaël Varoquaux

    Abstract: Scikit-learn is an increasingly popular machine learning li- brary. Written in Python, it is designed to be simple and efficient, accessible to non-experts, and reusable in various contexts. In this paper, we present and discuss our design choices for the application programming interface (API) of the project. In particular, we describe the simple and elegant interface shared by all learning and p… ▽ More

    Submitted 1 September, 2013; originally announced September 2013.

    Journal ref: European Conference on Machine Learning and Principles and Practices of Knowledge Discovery in Databases (2013)

  50. arXiv:1301.6952  [pdf, other

    cs.DB cs.CV q-bio.QM

    PyXNAT: XNAT in Python

    Authors: Yannick Schwartz, Alexis Barbot, Benjamin Thyreau, Vincent Frouin, Gaël Varoquaux, Aditya Siram, Daniel Marcus, Jean-Baptiste Poline

    Abstract: As neuroimaging databases grow in size and complexity, the time researchers spend investigating and managing the data increases to the expense of data analysis. As a result, investigators rely more and more heavily on scripting using high-level languages to automate data management and processing tasks. For this, a structured and programmatic access to the data store is necessary. Web services are… ▽ More

    Submitted 29 January, 2013; originally announced January 2013.

    Journal ref: Frontiers in Neuroinformatics 6, 12 (2012) 1-14