Khaled Saab

Khaled Saab

Mountain View, California, United States
625 followers 500+ connections

About

Research Scientist at Google DeepMind. My main projects include Med-Gemini and…

Activity

Join now to see all activity

Experience

  • Google Graphic
  • -

    Stanford, California, United States

  • -

  • -

    Atlanta, Georgia, United States

  • -

Education

Publications

  • Hungry Hungry Hippos: Towards Language Modeling with State Space Models

    International Conference on Learning Representations

    State space models (SSMs) have demonstrated state-of-the-art sequence modeling performance in
    some modalities, but underperform attention in language modeling. Moreover, despite scaling nearly
    linearly in sequence length instead of quadratically, SSMs are still slower than Transformers due to poor
    hardware utilization. In this paper, we make progress on understanding the expressivity gap between
    SSMs and attention in language modeling, and on reducing the hardware barrier between…

    State space models (SSMs) have demonstrated state-of-the-art sequence modeling performance in
    some modalities, but underperform attention in language modeling. Moreover, despite scaling nearly
    linearly in sequence length instead of quadratically, SSMs are still slower than Transformers due to poor
    hardware utilization. In this paper, we make progress on understanding the expressivity gap between
    SSMs and attention in language modeling, and on reducing the hardware barrier between SSMs and
    attention. First, we use synthetic language modeling tasks to understand the gap between SSMs and
    attention. We find that existing SSMs struggle with two capabilities: recalling earlier tokens in the
    sequence and comparing tokens across the sequence. To understand the impact on language modeling, we
    propose a new SSM layer, H3, that is explicitly designed for these abilities. H3 matches attention on
    the synthetic languages and comes within 0.4 PPL of Transformers on OpenWebText. Furthermore, a
    hybrid 125M-parameter H3-attention model that retains two attention layers surprisingly outperforms
    Transformers on OpenWebText by 1.0 PPL. Next, to improve the efficiency of training SSMs on modern
    hardware, we propose FlashConv. FlashConv uses a fused block FFT algorithm to improve efficiency on
    sequences up to 8K, and introduces a novel state passing algorithm that exploits the recurrent properties
    of SSMs to scale to longer sequences. FlashConv yields 2× speedup on the long-range arena benchmark
    and allows hybrid language models to generate text 2.4× faster than Transformers. Using FlashConv,
    we scale hybrid H3-attention language models up to 2.7B parameters on the Pile and find promising
    initial results, achieving lower perplexity than Transformers and outperforming Transformers in zero- and
    few-shot learning on a majority of tasks in the SuperGLUE benchmark.

    See publication
  • Effectively Modeling Time Series with Simple Discrete State Spaces

    International Conference on Learning Representations

    Time series modeling is a well-established problem, which often requires that methods (1) expressively represent complicated dependencies, (2) forecast long horizons, and (3) efficiently train over long sequences. State-space models (SSMs) are classical models for time series, and prior works combine SSMs with deep learning layers for efficient sequence modeling. However, we find fundamental limitations with these prior approaches, proving their SSM representations cannot express autoregressive…

    Time series modeling is a well-established problem, which often requires that methods (1) expressively represent complicated dependencies, (2) forecast long horizons, and (3) efficiently train over long sequences. State-space models (SSMs) are classical models for time series, and prior works combine SSMs with deep learning layers for efficient sequence modeling. However, we find fundamental limitations with these prior approaches, proving their SSM representations cannot express autoregressive time series processes. We thus introduce SPACETIME, a new state-space time series architecture that improves all three criteria. For expressivity, we propose a new SSM parameterization based on the companion matrix—a canonical representation for discrete-time processes—which enables SPACETIME’s SSM layers to learn desirable autoregressive processes. For long horizon forecasting, we introduce a “closed-loop” variation of the companion SSM, which enables SPACETIME to predict many future time-steps by generating its own layer-wise inputs. For efficient training and inference, we introduce an
    algorithm that reduces the memory and compute of a forward pass with the companion matrix. With sequence length ℓ and state-space size d, we go from O˜(dℓ) na¨ıvely to O˜(d + ℓ). In experiments, our contributions lead to state-of-the-art results on extensive and diverse benchmarks, with best or second-best AUROC on 6 / 7 ECG and speech time series classification, and best MSE on 14 / 16 Informer forecasting tasks. Furthermore, we find SPACETIME (1) fits AR(p) processes that prior deep SSMs fail on, (2) forecasts notably more accurately on longer horizons than prior state-of-the-art, and (3) speeds up training on real-world ETTh1 data by 73% and 80% relative wall-clock time over Transformers and LSTMs.

    See publication
  • Reducing Reliance on Spurious Features in Medical Image Classification with Spatial Specificity

    Machine Learning for Healthcare

    A common failure mode of neural networks trained to classify abnormalities in medical images is their reliance on spurious features, which are features that are associated with the class label but are non-generalizable. In this work, we examine if supervising models with increased spatial specificity (i.e., information about the location of the abnormality) impacts model reliance on spurious features. We first propose a data model of spurious features and theoretically analyze the impact of…

    A common failure mode of neural networks trained to classify abnormalities in medical images is their reliance on spurious features, which are features that are associated with the class label but are non-generalizable. In this work, we examine if supervising models with increased spatial specificity (i.e., information about the location of the abnormality) impacts model reliance on spurious features. We first propose a data model of spurious features and theoretically analyze the impact of increasing spatial specificity. We find that two properties of the data are impacted when we increase spatial specificity: the variance of the positively-labeled input pixels decreases and the mutual information between abnormal and spurious pixels decreases, both of which contribute to improved model robustness to spurious features. However, supervising models with greater spatial specificity incurs higher annotation costs, since training data must be labeled for the location of the abnormality, leading to a trade-off between annotation costs and model robustness to spurious features. We investigate this trade-off by varying the coarseness of the spatial specificity supplied and sweeping the quantity of training samples that have information about the abnormality location. Further, we assess if semi-supervised and contrastive learning methods improve the cost-robustness trade-off. We empirically examine the impact of supervising models with increased spatial specificity on two medical image datasets known to have spurious features: pneumothorax classification on chest x-rays and melanoma classification from dermoscopic images.

    Other authors
    See publication
  • ViLMedic: A Framework for Research at the Intersection of Vision and Language in Medical AI

    ACL (demo track)

    There is a growing need to model interactions between data modalities (e.g., vision, language) — both to improve AI predictions on existing tasks and to enable new applications. In the recent field of multimodal medical AI, integrating multiple modalities has gained widespread popularity as multimodal models have proven to improve performance, robustness, require less training samples and add complementary information. To improve technical reproducibility and transparency for multimodal medical…

    There is a growing need to model interactions between data modalities (e.g., vision, language) — both to improve AI predictions on existing tasks and to enable new applications. In the recent field of multimodal medical AI, integrating multiple modalities has gained widespread popularity as multimodal models have proven to improve performance, robustness, require less training samples and add complementary information. To improve technical reproducibility and transparency for multimodal medical tasks as well as speed up progress across medical AI, we present ViLMedic, a Vision and Language medical library. As of 2022, the library contains a dozen reference implementations replicating the state-of-the-art results for problems that range from medical visual question answering and radiology report generation to multimodal representation learning on widely adopted medical datasets. In addition, ViLMedic hosts a model-zoo with more than twenty pretrained models for the above tasks designed to be extensible by researchers but also simple for practitioners. Ultimately, we hope our reproducible pipelines can enable clinical translation and create real impact. The library is available at https://github.com/jbdel/vilmedic.

    See publication
  • Domino: Discovering Systematic Errors with Cross-Modal Embeddings

    International Conference on Learning Representations

    Machine learning models that achieve high overall accuracy often make systematic errors on important subsets (or slices) of data. Identifying underperforming slices is particularly challenging when working with high-dimensional inputs (e.g. images, audio), where important slices are often unlabeled. In order to address this issue, recent studies have proposed automated slice discovery methods (SDMs), which leverage learned model representations to mine input data for slices on which a model…

    Machine learning models that achieve high overall accuracy often make systematic errors on important subsets (or slices) of data. Identifying underperforming slices is particularly challenging when working with high-dimensional inputs (e.g. images, audio), where important slices are often unlabeled. In order to address this issue, recent studies have proposed automated slice discovery methods (SDMs), which leverage learned model representations to mine input data for slices on which a model performs poorly. To be useful to a practitioner, these methods must identify slices that are both underperforming and coherent (i.e. united by a human-understandable concept). However, no quantitative evaluation framework currently exists for rigorously assessing SDMs with respect to these criteria. Additionally, prior qualitative evaluations have shown that SDMs often identify slices that are incoherent. In this work, we address these challenges by first designing a principled evaluation framework that enables a quantitative comparison of SDMs across 1,235 slice discovery settings in three input domains (natural images, medical images, and time-series data). Then, motivated by the recent development of powerful cross-modal representation learning approaches, we present Domino, an SDM that leverages cross-modal embeddings and a novel error-aware mixture model to discover and describe coherent slices. We find that Domino accurately identifies 36% of the 1,235 slices in our framework - a 12 percentage point improvement over prior methods. Further, Domino is the first SDM that can provide natural language descriptions of identified slices, correctly generating the exact name of the slice in 35% of settings.

    Other authors
    See publication
  • Self-Supervised Graph Neural Networks for Improved Electroencephalographic Seizure Analysis

    International Conference on Learning Representations

    Automated seizure detection and classification from electroencephalography (EEG) can greatly improve seizure diagnosis and treatment. However, several modeling challenges remain unaddressed in prior automated seizure detection and classification studies: (1) representing non-Euclidean data structure in EEGs, (2) accurately classifying rare seizure types, and (3) lacking a quantitative interpretability approach to measure model ability to localize seizures. In this study, we address these…

    Automated seizure detection and classification from electroencephalography (EEG) can greatly improve seizure diagnosis and treatment. However, several modeling challenges remain unaddressed in prior automated seizure detection and classification studies: (1) representing non-Euclidean data structure in EEGs, (2) accurately classifying rare seizure types, and (3) lacking a quantitative interpretability approach to measure model ability to localize seizures. In this study, we address these challenges by (1) representing the spatiotemporal dependencies in EEGs using a graph neural network (GNN) and proposing two EEG graph structures that capture the electrode geometry or dynamic brain connectivity, (2) proposing a self-supervised pre-training method that predicts preprocessed signals for the next time period to further improve model performance, particularly on rare seizure types, and (3) proposing a quantitative model interpretability approach to assess a model's ability to localize seizures within EEGs. When evaluating our approach on seizure detection and classification on a large public dataset, we find that our GNN with self-supervised pre-training achieves 0.875 Area Under the Receiver Operating Characteristic Curve on seizure detection and 0.749 weighted F1-score on seizure classification, outperforming previous methods for both seizure detection and classification. Moreover, our self-supervised pre-training strategy significantly improves classification of rare seizure types. Furthermore, quantitative interpretability analysis shows that our GNN with self-supervised pre-training precisely localizes 25.4% focal seizures, a 21.9 point improvement over existing CNNs. Finally, by superimposing the identified seizure locations on both raw EEG signals and EEG graphs, our approach could provide clinicians with an intuitive visualization of localized seizure regions.

    See publication
  • Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers

    Neural Information Processing Systems

    Recurrent neural networks (RNNs), temporal convolutions, and neural differential equations (NDEs) are popular families of deep learning models for time-series data, each with unique strengths and tradeoffs in modeling power and computational efficiency. We introduce a simple sequence model inspired by control systems that generalizes these approaches while addressing their shortcomings. The Linear State-Space Layer (LSSL) maps a sequence u↦y by simply simulating a linear continuous-time…

    Recurrent neural networks (RNNs), temporal convolutions, and neural differential equations (NDEs) are popular families of deep learning models for time-series data, each with unique strengths and tradeoffs in modeling power and computational efficiency. We introduce a simple sequence model inspired by control systems that generalizes these approaches while addressing their shortcomings. The Linear State-Space Layer (LSSL) maps a sequence u↦y by simply simulating a linear continuous-time state-space representation x˙=Ax+Bu,y=Cx+Du. Theoretically, we show that LSSL models are closely related to the three aforementioned families of models and inherit their strengths. For example, they generalize convolutions to continuous-time, explain common RNN heuristics, and share features of NDEs such as time-scale adaptation. We then incorporate and generalize recent theory on continuous-time memorization to introduce a trainable subset of structured matrices A that endow LSSLs with long-range memory. Empirically, stacking LSSL layers into a simple deep neural network obtains state-of-the-art results across time series benchmarks for long dependencies in sequential image classification, real-world healthcare regression tasks, and speech. On a difficult speech classification task with length-16000 sequences, LSSL outperforms prior approaches by 24 accuracy points, and even outperforms baselines that use hand-crafted features on 100x shorter sequences.

    See publication
  • Observational Supervision for Medical Image Classification Using Gaze Data

    MICCAI

    Deep learning models have demonstrated favorable performance on many medical image classification tasks. However, they rely on expensive hand-labeled datasets that are time-consuming to create. In this work, we explore a new supervision source to training deep learning models by using gaze data that is passively and cheaply collected during a clinician’s workflow. We focus on three medical imaging tasks, including classifying chest X-ray scans for pneumothorax and brain MRI slices for…

    Deep learning models have demonstrated favorable performance on many medical image classification tasks. However, they rely on expensive hand-labeled datasets that are time-consuming to create. In this work, we explore a new supervision source to training deep learning models by using gaze data that is passively and cheaply collected during a clinician’s workflow. We focus on three medical imaging tasks, including classifying chest X-ray scans for pneumothorax and brain MRI slices for metastasis, two of which we curated gaze data for. The gaze data consists of a sequence of fixation locations on the image from an expert trying to identify an abnormality. Hence, the gaze data contains rich information about the image that can be used as a powerful supervision source. We first identify a set of gaze features and show that they indeed contain class-discriminative information. Then, we propose two methods for incorporating gaze features into deep learning pipelines. When no task labels are available, we combine multiple gaze features to extract weak labels and use them as the sole source of supervision (Gaze-WS). When task labels are available, we propose to use the gaze features as auxiliary task labels in a multi-task learning framework (Gaze-MTL). On three medical image classification tasks, our Gaze-WS method without task labels comes within 5 AUROC points (1.7 precision points) of models trained with task labels. With task labels, our Gaze-MTL method can improve performance by 2.4 AUROC points (4 precision points) over multiple baselines.

    See publication
  • Cross-Modal Data Programming Enables Rapid Medical Machine Learning

    Cell Patterns

    A major bottleneck in developing clinically impactful machine learning models is a lack of labeled training data for model supervision. Thus, medical researchers increasingly turn to weaker, noisier sources of supervision, such as leveraging extractions from unstructured text reports to supervise image classification. A key challenge in weak supervision is combining sources of information that may differ in quality and have correlated errors. Recently, a statistical theory of weak supervision…

    A major bottleneck in developing clinically impactful machine learning models is a lack of labeled training data for model supervision. Thus, medical researchers increasingly turn to weaker, noisier sources of supervision, such as leveraging extractions from unstructured text reports to supervise image classification. A key challenge in weak supervision is combining sources of information that may differ in quality and have correlated errors. Recently, a statistical theory of weak supervision called data programming has shown promise in addressing this challenge. Data programming now underpins many deployed machine-learning systems in the technology industry, even for critical applications. We propose a new technique for applying data programming to the problem of cross-modal weak supervision in medicine, wherein weak labels derived from an auxiliary modality (e.g., text) are used to train models over a different target modality (e.g., images). We evaluate our approach on diverse clinical tasks via direct comparison to institution-scale, hand-labeled datasets. We find that our supervision technique increases model performance by up to 6 points area under the receiver operating characteristic curve (ROC-AUC) over baseline methods by improving both coverage and quality of the weak labels. Our approach yields models that on average perform within 1.75 points ROC-AUC of those supervised with physician-years of hand labeling and outperform those supervised with physician-months of hand labeling by 10.25 points ROC-AUC, while using only person-days of developer time and clinician work—a time saving of 96%. Our results suggest that modern weak supervision techniques such as data programming may enable more rapid development and deployment of clinically useful machine-learning models.

    See publication
  • Weak Supervision as an Efficient Approach for Automated Seizure Detection in Electroencephalography

    npj Digital Medicine

    Automated seizure detection from electroencephalography (EEG) would improve the quality of patient care while reducing medical costs, but achieving reliably high performance across patients has proven difficult. Convolutional Neural Networks (CNNs) show promise in addressing this problem, but they are limited by a lack of large labeled training datasets. We propose using imperfect but plentiful archived annotations to train CNNs for automated, real-time EEG seizure detection across patients…

    Automated seizure detection from electroencephalography (EEG) would improve the quality of patient care while reducing medical costs, but achieving reliably high performance across patients has proven difficult. Convolutional Neural Networks (CNNs) show promise in addressing this problem, but they are limited by a lack of large labeled training datasets. We propose using imperfect but plentiful archived annotations to train CNNs for automated, real-time EEG seizure detection across patients. While these weak annotations indicate possible seizures with precision scores as low as 0.37, they are commonly produced in large volumes within existing clinical workflows by a mixed group of technicians, fellows, students, and board-certified epileptologists. We find that CNNs trained using such weak annotations achieve Area Under the Receiver Operating Characteristic curve (AUROC) values of 0.93 and 0.94 for pediatric and adult seizure onset detection, respectively. Compared to currently deployed clinical software, our model provides a 31% increase (18 points) in F1-score for pediatric patients and a 17% increase (11 points) for adult patients. These results demonstrate that weak annotations, which are sustainably collected via existing clinical workflows, can be leveraged to produce clinically useful seizure detection models.

    See publication
  • Doubly Weak Supervision of Deep Learning Models for Head CT

    MICCAI

    Recent deep learning models for intracranial hemorrhage (ICH) detection on computed tomography of the head have relied upon large datasets hand-labeled at either the full-scan level or at the individual slice-level. Though these models have demonstrated favorable empirical performance, the hand-labeled datasets upon which they rely are time-consuming and expensive to create. Further, given limited time, modelers must currently make an explicit choice between scan-level supervision, which…

    Recent deep learning models for intracranial hemorrhage (ICH) detection on computed tomography of the head have relied upon large datasets hand-labeled at either the full-scan level or at the individual slice-level. Though these models have demonstrated favorable empirical performance, the hand-labeled datasets upon which they rely are time-consuming and expensive to create. Further, given limited time, modelers must currently make an explicit choice between scan-level supervision, which leverages large numbers of patients, and slice-level supervision, which yields clinically insightful output in the axial and in-plane dimensions. In this work, we propose doubly weak supervision, where we (1) weakly label at the scan-level to scalably incorporate data from large populations and (2) model the problem using an attention-based multiple-instance learning approach that can provide useful signal at both axial and in-plane granularities, even with scan-level supervision. Models trained using this doubly weak supervision approach yield an average ROC-AUC score of 0.91, which is competitive with those of models trained using large, hand-labeled datasets, while requiring less than 10 h of clinician labeling time. Further, our models place large attention weights on the same slices used by the clinician to arrive at the ICH classification, and occlusion maps indicate heavy influence from clinically salient in-plane regions.

    See publication

Honors & Awards

  • Stanford Interdisciplinary Graduate Fellowship

    Stanford University

    The Stanford Interdisciplinary Graduate Fellowship (SIGF) Program is a competitive, university-wide program overseen and administered by the Office of the Vice Provost for Graduate Education that awards three-year fellowships to outstanding doctoral students engaged in interdisciplinary research.

    More info here: https://vpge.stanford.edu/fellowships-funding/sigf

  • Faculty Honors

    Georgia Institute of Technology

    This designation is awarded to undergraduate students who have a 4.0 academic average for the semester. Received each semester at GT.

More activity by Khaled

View Khaled’s full profile

  • See who you know in common
  • Get introduced
  • Contact Khaled directly
Join to view full profile

People also viewed

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Others named Khaled Saab in United States

Add new skills with these courses