Search | arXiv e-print repository

Opportunities in deep learning methods development for computational biology

Authors: Alex Jihun Lee, Reza Abbasi-Asl

Abstract: Advances in molecular technologies underlie an enormous growth in the size of data sets pertaining to biology and biomedicine. These advances parallel those in the deep learning subfield of machine learning. Components in the differentiable programming toolbox that makes deep learning possible are allowing computer scientists to address an increasingly large array of problems with flexible and eff… ▽ More Advances in molecular technologies underlie an enormous growth in the size of data sets pertaining to biology and biomedicine. These advances parallel those in the deep learning subfield of machine learning. Components in the differentiable programming toolbox that makes deep learning possible are allowing computer scientists to address an increasingly large array of problems with flexible and effective tools. However many of these tools have not fully proliferated into the computational biology and bioinformatics fields. In this perspective we survey some of these advances and highlight exemplary examples of their utilization in the biosciences, with the goal of increasing awareness among practitioners of emerging opportunities to blend expert knowledge with newly emerging deep learning architectural tools. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2405.16861 [pdf, other]

NCIDiff: Non-covalent Interaction-generative Diffusion Model for Improving Reliability of 3D Molecule Generation Inside Protein Pocket

Authors: Joongwon Lee, Wonho Zhung, Woo Youn Kim

Abstract: Advancements in deep generative modeling have changed the paradigm of drug discovery. Among such approaches, target-aware methods that exploit 3D structures of protein pockets were spotlighted for generating ligand molecules with their plausible binding modes. While docking scores superficially assess the quality of generated ligands, closer inspection of the binding structures reveals the inconsi… ▽ More Advancements in deep generative modeling have changed the paradigm of drug discovery. Among such approaches, target-aware methods that exploit 3D structures of protein pockets were spotlighted for generating ligand molecules with their plausible binding modes. While docking scores superficially assess the quality of generated ligands, closer inspection of the binding structures reveals the inconsistency in local interactions between a pocket and generated ligands. Here, we address the issue by explicitly generating non-covalent interactions (NCIs), which are universal patterns throughout protein-ligand complexes. Our proposed model, NCIDiff, simultaneously denoises NCI types of protein-ligand edges along with a 3D graph of a ligand molecule during the sampling. With the NCI-generating strategy, our model generates ligands with more reliable NCIs, especially outperforming the baseline diffusion-based models. We further adopted inpainting techniques on NCIs to further improve the quality of the generated molecules. Finally, we showcase the applicability of NCIDiff on drug design tasks for real-world settings with specialized objectives by guiding the generation process with desired NCI patterns. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.03931 [pdf, ps, other]

Incorporating changeable attitudes toward vaccination into an SIR infectious disease model

Authors: Yi Jiang, Kristin M. Kurianski, Jane H. Lee, Yanping Ma, Daniel Cicala, Glenn Ledder

Abstract: We develop a mechanistic model that classifies individuals both in terms of epidemiological status (SIR) and vaccination attitude (willing or unwilling), with the goal of discovering how disease spread is influenced by changing opinions about vaccination. Analysis of the model identifies existence and stability criteria for both disease-free and endemic disease equilibria. The analytical results,… ▽ More We develop a mechanistic model that classifies individuals both in terms of epidemiological status (SIR) and vaccination attitude (willing or unwilling), with the goal of discovering how disease spread is influenced by changing opinions about vaccination. Analysis of the model identifies existence and stability criteria for both disease-free and endemic disease equilibria. The analytical results, supported by numerical simulations, show that attitude changes induced by disease prevalence can destabilize endemic disease equilibria, resulting in limit cycles. △ Less

Submitted 6 May, 2024; originally announced May 2024.

Comments: 30 pages, 3 tables, 10 figures

MSC Class: 37N25 (Primary) 92D30 (Secondary)

arXiv:2404.00670 [pdf, other]

Statistical Analysis by Semiparametric Additive Regression and LSTM-FCN Based Hierarchical Classification for Computer Vision Quantification of Parkinsonian Bradykinesia

Authors: Youngseo Cho, In Hee Kwak, Dohyeon Kim, Jinhee Na, Hanjoo Sung, Jeongjae Lee, Young Eun Kim, Hyeo-il Ma

Abstract: Bradykinesia, characterized by involuntary slowing or decrement of movement, is a fundamental symptom of Parkinson's Disease (PD) and is vital for its clinical diagnosis. Despite various methodologies explored to quantify bradykinesia, computer vision-based approaches have shown promising results. However, these methods often fall short in adequately addressing key bradykinesia characteristics in… ▽ More Bradykinesia, characterized by involuntary slowing or decrement of movement, is a fundamental symptom of Parkinson's Disease (PD) and is vital for its clinical diagnosis. Despite various methodologies explored to quantify bradykinesia, computer vision-based approaches have shown promising results. However, these methods often fall short in adequately addressing key bradykinesia characteristics in repetitive limb movements: "occasional arrest" and "decrement in amplitude." This research advances vision-based quantification of bradykinesia by introducing nuanced numerical analysis to capture decrement in amplitudes and employing a simple deep learning technique, LSTM-FCN, for precise classification of occasional arrests. Our approach structures the classification process hierarchically, tailoring it to the unique dynamics of bradykinesia in PD. Statistical analysis of the extracted features, including those representing arrest and fatigue, has demonstrated their statistical significance in most cases. This finding underscores the importance of considering "occasional arrest" and "decrement in amplitude" in bradykinesia quantification of limb movement. Our enhanced diagnostic tool has been rigorously tested on an extensive dataset comprising 1396 motion videos from 310 PD patients, achieving an accuracy of 80.3%. The results confirm the robustness and reliability of our method. △ Less

Submitted 31 March, 2024; originally announced April 2024.

arXiv:2403.18881 [pdf]

Transmission IR Microscopy for the Quantitation of Biomolecular Mass In Live Cells

Authors: Yow-Ren Chang, Seong-Min Kim, Young Jong Lee

Abstract: Absolute quantity imaging of biomolecules on a single cell level is critical for measurement assurance in biosciences and bioindustries. While infrared (IR) transmission microscopy is a powerful label-free imaging modality capable of chemical quantification, its applicability to hydrated biological samples remains challenging due to the strong water absorption. We overcome this challenge by applyi… ▽ More Absolute quantity imaging of biomolecules on a single cell level is critical for measurement assurance in biosciences and bioindustries. While infrared (IR) transmission microscopy is a powerful label-free imaging modality capable of chemical quantification, its applicability to hydrated biological samples remains challenging due to the strong water absorption. We overcome this challenge by applying a solvent absorption compensation (SAC) technique to a home-built quantum cascade laser IR microscope. SAC-IR microscopy improves the chemical sensitivity considerably by adjusting the incident light intensity to pre-compensate the IR absorption by water while retaining the full dynamic range. We demonstrate the label-free chemical imaging of key biomolecules of a cell, such as protein, fatty acid, and nucleic acid, with sub-cellular spatial resolution. By imaging live fibroblast cells over twelve hours, we monitor the mass change of the three molecular species of single cells at various phases, including cell division. While the current live-cell imaging demonstration involved three wavenumbers, more wavenumber images could measure more biomolecules in live cells with higher accuracy. As a label-free method to measure absolute quantities of various molecules in a cell, SAC-IR microscopy can potentially become a standard chemical characterization tool for live cells in biology, medicine, and biotechnology. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: Body: 19 pages, 5 figures. Supplemental: 11 pages, 6 figures

arXiv:2403.14046 [pdf]

Desiderata of evidence for representation in neuroscience

Authors: Stephan Pohl, Edgar Y. Walker, David L. Barack, Jennifer Lee, Rachel N. Denison, Ned Block, Florent Meyniel, Wei Ji Ma

Abstract: This paper develops a systematic framework for the evidence neuroscientists use to establish whether a neural response represents a feature. Researchers try to establish that the neural response is (1) sensitive and (2) specific to the feature, (3) invariant to other features, and (4) functional, which means that it is used downstream in the brain. We formalize these desiderata in information-theo… ▽ More This paper develops a systematic framework for the evidence neuroscientists use to establish whether a neural response represents a feature. Researchers try to establish that the neural response is (1) sensitive and (2) specific to the feature, (3) invariant to other features, and (4) functional, which means that it is used downstream in the brain. We formalize these desiderata in information-theoretic terms. This formalism allows us to precisely state the desiderata while unifying the different analysis methods used in neuroscience under one framework. We discuss how common methods such as correlational analyses, decoding and encoding models, representational similarity analysis, and tests of statistical dependence are used to evaluate the desiderata. In doing so, we provide a common terminology to researchers that helps to clarify disagreements, to compare and integrate results across studies and research groups, and to identify when evidence might be missing and when evidence for some representational conclusion is strong. We illustrate the framework with several canonical examples, including the representation of orientation, numerosity, faces, and spatial location. We end by discussing how the framework can be extended to cover models of the neural code, multi-stage models, and other domains. △ Less

Submitted 20 March, 2024; originally announced March 2024.

Comments: 50 pages, 11 figures

arXiv:2403.06432 [pdf, other]

Joint-Embedding Masked Autoencoder for Self-supervised Learning of Dynamic Functional Connectivity from the Human Brain

Authors: Jungwon Choi, Hyungi Lee, Byung-Hoon Kim, Juho Lee

Abstract: Graph Neural Networks (GNNs) have shown promise in learning dynamic functional connectivity for distinguishing phenotypes from human brain networks. However, obtaining extensive labeled clinical data for training is often resource-intensive, making practical application difficult. Leveraging unlabeled data thus becomes crucial for representation learning in a label-scarce setting. Although generat… ▽ More Graph Neural Networks (GNNs) have shown promise in learning dynamic functional connectivity for distinguishing phenotypes from human brain networks. However, obtaining extensive labeled clinical data for training is often resource-intensive, making practical application difficult. Leveraging unlabeled data thus becomes crucial for representation learning in a label-scarce setting. Although generative self-supervised learning techniques, especially masked autoencoders, have shown promising results in representation learning in various domains, their application to dynamic graphs for dynamic functional connectivity remains underexplored, facing challenges in capturing high-level semantic representations. Here, we introduce the Spatio-Temporal Joint Embedding Masked Autoencoder (ST-JEMA), drawing inspiration from the Joint Embedding Predictive Architecture (JEPA) in computer vision. ST-JEMA employs a JEPA-inspired strategy for reconstructing dynamic graphs, which enables the learning of higher-level semantic representations considering temporal perspectives, addressing the challenges in fMRI data representation learning. Utilizing the large-scale UK Biobank dataset for self-supervised learning, ST-JEMA shows exceptional representation learning performance on dynamic functional connectivity demonstrating superiority over previous methods in predicting phenotypes and psychiatric diagnoses across eight benchmark fMRI datasets even with limited samples and effectiveness of temporal reconstruction on missing data scenarios. These findings highlight the potential of our approach as a robust representation learning method for leveraging label-scarce fMRI data. △ Less

Submitted 11 March, 2024; originally announced March 2024.

Comments: Under review

arXiv:2402.18361 [pdf, other]

Why Do Animals Need Shaping? A Theory of Task Composition and Curriculum Learning

Authors: Jin Hwa Lee, Stefano Sarao Mannelli, Andrew Saxe

Abstract: Diverse studies in systems neuroscience begin with extended periods of curriculum training known as `shaping' procedures. These involve progressively studying component parts of more complex tasks, and can make the difference between learning a task quickly, slowly or not at all. Despite the importance of shaping to the acquisition of complex tasks, there is as yet no theory that can help guide th… ▽ More Diverse studies in systems neuroscience begin with extended periods of curriculum training known as `shaping' procedures. These involve progressively studying component parts of more complex tasks, and can make the difference between learning a task quickly, slowly or not at all. Despite the importance of shaping to the acquisition of complex tasks, there is as yet no theory that can help guide the design of shaping procedures, or more fundamentally, provide insight into its key role in learning. Modern deep reinforcement learning systems might implicitly learn compositional primitives within their multilayer policy networks. Inspired by these models, we propose and analyse a model of deep policy gradient learning of simple compositional reinforcement learning tasks. Using the tools of statistical physics, we solve for exact learning dynamics and characterise different learning strategies including primitives pre-training, in which task primitives are studied individually before learning compositional tasks. We find a complex interplay between task complexity and the efficacy of shaping strategies. Overall, our theory provides an analytical understanding of the benefits of shaping in a class of compositional tasks and a quantitative account of how training protocols can disclose useful task primitives, ultimately yielding faster and more robust learning. △ Less

Submitted 12 June, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

Comments: Accepted to ICML 2024. 5 figures, 9 pages and Appendix

arXiv:2401.17433 [pdf]

Coronary CTA and Quantitative Cardiac CT Perfusion (CCTP) in Coronary Artery Disease

Authors: Hao Wu, Yingnan Song, Ammar Hoori, Ananya Subramaniam, Juhwan Lee, Justin Kim, Tao Hu, Sadeer Al-Kindi, Wei-Ming Huang, Chun-Ho Yun, Chung-Lieh Hung, Sanjay Rajagopalan, David L. Wilson

Abstract: We assessed the benefit of combining stress cardiac CT perfusion (CCTP) myocardial blood flow (MBF) with coronary CT angiography (CCTA) using our innovative CCTP software. By combining CCTA and CCTP, one can uniquely identify a flow limiting stenosis (obstructive-lesion + low-MBF) versus MVD (no-obstructive-lesion + low-MBF. We retrospectively evaluated 104 patients with suspected CAD, including 1… ▽ More We assessed the benefit of combining stress cardiac CT perfusion (CCTP) myocardial blood flow (MBF) with coronary CT angiography (CCTA) using our innovative CCTP software. By combining CCTA and CCTP, one can uniquely identify a flow limiting stenosis (obstructive-lesion + low-MBF) versus MVD (no-obstructive-lesion + low-MBF. We retrospectively evaluated 104 patients with suspected CAD, including 18 with diabetes, who underwent CCTA+CCTP. Whole heart and territorial MBF was assessed using our automated pipeline for CCTP analysis that included beam hardening correction; temporal scan registration; automated segmentation; fast, accurate, robust MBF estimation; and visualization. Stenosis severity was scored using the CCTA coronary-artery-disease-reporting-and-data-system (CAD-RADS), with obstructive stenosis deemed as CAD-RADS>=3. We established a threshold MBF (MBF=199-mL/min-100g) for normal perfusion. In patients with CAD-RADS>=3, 28/37(76%) patients showed ischemia in the corresponding territory. Two patients with obstructive disease had normal perfusion, suggesting collaterals and/or a hemodynamically insignificant stenosis. Among diabetics, 10 of 18 (56%) demonstrated diffuse ischemia consistent with MVD. Among non-diabetics, only 6% had MVD. Sex-specific prevalence of MVD was 21%/24% (M/F). On a per-vessel basis (n=256), MBF showed a significant difference between territories with and without obstructive stenosis (165 +/- 61 mL/min-100g vs. 274 +/- 62 mL/min-100g, p <0.05). A significant and negative rank correlation (rho=-0.53, p<0.05) between territory MBF and CAD-RADS was seen. CCTA in conjunction with a new automated quantitative CCTP approach can augment the interpretation of CAD, enabling the distinction of ischemia due to obstructive lesions and MVD. △ Less

Submitted 30 January, 2024; originally announced January 2024.

arXiv:2401.16190 [pdf]

AI prediction of cardiovascular events using opportunistic epicardial adipose tissue assessments from CT calcium score

Authors: Tao Hu, Joshua Freeze, Prerna Singh, Justin Kim, Yingnan Song, Hao Wu, Juhwan Lee, Sadeer Al-Kindi, Sanjay Rajagopalan, David L. Wilson, Ammar Hoori

Abstract: Background: Recent studies have used basic epicardial adipose tissue (EAT) assessments (e.g., volume and mean HU) to predict risk of atherosclerosis-related, major adverse cardiovascular events (MACE). Objectives: Create novel, hand-crafted EAT features, 'fat-omics', to capture the pathophysiology of EAT and improve MACE prediction. Methods: We segmented EAT using a previously-validated deep learn… ▽ More Background: Recent studies have used basic epicardial adipose tissue (EAT) assessments (e.g., volume and mean HU) to predict risk of atherosclerosis-related, major adverse cardiovascular events (MACE). Objectives: Create novel, hand-crafted EAT features, 'fat-omics', to capture the pathophysiology of EAT and improve MACE prediction. Methods: We segmented EAT using a previously-validated deep learning method with optional manual correction. We extracted 148 radiomic features (morphological, spatial, and intensity) and used Cox elastic-net for feature reduction and prediction of MACE. Results: Traditional fat features gave marginal prediction (EAT-volume/EAT-mean-HU/ BMI gave C-index 0.53/0.55/0.57, respectively). Significant improvement was obtained with 15 fat-omics features (C-index=0.69, test set). High-risk features included volume-of-voxels-having-elevated-HU-[-50, -30-HU] and HU-negative-skewness, both of which assess high HU, which as been implicated in fat inflammation. Other high-risk features include kurtosis-of-EAT-thickness, reflecting the heterogeneity of thicknesses, and EAT-volume-in-the-top-25%-of-the-heart, emphasizing adipose near the proximal coronary arteries. Kaplan-Meyer plots of Cox-identified, high- and low-risk patients were well separated with the median of the fat-omics risk, while high-risk group having HR 2.4 times that of the low-risk group (P<0.001). Conclusion: Preliminary findings indicate an opportunity to use more finely tuned, explainable assessments on EAT for improved cardiovascular risk prediction. △ Less

Submitted 29 January, 2024; originally announced January 2024.

Comments: 7 pages, 1 central illustration, 6 figures, 5 tables

arXiv:2401.12974 [pdf, other]

SegmentAnyBone: A Universal Model that Segments Any Bone at Any Location on MRI

Authors: Hanxue Gu, Roy Colglazier, Haoyu Dong, Jikai Zhang, Yaqian Chen, Zafer Yildiz, Yuwen Chen, Lin Li, Jichen Yang, Jay Willhite, Alex M. Meyer, Brian Guo, Yashvi Atul Shah, Emily Luo, Shipra Rajput, Sally Kuehn, Clark Bulleit, Kevin A. Wu, Jisoo Lee, Brandon Ramirez, Darui Lu, Jay M. Levin, Maciej A. Mazurowski

Abstract: Magnetic Resonance Imaging (MRI) is pivotal in radiology, offering non-invasive and high-quality insights into the human body. Precise segmentation of MRIs into different organs and tissues would be highly beneficial since it would allow for a higher level of understanding of the image content and enable important measurements, which are essential for accurate diagnosis and effective treatment pla… ▽ More Magnetic Resonance Imaging (MRI) is pivotal in radiology, offering non-invasive and high-quality insights into the human body. Precise segmentation of MRIs into different organs and tissues would be highly beneficial since it would allow for a higher level of understanding of the image content and enable important measurements, which are essential for accurate diagnosis and effective treatment planning. Specifically, segmenting bones in MRI would allow for more quantitative assessments of musculoskeletal conditions, while such assessments are largely absent in current radiological practice. The difficulty of bone MRI segmentation is illustrated by the fact that limited algorithms are publicly available for use, and those contained in the literature typically address a specific anatomic area. In our study, we propose a versatile, publicly available deep-learning model for bone segmentation in MRI across multiple standard MRI locations. The proposed model can operate in two modes: fully automated segmentation and prompt-based segmentation. Our contributions include (1) collecting and annotating a new MRI dataset across various MRI protocols, encompassing over 300 annotated volumes and 8485 annotated slices across diverse anatomic regions; (2) investigating several standard network architectures and strategies for automated segmentation; (3) introducing SegmentAnyBone, an innovative foundational model-based approach that extends Segment Anything Model (SAM); (4) comparative analysis of our algorithm and previous approaches; and (5) generalization analysis of our algorithm across different anatomical locations and MRI sequences, as well as an external dataset. We publicly release our model at https://github.com/mazurowski-lab/SegmentAnyBone. △ Less

Submitted 23 January, 2024; originally announced January 2024.

Comments: 15 pages, 15 figures

arXiv:2401.08098 [pdf]

Attention-Based CNN-BiLSTM for Sleep State Classification of Spatiotemporal Wide-Field Calcium Imaging Data

Authors: Xiaohui Zhang, Eric C. Landsness, Hanyang Miao, Wei Chen, Michelle Tang, Lindsey M. Brier, Joseph P. Culver, Jin-Moo Lee, Mark A. Anastasio

Abstract: Background: Wide-field calcium imaging (WFCI) with genetically encoded calcium indicators allows for spatiotemporal recordings of neuronal activity in mice. When applied to the study of sleep, WFCI data are manually scored into the sleep states of wakefulness, non-REM (NREM) and REM by use of adjunct EEG and EMG recordings. However, this process is time-consuming, invasive and often suffers from l… ▽ More Background: Wide-field calcium imaging (WFCI) with genetically encoded calcium indicators allows for spatiotemporal recordings of neuronal activity in mice. When applied to the study of sleep, WFCI data are manually scored into the sleep states of wakefulness, non-REM (NREM) and REM by use of adjunct EEG and EMG recordings. However, this process is time-consuming, invasive and often suffers from low inter- and intra-rater reliability. Therefore, an automated sleep state classification method that operates on spatiotemporal WFCI data is desired. New Method: A hybrid network architecture consisting of a convolutional neural network (CNN) to extract spatial features of image frames and a bidirectional long short-term memory network (BiLSTM) with attention mechanism to identify temporal dependencies among different time points was proposed to classify WFCI data into states of wakefulness, NREM and REM sleep. Results: Sleep states were classified with an accuracy of 84% and Cohen's kappa of 0.64. Gradient-weighted class activation maps revealed that the frontal region of the cortex carries more importance when classifying WFCI data into NREM sleep while posterior area contributes most to the identification of wakefulness. The attention scores indicated that the proposed network focuses on short- and long-range temporal dependency in a state-specific manner. Comparison with Existing Method: On a 3-hour WFCI recording, the CNN-BiLSTM achieved a kappa of 0.67, comparable to a kappa of 0.65 corresponding to the human EEG/EMG-based scoring. Conclusions: The CNN-BiLSTM effectively classifies sleep states from spatiotemporal WFCI data and will enable broader application of WFCI in sleep. △ Less

Submitted 15 January, 2024; originally announced January 2024.

arXiv:2401.03690 [pdf]

So You Want to Image Myelin Using MRI: Magnetic Susceptibility Source Separation for Myelin Imaging

Authors: Jongho Lee, Sooyeon Ji, Se-Hong Oh

Abstract: In MRI, researchers have long endeavored to effectively visualize myelin distribution in the brain, a pursuit with significant implications for both scientific research and clinical applications. Over time, various methods such as myelin water imaging, magnetization transfer imaging, and relaxometric imaging have been developed, each carrying distinct advantages and limitations. Recently, an innov… ▽ More In MRI, researchers have long endeavored to effectively visualize myelin distribution in the brain, a pursuit with significant implications for both scientific research and clinical applications. Over time, various methods such as myelin water imaging, magnetization transfer imaging, and relaxometric imaging have been developed, each carrying distinct advantages and limitations. Recently, an innovative technique named as magnetic susceptibility source separation has emerged, introducing a novel surrogate biomarker for myelin in the form of a diamagnetic susceptibility map. This paper comprehensively reviews this cutting-edge method, providing the fundamental concepts of magnetic susceptibility, susceptibility imaging, and the validation of the diamagnetic susceptibility map as a myelin biomarker that indirectly measure myelin content. Additionally, the paper explores essential aspects of data acquisition and processing, offering practical insights for readers. A comparison with established myelin imaging methods is also presented, and both current and prospective clinical and scientific applications are discussed to provide a holistic understanding of the technique. This work aims to serve as a foundational resource for newcomers entering this dynamic and rapidly expanding field. △ Less

Submitted 28 March, 2024; v1 submitted 8 January, 2024; originally announced January 2024.

Comments: Accepted to Magnetic Resonance in Medical Sciences

arXiv:2401.02624 [pdf, other]

Correlation-enhanced viable core in metabolic networks

Authors: Mi Jin Lee, Sudo Yi, Deok-Sun Lee

Abstract: Cellular ingredient concentrations can be stabilized by adjusting generation and consumption rates through multiple pathways. To explore the portion of cellular metabolism equipped with multiple pathways, we categorize individual metabolic reactions and compounds as viable or inviable: A compound is viable if processed by two or more reactions, and a reaction is viable if all of its substrates and… ▽ More Cellular ingredient concentrations can be stabilized by adjusting generation and consumption rates through multiple pathways. To explore the portion of cellular metabolism equipped with multiple pathways, we categorize individual metabolic reactions and compounds as viable or inviable: A compound is viable if processed by two or more reactions, and a reaction is viable if all of its substrates and products are viable. Using this classification, we identify the maximal subnetwork of viable nodes, referred to as the {\it viable core}, in bipartite metabolic networks across thousands of species. The obtained viable cores are remarkably larger than those in degree-preserving randomized networks, while their broad degree distributions commonly enable the viable cores to shrink gradually as reaction nodes are deleted. We demonstrate that the positive degree-degree correlations of the empirical networks may underlie the enlarged viable cores compared to the randomized networks. By investigating the relation between degree and cross-species frequency of metabolic compounds and reactions, we elucidate the evolutionary origin of the correlations. △ Less

Submitted 4 January, 2024; originally announced January 2024.

Comments: 8 pages, 4 figures

arXiv:2312.14939 [pdf, other]

Large-scale Graph Representation Learning of Dynamic Brain Connectome with Transformers

Authors: Byung-Hoon Kim, Jungwon Choi, EungGu Yun, Kyungsang Kim, Xiang Li, Juho Lee

Abstract: Graph Transformers have recently been successful in various graph representation learning tasks, providing a number of advantages over message-passing Graph Neural Networks. Utilizing Graph Transformers for learning the representation of the brain functional connectivity network is also gaining interest. However, studies to date have underlooked the temporal dynamics of functional connectivity, wh… ▽ More Graph Transformers have recently been successful in various graph representation learning tasks, providing a number of advantages over message-passing Graph Neural Networks. Utilizing Graph Transformers for learning the representation of the brain functional connectivity network is also gaining interest. However, studies to date have underlooked the temporal dynamics of functional connectivity, which fluctuates over time. Here, we propose a method for learning the representation of dynamic functional connectivity with Graph Transformers. Specifically, we define the connectome embedding, which holds the position, structure, and time information of the functional connectivity graph, and use Transformers to learn its representation across time. We perform experiments with over 50,000 resting-state fMRI samples obtained from three datasets, which is the largest number of fMRI data used in studies by far. The experimental results show that our proposed method outperforms other competitive baselines in gender classification and age regression tasks based on the functional connectivity extracted from the fMRI data. △ Less

Submitted 4 December, 2023; originally announced December 2023.

Comments: NeurIPS 2023 Temporal Graph Learning Workshop

arXiv:2312.01994 [pdf, other]

A Generative Self-Supervised Framework using Functional Connectivity in fMRI Data

Authors: Jungwon Choi, Seongho Keum, EungGu Yun, Byung-Hoon Kim, Juho Lee

Abstract: Deep neural networks trained on Functional Connectivity (FC) networks extracted from functional Magnetic Resonance Imaging (fMRI) data have gained popularity due to the increasing availability of data and advances in model architectures, including Graph Neural Network (GNN). Recent research on the application of GNN to FC suggests that exploiting the time-varying properties of the FC could signifi… ▽ More Deep neural networks trained on Functional Connectivity (FC) networks extracted from functional Magnetic Resonance Imaging (fMRI) data have gained popularity due to the increasing availability of data and advances in model architectures, including Graph Neural Network (GNN). Recent research on the application of GNN to FC suggests that exploiting the time-varying properties of the FC could significantly improve the accuracy and interpretability of the model prediction. However, the high cost of acquiring high-quality fMRI data and corresponding phenotypic labels poses a hurdle to their application in real-world settings, such that a model naïvely trained in a supervised fashion can suffer from insufficient performance or a lack of generalization on a small number of data. In addition, most Self-Supervised Learning (SSL) approaches for GNNs to date adopt a contrastive strategy, which tends to lose appropriate semantic information when the graph structure is perturbed or does not leverage both spatial and temporal information simultaneously. In light of these challenges, we propose a generative SSL approach that is tailored to effectively harness spatio-temporal information within dynamic FC. Our empirical results, experimented with large-scale (>50,000) fMRI datasets, demonstrate that our approach learns valuable representations and enables the construction of accurate and robust models when fine-tuned for downstream tasks. △ Less

Submitted 4 December, 2023; originally announced December 2023.

Comments: NeurIPS 2023 Temporal Graph Learning Workshop

arXiv:2311.09354 [pdf]

doi 10.1063/5.0189222

Nondestructive, quantitative viability analysis of 3D tissue cultures using machine learning image segmentation

Authors: Kylie J. Trettner, Jeremy Hsieh, Weikun Xiao, Jerry S. H. Lee, Andrea M. Armani

Abstract: Ascertaining the collective viability of cells in different cell culture conditions has typically relied on averaging colorimetric indicators and is often reported out in simple binary readouts. Recent research has combined viability assessment techniques with image-based deep-learning models to automate the characterization of cellular properties. However, further development of viability measure… ▽ More Ascertaining the collective viability of cells in different cell culture conditions has typically relied on averaging colorimetric indicators and is often reported out in simple binary readouts. Recent research has combined viability assessment techniques with image-based deep-learning models to automate the characterization of cellular properties. However, further development of viability measurements to assess the continuity of possible cellular states and responses to perturbation across cell culture conditions is needed. In this work, we demonstrate an image processing algorithm for quantifying cellular viability in 3D cultures without the need for assay-based indicators. We show that our algorithm performs similarly to a pair of human experts in whole-well images over a range of days and culture matrix compositions. To demonstrate potential utility, we perform a longitudinal study investigating the impact of a known therapeutic on pancreatic cancer spheroids. Using images taken with a high content imaging system, the algorithm successfully tracks viability at the individual spheroid and whole-well level. The method we propose reduces analysis time by 97% in comparison to the experts. Because the method is independent of the microscope or imaging system used, this approach lays the foundation for accelerating progress in and for improving the robustness and reproducibility of 3D culture analysis across biological and clinical research. △ Less

Submitted 11 March, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

Comments: 52 total pages, Main text and SI included, 35 figures (5 main text, 30 supplemental), 9 tables, 6 datasets (provided on linked GitHub), linked image files on Zenodo

arXiv:2311.08433

Clinical Characteristics and Laboratory Biomarkers in ICU-admitted Septic Patients with and without Bacteremia

Authors: Sangwon Baek, Seung Jun Lee

Abstract: Few studies have investigated the diagnostic utilities of biomarkers for predicting bacteremia among septic patients admitted to intensive care units (ICU). Therefore, this study evaluated the prediction power of laboratory biomarkers to utilize those markers with high performance to optimize the predictive model for bacteremia. This retrospective cross-sectional study was conducted at the ICU dep… ▽ More Few studies have investigated the diagnostic utilities of biomarkers for predicting bacteremia among septic patients admitted to intensive care units (ICU). Therefore, this study evaluated the prediction power of laboratory biomarkers to utilize those markers with high performance to optimize the predictive model for bacteremia. This retrospective cross-sectional study was conducted at the ICU department of Gyeongsang National University Changwon Hospital in 2019. Adult patients qualifying SEPSIS-3 (increase in sequential organ failure score greater than or equal to 2) criteria with at least two sets of blood culture were selected. Collected data was initially analyzed independently to identify the significant predictors, which was then used to build the multivariable logistic regression (MLR) model. A total of 218 patients with 48 cases of true bacteremia were analyzed in this research. Both CRP and PCT showed a substantial area under the curve (AUC) value for discriminating bacteremia among septic patients (0.757 and 0.845, respectively). To further enhance the predictive accuracy, we combined PCT, bilirubin, neutrophil lymphocyte ratio (NLR), platelets, lactic acid, erythrocyte sedimentation rate (ESR), and Glasgow Coma Scale (GCS) score to build the predictive model with an AUC of 0.907 (95% CI, 0.843 to 0.956). In addition, a high association between bacteremia and mortality rate was discovered through the survival analysis (0.004). While PCT is certainly a useful index for distinguishing patients with and without bacteremia by itself, our MLR model indicates that the accuracy of bacteremia prediction substantially improves by the combined use of PCT, bilirubin, NLR, platelets, lactic acid, ESR, and GCS score. △ Less

Submitted 16 November, 2023; v1 submitted 14 November, 2023; originally announced November 2023.

Comments: This article is not the right fit to be published as preprint in arXiv

arXiv:2311.04545 [pdf, other]

Molecular mechanism of anion permeation through aquaporin 6

Authors: Eiji Yamamoto, Keehyoung Joo, Jooyoung Lee, Mark S. P. Sansom, Masato Yasui

Abstract: Aquaporins (AQPs) are recognized as transmembrane water channels that facilitate selective water permeation through their monomeric pores. Among the AQP family, AQP6 has a unique characteristic as an anion channel, which is allosterically controlled by pH conditions and is eliminated by a single amino acid mutation. However, the molecular mechanism of anion permeation through AQP6 remains unclear.… ▽ More Aquaporins (AQPs) are recognized as transmembrane water channels that facilitate selective water permeation through their monomeric pores. Among the AQP family, AQP6 has a unique characteristic as an anion channel, which is allosterically controlled by pH conditions and is eliminated by a single amino acid mutation. However, the molecular mechanism of anion permeation through AQP6 remains unclear. Using molecular dynamics simulations in the presence of a transmembrane voltage utilizing an ion concentration gradient, we show that chloride ions permeate through the pore corresponding to the central axis of the AQP6 homotetramer. Under low pH conditions, a subtle opening of the hydrophobic selective filter (SF), located near the extracellular part of the central pore, becomes wetted and enables anion permeation. Our simulations also indicate that a single mutation (N63G) in human AQP6, located at the central pore, significantly reduces anion conduction, consistent with experimental data. Moreover, we demonstrate the pH-sensing mechanism in which the protonation of H184 and H189 under low pH conditions allosterically triggers the gating of the SF region. These results suggest a unique pH-dependent allosteric anion permeation mechanism in AQP6 and could clarify the role of the central pore in some of the AQP tetramers. △ Less

Submitted 8 November, 2023; originally announced November 2023.

arXiv:2311.04468 [pdf]

A human brain atlas of chi-separation for normative iron and myelin distributions

Authors: Kyeongseon Min, Beomseok Sohn, Woo Jung Kim, Chae Jung Park, Soohwa Song, Dong Hoon Shin, Kyung Won Chang, Na-Young Shin, Minjun Kim, Hyeong-Geol Shin, Phil Hyu Lee, Jongho Lee

Abstract: Iron and myelin are primary susceptibility sources in the human brain. These substances are essential for healthy brain, and their abnormalities are often related to various neurological disorders. Recently, an advanced susceptibility mapping technique, which is referred to as chi-separation, has been proposed, successfully disentangling paramagnetic iron from diamagnetic myelin. This method opene… ▽ More Iron and myelin are primary susceptibility sources in the human brain. These substances are essential for healthy brain, and their abnormalities are often related to various neurological disorders. Recently, an advanced susceptibility mapping technique, which is referred to as chi-separation, has been proposed, successfully disentangling paramagnetic iron from diamagnetic myelin. This method opened a potential for generating high resolution iron and myelin maps in the brain. Utilizing this technique, this study constructs a normative chi-separation atlas from 106 healthy human brains. The resulting atlas provides detailed anatomical structures associated with the distributions of iron and myelin, clearly delineating subcortical nuclei, thalamic nuclei, and white matter fiber bundles. Additionally, susceptibility values in a number of regions of interest are reported along with age-dependent changes. This atlas may have direct applications such as localization of subcortical structures for deep brain stimulation or high-intensity focused ultrasound and also serve as a valuable resource for future research. △ Less

Submitted 2 April, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

Comments: 19 pages, 9 figures

arXiv:2310.08738 [pdf, other]

Splicing Up Your Predictions with RNA Contrastive Learning

Authors: Philip Fradkin, Ruian Shi, Bo Wang, Brendan Frey, Leo J. Lee

Abstract: In the face of rapidly accumulating genomic data, our understanding of the RNA regulatory code remains incomplete. Recent self-supervised methods in other domains have demonstrated the ability to learn rules underlying the data-generating process such as sentence structure in language. Inspired by this, we extend contrastive learning techniques to genomic data by utilizing functional similarities… ▽ More In the face of rapidly accumulating genomic data, our understanding of the RNA regulatory code remains incomplete. Recent self-supervised methods in other domains have demonstrated the ability to learn rules underlying the data-generating process such as sentence structure in language. Inspired by this, we extend contrastive learning techniques to genomic data by utilizing functional similarities between sequences generated through alternative splicing and gene duplication. Our novel dataset and contrastive objective enable the learning of generalized RNA isoform representations. We validate their utility on downstream tasks such as RNA half-life and mean ribosome load prediction. Our pre-training strategy yields competitive results using linear probing on both tasks, along with up to a two-fold increase in Pearson correlation in low-data conditions. Importantly, our exploration of the learned latent space reveals that our contrastive objective yields semantically meaningful representations, underscoring its potential as a valuable initialization technique for RNA property prediction. △ Less

Submitted 17 October, 2023; v1 submitted 12 October, 2023; originally announced October 2023.

arXiv:2310.03964 [pdf, other]

A Learnable Counter-condition Analysis Framework for Functional Connectivity-based Neurological Disorder Diagnosis

Authors: Eunsong Kang, Da-woon Heo, Jiwon Lee, Heung-Il Suk

Abstract: To understand the biological characteristics of neurological disorders with functional connectivity (FC), recent studies have widely utilized deep learning-based models to identify the disease and conducted post-hoc analyses via explainable models to discover disease-related biomarkers. Most existing frameworks consist of three stages, namely, feature selection, feature extraction for classificati… ▽ More To understand the biological characteristics of neurological disorders with functional connectivity (FC), recent studies have widely utilized deep learning-based models to identify the disease and conducted post-hoc analyses via explainable models to discover disease-related biomarkers. Most existing frameworks consist of three stages, namely, feature selection, feature extraction for classification, and analysis, where each stage is implemented separately. However, if the results at each stage lack reliability, it can cause misdiagnosis and incorrect analysis in afterward stages. In this study, we propose a novel unified framework that systemically integrates diagnoses (i.e., feature selection and feature extraction) and explanations. Notably, we devised an adaptive attention network as a feature selection approach to identify individual-specific disease-related connections. We also propose a functional network relational encoder that summarizes the global topological properties of FC by learning the inter-network relations without pre-defined edges between functional networks. Last but not least, our framework provides a novel explanatory power for neuroscientific interpretation, also termed counter-condition analysis. We simulated the FC that reverses the diagnostic information (i.e., counter-condition FC): converting a normal brain to be abnormal and vice versa. We validated the effectiveness of our framework by using two large resting-state functional magnetic resonance imaging (fMRI) datasets, Autism Brain Imaging Data Exchange (ABIDE) and REST-meta-MDD, and demonstrated that our framework outperforms other competing methods for disease identification. Furthermore, we analyzed the disease-related neurological patterns based on counter-condition analysis. △ Less

Submitted 5 October, 2023; originally announced October 2023.

arXiv:2309.03227 [pdf]

Learning a Patent-Informed Biomedical Knowledge Graph Reveals Technological Potential of Drug Repositioning Candidates

Authors: Yongseung Jegal, Jaewoong Choi, Jiho Lee, Ki-Su Park, Seyoung Lee, Janghyeok Yoon

Abstract: Drug repositioning-a promising strategy for discovering new therapeutic uses for existing drugs-has been increasingly explored in the computational science literature using biomedical databases. However, the technological potential of drug repositioning candidates has often been overlooked. This study presents a novel protocol to comprehensively analyse various sources such as pharmaceutical paten… ▽ More Drug repositioning-a promising strategy for discovering new therapeutic uses for existing drugs-has been increasingly explored in the computational science literature using biomedical databases. However, the technological potential of drug repositioning candidates has often been overlooked. This study presents a novel protocol to comprehensively analyse various sources such as pharmaceutical patents and biomedical databases, and identify drug repositioning candidates with both technological potential and scientific evidence. To this end, first, we constructed a scientific biomedical knowledge graph (s-BKG) comprising relationships between drugs, diseases, and genes derived from biomedical databases. Our protocol involves identifying drugs that exhibit limited association with the target disease but are closely located in the s-BKG, as potential drug candidates. We constructed a patent-informed biomedical knowledge graph (p-BKG) by adding pharmaceutical patent information. Finally, we developed a graph embedding protocol to ascertain the structure of the p-BKG, thereby calculating the relevance scores of those candidates with target disease-related patents to evaluate their technological potential. Our case study on Alzheimer's disease demonstrates its efficacy and feasibility, while the quantitative outcomes and systematic methods are expected to bridge the gap between computational discoveries and successful market applications in drug repositioning research. △ Less

Submitted 3 September, 2023; originally announced September 2023.

arXiv:2309.01670 [pdf, other]

Blind Biological Sequence Denoising with Self-Supervised Set Learning

Authors: Nathan Ng, Ji Won Park, Jae Hyeon Lee, Ryan Lewis Kelly, Stephen Ra, Kyunghyun Cho

Abstract: Biological sequence analysis relies on the ability to denoise the imprecise output of sequencing platforms. We consider a common setting where a short sequence is read out repeatedly using a high-throughput long-read platform to generate multiple subreads, or noisy observations of the same sequence. Denoising these subreads with alignment-based approaches often fails when too few subreads are avai… ▽ More Biological sequence analysis relies on the ability to denoise the imprecise output of sequencing platforms. We consider a common setting where a short sequence is read out repeatedly using a high-throughput long-read platform to generate multiple subreads, or noisy observations of the same sequence. Denoising these subreads with alignment-based approaches often fails when too few subreads are available or error rates are too high. In this paper, we propose a novel method for blindly denoising sets of sequences without directly observing clean source sequence labels. Our method, Self-Supervised Set Learning (SSSL), gathers subreads together in an embedding space and estimates a single set embedding as the midpoint of the subreads in both the latent and sequence spaces. This set embedding represents the "average" of the subreads and can be decoded into a prediction of the clean sequence. In experiments on simulated long-read DNA data, SSSL methods denoise small reads of $\leq 6$ subreads with 17% fewer errors and large reads of $>6$ subreads with 8% fewer errors compared to the best baseline. On a real dataset of antibody sequences, SSSL improves over baselines on two self-supervised metrics, with a significant improvement on difficult small reads that comprise over 60% of the test set. By accurately denoising these reads, SSSL promises to better realize the potential of high-throughput DNA sequencing data for downstream scientific applications. △ Less

Submitted 4 September, 2023; originally announced September 2023.

arXiv:2308.12224 [pdf]

Enhancing cardiovascular risk prediction through AI-enabled calcium-omics

Authors: Ammar Hoori, Sadeer Al-Kindi, Tao Hu, Yingnan Song, Hao Wu, Juhwan Lee, Nour Tashtish, Pingfu Fu, Robert Gilkeson, Sanjay Rajagopalan, David L. Wilson

Abstract: Background. Coronary artery calcium (CAC) is a powerful predictor of major adverse cardiovascular events (MACE). Traditional Agatston score simply sums the calcium, albeit in a non-linear way, leaving room for improved calcification assessments that will more fully capture the extent of disease. Objective. To determine if AI methods using detailed calcification features (i.e., calcium-omics) can… ▽ More Background. Coronary artery calcium (CAC) is a powerful predictor of major adverse cardiovascular events (MACE). Traditional Agatston score simply sums the calcium, albeit in a non-linear way, leaving room for improved calcification assessments that will more fully capture the extent of disease. Objective. To determine if AI methods using detailed calcification features (i.e., calcium-omics) can improve MACE prediction. Methods. We investigated additional features of calcification including assessment of mass, volume, density, spatial distribution, territory, etc. We used a Cox model with elastic-net regularization on 2457 CT calcium score (CTCS) enriched for MACE events obtained from a large no-cost CLARIFY program (ClinicalTri-als.gov Identifier: NCT04075162). We employed sampling techniques to enhance model training. We also investigated Cox models with selected features to identify explainable high-risk characteristics. Results. Our proposed calcium-omics model with modified synthetic down sampling and up sampling gave C-index (80.5%/71.6%) and two-year AUC (82.4%/74.8%) for (80:20, training/testing), respectively (sampling was applied to the training set only). Results compared favorably to Agatston which gave C-index (71.3%/70.3%) and AUC (71.8%/68.8%), respectively. Among calcium-omics features, numbers of calcifications, LAD mass, and diffusivity (a measure of spatial distribution) were important determinants of increased risk, with dense calcification (>1000HU) associated with lower risk. The calcium-omics model reclassified 63% of MACE patients to the high risk group in a held-out test. The categorical net-reclassification index was NRI=0.153. Conclusions. AI analysis of coronary calcification can lead to improved results as compared to Agatston scoring. Our findings suggest the utility of calcium-omics in improved prediction of risk. △ Less

Submitted 23 August, 2023; originally announced August 2023.

Comments: 12 pages, 8 figures, 2 tables, 4 pages supplemental, journal paper format (under review)

arXiv:2308.08680 [pdf, other]

Permutationally Invariant Networks for Enhanced Sampling (PINES): Discovery of Multi-Molecular and Solvent-Inclusive Collective Variables

Authors: Nicholas S. M. Herringer, Siva Dasetty, Diya Gandhi, Junhee Lee, Andrew L. Ferguson

Abstract: The typically rugged nature of molecular free energy landscapes can frustrate efficient sampling of the thermodynamically relevant phase space due to the presence of high free energy barriers. Enhanced sampling techniques can improve phase space exploration by accelerating sampling along particular collective variables (CVs). A number of techniques exist for data-driven discovery of CVs parameteri… ▽ More The typically rugged nature of molecular free energy landscapes can frustrate efficient sampling of the thermodynamically relevant phase space due to the presence of high free energy barriers. Enhanced sampling techniques can improve phase space exploration by accelerating sampling along particular collective variables (CVs). A number of techniques exist for data-driven discovery of CVs parameterizing the important large scale motions of the system. A challenge to CV discovery is learning CVs invariant to symmetries of the molecular system, frequently rigid translation, rigid rotation, and permutational relabeling of identical particles. Of these, permutational invariance have proved a persistent challenge in frustrating the the data-driven discovery of multi-molecular CVs in systems of self-assembling particles and solvent-inclusive CVs for solvated systems. In this work, we integrate Permutation Invariant Vector (PIV) featurizations with autoencoding neural networks to learn nonlinear CVs invariant to translation, rotation, and permutation, and perform interleaved rounds of CV discovery and enhanced sampling to iteratively expand sampling of configurational phase space and obtain converged CVs and free energy landscapes. We demonstrate the Permutationally Invariant Network for Enhanced Sampling (PINES) approach in applications to the self-assembly of a 13-atom Argon cluster, association/dissociation of a NaCl ion pair in water, and hydrophobic collapse of a C45H92 n-pentatetracontane polymer chain. We make the approach freely available as a new module within the PLUMED2 enhanced sampling libraries. △ Less

Submitted 16 August, 2023; originally announced August 2023.

arXiv:2308.06887 [pdf, other]

Robustified ANNs Reveal Wormholes Between Human Category Percepts

Authors: Guy Gaziv, Michael J. Lee, James J. DiCarlo

Abstract: The visual object category reports of artificial neural networks (ANNs) are notoriously sensitive to tiny, adversarial image perturbations. Because human category reports (aka human percepts) are thought to be insensitive to those same small-norm perturbations -- and locally stable in general -- this argues that ANNs are incomplete scientific models of human visual perception. Consistent with this… ▽ More The visual object category reports of artificial neural networks (ANNs) are notoriously sensitive to tiny, adversarial image perturbations. Because human category reports (aka human percepts) are thought to be insensitive to those same small-norm perturbations -- and locally stable in general -- this argues that ANNs are incomplete scientific models of human visual perception. Consistent with this, we show that when small-norm image perturbations are generated by standard ANN models, human object category percepts are indeed highly stable. However, in this very same "human-presumed-stable" regime, we find that robustified ANNs reliably discover low-norm image perturbations that strongly disrupt human percepts. These previously undetectable human perceptual disruptions are massive in amplitude, approaching the same level of sensitivity seen in robustified ANNs. Further, we show that robustified ANNs support precise perceptual state interventions: they guide the construction of low-norm image perturbations that strongly alter human category percepts toward specific prescribed percepts. These observations suggest that for arbitrary starting points in image space, there exists a set of nearby "wormholes", each leading the subject from their current category perceptual state into a semantically very different state. Moreover, contemporary ANN models of biological visual processing are now accurate enough to consistently guide us to those portals. △ Less

Submitted 4 October, 2023; v1 submitted 13 August, 2023; originally announced August 2023.

Comments: In NeurIPS 2023. Code: https://github.com/ggaziv/Wormholes Project Webpage: https://himjl.github.io/pwormholes

Journal ref: https://neurips.cc/virtual/2023/poster/72812

arXiv:2307.04603 [pdf, other]

Solvent: A Framework for Protein Folding

Authors: Jaemyung Lee, Kyeongtak Han, Jaehoon Kim, Hasun Yu, Youhan Lee

Abstract: Consistency and reliability are crucial for conducting AI research. Many famous research fields, such as object detection, have been compared and validated with solid benchmark frameworks. After AlphaFold2, the protein folding task has entered a new phase, and many methods are proposed based on the component of AlphaFold2. The importance of a unified research framework in protein folding contains… ▽ More Consistency and reliability are crucial for conducting AI research. Many famous research fields, such as object detection, have been compared and validated with solid benchmark frameworks. After AlphaFold2, the protein folding task has entered a new phase, and many methods are proposed based on the component of AlphaFold2. The importance of a unified research framework in protein folding contains implementations and benchmarks to consistently and fairly compare various approaches. To achieve this, we present Solvent, a protein folding framework that supports significant components of state-of-the-art models in the manner of an off-the-shelf interface Solvent contains different models implemented in a unified codebase and supports training and evaluation for defined models on the same dataset. We benchmark well-known algorithms and their components and provide experiments that give helpful insights into the protein structure modeling field. We hope that Solvent will increase the reliability and consistency of proposed models and give efficiency in both speed and costs, resulting in acceleration on protein folding modeling research. The code is available at https://github.com/kakaobrain/solvent, and the project will continue to be developed. △ Less

Submitted 31 July, 2023; v1 submitted 7 July, 2023; originally announced July 2023.

Comments: preprint, 9pages

arXiv:2306.11681 [pdf, other]

MoleCLUEs: Molecular Conformers Maximally In-Distribution for Predictive Models

Authors: Michael Maser, Natasa Tagasovska, Jae Hyeon Lee, Andrew Watkins

Abstract: Structure-based molecular ML (SBML) models can be highly sensitive to input geometries and give predictions with large variance. We present an approach to mitigate the challenge of selecting conformations for such models by generating conformers that explicitly minimize predictive uncertainty. To achieve this, we compute estimates of aleatoric and epistemic uncertainties that are differentiable w.… ▽ More Structure-based molecular ML (SBML) models can be highly sensitive to input geometries and give predictions with large variance. We present an approach to mitigate the challenge of selecting conformations for such models by generating conformers that explicitly minimize predictive uncertainty. To achieve this, we compute estimates of aleatoric and epistemic uncertainties that are differentiable w.r.t. latent posteriors. We then iteratively sample new latents in the direction of lower uncertainty by gradient descent. As we train our predictive models jointly with a conformer decoder, the new latent embeddings can be mapped to their corresponding inputs, which we call \textit{MoleCLUEs}, or (molecular) counterfactual latent uncertainty explanations \citep{antoran2020getting}. We assess our algorithm for the task of predicting drug properties from 3D structure with maximum confidence. We additionally analyze the structure trajectories obtained from conformer optimizations, which provide insight into the sources of uncertainty in SBML. △ Less

Submitted 6 November, 2023; v1 submitted 20 June, 2023; originally announced June 2023.

Comments: NeurIPS 2023 AI for Science Workshop

arXiv:2305.03799 [pdf]

Detecting disruption of HER2 membrane protein organization in cell membranes with nanoscale precision

Authors: Yasaman Moradi, Jerry SH Lee, Andrea M. Armani

Abstract: The spatio-temporal organization of proteins within the cell membrane can affect numerous biological functions, including cell signaling, communication, and transportation. Deviations from normal spatial arrangements have been observed in various diseases, and better understanding this process is a key stepping-stone to advancing development of clinical interventions. However, given the nanometer… ▽ More The spatio-temporal organization of proteins within the cell membrane can affect numerous biological functions, including cell signaling, communication, and transportation. Deviations from normal spatial arrangements have been observed in various diseases, and better understanding this process is a key stepping-stone to advancing development of clinical interventions. However, given the nanometer length scales involved, detecting these subtle changes has primarily relied on complex super resolution and single molecule imaging methods. In this work, we demonstrate an alternative fluorescent imaging strategy for detecting protein organization based on a material that exhibits a unique photophysical behavior known as aggregation induced emission (AIE). Organic AIE molecules have an increase in emission signal when they are in close proximity and the molecular motion is restricted. This property simultaneously addresses the high background noise and low detection signal that limit conventional widefield fluorescent imaging. To demonstrate the potential of this approach, the fluorescent molecule sensor is conjugated to a human epidermal growth factor receptor 2 (HER2) specific antibody and used to investigate the spatio-temporal behavior of HER2 clustering in the membrane of HER2-overexpressing breast cancer cells. Notably, the disruption of HER2 clusters in response to an FDA-approved monoclonal antibody therapeutic (Trastuzumab) is successfully detected using a simple widefield fluorescent microscope. While the sensor demonstrated here is optimized for sensing HER2 clustering, it is an easily adaptable platform. Moreover, given the compatibility with widefield imaging, the system has the potential to be used with high-throughput imaging techniques, accelerating investigations into membrane protein spatio-temporal organization. △ Less

Submitted 23 October, 2023; v1 submitted 5 May, 2023; originally announced May 2023.

arXiv:2305.01520 [pdf, other]

Conditional Graph Information Bottleneck for Molecular Relational Learning

Authors: Namkyeong Lee, Dongmin Hyun, Gyoung S. Na, Sungwon Kim, Junseok Lee, Chanyoung Park

Abstract: Molecular relational learning, whose goal is to learn the interaction behavior between molecular pairs, got a surge of interest in molecular sciences due to its wide range of applications. Recently, graph neural networks have recently shown great success in molecular relational learning by modeling a molecule as a graph structure, and considering atom-level interactions between two molecules. Desp… ▽ More Molecular relational learning, whose goal is to learn the interaction behavior between molecular pairs, got a surge of interest in molecular sciences due to its wide range of applications. Recently, graph neural networks have recently shown great success in molecular relational learning by modeling a molecule as a graph structure, and considering atom-level interactions between two molecules. Despite their success, existing molecular relational learning methods tend to overlook the nature of chemistry, i.e., a chemical compound is composed of multiple substructures such as functional groups that cause distinctive chemical reactions. In this work, we propose a novel relational learning framework, called CGIB, that predicts the interaction behavior between a pair of graphs by detecting core subgraphs therein. The main idea is, given a pair of graphs, to find a subgraph from a graph that contains the minimal sufficient information regarding the task at hand conditioned on the paired graph based on the principle of conditional graph information bottleneck. We argue that our proposed method mimics the nature of chemical reactions, i.e., the core substructure of a molecule varies depending on which other molecule it interacts with. Extensive experiments on various tasks with real-world datasets demonstrate the superiority of CGIB over state-of-the-art baselines. Our code is available at https://github.com/Namkyeong/CGIB. △ Less

Submitted 9 July, 2023; v1 submitted 28 April, 2023; originally announced May 2023.

Comments: ICML 2023

arXiv:2305.00006 [pdf]

Data navigation on the ENCODE portal

Authors: Meenakshi S. Kagda, Bonita Lam, Casey Litton, Corinn Small, Cricket A. Sloan, Emma Spragins, Forrest Tanaka, Ian Whaling, Idan Gabdank, Ingrid Youngworth, J. Seth Strattan, Jason Hilton, Jennifer Jou, Jessica Au, Jin-Wook Lee, Kalina Andreeva, Keenan Graham, Khine Lin, Matt Simison, Otto Jolanki, Paul Sud, Pedro Assis, Philip Adenekan, Eric Douglas, Mingjie Li , et al. (9 additional authors not shown)

Abstract: Spanning two decades, the Encyclopaedia of DNA Elements (ENCODE) is a collaborative research project that aims to identify all the functional elements in the human and mouse genomes. To best serve the scientific community, all data generated by the consortium is shared through a web-portal (https://www.encodeproject.org/) with no access restrictions. The fourth and final phase of the project added… ▽ More Spanning two decades, the Encyclopaedia of DNA Elements (ENCODE) is a collaborative research project that aims to identify all the functional elements in the human and mouse genomes. To best serve the scientific community, all data generated by the consortium is shared through a web-portal (https://www.encodeproject.org/) with no access restrictions. The fourth and final phase of the project added a diverse set of new samples (including those associated with human disease), and a wide range of new assays aimed at detection, characterization and validation of functional genomic elements. The ENCODE data portal hosts results from over 23,000 functional genomics experiments, over 800 functional elements characterization experiments (including in vivo transgenic enhancer assays, reporter assays and CRISPR screens) along with over 60,000 results of computational and integrative analyses (including imputations, predictions and genome annotations). The ENCODE Data Coordination Center (DCC) is responsible for development and maintenance of the data portal, along with the implementation and utilisation of the ENCODE uniform processing pipelines to generate uniformly processed data. Here we report recent updates to the data portal. Specifically, we have completely redesigned the home page, improved search interface, added several new pages to highlight collections of biologically related data (deeply profiled cell lines, immune cells, Alzheimer's Disease, RNA-Protein interactions, degron matrix and a matrix of experiments organised by human donors), added single-cell experiments, and enhanced the cart interface for visualisation and download of user-selected datasets. △ Less

Submitted 4 May, 2023; v1 submitted 27 April, 2023; originally announced May 2023.

arXiv:2303.16725 [pdf]

Machine Learning for Uncovering Biological Insights in Spatial Transcriptomics Data

Authors: Alex J. Lee, Robert Cahill, Reza Abbasi-Asl

Abstract: Development and homeostasis in multicellular systems both require exquisite control over spatial molecular pattern formation and maintenance. Advances in spatially-resolved and high-throughput molecular imaging methods such as multiplexed immunofluorescence and spatial transcriptomics (ST) provide exciting new opportunities to augment our fundamental understanding of these processes in health and… ▽ More Development and homeostasis in multicellular systems both require exquisite control over spatial molecular pattern formation and maintenance. Advances in spatially-resolved and high-throughput molecular imaging methods such as multiplexed immunofluorescence and spatial transcriptomics (ST) provide exciting new opportunities to augment our fundamental understanding of these processes in health and disease. The large and complex datasets resulting from these techniques, particularly ST, have led to rapid development of innovative machine learning (ML) tools primarily based on deep learning techniques. These ML tools are now increasingly featured in integrated experimental and computational workflows to disentangle signals from noise in complex biological systems. However, it can be difficult to understand and balance the different implicit assumptions and methodologies of a rapidly expanding toolbox of analytical tools in ST. To address this, we summarize major ST analysis goals that ML can help address and current analysis trends. We also describe four major data science concepts and related heuristics that can help guide practitioners in their choices of the right tools for the right biological questions. △ Less

Submitted 29 March, 2023; originally announced March 2023.

arXiv:2302.13340 [pdf]

doi 10.47912/jscdm.218

Standardizing Paediatric Clinical Data: The Development of the conect4children (c4c) Cross Cutting Paediatric Data Dictionary

Authors: Anando Sen, Victoria Hedley, John Owen, Ronald Cornet, Dipak Kalra, Corinna Engel, Avril Palmeri, Joanne Lee, Jean-Christophe Roze, Joseph F Standing, Adilia Warris, Claudia Pansieri, Rebecca Leary, Mark Turner, Volker Straub

Abstract: Standardization of data items collected in paediatric clinical trials is an important but challenging issue. The Clinical Data Interchange Standards Consortium (CDISC) data standards are well understood by the pharmaceutical industry but lack the implementation of some paediatric specific concepts. When a paediatric concept is absent within CDISC standards, companies and research institutions take… ▽ More Standardization of data items collected in paediatric clinical trials is an important but challenging issue. The Clinical Data Interchange Standards Consortium (CDISC) data standards are well understood by the pharmaceutical industry but lack the implementation of some paediatric specific concepts. When a paediatric concept is absent within CDISC standards, companies and research institutions take multiple approaches in the collection of paediatric data, leading to different implementations of standards and potentially limited utility for reuse. To overcome these challenges, the conect4children consortium has developed a cross-cutting paediatric data dictionary (CCPDD). The dictionary was built over three phases - scoping (including a survey sent out to ten industrial and 34 academic partners to gauge interest), creation of a longlist and consensus building for the final set of terms. The dictionary was finalized during a workshop with attendees from academia, hospitals, industry and CDISC. The attendees held detailed discussions on each data item and participated in the final vote on the inclusion of the item in the CCPDD. Nine industrial and 34 academic partners responded to the survey, which showed overall interest in the development of the CCPDD. Following the final vote on 27 data items, three were rejected, six were deferred to the next version and a final opinion was sought from CDISC. The first version of the CCPDD with 25 data items was released in August 2019. The continued use of the dictionary has the potential to ensure the collection of standardized data that is interoperable and can later be pooled and reused for other applications. The dictionary is already being used for case report form creation in three clinical trials. The CCPDD will also serve as one of the inputs to the Paediatric User Guide, which is being developed by CDISC. △ Less

Submitted 26 February, 2023; originally announced February 2023.

Journal ref: Journal of the Society of Clinical Data Management, Volume 2, Issue 3, 2023

arXiv:2301.05991 [pdf]

Conceptual Framework and Documentation Standards of Cystoscopic Media Content for Artificial Intelligence

Authors: Okyaz Eminaga, Timothy Jiyong Lee, Jessie Ge, Eugene Shkolyar, Mark Laurie, Jin Long, Lukas Graham Hockman, Joseph C. Liao

Abstract: Background: The clinical documentation of cystoscopy includes visual and textual materials. However, the secondary use of visual cystoscopic data for educational and research purposes remains limited due to inefficient data management in routine clinical practice. Methods: A conceptual framework was designed to document cystoscopy in a standardized manner with three major sections: data management… ▽ More Background: The clinical documentation of cystoscopy includes visual and textual materials. However, the secondary use of visual cystoscopic data for educational and research purposes remains limited due to inefficient data management in routine clinical practice. Methods: A conceptual framework was designed to document cystoscopy in a standardized manner with three major sections: data management, annotation management, and utilization management. A Swiss-cheese model was proposed for quality control and root cause analyses. We defined the infrastructure required to implement the framework with respect to FAIR (findable, accessible, interoperable, re-usable) principles. We applied two scenarios exemplifying data sharing for research and educational projects to ensure the compliance with FAIR principles. Results: The framework was successfully implemented while following FAIR principles. The cystoscopy atlas produced from the framework could be presented in an educational web portal; a total of 68 full-length qualitative videos and corresponding annotation data were sharable for artificial intelligence projects covering frame classification and segmentation problems at case, lesion and frame levels. Conclusion: Our study shows that the proposed framework facilitates the storage of the visual documentation in a standardized manner and enables FAIR data for education and artificial intelligence research. △ Less

Submitted 18 January, 2023; v1 submitted 14 January, 2023; originally announced January 2023.

Comments: Under Reveiw

arXiv:2210.07247 [pdf]

doi 10.1128/msystems.00928-22

The Coming of Age of Nucleic Acid Vaccines during COVID-19

Authors: Halie M. Rando, Ronan Lordan, Likhitha Kolla, Elizabeth Sell, Alexandra J. Lee, Nils Wellhausen, Amruta Naik, Jeremy P. Kamil, COVID-19 Review Consortium, Anthony Gitter, Casey S. Greene

Abstract: In the 21st century, several emergent viruses have posed a global threat. Each pathogen has emphasized the value of rapid and scalable vaccine development programs. The ongoing SARS-CoV-2 pandemic has made the importance of such efforts especially clear. New biotechnological advances in vaccinology allow for recent advances that provide only the nucleic acid building blocks of an antigen, eliminat… ▽ More In the 21st century, several emergent viruses have posed a global threat. Each pathogen has emphasized the value of rapid and scalable vaccine development programs. The ongoing SARS-CoV-2 pandemic has made the importance of such efforts especially clear. New biotechnological advances in vaccinology allow for recent advances that provide only the nucleic acid building blocks of an antigen, eliminating many safety concerns. During the COVID-19 pandemic, these DNA and RNA vaccines have facilitated the development and deployment of vaccines at an unprecedented pace. This success was attributable at least in part to broader shifts in scientific research relative to prior epidemics; the genome of SARS-CoV-2 was available as early as January 2020, facilitating global efforts in the development of DNA and RNA vaccines within two weeks of the international community becoming aware of the new viral threat. Additionally, these technologies that were previously only theoretical are not only safe but also highly efficacious. Although historically a slow process, the rapid development of vaccines during the COVID-19 crisis reveals a major shift in vaccine technologies. Here, we provide historical context for the emergence of these paradigm-shifting vaccines. We describe several DNA and RNA vaccines and in terms of their efficacy, safety, and approval status. We also discuss patterns in worldwide distribution. The advances made since early 2020 provide an exceptional illustration of how rapidly vaccine development technology has advanced in the last two decades in particular and suggest a new era in vaccines against emerging pathogens. △ Less

Submitted 24 January, 2023; v1 submitted 14 October, 2022; originally announced October 2022.

arXiv:2208.08907 [pdf]

doi 10.1128/msystems.00927-22

Application of Traditional Vaccine Development Strategies to SARS-CoV-2

Authors: Halie M. Rando, Ronan Lordan, Alexandra J. Lee, Amruta Naik, Nils Wellhausen, Elizabeth Sell, Likhitha Kolla, COVID-19 Review Consortium, Anthony Gitter, Casey S. Greene

Abstract: Over the past 150 years, vaccines have revolutionized the relationship between people and disease. During the COVID-19 pandemic, technologies such as mRNA vaccines have received attention due to their novelty and successes. However, more traditional vaccine development platforms have also yielded important tools in the worldwide fight against the SARS-CoV-2 virus. A variety of approaches have been… ▽ More Over the past 150 years, vaccines have revolutionized the relationship between people and disease. During the COVID-19 pandemic, technologies such as mRNA vaccines have received attention due to their novelty and successes. However, more traditional vaccine development platforms have also yielded important tools in the worldwide fight against the SARS-CoV-2 virus. A variety of approaches have been used to develop COVID-19 vaccines that are now authorized for use in countries around the world. In this review, we highlight strategies that focus on the viral capsid and outwards, rather than on the nucleic acids inside. These approaches fall into two broad categories: whole-virus vaccines and subunit vaccines. Whole-virus vaccines use the virus itself, either in an inactivated or attenuated state. Subunit vaccines contain instead an isolated, immunogenic component of the virus. Here, we highlight vaccine candidates that apply these approaches against SARS-CoV-2 in different ways. In a companion manuscript, we review the more recent and novel development of nucleic-acid based vaccine technologies. We further consider the role that these COVID-19 vaccine development programs have played in prophylaxis at the global scale. Well-established vaccine technologies have proved especially important to making vaccines accessible in low- and middle-income countries. Vaccine development programs that use established platforms have been undertaken in a much wider range of countries than those using nucleic-acid-based technologies, which have been led by wealthy Western countries. Therefore, these vaccine platforms, though less novel from a biotechnological standpoint, have proven to be extremely important to the management of SARS-CoV-2. △ Less

Submitted 23 January, 2023; v1 submitted 16 August, 2022; originally announced August 2022.

arXiv:2207.09794 [pdf, other]

Effectiveness of vaccination and quarantine policies to curb the spread of COVID-19

Authors: Gyeong Hwan Jang, Sung Jin Kim, Mi Jin Lee, Seung-Woo Son

Abstract: A pandemic, the worldwide spread of a disease, can threaten human beings from the social as well as biological perspectives and paralyze existing living habits. To stave off the more devastating disaster and return to a normal life, people make tremendous efforts at multiscale levels from individual to worldwide: paying attention to hand hygiene, developing social policies such as wearing masks, s… ▽ More A pandemic, the worldwide spread of a disease, can threaten human beings from the social as well as biological perspectives and paralyze existing living habits. To stave off the more devastating disaster and return to a normal life, people make tremendous efforts at multiscale levels from individual to worldwide: paying attention to hand hygiene, developing social policies such as wearing masks, social distancing, quarantine, and inventing vaccines and remedy. Regarding the current severe pandemic, namely the coronavirus disease 2019, we explore the spreading-suppression effect when adopting the aforementioned efforts. Especially the quarantine and vaccination are considered since they are representative primary treatments for block spreading and prevention at the government level. We establish a compartment model consisting of susceptible (S), vaccination (V), exposed (E), infected (I), quarantined (Q), and recovered (R) compartments, called SVEIQR model. We look into the infected cases in Seoul and consider three kinds of vaccines, Pfizer, Moderna, and AstraZeneca. The values of the relevant parameters are obtained from empirical data from Seoul and clinical data for vaccines and estimated by Bayesian inference. After confirming that our SVEIQR model is plausible, we test the various scenarios by adjusting the associated parameters with the quarantine and vaccination policies around the current values. The quantitative result obtained from our model could suggest a guideline for policy making on effective vaccination and social policies. △ Less

Submitted 20 July, 2022; originally announced July 2022.

Comments: 8 pages, 5 figures

arXiv:2206.11228 [pdf, other]

Adversarially trained neural representations may already be as robust as corresponding biological neural representations

Authors: Chong Guo, Michael J. Lee, Guillaume Leclerc, Joel Dapello, Yug Rao, Aleksander Madry, James J. DiCarlo

Abstract: Visual systems of primates are the gold standard of robust perception. There is thus a general belief that mimicking the neural representations that underlie those systems will yield artificial visual systems that are adversarially robust. In this work, we develop a method for performing adversarial visual attacks directly on primate brain activity. We then leverage this method to demonstrate that… ▽ More Visual systems of primates are the gold standard of robust perception. There is thus a general belief that mimicking the neural representations that underlie those systems will yield artificial visual systems that are adversarially robust. In this work, we develop a method for performing adversarial visual attacks directly on primate brain activity. We then leverage this method to demonstrate that the above-mentioned belief might not be well founded. Specifically, we report that the biological neurons that make up visual systems of primates exhibit susceptibility to adversarial perturbations that is comparable in magnitude to existing (robustly trained) artificial neural networks. △ Less

Submitted 19 June, 2022; originally announced June 2022.

Comments: 10 pages, 6 figures, ICML2022

arXiv:2205.04259 [pdf, other]

Multi-segment preserving sampling for deep manifold sampler

Authors: Daniel Berenberg, Jae Hyeon Lee, Simon Kelow, Ji Won Park, Andrew Watkins, Vladimir Gligorijević, Richard Bonneau, Stephen Ra, Kyunghyun Cho

Abstract: Deep generative modeling for biological sequences presents a unique challenge in reconciling the bias-variance trade-off between explicit biological insight and model flexibility. The deep manifold sampler was recently proposed as a means to iteratively sample variable-length protein sequences by exploiting the gradients from a function predictor. We introduce an alternative approach to this guide… ▽ More Deep generative modeling for biological sequences presents a unique challenge in reconciling the bias-variance trade-off between explicit biological insight and model flexibility. The deep manifold sampler was recently proposed as a means to iteratively sample variable-length protein sequences by exploiting the gradients from a function predictor. We introduce an alternative approach to this guided sampling procedure, multi-segment preserving sampling, that enables the direct inclusion of domain-specific knowledge by designating preserved and non-preserved segments along the input sequence, thereby restricting variation to only select regions. We present its effectiveness in the context of antibody design by training two models: a deep manifold sampler and a GPT-2 language model on nearly six million heavy chain sequences annotated with the IGHV1-18 gene. During sampling, we restrict variation to only the complementarity-determining region 3 (CDR3) of the input. We obtain log probability scores from a GPT-2 model for each sampled CDR3 and demonstrate that multi-segment preserving sampling generates reasonable designs while maintaining the desired, preserved regions. △ Less

Submitted 9 May, 2022; originally announced May 2022.

arXiv:2204.00673 [pdf, other]

doi 10.1038/s41586-023-06031-6

Learnable latent embeddings for joint behavioral and neural analysis

Authors: Steffen Schneider, Jin Hwa Lee, Mackenzie Weygandt Mathis

Abstract: Mapping behavioral actions to neural activity is a fundamental goal of neuroscience. As our ability to record large neural and behavioral data increases, there is growing interest in modeling neural dynamics during adaptive behaviors to probe neural representations. In particular, neural latent embeddings can reveal underlying correlates of behavior, yet, we lack non-linear techniques that can exp… ▽ More Mapping behavioral actions to neural activity is a fundamental goal of neuroscience. As our ability to record large neural and behavioral data increases, there is growing interest in modeling neural dynamics during adaptive behaviors to probe neural representations. In particular, neural latent embeddings can reveal underlying correlates of behavior, yet, we lack non-linear techniques that can explicitly and flexibly leverage joint behavior and neural data. Here, we fill this gap with a novel method, CEBRA, that jointly uses behavioral and neural data in a hypothesis- or discovery-driven manner to produce consistent, high-performance latent spaces. We validate its accuracy and demonstrate our tool's utility for both calcium and electrophysiology datasets, across sensory and motor tasks, and in simple or complex behaviors across species. It allows for single and multi-session datasets to be leveraged for hypothesis testing or can be used label-free. Lastly, we show that CEBRA can be used for the mapping of space, uncovering complex kinematic features, and rapid, high-accuracy decoding of natural movies from visual cortex. △ Less

Submitted 5 October, 2022; v1 submitted 1 April, 2022; originally announced April 2022.

Comments: Website: cebra.ai

arXiv:2203.13982 [pdf]

Implications of Mortality Displacement for Effect Modification and Selection Bias

Authors: Honghyok Kim, Jong-Tae Lee, Roger D. Peng, Kelvin C. Fong, Michelle L. Bell

Abstract: Mortality displacement is the concept that deaths are moved forward in time (e.g., a few days, several months, and years) by exposure from when they would occur without the exposure, which is common in environmental time-series studies. Using concepts of a frail population and loss of life expectancy, it is understood that mortality displacement may decrease rate ratio (RR). Such decreases are tho… ▽ More Mortality displacement is the concept that deaths are moved forward in time (e.g., a few days, several months, and years) by exposure from when they would occur without the exposure, which is common in environmental time-series studies. Using concepts of a frail population and loss of life expectancy, it is understood that mortality displacement may decrease rate ratio (RR). Such decreases are thought to be minimal or substantial depending on study populations. Environmental epidemiologists have interpreted RR considering mortality displacement. This theoretical paper reveals that mortality displacement can be formulated as a built-in selection bias of RR in Cox models due to unmeasured risk factors independent from exposure of interest, and mortality displacement can also be viewed as an effect modifier by integrating the concepts of rate and loss of life expectancy. Thus, depending on the framework through which we view bias, mortality displacement can be categorized as selection bias in the bias taxonomy of epidemiology, and simultaneously mortality displacement can be seen as an effect modifier. This dichotomy provides useful implications regarding policy, effect modification, exposure time-windows selection, and generalizability, specifically why research in epidemiology may produce unexpected and heterogeneous RR over different studies and sub-populations. △ Less

Submitted 25 March, 2022; originally announced March 2022.

Comments: This is an epidemiological theory paper

arXiv:2203.13946 [pdf]

Using genome-wide expression compendia to study microorganisms

Authors: Alexandra J. Lee, Taylor Reiter, Georgia Doing, Julia Oh, Deborah A. Hogan, Casey S. Greene

Abstract: A gene expression compendium is a heterogeneous collection of gene expression experiments assembled from data collected for diverse purposes. The widely varied experimental conditions and genetic backgrounds across samples creates a tremendous opportunity for gaining a systems level understanding of the transcriptional responses that influence phenotypes. Variety in experimental design is particul… ▽ More A gene expression compendium is a heterogeneous collection of gene expression experiments assembled from data collected for diverse purposes. The widely varied experimental conditions and genetic backgrounds across samples creates a tremendous opportunity for gaining a systems level understanding of the transcriptional responses that influence phenotypes. Variety in experimental design is particularly important for studying microbes, where the transcriptional responses integrate many signals and demonstrate plasticity across strains including response to what nutrients are available and what microbes are present. Advances in high-throughput measurement technology have made it feasible to construct compendia for many microbes. In this review we discuss how these compendia are constructed and analyzed to reveal transcriptional patterns. △ Less

Submitted 25 March, 2022; originally announced March 2022.

arXiv:2202.04324 [pdf]

doi 10.1038/s41593-023-01444-y

Studying the neural representations of uncertainty

Authors: Edgar Y Walker, Stephan Pohl, Rachel N Denison, David L Barack, Jennifer Lee, Ned Block, Wei Ji Ma, Florent Meyniel

Abstract: The study of the brain's representations of uncertainty is a central topic in neuroscience. Unlike most quantities of which the neural representation is studied, uncertainty is a property of an observer's beliefs about the world, which poses specific methodological challenges. We analyze how the literature on the neural representations of uncertainty addresses those challenges and distinguish betw… ▽ More The study of the brain's representations of uncertainty is a central topic in neuroscience. Unlike most quantities of which the neural representation is studied, uncertainty is a property of an observer's beliefs about the world, which poses specific methodological challenges. We analyze how the literature on the neural representations of uncertainty addresses those challenges and distinguish between "code-driven" and "correlational" approaches. Code-driven approaches make assumptions about the neural code for representing world states and the associated uncertainty. By contrast, correlational approaches search for relationships between uncertainty and neural activity without constraints on the neural representation of the world state that this uncertainty accompanies. To compare these two approaches, we apply several criteria for neural representations: sensitivity, specificity, invariance, functionality. Our analysis reveals that the two approaches lead to different, but complementary findings, shaping new research questions and guiding future experiments. △ Less

Submitted 11 October, 2023; v1 submitted 9 February, 2022; originally announced February 2022.

Comments: 23 pages, 3 figures. Nature Neuroscience (2023)

arXiv:2201.08443 [pdf]

Diversifying the Genomic Data Science Research Community

Authors: The Genomic Data Science Community Network, Rosa Alcazar, Maria Alvarez, Rachel Arnold, Mentewab Ayalew, Lyle G. Best, Michael C. Campbell, Kamal Chowdhury, Katherine E. L. Cox, Christina Daulton, Youping Deng, Carla Easter, Karla Fuller, Shazia Tabassum Hakim, Ava M. Hoffman, Natalie Kucher, Andrew Lee, Joslynn Lee, Jeffrey T. Leek, Robert Meller, Loyda B. Méndez, Miguel P. Méndez-González, Stephen Mosher, Michele Nishiguchi, Siddharth Pratap , et al. (13 additional authors not shown)

Abstract: Over the last 20 years, there has been an explosion of genomic data collected for disease association, functional analyses, and other large-scale discoveries. At the same time, there have been revolutions in cloud computing that enable computational and data science research, while making data accessible to anyone with a web browser and an internet connection. However, students at institutions wit… ▽ More Over the last 20 years, there has been an explosion of genomic data collected for disease association, functional analyses, and other large-scale discoveries. At the same time, there have been revolutions in cloud computing that enable computational and data science research, while making data accessible to anyone with a web browser and an internet connection. However, students at institutions with limited resources have received relatively little exposure to curricula or professional development opportunities that lead to careers in genomic data science. To broaden participation in genomics research, the scientific community needs to support students, faculty, and administrators at Underserved Institutions (UIs) including Community Colleges, Historically Black Colleges and Universities, Hispanic-Serving Institutions, and Tribal Colleges and Universities in taking advantage of these tools in local educational and research programs. We have formed the Genomic Data Science Community Network (http://www.gdscn.org/) to identify opportunities and support broadening access to cloud-enabled genomic data science. Here, we provide a summary of the priorities for faculty members at UIs, as well as administrators, funders, and R1 researchers to consider as we create a more diverse genomic data science community. △ Less

Submitted 9 June, 2022; v1 submitted 20 January, 2022; originally announced January 2022.

Comments: 42 pages, 3 figures

arXiv:2111.04943 [pdf, other]

doi 10.1103/PhysRevLett.132.018401

Heterogeneous popularity of metabolic reactions from evolution

Authors: Mi Jin Lee, Deok-Sun Lee

Abstract: The composition of cellular metabolism is different across species. Empirical data reveal that bacterial species contain similar numbers of metabolic reactions but that the cross-species popularity of reactions is so heterogenous that some reactions are found in all the species while others are in just few species, characterized by a power-law distribution with the exponent one. Introducing an evo… ▽ More The composition of cellular metabolism is different across species. Empirical data reveal that bacterial species contain similar numbers of metabolic reactions but that the cross-species popularity of reactions is so heterogenous that some reactions are found in all the species while others are in just few species, characterized by a power-law distribution with the exponent one. Introducing an evolutionary model concretizing the stochastic recruitment of chemical reactions into the metabolism of different species at different times and their inheritance to descendants, we demonstrate that the exponential growth of the number of species containing a reaction and the saturated recruitment rate of brand-new reactions lead to the empirically identified power-law popularity distribution. Furthermore, the structural characteristics of metabolic networks and the species' phylogeny in our simulations agree well with empirical observations. △ Less

Submitted 5 January, 2024; v1 submitted 8 November, 2021; originally announced November 2021.

Comments: Main: 5 pages, 4 figures, Supplemental Material: 4 pages, 6 figures

Journal ref: Physical Review Letters 132, 018401 (2024)

arXiv:2106.16154 [pdf]

Ultra-Sharp Nanowire Arrays Natively Permeate, Record, and Stimulate Intracellular Activity in Neuronal and Cardiac Networks

Authors: Ren Liu, Jihwan Lee, Youngbin Tchoe, Deborah Pre, Andrew M. Bourhis, Agnieszka D'Antonio-Chronowska, Gaelle Robin, Sang Heon Lee, Yun Goo Ro, Ritwik Vatsyayan, Karen J. Tonsfeldt, Lorraine A. Hossain, M. Lisa Phipps, Jinkyoung Yoo, John Nogan, Jennifer S. Martinez, Kelly A. Frazer, Anne G. Bang, Shadi A. Dayeh

Abstract: Intracellular access with high spatiotemporal resolution can enhance our understanding of how neurons or cardiomyocytes regulate and orchestrate network activity, and how this activity can be affected with pharmacology or other interventional modalities. Nanoscale devices often employ electroporation to transiently permeate the cell membrane and record intracellular potentials, which tend to decre… ▽ More Intracellular access with high spatiotemporal resolution can enhance our understanding of how neurons or cardiomyocytes regulate and orchestrate network activity, and how this activity can be affected with pharmacology or other interventional modalities. Nanoscale devices often employ electroporation to transiently permeate the cell membrane and record intracellular potentials, which tend to decrease rapidly to extracellular potential amplitudes with time. Here, we report innovative scalable, vertical, ultra-sharp nanowire arrays that are individually addressable to enable long-term, native recordings of intracellular potentials. We report large action potential amplitudes that are indicative of intracellular access from 3D tissue-like networks of neurons and cardiomyocytes across recording days and that do not decrease to extracellular amplitudes for the duration of the recording of several minutes. Our findings are validated with cross-sectional microscopy, pharmacology, and electrical interventions. Our experiments and simulations demonstrate that individual electrical addressability of nanowires is necessary for high-fidelity intracellular electrophysiological recordings. This study advances our understanding of and control over high-quality multi-channel intracellular recordings, and paves the way toward predictive, high-throughput, and low-cost electrophysiological drug screening platforms. △ Less

Submitted 5 July, 2021; v1 submitted 30 June, 2021; originally announced June 2021.

Comments: Main manuscript: 33 pages, 4 figures, Supporting information: 43 pages, 27 figures, Submitted to Advanced Materials

arXiv:2106.13148 [pdf]

doi 10.1126/sciadv.abh2929

Membraneless organelles formed by liquid-liquid phase separation increase bacterial fitness

Authors: Xin Jin, Ji-Eun Lee, Charley Schaefer, Xinwei Luo, Adam J. M. Wollman, Alex L. Payne-Dwyer, Tian Tian, Xiaowei Zhang, Xiao Chen, Yingxing Li, Tom C. B. McLeish, Mark C. Leake, Fan Bai

Abstract: Liquid-liquid phase separation is emerging as a crucial phenomenon in several fundamental cell processes. A range of eukaryotic systems exhibit liquid condensates. However, their function in bacteria, which in general lack membrane-bound compartments, remains less clear. Here, we used high-resolution optical microscopy to observe single bacterial aggresomes, nanostructured intracellular assemblies… ▽ More Liquid-liquid phase separation is emerging as a crucial phenomenon in several fundamental cell processes. A range of eukaryotic systems exhibit liquid condensates. However, their function in bacteria, which in general lack membrane-bound compartments, remains less clear. Here, we used high-resolution optical microscopy to observe single bacterial aggresomes, nanostructured intracellular assemblies of proteins, to undercover their role in cell stress. We find that proteins inside aggresomes are mobile and undergo dynamic turnover, consistent with a liquid state. Our observations are in quantitative agreement with phase-separated liquid droplet formation driven by interacting proteins under thermal equilibrium that nucleate following diffusive collisions in the cytoplasm. We have discovered aggresomes in multiple species of bacteria, and show that these emergent, metastable liquid-structured protein assemblies increase bacterial fitness by enabling cells to tolerate environmental stresses. △ Less

Submitted 24 June, 2021; originally announced June 2021.

Journal ref: Sci Adv. 2021 Oct 22;7(43):eabh2929

arXiv:2106.07206 [pdf, ps, other]

doi 10.1103/PhysRevE.105.014309

Stability and selective extinction in complex mutualistic networks

Authors: Hyun Woo Lee, Jae Woo Lee, Deok-Sun Lee

Abstract: We study species abundance in the empirical plant-pollinator mutualistic networks exhibiting broad degree distributions, with uniform intra-group competition assumed, by the Lotka-Volterra equation. The stability of a fixed point is found to be identified by the signs of its non-zero components and those of its neighboring fixed points. Taking the annealed approximation, we derive the non-zero com… ▽ More We study species abundance in the empirical plant-pollinator mutualistic networks exhibiting broad degree distributions, with uniform intra-group competition assumed, by the Lotka-Volterra equation. The stability of a fixed point is found to be identified by the signs of its non-zero components and those of its neighboring fixed points. Taking the annealed approximation, we derive the non-zero components to be formulated in terms of degrees and the rescaled interaction strengths, which lead us to find different stable fixed points depending on parameters, and we obtain the phase diagram. The selective extinction phase finds small-degree species extinct and effective interaction reduced, maintaining stability and hindering the onset of instability. The non-zero minimum species abundances from different empirical networks show data collapse when rescaled as predicted theoretically. △ Less

Submitted 24 January, 2022; v1 submitted 14 June, 2021; originally announced June 2021.

Comments: 7 figures

Journal ref: Physical Review E 105, 014309 (2022)

arXiv:2105.14372 [pdf]

doi 10.1371/journal.pcbi.1009803

Ten Quick Tips for Deep Learning in Biology

Authors: Benjamin D. Lee, Anthony Gitter, Casey S. Greene, Sebastian Raschka, Finlay Maguire, Alexander J. Titus, Michael D. Kessler, Alexandra J. Lee, Marc G. Chevrette, Paul Allen Stewart, Thiago Britto-Borges, Evan M. Cofer, Kun-Hsing Yu, Juan Jose Carmona, Elana J. Fertig, Alexandr A. Kalinin, Beth Signal, Benjamin J. Lengerich, Timothy J. Triche Jr, Simina M. Boca

Abstract: Machine learning is a modern approach to problem-solving and task automation. In particular, machine learning is concerned with the development and applications of algorithms that can recognize patterns in data and use them for predictive modeling. Artificial neural networks are a particular class of machine learning algorithms and models that evolved into what is now described as deep learning. G… ▽ More Machine learning is a modern approach to problem-solving and task automation. In particular, machine learning is concerned with the development and applications of algorithms that can recognize patterns in data and use them for predictive modeling. Artificial neural networks are a particular class of machine learning algorithms and models that evolved into what is now described as deep learning. Given the computational advances made in the last decade, deep learning can now be applied to massive data sets and in innumerable contexts. Therefore, deep learning has become its own subfield of machine learning. In the context of biological research, it has been increasingly used to derive novel insights from high-dimensional biological data. To make the biological applications of deep learning more accessible to scientists who have some experience with machine learning, we solicited input from a community of researchers with varied biological and deep learning interests. These individuals collaboratively contributed to this manuscript's writing using the GitHub version control platform and the Manubot manuscript generation toolset. The goal was to articulate a practical, accessible, and concise set of guidelines and suggestions to follow when using deep learning. In the course of our discussions, several themes became clear: the importance of understanding and applying machine learning fundamentals as a baseline for utilizing deep learning, the necessity for extensive model comparisons with careful evaluation, and the need for critical thought in interpreting results generated by deep learning, among others. △ Less

Submitted 29 May, 2021; originally announced May 2021.

Comments: 23 pages, 2 figures

Showing 1–50 of 107 results for author: Lee, J