Search | arXiv e-print repository

Uni-Mol2: Exploring Molecular Pretraining Model at Scale

Authors: Xiaohong Ji, Wang Zhen, Zhifeng Gao, Hang Zheng, Linfeng Zhang, Guolin Ke, Weinan E

Abstract: In recent years, pretraining models have made significant advancements in the fields of natural language processing (NLP), computer vision (CV), and life sciences. The significant advancements in NLP and CV are predominantly driven by the expansion of model parameters and data size, a phenomenon now recognized as the scaling laws. However, research exploring scaling law in molecular pretraining mo… ▽ More In recent years, pretraining models have made significant advancements in the fields of natural language processing (NLP), computer vision (CV), and life sciences. The significant advancements in NLP and CV are predominantly driven by the expansion of model parameters and data size, a phenomenon now recognized as the scaling laws. However, research exploring scaling law in molecular pretraining models remains unexplored. In this work, we present Uni-Mol2 , an innovative molecular pretraining model that leverages a two-track transformer to effectively integrate features at the atomic level, graph level, and geometry structure level. Along with this, we systematically investigate the scaling law within molecular pretraining models, characterizing the power-law correlations between validation loss and model size, dataset size, and computational resources. Consequently, we successfully scale Uni-Mol2 to 1.1 billion parameters through pretraining on 800 million conformations, making it the largest molecular pretraining model to date. Extensive experiments show consistent improvement in the downstream tasks as the model size grows. The Uni-Mol2 with 1.1B parameters also outperforms existing methods, achieving an average 27% improvement on the QM9 and 14% on COMPAS-1D dataset. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2405.20763 [pdf, other]

Improving Generalization and Convergence by Enhancing Implicit Regularization

Authors: Mingze Wang, Haotian He, Jinbo Wang, Zilin Wang, Guanhua Huang, Feiyu Xiong, Zhiyu Li, Weinan E, Lei Wu

Abstract: In this work, we propose an Implicit Regularization Enhancement (IRE) framework to accelerate the discovery of flat solutions in deep learning, thereby improving generalization and convergence. Specifically, IRE decouples the dynamics of flat and sharp directions, which boosts the sharpness reduction along flat directions while maintaining the training stability in sharp directions. We show that I… ▽ More In this work, we propose an Implicit Regularization Enhancement (IRE) framework to accelerate the discovery of flat solutions in deep learning, thereby improving generalization and convergence. Specifically, IRE decouples the dynamics of flat and sharp directions, which boosts the sharpness reduction along flat directions while maintaining the training stability in sharp directions. We show that IRE can be practically incorporated with {\em generic base optimizers} without introducing significant computational overload. Experiments show that IRE consistently improves the generalization performance for image classification tasks across a variety of benchmark datasets (CIFAR-10/100, ImageNet) and models (ResNets and ViTs). Surprisingly, IRE also achieves a $2\times$ {\em speed-up} compared to AdamW in the pre-training of Llama models (of sizes ranging from 60M to 229M) on datasets including Wikitext-103, Minipile, and Openwebtext. Moreover, we provide theoretical guarantees, showing that IRE can substantially accelerate the convergence towards flat minima in Sharpness-aware Minimization (SAM). △ Less

Submitted 31 May, 2024; originally announced May 2024.

Comments: 35 pages

arXiv:2405.12356 [pdf, other]

Coarse-graining conformational dynamics with multi-dimensional generalized Langevin equation: how, when, and why

Authors: Pinchen Xie, Yunrui Qiu, Weinan E

Abstract: A data-driven ab initio generalized Langevin equation (AIGLE) approach is developed to learn and simulate high-dimensional, heterogeneous, coarse-grained conformational dynamics. Constrained by the fluctuation-dissipation theorem, the approach can build coarse-grained models in dynamical consistency with all-atom molecular dynamics. We also propose practical criteria for AIGLE to enforce long-term… ▽ More A data-driven ab initio generalized Langevin equation (AIGLE) approach is developed to learn and simulate high-dimensional, heterogeneous, coarse-grained conformational dynamics. Constrained by the fluctuation-dissipation theorem, the approach can build coarse-grained models in dynamical consistency with all-atom molecular dynamics. We also propose practical criteria for AIGLE to enforce long-term dynamical consistency. Case studies of a toy polymer, with 20 coarse-grained sites, and the alanine dipeptide, with two dihedral angles, elucidate why one should adopt AIGLE or its Markovian limit for modeling coarse-grained conformational dynamics in practice. △ Less

Submitted 20 May, 2024; originally announced May 2024.

arXiv:2402.00522 [pdf, ps, other]

Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling

Authors: Mingze Wang, Weinan E

Abstract: We conduct a systematic study of the approximation properties of Transformer for sequence modeling with long, sparse and complicated memory. We investigate the mechanisms through which different components of Transformer, such as the dot-product self-attention, positional encoding and feed-forward layer, affect its expressive power, and we study their combined effects through establishing explicit… ▽ More We conduct a systematic study of the approximation properties of Transformer for sequence modeling with long, sparse and complicated memory. We investigate the mechanisms through which different components of Transformer, such as the dot-product self-attention, positional encoding and feed-forward layer, affect its expressive power, and we study their combined effects through establishing explicit approximation rates. Our study reveals the roles of critical parameters in the Transformer, such as the number of layers and the number of attention heads. These theoretical insights are validated experimentally and offer natural suggestions for alternative architectures. △ Less

Submitted 24 May, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

Comments: 70 pages

arXiv:2401.08309 [pdf, other]

Anchor function: a type of benchmark functions for studying language models

Authors: Zhongwang Zhang, Zhiwei Wang, Junjie Yao, Zhangchen Zhou, Xiaolong Li, Weinan E, Zhi-Qin John Xu

Abstract: Understanding transformer-based language models is becoming increasingly crucial, particularly as they play pivotal roles in advancing towards artificial general intelligence. However, language model research faces significant challenges, especially for academic research groups with constrained resources. These challenges include complex data structures, unknown target functions, high computationa… ▽ More Understanding transformer-based language models is becoming increasingly crucial, particularly as they play pivotal roles in advancing towards artificial general intelligence. However, language model research faces significant challenges, especially for academic research groups with constrained resources. These challenges include complex data structures, unknown target functions, high computational costs and memory requirements, and a lack of interpretability in the inference process, etc. Drawing a parallel to the use of simple models in scientific research, we propose the concept of an anchor function. This is a type of benchmark function designed for studying language models in learning tasks that follow an "anchor-key" pattern. By utilizing the concept of an anchor function, we can construct a series of functions to simulate various language tasks. The anchor function plays a role analogous to that of mice in diabetes research, particularly suitable for academic research. We demonstrate the utility of the anchor function with an example, revealing two basic operations by attention structures in language models: shifting tokens and broadcasting one token from one position to many positions. These operations are also commonly observed in large language models. The anchor function framework, therefore, opens up a series of valuable and accessible research questions for further exploration, especially for theoretical study. △ Less

Submitted 16 January, 2024; originally announced January 2024.

arXiv:2311.17749 [pdf, other]

Learning Free Terminal Time Optimal Closed-loop Control of Manipulators

Authors: Wei Hu, Yue Zhao, Weinan E, Jiequn Han, Jihao Long

Abstract: This paper presents a novel approach to learning free terminal time closed-loop control for robotic manipulation tasks, enabling dynamic adjustment of task duration and control inputs to enhance performance. We extend the supervised learning approach, namely solving selected optimal open-loop problems and utilizing them as training data for a policy network, to the free terminal time scenario. Thr… ▽ More This paper presents a novel approach to learning free terminal time closed-loop control for robotic manipulation tasks, enabling dynamic adjustment of task duration and control inputs to enhance performance. We extend the supervised learning approach, namely solving selected optimal open-loop problems and utilizing them as training data for a policy network, to the free terminal time scenario. Three main challenges are addressed in this extension. First, we introduce a marching scheme that enhances the solution quality and increases the success rate of the open-loop solver by gradually refining time discretization. Second, we extend the QRnet in Nakamura-Zimmerer et al. (2021b) to the free terminal time setting to address discontinuity and improve stability at the terminal state. Third, we present a more automated version of the initial value problem (IVP) enhanced sampling method from previous work (Zhang et al., 2022) to adaptively update the training dataset, significantly improving its quality. By integrating these techniques, we develop a closed-loop policy that operates effectively over a broad domain with varying optimal time durations, achieving near globally optimal total costs. △ Less

Submitted 29 November, 2023; originally announced November 2023.

arXiv:2305.01243 [pdf]

Machine-Learned Invertible Coarse Graining for Multiscale Molecular Modeling

Authors: Jun Zhang, Xiaohan Lin, Weinan E, Yi Qin Gao

Abstract: Multiscale molecular modeling is widely applied in scientific research of molecular properties over large time and length scales. Two specific challenges are commonly present in multiscale modeling, provided that information between the coarse and fine representations of molecules needs to be properly exchanged: One is to construct coarse grained (CG) models by passing information from the fine to… ▽ More Multiscale molecular modeling is widely applied in scientific research of molecular properties over large time and length scales. Two specific challenges are commonly present in multiscale modeling, provided that information between the coarse and fine representations of molecules needs to be properly exchanged: One is to construct coarse grained (CG) models by passing information from the fine to coarse levels; the other is to restore finer molecular details given CG configurations. Although these two problems are commonly addressed independently, in this work, we present a theory connecting them, and develop a methodology called Cycle Coarse Graining (CCG) to solve both problems in a unified manner. In CCG, reconstruction can be achieved via a tractable optimization process, leading to a general method to retrieve fine details from CG simulations, which in turn, delivers a new solution to the CG problem, yielding an efficient way to calculate free energies in a rare-event-free manner. CCG thus provides a systematic way for multiscale molecular modeling, where the finer details of CG simulations can be efficiently retrieved, and the CG models can be improved consistently. △ Less

Submitted 2 May, 2023; originally announced May 2023.

Comments: 10 pages, 5 figures, plus SI

arXiv:2302.03498 [pdf, other]

MAC: A unified framework boosting low resource automatic speech recognition

Authors: Zeping Min, Qian Ge, Zhong Li, Weinan E

Abstract: We propose a unified framework for low resource automatic speech recognition tasks named meta audio concatenation (MAC). It is easy to implement and can be carried out in extremely low resource environments. Mathematically, we give a clear description of MAC framework from the perspective of bayesian sampling. In this framework, we leverage a novel concatenative synthesis text-to-speech system to… ▽ More We propose a unified framework for low resource automatic speech recognition tasks named meta audio concatenation (MAC). It is easy to implement and can be carried out in extremely low resource environments. Mathematically, we give a clear description of MAC framework from the perspective of bayesian sampling. In this framework, we leverage a novel concatenative synthesis text-to-speech system to boost the low resource ASR task. By the concatenative synthesis text-to-speech system, we can integrate language pronunciation rules and adjust the TTS process. Furthermore, we propose a broad notion of meta audio set to meet the modeling needs of different languages and different scenes when using the system. Extensive experiments have demonstrated the great effectiveness of MAC on low resource ASR tasks. For CTC greedy search, CTC prefix, attention, and attention rescoring decode mode in Cantonese ASR task, Taiwanese ASR task, and Japanese ASR task the MAC method can reduce the CER by more than 15\%. Furthermore, in the ASR task, MAC beats wav2vec2 (with fine-tuning) on common voice datasets of Cantonese and gets really competitive results on common voice datasets of Taiwanese and Japanese. Among them, it is worth mentioning that we achieve a \textbf{10.9\%} character error rate (CER) on the common voice Cantonese ASR task, bringing about \textbf{30\%} relative improvement compared to the wav2vec2 (with fine-tuning). △ Less

Submitted 15 February, 2023; v1 submitted 5 February, 2023; originally announced February 2023.

arXiv:2201.03549 [pdf, other]

A multi-scale sampling method for accurate and robust deep neural network to predict combustion chemical kinetics

Authors: Tianhan Zhang, Yuxiao Yi, Yifan Xu, Zhi X. Chen, Yaoyu Zhang, Weinan E, Zhi-Qin John Xu

Abstract: Machine learning has long been considered as a black box for predicting combustion chemical kinetics due to the extremely large number of parameters and the lack of evaluation standards and reproducibility. The current work aims to understand two basic questions regarding the deep neural network (DNN) method: what data the DNN needs and how general the DNN method can be. Sampling and preprocessing… ▽ More Machine learning has long been considered as a black box for predicting combustion chemical kinetics due to the extremely large number of parameters and the lack of evaluation standards and reproducibility. The current work aims to understand two basic questions regarding the deep neural network (DNN) method: what data the DNN needs and how general the DNN method can be. Sampling and preprocessing determine the DNN training dataset, further affect DNN prediction ability. The current work proposes using Box-Cox transformation (BCT) to preprocess the combustion data. In addition, this work compares different sampling methods with or without preprocessing, including the Monte Carlo method, manifold sampling, generative neural network method (cycle-GAN), and newly-proposed multi-scale sampling. Our results reveal that the DNN trained by the manifold data can capture the chemical kinetics in limited configurations but cannot remain robust toward perturbation, which is inevitable for the DNN coupled with the flow field. The Monte Carlo and cycle-GAN samplings can cover a wider phase space but fail to capture small-scale intermediate species, producing poor prediction results. A three-hidden-layer DNN, based on the multi-scale method without specific flame simulation data, allows predicting chemical kinetics in various scenarios and being stable during the temporal evolutions. This single DNN is readily implemented with several CFD codes and validated in various combustors, including (1). zero-dimensional autoignition, (2). one-dimensional freely propagating flame, (3). two-dimensional jet flame with triple-flame structure, and (4). three-dimensional turbulent lifted flames. The results demonstrate the satisfying accuracy and generalization ability of the pre-trained DNN. The Fortran and Python versions of DNN and example code are attached in the supplementary for reproducibility. △ Less

Submitted 12 August, 2022; v1 submitted 9 January, 2022; originally announced January 2022.

arXiv:2201.02025 [pdf, other]

A deep learning-based model reduction (DeePMR) method for simplifying chemical kinetics

Authors: Zhiwei Wang, Yaoyu Zhang, Enhan Zhao, Yiguang Ju, Weinan E, Zhi-Qin John Xu, Tianhan Zhang

Abstract: A deep learning-based model reduction (DeePMR) method for simplifying chemical kinetics is proposed and validated using high-temperature auto-ignitions, perfectly stirred reactors (PSR), and one-dimensional freely propagating flames of n-heptane/air mixtures. The mechanism reduction is modeled as an optimization problem on Boolean space, where a Boolean vector, each entry corresponding to a specie… ▽ More A deep learning-based model reduction (DeePMR) method for simplifying chemical kinetics is proposed and validated using high-temperature auto-ignitions, perfectly stirred reactors (PSR), and one-dimensional freely propagating flames of n-heptane/air mixtures. The mechanism reduction is modeled as an optimization problem on Boolean space, where a Boolean vector, each entry corresponding to a species, represents a reduced mechanism. The optimization goal is to minimize the reduced mechanism size given the error tolerance of a group of pre-selected benchmark quantities. The key idea of the DeePMR is to employ a deep neural network (DNN) to formulate the objective function in the optimization problem. In order to explore high dimensional Boolean space efficiently, an iterative DNN-assisted data sampling and DNN training procedure are implemented. The results show that DNN-assistance improves sampling efficiency significantly, selecting only $10^5$ samples out of $10^{34}$ possible samples for DNN to achieve sufficient accuracy. The results demonstrate the capability of the DNN to recognize key species and reasonably predict reduced mechanism performance. The well-trained DNN guarantees the optimal reduced mechanism by solving an inverse optimization problem. By comparing ignition delay times, laminar flame speeds, temperatures in PSRs, the resulting skeletal mechanism has fewer species (45 species) but the same level of accuracy as the skeletal mechanism (56 species) obtained by the Path Flux Analysis (PFA) method. In addition, the skeletal mechanism can be further reduced to 28 species if only considering atmospheric, near-stoichiometric conditions (equivalence ratio between 0.6 and 1.2). The DeePMR provides an innovative way to perform model reduction and demonstrates the great potential of data-driven methods in the combustion area. △ Less

Submitted 8 September, 2022; v1 submitted 6 January, 2022; originally announced January 2022.

arXiv:2112.14798 [pdf, other]

doi 10.4208/jml.220115

DeePN$^2$: A deep learning-based non-Newtonian hydrodynamic model

Authors: Lidong Fang, Pei Ge, Lei Zhang, Weinan E, Huan Lei

Abstract: A long standing problem in the modeling of non-Newtonian hydrodynamics of polymeric flows is the availability of reliable and interpretable hydrodynamic models that faithfully encode the underlying micro-scale polymer dynamics. The main complication arises from the long polymer relaxation time, the complex molecular structure and heterogeneous interaction. DeePN$^2$, a deep learning-based non-Newt… ▽ More A long standing problem in the modeling of non-Newtonian hydrodynamics of polymeric flows is the availability of reliable and interpretable hydrodynamic models that faithfully encode the underlying micro-scale polymer dynamics. The main complication arises from the long polymer relaxation time, the complex molecular structure and heterogeneous interaction. DeePN$^2$, a deep learning-based non-Newtonian hydrodynamic model, has been proposed and has shown some success in systematically passing the micro-scale structural mechanics information to the macro-scale hydrodynamics for suspensions with simple polymer conformation and bond potential. The model retains a multi-scaled nature by mapping the polymer configurations into a set of symmetry-preserving macro-scale features. The extended constitutive laws for these macro-scale features can be directly learned from the kinetics of their micro-scale counterparts. In this paper, we develop DeePN$^2$ using more complex micro-structural models. We show that DeePN$^2$ can faithfully capture the broadly overlooked viscoelastic differences arising from the specific molecular structural mechanics without human intervention. △ Less

Submitted 13 April, 2022; v1 submitted 29 December, 2021; originally announced December 2021.

arXiv:2112.14377 [pdf, other]

DeepHAM: A Global Solution Method for Heterogeneous Agent Models with Aggregate Shocks

Authors: Jiequn Han, Yucheng Yang, Weinan E

Abstract: An efficient, reliable, and interpretable global solution method, the Deep learning-based algorithm for Heterogeneous Agent Models (DeepHAM), is proposed for solving high dimensional heterogeneous agent models with aggregate shocks. The state distribution is approximately represented by a set of optimal generalized moments. Deep neural networks are used to approximate the value and policy function… ▽ More An efficient, reliable, and interpretable global solution method, the Deep learning-based algorithm for Heterogeneous Agent Models (DeepHAM), is proposed for solving high dimensional heterogeneous agent models with aggregate shocks. The state distribution is approximately represented by a set of optimal generalized moments. Deep neural networks are used to approximate the value and policy functions, and the objective is optimized over directly simulated paths. In addition to being an accurate global solver, this method has three additional features. First, it is computationally efficient in solving complex heterogeneous agent models, and it does not suffer from the curse of dimensionality. Second, it provides a general and interpretable representation of the distribution over individual states, which is crucial in addressing the classical question of whether and how heterogeneity matters in macroeconomics. Third, it solves the constrained efficiency problem as easily as it solves the competitive equilibrium, which opens up new possibilities for studying optimal monetary and fiscal policies in heterogeneous agent models with aggregate shocks. △ Less

Submitted 21 February, 2022; v1 submitted 28 December, 2021; originally announced December 2021.

Comments: Slides available at https://users.flatironinstitute.org/~jhan/files/DeepHAM_slides.pdf

arXiv:2107.03673 [pdf, other]

doi 10.4208/cicp.OA-2021-0257

MOD-Net: A Machine Learning Approach via Model-Operator-Data Network for Solving PDEs

Authors: Lulu Zhang, Tao Luo, Yaoyu Zhang, Weinan E, Zhi-Qin John Xu, Zheng Ma

Abstract: In this paper, we propose a a machine learning approach via model-operator-data network (MOD-Net) for solving PDEs. A MOD-Net is driven by a model to solve PDEs based on operator representation with regularization from data. For linear PDEs, we use a DNN to parameterize the Green's function and obtain the neural operator to approximate the solution according to the Green's method. To train the DNN… ▽ More In this paper, we propose a a machine learning approach via model-operator-data network (MOD-Net) for solving PDEs. A MOD-Net is driven by a model to solve PDEs based on operator representation with regularization from data. For linear PDEs, we use a DNN to parameterize the Green's function and obtain the neural operator to approximate the solution according to the Green's method. To train the DNN, the empirical risk consists of the mean squared loss with the least square formulation or the variational formulation of the governing equation and boundary conditions. For complicated problems, the empirical risk also includes a few labels, which are computed on coarse grid points with cheap computation cost and significantly improves the model accuracy. Intuitively, the labeled dataset works as a regularization in addition to the model constraints. The MOD-Net solves a family of PDEs rather than a specific one and is much more efficient than original neural operator because few expensive labels are required. We numerically show MOD-Net is very efficient in solving Poisson equation and one-dimensional radiative transfer equation. For nonlinear PDEs, the nonlinear MOD-Net can be similarly used as an ansatz for solving nonlinear PDEs, exemplified by solving several nonlinear PDE problems, such as the Burgers equation. △ Less

Submitted 28 December, 2021; v1 submitted 8 July, 2021; originally announced July 2021.

arXiv:2107.03633 [pdf, other]

Generalization Error of GAN from the Discriminator's Perspective

Authors: Hongkang Yang, Weinan E

Abstract: The generative adversarial network (GAN) is a well-known model for learning high-dimensional distributions, but the mechanism for its generalization ability is not understood. In particular, GAN is vulnerable to the memorization phenomenon, the eventual convergence to the empirical distribution. We consider a simplified GAN model with the generator replaced by a density, and analyze how the discri… ▽ More The generative adversarial network (GAN) is a well-known model for learning high-dimensional distributions, but the mechanism for its generalization ability is not understood. In particular, GAN is vulnerable to the memorization phenomenon, the eventual convergence to the empirical distribution. We consider a simplified GAN model with the generator replaced by a density, and analyze how the discriminator contributes to generalization. We show that with early stopping, the generalization error measured by Wasserstein metric escapes from the curse of dimensionality, despite that in the long term, memorization is inevitable. In addition, we present a hardness of learning result for WGAN. △ Less

Submitted 5 November, 2021; v1 submitted 8 July, 2021; originally announced July 2021.

MSC Class: 68T07; 62G07; 60-08

arXiv:2104.07794 [pdf, ps, other]

An $L^2$ Analysis of Reinforcement Learning in High Dimensions with Kernel and Neural Network Approximation

Authors: Jihao Long, Jiequn Han, Weinan E

Abstract: Reinforcement learning (RL) algorithms based on high-dimensional function approximation have achieved tremendous empirical success in large-scale problems with an enormous number of states. However, most analysis of such algorithms gives rise to error bounds that involve either the number of states or the number of features. This paper considers the situation where the function approximation is ma… ▽ More Reinforcement learning (RL) algorithms based on high-dimensional function approximation have achieved tremendous empirical success in large-scale problems with an enormous number of states. However, most analysis of such algorithms gives rise to error bounds that involve either the number of states or the number of features. This paper considers the situation where the function approximation is made either using the kernel method or the two-layer neural network model, in the context of a fitted Q-iteration algorithm with explicit regularization. We establish an $\tilde{O}(H^3|\mathcal {A}|^{\frac14}n^{-\frac14})$ bound for the optimal policy with $Hn$ samples, where $H$ is the length of each episode and $|\mathcal {A}|$ is the size of action space. Our analysis hinges on analyzing the $L^2$ error of the approximated Q-function using $n$ data points. Even though this result still requires a finite-sized action space, the error bound is independent of the dimensionality of the state space. △ Less

Submitted 15 February, 2022; v1 submitted 15 April, 2021; originally announced April 2021.

arXiv:2012.12654 [pdf]

A deep learning-based ODE solver for chemical kinetics

Authors: Tianhan Zhang, Yaoyu Zhang, Weinan E, Yiguang Ju

Abstract: Developing efficient and accurate algorithms for chemistry integration is a challenging task due to its strong stiffness and high dimensionality. The current work presents a deep learning-based numerical method called DeepCombustion0.0 to solve stiff ordinary differential equation systems. The homogeneous autoignition of DME/air mixture, including 54 species, is adopted as an example to illustrate… ▽ More Developing efficient and accurate algorithms for chemistry integration is a challenging task due to its strong stiffness and high dimensionality. The current work presents a deep learning-based numerical method called DeepCombustion0.0 to solve stiff ordinary differential equation systems. The homogeneous autoignition of DME/air mixture, including 54 species, is adopted as an example to illustrate the validity and accuracy of the algorithm. The training and testing datasets cover a wide range of temperature, pressure, and mixture conditions between 750-1200 K, 30-50 atm, and equivalence ratio = 0.7-1.5. Both the first-stage low-temperature ignition (LTI) and the second-stage high-temperature ignition (HTI) are considered. The methodology highlights the importance of the adaptive data sampling techniques, power transform preprocessing, and binary deep neural network (DNN) design. By using the adaptive random samplings and appropriate power transforms, smooth submanifolds in the state vector phase space are observed, on which two three-layer DNNs can be appropriately trained. The neural networks are end-to-end, which predict temporal gradients of the state vectors directly. The results show that temporal evolutions predicted by DNN agree well with traditional numerical methods in all state vector dimensions, including temperature, pressure, and species concentrations. Besides, the ignition delay time differences are within 1%. At the same time, the CPU time is reduced by more than 20 times and 200 times compared with the HMTS and VODE method, respectively. The current work demonstrates the enormous potential of applying the deep learning algorithm in chemical kinetics and combustion modeling. △ Less

Submitted 23 November, 2020; originally announced December 2020.

arXiv:2012.05420 [pdf, ps, other]

On the emergence of simplex symmetry in the final and penultimate layers of neural network classifiers

Authors: Weinan E, Stephan Wojtowytsch

Abstract: A recent numerical study observed that neural network classifiers enjoy a large degree of symmetry in the penultimate layer. Namely, if $h(x) = Af(x) +b$ where $A$ is a linear map and $f$ is the output of the penultimate layer of the network (after activation), then all data points $x_{i, 1}, \dots, x_{i, N_i}$ in a class $C_i$ are mapped to a single point $y_i$ by $f$ and the points $y_i$ are loc… ▽ More A recent numerical study observed that neural network classifiers enjoy a large degree of symmetry in the penultimate layer. Namely, if $h(x) = Af(x) +b$ where $A$ is a linear map and $f$ is the output of the penultimate layer of the network (after activation), then all data points $x_{i, 1}, \dots, x_{i, N_i}$ in a class $C_i$ are mapped to a single point $y_i$ by $f$ and the points $y_i$ are located at the vertices of a regular $k-1$-dimensional standard simplex in a high-dimensional Euclidean space. We explain this observation analytically in toy models for highly expressive deep neural networks. In complementary examples, we demonstrate rigorously that even the final output of the classifier $h$ is not uniform over data samples from a class $C_i$ if $h$ is a shallow network (or if the deeper layers do not bring the data samples into a convenient geometric configuration). △ Less

Submitted 4 June, 2021; v1 submitted 9 December, 2020; originally announced December 2020.

MSC Class: 68T07; 62H30

arXiv:2012.01484 [pdf, ps, other]

Some observations on high-dimensional partial differential equations with Barron data

Authors: Weinan E, Stephan Wojtowytsch

Abstract: We use explicit representation formulas to show that solutions to certain partial differential equations lie in Barron spaces or multilayer spaces if the PDE data lie in such function spaces. Consequently, these solutions can be represented efficiently using artificial neural networks, even in high dimension. Conversely, we present examples in which the solution fails to lie in the function space… ▽ More We use explicit representation formulas to show that solutions to certain partial differential equations lie in Barron spaces or multilayer spaces if the PDE data lie in such function spaces. Consequently, these solutions can be represented efficiently using artificial neural networks, even in high dimension. Conversely, we present examples in which the solution fails to lie in the function space associated to a neural network under consideration. △ Less

Submitted 4 June, 2021; v1 submitted 2 December, 2020; originally announced December 2020.

MSC Class: 68T07; 35C15; 65M80

arXiv:2011.14269 [pdf, other]

Generalization and Memorization: The Bias Potential Model

Authors: Hongkang Yang, Weinan E

Abstract: Models for learning probability distributions such as generative models and density estimators behave quite differently from models for learning functions. One example is found in the memorization phenomenon, namely the ultimate convergence to the empirical distribution, that occurs in generative adversarial networks (GANs). For this reason, the issue of generalization is more subtle than that for… ▽ More Models for learning probability distributions such as generative models and density estimators behave quite differently from models for learning functions. One example is found in the memorization phenomenon, namely the ultimate convergence to the empirical distribution, that occurs in generative adversarial networks (GANs). For this reason, the issue of generalization is more subtle than that for supervised learning. For the bias potential model, we show that dimension-independent generalization accuracy is achievable if early stopping is adopted, despite that in the long term, the model either memorizes the samples or diverges. △ Less

Submitted 1 March, 2021; v1 submitted 28 November, 2020; originally announced November 2020.

Comments: Added new section on regularized model

MSC Class: 68T07; 60-08

arXiv:2010.05627 [pdf, other]

Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning

Authors: Pan Zhou, Jiashi Feng, Chao Ma, Caiming Xiong, Steven Hoi, Weinan E

Abstract: It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed. This work aims to provide understandings on this generalization gap by analyzing their local convergence behaviors. Specifically, we observe the heavy tails of gradient noise in these algorithms. This motivates us to analyze these algorithms thr… ▽ More It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed. This work aims to provide understandings on this generalization gap by analyzing their local convergence behaviors. Specifically, we observe the heavy tails of gradient noise in these algorithms. This motivates us to analyze these algorithms through their Levy-driven stochastic differential equations (SDEs) because of the similar convergence behaviors of an algorithm and its SDE. Then we establish the escaping time of these SDEs from a local basin. The result shows that (1) the escaping time of both SGD and ADAM~depends on the Radon measure of the basin positively and the heaviness of gradient noise negatively; (2) for the same basin, SGD enjoys smaller escaping time than ADAM, mainly because (a) the geometry adaptation in ADAM~via adaptively scaling each gradient coordinate well diminishes the anisotropic structure in gradient noise and results in larger Radon measure of a basin; (b) the exponential gradient average in ADAM~smooths its gradient and leads to lighter gradient noise tails than SGD. So SGD is more locally unstable than ADAM~at sharp minima defined as the minima whose local basins have small Radon measure, and can better escape from them to flatter ones with larger Radon measure. As flat minima here which often refer to the minima at flat or asymmetric basins/valleys often generalize better than sharp ones , our result explains the better generalization performance of SGD over ADAM. Finally, experimental results confirm our heavy-tailed gradient noise assumption and theoretical affirmation. △ Less

Submitted 28 November, 2021; v1 submitted 12 October, 2020; originally announced October 2020.

Comments: NeurIPS 2020

arXiv:2010.05311 [pdf, other]

Interpretable Neural Networks for Panel Data Analysis in Economics

Authors: Yucheng Yang, Zhong Zheng, Weinan E

Abstract: The lack of interpretability and transparency are preventing economists from using advanced tools like neural networks in their empirical research. In this paper, we propose a class of interpretable neural network models that can achieve both high prediction accuracy and interpretability. The model can be written as a simple function of a regularized number of interpretable features, which are out… ▽ More The lack of interpretability and transparency are preventing economists from using advanced tools like neural networks in their empirical research. In this paper, we propose a class of interpretable neural network models that can achieve both high prediction accuracy and interpretability. The model can be written as a simple function of a regularized number of interpretable features, which are outcomes of interpretable functions encoded in the neural network. Researchers can design different forms of interpretable functions based on the nature of their tasks. In particular, we encode a class of interpretable functions named persistent change filters in the neural network to study time series cross-sectional data. We apply the model to predicting individual's monthly employment status using high-dimensional administrative data. We achieve an accuracy of 94.5% in the test set, which is comparable to the best performed conventional machine learning methods. Furthermore, the interpretability of the model allows us to understand the mechanism that underlies the prediction: an individual's employment status is closely related to whether she pays different types of insurances. Our work is a useful step towards overcoming the black-box problem of neural networks, and provide a new tool for economists to study administrative and proprietary big data. △ Less

Submitted 29 November, 2020; v1 submitted 11 October, 2020; originally announced October 2020.

arXiv:2010.05172 [pdf, other]

The Knowledge Graph for Macroeconomic Analysis with Alternative Big Data

Authors: Yucheng Yang, Yue Pang, Guanhua Huang, Weinan E

Abstract: The current knowledge system of macroeconomics is built on interactions among a small number of variables, since traditional macroeconomic models can mostly handle a handful of inputs. Recent work using big data suggests that a much larger number of variables are active in driving the dynamics of the aggregate economy. In this paper, we introduce a knowledge graph (KG) that consists of not only li… ▽ More The current knowledge system of macroeconomics is built on interactions among a small number of variables, since traditional macroeconomic models can mostly handle a handful of inputs. Recent work using big data suggests that a much larger number of variables are active in driving the dynamics of the aggregate economy. In this paper, we introduce a knowledge graph (KG) that consists of not only linkages between traditional economic variables but also new alternative big data variables. We extract these new variables and the linkages by applying advanced natural language processing (NLP) tools on the massive textual data of academic literature and research reports. As one example of the potential applications, we use it as the prior knowledge to select variables for economic forecasting models in macroeconomics. Compared to statistical variable selection methods, KG-based methods achieve significantly higher forecasting accuracy, especially for long run forecasts. △ Less

Submitted 11 October, 2020; originally announced October 2020.

arXiv:2009.14596 [pdf, other]

doi 10.4208/cicp.OA-2020-0185

Machine Learning and Computational Mathematics

Authors: Weinan E

Abstract: Neural network-based machine learning is capable of approximating functions in very high dimension with unprecedented efficiency and accuracy. This has opened up many exciting new possibilities, not just in traditional areas of artificial intelligence, but also in scientific computing and computational science. At the same time, machine learning has also acquired the reputation of being a set of "… ▽ More Neural network-based machine learning is capable of approximating functions in very high dimension with unprecedented efficiency and accuracy. This has opened up many exciting new possibilities, not just in traditional areas of artificial intelligence, but also in scientific computing and computational science. At the same time, machine learning has also acquired the reputation of being a set of "black box" type of tricks, without fundamental principles. This has been a real obstacle for making further progress in machine learning. In this article, we try to address the following two very important questions: (1) How machine learning has already impacted and will further impact computational mathematics, scientific computing and computational science? (2) How computational mathematics, particularly numerical analysis, {can} impact machine learning? We describe some of the most important progress that has been made on these issues. Our hope is to put things into a perspective that will help to integrate machine learning with computational mathematics. △ Less

Submitted 23 September, 2020; originally announced September 2020.

MSC Class: 68T07; 46E15; 26B35; 26B40

arXiv:2009.13500 [pdf, ps, other]

A priori estimates for classification problems using neural networks

Authors: Weinan E, Stephan Wojtowytsch

Abstract: We consider binary and multi-class classification problems using hypothesis classes of neural networks. For a given hypothesis class, we use Rademacher complexity estimates and direct approximation theorems to obtain a priori error estimates for regularized loss functionals. We consider binary and multi-class classification problems using hypothesis classes of neural networks. For a given hypothesis class, we use Rademacher complexity estimates and direct approximation theorems to obtain a priori error estimates for regularized loss functionals. △ Less

Submitted 28 September, 2020; originally announced September 2020.

MSC Class: 68T07; 60-08

arXiv:2009.10713 [pdf, other]

Towards a Mathematical Understanding of Neural Network-Based Machine Learning: what we know and what we don't

Authors: Weinan E, Chao Ma, Stephan Wojtowytsch, Lei Wu

Abstract: The purpose of this article is to review the achievements made in the last few years towards the understanding of the reasons behind the success and subtleties of neural network-based machine learning. In the tradition of good old applied mathematics, we will not only give attention to rigorous mathematical results, but also the insight we have gained from careful numerical experiments as well as… ▽ More The purpose of this article is to review the achievements made in the last few years towards the understanding of the reasons behind the success and subtleties of neural network-based machine learning. In the tradition of good old applied mathematics, we will not only give attention to rigorous mathematical results, but also the insight we have gained from careful numerical experiments as well as the analysis of simplified models. Along the way, we also list the open problems which we believe to be the most important topics for further study. This is not a complete overview over this quickly moving field, but we hope to provide a perspective which may be helpful especially to new researchers in the area. △ Less

Submitted 7 December, 2020; v1 submitted 22 September, 2020; originally announced September 2020.

Comments: Review article. Feedback welcome

MSC Class: 68T07 (primary); 26B40; 41A30; 35Q68

arXiv:2009.07799 [pdf, other]

On the Curse of Memory in Recurrent Neural Networks: Approximation and Optimization Analysis

Authors: Zhong Li, Jiequn Han, Weinan E, Qianxiao Li

Abstract: We study the approximation properties and optimization dynamics of recurrent neural networks (RNNs) when applied to learn input-output relationships in temporal data. We consider the simple but representative setting of using continuous-time linear RNNs to learn from data generated by linear relationships. Mathematically, the latter can be understood as a sequence of linear functionals. We prove a… ▽ More We study the approximation properties and optimization dynamics of recurrent neural networks (RNNs) when applied to learn input-output relationships in temporal data. We consider the simple but representative setting of using continuous-time linear RNNs to learn from data generated by linear relationships. Mathematically, the latter can be understood as a sequence of linear functionals. We prove a universal approximation theorem of such linear functionals, and characterize the approximation rate and its relation with memory. Moreover, we perform a fine-grained dynamical analysis of training linear RNNs, which further reveal the intricate interactions between memory and learning. A unifying theme uncovered is the non-trivial effect of memory, a notion that can be made precise in our framework, on approximation and optimization: when there is long term memory in the target, it takes a large number of neurons to approximate it. Moreover, the training process will suffer from slow downs. In particular, both of these effects become exponentially more pronounced with memory - a phenomenon we call the "curse of memory". These analyses represent a basic step towards a concrete mathematical understanding of new phenomenon that may arise in learning temporal relationships using recurrent architectures. △ Less

Submitted 15 May, 2021; v1 submitted 16 September, 2020; originally announced September 2020.

Comments: Published version

MSC Class: 68W25; 68T07; 37M10 ACM Class: I.2.6

arXiv:2009.06125 [pdf, other]

A Qualitative Study of the Dynamic Behavior for Adaptive Gradient Algorithms

Authors: Chao Ma, Lei Wu, Weinan E

Abstract: The dynamic behavior of RMSprop and Adam algorithms is studied through a combination of careful numerical experiments and theoretical explanations. Three types of qualitative features are observed in the training loss curve: fast initial convergence, oscillations, and large spikes in the late phase. The sign gradient descent (signGD) flow, which is the limit of Adam when taking the learning rate t… ▽ More The dynamic behavior of RMSprop and Adam algorithms is studied through a combination of careful numerical experiments and theoretical explanations. Three types of qualitative features are observed in the training loss curve: fast initial convergence, oscillations, and large spikes in the late phase. The sign gradient descent (signGD) flow, which is the limit of Adam when taking the learning rate to 0 while keeping the momentum parameters fixed, is used to explain the fast initial convergence. For the late phase of Adam, three different types of qualitative patterns are observed depending on the choice of the hyper-parameters: oscillations, spikes, and divergence. In particular, Adam converges much smoother and even faster when the values of the two momentum factors are close to each other. This observation is particularly important for scientific computing tasks, for which the training process usually proceeds into the high precision regime. △ Less

Submitted 29 September, 2021; v1 submitted 13 September, 2020; originally announced September 2020.

arXiv:2009.02327 [pdf, other]

doi 10.1103/PhysRevFluids.6.114402

OnsagerNet: Learning Stable and Interpretable Dynamics using a Generalized Onsager Principle

Authors: Haijun Yu, Xinyuan Tian, Weinan E, Qianxiao Li

Abstract: We propose a systematic method for learning stable and physically interpretable dynamical models using sampled trajectory data from physical processes based on a generalized Onsager principle. The learned dynamics are autonomous ordinary differential equations parameterized by neural networks that retain clear physical structure information, such as free energy, diffusion, conservative motion and… ▽ More We propose a systematic method for learning stable and physically interpretable dynamical models using sampled trajectory data from physical processes based on a generalized Onsager principle. The learned dynamics are autonomous ordinary differential equations parameterized by neural networks that retain clear physical structure information, such as free energy, diffusion, conservative motion and external forces. For high dimensional problems with a low dimensional slow manifold, an autoencoder with metric preserving regularization is introduced to find the low dimensional generalized coordinates on which we learn the generalized Onsager dynamics. Our method exhibits clear advantages over existing methods on benchmark problems for learning ordinary differential equations. We further apply this method to study Rayleigh-Benard convection and learn Lorenz-like low dimensional autonomous reduced order models that capture both qualitative and quantitative properties of the underlying dynamics. This forms a general approach to building reduced order models for forced dissipative systems. △ Less

Submitted 17 October, 2021; v1 submitted 6 September, 2020; originally announced September 2020.

Comments: 29 pages, 19 figures

MSC Class: 76E30; 34D20; 68T05/07; 82C35

Journal ref: Phy. Rev. Fluids 6(11):114402, 2021

arXiv:2008.05621 [pdf, other]

The Slow Deterioration of the Generalization Error of the Random Feature Model

Authors: Chao Ma, Lei Wu, Weinan E

Abstract: The random feature model exhibits a kind of resonance behavior when the number of parameters is close to the training sample size. This behavior is characterized by the appearance of large generalization gap, and is due to the occurrence of very small eigenvalues for the associated Gram matrix. In this paper, we examine the dynamic behavior of the gradient descent algorithm in this regime. We show… ▽ More The random feature model exhibits a kind of resonance behavior when the number of parameters is close to the training sample size. This behavior is characterized by the appearance of large generalization gap, and is due to the occurrence of very small eigenvalues for the associated Gram matrix. In this paper, we examine the dynamic behavior of the gradient descent algorithm in this regime. We show, both theoretically and experimentally, that there is a dynamic self-correction mechanism at work: The larger the eventual generalization gap, the slower it develops, both because of the small eigenvalues. This gives us ample time to stop the training process and obtain solutions with good generalization property. △ Less

Submitted 12 August, 2020; originally announced August 2020.

arXiv:2007.15623 [pdf, ps, other]

On the Banach spaces associated with multi-layer ReLU networks: Function representation, approximation theory and gradient descent dynamics

Authors: Weinan E, Stephan Wojtowytsch

Abstract: We develop Banach spaces for ReLU neural networks of finite depth $L$ and infinite width. The spaces contain all finite fully connected $L$-layer networks and their $L^2$-limiting objects under bounds on the natural path-norm. Under this norm, the unit ball in the space for $L$-layer networks has low Rademacher complexity and thus favorable generalization properties. Functions in these spaces can… ▽ More We develop Banach spaces for ReLU neural networks of finite depth $L$ and infinite width. The spaces contain all finite fully connected $L$-layer networks and their $L^2$-limiting objects under bounds on the natural path-norm. Under this norm, the unit ball in the space for $L$-layer networks has low Rademacher complexity and thus favorable generalization properties. Functions in these spaces can be approximated by multi-layer neural networks with dimension-independent convergence rates. The key to this work is a new way of representing functions in some form of expectations, motivated by multi-layer neural networks. This representation allows us to define a new class of continuous models for machine learning. We show that the gradient flow defined this way is the natural continuous analog of the gradient descent dynamics for the associated multi-layer neural networks. We show that the path-norm increases at most polynomially under this continuous gradient flow dynamics. △ Less

Submitted 30 July, 2020; originally announced July 2020.

MSC Class: 68T07; 46E15; 26B35; 26B40

arXiv:2007.09788 [pdf, other]

doi 10.1103/PhysRevB.103.024304

Coarse-grained spectral projection (CGSP): a deep learning-assisted approach to quantum unitary dynamics

Authors: Pinchen Xie, Weinan E

Abstract: We propose the coarse-grained spectral projection method (CGSP), a deep learning-assisted approach for tackling quantum unitary dynamic problems with an emphasis on quench dynamics. We show CGSP can extract spectral components of many-body quantum states systematically with sophisticated neural network quantum ansatz. CGSP exploits fully the linear unitary nature of the quantum dynamics, and is po… ▽ More We propose the coarse-grained spectral projection method (CGSP), a deep learning-assisted approach for tackling quantum unitary dynamic problems with an emphasis on quench dynamics. We show CGSP can extract spectral components of many-body quantum states systematically with sophisticated neural network quantum ansatz. CGSP exploits fully the linear unitary nature of the quantum dynamics, and is potentially superior to other quantum Monte Carlo methods for ergodic dynamics. Preliminary numerical results on 1D XXZ models with periodic boundary condition are carried out to demonstrate the practicality of CGSP. △ Less

Submitted 4 November, 2020; v1 submitted 19 July, 2020; originally announced July 2020.

Journal ref: Phys. Rev. B 103, 024304 (2021)

arXiv:2006.14450 [pdf, other]

The Quenching-Activation Behavior of the Gradient Descent Dynamics for Two-layer Neural Network Models

Authors: Chao Ma, Lei Wu, Weinan E

Abstract: A numerical and phenomenological study of the gradient descent (GD) algorithm for training two-layer neural network models is carried out for different parameter regimes when the target function can be accurately approximated by a relatively small number of neurons. It is found that for Xavier-like initialization, there are two distinctive phases in the dynamic behavior of GD in the under-parametr… ▽ More A numerical and phenomenological study of the gradient descent (GD) algorithm for training two-layer neural network models is carried out for different parameter regimes when the target function can be accurately approximated by a relatively small number of neurons. It is found that for Xavier-like initialization, there are two distinctive phases in the dynamic behavior of GD in the under-parametrized regime: An early phase in which the GD dynamics follows closely that of the corresponding random feature model and the neurons are effectively quenched, followed by a late phase in which the neurons are divided into two groups: a group of a few "activated" neurons that dominate the dynamics and a group of background (or "quenched") neurons that support the continued activation and deactivation process. This neural network-like behavior is continued into the mildly over-parametrized regime, where it undergoes a transition to a random feature-like behavior. The quenching-activation process seems to provide a clear mechanism for "implicit regularization". This is qualitatively different from the dynamics associated with the "mean-field" scaling where all neurons participate equally and there does not appear to be qualitative changes when the network parameters are changed. △ Less

Submitted 25 June, 2020; originally announced June 2020.

Comments: 23 pages

arXiv:2006.05982 [pdf, ps, other]

Representation formulas and pointwise properties for Barron functions

Authors: Weinan E, Stephan Wojtowytsch

Abstract: We study the natural function space for infinitely wide two-layer neural networks with ReLU activation (Barron space) and establish different representation formulae. In two cases, we describe the space explicitly up to isomorphism. Using a convenient representation, we study the pointwise properties of two-layer networks and show that functions whose singular set is fractal or curved (for examp… ▽ More We study the natural function space for infinitely wide two-layer neural networks with ReLU activation (Barron space) and establish different representation formulae. In two cases, we describe the space explicitly up to isomorphism. Using a convenient representation, we study the pointwise properties of two-layer networks and show that functions whose singular set is fractal or curved (for example distance functions from smooth submanifolds) cannot be represented by infinitely wide two-layer networks with finite path-norm. We use this structure theorem to show that the only $C^1$-diffeomorphisms which Barron space are affine. Furthermore, we show that every Barron function can be decomposed as the sum of a bounded and a positively one-homogeneous function and that there exist Barron functions which decay rapidly at infinity and are globally Lebesgue-integrable. This result suggests that two-layer neural networks may be able to approximate a greater variety of functions than commonly believed. △ Less

Submitted 4 June, 2021; v1 submitted 10 June, 2020; originally announced June 2020.

MSC Class: 68T07; 46E15; 26B35; 26B40

arXiv:2006.02619 [pdf, other]

Integrating Machine Learning with Physics-Based Modeling

Authors: Weinan E, Jiequn Han, Linfeng Zhang

Abstract: Machine learning is poised as a very powerful tool that can drastically improve our ability to carry out scientific research. However, many issues need to be addressed before this becomes a reality. This article focuses on one particular issue of broad interest: How can we integrate machine learning with physics-based modeling to develop new interpretable and truly reliable physical models? After… ▽ More Machine learning is poised as a very powerful tool that can drastically improve our ability to carry out scientific research. However, many issues need to be addressed before this becomes a reality. This article focuses on one particular issue of broad interest: How can we integrate machine learning with physics-based modeling to develop new interpretable and truly reliable physical models? After introducing the general guidelines, we discuss the two most important issues for developing machine learning-based physical models: Imposing physical constraints and obtaining optimal datasets. We also provide a simple and intuitive explanation for the fundamental reasons behind the success of modern machine learning, as well as an introduction to the concurrent machine learning framework needed for integrating machine learning with physics-based modeling. Molecular dynamics and moment closure of kinetic equations are used as examples to illustrate the main issues discussed. We end with a general discussion on where this integration will lead us to, and where the new frontier will be after machine learning is successfully integrated into scientific modeling. △ Less

Submitted 3 June, 2020; originally announced June 2020.

arXiv:2005.10815 [pdf, other]

Can Shallow Neural Networks Beat the Curse of Dimensionality? A mean field training perspective

Authors: Stephan Wojtowytsch, Weinan E

Abstract: We prove that the gradient descent training of a two-layer neural network on empirical or population risk may not decrease population risk at an order faster than $t^{-4/(d-2)}$ under mean field scaling. Thus gradient descent training for fitting reasonably smooth, but truly high-dimensional data may be subject to the curse of dimensionality. We present numerical evidence that gradient descent tra… ▽ More We prove that the gradient descent training of a two-layer neural network on empirical or population risk may not decrease population risk at an order faster than $t^{-4/(d-2)}$ under mean field scaling. Thus gradient descent training for fitting reasonably smooth, but truly high-dimensional data may be subject to the curse of dimensionality. We present numerical evidence that gradient descent training with general Lipschitz target functions becomes slower and slower as the dimension increases, but converges at approximately the same rate in all dimensions when the target function lies in the natural function space for two-layer ReLU networks. △ Less

Submitted 21 May, 2020; originally announced May 2020.

Comments: 5 figures

MSC Class: 68T07; 49Q22; 68W25

arXiv:2005.10807 [pdf, ps, other]

Kolmogorov Width Decay and Poor Approximators in Machine Learning: Shallow Neural Networks, Random Feature Models and Neural Tangent Kernels

Authors: Weinan E, Stephan Wojtowytsch

Abstract: We establish a scale separation of Kolmogorov width type between subspaces of a given Banach space under the condition that a sequence of linear maps converges much faster on one of the subspaces. The general technique is then applied to show that reproducing kernel Hilbert spaces are poor $L^2$-approximators for the class of two-layer neural networks in high dimension, and that multi-layer networ… ▽ More We establish a scale separation of Kolmogorov width type between subspaces of a given Banach space under the condition that a sequence of linear maps converges much faster on one of the subspaces. The general technique is then applied to show that reproducing kernel Hilbert spaces are poor $L^2$-approximators for the class of two-layer neural networks in high dimension, and that multi-layer networks with small path norm are poor approximators for certain Lipschitz functions, also in the $L^2$-topology. △ Less

Submitted 2 October, 2020; v1 submitted 21 May, 2020; originally announced May 2020.

MSC Class: 68T07; 41A30; 41A65; 46E15; 46E22

arXiv:2004.11658 [pdf, other]

doi 10.1016/j.cpc.2020.107624

86 PFLOPS Deep Potential Molecular Dynamics simulation of 100 million atoms with ab initio accuracy

Authors: Denghui Lu, Han Wang, Mohan Chen, Jiduan Liu, Lin Lin, Roberto Car, Weinan E, Weile Jia, Linfeng Zhang

Abstract: We present the GPU version of DeePMD-kit, which, upon training a deep neural network model using ab initio data, can drive extremely large-scale molecular dynamics (MD) simulation with ab initio accuracy. Our tests show that the GPU version is 7 times faster than the CPU version with the same power consumption. The code can scale up to the entire Summit supercomputer. For a copper system of 113, 2… ▽ More We present the GPU version of DeePMD-kit, which, upon training a deep neural network model using ab initio data, can drive extremely large-scale molecular dynamics (MD) simulation with ab initio accuracy. Our tests show that the GPU version is 7 times faster than the CPU version with the same power consumption. The code can scale up to the entire Summit supercomputer. For a copper system of 113, 246, 208 atoms, the code can perform one nanosecond MD simulation per day, reaching a peak performance of 86 PFLOPS (43% of the peak). Such unprecedented ability to perform MD simulation with ab initio accuracy opens up the possibility of studying many important issues in materials and molecules, such as heterogeneous catalysis, electrochemical cells, irradiation damage, crack propagation, and biochemical reactions. △ Less

Submitted 7 September, 2020; v1 submitted 24 April, 2020; originally announced April 2020.

Comments: 29 pages, 11 figures

arXiv:1912.06987 [pdf, ps, other]

The Generalization Error of the Minimum-norm Solutions for Over-parameterized Neural Networks

Authors: Weinan E, Chao Ma, Lei Wu

Abstract: We study the generalization properties of minimum-norm solutions for three over-parametrized machine learning models including the random feature model, the two-layer neural network model and the residual network model. We proved that for all three models, the generalization error for the minimum-norm solution is comparable to the Monte Carlo rate, up to some logarithmic terms, as long as the mode… ▽ More We study the generalization properties of minimum-norm solutions for three over-parametrized machine learning models including the random feature model, the two-layer neural network model and the residual network model. We proved that for all three models, the generalization error for the minimum-norm solution is comparable to the Monte Carlo rate, up to some logarithmic terms, as long as the models are sufficiently over-parametrized. △ Less

Submitted 28 January, 2021; v1 submitted 15 December, 2019; originally announced December 2019.

Comments: Published version

Journal ref: Pure and Applied Functional Analysis, Volume 5, Number 6, 1145-1460, 2020

arXiv:1907.12293 [pdf, ps, other]

doi 10.1109/TPAMI.2020.3022533

A mathematical model for universal semantics

Authors: Weinan E, Yajun Zhou

Abstract: We characterize the meaning of words with language-independent numerical fingerprints, through a mathematical analysis of recurring patterns in texts. Approximating texts by Markov processes on a long-range time scale, we are able to extract topics, discover synonyms, and sketch semantic fields from a particular document of moderate length, without consulting external knowledge-base or thesaurus.… ▽ More We characterize the meaning of words with language-independent numerical fingerprints, through a mathematical analysis of recurring patterns in texts. Approximating texts by Markov processes on a long-range time scale, we are able to extract topics, discover synonyms, and sketch semantic fields from a particular document of moderate length, without consulting external knowledge-base or thesaurus. Our Markov semantic model allows us to represent each topical concept by a low-dimensional vector, interpretable as algebraic invariants in succinct statistical operations on the document, targeting local environments of individual words. These language-independent semantic representations enable a robot reader to both understand short texts in a given language (automated question-answering) and match medium-length texts across different languages (automated word translation). Our semantic fingerprints quantify local meaning of words in 14 representative languages across 5 major language families, suggesting a universal and cost-effective mechanism by which human languages are processed at the semantic level. Our protocols and source codes are publicly available on https://github.com/yajun-zhou/linguae-naturalis-principia-mathematica △ Less

Submitted 12 July, 2020; v1 submitted 29 July, 2019; originally announced July 2019.

Comments: Main text (12 pages, 7 figures); Software Manual (ii+262 pages, 16 figures, 12 tables, available as two ancillary files). Revised according to reviewers' comments

Journal ref: IEEE Trans. Pattern Anal. Mach. Intell. 44(3):1124-1132 (2022)

arXiv:1906.08039 [pdf, ps, other]

The Barron Space and the Flow-induced Function Spaces for Neural Network Models

Authors: Weinan E, Chao Ma, Lei Wu

Abstract: One of the key issues in the analysis of machine learning models is to identify the appropriate function space and norm for the model. This is the set of functions endowed with a quantity which can control the approximation and estimation errors by a particular machine learning model. In this paper, we address this issue for two representative neural network models: the two-layer networks and the… ▽ More One of the key issues in the analysis of machine learning models is to identify the appropriate function space and norm for the model. This is the set of functions endowed with a quantity which can control the approximation and estimation errors by a particular machine learning model. In this paper, we address this issue for two representative neural network models: the two-layer networks and the residual neural networks. We define the Barron space and show that it is the right space for two-layer neural network models in the sense that optimal direct and inverse approximation theorems hold for functions in the Barron space. For residual neural network models, we construct the so-called flow-induced function space, and prove direct and inverse approximation theorems for this space. In addition, we show that the Rademacher complexity for bounded sets under these norms has the optimal upper bounds. △ Less

Submitted 27 March, 2021; v1 submitted 18 June, 2019; originally announced June 2019.

arXiv:1904.05263 [pdf, other]

Analysis of the Gradient Descent Algorithm for a Deep Neural Network Model with Skip-connections

Authors: Weinan E, Chao Ma, Qingcan Wang, Lei Wu

Abstract: The behavior of the gradient descent (GD) algorithm is analyzed for a deep neural network model with skip-connections. It is proved that in the over-parametrized regime, for a suitable initialization, with high probability GD can find a global minimum exponentially fast. Generalization error estimates along the GD path are also established. As a consequence, it is shown that when the target functi… ▽ More The behavior of the gradient descent (GD) algorithm is analyzed for a deep neural network model with skip-connections. It is proved that in the over-parametrized regime, for a suitable initialization, with high probability GD can find a global minimum exponentially fast. Generalization error estimates along the GD path are also established. As a consequence, it is shown that when the target function is in the reproducing kernel Hilbert space (RKHS) with a kernel defined by the initialization, there exist generalizable early-stopping solutions along the GD path. In addition, it is also shown that the GD path is uniformly close to the functions given by the related random feature model. Consequently, in this "implicit regularization" setting, the deep neural network model deteriorates to a random feature model. Our results hold for neural networks of any width larger than the input dimension. △ Less

Submitted 14 April, 2019; v1 submitted 10 April, 2019; originally announced April 2019.

Comments: 29 pages, 4 figures

arXiv:1904.04326 [pdf, other]

doi 10.1007/s11425-019-1628-5

A Comparative Analysis of the Optimization and Generalization Property of Two-layer Neural Network and Random Feature Models Under Gradient Descent Dynamics

Authors: Weinan E, Chao Ma, Lei Wu

Abstract: A fairly comprehensive analysis is presented for the gradient descent dynamics for training two-layer neural network models in the situation when the parameters in both layers are updated. General initialization schemes as well as general regimes for the network width and training data size are considered. In the over-parametrized regime, it is shown that gradient descent dynamics can achieve zero… ▽ More A fairly comprehensive analysis is presented for the gradient descent dynamics for training two-layer neural network models in the situation when the parameters in both layers are updated. General initialization schemes as well as general regimes for the network width and training data size are considered. In the over-parametrized regime, it is shown that gradient descent dynamics can achieve zero training loss exponentially fast regardless of the quality of the labels. In addition, it is proved that throughout the training process the functions represented by the neural network model are uniformly close to that of a kernel method. For general values of the network width and training data size, sharp estimates of the generalization error is established for target functions in the appropriate reproducing kernel Hilbert space. △ Less

Submitted 20 February, 2020; v1 submitted 8 April, 2019; originally announced April 2019.

Comments: Published version

MSC Class: 41A99; 49M99

Journal ref: Science China Mathematics (2020)

arXiv:1903.02154 [pdf, ps, other]

A Priori Estimates of the Population Risk for Residual Networks

Authors: Weinan E, Chao Ma, Qingcan Wang

Abstract: Optimal a priori estimates are derived for the population risk, also known as the generalization error, of a regularized residual network model. An important part of the regularized model is the usage of a new path norm, called the weighted path norm, as the regularization term. The weighted path norm treats the skip connections and the nonlinearities differently so that paths with more nonlineari… ▽ More Optimal a priori estimates are derived for the population risk, also known as the generalization error, of a regularized residual network model. An important part of the regularized model is the usage of a new path norm, called the weighted path norm, as the regularization term. The weighted path norm treats the skip connections and the nonlinearities differently so that paths with more nonlinearities are regularized by larger weights. The error estimates are a priori in the sense that the estimates depend only on the target function, not on the parameters obtained in the training process. The estimates are optimal, in a high dimensional setting, in the sense that both the bound for the approximation and estimation errors are comparable to the Monte Carlo error rates. A crucial step in the proof is to establish an optimal bound for the Rademacher complexity of the residual networks. Comparisons are made with existing norm-based generalization error bounds. △ Less

Submitted 30 May, 2019; v1 submitted 5 March, 2019; originally announced March 2019.

arXiv:1811.01558 [pdf, other]

Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations

Authors: Qianxiao Li, Cheng Tai, Weinan E

Abstract: We develop the mathematical foundations of the stochastic modified equations (SME) framework for analyzing the dynamics of stochastic gradient algorithms, where the latter is approximated by a class of stochastic differential equations with small noise parameters. We prove that this approximation can be understood mathematically as an weak approximation, which leads to a number of precise and usef… ▽ More We develop the mathematical foundations of the stochastic modified equations (SME) framework for analyzing the dynamics of stochastic gradient algorithms, where the latter is approximated by a class of stochastic differential equations with small noise parameters. We prove that this approximation can be understood mathematically as an weak approximation, which leads to a number of precise and useful results on the approximations of stochastic gradient descent (SGD), momentum SGD and stochastic Nesterov's accelerated gradient method in the general setting of stochastic objectives. We also demonstrate through explicit calculations that this continuous-time approach can uncover important analytical insights into the stochastic gradient algorithms under consideration that may not be easy to obtain in a purely discrete-time setting. △ Less

Submitted 5 November, 2018; originally announced November 2018.

MSC Class: 65K10; 60H10

arXiv:1810.11890 [pdf, other]

doi 10.1103/PhysRevMaterials.3.023804

Active Learning of Uniformly Accurate Inter-atomic Potentials for Materials Simulation

Authors: Linfeng Zhang, De-Ye Lin, Han Wang, Roberto Car, Weinan E

Abstract: An active learning procedure called Deep Potential Generator (DP-GEN) is proposed for the construction of accurate and transferable machine learning-based models of the potential energy surface (PES) for the molecular modeling of materials. This procedure consists of three main components: exploration, generation of accurate reference data, and training. Application to the sample systems of Al, Mg… ▽ More An active learning procedure called Deep Potential Generator (DP-GEN) is proposed for the construction of accurate and transferable machine learning-based models of the potential energy surface (PES) for the molecular modeling of materials. This procedure consists of three main components: exploration, generation of accurate reference data, and training. Application to the sample systems of Al, Mg and Al-Mg alloys demonstrates that DP-GEN can produce uniformly accurate PES models with a minimal number of reference data. △ Less

Submitted 6 February, 2019; v1 submitted 28 October, 2018; originally announced October 2018.

Journal ref: Phys. Rev. Materials 3, 023804 (2019)

arXiv:1810.06397 [pdf, other]

doi 10.4310/CMS.2019.v17.n5.a11

A Priori Estimates of the Population Risk for Two-layer Neural Networks

Authors: Weinan E, Chao Ma, Lei Wu

Abstract: New estimates for the population risk are established for two-layer neural networks. These estimates are nearly optimal in the sense that the error rates scale in the same way as the Monte Carlo error rates. They are equally effective in the over-parametrized regime when the network size is much larger than the size of the dataset. These new estimates are a priori in nature in the sense that the b… ▽ More New estimates for the population risk are established for two-layer neural networks. These estimates are nearly optimal in the sense that the error rates scale in the same way as the Monte Carlo error rates. They are equally effective in the over-parametrized regime when the network size is much larger than the size of the dataset. These new estimates are a priori in nature in the sense that the bounds depend only on some norms of the underlying functions to be fitted, not the parameters in the model, in contrast with most existing results which are a posteriori in nature. Using these a priori estimates, we provide a perspective for understanding why two-layer neural networks perform better than the related kernel methods. △ Less

Submitted 20 February, 2020; v1 submitted 15 October, 2018; originally announced October 2018.

Comments: Published version

MSC Class: 41A46; 41A63; 62J02; 65D05

Journal ref: Communications in Mathematical Sciences, Volume 17(2019)

arXiv:1809.10188 [pdf, other]

Monge-Ampère Flow for Generative Modeling

Authors: Linfeng Zhang, Weinan E, Lei Wang

Abstract: We present a deep generative model, named Monge-Ampère flow, which builds on continuous-time gradient flow arising from the Monge-Ampère equation in optimal transport theory. The generative map from the latent space to the data space follows a dynamical system, where a learnable potential function guides a compressible fluid to flow towards the target density distribution. Training of the model am… ▽ More We present a deep generative model, named Monge-Ampère flow, which builds on continuous-time gradient flow arising from the Monge-Ampère equation in optimal transport theory. The generative map from the latent space to the data space follows a dynamical system, where a learnable potential function guides a compressible fluid to flow towards the target density distribution. Training of the model amounts to solving an optimal control problem. The Monge-Ampère flow has tractable likelihoods and supports efficient sampling and inference. One can easily impose symmetry constraints in the generative model by designing suitable scalar potential functions. We apply the approach to unsupervised density estimation of the MNIST dataset and variational calculation of the two-dimensional Ising model at the critical point. This approach brings insights and techniques from Monge-Ampère equation, optimal transport, and fluid dynamics into reversible flow-based generative models. △ Less

Submitted 26 September, 2018; originally announced September 2018.

arXiv:1808.04258 [pdf, other]

Model Reduction with Memory and the Machine Learning of Dynamical Systems

Authors: Chao Ma, Jianchun Wang, Weinan E

Abstract: The well-known Mori-Zwanzig theory tells us that model reduction leads to memory effect. For a long time, modeling the memory effect accurately and efficiently has been an important but nearly impossible task in developing a good reduced model. In this work, we explore a natural analogy between recurrent neural networks and the Mori-Zwanzig formalism to establish a systematic approach for developi… ▽ More The well-known Mori-Zwanzig theory tells us that model reduction leads to memory effect. For a long time, modeling the memory effect accurately and efficiently has been an important but nearly impossible task in developing a good reduced model. In this work, we explore a natural analogy between recurrent neural networks and the Mori-Zwanzig formalism to establish a systematic approach for developing reduced models with memory. Two training models-a direct training model and a dynamically coupled training model-are proposed and compared. We apply these methods to the Kuramoto-Sivashinsky equation and the Navier-Stokes equation. Numerical experiments show that the proposed method can produce reduced model with good performance on both short-term prediction and long-term statistical properties. △ Less

Submitted 10 August, 2018; originally announced August 2018.

arXiv:1807.01083 [pdf, ps, other]

doi 10.1007/s40687-018-0172-y

A Mean-Field Optimal Control Formulation of Deep Learning

Authors: Weinan E, Jiequn Han, Qianxiao Li

Abstract: Recent work linking deep neural networks and dynamical systems opened up new avenues to analyze deep learning. In particular, it is observed that new insights can be obtained by recasting deep learning as an optimal control problem on difference or differential equations. However, the mathematical aspects of such a formulation have not been systematically explored. This paper introduces the mathem… ▽ More Recent work linking deep neural networks and dynamical systems opened up new avenues to analyze deep learning. In particular, it is observed that new insights can be obtained by recasting deep learning as an optimal control problem on difference or differential equations. However, the mathematical aspects of such a formulation have not been systematically explored. This paper introduces the mathematical formulation of the population risk minimization problem in deep learning as a mean-field optimal control problem. Mirroring the development of classical optimal control, we state and prove optimality conditions of both the Hamilton-Jacobi-Bellman type and the Pontryagin type. These mean-field results reflect the probabilistic nature of the learning problem. In addition, by appealing to the mean-field Pontryagin's maximum principle, we establish some quantitative relationships between population and empirical learning problems. This serves to establish a mathematical foundation for investigating the algorithmic and theoretical connections between optimal control and deep learning. △ Less

Submitted 3 July, 2018; originally announced July 2018.

Comments: 44 pages

Journal ref: Research in the Mathematical Sciences, 6:10 (2019)

arXiv:1807.00297 [pdf, ps, other]

Exponential Convergence of the Deep Neural Network Approximation for Analytic Functions

Authors: Weinan E, Qingcan Wang

Abstract: We prove that for analytic functions in low dimension, the convergence rate of the deep neural network approximation is exponential. We prove that for analytic functions in low dimension, the convergence rate of the deep neural network approximation is exponential. △ Less

Submitted 1 July, 2018; originally announced July 2018.

Showing 1–50 of 65 results for author: E, W