Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
113 views

Physics Informed Neural Network Theory and Applications

physics informed neural network theory and applications

Uploaded by

p218033
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views

Physics Informed Neural Network Theory and Applications

physics informed neural network theory and applications

Uploaded by

p218033
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Physics-Informed Neural Networks: Theory and

Applications
Cosmin Anitescu∗, Burak İsmail Ateş†, and Timon Rabczuk‡
Institut für Strukturmechanik, Bauhaus-Universität Weimar

Abstract
Methods that seek to employ machine learning algorithms for solv-
ing engineering problems have gained increased interest. Physics in-
formed neural networks (PINNs) are among the earliest approaches,
which attempt to employ the universal approximation property of arti-
ficial neural networks to represent the solution field. In this framework,
solving the original differential equation can be seen as an optimiza-
tion problem, where we seek to minimize the residual or some energy
functional. We present the main concepts and implementation steps
for PINNs, including an overview of the basics for defining and train-
ing an artificial neural network model. These methods are applied in
several numerical examples of forward and inverse problems, including
the Poisson equation, Helmholtz equation, linear elasticity and hyper-
elasticity.

1 Introduction
Machine learning (ML) methods based on artificial neural networks (ANNs)
have become increasingly used, particularly in data-rich fields such as text,
image and audio processing, where they have achieved remarkable results,
greatly surpassing the previous state-of-the art algorithms. Typically, ML
methods are most efficient in applications where the patterns are difficult to
describe by clear-cut rules, such as handwriting recognition. In these cases, it
may be more efficient to generate the rules by a kind of high-dimensional re-
gression between a sufficiently large number of input-output pairs. However,
other techniques based on ANNs have also been successful in domains where
the rules are relatively easy to describe, such as AlphaZero [59] for game play-
ing and AlphaFold [28] for protein folding. Many of these advancements have
been driven by an increase in computational capabilities, in particular with
∗ cosmin.anitescu@uni-weimar.de
† burak.ismail.ates@uni-weimar.de
‡ timon.rabczuk@uni-weimar.de

1
regard to Graphics Processing Units (GPUs) and Tensor Processing Units
(TPUs) [27], but also by theoretical advances related to the initialization
and architecture of the ANNs. In the scientific community, there has also
been increased interest in applying the new developments in ANNs and ML
to solve partial differential equations (PDEs) and other engineering problems
of interest.
One can distinguish between supervised, unsupervised and reinforcement
learning. In the former, the aim is to find the mapping between a set of inputs
and outputs, such as images of hand-written digits and the actual digit they
represent, so that when a new input is presented, the correct output can be
predicted by the ML algorithm. A prerequisite for the application of these
methods is the availability of labeled data. In engineering applications, such
approaches can be used e.g. for predicting the solution from the boundary
conditions for a given PDE based on a large set of inputs/solutions pairs of
similar problems, see also operator-approximation methods [36, 41, 38]. How-
ever, a drawback is the requirement for possibly large amounts of labeled data
(i.e. solved examples) drawn from the same distribution as the problems that
we like to solve in the first place. On the contrary, in unsupervised learn-
ing the algorithm aims to find patterns in the input data to produce useful
output based on some hard-coded rules or objectives. In classical ML, such
tasks include image segmentation, dimensionality reduction (such as princi-
pal component analysis or PCA), or different types of clustering (grouping
unlabeled data based on similarities or differences). Furthermore, there is
a middle ground category of semi-supervised learning, where a mixture of
labeled and unlabeled data is used in an attempt to overcome some of the
shortcomings of the first two categories. Related to this, is the concept of re-
inforcement learning, where an agent-based system seeks to learn the actions
that maximize a reward function.
Physics-informed neural networks (PINNs) are more closely-related to the
unsupervised or semi-supervised learning, whereby satisfying the governing
equations, including the boundary conditions, at a given set of collocation
points defines the objective function. This idea was originally proposed dur-
ing the 1990s in [33, 32] and further extended for domains with irregular
boundaries in [34]. As the cost of training neural networks became cheaper,
further developments have been first reported in [52, 60] among others, in-
cluding the extension to time-dependent problems and model parameter in-
ference (i.e. inverse problems). In [52], the term PINN is first used, along
with the concept of combining (possibly noisy) experimental data with the
governing equation in a small data or semi-supervised setting. Since then,
several improvements have been suggested, such as adaptively choosing the
collocation points [3, 68], variational formulations [69, 55, 29], and domain
decomposition approaches [58]. Moreover, PINNs have been applied to a
wide variety of problems, such as hyperelasticity [46], multiphase poroelastic-
ity [19], Kirchhoff plates [71], eikonal equation [64], biophysics[31], quantum
chemistry[50], materials science[57] and others.

2
In this chapter, we give a concise overview of the main ideas of PINNs,
focusing on the implementation and potential applications to forward and
inverse PDEs. In Section 2, we introduce the building blocks required to
create and train a neural network model, while in Section 3 we present the
collocation and energy minimization approaches, along with a discussion of
enforcing the boundary conditions. In Section 4, we present some numerical
examples, focusing on some pedagogical examples of standard PINNs for
problems that are feasible to compute on regular desktops or even mobile
computers, followed by some concluding remarks in Section 5.

2 Basics of Artificial Neural Networks


An artificial neural network (ANN) is loosely modeled after the structure
of the brain, which is made up of a large number of cells (neurons) which
communicate with their neighbors through electrical signals. Mathematically,
an ANN can be seen as a function uN N : Rn → Rm , which maps n inputs into
m outputs. An ANN is an universal function approximator [23]. Therefore,
uN N can be used to interpolate some unknown function from the data given
at certain points, or to approximate the solution of a partial differential
equation. The function uN N depends on a collection of parameters (called
trainable parameters) which are obtained by an optimization procedure with
the goal of minimizing some user-defined objective or loss function.
In an ANN, the neurons, or computational units, are organized in lay-
ers which are connected by composition with an activation function as de-
tailed below. Different types of layers (and activation functions) can be as-
sembled together, according to the application and the information known
about the function to be approximated. There are several types of ANNs,
which include fully-connected feed-forward networks, convolutional neural
networks (CNNs), recurrent neural networks (RNNs), residual neural net-
works (ResNets), transformers and others. In the following, we will focus
mostly on the feed-forward neural networks which are among the simplest
and can also be used as building blocks of more complicated architectures.

2.1 Feed-forward neural networks


In this type of network, also called multi-layer perceptron (MLP), the
output is obtained by successive compositions of a linear transformation and
a nonlinear activation function. The network consists of an input layer, an
output layer and any number of intermediate hidden layers. The function
uN N for a network with an n-dimensional input, and m-dimensional output
and k hidden layers can be written as:

uN N = Lk ◦ Lk−1 ◦ . . . ◦ L0 (1)

3
with
Li (xi ) = σi (Wi xi + bi ) = xi+1 for i = 0, . . . , k. (2)
Here Wi are matrices of size mi × ni , with n0 = n, ni+1 = mi , and mk = m,
xi+1 and bi are column vectors of size mi , and the activation functions σi
are applied element-wise to the vectors Wi xi + bi . The entries of the matri-
ces Wi are called weights and those of the vectors bi are called biases, and
together they represent the trainable parameters of the neural network. For
k > 0, the values m0 , …, mk−1 can be chosen freely and represent the number
of neurons in each hidden layer. If the number of hidden layers k > 1, we say
that uN N is a deep neural network. A schematic of a feed-forward network
with 3 neurons in the input layer, two hidden layers with 4 and 5 neurons,
respectively, and an output layer consisting of 2 neurons is shown in Figure
1. In a typical application, many inputs are collected in a batch and evalu-

Hidden
Hidden
Input
Output

Figure 1: A fully-connected feed-forward neural network with the input,


hidden and output layers.

ated together. Evaluating the output of the neural network involves mainly
linear algebra operations (such as matrix and vector products) which can be
easily parallelized. In machine learning frameworks, such as Tensorflow [1],
PyTorch [48] or JAX [5], a computational graph is built to record the different
operations. This allows for efficient evaluation and also for computing the
gradients by automatic differentiation methods as will be detailed in Section
2.3.

2.2 Activation Functions


Several types of activation functions can be considered depending on the
task at hand. We will briefly describe a few popular ones in the following
subsections.

4
2.2.1 Linear activation
The simplest activation function is the linear activation, which means
that σ is simply the identity function:
σ(x) = x. (3)
On a network with no hidden layers, a linear activation function between the
input and output layers can be used to perform a linear regression between
the input and output data. For networks with one or more hidden layers,
stacking linear layers is not useful since a composition of linear activations
is still linear. However, linear layers can be combined with other non-linear
activation functions. For example, linear layers can be used as the last layer
to scale the output to arbitrary values. A non-trainable linear transformation
is often used to normalize the input of a network to speed up the training
(optimization) process, as will be detailed in Section 2.3.3).

2.2.2 Rectified Linear Units


One of the simplest non-linear activation functions is the piece-wise linear
rectified linear unit (ReLU) function, defined as:
σ(x) = max(0, x). (4)
It can easily be seen that a single hidden layer with ReLU activation, followed
by a linear activation layer, can approximate exactly piecewise linear func-
tions in one dimension [55]. Indeed, on a grid with nodes x0 < x1 < . . . < xn ,
the finite element linear hat function Ni (x) can be written as:
 
1 1 1 1
Ni (x) = ReLU (x−xi−1 )− + ReLU (x−xi )+ ReLU (x−xi+1 )
hi hi hi+1 hi+1
(5)
where hi = xi −xi−1 . This observation can be extended to higher dimensions,
where two hidden layers are enough to approximate piecewise linear simplex
elements in two and more dimensions [20]. Further error bounds for the
approximation of ReLU networks in Sobolev norms are given e.g. in [49, 18].

2.2.3 Sigmoid
The sigmoid activation, also known as the logistic function, is defined as:
1
σ(x) = . (6)
1 + exp(−x)
This function has a S-shaped form, as shown in Figure 2b. The range of
this function is the interval (0, 1), therefore it is often used in the output
layer of neural networks used for binary classification tasks, where the output
is a probability that the input belongs to a given class. The function is
also differentiable infinitely many times, resulting in a smooth approximation
which is desirable for many applications.

5
10 1.0
8 0.8
6 0.6
4 0.4
2 0.2
0 0.0
10 5 0 5 10 10 5 0 5 10

(a) ReLU (b) Sigmoid

1.0 10

0.5 8
6
0.0
4
0.5 2
1.0 0
10 5 0 5 10 10 5 0 5 10

(c) Tanh (d) Swish

Figure 2: Commonly used non-linear activation functions.

2.2.4 Hyperbolic Tangent


The hyperbolic tangent activation function is defined as

exp(x) − exp(−x)
tanh(x) = . (7)
exp(x) + exp(−x)

This function looks similar to the sigmoid activation, maintaining the overall
S-shape and smoothness. An important difference is that the range of the
outputs is (−1, 1) which is centered at 0. This makes the tanh activation more
suitable for deep networks without creating a bias towards positive outputs.

2.2.5 Swish
The swish activation function is defined as:
x
swish(x) = = x · σ(x), (8)
1 + exp(−x)

where σ(x) is the sigmoid activation. The plot of this function is shown in
Figure 2d. The swish function looks similar to the ReLU activation. However,
like sigmoid and tanh, it is infinitely differentiable.

6
We note that there are several other activations that have been proposed
which are similar to ReLU and swish, such as Leaky ReLU [42], exponential
linear units (ELUs) [8], Gaussian error linear units (GELUs) [22], Mish [45]
and others. These have been shown to remedy some of the drawbacks of the
previously considered activation functions and provide a modest improvement
on some machine learning tasks, particularly related to image-based classi-
fication and segmentation tasks [37]. However, from the point of view of
function approximation where partial derivatives are involved, tanh or swish
are also well suited due to their smoothness properties.

2.2.6 Adaptive activation functions


In addition to the standard activation functions, which are fixed at each
layer, the so-called adaptive activations have been proposed which depend
on some model-dependent or trainable parameters. In particular, for a given
activation function σ(x), we can define the adaptive version by:

σa (x) = σ(ax). (9)

The idea of using trainable parameters in the activation function was pro-
posed in [2], and further developed in the context of function and PDE so-
lution approximation in [25, 24, 26, 57] among others. Some adaptive or
trainable activation functions have a different form, for example the original
Swish activation proposed in [53] is of the form:
x
σβ (x) = , (10)
1 + exp(−βx)

where β is either a trainable or user-defined parameter. In some cases, using


an adaptive activation can improve the results on classification tasks by a
modest amount, usually an increase of 0.5-2% in the accuracy [4]. A similar
improvement can be seen for function approximation, although the overall
complexity of the architecture is increased.

2.3 Training
As mentioned earlier, the training process involves optimizing the network
parameters (weights and biases) such that an objective function is minimized.
Suppose the loss function is denoted by L(uN N (x; θ)), where uN N is the
neural network and θ represents the trainable parameters, e.g. the matrices
Wi and vectors bi in (2). In the case of regression, a commonly used loss
function is the mean square error, defined as:
N
1 X
LM SE (uN N (xj ; θ)) = |uN N (xj ) − yj |2 , (11)
N j=1

7
where xj , j = 1, . . . , N are input points at which the ground truth output
values yj are known. For the case of PDE approximations, more complicated
loss functions which contain the partial derivatives of uN N with respect to
the inputs can be devised. Additional terms can be used to incorporate the
governing equations and boundary conditions, as will be detailed in Section
3. Then the process of training a neural network can be described as:
Find θ ∗ = arg min L(uN N (x; θ)). (12)
θ

We note that since L(uN N (x; θ) is usually based on the evaluation of uN N or


its derivatives at a finite number of points (called training points), therefore
θ ∗ will depend in general on number and location of these points. A careful
choice of selecting L and a proper weighting between its terms is therefore
key to ensuring that the training is successful and the output generalizes well
to new inputs.

2.3.1 Forward and back propagation


Finding the optimal weights is usually done by gradient-based meth-
ods, such as gradient descent. Parallelization and automatic differentiation
methods are key ingredients in efficient implementations. The optimization
method requires the gradients of a possibly large number of trainable param-
eters, with many networks containing tens of millions of parameters. Some
are even larger, for example the GPT-3 language model uses 175 billion pa-
rameters [13]. Therefore, reverse-mode differentiation, also known as back-
propagation [54], is commonly used to compute the gradients with respect to
the trainable parameters.
The differentiation process involves a forward pass, during which the neu-
ral network output and the loss function are evaluated from a given input
and the operations involved are recorded in a graph. Then the derivatives
are computed in reverse order of the evaluation, with the intermediate results
obtained from the chain rule stored at the graph nodes (see e.g. Chapter 6.5
in [16] for details). The remarkable outcome of this procedure is that the
partial derivatives of the loss function with respect to all the parameters can
be evaluated at a cost that is proportional to the number of floating points
operations involved in the forward evaluation.
Using forward mode differentiation, where the partial derivatives are com-
puted in the order of evaluation, would result in a much higher cost that is
also proportional with the number of parameters, although the memory re-
quirements may be lower [40]. In general evaluating the partial derivatives
(Jacobian) of a function f : Rn → Rm requires O(n) operations in forward
mode, and O(m) operations in reverse mode. In the context of PDEs, for-
ward mode differentiation may be more efficient when computing the partial
derivatives of the outputs with respect to the input coordinates, particularly
for multiphysics models or other coupled problems where several solution
fields are considered.

8
2.3.2 Network initialization
When initializing the training process, particular care is needed for the
selection of the initial value. For example, if all the weights and biases are set
to zero, then the gradients with respect to the weights within a layer will have
the same value. In a gradient descent update with a fixed step size, all the
parameters will be updated by the same amount, resulting in the equivalent
of a network with a single neuron per layer. Part of the recent success of deep
neural networks in applications is owed to better techniques for initializing
the values of the network parameters, such as Glorot (Xavier) [14] and He
[21] initialization.
While the initialization method can be seen as a hyper-parameter which
can be tuned according to the problem at hand, a commonly used one is
Glorot uniform, where the weights are chosen from a uniform distribution
U [−l, l], where r
6
l= , (13)
nin + nout
with nin and nout being the number of input and output neurons for a given
layer. The biases are initialized to zero. This is also the default initialization
used the Tensorflow deep learning framework.

2.3.3 Data normalization


It can be observed that the nonlinear region of most activation functions
σ(x), such as the ones in Figure 2, is centered in a small interval around x = 0.
Therefore, if the input data is in a region far away from the origin, then the
activation will be mostly constant or linear, which will hinder the performance
of gradient descent methods (see also Subsection 2.4.2). To remedy this issue,
it is essential to perform a normalization on the input data, which is just a
linear transformation into the interval [−1, 1]. In particular, for each input
neuron, the transformation is given by the formula:

2 · (x − xmin )
Tnorm (x) = − 1, (14)
xmax − xmin
where xmax and xmin are the maximum and minimum input values, respec-
tively. In the case where the input values are points in the computational
domain, then xmin and xmax represent the bounding box of the domain. These
values must be fixed for training and testing, otherwise incorrect results will
be obtained.

2.4 Testing and validation


After a neural network is trained, it is expected to output useful results.
However, in most cases it is not feasible to train the network indefinitely or
until the loss function stops decreasing (up to machine precision). Moreover,

9
the number of training points and the number of layers and neurons must
be correlated in the sense that, for optimal results, a larger number of pa-
rameters require a larger number of training points to avoid overfitting. The
performance of the network is then measured by testing and validating the
output.
In standard machine learning tasks, it is common to partition the avail-
able data into training/testing/validation subsets. The training data is used
in the optimization procedure (12) for finding the optimal trainable param-
eters (weights and biases). The validation data is used to monitor the per-
formance of the network by just evaluating the loss function. Tuning the
network hyperparameters, such as the type of activation function, network
size and optimization algorithm may require some trial and error. Although
the validation data is not used directly in the optimization process, it may
indirectly create a bias in the process of hyperparameter tuning. Therefore,
when the performance on the validation data is satisfactory, the network may
be further validated using the test set. A typical split is to use 80% of the
data for training and ca. 20% for testing and validation, although these ra-
tios may vary depending on the problem at hand. For example, in the case
of physics informed neural networks, where training and testing data are just
points in the domain, it may be useful to test the network by generating
many more points from a higher resolution sample. We note that in most
machine learning models, optimization is the most computationally intensive
part. Therefore the amount of training data is most closely related to the
amount of random access memory (RAM) and numerical (floating point) op-
erations required, while testing (evaluating) the model is comparatively much
cheaper.
In this section, we describe some of the pitfalls involved in training and
testing a network, and the countermeasures that can be implemented.

2.4.1 Underfitting and overfitting


Several types of approximation pathologies can be encountered in the pro-
cess of training a neural network, among which underfitting and overfitting
are some of the most common. Underfitting can occur when the neural net-
work does not have enough approximation capability to satisfactorily fit the
data or solve the problem at hand. It can also occur when the optimization
has not converged, for example because too few iterations have been per-
formed, or because the learning rate is too low or too high. Underfitting can
be typically identified when both the training and validation losses are higher
than acceptable values.
Overfitting, on the other hand, can appear when the network capacity is
larger than required. In this scenario, the training data is well approximated
but other data points may be far off from the actual values, or in machine
learning parlance, the model “does not generalize” well. A similar case where
fitting exactly a small data set does guarantee that the target function is well

10
approximated occurs in interpolation by high-order polynomials, where the
interpolant can oscillate wildly between the interpolation points. In this
case, the training loss value decreases to a low value (even zero), while the
validation loss can be much higher.
A good strategy to avoid overfitting or underfitting is to monitor both
the training and validation losses and to stop the training when the testing
loss begins to increase. To illustrate, the results for regression of the function
u(x) = sin(πx) for x ∈ [−1, 1] are shown in Figure 3. A random uniform
noise with magnitude in the interval (0, 0.1) was added to the training and
validation data, which consists of 201 and 50 points, respectively. A neural
network with two hidden layers consisting of 64 neuron has been used, to-
gether with the tanh activation function for the first two layers and linear
activation in the last layer. The ADAM optimizer with the default parame-
ters and learning rate of 0.001 is used to minimize the mean square error of
the difference between the predicted and training values.
We observe from Figure 3a that after 300 iterations, the neural network
can start to approximate the sinusoidal function, but it is still quite far
away from the actual shape (underfitting). The training loss value at this
stage is 0.1129, while the validation loss is 0.1393. After 10000 iterations,
the approximation is already quite good, with only a small error between
the prediction and the actual function (without noise) as shown in Figure
3b. Here, the training loss is 0.0033 and the validation loss is quite close at
0.0035. Next, if we continue to training, we start to observe that after many
more iterations, the training and validation loss start to diverge (see Figure
3d). After 100000 iterations, we notice that the predicted function has some
oscillations and spikes as it tries to capture the noise in the data as shown
in Figure 3c. At this stage, the training loss is 0.0026 and the test loss is
0.0037.

2.4.2 Vanishing and exploding gradients


Two other types of problems encountered in training of artificial neural
networks are those related to the magnitude of the gradients. The vanishing
gradients phenomenon occurs when the derivative of the loss function with
respect to the training variables is very small. This can mean that the objec-
tive function is very close to a stationary point, which can also be a saddle
point or some other point far from the global minimum. The end result is
very slow or no convergence of the loss function. A common remedy for this
problem is to perform a normalization of the input data (see also Subsection
2.3.3). Otherwise, changing the network architecture or the activation func-
tion (for example using rectified activations like ReLU or Swish) may also
be helpful, since S-shaped activations like sigmoid or hyperbolic tangent are
particularly susceptible to vanishing gradients.
Exploding gradients on the contrary, refer to the occurrence of too large
derivatives of the loss function with respect to the trainable parameters. In

11
1.0 Prediction 1.0 Prediction
Ground truth Ground truth
0.5 Training Data 0.5 Training Data
Validation Data Validation Data
0.0 0.0
0.5 0.5
1.0 1.0
1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0
(a) Underfit, 300 iterations (b) Proper fit, 10000 iterations
1.0 Prediction Train loss
Ground truth Validation loss
0.5 Training Data 10 1
Validation Data

Loss
0.0
0.5 10 2

1.0 0 20000 40000 60000 80000 100000


1.0 0.5 0.0 0.5 1.0 Iteration Number

(c) Overfit, 100000 iterations (d) Loss convergence

Figure 3: Fitting a noisy function and the loss convergence history

extreme cases, the gradients can overflow, resulting in not-a-number (NaN)


values for the loss. Another possible effect is unstable training, where the loss
value oscillates without converging to the optimal value. Possible remedies
for this problem include using a smaller learning rate, and adding residual
(or skip) connections to the neural network [51].
The ReLU activation function may suffer from a related problem known
as “dying ReLU”, which occurs when some neurons become inactivated, in
the sense that they always output zero for all the inputs. This can happen
when a large negative bias value is learned for a particular neuron. Because
the derivative of the constant zero function is also zero, it is not possible
to recover a “dead” ReLU neuron, resulting in a diminished approximation
capability.

2.5 Optimizers
We will now briefly describe the optimization algorithms commonly used
to train (i.e. minimize the loss function) a neural network. First, we mention
that two types of optimization strategies can be employed: full-batch training
and mini-batch training. In the former, the entire data set is used during
a forward pass through the network and the gradients with respect to all
the data points are computed in one step. In mini-batch training on the
other hand, the training data is split into several sub-sets of (approximately)
the same size called mini-batches. Then an optimization sub-step is taken
with respect to each mini-batch. When the entire data set is seen by the

12
optimization algorithm once, then a training epoch is completed. In general,
first order optimization methods, like gradient descent, are commonly used
with mini-batch training, while algorithms that make use of (approximations
of) second derivative information use full batch training. A detailed survey
of optimization methods used in machine learning has been presented in [62].

2.5.1 Stochastic Gradient Descent


The gradient descent method is the simplest gradient-based optimizer.
The idea is to minimize the function in the direction of the gradient evaluated
at the current guess by a fixed step size (also called the learning rate). If the
objective function is L(w), then an optimization step can be written as:

w(t+1) := w(t) − η∇w L(w(t) ), (15)

where η is the learning rate. In the case of mini-batch training, since the
mini-batches are typically randomly selected, the method is called stochastic
gradient descent (SGD). Using mini-batches has been shown to improve the
robustness, allowing the optimizer to find the global optima (or better local
optima) even for non-convex problems [9, 44].

2.5.2 Adaptive Momentum (ADAM)


This optimization method, proposed in [30], replaces the fixed learning
rate of the conventional SGD with a variable step-size based on the momen-
tum, which can be seen as a linear combination of the gradients of the current
and previous time steps.
An update of the ADAM optimizer from step t to step t + 1 has the form:

m(t+1) := β1 m(t) + (1 − β1 )∇w L(w(t) ) (biased first moment) (16)


v (t+1)
:= β2 v (t) (t)
+ (1 − β2 )(∇w L(w )) 2
(biased second moment) (17)
(t+1)
m
m̂ := (t+1)
(unbiased first moment) (18)
1 − β1
v(t+1)
v̂ := (t+1)
(unbiased second moment) (19)
1 − β2

w(t+1) := w(t) − η √ (weights update). (20)
v̂ + 
Here m and v are the moment vectors which are initialized to zeros, β1 , β2
and  are constants which are usually initialized to β1 = 0.9, β2 = 0.999,
and  = 10−8 , and η is the learning rate. β1t and β2t denote β1 and β2
to the power t, and (∇w L(w(t) ))2 denotes the element-wise squaring of the
gradient vector. Because the momentum vectors are initialized to zeros, a

13
bias-correction is introduced in (18) and (19). This technique can smooth
out the oscillations in the gradients and usually improves the convergence
compared to the standard SGD optimizer.

2.5.3 Quasi-Newton methods


The gradient descent-based methods approximate the loss at each step
by a linear function without taking into account the curvature information.
Faster convergence can be obtained by using Newton algorithms, which in-
volve computing the second derivatives. Nevertheless, for a large number of
parameters, the cost of Newton’s method in terms of memory storage and
floating point operations can be prohibitive, since the Hessian matrix has
size n × n, where n is the number of parameters. A more feasible alterna-
tive is the family of quasi-Newton methods, like the Broyden–Fletcher–Gold-
farb–Shanno (BFGS) algorithm [6, 12, 15, 56] or the limited memory version
L-BFGS [39], which are already implemented in machine learning frameworks
like Pytorch or Tensorflow Probability [11]. Another algorithm that can
be used for problems with a small number of parameters is the Levenberg-
Marquardt algorithm [35, 43], which can be seen as a combination of the
Gauss-Newton method and gradient descent.

3 Physics Informed Neural Networks


In the following, we focus on the physics informed neural network, by
which is meant an artificial neural network which incorporates the residuals
of the PDE to be solved into the loss function. In most cases, a simple, fully-
connected feed-forward network is used, however some important differences
can be noted in the form of the objective function, in particular regarding to
whether the strong or weak form of the PDE is used.

3.1 Collocation method


The classical PINNs are collocation-based, meaning that the neural net-
work aims to approximate the strong form of the governing equation at a
set of collocation points. Because the collocation points can be randomly
distributed inside the domain and no mesh is needed, this method belongs to
the category of mesh-free methods. Moreover, once the “building blocks” for
constructing the neural network and evaluating the partial derivatives with
respect to the inputs are obtained, the implementation is relatively simple.
In particular, suppose that the governing PDE is of the form:
∂u(x)
F(u(x), , . . .) = 0 for x ∈ Ω (21)
∂x1
∂u(x)
G(u(x), , . . .) = 0 for x ∈ Γ, (22)
∂n

14
where F represents a differential operator for the domain interior, G is a dif-
ferential operator for the boundary conditions, u is the unknown function,
Ω and Γ are the computational domain and its boundary, and n is the outer
normal vector to the boundary. The interior differential operator may con-
tain any order of derivatives with respect to the inputs, while the boundary
operator may contain any order of derivative with respect to the outer normal
vector for Neumann-type boundary conditions.
The loss function for a neural network uN N (x; θ) with trainable parame-
ters θ (which include the weights and biases for each layer) can be constructed
based on the “mean square error” (MSE) evaluated at a set of Nint interior
collocation points {xint
i }, i = 1, . . . , Nint and a set of Nbnd boundary collo-
cation points {xbnd
j }, j = 1, . . . , Nbnd as:

Nint
λ1 X ∂uN N (xint
i ; θ)
Lcoll (θ) = F(uN N (xint
i ; θ), , . . .)2
Nint i=1 ∂x1
Nbnd
λ2 X ∂uN N (xbnd
j ; θ)
+ G(uN N (xbnd
j ; θ), , . . .)2 . (23)
Nbnd j=1 ∂n

Here λ1 and λ2 are weight terms; usually choosing λ2 >> λ1 helps to speed up
convergence by ensuring that the boundary conditions are satisfied. Adaptive
methods for choosing the weights have also been proposed in [67]. In case of
time-dependent problems, the classical PINNs use a space-time discretization,
where the time is considered as an additional dimension.

3.2 Energy minimization method


In the energy minimization method, we seek to minimize an energy func-
tional, which is usually based on the weak (variational) form of the PDE. In
many scientific modeling tasks, an energy functional appears naturally from
the physical laws involved (for example the principle of minimum potential
energy in structural mechanics). Suppose the functional to minimize is de-
noted by J (u), which can be decomposed into a interior term and boundary
term: Z Z
J (u) = Hint (u) dΩ + Hbnd (u) dΓ, (24)
Ω Γ

with Γ denoting the portion of the boundary over which the boundary term
is evaluated. Then we can define the loss function of the form:
Z Z
Lenergy (θ) = Hint (uN N ) dΩ + Hbnd (uN N ) dΓ. (25)
Ω Γ

The integrals in (25) are usually approximated by numerical integration,


using a finite set of quadrature points {qint
i }, and weights {wi } with i =
int

15
1, . . . , Qint for the interior integral and quadrature points {qbnd
j } and weights
{wjbnd } with j = 1, . . . , Qbnd for the boundary integral, i.e.

Qint Q bnd

j ))wj . (26)
X X
Lenergy (θ) ≈ Hint (uN N (qint int
i ))wi + Hbnd (uN N (qbnd bnd

i=1 j=1

When additional constraints are needed, such as Dirichlet boundary condi-


tions, additional terms can be added to the loss function, similarly to (23).
Alternatively, one can impose the Dirichlet boundary conditions strongly (i.e.
exactly) by modifying the output of the neural network to match the pre-
scribed boundary data. In particular, we consider the computed solution to
be ũN N , satisfying ũN N (x) = uD (x) for x ∈ ΓD , where uD is the Dirichlet
boundary condition specified on the boundary ΓD to be of the form:

ũN N (x) = g(x) + d(x)uN N (x), (27)

where g is a smooth extension of uD such that g(x) = uD (x) for x ∈ ΓD and


d is a distance function such as d(x) = 0 for x ∈ ΓD and d(x) > 0 otherwise.
When uN N is vector-valued, then we multiply each of its components by d(x).
This ensures that the output ũN N satisfies exactly the boundary conditions
(i.e. ũN N (x) = uD (x) for x ∈ ΓD ), although the choice of g(x) and d(x)
requires some care [61].
In general, the energy minimization method tends to be less computation-
ally demanding than the collocation based PINNs, due to the fact that, e.g.
only the first order derivatives need to be computed for solving a second order
problem. However, it requires an integration mesh and it is more difficult to
verify that the solution is correct within a certain tolerance, since the objec-
tive function should converge to some non-zero minimum which is not known
in advance. A possible approach to overcome this problem is to compute the
residual loss for validation, which can then also be used to adaptively adjust
the number of integration points, as proposed in [17].

4 Numerical Applications
By using a small set of training or input data (e.g., initial and boundary
conditions and/or measured data) as well as governing physical laws, PINNs
attempt to approximate the solution of the problem. Complex nonlinear sys-
tems and phenomena in physics and engineering are described by differential
equations.
PINNs have shown their capabilities to solve both forward and inverse
problems in science and engineering. A forward problem can be defined as
a problem of finding a particular effect of a given cause utilizing a physical
or mathematical model, whereas an inverse problem refers to finding causes

16
from the given effects[63]. We can investigate the one-dimensional steady-
state heat equation with the source term to give more concrete examples of
forward and inverse problems.
Let us consider a rod with unit length along the x-axis and the heat
flowing through this rod with a heat source as our model. We can represent
the temperature at location x on the rod as T (x). Under certain assumptions,
such as the rod being perfectly insulated, with the source term q(x) being
known, then the governing equation can be written as:
d2 T
κ + q(x) = 0 (28)
dx2
where κ > 0 is the thermal diffusivity constant. Finding temperature at any
location on the rod is a forward problem. On the other hand, finding the
constant κ, which is a rod feature, from observed temperature data is a good
example of an inverse problem. These examples will be detailed in Section
4.1 and 4.2.
To summarize, the aforementioned procedures explained in the previous
sections to solve differential equations with PINNs will become tangible with
numerical applications in this section. The solution estimation of PINNs for
both forward and inverse problems will be discussed by providing simple and
complex examples.

4.1 Forward problems


In the introductory part of this section, the definition of a forward problem
is given as finding the particular effect of a given cause using a physical or
mathematical model.

4.1.1 One dimensional steady-state heat equation


Let us remember the one dimensional steady state heat conduction prob-
lem with a heat source. As we discussed before, the governing equation for
this example is given in (28). Let the thermal diffusivity constant be κ = 0.5
and x denote the location on the rod. Here the source term is given as
q(x) = 15x − 2. We assume that the temperatures at both ends are 0. Then
we can re-write (28) as:
d2 T q(x)
+ = 0, x ∈ [0, 1]
dx2 κ
q(x) = 15x − 2 (29)
κ = 0.5
T (0) = T (1) = 0
The first step to solve this problem is to discretize the domain with uniform
or randomly sampled collocation points. Then the neural network will pro-
cess these collocation points through its linearly connected layers consisting

17
of neurons with non-linear activation functions. Of course, the outcome of
the first forward propagation will not be compatible with the true solution.
Therefore, at this point, the physics and boundary knowledge will guide the
neural network to approximate the ground truth by updating the weights
and biases of the neural network. Let us elaborate on this step by step and
reinforce these steps with code snippets. Note that these codes are written
with TensorFlow version 2.x with the Keras API.
We first generate 100 equidistant points in our domain. Here the choice
of the number of points is up to the user. However, it should be noted that
the number of points also has some influence on the number of iterations or
network size required to have results with similar accuracy. The ADAM op-
timizer with a learning rate of 0.005 is used for this example. An input layer,
three hidden layers with 32 neurons equipped with tanh activation function,
and an output layer form the neural network (see Figure 4). The input and
output layers have one neuron each since the input for the network is only
one spatial dimension, and the output is the temperature at these points.
By setting the number of iterations to 1000 and introducing the boundary
condition data in TensorFlow tensors, we complete the initial settings of our
model (see Listing 1).

Listing 1: Initial settings for the heat equation

# We set seeds initially. This feature controls the randomization of


# variables (e.g. initial weights of the network).
# By doing it so, we can reproduce same results.
tf.random.set_seed(123)
# 100 equidistant points in the domain are created
x = tf.linspace(0.0, 1.0, 100)
# boundary conditions T(0)=T(1)=0 and \kappa are introduced.
bcs_x = [0.0, 1.0]
bcs_T = [0.0, 0.0]
bcs_x_tensor = tf.convert_to_tensor(bcs_x)[:, None]
bcs_T_tensor = tf.convert_to_tensor(bcs_T)[:, None]
kappa = 0.5
# Number of iterations
N = 1000
# ADAM optimizer with learning rate of 0.005
optim = tf.keras.optimizers.Adam(learning_rate=0.005)

# Function for creating the model


def buildModel(num_hidden_layers, num_neurons_per_layer):
tf.keras.backend.set_floatx("float32")
# Initialize a feedforward neural network
model = tf.keras.Sequential()

18
hidden layers
(1) (2) (3)
a1 a1 a1

(1) (2) (3)


a2 a2 a2

input output
layer layer
(1) (2) (3)
a3 a3 a3
x T
(1) (2) (3)
a4 a4 a4
.. .. ..
. . .
(1) (2) (3)
a32 a32 a32

Figure 4: The architecture of the feed-forward neural network for one-


dimensional steady state heat conduction problem. The network consists
of one input layer, one output layer, and three hidden layers with 32 neu-
rons each. a is the activation function. Superscripted numbers denote the
layer number, and subscripted ones denote the neuron number in the relevant
layer.

# Input is one dimensional ( one spatial dimension)


model.add(tf.keras.Input(1))

# Append hidden layers


for _ in range(num_hidden_layers):
model.add(
tf.keras.layers.Dense(
num_neurons_per_layer,
activation=tf.keras.activations.get("tanh"),
kernel_initializer="glorot_normal",
)
)

# Output is one-dimensional
model.add(tf.keras.layers.Dense(1))

return model

19
# determine the model size (3 hidden layers with 32 neurons each)
model = buildModel(3, 32)

Then we define our loss function, which is composed of two parts, the
boundary loss, and physics loss, as formulated in (30). Here, the loss term
tells us how far away our model is from ’reality’. For the measure of these
loss terms, we will use mean square error formulation, which is mentioned in
Section 2.3 and (11).
LLoss = LBCs + LP hysics (30)
Constructing the boundary loss is easier compared to the physics loss. Our
model’s assumptions should be compatible with the prescribed boundary con-
ditions, which are T (0) = 0 and T (1) = 0 for our case. Thus, our goal should
be to minimize the mean square error between our model’s temperature pre-
diction at both ends of the rod and the real temperature values at these
points, which must be 0. The boundary condition loss is given by (31).
NB =2
λ1 X
LBCs = |TN N (xj ) − yj |2 , (31)
NB j=1

where NB = 2 since we have boundary condition data for two points which
are T (0) = T (1) = 0. The regularization term λ1 is taken as 1.
We also need to provide information about the interior points to get rea-
sonable results. Although we do not know the temperature data for interme-
diate points on the rod, we know those points have to satisfy some physical
laws that we derived in (28). Or in other words, our temperature prediction
needs to satisfy (28). When we take the derivative of the temperature pre-
diction of the network with respect to x two times and sum this result with
the source term q(x) divided by κ, this summation must yield 0. Thus, the
physics loss for our example becomes:
NPX
=100
λ2 d2 TN N q(xj ) 2
LP hysics = | + | , (32)
NP j=1
dx2 x=xj κ

Again, the regularization term λ2 is taken as 1. Now we can combine the


boundary conditions loss and physics loss functions to form our model’s loss
function (see (30)), which will guide the model to make better predictions in
each iteration.
Listing 2: Loss function definition for the heat equation

# Boundary loss function


def boundary_loss(bcs_x_tensor, bcs_T_tensor):
predicted_bcs = model(bcs_x_tensor)
mse_bcs = tf.reduce_mean(tf.square(predicted_bcs - bcs_T_tensor))

20
return mse_bcs

# the first derivative of the prediction


def get_first_deriv(x):
with tf.GradientTape() as tape:
tape.watch(x)
T = model(x)
T_x = tape.gradient(T, x)
return T_x

# the second derivative of the prediction


def second_deriv(x):
with tf.GradientTape() as tape:
tape.watch(x)
T_x = get_first_deriv(x)
T_xx = tape.gradient(T_x, x)
return T_xx

# Source term divided by \kappa


source_func = lambda x: (15 * x - 2) / kappa

# Function for physics loss


def physics_loss(x):
predicted_Txx = second_deriv(x)
mse_phys = tf.reduce_mean(tf.square(predicted_Txx + source_func(x)))
return mse_phys

# Overall loss function


def loss_func(x, bcs_x_tensor, bcs_T_tensor):
bcs_loss = boundary_loss(bcs_x_tensor, bcs_T_tensor)
phys_loss = physics_loss(x)
loss = bcs_loss + phys_loss
return loss
TensorFlow records the operations on the trainable variables when com-
puting the loss function and calculates the gradients by backpropagation.
Then, the ADAM optimizer with a fixed learning rate of 0.005 minimizes
the loss function. By performing one forward and one back propagation over
the whole data set, one epoch is completed. This procedure is repeated the
specified number of times, which is 1000 for our example.

Listing 3: Training of the heat equation model


# taking gradients of the loss function
def get_grad():
with tf.GradientTape() as tape:

21
# This tape is for derivatives with
# respect to trainable variables
tape.watch(model.trainable_variables)
Loss = loss_func(x, bcs_x_tensor, bcs_T_tensor)
g = tape.gradient(Loss, model.trainable_variables)
return Loss, g

# optimizing and updating the weights of the model by using gradients


def train_step():
# Compute current loss and gradient w.r.t. parameters
loss, grad_theta = get_grad()
# Perform gradient descent step
optim.apply_gradients(zip(grad_theta, model.trainable_variables))
return loss

# Training loop
for i in range(N + 1):
loss = train_step()
# printing loss amount in each 100 epoch
if i % 100 == 0:
print("Epoch {:05d}: loss = {:10.8e}".format(i, loss))
Once the training process is completed with the desired loss value, we
can validate the output by performing one forward pass with a test data-set
which is typically formed in the same domain as the training data-set. In
our example, the training data was 100 equidistant points between 0 and 1.
We can determine our test data set as 200 equidistant points in the same
domain. Figure 5 depicts that the model’s prediction captures the analytical
result.

4.1.2 Two-dimensional linear elasticity example


So far, we have seen the most straightforward application of PINNs to
solve a one-dimensional forward problem. The problem has been described
with a linear second order non-homogeneous ordinary differential equation
with Dirichlet boundary conditions. Then the equation was solved with a
neural network with three hidden layers. This pedagogical example is sup-
posed to help readers to understand the concept of using neural networks in
scientific problems. Now, let us proceed with a more complex application.
Therefore, consider the cantilever beam model [65, 47, 66], a classical
example in linear elasticity theory. This problem is governed by the well
known equilibrium equation (33) given as
−∇ · σ(x) = f (x) for x∈Ω (33)

22
1.50
1.25
1.00
0.75
T(x)

0.50
0.25 Exact Solution
Predicted Solution
0.00
0.0 0.2 0.4 0.6 0.8 1.0
x
Figure 5: Exact solution and the prediction of the model. The predicted
solution coincides with the ground truth which is T (x) = −5x3 + 2x2 + 3x.

with the strain-displacement given by:


1
= (∇u + ∇uT ) (34)
2
and the Hooke’s law for a linear isotropic elastic solid:

σ = 2µ + λ(∇ · u)I, (35)

where µ and λ are the Lamé constants, and I is the identity tensor. The
Dirichlet boundary conditions are u(x) = û for x ∈ ΓD and the Neumann
boundary conditions are σn = t̂ for x ∈ ΓN , where n is the normal vector.
For this example (see Figure 6), Ω is a rectangle with corners at (0,0) and
(8,2). Letting x = (x, y), and u = (u, v) the Dirichlet boundary conditions
for x = 0 are:
W2
 
Py
u(x, y) = (2 + ν)(y 2 − )
6EI 4
(36)
P 2
v(x, y) = − (3νy L)
6EI
Commonly, a parabolic traction at x = 8

y 2 − yW
p(x, y) = P (37)
2I
is applied where P = 2 is the maximum traction, E = 103 is Young’s modulus,
ν = 0.25 is the Poisson ratio and I = b W
12 is second moment of area of the
3

cross-section. The dimensions of the beam in x, y and z directions are L = 8,


W = 2 and b = 1, respectively (Figure 6).

23
y

L=8 y

W=2
x z

b=1
Pmax=2

Figure 6: The illustration of 2-D elasticity problem.

The objective of this problem is to find the displacements on the beam in


x and y directions. In order to solve the problem, firstly, uniform collocation
points in the domain are created. The numbers of collocation points are
80 and 40 in x and y directions, respectively, as shown in Figure 7. The
prediction of the neural network shall satisfy the equilibrium equation (33)
and the constitutive law (35) as well as the boundary conditions (36).
The strong imposition of the Dirichlet boundary conditions is explained in
details in section 3.2. In this example, we strongly imposed Dirichlet bound-
ary conditions at x = 0. Therefore, there is no need to insert collocation
points along the y-axis where x = 0 as is illustrated in Figure 7.
In this example, a fully connected feed-forward neural network with three
hidden layers and with 20 neurons per hidden layer is used with the swish
activation function. The neural network constructed for this problem is de-
picted in Figure 8. ADAM and L-BFGS optimizers are used together. ADAM
optimizer was used for the first 15000 iterations, and then optimization of
the parameters continued with L-BFGS optimizer for the successive 500 it-
erations.
After completing the training procedure, the model is tested with a new
set of data. For the test data, the number of uniformly spaced collocation
points was doubled in the same domain. One forward pass is performed to
see the results for the test data. The solution obtained with the model and
the exact solution for the displacements of the beam in x and y directions
are plotted in Figure 9.
The approximation obtained by the neural network is very close to the an-
alytical solution. The relative L2 error in the approximation is 5.899 × 10−5 .
The error distribution can be seen in Figure 10. The error for the displace-
ments in x and y directions are of the order of 10−6 and 10−5 , respectively.
The model makes less accurate predictions for the displacements around the
beam’s free end than the fixed end.

24
2.0

1.5

1.0

0.5

0.0
0 2 4 6 8

Figure 7: Collocation points on the Timoshenko cantilever beam. 80 points


in x direction and 40 points in y direction. Red points stand for boundary
points whereas the blue points represent interior collocation points. Since
Dirichlet boundary conditions are strongly imposed, the collocation points
along y-axis where x=0 are not needed.

4.1.3 Three-dimensional hyperelasticity


As the last example of this section, a hyper-elasticity problem presented
in [55] will be discussed. We will solve this particular problem with the deep
energy method shown in Section 3.2. Objective of the problem is obtaining
the displacements for a 3D hyper-elastic cuboid made of an isotropic, homo-
geneous material subjected to prescribed twisting, body forces, and traction
forces. In order to obtain optimal parameters of the neural network, the po-
tential energy formulation for the body will be used as the loss function. The
governing equations and boundary conditions for this problems are written
as:
∇ · P + fb = 0,
Dirichlet boundary : u = ū on ∂ΩD , (38)
Neumann boundary : P · n = t̄ on ∂ΩN ,
where ū is the prescribed displacement given on the Dirichlet boundary and t̄
is the prescribed traction at the Neumann boundary; n denotes the outward
unit normal vector, P is the 1st Piola Kirchoff stress tensor and fb is the body
force. The potential energy functional of this problem is given by [55]
Z Z Z
ε(ϕ) = ΨdV − fb · ϕdV − t̄ · ϕdA, (39)
Ω Ω ∂ΩN

where Ψ is the strain energy density and ϕ indicates the mapping of points
on the body from the initial/undeformed to the deformed state.

25
hidden layers
(1) (2) (3)
a1 a1 a1

input output
layer layer
(1) (2) (3)
a2 a2 a2
x u
(1) (2) (3)
a3 a3 a3

y (1) (2) (3) v


a4 a4 a4
.. .. ..
. . .
(1) (2) (3)
a20 a20 a20

Figure 8: The architecture of the feed-forward neural network for the Timo-
shenko beam problem. The network consists of one input layer, one output
layer, and 3 hidden layers. There are 20 neurons per hidden layer. 2 neu-
rons in the input layer take x and y coordinates, and the output neurons
give displacements in u and v directions. a is the activation function that is
swish in this example. Superscripted numbers denote the layer number, and
subscripted ones denote the neuron number in the relevant layer.

In order to obtain optimal parameters of the neural network, the poten-


tial energy (39) is parameterized by the neural network’s prediction for the
displacements. Thus, the loss function reads:
Z Z Z
L(p) = Ψ(ϕ(X; p))dV − fb · ϕ(X; p)dV − t̄ · ϕ(X; p)dA, (40)
Ω Ω ∂ΩN

If we rewrite (40) in a discrete form, it becomes


NΩ NΩ N∂ ΩN
VΩ X VΩ X A∂ Ω N X
L(p) ≈ Ψ((ϕp )i ) − (fb )i · (ϕp )i − t̄i · (ϕp )i , (41)
NΩ i=1 NΩ i=1 N∂ ΩN i=1

in which VΩ is the volume and NΩ is the number of data points within the
solid; N∂ ΩN and A∂ ΩN denote the number of points on the surface subjected
to the force and the surface area, respectively.
Let us consider now 3D cuboid of length L = 1.25, width W = 1.0 and
depth H = 1.0. It is fixed at the left surface and twisted 60◦ counter-clockwise

26
(a) Predicted displacements in x-axis (b) Exact displacements in x-axis

(c) Predicted displacements in y-axis (d) Exact displacements in y-axis

Figure 9: Predicted and exact values for displacements on a Timoshenko


cantilever beam in x and y directions.

by boundary conditions u|Γ1 at the right-end surface. Also, at the lateral


surfaces, a body force fb = [0, −0.5, 0]T and traction forces t̄ = [1, 0, 0]T are
applied(see Figure 11). The Dirichlet boundary conditions for this particular
problem are:
u|Γ0 = [0, 0, 0]T ,
 
0 (42)
u|Γ1 = 0.5[0.5 + (X2 − 0.5) cos(π/3) − (X3 − 0.5) sin(π/3) − X2 ]
0.5[0.5 + (X2 − 0.5) sin(π/3) + (X3 − 0.5) cos(π/3) − X3 ]
The Neo-Hookean model is assumed in this problem. The material prop-
erties are shown in Table 1
Description Value
E - Young’s modulus 106
ν - Poisson ratio 0.3
E
µ - Lame’ parameter
2(1 + ν)

λ - Lame’ parameter
(1 + ν)(1 − 2ν)

Table 1: Material properties and parameters for hyperelastic cuboid

We now proceed with determining the network parameters. In each di-

27
(a) Estimation error for displace- (b) Estimation error for displace-
ments in x-axis ments in y-axis

Figure 10: The difference between exact solution and predicted solution for
displacements on the beam in x and y directions.
H=1

L=1
.25
1
Fixed Support W=

Figure 11: The 3-D hyperelastic cuboid is fixed at the left hand side and
twisted 60◦ counter-clockwise

rection, 40 equally spaced points, 64000 points in total, are placed over the
whole domain (see Figure 12a). The neural network consists of 3 hidden
layers, and each hidden layer has 30 neurons with a tanh activation function.
The input and output layers have three neurons corresponding to coordinates
of the initial configuration of the designated points and their deformed coor-
dinates after loading, respectively. The network is trained with 50 iterations
and the parameters are optimized by the L-BFGS optimizer.
The predicted deformed shape of the cuboid is given in Figure 12b. A line
passing through two points on the cube A(0.625, 1, 0.5) and B(0.625, 0, 0.5)
is drawn to compare the displacement predictions and the real displacements
on the line. We showed in [55], that a neural network with the same setup
but trained with 25 steps has an error in the L2 norm of 0.13210, whereas
the finite element model has error 0.13275 for estimating the displacements

28
(a) 64000 equidistant points over (b) Deformed shape of 3D hyperelastic
the domain cuboid

Figure 12: Training points on the cuboid and its predicted deformed shape
after training

on the line AB.

4.2 Inverse problems


An inverse problem can be considered as inferring features of a model
from the observed data. Finding the elasticity modulus of a beam from its
displacement measurements under certain constraints or inferring the space-
dependent reaction rate of a diffusion-reaction system [70] can be given as
examples for the inverse problems. PINNs have already been used to tackle
several inverse problems existing in unsaturated groundwater flow [10], nano-
optics and meta-materials [7]. We will discuss two inverse problems in this
section to demonstrate the application of PINNs to solve these problems.

4.2.1 Inverse heat equation


In Section 4.1, we examined a one-dimensional steady state heat equation
with a source term. The goal was to find the temperature values along the
rod with boundary data and governing physical laws. Now, let us reconsider
that problem with a slightly different setup that converts the problem into
an inverse one.
Assume we measured temperature at 100 equidistant points on the rod.
Additionally, we have the same source term and the same boundary condi-
tions and aim to find the thermal diffusivity constant κ in (28). Again, the
neural network seeks to predict the temperature values at 100 equidistant
points on the rod. However, the thermal diffusivity constant κ is unknown
this time. Therefore, the model will be trained to obtain the optimum values
of the diffusivity constant as well as the network parameters. Mean square
error between measured temperatures and model prediction will be used as
the loss function. Additionally, a physics loss term that guides the prediction

29
of the network according to the governing equations will be added to the loss
function.
At first, initial settings are applied (see Listing 4) similar to the forward
heat conduction problem defined in Section 4.1. However, the constant κ is
not known in advance for this problem. We have an initial guess of κ = 0.1
for the thermal diffusivity constant. The neural network has three hidden
layers with 32 neurons each, and the tanh function is used as the activation
function. The ADAM optimizer optimizes the network parameters with a
fixed learning rate of 0.001. The number of epochs is designated as 6000.

Listing 4: Initial settings for the inverse heat equation

# importing necessary libraries


import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

# We set seeds initially. This feature starts the model with same random
# variables (e.g. initial weights of the network).
# By doing it so, we have same results whenever the code is run
tf.random.set_seed(123)

# 100 equidistant points in the domain are created


x = tf.linspace(0.0, 1.0, 100)

# boundary conditions which are T(0)=T(1)=0 are introduced.


bcs_x = [0.0, 1.0]
bcs_T = [0.0, 0.0]
bcs_x_tensor = tf.convert_to_tensor(bcs_x)[:, None]
bcs_T_tensor = tf.convert_to_tensor(bcs_T)[:, None]
kappa = tf.Variable([0.1], trainable=True)

# Number of iterations
N = 6000

# ADAM optimizer with learning rate of 0.001


optim = tf.keras.optimizers.Adam(learning_rate=1e-3)

#The exact solution of the problem. It will be used to produce measured da


#and test data
solution = lambda x: -5 * x**3 + 2 * x**2 + 3 * x

def buildModel(num_hidden_layers, num_neurons_per_layer):


tf.keras.backend.set_floatx("float32")
# Initialize a feedforward neural network

30
model = tf.keras.Sequential()

# Input is one dimensional ( one spatial dimension)


model.add(tf.keras.Input(1))

# Append hidden layers


for _ in range(num_hidden_layers):
model.add(
tf.keras.layers.Dense(
num_neurons_per_layer,
activation=tf.keras.activations.get("tanh"),
kernel_initializer="glorot_normal",
)
)

# Output is one-dimensional
model.add(tf.keras.layers.Dense(1))

return model

# determine the model size (3 hidden layers with 32 neurons each)


model = buildModel(3, 32)

After defining the model settings, we can proceed with constructing the
loss function. The loss function (43) consists of three parts, namely, boundary
loss, physics loss and data loss.

LLoss = LBCs + LP hysics + LData (43)

with
NB
λ1 X
LBCs = |TN N (xi ) − yi |2 ,
NB i=1
NP
λ2 X d2 TN N q(xj ) 2
LP hysics = | + | , (44)
NP j=1 dx2 x=xj κ
ND
λ3 X
LData = |TN N (xj ) − yj |2
ND j=1

Here NB , NP , and ND correspond to the number of data points for bound-


ary loss, physics loss, and measured data loss, respectively. Regularization
terms λ1 , λ2 , λ3 are taken as 1 in this example; (43) and (44) are defined in
the code as follows:
Listing 5: Loss function for the inverse heat equation

31
@tf.function
def boundary_loss(bcs_x_tensor, bcs_T_tensor):
predicted_bcs = model(bcs_x_tensor)
mse_bcs = tf.reduce_mean(tf.square(predicted_bcs - bcs_T_tensor))
return mse_bcs

# the first derivative of the prediction


@tf.function
def get_first_deriv(x):
with tf.GradientTape() as tape:
tape.watch(x)
T = model(x)
T_x = tape.gradient(T, x)
return T_x

# the second derivative of the prediction


@tf.function
def second_deriv(x):
with tf.GradientTape() as tape:
tape.watch(x)
T_x = get_first_deriv(x)
T_xx = tape.gradient(T_x, x)
return T_xx

# Source term
def source_func(x): return (15 * x - 2)

@tf.function
def physics_loss(x):
x = x[1:-1]
predicted_Txx = second_deriv(x)
mse_phys = tf.reduce_mean(
tf.square(predicted_Txx * kappa + source_func(x)))
return mse_phys

@tf.function
def data_loss(x):
x = x[1:-1]
ob_T = solution(x)[:, None]
data_loss = tf.reduce_mean(tf.square(ob_T - model(x)))
return data_loss

@tf.function
def loss_func(x):
bcs_loss = boundary_loss(bcs_x_tensor, bcs_T_tensor)

32
phys_loss = physics_loss(x)
ob_loss = data_loss(x)
loss = phys_loss + ob_loss + bcs_loss
return loss
The training and testing procedures are the same as for the forward prob-
lem. Again, the gradients of the loss function with respect to κ and the train-
able variables, which are weights and biases of the network, are determined
with backpropagation. Then, the trainable variables and the κ value are
updated by the ADAM optimizer using previously obtained gradients. This
iterative procedure is repeated a number of epoch times, and eventually, it is
expected to reach the possible minimum loss value.

Listing 6: Training
# taking gradients of the loss function w.r.t. trainable variables
# and kappa
@tf.function
def get_grad():
with tf.GradientTape(persistent=True) as tape:
# This tape is for derivatives with
# respect to trainable variables
tape.watch(model.trainable_variables)
tape.watch(kappa)
Loss = loss_func(x)
g = tape.gradient(Loss, model.trainable_variables)
g_kappa = tape.gradient(Loss, kappa)
return Loss, g, g_kappa

# optimizing and updating the weights and biases of the model and
# kappa by using the gradients
@tf.function
def train_step():
# Compute current loss and gradient w.r.t. parameters
loss, grad_theta, grad_kappa = get_grad()

# Perform gradient descent step


optim.apply_gradients(zip(grad_theta, model.trainable_variables))
optim.apply_gradients([(grad_kappa, kappa)])
return loss
The network parameters obtained at the last epoch form our model. We can
test the model with a new data set in the same domain and plot the results to
compare it with the ground truth (see Figure 13a). The value for κ estimated
by the neural network is equal to 0.5000, and the real value of κ is 0.5. Figure
13b illustrates that as the network is being trained, the value of κ converges
to the true value. The relative L2 error norm is 7.575 × 10−6 .

33
1.4 predicted κ = 0.5000
real κ = 0.50
1.2

1.0

0.8
T(x)

0.6

0.4

0.2 Exact Solution


Predicted Solution
0.0
0.0 0.2 0.4 0.6 0.8 1.0
x

(a) The exact temperature and the predicted temperature.


0.5
Thermal Diffusivity(κ)

0.4

0.3

0.2
real value of κ
predicted κ
0.1
0 1000 2000 3000 4000 5000 6000
Iteration

(b) Change in the prediction of κ during the training.

Figure 13: The temperature and thermal diffusivity constant prediction of


the neural network and true values

4.2.2 Inverse Helmholtz equation


The second and also last example is the Helmholtz equation, which is
a time-independent version of the wave equation. It is used for describing
problems in electromagnetic radiation, acoustics, and seismology. The ho-
mogeneous form of the Helmholtz equation is written as :

∇2 u + k 2 u = 0 (45)

where ∇2 is the Laplace operator and k is the wave number. The solution
of the problem is u(x, y) for (x, y) ∈ Ω. An inverse acoustic duct problem,
adopted from [3], whose governing equation is a complex-valued Helmholtz
equation such that k is unknown and u(x, y) is known at some points in the
domain, will be investigated.

34
We can write (45) with domain information and boundary conditions as:

∇2 u(x, y) + k 2 u(x, y) = 0 where (x, y) ∈ Ω and Ω := (0, 2) × (0, 1), (46)

with the Neumann and Robin Boundary Conditions


∂u
= cos(mπx) at x = 0
∂n
∂u
= −iku at x = 2 (47)
∂n
∂u
= 0 for y = 0 and y = 1
∂n
m being the number of modes which is taken as 1. The wave number k is
unknown. The initial guess for k is 1, and the true value is chosen as k = 6.
The exact solution for u(x, y) can be written as:

u(x, y) = cos(mπy)(A1 e−ikx x )(A2 eikx x ),


(48)
where kx = k 2 − (πm)2
p

where A1 and A2 are obtained by solving a 2 × 2 linear system:


     
ikx −ikx A1 1
× = (49)
(k − kx )e−2ikx (k + kx )e2ikx A2 0

Similar to the previous inverse problem in which we obtained the thermal


diffusivity constant of a rod, the overall loss function is composed of boundary
loss, physics loss and data loss. The boundary loss is constructed by Neumann
and Robin boundary conditions specified in (47), and the physics loss is equal
to the left-hand-side of (46). In addition, the data loss, in other words, the
mean square error between observed u(x, y) values and the prediction of
the neural network is the last term in our overall loss function. These loss
functions can be described as follows:

Lloss = LBCs + LP hysics + LData (50)

where LBCs , LP hysics , LData are:


NB
λ1 X ∂uN N b b ∂u b b 2
LBCs = | (xi , yi ) − (x , y )| ,
NB i=1 ∂n ∂n i i
NP
λ2 X
LP hysics = |∇2 uN N (x∗j , yj∗ ) − k 2 u(x∗j , yj∗ )|2 , (51)
NP j=1
ND
λ3 X
LData = |uN N (x∗j , yj∗ ) − u(x∗j , yj∗ )|2
ND j=1

35
Here LBCs , LP hysics , LData refer to the loss obtained from boundary condi-
tions, governing equation, and measured data, respectively. The regulariza-
tion term λ1 is 100 whereas λ2 and λ3 are 1; NB indicates the number of
boundary points, NP , ND are the number of interior collocation points where
physics loss is computed and the number of points where the observed data
is available, respectively. In this problem, 784 equidistant points (28 × 28)
such that NP =ND = 676 and NB =108 are created (see Figure 14).
The neural network consists of 5 hidden layers with the tanh activation
function, and there are 30 neurons in each layer. The data is normalized
to the interval [−1, 1] before being processed. First, ADAM optimizer and,
subsequently, the quasi-Newton method (L-BFGS) are employed to minimize
the loss function. Five thousand iterations for ADAM and 6200 iterations
with L-BFGS are applied. The estimated solution for u(x, y) and the exact
solution are shown in Figure 15.
The initial guess for k was one, and the neural network’s estimation for
k after the training is 5.999. The relative L2 error norm for the real part of
the solution is 0.0015. A comparison between the predicted solution and the
exact solution can be found in Figure 16.

1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.5 1.0 1.5 2.0

Figure 14: Collocation points for 2D Helmholtz equation. Black points depict
the boundary points where Neumann boundary conditions are valid whereas
the red points show the Robin boundary points. In addition, blue points
represent the inner collocation points where physics loss and data loss are
computed.

36
(a) Predicted solution for the real (b) Exact solution for the real part of
part of Helmholtz equation Helmholtz equation

(c) Predicted solution for the imagi- (d) Exact solution for the imaginary
nary part of Helmholtz equation part of Helmholtz equation

Figure 15: Predicted and exact values for real and imaginary parts of the
Helmholtz equation .

(a) Error distribution between pre- (b) Error distribution between pre-
dicted and exact solution for the real dicted and exact solution for the
part imaginary part

Figure 16: Error distribution for real and imaginary parts of the Helmholtz
equation .

5 Conclusions
In this chapter, we have introduced some of the main building blocks
for PINNs. The main idea is to cast the process of solving a PDE as an
optimization problem, where either the residual or some energy functional

37
related to the governing equations is minimized. We showed the imple-
mentation of PINNs for both simple and more advanced inverse problems.
First, a one-dimensional steady state heat conduction problem with a source
term was solved for the unknown thermal diffusivity constant κ. Later, a
complex-valued Helmholtz equation for an inverse acoustic duct problem was
investigated. The wave number k is unknown in the beginning, and it is
approximated by the PINN model. Unlike the forward problems, we have
an additional term in the loss function, which is formed as the mean square
error between the measured data and the model’s prediction.
By taking advantage of modern machine learning libraries, it is possi-
ble to write fairly succinct programs that approximate the solution or some
quantity of interest, while at the same time taking advantage of the built-
in parallelization offered by multi-processor and GPU architectures. Nev-
ertheless, solving PDEs by the optimization of parameters in a “standard”
fully-connected neural network is less efficient than current methods such as
finite elements. More advances seem possible by combining machine learning
algorithms with classical methods for solving PDEs which make use of the
available knowledge for approximating the solutions or quantities of interest.

References
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, et al. TensorFlow: Large-
Scale Machine Learning on Heterogeneous Systems. Software available
from tensorflow.org. 2015. url: https://www.tensorflow.org/.
[2] F. Agostinelli, M. Hoffman, P. Sadowski, and P. Baldi. “Learning acti-
vation functions to improve deep neural networks”. In: arXiv preprint
arXiv:1412.6830 (2014).
[3] C. Anitescu, E. Atroshchenko, N. Alajlan, and T. Rabczuk. “Artifi-
cial neural network methods for the solution of second order boundary
value problems”. In: Computers, Materials and Continua 59.1 (2019),
pp. 345–359.
[4] A. Apicella, F. Donnarumma, F. Isgrò, and R. Prevete. “A survey on
modern trainable activation functions”. In: Neural Networks 138 (2021),
pp. 14–32.
[5] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, et al. JAX: compos-
able transformations of Python+NumPy programs. Version 0.2.5. 2018.
url: http://github.com/google/jax.
[6] C. G. Broyden. “The convergence of a class of double-rank minimiza-
tion algorithms: 2. The new algorithm”. In: IMA journal of applied
mathematics 6.3 (1970), pp. 222–231.

38
[7] Y. Chen, L. Lu, G. E. Karniadakis, and L. Dal Negro. “Physics-informed
neural networks for inverse problems in nano-optics and metamateri-
als”. In: Optics express 28.8 (2020), pp. 11618–11633.
[8] D.-A. Clevert, T. Unterthiner, and S. Hochreiter. “Fast and accurate
deep network learning by exponential linear units (ELUs)”. In: arXiv
preprint arXiv:1511.07289 (2015).
[9] C. De Sa, C. Re, and K. Olukotun. “Global convergence of stochastic
gradient descent for some non-convex matrix problems”. In: Interna-
tional conference on machine learning. PMLR. 2015, pp. 2332–2341.
[10] I. Depina, S. Jain, S. Mar Valsson, and H. Gotovac. “Application of
physics-informed neural networks to inverse problems in unsaturated
groundwater flow”. In: Georisk: Assessment and Management of Risk
for Engineered Systems and Geohazards 16.1 (2022), pp. 21–36.
[11] J. V. Dillon, I. Langmore, D. Tran, E. Brevdo, et al. “Tensorflow dis-
tributions”. In: arXiv preprint arXiv:1711.10604 (2017).
[12] R. Fletcher. “A new approach to variable metric algorithms”. In: The
computer journal 13.3 (1970), pp. 317–322.
[13] L. Floridi and M. Chiriatti. “GPT-3: Its nature, scope, limits, and con-
sequences”. In: Minds and Machines 30.4 (2020), pp. 681–694.
[14] X. Glorot and Y. Bengio. “Understanding the difficulty of training deep
feedforward neural networks”. In: Proceedings of the thirteenth interna-
tional conference on artificial intelligence and statistics. JMLR Work-
shop and Conference Proceedings. 2010, pp. 249–256.
[15] D. Goldfarb. “A family of variable-metric methods derived by varia-
tional means”. In: Mathematics of computation 24.109 (1970), pp. 23–
26.
[16] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press,
2016.
[17] S. Goswami, C. Anitescu, and T. Rabczuk. “Adaptive fourth-order
phase field analysis for brittle fracture”. In: Computer Methods in Ap-
plied Mechanics and Engineering 361 (2020), p. 112808.
[18] I. Gühring, G. Kutyniok, and P. Petersen. “Error bounds for approxi-
mations with deep ReLU neural networks in W s,p norms”. In: Analysis
and Applications 18.05 (2020), pp. 803–859.
[19] E. Haghighat, D. Amini, and R. Juanes. “Physics-informed neural net-
work simulation of multiphase poroelasticity using stress-split sequen-
tial training”. In: Computer Methods in Applied Mechanics and Engi-
neering 397 (2022), p. 115141.
[20] J. He, L. Li, J. Xu, and C. Zheng. “Relu deep neural networks and
linear finite elements”. In: Journal of Computational Mathematics 38.3
(2020), pp. 502–527.

39
[21] K. He, X. Zhang, S. Ren, and J. Sun. “Delving deep into rectifiers: Sur-
passing human-level performance on imagenet classification”. In: Pro-
ceedings of the IEEE international conference on computer vision. 2015,
pp. 1026–1034.
[22] D. Hendrycks and K. Gimpel. “Gaussian error linear units (GELUs)”.
In: arXiv preprint arXiv:1606.08415 (2016).
[23] K. Hornik, M. Stinchcombe, and H. White. “Multilayer feedforward
networks are universal approximators”. In: Neural networks 2.5 (1989),
pp. 359–366.
[24] A. D. Jagtap, K. Kawaguchi, and G. Em Karniadakis. “Locally adaptive
activation functions with slope recovery for deep and physics-informed
neural networks”. In: Proceedings of the Royal Society A 476.2239 (2020),
p. 20200334.
[25] A. D. Jagtap, K. Kawaguchi, and G. E. Karniadakis. “Adaptive acti-
vation functions accelerate convergence in deep and physics-informed
neural networks”. In: Journal of Computational Physics 404 (2020),
p. 109136.
[26] A. D. Jagtap, Y. Shin, K. Kawaguchi, and G. E. Karniadakis. “Deep
Kronecker neural networks: A general framework for neural networks
with adaptive activation functions”. In: Neurocomputing 468 (2022),
pp. 165–180.
[27] N. P. Jouppi, C. Young, N. Patil, D. Patterson, et al. “In-datacenter
performance analysis of a tensor processing unit”. In: Proceedings of the
44th annual international symposium on computer architecture. 2017,
pp. 1–12.
[28] J. Jumper, R. Evans, A. Pritzel, T. Green, et al. “Highly accurate pro-
tein structure prediction with AlphaFold”. In: Nature 596.7873 (2021),
pp. 583–589.
[29] E. Kharazmi, Z. Zhang, and G. E. Karniadakis. “Variational physics-
informed neural networks for solving partial differential equations”. In:
arXiv preprint arXiv:1912.00873 (2019).
[30] D. P. Kingma and J. Ba. “Adam: A method for stochastic optimization”.
In: arXiv preprint arXiv:1412.6980 (2014).
[31] G. Kissas, Y. Yang, E. Hwuang, W. R. Witschey, et al. “Machine learn-
ing in cardiovascular flows modeling: Predicting arterial blood pressure
from non-invasive 4D flow MRI data using physics-informed neural net-
works”. In: Computer Methods in Applied Mechanics and Engineering
358 (2020), p. 112623.
[32] I. E. Lagaris, A. Likas, and D. I. Fotiadis. “Artificial neural network
methods in quantum mechanics”. In: Computer Physics Communica-
tions 104.1-3 (1997), pp. 1–14.

40
[33] I. E. Lagaris, A. Likas, and D. I. Fotiadis. “Artificial neural networks
for solving ordinary and partial differential equations”. In: IEEE trans-
actions on neural networks 9.5 (1998), pp. 987–1000.
[34] I. E. Lagaris, A. C. Likas, and D. G. Papageorgiou. “Neural-network
methods for boundary value problems with irregular boundaries”. In:
IEEE Transactions on Neural Networks 11.5 (2000), pp. 1041–1049.
[35] K. Levenberg. “A method for the solution of certain non-linear prob-
lems in least squares”. In: Quarterly of applied mathematics 2.2 (1944),
pp. 164–168.
[36] A. Li, R. Chen, A. B. Farimani, and Y. J. Zhang. “Reaction diffusion
system prediction based on convolutional neural network”. In: Scientific
reports 10.1 (2020), pp. 1–9.
[37] Z. Li, F. Liu, W. Yang, S. Peng, et al. “A survey of convolutional neural
networks: analysis, applications, and prospects”. In: IEEE Transactions
on Neural Networks and Learning Systems (2021).
[38] Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, et al. “Fourier neural op-
erator for parametric partial differential equations”. In: arXiv preprint
arXiv:2010.08895 (2020).
[39] D. C. Liu and J. Nocedal. “On the limited memory BFGS method for
large scale optimization”. In: Mathematical programming 45.1 (1989),
pp. 503–528.
[40] J. López, C. Anitescu, and T. Rabczuk. “Isogeometric structural shape
optimization using automatic sensitivity analysis”. In: Applied Mathe-
matical Modelling 89 (2021), pp. 1004–1024.
[41] L. Lu, P. Jin, G. Pang, Z. Zhang, et al. “Learning nonlinear opera-
tors via DeepONet based on the universal approximation theorem of
operators”. In: Nature Machine Intelligence 3.3 (2021), pp. 218–229.
[42] A. L. Maas, A. Y. Hannun, A. Y. Ng, et al. “Rectifier nonlinearities
improve neural network acoustic models”. In: Proc. icml. Vol. 30. 1.
Citeseer. 2013, p. 3.
[43] D. W. Marquardt. “An algorithm for least-squares estimation of non-
linear parameters”. In: Journal of the society for Industrial and Applied
Mathematics 11.2 (1963), pp. 431–441.
[44] P. Mertikopoulos, N. Hallak, A. Kavis, and V. Cevher. “On the al-
most sure convergence of stochastic gradient descent in non-convex
problems”. In: Advances in Neural Information Processing Systems 33
(2020), pp. 1117–1128.
[45] D. Misra. “Mish: A self regularized non-monotonic activation function”.
In: arXiv preprint arXiv:1908.08681 (2019).

41
[46] V. M. Nguyen-Thanh, X. Zhuang, and T. Rabczuk. “A deep energy
method for finite deformation hyperelasticity”. In: European Journal of
Mechanics-A/Solids 80 (2020), p. 103874.
[47] A. D. Otero and F. L. Ponta. “Structural analysis of wind-turbine
blades by a generalized Timoshenko beam model”. In: (2010).
[48] A. Paszke, S. Gross, F. Massa, A. Lerer, et al. “PyTorch: An Impera-
tive Style, High-Performance Deep Learning Library”. In: Advances in
Neural Information Processing Systems 32. Curran Associates, Inc.,
2019, pp. 8024–8035. url: http : / / papers . neurips . cc / paper /
9015 - pytorch - an - imperative - style - high - performance - deep -
learning-library.pdf.
[49] P. Petersen and F. Voigtlaender. “Optimal approximation of piecewise
smooth functions using deep ReLU neural networks”. In: Neural Net-
works 108 (2018), pp. 296–330.
[50] D. Pfau, J. S. Spencer, A. G. D. G. Matthews, and W. M. C. Foulkes.
“Ab initio solution of the many-electron Schrödinger equation with
deep neural networks”. In: Phys. Rev. Research 2 (3 2020), p. 033429.
[51] G. Philipp, D. Song, and J. G. Carbonell. “Gradients explode - Deep
Networks are shallow - ResNet explained”. In: (2018).
[52] M. Raissi, P. Perdikaris, and G. E. Karniadakis. “Physics-informed neu-
ral networks: A deep learning framework for solving forward and inverse
problems involving nonlinear partial differential equations”. In: Journal
of Computational physics 378 (2019), pp. 686–707.
[53] P. Ramachandran, B. Zoph, and Q. V. Le. “Searching for activation
functions”. In: arXiv preprint arXiv:1710.05941 (2017).
[54] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. “Learning rep-
resentations by back-propagating errors”. In: nature 323.6088 (1986),
pp. 533–536.
[55] E. Samaniego, C. Anitescu, S. Goswami, V. M. Nguyen-Thanh, et al.
“An energy approach to the solution of partial differential equations in
computational mechanics via machine learning: Concepts, implemen-
tation and applications”. In: Computer Methods in Applied Mechanics
and Engineering 362 (2020), p. 112790.
[56] D. F. Shanno. “Conditioning of quasi-Newton methods for function
minimization”. In: Mathematics of computation 24.111 (1970), pp. 647–
656.
[57] K. Shukla, P. C. Di Leoni, J. Blackshire, D. Sparkman, et al. “Physics-
informed neural network for ultrasound nondestructive quantification
of surface breaking cracks”. In: Journal of Nondestructive Evaluation
39.3 (2020), pp. 1–20.

42
[58] K. Shukla, A. D. Jagtap, and G. E. Karniadakis. “Parallel physics-
informed neural networks via domain decomposition”. In: Journal of
Computational Physics 447 (2021), p. 110683.
[59] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, et al. “Master-
ing chess and shogi by self-play with a general reinforcement learning
algorithm”. In: arXiv preprint arXiv:1712.01815 (2017).
[60] J. Sirignano and K. Spiliopoulos. “DGM: A deep learning algorithm
for solving partial differential equations”. In: Journal of computational
physics 375 (2018), pp. 1339–1364.
[61] N Sukumar and A. Srivastava. “Exact imposition of boundary con-
ditions with distance functions in physics-informed deep neural net-
works”. In: Computer Methods in Applied Mechanics and Engineering
389 (2022), p. 114333.
[62] S. Sun, Z. Cao, H. Zhu, and J. Zhao. “A survey of optimization meth-
ods from a machine learning perspective”. In: IEEE transactions on
cybernetics 50.8 (2019), pp. 3668–3681.
[63] M. Vauhkonen, T. Tarvainen, and T. Lähivaara. “Inverse Problems”. In:
Mathematical Modelling. Ed. by S. Pohjolainen. Springer International
Publishing, 2016.
[64] U. bin Waheed, E. Haghighat, T. Alkhalifah, C. Song, et al. “PINNeik:
Eikonal solution using physics-informed neural networks”. In: Comput-
ers & Geosciences 155 (2021), p. 104833.
[65] C. Wang, V. Tan, and Y. Zhang. “Timoshenko beam model for vibra-
tion analysis of multi-walled carbon nanotubes”. In: Journal of Sound
and Vibration 294.4-5 (2006), pp. 1060–1072.
[66] G.-F. Wang and X.-Q. Feng. “Timoshenko beam model for buckling
and vibration of nanowires with surface effects”. In: Journal of physics
D: applied physics 42.15 (2009), p. 155411.
[67] S. Wang, X. Yu, and P. Perdikaris. “When and why PINNs fail to
train: A neural tangent kernel perspective”. In: Journal of Computa-
tional Physics 449 (2022), p. 110768.
[68] C. L. Wight and J. Zhao. “Solving allen-cahn and cahn-hilliard equa-
tions using the adaptive physics informed neural networks”. In: arXiv
preprint arXiv:2007.04542 (2020).
[69] B. Yu et al. “The deep Ritz method: a deep learning-based numeri-
cal algorithm for solving variational problems”. In: Communications in
Mathematics and Statistics 6.1 (2018), pp. 1–12.
[70] J. Yu, L. Lu, X. Meng, and G. E. Karniadakis. “Gradient-enhanced
physics-informed neural networks for forward and inverse PDE prob-
lems”. In: Computer Methods in Applied Mechanics and Engineering
393 (2022), p. 114823.

43
[71] X. Zhuang, H. Guo, N. Alajlan, H. Zhu, et al. “Deep autoencoder
based energy method for the bending, vibration, and buckling anal-
ysis of Kirchhoff plates with transfer learning”. In: European Journal of
Mechanics-A/Solids 87 (2021), p. 104225.

44

You might also like