Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Neural Networks: Basic Theory and Architecture Types

Greg Postalian-Yrausquin
Towards AI
Published in
17 min readJun 13, 2024

In this story, I would like to attempt to review and explain, at a high level and trying to use simple language, the theoretical fundamentals behind neural networks, the technologies that derive from them, and the most critical aspects of their implementation using PyTorch. I will also point to examples of use cases that I have documented in other articles.

Neural Networks, as their name implies, are complex systems composed of neurons. The power of the network comes from interconnections between these artificial neurons. These NN algorithms supposedly mimic biological systems.

The Neuron:

The core of the neural network is the neuron, which simply is a mathematical tool that takes a set of inputs (variables) and applies a linear and a nonlinear transformation and generates an output, in the same way a mathematical multivariate function does.

The input for the neurons of the first layer is the data from the original set. Each sample will be processed in the same way, but independently by the network.

The basic structure of the network is as follows: several layers, each consisting of a number of neurons, which is configured independently for each layer. The first layer is fed from the data source, and the last layer feeds into the output, the layers in between are fed from the previous one, and lead to the next one.

Some architectures might introduce variations to this model, which might consist of adding transformations between the layers, or the introduction of feedback loops. I will discuss those later.

Each neuron then consists of a linear and a non-linear part. The linear section is a standard linear equation, the non-linear transformation will vary between networks and layers.

In detail, a layer will be formed by an input vector (of size n), which will be multiplied by a matrix of weights (size nxm, where n is the size of the input and m is the number of neurons in the layers), the results is another vector of size m, which is added to a vector of free parameters. The whole result is passed by the non-linear function, called the Activation function. The output of this process feeds the next layer

Or, in explicit form:

To illustrate how this system can be complicated real fast, this is how the formula will look at the exit of the second layer:

The whole point of the training process that follows is to find the optimal values of the weights and free parameters that reproduce the original outputs (target variables). In order to measure the difference between the original and predicted values, we introduce a loss function. With this, all neural networks become supervised regression problems in their core.

The activation is selected from a set of functions that fulfill different use cases. At a high level the most used are the ReLU (keep values larger than 0, the negatives are set to 0), Sigmoid, and the Tanh.

The Learning Process:

As mentioned, the goal is to find the optimal value of a large set of parameters for each layer. In order to do that, a function that relates the Y predicted and the real Y is selected (the loss function). As with any optimization problem, the objectives are fixed to locate the point where this function is at a minimum. Some examples of loss functions are the mean squared error, the root mean squared error, the mean absolute error and the binary cross entropy.

The algorithm used to solve the optimization problem is a very of the gradient descent, which is a numerical implementation of the typical calculus problem to locate the minimum of the function applied in the multivariate case. The algorithm estimates the value of the gradient of the function in each iteration and updates the parameters in the following way:

This process is repeated for a given number of iterations (epochs), verifying on each run that the value of the loss is decreasing.

The learning rate is a hyperparameter, something that needs to be fixed a priori. The important note about the learning rate is that it cannot be set too high, or the loss values will start to oscillate without ever locating the minimum of the function.

The parameters are generally initialized using a random variable, which will follow a Gaussian distribution. In the idealized case, where all inputs are independent, and there is an infinite number of neurons in the layer, the outputs and the trained parameters will form Gaussian distributions as well.

These distributions can be seen as wave packets over the multivariate space formed by the parameters, variables or outputs, which are mathematical structures similar to those used in quantum field theory to model free particles over quantum fields. What in QFT are interactions, small deviations from the ideal scenario, in Neural Networks are produced by the introduction of dependencies between the variables and the finite number of neurons per layer. These deviations, specifically those produced by an internal or unseen structure between the members of the network are the source of the predictive power of the system. Now, when these deviations turn too large, then the system diverges and becomes chaotic.

As in QFT, there are mathematical solutions to problems with small deviations from the free (gaussian) case using perturbation theory. The same mathematics used to comprehend the interaction between physical particles is used to understand the inner work of Neural Networks.

If you want to read more about this approach to understanding neural networks, you can consult the original paper: [2307.03223] Neural Network Field Theories: Non-Gaussianity, Actions, and Locality (arxiv.org)

Back to the point, there is a number of decisions, or fixing of hyperparameters that need to be made when designing a neural network. These are:

  1. The number of inputs and outputs of each layer, which the caveat that the input of the first layer is number of variables, the number of outputs of the last layer is given by the nature of the problem to solve and the output of each internal or hidden layer is the input of the next layer.
  2. The number of layers in the network.
  3. The activation functions of each layer.
  4. The loss function.
  5. The gradient algorithm.
  6. The learning rate.
  7. The architecture (what we will explore in the next segment)

Next, I will review some of the most common architectures of neural networks, a high-level description of each, and sample use cases.

If you want to understand more the mathematics inside Neural Networks and Machine Learning, you can consult the book by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, available at Deep Learning (deeplearningbook.org). By the way, Goodfellow is credited as well with inventing the Generative Adversarial Networks, which are later described in this document.

The Multilayer Perceptron (MLP)

This is the most basic architecture for neural networks, it is formed by a linear structure (each layer depends on the previous layer). It also does not consider any particular relationship between the variables. The MLP has good use in problems where the predictors do not seem to depend on each other, so they do not form a continuous structure. For example, in a dataset formed from factors like age, salary, education or gender.

The following code block is a sample of the definition of a multilayer perceptron in Python using the PyTorch package:

import torch.nn as nn

class SimpleClassifier(nn.Module):
def __init__(self):
super(SimpleClassifier, self).__init__()
#the dropout layer is introduced to reduce the overfiting
#dropout is telling the neural network to drop data between layers randomly to introduce variability
self.dropout = nn.Dropout(0.1)
#for the layers I recommend to start a little over twice the number of columns and increase from there from a layer to the next
#then decrease again down to 2, in this case the response is binary
self.layers = nn.Sequential(
nn.Linear(input_size, 250),
nn.Linear(250, 500),
nn.Linear(500, 1000),
nn.Linear(1000, 1500),
nn.ReLU(),
self.dropout,
nn.Linear(1500, 1500),
nn.Sigmoid(),
self.dropout,
nn.Linear(1500, 1500),
nn.ReLU(),
self.dropout,
nn.Linear(1500, 1500),
nn.Sigmoid(),
self.dropout,
nn.Linear(1500, 1500),
nn.ReLU(),
self.dropout,
nn.Linear(1500, 1500),
nn.Sigmoid(),
self.dropout,
nn.Linear(1500, 1500),
nn.ReLU(),
self.dropout,
nn.Linear(1500, 500),
nn.Sigmoid(),
self.dropout,
nn.Linear(500, 500),
nn.ReLU(),
self.dropout,
nn.Linear(500, 500),
nn.Sigmoid(),
self.dropout,
#the last layer outputs 2 since the response variable is binary (0,1)
#the output of a multiclass classification should be of the size of the number of classes
nn.Linear(500, 2),
)

def forward(self, x):
return self.layers(x)

#define model
model = SimpleClassifier()

A typical learning loop using this model is represented in the next block:

#loading the model
model = SimpleClassifier()
model.train()

#these are the training parameters (number of cycles and learning rate)
num_epochs=100
learning_rate = 0.00001
#to reduce overfitting
regularization = 0.0000001

#loss function
criterion = nn.CrossEntropyLoss()

#algorithm to find the gradients
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=regularization)

#this code keeps the best model saved while doing the training loop
best_model_wts = copy.deepcopy(model.state_dict())
best_acc = 0.0
best_f1 = 0.0
best_epoch = 0
phases = ['train', 'val']
training_curves = {}
epoch_loss = 1
epoch_f1 = 0
epoch_acc = 0

#the dataset is split in 3 phases: training, validation and test
for phase in phases:
training_curves[phase+'_loss'] = []
training_curves[phase+'_acc'] = []
training_curves[phase+'_f1'] = []

#this is the training loop
for epoch in range(num_epochs):
print(f'\nEpoch {epoch+1}/{num_epochs}')
print('-' * 10)
for phase in phases:
if phase == 'train':
model.train()
else:
model.eval()
running_loss = 0.0
running_corrects = 0
running_fp = 0
running_tp = 0
running_tn = 0
running_fn = 0
# Iterate over data.
for inputs, labels in dataloaders[phase]:
inputs = inputs.view(inputs.shape[0],-1)
inputs = inputs
labels = labels

# zero the parameter gradients
optimizer.zero_grad()

# forward pass (see the chart above)
with torch.set_grad_enabled(phase == 'train'):
outputs = model(inputs)
_, predictions = torch.max(outputs, 1)
loss = criterion(outputs, labels)

# backward pass (only in training)
if phase == 'train':
loss.backward()
optimizer.step()

# statistics. Uses the f1 metric
running_loss += loss.item() * inputs.size(0)
running_corrects += torch.sum(predictions == labels.data)
running_fp += torch.sum((predictions != labels.data) & (predictions >= 0.5))
running_tp += torch.sum((predictions == labels.data) & (predictions >= 0.5))
running_fn += torch.sum((predictions != labels.data) & (predictions < 0.5))
running_tn += torch.sum((predictions == labels.data) & (predictions < 0.5))
print(f'Epoch {epoch+1}, {phase:5} Loss: {epoch_loss:.7f} F1: {epoch_f1:.7f} Acc: {epoch_acc:.7f} Partial loss: {loss.item():.7f} Best f1: {best_f1:.7f} ')

# calculation of loss, accuracy and f1 metric
epoch_loss = running_loss / dataset_sizes[phase]
epoch_acc = running_corrects.double() / dataset_sizes[phase]
epoch_f1 = (2*running_tp.double()) / (2*running_tp.double() + running_fp.double() + running_fn.double() + 0.0000000000000000000001)
training_curves[phase+'_loss'].append(epoch_loss)
training_curves[phase+'_acc'].append(epoch_acc)
training_curves[phase+'_f1'].append(epoch_f1)

print(f'Epoch {epoch+1}, {phase:5} Loss: {epoch_loss:.7f} F1: {epoch_f1:.7f} Acc: {epoch_acc:.7f} Best f1: {best_f1:.7f} ')

if phase == 'val' and epoch_f1 >= best_f1:
best_epoch = epoch
best_acc = epoch_acc
best_f1 = epoch_f1
best_model_wts = copy.deepcopy(model.state_dict())

print(f'Best val F1: {best_f1:5f}, Best val Acc: {best_acc:5f} at epoch {best_epoch}')

# load best model weights
model.load_state_dict(best_model_wts)

A sample of the implementation of an MLP with actual data can be seen in the following article: Using Neural Networks with Pytorch to Predict Fail of Automatic Recovery | by Greg Postalian-Yrausquin | Jun, 2024 | Towards AI (medium.com)

More information can be found on the Wikipedia page: Multilayer perceptron — Wikipedia

Convolutional Neural Networks (CNN):

The typical multilayer perceptron does not perform well in cases when the inputs are fields. A field, in this case, is some kind of structure where there is a relationship (via functions, continuity) between points. For example, the temperature of a metal plate, which is connected to a heat source on one point will follow a gradient that goes from the source to the more distant locations on the plate. In this two-dimensional example, the temperature on the surface can be represented by a grid (matrix):

Note that these structures could be 3D (imagine several of these stacked on top of each other), or as complicated as we like. MLPs are not good at modeling them since the lose the internal structure of the data during training. On the contrary, CNNs preserve the original relationships.

In Python, images are stored as matrices, where the rows and columns represent the position and the number is a measure of intensity. For color images 3 matrices are used for each image, storing in the cells values for the color encoding format RGB. In mathematics, those multidimensional matrices are named Tensors and can also be seen as vectorial functions (the output results in the shape of vectors), in this case, the coordinates of the output vector are the RGB color values.

For this reason, CNNs are widely used to model image and video data. The poster child for CNNs is image classification. In this article, you can see an example of using CNNs for that purpose: Convolutional Neural Networks in PyTorch: Image Classification | by Greg Postalian-Yrausquin | Jun, 2024 | Towards AI (medium.com).

Convolutional Neural Networks are defined by the presence of one or more convolution layers in the network. These are mathematical operations in which we slide a window or matrix (kernel) inside the data and perform element-wise multiplication and then sum of the values inside the kernel. A padding can be introduced to account for the reduction in size of the original matrix of data.

Other types of layers are introduced in the CNNs, for example: Max pool (reduces the size of the data by getting the maximum value of a portion of the map), flatten (converts the matrix of data into a vector, it is used at the end of the network or to continue training as a standard network) and unflatten (reverses the previous process).

The following sample code is the definition of a CNN class in PyTorch:

from torch.nn.modules.flatten import Flatten
class CNNClassifier(nn.Module):
def __init__(self):
super(CNNClassifier, self).__init__()
self.dropout = nn.Dropout(0.05)
self.pipeline = nn.Sequential(
#in channels is 1, because the input is grayscale
nn.Conv2d(in_channels = 1, out_channels = 10, kernel_size = 5, stride = 1, padding=1),
nn.ReLU(),
nn.Conv2d(in_channels = 10, out_channels = 10, kernel_size = 5, stride = 1, padding=1),
nn.ReLU(),
nn.Conv2d(in_channels = 10, out_channels = 10, kernel_size = 5, stride = 1, padding=1),
nn.ReLU(),
nn.Conv2d(in_channels = 10, out_channels = 5, kernel_size = 5, stride = 1, padding=1),
nn.ReLU(),
#dropout to introduce randomness and reduce overfitting
self.dropout,
#reduce and flat the tensor before applying the flat layers
nn.MaxPool2d(kernel_size = 2, stride = 2),
nn.Flatten(),
nn.Linear(500, 50),
nn.ReLU(),
self.dropout,
nn.Linear(50, 50),
nn.ReLU(),
self.dropout,
nn.Linear(50, 10),
nn.ReLU(),
self.dropout,
nn.Linear(10, 10),
nn.ReLU(),
self.dropout,
nn.Linear(10, 5),
)

def forward(self, x):
return self.pipeline(x)

model = CNNClassifier()

I find the Wikipedia entry for CNNs good to start looking for more details about their workings, history and use cases: Convolutional neural network — Wikipedia

The Autoencoder

A sub class of architecture is the autoencoder. It can be imaged as a specific configuration where the number of outputs is the same as the input. The model is tasked with learning to reproduce the entered data, passing through by one or more hidden layers.

The model is designed to have two portions, the Encoder which transforms the input into a different Representation, and the Decoder, which starting from the representation, rebuilds its version of the input. The idea is that this reconstruction must be as similar to the initial data as possible.

In these networks, the target is the same input data, so they are effectively unsupervised learning methods. Autoencoder architectures are used for example, as part of generative AI tasks.

In NLP, autotencoders are used to produce word embeddings (representations of words or sentences). These numerical representations of the text are used then in downstream tasks, like classification, distance calculation and other.

One clever way of using autoencoders in NLP is as follows:

An example of the definition of an autoencoder can be found in this article: Neural networks: encoder-decoder example (autoencoder) | by Greg Postalian-Yrausquin | Jun, 2024 | Medium. In this, a model is used to reconstruct images.

# Number of channels in the training images. For color images this is 3
nc = 3

#size of the representation
nr = 1000

#size of the starting point of the decoder
nz = 50

class Encdec(nn.Module):
def __init__(self, nc, nz, nr):
super(Encdec, self).__init__()
#this is the encoder
self.encoder = nn.Sequential(
nn.Conv2d(in_channels = nc, out_channels = 10, kernel_size = 5, stride = 1, padding=1),
nn.ReLU(),
nn.Conv2d(in_channels = 10, out_channels = 10, kernel_size = 5, stride = 1, padding=1),
nn.ReLU(),
nn.Conv2d(in_channels = 10, out_channels = 10, kernel_size = 5, stride = 1, padding=1),
nn.ReLU(),
nn.Conv2d(in_channels = 10, out_channels = 10, kernel_size = 5, stride = 1, padding=1),
nn.ReLU(),
nn.Conv2d(in_channels = 10, out_channels = 1, kernel_size = 5, stride = 1, padding=1),
nn.Flatten(),
nn.Linear(2916, 3000),
nn.ReLU(),
nn.Linear(3000, 1000),
nn.ReLU(),
nn.Linear(1000, 1000),
nn.ReLU(),
nn.Linear(1000, nr),
)
#this is the decoder
self.decoder = nn.Sequential(
nn.Linear(nr, 1000),
nn.ReLU(),
nn.Linear(1000, 500),
nn.ReLU(),
nn.Linear(500, nz*64*64),
nn.Unflatten(1, torch.Size([nz, 64, 64])),
nn.Conv2d(in_channels = nz, out_channels = 10, kernel_size = 5, stride = 1, padding=1),
nn.ReLU(),
nn.Conv2d(in_channels = 10, out_channels = 10, kernel_size = 5, stride = 1, padding=1),
nn.ReLU(),
nn.Conv2d(in_channels = 10, out_channels = nc, kernel_size = 5, stride = 1, padding=1),
nn.ReLU(),
nn.Flatten(),
nn.Linear(10092, 2000),
nn.ReLU(),
nn.Linear(2000, 1000),
nn.ReLU(),
nn.Linear(1000, 1000),
nn.ReLU(),
nn.Linear(1000, 500),
nn.ReLU(),
nn.Linear(500, nc*64*64),
nn.Unflatten(1, torch.Size([nc, 64, 64])),
nn.Tanh()
)

def encode(self, x):
return self.encoder(x)

def decode(self, x):
return self.decoder(x)

def forward(self, input):
return self.decoder(self.encoder(input))

netEncDec = Encdec(nc, nz, nr)

Again, I will refer to Wikipedia as the starting point to learn more about the autoencoder architecture in specific: Autoencoder — Wikipedia

A common and well-known implementation of the Encoder-Decoder mechanism is the Transformer. This architecture was introduced by the paper “Attention is all you need” [1706.03762] Attention Is All You Need (arxiv.org), introduced by data scientists at Google in 2017.

The transformer is widely used in NLP, and it consists of several steps that start with the production of embeddings for the input and output sets, of these. Both sets are then passed through a process to retain positional information, a step that is not included in the original Autoencoders. This introduces the same advantage as the RNN (which will be explained next) without the overhead in processing that is caused by the recurrence. The data then passes through the encoding process (attention stack) and is compared to the outputs at the decoding step (a second attention stack). The final step consists of applying a Softmax transformation.

The transformer architecture: diagram taken from the original paper “Attention is all you need”.

One very interesting and useful implementation of the transformers paradigm is BERT (Bidirectional Encoder Representations for Transformers), created by Google. Training transformers for language processing from scratch can be a titanic task, but fortunately, there are pretrained models that can be downloaded and fitted for different use cases (see the page from Huggingface for examples and instructions to download these models): Downloading models (huggingface.co)

Recurrent Neural Network:

RNN can be considered as a nonlinear attempt at neural networks. In RNNs, a layer is allowed to affect itself (there is the effect of back feeding). This action makes them ideal for modeling data that is in the shape of a sequence. The best example of such data is a stream of text, and for that reason, were mostly used in NLP until the introduction of the most efficient transformer. RNNs are also implemented in speech and handwriting recognition.

Besides the fact that they can be computationally intensive, RNNs have other issues like propagation errors, which are augmented by the nonlinearity; and vanishing gradients, which is the fact that the network actually keeps a very short memory of previous steps. To address these, more complicated derivations of the RNN architecture were introduced, like the LSTM (Long-Short Term Memory) and GRU (Gated Recurrent Units).

An example of the application of RNN is presented in the following article: RNN: Basic Recursive Neural Network for Sentiment Analysis in PyTorch | by Greg Postalian-Yrausquin | Jun, 2024 | Towards AI (medium.com)

The definition of this network in PyTorch is:

#definition of the neural network. As you can see there is only an RNN definition that
#includes 2 layers and one linear layer.
#dropout and regularization are introduced to reduce overfitting

class RNNClassifier(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(RNNClassifier, self).__init__()
self.hidden_size = hidden_size
self.RNN = nn.RNN(input_size, hidden_size, num_layers = 2, dropout = 0.2)
self.fc = nn.Linear(hidden_size, output_size)
pass

def forward(self, input):
output, hn = self.RNN(input)
output = self.fc(output)
return output, hn

model = RNNClassifier(insize,8,2)

Generative Adversarial Networks (GAN):

This is another composite architecture that can be formed from combinations of MLP, RNN or CNNs. Created by Ian Goodfellow and his colleagues in 2014 (see the original paper at Generative Adversarial Nets (nips.cc), and a tutorial at [1701.00160] NIPS 2016 Tutorial: Generative Adversarial Networks (arxiv.org)) GANs are a very clever application of neural networks in which two different models are trained, one is tasked to produce samples based on the original dataset and the other is confronted against this first model to guess if the samples are real or fake.

GANs are used for generative AI tasks (producing realistic data based on models); these could be text, image, or video generation. The detailed description involves a Generator network, which has to create new objects based on the original samples, and a Discriminator which is trained to differentiate between the artificially generated samples and the real ones. The idea, after many iterations of learning, is that the Discriminator will start to have difficulties separating the real data from the fake, ensuring the generation of a credible product.

This is a sample of the class definition for both adversarial networks in PyTorch:

# Number of channels in the training images. For color images this is 3
nc = 3

# Size of z latent vector (i.e. size of generator input)
nz = 100

# Size of feature maps in generator
ngf = 64

# Size of feature maps in discriminator
ndf = 64

#this is the discriminator network, tasked with trying to separed the images between real and fake.
class Discriminator(nn.Module):
def __init__(self, nc, ndf):
super(Discriminator, self).__init__()
self.pipeline = nn.Sequential(
#nc is 3, and the entry are tensors 3x64x64 (color images 64x64, each color image requires 3 tensors))
nn.Conv2d(nc, ndf, 4, 2, 1, bias=False),
nn.LeakyReLU(0.2),
nn.Conv2d(ndf, ndf * 2, 4, 2, 1, bias=False),
nn.BatchNorm2d(ndf * 2),
nn.LeakyReLU(0.2),
nn.Conv2d(ndf * 2, ndf * 4, 4, 2, 1, bias=False),
nn.BatchNorm2d(ndf * 4),
nn.LeakyReLU(0.2),
nn.Conv2d(ndf * 4, ndf * 8, 4, 2, 1, bias=False),
nn.BatchNorm2d(ndf * 8),
nn.LeakyReLU(0.2),
#the output is size 1, a vector
nn.Conv2d(ndf * 8, 1, 4, 1, 0, bias=False),
nn.Sigmoid()
)

def forward(self, input):
return self.pipeline(input)


class Generator(nn.Module):
def __init__(self, nc, nz, ngf):
super(Generator, self).__init__()
self.pipeline = nn.Sequential(
#the entry for the generator are random generated images, which in this case are of 100 channels, so nz is 100
nn.ConvTranspose2d( nz, ngf * 16, 4, 1, 0, bias=False),
nn.BatchNorm2d(ngf * 16),
nn.ReLU(),
nn.ConvTranspose2d(ngf * 16, ngf * 8, 4, 2, 1, bias=False),
nn.BatchNorm2d(ngf * 8),
nn.ReLU(),
nn.ConvTranspose2d( ngf * 8, ngf * 4, 4, 2, 1, bias=False),
nn.BatchNorm2d(ngf * 4),
nn.ReLU(),
nn.ConvTranspose2d( ngf * 4, ngf, 4, 2, 1, bias=False),
nn.BatchNorm2d(ngf),
nn.ReLU(),
#and the output are images with the same dimension of the training images, so the output has to be of size nc
nn.ConvTranspose2d( ngf, nc, 4, 2, 1, bias=False),
nn.Tanh()
)

def forward(self, input):
return self.pipeline(input)

I have implemented a GAN to create images in the following article: GAN: training a Generative Adversarial Network for image generation | by Greg Postalian-Yrausquin | Jun, 2024 | Medium

These are the very basics of Neural Networks for Machine Learning, but by no means are a complete description of all architectures that are out there. The topic is incredibly vast and fascinating and is currently going through an explosion, with new technologies and algorithms appearing constantly. Many of these are modifications or combinations of the architectures described in this document.

The mathematics behind neural networks are focused in understanding the flow of information, how the data and errors propagate inside the network and attempt to theorize the most efficient configuration of layers (depth of the network), neurons (width of the network), activation functions and training algorithms.

--

--

Professional with over 10 years of experience working with data, Azure Certified Data Engineer and Data Scientist. Passionate about telling stories with data