Pure Optimization
Pure Optimization
In machine learning, optimizers and loss functions are two components that help improve the performance of
the model.
By calculating the difference between the expected and actual outputs of a model, a loss function evaluates
the effectiveness of a model.
Among the loss functions are log loss, hinge loss, and mean square loss.
By modifying the model’s parameters to reduce the loss function value, the optimizer contributes to its
improvement.
RMSProp, ADAM, and SGD are a few examples of optimizers.
The optimizer’s job is to determine which combination of the neural network’s weights and biases will give it
the best chance to generate accurate predictions.
Pure Optimization
1.objective function
2.optimization algorithams
3.learning rate scheduling
4.regularization techiniques
5.hyperparameter tuning
6.advance technique
7.distributed and parllel optimization
8.neural architecture search
pure optimization
1. Objective Function
The objective function, often a loss function, measures how well the
model performs.
Examples include Mean Squared Error (MSE) for regression and
Cross-Entropy Loss for classification.
2. Optimization Algorithms
Gradient Descent: The most basic optimization method, adjusting
model parameters based on the gradient of the loss function.
Variants of Gradient Descent:
• Momentum: Accumulates gradients over time to smooth out updates.
• Adam (Adaptive Moment Estimation): Combines momentum and
adaptive learning rates, making it very popular for training deep
networks.
• 3. Learning Rate Scheduling:
• Adjusting the learning rate during training can lead to better
convergence. Techniques include:
• Step Decay: Reducing the learning rate at fixed intervals.
• Exponential Decay: Gradually decreasing the learning rate according to
an exponential function.
• Cyclic Learning Rates: Varying the learning rate cyclically within a
range.
• 4. Regularization Techniques
• To prevent overfitting, regularization methods like L1 and L2 regularization,
dropout, and early stopping are employed.
• 5. Hyperparameter Tuning
• Fine-tuning hyperparameters (e.g., learning rate, batch size, network
architecture) can significantly enhance model performance. Techniques like
grid search, random search, and Bayesian optimization are commonly used.
• 6. Advanced Techniques
• Batch Normalization: Normalizes the inputs of each layer, improving
convergence.
• Gradient Clipping: Prevents exploding gradients by capping them at a certain
threshold.
• 7. Distributed and Parallel Optimization
• Techniques to scale training across multiple GPUs or machines
can lead to faster training times and the ability to work with
larger datasets.
• 8. Neural Architecture Search (NAS)
• An automated process for optimizing the architecture of the
neural network itself, using techniques like reinforcement
learning or evolutionary algorithms.
challenges in neural network optimization
Some optimization techniques:
1. Vanishing and Exploding Gradients
• Vanishing Gradients: In deep networks, gradients can become very
small, making it difficult for the model to learn. This is particularly
problematic in long sequences or deep architectures.
• Exploding Gradients: Conversely, gradients can grow exponentially,
causing numerical instability and leading to divergent training.
• 2. Overfitting
• Neural networks have a high capacity to memorize training data, which
can lead to overfitting, where the model performs well on training data
but poorly on unseen data.
3. Choosing the Right Architecture
• Selecting the optimal network architecture (number of layers,
types of layers, etc.) is often trial-and-error and can significantly
impact performance.
• 4. Hyperparameter Tuning
• Finding the best hyperparameters (learning rate, batch size,
regularization strength) can be time-consuming and requires
extensive experimentation.
• 5. Local Minima and Saddle Points
• Optimization landscapes can be complex, with many local minima
and saddle points. Finding a global minimum can be challenging.
• 6. Computational Resources
• Training deep networks can require significant computational
resources and time, especially for large datasets or complex
models.
• 7. Data Quality and Quantity
• Insufficient or poor-quality data can hinder training. Imbalanced
datasets can lead to biased models.
• 8. Non-convexity
• The optimization problem in neural networks is non-convex,
making it difficult to guarantee convergence to the global
minimum.
• 9. Sensitivity to Initialization
• Poor weight initialization can lead to suboptimal training,
causing slow convergence or failure to converge.
• 10. Batch Size Effects
• The choice of batch size can influence training dynamics,
generalization, and convergence speed. Small batches can lead
to noisy gradient estimates, while large batches may require
careful tuning of the learning rate.
• 11. Generalization Across Domains
• Models trained in one domain may not generalize well to
others, raising issues with transfer learning and domain
adaptation.
• 12. Interpretability
• Understanding why a neural network makes specific
predictions is challenging, making it difficult to debug or
improve models.
Parameter Initialization
• Initializing the parameters of a deep neural network is an important step
in the training process, as it can have a significant impact on the
convergence and performance of the model. Here are some common
parameter initialization techniques used in deep learning:
Zero Initialization: Initialize all the weights and biases to zero. This is
not generally used in deep learning as it leads to symmetry in the
gradients, resulting in all the neurons learning the same feature.
Random Initialization: Initialize the weights and biases randomly from
a uniform or normal distribution. This is the most common technique
used in deep learning.
Xavier Initialization: Initialize the weights with a normal distribution
with mean 0 and variance of sqrt(1/n), where n is the number of neurons
in the previous layer. This is used for the sigmoid activation function.
• He Initialization: Initialize the weights with a normal distribution with
mean 0 and variance of sqrt(2/n), where n is the number of neurons in
the previous layer. This is used for the ReLU activation function.
Orthogonal Initialization: Initialize the weights with an orthogonal
matrix, which preserves the gradient norm during backpropagation.
Uniform Initialization: Initialize the weights with a uniform distribution.
This is less commonly used than random initialization.
Constant Initialization: Initialize the weights and biases with a constant
value. This is rarely used in deep learning.
• Adaptive Learning Rate Method
• Adaptive learning rate methods are an optimization of
gradient descent methods with the goal of minimizing the
objective function of a network by using the gradient of
the function and the parameters of the network.
•
• 1Gradient descent
• 2Adaptive Learning Rate Method
• 3Literature
• 4Weblinks
•
• Gradient descent
• Before the adaptive learning rate methods were introduced, the gradient descent algorithms including Batch Gradient Descent (BGD), Stochastic Gradient Descent (SGD) and mini-Batch Gradient Descent (mini-BGD, the mixture of
BGD ans SGD) were state-of-the-art. In essence, these methods try to update the weights θ of the network, with the help of a learning rate η, the objective function J(θ) and the gradient of it, ∇J(θ). What all gradient descent algorithms
and its improvements have in common, is the goal of minimizing J(θ) in order to find the optimal weights θ.
• θ=θ−η⋅∇θJ(θ)
• It tries to reach the minimum of Jθ), by subtracting from θ the gradient of J(θ) (refere to Figure 3 for a visualization). The algorithm always computes over the whole set of data, for each update. This makes the BGD the slowest and
causes it to be unable to update online. Additionally, it performs redundant operates for big sets of data, computing similar examples at each update and it converges to the closeset minimum depending on the given data data,
resulting in potential suboptimal results.
• θ=θ−η⋅∇θJ(θ;x(i);y(i))
• Figure 1:(5) Local minima may occure in J(θ) (here J(w)), which may re
• Contrary to BGD, SGD updates for each training example (x(i);y(i)), thus
updating according to a single example step. Furthermore, this fluctuation
enables the SGD to jump to minima farther away, potentially reaching a
better minimum. But thanks to this fluctuation, SGD is also able to
overshoot. This can be counteracted by slowly decreasing the learning
rate. In the exemplary code shown in Figure 2, a shuffle function is
additionally used in the SGD and mini-BGD algorithm, compared to the
BGD. This is done, as it is often preferable to avoid meaningful order of
the data and thereby avoid bias of optimization algorithm, although
sometimes better results can be achieved with data in order. In this case
the shuffle operation is to be removed.
• Lastly, there is the mini-BGD.
• θ=θ−η⋅∇θJ(θ;x(i:i+n);y(i:i+n))
• The mini-BGD updates for every mini-batch of n training examples. This leads to a more stable
convergence, by reducing the variance of the parameters. When people talk abput a SGD algorithm, they
often refer to this version.
• for i in range (nb_epoches):
• params_grad = evaluate_gradient(loss_function, data, params
• params = params - learning_rate * params_grad
• for i in range(np_epochs):
• np.random.shuffle(data)
• for example in data:
• params_grad = evaluate_gradient(loss_function, example, params
• params = params - learning_rate * params_grad
• for i in range(np_epochs):
• np.random.shuffle(data)
•
• for batch in get_batches(data, batch_size=50):
• params_grad = evaluate_gradient(loss_function, batch, params
• params = params - learning_rate * params_grad
• Figure 2: (1) Pseudo code of the three gradient descent algorithms
• sult in suboptimal solution for some gradient descent methodes.
• Adaptive Learning Rate Method
• As an improvement to traditional gradient descent algorithms, the
adaptive gradient descent optimization algorithms or adaptive
learning rate methods can be utilized. Several versions of these
algorithms are described below.
• Momentum can be seen as an evolution of the SGD.
• vt=γvt−1+η∇θJ(θ)θ=θ−vt
• While SGD has problems with data having steep curves in one direction of the
gradient, Momentum circumvents that by adding the update vector of the time
step before multiplying it with a γ, usually around 0.9 (1). As an analogy, one
can think of a ball rolling down the gradient, gathering momentum (hence the
name), while still being affected by the wind resistance (0< γ < 1).
• Nesterov accelerated gradient can be seen as a further enhancement to
momentum.
• vt=γvt−1+η∇θJ(θ−γvt−1)θ=θ−vt
• This algorithm adds a guess of the next step, in the form of the term γvt−1. A
comparison for the first two steps of Momentum and Nesterov accelerated
gradient can be found in Figure 3. The additional term results in a
consideration of the error of the previous step, accelerating the progress in
comparison to momentum.
• Contrary to the nesterov accelerated gradient, Adagrad adapts its learning
rate η during its run-time and it updates its parameters θi separately
during each time step t. It has to do that, since η adapts for every θi on its
own.
• θt+1,i=θt,i−ηGt,ii+ϵ√⋅∇θJ(θi)
• Gt is a matrix containing the squared sum of the past gradients with
regards to all θ along its diagonal.
• ϵ is correction term which is utilized to avoid dividing by 0 and is
generally insignificantly small (~10−8).
• Due to the accumulation of the squared gradients in Gt the learning rate
ηGt,ii+ϵ√ gets smaller over time, finally leading to a significantly small
rate, which causes the algorithm to obtain no new knowledge.