Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Curso de Redes Neuronales 2

Descargar como docx, pdf o txt
Descargar como docx, pdf o txt
Está en la página 1de 117

1.

Setting up your machine learning Application


1.1 Train/Dev/Test

NLP, Vision Speech. Structural data

What I have seen is that intuitions from one domain or from one application area often do not
transfer to another application areas.

And the best choices may depend on the amount of data you have, the number of features you
have through your computer configuration and whether you are training on GPU or CPUs. So, it is
impossible guess the best choice of hyperparameters the very first time.

The question I how efficiently you can go around this circle of iterations.

If you have this data:

Portion of it to be your:
This portion
Hold-out cross validation
of it to be
Training set set. Sometimes is called the
your test
development set. So, it is
set.
for brevity call the dev set

Se hace el proceso de elección del mejor modelo con el dev set, de modo que la mejor opción
hallada se prueba con el test set.

Se hace para tener una visión imparcial de que tan bien está funcionando el modelo.

Es común realizar la división, así:

Train/Test 70/30 %

Train/dev/test 60/20/20 %
El objetivo del conjunto dev set, es probar diferentes algoritmos en él para ver cuál funciona
mejor.

Las sugerencias de porcentajes son reglas de oro en el Deep learning, sin embargo, en problemas
como big data, en el que se tienen millones de datos, el 20% puede ser demasiado. Por lo que, es
claro que existen excepciones a la regla.

Mismatched train/set distribution

Por ejemplo, para los datos de entrenamiento usas fotos de gatos de alta resolución halladas en
internet, mientras que, para el Dev/test set usas imágenes tomadas por usuarios con cámaras de
baja resolución.

Importante

- Make sure that dev and test come from the same distribution

Not having a test set might be okay. (It can be enough with dev set).

1.2 Bias/Variance

Lo ideal es hallar un punto medio entre ambos sesgo y varianza, de modo que no obtengamos
underfitting ni overfitting de nuestra red.

Algunos ejemplos,

Train set error: 1%

Dev set error: 11%

Este algoritmo parece que podría haber sobreajustado el conjunto de entrenamiento. Ya que se
obtienen buenos resultados en el conjunto de entrenamiento, pero no en el conjunto de
desarrollo o de prueba. (Hay que recordar que el conjunto de test no es obligatorio y su función
puede llevarla a cabo el conjunto de desarrollo)

De lo anterior se obtiene entonces, que hay una gran varianza.


Por otra lado,

Train set error: 15%

Dev set error: 16%

Comparado con un porcentaje de error, de aprox., cero porciento de los humanos en identificar
una imagen de un gato, se puede decir que no se está ajustando el modelo a los datos
correctamente. Es decir, se tiene un Sesgo muy alto.

Otro ejemplo,

Train set error: 15%

Dev set error: 30%

En este caso vemos que se tiene un hight bias y una hight variance, lo cual es lo peor de ambos
mundos.

Un último ejemplo :

Train set error: 0,5%

Dev set error: 1%

Tendríamos low variance and low bias.

Importante, Todos estos análisis se hacen respecto a un porcentaje de error optimo, más
conocido como error de Bayes.

Train set error, se relaciona con el sesgo. Mientras el Dev set error, con la varianza.

Un ejemplo de alta varianza y alto sesgo, podría ser:

Es difícil, de apreciar en dos dimensiones, pero en problemas mayores, es posible obtene zonas en
las que el sesgo es elevado y otras en las que la varianza es elevada.
Importante: Si se tiene una varianza alta, o un sesgo alto, existen diferentes caminos para mejorar
la red. Lo cual se verá en el desarrollo del curso.

1.3 Basic Recipe for Machine Learning


1. High Bias? (Training data performance). Entonces, puedes probar con una red más grande:
más capas o más unidades por capa, entrenarla por más tiempo, algoritmos de
optimización más avanzados. Otra posibilidad, que puede que funcione o puede que no, es
hallar otro tipo de estructura para la red neuronal que se adapte mejor a los datos.
Usar una red más grande, casi siempre, funciona. Entrenar por más tiempo es menos
efectivo, pero no es malo.
2. High Variance? (Dev set performance). Obtener más datos, (cuando sea posible). Intentar
regularización. Intentar con otro tipo de estructura para la red neuronal.

Siempre que se este regulando, el entrenar una red más grande suele ser útil.

2. Regularizing your neural network


Nota: Aclaración de las próximas formulas.

The Frobenius norm formula

The rows "i" of the matrix should be the number of neurons in the current layer n[l].

whereas the columns "j" of the weight matrix should equal the number of neurons in the previous
layer n [l−1]

2.1 Regularization
Logistic Regression

L2 regularization it is when you are using the Euclidean normal.

Porque solo se regulariza el parámetro W y no el parámetros b? En la práctica se podría realizar la


regularización del parámetros b, pero se suele omitir, porque viendo los parámetros W tiene por
lo general mayor dimensionalidad, mientras que b es solo un número.

L2 regularización es el tipo más común de regularización.


También, es posible hablar de L1 regularización. Algunas personas dicen que permite comprimir el
modelo, y se requiere menor memoria para guardar el modelo.

Lambda se conoce como el parámetro de regularización (regularization parameter)

Lo anterior es aplicado a regresión lineal, ahora:

Neural Network

La regularización, es también conocida, conocida “weight decay”


2.2 Why regularization reduces your overfiting

De manera intuitive, si das um valor alto a lambda y minimizas el costo, lo que buscará esa
minimización es obtener un parámetro w menor, lo que lo lleva hacia el caso de high bias, es decir
un menor tamaño de red.

Pero en realidad no estarás obteniendo una red con menos unidades ocultas, o menor numero de
nodos. Lo que sucede es que se minimiza el impacto de cada una de ellas en la red.
Otro intento de intuición, como se vio en el curso anterior cuando se tiene una red, aunque sea
profunda, con funciones de activación lineal, lo que se obtiene es una red lineal, que no puede
ajustarse a modelos de gran complejidad. Lo que hace la regularización, es que busca que las
funciones de activación trabajen en la parte que es prácticamente lineal, como se muestra en la
siguiente imagen, de modo que impide que se genere sobreajuste.

Existe otro tipo de regularización, dropout.

2.3 Dropout Regularization

Basicamente lo que hace es, según la probabilidad, eliminar nodos al azar y realizar el proceso de
backward propagation.

Hay varios métodos de implementar esta técnica, el más común es el Inverted Dropout:

Representación en la capa l=3,

Keep_prob, es un numero que determina la probabilidad de que un nodo se mantenga.

Un keep_prob de 0,8, indica que hay una probabilidad de 0,2 de eliminar un nodo.

Se implementa un vector como se muestra:


Y luego se toman las activaciones de la capa,

Ya que el vector d3, tiene false or true, eso se interpreta en la multiplicación como 0 o 1.

Luego, se divide las activaciones en el parámetro keep_prob

Esta última línea de código garantiza que el valor esperado de a3, en la siguiente capa permanezca
igual.

En el momento de la prueba, no se usa dropout, ya que inserta incertidumbre en las predicciones.

Reminder: In general, the number of neurons in the previous layer gives us the number of
columns of the weight matrix, and the number of neurons in the current layer gives us the number
of rows in the weight matrix.

2.4 Why does drop-out work?


Lo que hace es similar a la regularización L2, disminuyendo el valor de W. Es posible, asignar el
parámetro de regularización variable en cada capa, siendo 1 cuando no se quiere eliminar ningún
nodo. Una desventaja es que la función de costo deja de estar definida por lo que, para comprobar
que este disminuyendo en cada iteración es recomendable, establecer el parámetro como 1, y
comprobar que esta disminuyendo y luego permitir que se realice la técnica de regularización.

2.5 Other regularization methods


Para el reconocimiento de imágenes, cuando el modelo esta sobreajustado, una técnica usada es
la de girar horizontalmente la imagen, realizar zoom. Estas “nuevas” imágenes serán datos
adicionales, que seguirán siendo gatos.

Early Stopping

Detener el entrenamiento antes de que w tome valores demasiado grande. Como se aprecia en la
imagen, el error en el conjunto de entrenamiento y el costo deben disminuir con cada iteración.
Sin embargo, el error en el conjunto de desarrollo, suele aumentar en cierto punto, en el cuál
empieza a existir sobreajuste.

Principio de ortogonalización

Separar el trabajo en la red en dos pasos:

1. Optimización de la función de costo, en la que no interesa que valores de w, b se hallan,


siempre que minimicen la función.
2. Not overfitting, aplicar regularización, más datos, etc., para evitar, de existir, el
sobreajuste.

El problema de trabajar con la detención temprana es que ya no se trabajan los dos problemas
anteriores de manera independiente. Ya que al detener las iteraciones se evita que se realice una
correcta optimización de la función de costo.
Por otro lado, la regularización L2 involucra una tarea mayor computacionalmente para hallar el
valor del parámetro lambda adecuado.

Cuando el costo computacional no es inconveniente, se prefiere usar la regularización L2. Ya que


permite usar el principio de ortogonalización.

3. Setting up your optimization problem

When normalizing, vector x should be divided by σ, So the formula for normalization is:

Subtract the mean and divide by standard deviation.

3.1 Normalizing inputs


Una de las formas de optimizar la red, es normalizando los datos de entrada (entrenamiento), lo
que se realiza en dos pasos:

1. Substring mean u(mu): Este paso es para desplazar la media hacia el cero.
2. Nomalizing variance (sigma): La varianza de una de las propiedades puede ser mayor que
la otra característica, lo que se busca es obtener una varianza más uniforme.

Los pasos se muestran gráficamente,

3.1.1 Why normalizing inputs?

Reminder the cost function


Si se sinsertan los valores sin normalizar, lo más probable es que se obtenga algo como lo
siguiente:

En el cual es posible que la característica x1, tome valores entre 1 y 1000, mientras la característica
x2, toma valores entre 0 y 1, lo que conlleva a que los valores de w1 y w2 sean muy diferentes.

En cambio, normalizando es posible obtener una función de costo en promedio más simétrica.
Es decir, la normalización permite que la optimización del costo sea más sencilla pues el gradiente
puede encontrar de manera más simple el mejor camino hacia el mínimo.

Esto es importante en los casos en los que se tienen escalas muy diferentes entre los datos de
entrada, sin embargo, aplicarlo siempre no hace ningún daño. Por lo que es recomendable
hacerlo, aunque se tenga duda de que las escalas pueden ser similares.

3.2 Vanishing/Exploding gradients


Uno de los problemas de entrenar redes neuronales, especialmente las profundas, es que los
gradientes desaparecen y explotan: significa que cuando estás entrenando, los gradientes pueden
ser extremadamente pequeñas o grandes.

Para simplificar la siguiente explicación, se asume que el parámetro b es cero y que la función de
activación es lineal, es decir, g(z)=z.
Suponiendo que se tiene una matriz como la mostrada, ese valor se mantiene en las diferentes
multiplicaciones matriciales de cada capa. Lo que significara un crecimiento exponencial del valor
de y_hat, en un factor de , para el ejemplo, 1,5^L.

Si por el contrario el valor es, por ejemplo 0,5,

El valor de y_hat estará afectado por el valor 0,5^L

Para redes muy profundas, esto representa un gran problema, que fue una limitante hasta hace
poco para la implementación de redes neuronales profundas. Actualmente, hay una solución
parcial que no lo resuelve completamente, pero ayuda mucho a la hora de inicializar los
parámetros.

3.3 Weight Initialization for deep networks


3.3.1 Simple neuron example
Ignorando de nuevo el valor de b.
Entre mayor se el numero n (características), menor se desea que sea el valor de los pesos wi, ya
que z será el resultado de la suma de un montón de estos términos.

Resulta entonces que se desea que la varianza de w, sea

Y para ello lo que se puede hacer es,

Se realiza la modificación teniendo en cuenta el número de nodos de la capa anterior a la que se


esta evaluando, ya que son los de entrada a dicha capa.

Si se está usando una función de activación como ReLu en lugar de 1/n, la varianza será 2/n.

Esta modificación del valor de W establece que la varianza sea la deseada. Al multiplicar por la raíz
cuadrada la matriz inicializada de manera aleatoria.

Esto, establece que el valor de las matrices de peso W no sean mucho más grandes de 1 ni mucho
menor que 1.

El valor que multiplica la matriz de pesos, depende del tipo de función de activación que se este
usando, los cuales se hallan en la literatura.
El valor para ReLU más común es el mostrado antes, y para el tanh es el señalado en la imagen
anterior.

Este parámetro puede ser considerado entre los hiperparámetros, para ajustar, sin embargo, su
efecto es modesto, por lo que no es el primer parámetro que uno buscaría modificar. En algunas
ocasiones, puede solucionar razonablemente problemas.

3.4 Numerical approximation of gradients


Cuando se implementa la propagación hacia atrás, es conveniente contar con un método del
correcto funcionamiento de los gradientes internos.

Usando una diferencia bilateral, se tiene un menor error al calcular la derivada.


3.5 Gradient checking
Take W1, b1…, WL, bL and reshape into a big vector theta.

And take dW1, dB1, dWL, dbL, and reshape into a big vector dtheta.

Al concatenar cada valor dentro de un gran vector y derivarlo, la pregunta es, ¿delta de theta es el
gradiente de la función J(theta)?

Por lo que tienen dos vectores, al derivada y el de aproximación, por lo que se debe realizar un
chequeo

En la práctica, puedes usar 10^-7 para el valor de épsilon.

Si el resultado del chek es cercano a 10^-7, está bien. Pero si es del orden de 10^-5, lo mejor es
revisar de cerca.

3.6 Gradient cheking implementation notes


1. Don’t use in training – only to debug
2. If algorithm fails grad check, look at components to try to identify bug. Esto puede
ayudarte para enfocar la busqueda del origen de los errores, puede que los que difieran en
mayor valor de la derivada de delta theta sean algunos valores de db, o dW, en
determinada capa. Lo que permitirá hallar más rápido el error.
3. Remember regularization, if you are using that,

4. Doesn’t work with dropour


5. Run at random initialization: perhaps again after some training.

4. Practical Aspects of deep learning


1. If you have 10,000,000 examples, how would you split the train/dev/test set?

2. The dev and test set should:

3. If your Neural Network model seems to have high bias, what of the following would be
promising things to try? (Check all that apply.)
4. You are working on an automated check-out kiosk for a supermarket, and are building a
classifier for apples, bananas, and oranges. Suppose your classifier obtains a training set
error of 0.5%, and a dev set error of 7%. Which of the following are promising things to try
to improve your classifier?

5. What is weight decay?


6. What happens when you increase the regularization hyperparameter lambda?

7. With the inverted dropout technique, at test time:

8. Increasing the parameter keep_prob from (say) 0.5 to 0.6 will likely cause the following:
9. Which of these techniques are useful for reducing variance (reducing overfitting)?
10. Why do we normalize the inputs xx?

5. NoteBook Initialization
By completing this assignment you will:
- Understand that different initialization methods and their impact on your model
performance
- Implement zero initialization and and see it fails to "break symmetry"
- Recognize that random initialization "breaks symmetry" and yields more efficient models,
- Understand that you could use both random initialization and scaling to get even better
training performance on your model

Initialization

Training your neural network requires specifying an initial value of the weights. A well chosen
initialization method will help learning. If you completed the previous course of this
specialization, you probably followed our instructions for weight initialization, and it has
worked out so far. But how do you choose the initialization for a new neural network? In this
notebook, you will see how different initializations lead to different results.

A well chosen initialization can:

- Speed up the convergence of gradient descent


- Increase the odds (probabilidades) of gradient descent converging to a lower training (and
generalization) error

To get started, run the following cell to load the packages and the planar dataset you will try
to classify.

You would like a classifier to separate the blue dots from the red dots.

5.1 Neural Network model


You will use a 3-layer neural network (already implemented for you). Here are the initialization
methods you will experiment with:

- Zeros initialization -- setting initialization = "zeros" in the input argument


- Random initialization -- setting initialization = "random" in the input argument. This
initializes the weights to large random values.
- He initialization -- setting initialization = "he" in the input argument. This initializes the
weights to random values scaled according to a paper by He et al., 2015.

Model

Arguments:

X -- input data, of shape (2, number of examples)

Y -- true "label" vector (containing 0 for red dots; 1 for blue dots), of shape (1, number of
examples)

learning_rate -- learning rate for gradient descent

num_iterations -- number of iterations to run gradient descent

print_cost -- if True, print the cost every 1000 iterations

initialization -- flag to choose which initialization to use ("zeros","random" or "he")

Returns:

parameters -- parameters learnt by the model


5.2 Zero Initialization
There are two types of parameters to initialize in a neural network:

- the weight matrices


- the bias vectors

Implement the following function to initialize all parameters to zeros. You'll see later that this does
not work well since it fails to "break symmetry", but lets try it anyway and see what happens.

El vector de dimensiones de capas, contiene el numero de nodos en cada capa, es decir cada n[l].

Es decir, hallando el tamaño del vector tendremos el número de capas en la red.


Ahora, se corre el código para entrenar el modelo con 15,000 iteraciones. Con lo que se obtiene lo
siguiente:

El rendimiento es realmente malo, y el costo no disminuye realmente, y el algoritmo no funciona


mejor que las adivinanzas al azar. ¿Por qué? Veamos los detalles de las predicciones y el límite de
decisión:
The model is predicting 0 for every example.

In general, initializing all the weights to zero results in the network failing to break symmetry. This
means that every neuron in each layer will learn the same thing, and you might as well be training
a neural network with n[l]=1 for every layer, and the network is no more powerful than a linear
classifier such as logistic regression.

5.2.1 What you should remember:


1. The weights W[l] should be initialized randomly to break symmetry.
2. It is however okay to initialize the biases b[l] to zeros. Symmetry is still broken so long as
W[l] is initialized randomly.

5.3 Random initialization


To break symmetry, lets intialize the weights randomly. Following random initialization, each
neuron can then proceed to learn a different function of its inputs. In this exercise, you will see
what happens if the weights are intialized randomly, but to very large values.

Exercise:

Implement the following function to initialize your weights to large random values (scaled by *10)
and your biases to zeros. We are using a fixed np.random.seed(..) to make sure your "random"
weights match ours, so don't worry if running several times your code gives you always the same
initial values for the parameters.

El anterior código genera error, por los dobles paréntesis. El error dice ser del tipo: non-integer
arguments.

El código correcto es:

Run the following code to train your model on 15,000 iterations using random initialization.
If you see "inf" as the cost after the iteration 0, this is because of numerical roundoff; a more
numerically sophisticated implementation would fix this. But this isn't worth worrying about for
our purposes.

Anyway, it looks like you have broken symmetry, and this gives better results. than before. The
model is no longer outputting all 0s.
Observations

- The cost starts very high. This is because with large random-valued weights, the last
activation (sigmoid) outputs results that are very close to 0 or 1 for some examples, and
when it gets that example wrong it incurs a very high loss for that example. Indeed, when
log(a^[3]) = log(0), the loss goes to infinity.
- Poor initialization can lead to vanishing/exploding gradients, which also slows down the
optimization algorithm.
- If you train this network longer you will see better results, but initializing with overly large
random numbers slows down the optimization.

In summary

- Initializing weights to very large random values does not work well.
- Hopefully intializing with small random values does better. The important question is: how
small should be these random values be? Lets find out in the next part!
5.4 He initialization
Finally, try "He Initialization"; this is named for the first author of He et al., 2015. (If you have
heard of "Xavier initialization", this is similar except Xavier initialization uses a scaling factor for the
weights W[l] of sqrt(1./layers_dims[l-1]) where He initialization would use sqrt(2./layers_dims[l-
1]).)
Observations:

The model with He initialization separates the blue and the red dots very well in a small number of
iterations.

5.5 Conclusions
- You have seen three different types of initializations. For the same number of iterations
and same hyperparameters the comparison is:
6. Notebook Regularization

By completing this assignment you will:

- Understand that different regularization methods that could help your model.
- Implement dropout and see it work on data.
- Recognize that a model without regularization gives you a better accuracy on the training
set but nor necessarily on the test set.
- Understand that you could use both dropout and regularization on your model

6.1 Regularization
Deep Learning models have so much flexibility and capacity that overfitting can be a serious
problem, if the training dataset is not big enough. Sure it does well on the training set, but the
learned network doesn't generalize to new examples that it has never seen!

6.2 Problem Statement


You have just been hired as an AI expert by the French Football Corporation. They would like you
to recommend positions where France's goal keeper should kick the ball so that the French team's
players can then hit it with their head.
They give you the following 2D dataset from France's past 10 games

Each dot corresponds to a position on the football field where a football player has hit the ball
with his/her head after the French goal keeper has shot the ball from the left side of the football
field.

- If the dot is blue, it means the French player managed to hit the ball with his/her head
- If the dot is red, it means the other team's player hit the ball with their head

Your goal: Use a deep learning model to find the positions on the field where the goalkeeper
should kick the ball.
6.3 Analysis of the dataset
This dataset is a little noisy, but it looks like a diagonal line separating the upper left half (blue)
from the lower right half (red) would work well.

You will first try a non-regularized model. Then you'll learn how to regularize it and decide which
model you will choose to solve the French Football Corporation's problem.

6.4 Non-regularized model


You will use the following neural network (already implemented for you below). This model can be
used:

- in regularization mode -- by setting the lambd input to a non-zero value. We use "lambd"
instead of "lambda" because "lambda" is a reserved keyword in Python.
- in dropout mode -- by setting the keep_prob to a value less than one

You will first try the model without any regularization. Then, you will implement:

- L2 regularization
- Dropout

In each part, you will run this model with the correct inputs so that it calls the functions you've
implemented. Take a look at the code below to familiarize yourself with the model.

El código para la red neuronal se brinda previamente, incluyendo funciones únicas para casa
caso de regularización.

Arguments:

X -- input data, of shape (input size, number of examples)

Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (output size, number of
examples)

learning_rate -- learning rate of the optimization

num_iterations -- number of iterations of the optimization loop

print_cost -- If True, print the cost every 10000 iterations

lambd -- regularization hyperparameter, scalar

keep_prob - probability of keeping a neuron active during drop-out, scalar.

Returns:

parameters -- parameters learned by the model. They can then be used to predict.
Let's train the model without any regularization, and observe the accuracy on the train/test sets.
The train accuracy is 94.8% while the test accuracy is 91.5%. This is the baseline model (you will
observe the impact of regularization on this model). Run the following code to plot the decision
boundary of your model.
The non-regularized model is obviously overfitting the training set. It is fitting the noisy points! Lets
now look at two techniques to reduce overfitting.

6.5 L2 Regularization
The standard way to avoid overfitting is called L2 regularization. It consists of appropriately
modifying your cost function, from:

To:

Let's modify your cost and observe the consequences

Use: np.sum(np.square(Wl)), to calculate

Note that you must do this for then sum the three terms and multiply
Es decir, el costo estará formado por la ya conocida formula del costo, cross_entropy_cost =
compute_cost(A3, Y), + la parte de la regularización, que se cumputa como:

L2_regularization_cost = (1/m)*(lambd/2)*(np.sum(np.square(W1))+np.sum(np.square(W2))
+np.sum(np.square(W3)))

En caso de tener un mayor numero de capas, puede que sea necesario usar un for para hallar la
suma de los valores de todas las capas. Aun no sé si se puede vectorizar.

Of course, because you changed the cost, you have to change backward propagation as well! All
the gradients have to be computed with respect to this new cost.

Implement the changes needed in backward propagation to take into account regularization. The
changes only concern dW1, dW2 and dW3. For each, you have to add the regularization term's

gradient

Para el siguiente paso es importante tener un cache con algunos valores del paso del forward
propagation, como se ha hecho hasta ahora.
Let's now run the model with L2 regularization (λ=0.7)(λ=0.7) .

Congrats, the test set accuracy increased to 93%. You have saved the French football team!

You are not overfitting the training data anymore. Let's plot the decision boundary.
Observations:

- The value of λ is a hyperparameter that you can tune using a dev set.
- L2 regularization makes your decision boundary smoother. If λ is too large, it is also
possible to "oversmooth", resulting in a model with high bias.

Es decir, se se regula demasiado es posible que se obtenga un modelo con un sesgo


demasiado alto. Lo que debe evitarse, también se especifica que es posible afinar el valor del
mismo con el dev set.

What is L2-regularization doing?

L2-regularization relies on the assumption that a model with small weights is simpler than a model
with large weights. Thus, by penalizing the square values of the weights in the cost function you
drive all the weights to smaller values. It becomes too costly for the cost to have large weights!
This leads to a smoother model in which the output changes more slowly as the input changes.

What you should remember -- the implications of L2-regularization on:

1. The cost computation:


- A regularization term is added to the cost
2. The backpropagation function:
- There are extra terms in the gradients with respect to weight matrices
3. Weights end up smaller ("weight decay")
- Weights are pushed to smaller values

6.6 Dropout
Finally, dropout is a widely used regularization technique that is specific to deep learning. It
randomly shuts down some neurons in each iteration.

At each iteration, you shut down (= set to zero) each neuron of a layer with probability
1−keep_prob or keep it with probability keep_prob. The dropped neurons don't contribute to the
training in both the forward and backward propagations of the iteration.

When you shut some neurons down, you actually modify your model. The idea behind drop-out is
that at each iteration, you train a different model that uses only a subset of your neurons. With
dropout, your neurons thus become less sensitive to the activation of one other specific neuron,
because that other neuron might be shut down at any time.

6.6.1 Forward propagation with dropout


Exercise: Implement the forward propagation with dropout. You are using a 3 layer neural
network, and will add dropout to the first and second hidden layers. We will not apply dropout to
the input layer or output layer.

To do that, you are going to carry out 4 Steps:

1. In lecture, we discussed creating a variable d[1] with the same shape as a[1] using
np.random.rand() to randomly get numbers between 0 and 1. Here, you will use a
vectorized implementation, so create a random matrix

2. Set each entry of D[1] to be 1 with probability (keep_prob), and 0 otherwise.

Hint: Let's say that keep_prob = 0.8, which means that we want to keep about 80% of the neurons
and drop out about 20% of them. We want to generate a vector that has 1's and 0's, where about
80% of them are 1 and about 20% are 0. This python statement:

X = (X < keep_prob).astype(int)

is conceptually the same as this if-else statement (for the simple case of a one-dimensional array) :

Note that the X = (X < keep_prob).astype(int) works with multi-dimensional arrays, and the
resulting output preserves the dimensions of the input array.

Also note that without using .astype(int), the result is an array of booleans True and False, which
Python automatically converts to 1 and 0 if we multiply it with numbers. (However, it's better
practice to convert data into the data type that we intend, so try using .astype(int).)

3. Set A[1] to A[1]∗D[1]. (You are shutting down some neurons). You can think of D[1] as a
mask, so that when it is multiplied with another matrix, it shuts down some of the values.
4. Divide A[1] by keep_prob. By doing this you are assuring that the result of the cost will still
have the same expected value as without drop-out. (This technique is also called inverted
dropout.)

6.6.2 Backward propagation with dropout

Backpropagation with dropout is actually quite easy. You will have to carry out 2 Steps:
1. You had previously shut down some neurons during forward propagation, by applying a
mask D[1] to A1. In backpropagation, you will have to shut down the same neurons, by
reapplying the same mask D[1] to dA1
Importante, debe eliminarse las mismas neuronas en el paso hacia atrás para ser
consistentes con el modelo.
2. During forward propagation, you had divided A1 by keep_prob. In backpropagation, you'll
therefore have to divide dA1 by keep_prob again (the calculus interpretation is that if A[1]
is scaled by keep_prob, then its derivative dA[1] is also scaled by the same keep_prob).

Nota personal: El profesor usa siempre 1./m. No se el por qué, debe ser para que el valor sea
tomado como float, quizá.

dW3 = 1./m * np.dot(dZ3, A2.T)


Let's now run the model with dropout (keep_prob = 0.86). It means at every iteration you shut
down each neurons of layer 1 and 2 with 14% probability.

Ese error mostrado, no lo explican en el notebook. Sin embargo, al parecer se genera una división
entre cero en la evaluación del modelo.
Note:

- A common mistake when using dropout is to use it both in training and testing. You should
use dropout (randomly eliminate nodes) only in training.
- Deep learning frameworks like tensorflow, PaddlePaddle, keras or caffe come with a
dropout layer implementation.

6.6.3 What you should remember about dropout:


- Dropout is a regularization technique.
- You only use dropout during training. Don't use dropout (randomly eliminate nodes)
during test time.
- Apply dropout both during forward and backward propagation.
- During training time, divide each dropout layer by keep_prob to keep the same expected
value for the activations. For example, if keep_prob is 0.5, then we will on average shut
down half the nodes, so the output will be scaled by 0.5 since only the remaining half are
contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the
output now has the same expected value. You can check that this works even when
keep_prob is other values than 0.5.

6.7 Conclusions
Here are the results of our three models:

Note that regularization hurts training set performance! This is because it limits the ability of the
network to overfit to the training set. But since it ultimately gives better test accuracy, it is helping
your system.
7. Notebook gradient Checking
By completing this assignment, you will:

- Implement gradient checking from scratch


- Understand how to use the difference formula to check your backpropagation
implementation.
- Recognize that your backpropagation algorithm should give you similar results as the ones
you got by computing the difference formula.
- Learn how to identify which parameter's gradient was computed incorrectly.

Goal

You are part of a team working to make mobile payments available globally, and are asked to build
a deep learning model to detect fraud--whenever someone makes a payment, you want to see if
the payment might be fraudulent, such as if the user's account has been taken over by a hacker.

But backpropagation is quite challenging to implement, and sometimes has bugs. Because this is a
mission-critical application, your company's CEO wants to be really certain that your
implementation of backpropagation is correct. Your CEO says, "Give me a proof that your
backpropagation is actually working!" To give this reassurance, you are going to use "gradient
checking".

7.1 How does gradient checking work?

Backpropagation computes the gradients where theta denotes the parameters of the model. J
is computed using forward propagation and your loos function.

Because forward propagation is relatively easy to implement, you're confident you got that right,
and so you're almost 100% sure that you're computing the cost J correctly. Thus, you can use your

code for computing J to verify the code for computing

Let's look back at the definition of a derivative (or gradient):

We know the following:

- dJ/dtheta is what you want to make sure you are computing correctly
- You van compute (in the case that theta is a real number), since
you are confident your implementation for J is correct.

7.2 1-dimensional gradient checking


Consider a 1D linear function J(θ)=θx. The model contains only a single real-valued parameter θθ ,
and takes xx as input.

You will implement code to compute J(.)and its derivative. You will then use gradient checking to
make sure your derivative computation for JJ is correct.

The diagram above shows the key computation steps: First start with x, then evaluate the function
J(x) ("forward propagation"). Then compute the derivative ∂J/∂θ ("backward propagation").

J -- the value of function J, computed using the formula J(theta) = theta * x

Now, implement the backward propagation step (derivative computation) of Figure 1. That is,
compute the derivative of J(θ)=θx with respect to θθ . To save you from doing the calculus, you

should get

Instructions:

1. First compute "gradapprox" using the formula above (1) and a small value of εε . Here are
the Steps to follow

2. Then compute the gradient using backward propagation, and store the result in a variable
"grad"
3. Finally, compute the relative difference between "gradapprox" and the "grad" using the
following formula
You will need 3 Steps to compute this formula:

1. compute the numerator using np.linalg.norm(...)


2. compute the denominator. You will need to call np.linalg.norm(...) twice.
3. divide them.

If this difference is small (say less than 10^-7), you can be quite confident that you have computed
your gradient correctly. Otherwise, there may be a mistake in the gradient computation.

Now, in the more general case, your cost function J has more than a single 1D input. When you are
training a neural network, θ actually consists of multiple matrices W[l] and biases b[l! It is
important to know how to do a gradient check with higher-dimensional inputs. Let's do it!

7.3 N-dimensional gradient checking


The following figure describes the forward and backward propagation of your fraud detection
model.
1. Forward propagation
2. run backward propagation.
You obtained some results on the fraud detection test set but you are not 100% sure of your
model. Nobody's perfect! Let's implement gradient checking to verify if your gradients are correct.

7.3.1 How does gradient checking work?.


As in 1) and 2), you want to compare "gradapprox" to the gradient computed by backpropagation.
The formula is still:

However, θ is not a scalar anymore. It is a dictionary called "parameters". We implemented a


function "dictionary_to_vector()" for you. It converts the "parameters" dictionary into a vector
called "values", obtained by reshaping all parameters (W1, b1, W2, b2, W3, b3) into vectors and
concatenating them.

The inverse function is "vector_to_dictionary" which outputs back the "parameters" dictionary.

We have also converted the "gradients" dictionary into a vector "grad" using
gradients_to_vector(). You don't need to worry about that.

Instructions: Here is pseudo-code that will help you implement the gradient check.

For each i in num_parameters:


Thus, you get a vector gradapprox, where gradapprox[i] is an approximation of the gradient with
respect to parameter_values[i]. You can now compare this gradapprox vector to the gradients
vector from backpropagation.

Nota: Cuando haces x + y, numpy usa un tipo de datos de "mínimo común denominador" para el
resultado. Dado que x es int y y es flotante, esto significa que devuelve una matriz flotante.

Pero cuando haces x += y, lo estás forzando a conformarse al tipo de datos de x, que es int.

Está parte del código me dio varios problemas, y fue difícil hallar la respuesta la cual tiene que ver
con lo mostrado en la nota anterior.

Adicionalmente, no entiendo porque en el J se usa el parámetro thetaplus y no el thetaplus[i][0].


Sin embargo, no tengo la respuesta aún. Por otro lado, ese vector es bidimensional, por eso se
indica el [i,0].
thetaplus[i][0] += epsilon, el uso de esta ecuación soluciona el problema. J_plus[i], _ =
forward_propagation_n(X, Y, vector_to_dictionary(thetaplus)).
It seems that there were errors in the backward_propagation_n code we gave you! Good that
you've implemented the gradient check. Go back to backward_propagation and try to find/correct
the errors (Hint: check dW2 and db1). Rerun the gradient check when you think you've fixed it.
Remember you'll need to re-execute the cell defining backward_propagation_n() if you modify the
code.

Can you get gradient check to declare your derivative computation correct? Even though this part
of the assignment isn't graded, we strongly urge you to try to find the bug and re-run gradient
check until you're convinced backprop is now correctly implemented.

Corregí los errores del código, lo que genera la siguiente salida.

Note:

1. Gradient Checking is slow! Approximating the gradient with is


computationally costly. For this reason, we don't run gradient checking at every iteration
during training. Just a few times to check if the gradient is correct.
2. Gradient Checking, at least as we've presented it, doesn't work with dropout. You would
usually run the gradient check algorithm without dropout to make sure your backprop is
correct, then add dropout.

8. Optimization algorithms
8.1 Mini batch gradient descent
Batch vs. mini-batch gradient descent

So, we already see that vectorization allows you to efficiently compute on m examples without a
specific for loop.
What if m=5.000.000 or bigger?

Entonces, lo que debes hacer es recorrer todos los datos de entrenamiento antes de dar un
pequeño paso de descenso por gradiente y recorrer de nuevo los datos, para dar un segundo paso
y así sucesivamente.

Resulta que puedes tener un algoritmo más rápido si permites que el descenso por gradiente
realice progresos incluso antes de que termine de procesar completamente el conjunto de
entrenamiento.

Imagina que divides el conjuntos de entrenamiento, en unos más pequeños, llamados mini-
batches.

Se agrega una nueva notación {t}, para designar el número de mini-batch t correspondiente.

Si se toman subconjuntos de 1000 elementos, tenemos entonces.


Ese proceso que se realiza dentro del for, se conoce como 1 spoch.

Cuando se tiene un conjunto de datos muy grande, el mini-batch, trabaja mucho más rápido que el
batch. Por lo que, todo el mundo en el Deep learning suelen usar mini-batch.

8.2 Understanding mini-batch gradient descent

8.2.1 Choosing your mini-batch size


If mini-batch size = m, so, batch gradient descent.

The other extreme would be if your mini-batch size, Were = 1. This gives you an algorithm called
stochastic gradient descent.
And here every example is its own mini-batch.

El stochastic gradient descent, no siempre conduce al mínimo y siempre lo hace con mucho ruido.
Y puede oscilar alrededor del mínimo, o del valor final sin llegar a un valor fijo.

Cuando entrenas con un solo ejemplo a la vez, stochastic gradient descent, puedes reducir el ruido
usando una tasa de aprendizaje menor. Pero la gran desventaja, es que pierde la velocidad ganada
en la vectorización. Lo que lo hace muy ineficiente.
El ideal es tomar un valor entre los dos extremos, que permita optimizar el modelo. Puede que al
tomar los mini-batches el modelo no se dirija al mínimo o se genere una variación cerca de un
valor final, lo que puede solucionarse disminuyendo la rata de aprendizaje.

¿Cuándo usar este algoritmo?

Cuando el conjunto de entrenamiento es menor que 2000, se puede usar el batch gradient
descent, sin problemas.

1. Los valores típicos para los tamaños de los mini-batches son números que son potencias
del 2, desde 64, hasta 512. (64, 128, 256, 512.)
2. Garantizar que el tamaño de los mini-batches encajen la CPU/GPU memory.

Este parámetros se convierte en un hiperparámetros que debe hallarse por iteración, es decir
probar diferentes potencias de dos y decidir una que permita una optimización del gradient
descent.

8.3 Exponentially weighted averages


Existen algunos algoritmo más rápidos que el gradient descent, para ello debes conocer los
exponentially weigthed averages, también conocidos como promedios móviles ponderados
exponencialmente en estadística.
Los datos parecen un poco ruidosos, y para calcular el promedio local o un promedio móvil de la
temperatura, lo que se puede hacer es lo siguiente:

Se obtiene la línea roja para el ejemplo anterior. Se puede pensar en vt como el promedio de 1/(1-
beta) días temperatura.

La linea verde describe el comportamiento para beta=0.98


En la curva verde se obtiene una curva más suave, pero también desplazada hacia la derecha, ya
que se ajusta más lentamente a los cambios, es decir, hay más latencia.

Lo que significa el beta=0,98, es que le estas dando un peso elevado a lo que sucedió previamente
mientras, le das un valor de solo 0,02 a lo que estas viendo actualmente (ver la formula)

Si nos dirigimos al otro extremo, con beta =0,5, se obtiene un promedio de 2 días y la curva
amarilla

Lo que se hace más susceptible a los valores atípicos. Pero, se adapta más rápidamente a los
cambios de temperatura.

8.4 Understanding exponentially weighted averages


Recuerda que la ecuación clave para implementar promedios ponderados exponencialmente es:
Resulta que guardar cada variable V, con un nombre diferente hace un gasto innecesario de
memoria, que puede obviarse sobrescribiendo el valor de v.

8.5 Bias correction in exponentially weighted average


Debido a que se inicializa el v0=0, el factor v1 toma un valor muy pequeño.

Para corregir el sesgo, se usa:

8.6 Gradient descent with momentum


Casi siempre funciona más rápido que el algoritmo de descenso de gradiente estándar. La idea
básica es calcular un promedio ponderado exponencialmente de sus gradientes y luego usar ese
gradiente para actualizar sus pesos en su lugar.

Suponemos que vamos a hallar el mínimo de una función de costo que tiene la siguiente forma
Esto puede llevarse en muchos pasos que oscilan hacia el mínimo. Podría aumentar la rata de
aprendizaje, sin embargo, eso podría disparar hacia los lados cada paso, como se muestra en la
grafica morada.

Para solucionar esto se usa el descenso con momentum, el cual funciona en batch y mini-batch.

Lo que hace es suavizar los pasos de descenso.

Aplicando este gradiente, se obtiene en cada paso una menor variación en la vertical, y se
aproxima de manera más eficiente hacia el mínimo.

En el caso de una función de costo con la forma:


Puedes imaginar los términos de vd, como:

Ahora, una pequeña bola puede rodar hacia abajo ganando momento gracias a la aceleración.
Pero, además cuanta con el termino beta que representa la fricción que impide que ruede sin
control.

8.6.1 Valor más común de beta


Beta=0,9

En la práctica, no se suele hacer corrección de sesgo con ya que con pocas


iteraciones ya se habrá superado el inicio cercano a cero.

Vdw y vdb, se inicializa con un vector de ceros, con las dimensiones de dW y db, respectivamente.

8.7 RMSprop
Es otro algoritmo: Raíz cuadrada media, que también puede acelerar el gradiente de descenso.
Con el descenso de gradiente estándar se oscila en la dirección vertical, aun cuando se requiere es
avanzar horizontalmente. Al igual que el n el anterior, este método permite avanzar más rápido en
una dirección especifica optima.

Esto permite usar una rata mayor de aprendizaje, sin divergir en la dirección vertical (en este caso)

Es decir, en la practica lo que tienes es un algoritmo que amortigua las variaciones en la dirección
en la que haya mayor oscilación.

Ya que se van a unir los dos criterios de optimización, RMSprop y momentum, se usa en el caso de
RMS un beta2 y además por razones numéricas para evitar dividir entre cero cuando el sdb o sdw
son muy pequeños se suma un épsilon del orden de 10^-8, para evitar que explote* el valor
calculado.

8.8 Adam optimization algorithm


A lo largo de la historia del DL se han propuesto varias optimizaciones de los algoritmos, sin
embargo, se ha demostrado que no pueden aplicarse a una gran gama de estructuras. Como caso
curioso, está el algoritmo de Adam el cual ha demostrado que se cumple para una gran variedad
de casos prácticos.

Primero, debes inicializar en cero lo siguiente:


Note que hay un error en el Sdb:

8.8.1 Hyperparameters choice

El parámetro beta2 es el recomendado por el autor del algoritmo.

Adam= Adaptative momentum estimation

8.9 Learning rate decay


Una de las formas de optimizar el aprendizaje de la red, es reducir lentamente la tasa de
aprendizaje. Un menor alfa, permite dar pasos más pequeños, a medida que nos acercamos al
mínimo.

Si alfa (0)=0,2
8.9.1 Other learning rate decay methods
1. Decaimiento exponencial:

2.

3. Decrease staircase:

Este es un parámetro que ayuda, sin embargo, no es el m[as importante para establecer mediante
un proceso iterativo.

8.10 The problem of local optima


9. Practice questions: Optimization algorithms
10. Programming assignment: Optimization
There are many different optimization algorithms you could be using to get you to the minimal
cost. Similarly, there are many different paths down this hill to the lowest point.

By completing this assignment you will:

1. Understand the intuition between Adam and RMS prop


2. Recognize the importance of mini-batch gradient descent
3. Learn the effects of momentum on the overall performance of your model

Until now, you've always used Gradient Descent to update the parameters and minimize the cost.
In this notebook, you will learn more advanced optimization methods that can speed up learning
and perhaps even get you to a better final value for the cost function. Having a good optimization
algorithm can be the difference between waiting days vs. just a few hours to get a good result.

Gradient descent goes "downhill" on a cost function J.

At each step of the training, you update your parameters following a certain direction to try to get
to the lowest possible point.

10.1 Gradient descent


A simple optimization method in machine learning is gradient descent (GD). When you take
gradient steps with respect to all mm examples on each step, it is also called Batch Gradient
Descent.

Warm-up exercise: Implement the gradient descent update rule.

A variant of this is Stochastic Gradient Descent (SGD), which is equivalent to mini-batch gradient
descent where each mini-batch has just 1 example. The update rule that you have just
implemented does not change. What changes is that you would be computing gradients on just
one training example at a time, rather than on the whole training set. The code examples below
illustrate the difference between stochastic gradient descent and (batch) gradient descent.

10.1.1 Difference between stochastic gradient descent and (batch) gradient


descent.
The code examples below illustrate the difference between stochastic gradient descent and
(batch) gradient descent.
In Stochastic Gradient Descent, you use only 1 training example before updating the gradients.
When the training set is large, SGD can be faster. But the parameters will "oscillate" toward the
minimum rather than converge smoothly. Here is an illustration of this:

Note also that implementing SGD requires 3 for-loops in total:

1. Over the number of iterations


2. Over the m training examples
3. Over the layers (to update all parameters, from (W[1],b[1])(W[1],b[1]) to (W[L],b[L]))

In practice, you'll often get faster results if you do not use neither the whole training set, nor only
one training example, to perform each update. Mini-batch gradient descent uses an intermediate
number of examples for each step. With mini-batch gradient descent, you loop over the mini-
batches instead of looping over individual training examples.

What you should remember:

1. The difference between gradient descent, mini-batch gradient descent and stochastic
gradient descent is the number of examples you use to perform one update step.
2. You have to tune a learning rate hyperparameter α
3. With a well-turned mini-batch size, usually it outperforms either gradient descent or
stochastic gradient descent (particularly when the training set is large).

10.2 Mini-batch gradient descent


Let's learn how to build mini-batches from the training set (X, Y).

There are two steps:

1. Shuffle (Barajar): Crear una versión barajada del conjunto de entrenamiento (X, Y) como
se muestra a continuación. Cada columna de X e Y representa un ejemplo de
entrenamiento. Tenga en cuenta que la barajada aleatoria se realiza de forma sincronizada
entre X e Y. De tal forma que después de la barajada la columna i-ésima de X es el ejemplo
correspondiente a la etiqueta i-ésima de Y. El paso de la barajada garantiza que los
ejemplos se dividirán aleatoriamente en diferentes mini lotes.
2. Partition: Partition the shuffled (X, Y) into mini-batches of size mini_batch_size (here 64).
Note that the number of training examples is not always divisible by mini_batch_size. The
last mini batch might be smaller, but you don't need to worry about this. When the final
mini-batch is smaller than the full mini_batch_size, it will look like this

Exercise: Implement random_mini_batches. We coded the shuffling part for you. To help you
with the partitioning step, we give you the following code that selects the indexes for the 1st
and 2nd mini-batches:
Note that the last mini-batch might end up smaller than mini_batch_size=64. Let⌊s⌋ represents
s rounded down to the nearest integer (this is math.floor(s) in Python). If the total number of

examples is not a multiple of mini_batch_size=64 then there will be mini-


batches with a full 64 examples, and the number of examples in the final mini-batch will be

10.2.1 Random mini batches


What you should remember:

1. Shuffling and Partitioning are the two steps required to build mini-batches
2. Powers of two are often chosen to be the mini-batch size, e.g., 16, 32, 64, 128.

10.3 Momentum
update after seeing just a subset of examples, the direction of the update has some variance, and
so the path taken by mini-batch gradient descent will "oscillate" toward convergence. Using
momentum can reduce these oscillations.

Momentum takes into account the past gradients to smooth out the update. We will store the
'direction' of the previous gradients in the variable v. Formally, this will be the exponentially
weighted average of the gradient on previous steps. You can also think of v as the "velocity" of a
ball rolling downhill, building up speed (and momentum) according to the direction of the
gradient/slope of the hill.
10.3.1 Inicializar con ceros la velocidad (v)
Exercise: Initialize the velocity. The velocity, v, is a python dictionary that needs to be initialized
with arrays of zeros. Its keys are the same as those in the grads dictionary, that is: for l=1,…,L:

Note that the iterator l starts at 0 in the for loop while the first parameters are v["dW1"] and
v["db1"] (that's a "one" on the superscript). This is why we are shifting l to l+1 in the for loop.

10.3.2 Parameters update with momentum

The momentum update rule is


where L is the number of layers, β is the momentum and α is the learning rate. All parameters
should be stored in the parameters dictionary. Note that the iterator l starts at 0 in the for loop
while the first parameters are W [1] and b[1] (that's a "one" on the superscript). So, you will need
to shift l to l+1 when coding.

Note that:

1. The velocity is initialized with zeros. So the algorithm will take a few iterations to "build
up" velocity and start to take bigger steps.
2. If β=0, then this just becomes standard gradient descent without momentum.

10.3.3 How do you choose beta?


1. The larger the momentum β is, the smoother the update because the more we take the
past gradients into account. But if β is too big, it could also smooth out the updates too
much.
2. Common values for β range from 0.8 to 0.999. If you don't feel inclined to tune this, β=0.9
is often a reasonable default.
3. Tuning the optimal β for your model might need trying several values to see what works
best in term of reducing the value of the cost function J

10.3.4 What you should remember


1. Momentum takes past gradients into account to smooth out the steps of gradient descent.
It can be applied with batch gradient descent, mini-batch gradient descent or stochastic
gradient descent.
2. You have tune a momentum hyperparameter β and a learning rate α

10.4 Adam
Adam is one of the most effective optimization algorithms for training neural networks. It
combines ideas from RMSProp (described in lecture) and Momentum.

10.4.1 How does Adam works?


1. It calculates an exponentially weighted average of past gradients, and stores it in variables

vv (before bias correction) and (with bias correction).


2. It calculates an exponentially weighted average of the squares of the past gradients, and

stores it in variables s (before bias correction) and (with bias correction)


3. It updates parameters in a direction based on combining information from "1" and "2"

The update rule is, for l=1,...,L

Where

- t counts the number of steps taken of Adam


- L is the number of layers
- β1 and β2are hyperparameters that control the two exponentially weighted averages
- α is the learning rate
- ε is a very small number to avoid dividing by zero. El cual esta incluido dentro de la raíz.

10.4.2 Initialize the Adam variables v, s which keep track of the past information

10.4.3 Parameters update with Adam

Nota: ¡Observar que la corrección se hace elevando el factor beta a la t, y no al cuadrado!


10.5 Model with different optimization algorithms
Lets use the following "moons" dataset to test the different optimization methods. (The dataset is
named "moons" because the data from each of the two classes looks a bit like a crescent-shaped
moon.)

We have already implemented a 3-layer neural network. You will train it with:

- Mini-batch Gradient Descent: it will call your function:


update_parameters_with_gd()
- Mini-batch Momentum: it will call your functions:
initialize_velocity() and update_parameters_with_momentum()
- Mini-batch Adam: it will call your functions:
initialize_adam() and update_parameters_with_adam()
10.5.1 Mini-batch gradient descent

10.5.2 Mini-batch gradient descent with momentum


Run the following code to see how the model does with momentum. Because this example is
relatively simple, the gains from using momemtum are small; but for more complex problems you
might see bigger gains.
10.5.3 Mini batch with Adam mode

10.5.4 Summary
1. Momentum usually helps, but given the small learning rate and the simplistic dataset, its
impact is almost negligeable. Also, the huge oscillations you see in the cost come from the
fact that some minibatches are more difficult thans others for the optimization algorithm.
2. Adam on the other hand, clearly outperforms mini-batch gradient descent and
Momentum. If you run the model for more epochs on this simple dataset, all three
methods will lead to very good results. However, you've seen that Adam converges a lot
faster.

Some advantages of Adam include:

1. Relatively low memory requirements (though higher than gradient descent and gradient
descent with momentum)
2. Usually works well even with little tuning of hyperparameters (except α)

10.5.5 Nota personal


the -= is a known problem and is on the list to be fixed. The issue seems to be that with -= you can
have different types on each side of the operator which confuses the grader; this does not happen
with plain = which is able to constrain each side to conform so that the grader is happy.

Lo anterior, porque obtuve 0 puntos al usar -= para actualizar un valor.

11. Hyperparameters tuning


11.1 Tuning process
Según Adrew Ng, en orden de importancia, entre los parámetros que se deben establecer, siendo
en rojo el de mayor importancia, en amarillo en segundo grado, y en morado en tercer grado de
importancia, se tiene:
En lugar de realizar una cuadricula de valores sistemáticos, de los hiperparámetros se puede optar
por realizarlo de manera random, así es posible encontrar un valor que funcione realmente bien.

Esta practica de ir de lo grueso* a lo fino, permite enfocarse en zonas en las que se obtiene
mejores resultados y tomar un valor que optimice el modelo entre los hiperparámetros que se
están analizando.

11.2 Using an appropriate scale to pick hyperparameters


Supongamos que estamos buscando el valor de unidades ocultas para una capa, y el numero de
capas.
En este ejemplo la escala, parece ser bastate obvia. Pero, no sucede con todos los
hiperparámetros.

Si tenemos que el valor de alfa puede variar entre 0,0001 y 1, entonces elegir valores al azar de
manera “directa”, puede hacer que la muestra sea tomada desde un mismo sitio. En lugar de eso,
se toma un valor r, que varia entre -4 y cero, en Python es:

r=-4*np.random.randn()

y luego se hace el parámetro,

learning_rate=10^r

Es decir, se usa una escala logarítmica.

Otro parámetro difícil de definir es el parámetro beta, usado para pesos promedios.
Hacer un muestreo lineal entre los posibles valores de beta no tiene sentido, por lo que se hará de
manera logarítmica entre los valores de 1-beta,

Los resultados son muy sensibles a los cambios de beta, cuando el valor de beta está muy cercano
a 1.

Es decir, cuando se cambia de un beta de 0,9 a uno de 0,9005, el cambio es despreciable. Sin
embargo, cuando el cambio es entre beta=0,999 a 0,9995, el cambio es apreciable. Ya que
tenemos la formula como 1/1-beta

11.3 Hyperparameters tuning in practice: Pandas vs. Caviar


Dependiendo de la capacidad de cómputo que tienes puedes optar por entrenar un único modelo
y cuidarlo como un bebé, mientras está aprendiendo de modo que se van ajustando los
hiperparámetros en cada paso, según como avance la curva que se esté usando como parámetro
de evaluación, el cual puede ser la función de costo, error, etc.
12. Batch Normalization
Hay una opción para la búsqueda de hiperparámetros que, no funciona en todos los casos, pero
cuando lo hace funciona realmente bien y facilita el trabajo.

12.1 Normalizing activations in a network


Como se vio antes el normalizar las entradas, puede contribuir a la optimización del modelo, para
“redondear” los pesos.

Ahora, la pregunta es: ¿Es posible normalizar los valores de las activaciones de modo que los
valores de la capa siguiente W, b se entrenen más rápido?
Pues, esto es lo que hace la normalización del lote (Batch Normalizing)

Técnicamente, no se va a normalizar los valore de a, sino de z. Hay discusión entre cuál de los dos
valores debería ser normalizado, sin embargo, en la práctica es más común hacerlo para Z.

12.1.1 Implementing batch norm


Supongamos que temenos algunos valores intermedios en NN, z(1),…, z(m). Esto en la capa l, lo
que se debería escribir como z[l](1),…, z[l](m), pero por simplicidad obvia la notación de la capa.

Se calcula la media, como:

Luego la varianza,

Y finalmente, se normaliza el valor de z,

Con esto lo que se hace s obtener una media de cero, y una varianza estándar de uno. Sin
embargo, no se quiere que las unidades ocultas tengan siempre media de cero y varianza de 1,
tiene sentido esperar que las unidades ocultas tengan una distribución diferente.

Entonces lo que se hace es calcular una nueva z, como sigue:

En el que gamma y beta, son parámetros aprendibles del modelo. Los cuales se deben actualizar.

Se puede mostrar, de las ecuaciones que, si se cumple lo siguiente,


Se estaría computando la función identidad.

Lo que permite es controlar la varianza y la media, mediante los parámetros beta y gamma.

Entonces, se usa z~, en lugar de, z.

12.2 Fitting batch Norm into a neural network


En el anterior ítem, vimos como implementar la normalización para una sola capa. Ahora, veremos
como implementarlo para una red profunda, de modo que funcione en las diferentes capas.

Entonces, terminas con parámetros como:

12.2.1 Working with mini-batches


Si trabajas con batch normalizing, sumar cualquier constante al parámetro z, no tendrá ningún
efecto ya que se esta substrayendo la media. Lo que significa que el parámetro b, ya no es
necesario.

Y este valor termina siendo reemplazado por el beta.


12.2.2 Implementing gradient descent

Esto también puede ser usado para otros algoritmos, como el de weighted, momentum, RMSprop,
Adam.

12.3 Why does Batch Norm work?

Cuando se entrena el modelo solo con los datos de la izquierda (únicamente gatos negros), no
obtendrá buenos resultados al probarlo con los datos de la derecha.
Garantiza, que, aunque los valores de z varíen la media y la desviación se mantiene constante
(determinado por beta y gamma). Es decir, hace que los valores de entrada a cada capa sean más
estables.

- Each mini batch is scaled by the mean/variance computed on just that mini batch
- This adds some noise to the values z[l] within that mini batch. So like dropout, it adds
some noise to each hidden layer’s activations.
- This has a slight regularization effect.

12.4 Batch Norm at test time

Durante el entrenamiento, se suelen computar varios ejemplos a la vez. Sin embargo, en la prueba
puede que se realicé uno por uno. Entonces es necesario modificar las ecuaciones para que tenga
sentido. Las siguientes son las ecuaciones usadas durante en entrenamiento:
En donde, m es el número de ejemplos para el minibach.

Lo que se hace entonces es guardar el valor de gamma y beta aprendidos durante el


entrenamiento y usar esos valores durante la prueba para los cálculos.

13. Multi-class classification


13.1 SoftMax regression
Hasta ahora, hemos realizado los medolos para reconocer dos posibles salida: 1 o 0 (gato o no
gato). Pero, muchas veces se requiere que reconozca más de dos clases.

Imagina, que se desea reconocer gatos, perros y pollos bebe. Es decir, tenemos varias entradas y
puede ser una entre varias clases de salida.
El numero de nodos a la salida, será el numero de clases.
Se calcula la probabilidad de que la salida sea una de las posibles clases, la suma de las
probabilidades son el 100%. Se observa que los limites de decisión se hacen más lineales. Sin
embargo, es posible que se aprendan limites más complejos.

13.2 Training a softmax classifier


Recuerda, que se usa un vector temporal t, que es épsilon elevado a los valores de z.

En contraste, con el had max, en el que la salida mayor será uno y los demás serán cero.
Softmax regression generalizes logistic regression to C classes. (mas de dos clases)

14. Introduction to programming frameworks


14.1 Deep learning frameworks
Se ha visto como implementar una red neuronal desde cero, pero a medida que se avanza resulta
conveniente poder usar paquetes de aprendizaje profundo disponibles en la actualizada.

Algunos de los más usados,

Criterios para elegir los paquetes, (ya que la mayoría de ellos mejoran continuamente y no todos
se ajustan a los requisitos)

- Ease of programming (devolepment and deployment)


- Running speed
- Truly open (open source with good governance)

14.2 TensorFlow
TensorFlow tiene implementadas las funciones necsarias para realizar el backprop
15. Practice questions: Hyperparameters tuning, Batch Normalization,
programming framework
1. If searching among a large number of hyperparameters, you should try values in a grid
rather than random values, so that you can carry out the search more systematically
and not rely on chance. False
2. Every hyperparameter, if set poorly, can have a huge negative impact on training, and
so all hyperparameters are about equally important to tune well. False

3. During hyperparameter search, whether you try to babysit one model (“Panda”
strategy) or train a lot of models in parallel (“Caviar”) is largely determined by

4. If you think \betaβ (hyperparameter for momentum) is between on 0.9 and 0.99, which
of the following is the recommended way to sample a value for beta?
5. Finding good hyperparameter values is very time-consuming. So typically you should
do it once at the start of the project, and try to find very good hyperparameters so that
you don’t ever have to revisit tuning them again. True or false?

6. In batch normalization as presented in the videos, if you apply it on the llth layer of your
neural network, what are you normalizing?

7. In the normalization formula, why do we use epsilon?


8. Which of the following statements about \gammaγ and \betaβ in Batch Norm are true?

9. After training a neural network with Batch Norm, at test time, to evaluate the neural
network on a new example you should:

10. Which of these statements about deep learning programming frameworks are true?
16. Programming Assignment
16.1 TensorFlow
In this notebook you will learn all the basics of Tensorflow. You will implement useful functions
and draw the parallel with what you did using Numpy. You will understand what Tensors and
operations are, as well as how to execute them in a computation graph.

After completing this assignment you will also be able to implement your own deep learning
models using Tensorflow. In fact, using our brand new SIGNS dataset, you will build a deep neural
network model to recognize numbers from 0 to 5 in sign language with a pretty impressive
accuracy.

TensorFlow Tutorial

Until now, you've always used numpy to build neural networks. Now we will step you through a
deep learning framework that will allow you to build neural networks more easily. Machine
learning frameworks like TensorFlow, PaddlePaddle, Torch, Caffe, Keras, and many others can
speed up your machine learning development significantly. All these frameworks also have a lot of
documentation, which you should feel free to read. In this assignment, you will learn to do the
following in TensorFlow:

- Initialize variables
- Start your own session
- Train algorithms
- Implement a Neural Network
Programing frameworks can not only shorten your coding time, but sometimes also perform
optimizations that speed up your code.

16.2 Exploring the TensorFlow Library

Now that you have imported the library, we will walk you through its different applications. You
will start with an example, where we compute for you the loss of one training example.

Writing and running programs in TensorFlow has the following steps:

1. Create Tensors (variables) that are not yet executed/evaluated.


2. Write operations between those Tensors.
3. Initialize your Tensors.
4. Create a Session.
5. Run the Session. This will run the operations you'd written above.

Therefore, when we created a variable for the loss, we simply defined the loss as a function of
other quantities but did not evaluate its value. To evaluate it, we had to run
init=tf.global_variables_initializer(). That initialized the loss variable, and in the last line we were
finally able to evaluate the value of loss and print its value.

Now let us look at an easy example. Run the cell below:


As expected, you will not see 20! You got a tensor saying that the result is a tensor that does not
have the shape attribute and is of type "int32". All you did was put in the 'computation graph', but
you have not run this computation yet. In order to actually multiply the two numbers, you will
have to create a session and run it.

Great! To summarize, remember to initialize your variables, create a session and run the
operations inside the session.

Next, you'll also have to know about placeholders. A placeholder is an object whose value you can
specify only later. To specify values for a placeholder, you can pass in values by using a "feed
dictionary" (feed_dict variable). Below, we created a placeholder for x. This allows us to pass in a
number later when we run the session.

When you first defined x you did not have to specify a value for it. A placeholder is simply a
variable that you will assign data to only later, when running the session. We say that you feed
data to these placeholders when running the session.

Here's what's happening: When you specify the operations needed for a computation, you are
telling TensorFlow how to construct a computation graph. The computation graph can have some
placeholders whose values you will specify only later. Finally, when you run the session, you are
telling TensorFlow to execute the computation graph.

16.2.1 Linear function


Let’s start this programming exercise by computing the following equation:

where W and X are random matrices and b is a random vector.

Compute WX+b where W,X , and bb are drawn from a random normal distribution. W is of shape
(4, 3), X is (3,1) and b is (4,1). As an example, here is how you would define a constant X that has
shape (3,1):

You might find the following functions helpful:


- tf.matmul(..., ...) to do a matrix multiplication
- tf.add(..., ...) to do an addition
- np.random.randn(...) to initialize randomly

16.2.2 Computing the sigmoid


Great! You just implemented a linear function. Tensorflow offers a variety of commonly used
neural network functions like tf.sigmoid and tf.softmax. For this exercise lets compute the sigmoid
function of an input.

You will do this exercise using a placeholder variable x. When running the session, you should use
the feed dictionary to pass in the input z. In this exercise, you will have to:

- create a placeholder x
- define the operations needed to compute the sigmoid using tf.sigmoid, and then
- run the session

Implement the sigmoid function below. You should use the following:

- tf.placeholder(tf.float32, name = "...")


- tf.sigmoid(...)
- sess.run(..., feed_dict = {x: z})

Note that there are two typical ways to create and use sessions in tensorflow:
To summarize, you how know how to:

1. Create placeholders
2. Specify the computation graph corresponding to operations you want to compute
3. Create the session
4. Run the session, using a feed dictionary if necessary to specify placeholder variables'
values

16.2.3 Computing the Cost


You can also use a built-in function to compute the cost of your neural network. So instead of
needing to write code to compute this as a function of a[2](i) and y(i) for i=1...m:
you can do it in one line of code in tensorflow!

Implement the cross-entropy loss. The function you will use is:

Your code should input z, compute the sigmoid (to get a) and then compute the cross-entropy cost
J. All this can be done using one call to tf.nn.sigmoid_cross_entropy_with_logits, which computes:
Importante, ver cómo funciona el feed_dict

16.2.4 Using one hot encodings


Many times, in deep learning you will have a y vector with numbers ranging from 0 to C-1, where C
is the number of classes. If C is for example 4, then you might have the following y vector which
you will need to convert as follows:

This is called a "one hot" encoding, because in the converted representation exactly one element
of each column is "hot" (meaning set to 1). To do this conversion in numpy, you might have to
write a few lines of code. In tensorflow, you can use one line of code:

Implement the function below to take one vector of labels and the total number of classes CC ,
and return the one hot encoding. Use tf.one_hot() to do this.
16.2.5 Initialize with zeros and ones
Now you will learn how to initialize a vector of zeros and ones. The function you will be calling is
tf.ones(). To initialize with zeros you could use tf.zeros() instead. These functions take in a shape
and return an array of dimension shape full of zeros and ones respectively

Implement the function below to take in a shape and to return an array (of the shape's dimension
of ones).

16.3 Building you first neural network in tensorflow


In this part of the assignment you will build a neural network using tensorflow. Remember that
there are two parts to implement a tensorflow model:

- Create the computation graph


- Run the graph

Let's delve into the problem you'd like to solve!


Problem statement: SIGNS Dataset

One afternoon, with some friends we decided to teach our computers to decipher sign language.
We spent a few hours taking pictures in front of a white wall and came up with the following
dataset. It's now your job to build an algorithm that would facilitate communications from a
speech-impaired person to someone who doesn't understand sign language.

- Training set: 1080 pictures (64 by 64 pixels) of signs representing numbers from 0 to 5
(180 pictures per number).
- Test set: 120 pictures (64 by 64 pixels) of signs representing numbers from 0 to 5 (20
pictures per number).

Note that this is a subset of the SIGNS dataset. The complete dataset contains many more signs.
These are the original pictures, before we lowered the image resolutoion to 64 by 64 pixels.

Run the following code to load the dataset.


As usual you flatten the image dataset, then normalize it by dividing by 255. On top of that, you
will convert each label to a one-hot vector as shown in Figure 1. Run the cell below to do so.
Note that 12288 comes from 64×64×3. Each image is square, 64 by 64 pixels, and 3 is for the RGB
colors. Please make sure all these shapes make sense to you before continuing.

Your goal is to build an algorithm capable of recognizing a sign with high accuracy. To do so, you
are going to build a tensorflow model that is almost the same as one you have previously built in
numpy for cat recognition (but now using a softmax output). It is a great occasion to compare your
numpy implementation to the tensorflow one.

The model is LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SOFTMAX. The SIGMOID output layer
has been converted to a SOFTMAX. A SOFTMAX layer generalizes SIGMOID to when there are
more than two classes.

16.3.1 Create placeholders


Your first task is to create placeholders for X and Y. This will allow you to later pass your training
data in when you run your session.

16.3.2 Initializing the parameters


Your second task is to initialize the parameters in tensorflow.

Implement the function below to initialize the parameters in tensorflow. You are going use Xavier
Initialization for weights and Zero Initialization for biases. The shapes are given below. As an
example, to help you, for W1 and b1 you could use:
As expected, the parameters haven't been evaluated yet.

16.3.3 Forward propagation in tensorflow


You will now implement the forward propagation module in tensorflow. The function will take in a
dictionary of parameters and it will complete the forward pass. The functions you will be using are:

- tf.nn.relu(...) to apply the ReLU activation

Question: Implement the forward pass of the neural network. We commented for you the numpy
equivalents so that you can compare the tensorflow implementation to numpy. It is important to
note that the forward propagation stops at z3. The reason is that in tensorflow the last linear layer
output is given as input to the function computing the loss. Therefore, you don't need a3!
You may have noticed that the forward propagation doesn't output any cache. You will understand
why below when we get to backpropagation.

16.3.4 Compute cost


As seen before, it is very easy to compute the cost using:

Implement the cost function below.

- It is important to know that the "logits" and "labels" inputs of


tf.nn.softmax_cross_entropy_with_logits are expected to be of shape (number of
examples, num_classes). We have thus transposed Z3 and Y for you.
- Besides, tf.reduce_mean basically does the summation over the examples.
16.3.5 Backpropagation & parameters updates
This is where you become grateful to programming frameworks. All the backpropagation and the
parameters update are taken care of in 1 line of code. It is very easy to incorporate this line in the
model.

For instance (por ejemplo), for gradient descent the optimizer would be:

To make the optimization you would do:

This computes the backpropagation by passing through the tensorflow graph in the reverse order.
From cost to inputs.

When coding, we often use _ as a "throwaway" variable to store values that we won't need to use
later. Here, _ takes on the evaluated value of optimizer, which we don't need (and c takes the
value of the cost variable).

16.3.6 Building the model


Now, you will bring it all together!
Amazing, your algorithm can recognize a sign representing a figure between 0 and 5 with 71.7%
accuracy.

Insights:

1. Your model seems big enough to fit the training set well. However, given the difference
between train and test accuracy, you could try to add L2 or dropout regularization to
reduce overfitting.
2. Think about the session as a block of code to train the model. Each time you run the
session on a minibatch, it trains the parameters. In total you have run the session many
times (1500 epochs) until you obtained well trained parameters.

What you should remember:

- Tensorflow is a programming framework used in deep learning


- The two main object classes in tensorflow are Tensors and Operators.
- When you code in tensorflow you have to take the following steps
1. Create a graph containing Tensors (Variables, Placeholders ...) and Operations
(tf.matmul, tf.add, ...)
2. Create a session
3. Initialize the session
4. Run the session to execute the graph
- You can execute the graph multiple times as you've seen in model()
- The backpropagation and optimization is automatically done when running the session on
the "optimizer" object.

También podría gustarte