Curso de Redes Neuronales 2
Curso de Redes Neuronales 2
Curso de Redes Neuronales 2
What I have seen is that intuitions from one domain or from one application area often do not
transfer to another application areas.
And the best choices may depend on the amount of data you have, the number of features you
have through your computer configuration and whether you are training on GPU or CPUs. So, it is
impossible guess the best choice of hyperparameters the very first time.
The question I how efficiently you can go around this circle of iterations.
Portion of it to be your:
This portion
Hold-out cross validation
of it to be
Training set set. Sometimes is called the
your test
development set. So, it is
set.
for brevity call the dev set
Se hace el proceso de elección del mejor modelo con el dev set, de modo que la mejor opción
hallada se prueba con el test set.
Se hace para tener una visión imparcial de que tan bien está funcionando el modelo.
Train/Test 70/30 %
Train/dev/test 60/20/20 %
El objetivo del conjunto dev set, es probar diferentes algoritmos en él para ver cuál funciona
mejor.
Las sugerencias de porcentajes son reglas de oro en el Deep learning, sin embargo, en problemas
como big data, en el que se tienen millones de datos, el 20% puede ser demasiado. Por lo que, es
claro que existen excepciones a la regla.
Por ejemplo, para los datos de entrenamiento usas fotos de gatos de alta resolución halladas en
internet, mientras que, para el Dev/test set usas imágenes tomadas por usuarios con cámaras de
baja resolución.
Importante
- Make sure that dev and test come from the same distribution
Not having a test set might be okay. (It can be enough with dev set).
1.2 Bias/Variance
Lo ideal es hallar un punto medio entre ambos sesgo y varianza, de modo que no obtengamos
underfitting ni overfitting de nuestra red.
Algunos ejemplos,
Este algoritmo parece que podría haber sobreajustado el conjunto de entrenamiento. Ya que se
obtienen buenos resultados en el conjunto de entrenamiento, pero no en el conjunto de
desarrollo o de prueba. (Hay que recordar que el conjunto de test no es obligatorio y su función
puede llevarla a cabo el conjunto de desarrollo)
Comparado con un porcentaje de error, de aprox., cero porciento de los humanos en identificar
una imagen de un gato, se puede decir que no se está ajustando el modelo a los datos
correctamente. Es decir, se tiene un Sesgo muy alto.
Otro ejemplo,
En este caso vemos que se tiene un hight bias y una hight variance, lo cual es lo peor de ambos
mundos.
Un último ejemplo :
Importante, Todos estos análisis se hacen respecto a un porcentaje de error optimo, más
conocido como error de Bayes.
Train set error, se relaciona con el sesgo. Mientras el Dev set error, con la varianza.
Es difícil, de apreciar en dos dimensiones, pero en problemas mayores, es posible obtene zonas en
las que el sesgo es elevado y otras en las que la varianza es elevada.
Importante: Si se tiene una varianza alta, o un sesgo alto, existen diferentes caminos para mejorar
la red. Lo cual se verá en el desarrollo del curso.
Siempre que se este regulando, el entrenar una red más grande suele ser útil.
The rows "i" of the matrix should be the number of neurons in the current layer n[l].
whereas the columns "j" of the weight matrix should equal the number of neurons in the previous
layer n [l−1]
2.1 Regularization
Logistic Regression
Neural Network
De manera intuitive, si das um valor alto a lambda y minimizas el costo, lo que buscará esa
minimización es obtener un parámetro w menor, lo que lo lleva hacia el caso de high bias, es decir
un menor tamaño de red.
Pero en realidad no estarás obteniendo una red con menos unidades ocultas, o menor numero de
nodos. Lo que sucede es que se minimiza el impacto de cada una de ellas en la red.
Otro intento de intuición, como se vio en el curso anterior cuando se tiene una red, aunque sea
profunda, con funciones de activación lineal, lo que se obtiene es una red lineal, que no puede
ajustarse a modelos de gran complejidad. Lo que hace la regularización, es que busca que las
funciones de activación trabajen en la parte que es prácticamente lineal, como se muestra en la
siguiente imagen, de modo que impide que se genere sobreajuste.
Basicamente lo que hace es, según la probabilidad, eliminar nodos al azar y realizar el proceso de
backward propagation.
Hay varios métodos de implementar esta técnica, el más común es el Inverted Dropout:
Un keep_prob de 0,8, indica que hay una probabilidad de 0,2 de eliminar un nodo.
Ya que el vector d3, tiene false or true, eso se interpreta en la multiplicación como 0 o 1.
Esta última línea de código garantiza que el valor esperado de a3, en la siguiente capa permanezca
igual.
Reminder: In general, the number of neurons in the previous layer gives us the number of
columns of the weight matrix, and the number of neurons in the current layer gives us the number
of rows in the weight matrix.
Early Stopping
Detener el entrenamiento antes de que w tome valores demasiado grande. Como se aprecia en la
imagen, el error en el conjunto de entrenamiento y el costo deben disminuir con cada iteración.
Sin embargo, el error en el conjunto de desarrollo, suele aumentar en cierto punto, en el cuál
empieza a existir sobreajuste.
Principio de ortogonalización
El problema de trabajar con la detención temprana es que ya no se trabajan los dos problemas
anteriores de manera independiente. Ya que al detener las iteraciones se evita que se realice una
correcta optimización de la función de costo.
Por otro lado, la regularización L2 involucra una tarea mayor computacionalmente para hallar el
valor del parámetro lambda adecuado.
When normalizing, vector x should be divided by σ, So the formula for normalization is:
1. Substring mean u(mu): Este paso es para desplazar la media hacia el cero.
2. Nomalizing variance (sigma): La varianza de una de las propiedades puede ser mayor que
la otra característica, lo que se busca es obtener una varianza más uniforme.
En el cual es posible que la característica x1, tome valores entre 1 y 1000, mientras la característica
x2, toma valores entre 0 y 1, lo que conlleva a que los valores de w1 y w2 sean muy diferentes.
En cambio, normalizando es posible obtener una función de costo en promedio más simétrica.
Es decir, la normalización permite que la optimización del costo sea más sencilla pues el gradiente
puede encontrar de manera más simple el mejor camino hacia el mínimo.
Esto es importante en los casos en los que se tienen escalas muy diferentes entre los datos de
entrada, sin embargo, aplicarlo siempre no hace ningún daño. Por lo que es recomendable
hacerlo, aunque se tenga duda de que las escalas pueden ser similares.
Para simplificar la siguiente explicación, se asume que el parámetro b es cero y que la función de
activación es lineal, es decir, g(z)=z.
Suponiendo que se tiene una matriz como la mostrada, ese valor se mantiene en las diferentes
multiplicaciones matriciales de cada capa. Lo que significara un crecimiento exponencial del valor
de y_hat, en un factor de , para el ejemplo, 1,5^L.
Para redes muy profundas, esto representa un gran problema, que fue una limitante hasta hace
poco para la implementación de redes neuronales profundas. Actualmente, hay una solución
parcial que no lo resuelve completamente, pero ayuda mucho a la hora de inicializar los
parámetros.
Si se está usando una función de activación como ReLu en lugar de 1/n, la varianza será 2/n.
Esta modificación del valor de W establece que la varianza sea la deseada. Al multiplicar por la raíz
cuadrada la matriz inicializada de manera aleatoria.
Esto, establece que el valor de las matrices de peso W no sean mucho más grandes de 1 ni mucho
menor que 1.
El valor que multiplica la matriz de pesos, depende del tipo de función de activación que se este
usando, los cuales se hallan en la literatura.
El valor para ReLU más común es el mostrado antes, y para el tanh es el señalado en la imagen
anterior.
Este parámetro puede ser considerado entre los hiperparámetros, para ajustar, sin embargo, su
efecto es modesto, por lo que no es el primer parámetro que uno buscaría modificar. En algunas
ocasiones, puede solucionar razonablemente problemas.
And take dW1, dB1, dWL, dbL, and reshape into a big vector dtheta.
Al concatenar cada valor dentro de un gran vector y derivarlo, la pregunta es, ¿delta de theta es el
gradiente de la función J(theta)?
Por lo que tienen dos vectores, al derivada y el de aproximación, por lo que se debe realizar un
chequeo
Si el resultado del chek es cercano a 10^-7, está bien. Pero si es del orden de 10^-5, lo mejor es
revisar de cerca.
3. If your Neural Network model seems to have high bias, what of the following would be
promising things to try? (Check all that apply.)
4. You are working on an automated check-out kiosk for a supermarket, and are building a
classifier for apples, bananas, and oranges. Suppose your classifier obtains a training set
error of 0.5%, and a dev set error of 7%. Which of the following are promising things to try
to improve your classifier?
8. Increasing the parameter keep_prob from (say) 0.5 to 0.6 will likely cause the following:
9. Which of these techniques are useful for reducing variance (reducing overfitting)?
10. Why do we normalize the inputs xx?
5. NoteBook Initialization
By completing this assignment you will:
- Understand that different initialization methods and their impact on your model
performance
- Implement zero initialization and and see it fails to "break symmetry"
- Recognize that random initialization "breaks symmetry" and yields more efficient models,
- Understand that you could use both random initialization and scaling to get even better
training performance on your model
Initialization
Training your neural network requires specifying an initial value of the weights. A well chosen
initialization method will help learning. If you completed the previous course of this
specialization, you probably followed our instructions for weight initialization, and it has
worked out so far. But how do you choose the initialization for a new neural network? In this
notebook, you will see how different initializations lead to different results.
To get started, run the following cell to load the packages and the planar dataset you will try
to classify.
You would like a classifier to separate the blue dots from the red dots.
Model
Arguments:
Y -- true "label" vector (containing 0 for red dots; 1 for blue dots), of shape (1, number of
examples)
Returns:
Implement the following function to initialize all parameters to zeros. You'll see later that this does
not work well since it fails to "break symmetry", but lets try it anyway and see what happens.
El vector de dimensiones de capas, contiene el numero de nodos en cada capa, es decir cada n[l].
In general, initializing all the weights to zero results in the network failing to break symmetry. This
means that every neuron in each layer will learn the same thing, and you might as well be training
a neural network with n[l]=1 for every layer, and the network is no more powerful than a linear
classifier such as logistic regression.
Exercise:
Implement the following function to initialize your weights to large random values (scaled by *10)
and your biases to zeros. We are using a fixed np.random.seed(..) to make sure your "random"
weights match ours, so don't worry if running several times your code gives you always the same
initial values for the parameters.
El anterior código genera error, por los dobles paréntesis. El error dice ser del tipo: non-integer
arguments.
Run the following code to train your model on 15,000 iterations using random initialization.
If you see "inf" as the cost after the iteration 0, this is because of numerical roundoff; a more
numerically sophisticated implementation would fix this. But this isn't worth worrying about for
our purposes.
Anyway, it looks like you have broken symmetry, and this gives better results. than before. The
model is no longer outputting all 0s.
Observations
- The cost starts very high. This is because with large random-valued weights, the last
activation (sigmoid) outputs results that are very close to 0 or 1 for some examples, and
when it gets that example wrong it incurs a very high loss for that example. Indeed, when
log(a^[3]) = log(0), the loss goes to infinity.
- Poor initialization can lead to vanishing/exploding gradients, which also slows down the
optimization algorithm.
- If you train this network longer you will see better results, but initializing with overly large
random numbers slows down the optimization.
In summary
- Initializing weights to very large random values does not work well.
- Hopefully intializing with small random values does better. The important question is: how
small should be these random values be? Lets find out in the next part!
5.4 He initialization
Finally, try "He Initialization"; this is named for the first author of He et al., 2015. (If you have
heard of "Xavier initialization", this is similar except Xavier initialization uses a scaling factor for the
weights W[l] of sqrt(1./layers_dims[l-1]) where He initialization would use sqrt(2./layers_dims[l-
1]).)
Observations:
The model with He initialization separates the blue and the red dots very well in a small number of
iterations.
5.5 Conclusions
- You have seen three different types of initializations. For the same number of iterations
and same hyperparameters the comparison is:
6. Notebook Regularization
- Understand that different regularization methods that could help your model.
- Implement dropout and see it work on data.
- Recognize that a model without regularization gives you a better accuracy on the training
set but nor necessarily on the test set.
- Understand that you could use both dropout and regularization on your model
6.1 Regularization
Deep Learning models have so much flexibility and capacity that overfitting can be a serious
problem, if the training dataset is not big enough. Sure it does well on the training set, but the
learned network doesn't generalize to new examples that it has never seen!
Each dot corresponds to a position on the football field where a football player has hit the ball
with his/her head after the French goal keeper has shot the ball from the left side of the football
field.
- If the dot is blue, it means the French player managed to hit the ball with his/her head
- If the dot is red, it means the other team's player hit the ball with their head
Your goal: Use a deep learning model to find the positions on the field where the goalkeeper
should kick the ball.
6.3 Analysis of the dataset
This dataset is a little noisy, but it looks like a diagonal line separating the upper left half (blue)
from the lower right half (red) would work well.
You will first try a non-regularized model. Then you'll learn how to regularize it and decide which
model you will choose to solve the French Football Corporation's problem.
- in regularization mode -- by setting the lambd input to a non-zero value. We use "lambd"
instead of "lambda" because "lambda" is a reserved keyword in Python.
- in dropout mode -- by setting the keep_prob to a value less than one
You will first try the model without any regularization. Then, you will implement:
- L2 regularization
- Dropout
In each part, you will run this model with the correct inputs so that it calls the functions you've
implemented. Take a look at the code below to familiarize yourself with the model.
El código para la red neuronal se brinda previamente, incluyendo funciones únicas para casa
caso de regularización.
Arguments:
Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (output size, number of
examples)
Returns:
parameters -- parameters learned by the model. They can then be used to predict.
Let's train the model without any regularization, and observe the accuracy on the train/test sets.
The train accuracy is 94.8% while the test accuracy is 91.5%. This is the baseline model (you will
observe the impact of regularization on this model). Run the following code to plot the decision
boundary of your model.
The non-regularized model is obviously overfitting the training set. It is fitting the noisy points! Lets
now look at two techniques to reduce overfitting.
6.5 L2 Regularization
The standard way to avoid overfitting is called L2 regularization. It consists of appropriately
modifying your cost function, from:
To:
Note that you must do this for then sum the three terms and multiply
Es decir, el costo estará formado por la ya conocida formula del costo, cross_entropy_cost =
compute_cost(A3, Y), + la parte de la regularización, que se cumputa como:
L2_regularization_cost = (1/m)*(lambd/2)*(np.sum(np.square(W1))+np.sum(np.square(W2))
+np.sum(np.square(W3)))
En caso de tener un mayor numero de capas, puede que sea necesario usar un for para hallar la
suma de los valores de todas las capas. Aun no sé si se puede vectorizar.
Of course, because you changed the cost, you have to change backward propagation as well! All
the gradients have to be computed with respect to this new cost.
Implement the changes needed in backward propagation to take into account regularization. The
changes only concern dW1, dW2 and dW3. For each, you have to add the regularization term's
gradient
Para el siguiente paso es importante tener un cache con algunos valores del paso del forward
propagation, como se ha hecho hasta ahora.
Let's now run the model with L2 regularization (λ=0.7)(λ=0.7) .
Congrats, the test set accuracy increased to 93%. You have saved the French football team!
You are not overfitting the training data anymore. Let's plot the decision boundary.
Observations:
- The value of λ is a hyperparameter that you can tune using a dev set.
- L2 regularization makes your decision boundary smoother. If λ is too large, it is also
possible to "oversmooth", resulting in a model with high bias.
L2-regularization relies on the assumption that a model with small weights is simpler than a model
with large weights. Thus, by penalizing the square values of the weights in the cost function you
drive all the weights to smaller values. It becomes too costly for the cost to have large weights!
This leads to a smoother model in which the output changes more slowly as the input changes.
6.6 Dropout
Finally, dropout is a widely used regularization technique that is specific to deep learning. It
randomly shuts down some neurons in each iteration.
At each iteration, you shut down (= set to zero) each neuron of a layer with probability
1−keep_prob or keep it with probability keep_prob. The dropped neurons don't contribute to the
training in both the forward and backward propagations of the iteration.
When you shut some neurons down, you actually modify your model. The idea behind drop-out is
that at each iteration, you train a different model that uses only a subset of your neurons. With
dropout, your neurons thus become less sensitive to the activation of one other specific neuron,
because that other neuron might be shut down at any time.
1. In lecture, we discussed creating a variable d[1] with the same shape as a[1] using
np.random.rand() to randomly get numbers between 0 and 1. Here, you will use a
vectorized implementation, so create a random matrix
Hint: Let's say that keep_prob = 0.8, which means that we want to keep about 80% of the neurons
and drop out about 20% of them. We want to generate a vector that has 1's and 0's, where about
80% of them are 1 and about 20% are 0. This python statement:
X = (X < keep_prob).astype(int)
is conceptually the same as this if-else statement (for the simple case of a one-dimensional array) :
Note that the X = (X < keep_prob).astype(int) works with multi-dimensional arrays, and the
resulting output preserves the dimensions of the input array.
Also note that without using .astype(int), the result is an array of booleans True and False, which
Python automatically converts to 1 and 0 if we multiply it with numbers. (However, it's better
practice to convert data into the data type that we intend, so try using .astype(int).)
3. Set A[1] to A[1]∗D[1]. (You are shutting down some neurons). You can think of D[1] as a
mask, so that when it is multiplied with another matrix, it shuts down some of the values.
4. Divide A[1] by keep_prob. By doing this you are assuring that the result of the cost will still
have the same expected value as without drop-out. (This technique is also called inverted
dropout.)
Backpropagation with dropout is actually quite easy. You will have to carry out 2 Steps:
1. You had previously shut down some neurons during forward propagation, by applying a
mask D[1] to A1. In backpropagation, you will have to shut down the same neurons, by
reapplying the same mask D[1] to dA1
Importante, debe eliminarse las mismas neuronas en el paso hacia atrás para ser
consistentes con el modelo.
2. During forward propagation, you had divided A1 by keep_prob. In backpropagation, you'll
therefore have to divide dA1 by keep_prob again (the calculus interpretation is that if A[1]
is scaled by keep_prob, then its derivative dA[1] is also scaled by the same keep_prob).
Nota personal: El profesor usa siempre 1./m. No se el por qué, debe ser para que el valor sea
tomado como float, quizá.
Ese error mostrado, no lo explican en el notebook. Sin embargo, al parecer se genera una división
entre cero en la evaluación del modelo.
Note:
- A common mistake when using dropout is to use it both in training and testing. You should
use dropout (randomly eliminate nodes) only in training.
- Deep learning frameworks like tensorflow, PaddlePaddle, keras or caffe come with a
dropout layer implementation.
6.7 Conclusions
Here are the results of our three models:
Note that regularization hurts training set performance! This is because it limits the ability of the
network to overfit to the training set. But since it ultimately gives better test accuracy, it is helping
your system.
7. Notebook gradient Checking
By completing this assignment, you will:
Goal
You are part of a team working to make mobile payments available globally, and are asked to build
a deep learning model to detect fraud--whenever someone makes a payment, you want to see if
the payment might be fraudulent, such as if the user's account has been taken over by a hacker.
But backpropagation is quite challenging to implement, and sometimes has bugs. Because this is a
mission-critical application, your company's CEO wants to be really certain that your
implementation of backpropagation is correct. Your CEO says, "Give me a proof that your
backpropagation is actually working!" To give this reassurance, you are going to use "gradient
checking".
Backpropagation computes the gradients where theta denotes the parameters of the model. J
is computed using forward propagation and your loos function.
Because forward propagation is relatively easy to implement, you're confident you got that right,
and so you're almost 100% sure that you're computing the cost J correctly. Thus, you can use your
- dJ/dtheta is what you want to make sure you are computing correctly
- You van compute (in the case that theta is a real number), since
you are confident your implementation for J is correct.
You will implement code to compute J(.)and its derivative. You will then use gradient checking to
make sure your derivative computation for JJ is correct.
The diagram above shows the key computation steps: First start with x, then evaluate the function
J(x) ("forward propagation"). Then compute the derivative ∂J/∂θ ("backward propagation").
Now, implement the backward propagation step (derivative computation) of Figure 1. That is,
compute the derivative of J(θ)=θx with respect to θθ . To save you from doing the calculus, you
should get
Instructions:
1. First compute "gradapprox" using the formula above (1) and a small value of εε . Here are
the Steps to follow
2. Then compute the gradient using backward propagation, and store the result in a variable
"grad"
3. Finally, compute the relative difference between "gradapprox" and the "grad" using the
following formula
You will need 3 Steps to compute this formula:
If this difference is small (say less than 10^-7), you can be quite confident that you have computed
your gradient correctly. Otherwise, there may be a mistake in the gradient computation.
Now, in the more general case, your cost function J has more than a single 1D input. When you are
training a neural network, θ actually consists of multiple matrices W[l] and biases b[l! It is
important to know how to do a gradient check with higher-dimensional inputs. Let's do it!
The inverse function is "vector_to_dictionary" which outputs back the "parameters" dictionary.
We have also converted the "gradients" dictionary into a vector "grad" using
gradients_to_vector(). You don't need to worry about that.
Instructions: Here is pseudo-code that will help you implement the gradient check.
Nota: Cuando haces x + y, numpy usa un tipo de datos de "mínimo común denominador" para el
resultado. Dado que x es int y y es flotante, esto significa que devuelve una matriz flotante.
Pero cuando haces x += y, lo estás forzando a conformarse al tipo de datos de x, que es int.
Está parte del código me dio varios problemas, y fue difícil hallar la respuesta la cual tiene que ver
con lo mostrado en la nota anterior.
Can you get gradient check to declare your derivative computation correct? Even though this part
of the assignment isn't graded, we strongly urge you to try to find the bug and re-run gradient
check until you're convinced backprop is now correctly implemented.
Note:
8. Optimization algorithms
8.1 Mini batch gradient descent
Batch vs. mini-batch gradient descent
So, we already see that vectorization allows you to efficiently compute on m examples without a
specific for loop.
What if m=5.000.000 or bigger?
Entonces, lo que debes hacer es recorrer todos los datos de entrenamiento antes de dar un
pequeño paso de descenso por gradiente y recorrer de nuevo los datos, para dar un segundo paso
y así sucesivamente.
Resulta que puedes tener un algoritmo más rápido si permites que el descenso por gradiente
realice progresos incluso antes de que termine de procesar completamente el conjunto de
entrenamiento.
Imagina que divides el conjuntos de entrenamiento, en unos más pequeños, llamados mini-
batches.
Se agrega una nueva notación {t}, para designar el número de mini-batch t correspondiente.
Cuando se tiene un conjunto de datos muy grande, el mini-batch, trabaja mucho más rápido que el
batch. Por lo que, todo el mundo en el Deep learning suelen usar mini-batch.
The other extreme would be if your mini-batch size, Were = 1. This gives you an algorithm called
stochastic gradient descent.
And here every example is its own mini-batch.
El stochastic gradient descent, no siempre conduce al mínimo y siempre lo hace con mucho ruido.
Y puede oscilar alrededor del mínimo, o del valor final sin llegar a un valor fijo.
Cuando entrenas con un solo ejemplo a la vez, stochastic gradient descent, puedes reducir el ruido
usando una tasa de aprendizaje menor. Pero la gran desventaja, es que pierde la velocidad ganada
en la vectorización. Lo que lo hace muy ineficiente.
El ideal es tomar un valor entre los dos extremos, que permita optimizar el modelo. Puede que al
tomar los mini-batches el modelo no se dirija al mínimo o se genere una variación cerca de un
valor final, lo que puede solucionarse disminuyendo la rata de aprendizaje.
Cuando el conjunto de entrenamiento es menor que 2000, se puede usar el batch gradient
descent, sin problemas.
1. Los valores típicos para los tamaños de los mini-batches son números que son potencias
del 2, desde 64, hasta 512. (64, 128, 256, 512.)
2. Garantizar que el tamaño de los mini-batches encajen la CPU/GPU memory.
Este parámetros se convierte en un hiperparámetros que debe hallarse por iteración, es decir
probar diferentes potencias de dos y decidir una que permita una optimización del gradient
descent.
Se obtiene la línea roja para el ejemplo anterior. Se puede pensar en vt como el promedio de 1/(1-
beta) días temperatura.
Lo que significa el beta=0,98, es que le estas dando un peso elevado a lo que sucedió previamente
mientras, le das un valor de solo 0,02 a lo que estas viendo actualmente (ver la formula)
Si nos dirigimos al otro extremo, con beta =0,5, se obtiene un promedio de 2 días y la curva
amarilla
Lo que se hace más susceptible a los valores atípicos. Pero, se adapta más rápidamente a los
cambios de temperatura.
Suponemos que vamos a hallar el mínimo de una función de costo que tiene la siguiente forma
Esto puede llevarse en muchos pasos que oscilan hacia el mínimo. Podría aumentar la rata de
aprendizaje, sin embargo, eso podría disparar hacia los lados cada paso, como se muestra en la
grafica morada.
Para solucionar esto se usa el descenso con momentum, el cual funciona en batch y mini-batch.
Aplicando este gradiente, se obtiene en cada paso una menor variación en la vertical, y se
aproxima de manera más eficiente hacia el mínimo.
Ahora, una pequeña bola puede rodar hacia abajo ganando momento gracias a la aceleración.
Pero, además cuanta con el termino beta que representa la fricción que impide que ruede sin
control.
Vdw y vdb, se inicializa con un vector de ceros, con las dimensiones de dW y db, respectivamente.
8.7 RMSprop
Es otro algoritmo: Raíz cuadrada media, que también puede acelerar el gradiente de descenso.
Con el descenso de gradiente estándar se oscila en la dirección vertical, aun cuando se requiere es
avanzar horizontalmente. Al igual que el n el anterior, este método permite avanzar más rápido en
una dirección especifica optima.
Esto permite usar una rata mayor de aprendizaje, sin divergir en la dirección vertical (en este caso)
Es decir, en la practica lo que tienes es un algoritmo que amortigua las variaciones en la dirección
en la que haya mayor oscilación.
Ya que se van a unir los dos criterios de optimización, RMSprop y momentum, se usa en el caso de
RMS un beta2 y además por razones numéricas para evitar dividir entre cero cuando el sdb o sdw
son muy pequeños se suma un épsilon del orden de 10^-8, para evitar que explote* el valor
calculado.
Si alfa (0)=0,2
8.9.1 Other learning rate decay methods
1. Decaimiento exponencial:
2.
3. Decrease staircase:
Este es un parámetro que ayuda, sin embargo, no es el m[as importante para establecer mediante
un proceso iterativo.
Until now, you've always used Gradient Descent to update the parameters and minimize the cost.
In this notebook, you will learn more advanced optimization methods that can speed up learning
and perhaps even get you to a better final value for the cost function. Having a good optimization
algorithm can be the difference between waiting days vs. just a few hours to get a good result.
At each step of the training, you update your parameters following a certain direction to try to get
to the lowest possible point.
A variant of this is Stochastic Gradient Descent (SGD), which is equivalent to mini-batch gradient
descent where each mini-batch has just 1 example. The update rule that you have just
implemented does not change. What changes is that you would be computing gradients on just
one training example at a time, rather than on the whole training set. The code examples below
illustrate the difference between stochastic gradient descent and (batch) gradient descent.
In practice, you'll often get faster results if you do not use neither the whole training set, nor only
one training example, to perform each update. Mini-batch gradient descent uses an intermediate
number of examples for each step. With mini-batch gradient descent, you loop over the mini-
batches instead of looping over individual training examples.
1. The difference between gradient descent, mini-batch gradient descent and stochastic
gradient descent is the number of examples you use to perform one update step.
2. You have to tune a learning rate hyperparameter α
3. With a well-turned mini-batch size, usually it outperforms either gradient descent or
stochastic gradient descent (particularly when the training set is large).
1. Shuffle (Barajar): Crear una versión barajada del conjunto de entrenamiento (X, Y) como
se muestra a continuación. Cada columna de X e Y representa un ejemplo de
entrenamiento. Tenga en cuenta que la barajada aleatoria se realiza de forma sincronizada
entre X e Y. De tal forma que después de la barajada la columna i-ésima de X es el ejemplo
correspondiente a la etiqueta i-ésima de Y. El paso de la barajada garantiza que los
ejemplos se dividirán aleatoriamente en diferentes mini lotes.
2. Partition: Partition the shuffled (X, Y) into mini-batches of size mini_batch_size (here 64).
Note that the number of training examples is not always divisible by mini_batch_size. The
last mini batch might be smaller, but you don't need to worry about this. When the final
mini-batch is smaller than the full mini_batch_size, it will look like this
Exercise: Implement random_mini_batches. We coded the shuffling part for you. To help you
with the partitioning step, we give you the following code that selects the indexes for the 1st
and 2nd mini-batches:
Note that the last mini-batch might end up smaller than mini_batch_size=64. Let⌊s⌋ represents
s rounded down to the nearest integer (this is math.floor(s) in Python). If the total number of
1. Shuffling and Partitioning are the two steps required to build mini-batches
2. Powers of two are often chosen to be the mini-batch size, e.g., 16, 32, 64, 128.
10.3 Momentum
update after seeing just a subset of examples, the direction of the update has some variance, and
so the path taken by mini-batch gradient descent will "oscillate" toward convergence. Using
momentum can reduce these oscillations.
Momentum takes into account the past gradients to smooth out the update. We will store the
'direction' of the previous gradients in the variable v. Formally, this will be the exponentially
weighted average of the gradient on previous steps. You can also think of v as the "velocity" of a
ball rolling downhill, building up speed (and momentum) according to the direction of the
gradient/slope of the hill.
10.3.1 Inicializar con ceros la velocidad (v)
Exercise: Initialize the velocity. The velocity, v, is a python dictionary that needs to be initialized
with arrays of zeros. Its keys are the same as those in the grads dictionary, that is: for l=1,…,L:
Note that the iterator l starts at 0 in the for loop while the first parameters are v["dW1"] and
v["db1"] (that's a "one" on the superscript). This is why we are shifting l to l+1 in the for loop.
Note that:
1. The velocity is initialized with zeros. So the algorithm will take a few iterations to "build
up" velocity and start to take bigger steps.
2. If β=0, then this just becomes standard gradient descent without momentum.
10.4 Adam
Adam is one of the most effective optimization algorithms for training neural networks. It
combines ideas from RMSProp (described in lecture) and Momentum.
Where
10.4.2 Initialize the Adam variables v, s which keep track of the past information
We have already implemented a 3-layer neural network. You will train it with:
10.5.4 Summary
1. Momentum usually helps, but given the small learning rate and the simplistic dataset, its
impact is almost negligeable. Also, the huge oscillations you see in the cost come from the
fact that some minibatches are more difficult thans others for the optimization algorithm.
2. Adam on the other hand, clearly outperforms mini-batch gradient descent and
Momentum. If you run the model for more epochs on this simple dataset, all three
methods will lead to very good results. However, you've seen that Adam converges a lot
faster.
1. Relatively low memory requirements (though higher than gradient descent and gradient
descent with momentum)
2. Usually works well even with little tuning of hyperparameters (except α)
Esta practica de ir de lo grueso* a lo fino, permite enfocarse en zonas en las que se obtiene
mejores resultados y tomar un valor que optimice el modelo entre los hiperparámetros que se
están analizando.
Si tenemos que el valor de alfa puede variar entre 0,0001 y 1, entonces elegir valores al azar de
manera “directa”, puede hacer que la muestra sea tomada desde un mismo sitio. En lugar de eso,
se toma un valor r, que varia entre -4 y cero, en Python es:
r=-4*np.random.randn()
learning_rate=10^r
Otro parámetro difícil de definir es el parámetro beta, usado para pesos promedios.
Hacer un muestreo lineal entre los posibles valores de beta no tiene sentido, por lo que se hará de
manera logarítmica entre los valores de 1-beta,
Los resultados son muy sensibles a los cambios de beta, cuando el valor de beta está muy cercano
a 1.
Es decir, cuando se cambia de un beta de 0,9 a uno de 0,9005, el cambio es despreciable. Sin
embargo, cuando el cambio es entre beta=0,999 a 0,9995, el cambio es apreciable. Ya que
tenemos la formula como 1/1-beta
Ahora, la pregunta es: ¿Es posible normalizar los valores de las activaciones de modo que los
valores de la capa siguiente W, b se entrenen más rápido?
Pues, esto es lo que hace la normalización del lote (Batch Normalizing)
Técnicamente, no se va a normalizar los valore de a, sino de z. Hay discusión entre cuál de los dos
valores debería ser normalizado, sin embargo, en la práctica es más común hacerlo para Z.
Luego la varianza,
Con esto lo que se hace s obtener una media de cero, y una varianza estándar de uno. Sin
embargo, no se quiere que las unidades ocultas tengan siempre media de cero y varianza de 1,
tiene sentido esperar que las unidades ocultas tengan una distribución diferente.
En el que gamma y beta, son parámetros aprendibles del modelo. Los cuales se deben actualizar.
Lo que permite es controlar la varianza y la media, mediante los parámetros beta y gamma.
Esto también puede ser usado para otros algoritmos, como el de weighted, momentum, RMSprop,
Adam.
Cuando se entrena el modelo solo con los datos de la izquierda (únicamente gatos negros), no
obtendrá buenos resultados al probarlo con los datos de la derecha.
Garantiza, que, aunque los valores de z varíen la media y la desviación se mantiene constante
(determinado por beta y gamma). Es decir, hace que los valores de entrada a cada capa sean más
estables.
- Each mini batch is scaled by the mean/variance computed on just that mini batch
- This adds some noise to the values z[l] within that mini batch. So like dropout, it adds
some noise to each hidden layer’s activations.
- This has a slight regularization effect.
Durante el entrenamiento, se suelen computar varios ejemplos a la vez. Sin embargo, en la prueba
puede que se realicé uno por uno. Entonces es necesario modificar las ecuaciones para que tenga
sentido. Las siguientes son las ecuaciones usadas durante en entrenamiento:
En donde, m es el número de ejemplos para el minibach.
Imagina, que se desea reconocer gatos, perros y pollos bebe. Es decir, tenemos varias entradas y
puede ser una entre varias clases de salida.
El numero de nodos a la salida, será el numero de clases.
Se calcula la probabilidad de que la salida sea una de las posibles clases, la suma de las
probabilidades son el 100%. Se observa que los limites de decisión se hacen más lineales. Sin
embargo, es posible que se aprendan limites más complejos.
En contraste, con el had max, en el que la salida mayor será uno y los demás serán cero.
Softmax regression generalizes logistic regression to C classes. (mas de dos clases)
Criterios para elegir los paquetes, (ya que la mayoría de ellos mejoran continuamente y no todos
se ajustan a los requisitos)
14.2 TensorFlow
TensorFlow tiene implementadas las funciones necsarias para realizar el backprop
15. Practice questions: Hyperparameters tuning, Batch Normalization,
programming framework
1. If searching among a large number of hyperparameters, you should try values in a grid
rather than random values, so that you can carry out the search more systematically
and not rely on chance. False
2. Every hyperparameter, if set poorly, can have a huge negative impact on training, and
so all hyperparameters are about equally important to tune well. False
3. During hyperparameter search, whether you try to babysit one model (“Panda”
strategy) or train a lot of models in parallel (“Caviar”) is largely determined by
4. If you think \betaβ (hyperparameter for momentum) is between on 0.9 and 0.99, which
of the following is the recommended way to sample a value for beta?
5. Finding good hyperparameter values is very time-consuming. So typically you should
do it once at the start of the project, and try to find very good hyperparameters so that
you don’t ever have to revisit tuning them again. True or false?
6. In batch normalization as presented in the videos, if you apply it on the llth layer of your
neural network, what are you normalizing?
9. After training a neural network with Batch Norm, at test time, to evaluate the neural
network on a new example you should:
10. Which of these statements about deep learning programming frameworks are true?
16. Programming Assignment
16.1 TensorFlow
In this notebook you will learn all the basics of Tensorflow. You will implement useful functions
and draw the parallel with what you did using Numpy. You will understand what Tensors and
operations are, as well as how to execute them in a computation graph.
After completing this assignment you will also be able to implement your own deep learning
models using Tensorflow. In fact, using our brand new SIGNS dataset, you will build a deep neural
network model to recognize numbers from 0 to 5 in sign language with a pretty impressive
accuracy.
TensorFlow Tutorial
Until now, you've always used numpy to build neural networks. Now we will step you through a
deep learning framework that will allow you to build neural networks more easily. Machine
learning frameworks like TensorFlow, PaddlePaddle, Torch, Caffe, Keras, and many others can
speed up your machine learning development significantly. All these frameworks also have a lot of
documentation, which you should feel free to read. In this assignment, you will learn to do the
following in TensorFlow:
- Initialize variables
- Start your own session
- Train algorithms
- Implement a Neural Network
Programing frameworks can not only shorten your coding time, but sometimes also perform
optimizations that speed up your code.
Now that you have imported the library, we will walk you through its different applications. You
will start with an example, where we compute for you the loss of one training example.
Therefore, when we created a variable for the loss, we simply defined the loss as a function of
other quantities but did not evaluate its value. To evaluate it, we had to run
init=tf.global_variables_initializer(). That initialized the loss variable, and in the last line we were
finally able to evaluate the value of loss and print its value.
Great! To summarize, remember to initialize your variables, create a session and run the
operations inside the session.
Next, you'll also have to know about placeholders. A placeholder is an object whose value you can
specify only later. To specify values for a placeholder, you can pass in values by using a "feed
dictionary" (feed_dict variable). Below, we created a placeholder for x. This allows us to pass in a
number later when we run the session.
When you first defined x you did not have to specify a value for it. A placeholder is simply a
variable that you will assign data to only later, when running the session. We say that you feed
data to these placeholders when running the session.
Here's what's happening: When you specify the operations needed for a computation, you are
telling TensorFlow how to construct a computation graph. The computation graph can have some
placeholders whose values you will specify only later. Finally, when you run the session, you are
telling TensorFlow to execute the computation graph.
Compute WX+b where W,X , and bb are drawn from a random normal distribution. W is of shape
(4, 3), X is (3,1) and b is (4,1). As an example, here is how you would define a constant X that has
shape (3,1):
You will do this exercise using a placeholder variable x. When running the session, you should use
the feed dictionary to pass in the input z. In this exercise, you will have to:
- create a placeholder x
- define the operations needed to compute the sigmoid using tf.sigmoid, and then
- run the session
Implement the sigmoid function below. You should use the following:
Note that there are two typical ways to create and use sessions in tensorflow:
To summarize, you how know how to:
1. Create placeholders
2. Specify the computation graph corresponding to operations you want to compute
3. Create the session
4. Run the session, using a feed dictionary if necessary to specify placeholder variables'
values
Implement the cross-entropy loss. The function you will use is:
Your code should input z, compute the sigmoid (to get a) and then compute the cross-entropy cost
J. All this can be done using one call to tf.nn.sigmoid_cross_entropy_with_logits, which computes:
Importante, ver cómo funciona el feed_dict
This is called a "one hot" encoding, because in the converted representation exactly one element
of each column is "hot" (meaning set to 1). To do this conversion in numpy, you might have to
write a few lines of code. In tensorflow, you can use one line of code:
Implement the function below to take one vector of labels and the total number of classes CC ,
and return the one hot encoding. Use tf.one_hot() to do this.
16.2.5 Initialize with zeros and ones
Now you will learn how to initialize a vector of zeros and ones. The function you will be calling is
tf.ones(). To initialize with zeros you could use tf.zeros() instead. These functions take in a shape
and return an array of dimension shape full of zeros and ones respectively
Implement the function below to take in a shape and to return an array (of the shape's dimension
of ones).
One afternoon, with some friends we decided to teach our computers to decipher sign language.
We spent a few hours taking pictures in front of a white wall and came up with the following
dataset. It's now your job to build an algorithm that would facilitate communications from a
speech-impaired person to someone who doesn't understand sign language.
- Training set: 1080 pictures (64 by 64 pixels) of signs representing numbers from 0 to 5
(180 pictures per number).
- Test set: 120 pictures (64 by 64 pixels) of signs representing numbers from 0 to 5 (20
pictures per number).
Note that this is a subset of the SIGNS dataset. The complete dataset contains many more signs.
These are the original pictures, before we lowered the image resolutoion to 64 by 64 pixels.
Your goal is to build an algorithm capable of recognizing a sign with high accuracy. To do so, you
are going to build a tensorflow model that is almost the same as one you have previously built in
numpy for cat recognition (but now using a softmax output). It is a great occasion to compare your
numpy implementation to the tensorflow one.
The model is LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SOFTMAX. The SIGMOID output layer
has been converted to a SOFTMAX. A SOFTMAX layer generalizes SIGMOID to when there are
more than two classes.
Implement the function below to initialize the parameters in tensorflow. You are going use Xavier
Initialization for weights and Zero Initialization for biases. The shapes are given below. As an
example, to help you, for W1 and b1 you could use:
As expected, the parameters haven't been evaluated yet.
Question: Implement the forward pass of the neural network. We commented for you the numpy
equivalents so that you can compare the tensorflow implementation to numpy. It is important to
note that the forward propagation stops at z3. The reason is that in tensorflow the last linear layer
output is given as input to the function computing the loss. Therefore, you don't need a3!
You may have noticed that the forward propagation doesn't output any cache. You will understand
why below when we get to backpropagation.
For instance (por ejemplo), for gradient descent the optimizer would be:
This computes the backpropagation by passing through the tensorflow graph in the reverse order.
From cost to inputs.
When coding, we often use _ as a "throwaway" variable to store values that we won't need to use
later. Here, _ takes on the evaluated value of optimizer, which we don't need (and c takes the
value of the cost variable).
Insights:
1. Your model seems big enough to fit the training set well. However, given the difference
between train and test accuracy, you could try to add L2 or dropout regularization to
reduce overfitting.
2. Think about the session as a block of code to train the model. Each time you run the
session on a minibatch, it trains the parameters. In total you have run the session many
times (1500 epochs) until you obtained well trained parameters.