Calc
Calc
∗
Caroline Sun
December 2020
1 Introduction
Calculus is going to be an integral part of our next few lectures regarding neural
networks. As a result, this is going to be a crash course into derivatives and
partials—if you’d like to get into more depth, check out the resources at the
end. While it’s going to be difficult to work with this material if you haven’t
been in a calculus class before, try to stick with it: it’ll be worth it.
2 Derivatives
Simply put, a derivative is just the rate of change of a function at a given
point. Even without calculus, we’ve dealt with rate of change and average rate
of change before. Let’s take a look at a basic example: lines.
∆y f (x1 ) − f (x0 )
slope = =
∆x x1 − x0
In the example y = 3x + 8, let’s find the rate of change given our equation
above.
∆y f (1) − f (0) 11 − 8
slope = = = =3
∆x 1−0 1−0
In a line, the rate of change at any point is constant, it’s already defined by
the slope! Let’s take a look at more complex curves.
1
Figure 1: y = x2 with secant y = 2x
f (b) − f (a)
A= (1)
b−a
Let’s start by finding the average rate of change between x = 0 and x = 2,
then. We would get f (2)−f
2−1
(1)
= 4−1
2−1 = 3. Now, let’s take a closer look, and
zoom into the average rate of change from x = 1 to x = 1.25. f (1.25)−f1.25−1
(1)
=
1.5625−1
1.25−1 = 2.25. Even closer, let’s try x = 1 to x = 1.001, which would give us
a slope of 2.001. Does it look like it’s approaching a number?
Essentially, we are trying to use our equation for the average rate of change,
and find a b-value as close as we possibly can to our a value. Mathematically,
that’s a limit that we can express by the following expression.
2
2.3 Run-through of Taking the Derivative of a Function
Fair warning, this section is going to be the most dense of this lecture. Granted,
it’s not a fully replacement of the first month of calculus, but we’re going to
try to distill essential information to take the derivative of the sigmoid function
later.
Luckily for us, we don’t have to use the limit definition of a derivative to
find the derivative of a function. We have several essential derivative rules that,
once established, we can use to find the derivative of complex functions.
First, let’s just remember that for any equation f(x) = c, f’(x) = 0. Why?
There is no change. We can circle back to the our explanation of a derivative
of a line. In this function, the slope is also 0.
The first derivative rule is known as the Power Rule. This is because it
applies for any function of the form f(x) = xn . For any function of that form,
the derivative will be f ’(x) = nxn-1 . We can use the power rule on our parabola
y = x2 example in 2.2, and confirm that the derivative at x = 1 is 2.
A special function is f(x) = ex , where the derivative will also equal f’(x) =
e . The derivative of its inverse, ln(x), is x1 .
x
3
f 0 (x) = h(x)g 0 (x) + h0 (x)g(x) (5)
An example is f(x) = 3x2 ex , where f’(x) = 6x·ex + 3x2 ex = 3xex (x + 2).
The last relevant rule is the derivative rule is the quotient rule. For a function
g(x)
f(x) = h(x) , its derivative is
3 Sigmoid Derivative
The sigmoid function is
1
S(x) =
1 + e-x
4
S 0 (x) = S(x)(1 − S(x)) (7)
This equation fully gives us S’(x) in terms of S(x), allowing us to only use
the S(x) function to find S’(x).
4 More Dimensions
Another use of derivatives in machine learning is for training neural networks,
known as gradient descent. At its simplest, a neural network is a chain of
weighted sums, and we use the error of our network, or the cost, to change
our weighted sums to maximize accuracy. Every layer of sums affects the next,
thus many variables end up changing the cost. We can’t just use a derivative
now, as our graph will be n-dimensional, and derivatives apply for the 2-D.
Instead, we use a gradient, ∇f . This gradient is analogous to a derivative in
a multi-variable context. While we won’t have to manually find gradients in
creating neural networks, we’ll try to grasp what a gradient means within a
simple example.
In the graph, z = x2 + y2 , it’s a parabaloid.
Figure 3: Paraboloid z = x2 + y2
Imagine laying a small ball onto the paraboloid: it’ll roll downwards, stop-
ping at (0, 0, 0). The negative of the gradient gives us the specific direction it
will fall down to.
Finding the gradient is using the same derivatives we found earlier, except it’s
a vector, with each component being the derivative of the function with respect
5
to one variable (and treating the other variables, temporarily, as constants).
These component derivatives are called partials. The partials of z with respect
to x and y, are the following:
∂z ∂z
,
∂x ∂y
and our gradient is just the those put into vector form, so
∂z ∂z
∇z = h , i
∂x ∂y
In our paraboloid example, the gradient would be h2x, 2yi
If we plug in a value into our gradient, we’ll get the direction of the steepest
incline. As a result, if we took the negation of that vector, it’d point to the
direction of steepest descent. In this example, the direction of steepest descent
would be at h−6, −8i at (3, 5, 34). This can be generalized to other functions
and points.
5 Closing
Hopefully, conceptually, derivatives and gradients make sense. However, further
resources for both will be in the next section.
6 Resources
• https://www.khanacademy.org/math/differential-calculus/dc-diff-intro
• https://www.mathsisfun.com/calculus/derivatives-rules.html
• https://www.youtube.com/watch?v=WUvTyaaNkzMlist=PL0-GT3co4r2wlh6UHTUeQsrf3mlS2lk6x
• https://www.khanacademy.org/math/multivariable-calculus/multivariable-
derivatives/partial-derivative-and-gradient-articles/a/the-gradient
• https://betterexplained.com/articles/vector-calculus-understanding-the-gradient/
7 References
Desmos and Geogebra both used to generate graphs.