Happymonk Test Paper For Data Scientist Intern
Happymonk Test Paper For Data Scientist Intern
Happymonk Test Paper For Data Scientist Intern
Networks
Monk AI
Abstract—The choice of Activation Functions (AF) has proven training examples being taken in one batch. Then the forward-
to be an important factor that affects the performance of an propagation equations will be:
Artificial Neural Network (ANN). Use a 1-hidden layer neural
network model that adapts to the most suitable activation z1 = a0 × w1 + b1
function according to the data-set. The ANN model can learn for
itself the best AF to use by exploiting a flexible functional form, a1 = g(z1 )
k0 + k1 ∗ x with parameters k0 , k1 being learned from multiple
runs. You can use this code-base for implementation guidelines z2 = a1 × w2 + b2
and help. https://github.com/sahamath/MultiLayerPerceptron
a2 = g(z2 )
I. BACKGROUND z3 = a2 × w3 + b3
Selection of the best performing AF for classification task a3 = Sof tmax(z3 )
is essentially a naive (or brute-force) procedure wherein, a
popularly used AF is picked and used in the network for where × denotes the matrix multiplication operation and
approximating the optimal function. If this function fails, the Sof tmax() denotes the Softmax activation function.
process is repeated with a different AF, till the network learns For back-propagation, let the loss function used in this
to approximate the ideal function. It is interesting to inquire model be the Categorical Cross-Entropy Loss, and let dfi
and inspect whether there exists a possibility of building a denote the gradient matrix of the loss with respect to the matrix
framework which uses the inherent clues and insights from fi , where f can be substituted with z, a, b, or w. and let there
data and bring about the most suitable AF. The possibilities be matrices dK2 and dK1 of dimension 3 × 1. Then the back-
of such an approach could not only save significant time and propagation equations will be:
effort for tuning the model, but will also open up new ways dz3 = a3 − y
for discovering essential features of not-so-popular AFs.
1
dw3 = aT2 × dz3
II. M ATHEMATICAL F RAMEWORK t
A. Compact Representation db3 = avg col (dz3 )
Let the proposed Ada-Act activation function be mathemat- da2 = dz3 × w3T
ically defined as: dz2 = g 0 (z2 ) ∗ da2
g(x) = k0 + k1 x (1)
1 T
dw2 = a × dz2
where the coefficients k0 , k1 and k2 are learned during training t 1
via back-propagation of error gradients. db2 = avg col (dz2 )
For the purpose of demonstration, consider a feed-forward
neural network consisting of an input layer L0 consisting
avge (da2 )
of m nodes for m features, two hidden layers L1 and L2
dK2 = avge (da2 ∗ z2 ) (2)
consisting of n and p nodes respectively, and an output
layer L3 consisting of k nodes for k classes. Let zi and avge (da2 ∗ z22 )
ai denote the inputs to and the activations of the nodes in
da1 = dz2 × w2T
layer Li respectively. Let wi and bi denote the weights and
the biases applied to the nodes of layer Li−1 , and let the dz1 = g 0 (z1 ) ∗ da1
activations of layer L0 be the input features of the training 1 T
examples. Finally, let K denote
the column matrix containing dw1 = a × dz1
k0 t 0
the equation coefficients: k1 and let t denote the number of db1 = avg col (dz1 )
k2
avge (da1 )
dK1 = avge (da1 ∗ z1 ) (3)