https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Presentation in Vietnam Japan AI Community in 2019-05-26.
The presentation summarizes what I've learned about Regularization in Deep Learning.
Disclaimer: The presentation is given in a community event, so it wasn't thoroughly reviewed or revised.
basics of GAN neural network
GAN is a advanced tech in area of neural networks which will help to generate new data . This new data will be developed based over the past experiences and raw data.
This document discusses gradient descent algorithms, feedforward neural networks, and backpropagation. It defines machine learning, artificial intelligence, and deep learning. It then explains gradient descent as an optimization technique used to minimize cost functions in deep learning models. It describes feedforward neural networks as having connections that move in one direction from input to output nodes. Backpropagation is mentioned as an algorithm for training neural networks.
Overview on Optimization algorithms in Deep LearningKhang Pham
Overview on function optimization in general and in deep learning. The slides cover from basic algorithms like batch gradient descent, stochastic gradient descent to the state of art algorithm like Momentum, Adagrad, RMSprop, Adam.
The document discusses hyperparameters and hyperparameter tuning in deep learning models. It defines hyperparameters as parameters that govern how the model parameters (weights and biases) are determined during training, in contrast to model parameters which are learned from the training data. Important hyperparameters include the learning rate, number of layers and units, and activation functions. The goal of training is for the model to perform optimally on unseen test data. Model selection, such as through cross-validation, is used to select the optimal hyperparameters. Training, validation, and test sets are also discussed, with the validation set used for model selection and the test set providing an unbiased evaluation of the fully trained model.
An overview of gradient descent optimization algorithms Hakky St
This document provides an overview of various gradient descent optimization algorithms that are commonly used for training deep learning models. It begins with an introduction to gradient descent and its variants, including batch gradient descent, stochastic gradient descent (SGD), and mini-batch gradient descent. It then discusses challenges with these algorithms, such as choosing the learning rate. The document proceeds to explain popular optimization algorithms used to address these challenges, including momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSprop, and Adam. It provides visualizations and intuitive explanations of how these algorithms work. Finally, it discusses strategies for parallelizing and optimizing SGD and concludes with a comparison of optimization algorithms.
This document summarizes various optimization techniques for deep learning models, including gradient descent, stochastic gradient descent, and variants like momentum, Nesterov's accelerated gradient, AdaGrad, RMSProp, and Adam. It provides an overview of how each technique works and comparisons of their performance on image classification tasks using MNIST and CIFAR-10 datasets. The document concludes by encouraging attendees to try out the different optimization methods in Keras and provides resources for further deep learning topics.
I mede this slide for the beginners of object detection.
Anchor box was really hard to understand for me, so I wrote about it as easy to understand as I can.
Let's overwhelmingly prosper!!
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
ResNet (short for Residual Network) is a deep neural network architecture that has achieved significant advancements in image recognition tasks. It was introduced by Kaiming He et al. in 2015.
The key innovation of ResNet is the use of residual connections, or skip connections, that enable the network to learn residual mappings instead of directly learning the desired underlying mappings. This addresses the problem of vanishing gradients that commonly occurs in very deep neural networks.
In a ResNet, the input data flows through a series of residual blocks. Each residual block consists of several convolutional layers followed by batch normalization and rectified linear unit (ReLU) activations. The original input to a residual block is passed through the block and added to the output of the block, creating a shortcut connection. This addition operation allows the network to learn residual mappings by computing the difference between the input and the output.
By using residual connections, the gradients can propagate more effectively through the network, enabling the training of deeper models. This enables the construction of extremely deep ResNet architectures with hundreds of layers, such as ResNet-101 or ResNet-152, while still maintaining good performance.
ResNet has become a widely adopted architecture in various computer vision tasks, including image classification, object detection, and image segmentation. Its ability to train very deep networks effectively has made it a fundamental building block in the field of deep learning.
Vanishing gradients occur when error gradients become very small during backpropagation, hindering convergence. This can happen when activation functions like sigmoid and tanh are used, as their derivatives are between 0 and 0.25. It affects earlier layers more due to more multiplicative terms. Using ReLU activations helps as their derivative is 1 for positive values. Initializing weights properly also helps prevent vanishing gradients. Exploding gradients occur when error gradients become very large, disrupting learning. It can be addressed through lower learning rates, gradient clipping, and gradient scaling.
Given two integer arrays val[0...n-1] and wt[0...n-1] that represents values and weights associated with n items respectively. Find out the maximum value subset of val[] such that sum of the weights of this subset is smaller than or equal to knapsack capacity W. Here the BRANCH AND BOUND ALGORITHM is discussed .
GANs are the new hottest topic in the ML arena; however, they present a challenge for the researchers and the engineers alike. Their design, and most importantly, the code implementation has been causing headaches to the ML practitioners, especially when moving to production.
Starting from the very basic of what a GAN is, passing trough Tensorflow implementation, using the most cutting-edge APIs available in the framework, and finally, production-ready serving at scale using Google Cloud ML Engine.
Slides for the talk: https://www.pycon.it/conference/talks/deep-diving-into-gans-form-theory-to-production
Github repo: https://github.com/zurutech/gans-from-theory-to-production
This document discusses support vector machines (SVMs) for pattern classification. It begins with an introduction to SVMs, noting that they construct a hyperplane to maximize the margin of separation between positive and negative examples. It then covers finding the optimal hyperplane for linearly separable and nonseparable patterns, including allowing some errors in classification. The document discusses solving the optimization problem using quadratic programming and Lagrange multipliers. It also introduces the kernel trick for applying SVMs to non-linear decision boundaries using a kernel function to map data to a higher-dimensional feature space. Examples are provided of applying SVMs to the XOR problem and computer experiments classifying a double moon dataset.
The document summarizes the Batch Normalization technique presented in the paper "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift". Batch Normalization aims to address the issue of internal covariate shift in deep neural networks by normalizing layer inputs to have zero mean and unit variance. It works by computing normalization statistics for each mini-batch and applying them to the inputs. This helps in faster and more stable training of deep networks by reducing the distribution shift across layers. The paper presented ablation studies on MNIST and ImageNet datasets showing Batch Normalization improves training speed and accuracy compared to prior techniques.
Artificial neural networks mimic the human brain by using interconnected layers of neurons that fire electrical signals between each other. Activation functions are important for neural networks to learn complex patterns by introducing non-linearity. Without activation functions, neural networks would be limited to linear regression. Common activation functions include sigmoid, tanh, ReLU, and LeakyReLU, with ReLU and LeakyReLU helping to address issues like vanishing gradients that can occur with sigmoid and tanh functions.
Residual neural networks (ResNets) solve the vanishing gradient problem through shortcut connections that allow gradients to flow directly through the network. The ResNet architecture consists of repeating blocks with convolutional layers and shortcut connections. These connections perform identity mappings and add the outputs of the convolutional layers to the shortcut connection. This helps networks converge earlier and increases accuracy. Variants include basic blocks with two convolutional layers and bottleneck blocks with three layers. Parameters like number of layers affect ResNet performance, with deeper networks showing improved accuracy. YOLO is a variant that replaces the softmax layer with a 1x1 convolutional layer and logistic function for multi-label classification.
Fuzzy relations, fuzzy graphs, and the extension principle are three important concepts in fuzzy logic. Fuzzy relations generalize classical relations to allow partial membership and describe relationships between objects to varying degrees. Fuzzy graphs describe functional mappings between input and output linguistic variables. The extension principle provides a procedure to extend functions defined on crisp domains to fuzzy domains by mapping fuzzy sets through functions. These concepts form the foundation of fuzzy rules and fuzzy arithmetic.
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
The document provides an introduction to deep learning and how to compute gradients in deep learning models. It discusses machine learning concepts like training models on data to learn patterns, supervised learning tasks like image classification, and optimization techniques like stochastic gradient descent. It then explains how to compute gradients using backpropagation in deep multi-layer neural networks, allowing models to be trained on large datasets. Key steps like the chain rule and backpropagation of errors from the final layer back through the network are outlined.
This document summarizes that some slides were adapted from various sources including machine learning lectures and professors from Stanford University, Cornell University, IIT Kharagpur, and University of Illinois at Chicago. Students are requested to use this material for study purposes only and not distribute it.
1629 stochastic subgradient approach for solving linear support vectorDr Fereidoun Dejahang
This document describes a stochastic subgradient descent approach called Pegasos for efficiently solving linear support vector machines (SVMs) on large datasets. Pegasos improves upon traditional gradient descent methods by using a more aggressive learning rate that allows for faster convergence to suboptimal solutions, which often generalize well to new examples. The key aspects of Pegasos are that it uses mini-batches of training examples to estimate subgradients, projects parameter updates into a bounded space, and converges to solutions much more quickly than traditional SVM solvers while achieving comparable test error rates. Experiments on a large text dataset demonstrate Pegasos' ability to reach accurate solutions orders of magnitude faster than conventional solvers like SVM Light.
Here are the steps to solve this ODE problem:
1. Define the ODE function:
function dydt = odefun(t,y)
dydt = -t.*y/10;
end
2. Solve the ODE:
[t,y] = ode45(@odefun,[0 10],10);
3. Plot the result:
plot(t,y)
xlabel('t')
ylabel('y(t)')
This uses ode45 to solve the ODE dy/dt = -t*y/10 on the interval [0,10] with initial condition y(0)=10.
Machine Learning workshop by GDSC Amity University ChhattisgarhPoorabpatel
The document discusses various machine learning techniques for image classification, including clustering strategies, feature extraction, and classifiers. It provides examples of k-means clustering, agglomerative clustering, mean-shift clustering, spectral clustering, bag-of-features representations, nearest neighbor classification, linear and nonlinear support vector machines (SVMs). SVMs are discussed in more detail, covering how they can learn nonlinear decision boundaries using the kernel trick, common kernel functions for images, and pros and cons of SVMs for classification.
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessIgor Sfiligoi
Presented at PEARC21.
Most experimental sciences now rely on computing, and biolog- ical sciences are no exception. As datasets get bigger, so do the computing costs, making proper optimization of the codes used by scientists increasingly important. Many of the codes developed in recent years are based on the Python-based NumPy, due to its ease of use and good performance characteristics. The composable nature of NumPy, however, does not generally play well with the multi-tier nature of modern CPUs, making any non-trivial multi- step algorithm limited by the external memory access speeds, which are hundreds of times slower than the CPU’s compute capabilities. In order to fully utilize the CPU compute capabilities, one must keep the working memory footprint small enough to fit in the CPU caches, which requires splitting the problem into smaller portions and fusing together as many steps as possible. In this paper, we present changes based on these principles to two important func- tions in the scikit-bio library, principal coordinates analysis and the Mantel test, that resulted in over 100x speed improvement in these widely used, general-purpose tools.
Regression takes a group of random variables, thought to be predicting Y, and tries to find a mathematical relationship between them. This relationship is typically in the form of a straight line (linear regression) that best approximates all the individual data points.
Support Vector Machines aim to find an optimal decision boundary that maximizes the margin between different classes of data points. This is achieved by formulating the problem as a constrained optimization problem that seeks to minimize training error while maximizing the margin. The dual formulation results in a quadratic programming problem that can be solved using algorithms like sequential minimal optimization. Kernels allow the data to be implicitly mapped to a higher dimensional feature space, enabling non-linear decision boundaries to be learned. This "kernel trick" avoids explicitly computing coordinates in the higher dimensional space.
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
This document discusses object detection using Adaboost and various techniques. It begins with an overview of the Adaboost algorithm and provides a toy example to illustrate how it works. Next, it describes how Viola and Jones used Adaboost with Haar-like features and an integral image representation for rapid face detection in images. It achieved high detection rates with very low false positives. The document also discusses how Schneiderman and Kanade used a parts-based representation with localized wavelet coefficients as features for object detection and used statistical independence of parts to obtain likelihoods for classification.
FINBOURNE engineer and Machine Learning specialist Jack Wright presentation on an 'introduction to machine learning'.
Topics covered:
What is a learning process and how can machines do it?
Do you understand the difference between empirical and true loss?
How and why do machine learning algorithms go awry?
This presentation uses visual examples to demonstrate how machine learning algorithms work and the principles they’re based on and brings it all together with a worked demo on a real dataset. It goes from “what is learning” through to regularisation and model selection.
This document provides an overview of regression analysis and linear regression. It explains that regression analysis estimates relationships among variables to predict continuous outcomes. Linear regression finds the best fitting line through minimizing error. It describes modeling with multiple features, representing data in vector and matrix form, and using gradient descent optimization to learn the weights through iterative updates. The goal is to minimize a cost function measuring error between predictions and true values.
The document provides information about artificial neural networks (ANNs). It discusses:
- ANNs are computing systems designed to simulate the human brain in processing information. They have self-learning capabilities that enable better results as more data becomes available.
- ANNs are inspired by biological neural systems and are made up of interconnected processing units similar to neurons. The network learns by adjusting the strengths of connections between units.
- Backpropagation is commonly used to train multilayer ANNs. It is a gradient descent algorithm that minimizes error by adjusting weights to better match network outputs to training targets. Weights are adjusted based on error terms propagated back through the network.
Gradient Boosted Regression Trees in scikit-learnDataRobot
Slides of the talk "Gradient Boosted Regression Trees in scikit-learn" by Peter Prettenhofer and Gilles Louppe held at PyData London 2014.
Abstract:
This talk describes Gradient Boosted Regression Trees (GBRT), a powerful statistical learning technique with applications in a variety of areas, ranging from web page ranking to environmental niche modeling. GBRT is a key ingredient of many winning solutions in data-mining competitions such as the Netflix Prize, the GE Flight Quest, or the Heritage Health Price.
I will give a brief introduction to the GBRT model and regression trees -- focusing on intuition rather than mathematical formulas. The majority of the talk will be dedicated to an in depth discussion how to apply GBRT in practice using scikit-learn. We will cover important topics such as regularization, model tuning and model interpretation that should significantly improve your score on Kaggle.
This document provides an overview of deep learning including why it is used, common applications, strengths and challenges, common algorithms, and techniques for developing deep learning models. In 3 sentences: Deep learning methods like neural networks can learn complex patterns in large, unlabeled datasets and are better than traditional machine learning for tasks like image recognition. Popular deep learning algorithms include convolutional neural networks for image data and recurrent neural networks for sequential data. Effective deep learning requires techniques like regularization, dropout, data augmentation, and hyperparameter optimization to prevent overfitting on training data.
This document provides an overview of deep learning including:
1. Why deep learning performs better than traditional machine learning for tasks like image and speech recognition.
2. Common deep learning applications such as image recognition, speech recognition, and healthcare.
3. Challenges of deep learning like the need for large datasets and lack of interpretability.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
"Number Crunching in Python": slides presented at EuroPython 2012, Florence, Italy
Slides have been authored by me and by Dr. Enrico Franchi.
Scientific and Engineering Computing, Numpy NDArray implementation and some working case studies are reported.
Similar to Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona 2018 (20)
This document provides an overview of deep generative learning and summarizes several key generative models including GANs, VAEs, diffusion models, and autoregressive models. It discusses the motivation for generative models and their applications such as image generation, text-to-image synthesis, and enhancing other media like video and speech. Example state-of-the-art models are provided for each application. The document also covers important concepts like the difference between discriminative and generative modeling, sampling techniques, and the training procedures for GANs and VAEs.
This document provides an overview of deep generative learning and summarizes several key generative models including GANs, VAEs, diffusion models, and autoregressive models. It discusses the motivation for generative models and their applications such as image generation, text-to-image synthesis, and enhancing other media like video and speech. Example state-of-the-art models are provided for each application. The document also covers important concepts like the difference between discriminative and generative modeling, sampling techniques, and the training procedures for GANs and VAEs.
The document discusses the Vision Transformer (ViT) model for computer vision tasks. It covers:
1. How ViT tokenizes images into patches and uses position embeddings to encode spatial relationships.
2. ViT uses a class embedding to trigger class predictions, unlike CNNs which have decoders.
3. The receptive field of ViT grows as the attention mechanism allows elements to attend to other distant elements in later layers.
4. Initial results showed ViT performance was comparable to CNNs when trained on large datasets but lagged CNNs trained on smaller datasets like ImageNet.
Machine translation and computer vision have greatly benefited from the advances in deep learning. A large and diverse amount of textual and visual data have been used to train neural networks whether in a supervised or self-supervised manner. Nevertheless, the convergence of the two fields in sign language translation and production still poses multiple open challenges, like the low video resources, limitations in hand pose estimation, or 3D spatial grounding from poses.
The transformer is the neural architecture that has received most attention in the early 2020's. It removed the recurrency in RNNs, replacing it with and attention mechanism across the input and output tokens of a sequence (cross-attenntion) and between the tokens composing the input (and output) sequences, named self-attention.
These slides review the research of our lab since 2016 on applied deep learning, starting from our participation in the TRECVID Instance Search 2014, moving into video analysis with CNN+RNN architectures, and our current efforts in sign language translation and production.
Machine translation and computer vision have greatly benefited of the advances in deep learning. The large and diverse amount of textual and visual data have been used to train neural networks whether in a supervised or self-supervised manner. Nevertheless, the convergence of the two field in sign language translation and production is still poses multiple open challenges, like the low video resources, limitations in hand pose estimation, or 3D spatial grounding from poses. This talk will present these challenges and the How2✌️Sign dataset (https://how2sign.github.io) recorded at CMU in collaboration with UPC, BSC, Gallaudet University and Facebook.
https://imatge.upc.edu/web/publications/sign-language-translation-and-production-multimedia-and-multimodal-challenges-all
https://imatge-upc.github.io/synthref/
Integrating computer vision with natural language processing has achieved significant progress
over the last years owing to the continuous evolution of deep learning. A novel vision and language
task, which is tackled in the present Master thesis is referring video object segmentation, in which a
language query defines which instance to segment from a video sequence. One of the biggest chal-
lenges for this task is the lack of relatively large annotated datasets since a tremendous amount of
time and human effort is required for annotation. Moreover, existing datasets suffer from poor qual-
ity annotations in the sense that approximately one out of ten language expressions fails to uniquely
describe the target object.
The purpose of the present Master thesis is to address these challenges by proposing a novel
method for generating synthetic referring expressions for an image (video frame). This method pro-
duces synthetic referring expressions by using only the ground-truth annotations of the objects as well
as their attributes, which are detected by a state-of-the-art object detection deep neural network. One
of the advantages of the proposed method is that its formulation allows its application to any object
detection or segmentation dataset.
By using the proposed method, the first large-scale dataset with synthetic referring expressions for
video object segmentation is created, based on an existing large benchmark dataset for video instance
segmentation. A statistical analysis and comparison of the created synthetic dataset with existing ones
is also provided in the present Master thesis.
The conducted experiments on three different datasets used for referring video object segmen-
tation prove the efficiency of the generated synthetic data. More specifically, the obtained results
demonstrate that by pre-training a deep neural network with the proposed synthetic dataset one can
improve the ability of the network to generalize across different datasets, without any additional annotation cost. This outcome is even more important taking into account that no additional annotation cost is involved.
Master MATT thesis defense by Juan José Nieto
Advised by Víctor Campos and Xavier Giro-i-Nieto.
27th May 2021.
Pre-training Reinforcement Learning (RL) agents in a task-agnostic manner has shown promising results. However, previous works still struggle to learn and discover meaningful skills in high-dimensional state-spaces. We approach the problem by leveraging unsupervised skill discovery and self-supervised learning of state representations. In our work, we learn a compact latent representation by making use of variational or contrastive techniques. We demonstrate that both allow learning a set of basic navigation skills by maximizing an information theoretic objective. We assess our method in Minecraft 3D maps with different complexities. Our results show that representations and conditioned policies learned from pixels are enough for toy examples, but do not scale to realistic and complex maps. We also explore alternative rewards and input observations to overcome these limitations.
https://imatge.upc.edu/web/publications/discovery-and-learning-navigation-goals-pixels-minecraft
Peter Muschick MSc thesis
Universitat Pollitecnica de Catalunya, 2020
Sign language recognition and translation has been an active research field in the recent years with most approaches using deep neural networks to extract information from sign language data. This work investigates the mostly disregarded approach of using human keypoint estimation from image and video data with OpenPose in combination with transformer network architecture. Firstly, it was shown that it is possible to recognize individual signs (4.5% word error rate (WER)). Continuous sign language recognition though was more error prone (77.3% WER) and sign language translation was not possible using the proposed methods, which might be due to low accuracy scores of human keypoint estimation by OpenPose and accompanying loss of information or insufficient capacities of the used transformer model. Results may improve with the use of datasets containing higher repetition rates of individual signs or focusing more precisely on keypoint extraction of hands.
This document discusses interpretability and explainable AI (XAI) in neural networks. It begins by providing motivation for why explanations of neural network predictions are often required. It then provides an overview of different interpretability techniques, including visualizing learned weights and feature maps, attribution methods like class activation maps and guided backpropagation, and feature visualization. Specific examples and applications of each technique are described. The document serves as a guide to interpretability and explainability in deep learning models.
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/dlai-2020/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/drl-2020/
This course presents the principles of reinforcement learning as an artificial intelligence tool based on the interaction of the machine with its environment, with applications to control tasks (eg. robotics, autonomous driving) o decision making (eg. resource optimization in wireless communication networks). It also advances in the development of deep neural networks trained with little or no supervision, both for discriminative and generative tasks, with special attention on multimedia applications (vision, language and speech).
Giro-i-Nieto, X. One Perceptron to Rule Them All: Language, Vision, Audio and Speech. In Proceedings of the 2020 International Conference on Multimedia Retrieval (pp. 7-8).
Tutorial page:
https://imatge.upc.edu/web/publications/one-perceptron-rule-them-all-language-vision-audio-and-speech-tutorial
Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language, vision and speech. Image captioning, lip reading or video sonorization are some of the first applications of a new and exciting field of research exploiting the generalization properties of deep neural representation. This tutorial will firstly review the basic neural architectures to encode and decode vision, text and audio, to later review the those models that have successfully translated information across modalities.
This document summarizes image segmentation techniques using deep learning. It begins with an overview of semantic segmentation and instance segmentation. It then discusses several techniques for semantic segmentation, including deconvolution/transposed convolution for learnable upsampling, skip connections to combine predictions from different CNN depths, and dilated convolutions to increase the receptive field without losing resolution. For instance segmentation, it covers proposal-based methods like Mask R-CNN, and single-shot and recurrent approaches as alternatives to proposal-based models.
https://imatge-upc.github.io/rvos-mots/
Video object segmentation can be understood as a sequence-to-sequence task that can benefit from the curriculum learning strategies for better and faster training of deep neural networks. This work explores different schedule sampling and frame skipping variations to significantly improve the performance of a recurrent architecture. Our results on the car class of the KITTI-MOTS challenge indicate that, surprisingly, an inverse schedule sampling is a better option than a classic forward one. Also, that a progressive skipping of frames during training is beneficial, but only when training with the ground truth masks instead of the predicted ones.
Deep neural networks have achieved outstanding results in various applications such as vision, language, audio, speech, or reinforcement learning. These powerful function approximators typically require large amounts of data to be trained, which poses a challenge in the usual case where little labeled data is available. During the last year, multiple solutions have been proposed to leverage this problem, based on the concept of self-supervised learning, which can be understood as a specific case of unsupervised learning. This talk will cover its basic principles and provide examples in the field of multimedia.
More from Universitat Politècnica de Catalunya (20)
NYCMeetup07-25-2024-Unstructured Data Processing From Cloud to EdgeTimothy Spann
NYCMeetup07-25-2024-Unstructured Data Processing From Cloud to Edge
https://www.meetup.com/unstructured-data-meetup-new-york/
https://www.meetup.com/unstructured-data-meetup-new-york/events/301720478/
Details
This is an in-person event! Registration is required to get in.
Topic: Connecting your unstructured data with Generative LLMs
What we’ll do:
Have some food and refreshments. Hear three exciting talks about unstructured data and generative AI.
5:30 - 6:00 - Welcome/Networking/Registration
6:05 - 6:30 - Tim Spann, Principal DevRel, Zilliz
6:35 - 7:00 - Chris Joynt, Senior PMM, Cloudera
7:05 - 7:30 - Lisa N Cao, Product Manager, Datastrato
7:30 - 8:30 - Networking
Tech talk 1: Unstructured Data Processing From Cloud to Edge
Speaker: Tim Spann, Principal Dev Advocate, Zilliz
In this talk I will do a presentation on why you should add a Cloud Native vector database to your Data and AI platform. He will also cover a quick introduction to Milvus, Vector Databases and unstructured data processing. By adding Milvus to your architecture you can scale out and improve your AI use cases through RAG, Real-Time Search, Multimodal Search, Recommendations Engines, fraud detection and many more emerging use cases.
As I will show, Edge devices even as small and inexpensive as a Raspberry Pi 5 can work in machine learning, deep learning and AI use cases and be enhanced with a vector database.
Tech talk 2: RAG Pipelines with Apache NiFi
Speaker: Chris Joynt, Senior PMM, Cloudera
Executing on RAG Architecture is not a set-it-and-forget-it endeavor. Unstructured or multimodal data must be cleansed, parsed, processed, chunked and vectorized before being loaded into knowledge stores and vector DB's. That needs to happen efficiently to keep our GenAI up to date always with fresh contextual data. But not only that, changes will have to be made on an ongoing basis. For example, new data sources must be added. Experimentation will be necessary to find the ideal chunking strategy. Apache NiFi is the perfect tool to build RAG pipelines to stream proprietary and external data into your RAG architectures. Come learn how to use this scalable and incredible versatile tool to quickly build pipelines to activate your GenAI use case.
Tech Talk 3: Metadata Lakes for Next-Gen AI/ML
Speaker: Lisa N Cao, Datastrato
Abstract: As data catalogs evolve to meet the growing and new demands of high-velocity, unstructured data, we see them taking a new shape as an emergent and flexible way to activate metadata for multiple uses. This talk discusses modern uses of metadata at the infrastructure level for AI-enablement in RAG pipelines in response to the new demands of the ecosystem. We will also discuss Apache (incubating) Gravitino and its open source-first approach to data cataloging across multi-cloud and geo-distributed architectures.
Who Should attend:
Anyone interested in talking and learning about Unstructured Data and Generative AI Apps.
When:
July 25, 2024
5:30PM
Graph Machine Learning - Past, Present, and Future -kashipong
Graph machine learning, despite its many commonalities with graph signal processing, has developed as a relatively independent field.
This presentation will trace the historical progression from graph data mining in the 1990s, through graph kernel methods in the 2000s, to graph neural networks in the 2010s, highlighting the key ideas and advancements of each era. Additionally, recent significant developments, such as the integration with causal inference, will be discussed.
Introduction to Data Science
1.1 What is Data Science, importance of data science,
1.2 Big data and data Science, the current Scenario,
1.3 Industry Perspective Types of Data: Structured vs. Unstructured Data,
1.4 Quantitative vs. Categorical Data,
1.5 Big Data vs. Little Data, Data science process
1.6 Role of Data Scientist
Databricks Vs Snowflake off Page PDF submission.pptxdewsharon760
Discover the key differences between Databricks and Snowflake. Learn about their features, use cases, and how to choose the right data platform for your business needs.
Why You Need Real-Time Data to Compete in E-CommercePromptCloud
In the fast-paced world of e-commerce, real-time data is crucial for staying competitive. By accessing up-to-date information on market trends, competitor pricing, and customer preferences, businesses can make informed decisions quickly. Real-time data enables dynamic pricing strategies, effective inventory management, and personalized marketing efforts, all of which are essential for meeting customer demands and outperforming competitors. Embrace real-time data to stay agile, optimize your operations, and drive growth in the ever-evolving e-commerce landscape. Get in touch for custom web scraping services: https://bit.ly/3WkqYVm
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona 2018
1. [course
site]
Verónica Vilaplana
veronica.vilaplana@upc.edu
Associate Professor
Universitat Politecnica de Catalunya
Technical University of Catalonia
Optimization for neural
network training
Day 3 Lecture 2
#DLUPC
2. Previously
in
DLAI…
• Mul.layer
perceptron
• Training:
(stochas.c
/
mini-‐batch)
gradient
descent
• Backpropaga.on
• Loss
func.on
but…
What
type
of
op.miza.on
problem?
Do
local
minima
and
saddle
points
cause
problems?
Does
gradient
descent
perform
well?
How
to
set
the
learning
rate?
How
to
ini.alize
weights?
How
does
batch
size
affect
training?
2
3. Index
• Op6miza6on
for
a
machine
learning
task;
difference
between
learning
and
pure
op6miza6on
• Expected
and
empirical
risk
• Surrogate
loss
func.ons
and
early
stopping
• Batch
and
mini-‐batch
algorithms
• Challenges
for
deep
models
• Local
minima
• Saddle
points
and
other
flat
regions
• Cliffs
and
exploding
gradients
• Prac6cal
algorithms
• Stochas.c
Gradient
Descent
• Momentum
• Nesterov
Momentum
• Learning
rate
• Adap.ve
learning
rates:
adaGrad,
RMSProp,
Adam
• Parameter
ini6aliza6on
• Batch
Normaliza6on
3
5. Op6miza6on
for
NN
training
• Goal:
Find
the
parameters
that
minimize
the
expected
risk
(generaliza.on
error)
• x
input,
predicted
output,
y
target
output,
E
expecta.on
• pdata
true
(unknown)
data
distribu.on,
L
loss
func6on
(how
wrong
predic6ons
are)
• But
we
only
have
a
training
set
of
samples:
we
minimize
the
empirical
risk,
average
loss
on
a
finite
dataset
D
J(θ) = Ε(x,y)∼pdata
L( fθ
(x), y)
fθ
(x)
J(θ) = Ε(x,y)∼ ˆpdata
L( fθ
(x), y) =
1
D
L( fθ
(x(i)
), y(i)
)
(x(i)
,y(i)
)∈D
∑
where
is
the
empirical
distribu.on,
|D|
is
the
number
of
examples
in
D
5
ˆpdata
6. Surrogate
loss
• O]en
minimizing
the
real
loss
is
intractable
(can’t
be
used
with
gradient
descent)
• e.g.
0-‐1
loss
(0
if
correctly
classified,
1
if
it
is
not)
(intractable
even
for
linear
classifiers
(Marcobe
1992)
• Minimize
a
surrogate
loss
instead
• e.g.
for
the
0-‐1
loss
hinge
square
logis.c
6
0-‐1
loss
(blue)
and
surrogate
losses
(green:
square,
purple:
hinge,
yellow:
logis.c)
L( f (x) , y) = I( f (x)≠y)
L( f (x), y) = max(0,1− yf (x))
L( f (x), y) = (1− yf (x))2
L( f (x), y) = log(1+ e− yf (x)
)
7. Surrogate
loss
func6ons
7
Probabilistic
classifier
Outputs
probability
of
class
1
f(x) ≈ P(y=1 | x) Probability for class 0 is 1-f(x)
Binary cross-entropy loss:
L(f(x),y) = -(y log(f(x)) + (1-y) log(1-f(x))
Decision
func.on: F(x) = If(x)>0.5
Outputs
a
vector
of
probabili.es:
f(x) ≈ ( P(y=0|x), ..., P(y=m-1|x) )
Negative conditional log likelihood loss
L(f(x),y) = -log f(x)y
Decision
func.on:
F(x) = argmax(f(x))
Non-
Hinge
loss:
probabilistic
classifier
Outputs a «score» f(x) for class 1.
score for the other class is -f(x)
L(f(x),t) = max(0, 1-t f(x)) where t=2y-1
Decision
func.on:
F(x) = If(x)>0
Outputs
a
vector
f(x) of
real-‐valued
scores
for
the
m
classes.
Mul.class
margin
loss
L(f(x),y) = max(0,1+max(f(x)k)-f(x)y )
k≠y
Decision
func.on:
F(x) = argmax(f(x))
Binary classifier Multiclass classifier
8. Early
stopping
• Training
algorithms
usually
do
not
halt
at
a
local
minimum
• Convergence
criterion
based
on
early
stopping:
• based
on
surrogate
loss
or
true
underlying
loss
(ex
0-‐1
loss)
measured
on
a
valida6on
set
• #
training
steps
=
hyperparameter
controlling
the
effec.ve
capacity
of
the
model
• simple,
effec.ve,
must
keep
a
copy
of
the
best
parameters
• acts
as
a
regularizer
(Bishop
1995,…)
8
Training
error
decreases
steadily
Valida.on
error
begins
to
increase
Return
parameters
at
point
with
lowest
valida6on
error
9. Batch
and
mini-‐batch
algorithms
• Gradient
descent
at
each
itera.on
computes
gradients
over
the
en.re
dataset
for
one
update
• ↑
Gradients
are
stable
• ↓
Using
the
complete
training
set
can
be
very
expensive
• the
gain
of
using
more
samples
is
less
than
linear:
• standard
error
of
the
mean
es.mated
from
m
samples
is
(σ
is
true
std)
• ↓
Training
set
may
be
redundant
• Use
a
subset
of
the
training
set
Loop:
1. sample
a
subset
of
data
2. forward
prop
through
the
network
3. backprop
to
calculate
gradients
4. update
parameters
using
gradients
9
∇θ
J(θ) =
1
m
∇θ
L( fθ
(x(i)
), y(i)
)i∑
SE =
σ
m
Minibatch
gradient
descent
10. Batch
and
mini-‐batch
algorithms
• How
many
samples
in
each
update
step?
• Determinis.c
or
batch
gradient
methods:
process
all
training
samples
in
a
large
batch
• Mini-‐batch
stochas.c
methods:
use
several
(not
all)
samples
• Stochas.c
methods:
use
a
single
example
at
a
.me
• online
methods:
samples
are
drawn
from
a
stream
of
con.nually
created
samples
10
batch
vs
minibatch
gradient
descent
11. Batch
and
mini-‐batch
algorithms
Mini-‐batch
size?
• Larger
batches:
more
accurate
es.mate
of
the
gradient
but
less
than
linear
return
• Very
small
batches:
Mul.core
architectures
under-‐u.lized
• Smaller
batches
provide
noisier
gradient
es.mates
• Small
batches
may
offer
a
regularizing
effect
(add
noise)
• but
may
require
small
learning
rate
• may
increase
number
of
steps
for
convergence
• If
small
training
set,
use
batch
gradient
descent
• If
large
training
set,
use
mini
batches
• Minbatches
should
be
selected
randomly
(shuffle
samples)
• unbiased
es.mate
of
gradients
• Typical
mini-‐batch
size:
32,
64,
128,
256
• (2p,
make
sure
mini-‐batch
fits
in
CPU-‐GPU
memory)
11
13. Convex
/
Non-‐convex
op6miza6on
A
func.on
defined
on
an
n-‐dim
interval
is
convex
if
for
any
13
f : X → !
f (λx + (1− λ)x') ≤ λ f (x) + (1− λ) f (x')
x,x' ∈X λ ∈[0,1]
f (λx + (1− λ)x')
λ f (x) + (1− λ) f (x')
14. Convex
/
Non-‐convex
op6miza6on
• Convex
op.miza.on
• any
local
minimum
is
a
global
minimum
• there
are
several
opt.
algorithms
(polynomial-‐.me)
• Non-‐convex
op.miza.on
• objec6ve
func6on
in
deep
networks
is
non-‐convex
• deep
models
may
have
several
local
minima
• but
this
is
not
necessarily
a
major
problem!
14
15. Local
minima
and
saddle
points
• Cri6cal
points:
• For
high
dimensional
loss
func.ons,
local
minima
are
rare
compared
to
saddle
points
• Hessian
matrix:
real,
symmetric
eigenvector/eigenvalue
decomposi.on
• Intui.on:
eigenvalues
of
the
Hessian
matrix
• local
minimum/maximum:
all
posi.ve
/
all
nega.ve
eigenvalues:
exponen.ally
unlikely
as
n
grows
• saddle
points:
both
posi.ve
and
nega.ve
eigenvalues
15
Dauphin
et
al.
Iden.fying
and
abacking
the
saddle
point
problem
in
high-‐dimensional
non-‐convex
op.miza.on.
NIPS
2014
Hij
=
∂2
f
∂xi
∂xj
f :!n
→ !
∇x
f (x) = 0
16. Local
minima
and
saddle
points
• It
is
believed
that
for
many
problems
including
learning
deep
nets,
almost
all
local
minimum
have
very
similar
func.on
value
to
the
global
op.mum
• Finding
a
local
minimum
is
good
enough
16
Value
of
local
minima
found
by
running
SGD
for
200
itera.ons
on
a
simplified
version
of
MNIST
from
different
ini.al
star.ng
points.
As
number
of
parameters
increases,
local
minima
tend
to
cluster
more
.ghtly.
• For
many
random
func.ons
local
minima
are
more
likely
to
have
low
cost
than
high
cost.
Choromanska
et
al.
The
loss
surfaces
of
mul.layer
networks,
AISTATS
2015
17. Saddle
points
How
to
escape
from
saddle
points?
• First
order
methods
• ini.ally
abracted
to
saddle
points,
but
unless
exact
hit,
it
will
be
repelled
when
close
• hitng
cri.cal
point
exactly
is
unlikely
(es.mated
gradient
is
noisy)
• saddle
points
are
very
unstable:
noise
(stochas.c
gradient
descent)
helps
convergence,
trajectory
escapes
quickly
• Second
order
moments:
• Netwon’s
method
can
jump
to
saddle
points
(where
gradient
is
0)
17
S.
Credit:
K.McGuinness
SGD
tends
to
oscillate
between
slowly
approaching
a
saddle
point
and
quickly
escaping
from
it
18. Other
difficul6es
• Cliffs
and
exploding
gradients
• Nets
with
many
layers
/
recurrent
nets
can
contain
very
steep
regions
(cliffs):
gradient
descent
can
move
parameters
too
far,
jumping
off
of
the
cliff.
(solu.ons:
gradient
clipping)
• Long
term
dependencies
• computa.onal
graph
becomes
very
deep
(deep
nets
/
recurrent
nets):
vanishing
and
exploding
gradients
18
cost
func.on
of
highly
non
linear
deep
nets
or
recurrent
net
(Pascanu2013)
20. Mini-‐batch
Gradient
Descent
• Most
used
algorithm
for
deep
learning
Algorithm
• Require:
ini.al
parameter
θ,
learning
rate
α,
• while
stopping
criterion
not
met
do
• sample
a
minibatch
of
m
examples
from
the
training
set
with
corresponding
targets
• compute
gradient
es.mate
• apply
update
• end
while
20
{x(i)
}i=1...m
{y(i)
}i=1...m
g ← +
1
m
∇θ
L( fθ
(x(i)
), y(i)
)i∑
θ ←θ −αg
21. Problems
with
GD
• GD
can
be
very
slow.
• Can
get
stuck
in
local
minima
or
saddle
points
• If
the
loss
changes
quickly
in
one
direc.on
and
slowly
in
another,
GD
makes
slow
progress
along
shallow
dimension,
jibers
along
steep
direc.on
21
Loss
func.on
has
a
high
condi6on
number
(5):
ra.o
of
largest
to
smallest
singular
value
of
Hessian
matrix
is
large
22. Momentum
• Momentum
is
designed
to
accelerate
learning,
especially
for
high
curvature,
small
but
consistent
gradients
or
noisy
gradients
• New
variable
velocity
v
(direc.on
and
speed
at
which
parameters
move)
• exponen.ally
decaying
average
of
nega.ve
gradient
Algorithm
• Require:
ini.al
parameter
θ,
learning
rate
α,
momentum
parameter
λ
,
ini6al
velocity
v
• Update
rule:
(g
is
gradient
es.mate)
• compute
velocity
update
• apply
update
• Typical
values
v0=0,
λ=0.5,
0.9,0.99
(in
[0,1})
• Read
physical
analogy
in
Deep
Learning
book
(Goodfellow
et
al):
velocity
=
momentum
of
unit
mass
par.cle
22
θ ←θ + v
v ← λv −αg
23. Nesterov
accelerated
gradient
(NAG)
• A
variant
of
momentum,
where
gradient
is
evaluated
a]er
current
velocity
is
applied:
• Approximate
where
the
parameters
will
be
on
the
next
.me
step
using
current
velocity
• Update
velocity
using
gradient
where
we
predict
parameters
will
be
Algorithm
• Require:
ini.al
parameter
θ,
learning
rate
α,
momentum
parameter
λ
,
ini.al
velocity
v
• Update:
• apply
interim
update
• compute
gradient
(at
interim
point)
• compute
velocity
update
• apply
update
• Interpreta.on:
add
a
correc.on
factor
to
momentum
23
g ← +
1
m
∇!θ
L!θ
( f (x(i)
), y(i)
)i∑
θ ←θ + v
v ← λv −αg
!θ ←θ + λv
interim
24. Nesterov
accelerated
gradient
(NAG)
24
current
loca.on
wt
vt
∇L(wt) vt+1
S.
Credit:
K.
McGuinness
predicted
loca.on
based
on
velocity
alone
wt + 𝛾v
∇L(wt + 𝛾vt)
vt
vt+1
25. GD:
learning
rate
• Learning
rate
is
a
crucial
parameter
for
GD
• Too
large:
overshoots
local
minimum,
loss
increases
• Too
small:
makes
very
slow
progress,
can
get
stuck
• Good
learning
rate:
makes
steady
progress
toward
local
minimum
25
too
small
too
large
26. GD:
learning
rate
decay
• In
prac.ce
it
is
necessary
to
gradually
decrease
learning
rate
to
speed
up
the
training
• step
decay
(e.g.
reduce
by
half
every
few
epochs)
• exponen6al
decay
• 1/t
decay
• manual
decay
• Sufficient
condi.ons
for
convergence:
• Usually:
adapt
learning
rate
by
monitoring
learning
curves
that
plot
the
objec.ve
func.on
as
a
func.on
of
.me
(more
of
an
art
than
a
science!)
26
αt
= ∞
t=1
∞
∑ αt
2
< ∞
t=1
∞
∑
α = α0
e−kt
α =
α0
1+ kt
k decay rate
t iteration number
α0
initial learning rate
27. Adap6ve
learning
rates
• Cost
if
o]en
sensi.ve
to
some
direc.ons
and
insensi.ve
to
others
• Momentum/Nesterov
mi.gate
this
issue
but
introduce
another
hyperparameter
• Solu6on:
Use
a
separate
learning
rate
for
each
parameter
and
automa6cally
adapt
it
through
the
course
of
learning
• Algorithms
(mini-‐batch
based)
• AdaGrad
• RMSProp
• Adam
27
28. AdaGrad
• Adapts
the
learning
rate
of
each
parameter
based
on
sizes
of
previous
updates:
• scales
updates
to
be
larger
for
parameters
that
are
updated
less
• scales
updates
to
be
smaller
for
parameters
that
are
updated
more
• The
net
effect
is
greater
progress
in
the
more
gently
sloped
direc.ons
of
parameter
space
• Require:
ini.al
parameter
θ,
learning
rate
α,
small
constant
δ
(e.g.
10-‐7)
for
numerical
stability
• Update:
• accumulate
squared
gradient
• compute
update
• apply
update
28
θ ←θ + Δθ
Δθ ← −
α
δ + r
⊙ g
r ← r + g ⊙ g sum
of
all
previous
squared
gradients
updates
inversely
propor.onal
to
the
square
root
of
the
sum
(elementwise
mul.plica.on)
Duchi
et
al.
Adap.ve
Subgradient
Methods
for
Online
Learning
and
Stochas.c
Op.miza.on.
JMRL
2011
29. Root
Mean
Square
Propaga6on
(RMSProp)
• AdaGrad
can
result
in
a
premature
and
excessive
decrease
in
effec6ve
learning
rate
• RMSProp
modifies
AdaGrad
to
perform
beber
in
non-‐convex
surfaces
• Changes
gradient
accumula.on
by
an
exponen6ally
decaying
average
of
sum
of
squares
of
gradients
• Requires:
ini.al
parameter
θ,
learning
rate
α,
decay
rate
ρ,
small
constant
δ
(e.g.
10-‐7)
• Update:
• accumulate
squared
gradient
• compute
update
• apply
update
29
θ ←θ + Δθ
Δθ ← −
α
δ + r
⊙ g
r ← ρr + (1− ρ)g ⊙ g
Geoff
Hinton,
Unpublished
30. ADAp6ve
Moments
(Adam)
• Combina.on
of
RMSProp
and
momentum,
but:
• Keep
decaying
average
of
both
first-‐order
moment
of
gradient
(momentum)
and
second-‐
order
moment
(RMSProp)
• Includes
bias
correc.ons
(first
and
second
moments)
to
account
for
their
ini.aliza.on
at
origin
Update:
• updated
biased
first
moment
es6mate
• update
biased
second
moment
• correct
biases
• compute
update
(opera.ons
applied
elementwise)
• apply
update
30
θ ←θ + Δθ
Δθ ← −α
ˆs
δ + ˆr
s ← ρ1
s + (1− ρ1
)g
r ← ρ2
r + (1− ρ2
)g ⊙ g
ˆs ←
s
1− ρ1
ˆr ←
r
1− ρ2
Kingma
et
al.
Adam:
a
Method
for
Stochas.c
Op.miza.on.
ICLR
2015
δ=10-‐8,
ρ1=0.9,
ρ2=0.999
34. Parameter
ini6aliza6on
• Weights
• Can’t
ini.alize
weights
to
0
(gradients
would
be
0)
• Can’t
ini.alize
all
weights
to
the
same
value
(all
hidden
units
in
a
layer
will
always
behave
the
same;
need
to
break
symmetry)
• Small
random
number,
e.g.,
uniform
or
gaussian
distribu.on
• if
weights
start
too
small,
the
signal
shrinks
as
it
passes
through
each
layer
un.l
it
is
too
.ny
to
be
useful
• Xavier
ini.aliza.on
(calibra.ng
variances,
for
tanh
ac.va.ons)
sqrt(1/n)
• each
neuron:
w
=
randn(n)
/
sqrt(n)
,
n
inputs
• He
ini.aliza.on
(for
ReLu
ac.va.ons)
sqrt(2/n)
• each
neuron
w
=
randn(n)
*
sqrt(2.0
/n)
,
n
inputs
• Biases
• ini.alize
all
to
0
(except
for
output
unit
for
skewed
distribu.ons,
0.01
to
avoid
satura.ng
RELU)
• Alterna6ve:
Ini.alize
using
machine
learning;
parameters
learned
by
unsupervised
model
trained
on
the
same
inputs
/
trained
on
unrelated
task
34
35. Normalizing
inputs
• Normalizing
inputs
to
speed
up
learning
• For
input
layers:
data
preprocessing
mean
=
1,
std=1
• For
hidden
layers:
batch
normaliza.on
35
original
data
mean=0
mean
=0,
std=1
Loss
for
unnormalized
data
Loss
for
normalized
data
36. Batch
normaliza6on
• As
learning
progresses,
the
distribu.on
of
the
layer
inputs
changes
due
to
parameter
updates
(
internal
covariate
shi])
• This
can
result
in
most
inputs
being
in
the
non-‐linear
regime
of
the
ac.va.on
func.on,
slowing
down
learning
• Bach
normaliza.on
is
a
technique
to
reduce
this
effect
• Explicitly
force
the
layer
ac.va.ons
to
have
zero
mean
and
unit
variance
w.r.t
running
batch
es.mates
• Adds
a
learnable
scale
and
bias
term
to
allow
the
network
to
s.ll
use
the
nonlinearity
36
Ioffe
and
Szegedy,
2015.
“Batch
normaliza.on:
accelera.ng
deep
network
training
by
reducing
internal
covariate
shi]”
FC
/
Conv
Batch
norm
ReLu
FC
/
Conv
Batch
norm
ReLu
37. Batch
normaliza6on
• Can
be
applied
to
any
input
or
hidden
layer
• For
a
mini-‐batch
of
m
ac.va.ons
of
the
layer
1. Compute
empirical
mean
and
variance
for
each
dimension
D
2. Normalize
3. Scale
and
shi]
(two
learnable
parameters
)
37
ˆxi
=
xi
− µB
σ B
2
+ ε
m
D
x
yi
= γ ˆxi
+ β
B = xi{ }i=1....m
µB
=
1
m
xi
i=1
m
∑ σ B
2
=
1
m
(xi
− µB
)2
i=1
m
∑
Note:
normaliza.on
can
reduce
the
expressive
power
of
the
network
(e.g.
normalize
inputs
of
a
sigmoid
would
constrain
them
to
the
linear
regime
To
recover
the
iden.ty
mapping.
The
network
can
lean
Then
β = µBγ = σ B
2
+ ε
ˆyi
= xi
38. Batch
normaliza6on
Each
mini-‐batch
is
scaled
by
the
mean/variance
computed
on
just
that
mini-‐batch.
This
adds
some
noise
to
the
hidden
layer’s
ac.va.ons
within
that
minibatch,
having
a
slight
regulariza.on
effect:
• Improves
gradient
flow
through
the
network
• Allows
higher
learning
rates
• Reduces
the
strong
dependency
on
ini.aliza.on
• Reduces
the
need
of
regulariza.on
At
test
.me
BN
layers
func.on
differently:
• Mean
and
std
are
not
computed
on
the
batch.
• Instead,
a
single
fixed
empirical
mean
and
std
of
ac.va.ons
computed
during
training
is
used
(can
be
es.mated
with
exponen.ally
decaying
weighted
averages)
38
39. Summary
39
• Op.miza.on
for
NN
is
different
from
pure
op.miza.on:
• GD
with
mini-‐batches
• early
stopping
• non-‐convex
surface,
saddle
points
• Learning
rate
has
a
significant
impact
on
model
performance
• Several
extensions
to
GD
can
improve
convergence
• Adap.ve
learning-‐rate
methods
are
likely
to
achieve
best
results
• RMSProp,
Adam
• Weight
ini.aliza.on:
He
w=
randn(n)/
sqrt(2/n)
• Batch
normaliza.on
to
reduce
the
internal
covariance
shi]
40. Bibliograpy
• Goodfellow,
I.,
Bengio,
Y.,
and
A.,
C.
(2016),
Deep
Learning,
MIT
Press.
• Choromanska,
A.,
Henaff,
M.,
Mathieu,
M.,
Arous,
G.
B.,
and
LeCun,
Y.
(2015),
The
loss
surfaces
of
mul.layer
networks.
In
AISTATS.
• Dauphin,
Y.
N.,
Pascanu,
R.,
Gulcehre,
C.,
Cho,
K.,
Ganguli,
S.,
and
Bengio,
Y.
(2014).
Iden.fying
and
abacking
the
saddle
point
problem
in
high-‐dimensional
non-‐convex
op.miza.on.
In
Advances
in
Neural
Informa.on
Processing.
Systems,
pages
2933–2941.
• Duchi,
J.,
Hazan,
E.,
and
Singer,
Y.
(2011).
Adap.ve
subgradient
methods
for
online
learning
and
stochas.c
op.miza.on.
Journal
of
Machine
Learning
Research,
12(Jul):2121–2159.
• Goodfellow,
I.
J.,
Vinyals,
O.,
and
Saxe,
A.
M.
(2015).
Qualita.vely
characterizing
neural
network
op.miza.on
problems.
In
Interna.onal
Conference
on
Learning
Representa.ons.
• Hinton,
G.
(2012).
Neural
networks
for
machine
learning.
Coursera,
video
lectures
• Jacobs,
R.
A.
(1988).
Increased
rates
of
convergence
through
learning
rate
adapta.on.
Neural
networks,
1(4):295–307.
• Kingma,
D.
and
Ba,
J.
(2014)-‐
Adam:
A
method
for
stochas.c
op.miza.on.
arXiv
preprint
arXiv:
1412.6980.
• Saxe,
A.
M.,
McClelland,
J.
L.,
and
Ganguli,
S.
(2013).
Exact
solu.ons
to
the
nonlinear
dynamics
of
learning
in
deep
linear
neural
networks.
In
Interna.onal
Conference
on
Learning
Representa.ons
40