Object detection and Instance SegmentationHichem Felouat
The document discusses object detection and instance segmentation models like YOLOv5, Faster R-CNN, EfficientDet, Mask R-CNN, and TensorFlow's object detection API. It provides information on labeling images with bounding boxes for training these models, including open-source and commercial annotation tools. The document also covers evaluating object detection models using metrics like mean average precision (mAP) and intersection over union (IoU). It includes an example of training YOLOv5 on a custom dataset.
Deep Learning: Recurrent Neural Network (Chapter 10) Larry Guo
This Material is an in_depth study report of Recurrent Neural Network (RNN)
Material mainly from Deep Learning Book Bible, http://www.deeplearningbook.org/
Topics: Briefing, Theory Proof, Variation, Gated RNNN Intuition. Real World Application
Application (CNN+RNN on SVHN)
Also a video (In Chinese)
https://www.youtube.com/watch?v=p6xzPqRd46w
Deep learning and neural networks (using simple mathematics)Amine Bendahmane
The document provides an overview of machine learning and deep learning concepts through a series of diagrams and explanations. It begins by introducing concepts like regression, classification, and clustering. It then discusses supervised vs unsupervised learning before explaining neural networks and components like the perceptron, multi-layer perceptrons, and convolutional neural networks. It notes how neural networks learn representations and separate data through hidden layers.
Convolutional neural network from VGG to DenseNetSungminYou
This document summarizes recent developments in convolutional neural networks (CNNs) for image recognition, including residual networks (ResNets) and densely connected convolutional networks (DenseNets). It reviews CNN structure and components like convolution, pooling, and ReLU. ResNets address degradation problems in deep networks by introducing identity-based skip connections. DenseNets connect each layer to every other layer to encourage feature reuse, addressing vanishing gradients. The document outlines the structures of ResNets and DenseNets and their advantages over traditional CNNs.
In machine learning, a convolutional neural network is a class of deep, feed-forward artificial neural networks that have successfully been applied fpr analyzing visual imagery.
Part 2 of the Deep Learning Fundamentals Series, this session discusses Tuning Training (including hyperparameters, overfitting/underfitting), Training Algorithms (including different learning rates, backpropagation), Optimization (including stochastic gradient descent, momentum, Nesterov Accelerated Gradient, RMSprop, Adaptive algorithms - Adam, Adadelta, etc.), and a primer on Convolutional Neural Networks. The demos included in these slides are running on Keras with TensorFlow backend on Databricks.
Summary:
There are three parts in this presentation.
A. Why do we need Convolutional Neural Network
- Problems we face today
- Solutions for problems
B. LeNet Overview
- The origin of LeNet
- The result after using LeNet model
C. LeNet Techniques
- LeNet structure
- Function of every layer
In the following Github Link, there is a repository that I rebuilt LeNet without any deep learning package. Hope this can make you more understand the basic of Convolutional Neural Network.
Github Link : https://github.com/HiCraigChen/LeNet
LinkedIn : https://www.linkedin.com/in/YungKueiChen
生成式對抗網路 (Generative Adversarial Network, GAN) 顯然是深度學習領域的下一個熱點,Yann LeCun 說這是機器學習領域這十年來最有趣的想法 (the most interesting idea in the last 10 years in ML),又說這是有史以來最酷的東西 (the coolest thing since sliced bread)。生成式對抗網路解決了什麼樣的問題呢?在機器學習領域,回歸 (regression) 和分類 (classification) 這兩項任務的解法人們已經不再陌生,但是如何讓機器更進一步創造出有結構的複雜物件 (例如:圖片、文句) 仍是一大挑戰。用生成式對抗網路,機器已經可以畫出以假亂真的人臉,也可以根據一段敘述文字,自己畫出對應的圖案,甚至還可以畫出二次元人物頭像 (左邊的動畫人物頭像就是機器自己生成的)。本課程希望能帶大家認識生成式對抗網路這個深度學習最前沿的技術。
http://imatge-upc.github.io/telecombcn-2016-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
GoogLeNet introduced several key insights for designing efficient deep learning networks:
1. Exploit local correlations in images by concatenating 1x1, 3x3, and 5x5 convolutions along with pooling.
2. Decrease dimensions before expensive convolutions using 1x1 convolutions for dimension reduction.
3. Stack inception modules upon each other, occasionally inserting max pooling layers, to allow tweaking each module.
4. Counter vanishing gradients with intermediate losses added to the total loss for training deep networks.
5. End with a global average pooling layer instead of fully connected layers to avoid overfitting.
The document discusses convolutional neural networks (CNNs). It begins with an introduction and overview of CNN components like convolution, ReLU, and pooling layers. Convolution layers apply filters to input images to extract features, ReLU introduces non-linearity, and pooling layers reduce dimensionality. CNNs are well-suited for image data since they can incorporate spatial relationships. The document provides an example of building a CNN using TensorFlow to classify handwritten digits from the MNIST dataset.
1. The document discusses various machine learning classification algorithms including neural networks, support vector machines, logistic regression, and radial basis function networks.
2. It provides examples of using straight lines and complex boundaries to classify data with neural networks. Maximum margin hyperplanes are used for support vector machine classification.
3. Logistic regression is described as useful for binary classification problems by using a sigmoid function and cross entropy loss. Radial basis function networks can perform nonlinear classification with a kernel trick.
An introduction to Keras, a high-level neural networks library written in Python. Keras makes deep learning more accessible, is fantastic for rapid protyping, and can run on top of TensorFlow, Theano, or CNTK. These slides focus on examples, starting with logistic regression and building towards a convolutional neural network.
The presentation was given at the Austin Deep Learning meetup: https://www.meetup.com/Austin-Deep-Learning/events/237661902/
A comprehensive tutorial on Convolutional Neural Networks (CNN) which talks about the motivation behind CNNs and Deep Learning in general, followed by a description of the various components involved in a typical CNN layer. It explains the theory involved with the different variants used in practice and also, gives a big picture of the whole network by putting everything together.
Next, there's a discussion of the various state-of-the-art frameworks being used to implement CNNs to tackle real-world classification and regression problems.
Finally, the implementation of the CNNs is demonstrated by implementing the paper 'Age ang Gender Classification Using Convolutional Neural Networks' by Hassner (2015).
The document discusses artificial neural networks. It describes their basic structure and components, including dendrites that receive input signals, a soma that processes the inputs, and an axon that transmits output signals. It also explains how neurons are connected at synapses to transfer signals between neurons. Finally, it mentions different types of activation functions that can be used in neural networks.
Recurrent Neural Networks. Part 1: TheoryAndrii Gakhov
The document provides an overview of recurrent neural networks (RNNs) and their advantages over feedforward neural networks. It describes the basic structure and training of RNNs using backpropagation through time. RNNs can process sequential data of variable lengths, unlike feedforward networks. However, RNNs are difficult to train due to vanishing and exploding gradients. More advanced RNN architectures like LSTMs and GRUs address this by introducing gating mechanisms that allow the network to better control the flow of information.
The document describes the structure and functioning of a feedforward neural network. It notes that the network contains an input layer with n-dimensional vectors, L-1 hidden layers with n neurons each, and an output layer with k neurons. Each neuron has a pre-activation and activation value. The pre-activation at layer i is the weighted sum of outputs from layer i-1 plus a bias. The activation is this pre-activation passed through an activation function. Backpropagation is used to minimize a loss function through gradient descent to learn the network's weights and biases parameters.
The document discusses using the SPEA2 algorithm for multi-objective optimization to find non-dominated classification rules from transaction data. It describes classification rule mining, objectives of accuracy, comprehensibility and interestingness, and the SPEA2 approach which uses selection, crossover and mutation operators over generations to find a non-dominated solution set. A case study applies SPEA2 on insurance broker transaction data to extract non-dominated rules relating customer attributes to insurance products.
在這個資料科學蔚為風潮的年代,身為一個對新技術充滿好奇的攻城獅,自然會想要擴充自己的武器庫,學習嶄新的資料分析工具;而 R 語言,一個由統計學家專門為了資料探索與分析所開發的腳本語言,具有龐大的開源社群支持以及琳瑯滿目、數以萬計的各式套件,正是當今學習資料科學相關工具的首選。
然而,R 語言的設計邏輯與一般的程式語言不同,工程師們過去學習程式語言的經驗,往往造成學習 R 語言的障礙,本課程將從 R 語言的基礎開始,讓同學們從課堂講解以及互動式上機課程中,得以徹底理解 R 語言的核心概念與精要,學習如何利用 R 語言問資料問題,並且從資料分析的角度撰寫效率良好同時具有高度可讀性的 R 語言代碼。
(1) This document provides a quick tour of machine learning concepts including the components, types, and step-by-step process of machine learning.
(2) It discusses machine learning applications in areas like credit approval, education, recommender systems, and reinforcement learning.
(3) The tour outlines the key components of a machine learning problem including the target function, training data, learning algorithm, hypothesis set, and learned hypothesis. It also distinguishes between supervised, unsupervised, and semi-supervised learning problems.
MixTaiwan 20170222 清大電機 孫民 AI The Next Big ThingMix Taiwan
講師簡介:
孫民助理教授│清華大學電機系
孫民博士目前任教於國立清華大學電機系,他畢業於國立交通大學電子工程學系後,取得史坦福電機碩士、密西根安雅堡電機系統組博士、以及西雅圖華盛頓大學計算機工程博士後的經歷。他的研究興趣在電腦視覺、機器學習、以及人機互動領域,近年來基於深度學習在電腦視覺的突破,他致力於開發橫跨人工智慧不同子領域的系統,如自動影片文字描述(視覺x自然語言)、以及與人類行為互動的智慧機器(視覺 x 控制)。
(1) The document discusses using autoencoders for image classification. Autoencoders are neural networks trained to encode inputs so they can be reconstructed, learning useful features in the process. (2) Stacked autoencoders and convolutional autoencoders are evaluated on the MNIST handwritten digit dataset. Greedy layerwise training is used to construct deep pretrained networks. (3) Visualization of hidden unit activations shows the features learned by the autoencoders. The main difference between autoencoders and convolutional networks is that convolutional networks have more hardwired topological constraints due to the convolutional and pooling operations.
Jeff Dean at AI Frontiers: Trends and Developments in Deep Learning ResearchAI Frontiers
In this talk at AI Frontiers conference, Jeff Dean discusses recent trends and developments in deep learning research. Jeff touches on the significant progress that this research has produced in a number of areas, including computer vision, language understanding, translation, healthcare, and robotics. These advances are driven by both new algorithmic approaches to some of these problems, and by the ability to scale computation for training ever large models on larger datasets. Finally, one of the reasons for the rapid spread of the ideas and techniques of deep learning has been the availability of open source libraries such as TensorFlow. He gives an overview of why these software libraries have an important role in making the benefits of machine learning available throughout the world.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...Simplilearn
This Deep Learning Presentation will help you in understanding what is Deep learning, why do we need Deep learning, applications of Deep Learning along with a detailed explanation on Neural Networks and how these Neural Networks work. Deep learning is inspired by the integral function of the human brain specific to artificial neural networks. These networks, which represent the decision-making process of the brain, use complex algorithms that process data in a non-linear way, learning in an unsupervised manner to make choices based on the input. This Deep Learning tutorial is ideal for professionals with beginners to intermediate levels of experience. Now, let us dive deep into this topic and understand what Deep learning actually is.
Below topics are explained in this Deep Learning Presentation:
1. What is Deep Learning?
2. Why do we need Deep Learning?
3. Applications of Deep Learning
4. What is Neural Network?
5. Activation Functions
6. Working of Neural Network
Simplilearn’s Deep Learning course will transform you into an expert in deep learning techniques using TensorFlow, the open-source software library designed to conduct machine learning & deep neural network research. With our deep learning course, you’ll master deep learning and TensorFlow concepts, learn to implement algorithms, build artificial neural networks and traverse layers of data abstraction to understand the power of data and prepare you for your new role as deep learning scientist.
Why Deep Learning?
It is one of the most popular software platforms used for deep learning and contains powerful tools to help you build and implement artificial neural networks.
Advancements in deep learning are being seen in smartphone applications, creating efficiencies in the power grid, driving advancements in healthcare, improving agricultural yields, and helping us find solutions to climate change. With this Tensorflow course, you’ll build expertise in deep learning models, learn to operate TensorFlow to manage neural networks and interpret the results.
You can gain in-depth knowledge of Deep Learning by taking our Deep Learning certification training course. With Simplilearn’s Deep Learning course, you will prepare for a career as a Deep Learning engineer as you master concepts and techniques including supervised and unsupervised learning, mathematical and heuristic aspects, and hands-on modeling to develop algorithms.
There is booming demand for skilled deep learning engineers across a wide range of industries, making this deep learning course with TensorFlow training well-suited for professionals at the intermediate to advanced level of experience. We recommend this deep learning online course particularly for the following professionals:
1. Software engineers
2. Data scientists
3. Data analysts
4. Statisticians with an interest in deep learning
Deep Learning Tutorial | Deep Learning Tutorial For Beginners | What Is Deep ...Simplilearn
The document discusses deep learning and neural networks. It begins by defining deep learning as a subfield of machine learning that is inspired by the structure and function of the brain. It then discusses how neural networks work, including how data is fed as input and passed through layers with weighted connections between neurons. The neurons perform operations like multiplying the weights and inputs, adding biases, and applying activation functions. The network is trained by comparing the predicted and actual outputs to calculate error and adjust the weights through backpropagation to reduce error. Deep learning platforms like TensorFlow, PyTorch, and Keras are also mentioned.
Neural network basic and introduction of Deep learningTapas Majumdar
Deep learning tools and techniques can be used to build convolutional neural networks (CNNs). Neural networks learn from observational training data by automatically inferring rules to solve problems. Neural networks use multiple hidden layers of artificial neurons to process input data and produce output. Techniques like backpropagation, cross-entropy cost functions, softmax activations, and regularization help neural networks learn more effectively and avoid issues like overfitting.
This document discusses a deep learning course at Carnegie Mellon University for fall 2016 that covers topics like popularization of backpropagation for training neural networks, unsupervised pre-training of deep networks, and convolutional neural networks winning the ImageNet competition in 2012 leading to increased interest in deep learning research. It also shows the architecture of a convolutional neural network and how it is split across two GPUs during training.
Hardware Acceleration for Machine LearningCastLabKAIST
This document provides an overview of a lecture on hardware acceleration for machine learning. The lecture will cover deep neural network models like convolutional neural networks and recurrent neural networks. It will also discuss various hardware accelerators developed for machine learning, including those designed for mobile/edge and cloud computing environments. The instructor's background and the agenda topics are also outlined.
Deep learning is a subset of machine learning and artificial intelligence that uses multilayer neural networks to enable computers to learn from large amounts of data. Convolutional neural networks are commonly used for deep learning tasks involving images. Recurrent neural networks are used for sequential data like text or time series. Deep learning models can learn high-level features from data without relying on human-defined features. This allows them to achieve high performance in application areas such as computer vision, speech recognition, and natural language processing.
Separating Hype from Reality in Deep Learning with Sameer FarooquiDatabricks
Deep Learning is all the rage these days, but where does the reality of what Deep Learning can do end and the media hype begin? In this talk, I will dispel common myths about Deep Learning that are not necessarily true and help you decide whether you should practically use Deep Learning in your software stack.
I’ll begin with a technical overview of common neural network architectures like CNNs, RNNs, GANs and their common use cases like computer vision, language understanding or unsupervised machine learning. Then I’ll separate the hype from reality around questions like:
• When should you prefer traditional ML systems like scikit learn or Spark.ML instead of Deep Learning?
• Do you no longer need to do careful feature extraction and standardization if using Deep Learning?
• Do you really need terabytes of data when training neural networks or can you ‘steal’ pre-trained lower layers from public models by using transfer learning?
• How do you decide which activation function (like ReLU, leaky ReLU, ELU, etc) or optimizer (like Momentum, AdaGrad, RMSProp, Adam, etc) to use in your neural network?
• Should you randomly initialize the weights in your network or use more advanced strategies like Xavier or He initialization?
• How easy is it to overfit/overtrain a neural network and what are the common techniques to ovoid overfitting (like l1/l2 regularization, dropout and early stopping)?
Deep learning techniques like convolutional neural networks (CNNs) and deep neural networks have achieved human-level performance on certain tasks. Pioneers in the field include Geoffrey Hinton, who co-invented backpropagation, Yann LeCun who developed CNNs for image recognition, and Andrew Ng who helped apply these techniques at companies like Baidu and Coursera. Deep learning is now widely used for applications such as image recognition, speech recognition, and distinguishing objects like dogs from cats, often outperforming previous machine learning methods.
This is a single day course, allows the learner to get experience with the basic details of deep learning, first half is building a network using python/numpy only and the second half we build the more advanced netwrok using TensorFlow/Keras.
At the end you will find a list of usefull pointers to continue.
course git: https://gitlab.com/eshlomo/EazyDnn
A fast-paced introduction to Deep Learning concepts, such as activation functions, cost functions, backpropagation, and then a quick dive into CNNs. Basic knowledge of vectors, matrices, and elementary calculus (derivatives), are helpful in order to derive the maximum benefit from this session.
Next we'll see a simple neural network using Keras, followed by an introduction to TensorFlow and TensorBoard. (Bonus points if you know Zorn's Lemma, the Well-Ordering Theorem, and the Axiom of Choice.)
We present basic concepts of machine learning such as: supervised and unsupervised learning, types of tasks, how some algorithms work, neural networks, deep learning concepts, how to apply it in your work.
An introduction to Deep Learning (DL) concepts, such as neural networks, back propagation, activation functions, CNNs, RNNs (if time permits), and the CLT/AUT/fixed-point theorems, along with code samples in Java and TensorFlow.
Part 1 of the Deep Learning Fundamentals Series, this session discusses the use cases and scenarios surrounding Deep Learning and AI; reviews the fundamentals of artificial neural networks (ANNs) and perceptrons; discuss the basics around optimization beginning with the cost function, gradient descent, and backpropagation; and activation functions (including Sigmoid, TanH, and ReLU). The demos included in these slides are running on Keras with TensorFlow backend on Databricks.
An introduction to Deep Learning (DL) concepts, such as neural networks, back propagation, activation functions, CNNs, and GANs, along with a simple yet complete neural network.
Digit recognizer by convolutional neural networkDing Li
A convolutional neural network is used to recognize handwritten digits from images. The CNN uses convolutional and max pooling layers to extract local features from the images. These local features are then fed into fully connected layers to combine them into global features used to predict the digit (0-9) in each image with a softmax output layer. The model is trained on 60,000 images and achieves 99.67% accuracy on the test set after 30 training epochs. While powerful, it is unclear if humans can fully understand the "mind" and logic of artificial neural networks.
I have conducted a workshop on Tensorflow2.0 at Facebook Dev CIrcle. This mostly covers the importance of TensorFlow to implement deep neural networks.
You can check the related demo at:
https://github.com/rayyan17/Introduction-To-Tensor-Flow.git
This document is a presentation by Ted Chang about creating new opportunities for Taiwan's intelligent transformation. It discusses paradigm shifts in technology such as mobile phones and cloud computing. It introduces concepts like the Internet of Things, artificial intelligence, and how they can be combined. It argues that key driving forces for the future will be machine learning, big data, cloud computing and AI. The presentation envisions applications of these technologies in areas like future medicine and smart manufacturing. It ends by emphasizing the importance of wisdom and intelligence in shaping the future.
- The document discusses how artificial intelligence can enable earlier and safer medicine.
- It provides background on the author and their expertise in biomedical informatics and roles as editor-in-chief of several academic journals.
- Key applications of AI in healthcare discussed include using machine learning on large medical datasets to detect suspicious moles earlier, reduce medication errors, and more accurately predict cancer occurrence up to 12 months in advance.
- The author argues that AI has the potential to transform medicine by enabling more preventive and earlier detection approaches compared to traditional reactive healthcare models.
Jane may be able to help. Let me check with her personal assistant Jane-ML.
NextPrevIndex
Meera checks with Jane-ML
User-Agent Interaction (V)
48
PA_Meera: Mina, do you
have trouble in
debugging?
Mina: Yes, is there
anyone who has done
this?
Personal Agent
[Meera]
Jane-ML: Jane has done a similar debugging problem before. She is available now and willing to help.
compiletheme
Compiling output
1) Kaggle is the largest platform for AI and data science competitions, acquired by Google in 2017. It has been used by companies like Bosch, Mercedes, and Asus for challenges like improving production lines, accelerating testing processes, and component failure prediction.
2) The document discusses the author's experiences winning silver medals in Kaggle competitions involving camera model identification, passenger screening algorithms, and pneumonia detection. For camera model identification, the author used transfer learning with InceptionResNetV2 and high-pass filters to identify camera models from images.
3) For passenger screening, the author modified a 2D CNN to 3D and used 3D data augmentation to rank in the top 7% of the $1
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT台灣資料科學年會
The document describes a system for precision agriculture using IoT. It involves sensors collecting environmental data from fields and feeding it to a control board connected to actuators like irrigation systems. The data is also sent to an IoTtalk engine and AgriTalk server in the cloud for analysis and remote access/control through an AgriGUI interface. Equations were developed to estimate nutrient levels like nitrogen from sensor readings to help optimize crop growth.
The document discusses Open Robot Club and includes several links to its website and YouTube videos. It provides information on the club's computing resources like NVIDIA V100 GPUs. Tables with metrics like underkill and overkill percentages are included for different types of tasks like AI AOI and PCB inspection. The club's website and demos are referenced throughout.
Databricks Vs Snowflake off Page PDF submission.pptxdewsharon760
Discover the key differences between Databricks and Snowflake. Learn about their features, use cases, and how to choose the right data platform for your business needs.
Graph Machine Learning - Past, Present, and Future -kashipong
Graph machine learning, despite its many commonalities with graph signal processing, has developed as a relatively independent field.
This presentation will trace the historical progression from graph data mining in the 1990s, through graph kernel methods in the 2000s, to graph neural networks in the 2010s, highlighting the key ideas and advancements of each era. Additionally, recent significant developments, such as the integration with causal inference, will be discussed.
Why You Need Real-Time Data to Compete in E-CommercePromptCloud
In the fast-paced world of e-commerce, real-time data is crucial for staying competitive. By accessing up-to-date information on market trends, competitor pricing, and customer preferences, businesses can make informed decisions quickly. Real-time data enables dynamic pricing strategies, effective inventory management, and personalized marketing efforts, all of which are essential for meeting customer demands and outperforming competitors. Embrace real-time data to stay agile, optimize your operations, and drive growth in the ever-evolving e-commerce landscape. Get in touch for custom web scraping services: https://bit.ly/3WkqYVm
emotional interface - dehligame satta for youbkldehligame1
Welcome to DelhiGame.in, your premier hub for the latest Satta results and gaming updates in Delhi! Check out our live results https://delhigame.in/ and stay informed with the latest updates https://delhigame.in/past-results/ . Join us to experience the thrill of gaming like never before!
2. Deep learning
attracts lots of attention.
• I believe you have seen lots of exciting results
before.
This talk focuses on the basic techniques.
Deep learning trends
at Google. Source:
SIGMOD/Jeff Dean
3. Outline
Lecture IV: Next Wave
Lecture III: Variants of Neural Network
Lecture II: Tips for Training Deep Neural Network
Lecture I: Introduction of Deep Learning
5. Outline of Lecture I
Introduction of Deep Learning
Why Deep?
“Hello World” for Deep Learning
Let’s start with general
machine learning.
6. Machine Learning
≈ Looking for a Function
• Speech Recognition
• Image Recognition
• Playing Go
• Dialogue System
f
f
f
f
“Cat”
“How are you”
“5-5”
“Hello”“Hi”
(what the user said) (system response)
(next move)
7. Framework
A set of
function 21, ff
1f “cat”
1f “dog”
2f “money”
2f “snake”
Model
f “cat”
Image Recognition:
8. Framework
A set of
function 21, ff
f “cat”
Image Recognition:
Model
Training
Data
Goodness of
function f
Better!
“monkey” “cat” “dog”
function input:
function output:
Supervised Learning
9. Framework
A set of
function 21, ff
f “cat”
Image Recognition:
Model
Training
Data
Goodness of
function f
“monkey” “cat” “dog”
*
f
Pick the “Best” Function
Using
f
“cat”
Training Testing
Step 1
Step 2 Step 3
10. Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Three Steps for Deep Learning
Deep Learning is so simple ……
11. Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Three Steps for Deep Learning
Deep Learning is so simple ……
Neural
Network
13. bwawawaz KKkk 11
Neural Network
z
1w
kw
Kw
…
1a
ka
Ka
b
z
bias
a
weights
Neuron…
……
A simple function
Activation
function
14. Neural Network
z
bias
Activation
function
weights
Neuron
1
-2
-1
1
2
-1
1
4
z
z
z
e
z
1
1
Sigmoid Function
0.98
15. Neural Network
z
z
z
z
Different connections leads to
different network structure
Weights and biases are network parameters 𝜃
Each neurons can have different values
of weights and biases.
20. Output Layer (Option)
• Softmax layer as the output layer
Ordinary Layer
11 zy
22 zy
33 zy
1z
2z
3z
In general, the output of
network can be any value.
May not be easy to interpret
21. Output Layer (Option)
• Softmax layer as the output layer
1z
2z
3z
Softmax Layer
e
e
e
1z
e
2z
e
3z
e
3
1
1
1
j
zz j
eey
3
1j
z j
e
3
-3
1 2.7
20
0.05
0.88
0.12
≈0
Probability:
1 > 𝑦𝑖 > 0
𝑖 𝑦𝑖 = 1
3
1
2
2
j
zz j
eey
3
1
3
3
j
zz j
eey
22. Example Application
Input Output
16 x 16 = 256
1x
2x
256x
……
Ink → 1
No ink → 0
……
y1
y2
y10
Each dimension represents
the confidence of a digit.
is 1
is 2
is 0
……
0.1
0.7
0.2
The image
is “2”
23. Example Application
• Handwriting Digit Recognition
Machine “2”
1x
2x
256x
……
……
y1
y2
y10
is 1
is 2
is 0
……
What is needed is a
function ……
Input:
256-dim vector
output:
10-dim vector
Neural
Network
24. Output
LayerHidden Layers
Input
Layer
Example Application
Input Output
1x
2x
Layer 1
……
Nx
……
Layer 2
……
Layer L
……
……
……
……
“2”
……
y1
y2
y10
is 1
is 2
is 0
……
A function set containing the
candidates for
Handwriting Digit Recognition
You need to decide the network structure to
let a good function in your function set.
25. FAQ
• Q: How many layers? How many neurons for each
layer?
• Q: Can the structure be automatically determined?
Trial and Error Intuition+
26. Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Three Steps for Deep Learning
Deep Learning is so simple ……
Neural
Network
27. Training Data
• Preparing training data: images and their labels
The learning target is defined on
the training data.
“5” “0” “4” “1”
“3”“1”“2”“9”
28. Learning Target
16 x 16 = 256
1x
2x
……256x
……
……
……
……
Ink → 1
No ink → 0
……
y1
y2
y10
y1 has the maximum value
The learning target is ……
Input:
y2 has the maximum valueInput:
is 1
is 2
is 0
Softmax
30. Total Loss
x1
x2
xR
NN
NN
NN
……
……
y1
y2
yR
𝑦1
𝑦2
𝑦 𝑅
𝑙1
……
……
x3 NN y3
𝑦3
For all training data …
𝐿 =
𝑟=1
𝑅
𝑙 𝑟
Find the network
parameters 𝜽∗ that
minimize total loss L
Total Loss:
𝑙2
𝑙3
𝑙 𝑅
As small as possible
Find a function in
function set that
minimizes total loss L
31. Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Three Steps for Deep Learning
Deep Learning is so simple ……
Neural
Network
32. How to pick the best function
Find network parameters 𝜽∗ that minimize total loss L
Network parameters 𝜃 =
𝑤1, 𝑤2, 𝑤3, ⋯ , 𝑏1, 𝑏2, 𝑏3, ⋯
Enumerate all possible values
Layer l
……
Layer l+1
……
E.g. speech recognition: 8 layers and
1000 neurons each layer
1000
neurons
1000
neurons
106
weights
Millions of parameters
33. Gradient Descent
Total
Loss 𝐿
Random, RBM pre-train
Usually good enough
Network parameters 𝜃 =
𝑤1, 𝑤2, ⋯ , 𝑏1, 𝑏2, ⋯
w
Pick an initial value for w
Find network parameters 𝜽∗ that minimize total loss L
34. Gradient Descent
Total
Loss 𝐿
Network parameters 𝜃 =
𝑤1, 𝑤2, ⋯ , 𝑏1, 𝑏2, ⋯
w
Pick an initial value for w
Compute 𝜕𝐿 𝜕𝑤
Positive
Negative
Decrease w
Increase w
http://chico386.pixnet.net/album/photo/171572850
Find network parameters 𝜽∗ that minimize total loss L
35. Gradient Descent
Total
Loss 𝐿
Network parameters 𝜃 =
𝑤1, 𝑤2, ⋯ , 𝑏1, 𝑏2, ⋯
w
Pick an initial value for w
Compute 𝜕𝐿 𝜕𝑤
−𝜂𝜕𝐿 𝜕𝑤
η is called
“learning rate”
𝑤 ← 𝑤 − 𝜂𝜕𝐿 𝜕𝑤
Repeat
Find network parameters 𝜽∗ that minimize total loss L
36. Gradient Descent
Total
Loss 𝐿
Network parameters 𝜃 =
𝑤1, 𝑤2, ⋯ , 𝑏1, 𝑏2, ⋯
w
Pick an initial value for w
Compute 𝜕𝐿 𝜕𝑤
𝑤 ← 𝑤 − 𝜂𝜕𝐿 𝜕𝑤
Repeat Until 𝜕𝐿 𝜕𝑤 is approximately small
(when update is little)
Find network parameters 𝜽∗ that minimize total loss L
40. 𝑤1
𝑤2
Gradient Descent Hopfully, we would reach
a minima …..
Compute 𝜕𝐿 𝜕𝑤1, 𝜕𝐿 𝜕𝑤2
(−𝜂 𝜕𝐿 𝜕𝑤1, −𝜂 𝜕𝐿 𝜕𝑤2)
Color: Value of
Total Loss L
41. Gradient Descent - Difficulty
• Gradient descent never guarantee global minima
𝐿
𝑤1 𝑤2
Different initial point
Reach different minima,
so different results
There are some tips to
help you avoid local
minima, no guarantee.
42. Gradient Descent
𝑤1𝑤2
You are playing Age of Empires …
Compute 𝜕𝐿 𝜕𝑤1, 𝜕𝐿 𝜕𝑤2
(−𝜂 𝜕𝐿 𝜕𝑤1, −𝜂 𝜕𝐿 𝜕𝑤2)
You cannot see the whole map.
43. Gradient Descent
This is the “learning” of machines in deep
learning ……
Even alpha go using this approach.
I hope you are not too disappointed :p
People image …… Actually …..
44. Backpropagation
• Backpropagation: an efficient way to compute 𝜕𝐿 𝜕𝑤
• Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_201
5_2/Lecture/DNN%20backprop.ecm.mp4/index.html
Don’t worry about 𝜕𝐿 𝜕𝑤, the toolkits will handle it.
台大周伯威
同學開發
45. Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Concluding Remarks
Deep Learning is so simple ……
46. Outline of Lecture I
Introduction of Deep Learning
Why Deep?
“Hello World” for Deep Learning
47. Layer X Size
Word Error
Rate (%)
Layer X Size
Word Error
Rate (%)
1 X 2k 24.2
2 X 2k 20.4
3 X 2k 18.4
4 X 2k 17.8
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Deeper is Better?
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Not surprised, more
parameters, better
performance
48. Universality Theorem
Reference for the reason:
http://neuralnetworksandde
eplearning.com/chap4.html
Any continuous function f
M
: RRf N
Can be realized by a network
with one hidden layer
(given enough hidden
neurons)
Why “Deep” neural network not “Fat” neural network?
49. Fat + Short v.s. Thin + Tall
1x 2x …… Nx
Deep
1x 2x …… Nx
……
Shallow
Which one is better?
The same number
of parameters
50. Fat + Short v.s. Thin + Tall
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Layer X Size
Word Error
Rate (%)
Layer X Size
Word Error
Rate (%)
1 X 2k 24.2
2 X 2k 20.4
3 X 2k 18.4
4 X 2k 17.8
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Why?
51. Analogy
• Logic circuits consists of
gates
• A two layers of logic gates
can represent any Boolean
function.
• Using multiple layers of
logic gates to build some
functions are much simpler
• Neural network consists of
neurons
• A hidden layer network can
represent any continuous
function.
• Using multiple layers of
neurons to represent some
functions are much simpler
This page is for EE background.
less gates needed
Logic circuits Neural network
less
parameters
less
data?
52. 長髮
男
Modularization
• Deep → Modularization
Girls with
long hair
Boys with
short hair
Boys with
long hair
Image
Classifier
1
Classifier
2
Classifier
3
長髮
女
長髮
女
長髮
女
長髮
女
Girls with
short hair
短髮
女
短髮
男
短髮
男
短髮
男
短髮
男
短髮
女
短髮
女
短髮
女
Classifier
4
Little examplesweak
53. Modularization
• Deep → Modularization
Image
Long or
short?
Boy or Girl?
Classifiers for the
attributes
長髮
男
長髮
女
長髮
女
長髮
女
長髮
女
短髮
女 短髮
男
短髮
男
短髮
男
短髮
男
短髮
女
短髮
女
短髮
女
v.s.
長髮
男
長髮
女
長髮
女
長髮
女
長髮
女
短髮
女
短髮
男
短髮
男
短髮
男
短髮
男
短髮
女
短髮
女
短髮
女
v.s.
Each basic classifier can have
sufficient training examples.
Basic
Classifier
54. Modularization
• Deep → Modularization
Image
Long or
short?
Boy or Girl?
Sharing by the
following classifiers
as module
can be trained by little data
Girls with
long hair
Boys with
short hair
Boys with
long hair
Classifier
1
Classifier
2
Classifier
3
Girls with
short hair
Classifier
4
Little datafineBasic
Classifier
55. Modularization
• Deep → Modularization
1x
2x
……
Nx
……
……
……
……
……
……
The most basic
classifiers
Use 1st layer as module
to build classifiers
Use 2nd layer as
module ……
The modularization is
automatically learned from data.
→ Less training data?
56. Modularization
• Deep → Modularization
1x
2x
……
Nx
……
……
……
……
……
……
The most basic
classifiers
Use 1st layer as module
to build classifiers
Use 2nd layer as
module ……
Reference: Zeiler, M. D., & Fergus, R.
(2014). Visualizing and understanding
convolutional networks. In Computer
Vision–ECCV 2014 (pp. 818-833)
57. Outline of Lecture I
Introduction of Deep Learning
Why Deep?
“Hello World” for Deep Learning
59. Keras
• François Chollet is the author of Keras.
• He currently works for Google as a deep learning
engineer and researcher.
• Keras means horn in Greek
• Documentation: http://keras.io/
• Example:
https://github.com/fchollet/keras/tree/master/exa
mples
61. Example Application
• Handwriting Digit Recognition
Machine “1”
“Hello world” for deep learning
MNIST Data: http://yann.lecun.com/exdb/mnist/
Keras provides data sets loading function: http://keras.io/datasets/
28 x 28
64. Keras
Step 3.1: Configuration
Step 3.2: Find the optimal network parameters
𝑤 ← 𝑤 − 𝜂𝜕𝐿 𝜕𝑤
0.1
Training data
(Images)
Labels
(digits)
Next lecture
65. Keras
Step 3.2: Find the optimal network parameters
https://www.tensorflow.org/versions/r0.8/tutorials/mnist/beginners/index.html
Number of training examples
numpy array
28 x 28
=784
numpy array
10
Number of training examples
…… ……
67. Keras
• Using GPU to speed training
• Way 1
• THEANO_FLAGS=device=gpu0 python
YourCode.py
• Way 2 (in your code)
• import os
• os.environ["THEANO_FLAGS"] =
"device=gpu0"
70. Neural
Network
Good Results on
Testing Data?
Good Results on
Training Data?
Step 3: pick the
best function
Step 2: goodness
of function
Step 1: define a
set of function
YES
YES
NO
NO
Overfitting!
Recipe of Deep Learning
71. Do not always blame Overfitting
Testing Data
Overfitting?
Training Data
Not well trained
72. Neural
Network
Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
Recipe of Deep Learning
Different approaches for
different problems.
e.g. dropout for good results
on testing data
73. Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
Recipe of Deep Learning
Choosing proper loss
Mini-batch
New activation function
Adaptive Learning Rate
Momentum
78. Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
Recipe of Deep Learning
Choosing proper loss
Mini-batch
New activation function
Adaptive Learning Rate
Momentum
79. Mini-batch
x1
NN
……
y1
𝑦1
𝑙1
x31 NN y31
𝑦31
𝑙31
x2
NN
……
y2
𝑦2
𝑙2
x16 NN y16
𝑦16
𝑙16
Pick the 1st batch
Randomly initialize
network parameters
Pick the 2nd batch
Mini-batchMini-batch
𝐿′ = 𝑙1 + 𝑙31 + ⋯
𝐿′′ = 𝑙2 + 𝑙16 + ⋯
Update parameters once
Update parameters once
Until all mini-batches
have been picked
…
one epoch
Repeat the above process
We do not really minimize total loss!
80. Mini-batch
x1
NN
……
y1
𝑦1
𝑙1
x31 NN y31
𝑦31
𝑙31
Mini-batch
Pick the 1st batch
Pick the 2nd batch
𝐿′ = 𝑙1
+ 𝑙31
+ ⋯
𝐿′′ = 𝑙2
+ 𝑙16
+ ⋯
Update parameters once
Update parameters once
Until all mini-batches
have been picked
…one epoch
100 examples in a mini-batch
Repeat 20 times
81. Mini-batch
x1
NN
……
y1
𝑦1
𝑙1
x31 NN y31
𝑦31
𝑙31
x2
NN
……
y2
𝑦2
𝑙2
x16 NN y16
𝑦16
𝑙16
Pick the 1st batch
Randomly initialize
network parameters
Pick the 2nd batch
Mini-batchMini-batch
𝐿′ = 𝑙1 + 𝑙31 + ⋯
𝐿′′ = 𝑙2 + 𝑙16 + ⋯
Update parameters once
Update parameters once
…
L is different each time
when we update
parameters!
We do not really minimize total loss!
83. Mini-batch is Faster
1 epoch
See all
examples
See only one
batch
Update after seeing all
examples
If there are 20 batches, update
20 times in one epoch.
Original Gradient Descent With Mini-batch
Not always true with
parallel computing.
Can have the same speed
(not super large data set)
Mini-batch has better performance!
84. Mini-batch is Better! Accuracy
Mini-batch 0.84
No batch 0.12
Testing:
Epoch
Accuracy
Mini-batch
No batch
Training
85. x1
NN…… y1
𝑦1
𝑙1
x31 NN y31
𝑦31
𝑙31
x2
NN
……
y2
𝑦2
𝑙2
x16 NN y16
𝑦16
𝑙16
Mini-batchMini-batch
Shuffle the training examples for each epoch
Epoch 1
x1
NN
……
y1
𝑦1
𝑙1
x31 NN y31
𝑦31
𝑙17
x2
NN
……
y2
𝑦2
𝑙2
x16 NN y16
𝑦16
𝑙26
Mini-batchMini-batch
Epoch 2
Don’t worry. This is the default of Keras.
86. Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
Recipe of Deep Learning
Choosing proper loss
Mini-batch
New activation function
Adaptive Learning Rate
Momentum
87. Hard to get the power of Deep …
Deeper usually does not imply better.
Results on Training Data
89. Vanishing Gradient Problem
Larger gradients
Almost random Already converge
based on random!?
Learn very slow Learn very fast
1x
2x
……
Nx
……
……
……
……
……
……
……
y1
y2
yM
Smaller gradients
98. Maxout
• Learnable activation function [Ian J. Goodfellow, ICML’13]
Max
1x
2x
Input
Max
+ 5
+ 7
+ −1
+ 1
7
1
Max
Max
+ 1
+ 2
+ 4
+ 3
2
4
ReLU is a special cases of Maxout
You can have more than 2 elements in a group.
neuron
99. Maxout
• Learnable activation function [Ian J. Goodfellow, ICML’13]
• Activation function in maxout network can be
any piecewise linear convex function
• How many pieces depending on how many
elements in a group
ReLU is a special cases of Maxout
2 elements in a group 3 elements in a group
100. Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
Recipe of Deep Learning
Choosing proper loss
Mini-batch
New activation function
Adaptive Learning Rate
Momentum
101. 𝑤1
𝑤2
Learning Rates
If learning rate is too large
Total loss may not decrease
after each update
Set the learning
rate η carefully
102. 𝑤1
𝑤2
Learning Rates
If learning rate is too large
Set the learning
rate η carefully
If learning rate is too small
Training would be too slow
Total loss may not decrease
after each update
103. Learning Rates
• Popular & Simple Idea: Reduce the learning rate by
some factor every few epochs.
• At the beginning, we are far from the destination, so we
use larger learning rate
• After several epochs, we are close to the destination, so
we reduce the learning rate
• E.g. 1/t decay: 𝜂 𝑡 = 𝜂 𝑡 + 1
• Learning rate cannot be one-size-fits-all
• Giving different parameters different learning
rates
104. Adagrad
Parameter dependent
learning rate
w ← 𝑤 − ߟ 𝑤 𝜕𝐿 ∕ 𝜕𝑤
constant
𝑔𝑖
is 𝜕𝐿 ∕ 𝜕𝑤 obtained
at the i-th update
ߟ 𝑤 =
𝜂
𝑖=0
𝑡
𝑔𝑖 2
Summation of the square of the previous derivatives
𝑤 ← 𝑤 − 𝜂𝜕𝐿 ∕ 𝜕𝑤Original:
Adagrad:
107. Not the whole story ……
• Adagrad [John Duchi, JMLR’11]
• RMSprop
• https://www.youtube.com/watch?v=O3sxAc4hxZU
• Adadelta [Matthew D. Zeiler, arXiv’12]
• “No more pesky learning rates” [Tom Schaul, arXiv’12]
• AdaSecant [Caglar Gulcehre, arXiv’14]
• Adam [Diederik P. Kingma, ICLR’15]
• Nadam
• http://cs229.stanford.edu/proj2015/054_report.pdf
108. Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
Recipe of Deep Learning
Choosing proper loss
Mini-batch
New activation function
Adaptive Learning Rate
Momentum
109. Hard to find
optimal network parameters
Total
Loss
The value of a network parameter w
Very slow at the
plateau
Stuck at local minima
𝜕𝐿 ∕ 𝜕𝑤
= 0
Stuck at saddle point
𝜕𝐿 ∕ 𝜕𝑤
= 0
𝜕𝐿 ∕ 𝜕𝑤
≈ 0
110. In physical world ……
• Momentum
How about put this phenomenon
in gradient descent?
111. Movement =
Negative of 𝜕𝐿∕𝜕𝑤 + Momentum
Momentum
cost
𝜕𝐿∕𝜕𝑤 = 0
Still not guarantee reaching
global minima, but give some
hope ……
Negative of 𝜕𝐿 ∕ 𝜕𝑤
Momentum
Real Movement
113. Let’s try it
• ReLU, 3 layer
Accuracy
Original 0.96
Adam 0.97
Training
Testing:
Adam
Original
114. Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
Recipe of Deep Learning
Early Stopping
Regularization
Dropout
Network Structure
115. Why Overfitting?
• Training data and testing data can be different.
Training Data: Testing Data:
The parameters achieving the learning target do not
necessary have good results on the testing data.
Learning target is defined by the training data.
116. Panacea for Overfitting
• Have more training data
• Create more training data (?)
Original
Training Data:
Created
Training Data:
Shift 15。
Handwriting recognition:
118. Why Overfitting?
• For experiments, we added some noises to the
testing data
Training is not influenced.
Accuracy
Clean 0.97
Noisy 0.50
Testing:
119. Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
Recipe of Deep Learning
Early Stopping
Weight Decay
Dropout
Network Structure
121. Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
Recipe of Deep Learning
Early Stopping
Weight Decay
Dropout
Network Structure
122. Weight Decay
• Our brain prunes out the useless link between
neurons.
Doing the same thing to machine’s brain improves
the performance.
124. Weight Decay
• Implementation
Smaller and smaller
Keras: http://keras.io/regularizers/
w
L
ww
w
L
ww
1
Original:
Weight Decay:
0.01
0.99
125. Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
Recipe of Deep Learning
Early Stopping
Weight Decay
Dropout
Network Structure
127. Dropout
Training:
Each time before updating the parameters
Each neuron has p% to dropout
Using the new network for training
The structure of the network is changed.
Thinner!
For each mini-batch, we resample the dropout neurons
128. Dropout
Testing:
No dropout
If the dropout rate at training is p%,
all the weights times (1-p)%
Assume that the dropout rate is 50%.
If a weight w = 1 by training, set 𝑤 = 0.5 for testing.
129. Dropout - Intuitive Reason
When teams up, if everyone expect the partner will do
the work, nothing will be done finally.
However, if you know your partner will dropout, you
will do better.
我的 partner
會擺爛,所以
我要好好做
When testing, no one dropout actually, so obtaining
good results eventually.
130. Dropout - Intuitive Reason
• Why the weights should multiply (1-p)% (dropout
rate) when testing?
Training of Dropout Testing of Dropout
𝑤1
𝑤2
𝑤3
𝑤4
𝑧
𝑤1
𝑤2
𝑤3
𝑤4
𝑧′
Assume dropout rate is 50%
0.5 ×
0.5 ×
0.5 ×
0.5 ×
No dropout
Weights from training
𝑧′ ≈ 2𝑧
𝑧′ ≈ 𝑧
Weights multiply (1-p)%
131. Dropout is a kind of ensemble.
Ensemble
Network
1
Network
2
Network
3
Network
4
Train a bunch of networks with different structures
Training
Set
Set 1 Set 2 Set 3 Set 4
132. Dropout is a kind of ensemble.
Ensemble
y1
Network
1
Network
2
Network
3
Network
4
Testing data x
y2 y3 y4
average
133. Dropout is a kind of ensemble.
Training of
Dropout
minibatch
1
……
Using one mini-batch to train one network
Some parameters in the network are shared
minibatch
2
minibatch
3
minibatch
4
M neurons
2M possible
networks
134. Dropout is a kind of ensemble.
testing data x
Testing of Dropout
……
average
y1 y2 y3
All the
weights
multiply
(1-p)%
≈ y
?????
135. More about dropout
• More reference for dropout [Nitish Srivastava, JMLR’14] [Pierre Baldi,
NIPS’13][Geoffrey E. Hinton, arXiv’12]
• Dropout works better with Maxout [Ian J. Goodfellow, ICML’13]
• Dropconnect [Li Wan, ICML’13]
• Dropout delete neurons
• Dropconnect deletes the connection between neurons
• Annealed dropout [S.J. Rennie, SLT’14]
• Dropout rate decreases by epochs
• Standout [J. Ba, NISP’13]
• Each neural has different dropout rate
138. Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
Recipe of Deep Learning
Early Stopping
Regularization
Dropout
Network Structure
CNN is a very good example!
(next lecture)
140. Recipe of Deep Learning
Neural
Network
Good Results on
Testing Data?
Good Results on
Training Data?
Step 3: pick the
best function
Step 2: goodness
of function
Step 1: define a
set of function
YES
YES
NO
NO
149. Variants of Neural Networks
Convolutional Neural
Network (CNN)
Recurrent Neural Network
(RNN)
Widely used in
image processing
150. Why CNN for Image?
• When processing image, the first layer of fully
connected network would be very large
100
……
……
……
……
……
Softmax
100
100 x 100 x 3 1000
3 x 107
Can the fully connected network be simplified by
considering the properties of image recognition?
151. Why CNN for Image
• Some patterns are much smaller than the whole
image
A neuron does not have to see the whole image
to discover the pattern.
“beak” detector
Connecting to small region with less parameters
152. Why CNN for Image
• The same patterns appear in different regions.
“upper-left
beak” detector
“middle beak”
detector
They can use the same
set of parameters.
Do almost the same thing
153. Why CNN for Image
• Subsampling the pixels will not change the object
subsampling
bird
bird
We can subsample the pixels to make image smaller
Less parameters for the network to process the image
154. Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Three Steps for Deep Learning
Deep Learning is so simple ……
Convolutional
Neural Network
155. The whole CNN
Fully Connected
Feedforward network
cat dog ……
Convolution
Max Pooling
Convolution
Max Pooling
Flatten
Can repeat
many times
156. The whole CNN
Convolution
Max Pooling
Convolution
Max Pooling
Flatten
Can repeat
many times
Some patterns are much
smaller than the whole image
The same patterns appear in
different regions.
Subsampling the pixels will
not change the object
Property 1
Property 2
Property 3
157. The whole CNN
Fully Connected
Feedforward network
cat dog ……
Convolution
Max Pooling
Convolution
Max Pooling
Flatten
Can repeat
many times
158. CNN – Convolution
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
-1 1 -1
-1 1 -1
-1 1 -1
Filter 2
……
Those are the network
parameters to be learned.
Matrix
Matrix
Each filter detects a small
pattern (3 x 3).
Property 1
167. CNN – Max Pooling
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
3 0
13
-1 1
30
2 x 2 image
Each filter
is a channel
New image
but smaller
Conv
Max
Pooling
168. The whole CNN
Convolution
Max Pooling
Convolution
Max Pooling
Can repeat
many times
A new image
The number of the channel
is the number of filters
Smaller than the original
image
3 0
13
-1 1
30
169. The whole CNN
Fully Connected
Feedforward network
cat dog ……
Convolution
Max Pooling
Convolution
Max Pooling
Flatten
A new image
A new image
177. Max
1x
2x
Input
Max
+ 5
+ 7
+ −1
+ 1
7
1
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
image
convolution
Max
pooling
-1 1 -1
-1 1 -1
-1 1 -1
1 -1 -1
-1 1 -1
-1 -1 1
Only 9 x 2 = 18
parameters
Dim = 6 x 6 = 36
Dim = 4 x 4 x 2
= 32
parameters =
36 x 32 = 1152
178. Convolutional Neural Network
Learning: Nothing special, just gradient descent ……
CNN
“monkey”
“cat”
“dog”
Convolution, Max
Pooling, fully connected
1
0
0
……
target
Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Convolutional
Neural Network
179. Playing Go
Network (19 x 19
positions)
Next move
19 x 19 vector
Black: 1
white: -1
none: 0
19 x 19 vector
Fully-connected feedword
network can be used
But CNN performs much better.
19 x 19 matrix
(image)
180. Playing Go
Network
Network
record of previous plays
Target:
“天元” = 1
else = 0
Target:
“五之 5” = 1
else = 0
Training:
進藤光 v.s. 社清春
黑: 5之五
白: 天元
黑: 五之5
181. Why CNN for playing Go?
• Some patterns are much smaller than the whole
image
• The same patterns appear in different regions.
Alpha Go uses 5 x 5 for first layer
182. Why CNN for playing Go?
• Subsampling the pixels will not change the object
Alpha Go does not use Max Pooling ……
Max Pooling How to explain this???
183. Variants of Neural Networks
Convolutional Neural
Network (CNN)
Recurrent Neural Network
(RNN) Neural Network with Memory
184. Example Application
• Slot Filling
I would like to arrive Taipei on November 2nd.
ticket booking system
Destination:
time of arrival:
Taipei
November 2nd
Slot
186. 1-of-N encoding
Each dimension corresponds
to a word in the lexicon
The dimension for the word
is 1, and others are 0
lexicon = {apple, bag, cat, dog, elephant}
apple = [ 1 0 0 0 0]
bag = [ 0 1 0 0 0]
cat = [ 0 0 1 0 0]
dog = [ 0 0 0 1 0]
elephant = [ 0 0 0 0 1]
The vector is lexicon size.
1-of-N Encoding
How to represent each word as a vector?
187. Beyond 1-of-N encoding
w = “apple”
a-a-a
a-a-b
p-p-l
26 X 26 X 26
……
a-p-p
…
p-l-e
…
…………
1
1
1
0
0
Word hashingDimension for “Other”
w = “Sauron”
…
apple
bag
cat
dog
elephant
“other”
0
0
0
0
0
1
w = “Gandalf”
187
188. Example Application
1x 2x
2y1y
Taipei
dest
time of
departure
Input: a word
(Each word is represented
as a vector)
Output:
Probability distribution that
the input word belonging to
the slots
Solving slot filling by
Feedforward network?
189. Example Application
1x 2x
2y1y
Taipei
arrive Taipei on November 2nd
other otherdest time time
leave Taipei on November 2nd
place of departure
Neural network
needs memory!
dest
time of
departure
Problem?
190. Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Three Steps for Deep Learning
Deep Learning is so simple ……
Recurrent
Neural Network
191. Recurrent Neural Network (RNN)
1x 2x
2y1y
1a 2a
Memory can be considered
as another input.
The output of hidden layer
are stored in the memory.
store
192. RNN
store store
x1
x2 x3
y1 y2
y3
a1
a1
a2
a2 a3
The same network is used again and again.
arrive Taipei on November 2nd
Probability of
“arrive” in each slot
Probability of
“Taipei” in each slot
Probability of
“on” in each slot
193. RNN
store
x1 x2
y1 y2
a1
a1
a2
……
……
……
store
x1 x2
y1 y2
a1
a1
a2
……
……
……
leave Taipei
Prob of “leave”
in each slot
Prob of “Taipei”
in each slot
Prob of “arrive”
in each slot
Prob of “Taipei”
in each slot
arrive Taipei
Different
The values stored in the memory is different.
194. Of course it can be deep …
…… ……
xt
xt+1 xt+2
……
……yt
……
……
yt+1
……
yt+2
……
……
196. Memory
Cell
Long Short-term Memory (LSTM)
Input Gate
Output Gate
Signal control
the input gate
Signal control
the output gate
Forget
Gate
Signal control
the forget gate
Other part of the network
Other part of the network
(Other part of
the network)
(Other part of
the network)
(Other part of
the network)
LSTM
Special Neuron:
4 inputs,
1 output
197. 𝑧
𝑧𝑖
𝑧𝑓
𝑧 𝑜
𝑔 𝑧
𝑓 𝑧𝑖
multiply
multiply
Activation function f is
usually a sigmoid function
Between 0 and 1
Mimic open and close gate
c
𝑐′ = 𝑔 𝑧 𝑓 𝑧𝑖 + 𝑐𝑓 𝑧𝑓
ℎ 𝑐′𝑓 𝑧 𝑜
𝑎 = ℎ 𝑐′
𝑓 𝑧 𝑜
𝑔 𝑧 𝑓 𝑧𝑖
𝑐′
𝑓 𝑧𝑓
𝑐𝑓 𝑧𝑓
𝑐
203. Multiple-layer
LSTM
This is quite
standard now.
https://img.komicolle.org/2015-09-20/src/14426967627131.gif
Don’t worry if you cannot understand this.
Keras can handle it.
Keras supports
“LSTM”, “GRU”, “SimpleRNN” layers
204. Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Three Steps for Deep Learning
Deep Learning is so simple ……
205. copy copy
x1
x2 x3
y1 y2
y3
Wi
a1
a1
a2
a2 a3
arrive Taipei on November 2nd
Training
Sentences:
Learning Target
other otherdest
10 0 10 010 0
other dest other
… … … … … …
time time
206. Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Three Steps for Deep Learning
Deep Learning is so simple ……
207. Learning
RNN Learning is very difficult in practice.
Backpropagation
through time (BPTT)
𝑤 ← 𝑤 − 𝜂𝜕𝐿 ∕ 𝜕𝑤 1x 2x
2y1y
1a 2a
copy
𝑤
208. Unfortunately ……
• RNN-based network is not always easy to learn
感謝 曾柏翔 同學
提供實驗結果
Real experiments on Language modeling
Lucky
sometimes
TotalLoss
Epoch
209. The error surface is rough.
w1
w2
Cost
The error surface is either
very flat or very steep.
Clipping
[Razvan Pascanu, ICML’13]
TotalLoss
210. Why?
1
1
y1
0
1
w
y2
0
1
w
y3
0
1
w
y1000
……
𝑤 = 1
𝑤 = 1.01
𝑦1000
= 1
𝑦1000 ≈ 20000
𝑤 = 0.99
𝑤 = 0.01
𝑦1000 ≈ 0
𝑦1000 ≈ 0
1 1 1 1
Large
𝜕𝐿 𝜕𝑤
Small
Learning rate?
small
𝜕𝐿 𝜕𝑤
Large
Learning rate?
Toy Example
=w999
211. add
• Long Short-term Memory (LSTM)
• Can deal with gradient vanishing (not gradient
explode)
Helpful Techniques
Memory and input are
added
The influence never disappears
unless forget gate is closed
No Gradient vanishing
(If forget gate is opened.)
[Cho, EMNLP’14]
Gated Recurrent Unit (GRU):
simpler than LSTM
212. Helpful Techniques
Vanilla RNN Initialized with Identity matrix + ReLU activation
function [Quoc V. Le, arXiv’15]
Outperform or be comparable with LSTM in 4 different tasks
[Jan Koutnik, JMLR’14]
Clockwise RNN
[Tomas Mikolov, ICLR’15]
Structurally Constrained
Recurrent Network (SCRN)
213. More Applications ……
store store
x1
x2 x3
y1 y2
y3
a1
a1
a2
a2 a3
arrive Taipei on November 2nd
Probability of
“arrive” in each slot
Probability of
“Taipei” in each slot
Probability of
“on” in each slot
Input and output are both sequences
with the same length
RNN can do more than that!
214. Many to one
• Input is a vector sequence, but output is only one vector
Sentiment Analysis
……
我 覺 太得 糟 了
超好雷
好雷
普雷
負雷
超負雷
看了這部電影覺
得很高興 …….
這部電影太糟了
…….
這部電影很
棒 …….
Positive (正雷) Negative (負雷) Positive (正雷)
……
Keras Example:
https://github.com/fchollet/keras/blob
/master/examples/imdb_lstm.py
215. Many to Many (Output is shorter)
• Both input and output are both sequences, but the output
is shorter.
• E.g. Speech Recognition
好 好 好
Trimming
棒 棒 棒 棒 棒
“好棒”
Why can’t it be
“好棒棒”
Input:
Output: (character sequence)
(vector
sequence)
Problem?
216. Many to Many (Output is shorter)
• Both input and output are both sequences, but the output
is shorter.
• Connectionist Temporal Classification (CTC) [Alex Graves,
ICML’06][Alex Graves, ICML’14][Haşim Sak, Interspeech’15][Jie Li,
Interspeech’15][Andrew Senior, ASRU’15]
好 φ φ 棒 φ φ φ φ 好 φ φ 棒 φ 棒 φ φ
“好棒” “好棒棒”Add an extra symbol “φ”
representing “null”
217. Many to Many (No Limitation)
• Both input and output are both sequences with different
lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→機器學習)
Containing all
information about
input sequence
learning
machine
218. learning
Many to Many (No Limitation)
• Both input and output are both sequences with different
lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→機器學習)
machine
機 習器 學
……
……
Don’t know when to stop
慣 性
219. Many to Many (No Limitation)
推 tlkagk: =========斷==========
Ref:http://zh.pttpedia.wikia.com/wiki/%E6%8E%A5%E9%BE%8D%
E6%8E%A8%E6%96%87 (鄉民百科)
220. learning
Many to Many (No Limitation)
• Both input and output are both sequences with different
lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→機器學習)
machine
機 習器 學
Add a symbol “===“ (斷)
[Ilya Sutskever, NIPS’14][Dzmitry Bahdanau, arXiv’15]
===
221. One to Many
• Input an image, but output a sequence of words
Input
image
a woman is
……
===
CNN
A vector
for whole
image
[Kelvin Xu, arXiv’15][Li Yao, ICCV’15]
Caption Generation
226. Outline
Supervised Learning
• Ultra Deep Network
• Attention Model
Reinforcement Learning
Unsupervised Learning
• Image: Realizing what the World Looks Like
• Text: Understanding the Meaning of Words
• Audio: Learning human language without supervision
New network structure
235. Outline
Supervised Learning
• Ultra Deep Network
• Attention Model
Reinforcement Learning
Unsupervised Learning
• Image: Realizing what the World Looks Like
• Text: Understanding the Meaning of Words
• Audio: Learning human language without supervision
New network structure
238. Attention-based Model v2
Reading Head
Controller
Input
Reading Head
output
…… ……
Machine’s Memory
DNN/RNN
Neural Turing Machine
Writing Head
Controller
Writing Head
240. Reading Comprehension
• End-To-End Memory Networks. S. Sukhbaatar, A. Szlam, J.
Weston, R. Fergus. NIPS, 2015.
The position of reading head:
Keras has example:
https://github.com/fchollet/keras/blob/master/examples/ba
bi_memnn.py
243. Visual Question Answering
• Huijuan Xu, Kate Saenko. Ask, Attend and Answer: Exploring
Question-Guided Spatial Attention for Visual Question
Answering. arXiv Pre-Print, 2015
244. Speech Question Answering
• TOEFL Listening Comprehension Test by Machine
• Example:
Question: “ What is a possible origin of Venus’ clouds? ”
Audio Story:
Choices:
(A) gases released as a result of volcanic activity
(B) chemical reactions caused by high surface temperatures
(C) bursts of radio energy from the plane's surface
(D) strong winds that blow dust into the atmosphere
(The original story is 5 min long.)
245. Simple Baselines
Accuracy(%)
(1) (2) (3) (4) (5) (6) (7)
Naive Approaches
random
(4) the choice with semantic
most similar to others
(2) select the shortest
choice as answer
Experimental setup:
717 for training,
124 for validation, 122 for testing
246. Model Architecture
“what is a possible
origin of Venus‘ clouds?"
Question:
Question
Semantics
…… It be quite possible that this be
due to volcanic eruption because
volcanic eruption often emit gas. If
that be the case volcanism could very
well be the root cause of Venus 's thick
cloud cover. And also we have observe
burst of radio energy from the planet
's surface. These burst be similar to
what we see when volcano erupt on
earth ……
Audio Story:
Speech
Recognition
Semantic
Analysis
Semantic
Analysis
Attention
Answer
Select the choice most
similar to the answer
Attention
Everything is learned
from training examples
252. Outline
Supervised Learning
• Ultra Deep Network
• Attention Model
Reinforcement Learning
Unsupervised Learning
• Image: Realizing what the World Looks Like
• Text: Understanding the Meaning of Words
• Audio: Learning human language without supervision
New network structure
257. Supervised v.s. Reinforcement
• Supervised:
• Reinforcement Learning
Next move:
“5-5”
Next move:
“3-3”
First move …… many moves …… Win!
Alpha Go is supervised learning + reinforcement learning.
258. Difficulties of Reinforcement
Learning
• It may be better to sacrifice immediate reward to
gain more long-term reward
• E.g. Playing Go
• Agent’s actions affect the subsequent data it
receives
• E.g. Exploration
260. Application: Interactive Retrieval
• Interactive retrieval is helpful.
user
“Deep Learning”
“Deep Learning” related to Machine Learning?
“Deep Learning” related to Education?
[Wu & Lee, INTERSPEECH 16]
261. Deep Reinforcement Learning
• Different network depth
Better retrieval
performance,
Less user labor
The task cannot be addressed
by linear model.
Some depth is needed.
More Interaction
262. More applications
• Alpha Go, Playing Video Games, Dialogue
• Flying Helicopter
• https://www.youtube.com/watch?v=0JL04JJjocc
• Driving
• https://www.youtube.com/watch?v=0xo1Ldx3L
5Q
• Google Cuts Its Giant Electricity Bill With
DeepMind-Powered AI
• http://www.bloomberg.com/news/articles/2016-07-
19/google-cuts-its-giant-electricity-bill-with-deepmind-
powered-ai
263. To learn deep reinforcement
learning ……
• Lectures of David Silver
• http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Te
aching.html
• 10 lectures (1:30 each)
• Deep Reinforcement Learning
• http://videolectures.net/rldm2015_silver_reinfo
rcement_learning/
264. Outline
Supervised Learning
• Ultra Deep Network
• Attention Model
Reinforcement Learning
Unsupervised Learning
• Image: Realizing what the World Looks Like
• Text: Understanding the Meaning of Words
• Audio: Learning human language without supervision
New network structure
265. Does machine know what the
world look like?
Draw something!
Ref: https://openai.com/blog/generative-models/
266. Deep Dream
• Given a photo, machine adds what it sees ……
http://deepdreamgenerator.com/
267. Deep Dream
• Given a photo, machine adds what it sees ……
http://deepdreamgenerator.com/
268. Deep Style
• Given a photo, make its style like famous paintings
https://dreamscopeapp.com/
269. Deep Style
• Given a photo, make its style like famous paintings
https://dreamscopeapp.com/
278. Outline
Supervised Learning
• Ultra Deep Network
• Attention Model
Reinforcement Learning
Unsupervised Learning
• Image: Realizing what the World Looks Like
• Text: Understanding the Meaning of Words
• Audio: Learning human language without supervision
New network structure
280. Machine Reading
• Machine learn the meaning of words from reading
a lot of documents without supervision
dog
cat
rabbit
jump
run
flower
tree
Word Vector / Embedding
281. Machine Reading
• Generating Word Vector/Embedding is
unsupervised
Neural Network
Apple
https://garavato.files.wordpress.com/2011/11/stacksdocuments.jpg?w=490
Training data is a lot of text
?
282. Machine Reading
• Machine learn the meaning of words from reading
a lot of documents without supervision
• A word can be understood by its context
蔡英文 520宣誓就職
馬英九 520宣誓就職
蔡英文、馬英九 are
something very similar
You shall know a word
by the company it keeps
285. Machine Reading
• Machine learn the meaning of words from reading
a lot of documents without supervision
286. Demo
• Model used in demo is provided by 陳仰德
• Part of the project done by 陳仰德、林資偉
• TA: 劉元銘
• Training data is from PTT (collected by 葉青峰)
286
287. Outline
Supervised Learning
• Ultra Deep Network
• Attention Model
Reinforcement Learning
Unsupervised Learning
• Image: Realizing what the World Looks Like
• Text: Understanding the Meaning of Words
• Audio: Learning human language without supervision
New network structure
288. Learning from Audio Book
Machine listens to lots of
audio book
[Chung, Interspeech 16)
Machine does not have
any prior knowledge
Like an infant
289. Audio Word to Vector
• Audio segment corresponding to an unknown word
Fixed-length vector
290. Audio Word to Vector
• The audio segments corresponding to words with
similar pronunciations are close to each other.
ever ever
never
never
never
dog
dog
dogs
296. Concluding Remarks
Lecture IV: Next Wave
Lecture III: Variants of Neural Network
Lecture II: Tips for Training Deep Neural Network
Lecture I: Introduction of Deep Learning
297. AI 即將取代多數的工作?
• New Job in AI Age
http://www.express.co.uk/news/science/651202/First-step-towards-The-Terminator-
becoming-reality-AI-beats-champ-of-world-s-oldest-game
AI 訓練師
(機器學習專家、
資料科學家)