Module 2
Module 2
" 100 XP
Introduction
2 minutes
Computer vision is one of the core areas of artificial intelligence (AI), and focuses on creating
solutions that enable AI applications to "see" the world and make sense of it.
Of course, computers don't have biological eyes that work the way ours do, but they're
capable of processing images; either from a live camera feed or from digital photographs or
videos. This ability to process images is the key to creating software that can emulate human
visual perception.
In this module, we'll examine some of the fundamental principles and techniques that underlie
computer vision. We'll also introduce Microsoft Azure AI Vision, a cloud service that
developers can use to create a wide range of computer vision solutions.
" 100 XP
Before we can explore image processing and other computer vision capabilities, it's useful to
consider what an image actually is in the context of data for a computer program.
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 255 255 255 0 0
0 0 255 255 255 0 0
0 0 255 255 255 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
The array consists of seven rows and seven columns, representing the pixel values for a 7x7
pixel image (which is known as the image's resolution). Each pixel has a value between 0
(black) and 255 (white); with values between these bounds representing shades of gray. The
image represented by this array looks similar to the following (magnified) image:
The array of pixel values for this image is two-dimensional (representing rows and columns, or
x and y coordinates) and defines a single rectangle of pixel values. A single layer of pixel values
like this represents a grayscale image. In reality, most digital images are multidimensional and
consist of three layers (known as channels) that represent red, green, and blue (RGB) color
hues. For example, we could represent a color image by defining three channels of pixel values
that create the same square shape as the previous grayscale example:
Red:
150 150 150 150 150 150 150
150 150 150 150 150 150 150
150 150 255 255 255 150 150
150 150 255 255 255 150 150
150 150 255 255 255 150 150
150 150 150 150 150 150 150
150 150 150 150 150 150 150
Green:
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 255 255 255 0 0
0 0 255 255 255 0 0
0 0 255 255 255 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
Blue:
255 255 255 255 255 255 255
255 255 255 255 255 255 255
255 255 0 0 0 255 255
255 255 0 0 0 255 255
255 255 0 0 0 255 255
255 255 255 255 255 255 255
255 255 255 255 255 255 255
Red: 150
Green: 0
Blue: 255
Red: 255
Green: 255
Blue: 0
-1 -1 -1
-1 8 -1
-1 -1 -1
The kernel is then convolved across the image, calculating a weighted sum for each 3x3 patch
of pixels and assigning the result to a new image. It's easier to understand how the filtering
works by exploring a step-by-step example.
Let's start with the grayscale image we explored previously:
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 255 255 255 0 0
0 0 255 255 255 0 0
0 0 255 255 255 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
First, we apply the filter kernel to the top left patch of the image, multiplying each pixel value
by the corresponding weight value in the kernel and adding the results:
(0 x -1) + (0 x -1) + (0 x -1) +
(0 x -1) + (0 x 8) + (0 x -1) +
(0 x -1) + (0 x -1) + (255 x -1) = -255
The result (-255) becomes the first value in a new array. Then we move the filter kernel along
one pixel to the right and repeat the operation:
Again, the result is added to the new array, which now contains two values:
-255 -510
The process is repeated until the filter has been convolved across the entire image, as shown
in this animation:
The filter is convolved across the image, calculating a new array of values. Some of the values
might be outside of the 0 to 255 pixel value range, so the values are adjusted to fit into that
range. Because of the shape of the filter, the outside edge of pixels isn't calculated, so a
padding value (usually 0) is applied. The resulting array represents a new image in which the
filter has transformed the original image. In this case, the filter has had the effect of
highlighting the edges of shapes in the image.
To see the effect of the filter more clearly, here's an example of the same filter applied to a real
image:
ノ Expand table
Because the filter is convolved across the image, this kind of image manipulation is often
referred to as convolutional filtering. The filter used in this example is a particular type of filter
(called a laplace filter) that highlights the edges on objects in an image. There are many other
kinds of filter that you can use to create blurring, sharpening, color inversion, and other
effects.
" 100 XP
The ability to use filters to apply effects to images is useful in image processing tasks, such as
you might perform with image editing software. However, the goal of computer vision is often
to extract meaning, or at least actionable insights, from images; which requires the creation of
machine learning models that are trained to recognize features based on large volumes of
existing images.
Tip
This unit assumes you are familiar with the fundamental principles of machine learning,
and that you have conceptual knowledge of deep learning with neural networks. If you
are new to machine learning, consider completing the Fundamentals of machine
learning module on Microsoft Learn.
During the training process for a CNN, filter kernels are initially defined using randomly
generated weight values. Then, as the training process progresses, the models predictions are
evaluated against known label values, and the filter weights are adjusted to improve accuracy.
Eventually, the trained fruit image classification model uses the filter weights that best extract
features that help identify different kinds of fruit.
The following diagram illustrates how a CNN for an image classification model works:
1. Images with known labels (for example, 0: apple, 1: banana, or 2: orange) are fed into the
network to train the model.
2. One or more layers of filters is used to extract features from each image as it is fed
through the network. The filter kernels start with randomly assigned weights and
generate arrays of numeric values called feature maps.
3. The feature maps are flattened into a single dimensional array of feature values.
4. The feature values are fed into a fully connected neural network.
5. The output layer of the neural network uses a softmax or similar function to produce a
result that contains a probability value for each possible class, for example [0.2, 0.5, 0.3].
During training the output probabilities are compared to the actual class label - for example,
an image of a banana (class 1) should have the value [0.0, 1.0, 0.0]. The difference between the
predicted and actual class scores is used to calculate the loss in the model, and the weights in
the fully connected neural network and the filter kernels in the feature extraction layers are
modified to reduce the loss.
The training process repeats over multiple epochs until an optimal set of weights has been
learned. Then, the weights are saved and the model can be used to predict labels for new
images for which the label is unknown.
7 Note
CNN architectures usually include multiple convolutional filter layers and additional layers
to reduce the size of feature maps, constrain the extracted values, and otherwise
manipulate the feature values. These layers have been omitted in this simplified example
to focus on the key concept, which is that filters are used to extract numeric features from
images, which are then used in a neural network to predict image labels.
Transformers and multi-modal models
CNNs have been at the core of computer vision solutions for many years. While they're
commonly used to solve image classification problems as described previously, they're also
the basis for more complex computer vision models. For example, object detection models
combine CNN feature extraction layers with the identification of regions of interest in images
to locate multiple classes of object in the same image.
Transformers
Most advances in computer vision over the decades have been driven by improvements in
CNN-based models. However, in another AI discipline - natural language processing (NLP),
another type of neural network architecture, called a transformer has enabled the
development of sophisticated models for language. Transformers work by processing huge
volumes of data, and encoding language tokens (representing individual words or phrases) as
vector-based embeddings (arrays of numeric values). You can think of an embedding as
representing a set of dimensions that each represent some semantic attribute of the token.
The embeddings are created such that tokens that are commonly used in the same context
define vectors that are more closely aligned than unrelated words.
As a simple example, the following diagram shows some words encoded as three-dimensional
vectors, and plotted in a 3D space:
Tokens that are semantically similar are encoded in similar directions, creating a semantic
language model that makes it possible to build sophisticated NLP solutions for text analysis,
translation, language generation, and other tasks.
7 Note
We've used only three dimensions, because that's easy to visualize. In reality, encoders in
transformer networks create vectors with many more dimensions, defining complex
semantic relationships between tokens based on linear algebraic calculations. The math
involved is complex, as is the architecture of a transformer model. Our goal here is just to
provide a conceptual understanding of how encoding creates a model that encapsulates
relationships between entities.
Multi-modal models
The success of transformers as a way to build language models has led AI researchers to
consider whether the same approach would be effective for image data. The result is the
development of multi-modal models, in which the model is trained using a large volume of
captioned images, with no fixed labels. An image encoder extracts features from images based
on pixel values and combines them with text embeddings created by a language encoder. The
overall model encapsulates relationships between natural language token embeddings and
image features, as shown here:
The Microsoft Florence model is just such a model. Trained with huge volumes of captioned
images from the Internet, it includes both a language encoder and an image encoder. Florence
is an example of a foundation model. In other words, a pre-trained general model on which
you can build multiple adaptive models for specialist tasks. For example, you can use Florence
as a foundation model for adaptive models that perform:
" 100 XP
Azure AI Vision
3 minutes
While you can train your own machine learning models for computer vision, the architecture
for computer vision models can be complex; and you require significant volumes of training
images and compute power to perform the training process.
Microsoft's Azure AI Vision service provides prebuilt and customizable computer vision models
that are based on the Florence foundation model and provide various powerful capabilities.
With Azure AI Vision, you can create sophisticated computer vision solutions quickly and
easily; taking advantage of "off-the-shelf" functionality for many common computer vision
scenarios, while retaining the ability to create custom models using your own images.
Azure AI Vision: A specific resource for the Azure AI Vision service. Use this resource type
if you don't intend to use any other Azure AI services, or if you want to track utilization
and costs for your Azure AI Vision resource separately.
Azure AI services: A general resource that includes Azure AI Vision along with many
other Azure AI services; such as Azure AI Language, Azure AI Custom Vision, Azure AI
Translator, and others. Use this resource type if you plan to use multiple AI services and
want to simplify administration and development.
The Azure AI Vision service can analyze this image and extract the following text:
Nutrition Facts Amount Per Serving
Serving size:1 bar (40g)
Serving Per Package: 4
Total Fat 13g
Saturated Fat 1.5g
Amount Per Serving
Trans Fat 0g
calories 190
Cholesterol 0mg
ories from Fat 110
Sodium 20mg
ntDaily Values are based on
Vitamin A 50
calorie diet
Tip
You can explore Azure AI Vision's OCR capabilities further in the Read text with Azure AI
Vision module on Microsoft Learn.
Skateboard (90.40%)
Person (95.5%)
The predictions include a confidence score that indicates the probability the model has
calculated for the predicted objects.
In addition to the detected object labels and their probabilities, Azure AI Vision returns
bounding box coordinates that indicate the top, left, width, and height of the object detected.
You can use these coordinates to determine where in the image each object was detected, like
this:
sport (99.60%)
person (99.56%)
footwear (98.05%)
skating (96.27%)
boardsport (95.58%)
skateboarding equipment (94.43%)
clothing (94.02%)
wall (93.81%)
skateboarding (93.78%)
skateboarder (93.25%)
individual sports (92.80%)
street stunts (90.81%)
balance (90.81%)
jumping (89.87%)
sports equipment (88.61%)
extreme sport (88.35%)
kickflip (88.18%)
stunt (87.27%)
skateboard (86.87%)
stunt performer (85.83%)
knee (85.30%)
sports (85.24%)
longboard (84.61%)
longboarding (84.45%)
riding (73.37%)
skate (67.27%)
air (64.83%)
young (63.29%)
outdoor (61.39%)
Image classification
An image classification model is used to predict the category, or class of an image. For
example, you could train a model to determine which type of fruit is shown in an image, like
this:
ノ Expand table
Apple Banana Orange
Object detection
Object detection models detect and classify objects in an image, returning bounding box
coordinates to locate each object. In addition to the built-in object detection capabilities in
Azure AI Vision, you can train a custom object detection model with your own images. For
example, you could use photographs of fruit to train a model that detects multiple fruits in an
image, like this:
7 Note
Details of how to use Azure AI Vision to train a custom model are beyond the scope of
this module. You can find information about custom model training in the Azure AI
Unit 6 of 7 S テ Ask Learn
" 200 XP
Knowledge check
Module assessment 2 minutes
Pixels
" Correct. Pixels are numeric values that represent shade intensity for points in the
image.
Image file names
2. You want to use the Azure AI Vision service to analyze images. You also want to use the
Azure AI Language service to analyze text. You want developers to require only one key and
endpoint to access all of your services. What kind of resource should you create in your
Azure subscription? *
Azure AI Vision
Azure AI services
" Correct. An Azure AI Services resource supports both Azure AI Vision and Azure
AI Language.
Azure OpenAI service
3. You want to use the Azure AI Vision service to identify the location of individual items in
an image. Which of the following features should you retrieve? *
Objects
" Correct. Azure AI Vision returns objects with a bounding box to indicate their
location in the image.
Visual Tags
Dense Captions
" 100 XP
Introduction
2 minutes
Face detection and analysis is an area of artificial intelligence (AI) which uses algorithms to
locate and analyze human faces in images or video content.
There are many applications for face detection, analysis, and recognition. For example,
" 100 XP
Face detection involves identifying regions of an image that contain a human face, typically by
returning bounding box coordinates that form a rectangle around the face, like this:
With Face analysis, facial features can be used to train machine learning models to return
other information, such as facial features such as nose, eyes, eyebrows, lips, and others.
Facial recognition
A further application of facial analysis is to train a machine learning model to identify known
individuals from their facial features. This is known as facial recognition, and uses multiple
images of an individual to train the model. This trains the model so that it can detect those
individuals in new images on which it wasn't trained.
When used responsibly, facial recognition is an important and useful technology that can
improve efficiency, security, and customer experiences. Next we'll explore Azure AI Face
service, which provides pre-trained models to detect, recognize, and analyze faces.
" 100 XP
Microsoft Azure provides multiple Azure AI services that you can use to detect and analyze
faces, including:
Azure AI Vision, which offers face detection and some basic face analysis, such as
returning the bounding box coordinates around an image.
Azure AI Video Indexer, which you can use to detect and identify faces in a video.
Azure AI Face, which offers pre-built algorithms that can detect, recognize, and analyze
faces.
Of these, Face offers the widest range of facial analysis capabilities.
Accessories: indicates whether the given face has accessories. This attribute returns
possible accessories including headwear, glasses, and mask, with confidence score
between zero and one for each accessory.
Blur: how blurred the face is, which can be an indication of how likely the face is to be
the main focus of the image.
Exposure: such as whether the image is underexposed or over exposed. This applies to
the face in the image and not the overall image exposure.
Glasses: whether or not the person is wearing glasses.
Head pose: the face's orientation in a 3D space.
Mask: indicates whether the face is wearing a mask.
Noise: refers to visual noise in the image. If you have taken a photo with a high ISO
setting for darker settings, you would notice this noise in the image. The image looks
grainy or full of tiny dots that make the image less clear.
Occlusion: determines if there might be objects blocking the face in the image.
Quality For Recognition: a rating of high, medium, or low that reflects if the image is of
sufficient quality to attempt face recognition on.
Responsible AI use
) Important
The Limited Access policy requires customers to submit an intake form to access additional
Azure AI Face service capabilities including:
Face: Use this specific resource type if you don't intend to use any other Azure AI
services, or if you want to track utilization and costs for Face separately.
Azure AI services: A general resource that includes Azure AI Face along with many other
Azure AI services such as Azure AI Content Safety, Azure AI Language, and others. Use
this resource type if you plan to use multiple Azure AI services and want to simplify
administration and development.
Image format - supported images are JPEG, PNG, GIF, and BMP.
File size - 6 MB or smaller.
Face size range - from 36 x 36 pixels up to 4096 x 4096 pixels. Smaller or larger faces will
not be detected.
Other issues - face detection can be impaired by extreme face angles, extreme lighting,
and occlusion (objects blocking the face such as a hand).
" 200 XP
Knowledge check
Module assessment 3 minutes
1. How does the Face service indicate the location of faces in images? *
A pair of coordinates for each face, indicating the center of the face
Two pairs of coordinates for each face, indicating the location of the eyes
A set of coordinates for each face, defining a rectangular bounding box around
the face
" Correct: The locations of detected faces are indicated by coordinates for a
rectangular bounding box
Glasses
Extreme angles
" Correct: Best results are obtained when the faces are full-frontal or as near as
possible to full-frontal
Fast shutter speed
3. What two actions are required to try out the capabilities of the Face service? *
" 100 XP
Introduction
3 minutes
Suppose you have image files of road signs, advertisements, or writing on a chalk board.
Machines can read the text in the images using optical character recognition (OCR), the
capability for artificial intelligence (AI) to process words in images into machine-readable text.
In this module, we'll focus on the use of OCR technologies to detect text in images and
convert it into a text-based data format, which can then be stored, printed, or used as the
input for further processing or analysis.
Uses of OCR
Automating text processing can improve the speed and efficiency of work by removing the
need for manual data entry. The ability to recognize printed and handwritten text in images is
beneficial in scenarios such as note taking, digitizing medical records or historical documents,
scanning checks for bank deposits, and more.
Unit 2 of 6 S テ Ask Learn
" 100 XP
The ability for computer systems to process written and printed text is an area of AI where
computer vision intersects with natural language processing. Vision capabilities are needed to
"read" the text, and then natural language processing capabilities make sense of it.
OCR is the foundation of processing text in images and uses machine learning models that are
trained to recognize individual shapes as letters, numerals, punctuation, or other elements of
text. Much of the early work on implementing this kind of capability was performed by postal
services to support automatic sorting of mail based on postal codes. Since then, the state-of-
the-art for reading text has moved on, and we have models that detect printed or handwritten
text in an image and read it line-by-line and word-by-word.
The Read API, otherwise known as Read OCR engine, uses the latest recognition models and is
optimized for images that have a significant amount of text or have considerable visual noise.
It can automatically determine the proper recognition model to use taking into consideration
the number of lines of text, images that include text, and handwriting.
The OCR engine takes in an image file and identifies bounding boxes, or coordinates, where
items are located within an image. In OCR, the model identifies bounding boxes around
anything that appears to be text in the image.
Calling the Read API returns results arranged into the following hierarchy:
Pages - One for each page of text, including information about the page size and
orientation.
Lines - The lines of text on a page.
Words - The words in a line of text, including the bounding box coordinates and text
itself.
Each line and word includes bounding box coordinates indicating its position on the page.
" 100 XP
To use the Azure AI Vision service you must first create a resource for it in your Azure
subscription. You can use either of the following resource types:
Azure AI Vision: A specific resource for vision services. Use this resource type if you don't
intend to use any other AI services, or if you want to track utilization and costs for your AI
Vision resource separately.
Azure AI services: A general resource that includes Azure AI Vision along with many
other Azure AI services such as Azure AI Language, Azure AI Speech, and others. Use this
resource type if you plan to use multiple Azure AI services and want to simplify
administration and development.
Once you've created a resource, there are several ways to use Azure AI Vision's Read API:
Vision Studio
REST API
Software Development Kits (SDKs): Python, C#, JavaScript
From the Vision Studio home page, you can select Optical Character Recognition and the
Extract text from images tile to try out the Read OCR engine. Your resource begins to incur
usage costs when it is used to return results. Using one of your own files or a sample file, you
can see how the Read OCR engine returns detected attributes. These attributes correspond
with what the machine detects in the bounding boxes.
Behind the scenes, the image is analyzed for features including people, text, and objects, and
marked by bounding boxes. The detected information is processed and the results are
returned to the user. The raw results are returned in JSON and include information about the
bounding box locations on the page, and the detected text. Keep in mind that Vision Studio
can return examples of OCR, but to build your own OCR application, you need to work with an
SDK or REST API.
Next unit: Exercise - Read text in Vision Studio
R Previous Next T
Unit 5 of 6 S テ Ask Learn
" 200 XP
Knowledge check
Module assessment 2 minutes
1. You want to extract text from images and then use Azure AI Language to analyze the text.
You want developers to require only one key and endpoint to access all of your services.
What kind of resource should you create in your Azure subscription? *
Azure AI Vision
Azure AI services
" Correct. An Azure AI services resource supports both Azure AI Vision for text
extraction, and Azure AI Language for text analytics.
Azure AI Language
2. You plan to use Azure AI Vision's Read API. What results can the Read API provide? *
Results arranged by pages that have photographs first, then pages that
exclusively have text