0% found this document useful (0 votes)

2 views

Module 2

This document provides an overview of computer vision, a key area of artificial intelligence that enables machines to interpret and process images. It covers fundamental concepts such as pixel arrays, image processing techniques, and the use of convolutional neural networks (CNNs) for feature extraction and classification. Additionally, it introduces Microsoft Azure AI Vision, a service that offers prebuilt models and customizable solutions for various computer vision applications.

Uploaded by

aishux07

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Module 2

Uploaded by

aishux07

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Unit 1 of 7 Ｓﾃ Ask Learn

＂ 100 XP

Introduction
2 minutes

Computer vision is one of the core areas of artificial intelligence (AI), and focuses on creating
solutions that enable AI applications to "see" the world and make sense of it.
Of course, computers don't have biological eyes that work the way ours do, but they're
capable of processing images; either from a live camera feed or from digital photographs or
videos. This ability to process images is the key to creating software that can emulate human
visual perception.

In this module, we'll examine some of the fundamental principles and techniques that underlie
computer vision. We'll also introduce Microsoft Azure AI Vision, a cloud service that
developers can use to create a wide range of computer vision solutions.

Next unit: Images and image processing

Next Ｔ
Unit 2 of 7 Ｓﾃ Ask Learn

＂ 100 XP

Images and image processing

8 minutes

Before we can explore image processing and other computer vision capabilities, it's useful to
consider what an image actually is in the context of data for a computer program.

Images as pixel arrays

To a computer, an image is an array of numeric pixel values. For example, consider the
following array:

0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 255 255 255 0 0
0 0 255 255 255 0 0
0 0 255 255 255 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0

The array consists of seven rows and seven columns, representing the pixel values for a 7x7
pixel image (which is known as the image's resolution). Each pixel has a value between 0
(black) and 255 (white); with values between these bounds representing shades of gray. The
image represented by this array looks similar to the following (magnified) image:

The array of pixel values for this image is two-dimensional (representing rows and columns, or
x and y coordinates) and defines a single rectangle of pixel values. A single layer of pixel values
like this represents a grayscale image. In reality, most digital images are multidimensional and
consist of three layers (known as channels) that represent red, green, and blue (RGB) color
hues. For example, we could represent a color image by defining three channels of pixel values
that create the same square shape as the previous grayscale example:

Red:
150 150 150 150 150 150 150
150 150 150 150 150 150 150
150 150 255 255 255 150 150
150 150 255 255 255 150 150
150 150 255 255 255 150 150
150 150 150 150 150 150 150
150 150 150 150 150 150 150

Green:
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 255 255 255 0 0
0 0 255 255 255 0 0
0 0 255 255 255 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0

Blue:
255 255 255 255 255 255 255
255 255 255 255 255 255 255
255 255 0 0 0 255 255
255 255 0 0 0 255 255
255 255 0 0 0 255 255
255 255 255 255 255 255 255
255 255 255 255 255 255 255

Here's the resulting image:

The purple squares are represented by the combination:

Red: 150
Green: 0
Blue: 255

The yellow squares in the center are represented by the combination:

Red: 255
Green: 255
Blue: 0

Using filters to process images

A common way to perform image processing tasks is to apply filters that modify the pixel
values of the image to create a visual effect. A filter is defined by one or more arrays of pixel
values, called filter kernels. For example, you could define filter with a 3x3 kernel as shown in
this example:

-1 -1 -1
-1 8 -1
-1 -1 -1

The kernel is then convolved across the image, calculating a weighted sum for each 3x3 patch
of pixels and assigning the result to a new image. It's easier to understand how the filtering
works by exploring a step-by-step example.
Let's start with the grayscale image we explored previously:

0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 255 255 255 0 0
0 0 255 255 255 0 0
0 0 255 255 255 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0

First, we apply the filter kernel to the top left patch of the image, multiplying each pixel value
by the corresponding weight value in the kernel and adding the results:
(0 x -1) + (0 x -1) + (0 x -1) +
(0 x -1) + (0 x 8) + (0 x -1) +
(0 x -1) + (0 x -1) + (255 x -1) = -255

The result (-255) becomes the first value in a new array. Then we move the filter kernel along
one pixel to the right and repeat the operation:

(0 x -1) + (0 x -1) + (0 x -1) +

(0 x -1) + (0 x 8) + (0 x -1) +
(0 x -1) + (255 x -1) + (255 x -1) = -510

Again, the result is added to the new array, which now contains two values:

-255 -510

The process is repeated until the filter has been convolved across the entire image, as shown
in this animation:

The filter is convolved across the image, calculating a new array of values. Some of the values
might be outside of the 0 to 255 pixel value range, so the values are adjusted to fit into that
range. Because of the shape of the filter, the outside edge of pixels isn't calculated, so a
padding value (usually 0) is applied. The resulting array represents a new image in which the
filter has transformed the original image. In this case, the filter has had the effect of
highlighting the edges of shapes in the image.
To see the effect of the filter more clearly, here's an example of the same filter applied to a real
image:

ﾉ Expand table

Original Image Filtered Image

Because the filter is convolved across the image, this kind of image manipulation is often
referred to as convolutional filtering. The filter used in this example is a particular type of filter
(called a laplace filter) that highlights the edges on objects in an image. There are many other
kinds of filter that you can use to create blurring, sharpening, color inversion, and other
effects.

Next unit: Machine learning for computer vision

Ｒ Previous Next Ｔ
Unit 3 of 7 Ｓﾃ Ask Learn

＂ 100 XP

Machine learning for computer vision

10 minutes

The ability to use filters to apply effects to images is useful in image processing tasks, such as
you might perform with image editing software. However, the goal of computer vision is often
to extract meaning, or at least actionable insights, from images; which requires the creation of
machine learning models that are trained to recognize features based on large volumes of
existing images.

 Tip

This unit assumes you are familiar with the fundamental principles of machine learning,
and that you have conceptual knowledge of deep learning with neural networks. If you
are new to machine learning, consider completing the Fundamentals of machine
learning module on Microsoft Learn.

Convolutional neural networks (CNNs)

One of the most common machine learning model architectures for computer vision is a
convolutional neural network (CNN), a type of deep learning architecture. CNNs use filters to
extract numeric feature maps from images, and then feed the feature values into a deep
learning model to generate a label prediction. For example, in an image classification scenario,
the label represents the main subject of the image (in other words, what is this an image of?).
You might train a CNN model with images of different kinds of fruit (such as apple, banana,
and orange) so that the label that is predicted is the type of fruit in a given image.

During the training process for a CNN, filter kernels are initially defined using randomly
generated weight values. Then, as the training process progresses, the models predictions are
evaluated against known label values, and the filter weights are adjusted to improve accuracy.
Eventually, the trained fruit image classification model uses the filter weights that best extract
features that help identify different kinds of fruit.

The following diagram illustrates how a CNN for an image classification model works:
1. Images with known labels (for example, 0: apple, 1: banana, or 2: orange) are fed into the
network to train the model.
2. One or more layers of filters is used to extract features from each image as it is fed
through the network. The filter kernels start with randomly assigned weights and
generate arrays of numeric values called feature maps.
3. The feature maps are flattened into a single dimensional array of feature values.
4. The feature values are fed into a fully connected neural network.
5. The output layer of the neural network uses a softmax or similar function to produce a
result that contains a probability value for each possible class, for example [0.2, 0.5, 0.3].

During training the output probabilities are compared to the actual class label - for example,
an image of a banana (class 1) should have the value [0.0, 1.0, 0.0]. The difference between the
predicted and actual class scores is used to calculate the loss in the model, and the weights in
the fully connected neural network and the filter kernels in the feature extraction layers are
modified to reduce the loss.

The training process repeats over multiple epochs until an optimal set of weights has been
learned. Then, the weights are saved and the model can be used to predict labels for new
images for which the label is unknown.

７ Note

CNN architectures usually include multiple convolutional filter layers and additional layers
to reduce the size of feature maps, constrain the extracted values, and otherwise
manipulate the feature values. These layers have been omitted in this simplified example
to focus on the key concept, which is that filters are used to extract numeric features from
images, which are then used in a neural network to predict image labels.
Transformers and multi-modal models
CNNs have been at the core of computer vision solutions for many years. While they're
commonly used to solve image classification problems as described previously, they're also
the basis for more complex computer vision models. For example, object detection models
combine CNN feature extraction layers with the identification of regions of interest in images
to locate multiple classes of object in the same image.

Transformers
Most advances in computer vision over the decades have been driven by improvements in
CNN-based models. However, in another AI discipline - natural language processing (NLP),
another type of neural network architecture, called a transformer has enabled the
development of sophisticated models for language. Transformers work by processing huge
volumes of data, and encoding language tokens (representing individual words or phrases) as
vector-based embeddings (arrays of numeric values). You can think of an embedding as
representing a set of dimensions that each represent some semantic attribute of the token.
The embeddings are created such that tokens that are commonly used in the same context
define vectors that are more closely aligned than unrelated words.

As a simple example, the following diagram shows some words encoded as three-dimensional
vectors, and plotted in a 3D space:

Tokens that are semantically similar are encoded in similar directions, creating a semantic
language model that makes it possible to build sophisticated NLP solutions for text analysis,
translation, language generation, and other tasks.

７ Note
We've used only three dimensions, because that's easy to visualize. In reality, encoders in
transformer networks create vectors with many more dimensions, defining complex
semantic relationships between tokens based on linear algebraic calculations. The math
involved is complex, as is the architecture of a transformer model. Our goal here is just to
provide a conceptual understanding of how encoding creates a model that encapsulates
relationships between entities.

Multi-modal models
The success of transformers as a way to build language models has led AI researchers to
consider whether the same approach would be effective for image data. The result is the
development of multi-modal models, in which the model is trained using a large volume of
captioned images, with no fixed labels. An image encoder extracts features from images based
on pixel values and combines them with text embeddings created by a language encoder. The
overall model encapsulates relationships between natural language token embeddings and
image features, as shown here:

The Microsoft Florence model is just such a model. Trained with huge volumes of captioned
images from the Internet, it includes both a language encoder and an image encoder. Florence
is an example of a foundation model. In other words, a pre-trained general model on which
you can build multiple adaptive models for specialist tasks. For example, you can use Florence
as a foundation model for adaptive models that perform:

Image classification: Identifying to which category an image belongs.

Object detection: Locating individual objects within an image.
Captioning: Generating appropriate descriptions of images.
Tagging: Compiling a list of relevant text tags for an image.
Multi-modal models like Florence are at the cutting edge of computer vision and AI in general,
and are expected to drive advances in the kinds of solution that AI makes possible.

Next unit: Azure AI Vision

Ｒ Previous Next Ｔ
Unit 4 of 7 Ｓﾃ Ask Learn

＂ 100 XP

Azure AI Vision
3 minutes

While you can train your own machine learning models for computer vision, the architecture
for computer vision models can be complex; and you require significant volumes of training
images and compute power to perform the training process.

Microsoft's Azure AI Vision service provides prebuilt and customizable computer vision models
that are based on the Florence foundation model and provide various powerful capabilities.
With Azure AI Vision, you can create sophisticated computer vision solutions quickly and
easily; taking advantage of "off-the-shelf" functionality for many common computer vision
scenarios, while retaining the ability to create custom models using your own images.

Azure resources for Azure AI Vision service

To use Azure AI Vision, you need to create a resource for it in your Azure subscription. You can
use either of the following resource types:

Azure AI Vision: A specific resource for the Azure AI Vision service. Use this resource type
if you don't intend to use any other Azure AI services, or if you want to track utilization
and costs for your Azure AI Vision resource separately.
Azure AI services: A general resource that includes Azure AI Vision along with many
other Azure AI services; such as Azure AI Language, Azure AI Custom Vision, Azure AI
Translator, and others. Use this resource type if you plan to use multiple AI services and
want to simplify administration and development.

Analyzing images with the Azure AI Vision service

After you've created a suitable resource in your subscription, you can submit images to the
Azure AI Vision service to perform a wide range of analytical tasks.
Azure AI Vision supports multiple image analysis capabilities, including:

Optical character recognition (OCR) - extracting text from images.

Generating captions and descriptions of images.
Detection of thousands of common objects in images.
Tagging visual features in images
These tasks, and more, can be performed in Azure AI Vision Studio .

Optical character recognition

Azure AI Vision service can use optical character recognition (OCR) capabilities to detect text in
images. For example, consider the following image of a nutrition label on a product in a
grocery store:

The Azure AI Vision service can analyze this image and extract the following text:
Nutrition Facts Amount Per Serving
Serving size:1 bar (40g)
Serving Per Package: 4
Total Fat 13g
Saturated Fat 1.5g
Amount Per Serving
Trans Fat 0g
calories 190
Cholesterol 0mg
ories from Fat 110
Sodium 20mg
ntDaily Values are based on
Vitamin A 50
calorie diet

 Tip

You can explore Azure AI Vision's OCR capabilities further in the Read text with Azure AI
Vision module on Microsoft Learn.

Describing an image with captions

Azure AI Vision has the ability to analyze an image, evaluate the objects that are detected, and
generate a human-readable phrase or sentence that can describe what was detected in the
image. For example, consider the following image:

Azure AI Vision returns the following caption for this image:

A man jumping on a skateboard

Detecting common objects in an image

Azure AI Vision can identify thousands of common objects in images. For example, when used
to detect objects in the skateboarder image discussed previously, Azure AI Vision returns the
following predictions:

Skateboard (90.40%)
Person (95.5%)

The predictions include a confidence score that indicates the probability the model has
calculated for the predicted objects.
In addition to the detected object labels and their probabilities, Azure AI Vision returns
bounding box coordinates that indicate the top, left, width, and height of the object detected.
You can use these coordinates to determine where in the image each object was detected, like
this:

Tagging visual features

Azure AI Vision can suggest tags for an image based on its contents. These tags can be
associated with the image as metadata that summarizes attributes of the image and can be
useful if you want to index an image along with a set of key terms that might be used to
search for images with specific attributes or contents.
For example, the tags returned for the skateboarder image (with associated confidence scores)
include:

sport (99.60%)
person (99.56%)
footwear (98.05%)
skating (96.27%)
boardsport (95.58%)
skateboarding equipment (94.43%)
clothing (94.02%)
wall (93.81%)
skateboarding (93.78%)
skateboarder (93.25%)
individual sports (92.80%)
street stunts (90.81%)
balance (90.81%)
jumping (89.87%)
sports equipment (88.61%)
extreme sport (88.35%)
kickflip (88.18%)
stunt (87.27%)
skateboard (86.87%)
stunt performer (85.83%)
knee (85.30%)
sports (85.24%)
longboard (84.61%)
longboarding (84.45%)
riding (73.37%)
skate (67.27%)
air (64.83%)
young (63.29%)
outdoor (61.39%)

Training custom models

If the built-in models provided by Azure AI Vision don't meet your needs, you can use the
service to train a custom model for image classification or object detection. Azure AI Vision
builds custom models on the pre-trained foundation model, meaning that you can train
sophisticated models by using relatively few training images.

Image classification
An image classification model is used to predict the category, or class of an image. For
example, you could train a model to determine which type of fruit is shown in an image, like
this:

ﾉ Expand table
Apple Banana Orange

Object detection
Object detection models detect and classify objects in an image, returning bounding box
coordinates to locate each object. In addition to the built-in object detection capabilities in
Azure AI Vision, you can train a custom object detection model with your own images. For
example, you could use photographs of fruit to train a model that detects multiple fruits in an
image, like this:

７ Note

Details of how to use Azure AI Vision to train a custom model are beyond the scope of
this module. You can find information about custom model training in the Azure AI
Unit 6 of 7 Ｓﾃ Ask Learn

＂ 200 XP

Knowledge check
Module assessment 2 minutes

） Great job! You passed the module assessment. Ｐ

Check your knowledge

1. Computer vision is based on the manipulation and analysis of what kinds of values in an
image? *

Timestamps in photograph metadata

Pixels
＂ Correct. Pixels are numeric values that represent shade intensity for points in the
image.
Image file names

2. You want to use the Azure AI Vision service to analyze images. You also want to use the
Azure AI Language service to analyze text. You want developers to require only one key and
endpoint to access all of your services. What kind of resource should you create in your
Azure subscription? *

Azure AI Vision

Azure AI services
＂ Correct. An Azure AI Services resource supports both Azure AI Vision and Azure
AI Language.
Azure OpenAI service

3. You want to use the Azure AI Vision service to identify the location of individual items in
an image. Which of the following features should you retrieve? *
Objects
＂ Correct. Azure AI Vision returns objects with a bounding box to indicate their
location in the image.
Visual Tags

Dense Captions

Next unit: Summary

Ｒ Previous Next Ｔ
Unit 1 of 6 Ｓﾃ Ask Learn

＂ 100 XP

Introduction
2 minutes

Face detection and analysis is an area of artificial intelligence (AI) which uses algorithms to
locate and analyze human faces in images or video content.
There are many applications for face detection, analysis, and recognition. For example,

Security - facial recognition can be used in building security applications, and

increasingly it is used in smart phones operating systems for unlocking devices.
Social media - facial recognition can be used to automatically tag known friends in
photographs.
Intelligent monitoring - for example, an automobile might include a system that
monitors the driver's face to determine if the driver is looking at the road, looking at a
mobile device, or shows signs of tiredness.
Advertising - analyzing faces in an image can help direct advertisements to an
appropriate demographic audience.
Missing persons - using public cameras systems, facial recognition can be used to
identify if a missing person is in the image frame.
Identity validation - useful at ports of entry kiosks where a person holds a special entry
permit.

Next unit: Understand facial analysis

Next Ｔ
Unit 2 of 6 Ｓﾃ Ask Learn

＂ 100 XP

Understand facial analysis

2 minutes

Face detection involves identifying regions of an image that contain a human face, typically by
returning bounding box coordinates that form a rectangle around the face, like this:

With Face analysis, facial features can be used to train machine learning models to return
other information, such as facial features such as nose, eyes, eyebrows, lips, and others.

Facial recognition
A further application of facial analysis is to train a machine learning model to identify known
individuals from their facial features. This is known as facial recognition, and uses multiple
images of an individual to train the model. This trains the model so that it can detect those
individuals in new images on which it wasn't trained.

When used responsibly, facial recognition is an important and useful technology that can
improve efficiency, security, and customer experiences. Next we'll explore Azure AI Face
service, which provides pre-trained models to detect, recognize, and analyze faces.

Next unit: Get started with facial analysis on Azure

Ｒ Previous Next Ｔ
Unit 3 of 6 Ｓﾃ Ask Learn

＂ 100 XP

Get started with facial analysis on Azure

4 minutes

Microsoft Azure provides multiple Azure AI services that you can use to detect and analyze
faces, including:
Azure AI Vision, which offers face detection and some basic face analysis, such as
returning the bounding box coordinates around an image.
Azure AI Video Indexer, which you can use to detect and identify faces in a video.
Azure AI Face, which offers pre-built algorithms that can detect, recognize, and analyze
faces.
Of these, Face offers the widest range of facial analysis capabilities.

Azure AI Face service

The Azure AI Face service can return the rectangle coordinates for any human faces that are
found in an image, as well as a series of related attributes:

Accessories: indicates whether the given face has accessories. This attribute returns
possible accessories including headwear, glasses, and mask, with confidence score
between zero and one for each accessory.
Blur: how blurred the face is, which can be an indication of how likely the face is to be
the main focus of the image.
Exposure: such as whether the image is underexposed or over exposed. This applies to
the face in the image and not the overall image exposure.
Glasses: whether or not the person is wearing glasses.
Head pose: the face's orientation in a 3D space.
Mask: indicates whether the face is wearing a mask.
Noise: refers to visual noise in the image. If you have taken a photo with a high ISO
setting for darker settings, you would notice this noise in the image. The image looks
grainy or full of tiny dots that make the image less clear.
Occlusion: determines if there might be objects blocking the face in the image.
Quality For Recognition: a rating of high, medium, or low that reflects if the image is of
sufficient quality to attempt face recognition on.

Responsible AI use
） Important

To support Microsoft's Responsible AI Standard , Azure AI Face and Azure AI Vision

have a Limited Access policy .

Anyone can use the Face service to:

Detect the location of faces in an image.
Determine if a person is wearing glasses.
Determine if there's occlusion, blur, noise, or over/under exposure for any of the faces.
Return the head pose coordinates for each face in an image.

The Limited Access policy requires customers to submit an intake form to access additional
Azure AI Face service capabilities including:

Face verification: the ability to compare faces for similarity.

Face identification: the ability to identify named individuals in an image.
Liveness detection: the ability to detect and mitigate instances of recurring content
and/or behaviors that indicate a violation of policies (eg. such as if the input video
stream is real or fake).

Azure resources for Face

To use the Face service, you must create one of the following types of resource in your Azure
subscription:

Face: Use this specific resource type if you don't intend to use any other Azure AI
services, or if you want to track utilization and costs for Face separately.
Azure AI services: A general resource that includes Azure AI Face along with many other
Azure AI services such as Azure AI Content Safety, Azure AI Language, and others. Use
this resource type if you plan to use multiple Azure AI services and want to simplify
administration and development.

Tips for more accurate results

There are some considerations that can help improve the accuracy of the detection in the
images:

Image format - supported images are JPEG, PNG, GIF, and BMP.
File size - 6 MB or smaller.
Face size range - from 36 x 36 pixels up to 4096 x 4096 pixels. Smaller or larger faces will
not be detected.
Other issues - face detection can be impaired by extreme face angles, extreme lighting,
and occlusion (objects blocking the face such as a hand).

Next unit: Exercise - Detect faces in Vision Studio

Ｒ Previous Next Ｔ
Unit 5 of 6 Ｓﾃ Ask Learn

＂ 200 XP

Knowledge check
Module assessment 3 minutes

） Great job! You passed the module assessment. Ｐ

1. How does the Face service indicate the location of faces in images? *

A pair of coordinates for each face, indicating the center of the face

Two pairs of coordinates for each face, indicating the location of the eyes

A set of coordinates for each face, defining a rectangular bounding box around
the face
＂ Correct: The locations of detected faces are indicated by coordinates for a
rectangular bounding box

2. What is one aspect that might impair facial detection? *

Glasses

Extreme angles
＂ Correct: Best results are obtained when the faces are full-frontal or as near as
possible to full-frontal
Fast shutter speed

3. What two actions are required to try out the capabilities of the Face service? *

Create an Azure AI Search resource, and open Vision Studio

Create a Face resource, and open Vision Studio

＂ Correct: The Face resource has face detections capabilities, and can be used in
Vision Studio to understand its capabilities.
Create a Face resource, and open Azure AI Studio
Unit 1 of 6 Ｓﾃ Ask Learn

＂ 100 XP

Introduction
3 minutes

Suppose you have image files of road signs, advertisements, or writing on a chalk board.
Machines can read the text in the images using optical character recognition (OCR), the
capability for artificial intelligence (AI) to process words in images into machine-readable text.
In this module, we'll focus on the use of OCR technologies to detect text in images and
convert it into a text-based data format, which can then be stored, printed, or used as the
input for further processing or analysis.

Uses of OCR
Automating text processing can improve the speed and efficiency of work by removing the
need for manual data entry. The ability to recognize printed and handwritten text in images is
beneficial in scenarios such as note taking, digitizing medical records or historical documents,
scanning checks for bank deposits, and more.
Unit 2 of 6 Ｓﾃ Ask Learn

＂ 100 XP

Get started with Azure AI Vision

3 minutes

The ability for computer systems to process written and printed text is an area of AI where
computer vision intersects with natural language processing. Vision capabilities are needed to
"read" the text, and then natural language processing capabilities make sense of it.

OCR is the foundation of processing text in images and uses machine learning models that are
trained to recognize individual shapes as letters, numerals, punctuation, or other elements of
text. Much of the early work on implementing this kind of capability was performed by postal
services to support automatic sorting of mail based on postal codes. Since then, the state-of-
the-art for reading text has moved on, and we have models that detect printed or handwritten
text in an image and read it line-by-line and word-by-word.

Azure AI Vision's OCR Engine

Azure AI Vision service has the ability to extract machine-readable text from images. Azure AI
Vision's Read API is the OCR engine that powers text extraction from images, PDFs, and TIFF
files. OCR for images is optimized for general, non-document images that makes it easier to
embed OCR in your user experience scenarios.

The Read API, otherwise known as Read OCR engine, uses the latest recognition models and is
optimized for images that have a significant amount of text or have considerable visual noise.
It can automatically determine the proper recognition model to use taking into consideration
the number of lines of text, images that include text, and handwriting.
The OCR engine takes in an image file and identifies bounding boxes, or coordinates, where
items are located within an image. In OCR, the model identifies bounding boxes around
anything that appears to be text in the image.

Calling the Read API returns results arranged into the following hierarchy:
Pages - One for each page of text, including information about the page size and
orientation.
Lines - The lines of text on a page.
Words - The words in a line of text, including the bounding box coordinates and text
itself.
Each line and word includes bounding box coordinates indicating its position on the page.

Next unit: Get started with Vision Studio on Azure

Ｒ Previous Next Ｔ
Unit 3 of 6 Ｓﾃ Ask Learn

＂ 100 XP

Get started with Vision Studio on Azure

3 minutes

To use the Azure AI Vision service you must first create a resource for it in your Azure
subscription. You can use either of the following resource types:
Azure AI Vision: A specific resource for vision services. Use this resource type if you don't
intend to use any other AI services, or if you want to track utilization and costs for your AI
Vision resource separately.
Azure AI services: A general resource that includes Azure AI Vision along with many
other Azure AI services such as Azure AI Language, Azure AI Speech, and others. Use this
resource type if you plan to use multiple Azure AI services and want to simplify
administration and development.

Once you've created a resource, there are several ways to use Azure AI Vision's Read API:
Vision Studio
REST API
Software Development Kits (SDKs): Python, C#, JavaScript

Below we will focus on the Vision Studio.

Azure AI Vision Studio

Azure AI Vision Studio gives you access to Azure AI Vision APIs through a graphical user
interface that does not require coding to get started.
When you open Vision Studio , you need to select your default resource. Your default
resource in Vision Studio must be an Azure AI services resource, rather than Azure AI Vision
resource.

From the Vision Studio home page, you can select Optical Character Recognition and the
Extract text from images tile to try out the Read OCR engine. Your resource begins to incur
usage costs when it is used to return results. Using one of your own files or a sample file, you
can see how the Read OCR engine returns detected attributes. These attributes correspond
with what the machine detects in the bounding boxes.
Behind the scenes, the image is analyzed for features including people, text, and objects, and
marked by bounding boxes. The detected information is processed and the results are
returned to the user. The raw results are returned in JSON and include information about the
bounding box locations on the page, and the detected text. Keep in mind that Vision Studio
can return examples of OCR, but to build your own OCR application, you need to work with an
SDK or REST API.
Next unit: Exercise - Read text in Vision Studio
Ｒ Previous Next Ｔ
Unit 5 of 6 Ｓﾃ Ask Learn

＂ 200 XP

Knowledge check
Module assessment 2 minutes

） Great job! You passed the module assessment. Ｐ

1. You want to extract text from images and then use Azure AI Language to analyze the text.
You want developers to require only one key and endpoint to access all of your services.
What kind of resource should you create in your Azure subscription? *

Azure AI Vision

Azure AI services
＂ Correct. An Azure AI services resource supports both Azure AI Vision for text
extraction, and Azure AI Language for text analytics.
Azure AI Language

2. You plan to use Azure AI Vision's Read API. What results can the Read API provide? *

Results arranged in pages, lines, and words

＂ Correct: The Read API takes an image and extracts the words, organizing the
results by page and line.
Only the bounding box coordinates

Results arranged by pages that have photographs first, then pages that
exclusively have text

Next unit: Summary

Ｒ Previous Next Ｔ

Mzc64 Quick Start Guide
No ratings yet
Mzc64 Quick Start Guide
8 pages
Fundamentals of Computer Vision with QA
No ratings yet
Fundamentals of Computer Vision with QA
25 pages
Microsoft
No ratings yet
Microsoft
6 pages
CS4442_CS9542_Part 2_Lecture 1_Intro_Filtering
No ratings yet
CS4442_CS9542_Part 2_Lecture 1_Intro_Filtering
40 pages
Machine - Learning (Computer Vision)
No ratings yet
Machine - Learning (Computer Vision)
56 pages
AI 10th grade pdfs
No ratings yet
AI 10th grade pdfs
30 pages
Computer Vision
No ratings yet
Computer Vision
15 pages
X AI SS CH5 LM
No ratings yet
X AI SS CH5 LM
54 pages
Computer Vision 2
No ratings yet
Computer Vision 2
62 pages
ECE280F24_Lab5 (4)
No ratings yet
ECE280F24_Lab5 (4)
27 pages
Screenshot 2023-10-23 at 5.51.17 AM
No ratings yet
Screenshot 2023-10-23 at 5.51.17 AM
14 pages
3.1 - Image Fundamentals
No ratings yet
3.1 - Image Fundamentals
32 pages
Chapter-4 Computer Vision Study material
No ratings yet
Chapter-4 Computer Vision Study material
4 pages
@introduction of Digital Image Processing (@background ,@digital Image Representation, @fundamental Step in Image Processing)
No ratings yet
@introduction of Digital Image Processing (@background ,@digital Image Representation, @fundamental Step in Image Processing)
4 pages
AI-Computer Vision
No ratings yet
AI-Computer Vision
16 pages
Computer Vision Course Lecture 2
No ratings yet
Computer Vision Course Lecture 2
53 pages
Computer Vision-Lec 02
No ratings yet
Computer Vision-Lec 02
121 pages
Introcduction To Image Processing With Python Nour Eddine ALAA and Ismail Zine El Abidne March 5, 2021
No ratings yet
Introcduction To Image Processing With Python Nour Eddine ALAA and Ismail Zine El Abidne March 5, 2021
77 pages
image processing .
No ratings yet
image processing .
13 pages
HODL Lec 3 DNNs For Vision 1
No ratings yet
HODL Lec 3 DNNs For Vision 1
36 pages
Computer Vision
No ratings yet
Computer Vision
29 pages
Ai
No ratings yet
Ai
14 pages
Ip Cv Summary Finaaaal-1
No ratings yet
Ip Cv Summary Finaaaal-1
178 pages
Lec10 Image Enhancment
No ratings yet
Lec10 Image Enhancment
27 pages
An Introduction To Digital Image Processing
No ratings yet
An Introduction To Digital Image Processing
49 pages
Computer Vision
No ratings yet
Computer Vision
8 pages
Filtering Basics
No ratings yet
Filtering Basics
83 pages
C10_AI_COMPUTER VISION (1)
No ratings yet
C10_AI_COMPUTER VISION (1)
40 pages
Image Filtering: Associate Professor Faculty of Computer Science Institute of Business Administration - Karachi
No ratings yet
Image Filtering: Associate Professor Faculty of Computer Science Institute of Business Administration - Karachi
59 pages
Unit-5 Computer Vision
No ratings yet
Unit-5 Computer Vision
3 pages
04 - Image Enhancement in The Spatial Domain
No ratings yet
04 - Image Enhancement in The Spatial Domain
29 pages
Vazquez ImageProcessFundamentals
No ratings yet
Vazquez ImageProcessFundamentals
83 pages
Computer Vision - Session 1
No ratings yet
Computer Vision - Session 1
36 pages
Lecture1_merged
No ratings yet
Lecture1_merged
182 pages
Image Filtering: Formal Terminology - Filtering With Masks
No ratings yet
Image Filtering: Formal Terminology - Filtering With Masks
30 pages
Algorithms For Image Processing and Computer Vision: J.R. Parker
No ratings yet
Algorithms For Image Processing and Computer Vision: J.R. Parker
8 pages
Image Processing - Techniques, Types, & Applications (2023)
No ratings yet
Image Processing - Techniques, Types, & Applications (2023)
32 pages
Lecture 2: Image Processing Review, Neighbors, Connected Components, and Distance
No ratings yet
Lecture 2: Image Processing Review, Neighbors, Connected Components, and Distance
7 pages
Digital Image Processing
No ratings yet
Digital Image Processing
156 pages
Computer Vision
No ratings yet
Computer Vision
21 pages
Practical Image-1
No ratings yet
Practical Image-1
22 pages
Lect 1
No ratings yet
Lect 1
75 pages
Chapter 3
No ratings yet
Chapter 3
78 pages
Lect4 Images
No ratings yet
Lect4 Images
29 pages
Algorithms For Image Processing and Computer Vision: J.R. Parker
No ratings yet
Algorithms For Image Processing and Computer Vision: J.R. Parker
8 pages
Week 4 Digital Image Processing
No ratings yet
Week 4 Digital Image Processing
16 pages
Suggested Readings: Image Processing Basics
No ratings yet
Suggested Readings: Image Processing Basics
11 pages
CV
No ratings yet
CV
9 pages
cv
No ratings yet
cv
4 pages
Question Bank 9 (1)
No ratings yet
Question Bank 9 (1)
6 pages
Ch2-Image Enhancement in Spacial Domain
No ratings yet
Ch2-Image Enhancement in Spacial Domain
117 pages
Computer Vision Class X
No ratings yet
Computer Vision Class X
39 pages
Selected Topics in Computer Science Ch1
No ratings yet
Selected Topics in Computer Science Ch1
26 pages
EE100CH11
No ratings yet
EE100CH11
25 pages
Lect02 ImageProcessingReview
No ratings yet
Lect02 ImageProcessingReview
53 pages
Digital Image Processing 03 Image Enhancement in Spatial Domain
No ratings yet
Digital Image Processing 03 Image Enhancement in Spatial Domain
57 pages
Fundamentals of Image Processing
No ratings yet
Fundamentals of Image Processing
32 pages
Unit 1 CV
No ratings yet
Unit 1 CV
78 pages
Ch-Computer Vision
No ratings yet
Ch-Computer Vision
6 pages
Computer Vision
No ratings yet
Computer Vision
19 pages
Histogram Equalization: Enhancing Image Contrast for Enhanced Visual Perception
From Everand
Histogram Equalization: Enhancing Image Contrast for Enhanced Visual Perception
Fouad Sabry
No ratings yet
High Level Impacts GST Training & SAP Landscape - High Level Impacts. Compliance To Legal Requirements? Efficiency That Reduces Costs?
No ratings yet
High Level Impacts GST Training & SAP Landscape - High Level Impacts. Compliance To Legal Requirements? Efficiency That Reduces Costs?
9 pages
Сеть глобального корпоративного контроля
No ratings yet
Сеть глобального корпоративного контроля
74 pages
LABS43 SBC Animal Feed
No ratings yet
LABS43 SBC Animal Feed
1 page
Landmark Forum Syllabus
No ratings yet
Landmark Forum Syllabus
6 pages
4.6 Ground Reflection (Two-Ray) Model: Mobile Communications Lecture 5
No ratings yet
4.6 Ground Reflection (Two-Ray) Model: Mobile Communications Lecture 5
4 pages
Calculus Cheat Sheet Limits
No ratings yet
Calculus Cheat Sheet Limits
2 pages
DIYEgg
100% (1)
DIYEgg
16 pages
Proposal Outline
No ratings yet
Proposal Outline
2 pages
Robotics Reflection
No ratings yet
Robotics Reflection
5 pages
CC1011 Midterm
No ratings yet
CC1011 Midterm
3 pages
Download Scientific and Technical Translation Routledge Translation Guides First Edition Maeve Olohan ebook All Chapters PDF
100% (3)
Download Scientific and Technical Translation Routledge Translation Guides First Edition Maeve Olohan ebook All Chapters PDF
40 pages
Project Report On Airtel
No ratings yet
Project Report On Airtel
22 pages
Iridium 9505a Satellite Phone
No ratings yet
Iridium 9505a Satellite Phone
4 pages
Mechanical Sample Book PDF For Gate Exam PDF
No ratings yet
Mechanical Sample Book PDF For Gate Exam PDF
28 pages
Influence of Certain Psycho-Sociological Variables On English Achievement of Class VIII Students
No ratings yet
Influence of Certain Psycho-Sociological Variables On English Achievement of Class VIII Students
313 pages
Cover page
No ratings yet
Cover page
4 pages
CS 1000/CS 3000 HIS Operation: IM 33S02C10-01E 13th Edition
No ratings yet
CS 1000/CS 3000 HIS Operation: IM 33S02C10-01E 13th Edition
84 pages
Motor Feeder Cable & Cable Tray Sizing and Data
No ratings yet
Motor Feeder Cable & Cable Tray Sizing and Data
5 pages
Theories of Corporate Personality
No ratings yet
Theories of Corporate Personality
4 pages
Example Rapid Mix Design Lab
100% (1)
Example Rapid Mix Design Lab
19 pages
Essay
No ratings yet
Essay
1 page
21ST Century
No ratings yet
21ST Century
13 pages
Points East and West: Acupuncture and Comparative Philosophy of Science
No ratings yet
Points East and West: Acupuncture and Comparative Philosophy of Science
10 pages
Module 1
No ratings yet
Module 1
26 pages
PINN Gentle Introduction
No ratings yet
PINN Gentle Introduction
26 pages
Earth and Life Science SHS 3.2 Four Subsystems of Earth
No ratings yet
Earth and Life Science SHS 3.2 Four Subsystems of Earth
15 pages
Neumann Rhodes 2023 Morality in Social Media A Scoping Review
No ratings yet
Neumann Rhodes 2023 Morality in Social Media A Scoping Review
31 pages
Recognition As Intersubjective Vulnerability in The Psychoanalytic Dialogue PDF
No ratings yet
Recognition As Intersubjective Vulnerability in The Psychoanalytic Dialogue PDF
18 pages
Loads On Bearing
No ratings yet
Loads On Bearing
8 pages