Azure Application Architecture Guide
Azure Application Architecture Guide
Helping our customers design and architect new solutions is core to the Azure Architecture Center’s mission.
Architecture diagrams like those included in our guidance can help communicate design decisions and the
relationships between components of a given workload. On this page you will find an official collection of Azure
architecture icons including Azure product icons to help you build a custom architecture diagram for your next
solution.
Do’s
Use the icon to illustrate how products can work together
In diagrams, we recommend to include the product name somewhere close to the icon
Use the icons as they would appear within Azure
Don’ts
Don’t crop, flip or rotate icons
Don’t distort or change icon shape in any way
Don’t use Microsoft product icons to represent your product or service
Icon updates
As of November 2020, the folder structure of our collection of Azure architecture icons has changed. The FAQs and
Terms of Use PDF files appear in the first level when you download the SVG icons below. The files in the icons folder
are the same except there is no longer a CXP folder. If you encounter any issues, let us know.
Terms
Microsoft permits the use of these icons in architectural diagrams, training materials, or documentation. You may
copy, distribute, and display the icons only for the permitted use unless granted explicit permission by Microsoft.
Microsoft reserves all other rights.
I agree to the above terms
D O W NLO AD SVG
ICO NS
What's new in the Azure Architecture Center
12/18/2020 • 4 minutes to read • Edit Online
December 2020
New Articles
Performance testing
Testing tools
Refactor IBM z/OS mainframe Coupling Facility (CF) to Azure
Design scalable Azure applications
Plan for capacity
Design Azure applications for efficiency
Design for scaling
Updated Articles
Baseline architecture for an Azure Kubernetes Service (AKS) cluster (#32842d421)
November 2020
New Articles
Use Azure Stack HCI stretched clusters for disaster recovery
Use Azure Stack HCI switchless interconnect and lightweight quorum for Remote Office/Branch Office
Release Engineering Application Development
Release Engineering Continuous integration
Release Engineering Rollback
Project 15 from Microsoft Open Platform for Conservation and Ecological Sustainability Solutions
Unisys mainframe migration
Security logs and audits
Check for identity, network, data risks
Security operations in Azure
Security health modeling in Azure
Azure enterprise cloud file share
Modernize mainframe & midrange data
IoT event routing
Create a Gridwich environment
Gridwich cloud media system
Gridwich CI/CD pipeline
Gridwich clean monolith architecture
Gridwich content protection and DRM
Logging in Gridwich
Gridwich request-response messages
Gridwich project naming and namespaces
Gridwich saga orchestration
Gridwich Storage Service
Gridwich keys and secrets management
Gridwich Media Services setup and scaling
Gridwich pipeline-generated admin scripts
Gridwich Azure DevOps setup
Gridwich local development environment setup
Test Media Services V3 encoding
Gridwich variable flow
Updated Articles
Choosing a data storage technology (#4128cc2d9)
Process real-time vehicle data using IoT (#beeba69f6)
Security monitoring tools in Azure (#4f3a35043)
Building a CI/CD pipeline for microservices on Kubernetes (#c0135f775)
October 2020
New Articles
SQL Server 2008 R2 failover cluster in Azure
Kafka on Azure
Secure application's configuration and dependencies
Application classification for security
Application threat analysis
AKS triage - cluster health
AKS triage - container registry connectivity
AKS triage - admission controllers
AKS triage - workload deployments
AKS triage - node health
Azure Kubernetes Service (AKS) operations triage
Alerts in IoT Edge Vision
Camera selection for IoT Edge Vision
Hardware for IoT Edge Vision
Image storage in IoT Edge Vision
Azure IoT Edge Vision
Machine learning in IoT Edge Vision
User interface in IoT Edge Vision
Configure infrastructure
Repeatable Infrastructure
Automation overview of goals, best practices, and types in Azure
Automated Tasks
Partitioning in Event Hubs and Kafka
Retail - Buy online, pickup in store (BOPIS)
Magento e-commerce platform in Azure Kubernetes Service (AKS)
Web app private connectivity to Azure SQL database
Updated Articles
Security with identity and access management (IAM) in Azure (#2a1154709)
Regulatory compliance (#2a1154709)
Computer forensics Chain of Custody in Azure (#909b776f4)
Overview of the performance efficiency pillar (#ed89cf6ab)
Azure Kubernetes Service (AKS) solution journey (#b63ab6a9f)
GCP to Azure Services Comparison (#a45091c00)
Resiliency patterns (#b5201626c)
Choosing an Azure compute service (#a64329288)
Security patterns (#13add8a06)
September 2020
New Articles
Administrative account security
Enforce governance to reduce risks
Security management groups
Regulatory compliance
Team roles and responsibilities
Segmentation strategies
Tenancy model for SaaS applications
Migrate IBM mainframe applications to Azure with TmaxSoft OpenFrame
Security monitoring tools in Azure
Applications and services
Security storage in Azure | Microsoft Docs
Azure Active Directory IDaaS in Security Operations
Capture cost requirements for an Azure
Tradeoffs for costs
Microsoft Azure Well-Architected Framework
Overview of the security pillar
Storage, data, and encryption in Azure | Microsoft Docs
Move Azure resources across regions
FSLogix for the enterprise
Stromasys Charon-SSP Solaris emulator on Azure VMs
Azure Kubernetes Service (AKS) solution journey
IoT analyze and optimize loops
IoT measure and control loops
IoT monitor and manage loops
Hybrid and Multicloud Architectures
Azure Arc hybrid management and deployment for Kubernetes clusters
Manage configurations for Azure Arc enabled servers
Azure Automation in a hybrid environment
Using Azure file shares in a hybrid environment
Azure Functions in a hybrid environment
Connect standalone servers by using Azure Network Adapter
Back up files and applications on Azure Stack Hub
Disaster Recovery for Azure Stack Hub virtual machines
Azure Automation Update Management
Deploy AI and machine learning at the edge by using Azure Stack Edge
On-premises data gateway for Azure Logic Apps
Run containers in a hybrid environment
Design a hybrid Domain Name System solution with Azure
Hybrid file services
Hybrid availability and performance monitoring
Hybrid Security Monitoring using Azure Security Center and Azure Sentinel
Manage hybrid Azure workloads using Windows Admin Center
DevSecOps in GitHub
Network security review
Network security strategies
Security with identity and access management (IAM) in Azure
Azure Messaging cost estimates
Web application cost estimates
Updated Articles
DevTest and DevOps for IaaS solutions (#a2a167058)
DevTest and DevOps for microservice solutions (#a2a167058)
DevTest and DevOps for PaaS solutions (#a2a167058)
Baseline architecture for an Azure Kubernetes Service (AKS) cluster (#9b20a025d)
DevSecOps in Azure (#511e6ee92)
August 2020
New Articles
Multiple forests with AD DS, Azure AD, and Azure AD DS
Multiple forests with AD DS and Azure AD
Virtual network integrated serverless microservices
Big data analytics with Azure Data Explorer
Content Delivery Network analytics
Azure Data Explorer interactive analytics
IoT analytics with Azure Data Explorer
Azure Data Explorer monitoring
Custom Business Processes
Web and Mobile Front Ends
Line of Business Extension
Attestation, authentication, and provisioning
Field and cloud edge gateways
Condition Monitoring for Industrial IoT
Predictive Maintenance for Industrial IoT
Banking system cloud transformation on Azure
JMeter implementation reference for load testing pipeline solution
Patterns and implementations
IoT connected light, power, and internet for emerging markets
Compute
Scale IoT solutions with application stamps
Builders, developers, and operators
IoT application-to-device commands
IoT solution architecture
IoT solutions conceptual overview
Use cached data
Criteria for choosing a data store
Data store decision tree
Updated Articles
Network security and containment in Azure | Microsoft Docs (#75626e3a7)
Data store cost estimates (#2b0e692f9)
Retry guidance for Azure services (#6c8a169c9)
Chaos engineering (#f742721a9)
Understand data store models (#4fbdd828a)
Cost governance for an Azure workload (#a3452805a)
Event-based cloud automation (#d2cca2011)
Serverless event processing (#d2cca2011)
Azure Application Architecture Guide
12/18/2020 • 3 minutes to read • Edit Online
This guide presents a structured approach for designing applications on Azure that are scalable, secure, resilient,
and highly available. It is based on proven practices that we have learned from customer engagements.
Introduction
The cloud is changing how applications are designed and secured. Instead of monoliths, applications are
decomposed into smaller, decentralized services. These services communicate through APIs or by using
asynchronous messaging or eventing. Applications scale horizontally, adding new instances as demand requires.
These trends bring new challenges. Application state is distributed. Operations are done in parallel and
asynchronously. Applications must be resilient when failures occur. Malicious actors continuously target
applications. Deployments must be automated and predictable. Monitoring and telemetry are critical for gaining
insight into the system. This guide is designed to help you navigate these changes.
Monolithic Decomposed
Designed for predictable scalability Designed for elastic scale
Relational database Polyglot persistence (mix of storage technologies)
Synchronized processing Asynchronous processing
Design to avoid failures (MTBF) Design for failure (MTTR)
Occasional large updates Frequent small updates
Manual management Automated self-management
Snowflake servers Immutable infrastructure
Technology choices
Co mp u te Data sto res Messagin g
Application architecture
Referen ce Design Design B est
arch itectu res p rin cip les p attern s p ractices
Technology choices
Knowing the type of architecture you are building, now you can start to choose the main technology pieces for the
architecture. The following technology choices are critical:
Compute refers to the hosting model for the computing resources that your applications run on. For more
information, see Choose a compute service.
Data stores include databases but also storage for message queues, caches, logs, and anything else that an
application might persist to storage. For more information, see Choose a data store.
Messaging technologies enable asynchronous messages between components of the system. For more
information, see Choose a messaging service.
You will probably have to make additional technology choices along the way, but these three elements (compute,
data, and messaging) are central to most cloud applications and will determine many aspects of your design.
Quality pillars
A successful cloud application will focus on five pillars of software quality: Cost optimization, Operational
excellence, Performance efficiency, Reliability, and Security.
Leverage the Microsoft Azure Well-Architected Framework to assess your architecture across these five pillars.
Next steps
Architecture styles
Architecture styles
12/18/2020 • 5 minutes to read • Edit Online
An architecture style is a family of architectures that share certain characteristics. For example, N-tier is a common
architecture style. More recently, microservice architectures have started to gain favor. Architecture styles don't
require the use of particular technologies, but some technologies are well-suited for certain architectures. For
example, containers are a natural fit for microservices.
We have identified a set of architecture styles that are commonly found in cloud applications. The article for each
style includes:
A description and logical diagram of the style.
Recommendations for when to choose this style.
Benefits, challenges, and best practices.
A recommended deployment using relevant Azure services.
Web-Queue-Worker Front and backend jobs, decoupled by Relatively simple domain with some
async messaging. resource intensive tasks.
Big data Divide a huge dataset into small Batch and real-time data analysis.
chunks. Parallel processing on local Predictive analysis using ML.
datasets.
Big compute Data allocation to thousands of cores. Compute intensive domains such as
simulation.
The term big compute describes large-scale workloads that require a large number of cores, often numbering in
the hundreds or thousands. Scenarios include image rendering, fluid dynamics, financial risk modeling, oil
exploration, drug design, and engineering stress analysis, among others.
Benefits
High performance with "embarrassingly parallel" processing.
Can harness hundreds or thousands of computer cores to solve large problems faster.
Access to specialized high-performance hardware, with dedicated high-speed InfiniBand networks.
You can provision VMs as needed to do work, and then tear them down.
Challenges
Managing the VM infrastructure.
Managing the volume of number crunching
Provisioning thousands of cores in a timely manner.
For tightly coupled tasks, adding more cores can have diminishing returns. You may need to experiment to find
the optimum number of cores.
A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or
complex for traditional database systems.
Orchestration
Big data solutions typically involve one or more of the following types of workload:
Batch processing of big data sources at rest.
Real-time processing of big data in motion.
Interactive exploration of big data.
Predictive analytics and machine learning.
Most big data architectures include some or all of the following components:
Data sources : All big data solutions start with one or more data sources. Examples include:
Application data stores, such as relational databases.
Static files produced by applications, such as web server log files.
Real-time data sources, such as IoT devices.
Data storage : Data for batch processing operations is typically stored in a distributed file store that can
hold high volumes of large files in various formats. This kind of store is often called a data lake. Options for
implementing this storage include Azure Data Lake Store or blob containers in Azure Storage.
Batch processing : Because the data sets are so large, often a big data solution must process data files
using long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis. Usually these
jobs involve reading source files, processing them, and writing the output to new files. Options include
running U-SQL jobs in Azure Data Lake Analytics, using Hive, Pig, or custom Map/Reduce jobs in an
HDInsight Hadoop cluster, or using Java, Scala, or Python programs in an HDInsight Spark cluster.
Real-time message ingestion : If the solution includes real-time sources, the architecture must include a
way to capture and store real-time messages for stream processing. This might be a simple data store,
where incoming messages are dropped into a folder for processing. However, many solutions need a
message ingestion store to act as a buffer for messages, and to support scale-out processing, reliable
delivery, and other message queuing semantics. Options include Azure Event Hubs, Azure IoT Hubs, and
Kafka.
Stream processing : After capturing real-time messages, the solution must process them by filtering,
aggregating, and otherwise preparing the data for analysis. The processed stream data is then written to an
output sink. Azure Stream Analytics provides a managed stream processing service based on perpetually
running SQL queries that operate on unbounded streams. You can also use open source Apache streaming
technologies like Storm and Spark Streaming in an HDInsight cluster.
Analytical data store : Many big data solutions prepare data for analysis and then serve the processed
data in a structured format that can be queried using analytical tools. The analytical data store used to serve
these queries can be a Kimball-style relational data warehouse, as seen in most traditional business
intelligence (BI) solutions. Alternatively, the data could be presented through a low-latency NoSQL
technology such as HBase, or an interactive Hive database that provides a metadata abstraction over data
files in the distributed data store. Azure Synapse Analytics provides a managed service for large-scale,
cloud-based data warehousing. HDInsight supports Interactive Hive, HBase, and Spark SQL, which can also
be used to serve data for analysis.
Analysis and repor ting : The goal of most big data solutions is to provide insights into the data through
analysis and reporting. To empower users to analyze the data, the architecture may include a data modeling
layer, such as a multidimensional OLAP cube or tabular data model in Azure Analysis Services. It might also
support self-service BI, using the modeling and visualization technologies in Microsoft Power BI or Microsoft
Excel. Analysis and reporting can also take the form of interactive data exploration by data scientists or data
analysts. For these scenarios, many Azure services support analytical notebooks, such as Jupyter, enabling
these users to leverage their existing skills with Python or R. For large-scale data exploration, you can use
Microsoft R Server, either standalone or with Spark.
Orchestration : Most big data solutions consist of repeated data processing operations, encapsulated in
workflows, that transform source data, move data between multiple sources and sinks, load the processed
data into an analytical data store, or push the results straight to a report or dashboard. To automate these
workflows, you can use an orchestration technology such Azure Data Factory or Apache Oozie and Sqoop.
Azure includes many services that can be used in a big data architecture. They fall roughly into two categories:
Managed services, including Azure Data Lake Store, Azure Data Lake Analytics, Azure Synapse Analytics, Azure
Stream Analytics, Azure Event Hub, Azure IoT Hub, and Azure Data Factory.
Open source technologies based on the Apache Hadoop platform, including HDFS, HBase, Hive, Pig, Spark,
Storm, Oozie, Sqoop, and Kafka. These technologies are available on Azure in the Azure HDInsight service.
These options are not mutually exclusive, and many solutions combine open source technologies with Azure
services.
Benefits
Technology choices . You can mix and match Azure managed services and Apache technologies in HDInsight
clusters, to capitalize on existing skills or technology investments.
Performance through parallelism . Big data solutions take advantage of parallelism, enabling high-
performance solutions that scale to large volumes of data.
Elastic scale . All of the components in the big data architecture support scale-out provisioning, so that you can
adjust your solution to small or large workloads, and pay only for the resources that you use.
Interoperability with existing solutions . The components of the big data architecture are also used for IoT
processing and enterprise BI solutions, enabling you to create an integrated solution across data workloads.
Challenges
Complexity . Big data solutions can be extremely complex, with numerous components to handle data
ingestion from multiple data sources. It can be challenging to build, test, and troubleshoot big data processes.
Moreover, there may be a large number of configuration settings across multiple systems that must be used in
order to optimize performance.
Skillset . Many big data technologies are highly specialized, and use frameworks and languages that are not
typical of more general application architectures. On the other hand, big data technologies are evolving new
APIs that build on more established languages. For example, the U-SQL language in Azure Data Lake Analytics is
based on a combination of Transact-SQL and C#. Similarly, SQL-based APIs are available for Hive, HBase, and
Spark.
Technology maturity . Many of the technologies used in big data are evolving. While core Hadoop
technologies such as Hive and Pig have stabilized, emerging technologies such as Spark introduce extensive
changes and enhancements with each new release. Managed services such as Azure Data Lake Analytics and
Azure Data Factory are relatively young, compared with other Azure services, and will likely evolve over time.
Security . Big data solutions usually rely on storing all static data in a centralized data lake. Securing access to
this data can be challenging, especially when the data must be ingested and consumed by multiple applications
and platforms.
Best practices
Leverage parallelism . Most big data processing technologies distribute the workload across multiple
processing units. This requires that static data files are created and stored in a splittable format. Distributed
file systems such as HDFS can optimize read and write performance, and the actual processing is performed
by multiple cluster nodes in parallel, which reduces overall job times.
Par tition data . Batch processing usually happens on a recurring schedule — for example, weekly or
monthly. Partition data files, and data structures such as tables, based on temporal periods that match the
processing schedule. That simplifies data ingestion and job scheduling, and makes it easier to troubleshoot
failures. Also, partitioning tables that are used in Hive, U-SQL, or SQL queries can significantly improve
query performance.
Apply schema-on-read semantics . Using a data lake lets you to combine storage for files in multiple
formats, whether structured, semi-structured, or unstructured. Use schema-on-read semantics, which
project a schema onto the data when the data is processing, not when the data is stored. This builds
flexibility into the solution, and prevents bottlenecks during data ingestion caused by data validation and
type checking.
Process data in-place . Traditional BI solutions often use an extract, transform, and load (ETL) process to
move data into a data warehouse. With larger volumes data, and a greater variety of formats, big data
solutions generally use variations of ETL, such as transform, extract, and load (TEL). With this approach, the
data is processed within the distributed data store, transforming it to the required structure, before moving
the transformed data into an analytical data store.
Balance utilization and time costs . For batch processing jobs, it's important to consider two factors: The
per-unit cost of the compute nodes, and the per-minute cost of using those nodes to complete the job. For
example, a batch job may take eight hours with four cluster nodes. However, it might turn out that the job
uses all four nodes only during the first two hours, and after that, only two nodes are required. In that case,
running the entire job on two nodes would increase the total job time, but would not double it, so the total
cost would be less. In some business scenarios, a longer processing time may be preferable to the higher
cost of using underutilized cluster resources.
Separate cluster resources . When deploying HDInsight clusters, you will normally achieve better
performance by provisioning separate cluster resources for each type of workload. For example, although
Spark clusters include Hive, if you need to perform extensive processing with both Hive and Spark, you
should consider deploying separate dedicated Spark and Hadoop clusters. Similarly, if you are using HBase
and Storm for low latency stream processing and Hive for batch processing, consider separate clusters for
Storm, HBase, and Hadoop.
Orchestrate data ingestion . In some cases, existing business applications may write data files for batch
processing directly into Azure storage blob containers, where they can be consumed by HDInsight or Azure
Data Lake Analytics. However, you will often need to orchestrate the ingestion of data from on-premises or
external data sources into the data lake. Use an orchestration workflow or pipeline, such as those supported
by Azure Data Factory or Oozie, to achieve this in a predictable and centrally manageable fashion.
Scrub sensitive data early . The data ingestion workflow should scrub sensitive data early in the process,
to avoid storing it in the data lake.
IoT architecture
Internet of Things (IoT) is a specialized subset of big data solutions. The following diagram shows a possible logical
architecture for IoT. The diagram emphasizes the event-streaming components of the architecture.
The cloud gateway ingests device events at the cloud boundary, using a reliable, low latency messaging system.
Devices might send events directly to the cloud gateway, or through a field gateway . A field gateway is a
specialized device or software, usually colocated with the devices, that receives events and forwards them to the
cloud gateway. The field gateway might also preprocess the raw device events, performing functions such as
filtering, aggregation, or protocol transformation.
After ingestion, events go through one or more stream processors that can route the data (for example, to
storage) or perform analytics and other processing.
The following are some common types of processing. (This list is certainly not exhaustive.)
Writing event data to cold storage, for archiving or batch analytics.
Hot path analytics, analyzing the event stream in (near) real time, to detect anomalies, recognize patterns
over rolling time windows, or trigger alerts when a specific condition occurs in the stream.
Handling special types of non-telemetry messages from devices, such as notifications and alarms.
Machine learning.
The boxes that are shaded gray show components of an IoT system that are not directly related to event streaming,
but are included here for completeness.
The device registr y is a database of the provisioned devices, including the device IDs and usually device
metadata, such as location.
The provisioning API is a common external interface for provisioning and registering new devices.
Some IoT solutions allow command and control messages to be sent to devices.
This section has presented a very high-level view of IoT, and there are many subtleties and challenges to
consider. For a more detailed reference architecture and discussion, see the Microsoft Azure IoT Reference
Architecture (PDF download).
Next steps
Learn more about big data architectures.
Event-driven architecture style
12/18/2020 • 3 minutes to read • Edit Online
An event-driven architecture consists of event producers that generate a stream of events, and event
consumers that listen for the events.
Event Consumers
Event Consumers
Events are delivered in near real time, so consumers can respond immediately to events as they occur. Producers
are decoupled from consumers — a producer doesn't know which consumers are listening. Consumers are also
decoupled from each other, and every consumer sees all of the events. This differs from a Competing Consumers
pattern, where consumers pull messages from a queue and a message is processed just once (assuming no
errors). In some systems, such as IoT, events must be ingested at very high volumes.
An event driven architecture can use a pub/sub model or an event stream model.
Pub/sub : The messaging infrastructure keeps track of subscriptions. When an event is published, it sends
the event to each subscriber. After an event is received, it cannot be replayed, and new subscribers do not
see the event.
Event streaming : Events are written to a log. Events are strictly ordered (within a partition) and durable.
Clients don't subscribe to the stream, instead a client can read from any part of the stream. The client is
responsible for advancing its position in the stream. That means a client can join at any time, and can
replay events.
On the consumer side, there are some common variations:
Simple event processing . An event immediately triggers an action in the consumer. For example, you
could use Azure Functions with a Service Bus trigger, so that a function executes whenever a message is
published to a Service Bus topic.
Complex event processing . A consumer processes a series of events, looking for patterns in the event
data, using a technology such as Azure Stream Analytics or Apache Storm. For example, you could
aggregate readings from an embedded device over a time window, and generate a notification if the
moving average crosses a certain threshold.
Event stream processing . Use a data streaming platform, such as Azure IoT Hub or Apache Kafka, as a
pipeline to ingest events and feed them to stream processors. The stream processors act to process or
transform the stream. There may be multiple stream processors for different subsystems of the application.
This approach is a good fit for IoT workloads.
The source of the events may be external to the system, such as physical devices in an IoT solution. In that case,
the system must be able to ingest the data at the volume and throughput that is required by the data source.
In the logical diagram above, each type of consumer is shown as a single box. In practice, it's common to have
multiple instances of a consumer, to avoid having the consumer become a single point of failure in system.
Multiple instances might also be necessary to handle the volume and frequency of events. Also, a single consumer
might process events on multiple threads. This can create challenges if events must be processed in order or
require exactly-once semantics. See Minimize Coordination.
Benefits
Producers and consumers are decoupled.
No point-to-point integrations. It's easy to add new consumers to the system.
Consumers can respond to events immediately as they arrive.
Highly scalable and distributed.
Subsystems have independent views of the event stream.
Challenges
Guaranteed delivery. In some systems, especially in IoT scenarios, it's crucial to guarantee that events are
delivered.
Processing events in order or exactly once. Each consumer type typically runs in multiple instances, for
resiliency and scalability. This can create a challenge if the events must be processed in order (within a
consumer type), or if the processing logic is not idempotent.
Additional considerations
The amount of data to include in an event can be a significant consideration that affects both performance and
cost. Putting all the relevant information needed for processing in the event itself can simplify the processing
code and save additional lookups. Putting the minimal amount of information in an event, like just a couple of
identifiers, will reduce transport time and cost, but requires the processing code to look up any additional
information it needs.
Microservices architecture style
12/18/2020 • 5 minutes to read • Edit Online
A microservices architecture consists of a collection of small, autonomous services. Each service is self-contained
and should implement a single business capability.
Benefits
Agility. Because microservices are deployed independently, it's easier to manage bug fixes and feature
releases. You can update a service without redeploying the entire application, and roll back an update if
something goes wrong. In many traditional applications, if a bug is found in one part of the application, it
can block the entire release process. New features may be held up waiting for a bug fix to be integrated,
tested, and published.
Small, focused teams . A microservice should be small enough that a single feature team can build, test,
and deploy it. Small team sizes promote greater agility. Large teams tend be less productive, because
communication is slower, management overhead goes up, and agility diminishes.
Small code base . In a monolithic application, there is a tendency over time for code dependencies to
become tangled Adding a new feature requires touching code in a lot of places. By not sharing code or data
stores, a microservices architecture minimizes dependencies, and that makes it easier to add new features.
Mix of technologies . Teams can pick the technology that best fits their service, using a mix of technology
stacks as appropriate.
Fault isolation . If an individual microservice becomes unavailable, it won't disrupt the entire application,
as long as any upstream microservices are designed to handle faults correctly (for example, by
implementing circuit breaking).
Scalability . Services can be scaled independently, letting you scale out subsystems that require more
resources, without scaling out the entire application. Using an orchestrator such as Kubernetes or Service
Fabric, you can pack a higher density of services onto a single host, which allows for more efficient
utilization of resources.
Data isolation . It is much easier to perform schema updates, because only a single microservice is
affected. In a monolithic application, schema updates can become very challenging, because different parts
of the application may all touch the same data, making any alterations to the schema risky.
Challenges
The benefits of microservices don't come for free. Here are some of the challenges to consider before embarking
on a microservices architecture.
Complexity . A microservices application has more moving parts than the equivalent monolithic
application. Each service is simpler, but the entire system as a whole is more complex.
Development and testing . Writing a small service that relies on other dependent services requires a
different approach than a writing a traditional monolithic or layered application. Existing tools are not
always designed to work with service dependencies. Refactoring across service boundaries can be difficult.
It is also challenging to test service dependencies, especially when the application is evolving quickly.
Lack of governance . The decentralized approach to building microservices has advantages, but it can
also lead to problems. You may end up with so many different languages and frameworks that the
application becomes hard to maintain. It may be useful to put some project-wide standards in place,
without overly restricting teams' flexibility. This especially applies to cross-cutting functionality such as
logging.
Network congestion and latency . The use of many small, granular services can result in more
interservice communication. Also, if the chain of service dependencies gets too long (service A calls B,
which calls C...), the additional latency can become a problem. You will need to design APIs carefully. Avoid
overly chatty APIs, think about serialization formats, and look for places to use asynchronous
communication patterns.
Data integrity . With each microservice responsible for its own data persistence. As a result, data
consistency can be a challenge. Embrace eventual consistency where possible.
Management . To be successful with microservices requires a mature DevOps culture. Correlated logging
across services can be challenging. Typically, logging must correlate multiple service calls for a single user
operation.
Versioning . Updates to a service must not break services that depend on it. Multiple services could be
updated at any given time, so without careful design, you might have problems with backward or forward
compatibility.
Skillset . Microservices are highly distributed systems. Carefully evaluate whether the team has the skills
and experience to be successful.
Best practices
Model services around the business domain.
Decentralize everything. Individual teams are responsible for designing and building services. Avoid
sharing code or data schemas.
Data storage should be private to the service that owns the data. Use the best storage for each service and
data type.
Services communicate through well-designed APIs. Avoid leaking implementation details. APIs should
model the domain, not the internal implementation of the service.
Avoid coupling between services. Causes of coupling include shared database schemas and rigid
communication protocols.
Offload cross-cutting concerns, such as authentication and SSL termination, to the gateway.
Keep domain knowledge out of the gateway. The gateway should handle and route client requests without
any knowledge of the business rules or domain logic. Otherwise, the gateway becomes a dependency and
can cause coupling between services.
Services should have loose coupling and high functional cohesion. Functions that are likely to change
together should be packaged and deployed together. If they reside in separate services, those services end
up being tightly coupled, because a change in one service will require updating the other service. Overly
chatty communication between two services may be a symptom of tight coupling and low cohesion.
Isolate failures. Use resiliency strategies to prevent failures within a service from cascading. See Resiliency
patterns and Designing reliable applications.
Next steps
For detailed guidance about building a microservices architecture on Azure, see Designing, building, and
operating microservices on Azure.
N-tier architecture style
12/18/2020 • 5 minutes to read • Edit Online
An N-tier architecture divides an application into logical layers and physical tiers .
Remote
Service
Middle
Tier 1
Layers are a way to separate responsibilities and manage dependencies. Each layer has a specific responsibility. A
higher layer can use services in a lower layer, but not the other way around.
Tiers are physically separated, running on separate machines. A tier can call to another tier directly, or use
asynchronous messaging (message queue). Although each layer might be hosted in its own tier, that's not
required. Several layers might be hosted on the same tier. Physically separating the tiers improves scalability and
resiliency, but also adds latency from the additional network communication.
A traditional three-tier application has a presentation tier, a middle tier, and a database tier. The middle tier is
optional. More complex applications can have more than three tiers. The diagram above shows an application with
two middle tiers, encapsulating different areas of functionality.
An N-tier application can have a closed layer architecture or an open layer architecture :
In a closed layer architecture, a layer can only call the next layer immediately down.
In an open layer architecture, a layer can call any of the layers below it.
A closed layer architecture limits the dependencies between layers. However, it might create unnecessary network
traffic, if one layer simply passes requests along to the next layer.
Benefits
Portability between cloud and on-premises, and between cloud platforms.
Less learning curve for most developers.
Natural evolution from the traditional application model.
Open to heterogeneous environment (Windows/Linux)
Challenges
It's easy to end up with a middle tier that just does CRUD operations on the database, adding extra latency
without doing any useful work.
Monolithic design prevents independent deployment of features.
Managing an IaaS application is more work than an application that uses only managed services.
It can be difficult to manage network security in a large system.
Best practices
Use autoscaling to handle changes in load. See Autoscaling best practices.
Use asynchronous messaging to decouple tiers.
Cache semistatic data. See Caching best practices.
Configure the database tier for high availability, using a solution such as SQL Server Always On availability
groups.
Place a web application firewall (WAF) between the front end and the Internet.
Place each tier in its own subnet, and use subnets as a security boundary.
Restrict access to the data tier, by allowing requests only from the middle tier(s).
Each tier consists of two or more VMs, placed in an availability set or virtual machine scale set. Multiple VMs
provide resiliency in case one VM fails. Load balancers are used to distribute requests across the VMs in a tier. A
tier can be scaled horizontally by adding more VMs to the pool.
Each tier is also placed inside its own subnet, meaning their internal IP addresses fall within the same address
range. That makes it easy to apply network security group rules and route tables to individual tiers.
The web and business tiers are stateless. Any VM can handle any request for that tier. The data tier should consist
of a replicated database. For Windows, we recommend SQL Server, using Always On availability groups for high
availability. For Linux, choose a database that supports replication, such as Apache Cassandra.
Network security groups restrict access to each tier. For example, the database tier only allows access from the
business tier.
NOTE
The layer labeled "Business Tier" in our reference diagram is a moniker to the business logic tier. Likewise, we also call the
presentation tier the "Web Tier." In our example, this is a web application, though multi-tier architectures can be used for
other topologies as well (like desktop apps). Name your tiers what works best for your team to communicate the intent of
that logical and/or physical tier in your application - you could even express that naming in resources you choose to
represent that tier (e.g. vmss-appName-business-layer).
The core components of this architecture are a web front end that serves client requests, and a worker that
performs resource-intensive tasks, long-running workflows, or batch jobs. The web front end communicates with
the worker through a message queue .
Remote
Service
Identity
Provider
Queue
CDN Static
Content
Other components that are commonly incorporated into this architecture include:
One or more databases.
A cache to store values from the database for quick reads.
CDN to serve static content
Remote services, such as email or SMS service. Often these are provided by third parties.
Identity provider for authentication.
The web and worker are both stateless. Session state can be stored in a distributed cache. Any long-running work
is done asynchronously by the worker. The worker can be triggered by messages on the queue, or run on a
schedule for batch processing. The worker is an optional component. If there are no long-running operations, the
worker can be omitted.
The front end might consist of a web API. On the client side, the web API can be consumed by a single-page
application that makes AJAX calls, or by a native client application.
Benefits
Relatively simple architecture that is easy to understand.
Easy to deploy and manage.
Clear separation of concerns.
The front end is decoupled from the worker using asynchronous messaging.
The front end and the worker can be scaled independently.
Challenges
Without careful design, the front end and the worker can become large, monolithic components that are
difficult to maintain and update.
There may be hidden dependencies, if the front end and worker share data schemas or code modules.
Best practices
Expose a well-designed API to the client. See API design best practices.
Autoscale to handle changes in load. See Autoscaling best practices.
Cache semi-static data. See Caching best practices.
Use a CDN to host static content. See CDN best practices.
Use polyglot persistence when appropriate. See Use the best data store for the job.
Partition data to improve scalability, reduce contention, and optimize performance. See Data partitioning best
practices.
The front end is implemented as an Azure App Service web app, and the worker is implemented as an Azure
Functions app. The web app and the function app are both associated with an App Service plan that
provides the VM instances.
You can use either Azure Service Bus or Azure Storage queues for the message queue. (The diagram shows
an Azure Storage queue.)
Azure Cache for Redis stores session state and other data that needs low latency access.
Azure CDN is used to cache static content such as images, CSS, or HTML.
For storage, choose the storage technologies that best fit the needs of the application. You might use
multiple storage technologies (polyglot persistence). To illustrate this idea, the diagram shows Azure SQL
Database and Azure Cosmos DB.
For more details, see App Service web application reference architecture.
Additional considerations
Not every transaction has to go through the queue and worker to storage. The web front end can perform
simple read/write operations directly. Workers are designed for resource-intensive tasks or long-running
workflows. In some cases, you might not need a worker at all.
Use the built-in autoscale feature of App Service to scale out the number of VM instances. If the load on the
application follows predictable patterns, use schedule-based autoscale. If the load is unpredictable, use
metrics-based autoscaling rules.
Consider putting the web app and the function app into separate App Service plans. That way, they can be
scaled independently.
Use separate App Service plans for production and testing. Otherwise, if you use the same plan for
production and testing, it means your tests are running on your production VMs.
Use deployment slots to manage deployments. This lets you to deploy an updated version to a staging slot,
then swap over to the new version. It also lets you swap back to the previous version, if there was a problem
with the update.
Ten design principles for Azure applications
12/18/2020 • 2 minutes to read • Edit Online
Follow these design principles to make your application more scalable, resilient, and manageable.
Design for self healing . In a distributed system, failures happen. Design your application to be self healing when
failures occur.
Make all things redundant . Build redundancy into your application, to avoid having single points of failure.
Minimize coordination . Minimize coordination between application services to achieve scalability.
Design to scale out . Design your application so that it can scale horizontally, adding or removing new instances
as demand requires.
Par tition around limits . Use partitioning to work around database, network, and compute limits.
Design for operations . Design your application so that the operations team has the tools they need.
Use managed ser vices . When possible, use platform as a service (PaaS) rather than infrastructure as a service
(IaaS).
Use the best data store for the job . Pick the storage technology that is the best fit for your data and how it will
be used.
Design for evolution . All successful applications change over time. An evolutionary design is key for continuous
innovation.
Build for the needs of business . Every design decision must be justified by a business requirement.
Design for self healing
12/18/2020 • 4 minutes to read • Edit Online
Recommendations
Retr y failed operations . Transient failures may occur due to momentary loss of network connectivity, a dropped
database connection, or a timeout when a service is busy. Build retry logic into your application to handle transient
failures. For many Azure services, the client SDK implements automatic retries. For more information, see Transient
fault handling and the Retry pattern.
Protect failing remote ser vices (Circuit Breaker) . It's good to retry after a transient failure, but if the failure
persists, you can end up with too many callers hammering a failing service. This can lead to cascading failures, as
requests back up. Use the Circuit Breaker pattern to fail fast (without making the remote call) when an operation is
likely to fail.
Isolate critical resources (Bulkhead) . Failures in one subsystem can sometimes cascade. This can happen if a
failure causes some resources, such as threads or sockets, not to get freed in a timely manner, leading to resource
exhaustion. To avoid this, partition a system into isolated groups, so that a failure in one partition does not bring
down the entire system.
Perform load leveling . Applications may experience sudden spikes in traffic that can overwhelm services on the
backend. To avoid this, use the Queue-Based Load Leveling pattern to queue work items to run asynchronously. The
queue acts as a buffer that smooths out peaks in the load.
Fail over . If an instance can't be reached, fail over to another instance. For things that are stateless, like a web
server, put several instances behind a load balancer or traffic manager. For things that store state, like a database,
use replicas and fail over. Depending on the data store and how it replicates, this may require the application to
deal with eventual consistency.
Compensate failed transactions . In general, avoid distributed transactions, as they require coordination across
services and resources. Instead, compose an operation from smaller individual transactions. If the operation fails
midway through, use Compensating Transactions to undo any step that already completed.
Checkpoint long-running transactions . Checkpoints can provide resiliency if a long-running operation fails.
When the operation restarts (for example, it is picked up by another VM), it can be resumed from the last
checkpoint.
Degrade gracefully . Sometimes you can't work around a problem, but you can provide reduced functionality that
is still useful. Consider an application that shows a catalog of books. If the application can't retrieve the thumbnail
image for the cover, it might show a placeholder image. Entire subsystems might be noncritical for the application.
For example, in an e-commerce site, showing product recommendations is probably less critical than processing
orders.
Throttle clients . Sometimes a small number of users create excessive load, which can reduce your application's
availability for other users. In this situation, throttle the client for a certain period of time. See the Throttling pattern.
Block bad actors . Just because you throttle a client, it doesn't mean client was acting maliciously. It just means the
client exceeded their service quota. But if a client consistently exceeds their quota or otherwise behaves badly, you
might block them. Define an out-of-band process for user to request getting unblocked.
Use leader election . When you need to coordinate a task, use Leader Election to select a coordinator. That way,
the coordinator is not a single point of failure. If the coordinator fails, a new one is selected. Rather than implement
a leader election algorithm from scratch, consider an off-the-shelf solution such as Zookeeper.
Test with fault injection . All too often, the success path is well tested but not the failure path. A system could run
in production for a long time before a failure path is exercised. Use fault injection to test the resiliency of the
system to failures, either by triggering actual failures or by simulating them.
Embrace chaos engineering . Chaos engineering extends the notion of fault injection, by randomly injecting
failures or abnormal conditions into production instances.
For a structured approach to making your applications self healing, see Design reliable applications for Azure.
Make all things redundant
12/18/2020 • 2 minutes to read • Edit Online
Recommendations
Consider business requirements . The amount of redundancy built into a system can affect both cost and
complexity. Your architecture should be informed by your business requirements, such as recovery time objective
(RTO). For example, a multi-region deployment is more expensive than a single-region deployment, and is more
complicated to manage. You will need operational procedures to handle failover and failback. The additional cost
and complexity might be justified for some business scenarios and not others.
Place VMs behind a load balancer . Don't use a single VM for mission-critical workloads. Instead, place multiple
VMs behind a load balancer. If any VM becomes unavailable, the load balancer distributes traffic to the remaining
healthy VMs. To learn how to deploy this configuration, see Multiple VMs for scalability and availability.
Load
Balancer
Replicate databases . Azure SQL Database and Cosmos DB automatically replicate the data within a region, and
you can enable geo-replication across regions. If you are using an IaaS database solution, choose one that supports
replication and failover, such as SQL Server Always On availability groups.
Enable geo-replication . Geo-replication for Azure SQL Database and Cosmos DB creates secondary readable
replicas of your data in one or more secondary regions. In the event of an outage, the database can fail over to the
secondary region for writes.
Par tition for availability . Database partitioning is often used to improve scalability, but it can also improve
availability. If one shard goes down, the other shards can still be reached. A failure in one shard will only disrupt a
subset of the total transactions.
Deploy to more than one region . For the highest availability, deploy the application to more than one region.
That way, in the rare case when a problem affects an entire region, the application can fail over to another region.
The following diagram shows a multi-region application that uses Azure Traffic Manager to handle failover.
Region 1
Region 2
Azure Traffic
Manager
Synchronize front and backend failover . Use Azure Traffic Manager to fail over the front end. If the front end
becomes unreachable in one region, Traffic Manager will route new requests to the secondary region. Depending
on your database solution, you may need to coordinate failing over the database.
Use automatic failover but manual failback . Use Traffic Manager for automatic failover, but not for automatic
failback. Automatic failback carries a risk that you might switch to the primary region before the region is
completely healthy. Instead, verify that all application subsystems are healthy before manually failing back. Also,
depending on the database, you might need to check data consistency before failing back.
Include redundancy for Traffic Manager . Traffic Manager is a possible failure point. Review the Traffic Manager
SLA, and determine whether using Traffic Manager alone meets your business requirements for high availability. If
not, consider adding another traffic management solution as a failback. If the Azure Traffic Manager service fails,
change your CNAME records in DNS to point to the other traffic management service.
Minimize coordination
12/18/2020 • 4 minutes to read • Edit Online
Update
Orders
Node 1
Update OrderItems
(blocked)
Node 2
Coordination limits the benefits of horizontal scale and creates bottlenecks. In this example, as you scale out the
application and add more instances, you'll see increased lock contention. In the worst case, the front-end instances
will spend most of their time waiting on locks.
"Exactly once" semantics are another frequent source of coordination. For example, an order must be processed
exactly once. Two workers are listening for new orders. Worker1 picks up an order for processing. The application
must ensure that Worker2 doesn't duplicate the work, but also if Worker1 crashes, the order isn't dropped.
Orders Worker 1
Worker 2
You can use a pattern such as Scheduler Agent Supervisor to coordinate between the workers, but in this case a
better approach might be to partition the work. Each worker is assigned a certain range of orders (say, by billing
region). If a worker crashes, a new instance picks up where the previous instance left off, but multiple instances
aren't contending.
Recommendations
Embrace eventual consistency. When data is distributed, it takes coordination to enforce strong consistency
guarantees. For example, suppose an operation updates two databases. Instead of putting it into a single
transaction scope, it's better if the system can accommodate eventual consistency, perhaps by using the
Compensating Transaction pattern to logically roll back after a failure.
Use domain events to synchronize state. A domain event is an event that records when something happens
that has significance within the domain. Interested services can listen for the event, rather than using a global
transaction to coordinate across multiple services. If this approach is used, the system must tolerate eventual
consistency (see previous item).
Consider patterns such as CQRS and event sourcing. These two patterns can help to reduce contention
between read workloads and write workloads.
The CQRS pattern separates read operations from write operations. In some implementations, the read data
is physically separated from the write data.
In the Event Sourcing pattern, state changes are recorded as a series of events to an append-only data store.
Appending an event to the stream is an atomic operation, requiring minimal locking.
These two patterns complement each other. If the write-only store in CQRS uses event sourcing, the read-only
store can listen for the same events to create a readable snapshot of the current state, optimized for queries. Before
adopting CQRS or event sourcing, however, be aware of the challenges of this approach.
Par tition data. Avoid putting all of your data into one data schema that is shared across many application
services. A microservices architecture enforces this principle by making each service responsible for its own data
store. Within a single database, partitioning the data into shards can improve concurrency, because a service
writing to one shard does not affect a service writing to a different shard.
Design idempotent operations. When possible, design operations to be idempotent. That way, they can be
handled using at-least-once semantics. For example, you can put work items on a queue. If a worker crashes in the
middle of an operation, another worker simply picks up the work item.
Use asynchronous parallel processing. If an operation requires multiple steps that are performed
asynchronously (such as remote service calls), you might be able to call them in parallel, and then aggregate the
results. This approach assumes that each step does not depend on the results of the previous step.
Use optimistic concurrency when possible. Pessimistic concurrency control uses database locks to prevent
conflicts. This can cause poor performance and reduce availability. With optimistic concurrency control, each
transaction modifies a copy or snapshot of the data. When the transaction is committed, the database engine
validates the transaction and rejects any transactions that would affect database consistency.
Azure SQL Database and SQL Server support optimistic concurrency through snapshot isolation. Some Azure
storage services support optimistic concurrency through the use of Etags, including Azure Cosmos DB and Azure
Storage.
Consider MapReduce or other parallel, distributed algorithms. Depending on the data and type of work to
be performed, you may be able to split the work into independent tasks that can be performed by multiple nodes
working in parallel. See Big compute architecture style.
Use leader election for coordination. In cases where you need to coordinate operations, make sure the
coordinator does not become a single point of failure in the application. Using the Leader Election pattern, one
instance is the leader at any time, and acts as the coordinator. If the leader fails, a new instance is elected to be the
leader.
Design to scale out
12/18/2020 • 2 minutes to read • Edit Online
Recommendations
Avoid instance stickiness . Stickiness, or session affinity, is when requests from the same client are always
routed to the same server. Stickiness limits the application's ability to scale out. For example, traffic from a high-
volume user will not be distributed across instances. Causes of stickiness include storing session state in memory,
and using machine-specific keys for encryption. Make sure that any instance can handle any request.
Identify bottlenecks . Scaling out isn't a magic fix for every performance issue. For example, if your backend
database is the bottleneck, it won't help to add more web servers. Identify and resolve the bottlenecks in the
system first, before throwing more instances at the problem. Stateful parts of the system are the most likely cause
of bottlenecks.
Decompose workloads by scalability requirements. Applications often consist of multiple workloads, with
different requirements for scaling. For example, an application might have a public-facing site and a separate
administration site. The public site may experience sudden surges in traffic, while the administration site has a
smaller, more predictable load.
Offload resource-intensive tasks. Tasks that require a lot of CPU or I/O resources should be moved to
background jobs when possible, to minimize the load on the front end that is handling user requests.
Use built-in autoscaling features . Many Azure compute services have built-in support for autoscaling. If the
application has a predictable, regular workload, scale out on a schedule. For example, scale out during business
hours. Otherwise, if the workload is not predictable, use performance metrics such as CPU or request queue length
to trigger autoscaling. For autoscaling best practices, see Autoscaling.
Consider aggressive autoscaling for critical workloads . For critical workloads, you want to keep ahead of
demand. It's better to add new instances quickly under heavy load to handle the additional traffic, and then
gradually scale back.
Design for scale in . Remember that with elastic scale, the application will have periods of scale in, when
instances get removed. The application must gracefully handle instances being removed. Here are some ways to
handle scalein:
Listen for shutdown events (when available) and shut down cleanly.
Clients/consumers of a service should support transient fault handling and retry.
For long-running tasks, consider breaking up the work, using checkpoints or the Pipes and Filters pattern.
Put work items on a queue so that another instance can pick up the work, if an instance is removed in the
middle of processing.
Partition around limits
12/18/2020 • 2 minutes to read • Edit Online
Recommendations
Par tition different par ts of the application . Databases are one obvious candidate for partitioning, but also
consider storage, cache, queues, and compute instances.
Design the par tition key to avoid hotspots . If you partition a database, but one shard still gets the majority of
the requests, then you haven't solved your problem. Ideally, load gets distributed evenly across all the partitions.
For example, hash by customer ID and not the first letter of the customer name, because some letters are more
frequent. The same principle applies when partitioning a message queue. Pick a partition key that leads to an even
distribution of messages across the set of queues. For more information, see Sharding.
Par tition around Azure subscription and ser vice limits . Individual components and services have limits, but
there are also limits for subscriptions and resource groups. For very large applications, you might need to partition
around those limits.
Par tition at different levels . Consider a database server deployed on a VM. The VM has a VHD that is backed by
Azure Storage. The storage account belongs to an Azure subscription. Notice that each step in the hierarchy has
limits. The database server may have a connection pool limit. VMs have CPU and network limits. Storage has IOPS
limits. The subscription has limits on the number of VM cores. Generally, it's easier to partition lower in the
hierarchy. Only large applications should need to partition at the subscription level.
Design for operations
12/18/2020 • 2 minutes to read • Edit Online
Design an application so that the operations team has the tools they
need
The cloud has dramatically changed the role of the operations team. They are no longer responsible for managing
the hardware and infrastructure that hosts the application. That said, operations is still a critical part of running a
successful cloud application. Some of the important functions of the operations team include:
Deployment
Monitoring
Escalation
Incident response
Security auditing
Robust logging and tracing are particularly important in cloud applications. Involve the operations team in design
and planning, to ensure the application gives them the data and insight they need to be successful.
Recommendations
Make all things obser vable . Once a solution is deployed and running, logs and traces are your primary insight
into the system. Tracing records a path through the system, and is useful to pinpoint bottlenecks, performance
issues, and failure points. Logging captures individual events such as application state changes, errors, and
exceptions. Log in production, or else you lose insight at the very times when you need it the most.
Instrument for monitoring . Monitoring gives insight into how well (or poorly) an application is performing, in
terms of availability, performance, and system health. For example, monitoring tells you whether you are meeting
your SLA. Monitoring happens during the normal operation of the system. It should be as close to real-time as
possible, so that the operations staff can react to issues quickly. Ideally, monitoring can help avert problems before
they lead to a critical failure. For more information, see Monitoring and diagnostics.
Instrument for root cause analysis . Root cause analysis is the process of finding the underlying cause of
failures. It occurs after a failure has already happened.
Use distributed tracing . Use a distributed tracing system that is designed for concurrency, asynchrony, and cloud
scale. Traces should include a correlation ID that flows across service boundaries. A single operation may involve
calls to multiple application services. If an operation fails, the correlation ID helps to pinpoint the cause of the
failure.
Standardize logs and metrics . The operations team will need to aggregate logs from across the various services
in your solution. If every service uses its own logging format, it becomes difficult or impossible to get useful
information from them. Define a common schema that includes fields such as correlation ID, event name, IP
address of the sender, and so forth. Individual services can derive custom schemas that inherit the base schema,
and contain additional fields.
Automate management tasks , including provisioning, deployment, and monitoring. Automating a task makes it
repeatable and less prone to human errors.
Treat configuration as code . Check configuration files into a version control system, so that you can track and
version your changes, and roll back if needed.
Use platform as a service (PaaS) options
12/18/2020 • 2 minutes to read • Edit Online
Hadoop HDInsight
MongoDB Cosmos DB
Please note that this is not meant to be an exhaustive list, but a subset of equivalent options.
Use the best data store for the job
12/18/2020 • 2 minutes to read • Edit Online
Pick the storage technology that is the best fit for your data and how it
will be used
Gone are the days when you would just stick all of your data into a big relational SQL database. Relational
databases are very good at what they do — providing ACID guarantees for transactions over relational data. But
they come with some costs:
Queries may require expensive joins.
Data must be normalized and conform to a predefined schema (schema on write).
Lock contention may impact performance.
In any large solution, it's likely that a single data store technology won't fill all your needs. Alternatives to relational
databases include key/value stores, document databases, search engine databases, time series databases, column
family databases, and graph databases. Each has pros and cons, and different types of data fit more naturally into
one or another.
For example, you might store a product catalog in a document database, such as Cosmos DB, which allows for a
flexible schema. In that case, each product description is a self-contained document. For queries over the entire
catalog, you might index the catalog and store the index in Azure Search. Product inventory might go into a SQL
database, because that data requires ACID guarantees.
Remember that data includes more than just the persisted application data. It also includes application logs, events,
messages, and caches.
Recommendations
Don't use a relational database for ever ything . Consider other data stores when appropriate. See Choose the
right data store.
Embrace polyglot persistence . In any large solution, it's likely that a single data store technology won't fill all
your needs.
Consider the type of data . For example, put transactional data into SQL, put JSON documents into a document
database, put telemetry data into a time series data base, put application logs in Elasticsearch, and put blobs in
Azure Blob Storage.
Prefer availability over (strong) consistency . The CAP theorem implies that a distributed system must make
trade-offs between availability and consistency. (Network partitions, the other leg of the CAP theorem, can never
be completely avoided.) Often, you can achieve higher availability by adopting an eventual consistency model.
Consider the skillset of the development team . There are advantages to using polyglot persistence, but it's
possible to go overboard. Adopting a new data storage technology requires a new set of skills. The development
team must understand how to get the most out of the technology. They must understand appropriate usage
patterns, how to optimize queries, tune for performance, and so on. Factor this in when considering storage
technologies.
Use compensating transactions . A side effect of polyglot persistence is that single transaction might write data
to multiple stores. If something fails, use compensating transactions to undo any steps that already completed.
Look at bounded contexts . Bounded context is a term from domain driven design. A bounded context is an
explicit boundary around a domain model, and defines which parts of the domain the model applies to. Ideally, a
bounded context maps to a subdomain of the business domain. The bounded contexts in your system are a natural
place to consider polyglot persistence. For example, "products" may appear in both the Product Catalog
subdomain and the Product Inventory subdomain, but it's very likely that these two subdomains have different
requirements for storing, updating, and querying products.
Design for evolution
12/18/2020 • 3 minutes to read • Edit Online
Recommendations
Enforce high cohesion and loose coupling . A service is cohesive if it provides functionality that logically
belongs together. Services are loosely coupled if you can change one service without changing the other. High
cohesion generally means that changes in one function will require changes in other related functions. If you find
that updating a service requires coordinated updates to other services, it may be a sign that your services are not
cohesive. One of the goals of domain-driven design (DDD) is to identify those boundaries.
Encapsulate domain knowledge . When a client consumes a service, the responsibility for enforcing the
business rules of the domain should not fall on the client. Instead, the service should encapsulate all of the domain
knowledge that falls under its responsibility. Otherwise, every client has to enforce the business rules, and you end
up with domain knowledge spread across different parts of the application.
Use asynchronous messaging . Asynchronous messaging is a way to decouple the message producer from the
consumer. The producer does not depend on the consumer responding to the message or taking any particular
action. With a pub/sub architecture, the producer may not even know who is consuming the message. New
services can easily consume the messages without any modifications to the producer.
Don't build domain knowledge into a gateway . Gateways can be useful in a microservices architecture, for
things like request routing, protocol translation, load balancing, or authentication. However, the gateway should be
restricted to this sort of infrastructure functionality. It should not implement any domain knowledge, to avoid
becoming a heavy dependency.
Expose open interfaces . Avoid creating custom translation layers that sit between services. Instead, a service
should expose an API with a well-defined API contract. The API should be versioned, so that you can evolve the API
while maintaining backward compatibility. That way, you can update a service without coordinating updates to all
of the upstream services that depend on it. Public facing services should expose a RESTful API over HTTP. Backend
services might use an RPC-style messaging protocol for performance reasons.
Design and test against ser vice contracts . When services expose well-defined APIs, you can develop and test
against those APIs. That way, you can develop and test an individual service without spinning up all of its
dependent services. (Of course, you would still perform integration and load testing against the real services.)
Abstract infrastructure away from domain logic . Don't let domain logic get mixed up with infrastructure-
related functionality, such as messaging or persistence. Otherwise, changes in the domain logic will require
updates to the infrastructure layers and vice versa.
Offload cross-cutting concerns to a separate ser vice . For example, if several services need to authenticate
requests, you could move this functionality into its own service. Then you could evolve the authentication service
— for example, by adding a new authentication flow — without touching any of the services that use it.
Deploy ser vices independently . When the DevOps team can deploy a single service independently of other
services in the application, updates can happen more quickly and safely. Bug fixes and new features can be rolled
out at a more regular cadence. Design both the application and the release process to support independent
updates.
Build for the needs of the business
12/18/2020 • 2 minutes to read • Edit Online
Recommendations
Define business objectives , including the recovery time objective (RTO), recovery point objective (RPO), and
maximum tolerable outage (MTO). These numbers should inform decisions about the architecture. For example, to
achieve a low RTO, you might implement automated failover to a secondary region. But if your solution can tolerate
a higher RTO, that degree of redundancy might be unnecessary.
Document ser vice level agreements (SL A) and ser vice level objectives (SLO) , including availability and
performance metrics. You might build a solution that delivers 99.95% availability. Is that enough? The answer is a
business decision.
Model the application around the business domain . Start by analyzing the business requirements. Use these
requirements to model the application. Consider using a domain-driven design (DDD) approach to create domain
models that reflect the business processes and use cases.
Capture both functional and nonfunctional requirements . Functional requirements let you judge whether
the application does the right thing. Nonfunctional requirements let you judge whether the application does those
things well. In particular, make sure that you understand your requirements for scalability, availability, and latency.
These requirements will influence design decisions and choice of technology.
Decompose by workload . The term "workload" in this context means a discrete capability or computing task,
which can be logically separated from other tasks. Different workloads may have different requirements for
availability, scalability, data consistency, and disaster recovery.
Plan for growth . A solution might meet your current needs, in terms of number of users, volume of transactions,
data storage, and so forth. However, a robust application can handle growth without major architectural changes.
See Design to scale out and Partition around limits. Also consider that your business model and business
requirements will likely change over time. If an application's service model and data models are too rigid, it
becomes hard to evolve the application for new use cases and scenarios. See Design for evolution.
Manage costs . In a traditional on-premises application, you pay upfront for hardware as a capital expenditure. In a
cloud application, you pay for the resources that you consume. Make sure that you understand the pricing model
for the services that you consume. The total cost will include network bandwidth usage, storage, IP addresses,
service consumption, and other factors. For more information, see Azure pricing. Also consider your operations
costs. In the cloud, you don't have to manage the hardware or other infrastructure, but you still need to manage
your applications, including DevOps, incident response, disaster recovery, and so forth.
Choose an Azure compute service for your
application
12/18/2020 • 6 minutes to read • Edit Online
Azure offers a number of ways to host your application code. The term compute refers to the hosting model for
the computing resources that your application runs on. The following flowchart will help you to choose a compute
service for your application.
If your application consists of multiple workloads, evaluate each workload separately. A complete solution may
incorporate two or more compute services.
Definitions:
"Lift and shift" is a strategy for migrating a workload to the cloud without redesigning the application or
making code changes. Also called rehosting. For more information, see Azure migration center.
Cloud optimized is a strategy for migrating to the cloud by refactoring an application to take advantage of
cloud-native features and capabilities.
The output from this flowchart is a star ting point for consideration. Next, perform a more detailed evaluation of
the service to see if it meets your needs.
This article includes several tables which may help you to make these tradeoff decisions. Based on this analysis,
you may find that the initial candidate isn't suitable for your particular application or workload. In that case, expand
your analysis to include other compute services.
NOTE
Azure Functions is an Azure serverless compute offering. You may read Choose the right integration and automation
services in Azure to know how this service compares with other Azure serverless offerings, such as Logic Apps which
provides serverless workflows.
There is a spectrum from IaaS to pure PaaS. For example, Azure VMs can autoscale by using virtual machine scale
sets. This automatic scaling capability isn't strictly PaaS, but it's the type of management feature found in PaaS
services.
In general, there is a tradeoff between control and ease of management. IaaS gives the most control, flexibility, and
portability, but you have to provision, configure and manage the VMs and network components you create. FaaS
services automatically manage nearly all aspects of running an application. PaaS services fall somewhere in
between.
A Z URE
VIRT UA L APP SERVIC E A Z URE K UB ERN ET E C O N TA IN ER A Z URE
C RIT ERIA M A C H IN ES SERVIC E FA B RIC F UN C T IO N S S SERVIC E IN STA N C ES B ATC H
Minimum 12 1 53 Serverless 1 33 No 14
number of dedicated
nodes nodes
Notes
1. If using Consumption plan. If using App Service plan, functions run on the VMs allocated for your App Service
plan. See Choose the correct service plan for Azure Functions.
2. Higher SLA with two or more instances.
3. Recommended for production environments.
4. Can scale down to zero after job completes.
5. Requires App Service Environment (ASE).
6. Use Azure App Service Hybrid Connections.
7. Requires App Service plan or Azure Functions Premium plan.
DevOps
A Z URE
VIRT UA L APP SERVIC E A Z URE K UB ERN ET E C O N TA IN ER A Z URE
C RIT ERIA M A C H IN ES SERVIC E FA B RIC F UN C T IO N S S SERVIC E IN STA N C ES B ATC H
Local Agnostic IIS Express, Local node Visual Minikube, Local Not
debugging others 1 cluster Studio or others container supported
Azure runtime
Functions
CLI
Notes
1. Options include IIS Express for ASP.NET or node.js (iisnode); PHP web server; Azure Toolkit for IntelliJ, Azure
Toolkit for Eclipse. App Service also supports remote debugging of deployed web app.
2. See Resource Manager providers, regions, API versions and schemas.
Scalability
A Z URE
VIRT UA L APP SERVIC E A Z URE K UB ERN ET E C O N TA IN ER A Z URE
C RIT ERIA M A C H IN ES SERVIC E FA B RIC F UN C T IO N S S SERVIC E IN STA N C ES B ATC H
Load Azure Load Integrated Azure Load Integrated Azure Load No built-in Azure Load
balancer Balancer Balancer Balancer or support Balancer
Application
Gateway
Notes
1. See Autoscale pods.
2. See Automatically scale a cluster to meet application demands on Azure Kubernetes Service (AKS).
3. See Azure subscription and service limits, quotas, and constraints.
Availability
A Z URE
VIRT UA L APP SERVIC E A Z URE K UB ERN ET E C O N TA IN ER A Z URE
C RIT ERIA M A C H IN ES SERVIC E FA B RIC F UN C T IO N S S SERVIC E IN STA N C ES B ATC H
SLA SLA for SLA for App SLA for SLA for SLA for AKS SLA for SLA for
Virtual Service Service Functions Container Azure Batch
Machines Fabric Instances
For guided learning on Service Guarantees, review Core Cloud Services - Azure architecture and service
guarantees.
Security
Review and understand the available security controls and visibility for each service
App Service
Azure Kubernetes Service
Batch
Container Instances
Functions
Service Fabric
Virtual machine - Windows
Virtual machine - LINUX
Other criteria
A Z URE
VIRT UA L APP SERVIC E A Z URE K UB ERN ET E C O N TA IN ER A Z URE
C RIT ERIA M A C H IN ES SERVIC E FA B RIC F UN C T IO N S S SERVIC E IN STA N C ES B ATC H
Cost Windows, App Service Service Azure AKS pricing Container Azure Batch
Linux pricing Fabric Functions Instances pricing
pricing pricing pricing
The output from this flowchart is a star ting point for consideration. Next, perform a more detailed evaluation of
the service to see if it meets your needs.
Next steps
Core Cloud Services - Azure compute options. This Microsoft Learn module explores how compute services can
solve common business needs.
Understand data store models
12/18/2020 • 12 minutes to read • Edit Online
Modern business systems manage increasingly large volumes of heterogeneous data. This heterogeneity means
that a single data store is usually not the best approach. Instead, it's often better to store different types of data in
different data stores, each focused toward a specific workload or usage pattern. The term polyglot persistence is
used to describe solutions that use a mix of data store technologies. Therefore, it's important to understand the
main storage models and their tradeoffs.
Selecting the right data store for your requirements is a key design decision. There are literally hundreds of
implementations to choose from among SQL and NoSQL databases. Data stores are often categorized by how they
structure data and the types of operations they support. This article describes several of the most common storage
models. Note that a particular data store technology may support multiple storage models. For example, a
relational database management systems (RDBMS) may also support key/value or graph storage. In fact, there is a
general trend for so-called multi-model support, where a single database system supports several models. But it's
still useful to understand the different models at a high level.
Not all data stores in a given category provide the same feature-set. Most data stores provide server-side
functionality to query and process data. Sometimes this functionality is built into the data storage engine. In other
cases, the data storage and processing capabilities are separated, and there may be several options for processing
and analysis. Data stores also support different programmatic and management interfaces.
Generally, you should start by considering which storage model is best suited for your requirements. Then consider
a particular data store within that category, based on factors such as feature set, cost, and ease of management.
Key/value stores
A key/value store associates each data value with a unique key. Most key/value stores only support simple query,
insert, and delete operations. To modify a value (either partially or completely), an application must overwrite the
existing data for the entire value. In most implementations, reading or writing a single value is an atomic operation.
An application can store arbitrary data as a set of values. Any schema information must be provided by the
application. The key/value store simply retrieves or stores the value by key.
Key/value stores are highly optimized for applications performing simple lookups, but are less suitable if you need
to query data across different key/value stores. Key/value stores are also not optimized for querying by value.
A single key/value store can be extremely scalable, as the data store can easily distribute data across multiple nodes
on separate machines.
Azure services
Azure Cosmos DB Table API, etcd API (preview), and SQL API | (Cosmos DB Security Baseline)
Azure Cache for Redis | (Security Baseline)
Azure Table Storage | (Security Baseline)
Workload
Data is accessed using a single key, like a dictionary.
No joins, lock, or unions are required.
No aggregation mechanisms are used.
Secondary indexes are generally not used.
Data type
Each key is associated with a single value.
There is no schema enforcement.
No relationships between entities.
Examples
Data caching
Session management
User preference and profile management
Product recommendation and ad serving
Document databases
A document database stores a collection of documents, where each document consists of named fields and data.
The data can be simple values or complex elements such as lists and child collections. Documents are retrieved by
unique keys.
Typically, a document contains the data for single entity, such as a customer or an order. A document may contain
information that would be spread across several relational tables in an RDBMS. Documents don't need to have the
same structure. Applications can store different data in documents as business requirements change.
Azure service
Azure Cosmos DB SQL API | (Cosmos DB Security Baseline)
Workload
Insert and update operations are common.
No object-relational impedance mismatch. Documents can better match the object structures used in application
code.
Individual documents are retrieved and written as a single block.
Data requires index on multiple fields.
Data type
Data can be managed in de-normalized way.
Size of individual document data is relatively small.
Each document type can use its own schema.
Documents can include optional fields.
Document data is semi-structured, meaning that data types of each field are not strictly defined.
Examples
Product catalog
Content management
Inventory management
Graph databases
A graph database stores two types of information, nodes and edges. Edges specify relationships between nodes.
Nodes and edges can have properties that provide information about that node or edge, similar to columns in a
table. Edges can also have a direction indicating the nature of the relationship.
Graph databases can efficiently perform queries across the network of nodes and edges and analyze the
relationships between entities. The following diagram shows an organization's personnel database structured as a
graph. The entities are employees and departments, and the edges indicate reporting relationships and the
departments in which employees work.
This structure makes it straightforward to perform queries such as "Find all employees who report directly or
indirectly to Sarah" or "Who works in the same department as John?" For large graphs with lots of entities and
relationships, you can perform very complex analyses very quickly. Many graph databases provide a query
language that you can use to traverse a network of relationships efficiently.
Azure services
Azure Cosmos DB Gremlin API | (Security Baseline)
SQL Server | (Security Baseline)
Workload
Complex relationships between data items involving many hops between related data items.
The relationship between data items are dynamic and change over time.
Relationships between objects are first-class citizens, without requiring foreign-keys and joins to traverse.
Data type
Nodes and relationships.
Nodes are similar to table rows or JSON documents.
Relationships are just as important as nodes, and are exposed directly in the query language.
Composite objects, such as a person with multiple phone numbers, tend to be broken into separate, smaller
nodes, combined with traversable relationships
Examples
Organization charts
Social graphs
Fraud detection
Recommendation engines
Data analytics
Data analytics stores provide massively parallel solutions for ingesting, storing, and analyzing data. The data is
distributed across multiple servers to maximize scalability. Large data file formats such as delimiter files (CSV),
parquet, and ORC are widely used in data analytics. Historical data is typically stored in data stores such as blob
storage or Azure Data Lake Storage Gen2, which are then accessed by Azure Synapse, Databricks, or HDInsight as
external tables. A typical scenario using data stored as parquet files for performance, is described in the article Use
external tables with Synapse SQL.
Azure services
Azure Synapse Analytics | (Security Baseline)
Azure Data Lake | (Security Baseline)
Azure Data Explorer | (Security Baseline)
Azure Analysis Services
HDInsight | (Security Baseline)
Azure Databricks | (Security Baseline)
Workload
Data analytics
Enterprise BI
Data type
Historical data from multiple sources.
Usually denormalized in a "star" or "snowflake" schema, consisting of fact and dimension tables.
Usually loaded with new data on a scheduled basis.
Dimension tables often include multiple historic versions of an entity, referred to as a slowly changing
dimension.
Examples
Enterprise data warehouse
Column-family databases
A column-family database organizes data into rows and columns. In its simplest form, a column-family database
can appear very similar to a relational database, at least conceptually. The real power of a column-family database
lies in its denormalized approach to structuring sparse data.
You can think of a column-family database as holding tabular data with rows and columns, but the columns are
divided into groups known as column families. Each column family holds a set of columns that are logically related
together and are typically retrieved or manipulated as a unit. Other data that is accessed separately can be stored in
separate column families. Within a column family, new columns can be added dynamically, and rows can be sparse
(that is, a row doesn't need to have a value for every column).
The following diagram shows an example with two column families, Identity and Contact Info . The data for a
single entity has the same row key in each column-family. This structure, where the rows for any given object in a
column family can vary dynamically, is an important benefit of the column-family approach, making this form of
data store highly suited for storing structured, volatile data.
Unlike a key/value store or a document database, most column-family databases store data in key order, rather than
by computing a hash. Many implementations allow you to create indexes over specific columns in a column-family.
Indexes let you retrieve data by columns value, rather than row key.
Read and write operations for a row are usually atomic with a single column-family, although some
implementations provide atomicity across the entire row, spanning multiple column-families.
Azure services
Azure Cosmos DB Cassandra API | (Security Baseline)
HBase in HDInsight | (Security Baseline)
Workload
Most column-family databases perform write operations extremely quickly.
Update and delete operations are rare.
Designed to provide high throughput and low-latency access.
Supports easy query access to a particular set of fields within a much larger record.
Massively scalable.
Data type
Data is stored in tables consisting of a key column and one or more column families.
Specific columns can vary by individual rows.
Individual cells are accessed via get and put commands
Multiple rows are returned using a scan command.
Examples
Recommendations
Personalization
Sensor data
Telemetry
Messaging
Social media analytics
Web analytics
Activity monitoring
Weather and other time-series data
Object storage
Object storage is optimized for storing and retrieving large binary objects (images, files, video and audio streams,
large application data objects and documents, virtual machine disk images). Large data files are also popularly used
in this model, for example, delimiter file (CSV), parquet, and ORC. Object stores can manage extremely large
amounts of unstructured data.
Azure service
Azure Blob Storage | (Security Baseline)
Azure Data Lake Storage Gen2 | (Security Baseline)
Workload
Identified by key.
Content is typically an asset such as a delimiter, image, or video file.
Content must be durable and external to any application tier.
Data type
Data size is large.
Value is opaque.
Examples
Images, videos, office documents, PDFs
Static HTML, JSON, CSS
Log and audit files
Database backups
Shared files
Sometimes, using simple flat files can be the most effective means of storing and retrieving information. Using file
shares enables files to be accessed across a network. Given appropriate security and concurrent access control
mechanisms, sharing data in this way can enable distributed services to provide highly scalable data access for
performing basic, low-level operations such as simple read and write requests.
Azure service
Azure Files | (Security Baseline)
Workload
Migration from existing apps that interact with the file system.
Requires SMB interface.
Data type
Files in a hierarchical set of folders.
Accessible with standard I/O libraries.
Examples
Legacy files
Shared content accessible among a number of VMs or app instances
Aided with this understanding of different data storage models, the next step is to evaluate your workload and
application, and decide which data store will meet your specific needs. Use the data storage decision tree to help
with this process.
Select an Azure data store for your application
12/18/2020 • 2 minutes to read • Edit Online
Azure offers a number of managed data storage solutions, each providing different features and capabilities. This
article will help you to choose a data store for your application.
If your application consists of multiple workloads, evaluate each workload separately. A complete solution may
incorporate multiple data stores.
Use the following flowchart to select a candidate data store.
START
Cassandra No
CosmosDB
Cassandra API
Yes
Need SMB interface? Azure Files
CosmosDB MongoDB
MongoDB API No
Blob Storage
Yes
Archive? cool access tier
archive access tier
No
Yes
Search index data? Azure Search
No
Yes
Time series data?
Time Series
Insights
No
Object No
No
Yes Cosmos DB
Graph data?
Graph API
No
Yes
Azure Cache for
Redis
Transient data?
No
Cosmos DB SQL
API
The output from this flowchart is a star ting point for consideration. Next, perform a more detailed evaluation of
the data store to see if it meets your needs. Refer to Criteria for choosing a data store to aid in this evaluation.
Criteria for choosing a data store
12/18/2020 • 3 minutes to read • Edit Online
This article describes the comparison criteria you should use when evaluating a data store. The goal is to help you
determine which data storage types can meet your solution's requirements.
General considerations
Keep the following considerations in mind when making your selection.
Functional requirements
Data format . What type of data are you intending to store? Common types include transactional data,
JSON objects, telemetry, search indexes, or flat files.
Data size . How large are the entities you need to store? Will these entities need to be maintained as a
single document, or can they be split across multiple documents, tables, collections, and so forth?
Scale and structure . What is the overall amount of storage capacity you need? Do you anticipate
partitioning your data?
Data relationships . Will your data need to support one-to-many or many-to-many relationships? Are
relationships themselves an important part of the data? Will you need to join or otherwise combine data
from within the same dataset, or from external datasets?
Consistency model . How important is it for updates made in one node to appear in other nodes, before
further changes can be made? Can you accept eventual consistency? Do you need ACID guarantees for
transactions?
Schema flexibility . What kind of schemas will you apply to your data? Will you use a fixed schema, a
schema-on-write approach, or a schema-on-read approach?
Concurrency . What kind of concurrency mechanism do you want to use when updating and synchronizing
data? Will the application perform many updates that could potentially conflict. If so, you may require
record locking and pessimistic concurrency control. Alternatively, can you support optimistic concurrency
controls? If so, is simple timestamp-based concurrency control enough, or do you need the added
functionality of multi-version concurrency control?
Data movement . Will your solution need to perform ETL tasks to move data to other stores or data
warehouses?
Data lifecycle . Is the data write-once, read-many? Can it be moved into cool or cold storage?
Other suppor ted features . Do you need any other specific features, such as schema validation,
aggregation, indexing, full-text search, MapReduce, or other query capabilities?
Non-functional requirements
Performance and scalability . What are your data performance requirements? Do you have specific
requirements for data ingestion rates and data processing rates? What are the acceptable response times
for querying and aggregation of data once ingested? How large will you need the data store to scale up? Is
your workload more read-heavy or write-heavy?
Reliability . What overall SLA do you need to support? What level of fault-tolerance do you need to provide
for data consumers? What kind of backup and restore capabilities do you need?
Replication . Will your data need to be distributed among multiple replicas or regions? What kind of data
replication capabilities do you require?
Limits . Will the limits of a particular data store support your requirements for scale, number of
connections, and throughput?
Management and cost
Managed ser vice . When possible, use a managed data service, unless you require specific capabilities that
can only be found in an IaaS-hosted data store.
Region availability . For managed services, is the service available in all Azure regions? Does your solution
need to be hosted in certain Azure regions?
Por tability . Will your data need to be migrated to on-premises, external datacenters, or other cloud hosting
environments?
Licensing . Do you have a preference of a proprietary versus OSS license type? Are there any other external
restrictions on what type of license you can use?
Overall cost . What is the overall cost of using the service within your solution? How many instances will
need to run, to support your uptime and throughput requirements? Consider operations costs in this
calculation. One reason to prefer managed services is the reduced operational cost.
Cost effectiveness . Can you partition your data, to store it more cost effectively? For example, can you
move large objects out of an expensive relational database into an object store?
Security
Security . What type of encryption do you require? Do you need encryption at rest? What authentication
mechanism do you want to use to connect to your data?
Auditing . What kind of audit log do you need to generate?
Networking requirements . Do you need to restrict or otherwise manage access to your data from other
network resources? Does data need to be accessible only from inside the Azure environment? Does the data
need to be accessible from specific IP addresses or subnets? Does it need to be accessible from applications
or services hosted on-premises or in other external datacenters?
DevOps
Skill set . Are there particular programming languages, operating systems, or other technology that your
team is particularly adept at using? Are there others that would be difficult for your team to work with?
Clients Is there good client support for your development languages?
Overview of load-balancing options in Azure
12/18/2020 • 4 minutes to read • Edit Online
The term load balancing refers to the distribution of workloads across multiple computing resources. Load
balancing aims to optimize resource use, maximize throughput, minimize response time, and avoid overloading
any single resource. It can also improve availability by sharing a workload across redundant computing resources.
Overview
Azure load balancing services can be categorized along two dimensions: global versus regional, and HTTP(S) versus
non-HTTP(S).
Global versus regional
Global load-balancing services distribute traffic across regional backends, clouds, or hybrid on-premises
services. These services route end-user traffic to the closest available backend. They also react to changes in
service reliability or performance, in order to maximize availability and performance. You can think of them
as systems that load balance between application stamps, endpoints, or scale-units hosted across different
regions/geographies.
Regional load-balancing services distribute traffic within virtual networks across virtual machines (VMs) or
zonal and zone-redundant service endpoints within a region. You can think of them as systems that load
balance between VMs, containers, or clusters within a region in a virtual network.
HTTP(S ) versus non-HTTP(S )
HTTP(S) load-balancing services are Layer 7 load balancers that only accept HTTP(S) traffic. They are
intended for web applications or other HTTP(S) endpoints. They include features such as SSL offload, web
application firewall, path-based load balancing, and session affinity.
Non-HTTP/S load-balancing services can handle non-HTTP(S) traffic and are recommended for non-web
workloads.
The following table summarizes the Azure load balancing services by these categories:
Traffic Manager is a DNS-based traffic load balancer that enables you to distribute traffic optimally to services
across global Azure regions, while providing high availability and responsiveness. Because Traffic Manager is a
DNS-based load-balancing service, it load balances only at the domain level. For that reason, it can't fail over as
quickly as Front Door, because of common challenges around DNS caching and systems not honoring DNS TTLs.
Application Gateway provides application delivery controller (ADC) as a service, offering various Layer 7 load-
balancing capabilities. Use it to optimize web farm productivity by offloading CPU-intensive SSL termination to the
gateway.
Azure Load Balancer is a high-performance, ultra low-latency Layer 4 load-balancing service (inbound and
outbound) for all UDP and TCP protocols. It is built to handle millions of requests per second while ensuring your
solution is highly available. Azure Load Balancer is zone-redundant, ensuring high availability across Availability
Zones.
This article describes the different types of messages and the entities that participate in a messaging infrastructure.
Based on the requirements of each message type, the article recommends Azure messaging services. The options
include Azure Service Bus, Event Grid, and Event Hubs.
At an architectural level, a message is a datagram created by an entity (producer), to distribute information so that
other entities (consumers) can be aware and act accordingly. The producer and the consumer can communicate
directly or optionally through an intermediary entity (message broker). This article focuses on asynchronous
messaging using a message broker.
Messages can be classified into two main categories. If the producer expects an action from the consumer, that
message is a command. If the message informs the consumer that an action has taken place, then the message is
an event.
Commands
The producer sends a command with the intent that the consumer(s) will perform an operation within the scope of
a business transaction.
A command is a high-value message and must be delivered at least once. If a command is lost, the entire business
transaction might fail. Also, a command shouldn't be processed more than once. Doing so might cause an
erroneous transaction. A customer might get duplicate orders or billed twice.
Commands are often used to manage the workflow of a multistep business transaction. Depending on the
business logic, the producer may expect the consumer to acknowledge the message and report the results of the
operation. Based on that result, the producer may choose an appropriate course of action.
Events
An event is a type of message that a producer raises to announce facts.
The producer (known as the publisher in this context) has no expectations that the events will result in any action.
Interested consumer(s), can subscribe, listen for events, and take actions depending on their consumption scenario.
Events can have multiple subscribers or no subscribers at all. Two different subscribers can react to an event with
different actions and not be aware of one another.
The producer and consumer are loosely coupled and managed independently. The consumer isn't expected to
acknowledge the event back to the producer. A consumer that is no longer interested in the events, can
unsubscribe. The consumer is removed from the pipeline without affecting the producer or the overall functionality
of the system.
There are two categories of events:
The producer raises events to announce discrete facts. A common use case is event notification. For
example, Azure Resource Manager raises events when it creates, modifies, or deletes resources. A subscriber
of those events could be a Logic App that sends alert emails.
The producer raises related events in a sequence, or a stream of events, over a period of time. Typically, a
stream is consumed for statistical evaluation. The evaluation can be done within a temporal window or as
events arrive. Telemetry is a common use case, for example, health and load monitoring of a system.
Another case is event streaming from IoT devices.
A common pattern for implementing event messaging is the Publisher-Subscriber pattern.
The Competing Consumers Pattern explains how to process multiple messages concurrently to optimize
throughput, to improve scalability and availability, and to balance the workload.
Load leveling
The volume of messages generated by the producer or a group of producers can be variable. At times there might
be a large volume causing spikes in messages. Instead of adding consumers to handle this work, a message broker
can act as a buffer, and consumers gradually drain messages at their own pace without stressing the system.
Storage services can also offer additional features for analyzing events. For example, by taking advantage of
the access tiers of a blob storage account, you can store events in a hot tier for data that needs frequent access.
You might use that data for visualization. Alternately, you can store data in the archive tier and retrieve it
occasionally for auditing purposes.
Capture stores all events ingested by Event Hubs and is useful for batch processing. You can generate reports on
the data by using a MapReduce function. Captured data can also serve as the source of truth. If certain facts were
missed while aggregating the data, you can refer to the captured data.
For details about this feature, see Capture events through Azure Event Hubs in Azure Blob Storage or Azure Data
Lake Storage.
Support for Apache Kafka clients
Event Hubs provides an endpoint for Apache Kafka clients. Existing clients can update their configuration to point
to the endpoint and start sending events to Event Hubs. No code changes are required.
For more information, see Event Hubs for Apache Kafka.
Crossover scenarios
In some cases, it's advantageous to combine two messaging services.
Combining services can increase the efficiency of your messaging system. For instance, in your business
transaction, you use Azure Service Bus queues to handle messages. Queues that are mostly idle and receive
messages occasionally are inefficient because the consumer is constantly polling the queue for new messages. You
can set up an Event Grid subscription with an Azure Function as the event handler. Each time the queue receives a
message and there are no consumers listening, Event Grid sends a notification, which invokes the Azure Function
that drains the queue.
For details about connecting Service Bus to Event Grid, see Azure Service Bus to Event Grid integration overview.
The Enterprise integration on Azure using message queues and events reference architecture shows an
implementation of Service Bus to Event Grid integration.
Here's another example. Event Grid receives a set of events in which some events require a workflow while others
are for notification. The message metadata indicates the type of event. One way is to check the metadata by using
the filtering feature in the event subscription. If it requires a workflow, Event Grid sends it to Azure Service Bus
queue. The receivers of that queue can take necessary actions. The notification events are sent to Logic Apps to
send alert emails.
Related patterns
Consider these patterns when implementing asynchronous messaging:
Competing Consumers Pattern. Multiple consumers may need to compete to read messages from a queue. This
pattern explains how to process multiple messages concurrently to optimize throughput, to improve scalability
and availability, and to balance the workload.
Priority Queue Pattern. For cases where the business logic requires that some messages are processed before
others, this pattern describes how messages posted by a producer that have a higher priority can be received
and processed more quickly by a consumer than messages of a lower priority.
Queue-based Load Leveling Pattern. This pattern uses a message broker to act as a buffer between a producer
and a consumer to help to minimize the impact on availability and responsiveness of intermittent heavy loads
for both those entities.
Retry Pattern. A producer or consumer might be unable connect to a queue, but the reasons for this failure may
be temporary and quickly pass. This pattern describes how to handle this situation to add resiliency to an
application.
Scheduler Agent Supervisor Pattern. Messaging is often used as part of a workflow implementation. This
pattern demonstrates how messaging can coordinate a set of actions across a distributed set of services and
other remote resources, and enable a system to recover and retry actions that fail.
Choreography pattern. This pattern shows how services can use messaging to control the workflow of a
business transaction.
Claim-Check Pattern. This pattern shows how to split a large message into a claim check and a payload.
Web API design
12/18/2020 • 28 minutes to read • Edit Online
Most modern web applications expose APIs that clients can use to interact with the application. A well-designed
web API should aim to support:
Platform independence . Any client should be able to call the API, regardless of how the API is
implemented internally. This requires using standard protocols, and having a mechanism whereby the client
and the web service can agree on the format of the data to exchange.
Ser vice evolution . The web API should be able to evolve and add functionality independently from client
applications. As the API evolves, existing client applications should continue to function without
modification. All functionality should be discoverable so that client applications can fully use it.
This guidance describes issues that you should consider when designing a web API.
Introduction to REST
In 2000, Roy Fielding proposed Representational State Transfer (REST) as an architectural approach to designing
web services. REST is an architectural style for building distributed systems based on hypermedia. REST is
independent of any underlying protocol and is not necessarily tied to HTTP. However, most common REST
implementations use HTTP as the application protocol, and this guide focuses on designing REST APIs for HTTP.
A primary advantage of REST over HTTP is that it uses open standards, and does not bind the implementation of
the API or the client applications to any specific implementation. For example, a REST web service could be written
in ASP.NET, and client applications can use any language or toolset that can generate HTTP requests and parse
HTTP responses.
Here are some of the main design principles of RESTful APIs using HTTP:
REST APIs are designed around resources, which are any kind of object, data, or service that can be
accessed by the client.
A resource has an identifier, which is a URI that uniquely identifies that resource. For example, the URI for a
particular customer order might be:
https://adventure-works.com/orders/1
Clients interact with a service by exchanging representations of resources. Many web APIs use JSON as the
exchange format. For example, a GET request to the URI listed above might return this response body:
{"orderId":1,"orderValue":99.90,"productId":1,"quantity":1}
REST APIs use a uniform interface, which helps to decouple the client and service implementations. For
REST APIs built on HTTP, the uniform interface includes using standard HTTP verbs to perform operations
on resources. The most common operations are GET, POST, PUT, PATCH, and DELETE.
REST APIs use a stateless request model. HTTP requests should be independent and may occur in any order,
so keeping transient state information between requests is not feasible. The only place where information is
stored is in the resources themselves, and each request should be an atomic operation. This constraint
enables web services to be highly scalable, because there is no need to retain any affinity between clients
and specific servers. Any server can handle any request from any client. That said, other factors can limit
scalability. For example, many web services write to a backend data store, which may be hard to scale out.
For more information about strategies to scale out a data store, see Horizontal, vertical, and functional data
partitioning.
REST APIs are driven by hypermedia links that are contained in the representation. For example, the
following shows a JSON representation of an order. It contains links to get or update the customer
associated with the order.
{
"orderID":3,
"productID":2,
"quantity":4,
"orderValue":16.60,
"links": [
{"rel":"product","href":"https://adventure-works.com/customers/3", "action":"GET" },
{"rel":"product","href":"https://adventure-works.com/customers/3", "action":"PUT" }
]
}
In 2008, Leonard Richardson proposed the following maturity model for web APIs:
Level 0: Define one URI, and all operations are POST requests to this URI.
Level 1: Create separate URIs for individual resources.
Level 2: Use HTTP methods to define operations on resources.
Level 3: Use hypermedia (HATEOAS, described below).
Level 3 corresponds to a truly RESTful API according to Fielding's definition. In practice, many published web APIs
fall somewhere around level 2.
https://adventure-works.com/orders // Good
https://adventure-works.com/create-order // Avoid
A resource doesn't have to be based on a single physical data item. For example, an order resource might be
implemented internally as several tables in a relational database, but presented to the client as a single entity.
Avoid creating APIs that simply mirror the internal structure of a database. The purpose of REST is to model
entities and the operations that an application can perform on those entities. A client should not be exposed to the
internal implementation.
Entities are often grouped together into collections (orders, customers). A collection is a separate resource from
the item within the collection, and should have its own URI. For example, the following URI might represent the
collection of orders:
https://adventure-works.com/orders
Sending an HTTP GET request to the collection URI retrieves a list of items in the collection. Each item in the
collection also has its own unique URI. An HTTP GET request to the item's URI returns the details of that item.
Adopt a consistent naming convention in URIs. In general, it helps to use plural nouns for URIs that reference
collections. It's a good practice to organize URIs for collections and items into a hierarchy. For example,
/customers is the path to the customers collection, and /customers/5 is the path to the customer with ID equal to
5. This approach helps to keep the web API intuitive. Also, many web API frameworks can route requests based on
parameterized URI paths, so you could define a route for the path /customers/{id} .
Also consider the relationships between different types of resources and how you might expose these
associations. For example, the /customers/5/orders might represent all of the orders for customer 5. You could
also go in the other direction, and represent the association from an order back to a customer with a URI such as
/orders/99/customer . However, extending this model too far can become cumbersome to implement. A better
solution is to provide navigable links to associated resources in the body of the HTTP response message. This
mechanism is described in more detail in the section Use HATEOAS to enable navigation to related resources.
In more complex systems, it can be tempting to provide URIs that enable a client to navigate through several levels
of relationships, such as /customers/1/orders/99/products . However, this level of complexity can be difficult to
maintain and is inflexible if the relationships between resources change in the future. Instead, try to keep URIs
relatively simple. Once an application has a reference to a resource, it should be possible to use this reference to
find items related to that resource. The preceding query can be replaced with the URI /customers/1/orders to find
all the orders for customer 1, and then /orders/99/products to find the products in this order.
TIP
Avoid requiring resource URIs more complex than collection/item/collection.
Another factor is that all web requests impose a load on the web server. The more requests, the bigger the load.
Therefore, try to avoid "chatty" web APIs that expose a large number of small resources. Such an API may require
a client application to send multiple requests to find all of the data that it requires. Instead, you might want to
denormalize the data and combine related information into bigger resources that can be retrieved with a single
request. However, you need to balance this approach against the overhead of fetching data that the client doesn't
need. Retrieving large objects can increase the latency of a request and incur additional bandwidth costs. For more
information about these performance antipatterns, see Chatty I/O and Extraneous Fetching.
Avoid introducing dependencies between the web API and the underlying data sources. For example, if your data
is stored in a relational database, the web API doesn't need to expose each table as a collection of resources. In fact,
that's probably a poor design. Instead, think of the web API as an abstraction of the database. If necessary,
introduce a mapping layer between the database and the web API. That way, client applications are isolated from
changes to the underlying database scheme.
Finally, it might not be possible to map every operation implemented by a web API to a specific resource. You can
handle such non-resource scenarios through HTTP requests that invoke a function and return the results as an
HTTP response message. For example, a web API that implements simple calculator operations such as add and
subtract could provide URIs that expose these operations as pseudo resources and use the query string to specify
the parameters required. For example, a GET request to the URI /add?operand1=99&operand2=1 would return a
response message with the body containing the value 100. However, only use these forms of URIs sparingly.
/customers Create a new Retrieve all customers Bulk update of Remove all customers
customer customers
/customers/1 Error Retrieve the details Update the details of Remove customer 1
for customer 1 customer 1 if it exists
/customers/1/orders Create a new order Retrieve all orders for Bulk update of orders Remove all orders for
for customer 1 customer 1 for customer 1 customer 1
{"Id":1,"Name":"Gizmo","Category":"Widgets","Price":1.99}
If the server doesn't support the media type, it should return HTTP status code 415 (Unsupported Media Type).
A client request can include an Accept header that contains a list of media types the client will accept from the
server in the response message. For example:
If the server cannot match any of the media type(s) listed, it should return HTTP status code 406 (Not Acceptable).
GET methods
A successful GET method typically returns HTTP status code 200 (OK). If the resource cannot be found, the method
should return 404 (Not Found).
POST methods
If a POST method creates a new resource, it returns HTTP status code 201 (Created). The URI of the new resource
is included in the Location header of the response. The response body contains a representation of the resource.
If the method does some processing but does not create a new resource, the method can return HTTP status code
200 and include the result of the operation in the response body. Alternatively, if there is no result to return, the
method can return HTTP status code 204 (No Content) with no response body.
If the client puts invalid data into the request, the server should return HTTP status code 400 (Bad Request). The
response body can contain additional information about the error or a link to a URI that provides more details.
PUT methods
If a PUT method creates a new resource, it returns HTTP status code 201 (Created), as with a POST method. If the
method updates an existing resource, it returns either 200 (OK) or 204 (No Content). In some cases, it might not
be possible to update an existing resource. In that case, consider returning HTTP status code 409 (Conflict).
Consider implementing bulk HTTP PUT operations that can batch updates to multiple resources in a collection. The
PUT request should specify the URI of the collection, and the request body should specify the details of the
resources to be modified. This approach can help to reduce chattiness and improve performance.
PATCH methods
With a PATCH request, the client sends a set of updates to an existing resource, in the form of a patch document.
The server processes the patch document to perform the update. The patch document doesn't describe the whole
resource, only a set of changes to apply. The specification for the PATCH method (RFC 5789) doesn't define a
particular format for patch documents. The format must be inferred from the media type in the request.
JSON is probably the most common data format for web APIs. There are two main JSON-based patch formats,
called JSON patch and JSON merge patch.
JSON merge patch is somewhat simpler. The patch document has the same structure as the original JSON
resource, but includes just the subset of fields that should be changed or added. In addition, a field can be deleted
by specifying null for the field value in the patch document. (That means merge patch is not suitable if the
original resource can have explicit null values.)
For example, suppose the original resource has the following JSON representation:
{
"name":"gizmo",
"category":"widgets",
"color":"blue",
"price":10
}
{
"price":12,
"color":null,
"size":"small"
}
This tells the server to update price , delete color , and add size , while name and category are not modified.
For the exact details of JSON merge patch, see RFC 7396. The media type for JSON merge patch is
application/merge-patch+json .
Merge patch is not suitable if the original resource can contain explicit null values, due to the special meaning of
null in the patch document. Also, the patch document doesn't specify the order that the server should apply the
updates. That may or may not matter, depending on the data and the domain. JSON patch, defined in RFC 6902, is
more flexible. It specifies the changes as a sequence of operations to apply. Operations include add, remove,
replace, copy, and test (to validate values). The media type for JSON patch is application/json-patch+json .
Here are some typical error conditions that might be encountered when processing a PATCH request, along with
the appropriate HTTP status code.
The patch document format isn't supported. 415 (Unsupported Media Type)
The patch document is valid, but the changes can't be applied 409 (Conflict)
to the resource in its current state.
DELETE methods
If the delete operation is successful, the web server should respond with HTTP status code 204, indicating that the
process has been successfully handled, but that the response body contains no further information. If the resource
doesn't exist, the web server can return HTTP 404 (Not Found).
Asynchronous operations
Sometimes a POST, PUT, PATCH, or DELETE operation might require processing that takes a while to complete. If
you wait for completion before sending a response to the client, it may cause unacceptable latency. If so, consider
making the operation asynchronous. Return HTTP status code 202 (Accepted) to indicate the request was accepted
for processing but is not completed.
You should expose an endpoint that returns the status of an asynchronous request, so the client can monitor the
status by polling the status endpoint. Include the URI of the status endpoint in the Location header of the 202
response. For example:
If the client sends a GET request to this endpoint, the response should contain the current status of the request.
Optionally, it could also include an estimated time to completion or a link to cancel the operation.
HTTP/1.1 200 OK
Content-Type: application/json
{
"status":"In progress",
"link": { "rel":"cancel", "method":"delete", "href":"/api/status/12345" }
}
If the asynchronous operation creates a new resource, the status endpoint should return status code 303 (See
Other) after the operation completes. In the 303 response, include a Location header that gives the URI of the new
resource:
/orders?limit=25&offset=50
Also consider imposing an upper limit on the number of items returned, to help prevent Denial of Service attacks.
To assist client applications, GET requests that return paginated data should also include some form of metadata
that indicate the total number of resources available in the collection.
You can use a similar strategy to sort data as it is fetched, by providing a sort parameter that takes a field name as
the value, such as /orders?sort=ProductID. However, this approach can have a negative effect on caching, because
query string parameters form part of the resource identifier used by many cache implementations as the key to
cached data.
You can extend this approach to limit the fields returned for each item, if each item contains a large amount of
data. For example, you could use a query string parameter that accepts a comma-delimited list of fields, such as
/orders?fields=ProductID,Quantity.
Give all optional parameters in query strings meaningful defaults. For example, set the limit parameter to 10
and the offset parameter to 0 if you implement pagination, set the sort parameter to the key of the resource if
you implement ordering, and set the fields parameter to all fields in the resource if you support projections.
HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Type: image/jpeg
Content-Length: 4580
The Content-Length header gives the total size of the resource, and the Accept-Ranges header indicates that the
corresponding GET operation supports partial results. The client application can use this information to retrieve
the image in smaller chunks. The first request fetches the first 2500 bytes by using the Range header:
The response message indicates that this is a partial response by returning HTTP status code 206. The Content-
Length header specifies the actual number of bytes returned in the message body (not the size of the resource),
and the Content-Range header indicates which part of the resource this is (bytes 0-2499 out of 4580):
Accept-Ranges: bytes
Content-Type: image/jpeg
Content-Length: 2500
Content-Range: bytes 0-2499/4580
[...]
A subsequent request from the client application can retrieve the remainder of the resource.
Use HATEOAS to enable navigation to related resources
One of the primary motivations behind REST is that it should be possible to navigate the entire set of resources
without requiring prior knowledge of the URI scheme. Each HTTP GET request should return the information
necessary to find the resources related directly to the requested object through hyperlinks included in the
response, and it should also be provided with information that describes the operations available on each of these
resources. This principle is known as HATEOAS, or Hypertext as the Engine of Application State. The system is
effectively a finite state machine, and the response to each request contains the information necessary to move
from one state to another; no other information should be necessary.
NOTE
Currently there are no general-purpose standards that define how to model the HATEOAS principle. The examples shown in
this section illustrate one possible, proprietary solution.
For example, to handle the relationship between an order and a customer, the representation of an order could
include links that identify the available operations for the customer of the order. Here is a possible representation:
{
"orderID":3,
"productID":2,
"quantity":4,
"orderValue":16.60,
"links":[
{
"rel":"customer",
"href":"https://adventure-works.com/customers/3",
"action":"GET",
"types":["text/xml","application/json"]
},
{
"rel":"customer",
"href":"https://adventure-works.com/customers/3",
"action":"PUT",
"types":["application/x-www-form-urlencoded"]
},
{
"rel":"customer",
"href":"https://adventure-works.com/customers/3",
"action":"DELETE",
"types":[]
},
{
"rel":"self",
"href":"https://adventure-works.com/orders/3",
"action":"GET",
"types":["text/xml","application/json"]
},
{
"rel":"self",
"href":"https://adventure-works.com/orders/3",
"action":"PUT",
"types":["application/x-www-form-urlencoded"]
},
{
"rel":"self",
"href":"https://adventure-works.com/orders/3",
"action":"DELETE",
"types":[]
}]
}
In this example, the links array has a set of links. Each link represents an operation on a related entity. The data
for each link includes the relationship ("customer"), the URI ( https://adventure-works.com/customers/3 ), the HTTP
method, and the supported MIME types. This is all the information that a client application needs to be able to
invoke the operation.
The links array also includes self-referencing information about the resource itself that has been retrieved. These
have the relationship self.
The set of links that are returned may change, depending on the state of the resource. This is what is meant by
hypertext being the "engine of application state."
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
NOTE
For simplicity, the example responses shown in this section do not include HATEOAS links.
If the DateCreated field is added to the schema of the customer resource, then the response would look like this:
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Existing client applications might continue functioning correctly if they are capable of ignoring unrecognized fields,
while new client applications can be designed to handle this new field. However, if more radical changes to the
schema of resources occur (such as removing or renaming fields) or the relationships between resources change
then these may constitute breaking changes that prevent existing client applications from functioning correctly. In
these situations, you should consider one of the following approaches.
URI versioning
Each time you modify the web API or change the schema of resources, you add a version number to the URI for
each resource. The previously existing URIs should continue to operate as before, returning resources that
conform to their original schema.
Extending the previous example, if the address field is restructured into subfields containing each constituent part
of the address (such as streetAddress , city , state , and zipCode ), this version of the resource could be exposed
through a URI containing a version number, such as https://adventure-works.com/v2/customers/3 :
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
{"id":3,"name":"Contoso LLC","dateCreated":"2014-09-04T12:11:38.0376089Z","address":{"streetAddress":"1
Microsoft Way","city":"Redmond","state":"WA","zipCode":98053}}
This versioning mechanism is very simple but depends on the server routing the request to the appropriate
endpoint. However, it can become unwieldy as the web API matures through several iterations and the server has
to support a number of different versions. Also, from a purist's point of view, in all cases the client applications are
fetching the same data (customer 3), so the URI should not really be different depending on the version. This
scheme also complicates implementation of HATEOAS as all links will need to include the version number in their
URIs.
Query string versioning
Rather than providing multiple URIs, you can specify the version of the resource by using a parameter within the
query string appended to the HTTP request, such as https://adventure-works.com/customers/3?version=2 . The
version parameter should default to a meaningful value such as 1 if it is omitted by older client applications.
This approach has the semantic advantage that the same resource is always retrieved from the same URI, but it
depends on the code that handles the request to parse the query string and send back the appropriate HTTP
response. This approach also suffers from the same complications for implementing HATEOAS as the URI
versioning mechanism.
NOTE
Some older web browsers and web proxies will not cache responses for requests that include a query string in the URI. This
can degrade performance for web applications that use a web API and that run from within such a web browser.
Header versioning
Rather than appending the version number as a query string parameter, you could implement a custom header
that indicates the version of the resource. This approach requires that the client application adds the appropriate
header to any requests, although the code handling the client request could use a default value (version 1) if the
version header is omitted. The following examples use a custom header named Custom-Header. The value of this
header indicates the version of web API.
Version 1:
Version 2:
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
{"id":3,"name":"Contoso LLC","dateCreated":"2014-09-04T12:11:38.0376089Z","address":{"streetAddress":"1
Microsoft Way","city":"Redmond","state":"WA","zipCode":98053}}
As with the previous two approaches, implementing HATEOAS requires including the appropriate custom header
in any links.
Media type versioning
When a client application sends an HTTP GET request to a web server it should stipulate the format of the content
that it can handle by using an Accept header, as described earlier in this guidance. Frequently the purpose of the
Accept header is to allow the client application to specify whether the body of the response should be XML, JSON,
or some other common format that the client can parse. However, it is possible to define custom media types that
include information enabling the client application to indicate which version of a resource it is expecting. The
following example shows a request that specifies an Accept header with the value application/vnd.adventure-
works.v1+json. The vnd.adventure-works.v1 element indicates to the web server that it should return version 1 of
the resource, while the json element specifies that the format of the response body should be JSON:
The code handling the request is responsible for processing the Accept header and honoring it as far as possible
(the client application may specify multiple formats in the Accept header, in which case the web server can choose
the most appropriate format for the response body). The web server confirms the format of the data in the
response body by using the Content-Type header:
HTTP/1.1 200 OK
Content-Type: application/vnd.adventure-works.v1+json; charset=utf-8
If the Accept header does not specify any known media types, the web server could generate an HTTP 406 (Not
Acceptable) response message or return a message with a default media type.
This approach is arguably the purest of the versioning mechanisms and lends itself naturally to HATEOAS, which
can include the MIME type of related data in resource links.
NOTE
When you select a versioning strategy, you should also consider the implications on performance, especially caching on the
web server. The URI versioning and Query String versioning schemes are cache-friendly inasmuch as the same URI/query
string combination refers to the same data each time.
The Header versioning and Media Type versioning mechanisms typically require additional logic to examine the values in the
custom header or the Accept header. In a large-scale environment, many clients using different versions of a web API can
result in a significant amount of duplicated data in a server-side cache. This issue can become acute if a client application
communicates with a web server through a proxy that implements caching, and that only forwards a request to the web
server if it does not currently hold a copy of the requested data in its cache.
More information
Microsoft REST API guidelines. Detailed recommendations for designing public REST APIs.
Web API checklist. A useful list of items to consider when designing and implementing a web API.
Open API Initiative. Documentation and implementation details on Open API.
Web API implementation
12/18/2020 • 46 minutes to read • Edit Online
A carefully designed RESTful web API defines the resources, relationships, and navigation schemes that are
accessible to client applications. When you implement and deploy a web API, you should consider the physical
requirements of the environment hosting the web API and the way in which the web API is constructed rather than
the logical structure of the data. This guidance focuses on best practices for implementing a web API and
publishing it to make it available to client applications. For detailed information about web API design, see Web API
design.
Processing requests
Consider the following points when you implement the code to handle requests.
GET, PUT, DELETE, HEAD, and PATCH actions should be idempotent
The code that implements these requests should not impose any side-effects. The same request repeated over the
same resource should result in the same state. For example, sending multiple DELETE requests to the same URI
should have the same effect, although the HTTP status code in the response messages may be different. The first
DELETE request might return status code 204 (No Content), while a subsequent DELETE request might return status
code 404 (Not Found).
NOTE
The article Idempotency Patterns on Jonathan Oliver's blog provides an overview of idempotency and how it relates to data
management operations.
POST actions that create new resources should not have unrelated side -effects
If a POST request is intended to create a new resource, the effects of the request should be limited to the new
resource (and possibly any directly related resources if there is some sort of linkage involved) For example, in an e-
commerce system, a POST request that creates a new order for a customer might also amend inventory levels and
generate billing information, but it should not modify information not directly related to the order or have any
other side-effects on the overall state of the system.
Avoid implementing chatty POST, PUT, and DELETE operations
Support POST, PUT and DELETE requests over resource collections. A POST request can contain the details for
multiple new resources and add them all to the same collection, a PUT request can replace the entire set of
resources in a collection, and a DELETE request can remove an entire collection.
The OData support included in ASP.NET Web API 2 provides the ability to batch requests. A client application can
package up several web API requests and send them to the server in a single HTTP request, and receive a single
HTTP response that contains the replies to each request. For more information, Introducing batch support in Web
API and Web API OData.
Follow the HTTP specification when sending a response
A web API must return messages that contain the correct HTTP status code to enable the client to determine how to
handle the result, the appropriate HTTP headers so that the client understands the nature of the result, and a
suitably formatted body to enable the client to parse the result.
For example, a POST operation should return status code 201 (Created) and the response message should include
the URI of the newly created resource in the Location header of the response message.
Support content negotiation
The body of a response message may contain data in a variety of formats. For example, an HTTP GET request could
return data in JSON, or XML format. When the client submits a request, it can include an Accept header that
specifies the data formats that it can handle. These formats are specified as media types. For example, a client that
issues a GET request that retrieves an image can specify an Accept header that lists the media types that the client
can handle, such as image/jpeg, image/gif, image/png . When the web API returns the result, it should format the
data by using one of these media types and specify the format in the Content-Type header of the response.
If the client does not specify an Accept header, then use a sensible default format for the response body. As an
example, the ASP.NET Web API framework defaults to JSON for text-based data.
Provide links to support HATEOAS -style navigation and discovery of resources
The HATEOAS approach enables a client to navigate and discover resources from an initial starting point. This is
achieved by using links containing URIs; when a client issues an HTTP GET request to obtain a resource, the
response should contain URIs that enable a client application to quickly locate any directly related resources. For
example, in a web API that supports an e-commerce solution, a customer may have placed many orders. When a
client application retrieves the details for a customer, the response should include links that enable the client
application to send HTTP GET requests that can retrieve these orders. Additionally, HATEOAS-style links should
describe the other operations (POST, PUT, DELETE, and so on) that each linked resource supports together with the
corresponding URI to perform each request. This approach is described in more detail in API design.
Currently there are no standards that govern the implementation of HATEOAS, but the following example illustrates
one possible approach. In this example, an HTTP GET request that finds the details for a customer returns a
response that includes HATEOAS links that reference the orders for that customer:
HTTP/1.1 200 OK
...
Content-Type: application/json; charset=utf-8
...
Content-Length: ...
{"CustomerID":2,"CustomerName":"Bert","Links":[
{"rel":"self",
"href":"https://adventure-works.com/customers/2",
"action":"GET",
"types":["text/xml","application/json"]},
{"rel":"self",
"href":"https://adventure-works.com/customers/2",
"action":"PUT",
"types":["application/x-www-form-urlencoded"]},
{"rel":"self",
"href":"https://adventure-works.com/customers/2",
"action":"DELETE",
"types":[]},
{"rel":"orders",
"href":"https://adventure-works.com/customers/2/orders",
"action":"GET",
"types":["text/xml","application/json"]},
{"rel":"orders",
"href":"https://adventure-works.com/customers/2/orders",
"action":"POST",
"types":["application/x-www-form-urlencoded"]}
]}
In this example, the customer data is represented by the Customer class shown in the following code snippet. The
HATEOAS links are held in the Links collection property:
The HTTP GET operation retrieves the customer data from storage and constructs a Customer object, and then
populates the Links collection. The result is formatted as a JSON response message. Each link comprises the
following fields:
The relationship between the object being returned and the object described by the link. In this case self
indicates that the link is a reference back to the object itself (similar to a this pointer in many object-oriented
languages), and orders is the name of a collection containing the related order information.
The hyperlink ( Href ) for the object being described by the link in the form of a URI.
The type of HTTP request ( Action ) that can be sent to this URI.
The format of any data ( Types ) that should be provided in the HTTP request or that can be returned in the
response, depending on the type of the request.
The HATEOAS links shown in the example HTTP response indicate that a client application can perform the
following operations:
An HTTP GET request to the URI https://adventure-works.com/customers/2 to fetch the details of the customer
(again). The data can be returned as XML or JSON.
An HTTP PUT request to the URI https://adventure-works.com/customers/2 to modify the details of the customer.
The new data must be provided in the request message in x-www-form-urlencoded format.
An HTTP DELETE request to the URI https://adventure-works.com/customers/2 to delete the customer. The
request does not expect any additional information or return data in the response message body.
An HTTP GET request to the URI https://adventure-works.com/customers/2/orders to find all the orders for the
customer. The data can be returned as XML or JSON.
An HTTP POST request to the URI https://adventure-works.com/customers/2/orders to create a new order for this
customer. The data must be provided in the request message in x-www-form-urlencoded format.
Handling exceptions
Consider the following points if an operation throws an uncaught exception.
Capture exceptions and return a meaningful response to clients
The code that implements an HTTP operation should provide comprehensive exception handling rather than letting
uncaught exceptions propagate to the framework. If an exception makes it impossible to complete the operation
successfully, the exception can be passed back in the response message, but it should include a meaningful
description of the error that caused the exception. The exception should also include the appropriate HTTP status
code rather than simply returning status code 500 for every situation. For example, if a user request causes a
database update that violates a constraint (such as attempting to delete a customer that has outstanding orders),
you should return status code 409 (Conflict) and a message body indicating the reason for the conflict. If some
other condition renders the request unachievable, you can return status code 400 (Bad Request). You can find a full
list of HTTP status codes on the Status code definitions page on the W3C website.
The code example traps different conditions and returns an appropriate response.
[HttpDelete]
[Route("customers/{id:int}")]
public IHttpActionResult DeleteCustomer(int id)
{
try
{
// Find the customer to be deleted in the repository
var customerToDelete = repository.GetCustomer(id);
TIP
Do not include information that could be useful to an attacker attempting to penetrate your API.
Many web servers trap error conditions themselves before they reach the web API. For example, if you configure
authentication for a web site and the user fails to provide the correct authentication information, the web server
should respond with status code 401 (Unauthorized). Once a client has been authenticated, your code can perform
its own checks to verify that the client should be able access the requested resource. If this authorization fails, you
should return status code 403 (Forbidden).
Handle exceptions consistently and log information about errors
To handle exceptions in a consistent manner, consider implementing a global error handling strategy across the
entire web API. You should also incorporate error logging which captures the full details of each exception; this
error log can contain detailed information as long as it is not made accessible over the web to clients.
Distinguish between client-side errors and server-side errors
The HTTP protocol distinguishes between errors that occur due to the client application (the HTTP 4xx status codes),
and errors that are caused by a mishap on the server (the HTTP 5xx status codes). Make sure that you respect this
convention in any error response messages.
HTTP/1.1 200 OK
...
Cache-Control: max-age=600, private
Content-Type: text/json; charset=utf-8
Content-Length: ...
{"orderID":2,"productID":4,"quantity":2,"orderValue":10.00}
In this example, the Cache-Control header specifies that the data returned should be expired after 600 seconds, and
is only suitable for a single client and must not be stored in a shared cache used by other clients (it is private). The
Cache-Control header could specify public rather than private in which case the data can be stored in a shared
cache, or it could specify no-store in which case the data must not be cached by the client. The following code
example shows how to construct a Cache-Control header in a response message:
public class OrdersController : ApiController
{
...
[Route("api/orders/{id:int:min(0)}")]
[HttpGet]
public IHttpActionResult FindOrderByID(int id)
{
// Find the matching order
Order order = ...;
...
// Create a Cache-Control header for the response
var cacheControlHeader = new CacheControlHeaderValue();
cacheControlHeader.Private = true;
cacheControlHeader.MaxAge = new TimeSpan(0, 10, 0);
...
// Return a response message containing the order and the cache control header
OkResultWithCaching<Order> response = new OkResultWithCaching<Order>(order, this)
{
CacheControlHeader = cacheControlHeader
};
return response;
}
...
}
This code uses a custom IHttpActionResult class named OkResultWithCaching . This class enables the controller to
set the cache header contents:
Cache management is the responsibility of the client application or intermediate server, but if properly
implemented it can save bandwidth and improve performance by removing the need to fetch data that has already
been recently retrieved.
The max-age value in the Cache-Control header is only a guide and not a guarantee that the corresponding data
won't change during the specified time. The web API should set the max-age to a suitable value depending on the
expected volatility of the data. When this period expires, the client should discard the object from the cache.
NOTE
Most modern web browsers support client-side caching by adding the appropriate cache-control headers to requests and
examining the headers of the results, as described. However, some older browsers will not cache the values returned from a
URL that includes a query string. This is not usually an issue for custom client applications which implement their own cache
management strategy based on the protocol discussed here.
Some older proxies exhibit the same behavior and might not cache requests based on URLs with query strings. This could be
an issue for custom client applications that connect to a web server through such a proxy.
// Return a response message containing the order and the cache control header
OkResultWithCaching<Order> response = new OkResultWithCaching<Order>(order, this)
{
...,
ETag = eTag
};
return response;
}
...
}
The response message posted by the web API looks like this:
HTTP/1.1 200 OK
...
Cache-Control: max-age=600, private
Content-Type: text/json; charset=utf-8
ETag: "2147483648"
Content-Length: ...
{"orderID":2,"productID":4,"quantity":2,"orderValue":10.00}
TIP
For security reasons, do not allow sensitive data or data returned over an authenticated (HTTPS) connection to be cached.
A client application can issue a subsequent GET request to retrieve the same resource at any time, and if the
resource has changed (it has a different ETag) the cached version should be discarded and the new version added
to the cache. If a resource is large and requires a significant amount of bandwidth to transmit back to the client,
repeated requests to fetch the same data can become inefficient. To combat this, the HTTP protocol defines the
following process for optimizing GET requests that you should support in a web API:
The client constructs a GET request containing the ETag for the currently cached version of the resource
referenced in an If-None-Match HTTP header:
The GET operation in the web API obtains the current ETag for the requested data (order 2 in the above
example), and compares it to the value in the If-None-Match header.
If the current ETag for the requested data matches the ETag provided by the request, the resource has not
changed and the web API should return an HTTP response with an empty message body and a status code
of 304 (Not Modified).
If the current ETag for the requested data does not match the ETag provided by the request, then the data has
changed and the web API should return an HTTP response with the new data in the message body and a
status code of 200 (OK).
If the requested data no longer exists then the web API should return an HTTP response with the status code
of 404 (Not Found).
The client uses the status code to maintain the cache. If the data has not changed (status code 304) then the
object can remain cached and the client application should continue to use this version of the object. If the
data has changed (status code 200) then the cached object should be discarded and the new one inserted. If
the data is no longer available (status code 404) then the object should be removed from the cache.
NOTE
If the response header contains the Cache-Control header no-store then the object should always be removed from the
cache regardless of the HTTP status code.
The code below shows the FindOrderByID method extended to support the If-None-Match header. Notice that if the
If-None-Match header is omitted, the specified order is always retrieved:
public class OrdersController : ApiController
{
[Route("api/orders/{id:int:min(0)}")]
[HttpGet]
public IHttpActionResult FindOrderByID(int id)
{
try
{
// Find the matching order
Order order = ...;
return response;
}
catch
{
return InternalServerError();
}
}
...
}
This example incorporates an additional custom IHttpActionResult class named EmptyResultWithCaching . This class
simply acts as a wrapper around an HttpResponseMessage object that does not contain a response body:
public class EmptyResultWithCaching : IHttpActionResult
{
public CacheControlHeaderValue CacheControlHeader { get; set; }
public EntityTagHeaderValue ETag { get; set; }
public HttpStatusCode StatusCode { get; set; }
public Uri Location { get; set; }
TIP
In this example, the ETag for the data is generated by hashing the data retrieved from the underlying data source. If the ETag
can be computed in some other way, then the process can be optimized further and the data only needs to be fetched from
the data source if it has changed. This approach is especially useful if the data is large or accessing the data source can result
in significant latency (for example, if the data source is a remote database).
The PUT operation in the web API obtains the current ETag for the requested data (order 1 in the above
example), and compares it to the value in the If-Match header.
If the current ETag for the requested data matches the ETag provided by the request, the resource has not
changed and the web API should perform the update, returning a message with HTTP status code 204 (No
Content) if it is successful. The response can include Cache-Control and ETag headers for the updated
version of the resource. The response should always include the Location header that references the URI of
the newly updated resource.
If the current ETag for the requested data does not match the ETag provided by the request, then the data has
been changed by another user since it was fetched and the web API should return an HTTP response with an
empty message body and a status code of 412 (Precondition Failed).
If the resource to be updated no longer exists then the web API should return an HTTP response with the
status code of 404 (Not Found).
The client uses the status code and response headers to maintain the cache. If the data has been updated
(status code 204) then the object can remain cached (as long as the Cache-Control header does not specify
no-store) but the ETag should be updated. If the data was changed by another user changed (status code
412) or not found (status code 404) then the cached object should be discarded.
The next code example shows an implementation of the PUT operation for the Orders controller:
// Create the No Content response with Cache-Control, ETag, and Location headers
var cacheControlHeader = new CacheControlHeaderValue();
cacheControlHeader.Private = true;
cacheControlHeader.MaxAge = new TimeSpan(0, 10, 0);
hashedOrder = order.GetHashCode();
hashedOrderEtag = $"\"{hashedOrder}\"";
var eTag = new EntityTagHeaderValue(hashedOrderEtag);
return response;
}
TIP
Use of the If-Match header is entirely optional, and if it is omitted the web API will always attempt to update the specified
order, possibly blindly overwriting an update made by another user. To avoid problems due to lost updates, always provide an
If-Match header.
You can also set the static Expect100Continue property of the ServicePointManager class to specify the default value
of this property for all subsequently created ServicePoint objects.
Support pagination for requests that may return large numbers of objects
If a collection contains a large number of resources, issuing a GET request to the corresponding URI could result in
significant processing on the server hosting the web API affecting performance, and generate a significant amount
of network traffic resulting in increased latency.
To handle these cases, the web API should support query strings that enable the client application to refine requests
or fetch data in more manageable, discrete blocks (or pages). The code below shows the GetAllOrders method in
the Orders controller. This method retrieves the details of orders. If this method was unconstrained, it could
conceivably return a large amount of data. The limit and offset parameters are intended to reduce the volume
of data to a smaller subset, in this case only the first 10 orders by default:
public class OrdersController : ApiController
{
...
[Route("api/orders")]
[HttpGet]
public IEnumerable<Order> GetAllOrders(int limit=10, int offset=0)
{
// Find the number of orders specified by the limit parameter
// starting with the order specified by the offset parameter
var orders = ...
return orders;
}
...
}
A client application can issue a request to retrieve 30 orders starting at offset 50 by using the URI
https://www.adventure-works.com/api/orders?limit=30&offset=50 .
TIP
Avoid enabling client applications to specify query strings that result in a URI that is more than 2000 characters long. Many
web clients and servers cannot handle URIs that are this long.
Test the exception handling performed by each operation and verify that an appropriate and meaningful
HTTP response is passed back to the client application.
Verify that request and response messages are well-formed. For example, if an HTTP POST request contains
the data for a new resource in x-www-form-urlencoded format, confirm that the corresponding operation
correctly parses the data, creates the resources, and returns a response containing the details of the new
resource, including the correct Location header.
Verify all links and URIs in response messages. For example, an HTTP POST message should return the URI
of the newly created resource. All HATEOAS links should be valid.
Ensure that each operation returns the correct status codes for different combinations of input. For example:
If a query is successful, it should return status code 200 (OK)
If a resource is not found, the operation should return HTTP status code 404 (Not Found).
If the client sends a request that successfully deletes a resource, the status code should be 204 (No
Content).
If the client sends a request that creates a new resource, the status code should be 201 (Created).
Watch out for unexpected response status codes in the 5xx range. These messages are usually reported by the host
server to indicate that it was unable to fulfill a valid request.
Test the different request header combinations that a client application can specify and ensure that the web
API returns the expected information in response messages.
Test query strings. If an operation can take optional parameters (such as pagination requests), test the
different combinations and order of parameters.
Verify that asynchronous operations complete successfully. If the web API supports streaming for requests
that return large binary objects (such as video or audio), ensure that client requests are not blocked while
the data is streamed. If the web API implements polling for long-running data modification operations, verify
that the operations report their status correctly as they proceed.
You should also create and run performance tests to check that the web API operates satisfactorily under duress.
You can build a web performance and load test project by using Visual Studio Ultimate. For more information, see
Run performance tests on an application before a release.
3. For each web API, specify the HTTP operations that the web API exposes together with any optional
parameters that an operation can take as input. You can also configure whether the API management service
should cache the response received from the web API to optimize repeated requests for the same data.
Record the details of the HTTP responses that each operation can generate. This information is used to
generate documentation for developers, so it is important that it is accurate and complete.
You can either define operations manually using the wizards provided by the Azure portal, or you can
import them from a file containing the definitions in WADL or Swagger format.
4. Configure the security settings for communications between the API management service and the web
server hosting the web API. The API management service currently supports Basic authentication and
mutual authentication using certificates, and OAuth 2.0 user authorization.
5. Create a product. A product is the unit of publication; you add the web APIs that you previously connected to
the management service to the product. When the product is published, the web APIs become available to
developers.
NOTE
Prior to publishing a product, you can also define user-groups that can access the product and add users to these
groups. This gives you control over the developers and applications that can use the web API. If a web API is subject
to approval, prior to being able to access it a developer must send a request to the product administrator. The
administrator can grant or deny access to the developer. Existing developers can also be blocked if circumstances
change.
6. Configure policies for each web API. Policies govern aspects such as whether cross-domain calls should be
allowed, how to authenticate clients, whether to convert between XML and JSON data formats transparently,
whether to restrict calls from a given IP range, usage quotas, and whether to limit the call rate. Policies can
be applied globally across the entire product, for a single web API in a product, or for individual operations
in a web API.
For more information, see the API Management documentation.
TIP
Azure provides the Azure Traffic Manager which enables you to implement failover and load-balancing, and reduce latency
across multiple instances of a web site hosted in different geographic locations. You can use Azure Traffic Manager in
conjunction with the API Management Service; the API Management Service can route requests to instances of a web site
through Azure Traffic Manager. For more information, see Traffic Manager routing methods.
In this structure, if you are using custom DNS names for your web sites, you should configure the appropriate CNAME record
for each web site to point to the DNS name of the Azure Traffic Manager web site.
NOTE
You can change the details for a published product, and the changes are applied immediately. For example, you can add or
remove an operation from a web API without requiring that you republish the product that contains the web API.
More information
ASP.NET Web API OData contains examples and further information on implementing an OData web API by
using ASP.NET.
Introducing batch support in Web API and Web API OData describes how to implement batch operations in a
web API by using OData.
Idempotency patterns on Jonathan Oliver's blog provides an overview of idempotency and how it relates to
data management operations.
Status code definitions on the W3C website contains a full list of HTTP status codes and their descriptions.
Run background tasks with WebJobs provides information and examples on using WebJobs to perform
background operations.
Azure Notification Hubs notify users shows how to use an Azure Notification Hub to push asynchronous
responses to client applications.
API Management describes how to publish a product that provides controlled and secure access to a web API.
Azure API Management REST API reference describes how to use the API Management REST API to build custom
management applications.
Traffic Manager routing methods summarizes how Azure Traffic Manager can be used to load-balance requests
across multiple instances of a website hosting a web API.
Application Insights - Get started with ASP.NET provides detailed information on installing and configuring
Application Insights in an ASP.NET Web API project.
Autoscaling
12/18/2020 • 15 minutes to read • Edit Online
Autoscaling is the process of dynamically allocating resources to match performance requirements. As the
volume of work grows, an application may need additional resources to maintain the desired performance levels
and satisfy service-level agreements (SLAs). As demand slackens and the additional resources are no longer
needed, they can be de-allocated to minimize costs.
Autoscaling takes advantage of the elasticity of cloud-hosted environments while easing management overhead.
It reduces the need for an operator to continually monitor the performance of a system and make decisions about
adding or removing resources.
There are two main ways that an application can scale:
Ver tical scaling , also called scaling up and down, means changing the capacity of a resource. For
example, you could move an application to a larger VM size. Vertical scaling often requires making the
system temporarily unavailable while it is being redeployed. Therefore, it's less common to automate
vertical scaling.
Horizontal scaling , also called scaling out and in, means adding or removing instances of a resource. The
application continues running without interruption as new resources are provisioned. When the
provisioning process is complete, the solution is deployed on these additional resources. If demand drops,
the additional resources can be shut down cleanly and deallocated.
Many cloud-based systems, including Microsoft Azure, support automatic horizontal scaling. The rest of this
article focuses on horizontal scaling.
NOTE
Autoscaling mostly applies to compute resources. While it's possible to horizontally scale a database or message queue, this
usually involves data partitioning, which is generally not automated.
Overview
An autoscaling strategy typically involves the following pieces:
Instrumentation and monitoring systems at the application, service, and infrastructure levels. These systems
capture key metrics, such as response times, queue lengths, CPU utilization, and memory usage.
Decision-making logic that evaluates these metrics against predefined thresholds or schedules, and decides
whether to scale.
Components that scale the system.
Testing, monitoring, and tuning of the autoscaling strategy to ensure that it functions as expected.
Azure provides built-in autoscaling mechanisms that address common scenarios. If a particular service or
technology does not have built-in autoscaling functionality, or if you have specific autoscaling requirements
beyond its capabilities, you might consider a custom implementation. A custom implementation would collect
operational and system metrics, analyze the metrics, and then scale resources accordingly.
Many types of applications require background tasks that run independently of the user interface (UI). Examples
include batch jobs, intensive processing tasks, and long-running processes such as workflows. Background jobs
can be executed without requiring user interaction--the application can start the job and then continue to process
interactive requests from users. This can help to minimize the load on the application UI, which can improve
availability and reduce interactive response times.
For example, if an application is required to generate thumbnails of images that are uploaded by users, it can do
this as a background job and save the thumbnail to storage when it is complete--without the user needing to wait
for the process to be completed. In the same way, a user placing an order can initiate a background workflow that
processes the order, while the UI allows the user to continue browsing the web app. When the background job is
complete, it can update the stored orders data and send an email to the user that confirms the order.
When you consider whether to implement a task as a background job, the main criteria is whether the task can run
without user interaction and without the UI needing to wait for the job to be completed. Tasks that require the user
or the UI to wait while they are completed might not be appropriate as background jobs.
Triggers
Background jobs can be initiated in several different ways. They fall into one of the following categories:
Event-driven triggers . The task is started in response to an event, typically an action taken by a user or a step
in a workflow.
Schedule-driven triggers . The task is invoked on a schedule based on a timer. This might be a recurring
schedule or a one-off invocation that is specified for a later time.
Event-driven triggers
Event-driven invocation uses a trigger to start the background task. Examples of using event-driven triggers
include:
The UI or another job places a message in a queue. The message contains data about an action that has taken
place, such as the user placing an order. The background task listens on this queue and detects the arrival of a
new message. It reads the message and uses the data in it as the input to the background job.
The UI or another job saves or updates a value in storage. The background task monitors the storage and
detects changes. It reads the data and uses it as the input to the background job.
The UI or another job makes a request to an endpoint, such as an HTTPS URI, or an API that is exposed as a web
service. It passes the data that is required to complete the background task as part of the request. The endpoint
or web service invokes the background task, which uses the data as its input.
Typical examples of tasks that are suited to event-driven invocation include image processing, workflows, sending
information to remote services, sending email messages, and provisioning new users in multitenant applications.
Schedule -driven triggers
Schedule-driven invocation uses a timer to start the background task. Examples of using schedule-driven triggers
include:
A timer that is running locally within the application or as part of the application's operating system invokes a
background task on a regular basis.
A timer that is running in a different application, such as Azure Logic Apps, sends a request to an API or web
service on a regular basis. The API or web service invokes the background task.
A separate process or application starts a timer that causes the background task to be invoked once after a
specified time delay, or at a specific time.
Typical examples of tasks that are suited to schedule-driven invocation include batch-processing routines (such as
updating related-products lists for users based on their recent behavior), routine data processing tasks (such as
updating indexes or generating accumulated results), data analysis for daily reports, data retention cleanup, and
data consistency checks.
If you use a schedule-driven task that must run as a single instance, be aware of the following:
If the compute instance that is running the scheduler (such as a virtual machine using Windows scheduled
tasks) is scaled, you will have multiple instances of the scheduler running. These could start multiple instances
of the task.
If tasks run for longer than the period between scheduler events, the scheduler may start another instance of
the task while the previous one is still running.
Returning results
Background jobs execute asynchronously in a separate process, or even in a separate location, from the UI or the
process that invoked the background task. Ideally, background tasks are "fire and forget" operations, and their
execution progress has no impact on the UI or the calling process. This means that the calling process does not
wait for completion of the tasks. Therefore, it cannot automatically detect when the task ends.
If you require a background task to communicate with the calling task to indicate progress or completion, you
must implement a mechanism for this. Some examples are:
Write a status indicator value to storage that is accessible to the UI or caller task, which can monitor or check
this value when required. Other data that the background task must return to the caller can be placed into the
same storage.
Establish a reply queue that the UI or caller listens on. The background task can send messages to the queue
that indicate status and completion. Data that the background task must return to the caller can be placed into
the messages. If you are using Azure Service Bus, you can use the ReplyTo and CorrelationId properties to
implement this capability.
Expose an API or endpoint from the background task that the UI or caller can access to obtain status
information. Data that the background task must return to the caller can be included in the response.
Have the background task call back to the UI or caller through an API to indicate status at predefined points or
on completion. This might be through events raised locally or through a publish-and-subscribe mechanism.
Data that the background task must return to the caller can be included in the request or event payload.
Hosting environment
You can host background tasks by using a range of different Azure platform services:
Azure Web Apps and WebJobs . You can use WebJobs to execute custom jobs based on a range of different
types of scripts or executable programs within the context of a web app.
Azure Vir tual Machines . If you have a Windows service or want to use the Windows Task Scheduler, it is
common to host your background tasks within a dedicated virtual machine.
Azure Batch . Batch is a platform service that schedules compute-intensive work to run on a managed
collection of virtual machines. It can automatically scale compute resources.
Azure Kubernetes Ser vice (AKS). Azure Kubernetes Service provides a managed hosting environment for
Kubernetes on Azure.
The following sections describe each of these options in more detail, and include considerations to help you
choose the appropriate option.
Azure Web Apps and WebJobs
You can use Azure WebJobs to execute custom jobs as background tasks within an Azure Web App. WebJobs run
within the context of your web app as a continuous process. WebJobs also run in response to a trigger event from
Azure Logic Apps or external factors, such as changes to storage blobs and message queues. Jobs can be started
and stopped on demand, and shut down gracefully. If a continuously running WebJob fails, it is automatically
restarted. Retry and error actions are configurable.
When you configure a WebJob:
If you want the job to respond to an event-driven trigger, you should configure it as Run continuously . The
script or program is stored in the folder named site/wwwroot/app_data/jobs/continuous.
If you want the job to respond to a schedule-driven trigger, you should configure it as Run on a schedule . The
script or program is stored in the folder named site/wwwroot/app_data/jobs/triggered.
If you choose the Run on demand option when you configure a job, it will execute the same code as the Run
on a schedule option when you start it.
Azure WebJobs run within the sandbox of the web app. This means that they can access environment variables and
share information, such as connection strings, with the web app. The job has access to the unique identifier of the
machine that is running the job. The connection string named AzureWebJobsStorage provides access to Azure
storage queues, blobs, and tables for application data, and access to Service Bus for messaging and
communication. The connection string named AzureWebJobsDashboard provides access to the job action log
files.
Azure WebJobs have the following characteristics:
Security : WebJobs are protected by the deployment credentials of the web app.
Suppor ted file types : You can define WebJobs by using command scripts (.cmd), batch files (.bat), PowerShell
scripts (.ps1), bash shell scripts (.sh), PHP scripts (.php), Python scripts (.py), JavaScript code (.js), and executable
programs (.exe, .jar, and more).
Deployment : You can deploy scripts and executables by using the Azure portal, by using Visual Studio, by
using the Azure WebJobs SDK, or by copying them directly to the following locations:
For triggered execution: site/wwwroot/app_data/jobs/triggered/{ job name}
For continuous execution: site/wwwroot/app_data/jobs/continuous/{ job name}
Logging : Console.Out is treated (marked) as INFO. Console.Error is treated as ERROR. You can access
monitoring and diagnostics information by using the Azure portal. You can download log files directly from the
site. They are saved in the following locations:
For triggered execution: Vfs/data/jobs/triggered/jobName
For continuous execution: Vfs/data/jobs/continuous/jobName
Configuration : You can configure WebJobs by using the portal, the REST API, and PowerShell. You can use a
configuration file named settings.job in the same root directory as the job script to provide configuration
information for a job. For example:
{ "stopping_wait_time": 60 }
{ "is_singleton": true }
Considerations
By default, WebJobs scale with the web app. However, you can configure jobs to run on single instance by
setting the is_singleton configuration property to true . Single instance WebJobs are useful for tasks that you
do not want to scale or run as simultaneous multiple instances, such as reindexing, data analysis, and similar
tasks.
To minimize the impact of jobs on the performance of the web app, consider creating an empty Azure Web App
instance in a new App Service plan to host long-running or resource-intensive WebJobs.
Azure Virtual Machines
Background tasks might be implemented in a way that prevents them from being deployed to Azure Web Apps, or
these options might not be convenient. Typical examples are Windows services, and third-party utilities and
executable programs. Another example might be programs written for an execution environment that is different
than that hosting the application. For example, it might be a Unix or Linux program that you want to execute from
a Windows or .NET application. You can choose from a range of operating systems for an Azure virtual machine,
and run your service or executable on that virtual machine.
To help you choose when to use Virtual Machines, see Azure App Services, Cloud Services and Virtual Machines
comparison. For information about the options for Virtual Machines, see Sizes for Windows virtual machines in
Azure. For more information about the operating systems and prebuilt images that are available for Virtual
Machines, see Azure Virtual Machines Marketplace.
To initiate the background task in a separate virtual machine, you have a range of options:
You can execute the task on demand directly from your application by sending a request to an endpoint that the
task exposes. This passes in any data that the task requires. This endpoint invokes the task.
You can configure the task to run on a schedule by using a scheduler or timer that is available in your chosen
operating system. For example, on Windows you can use Windows Task Scheduler to execute scripts and tasks.
Or, if you have SQL Server installed on the virtual machine, you can use the SQL Server Agent to execute scripts
and tasks.
You can use Azure Logic Apps to initiate the task by adding a message to a queue that the task listens on, or by
sending a request to an API that the task exposes.
See the earlier section Triggers for more information about how you can initiate background tasks.
Considerations
Consider the following points when you are deciding whether to deploy background tasks in an Azure virtual
machine:
Hosting background tasks in a separate Azure virtual machine provides flexibility and allows precise control
over initiation, execution, scheduling, and resource allocation. However, it will increase runtime cost if a virtual
machine must be deployed just to run background tasks.
There is no facility to monitor the tasks in the Azure portal and no automated restart capability for failed tasks--
although you can monitor the basic status of the virtual machine and manage it by using the Azure Resource
Manager Cmdlets. However, there are no facilities to control processes and threads in compute nodes. Typically,
using a virtual machine will require additional effort to implement a mechanism that collects data from
instrumentation in the task, and from the operating system in the virtual machine. One solution that might be
appropriate is to use the System Center Management Pack for Azure.
You might consider creating monitoring probes that are exposed through HTTP endpoints. The code for these
probes could perform health checks, collect operational information and statistics--or collate error information
and return it to a management application. For more information, see the Health Endpoint Monitoring pattern.
For more information, see:
Virtual Machines
Azure Virtual Machines FAQ
Azure Batch
Consider Azure Batch if you need to run large, parallel high-performance computing (HPC) workloads across tens,
hundreds, or thousands of VMs.
The Batch service provisions the VMs, assign tasks to the VMs, runs the tasks, and monitors the progress. Batch
can automatically scale out the VMs in response to the workload. Batch also provides job scheduling. Azure Batch
supports both Linux and Windows VMs.
Considerations
Batch works well with intrinsically parallel workloads. It can also perform parallel calculations with a reduce step at
the end, or run Message Passing Interface (MPI) applications for parallel tasks that require message passing
between nodes.
An Azure Batch job runs on a pool of nodes (VMs). One approach is to allocate a pool only when needed and then
delete it after the job completes. This maximizes utilization, because nodes are not idle, but the job must wait for
nodes to be allocated. Alternatively, you can create a pool ahead of time. That approach minimizes the time that it
takes for a job to start, but can result in having nodes that sit idle. For more information, see Pool and compute
node lifetime.
For more information, see:
What is Azure Batch?
Develop large-scale parallel compute solutions with Batch
Batch and HPC solutions for large-scale computing workloads
Azure Kubernetes Service
Azure Kubernetes Service (AKS) manages your hosted Kubernetes environment, which makes it easy to deploy and
manage containerized applications.
Containers can be useful for running background jobs. Some of the benefits include:
Containers support high-density hosting. You can isolate a background task in a container, while placing
multiple containers in each VM.
The container orchestrator handles internal load balancing, configuring the internal network, and other
configuration tasks.
Containers can be started and stopped as needed.
Azure Container Registry allows you to register your containers inside Azure boundaries. This comes with
security, privacy, and proximity benefits.
Considerations
Requires an understanding of how to use a container orchestrator. Depending on the skill set of your DevOps
team, this may or may not be an issue.
For more information, see:
Overview of containers in Azure
Introduction to private Docker container registries
Partitioning
If you decide to include background tasks within an existing compute instance, you must consider how this will
affect the quality attributes of the compute instance and the background task itself. These factors will help you to
decide whether to colocate the tasks with the existing compute instance or separate them out into a separate
compute instance:
Availability : Background tasks might not need to have the same level of availability as other parts of the
application, in particular the UI and other parts that are directly involved in user interaction. Background
tasks might be more tolerant of latency, retried connection failures, and other factors that affect availability
because the operations can be queued. However, there must be sufficient capacity to prevent the backup of
requests that could block queues and affect the application as a whole.
Scalability : Background tasks are likely to have a different scalability requirement than the UI and the
interactive parts of the application. Scaling the UI might be necessary to meet peaks in demand, while
outstanding background tasks might be completed during less busy times by fewer compute instances.
Resiliency : Failure of a compute instance that just hosts background tasks might not fatally affect the
application as a whole if the requests for these tasks can be queued or postponed until the task is available
again. If the compute instance and/or tasks can be restarted within an appropriate interval, users of the
application might not be affected.
Security : Background tasks might have different security requirements or restrictions than the UI or other
parts of the application. By using a separate compute instance, you can specify a different security
environment for the tasks. You can also use patterns such as Gatekeeper to isolate the background compute
instances from the UI in order to maximize security and separation.
Performance : You can choose the type of compute instance for background tasks to specifically match the
performance requirements of the tasks. This might mean using a less expensive compute option if the tasks
do not require the same processing capabilities as the UI, or a larger instance if they require additional
capacity and resources.
Manageability : Background tasks might have a different development and deployment rhythm from the
main application code or the UI. Deploying them to a separate compute instance can simplify updates and
versioning.
Cost : Adding compute instances to execute background tasks increases hosting costs. You should carefully
consider the trade-off between additional capacity and these extra costs.
For more information, see the Leader Election pattern and the Competing Consumers pattern.
Conflicts
If you have multiple instances of a background job, it is possible that they will compete for access to resources and
services, such as databases and storage. This concurrent access can result in resource contention, which might
cause conflicts in availability of the services and in the integrity of data in storage. You can resolve resource
contention by using a pessimistic locking approach. This prevents competing instances of a task from concurrently
accessing a service or corrupting data.
Another approach to resolve conflicts is to define background tasks as a singleton, so that there is only ever one
instance running. However, this eliminates the reliability and performance benefits that a multiple-instance
configuration can provide. This is especially true if the UI can supply sufficient work to keep more than one
background task busy.
It is vital to ensure that the background task can automatically restart and that it has sufficient capacity to cope
with peaks in demand. You can achieve this by allocating a compute instance with sufficient resources, by
implementing a queueing mechanism that can store requests for later execution when demand decreases, or by
using a combination of these techniques.
Coordination
The background tasks might be complex and might require multiple individual tasks to execute to produce a result
or to fulfill all the requirements. It is common in these scenarios to divide the task into smaller discreet steps or
subtasks that can be executed by multiple consumers. Multistep jobs can be more efficient and more flexible
because individual steps might be reusable in multiple jobs. It is also easy to add, remove, or modify the order of
the steps.
Coordinating multiple tasks and steps can be challenging, but there are three common patterns that you can use to
guide your implementation of a solution:
Decomposing a task into multiple reusable steps . An application might be required to perform a
variety of tasks of varying complexity on the information that it processes. A straightforward but inflexible
approach to implementing this application might be to perform this processing as a monolithic module.
However, this approach is likely to reduce the opportunities for refactoring the code, optimizing it, or
reusing it if parts of the same processing are required elsewhere within the application. For more
information, see the Pipes and Filters pattern.
Managing execution of the steps for a task . An application might perform tasks that comprise a
number of steps (some of which might invoke remote services or access remote resources). The individual
steps might be independent of each other, but they are orchestrated by the application logic that
implements the task. For more information, see Scheduler Agent Supervisor pattern.
Managing recover y for task steps that fail . An application might need to undo the work that is
performed by a series of steps (which together define an eventually consistent operation) if one or more of
the steps fail. For more information, see the Compensating Transaction pattern.
Resiliency considerations
Background tasks must be resilient in order to provide reliable services to the application. When you are planning
and designing background tasks, consider the following points:
Background tasks must be able to gracefully handle restarts without corrupting data or introducing
inconsistency into the application. For long-running or multistep tasks, consider using check pointing by
saving the state of jobs in persistent storage, or as messages in a queue if this is appropriate. For example,
you can persist state information in a message in a queue and incrementally update this state information
with the task progress so that the task can be processed from the last known good checkpoint--instead of
restarting from the beginning. When using Azure Service Bus queues, you can use message sessions to
enable the same scenario. Sessions allow you to save and retrieve the application processing state by using
the SetState and GetState methods. For more information about designing reliable multistep processes and
workflows, see the Scheduler Agent Supervisor pattern.
When you use queues to communicate with background tasks, the queues can act as a buffer to store
requests that are sent to the tasks while the application is under higher than usual load. This allows the
tasks to catch up with the UI during less busy periods. It also means that restarts will not block the UI. For
more information, see the Queue-Based Load Leveling pattern. If some tasks are more important than
others, consider implementing the Priority Queue pattern to ensure that these tasks run before less
important ones.
Background tasks that are initiated by messages or process messages must be designed to handle
inconsistencies, such as messages arriving out of order, messages that repeatedly cause an error (often
referred to as poison messages), and messages that are delivered more than once. Consider the following:
Messages that must be processed in a specific order, such as those that change data based on the
existing data value (for example, adding a value to an existing value), might not arrive in the original
order in which they were sent. Alternatively, they might be handled by different instances of a
background task in a different order due to varying loads on each instance. Messages that must be
processed in a specific order should include a sequence number, key, or some other indicator that
background tasks can use to ensure that they are processed in the correct order. If you are using
Azure Service Bus, you can use message sessions to guarantee the order of delivery. However, it is
usually more efficient, where possible, to design the process so that the message order is not
important.
Typically, a background task will peek at messages in the queue, which temporarily hides them from
other message consumers. Then it deletes the messages after they have been successfully processed.
If a background task fails when processing a message, that message will reappear on the queue after
the peek time-out expires. It will be processed by another instance of the task or during the next
processing cycle of this instance. If the message consistently causes an error in the consumer, it will
block the task, the queue, and eventually the application itself when the queue becomes full.
Therefore, it is vital to detect and remove poison messages from the queue. If you are using Azure
Service Bus, messages that cause an error can be moved automatically or manually to an associated
dead letter queue.
Queues are guaranteed at least once delivery mechanisms, but they might deliver the same message
more than once. In addition, if a background task fails after processing a message but before deleting
it from the queue, the message will become available for processing again. Background tasks should
be idempotent, which means that processing the same message more than once does not cause an
error or inconsistency in the application's data. Some operations are naturally idempotent, such as
setting a stored value to a specific new value. However, operations such as adding a value to an
existing stored value without checking that the stored value is still the same as when the message
was originally sent will cause inconsistencies. Azure Service Bus queues can be configured to
automatically remove duplicated messages.
Some messaging systems, such as Azure storage queues and Azure Service Bus queues, support a
de-queue count property that indicates the number of times a message has been read from the
queue. This can be useful in handling repeated and poison messages. For more information, see
Asynchronous Messaging Primer and Idempotency Patterns.
Related patterns
Compute Partitioning Guidance
Caching
12/18/2020 • 55 minutes to read • Edit Online
Caching is a common technique that aims to improve the performance and scalability of a system. It does this by
temporarily copying frequently accessed data to fast storage that's located close to the application. If this fast data
storage is located closer to the application than the original source, then caching can significantly improve
response times for client applications by serving data more quickly.
Caching is most effective when a client instance repeatedly reads the same data, especially if all the following
conditions apply to the original data store:
It remains relatively static.
It's slow compared to the speed of the cache.
It's subject to a high level of contention.
It's far away when network latency can cause access to be slow.
NOTE
Consider the expiration period for the cache and the objects that it contains carefully. If you make it too short, objects will
expire too quickly and you will reduce the benefits of using the cache. If you make the period too long, you risk the data
becoming stale.
It's also possible that the cache might fill up if data is allowed to remain resident for a long time. In this case, any
requests to add new items to the cache might cause some items to be forcibly removed in a process known as
eviction. Cache services typically evict data on a least-recently-used (LRU) basis, but you can usually override this
policy and prevent items from being evicted. However, if you adopt this approach, you risk exceeding the memory
that's available in the cache. An application that attempts to add an item to the cache will fail with an exception.
Some caching implementations might provide additional eviction policies. There are several types of eviction
policies. These include:
A most-recently-used policy (in the expectation that the data will not be required again).
A first-in-first-out policy (oldest data is evicted first).
An explicit removal policy based on a triggered event (such as the data being modified).
Invalidate data in a client-side cache
Data that's held in a client-side cache is generally considered to be outside the auspices of the service that
provides the data to the client. A service cannot directly force a client to add or remove information from a client-
side cache.
This means that it's possible for a client that uses a poorly configured cache to continue using outdated
information. For example, if the expiration policies of the cache aren't properly implemented, a client might use
outdated information that's cached locally when the information in the original data source has changed.
If you are building a web application that serves data over an HTTP connection, you can implicitly force a web
client (such as a browser or web proxy) to fetch the most recent information. You can do this if a resource is
updated by a change in the URI of that resource. Web clients typically use the URI of a resource as the key in the
client-side cache, so if the URI changes, the web client ignores any previously cached versions of a resource and
fetches the new version instead.
NOTE
Redis does not guarantee that all writes will be saved in the event of a catastrophic failure, but at worst you might lose only
a few seconds worth of data. Remember that a cache is not intended to act as an authoritative data source, and it is the
responsibility of the applications using the cache to ensure that critical data is saved successfully to an appropriate data
store. For more information, see the Cache-aside pattern.
NOTE
Azure Cache for Redis provides its own security layer through which clients connect. The underlying Redis servers are not
exposed to the public network.
NOTE
Do not use the session state provider for Azure Cache for Redis with ASP.NET applications that run outside of the Azure
environment. The latency of accessing the cache from outside of Azure can eliminate the performance benefits of caching
data.
Similarly, the output cache provider for Azure Cache for Redis enables you to save the HTTP responses generated
by an ASP.NET web application. Using the output cache provider with Azure Cache for Redis can improve the
response times of applications that render complex HTML output. Application instances that generate similar
responses can use the shared output fragments in the cache rather than generating this HTML output afresh. For
more information, see ASP.NET output cache provider for Azure Cache for Redis.
NOTE
If you implement your own Redis cache in this way, you are responsible for monitoring, managing, and securing the service.
// If the value returned is null, the item was not found in the cache
// So retrieve the item from the data source and add it to the cache
if (itemValue == null)
{
itemValue = await GetItemFromDataSourceAsync(itemKey);
await cache.StringSetAsync(itemKey, itemValue);
}
The StringGet and StringSet methods are not restricted to retrieving or storing string values. They can take any
item that is serialized as an array of bytes. If you need to save a .NET object, you can serialize it as a byte stream
and use the StringSet method to write it to the cache.
Similarly, you can read an object from the cache by using the StringGet method and deserializing it as a .NET
object. The following code shows a set of extension methods for the IDatabase interface (the GetDatabase method
of a Redis connection returns an IDatabase object), and some sample code that uses these methods to read and
write a BlogPost object to the cache:
public static class RedisCacheExtensions
{
public static async Task<T> GetAsync<T>(this IDatabase cache, string key)
{
return Deserialize<T>(await cache.StringGetAsync(key));
}
public static async Task SetAsync(this IDatabase cache, string key, object value)
{
await cache.StringSetAsync(key, Serialize(value));
}
if (o != null)
{
BinaryFormatter binaryFormatter = new BinaryFormatter();
using (MemoryStream memoryStream = new MemoryStream())
{
binaryFormatter.Serialize(memoryStream, o);
objectDataAsStream = memoryStream.ToArray();
}
}
return objectDataAsStream;
}
if (stream != null)
{
BinaryFormatter binaryFormatter = new BinaryFormatter();
using (MemoryStream memoryStream = new MemoryStream(stream))
{
result = (T)binaryFormatter.Deserialize(memoryStream);
}
}
return result;
}
}
The following code illustrates a method named RetrieveBlogPost that uses these extension methods to read and
write a serializable BlogPost object to the cache following the cache-aside pattern:
// The BlogPost type
[Serializable]
public class BlogPost
{
private HashSet<string> tags;
return blogPost;
}
Redis supports command pipelining if a client application sends multiple asynchronous requests. Redis can
multiplex the requests using the same connection rather than receiving and responding to commands in a strict
sequence.
This approach helps to reduce latency by making more efficient use of the network. The following code snippet
shows an example that retrieves the details of two customers concurrently. The code submits two requests and
then performs some other processing (not shown) before waiting to receive the results. The Wait method of the
cache object is similar to the .NET Framework Task.Wait method:
For additional information on writing client applications that can the Azure Cache for Redis, see the Azure Cache
for Redis documentation. More information is also available at StackExchange.Redis.
The page Pipelines and multiplexers on the same website provides more information about asynchronous
operations and pipelining with Redis and the StackExchange library.
GETSET , which retrieves the value that's associated with a key and changes it to a new value. The
StackExchange library makes this operation available through the IDatabase.StringGetSetAsync method.
The code snippet below shows an example of this method. This code returns the current value that's
associated with the key "data:counter" from the previous example. Then it resets the value for this key back
to zero, all as part of the same operation:
MGET and MSET, which can return or change a set of string values as a single operation. The
IDatabase.StringGetAsync and IDatabase.StringSetAsync methods are overloaded to support this
functionality, as shown in the following example:
ConnectionMultiplexer redisHostConnection = ...;
IDatabase cache = redisHostConnection.GetDatabase();
...
// Create a list of key-value pairs
var keysAndValues =
new List<KeyValuePair<RedisKey, RedisValue>>()
{
new KeyValuePair<RedisKey, RedisValue>("data:key1", "value1"),
new KeyValuePair<RedisKey, RedisValue>("data:key99", "value2"),
new KeyValuePair<RedisKey, RedisValue>("data:key322", "value3")
};
You can also combine multiple operations into a single Redis transaction as described in the Redis transactions
and batches section earlier in this article. The StackExchange library provides support for transactions through the
ITransaction interface.
You create an ITransaction object by using the IDatabase.CreateTransaction method. You invoke commands to
the transaction by using the methods provided by the ITransaction object.
The ITransaction interface provides access to a set of methods that's similar to those accessed by the IDatabase
interface, except that all the methods are asynchronous. This means that they are only performed when the
ITransaction.Execute method is invoked. The value that's returned by the ITransaction.Execute method
indicates whether the transaction was created successfully (true) or if it failed (false).
The following code snippet shows an example that increments and decrements two counters as part of the same
transaction:
Remember that Redis transactions are unlike transactions in relational databases. The Execute method simply
queues all the commands that comprise the transaction to be run, and if any of them is malformed then the
transaction is stopped. If all the commands have been queued successfully, each command runs asynchronously.
If any command fails, the others still continue processing. If you need to verify that a command has completed
successfully, you must fetch the results of the command by using the Result property of the corresponding task,
as shown in the example above. Reading the Result property will block the calling thread until the task has
completed.
For more information, see Transactions in Redis.
When performing batch operations, you can use the IBatch interface of the StackExchange library. This interface
provides access to a set of methods similar to those accessed by the IDatabase interface, except that all the
methods are asynchronous.
You create an object by using the IDatabase.CreateBatch method, and then run the batch by using the
IBatch
IBatch.Execute method, as shown in the following example. This code simply sets a string value, increments and
decrements the same counters used in the previous example, and displays the results:
It is important to understand that unlike a transaction, if a command in a batch fails because it is malformed, the
other commands might still run. The IBatch.Execute method does not return any indication of success or failure.
Perform fire and forget cache operations
Redis supports fire and forget operations by using command flags. In this situation, the client simply initiates an
operation but has no interest in the result and does not wait for the command to be completed. The example
below shows how to perform the INCR command as a fire and forget operation:
You can also set the expiration time to a specific date and time by using the EXPIRE command, which is available in
the StackExchange library as the KeyExpireAsync method:
ConnectionMultiplexer redisHostConnection = ...;
IDatabase cache = redisHostConnection.GetDatabase();
...
// Add a key with an expiration date of midnight on 1st January 2015
await cache.StringSetAsync("data:key1", 99);
await cache.KeyExpireAsync("data:key1",
new DateTime(2015, 1, 1, 0, 0, 0, DateTimeKind.Utc));
...
TIP
You can manually remove an item from the cache by using the DEL command, which is available through the StackExchange
library as the IDatabase.KeyDeleteAsync method.
You can also combine existing sets to create new sets by using the SDIFF (set difference), SINTER (set intersection),
and SUNION (set union) commands. The StackExchange library unifies these operations in the
IDatabase.SetCombineAsync method. The first parameter to this method specifies the set operation to perform.
The following code snippets show how sets can be useful for quickly storing and retrieving collections of related
items. This code uses the BlogPost type that was described in the section Implement Redis Cache Client
Applications earlier in this article.
A BlogPost object contains four fields—an ID, a title, a ranking score, and a collection of tags. The first code
snippet below shows the sample data that's used for populating a C# list of BlogPost objects:
List<string[]> tags = new List<string[]>
{
new[] { "iot","csharp" },
new[] { "iot","azure","csharp" },
new[] { "csharp","git","big data" },
new[] { "iot","git","database" },
new[] { "database","git" },
new[] { "csharp","database" },
new[] { "iot" },
new[] { "iot","database","git" },
new[] { "azure","database","big data","git","csharp" },
new[] { "azure" }
};
You can store the tags for each BlogPost object as a set in a Redis cache and associate each set with the ID of the
BlogPost . This enables an application to quickly find all the tags that belong to a specific blog post. To enable
searching in the opposite direction and find all blog posts that share a specific tag, you can create another set that
holds the blog posts referencing the tag ID in the key:
// Now do the inverse so we can figure out which blog posts have a given tag
foreach (var tag in post.Tags)
{
await cache.SetAddAsync(string.Format(CultureInfo.InvariantCulture,
"tag:{0}:blog:posts", tag), post.Id);
}
}
These structures enable you to perform many common queries very efficiently. For example, you can find and
display all of the tags for blog post 1 like this:
// Show the tags for blog post #1
foreach (var value in await cache.SetMembersAsync("blog:posts:1:tags"))
{
Console.WriteLine(value);
}
You can find all tags that are common to blog post 1 and blog post 2 by performing a set intersection operation,
as follows:
And you can find all blog posts that contain a specific tag:
// Show the ids of the blog posts that have the tag "iot".
foreach (var value in await cache.SetMembersAsync("tag:iot:blog:posts"))
{
Console.WriteLine(value);
}
As more blog posts are read, their titles are pushed onto the same list. The list is ordered by the sequence in which
the titles have been added. The most recently read blog posts are toward the left end of the list. (If the same blog
post is read more than once, it will have multiple entries in the list.)
You can display the titles of the most recently read posts by using the IDatabase.ListRange method. This method
takes the key that contains the list, a starting point, and an ending point. The following code retrieves the titles of
the 10 blog posts (items from 0 to 9) at the left-most end of the list:
// Show latest ten posts
foreach (string postTitle in await cache.ListRangeAsync(redisKey, 0, 9))
{
Console.WriteLine(postTitle);
}
Note that the ListRangeAsync method does not remove items from the list. To do this, you can use the
IDatabase.ListLeftPopAsync and IDatabase.ListRightPopAsync methods.
To prevent the list from growing indefinitely, you can periodically cull items by trimming the list. The code snippet
below shows you how to remove all but the five left-most items from the list:
You can retrieve the blog post titles and scores in ascending score order by using the
IDatabase.SortedSetRangeByRankWithScores method:
NOTE
The StackExchange library also provides the IDatabase.SortedSetRangeByRankAsync method, which returns the data in
score order, but does not return the scores.
You can also retrieve items in descending order of scores, and limit the number of items that are returned by
providing additional parameters to the IDatabase.SortedSetRangeByRankWithScoresAsync method. The next example
displays the titles and scores of the top 10 ranked blog posts:
The next example uses the IDatabase.SortedSetRangeByScoreWithScoresAsync method, which you can use to limit
the items that are returned to those that fall within a given score range:
The first parameter to the Subscribe method is the name of the channel. This name follows the same conventions
that are used by keys in the cache. The name can contain any binary data, although it is advisable to use relatively
short, meaningful strings to help ensure good performance and maintainability.
Note also that the namespace used by channels is separate from that used by keys. This means you can have
channels and keys that have the same name, although this may make your application code more difficult to
maintain.
The second parameter is an Action delegate. This delegate runs asynchronously whenever a new message
appears on the channel. This example simply displays the message on the console (the message will contain the
title of a blog post).
To publish to a channel, an application can use the Redis PUBLISH command. The StackExchange library provides
the IServer.PublishAsync method to perform this operation. The next code snippet shows how to publish a
message to the "messages:blogPosts" channel:
There are several points you should understand about the publish/subscribe mechanism:
Multiple subscribers can subscribe to the same channel, and they will all receive the messages that are
published to that channel.
Subscribers only receive messages that have been published after they have subscribed. Channels are not
buffered, and once a message is published, the Redis infrastructure pushes the message to each subscriber and
then removes it.
By default, messages are received by subscribers in the order in which they are sent. In a highly active system
with a large number of messages and many subscribers and publishers, guaranteed sequential delivery of
messages can slow performance of the system. If each message is independent and the order is unimportant,
you can enable concurrent processing by the Redis system, which can help to improve responsiveness. You can
achieve this in a StackExchange client by setting the PreserveAsyncOrder of the connection used by the
subscriber to false:
Serialization considerations
When you choose a serialization format, consider tradeoffs between performance, interoperability, versioning,
compatibility with existing systems, data compression, and memory overhead. When you are evaluating
performance, remember that benchmarks are highly dependent on context. They may not reflect your actual
workload, and may not consider newer libraries or versions. There is no single "fastest" serializer for all scenarios.
Some options to consider include:
Protocol Buffers (also called protobuf) is a serialization format developed by Google for serializing
structured data efficiently. It uses strongly typed definition files to define message structures. These
definition files are then compiled to language-specific code for serializing and deserializing messages.
Protobuf can be used over existing RPC mechanisms, or it can generate an RPC service.
Apache Thrift uses a similar approach, with strongly typed definition files and a compilation step to
generate the serialization code and RPC services.
Apache Avro provides similar functionality to Protocol Buffers and Thrift, but there is no compilation step.
Instead, serialized data always includes a schema that describes the structure.
JSON is an open standard that uses human-readable text fields. It has broad cross-platform support. JSON
does not use message schemas. Being a text-based format, it is not very efficient over the wire. In some
cases, however, you may be returning cached items directly to a client via HTTP, in which case storing JSON
could save the cost of deserializing from another format and then serializing to JSON.
BSON is a binary serialization format that uses a structure similar to JSON. BSON was designed to be
lightweight, easy to scan, and fast to serialize and deserialize, relative to JSON. Payloads are comparable in
size to JSON. Depending on the data, a BSON payload may be smaller or larger than a JSON payload.
BSON has some additional data types that are not available in JSON, notably BinData (for byte arrays) and
Date.
MessagePack is a binary serialization format that is designed to be compact for transmission over the wire.
There are no message schemas or message type checking.
Bond is a cross-platform framework for working with schematized data. It supports cross-language
serialization and deserialization. Notable differences from other systems listed here are support for
inheritance, type aliases, and generics.
gRPC is an open-source RPC system developed by Google. By default, it uses Protocol Buffers as its
definition language and underlying message interchange format.
More information
Azure Cache for Redis documentation
Azure Cache for Redis FAQ
Task-based Asynchronous pattern
Redis documentation
StackExchange.Redis
Data partitioning guide
Best practices for using content delivery networks
(CDNs)
12/18/2020 • 8 minutes to read • Edit Online
A content delivery network (CDN) is a distributed network of servers that can efficiently deliver web content to
users. CDNs store cached content on edge servers that are close to end users to minimize latency.
CDNs are typically used to deliver static content such as images, style sheets, documents, client-side scripts, and
HTML pages. The major advantages of using a CDN are lower latency and faster delivery of content to users,
regardless of their geographical location in relation to the datacenter where the application is hosted. CDNs can
also help to reduce load on a web application, because the application does not have to service requests for the
content that is hosted in the CDN.
In Azure, the Azure Content Delivery Network is a global CDN solution for delivering high-bandwidth content that
is hosted in Azure or any other location. Using Azure CDN, you can cache publicly available objects loaded from
Azure blob storage, a web application, virtual machine, any publicly accessible web server.
This topic describes some general best practices and considerations when using a CDN. For more information, see
Azure CDN.
Challenges
There are several challenges to take into account when planning to use a CDN.
Deployment . Decide the origin from which the CDN fetches the content, and whether you need to deploy
the content in more than one storage system. Take into account the process for deploying static content and
resources. For example, you may need to implement a separate step to load content into Azure blob
storage.
Versioning and cache-control . Consider how you will update static content and deploy new versions.
Understand how the CDN performs caching and time-to-live (TTL). For Azure CDN, see How caching works.
Testing . It can be difficult to perform local testing of your CDN settings when developing and testing an
application locally or in a staging environment.
Search engine optimization (SEO) . Content such as images and documents are served from a different
domain when you use the CDN. This can have an effect on SEO for this content.
Content security . Not all CDNs offer any form of access control for the content. Some CDN services,
including Azure CDN, support token-based authentication to protect CDN content. For more information,
see Securing Azure Content Delivery Network assets with token authentication.
Client security . Clients might connect from an environment that does not allow access to resources on the
CDN. This could be a security-constrained environment that limits access to only a set of known sources, or
one that prevents loading of resources from anything other than the page origin. A fallback implementation
is required to handle these cases.
Resilience . The CDN is a potential single point of failure for an application.
Scenarios where a CDN may be less useful include:
If the content has a low hit rate, it might be accessed only few times while it is valid (determined by its time-
to-live setting).
If the data is private, such as for large enterprises or supply chain ecosystems.
In many large-scale solutions, data is divided into partitions that can be managed and accessed separately.
Partitioning can improve scalability, reduce contention, and optimize performance. It can also provide a
mechanism for dividing data by usage pattern. For example, you can archive older data in cheaper data storage.
However, the partitioning strategy must be chosen carefully to maximize the benefits while minimizing adverse
effects.
NOTE
In this article, the term partitioning means the process of physically dividing data into separate data stores. It is not the
same as SQL Server table partitioning.
Designing partitions
There are three typical strategies for partitioning data:
Horizontal par titioning (often called sharding). In this strategy, each partition is a separate data store,
but all partitions have the same schema. Each partition is known as a shard and holds a specific subset of
the data, such as all the orders for a specific set of customers.
Ver tical par titioning . In this strategy, each partition holds a subset of the fields for items in the data
store. The fields are divided according to their pattern of use. For example, frequently accessed fields
might be placed in one vertical partition and less frequently accessed fields in another.
Functional par titioning . In this strategy, data is aggregated according to how it is used by each
bounded context in the system. For example, an e-commerce system might store invoice data in one
partition and product inventory data in another.
These strategies can be combined, and we recommend that you consider them all when you design a
partitioning scheme. For example, you might divide data into shards and then use vertical partitioning to further
subdivide the data in each shard.
Horizontal partitioning (sharding)
Figure 1 shows horizontal partitioning or sharding. In this example, product inventory data is divided into shards
based on the product key. Each shard holds the data for a contiguous range of shard keys (A-G and H-Z),
organized alphabetically. Sharding spreads the load over more computers, which reduces contention and
improves performance.
Rebalancing partitions
As a system matures, you might have to adjust the partitioning scheme. For example, individual partitions might
start getting a disproportionate volume of traffic and become hot, leading to excessive contention. Or you might
have underestimated the volume of data in some partitions, causing some partitions to approach capacity limits.
Some data stores, such as Cosmos DB, can automatically rebalance partitions. In other cases, rebalancing is an
administrative task that consists of two stages:
1. Determine a new partitioning strategy.
Which partitions need to be split (or possibly combined)?
What is the new partition key?
2. Migrate data from the old partitioning scheme to the new set of partitions.
Depending on the data store, you might be able to migrate data between partitions while they are in use. This is
called online migration. If that's not possible, you might need to make partitions unavailable while the data is
relocated (offline migration).
Offline migration
Offline migration is typically simpler because it reduces the chances of contention occurring. Conceptually, offline
migration works as follows:
1. Mark the partition offline.
2. Split-merge and move the data to the new partitions.
3. Verify the data.
4. Bring the new partitions online.
5. Remove the old partition.
Optionally, you can mark a partition as read-only in step 1, so that applications can still read the data while it is
being moved.
Online migration
Online migration is more complex to perform but less disruptive. The process is similar to offline migration,
except the original partition is not marked offline. Depending on the granularity of the migration process (for
example, item by item versus shard by shard), the data access code in the client applications might have to
handle reading and writing data that's held in two locations, the original partition and the new partition.
Related patterns
The following design patterns might be relevant to your scenario:
The sharding pattern describes some common strategies for sharding data.
The index table pattern shows how to create secondary indexes over data. An application can quickly
retrieve data with this approach, by using queries that do not reference the primary key of a collection.
The materialized view pattern describes how to generate prepopulated views that summarize data to
support fast query operations. This approach can be useful in a partitioned data store if the partitions that
contain the data being summarized are distributed across multiple sites.
Next steps
Learn about partitioning strategies for specific Azure services. See Data partitioning strategies
Data partitioning strategies
12/18/2020 • 31 minutes to read • Edit Online
This article describes some strategies for partitioning data in various Azure data stores. For general guidance about
when to partition data and best practices, see Data partitioning.
A single shard can contain the data for several shardlets. For example, you can use list shardlets to store data for
different non-contiguous tenants in the same shard. You can also mix range shardlets and list shardlets in the same
shard, although they will be addressed through different maps. The following diagram shows this approach:
Elastic pools make it possible to add and remove shards as the volume of data shrinks and grows. Client
applications can create and delete shards dynamically, and transparently update the shard map manager. However,
removing a shard is a destructive operation that also requires deleting all the data in that shard.
If an application needs to split a shard into two separate shards or combine shards, use the split-merge tool. This
tool runs as an Azure web service, and migrates data safely between shards.
The partitioning scheme can significantly affect the performance of your system. It can also affect the rate at which
shards have to be added or removed, or that data must be repartitioned across shards. Consider the following
points:
Group data that is used together in the same shard, and avoid operations that access data from multiple
shards. A shard is a SQL database in its own right, and cross-database joins must be performed on the client
side.
Although SQL Database does not support cross-database joins, you can use the Elastic Database tools to
perform multi-shard queries. A multi-shard query sends individual queries to each database and merges the
results.
Don't design a system that has dependencies between shards. Referential integrity constraints, triggers, and
stored procedures in one database cannot reference objects in another.
If you have reference data that is frequently used by queries, consider replicating this data across shards.
This approach can remove the need to join data across databases. Ideally, such data should be static or slow-
moving, to minimize the replication effort and reduce the chances of it becoming stale.
Shardlets that belong to the same shard map should have the same schema. This rule is not enforced by
SQL Database, but data management and querying becomes very complex if each shardlet has a different
schema. Instead, create separate shard maps for each schema. Remember that data belonging to different
shardlets can be stored in the same shard.
Transactional operations are only supported for data within a shard, and not across shards. Transactions can
span shardlets as long as they are part of the same shard. Therefore, if your business logic needs to perform
transactions, either store the data in the same shard or implement eventual consistency.
Place shards close to the users that access the data in those shards. This strategy helps reduce latency.
Avoid having a mixture of highly active and relatively inactive shards. Try to spread the load evenly across
shards. This might require hashing the sharding keys. If you are geo-locating shards, make sure that the
hashed keys map to shardlets held in shards stored close to the users that access that data.
NOTE
If the SessionId and PartitionKey properties are both specified, then they must be set to the same value or the
message will be rejected.
If the SessionId and PartitionKey properties for a message are not specified, but duplicate detection is
enabled, the MessageId property will be used. All messages with the same MessageId will be directed to the
same fragment.
If messages do not include a SessionId, PartitionKey, or MessageId property, then Service Bus assigns
messages to fragments sequentially. If a fragment is unavailable, Service Bus will move on to the next. This
means that a temporary fault in the messaging infrastructure does not cause the message-send operation to
fail.
Consider the following points when deciding if or how to partition a Service Bus message queue or topic:
Service Bus queues and topics are created within the scope of a Service Bus namespace. Service Bus
currently allows up to 100 partitioned queues or topics per namespace.
Each Service Bus namespace imposes quotas on the available resources, such as the number of
subscriptions per topic, the number of concurrent send and receive requests per second, and the maximum
number of concurrent connections that can be established. These quotas are documented in Service Bus
quotas. If you expect to exceed these values, then create additional namespaces with their own queues and
topics, and spread the work across these namespaces. For example, in a global application, create separate
namespaces in each region and configure application instances to use the queues and topics in the nearest
namespace.
Messages that are sent as part of a transaction must specify a partition key. This can be a SessionId,
PartitionKey, or MessageId property. All messages that are sent as part of the same transaction must specify
the same partition key because they must be handled by the same message broker process. You cannot
send messages to different queues or topics within the same transaction.
Partitioned queues and topics can't be configured to be automatically deleted when they become idle.
Partitioned queues and topics can't currently be used with the Advanced Message Queuing Protocol (AMQP)
if you are building cross-platform or hybrid solutions.
Partitioning Cosmos DB
Azure Cosmos DB is a NoSQL database that can store JSON documents using the Azure Cosmos DB SQL API. A
document in a Cosmos DB database is a JSON-serialized representation of an object or other piece of data. No
fixed schemas are enforced except that every document must contain a unique ID.
Documents are organized into collections. You can group related documents together in a collection. For example,
in a system that maintains blog postings, you can store the contents of each blog post as a document in a
collection. You can also create collections for each subject type. Alternatively, in a multitenant application, such as a
system where different authors control and manage their own blog posts, you can partition blogs by author and
create separate collections for each author. The storage space that's allocated to collections is elastic and can shrink
or grow as needed.
Cosmos DB supports automatic partitioning of data based on an application-defined partition key. A logical
partition is a partition that stores all the data for a single partition key value. All documents that share the same
value for the partition key are placed within the same logical partition. Cosmos DB distributes values according to
hash of the partition key. A logical partition has a maximum size of 10 GB. Therefore, the choice of the partition key
is an important decision at design time. Choose a property with a wide range of values and even access patterns.
For more information, see Partition and scale in Azure Cosmos DB.
NOTE
Each Cosmos DB database has a performance level that determines the amount of resources it gets. A performance level is
associated with a request unit (RU) rate limit. The RU rate limit specifies the volume of resources that's reserved and available
for exclusive use by that collection. The cost of a collection depends on the performance level that's selected for that
collection. The higher the performance level (and RU rate limit) the higher the charge. You can adjust the performance level of
a collection by using the Azure portal. For more information, see Request Units in Azure Cosmos DB.
If the partitioning mechanism that Cosmos DB provides is not sufficient, you may need to shard the data at the
application level. Document collections provide a natural mechanism for partitioning data within a single database.
The simplest way to implement sharding is to create a collection for each shard. Containers are logical resources
and can span one or more servers. Fixed-size containers have a maximum limit of 10 GB and 10,000 RU/s
throughput. Unlimited containers do not have a maximum storage size, but must specify a partition key. With
application sharding, the client application must direct requests to the appropriate shard, usually by implementing
its own mapping mechanism based on some attributes of the data that define the shard key.
All databases are created in the context of a Cosmos DB database account. A single account can contain several
databases, and it specifies in which regions the databases are created. Each account also enforces its own access
control. You can use Cosmos DB accounts to geo-locate shards (collections within databases) close to the users
who need to access them, and enforce restrictions so that only those users can connect to them.
Consider the following points when deciding how to partition data with the Cosmos DB SQL API:
The resources available to a Cosmos DB database are subject to the quota limitations of the
account . Each database can hold a number of collections, and each collection is associated with a
performance level that governs the RU rate limit (reserved throughput) for that collection. For more
information, see Azure subscription and service limits, quotas, and constraints.
Each document must have an attribute that can be used to uniquely identify that document
within the collection in which it is held . This attribute is different from the shard key, which defines
which collection holds the document. A collection can contain a large number of documents. In theory, it's
limited only by the maximum length of the document ID. The document ID can be up to 255 characters.
All operations against a document are performed within the context of a transaction.
Transactions are scoped to the collection in which the document is contained. If an operation fails,
the work that it has performed is rolled back. While a document is subject to an operation, any changes that
are made are subject to snapshot-level isolation. This mechanism guarantees that if, for example, a request
to create a new document fails, another user who's querying the database simultaneously will not see a
partial document that is then removed.
Database queries are also scoped to the collection level . A single query can retrieve data from only
one collection. If you need to retrieve data from multiple collections, you must query each collection
individually and merge the results in your application code.
Cosmos DB suppor ts programmable items that can all be stored in a collection alongside
documents . These include stored procedures, user-defined functions, and triggers (written in JavaScript).
These items can access any document within the same collection. Furthermore, these items run either inside
the scope of the ambient transaction (in the case of a trigger that fires as the result of a create, delete, or
replace operation performed against a document), or by starting a new transaction (in the case of a stored
procedure that is run as the result of an explicit client request). If the code in a programmable item throws
an exception, the transaction is rolled back. You can use stored procedures and triggers to maintain integrity
and consistency between documents, but these documents must all be part of the same collection.
The collections that you intend to hold in the databases should be unlikely to exceed the
throughput limits defined by the performance levels of the collections . For more information, see
Request Units in Azure Cosmos DB. If you anticipate reaching these limits, consider splitting collections
across databases in different accounts to reduce the load per collection.
NOTE
You can store a limited set of data types in searchable documents, including strings, Booleans, numeric data, datetime data,
and some geographical data. For more information, see the page Supported data types (Azure Search) on the Microsoft
website.
You have limited control over how Azure Search partitions data for each instance of the service. However, in a
global environment you might be able to improve performance and reduce latency and contention further by
partitioning the service itself using either of the following strategies:
Create an instance of Azure Search in each geographic region, and ensure that client applications are
directed toward the nearest available instance. This strategy requires that any updates to searchable content
are replicated in a timely manner across all instances of the service.
Create two tiers of Azure Search:
A local service in each region that contains the data that's most frequently accessed by users in that
region. Users can direct requests here for fast but limited results.
A global service that encompasses all the data. Users can direct requests here for slower but more
complete results.
This approach is most suitable when there is a significant regional variation in the data that's being searched.
IMPORTANT
Azure Cache for Redis currently supports Redis clustering in premium tier only.
The page Partitioning: how to split data among multiple Redis instances on the Redis website provides more
information about implementing partitioning with Redis. The remainder of this section assumes that you are
implementing client-side or proxy-assisted partitioning.
Consider the following points when deciding how to partition data with Azure Cache for Redis:
Azure Cache for Redis is not intended to act as a permanent data store, so whatever partitioning scheme you
implement, your application code must be able to retrieve data from a location that's not the cache.
Data that is frequently accessed together should be kept in the same partition. Redis is a powerful key-value
store that provides several highly optimized mechanisms for structuring data. These mechanisms can be
one of the following:
Simple strings (binary data up to 512 MB in length)
Aggregate types such as lists (which can act as queues and stacks)
Sets (ordered and unordered)
Hashes (which can group related fields together, such as the items that represent the fields in an object)
The aggregate types enable you to associate many related values with the same key. A Redis key identifies a
list, set, or hash rather than the data items that it contains. These types are all available with Azure Cache for
Redis and are described by the Data types page on the Redis website. For example, in part of an e-commerce
system that tracks the orders that are placed by customers, the details of each customer can be stored in a
Redis hash that is keyed by using the customer ID. Each hash can hold a collection of order IDs for the
customer. A separate Redis set can hold the orders, again structured as hashes, and keyed by using the order
ID. Figure 8 shows this structure. Note that Redis does not implement any form of referential integrity, so it
is the developer's responsibility to maintain the relationships between customers and orders.
Figure 8. Suggested structure in Redis storage for recording customer orders and their details.
NOTE
In Redis, all keys are binary data values (like Redis strings) and can contain up to 512 MB of data. In theory, a key can contain
almost any information. However, we recommend adopting a consistent naming convention for keys that is descriptive of the
type of data and that identifies the entity, but is not excessively long. A common approach is to use keys of the form
"entity_type:ID". For example, you can use "customer:99" to indicate the key for a customer with the ID 99.
You can implement vertical partitioning by storing related information in different aggregations in the same
database. For example, in an e-commerce application, you can store commonly accessed information about
products in one Redis hash and less frequently used detailed information in another. Both hashes can use
the same product ID as part of the key. For example, you can use "product: nn" (where nn is the product ID)
for the product information and "product_details: nn" for the detailed data. This strategy can help reduce the
volume of data that most queries are likely to retrieve.
You can repartition a Redis data store, but keep in mind that it's a complex and time-consuming task. Redis
clustering can repartition data automatically, but this capability is not available with Azure Cache for Redis.
Therefore, when you design your partitioning scheme, try to leave sufficient free space in each partition to
allow for expected data growth over time. However, remember that Azure Cache for Redis is intended to
cache data temporarily, and that data held in the cache can have a limited lifetime specified as a time-to-live
(TTL) value. For relatively volatile data, the TTL can be short, but for static data the TTL can be a lot longer.
Avoid storing large amounts of long-lived data in the cache if the volume of this data is likely to fill the
cache. You can specify an eviction policy that causes Azure Cache for Redis to remove data if space is at a
premium.
NOTE
When you use Azure Cache for Redis, you specify the maximum size of the cache (from 250 MB to 53 GB) by
selecting the appropriate pricing tier. However, after an Azure Cache for Redis has been created, you cannot increase
(or decrease) its size.
Redis batches and transactions cannot span multiple connections, so all data that is affected by a batch or
transaction should be held in the same database (shard).
NOTE
A sequence of operations in a Redis transaction is not necessarily atomic. The commands that compose a transaction
are verified and queued before they run. If an error occurs during this phase, the entire queue is discarded. However,
after the transaction has been successfully submitted, the queued commands run in sequence. If any command fails,
only that command stops running. All previous and subsequent commands in the queue are performed. For more
information, go to the Transactions page on the Redis website.
Redis supports a limited number of atomic operations. The only operations of this type that support
multiple keys and values are MGET and MSET operations. MGET operations return a collection of values for
a specified list of keys, and MSET operations store a collection of values for a specified list of keys. If you
need to use these operations, the key-value pairs that are referenced by the MSET and MGET commands
must be stored within the same database.
Many cloud applications use asynchronous messages to exchange information between components of the system.
An important aspect of messaging is the format used to encode the payload data. After you choose a messaging
technology, the next step is to define how the messages will be encoded. There are many options available, but the
right choice depends on your use case.
This article describes some of the considerations.
Overview
A message exchange between a producer and a consumer needs:
A shape or structure that defines the payload of the message.
An encoding format to represent the payload.
Serialization libraries to read and write the encoded payload.
The producer of the message defines the message shape based on the business logic and the information it wants
to send to the consumer(s). To structure the shape, divide the information into discrete or related subjects (fields).
Decide the characteristics of the values for those fields. Consider: What is the most efficient datatype? Will the
payload always have certain fields? Will the payload have a single record or a repeated set of values?
Then, choose an encoding format depending on your need. Certain factors include the ability to create highly
structured data if you need it, time taken to encode and transfer the message, and the ability to parse the payload.
Depending on the encoding format, choose a serialization library that is well supported.
A consumer of the message must be aware of those decisions so that it knows how to read incoming messages.
To transfer messages, the producer serializes the message to an encoding format. At the receiving end, the
consumer deserializes the payload to use the data. This way both entities share the model and as long as the shape
doesn't change, messaging continues without issues. When the contract changes, the encoding format should be
capable of handling the change without breaking the consumer.
Some encoding formats such as JSON are self-describing, meaning they can be parsed without referencing a
schema. However, such formats tend to yield larger messages. With other formats, the data may not be parsed as
easily but the messages are compact. This article highlights some factors that can help you choose a format.
Next steps
Understand messaging design patterns for cloud applications.
Best practices for monitoring cloud applications
12/18/2020 • 68 minutes to read • Edit Online
Distributed applications and services running in the cloud are, by their nature, complex pieces of software that
comprise many moving parts. In a production environment, it's important to be able to track the way in which
users use your system, trace resource utilization, and generally monitor the health and performance of your
system. You can use this information as a diagnostic aid to detect and correct issues, and also to help spot
potential problems and prevent them from occurring.
NOTE
This list is not intended to be comprehensive. This document focuses on these scenarios as the most common situations for
performing monitoring. There might be others that are less common or are specific to your environment.
The following sections describe these scenarios in more detail. The information for each scenario is discussed in
the following format:
1. A brief overview of the scenario.
2. The typical requirements of this scenario.
3. The raw instrumentation data that's required to support the scenario, and possible sources of this information.
4. How this raw data can be analyzed and combined to generate meaningful diagnostic information.
Health monitoring
A system is healthy if it is running and capable of processing requests. The purpose of health monitoring is to
generate a snapshot of the current health of the system so that you can verify that all components of the system
are functioning as expected.
Requirements for health monitoring
An operator should be alerted quickly (within a matter of seconds) if any part of the system is deemed to be
unhealthy. The operator should be able to ascertain which parts of the system are functioning normally, and which
parts are experiencing problems. System health can be highlighted through a traffic-light system:
Red for unhealthy (the system has stopped)
Yellow for partially healthy (the system is running with reduced functionality)
Green for completely healthy
A comprehensive health-monitoring system enables an operator to drill down through the system to view the
health status of subsystems and components. For example, if the overall system is depicted as partially healthy,
the operator should be able to zoom in and determine which functionality is currently unavailable.
Data sources, instrumentation, and data-collection requirements
The raw data that's required to support health monitoring can be generated as a result of:
Tracing execution of user requests. This information can be used to determine which requests have succeeded,
which have failed, and how long each request takes.
Synthetic user monitoring. This process simulates the steps performed by a user and follows a predefined
series of steps. The results of each step should be captured.
Logging exceptions, faults, and warnings. This information can be captured as a result of trace statements
embedded into the application code, as well as retrieving information from the event logs of any services that
the system references.
Monitoring the health of any third-party services that the system uses. This monitoring might require
retrieving and parsing health data that these services supply. This information might take a variety of formats.
Endpoint monitoring. This mechanism is described in more detail in the "Availability monitoring" section.
Collecting ambient performance information, such as background CPU utilization or I/O (including network)
activity.
Analyzing health data
The primary focus of health monitoring is to quickly indicate whether the system is running. Hot analysis of the
immediate data can trigger an alert if a critical component is detected as unhealthy. (It fails to respond to a
consecutive series of pings, for example.) The operator can then take the appropriate corrective action.
A more advanced system might include a predictive element that performs a cold analysis over recent and current
workloads. A cold analysis can spot trends and determine whether the system is likely to remain healthy or
whether the system will need additional resources. This predictive element should be based on critical
performance metrics, such as:
The rate of requests directed at each service or subsystem.
The response times of these requests.
The volume of data flowing into and out of each service.
If the value of any metric exceeds a defined threshold, the system can raise an alert to enable an operator or
autoscaling (if available) to take the preventative actions necessary to maintain system health. These actions might
involve adding resources, restarting one or more services that are failing, or applying throttling to lower-priority
requests.
Availability monitoring
A truly healthy system requires that the components and subsystems that compose the system are available.
Availability monitoring is closely related to health monitoring. But whereas health monitoring provides an
immediate view of the current health of the system, availability monitoring is concerned with tracking the
availability of the system and its components to generate statistics about the uptime of the system.
In many systems, some components (such as a database) are configured with built-in redundancy to permit rapid
failover in the event of a serious fault or loss of connectivity. Ideally, users should not be aware that such a failure
has occurred. But from an availability monitoring perspective, it's necessary to gather as much information as
possible about such failures to determine the cause and take corrective actions to prevent them from recurring.
The data that's required to track availability might depend on a number of lower-level factors. Many of these
factors might be specific to the application, system, and environment. An effective monitoring system captures the
availability data that corresponds to these low-level factors and then aggregates them to give an overall picture of
the system. For example, in an e-commerce system, the business functionality that enables a customer to place
orders might depend on the repository where order details are stored and the payment system that handles the
monetary transactions for paying for these orders. The availability of the order-placement part of the system is
therefore a function of the availability of the repository and the payment subsystem.
Requirements for availability monitoring
An operator should also be able to view the historical availability of each system and subsystem, and use this
information to spot any trends that might cause one or more subsystems to periodically fail. (Do services start to
fail at a particular time of day that corresponds to peak processing hours?)
A monitoring solution should provide an immediate and historical view of the availability or unavailability of each
subsystem. It should also be capable of quickly alerting an operator when one or more services fail or when users
can't connect to services. This is a matter of not only monitoring each service, but also examining the actions that
each user performs if these actions fail when they attempt to communicate with a service. To some extent, a
degree of connectivity failure is normal and might be due to transient errors. But it might be useful to allow the
system to raise an alert for the number of connectivity failures to a specified subsystem that occur during a
specific period.
Data sources, instrumentation, and data-collection requirements
As with health monitoring, the raw data that's required to support availability monitoring can be generated as a
result of synthetic user monitoring and logging any exceptions, faults, and warnings that might occur. In addition,
availability data can be obtained from performing endpoint monitoring. The application can expose one or more
health endpoints, each testing access to a functional area within the system. The monitoring system can ping each
endpoint by following a defined schedule and collect the results (success or fail).
All timeouts, network connectivity failures, and connection retry attempts must be recorded. All data should be
timestamped.
Analyzing availability data
The instrumentation data must be aggregated and correlated to support the following types of analysis:
The immediate availability of the system and subsystems.
The availability failure rates of the system and subsystems. Ideally, an operator should be able to correlate
failures with specific activities: what was happening when the system failed?
A historical view of failure rates of the system or any subsystems across any specified period, and the load on
the system (number of user requests, for example) when a failure occurred.
The reasons for unavailability of the system or any subsystems. For example, the reasons might be service not
running, connectivity lost, connected but timing out, and connected but returning errors.
You can calculate the percentage availability of a service over a period of time by using the following formula:
This is useful for SLA purposes. (SLA monitoring is described in more detail later in this guidance.) The definition
of downtime depends on the service. For example, Visual Studio Team Services Build Service defines downtime as
the period (total accumulated minutes) during which Build Service is unavailable. A minute is considered
unavailable if all continuous HTTP requests to Build Service to perform customer-initiated operations throughout
the minute either result in an error code or do not return a response.
Performance monitoring
As the system is placed under more and more stress (by increasing the volume of users), the size of the datasets
that these users access grows and the possibility of failure of one or more components becomes more likely.
Frequently, component failure is preceded by a decrease in performance. If you're able detect such a decrease, you
can take proactive steps to remedy the situation.
System performance depends on a number of factors. Each factor is typically measured through key performance
indicators (KPIs), such as the number of database transactions per second or the volume of network requests that
are successfully serviced in a specified time frame. Some of these KPIs might be available as specific performance
measures, whereas others might be derived from a combination of metrics.
NOTE
Determining poor or good performance requires that you understand the level of performance at which the system should
be capable of running. This requires observing the system while it's functioning under a typical load and capturing the data
for each KPI over a period of time. This might involve running the system under a simulated load in a test environment and
gathering the appropriate data before deploying the system to a production environment.
You should also ensure that monitoring for performance purposes does not become a burden on the system. You might be
able to dynamically adjust the level of detail for the data that the performance monitoring process gathers.
Security monitoring
All commercial systems that include sensitive data must implement a security structure. The complexity of the
security mechanism is usually a function of the sensitivity of the data. In a system that requires users to be
authenticated, you should record:
All sign-in attempts, whether they fail or succeed.
All operations performed by—and the details of all resources accessed by—an authenticated user.
When a user ends a session and signs out.
Monitoring might be able to help detect attacks on the system. For example, a large number of failed sign-in
attempts might indicate a brute-force attack. An unexpected surge in requests might be the result of a distributed
denial-of-service (DDoS) attack. You must be prepared to monitor all requests to all resources regardless of the
source of these requests. A system that has a sign-in vulnerability might accidentally expose resources to the
outside world without requiring a user to actually sign in.
Requirements for security monitoring
The most critical aspects of security monitoring should enable an operator to quickly:
Detect attempted intrusions by an unauthenticated entity.
Identify attempts by entities to perform operations on data for which they have not been granted access.
Determine whether the system, or some part of the system, is under attack from outside or inside. (For
example, a malicious authenticated user might be attempting to bring the system down.)
To support these requirements, an operator should be notified if:
One account makes repeated failed sign-in attempts within a specified period.
One authenticated account repeatedly tries to access a prohibited resource during a specified period.
A large number of unauthenticated or unauthorized requests occur during a specified period.
The information that's provided to an operator should include the host address of the source for each request. If
security violations regularly arise from a particular range of addresses, these hosts might be blocked.
A key part in maintaining the security of a system is being able to quickly detect actions that deviate from the
usual pattern. Information such as the number of failed and/or successful sign-in requests can be displayed
visually to help detect whether there is a spike in activity at an unusual time. (An example of this activity is users
signing in at 3:00 AM and performing a large number of operations when their working day starts at 9:00 AM).
This information can also be used to help configure time-based autoscaling. For example, if an operator observes
that a large number of users regularly sign in at a particular time of day, the operator can arrange to start
additional authentication services to handle the volume of work, and then shut down these additional services
when the peak has passed.
Data sources, instrumentation, and data-collection requirements
Security is an all-encompassing aspect of most distributed systems. The pertinent data is likely to be generated at
multiple points throughout a system. You should consider adopting a Security Information and Event
Management (SIEM) approach to gather the security-related information that results from events raised by the
application, network equipment, servers, firewalls, antivirus software, and other intrusion-prevention elements.
Security monitoring can incorporate data from tools that are not part of your application. These tools can include
utilities that identify port-scanning activities by external agencies, or network filters that detect attempts to gain
unauthenticated access to your application and data.
In all cases, the gathered data must enable an administrator to determine the nature of any attack and take the
appropriate countermeasures.
Analyzing security data
A feature of security monitoring is the variety of sources from which the data arises. The different formats and
level of detail often require complex analysis of the captured data to tie it together into a coherent thread of
information. Apart from the simplest of cases (such as detecting a large number of failed sign-ins, or repeated
attempts to gain unauthorized access to critical resources), it might not be possible to perform any complex
automated processing of security data. Instead, it might be preferable to write this data, timestamped but
otherwise in its original form, to a secure repository to allow for expert manual analysis.
SLA monitoring
Many commercial systems that support paying customers make guarantees about the performance of the system
in the form of SLAs. Essentially, SLAs state that the system can handle a defined volume of work within an agreed
time frame and without losing critical information. SLA monitoring is concerned with ensuring that the system can
meet measurable SLAs.
NOTE
SLA monitoring is closely related to performance monitoring. But whereas performance monitoring is concerned with
ensuring that the system functions optimally, SLA monitoring is governed by a contractual obligation that defines what
optimally actually means.
NOTE
Some contracts for commercial systems might also include SLAs for customer support. An example is that all help-desk
requests will elicit a response within five minutes, and that 99 percent of all problems will be fully addressed within 1
working day. Effective issue tracking (described later in this section) is key to meeting SLAs such as these.
NOTE
System uptime needs to be defined carefully. In a system that uses redundancy to ensure maximum availability, individual
instances of elements might fail, but the system can remain functional. System uptime as presented by health monitoring
should indicate the aggregate uptime of each element and not necessarily whether the system has actually halted.
Additionally, failures might be isolated. So even if a specific system is unavailable, the remainder of the system might remain
available, although with decreased functionality. (In an e-commerce system, a failure in the system might prevent a
customer from placing orders, but the customer might still be able to browse the product catalog.)
For alerting purposes, the system should be able to raise an event if any of the high-level indicators exceed a
specified threshold. The lower-level details of the various factors that compose the high-level indicator should be
available as contextual data to the alerting system.
Data sources, instrumentation, and data-collection requirements
The raw data that's required to support SLA monitoring is similar to the raw data that's required for performance
monitoring, together with some aspects of health and availability monitoring. (See those sections for more
details.) You can capture this data by:
Performing endpoint monitoring.
Logging exceptions, faults, and warnings.
Tracing the execution of user requests.
Monitoring the availability of any third-party services that the system uses.
Using performance metrics and counters.
All data must be timed and timestamped.
Analyzing SLA data
The instrumentation data must be aggregated to generate a picture of the overall performance of the system.
Aggregated data must also support drill-down to enable examination of the performance of the underlying
subsystems. For example, you should be able to:
Calculate the total number of user requests during a specified period and determine the success and failure
rate of these requests.
Combine the response times of user requests to generate an overall view of system response times.
Analyze the progress of user requests to break down the overall response time of a request into the response
times of the individual work items in that request.
Determine the overall availability of the system as a percentage of uptime for any specific period.
Analyze the percentage time availability of the individual components and services in the system. This might
involve parsing logs that third-party services have generated.
Many commercial systems are required to report real performance figures against agreed SLAs for a specified
period, typically a month. This information can be used to calculate credits or other forms of repayments for
customers if the SLAs are not met during that period. You can calculate availability for a service by using the
technique described in the section Analyzing availability data.
For internal purposes, an organization might also track the number and nature of incidents that caused services to
fail. Learning how to resolve these issues quickly, or eliminate them completely, will help to reduce downtime and
meet SLAs.
Auditing
Depending on the nature of the application, there might be statutory or other legal regulations that specify
requirements for auditing users' operations and recording all data access. Auditing can provide evidence that links
customers to specific requests. Nonrepudiation is an important factor in many e-business systems to help
maintain trust be between a customer and the organization that's responsible for the application or service.
Requirements for auditing
An analyst must be able to trace the sequence of business operations that users are performing so that you can
reconstruct users' actions. This might be necessary simply as a matter of record, or as part of a forensic
investigation.
Audit information is highly sensitive. It will likely include data that identifies the users of the system, together with
the tasks that they're performing. For this reason, audit information will most likely take the form of reports that
are available only to trusted analysts rather than as an interactive system that supports drill-down of graphical
operations. An analyst should be able to generate a range of reports. For example, reports might list all users'
activities occurring during a specified time frame, detail the chronology of activity for a single user, or list the
sequence of operations performed against one or more resources.
Data sources, instrumentation, and data-collection requirements
The primary sources of information for auditing can include:
The security system that manages user authentication.
Trace logs that record user activity.
Security logs that track all identifiable and unidentifiable network requests.
The format of the audit data and the way in which it's stored might be driven by regulatory requirements. For
example, it might not be possible to clean the data in any way. (It must be recorded in its original format.) Access
to the repository where it's held must be protected to prevent tampering.
Analyzing audit data
An analyst must be able to access the raw data in its entirety, in its original form. Aside from the requirement to
generate common audit reports, the tools for analyzing this data are likely to be specialized and kept external to
the system.
Usage monitoring
Usage monitoring tracks how the features and components of an application are used. An operator can use the
gathered data to:
Determine which features are heavily used and determine any potential hotspots in the system. High-traffic
elements might benefit from functional partitioning or even replication to spread the load more evenly. An
operator can also use this information to ascertain which features are infrequently used and are possible
candidates for retirement or replacement in a future version of the system.
Obtain information about the operational events of the system under normal use. For example, in an e-
commerce site, you can record the statistical information about the number of transactions and the volume
of customers that are responsible for them. This information can be used for capacity planning as the
number of customers grows.
Detect (possibly indirectly) user satisfaction with the performance or functionality of the system. For
example, if a large number of customers in an e-commerce system regularly abandon their shopping carts,
this might be due to a problem with the checkout functionality.
Generate billing information. A commercial application or multitenant service might charge customers for
the resources that they use.
Enforce quotas. If a user in a multitenant system exceeds their paid quota of processing time or resource
usage during a specified period, their access can be limited or processing can be throttled.
Requirements for usage monitoring
To examine system usage, an operator typically needs to see information that includes:
The number of requests that are processed by each subsystem and directed to each resource.
The work that each user is performing.
The volume of data storage that each user occupies.
The resources that each user is accessing.
An operator should also be able to generate graphs. For example, a graph might display the most resource-
hungry users, or the most frequently accessed resources or system features.
Data sources, instrumentation, and data-collection requirements
Usage tracking can be performed at a relatively high level. It can note the start and end times of each request and
the nature of the request (read, write, and so on, depending on the resource in question). You can obtain this
information by:
Tracing user activity.
Capturing performance counters that measure the utilization for each resource.
Monitoring the resource consumption by each user.
For metering purposes, you also need to be able to identify which users are responsible for performing which
operations, and the resources that these operations use. The gathered information should be detailed enough to
enable accurate billing.
Issue tracking
Customers and other users might report issues if unexpected events or behavior occurs in the system. Issue
tracking is concerned with managing these issues, associating them with efforts to resolve any underlying
problems in the system, and informing customers of possible resolutions.
Requirements for issue tracking
Operators often perform issue tracking by using a separate system that enables them to record and report the
details of problems that users report. These details can include the tasks that the user was trying to perform,
symptoms of the problem, the sequence of events, and any error or warning messages that were issued.
Data sources, instrumentation, and data-collection requirements
The initial data source for issue-tracking data is the user who reported the issue in the first place. The user might
be able to provide additional data such as:
A crash dump (if the application includes a component that runs on the user's desktop).
A screen snapshot.
The date and time when the error occurred, together with any other environmental information such as the
user's location.
This information can be used to help the debugging effort and help construct a backlog for future releases of the
software.
Analyzing issue -tracking data
Different users might report the same problem. The issue-tracking system should associate common reports.
The progress of the debugging effort should be recorded against each issue report. When the problem is resolved,
the customer can be informed of the solution.
If a user reports an issue that has a known solution in the issue-tracking system, the operator should be able to
inform the user of the solution immediately.
NOTE
Many modern frameworks automatically publish performance and trace events. Capturing this information is simply a
matter of providing a means to retrieve and store it where it can be processed and analyzed.
The operating system where the application is running can be a source of low-level system-wide information, such
as performance counters that indicate I/O rates, memory utilization, and CPU usage. Operating system errors
(such as the failure to open a file correctly) might also be reported.
You should also consider the underlying infrastructure and components on which your system runs. Virtual
machines, virtual networks, and storage services can all be sources of important infrastructure-level performance
counters and other diagnostic data.
If your application uses other external services, such as a web server or database management system, these
services might publish their own trace information, logs, and performance counters. Examples include SQL Server
Dynamic Management Views for tracking operations performed against a SQL Server database, and IIS trace logs
for recording requests made to a web server.
As the components of a system are modified and new versions are deployed, it's important to be able to attribute
issues, events, and metrics to each version. This information should be tied back to the release pipeline so that
problems with a specific version of a component can be tracked quickly and rectified.
Security issues might occur at any point in the system. For example, a user might attempt to sign in with an invalid
user ID or password. An authenticated user might try to obtain unauthorized access to a resource. Or a user might
provide an invalid or outdated key to access encrypted information. Security-related information for successful
and failing requests should always be logged.
The section Instrumenting an application contains more guidance on the information that you should capture. But
you can use a variety of strategies to gather this information:
Application/system monitoring . This strategy uses internal sources within the application, application
frameworks, operating system, and infrastructure. The application code can generate its own monitoring
data at notable points during the lifecycle of a client request. The application can include tracing statements
that might be selectively enabled or disabled as circumstances dictate. It might also be possible to inject
diagnostics dynamically by using a diagnostics framework. These frameworks typically provide plug-ins
that can attach to various instrumentation points in your code and capture trace data at these points.
Additionally, your code and/or the underlying infrastructure might raise events at critical points. Monitoring
agents that are configured to listen for these events can record the event information.
Real user monitoring . This approach records the interactions between a user and the application and
observes the flow of each request and response. This information can have a two-fold purpose: it can be
used for metering usage by each user, and it can be used to determine whether users are receiving a
suitable quality of service (for example, fast response times, low latency, and minimal errors). You can use
the captured data to identify areas of concern where failures occur most often. You can also use the data to
identify elements where the system slows down, possibly due to hotspots in the application or some other
form of bottleneck. If you implement this approach carefully, it might be possible to reconstruct users'
flows through the application for debugging and testing purposes.
IMPORTANT
You should consider the data that's captured by monitoring real users to be highly sensitive because it might include
confidential material. If you save captured data, store it securely. If you want to use the data for performance
monitoring or debugging purposes, strip out all personally identifiable information first.
Synthetic user monitoring . In this approach, you write your own test client that simulates a user and
performs a configurable but typical series of operations. You can track the performance of the test client to
help determine the state of the system. You can also use multiple instances of the test client as part of a
load-testing operation to establish how the system responds under stress, and what sort of monitoring
output is generated under these conditions.
NOTE
You can implement real and synthetic user monitoring by including code that traces and times the execution of
method calls and other critical parts of an application.
Profiling . This approach is primarily targeted at monitoring and improving application performance.
Rather than operating at the functional level of real and synthetic user monitoring, it captures lower-level
information as the application runs. You can implement profiling by using periodic sampling of the
execution state of an application (determining which piece of code that the application is running at a given
point in time). You can also use instrumentation that inserts probes into the code at important junctures
(such as the start and end of a method call) and records which methods were invoked, at what time, and
how long each call took. You can then analyze this data to determine which parts of the application might
cause performance problems.
Endpoint monitoring . This technique uses one or more diagnostic endpoints that the application exposes
specifically to enable monitoring. An endpoint provides a pathway into the application code and can return
information about the health of the system. Different endpoints can focus on various aspects of the
functionality. You can write your own diagnostics client that sends periodic requests to these endpoints and
assimilate the responses. For more information, see the Health Endpoint Monitoring pattern.
For maximum coverage, you should use a combination of these techniques.
Instrumenting an application
Instrumentation is a critical part of the monitoring process. You can make meaningful decisions about the
performance and health of a system only if you first capture the data that enables you to make these decisions.
The information that you gather by using instrumentation should be sufficient to enable you to assess
performance, diagnose problems, and make decisions without requiring you to sign in to a remote production
server to perform tracing (and debugging) manually. Instrumentation data typically comprises metrics and
information that's written to trace logs.
The contents of a trace log can be the result of textual data that's written by the application or binary data that's
created as the result of a trace event (if the application is using Event Tracing for Windows--ETW). They can also be
generated from system logs that record events arising from parts of the infrastructure, such as a web server.
Textual log messages are often designed to be human-readable, but they should also be written in a format that
enables an automated system to parse them easily.
You should also categorize logs. Don't write all trace data to a single log, but use separate logs to record the trace
output from different operational aspects of the system. You can then quickly filter log messages by reading from
the appropriate log rather than having to process a single lengthy file. Never write information that has different
security requirements (such as audit information and debugging data) to the same log.
NOTE
A log might be implemented as a file on the file system, or it might be held in some other format, such as a blob in blob
storage. Log information might also be held in more structured storage, such as rows in a table.
Metrics will generally be a measure or count of some aspect or resource in the system at a specific time, with one
or more associated tags or dimensions (sometimes called a sample). A single instance of a metric is usually not
useful in isolation. Instead, metrics have to be captured over time. The key issue to consider is which metrics you
should record and how frequently. Generating data for metrics too often can impose a significant additional load
on the system, whereas capturing metrics infrequently might cause you to miss the circumstances that lead to a
significant event. The considerations will vary from metric to metric. For example, CPU utilization on a server
might vary significantly from second to second, but high utilization becomes an issue only if it's long-lived over a
number of minutes.
Information for correlating data
You can easily monitor individual system-level performance counters, capture metrics for resources, and obtain
application trace information from various log files. But some forms of monitoring require the analysis and
diagnostics stage in the monitoring pipeline to correlate the data that's retrieved from several sources. This data
might take several forms in the raw data, and the analysis process must be provided with sufficient
instrumentation data to be able to map these different forms. For example, at the application framework level, a
task might be identified by a thread ID. Within an application, the same work might be associated with the user ID
for the user who is performing that task.
Also, there's unlikely to be a 1:1 mapping between threads and user requests, because asynchronous operations
might reuse the same threads to perform operations on behalf of more than one user. To complicate matters
further, a single request might be handled by more than one thread as execution flows through the system. If
possible, associate each request with a unique activity ID that's propagated through the system as part of the
request context. (The technique for generating and including activity IDs in trace information depends on the
technology that's used to capture the trace data.)
All monitoring data should be timestamped in the same way. For consistency, record all dates and times by using
Coordinated Universal Time. This will help you more easily trace sequences of events.
NOTE
Computers operating in different time zones and networks might not be synchronized. Don't depend on using timestamps
alone for correlating instrumentation data that spans multiple machines.
Figure 3 - Using a monitoring agent to pull information and write to shared storage.
NOTE
Using a monitoring agent is ideally suited to capturing instrumentation data that's naturally pulled from a data source. An
example is information from SQL Server Dynamic Management Views or the length of an Azure Service Bus queue.
It's feasible to use the approach just described to store telemetry data for a small-scale application running on a
limited number of nodes in a single location. However, a complex, highly scalable, global cloud application might
generate huge volumes of data from hundreds of web and worker roles, database shards, and other services. This
flood of data can easily overwhelm the I/O bandwidth available with a single, central location. Therefore, your
telemetry solution must be scalable to prevent it from acting as a bottleneck as the system expands. Ideally, your
solution should incorporate a degree of redundancy to reduce the risks of losing important monitoring
information (such as auditing or billing data) if part of the system fails.
To address these issues, you can implement queuing, as shown in Figure 4. In this architecture, the local
monitoring agent (if it can be configured appropriately) or custom data-collection service (if not) posts data to a
queue. A separate process running asynchronously (the storage writing service in Figure 4) takes the data in this
queue and writes it to shared storage. A message queue is suitable for this scenario because it provides "at least
once" semantics that help ensure that queued data will not be lost after it's posted. You can implement the storage
writing service by using a separate worker role.
NOTE
You should restrict access to dashboards to authorized personnel, because this information might be commercially sensitive.
You should also protect the underlying data for dashboards to prevent users from changing it.
Raising alerts
Alerting is the process of analyzing the monitoring and instrumentation data and generating a notification if a
significant event is detected.
Alerting helps ensure that the system remains healthy, responsive, and secure. It's an important part of any
system that makes performance, availability, and privacy guarantees to the users where the data might need to be
acted on immediately. An operator might need to be notified of the event that triggered the alert. Alerting can also
be used to invoke system functions such as autoscaling.
Alerting usually depends on the following instrumentation data:
Security events . If the event logs indicate that repeated authentication and/or authorization failures are
occurring, the system might be under attack and an operator should be informed.
Performance metrics . The system must quickly respond if a particular performance metric exceeds a
specified threshold.
Availability information . If a fault is detected, it might be necessary to quickly restart one or more
subsystems, or fail over to a backup resource. Repeated faults in a subsystem might indicate more serious
concerns.
Operators might receive alert information by using many delivery channels such as email, a pager device, or an
SMS text message. An alert might also include an indication of how critical a situation is. Many alerting systems
support subscriber groups, and all operators who are members of the same group can receive the same set of
alerts.
An alerting system should be customizable, and the appropriate values from the underlying instrumentation data
can be provided as parameters. This approach enables an operator to filter data and focus on those thresholds or
combinations of values that are of interest. Note that in some cases, the raw instrumentation data can be provided
to the alerting system. In other situations, it might be more appropriate to supply aggregated data. (For example,
an alert can be triggered if the CPU utilization for a node has exceeded 90 percent over the last 10 minutes). The
details provided to the alerting system should also include any appropriate summary and context information.
This data can help reduce the possibility that false-positive events will trip an alert.
Reporting
Reporting is used to generate an overall view of the system. It might incorporate historical data in addition to
current information. Reporting requirements themselves fall into two broad categories: operational reporting and
security reporting.
Operational reporting typically includes the following aspects:
Aggregating statistics that you can use to understand resource utilization of the overall system or specified
subsystems during a specified time window.
Identifying trends in resource usage for the overall system or specified subsystems during a specified period.
Monitoring the exceptions that have occurred throughout the system or in specified subsystems during a
specified period.
Determining the efficiency of the application in terms of the deployed resources, and understanding whether
the volume of resources (and their associated cost) can be reduced without affecting performance
unnecessarily.
Security reporting is concerned with tracking customers' use of the system. It can include:
Auditing user operations . This requires recording the individual requests that each user performs, together
with dates and times. The data should be structured to enable an administrator to quickly reconstruct the
sequence of operations that a user performs over a specified period.
Tracking resource use by user . This requires recording how each request for a user accesses the various
resources that compose the system, and for how long. An administrator must be able to use this data to
generate a utilization report by user over a specified period, possibly for billing purposes.
In many cases, batch processes can generate reports according to a defined schedule. (Latency is not normally an
issue.) But they should also be available for generation on an ad hoc basis if needed. As an example, if you are
storing data in a relational database such as Azure SQL Database, you can use a tool such as SQL Server
Reporting Services to extract and format data and present it as a set of reports.
Next steps
Azure Monitor overview
Monitor, diagnose, and troubleshoot Microsoft Azure Storage
Overview of alerts in Microsoft Azure
View service health notifications by using the Azure portal
What is Application Insights?
Performance diagnostics for Azure virtual machines
Download and install SQL Server Data Tools (SSDT) for Visual Studio
Retry guidance for Azure services
12/18/2020 • 41 minutes to read • Edit Online
Most Azure services and client SDKs include a retry mechanism. However, these differ because each service has
different characteristics and requirements, and so each retry mechanism is tuned to a specific service. This guide
summarizes the retry mechanism features for the majority of Azure services, and includes information to help you
use, adapt, or extend the retry mechanism for that service.
For general guidance on handling transient faults, and retrying connections and operations against services and
resources, see Retry guidance.
The following table summarizes the retry features for the Azure services described in this guidance.
P O L IC Y T EL EM ET RY
SERVIC E RET RY C A PA B IL IT IES C O N F IGURAT IO N SC O P E F EAT URES
NOTE
For retry guidance on Managed Service Identity endpoints, see How to use an Azure VM Managed Service Identity (MSI) for
token acquisition.
Retry mechanism
There is a built-in retry mechanism for Azure Active Directory in the Active Directory Authentication Library
(ADAL). To avoid unexpected lockouts, we recommend that third-party libraries and application code do not retry
failed connections, but allow ADAL to handle retries.
Retry usage guidance
Consider the following guidelines when using Azure Active Directory:
When possible, use the ADAL library and the built-in support for retries.
If you are using the REST API for Azure Active Directory, retry the operation if the result code is 429 (Too Many
Requests) or an error in the 5xx range. Do not retry for any other errors.
For 429 errors, only retry after the time indicated in the Retr y-After header.
For 5xx errors, use exponential back-off, with the first retry at least 5 seconds after the response.
Do not retry on errors other than 429 and 5xx.
More information
Azure Active Directory Authentication Libraries
Cosmos DB
Cosmos DB is a fully managed multi-model database that supports schemaless JSON data. It offers configurable
and reliable performance, native JavaScript transactional processing, and is built for the cloud with elastic scale.
Retry mechanism
The DocumentClient class automatically retries failed attempts. To set the number of retries and the maximum wait
time, configure ConnectionPolicy.RetryOptions. Exceptions that the client raises are either beyond the retry policy
or are not transient errors.
If Cosmos DB throttles the client, it returns an HTTP 429 error. Check the status code in the
DocumentClientException .
Policy configuration
The following table shows the default settings for the RetryOptions class.
SET T IN G DEFA ULT VA L UE DESC RIP T IO N
Example
Telemetry
Retry attempts are logged as unstructured trace messages through a .NET TraceSource . You must configure a
TraceListener to capture the events and write them to a suitable destination log.
For example, if you add the following to your App.config file, traces will be generated in a text file in the same
location as the executable:
<configuration>
<system.diagnostics>
<switches>
<add name="SourceSwitch" value="Verbose"/>
</switches>
<sources>
<source name="DocDBTrace" switchName="SourceSwitch" switchType="System.Diagnostics.SourceSwitch" >
<listeners>
<add name="MyTextListener" type="System.Diagnostics.TextWriterTraceListener"
traceOutputOptions="DateTime,ProcessId,ThreadId" initializeData="CosmosDBTrace.txt"></add>
</listeners>
</source>
</sources>
</system.diagnostics>
</configuration>
Event Hubs
Azure Event Hubs is a hyperscale telemetry ingestion service that collects, transforms, and stores millions of
events.
Retry mechanism
Retry behavior in the Azure Event Hubs Client Library is controlled by the RetryPolicy property on the
EventHubClient class. The default policy retries with exponential backoff when Azure Event Hub returns a transient
EventHubsException or an OperationCanceledException . Default retry policy for Event Hubs is to retry up to 9 times
with an exponential back-off time of up to 30 seconds .
Example
More information
.NET Standard client library for Azure Event Hubs
IoT Hub
Azure IoT Hub is a service for connecting, monitoring, and managing devices to develop Internet of Things (IoT)
applications.
Retry mechanism
The Azure IoT device SDK can detect errors in the network, protocol, or application. Based on the error type, the
SDK checks whether a retry needs to be performed. If the error is recoverable, the SDK begins to retry using the
configured retry policy.
The default retry policy is exponential back-off with random jitter, but it can be configured.
Policy configuration
Policy configuration differs by language. For more details, see IoT Hub retry policy configuration.
More information
IoT Hub retry policy
Troubleshoot IoT Hub device disconnection
Alternatively, you can specify the options as a string, and pass this to the Connect method. The
ReconnectRetr yPolicy property cannot be set this way, only through code.
You can also specify options directly when you connect to the cache.
For more information, see Stack Exchange Redis Configuration in the StackExchange.Redis documentation.
The following table shows the default settings for the built-in retry policy.
DEFA ULT VA L UE
C O N T EXT SET T IN G ( V 1. 2. 2) M EA N IN G
NOTE
For synchronous operations, SyncTimeout can add to the end-to-end latency, but setting the value too low can cause
excessive timeouts. See How to troubleshoot Azure Cache for Redis. In general, avoid using synchronous operations, and use
asynchronous operations instead. For more information, see Pipelines and Multiplexers.
localhost:6379,connectTimeout=2000,connectRetry=3
1 unique nodes specified
Requesting tie-break from localhost:6379 > __Booksleeve_TieBreak...
Allowing endpoints 00:00:02 to respond...
localhost:6379 faulted: SocketFailure on PING
localhost:6379 failed to nominate (Faulted)
> UnableToResolvePhysicalConnection on GET
No masters detected
localhost:6379: Standalone v2.0.0, master; keep-alive: 00:01:00; int: Connecting; sub: Connecting; not in use:
DidNotRespond
localhost:6379: int ops=0, qu=0, qs=0, qc=1, wr=0, sync=1, socks=2; sub ops=0, qu=0, qs=0, qc=0, wr=0, socks=2
Circular op-count snapshot; int: 0 (0.00 ops/s; spans 10s); sub: 0 (0.00 ops/s; spans 10s)
Sync timeouts: 0; fire and forget: 0; last heartbeat: -1s ago
resetting failing connections to retry...
retrying; attempts left: 2...
...
Examples
The following code example configures a constant (linear) delay between retries when initializing the
StackExchange.Redis client. This example shows how to set the configuration using a ConfigurationOptions
instance.
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using StackExchange.Redis;
namespace RetryCodeSamples
{
class CacheRedisCodeSamples
{
public async static Task Samples()
{
var writer = new StringWriter();
{
try
{
var retryTimeInMilliseconds = TimeSpan.FromSeconds(4).Milliseconds; // delay between
retries
The next example sets the configuration by specifying the options as a string. The connection timeout is the
maximum period of time to wait for a connection to the cache, not the delay between retry attempts. Note that the
ReconnectRetr yPolicy property can only be set by code.
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using StackExchange.Redis;
namespace RetryCodeSamples
{
class CacheRedisCodeSamples
{
public async static Task Samples()
{
var writer = new StringWriter();
{
try
{
// Using string-based configuration.
var options = "localhost,connectRetry=3,connectTimeout=2000";
ConnectionMultiplexer redis = ConnectionMultiplexer.Connect(options, writer);
Azure Search
Azure Search can be used to add powerful and sophisticated search capabilities to a website or application, quickly
and easily tune search results, and construct rich and fine-tuned ranking models.
Retry mechanism
Retry behavior in the Azure Search SDK is controlled by the SetRetryPolicy method on the SearchServiceClient
and SearchIndexClient classes. The default policy retries with exponential backoff when Azure Search returns a 5xx
or 408 (Request Timeout) response.
Telemetry
Trace with ETW or by registering a custom trace provider. For more information, see the AutoRest documentation.
Service Bus
Service Bus is a cloud messaging platform that provides loosely coupled message exchange with improved scale
and resiliency for components of an application, whether hosted in the cloud or on-premises.
Retry mechanism
Service Bus implements retries using implementations of the abstract Retr yPolicy class. The namespace and
some of the configuration details depend on which Service Bus client SDK package is used:
PA C K A GE DESC RIP T IO N N A M ESPA C E
Both versions of the client library provide the following built-in implementations of RetryPolicy :
RetryExponential. Implements exponential backoff.
NoRetry. Does not perform retries. Use this class when you don't need retries at the Service Bus API level,
for example when another process manages retries as part of a batch or multistep operation.
The RetryPolicy.Default property returns a default policy of type RetryExponential . This policy object has the
following settings:
Service Bus actions can return a range of exceptions, listed in Service Bus messaging exceptions. Exceptions
returned from Service Bus expose the IsTransient property that indicates whether the client should retry the
operation. The built-in Retr yExponential policy checks this property before retrying.
If the last exception encountered was Ser verBusyException , the Retr yExponential policy adds 10 seconds to
the computed retry interval. This value cannot be changed.
Custom implementations could use a combination of the exception type and the IsTransient property to provide
more fine-grained control over retry actions. For example, you could detect a QuotaExceededException and
take action to drain the queue before retrying sending a message to it.
The following code sets the retry policy on a Service Bus client using the Microsoft.Azure.ServiceBus library:
const string QueueName = "queue1";
const string ServiceBusConnectionString = "<your_connection_string>";
The retry policy cannot be set at the individual operation level. It applies to all operations for the client.
Retry usage guidance
Consider the following guidelines when using Service Bus:
When using the built-in Retr yExponential implementation, do not implement a fallback operation as the
policy reacts to Server Busy exceptions and automatically switches to an appropriate retry mode.
Service Bus supports a feature called Paired Namespaces that implements automatic failover to a backup
queue in a separate namespace if the queue in the primary namespace fails. Messages from the secondary
queue can be sent back to the primary queue when it recovers. This feature helps to address transient failures.
For more information, see Asynchronous Messaging Patterns and High Availability.
Consider starting with the following settings for retrying operations. These settings are general purpose, and you
should monitor the operations and fine-tune the values to suit your own scenario.
EXA M P L E M A XIM UM
C O N T EXT L AT EN C Y RET RY P O L IC Y SET T IN GS H O W IT W O RK S
* Not including additional delay that is added if a Server Busy response is received.
Telemetry
Service Bus logs retries as ETW events using an EventSource . You must attach an EventListener to the event
source to capture the events and view them in Performance Viewer, or write them to a suitable destination log. The
retry events are of the following form:
Microsoft-ServiceBus-Client/RetryPolicyIteration
ThreadID="14,500"
FormattedMessage="[TrackingId:] RetryExponential: Operation Get:https://retry-
tests.servicebus.windows.net/TestQueue/?api-version=2014-05 at iteration 0 is retrying after 00:00:00.1000000
sleep because of Microsoft.ServiceBus.Messaging.MessagingCommunicationException: The remote name could not be
resolved: 'retry-tests.servicebus.windows.net'.TrackingId:6a26f99c-dc6d-422e-8565-f89fdd0d4fe3,
TimeStamp:9/5/2014 10:00:13 PM."
trackingId=""
policyType="RetryExponential"
operation="Get:https://retry-tests.servicebus.windows.net/TestQueue/?api-version=2014-05"
iteration="0"
iterationSleep="00:00:00.1000000"
lastExceptionType="Microsoft.ServiceBus.Messaging.MessagingCommunicationException"
exceptionMessage="The remote name could not be resolved: 'retry-
tests.servicebus.windows.net'.TrackingId:6a26f99c-dc6d-422e-8565-f89fdd0d4fe3,TimeStamp:9/5/2014 10:00:13 PM"
Examples
The following code example shows how to set the retry policy for:
A namespace manager. The policy applies to all operations on that manager, and cannot be overridden for
individual operations.
A messaging factory. The policy applies to all clients created from that factory, and cannot be overridden when
creating individual clients.
An individual messaging client. After a client has been created, you can set the retry policy for that client. The
policy applies to all operations on that client.
using System;
using System.Threading.Tasks;
using Microsoft.ServiceBus;
using Microsoft.ServiceBus.Messaging;
namespace RetryCodeSamples
{
class ServiceBusCodeSamples
{
private const string connectionString =
@"Endpoint=sb://[my-namespace].servicebus.windows.net/;
SharedAccessKeyName=RootManageSharedAccessKey;
SharedAccessKey=C99..........Mk=";
ServiceBusEnvironment.SystemConnectivity.Mode = ConnectivityMode.Http;
// The namespace manager will have a default exponential policy with 10 retry attempts
// and a 3 second delay delta.
// Retry delays will be approximately 0 sec, 3 sec, 9 sec, 25 sec and the fixed 30 sec,
// with an extra 10 sec added when receiving a ServiceBusyException.
{
// Set different values for the retry policy, used for all operations on the namespace
manager.
namespaceManager.Settings.RetryPolicy =
new RetryExponential(
minBackoff: TimeSpan.FromSeconds(0),
maxBackoff: TimeSpan.FromSeconds(30),
maxRetryCount: 3);
{
// Set different values for the retry policy, used for clients created from it.
messagingFactory.RetryPolicy =
new RetryExponential(
minBackoff: TimeSpan.FromSeconds(1),
maxBackoff: TimeSpan.FromSeconds(30),
maxRetryCount: 3);
{
var client = messagingFactory.CreateQueueClient(QueueName);
// The client inherits the policy from the factory that created it.
More information
Asynchronous Messaging Patterns and High Availability
Service Fabric
Distributing reliable services in a Service Fabric cluster guards against most of the potential transient faults
discussed in this article. Some transient faults are still possible, however. For example, the naming service might be
in the middle of a routing change when it gets a request, causing it to throw an exception. If the same request
comes 100 milliseconds later, it will probably succeed.
Internally, Service Fabric manages this kind of transient fault. You can configure some settings by using the
OperationRetrySettings class while setting up your services. The following code shows an example. In most cases,
this should not be necessary, and the default settings will be fine.
FabricTransportRemotingSettings transportSettings = new FabricTransportRemotingSettings
{
OperationTimeout = TimeSpan.FromSeconds(30)
};
More information
Remote exception handling
SA M P L E TA RGET
E2E
C O N T EXT M A X L AT EN C Y RET RY ST RAT EGY SET T IN GS VA L UES H O W IT W O RK S
NOTE
The end-to-end latency targets assume the default timeout for connections to the service. If you specify longer connection
timeouts, the end-to-end latency will be extended by this additional time for every retry attempt.
Examples
This section shows how you can use Polly to access Azure SQL Database using a set of retry policies configured in
the Policy class.
The following code shows an extension method on the SqlCommand class that calls ExecuteAsync with exponential
backoff.
public async static Task<SqlDataReader> ExecuteReaderWithRetryAsync(this SqlCommand command)
{
GuardConnectionIsNotNull(command);
}, cancellationToken);
}
More information
Cloud Service Fundamentals Data Access Layer – Transient Fault Handling
For general guidance on getting the most from SQL Database, see Azure SQL Database performance and elasticity
guide.
You can then specify this as the default retry strategy for all operations using the SetConfiguration method of
the DbConfiguration instance when the application starts. By default, EF will automatically discover and use the
configuration class.
DbConfiguration.SetConfiguration(new BloggingContextConfiguration());
You can specify the retry configuration class for a context by annotating the context class with a
DbConfigurationType attribute. However, if you have only one configuration class, EF will use it without the need
to annotate the context.
[DbConfigurationType(typeof(BloggingContextConfiguration))]
public class BloggingContext : DbContext
If you need to use different retry strategies for specific operations, or disable retries for specific operations, you
can create a configuration class that allows you to suspend or swap strategies by setting a flag in the CallContext .
The configuration class can use this flag to switch strategies, or disable the strategy you provide and use a default
strategy. For more information, see Suspend Execution Strategy (EF6 onwards).
Another technique for using specific retry strategies for individual operations is to create an instance of the
required strategy class and supply the desired settings through parameters. You then invoke its ExecuteAsync
method.
var executionStrategy = new SqlAzureExecutionStrategy(5, TimeSpan.FromSeconds(4));
var blogs = await executionStrategy.ExecuteAsync(
async () =>
{
using (var db = new BloggingContext("Blogs"))
{
// Acquire some values asynchronously and return them
}
},
new CancellationToken()
);
The simplest way to use a DbConfiguration class is to locate it in the same assembly as the DbContext class.
However, this is not appropriate when the same context is required in different scenarios, such as different
interactive and background retry strategies. If the different contexts execute in separate AppDomains, you can use
the built-in support for specifying configuration classes in the configuration file or set it explicitly using code. If the
different contexts must execute in the same AppDomain, a custom solution will be required.
For more information, see Code-Based Configuration (EF6 onwards).
The following table shows the default settings for the built-in retry policy when using EF6.
SA M P L E TA RGET
E2E
C O N T EXT M A X L AT EN C Y RET RY P O L IC Y SET T IN GS VA L UES H O W IT W O RK S
NOTE
The end-to-end latency targets assume the default timeout for connections to the service. If you specify longer connection
timeouts, the end-to-end latency will be extended by this additional time for every retry attempt.
Examples
The following code example defines a simple data access solution that uses Entity Framework. It sets a specific
retry strategy by defining an instance of a class named BlogConfiguration that extends DbConfiguration .
using System;
using System.Collections.Generic;
using System.Data.Entity;
using System.Data.Entity.SqlServer;
using System.Threading.Tasks;
namespace RetryCodeSamples
{
public class BlogConfiguration : DbConfiguration
{
public BlogConfiguration()
{
// Set up the execution strategy for SQL Database (exponential) with 5 retries and 12 sec delay.
// These values could be loaded from configuration rather than being hard-coded.
this.SetExecutionStrategy(
"System.Data.SqlClient", () => new SqlAzureExecutionStrategy(5,
TimeSpan.FromSeconds(12)));
}
}
// Specify the configuration type if more than one has been defined.
// [DbConfigurationType(typeof(BlogConfiguration))]
public class BloggingContext : DbContext
{
// Definition of content goes here.
}
class EF6CodeSamples
{
public async static Task Samples()
{
// Execution strategy configured by DbConfiguration subclass, discovered automatically or
// or explicitly indicated through configuration or with an attribute. Default is no retries.
using (var db = new BloggingContext("Blogs"))
{
// Add, edit, delete blog items here, then:
await db.SaveChangesAsync();
}
}
}
}
More examples of using the Entity Framework retry mechanism can be found in Connection resiliency / retry logic.
More information
Azure SQL Database performance and elasticity guide
The following code shows how to execute a transaction with automatic retries, by using an execution strategy. The
transaction is defined in a delegate. If a transient failure occurs, the execution strategy will invoke the delegate
again.
strategy.Execute(() =>
{
using (var transaction = db.Database.BeginTransaction())
{
db.Blogs.Add(new Blog { Url = "https://blogs.msdn.com/dotnet" });
db.SaveChanges();
transaction.Commit();
}
});
}
Azure Storage
Azure Storage services include blob storage, files, and storage queues.
Blobs, Queues and Files
The ClientOptions Class is the base type for all client option types and exposes various common client options like
Diagnostics, Retry, Transport. To provide the client configuration options for connecting to Azure Queue, Blob, and
File Storage you must use the corresponding derived type. In the next example, you use the QueueClientOptions
class (derived from ClientOptions) to configure a client to connect to Azure Queue Service. The Retry property is
the set of options that can be specified to influence how retry attempts are made, and how a failure is eligible to be
retried.
using System;
using System.Threading;
using Azure.Core;
using Azure.Identity;
using Azure.Storage;
using Azure.Storage.Queues;
using Azure.Storage.Queues.Models;
namespace RetryCodeSamples
{
class AzureStorageCodeSamples {
// Provide the client configuration options for connecting to Azure Queue Storage
QueueClientOptions queueClientOptions = new QueueClientOptions()
{
Retry = {
Delay = TimeSpan.FromSeconds(2), //The delay between retry attempts for a fixed
approach or the delay on which to base
//calculations for a backoff-based approach
MaxRetries = 5, //The maximum number of retry attempts before giving
up
Mode = RetryMode.Exponential, //The approach to use for calculating retry delays
MaxDelay = TimeSpan.FromSeconds(10) //The maximum permissible delay between retry
attempts
},
Table Support
NOTE
WindowsAzure.Storage Nuget Package has been deprecated. For Azure table support, see Microsoft.Azure.Cosmos.Table
Nuget Package
Retry mechanism
Retries occur at the individual REST operation level and are an integral part of the client API implementation. The
client storage SDK uses classes that implement the IExtendedRetryPolicy Interface.
The built-in classes provide support for linear (constant delay) and exponential with randomization retry intervals.
There is also a no retry policy for use when another process is handling retries at a higher level. However, you can
implement your own retry classes if you have specific requirements not provided by the built-in classes.
Alternate retries switch between primary and secondary storage service location if you are using read access geo-
redundant storage (RA-GRS) and the result of the request is a retryable error. See Azure Storage Redundancy
Options for more information.
Policy configuration
Retry policies are configured programmatically. A typical procedure is to create and populate a
TableRequestOptions , BlobRequestOptions , FileRequestOptions , or QueueRequestOptions instance.
The request options instance can then be set on the client, and all operations with the client will use the specified
request options.
client.DefaultRequestOptions = interactiveRequestOption;
var stats = await client.GetServiceStatsAsync();
You can override the client request options by passing a populated instance of the request options class as a
parameter to operation methods.
You use an OperationContext instance to specify the code to execute when a retry occurs and when an operation
has completed. This code can collect information about the operation for use in logs and telemetry.
In addition to indicating whether a failure is suitable for retry, the extended retry policies return a Retr yContext
object that indicates the number of retries, the results of the last request, whether the next retry will happen in the
primary or secondary location (see table below for details). The properties of the Retr yContext object can be
used to decide if and when to attempt a retry. For more information, see IExtendedRetryPolicy.Evaluate Method.
The following tables show the default settings for the built-in retry policies.
Request options:
SET T IN G DEFA ULT VA L UE M EA N IN G
Exponential policy:
Linear policy:
SA M P L E TA RGET
E2E
C O N T EXT M A X L AT EN C Y RET RY P O L IC Y SET T IN GS VA L UES H O W IT W O RK S
Telemetry
Retry attempts are logged to a TraceSource . You must configure a TraceListener to capture the events and write
them to a suitable destination log. You can use the TextWriterTraceListener or XmlWriterTraceListener to
write the data to a log file, the EventLogTraceListener to write to the Windows Event Log, or the
EventProviderTraceListener to write trace data to the ETW subsystem. You can also configure autoflushing of
the buffer, and the verbosity of events that will be logged (for example, Error, Warning, Informational, and Verbose).
For more information, see Client-side Logging with the .NET Storage Client Library.
Operations can receive an OperationContext instance, which exposes a Retr ying event that can be used to
attach custom telemetry logic. For more information, see OperationContext.Retrying Event.
Examples
The following code example shows how to create two TableRequestOptions instances with different retry
settings; one for interactive requests and one for background requests. The example then sets these two retry
policies on the client so that they apply for all requests, and also sets the interactive strategy on a specific request
so that it overrides the default settings applied to the client.
using System;
using System.Threading.Tasks;
using Microsoft.Azure.Cosmos.Table;
namespace RetryCodeSamples
{
class AzureStorageCodeSamples
{
private const string connectionString = "UseDevelopmentStorage=true";
{
// Set properties for the client (used on all requests unless overridden)
// Different exponential policy parameters for background scenarios
client.DefaultRequestOptions = backgroundRequestOption;
// Linear policy for interactive scenarios
client.DefaultRequestOptions = interactiveRequestOption;
}
{
// set properties for a specific request
var stats = await client.GetServiceStatsAsync(interactiveRequestOption, operationContext:
null);
}
{
// Set up notifications for an operation
var context = new OperationContext();
context.ClientRequestID = "some request id";
context.Retrying += (sender, args) =>
{
// Collect retry information
};
context.RequestCompleted += (sender, args) =>
{
// Collect operation completion information
};
var stats = await client.GetServiceStatsAsync(null, context);
}
}
}
}
}
More information
Azure Storage client Library retry policy recommendations
Storage Client Library 2.0 – Implementing retry policies
Incremental . A retry strategy with a specified number of retry attempts and an incremental time interval
between retries. For example:
retryInterval = TimeSpan.FromMilliseconds(this.initialInterval.TotalMilliseconds +
(this.increment.TotalMilliseconds * currentRetryCount));
LinearRetr y . A retry policy that performs a specified number of retries, using a specified fixed time interval
between retries. For example:
retryInterval = this.deltaBackoff;
All applications that communicate with remote services and resources must be sensitive to transient faults. This is
especially the case for applications that run in the cloud, where the nature of the environment and connectivity
over the Internet means these types of faults are likely to be encountered more often. Transient faults include the
momentary loss of network connectivity to components and services, the temporary unavailability of a service, or
timeouts that arise when a service is busy. These faults are often self-correcting, and if the action is repeated after a
suitable delay it is likely to succeed.
This document covers general guidance for transient fault handling. For information about handling transient
faults when using Microsoft Azure services, see Azure service-specific retry guidelines.
Challenges
Transient faults can have a huge effect on the perceived availability of an application, even if it has been thoroughly
tested under all foreseeable circumstances. To ensure that cloud-hosted applications operate reliably, they must be
able to respond to the following challenges:
The application must be able to detect faults when they occur, and determine if these faults are likely to be
transient, more long-lasting, or are terminal failures. Different resources are likely to return different
responses when a fault occurs, and these responses may also vary depending on the context of the
operation; for example, the response for an error when reading from storage may be different from
response for an error when writing to storage. Many resources and services have well-documented
transient failure contracts. However, where such information is not available, it may be difficult to discover
the nature of the fault and whether it is likely to be transient.
The application must be able to retry the operation if it determines that the fault is likely to be transient and
keep track of the number of times the operation was retried.
The application must use an appropriate strategy for the retries. This strategy specifies the number of times
it should retry, the delay between each attempt, and the actions to take after a failed attempt. The
appropriate number of attempts and the delay between each one are often difficult to determine, and vary
based on the type of resource as well as the current operating conditions of the resource and the
application itself.
General guidelines
The following guidelines will help you to design a suitable transient fault handling mechanism for your
applications:
Determine if there is a built-in retr y mechanism:
Many services provide an SDK or client library that contains a transient fault handling mechanism.
The retry policy it uses is typically tailored to the nature and requirements of the target service.
Alternatively, REST interfaces for services may return information that is useful in determining
whether a retry is appropriate, and how long to wait before the next retry attempt.
Use the built-in retry mechanism where available, unless you have specific and well-understood
requirements that make a different retry behavior more appropriate.
Determine if the operation is suitable for retr ying :
You should only retry operations where the faults are transient (typically indicated by the nature of
the error), and if there is at least some likelihood that the operation will succeed when reattempted.
There is no point in reattempting operations that indicate an invalid operation such as a database
update to an item that does not exist, or requests to a service or resource that has suffered a fatal
error.
In general, you should implement retries only where the full impact of this can be determined, and
the conditions are well understood and can be validated. If not, leave it to the calling code to
implement retries. Remember that the errors returned from resources and services outside your
control may evolve over time, and you may need to revisit your transient fault detection logic.
When you create services or components, consider implementing error codes and messages that will
help clients determine whether they should retry failed operations. In particular, indicate if the client
should retry the operation (perhaps by returning an isTransient value) and suggest a suitable delay
before the next retry attempt. If you build a web service, consider returning custom errors defined
within your service contracts. Even though generic clients may not be able to read these, they will be
useful when building custom clients.
Determine an appropriate retr y count and inter val:
It is vital to optimize the retry count and the interval to the type of use case. If you do not retry a
sufficient number of times, the application will be unable to complete the operation and is likely to
experience a failure. If you retry too many times, or with too short an interval between tries, the
application can potentially hold resources such as threads, connections, and memory for long
periods, which will adversely affect the health of the application.
The appropriate values for the time interval and the number of retry attempts depend on the type of
operation being attempted. For example, if the operation is part of a user interaction, the interval
should be short and only a few retries attempted to avoid making users wait for a response (which
holds open connections and can reduce availability for other users). If the operation is part of a long
running or critical workflow, where canceling and restarting the process is expensive or time-
consuming, it is appropriate to wait longer between attempts and retry more times.
Determining the appropriate intervals between retries is the most difficult part of designing a
successful strategy. Typical strategies use the following types of retry interval:
Exponential back-off . The application waits a short time before the first retry, and then
exponentially increasing times between each subsequent retry. For example, it may retry the
operation after 3 seconds, 12 seconds, 30 seconds, and so on.
Incremental inter vals . The application waits a short time before the first retry, and then
incrementally increasing times between each subsequent retry. For example, it may retry the
operation after 3 seconds, 7 seconds, 13 seconds, and so on.
Regular inter vals . The application waits for the same period of time between each attempt.
For example, it may retry the operation every 3 seconds.
Immediate retr y . Sometimes a transient fault is brief, perhaps due to an event such as a
network packet collision or a spike in a hardware component. In this case, retrying the
operation immediately is appropriate because it may succeed if the fault has cleared in the
time it takes the application to assemble and send the next request. However, there should
never be more than one immediate retry attempt, and you should switch to alternative
strategies, such as exponential back-off or fallback actions, if the immediate retry fails.
Randomization . Any of the retry strategies listed above may include a randomization to
prevent multiple instances of the client sending subsequent retry attempts at the same time.
For example, one instance may retry the operation after 3 seconds, 11 seconds, 28 seconds,
and so on, while another instance may retry the operation after 4 seconds, 12 seconds, 26
seconds, and so on. Randomization is a useful technique that may be combined with other
strategies.
As a general guideline, use an exponential back-off strategy for background operations, and
immediate or regular interval retry strategies for interactive operations. In both cases, you should
choose the delay and the retry count so that the maximum latency for all retry attempts is within the
required end-to-end latency requirement.
Take into account the combination of all the factors that contribute to the overall maximum timeout
for a retried operation. These factors include the time taken for a failed connection to produce a
response (typically set by a timeout value in the client) as well as the delay between retry attempts
and the maximum number of retries. The total of all these times can result in long overall operation
times, especially when using an exponential delay strategy where the interval between retries grows
rapidly after each failure. If a process must meet a specific service level agreement (SLA), the overall
operation time, including all timeouts and delays, must be within the limits defined in the SLA.
Overly aggressive retry strategies, which have intervals that are too short or retries that are too
frequent, can have an adverse effect on the target resource or service. This may prevent the resource
or service from recovering from its overloaded state, and it will continue to block or refuse requests.
This results in a vicious circle where more and more requests are sent to the resource or service, and
consequently its ability to recover is further reduced.
Take into account the timeout of the operations when choosing the retry intervals to avoid launching
a subsequent attempt immediately (for example, if the timeout period is similar to the retry interval).
Also consider if you need to keep the total possible period (the timeout plus the retry intervals) to
below a specific total time. Operations that have unusually short or very long timeouts may influence
how long to wait, and how often to retry the operation.
Use the type of the exception and any data it contains, or the error codes and messages returned
from the service, to optimize the interval and the number of retries. For example, some exceptions or
error codes (such as the HTTP code 503 Service Unavailable with a Retry-After header in the
response) may indicate how long the error might last, or that the service has failed and will not
respond to any subsequent attempt.
Avoid anti-patterns :
In the vast majority of cases, you should avoid implementations that include duplicated layers of
retry code. Avoid designs that include cascading retry mechanisms, or that implement retry at every
stage of an operation that involves a hierarchy of requests, unless you have specific requirements
that demand this. In these exceptional circumstances, use policies that prevent excessive numbers of
retries and delay periods, and make sure you understand the consequences. For example, if one
component makes a request to another, which then accesses the target service, and you implement
retry with a count of three on both calls there will be nine retry attempts in total against the service.
Many services and resources implement a built-in retry mechanism and you should investigate how
you can disable or modify this if you need to implement retries at a higher level.
Never implement an endless retry mechanism. This is likely to prevent the resource or service
recovering from overload situations, and cause throttling and refused connections to continue for a
longer period. Use a finite number or retries, or implement a pattern such as Circuit Breaker to allow
the service to recover.
Never perform an immediate retry more than once.
Avoid using a regular retry interval, especially when you have a large number of retry attempts,
when accessing services and resources in Azure. The optimum approach is this scenario is an
exponential back-off strategy with a circuit-breaking capability.
Prevent multiple instances of the same client, or multiple instances of different clients, from sending
retries at the same times. If this is likely to occur, introduce randomization into the retry intervals.
Test your retr y strategy and implementation:
Ensure you fully test your retry strategy implementation under as wide a set of circumstances as
possible, especially when both the application and the target resources or services it uses are under
extreme load. To check behavior during testing, you can:
Inject transient and nontransient faults into the service. For example, send invalid requests or
add code that detects test requests and responds with different types of errors. For an
example using TestApi, see Fault Injection Testing with TestApi and Introduction to TestApi –
Part 5: Managed Code Fault Injection APIs.
Create a mock of the resource or service that returns a range of errors that the real service
may return. Ensure you cover all the types of error that your retry strategy is designed to
detect.
Force transient errors to occur by temporarily disabling or overloading the service if it is a
custom service that you created and deployed (of course, you should not attempt to overload
any shared resources or shared services within Azure).
For HTTP-based APIs, consider using the FiddlerCore library in your automated tests to
change the outcome of HTTP requests, either by adding extra roundtrip times or by changing
the response (such as the HTTP status code, headers, body, or other factors). This enables
deterministic testing of a subset of the failure conditions, whether transient faults or other
types of failure. For more information, see FiddlerCore. For examples of how to use the library,
particularly the HttpMangler class, examine the source code for the Azure Storage SDK.
Perform high load factor and concurrent tests to ensure that the retry mechanism and
strategy works correctly under these conditions, and does not have an adverse effect on the
operation of the client or cause cross-contamination between requests.
Manage retr y policy configurations:
A retry policy is a combination of all of the elements of your retry strategy. It defines the detection
mechanism that determines whether a fault is likely to be transient, the type of interval to use (such
as regular, exponential back-off, and randomization), the actual interval value(s), and the number of
times to retry.
Retries must be implemented in many places within even the simplest application, and in every layer
of more complex applications. Rather than hard-coding the elements of each policy at multiple
locations, consider using a central point for storing all the policies. For example, store the values such
as the interval and retry count in application configuration files, read them at runtime, and
programmatically build the retry policies. This makes it easier to manage the settings, and to modify
and fine-tune the values in order to respond to changing requirements and scenarios. However,
design the system to store the values rather than rereading a configuration file every time, and
ensure suitable defaults are used if the values cannot be obtained from configuration.
In an Azure Cloud Services application, consider storing the values that are used to build the retry
policies at runtime in the service configuration file so that they can be changed without needing to
restart the application.
Take advantage of built-in or default retry strategies available in the client APIs you use, but only
where they are appropriate for your scenario. These strategies are typically general purpose. In some
scenarios they may be all that is required, but in other scenarios they may not offer the full range of
options to suit your specific requirements. You must understand how the settings will affect your
application through testing to determine the most appropriate values.
Log and track transient and nontransient faults:
As part of your retry strategy, include exception handling and other instrumentation that logs when
retry attempts are made. While an occasional transient failure and retry are to be expected, and do
not indicate a problem, regular and increasing numbers of retries are often an indicator of an issue
that may cause a failure, or is currently degrading application performance and availability.
Log transient faults as Warning entries rather than Error entries so that monitoring systems do not
detect them as application errors that may trigger false alerts.
Consider storing a value in your log entries that indicates if the retries were caused by throttling in
the service, or by other types of faults such as connection failures, so that you can differentiate them
during analysis of the data. An increase in the number of throttling errors is often an indicator of a
design flaw in the application or the need to switch to a premium service that offers dedicated
hardware.
Consider measuring and logging the overall time taken for operations that include a retry
mechanism. This is a good indicator of the overall effect of transient faults on user response times,
process latency, and the efficiency of the application use cases. Also log the number of retries
occurred in order to understand the factors that contributed to the response time.
Consider implementing a telemetry and monitoring system that can raise alerts when the number
and rate of failures, the average number of retries, or the overall times taken for operations to
succeed, is increasing.
Manage operations that continually fail:
There will be circumstances where the operation continues to fail at every attempt, and it is vital to
consider how you will handle this situation:
Although a retry strategy will define the maximum number of times that an operation should
be retried, it does not prevent the application repeating the operation again, with the same
number of retries. For example, if an order processing service fails with a fatal error that puts
it out of action permanently, the retry strategy may detect a connection timeout and consider
it to be a transient fault. The code will retry the operation a specified number of times and
then give up. However, when another customer places an order, the operation will be
attempted again - even though it is sure to fail every time.
To prevent continual retries for operations that continually fail, consider implementing the
Circuit Breaker pattern. In this pattern, if the number of failures within a specified time window
exceeds the threshold, requests are returned to the caller immediately as errors, without
attempting to access the failed resource or service.
The application can periodically test the service, on an intermittent basis and with long
intervals between requests, to detect when it becomes available. An appropriate interval will
depend on the scenario, such as the criticality of the operation and the nature of the service,
and might be anything between a few minutes and several hours. At the point where the test
succeeds, the application can resume normal operations and pass requests to the newly
recovered service.
In the meantime, it may be possible to fall back to another instance of the service (perhaps in
a different datacenter or application), use a similar service that offers compatible (perhaps
simpler) functionality, or perform some alternative operations in the hope that the service will
become available soon. For example, it may be appropriate to store requests for the service in
a queue or data store and replay them later. Otherwise you might be able to redirect the user
to an alternative instance of the application, degrade the performance of the application but
still offer acceptable functionality, or just return a message to the user indicating that the
application is not available at present.
Other considerations
When deciding on the values for the number of retries and the retry intervals for a policy, consider if
the operation on the service or resource is part of a long-running or multistep operation. It may be
difficult or expensive to compensate all the other operational steps that have already succeeded
when one fails. In this case, a very long interval and a large number of retries may be acceptable as
long as it does not block other operations by holding or locking scarce resources.
Consider if retrying the same operation may cause inconsistencies in data. If some parts of a
multistep process are repeated, and the operations are not idempotent, it may result in an
inconsistency. For example, an operation that increments a value, if repeated, will produce an invalid
result. Repeating an operation that sends a message to a queue may cause an inconsistency in the
message consumer if it cannot detect duplicate messages. To prevent this, ensure that you design
each step as an idempotent operation. For more information about idempotency, see Idempotency
patterns.
Consider the scope of the operations that will be retried. For example, it may be easier to implement
retry code at a level that encompasses several operations, and retry them all if one fails. However,
doing this may result in idempotency issues or unnecessary rollback operations.
If you choose a retry scope that encompasses several operations, take into account the total latency
of all of them when determining the retry intervals, when monitoring the time taken, and before
raising alerts for failures.
Consider how your retry strategy may affect neighbors and other tenants in a shared application, or
when using shared resources and services. Aggressive retry policies can cause an increasing number
of transient faults to occur for these other users and for applications that share the resources and
services. Likewise, your application may be affected by the retry policies implemented by other users
of the resources and services. For mission-critical applications, you may decide to use premium
services that are not shared. This provides you with much more control over the load and
consequent throttling of these resources and services, which can help to justify the additional cost.
More information
Azure service-specific retry guidelines
Circuit Breaker pattern
Compensating Transaction pattern
Idempotency patterns
Performance tuning a distributed application
11/2/2020 • 2 minutes to read • Edit Online
In this series, we walk through several cloud application scenarios, showing how a development team used load
tests and metrics to diagnose performance issues. These articles are based on actual load testing that we
performed when developing example applications. The code for each scenario is available on GitHub.
Scenarios:
Distributed business transaction
Calling multiple backend services
Event stream processing
What is performance?
Performance is frequently measured in terms of throughput, response time, and availability. Performance targets
should be based on business operations. Customer-facing tasks may have more stringent requirements than
operational tasks such as generating reports.
Define a service level objective (SLO) that defines performance targets for each workload. You typically achieve this
by breaking a performance target into a set of Key Performance Indicators (KPIs), such as:
Latency or response time of specific requests
The number of requests performed per second
The rate at which the system generates exceptions.
Performance targets should explicitly include a target load. Also, not all users will receive exactly the same level of
performance, even when accessing the system simultaneously and performing the same work. So an SLO should
be framed in terms of percentiles.
An example SLO for might be: "Client requests will have a response within 500 ms @ P90, at loads up to 25 K
requests/second."
Next steps
Read the performance tuning scenarios
Distributed business transaction
Calling multiple backend services
Event stream processing
Performance tuning scenario: Distributed business
transactions
11/2/2020 • 9 minutes to read • Edit Online
This article describes how a development team used metrics to find bottlenecks and improve the performance of a
distributed system. The article is based on actual load testing that we did for a sample application. The application
code is available on GitHub.
This article is part of a series. Read the first part here.
Scenario : A client application initiates a business transaction that involves multiple steps.
This scenario involves a drone delivery application that runs on Azure Kubernetes Service (AKS). Customers use a
web app to schedule deliveries by drone. Each transaction requires multiple steps that are performed by separate
microservices on the back end:
The Delivery service manages deliveries.
The Drone Scheduler service schedules drones for pickup.
The Package service manages packages.
There are two other services: An Ingestion service that accepts client requests and puts them on a queue for
processing, and a Workflow service that coordinates the steps in the workflow.
For more information about this scenario, see Designing a microservices architecture.
Test 1: Baseline
For the first load test, the team created a 6-node AKS cluster and deployed three replicas of each microservice. The
load test was a step-load test, starting at two simulated users and ramping up to 40 simulated users.
SET T IN G VA L UE
Cluster nodes 6
The following graph shows the results of the load test, as shown in Visual Studio. The purple line plots user load,
and the orange line plots total requests.
The first thing to realize about this scenario is that client requests per second is not a useful metric of performance.
That's because the application processes requests asynchronously, so the client gets a response right away. The
response code is always HTTP 202 (Accepted), meaning the request was accepted but processing is not complete.
What we really want to know is whether the backend is keeping up with the request rate. The Service Bus queue
can absorb spikes, but if the backend cannot handle a sustained load, processing will fall further and further
behind.
Here's a more informative graph. It plots the number incoming and outgoing messages on the Service Bus queue.
Incoming messages are shown in light blue, and outgoing messages are shown in dark blue:
This chart is showing that the rate of incoming messages increases, reaching a peak and then dropping back to
zero at the end of the load test. But the number of outgoing messages peaks early in the test and then actually
drops. That means the Workflow service, which handles the requests, isn't keeping up. Even after the load test ends
(around 9:22 on the graph), messages are still being processed as the Workflow service continues to drain the
queue.
What's slowing down the processing? The first thing to look for is errors or exceptions that might indicate a
systematic issue. The Application Map in Azure Monitor shows the graph of calls between components, and is a
quick way to spot issues and then click through to get more details.
Sure enough, the Application Map shows that the Workflow service is getting errors from the Delivery service:
To see more details, you can select a node in the graph and click into an end-to-end transaction view. In this case, it
shows that the Delivery service is returning HTTP 500 errors. The error messages indicate that an exception is
being thrown due to memory limits in Azure Cache for Redis.
You may notice that these calls to Redis don't appear in the Application Map. That's because the .NET library for
Application Insights doesn't have built-in support for tracking Redis as a dependency. (For a list of what's
supported out of the box, see Dependency auto-collection.) As a fallback, you can use the TrackDependency API to
track any dependency. Load testing often reveals these kinds of gaps in the telemetry, which can be remediated.
However, there is still a dramatic lag in processing messages. At the peak of the load test, the incoming message
rate is more than 5× the outgoing rate:
The following graph measures throughput in terms of message completion — that is, the rate at which the
Workflow service marks the Service Bus messages as completed. Each point on the graph represents 5 seconds of
data, showing ~16/sec maximum throughput.
This graph was generated by running a query in the Log Analytics workspace, using the Kusto query language:
let start=datetime("2019-07-31T22:30:00.000Z");
let end=datetime("2019-07-31T22:45:00.000Z");
dependencies
| where cloud_RoleName == 'fabrikam-workflow'
| where timestamp > start and timestamp < end
| where type == 'Azure Service Bus'
| where target has 'https://dev-i-iuosnlbwkzkau.servicebus.windows.net'
| where client_Type == "PC"
| where name == "Complete"
| summarize succeeded=sumif(itemCount, success == true), failed=sumif(itemCount, success == false) by
bin(timestamp, 5s)
| render timechart
SET T IN G VA L UE
Cluster nodes 6
Unfortunately this load test shows only modest improvement. Outgoing messages are still not keeping up with
incoming messages:
Throughput is more consistent, but the maximum achieved is about the same as the previous test:
Moreover, looking at Azure Monitor for containers, it appears the problem is not caused by resource exhaustion
within the cluster. First, the node-level metrics show that CPU utilization remains under 40% even at the 95th
percentile, and memory utilization is about 20%.
In a Kubernetes environment, it's possible for individual pods to be resource-constrained even when the nodes
aren't. But the pod-level view shows that all pods are healthy.
From this test, it seems that just adding more pods to the back end won't help. The next step is to look more closely
at the Workflow service to understand what's happening when it processes messages. Application Insights shows
that the average duration of the Workflow service's Process operation is 246 ms.
We can also run a query to get metrics on the individual operations within each transaction:
delivery 37 57
package 12 17
dronescheduler 21 41
The first row in this table represents the Service Bus queue. The other rows are the calls to the backend services.
For reference, here's the Log Analytics query for this table:
let start=datetime("2019-07-31T22:30:00.000Z");
let end=datetime("2019-07-31T22:45:00.000Z");
let dataset=dependencies
| where timestamp > start and timestamp < end
| where (cloud_RoleName == 'fabrikam-workflow')
| where name == 'Complete' or target in ('package', 'delivery', 'dronescheduler');
dataset
| summarize percentiles(duration, 50, 95) by target
These latencies look reasonable. But here is the key insight: If the total operation time is ~250 ms, that puts a strict
upper bound on how fast messages can be processed in serial. The key to improving throughput, therefore, is
greater parallelism.
That should be possible in this scenario, for two reasons:
These are network calls, so most of the time is spent waiting for I/O completion
The messages are independent, and don't need to be processed in order.
For more information about these settings, see Best Practices for performance improvements using Service Bus
Messaging. Running the test with these settings produced the following graph:
Recall that incoming messages are shown in light blue, and outgoing messages are shown in dark blue.
At first glance, this is a very weird graph. For a while, the outgoing message rate exactly tracks the incoming rate.
But then, at about the 2:03 mark, the rate of incoming messages levels off, while the number of outgoing messages
continues to rise, actually exceeding the total number of incoming messages. That seems impossible.
The clue to this mystery can be found in the Dependencies view in Application Insights. This chart summarizes all
of the calls that the Workflow service made to Service Bus:
Notice that entry for DeadLetter . That calls indicates messages are going into the Service Bus dead-letter queue.
To understand what's happening, you need to understand Peek-Lock semantics in Service Bus. When a client uses
Peek-Lock, Service Bus atomically retrieves and locks a message. While the lock is held, the message is guaranteed
not to be delivered to other receivers. If the lock expires, the message becomes available to other receivers. After a
maximum number of delivery attempts (which is configurable), Service Bus will put the messages onto a dead-
letter queue, where it can be examined later.
Remember that the Workflow service is prefetching large batches of messages — 3000 messages at a time). That
means the total time to process each message is longer, which results in messages timing out, going back onto the
queue, and eventually going into the dead-letter queue.
You can also see this behavior in the exceptions, where numerous MessageLostLockException exceptions are
recorded:
Over the total duration of the 8-minute load test, the application completed 25 K operations, with a peak
throughput of 72 operations/sec, representing a 400% increase in maximum throughput.
However, running the same test with a longer duration showed that the application could not sustain this rate:
The container metrics show that maximum CPU utilization was close to 100%. At this point, the application appears
to be CPU-bound. Scaling the cluster might improve performance now, unlike the previous attempt at scaling out.
SET T IN G VA L UE
Cluster nodes 12
This test resulted in a higher sustained throughput, with no significant lags in processing messages. Moreover,
node CPU utilization stayed below 80%.
Summary
For this scenario, the following bottlenecks were identified:
Out-of-memory exceptions in Azure Cache for Redis.
Lack of parallelism in message processing.
Insufficient message lock duration, leading to lock timeouts and messages being placed in the dead letter
queue.
CPU exhaustion.
To diagnose these issues, the development team relied on the following metrics:
The rate of incoming and outgoing Service Bus messages.
Application Map in Application Insights.
Errors and exceptions.
Custom Log Analytics queries.
CPU and memory utilization in Azure Monitor for containers.
Next steps
For more information about the design of this scenario, see Designing a microservices architecture.
Performance tuning scenario: Multiple backend
services
11/2/2020 • 10 minutes to read • Edit Online
This article describes how a development team used metrics to find bottlenecks and improve the performance of a
distributed system. The article is based on actual load testing that we did for a sample application. The application
code is available on GitHub, along with the Visual Studio load test project used to generate the results.
This article is part of a series. Read the first part here.
Scenario : Call multiple backend services to retrieve information and then aggregate the results.
This scenario involves a drone delivery application. Clients can query a REST API to get their latest invoice
information. The invoice includes a summary of the customer's deliveries, packages, and total drone utilization.
This application uses a microservices architecture running on Azure Kubernetes Service (AKS), and the information
needed for the invoice is spread across several microservices.
Rather than the client calling each service directly, the application implements the Gateway Aggregation pattern.
Using this pattern, the client makes a single request to a gateway service. The gateway in turn calls the backend
services in parallel, and then aggregates the results into a single response payload.
This chart shows that one operation in particular, GetDroneUtilization , takes much longer on average — by an
order of magnitude. The gateway makes these calls in parallel, so the slowest operation determines how long it
takes for the entire request to complete.
Clearly the next step is dig into the GetDroneUtilization operation and look for any bottlenecks. One possibility is
resource exhaustion. Perhaps this particular backend service is running out of CPU or memory. For an AKS cluster,
this information is available in the Azure portal through the Azure Monitor for containers feature. The following
graphs show resource utilization at the cluster level:
In this screenshot, both the average and maximum values are shown. It's important to look at more than just the
average, because the average can hide spikes in the data. Here, the average CPU utilization stays below 50%, but
there are a couple of spikes to 80%. That's close to capacity but still within tolerances. Something else is causing
the bottleneck.
The next chart reveals the true culprit. This chart shows HTTP response codes from the Delivery service's backend
database, which in this case is Cosmos DB. The blue line represents success codes (HTTP 2xx), while the green line
represents HTTP 429 errors. An HTTP 429 return code means that Cosmos DB is temporarily throttling requests,
because the caller is consuming more resource units (RU) than provisioned.
To get further insight, the development team used Application Insights to view the end-to-end telemetry for a
representative sample of requests. Here is one instance:
This view shows the calls related to a single client request, along with timing information and response codes. The
top-level calls are from the gateway to the backend services. The call to GetDroneUtilization is expanded to show
calls to external dependencies — in this case, to Cosmos DB. The call in red returned an HTTP 429 error.
Note the large gap between the HTTP 429 error and the next call. When the Cosmos DB client library receives an
HTTP 429 error, it automatically backs off and waits to retry the operation. What this view shows is that during the
672 ms this operation took, most of that time was spent waiting to retry Cosmos DB.
Here's another interesting graph for this analysis. It shows RU consumption per physical partition versus
provisioned RUs per physical partition:
To make sense of this graph, you need to understand how Cosmos DB manages partitions. Collections in Cosmos
DB can have a partition key. Each possible key value defines a logical partition of the data within the collection.
Cosmos DB distributes these logical partitions across one or more physical partitions. The management of physical
partitions is handled automatically by Cosmos DB. As you store more data, Cosmos DB might move logical
partitions into new physical partitions, in order to spread load across the physical partitions.
For this load test, the Cosmos DB collection was provisioned with 900 RUs. The chart shows 100 RU per physical
partition, which implies a total of nine physical partitions. Although Cosmos DB automatically handles the sharding
of physical partitions, knowing the partition count can give insight into performance. The development team will
use this information later, as they continue to optimize. Where the blue line crosses the purple horizontal line, RU
consumption has exceeded the provisioned RUs. That's the point where Cosmos DB will begin to throttle calls.
Throughput (req/sec) 19 23
These aren't huge gains, but looking at the graph over time shows a more complete picture:
Whereas the previous test showed an initial spike followed by a sharp drop, this test shows more consistent
throughput. However, the maximum throughput is not significantly higher.
All requests to Cosmos DB returned a 2xx status, and the HTTP 429 errors went away:
The graph of RU consumption versus provisioned RUs shows there is plenty of headroom. There are about 275
RUs per physical partition, and the load test peaked at about 100 RUs consumed per second.
Another interesting metric is the number of calls to Cosmos DB per successful operation:
Assuming no errors, the number of calls should match the actual query plan. In this case, the operation involves a
cross-partition query that hits all nine physical partitions. The higher value in the first load test reflects the number
of calls that returned a 429 error.
This metric was calculated by running a custom Log Analytics query:
let start=datetime("2019-06-18T20:59:00.000Z");
let end=datetime("2019-07-24T21:10:00.000Z");
let operationNameToEval="GET DroneDeliveries/GetDroneUtilization";
let dependencyType="Azure DocumentDB";
let dataset=requests
| where timestamp > start and timestamp < end
| where success == true
| where name == operationNameToEval;
dataset
| project reqOk=itemCount
| summarize
SuccessRequests=sum(reqOk),
TotalNumberOfDepCalls=(toscalar(dependencies
| where timestamp > start and timestamp < end
| where type == dependencyType
| summarize sum(itemCount)))
| project
OperationName=operationNameToEval,
DependencyName=dependencyType,
SuccessRequests,
AverageNumberOfDepCallsPerOperation=(TotalNumberOfDepCalls/SuccessRequests)
To summarize, the second load test shows improvement. However, the GetDroneUtilization operation still takes
about an order of magnitude longer than the next-slowest operation. Looking at the end-to-end transactions helps
to explain why:
As mentioned earlier, the GetDroneUtilization operation involves a cross-partition query to Cosmos DB. This
means the Cosmos DB client has to fan out the query to each physical partition and collect the results. As the end-
to-end transaction view shows, these queries are being performed in serial. The operation takes as long as the sum
of all the queries — and this problem will only get worse as the size of the data grows and more physical partitions
are added.
Test 3: Parallel queries
Based on the previous results, an obvious way to reduce latency is to issue the queries in parallel. The Cosmos DB
client SDK has a setting that controls the maximum degree of parallelism.
VA L UE DESC RIP T IO N
0 No parallelism (default)
For the third load test, this setting was changed from 0 to -1. The following table summarizes the results:
Throughput (req/sec) 19 23 42
From the load test graph, not only is the overall throughput much higher (the orange line), but throughput also
keeps pace with the load (the purple line).
We can verify that the Cosmos DB client is making queries in parallel by looking at the end-to-end transaction
view:
Interestingly, a side effect of increasing the throughput is that the number of RUs consumed per second also
increases. Although Cosmos DB did not throttle any requests during this test, the consumption was close to the
provisioned RU limit:
This graph might be a signal to further scale out the database. However, it turns out that we can optimize the query
instead.
SELECT * FROM c
WHERE c.ownerId = <ownerIdValue> and
c.year = <yearValue> and
c.month = <monthValue>
This query selects records that match a particular owner ID and month/year. In the original design, none of these
properties is the partition key. That requires the client to fan out the query to each physical partition and gather
the results. To improve query performance, the development team changed the design so that owner ID is the
partition key for the collection. That way, the query can target a specific physical partition. (Cosmos DB handles this
automatically; you don't have to manage the mapping between partition key values and physical partitions.)
After switching the collection to the new partition key, there was a dramatic improvement in RU consumption,
which translates directly into lower costs.
The end-to-end transaction view shows that as predicted, the query reads only one physical partition:
Throughput (req/sec) 19 23 42 59
A consequence of the improved performance is that node CPU utilization becomes very high:
Toward the end of the load test, average CPU reached about 90%, and maximum CPU reached 100%. This metric
indicates that CPU is the next bottleneck in the system. If higher throughput is needed, the next step might be
scaling out the Delivery service to more instances.
Summary
For this scenario, the following bottlenecks were identified:
Cosmos DB throttling requests due to insufficient RUs provisioned.
High latency caused by querying multiple database partitions in serial.
Inefficient cross-partition query, because the query did not include the partition key.
In addition, CPU utilization was identified as a potential bottleneck at higher scale. To diagnose these issues, the
development team looked at:
Latency and throughput from the load test.
Cosmos DB errors and RU consumption.
The end-to-end transaction view in Application Insight.
CPU and memory utilization in Azure Monitor for containers.
Next steps
Review performance antipatterns
Performance tuning scenario: Event streaming with
Azure Functions
11/2/2020 • 7 minutes to read • Edit Online
This article describes how a development team used metrics to find bottlenecks and improve the performance of a
distributed system. The article is based on actual load testing that we did for a sample application. The application
code is available on GitHub.
This article is part of a series. Read the first part here.
Scenario : Process a stream of events using Azure Functions.
In this scenario, a fleet of drones sends position data in real time to Azure IoT Hub. A Functions app receives the
events, transforms the data into GeoJSON format, and writes the transformed data to Cosmos DB. Cosmos DB has
native support for geospatial data, and Cosmos DB collections can be indexed for efficient spatial queries. For
example, a client application could query for all drones within 1 km of a given location, or find all drones within a
certain area.
These processing requirements are simple enough that they don't require a full-fledged stream processing engine.
In particular, the processing doesn't join streams, aggregate data, or process across time windows. Based on these
requirements, Azure Functions is a good fit for processing the messages. Cosmos DB can also scale to support
very high write throughput.
Monitoring throughput
This scenario presents an interesting performance challenge. The data rate per device is known, but the number of
devices may fluctuate. For this business scenario, the latency requirements are not particularly stringent. The
reported position of a drone only needs to be accurate within a minute. That said, the function app must keep up
with the average ingestion rate over time.
IoT Hub stores messages in a log stream. Incoming messages are appended to the tail of the stream. A reader of
the stream — in this case, the function app — controls its own rate of traversing the stream. This decoupling of the
read and write paths makes IoT Hub very efficient, but also means that a slow reader can fall behind. To detect this
condition, the development team added a custom metric to measure message lateness. This metric records the
delta between when a message arrives at IoT Hub, and when the function receives the message for processing.
Test 1: Baseline
The first load test showed an immediate problem: The Function app consistently received HTTP 429 errors from
Cosmos DB, indicating that Cosmos DB was throttling the write requests.
In response, the team scaled Cosmos DB by increasing the number RUs allocated for the collection, but the errors
continued. This seemed strange, because their back-of-envelope calculation showed that Cosmos DB should have
no problem keeping up with the volume of write requests.
Later that day, one of the developers sent the following email to the team:
I looked at Cosmos DB for the warm path. There's one thing I don't understand. The partition key is deliveryId,
however we don't send deliveryId to Cosmos DB. Am I missing something?
That was the clue. Looking at the partition heat map, it turned out that all of the documents were landing on the
same partition.
What you want to see in the heat map is an even distribution across all of the partitions. In this case, because every
document was getting written to the same partition, adding RUs didn't help. The problem turned out to be a bug in
the code. Although the Cosmos DB collection had a partition key, the Azure Function didn't actually include the
partition key in the document. For more information about the partition heat map, see Determine the throughput
distribution across partitions.
The reason the value peaks at 5 minutes and then drops to zero is because the function app discards messages
that are more than 5 minutes late:
You can see this in the graph when the lateness metric drops back to zero. In the meantime, data has been lost,
because the function was throwing away messages.
What was happening? For this particular load test, the Cosmos DB collection had RUs to spare, so the bottleneck
was not at the database. Rather, the problem was in the message processing loop. Simply put, the function was not
writing documents quickly enough to keep up with the incoming volume of messages. Over time, it fell further and
further behind.
Test 3: Parallel writes
If the time to process a message is the bottleneck, one solution is to process more messages in parallel. In this
scenario:
Increase the number of IoT Hub partitions. Each IoT Hub partition gets assigned one function instance at a time,
so we would expect throughput to scale linearly with the number of partitions.
Parallelize the document writes within the function.
To explore the second option, the team modified the function to support parallel writes. The original version of the
function used the Cosmos DB output binding. The optimized version calls the Cosmos DB client directly and
performs the writes in parallel using Task.WhenAll:
await Task.WhenAll(tasks);
return (this.UpsertedDocuments,
this.DroppedMessages,
this.CosmosDbTotalMilliseconds);
}
Note that race conditions are possible with approach. Suppose that two messages from the same drone happen to
arrive in the same batch of messages. By writing them in parallel, the earlier message could overwrite the later
message. For this particular scenario, the application can tolerate losing an occasional message. Drones send new
position data every 5 seconds, so the data in Cosmos DB is updated continually. In other scenarios, however, it may
be important to process messages strictly in order.
After deploying this code change, the application was able to ingest more than 2500 requests/sec, using an IoT
Hub with 32 partitions.
Client-side Consideration
Overall client experience may be diminished by aggressive parallelization on server side. Consider leveraging
Azure Cosmos DB bulk executor library (not shown in this implementation) which significantly reduces the client-
side compute resources needed to saturate the throughput allocated to a Cosmos DB container. A single threaded
application that writes data using the bulk import API achieves nearly ten times greater write throughput when
compared to a multi-threaded application that writes data in parallel while saturating the client machine's CPU.
Summary
For this scenario, the following bottlenecks were identified:
Hot write partition, due to a missing partition key value in the documents being written.
Writing documents in serial per IoT Hub partition.
To diagnose these issues, the development team relied on the following metrics:
Throttled requests in Cosmos DB.
Partition heat map — Maximum consumed RUs per partition.
Messages received versus documents created.
Message lateness.
Next steps
Review performance antipatterns
Performance antipatterns for cloud applications
12/18/2020 • 2 minutes to read • Edit Online
A performance antipattern is a common practice that is likely to cause scalability problems when an application is
under pressure.
Here is a common scenario: An application behaves well during performance testing. It's released to production,
and begins to handle real workloads. At that point, it starts to perform poorly—rejecting user requests, stalling, or
throwing exceptions. The development team is then faced with two questions:
Why didn't this behavior show up during testing?
How do we fix it?
The answer to the first question is straightforward. It's difficult to simulate real users in a test environment, along
with their behavior patterns and the volumes of work they might perform. The only completely sure way to
understand how a system behaves under load is to observe it in production. To be clear, we aren't suggesting that
you should skip performance testing. Performance tests are crucial for getting baseline performance metrics. But
you must be prepared to observe and correct performance issues when they arise in the live system.
The answer to the second question, how to fix the problem, is less straightforward. Any number of factors might
contribute, and sometimes the problem only manifests under certain circumstances. Instrumentation and logging
are key to finding the root cause, but you also have to know what to look for.
Based on our engagements with Microsoft Azure customers, we've identified some of the most common
performance issues that customers see in production. For each antipattern, we describe why the antipattern
typically occurs, symptoms of the antipattern, and techniques for resolving the problem. We also provide sample
code that illustrates both the antipattern and a suggested solution.
Some of these antipatterns may seem obvious when you read the descriptions, but they occur more often than
you might think. Sometimes an application inherits a design that worked on-premises, but doesn't scale in the
cloud. Or an application might start with a very clean design, but as new features are added, one or more of these
antipatterns creeps in. Regardless, this guide will help you to identify and fix these antipatterns.
Catalog of antipatterns
Here is the list of the antipatterns that we've identified:
Improper Instantiation Repeatedly creating and destroying objects that are designed
to be shared and reused.
A N T IPAT T ERN DESC RIP T IO N
Monolithic Persistence Using the same data store for data with very different usage
patterns.
Next steps
For more about performance tuning, see Performance tuning a distributed application
Busy Database antipattern
12/18/2020 • 7 minutes to read • Edit Online
Offloading processing to a database server can cause it to spend a significant proportion of time running code,
rather than responding to requests to store and retrieve data.
Problem description
Many database systems can run code. Examples include stored procedures and triggers. Often, it's more efficient to
perform this processing close to the data, rather than transmitting the data to a client application for processing.
However, overusing these features can hurt performance, for several reasons:
The database server may spend too much time processing, rather than accepting new client requests and
fetching data.
A database is usually a shared resource, so it can become a bottleneck during periods of high use.
Runtime costs may be excessive if the data store is metered. That's particularly true of managed database
services. For example, Azure SQL Database charges for Database Transaction Units (DTUs).
Databases have finite capacity to scale up, and it's not trivial to scale a database horizontally. Therefore, it may
be better to move processing into a compute resource, such as a VM or App Service app, that can easily scale
out.
This antipattern typically occurs because:
The database is viewed as a service rather than a repository. An application might use the database server to
format data (for example, converting to XML), manipulate string data, or perform complex calculations.
Developers try to write queries whose results can be displayed directly to users. For example, a query might
combine fields or format dates, times, and currency according to locale.
Developers are trying to correct the Extraneous Fetching antipattern by pushing computations to the database.
Stored procedures are used to encapsulate business logic, perhaps because they are considered easier to
maintain and update.
The following example retrieves the 20 most valuable orders for a specified sales territory and formats the results
as XML. It uses Transact-SQL functions to parse the data and convert the results to XML. You can find the complete
sample here.
SELECT TOP 20
soh.[SalesOrderNumber] AS '@OrderNumber',
soh.[Status] AS '@Status',
soh.[ShipDate] AS '@ShipDate',
YEAR(soh.[OrderDate]) AS '@OrderDateYear',
MONTH(soh.[OrderDate]) AS '@OrderDateMonth',
soh.[DueDate] AS '@DueDate',
FORMAT(ROUND(soh.[SubTotal],2),'C')
AS '@SubTotal',
FORMAT(ROUND(soh.[TaxAmt],2),'C')
AS '@TaxAmt',
FORMAT(ROUND(soh.[TotalDue],2),'C')
AS '@TotalDue',
CASE WHEN soh.[TotalDue] > 5000 THEN 'Y' ELSE 'N' END
AS '@ReviewRequired',
(
SELECT
c.[AccountNumber] AS '@AccountNumber',
UPPER(LTRIM(RTRIM(REPLACE(
CONCAT( p.[Title], ' ', p.[FirstName], ' ', p.[MiddleName], ' ', p.[LastName], ' ', p.[Suffix]),
' ', ' ')))) AS '@FullName'
FROM [Sales].[Customer] c
INNER JOIN [Person].[Person] p
ON c.[PersonID] = p.[BusinessEntityID]
WHERE c.[CustomerID] = soh.[CustomerID]
FOR XML PATH ('Customer'), TYPE
),
(
SELECT
sod.[OrderQty] AS '@Quantity',
FORMAT(sod.[UnitPrice],'C')
AS '@UnitPrice',
FORMAT(ROUND(sod.[LineTotal],2),'C')
AS '@LineTotal',
sod.[ProductID] AS '@ProductId',
CASE WHEN (sod.[ProductID] >= 710) AND (sod.[ProductID] <= 720) AND (sod.[OrderQty] >= 5) THEN 'Y' ELSE
'N' END
AS '@InventoryCheckRequired'
Clearly, this is complex query. As we'll see later, it turns out to use significant processing resources on the database
server.
The application then uses the .NET Framework System.Xml.Linq APIs to format the results as XML.
order.Add(
new XAttribute("OrderNumber", orderNumber),
new XAttribute("Status", reader["Status"]),
new XAttribute("ShipDate", reader["ShipDate"]),
... // More attributes, not shown.
customer.Add(
new XAttribute("AccountNumber", reader["AccountNumber"]),
new XAttribute("FullName", fullName));
}
lineItems.Add(
new XElement("LineItem",
new XAttribute("Quantity", quantity),
new XAttribute("UnitPrice", ((Decimal)reader["UnitPrice"]).ToString("C")),
new XAttribute("LineTotal", RoundAndFormat(reader["LineTotal"])),
new XAttribute("ProductId", productId),
new XAttribute("InventoryCheckRequired", inventoryCheckRequired)
));
}
// Match the exact formatting of the XML returned from SQL
var xml = doc
.ToString(SaveOptions.DisableFormatting)
.Replace(" />", "/>");
}
}
NOTE
This code is somewhat complex. For a new application, you might prefer to use a serialization library. However, the
assumption here is that the development team is refactoring an existing application, so the method needs to return the
exact same format as the original code.
Considerations
Many database systems are highly optimized to perform certain types of data processing, such as
calculating aggregate values over large datasets. Don't move those types of processing out of the database.
Do not relocate processing if doing so causes the database to transfer far more data over the network. See
the Extraneous Fetching antipattern.
If you move processing to an application tier, that tier may need to scale out to handle the additional work.
Example diagnosis
The following sections apply these steps to the sample application described earlier.
Monitor the volume of database activity
The following graph shows the results of running a load test against the sample application, using a step load of up
to 50 concurrent users. The volume of requests quickly reaches a limit and stays at that level, while the average
response time steadily increases. A logarithmic scale is used for those two metrics.
This line graph shows user load, requests per second, and average response time. The graph shows that response
time increases as load increases.
The next graph shows CPU utilization and DTUs as a percentage of service quota. DTUs provide a measure of how
much processing the database performs. The graph shows that CPU and DTU utilization both quickly reached
100%.
This line graph shows CPU percentage and DTU percentage over time. The graph shows that both quickly reach
100%.
Examine the work performed by the database
It could be that the tasks performed by the database are genuine data access operations, rather than processing, so
it is important to understand the SQL statements being run while the database is busy. Monitor the system to
capture the SQL traffic and correlate the SQL operations with application requests.
If the database operations are purely data access operations, without a lot of processing, then the problem might
be Extraneous Fetching.
Implement the solution and verify the result
The following graph shows a load test using the updated code. Throughput is significantly higher, over 400
requests per second versus 12 earlier. The average response time is also much lower, just above 0.1 seconds
compared to over 4 seconds.
This line graph shows user load, requests per second, and average response time. The graph shows that response
time remains roughly constant throughout the load test.
CPU and DTU utilization shows that the system took longer to reach saturation, despite the increased throughput.
This line graph shows CPU percentage and DTU percentage over time. The graph shows that CPU and DTU take
longer to reach 100% than previously.
Related resources
Extraneous Fetching antipattern
Busy Front End antipattern
12/18/2020 • 8 minutes to read • Edit Online
Performing asynchronous work on a large number of background threads can starve other concurrent foreground
tasks of resources, decreasing response times to unacceptable levels.
Problem description
Resource-intensive tasks can increase the response times for user requests and cause high latency. One way to
improve response times is to offload a resource-intensive task to a separate thread. This approach lets the
application stay responsive while processing happens in the background. However, tasks that run on a background
thread still consume resources. If there are too many of them, they can starve the threads that are handling
requests.
NOTE
The term resource can encompass many things, such as CPU utilization, memory occupancy, and network or disk I/O.
This problem typically occurs when an application is developed as monolithic piece of code, with all of the business
logic combined into a single tier shared with the presentation layer.
Here's an example using ASP.NET that demonstrates the problem. You can find the complete sample here.
return Request.CreateResponse(HttpStatusCode.Accepted);
}
}
The Post method in the WorkInFrontEnd controller implements an HTTP POST operation. This operation
simulates a long-running, CPU-intensive task. The work is performed on a separate thread, in an attempt to
enable the POST operation to complete quickly.
The Get method in the UserProfile controller implements an HTTP GET operation. This method is much
less CPU intensive.
The primary concern is the resource requirements of the Post method. Although it puts the work onto a
background thread, the work can still consume considerable CPU resources. These resources are shared with other
operations being performed by other concurrent users. If a moderate number of users send this request at the
same time, overall performance is likely to suffer, slowing down all operations. Users might experience significant
latency in the Get method, for example.
public WorkInBackgroundController()
{
var serviceBusConnectionString = ...;
QueueName = ...;
ServiceBusQueueHandler = new ServiceBusQueueHandler(serviceBusConnectionString);
QueueClient = ServiceBusQueueHandler.GetQueueClientAsync(QueueName).Result;
}
[HttpPost]
[Route("api/workinbackground")]
public async Task<long> Post()
{
return await ServiceBusQueueHandler.AddWorkLoadToQueueAsync(QueueClient, QueueName, 0);
}
}
The back end pulls messages from the Service Bus queue and does the processing.
public async Task RunAsync(CancellationToken cancellationToken)
{
this._queueClient.OnMessageAsync(
// This lambda is invoked for each message received.
async (receivedMessage) =>
{
try
{
// Simulate processing of message
Thread.SpinWait(Int32.MaxValue / 1000);
await receivedMessage.CompleteAsync();
}
catch
{
receivedMessage.Abandon();
}
});
}
Considerations
This approach adds some additional complexity to the application. You must handle queuing and dequeuing
safely to avoid losing requests in the event of a failure.
The application takes a dependency on an additional service for the message queue.
The processing environment must be sufficiently scalable to handle the expected workload and meet the
required throughput targets.
While this approach should improve overall responsiveness, the tasks that are moved to the back end may take
longer to complete.
Example diagnosis
The following sections apply these steps to the sample application described earlier.
Identify points of slowdown
Instrument each method to track the duration and resources consumed by each request. Then monitor the
application in production. This can provide an overall view of how requests compete with each other. During
periods of stress, slow-running resource-hungry requests will likely affect other operations, and this behavior can
be observed by monitoring the system and noting the drop off in performance.
The following image shows a monitoring dashboard. (We used AppDynamics for our tests.) Initially, the system has
light load. Then users start requesting the UserProfile GET method. The performance is reasonably good until
other users start issuing requests to the WorkInFrontEnd POST method. At that point, response times increase
dramatically (first arrow). Response times only improve after the volume of requests to the WorkInFrontEnd
controller diminishes (second arrow).
At this point, it appears the Post method in the WorkInFrontEnd controller is a prime candidate for closer
examination. Further work in a controlled environment is needed to confirm the hypothesis.
Perform load testing
The next step is to perform tests in a controlled environment. For example, run a series of load tests that include
and then omit each request in turn to see the effects.
The graph below shows the results of a load test performed against an identical deployment of the cloud service
used in the previous tests. The test used a constant load of 500 users performing the Get operation in the
UserProfile controller, along with a step load of users performing the Post operation in the WorkInFrontEnd
controller.
Initially, the step load is 0, so the only active users are performing the UserProfile requests. The system is able to
respond to approximately 500 requests per second. After 60 seconds, a load of 100 additional users starts sending
POST requests to the WorkInFrontEnd controller. Almost immediately, the workload sent to the UserProfile
controller drops to about 150 requests per second. This is due to the way the load-test runner functions. It waits for
a response before sending the next request, so the longer it takes to receive a response, the lower the request rate.
As more users send POST requests to the WorkInFrontEnd controller, the response rate of the UserProfile
controller continues to drop. But note that the volume of requests handled by the WorkInFrontEnd controller
remains relatively constant. The saturation of the system becomes apparent as the overall rate of both requests
tends toward a steady but low limit.
Review the source code
The final step is to look at the source code. The development team was aware that the Post method could take a
considerable amount of time, which is why the original implementation used a separate thread. That solved the
immediate problem, because the Post method did not block waiting for a long-running task to complete.
However, the work performed by this method still consumes CPU, memory, and other resources. Enabling this
process to run asynchronously might actually damage performance, as users can trigger a large number of these
operations simultaneously, in an uncontrolled manner. There is a limit to the number of threads that a server can
run. Past this limit, the application is likely to get an exception when it tries to start a new thread.
NOTE
This doesn't mean you should avoid asynchronous operations. Performing an asynchronous await on a network call is a
recommended practice. (See the Synchronous I/O antipattern.) The problem here is that CPU-intensive work was spawned on
another thread.
Note that the WorkInBackground controller also handled a much larger volume of requests. However, you can't
make a direct comparison in this case, because the work being performed in this controller is very different from
the original code. The new version simply queues a request, rather than performing a time consuming calculation.
The main point is that this method no longer drags down the entire system under load.
CPU and network utilization also show the improved performance. The CPU utilization never reached 100%, and
the volume of handled network requests was far greater than earlier, and did not tail off until the workload
dropped.
The following graph shows the results of a load test. The overall volume of requests serviced is greatly improved
compared to the earlier tests.
Related guidance
Autoscaling best practices
Background jobs best practices
Queue-Based Load Leveling pattern
Web Queue Worker architecture style
Chatty I/O antipattern
12/18/2020 • 9 minutes to read • Edit Online
The cumulative effect of a large number of I/O requests can have a significant impact on performance and
responsiveness.
Problem description
Network calls and other I/O operations are inherently slow compared to compute tasks. Each I/O request typically
has significant overhead, and the cumulative effect of numerous I/O operations can slow down the system. Here
are some common causes of chatty I/O.
Reading and writing individual records to a database as distinct requests
The following example reads from a database of products. There are three tables, Product , ProductSubcategory ,
and ProductPriceListHistory . The code retrieves all of the products in a subcategory, along with the pricing
information, by executing a series of queries:
1. Query the subcategory from the ProductSubcategory table.
2. Find all products in that subcategory by querying the Product table.
3. For each product, query the pricing data from the ProductPriceListHistory table.
The application uses Entity Framework to query the database. You can find the complete sample here.
This example shows the problem explicitly, but sometimes an O/RM can mask the problem, if it implicitly fetches
child records one at a time. This is known as the "N+1 problem".
Implementing a single logical operation as a series of HTTP requests
This often happens when developers try to follow an object-oriented paradigm, and treat remote objects as if they
were local objects in memory. This can result in too many network round trips. For example, the following web API
exposes the individual properties of User objects through individual HTTP GET methods.
[HttpGet]
[Route("users/{id:int}/gender")]
public HttpResponseMessage GetGender(int id)
{
...
}
[HttpGet]
[Route("users/{id:int}/dateofbirth")]
public HttpResponseMessage GetDateOfBirth(int id)
{
...
}
}
While there's nothing technically wrong with this approach, most clients will probably need to get several
properties for each User , resulting in client code like the following.
if (subCategory == null)
return NotFound();
return Ok(subCategory);
}
}
Follow REST design principles for web APIs. Here's a revised version of the web API from the earlier example.
Instead of separate GET methods for each property, there is a single GET method that returns the User . This
results in a larger response body per request, but each client is likely to make fewer API calls.
// Client code
HttpResponseMessage response = await client.GetAsync("users/1");
response.EnsureSuccessStatusCode();
var user = await response.Content.ReadAsStringAsync();
For file I/O, consider buffering data in memory and then writing the buffered data to a file as a single operation.
This approach reduces the overhead from repeatedly opening and closing the file, and helps to reduce
fragmentation of the file on disk.
// Save a list of customer objects to a file
private async Task SaveCustomerListToFileAsync(List<Customer> customers)
{
using (Stream fileStream = new FileStream(CustomersFileName, FileMode.Append))
{
BinaryFormatter formatter = new BinaryFormatter();
foreach (var customer in customers)
{
byte[] data = null;
using (MemoryStream memStream = new MemoryStream())
{
formatter.Serialize(memStream, customer);
data = memStream.ToArray();
}
await fileStream.WriteAsync(data, 0, data.Length);
}
}
}
// Save the contents of the list, writing all customers in a single operation
await SaveCustomerListToFileAsync(customers);
Considerations
The first two examples make fewer I/O calls, but each one retrieves more information. You must consider
the tradeoff between these two factors. The right answer will depend on the actual usage patterns. For
example, in the web API example, it might turn out that clients often need just the user name. In that case, it
might make sense to expose it as a separate API call. For more information, see the Extraneous Fetching
antipattern.
When reading data, do not make your I/O requests too large. An application should only retrieve the
information that it is likely to use.
Sometimes it helps to partition the information for an object into two chunks, frequently accessed data that
accounts for most requests, and less frequently accessed data that is used rarely. Often the most frequently
accessed data is a relatively small portion of the total data for an object, so returning just that portion can
save significant I/O overhead.
When writing data, avoid locking resources for longer than necessary, to reduce the chances of contention
during a lengthy operation. If a write operation spans multiple data stores, files, or services, then adopt an
eventually consistent approach. See Data Consistency guidance.
If you buffer data in memory before writing it, the data is vulnerable if the process crashes. If the data rate
typically has bursts or is relatively sparse, it may be safer to buffer the data in an external durable queue
such as Event Hubs.
Consider caching data that you retrieve from a service or a database. This can help to reduce the volume of
I/O by avoiding repeated requests for the same data. For more information, see Caching best practices.
Example diagnosis
The following sections apply these steps to the example shown earlier that queries a database.
Load test the application
This graph shows the results of load testing. Median response time is measured in tens of seconds per request.
The graph shows very high latency. With a load of 1000 users, a user might have to wait for nearly a minute to see
the results of a query.
NOTE
The application was deployed as an Azure App Service web app, using Azure SQL Database. The load test used a simulated
step workload of up to 1000 concurrent users. The database was configured with a connection pool supporting up to 1000
concurrent connections, to reduce the chance that contention for connections would affect the results.
The next image shows the actual SQL statements that were issued. The query that fetches price information is run
for each individual product in the product subcategory. Using a join would considerably reduce the number of
database calls.
If you are using an O/RM, such as Entity Framework, tracing the SQL queries can provide insight into how the
O/RM translates programmatic calls into SQL statements, and indicate areas where data access might be
optimized.
Implement the solution and verify the result
Rewriting the call to Entity Framework produced the following results.
This load test was performed on the same deployment, using the same load profile. This time the graph shows
much lower latency. The average request time at 1000 users is between 5 and 6 seconds, down from nearly a
minute.
This time the system supported an average of 3,970 requests per minute, compared to 410 for the earlier test.
Tracing the SQL statement shows that all the data is fetched in a single SELECT statement. Although this query is
considerably more complex, it is performed only once per operation. And while complex joins can become
expensive, relational database systems are optimized for this type of query.
Related resources
API Design best practices
Caching best practices
Data Consistency Primer
Extraneous Fetching antipattern
No Caching antipattern
Extraneous Fetching antipattern
12/18/2020 • 10 minutes to read • Edit Online
Retrieving more data than needed for a business operation can result in unnecessary I/O overhead and reduce
responsiveness.
Problem description
This antipattern can occur if the application tries to minimize I/O requests by retrieving all of the data that it
might need. This is often a result of overcompensating for the Chatty I/O antipattern. For example, an application
might fetch the details for every product in a database. But the user may need just a subset of the details (some
may not be relevant to customers), and probably doesn't need to see all of the products at once. Even if the user
is browsing the entire catalog, it would make sense to paginate the results—showing 20 at a time, for example.
Another source of this problem is following poor programming or design practices. For example, the following
code uses Entity Framework to fetch the complete details for every product. Then it filters the results to return
only a subset of the fields, discarding the rest. You can find the complete sample here.
// Project fields from the query results. This happens in application memory.
var result = products.Select(p => new ProductInfo { Id = p.ProductId, Name = p.Name });
return Ok(result);
}
}
In the next example, the application retrieves data to perform an aggregation that could be done by the database
instead. The application calculates total sales by getting every record for all orders sold, and then computing the
sum over those records. You can find the complete sample here.
The next example shows a subtle problem caused by the way Entity Framework uses LINQ to Entities.
var query = from p in context.Products.AsEnumerable()
where p.SellStartDate < DateTime.Now.AddDays(-7) // AddDays cannot be mapped by LINQ to Entities
select ...;
The application is trying to find products with a SellStartDate more than a week old. In most cases, LINQ to
Entities would translate a where clause to a SQL statement that is executed by the database. In this case, however,
LINQ to Entities cannot map the AddDays method to SQL. Instead, every row from the Product table is returned,
and the results are filtered in memory.
The call to AsEnumerable is a hint that there is a problem. This method converts the results to an IEnumerable
interface. Although IEnumerable supports filtering, the filtering is done on the client side, not the database. By
default, LINQ to Entities uses IQueryable , which passes the responsibility for filtering to the data source.
When using Entity Framework, ensure that LINQ queries are resolved using the IQueryable interface and not
IEnumerable . You may need to adjust the query to use only functions that can be mapped to the data source. The
earlier example can be refactored to remove the AddDays method from the query, allowing filtering to be done
by the database.
DateTime dateSince = DateTime.Now.AddDays(-7); // AddDays has been factored out.
var query = from p in context.Products
where p.SellStartDate < dateSince // This criterion can be passed to the database by LINQ to
Entities
select ...;
Considerations
In some cases, you can improve performance by partitioning data horizontally. If different operations
access different attributes of the data, horizontal partitioning may reduce contention. Often, most
operations are run against a small subset of the data, so spreading this load may improve performance.
See Data partitioning.
For operations that have to support unbounded queries, implement pagination and only fetch a limited
number of entities at a time. For example, if a customer is browsing a product catalog, you can show one
page of results at a time.
When possible, take advantage of features built into the data store. For example, SQL databases typically
provide aggregate functions.
If you're using a data store that doesn't support a particular function, such as aggregation, you could store
the calculated result elsewhere, updating the value as records are added or updated, so the application
doesn't have to recalculate the value each time it's needed.
If you see that requests are retrieving a large number of fields, examine the source code to determine
whether all of these fields are necessary. Sometimes these requests are the result of poorly designed
SELECT * query.
Similarly, requests that retrieve a large number of entities may be sign that the application is not filtering
data correctly. Verify that all of these entities are needed. Use database-side filtering if possible, for
example, by using WHERE clauses in SQL.
Offloading processing to the database is not always the best option. Only use this strategy when the
database is designed or optimized to do so. Most database systems are highly optimized for certain
functions, but are not designed to act as general-purpose application engines. For more information, see
the Busy Database antipattern.
Example diagnosis
The following sections apply these steps to the previous examples.
Identify slow workloads
This graph shows performance results from a load test that simulated up to 400 concurrent users running the
GetAllFieldsAsync method shown earlier. Throughput diminishes slowly as the load increases. Average response
time goes up as the workload increases.
A load test for the AggregateOnClientAsync operation shows a similar pattern. The volume of requests is
reasonably stable. The average response time increases with the workload, although more slowly than the
previous graph.
Correlate slow workloads with behavioral patterns
Any correlation between regular periods of high usage and slowing performance can indicate areas of concern.
Closely examine the performance profile of functionality that is suspected to be slow running, to determine
whether it matches the load testing performed earlier.
Load test the same functionality using step-based user loads, to find the point where performance drops
significantly or fails completely. If that point falls within the bounds of your expected real-world usage, examine
how the functionality is implemented.
A slow operation is not necessarily a problem, if it is not being performed when the system is under stress, is not
time critical, and does not negatively affect the performance of other important operations. For example,
generating monthly operational statistics might be a long-running operation, but it can probably be performed
as a batch process and run as a low-priority job. On the other hand, customers querying the product catalog is a
critical business operation. Focus on the telemetry generated by these critical operations to see how the
performance varies during periods of high usage.
Identify data sources in slow workloads
If you suspect that a service is performing poorly because of the way it retrieves data, investigate how the
application interacts with the repositories it uses. Monitor the live system to see which sources are accessed
during periods of poor performance.
For each data source, instrument the system to capture the following:
The frequency that each data store is accessed.
The volume of data entering and exiting the data store.
The timing of these operations, especially the latency of requests.
The nature and rate of any errors that occur while accessing each data store under typical load.
Compare this information against the volume of data being returned by the application to the client. Track the
ratio of the volume of data returned by the data store against the volume of data returned to the client. If there is
any large disparity, investigate to determine whether the application is fetching data that it doesn't need.
You may be able to capture this data by observing the live system and tracing the lifecycle of each user request,
or you can model a series of synthetic workloads and run them against a test system.
The following graphs show telemetry captured using New Relic APM during a load test of the GetAllFieldsAsync
method. Note the difference between the volumes of data received from the database and the corresponding
HTTP responses.
For each request, the database returned 80,503 bytes, but the response to the client only contained 19,855 bytes,
about 25% of the size of the database response. The size of the data returned to the client can vary depending on
the format. For this load test, the client requested JSON data. Separate testing using XML (not shown) had a
response size of 35,655 bytes, or 44% of the size of the database response.
The load test for the AggregateOnClientAsync method shows more extreme results. In this case, each test
performed a query that retrieved over 280 Kb of data from the database, but the JSON response was a mere 14
bytes. The wide disparity is because the method calculates an aggregated result from a large volume of data.
Identify and analyze slow queries
Look for database queries that consume the most resources and take the most time to execute. You can add
instrumentation to find the start and completion times for many database operations. Many data stores also
provide in-depth information on how queries are performed and optimized. For example, the Query Performance
pane in the Azure SQL Database management portal lets you select a query and view detailed runtime
performance information. Here is the query generated by the GetAllFieldsAsync operation:
Implement the solution and verify the result
After changing the GetRequiredFieldsAsync method to use a SELECT statement on the database side, load testing
showed the following results.
This load test used the same deployment and the same simulated workload of 400 concurrent users as before.
The graph shows much lower latency. Response time rises with load to approximately 1.3 seconds, compared to 4
seconds in the previous case. The throughput is also higher at 350 requests per second compared to 100 earlier.
The volume of data retrieved from the database now closely matches the size of the HTTP response messages.
Load testing using the AggregateOnDatabaseAsync method generates the following results:
The average response time is now minimal. This is an order of magnitude improvement in performance, caused
primarily by the large reduction in I/O from the database.
Here is the corresponding telemetry for the AggregateOnDatabaseAsync method. The amount of data retrieved
from the database was vastly reduced, from over 280 Kb per transaction to 53 bytes. As a result, the maximum
sustained number of requests per minute was raised from around 2,000 to over 25,000.
Related resources
Busy Database antipattern
Chatty I/O antipattern
Data partitioning best practices
Improper Instantiation antipattern
12/18/2020 • 6 minutes to read • Edit Online
It can hurt performance to continually create new instances of an object that is meant to be created once and then
shared.
Problem description
Many libraries provide abstractions of external resources. Internally, these classes typically manage their own
connections to the resource, acting as brokers that clients can use to access the resource. Here are some examples
of broker classes that are relevant to Azure applications:
System.Net.Http.HttpClient . Communicates with a web service using HTTP.
Microsoft.ServiceBus.Messaging.QueueClient . Posts and receives messages to a Service Bus queue.
Microsoft.Azure.Documents.Client.DocumentClient . Connects to a Cosmos DB instance.
StackExchange.Redis.ConnectionMultiplexer . Connects to Redis, including Azure Cache for Redis.
These classes are intended to be instantiated once and reused throughout the lifetime of an application. However,
it's a common misunderstanding that these classes should be acquired only as necessary and released quickly. (The
ones listed here happen to be .NET libraries, but the pattern is not unique to .NET.) The following ASP.NET example
creates an instance of HttpClient to communicate with a remote service. You can find the complete sample here.
In a web application, this technique is not scalable. A new HttpClient object is created for each user request. Under
heavy load, the web server may exhaust the number of available sockets, resulting in SocketException errors.
This problem is not restricted to the HttpClient class. Other classes that wrap resources or are expensive to create
might cause similar issues. The following example creates an instance of the ExpensiveToCreateService class. Here
the issue is not necessarily socket exhaustion, but simply how long it takes to create each instance. Continually
creating and destroying instances of this class might adversely affect the scalability of the system.
public class NewServiceInstancePerRequestController : ApiController
{
public async Task<Product> GetProductAsync(string id)
{
var expensiveToCreateService = new ExpensiveToCreateService();
return await expensiveToCreateService.GetProductByIdAsync(id);
}
}
static SingleHttpClientInstanceController()
{
httpClient = new HttpClient();
}
// This method uses the shared instance of HttpClient for every call to GetProductAsync.
public async Task<Product> GetProductAsync(string id)
{
var hostName = HttpContext.Current.Request.Url.Host;
var result = await httpClient.GetStringAsync(string.Format("http://{0}:8080/api/...", hostName));
return new Product { Name = result };
}
}
Considerations
The key element of this antipattern is repeatedly creating and destroying instances of a shareable object. If a
class is not shareable (not thread-safe), then this antipattern does not apply.
The type of shared resource might dictate whether you should use a singleton or create a pool. The
HttpClient class is designed to be shared rather than pooled. Other objects might support pooling,
enabling the system to spread the workload across multiple instances.
Objects that you share across multiple requests must be thread-safe. The HttpClient class is designed to be
used in this manner, but other classes might not support concurrent requests, so check the available
documentation.
Be careful about setting properties on shared objects, as this can lead to race conditions. For example,
setting DefaultRequestHeaders on the HttpClient class before each request can create a race condition. Set
such properties once (for example, during startup), and create separate instances if you need to configure
different settings.
Some resource types are scarce and should not be held onto. Database connections are an example. Holding
an open database connection that is not required may prevent other concurrent users from gaining access
to the database.
In the .NET Framework, many objects that establish connections to external resources are created by using
static factory methods of other classes that manage these connections. These objects are intended to be
saved and reused, rather than disposed and re-created. For example, in Azure Service Bus, the QueueClient
object is created through a MessagingFactory object. Internally, the MessagingFactory manages connections.
For more information, see Best Practices for performance improvements using Service Bus Messaging.
Example diagnosis
The following sections apply these steps to the sample application described earlier.
Identify points of slowdown or failure
The following image shows results generated using New Relic APM, showing operations that have a poor response
time. In this case, the GetProductAsync method in the NewHttpClientInstancePerRequest controller is worth
investigating further. Notice that the error rate also increases when these operations are running.
Examine telemetry data and find correlations
The next image shows data captured using thread profiling, over the same period corresponding as the previous
image. The system spends a significant time opening socket connections, and even more time closing them and
handling socket exceptions.
For comparison, the following image shows the stack trace telemetry. This time, the system spends most of its time
performing real work, rather than opening and closing sockets.
The next graph shows a similar load test using a shared instance of the ExpensiveToCreateService object. Again, the
volume of handled requests increases in line with the user load, while the average response time remains low.
Monolithic Persistence antipattern
12/18/2020 • 6 minutes to read • Edit Online
Putting all of an application's data into a single data store can hurt performance, either because it leads to resource
contention, or because the data store is not a good fit for some of the data.
Problem description
Historically, applications have often used a single data store, regardless of the different types of data that the
application might need to store. Usually this was done to simplify the application design, or else to match the
existing skill set of the development team.
Modern cloud-based systems often have additional functional and nonfunctional requirements, and need to store
many heterogeneous types of data, such as documents, images, cached data, queued messages, application logs,
and telemetry. Following the traditional approach and putting all of this information into the same data store can
hurt performance, for two main reasons:
Storing and retrieving large amounts of unrelated data in the same data store can cause contention, which in
turn leads to slow response times and connection failures.
Whichever data store is chosen, it might not be the best fit for all of the different types of data, or it might not
be optimized for the operations that the application performs.
The following example shows an ASP.NET Web API controller that adds a new record to a database and also
records the result to a log. The log is held in the same database as the business data. You can find the complete
sample here.
The rate at which log records are generated will probably affect the performance of the business operations. And if
another component, such as an application process monitor, regularly reads and processes the log data, that can
also affect the business operations.
Considerations
Separate data by the way it is used and how it is accessed. For example, don't store log information and
business data in the same data store. These types of data have significantly different requirements and
patterns of access. Log records are inherently sequential, while business data is more likely to require
random access, and is often relational.
Consider the data access pattern for each type of data. For example, store formatted reports and documents
in a document database such as Cosmos DB, but use Azure Cache for Redis to cache temporary data.
If you follow this guidance but still reach the limits of the database, you may need to scale up the database.
Also consider scaling horizontally and partitioning the load across database servers. However, partitioning
may require redesigning the application. For more information, see Data partitioning.
Example diagnosis
The following sections apply these steps to the sample application described earlier.
Instrument and monitor the system
The following graph shows the results of load testing the sample application described earlier. The test used a step
load of up to 1000 concurrent users.
As the load increases to 700 users, so does the throughput. But at that point, throughput levels off, and the system
appears to be running at its maximum capacity. The average response gradually increases with user load, showing
that the system can't keep up with demand.
Identify periods of poor performance
If you are monitoring the production system, you might notice patterns. For example, response times might drop
off significantly at the same time each day. This could be caused by a regular workload or scheduled batch job, or
just because the system has more users at certain times. You should focus on the telemetry data for these events.
Look for correlations between increased response times and increased database activity or I/O to shared resources.
If there are correlations, it means the database might be a bottleneck.
Identify which data stores are accessed during those periods
The next graph shows the utilization of database throughput units (DTU) during the load test. (A DTU is a measure
of available capacity, and is a combination of CPU utilization, memory allocation, I/O rate.) Utilization of DTUs
quickly reached 100%. This is roughly the point where throughput peaked in the previous graph. Database
utilization remained very high until the test finished. There is a slight drop toward the end, which could be caused
by throttling, competition for database connections, or other factors.
Examine the telemetry for the data stores
Instrument the data stores to capture the low-level details of the activity. In the sample application, the data access
statistics showed a high volume of insert operations performed against both the PurchaseOrderHeader table and
the MonoLog table.
The pattern of throughput is similar to the earlier graph, but the point at which performance peaks is
approximately 500 requests per second higher. The average response time is marginally lower. However, these
statistics don't tell the full story. Telemetry for the business database shows that DTU utilization peaks at around
75%, rather than 100%.
Similarly, the maximum DTU utilization of the log database only reaches about 70%. The databases are no longer
the limiting factor in the performance of the system.
Related resources
Choose the right data store
Criteria for choosing a data store
Data Access for Highly Scalable Solutions: Using SQL, NoSQL, and Polyglot Persistence
Data partitioning
No Caching antipattern
12/18/2020 • 8 minutes to read • Edit Online
In a cloud application that handles many concurrent requests, repeatedly fetching the same data can reduce
performance and scalability.
Problem description
When data is not cached, it can cause a number of undesirable behaviors, including:
Repeatedly fetching the same information from a resource that is expensive to access, in terms of I/O overhead
or latency.
Repeatedly constructing the same objects or data structures for multiple requests.
Making excessive calls to a remote service that has a service quota and throttles clients past a certain limit.
In turn, these problems can lead to poor response times, increased contention in the data store, and poor
scalability.
The following example uses Entity Framework to connect to a database. Every client request results in a call to the
database, even if multiple requests are fetching exactly the same data. The cost of repeated requests, in terms of
I/O overhead and data access charges, can accumulate quickly.
Notice that the GetAsync method now calls the CacheService class, rather than calling the database directly. The
CacheService class first tries to get the item from Azure Cache for Redis. If the value isn't found in the cache, the
CacheService invokes a lambda function that was passed to it by the caller. The lambda function is responsible for
fetching the data from the database. This implementation decouples the repository from the particular caching
solution, and decouples the CacheService from the database.
Considerations
If the cache is unavailable, perhaps because of a transient failure, don't return an error to the client. Instead,
fetch the data from the original data source. However, be aware that while the cache is being recovered, the
original data store could be swamped with requests, resulting in timeouts and failed connections. (After all,
this is one of the motivations for using a cache in the first place.) Use a technique such as the Circuit Breaker
pattern to avoid overwhelming the data source.
Applications that cache dynamic data should be designed to support eventual consistency.
For web APIs, you can support client-side caching by including a Cache-Control header in request and
response messages, and using ETags to identify versions of objects. For more information, see API
implementation.
You don't have to cache entire entities. If most of an entity is static but only a small piece changes frequently,
cache the static elements and retrieve the dynamic elements from the data source. This approach can help
to reduce the volume of I/O being performed against the data source.
In some cases, if volatile data is short-lived, it can be useful to cache it. For example, consider a device that
continually sends status updates. It might make sense to cache this information as it arrives, and not write it
to a persistent store at all.
To prevent data from becoming stale, many caching solutions support configurable expiration periods, so
that data is automatically removed from the cache after a specified interval. You may need to tune the
expiration time for your scenario. Data that is highly static can stay in the cache for longer periods than
volatile data that may become stale quickly.
If the caching solution doesn't provide built-in expiration, you may need to implement a background
process that occasionally sweeps the cache, to prevent it from growing without limits.
Besides caching data from an external data source, you can use caching to save the results of complex
computations. Before you do that, however, instrument the application to determine whether the application
is really CPU bound.
It might be useful to prime the cache when the application starts. Populate the cache with the data that is
most likely to be used.
Always include instrumentation that detects cache hits and cache misses. Use this information to tune
caching policies, such what data to cache, and how long to hold data in the cache before it expires.
If the lack of caching is a bottleneck, then adding caching may increase the volume of requests so much that
the web front end becomes overloaded. Clients may start to receive HTTP 503 (Service Unavailable) errors.
These are an indication that you should scale out the front end.
If you need a deeper analysis, you can use a profiler to capture low-level performance data in a test environment
(not the production system). Look at metrics such as I/O request rates, memory usage, and CPU utilization. These
metrics may show a large number of requests to a data store or service, or repeated processing that performs the
same calculation.
Load test the application
The following graph shows the results of load testing the sample application. The load test simulates a step load of
up to 800 users performing a typical series of operations.
The number of successful tests performed each second reaches a plateau, and additional requests are slowed as a
result. The average test time steadily increases with the workload. The response time levels off once the user load
peaks.
Examine data access statistics
Data access statistics and other information provided by a data store can give useful information, such as which
queries are repeated most frequently. For example, in Microsoft SQL Server, the sys.dm_exec_query_stats
management view has statistical information for recently executed queries. The text for each query is available in
the sys.dm_exec-query_plan view. You can use a tool such as SQL Server Management Studio to run the following
SQL query and determine how frequently queries are performed.
The UseCount column in the results indicates how frequently each query is run. The following image shows that
the third query was run more than 250,000 times, significantly more than any other query.
Here is the SQL query that is causing so many database requests:
This is the query that Entity Framework generates in GetByIdAsync method shown earlier.
Implement the solution and verify the result
After you incorporate a cache, repeat the load tests and compare the results to the earlier load tests without a
cache. Here are the load test results after adding a cache to the sample application.
The volume of successful tests still reaches a plateau, but at a higher user load. The request rate at this load is
significantly higher than earlier. Average test time still increases with load, but the maximum response time is 0.05
ms, compared with 1 ms earlier—a 20× improvement.
Related resources
API implementation best practices
Cache-Aside pattern
Caching best practices
Circuit Breaker pattern
Synchronous I/O antipattern
12/18/2020 • 6 minutes to read • Edit Online
Blocking the calling thread while I/O completes can reduce performance and affect vertical scalability.
Problem description
A synchronous I/O operation blocks the calling thread while the I/O completes. The calling thread enters a wait
state and is unable to perform useful work during this interval, wasting processing resources.
Common examples of I/O include:
Retrieving or persisting data to a database or any type of persistent storage.
Sending a request to a web service.
Posting a message or retrieving a message from a queue.
Writing to or reading from a local file.
This antipattern typically occurs because:
It appears to be the most intuitive way to perform an operation.
The application requires a response from a request.
The application uses a library that only provides synchronous methods for I/O.
An external library performs synchronous I/O operations internally. A single synchronous I/O call can block an
entire call chain.
The following code uploads a file to Azure blob storage. There are two places where the code blocks waiting for
synchronous I/O, the CreateIfNotExists method and the UploadFromStream method.
container.CreateIfNotExists();
var blockBlob = container.GetBlockBlobReference("myblob");
// Create or overwrite the "myblob" blob with contents from a local file.
using (var fileStream = File.OpenRead(HostingEnvironment.MapPath("~/FileToUpload.txt")))
{
blockBlob.UploadFromStream(fileStream);
}
Here's an example of waiting for a response from an external service. The GetUserProfile method calls a remote
service that returns a UserProfile .
public interface IUserProfileService
{
UserProfile GetUserProfile();
}
public SyncController()
{
_userProfileService = new FakeUserProfileService();
}
You can find the complete code for both of these examples here.
await container.CreateIfNotExistsAsync();
// Create or overwrite the "myblob" blob with contents from a local file.
using (var fileStream = File.OpenRead(HostingEnvironment.MapPath("~/FileToUpload.txt")))
{
await blockBlob.UploadFromStreamAsync(fileStream);
}
The await operator returns control to the calling environment while the asynchronous operation is performed.
The code after this statement acts as a continuation that runs when the asynchronous operation has completed.
A well designed service should also provide asynchronous operations. Here is an asynchronous version of the web
service that returns user profiles. The GetUserProfileAsync method depends on having an asynchronous version
of the User Profile service.
public interface IUserProfileService
{
Task<UserProfile> GetUserProfileAsync();
}
public AsyncController()
{
_userProfileService = new FakeUserProfileService();
}
// This is an synchronous method that calls the Task based GetUserProfileAsync method.
public Task<UserProfile> GetUserProfileAsync()
{
return _userProfileService.GetUserProfileAsync();
}
}
For libraries that don't provide asynchronous versions of operations, it may be possible to create asynchronous
wrappers around selected synchronous methods. Follow this approach with caution. While it may improve
responsiveness on the thread that invokes the asynchronous wrapper, it actually consumes more resources. An
extra thread may be created, and there is overhead associated with synchronizing the work done by this thread.
Some tradeoffs are discussed in this blog post: Should I expose asynchronous wrappers for synchronous methods?
Here is an example of an asynchronous wrapper around a synchronous method.
Considerations
I/O operations that are expected to be very short lived and are unlikely to cause contention might be more
performant as synchronous operations. An example might be reading small files on an SSD drive. The
overhead of dispatching a task to another thread, and synchronizing with that thread when the task
completes, might outweigh the benefits of asynchronous I/O. However, these cases are relatively rare, and
most I/O operations should be done asynchronously.
Improving I/O performance may cause other parts of the system to become bottlenecks. For example,
unblocking threads might result in a higher volume of concurrent requests to shared resources, leading in
turn to resource starvation or throttling. If that becomes a problem, you might need to scale out the number
of web servers or partition data stores to reduce contention.
Example diagnosis
The following sections apply these steps to the sample application described earlier.
Monitor web server performance
For Azure web applications and web roles, it's worth monitoring the performance of the IIS web server. In
particular, pay attention to the request queue length to establish whether requests are being blocked waiting for
available threads during periods of high activity. You can gather this information by enabling Azure diagnostics. For
more information, see:
Monitor Apps in Azure App Service
Create and use performance counters in an Azure application
Instrument the application to see how requests are handled once they have been accepted. Tracing the flow of a
request can help to identify whether it is performing slow-running calls and blocking the current thread. Thread
profiling can also highlight requests that are being blocked.
Load test the application
The following graph shows the performance of the synchronous GetUserProfile method shown earlier, under
varying loads of up to 4000 concurrent users. The application is an ASP.NET application running in an Azure Cloud
Service web role.
The synchronous operation is hard-coded to sleep for 2 seconds, to simulate synchronous I/O, so the minimum
response time is slightly over 2 seconds. When the load reaches approximately 2500 concurrent users, the average
response time reaches a plateau, although the volume of requests per second continues to increase. Note that the
scale for these two measures is logarithmic. The number of requests per second doubles between this point and
the end of the test.
In isolation, it's not necessarily clear from this test whether the synchronous I/O is a problem. Under heavier load,
the application may reach a tipping point where the web server can no longer process requests in a timely manner,
causing client applications to receive time-out exceptions.
Incoming requests are queued by the IIS web server and handed to a thread running in the ASP.NET thread pool.
Because each operation performs I/O synchronously, the thread is blocked until the operation completes. As the
workload increases, eventually all of the ASP.NET threads in the thread pool are allocated and blocked. At that
point, any further incoming requests must wait in the queue for an available thread. As the queue length grows,
requests start to time out.
Implement the solution and verify the result
The next graph shows the results from load testing the asynchronous version of the code.
Throughput is far higher. Over the same duration as the previous test, the system successfully handles a nearly
tenfold increase in throughput, as measured in requests per second. Moreover, the average response time is
relatively constant and remains approximately 25 times smaller than the previous test.
Responsible Innovation: A Best Practices Toolkit
12/18/2020 • 2 minutes to read • Edit Online
This toolkit provides developers with a set of practices in development, for anticipating and addressing the
potential negative impacts of technology on people. We are sharing this as an early stage practice for feedback and
learning.
Judgment Call
Judgment Call is an award-winning game and team-based activity that puts Microsoft’s AI principles of fairness,
privacy and security, reliability and safety, transparency, inclusion, and accountability into action. The game provides
an easy-to-use method for cultivating stakeholder empathy through scenario-imagining. Game participants write
product reviews from the perspective of a particular stakeholder, describing what kind of impact and harms the
technology could produce from their point of view.
Harms Modeling
Harms Modeling is a framework for product teams, grounded in four core pillars that examine how people's lives
can be negatively impacted by technology: injuries, denial of consequential services, infringement on human rights,
and erosion of democratic & societal structures. Similar to Security Threat Modeling, Harms Modeling enables
product teams to anticipate potential real-world impacts of technology, which is a cornerstone of responsible
development.
Community Jury
Community Jury is a technique that brings together diverse stakeholders impacted by a technology. It is an
adaptation of the citizen jury. The stakeholders are provided an opportunity to learn from experts about a project,
deliberate together, and give feedback on use cases and product design. This technique allows project teams to
collaborate with researchers to identify stakeholder values, and understand the perceptions and concerns of
impacted stakeholders.
Judgment Call
12/18/2020 • 5 minutes to read • Edit Online
Judgment Call is an award-winning game and team-based activity that puts Microsoft’s AI principles of fairness,
privacy and security, reliability and safety, transparency, inclusion, and accountability into action. The game
provides an easy-to-use method for cultivating stakeholder empathy by imagining their scenarios. Game
participants write product reviews from the perspective of a particular stakeholder, describing what kind of impact
and harms the technology could produce from their point of view.
Benefits
Technology builders need practical methods to incorporate ethics in product development, by considering the
values of diverse stakeholders and how technology may uphold or not uphold those values. The goal of the game
is to imagine potential outcomes of a product or platform by gaining a better understanding of stakeholders, and
what they need and expect.
The game helps people discuss challenging topics in a safe space within the context of gameplay, and gives
technology creator a vocabulary to facilitate ethics discussions during product development. It gives managers and
designers an interactive tool to lead ethical dialogue with their teams to incorporate ethical design in their
products.
The theoretical basis and initial outcomes of the Judgment Call game were presented at the 2019 ACM Design
Interactive Systems conference and the game was a finalist in the Fast Company 2019 Innovation by Design
Awards (Social Goods category). The game has been presented to thousands of people in the US and
internationally. While the largest group facilitated in game play has been over 100, each card deck can involve 1-10
players. This game is not intended to be a substitute for performing robust user research. Rather, it's an exercise to
help tech builders learn how to exercise their moral imagination.
Preparation
Judgment Call uses role play to surface ethical concerns in product development so players will anticipate societal
impact, write product reviews, and explore mitigations. Players think of what kind of harms and impacts the
technology might produce by writing product reviews from the point of view of a stakeholder.
To prepare for this game, download the printable Judgment Call game kit.
Discussion
The moderator has each player read their reviews. Everyone is invited to discuss the different perspectives and
other considerations that may not have come up.
Potential moderator questions include:
Who is the most impacted?
What features are problematic?
What are the potential harms?
Harms mitigation
Finally, the group picks a thread from the discussion to begin identifying design or process changes that can
mitigate a particular harm. At the end of each round, the decks are shuffled, and another round can begin with the
same or a different scenario.
Next steps
Once you have enough understanding of potential harms or negative impact your product or scenario may cause,
proceed to learn how to model these harms so you can devise effective mitigations.
Foundations of assessing harm
12/18/2020 • 3 minutes to read • Edit Online
Harms Modeling is a practice designed to help you anticipate the potential for harm, identify gaps in product that
could put people at risk, and ultimately create approaches that proactively address harm.
Human understanding
In addition to appreciating the importance of human rights, building trustworthy systems requires us to consider
many people's perspectives. Asking who the stakeholders are, what they value, how they could benefit, and how
they could be hurt by our technology, is a powerful step that allows us to design and build better products.
Who does the technology impact?
Who are the customers?
What do they value?
How should they benefit?
How could tech harm them?
Who are the non-customer stakeholders?
What do they value?
How should they benefit?
How could tech harm them?
Asking these questions is a practice in Value Sensitive Design and is the beginning to better understanding what is
important to stakeholders, and how it plays into their relationship with the product.
Types of stakeholders
Project sponsors
Backers, decision makers, and owners make up this category. Their values are articulated in the project strategy
and goals.
Tech builders
Designers, developers, project managers, and those working directly on designing systems make up this group.
They bring their own ethical standards and profession-specific values to the system.
Direct & indirect stakeholders
These stakeholders are significantly impacted by the system. This includes end users, software staff, clients,
bystanders, interfacing institutions, and even past or future generations. Non-human factors such as places, for
example, historic buildings or sacred spaces, may also be included.
Marginalized populations
This category is made up of the population frequently considered a minority, vulnerable, or stigmatized. This
includes children, the elderly, members of the LGBTQ+ community, ethnic minorities, and other populations who
often experience unique and disproportionate consequences.
Assessing Harm
Once you have defined the technology purpose, use cases, and stakeholders, conduct a Harms Modeling exercise
to evaluate potential ways the use of a technology you are building could result in negative outcomes for people
and society.
Next Steps
Read Types of harm for further harms analysis.
Types of harm
12/18/2020 • 13 minutes to read • Edit Online
This article creates awareness for the different types of harms, so that appropriate mitigation steps can be
implemented.
Risk of injury
Physical injury
Consider how technology could hurt people or create dangerous environments.
Overreliance on safety This points out the How might people rely on A healthcare agent could
features dependence on technology this technology to keep misdiagnose illness, leading
to make decisions without them safe? How could this to unnecessary treatment.
adequate human oversight. technology reduce
appropriate human
oversight?
Inadequate fail-safes Real-world testing may If this technology fails or is If an automatic door failed
insufficiently consider a misused, how would people to detect a wheelchair
diverse set of users and be impacted? At what point during an emergency
scenarios. could a human intervene? evacuation, a person could
Are there alternative uses be trapped if there isn't an
that have not been tested accessible override button.
for? How would users be
impacted by a system
failure?
Overreliance on Misguided beliefs can lead How could this technology A chat bot could be relied
automation users to trust the reliability reduce direct interpersonal upon for relationship advice
of a digital agent over that feedback? How might this or mental health counseling
of a human. technology interface with instead of a trained
trusted sources of professional.
information? How could
sole dependence on an
artificial agent impact a
person?
H A RM DESC RIP T IO N C O N SIDERAT IO N ( S) EXA M P L E
Distor tion of reality or When intentionally Could this be used to An IoT device could enable
gaslighting misused, technology can modify digital media or monitoring and controlling
undermine trust and physical environments? of an ex-intimate partner
distort someone's sense of from afar.
reality.
Reduced self- Some shared content can How could this technology Synthetic media "revenge
esteem/reputation be harmful, false, be used to inappropriately porn" can swap faces,
damage misleading, or denigrating. share personal information? creating the illusion of a
How could it be person participating in a
manipulated to misuse video, who did not.
information and
misrepresent people?
Addiction/attention Technology could be In what ways might this Variable drop rates in video
hijacking designed for prolonged technology reward or game loot boxes could
interaction, without regard encourage continued cause players to keep
for well-being. interaction beyond playing and neglect self-
delivering user value? care.
Identity theft Identity theft may lead to How might an individual be Synthetic voice font could
loss of control over impersonated with this mimic the sound of a
personal credentials, technology? How might person's voice and be used
reputation, and/or this technology mistakenly to access a bank account.
representation. recognize the wrong
individual as an authentic
user?
Misattribution This includes crediting a In what ways might this Facial recognition can
person with an action or technology attribute an misidentify an individual
content that they are not action to an individual or during a police
responsible for. group? How could investigation.
someone be affected if an
action was incorrectly
attributed to them?
Employment Some people may be Are there ways in which Hiring AI could recommend
discrimination denied access to apply for this technology could fewer candidates with
or secure a job based on impact recommendations female-sounding names for
characteristics unrelated to or decisions related to interviews.
merit. employment?
Housing discrimination This includes denying How could this technology Public housing queuing
people access to housing or impact recommendations algorithm could cause
the ability to apply for or decisions related to people with international-
housing. housing? sounding names to wait
longer for vouchers.
H A RM DESC RIP T IO N C O N SIDERAT IO N ( S) EXA M P L E
Insurance and benefit This includes denying Could this technology be Insurance company might
discrimination people insurance, social used to determine access, charge higher rates for
assistance, or access to a cost, allocation of insurance drivers working night shifts
medical trial due to biased or social benefits? due to algorithmic
standards. predictions suggesting
increased drunk driving
risk.
Educational Access to education may be How might this technology Emotion classifier could
discrimination denied because of an be used to determine incorrectly report that
unchangeable access, cost, students of color are less
characteristic. accommodations, or other engaged than their white
outcomes related to counterparts, leading to
education? lower grades.
Loss of choice/network Presenting people with only How might this technology News feed could only
and filter bubble information that conforms affect which choices and present information that
to and reinforces their own information are made confirms existing beliefs.
beliefs. available to people? What
past behaviors or
preferences might this
technology rely on to
predict future behaviors or
preferences?
Economic loss
Automating decisions related to financial instruments, economic opportunity, and resources can amplify existing
societal inequities and obstruct well-being.
Credit discrimination People may be denied How might this technology Higher introductory rate
access to financial rely on existing credit offers could be sent only to
instruments based on structures to make homes in lower
characteristics unrelated to decisions? How might this socioeconomic postal
economic merit. technology affect the ability codes.
of an individual or group to
obtain or maintain a credit
score?
H A RM DESC RIP T IO N C O N SIDERAT IO N ( S) EXA M P L E
Differential pricing of Goods or services may be How could this technology More could be charged for
goods and ser vices offered at different prices be used to determine products based on
for reasons unrelated to pricing of goods or designation for men or
the cost of production or services? What are the women.
delivery. criteria for determining the
cost to people for use of
this tech?
Economic exploitation People might be compelled What role did human labor Paying financially destitute
or misled to work on play in producing training people for their biometric
something that impacts data for this technology? data to train AI systems.
their dignity or wellbeing. How was this workforce
acquired? What role does
human labor play in
supporting this
technology? Where is this
workforce expected to
come from?
Devaluation of Technology may supplant How might this technology AI agents replace
individual exper tise the use of paid human impact the need to employ doctors/radiographers for
expertise or labor. an existing workforce? evaluation of medical
imaging.
Dehumanization Removing, reducing, or How might this technology Entity recognition and
obscuring visibility of a be used to simplify or virtual overlays in drone
person's humanity. abstract the way a person surveillance could reduce
is represented? How might the perceived accountability
this technology reduce the of human actions.
distinction between
humans and the digital
world?
Public shaming This may mean exposing How might movements or A fitness app could reveal a
people's private, sensitive, actions be revealed user's GPS location on
or socially inappropriate through data aggregation? social media, indicating
material. attendance at an Alcoholics
Anonymous meeting.
Liberty loss
Automation of legal, judicial, and social systems can reinforce biases and lead to detrimental consequences.
Predictive policing Inferring suspicious How could this support or An algorithm can predict a
behavior and/or criminal replace human policing or number of area arrests, so
intent based on historical criminal justice decision- police make sure they
records. making? match, or exceed, that
number.
Loss of effective This means an inability to How might people Automated prison sentence
remedy explain the rationale or lack understand the reasoning or pre-trial release decision
of opportunity to contest a for decisions made by this is not explained to the
decision. technology? How might an accused.
individual that relies on this
technology explain the
decisions it makes? How
could people contest or
question a decision this
technology makes?
Privacy loss
Information generated by our use of technology can be used to determine facts or make assumptions about
someone without their knowledge.
Interference with Revealing information a How could this technology Task-tracking AI could
private life person has not chosen to use information to infer monitor personal patterns
share. portions of a private life? from which it infers an
How could decisions based extramarital affair.
upon these inferences
expose things that a person
does not want made
public?
Forced association Requiring participation in How might use of this Biometric enrollment in a
the use of technology or technology be required for company's meeting room
surveillance to take part in participation in society or transcription AI is a
society. organization membership? stipulated requirement in
job offer letter.
H A RM DESC RIP T IO N C O N SIDERAT IO N ( S) EXA M P L E
Inability to freely and This may mean restriction In what way does the Intelligent meeting system
fully develop of one's ability to truthfully system or product ascribe could record all discussions
personality express themselves or positive vs negative between colleagues
explore external avenues connotations toward including personal coaching
for self-development. particular personality traits? and mentorship sessions.
In what way can using the
product or system reveal
information to entities such
as the government or
employer that inhibits free
expression?
Never forgiven Digital files or records may What and where is data A teenager's social media
never be deleted. being stored from this history could remain
product, and who can searchable long after they
access it? How long is user have outgrown the
data stored after platform.
technology interaction?
How is user data updated
or deleted?
Loss of freedom of This means an inability to In what ways might this A real name could be
movement or assembly navigate the physical or technology monitor people required in order to sign up
virtual world with desired across physical and virtual for a video game enabling
anonymity. space? real-world stalking.
Environmental impact
The environment can be impacted by every decision in a system or product life cycle, from the amount of cloud
computing needed to retail packaging. Environmental changes can impact entire communities.
Exploitation or Obtaining the raw materials What materials are needed A local community could be
depletion of resources for a technology, including to build or run this displaced due to the
how it's powered, leads to technology? What energy harvesting of rare earth
negative consequences to requirements are needed to minerals and metals
the environment and its build or run this required for some
inhabitants. technology? electronic manufacturing.
Electronic waste Reduced quality of How might this technology Toxic materials inside
collective well-being reduce electronic waste by discarded electronic devices
because of the inability to recycling materials or could leach into the water
repair, recycle, or otherwise allowing users to self- supply, making local
responsibly dispose of repair? How might this populations ill.
electronics. technology contribute to
electronic waste when new
versions are released or
when current/past versions
stop working?
Misinformation Disguising fake information How might this technology Generation of synthetic
as legitimate or credible be used to generate speech of a political leader
information. misinformation? How could sways an election.
it be used to spread
misinformation that
appears credible?
Behavioral exploitation This means exploiting How might this technology Monitoring shopping
personal preferences or be used to observe habits in connected retail
patterns of behavior to patterns of behavior? How environment leads to
induce a desired reaction. could this technology be personalized incentives for
used to encourage impulse shoppers and
dysfunctional or hoarders.
maladaptive behaviors?
Social detriment
At scale, the way technology impacts people shapes social and economic structures within communities. It can
further ingrain elements that include or benefit some, at the exclusion of others.
Amplification of power This may perpetuate How might this technology Requiring a residential
inequality existing class or privilege be used in contexts where address & phone number
disparities. there are existing social, to register on a job website
economic, or class could prevent a homeless
disparities? How might person from applying for
people with more power or jobs.
privilege disproportionately
influence the technology?
Stereotype This may perpetuate How might this technology Results of an image search
reinforcement uninformed "conventional be used to reinforce or for "CEO" could primarily
wisdom" about historically amplify existing social show photos of Caucasian
or statistically norms or cultural men.
underrepresented people. stereotypes? How might
the data used by this
technology cause it to
reflect biases or
stereotypes?
Loss of individuality This may be an inability to How might this technology Limited customization
express a unique amplify majority opinions options in designing a
perspective. or "group-think"? video game avatar inhibits
Conversely, how might self-expression of a player's
unique forms of expression diversity.
be suppressed. In what
ways might the data
gathered by this
technology be used in
feedback to people?
Loss of representation Broad categories of How could this technology Automatic photo caption
generalization obscure, constrain identity options? assigns incorrect gender
diminish, or erase real Could it be used to identity and age to the
identities. automatically label or subject.
categorize people?
H A RM DESC RIP T IO N C O N SIDERAT IO N ( S) EXA M P L E
Evaluate harms
Once you have generated a broad list of potential harms, you should complete your Harms Model by evaluating
the potential magnitude for each category of harm. This will allow you to prioritize your areas of focus. See the
following example harms model for reference:
C O N T RIB UT IN G FA C TO R DEF IN IT IO N
Next Steps
Use the Harms Model you developed to guide your product development work:
Seek more information from stakeholders that you identified as potentially experiencing harm.
Develop and validate hypothesis for addressing the areas you identified as having the highest potential for
harm.
Integrate the insights into your decisions throughout the technology development process: data collectionand
model training, system architecture, user experience design, product documentation, feedback loops, and
communication capabilities and limitationsof the technology.
Explore Community Jury.
Assess and mitigate unfairness using Azure Machine Learning and the open-source FairLearn package.
Other Responsible AI tools:
Responsible AI resource center
Guidelines for Human AI interaction
Conversational AI guidelines
Inclusive Design guidelines
AI Fairness checklist
Additional references:
Downloadable booklet for assessing harms
Value Sensitive Design
Community Jury
12/18/2020 • 7 minutes to read • Edit Online
Community jury, an adaptation of the citizen jury, is a technique where diverse stakeholders impacted by a
technology are provided an opportunity to learn about a project, deliberate together, and give feedback on use
cases and product design. This technique allows project teams to understand the perceptions and concerns of
impacted stakeholders.
A community jury is different from a focus group or market research; it allows the impacted stakeholders to hear
directly from the subject matter experts in the product team, and cocreate solutions to challenging problems with
them.
Benefits
This section discusses some important benefits of community jury.
Expert Testimony
Members of a community jury hear details about the technology under consideration, directly from experts on the
product team. These experts help highlight the capabilities in particular applications.
Proximity
A community jury allows decision-makers to hear directly from the community, and to learn about their values,
concerns, and ideas regarding a particular issue or problem. It also provides a valuable opportunity to better
understand the reasons for their conclusions.
Consensus
By bringing impacted individuals together and providing an opportunity for them to learn and discuss key aspects
of the technology, a community jury can identify areas of agreement and build common ground solutions to
challenging problems.
Session structure
Sessions typically last 2-3 hours. Add more or longer deep dive sessions, as needed, if aspects of the project
require more learning, deliberation, and cocreation.
1. Over view, introduction, and Q&A - The moderator provides a session overview, then introduces the
project team and explains the product's purpose, along with potential use cases, benefits, and harms.
Questions are then accepted from community members. This session should be between 30 to 60 minutes
long.
2. Discussion of key themes - Jury members ask in-depth questions about aspects of the project, fielded by
the moderator. This session should also be between 30 to 60 minutes in length.
3. This step can be any one of the following options:
Deliberation and cocreation - This option is preferable. Jury members deliberate and co-create
solutions with the project team. This is typically 30 to 60 minutes long.
Individual anonymous sur vey - Alternatively, conduct an anonymous survey of jury members.
Anonymity may allow issues to be raised that would not otherwise be expressed. Use this 30-minute
session if there are time constraints.
4. Following the session - The moderator produces a study report that describes key insights, concerns, and
potential solutions to the concerns.
If the values of different stakeholders were in conflict with each other during the session and the value tensions
were left unresolved, the product team would need to brainstorm solutions, and conduct a follow-up session with
the jury to determine if the solutions adequately resolve their concerns.
Additional information
Privacy Index
The Privacy Index is an approximate measure for an individual's concern about personal data privacy, and is
gauged using the following:
1. Consumers have lost all control over how personal information is collected and used by companies.
2. Most businesses handle the personal information they collect about consumers in a proper and confidential
way.
3. Existing laws and organizational practices provide a reasonable level of protection for consumer privacy today.
Participants are asked to provide responses to the above concerns using the scale of: 1 - Strongly Agree, 2 -
Somewhat Agree, 3 - Somewhat Disagree, 4- Strongly Disagree, and classified into the categories below.
High/Fundamentalist => IF A = 1 or 2 AND B & C = 3 or 4
Low/unconcerned => IF A = 3 or 4 AND B & C = 1 or 2
Medium/pragmatist => All other responses
Next steps
Explore the following references for detailed information on this method:
Jefferson center: creator of the Citizen's Jury method https://jefferson-center.org/about-us/how-we-work/
Citizen's jury handbook http://www.rachel.org/files/document/Citizens_Jury_Handbook.pdf
Case study: Connected Health Cities (UK)
Project page
Final report
Jury specification
Juror's report
Case study: Community Jury at Microsoft
Azure for AWS Professionals
11/2/2020 • 2 minutes to read • Edit Online
This article helps Amazon Web Services (AWS) experts understand the basics of Microsoft Azure accounts,
platform, and services. It also covers key similarities and differences between the AWS and Azure platforms.
You'll learn:
How accounts and resources are organized in Azure.
How available solutions are structured in Azure.
How the major Azure services differ from AWS services.
Azure and AWS built their capabilities independently over time so that each has important implementation and
design differences.
Overview
Like AWS, Microsoft Azure is built around a core set of compute, storage, database, and networking services. In
many cases, both platforms offer a basic equivalence between the products and services they offer. Both AWS and
Azure allow you to build highly available solutions based on Windows or Linux hosts. So, if you're used to
development using Linux and OSS technology, both platforms can do the job.
While the capabilities of both platforms are similar, the resources that provide those capabilities are often
organized differently. Exact one-to-one relationships between the services required to build a solution are not
always clear. In other cases, a particular service might be offered on one platform, but not the other. See charts of
comparable Azure and AWS services.
Services
For a listing of how services map between platforms, see AWS to Azure services comparison.
Not all Azure products and services are available in all regions. Consult the Products by Region page for details. You
can find the uptime guarantees and downtime credit policies for each Azure product or service on the Service Level
Agreements page.
Components
A number of core components on Azure and AWS have similar functionality. To review the differences, visit the
component page for the topic you're interested in:
Accounts
Compute
Databases
Messaging
Networking
Regions and Zones
Resources
Security & Identity
Storage
Azure and AWS accounts and subscriptions
11/2/2020 • 2 minutes to read • Edit Online
Azure services can be purchased using several pricing options, depending on your organization's size and needs.
See the pricing overview page for details.
Azure subscriptions are a grouping of resources with an assigned owner responsible for billing and permissions
management. Unlike AWS, where any resources created under the AWS account are tied to that account,
subscriptions exist independently of their owner accounts, and can be reassigned to new owners as needed.
Container Service
The Azure Kubernetes Service supports Docker containers managed through Kubernetes.
Batch Processing
Azure Batch allows you to manage compute-intensive work across a scalable collection of virtual machines.
Service Comparison
Virtual servers
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
Elastic Compute Cloud (EC2) Instances Virtual Machines Virtual servers allow users to deploy,
manage, and maintain OS and server
software. Instance types provide
combinations of CPU/RAM. Users pay
for what they use with the flexibility to
change sizes.
Auto Scaling Virtual Machine Scale Sets Allows you to automatically change the
number of VM instances. You set
defined metric and thresholds that
determine if the platform adds or
removes instances.
Elastic Container Service (ECS) Container Instances Azure Container Instances is the fastest
and simplest way to run a container in
Fargate Azure, without having to provision any
virtual machines or adopt a higher-level
orchestration service.
Elastic Kubernetes Service (EKS) Kubernetes Service (AKS) Deploy orchestrated containerized
applications with Kubernetes. Simplify
monitoring and cluster management
through auto upgrades and a built-in
operations console. See AKS solution
journey.
App Mesh Service Fabric Mesh Fully managed service that enables
developers to deploy microservices
applications without managing virtual
machines, storage, or networking.
Container architectures
Serverless architectures
Social App for Mobile and Web with Authentication
12/16/2019
3 min read
View a detailed, step-by-step diagram depicting the build process and implementation of the mobile client
app architecture that offers social image sharing with a companion web app and authentication abilities,
even while offline.
See also
Create a Linux VM on Azure using the portal
Azure Reference Architecture: Running a Linux VM on Azure
Get started with Node.js web apps in Azure App Service
Azure Reference Architecture: Basic web application
Create your first Azure Function
Database technologies on Azure and AWS
11/2/2020 • 2 minutes to read • Edit Online
Service comparison
TYPE AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
TYPE AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
Database migration Database Migration Service Database Migration Service Migration of database
schema and data from one
database format to a specific
database technology in the
cloud.
Database architectures
Messaging components
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
Simple Queue Service (SQS) Queue Storage Provides a managed message queueing
service for communicating between
decoupled application components.
Messaging architectures
Cloud virtual networking Virtual Private Cloud (VPC) Virtual Network Provides an isolated, private
environment in the cloud.
Users have control over their
virtual networking
environment, including
selection of their own IP
address range, creation of
subnets, and configuration
of route tables and network
gateways.
Load balancing Network Load Balancer Load Balancer Azure Load Balancer load
balances traffic at layer 4
(TCP or UDP). Standard Load
Balancer also supports
cross-region or global load
balancing.
Networking architectures
See also
Create a virtual network using the Azure portal
Plan and design Azure Virtual Networks
Azure Network Security Best Practices
Regions and zones on Azure and AWS
12/18/2020 • 3 minutes to read • Edit Online
Failures can vary in the scope of their impact. Some hardware failures, such as a failed disk, may affect a single host
machine. A failed network switch could affect a whole server rack. Less common are failures that disrupt a whole
datacenter, such as loss of power in a datacenter. Rarely, an entire region could become unavailable.
One of the main ways to make an application resilient is through redundancy. But you need to plan for this
redundancy when you design the application. Also, the level of redundancy that you need depends on your
business requirements—not every application needs redundancy across regions to guard against a regional
outage. In general, a tradeoff exists between greater redundancy and reliability versus higher cost and complexity.
In Azure, a region is divided into two or more Availability Zones. An Availability Zone corresponds with a physically
isolated datacenter in the geographic region. Azure has numerous features for providing application redundancy at
every level of potential failure, including availability sets , availability zones , and paired regions .
The diagram has three parts. The first part shows VMs in an availability set in a virtual network. The second part
shows an availability zone with two availability sets in a virtual network. The third part shows regional pairs with
resources in each region.
The following table summarizes each option.
Availability sets
To protect against localized hardware failures, such as a disk or network switch failing, deploy two or more VMs in
an availability set. An availability set consists of two or more fault domains that share a common power source and
network switch. VMs in an availability set are distributed across the fault domains, so if a hardware failure affects
one fault domain, network traffic can still be routed to the VMs in the other fault domains. For more information
about Availability Sets, see Manage the availability of Windows virtual machines in Azure.
When VM instances are added to availability sets, they are also assigned an update domain. An update domain is a
group of VMs that are set for planned maintenance events at the same time. Distributing VMs across multiple
update domains ensures that planned update and patching events affect only a subset of these VMs at any given
time.
Availability sets should be organized by the instance's role in your application to ensure one instance in each role is
operational. For example, in a three-tier web application, create separate availability sets for the front-end,
application, and data tiers.
Availability zones
An Availability Zone is a physically separate zone within an Azure region. Each Availability Zone has a distinct
power source, network, and cooling. Deploying VMs across availability zones helps to protect an application
against datacenter-wide failures.
Paired regions
To protect an application against a regional outage, you can deploy the application across multiple regions, using
Azure Traffic Manager to distribute internet traffic to the different regions. Each Azure region is paired with another
region. Together, these form a regional pair. With the exception of Brazil South, regional pairs are located within the
same geography in order to meet data residency requirements for tax and law enforcement jurisdiction purposes.
Unlike Availability Zones, which are physically separate datacenters but may be in relatively nearby geographic
areas, paired regions are typically separated by at least 300 miles. This design ensures that large-scale disasters
only affect one of the regions in the pair. Neighboring pairs can be set to sync database and storage service data,
and are configured so that platform updates are rolled out to only one region in the pair at a time.
Azure geo-redundant storage is automatically backed up to the appropriate paired region. For all other resources,
creating a fully redundant solution using paired regions means creating a full copy of your solution in both regions.
See also
Regions for virtual machines in Azure
Availability options for virtual machines in Azure
High availability for Azure applications
Failure and disaster recovery for Azure applications
Planned maintenance for Linux virtual machines in Azure
Resource management on Azure and AWS
11/2/2020 • 2 minutes to read • Edit Online
The term "resource" in Azure is used in the same way as in AWS, meaning any compute instance, storage object,
networking device, or other entity you can create or configure within the platform.
Azure resources are deployed and managed using one of two models: Azure Resource Manager, or the older Azure
classic deployment model. Any new resources are created using the Resource Manager model.
Resource groups
Both Azure and AWS have entities called "resource groups" that organize resources such as VMs, storage, and
virtual networking devices. However, Azure resource groups are not directly comparable to AWS resource groups.
While AWS allows a resource to be tagged into multiple resource groups, an Azure resource is always associated
with one resource group. A resource created in one resource group can be moved to another group, but can only
be in one resource group at a time. Resource groups are the fundamental grouping used by Azure Resource
Manager.
Resources can also be organized using tags. Tags are key-value pairs that allow you to group resources across your
subscription irrespective of resource group membership.
Management interfaces
Azure offers several ways to manage your resources:
Web interface. Like the AWS Dashboard, the Azure portal provides a full web-based management interface
for Azure resources.
REST API. The Azure Resource Manager REST API provides programmatic access to most of the features
available in the Azure portal.
Command Line. The Azure CLI provides a command-line interface capable of creating and managing Azure
resources. The Azure CLI is available for Windows, Linux, and Mac OS.
PowerShell. The Azure modules for PowerShell allow you to execute automated management tasks using a
script. PowerShell is available for Windows, Linux, and Mac OS.
Templates. Azure Resource Manager templates provide similar JSON template-based resource management
capabilities to the AWS CloudFormation service.
In each of these interfaces, the resource group is central to how Azure resources get created, deployed, or modified.
This is similar to the role a "stack" plays in grouping AWS resources during CloudFormation deployments.
The syntax and structure of these interfaces are different from their AWS equivalents, but they provide comparable
capabilities. In addition, many third-party management tools used on AWS, like Hashicorp's Terraform and Netflix
Spinnaker, are also available on Azure.
See also
Azure resource group guidelines
Security and identity on Azure and AWS
11/2/2020 • 2 minutes to read • Edit Online
See also
Getting started with Microsoft Azure security
Azure Identity Management and access control security best practices
Comparing storage on Azure and AWS
11/2/2020 • 3 minutes to read • Edit Online
Storage comparison
Object storage
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
Simple Storage Services (S3) Blob storage Object storage service, for use cases
including cloud applications, content
distribution, backup, archiving, disaster
recovery, and big data analytics.
Elastic Block Store (EBS) managed disks SSD storage optimized for I/O intensive
read/write operations. For use as high-
performance Azure virtual machine
storage.
Shared files
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
S3 Infrequent Access (IA) Storage cool tier Cool storage is a lower-cost tier for
storing data that is infrequently
accessed and long-lived.
S3 Glacier, Deep Archive Storage archive access tier Archive storage has the lowest storage
cost and higher data retrieval costs
compared to hot and cool storage.
Hybrid storage
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
Storage architectures
See also
Microsoft Azure Storage Performance and Scalability Checklist
Azure Storage security guide
Best practices for using content delivery networks (CDNs)
AWS to Azure services comparison
12/18/2020 • 22 minutes to read • Edit Online
This article helps you understand how Microsoft Azure services compare to Amazon Web Services (AWS). Whether
you are planning a multicloud solution with Azure and AWS, or migrating to Azure, you can compare the IT
capabilities of Azure and AWS services in all categories.
This article compares services that are roughly comparable. Not every AWS service or Azure service is listed, and
not every matched service has exact feature-for-feature parity.
Marketplace
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
Alexa Skills Kit Bot Framework Build and connect intelligent bots that
interact with your users using text/SMS,
Skype, Teams, Slack, Microsoft 365 mail,
Twitter, and other popular services.
Polly, Transcribe Speech Services Enables both Speech to Text, and Text
into Speech capabilities.
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
Lake Formation Data Share A simple and safe service for sharing
big data
Automated enterprise BI
6/03/2020
13 min read
Automate an extract, load, and transform (ELT) workflow in Azure using Azure Data Factory with Azure
Synapse Analytics.
view all
Big data processing
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
Data Pipeline, Glue Data Factory Processes and moves data between
different compute and storage services,
as well as on-premises data sources at
specified intervals. Create, schedule,
orchestrate, and manage data pipelines.
Analytics architectures
Automated enterprise BI
6/03/2020
13 min read
Automate an extract, load, and transform (ELT) workflow in Azure using Azure Data Factory with Azure
Synapse Analytics.
Compute
Virtual servers
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
Elastic Compute Cloud (EC2) Instances Virtual Machines Virtual servers allow users to deploy,
manage, and maintain OS and server
software. Instance types provide
combinations of CPU/RAM. Users pay
for what they use with the flexibility to
change sizes.
Auto Scaling Virtual Machine Scale Sets Allows you to automatically change the
number of VM instances. You set
defined metric and thresholds that
determine if the platform adds or
removes instances.
Elastic Container Service (ECS) Container Instances Azure Container Instances is the fastest
and simplest way to run a container in
Fargate Azure, without having to provision any
virtual machines or adopt a higher-level
orchestration service.
Elastic Kubernetes Service (EKS) Kubernetes Service (AKS) Deploy orchestrated containerized
applications with Kubernetes. Simplify
monitoring and cluster management
through auto upgrades and a built-in
operations console. See AKS solution
journey.
App Mesh Service Fabric Mesh Fully managed service that enables
developers to deploy microservices
applications without managing virtual
machines, storage, or networking.
Container architectures
Serverless architectures
Database
TYPE AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
Database migration Database Migration Service Database Migration Service Migration of database
schema and data from one
database format to a specific
database technology in the
cloud.
Database architectures
CodePipeline
Command Line Interface CLI Built on top of the native REST API
across all cloud services, various
PowerShell programming language-specific
wrappers provide easier ways to create
solutions.
Azure Automation
Devops architectures
Container CI/CD using Jenkins and Kubernetes on Azure Kubernetes Service (AKS)
12/16/2019
2 min read
Containers make it easy for you to continuously build and deploy applications. By orchestrating the
deployment of those containers using Azure Kubernetes Service (AKS), you can achieve replicable,
manageable clusters of containers.
Kinesis Firehose, Kinesis Streams Event Hubs Services that allow the mass ingestion
of small data inputs, typically from
devices and sensors, to process and
route the data.
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
IoT Things Graph Digital Twins Azure Digital Twins is an IoT service that
helps you create comprehensive models
of physical environments. Create spatial
intelligence graphs to model the
relationships and interactions between
people, places, and devices. Query data
from a physical space rather than
disparate sensors.
IOT architectures
Management
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
Usage and Billing Report Billing API Services to help generate, monitor,
forecast, and share billing data for
resource usage by time, organization, or
product resources.
Personal Health Dashboard Resource Health Provides detailed information about the
health of resources as well as
recommended actions for maintaining
resource health.
Simple Queue Service (SQS) Queue Storage Provides a managed message queueing
service for communicating between
decoupled application components.
Mobile services
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
Device Farm
The AWS Device Farm provides cross-device testing services. In Azure, Visual Studio App Center provides similar
cross-device front-end testing for mobile devices.
In addition to front-end testing, the Azure DevTest Labs provides back-end testing resources for Linux and
Windows environments.
Mobile architectures
Scalable web and mobile applications using Azure Database for PostgreSQL
12/16/2019
1 min read
Use Azure Database for PostgreSQL to rapidly build engaging, performant, and scalable cross-platform and
native apps for iOS, Android, Windows, or Mac.
Social App for Mobile and Web with Authentication
12/16/2019
3 min read
View a detailed, step-by-step diagram depicting the build process and implementation of the mobile client
app architecture that offers social image sharing with a companion web app and authentication abilities,
even while offline.
Networking
A REA AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
Cloud virtual networking Virtual Private Cloud (VPC) Virtual Network Provides an isolated, private
environment in the cloud.
Users have control over their
virtual networking
environment, including
selection of their own IP
address range, creation of
subnets, and configuration
of route tables and network
gateways.
Load balancing Network Load Balancer Load Balancer Azure Load Balancer load
balances traffic at layer 4
(TCP or UDP). Standard Load
Balancer also supports
cross-region or global load
balancing.
Networking architectures
Identity and Access Management (IAM) Azure Active Directory Allows users to securely control access
to services and resources while offering
data security and protection. Create
and manage users and groups, and use
permissions to allow and deny access to
resources.
Identity and Access Management (IAM) Role Based Access Control Role-based access control (RBAC) helps
you manage who has access to Azure
resources, what they can do with those
resources, and what areas they have
access to.
Directory Service Azure Active Directory Domain Services Provides managed domain services
such as domain join, group policy, LDAP,
and Kerberos/NTLM authentication that
are fully compatible with Windows
Server Active Directory.
Encryption
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
Server-side encryption with Amazon S3 Azure Storage Service Encryption Helps you protect and safeguard your
Key Management Service data and meet your organizational
security and compliance commitments.
Key Management Service (KMS), Key Vault Provides security solution and works
CloudHSM with other services by providing a way
to manage, create, and control
encryption keys stored in hardware
security modules (HSM).
Firewall
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
Web Application Firewall Web Application Firewall A firewall that protects web applications
from common web exploits.
Security
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
Certificate Manager App Service Certificates available on the Service that allows customers to create,
Portal manage, and consume certificates
seamlessly in the cloud.
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
Security architectures
Storage
Object storage
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
Simple Storage Services (S3) Blob storage Object storage service, for use cases
including cloud applications, content
distribution, backup, archiving, disaster
recovery, and big data analytics.
Virtual server disks
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
Elastic Block Store (EBS) managed disks SSD storage optimized for I/O intensive
read/write operations. For use as high-
performance Azure virtual machine
storage.
Shared files
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
S3 Infrequent Access (IA) Storage cool tier Cool storage is a lower-cost tier for
storing data that is infrequently
accessed and long-lived.
S3 Glacier, Deep Archive Storage archive access tier Archive storage has the lowest storage
cost and higher data retrieval costs
compared to hot and cool storage.
Hybrid storage
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
Storage architectures
Web applications
AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
Global Accelerator Cross-regional load balancer Distribute and load balance traffic
across multiple Azure regions via a
single, static, global anycast public IP
address.
Web architectures
Miscellaneous
A REA AW S SERVIC E A Z URE SERVIC E DESC RIP T IO N
Backend process logic Step Functions Logic Apps Cloud technology to build
distributed applications
using out-of-the-box
connectors to reduce
integration challenges.
Connect apps, data and
devices on-premises or in
the cloud.
More learning
If you are new to Azure, review the interactive Core Cloud Services - Introduction to Azure module on Microsoft
Learn.
Azure for GCP Professionals
12/18/2020 • 8 minutes to read • Edit Online
This article helps Google Cloud Platform (GCP) experts understand the basics of Microsoft Azure accounts,
platform, and services. It also covers key similarities and differences between the GCP and Azure platforms.
You'll learn:
How accounts and resources are organized in Azure.
How available solutions are structured in Azure.
How the major Azure services differ from GCP services.
Azure and GCP built their capabilities independently over time so that each has important implementation and
design differences.
Overview
Like GCP, Microsoft Azure is built around a core set of compute, storage, database, and networking services. In
many cases, both platforms offer a basic equivalence between the products and services they offer. Both GCP and
Azure allow you to build highly available solutions based on Linux or Windows hosts. So, if you're used to
development using Linux and OSS technology, both platforms can do the job.
While the capabilities of both platforms are similar, the resources that provide those capabilities are often
organized differently. Exact one-to-one relationships between the services required to build a solution are not
always clear. In other cases, a particular service might be offered on one platform, but not the other.
Resource management
The term "resource" in Azure means any compute instance, storage object, networking device, or other entity you
can create or configure within the platform.
Azure resources are deployed and managed using one of two models: Azure Resource Manager, or the older Azure
classic deployment model. Any new resources are created using the Resource Manager model.
Resource groups
Azure additionally has an entity called "resource groups" that organize resources such as VMs, storage, and virtual
networking devices. An Azure resource is always associated with one resource group. A resource created in one
resource group can be moved to another group but can only be in one resource group at a time. Resource groups
are the fundamental grouping used by Azure Resource Manager.
Resources can also be organized using tags. Tags are key-value pairs that allow you to group resources across your
subscription irrespective of resource group membership.
Management interfaces
Azure offers several ways to manage your resources:
Web interface. The Azure portal provides a full web-based management interface for Azure resources.
REST API. The Azure Resource Manager REST API provides programmatic access to most of the features
available in the Azure portal.
Command Line. The Azure CLI provides a command-line interface capable of creating and managing Azure
resources. The Azure CLI is available for Windows, Linux, and Mac OS.
PowerShell. The Azure modules for PowerShell allow you to execute automated management tasks using a
script. PowerShell is available for Windows, Linux, and Mac OS.
Templates. Azure Resource Manager templates provide JSON template-based resource management
capabilities.
In each of these interfaces, the resource group is central to how Azure resources get created, deployed, or modified.
In addition, many third-party management tools like Hashicorp's Terraform and Netflix Spinnaker, are also
available on Azure.
See also
Azure resource group guidelines
Availability sets
To protect against localized hardware failures, such as a disk or network switch failing, deploy two or more VMs in
an availability set. An availability set consists of two or more fault domains that share a common power source and
network switch. VMs in an availability set are distributed across the fault domains, so if a hardware failure affects
one fault domain, network traffic can still be routed the VMs in the other fault domains. For more information
about Availability Sets, see Manage the availability of Windows virtual machines in Azure.
When VM instances are added to availability sets, they are also assigned an update domain. An update domain is a
group of VMs that are set for planned maintenance events at the same time. Distributing VMs across multiple
update domains ensures that planned update and patching events affect only a subset of these VMs at any given
time.
Availability sets should be organized by the instance's role in your application to ensure one instance in each role is
operational. For example, in a three-tier web application, create separate availability sets for the front-end,
application, and data tiers.
Availability sets
Availability zones
Like GCP, Azure regions can have Availability zones. An Availability Zone is a physically separate zone within an
Azure region. Each Availability Zone has a distinct power source, network, and cooling. Deploying VMs across
availability zones helps to protect an application against datacenter-wide failures.
Services
For a listing of how services map between platforms, see GCP to Azure services comparison.
Not all Azure products and services are available in all regions. Consult the Products by Region page for details. You
can find the uptime guarantees and downtime credit policies for each Azure product or service on the Service Level
Agreements page.
Next steps
Get started with Azure
Azure solution architectures
Azure Reference Architectures
GCP to Azure services comparison
12/18/2020 • 20 minutes to read • Edit Online
This article helps you understand how Microsoft Azure services compare to Google Cloud Platform (GCP). Whether
you are planning a multi-cloud solution with Azure and GCP, or migrating to Azure, you can compare the IT
capabilities of Azure and GCP services in all categories.
This article compares services that are roughly comparable. Not every GCP service or Azure service is listed, and
not every matched service has exact feature-for-feature parity.
For an overview of Azure for GCP users, see Introduction to Azure for GCP Professionals.
Marketplace
GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N
AI Building blocks - Sight Azure Cognitive Services Computer Use visual data processing to label
Vision content, from objects to concepts,
extract printed and handwritten text,
recognize familiar subjects like brands
and landmarks, and moderate content.
No machine learning expertise is
required.
AI Building blocks - Language Azure Cognitive Services Text Analytics Cloud-based services that provides
advanced natural language processing
over raw text, and includes four main
functions: sentiment analysis, key
phrase extraction, language detection,
and named entity recognition.
GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N
AI Building blocks - Language Azure Cognitive Services Language A machine learning-based service to
Understanding (LUIS) build natural language understanding
into apps, bots, and IoT devices. Quickly
create enterprise-ready, custom models
that continuously improve.
AI Building blocks - Conversation Azure Cognitive Services Speech To Text Swiftly convert audio into text from a
variety of sources. Customize models to
overcome common speech recognition
barriers, such as unique vocabularies,
speaking styles, or background noise.
AI Building blocks – Structured Data Azure ML - Automated Machine Empower professional and non-
Learning professional data scientists to build
machine learning models rapidly.
Automate time-consuming and iterative
tasks of model development using
breakthrough research-and accelerate
time to market. Available in Azure
Machine learning, Power BI, ML.NET &
Visual Studio.
AI Building blocks – Structured Data ML.NET Model Builder ML.NET Model Builder provides an easy
to understand visual interface to build,
train, and deploy custom machine
learning models. Prior machine learning
expertise is not required. Model Builder
supports AutoML, which automatically
explores different machine learning
algorithms and settings to help you find
the one that best suits your scenario.
AI Building blocks – Cloud AutoML Azure Cognitive Services Custom Vision Customize and embed state-of-the-art
computer vision for specific domains.
Build frictionless customer experiences,
optimize manufacturing processes,
accelerate digital marketing campaigns-
and more. No machine learning
expertise is required.
AI Building blocks – Cloud AutoML Video Indexer Easily extract insights from your videos
and quickly enrich your applications to
enhance discovery and engagement.
AI Building blocks – Cloud AutoML Azure Cognitive Services QnA Maker Build, train and publish a sophisticated
bot using FAQ pages, support websites,
product manuals, SharePoint
documents or editorial content through
an easy-to-use UI or via REST APIs.
AI Platform Notebooks Azure Notebooks Develop and run code from anywhere
with Jupyter notebooks on Azure.
Deep Learning VM Image Data Science Virtual Machines Pre-Configured environments in the
cloud for Data Science and AI
Development.
GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N
Deep Learning Containers GPU support on Azure Kubernetes Graphical processing units (GPUs) are
Service (AKS) often used for compute-intensive
workloads such as graphics and
visualization workloads. AKS supports
the creation of GPU-enabled node
pools to run these compute-intensive
workloads in Kubernetes.
Data Labeling Service Azure ML - Data Labeling A central place to create, manage, and
monitor labeling projects (public
preview). Use it to coordinate data,
labels, and team members to efficiently
manage labeling tasks. Machine
Learning supports image classification,
either multi-label or multi-class, and
object identification with bounded
boxes.
Continuous Evaluation Azure ML – Data Drift Monitor for data drift between the
training dataset and inference data of a
deployed model. In the context of
machine learning, trained machine
learning models may experience
degraded prediction performance
because of drift. With Azure Machine
Learning, you can monitor data drift
and the service can send an email alert
to you when drift is detected.
Dialogflow Microsoft Bot Framework Build and connect intelligent bots that
interact with your users using text/SMS,
Skype, Teams, Slack, Microsoft 365 mail,
Twitter, and other popular services.
BigQuery SQL Server Big Data Clusters Allow you to deploy scalable clusters of
SQL Server, Spark, and HDFS containers
running on Kubernetes. These
components are running side by side to
enable you to read, write, and process
big data from Transact-SQL or Spark,
allowing you to easily combine and
analyze your high-value relational data
with high-volume big data.
Cloud Data Fusion Azure Data Factory Processes and moves data between
different compute and storage services,
as well as on-premises data sources at
specified intervals. Create, schedule,
orchestrate, and manage data pipelines.
Cloud Data Catalog Azure Data Catalog A fully managed service that serves as a
system of registration and system of
discovery for enterprise data sources
Analytics and visualization
GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N
Compute
Virtual servers
GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N
Compute Engine Azure Virtual Machines Virtual servers allow users to deploy,
manage, and maintain OS and server
software. Instance types provide
combinations of CPU/RAM. Users pay
for what they use with the flexibility to
change sizes.
Compute Engine Managed Instance Azure Virtual Machine Scale Sets Allows you to automatically change the
Groups number of VM instances. You set
defined metric and thresholds that
determine if the platform adds or
removes instances.
VMware as a service Azure VMware by CloudSimple Redeploy and extend your VMware-
based enterprise workloads to Azure
with Azure VMware Solution by
CloudSimple. Keep using the VMware
tools you already know to manage
workloads on Azure without disrupting
network, security, or data protection
policies.
Cloud Run Azure Container Instances Azure Container Instances is the fastest
and simplest way to run a container in
Azure, without having to provision any
virtual machines or adopt a higher-level
orchestration service.
Artifact Registry (beta) Azure Container Registry Allows customers to store Docker
formatted images. Used to create all
Container Registry types of container deployments on
Azure.
Kubernetes Engine (GKE) Azure Kubernetes Service (AKS) Deploy orchestrated containerized
applications with Kubernetes. Simplify
cluster management and monitoring
through automatic upgrades and a
built-in operations console. See AKS
solution journey.
Kubernetes Engine Monitoring Azure Monitor for containers Azure Monitor for containers is a
feature designed to monitor the
performance of container workloads
deployed to: Managed Kubernetes
clusters hosted on Azure Kubernetes
Service (AKS); Self-managed Kubernetes
clusters hosted on Azure using AKS
Engine; Azure Container Instances, Self-
managed Kubernetes clusters hosted on
Azure Stack or on-premises; or Azure
Red Hat OpenShift.
Anthos Service Mesh Service Fabric Mesh Fully managed service that enables
developers to deploy microservices
applications without managing virtual
machines, storage, or networking.
Container architectures
Here are some architectures that use AKS as the orchestrator.
Database
TYPE GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N
Relational database (PaaS) Cloud SQL ( SQL Server, SQL Server Options Managed relational database
MySQL, PostgreSQL ) service where resiliency,
Azure Database for MySQL security, scale, and
maintenance are primarily
Azure Database for handled by the platform.
PostgreSQL
Cloud Source Repositories Azure DevOps Repos, GitHub Repos A cloud service for collaborating on
code development.
Cloud Build Azure DevOps Pipelines, GitHub Actions Fully managed build service that
supports continuous integration and
deployment.
Artifact Registry Azure DevOps Artifacts, GitHub Add fully integrated package
Packages management to your continuous
integration/continuous delivery (CI/CD)
pipelines with a single click. Create and
share Maven, npm, NuGet, and Python
package feeds from public and private
sources with teams of any size.
GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N
Cloud Developer Tools (including Cloud Azure Developer Tools Collection of tools for building,
Code) debugging, deploying, diagnosing, and
managing multiplatform scalable apps
and services.
Cloud Deployment Manager Azure Resource Manager Provides a way for users to automate
the manual, long-running, error-prone,
and frequently repeated IT tasks.
Cloud IoT Core Azure IoT Hub,Azure Event Hubs A cloud gateway for managing
bidirectional communication with
billions of IoT devices, securely and at
scale.
GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N
Cloud Pub/Sub Azure Stream Analytics,HDInsight Kafka Process and route streaming data to
subsequent processing engine or
storage or database platform.
Edge Tpu Azure IoT Edge Deploy cloud intelligence directly on IoT
devices to run in on-premises scenarios.
Management
GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N
Cloud Pub/Sub Azure Event Grid A fully managed event routing service
that allows for uniform event
consumption using a publish/subscribe
model.
Networking
A REA GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N
A REA GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N
Cloud virtual networking Virtual Private Network Azure Virtual Network (Vnet) Provides an isolated, private
(VPC) environment in the cloud.
Users have control over their
virtual networking
environment, including
selection of their own IP
address range,
adding/updating address
ranges, creation of subnets,
and configuration of route
tables and network
gateways.
DNS management Cloud DNS Azure DNS Manage your DNS records
using the same credentials
and billing and support
contract as your other Azure
services
Cloud VPN Gateway Azure Virtual WAN Azure virtual WAN simplifies
large scale branch
connectivity with VPN and
ExpressRoute.
Load balancing Network Load Balancing Azure Load Balancer Azure Load Balancer load-
balances traffic at layer 4 (all
TCP or UDP).
A REA GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N
Global load balancing Azure Front door Azure front door enables
global load balancing across
regions using a single
anycast IP.
Content delivery network Cloud CDN Azure CDN A content delivery network
(CDN) is a distributed
network of servers that can
efficiently deliver web
content to users.
Web Application Firewall Cloud Armor Application Gateway - Web Azure Web Application
Application Firewall Firewall (WAF) provides
centralized protection of
your web applications from
common exploits and
vulnerabilities.
A REA GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N
NAT Gateway Cloud NAT Azure Virtual Network NAT Virtual Network NAT
(network address
translation) provides
outbound NAT translations
for internet connectivity for
virtual networks.
Private Connectivity to PaaS VPC Service controls Azure Private Link Azure Private Link enables
you to access Azure PaaS
Services and Azure hosted
customer-owned/partner
services over a private
endpoint in your virtual
network.
Telemetry VPC Flow logs NSG Flow logs Network security group
(NSG) flow logs are a feature
of Network Watcher that
allows you to view
information about ingress
and egress IP traffic through
an NSG.
Other Connectivity Options S2S,P2S Direct Interconnect,Partner Point to Site lets you create
Interconnect,Carrier Peering a secure connection to your
virtual network from an
individual client computer.
Site to Site is a connection
between two or more
networks, such as a
corporate network and a
branch office network.
Security
A REA GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N
Authentication and Cloud IAM Azure Active Directory Allows users to securely
Authorization control access to services
and resources while offering
data security and protection.
Create and manage users
and groups and use
permissions to allow and
deny access to resources.
Encryption Cloud KMS,Secret Manager Azure Key Vault Provides security solution
and works with other
services by providing a way
to manage, create, and
control encryption keys
stored in hardware security
modules (HSM).
Data at rest encryption Encryption by default Azure Storage Service Azure Storage Service
Encryption - encryption by Encryption helps you protect
default and safeguard your data and
meet your organizational
security and compliance
commitments.
Storage
Object storage
GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N
Cloud Storage Azure Blob storage Object storage service, for use cases
including cloud applications, content
Cloud Storage for Firebase distribution, backup, archiving, disaster
recovery, and big data analytics.
Persistant Disk Azure managed disks SSD storage optimized for I/O intensive
read/write operations. For use as high-
Local SSD performance Azure virtual machine
storage.
File Storage
GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N
Filestore Azure Files,Azure NetApp Files File based storage and hosted NetApp
Appliance Storage.
Drive Enterprise OneDrive For business Shared files from from personal devices
to cloud
Application services
GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N
Miscellaneous
A REA GC P SERVIC E A Z URE SERVIC E DESC RIP T IO N
Generic Database migrationGuide (SQL Database migration guide Migration of database schema and data
or Open Source Database) from one database format to a specific
database technology in the cloud.
SQL Server Database Migration Data MigrationAssistant Assessment of SQL Server database
Assessment Tool before migration
SQL Server Database Migration Tool Data Migration Service Actual migration of database to Azure
Open source Database Migration Tool Database Native Tool Actual Migration of open source
likemysqldump,pg\dump database to Azure
Assessment and migration tool Azure Migrate Assesses on-premises workloads for
migration to Azure, performs
performance-based sizing, and provides
cost estimations.
More learning
If you are new to Azure, review the interactive Core Cloud Services - Introduction to Azure module on Microsoft
Learn.
Microsoft Azure Well-Architected Framework
12/18/2020 • 9 minutes to read • Edit Online
The Azure Well-Architected Framework is a set of guiding tenets that can be used to improve the quality of a
workload. The framework consists of five pillars of architecture excellence: Cost Optimization, Operational
Excellence, Performance Efficiency, Reliability, and Security.
To assess your workload using the tenets found in the Microsoft Azure Well-Architected Framework, see the
Microsoft Azure Well-Architected Review.
P IL L A R DESC RIP T IO N
Cost Optimization
When you are designing a cloud solution, focus on generating incremental value early. Apply the principles of
Build-Measure-Learn , to accelerate your time to market while avoiding capital-intensive solutions. Use the pay-
as-you-go strategy for your architecture, and invest in scaling out, rather than delivering a large investment first
version. Consider opportunity costs in your architecture, and the balance between first mover advantage versus
"fast follow". Use the cost calculators to estimate the initial cost and operational costs. Finally, establish policies,
budgets, and controls that set cost limits for your solution.
Cost guidance
Review cost principles
Develop a cost model
Create budgets and alerts
Review the cost optimization checklist
Operational Excellence
This pillar covers the operations processes that keep an application running in production. Deployments must be
reliable and predictable. They should be automated to reduce the chance of human error. They should be a fast and
routine process, so they don't slow down the release of new features or bug fixes. Equally important, you must be
able to quickly roll back or roll forward if an update has problems.
Monitoring and diagnostics are crucial. Cloud applications run in a remote data-center where you do not have full
control of the infrastructure or, in some cases, the operating system. In a large application, it's not practical to log
into VMs to troubleshoot an issue or sift through log files. With PaaS services, there may not even be a dedicated
VM to log into. Monitoring and diagnostics give insight into the system, so that you know when and where failures
occur. All systems must be observable. Use a common and consistent logging schema that lets you correlate
events across systems.
The monitoring and diagnostics process has several distinct phases:
Instrumentation. Generating the raw data, from application logs, web server logs, diagnostics built into the
Azure platform, and other sources.
Collection and storage. Consolidating the data into one place.
Analysis and diagnosis. To troubleshoot issues and see the overall health.
Visualization and alerts. Using telemetry data to spot trends or alert the operations team.
Use the DevOps checklist to review your design from a management and DevOps standpoint.
Operational excellence guidance
Design patterns for management and monitoring
Best practices: Monitoring and diagnostics
Performance efficiency
Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an
efficient manner. The main ways to achieve this are by using scaling appropriately and implementing PaaS
offerings that have scaling built in.
There are two main ways that an application can scale. Vertical scaling (scaling up) means increasing the capacity
of a resource, for example by using a larger VM size. Horizontal scaling (scaling out) is adding new instances of a
resource, such as VMs or database replicas.
Horizontal scaling has significant advantages over vertical scaling:
True cloud scale. Applications can be designed to run on hundreds or even thousands of nodes, reaching scales
that are not possible on a single node.
Horizontal scale is elastic. You can add more instances if load increases, or remove them during quieter periods.
Scaling out can be triggered automatically, either on a schedule or in response to changes in load.
Scaling out may be cheaper than scaling up. Running several small VMs can cost less than a single large VM.
Horizontal scaling can also improve resiliency, by adding redundancy. If an instance goes down, the application
keeps running.
An advantage of vertical scaling is that you can do it without making any changes to the application. But at some
point you'll hit a limit, where you can't scale any up any more. At that point, any further scaling must be horizontal.
Horizontal scale must be designed into the system. For example, you can scale out VMs by placing them behind a
load balancer. But each VM in the pool must be able to handle any client request, so the application must be
stateless or store state externally (say, in a distributed cache). Managed PaaS services often have horizontal scaling
and autoscaling built in. The ease of scaling these services is a major advantage of using PaaS services.
Just adding more instances doesn't mean an application will scale, however. It might simply push the bottleneck
somewhere else. For example, if you scale a web front end to handle more client requests, that might trigger lock
contentions in the database. You would then need to consider additional measures, such as optimistic concurrency
or data partitioning, to enable more throughput to the database.
Always conduct performance and load testing to find these potential bottlenecks. The stateful parts of a system,
such as databases, are the most common cause of bottlenecks, and require careful design to scale horizontally.
Resolving one bottleneck may reveal other bottlenecks elsewhere.
Use the Performance efficiency checklist to review your design from a scalability standpoint.
Performance efficiency guidance
Design patterns for scalability and performance
Best practices: Autoscaling, Background jobs, Caching, CDN, Data partitioning
Reliability
A reliable workload is one that is both resilient and available. Resiliency is the ability of the system to recover from
failures and continue to function. The goal of resiliency is to return the application to a fully functioning state after
a failure occurs. Availability is whether your users can access your workload when they need to.
In traditional application development, there has been a focus on increasing the mean time between failures
(MTBF). Effort was spent trying to prevent the system from failing. In cloud computing, a different mindset is
required, due to several factors:
Distributed systems are complex, and a failure at one point can potentially cascade throughout the system.
Costs for cloud environments are kept low through the use of commodity hardware, so occasional hardware
failures must be expected.
Applications often depend on external services, which may become temporarily unavailable or throttle high-
volume users.
Today's users expect an application to be available 24/7 without ever going offline.
All of these factors mean that cloud applications must be designed to expect occasional failures and recover from
them. Azure has many resiliency features already built into the platform. For example:
Azure Storage, SQL Database, and Cosmos DB all provide built-in data replication, both within a region and
across regions.
Azure managed disks are automatically placed in different storage scale units to limit the effects of hardware
failures.
VMs in an availability set are spread across several fault domains. A fault domain is a group of VMs that share a
common power source and network switch. Spreading VMs across fault domains limits the impact of physical
hardware failures, network outages, or power interruptions.
That said, you still need to build resiliency into your application. Resiliency strategies can be applied at all levels of
the architecture. Some mitigations are more tactical in nature — for example, retrying a remote call after a
transient network failure. Other mitigations are more strategic, such as failing over the entire application to a
secondary region. Tactical mitigations can make a big difference. While it's rare for an entire region to experience a
disruption, transient problems such as network congestion are more common — so target these first. Having the
right monitoring and diagnostics is also important, both to detect failures when they happen, and to find the root
causes.
When designing an application to be resilient, you must understand your availability requirements. How much
downtime is acceptable? This is partly a function of cost. How much will potential downtime cost your business?
How much should you invest in making the application highly available?
Reliability guidance
Designing reliable Azure applications
Design patterns for resiliency
Best practices: Transient fault handling, Retry guidance for specific services
Security
Think about security throughout the entire lifecycle of an application, from design and implementation to
deployment and operations. The Azure platform provides protections against a variety of threats, such as network
intrusion and DDoS attacks. But you still need to build security into your application and into your DevOps
processes.
Here are some broad security areas to consider.
Identity management
Consider using Azure Active Directory (Azure AD) to authenticate and authorize users. Azure AD is a fully managed
identity and access management service. You can use it to create domains that exist purely on Azure, or integrate
with your on-premises Active Directory identities. Azure AD also integrates with Office365, Dynamics CRM Online,
and many third-party SaaS applications. For consumer-facing applications, Azure Active Directory B2C lets users
authenticate with their existing social accounts (such as Facebook, Google, or LinkedIn), or create a new user
account that is managed by Azure AD.
If you want to integrate an on-premises Active Directory environment with an Azure network, several approaches
are possible, depending on your requirements. For more information, see our Identity Management reference
architectures.
Protecting your infrastructure
Control access to the Azure resources that you deploy. Every Azure subscription has a trust relationship with an
Azure AD tenant. Use role-based access control (RBAC) to grant users within your organization the correct
permissions to Azure resources. Grant access by assigning RBAC role to users or groups at a certain scope. The
scope can be a subscription, a resource group, or a single resource. Audit all changes to infrastructure.
Application security
In general, the security best practices for application development still apply in the cloud. These include things like
using SSL everywhere, protecting against CSRF and XSS attacks, preventing SQL injection attacks, and so on.
Cloud applications often use managed services that have access keys. Never check these into source control.
Consider storing application secrets in Azure Key Vault.
Data sovereignty and encryption
Make sure that your data remains in the correct geopolitical zone when using Azure data services. Azure's geo-
replicated storage uses the concept of a paired region in the same geopolitical region.
Use Key Vault to safeguard cryptographic keys and secrets. By using Key Vault, you can encrypt keys and secrets
by using keys that are protected by hardware security modules (HSMs). Many Azure storage and DB services
support data encryption at rest, including Azure Storage, Azure SQL Database, Azure Synapse Analytics, and
Cosmos DB.
Security resources
Azure Security Center provides integrated security monitoring and policy management across your Azure
subscriptions.
Azure Security Documentation
Microsoft Trust Center
Principles of cost optimization
11/2/2020 • 2 minutes to read • Edit Online
A cost-effective workload is driven by business goals and the return on investment (ROI) while staying within a
given budget. The principles of cost optimization are a series of important considerations that can help achieve
both business objectives and cost justification.
To assess your workload using the tenets found in the Azure Well-Architected Framework, see the Microsoft Azure
Well-Architected Review.
Keep within the cost constraints
Every design choice has cost implications. Before choosing an architectural pattern, Azure service, or a price model
for the service, consider the budget constraints set by the company. As part of design, identify acceptable
boundaries on scale, redundancy, and performance against cost. After estimating the initial cost, set budgets and
alerts at different scopes to measure the cost. One of cost drivers can be unrestricted resources. These resources
typically need to scale and consume more cost to meet demand.
Aim for scalable costs
A key benefit of the cloud is the ability to scale dynamically. The workload cost should scale linearly with demand.
You can save cost through automatic scaling. Consider the usage metrics and performance to determine the
number of instances. Choose smaller instances for a highly variable workload and scale out to get the required
level of performance, rather than up. This choice will enable you to make your cost calculations and estimates
granular.
Pay for consumption
Adopt a leasing model instead of owning infrastructure. Azure offers many SaaS and PaaS resources that simplify
overall architecture. The cost of hardware, software, development, operations, security, and data center space
included in the pricing model.
Also, choose pay-as-you-go over fixed pricing. That way, as a consumer, you're charged for only what you use.
Right resources, right size
Choose the right resources that are aligned with business goals and can handle the performance of the workload.
An inappropriate or misconfigured service can impact cost. For example, building a multi-region service when the
service levels don't require high-availability or geo-redundancy will increase cost without any reasonable business
justification.
Certain infrastructure resources are delivered as fix-sized building blocks. Ensure that these blocks are adequately
sized to meet capacity demand, deliver expected performance without wasting resources.
Monitor and optimize
Treat cost monitoring and optimization as a process, rather than a point-in-time activity. Conduct regular cost
reviews and measure and forecast the capacity needs so that you can provision resources dynamically and scale
with demand. Review the cost management recommendations and take action.
If you're just starting in this process review enable success during a cloud adoption journey .
Checklist - Design for cost
11/2/2020 • 2 minutes to read • Edit Online
Cost model
Capture clear requirements . Gather detailed information about the business workflow, regulatory,
security, and availability.
Capture requirements
Estimate the initial cost . Use tools such as Azure pricing calculator to assess cost of the services you plan
to use in the workload. Use Azure Migrate and Microsoft Azure Total Cost of Ownership (TCO) Calculator for
migration projects. Accurately reflect the cost associated with right storage type. Add hidden costs, such as
networking cost for large data download.
Estimate the initial cost
Define policies for the cost constraints defined by the organization . Understand the constraints and
define acceptable boundaries for quality pillars of scale, availability, security.
Consider the cost constraints
Identify shared assets . Evaluate the business areas where you can use shared resources. Review the
billing meters build chargeback reports per consumer to identify metered costs for shared cloud services.
Create a structured view of the organization in the cloud
Plan a governance strategy . Plan for cost controls through Azure Policy. Use resource tags so that custom
cost report can be created. Define budgets and alerts to send notifications when certain thresholds are
reached.
Governance
Architecture
Check the cost of resources in various Azure geographic regions . Check your egress and ingress
cost, within regions and across regions. Only deploy to multiple regions if your service levels require it for
either availability or geo-distribution.
Azure regions
Choose a subscription that is appropriate for the workload . Azure Dev/Test subscription types are
suitable for experimental or non-production workloads and have lower prices on some Azure services such
as specific VM sizes. If you can commit to one or three years, consider subscriptions and offer types that
support Azure Reservations.
Subscription and offer type
Choose the right resources to handle the performance . Understand the usage meters and the
number of meters for each resource in the workload. Consider tradeoffs over time. For example, cheaper
virtual machines may initially indicate a lower cost but can be more expensive over time to maintain a
certain performance level. Be clear about the billing model of third-party services.
Azure resources
Use cost alerts to monitor usage and spending
Compare consumption-based pricing with pre-provisioned cost . Establish baseline cost by
considering the peaks and the frequency of peaks when analyzing performance.
Consumption and fixed cost models
Use proof-of-concept deployments . The Azure Architecture Center has many reference architectures and
implementations that can serve as a starting point. The Azure Tech Community has architecture and services
forums.
Choose managed ser vices when possible . With PaaS and SaaS options, the cost of running and
maintaining the infrastructure is included in the service price.
Managed services
Develop a cost model
12/18/2020 • 6 minutes to read • Edit Online
Cost modeling is an exercise where you create logical groups of cloud resources that are mapped to the
organization's hierarchy and then estimate costs for those groups. The goal of cost modeling is to estimate the
overall cost of the organization in the cloud.
1. Understand how your responsibilities align with your organization
Map the organization's needs to logical groupings offered by cloud services. This way the business leaders
of the company get a clear view of the cloud services and how they're controlled.
2. Capture clear requirements
Start your planning with a careful enumeration of requirements. From the high-level requirements, narrow
down each requirement before starting on the design of the solution.
3. Consider the cost constraints
Evaluate the budget constraints on each business unit and determine the governance policies in Azure to
lower cost by reducing wastage, overprovisioning, or expensive provisioning of resources.
4. Consider tradeoffs
Optimal design doesn't equate to a lowest-cost design.
As requirements are prioritized, cost can be adjusted. Expect a series of tradeoffs in the areas that you want
to optimize, such as security, scalability, resilience, and operability. If the cost to address the challenges in
those areas is high, stakeholders will look for alternate options to reduce cost. There might be risky choices
made in favor or a cheaper solution.
5. Derive functional requirements from high-level goals
Break down the high-level goals into functional requirements for the components of the solution. Each
requirement must be based on realistic metrics to estimate the actual cost of the workload.
6. Consider the billing model for Azure resources
Azure services are offered with consumption-based price where you're charged for only what you use.
There's also options for fixed price where you're charged for provisioned resources
Most services are priced based on units of size, amount of data, or operations. Understand the meters that
are used to track usage. For more information, see Azure resources.
At the end of this exercise, you should have identified the lower and upper limits on cost and set budgets for the
workload. Azure allows you to create and manage budgets in Azure Cost Management. For information, see
Quickstart: Create a budget with an Azure Resource Manager template.
H AVE F REQ UEN T A N D C L EA R C O M M UN IC AT IO N W IT H T H E STA K EH O L DERS
In the initial stages, communication between stakeholders is vital. The overall team must align on the requirements so that
overall business goals are met. If not, the entire solution might be at risk.
For instance, the development team indicates that the resilience of a monthly batch-processing job is low. They might request
the job to work as a single node without scaling capabilities. This request opposes the architect’s recommendation to
automatically scale out, and route requests to worker nodes.
This type of disagreement can introduce a point of failure into the system, risking the Service Level Agreement, and cause an
increase in operational cost.
Organization structure
Map the organization's needs to logical groupings offered by cloud services. This way the business leaders of the
company get a clear view of the cloud services and how they're controlled.
1. Understand how your workload fits into cost optimization across the portfolio of cloud workloads.
If you are working on a workload that fits into a broader portfolio of workloads, see CAF to get started guide to
document foundational decisions. That guide will help your team capture the broader portfolio view of business
units, resources organizations, responsibilities, and a view of the long term portfolio.
If cost optimization is being executed by a central team across the portfolio, see CAF to get started managing
enterprise costs.
2. Encourage a culture of democratized cost optimization decisions
As a workload owner, you can have a measurable impact on cost optimization. There are other roles in the
organization which can help improve cost management activites. To help embed the pillar of cost optimization into
your organization beyond your workload team, see the CAF article: Build a cost-conscious organization.
3. Reduce costs through shared cloud services and landing zones
If your workload has depedencies on shared assets like Active Directory, Network connectivity, security devices, or
other services that are also used by other workloads, encourage your central IT organization to provide those
services through a centrally managed landing zone to reduce duplicate costs. See the CAF article: get started with
centralized design & configuration to get started with the development of landing zones.
4. Calculate the ROI by understanding what is included in each grouping and what isn't.
Which aspects of the hierarchy are covered by cloud ser vices?
Azure pricing model is based on expenses incurred for the service. Expenses include hardware, software,
development, operations, security, and data center space to name a few. Evaluate the cost benefit of shifting
away from owned technology infrastructure to leased technology solutions.
5. Identify scenarios where you can use shared cloud services to lower cost.
Can some ser vices be shared by other consumers?
Identify areas where a service or an application environment can be shared with other business units.
Identify resources that can be used as shared services and review their billing meters. Examples include a
virtual network and its hybrid connectivity or a shared app service environment (ASE). If the meter data isn't
able to be split across consumers, decide on custom solutions to allocate proportional costs. Move shared
services to dedicated resources for consumers for cost reporting.
Build chargeback reports per consumer to identify metered costs for shared cloud services. Aim for
granular reports to understand which workload is consuming what amount of the shared cloud service.
Next step
Capture cost requirements
Cost constraints
Here are some considerations for determining the governance policies that can assist with cost management.
What are the budget constraints set by the company for each business unit?
What are policies for the budget alert levels and associated actions?
Identify acceptable boundaries for scale, redundancy, and performance against cost.
Assess the limits for security. Don't compromise on security. Premium cloud security features can drive the cost
up. It’s not necessary to overinvest. Instead use the cost profile to drive a realistic threat profile.
Identify unrestricted resources. These resources typically need to scale and consume more cost to meet
demand.
Next step
Consider tradeoffs
Functional requirements
Break down high-level goals into functional requirements. For each of those requirements, define metrics to
calculate cost estimates accurately. Cloud services are priced based on performance, features, and locations. When
defining these metrics, identify acceptable boundaries of performance, scale, resilience, and security. Start by
expressing your goals in number of business transactions over time, breaking them down to fine-grain
requirements.
What resources are needed for a single transaction, and how many transactions are done per second,
day, year?
Start with a fixed cost of operations and a rough estimate of transaction volume to work out a cost-per-
transaction to establish a baseline. Consider the difference between cost models based on fixed, static
provisioning of services, more variable costs based upon autoscaling such as serverless technologies.
Start your planning with a careful enumeration of requirements. Make sure the needs of the stakeholders are
addressed. For strong alignment with business goals, those areas must be defined by the stakeholders and
shouldn’t be collected from a vendor.
Capture requirements at these levels:
Business workflow
Compliance and regulatory
Security
Availability
What do you aim to achieve by building your architecture in the cloud?
Landing zone
Consider the cost implications of the geographic region to which the landing zone is deployed.
The landing zone consists of the subscription and resource group, in which your cloud infrastructure components
exist. This zone impacts the overall cost. Consider the tradeoffs. For example, there are additional costs for network
ingress and egress for cross-zonal traffic. For more information, see Azure regions and Azure resources.
For information about landing zone for the entire organization, see CAF: Implement landing zone best practices.
Security
Security is one of the most important aspects of any architecture. Security measures protect the valuable data of
the organization. It provides confidentiality, integrity, and availability assurances against attacks and misuse of the
systems.
Factor in the cost of security controls, such as authentication, MFA, conditional access, information protection,
JIT/PIM, and premium Azure AD features. Those options will drive up the cost.
For security considerations, see the Security Pillar.
Business continuity
Does the application have a Ser vice Level Agreement that it must meet?
Factor in the cost when you create high availability and disaster recovery strategies.
Overall Service Level Agreement (SLA), Recovery Time Objective (RTO), and Recovery Point Objective (RPO) may
drive towards expensive design choices in order to support higher availability requirements. For example, a choice
might be to host the application across regions, which is costlier than single region but supports high availability.
If your service SLAs, RTOs and RPOs allow, then consider cheaper options. For instance, pre-build automation
scripts and packages that would redeploy the disaster recovery components of the solution from the ground-up in
case a disaster occurs. Alternatively, use Azure platform-managed replication. Both options can lower cost because
fewer cloud services are pre-deployed and managed, reducing wastage.
In general, if the cost of high availability exceeds the cost of application downtime, then you could be over
engineering the high availability strategy. Conversely, if the cost of high availability is less than the cost of a
reasonable period of downtime, you may need to invest more.
Suppose the downtime costs are relatively low, you can save by using recovery from your backup and disaster
recovery processes. If the downtime is likely to cost a significant amount per hour, then invest more in the high
availability and disaster recovery of the service. It's a three-way tradeoff between cost of service provision, the
availability requirements, and the organization's response to risk.
Application lifespan
Does your ser vice run seasonally or follow long-term patterns?
For long running applications, consider using Azure Reservations if you can commit to one-year or three-year
term. VM reservations can reduce cost by 60% or more when compared to pay-as-you-go prices.
Reservation is still an operational expense with all the corresponding benefits. Monitor the cost on workloads that
have been running in the cloud for an extended period to forecast the reserved instance sizes that are needed. For
information about optimization, see Reserved instances.
If your application runs intermittently, consider using Azure Functions in a consumption plan so you only pay for
compute resources you use.
Automation opportunities
Is it a business requirement to have the ser vice be available 24x7?
You may not have a business goal to leave the service running all the time. Doing so will incur a consistent cost.
Can you save by shutting down the service or scaling it down outside normal business hours? If you can,
Azure has a rich set of APIs, SDKs, and automation technology that utilizes DevOps and traditional automation
principles. Those technologies ensure that the workload is available at an appropriate level of scale as needed.
Repurpose some compute and data resources for other tasks that run out of regular business hours. See the
Compute Resource Consolidation pattern and consider containers or elastic pools for more compute and data
cost flexibility.
Standardization
Ensure that your cloud environments are integrated into any IT operations processes. Those operations include
user or application access provisioning, incident response, and disaster recovery. That mapping may uncover areas
where additional cloud cost is needed.
Next step
Determine the cost constraints
Azure regions
11/2/2020 • 2 minutes to read • Edit Online
Cost of an Azure service can vary between locations based on demand and local infrastructure costs. Consider all
these geographical areas when choosing the location of your resources to estimate costs.
Landing zone The ultimate location of your cloud solution or the landing
zone, typically consisting of logical containers such as a
subscription and resource group, in which your cloud
infrastructure components exist.
The complete list of Azure geographies, regions, and locations, is shown in Azure global infrastructure.
To see availability of a product by region, see Products available by region.
Tradeoff
Locating resources in a cheaper region should not negate the cost of network ingress and egress or by
degraded application performance because of increased latency.
An application hosted in a single region may cost less than an application hosted across regions because of
replication costs or the need for extra nodes.
If your solution needs to follow certain government regulations, the cost will be higher. Otherwise you can meet
less rigid compliance, through Azure Policy, which is free.
Certain Azure regions are built specifically for high compliance and security needs. For example, with Azure
Government (USA) you're given an isolated instance of Azure. Azure Germany has datacenters that meet privacy
certifications. These specialized regions have higher cost.
Regulatory requirements can dictate restrictions on data residency. These requirements may impact your data
replication options for resiliency and redundancy.
Traffic across billing zones and regions
Cross-regional traffic and cross-zonal traffic incur additional costs.
Is the application critical enough to have the footprint of the resources cross zones and,or cross
regions?
Bandwidth refers to data moving in and out of Azure datacenters. Inbound data transfers (data going into Azure
datacenters) are free for most services. For outbound data transfers, (data going out of Azure datacenters) the data
transfer pricing is based on billing zones. For more information, see Bandwidth Pricing Details.
Suppose, you want to build a cost-effective solution by provisioning resources in locations that offer the lowest
prices. The dependent resources and their users are located in different parts of the world. In this case, data
transfer between locations will add cost if there are meters tracking the volume of data moving across locations.
Any savings from choosing the cheapest location could be offset by the additional cost of transferring data.
The cross-regional and cross-zone additional costs do not apply to global services, such as Azure Active
Directory.
Not all Azure services support zones and not all regions in Azure support zones.
Before choosing a location, consider how important is the application to justify the cost of having
resources cross zones and/or cross regions. For non-mission critical applications such as, developer or test,
consider keeping the solution and its dependencies in a single region or single zone to leverage the
advantages of choosing the lower-cost region.
Azure resources
12/18/2020 • 3 minutes to read • Edit Online
Just like on-premises equipment, there are several elements that affect monthly costs when using Azure services.
For each Azure resource, have a clear understanding of the meters that track usage and the number of meters
associated with the resource tier. The meters correlate to several billable units. Those units are charged to the
account for each billing period. The rate per billable unit depends on the resource tier.
A resource tier impacts pricing because each tier offers levels of features such as performance or availability. For
example, a Standard HDD hard disk is cheaper than a Premium SSD hard disk.
Start with a lower resource tier then scale the resource up as needed. Growing a service with little to no
downtime is easier when compared to downscaling a service. Downscaling usually requires deprovisioning or
downtime. In general, choose scaling out instead of scaling up.
As part of the requirements, consider the metrics for each resource and build your alerts on baseline thresholds
for each metric. The alerts can be used to fine-tune the resources. For more information, see Respond to cost alert.
Azure usage rates and billing periods can vary depending on the subscription and offer type. Some subscription
types also include usage allowances or lower prices. For example, Azure Dev/Test subscription offers lower prices
on Azure services such as specific VM sizes, PaaS web apps, and VM images with pre-installed software. Visual
Studio subscribers obtain as part of their benefits access to Azure subscriptions with monthly allowances.
For information about the subscription offers, see Microsoft Azure Offer Details.
As you decide the offer type, consider the types that support Azure Reservations. With reservations, you prepay
for reserved capacity at a discount. Reservations are suitable for workloads that have a long-term usage pattern.
Combining the offer type with reservations can significantly lower the cost. For information about subscription
and offer types that are eligible for reservations, see Discounted subscription and offer types.
Azure Marketplace offers both the Azure products and services from third-party vendors. Different billing
structures apply to each of those categories. The billing structures can range from free, pay-as-you-go, one-time
purchase fee, or a managed offering with support and licensing monthly costs.
Governance
12/18/2020 • 2 minutes to read • Edit Online
Governance can assist with cost management. This work will benefit your ongoing cost review process and will
offer a level of protection for new resources.
It’s difficult to attribute costs before deploying a workload to the cloud. If you use methods for on-premises
estimation or directly map on-premises assets to cloud resources, estimate will be inaccurate. For example, if you
build your own datacenter your costs may appear comparable to cloud. Most on-premises estimates don't account
for costs like cooling, electricity, IT and facilities labor, security, and disaster recovery.
Here are some best practices:
Use proof-of-concept deployments to help refine cost estimates.
Choose the right resources that can handle the performance of the workload. For example, cheaper virtual
machines may initially indicate a lower cost but can be more expensive eventually to maintain a certain
performance level.
Accurately reflect the cost associated with right storage type.
Add hidden costs, such as networking cost for large data download.
Migration workloads
Quantify the cost of running your business in Azure by calculating Total Cost Ownership (TCO) and the Return on
Investment (ROI). Compare those metrics to existing on-premises equivalents.
It’s difficult to attribute costs before migrating to the cloud.
Using on-premises calculation may not accurately reflect the cost of cloud resources. Here are some challenges:
On-premises TCO may not accurately account for hidden expenses. These expenses include under-utilization of
purchased hardware or network maintenance costs including labor and equipment failure.
Cloud TCO may not accurately account for a drop in the organization’s operational labor hours. Cloud
provider’s infrastructure, platform management services, and additional operational efficiencies are included in
the cloud service pricing. Especially true at a smaller scale, the cloud provider’s services don't result in
reduction of IT labor head count.
ROI may not accurately account for new organizational benefits because of cloud capabilities. It's hard to
quantify improved collaboration, reduced time to service customers, and fast scaling with minimal or no
downtime.
ROI may not accurately account for business process re-engineering needed to fully adopt cloud benefits. In
some cases, this re-engineering may not occur at all, leaving an organization in a state where new technology is
used inefficiently.
Azure provides these tools to determine cost.
Microsoft Azure Total Cost of Ownership (TCO) Calculator to reflect all costs.
For migration projects, the TCO Calculator may assist, as it pre-populates some common cost but allows
you to modify the cost assumptions.
Azure pricing calculator to assess cost of the services you plan to use in your solution.
Azure Migrate to evaluate your organization's current workloads in on-premises datacenters. It suggests Azure
replacement solution, such virtual machine sizes based on your workload. It also provides a cost estimate.
Example estimate for a microservices workload
Let's consider this scenario as an example. We'll use the Azure Pricing calculator to estimate the initial cost before
the workload is deployed. The cost is calculated per month or for 730 hours.
In this example, we've chosen the microservices pattern. As the container orchestrator, one of the options could be
Azure Kubernetes Service (AKS) that manages a cluster of pods. We choose NGINX ingress controller because it's
well-known controller for such workloads.
The example is based on the current price and is subject to change. The calculation shown is for information
purposes only.
Compute
For AKS, there's no charge for cluster management.
For AKS agent nodes, there are many instance sizes and SKUs options. Our example workload is expected to follow
a long running pattern and we can commit to three years. So, an instance that is eligible for reserved instances
would be a good choice. We can lower the cost by choosing the 3-year reser ved plan .
The workload needs two virtual machines. One is for the backend services and the other for the utility services.
The B12MS instance with 2 virtual machines is sufficient for this initial estimation. We can lower cost by choosing
reserved instances.
Estimated Total: $327.17 per month with upfront payment of $11,778.17 .
Application gateway
For this scenario, we consider the Standard_v2 Tier of Azure Application Gateway because of the autoscaling
capabilities and performance benefits. We also choose consumption-based pricing, which is calculated by capacity
units (CU). Each capacity unit is calculated based on compute, persistent connections, or throughput. For
Standard_v2 SKU - Each compute unit can handle approximately 50 connections per second with RSA 2048-bit key
TLS certificate. For this workload, we estimate 10 capacity units.
Estimated Total: $248.64 per month.
Load balancer
NGINX ingress controller deploys a load balancer that routes internet traffic to the ingress. Approximately 15 load
balancer rules are needed. NAT rules are free. The main cost driver is the amount of data processed inbound and
outbound independent of rules. We estimate traffic of 1 TB (inbound and outbound).
Estimated Total: $96.37 per month.
Bandwidth
We estimate 2-TB outbound traffic. The first 5 GB/month are free in Zone 1 (Zone 1 includes North America,
Europe, and Australia). Between 5 GB - 10 TB /month is charged $0.087 per GB.
Estimated Total: $177.74 per month
External data source
Because the schema-on read nature of the data handled by the workload, we choose Azure Cosmos DB as the
external data store. By using the Cosmos DB capacity calculator, we can calculate the throughput to reserve.
Cost variables
For lower latency, in this scenario we enable geo-replication by using the Multi-regions writes feature. By
default Cosmos DB uses one region for writes and the rest for reads.
Default choices in consistency model of Session and indexing policy to Automatic . Automatic indexing
makes Cosmos DB to index all properties in all items for flexible and efficient queries. Custom indexing
policy allows you to include/exclude properties from the index for lower write RU and storage size. So
uploading a custom indexing policy can make you reduce costs.
Total data store isn't a significant cost driver, here it's set to 500 GB.
The throughput is variable indicating peaks. The percentage of time at peak is set to 10%.
The item size is an average of 90k. By using the capacity calculator you can upload sample json files with
your document's data structure, the average document's size, the number of reads/writes per second. These
variables have the largest impact on cost because they're used to calculate the throughput. The throughput
values are shown in the image.
Now, we use those values in the Azure Pricing calculator.
The average throughput based on these settings is 20,000 RUs/sec which is the minimum throughput required for
a 3-year reser ved capacity plan.
Here is the total cost for three years using the reserved plan:
$1,635.20 Average per month ($58,867.20 charged upfront)
You save $700.00 by choosing the 3-year reserved capacity over the pay-as-you-go price.
CI/CD pipelines
With Azure DevOps, Basic Plan is included for Visual Studio Enterprise, Professional, Test Professional, and MSDN
Platforms subscribers. And no charge for adding or editing work items and bugs, viewing dashboards, backlogs,
and Kanban boards for stakeholders.
Basic plan license for five users is free.
Additional services
For Microsoft Hosted Pipelines, the Free tier includes one parallel CI/CD job with 1,800 minutes (30 hours) per
month. However you can select the Paid tier and have one CI/CD parallel job ($40.00), in this tier, each parallel
CI/CD job includes unlimited minutes.
For this stage of cost estimation, the Self Hosted Pipelines is not required because the workload doesn't have
custom software that runs in your build process which isn't included in the Microsoft-hosted option.
Azure Artifacts is a service where you can create package feeds to publish and consume Maven, npm, NuGet,
Python, and universal packages. Azure Artifacts is billed on a consumption basis, and is free up until 2 GB of
storage. For this scenario, we estimate 56 GB in artifacts ($56.00)
Azure DevOps offers a cloud-based solution for load testing your apps. Load tests are measured and billed in
virtual user minutes (VUMs). For this scenario, we estimate a 200,000 VUMs ($72.00).
Estimated Total: $168.00 per month
Managed services
12/18/2020 • 2 minutes to read • Edit Online
Look for areas in the architecture where it may be natural to incorporate platform-as-a-service (PaaS) options.
These include caching, queues, and data storage. PaaS reduces time and cost of managing servers, storage,
networking, and other application infrastructure.
With PaaS, the infrastructure cost is included in the pricing model of the service. For example, you can provision a
lower SKU virtual machine as a jumpbox. There are additional costs for storage and managing a separate server.
You also need to configure a public IP on the virtual machine, which is not recommended. A managed service such
as Azure Bastion takes into consideration all those costs and offers better security.
Azure provides a wide range of PaaS resources. Here are some examples of when you might consider PaaS options:
TA SK USE
Host a web server Azure App Service instead of setting up IIS servers.
Indexing and querying heterogenous data Azure Cognitive Search instead of ElasticSearch.
Host a database server Azure offers many SQL and no-SQL options such as Azure
SQL Database and Azure Cosmos DB.
Secure access to virtual machine Azure Bastion instead of virtual machines as jump boxes.
Start with a fixed minimum level of performance and then use architectural patterns (such as Queue Based Load
Leveling) and autoscaling of services. With this approach the peaks can be smoothed out into a more consistent
flow of compute and data. This approach should temporarily extend your burst performance when the service is
under sustained load. If cost is an important factor but you need to maintain service availability under burst
workload use the Throttling pattern to maintain quality of service under load.
Compare and contrast the options and understand how to provision workloads that can potentially switch between
the two models. The model will be a tradeoff between scalability and predictability. Ideally in the architecture, blend
the two aspects.
Provisioning cloud resources to optimize cost
11/2/2020 • 2 minutes to read • Edit Online
The main cost driver for machine learning workloads is the compute cost. Those resources are needed to run the
training model and host the deployment. For information about choosing a compute target, see What are compute
targets in Azure Machine Learning?.
The compute cost depends on the cluster size, node type, and number of nodes. Billing starts while the cluster
nodes are starting, running, or shutting down.
With services such as Azure Machine Learning, you have the option of creating fix-sized clusters or use autoscaling.
If the amount of compute is not known, start with a zero-node cluster. The cluster will scale up when it
detects jobs in the queue. A zero-node cluster is not charged.
Fix-sized clusters are appropriate for jobs that run at a constant rate and the amount of compute is known and
measured beforehand. The time taken to spin up or down a cluster incurs additional cost.
If you don’t need retraining frequently, turn off the cluster when not in use.
To lower the cost for experimental or development workloads, choose Spot VMs. They aren't recommended for
production workloads because they might be evicted by Azure at any time. For more information, see Use Spot
VMs in Azure.
For more information about the services that make up a machine learning workload, see What are the machine
learning products at Microsoft?.
This article provides cost considerations for some technology choices. This is not meant to be an exhaustive list, but
a subset of options.
Start with smaller DWUs and measure performance for resource intensive operations, such as heavy data
loading or transformation. This will help you determine the number of units you need to increase or decrease.
Measure usage during the peak business hours so you can assess the number of concurrent queries and
accordingly add units to increase the parallelism. Conversely, measure off-peak usage so that you can pause
compute when needed.
In Azure Synapse Analytics, you can import or export data from an external data store, such as Azure Blob Storage
and Azure Data Lake Store. Storage and analytics resources aren't included in the price. There is additional
bandwidth cost for moving data in and out of the data warehouse.
For more information, see these articles:
Azure Synapse Pricing
Manage compute in Azure Synapse Analytics data warehouse
Reference architecture
Automated enterprise BI with Azure Synapse Analytics and Azure Data Factory
Enterprise BI in Azure with Azure Synapse Analytics
Azure Databricks
Azure Databricks offers two SKUs Standard and Premium , each with these options, listed in the order of least to
most expensive.
Data Engineering Light is for data engineers to build and execute jobs in automated Spark clusters.
Data Engineering includes autoscaling and has features for machine learning flows.
Data Analytics includes the preceding set of features and is intended for data scientists to explore, visualize,
manipulate, and share data and insights interactively.
Choose a SKU depending on your workload. If you need features like log audit, which is available in Premium , the
overall cost can increase. If you need autoscaling of clusters to handle larger workloads or interactive Databricks
dashboards, choose an option higher than Data Engineering Light .
Here are factors that impact Databricks billing:
Azure location where the resource is provisioned.
The virtual machine instance tier and number of hours the instances were running.
Databricks units (DBU), which is a unit of processing capability per hour, billed on per-second usage.
The example is based on the current price and is subject to change. The calculation shown is for information
purposes only.
Suppose you run a Premium cluster for 100 hours in East US 2 with 10 DS13v2 instances.
IT EM EXA M P L E EST IM AT E
DBU cost for Data Analytics workload 100 hours x 10 instances x 2 DBU per node x $0.55/DBU =
$1,100
Total $1,841
Turning off the Spark cluster when not in use to prevent unnecessary charges.
Reference architecture
Stream processing with Azure Databricks
Build a Real-time Recommendation API on Azure
Batch scoring of Spark models on Azure Databricks
Azure Stream Analytics
Stream analytics uses streaming units (SUs) to measure the amount of compute, memory, and throughput required
to process data. When provisioning a stream processing job, you're expected to specify an initial number of SUs.
Higher streaming units mean higher cost because more resources are used.
Stream processing with low latency requires a significant amount of memory. This resource is tracked by the SU%
utilization metric. Lower utilization indicates that the workload requires more compute resources. You can set an
alert on 80% SU Utilization metric to prevent resource exhaustion.
To evaluate the number of units you need, process an amount of data that is realistic for your production
level workload, observe the SU% Utilization metric, and accordingly adjust the SU value.
You can create stream processing jobs in Azure Stream Analytics and deploy them to devices running Azure IoT
Edge through Azure IoT Hub. The number of devices impacts over all cost. Billing starts when a job is deployed to
devices, regardless of the job status (running, failed, stopped).
SUs are based on the partition configuration for the inputs and the query that's defined within the job. For more
information, see Calculate the max streaming units for a job and Understand and adjust Streaming Units.
For pricing details, see Azure Stream Analytics pricing.
Reference architecture
Batch scoring of Python machine learning models on Azure
Azure IoT reference architecture
Compute refers to the hosting model for the computing resources that your application runs on. Whether you’re
hosting model is Infrastructure as a Service (IaaS), Platform as a Service (PaaS), or Function as a service (FaaS),
each resource requires your evaluation to understand the tradeoffs that can be made that impact your cost. To
learn more about hosting models, read Understand the hosting models.
Infrastructure-as-a-Ser vice (IaaS) lets you provision individual virtual machines (VMs) along with the
associated networking and storage components. Then you deploy whatever software and applications you
want onto those VMs. This model is the closest to a traditional on-premises environment, except that
Microsoft manages the infrastructure. You still manage the individual VMs.
Platform-as-a-Ser vice (PaaS) provides a managed hosting environment, where you can deploy your
application without needing to manage VMs or networking resources. Azure App Service is a PaaS service.
Functions-as-a-Ser vice (FaaS) goes even further in removing the need to worry about the hosting
environment. In a FaaS model, you simply deploy your code and the service automatically runs it. Azure
Functions are a FaaS service.
What are the cost implications to consider for choosing a hosting model?
If your application can be broken down into short pieces of code, a FaaS hosting model might be the best choice.
You’re charged only for the time it takes to execute your code. For example, Azure Functions is a FaaS service that
processes events with serverless code. Azure Functions allows you to run short pieces of code (called functions)
without worrying about application infrastructure. Use one of the three Azure Functions pricing plans to fit your
need. To learn more about the pricing plans, see How much does Functions cost?
If you want to deploy a larger or more complex application, PaaS may be the better choice. With PaaS, your
application is always running, as opposed to FaaS, where your code is executed only when needed. Since more
resources are used with PaaS, the price increases.
If you are migrating your infrastructure from on-premises to Azure, IaaS will greatly reduce and optimize
infrastructure costs and salaries for IT staff who are no longer needed to manage the infrastructure. Since IaaS uses
more resources than PaaS and FaaS, your cost could be highest.
What are the main cost drivers for Azure ser vices?
You will be charged differently for each service depending on your region, licensing plan (e.g., Azure Hybrid Benefit
for Windows Server), number and type of instances you need, operating system, lifespan, and other parameters
required by the service. Assess the need for each compute service by using the flowchart in Choose a candidate
service. Consider the tradeoffs that will impact your cost by creating different estimates using the Pricing
Calculator. If your application consists of multiple workloads, we recommend that you evaluate each workload
separately. See Consider limits and costs to perform a more detailed evaluation on service limits, cost, SLAs, and
regional availability.
Are there payment options for Vir tual Machines (VMs) to help meet my budget?
The best choice is driven by the business requirements of your workload. If you have higher SLA requirements, it
will increase overall costs. You will likely need more VMs to ensure uptime and connectivity. Other factors that will
impact cost are region, operating system, and the number of instances you choose. Your cost also depends on the
workload life span. See Virtual machines and Use Spot VMs in Azure for more details.
Pay as you go lets you pay for compute capacity by the second, with no long-term commitment or upfront
payments. This option allows you to increase or decrease compute capacity on demand. It is appropriate for
applications with short-term, spiky, or unpredictable workloads that cannot be interrupted. You can also start
or stop usage at any time, resulting in paying only for what you use.
Reser ved Vir tual Machine Instances lets you purchase a VM for one or three years in a specified region
in advance. It is appropriate for applications with steady-state usage. You may save more money compared
to pay-as-you-go pricing.
Spot pricing lets you purchase unused compute capacity at major discounts. If your workload can tolerate
interruptions, and its execution time is flexible, then using spot pricing for VMs can significantly reduce the
cost of running your workload in Azure.
Dev-Test pricing offers discounted rates on Azure to support your ongoing development and testing. Dev-
Test allows you to quickly create consistent development and test environments through a scalable, on-
demand infrastructure. This will allow you to spin up what you need, when you need it, and explore
scenarios before going into production. To learn more about Azure Dev-Test reduced rates, see Azure
Dev/Test Pricing.
For details on available sizes and options for the Azure VMs you can use to run your apps and workloads, see Sizes
for virtual machines in Azure. For details on specific Azure VM types, see Virtual Machine series.
Do I pay extra to run large-scale parallel and high-performance computing (HPC) batch jobs?
Use Azure Batch to run large-scale parallel and HPC batch jobs in Azure. You can save on VM cost because multiple
apps can run on one VM. Configure your workload with either the Low-priority tier (the cheapest option) or the
Standard tier (provides better CPU performance). There is no cost for the Azure Batch service. Charges accrue for
the underlying resources that run your batch workloads.
If you host you web apps in PaaS, you’ll need to choose an App Service plan to run your apps. The plan will define
the set of compute resources on which your app will run. If you have more than one app, they will run using the
same resources. This is where you will see the most significant cost savings, as you don’t incur cost for VMs.
If your apps are event-driven with a short-lived process using a microservices architecture style, we recommend
using Azure Functions. Your cost is determined by execution time and memory for a single function execution. For
pricing details, see Azure Functions pricing.
Is it more cost-effective to deploy development testing (dev-test) on a PaaS or IaaS hosting model?
If your dev-test is built on Azure managed services such as Azure DevOps, Azure SQL Database, Azure Cache for
Redis, and Application Insights, the cheapest solution might be using the PaaS hosting model. You won’t incur the
cost and maintenance of hardware. If your dev-test is built on Azure managed services such as Azure DevOps,
Azure DevTest Labs, VMs, and Application Insights, you need to add the cost of the VMs, which can greatly increase
your cost. For details on evaluating, see Azure Dev/Test Pricing.
A special case of PaaS - Containers
How can I get the best cost savings for a containerized workload that requires full orchestration?
Your business requirements may necessitate that you store container images so that you have fast, scalable
retrieval, and network-close deployment of container workloads. Although there are choices as to how you will run
them, we recommend that you use AKS to set up instances with a minimum of three (3) nodes. AKS reduces the
complexity and operational overhead of managing Kubernetes by offloading much of that responsibility to Azure.
There is no charge for AKS Cluster Management. Any additional costs are minimal. The containers themselves have
no impact on cost. You pay only for per-second billing and custom machine sizes.
Can I save money if my containerized workload does not need full orchestration?
Your business requirements may not necessitate full orchestration. If this is the case and you are using App Service
containers, we recommend that you use one of the App Service plans. Choose the appropriate plan based on your
environment and workload.
There is no charge to use SNI-based SSL. Standard and Premium service plans include the right to use one IP SSL
at no additional charge. Free and shared service plans do not support SSL. You can purchase the right to use
additional SSL connections for a fee. In all cases the SSL certificate itself must be purchased separately. To learn
more about each plan, see App services plans.
Where’s the savings if my workload is event driven with a shor t-lived process?
In this example, Service Fabric may be a better choice than AKS. The biggest difference between the two is that AKS
works only with Docker applications using Kubernetes. Service Fabric works with microservices and supports
many different runtime strategies. Service Fabric can deploy Docker and Windows Server containers. Like AKS,
Service Fabric is built for the microservice and event-driven architectures. AKS is strictly a service orchestrator and
handles deployments, whereas Service Fabric also offers a development framework that allows building
stateful/stateless applications.
NOTE
The costs in this example are based on the current price and are subject to change. The calculation is for information
purposes only. It shows the Collapsed view of the cost in this estimate.
TIP
You can start building your cost estimate at any time and re-visit it later. The changes will be saved until you modify or delete
your estimate.
Next steps
Provisioning cloud resources to optimize cost
Virtual Machines documentation
VM payment options
Pricing Calculator
Data store cost estimates
11/2/2020 • 9 minutes to read • Edit Online
Most cloud workloads adopt the polyglot persistence approach. Instead of using one data store service, a mix of
technologies is used. You can achieve optimal cost benefit from using this approach.
Each Azure data store has a different billing model. To establish a total cost estimate, first, identify the business
transactions and their requirements. Then, break each transaction into operations. Lastly, identify a data store
appropriate for the type of data. Do this for each workload separately.
Let's take an example of an e-commerce application. It needs to store data for transactions such as orders,
payments, and billing. The data structure is predetermined and not expected to change frequently. Data integrity
and consistency are crucial. There's a need to store product catalogs, social media posts, and product reviews. In
some cases, the data is unstructured, and is likely to change over time. Media files must be stored and data must be
stored for auditing purposes.
Learn about data stores in Understand data store models.
Do you need to restrict or otherwise manage access to your data from other network resources?
Does data need to be accessible only from inside the Azure environment?
Does the data need to be accessible from specific IP addresses or subnets?
Does the data need to be accessible from applications or services hosted on-premises or in other external data
centers?
Consider Azure Blob Storage Block Blobs instead of storing binary image data in Azure SQL Database. Blob
storage is cheaper than Azure SQL Database.
If your design requires SQL, store a lookup table in SQL Database and retrieve the document when needed to
serve it to the user in your application middle tier. SQL Database is highly targeted for high-speed data lookups
and set-based operations.
The hot access tier of Azure Block Blob Storage cost is cheaper than the equivalent size of the Premium SSD
volume that has the database.
Read this decision chart to make your choices.
If the on-premises data is already on a SQL server, it might be a natural choice. The on-premises license with
Software Assurance can be used to bring down the cost if the workload is eligible for Azure Hybrid Benefit. This
option applies for Azure SQL Database (PaaS) and SQL Server on Azure Virtual Machines (IaaS).
For open-source databases such as MySQL, MariaDB, or PostGreSQL, Azure provides managed services that are
easy to provision.
What are some design considerations that will affect cost?
If the SLAs don't allow for downtime, can a read-only replica in a different region enable business continuity?
If the database in the other region must be read/write, how will the data be replicated?
Does the data need to be synchronous, or could consistency allow for asynchronous replication?
How important is it for updates made in one node to appear in other nodes, before further changes can be
made?
Azure storage has several options to make sure data is copied and available when needed. Locally redundant
storage (LRS) synchronously replicates data in the primary region. If the entire primary center is unavailable the
replicated data is lost. At the expensive end, Geo-zone-redundant storage (preview) (GZRS) replicates data in
availability zones within the primary region and asynchronously copied to another paired region. Databases that
offer geo-redundant storage, such as SQL Database, are more expensive. Most other OSS RDBMS database use
LRS storage, which contributes to the lower price range.
For more information, see automated backups.
Cosmos DB offers five consistency levels: strong, bounded staleness, session, consistent prefix, and eventual
consistency. Each level provides availability and performance tradeoffs and is backed by comprehensive SLAs. The
consistency level itself does not affect cost.
How can I minimize compute cost?
Higher throughput and IOPS require higher compute, memory, I/O and storage limits. These limits are expressed in
a vCore model. With higher vCore number, you buy more resources and consequently the cost is higher. Azure SQL
Database has more vCores and allows you to scale in smaller increments. Azure Database for MySQL, PostgreSQL,
and MariaDB have fewer vCores and scaling up to a higher vCore can cost more. MySQL provides in-memory
tables in the Memor y Optimized tier, which can also increase the cost.
All options offer a consumption and provisioned pricing models. With pre-provisioned instances, you save more if
you can commit to one or three years.
How is primar y and backup storage cost calculated?
With Azure SQL Database the initial 32 GB of storage is included in the price. For the other listed options, you need
to buy storage separately and might increase the cost depending on your storage needs.
For most databases, there is no charge for the price of backup storage that is equal in size to primary storage. If
you need more backup storage, you will incur an additional cost.
The messaging services in this article have no up-front cost or termination fees, and you pay only for what you use.
In some cases, it's advantageous to combine two messaging services to increase the efficiency of your messaging
system. See Crossover scenarios for examples.
Cost is based on the number of operations or throughput units used depending on the message service. Using the
wrong messaging service for the intent can lead to higher costs. Before choosing a service, first, determine the
intent and requirements of the messages. Then, consider the tradeoffs between cost and operations/throughput
units. For tradeoff examples, see Technology choices for a message broker.
Use the Azure Pricing calculator for help creating various cost scenarios.
Event Hubs
Stream millions of events per second from any source to build dynamic data pipelines and immediately respond to
business challenges using Event Hubs. Cost is based on throughput units. A key difference between Event Grid and
Event Hubs is in the way event data is made available to subscribers. For more information, see Pull model.
For questions and answers on pricing, see Pricing.
For pricing, see Event Hubs pricing.
Networking cost estimates
12/18/2020 • 5 minutes to read • Edit Online
Azure Front Door and Traffic Manager can distribute traffic to backends, clouds, or hybrid on-premises services
that reside in multiple regions. Otherwise, for regional traffic that moves within virtual networks or zonal and
zone-redundant service endpoints within a region, choose Application Gateway or Azure Load Balancer.
What is the type of traffic?
Load balancers such as Azure Application Gateway and Azure Front Door are designed to route and distribute
traffic to web applications or HTTP(S) endpoints. Those services support Layer 7 features such as SSL offload, web
application firewall, path-based load balancing, and session affinity. For non-web traffic, choose Traffic Manager or
Azure Load Balancer. Traffic Manager uses DNS to route traffic. Load Balancer supports Layer 4 features for all UDP
and TCP protocols.
Here’s the matrix for choosing load balancers by considering both dimensions.
The example is based on the current price and is subject to change. The calculation shown is for information
purposes only.
IT EM EXA M P L E EST IM AT E
Total $486.60
IT EM EXA M P L E EST IM AT E
Total $105.95
Peering
Do you need to connect vir tual networks?
Peering technology is a service used for Azure virtual networks to connect with other virtual networks in the same
or different Azure region. Peering technology is used often in hub and spoke architectures.
An important consideration is the additional costs incurred by peering connections on both egress and ingress
traffic traversing the peering connections.
Keeping the top talking services of a workload within the same virtual network, zone and/or region
unless otherwise required. Use virtual networks as shared resources for multiple workloads against a single
virtual network per workload approach. This approach will localize traffic to a single virtual network and avoid
the additional costs on peering charges.
Peering within the same region is cheaper than peering between regions or Global regions. For instance, you
consume 50 TB per month by connecting two VNETs in Central US. Using the current price here is the incurred cost.
IT EM EXA M P L E EST IM AT E
Total $1,025.00
Let’s compare that cost for cross-region peering between Central US and East US.
IT EM EXA M P L E EST IM AT E
Total $3,584.00
Hybrid connectivity
There are two main options for connecting an on-premises datacenter to Azure datacenters:
Azure VPN Gateway can be used to connect a virtual network to an on-premises network through a VPN
appliance or to Azure Stack through a site-to-site VPN tunnel.
Azure ExpressRoute creates private connections between Azure datacenters and infrastructure that’s on premise
or in a colocation environment. What is the required throughput for cross-premises connectivity?
VPN gateway is recommended for development/test cases or small-scale production workloads where throughput
is less than 100 Mbps. Use ExpressRoute for enterprise and mission-critical workloads that access most Azure
services. You can choose bandwidth from 50 Mbps to 10 Gbps.
Another consideration is security. Unlike VPN Gateway traffic, ExpressRoute connections don't go over the public
internet. VPN Gateway traffic is secured by industry standard IPsec.
For both services, inbound transfers are free and outbound transfers are billed per the billing zone.
For more information, see Choose a solution for connecting an on-premises network to Azure.
This blog post provides a comparison of the two services.
Example cost analysis
This example compares the pricing details for VPN Gateway and ExpressRoute.
The example is based on the current price and is subject to change. The calculation shown is for information
purposes only.
IT EM EXA M P L E EST IM AT E
Site to Site (S2S) Tunnels The first 10 tunnels are free. For additional tunnels, the cost is
5 tunnels x 720 hours x $0.015 per hour per tunnel = $54.00
ExpressRoute
Let's choose the Metered Data plan in billing zone of Zone 1 .
IT EM EXA M P L E EST IM AT E
Circuit bandwidth 1 GB/s speed has a fixed rate of $436.00 for the standard
price.
The main cost driver is outbound data transfer. ExpressRoute is more cost-effective than VPN Gateway when
consuming large amounts of outbound data. If you consume more than 200 TB per month, consider ExpressRoute
with the Unlimited Data plan where you're charged a flat rate.
For pricing details, see:
Azure VPN Gateway pricing
Azure ExpressRoute pricing
Networking resources provisioning
12/18/2020 • 7 minutes to read • Edit Online
Azure ExpressRoute
There are two main pricing models:
Metered Data plan
There are two pricing tiers: Standard and Premium , which is priced higher. The tier pricing is based on the
circuit bandwidth.
If you don't need to access the services globally, choose Standard . With this tier, you can connect to regions
within the same zone at no additional cost. Outbound cross-zonal traffic incurs more cost.
Unlimited Data plan
All inbound and outbound data transfer is included in the flat rate. There are two pricing tiers: Standard and
Premium , which is priced higher.
Calculate your utilization and choose a billing plan. The Unlimited Data plan is recommended if you exceed
about 68% of utilization.
For more information, see Azure ExpressRoute pricing.
Reference architecture
Connect an on-premises network to Azure using ExpressRoute connects an Azure virtual network and an on-
premises network connected using with VPN gateway failover.
Azure Firewall
Azure Firewall usage can be charged at a fixed rate per deployment hour. There's additional cost for the amount of
data transferred.
There aren't additional cost for a firewall deployed in an availability zone. There are additional costs for inbound
and outbound data transfers associated with availability zones.
When compared to network virtual appliances (NVAs), with Azure Firewall you can save up to 30-50%. For more
information see Azure Firewall vs NVA.
Reference architecture
Hub-spoke network topology in Azure
Deploy highly available NVAs
Traffic Manager
Traffic manager uses DNS to route and load balance traffic to service endpoints in different Azure regions. So, an
important use case is disaster recovery. In a workload, you can use Traffic Manager to route incoming requests to
the primary region. If that region becomes unavailable, Traffic Manager fails over to the secondary region. There
are other features that can make the application highly responsive and available. Those features cost money.
Determine the best web app to handle request based on geographic location.
Configure caching to reduce the response time.
Traffic Manager isn't charged for bandwidth consumption. Billing is based on the number of DNS queries received,
with a discount for services receiving more than 1 billion monthly queries. You're also charged for each monitored
endpoint.
Reference architecture
Multi-region N-tier application uses Traffic Manager to route incoming requests to the primary region. If that
region becomes unavailable, Traffic Manager fails over to the secondary region. For more information, see the
section Traffic Manager configuration.
DNS query charges
Traffic Manager uses DNS to direct clients to specific service.
Only DNS queries that reach Traffic Manager are charged in million query units. For 100 million DNS queries
month, the charges will be $54.00 a month based on the current Traffic Manager pricing.
Not all DNS queries reach Traffic Manager. Recursive DNS servers run by enterprises and ISPs first attempt to
resolve the query by using cached DNS responses. Those servers query Traffic Manager at a regular interval to get
updated DNS entries. That interval value or TTL is configurable in seconds. TTL can impact cost. Longer TTL
increases the amount of caching and reduces DNS query charges. Conversely, shorter TTL results in more queries.
However, there is a tradeoff. Increased caching also impacts how often the endpoint status is refreshed. For
example, the user failover times, for an endpoint failure, will become longer.
Health monitoring charges
When Traffic Manager receives a DNS request, it chooses an available endpoint based on configured state and
health of the endpoint. To do this, Traffic Manager continually monitors the health of each service endpoint.
The number of monitored endpoints are charged. You can add endpoints for services hosted in Azure and then add
on endpoints for services hosted on-premises or with a different hosting provider. The external endpoints are more
expensive, but health checks can provide high-availability applications that are resilient to endpoint failure,
including Azure region failures.
Real User Measurement charges
Real User Measurements evaluates network latency from the client applications to Azure regions. That influences
Traffic Manager to select the best Azure region in which the application is hosted. The number of measurements
sent to Traffic Manager is billed.
Traffic view charges
By using Traffic View, you can get insight into the traffic patterns where you have endpoints. The charges are based
on the number of data points used to create the insights presented.
Virtual Network
Azure Virtual Network is free. You can create up to 50 virtual networks across all regions within a subscription.
Here are a few considerations:
Inbound and outbound data transfers are charged per the billing zone. Traffic that moves across regions and
billing zones are more expensive. For more information, see:
Traffic across zones
Bandwidth Pricing Details.
VNET Peering has additional cost. Peering within the same region is cheaper than peering between regions
or Global regions. Inbound and outbound traffic is charged at both ends of the peered networks. For more
information, see Peering
Managed services (PaaS) don't always need a virtual network. The cost of networking is included in the
service cost.
Web application cost estimates
11/2/2020 • 4 minutes to read • Edit Online
All web applications (apps) have no up-front cost or termination fees. Some charge only for what you use and
others charge per-second or per-hour. In addition, all web apps run in App Service plans. Together these costs can
help determine the total cost of a web app.
Use the Azure Pricing calculator to help create various cost scenarios.
You don't get charged for using the App Service features that are available to you (e.g., configuring custom
domains, TLS/SSL certificates, deployment slots, backups, etc.). For a list of exceptions, see How much does my App
Service plan cost?
Azure Cost Management has an alert feature. Alerts are generated when consumption reaches a threshold.
Consider the metrics that each resource in the workload. For each metric, build alerts on baseline thresholds. This
way, the admins can be alerted when the workload is using the services at capacity. The admins can then tune the
resources to target SKUs based on current load.
You can also set alerts on allowed budgets at the resource group or management groups scopes. Both cloud
services performance and budget requirements can be balanced through alerts on metrics and budgets.
Over time, the workload can be optimized to autoheal itself when alerts are triggered. For information about using
alerts, see Use cost alerts to monitor usage and spending.
Respond to alerts
When you receive an alert, check the current consumption data. Budget alerts aren't triggered in real time. There
may be a delay between the alert and the current actual cost. Look for significant difference between cost values
when the alert happened and the current cost. Next, conduct a cost review to discuss the cost trend, possible
causes, and any required action. For information about stakeholders in a cost review, see Cost reviews.
Determine short and long-term actions justified by business value. Can a temporary increase in the alert threshold
be a feasible fix? Does the budget need to be increased longer-term? Any increase in budget must be approved.
If the alert was caused because of unnecessary or expensive resources, you can implement additional Azure Policy
controls. You can also add budget automation to trigger resource scaling or shutdowns.
Revise budgets
After you identify and analyze your spending patterns, you can set budget limits for applications or business units.
You'll want to assign access to view or manage each budget to the appropriate groups. Setting several alert
thresholds for each budget can help track your burn down rate.
Generate cost reports
11/2/2020 • 3 minutes to read • Edit Online
To monitor the cost of the workload, use Azure cost tools or custom reports. The reports can be scoped to business
units, applications, IT infrastructure shared services, and so on. Make sure that the information is consistently
shared with the stakeholders.
NOTE
There are many ways of purchasing Azure services. Not all are supported by Azure Cost Management. For example, detailed
billing information for services purchased through a Cloud Solution Provider (CSP) must be obtained directly from the CSP.
For more information about the supported cost data, see Understand cost management data.
Advisor recommendations
Azure Advisor recommendations for cost can highlight the over-provisioned services and ways to lower cost. For
example, the virtual machines that should be resized to a lower SKU, unprovisioned ExpressRoute circuits, and idle
virtual network gateways.
For more information, see Advisor cost management recommendations.
Consumption APIs
Granular and custom reports can help track cost over time. Azure provides a set of Consumption APIs to generate
such reports. These APIs allow you to query and create various cost data. Data includes usage data for Azure
services and third-party services through Marketplace, balances, budgets, recommendations on reserved
instances, among others. You can configure Azure Role-based Access control policies to allow only a certain set of
users or applications access the data.
For example, you want to determine the cost of all resources used in your workload for a given period. One way of
getting this data is by querying usage meters and the rate of those meters. You also need to know the billing period
of the usage. By combining these APIs, you can estimate the consumption cost.
Billing account API: To get your billing account to manage your invoices, payments, and track costs.
Billing Periods API: To get billing periods that have consumption data.
Usage Detail API: To get the breakdown of consumed quantities and estimated charges.
Marketplace Store Charge API: To get usage-based marketplace charges for third-party services.
Price Sheet API: To get the applicable rate for each meter.
The result of the APIs can be imported into analysis tools.
NOTE
Consumption APIs are supported for Enterprise Enrollments and Web Direct Subscriptions (with exceptions). Check
Consumption APIs for updates to determine support for other types of Azure subscriptions.
For more information about common cost scenarios, see Billing automation scenarios.
Custom scripts
Use Azure APIs to schedule custom scripts that identify orphaned or empty resources. For example, unattached
managed disks, load balancers, application gateways, or Azure SQL Servers with no databases. These resources
incur a flat monthly charge while unused. Other resources may be stale, for example VM diagnostics data in blob or
table storage. To determine if the item should be deleted, check its last use and modification timestamps.
NOTE
Sharing requires Power BI Premium licenses.
For more information, see Create visuals and reports with the Azure Cost Management connector in Power BI
Desktop.
You can also use the Cost Management App. The app uses Azure Cost Management Template app for Power BI. You
can import and analyze usage data and incurred cost within Power BI.
Conduct cost reviews
12/18/2020 • 2 minutes to read • Edit Online
The goal of cost monitoring is to review cloud spend with the intent of establishing cost controls and preventing
any misuse. The organization should adopt both proactive and reactive review approaches for monitoring cost.
Effective cost reviews must be conducted by accountable stakeholders at a regular cadence. The reviews must be
complemented with reactive cost reviews, for example when a budget limit causes an alert.
Who should be included in a cost review?
The financial stakeholders must understand cloud billing to derive business benefits using financial metrics to
make effective decisions.
Also include members of the technical team. Application owners, systems administrators who monitor and back-up
cloud systems, and business unit representatives must be aware of the decisions. They can provide insights
because they understand cloud cost metering and cloud architecture. One way of identifying owners of systems or
applications is through resource tags.
What should be the cadence of cost reviews?
Cost reviews can be conducted as part of the regular business reviews. It’s recommended that such reviews are
scheduled,
During the billing period. This review is to create an awareness of the estimated pending billing. These reports
can be based on Azure Advisor and Azure Cost Management – Cost analysis.
After the billing period. This review shows the actual cost with activity for that month. Use Balance APIs to
generate monthly reports. The APIs can query data that gets information on balances, new purchases, Azure
Marketplace service charges, adjustments, and overage charges.
Because of a budget alert or Azure Advisor recommendations.
Web Direct (pay-as-you-go) and Cloud Solution Provider (CSP) billing occurs monthly. While Enterprise Agreement
(EA) billing occurs annually, costs should still be reviewed monthly.
Checklist - Optimize cost
11/2/2020 • 2 minutes to read • Edit Online
Continue to monitor and optimize the workload by using the right resources and sizes. Use this checklist to
optimize a workload.
Review the underutilized resources . Evaluate CPU utilization and network throughput over time to
check if the resources are used adequately. Azure Advisor identifies underutilized virtual machines. You can
choose to decommission, resize, or shut down the machine to meet the cost requirements.
Resize virtual machines
Shutdown the under utilized instances
Continuously take action on the cost reviews . Treat cost optimization as a process, rather than a point-
in-time activity. Use tooling in Azure that provides recommendations on usage or cost optimization. Review
the cost management recommendations and take action. Make sure that all stakeholders are in agreement
about the implementation and timing of the change.
Recommended tab in the Azure portal
Recommendations in the Cost Management Power BI app
Recommendations in Azure Advisor
Recommendations using Reservation REST APIs
Use reser ved instances on long running workloads . Reserve a prepaid capacity for a period, generally
one or three years. With reserved instances, there’s a significant discount when compared with pay-as-you-
go pricing.
Reserved instances
Use discount prices . These methods of buying Azure resources can lower costs.
Azure Hybrid Use Benefit
Azure Reservations
There are also payment plans offered at a lower cost:
Microsoft Azure Enterprise Agreement
Enterprise Dev Test Subscription
Cloud Service Provider (Partner Program)
Have a scale-in and scale-out policy . In a cost-optimized architecture, costs scale linearly with demand.
Increasing customer base shouldn't require more investment in infrastructure. Conversely, if demand drops,
scale-down of unused resources. Autoscale Azure resources when possible.
Autoscale instances
Reevaluate design choices . Analyze the cost reports and forecast the capacity needs. You might need to
change some design choices.
Choose the right storage tier . Consider using hot, cold, archive tier for storage account data.
Storage accounts can provide automated tiering and lifecycle management. For more information,
see Review your storage options
Choose the right data store . Instead of using one data store service, use a mix of data store
depending on the type of data you need to store for each workload. For more information, see
Choose the right data store.
Choose Spot VMs for low priority workloads . Spot VMs are ideal for workloads that can be
interrupted, such as highly parallel batch processing jobs.
Spot VMs
Optimize data transfer . Only deploy to multiple regions if your service levels require it for either
availability or geo-distribution. Data going out of Azure datacenters can add cost because pricing is
based on Billing Zones.
Traffic across billing zones and regions
Reduce load on ser vers . Use Azure Content Delivery Network (CDN) and caching service to
reduce load on front-end servers. Caching is suitable for servers that are continually rendering
dynamic content that doesn't change frequently.
Use managed ser vices . Measure the cost of maintaining infrastructure and replace it with Azure
PaaS or SaaS services.
Managed services
Autoscale instances
11/2/2020 • 2 minutes to read • Edit Online
In Azure, it’s easier to grow a service with little to no downtime against downscaling a service, which usually
requires deprovisioning or downtime. In general, opt for scale-out instead of scale up.
For certain application, capacity requirements may swing over time. Autoscaling policies allow for less error-prone
operations and cost savings through robust automation.
Choose smaller instances where workload is highly variable and scale out to get the desired level of
performance, rather than up. This choice will enable you to make your cost calculations and estimates granular.
Stateless applications
Many Azure services can be used to improve the application's ability to scale dynamically. Even where they may not
have been originally designed to do so.
For example, many ASP.NET stateful web applications can be made stateless. Then they can be auto scaled, which
results in cost benefit. You store state using Azure Redis Cache, or Cosmos DB as a back-end session state store
through a Session State Provider.
Reserved instances
11/2/2020 • 2 minutes to read • Edit Online
Azure Reservations are offered on many services as a way to lower cost. You reserve a prepaid capacity for a
period, generally one or three years. With reserved instances, there’s a significant discount when compared with
pay-as-you-go pricing. You can pay up front or monthly, price wise both are same.
Usage pattern
Before opting for reserved instances, analyze the usage data with pay-as-you-go prices over a time duration.
Do the ser vices in the workload run intermittently or follow long-term patterns?
Azure provides several options that can help analyze usage and make recommendations by comparing
reservations prices with pay-as-you-go price. An easy way is to view the Recommended tab in the Azure portal.
Azure Advisor also provides recommendations that are applicable to an entire subscription. If you need a
programmatic way, use the Reservation Recommendations REST APIs.
Reserved instances can lower cost for long running workloads. For intermittent workloads, prepaid capacity might
go unused and it doesn't carry over to the next billing period. Usage exceeding the reserved quantity is charged
using more expensive pay-as-you-go rates. But there are some flexible options. You can exchange or refund a
reservation within limits. For more information, see Self-service exchanges and refunds for Azure Reservations.
Scope
Reservations can be applied to a specific scope—subscription, resource group, or a single resource. Suppose you
set the scope as the resource group, the reservation savings will apply to the applicable resources provisioned in
that group. For more information, see Scope reservations.
Virtual machines can be deployed in fix-sized blocks. These VMs must be adequately sized to meet capacity
demand and reduce waste.
For example, look at a VM running the SAP on Azure project can show you how initially the VM was sized based on
the size of existing hardware server (with cost around €1 K per month), but the real utilization of VM was not more
than 25% - but simple choosing the right VM size in the cloud we can achieve 75% saving (resize saving). And by
applying the snoozing you can get additional 14% of economy:
It is easy to handle cost comparison when you are well equipped and for this Microsoft provides the set of specific
services and tools that help you to understand and plan costs. These include the TCO Calculator, Azure Pricing
Calculator, Azure Cost Management (Cloudyn), Azure Migrate, Cosmos DB Sizing Calculator, and the Azure Site
Recovery Deployment Planner.
Here are some strategies that you can use to lower cost for virtual machines.
Determine the load by analyzing the CPU utilization to make sure that the instance is adequately utilized.
Ideally, with the right size, the current load should fit in a lower SKU of the same tier. Another way is to lower the
number instances and still keep the load below a reasonable utilization. Azure Advisor recommends load less than
80% utilization for non-user facing workloads and 40% when user-facing workload. It also provides current and
target SKU information.
You can identify underutilized machines by adjusting the CPU utilization rule on each subscription.
Resizing a virtual machine does require the machine to be shut down and restarted. There might be a period of
time when it will be unavailable. Make sure you carefully time this action for minimal business impact.
Spot VMs
Some workloads don't have a business requirement to complete a job within a period.
Can the workload be interrupted?
Spot VMs are ideal for workloads that can be interrupted, such as highly parallel batch processing jobs. These VMs
take advantage of the surplus capacity in Azure at a lower cost. They're also well suited for experimental,
development, and testing of large-scale solutions.
For more information, see Use Spot VMs in Azure.
Reserved VMs
Virtual machines are eligible for Azure Reservations. You can prepay for VM instances if you can commit to one or
three years. Reserved instances are appropriate for workloads that have a long-term usage pattern.
The discount only applies to compute and not the other meters used to measure usage for VMs. The discount can
be extended to other services that emit VM usage, such as Virtual machine scale sets and Container services, to
name a few. For more information, see Software costs not included with Azure Reserved VM Instances and Services
that get VM reservation discounts.
With reserved instances, you need to determine the VM size to buy. Analyze usage data using Reser vations
Consumption APIs and follow the recommendations of Azure portal and Azure Advisor to determine the size.
Reservations also apply to dedicated hosts. The discount is applied to all running hosts that match the reservation
scope and attributes. An important consideration is the SKU for the host. When selecting a SKU, choose the VM
series and type eligible to be a dedicated host. For more information, see Azure Dedicated Hosts pricing.
For information about discounts on virtual machines, see How the Azure reservation discount is applied to virtual
machines.
Caching data
11/2/2020 • 2 minutes to read • Edit Online
Caching is a strategy where you store a copy of the data in front of the main data store. The cache store is typically
located closer to the consuming client than the main store. Advantages of caching include faster response times
and the ability to serve data quickly. In doing so, you can save on the overall cost. Be sure to assess the built-in
caching features of Azure services used in your architecture. Azure also offers caching services, such as Azure
Cache for Redis or Azure CDN.
For information about what type of data is suitable for caching, see Caching.
As you design the workload, consider tradeoffs between cost optimization and other aspects of the design, such as
security, scalability, resilience, and operability.
What is most impor tant for the business: lowest cost, no downtime, high throughput?
An optimal design doesn’t equate to a low-cost design. There might be risky choices made in favor of a cheaper
solution.
Overall Service Level Agreement (SLA), Recovery Time Objective (RTO), and Recovery Point Objective (RPO) may
lead to expensive design choices. If your service SLAs, RTOs, and RPOs times are short, then higher investment is
inevitable for high availability and disaster recovery options.
For example, to support high availability, you choose to host the application across regions. This choice is costlier
than single region because of the replication costs or the need provisioning extra nodes. Data transfer between
regions will also add cost.
If the cost of high availability exceeds the cost of downtime, you can save by using Azure platform-managed
replication and recover data from the backup storage.
For resiliency, availability, and reliability considerations, see the Reliability pillar.
This pillar covers the operations processes that keep an application running in production. Deployments must be
reliable and predictable. They should be automated to reduce the chance of human error. They should be a fast and
routine process, so they don't slow down the release of new features or bug fixes. Equally important, you must be
able to quickly roll back or roll forward if an update has problems.
To assess your workload using the tenets found in the Microsoft Azure Well-Architected Framework, see the
Microsoft Azure Well-Architected Review.
These are the disciplines we group in the operational excellence pillar:
Code deployment How you deploy your application code is going to be one of
the key factors that will determine your application stability
Considering and improving how software is developed, deployed, operated, and maintained is one part of
achieving a higher competency in operations. Equally important is providing a team culture of experimentation and
growth, solutions for rationalizing the current state of operations, and incident response plans. The principles of
operational excellence are a series of considerations that can help achieve excellent operational practices.
To assess your workload using the tenets found in the Azure Well-Architected Framework, see the Microsoft Azure
Well-Architected Review.
Optimize build and release processes
From provisioning with Infrastructure as Code, building and releasing with CI/CD pipelines, automated testing, and
embracing software engineering disciplines across your entire environment. This approach ensures the creation
and management of environments throughout the software development lifecycle is consistent, repeatable, and
enables early detection of issues.
Monitor the entire system and understand operational health
Implement systems and processes to monitor build and release processes, infrastructure health, and application
health. Telemetry is critical to understanding the health of a workload and whether the service is meeting the
business goals.
Rehearse recover y and practice failure
Run DR drills on a regular cadence and use engineering practices to identify and remediate weak points in
application reliability. Regular rehearsal of failure will validate the effectiveness of recovery processes and ensure
teams are familiar with their responsibilities.
Embrace operational improvement
Continuously evaluate and refine operational procedures and tasks while striving to reduce complexity and
ambiguity. This approach enables an organization to evolve processes over time, optimizing inefficiencies, and
learning from failures.
Use loosely coupled architecture
Enable teams to independently test, deploy, and update their systems on demand without depending on other
teams for support, services, resources, or approvals.
Incident management
When incidents occur, have well thought out plans and solutions for incident management, incident
communication, and feedback loops. Take the lessons learned from each incident and build telemetry and
monitoring elements to prevent future occurrences.
Automation overview: Goals, best practices, and
types
12/18/2020 • 4 minutes to read • Edit Online
Automation has revolutionized how businesses operate and this trend continues to evolve. Businesses have moved
to automating manual processes so that engineers can focus attention on tasks that add business value.
Automating business solutions allow you to:
Activate resources on demand.
Deploy solutions rapidly.
Minimize human error in configuring repetitive tasks.
Produce consistent and repeatable results.
To learn more, see Deployment considerations for automation.
Goals of automation
When automating technical processes and configurations, a common approach for some organizations is to
automate what they can and leave the more difficult processes for humans to perform manually.
A goal of automation is to make tools that do what humans can do, only better. For example, a human can perform
any given task once. But when the task requires repetitive runs, especially over long time periods, an automated
system is better equipped to do this with more predictable results that are error-free. Increasing speed is another
goal in automation. When you practice these automation goals, you can build systems that are faster, repetitive, and
can run on a daily basis.
Types of automation
Three types of automation are described in this article: infrastructure deployment, infrastructure configuration, and
operational tasks. These categories share the same goals and best practices mentioned previously. They differ in
areas where Azure provides solutions that help achieve optimal automation. Other types of automation such as
continuous deployment and continuous delivery are described further in the Operational Excellence pillar.
Infrastructure deployment
As businesses move to the cloud, they need to repeatedly deploy their solutions and know that their infrastructure
is in a reliable state. To meet these challenges, you can automate deployments using a practice referred to as
infrastructure as code. In code, you define the infrastructure that needs to be deployed.
Although there are many deployment technologies you can use with Azure, this article describes two of the more
popular technologies:
Azure Resource Manager (ARM) templates
Terraform
These technologies use a declarative approach. This lets you state what you intend to deploy without having to
write the sequence of programming commands to create it. You can deploy not only virtual machines, but also the
network infrastructure, storage systems, and any other resources you may need.
Infrastructure configuration
If you don't manage configuration carefully, your business could encounter disruptions such as systems outages
and security issues. Optimal configuration can enable you to quickly detect and correct configurations that could
interrupt or slow performance.
Configurations such as installing software on a virtual machine, adding data to a database, or starting pods in an
Azure Kubernetes Service cluster need to access Azure through an endpoint that is specific to your instance, outside
of the exposed REST APIs. This enables the configuration tools to use agents, networking, or other access methods
to provide resource-specific configuration options.
For example, when deploying to Azure, you may need to run post-deployment virtual machine configuration or run
other arbitrary code to bootstrap the deployed Azure resources. Another example is configuration tools that can be
used to configure and manage the ongoing configuration and state of Azure virtual machines.
Operational tasks
As the demand for speed in performing operational tasks increases over time, you are expected to deliver things
faster and faster. The old way of manually performing operational tasks won't scale to that type of demand now.
This is where automation can help. To meet on-demand delivery using an automation platform, you need to
develop automation components (such runbooks and configurations), create integrations to systems that are
already in place efficiently, and operate and troubleshoot.
Advantages of automating operational tasks include:
Optimize and extend existing processes (for example, using a PowerShell module or REST API).
Deliver flexible and reliable services.
Lower costs.
Improve predictability.
Two popular options for automating operational tasks are:
Azure functions - Run code without managing the underlying infrastructure on where the code is run.
Azure automation - Uses a programming language to automate operational tasks in code and executed on
demand.
For more information, see Automation. To see a Microsoft Ignite video, see Automating Operational and
Management Tasks.
Next steps
Automate Repeatable Infrastructure
Repeatable Infrastructure
12/18/2020 • 4 minutes to read • Edit Online
Historically, deploying a new service or application involves manual work such as procuring and preparing
hardware, configuring operating environments, and enabling monitoring solutions. Ideally, an organization would
have multiple environments in which to test deployments. These test environments should be similar enough to
production that deployment and run time issues are detected before deployment to production. This manual work
takes time, is error-prone, and can produce inconsistencies between the environments if not done well.
Cloud computing changes the way we procure infrastructure. No longer are we unboxing, racking, and cabling
physical infrastructure. We have internet accessible management portals and REST interfaces to help us. We can
now provision virtual machines, databases, and other cloud services on demand and globally. When we no longer
need cloud services, they can be easily deleted. However, cloud computing alone does not remove the effort and
risk in provisioning infrastructure. When using a cloud portal to build systems, many of the same manual
configuration tasks remain. Application servers require configuration, databases need networking, and firewalls
need firewalling.
{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"storageName": {
"type": "string",
"defaultValue": "newStorageAccount"
}
},
"resources": [
{
"name": "[parameters('storageName')]",
"type": "Microsoft.Storage/storageAccounts",
"apiVersion": "2019-06-01",
"location": "[resourceGroup().location]",
"kind": "StorageV2",
"sku": {
"name": "Premium_LRS",
"tier": "Premium"
}
}
]
}
Learn more
Documentation: What are ARM templates
Learn module: Deploy consistent infrastructure with ARM Templates
Code Samples: ARM templates
Take note, the Terraform provider for Azure is an abstraction on top of Azure APIs. This abstraction is beneficial
because the API complexities are obfuscated. This abstraction comes at a cost; the Terraform provider for Azure
does not always provide parity with the Azure APIs' capabilities.
Learn more
Documentation: Using Terraform on Azure
Next steps
Automate infrastructure configuration
Configure infrastructure
12/18/2020 • 6 minutes to read • Edit Online
When working with Azure, many services can be created and configured programmatically using automation or
infrastructure as code tooling. These tools access Azure through the exposed REST APIs or what we refer to as the
Azure control plane. For example, an Azure Network Security Group can be deployed, and security group rules
created using an Azure Resource Manager template. The Network Security Group and its configuration are exposed
through the Azure control plane and natively accessible.
Other configurations, such as installing software on a virtual machine, adding data to a database, or starting pods
in an Azure Kubernetes Service cluster cannot be accessed through the Azure control plane. These actions require a
different set of configuration tools. We consider these configurations as being on the Azure data plane side, or not
exposed through Azure REST APIs. These data plane enables tools to use agents, networking, or other access
methods to provide resource-specific configuration options.
For example, when deploying a set of virtual machined to Azure, you may also want to install and configure a web
server, stage content, and then make the content available on the internet. Furthermore, if the virtual machine
configuration changes and no longer aligns with the configuration definition, you may want a configuration
management system to remediate the configuration. Many options are available for these data plane
configurations. This document details several and provides links for in-depth information.
Bootstrap automation
When deploying to Azure, you may need to run post-deployment virtual machine configuration or run other
arbitrary code to bootstrap the deployed Azure resources. Several options are available for these bootstrapping
tasks and are detailed in the following sections of this document.
Azure VM extensions
Azure virtual machine extensions are small packages that run post-deployment configuration and automation on
Azure virtual machines. Several extensions are available for many different configuration tasks, such as running
scripts, configuring antimalware solutions, and configuring logging solutions. These extensions can be installed and
run on virtual machines using an ARM template, the Azure CLI, Azure PowerShell module, or the Azure portal. Each
Azure VM has a VM Agent installed, and this agent manages the lifecycle of the extension.
A typical VM extension use case would be to use a custom script extension to install software, run commands, and
perform configurations on a virtual machine or virtual machine scale set. The custom script extension uses the
Azure virtual machine agent to download and execute a script. The custom script extensions can be configured to
run as part of infrastructure as code deployments such that the VM is created, and then the script extension is run
on the VM. Extensions can also be run outside of an Azure deployment using the Azure CLI, PowerShell module, or
the Azure portal.
In the following example, the Azure CLI is used to deploy a custom script extension to an existing virtual machine,
which installs a Nginx webserver.
az vm extension set \
--resource-group myResourceGroup \
--vm-name myVM --name customScript \
--publisher Microsoft.Azure.Extensions \
--settings '{"commandToExecute": "apt-get install -y nginx"}'
Learn more
Use the included code sample to deploy a virtual machine and configure a web server on that machine with the
custom script extension.
Documentation: Azure virtual machine extensions
Code Samples: Configure VM with script extension during Azure deployment
cloud-init
cloud-init is a known industry tool for configuring Linux virtual machines on first boot. Much like the Azure custom
script extension, cloud-init allows you to install packages and run commands on Linux virtual machines. cloud-init
can be used for things like software installation, system configurations, and content staging. Azure includes many
cloud-init enable Marketplace virtual machine images across many of the most well-known Linux distributions. For
a full list, see cloud-init support for virtual machines in Azure.
To use cloud-init, create a text file named cloud-init.txt and enter your cloud-init configuration. In this example, the
Nginx package is added to the cloud-init configuration.
#cloud-config
package_upgrade: true
packages:
- nginx
Create the virtual machine, specifying the --custom-data property with the cloud-inti configuration name.
az vm create \
--resource-group myResourceGroupAutomate \
--name myAutomatedVM \
--image UbuntuLTS \
--admin-username azureuser \
--generate-ssh-keys \
--custom-data cloud-init.txt
On boot, cloud-init will use the systems native package management tool to install Nginx.
Learn more
Documentation: cloud-init support for virtual machines in Azure
Azure deployment script resource
When performing Azure deployments, you may need to run arbitrary code for bootstrapping things like managing
user accounts, Kubernetes pods, or querying data from a non-Azure system. Because none of these operations are
accessible through the Azure control plane, some other mechanism is required for performing this automation. To
run arbitrary code with an Azure deployment, check out the Microsoft.Resources/deploymentScripts Azure
resource.
The deployment script resource behaves similar to any other Azure resource:
Can be used in an ARM template.
Contain ARM template dependencies on other resources.
Consume input, produce output.
Use a user-assigned managed identity for authentication.
When deployed, the deployment script runs PowerShell or Azure CLI commands and scripts. Script execution and
logging can be observed in the Azure portal or with the Azure CLI and PowerShell module. Many options can be
configured like environment variables for the execution environment, timeout options, and what to do with the
resource after a script failure.
The following example shows an ARM template snippet with the deployment script resource configured to run a
PowerShell script.
{
"type": "Microsoft.Resources/deploymentScripts",
"apiVersion": "2019-10-01-preview",
"name": "runPowerShellScript",
"location": "[resourceGroup().location]",
"kind": "AzurePowerShell",
"identity": {
"type": "UserAssigned",
"userAssignedIdentities": {"[parameters('identity')]": {}}
},
"properties": {
"forceUpdateTag": "1",
"azPowerShellVersion": "3.0",
"arguments": "[concat('-sqlServer ', parameters('sqlServer'))]",
"primaryScriptUri": "[variables('script')]",
"timeout": "PT30M",
"cleanupPreference": "OnSuccess",
"retentionInterval": "P1D"
}
}
Learn more
Documentation: Use deployment scripts in templates
Configuration Management
Configuration management tools can be used to configure and manage the ongoing configuration and state of
Azure virtual machines. Three popular options are Azure Automation State Configuration, Chef, and Puppet.
Azure Automation State Configuration
Azure Automation State Configuration is a configuration management solution built on top of PowerShell Desired
State Configuration (DSC). State configuration works with Azure virtual machines, on-premises machines, and
machines in a cloud other than Azure. Using state configuration, you can import PowerShell DSC resources and
assign them to many virtual from a central location. Once each endpoint has evaluated and / or applied the desired
state, state compliance is reported to Azure and can be seen on a built-in dashboard.
The following example uses PowerShell DSC to ensure the NGINX has been installed on Linux systems.
configuration linuxpackage {
Import-DSCResource -Module nx
Node "localhost" {
nxPackage nginx {
Name = "nginx"
Ensure = "Present"
}
}
}
Once imported into Azure State Configuration and assigned to nodes, the state configuration dashboard provides
compliance results.
Learn more
Use the included code sample to deploy Azure Automation State Configuration and several Azure virtual machines.
The virtual machines are also onboarded to state configuration, and a configuration applied.
Documentation: Get started with Azure Automation State Configuration
Code Sample: Deploy DSC and VMs with an ARM template
Chef
Chef is an automation platform that helps define how your infrastructure is configured, deployed, and managed.
Additional components included Chef Habitat for application lifecycle automation rather than the infrastructure
and Chef InSpec that helps automate compliance with security and policy requirements. Chef Clients are installed
on target machines, with one or more central Chef Servers that store and manage the configurations.
Learn more
Documentation: An Overview of Chef
Puppet
Puppet is an enterprise-ready automation platform that handles the application delivery and deployment process.
Agents are installed on target machines to allow Puppet Master to run manifests that define the desired
configuration of the Azure infrastructure and virtual machines. Puppet can integrate with other solutions such as
Jenkins and GitHub for an improved DevOps workflow.
Learn more
Documentation: How Puppet works
Next steps
Automate operational tasks
Automate operational tasks
12/18/2020 • 7 minutes to read • Edit Online
Operational tasks can include any action or activity you may perform while managing systems, system access, and
processes. Some examples include rebooting servers, creating accounts, and shipping logs to a data store. These
tasks may occur on a schedule, as a response to an event or monitoring alert, or ad-hock based on external factors.
Like many other activities related to managing computer systems, these activities are often performed manually,
which takes time, and are error-prone.
Many of these operational tasks can and should be automated. Using scripting technologies and related solutions,
you can shift effort from manually performing operational tasks towards building automation for these tasks. In
doing so, you achieve so much, including:
Reduce time to perform an action.
Reduce risk in performing the action.
Automated response to events and alerts.
Increased human capacity for further innovation.
When working in Azure, you have many options for automating operational tasks. This document details some of
the more popular.
Azure Functions
Azure Functions allows you to run code without managing the underlying infrastructure on where the code is run.
Functions provide a cost-effective, scalable, and event-driven platform for building applications and running
operational tasks. Functions support running code written in C#, Java, JavaScript, Python, and PowerShell.
When creating a Function, a hosting plan is selected. Hosting plans controls how a function app is scaled, resource
availability, and availability of advanced features such as virtual network connectivity and startup time. The hosting
plan also influences the cost.
Functions hosting plans:
Consumption - Default hosting plan, pay only for Function execution time, configurable timeout period,
automatic scale.
Premium - Faster start, VNet connectivity, unlimited execution duration, premium instance sizes, more
predictable pricing.
App Ser vice Plan - Functions run on dedicated virtual machines and can use custom images.
For full details on consumption plans, see Azure Functions scale and hosting.
Functions provide event-driven automation; each function has a trigger associated with it. These triggers are what
run the functions. Common triggers include:
HTTP / Webhook - Function is run when an HTTP request is received.
Queue - Function is run when a new message arrives in a message queue.
Blob storage - Function is run when a new blob is created in a storage container.
Timer - Function is run on a schedule.
Below are example triggers seen in the Azure portal when creating a new function
Once an event has occurred that initiates a Function, you may also want to consume data from this event or from
another source. Once a Function has been completed, you may want to publish or push data to an Azure service
such as a Blob Storage. Input and output are achieved using input and output bindings. For more information
about triggers and bindings, see Azure Functions triggers and binding concepts.
Both PowerShell and Python are common languages for automating everyday operational tasks. Because Azure
Functions supports both of these languages, it is an excellent platform for hosting, running, and auditing
automated operational tasks. For example, let's assume that you would like to build a solution to facilitate self-
service account creation. An Azure PowerShell Function could be used to create the account in Azure Active
Directory. An HTTP trigger can be used to initiate the Function, and an input binding configured to pull the account
details from the HTTP request body. The only remaining piece would be a solution that consumes the account
details and creates the HTTP requests against the Function.
Learn more
Documentation: Azure Functions PowerShell developer guide
Documentation: Azure Functions Python developer guide
Azure Automation
PowerShell and Python are popular programming languages for automating operational tasks. Using these
languages, performing operations like restarting services, moving logs between data stores, and scaling
infrastructure to meet demand can be expressed in code and executed on demand. Alone, these languages do not
offer a platform for centralized management, version control, or execution history. The languages also lack a native
mechanism for responding to events like monitoring driven alerts. To provide these capabilities, an automation
platform is needed.
Azure Automation provides an Azure-hosted platform for hosting and running PowerShell and Python code across
Azure, non-Azure cloud, and on-premises environments. PowerShell and Python code is stored in an Azure
Automation Runbook, which has the following attributes:
Execute Runbooks on demand, on a schedule, or through a webhook.
Execution history and logging.
Integrated secrets store.
Source Control integration.
As seen in the following image, Azure Automation provides a portal experience for managing Azure Automation
Runbooks. Use the included code sample (ARM template) to deploy an Azure automation account, automation
runbook, and explore Azure Automation for yourself.
Learn more
Documentation: What is Azure Automation?
Scale operations
So far, this document has detailed options for scripting operational tasks; however, many Azure services come with
built-in automation, particularly in scale operations. As application demand increases (transactions, memory
consumption, and compute availability), you may need to scale the application hosting platform so that requests
continue to be served. As demand decreases, scaling back not only appropriately resizes your application but also
reduces operational cost.
In cloud computing, scale activities are classified into two buckets:
Scale-up - Adding additional resources to an existing system to meet demand.
Scale-out - Adding additional infrastructure to meet demand.
Many Azure services can be scaled up by changing the pricing tier of that service. Generally, this operation would
need to be performed manually or using detection logic and custom automation.
Some Azure services support automatic scale-out, which is the focus of this section.
Azure Monitor autoscale
Azure Monitor autoscale can be used to autoscale Virtual Machine Scale Sets, Cloud Services, App Service Web
Apps, and API Management service. To configure scale-out operations for these services, while in the Azure portal,
select the service, and then scale-out under the resource settings. Select Custom To configure autoscaling rules.
Automatic scale operations can also be configured using an Azure Resource Manager Template, the Azure
PowerShell module, and the Azure CLI.
When creating the autoscale rules, configure minimum and maximum instance counts. These settings prevent
inadvertent costly scale operations. Next, configure autoscale rules, at minimum one to add more instances, and
one to remove instances when no longer needed. Azure Monitor autoscale rules give you fine-grain control over
when a scale operation is initiated. See the Learn more section below for more information on configuring these
rules.
Learn more
Documentation: Azure Monitor autoscale overview
Azure Kubernetes Service
Azure Kubernetes Service (AKS) offers an Azure managed Kubernetes cluster. When considering scale operations in
Kubernetes, there are two components:
Pod scaling - Increase or decrease the number of load balanced pods to meet application demand.
Node scaling - Increase or decrease the number of cluster nodes to meet cluster demand.
Azure Kubernetes Service includes automation to facilitate both of these scale operation types.
Horizontal pod autoscaler
Horizontal pod autoscaler (HPA) monitors resource demand and automatically scales pod replicas. When
configuring horizontal pod autoscaling, you provide the minimum and maximum pod replicas that a cluster can
run and the metrics and thresholds that initiate the scale operation. To use horizontal pod autoscaling, each pod
must be configured with resource requests and limits, and a HorizontalPodAutoscaler Kubernetes object must
be created.
The following Kubernetes manifest demonstrates resource requests on a Kubernetes pod and also the definition of
a horizontal pod autoscaler object.
apiVersion: apps/v1
kind: Deployment
metadata:
name: azure-vote-back
spec:
replicas: 1
selector:
matchLabels:
app: azure-vote-back
template:
metadata:
labels:
app: azure-vote-back
spec:
nodeSelector:
"beta.kubernetes.io/os": linux
containers:
- name: azure-vote-back
image: redis
# Resource requests and limits
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 250m
memory: 256Mi
ports:
- containerPort: 6379
name: redis
---
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: azure-vote-back-hpa
spec:
maxReplicas: 10 # define max replica count
minReplicas: 3 # define min replica count
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: azure-vote-back
targetCPUUtilizationPercentage: 50 # target CPU utilization
Cluster autoscaler
Where horizontal pod autoscaling is a response to demand on a specific application of service running in a
Kubernetes cluster, cluster autoscaling responds to demand on the entire Kubernetes cluster itself. If a Kubernetes
cluster does not have enough compute resources or nodes to facilitate all requested pods' resource requests, some
of these pods will enter a non-scheduled or pending state. In response to this situation, additional nodes can be
automatically added to the cluster. Conversely, once compute resources have been freed up, the cluster nodes can
automatically be removed to match steady-state demand.
Cluster autoscaler can be configured when creating an AKS cluster. The following example demonstrates this
operation with the Azure CLI. This operation can also be completed with an Azure Resource Manager template.
az aks create \
--resource-group myResourceGroup \
--name myAKSCluster \
--node-count 1 \
--vm-set-type VirtualMachineScaleSets \
--load-balancer-sku standard \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 3
Cluster autoscaler can also be configured on an existing cluster using the following Azure CLI command.
az aks update \
--resource-group myResourceGroup \
--name myAKSCluster \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 3
See the included documentation for information on fine-grain control for cluster autoscale operations.
Learn more
Documentation: Horizontal pod autoscaling
Documentation: AKS cluster autoscaler
Release Engineering: Application Development
12/18/2020 • 5 minutes to read • Edit Online
One of the primary goals of adopting modern release management strategies is to build systems that allow your
teams to turn ideas into production delivered software with as little friction as possible. Throughout this section of
the Well-Architected Framework, methods and tools for quickly and reliably delivering software are examined. You
will learn about things like continuous deployment software, integration strategies, and deployment environments.
Samples are provided to help you quickly get hands-on with this technology.
However, release engineering does not start with fancy deployment software, multiple deployment environments,
or Kubernetes clusters. Before examining how we can quickly and reliably release software, we need to first look at
how software is developed. Not only has the introduction of cloud computing had a significant impact on how
software is delivered and run, but it's also had a huge downstream impact on how software is developed. For
example, the introduction of container technology has changed how we can host, scale, and deprecate software.
That said, containers have also impacted things like dependency management, host environment, and tooling as we
develop software.
This article details many practices that you may want to consider when building strategies for developing for the
cloud. Topics include:
Development environments, or where you write your code.
Source control and branching strategies, how you manage, collaborate on, and eventually deploy your code.
Development environments
When developing software for the cloud, or any environment, care needs to be taken to ensure that the
development environment is set up for success. When setting up a development environment, you may consider
questions like the following:
How do I ensure that all dependencies are in place?
How can I best configure my development environment to emulate a production environment?
How do I develop code where service dependencies may exist with code already in production?
The following sections briefly detail technology that aids during the local or what is often referred to as "inner-loop"
development process.
Docker Desktop
Docker Desktop is an application that provides a Docker environment on your development system. Docker
Desktop includes not only the Docker runtime but application development tools and local Kubernetes
environment. Using Docker Desktop, you can develop in any language, create and test container images, and when
ready, push application ready container images to a container registry for production use.
Learn more
Docker Desktop
Windows Subsystem for Linux
Many applications and solutions are built on Linux. Windows Subsystem for Linux provides a Linux environment on
your Windows machines, including many command-line tools, utilities, and Linux applications. Multiple GNU/Linux
distributions are available and can be found in the Microsoft Store.
Learn more
Documentation: Windows Subsystem for Linux Documentation
Bridge to Kubernetes
Bridge to Kubernetes allows you to run and debug code on your development system while connected to a
Kubernetes cluster. This configuration can be helpful when working on microservice type architectures. Using
Bridge, you can work locally on one service, which has a dependency on other services. Rather than needing to run
all dependant services on your development system or deploy your in-development code to the cluster, Bridge
manages the communication between your development system and running services in your Kubernetes cluster.
Essentially, the code running on your development system behaves as if it's running in the Kubernetes cluster.
Some features of Bridge:
Works with Azure Kubernetes Service (AKS) and non-AKS clusters (in preview).
Redirects traffic between your Kubernetes cluster and code running on your development system.
Support for isolated traffic routing, which is important in shared clusters.
Replicates environment variables and mounted volumes from your cluster to your development system.
Cluster diagnostic logs are made available on your development system.
Learn more
Documentation: Use Bridge to Kubernetes with Visual Studio Code
Documentation: Use Bridge to Kubernetes with Visual Studio
Source control
Source control management (SCM) systems provide a way to control, collaborate, and peer review software
changes. As software is merged into source control, the system helps manage code conflicts. Ultimately, source
control provides a running history of the software, modification, and contributors. Whether a piece of software is
open-sourced or private, using source control software has become a standardized method of managing software
development. As detailed in later sections of the Well-Architected Framework, source control systems can also be
enlightened with integrated testing, security, and release practices. As cloud practices are adopted and because so
much of the cloud infrastructure is managed through code, version control systems are also becoming an integral
part of infrastructure management.
Many source control systems are powered by Git. Git is a distributed version control system with related tools that
allow you and your team to track source code changes during the software development lifecycle. Using Git, you
can create a copy of the software, make changes, propose the changes, and receive peer review on your proposal.
During peer review, Git makes it easy to see precisely the changes being proposed. Once the proposed changes
have been approved, Git helps merge the changes into the source, including conflict resolution. If, at any point, the
changes need to be reverted, Git can also manage rollback.
Let's examine a few aspects of version controlling software and infrastructure configurations.
Version Control and code changes
Beyond providing us with a place to store code, source control systems allow us to understand what version of the
software is current and identify changes between the present and past versions. Version control solutions should
also provide a method for reverting to the previous version when needed.
The following image demonstrates how Git and GitHub are used to see the proposed code changes.
Forking and Pull Requests
Using source control systems, you can create your own copies of the software source called forks. With your fork in
place, changes can be made to the software without the risk that the changes will impact the software's current
version. Once you are happy with the changes made in your fork, you can suggest that your changes be merged
into the main source control through what is known as a pull request.
Peer Review
As updates are made to software and infrastructure configurations, version control software allows us to propose
these changes before merging them into the source. During the proposal, peers can review the changes,
recommend updates, and approve the changes. Source control solutions provide an excellent platform for
collaboration on changes to the software.
To learn more about using Git, visit the DevOps Resource Center.
GitHub
GitHub is a popular source control system that uses Git. In addition to core Git functionality, GitHub includes several
other features such as access control, collaboration features such as code issues, project boards, wikis, and an
automation platform called GitHub actions.
Learn more
GitHub
Azure Repos
Azure DevOps is a collection of services for building, collaborating on, testing, and delivering software to any
environment. Azure DevOps services include Azure Repos, which is a source control system. Using Azure Repos
includes unlimited free private Git repositories. Standard Git powers Azure Repos, and you can use clients and tools
of your choice for working with them.
Learn more
Documentation: Azure Repos
Next steps
Release Engineering: Continuous integration
Release Engineering: Continuous integration
12/18/2020 • 4 minutes to read • Edit Online
As code is developed, updated, or even removed, having a friction-free and safe method to integrate these changes
into the main code branch is paramount to enabling developers to provide value fast. As a developer, making small
code changes, pushing these to a code repository, and getting almost instantaneous feedback on the quality, test
coverage, and introduced bugs allows you to work faster, with more confidence, and less risk. Continuous
integration (CI) is a practice where source control systems and software deployment pipelines are integrated to
provide automated build, test, and feedback mechanisms for software development teams.
Continuous integration is about ensuring that software is ready for deployment but does not include the
deployment itself. This article covers the basics of continuous integration and offers links and examples for more
in-depth content.
Continious integration
Continuous integration is a software development practice under which developers integrate software updates into
a source control system on a regular cadence. The continuous integration process starts when an engineer creates
a pull request signaling to the CI system that code changes are ready to be integrated. Ideally, integration validates
the code against several baselines and tests and provides quick feedback to the requesting engineer on the status
of these tests. Assuming baseline checks and testing have gone well, the integration process produces and stages
assets such as compiled code and container images that will eventually deploy the updated software.
As a software engineer, continuous integration can help you deliver quality software more quickly by performing
the following:
Run automated tests against the code, providing early detection of breaking changes.
Run code analysis to ensure code standards, quality, and configuration.
Run compliance and security checks ensuring no known vulnerabilities.
Run acceptance or functional tested to ensure that the software operates as expected.
To provide quick feedback on detected issues.
Where applicable, produce deployable assets or packages that include the updated code.
To achieve continuous integration, use software solutions to manage, integrate, and automate the process. A
common practice is to use a continuous integration pipeline, detailed in this article's next section.
Test Integration
A key element of continuous integration is the continual building and testing of code as code contributions are
created. Testing pull requests as they are created gives quick feedback that the commit has not introduced breaking
changes. The advantage is that the tests that are run by the continuous integration pipeline can be the same tests
run during test-driven development.
The following code snippet shows a test step from an Azure DevOps pipeline. There are two actions occurring:
The first task is using a popular Python testing framework to run CI tests. These tests reside in source control
alongside the Python code. The test results are output to a file named test-results.xml.
The second task consumes the test results and publishing them to the Azure DevOps pipeline as an integrated
report.
- script: |
pip3 install pytest
pytest azure-vote/azure-vote/tests/ --junitxml=junit/test-results.xml
continueOnError: true
- task: PublishTestResults@2
displayName: 'Publish Test Results'
inputs:
testResultsFormat: 'JUnit'
testResultsFiles: '**/test-results.xml'
failTaskOnFailedTests: true
testRunTitle: 'Python $(python.version)'
The following image shows the test results as seen in the Azure DevOps portal.
CI Result Badges
Many developers like to show that they're keeping their code quality high by displaying a status badge in their
repo. The following image shows an Azure Pipelines badge as displayed on the Readme file for an open-source
project on GitHub.
Learn more
To learn how to display badges in your repositories, see these articles:
Add an Azure Pipeline status badge to your repository.
Add a GitHub workflow status badge to your repository.
Next steps
Release Engineering: Release testing
Testing your application and Azure environment
12/18/2020 • 6 minutes to read • Edit Online
Testing is one of the fundamental components and DevOps and agile development in general. If automation gives
DevOps the required speed and agility to deploy software quickly, only through extensive testing those
deployments will achieve the required reliability that customers demand.
A main tenet of a DevOps practice to achieve system reliability is the "Shift Left" principle. If developing and
deploying an application is a process depicted as a series of steps going from left to right, testing should not be
limited to being performed at the very end of the process (at the right). It should be shifted as much to the
beginning (to the left) as possible. Errors are cheaper to repair when caught early. They can be expensive or
impossible to fix later in the application life cycle.
Another aspect to consider is that testing should occur on both application code as well as infrastructure code and
they should be subject to the same quality controls. As described in [Infrastructure as Code][iac], the environment
where applications are running should be version-controlled and deployed through the same mechanisms as
application code, and hence can be tested and validated using DevOps testing paradigms too.
You can use your favorite testing tool to run your tests, including Azure Pipelines for automated testing and Azure
Testing Plans for manual testing.
There are multiple stages at which tests can be performed in the life cycle of code, and each of them has some
particularities that is important to understand. In this guide, you can find a summary of the different tests that you
should consider while developing and deploying applications.
Automated Testing
Automating tests is the best way to make sure that they are executed. Depending on how frequently tests are
performed, they are typically limited in duration and scope, as the different types of automated tests will show:
Unit Testing
Unit tests are tests typically run by each new version of code committed into version control. Unit Tests should be
extensive (should cover ideally 100% of the code) and quick (typically under 30 seconds, although this number is
not a rule set in stone). Unit testing could verify things like the syntax correctness of application code, Resource
Manager templates or Terraform configurations, that the code is following best practices, or that they produce the
expected results when provided certain inputs.
Unit tests should be applied both to application code and infrastructure code.
Smoke Testing
Smoke tests are more exhaustive than unit tests, but still not as much as integration tests. They normally run in less
than 15 minutes. Still not verifying the interoperability of the different components with each other, smoke tests
verify that each of them can be correctly built and offers the expected functionality and performance.
Smoke tests usually involve building the application code, and if infrastructure, possibly testing the deployment in
a test environment.
Integration Testing
After making sure that the different application components operate correctly individually, integration testing has
as goal determine whether they can interact with each other as they should. Integration tests usually take longer
than smoke testing, and as a consequence they are sometimes executed not as frequently. For example, running
integration tests every night still offers a good compromise, detecting interoperability issues between application
components no later than one day after they were introduced.
Summary
In order to deploy software quickly and reliably, testing is a fundamental component of the development and
deployment life cycle. Not only application code should be tested, but infrastructure automation and resiliency
should equally be put to the test, to make sure that the application is going to perform as expected in every
situation.
Next steps
Release Engineering: Performance
Performance considerations for your deployment
infrastructure
12/18/2020 • 3 minutes to read • Edit Online
Build status shows if your product is in a deployable state, so builds are the heartbeat of your continuous delivery
system. It's important to have a build process up and running from the first day of your product development.
Since builds provide such a crucial information about the status of your product, you should always strive for fast
builds.
It will be hard to fix build problem if it takes longer to build, and the team will suffer from a broken window
disorder. Eventually, nobody cares if the build is broken since it's always broken and it takes lot of effort to fix it.
Build times
Here are few ways you can achieve faster builds.
Selecting right size VMs: Speeding up your builds starts with selecting the right size VMs. Fast machines
can make the difference between hours and minutes. If your pipelines are in Azure Pipelines, then you've got
a convenient option to run your jobs using a Microsoft-hosted agent. With Microsoft-hosted agents,
maintenance and upgrades are taken care of for you. For more info see, Microsoft-hosted agents.
Build ser ver location: When you're building your code, a lot of bits are moved across the wire, so inputs
to the builds are fetched from a source control repository and the artifact repository, such as source code,
NuGet packages, etc. At the end, the output from the build process needs to be copied, not only the
compiled artifacts, but also test reports, code coverage results, and debug symbols. So it is important that
these copy actions are fast. If you are using your own build server, ensure that the build server is located
near the sources and a target location, and it can reduce the duration of your build considerably.
Scaling out build ser vers: A single build server may be sufficient for a small product, but as the size and
the scope of the product and the number of teams working on the product increases, the single server may
not be enough. Scale your infrastructure horizontally over multiple machines when you reach the limit. For
more info see, how you can leverage Azure DevOps Agent Pools.
Optimizing the build:
Add parallel build execution so we can speed up the build process. For more info see, Azure DevOps
parallel jobs.
Enable parallel execution of test suites, which is often a huge time saver, especially when executing
integration and UI tests. For more info see, Run tests in parallel using Azure Pipeline.
Use the notion of a multiplier, where you can scale out your builds over multiple build agents. For
more info see, Organizing Azure Pipeline into Jobs.
Move a part of the test feedback loop to the release pipeline. This improves the build speed, and
hence the speed of the build feedback loop.
Publish the build artifacts to the package management system solution, and hence publish to a
NuGet feed at the end of a build.
Human intervention
It's important to select different builds for different purpose.
CI builds: Purpose of this build is to ensure it compiles and unit tests run. This build gets triggered at each
commit or set of commits over a period of time. It serves as the heartbeat of the project, provides quality
feedback to the team immediately. For more info see, CI triggers or Batching CI builds.
Nightly build: Purpose of this build is not only to compile but also ensure necessary
integration/regression tests are run. This build can take up some more time, because we need to do some
extra steps to get additional information about the product. For example, metrics about the state of the
software using SonarQube. It may also contain a set of regression tests and integration tests and it may also
deploy the solution to a temporary machine to verify the solution is continuing to work. For more info see,
scheduling builds using cron syntax
Release build: Besides compiling, running test this build additionally compiles the API documentation,
compliance reports, code signing, and other steps which are not required every time the code is built. Finally
this build provide the golden copy that will be pushed to the release pipeline to finally deploy in the
production environment. Generally release build is gets triggered manually instead of a CI trigger.
Next steps
Release Engineering: Deployment
Deployment considerations for DevOps
12/18/2020 • 4 minutes to read • Edit Online
As you provision and update Azure resources, application code, and configuration settings, a repeatable and
predictable process will help you avoid errors and downtime. We recommend automated processes for
deployment that you can run on demand and rerun if something fails. After your deployment processes are
running smoothly, process documentation can keep them that way.
Automation
To activate resources on demand, deploy solutions rapidly, minimize human error, and produce consistent and
repeatable results, be sure to automate deployments and updates.
Automate as many processes as possible
The most reliable deployment processes are automated and idempotent — that is, repeatable to produce the same
results.
To automate provisioning of Azure resources, you can use Terraform, Ansible, Chef, Puppet, Azure PowerShell,
Azure CLI, or Azure Resource Manager templates.
To configure VMs, you can use cloud-init (for Linux VMs) or Azure Automation State Configuration (DSC).
To automate application deployment, you can use Azure DevOps Services, Jenkins, or other CI/CD solutions.
As a best practice, create a repository of categorized automation scripts for quick access, documented with
explanations of parameters and examples of script use. Keep this documentation in sync with your Azure
deployments, and designate a primary person to manage the repository.
Automation scripts can also activate resources on demand for disaster recovery.
Automate and test deployment and maintenance tasks
Distributed applications consist of multiple parts that must work together. Deployment should take advantage of
proven mechanisms, such as scripts, that can update and validate configuration and automate the deployment
process. Test all processes fully to ensure that errors don't cause additional downtime.
Implement deployment security measures
All deployment tools must incorporate security restrictions to protect the deployed application. Define and enforce
deployment policies carefully, and minimize the need for human intervention.
Release process
One of the challenges with automating deployment is the cut-over itself, taking software from the final stage of
testing to live production. You usually need to do this quickly in order to minimize downtime. The blue-green
deployment approach does this by ensuring you have two production environments, as identical as possible.
Next steps
Release Engineering: Rollback
Release Engineering: Rollback
12/18/2020 • 2 minutes to read • Edit Online
In some cases, a new software deployment can harm or degrade the functionality of a software system. When
building your solutions, it is essential to anticipate deployment issues and to architect solutions that provide
mechanisms for fixing problematic deployments. Rolling back a deployment involves reverting the deployment to
a known good state. Rollback can be accomplished in many different ways. Several Azure services support native
mechanisms for rolling back to a previous state. Some of these services are detailed in this article.
Learn more
For more information on using Azure App Service deployment slots, see Set up staging environments in Azure App
Service
deployment.apps/demorollback
REVISION CHANGE-CAUSE
1 <none>
2 <none>
If an updated deployment has introduced issues, the kubectl rollout undo command can be used to revert to a
previous deployment revision.
Learn more
For more information, see Kubernetes Deployments
Learn more
For more information, see Rollback on an error to successful deployment
Logic apps
When making changes to an Azure logic application, a new version of the application is created. Azure maintains a
history of versions and can revert or promote any previous version. To do so, in the Azure portal, select your logic
app > Versions . Previous versions can be selected on the versions pane, and the application can be inspected both
in the code view and the visual designer view. Select the version you would like to revert to, and click the Promote
option and then Save .
Learn more
For more information, see Manage logic apps in the Azure portal
Monitoring for DevOps
11/2/2020 • 7 minutes to read • Edit Online
What you cannot see, you cannot measure. What you cannot measure, you cannot improve. This classic
management axiom is true in the cloud as well. Traditional application and infrastructure monitoring is based on
whether the application is running or not, or what response time it is giving. However, cloud-based monitoring
offer many more opportunities that you should be leveraging in order to give your users the best experience.
Application Monitoring
Application Insights is the Azure Service that allows not only to verify that your application is running correctly, but
it makes application troubleshooting easier, and can be used for custom business telemetry that will tell you
whether your application is being used as intended.
Make sure you leverage all the rich information that Application Insights can provide about your application. This
list is not exhaustive, but here you can find some of the visibility that Application Insights can give you:
Application Insights offers you a default dashboard with an educated guess of the most important metrics you
will be interested in. You can then modify it and customize it to your own needs.
By instrumenting your application correctly, Application Insights will give you performance statistics both from
a client and a server perspective
The Application Map will show you application dependencies in other services such as backend APIs or
databases, allowing to determine visually where performance problems lie
Smart Detection will warn you when anomalies in performance or utilization patterns happen
Usage Analysis can give you telemetry on which features of your application are most frequently used, or
whether all your application functionality is being used. This feature is especially useful after changes to the
application functionality, to verify whether those changes were successful
Release annotations are visual indicators in your Application Insights charts of new builds and other events, so
that you can visually correlate changes in application performance to code releases, being able to quickly
pinpoint performance problems.
Cross-component transaction diagnostics allow you to follow failed transactions to find the point in the
architecture where the fault was originated.
Snapshot Debugger, to automatically collect a snapshot of a live application in case of an exception, to analyze it
at a later stage.
To use Application Insights you have two options: you can use codeless monitoring , where onboarding your app
to Application Insights does not require any code change, or code-based monitoring , where you instrument
your code to send telemetry to Application Insights using the Software Development Kit for your programming
language of choice.
You can certainly use other Application Performance Management tools to monitor your application on Azure, such
as NewRelic or AppDynamics, but Application Insights will give you the most seamless and integrated experience.
Platform Monitoring
Application Insights is actually one of the components of Azure Monitor, which gives you rich metrics and logs to
verify the state of your complete Azure landscape. No matter whether your application is running on Virtual
Machines, App Services, or Kubernetes, Azure Monitor will help you to follow the state of your infrastructure, and
to react promptly if there are any issues.
Make sure not only to monitor your compute elements supporting your application code, but your data platform
as well: databases, storage accounts, or data lakes should be closely monitored, since a low performance of the
data tier of an application could have serious consequences.
Container Insights
Should your application run on Azure Kubernetes Service, Azure Monitor allows you to easily monitor the state of
your cluster, nodes, and pods. Easy to configure for AKS clusters, Container Insights delivers quick, visual, and
actionable information: from the CPU and memory pressure of your nodes to the logs of individual Kubernetes
pods.
Additionally, for operators that prefer using the open-source Kubernetes monitoring tool Prometheus but still like
the ease of use of Azure Monitor Container Insights, both solutions can integrate with each other.
The Sidecar Pattern adds a separate container with responsibilities that are required by the main container. A
common use case is for running logging utilities and monitoring agents.
Network monitoring
Regardless the form factor or programming language your application is based on, the network connecting your
code to your users can make or break the experience that your application provides. As a consequence monitoring
and troubleshooting the network can be decisive for an operations team. The component of Azure Monitor that
manages the network components is called Network Watcher, a collection of network monitoring and
troubleshooting tools. Some of these tools are:
Traffic Analytics will give you an overview of the traffic in your Virtual Networks, as well as the percentage
coming from malicious IP addresses, leveraging Microsoft Threat Intelligence databases. This tool will show you
as well the systems in your virtual networks that generate most traffic, so that you can visually identify
bottlenecks before they degenerate into problems.
Network Performance Manager can generate synthetic traffic to measure the performance of network
connections over multiple links, giving you a perspective on the evolution of WAN and Internet connections
over time, as well as offering valuable monitoring information about Microsoft ExpressRoute circuits.
VPN diagnostics can help troubleshooting site-to-site VPN connections connecting your applications to users
on-premises.
Connection Monitor allows you to measure the network availability between sets of endpoints.
Other information sources
Not only your application components are producing data, but there are many other signals that you need to track
to effectively operate a cloud environment:
Activity Log : this is a trail audit that lets you see every change that has gone through Azure APIs. It can be
critical to understand sudden performance changes or problems, that might have been due to a
misconfiguration of the Azure platform.
Azure Ser vice Health : sometimes outages are provoked not by configuration changes, but by glitches in the
Azure platform itself. You can find information about any Azure-related problem impacting your application in
the Azure Service Health logs.
Azure Advisor : find here recommendations about how to optimize your Azure platform to reduce costs,
improve your security posture, or increase the availability of your environment.
Azure Security Center : not a focus of this pillar, but to be included for completeness: Azure Security Center
can help you to understand whether your Azure resources are configured according to security best practices
Summary
You can use any monitoring platform to manage your Azure resources. Microsoft's first party offering is Azure
Monitor, a comprehensive solution for metrics and logs from the infrastructure to the application code, including
the ability to trigger alerts and automated actions as well as data visualization.
DevOps Checklist
11/2/2020 • 14 minutes to read • Edit Online
DevOps is the integration of development, quality assurance, and IT operations into a unified culture and set of
processes for delivering software. Use this checklist as a starting point to assess your DevOps culture and process.
Culture
Ensure business alignment across organizations and teams. Conflicts over resources, purpose, goals, and
priorities within an organization can be a risk to successful operations. Ensure that the business, development, and
operations teams are all aligned.
Ensure the entire team understands the software lifecycle. Your team needs to understand the overall
lifecycle of the application, and which part of the lifecycle the application is currently in. This helps all team
members know what they should be doing now, and what they should be planning and preparing for in the future.
Reduce cycle time. Aim to minimize the time it takes to move from ideas to usable developed software. Limit the
size and scope of individual releases to keep the test burden low. Automate the build, test, configuration, and
deployment processes whenever possible. Clear any obstacles to communication among developers, and between
developers and operations.
Review and improve processes. Your processes and procedures, both automated and manual, are never final.
Set up regular reviews of current workflows, procedures, and documentation, with a goal of continual
improvement.
Do proactive planning. Proactively plan for failure. Have processes in place to quickly identify issues when they
occur, escalate to the correct team members to fix, and confirm resolution.
Learn from failures. Failures are inevitable, but it's important to learn from failures to avoid repeating them. If an
operational failure occurs, triage the issue, document the cause and solution, and share any lessons that were
learned. Whenever possible, update your build processes to automatically detect that kind of failure in the future.
Optimize for speed and collect data. Every planned improvement is a hypothesis. Work in the smallest
increments possible. Treat new ideas as experiments. Instrument the experiments so that you can collect production
data to assess their effectiveness. Be prepared to fail fast if the hypothesis is wrong.
Allow time for learning. Both failures and successes provide good opportunities for learning. Before moving on
to new projects, allow enough time to gather the important lessons, and make sure those lessons are absorbed by
your team. Also give the team the time to build skills, experiment, and learn about new tools and techniques.
Document operations. Document all tools, processes, and automated tasks with the same level of quality as your
product code. Document the current design and architecture of any systems you support, along with recovery
processes and other maintenance procedures. Focus on the steps you actually perform, not theoretically optimal
processes. Regularly review and update the documentation. For code, make sure that meaningful comments are
included, especially in public APIs, and use tools to automatically generate code documentation whenever possible.
Share knowledge. Documentation is only useful if people know that it exists and can find it. Ensure the
documentation is organized and easily discoverable. Be creative: Use brown bags (informal presentations), videos,
or newsletters to share knowledge.
Development
Provide developers with production-like environments. If development and test environments don't match
the production environment, it is hard to test and diagnose problems. Therefore, keep development and test
environments as close to the production environment as possible. Make sure that test data is consistent with the
data used in production, even if it's sample data and not real production data (for privacy or compliance reasons).
Plan to generate and anonymize sample test data.
Ensure that all authorized team members can provision infrastructure and deploy the application.
Setting up production-like resources and deploying the application should not involve complicated manual tasks or
detailed technical knowledge of the system. Anyone with the right permissions should be able to create or deploy
production-like resources without going to the operations team.
This recommendation doesn't imply that anyone can push live updates to the production deployment. It's about
reducing friction for the development and QA teams to create production-like environments.
Instrument the application for insight. To understand the health of your application, you need to know how it's
performing and whether it's experiencing any errors or problems. Always include instrumentation as a design
requirement, and build the instrumentation into the application from the start. Instrumentation must include event
logging for root cause analysis, but also telemetry and metrics to monitor the overall health and usage of the
application.
Track your technical debt. In many projects, release schedules can get prioritized over code quality to one degree
or another. Always keep track when this occurs. Document any shortcuts or other suboptimal implementations, and
schedule time in the future to revisit these issues.
Consider pushing updates directly to production. To reduce the overall release cycle time, consider pushing
properly tested code commits directly to production. Use feature toggles to control which features are enabled. This
allows you to move from development to release quickly, using the toggles to enable or disable features. Toggles
are also useful when performing tests such as canary releases, where a particular feature is deployed to a subset of
the production environment.
Testing
Automate testing. Manually testing software is tedious and susceptible to error. Automate common testing tasks
and integrate the tests into your build processes. Automated testing ensures consistent test coverage and
reproducibility. Integrated UI tests should also be performed by an automated tool. Azure offers development and
test resources that can help you configure and execute testing. For more information, see Development and test.
Test for failures. If a system can't connect to a service, how does it respond? Can it recover once the service is
available again? Make fault injection testing a standard part of review on test and staging environments. When your
test process and practices are mature, consider running these tests in production.
Test in production. The release process doesn't end with deployment to production. Have tests in place to ensure
that deployed code works as expected. For deployments that are infrequently updated, schedule production testing
as a regular part of maintenance.
Automate performance testing to identify performance issues early. The impact of a serious performance
issue can be as severe as a bug in the code. While automated functional tests can prevent application bugs, they
might not detect performance problems. Define acceptable performance goals for metrics like latency, load times,
and resource usage. Include automated performance tests in your release pipeline, to make sure the application
meets those goals.
Perform capacity testing. An application might work fine under test conditions, and then have problems in
production due to scale or resource limitations. Always define the maximum expected capacity and usage limits.
Test to make sure the application can handle those limits, but also test what happens when those limits are
exceeded. Capacity testing should be performed at regular intervals.
After the initial release, you should run performance and capacity tests whenever updates are made to production
code. Use historical data to fine-tune tests and to determine what types of tests need to be performed.
Perform automated security penetration testing. Ensuring your application is secure is as important as
testing any other functionality. Make automated penetration testing a standard part of the build and deployment
process. Schedule regular security tests and vulnerability scanning on deployed applications, monitoring for open
ports, endpoints, and attacks. Automated testing does not remove the need for in-depth security reviews at regular
intervals.
Perform automated business continuity testing. Develop tests for large-scale business continuity, including
backup recovery and failover. Set up automated processes to perform these tests regularly.
Release
Automate deployments. Automate deploying the application to test, staging, and production environments.
Automation enables faster and more reliable deployments, and ensures consistent deployments to any supported
environment. It removes the risk of human error caused by manual deployments. It also makes it easy to schedule
releases for convenient times, to minimize any effects of potential downtime. Have systems in place to detect any
problems during rollout, and have an automated way to roll forward fixes or roll back changes.
Use continuous integration. Continuous integration (CI) is the practice of merging all developer code into a
central codebase on a regular schedule, and then automatically performing standard build and test processes. CI
ensures that an entire team can work on a codebase at the same time without having conflicts. It also ensures that
code defects are found as early as possible. Preferably, the CI process should run every time that code is committed
or checked in. At the very least, it should run once per day.
Consider adopting a trunk based development model. In this model, developers commit to a single branch (the
trunk). There is a requirement that commits never break the build. This model facilitates CI, because all feature
work is done in the trunk, and any merge conflicts are resolved when the commit happens.
Consider using continuous deliver y. Continuous delivery (CD) is the practice of ensuring that code is always
ready to deploy, by automatically building, testing, and deploying code to production-like environments. Adding
continuous delivery to create a full CI/CD pipeline will help you detect code defects as soon as possible, and ensures
that properly tested updates can be released in a very short time.
Continuous deployment is an additional process that automatically takes any updates that have passed through
the CI/CD pipeline and deploys them into production. Continuous deployment requires robust automatic
testing and advanced process planning, and may not be appropriate for all teams.
Make small incremental changes. Large code changes have a greater potential to introduce bugs. Whenever
possible, keep changes small. This limits the potential effects of each change, and makes it easier to understand and
debug any issues.
Control exposure to changes. Make sure you're in control of when updates are visible to your end users.
Consider using feature toggles to control when features are enabled for end users.
Implement release management strategies to reduce deployment risk . Deploying an application update to
production always entails some risk. To minimize this risk, use strategies such as canary releases or blue-green
deployments to deploy updates to a subset of users. Confirm the update works as expected, and then roll the
update out to the rest of the system.
Document all changes. Minor updates and configuration changes can be a source of confusion and versioning
conflict. Always keep a clear record of any changes, no matter how small. Log everything that changes, including
patches applied, policy changes, and configuration changes. (Don't include sensitive data in these logs. For example,
log that a credential was updated, and who made the change, but don't record the updated credentials.) The record
of the changes should be visible to the entire team.
Consider making infrastructure immutable. Immutable infrastructure is the principle that you shouldn't
modify infrastructure after it's deployed to production. Otherwise, you can get into a state where ad hoc changes
have been applied, making it hard to know exactly what changed. Immutable infrastructure works by replacing
entire servers as part of any new deployment. This allows the code and the hosting environment to be tested and
deployed as a block. Once deployed, infrastructure components aren't modified until the next build and deploy
cycle.
Monitoring
Make systems obser vable. The operations team should always have clear visibility into the health and status of a
system or service. Set up external health endpoints to monitor status, and ensure that applications are coded to
instrument the operations metrics. Use a common and consistent schema that helps you correlate events across
systems. Azure Diagnostics and Application Insights are the standard method of tracking the health and status of
Azure resources. Microsoft Operation Management Suite also provides centralized monitoring and management
for cloud or hybrid solutions.
Aggregate and correlate logs and metrics . A properly instrumented telemetry system will provide a large
amount of raw performance data and event logs. Make sure that telemetry and log data is processed and correlated
in a short period of time, so that operations staff always have an up-to-date picture of system health. Organize and
display data in ways that give a cohesive view of any issues, so that whenever possible it's clear when events are
related to one another.
Consult your corporate retention policy for requirements on how data is processed and how long it should be
stored.
Implement automated aler ts and notifications. Set up monitoring tools like Azure Monitor to detect patterns
or conditions that indicate potential or current issues, and send alerts to the team members who can address the
issues. Tune the alerts to avoid false positives.
Monitor assets and resources for expirations. Some resources and assets, such as certificates, expire after a
given amount of time. Make sure to track which assets expire, when they expire, and what services or features
depend on them. Use automated processes to monitor these assets. Notify the operations team before an asset
expires, and escalate if expiration threatens to disrupt the application.
Management
Automate operations tasks. Manually handling repetitive operations processes is error-prone. Automate these
tasks whenever possible to ensure consistent execution and quality. Code that implements the automation should
be versioned in source control. As with any other code, automation tools must be tested.
Take an infrastructure-as-code approach to provisioning. Minimize the amount of manual configuration
needed to provision resources. Instead, use scripts and Azure Resource Manager templates. Keep the scripts and
templates in source control, like any other code you maintain.
Consider using containers. Containers provide a standard package-based interface for deploying applications.
Using containers, an application is deployed using self-contained packages that include any software, dependencies,
and files needed to run the application, which greatly simplifies the deployment process.
Containers also create an abstraction layer between the application and the underlying operating system, which
provides consistency across environments. This abstraction can also isolate a container from other processes or
applications running on a host.
Implement resiliency and self-healing. Resiliency is the ability of an application to recover from failures.
Strategies for resiliency include retrying transient failures, and failing over to a secondary instance or even another
region. For more information, see Designing reliable Azure applications . Instrument your applications so that issues
are reported immediately and you can manage outages or other system failures.
Have an operations manual. An operations manual or runbook documents the procedures and management
information needed for operations staff to maintain a system. Also document any operations scenarios and
mitigation plans that might come into play during a failure or other disruption to your service. Create this
documentation during the development process, and keep it up to date afterwards. This is a living document, and
should be reviewed, tested, and improved regularly.
Shared documentation is critical. Encourage team members to contribute and share knowledge. The entire team
should have access to documents. Make it easy for anyone on the team to help keep documents updated.
Document on-call procedures. Make sure on-call duties, schedules, and procedures are documented and shared
to all team members. Keep this information up-to-date at all times.
Document escalation procedures for third-par ty dependencies. If your application depends on external
third-party services that you don't directly control, you must have a plan to deal with outages. Create
documentation for your planned mitigation processes. Include support contacts and escalation paths.
Use configuration management. Configuration changes should be planned, visible to operations, and recorded.
This could take the form of a configuration management database, or a configuration-as-code approach.
Configuration should be audited regularly to ensure that what's expected is actually in place.
Get an Azure suppor t plan and understand the process. Azure offers a number of support plans. Determine
the right plan for your needs, and make sure the entire team knows how to use it. Team members should
understand the details of the plan, how the support process works, and how to open a support ticket with Azure. If
you are anticipating a high-scale event, Azure support can assist you with increasing your service limits. For more
information, see the Azure Support FAQs.
Follow least-privilege principles when granting access to resources. Carefully manage access to resources.
Access should be denied by default, unless a user is explicitly given access to a resource. Only grant a user access to
what they need to complete their tasks. Track user permissions and perform regular security audits.
Use role-based access control. Assigning user accounts and access to resources should not be a manual
process. Use role-based access control (RBAC) grant access based on Azure Active Directory identities and groups.
Use a bug tracking system to track issues. Without a good way to track issues, it's easy to miss items,
duplicate work, or introduce additional problems. Don't rely on informal person-to-person communication to track
the status of bugs. Use a bug tracking tool to record details about problems, assign resources to address them, and
provide an audit trail of progress and status.
Manage all resources in a change management system. All aspects of your DevOps process should be
included in a management and versioning system, so that changes can be easily tracked and audited. This includes
code, infrastructure, configuration, documentation, and scripts. Treat all these types of resources as code throughout
the test/build/review process.
Use checklists. Create operations checklists to ensure processes are followed. It's common to miss something in a
large manual, and following a checklist can force attention to details that might otherwise be overlooked. Maintain
the checklists, and continually look for ways to automate tasks and streamline processes.
For more about DevOps, see What is DevOps? on the Visual Studio site.
Overview of the performance efficiency pillar
12/18/2020 • 3 minutes to read • Edit Online
Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an
efficient manner. Before the cloud became popular, when it came to planning how a system would handle increases
in load, many organizations intentionally provisioned workloads to be oversized to meet business requirements.
This might make sense in on-premises environments because it ensured capacity during peak usage. Capacity
reflects resource availability (CPU and memory). This was a major consideration for processes that would be in
place for a number of years.
Just as you needed to anticipate increases in load in on-premises environments, you need to anticipate increases in
cloud environments to meet business requirements. One difference is that you may no longer need to make long-
term predictions for anticipated changes to ensure that you will have enough capacity in the future. Another
difference is in the approach used to manage performance.
Principles
Follow these principles to guide you through improving performance efficiency:
Become Data-driven - Embrace a data-driven culture to deliver timely insights to everyone in your
organization across all your data. To harness this culture, get the best performance from your analytics
solution across all your data, ensure data has the security and privacy needed for your business
environment, and make sure you have tools that enable everyone in your organization to gain insights from
your data.
Avoid antipatterns - A performance antipattern is a common practice that is likely to cause scalability
problems when an application is under pressure. For example, you can have an application that behaves as
expected during performance testing. However, when it is released to production and starts to handle live
workloads, performance decreases. Scalability problems such as rejecting user requests, stalling, or
throwing exceptions may arise. To learn how to identify and fix these antipatterns, see Performance
antipatterns for cloud applications.
Perform load testing to set limits - Load testing helps ensure that your applications can scale and do
not go down during peak traffic. Load test each application to understand how it performs at various scales.
To learn about Azure service limits, see Managing limits.
Understand billing for metered resources - Your business requirements will determine the tradeoffs
between cost and level of performance efficiency. Azure doesn't directly bill based on the resource cost.
Charges for a resource, such as a virtual machine, are calculated by using one or more meters. Meters are
used to track a resource’s usage over time. These meters are then used to calculate the bill.
Monitor and optimize - Lack of monitoring new services and the health of current workloads are major
inhibitors in workload quality. The overall monitoring strategy should take into consideration not only
scalability, but resiliency (infrastructure, application, and dependent services) and application performance
as well. For purposes of scalability, looking at the metrics would allow you to provision resources
dynamically and scale with demand.
Next steps
Performance efficiency impacts the entire architecture spectrum. Bridge gaps in you knowledge of Azure by
reviewing the 5 pillars in the Microsoft Azure Well-Architected Framework.
To assess your workload using the tenets found in the Microsoft Azure Well-Architected Framework, see the
Microsoft Azure Well-Architected Review.
Design scalable Azure applications
12/18/2020 • 8 minutes to read • Edit Online
Application design is critical to handling scale as load increases. This article will give you insight on the most
important topics. For more topics related to handling scale, see the Design Azure applications for efficiency article
in the Performance efficiency pillar.
Database considerations
The choice of database can affect an application's performance and scalability. Database reads and writes involve a
network call and storage I/O, both of which are expense operations. Choosing the right database service to store
and retrieve data is therefore a critical decision and must be considered to ensure application scalability. Azure has
many database services that will fit most needs. In addition, there are third-party options that can be considered
from Azure Marketplace.
To help you choose a database type, determine if application storage requirements fit a relational design (SQL)
versus a key-value/document/graph design (NO-SQL). Some applications may have both a SQL and a NO-SQL
database for different storage needs. Use the Azure data store decision tree to help you find the appropriate
managed data storage solution.
Why use a relational database?
Use a relational database when strong consistency guarantees are important — where all changes are atomic, and
transactions always leave the data in a consistent state. However, a relational database generally can't scale out
horizontally without sharding the data in some way. Implementing manual sharding can be a time consuming task.
Also, the data in relational database must be normalized, which isn't appropriate for every data set.
If a relational database is considered optimal, Azure offers several PaaS options that fully manage hosting and
operations of the database. Azure SQL Database can host single databases or multiple databases (Azure SQL
Database Managed Instance). The suite of offerings spans requirements that cross performance, scale, size,
resiliency, disaster recovery, and migration compatibility. Azure offers the following PaaS relational database
services:
Azure SQL Database
Azure Database for MySQL
Azure Database for PostgreSQL
Azure Database for MariaDB
Why use a NO -SQL database?
Use a NO-SQL database when application performance and availability are more important than strong
consistency. NO-SQL is ideal for handling large, unrelated, indeterminate, or rapidly changing data. NO-SQL
databases have trade-offs. For specifics, see Some challenges with NoSQL databases.
Azure provides two managed services that optimize for NO-SQL solutions: Azure Cosmos DB and Azure Cache for
Redis. For document and graph databases, Cosmos DB offers extreme scale and performance.
For a detailed description of NO-SQL and relational databases, see Understanding the differences.
TIP
When appropriate, decomposing an application into microservices is a level of decoupling that is an architectural best
practice. A microservices architecture can also bring some challenges. The design patterns in Design patterns for
microservices can help mitigate these challenges.
Integrated security
Integrated security is a singular unified solution to protect every service that a business runs through a set of
common policies and configuration settings. In addition to reducing issues associated with scaling, provisioning,
and managing (including higher costs and complexity), integrated security also increases control and overall
security. However, there may be times when you may not want to use connection pooling for security reasons. For
example, although connection pooling improves the performance of subsequent database requests for a single
user, that user cannot take advantage of connections made by other users. It also results in at least one connection
per user to the database server.
Measure your business' security requirements against the advantages and disadvantages of connection pooling. To
learn more, see Pool fragmentation
Next steps
Application efficiency
Design Azure applications for efficiency
12/18/2020 • 4 minutes to read • Edit Online
Making choices that effect performance efficiency is critical to application design. For additional related topics, see
the Design scalable Azure applications article in the Performance efficiency pillar.
TIP
Migrate an Azure Cloud Services application to Azure Service Fabric describes best practices about stateless services for an
application that is migrated from old Azure Cloud Services to Azure Service Fabric.
Next steps
Design for scaling
Design for scaling
12/18/2020 • 4 minutes to read • Edit Online
Scaling is the ability of a system to handle increased load. Services covered by Azure Autoscale can scale
automatically to match demand to accommodate workload. They will scale out to ensure capacity during workload
peaks and scaling will return to normal automatically when the peak drops.
NOTE
Providing your application is explicitly designed to handle some of its instances being terminated, remember to use
autoscaling to scale down/in resources that are no longer necessary for the given load in order to reduce operational costs.
The Application Service autoscaling sample shows how to create an Azure App Service plan which includes an
Azure App Service.
Azure Kubernetes Service (AKS) offers two levels of autoscale:
Horizontal autoscale - Can be enabled on service containers to add more or fewer pod instances within the
cluster.
Cluster autoscale - Can be enabled on the agent VM instances running an agent node-pool to add more or
remove VM instances dynamically.
Other Azure services include the following:
Azure Ser vice Fabric - Virtual machine scale sets offer autoscale capabilities for true IaaS scenarios.
Azure App Gateway and Azure API Management - PaaS offerings for ingress services that enable autoscale.
Azure Functions , Azure Logic Apps , and App Ser vices - Serverless pay-per-use consumption modeling
that inherently provide autoscaling capabilities.
Azure SQL Database - PaaS platform to change performance characteristics of a database on the fly and
assign more resources when needed or release the resources when they are not needed. Allows scaling
up/down, read scale-out, and global scale-out/sharding capabilities.
Each service documents its autoscale capabilities. Review Autoscale overview for a general discussion on Azure
platform autoscale.
NOTE
If your application does not have built-in ability to autoscale, or isn't configured to scale out automatically as load increases,
it's possible that your application's services will fail if they become saturated with user requests. See Azure Automation for
possible solutions.
Next steps
Plan for capacity
Plan for capacity
12/18/2020 • 4 minutes to read • Edit Online
Azure offers many options to meet capacity requirements as your business grows. These options can also minimize
cost.
NOTE
Don't plan for capacity to meet the highest level of expected demand. An inappropriate or misconfigured service can impact
cost. For example, building a multiregion service when the service levels don't require high-availability or geo-redundancy will
increase cost without reasonable business justification.
Next steps
Performance testing
Performance testing
12/18/2020 • 4 minutes to read • Edit Online
Performance testing helps to maintain systems properly and fix defects before problems reach system users. It
helps maintain the efficiency, responsiveness, scalability, and speed of applications when compared with business
requirements. When done effectively, performance testing should give you the diagnostic information necessary to
eliminate bottlenecks, which lead to poor performance. A bottleneck occurs when data flow is either interrupted or
stops due to insufficient capacity to handle the workload.
To avoid experiencing poor performance, commit time and resources to testing system performance. Two subsets
of performance testing, load testing and stress testing, can determine the upper (close to capacity limit) and
maximum (point of failure) limit, respectively, of the application's capacity. By performing these tests, you can
determine the necessary infrastructure to support the anticipated workloads.
A best practice is to plan for a load buffer to accommodate random spikes without overloading the infrastructure.
For example, if a normal system load is 100,000 requests per second, the infrastructure should support 100,000
requests at 80% of total capacity (i.e., 125,000 requests per second). If you anticipate that the application will
continue to sustain 100,000 requests per second, and the current SKU (Stock Keeping Unit) introduces latency at
65,000 requests per second, you'll most likely need to upgrade your product to the next higher SKU. If there is a
secondary region, you'll need to ensure that it also supports the higher SKU.
Load testing
Load testing measures system performance as the workload increases. It identifies where and when your
application breaks, so you can fix the issue before shipping to production. It does this by testing system behavior
under typical and heavy loads. The following are key points to consider for load testing:
Know the Azure ser vice limits - Different Azure services have soft and hard limits associated with them.
The terms soft limit and hard limit describe the current, adjustable service limit (soft limit) and the maximum
limit (hard limit). Understand the limits for the services you consume so that you are not blocked if you need
to exceed them. For a list of the most common Azure limits, see Azure subscription and service limits,
quotas, and constraints.
The ResourceLimits sample shows how to query the limits and quotas for commonly used resources.
Measure typical loads - Knowing the typical and maximum loads on your system helps you understand
when something is operating outside of its designed limits. Monitor traffic to understand application
behavior.
Understand application behavior under various scales - Load test your application to understand
how it performs at various scales. First, test to see how the application performs under a typical load. Then,
test to see how it performs under load using different scaling operations. To get additional insight into how
to evaluate your application as the amount of traffic sent to it increases, see Autoscale best practices.
Stress testing
Unlike load testing, which ensures that a system can handle what it’s designed to handle, stress testing focuses on
overloading the system until it breaks. A stress test determines how stable a system is and its ability to withstand
extreme increases in load. It does this by testing the maximum number requests from another service (for
example) that a system can handle at a given time before performance is compromised and fails. Find this
maximum to understand what kind of load the current environment can adequately support.
Determine the maximum demand you want to place on memory, CPU, and disk IOPS. Once a stress test has been
performed, you will know the maximum supported load and an operational margin. It is best to choose an
operational threshold so that scaling can be performed before the threshold has been reached.
Once you determine an acceptable operational margin and response time under typical loads, verify that the
environment is configured adequately. To do this, make sure the SKUs that you selected are based on the desired
margins. Be careful to stay as close as possible to your margins. Allocating too much can increase costs and
maintenance unnecessarily; allocating too few can result in poor user experience.
In addition to stress testing through increased load, you can stress test by reducing resources to identify what
happens when the machine runs out of memory. You can also stress test by increasing latency (e.g., the database
takes 10x time to reply, writes to storage takes 10x longer, etc.).
Multiregion testing
A multiregion architecture can provide higher availability than deploying to a single region. If a regional outage
affects the primary region, you can use Front Door to use the secondary region. This architecture can also help if an
individual subsystem of the application fails.
Test the amount of time it would take for users to be rerouted to the paired region so that the region doesn't fail. To
learn more about routing, see Front Door routing methods. Typically, a planned test failover can help determine
how much time would be required to fully scale to support the redirected load.
NOTE
Ensure that a rule has been configured to scale the environment back down once load reaches below the set thresholds. This
will save you money.
Next steps
Testing tools
Testing tools
12/18/2020 • 4 minutes to read • Edit Online
There are multiple stages in the development and deployment life cycle in which tests can be performed.
Application code, infrastructure automation, and fault tolerance should all be tested. This can ensure that the
application will perform as expected in every situation. You'll want to test early enough in the application life cycle
to catch and fix errors. Errors are cheaper to repair when caught early and can be expensive or impossible to fix
later.
Testing can be automated or manual. Automating tests is the best way to make sure that they are executed.
Depending on how frequently tests are performed, they are typically limited in duration and scope. Manual testing
is run much less frequently. For a list of tests that you should consider while developing and deploying applications,
see Testing your application and Azure environment.
Caching data
Caching can dramatically improve performance, scalability, and availability. The more data that you have, the
greater the benefits of caching become. Caching typically works well with data that is immutable or that changes
infrequently. Examples include reference information such as product and pricing information in an e-commerce
application, or shared static resources that are costly to construct. Some or all of this data can be loaded into the
cache at application startup to minimize demand on resources and to improve performance.
Use performance testing and usage analysis to determine whether pre-populating or on-demand loading of the
cache, or a combination of both, is appropriate. The decision should be based on the volatility and usage pattern of
the data. Cache utilization and performance analysis are particularly important in applications that encounter heavy
loads and must be highly scalable.
To learn more about how to use caching as a solution in testing, see Caching.
Use Azure Redis to cache data
Azure Cache for Redis is a caching service that can be accessed from any Azure application, whether the application
is implemented as a cloud service, a website, or inside an Azure virtual machine. Caches can be shared by client
applications that have the appropriate access key. It is a high-performance caching solution that provides
availability, scalability, and security.
To learn more about using Azure Cache for Redis, see Considerations for implementing caching in Azure.
Benchmark testing
Benchmarking is the process of simulating different workloads on your application and measuring application
performance for each workload. It is the best way to figure out what resources you will need to host your
application. Use performance indicators to assess whether your application is performing as expected or not. Take
into consideration VM sizes and disk sizes.
See the Optimize IOPS, throughput, and latency table for guidance.
Metrics
Metrics measure trends over time. They are available for interactive analysis in the Azure portal with Azure Metrics
Explorer. Metrics also can be added to an Azure dashboard for visualization in combination with other data and
used for near-real time alerting.
Performance testing gives you the ability to see specific details on the processing capabilities of applications. You'll
most likely want a monitoring tool that allows you to discover proactively if the issues you find through testing are
appearing in both your infrastructure and applications. Azure Monitor Metrics is a feature of Azure Monitor that
collects metrics from monitored resources into a time series database.
With Azure Monitor, you can collect, analyze, and act on telemetry from your cloud and on-premises environments.
It helps you understand how applications are performing and identifies issues affecting them and the resources
they depend on.
For a list of Azure metrics, see Supported metrics with Azure Monitor.
Next steps
Performance monitoring
Monitoring for Scalability
11/2/2020 • 2 minutes to read • Edit Online
Monitoring for scalability should be part of your overall monitoring strategy that utilizes Azure Monitor. The overall
monitoring strategy should take into consideration not only scalability, but resiliency (infrastructure, application,
and dependent services) and application performance as well. Most services in Azure offer the ability to turn on
both data and management plane logs as well as metrics. For purposes of scalability, looking at the metrics would
allow you to scale based scale up, scale out, scale in, and scale down. The ability to scale dynamically is one of the
biggest values of moving to the cloud.
Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an
efficient manner, and is one of the pillars of the Microsoft Azure Well-Architected Framework. Use this checklist to
review your application architecture from a performance efficiency standpoint.
Application design
Par tition the workload . Design parts of the process to be discrete and decomposable. Minimize the size of each
part, while following the usual rules for separation of concerns and the single responsibility principle. This allows
the component parts to be distributed in a way that maximizes use of each compute unit (such as a role or
database server). It also makes it easier to scale the application by adding instances of specific resources. For
complex domains, consider adopting a microservices architecture.
Design for scaling . Scaling allows applications to react to variable load by increasing and decreasing the number
of instances of roles, queues, and other services they use. However, the application must be designed with this in
mind. For example, the application and the services it uses must be stateless, to allow requests to be routed to any
instance. This also prevents the addition or removal of specific instances from adversely affecting current users.
You should also implement configuration or autodetection of instances as they are added and removed, so that
code in the application can perform the necessary routing. For example, a web application might use a set of
queues in a round-robin approach to route requests to background services running in worker roles. The web
application must be able to detect changes in the number of queues, to successfully route requests and balance the
load on the application.
Scale as a unit . Plan for additional resources to accommodate growth. For each resource, know the upper scaling
limits, and use sharding or decomposition to go beyond these limits. Determine the scale units for the system in
terms of well-defined sets of resources. This makes applying scale-out operations easier, and less prone to negative
impact on the application through limitations imposed by lack of resources in some part of the overall system. For
example, adding x number of web and worker roles might require y number of additional queues and z number of
storage accounts to handle the additional workload generated by the roles. So a scale unit could consist of x web
and worker roles, y queues, and z storage accounts. Design the application so that it's easily scaled by adding one
or more scale units.
Avoid client affinity . Where possible, ensure that the application does not require affinity. Requests can thus be
routed to any instance, and the number of instances is irrelevant. This also avoids the overhead of storing,
retrieving, and maintaining state information for each user.
Take advantage of platform autoscaling features . Where the hosting platform supports an autoscaling
capability, such as Azure Autoscale, prefer it to custom or third-party mechanisms unless the built-in mechanism
can't fulfill your requirements. Use scheduled scaling rules where possible to ensure resources are available
without a start-up delay, but add reactive autoscaling to the rules where appropriate to cope with unexpected
changes in demand. You can use the autoscaling operations in the classic deployment model to adjust autoscaling,
and to add custom counters to rules. For more information, see Auto-scaling guidance.
Offload CPU-intensive and I/O-intensive tasks as background tasks . If a request to a service is expected to
take a long time to run or absorb considerable resources, offload the processing for this request to a separate task.
Use worker roles or background jobs (depending on the hosting platform) to execute these tasks. This strategy
enables the service to continue receiving further requests and remain responsive. For more information, see
Background jobs guidance.
Distribute the workload for background tasks . Where there are many background tasks, or the tasks require
considerable time or resources, spread the work across multiple compute units (such as worker roles or
background jobs). For one possible solution, see the Competing Consumers pattern.
Consider moving toward a shared-nothing architecture . A shared-nothing architecture uses independent,
self-sufficient nodes that have no single point of contention (such as shared services or storage). In theory, such a
system can scale almost indefinitely. While a fully shared-nothing approach is generally not practical for most
applications, it may provide opportunities to design for better scalability. For example, avoiding the use of server-
side session state, client affinity, and data partitioning are good examples of moving toward a shared-nothing
architecture.
Data management
Use data par titioning . Divide the data across multiple databases and database servers, or design the application
to use data storage services that can provide this partitioning transparently (examples include Azure SQL Database
Elastic Database, and Azure Table storage). This approach can help to maximize performance and allow easier
scaling. There are different partitioning techniques, such as horizontal, vertical, and functional. You can use a
combination of these to achieve maximum benefit from increased query performance, simpler scalability, more
flexible management, better availability, and to match the type of store to the data it will hold. Also, consider using
different types of data store for different types of data, choosing the types based on how well they are optimized
for the specific type of data. This may include using table storage, a document database, or a column-family data
store, instead of, or as well as, a relational database. For more information, see Data partitioning guidance.
Design for eventual consistency . Eventual consistency improves scalability by reducing or removing the time
needed to synchronize related data partitioned across multiple stores. The cost is that data is not always consistent
when it is read, and some write operations may cause conflicts. Eventual consistency is ideal for situations where
the same data is read frequently but written infrequently. For more information, see the Data Consistency Primer.
Reduce chatty interactions between components and ser vices . Avoid designing interactions in which an
application is required to make multiple calls to a service (each of which returns a small amount of data), rather
than a single call that can return all of the data. Where possible, combine several related operations into a single
request when the call is to a service or component that has noticeable latency. This makes it easier to monitor
performance and optimize complex operations. For example, use stored procedures in databases to encapsulate
complex logic, and reduce the number of round trips and resource locking.
Use queues to level the load for high velocity data writes . Surges in demand for a service can overwhelm
that service and cause escalating failures. To prevent this, consider implementing the Queue-Based Load Leveling
pattern. Use a queue that acts as a buffer between a task and a service that it invokes. This can smooth intermittent
heavy loads that may otherwise cause the service to fail or the task to time out.
Minimize the load on the data store . The data store is commonly a processing bottleneck, a costly resource,
and often not easy to scale out. Where possible, remove logic (such as processing XML documents or JSON
objects) from the data store, and perform processing within the application. For example, instead of passing XML to
the database (other than as an opaque string for storage), serialize or deserialize the XML within the application
layer and pass it in a form that is native to the data store. It's typically much easier to scale out the application than
the data store, so you should attempt to do as much of the compute-intensive processing as possible within the
application.
Minimize the volume of data retrieved . Retrieve only the data you require by specifying columns and using
criteria to select rows. Make use of table value parameters and the appropriate isolation level. Use mechanisms like
entity tags to avoid retrieving data unnecessarily.
Aggressively use caching . Use caching wherever possible to reduce the load on resources and services that
generate or deliver data. Caching is typically suited to data that is relatively static, or that requires considerable
processing to obtain. Caching should occur at all levels where appropriate in each layer of the application,
including data access and user interface generation. For more information, see the Caching Guidance.
Handle data growth and retention . The amount of data stored by an application grows over time. This growth
increases storage costs as well as latency when accessing the data, affecting application throughput and
performance. It may be possible to periodically archive some of the old data that is no longer accessed, or move
data that is rarely accessed into long-term storage that is more cost efficient, even if the access latency is higher.
Optimize Data Transfer Objects (DTOs) using an efficient binar y format . DTOs are passed between the
layers of an application many times. Minimizing the size reduces the load on resources and the network. However,
balance the savings with the overhead of converting the data to the required format in each location where it is
used. Adopt a format that has the maximum interoperability to enable easy reuse of a component.
Set cache control . Design and configure the application to use output caching or fragment caching where
possible, to minimize processing load.
Enable client side caching . Web applications should enable cache settings on the content that can be cached.
This is commonly disabled by default. Configure the server to deliver the appropriate cache control headers to
enable caching of content on proxy servers and clients.
Use Azure blob storage and the Azure Content Deliver y Network to reduce the load on the
application . Consider storing static or relatively static public content, such as images, resources, scripts, and style
sheets, in blob storage. This approach relieves the application of the load caused by dynamically generating this
content for each request. Additionally, consider using the Content Delivery Network to cache this content and
deliver it to clients. Using the Content Delivery Network can improve performance at the client because the content
is delivered from the geographically closest datacenter that contains a Content Delivery Network cache. For more
information, see the Content Delivery Network Guidance.
Optimize and tune SQL queries and indexes . Some T-SQL statements or constructs may have an adverse
effect on performance that can be reduced by optimizing the code in a stored procedure. For example, avoid
converting datetime types to a varchar before comparing with a datetime literal value. Use date/time
comparison functions instead. Lack of appropriate indexes can also slow query execution. If you use an
object/relational mapping framework, understand how it works and how it may affect performance of the data
access layer. For more information, see Query Tuning.
Consider denormalizing data . Data normalization helps to avoid duplication and inconsistency. However,
maintaining multiple indexes, checking for referential integrity, performing multiple accesses to small chunks of
data, and joining tables to reassemble the data imposes an overhead that can affect performance. Consider if some
additional storage volume and duplication is acceptable in order to reduce the load on the data store. Also consider
if the application itself (which is typically easier to scale) can be relied on to take over tasks such as managing
referential integrity in order to reduce the load on the data store. For more information, see Data partitioning
guidance.
Implementation
Review the performance antipatterns . See Performance antipatterns for cloud applications for common
practices that are likely to cause scalability problems when an application is under pressure.
Use asynchronous calls . Use asynchronous code wherever possible when accessing resources or services that
may be limited by I/O or network bandwidth, or that have a noticeable latency, in order to avoid locking the calling
thread.
Avoid locking resources, and use an optimistic approach instead . Never lock access to resources such as
storage or other services that have noticeable latency, because this is a primary cause of poor performance. Always
use optimistic approaches to managing concurrent operations, such as writing to storage. Use features of the
storage layer to manage conflicts. In distributed applications, data may be only eventually consistent.
Compress highly compressible data over high latency, low bandwidth networks . In the majority of cases
in a web application, the largest volume of data generated by the application and passed over the network is HTTP
responses to client requests. HTTP compression can reduce this considerably, especially for static content. This can
reduce cost as well as reducing the load on the network, though compressing dynamic content does apply a
fractionally higher load on the server. In other, more generalized environments, data compression can reduce the
volume of data transmitted and minimize transfer time and costs, but the compression and decompression
processes incur overhead. As such, compression should only be used when there is a demonstrable gain in
performance. Other serialization methods, such as JSON or binary encodings, may reduce the payload size while
having less impact on performance, whereas XML is likely to increase it.
Minimize the time that connections and resources are in use . Maintain connections and resources only for
as long as you need to use them. For example, open connections as late as possible, and allow them to be returned
to the connection pool as soon as possible. Acquire resources as late as possible, and dispose of them as soon as
possible.
Minimize the number of connections required . Service connections absorb resources. Limit the number that
are required and ensure that existing connections are reused whenever possible. For example, after performing
authentication, use impersonation where appropriate to run code as a specific identity. This can help to make best
use of the connection pool by reusing connections.
NOTE
APIs for some services automatically reuse connections, provided service-specific guidelines are followed. It's important that
you understand the conditions that enable connection reuse for each service that your application uses.
Send requests in batches to optimize network use . For example, send and read messages in batches when
accessing a queue, and perform multiple reads or writes as a batch when accessing storage or a cache. This can
help to maximize efficiency of the services and data stores by reducing the number of calls across the network.
Avoid a requirement to store ser ver-side session state where possible. Server-side session state
management typically requires client affinity (that is, routing each request to the same server instance), which
affects the ability of the system to scale. Ideally, you should design clients to be stateless with respect to the servers
that they use. However, if the application must maintain session state, store sensitive data or large volumes of per-
client data in a distributed server-side cache that all instances of the application can access.
Optimize table storage schemas . When using table stores that require the table and column names to be
passed and processed with every query, such as Azure table storage, consider using shorter names to reduce this
overhead. However, do not sacrifice readability or manageability by using overly compact names.
Create resource dependencies during deployment or at application star tup . Avoid repeated calls to
methods that test the existence of a resource and then create the resource if it does not exist. Methods such as
CloudTable.CreateIfNotExists and CloudQueue.CreateIfNotExists in the Azure Storage Client Library follow this
pattern. These methods can impose considerable overhead if they are invoked before each access to a storage
table or storage queue. Instead:
Create the required resources when the application is deployed, or when it first starts (a single call to
CreateIfNotExists for each resource in the startup code for a web or worker role is acceptable). However, be sure
to handle exceptions that may arise if your code attempts to access a resource that doesn't exist. In these
situations, you should log the exception, and possibly alert an operator that a resource is missing.
Under some circumstances, it may be appropriate to create the missing resource as part of the exception
handling code. But you should adopt this approach with caution as the non-existence of the resource might be
indicative of a programming error (a misspelled resource name for example), or some other infrastructure-level
issue.
Use lightweight frameworks . Carefully choose the APIs and frameworks you use to minimize resource usage,
execution time, and overall load on the application. For example, using Web API to handle service requests can
reduce the application footprint and increase execution speed, but it may not be suitable for advanced scenarios
where the additional capabilities of Windows Communication Foundation are required.
Consider minimizing the number of ser vice accounts . For example, use a specific account to access
resources or services that impose a limit on connections, or perform better where fewer connections are
maintained. This approach is common for services such as databases, but it can affect the ability to accurately audit
operations due to the impersonation of the original user.
Carr y out performance profiling and load testing during development, as part of test routines, and before
final release to ensure the application performs and scales as required. This testing should occur on the same type
of hardware as the production platform, and with the same types and quantities of data and user load as it will
encounter in production. For more information, see Testing the performance of a cloud service.
Overview of the reliability pillar
11/2/2020 • 13 minutes to read • Edit Online
Building a reliable application in the cloud is different from traditional application development. While historically
you may have purchased levels of redundant higher-end hardware to minimize the chance of an entire application
platform failing, in the cloud, we acknowledge up front that failures will happen. Instead of trying to prevent
failures altogether, the goal is to minimize the effects of a single failing component.
To assess your workload using the tenets found in the Microsoft Azure Well-Architected Framework, see the
Microsoft Azure Well-Architected Review.
Reliable applications are:
Resilient and recover gracefully from failures, and they continue to function with minimal downtime and data
loss before full recovery.
Highly available (HA) and run as designed in a healthy state with no significant downtime.
Understanding how these elements work together — and how they affect cost — is essential to building a reliable
application. It can help you determine how much downtime is acceptable, the potential cost to your business, and
which functions are necessary during a recovery.
This article provides a brief overview of building reliability into each step of the Azure application design process.
Each section includes a link to an in-depth article on how to integrate reliability into that specific step in the
process. If you're looking for reliability considerations for individual Azure services, review the Resiliency checklist
for specific Azure services.
Define requirements
Identify your business needs, and build your reliability plan to address them. Consider the following:
Identify workloads and usage. A workload is a distinct capability or task that is logically separated from
other tasks, in terms of business logic and data storage requirements. Each workload has different
requirements for availability, scalability, data consistency, and disaster recovery.
Plan for usage patterns. Usage patterns also play a role in requirements. Identify differences in
requirements during critical and non-critical periods. For example, a tax-filing application can't fail during a
filing deadline. To ensure uptime, plan redundancy across several regions in case one fails. Conversely, to
minimize costs during non-critical periods, you can run your application in a single region.
Identify critical components and paths. Not all components of your system might be as important as
others. For example, you might have an optional component that adds incremental value, but that the
workload can run without if necessary. Understand where these components are, and conversely, where the
critical parts of your workload are. This will help to scope your availability and reliablity metrics and to plan
your recovery strategies to prioritise the highest-importance components.
Establish availability metrics — mean time to recover y (MTTR) and mean time between
failures (MTBF). MTTR is the average time it takes to restore a component after a failure. MTBF is how
long a component can reasonably expect to last between outages. Use these measures to determine where
to add redundancy and to determine service-level agreements (SLAs) for customers.
Establish recover y metrics — recover y time objective (RTO) and recover y point objective
(RPO). RTO is the maximum acceptable time an application can be unavailable after an incident. RPO is the
maximum duration of data loss that is acceptable during a disaster. To derive these values, conduct a risk
assessment and make sure you understand the cost and risk of downtime or data loss in your organization.
NOTE
If the MTTR of any critical component in a highly available setup exceeds the system RTO, a failure in the system
might cause an unacceptable business disruption. That is, you can't restore the system within the defined RTO.
Determine workload availability targets. To ensure that application architecture meets your business
requirements, define target SLAs for each workload. Account for the cost and complexity of meeting
availability requirements, in addition to application dependencies.
Understand ser vice-level agreements. In Azure, the SLA describes the Microsoft commitments for
uptime and connectivity. If the SLA for a particular service is 99.9 percent, you should expect the service to
be available 99.9 percent of the time.
Define your own target SLAs for each workload in your solution, so you can determine whether the
architecture meets the business requirements. For example, if a workload requires 99.99 percent uptime
but depends on a service with a 99.9 percent SLA, that service can't be a single point of failure in the
system.
For more information about developing requirements for reliable applications, see Application design for
resiliency.
Building resiliency (recovering from failures) and availability (running in a healthy state without significant
downtime) into your apps begins with gathering requirements. For example, how much downtime is acceptable?
How much does potential downtime cost your business? What are your customer's availability requirements? How
much do you invest in making your application highly available? What is the risk versus the cost?
Resiliency strategies
This section describes some common resiliency strategies. Most of these strategies are not limited to a particular
technology. The descriptions summarize the general idea behind each technique and include links to further
reading.
Implement resiliency patterns for remote operations, where appropriate. If your application depends on
communication between remote services, follow design patterns for dealing with transient failures.
Retr y transient failures. These can be caused by momentary loss of network connectivity, a dropped
database connection, or a timeout when a service is busy. Often, a transient failure can be resolved by
retrying the request.
For many Azure services, the client software development kit (SDK) implements automatic retries in a
way that is transparent to the caller. See Retry guidance for specific services.
Or implement the Retry pattern to help the application handle anticipated, temporary failures
transparently when it tries to connect to a service or network resource.
Apply compensating transactions . A compensating transaction is a transaction that undoes the effects
of another completed transaction. In a distributed system, it can be difficult to achieve strong transactional
consistency. Compensating transactions help to achieve consistency by using a series of smaller, individual
transactions that can be undone at each step. For example, to book a trip, a customer might reserve a car, a
hotel room, and a flight. If one of these steps fails, the entire operation fails. Instead of trying to use a single
distributed transaction for the entire operation, you can define a compensating transaction for each step.
Implement asynchronous operations, whenever possible. Synchronous operations can monopolize
resources and block other operations while the caller waits for the process to complete. Design each part of
your application to allow for asynchronous operations, whenever possible. For more information on how to
implement asynchronous programming in C#, see Asynchronous Programming.
Load balancing
Proper load-balancing allows you to meet availability requirements and to minimize costs associated with
availability.
Use load-balancing to distribute requests. Load-balancing distributes your application's requests to
healthy service instances by removing unhealthy instances from rotation. If your service uses Azure App
Service or Azure Cloud Services, it's already load-balanced for you. However, if your application uses Azure
VMs, you need to provision a load-balancer. For more information, see What is Azure Load Balancer?
You can use Azure Load Balancer to:
Load-balance incoming Internet traffic to your VMs. This configuration is known as a public Load
Balancer.
Load-balance traffic across VMs inside a virtual network. You can also reach a Load Balancer front end
from an on-premises network in a hybrid scenario. Both scenarios use a configuration that is known as
an internal Load Balancer.
Port forward traffic to an itemized port on specific VMs with inbound network address translation (NAT)
rules.
Provide outbound connectivity for VMs inside your virtual network by using a public Load Balancer.
Balance loads across regions with a traffic manager, such as Azure Traffic Manager. To load-
balance traffic across regions requires a traffic management solution, and Azure provides Traffic Manager.
You can also take advantage of third-party services that provide similar traffic-management capabilities.
For more information about the FMA process, with specific recommendations for Azure, see Failure mode analysis.
NOTE
The selection of the Resource Group location is important. In the event of a regional outage, you will be unable to control
resources inside that Resource Group, regardless of what region those resources are actually in (i.e., the resources in the
other region(s) will continue to function, but management plane operations will be unavailable.
Your response to a region-wide service disruption depends on your deployment and your disaster recovery plan.
As a cost-control strategy, for non-critical applications that don't require a guaranteed recovery time, it might
make sense to redeploy to a different region.
For applications that are hosted in another region with deployed roles but don't distribute traffic across regions
(active/passive deployment), switch to the secondary hosted service in the alternate region.
For applications that have a full-scale secondary deployment in another region (active/active deployment),
route traffic to that region.
To learn more about recovering from a region-wide service disruption, see Recover from a region-wide service
disruption.
VM recovery
For critical apps, plan for recovering VMs in the event of a region-wide service disruption.
Use Azure Backup or another backup method to create cross-region backups that are application consistent.
(Replication of the Backup vault must be configured at the time of creation.)
Use Site Recovery to replicate across regions for one-click application failover and failover testing.
Use Traffic Manager to automate user traffic failover to another region.
To learn more, see Recover from a region-wide service disruption, Virtual machines.
Storage recovery
To protect your storage in the event of a region-wide service disruption:
Use geo-redundant storage.
Know where your storage is geo-replicated. This affects where you deploy other instances of your data that
require regional affinity with your storage.
Check data for consistency after failover and, if necessary, restore from a backup.
To learn more, see Designing highly available applications using RA-GRS.
SQL Database and SQL Server
Azure SQL Database provides two types of recovery:
Use geo-restore to restore a database from a backup copy in another region. For more information, see
Recover an Azure SQL database using automated database backups.
Use active geo-replication to fail over to a secondary database. For more information, see Creating and using
active geo-replication.
For SQL Server running on VMs, see High availability and disaster recovery for SQL Server in Azure Virtual
Machines.
Error handling for resilient applications in Azure
11/2/2020 • 2 minutes to read • Edit Online
Ensuring your application can recover from errors is critical when working in a distributed system
A reference implementation is available here. It uses Polly and IHttpClientBuilder to implement the Circuit
Breaker pattern.
Request timeouts
When making a service call or a database call ensure that appropriate request timeouts are set. Database
Connection timeouts are typically set to 30s. Use guidance on troubleshoot, diagnose, and prevent SQL connection
errors and transient errors for SQL Database.
Leverage design patterns that encapsulate robust timeout strategies like Choreography pattern or Compensating
Transaction pattern.
Cascading Failures
The Circuit Breaker pattern provides stability while the system recovers from a failure and minimizes the impact on
performance. It can help to maintain the response time of the system by quickly rejecting a request for an operation
that's likely to fail, rather than waiting for the operation to time out, or never return.
Retry pattern. Describes how an application can handle anticipated temporary failures when it tries to connect to a
service or network resource by transparently retrying an operation that has previously failed.
Health Endpoint Monitoring pattern. A circuit breaker might be able to test the health of a service by sending a
request to an endpoint exposed by the service. The service should return information indicating its status.
Failure mode analysis (FMA) is a process for building resiliency into a system, by identifying possible failure points
in the system. The FMA should be part of the architecture and design phases, so that you can build failure recovery
into the system from the beginning.
Here is the general process to conduct an FMA:
1. Identify all of the components in the system. Include external dependencies, such as identity providers, third-
party services, and so on.
2. For each component, identify potential failures that could occur. A single component may have more than
one failure mode. For example, you should consider read failures and write failures separately, because the
impact and possible mitigation steps will be different.
3. Rate each failure mode according to its overall risk. Consider these factors:
What is the likelihood of the failure. Is it relatively common? Extremely rare? You don't need exact
numbers; the purpose is to help rank the priority.
What is the impact on the application, in terms of availability, data loss, monetary cost, and business
disruption?
4. For each failure mode, determine how the application will respond and recover. Consider tradeoffs in cost
and application complexity.
As a starting point for your FMA process, this article contains a catalog of potential failure modes and their
mitigation steps. The catalog is organized by technology or Azure service, plus a general category for application-
level design. The catalog is not exhaustive, but covers many of the core Azure services.
App Service
App Service app shuts down.
Detection . Possible causes:
Expected shutdown
An operator shuts down the application; for example, using the Azure portal.
The app was unloaded because it was idle. (Only if the Always On setting is disabled.)
Unexpected shutdown
The app crashes.
An App Service VM instance becomes unavailable.
Application_End logging will catch the app domain shutdown (soft process crash) and is the only way to catch the
application domain shutdowns.
Recover y:
If the shutdown was expected, use the application's shutdown event to shut down gracefully. For example, in
ASP.NET, use the Application_End method.
If the application was unloaded while idle, it is automatically restarted on the next request. However, you will
incur the "cold start" cost.
To prevent the application from being unloaded while idle, enable the Always On setting in the web app. See
Configure web apps in Azure App Service.
To prevent an operator from shutting down the app, set a resource lock with ReadOnly level. See Lock resources
with Azure Resource Manager.
If the app crashes or an App Service VM becomes unavailable, App Service automatically restarts the app.
Diagnostics . Application logs and web server logs. See Enable diagnostics logging for web apps in Azure App
Service.
A particular user repeatedly makes bad requests or overloads the system.
Detection . Authenticate users and include user ID in application logs.
Recover y:
Use Azure API Management to throttle requests from the user. See Advanced request throttling with Azure API
Management
Block the user.
Diagnostics . Log all authentication requests.
A bad update was deployed.
Detection . Monitor the application health through the Azure portal (see Monitor Azure web app performance) or
implement the health endpoint monitoring pattern.
Recover y:. Use multiple deployment slots and roll back to the last-known-good deployment. For more
information, see Basic web application.
Azure Search
Writing data to Azure Search fails.
Detection . Catch Microsoft.Rest.Azure.CloudException errors.
Recover y:
The Search .NET SDK automatically retries after transient failures. Any exceptions thrown by the client SDK should
be treated as nontransient errors.
The default retry policy uses exponential back-off. To use a different retry policy, call SetRetryPolicy on the
SearchIndexClient or SearchServiceClient class. For more information, see Automatic Retries.
Diagnostics . Use Search Traffic Analytics.
Reading data from Azure Search fails.
Detection . Catch Microsoft.Rest.Azure.CloudException errors.
Recover y:
The Search .NET SDK automatically retries after transient failures. Any exceptions thrown by the client SDK should
be treated as nontransient errors.
The default retry policy uses exponential back-off. To use a different retry policy, call SetRetryPolicy on the
SearchIndexClient or SearchServiceClient class. For more information, see Automatic Retries.
Cassandra
Reading or writing to a node fails.
Detection . Catch the exception. For .NET clients, this will typically be System.Web.HttpException . Other client may
have other exception types. For more information, see Cassandra error handling done right.
Recover y:
Each Cassandra client has its own retry policies and capabilities. For more information, see Cassandra error
handling done right.
Use a rack-aware deployment, with data nodes distributed across the fault domains.
Deploy to multiple regions with local quorum consistency. If a nontransient failure occurs, fail over to another
region.
Diagnostics . Application logs
Cloud Service
Web or worker roles are unexpectedly being shut down.
Detection . The RoleEnvironment.Stopping event is fired.
Recover y . Override the RoleEntryPoint.OnStop method to gracefully clean up. For more information, see The Right
Way to Handle Azure OnStop Events (blog).
Cosmos DB
Reading data fails.
Detection . Catch System.Net.Http.HttpRequestException or Microsoft.Azure.Documents.DocumentClientException .
Recover y:
The SDK automatically retries failed attempts. To set the number of retries and the maximum wait time,
configure ConnectionPolicy.RetryOptions . Exceptions that the client raises are either beyond the retry policy or
are not transient errors.
If Cosmos DB throttles the client, it returns an HTTP 429 error. Check the status code in the
DocumentClientException . If you are getting error 429 consistently, consider increasing the throughput value of
the collection.
If you are using the MongoDB API, the service returns error code 16500 when throttling.
Replicate the Cosmos DB database across two or more regions. All replicas are readable. Using the client SDKs,
specify the PreferredLocations parameter. This is an ordered list of Azure regions. All reads will be sent to the
first available region in the list. If the request fails, the client will try the other regions in the list, in order. For
more information, see How to set up Azure Cosmos DB global distribution using the SQL API.
Diagnostics . Log all errors on the client side.
Writing data fails.
Detection . Catch System.Net.Http.HttpRequestException or Microsoft.Azure.Documents.DocumentClientException .
Recover y:
The SDK automatically retries failed attempts. To set the number of retries and the maximum wait time,
configure ConnectionPolicy.RetryOptions . Exceptions that the client raises are either beyond the retry policy or
are not transient errors.
If Cosmos DB throttles the client, it returns an HTTP 429 error. Check the status code in the
DocumentClientException . If you are getting error 429 consistently, consider increasing the throughput value of
the collection.
Replicate the Cosmos DB database across two or more regions. If the primary region fails, another region will
be promoted to write. You can also trigger a failover manually. The SDK does automatic discovery and routing,
so application code continues to work after a failover. During the failover period (typically minutes), write
operations will have higher latency, as the SDK finds the new write region. For more information, see How to set
up Azure Cosmos DB global distribution using the SQL API.
As a fallback, persist the document to a backup queue, and process the queue later.
Diagnostics . Log all errors on the client side.
Queue storage
Writing a message to Azure Queue storage fails consistently.
Detection . After N retry attempts, the write operation still fails.
Recover y:
Store the data in a local cache, and forward the writes to storage later, when the service becomes available.
Create a secondary queue, and write to that queue if the primary queue is unavailable.
Diagnostics . Use storage metrics.
The application cannot process a particular message from the queue.
Detection . Application specific. For example, the message contains invalid data, or the business logic fails for some
reason.
Recover y:
Move the message to a separate queue. Run a separate process to examine the messages in that queue.
Consider using Azure Service Bus Messaging queues, which provides a dead-letter queue functionality for this
purpose.
NOTE
If you are using Storage queues with WebJobs, the WebJobs SDK provides built-in poison message handling. See How to use
Azure queue storage with the WebJobs SDK.
SQL Database
Cannot connect to the database in the primary region.
Detection . Connection fails.
Recover y:
Prerequisite: The database must be configured for active geo-replication. See SQL Database Active Geo-Replication.
For queries, read from a secondary replica.
For inserts and updates, manually fail over to a secondary replica. See Initiate a planned or unplanned failover
for Azure SQL Database.
The replica uses a different connection string, so you will need to update the connection string in your application.
Client runs out of connections in the connection pool.
Detection . Catch System.InvalidOperationException errors.
Recover y:
Retry the operation.
As a mitigation plan, isolate the connection pools for each use case, so that one use case can't dominate all the
connections.
Increase the maximum connection pools.
Diagnostics . Application logs.
Database connection limit is reached.
Detection . Azure SQL Database limits the number of concurrent workers, logins, and sessions. The limits depend
on the service tier. For more information, see Azure SQL Database resource limits.
To detect these errors, catch System.Data.SqlClient.SqlException and check the value of SqlException.Number for
the SQL error code. For a list of relevant error codes, see SQL error codes for SQL Database client applications:
Database connection error and other issues.
Recover y . These errors are considered transient, so retrying may resolve the issue. If you consistently hit these
errors, consider scaling the database.
Diagnostics . - The sys.event_log query returns successful database connections, connection failures, and
deadlocks.
Create an alert rule for failed connections.
Enable SQL Database auditing and check for failed logins.
Service Fabric
A request to a service fails.
Detection . The service returns an error.
Recover y:
Locate a proxy again ( ServiceProxy or ActorProxy ) and call the service/actor method again.
Stateful ser vice . Wrap operations on reliable collections in a transaction. If there is an error, the transaction
will be rolled back. The request, if pulled from a queue, will be processed again.
Stateless ser vice . If the service persists data to an external store, all operations need to be idempotent.
Diagnostics . Application log
Service Fabric node is shut down.
Detection . A cancellation token is passed to the service's RunAsync method. Service Fabric cancels the task before
shutting down the node.
Recover y . Use the cancellation token to detect shutdown. When Service Fabric requests cancellation, finish any
work and exit RunAsync as quickly as possible.
Diagnostics . Application logs
Storage
Writing data to Azure Storage fails
Detection . The client receives errors when writing.
Recover y:
1. Retry the operation, to recover from transient failures. The retry policy in the client SDK handles this
automatically.
2. Implement the Circuit Breaker pattern to avoid overwhelming storage.
3. If N retry attempts fail, perform a graceful fallback. For example:
Store the data in a local cache, and forward the writes to storage later, when the service becomes
available.
If the write action was in a transactional scope, compensate the transaction.
Diagnostics . Use storage metrics.
Reading data from Azure Storage fails.
Detection . The client receives errors when reading.
Recover y:
1. Retry the operation, to recover from transient failures. The retry policy in the client SDK handles this
automatically.
2. For RA-GRS storage, if reading from the primary endpoint fails, try reading from the secondary endpoint. The
client SDK can handle this automatically. See Azure Storage replication.
3. If N retry attempts fail, take a fallback action to degrade gracefully. For example, if a product image can't be
retrieved from storage, show a generic placeholder image.
Diagnostics . Use storage metrics.
Virtual machine
Connection to a backend VM fails.
Detection . Network connection errors.
Recover y:
Deploy at least two backend VMs in an availability set, behind a load balancer.
If the connection error is transient, sometimes TCP will successfully retry sending the message.
Implement a retry policy in the application.
For persistent or nontransient errors, implement the Circuit Breaker pattern.
If the calling VM exceeds its network egress limit, the outbound queue will fill up. If the outbound queue is
consistently full, consider scaling out.
Diagnostics . Log events at service boundaries.
VM instance becomes unavailable or unhealthy.
Detection . Configure a Load Balancer health probe that signals whether the VM instance is healthy. The probe
should check whether critical functions are responding correctly.
Recover y . For each application tier, put multiple VM instances into the same availability set, and place a load
balancer in front of the VMs. If the health probe fails, the Load Balancer stops sending new connections to the
unhealthy instance.
Diagnostics . - Use Load Balancer log analytics.
Configure your monitoring system to monitor all of the health monitoring endpoints.
Operator accidentally shuts down a VM.
Detection . N/A
Recover y . Set a resource lock with ReadOnly level. See Lock resources with Azure Resource Manager.
Diagnostics . Use Azure Activity Logs.
WebJobs
Continuous job stops running when the SCM host is idle.
Detection . Pass a cancellation token to the WebJob function. For more information, see Graceful shutdown.
Recover y . Enable the Always On setting in the web app. For more information, see Run Background tasks with
WebJobs.
Application design
Application can't handle a spike in incoming requests.
Detection . Depends on the application. Typical symptoms:
The website starts returning HTTP 5xx error codes.
Dependent services, such as database or storage, start to throttle requests. Look for HTTP errors such as HTTP
429 (Too Many Requests), depending on the service.
HTTP queue length grows.
Recover y:
Scale out to handle increased load.
Mitigate failures to avoid having cascading failures disrupt the entire application. Mitigation strategies
include:
Implement the Throttling pattern to avoid overwhelming backend systems.
Use queue-based load leveling to buffer requests and process them at an appropriate pace.
Prioritize certain clients. For example, if the application has free and paid tiers, throttle customers on the
free tier, but not paid customers. See Priority queue pattern.
Diagnostics . Use App Service diagnostic logging. Use a service such as Azure Log Analytics, Application Insights,
or New Relic to help understand the diagnostic logs.
Next steps
For more information about the FMA process, see Resilience by design for cloud services (PDF download).
Backup and disaster recover for Azure applications
12/18/2020 • 14 minutes to read • Edit Online
Disaster recovery is the process of restoring application functionality in the wake of a catastrophic loss.
Your tolerance for reduced functionality during a disaster is a business decision that varies from one application to
the next. It might be acceptable for some applications to be unavailable or to be partially available with reduced
functionality or delayed processing for a period of time. For other applications, any reduced functionality is
unacceptable.
Network outage
When parts of the Azure network are inaccessible, you might not be able to access your application or data. In this
situation, we recommend designing the disaster recovery strategy to run most applications with reduced
functionality.
If reducing functionality isn't an option, the remaining options are application downtime or failover to an alternate
region.
In a reduced functionality scenario:
If your application can't access its data because of an Azure network outage, you might be able to run locally
with reduced application functionality by using cached data.
You might be able to store data in an alternate location until connectivity is restored.
Manual responses
Although automation is ideal, some strategies for disaster recovery require manual responses.
Alerts
Monitor your application for warning signs that may require proactive intervention. For example, if Azure SQL
Database or Azure Cosmos DB consistently throttles your application, you might need to increase your database
capacity or optimize your queries. Even though the application might handle the throttling errors transparently,
your telemetry should still raise an alert so that you can follow up.
For service limits and quota thresholds, we recommend configuring alerts on Azure resources metrics and
diagnostics logs. When possible, set up alerts on metrics, which are lower latency than diagnostics logs.
Through Resource Health, Azure provides some built-in health status checks that can help you diagnose Azure
service throttling issues.
Failover
Configure a disaster recovery strategy for each Azure application and its Azure services. Acceptable deployment
strategies to support disaster recovery may vary based on the SLAs required for all components of each
application.
Azure provides different features within many Azure services to allow for manual failover, such as redis cache geo-
replicas, or for automated failover, such as SQL auto-failover groups. For example:
For an application that mainly uses virtual machines, you can use Azure Site Recovery for the web and logic
tiers. For more information, see Azure to Azure disaster recovery architecture. For SQL Server on VMs, use SQL
Server Always On availability groups.
For an application that uses App Service and Azure SQL Database, you can use a smaller tier App Service plan
configured in the secondary region, which autoscales when a failover occurs. Use failover groups for the
database tier.
In either scenario, an Azure Traffic Manager profile provides for the automated failover across regions. Load
balancers or application gateways should be set up in the secondary region to support faster availability on
failover.
Operational readiness testing
Perform an operational readiness test for failover to the secondary region and for failback to the primary region.
Many Azure services support manual failover or test failover for disaster recovery drills. Alternatively, you can
simulate an outage by shutting down or removing Azure services.
NOTE
In some scenarios, such as that of VMs backed up using Azure Backup, you can restore only from a backup in the same
region. Other Azure services, such as Azure Cache for Redis, provide geo-replicated backups, which you can use to restore
services across regions.
Backup strategy
Many alternative strategies are available for implementing distributed compute across regions. These must be
tailored to the specific business requirements and circumstances of the application. At a high level, the approaches
can be divided into the following categories:
Redeploy on disaster : In this approach, the application is redeployed from scratch at the time of disaster.
This is appropriate for non-critical applications that don’t require a guaranteed recovery time. Redeploy to a
new region
Warm Spare (Active/Passive) : A secondary hosted service is created in an alternate region, and roles are
deployed to guarantee minimal capacity; however, the roles don’t receive production traffic. This approach
is useful for applications that have not been designed to distribute traffic across regions. Basic Web
Application example, Replicate VM to another region
Hot Spare (Active/Active) : The application is designed to receive production load in multiple regions. The
cloud services in each region might be configured for higher capacity than required for disaster recovery
purposes. Alternatively, the cloud services might scale out as necessary at the time of a disaster and failover.
This approach requires substantial investment in application design, but it has significant benefits. These
include low and guaranteed recovery time, continuous testing of all recovery locations, and efficient usage
of capacity. Multi tier DR example
Resource management
You can distribute compute instances across regions by creating a separate cloud service in each target region,
and then publishing the deployment package to each cloud service. However, distributing traffic across cloud
services in different regions must be implemented by the application developer or with a traffic management
service.
Determining the number of spare role instances to deploy in advance for disaster recovery is an important aspect
of capacity planning. Having a full-scale secondary deployment ensures that capacity is already available when
needed; however, this effectively doubles the cost. A common pattern is to have a small, secondary deployment,
just large enough to run critical services. This small secondary deployment is a good idea, both to reserve capacity,
and for testing the configuration of the secondary environment.
NOTE
The subscription quota is not a capacity guarantee. The quota is simply a credit limit. To guarantee capacity, the required
number of roles must be defined in the service model, and the roles must be deployed.
Validating backups
Regularly verify that your backup data is what you expect by running a script to validate data integrity, schema,
and queries. There's no point having a backup if it's not useful to restore your data sources. Log and report any
inconsistencies so the backup service can be repaired.
Backup Storage
Backups are about protecting data, applications, and systems that are important to the organization. In operations
environments, it's easy to provide backups: pick the workload that needs hyper-availability and back it up.
Operations environments are relatively static – in that, the systems and applications used remain relatively
consistent, with only the data changing daily.
Application archives
It's important to remember, that a DR plan is more than just an ordered restoration from backup and validation
process. Applications may require post-restoration configuration due to site changes, or reinstallation may be
necessary with restored data imported after.
Outage retrospectives
No amount of safeguards or preparation can prevent every possible incident, and sometimes simple human error
can have significant consequences to a development project. You can't avoid it, but you can learn from it and take
steps to minimize the chances of a similar incident in the future. The question is how software organizations can
best go about learning from mistakes with agile postmortems.
Service-specific guidance
The following articles describe disaster recovery for specific Azure services:
SERVIC E A RT IC L E
Azure Database for MySQL Overview of business continuity with Azure Database for
MySQL
Azure Database for PostgreSQL Overview of business continuity with Azure Database for
PostgreSQL
Azure Cloud Services What to do in the event of an Azure service disruption that
impacts Azure Cloud Services
Next steps
Recover from data corruption or accidental deletion
Recover from a region-wide service disruption
Using business metrics to design resilient Azure
applications
11/2/2020 • 5 minutes to read • Edit Online
SL A DO W N T IM E P ER W EEK DO W N T IM E P ER M O N T H DO W N T IM E P ER Y EA R
Availability metrics
Use these measures to plan for redundancy and determine customer SLAs.
Mean time to recover (MTTR) is the average time it takes to restore a component after a failure.
Mean time between failures (MTBF) is the how long a component can reasonably expect to last between
outages.
The Service Level Agreement Estimator sample shows how to calculate the SLA of your architecture.
Composite SLAs
Composite SLAs involve multiple services supporting an application, each with differing levels of availability. For
example, consider an App Service web app that writes to Azure SQL Database. At the time of this writing, these
Azure services have the following SLAs:
App Service web apps = 99.95%
SQL Database = 99.99%
What is the maximum downtime you would expect for this application? If either service fails, the whole application
fails. The probability of each service failing is independent, so the composite SLA for this application is 99.95% ×
99.99% = 99.94%. That's lower than the individual SLAs, which isn't surprising because an application that relies on
multiple services has more potential failure points.
You can improve the composite SLA by creating independent fallback paths. For example, if SQL Database is
unavailable, put transactions into a queue to be processed later.
With this design, the application is still available even if it can't connect to the database. However, it fails if the
database and the queue both fail at the same time. The expected percentage of time for a simultaneous failure is
0.0001 × 0.001, so the composite SLA for this combined path is:
Database or queue = 1.0 − (0.0001 × 0.001) = 99.99999%
The total composite SLA is:
Web app and (database or queue) = 99.95% × 99.99999% = ~99.95%
There are tradeoffs to this approach. The application logic is more complex, you are paying for the queue, and you
need to consider data consistency issues.
SLAs for multiregion deployments
SLAs for multiregion deployments involve a high-availability technique to deploy the application in more than one
region and use Azure Traffic Manager to fail over if the application fails in one region.
The composite SLA for a multiregion deployment is calculated as follows:
N is the composite SLA for the application deployed in one region.
R is the number of regions where the application is deployed.
The expected chance that the application fails in all regions at the same time is ((1 − N) ^ R). For example, if the
single-region SLA is 99.95%:
The combined SLA for two regions = (1 − (1 − 0.9995) ^ 2) = 99.999975%
The combined SLA for four regions = (1 − (1 − 0.9995) ^ 4) = 99.999999%
The SLA for Traffic Manager is also a factor. Failing over is not instantaneous in active-passive configurations, which
can result in downtime during a failover. See Traffic Manager endpoint monitoring and failover.
Chaos engineering
12/18/2020 • 5 minutes to read • Edit Online
Chaos engineering is a methodology that helps developers attain consistent reliability by hardening services
against failures in production. A common way to introduce chaos is to deliberately inject faults that cause system
components to fail. The goal is to observe, monitor, respond to, and improve your system's reliability under adverse
circumstances.
Context
It's difficult to simulate the characteristics of a service's behavior at scale outside a production environment. The
transient nature of cloud platforms can exacerbate this difficulty. Architecting your service to expect failure is a core
approach to creating a modern service. Chaos engineering embraces the uncertainty of the production
environment and strives to anticipate rare, unpredictable, and disruptive outcomes, so that you can minimize any
potential impact on your customers.
Principles
Chaos engineering is aimed at increasing your service’s resiliency and its ability to react to failures. By conducting
experiments in a controlled environment, you can identify issues that are likely to arise during development and
deployment. During this process, be vigilant in adopting the following guidelines:
Be proactive.
Embrace failure.
Break the system.
Identify and address single points of failure early.
Install guardrails and graceful mitigations.
Minimize the blast radius.
Build immunity.
Chaos engineering should be an integral part of development team culture and an ongoing practice, not a short-
term tactical effort in response to a single outage.
Development team members are partners in the process. They must be equipped with the resources to triage
issues, implement the testability that's required for fault injection, and drive the necessary product changes.
AT TA C K ER DEF EN DER
Mitigate
Goals
Familiarize team members with monitoring tools.
Recognize outage patterns.
Learn how to assess the impact.
Determine the root cause and mitigate accordingly.
Practice log analysis.
Overall method
1. Start with a hypothesis.
2. Measure baseline behavior.
3. Inject a fault or faults.
4. Monitor the resulting behavior.
5. Document the process and observations.
6. Identify and act on the result.
Periodically validate your process, architecture choices, and code. By conducting fault-injection experiments, you
can confirm that monitoring is in place and alerts are set up, the directly responsible individual (DRI) process is
effective, and your documentation and investigation processes are up to date. Keep in mind a few key
considerations:
Challenge system assumptions.
Validate change (topology, platform, resources).
Use service-level agreement (SLA) buffers.
Use live-site outages as opportunities.
Best practices
Shift left
Shift-left testing means experiment early, experiment often. Incorporate fault-injection configurations and create
resiliency-validation gates during the development stages and in the deployment pipeline.
Shift right
Shift-right testing means that you verify that the service is resilient where it counts in a pre-production or
production environment with actual customer load. Adopt a proactive approach as opposed to reacting to failures.
Be a part of determining and controlling requirements for the blast radius.
Blast radius
Stop the experiment when it goes beyond scope. Unknown results are an expected outcome of chaos experiments.
Strive to achieve balance between collecting substantial result data and affecting as few production users as
possible. For an example of this principle in practice, see the Bulkhead pattern article.
Error budget testing
Establish an error budget as an investment in chaos and fault injection. Your error budget is the difference between
achieving 100% of the service-level objective (SLO) and achieving the agreed-upon SLO.
Considerations
The following sections discuss additional considerations about chaos engineering, based on its application inside
Azure.
Identify faults that are relevant to the development team
Work closely with the development teams to ensure the relevance of the injected failures. Use past incidents or
issues as a guide. Examine dependencies and evaluate the results when those dependencies are removed.
An external team can't hypothesize faults for your team. A study of failures from an artificial source might be
relevant to your team's purposes, but the effort must be justified.
Inject faults in a way that accurately reflects production failures
Simulate production failures. Treat injected faults in the same way that you would treat production-level faults.
Enforcing a tighter limit on the blast radius will enable you to simulate a production environment. Each fault-
injection effort must be accompanied by tooling that's designed to inject the types of faults that are relevant to
your team's scenarios. Here are two basic ways:
Inject faults in a non-production environment, such as Canary or Test In Production (TIP).
Partition the production service or environment.
Halt all faults and roll back the state to its last-known good configuration if the state seems severe.
Build confidence incrementally
Start by hardening the core, and then expand out in layers. At each point, lock in progress with automated
regression tests. Each team should have a long-term strategy based on a progression that makes sense for the
team's circumstances.
By applying the shift left strategy, you can help ensure that any obstacles to developer usage are removed early and
the testing results are actionable.
The process must be very low tax. That is, the process must make it easy for developers to understand what
happened and to fix the issues. The effort must fit easily into their normal workflow, not burden them with one-off
special activities.
Faults
The following table lists faults that you can apply to inject chaos. The list represents commonly injected faults and
isn't intended to be exhaustive.
Resource pressure
CPU
Memory
Physical
Virtual
bad checksum
Hard disk
Capacity
Read
Write
Availability
Data corruption
Read / Write Latency
Network
Layers
Transport (TCP/UDP)
Application layer (HTTP)
Types
Disconnect
Latency
Alter response codes (HTTP)
Packet reorder / loss (TCP/UDP)
# of connections (active / passive)
DOS attack
Filters
Domain / IP / Subnet
URL path
Port / Protocol
DNS Host Name resolution
Process
Stop / Kill
Restart
Stop service
Start
Crash
Hang
Vir tual Machine
Stop
Restart
BSOD
Change date
Re-image
Live Migration
Platform
Quorum loss
Data loss
Move primary node
Remove replica
Functions
Latency
Exceptions
Status codes
Intercept / Denylist calls
Disk capacity
Application specific
Intercept / Re-route calls
No access to service code
Hardware
Machine
Storage
Network devices
Rack
UPS
Datacenter
Data Management for Reliability
12/18/2020 • 4 minutes to read • Edit Online
Database resiliency
Azure SQL Database
SQL Database automatically performs a combination of full database backups weekly, differential database
backups hourly, and transaction log backups every five - 10 minutes to protect your business from data loss. Use
point-in-time restore to restore a database to an earlier time. For more information, see:
Recover an Azure SQL database using automated database backups
Overview of business continuity with Azure SQL Database
SQL Server on VMs
For SQL Server running on VMs, there are two options: traditional backups and log shipping. Traditional backups
enables you to restore to a specific point in time, but the recovery process is slow. Restoring traditional backups
requires starting with an initial full backup, and then applying any backups taken after that. The second option is to
configure a log shipping session to delay the restore of log backups (for example, by two hours). This provides a
window to recover from errors made on the primary.
Azure Cosmos DB
Azure Cosmos DB automatically takes backups at regular intervals. Backups are stored separately in another
storage service, and those backups are globally replicated for resiliency against regional disasters. If you
accidentally delete your database or collection, you can file a support ticket or call Azure support to restore the
data from the last automatic backup. For more information, see Automatic online backup and restore with Azure
Cosmos DB.
Azure Database for MySQL, Azure Database for PostgreSQL
When using Azure Database for MySQL or Azure Database for PostgreSQL, the database service automatically
makes a backup of the service every five minutes. Using this automatic backup feature you may restore the server
and all its databases into a new server to an earlier point-in-time. For more information, see:
How to back up and restore a server in Azure Database for MySQL by using the Azure portal
How to backup and restore a server in Azure Database for PostgreSQL using the Azure portal
Storage resiliency
Azure Storage provides data resiliency through automated replicas. However, this does not prevent application
code or users from corrupting data, whether accidentally or maliciously. Maintaining data fidelity in the face of
application or user error requires more advanced techniques, such as copying the data to a secondary storage
location with an audit log.
Block blobs . Create a point-in-time snapshot of each block blob. For more information, see Creating a
Snapshot of a Blob. For each snapshot, you are only charged for the storage required to store the differences
within the blob since the last snapshot state. The snapshots are dependent on the existence of the original
blob they are based on, so a copy operation to another blob or even another storage account is advisable.
This ensures that backup data is properly protected against accidental deletion. You can use AzCopy or Azure
PowerShell to copy the blobs to another storage account.
Files . Use share snapshots, or use AzCopy or PowerShell to copy your files to another storage account.
Tables . Use AzCopy to export the table data into another storage account in another region.
Samples related to storage resiliency are here. The scripts perform these tasks:
ARM template to deploy a storage account and blob container.
Copies file into the blob container.
Creates the blob snapshot.
Creates a share snapshot.
Calls AzCopy for table storage
Monitoring application health for reliability in Azure
12/18/2020 • 8 minutes to read • Edit Online
Monitoring and diagnostics are crucial for resiliency. If something fails, you need to know that it failed, when it
failed — and why.
Monitoring is not the same as failure detection. For example, your application might detect a transient error and
retry, avoiding downtime. But it should also log the retry operation so that you can monitor the error rate to get an
overall picture of application health.
Think of the monitoring and diagnostics process as a pipeline with four distinct stages: Instrumentation, collection
and storage, analysis and diagnosis, and visualization and alerts.
TIP
Monitor and manage the progress of long-running workflows by implementing a pattern such as Scheduler Agent
Supervisor.
Test Monitoring
Include monitoring systems in your test plan. Automated failover and failback systems depend on the correct
functioning of monitoring and instrumentation. Dashboards to visualize system health and operator alerts also
depend on having accurate monitoring and instrumentation. If these elements fail, miss critical information, or
report inaccurate data, an operator might not realize that the system is unhealthy or failing.
Instrumentation
It's not practical to monitor your application directly, so instrumentation is key. A large-scale distributed system
might run on dozens of virtual machines (VMs), which are added and removed over time. Likewise, a cloud
application might use a number of data stores and a single user action might span multiple subsystems.
Provide rich instrumentation:
For failures that are likely but have not yet occurred, provide enough data to determine the cause, mitigate the
situation, and ensure that the system remains available.
For failures that have already occurred, the application should return an appropriate error message to the user
but should attempt to continue running, albeit with reduced functionality.
Monitoring systems should capture comprehensive details so that applications can be restored efficiently and, if
necessary, designers and developers can modify the system to prevent the situation from recurring.
The raw data for monitoring can come from a variety of sources, including:
Application logs, such as those produced by the Azure Application Insights service.
Operating system performance metrics collected by Azure monitoring agents.
Azure resources, including metrics collected by Azure Monitor.
Azure Service Health, which offers a dashboard to help you track active events.
Azure AD logs built into the Azure platform.
Most Azure services have metrics and diagnostics that you can configure to analyze and determine the cause of
problems. To learn more, see Monitoring data collected by Azure Monitor.
The HealthProbesSample sample shows how to set up health probes. It provides an ARM template to set up
the infrastructure. A load balancer accepts public requests and load balance to a set of virtual machines. The health
probe is set up so that it can check for service's path /Health.
Application logs
Application logs are an important source of diagnostics data. To gain insight when you need it most, follow best
practices for application logging.
Log data in the production environment
Capture robust telemetry data while the application is running in the production environment, so you have
sufficient information to diagnose the cause of issues in the production state.
Log events at service boundaries
Include a correlation ID that flows across service boundaries. If a transaction flows through multiple services and
one of them fails, the correlation ID helps you track requests across your application and pinpoints why the
transaction failed.
Use semantic (structured) logging
With structured logs, it's easier to automate the consumption and analysis of the log data, which is especially
important at cloud scale. Generally, we recommend storing Azure resources metrics and diagnostics data in a Log
Analytics workspace rather than in a storage account. This way, you can use Kusto queries to obtain the data you
want quickly and in a structured format. You can also use Azure Monitor APIs and Azure Log Analytics APIs.
Use asynchronous logging
Synchronous logging operations sometimes block your application code, causing requests to back up as logs are
written. Use asynchronous logging to preserve availability during application logging.
Separate application logging from auditing
Audit records are commonly maintained for compliance or regulatory requirements and must be complete. To
avoid dropped transactions, maintain audit logs separately from diagnostic logs.
Azure is divided physically and logically into units called regions. A region consists of one or more datacenters in
close proximity.
Under rare circumstances, it is possible that facilities in an entire region can become inaccessible, for example due
to network failures. Or facilities can be lost entirely, for example due to a natural disaster. This section explains the
capabilities of Azure for creating applications that are distributed across regions. Such distribution helps to
minimize the possibility that a failure in one region could affect other regions.
Cloud services
Resource management
You can distribute compute instances across regions by creating a separate cloud service in each target region, and
then publishing the deployment package to each cloud service. However, distributing traffic across cloud services
in different regions must be implemented by the application developer or with a traffic management service.
Determining the number of spare role instances to deploy in advance for disaster recovery is an important aspect
of capacity planning. Having a full-scale secondary deployment ensures that capacity is already available when
needed; however, this effectively doubles the cost. A common pattern is to have a small, secondary deployment,
just large enough to run critical services. This small secondary deployment is a good idea, both to reserve capacity,
and for testing the configuration of the secondary environment.
NOTE
The subscription quota is not a capacity guarantee. The quota is simply a credit limit. To guarantee capacity, the required
number of roles must be defined in the service model, and the roles must be deployed.
Load Balancing
To load balance traffic across regions requires a traffic management solution. Azure provides Azure Traffic
Manager. You can also take advantage of third-party services that provide similar traffic management capabilities.
Strategies
Many alternative strategies are available for implementing distributed compute across regions. These must be
tailored to the specific business requirements and circumstances of the application. At a high level, the approaches
can be divided into the following categories:
Redeploy on disaster : In this approach, the application is redeployed from scratch at the time of disaster.
This is appropriate for non-critical applications that don’t require a guaranteed recovery time.
Warm Spare (Active/Passive) : A secondary hosted service is created in an alternate region, and roles are
deployed to guarantee minimal capacity; however, the roles don’t receive production traffic. This approach
is useful for applications that have not been designed to distribute traffic across regions.
Hot Spare (Active/Active) : The application is designed to receive production load in multiple regions. The
cloud services in each region might be configured for higher capacity than required for disaster recovery
purposes. Alternatively, the cloud services might scale out as necessary at the time of a disaster and fail
over. This approach requires substantial investment in application design, but it has significant benefits.
These include low and guaranteed recovery time, continuous testing of all recovery locations, and efficient
usage of capacity.
A complete discussion of distributed design is outside the scope of this document. For more information, see
Disaster Recovery and High Availability for Azure Applications.
Virtual machines
Recovery of infrastructure as a service (IaaS) virtual machines (VMs) is similar to platform as a service (PaaS)
compute recovery in many respects. There are important differences, however, due to the fact that an IaaS VM
consists of both the VM and the VM disk.
Use Azure Backup to create cross region backups that are application consistent . Azure Backup
enables customers to create application consistent backups across multiple VM disks, and support
replication of backups across regions. You can do this by choosing to geo-replicate the backup vault at the
time of creation. Replication of the backup vault must be configured at the time of creation. It can't be set
later. If a region is lost, Microsoft will make the backups available to customers. Customers will be able to
restore to any of their configured restore points.
Separate the data disk from the operating system disk . An important consideration for IaaS VMs is
that you cannot change the operating system disk without re-creating the VM. This is not a problem if your
recovery strategy is to redeploy after disaster. However, it might be a problem if you are using the Warm
Spare approach to reserve capacity. To implement this properly, you must have the correct operating
system disk deployed to both the primary and secondary locations, and the application data must be stored
on a separate drive. If possible, use a standard operating system configuration that can be provided on both
locations. After a failover, you must then attach the data drive to your existing IaaS VMs in the secondary
DC. Use AzCopy to copy snapshots of the data disk(s) to a remote site.
Be aware of potential consistency issues after a geo-failover of multiple VM Disks . VM Disks are
implemented as Azure Storage blobs, and have the same geo-replication characteristic. Unless Azure
Backup is used, there are no guarantees of consistency across disks, because geo-replication is
asynchronous and replicates independently. Individual VM disks are guaranteed to be in a crash consistent
state after a geo-failover, but not consistent across disks. This could cause problems in some cases (for
example, in the case of disk striping).
Use Azure Site Recover y to replicate applications across regions . With Azure Site Recovery,
customers don't need to worry about separating data disks from operating system disks or about potential
consistency issues. Azure Site Recovery replicates workloads running on physical and virtual machines
from a primary site (either on-premises or in Azure) to a secondary location (in Azure). When an outage
occurs at the customer's primary site, a failover can be triggered to quickly return the customer to an
operational state. After the primary location is restored, customers can then fail back. They can easily
replicate using recovery points with application-consistent snapshots. These snapshots capture disk data, all
in-memory data, and all in-process transactions. Azure Site Recovery allows customers to keep recovery
time objectives (RTO) and recovery point objectives (RPO) within organizational limits. Customers can also
easily run disaster recovery drills without affecting applications in production. Using recovery plans,
customers can sequence the failover and recovery of multitier applications running on multiple VMs. They
can group machines together in a recovery plan, and optionally add scripts and manual actions. Recovery
plans can be integrated with Azure Automation runbooks.
Storage
Recovery by using geo -redundant storage of blob, table, queue, and VM disk storage
In Azure, blobs, tables, queues, and VM disks are all geo-replicated by default. This is referred to as geo-redundant
storage (GRS). GRS replicates storage data to a paired datacenter located hundreds of miles apart within a specific
geographic region. GRS is designed to provide additional durability in case there is a major datacenter disaster.
Microsoft controls when failover occurs, and failover is limited to major disasters in which the original primary
location is deemed unrecoverable in a reasonable amount of time. Under some scenarios, this can be several days.
Data is typically replicated within a few minutes, although synchronization interval is not yet covered by a service
level agreement.
If a geo-failover occurs, there will be no change to how the account is accessed (the URL and account key will not
change). The storage account will, however, be in a different region after failover. This could impact applications
that require regional affinity with their storage account. Even for services and applications that do not require a
storage account in the same datacenter, the cross-datacenter latency and bandwidth charges might be compelling
reasons to move traffic to the failover region temporarily. This could factor into an overall disaster recovery
strategy.
In addition to automatic failover provided by GRS, Azure has introduced a service that gives you read access to the
copy of your data in the secondary storage location. This is called read-access geo-redundant storage (RA-GRS).
For more information about both GRS and RA-GRS storage, see Azure Storage replication.
Geo -replication region mappings
It is important to know where your data is geo-replicated, in order to know where to deploy the other instances of
your data that require regional affinity with your storage. For more information, see Azure Paired Regions.
Geo -replication pricing
Geo-replication is included in current pricing for Azure Storage. This is called geo-redundant storage (GRS). If you
do not want your data geo-replicated, you can disable geo-replication for your account. This is called locally
redundant storage (LRS), and it is charged at a discounted price compared to GRS.
Determining if a geo -failover has occurred
If a geo-failover occurs, this will be posted to the Azure Service Health Dashboard. Applications can implement an
automated means of detecting this, however, by monitoring the geo-region for their storage account. This can be
used to trigger other recovery operations, such as activation of compute resources in the geo-region where their
storage moved to. You can perform a query for this from the service management API, by using Get Storage
Account Properties. The relevant properties are:
<GeoPrimaryRegion>primary-region</GeoPrimaryRegion>
<StatusOfPrimary>[Available|Unavailable]</StatusOfPrimary>
<LastGeoFailoverTime>DateTime</LastGeoFailoverTime>
<GeoSecondaryRegion>secondary-region</GeoSecondaryRegion>
<StatusOfSecondary>[Available|Unavailable]</StatusOfSecondary>
Database
SQL Database
Azure SQL Database provides two types of recovery: geo-restore and active geo-replication.
Geo-restore
Geo-restore is also available with Basic, Standard, and Premium databases. It provides the default recovery option
when the database is unavailable because of an incident in the region where your database is hosted. Similar to
point-in-time restore, geo-restore relies on database backups in geo-redundant Azure storage. It restores from the
geo-replicated backup copy, and therefore is resilient to the storage outages in the primary region. For more
information, see Restore an Azure SQL Database or failover to a secondary.
Active geo-replication
Active geo-replication is available for all database tiers. It’s designed for applications that have more aggressive
recovery requirements than geo-restore can offer. Using active geo-replication, you can create up to four readable
secondaries on servers in different regions. You can initiate failover to any of the secondaries. In addition, active
geo-replication can be used to support the application upgrade or relocation scenarios, as well as load balancing
for read-only workloads. For details, see Configure active geo-replication for Azure SQL Database and initiate
failover. Refer to Designing globally available services using Azure SQL Database and Managing rolling upgrades
of cloud applications by using SQL Database active geo-replication for details on how to design and implement
applications and applications upgrade without downtime.
SQL Server on Azure Virtual Machines
A variety of options are available for recovery and high availability for SQL Server 2012 (and later) running in
Azure Virtual Machines. For more information, see High availability and disaster recovery for SQL Server in Azure
Virtual Machines.
NOTE
In some cases, these steps can help to mitigate a service-specific outage rather than an entire datacenter event. From the
application perspective, a service-specific outage might be just as limiting and would require temporarily migrating the
service to an alternate Azure region.
Service Bus
Azure Service Bus uses a unique namespace that does not span Azure regions. So the first requirement is to set up
the necessary service bus namespaces in the alternate region. However, there are also considerations for the
durability of the queued messages. There are several strategies for replicating messages across Azure regions. For
the details on these replication strategies and other disaster recovery strategies, see Best practices for insulating
applications against Service Bus outages and disasters.
App Service
To migrate an Azure App Service application, such as Web Apps or Mobile Apps, to a secondary Azure region, you
must have a backup of the website available for publishing. If the outage does not involve the entire Azure
datacenter, it might be possible to use FTP to download a recent backup of the site content. Then create a new app
in the alternate region, unless you have previously done this to reserve capacity. Publish the site to the new region,
and make any necessary configuration changes. These changes could include database connection strings or other
region-specific settings. If necessary, add the site’s SSL certificate and change the DNS CNAME record so that the
custom domain name points to the redeployed Azure Web App URL.
HDInsight
The data associated with HDInsight is stored by default in Azure Blob Storage. HDInsight requires that a Hadoop
cluster processing MapReduce jobs must be colocated in the same region as the storage account that contains the
data being analyzed. Provided you use the geo-replication feature available to Azure Storage, you can access your
data in the secondary region where the data was replicated if for some reason the primary region is no longer
available. You can create a new Hadoop cluster in the region where the data has been replicated and continue
processing it.
SQL Reporting
At this time, recovering from the loss of an Azure region requires multiple SQL Reporting instances in different
Azure regions. These SQL Reporting instances should access the same data, and that data should have its own
recovery plan in the event of a disaster. You can also maintain external backup copies of the RDL file for each
report.
Media Services
Azure Media Services has a different recovery approach for encoding and streaming. Typically, streaming is more
critical during a regional outage. To prepare for this, you should have a Media Services account in two different
Azure regions. The encoded content should be located in both regions. During a failure, you can redirect the
streaming traffic to the alternate region. Encoding can be performed in any Azure region. If encoding is time-
sensitive, for example during live event processing, you must be prepared to submit jobs to an alternate
datacenter during failures.
Virtual network
Configuration files provide the quickest way to set up a virtual network in an alternate Azure region. After
configuring the virtual network in the primary Azure region, export the virtual network settings for the current
network to a network configuration file. If an outage occurs in the primary region, restore the virtual network from
the stored configuration file. Then configure other cloud services, virtual machines, or cross-premises settings to
work with the new virtual network.
To test resiliency, you should verify how the end-to-end workload performs under intermittent failure conditions.
Run tests in production using both synthetic and real user data. Test and production are rarely identical, so it's
important to validate your application in production using a blue-green or canary deployment. This way, you're
testing the application under real conditions, so you can be sure that it will function as expected when fully
deployed.
As part of your test plan, include:
Chaos engineering
Automated predeployment testing
Fault injection testing
Peak load testing
Disaster recovery testing
Third-party service testing
Chaos engineering
Harden services against failures in production to attain consistent reliability by adopting chaos engineering.
Simulation testing
Simulation testing involves creating small, real-life situations. Simulations demonstrate the effectiveness of the
solutions in the recovery plan and highlight any issues that weren't adequately addressed.
As you perform simulation testing, follow best practices:
Conduct simulations in a manner that doesn't disrupt actual business but feels like a real situation.
Make sure that simulated scenarios are completely controllable. If the recovery plan seems to be failing, you can
restore the situation back to normal without causing damage.
Inform management about when and how the simulation exercises will be conducted. Your plan should detail
the time frame and the resources affected during the simulation.
Resiliency is the ability of a system to recover from failures and continue to function. Every technology has its own
particular failure modes, which you must consider when designing and implementing your application. Use this
checklist to review the resiliency considerations for specific Azure services. For more information about designing
resilient applications, see Design reliable Azure applications.
App Service
Use Standard or Premium tier. These tiers support staging slots and automated backups. For more information,
see Azure App Service plans in-depth overview
Avoid scaling up or down. Instead, select a tier and instance size that meet your performance requirements
under typical load, and then scale out the instances to handle changes in traffic volume. Scaling up and down may
trigger an application restart.
Store configuration as app settings. Use app settings to hold configuration settings as app settings. Define the
settings in your Resource Manager templates, or using PowerShell, so that you can apply them as part of an
automated deployment / update process, which is more reliable. For more information, see Configure web apps in
Azure App Service.
Create separate App Ser vice plans for production and test. Don't use slots on your production deployment
for testing. All apps within the same App Service plan share the same VM instances. If you put production and test
deployments in the same plan, it can negatively affect the production deployment. For example, load tests might
degrade the live production site. By putting test deployments into a separate plan, you isolate them from the
production version.
Separate web apps from web APIs. If your solution has both a web front end and a web API, consider
decomposing them into separate App Service apps. This design makes it easier to decompose the solution by
workload. You can run the web app and the API in separate App Service plans, so they can be scaled independently.
If you don't need that level of scalability at first, you can deploy the apps into the same plan, and move them into
separate plans later, if needed.
Avoid using the App Ser vice backup feature to back up Azure SQL databases. Instead, use SQL Database
automated backups. App Service backup exports the database to a SQL BACPAC file, which costs DTUs.
Deploy to a staging slot. Create a deployment slot for staging. Deploy application updates to the staging slot,
and verify the deployment before swapping it into production. This reduces the chance of a bad update in
production. It also ensures that all instances are warmed up before being swapped into production. Many
applications have a significant warmup and cold-start time. For more information, see Set up staging environments
for web apps in Azure App Service.
Create a deployment slot to hold the last-known-good (LKG) deployment. When you deploy an update to
production, move the previous production deployment into the LKG slot. This makes it easier to roll back a bad
deployment. If you discover a problem later, you can quickly revert to the LKG version. For more information, see
Basic web application.
Enable diagnostics logging , including application logging and web server logging. Logging is important for
monitoring and diagnostics. See Enable diagnostics logging for web apps in Azure App Service
Log to blob storage. This makes it easier to collect and analyze the data.
Create a separate storage account for logs. Don't use the same storage account for logs and application data.
This helps to prevent logging from reducing application performance.
Monitor performance. Use a performance monitoring service such as New Relic or Application Insights to
monitor application performance and behavior under load. Performance monitoring gives you real-time insight
into the application. It enables you to diagnose issues and perform root-cause analysis of failures.
Application Gateway
Provision at least two instances. Deploy Application Gateway with at least two instances. A single instance is a
single point of failure. Use two or more instances for redundancy and scalability. In order to qualify for the SLA, you
must provision two or more medium or larger instances.
Cosmos DB
Replicate the database across regions. Cosmos DB allows you to associate any number of Azure regions with
a Cosmos DB database account. A Cosmos DB database can have one write region and multiple read regions. If
there is a failure in the write region, you can read from another replica. The Client SDK handles this automatically.
You can also fail over the write region to another region. For more information, see How to distribute data globally
with Azure Cosmos DB.
Event Hubs
Use checkpoints . An event consumer should write its current position to persistent storage at some predefined
interval. That way, if the consumer experiences a fault (for example, the consumer crashes, or the host fails), then a
new instance can resume reading the stream from the last recorded position. For more information, see Event
consumers.
Handle duplicate messages. If an event consumer fails, message processing is resumed from the last recorded
checkpoint. Any messages that were already processed after the last checkpoint will be processed again. Therefore,
your message processing logic must be idempotent, or the application must be able to deduplicate messages.
Handle exceptions.. An event consumer typically processes a batch of messages in a loop. You should handle
exceptions within this processing loop to avoid losing an entire batch of messages if a single message causes an
exception.
Use a dead-letter queue. If processing a message results in a nontransient failure, put the message onto a dead-
letter queue, so that you can track the status. Depending on the scenario, you might retry the message later, apply a
compensating transaction, or take some other action. Note that Event Hubs does not have any built-in dead-letter
queue functionality. You can use Azure Queue Storage or Service Bus to implement a dead-letter queue, or use
Azure Functions or some other eventing mechanism.
Implement disaster recover y by failing over to a secondar y Event Hubs namespace. For more
information, see Azure Event Hubs Geo-disaster recovery.
Search
Provision more than one replica. Use at least two replicas for read high-availability, or three for read-write
high-availability.
Configure indexers for multi-region deployments. If you have a multi-region deployment, consider your
options for continuity in indexing.
If the data source is geo-replicated, you should generally point each indexer of each regional Azure Search
service to its local data source replica. However, that approach is not recommended for large datasets stored
in Azure SQL Database. The reason is that Azure Search cannot perform incremental indexing from
secondary SQL Database replicas, only from primary replicas. Instead, point all indexers to the primary
replica. After a failover, point the Azure Search indexers at the new primary replica.
If the data source is not geo-replicated, point multiple indexers at the same data source, so that Azure Search
services in multiple regions continuously and independently index from the data source. For more
information, see Azure Search performance and optimization considerations.
Service Bus
Use Premium tier for production workloads . Service Bus Premium Messaging provides dedicated and
reserved processing resources, and memory capacity to support predictable performance and throughput.
Premium Messaging tier also gives you access to new features that are available only to premium customers at
first. You can decide the number of messaging units based on expected workloads.
Handle duplicate messages . If a publisher fails immediately after sending a message, or experiences network or
system issues, it may erroneously fail to record that the message was delivered, and may send the same message
to the system twice. Service Bus can handle this issue by enabling duplicate detection. For more information, see
Duplicate detection.
Handle exceptions . Messaging APIs generate exceptions when a user error, configuration error, or other error
occurs. The client code (senders and receivers) should handle these exceptions in their code. This is especially
important in batch processing, where exception handling can be used to avoid losing an entire batch of messages.
For more information, see Service Bus messaging exceptions.
Retr y policy . Service Bus allows you to pick the best retry policy for your applications. The default policy is to
allow 9 maximum retry attempts, and wait for 30 seconds but this can be further adjusted. For more information,
see Retry policy – Service Bus.
Use a dead-letter queue . If a message cannot be processed or delivered to any receiver after multiple retries, it
is moved to a dead letter queue. Implement a process to read messages from the dead letter queue, inspect them,
and remediate the problem. Depending on the scenario, you might retry the message as-is, make changes and
retry, or discard the message. For more information, see Overview of Service Bus dead-letter queues.
Use Geo-Disaster Recover y . Geo-disaster recovery ensures that data processing continues to operate in a
different region or datacenter if an entire Azure region or datacenter becomes unavailable due to a disaster. For
more information, see Azure Service Bus Geo-disaster recovery.
Storage
For application data, use read-access geo-redundant storage (RA-GRS). RA-GRS storage replicates the
data to a secondary region, and provides read-only access from the secondary region. If there is a storage outage
in the primary region, the application can read the data from the secondary region. For more information, see
Azure Storage replication.
For VM disks, use managed disks. Managed disks provide better reliability for VMs in an availability set,
because the disks are sufficiently isolated from each other to avoid single points of failure. Also, managed disks
aren't subject to the IOPS limits of VHDs created in a storage account. For more information, see Manage the
availability of Windows virtual machines in Azure.
For Queue storage, create a backup queue in another region. For Queue storage, a read-only replica has
limited use, because you can't queue or dequeue items. Instead, create a backup queue in a storage account in
another region. If there is a storage outage, the application can use the backup queue, until the primary region
becomes available again. That way, the application can still process new requests.
SQL Database
Use Standard or Premium tier. These tiers provide a longer point-in-time restore period (35 days). For more
information, see SQL Database options and performance.
Enable SQL Database auditing. Auditing can be used to diagnose malicious attacks or human error. For more
information, see Get started with SQL database auditing.
Use Active Geo-Replication Use Active Geo-Replication to create a readable secondary in a different region. If
your primary database fails, or simply needs to be taken offline, perform a manual failover to the secondary
database. Until you fail over, the secondary database remains read-only. For more information, see SQL Database
Active Geo-Replication.
Use sharding. Consider using sharding to partition the database horizontally. Sharding can provide fault isolation.
For more information, see Scaling out with Azure SQL Database.
Use point-in-time restore to recover from human error. Point-in-time restore returns your database to an
earlier point in time. For more information, see Recover an Azure SQL database using automated database
backups.
Use geo-restore to recover from a ser vice outage. Geo-restore restores a database from a geo-redundant
backup. For more information, see Recover an Azure SQL database using automated database backups.
Traffic Manager
Perform manual failback . After a Traffic Manager failover, perform manual failback, rather than automatically
failing back. Before failing back, verify that all application subsystems are healthy. Otherwise, you can create a
situation where the application flips back and forth between datacenters. For more information, see Run VMs in
multiple regions for high availability.
Create a health probe endpoint. Create a custom endpoint that reports on the overall health of the application.
This enables Traffic Manager to fail over if any critical path fails, not just the front end. The endpoint should return
an HTTP error code if any critical dependency is unhealthy or unreachable. Don't report errors for non-critical
services, however. Otherwise, the health probe might trigger failover when it's not needed, creating false positives.
For more information, see Traffic Manager endpoint monitoring and failover.
Virtual Machines
Avoid running a production workload on a single VM. A single VM deployment is not resilient to planned or
unplanned maintenance. Instead, put multiple VMs in an availability set or virtual machine scale set, with a load
balancer in front.
Specify an availability set when you provision the VM. Currently, there is no way to add a VM to an
availability set after the VM is provisioned. When you add a new VM to an existing availability set, make sure to
create a NIC for the VM, and add the NIC to the back-end address pool on the load balancer. Otherwise, the load
balancer won't route network traffic to that VM.
Put each application tier into a separate Availability Set. In an N-tier application, don't put VMs from
different tiers into the same availability set. VMs in an availability set are placed across fault domains (FDs) and
update domains (UD). However, to get the redundancy benefit of FDs and UDs, every VM in the availability set must
be able to handle the same client requests.
Replicate VMs using Azure Site Recover y. When you replicate Azure VMs using Site Recovery, all the VM
disks are continuously replicated to the target region asynchronously. The recovery points are created every few
minutes. This gives you a Recovery Point Objective (RPO) in the order of minutes. You can conduct disaster
recovery drills as many times as you want, without affecting the production application or the ongoing replication.
For more information, see Run a disaster recovery drill to Azure.
Choose the right VM size based on performance requirements. When moving an existing workload to
Azure, start with the VM size that's the closest match to your on-premises servers. Then measure the performance
of your actual workload with respect to CPU, memory, and disk IOPS, and adjust the size if needed. This helps to
ensure the application behaves as expected in a cloud environment. Also, if you need multiple NICs, be aware of the
NIC limit for each size.
Use managed disks for VHDs. Managed disks provide better reliability for VMs in an availability set, because
the disks are sufficiently isolated from each other to avoid single points of failure. Also, managed disks aren't
subject to the IOPS limits of VHDs created in a storage account. For more information, see Manage the availability
of Windows virtual machines in Azure.
Install applications on a data disk , not the OS disk . Otherwise, you may reach the disk size limit.
Use Azure Backup to back up VMs. Backups protect against accidental data loss. For more information, see
Protect Azure VMs with a Recovery Services vault.
Enable diagnostic logs. Include basic health metrics, infrastructure logs, and boot diagnostics. Boot diagnostics
can help you diagnose a boot failure if your VM gets into a nonbootable state. For more information, see Overview
of Azure Diagnostic Logs.
Configure Azure Monitor. Collect and analyze monitoring data from Azure virtual machines including the guest
operating system and the workloads that run in it, see Azure Monitor and Quickstart: Azure Monitor.
Virtual Network
To allow or block public IP addresses, add a network security group to the subnet. Block access from
malicious users, or allow access only from users who have privilege to access the application.
Create a custom health probe. Load Balancer Health Probes can test either HTTP or TCP. If a VM runs an HTTP
server, the HTTP probe is a better indicator of health status than a TCP probe. For an HTTP probe, use a custom
endpoint that reports the overall health of the application, including all critical dependencies. For more information,
see Azure Load Balancer overview.
Don't block the health probe. The Load Balancer Health probe is sent from a known IP address, 168.63.129.16.
Don't block traffic to or from this IP in any firewall policies or network security group rules. Blocking the health
probe would cause the load balancer to remove the VM from rotation.
Enable Load Balancer logging. The logs show how many VMs on the back-end are not receiving network traffic
due to failed probe responses. For more information, see Log analytics for Azure Load Balancer.
Overview of the security pillar
11/2/2020 • 3 minutes to read • Edit Online
Information Security has always been a complex subject, and it evolves quickly with the creative ideas and
implementations of attackers and security researchers. The origin of security vulnerabilities started with
identifying and exploiting common programming errors and unexpected edge cases. However over time, the
attack surface that an attacker may explore and exploit has expanded well beyond that. Attackers now freely exploit
vulnerabilities in system configurations, operational practices, and the social habits of the systems’ users. As
system complexity, connectedness, and the variety of users increase, attackers have more opportunities to identify
unprotected edge cases and to “hack” systems into doing things they were not designed to do.
Security is one of the most important aspects of any architecture. It provides confidentiality, integrity, and
availability assurances against deliberate attacks and abuse of your valuable data and systems. Losing these
assurances can negatively impact your business operations and revenue, as well as your organization’s reputation
in the marketplace. In the following series of articles, we’ll discuss key architectural considerations and principles
for security and how they apply to Azure.
To assess your workload using the tenets found in the Microsoft Azure Well-Architected Framework, see the
Microsoft Azure Well-Architected Review.
These are the topics we cover in the security pillar of the Microsoft Azure Well-Architected Framework
Security design principles These principles support these three key strategies and
describe a securely architected system hosted on cloud or on-
premises datacenters (or a combination of both).
Reduce organizational risk Much like physical safety, success in information security is
defined more as an ongoing task of applying good security
practices and principles and hygiene rather than a static
absolute state.
SEC URIT Y TO P IC DESC RIP T IO N
Applications and services Applications and the data associated with them ultimately act
as the primary store of business value on a cloud platform.
Governance, risk, and compliance How is the organization’s security going to be monitored,
audited, and reported? What types of risks does the
organization face while trying to protect identifiable
information, Intellectual Property (IP), financial information?
Are there specific industry, government, or regulatory
requirements that dictate or provide recommendation on
criteria that your organization’s security controls must meet?
Identity and access management Identity provides the basis of a large percentage of security
assurances.
Info protection and storage Protecting data at rest is required to maintain confidentiality,
integrity, and availability assurances across all workloads.
Network security and containment Network security has been the traditional linchpin of
enterprise security efforts. However, cloud computing has
increased the requirement for network perimeters to be more
porous and many attackers have mastered the art of attacks
on identity system elements (which nearly always bypass
network controls).
These principles support these three key strategies and describe a securely architected system hosted on cloud or
on-premises datacenters (or a combination of both). Application of these principles will dramatically increase the
likelihood your security architecture will maintain assurances of confidentiality, integrity, and availability.
Each recommendation in this document includes a description of why it is recommended which maps to one of
more of these principles:
Align Security Priorities to Mission – Security resources are almost always limited, so prioritize efforts
and assurances by aligning security strategy and technical controls to the business using classification of
data and systems. Security resources should be focused first on people and assets (systems, data, accounts,
etc.) with intrinsic business value and those with administrative privileges over business critical assets.
Build a Comprehensive Strategy – A security strategy should consider investments in culture, processes,
and security controls across all system components. The strategy should also consider security for the full
lifecycle of system components including the supply chain of software, hardware, and services.
Drive Simplicity – Complexity in systems leads to increased human confusion, errors, automation failures,
and difficulty of recovering from an issue. Favor simple and consistent architectures and implementations.
Design for Attackers – Your security design and prioritization should be focused on the way attackers see
your environment, which is often not the way IT and application teams see it. Inform your security design
and test it with penetration testing to simulate one time attacks and red teams to simulate long-term
persistent attack groups. Design your enterprise segmentation strategy and other security controls to
contain attacker lateral movement within your environment. Actively measure and reduce the potential
Attack Surface that attackers target for exploitation for resources within the environment.
Leverage Native Controls – Favor native security controls built into cloud services over external controls
from third parties. Native security controls are maintained and supported by the service provider,
eliminating or reducing effort required to integrate external security tooling and update those integrations
over time.
Use Identity as Primar y Access Control – Access to resources in cloud architectures is primarily
governed by identity-based authentication and authorization for access controls. Your account control
strategy should rely on identity systems for controlling access rather than relying on network controls or
direct use of cryptographic keys.
Accountability – Designate clear ownership of assets and security responsibilities and ensure actions are
traceable for nonrepudiation. You should also ensure entities have been granted the least privilege required
(to a manageable level of granularity).
Embrace Automation - Automation of tasks decreases the chance of human error that can create risk, so
both IT operations and security best practices should be automated as much as possible to reduce human
errors (while ensuring skilled humans govern and audit the automation).
Focus on Information Protection – Intellectual property is frequently one of the biggest repositories of
organizational value and this data should be protected anywhere it goes including cloud services, mobile
devices, workstations, or collaboration platforms (without impeding collaboration that allows for business
value creation). Your security strategy should be built around classifying information and assets to enable
security prioritization, leveraging strong access control and encryption technology, and meeting business
needs like productivity, usability, and flexibility.
Design for Resilience – Your security strategy should assume that controls will fail and design
accordingly. Making your security posture more resilient requires several approaches working together
Balanced Investment – across core functions spanning the full NIST Cybersecurity Framework
lifecycle (identify, protect, detect, respond, and recover) to ensure that attackers who successfully
evade preventive controls lose access from detection, response, and recovery capabilities.
Ongoing maintenance – of security controls and assurances to ensure that they don’t decay over time
with changes to the environment or neglect
Ongoing vigilance – to ensure that anomalies and potential threats that could pose risks to the
organizations are addressed in a timely manner.
Defense in depth – approach includes additional controls in the design to mitigate risk to the
organization in the event a primary security control fails. This design should consider how likely the
primary control is to fail, the potential organizational risk if it does, and the effectiveness of the
additional control (especially in the likely cases that would cause the primary control to fail).
Least Privilege – This is a form of defense in depth to limit the damage that can be done by any one
account. Accounts should be granted the least amount of privileged required to accomplish their
assigned tasks by access permissions and by time. This helps mitigate the damage of an external
attacker who gains access to the account and/or an internal employee that inadvertently or
deliberately (for example, insider attack) compromises security assurances.
Baseline and Benchmark – To ensure your organization considers current thinking from outside sources,
evaluate your strategy and configuration against external references (including compliance requirements).
This helps to validate your approaches, minimize risk of inadvertent oversight, and the risk of punitive fines
from noncompliance.
Drive Continuous Improvement – Systems and existing practices should be regularly evaluated and
improved to ensure they are and remain effective against attackers who continuously improve and the
continuous digital transformation of the enterprise. This should include processes that proactively integrate
learnings from real world attacks, realistic penetration testing and red team activities, and other sources as
available.
Assume Zero Trust – When evaluating access requests, all requesting users, devices, and applications
should be considered untrusted until their integrity can be sufficiently validated. Access requests should be
granted conditionally based on the requestors trust level and the target resource’s sensitivity. Reasonable
attempts should be made to offer means to increase trust validation (for example, request multi-factor
authentication) and remediate known risks (change known-leaked password, remediate malware infection)
to support productivity goals.
Educate and incentivize security – The humans that are designing and operating the cloud workloads
are part of the whole system. It is critical to ensure that these people are educated, informed, and
incentivized to support the security assurance goals of the system. This is particularly important for people
with accounts granted broad administrative privileges.
Enforce governance to reduce risks
12/18/2020 • 2 minutes to read • Edit Online
As part of overall design, prioritize where to invest the available resources; financial, people, and time. Constraints
on those resources also affect the security implementation across the organization. Set organizational policies for
operations, technologies, and configurations based on internal factors (business requirements, risks, asset
evaluation) and external factors (benchmarks, regulatory standards, threat environment).
Before defining the policies, consider:
How is the organization’s security monitored, audited, and reported? Is there mandatory reporting?
Are the existing security practices are working?
Are there new requirements?
Are there any requirements specific to industry, government, or regulatory requirements?
Designate group(s) (or individual roles) for central functions that affect shared services and applications.
After the policies are set, continuously improve those standards incrementally. Make sure that the security posture
doesn’t degrade over time by having auditing and monitoring compliance. For information about managing
security standards of an organization, see governance, risk, and compliance (GRC).
In this section
A RT IC L E DESC RIP T IO N
Reference model: Segmentation Reference model and strategies of how the functions and
teams can be segmented.
Management groups and permissions Strategies using management groups to manage resources
across multiple subscriptions consistently and efficiently.
Segmentation refers to the isolation of resources from other parts of the organization. It's an effective way of
detecting and containing adversary movements.
An approach to segmentation is network isolation. This approach is not recommended because different technical
teams are not aligned with the overall business use cases and application workloads. An outcome is complexity
especially with on-premises network and leads to broad network firewall exceptions. While network control should
be considered as one of the strategies, it must be part of a unified segmentation strategy.
An effective segmentation strategy will guide all technical teams (IT, security, applications) to consistently isolate
access using networking, applications, identity, and any other access controls. The strategy should aim to:
Minimize operation friction by aligning to business practices and applications.
Contain Risk by adding cost attackers. This is done by:
Isolating sensitive workloads from compromise of other assets
Isolating high exposure systems from being used as a pivot to other systems
Monitor operations that can lead to potential violations of the integrity of the segments (account usage,
unexpected traffic.).
Here are some recommendations for creating a unified strategy.
Ensure alignment of technical teams to a single strategy that is based on assessing business risks.
Establish a modern perimeter based on zero-trust principles focused on identity, device, applications, and other
signals. This will overcome the limitation of network controls to protect new resources and attack types.
Reinforce network controls for legacy applications by exploring microsegmentation strategies.
Reference model
Start with this reference model and adapt it to your organization’s needs. This model shows how functions,
resources, and teams can be segmented.
Example segments
Consider isolating shared and individual resources as shown in the preceding image.
Core services segment
This segment hosts shared services utilized across the organization. These shared services typically include Active
Directory Domain Services, DNS/DHCP, System Management Tools hosted on Azure Infrastructure as a Service
(IaaS) virtual machines.
Individual segments
There are other segments that can contain resources based on some criteria. For instance, resources used by a
workload application can be contained in a separate segment. Another way is segment by lifecycle stages:
development, test, and production. Some resources might intersect, such as applications can use virtual networks
used for lifecycle stages.
Functions and teams
These are the main functions for this reference model. Mapping between central functions, responsibilities, and
teams are described in Team roles and responsibilities.
F UN C T IO N SC O P E RESP O N SIB IL IT Y
Policy management (Core and individual Some or all resources. Monitor and enforce compliance with
segments) external (or internal) regulations,
standards, and security policy assign
appropriate permission to those roles.
Central IT operations (Core) Across all resources. Grant permissions to the central IT
department (often the infrastructure
team) to create, modify, and delete
resources like virtual machines and
storage.
Central networking group (Core and All network resources. Ensure consistency and avoid technical
individual segments) conflicts, assign network resource
responsibilities to a single central
networking organization. These
resources should include virtual
networks, subnets, Network Security
Groups (NSG), and the virtual machines
hosting virtual network appliances.
Security operations (Core and individual All resources. Assess risk factors, identify potential
segments) mitigations, and advise organizational
stakeholders who accept the risk.
F UN C T IO N SC O P E RESP O N SIB IL IT Y
IT operations (individual segments) All resources. Grant permission to create, modify, and
delete resources. The purpose of the
segment (and resulting permissions) will
depend on your organization structure.
Segments with resources
managed by a centralized IT
organization can grant the
central IT department (often the
infrastructure team) permission
to modify these resources.
Segments managed by
independent business units or
functions (such as a Human
Resources IT Team) can grant
those teams permission to all
resources in the segment.
Segments with autonomous
DevOps teams don’t need to
grant permissions across all
resources because the resource
role (below) grants permissions
to application teams. For
emergencies, use the service
admin account (break-glass
account).
Service admin (Core and individual Use the service admin role only for
segments) emergencies (and initial setup if
required). Do not use this role for daily
tasks.
See also
Management groups
Next steps
Start with this reference model and manage resources across multiple subscriptions consistently and efficiently
with management groups.
Management groups
Security management groups
11/2/2020 • 5 minutes to read • Edit Online
Management groups can manage resources across multiple subscriptions consistently and efficiently. However,
due to its flexibility, your design can become complex and compromise security and operations.
The recommendations in this article are based on an example reference model. Start with this model and adapt it
to your organization’s needs. For more information, see Enterprise segmentation strategy.
IMPORTANT
Treat security teams as critical impact accounts and apply the same protections as administrators.
IMPORTANT
Test all enterprise-wide changes on the root management group before applying (policy, tags, RBAC model, and so on). You
can use a test lab. This can be representative lab tenant or lab segment in production tenant. Another option is to use a
production pilot. This can be a segment management group or designated subset in subscription(s) management group.
Validate changes to make sure the requirements have the desired effect.
Administrative account security
12/18/2020 • 9 minutes to read • Edit Online
Administration is the practice of monitoring, maintaining, and operating Information Technology (IT) systems to
meet service levels that the business requires. Administration introduces some of the highest impact security risks
because performing these tasks requires privileged access to a very broad set of these systems and applications.
Attackers know that gaining access to an account with administrative privileges can get them access to most or all
of the data they would target, making the security of administration one of the most critical security areas.
As an example, Microsoft makes significant investments in protection and training of administrators for our cloud
systems and IT systems:
Microsoft's recommended core strategy for administrative privileges is to use the available controls to reduce risk
Reduce risk exposure (scope and time) – The principle of least privilege is best accomplished with modern
controls that provide privileges on demand. This help to limit risk by limiting administrative privileges exposure by:
Scope – Just Enough Access (JEA) provides only the required privileges for the administrative operation
required (vs. having direct and immediate privileges to many or all systems at a time, which is almost never
required).
Time – Just in Time (JIT) approaches provided the required privileged as they are needed.
Mitigate the remaining risks – Use a combination of preventive and detective controls to reduce risks
such as isolating administrator accounts from the most common risks phishing and general web browsing,
simplifying and optimizing their workflow, increasing assurance of authentication decisions, and identifying
anomalies from normal baseline behavior that can be blocked or investigated.
Microsoft has captured and documented best practices for protecting administrative accounts and published
prioritized roadmaps for protecting privileged access that can be used as references for prioritizing mitigations for
accounts with privileged access.
Securing Privileged Access (SPA) roadmap for administrators of on-premises Active Directory
Guidance for securing administrators of Azure Active Directory
Most architectures have shared services that are hosted and accessed across networks. Those services share
common infrastructure and users need to access resources and data from anywhere. For such architectures, a
common way to secure resources is to use network controls. However, that isn't enough.
Provide security assurance through identity management: the process of authenticating and authorizing security
principals. Use identity management services to authenticate and grant permission to users, partners, customers,
applications, services, and other entities.
Checklist
How are you managing the identity for your workload?
Define clear lines of responsibility and separation of duties for each function. Restrict access based on a need-
to-know basis and least privilege security principles.
Assign permissions to users, groups, and applications at a certain scope through Azure RBAC. Use built-in roles
when possible.
Prevent deletion or modification of a resource, resource group, or subscription through management locks.
Use Managed Identities to access resources in Azure.
Support a single enterprise directory. Keep the cloud and on-premises directories synchronized, except for
critical-impact accounts.
Set up Azure AD Conditional Access. Enforce and measure key security attributes when authenticating all users,
especially for critical-impact accounts.
Have a separate identity source for non-employees.
Preferably use passwordless methods or opt for modern password methods.
Block legacy protocols and authentication methods.
In this section
Follow these questions to assess the workload at a deeper level. The recommendations in this section are based on
using Azure AD.
Does the application team have a clear view on Designate groups (or individual roles) that will be responsible
responsibilities and individual/group access levels? for central functions, such as network, policy management
and so on.
Is the workload infrastructure protected with RBAC Azure Resource Manager handles all control plane requests
(role-based access control)? and applies restrictions that you specify through Azure role-
based access control (Azure RBAC), Azure Policy, locks.
Has role-based and/or resource-based authorization Use a mix of role-based and resource-based authorization.
been configured within Azure AD? Start with the principle of least privilege and add more actions
based your needs.
A SSESSM EN T DESC RIP T IO N
How is the workload authenticated when Authenticate using Managed Identities, use passwordless
communicating with Azure platform ser vices? protections, and keep all (except critical accounts) identities at
a central location.
The questions in this section are aligned to the Azure Security Benchmarks Identity and Access Control.
Reference architecture
Here are some reference architectures related to identity and access management:
Integrate on-premises AD domains with Azure AD
Integrate on-premises AD with Azure
Next steps
We recommend applying as many as of the best practices as early as possible, and then working to retrofit any
gaps over time as you mature your security program.
Monitor identity, network, data risks
Related links
Five steps to securing your identity infrastructure
In an organization, several teams work together to make sure that the workload and the supporting infrastructure
are secure. To avoid confusion that can create security risks, define clear lines of responsibility and separation of
duties.
Do the teams have a clear view on responsibilities and individual/group access levels?
Designate groups (or individual roles) that will be responsible for central functions, such as network, policy
management as shown in this table. Document responsibilities of each group, share contacts to facilitate
communication.
C EN T RA L F UN C T IO N RESP O N SIB IL IT Y
Identity Security and Standards Set direction for Azure AD directories, PIM/PAM usage, MFA,
password/synchronization configuration, Application Identity
Standards.
Application roles and responsibilities should cover different access level of each operational function. For
example, publish production release, access customer data, manipulate database records, and so on.
Application teams should include central functions listed in the preceding table.
Next
Restrict access to Azure resources based on a need-to-know basis starting with the principle of least privilege
security and add more based your operational needs.
Azure control plane security
Related links
For considerations about using management groups to reflect the organization's structure within an Azure Active
Directory (Azure AD) tenant, see CAF: Management group and subscription organization.
Back to the main article: Azure identity and access management considerations
Azure control plane security
12/18/2020 • 3 minutes to read • Edit Online
The term control plane refers to the management of resources in your subscription. These activities include
creating, updating, and deleting Azure resources as required by the technical team.
Azure Resource Manager handles all control plane requests and applies restrictions that you specify through Azure
role-based access control (Azure RBAC), Azure Policy, locks. Apply those restrictions are based on the requirement
of the organization.
Key points
Restrict access based on a need-to-know basis and least privilege security principles.
Assign permissions to users, groups, and applications at a certain scope through Azure RBAC.
Use built-in roles when possible.
Prevent deletion or modification of a resource, resource group, or subscription through management locks.
Use less critical control in your CI/CD pipeline for development and test environments.
Role-based access control (RBAC) provides the separation when accessing the resources that an application uses.
Decide who has access to resources at the granular level and what they can do with those resources. For example:
Developers can't access production infrastructure.
Only the SecOps team can read and manage Key Vault secrets.
If there are multiple teams, Project A team can access and manage Resource Group A and all resources within.
Grant roles the appropriate permissions that start with least privilege and add more based on your
operational needs. Provide clear guidance to your technical teams that implement permissions. This clarity
makes it easier to detect and correct that reduces human errors such as overpermissioning.
Azure RBAC helps you manage that separation. You can assign permissions to users, groups, and applications at a
certain scope. The scope of a role assignment can be a subscription, a resource group, or a single resource. For
details, see Azure role-based access control (Azure RBAC).
Assign permissions at management group instead of individual subscriptions to drive consistency and ensure
application to future subscriptions.
Consider the built-in roles before creating custom roles to grant the appropriate permissions to resources and
other objects.
For example, assign security teams with the Security Readers permission that provides access needed to assess
risk factors, identify potential mitigations, without providing access to the data.
IMPORTANT
Treat security teams as critical accounts and apply the same protections as administrators.
Management locks
Are there resource locks applied on critical par ts of the infrastructure?
Unlike role-based access control, management locks are used to apply a restriction across all users and roles.
For critical infrastructure, use management locks to prevent deletion or modification of a resource, resource group,
or subscription. Lock in use cases where only specific roles and users with permissions should be able to
delete/modify resources.
Set locks in the DevOps process carefully because modification locks can sometimes block automation. For
examples of those blocks and considerations, see Considerations before applying locks.
Application deployment
Is there a direct access to the application infrastructure through Azure por tal, Command-line
Interface (CLI), or REST API?
It's recommended that you deploy application infrastructure through automation and CI/CD. To maximize
application autonomy and agility, balance restrictive access control on less critical development and test
environments.
Are CI/CD pipeline roles clearly defined and permissions set?
Azure DevOps offers built-in roles that can be assigned to individual users of groups. For example, using them
properly can make sure that only users responsible for production releases are able to initiate the process and that
only developers can access the source code. Variable groups often contain sensitive configuration information and
can be protected as well.
Next
Grant or deny access to a system by verifying whether the accessor has the permissions to perform the requested
action.
Authentication
Related links
Back to the main article: Azure identity and access management considerations
Authentication with Azure AD
12/18/2020 • 6 minutes to read • Edit Online
Authentication is a process that grants or denies access to a system by verifying the accessor's identity. Use a
managed identity service for all resources to simplify overall management (such as password policies) and
minimize the risk of oversights or human errors. Azure Active Directory (Azure AD) is the one-stop-shop for
identity and access management service for Azure.
Key points
Use Managed Identities to access resources in Azure.
Keep the cloud and on-premises directories synchronized, except for high-privilege accounts.
Preferably use passwordless methods or opt for modern password methods.
Enable Azure AD conditional access based on key security attributes when authenticating all users, especially for
high-privilege accounts.
Authenticate with identity services instead of cryptographic keys. On Azure, Managed Identities eliminate the need
to store credentials that might be leaked inadvertently. When Managed Identity is enabled for an Azure resource,
it's assigned an identity that you can use to obtain Azure AD tokens. For more information, see Azure AD-managed
identities for Azure resources.
For example, an Azure Kubernetes Service (AKS) cluster needs to pull images from Azure Container Registry (ACR).
To access the image, the cluster needs to know the ACR credentials. The recommended way is to enable Managed
Identities during cluster configuration. That configuration assigns an identity to the cluster and allows it to obtain
Azure AD tokens.
This approach is secure because Azure handles the management of the underlying credentials for you.
The identity is tied to the lifecycle of the resource, in the example the AKS cluster. When the resource is deleted,
Azure automatically deletes the identity.
Azure AD manages the timely rotation of secrets for you.
TIP
Here are the resources for the preceding example:
Don't assume that API URLs used by a workload are hidden and can't get exposed to attackers. For example,
JavaScript code on a website can be viewed. A mobile application can be decompiled and inspected. Even for
internal APIs used only on the backend, a requirement of authentication can increase the difficulty of lateral
movement if an attacker gets network access. Typical mechanisms include API keys, authorization tokens, IP
restrictions.
Managed Identity can help an API be more secure because it replaces the use of human-managed service
principals and can request authorization tokens.
How is user authentication handled in the application?
Don't use custom implementations to manage user credentials. Instead, use Azure AD or other managed identity
providers such as Microsoft account Azure B2C. Managed identity providers provide additional security features
such as modern password protections, MFA (Multi-factor authentication), and resets. In general, passwordless
protections are preferred. Also, modern protocols like OAuth 2.0 use token-based authentication with limited
timespan.
Are authentication tokens cached securely and encr ypted when sharing across web ser vers?
Application code should first try to get OAuth access tokens silently from a cache before attempting to acquire a
token from the identity provider, to optimize performance and maximize availability. Tokens should be stored
securely and handled as any other credentials. When there's a need to share tokens across application servers
(instead of each server acquiring and caching their own) encryption should be used.
For information, see Acquire and cache tokens.
IMPORTANT
Don’t synchronize high-privilege accounts to an on-premises directory. If an attacker gets full control of on-premises assets,
they can compromise a cloud account. This strategy will limit the scope of an incident. For more information, see Critical
impact account dependencies.
Synchronization is blocked by default in the default Azure AD Connect configuration. Make sure that you haven’t customized
this configuration. For information about filtering in Azure AD, see Azure AD Connect sync: Configure filtering.
GitHub: Integrate on-premises Active Directory domains with Azure Active Directory.
The design considerations are described in Integrate on-premises Active Directory domains with Azure AD.
Next
Grant or deny access to a system by verifying the accessor's identity.
Authorization
Related links
Back to the main article: Azure identity and access management considerations
Authorization with Azure AD
12/18/2020 • 4 minutes to read • Edit Online
Authorization is a process that grants or denies access to a system by verifying whether the accessor has the
permissions to perform the requested action. The accessor in this context is the workload (cloud application) or the
user of the workload. The action might be operational or related to resource management. There are two main
approaches to authorization: role-based and resource-based. Both can be configured with Azure AD.
Key points
Use a mix of role-based and resource-based authorization. Start with the principle of least privilege and add
more actions based on your needs.
Define clear lines of responsibility and separation of duties for application roles and the resources it can
manage. Consider the access levels of each operational function, such as permissions needed to publish
production release, access customer data, manipulate database records.
Do not provide permanent access for any critical accounts. Elevate access permissions that are based on
approval and is time bound using Azure AD Privileged Identity Management (Azure AD PIM).|
Role-based authorization
This approach authorizes an action based on the role assigned to a user. For example, some actions require an
administrator role.
A role is a set of permissions. For example, the administrator role has permissions to perform all read, write, and
delete operations. Also, the role has a scope. The scope specifies the management groups, subscriptions, or
resource groups within which the role is allowed to operate.
When assigning a role to a user consider what actions the role can perform and what is the scope of those
operations. Here are some considerations for role assignment:
Use built-in roles before creating custom roles to grant the appropriate permissions to VMs and other
objects. You can assign built-in roles to users, groups, service principals, and managed identities. For more
information, see Azure built-in roles.
If you need to create custom roles, grant roles with the appropriate action. Actions are categorized into
operational and data actions. To avoid overpermissioning, start with actions that have least privilege and
add more based your operational or data access needs. Provide clear guidance to your technical teams that
implement permissions. For more information, see Azure custom roles.
If you have a segmentation strategy, assign permissions with a scope. For example, if you use management
group to support your strategy, set the scope to the group rather than the individual subscriptions. This will
drive consistency and ensure application to future subscriptions. When assigning permissions for a
segment, consider consistency while allowing flexibility to accommodate several organizational models.
These models can range from a single centralized IT group to mostly independent IT and DevOps teams. For
information about assigning scope, see AssignableScopes.
You can use security groups to assign permissions. However, there are disadvantages. It can get complex
because the workload needs to keep track of which security groups correspond to which application roles,
for each tenant. Also, access tokens can grow significantly and Azure AD includes an "overage" claim to limit
the token size. See Microsoft identity platform access tokens.
For more information, see Azure role-based access control (RBAC).
For information about implementing role-based authorization in an ASP.NET application, see Role-based
authorization.
Resource-based authorization
With role-based authorization, a user gets the same level of control on a resource based on the user's role.
However, there might be situations where you need to define access rights per resource. For example, in a resource
group, you want to allow some users to delete the resource; other users cannot. In such situations, use resource-
based authorization that authorizes an action based on a particular resource. Every resource has an Owner. Owner
can delete the resource. Contributors can read and update but can't delete it.
NOTE
The "owner" and "contributor" roles for a resource are not the same as application roles.
You'll need to implement custom logic for resource-based authorization. That logic might be a mapping of
resources, Azure AD object (like role, group, user), and permissions.
For information and code sample about implementing resource-based authorization in an ASP.NET application, see
Resource-based authorization.
Do not provide permanent access for any critical accounts and lower permissions when access is no longer
required. Some strategies include:
Just-in-time privileged access to Azure AD and Azure resources.
Time-bound access.
Approval-based access.
Break glass for emergency access process to gain access.
Limit write access to production systems to service principals. No user accounts should have have regular write-
access.
You can use native and third-party options to elevate access permissions for at least highly privileged if not all
activities. Azure AD Privileged Identity Management (Azure AD PIM) is the recommended native solution on Azure.
For more information about PIM, see What is Azure AD Privileged Identity Management?.
Related links
Back to the main article: Azure identity and access management considerations
Network security review
12/18/2020 • 4 minutes to read • Edit Online
Assess your workload in areas such as network boundary security, network security, database security, data storage
security, identity management, and operational security.
How do you implement DDoS protection?
Risk : Potential of smaller-scale attack that doesn't trip the platform-level protection.
Azure Virtual Network resources offers Basic and Standard tiers for DDoS protection. Enable DDoS Protection
Standard for all business-critical web application and services.
The Windows N-tier application on Azure with SQL Server reference architecture uses DDoS Protection Standard
because this option:
Uses adaptive tuning, based on the application's network traffic patterns, to detect threats.
Guarantees 100% SLA.
Can be cost effective. For example, during a DDoS attack, the first set of attacks cause the provisioned resources
to scale out. For a resource such as a virtual machine scale set, 10 machines can grow to 100, increasing overall
costs. With Standard protection, you don't have to worry about the cost of the scaled resources because Azure
will provide the cost credit.
For information about Standard DDoS Protection, see Azure DDoS Protection Service.
How do you configure public IPs for which traffic is passed in, and how and where it's translated?
Risk : Inability to ensure VMs and communication between them remains private within a network boundary.
Use Azure Virtual Network to allow VMs to securely communicate with each other, the internet, and on-premises
networks. Otherwise, communication between VMs may not be fully private within a network boundary.
How do you configure traffic flow between multiple application tiers?
Risk : Inability to define different access policies based on the workload types, and to control traffic flows between
them.
You may want to define different access policies based on the workload types and control flow between them. Use
Azure Virtual Network Subnet to allocate separate address spaces for different elements or tiers within the
workload. Then, define different access policies, and control traffic flows between those tiers.
How do you route network traffic through security appliances for security boundar y policy
enforcement, auditing, and inspection?
Risk : Inability to define communication paths between different tiers within a network boundary.
Use Azure Virtual Network User Defined Routes (UDR) to control next hop for traffic between Azure, on-premises,
and Internet resources through virtual appliance, virtual network gateway, virtual network, or internet.
Do you use firewalls, load balancers, and Network Intrusion Detection/Network Intrusion Prevention
(NIDS/NIPS)?
Risk : Possibility of not being able to select comprehensive solutions for secure network boundaries.
Use Network Appliances from Azure Marketplace to deploy a variety of preconfigured network virtual appliances.
Utilize Application Gateway WAF to detect and protect against common web attacks.
How do you segment the larger address space into subnets?
Risk : Inability to allow or deny inbound network traffic to, or outbound network traffic from, within larger network
space.
Use network security groups (NSGs) to allow or deny traffic to and from single IP address, to and from multiple IP
addresses, or even to and from entire subnets.
How do you control routing behavior between VM connectivity?
Risk : Potential of outbound connections from any VM increasing attack surface area leveraged by attackers.
Utilize forced tunneling to ensure that connections to the internet go through corporate network security devices.
How do you implement enhanced levels of security - firewall, IPS/IDS, antivirus, vulnerability
management, botnet protection on top of network-level controls?
Risk : Inability to enable security for other OSI model layers other than network and transport layer.
Use Azure Marketplace to provision devices for higher levels of network security than with network-level access
controls.
How do you establish cross premises connectivity?
Risk : Inability to make services available even when datacenters might become unavailable.
Utilize Azure Traffic Manager to load balance connections to services based on the location of the user and/or other
criteria
How do you disable RDP/SSH access to vir tual machines?
Risk : Potential for attackers to use brute force techniques to gain access and launch other attacks.
Disable RDP/SSH access to Azure Virtual Machines and use VPN/ExpressRoute to access these virtual machines for
remote management.
How do you prevent, detect, and respond to threats?
Risk : Inability to have a single pane of visibility to prevent, detect, and respond to threats.
Employ Azure Security Center for increased visibility into, and control over, the security of Azure resources,
integrated security monitoring, and policy management across Azure subscriptions.
How do you monitor and diagnose conditions of the network?
Risk : Inability to understand, diagnose, and gain insights to the network in Azure.
Use Network Watcher to understand, diagnose, and gain insights to the network in Azure.
How do you gain access to real time performance information at the packet level?
Protect assets by placing controls on network traffic originating in Azure, between on-premises and Azure hosted
resources, and traffic to and from Azure. If security measures aren't in place attackers can gain access, for instance,
by scanning across public IP ranges. Proper network security controls can provide defense-in-depth elements that
help detect, contain, and stop attackers who gain entry into your cloud deployments.
Here are some recommendations for network security and containment:
Align network segmentation with overall strategy
Centralize network management and security
Build a network containment strategy
Define an internet edge strategy
Protecting data at rest is required to maintain confidentiality, integrity, and availability assurances across all
workloads. Storage in a cloud service like Azure is architected and implemented quite differently than on premises
solutions to enable massive scaling, modern access through REST APIs, and isolation between tenants.
Granting access to Azure storage is possible through Azure Active Directory (Azure AD) as well as key based
authentication mechanisms (Symmetric Shared Key Authentication, or Shared Access Signature (SAS)).
Storage in Azure includes a number of native security design attributes:
All data is encrypted by the service.
Data in the storage system cannot be read by a tenant if it has not been written by that tenant (to mitigate
the risk of cross tenant data leakage).
Data will remain only in the region you choose.
The system maintains three synchronous copies of data in the region you choose.
Detailed activity logging is available on an opt-in basis.
Additional security features can be configured such as a storage firewall to provide an additional layer of access
control as well as storage threat protection to detect anomalous access and activities.
Applications and services
12/18/2020 • 2 minutes to read • Edit Online
Applications and the data associated with them act as the primary store of business value on a cloud platform.
Applications can play a role in risks to the business because:
Business processes are encapsulated and executed by applications and services need to be available and
provided with high integrity.
Business data is stored and processed by application workloads and requires high assurances of
confidentiality, integrity, and availability.
In this section
A SSESSM EN T DESC RIP T IO N
What aspects of the application do you need to Understanding the hosting models and the security
protect? responsibility.
Does the organization identify the highest severity Identify risks to the application and risks it may pose to your
threats to this workload through threat modeling? enterprise through threat modeling.
Do you have any regulator y or governance Guidance on standards published by law, authorities, and
requirements? regulators.
Are you exposing information through exception Consider the way you store secrets and handle exceptions.
handling or HTTP headers? Here are some considerations.
Are the frameworks and libraries used by the Evaluate frameworks and libraries used by the application and
application secure? the resulting vulnerabilities.
Next steps
See these best practices related to PaaS applications.
Securing PaaS deployments
Secure communication paths between applications and the services. Make sure that there's a distinction between
the endpoints exposed to the public internet and private ones. Also, the public endpoints are protected with web
application firewall.
Network security
Application classification for security
12/18/2020 • 4 minutes to read • Edit Online
Azure can host both legacy and modern applications through Infrastructure as a Service (IaaS) virtual machines
and Platform as a Service (PaaS). With legacy applications, you have the responsibility of securing all dependencies
including OS, middleware, and other components. For PaaS applications, you don't need to manage and secure the
underlying server OS. You are responsible for the application configuration.
This article describes the considerations for understanding the hosting models and the security responsibility of
each, identifying critical applications.
Next steps
Applications and services
Application classification
Application threat analysis
Regulatory compliance
Application threat analysis
11/2/2020 • 3 minutes to read • Edit Online
Do a comprehensive analysis to identify threats, attacks, vulnerabilities, and counter measures. Having this
information can protect the application and threats it might pose to the system. Start with simple questions to gain
insight into potential risks. Then, progress to advanced techniques using threat modeling.
Are connections authenticated using Azure AD, TLS (with Prevent unauthorized access to the application component
mutual authentication), or another modern security protocol and data.
approved by the security team?
Between users and the application
Between different application components and
services
Are you limiting access to only those accounts that have the Prevent unauthorized data tampering or alteration.
need to write or modify data in the application
Is the application activity logged and fed into a Security Detect and investigate attacks quickly.
Information and Event Management (SIEM) through Azure
Monitor or a similar solution?
Is critical data protected with encryption that has been Prevent unauthorized copying of data at rest.
approved by the security team?
Is inbound and outbound network traffic encrypted using Prevent unauthorized copying of data in transit.
TLS?
Is the application protected against Distributed Denial of Detect attacks designed to overload the application so it can’t
Service (DDoS) attacks using services such as Azure DDoS be used.
protection?
Does the application store any logon credentials or keys to Identify whether an attack can use your application to attack
access other applications, databases, or services? other systems.
Do the application controls allow you to fulfill regulatory Protect user’s private data and avoid compliance fines.
requirements?
Integrate threat modeling through automation using secure operations. Here are some resources:
Toolkit for Secure DevOps on Azure.
Guidance on DevOps pipeline security by OWASP.
If a security vulnerability is discovered, update the software with the fix as soon as possible. Have processes, tools,
and approvals in place to roll out the fix quickly.
Next steps
Applications and services
Application classification
Regulatory compliance
Regulatory compliance
12/18/2020 • 2 minutes to read • Edit Online
Cloud architectures can have challenges when strictly applying standards to workloads. This article provides some
security recommendations for adhering to standards.
Governments and regulatory organizations frequently publish standards to help define good security practices so
that organizations can avoid negligence. The purpose and scope of these standards and regulations vary, but the
security requirements can influence the design for data protection and retention, data privacy, and system security.
Noncompliance can lead to fines or other business impact. Work with your regulators and carefully review the
standards to understand both the intent and the literal wording. Start by answering these questions:
What are the regulatory requirements for the solution?
How is compliance measured?
Who approves that solution meets the requirements?
Are there processes for obtaining attestations?
Security of an application that is hosted in Azure is a shared responsibility between you as the application owner
and Azure. For IaaS, you are responsible for configurations related to VM, operating system, and components
installed on it. For PaaS, you are responsible for the security of the application service configurations and making
sure that the dependencies used by the application are also secure.
Key points
Do not store secrets in source code or configuration files. Instead keep them in a secure store, such as Azure
Key Vault.
Do not expose detailed error information when handling application exceptions.
Do not expose platform-specific information.
Restrict access to Azure resources that do not meet the security requirements.
Validate the security of any open-source code added to your application.
Update frameworks and libraries as part of the application lifecycle.
Configuration security
During the design phase, consider the way you store secrets and handle exceptions. Here are some considerations.
Are application keys and secrets stored securely?
Configuration within in the application can include secrets like database connection strings, certificate keys, and so
on. Do not store secrets in source code or configuration files. Instead keep them in a secure store, such as Azure
Key Vault. If you need to store secrets in code, identify them with static code scanning tools. Add the scanning
process in your continuous integration (CI) pipeline.
Are errors and exceptions handled properly without exposing that information to users?
When handling application exceptions, make the application fail gracefully and log the error. Do not provide
detailed information related to the failure, such as call stack, SQL queries, or out of range errors. This information
can provide attackers with valuable information about the internals of the application.
Is platform-specific information removed from ser ver-client communication?
Do not reveal information about the application platform. Such information (for example, "X-Powered-By", "X-
ASPNET-VERSION") can get exposed through HTTP banners HTTP headers, error messages, website footers.
Malicious actors can use this information when mapping attack vectors of the application.
Consider using Azure CDN to separate the hosting platform from end users. Azure API Management offers
transformation policies that allow you to modify HTTP headers and remove sensitive information.
Are Azure policies used to control the configuration of the solution resources?
Use Azure Policy to deploy desired settings where applicable. Block resources that do not meet the proper security
requirements defined during service enablement.
It's important to be aware of the implications of using third-party frameworks and libraries in your application
code. These components can result in vulnerabilities. Here are some best practices:
Validate the security of any open-source code added to your application. Tools that can help this assessment:
OWASP Dependency-Check
NPM audit
Maintain a list of frameworks and libraries as part of the application inventory. Also, keep track of versions
in use.
Update frameworks and libraries as part of the application lifecycle. Prioritize critical security patches.
Next steps
Applications and services
Application classification
Application threat analysis
Regulatory compliance
Security health modeling
12/18/2020 • 2 minutes to read • Edit Online
Health modeling refers to the activities that maintain the security posture of a workload through monitoring. These
activities can highlight, if the current security practices are effective or are there new requirements. Health
modeling can be categorized as follows:
Key points
Monitor the workload and the infrastructure in which it runs.
Conduct audits.
Enable, acquire, and store audit logs.
Update and patch security fixes.
Respond to incidents.
In this section
A SSESSM EN T DESC RIP T IO N
How is security monitored in the application context? Use Azure tools and services to monitor your security posture
and also remediate incidents.
Is access to the control plane and data plane of the Monitor network conditions and identity-related risk events
application periodically reviewed? regularly.
Do you implement security practices and tools during Activities related to enabling, acquiring, and storing audit logs
the development lifecycle? for Azure services.
Are operational processes for incident response Guidance for the central SecOps team for monitoring security-
defined and tested? related telemetry data and investigating security breaches.
Next steps
We recommend applying as many as of the best practices as early as possible, and then working to retrofit any
gaps over time as you mature your security program.
Optimize security investments
Assign stakeholders to use Secure Score in Azure Security Center to monitor risk profile and continuously improve
security posture.
Operationalize Azure Secure Score
Security monitoring tools in Azure
12/18/2020 • 3 minutes to read • Edit Online
Use security monitoring tools to monitor for anomalous behavior and enable investigation of incidents.
Use this article to learn how to strengthen your security posture using the following monitor tools:
Azure Security Center - Prevent, detect, and respond to threats with increased visibility and control over the
security of your resources.
Azure Sentinel - Receive intelligent security analytics and threat intelligence across the enterprise.
Azure DDoS Protection - Defend against DDoS attacks.
Azure Rights Management (RMS) - Protect files and emails across multiple devices.
Azure Information Protection - Secure email, documents, and sensitive data that you share outside your
company.
Next steps
Security health modeling
Security logs and audits
Security operations in Azure
Check for identity, network, data risks
Security operations in Azure
12/18/2020 • 3 minutes to read • Edit Online
The responsibility of the security operation team (also known as Security Operations Center (SOC), or SecOps) is
to rapidly detect, prioritize, and triage potential attacks. These operations help eliminate false positives and focus
on real attacks, reducing the mean time to remediate real incidents. Central SecOps team monitors security-
related telemetry data and investigates security breaches. It's important that any communication, investigation,
and hunting activities are aligned with the application team.
Tools
Here are some Azure tools that a SOC team can use investigate and remediate incidents.
TO O L P URP O SE
Azure Information Protection Secure email, documents, and sensitive data that you share
outside your company.
Investigation practices should use native tools with deep knowledge of the asset type such as an Endpoint
detection and response (EDR) solution, Identity tools, and Azure Sentinel.
For more information about monitoring tools, see Security monitoring tools in Azure.
Incident response
Make sure that a security contact receives Azure incident notifications from Microsoft.
In most cases, such notifications indicate that your resource is compromised and,or attacking another customer.
This enables your security operations team to rapidly respond to potential security risks and remediate them.
Ensure administrator contact information in the Azure enrollment portal includes contact information that will
notify security operations directly or rapidly through an internal process.
Are operational processes for incident response defined and tested?
Actions executed during an incident and response investigation could impact application availability or
performance. Define these processes and align them with the responsible (and in most cases central) SecOps
team. The impact of such an investigation on the application has to be analyzed.
Are there tools to help incident responders quickly understand the application and components to
do an investigation?
Incident responders are part of a central SecOps team and need to understand security insights of an application.
Security playbook in Azure Sentinel can help to understand the security concepts and cover the typical
investigation activities.
Next steps
Security health modeling
Security tools
Security logs and audits
Check for identity, network, data risks
Security audits
12/18/2020 • 3 minutes to read • Edit Online
To make sure that the security posture doesn’t degrade over time, have regular auditing that checks compliance
with organizational standards. Enable, acquire, and store audit logs for Azure services.
Key points
Use approved time synchronization sources.
Configure central security log management.
Enable audit logging for Azure resources.
Collect security logs from operating systems.
Configure security log storage retention.
Monitor and review logs.
Enable alerts for anomalous activities.
Centralize anti-malware logging.
Enable DNS query logging.
Enable command-line audit logging.
Regularly review roles that have high privileges. Set up a recurring review pattern to ensure that accounts are
removed from permissions as roles change. Consider auditing at least twice a year.
As people in the organization and on the project change, make sure that only the right people have access to the
application infrastructure. Auditing and reviewing the access control reduces the attack vector to the application.
Azure control plane depends on Azure AD. You can conduct the review manually or through an automated process
by using tools such as Azure AD access reviews. These reviews are often centrally performed often as part of
internal or external audit activities.
Next steps
Security health modeling
Security operations in Azure
Check for identity, network, data risks
Security tools
Check for identity, network, data risks
12/18/2020 • 3 minutes to read • Edit Online
Along with identity-based access control, network-based access control is a top priority for protecting assets.
Monitor identity-related risk events using adaptive machine learning algorithms, heuristics quickly before the
attacker can gain deeper access into the system. Also, monitor all communications between segments to detect
potential security threats flowing over the wire.
This article describes some considerations that can help monitor the workload for those risks.
Key points
View identity related risks in Azure Active Directory (Azure AD) reporting and Azure AD Identity Protection.
Obfuscate Personally Identifiable Information (PII) by using PII Detection cognitive skill.
Enable logs (including raw traffic) from your network devices.
Set alerts and gain access to real-time performance information at the packet level.
Monitor identity-related risk events on potentially compromised identities and remediate those risks. Review the
reported risk events in these ways:
Azure AD reporting. For information see users at risk security report and the risky sign-ins security report.
Use the reporting capabilities of Azure Active Directory Identity Protection.
Use the Identity Protection risk events API to gain get programmatic access to security detections by using
Microsoft Graph. See riskDetection and riskyUser APIs.
Azure AD uses adaptive machine learning algorithms, heuristics, and known compromised credentials
(username/password pairs) to detect suspicious actions that are related to your user accounts. These
username/password pairs come from monitoring public and dark web and by working with security researchers,
law enforcement, security teams at Microsoft, and others.
Remediate risks by manually addressing each reported account or by setting up a user risk policy to require a
password change for high risk events.
Be cautious when logging sensitive application information. Don't store PII (contact information, payment
information etc.), any application logs. Apply protective measures, such as obfuscation. Machine learning tools like
Cognitive Search PII detection can help with this. For more information, see PII Detection cognitive skill.
Enable network visibility
One way to enable network visibility is by integrating network logs and analyzing the data to identify anomalies.
Based on those insights, you can choose to set alerts or block traffic crossing segmentation boundaries.
How do you monitor and diagnose conditions of the network?
Proactive monitoring
How do you gain access to real-time performance information at the packet level?
Take advantage of packet capture to set alerts and gain access to real-time performance information at the packet
level.
Packet capture tracks traffic in and out of virtual machines. It gives you the capability to run proactive captures
based on defined network anomalies including information about network intrusions.
For an example, see Scenario: Get alerts when VM is sending you more TCP segments than usual.
Next steps
Security health modeling
Security operations in Azure
Security tools
Security logs and audits
Governance, risk, and compliance
12/18/2020 • 22 minutes to read • Edit Online
Organizations of all sizes are constrained by their available resources; financial, people, and time. To achieve an
effective return on investment (ROI) organizations must prioritize where they will invest. Implementation of security
across the organization is also constrained by this, so to achieve an appropriate ROI on security the organization
needs to first understand and define its security priorities.
Governance: How is the organization’s security going to be monitored, audited, and reported? Design and
implementation of security controls within an organization is only the beginning of the story. How does the
organization know that things are actually working? Are they improving? Are there new requirements? Is there
mandatory reporting? Similar to compliance there may be external industry, government or regulatory standards
that need to be considered.
Risk : What types of risks does the organization face while trying to protect identifiable information, Intellectual
Property (IP), financial information? Who may be interested or could use this information if stolen, including
external and internal threats as well as unintentional or malicious? A commonly forgotten but extremely important
consideration within risk is addressing Disaster Recovery and Business Continuity.
Compliance: Is there a specific industry, government, or regulatory requirements that dictate or provide
recommendation on criteria that your organization’s security controls must meet? Examples of such standards,
organizations, controls, and legislation are ISO27001, NIST, PCI-DSS.
The collective role of organization(s) is to manage the security standards of the organization through their lifecycle:
Define - Set organizational standards and policies for practices, technologies, and configurations based on
internal factors (organizational culture, risk appetite, asset valuation, business initiatives, etc.) and external
factors (benchmarks, regulatory standards, threat environment, and more)
Improve – Continually push these standards incrementally forward towards the ideal state to ensure
continual risk reduction.
Sustain – Ensure the security posture doesn’t degrade naturally over time by instituting auditing and
monitoring compliance with organizational standards.
Monitor Secure Score and Follow up with technical teams Security Posture Management Team
Vulnerability Management (or Governance/Risk/Compliance
team)
Architecture Team
Responsible Technical Team (listed below)
NOTE: Note that in the DevOps model, some application teams may be responsible for their own application
resources.
The Azure Secure Score sample shows how to get your Azure Secure Score for a subscription by calling the
Azure Security Center REST API. The API methods provide the flexibility to query the data and build your own
reporting mechanism of your secure scores over time.
Ser ver Endpoint Security Typically IT operations, security, or jointly. Monitor and
remediate server security (patching, configuration, endpoint
security, etc.).
GRO UP O R IN DIVIDUA L RO L E RESP O N SIB IL IT Y
Incident Monitoring and Response Typically security operations team. Investigate and remediate
security incidents in Security Information and Event
Management (SIEM) or source console.
Policy Management Typically GRC team + Architecture. Set Direction for use of
Role-Based Access Control (RBAC), Azure Security Center,
Administrator protection strategy, and Azure Policy to govern
Azure resources.
Identity Security and Standards Typically Security Team + Identity Team jointly. Set direction for
Azure AD directories, PIM/PAM usage, MFA,
password/synchronization configuration, Application Identity
Standards.
Penetration testing
Use Penetration Testing to validate security defenses.
Real world validation of security defenses is critical to validate your defense strategy and implementation. This can
be accomplished by a penetration test (simulates a one time attack) or a red team program (simulates a persistent
threat actor targeting your environment).
Follow the guidance published by Microsoft for planning and executing simulated attacks.
Organizations of all sizes are constrained by their available resources; financial, people, and time. To achieve an
effective return on investment (ROI) organizations must prioritize where they will invest. Implementation of security
across the organization is also constrained by this, so to achieve an appropriate ROI on security the organization
needs to first understand and define its security priorities.
Governance: How is the organization’s security going to be monitored, audited, and reported? Design and
implementation of security controls within an organization is only the beginning of the story. How does the
organization know that things are actually working? Are they improving? Are there new requirements? Is there
mandatory reporting? Similar to compliance there may be external industry, government or regulatory standards
that need to be considered.
Risk : What types of risks does the organization face while trying to protect identifiable information, Intellectual
Property (IP), financial information? Who may be interested or could use this information if stolen, including
external and internal threats as well as unintentional or malicious? A commonly forgotten but extremely important
consideration within risk is addressing Disaster Recovery and Business Continuity.
Compliance: Is there a specific industry, government, or regulatory requirements that dictate or provide
recommendation on criteria that your organization’s security controls must meet? Examples of such standards,
organizations, controls, and legislation are ISO27001, NIST, PCI-DSS.
The collective role of organization(s) is to manage the security standards of the organization through their lifecycle:
Define - Set organizational standards and policies for practices, technologies, and configurations based on
internal factors (organizational culture, risk appetite, asset valuation, business initiatives, etc.) and external
factors (benchmarks, regulatory standards, threat environment, and more)
Improve – Continually push these standards incrementally forward towards the ideal state to ensure
continual risk reduction.
Sustain – Ensure the security posture doesn’t degrade naturally over time by instituting auditing and
monitoring compliance with organizational standards.
Monitor Secure Score and Follow up with technical teams Security Posture Management Team
Vulnerability Management (or Governance/Risk/Compliance
team)
Architecture Team
Responsible Technical Team (listed below)
NOTE: Note that in the DevOps model, some application teams may be responsible for their own application
resources.
The Azure Secure Score sample shows how to get your Azure Secure Score for a subscription by calling the
Azure Security Center REST API. The API methods provide the flexibility to query the data and build your own
reporting mechanism of your secure scores over time.
Ser ver Endpoint Security Typically IT operations, security, or jointly. Monitor and
remediate server security (patching, configuration, endpoint
security, etc.).
GRO UP O R IN DIVIDUA L RO L E RESP O N SIB IL IT Y
Incident Monitoring and Response Typically security operations team. Investigate and remediate
security incidents in Security Information and Event
Management (SIEM) or source console.
Policy Management Typically GRC team + Architecture. Set Direction for use of
Role-Based Access Control (RBAC), Azure Security Center,
Administrator protection strategy, and Azure Policy to govern
Azure resources.
Identity Security and Standards Typically Security Team + Identity Team jointly. Set direction for
Azure AD directories, PIM/PAM usage, MFA,
password/synchronization configuration, Application Identity
Standards.
Penetration testing
Use Penetration Testing to validate security defenses.
Real world validation of security defenses is critical to validate your defense strategy and implementation. This can
be accomplished by a penetration test (simulates a one time attack) or a red team program (simulates a persistent
threat actor targeting your environment).
Follow the guidance published by Microsoft for planning and executing simulated attacks.
Organizations of all sizes are constrained by their available resources; financial, people, and time. To achieve an
effective return on investment (ROI) organizations must prioritize where they will invest. Implementation of
security across the organization is also constrained by this, so to achieve an appropriate ROI on security the
organization needs to first understand and define its security priorities.
Governance: How is the organization’s security going to be monitored, audited, and reported? Design and
implementation of security controls within an organization is only the beginning of the story. How does the
organization know that things are actually working? Are they improving? Are there new requirements? Is there
mandatory reporting? Similar to compliance there may be external industry, government or regulatory standards
that need to be considered.
Risk : What types of risks does the organization face while trying to protect identifiable information, Intellectual
Property (IP), financial information? Who may be interested or could use this information if stolen, including
external and internal threats as well as unintentional or malicious? A commonly forgotten but extremely important
consideration within risk is addressing Disaster Recovery and Business Continuity.
Compliance: Is there a specific industry, government, or regulatory requirements that dictate or provide
recommendation on criteria that your organization’s security controls must meet? Examples of such standards,
organizations, controls, and legislation are ISO27001, NIST, PCI-DSS.
The collective role of organization(s) is to manage the security standards of the organization through their
lifecycle:
Define - Set organizational standards and policies for practices, technologies, and configurations based on
internal factors (organizational culture, risk appetite, asset valuation, business initiatives, etc.) and external
factors (benchmarks, regulatory standards, threat environment, and more)
Improve – Continually push these standards incrementally forward towards the ideal state to ensure
continual risk reduction.
Sustain – Ensure the security posture doesn’t degrade naturally over time by instituting auditing and
monitoring compliance with organizational standards.
Monitor Secure Score and Follow up with technical teams Security Posture Management Team
Vulnerability Management (or Governance/Risk/Compliance
team)
Architecture Team
Responsible Technical Team (listed below)
NOTE: Note that in the DevOps model, some application teams may be responsible for their own application
resources.
The Azure Secure Score sample shows how to get your Azure Secure Score for a subscription by calling the
Azure Security Center REST API. The API methods provide the flexibility to query the data and build your own
reporting mechanism of your secure scores over time.
Ser ver Endpoint Security Typically IT operations, security, or jointly. Monitor and
remediate server security (patching, configuration, endpoint
security, etc.).
GRO UP O R IN DIVIDUA L RO L E RESP O N SIB IL IT Y
Incident Monitoring and Response Typically security operations team. Investigate and remediate
security incidents in Security Information and Event
Management (SIEM) or source console.
Policy Management Typically GRC team + Architecture. Set Direction for use of
Role-Based Access Control (RBAC), Azure Security Center,
Administrator protection strategy, and Azure Policy to govern
Azure resources.
Identity Security and Standards Typically Security Team + Identity Team jointly. Set direction
for Azure AD directories, PIM/PAM usage, MFA,
password/synchronization configuration, Application Identity
Standards.
Penetration testing
Use Penetration Testing to validate security defenses.
Real world validation of security defenses is critical to validate your defense strategy and implementation. This can
be accomplished by a penetration test (simulates a one time attack) or a red team program (simulates a persistent
threat actor targeting your environment).
Follow the guidance published by Microsoft for planning and executing simulated attacks.
These design patterns are useful for building reliable, scalable, secure applications in the cloud.
Each pattern describes the problem that the pattern addresses, considerations for applying the pattern, and an
example based on Microsoft Azure. Most of the patterns include code samples or snippets that show how to
implement the pattern on Azure. However, most of the patterns are relevant to any distributed system, whether
hosted on Azure or on other cloud platforms.
Data Management
Data management is the key element of cloud
applications, and influences most of the quality attributes.
Data is typically hosted in different locations and across
multiple servers for reasons such as performance,
scalability or availability, and this can present a range of
challenges. For example, data consistency must be
maintained, and data will typically need to be
synchronized across different locations.
Resiliency
Resiliency is the ability of a system to gracefully handle
and recover from failures. The nature of cloud hosting,
where applications are often multitenant, use shared
platform services, compete for resources and bandwidth,
communicate over the Internet, and run on commodity
hardware means there is an increased likelihood that both
transient and more permanent faults will arise. Detecting
failures, and recovering quickly and efficiently, is necessary
to maintain resiliency.
Security
Security provides confidentiality, integrity, and availability
assurances against malicious attacks on information
systems (and safety assurances for attacks on operational
technology systems). Losing these assurances can
negatively impact your business operations and revenue,
as well as your organization’s reputation in the
marketplace. Maintaining security requires following well-
established practices (security hygiene) and being vigilant
to detect and rapidly remediate vulnerabilities and active
attacks.
Catalog of patterns
PAT T ERN SUM M A RY C AT EGO RY
Backends for Frontends Create separate backend services to be Design and Implementation
consumed by specific frontend
applications or interfaces.
External Configuration Store Move configuration information out of Design and Implementation,
the application deployment package to
a centralized location. Management and Monitoring
Index Table Create indexes over the fields in data Data Management,
stores that are frequently referenced by
queries. Performance and Scalability
Pipes and Filters Break down a task that performs Design and Implementation,
complex processing into a series of
separate elements that can be reused. Messaging
Static Content Hosting Deploy static content to a cloud-based Design and Implementation,
storage service that can deliver them
directly to the client. Data Management,
Performance and Scalability
Valet Key Use a token or key that provides clients Data Management,
with restricted direct access to a specific
resource or service. Security
Availability patterns
12/18/2020 • 2 minutes to read • Edit Online
Availability is measured as a percentage of uptime, and defines the proportion of time that a system is functional
and working. Availability is affected by system errors, infrastructure problems, malicious attacks, and system load.
Cloud applications typically provide users with a service level agreement (SLA), which means that applications
must be designed and implemented to maximize availability.
Queue-Based Load Leveling Use a queue that acts as a buffer between a task and a
service that it invokes, to smooth intermittent heavy loads.
To mitigate against availability risks from malicious Distributed Denial of Service (DDoS) attacks, implement the
native Azure DDoS protection standard service or a 3rd party capability.
Data Management patterns
12/18/2020 • 2 minutes to read • Edit Online
Data management is the key element of cloud applications, and influences most of the quality attributes. Data is
typically hosted in different locations and across multiple servers for reasons such as performance, scalability or
availability, and this can present a range of challenges. For example, data consistency must be maintained, and
data will typically need to be synchronized across different locations.
Additionally data should be protected at rest, in transit, and via authorized access mechanisms to maintain
security assurances of confidentiality, integrity, and availability. Refer to the Azure Security Benchmark Data
Protection Control for more information.
Event Sourcing Use an append-only store to record the full series of events
that describe actions taken on data in a domain.
Index Table Create indexes over the fields in data stores that are
frequently referenced by queries.
Materialized View Generate prepopulated views over the data in one or more
data stores when the data isn't ideally formatted for required
query operations.
Static Content Hosting Deploy static content to a cloud-based storage service that
can deliver them directly to the client.
Valet Key Use a token or key that provides clients with restricted direct
access to a specific resource or service.
Design and Implementation patterns
12/18/2020 • 2 minutes to read • Edit Online
Good design encompasses factors such as consistency and coherence in component design and deployment,
maintainability to simplify administration and development, and reusability to allow components and
subsystems to be used in other applications and in other scenarios. Decisions made during the design and
implementation phase have a huge impact on the quality and the total cost of ownership of cloud hosted
applications and services.
Pipes and Filters Break down a task that performs complex processing into a
series of separate elements that can be reused.
Static Content Hosting Deploy static content to a cloud-based storage service that
can deliver them directly to the client.
PAT T ERN SUM M A RY
Cloud applications run in a remote datacenter where you do not have full control of the infrastructure or, in
some cases, the operating system. This can make management and monitoring more difficult than an on-
premises deployment. Applications must expose runtime information that administrators and operators can use
to manage and monitor the system, as well as supporting changing business requirements and customization
without requiring the application to be stopped or redeployed.
The distributed nature of cloud applications requires a messaging infrastructure that connects the components
and services, ideally in a loosely coupled manner in order to maximize scalability. Asynchronous messaging is
widely used, and provides many benefits, but also brings challenges such as the ordering of messages, poison
message management, idempotency, and more.
Claim Check Split a large message into a claim check and a payload to
avoid overwhelming a message bus.
Pipes and Filters Break down a task that performs complex processing into a
series of separate elements that can be reused.
Queue-Based Load Leveling Use a queue that acts as a buffer between a task and a
service that it invokes in order to smooth intermittent heavy
loads.
Performance is an indication of the responsiveness of a system to execute any action within a given time
interval, while scalability is ability of a system either to handle increases in load without impact on performance
or for the available resources to be readily increased. Cloud applications typically encounter variable workloads
and peaks in activity. Predicting these, especially in a multi-tenant scenario, is almost impossible. Instead,
applications should be able to scale out within limits to meet peaks in demand, and scale in when demand
decreases. Scalability concerns not just compute instances, but other elements such as data storage, messaging
infrastructure, and more.
Event Sourcing Use an append-only store to record the full series of events
that describe actions taken on data in a domain.
Index Table Create indexes over the fields in data stores that are
frequently referenced by queries.
Materialized View Generate prepopulated views over the data in one or more
data stores when the data isn't ideally formatted for
required query operations.
Queue-Based Load Leveling Use a queue that acts as a buffer between a task and a
service that it invokes in order to smooth intermittent
heavy loads.
Static Content Hosting Deploy static content to a cloud-based storage service that
can deliver them directly to the client.
PAT T ERN SUM M A RY
Resiliency is the ability of a system to gracefully handle and recover from failures, both inadvertent and
malicious.
The nature of cloud hosting, where applications are often multi-tenant, use shared platform services, compete
for resources and bandwidth, communicate over the Internet, and run on commodity hardware means there is
an increased likelihood that both transient and more permanent faults will arise. The connected nature of the
internet and the rise in sophistication and volume of attacks increase the likelihood of a security disruption.
Detecting failures and recovering quickly and efficiently, is necessary to maintain resiliency.
Circuit Breaker Handle faults that might take a variable amount of time to
fix when connecting to a remote service or resource.
Queue-Based Load Leveling Use a queue that acts as a buffer between a task and a
service that it invokes in order to smooth intermittent heavy
loads.
Security Resliency
Achieving security resilience requires a combination of preventive measures to block attacks, responsive
measures detect and quickly remediate active attacks, and governance to ensure consistent application of best
practices.
Security strategy should include lessons learned described in security strategy guidance
Azure security configurations should align to the best practices and controls in the Azure Security
Benchmark (ASB). Security configurations for Azure services should align to the Security baselines for Azure
in the ASB
Azure architectures should integrate native security capabilities to protect and monitor workloads
including Azure Defender, Azure DDoS protection, Azure Firewall, and Azure Web Application Firewall (WAF)
For a more detailed dsicussion, see the Cybersecurity Resilience module in the CISO workshop
Security patterns
12/18/2020 • 2 minutes to read • Edit Online
Security provides confidentiality, integrity, and availability assurances against malicious attacks on information
systems (and safety assurances for attacks on operational technology systems). Losing these assurances can
negatively impact your business operations and revenue, as well as your organization’s reputation in the
marketplace. Maintaining security requires following well-established practices (security hygiene) and being
vigilant to detect and rapidly remediate vulnerabilities and active attacks.
Patterns
PAT T ERN SUM M A RY
Valet Key Use a token or key that provides clients with restricted direct
access to a specific resource or service.
Security Strategy Guidance Building and updating a security strategy for cloud adoption
and modern threat environment
Security Roles and Responsibilities Guidance on security roles and responsibilities including
definitions of mission/outcome for each organizational
function and how each should evolve with the adoption of
cloud.
Getting Started Guide for Security Guidance for planning and implementing security throughout
cloud adoption
Ambassador pattern
12/18/2020 • 3 minutes to read • Edit Online
Create helper services that send network requests on behalf of a consumer service or application. An ambassador
service can be thought of as an out-of-process proxy that is co-located with the client.
This pattern can be useful for offloading common client connectivity tasks such as monitoring, logging, routing,
security (such as TLS), and resiliency patterns in a language agnostic way. It is often used with legacy applications,
or other applications that are difficult to modify, in order to extend their networking capabilities. It can also enable
a specialized team to implement those features.
Solution
Put client frameworks and libraries into an external process that acts as a proxy between your application and
external services. Deploy the proxy on the same host environment as your application to allow control over
routing, resiliency, security features, and to avoid any host-related access restrictions. You can also use the
ambassador pattern to standardize and extend instrumentation. The proxy can monitor performance metrics such
as latency or resource usage, and this monitoring happens in the same host environment as the application.
Features that are offloaded to the ambassador can be managed independently of the application. You can update
and modify the ambassador without disturbing the application's legacy functionality. It also allows for separate,
specialized teams to implement and maintain security, networking, or authentication features that have been
moved to the ambassador.
Ambassador services can be deployed as a sidecar to accompany the lifecycle of a consuming application or
service. Alternatively, if an ambassador is shared by multiple separate processes on a common host, it can be
deployed as a daemon or Windows service. If the consuming service is containerized, the ambassador should be
created as a separate container on the same host, with the appropriate links configured for communication.
Example
The following diagram shows an application making a request to a remote service via an ambassador proxy. The
ambassador provides routing, circuit breaking, and logging. It calls the remote service and then returns the
response to the client application:
Related guidance
Sidecar pattern
Anti-Corruption Layer pattern
12/18/2020 • 2 minutes to read • Edit Online
Implement a façade or adapter layer between different subsystems that don't share the same semantics. This layer
translates requests that one subsystem makes to the other subsystem. Use this pattern to ensure that an
application's design is not limited by dependencies on outside subsystems. This pattern was first described by Eric
Evans in Domain-Driven Design.
Solution
Isolate the different subsystems by placing an anti-corruption layer between them. This layer translates
communications between the two systems, allowing one system to remain unchanged while the other can avoid
compromising its design and technological approach.
The diagram above shows an application with two subsystems. Subsystem A calls to subsystem B through an anti-
corruption layer. Communication between subsystem A and the anti-corruption layer always uses the data model
and architecture of subsystem A. Calls from the anti-corruption layer to subsystem B conform to that subsystem's
data model or methods. The anti-corruption layer contains all of the logic necessary to translate between the two
systems. The layer can be implemented as a component within the application or as an independent service.
Related guidance
Strangler Fig pattern
Asynchronous Request-Reply pattern
12/18/2020 • 9 minutes to read • Edit Online
Decouple backend processing from a frontend host, where backend processing needs to be asynchronous, but the
frontend still needs a clear response.
Solution
One solution to this problem is to use HTTP polling. Polling is useful to client-side code, as it can be hard to provide
call-back endpoints or use long running connections. Even when callbacks are possible, the extra libraries and
services that are required can sometimes add too much extra complexity.
The client application makes a synchronous call to the API, triggering a long-running operation on the
backend.
The API responds synchronously as quickly as possible. It returns an HTTP 202 (Accepted) status code,
acknowledging that the request has been received for processing.
NOTE
The API should validate both the request and the action to be performed before starting the long running process. If
the request is invalid, reply immediately with an error code such as HTTP 400 (Bad Request).
The response holds a location reference pointing to an endpoint that the client can poll to check for the
result of the long running operation.
The API offloads processing to another component, such as a message queue.
While the work is still pending, the status endpoint returns HTTP 202. Once the work is complete, the status
endpoint can either return a resource that indicates completion, or redirect to another resource URL. For
example, if the asynchronous operation creates a new resource, the status endpoint would redirect to the
URL for that resource.
The following diagram shows a typical flow:
1. The client sends a request and receives an HTTP 202 (Accepted) response.
2. The client sends an HTTP GET request to the status endpoint. The work is still pending, so this call also returns
HTTP 202.
3. At some point, the work is complete and the status endpoint returns 302 (Found) redirecting to the resource.
4. The client fetches the resource at the specified URL.
Location A URL the client should poll for a This URL could be a SAS token with
response status. the Valet Key Pattern being
appropriate if this location needs
access control. The valet key pattern
is also valid when response polling
needs offloading to another backend
You may need to use a processing proxy or facade to manipulate the response headers or payload
depending on the underlying services used.
If the status endpoint redirects on completion, either HTTP 302 or HTTP 303 are appropriate return codes,
depending on the exact semantics you support.
Upon successful processing, the resource specified by the Location header should return an appropriate
HTTP response code such as 200 (OK), 201 (Created), or 204 (No Content).
If an error occurs during processing, persist the error at the resource URL described in the Location header
and ideally return an appropriate response code to the client from that resource (4xx code).
Not all solutions will implement this pattern in the same way and some services will include additional or
alternate headers. For example, Azure Resource Manager uses a modified variant of this pattern. For more
information, see Azure Resource Manager Async Operations.
Legacy clients might not support this pattern. In that case, you might need to place a facade over the
asynchronous API to hide the asynchronous processing from the original client. For example, Azure Logic
Apps supports this pattern natively can be used as an integration layer between an asynchronous API and a
client that makes synchronous calls. See Perform long-running tasks with the webhook action pattern.
In some scenarios, you might want to provide a way for clients to cancel a long-running request. In that
case, the backend service must support some form of cancellation instruction.
Example
The following code shows excerpts from an application that uses Azure Functions to implement this pattern. There
are three functions in the solution:
The asynchronous API endpoint.
The status endpoint.
A backend function that takes queued work items and executes them.
string rqs =
$"http://{Environment.GetEnvironmentVariable("WEBSITE_HOSTNAME")}/api/RequestStatus/{reqid}";
await OutMessage.AddAsync(m);
AsyncProcessingBackgroundWorker function
The AsyncProcessingBackgroundWorker function picks up the operation from the queue, does some work based on
the message payload, and writes the result to the SAS signature location.
AsyncOperationStatusChecker function
The AsyncOperationStatusChecker function implements the status endpoint. This function first checks whether the
request was completed
If the request was completed, the function either returns a valet-key to the response, or redirects the call
immediately to the valet-key URL.
If the request is still pending, then we should return a 202 accepted with a self-referencing Location header,
putting an ETA for a completed response in the http Retry-After header.
switch (OnPending)
{
case OnPendingEnum.Accepted:
{
// Return an HTTP 202 status code.
return (ActionResult)new AcceptedResult() { Location = rqs };
}
case OnPendingEnum.Synchronous:
{
// Back off and retry. Time out if the backoff period hits one minute
int backoff = 250;
if (await inputBlob.ExistsAsync())
{
log.LogInformation($"Synchronous Redirect mode {thisGUID}.blob - completed after
{backoff} ms");
return await OnCompleted(OnComplete, inputBlob, thisGUID);
}
else
{
log.LogInformation($"Synchronous mode {thisGUID}.blob - NOT FOUND after timeout
log.LogInformation($"Synchronous mode {thisGUID}.blob - NOT FOUND after timeout
{backoff} ms");
return (ActionResult)new NotFoundResult();
}
}
default:
{
throw new InvalidOperationException($"Unexpected value: {OnPending}");
}
}
}
}
case OnCompleteEnum.Stream:
{
// Download the file and return it directly to the caller.
// For larger files, use a stream to minimize RAM usage.
return (ActionResult)new OkObjectResult(await inputBlob.DownloadTextAsync());
}
default:
{
throw new InvalidOperationException($"Unexpected value: {OnComplete}");
}
}
}
}
Redirect,
Stream
}
Accepted,
Synchronous
}
Create separate backend services to be consumed by specific frontend applications or interfaces. This pattern is
useful when you want to avoid customizing a single backend for multiple interfaces. This pattern was first
described by Sam Newman.
As the development activity focuses on the backend service, a separate team may be created to manage and
maintain the backend. Ultimately, this results in a disconnect between the interface and backend development
teams, placing a burden on the backend team to balance the competing requirements of the different UI teams.
When one interface team requires changes to the backend, those changes must be validated with other interface
teams before they can be integrated into the backend.
Solution
Create one backend per user interface. Fine-tune the behavior and performance of each backend to best match the
needs of the frontend environment, without worrying about affecting other frontend experiences.
Because each backend is specific to one interface, it can be optimized for that interface. As a result, it will be
smaller, less complex, and likely faster than a generic backend that tries to satisfy the requirements for all
interfaces. Each interface team has autonomy to control their own backend and doesn't rely on a centralized
backend development team. This gives the interface team flexibility in language selection, release cadence,
prioritization of workload, and feature integration in their backend.
For more information, see Pattern: Backends For Frontends.
Related guidance
Pattern: Backends For Frontends
Gateway Aggregation pattern
Gateway Offloading pattern
Gateway Routing pattern
Bulkhead pattern
12/18/2020 • 4 minutes to read • Edit Online
The Bulkhead pattern is a type of application design that is tolerant of failure. In a bulkhead architecture, elements
of an application are isolated into pools so that if one fails, the others will continue to function. It's named after the
sectioned partitions (bulkheads) of a ship's hull. If the hull of a ship is compromised, only the damaged section
fills with water, which prevents the ship from sinking.
Solution
Partition service instances into different groups, based on consumer load and availability requirements. This
design helps to isolate failures, and allows you to sustain service functionality for some consumers, even during a
failure.
A consumer can also partition resources, to ensure that resources used to call one service don't affect the
resources used to call another service. For example, a consumer that calls multiple services may be assigned a
connection pool for each service. If a service begins to fail, it only affects the connection pool assigned for that
service, allowing the consumer to continue using the other services.
The benefits of this pattern include:
Isolates consumers and services from cascading failures. An issue affecting a consumer or service can be
isolated within its own bulkhead, preventing the entire solution from failing.
Allows you to preserve some functionality in the event of a service failure. Other services and features of the
application will continue to work.
Allows you to deploy services that offer a different quality of service for consuming applications. A high-
priority consumer pool can be configured to use high-priority services.
The following diagram shows bulkheads structured around connection pools that call individual services. If
Service A fails or causes some other issue, the connection pool is isolated, so only workloads using the thread
pool assigned to Service A are affected. Workloads that use Service B and C are not affected and can continue
working without interruption.
The next diagram shows multiple clients calling a single service. Each client is assigned a separate service
instance. Client 1 has made too many requests and overwhelmed its instance. Because each service instance is
isolated from the others, the other clients can continue making calls.
Example
The following Kubernetes configuration file creates an isolated container to run a single service, with its own CPU
and memory resources and limits.
apiVersion: v1
kind: Pod
metadata:
name: drone-management
spec:
containers:
- name: drone-management-container
image: drone-service
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "1"
Related guidance
Designing reliable Azure applications
Circuit Breaker pattern
Retry pattern
Throttling pattern
Cache-Aside pattern
12/18/2020 • 7 minutes to read • Edit Online
Load data on demand into a cache from a data store. This can improve performance and also helps to maintain
consistency between data held in the cache and data in the underlying data store.
Solution
Many commercial caching systems provide read-through and write-through/write-behind operations. In these
systems, an application retrieves data by referencing the cache. If the data isn't in the cache, it's retrieved from the
data store and added to the cache. Any modifications to data held in the cache are automatically written back to
the data store as well.
For caches that don't provide this functionality, it's the responsibility of the applications that use the cache to
maintain the data.
An application can emulate the functionality of read-through caching by implementing the cache-aside strategy.
This strategy loads data into the cache on demand. The figure illustrates using the Cache-Aside pattern to store
data in the cache.
If an application updates information, it can follow the write-through strategy by making the modification to the
data store, and by invalidating the corresponding item in the cache.
When the item is next required, using the cache-aside strategy will cause the updated data to be retrieved from
the data store and added back into the cache.
Issues and considerations
Consider the following points when deciding how to implement this pattern:
Lifetime of cached data . Many caches implement an expiration policy that invalidates data and removes it
from the cache if it's not accessed for a specified period. For cache-aside to be effective, ensure that the expiration
policy matches the pattern of access for applications that use the data. Don't make the expiration period too short
because this can cause applications to continually retrieve data from the data store and add it to the cache.
Similarly, don't make the expiration period so long that the cached data is likely to become stale. Remember that
caching is most effective for relatively static data, or data that is read frequently.
Evicting data . Most caches have a limited size compared to the data store where the data originates, and they'll
evict data if necessary. Most caches adopt a least-recently-used policy for selecting items to evict, but this might
be customizable. Configure the global expiration property and other properties of the cache, and the expiration
property of each cached item, to ensure that the cache is cost effective. It isn't always appropriate to apply a
global eviction policy to every item in the cache. For example, if a cached item is very expensive to retrieve from
the data store, it can be beneficial to keep this item in the cache at the expense of more frequently accessed but
less costly items.
Priming the cache . Many solutions prepopulate the cache with the data that an application is likely to need as
part of the startup processing. The Cache-Aside pattern can still be useful if some of this data expires or is
evicted.
Consistency . Implementing the Cache-Aside pattern doesn't guarantee consistency between the data store and
the cache. An item in the data store can be changed at any time by an external process, and this change might not
be reflected in the cache until the next time the item is loaded. In a system that replicates data across data stores,
this problem can become serious if synchronization occurs frequently.
Local (in-memor y) caching . A cache could be local to an application instance and stored in-memory. Cache-
aside can be useful in this environment if an application repeatedly accesses the same data. However, a local
cache is private and so different application instances could each have a copy of the same cached data. This data
could quickly become inconsistent between caches, so it might be necessary to expire data held in a private cache
and refresh it more frequently. In these scenarios, consider investigating the use of a shared or a distributed
caching mechanism.
Example
In Microsoft Azure you can use Azure Cache for Redis to create a distributed cache that can be shared by multiple
instances of an application.
This following code examples use the StackExchange.Redis client, which is a Redis client library written for .NET.
To connect to an Azure Cache for Redis instance, call the static ConnectionMultiplexer.Connect method and pass
in the connection string. The method returns a ConnectionMultiplexer that represents the connection. One
approach to sharing a ConnectionMultiplexer instance in your application is to have a static property that returns
a connected instance, similar to the following example. This approach provides a thread-safe way to initialize only
a single connected instance.
The GetMyEntityAsync method in the following code example shows an implementation of the Cache-Aside
pattern. This method retrieves an object from the cache using the read-through approach.
An object is identified by using an integer ID as the key. The GetMyEntityAsync method tries to retrieve an item
with this key from the cache. If a matching item is found, it's returned. If there's no match in the cache, the
GetMyEntityAsync method retrieves the object from a data store, adds it to the cache, and then returns it. The
code that actually reads the data from the data store is not shown here, because it depends on the data store.
Note that the cached item is configured to expire to prevent it from becoming stale if it's updated elsewhere.
return value;
}
The examples use Azure Cache for Redis to access the store and retrieve information from the cache. For
more information, see Using Azure Cache for Redis and How to create a Web App with Azure Cache for Redis.
The UpdateEntityAsync method shown below demonstrates how to invalidate an object in the cache when the
value is changed by the application. The code updates the original data store and then removes the cached item
from the cache.
NOTE
The order of the steps is important. Update the data store before removing the item from the cache. If you remove the
cached item first, there is a small window of time when a client might fetch the item before the data store is updated. That
will result in a cache miss (because the item was removed from the cache), causing the earlier version of the item to be
fetched from the data store and added back into the cache. The result will be stale cache data.
Related guidance
The following information may be relevant when implementing this pattern:
Caching Guidance. Provides additional information on how you can cache data in a cloud solution, and the
issues that you should consider when you implement a cache.
Data Consistency Primer. Cloud applications typically use data that's spread across data stores. Managing
and maintaining data consistency in this environment is a critical aspect of the system, particularly the
concurrency and availability issues that can arise. This primer describes issues about consistency across
distributed data, and summarizes how an application can implement eventual consistency to maintain the
availability of data.
Choreography
12/18/2020 • 7 minutes to read • Edit Online
Have each component of the system participate in the decision-making process about the workflow of a business
transaction, instead of relying on a central point of control.
Solution
Let each service decide when and how a business operation is processed, instead of depending on a central
orchestrator.
One way to implement choreography is to use the asynchronous messaging pattern to coordinate the business
operations.
A client request publishes messages to a message queue. As messages arrive, they are pushed to subscribers, or
services, interested in that message. Each subscribed service does their operation as indicated by the message
and responds to the message queue with success or failure of the operation. In case of success, the service can
push a message back to the same queue or a different message queue so that another service can continue the
workflow if needed. If an operation fails, the message bus can retry that operation.
This way, the services choreograph the workflow among themselves without depending on an orchestrator or
having direct communication between them.
Because there isn't point-to-point communication, this pattern helps reduce coupling between services. Also, it
can remove the performance bottleneck caused by the orchestrator when it has to deal with all transactions.
Example
This example shows the choreography pattern with the Drone Delivery app. When a client requests a pickup, the
app assigns a drone and notifies the client.
A single client business transaction requires three distinct business operations: creating or updating a package,
assigning a drone to deliver the package, and checking the delivery status. Those operations are performed by
three microservices: Package, Drone Scheduler, and Delivery services. Instead of a central orchestrator, the
services use messaging to collaborate and coordinate the request among themselves.
Design
The business transaction is processed in a sequence through multiple hops. Each hop has a message bus and the
respective business service.
When a client sends a delivery request through an HTTP endpoint, the Ingestion service receives it, raises an
operation event, and sends it to a message bus. The bus invokes the subscribed business service and sends the
event in a POST request. On receiving the event, the business service can complete the operation with success,
failure, or the request can time out. If successful, the service responds to the bus with the Ok status code, raises a
new operation event, and sends it to the message bus of the next hop. In case of a failure or time-out, the service
reports failure by sending the BadRequest code to the message bus that sent the original POST request. The
message bus retries the operation based on a retry policy. After that period expires, message bus flags the failed
operation and further processing of the entire request stops.
This workflow continues until the entire request has been processed.
The design uses multiple message buses to process the entire business transaction. Microsoft Azure Event Grid
provides the messaging service. The app is deployed in an Azure Kubernetes Service (AKS) cluster with two
containers in the same pod. One container runs the ambassador that interacts with Event Grid while the other
runs a business service. The approach with two containers in the same pod improves performance and scalability.
The ambassador and the business service share the same network allowing for low latency and high throughput.
To avoid cascading retry operations that might lead to multiple efforts, only Event Grid retries an operation
instead of the business service. It flags a failed request by sending a messaging to a dead letter queue (DLQ).
The business services are idempotent to make sure retry operations don’t result in duplicate resources. For
example, the Package service uses upsert operations to add data to the data store.
The example implements a custom solution to correlate calls across all services and Event Grid hops.
Here’s a code example that shows the choreography pattern between all business services. It shows the workflow
of the Drone Delivery app transactions. Code for exception handling and logging have been removed for brevity.
[HttpPost]
[Route("/api/[controller]/operation")]
[ProducesResponseType(typeof(void), 200)]
[ProducesResponseType(typeof(void), 400)]
[ProducesResponseType(typeof(void), 500)]
if (events == null)
{
return BadRequest("No Event for Choreography");
}
foreach(var e in events)
{
Handle faults that might take a variable amount of time to recover from, when connecting to a remote service
or resource. This can improve the stability and resiliency of an application.
Solution
The Circuit Breaker pattern, popularized by Michael Nygard in his book, Release It!, can prevent an application
from repeatedly trying to execute an operation that's likely to fail. Allowing it to continue without waiting for
the fault to be fixed or wasting CPU cycles while it determines that the fault is long lasting. The Circuit Breaker
pattern also enables an application to detect whether the fault has been resolved. If the problem appears to
have been fixed, the application can try to invoke the operation.
The purpose of the Circuit Breaker pattern is different than the Retry pattern. The Retry pattern enables an
application to retry an operation in the expectation that it'll succeed. The Circuit Breaker pattern prevents
an application from performing an operation that is likely to fail. An application can combine these two
patterns by using the Retry pattern to invoke an operation through a circuit breaker. However, the retry
logic should be sensitive to any exceptions returned by the circuit breaker and abandon retry attempts if
the circuit breaker indicates that a fault is not transient.
A circuit breaker acts as a proxy for operations that might fail. The proxy should monitor the number of recent
failures that have occurred, and use this information to decide whether to allow the operation to proceed, or
simply return an exception immediately.
The proxy can be implemented as a state machine with the following states that mimic the functionality of an
electrical circuit breaker:
Closed : The request from the application is routed to the operation. The proxy maintains a count of the
number of recent failures, and if the call to the operation is unsuccessful the proxy increments this
count. If the number of recent failures exceeds a specified threshold within a given time period, the
proxy is placed into the Open state. At this point the proxy starts a timeout timer, and when this timer
expires the proxy is placed into the Half-Open state.
The purpose of the timeout timer is to give the system time to fix the problem that caused the
failure before allowing the application to try to perform the operation again.
Open : The request from the application fails immediately and an exception is returned to the
application.
Half-Open : A limited number of requests from the application are allowed to pass through and invoke
the operation. If these requests are successful, it's assumed that the fault that was previously causing
the failure has been fixed and the circuit breaker switches to the Closed state (the failure counter is
reset). If any request fails, the circuit breaker assumes that the fault is still present so it reverts back to
the Open state and restarts the timeout timer to give the system a further period of time to recover
from the failure.
The Half-Open state is useful to prevent a recovering service from suddenly being flooded with
requests. As a service recovers, it might be able to support a limited volume of requests until the
recovery is complete, but while recovery is in progress a flood of work can cause the service to
time out or fail again.
In the figure, the failure counter used by the Closed state is time based. It's automatically reset at periodic
intervals. This helps to prevent the circuit breaker from entering the Open state if it experiences occasional
failures. The failure threshold that trips the circuit breaker into the Open state is only reached when a
specified number of failures have occurred during a specified interval. The counter used by the Half-Open
state records the number of successful attempts to invoke the operation. The circuit breaker reverts to the
Closed state after a specified number of consecutive operation invocations have been successful. If any
invocation fails, the circuit breaker enters the Open state immediately and the success counter will be reset
the next time it enters the Half-Open state.
How the system recovers is handled externally, possibly by restoring or restarting a failed component or
repairing a network connection.
The Circuit Breaker pattern provides stability while the system recovers from a failure and minimizes the
impact on performance. It can help to maintain the response time of the system by quickly rejecting a request
for an operation that's likely to fail, rather than waiting for the operation to time out, or never return. If the
circuit breaker raises an event each time it changes state, this information can be used to monitor the health of
the part of the system protected by the circuit breaker, or to alert an administrator when a circuit breaker trips
to the Open state.
The pattern is customizable and can be adapted according to the type of the possible failure. For example, you
can apply an increasing timeout timer to a circuit breaker. You could place the circuit breaker in the Open state
for a few seconds initially, and then if the failure hasn't been resolved increase the timeout to a few minutes,
and so on. In some cases, rather than the Open state returning failure and raising an exception, it could be
useful to return a default value that is meaningful to the application.
NOTE
A service can return HTTP 429 (Too Many Requests) if it is throttling the client, or HTTP 503 (Service Unavailable) if the
service is not currently available. The response can include additional information, such as the anticipated duration of
the delay.
Replaying Failed Requests . In the Open state, rather than simply failing quickly, a circuit breaker could also
record the details of each request to a journal and arrange for these requests to be replayed when the remote
resource or service becomes available.
Inappropriate Timeouts on External Ser vices . A circuit breaker might not be able to fully protect
applications from operations that fail in external services that are configured with a lengthy timeout period. If
the timeout is too long, a thread running a circuit breaker might be blocked for an extended period before the
circuit breaker indicates that the operation has failed. In this time, many other application instances might also
try to invoke the service through the circuit breaker and tie up a significant number of threads before they all
fail.
Example
In a web application, several of the pages are populated with data retrieved from an external service. If the
system implements minimal caching, most hits to these pages will cause a round trip to the service.
Connections from the web application to the service could be configured with a timeout period (typically 60
seconds), and if the service doesn't respond in this time the logic in each web page will assume that the
service is unavailable and throw an exception.
However, if the service fails and the system is very busy, users could be forced to wait for up to 60 seconds
before an exception occurs. Eventually resources such as memory, connections, and threads could be
exhausted, preventing other users from connecting to the system, even if they aren't accessing pages that
retrieve data from the service.
Scaling the system by adding further web servers and implementing load balancing might delay when
resources become exhausted, but it won't resolve the issue because user requests will still be unresponsive
and all web servers could still eventually run out of resources.
Wrapping the logic that connects to the service and retrieves the data in a circuit breaker could help to solve
this problem and handle the service failure more elegantly. User requests will still fail, but they'll fail more
quickly and the resources won't be blocked.
The class maintains state information about a circuit breaker in an object that implements the
CircuitBreaker
ICircuitBreakerStateStore interface shown in the following code.
interface ICircuitBreakerStateStore
{
CircuitBreakerStateEnum State { get; }
void Reset();
void HalfOpen();
The State property indicates the current state of the circuit breaker, and will be either Open , HalfOpen , or
Closed as defined by the CircuitBreakerStateEnum enumeration. The IsClosed property should be true if the
circuit breaker is closed, but false if it's open or half open. The Trip method switches the state of the circuit
breaker to the open state and records the exception that caused the change in state, together with the date and
time that the exception occurred. The LastException and the LastStateChangedDateUtc properties return this
information. The Reset method closes the circuit breaker, and the HalfOpen method sets the circuit breaker
to half open.
The InMemoryCircuitBreakerStateStore class in the example contains an implementation of the
ICircuitBreakerStateStore interface. The CircuitBreaker class creates an instance of this class to hold the
state of the circuit breaker.
The ExecuteAction method in the CircuitBreaker class wraps an operation, specified as an Action delegate.
If the circuit breaker is closed, ExecuteAction invokes the Action delegate. If the operation fails, an exception
handler calls TrackException , which sets the circuit breaker state to open. The following code example
highlights this flow.
public class CircuitBreaker
{
private readonly ICircuitBreakerStateStore stateStore =
CircuitBreakerStateStoreFactory.GetCircuitBreakerStateStore();
The following example shows the code (omitted from the previous example) that is executed if the circuit
breaker isn't closed. It first checks if the circuit breaker has been open for a period longer than the time
specified by the local OpenToHalfOpenWaitTime field in the CircuitBreaker class. If this is the case, the
ExecuteAction method sets the circuit breaker to half open, then tries to perform the operation specified by
the Action delegate.
If the operation is successful, the circuit breaker is reset to the closed state. If the operation fails, it is tripped
back to the open state and the time the exception occurred is updated so that the circuit breaker will wait for a
further period before trying to perform the operation again.
If the circuit breaker has only been open for a short time, less than the OpenToHalfOpenWaitTime value, the
ExecuteAction method simply throws a CircuitBreakerOpenException exception and returns the error that
caused the circuit breaker to transition to the open state.
Additionally, it uses a lock to prevent the circuit breaker from trying to perform concurrent calls to the
operation while it's half open. A concurrent attempt to invoke the operation will be handled as if the circuit
breaker was open, and it'll fail with an exception as described later.
...
if (IsOpen)
{
// The circuit breaker is Open. Check if the Open timeout has expired.
// If it has, set the state to HalfOpen. Another approach might be to
// check for the HalfOpen state that had be set by some other operation.
if (stateStore.LastStateChangedDateUtc + OpenToHalfOpenWaitTime < DateTime.UtcNow)
{
// The Open timeout has expired. Allow one operation to execute. Note that, in
// this example, the circuit breaker is set to HalfOpen after being
// in the Open state for some period of time. An alternative would be to set
// this using some other approach such as a timer, test method, manually, and
// so on, and check the state here to determine how to handle execution
// of the action.
// Limit the number of threads to be executed when the breaker is HalfOpen.
// An alternative would be to use a more complex approach to determine which
// threads or how many are allowed to execute, or to execute a simple test
// method instead.
bool lockTaken = false;
try
{
Monitor.TryEnter(halfOpenSyncObject, ref lockTaken);
if (lockTaken)
{
// Set the circuit breaker state to HalfOpen.
stateStore.HalfOpen();
// If this action succeeds, reset the state and allow other operations.
// In reality, instead of immediately returning to the Closed state, a counter
// here would record the number of successful operations and return the
// circuit breaker to the Closed state only after a specified number succeed.
this.stateStore.Reset();
return;
}
}
catch (Exception ex)
{
// If there's still an exception, trip the breaker again immediately.
this.stateStore.Trip(ex);
// Throw the exception so that the caller knows which exception occurred.
throw;
}
finally
{
if (lockTaken)
{
Monitor.Exit(halfOpenSyncObject);
}
}
}
// The Open timeout hasn't yet expired. Throw a CircuitBreakerOpen exception to
// inform the caller that the call was not actually attempted,
// and return the most recent exception received.
throw new CircuitBreakerOpenException(stateStore.LastException);
}
...
try
{
breaker.ExecuteAction(() =>
{
// Operation protected by the circuit breaker.
...
});
}
catch (CircuitBreakerOpenException ex)
{
// Perform some different action when the breaker is open.
// Last exception details are in the inner exception.
...
}
catch (Exception ex)
{
...
}
Split a large message into a claim check and a payload. Send the claim check to the messaging platform and store
the payload to an external service. This pattern allows large messages to be processed, while protecting the
message bus and the client from being overwhelmed or slowed down. This pattern also helps to reduce costs, as
storage is usually cheaper than resource units used by the messaging platform.
This pattern is also known as Reference-Based Messaging, and was originally described in the book Enterprise
Integration Patterns, by Gregor Hohpe and Bobby Woolf.
Solution
Store the entire message payload into an external service, such as a database. Get the reference to the stored
payload, and send just that reference to the message bus. The reference acts like a claim check used to retrieve a
piece of luggage, hence the name of the pattern. Clients interested in processing that specific message can use the
obtained reference to retrieve the payload, if needed.
Examples
On Azure, this pattern can be implemented in several ways and with different technologies, but there are two main
categories. In both cases, the receiver has the responsibility to read the claim check and use it to retrieve the
payload.
Automatic claim-check generation . This approach uses Azure Event Grid to automatically generate the
claim check and push it into the message bus.
Manual claim-check generation . In this approach, the sender is responsible for managing the payload.
The sender stores the payload using the appropriate service, gets or generates the claim check, and sends
the claim check to the message bus.
Event Grid is an event routing service and tries to deliver events within a configurable amount of time up to 24
hours. After that, events are either discarded or dead lettered. If you need to archive the event payloads or replay
the event stream, you can add an Event Grid subscription to Event Hubs or Queue Storage, where messages can be
retained for longer periods and archiving messages is supported. For information about fine tuning Event Grid
message delivery and retry, and dead letter configuration, see Dead letter and retry policies.
Automatic claim-check generation with Blob Storage and Event Grid
In this approach, the sender drops the message payload into a designated Azure Blob Storage container. Event Grid
automatically generates a tag/reference and sends it to a supported message bus, such as Azure Storage Queues.
The receiver can poll the queue, get the message, and then use the stored reference data to download the payload
directly from Blob Storage.
The same Event Grid message can be directly consumed by Azure Functions, without needing to go through a
message bus. This approach takes full advantage of the serverless nature of both Event Grid and Functions.
You can find example code for this approach here.
Event Grid with Event Hubs
Similar to the previous example, Event Grid automatically generates a message when a payload is written to an
Azure Blob container. But in this example, the message bus is implemented using Event Hubs. A client can register
itself to receive the stream of messages as they are written to the event hub. The event hub can also be configured
to archive messages, making them available as an Avro file that can be queried using tools like Apache Spark,
Apache Drill, or any of the available Avro libraries.
You can find example code for this approach here.
Claim check generation with Service Bus
This solution takes advantage of a specific Service Bus plugin, ServiceBus.AttachmentPlugin, which makes the
claim-check workflow easy to implement. The plugin converts any message body into an attachment that gets
stored in Azure Blob Storage when the message is sent.
using ServiceBus.AttachmentPlugin;
...
// Creating and registering the sender using Service Bus Connection String and Queue Name
var sender = new MessageSender(serviceBusConnectionString, queueName);
sender.RegisterAzureStorageAttachmentPlugin(config);
// Create payload
var payload = new { data = "random data string for testing" };
var serialized = JsonConvert.SerializeObject(payload);
var payloadAsBytes = Encoding.UTF8.GetBytes(serialized);
var message = new Message(payloadAsBytes);
The Service Bus message acts as a notification queue, which a client can subscribe to. When the consumer receives
the message, the plugin makes it possible to directly read the message data from Blob Storage. You can then
choose how to process the message further. An advantage of this approach is that it abstracts the claim-check
workflow from the sender and receiver.
You can find example code for this approach here.
Manual claim-check generation with Kafka
In this example, a Kafka client writes the payload to Azure Blob Storage. Then it sends a notification message using
Kafka-enabled Event Hubs. The consumer receives the message and can access the payload from Blob Storage.
This example shows how a different messaging protocol can be used to implement the claim-check pattern in
Azure. For example, you might need to support existing Kafka clients.
You can find example code for this approach here.
The Command and Query Responsibility Segregation (CQRS) pattern separates read and update operations for a
data store. Implementing CQRS in your application can maximize its performance, scalability, and security. The
flexibility created by migrating to CQRS allows a system to better evolve over time and prevents update
commands from causing merge conflicts at the domain level.
The problem
In traditional architectures, the same data model is used to query and update a database. That's simple and works
well for basic CRUD operations. In more complex applications, however, this approach can become unwieldy. For
example, on the read side, the application may perform many different queries, returning data transfer objects
(DTOs) with different shapes. Object mapping can become complicated. On the write side, the model may
implement complex validation and business logic. As a result, you can end up with an overly complex model that
does too much.
Read and write workloads are often asymmetrical, with very different performance and scale requirements.
There is often a mismatch between the read and write representations of the data, such as additional
columns or properties that must be updated correctly even though they aren't required as part of an
operation.
Data contention can occur when operations are performed in parallel on the same set of data.
The traditional approach can have a negative effect on performance due to load on the data store and data
access layer, and the complexity of queries required to retrieve information.
Managing security and permissions can become complex, because each entity is subject to both read and
write operations, which might expose data in the wrong context.
Solution
CQRS separates reads and writes into different models, using commands to update data, and queries to read
data.
Commands should be task based, rather than data centric. ("Book hotel room", not "set ReservationStatus to
Reserved").
Commands may be placed on a queue for asynchronous processing, rather than being processed
synchronously.
Queries never modify the database. A query returns a DTO that does not encapsulate any domain knowledge.
The models can then be isolated, as shown in the following diagram, although that's not an absolute requirement.
Having separate query and update models simplifies the design and implementation. However, one disadvantage
is that CQRS code can't automatically be generated from a database schema using scaffolding mechanisms such
as O/RM tools.
For greater isolation, you can physically separate the read data from the write data. In that case, the read database
can use its own data schema that is optimized for queries. For example, it can store a materialized view of the
data, in order to avoid complex joins or complex O/RM mappings. It might even use a different type of data store.
For example, the write database might be relational, while the read database is a document database.
If separate read and write databases are used, they must be kept in sync. Typically this is accomplished by having
the write model publish an event whenever it updates the database. Updating the database and publishing the
event must occur in a single transaction.
The read store can be a read-only replica of the write store, or the read and write stores can have a different
structure altogether. Using multiple read-only replicas can increase query performance, especially in distributed
scenarios where read-only replicas are located close to the application instances.
Separation of the read and write stores also allows each to be scaled appropriately to match the load. For
example, read stores typically encounter a much higher load than write stores.
Some implementations of CQRS use the Event Sourcing pattern. With this pattern, application state is stored as a
sequence of events. Each event represents a set of changes to the data. The current state is constructed by
replaying the events. In a CQRS context, one benefit of Event Sourcing is that the same events can be used to
notify other components — in particular, to notify the read model. The read model uses the events to create a
snapshot of the current state, which is more efficient for queries. However, Event Sourcing adds complexity to the
design.
Benefits of CQRS include:
Independent scaling . CQRS allows the read and write workloads to scale independently, and may result in
fewer lock contentions.
Optimized data schemas . The read side can use a schema that is optimized for queries, while the write side
uses a schema that is optimized for updates.
Security . It's easier to ensure that only the right domain entities are performing writes on the data.
Separation of concerns . Segregating the read and write sides can result in models that are more
maintainable and flexible. Most of the complex business logic goes into the write model. The read model can
be relatively simple.
Simpler queries . By storing a materialized view in the read database, the application can avoid complex joins
when querying.
Example
The following code shows some extracts from an example of a CQRS implementation that uses different
definitions for the read and the write models. The model interfaces don't dictate any features of the underlying
data stores, and they can evolve and be fine-tuned independently because these interfaces are separated.
The following code shows the read model definition.
// Query interface
namespace ReadModel
{
public interface ProductsDao
{
ProductDisplay FindById(int productId);
ICollection<ProductDisplay> FindByName(string name);
ICollection<ProductInventory> FindOutOfStockProducts();
ICollection<ProductDisplay> FindRelatedProducts(int productId);
}
The system allows users to rate products. The application code does this using the RateProduct command shown
in the following code.
The system uses the ProductsCommandHandler class to handle commands sent by the application. Clients typically
send commands to the domain through a messaging system such as a queue. The command handler accepts
these commands and invokes methods of the domain interface. The granularity of each command is designed to
reduce the chance of conflicting requests. The following code shows an outline of the ProductsCommandHandler
class.
public class ProductsCommandHandler :
ICommandHandler<AddNewProduct>,
ICommandHandler<RateProduct>,
ICommandHandler<AddToInventory>,
ICommandHandler<ConfirmItemShipped>,
ICommandHandler<UpdateStockFromInventoryRecount>
{
private readonly IRepository<Product> repository;
Undo the work performed by a series of steps, which together define an eventually consistent operation, if one
or more of the steps fail. Operations that follow the eventual consistency model are commonly found in cloud-
hosted applications that implement complex business processes and workflows.
The Data Consistency Primer provides information about why distributed transactions don't scale well, and
the principles of the eventual consistency model.
A challenge in the eventual consistency model is how to handle a step that has failed. In this case it might be
necessary to undo all of the work completed by the previous steps in the operation. However, the data can't
simply be rolled back because other concurrent instances of the application might have changed it. Even in
cases where the data hasn't been changed by a concurrent instance, undoing a step might not simply be a
matter of restoring the original state. It might be necessary to apply various business-specific rules (see the
travel website described in the Example section).
If an operation that implements eventual consistency spans several heterogeneous data stores, undoing the
steps in the operation will require visiting each data store in turn. The work performed in every data store must
be undone reliably to prevent the system from remaining inconsistent.
Not all data affected by an operation that implements eventual consistency might be held in a database. In a
service oriented architecture (SOA) environment an operation could invoke an action in a service, and cause a
change in the state held by that service. To undo the operation, this state change must also be undone. This can
involve invoking the service again and performing another action that reverses the effects of the first.
Solution
The solution is to implement a compensating transaction. The steps in a compensating transaction must undo
the effects of the steps in the original operation. A compensating transaction might not be able to simply
replace the current state with the state the system was in at the start of the operation because this approach
could overwrite changes made by other concurrent instances of an application. Instead, it must be an intelligent
process that takes into account any work done by concurrent instances. This process will usually be application
specific, driven by the nature of the work performed by the original operation.
A common approach is to use a workflow to implement an eventually consistent operation that requires
compensation. As the original operation proceeds, the system records information about each step and how the
work performed by that step can be undone. If the operation fails at any point, the workflow rewinds back
through the steps it's completed and performs the work that reverses each step. Note that a compensating
transaction might not have to undo the work in the exact reverse order of the original operation, and it might be
possible to perform some of the undo steps in parallel.
This approach is similar to the Sagas strategy discussed in Clemens Vasters’ blog.
A compensating transaction is also an eventually consistent operation and it could also fail. The system should
be able to resume the compensating transaction at the point of failure and continue. It might be necessary to
repeat a step that's failed, so the steps in a compensating transaction should be defined as idempotent
commands. For more information, see Idempotency Patterns on Jonathan Oliver’s blog.
In some cases it might not be possible to recover from a step that has failed except through manual
intervention. In these situations the system should raise an alert and provide as much information as possible
about the reason for the failure.
Many of the challenges of implementing a compensating transaction are the same as those with
implementing eventual consistency. See the section Considerations for Implementing Eventual Consistency
in the Data Consistency Primer for more information.
Example
A travel website lets customers book itineraries. A single itinerary might comprise a series of flights and hotels.
A customer traveling from Seattle to London and then on to Paris could perform the following steps when
creating an itinerary:
1. Book a seat on flight F1 from Seattle to London.
2. Book a seat on flight F2 from London to Paris.
3. Book a seat on flight F3 from Paris to Seattle.
4. Reserve a room at hotel H1 in London.
5. Reserve a room at hotel H2 in Paris.
These steps constitute an eventually consistent operation, although each step is a separate action. Therefore, as
well as performing these steps, the system must also record the counter operations necessary to undo each
step in case the customer decides to cancel the itinerary. The steps necessary to perform the counter operations
can then run as a compensating transaction.
Notice that the steps in the compensating transaction might not be the exact opposite of the original steps, and
the logic in each step in the compensating transaction must take into account any business-specific rules. For
example, unbooking a seat on a flight might not entitle the customer to a complete refund of any money paid.
The figure illustrates generating a compensating transaction to undo a long-running transaction to book a
travel itinerary.
NOTE
It might be possible for the steps in the compensating transaction to be performed in parallel, depending on how you've
designed the compensating logic for each step.
In many business solutions, failure of a single step doesn't always necessitate rolling the system back by using a
compensating transaction. For example, if—after having booked flights F1, F2, and F3 in the travel website
scenario—the customer is unable to reserve a room at hotel H1, it's preferable to offer the customer a room at a
different hotel in the same city rather than canceling the flights. The customer can still decide to cancel (in which
case the compensating transaction runs and undoes the bookings made on flights F1, F2, and F3), but this
decision should be made by the customer rather than by the system.
Enable multiple concurrent consumers to process messages received on the same messaging channel. This
enables a system to process multiple messages concurrently to optimize throughput, to improve scalability and
availability, and to balance the workload.
Solution
Use a message queue to implement the communication channel between the application and the instances of
the consumer service. The application posts requests in the form of messages to the queue, and the consumer
service instances receive messages from the queue and process them. This approach enables the same pool of
consumer service instances to handle messages from any instance of the application. The figure illustrates using
a message queue to distribute work to instances of a service.
Microsoft Azure Service Bus Queues can implement guaranteed first-in-first-out ordering of
messages by using message sessions. For more information, see Messaging Patterns Using Sessions.
Designing ser vices for resiliency . If the system is designed to detect and restart failed service
instances, it might be necessary to implement the processing performed by the service instances as
idempotent operations to minimize the effects of a single message being retrieved and processed more
than once.
Detecting poison messages . A malformed message, or a task that requires access to resources that
aren't available, can cause a service instance to fail. The system should prevent such messages being
returned to the queue, and instead capture and store the details of these messages elsewhere so that they
can be analyzed if necessary.
Handling results . The service instance handling a message is fully decoupled from the application logic
that generates the message, and they might not be able to communicate directly. If the service instance
generates results that must be passed back to the application logic, this information must be stored in a
location that's accessible to both. In order to prevent the application logic from retrieving incomplete data
the system must indicate when processing is complete.
If you're using Azure, a worker process can pass results back to the application logic by using a
dedicated message reply queue. The application logic must be able to correlate these results with the
original message. This scenario is described in more detail in the Asynchronous Messaging Primer.
Scaling the messaging system . In a large-scale solution, a single message queue could be
overwhelmed by the number of messages and become a bottleneck in the system. In this situation,
consider partitioning the messaging system to send messages from specific producers to a particular
queue, or use load balancing to distribute messages across multiple message queues.
Ensuring reliability of the messaging system . A reliable messaging system is needed to guarantee
that after the application enqueues a message it won't be lost. This is essential for ensuring that all
messages are delivered at least once.
Some messaging systems support sessions that enable a producer to group messages together and ensure
that they're all handled by the same consumer. This mechanism can be used with prioritized messages (if
they are supported) to implement a form of message ordering that delivers messages in sequence from a
producer to a single consumer.
Example
Azure provides Service Bus Queues and Azure Function queue triggers that, when combined, are a direct
implementation of this cloud design pattern. Azure Functions integrate with Azure Service Bus via triggers and
bindings. Integrating with Service Bus allows you to build functions that consume queue messages sent by
publishers. The publishing application(s) will post messages to a queue, and consumers, implemented as Azure
Functions, can retrieve messages from this queue and handle them.
For resiliency, a Service Bus queue enables a consumer to use PeekLock mode when it retrieves a message from
the queue; this mode doesn't actually remove the message, but simply hides it from other consumers. The Azure
Functions runtime receives a message in PeekLock mode, if the function finishes successfully it calls Complete
on the message, or it may call Abandon if the function fails, and the message will become visible again, allowing
another consumer to retrieve it. If the function runs for a period longer than the PeekLock timeout, the lock is
automatically renewed as long as the function is running.
Azure Functions can scale out/in based on the depth of the queue, all acting as competing consumers of the
queue. If multiple instances of the functions are created they all compete by independently pulling and
processing the messages.
For detailed information on using Azure Service Bus queues, see Service Bus queues, topics, and subscriptions.
For information on Queue triggered Azure Functions, see Azure Service Bus trigger for Azure Functions.
The following code shows how you can create a new message and send it to a Service Bus Queue by using a
QueueClient instance.
private string serviceBusConnectionString = ...;
...
while (!ct.IsCancellationRequested)
{
// Create a new message to send to the queue
string messageBody = $"Message {msgNumber}";
var message = new Message(Encoding.UTF8.GetBytes(messageBody));
The following code example shows a consumer, written as a C# Azure Function, that reads message metadata
and logs a Service Bus Queue message. Note how the ServiceBusTrigger attribute is used to bind it to a Service
Bus Queue.
[FunctionName("ProcessQueueMessage")]
public static void Run(
[ServiceBusTrigger("myqueue", Connection = "ServiceBusConnectionString")]
string myQueueItem,
Int32 deliveryCount,
DateTime enqueuedTimeUtc,
string messageId,
ILogger log)
{
log.LogInformation($"C# ServiceBus queue trigger function consumed message: {myQueueItem}");
log.LogInformation($"EnqueuedTimeUtc={enqueuedTimeUtc}");
log.LogInformation($"DeliveryCount={deliveryCount}");
log.LogInformation($"MessageId={messageId}");
}
Consolidate multiple tasks or operations into a single computational unit. This can increase compute resource
utilization, and reduce the costs and management overhead associated with performing compute processing in
cloud-hosted applications.
Each computational unit consumes chargeable resources, even when it's idle or lightly used. Therefore, this isn't
always the most cost-effective solution.
In Azure, this concern applies to roles in a Cloud Service, App Services, and Virtual Machines. These items run in
their own virtual environment. Running a collection of separate roles, websites, or virtual machines that are
designed to perform a set of well-defined operations, but that need to communicate and cooperate as part of a
single solution, can be an inefficient use of resources.
Solution
To help reduce costs, increase utilization, improve communication speed, and reduce management it's possible to
consolidate multiple tasks or operations into a single computational unit.
Tasks can be grouped according to criteria based on the features provided by the environment and the costs
associated with these features. A common approach is to look for tasks that have a similar profile concerning their
scalability, lifetime, and processing requirements. Grouping these together allows them to scale as a unit. The
elasticity provided by many cloud environments enables additional instances of a computational unit to be started
and stopped according to the workload. For example, Azure provides autoscaling that you can apply to roles in a
Cloud Service, App Services, and Virtual Machines. For more information, see Autoscaling Guidance.
As a counter example to show how scalability can be used to determine which operations shouldn't be grouped
together, consider the following two tasks:
Task 1 polls for infrequent, time-insensitive messages sent to a queue.
Task 2 handles high-volume bursts of network traffic.
The second task requires elasticity that can involve starting and stopping a large number of instances of the
computational unit. Applying the same scaling to the first task would simply result in more tasks listening for
infrequent messages on the same queue, and is a waste of resources.
In many cloud environments it's possible to specify the resources available to a computational unit in terms of the
number of CPU cores, memory, disk space, and so on. Generally, the more resources specified, the greater the
cost. To save money, it's important to maximize the work an expensive computational unit performs, and not let it
become inactive for an extended period.
If there are tasks that require a great deal of CPU power in short bursts, consider consolidating these into a single
computational unit that provides the necessary power. However, it's important to balance this need to keep
expensive resources busy against the contention that could occur if they are over stressed. Long-running,
compute-intensive tasks shouldn't share the same computational unit, for example.
NOTE
Consider consolidating compute resources only for a system that's been in production for a period of time so that operators
and developers can monitor the system and create a heat map that identifies how each task uses differing resources. This
map can be used to determine which tasks are good candidates for sharing compute resources.
Complexity . Combining multiple tasks into a single computational unit adds complexity to the code in the unit,
possibly making it more difficult to test, debug, and maintain.
Stable logical architecture . Design and implement the code in each task so that it shouldn't need to change,
even if the physical environment the task runs in does change.
Other strategies . Consolidating compute resources is only one way to help reduce costs associated with running
multiple tasks concurrently. It requires careful planning and monitoring to ensure that it remains an effective
approach. Other strategies might be more appropriate, depending on the nature of the work and where the users
these tasks are running are located. For example, functional decomposition of the workload (as described by the
Compute Partitioning Guidance) might be a better option.
Example
When building a cloud service on Azure, it’s possible to consolidate the processing performed by multiple tasks
into a single role. Typically this is a worker role that performs background or asynchronous processing tasks.
In some cases it's possible to include background or asynchronous processing tasks in the web role. This
technique helps to reduce costs and simplify deployment, although it can impact the scalability and
responsiveness of the public-facing interface provided by the web role.
The role is responsible for starting and stopping the tasks. When the Azure fabric controller loads a role, it raises
the Start event for the role. You can override the OnStart method of the WebRole or WorkerRole class to handle
this event, perhaps to initialize the data and other resources the tasks in this method depend on.
When the OnStart method completes, the role can start responding to requests. You can find more information
and guidance about using the OnStart and Run methods in a role in the Application Startup Processes section in
the patterns & practices guide Moving Applications to the Cloud.
Keep the code in the OnStart method as concise as possible. Azure doesn't impose any limit on the time
taken for this method to complete, but the role won't be able to start responding to network requests sent to it
until this method completes.
When the OnStart method has finished, the role executes the Run method. At this point, the fabric controller can
start sending requests to the role.
Place the code that actually creates the tasks in the Run method. Note that the Run method defines the lifetime
of the role instance. When this method completes, the fabric controller will arrange for the role to be shut down.
When a role shuts down or is recycled, the fabric controller prevents any more incoming requests being received
from the load balancer and raises the Stop event. You can capture this event by overriding the OnStop method of
the role and perform any tidying up required before the role terminates.
Any actions performed in the OnStop method must be completed within five minutes (or 30 seconds if you
are using the Azure emulator on a local computer). Otherwise the Azure fabric controller assumes that the
role has stalled and will force it to stop.
The tasks are started by the Run method that waits for the tasks to complete. The tasks implement the business
logic of the cloud service, and can respond to messages posted to the role through the Azure load balancer. The
figure shows the lifecycle of tasks and resources in a role in an Azure cloud service.
The WorkerRole.cs file in the ComputeResourceConsolidation.Worker project shows an example of how you might
implement this pattern in an Azure cloud service.
The MyWorkerTask1 and the MyWorkerTask2 methods illustrate how to perform different tasks within the same
worker role. The following code shows MyWorkerTask1 . This is a simple task that sleeps for 30 seconds and then
outputs a trace message. It repeats this process until the task is canceled. The code in MyWorkerTask2 is similar.
// A sample worker role task.
private static async Task MyWorkerTask1(CancellationToken ct)
{
// Fixed interval to wake up and check for work and/or do work.
var interval = TimeSpan.FromSeconds(30);
try
{
while (!ct.IsCancellationRequested)
{
// Wake up and do some background processing if not canceled.
// TASK PROCESSING CODE HERE
Trace.TraceInformation("Doing Worker Task 1 Work");
The sample code shows a common implementation of a background process. In a real world application you
can follow this same structure, except that you should place your own processing logic in the body of the loop
that waits for the cancellation request.
After the worker role has initialized the resources it uses, the Run method starts the two tasks concurrently, as
shown here.
/// <summary>
/// The cancellation token source use to cooperatively cancel running tasks
/// </summary>
private readonly CancellationTokenSource cts = new CancellationTokenSource();
/// <summary>
/// List of running tasks on the role instance
/// </summary>
private readonly List<Task> tasks = new List<Task>();
// If there wasn't a cancellation request, stop all tasks and return from Run()
// An alternative to canceling and returning when a task exits would be to
// restart the task.
if (!cts.IsCancellationRequested)
{
Trace.TraceInformation("Task returned without cancellation request");
Stop(TimeSpan.FromMinutes(5));
}
}
...
In this example, the Run method waits for tasks to be completed. If a task is canceled, the Run method assumes
that the role is being shut down and waits for the remaining tasks to be canceled before finishing (it waits for a
maximum of five minutes before terminating). If a task fails due to an expected exception, the Run method
cancels the task.
You could implement more comprehensive monitoring and exception handling strategies in the Run method
such as restarting tasks that have failed, or including code that enables the role to stop and start individual
tasks.
The Stop method shown in the following code is called when the fabric controller shuts down the role instance
(it's invoked from the OnStop method). The code stops each task gracefully by canceling it. If any task takes more
than five minutes to complete, the cancellation processing in the Stop method ceases waiting and the role is
terminated.
// Stop running tasks and wait for tasks to complete before returning
// unless the timeout expires.
private void Stop(TimeSpan timeout)
{
Trace.TraceInformation("Stop called. Canceling tasks.");
// Cancel running tasks.
cts.Cancel();
// Wait for all the tasks to complete before returning. Note that the
// emulator currently allows 30 seconds and Azure allows five
// minutes for processing to complete.
try
{
Task.WaitAll(tasks.ToArray(), timeout);
}
catch (AggregateException ex)
{
Trace.TraceError(ex.Message);
The deployment stamp pattern involves deploying multiple independent copies of application components,
including data stores. Each individual copy is called a stamp, or sometimes a service unit or scale unit. This
approach can improve the scalability of your solution, allow you to deploy instances across multiple regions, and
separate your customer data.
When hosting an application in the cloud there are certain considerations to be made. One key thing to keep in
mind is the performance and reliability of your application. If you host a single instance of your solution, you
might be subject to the following limitations:
Scale limits. Deploying a single instance of your application may result in natural scaling limits. For example,
you may use services that have limits on the number of inbound connections, host names, TCP sockets, or
other resources.
Non-linear scaling or cost. Some of your solution's components may not scale linearly with the number of
requests or the amount of data. Instead, there can be a sudden decrease in performance or increase in cost
once a threshold has been met. For example, you may use a database and discover that the marginal cost of
adding more capacity (scaling up) becomes prohibitive, and that scaling out is a more cost-effective strategy.
Similarly, Azure Front Door has higher per-domain pricing when a high number of custom domains are
deployed, and it may be better to spread the custom domains across multiple Front Door instances.
Separation of customers. You may need to keep certain customers' data isolated from other customers'
data. Similarly, you may have some customers that require more system resources to service than others, and
consider grouping them on different sets of infrastructure.
Handling single- and multi-tenant instances. You may have some large customers who need their own
independent instances of your solution. You may also have a pool of smaller customers who can share a multi-
tenant deployment.
Complex deployment requirements. You may need to deploy updates to your service in a controlled
manner, and to deploy to different subsets of your customer base at different times.
Update frequency. You may have some customers who are tolerant of having frequent updates to your
system, while others may be risk-averse and want infrequent updates to the system that services their
requests. It may make sense to have these customers deployed to isolated environments.
Geographical or geopolitical restrictions. To architect for low latency, or to comply with data sovereignty
requirements, you may deploy some of your customers into specific regions.
Implementation
To avoid these issues, consider deploying copies of your solution's components multiple times. Stamps operate
independently of each other and can be deployed and updated independently. A single geographical region may
contain a single stamp, or may contain multiple stamps to allow for horizontal scale-out within the region. Stamps
contain a subset of your customers.
Deployment stamps can apply whether your solution uses infrastructure as a service (IaaS) or platform as a
service (PaaS) components, or a mixture of both. Typically IaaS workloads require more intervention to scale, so
the pattern may be useful for IaaS-heavy workloads to allow for scaling out.
Stamps can be used to implement deployment rings. If different customers want to receive service updates at
different frequencies, they can be grouped onto different stamps, and each stamp could have updates deployed at
different cadences.
Because stamps run independently from each other, data is implicitly sharded. Furthermore, a single stamp can
make use of further sharding to internally allow for scalability and elasticity within the stamp.
The deployment stamp pattern is used internally by many Azure services, including App Service, Azure Stack, and
Azure Storage.
Deployment stamps are related to, but distinct from, geodes. In a deployment stamp architecture, multiple
independent instances of your system are deployed and contain a subset of your customers and users. In geodes,
all instances can serve requests from any users, but this architecture is often more complex to design and build.
You may also consider mixing the two patterns within one solution; the traffic routing approach described below
is an example of such a hybrid scenario.
Deployment
Because of the complexity that is involved in deploying identical copies of the same components, good DevOps
practices are critical to ensure success when implementing this pattern. Consider describing your infrastructure as
code, such as using Azure Resource Manager templates and scripts. With this approach, you can ensure that the
deployment of each stamp is predictable. It also reduces the likelihood of human errors such as accidental
mismatches in configuration between stamps.
You can deploy updates automatically to all stamps in parallel, in which case you might consider technologies like
Resource Manager templates or Azure Deployment Manager to coordinate the deployment of your infrastructure
and applications. Alternatively, you may decide to gradually roll out updates to some stamps first, and then
progressively to others. Azure Deployment Manager can also manage this type of staged rollout, or you could
consider using a release management tool like Azure Pipelines to orchestrate deployments to each stamp.
Carefully consider the topology of the Azure subscriptions and resource groups for your deployments:
Typically a subscription will contain all resources for a single solution, so in general consider using a single
subscription for all stamps. However, some Azure services impose subscription-wide quotas, so if you are
using this pattern to allow for a high degree of scale-out, you may need to consider deploying stamps across
different subscriptions.
Resource groups are generally used to deploy components with the same lifecycle. If you plan to deploy
updates to all of your stamps at once, consider using a single resource group to contain all of the components
for all of your stamps, and use resource naming conventions and tags to identify the components that belong
to each stamp. Alternatively, if you plan to deploy updates to each stamp independently, consider deploying
each stamp into its own resource group.
Capacity planning
Use load and performance testing to determine the approximate load that a given stamp can accommodate. Load
metrics may be based on the number of customers/tenants that a single stamp can accommodate, or metrics
from the services that the components within the stamp emit. Ensure that you have sufficient instrumentation to
measure when a given stamp is approaching its capacity, and the ability to deploy new stamps quickly to respond
to demand.
Traffic routing
The Deployment Stamp pattern works well if each stamp is addressed independently. For example, if Contoso
deploys the same API application across multiple stamps, they might consider using DNS to route traffic to the
relevant stamp:
unit1.aus.myapi.contoso.com routes traffic to stamp unit1 within an Australian region.
unit2.aus.myapi.contoso.com routes traffic to stamp unit2 within an Australian region.
unit1.eu.myapi.contoso.com routes traffic to stamp unit1 within a European region.
If a traffic routing service is included in your solution, consider whether it acts as a gateway and could therefore
perform gateway offloading for services such as token validation, throttling, and authorization.
Supporting technologies
Infrastructure as code. For example, Resource Manager templates, Azure CLI, Terraform, PowerShell, Bash.
Deployment Manager, which can orchestrate deployments of a solution across multiple stamps.
Azure Front Door, which can route traffic to a specific stamp or to a traffic routing service.
Example
The following example deploys multiple stamps of a simple PaaS solution, with an app service and a SQL
Database in each stamp. While stamps can be configured in any region that support the services deployed in the
template, for illustration purposes this template deploys two stamps within the West US 2 region and a further
stamp in the West Europe region. Within a stamp, the app service receives its own public DNS hostname and it
can receive connections independently of all other stamps.
WARNING
The example below uses a SQL Server administrator account. It's generally not a good practice to use an administrative
account from your application. For a real application, consider using a managed identity to connect from your application to
a SQL database, or use a least-privilege account.
NOTE
There are alternative approaches to deploying stamps with a Resource Manager template, including using nested templates
or linked templates to decouple the definition of each stamp from the iteration required to deploy multiple copies.
Next steps
Sharding can be used as another, simpler, approach to scale out your data tier. Stamps implicitly shard their
data, but sharding does not require a Deployment Stamp. For more information, see Sharding Pattern.
If a traffic routing service is deployed, the Gateway Routing and Gateway Offloading patterns can be used
together to make the best use of this component.
Event Sourcing pattern
12/18/2020 • 14 minutes to read • Edit Online
Instead of storing just the current state of the data in a domain, use an append-only store to record the full series
of actions taken on that data. The store acts as the system of record and can be used to materialize the domain
objects. This can simplify tasks in complex domains, by avoiding the need to synchronize the data model and the
business domain, while improving performance, scalability, and responsiveness. It can also provide consistency
for transactional data, and maintain full audit trails and history that can enable compensating actions.
For a deeper understanding of the limits of the CRUD approach, see CRUD, Only When You Can Afford It.
Solution
The Event Sourcing pattern defines an approach to handling operations on data that's driven by a sequence of
events, each of which is recorded in an append-only store. Application code sends a series of events that
imperatively describe each action that has occurred on the data to the event store, where they're persisted. Each
event represents a set of changes to the data (such as AddedItemToOrder ).
The events are persisted in an event store that acts as the system of record (the authoritative data source) about
the current state of the data. The event store typically publishes these events so that consumers can be notified
and can handle them if needed. Consumers could, for example, initiate tasks that apply the operations in the
events to other systems, or perform any other associated action that's required to complete the operation. Notice
that the application code that generates the events is decoupled from the systems that subscribe to the events.
Typical uses of the events published by the event store are to maintain materialized views of entities as actions in
the application change them, and for integration with external systems. For example, a system can maintain a
materialized view of all customer orders that's used to populate parts of the UI. As the application adds new
orders, adds or removes items on the order, and adds shipping information, the events that describe these
changes can be handled and used to update the materialized view.
In addition, at any point it's possible for applications to read the history of events, and use it to materialize the
current state of an entity by playing back and consuming all the events related to that entity. This can occur on
demand to materialize a domain object when handling a request, or through a scheduled task so that the state of
the entity can be stored as a materialized view to support the presentation layer.
The figure shows an overview of the pattern, including some of the options for using the event stream such as
creating a materialized view, integrating events with external applications and systems, and replaying events to
create projections of the current state of specific entities.
Event sourcing is commonly combined with the CQRS pattern by performing the data management tasks in
response to the events, and by materializing views from the stored events.
NOTE
See the Data Consistency Primer for information about eventual consistency.
The event store is the permanent source of information, and so the event data should never be updated. The only
way to update an entity to undo a change is to add a compensating event to the event store. If the format (rather
than the data) of the persisted events needs to change, perhaps during a migration, it can be difficult to combine
existing events in the store with the new version. It might be necessary to iterate through all the events making
changes so they're compliant with the new format, or add new events that use the new format. Consider using a
version stamp on each version of the event schema to maintain both the old and the new event formats.
Multi-threaded applications and multiple instances of applications might be storing events in the event store. The
consistency of events in the event store is vital, as is the order of events that affect a specific entity (the order that
changes occur to an entity affects its current state). Adding a timestamp to every event can help to avoid issues.
Another common practice is to annotate each event resulting from a request with an incremental identifier. If two
actions attempt to add events for the same entity at the same time, the event store can reject an event that
matches an existing entity identifier and event identifier.
There's no standard approach, or existing mechanisms such as SQL queries, for reading the events to obtain
information. The only data that can be extracted is a stream of events using an event identifier as the criteria. The
event ID typically maps to individual entities. The current state of an entity can be determined only by replaying
all of the events that relate to it against the original state of that entity.
The length of each event stream affects managing and updating the system. If the streams are large, consider
creating snapshots at specific intervals such as a specified number of events. The current state of the entity can be
obtained from the snapshot and by replaying any events that occurred after that point in time. For more
information about creating snapshots of data, see Snapshot on Martin Fowler’s Enterprise Application
Architecture website and Primary-Subordinate Snapshot Replication.
Even though event sourcing minimizes the chance of conflicting updates to the data, the application must still be
able to deal with inconsistencies that result from eventual consistency and the lack of transactions. For example,
an event that indicates a reduction in stock inventory might arrive in the data store while an order for that item is
being placed, resulting in a requirement to reconcile the two operations either by advising the customer or
creating a back order.
Event publication might be "at least once," and so consumers of the events must be idempotent. They must not
reapply the update described in an event if the event is handled more than once. For example, if multiple
instances of a consumer maintain an aggregate an entity's property, such as the total number of orders placed,
only one must succeed in incrementing the aggregate when an order placed event occurs. While this isn't a key
characteristic of event sourcing, it's the usual implementation decision.
When to use this pattern
Use this pattern in the following scenarios:
When you want to capture intent, purpose, or reason in the data. For example, changes to a customer
entity can be captured as a series of specific event types such as Moved home, Closed account, or
Deceased.
When it's vital to minimize or completely avoid the occurrence of conflicting updates to data.
When you want to record events that occur, and be able to replay them to restore the state of a system, roll
back changes, or keep a history and audit log. For example, when a task involves multiple steps you might
need to execute actions to revert updates and then replay some steps to bring the data back into a
consistent state.
When using events is a natural feature of the operation of the application, and requires little additional
development or implementation effort.
When you need to decouple the process of inputting or updating data from the tasks required to apply
these actions. This might be to improve UI performance, or to distribute events to other listeners that take
action when the events occur. For example, integrating a payroll system with an expense submission
website so that events raised by the event store in response to data updates made in the website are
consumed by both the website and the payroll system.
When you want flexibility to be able to change the format of materialized models and entity data if
requirements change, or—when used in conjunction with CQRS—you need to adapt a read model or the
views that expose the data.
When used in conjunction with CQRS, and eventual consistency is acceptable while a read model is
updated, or the performance impact of rehydrating entities and data from an event stream is acceptable.
This pattern might not be useful in the following situations:
Small or simple domains, systems that have little or no business logic, or nondomain systems that
naturally work well with traditional CRUD data management mechanisms.
Systems where consistency and real-time updates to the views of the data are required.
Systems where audit trails, history, and capabilities to roll back and replay actions are not required.
Systems where there's only a very low occurrence of conflicting updates to the underlying data. For
example, systems that predominantly add data rather than updating it.
Example
A conference management system needs to track the number of completed bookings for a conference so that it
can check whether there are seats still available when a potential attendee tries to make a booking. The system
could store the total number of bookings for a conference in at least two ways:
The system could store the information about the total number of bookings as a separate entity in a
database that holds booking information. As bookings are made or canceled, the system could increment
or decrement this number as appropriate. This approach is simple in theory, but can cause scalability
issues if a large number of attendees are attempting to book seats during a short period of time. For
example, in the last day or so prior to the booking period closing.
The system could store information about bookings and cancellations as events held in an event store. It
could then calculate the number of seats available by replaying these events. This approach can be more
scalable due to the immutability of events. The system only needs to be able to read data from the event
store, or append data to the event store. Event information about bookings and cancellations is never
modified.
The following diagram illustrates how the seat reservation subsystem of the conference management system
might be implemented using event sourcing.
Some optimizations to consider are using snapshots (so that you don’t need to query and replay the
full list of events to obtain the current state of the aggregate), and maintaining a cached copy of the
aggregate in memory.
3. The command handler invokes a method exposed by the domain model to make the reservations.
4. The SeatAvailability aggregate records an event containing the number of seats that were reserved. The
next time the aggregate applies events, all the reservations will be used to compute how many seats
remain.
5. The system appends the new event to the list of events in the event store.
If a user cancels a seat, the system follows a similar process except the command handler issues a command that
generates a seat cancellation event and appends it to the event store.
As well as providing more scope for scalability, using an event store also provides a complete history, or audit
trail, of the bookings and cancellations for a conference. The events in the event store are the accurate record.
There is no need to persist aggregates in any other way because the system can easily replay the events and
restore the state to any point in time.
You can find more information about this example in Introducing Event Sourcing.
Move configuration information out of the application deployment package to a centralized location. This can
provide opportunities for easier management and control of configuration data, and for sharing configuration
data across applications and application instances.
Solution
Store the configuration information in external storage, and provide an interface that can be used to quickly and
efficiently read and update configuration settings. The type of external store depends on the hosting and runtime
environment of the application. In a cloud-hosted scenario it's typically a cloud-based storage service, but could be
a hosted database or other system.
The backing store you choose for configuration information should have an interface that provides consistent and
easy-to-use access. It should expose the information in a correctly typed and structured format. The
implementation might also need to authorize users’ access in order to protect configuration data, and be flexible
enough to allow storage of multiple versions of the configuration (such as development, staging, or production,
including multiple release versions of each one).
Many built-in configuration systems read the data when the application starts up, and cache the data in
memory to provide fast access and minimize the impact on application performance. Depending on the type
of backing store used, and the latency of this store, it might be helpful to implement a caching mechanism
within the external configuration store. For more information, see the Caching Guidance. The figure illustrates
an overview of the External Configuration Store pattern with optional local cache.
Issues and considerations
Consider the following points when deciding how to implement this pattern:
Choose a backing store that offers acceptable performance, high availability, robustness, and can be backed up as
part of the application maintenance and administration process. In a cloud-hosted application, using a cloud
storage mechanism is usually a good choice to meet these requirements.
Design the schema of the backing store to allow flexibility in the types of information it can hold. Ensure that it
provides for all configuration requirements such as typed data, collections of settings, multiple versions of settings,
and any other features that the applications using it require. The schema should be easy to extend to support
additional settings as requirements change.
Consider the physical capabilities of the backing store, how it relates to the way configuration information is
stored, and the effects on performance. For example, storing an XML document containing configuration
information will require either the configuration interface or the application to parse the document in order to
read individual settings. It'll make updating a setting more complicated, though caching the settings can help to
offset slower read performance.
Consider how the configuration interface will permit control of the scope and inheritance of configuration settings.
For example, it might be a requirement to scope configuration settings at the organization, application, and the
machine level. It might need to support delegation of control over access to different scopes, and to prevent or
allow individual applications to override settings.
Ensure that the configuration interface can expose the configuration data in the required formats such as typed
values, collections, key/value pairs, or property bags.
Consider how the configuration store interface will behave when settings contain errors, or don't exist in the
backing store. It might be appropriate to return default settings and log errors. Also consider aspects such as the
case sensitivity of configuration setting keys or names, the storage and handling of binary data, and the ways that
null or empty values are handled.
Consider how to protect the configuration data to allow access to only the appropriate users and applications. This
is likely a feature of the configuration store interface, but it's also necessary to ensure that the data in the backing
store can't be accessed directly without the appropriate permission. Ensure strict separation between the
permissions required to read and to write configuration data. Also consider whether you need to encrypt some or
all of the configuration settings, and how this'll be implemented in the configuration store interface.
Centrally stored configurations, which change application behavior during runtime, are critically important and
should be deployed, updated, and managed using the same mechanisms as deploying application code. For
example, changes that can affect more than one application must be carried out using a full test and staged
deployment approach to ensure that the change is appropriate for all applications that use this configuration. If an
administrator edits a setting to update one application, it could adversely impact other applications that use the
same setting.
If an application caches configuration information, the application needs to be alerted if the configuration changes.
It might be possible to implement an expiration policy over cached configuration data so that this information is
automatically refreshed periodically and any changes picked up (and acted on).
Example
In a Microsoft Azure hosted application, a typical choice for storing configuration information externally is to use
Azure Storage. This is resilient, offers high performance, and is replicated three times with automatic failover to
offer high availability. Azure Table storage provides a key/value store with the ability to use a flexible schema for
the values. Azure Blob storage provides a hierarchical, container-based store that can hold any type of data in
individually named blobs.
The following example shows how a configuration store can be implemented over Blob storage to store and
expose configuration information. The BlobSettingsStore class abstracts Blob storage for holding configuration
information, and implements the ISettingsStore interface shown in the following code.
This interface defines methods for retrieving and updating configuration settings held in the configuration store,
and includes a version number that can be used to detect whether any configuration settings have been modified
recently. The BlobSettingsStore class uses the ETag property of the blob to implement versioning. The ETag
property is updated automatically each time the blob is written.
By design, this simple solution exposes all configuration settings as string values rather than typed values.
...
public IObservable<KeyValuePair<string, string>> Changed => this.changed.AsObservable();
...
string value;
try
{
this.settingsCacheLock.EnterReadLock();
return value;
}
...
private void CheckForConfigurationChanges()
{
try
{
// It is assumed that updates are infrequent.
// To avoid race conditions in refreshing the cache, synchronize access to the in-memory cache.
await this.syncCacheSemaphore.WaitAsync();
// If the versions are the same, nothing has changed in the configuration.
if (this.currentVersion == latestVersion) return;
// Get the latest settings from the settings store and publish changes.
var latestSettings = await this.settings.FindAllAsync();
if (this.settingsCache != null)
{
//Notify settings changed
latestSettings.Except(this.settingsCache).ToList().ForEach(kv => this.changed.OnNext(kv));
}
this.settingsCache = latestSettings;
}
finally
{
this.settingsCacheLock.ExitWriteLock();
}
The ExternalConfigurationManager class also provides a property named Environment . This property supports
varying configurations for an application running in different environments, such as staging and production.
An ExternalConfigurationManager object can also query the BlobSettingsStore object periodically for any changes.
In the following code, the StartMonitor method calls CheckForConfigurationChanges at an interval to detect any
changes and raise the Changed event, as described earlier.
/// <summary>
/// Start the background monitoring for configuration changes in the central store
/// </summary>
public void StartMonitor()
{
if (this.IsMonitoring)
return;
try
{
this.timerSemaphore.Wait();
/// <summary>
/// Loop that monitors for configuration changes
/// </summary>
/// <returns></returns>
public async Task ConfigChangeMonitor()
{
while (!cts.Token.IsCancellationRequested)
{
await this.CheckForConfigurationChangesAsync();
await Task.Delay(this.interval, cts.Token);
}
}
/// <summary>
/// Stop monitoring for configuration changes
/// </summary>
public void StopMonitor()
{
try
{
this.timerSemaphore.Wait();
The following code is taken from the WorkerRole class in the ExternalConfigurationStore.Cloud project. It shows
how the application uses the ExternalConfiguration class to read a setting.
// Get a setting.
var setting = ExternalConfiguration.Instance.GetAppSetting("setting1");
Trace.TraceInformation("Worker Role: Get setting1, value: " + setting);
this.completeEvent.WaitOne();
}
The following code, also from the WorkerRole class, shows how the application subscribes to configuration events.
Delegate authentication to an external identity provider. This can simplify development, minimize the requirement
for user administration, and improve the user experience of the application.
Solution
Implement an authentication mechanism that can use federated identity. Separate user authentication from the
application code, and delegate authentication to a trusted identity provider. This can simplify development and
allow users to authenticate using a wider range of identity providers (IdP) while minimizing the administrative
overhead. It also allows you to clearly decouple authentication from authorization.
The trusted identity providers include corporate directories, on-premises federation services, other security token
services (STS) provided by business partners, or social identity providers that can authenticate users who have, for
example, a Microsoft, Google, Yahoo!, or Facebook account.
The figure illustrates the Federated Identity pattern when a client application needs to access a service that requires
authentication. The authentication is performed by an IdP that works in concert with an STS. The IdP issues security
tokens that provide information about the authenticated user. This information, referred to as claims, includes the
user’s identity, and might also include other information such as role membership and more granular access
rights.
This model is often called claims-based access control. Applications and services authorize access to features and
functionality based on the claims contained in the token. The service that requires authentication must trust the IdP.
The client application contacts the IdP that performs the authentication. If the authentication is successful, the IdP
returns a token containing the claims that identify the user to the STS (note that the IdP and STS can be the same
service). The STS can transform and augment the claims in the token based on predefined rules, before returning it
to the client. The client application can then pass this token to the service as proof of its identity.
There might be additional security token services in the chain of trust. For example, in the scenario described
later, an on-premises STS trusts another STS that is responsible for accessing an identity provider to
authenticate the user. This approach is common in enterprise scenarios where there's an on-premises STS and
directory.
Federated authentication provides a standards-based solution to the issue of trusting identities across diverse
domains, and can support single sign-on. It's becoming more common across all types of applications, especially
cloud-hosted applications, because it supports single sign-on without requiring a direct network connection to
identity providers. The user doesn't have to enter credentials for every application. This increases security because
it prevents the creation of credentials required to access many different applications, and it also hides the user’s
credentials from all but the original identity provider. Applications see just the authenticated identity information
contained within the token.
Federated identity also has the major advantage that management of the identity and credentials is the
responsibility of the identity provider. The application or service doesn't need to provide identity management
features. In addition, in corporate scenarios, the corporate directory doesn't need to know about the user if it trusts
the identity provider. This removes all the administrative overhead of managing the user identity within the
directory.
Example
An organization hosts a multi-tenant software as a service (SaaS) application in Microsoft Azure. The application
includes a website that tenants can use to manage the application for their own users. The application allows
tenants to access the website by using a federated identity that is generated by Active Directory Federation
Services (AD FS) when a user is authenticated by that organization’s own Active Directory.
The figure shows how tenants authenticate with their own identity provider (step 1), in this case AD FS. After
successfully authenticating a tenant, AD FS issues a token. The client browser forwards this token to the SaaS
application’s federation provider, which trusts tokens issued by the tenant’s AD FS, in order to get back a token that
is valid for the SaaS federation provider (step 2). If necessary, the SaaS federation provider performs a
transformation on the claims in the token into claims that the application recognizes (step 3) before returning the
new token to the client browser. The application trusts tokens issued by the SaaS federation provider and uses the
claims in the token to apply authorization rules (step 4).
Tenants won't need to remember separate credentials to access the application, and an administrator at the
tenant’s company can configure in its own AD FS the list of users that can access the application.
Related guidance
Microsoft Azure Active Directory
Active Directory Domain Services
Active Directory Federation Services
Identity management for multitenant applications in Microsoft Azure
Multitenant Applications in Azure
Gatekeeper pattern
12/18/2020 • 4 minutes to read • Edit Online
Protect applications and services by using a dedicated host instance that acts as a broker between clients and the
application or service, validates and sanitizes requests, and passes requests and data between them. This can
provide an additional layer of security, and limit the attack surface of the system.
Solution
To minimize the risk of clients gaining access to sensitive information and services, decouple hosts or tasks that
expose public endpoints from the code that processes requests and accesses storage. You can achieve this by using
a façade or a dedicated task that interacts with clients and then hands off the request—perhaps through a
decoupled interface—to the hosts or tasks that'll handle the request. The figure provides a high-level overview of
this pattern.
The gatekeeper pattern can be used to simply protect storage, or it can be used as a more comprehensive façade
to protect all of the functions of the application. The important factors are:
Controlled validation . The gatekeeper validates all requests, and rejects those that don't meet validation
requirements.
Limited risk and exposure . The gatekeeper doesn't have access to the credentials or keys used by the trusted
host to access storage and services. If the gatekeeper is compromised, the attacker doesn't get access to these
credentials or keys.
Appropriate security . The gatekeeper runs in a limited privilege mode, while the rest of the application runs
in the full trust mode required to access storage and services. If the gatekeeper is compromised, it can't directly
access the application services or data.
This pattern acts like a firewall in a typical network topography. It allows the gatekeeper to examine requests and
make a decision about whether to pass the request on to the trusted host that performs the required tasks. This
decision typically requires the gatekeeper to validate and sanitize the request content before passing it on to the
trusted host.
Example
In a cloud-hosted scenario, this pattern can be implemented by decoupling the gatekeeper role or virtual machine
from the trusted roles and services in an application. Do this by using an internal endpoint, a queue, or storage as
an intermediate communication mechanism. The figure illustrates using an internal endpoint.
Related patterns
The Valet Key pattern might also be relevant when implementing the Gatekeeper pattern. When communicating
between the Gatekeeper and trusted roles it's good practice to enhance security by using keys or tokens that limit
permissions for accessing resources. Describes how to use a token or key that provides clients with restricted
direct access to a specific resource or service.
Gateway Aggregation pattern
12/18/2020 • 3 minutes to read • Edit Online
Use a gateway to aggregate multiple individual requests into a single request. This pattern is useful when a client
must make multiple calls to different backend systems to perform an operation.
Solution
Use a gateway to reduce chattiness between the client and the services. The gateway receives client requests,
dispatches requests to the various backend systems, and then aggregates the results and sends them back to the
requesting client.
This pattern can reduce the number of requests that the application makes to backend services, and improve
application performance over high-latency networks.
In the following diagram, the application sends a request to the gateway (1). The request contains a package of
additional requests. The gateway decomposes these and processes each request by sending it to the relevant
service (2). Each service returns a response to the gateway (3). The gateway combines the responses from each
service and sends the response to the application (4). The application makes a single request and receives only a
single response from the gateway.
Issues and considerations
The gateway should not introduce service coupling across the backend services.
The gateway should be located near the backend services to reduce latency as much as possible.
The gateway service may introduce a single point of failure. Ensure the gateway is properly designed to meet
your application's availability requirements.
The gateway may introduce a bottleneck. Ensure the gateway has adequate performance to handle load and
can be scaled to meet your anticipated growth.
Perform load testing against the gateway to ensure you don't introduce cascading failures for services.
Implement a resilient design, using techniques such as bulkheads, circuit breaking, retry, and timeouts.
If one or more service calls takes too long, it may be acceptable to timeout and return a partial set of data.
Consider how your application will handle this scenario.
Use asynchronous I/O to ensure that a delay at the backend doesn't cause performance issues in the
application.
Implement distributed tracing using correlation IDs to track each individual call.
Monitor request metrics and response sizes.
Consider returning cached data as a failover strategy to handle failures.
Instead of building aggregation into the gateway, consider placing an aggregation service behind the gateway.
Request aggregation will likely have different resource requirements than other services in the gateway and
may impact the gateway's routing and offloading functionality.
Example
The following example illustrates how to create a simple a gateway aggregation NGINX service using Lua.
worker_processes 4;
events {
worker_connections 1024;
}
http {
server {
listen 80;
location = /batch {
content_by_lua '
ngx.req.read_body()
ngx.say(cjson.encode({results = results}))
';
}
location = /service1 {
default_type application/json;
echo '{"attr1":"val1"}';
}
location = /service2 {
default_type application/json;
echo '{"attr2":"val2"}';
}
}
}
Related guidance
Backends for Frontends pattern
Gateway Offloading pattern
Gateway Routing pattern
Gateway Offloading pattern
12/18/2020 • 3 minutes to read • Edit Online
Offload shared or specialized service functionality to a gateway proxy. This pattern can simplify application
development by moving shared service functionality, such as the use of SSL certificates, from other parts of the
application into the gateway.
Solution
Offload some features into a gateway, particularly cross-cutting concerns such as certificate management,
authentication, SSL termination, monitoring, protocol translation, or throttling.
The following diagram shows a gateway that terminates inbound SSL connections. It requests data on behalf of
the original requestor from any HTTP server upstream of the gateway.
Example
Using Nginx as the SSL offload appliance, the following configuration terminates an inbound SSL connection and
distributes the connection to one of three upstream HTTP servers.
upstream iis {
server 10.3.0.10 max_fails=3 fail_timeout=15s;
server 10.3.0.20 max_fails=3 fail_timeout=15s;
server 10.3.0.30 max_fails=3 fail_timeout=15s;
}
server {
listen 443;
ssl on;
ssl_certificate /etc/nginx/ssl/domain.cer;
ssl_certificate_key /etc/nginx/ssl/domain.key;
location / {
set $targ iis;
proxy_pass http://$targ;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto https;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header Host $host;
}
}
Related guidance
Backends for Frontends pattern
Gateway Aggregation pattern
Gateway Routing pattern
Gateway Routing pattern
12/18/2020 • 3 minutes to read • Edit Online
Route requests to multiple services using a single endpoint. This pattern is useful when you wish to expose
multiple services on a single endpoint and route to the appropriate service based on the request.
Solution
Place a gateway in front of a set of applications, services, or deployments. Use application Layer 7 routing to
route the request to the appropriate instances.
With this pattern, the client application only needs to know about and communicate with a single endpoint. If a
service is consolidated or decomposed, the client does not necessarily require updating. It can continue making
requests to the gateway, and only the routing changes.
A gateway also lets you abstract backend services from the clients, allowing you to keep client calls simple while
enabling changes in the backend services behind the gateway. Client calls can be routed to whatever service or
services need to handle the expected client behavior, allowing you to add, split, and reorganize services behind
the gateway without changing the client.
This pattern can also help with deployment, by allowing you to manage how updates are rolled out to users.
When a new version of your service is deployed, it can be deployed in parallel with the existing version. Routing
lets you control what version of the service is presented to the clients, giving you the flexibility to use various
release strategies, whether incremental, parallel, or complete rollouts of updates. Any issues discovered after the
new service is deployed can be quickly reverted by making a configuration change at the gateway, without
affecting clients.
Example
Using Nginx as the router, the following is a simple example configuration file for a server that routes requests
for applications residing on different virtual directories to different machines at the back end.
server {
listen 80;
server_name domain.com;
location /app1 {
proxy_pass http://10.0.3.10:80;
}
location /app2 {
proxy_pass http://10.0.3.20:80;
}
location /app3 {
proxy_pass http://10.0.3.30:80;
}
}
On Azure, multiple services can be set up behind an Application Gateway instance, which provides layer-7
routing.
Related guidance
Backends for Frontends pattern
Gateway Aggregation pattern
Gateway Offloading pattern
Geodes
12/18/2020 • 5 minutes to read • Edit Online
The geode pattern involves deploying a collection of backend services into a set of ge ographical node s, each of
which can service any request for any client in any region. This pattern allows serving requests in an active-active
style, improving latency and increasing availability by distributing request processing around the globe.
Solution
Deploy the service into a number of satellite deployments spread around the globe, each of which is called a
geode. The geode pattern harnesses key features of Azure to route traffic via the shortest path to a nearby geode,
which improves latency and performance. Each geode is behind a global load balancer, and uses a geo-replicated
read-write service like Azure Cosmos DB to host the data plane, ensuring cross-geode data consistency. Data
replication services ensure that data stores are identical across geodes, so all requests can be served from all
geodes.
The key difference between a deployment stamp and a geode is that geodes never exist in isolation. There should
always be more than one geode in a production platform.
Geodes have the following characteristics:
Consist of a collection of disparate types of resources, often defined in a template.
Have no dependencies outside of the geode footprint and are self-contained. No geode is dependent on
another to operate, and if one dies, the others continue to operate.
Are loosely coupled via an edge network and replication backplane. For example, you can use Azure Traffic
Manager or Azure Front Door for fronting the geodes, while Azure Cosmos DB can act as the replication
backplane. Geodes are not the same as clusters because they share a replication backplane, so the platform
takes care of quorum issues.
The geode pattern occurs in big data architectures that use commodity hardware to process data colocated on the
same machine, and MapReduce to consolidate results across machines. Another usage is near-edge compute,
which brings compute closer to the intelligent edge of the network to reduce response time.
Services can use this pattern over dozens or hundreds of geodes. Furthermore, the resiliency of the whole
solution increases with each added geode, since any geodes can take over if a regional outage takes one or more
geodes offline.
It's also possible to augment local availability techniques, such as availability zones or paired regions, with the
geode pattern for global availability. This increases complexity, but is useful if your architecture is underpinned by
a storage engine such as blob storage that can only replicate to a paired region. You can deploy geodes into an
intra-zone, zonal, or regional footprint, with a mind to regulatory or latency constraints on location.
Examples
Windows Active Directory implements an early variant of this pattern. Multi-primary replication means all
updates and requests can in theory be served from all serviceable nodes, but FSMO roles mean that all geodes
aren't equal.
A QnA sample application on GitHub showcases this design pattern in practice.
Health Endpoint Monitoring pattern
12/18/2020 • 13 minutes to read • Edit Online
Implement functional checks in an application that external tools can access through exposed endpoints at
regular intervals. This can help to verify that applications and services are performing correctly.
Solution
Implement health monitoring by sending requests to an endpoint on the application. The application should
perform the necessary checks, and return an indication of its status.
A health monitoring check typically combines two factors:
The checks (if any) performed by the application or service in response to the request to the health
verification endpoint.
Analysis of the results by the tool or framework that performs the health verification check.
The response code indicates the status of the application and, optionally, any components or services it uses.
The latency or response time check is performed by the monitoring tool or framework. The figure provides an
overview of the pattern.
Other checks that might be carried out by the health monitoring code in the application include:
Checking cloud storage or a database for availability and response time.
Checking other resources or services located in the application, or located elsewhere but used by the
application.
Services and tools are available that monitor web applications by submitting a request to a configurable set of
endpoints, and evaluating the results against a set of configurable rules. It's relatively easy to create a service
endpoint whose sole purpose is to perform some functional tests on the system.
Typical checks that can be performed by the monitoring tools include:
Validating the response code. For example, an HTTP response of 200 (OK) indicates that the application
responded without error. The monitoring system might also check for other response codes to give more
comprehensive results.
Checking the content of the response to detect errors, even when a 200 (OK) status code is returned. This
can detect errors that affect only a section of the returned web page or service response. For example,
checking the title of a page or looking for a specific phrase that indicates the correct page was returned.
Measuring the response time, which indicates a combination of the network latency and the time that the
application took to execute the request. An increasing value can indicate an emerging problem with the
application or network.
Checking resources or services located outside the application, such as a content delivery network used by
the application to deliver content from global caches.
Checking for expiration of SSL certificates.
Measuring the response time of a DNS lookup for the URL of the application to measure DNS latency and
DNS failures.
Validating the URL returned by the DNS lookup to ensure correct entries. This can help to avoid malicious
request redirection through a successful attack on the DNS server.
It's also useful, where possible, to run these checks from different on-premises or hosted locations to measure
and compare response times. Ideally you should monitor applications from locations that are close to
customers to get an accurate view of the performance from each location. In addition to providing a more
robust checking mechanism, the results can help you decide on the deployment location for the application—
and whether to deploy it in more than one datacenter.
Tests should also be run against all the service instances that customers use to ensure the application is working
correctly for all customers. For example, if customer storage is spread across more than one storage account,
the monitoring process should check all of these.
Make sure that the application correctly returns a 200 (OK) only when the target resource is found and
processed. In some scenarios, such as when using a master page to host the target web page, the server
sends back a 200 (OK) status code instead of a 404 (Not Found) code, even when the target content page
was not found.
The number of endpoints to expose for an application. One approach is to expose at least one endpoint for the
core services that the application uses and another for lower priority services, allowing different levels of
importance to be assigned to each monitoring result. Also consider exposing more endpoints, such as one for
each core service, for additional monitoring granularity. For example, a health verification check might check the
database, storage, and an external geocoding service that an application uses, with each requiring a different
level of uptime and response time. The application could still be healthy if the geocoding service, or some other
background task, is unavailable for a few minutes.
Whether to use the same endpoint for monitoring as is used for general access, but to a specific path designed
for health verification checks, for example, /HealthCheck/{GUID}/ on the general access endpoint. This allows
some functional tests in the application to be run by the monitoring tools, such as adding a new user
registration, signing in, and placing a test order, while also verifying that the general access endpoint is
available.
The type of information to collect in the service in response to monitoring requests, and how to return this
information. Most existing tools and frameworks look only at the HTTP status code that the endpoint returns. To
return and validate additional information, you might have to create a custom monitoring utility or service.
How much information to collect. Performing excessive processing during the check can overload the
application and impact other users. The time it takes might exceed the timeout of the monitoring system so it
marks the application as unavailable. Most applications include instrumentation such as error handlers and
performance counters that log performance and detailed error information, this might be sufficient instead of
returning additional information from a health verification check.
Caching the endpoint status. It could be expensive to run the health check too frequently. If the health status is
reported through a dashboard, for example, you don't want every request from the dashboard to trigger a
health check. Instead, periodically check the system health and cache the status. Expose an endpoint that returns
the cached status.
How to configure security for the monitoring endpoints to protect them from public access, which might expose
the application to malicious attacks, risk the exposure of sensitive information, or attract denial of service (DoS)
attacks. Typically this should be done in the application configuration so that it can be updated easily without
restarting the application. Consider using one or more of the following techniques:
Secure the endpoint by requiring authentication. You can do this by using an authentication security key
in the request header or by passing credentials with the request, provided that the monitoring service or
tool supports authentication.
Use an obscure or hidden endpoint. For example, expose the endpoint on a different IP address to
that used by the default application URL, configure the endpoint on a nonstandard HTTP port,
and/or use a complex path to the test page. You can usually specify additional endpoint addresses
and ports in the application configuration, and add entries for these endpoints to the DNS server
if required to avoid having to specify the IP address directly.
Expose a method on an endpoint that accepts a parameter such as a key value or an operation
mode value. Depending on the value supplied for this parameter, when a request is received the
code can perform a specific test or set of tests, or return a 404 (Not Found) error if the parameter
value isn't recognized. The recognized parameter values could be set in the application
configuration.
DoS attacks are likely to have less impact on a separate endpoint that performs basic
functional tests without compromising the operation of the application. Ideally, avoid using a
test that might expose sensitive information. If you must return information that might be
useful to an attacker, consider how you'll protect the endpoint and the data from unauthorized
access. In this case just relying on obscurity isn't enough. You should also consider using an
HTTPS connection and encrypting any sensitive data, although this will increase the load on
the server.
How to access an endpoint that's secured using authentication. Not all tools and frameworks can be
configured to include credentials with the health verification request. For example, Microsoft Azure built-
in health verification features can't provide authentication credentials. Some third-party alternatives are
Pingdom, Panopta, NewRelic, and Statuscake.
How to ensure that the monitoring agent is performing correctly. One approach is to expose an endpoint
that simply returns a value from the application configuration or a random value that can be used to test
the agent.
Also ensure that the monitoring system performs checks on itself, such as a self-test and built-in test,
to avoid it issuing false positive results.
Example
The following code examples, taken from the HealthCheckController class (a sample that demonstrates this
pattern is available on GitHub), demonstrates exposing an endpoint for performing a range of health checks.
The CoreServices method, shown below in C#, performs a series of checks on services used in the application.
If all of the tests run without error, the method returns a 200 (OK) status code. If any of the tests raises an
exception, the method returns a 500 (Internal Error) status code. The method could optionally return additional
information when an error occurs, if the monitoring tool or framework is able to use it.
public ActionResult CoreServices()
{
try
{
// Run a simple check to ensure the database is available.
DataStore.Instance.CoreHealthCheck();
// This can optionally return different status codes based on the exception.
// Optionally it could return more details about the exception.
// The additional information could be used by administrators who access the
// endpoint with a browser, or using a ping utility that can display the
// additional information.
return new HttpStatusCodeResult((int)HttpStatusCode.InternalServerError);
}
return new HttpStatusCodeResult((int)HttpStatusCode.OK);
}
The ObscurePath method shows how you can read a path from the application configuration and use it as the
endpoint for tests. This example, in C#, also shows how you can accept an ID as a parameter and use it to check
for valid requests.
// The obscure path can be set through configuration to hide the endpoint.
var hiddenPathKey = CloudConfigurationManager.GetSetting("Test.ObscurePath");
// If the value passed does not match that in configuration, return 404 (Not Found).
if (!string.Equals(id, hiddenPathKey))
{
return new HttpStatusCodeResult((int)HttpStatusCode.NotFound);
}
The TestResponseFromConfig method shows how you can expose an endpoint that performs a check for a
specified configuration setting value.
public ActionResult TestResponseFromConfig()
{
// Health check that returns a response code set in configuration for testing.
var returnStatusCodeSetting = CloudConfigurationManager.GetSetting(
"Test.ReturnStatusCode");
int returnStatusCode;
Even though Azure provides a reasonably comprehensive set of monitoring options, you can use
additional services and tools to provide extra information. Azure Management Services provides a
built-in monitoring mechanism for alert rules. The alerts section of the management services page in
the Azure portal allows you to configure up to ten alert rules per subscription for your services.
These rules specify a condition and a threshold value for a service such as CPU load, or the number
of requests or errors per second, and the service can automatically send email notifications to
addresses you define in each rule.
The conditions you can monitor vary depending on the hosting mechanism you choose for your application
(such as Web Sites, Cloud Services, Virtual Machines, or Mobile Services), but all of these include the ability to
create an alert rule that uses a web endpoint you specify in the settings for your service. This endpoint should
respond in a timely way so that the alert system can detect that the application is operating correctly.
If you host your application in Azure Cloud Services web and worker roles or Virtual Machines, you can take
advantage of one of the built-in services in Azure called Traffic Manager. Traffic Manager is a routing and load-
balancing service that can distribute requests to specific instances of your Cloud Services hosted application
based on a range of rules and settings.
In addition to routing requests, Traffic Manager pings a URL, port, and relative path that you specify on a regular
basis to determine which instances of the application defined in its rules are active and are responding to
requests. If it detects a status code 200 (OK), it marks the application as available. Any other status code causes
Traffic Manager to mark the application as offline. You can view the status in the Traffic Manager console, and
configure the rule to reroute requests to other instances of the application that are responding.
However, Traffic Manager will only wait ten seconds to receive a response from the monitoring URL. Therefore,
you should ensure that your health verification code executes in this time, allowing for network latency for the
round trip from Traffic Manager to your application and back again.
Read more information about using Traffic Manager to monitor your applications. Traffic Manager is also
discussed in Multiple Datacenter Deployment Guidance.
Related guidance
The following guidance can be useful when implementing this pattern:
Instrumentation and Telemetry Guidance. Checking the health of services and components is typically done
by probing, but it's also useful to have information in place to monitor application performance and detect
events that occur at runtime. This data can be transmitted back to monitoring tools as additional information
for health monitoring. Instrumentation and Telemetry Guidance explores gathering remote diagnostics
information that's collected by instrumentation in applications.
Receiving alert notifications.
This pattern includes a downloadable sample application.
Index Table pattern
12/18/2020 • 9 minutes to read • Edit Online
Create indexes over the fields in data stores that are frequently referenced by queries. This pattern can improve
query performance by allowing applications to more quickly locate the data to retrieve from a data store.
While the primary key is valuable for queries that fetch data based on the value of this key, an application might
not be able to use the primary key if it needs to retrieve data based on some other field. In the customers
example, an application can't use the Customer ID primary key to retrieve customers if it queries data solely by
referencing the value of some other attribute, such as the town in which the customer is located. To perform a
query such as this, the application might have to fetch and examine every customer record, which could be a slow
process.
Many relational database management systems support secondary indexes. A secondary index is a separate data
structure that's organized by one or more nonprimary (secondary) key fields, and it indicates where the data for
each indexed value is stored. The items in a secondary index are typically sorted by the value of the secondary
keys to enable fast lookup of data. These indexes are usually maintained automatically by the database
management system.
You can create as many secondary indexes as you need to support the different queries that your application
performs. For example, in a Customers table in a relational database where the Customer ID is the primary key,
it's beneficial to add a secondary index over the town field if the application frequently looks up customers by the
town where they reside.
However, although secondary indexes are common in relational systems, most NoSQL data stores used by cloud
applications don't provide an equivalent feature.
Solution
If the data store doesn't support secondary indexes, you can emulate them manually by creating your own index
tables. An index table organizes the data by a specified key. Three strategies are commonly used for structuring
an index table, depending on the number of secondary indexes that are required and the nature of the queries
that an application performs.
The first strategy is to duplicate the data in each index table but organize it by different keys (complete
denormalization). The next figure shows index tables that organize the same customer information by Town and
LastName.
This strategy is appropriate if the data is relatively static compared to the number of times it's queried using each
key. If the data is more dynamic, the processing overhead of maintaining each index table becomes too large for
this approach to be useful. Also, if the volume of data is very large, the amount of space required to store the
duplicate data is significant.
The second strategy is to create normalized index tables organized by different keys and reference the original
data by using the primary key rather than duplicating it, as shown in the following figure. The original data is
called a fact table.
This technique saves space and reduces the overhead of maintaining duplicate data. The disadvantage is that an
application has to perform two lookup operations to find data using a secondary key. It has to find the primary
key for the data in the index table, and then use the primary key to look up the data in the fact table.
The third strategy is to create partially normalized index tables organized by different keys that duplicate
frequently retrieved fields. Reference the fact table to access less frequently accessed fields. The next figure shows
how commonly accessed data is duplicated in each index table.
With this strategy, you can strike a balance between the first two approaches. The data for common queries can
be retrieved quickly by using a single lookup, while the space and maintenance overhead isn't as significant as
duplicating the entire data set.
If an application frequently queries data by specifying a combination of values (for example, “Find all customers
that live in Redmond and that have a last name of Smith”), you could implement the keys to the items in the index
table as a concatenation of the Town attribute and the LastName attribute. The next figure shows an index table
based on composite keys. The keys are sorted by Town, and then by LastName for records that have the same
value for Town.
Index tables can speed up query operations over sharded data, and are especially useful where the shard key is
hashed. The next figure shows an example where the shard key is a hash of the Customer ID. The index table can
organize data by the nonhashed value (Town and LastName), and provide the hashed shard key as the lookup
data. This can save the application from repeatedly calculating hash keys (an expensive operation) if it needs to
retrieve data that falls within a range, or it needs to fetch data in order of the nonhashed key. For example, a
query such as “Find all customers that live in Redmond” can be quickly resolved by locating the matching items in
the index table, where they're all stored in a contiguous block. Then, follow the references to the customer data
using the shard keys stored in the index table.
Issues and considerations
Consider the following points when deciding how to implement this pattern:
The overhead of maintaining secondary indexes can be significant. You must analyze and understand the
queries that your application uses. Only create index tables when they're likely to be used regularly. Don't
create speculative index tables to support queries that an application doesn't perform, or performs only
occasionally.
Duplicating data in an index table can add significant overhead in storage costs and the effort required to
maintain multiple copies of data.
Implementing an index table as a normalized structure that references the original data requires an
application to perform two lookup operations to find data. The first operation searches the index table to
retrieve the primary key, and the second uses the primary key to fetch the data.
If a system incorporates a number of index tables over very large data sets, it can be difficult to maintain
consistency between index tables and the original data. It might be possible to design the application
around the eventual consistency model. For example, to insert, update, or delete data, an application could
post a message to a queue and let a separate task perform the operation and maintain the index tables
that reference this data asynchronously. For more information about implementing eventual consistency,
see the Data Consistency Primer.
Microsoft Azure storage tables support transactional updates for changes made to data held in the
same partition (referred to as entity group transactions). If you can store the data for a fact table and
one or more index tables in the same partition, you can use this feature to help ensure consistency.
Example
Azure storage tables provide a highly scalable key/value data store for applications running in the cloud.
Applications store and retrieve data values by specifying a key. The data values can contain multiple fields, but the
structure of a data item is opaque to table storage, which simply handles a data item as an array of bytes.
Azure storage tables also support sharding. The sharding key includes two elements, a partition key and a row
key. Items that have the same partition key are stored in the same partition (shard), and the items are stored in
row key order within a shard. Table storage is optimized for performing queries that fetch data falling within a
contiguous range of row key values within a partition. If you're building cloud applications that store information
in Azure tables, you should structure your data with this feature in mind.
For example, consider an application that stores information about movies. The application frequently queries
movies by genre (action, documentary, historical, comedy, drama, and so on). You could create an Azure table
with partitions for each genre by using the genre as the partition key, and specifying the movie name as the row
key, as shown in the next figure.
This approach is less effective if the application also needs to query movies by starring actor. In this case, you can
create a separate Azure table that acts as an index table. The partition key is the actor and the row key is the
movie name. The data for each actor will be stored in separate partitions. If a movie stars more than one actor, the
same movie will occur in multiple partitions.
You can duplicate the movie data in the values held by each partition by adopting the first approach described in
the Solution section above. However, it's likely that each movie will be replicated several times (once for each
actor), so it might be more efficient to partially denormalize the data to support the most common queries (such
as the names of the other actors) and enable an application to retrieve any remaining details by including the
partition key necessary to find the complete information in the genre partitions. This approach is described by
the third option in the Solution section. The next figure shows this approach.
Related patterns and guidance
The following patterns and guidance might also be relevant when implementing this pattern:
Data Consistency Primer. An index table must be maintained as the data that it indexes changes. In the cloud, it
might not be possible or appropriate to perform operations that update an index as part of the same
transaction that modifies the data. In that case, an eventually consistent approach is more suitable. Provides
information on the issues surrounding eventual consistency.
Sharding pattern. The Index Table pattern is frequently used in conjunction with data partitioned by using
shards. The Sharding pattern provides more information on how to divide a data store into a set of shards.
Materialized View pattern. Instead of indexing data to support queries that summarize data, it might be more
appropriate to create a materialized view of the data. Describes how to support efficient summary queries by
generating prepopulated views over data.
Leader Election pattern
12/18/2020 • 11 minutes to read • Edit Online
Solution
A single task instance should be elected to act as the leader, and this instance should coordinate the actions of the
other subordinate task instances. If all of the task instances are running the same code, they are each capable of
acting as the leader. Therefore, the election process must be managed carefully to prevent two or more instances
taking over the leader role at the same time.
The system must provide a robust mechanism for selecting the leader. This method has to cope with events such
as network outages or process failures. In many solutions, the subordinate task instances monitor the leader
through some type of heartbeat method, or by polling. If the designated leader terminates unexpectedly, or a
network failure makes the leader unavailable to the subordinate task instances, it's necessary for them to elect a
new leader.
There are several strategies for electing a leader among a set of tasks in a distributed environment, including:
Selecting the task instance with the lowest-ranked instance or process ID.
Racing to acquire a shared, distributed mutex. The first task instance that acquires the mutex is the leader.
However, the system must ensure that, if the leader terminates or becomes disconnected from the rest of the
system, the mutex is released to allow another task instance to become the leader.
Implementing one of the common leader election algorithms such as the Bully Algorithm or the Ring
Algorithm. These algorithms assume that each candidate in the election has a unique ID, and that it can
communicate with the other candidates reliably.
Issues and considerations
Consider the following points when deciding how to implement this pattern:
The process of electing a leader should be resilient to transient and persistent failures.
It must be possible to detect when the leader has failed or has become otherwise unavailable (such as due to a
communications failure). How quickly detection is needed is system dependent. Some systems might be able
to function for a short time without a leader, during which a transient fault might be fixed. In other cases, it
might be necessary to detect leader failure immediately and trigger a new election.
In a system that implements horizontal autoscaling, the leader could be terminated if the system scales back
and shuts down some of the computing resources.
Using a shared, distributed mutex introduces a dependency on the external service that provides the mutex.
The service constitutes a single point of failure. If it becomes unavailable for any reason, the system won't be
able to elect a leader.
Using a single dedicated process as the leader is a straightforward approach. However, if the process fails
there could be a significant delay while it's restarted. The resulting latency can affect the performance and
response times of other processes if they're waiting for the leader to coordinate an operation.
Implementing one of the leader election algorithms manually provides the greatest flexibility for tuning and
optimizing the code.
Avoid making the leader a bottleneck in the system. The purpose of the leader is to coordinate the work of the
subordinate tasks, and it doesn't necessarily have to participate in this work itself—although it should be able
to do so if the task isn't elected as the leader.
Example
The DistributedMutex project in the LeaderElection solution (a sample that demonstrates this pattern is available
on GitHub) shows how to use a lease on an Azure Storage blob to provide a mechanism for implementing a
shared, distributed mutex. This mutex can be used to elect a leader among a group of role instances in an Azure
cloud service. The first role instance to acquire the lease is elected the leader, and remains the leader until it
releases the lease or isn't able to renew the lease. Other role instances can continue to monitor the blob lease in
case the leader is no longer available.
A blob lease is an exclusive write lock over a blob. A single blob can be the subject of only one lease at any
point in time. A role instance can request a lease over a specified blob, and it'll be granted the lease if no other
role instance holds a lease over the same blob. Otherwise the request will throw an exception.
To avoid a faulted role instance retaining the lease indefinitely, specify a lifetime for the lease. When this
expires, the lease becomes available. However, while a role instance holds the lease it can request that the
lease is renewed, and it'll be granted the lease for a further period of time. The role instance can continually
repeat this process if it wants to retain the lease. For more information on how to lease a blob, see Lease Blob
(REST API).
The BlobDistributedMutex class in the C# example below contains the RunTaskWhenMutexAcquired method that
enables a role instance to attempt to acquire a lease over a specified blob. The details of the blob (the name,
container, and storage account) are passed to the constructor in a BlobSettings object when the
BlobDistributedMutex object is created (this object is a simple struct that is included in the sample code). The
constructor also accepts a Task that references the code that the role instance should run if it successfully
acquires the lease over the blob and is elected the leader. Note that the code that handles the low-level details of
acquiring the lease is implemented in a separate helper class named BlobLeaseManager .
The RunTaskWhenMutexAcquired method in the code sample above invokes the RunTaskWhenBlobLeaseAcquired
method shown in the following code sample to actually acquire the lease. The RunTaskWhenBlobLeaseAcquired
method runs asynchronously. If the lease is successfully acquired, the role instance has been elected the leader.
The purpose of the taskToRunWhenLeaseAcquired delegate is to perform the work that coordinates the other role
instances. If the lease isn't acquired, another role instance has been elected as the leader and the current role
instance remains a subordinate. Note that the TryAcquireLeaseOrWait method is a helper method that uses the
BlobLeaseManager object to acquire the lease.
private async Task RunTaskWhenBlobLeaseAcquired(
BlobLeaseManager leaseManager, CancellationToken token)
{
while (!token.IsCancellationRequested)
{
// Try to acquire the blob lease.
// Otherwise wait for a short time before trying again.
string leaseId = await this.TryAcquireLeaseOrWait(leaseManager, token);
if (!string.IsNullOrEmpty(leaseId))
{
// Create a new linked cancellation token source so that if either the
// original token is canceled or the lease can't be renewed, the
// leader task can be canceled.
using (var leaseCts =
CancellationTokenSource.CreateLinkedTokenSource(new[] { token }))
{
// Run the leader task.
var leaderTask = this.taskToRunWhenLeaseAcquired.Invoke(leaseCts.Token);
...
}
}
}
...
}
The task started by the leader also runs asynchronously. While this task is running, the
RunTaskWhenBlobLeaseAcquired method shown in the following code sample periodically attempts to renew the
lease. This helps to ensure that the role instance remains the leader. In the sample solution, the delay between
renewal requests is less than the time specified for the duration of the lease in order to prevent another role
instance from being elected the leader. If the renewal fails for any reason, the task is canceled.
If the lease fails to be renewed or the task is canceled (possibly as a result of the role instance shutting down), the
lease is released. At this point, this or another role instance might be elected as the leader. The code extract below
shows this part of the process.
// When any task completes (either the leader task itself or when it
// couldn't renew the lease) then cancel the other task.
await CancelAllWhenAnyCompletes(leaderTask, renewLeaseTask, leaseCts);
}
}
}
}
...
}
The KeepRenewingLease method is another helper method that uses the BlobLeaseManager object to renew the
lease. The CancelAllWhenAnyCompletes method cancels the tasks specified as the first two parameters. The
following diagram illustrates using the BlobDistributedMutex class to elect a leader and run a task that
coordinates operations.
The following code example shows how to use the BlobDistributedMutex class in a worker role. This code
acquires a lease over a blob named MyLeaderCoordinatorTask in the lease's container in development storage, and
specifies that the code defined in the MyLeaderCoordinatorTask method should run if the role instance is elected
the leader.
var settings = new BlobSettings(CloudStorageAccount.DevelopmentStorageAccount,
"leases", "MyLeaderCoordinatorTask");
var cts = new CancellationTokenSource();
var mutex = new BlobDistributedMutex(settings, MyLeaderCoordinatorTask);
mutex.RunTaskWhenMutexAcquired(this.cts.Token);
...
Generate prepopulated views over the data in one or more data stores when the data isn't ideally formatted for
required query operations. This can help support efficient querying and data extraction, and improve application
performance.
Solution
To support efficient querying, a common solution is to generate, in advance, a view that materializes the data in a
format suited to the required results set. The Materialized View pattern describes generating prepopulated views
of data in environments where the source data isn't in a suitable format for querying, where generating a
suitable query is difficult, or where query performance is poor due to the nature of the data or the data store.
These materialized views, which only contain data required by a query, allow applications to quickly obtain the
information they need. In addition to joining tables or combining data entities, materialized views can include the
current values of calculated columns or data items, the results of combining values or executing transformations
on the data items, and values specified as part of the query. A materialized view can even be optimized for just a
single query.
A key point is that a materialized view and the data it contains is completely disposable because it can be entirely
rebuilt from the source data stores. A materialized view is never updated directly by an application, and so it's a
specialized cache.
When the source data for the view changes, the view must be updated to include the new information. You can
schedule this to happen automatically, or when the system detects a change to the original data. In some cases it
might be necessary to regenerate the view manually. The figure shows an example of how the Materialized View
pattern might be used.
Issues and considerations
Consider the following points when deciding how to implement this pattern:
How and when the view will be updated. Ideally it'll regenerate in response to an event indicating a change to the
source data, although this can lead to excessive overhead if the source data changes rapidly. Alternatively,
consider using a scheduled task, an external trigger, or a manual action to regenerate the view.
In some systems, such as when using the Event Sourcing pattern to maintain a store of only the events that
modified the data, materialized views are necessary. Prepopulating views by examining all events to determine
the current state might be the only way to obtain information from the event store. If you're not using Event
Sourcing, you need to consider whether a materialized view is helpful or not. Materialized views tend to be
specifically tailored to one, or a small number of queries. If many queries are used, materialized views can result
in unacceptable storage capacity requirements and storage cost.
Consider the impact on data consistency when generating the view, and when updating the view if this occurs on
a schedule. If the source data is changing at the point when the view is generated, the copy of the data in the
view won't be fully consistent with the original data.
Consider where you'll store the view. The view doesn't have to be located in the same store or partition as the
original data. It can be a subset from a few different partitions combined.
A view can be rebuilt if lost. Because of that, if the view is transient and is only used to improve query
performance by reflecting the current state of the data, or to improve scalability, it can be stored in a cache or in a
less reliable location.
When defining a materialized view, maximize its value by adding data items or columns to it based on
computation or transformation of existing data items, on values passed in the query, or on combinations of these
values when appropriate.
Where the storage mechanism supports it, consider indexing the materialized view to further increase
performance. Most relational databases support indexing for views, as do big data solutions based on Apache
Hadoop.
Example
The following figure shows an example of using the Materialized View pattern to generate a summary of sales.
Data in the Order, OrderItem, and Customer tables in separate partitions in an Azure storage account are
combined to generate a view containing the total sales value for each product in the Electronics category, along
with a count of the number of customers who made purchases of each item.
Creating this materialized view requires complex queries. However, by exposing the query result as a
materialized view, users can easily obtain the results and use them directly or incorporate them in another query.
The view is likely to be used in a reporting system or dashboard, and can be updated on a scheduled basis such
as weekly.
Although this example uses Azure table storage, many relational database management systems also
provide native support for materialized views.
Decompose a task that performs complex processing into a series of separate elements that can be reused. This
can improve performance, scalability, and reusability by allowing task elements that perform the processing to be
deployed and scaled independently.
Some of the tasks that the monolithic modules perform are functionally very similar, but the modules have been
designed separately. The code that implements the tasks is closely coupled in a module, and has been developed
with little or no thought given to reuse or scalability.
However, the processing tasks performed by each module, or the deployment requirements for each task, could
change as business requirements are updated. Some tasks might be compute intensive and could benefit from
running on powerful hardware, while others might not require such expensive resources. Also, additional
processing might be required in the future, or the order in which the tasks performed by the processing could
change. A solution is required that addresses these issues, and increases the possibilities for code reuse.
Solution
Break down the processing required for each stream into a set of separate components (or filters), each
performing a single task. By standardizing the format of the data that each component receives and sends, these
filters can be combined together into a pipeline. This helps to avoid duplicating code, and makes it easy to
remove, replace, or integrate additional components if the processing requirements change. The next figure
shows a solution implemented using pipes and filters.
The time it takes to process a single request depends on the speed of the slowest filter in the pipeline. One or
more filters could be a bottleneck, especially if a large number of requests appear in a stream from a particular
data source. A key advantage of the pipeline structure is that it provides opportunities for running parallel
instances of slow filters, enabling the system to spread the load and improve throughput.
The filters that make up a pipeline can run on different machines, enabling them to be scaled independently and
take advantage of the elasticity that many cloud environments provide. A filter that is computationally intensive
can run on high-performance hardware, while other less demanding filters can be hosted on less expensive
commodity hardware. The filters don't even have to be in the same datacenter or geographic location, which
allows each element in a pipeline to run in an environment close to the resources it requires. The next figure
shows an example applied to the pipeline for the data from Source 1.
If the input and output of a filter are structured as a stream, it's possible to perform the processing for each filter
in parallel. The first filter in the pipeline can start its work and output its results, which are passed directly on to
the next filter in the sequence before the first filter has completed its work.
Another benefit is the resiliency that this model can provide. If a filter fails or the machine it's running on is no
longer available, the pipeline can reschedule the work that the filter was performing and direct this work to
another instance of the component. Failure of a single filter doesn't necessarily result in failure of the entire
pipeline.
Using the Pipes and Filters pattern in conjunction with the Compensating Transaction pattern is an alternative
approach to implementing distributed transactions. A distributed transaction can be broken down into separate,
compensable tasks, each of which can be implemented by using a filter that also implements the Compensating
Transaction pattern. The filters in a pipeline can be implemented as separate hosted tasks running close to the
data that they maintain.
If you're implementing the pipeline by using message queues (such as Microsoft Azure Service Bus
queues), the message queuing infrastructure might provide automatic duplicate message detection
and removal.
Context and state . In a pipeline, each filter essentially runs in isolation and shouldn't make any
assumptions about how it was invoked. This means that each filter should be provided with sufficient
context to perform its work. This context could include a large amount of state information.
It's possible to group filters that should scale together in the same process. For more information, see
the Compute Resource Consolidation pattern.
Flexibility is required to allow reordering of the processing steps performed by an application, or the
capability to add and remove steps.
The system can benefit from distributing the processing for steps across different servers.
A reliable solution is required that minimizes the effects of failure in a step while data is being processed.
This pattern might not be useful when:
The processing steps performed by an application aren't independent, or they have to be performed
together as part of the same transaction.
The amount of context or state information required by a step makes this approach inefficient. It might be
possible to persist state information to a database instead, but don't use this strategy if the additional load
on the database causes excessive contention.
Example
You can use a sequence of message queues to provide the infrastructure required to implement a pipeline. An
initial message queue receives unprocessed messages. A component implemented as a filter task listens for a
message on this queue, performs its work, and then posts the transformed message to the next queue in the
sequence. Another filter task can listen for messages on this queue, process them, post the results to another
queue, and so on until the fully transformed data appears in the final message in the queue. The next figure
illustrates implementing a pipeline using message queues.
If you're building a solution on Azure you can use Service Bus queues to provide a reliable and scalable queuing
mechanism. The ServiceBusPipeFilter class shown below in C# demonstrates how you can implement a filter
that receives input messages from a queue, processes these messages, and posts the results to another queue.
The ServiceBusPipeFilter class is defined in the PipesAndFilters.Shared project available from GitHub.
...
// Create the inbound and outbound queue clients.
this.inQueue = QueueClient.CreateFromConnectionString(...);
}
public void OnPipeFilterMessageAsync(
Func<BrokeredMessage, Task<BrokeredMessage>> asyncFilterTask, ...)
{
...
this.inQueue.OnMessageAsync(
async (msg) =>
{
...
// Process the filter and send the output to the
// next queue in the pipeline.
var outMessage = await asyncFilterTask(msg);
// Note: There's a chance that the same message could be sent twice
// or that a message gets processed by an upstream or downstream
// filter at the same time.
// This would happen in a situation where processing of a message was
// completed, it was sent to the next pipe/queue, and then failed
// to complete when using the PeekLock method.
// Idempotent message processing and concurrency should be considered
// in a real-world implementation.
},
options);
}
this.inQueue.Close();
...
}
...
}
The method in the ServiceBusPipeFilter class connects to a pair of input and output queues, and the
Start
Close method disconnects from the input queue. The OnPipeFilterMessageAsync method performs the actual
processing of messages, the asyncFilterTask parameter to this method specifies the processing to be performed.
The OnPipeFilterMessageAsync method waits for incoming messages on the input queue, runs the code specified
by the asyncFilterTask parameter over each message as it arrives, and posts the results to the output queue. The
queues themselves are specified by the constructor.
The sample solution implements filters in a set of worker roles. Each worker role can be scaled independently,
depending on the complexity of the business processing that it performs or the resources required for processing.
Additionally, multiple instances of each worker role can be run in parallel to improve throughput.
The following code shows an Azure worker role named PipeFilterARoleEntry , defined in the PipeFilterA project in
the sample solution.
public class PipeFilterARoleEntry : RoleEntryPoint
{
...
private ServiceBusPipeFilter pipeFilterA;
this.pipeFilterA.Start();
...
}
newMsg.Properties.Add(Constants.FilterAMessageKey, "Complete");
return newMsg;
});
...
}
...
}
This role contains a ServiceBusPipeFilter object. The OnStart method in the role connects to the queues for
receiving input messages and posting output messages (the names of the queues are defined in the Constants
class). The Run method invokes the OnPipeFilterMessagesAsync method to perform some processing on each
message that's received (in this example, the processing is simulated by waiting for a short period of time). When
processing is complete, a new message is constructed containing the results (in this case, the input message has a
custom property added), and this message is posted to the output queue.
The sample code contains another worker role named PipeFilterBRoleEntry in the PipeFilterB project. This role is
similar to PipeFilterARoleEntry except that it performs different processing in the Run method. In the example
solution, these two roles are combined to construct a pipeline, the output queue for the PipeFilterARoleEntry
role is the input queue for the PipeFilterBRoleEntry role.
The sample solution also provides two additional roles named InitialSenderRoleEntry (in the InitialSender
project) and FinalReceiverRoleEntry (in the FinalReceiver project). The InitialSenderRoleEntry role provides the
initial message in the pipeline. The OnStart method connects to a single queue and the Run method posts a
method to this queue. This queue is the input queue used by the PipeFilterARoleEntry role, so posting a message
to it causes the message to be received and processed by the PipeFilterARoleEntry role. The processed message
then passes through the PipeFilterBRoleEntry role.
The input queue for the FinalReceiveRoleEntry role is the output queue for the PipeFilterBRoleEntry role. The
Run method in the FinalReceiveRoleEntry role, shown below, receives the message and performs some final
processing. Then it writes the values of the custom properties added by the filters in the pipeline to the trace
output.
return null;
});
...
}
...
}
Prioritize requests sent to services so that requests with a higher priority are received and processed more
quickly than those with a lower priority. This pattern is useful in applications that offer different service level
guarantees to individual clients.
Solution
A queue is usually a first-in, first-out (FIFO) structure, and consumers typically receive messages in the same
order that they were posted to the queue. However, some message queues support priority messaging. The
application posting a message can assign a priority and the messages in the queue are automatically reordered
so that those with a higher priority will be received before those with a lower priority. The figure illustrates a
queue with priority messaging.
Most message queue implementations support multiple consumers (following the Competing Consumers
pattern), and the number of consumer processes can be scaled up or down depending on demand.
In systems that don't support priority-based message queues, an alternative solution is to maintain a separate
queue for each priority. The application is responsible for posting messages to the appropriate queue. Each
queue can have a separate pool of consumers. Higher priority queues can have a larger pool of consumers
running on faster hardware than lower priority queues. The next figure illustrates using separate message
queues for each priority.
A variation on this strategy is to have a single pool of consumers that check for messages on high priority
queues first, and only then start to fetch messages from lower priority queues. There are some semantic
differences between a solution that uses a single pool of consumer processes (either with a single queue that
supports messages with different priorities or with multiple queues that each handle messages of a single
priority), and a solution that uses multiple queues with a separate pool for each queue.
In the single pool approach, higher priority messages are always received and processed before lower priority
messages. In theory, messages that have a very low priority could be continually superseded and might never be
processed. In the multiple pool approach, lower priority messages will always be processed, just not as quickly as
those of a higher priority (depending on the relative size of the pools and the resources that they have available).
Using a priority queuing mechanism can provide the following advantages:
It allows applications to meet business requirements that require prioritization of availability or
performance, such as offering different levels of service to specific groups of customers.
It can help to minimize operational costs. In the single queue approach, you can scale back the number of
consumers if necessary. High priority messages will still be processed first (although possibly more
slowly), and lower priority messages might be delayed for longer. If you've implemented the multiple
message queue approach with separate pools of consumers for each queue, you can reduce the pool of
consumers for lower priority queues, or even suspend processing for some very low priority queues by
stopping all the consumers that listen for messages on those queues.
The multiple message queue approach can help maximize application performance and scalability by
partitioning messages based on processing requirements. For example, vital tasks can be prioritized to be
handled by receivers that run immediately while less important background tasks can be handled by
receivers that are scheduled to run at less busy periods.
Example
Microsoft Azure doesn't provide a queuing mechanism that natively supports automatic prioritization of
messages through sorting. However, it does provide Azure Service Bus topics and subscriptions that support a
queuing mechanism that provides message filtering, together with a wide range of flexible capabilities that make
it ideal for use in most priority queue implementations.
An Azure solution can implement a Service Bus topic an application can post messages to, in the same way as a
queue. Messages can contain metadata in the form of application-defined custom properties. Service Bus
subscriptions can be associated with the topic, and these subscriptions can filter messages based on their
properties. When an application sends a message to a topic, the message is directed to the appropriate
subscription where it can be read by a consumer. Consumer processes can retrieve messages from a subscription
using the same semantics as a message queue (a subscription is a logical queue). The following figure illustrates
implementing a priority queue with Azure Service Bus topics and subscriptions.
In the figure above, the application creates several messages and assigns a custom property called Priority in
each message with a value, either High or Low . The application posts these messages to a topic. The topic has
two associated subscriptions that both filter messages by examining the Priority property. One subscription
accepts messages where the Priority property is set to High , and the other accepts messages where the
Priority property is set to Low . A pool of consumers reads messages from each subscription. The high priority
subscription has a larger pool, and these consumers might be running on more powerful computers with more
resources available than the consumers in the low priority pool.
Note that there's nothing special about the designation of high and low priority messages in this example.
They're simply labels specified as properties in each message, and are used to direct messages to a specific
subscription. If additional priorities are required, it's relatively easy to create further subscriptions and pools of
consumer processes to handle these priorities.
The PriorityQueue solution available on GitHub contains an implementation of this approach. This solution
contains two worker role projects named PriorityQueue.High and PriorityQueue.Low . These worker roles inherit
from the PriorityWorkerRole class that contains the functionality for connecting to a specified subscription in the
OnStart method.
The PriorityQueue.High and PriorityQueue.Low worker roles connect to different subscriptions, defined by their
configuration settings. An administrator can configure different numbers of each role to be run. Typically there'll
be more instances of the PriorityQueue.High worker role than the PriorityQueue.Low worker role.
The Run method in the PriorityWorkerRole class arranges for the virtual ProcessMessage method (also defined
in the PriorityWorkerRole class) to be run for each message received on the queue. The following code shows
the Run and ProcessMessage methods. The QueueManager class, defined in the PriorityQueue.Shared project,
provides helper methods for using Azure Service Bus queues.
public class PriorityWorkerRole : RoleEntryPoint
{
private QueueManager queueManager;
...
The PriorityQueue.Highand PriorityQueue.Low worker roles both override the default functionality of the
ProcessMessage method. The code below shows the ProcessMessage method for the PriorityQueue.High worker
role.
When an application posts messages to the topic associated with the subscriptions used by the
PriorityQueue.High and PriorityQueue.Low worker roles, it specifies the priority by using the Priority custom
property, as shown in the following code example. This code (implemented in the WorkerRole class in the
PriorityQueue.Sender project), uses the SendBatchAsync helper method of the QueueManager class to post
messages to a topic in batches.
// Send a low priority batch.
var lowMessages = new List<BrokeredMessage>();
this.queueManager.SendBatchAsync(lowMessages).Wait();
...
this.queueManager.SendBatchAsync(highMessages).Wait();
Enable an application to announce events to multiple interested consumers asynchronously, without coupling the
senders to the receivers.
Also called : Pub/sub messaging
Solution
Introduce an asynchronous messaging subsystem that includes the following:
An input messaging channel used by the sender. The sender packages events into messages, using a known
message format, and sends these messages via the input channel. The sender in this pattern is also called
the publisher.
NOTE
A message is a packet of data. An event is a message that notifies other components about a change or an action
that has taken place.
One output messaging channel per consumer. The consumers are known as subscribers.
A mechanism for copying each message from the input channel to the output channels for all subscribers
interested in that message. This operation is typically handled by an intermediary such as a message
broker or event bus.
The following diagram shows the logical components of this pattern:
Pub/sub messaging has the following benefits:
It decouples subsystems that still need to communicate. Subsystems can be managed independently, and
messages can be properly managed even if one or more receivers are offline.
It increases scalability and improves responsiveness of the sender. The sender can quickly send a single
message to the input channel, then return to its core processing responsibilities. The messaging
infrastructure is responsible for ensuring messages are delivered to interested subscribers.
It improves reliability. Asynchronous messaging helps applications continue to run smoothly under
increased loads and handle intermittent failures more effectively.
It allows for deferred or scheduled processing. Subscribers can wait to pick up messages until off-peak
hours, or messages can be routed or processed according to a specific schedule.
It enables simpler integration between systems using different platforms, programming languages, or
communication protocols, as well as between on-premises systems and applications running in the cloud.
It facilitates asynchronous workflows across an enterprise.
It improves testability. Channels can be monitored and messages can be inspected or logged as part of an
overall integration test strategy.
It provides separation of concerns for your applications. Each application can focus on its core capabilities,
while the messaging infrastructure handles everything required to reliably route messages to multiple
consumers.
Example
The following diagram shows an enterprise integration architecture that uses Service Bus to coordinate
workflows, and Event Grid to notify subsystems of events that occur. For more information, see Enterprise
integration on Azure using message queues and events.
Related patterns and guidance
The following patterns and guidance might be relevant when implementing this pattern:
Choose between Azure services that deliver messages.
The Event-driven architecture style is an architecture style that uses pub/sub messaging.
Asynchronous Messaging Primer. Message queues are an asynchronous communications mechanism. If a
consumer service needs to send a reply to an application, it might be necessary to implement some form
of response messaging. The Asynchronous Messaging Primer provides information on how to implement
request/reply messaging using message queues.
Observer Pattern. The Publish-Subscribe pattern builds on the Observer pattern by decoupling subjects
from observers via asynchronous messaging.
Message Broker Pattern. Many messaging subsystems that support a publish-subscribe model are
implemented via a message broker.
Queue-Based Load Leveling pattern
12/18/2020 • 5 minutes to read • Edit Online
Use a queue that acts as a buffer between a task and a service it invokes in order to smooth intermittent heavy
loads that can cause the service to fail or the task to time out. This can help to minimize the impact of peaks in
demand on availability and responsiveness for both the task and the service.
Solution
Refactor the solution and introduce a queue between the task and the service. The task and the service run
asynchronously. The task posts a message containing the data required by the service to a queue. The queue
acts as a buffer, storing the message until it's retrieved by the service. The service retrieves the messages from
the queue and processes them. Requests from a number of tasks, which can be generated at a highly variable
rate, can be passed to the service through the same message queue. This figure shows using a queue to level
the load on a service.
The queue decouples the tasks from the service, and the service can handle the messages at its own pace
regardless of the volume of requests from concurrent tasks. Additionally, there's no delay to a task if the
service isn't available at the time it posts a message to the queue.
This pattern provides the following benefits:
It can help to maximize availability because delays arising in services won't have an immediate and
direct impact on the application, which can continue to post messages to the queue even when the
service isn't available or isn't currently processing messages.
It can help to maximize scalability because both the number of queues and the number of services can
be varied to meet demand.
It can help to control costs because the number of service instances deployed only have to be adequate
to meet average load rather than the peak load.
Some services implement throttling when demand reaches a threshold beyond which the system
could fail. Throttling can reduce the functionality available. You can implement load leveling with
these services to ensure that this threshold isn't reached.
Example
A web app writes data to an external data store. If a large number of instances of the web app run concurrently,
the data store might be unable to respond to requests quickly enough, causing requests to time out, be
throttled, or otherwise fail. The following diagram shows a data store being overwhelmed by a large number
of concurrent requests from instances of an application.
To resolve this, you can use a queue to level the load between the application instances and the data store. An
Azure Functions app reads the messages from the queue and performs the read/write requests to the data
store. The application logic in the function app can control the rate at which it passes requests to the data store,
to prevent the store from being overwhelmed. (Otherwise the function app will just re-introduce the same
problem at the back end.)
Enable an application to handle transient failures when it tries to connect to a service or network resource, by
transparently retrying a failed operation. This can improve the stability of the application.
Solution
In the cloud, transient faults aren't uncommon and an application should be designed to handle them elegantly
and transparently. This minimizes the effects faults can have on the business tasks the application is performing.
If an application detects a failure when it tries to send a request to a remote service, it can handle the failure
using the following strategies:
Cancel . If the fault indicates that the failure isn't transient or is unlikely to be successful if repeated, the
application should cancel the operation and report an exception. For example, an authentication failure
caused by providing invalid credentials is not likely to succeed no matter how many times it's attempted.
Retr y . If the specific fault reported is unusual or rare, it might have been caused by unusual
circumstances such as a network packet becoming corrupted while it was being transmitted. In this case,
the application could retry the failing request again immediately because the same failure is unlikely to
be repeated and the request will probably be successful.
Retr y after delay . If the fault is caused by one of the more commonplace connectivity or busy failures,
the network or service might need a short period while the connectivity issues are corrected or the
backlog of work is cleared. The application should wait for a suitable time before retrying the request.
For the more common transient failures, the period between retries should be chosen to spread requests from
multiple instances of the application as evenly as possible. This reduces the chance of a busy service continuing
to be overloaded. If many instances of an application are continually overwhelming a service with retry
requests, it'll take the service longer to recover.
If the request still fails, the application can wait and make another attempt. If necessary, this process can be
repeated with increasing delays between retry attempts, until some maximum number of requests have been
attempted. The delay can be increased incrementally or exponentially, depending on the type of failure and the
probability that it'll be corrected during this time.
The following diagram illustrates invoking an operation in a hosted service using this pattern. If the request is
unsuccessful after a predefined number of attempts, the application should treat the fault as an exception and
handle it accordingly.
The application should wrap all attempts to access a remote service in code that implements a retry policy
matching one of the strategies listed above. Requests sent to different services can be subject to different
policies. Some vendors provide libraries that implement retry policies, where the application can specify the
maximum number of retries, the time between retry attempts, and other parameters.
An application should log the details of faults and failing operations. This information is useful to operators. If a
service is frequently unavailable or busy, it's often because the service has exhausted its resources. You can
reduce the frequency of these faults by scaling out the service. For example, if a database service is continually
overloaded, it might be beneficial to partition the database and spread the load across multiple servers.
Microsoft Entity Framework provides facilities for retrying database operations. Also, most Azure services
and client SDKs include a retry mechanism. For more information, see Retry guidance for specific services.
Example
This example in C# illustrates an implementation of the Retry pattern. The OperationWithBasicRetryAsync
method, shown below, invokes an external service asynchronously through the TransientOperationAsync
method. The details of the TransientOperationAsync method will be specific to the service and are omitted from
the sample code.
private int retryCount = 3;
private readonly TimeSpan delay = TimeSpan.FromSeconds(5);
for (;;)
{
try
{
// Call external service.
await TransientOperationAsync();
// Return or break.
break;
}
catch (Exception ex)
{
Trace.TraceError("Operation Exception");
currentRetry++;
// Async method that wraps a call to a remote service (details not shown).
private async Task TransientOperationAsync()
{
...
}
The statement that invokes this method is contained in a try/catch block wrapped in a for loop. The for loop
exits if the call to the TransientOperationAsync method succeeds without throwing an exception. If the
TransientOperationAsync method fails, the catch block examines the reason for the failure. If it's believed to be a
transient error the code waits for a short delay before retrying the operation.
The for loop also tracks the number of times that the operation has been attempted, and if the code fails three
times the exception is assumed to be more long lasting. If the exception isn't transient or it's long lasting, the
catch handler throws an exception. This exception exits the for loop and should be caught by the code that
invokes the OperationWithBasicRetryAsync method.
The IsTransient method, shown below, checks for a specific set of exceptions that are relevant to the
environment the code is run in. The definition of a transient exception will vary according to the resources being
accessed and the environment the operation is being performed in.
private bool IsTransient(Exception ex)
{
// Determine if the exception is transient.
// In some cases this is as simple as checking the exception type, in other
// cases it might be necessary to inspect other properties of the exception.
if (ex is OperationTransientException)
return true;
Coordinate a set of distributed actions as a single operation. If any of the actions fail, try to handle the failures
transparently, or else undo the work that was performed, so the entire operation succeeds or fails as a whole.
This can add resiliency to a distributed system, by enabling it to recover and retry actions that fail due to
transient exceptions, long-lasting faults, and process failures.
Solution
The Scheduler Agent Supervisor pattern defines the following actors. These actors orchestrate the steps to be
performed as part of the overall task.
The Scheduler arranges for the steps that make up the task to be executed and orchestrates their
operation. These steps can be combined into a pipeline or workflow. The Scheduler is responsible for
ensuring that the steps in this workflow are performed in the right order. As each step is performed, the
Scheduler records the state of the workflow, such as "step not yet started," "step running," or "step
completed." The state information should also include an upper limit of the time allowed for the step to
finish, called the complete-by time. If a step requires access to a remote service or resource, the Scheduler
invokes the appropriate Agent, passing it the details of the work to be performed. The Scheduler typically
communicates with an Agent using asynchronous request/response messaging. This can be implemented
using queues, although other distributed messaging technologies could be used instead.
The Scheduler performs a similar function to the Process Manager in the Process Manager pattern.
The actual workflow is typically defined and implemented by a workflow engine that's controlled by
the Scheduler. This approach decouples the business logic in the workflow from the Scheduler.
The Agent contains logic that encapsulates a call to a remote service, or access to a remote resource
referenced by a step in a task. Each Agent typically wraps calls to a single service or resource,
implementing the appropriate error handling and retry logic (subject to a timeout constraint, described
later). If the steps in the workflow being run by the Scheduler use several services and resources across
different steps, each step might reference a different Agent (this is an implementation detail of the
pattern).
The Super visor monitors the status of the steps in the task being performed by the Scheduler. It runs
periodically (the frequency will be system-specific), and examines the status of steps maintained by the
Scheduler. If it detects any that have timed out or failed, it arranges for the appropriate Agent to recover
the step or execute the appropriate remedial action (this might involve modifying the status of a step).
Note that the recovery or remedial actions are implemented by the Scheduler and Agents. The Supervisor
should simply request that these actions be performed.
The Scheduler, Agent, and Supervisor are logical components and their physical implementation depends on the
technology being used. For example, several logical agents might be implemented as part of a single web
service.
The Scheduler maintains information about the progress of the task and the state of each step in a durable data
store, called the state store. The Supervisor can use this information to help determine whether a step has failed.
The figure illustrates the relationship between the Scheduler, the Agents, the Supervisor, and the state store.
NOTE
This diagram shows a simplified version of the pattern. In a real implementation, there might be many instances of the
Scheduler running concurrently, each a subset of tasks. Similarly, the system could run multiple instances of each Agent, or
even multiple Supervisors. In this case, Supervisors must coordinate their work with each other carefully to ensure that
they don’t compete to recover the same failed steps and tasks. The Leader Election pattern provides one possible solution
to this problem.
When the application is ready to run a task, it submits a request to the Scheduler. The Scheduler records initial
state information about the task and its steps (for example, step not yet started) in the state store and then starts
performing the operations defined by the workflow. As the Scheduler starts each step, it updates the information
about the state of that step in the state store (for example, step running).
If a step references a remote service or resource, the Scheduler sends a message to the appropriate Agent. The
message contains the information that the Agent needs to pass to the service or access the resource, in addition
to the complete-by time for the operation. If the Agent completes its operation successfully, it returns a response
to the Scheduler. The Scheduler can then update the state information in the state store (for example, step
completed) and perform the next step. This process continues until the entire task is complete.
An Agent can implement any retry logic that's necessary to perform its work. However, if the Agent doesn't
complete its work before the complete-by period expires, the Scheduler will assume that the operation has
failed. In this case, the Agent should stop its work and not try to return anything to the Scheduler (not even an
error message), or try any form of recovery. The reason for this restriction is that, after a step has timed out or
failed, another instance of the Agent might be scheduled to run the failing step (this process is described later).
If the Agent fails, the Scheduler won't receive a response. The pattern doesn't make a distinction between a step
that has timed out and one that has genuinely failed.
If a step times out or fails, the state store will contain a record that indicates that the step is running, but the
complete-by time will have passed. The Supervisor looks for steps like this and tries to recover them. One
possible strategy is for the Supervisor to update the complete-by value to extend the time available to complete
the step, and then send a message to the Scheduler identifying the step that has timed out. The Scheduler can
then try to repeat this step. However, this design requires the tasks to be idempotent.
The Supervisor might need to prevent the same step from being retried if it continually fails or times out. To do
this, the Supervisor could maintain a retry count for each step, along with the state information, in the state
store. If this count exceeds a predefined threshold the Supervisor can adopt a strategy of waiting for an extended
period before notifying the Scheduler that it should retry the step, in the expectation that the fault will be
resolved during this period. Alternatively, the Supervisor can send a message to the Scheduler to request the
entire task be undone by implementing a Compensating Transaction pattern. This approach will depend on the
Scheduler and Agents providing the information necessary to implement the compensating operations for each
step that completed successfully.
It isn't the purpose of the Supervisor to monitor the Scheduler and Agents, and restart them if they fail. This
aspect of the system should be handled by the infrastructure these components are running in. Similarly, the
Supervisor shouldn't have knowledge of the actual business operations that the tasks being performed by
the Scheduler are running (including how to compensate should these tasks fail). This is the purpose of the
workflow logic implemented by the Scheduler. The sole responsibility of the Supervisor is to determine
whether a step has failed and arrange either for it to be repeated or for the entire task containing the failed
step to be undone.
If the Scheduler is restarted after a failure, or the workflow being performed by the Scheduler terminates
unexpectedly, the Scheduler should be able to determine the status of any inflight task that it was handling when
it failed, and be prepared to resume this task from that point. The implementation details of this process are
likely to be system-specific. If the task can't be recovered, it might be necessary to undo the work already
performed by the task. This might also require implementing a compensating transaction.
The key advantage of this pattern is that the system is resilient in the event of unexpected temporary or
unrecoverable failures. The system can be constructed to be self-healing. For example, if an Agent or the
Scheduler fails, a new one can be started and the Supervisor can arrange for a task to be resumed. If the
Supervisor fails, another instance can be started and can take over from where the failure occurred. If the
Supervisor is scheduled to run periodically, a new instance can be automatically started after a predefined
interval. The state store can be replicated to reach an even greater degree of resiliency.
Example
A web application that implements an ecommerce system has been deployed on Microsoft Azure. Users can run
this application to browse the available products and to place orders. The user interface runs as a web role, and
the order processing elements of the application are implemented as a set of worker roles. Part of the order
processing logic involves accessing a remote service, and this aspect of the system could be prone to transient
or more long-lasting faults. For this reason, the designers used the Scheduler Agent Supervisor pattern to
implement the order processing elements of the system.
When a customer places an order, the application constructs a message that describes the order and posts this
message to a queue. A separate submission process, running in a worker role, retrieves the message, inserts the
order details into the orders database, and creates a record for the order process in the state store. Note that the
inserts into the orders database and the state store are performed as part of the same operation. The submission
process is designed to ensure that both inserts complete together.
The state information that the submission process creates for the order includes:
OrderID . The ID of the order in the orders database.
LockedBy . The instance ID of the worker role handling the order. There might be multiple current
instances of the worker role running the Scheduler, but each order should only be handled by a single
instance.
CompleteBy . The time the order should be processed by.
ProcessState . The current state of the task handling the order. The possible states are:
Pending . The order has been created but processing hasn't yet been started.
Processing . The order is currently being processed.
Processed . The order has been processed successfully.
Error . The order processing has failed.
FailureCount . The number of times that processing has been tried for the order.
In this state information, the OrderID field is copied from the order ID of the new order. The LockedBy and
CompleteBy fields are set to null , the ProcessState field is set to Pending , and the FailureCount field is set to
0.
NOTE
In this example, the order handling logic is relatively simple and only has a single step that invokes a remote service. In a
more complex multistep scenario, the submission process would likely involve several steps, and so several records would
be created in the state store — each one describing the state of an individual step.
The Scheduler also runs as part of a worker role and implements the business logic that handles the order. An
instance of the Scheduler polling for new orders examines the state store for records where the LockedBy field is
null and the ProcessState field is pending. When the Scheduler finds a new order, it immediately populates the
LockedBy field with its own instance ID, sets the CompleteBy field to an appropriate time, and sets the
ProcessState field to processing. The code is designed to be exclusive and atomic to ensure that two concurrent
instances of the Scheduler can't try to handle the same order simultaneously.
The Scheduler then runs the business workflow to process the order asynchronously, passing it the value in the
OrderID field from the state store. The workflow handling the order retrieves the details of the order from the
orders database and performs its work. When a step in the order processing workflow needs to invoke the
remote service, it uses an Agent. The workflow step communicates with the Agent using a pair of Azure Service
Bus message queues acting as a request/response channel. The figure shows a high-level view of the solution.
The message sent to the Agent from a workflow step describes the order and includes the complete-by time. If
the Agent receives a response from the remote service before the complete-by time expires, it posts a reply
message on the Service Bus queue on which the workflow is listening. When the workflow step receives the
valid reply message, it completes its processing and the Scheduler sets the ProcessState field of the order state
to processed. At this point, the order processing has completed successfully.
If the complete-by time expires before the Agent receives a response from the remote service, the Agent simply
halts its processing and terminates handling the order. Similarly, if the workflow handling the order exceeds the
complete-by time, it also terminates. In both cases, the state of the order in the state store remains set to
processing, but the complete-by time indicates that the time for processing the order has passed and the process
is deemed to have failed. Note that if the Agent that's accessing the remote service, or the workflow that's
handling the order (or both) terminate unexpectedly, the information in the state store will again remain set to
processing and eventually will have an expired complete-by value.
If the Agent detects an unrecoverable, nontransient fault while it's trying to contact the remote service, it can
send an error response back to the workflow. The Scheduler can set the status of the order to error and raise an
event that alerts an operator. The operator can then try to resolve the reason for the failure manually and
resubmit the failed processing step.
The Supervisor periodically examines the state store looking for orders with an expired complete-by value. If the
Supervisor finds a record, it increments the FailureCount field. If the failure count value is below a specified
threshold value, the Supervisor resets the LockedBy field to null, updates the CompleteBy field with a new
expiration time, and sets the ProcessState field to pending. An instance of the Scheduler can pick up this order
and perform its processing as before. If the failure count value exceeds a specified threshold, the reason for the
failure is assumed to be nontransient. The Supervisor sets the status of the order to error and raises an event
that alerts an operator.
In this example, the Supervisor is implemented in a separate worker role. You can use a variety of strategies
to arrange for the Supervisor task to be run, including using the Azure Scheduler service (not to be confused
with the Scheduler component in this pattern). For more information about the Azure Scheduler service, visit
the Scheduler page.
Although it isn't shown in this example, the Scheduler might need to keep the application that submitted the
order informed about the progress and status of the order. The application and the Scheduler are isolated from
each other to eliminate any dependencies between them. The application has no knowledge of which instance of
the Scheduler is handling the order, and the Scheduler is unaware of which specific application instance posted
the order.
To allow the order status to be reported, the application could use its own private response queue. The details of
this response queue would be included as part of the request sent to the submission process, which would
include this information in the state store. The Scheduler would then post messages to this queue indicating the
status of the order (request received, order completed, order failed, and so on). It should include the order ID in
these messages so they can be correlated with the original request by the application.
Process a set of related messages in a defined order, without blocking processing of other groups of messages.
Solution
Push related messages into categories within the queuing system, and have the queue listeners lock and pull only
from one category, one message at a time.
Here's what the general Sequential Convoy pattern looks like:
In the queue, messages for different categories may be interleaved, as shown in the following diagram:
Example
On Azure, this pattern can be implemented using Azure Service Bus message sessions. For the consumers, you can
use either Logic Apps with the Service Bus peek-lock connector or Azure Functions with the Service Bus trigger.
For the previous order-tracking example, process each ledger message in the order it's received, and send each
transaction to another queue where the category is set to the order ID. A transaction will never span multiple
orders in this scenario, so consumers process each category in parallel but FIFO within the category.
The ledge processor fan outs the messages by de-batching the content of each message in the first queue:
Next steps
The following information may be relevant when implementing this pattern:
Message sessions: first in, first out (FIFO)
Peek-Lock Message (Non-Destructive Read)
In order delivery of correlated messages in Logic Apps by using Service Bus sessions (MSDN blog)
Sharding pattern
12/18/2020 • 19 minutes to read • Edit Online
Divide a data store into a set of horizontal partitions or shards. This can improve scalability when storing and
accessing large volumes of data.
Solution
Divide the data store into horizontal partitions or shards. Each shard has the same schema, but holds its own
distinct subset of the data. A shard is a data store in its own right (it can contain the data for many entities of
different types), running on a server acting as a storage node.
This pattern has the following benefits:
You can scale the system out by adding further shards running on additional storage nodes.
A system can use off-the-shelf hardware rather than specialized and expensive computers for each
storage node.
You can reduce contention and improve performance by balancing the workload across shards.
In the cloud, shards can be located physically close to the users that'll access the data.
When dividing a data store up into shards, decide which data should be placed in each shard. A shard typically
contains items that fall within a specified range determined by one or more attributes of the data. These
attributes form the shard key (sometimes referred to as the partition key). The shard key should be static. It
shouldn't be based on data that might change.
Sharding physically organizes the data. When an application stores and retrieves data, the sharding logic directs
the application to the appropriate shard. This sharding logic can be implemented as part of the data access code
in the application, or it could be implemented by the data storage system if it transparently supports sharding.
Abstracting the physical location of the data in the sharding logic provides a high level of control over which
shards contain which data. It also enables data to migrate between shards without reworking the business logic
of an application if the data in the shards need to be redistributed later (for example, if the shards become
unbalanced). The tradeoff is the additional data access overhead required in determining the location of each
data item as it's retrieved.
To ensure optimal performance and scalability, it's important to split the data in a way that's appropriate for the
types of queries that the application performs. In many cases, it's unlikely that the sharding scheme will exactly
match the requirements of every query. For example, in a multi-tenant system an application might need to
retrieve tenant data using the tenant ID, but it might also need to look up this data based on some other
attribute such as the tenant’s name or location. To handle these situations, implement a sharding strategy with a
shard key that supports the most commonly performed queries.
If queries regularly retrieve data using a combination of attribute values, you can likely define a composite shard
key by linking attributes together. Alternatively, use a pattern such as Index Table to provide fast lookup to data
based on attributes that aren't covered by the shard key.
Sharding strategies
Three strategies are commonly used when selecting the shard key and deciding how to distribute data across
shards. Note that there doesn't have to be a one-to-one correspondence between shards and the servers that
host them—a single server can host multiple shards. The strategies are:
The Lookup strategy . In this strategy the sharding logic implements a map that routes a request for data to
the shard that contains that data using the shard key. In a multi-tenant application all the data for a tenant might
be stored together in a shard using the tenant ID as the shard key. Multiple tenants might share the same shard,
but the data for a single tenant won't be spread across multiple shards. The figure illustrates sharding tenant
data based on tenant IDs.
The mapping between the shard key and the physical storage can be based on physical shards where each shard
key maps to a physical partition. Alternatively, a more flexible technique for rebalancing shards is virtual
partitioning, where shard keys map to the same number of virtual shards, which in turn map to fewer physical
partitions. In this approach, an application locates data using a shard key that refers to a virtual shard, and the
system transparently maps virtual shards to physical partitions. The mapping between a virtual shard and a
physical partition can change without requiring the application code be modified to use a different set of shard
keys.
The Range strategy . This strategy groups related items together in the same shard, and orders them by shard
key—the shard keys are sequential. It's useful for applications that frequently retrieve sets of items using range
queries (queries that return a set of data items for a shard key that falls within a given range). For example, if an
application regularly needs to find all orders placed in a given month, this data can be retrieved more quickly if
all orders for a month are stored in date and time order in the same shard. If each order was stored in a different
shard, they'd have to be fetched individually by performing a large number of point queries (queries that return
a single data item). The next figure illustrates storing sequential sets (ranges) of data in shard.
In this example, the shard key is a composite key containing the order month as the most significant element,
followed by the order day and the time. The data for orders is naturally sorted when new orders are created and
added to a shard. Some data stores support two-part shard keys containing a partition key element that
identifies the shard and a row key that uniquely identifies an item in the shard. Data is usually held in row key
order in the shard. Items that are subject to range queries and need to be grouped together can use a shard key
that has the same value for the partition key but a unique value for the row key.
The Hash strategy . The purpose of this strategy is to reduce the chance of hotspots (shards that receive a
disproportionate amount of load). It distributes the data across the shards in a way that achieves a balance
between the size of each shard and the average load that each shard will encounter. The sharding logic
computes the shard to store an item in based on a hash of one or more attributes of the data. The chosen
hashing function should distribute data evenly across the shards, possibly by introducing some random element
into the computation. The next figure illustrates sharding tenant data based on a hash of tenant IDs.
To understand the advantage of the Hash strategy over other sharding strategies, consider how a multi-tenant
application that enrolls new tenants sequentially might assign the tenants to shards in the data store. When
using the Range strategy, the data for tenants 1 to n will all be stored in shard A, the data for tenants n+1 to m
will all be stored in shard B, and so on. If the most recently registered tenants are also the most active, most data
activity will occur in a small number of shards, which could cause hotspots. In contrast, the Hash strategy
allocates tenants to shards based on a hash of their tenant ID. This means that sequential tenants are most likely
to be allocated to different shards, which will distribute the load across them. The previous figure shows this for
tenants 55 and 56.
The three sharding strategies have the following advantages and considerations:
Lookup . This offers more control over the way that shards are configured and used. Using virtual shards
reduces the impact when rebalancing data because new physical partitions can be added to even out the
workload. The mapping between a virtual shard and the physical partitions that implement the shard can
be modified without affecting application code that uses a shard key to store and retrieve data. Looking
up shard locations can impose an additional overhead.
Range . This is easy to implement and works well with range queries because they can often fetch
multiple data items from a single shard in a single operation. This strategy offers easier data
management. For example, if users in the same region are in the same shard, updates can be scheduled
in each time zone based on the local load and demand pattern. However, this strategy doesn't provide
optimal balancing between shards. Rebalancing shards is difficult and might not resolve the problem of
uneven load if the majority of activity is for adjacent shard keys.
Hash . This strategy offers a better chance of more even data and load distribution. Request routing can
be accomplished directly by using the hash function. There's no need to maintain a map. Note that
computing the hash might impose an additional overhead. Also, rebalancing shards is difficult.
Most common sharding systems implement one of the approaches described above, but you should also
consider the business requirements of your applications and their patterns of data usage. For example, in a
multi-tenant application:
You can shard data based on workload. You could segregate the data for highly volatile tenants in
separate shards. The speed of data access for other tenants might be improved as a result.
You can shard data based on the location of tenants. You can take the data for tenants in a specific
geographic region offline for backup and maintenance during off-peak hours in that region, while the
data for tenants in other regions remains online and accessible during their business hours.
High-value tenants could be assigned their own private, high performing, lightly loaded shards, whereas
lower-value tenants might be expected to share more densely-packed, busy shards.
The data for tenants that need a high degree of data isolation and privacy can be stored on a completely
separate server.
Autoincremented values in other fields that are not shard keys can also cause problems. For example,
if you use autoincremented fields to generate unique IDs, then two different items located in different
shards might be assigned the same ID.
It might not be possible to design a shard key that matches the requirements of every possible query
against the data. Shard the data to support the most frequently performed queries, and if necessary
create secondary index tables to support queries that retrieve data using criteria based on attributes that
aren't part of the shard key. For more information, see the Index Table pattern.
Queries that access only a single shard are more efficient than those that retrieve data from multiple
shards, so avoid implementing a sharding system that results in applications performing large numbers
of queries that join data held in different shards. Remember that a single shard can contain the data for
multiple types of entities. Consider denormalizing your data to keep related entities that are commonly
queried together (such as the details of customers and the orders that they have placed) in the same
shard to reduce the number of separate reads that an application performs.
If an entity in one shard references an entity stored in another shard, include the shard key for the
second entity as part of the schema for the first entity. This can help to improve the performance of
queries that reference related data across shards.
If an application must perform queries that retrieve data from multiple shards, it might be possible to
fetch this data by using parallel tasks. Examples include fan-out queries, where data from multiple shards
is retrieved in parallel and then aggregated into a single result. However, this approach inevitably adds
some complexity to the data access logic of a solution.
For many applications, creating a larger number of small shards can be more efficient than having a
small number of large shards because they can offer increased opportunities for load balancing. This can
also be useful if you anticipate the need to migrate shards from one physical location to another. Moving
a small shard is quicker than moving a large one.
Make sure the resources available to each shard storage node are sufficient to handle the scalability
requirements in terms of data size and throughput. For more information, see the section “Designing
Partitions for Scalability” in the Data Partitioning Guidance.
Consider replicating reference data to all shards. If an operation that retrieves data from a shard also
references static or slow-moving data as part of the same query, add this data to the shard. The
application can then fetch all of the data for the query easily, without having to make an additional round
trip to a separate data store.
If reference data held in multiple shards changes, the system must synchronize these changes across
all shards. The system can experience a degree of inconsistency while this synchronization occurs. If
you do this, you should design your applications to be able to handle it.
It can be difficult to maintain referential integrity and consistency between shards, so you should
minimize operations that affect data in multiple shards. If an application must modify data across shards,
evaluate whether complete data consistency is actually required. Instead, a common approach in the
cloud is to implement eventual consistency. The data in each partition is updated separately, and the
application logic must take responsibility for ensuring that the updates all complete successfully, as well
as handling the inconsistencies that can arise from querying data while an eventually consistent
operation is running. For more information about implementing eventual consistency, see the Data
Consistency Primer.
Configuring and managing a large number of shards can be a challenge. Tasks such as monitoring,
backing up, checking for consistency, and logging or auditing must be accomplished on multiple shards
and servers, possibly held in multiple locations. These tasks are likely to be implemented using scripts or
other automation solutions, but that might not completely eliminate the additional administrative
requirements.
Shards can be geolocated so that the data that they contain is close to the instances of an application that
use it. This approach can considerably improve performance, but requires additional consideration for
tasks that must access multiple shards in different locations.
NOTE
The primary focus of sharding is to improve the performance and scalability of a system, but as a by-product it can also
improve availability due to how the data is divided into separate partitions. A failure in one partition doesn't necessarily
prevent an application from accessing data held in other partitions, and an operator can perform maintenance or
recovery of one or more partitions without making the entire data for an application inaccessible. For more information,
see the Data Partitioning Guidance.
Example
The following example in C# uses a set of SQL Server databases acting as shards. Each database holds a subset
of the data used by an application. The application retrieves data that's distributed across the shards using its
own sharding logic (this is an example of a fan-out query). The details of the data that's located in each shard is
returned by a method called GetShards . This method returns an enumerable list of ShardInformation objects,
where the ShardInformation type contains an identifier for each shard and the SQL Server connection string
that an application should use to connect to the shard (the connection strings aren't shown in the code example).
The code below shows how the application uses the list of ShardInformation objects to perform a query that
fetches data from each shard in parallel. The details of the query aren't shown, but in this example the data that's
retrieved contains a string that could hold information such as the name of a customer if the shards contain the
details of customers. The results are aggregated into a ConcurrentBag collection for processing by the
application.
// Retrieve the shards as a ShardInformation[] instance.
var shards = GetShards();
Deploy components of an application into a separate process or container to provide isolation and encapsulation.
This pattern can also enable applications to be composed of heterogeneous components and technologies.
This pattern is named Sidecar because it resembles a sidecar attached to a motorcycle. In the pattern, the sidecar
is attached to a parent application and provides supporting features for the application. The sidecar also shares
the same lifecycle as the parent application, being created and retired alongside the parent. The sidecar pattern is
sometimes referred to as the sidekick pattern and is a decomposition pattern.
Solution
Co-locate a cohesive set of tasks with the primary application, but place them inside their own process or
container, providing a homogeneous interface for platform services across languages.
A sidecar service is not necessarily part of the application, but is connected to it. It goes wherever the parent
application goes. Sidecars are supporting processes or services that are deployed with the primary application.
On a motorcycle, the sidecar is attached to one motorcycle, and each motorcycle can have its own sidecar. In the
same way, a sidecar service shares the fate of its parent application. For each instance of the application, an
instance of the sidecar is deployed and hosted alongside it.
Advantages of using a sidecar pattern include:
A sidecar is independent from its primary application in terms of runtime environment and programming
language, so you don't need to develop one sidecar per language.
The sidecar can access the same resources as the primary application. For example, a sidecar can monitor
system resources used by both the sidecar and the primary application.
Because of its proximity to the primary application, there’s no significant latency when communicating
between them.
Even for applications that don’t provide an extensibility mechanism, you can use a sidecar to extend
functionality by attaching it as its own process in the same host or sub-container as the primary
application.
The sidecar pattern is often used with containers and referred to as a sidecar container or sidekick container.
Related guidance
Ambassador pattern
Static Content Hosting pattern
12/18/2020 • 5 minutes to read • Edit Online
Deploy static content to a cloud-based storage service that can deliver them directly to the client. This can reduce
the need for potentially expensive compute instances.
Solution
In most cloud hosting environments, you can put some of an application's resources and static pages in a storage
service. The storage service can serve requests for these resources, reducing load on the compute resources that
handle other web requests. The cost for cloud-hosted storage is typically much less than for compute instances.
When hosting some parts of an application in a storage service, the main considerations are related to
deployment of the application and to securing resources that aren't intended to be available to anonymous users.
Example
Azure Storage supports serving static content directly from a storage container. Files are served through
anonymous access requests. By default, files have a URL in a subdomain of core.windows.net , such as
https://contoso.z4.web.core.windows.net/image.png . You can configure a custom domain name, and use Azure
CDN to access the files over HTTPS. For more information, see Static website hosting in Azure Storage.
Static website hosting makes the files available for anonymous access. If you need to control who can access the
files, you can store files in Azure blob storage and then generate shared access signatures to limit access.
The links in the pages delivered to the client must specify the full URL of the resource. If the resource is protected
with a valet key, such as a shared access signature, this signature must be included in the URL.
A sample application that demonstrates using external storage for static resources is available on GitHub. This
sample uses configuration files to specify the storage account and container that holds the static content.
<Setting name="StaticContent.StorageConnectionString"
value="UseDevelopmentStorage=true" />
<Setting name="StaticContent.Container" value="static-content" />
The Settings class in the file Settings.cs of the StaticContentHosting.Web project contains methods to extract
these values and build a string value containing the cloud storage account container URL.
return url.Content(contentPath);
}
}
The file Index.cshtml in the Views\Home folder contains an image element that uses the StaticContentUrl method
to create the URL for its src attribute.
Incrementally migrate a legacy system by gradually replacing specific pieces of functionality with new applications
and services. As features from the legacy system are replaced, the new system eventually replaces all of the old
system's features, strangling the old system and allowing you to decommission it.
Solution
Incrementally replace specific pieces of functionality with new applications and services. Create a façade that
intercepts requests going to the backend legacy system. The façade routes these requests either to the legacy
application or the new services. Existing features can be migrated to the new system gradually, and consumers
can continue using the same interface, unaware that any migration has taken place.
This pattern helps to minimize risk from the migration, and spread the development effort over time. With the
façade safely routing users to the correct application, you can add functionality to the new system at whatever
pace you like, while ensuring the legacy application continues to function. Over time, as features are migrated to
the new system, the legacy system is eventually "strangled" and is no longer necessary. Once this process is
complete, the legacy system can safely be retired.
Related guidance
Martin Fowler's blog post on StranglerApplication
Throttling pattern
12/18/2020 • 7 minutes to read • Edit Online
Control the consumption of resources used by an instance of an application, an individual tenant, or an entire
service. This can allow the system to continue to function and meet service level agreements, even when an
increase in demand places an extreme load on resources.
Solution
An alternative strategy to autoscaling is to allow applications to use resources only up to a limit, and then
throttle them when this limit is reached. The system should monitor how it's using resources so that, when
usage exceeds the threshold, it can throttle requests from one or more users. This will enable the system to
continue functioning and meet any service level agreements (SLAs) that are in place. For more information on
monitoring resource usage, see the Instrumentation and Telemetry Guidance.
The system could implement several throttling strategies, including:
Rejecting requests from an individual user who's already accessed system APIs more than n times per
second over a given period of time. This requires the system to meter the use of resources for each tenant
or user running an application. For more information, see the Service Metering Guidance.
Disabling or degrading the functionality of selected nonessential services so that essential services can
run unimpeded with sufficient resources. For example, if the application is streaming video output, it
could switch to a lower resolution.
Using load leveling to smooth the volume of activity (this approach is covered in more detail by the
Queue-based Load Leveling pattern). In a multi-tenant environment, this approach will reduce the
performance for every tenant. If the system must support a mix of tenants with different SLAs, the work
for high-value tenants might be performed immediately. Requests for other tenants can be held back, and
handled when the backlog has eased. The Priority Queue pattern could be used to help implement this
approach.
Deferring operations being performed on behalf of lower priority applications or tenants. These
operations can be suspended or limited, with an exception generated to inform the tenant that the system
is busy and that the operation should be retried later.
The figure shows an area graph for resource use (a combination of memory, CPU, bandwidth, and other factors)
against time for applications that are making use of three features. A feature is an area of functionality, such as a
component that performs a specific set of tasks, a piece of code that performs a complex calculation, or an
element that provides a service such as an in-memory cache. These features are labeled A, B, and C.
The area immediately below the line for a feature indicates the resources that are used by applications when
they invoke this feature. For example, the area below the line for Feature A shows the resources used by
applications that are making use of Feature A, and the area between the lines for Feature A and Feature B
indicates the resources used by applications invoking Feature B. Aggregating the areas for each feature
shows the total resource use of the system.
The previous figure illustrates the effects of deferring operations. Just prior to time T1, the total resources
allocated to all applications using these features reach a threshold (the limit of resource use). At this point, the
applications are in danger of exhausting the resources available. In this system, Feature B is less critical than
Feature A or Feature C, so it's temporarily disabled and the resources that it was using are released. Between
times T1 and T2, the applications using Feature A and Feature C continue running as normal. Eventually, the
resource use of these two features diminishes to the point when, at time T2, there is sufficient capacity to enable
Feature B again.
The autoscaling and throttling approaches can also be combined to help keep the applications responsive and
within SLAs. If the demand is expected to remain high, throttling provides a temporary solution while the system
scales out. At this point, the full functionality of the system can be restored.
The next figure shows an area graph of the overall resource use by all applications running in a system against
time, and illustrates how throttling can be combined with autoscaling.
At time T1, the threshold specifying the soft limit of resource use is reached. At this point, the system can start to
scale out. However, if the new resources don't become available quickly enough, then the existing resources
might be exhausted and the system could fail. To prevent this from occurring, the system is temporarily throttled,
as described earlier. When autoscaling has completed and the additional resources are available, throttling can
be relaxed.
Example
The final figure illustrates how throttling can be implemented in a multi-tenant system. Users from each of the
tenant organizations access a cloud-hosted application where they fill out and submit surveys. The application
contains instrumentation that monitors the rate at which these users are submitting requests to the application.
In order to prevent the users from one tenant affecting the responsiveness and availability of the application for
all other users, a limit is applied to the number of requests per second the users from any one tenant can submit.
The application blocks requests that exceed this limit.
Use a token that provides clients with restricted direct access to a specific resource, in order to offload data
transfer from the application. This is particularly useful in applications that use cloud-hosted storage systems or
queues, and can minimize cost and maximize scalability and performance.
Solution
You need to resolve the problem of controlling access to a data store where the store can't manage authentication
and authorization of clients. One typical solution is to restrict access to the data store’s public connection and
provide the client with a key or token that the data store can validate.
This key or token is usually referred to as a valet key. It provides time-limited access to specific resources and
allows only predefined operations such as reading and writing to storage or queues, or uploading and
downloading in a web browser. Applications can create and issue valet keys to client devices and web browsers
quickly and easily, allowing clients to perform the required operations without requiring the application to
directly handle the data transfer. This removes the processing overhead, and the impact on performance and
scalability, from the application and the server.
The client uses this token to access a specific resource in the data store for only a specific period, and with specific
restrictions on access permissions, as shown in the figure. After the specified period, the key becomes invalid and
won't allow access to the resource.
It's also possible to configure a key that has other dependencies, such as the scope of the data. For example,
depending on the data store capabilities, the key can specify a complete table in a data store, or only specific rows
in a table. In cloud storage systems the key can specify a container, or just a specific item within a container.
The key can also be invalidated by the application. This is a useful approach if the client notifies the server that the
data transfer operation is complete. The server can then invalidate that key to prevent further access.
Using this pattern can simplify managing access to resources because there's no requirement to create and
authenticate a user, grant permissions, and then remove the user again. It also makes it easy to limit the location,
the permission, and the validity period—all by simply generating a key at runtime. The important factors are to
limit the validity period, and especially the location of the resource, as tightly as possible so that the recipient can
only use it for the intended purpose.
Example
Azure supports shared access signatures on Azure Storage for granular access control to data in blobs, tables, and
queues, and for Service Bus queues and topics. A shared access signature token can be configured to provide
specific access rights such as read, write, update, and delete to a specific table; a key range within a table; a queue;
a blob; or a blob container. The validity can be a specified time period or with no time limit.
Azure shared access signatures also support server-stored access policies that can be associated with a specific
resource such as a table or blob. This feature provides additional control and flexibility compared to application-
generated shared access signature tokens, and should be used whenever possible. Settings defined in a server-
stored policy can be changed and are reflected in the token without requiring a new token to be issued, but
settings defined in the token can't be changed without issuing a new token. This approach also makes it possible
to revoke a valid shared access signature token before it's expired.
For more information, see Introducing Table SAS (Shared Access Signature), Queue SAS and update to Blob
SAS and Using Shared Access Signatures on MSDN.
The following code shows how to create a shared access signature token that's valid for five minutes. The
GetSharedAccessReferenceForUpload method returns a shared access signatures token that can be used to upload a
file to Azure Blob Storage.
public class ValuesController : ApiController
{
private readonly BlobServiceClient blobServiceClient;
private readonly string blobContainer;
...
/// <summary>
/// Return a limited access key that allows the caller to upload a file
/// to this specific destination for a defined period of time.
/// </summary>
private StorageEntitySas GetSharedAccessReferenceForUpload(string blobName)
{
var blob = blobServiceClient.GetBlobContainerClient(this.blobContainer).GetBlobClient(blobName);
{
BlobContainerName = this.blobContainer,
BlobName = blobName,
Resource = "b",
StartsOn = DateTimeOffset.UtcNow.AddMinutes(-5),
ExpiresOn = DateTimeOffset.UtcNow.AddMinutes(5)
};
policy.SetPermissions(BlobSasPermissions.Write);
var sas = policy.ToSasQueryParameters(storageSharedKeyCredential).ToString();
The complete sample is available in the ValetKey solution available for download from GitHub. The
ValetKey.Web project in this solution contains a web application that includes the ValuesController class
shown above. A sample client application that uses this web application to retrieve a shared access signatures
key and upload a file to blob storage is available in the ValetKey.Client project.
Next steps
The following patterns and guidance might also be relevant when implementing this pattern:
A sample that demonstrates this pattern is available on GitHub.
Gatekeeper pattern. This pattern can be used in conjunction with the Valet Key pattern to protect applications
and services by using a dedicated host instance that acts as a broker between clients and the application or
service. The gatekeeper validates and sanitizes requests, and passes requests and data between the client and
the application. Can provide an additional layer of security, and reduce the attack surface of the system.
Static Content Hosting pattern. Describes how to deploy static resources to a cloud-based storage service that
can deliver these resources directly to the client to reduce the requirement for expensive compute instances.
Where the resources aren't intended to be publicly available, the Valet Key pattern can be used to secure them.
Introducing Table SAS (Shared Access Signature), Queue SAS and update to Blob SAS
Using Shared Access Signatures
Shared Access Signature Authentication with Service Bus
Artificial intelligence (AI)
12/18/2020 • 15 minutes to read • Edit Online
Artificial intelligence (AI) is the capability of a computer to imitate intelligent human behavior. Through AI, machines
can analyze images, comprehend speech, interact in natural ways, and make predictions using data.
AI concepts
Algorithm
An algorithm is a sequence of calculations and rules used to solve a problem or analyze a set of data. It is like a flow
chart, with step-by-step instructions for questions to ask, but written in math and programming code. An algorithm
may describe how to determine whether a pet is a cat, dog, fish, bird, or lizard. Another far more complicated
algorithm may describe how to identify a written or spoken language, analyze its words, translate them into a
different language, and then check the translation for accuracy.
Machine learning
Machine learning (ML) is an AI technique that uses mathematical algorithms to create predictive models. An
algorithm is used to parse data fields and to "learn" from that data by using patterns found within it to generate
models. Those models are then used to make informed predictions or decisions about new data.
The predictive models are validated against known data, measured by performance metrics selected for specific
business scenarios, and then adjusted as needed. This process of learning and validation is called training. Through
periodic retraining, ML models are improved over time.
Machine learning at scale
What are the machine learning products at Microsoft?
Deep learning
Deep learning is a type of ML that can determine for itself whether its predictions are accurate. It also uses
algorithms to analyze data, but it does so on a larger scale than ML.
Deep learning uses artificial neural networks, which consist of multiple layers of algorithms. Each layer looks at the
incoming data, performs its own specialized analysis, and produces an output that other layers can understand. This
output is then passed to the next layer, where a different algorithm does its own analysis, and so on.
With many layers in each neural network-and sometimes using multiple neural networks-a machine can learn
through its own data processing. This requires much more data and much more computing power than ML.
Deep learning versus machine learning
Distributed training of deep learning models on Azure
Batch scoring of deep learning models on Azure
Training of Python scikit-learn and deep learning models on Azure
Real-time scoring of Python scikit-learn and deep learning models on Azure
Bots
A bot is an automated software program designed to perform a particular task. Think of it as a robot without a
body. Early bots were comparatively simple, handling repetitive and voluminous tasks with relatively
straightforward algorithmic logic. An example would be web crawlers used by search engines to automatically
explore and catalog web content.
Bots have become much more sophisticated, using AI and other technologies to mimic human activity and
decision-making, often while interacting directly with humans through text or even speech. Examples include bots
that can take a dinner reservation, chatbots (or conversational AI) that help with customer service interactions, and
social bots that post breaking news or scientific data to social media sites.
Microsoft offers the Azure Bot Service, a managed service purpose-built for enterprise-grade bot development.
About Azure Bot Service
Ten guidelines for responsible bots
Azure reference architecture: Enterprise-grade conversational bot
Example workload: Conversational chatbot for hotel reservations on Azure
Autonomous systems
Autonomous systems are part of an evolving new class that goes beyond basic automation. Instead of performing
a specific task repeatedly with little or no variation (like bots do), autonomous systems bring intelligence to
machines so they can adapt to changing environments to accomplish a desired goal.
Smart buildings use autonomous systems to automatically control operations like lighting, ventilation, air
conditioning, and security. A more sophisticated example would be a self-directed robot exploring a collapsed mine
shaft to thoroughly map its interior, determine which portions are structurally sound, analyze the air for
breathability, and detect signs of trapped miners in need of rescue-all without a human monitoring in real time on
the remote end.
Autonomous systems and solutions from Microsoft AI
General info on Microsoft AI
Learn more about Microsoft AI, and keep up-to-date with related news:
Microsoft AI School
Azure AI platform page
Microsoft AI platform page
Microsoft AI Blog
Microsoft AI on GitHub: Samples, reference architectures, and best practices
Azure Architecture Center
Hyperparameters are data variables that govern the training process itself. They are configuration variables that
control how the algorithm operates. Hyperparameters are thus typically set before model training begins and are
not modified within the training process in the way that parameters are. Hyperparameter tuning involves running
trials within the training task, assessing how well they are getting the job done, and then adjusting as needed. This
process generates multiple models, each trained using different families of hyperparameters.
Tune hyperparameters for your model with Azure Machine Learning
Azure Machine Learning Studio (classic): Tune Model Hyperparameters
M o d e l se l e c t i o n
The process of training and hyperparameter tuning produces numerous candidate models. These can have many
different variances, including the effort needed to prepare the data, the flexibility of the model, the amount of
processing time, and of course the degree of accuracy of its results. Choosing the best trained model for your needs
and constraints is called model selection, but this is as much about preplanning before training as it is about
choosing the one that works best.
A u t o m a t e d m a c h i n e l e a r n i n g (A u t o M L )
Automated machine learning, also known as AutoML, is the process of automating the time-consuming, iterative
tasks of machine learning model development. It can significantly reduce the time it takes to get production-ready
ML models. Automated ML can assist with model selection, hyperparameter tuning, model training, and other tasks,
without requiring extensive programming or domain knowledge.
What is automated machine learning?
Scoring
Scoring is also called prediction and is the process of generating values based on a trained machine learning
model, given some new input data. The values, or scores, that are created can represent predictions of future values,
but they might also represent a likely category or outcome. The scoring process can generate many different types
of values:
A list of recommended items and a similarity score
Numeric values, for time series models and regression models
A probability value, indicating the likelihood that a new input belongs to some existing category
The name of a category or cluster to which a new item is most similar
A predicted class or outcome, for classification models
Batch scoring is when data is collected during some fixed period of time and then processed in a batch. This
might include generating business reports or analyzing customer loyalty.
Real-time scoring is exactly that-scoring that is ongoing and performed as quickly as possible. The classic
example is credit card fraud detection, but real-time scoring can also be used in speech recognition, medical
diagnoses, market analyses, and many other applications.
General info on custom AI on Azure
Microsoft AI on GitHub: Samples, reference architectures, and best practices
Custom AI on Azure GitHub repo. A collection of scripts and tutorials to help developers effectively use Azure
for their AI workloads
Azure Machine Learning SDK for Python
Azure Machine Learning service example notebooks (Python). A GitHub repo of example notebooks
demonstrating the Azure Machine Learning Python SDK
Azure Machine Learning SDK for R
Customer stories
Different industries are applying AI in innovative and inspiring ways. Following are a number of customer case
studies and success stories:
ASOS: Online retailer solves challenges with Azure Machine Learning service
KPMG helps financial institutions save millions in compliance costs with Azure Cognitive Services
Volkswagen: Machine translation speaks Volkswagen – in 40 languages
Buncee: NYC school empowers readers of all ages and abilities with Azure AI
InterSystems: Data platform company boosts healthcare IT by generating critical information at
unprecedented speed
Zencity: Data-driven startup uses funding to help local governments support better quality of life for
residents
Bosch uses IoT innovation to drive traffic safety improvements by helping drivers avoid serious accidents
Automation Anywhere: Robotic process automation platform developer enriches its software with Azure
Cognitive Services
Wix deploys smart, scalable search across 150 million websites with Azure Cognitive Search
Asklepios Klinik Altona: Precision surgeries with Microsoft HoloLens 2 and 3D visualization
AXA Global P&C: Global insurance firm models complex natural disasters with cloud-based HPC
Browse more AI customer stories
Next steps
To learn about the artificial intelligence development products available from Microsoft, refer to
the Microsoft AI platform page.
For training in how to develop AI solutions, refer to Microsoft AI School.
Microsoft AI on GitHub: Samples, reference architectures, and best practices organizes the Microsoft open
source AI-based repositories, providing tutorials and learning materials.
Choosing a Microsoft cognitive services technology
12/18/2020 • 3 minutes to read • Edit Online
Microsoft cognitive services are cloud-based APIs that you can use in artificial intelligence (AI) applications and
data flows. They provide you with pretrained models that are ready to use in your application, requiring no data
and no model training on your part. The cognitive services are developed by Microsoft's AI and Research team and
leverage the latest deep learning algorithms. They are consumed over HTTP REST interfaces. In addition, SDKs are
available for many common application development frameworks.
The cognitive services include:
Text analysis
Computer vision
Video analytics
Speech recognition and generation
Natural language understanding
Intelligent search
Key benefits:
Minimal development effort for state-of-the-art AI services.
Easy integration into apps via HTTP REST interfaces.
Built-in support for consuming cognitive services in Azure Data Lake Analytics.
Considerations:
Only available over the web. Internet connectivity is generally required. An exception is the Custom Vision
Service, whose trained model you can export for prediction on devices and at the IoT edge.
Although considerable customization is supported, the available services may not suit all predictive
analytics requirements.
What are your options when choosing amongst the cognitive services?
In Azure, there are dozens of Cognitive Services available. The current listing of these is available in a directory
categorized by the functional area they support:
Vision
Speech
Decision
Search
Language
Entity Linking API Text Power your app's data links with named
entity recognition and disambiguation.
Web Language Model API Text Use predictive language models trained
on web-scale data.
Academic Knowledge API Text Tap into the wealth of academic content
in the Microsoft Academic Graph
populated by Bing.
Bing Spell Check API Text Detect and correct spelling mistakes in
your app.
Bing Entity Search API Text (web search query) Identify and augment entity
information from the web.
Bing Image Search API Text (web search query) Search for images.
Bing News Search API Text (web search query) Search for news.
Bing Video Search API Text (web search query) Search for videos.
C A PA B IL IT Y IN P UT T Y P E K EY B EN EF IT
Bing Web Search API Text (web search query) Get enhanced search details from
billions of web documents.
Bing Speech API Text or Speech Convert speech to text and back again.
Computer Vision API Images (or frames from video) Distill actionable information from
images, automatically create description
of photos, derive tags, recognize
celebrities, extract text, and create
accurate thumbnails.
Content Moderator Text, Images or Video Automated image, text, and video
moderation.
Emotion API Images (photos with human subjects) Identify the range emotions of human
subjects.
Face API Images (photos with human subjects) Detect, identify, analyze, organize, and
tag faces in photos.
Custom Vision Service Images (or frames from video) Customize your own computer vision
models.
Custom Decision Service Web content (for example, RSS feed) Use machine learning to automatically
select the appropriate content for your
home page
Bing Custom Search API Text (web search query) Commercial-grade search tool.
Compare the machine learning products and
technologies from Microsoft
12/18/2020 • 8 minutes to read • Edit Online
Learn about the machine learning products and technologies from Microsoft. Compare options to help you choose
how to most effectively build, deploy, and manage your machine learning solutions.
C LO UD O P T IO N S W H AT IT IS W H AT Y O U C A N DO W IT H IT
Azure Machine Learning Managed platform for machine learning Use a pretrained model. Or, train,
deploy, and manage models on Azure
using Python and CLI
Azure Cognitive Services Pre-built AI capabilities implemented Build intelligent applications quickly
through REST APIs and SDKs using standard programming
languages. Doesn't require machine
learning and data science expertise
Azure SQL Managed Instance Machine In-database machine learning for SQL Train and deploy models inside Azure
Learning Services SQL Managed Instance
Machine learning in Azure Synapse Analytics service with machine learning Train and deploy models inside Azure
Analytics SQL Managed Instance
Machine learning and AI with ONNX in Machine learning in SQL on IoT Train and deploy models inside Azure
Azure SQL Edge SQL Edge
Azure Databricks Apache Spark-based analytics platform Build and deploy models and data
workflows using integrations with
open-source machine learning libraries
and the MLFlow platform.
O N - P REM ISES O P T IO N S W H AT IT IS W H AT Y O U C A N DO W IT H IT
SQL Server Machine Learning Services In-database machine learning for SQL Train and deploy models inside SQL
Server
Machine Learning Services on SQL Machine learning in Big Data Clusters Train and deploy models on SQL Server
Server Big Data Clusters Big Data Clusters
P L AT F O RM S/ TO O L S W H AT IT IS W H AT Y O U C A N DO W IT H IT
Azure Data Science Virtual Machine Virtual machine with pre-installed data Develop machine learning solutions in a
science tools pre-configured environment
Machine Learning extension for Azure Open-source and cross-platform Manage packages, import machine
Data Studio machine learning extension for Azure learning models, make predictions, and
Data Studio create notebooks to run experiments
for your SQL databases
Key benefits Code first (SDK) and studio & drag-and-drop designer web
interface authoring options.
Suppor ted languages Various options depending on the service. Standard ones are
C#, Java, JavaScript, and Python.
Considerations Assumes a SQL database as the data tier for your application.
Azure Databricks
Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services
platform. Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive
workspace that enables collaboration between data scientists, data engineers, and business analysts. Use Python,
R, Scala, and SQL code in web-based notebooks to query, visualize, and model data.
Use Databricks when you want to collaborate on building machine learning solutions on Apache Spark.
ML.NET
ML.NET is an open-source, and cross-platform machine learning framework. With ML.NET, you can build custom
machine learning solutions and integrate them into your .NET applications. ML.NET offers varying levels of
interoperability with popular frameworks like TensorFlow and ONNX for training and scoring machine learning
and deep learning models. For resource-intensive tasks like training image classification models, you can take
advantage of Azure to train your models in the cloud.
Use ML.NET when you want to integrate machine learning solutions into your .NET applications. Choose between
the API for a code-first experience and Model Builder or the CLI for a low-code experience.
Windows ML
Windows ML inference engine allows you to use trained machine learning models in your applications, evaluating
trained models locally on Windows 10 devices.
Use Windows ML when you want to use trained machine learning models within your Windows applications.
MMLSpark
Microsoft ML for Apache Spark (MMLSpark) is an open-source library that expands the distributed computing
framework Apache Spark. MMLSpark adds many deep learning and data science tools to the Spark ecosystem,
including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK),
LightGBM, LIME (Model Interpretability), and OpenCV. You can use these tools to create powerful predictive
models on any Spark cluster, such as Azure Databricks or Cosmic Spark.
MMLSpark also brings new networking capabilities to the Spark ecosystem. With the HTTP on Spark project, users
can embed any web service into their SparkML models. Additionally, MMLSpark provides easy-to-use tools for
orchestrating Azure Cognitive Services at scale. For production-grade deployment, the Spark Serving project
enables high throughput, submillisecond latency web services, backed by your Spark cluster.
Next steps
To learn about all the Artificial Intelligence (AI) development products available from Microsoft, see Microsoft AI
platform
For training in developing AI and Machine Learning solutions with Microsoft, see Microsoft Learn
Machine learning at scale
12/18/2020 • 3 minutes to read • Edit Online
Machine learning (ML) is a technique used to train predictive models based on mathematical algorithms. Machine
learning analyzes the relationships between data fields to predict unknown values.
Creating and deploying a machine learning model is an iterative process:
Data scientists explore the source data to determine relationships between features and predicted labels.
The data scientists train and validate models based on appropriate algorithms to find the optimal model for
prediction.
The optimal model is deployed into production, as a web service or some other encapsulated function.
As new data is collected, the model is periodically retrained to improve its effectiveness.
Machine learning at scale addresses two different scalability concerns. The first is training a model against large
data sets that require the scale-out capabilities of a cluster to train. The second centers on operationalizing the
learned model so it can scale to meet the demands of the applications that consume it. Typically this is
accomplished by deploying the predictive capabilities as a web service that can then be scaled out.
Machine learning at scale has the benefit that it can produce powerful, predictive capabilities because better
models typically result from more data. Once a model is trained, it can be deployed as a stateless, highly
performant scale-out web service.
Challenges
Machine learning at scale produces a few challenges:
You typically need a lot of data to train a model, especially for deep learning models.
You need to prepare these big data sets before you can even begin training your model.
The model training phase must access the big data stores. It's common to perform the model training using the
same big data cluster, such as Spark, that is used for data preparation.
For scenarios such as deep learning, not only will you need a cluster that can provide you scale-out on CPUs,
but your cluster will need to consist of GPU-enabled nodes.
Next steps
The following reference architectures show machine learning scenarios in Azure:
Batch scoring on Azure for deep learning models
Real-time scoring of Python Scikit-Learn and Deep Learning Models on Azure
Choosing a natural language processing technology
in Azure
12/18/2020 • 3 minutes to read • Edit Online
Natural language processing (NLP) is used for tasks such as sentiment analysis, topic detection, language
detection, key phrase extraction, and document categorization.
NLP can be use to classify documents, such as labeling documents as sensitive or spam. The output of NLP can be
used for subsequent processing or search. Another use for NLP is to summarize text by identifying the entities
present in the document. These entities can also be used to tag documents with keywords, which enables search
and retrieval based on content. Entities might be combined into topics, with summaries that describe the
important topics present in each document. The detected topics may be used to categorize the documents for
navigation, or to enumerate related documents given a selected topic. Another use for NLP is to score text for
sentiment, to assess the positive or negative tone of a document. These approaches use many techniques from
natural language processing, such as:
Tokenizer . Splitting the text into words or phrases.
Stemming and lemmatization . Normalizing words so that different forms map to the canonical word with
the same meaning. For example, "running" and "ran" map to "run."
Entity extraction . Identifying subjects in the text.
Par t of speech detection . Identifying text as a verb, noun, participle, verb phrase, and so on.
Sentence boundar y detection . Detecting complete sentences within paragraphs of text.
When using NLP to extract information and insight from free-form text, the starting point is typically the raw
documents stored in object storage such as Azure Storage or Azure Data Lake Store.
Challenges
Processing a collection of free-form text documents is typically computationally resource intensive, as well as
being time intensive.
Without a standardized document format, it can be difficult to achieve consistently accurate results using free-
form text processing to extract specific facts from a document. For example, think of a text representation of an
invoice—it can be difficult to build a process that correctly extracts the invoice number and invoice date for
invoices across any number of vendors.
What are your options when choosing an NLP service?
In Azure, the following services provide natural language processing (NLP) capabilities:
Azure HDInsight with Spark and Spark MLlib
Azure Databricks
Microsoft Cognitive Services
Capability matrix
The following tables summarize the key differences in capabilities.
General capabilities
C A PA B IL IT Y A Z URE H DIN SIGH T M IC RO SO F T C O GN IT IVE SERVIC ES
Programmability Python, Scala, Java C#, Java, Node.js, Python, PHP, Ruby
Part of speech tagging Yes (Spark NLP) Yes (Linguistic Analysis API)
Spell checking Yes (Spark NLP) Yes (Bing Spell Check API)
See also
Natural language processing
R developer's guide to Azure
12/18/2020 • 8 minutes to read • Edit Online
Many data scientists dealing with ever-increasing volumes of data are looking for ways to
harness the power of cloud computing for their analyses. This article provides an
overview of the various ways that data scientists can use their existing skills with the R
programming language in Azure.
Microsoft has fully embraced the R programming language as a first-class tool for data
scientists. By providing many different options for R developers to run their code in
Azure, the company is enabling data scientists to extend their data science workloads into the cloud when tackling
large-scale projects.
Let's examine the various options and the most compelling scenarios for each one.
Azure Machine Learning Server enterprise software for data science, providing R and Python
interpreters
Azure Machine Learning cloud service that you use to train, deploy, automate, and
manage machine learning models
Azure Machine Learning Studio (classic) run custom R scripts in Azure's machine learning experiments
Azure Batch offers a variety options for economically running R code across
many nodes in a cluster
Azure SQL Database run R scripts inside of the SQL Server database engine
ML Services on HDInsight
Microsoft ML Services provide data scientists, statisticians, and R programmers with on-demand access to scalable,
distributed methods of analytics on HDInsight. This solution provides the latest capabilities for R-based analytics on
datasets of virtually any size, loaded to either Azure Blob or Data Lake storage.
This is an enterprise-grade solution that allows you to scale your R code across a cluster. By using functions in
Microsoft's RevoScaleR package, your R scripts on HDInsight can run data processing functions in parallel across
many nodes in a cluster. This allows R to crunch data on a much larger scale than is possible with single-threaded R
running on a workstation.
This ability to scale makes ML Services on HDInsight a great option for R developers with massive data sets. It
provides a flexible and scalable platform for running your R scripts in the cloud.
For a walk-through on creating an ML Services cluster, see Get started with ML Services on Azure HDInsight.
Azure Databricks
Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services
platform. Designed with the founders of Apache Spark, Databricks is integrated with Azure to provide one-click
setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data
engineers, and business analysts.
The collaboration in Databricks is enabled by the platform's notebook system. Users can create, share, and edit
notebooks with other users of the systems. These notebooks allow users to write code that executes against Spark
clusters managed in the Databricks environment. These notebooks fully support R and give users access to Spark
through both the SparkR and sparklyr packages.
Since Databricks is built on Spark and has a strong focus on collaboration, the platform is often used by teams of
data scientists that work together on complex analyses of large data sets. Because the notebooks in Databricks
support other languages in addition to R, it is especially useful for teams where analysts use different languages
for their primary work.
The article What is Azure Databricks? can provide more details about the platform and help you get started.
Azure Batch
For large-scale R jobs, you can use Azure Batch. This service provides cloud-scale job scheduling and compute
management so you can scale your R workload across tens, hundreds, or thousands of virtual machines. Since it is
a generalized computing platform, there a few options for running R jobs on Azure Batch.
One option is to use Microsoft's doAzureParallel package. This R package is a parallel backend for the foreach
package. It allows each iteration of the foreach loop to run in parallel on a node within the Azure Batch cluster. For
an introduction to the package, see the blog post doAzureParallel: Take advantage of Azure’s flexible compute
directly from your R session.
Another option for running an R script in Azure Batch is to bundle your code with "RScript.exe" as a Batch App in
the Azure portal. For a detailed walkthrough, see R Workloads on Azure Batch.
A third option is to use the Azure Distributed Data Engineering Toolkit (AZTK), which allows you to provision on-
demand Spark clusters using Docker containers in Azure Batch. This provides an economical way to run Spark jobs
in Azure. By using SparklyR with AZTK, your R scripts can be scaled out in the cloud easily and economically.
Azure Notebooks
Azure Notebooks is a low-cost, low-friction method for R developers who prefer working with notebooks to bring
their code to Azure. It is a free service for anyone to develop and run code in their browser using Jupyter, which is
an open-source project that enables combing markdown prose, executable code, and graphics onto a single canvas.
The free service tier of Azure Notebooks is a viable option for small-scale projects, as it limits each notebook's
process to 4 GB of memory and 1 GB data sets. If you need compute and data power beyond these limitations,
however, you can run notebooks in a Data Science Virtual Machine instance. For more information, see Manage
and configure Azure Notebooks projects - Compute tier.
The R logo is © 2016 The R Foundation and is used under the terms of the Creative Commons Attribution-ShareAlike 4.0 International license.
Baseball decision analysis with ML.NET and Blazor
12/18/2020 • 5 minutes to read • Edit Online
This scenario describes a web application, called the Baseball Machine Learning Workbench, which provides an
interface for non-technical users to use artificial intelligence (AI) and machine learning (ML) to perform decision
analysis techniques in order to rapidly gain insights and make informed predictions.
This solution uses historical baseball data to generate National Baseball Hall of Fame insights. Machine intelligence
powers the what-if analysis, decision thresholding, and improvements over traditional rule-based systems. User-
friendly interface controls set adjustable parameters and surface the results in real time, with clear visual cues to
highlight positive or negative outcomes.
The architecture provides rapid results by using in-memory models and rapid two-way communication between
the user and server. Delta rendering pushes down only modified content to the web browser, updating the display
without having to reload the entire page. This solution can scale to tens of in-memory models serving hundreds of
concurrent sessions in real time.
The following article explains the architecture of the Baseball Machine Learning Workbench, where to get the
source code for it, and how to deploy it. You can also view a live demo of this solution.
Architecture
The processing sequence in this solution flows as follows:
1. The user accesses the workbench application with any browser. They choose which analysis method to
employ and then which player to analyze.
2. SignalR brokers two-way communication with the server in real time.
3. Azure App Service hosts the application, including AI logic with the machine learning models.
4. One of three different decision analysis mechanisms is utilized, depending on which mode the user has
selected.
5. Historical data is analyzed using the designated set of rules or ML models in ML.NET, operating in-memory
for very quick inference.
6. Blazor Server surfaces the results to the end user’s browser, updating only the portions of the interface that
have changed, and transmits back to the user via SignalR.
7. Azure Application Insights is optionally used to monitor performance and instrumentation resources as
needed.
Components
The following assets and technologies were used to craft the Baseball Machine Learning Workbench:
Azure App Service enables you to build and host web applications in the programming language of your
choice without managing infrastructure.
.NET Core 3.1 is an open-source, cross-platform, general-purpose development framework that runs on
Windows, Linux, and macOS platforms.
ML.NET is a cross-platform, open-source framework for creating machine learning and artificial intelligence
models using .NET. It is used in the inference sections of this application.
Blazor Server lets you build interactive web UIs using C# instead of JavaScript, as in this solution. In this
architecture, Blazor renders the UI on the server and pushes HTML and other changes out to the browser.
SignalR provides asynchronous communication between the browser and the server (including Blazor). It
handles event updates, UI updates, and any processing done on the server (such as model inference).
Visual Studio 2019 is the software programming environment used for this project. This architecture uses
cross-platform components, so either the Windows or Mac version can be used.
Azure Application Insights (a feature of Azure Monitor) can be used for performance monitoring and
analytics and to drive autoscaling.
Considerations
Scalability
This solution uses the prediction engine functionality in ML.NET to scale the model response times. Object pooling
allows the ML.NET models to be accessed by multiple requests in a thread-safe manner. Learn more about ML.NET
object pooling in Deploy a model in an ASP.NET Core Web API.
Azure App Service is used for hosting the workbench in the cloud. With App Service you can automatically scale the
number of instances that run your app, letting you keep up with customer demand. For more information on
autoscale, refer to Autoscaling best practices in the Azure Architecture Center.
In Blazor Server, the state of many components might be maintained concurrently by the server. Because of this,
memory exhaustion is a concern that must be addressed. For guidance on how to author a Blazor Server app to
help ensure the best use of server memory, consult Threat mitigation guidance for ASP.NET Core Blazor Server.
Applying these best practices allows a server-side Blazor application to scale to thousands of concurrent users—
even on relatively small server hosts.
General guidance on designing scalable solutions is provided in the Azure Architecture Center’s Performance
efficiency checklist.
Resiliency and support
Use .NET Core 3.1.x because it is a Long Term Support (LTS) release. Although Blazor Server is also available in .NET
Core 3.0, that is not an LTS release and thus continuing compatibility with future component updates is not assured.
Learn more about the .NET Core Support Policy.
Related resources
View a live demo of this solution
Baseball HOF prediction using R mlr and DALEX packages is a GitHub repo using R and cutting edge “black
box” model techniques to explain ML.NET models related to this workload
Blazor documentation
ML.NET documentation
ASP.NET Core Blazor hosting models
MLOps (DevOps for Machine Learning) helps data science teams deliver innovation faster, increasing the
pace of ML model development
Learn about the National Baseball Hall of Fame voting process and rules
XAI Stories: Case Studies for Explainable Artificial Intelligence (Warsaw University of Technology and
University of Warsaw, 2020)
High Performance Computing (HPC) on Azure
12/18/2020 • 7 minutes to read • Edit Online
Introduction to HPC
High Performance Computing (HPC), also called "Big Compute", uses a large number of CPU or GPU-based
computers to solve complex mathematical tasks.
Many industries use HPC to solve some of their most difficult problems. These include workloads such as:
Genomics
Oil and gas simulations
Finance
Semiconductor design
Engineering
Weather modeling
How is HPC different on the cloud?
One of the primary differences between an on-premises HPC system and one in the cloud is the ability for
resources to dynamically be added and removed as they're needed. Dynamic scaling removes compute capacity as
a bottleneck and instead allow customers to right size their infrastructure for the requirements of their jobs.
The following articles provide more detail about this dynamic scaling capability.
Big Compute Architecture Style
Autoscaling best practices
Implementation checklist
As you're looking to implement your own HPC solution on Azure, ensure you're reviewed the following topics:
Choose the appropriate architecture based on your requirements
Know which compute options is right for your workload
Identify the right storage solution that meets your needs
Decide how you're going to manage all your resources
Optimize your application for the cloud
Secure your Infrastructure
Infrastructure
There are a number of infrastructure components necessary to build an HPC system. Compute, Storage, and
Networking provide the underlying components, no matter how you choose to manage your HPC workloads.
Example HPC architectures
There are a number of different ways to design and implement your HPC architecture on Azure. HPC applications
can scale to thousands of compute cores, extend on-premises clusters, or run as a 100% cloud-native solution.
The following scenarios outline a few of the common ways HPC solutions are built.
Computer-aided engineering services on Azure
Provide a software-as-a-service (SaaS) platform for computer-aided engineering (CAE) on Azure.
Management
Do -it-yourself
Building an HPC system from scratch on Azure offers a significant amount of flexibility, but is often very
maintenance intensive.
1. Set up your own cluster environment in Azure virtual machines or virtual machine scale sets.
2. Use Azure Resource Manager templates to deploy leading workload managers, infrastructure, and applications.
3. Choose HPC and GPU VM sizes that include specialized hardware and network connections for MPI or GPU
workloads.
4. Add high performance storage for I/O-intensive workloads.
Hybrid and cloud Bursting
If you have an existing on-premises HPC system that you'd like to connect to Azure, there are a number of
resources to help get you started.
First, review the Options for connecting an on-premises network to Azure article in the documentation. From there,
you may want information on these connectivity options:
Cost management
Managing your HPC cost on Azure can be done through a few different ways. Ensure you've reviewed the Azure
purchasing options to find the method that works best for your organization.
Security
For an overview of security best practices on Azure, review the Azure Security Documentation.
In addition to the network configurations available in the Cloud Bursting section, you may want to implement a
hub/spoke configuration to isolate your compute resources:
HPC applications
Run custom or commercial HPC applications in Azure. Several examples in this section are benchmarked to scale
efficiently with additional VMs or compute cores. Visit the Azure Marketplace for ready-to-deploy solutions.
NOTE
Check with the vendor of any commercial application for licensing or other restrictions for running in the cloud. Not all
vendors offer pay-as-you-go licensing. You might need a licensing server in the cloud for your solution, or connect to an on-
premises license server.
Engineering applications
Altair RADIOSS
ANSYS CFD
MATLAB Distributed Computing Server
StarCCM+
OpenFOAM
Graphics and rendering
Autodesk Maya, 3ds Max, and Arnold on Azure Batch
AI and deep learning
Microsoft Cognitive Toolkit
Batch Shipyard recipes for deep learning
MPI Providers
Microsoft MPI
Remote visualization
Linux virtual desktops with Citrix
Build a VDI environment for Linux Desktops using Citrix on Azure.
Performance Benchmarks
Compute benchmarks
Customer stories
There are a number of customers who have seen great success by using Azure for their HPC workloads. You can
find a few of these customer case studies below:
AXA Global P&C
Axioma
d3View
EFS
Hymans Robertson
MetLife
Microsoft Research
Milliman
Mitsubishi UFJ Securities International
NeuroInitiative
Schlumberger
Towers Watson
Next steps
For the latest announcements, see:
Microsoft HPC and Batch team blog
Visit the Azure blog.
Microsoft Batch Examples
These tutorials will provide you with details on running applications on Microsoft Batch
Get started developing with Batch
Use Azure Batch code samples
Use low-priority VMs with Batch
Run containerized HPC workloads with Batch Shipyard
Run parallel R workloads on Batch
Run on-demand Spark jobs on Batch
Use compute-intensive VMs in Batch pools
Azure Kubernetes Service solution journey
12/18/2020 • 3 minutes to read • Edit Online
Kubernetes is an open-source system for automating deployment, scaling, and management of containerized
applications. Azure Kubernetes Service (AKS) makes it simple to deploy a managed Kubernetes cluster in Azure.
Organizations are at various points in their understanding, rationalizing, and adoption of Kubernetes on Azure.
Your organization's journey will likely follow a similar path to many other technologies you've adopted; learning,
aligning your organization around roles & responsibilities, and deploying production-ready workloads. From there,
you'll iterate; growing your product as your customer and business demands change.
Best
Practices
AKS Baseline Cluster Your Workload Operate
Align
Learn
Iterate
Organizational readiness
As organizations such as yours have adopted Azure, the Cloud Adoption Framework provides them prescriptive
guidance as they move between the phases of the cloud adoption lifecycle. The Cloud Adoption Framework
includes tools, programs, and content to simplify adoption of Kubernetes and related cloud-native practices at
scale.
Kubernetes in the Cloud Adoption Framework
Path to production
You understand the benefits and trade-offs of Kubernetes, and have decided that AKS is the best Azure compute
platform for your workload. Your organizational controls have been put into place; you're ready to learn how to
deploy production-ready clusters for your workload.
Microsoft's AKS Baseline Cluster is the starting point to help you build production-ready AKS clusters. We
recommend you start from this baseline implementation and modify it to align to your workload's specific needs
and Well-Architected Framework priorities.
Microsoft's AKS Baseline Cluster
Best practices
As part of on going operations, you may wish to spot check your cluster against current recommended best
practices. The best place to start is to ensure your cluster is aligned with Microsoft's AKS Baseline Cluster.
See Best Practices for Cluster Operations and Best Practices for AKS Workloads.
You may also consider evaluating a community-driven utility like The AKS Checklist as a way of organizing and
tracking your alignment to these best practices.
Operations Guide
Getting your workload deployed on AKS is a great milestone and this is when day-2 operations are going to be
top-of-mind. Microsoft's AKS Day 2 Operations Guide was built for your ease of reference. This will help
ensure you are ready to meet the demands of your customers and ensure you are prepared for break-fix situations
via optimized triage processes.
Microsoft's AKS Day 2 Operations Guide
Additional resources
The typical AKS solution journey depicted above ranges from learning about AKS to growing your existing clusters
to meet new product and customer demands. However, you might also just be looking for additional reference and
supporting material to help along the way for your specific situation.
Example solutions
If you're seeking additional reference material that use AKS as their foundation, here are a few to consider.
Microservices architecture on AKS
Secure DevOps for AKS
Building a telehealth system
CI/CD pipeline for container-based workloads
Azure Arc
Azure Kubernetes Service offers you a managed Kubernetes experience on Azure, however there are workloads or
situations that might be best suited for placing your own Kubernetes clusters under Azure Arc management. This
includes your clusters such as RedHat OpenShift, RedHat RKE, and Canonical Charmed Kubernetes. Azure Arc
management should also be used for AKS Engine clusters running in your datacenter, in another cloud, or on Azure
Stack Hub.
Azure Arc enabled Kubernetes
Managed service provider
If you're a managed service provider, you already use Azure Lighthouse to manage resources for multiple
customers. Azure Kubernetes Service supports Azure Lighthouse so that you can manage hosted Kubernetes
environments and deploy containerized applications within your customers' tenants.
AKS with Azure Lighthouse
AWS or GCP professionals
These articles provide service mapping and comparison between Azure and other cloud services. This reference
can help you ramp up quickly on Azure.
Containers and container orchestrators for AWS Professionals
Containers and container orchestrators for GCP Professionals
Azure Kubernetes Services (AKS) Day 2 Operations
Guide
12/18/2020 • 2 minutes to read • Edit Online
After releasing your Azure Kubernetes Service-hosted application, prepare for day-2 operations. The term refers to
the ongoing maintainence of the deployed assets and rollout of upgrades. The operations can help you:
Keep up to date with your service-level agreement (SLA) or Service-level objective (SLO) requirements.
Troubleshoot customer support requests.
Stay current with the latest platform features and security updates.
Plan for future growth.
Prerequisites
The best practices for day-2 operations assume that you've deployed the Azure Kubernetes Service (AKS) Baseline
architecture as an example of a production cluster.
Triage practices
It's often challenging to do root-cause analysis given the different aspects of an AKS cluster. When triaging issues,
consider a top-down approach on the cluster hierarchy. Start at the cluster level and drill down if necessary.
Cluster Node pools Nodes Pods Containers
In this series, we'll walk you through the thought process of this approach. The articles show examples using a set
of tools and dashboards, and how they can highlight some symptoms.
Common causes addressed in this series include:
Network and connectivity problems caused by improper configuration.
Control plane to node communication is broken.
Kubelet pressures caused by insufficient compute, memory, or storage resources.
DNS resolution issues.
Node health is running out of disk IOPS.
Admission control pipeline is blocking a large number of requests to the API server.
The cluster doesn't have permissions to pull from the appropriate container registry.
It’s not intended to resolve specific issues. For information about troubleshooting specific issues, AKS Common
Issues.
In this series
ST EP DESC RIP T IO N
1- Check the AKS cluster health Start by checking the cluster the health of the overall cluster
and networking.
2- Examine the node and pod health Check the health of the AKS worker nodes.
ST EP DESC RIP T IO N
3- Check the workload deployments Check to see that all deployments and daemonSets are
running.
4- Validate the admission controllers Check whether the admission controllers are working as
expected.
5- Verify the connection to the container registry Verify the connection to the container registry.
Related links
Day-2 operations
AKS periscope
AKS roadmap
AKS documentation
Check the AKS cluster health
12/18/2020 • 2 minutes to read • Edit Online
Diagnostics shows a list of results from various test runs. If there are any issues found, More info can show you
information about the underlying issue.
This image indicates that network and connectivity issues are caused by Azure CNI subnet configuration.
To learn more about this feature, see Azure Kubernetes Service Diagnostics overview.
Next steps
Examine the node and pod health
Examine the node and pod health
11/2/2020 • 5 minutes to read • Edit Online
If the cluster checks are clear, check the health of the AKS worker nodes. Determine the reason for the unhealthy
node and resolve the issue.
This article is part of a series. Read the introduction here.
AKS - Nodes view - In Azure portal, open navigate to the cluster. Select Insights under Monitoring . View
Nodes on the right pane.
Prometheus and Grafana Dashboard . Open the Node Conditions dashboard.
If tunnelfront or aks-link connectivity is not working, establish connectivity after checking that the appropriate
AKS egress traffic rules have been allowed. Here are the steps:
1. Restart tunnelfront or aks-link .
If restarting the pods doesn't fix the connection, continue to the next step.
2. Check the logs and look for abnormalities. This output shows logs for a working connection.
You can also retrieve those logs by searching the container logs in the logging and monitoring service. This
example searches Azure Monitor for Containers to check for aks-link connectivity errors.
If you can't get the logs through the kubectl or queries, use SSH into the node. This example finds the tunnelfront
pod after connecting to the node through SSH.
The pressure increases when kubelet restarts and causes some sporadic, unpredictable behavior. Make sure that
the error count isn't continuously growing. An occasional error is acceptable but a constant growth indicates an
underlying issue that needs to be investigated and resolved.
Physical storage devices have limitations, bandwidth, and total number of file operations. Azure Disks are used to
store the OS running on the AKS nodes. They are subject to the same physical storage limitations.
Another way is to look at the throughput. IOPSs is measured with the average IO size * IOPS as the throughput in
MB/s. Large IO sizes will lead to lower IOPS because the throughput of a disk doesn't change.
When a workload exceeds Azure Disks' service limits on Max IOPS, the cluster becomes unresponsive and blocked
in IO Wait. Everything on Linux is a file. This includes network sockets, CNI, Docker, and other services that use
network I/O. All of those files will fail if they're unable to read the disk.
The events that can trigger IOPS throttle include:
High volume of containers running on the nodes. Docker IO is shared on the OS disk.
Custom or third-party tools used for security, monitoring, logging that are running on the OS disk.
Node failover events, and periodic jobs. As the load increases or the pods are scaled, this throttling occurs more
frequently until all nodes go to NotReady state while the IO completes.
Related links
Virtual machine disk limits
Relationship between Virtual Machine and Disk Performance
Next steps
Check the workload deployments
Check the workload deployments
11/2/2020 • 2 minutes to read • Edit Online
Check to see that all deployments and daemonSets are running. The Ready and the Available matches the
expected count.
This article is part of a series. Read the introduction here.
Tools
AKS - Workloads . In Azure portal, navigate to the AKS cluster resource. Select Workloads .
Prometheus and Grafana Dashboard . Deployment Status Dashboard. This image is from Grafana
Next steps
Validate the admission controllers
Validate the admission controllers are working as
expected
11/2/2020 • 2 minutes to read • Edit Online
These commands check if AKS Policy is running in your cluster and how to validate that all of the admission
controllers are functioning as expected.
# Sample Output
...
NAME READY STATUS RESTARTS AGE
gatekeeper-audit-65844778cb-rkflg 1/1 Running 0 163m
gatekeeper-controller-78797d4687-4pf6w 1/1 Running 0 163m
gatekeeper-controller-78797d4687-splzh 1/1 Running 0 163m
...
If this command doesn't run as expected, it could indicate that an admission controller, API service, or CRD isn't
functioning correctly.
# Sample Output
...
NAME SHORTNAMES APIGROUP NAMESPACED KIND
bindings true Binding
componentstatuses cs false
ComponentStatus
configmaps cm true ConfigMap
...
Next steps
Verify the connection to the container registry
Verify the connection to the container registry
11/2/2020 • 2 minutes to read • Edit Online
Make sure that the worker nodes have the correct permission to pull the necessary container images from the
container registry.
This article is part of a series. Read the introduction here.
A common symptom of this issue is receiving ImagePullBackoff errors when getting or describing a pod. Be sure
that the registry and image name are correct. Also, the cluster has permissions to pull from the appropriate
container registry.
If you are using Azure Container Registry (ACR), the cluster service principal or managed identity should be
granted AcrPull permissions against the registry.
One way is to run this command using the managed identity of the AKS cluster node pool. This command gets a
list of its permissions.
# Expected Output
...
e5615a90-1767-4a4f-83b6-cecfa0675970 AcrPull
/subscriptions/.../providers/Microsoft.ContainerRegistry/registries/akskhacr
...
If you're using another container registry, check the appropriate ImagePullSecret credentials for the registry.
Related links
Import container images to a container registry
AKS Roadmap
Azure Data Architecture Guide
11/2/2020 • 2 minutes to read • Edit Online
This guide presents a structured approach for designing data-centric solutions on Microsoft Azure. It is based on
proven practices derived from customer engagements.
Introduction
The cloud is changing the way applications are designed, including how data is processed and stored. Instead of a
single general-purpose database that handles all of a solution's data, polyglot persistence solutions use multiple,
specialized data stores, each optimized to provide specific capabilities. The perspective on data in the solution
changes as a result. There are no longer multiple layers of business logic that read and write to a single data layer.
Instead, solutions are designed around a data pipeline that describes how data flows through a solution, where it is
processed, where it is stored, and how it is consumed by the next component in the pipeline.
ETL
Big data solutions . A big data architecture is designed to handle the ingestion, processing, and analysis of data
that is too large or complex for traditional database systems. The data may be processed in batch or in real time.
Big data solutions typically involve a large amount of non-relational data, such as key-value data, JSON documents,
or time series data. Often traditional RDBMS systems are not well-suited to store this type of data. The term NoSQL
refers to a family of databases designed to hold non-relational data. The term isn't quite accurate, because many
non-relational data stores support SQL compatible queries. The term NoSQL stands for "Not only SQL".
Ingestion Processing Storage
ML
Reporting
These two categories are not mutually exclusive, and there is overlap between them, but we feel that it's a useful
way to frame the discussion. Within each category, the guide discusses common scenarios , including relevant
Azure services and the appropriate architecture for the scenario. In addition, the guide compares technology
choices for data solutions in Azure, including open source options. Within each category, we describe the key
selection criteria and a capability matrix, to help you choose the right technology for your scenario.
This guide is not intended to teach you data science or database theory — you can find entire books on those
subjects. Instead, the goal is to help you select the right data architecture or data pipeline for your scenario, and
then select the Azure services and technologies that best fit your requirements. If you already have an architecture
in mind, you can skip directly to the technology choices.
Extract, transform, and load (ETL)
12/18/2020 • 5 minutes to read • Edit Online
A common problem that organizations face is how to gather data from multiple sources, in multiple formats, and
move it to one or more data stores. The destination may not be the same type of data store as the source, and often
the format is different, or the data needs to be shaped or cleaned before loading it into its final destination.
Various tools, services, and processes have been developed over the years to help address these challenges. No
matter the process used, there is a common need to coordinate the work and apply some level of data
transformation within the data pipeline. The following sections highlight the common methods used to perform
these tasks.
Often, the three ETL phases are run in parallel to save time. For example, while data is being extracted, a
transformation process could be working on data already received and prepare it for loading, and a loading process
can begin working on the prepared data, rather than waiting for the entire extraction process to complete.
Relevant Azure service:
Azure Data Factory v2
Other tools:
SQL Server Integration Services (SSIS)
Typical use cases for ELT fall within the big data realm. For example, you might start by extracting all of the source
data to flat files in scalable storage such as Hadoop distributed file system (HDFS) or Azure Data Lake Store.
Technologies such as Spark, Hive, or PolyBase can then be used to query the source data. The key point with ELT is
that the data store used to perform the transformation is the same data store where the data is ultimately
consumed. This data store reads directly from the scalable storage, instead of loading the data into its own
proprietary storage. This approach skips the data copy step present in ETL, which can be a time consuming
operation for large data sets.
In practice, the target data store is a data warehouse using either a Hadoop cluster (using Hive or Spark) or a Azure
Synapse Analytics. In general, a schema is overlaid on the flat file data at query time and stored as a table, enabling
the data to be queried like any other table in the data store. These are referred to as external tables because the
data does not reside in storage managed by the data store itself, but on some external scalable storage.
The data store only manages the schema of the data and applies the schema on read. For example, a Hadoop
cluster using Hive would describe a Hive table where the data source is effectively a path to a set of files in HDFS. In
Azure Synapse, PolyBase can achieve the same result — creating a table against data stored externally to the
database itself. Once the source data is loaded, the data present in the external tables can be processed using the
capabilities of the data store. In big data scenarios, this means the data store must be capable of massively parallel
processing (MPP), which breaks the data into smaller chunks and distributes processing of the chunks across
multiple machines in parallel.
The final phase of the ELT pipeline is typically to transform the source data into a final format that is more efficient
for the types of queries that need to be supported. For example, the data may be partitioned. Also, ELT might use
optimized storage formats like Parquet, which stores row-oriented data in a columnar fashion and provides
optimized indexing.
Relevant Azure service:
Azure Synapse
HDInsight with Hive
Azure Data Factory v2
Oozie on HDInsight
Other tools:
SQL Server Integration Services (SSIS)
In the diagram above, there are several tasks within the control flow, one of which is a data flow task. One of the
tasks is nested within a container. Containers can be used to provide structure to tasks, providing a unit of work.
One such example is for repeating elements within a collection, such as files in a folder or database statements.
Relevant Azure service:
Azure Data Factory v2
Other tools:
SQL Server Integration Services (SSIS)
Technology choices
Online Transaction Processing (OLTP) data stores
Online Analytical Processing (OLAP) data stores
Data warehouses
Pipeline orchestration
Next steps
The following reference architectures show end-to-end ELT pipelines on Azure:
Enterprise BI in Azure with Azure Synapse
Automated enterprise BI with Azure Synapse and Azure Data Factory
Online analytical processing (OLAP)
12/18/2020 • 8 minutes to read • Edit Online
Online analytical processing (OLAP) is a technology that organizes large business databases and supports complex
analysis. It can be used to perform complex analytical queries without negatively affecting transactional systems.
The databases that a business uses to store all its transactions and records are called online transaction processing
(OLTP) databases. These databases usually have records that are entered one at a time. Often they contain a great
deal of information that is valuable to the organization. The databases that are used for OLTP, however, were not
designed for analysis. Therefore, retrieving answers from these databases is costly in terms of time and effort.
OLAP systems were designed to help extract this business intelligence information from the data in a highly
performant way. This is because OLAP databases are optimized for heavy read, low write workloads.
Semantic modeling
A semantic data model is a conceptual model that describes the meaning of the data elements it contains.
Organizations often have their own terms for things, sometimes with synonyms, or even different meanings for the
same term. For example, an inventory database might track a piece of equipment with an asset ID and a serial
number, but a sales database might refer to the serial number as the asset ID. There is no simple way to relate these
values without a model that describes the relationship.
Semantic modeling provides a level of abstraction over the database schema, so that users don't need to know the
underlying data structures. This makes it easier for end users to query data without performing aggregates and
joins over the underlying schema. Also, usually columns are renamed to more user-friendly names, so that the
context and meaning of the data are more obvious.
Semantic modeling is predominately used for read-heavy scenarios, such as analytics and business intelligence
(OLAP), as opposed to more write-heavy transactional data processing (OLTP). This is mostly due to the nature of a
typical semantic layer:
Aggregation behaviors are set so that reporting tools display them properly.
Business logic and calculations are defined.
Time-oriented calculations are included.
Data is often integrated from multiple sources.
Traditionally, the semantic layer is placed over a data warehouse for these reasons.
There are two primary types of semantic models:
Tabular . Uses relational modeling constructs (model, tables, columns). Internally, metadata is inherited from
OLAP modeling constructs (cubes, dimensions, measures). Code and script use OLAP metadata.
Multidimensional . Uses traditional OLAP modeling constructs (cubes, dimensions, measures).
Relevant Azure service:
Azure Analysis Services
Uses Transactions No
Model Multidimensional
Challenges
For all the benefits OLAP systems provide, they do produce a few challenges:
Whereas data in OLTP systems is constantly updated through transactions flowing in from various sources,
OLAP data stores are typically refreshed at a much slower intervals, depending on business needs. This means
OLAP systems are better suited for strategic business decisions, rather than immediate responses to changes.
Also, some level of data cleansing and orchestration needs to be planned to keep the OLAP data stores up-to-
date.
Unlike traditional, normalized, relational tables found in OLTP systems, OLAP data models tend to be
multidimensional. This makes it difficult or impossible to directly map to entity-relationship or object-oriented
models, where each attribute is mapped to one column. Instead, OLAP systems typically use a star or snowflake
schema in place of traditional normalization.
OLAP in Azure
In Azure, data held in OLTP systems such as Azure SQL Database is copied into the OLAP system, such as Azure
Analysis Services. Data exploration and visualization tools like Power BI, Excel, and third-party options connect to
Analysis Services servers and provide users with highly interactive and visually rich insights into the modeled data.
The flow of data from OLTP data to OLAP is typically orchestrated using SQL Server Integration Services, which can
be executed using Azure Data Factory.
In Azure, all of the following data stores will meet the core requirements for OLAP:
SQL Server with Columnstore indexes
Azure Analysis Services
SQL Server Analysis Services (SSAS)
SQL Server Analysis Services (SSAS) offers OLAP and data mining functionality for business intelligence
applications. You can either install SSAS on local servers, or host within a virtual machine in Azure. Azure Analysis
Services is a fully managed service that provides the same major features as SSAS. Azure Analysis Services
supports connecting to various data sources in the cloud and on-premises in your organization.
Clustered Columnstore indexes are available in SQL Server 2014 and above, as well as Azure SQL Database, and
are ideal for OLAP workloads. However, beginning with SQL Server 2016 (including Azure SQL Database), you can
take advantage of hybrid transactional/analytics processing (HTAP) through the use of updateable nonclustered
columnstore indexes. HTAP enables you to perform OLTP and OLAP processing on the same platform, which
removes the need to store multiple copies of your data, and eliminates the need for distinct OLTP and OLAP
systems. For more information, see Get started with Columnstore for real-time operational analytics.
Capability matrix
The following tables summarize the key differences in capabilities.
General capabilities
A Z URE SQ L
SQ L SERVER W IT H DATA B A SE W IT H
A Z URE A N A LY SIS SQ L SERVER C O L UM N STO RE C O L UM N STO RE
C A PA B IL IT Y SERVIC ES A N A LY SIS SERVIC ES IN DEXES IN DEXES
Supports No Yes No No
multidimensional
cubes
[1] Although SQL Server and Azure SQL Database cannot be used to query from and integrate multiple external
data sources, you can still build a pipeline that does this for you using SSIS or Azure Data Factory. SQL Server
hosted in an Azure VM has additional options, such as linked servers and PolyBase. For more information, see
Pipeline orchestration, control flow, and data movement.
[2] Connecting to SQL Server running on an Azure Virtual Machine is not supported using an Azure AD account.
Use a domain Active Directory account instead.
Scalability Capabilities
A Z URE SQ L
SQ L SERVER W IT H DATA B A SE W IT H
A Z URE A N A LY SIS SQ L SERVER C O L UM N STO RE C O L UM N STO RE
C A PA B IL IT Y SERVIC ES A N A LY SIS SERVIC ES IN DEXES IN DEXES
The management of transactional data using computer systems is referred to as online transaction processing
(OLTP). OLTP systems record business interactions as they occur in the day-to-day operation of the organization,
and support querying of this data to make inferences.
Transactional data
Transactional data is information that tracks the interactions related to an organization's activities. These
interactions are typically business transactions, such as payments received from customers, payments made to
suppliers, products moving through inventory, orders taken, or services delivered. Transactional events, which
represent the transactions themselves, typically contain a time dimension, some numerical values, and references
to other data.
Transactions typically need to be atomic and consistent. Atomicity means that an entire transaction always
succeeds or fails as one unit of work, and is never left in a half-completed state. If a transaction cannot be
completed, the database system must roll back any steps that were already done as part of that transaction. In a
traditional RDBMS, this rollback happens automatically if a transaction cannot be completed. Consistency means
that transactions always leave the data in a valid state. (These are very informal descriptions of atomicity and
consistency. There are more formal definitions of these properties, such as ACID.)
Transactional databases can support strong consistency for transactions using various locking strategies, such as
pessimistic locking, to ensure that all data is strongly consistent within the context of the enterprise, for all users
and processes.
The most common deployment architecture that uses transactional data is the data store tier in a 3-tier
architecture. A 3-tier architecture typically consists of a presentation tier, business logic tier, and data store tier. A
related deployment architecture is the N-tier architecture, which may have multiple middle-tiers handling business
logic.
Updateable Yes
REQ UIREM EN T DESC RIP T IO N
Appendable Yes
Model Relational
Challenges
Implementing and using an OLTP system can create a few challenges:
OLTP systems are not always good for handling aggregates over large amounts of data, although there are
exceptions, such as a well-planned SQL Server-based solution. Analytics against the data, that rely on aggregate
calculations over millions of individual transactions, are very resource intensive for an OLTP system. They can
be slow to execute and can cause a slow-down by blocking other transactions in the database.
When conducting analytics and reporting on data that is highly normalized, the queries tend to be complex,
because most queries need to de-normalize the data by using joins. Also, naming conventions for database
objects in OLTP systems tend to be terse and succinct. The increased normalization coupled with terse naming
conventions makes OLTP systems difficult for business users to query, without the help of a DBA or data
developer.
Storing the history of transactions indefinitely and storing too much data in any one table can lead to slow
query performance, depending on the number of transactions stored. The common solution is to maintain a
relevant window of time (such as the current fiscal year) in the OLTP system and offload historical data to other
systems, such as a data mart or data warehouse.
OLTP in Azure
Applications such as websites hosted in App Service Web Apps, REST APIs running in App Service, or mobile or
desktop applications communicate with the OLTP system, typically via a REST API intermediary.
In practice, most workloads are not purely OLTP. There tends to be an analytical component as well. In addition,
there is an increasing demand for real-time reporting, such as running reports against the operational system. This
is also referred to as HTAP (Hybrid Transactional and Analytical Processing). For more information, see Online
Analytical Processing (OLAP).
In Azure, all of the following data stores will meet the core requirements for OLTP and the management of
transaction data:
Azure SQL Database
SQL Server in an Azure virtual machine
Azure Database for MySQL
Azure Database for PostgreSQL
Capability matrix
The following tables summarize the key differences in capabilities.
General capabilities
SQ L SERVER IN A N
A Z URE SQ L A Z URE VIRT UA L A Z URE DATA B A SE A Z URE DATA B A SE
C A PA B IL IT Y DATA B A SE M A C H IN E F O R M Y SQ L F O R P O STGRESQ L
Availability capabilities
SQ L SERVER IN A N
A Z URE SQ L A Z URE VIRT UA L A Z URE DATA B A SE A Z URE DATA B A SE
C A PA B IL IT Y DATA B A SE M A C H IN E F O R M Y SQ L F O R P O STGRESQ L
Security capabilities
SQ L SERVER IN A N
A Z URE SQ L A Z URE VIRT UA L A Z URE DATA B A SE A Z URE DATA B A SE
C A PA B IL IT Y DATA B A SE M A C H IN E F O R M Y SQ L F O R P O STGRESQ L
Private IP No Yes No No
Data warehousing
12/18/2020 • 11 minutes to read • Edit Online
A data warehouse is a centralized repository of integrated data from one or more disparate sources. Data
warehouses store current and historical data and are used for reporting and analysis of the data.
To move data into a data warehouse, data is periodically extracted from various sources that contain important
business information. As the data is moved, it can be formatted, cleaned, validated, summarized, and reorganized.
Alternatively, the data can be stored in the lowest level of detail, with aggregated views provided in the warehouse
for reporting. In either case, the data warehouse becomes a permanent data store for reporting, analysis, and
business intelligence (BI).
Challenges
Properly configuring a data warehouse to fit the needs of your business can bring some of the following
challenges:
Committing the time required to properly model your business concepts. Data warehouses are information
driven. You must standardize business-related terms and common formats, such as currency and dates. You
also need to restructure the schema in a way that makes sense to business users but still ensures accuracy
of data aggregates and relationships.
Planning and setting up your data orchestration. Consider how to copy data from the source transactional
system to the data warehouse, and when to move historical data from operational data stores into the
warehouse.
Maintaining or improving data quality by cleaning the data as it is imported into the warehouse.
Capability Matrix
The following tables summarize the key differences in capabilities.
General capabilities
A Z URE SQ L A PA C H E H IVE O N H IVE L L A P O N
C A PA B IL IT Y DATA B A SE SQ L SERVER ( VM ) A Z URE SY N A P SE H DIN SIGH T H DIN SIGH T
[1] Azure Synapse allows you to scale up or down by adjusting the number of data warehouse units (DWUs). See
Manage compute power in Azure Synapse.
Security capabilities
SQ L SERVER IN A
A Z URE SQ L VIRT UA L A PA C H E H IVE O N H IVE L L A P O N
C A PA B IL IT Y DATA B A SE M A C H IN E A Z URE SY N A P SE H DIN SIGH T H DIN SIGH T
Authentication SQL / Azure SQL / Azure AD / SQL / Azure AD local / Azure AD 1 local / Azure AD 1
Active Directory Active Directory
(Azure AD)
SQ L SERVER IN A
A Z URE SQ L VIRT UA L A PA C H E H IVE O N H IVE L L A P O N
C A PA B IL IT Y DATA B A SE M A C H IN E A Z URE SY N A P SE H DIN SIGH T H DIN SIGH T
A non-relational database is a database that does not use the tabular schema of rows and columns found in most
traditional database systems. Instead, non-relational databases use a storage model that is optimized for the
specific requirements of the type of data being stored. For example, data may be stored as simple key/value pairs,
as JSON documents, or as a graph consisting of edges and vertices.
What all of these data stores have in common is that they don't use a relational model. Also, they tend to be more
specific in the type of data they support and how data can be queried. For example, time series data stores are
optimized for queries over time-based sequences of data, while graph data stores are optimized for exploring
weighted relationships between entities. Neither format would generalize well to the task of managing
transactional data.
The term NoSQL refers to data stores that do not use SQL for queries, and instead use other programming
languages and constructs to query the data. In practice, "NoSQL" means "non-relational database," even though
many of these databases do support SQL-compatible queries. However, the underlying query execution strategy is
usually very different from the way a traditional RDBMS would execute the same SQL query.
The following sections describe the major categories of non-relational or NoSQL database.
Unlike a key/value store or a document database, most column-family databases physically store data in key order,
rather than by computing a hash. The row key is considered the primary index and enables key-based access via a
specific key or a range of keys. Some implementations allow you to create secondary indexes over specific columns
in a column family. Secondary indexes let you retrieve data by columns value, rather than row key.
On disk, all of the columns within a column family are stored together in the same file, with a certain number of
rows in each file. With large data sets, this approach creates a performance benefit by reducing the amount of data
that needs to be read from disk when only a few columns are queried together at a time.
Read and write operations for a row are typically atomic within a single column family, although some
implementations provide atomicity across the entire row, spanning multiple column families.
Relevant Azure service:
Cosmos DB Cassandra API
HBase in HDInsight
Key/value stores are highly optimized for applications performing simple lookups using the value of the key, or by
a range of keys, but are less suitable for systems that need to query data across different tables of keys/values, such
as joining data across multiple tables.
Key/value stores are also not optimized for scenarios where querying or filtering by non-key values is important,
rather than performing lookups based only on keys. For example, with a relational database, you can find a record
by using a WHERE clause to filter the non-key columns, but key/values stores usually do not have this type of
lookup capability for values, or if they do, it requires a slow scan of all values.
A single key/value store can be extremely scalable, as the data store can easily distribute data across multiple nodes
on separate machines.
Relevant Azure services:
Azure Cosmos DB Table API
Azure Cache for Redis
Azure Table Storage
This structure makes it straightforward to perform queries such as "Find all employees who report directly or
indirectly to Sarah" or "Who works in the same department as John?" For large graphs with lots of entities and
relationships, you can perform complex analyses quickly. Many graph databases provide a query language that you
can use to traverse a network of relationships efficiently.
Relevant Azure service:
Azure Cosmos DB Graph API
Although the records written to a time series database are generally small, there are often a large number of
records, and total data size can grow rapidly. Time series data stores also handle out-of-order and late-arriving data,
automatic indexing of data points, and optimizations for queries described in terms of windows of time. This last
feature enables queries to run across millions of data points and multiple data streams quickly, in order to support
time series visualizations, which is a common way that time series data is consumed.
For more information, see Time series solutions
Relevant Azure services:
Azure Time Series Insights
OpenTSDB with HBase on HDInsight
Some object data stores replicate a given blob across multiple server nodes, which enables fast parallel reads. This
in turn enables the scale-out querying of data contained in large files, because multiple processes, typically running
on different servers, can each query the large data file simultaneously.
One special case of object data stores is the network file share. Using file shares enables files to be accessed across
a network using standard networking protocols like server message block (SMB). Given appropriate security and
concurrent access control mechanisms, sharing data in this way can enable distributed services to provide highly
scalable data access for basic, low-level operations such as simple read and write requests.
Relevant Azure services:
Azure Blob Storage
Azure Data Lake Store
Azure File Storage
Typical requirements
Non-relational data stores often use a different storage architecture from that used by relational databases.
Specifically, they tend toward having no fixed schema. Also, they tend not to support transactions, or else restrict
the scope of transactions, and they generally don't include secondary indexes for scalability reasons.
The following compares the requirements for each of the non-relational data stores:
C O L UM N - FA M ILY
REQ UIREM EN T DO C UM EN T DATA DATA K EY / VA L UE DATA GRA P H DATA
Locking Strategy Optimistic (lock free) Pessimistic (row locks) Optimistic (ETag)
Indexing Primary and Primary and Primary index only Primary and
secondary indexes secondary indexes secondary indexes
Data shape Document Tabular with column Key and value Graph containing
families containing edges and vertices
columns
Datum size Small (KBs) to Medium (MBs) to Small (KBs) Small (KBs)
medium (low MBs) Large (low GBs)
Overall Maximum Very Large (PBs) Very Large (PBs) Very Large (PBs) Large (TBs)
Scale
REQ UIREM EN T T IM E SERIES DATA O B JEC T DATA EXT ERN A L IN DEX DATA
Sparse No N/A No
Datum size Small (KBs) Large (GBs) to Very Large Small (KBs)
(TBs)
Overall Maximum Scale Large (low TBs) Very Large (PBs) Large (low TBs)
Processing free-form text for search
12/18/2020 • 2 minutes to read • Edit Online
To support search, free-form text processing can be performed against documents containing paragraphs of text.
Text search works by constructing a specialized index that is precomputed against a collection of documents. A
client application submits a query that contains the search terms. The query returns a result set, consisting of a list
of documents sorted by how well each document matches the search criteria. The result set may also include the
context in which the document matches the criteria, which enables the application to highlight the matching phrase
in the document.
Free-form text processing can produce useful, actionable data from large amounts of noisy text data. The results
can give unstructured documents a well-defined and queryable structure.
Challenges
Processing a collection of free-form text documents is typically computationally intensive, as well as time
intensive.
In order to search free-form text effectively, the search index should support fuzzy search based on terms that
have a similar construction. For example, search indexes are built with lemmatization and linguistic stemming,
so that queries for "run" will match documents that contain "ran" and "running."
Architecture
In most scenarios, the source text documents are loaded into object storage such as Azure Storage or Azure Data
Lake Store, and then indexed using an external search service. In this scenario, source text documents are physically
distinct from a resulting search index that's hosted on a search service. An exception is using full text search within
SQL Server or Azure SQL Database. In this case, the document data exists internally in tables managed by the
database. Once stored, the documents are processed in a batch to create the index.
Technology choices
Options for creating a search index include Azure Cognitive Search, Elasticsearch, and HDInsight with Solr. Each of
these technologies can populate a search index from a collection of documents. Cognitive Search provides indexers
that can automatically populate the index for documents ranging from plain text to Excel and PDF formats. You can
also attach machine learning models to an indexer to analyze images and unstructured text for searchable content.
On HDInsight, Apache Solr can index binary files of many types, including plain text, Word, and PDF. Once the index
is constructed, clients can access the search interface by means of a REST API.
If your text data is stored in SQL Server or Azure SQL Database, you can use the full-text search that is built into the
database. The database populates the index from text, binary, or XML data stored within the same database. Clients
search by using T-SQL queries.
For more information, see Search data stores.
Time series solutions
12/18/2020 • 4 minutes to read • Edit Online
Time series data is a set of values organized by time. Examples of time series data include sensor data, stock prices,
click stream data, and application telemetry. Time series data can be analyzed for historical trends, real-time alerts,
or predictive modeling.
Time series data represents how an asset or process changes over time. The data has a timestamp, but more
importantly, time is the most meaningful axis for viewing or analyzing the data. Time series data typically arrives in
order of time and is usually treated as an insert rather than an update to your database. Because of this, change is
measured over time, enabling you to look backward and to predict future change. As such, time series data is best
visualized with scatter or line charts.
Challenges
Time series data is often very high volume, especially in IoT scenarios. Storing, indexing, querying, analyzing,
and visualizing time series data can be challenging.
It can be challenging to find the right combination of high-speed storage and powerful compute operations
for handling real-time analytics, while minimizing time to market and overall cost investment.
Architecture
In many scenarios that involve time series data, such as IoT, the data is captured in real time. As such, a real-time
processing architecture is appropriate.
Data from one or more data sources is ingested into the stream buffering layer by IoT Hub, Event Hubs, or Kafka on
HDInsight. Next, the data is processed in the stream processing layer that can optionally hand off the processed
data to a machine learning service for predictive analytics. The processed data is stored in an analytical data store,
such as HBase, Azure Cosmos DB, Azure Data Lake, or Blob Storage. An analytics and reporting application or
service, like Power BI, Azure Data Explorer, or OpenTSDB (if stored in HBase) can be used to display the time series
data for analysis.
Another option is to use Azure Time Series Insights. Time Series Insights is a fully managed service for time series
data. In this architecture, Time Series Insights performs the roles of stream processing, data store, and analytics and
reporting. It accepts streaming data from either IoT Hub or Event Hubs and stores, processes, analyzes, and
displays the data in near real time. It does not pre-aggregate the data, but stores the raw events.
Time Series Insights is schema adaptive, which means that you do not have to do any data preparation to start
deriving insights. This enables you to explore, compare, and correlate a variety of data sources seamlessly. It also
provides SQL-like filters and aggregates, ability to construct, visualize, compare, and overlay various time series
patterns, heat maps, and the ability to save and share queries.
Technology choices
Data Storage
Analysis, visualizations, and reporting
Analytical Data Stores
Stream processing
Working with CSV and JSON files for data solutions
12/18/2020 • 4 minutes to read • Edit Online
CSV and JSON are likely the most common formats used for ingesting, exchanging, and storing unstructured or
semi-structured data.
Challenges
There are some challenges to consider when working with these formats:
Without any restraints on the data model, CSV and JSON files are prone to data corruption ("garbage in,
garbage out"). For instance, there's no notion of a date/time object in either file, so the file format does not
prevent you from inserting "ABC123" in a date field, for example.
Using CSV and JSON files as your cold storage solution does not scale well when working with big data. In
most cases, they cannot be split into partitions for parallel processing, and cannot be compressed as well as
binary formats. This often leads to processing and storing this data into read-optimized formats such as
Parquet and ORC (optimized row columnar), which also provide indexes and inline statistics about the data
contained.
You may need to apply a schema on the semi-structured data to make it easier to query and analyze.
Typically, this requires storing the data in another form that complies with your environment's data storage
needs, such as within a database.
Big data architectures
12/18/2020 • 10 minutes to read • Edit Online
A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or
complex for traditional database systems. The threshold at which organizations enter into the big data realm
differs, depending on the capabilities of the users and their tools. For some, it can mean hundreds of gigabytes of
data, while for others it means hundreds of terabytes. As tools for working with big data sets advance, so does the
meaning of big data. More and more, this term relates to the value you can extract from your data sets through
advanced analytics, rather than strictly the size of the data, although in these cases they tend to be quite large.
Over the years, the data landscape has changed. What you can do, or are expected to do, with data has changed.
The cost of storage has fallen dramatically, while the means by which data is collected keeps growing. Some data
arrives at a rapid pace, constantly demanding to be collected and observed. Other data arrives more slowly, but in
very large chunks, often in the form of decades of historical data. You might be facing an advanced analytics
problem, or one that requires machine learning. These are challenges that big data architectures seek to solve.
Big data solutions typically involve one or more of the following types of workload:
Batch processing of big data sources at rest.
Real-time processing of big data in motion.
Interactive exploration of big data.
Predictive analytics and machine learning.
Consider big data architectures when you need to:
Store and process data in volumes too large for a traditional database.
Transform unstructured data for analysis and reporting.
Capture, process, and analyze unbounded streams of data in real time, or with low latency.
Most big data architectures include some or all of the following components:
Data sources . All big data solutions start with one or more data sources. Examples include:
Application data stores, such as relational databases.
Static files produced by applications, such as web server log files.
Real-time data sources, such as IoT devices.
Data storage . Data for batch processing operations is typically stored in a distributed file store that can
hold high volumes of large files in various formats. This kind of store is often called a data lake. Options for
implementing this storage include Azure Data Lake Store or blob containers in Azure Storage.
Batch processing . Because the data sets are so large, often a big data solution must process data files
using long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis. Usually these
jobs involve reading source files, processing them, and writing the output to new files. Options include
running U-SQL jobs in Azure Data Lake Analytics, using Hive, Pig, or custom Map/Reduce jobs in an
HDInsight Hadoop cluster, or using Java, Scala, or Python programs in an HDInsight Spark cluster.
Real-time message ingestion . If the solution includes real-time sources, the architecture must include a
way to capture and store real-time messages for stream processing. This might be a simple data store,
where incoming messages are dropped into a folder for processing. However, many solutions need a
message ingestion store to act as a buffer for messages, and to support scale-out processing, reliable
delivery, and other message queuing semantics. This portion of a streaming architecture is often referred to
as stream buffering. Options include Azure Event Hubs, Azure IoT Hub, and Kafka.
Stream processing . After capturing real-time messages, the solution must process them by filtering,
aggregating, and otherwise preparing the data for analysis. The processed stream data is then written to an
output sink. Azure Stream Analytics provides a managed stream processing service based on perpetually
running SQL queries that operate on unbounded streams. You can also use open source Apache streaming
technologies like Storm and Spark Streaming in an HDInsight cluster.
Analytical data store . Many big data solutions prepare data for analysis and then serve the processed
data in a structured format that can be queried using analytical tools. The analytical data store used to serve
these queries can be a Kimball-style relational data warehouse, as seen in most traditional business
intelligence (BI) solutions. Alternatively, the data could be presented through a low-latency NoSQL
technology such as HBase, or an interactive Hive database that provides a metadata abstraction over data
files in the distributed data store. Azure Synapse Analytics provides a managed service for large-scale,
cloud-based data warehousing. HDInsight supports Interactive Hive, HBase, and Spark SQL, which can also
be used to serve data for analysis.
Analysis and repor ting . The goal of most big data solutions is to provide insights into the data through
analysis and reporting. To empower users to analyze the data, the architecture may include a data modeling
layer, such as a multidimensional OLAP cube or tabular data model in Azure Analysis Services. It might also
support self-service BI, using the modeling and visualization technologies in Microsoft Power BI or Microsoft
Excel. Analysis and reporting can also take the form of interactive data exploration by data scientists or data
analysts. For these scenarios, many Azure services support analytical notebooks, such as Jupyter, enabling
these users to leverage their existing skills with Python or R. For large-scale data exploration, you can use
Microsoft R Server, either standalone or with Spark.
Orchestration . Most big data solutions consist of repeated data processing operations, encapsulated in
workflows, that transform source data, move data between multiple sources and sinks, load the processed
data into an analytical data store, or push the results straight to a report or dashboard. To automate these
workflows, you can use an orchestration technology such Azure Data Factory or Apache Oozie and Sqoop.
Lambda architecture
When working with very large data sets, it can take a long time to run the sort of queries that clients need. These
queries can't be performed in real time, and often require algorithms such as MapReduce that operate in parallel
across the entire data set. The results are then stored separately from the raw data and used for querying.
One drawback to this approach is that it introduces latency — if processing takes a few hours, a query may return
results that are several hours old. Ideally, you would like to get some results in real time (perhaps with some loss of
accuracy), and combine these results with the results from the batch analytics.
The lambda architecture , first proposed by Nathan Marz, addresses this problem by creating two paths for data
flow. All data coming into the system goes through these two paths:
A batch layer (cold path) stores all of the incoming data in its raw form and performs batch processing on
the data. The result of this processing is stored as a batch view .
A speed layer (hot path) analyzes data in real time. This layer is designed for low latency, at the expense of
accuracy.
The batch layer feeds into a ser ving layer that indexes the batch view for efficient querying. The speed layer
updates the serving layer with incremental updates based on the most recent data.
Data that flows into the hot path is constrained by latency requirements imposed by the speed layer, so that it can
be processed as quickly as possible. Often, this requires a tradeoff of some level of accuracy in favor of data that is
ready as quickly as possible. For example, consider an IoT scenario where a large number of temperature sensors
are sending telemetry data. The speed layer may be used to process a sliding time window of the incoming data.
Data flowing into the cold path, on the other hand, is not subject to the same low latency requirements. This allows
for high accuracy computation across large data sets, which can be very time intensive.
Eventually, the hot and cold paths converge at the analytics client application. If the client needs to display timely,
yet potentially less accurate data in real time, it will acquire its result from the hot path. Otherwise, it will select
results from the cold path to display less timely but more accurate data. In other words, the hot path has data for a
relatively small window of time, after which the results can be updated with more accurate data from the cold path.
The raw data stored at the batch layer is immutable. Incoming data is always appended to the existing data, and the
previous data is never overwritten. Any changes to the value of a particular datum are stored as a new
timestamped event record. This allows for recomputation at any point in time across the history of the data
collected. The ability to recompute the batch view from the original raw data is important, because it allows for
new views to be created as the system evolves.
Kappa architecture
A drawback to the lambda architecture is its complexity. Processing logic appears in two different places — the cold
and hot paths — using different frameworks. This leads to duplicate computation logic and the complexity of
managing the architecture for both paths.
The kappa architecture was proposed by Jay Kreps as an alternative to the lambda architecture. It has the same
basic goals as the lambda architecture, but with an important distinction: All data flows through a single path,
using a stream processing system.
There are some similarities to the lambda architecture's batch layer, in that the event data is immutable and all of it
is collected, instead of a subset. The data is ingested as a stream of events into a distributed and fault tolerant
unified log. These events are ordered, and the current state of an event is changed only by a new event being
appended. Similar to a lambda architecture's speed layer, all event processing is performed on the input stream and
persisted as a real-time view.
If you need to recompute the entire data set (equivalent to what the batch layer does in lambda), you simply replay
the stream, typically using parallelism to complete the computation in a timely fashion.
The cloud gateway ingests device events at the cloud boundary, using a reliable, low latency messaging system.
Devices might send events directly to the cloud gateway, or through a field gateway . A field gateway is a
specialized device or software, usually collocated with the devices, that receives events and forwards them to the
cloud gateway. The field gateway might also preprocess the raw device events, performing functions such as
filtering, aggregation, or protocol transformation.
After ingestion, events go through one or more stream processors that can route the data (for example, to
storage) or perform analytics and other processing.
The following are some common types of processing. (This list is certainly not exhaustive.)
Writing event data to cold storage, for archiving or batch analytics.
Hot path analytics, analyzing the event stream in (near) real time, to detect anomalies, recognize patterns
over rolling time windows, or trigger alerts when a specific condition occurs in the stream.
Handling special types of nontelemetry messages from devices, such as notifications and alarms.
Machine learning.
The boxes that are shaded gray show components of an IoT system that are not directly related to event streaming,
but are included here for completeness.
The device registr y is a database of the provisioned devices, including the device IDs and usually device
metadata, such as location.
The provisioning API is a common external interface for provisioning and registering new devices.
Some IoT solutions allow command and control messages to be sent to devices.
Relevant Azure services:
Azure IoT Hub
Azure Event Hubs
Azure Stream Analytics
Azure Data Explorer
Learn more about IoT on Azure by reading the Azure IoT reference architecture.
Batch processing
12/18/2020 • 6 minutes to read • Edit Online
A common big data scenario is batch processing of data at rest. In this scenario, the source data is loaded into data
storage, either by the source application itself or by an orchestration workflow. The data is then processed in-place
by a parallelized job, which can also be initiated by the orchestration workflow. The processing may include
multiple iterative steps before the transformed results are loaded into an analytical data store, which can be
queried by analytics and reporting components.
For example, the logs from a web server might be copied to a folder and then processed overnight to generate
daily reports of web activity.
Challenges
Data format and encoding . Some of the most difficult issues to debug happen when files use an
unexpected format or encoding. For example, source files might use a mix of UTF-16 and UTF-8 encoding,
or contain unexpected delimiters (space versus tab), or include unexpected characters. Another common
example is text fields that contain tabs, spaces, or commas that are interpreted as delimiters. Data loading
and parsing logic must be flexible enough to detect and handle these issues.
Orchestrating time slices . Often source data is placed in a folder hierarchy that reflects processing
windows, organized by year, month, day, hour, and so on. In some cases, data may arrive late. For example,
suppose that a web server fails, and the logs for March 7th don't end up in the folder for processing until
March 9th. Are they just ignored because they're too late? Can the downstream processing logic handle out-
of-order records?
Architecture
A batch processing architecture has the following logical components, shown in the diagram above.
Data storage . Typically a distributed file store that can serve as a repository for high volumes of large files
in various formats. Generically, this kind of store is often referred to as a data lake.
Batch processing . The high-volume nature of big data often means that solutions must process data files
using long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis. Usually these
jobs involve reading source files, processing them, and writing the output to new files.
Analytical data store . Many big data solutions are designed to prepare data for analysis and then serve
the processed data in a structured format that can be queried using analytical tools.
Analysis and repor ting . The goal of most big data solutions is to provide insights into the data through
analysis and reporting.
Orchestration . With batch processing, typically some orchestration is required to migrate or copy the data
into your data storage, batch processing, analytical data store, and reporting layers.
Technology choices
The following technologies are recommended choices for batch processing solutions in Azure.
Data storage
Azure Storage Blob Containers . Many existing Azure business processes already use Azure blob storage,
making this a good choice for a big data store.
Azure Data Lake Store . Azure Data Lake Store offers virtually unlimited storage for any size of file, and
extensive security options, making it a good choice for extremely large-scale big data solutions that require a
centralized store for data in heterogeneous formats.
For more information, see Data storage.
Batch processing
U-SQL . U-SQL is the query processing language used by Azure Data Lake Analytics. It combines the declarative
nature of SQL with the procedural extensibility of C#, and takes advantage of parallelism to enable efficient
processing of data at massive scale.
Hive . Hive is a SQL-like language that is supported in most Hadoop distributions, including HDInsight. It can be
used to process data from any HDFS-compatible store, including Azure blob storage and Azure Data Lake Store.
Pig . Pig is a declarative big data processing language used in many Hadoop distributions, including HDInsight.
It is particularly useful for processing data that is unstructured or semi-structured.
Spark . The Spark engine supports batch processing programs written in a range of languages, including Java,
Scala, and Python. Spark uses a distributed architecture to process data in parallel across multiple worker
nodes.
For more information, see Batch processing.
Analytical data store
Azure Synapse Analytics . Azure Synapse is a managed service based on SQL Server database technologies
and optimized to support large-scale data warehousing workloads.
Spark SQL . Spark SQL is an API built on Spark that supports the creation of dataframes and tables that can be
queried using SQL syntax.
HBase . HBase is a low-latency NoSQL store that offers a high-performance, flexible option for querying
structured and semi-structured data.
Hive . In addition to being useful for batch processing, Hive offers a database architecture that is conceptually
similar to that of a typical relational database management system. Improvements in Hive query performance
through innovations like the Tez engine and Stinger initiative mean that Hive tables can be used effectively as
sources for analytical queries in some scenarios.
For more information, see Analytical data stores.
Analytics and reporting
Azure Analysis Ser vices . Many big data solutions emulate traditional enterprise business intelligence
architectures by including a centralized online analytical processing (OLAP) data model (often referred to as a
cube) on which reports, dashboards, and interactive "slice and dice" analysis can be based. Azure Analysis
Services supports the creation of tabular models to meet this need.
Power BI . Power BI enables data analysts to create interactive data visualizations based on data models in an
OLAP model or directly from an analytical data store.
Microsoft Excel . Microsoft Excel is one of the most widely used software applications in the world, and offers
a wealth of data analysis and visualization capabilities. Data analysts can use Excel to build document data
models from analytical data stores, or to retrieve data from OLAP data models into interactive PivotTables and
charts.
For more information, see Analytics and reporting.
Orchestration
Azure Data Factor y . Azure Data Factory pipelines can be used to define a sequence of activities, scheduled for
recurring temporal windows. These activities can initiate data copy operations as well as Hive, Pig, MapReduce,
or Spark jobs in on-demand HDInsight clusters; U-SQL jobs in Azure Date Lake Analytics; and stored
procedures in Azure Synapse or Azure SQL Database.
Oozie and Sqoop . Oozie is a job automation engine for the Apache Hadoop ecosystem and can be used to
initiate data copy operations as well as Hive, Pig, and MapReduce jobs to process data and Sqoop jobs to copy
data between HDFS and SQL databases.
For more information, see Pipeline orchestration
Real time processing
12/18/2020 • 5 minutes to read • Edit Online
Real time processing deals with streams of data that are captured in real-time and processed with minimal latency
to generate real-time (or near-real-time) reports or automated responses. For example, a real-time traffic
monitoring solution might use sensor data to detect high traffic volumes. This data could be used to dynamically
update a map to show congestion, or automatically initiate high-occupancy lanes or other traffic management
systems.
Real-time processing is defined as the processing of unbounded stream of input data, with very short latency
requirements for processing — measured in milliseconds or seconds. This incoming data typically arrives in an
unstructured or semi-structured format, such as JSON, and has the same processing requirements as batch
processing, but with shorter turnaround times to support real-time consumption.
Processed data is often written to an analytical data store, which is optimized for analytics and visualization. The
processed data can also be ingested directly into the analytics and reporting layer for analysis, business
intelligence, and real-time dashboard visualization.
Challenges
One of the big challenges of real-time processing solutions is to ingest, process, and store messages in real time,
especially at high volumes. Processing must be done in such a way that it does not block the ingestion pipeline.
The data store must support high-volume writes. Another challenge is being able to act on the data quickly, such
as generating alerts in real time or presenting the data in a real-time (or near-real-time) dashboard.
Architecture
A real-time processing architecture has the following logical components.
Real-time message ingestion. The architecture must include a way to capture and store real-time
messages to be consumed by a stream processing consumer. In simple cases, this service could be
implemented as a simple data store in which new messages are deposited in a folder. But often the solution
requires a message broker, such as Azure Event Hubs, that acts as a buffer for the messages. The message
broker should support scale-out processing and reliable delivery.
Stream processing. After capturing real-time messages, the solution must process them by filtering,
aggregating, and otherwise preparing the data for analysis.
Analytical data store. Many big data solutions are designed to prepare data for analysis and then serve
the processed data in a structured format that can be queried using analytical tools.
Analysis and repor ting. The goal of most big data solutions is to provide insights into the data through
analysis and reporting.
Technology choices
The following technologies are recommended choices for real-time processing solutions in Azure.
Real-time message ingestion
Azure Event Hubs . Azure Event Hubs is a messaging solution for ingesting millions of event messages per
second. The captured event data can be processed by multiple consumers in parallel. While Event Hubs natively
supports AMQP (Advanced Message Queuing Protocol 1.0), it also provides a binary compatibility layer that
allows applications using the Kafka protocol (Kafka 1.0 and above) to process events using Event Hubs with no
application changes.
Azure IoT Hub . Azure IoT Hub provides bi-directional communication between Internet-connected devices,
and a scalable message queue that can handle millions of simultaneously connected devices.
Apache Kafka . Kafka is an open source message queuing and stream processing application that can scale to
handle millions of messages per second from multiple message producers, and route them to multiple
consumers. Kafka is available in Azure as an HDInsight cluster type.
For more information, see Real-time message ingestion.
Data storage
Azure Storage Blob Containers or Azure Data Lake Store . Incoming real-time data is usually captured in
a message broker (see above), but in some scenarios, it can make sense to monitor a folder for new files and
process them as they are created or updated. Additionally, many real-time processing solutions combine
streaming data with static reference data, which can be stored in a file store. Finally, file storage may be used as
an output destination for captured real-time data for archiving, or for further batch processing in a lambda
architecture.
For more information, see Data storage.
Stream processing
Azure Stream Analytics . Azure Stream Analytics can run perpetual queries against an unbounded stream of
data. These queries consume streams of data from storage or message brokers, filter and aggregate the data
based on temporal windows, and write the results to sinks such as storage, databases, or directly to reports in
Power BI. Stream Analytics uses a SQL-based query language that supports temporal and geospatial constructs,
and can be extended using JavaScript.
Storm . Apache Storm is an open source framework for stream processing that uses a topology of spouts and
bolts to consume, process, and output the results from real-time streaming data sources. You can provision
Storm in an Azure HDInsight cluster, and implement a topology in Java or C#.
Spark Streaming . Apache Spark is an open source distributed platform for general data processing. Spark
provides the Spark Streaming API, in which you can write code in any supported Spark language, including
Java, Scala, and Python. Spark 2.0 introduced the Spark Structured Streaming API, which provides a simpler and
more consistent programming model. Spark 2.0 is available in an Azure HDInsight cluster.
For more information, see Stream processing.
Analytical data store
Azure Synapse Analytics , Azure Data Explorer , HBase , Spark , or Hive . Processed real-time data can be
stored in a relational database such Synapse Analytics, Azure Data Explorer, a NoSQL store such as HBase, or as
files in distributed storage over which Spark or Hive tables can be defined and queried.
For more information, see Analytical data stores.
Analytics and reporting
Azure Analysis Ser vices , Power BI , and Microsoft Excel . Processed real-time data that is stored in an
analytical data store can be used for historical reporting and analysis in the same way as batch processed data.
Additionally, Power BI can be used to publish real-time (or near-real-time) reports and visualizations from
analytical data sources where latency is sufficiently low, or in some cases directly from the stream processing
output.
For more information, see Analytics and reporting.
In a purely real-time solution, most of the processing orchestration is managed by the message ingestion and
stream processing components. However, in a lambda architecture that combines batch processing and real-time
processing, you may need to use an orchestration framework such as Azure Data Factory or Apache Oozie and
Sqoop to manage batch workflows for captured real-time data.
Next steps
The following reference architecture shows an end-to-end stream processing pipeline:
Stream processing with Azure Stream Analytics
Choosing an analytical data store in Azure
12/18/2020 • 5 minutes to read • Edit Online
In a big data architecture, there is often a need for an analytical data store that serves processed data in a
structured format that can be queried using analytical tools. Analytical data stores that support querying of both
hot-path and cold-path data are collectively referred to as the serving layer, or data serving storage.
The serving layer deals with processed data from both the hot path and cold path. In the lambda architecture, the
serving layer is subdivided into a speed serving layer, which stores data that has been processed incrementally,
and a batch serving layer, which contains the batch-processed output. The serving layer requires strong support
for random reads with low latency. Data storage for the speed layer should also support random writes, because
batch loading data into this store would introduce undesired delays. On the other hand, data storage for the batch
layer does not need to support random writes, but batch writes instead.
There is no single best data management choice for all data storage tasks. Different data management solutions
are optimized for different tasks. Most real-world cloud apps and big data processes have a variety of data storage
requirements and often use a combination of data storage solutions.
Capability matrix
The following tables summarize the key differences in capabilities.
General capabilities
A Z URE H B A SE/ P H H IVE L L A P A Z URE
SQ L A Z URE DATA O EN IX O N ON A N A LY SIS C O SM O S
C A PA B IL IT Y DATA B A SE SY N A P SE EXP LO RER H DIN SIGH T H DIN SIGH T SERVIC ES DB
Security capabilities
A Z URE H B A SE/ P H H IVE L L A P A Z URE
SQ L A Z URE DATA O EN IX O N ON A N A LY SIS C O SM O S
C A PA B IL IT Y DATA B A SE SY N A P SE EXP LO RER H DIN SIGH T H DIN SIGH T SERVIC ES DB
Authenticati SQL / Azure SQL / Azure Azure AD local / local / Azure AD database
on Active AD Azure AD 1 Azure AD 1 users /
Directory Azure AD
(Azure AD) via access
control
(IAM)
The goal of most big data solutions is to provide insights into the data through analysis and reporting. This can
include preconfigured reports and visualizations, or interactive data exploration.
Capability matrix
The following tables summarize the key differences in capabilities.
General capabilities
Z EP P EL IN M IC RO SO F T A Z URE
C A PA B IL IT Y P O W ER B I JUP Y T ER N OT EB O O K S N OT EB O O K S N OT EB O O K S
Embedding Yes No No No
capabilities
Z EP P EL IN M IC RO SO F T A Z URE
C A PA B IL IT Y P O W ER B I JUP Y T ER N OT EB O O K S N OT EB O O K S N OT EB O O K S
Big data solutions often use long-running batch jobs to filter, aggregate, and otherwise prepare the data for
analysis. Usually these jobs involve reading source files from scalable storage (like HDFS, Azure Data Lake Store,
and Azure Storage), processing them, and writing the output to new files in scalable storage.
The key requirement of such batch processing engines is the ability to scale out computations, in order to handle a
large volume of data. Unlike real-time processing, however, batch processing is expected to have latencies (the time
between data ingestion and computing a result) that measure in minutes to hours.
Capability matrix
The following tables summarize the key differences in capabilities.
General capabilities
A Z URE DATA L A K E
C A PA B IL IT Y A N A LY T IC S A Z URE SY N A P SE H DIN SIGH T A Z URE DATA B RIC K S
Pricing model Per batch job By cluster hour By cluster hour Databricks Unit2 +
cluster hour
Autoscaling No No No No No Yes
Scale-out Per job Per cluster Per cluster Per cluster Per cluster Per cluster
granularity
A data lake is a storage repository that holds a large amount of data in its native, raw format. Data lake stores are
optimized for scaling to terabytes and petabytes of data. The data typically comes from multiple heterogeneous
sources, and may be structured, semi-structured, or unstructured. The idea with a data lake is to store everything in
its original, untransformed state. This approach differs from a traditional data warehouse, which transforms and
processes the data at the time of ingestion.
Advantages of a data lake:
Data is never thrown away, because the data is stored in its raw format. This is especially useful in a big data
environment, when you may not know in advance what insights are available from the data.
Users can explore the data and create their own queries.
May be faster than traditional ETL tools.
More flexible than a data warehouse, because it can store unstructured and semi-structured data.
A complete data lake solution consists of both storage and processing. Data lake storage is designed for fault-
tolerance, infinite scalability, and high-throughput ingestion of data with varying shapes and sizes. Data lake
processing involves one or more processing engines built with these goals in mind, and can operate on data stored
in a data lake at scale.
Challenges
Lack of a schema or descriptive metadata can make the data hard to consume or query.
Lack of semantic consistency across the data can make it challenging to perform analysis on the data, unless
users are highly skilled at data analytics.
It can be hard to guarantee the quality of the data going into the data lake.
Without proper governance, access control and privacy issues can be problems. What information is going into
the data lake, who can access that data, and for what uses?
A data lake may not be the best way to integrate data that is already relational.
By itself, a data lake does not provide integrated or holistic views across the organization.
A data lake may become a dumping ground for data that is never actually analyzed or mined for insights.
This topic compares options for data storage for big data solutions — specifically, data storage for bulk data
ingestion and batch processing, as opposed to analytical data stores or real-time streaming ingestion.
Azure Cosmos DB
Azure Cosmos DB is Microsoft's globally distributed multi-model database. Cosmos DB guarantees single-digit-
millisecond latencies at the 99th percentile anywhere in the world, offers multiple well-defined consistency models
to fine-tune performance, and guarantees high availability with multi-homing capabilities.
Azure Cosmos DB is schema-agnostic. It automatically indexes all the data without requiring you to deal with
schema and index management. It's also multi-model, natively supporting document, key-value, graph, and
column-family data models.
Azure Cosmos DB features:
Geo-replication
Elastic scaling of throughput and storage worldwide
Five well-defined consistency levels
HBase on HDInsight
Apache HBase is an open-source, NoSQL database that is built on Hadoop and modeled after Google BigTable.
HBase provides random access and strong consistency for large amounts of unstructured and semi-structured
data in a schemaless database organized by column families.
Data is stored in the rows of a table, and data within a row is grouped by column family. HBase is schemaless in
the sense that neither the columns nor the type of data stored in them need to be defined before using them. The
open-source code scales linearly to handle petabytes of data on thousands of nodes. It can rely on data
redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop
ecosystem.
The HDInsight implementation leverages the scale-out architecture of HBase to provide automatic sharding of
tables, strong consistency for reads and writes, and automatic failover. Performance is enhanced by in-memory
caching for reads and high-throughput streaming for writes. In most cases, you'll want to create the HBase cluster
inside a virtual network so other HDInsight clusters and applications can directly access the tables.
Azure Data Explorer
Azure Data Explorer is a fast and highly scalable data exploration service for log and telemetry data. It helps you
handle the many data streams emitted by modern software so you can collect, store, and analyze data. Azure Data
Explorer is ideal for analyzing large volumes of diverse data from any data source, such as websites, applications,
IoT devices, and more. This data is used for diagnostics, monitoring, reporting, machine learning, and additional
analytics capabilities. Azure Data Explorer makes it simple to ingest this data and enables you to do complex ad
hoc queries on the data in seconds.
Azure Data Explorer can be linearly scaled out for increasing ingestion and query processing throughput. An Azure
Data Explorer cluster can be deployed to a Virtual Network for enabling private networks.
Capability matrix
The following tables summarize the key differences in capabilities.
File storage capabilities
C A PA B IL IT Y A Z URE DATA L A K E STO RE A Z URE B LO B STO RA GE C O N TA IN ERS
Purpose Optimized storage for big data analytics General purpose object store for a wide
workloads variety of storage scenarios
Use cases Batch, streaming analytics, and machine Any type of text or binary data, such as
learning data such as log files, IoT data, application back end, backup data,
click streams, large datasets media storage for streaming, and
general purpose data
Authentication protocol OAuth 2.0. Calls must contain a valid Hash-based message authentication
JWT (JSON web token) issued by Azure code (HMAC). Calls must contain a
Active Directory Base64-encoded SHA-256 hash over a
part of the HTTP request.
C A PA B IL IT Y A Z URE DATA L A K E STO RE A Z URE B LO B STO RA GE C O N TA IN ERS
Authorization POSIX access control lists (ACLs). ACLs For account-level authorization use
based on Azure Active Directory Account Access Keys. For account,
identities can be set file and folder level. container, or blob authorization use
Shared Access Signature Keys.
Developer SDKs .NET, Java, Python, Node.js .NET, Java, Python, Node.js, C++, Ruby
Analytics workload performance Optimized performance for parallel Not optimized for analytics workloads
analytics workloads, High Throughput
and IOPS
Size limits No limits on account sizes, file sizes or Specific limits documented here
number of files
Primary database model Document store, graph, key-value Wide column store
store, wide column store
SQL language support Yes Yes (using the Phoenix JDBC driver)
Pricing model Elastically scalable request units (RUs) Per-minute pricing for HDInsight cluster
charged per-second as needed, (horizontal scaling of nodes), storage
elastically scalable storage
Modern business systems manage increasingly large volumes of heterogeneous data. This heterogeneity means
that a single data store is usually not the best approach. Instead, it's often better to store different types of data in
different data stores, each focused toward a specific workload or usage pattern. The term polyglot persistence is
used to describe solutions that use a mix of data store technologies. Therefore, it's important to understand the
main storage models and their tradeoffs.
Selecting the right data store for your requirements is a key design decision. There are literally hundreds of
implementations to choose from among SQL and NoSQL databases. Data stores are often categorized by how
they structure data and the types of operations they support. This article describes several of the most common
storage models. Note that a particular data store technology may support multiple storage models. For example,
a relational database management systems (RDBMS) may also support key/value or graph storage. In fact, there
is a general trend for so-called multi-model support, where a single database system supports several models.
But it's still useful to understand the different models at a high level.
Not all data stores in a given category provide the same feature-set. Most data stores provide server-side
functionality to query and process data. Sometimes this functionality is built into the data storage engine. In
other cases, the data storage and processing capabilities are separated, and there may be several options for
processing and analysis. Data stores also support different programmatic and management interfaces.
Generally, you should start by considering which storage model is best suited for your requirements. Then
consider a particular data store within that category, based on factors such as feature set, cost, and ease of
management.
Key/value stores
A key/value store associates each data value with a unique key. Most key/value stores only support simple query,
insert, and delete operations. To modify a value (either partially or completely), an application must overwrite the
existing data for the entire value. In most implementations, reading or writing a single value is an atomic
operation.
An application can store arbitrary data as a set of values. Any schema information must be provided by the
application. The key/value store simply retrieves or stores the value by key.
Key/value stores are highly optimized for applications performing simple lookups, but are less suitable if you
need to query data across different key/value stores. Key/value stores are also not optimized for querying by
value.
A single key/value store can be extremely scalable, as the data store can easily distribute data across multiple
nodes on separate machines.
Azure services
Azure Cosmos DB Table API, etcd API (preview), and SQL API | (Cosmos DB Security Baseline)
Azure Cache for Redis | (Security Baseline)
Azure Table Storage | (Security Baseline)
Workload
Data is accessed using a single key, like a dictionary.
No joins, lock, or unions are required.
No aggregation mechanisms are used.
Secondary indexes are generally not used.
Data type
Each key is associated with a single value.
There is no schema enforcement.
No relationships between entities.
Examples
Data caching
Session management
User preference and profile management
Product recommendation and ad serving
Document databases
A document database stores a collection of documents, where each document consists of named fields and data.
The data can be simple values or complex elements such as lists and child collections. Documents are retrieved
by unique keys.
Typically, a document contains the data for single entity, such as a customer or an order. A document may contain
information that would be spread across several relational tables in an RDBMS. Documents don't need to have
the same structure. Applications can store different data in documents as business requirements change.
Azure service
Azure Cosmos DB SQL API | (Cosmos DB Security Baseline)
Workload
Insert and update operations are common.
No object-relational impedance mismatch. Documents can better match the object structures used in
application code.
Individual documents are retrieved and written as a single block.
Data requires index on multiple fields.
Data type
Data can be managed in de-normalized way.
Size of individual document data is relatively small.
Each document type can use its own schema.
Documents can include optional fields.
Document data is semi-structured, meaning that data types of each field are not strictly defined.
Examples
Product catalog
Content management
Inventory management
Graph databases
A graph database stores two types of information, nodes and edges. Edges specify relationships between nodes.
Nodes and edges can have properties that provide information about that node or edge, similar to columns in a
table. Edges can also have a direction indicating the nature of the relationship.
Graph databases can efficiently perform queries across the network of nodes and edges and analyze the
relationships between entities. The following diagram shows an organization's personnel database structured as
a graph. The entities are employees and departments, and the edges indicate reporting relationships and the
departments in which employees work.
This structure makes it straightforward to perform queries such as "Find all employees who report directly or
indirectly to Sarah" or "Who works in the same department as John?" For large graphs with lots of entities and
relationships, you can perform very complex analyses very quickly. Many graph databases provide a query
language that you can use to traverse a network of relationships efficiently.
Azure services
Azure Cosmos DB Gremlin API | (Security Baseline)
SQL Server | (Security Baseline)
Workload
Complex relationships between data items involving many hops between related data items.
The relationship between data items are dynamic and change over time.
Relationships between objects are first-class citizens, without requiring foreign-keys and joins to traverse.
Data type
Nodes and relationships.
Nodes are similar to table rows or JSON documents.
Relationships are just as important as nodes, and are exposed directly in the query language.
Composite objects, such as a person with multiple phone numbers, tend to be broken into separate, smaller
nodes, combined with traversable relationships
Examples
Organization charts
Social graphs
Fraud detection
Recommendation engines
Data analytics
Data analytics stores provide massively parallel solutions for ingesting, storing, and analyzing data. The data is
distributed across multiple servers to maximize scalability. Large data file formats such as delimiter files (CSV),
parquet, and ORC are widely used in data analytics. Historical data is typically stored in data stores such as blob
storage or Azure Data Lake Storage Gen2, which are then accessed by Azure Synapse, Databricks, or HDInsight as
external tables. A typical scenario using data stored as parquet files for performance, is described in the article
Use external tables with Synapse SQL.
Azure services
Azure Synapse Analytics | (Security Baseline)
Azure Data Lake | (Security Baseline)
Azure Data Explorer | (Security Baseline)
Azure Analysis Services
HDInsight | (Security Baseline)
Azure Databricks | (Security Baseline)
Workload
Data analytics
Enterprise BI
Data type
Historical data from multiple sources.
Usually denormalized in a "star" or "snowflake" schema, consisting of fact and dimension tables.
Usually loaded with new data on a scheduled basis.
Dimension tables often include multiple historic versions of an entity, referred to as a slowly changing
dimension.
Examples
Enterprise data warehouse
Column-family databases
A column-family database organizes data into rows and columns. In its simplest form, a column-family database
can appear very similar to a relational database, at least conceptually. The real power of a column-family
database lies in its denormalized approach to structuring sparse data.
You can think of a column-family database as holding tabular data with rows and columns, but the columns are
divided into groups known as column families. Each column family holds a set of columns that are logically
related together and are typically retrieved or manipulated as a unit. Other data that is accessed separately can be
stored in separate column families. Within a column family, new columns can be added dynamically, and rows
can be sparse (that is, a row doesn't need to have a value for every column).
The following diagram shows an example with two column families, Identity and Contact Info . The data for a
single entity has the same row key in each column-family. This structure, where the rows for any given object in a
column family can vary dynamically, is an important benefit of the column-family approach, making this form of
data store highly suited for storing structured, volatile data.
Unlike a key/value store or a document database, most column-family databases store data in key order, rather
than by computing a hash. Many implementations allow you to create indexes over specific columns in a column-
family. Indexes let you retrieve data by columns value, rather than row key.
Read and write operations for a row are usually atomic with a single column-family, although some
implementations provide atomicity across the entire row, spanning multiple column-families.
Azure services
Azure Cosmos DB Cassandra API | (Security Baseline)
HBase in HDInsight | (Security Baseline)
Workload
Most column-family databases perform write operations extremely quickly.
Update and delete operations are rare.
Designed to provide high throughput and low-latency access.
Supports easy query access to a particular set of fields within a much larger record.
Massively scalable.
Data type
Data is stored in tables consisting of a key column and one or more column families.
Specific columns can vary by individual rows.
Individual cells are accessed via get and put commands
Multiple rows are returned using a scan command.
Examples
Recommendations
Personalization
Sensor data
Telemetry
Messaging
Social media analytics
Web analytics
Activity monitoring
Weather and other time-series data
Search Engine Databases
A search engine database allows applications to search for information held in external data stores. A search
engine database can index massive volumes of data and provide near real-time access to these indexes.
Indexes can be multi-dimensional and may support free-text searches across large volumes of text data. Indexing
can be performed using a pull model, triggered by the search engine database, or using a push model, initiated
by external application code.
Searching can be exact or fuzzy. A fuzzy search finds documents that match a set of terms and calculates how
closely they match. Some search engines also support linguistic analysis that can return matches based on
synonyms, genre expansions (for example, matching dogs to pets ), and stemming (matching words with the
same root).
Azure service
Azure Search | (Security Baseline)
Workload
Data indexes from multiple sources and services.
Queries are ad-hoc and can be complex.
Full text search is required.
Ad hoc self-service query is required.
Data type
Semi-structured or unstructured text
Text with reference to structured data
Examples
Product catalogs
Site search
Logging
Object storage
Object storage is optimized for storing and retrieving large binary objects (images, files, video and audio streams,
large application data objects and documents, virtual machine disk images). Large data files are also popularly
used in this model, for example, delimiter file (CSV), parquet, and ORC. Object stores can manage extremely large
amounts of unstructured data.
Azure service
Azure Blob Storage | (Security Baseline)
Azure Data Lake Storage Gen2 | (Security Baseline)
Workload
Identified by key.
Content is typically an asset such as a delimiter, image, or video file.
Content must be durable and external to any application tier.
Data type
Data size is large.
Value is opaque.
Examples
Images, videos, office documents, PDFs
Static HTML, JSON, CSS
Log and audit files
Database backups
Shared files
Sometimes, using simple flat files can be the most effective means of storing and retrieving information. Using
file shares enables files to be accessed across a network. Given appropriate security and concurrent access
control mechanisms, sharing data in this way can enable distributed services to provide highly scalable data
access for performing basic, low-level operations such as simple read and write requests.
Azure service
Azure Files | (Security Baseline)
Workload
Migration from existing apps that interact with the file system.
Requires SMB interface.
Data type
Files in a hierarchical set of folders.
Accessible with standard I/O libraries.
Examples
Legacy files
Shared content accessible among a number of VMs or app instances
Aided with this understanding of different data storage models, the next step is to evaluate your workload and
application, and decide which data store will meet your specific needs. Use the data storage decision tree to help
with this process.
Choosing a data pipeline orchestration technology in
Azure
12/18/2020 • 2 minutes to read • Edit Online
Most big data solutions consist of repeated data processing operations, encapsulated in workflows. A pipeline
orchestrator is a tool that helps to automate these workflows. An orchestrator can schedule jobs, execute
workflows, and coordinate dependencies among tasks.
Capability matrix
The following tables summarize the key differences in capabilities.
General capabilities
SQ L SERVER IN T EGRAT IO N
C A PA B IL IT Y A Z URE DATA FA C TO RY SERVIC ES ( SSIS) O O Z IE O N H DIN SIGH T
Management tools Azure Portal, PowerShell, SSMS, PowerShell Bash shell, Oozie REST API,
CLI, .NET SDK Oozie web UI
Pricing Pay per usage Licensing / pay for features No additional charge on top
of running the HDInsight
cluster
Pipeline capabilities
SQ L SERVER IN T EGRAT IO N
C A PA B IL IT Y A Z URE DATA FA C TO RY SERVIC ES ( SSIS) O O Z IE O N H DIN SIGH T
Spark Yes No No
Scalability capabilities
SQ L SERVER IN T EGRAT IO N
C A PA B IL IT Y A Z URE DATA FA C TO RY SERVIC ES ( SSIS) O O Z IE O N H DIN SIGH T
Scale up Yes No No
Real time processing deals with streams of data that are captured in real-time and processed with minimal latency.
Many real-time processing solutions need a message ingestion store to act as a buffer for messages, and to
support scale-out processing, reliable delivery, and other message queuing semantics.
Kafka on HDInsight
Apache Kafka is an open-source distributed streaming platform that can be used to build real-time data pipelines
and streaming applications. Kafka also provides message broker functionality similar to a message queue, where
you can publish and subscribe to named data streams. It is horizontally scalable, fault-tolerant, and extremely fast.
Kafka on HDInsight provides a Kafka as a managed, highly scalable, and highly available service in Azure.
Some common use cases for Kafka are:
Messaging . Because it supports the publish-subscribe message pattern, Kafka is often used as a message
broker.
Activity tracking . Because Kafka provides in-order logging of records, it can be used to track and re-create
activities, such as user actions on a web site.
Aggregation . Using stream processing, you can aggregate information from different streams to combine and
centralize the information into operational data.
Transformation . Using stream processing, you can combine and enrich data from multiple input topics into
one or more output topics.
Capability matrix
The following tables summarize the key differences in capabilities.
Cloud-to-device Yes No No
communications
Protocol support MQTT, AMQP, HTTPS 1 AMQP, HTTPS, Kafka Kafka Protocol
Protocol
[1] You can also use Azure IoT protocol gateway as a custom gateway to enable protocol adaptation for IoT Hub.
For more information, see Comparison of Azure IoT Hub and Azure Event Hubs.
Choosing a search data store in Azure
12/18/2020 • 2 minutes to read • Edit Online
This article compares technology choices for search data stores in Azure. A search data store is used to create and
store specialized indexes for performing searches on free-form text. The text that is indexed may reside in a
separate data store, such as blob storage. An application submits a query to the search data store, and the result is
a list of matching documents. For more information about this scenario, see Processing free-form text for search.
Capability matrix
The following tables summarize the key differences in capabilities.
General capabilities
H DIN SIGH T W IT H
C A PA B IL IT Y C O GN IT IVE SEA RC H EL A ST IC SEA RC H SO L R SQ L DATA B A SE
Manageability capabilities
H DIN SIGH T W IT H
C A PA B IL IT Y C O GN IT IVE SEA RC H EL A ST IC SEA RC H SO L R SQ L DATA B A SE
Security capabilities
H DIN SIGH T W IT H
C A PA B IL IT Y C O GN IT IVE SEA RC H EL A ST IC SEA RC H SO L R SQ L DATA B A SE
See also
Processing free-form text for search
Choosing a stream processing technology in Azure
12/18/2020 • 2 minutes to read • Edit Online
This article compares technology choices for real-time stream processing in Azure.
Real-time stream processing consumes messages from either queue or file-based storage, process the messages,
and forward the result to another message queue, file store, or database. Processing may include querying,
filtering, and aggregating messages. Stream processing engines must be able to consume an endless streams of
data and produce results with minimal latency. For more information, see Real time processing.
Capability matrix
The following tables summarize the key differences in capabilities.
General capabilities
A PA C H E
A Z URE H DIN SIGH T SPA RK IN A Z URE A P P
ST REA M W IT H SPA RK A Z URE H DIN SIGH T A Z URE SERVIC E
C A PA B IL IT Y A N A LY T IC S ST REA M IN G DATA B RIC K S W IT H STO RM F UN C T IO N S W EB JO B S
A PA C H E
A Z URE H DIN SIGH T SPA RK IN A Z URE A P P
ST REA M W IT H SPA RK A Z URE H DIN SIGH T A Z URE SERVIC E
C A PA B IL IT Y A N A LY T IC S ST REA M IN G DATA B RIC K S W IT H STO RM F UN C T IO N S W EB JO B S
Programmabil Stream C#/F#, Java, C#/F#, Java, C#, Java C#, F#, Java, C#, Java,
ity analytics Python, Scala Python, R, Node.js, Node.js, PHP,
query Scala Python Python
language,
JavaScript
Pricing model Streaming Per cluster Databricks Per cluster Per function Per app
units hour units hour execution and service plan
resource hour
consumption
Integration capabilities
A PA C H E
A Z URE H DIN SIGH T SPA RK IN A Z URE A P P
ST REA M W IT H SPA RK A Z URE H DIN SIGH T A Z URE SERVIC E
C A PA B IL IT Y A N A LY T IC S ST REA M IN G DATA B RIC K S W IT H STO RM F UN C T IO N S W EB JO B S
Inputs Azure Event Event Hubs, Event Hubs, Event Hubs, Supported Service Bus,
Hubs, Azure IoT Hub, IoT Hub, IoT Hub, bindings Storage
IoT Hub, Kafka, HDFS, Kafka, HDFS, Storage Blobs, Queues,
Azure Blob Storage Blobs, Storage Blobs, Azure Data Storage Blobs,
storage Azure Data Azure Data Lake Store Event Hubs,
Lake Store Lake Store WebHooks,
Cosmos DB,
Files
Sinks Azure Data HDFS, Kafka, HDFS, Kafka, Event Hubs, Supported Service Bus,
Lake Store, Storage Blobs, Storage Blobs, Service Bus, bindings Storage
Azure SQL Azure Data Azure Data Kafka Queues,
Database, Lake Store, Lake Store, Storage Blobs,
Storage Blobs, Cosmos DB Cosmos DB Event Hubs,
Event Hubs, WebHooks,
Power BI, Cosmos DB,
Table Storage, Files
Service Bus
Queues,
Service Bus
Topics,
Cosmos DB,
Azure
Functions
Processing capabilities
A PA C H E
A Z URE H DIN SIGH T SPA RK IN A Z URE A P P
ST REA M W IT H SPA RK A Z URE H DIN SIGH T A Z URE SERVIC E
C A PA B IL IT Y A N A LY T IC S ST REA M IN G DATA B RIC K S W IT H STO RM F UN C T IO N S W EB JO B S
Input data Avro, JSON or Any format Any format Any format Any format Any format
formats CSV, UTF-8 using custom using custom using custom using custom using custom
encoded code code code code code
See also:
Choosing a real-time message ingestion technology
Real time processing
Tenancy models for SaaS applications
11/2/2020 • 3 minutes to read • Edit Online
In a Software as a Service (SaaS) model, each of your customers are a tenant of your application. Each tenant pays
for access to the SaaS application by paying a subscription fee. This article describes the application tenancy
models available to SaaS application builders.
When designing a SaaS application, you must choose the application tenancy model that best fits the needs of your
customers and your business. In general, the application tenancy model doesn't impact the functionality of an
application. But it likely impacts other aspects of the overall solution including scale, tenant isolation, cost per
tenant and operation complexity.
In the single tenant model, a single dedicated instance of an application is deployed for each customer. For example,
with a N-tier architecture style application, all customers get a new dedicated instance of the web, middle, and data
tiers. These tiers are not shared between these customers.
Mixed tenant
In this model, one or more parts of an application are deployed as dedicated for each customer, and the rest is
shared between all customers. For example, with a N-tier architecture style application the web and middle tiers are
shared between all customers. However, a dedicated data tier and database is provisioned for each customer.
Multi-tenant
In this model, a single instance of the application is deployed for all customers and shared amongst them. For
example, with a N-tier architecture style application, the web, middle and data tiers are shared between all
customers.
A combination of these models can be provided for customers with different needs. For example, your basic tier of
service would run on a shared multi-tenant instance of your application. As a baseline, your customers can access
your app with lower performance or limited functionality for a lower cost. On top of this baseline, a dedicate service
tier could run on a single tenant model. For customers that need higher performance or additional functionality,
you can provide an isolated instance of your application for a higher cost.
Additional considerations
Once you know the customer's needs and business goals, start asking these questions:
Do I operate in a highly regulated industry that requires customers data to be isolated from other customers?
Am I looking to rapidly scale my application to many thousands of clients?
Am I concerned about how much it costs to run each tenant/customer instance?
The answers to these questions will help you refine your tenancy requirement.
Next steps
Read more about application tenancy patterns in Multi-tenant SaaS database tenancy patterns.
Monitoring Azure Databricks
12/18/2020 • 2 minutes to read • Edit Online
Azure Databricks is a fast, powerful Apache Spark–based analytics service that makes it easy to rapidly develop and
deploy big data analytics and artificial intelligence (AI) solutions. Many users take advantage of the simplicity of
notebooks in their Azure Databricks solutions. For users that require more robust computing options, Azure
Databricks supports the distributed execution of custom application code.
Monitoring is a critical part of any production-level solution, and Azure Databricks offers robust functionality for
monitoring custom application metrics, streaming query events, and application log messages. Azure Databricks
can send this monitoring data to different logging services.
The following articles show how to send monitoring data from Azure Databricks to Azure Monitor, the monitoring
data platform for Azure.
Send Azure Databricks application logs to Azure Monitor
Use dashboards to visualize Azure Databricks metrics
Troubleshoot performance bottlenecks
The code library that accompanies these articles extends the core monitoring functionality of Azure Databricks to
send Spark metrics, events, and logging information to Azure Monitor.
The audience for these articles and the accompanying code library are Apache Spark and Azure Databricks solution
developers. The code must be built into Java Archive (JAR) files and then deployed to an Azure Databricks cluster.
The code is a combination of Scala and Java, with a corresponding set of Maven project object model (POM) files to
build the output JAR files. Understanding of Java, Scala, and Maven are recommended as prerequisites.
Next steps
Start by building the code library and deploying it to your Azure Databricks cluster.
Send Azure Databricks application logs to Azure Monitor
Send Azure Databricks application logs to Azure
Monitor
12/18/2020 • 2 minutes to read • Edit Online
This article shows how to send application logs and metrics from Azure Databricks to a Log Analytics workspace. It
uses the Azure Databricks Monitoring Library, which is available on GitHub.
Prerequisites
Configure your Azure Databricks cluster to use the monitoring library, as described in the GitHub readme.
NOTE
The monitoring library streams Apache Spark level events and Spark Structured Streaming metrics from your jobs to Azure
Monitor. You don't need to make any changes to your application code for these events and metrics.
import org.apache.spark.metrics.UserMetricsSystems
import org.apache.spark.sql.SparkSession
object StreamingQueryListenerSampleJob {
driverMetricsSystem.counter(COUNTER_NAME).inc(5)
}
}
The monitoring library includes a sample application that demonstrates how to use the UserMetricsSystem
class.
Send application logs using Log4j
To send your Azure Databricks application logs to Azure Log Analytics using the Log4j appender in the library,
follow these steps:
1. Build the spark-listeners-1.0-SNAPSHOT.jar and the spark-listeners-loganalytics-1.0-
SNAPSHOT.jar JAR file as described in the GitHub readme.
2. Create a log4j.proper ties configuration file for your application. Include the following configuration
properties. Substitute your application package name and log level where indicated:
log4j.appender.A1=com.microsoft.pnp.logging.loganalytics.LogAnalyticsAppender
log4j.appender.A1.layout=com.microsoft.pnp.logging.JSONLayout
log4j.appender.A1.layout.LocationInfo=false
log4j.additivity.<your application package name>=false
log4j.logger.<your application package name>=<log level>, A1
import com.microsoft.pnp.logging.Log4jConfiguration
4. Configure Log4j using the log4j.proper ties file you created in step 3:
5. Add Apache Spark log messages at the appropriate level in your code as required. For example, use the
logDebug method to send a debug log message. For more information, see Logging in the Spark
documentation.
logTrace("Trace message")
logDebug("Debug message")
logInfo("Info message")
logWarning("Warning message")
logError("Error message")
IMPORTANT
After you verify the metrics appear, stop the sample application job.
Next steps
Deploy the performance monitoring dashboard that accompanies this code library to troubleshoot performance
issues in your production Azure Databricks workloads.
Use dashboards to visualize Azure Databricks metrics
Use dashboards to visualize Azure Databricks metrics
12/18/2020 • 7 minutes to read • Edit Online
This article shows how to set up a Grafana dashboard to monitor Azure Databricks jobs for performance issues.
Azure Databricks is a fast, powerful, and collaborative Apache Spark–based analytics service that makes it easy to
rapidly develop and deploy big data analytics and artificial intelligence (AI) solutions. Monitoring is a critical
component of operating Azure Databricks workloads in production. The first step is to gather metrics into a
workspace for analysis. In Azure, the best solution for managing log data is Azure Monitor. Azure Databricks does
not natively support sending log data to Azure monitor, but a library for this functionality is available in GitHub.
This library enables logging of Azure Databricks service metrics as well as Apache Spark structure streaming
query event metrics. Once you've successfully deployed this library to an Azure Databricks cluster, you can further
deploy a set of Grafana dashboards that you can deploy as part of your production environment.
Prerequisites
Configure your Azure Databricks cluster to use the monitoring library, as described in the GitHub readme.
This template creates the workspace and also creates a set of predefined queries that are used by dashboard.
export DATA_SOURCE="https://raw.githubusercontent.com/mspnp/spark-
monitoring/master/perftools/deployment/grafana/AzureDataSource.sh"
az group deployment create \
--resource-group <resource-group-name> \
--template-file grafanaDeploy.json \
--parameters adminPass='<vm password>' dataSource=$DATA_SOURCE
Once the deployment is complete, the bitnami image of Grafana is installed on the virtual machine.
2. Note the values for appId, password, and tenant in the output from this command:
{
"appId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"displayName": "azure-cli-2019-03-27-00-33-39",
"name": "http://<service principal name>",
"password": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"tenant": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
}
3. Log into Grafana as described earlier. Select Configuration (the gear icon) and then Data Sources .
4. In the Data Sources tab, click Add data source .
5. Select Azure Monitor as the data source type.
6. In the Settings section, enter a name for the data source in the Name textbox.
7. In the Azure Monitor API Details section, enter the following information:
Subscription Id: Your Azure subscription ID.
Tenant Id: The tenant ID from earlier.
Client Id: The value of "appId" from earlier.
Client Secret: The value of "password" from earlier.
8. In the Azure Log Analytics API Details section, check the Same Details as Azure Monitor API
checkbox.
9. Click Save & Test . If the Log Analytics data source is correctly configured, a success message is displayed.
sh DashGen.sh
Next steps
Troubleshoot performance bottlenecks
Troubleshoot performance bottlenecks in Azure
Databricks
12/18/2020 • 6 minutes to read • Edit Online
This article describes how to use monitoring dashboards to find performance bottlenecks in Spark jobs on Azure
Databricks.
Azure Databricks is an Apache Spark–based analytics service that makes it easy to rapidly develop and deploy big
data analytics. Monitoring and troubleshooting performance issues is a critical when operating production Azure
Databricks workloads. To identify common performance issues, it's helpful to use monitoring visualizations based
on telemetry data.
Prerequisites
To set up the Grafana dashboards shown in this article:
Configure your Databricks cluster to send telemetry to a Log Analytics workspace, using the Azure
Databricks Monitoring Library. For details, see the GitHub readme.
Deploy Grafana in a virtual machine. See Use dashboards to visualize Azure Databricks metrics.
The Grafana dashboard that is deployed includes a set of time-series visualizations. Each graph is time-series plot
of metrics related to an Apache Spark job, the stages of the job, and tasks that make up each stage.
The cluster throughput graph shows the number of jobs, stages, and tasks completed per minute. This helps you to
understand the workload in terms of the relative number of stages and tasks per job. Here you can see that the
number of jobs per minute ranges between 2 and 6, while the number of stages is about 12 – 24 per minute.
Sum of task execution latency
This visualization shows the sum of task execution latency per host running on a cluster. Use this graph to detect
tasks that run slowly due to the host slowing down on a cluster, or a misallocation of tasks per executor. In the
following graph, most of the hosts have a sum of about 30 seconds. However, two of the hosts have sums that
hover around 10 minutes. Either the hosts are running slow or the number of tasks per executor is misallocated.
The number of tasks per executor shows that two executors are assigned a disproportionate number of tasks,
causing a bottleneck.
Task metrics per stage
The task metrics visualization gives the cost breakdown for a task execution. You can use it see the relative time
spent on tasks such as serialization and deserialization. This data might show opportunities to optimize — for
example, by using broadcast variables to avoid shipping data. The task metrics also show the shuffle data size for a
task, and the shuffle read and write times. If these values are high, it means that a lot of data is moving across the
network.
Another task metric is the scheduler delay, which measures how long it takes to schedule a task. Ideally, this value
should be low compared to the executor compute time, which is the time spent actually executing the task.
The following graph shows a scheduler delay time (3.7 s) that exceeds the executor compute time (1.1 s). That
means more time is spent waiting for tasks to be scheduled than doing the actual work.
In this case, the problem was caused by having too many partitions, which caused a lot of overhead. Reducing the
number of partitions lowered the scheduler delay time. The next graph shows that most of the time is spent
executing the task.
Streaming throughput and latency
Streaming throughput is directly related to structured streaming. There are two important metrics associated with
streaming throughput: Input rows per second and processed rows per second. If input rows per second outpaces
processed rows per second, it means the stream processing system is falling behind. Also, if the input data comes
from Event Hubs or Kafka, then input rows per second should keep up with the data ingestion rate at the front end.
Two jobs can have similar cluster throughput but very different streaming metrics. The following screenshot shows
two different workloads. They are similar in terms of cluster throughput ( jobs, stages, and tasks per minute). But
the second run processes 12,000 rows/sec versus 4,000 rows/sec.
Streaming throughput is often a better business metric than cluster throughput, because it measures the number
of data records that are processed.
Shuffle metrics are metrics related to data shuffling across the executors.
Shuffle I/O
Shuffle memory
File system usage
Disk usage
There are several options for transferring data to and from Azure, depending on your needs.
Physical transfer
Using physical hardware to transfer data to Azure is a good option when:
Your network is slow or unreliable.
Getting additional network bandwidth is cost-prohibitive.
Security or organizational policies do not allow outbound connections when dealing with sensitive data.
If your primary concern is how long it will take to transfer your data, you may want to run a test to verify whether
network transfer is actually slower than physical transport.
There are two main options for physically transporting data to Azure:
Azure Impor t/Expor t . The Azure Import/Export service lets you securely transfer large amounts of data to
Azure Blob Storage or Azure Files by shipping internal SATA HDDs or SDDs to an Azure datacenter. You can
also use this service to transfer data from Azure Storage to hard disk drives and have these shipped to you
for loading on-premises.
Azure Data Box . Azure Data Box is a Microsoft-provided appliance that works much like the Azure
Import/Export service. Microsoft ships you a proprietary, secure, and tamper-resistant transfer appliance and
handles the end-to-end logistics, which you can track through the portal. One benefit of the Azure Data Box
service is ease of use. You don't need to purchase several hard drives, prepare them, and transfer files to
each one. Azure Data Box is supported by a number of industry-leading Azure partners to make it easier to
seamlessly use offline transport to the cloud from their products.
Graphical interface
Consider the following options if you are only transferring a few files or data objects and don't need to automate
the process.
Azure Storage Explorer . Azure Storage Explorer is a cross-platform tool that lets you manage the contents
of your Azure storage accounts. It allows you to upload, download, and manage blobs, files, queues, tables,
and Azure Cosmos DB entities. Use it with Blob storage to manage blobs and folders, as well as upload and
download blobs between your local file system and Blob storage, or between storage accounts.
Azure por tal . Both Blob storage and Data Lake Store provide a web-based interface for exploring files and
uploading new files one at a time. This is a good option if you do not want to install any tools or issue
commands to quickly explore your files, or to simply upload a handful of new ones.
Data pipeline
Azure Data Factor y . Azure Data Factory is a managed service best suited for regularly transferring files between
a number of Azure services, on-premises, or a combination of the two. Using Azure Data Factory, you can create
and schedule data-driven workflows (called pipelines) that ingest data from disparate data stores. It can process
and transform the data by using compute services such as Azure HDInsight Hadoop, Spark, Azure Data Lake
Analytics, and Azure Machine Learning. Create data-driven workflows for orchestrating and automating data
movement and data transformation.
Form factor Internal SATA HDDs or SDDs Secure, tamper-proof, single hardware
appliance
C A PA B IL IT Y DISTC P SQ O O P H A DO O P C L I
Other :
[1] AdlCopy is optimized for transferring big data when used with a Data Lake Analytics account.
[2] PolyBase performance can be increased by pushing computation to Hadoop and using PolyBase scale-out
groups to enable parallel data transfer between SQL Server instances and Hadoop nodes.
Graphical interface and Azure Data Factory
C A PA B IL IT Y A Z URE STO RA GE EXP LO RER A Z URE P O RTA L * A Z URE DATA FA C TO RY
* Azure portal in this case means using the web-based exploration tools for Blob storage and Data Lake Store.
Extending on-premises data solutions to the cloud
12/18/2020 • 7 minutes to read • Edit Online
When organizations move workloads and data to the cloud, their on-premises datacenters often continue to play
an important role. The term hybrid cloud refers to a combination of public cloud and on-premises datacenters, to
create an integrated IT environment that spans both. Some organizations use hybrid cloud as a path to migrate
their entire datacenter to the cloud over time. Other organizations use cloud services to extend their existing on-
premises infrastructure.
This article describes some considerations and best practices for managing data in a hybrid cloud solution,
Challenges
Creating a consistent environment in terms of security, management, and development, and avoiding
duplication of work.
Creating a reliable, low latency and secure data connection between your on-premises and cloud
environments.
Replicating your data and modifying applications and tools to use the correct data stores within each
environment.
Securing and encrypting data that is hosted in the cloud but accessed from on-premises, or vice versa.
Azure Stack
For a complete hybrid cloud solution, consider using Microsoft Azure Stack. Azure Stack is a hybrid cloud platform
that lets you provide Azure services from your datacenter. This helps maintain consistency between on-premises
and Azure, by using identical tools and requiring no code changes.
The following are some use cases for Azure and Azure Stack:
Edge and disconnected solutions . Address latency and connectivity requirements by processing data
locally in Azure Stack and then aggregating in Azure for further analytics, with common application logic
across both.
Cloud applications that meet varied regulations . Develop and deploy applications in Azure, with the
flexibility to deploy the same applications on-premises on Azure Stack to meet regulatory or policy
requirements.
Cloud application model on-premises . Use Azure to update and extend existing applications or build
new ones. Use consistent DevOps processes across Azure in the cloud and Azure Stack on-premises.
Hybrid networking
This article focused on hybrid data solutions, but another consideration is how to extend your on-premises
network to Azure. For more information about this aspect of hybrid solutions, see:
Choose a solution for connecting an on-premises network to Azure
Hybrid network reference architectures
Securing data solutions
12/18/2020 • 5 minutes to read • Edit Online
For many, making data accessible in the cloud, particularly when transitioning from working exclusively in on-
premises data stores, can cause some concern around increased accessibility to that data and new ways in which to
secure it.
Challenges
Centralizing the monitoring and analysis of security events stored in numerous logs.
Implementing encryption and authorization management across your applications and services.
Ensuring that centralized identity management works across all of your solution components, whether on-
premises or in the cloud.
Data Protection
The first step to protecting information is identifying what to protect. Develop clear, simple, and well-communicated
guidelines to identify, protect, and monitor the most important data assets anywhere they reside. Establish the
strongest protection for assets that have a disproportionate impact on the organization's mission or profitability.
These are known as high value assets, or HVAs. Perform stringent analysis of HVA lifecycle and security
dependencies, and establish appropriate security controls and conditions. Similarly, identify and classify sensitive
assets, and define the technologies and processes to automatically apply security controls.
Once the data you need to protect has been identified, consider how you will protect the data at rest and data in
transit.
Data at rest : Data that exists statically on physical media, whether magnetic or optical disk, on premises or in
the cloud.
Data in transit : Data while it is being transferred between components, locations or programs, such as over
the network, across a service bus (from on-premises to cloud and vice-versa), or during an input/output
process.
To learn more about protecting your data at rest or in transit, see Azure Data Security and Encryption Best Practices.
Access Control
Central to protecting your data in the cloud is a combination of identity management and access control. Given the
variety and type of cloud services, as well as the rising popularity of hybrid cloud, there are several key practices
you should follow when it comes to identity and access control:
Centralize your identity management.
Enable Single Sign-On (SSO).
Deploy password management.
Enforce multi-factor authentication for users.
Use role based access control (RBAC).
Conditional Access Policies should be configured, which enhances the classic concept of user identity with
additional properties related to user location, device type, patch level, and so on.
Control locations where resources are created using resource manager.
Actively monitor for suspicious activities
For more information, see Azure Identity Management and access control security best practices.
Auditing
Beyond the identity and access monitoring previously mentioned, the services and applications that you use in the
cloud should be generating security-related events that you can monitor. The primary challenge to monitoring
these events is handling the quantities of logs , in order to avoid potential problems or troubleshoot past ones.
Cloud-based applications tend to contain many moving parts, most of which generate some level of logging and
telemetry. Use centralized monitoring and analysis to help you manage and make sense of the large amount of
information.
For more information, see Azure Logging and Auditing.
This article describes performance considerations for running Apache Cassandra on Azure virtual machines.
These recommendations are based on the results of performance tests, which you can find on GitHub. You should
use these recommendations as a baseline and then test against your own workload.
Accelerated Networking
Cassandra nodes make heavy use of the network to send and receive data from the client VM and to communicate
between nodes for replication. For optimal performance, Cassandra VMs benefit from high-throughput and low-
latency network.
We recommended enabling Accelerated Networking on the NIC of the Cassandra node and on VMs running client
applications accessing Cassandra.
Accelerated networking requires a modern Linux distribution with the latest drivers, such as Cent OS 7.5+ or
Ubuntu 16.x/18.x. For more information, see Create a Linux virtual machine with Accelerated Networking.
Linux read-ahead
In most Linux distributions in the Azure Marketplace, the default block device read-ahead setting is 4096 KB.
Cassandra's read IOs are usually random and relatively small. Therefore, having a large read-ahead wastes
throughput by reading parts of files that aren't needed.
To minimize unnecessary lookahead, set the Linux block device read-ahead setting to 8 KB. (See Recommended
production settings in the DataStax documentation.)
Configure 8 KB read-ahead for all block devices in the stripe set and on the array device itself (for example,
/dev/md0 ).
For more information, see Comparing impact of disk read-ahead settings (GitHub).
Multi-datacenter replication
Cassandra natively supports the concept of multiple data centers, making it easy to configure one Cassandra ring
across multiple Azure regions or across availability zones within one region.
For a multiregion deployment, use Azure Global VNet-peering to connect the virtual networks in the different
regions. When VMs are deployed in the same region but in separate availability zones, the VMs can be in the same
virtual network.
It's important to measure the baseline roundtrip latency between regions. Network latency between regions can be
10-100 times higher than latency within a region. Expect a lag between data appearing in the second region when
using LOCAL_QUORUM write consistency, or significantly decreased performance of writes when using
EACH_QUORUM.
When running Apache Cassandra at scale, and specifically in a multi-DC environment, node repair becomes
challenging. Tools such as Reaper can help to coordinate repairs at scale (for example, across all the nodes in a data
center, one data center at a time, to limit the load on the whole cluster). However, node repair for large clusters is
not yet a fully solved problem and applies in all environments, whether on-premises or in the cloud.
When nodes are added to a secondary region, performance will not scale linearly, because some bandwidth and
CPU/disk resources are spent on receiving and sending replication traffic across regions.
For more information, see Measuring impact of multi-dc cross-region replication (GitHub).
Hinted-handoff configuration
In a multiregion Cassandra ring, write workloads with consistency level of LOCAL_QUORUM may lose data in the
secondary region. By default, Cassandra hinted handoff is throttled to a relatively low maximum throughput and
three-hour hint lifetime. For workloads with heavy writes, we recommended increasing the hinted handoff throttle
and hint window time to ensure hints are not dropped before they are replicated.
For more information, see Observations on hinted handoff in cross-region replication (GitHub).
Next steps
For more information about these performance results, see Cassandra on Azure VMs Performance Experiments.
For information on general Cassandra settings, not specific to Azure, see:
DataStax Recommended Production Settings
Apache Cassandra Hardware Choices
Apache Cassandra Configuration File
The following reference architecture deploys Cassandra as part of an n-tier configuration:
Linux N-tier application in Azure with Apache Cassandra
Building microservices on Azure
12/18/2020 • 5 minutes to read • Edit Online
Microservices are a popular architectural style for building applications that are resilient, highly scalable,
independently deployable, and able to evolve quickly. But a successful microservices architecture requires a
different approach to designing and building applications.
A microservices architecture consists of a collection of small, autonomous services. Each service is self-contained
and should implement a single business capability.
Benefits
Agility. Because microservices are deployed independently, it's easier to manage bug fixes and feature
releases. You can update a service without redeploying the entire application, and roll back an update if
something goes wrong. In many traditional applications, if a bug is found in one part of the application, it
can block the entire release process. New features may be held up waiting for a bug fix to be integrated,
tested, and published.
Small, focused teams . A microservice should be small enough that a single feature team can build, test,
and deploy it. Small team sizes promote greater agility. Large teams tend be less productive, because
communication is slower, management overhead goes up, and agility diminishes.
Small code base . In a monolithic application, there is a tendency over time for code dependencies to
become tangled Adding a new feature requires touching code in a lot of places. By not sharing code or data
stores, a microservices architecture minimizes dependencies, and that makes it easier to add new features.
Mix of technologies . Teams can pick the technology that best fits their service, using a mix of technology
stacks as appropriate.
Fault isolation . If an individual microservice becomes unavailable, it won't disrupt the entire application, as
long as any upstream microservices are designed to handle faults correctly (for example, by implementing
circuit breaking).
Scalability . Services can be scaled independently, letting you scale out subsystems that require more
resources, without scaling out the entire application. Using an orchestrator such as Kubernetes or Service
Fabric, you can pack a higher density of services onto a single host, which allows for more efficient
utilization of resources.
Data isolation . It is much easier to perform schema updates, because only a single microservice is affected.
In a monolithic application, schema updates can become very challenging, because different parts of the
application may all touch the same data, making any alterations to the schema risky.
Challenges
The benefits of microservices don't come for free. Here are some of the challenges to consider before embarking
on a microservices architecture.
Complexity . A microservices application has more moving parts than the equivalent monolithic
application. Each service is simpler, but the entire system as a whole is more complex.
Development and testing . Writing a small service that relies on other dependent services requires a
different approach than a writing a traditional monolithic or layered application. Existing tools are not
always designed to work with service dependencies. Refactoring across service boundaries can be difficult. It
is also challenging to test service dependencies, especially when the application is evolving quickly.
Lack of governance . The decentralized approach to building microservices has advantages, but it can also
lead to problems. You may end up with so many different languages and frameworks that the application
becomes hard to maintain. It may be useful to put some project-wide standards in place, without overly
restricting teams' flexibility. This especially applies to cross-cutting functionality such as logging.
Network congestion and latency . The use of many small, granular services can result in more
interservice communication. Also, if the chain of service dependencies gets too long (service A calls B, which
calls C...), the additional latency can become a problem. You will need to design APIs carefully. Avoid overly
chatty APIs, think about serialization formats, and look for places to use asynchronous communication
patterns.
Data integrity . With each microservice responsible for its own data persistence. As a result, data
consistency can be a challenge. Embrace eventual consistency where possible.
Management . To be successful with microservices requires a mature DevOps culture. Correlated logging
across services can be challenging. Typically, logging must correlate multiple service calls for a single user
operation.
Versioning . Updates to a service must not break services that depend on it. Multiple services could be
updated at any given time, so without careful design, you might have problems with backward or forward
compatibility.
Skillset . Microservices are highly distributed systems. Carefully evaluate whether the team has the skills
and experience to be successful.
One of the biggest challenges of microservices is to define the boundaries of individual services. The general rule
is that a service should do "one thing" — but putting that rule into practice requires careful thought. There is no
mechanical process that will produce the "right" design. You have to think deeply about your business domain,
requirements, and goals. Otherwise, you can end up with a haphazard design that exhibits some undesirable
characteristics, such as hidden dependencies between services, tight coupling, or poorly designed interfaces. This
article shows a domain-driven approach to designing microservices.
This article uses a drone delivery service as a running example. You can read more about the scenario and the
corresponding reference implementation here.
Introduction
Microservices should be designed around business capabilities, not horizontal layers such as data access or
messaging. In addition, they should have loose coupling and high functional cohesion. Microservices are loosely
coupled if you can change one service without requiring other services to be updated at the same time. A
microservice is cohesive if it has a single, well-defined purpose, such as managing user accounts or tracking
delivery history. A service should encapsulate domain knowledge and abstract that knowledge from clients. For
example, a client should be able to schedule a drone without knowing the details of the scheduling algorithm or
how the drone fleet is managed.
Domain-driven design (DDD) provides a framework that can get you most of the way to a set of well-designed
microservices. DDD has two distinct phases, strategic and tactical. In strategic DDD, you are defining the large-scale
structure of the system. Strategic DDD helps to ensure that your architecture remains focused on business
capabilities. Tactical DDD provides a set of design patterns that you can use to create the domain model. These
patterns include entities, aggregates, and domain services. These tactical patterns will help you to design
microservices that are both loosely coupled and cohesive.
In this article and the next, we'll walk through the following steps, applying them to the Drone Delivery application:
1. Start by analyzing the business domain to understand the application's functional requirements. The output
of this step is an informal description of the domain, which can be refined into a more formal set of domain
models.
2. Next, define the bounded contexts of the domain. Each bounded context contains a domain model that
represents a particular subdomain of the larger application.
3. Within a bounded context, apply tactical DDD patterns to define entities, aggregates, and domain services.
4. Use the results from the previous step to identify the microservices in your application.
In this article, we cover the first three steps, which are primarily concerned with DDD. In the next article, we'll
identify the microservices. However, it's important to remember that DDD is an iterative, ongoing process. Service
boundaries aren't fixed in stone. As an application evolves, you may decide to break apart a service into several
smaller services.
NOTE
This article doesn't show a complete and comprehensive domain analysis. We deliberately kept the example brief, to illustrate
the main points. For more background on DDD, we recommend Eric Evans' Domain-Driven Design, the book that first
introduced the term. Another good reference is Implementing Domain-Driven Design by Vaughn Vernon.
Shipping is placed in the center of the diagram, because it's core to the business. Everything else in the
diagram exists to enable this functionality.
Drone management is also core to the business. Functionality that is closely related to drone management
includes drone repair and using predictive analysis to predict when drones need servicing and
maintenance.
ETA analysis provides time estimates for pickup and delivery.
Third-par ty transpor tation will enable the application to schedule alternative transportation methods if a
package cannot be shipped entirely by drone.
Drone sharing is a possible extension of the core business. The company may have excess drone capacity
during certain hours, and could rent out drones that would otherwise be idle. This feature will not be in the
initial release.
Video sur veillance is another area that the company might expand into later.
User accounts , Invoicing , and Call center are subdomains that support the core business.
Notice that at this point in the process, we haven't made any decisions about implementation or technologies.
Some of the subsystems may involve external software systems or third-party services. Even so, the application
needs to interact with these systems and services, so it's important to include them in the domain model.
NOTE
When an application depends on an external system, there is a risk that the external system's data schema or API will leak
into your application, ultimately compromising the architectural design. This is particularly true with legacy systems that may
not follow modern best practices, and may use convoluted data schemas or obsolete APIs. In that case, it's important to
have a well-defined boundary between these external systems and the application. Consider using the Strangler Fig pattern
or the Anti-Corruption Layer pattern for this purpose.
Bounded contexts are not necessarily isolated from one another. In this diagram, the solid lines connecting the
bounded contexts represent places where two bounded contexts interact. For example, Shipping depends on User
Accounts to get information about customers, and on Drone Management to schedule drones from the fleet.
In the book Domain Driven Design, Eric Evans describes several patterns for maintaining the integrity of a domain
model when it interacts with another bounded context. One of the main principles of microservices is that services
communicate through well-defined APIs. This approach corresponds to two patterns that Evans calls Open Host
Service and Published Language. The idea of Open Host Service is that a subsystem defines a formal protocol (API)
for other subsystems to communicate with it. Published Language extends this idea by publishing the API in a
form that other teams can use to write clients. In the article Designing APIs for microservices, we discuss using
OpenAPI Specification (formerly known as Swagger) to define language-agnostic interface descriptions for REST
APIs, expressed in JSON or YAML format.
For the rest of this journey, we will focus on the Shipping bounded context.
Next steps
After completing a domain analysis, the next step is to apply tactical DDD, to define your domain models with
more precision.
Tactical DDD
Using tactical DDD to design microservices
12/18/2020 • 5 minutes to read • Edit Online
During the strategic phase of DDD, you are mapping out the business domain and defining bounded contexts for
your domain models. Tactical DDD is when you define your domain models with more precision. The tactical
patterns are applied within a single bounded context. In a microservices architecture, we are particularly interested
in the entity and aggregate patterns. Applying these patterns will help us to identify natural boundaries for the
services in our application (see the next article in this series). As a general principle, a microservice should be no
smaller than an aggregate, and no larger than a bounded context. First, we'll review the tactical patterns. Then we'll
apply them to the Shipping bounded context in the Drone Delivery application.
Entities . An entity is an object with a unique identity that persists over time. For example, in a banking application,
customers and accounts would be entities.
An entity has a unique identifier in the system, which can be used to look up or retrieve the entity. That doesn't
mean the identifier is always exposed directly to users. It could be a GUID or a primary key in a database.
An identity may span multiple bounded contexts, and may endure beyond the lifetime of the application. For
example, bank account numbers or government-issued IDs are not tied to the lifetime of a particular
application.
The attributes of an entity may change over time. For example, a person's name or address might change, but
they are still the same person.
An entity can hold references to other entities.
Value objects . A value object has no identity. It is defined only by the values of its attributes. Value objects are
also immutable. To update a value object, you always create a new instance to replace the old one. Value objects
can have methods that encapsulate domain logic, but those methods should have no side-effects on the object's
state. Typical examples of value objects include colors, dates and times, and currency values.
Aggregates . An aggregate defines a consistency boundary around one or more entities. Exactly one entity in an
aggregate is the root. Lookup is done using the root entity's identifier. Any other entities in the aggregate are
children of the root, and are referenced by following pointers from the root.
The purpose of an aggregate is to model transactional invariants. Things in the real world have complex webs of
relationships. Customers create orders, orders contain products, products have suppliers, and so on. If the
application modifies several related objects, how does it guarantee consistency? How do we keep track of
invariants and enforce them?
Traditional applications have often used database transactions to enforce consistency. In a distributed application,
however, that's often not feasible. A single business transaction may span multiple data stores, or may be long
running, or may involve third-party services. Ultimately it's up to the application, not the data layer, to enforce the
invariants required for the domain. That's what aggregates are meant to model.
NOTE
An aggregate might consist of a single entity, without child entities. What makes it an aggregate is the transactional
boundary.
Domain and application ser vices . In DDD terminology, a service is an object that implements some logic
without holding any state. Evans distinguishes between domain services, which encapsulate domain logic, and
application services, which provide technical functionality, such as user authentication or sending an SMS
message. Domain services are often used to model behavior that spans multiple entities.
NOTE
The term service is overloaded in software development. The definition here is not directly related to microservices.
Domain events . Domain events can be used to notify other parts of the system when something happens. As the
name suggests, domain events should mean something within the domain. For example, "a record was inserted
into a table" is not a domain event. "A delivery was cancelled" is a domain event. Domain events are especially
relevant in a microservices architecture. Because microservices are distributed and don't share data stores, domain
events provide a way for microservices to coordinate with each other. The article Interservice communication
discusses asynchronous messaging in more detail.
There are a few other DDD patterns not listed here, including factories, repositories, and modules. These can be
useful patterns for when you are implementing a microservice, but they are less relevant when designing the
boundaries between microservice.
Next steps
The next step is to define the boundaries for each microservice.
Identify microservice boundaries
Identifying microservice boundaries
12/18/2020 • 5 minutes to read • Edit Online
What is the right size for a microservice? You often hear something to the effect of, "not too big and not too small"
— and while that's certainly correct, it's not very helpful in practice. But if you start from a carefully designed
domain model, it's much easier to reason about microservices.
This article uses a drone delivery service as a running example. You can read more about the scenario and the
corresponding reference implementation here.
Next steps
At this point, you should have a clear understanding of the purpose and functionality of each microservice in your
design. Now you can architect the system.
Design a microservices architecture
Designing a microservices architecture
12/18/2020 • 2 minutes to read • Edit Online
Microservices have become a popular architectural style for building cloud applications that are resilient, highly
scalable, independently deployable, and able to evolve quickly. To be more than just a buzzword, however,
microservices require a different approach to designing and building applications.
In this set of articles, we explore how to build a microservices architecture on Azure. Topics include:
Compute options for microservices
Interservice communication
API design
API gateways
Data considerations
Design patterns
Prerequisites
Before reading these articles, you might start with the following:
Introduction to microservices architectures. Understand the benefits and challenges of microservices, and
when to use this style of architecture.
Using domain analysis to model microservices. Learn a domain-driven approach to modeling microservices.
Reference implementation
To illustrate best practices for a microservices architecture, we created a reference implementation that we call the
Drone Delivery application. This implementation runs on Kubernetes using Azure Kubernetes Service (AKS). You
can find the reference implementation on GitHub.
Scenario
Fabrikam, Inc. is starting a drone delivery service. The company manages a fleet of drone aircraft. Businesses
register with the service, and users can request a drone to pick up goods for delivery. When a customer schedules
a pickup, a backend system assigns a drone and notifies the user with an estimated delivery time. While the
delivery is in progress, the customer can track the location of the drone, with a continuously updated ETA.
This scenario involves a fairly complicated domain. Some of the business concerns include scheduling drones,
tracking packages, managing user accounts, and storing and analyzing historical data. Moreover, Fabrikam wants
to get to market quickly and then iterate quickly, adding new functionality and capabilities. The application needs
to operate at cloud scale, with a high service level objective (SLO). Fabrikam also expects that different parts of the
system will have very different requirements for data storage and querying. All of these considerations lead
Fabrikam to choose a microservices architecture for the Drone Delivery application.
NOTE
For help in choosing between a microservices architecture and other architectural styles, see the Azure Application
Architecture Guide.
Our reference implementation uses Kubernetes with Azure Kubernetes Service (AKS). However, many of the high-
level architectural decisions and challenges will apply to any container orchestrator, including Azure Service
Fabric.
Next steps
Choose a compute option
Choosing an Azure compute option for microservices
12/18/2020 • 4 minutes to read • Edit Online
The term compute refers to the hosting model for the computing resources that your application runs on. For a
microservices architecture, two approaches are especially popular:
A service orchestrator that manages services running on dedicated nodes (VMs).
A serverless architecture using functions as a service (FaaS).
While these aren't the only options, they are both proven approaches to building microservices. An application
might include both approaches.
Service orchestrators
An orchestrator handles tasks related to deploying and managing a set of services. These tasks include placing
services on nodes, monitoring the health of services, restarting unhealthy services, load balancing network traffic
across service instances, service discovery, scaling the number of instances of a service, and applying
configuration updates. Popular orchestrators include Kubernetes, Service Fabric, DC/OS, and Docker Swarm.
On the Azure platform, consider the following options:
Azure Kubernetes Service (AKS) is a managed Kubernetes service. AKS provisions Kubernetes and exposes
the Kubernetes API endpoints, but hosts and manages the Kubernetes control plane, performing automated
upgrades, automated patching, autoscaling, and other management tasks. You can think of AKS as being
"Kubernetes APIs as a service."
Service Fabric is a distributed systems platform for packaging, deploying, and managing microservices.
Microservices can be deployed to Service Fabric as containers, as binary executables, or as Reliable Services.
Using the Reliable Services programming model, services can directly use Service Fabric programming APIs
to query the system, report health, receive notifications about configuration and code changes, and discover
other services. A key differentiation with Service Fabric is its strong focus on building stateful services using
Reliable Collections.
Other options such as Docker Enterprise Edition and Mesosphere DC/OS can run in an IaaS environment on
Azure. You can find deployment templates on Azure Marketplace.
Containers
Sometimes people talk about containers and microservices as if they were the same thing. While that's not true —
you don't need containers to build microservices — containers do have some benefits that are particularly relevant
to microservices, such as:
Por tability . A container image is a standalone package that runs without needing to install libraries or
other dependencies. That makes them easy to deploy. Containers can be started and stopped quickly, so you
can spin up new instances to handle more load or to recover from node failures.
Density . Containers are lightweight compared with running a virtual machine, because they share OS
resources. That makes it possible to pack multiple containers onto a single node, which is especially useful
when the application consists of many small services.
Resource isolation . You can limit the amount of memory and CPU that is available to a container, which
can help to ensure that a runaway process doesn't exhaust the host resources. See the Bulkhead pattern for
more information.
Serverless (Functions as a Service)
With a serverless architecture, you don't manage the VMs or the virtual network infrastructure. Instead, you deploy
code and the hosting service handles putting that code onto a VM and executing it. This approach tends to favor
small granular functions that are coordinated using event-based triggers. For example, a message being placed
onto a queue might trigger a function that reads from the queue and processes the message.
Azure Functions is a serverless compute service that supports various function triggers, including HTTP requests,
Service Bus queues, and Event Hubs events. For a complete list, see Azure Functions triggers and bindings
concepts. Also consider Azure Event Grid, which is a managed event routing service in Azure.
Orchestrator or serverless?
Here are some factors to consider when choosing between an orchestrator approach and a serverless approach.
Manageability A serverless application is easy to manage, because the platform manages all the of compute
resources for you. While an orchestrator abstracts some aspects of managing and configuring a cluster, it does not
completely hide the underlying VMs. With an orchestrator, you will need to think about issues such as load
balancing, CPU and memory usage, and networking.
Flexibility and control . An orchestrator gives you a great deal of control over configuring and managing your
services and the cluster. The tradeoff is additional complexity. With a serverless architecture, you give up some
degree of control because these details are abstracted.
Por tability . All of the orchestrators listed here (Kubernetes, DC/OS, Docker Swarm, and Service Fabric) can run
on-premises or in multiple public clouds.
Application integration . It can be challenging to build a complex application using a serverless architecture, due
to the need to coordinate, deploy, and manage many small independent functions. One option in Azure is to use
Azure Logic Apps to coordinate a set of Azure Functions. For an example of this approach, see Create a function
that integrates with Azure Logic Apps.
Cost . With an orchestrator, you pay for the VMs that are running in the cluster. With a serverless application, you
pay only for the actual compute resources consumed. In both cases, you need to factor in the cost of any additional
services, such as storage, databases, and messaging services.
Scalability . Azure Functions scales automatically to meet demand, based on the number of incoming events. With
an orchestrator, you can scale out by increasing the number of service instances running in the cluster. You can also
scale by adding additional VMs to the cluster.
Our reference implementation primarily uses Kubernetes, but we did use Azure Functions for one service, namely
the Delivery History service. Azure Functions was a good fit for this particular service, because it's is an event-
driven workload. By using an Event Hubs trigger to invoke the function, the service needed a minimal amount of
code. Also, the Delivery History service is not part of the main workflow, so running it outside of the Kubernetes
cluster doesn't affect the end-to-end latency of user-initiated operations.
Next steps
Interservice communication
Designing interservice communication for
microservices
12/18/2020 • 12 minutes to read • Edit Online
Communication between microservices must be efficient and robust. With lots of small services interacting to
complete a single transaction, this can be a challenge. In this article, we look at the tradeoffs between
asynchronous messaging versus synchronous APIs. Then we look at some of the challenges in designing resilient
interservice communication.
Challenges
Here are some of the main challenges arising from service-to-service communication. Service meshes, described
later in this article, are designed to handle many of these challenges.
Resiliency . There may be dozens or even hundreds of instances of any given microservice. An instance can fail for
any number of reasons. There can be a node-level failure, such as a hardware failure or a VM reboot. An instance
might crash, or be overwhelmed with requests and unable to process any new requests. Any of these events can
cause a network call to fail. There are two design patterns that can help make service-to-service network calls
more resilient:
Retr y . A network call may fail because of a transient fault that goes away by itself. Rather than fail outright,
the caller should typically retry the operation a certain number of times, or until a configured time-out
period elapses. However, if an operation is not idempotent, retries can cause unintended side effects. The
original call might succeed, but the caller never gets a response. If the caller retries, the operation may be
invoked twice. Generally, it's not safe to retry POST or PATCH methods, because these are not guaranteed to
be idempotent.
Circuit Breaker . Too many failed requests can cause a bottleneck, as pending requests accumulate in the
queue. These blocked requests might hold critical system resources such as memory, threads, database
connections, and so on, which can cause cascading failures. The Circuit Breaker pattern can prevent a
service from repeatedly trying an operation that is likely to fail.
Load balancing . When service "A" calls service "B", the request must reach a running instance of service "B". In
Kubernetes, the Service resource type provides a stable IP address for a group of pods. Network traffic to the
service's IP address gets forwarded to a pod by means of iptable rules. By default, a random pod is chosen. A
service mesh (see below) can provide more intelligent load balancing algorithms based on observed latency or
other metrics.
Distributed tracing . A single transaction may span multiple services. That can make it hard to monitor the
overall performance and health of the system. Even if every service generates logs and metrics, without some way
to tie them together, they are of limited use. The article Logging and monitoring talks more about distributed
tracing, but we mention it here as a challenge.
Ser vice versioning . When a team deploys a new version of a service, they must avoid breaking any other
services or external clients that depend on it. In addition, you might want to run multiple versions of a service
side-by-side, and route requests to a particular version. See API Versioning for more discussion of this issue.
TLS encr yption and mutual TLS authentication . For security reasons, you may want to encrypt traffic
between services with TLS, and use mutual TLS authentication to authenticate callers.
Synchronous versus asynchronous messaging
There are two basic messaging patterns that microservices can use to communicate with other microservices.
1. Synchronous communication. In this pattern, a service calls an API that another service exposes, using a
protocol such as HTTP or gRPC. This option is a synchronous messaging pattern because the caller waits for
a response from the receiver.
2. Asynchronous message passing. In this pattern, a service sends message without waiting for a response,
and one or more services process the message asynchronously.
It's important to distinguish between asynchronous I/O and an asynchronous protocol. Asynchronous I/O means
the calling thread is not blocked while the I/O completes. That's important for performance, but is an
implementation detail in terms of the architecture. An asynchronous protocol means the sender doesn't wait for a
response. HTTP is a synchronous protocol, even though an HTTP client may use asynchronous I/O when it sends a
request.
There are tradeoffs to each pattern. Request/response is a well-understood paradigm, so designing an API may
feel more natural than designing a messaging system. However, asynchronous messaging has some advantages
that can be useful in a microservices architecture:
Reduced coupling . The message sender does not need to know about the consumer.
Multiple subscribers . Using a pub/sub model, multiple consumers can subscribe to receive events. See
Event-driven architecture style.
Failure isolation . If the consumer fails, the sender can still send messages. The messages will be picked up
when the consumer recovers. This ability is especially useful in a microservices architecture, because each
service has its own lifecycle. A service could become unavailable or be replaced with a newer version at any
given time. Asynchronous messaging can handle intermittent downtime. Synchronous APIs, on the other
hand, require the downstream service to be available or the operation fails.
Responsiveness . An upstream service can reply faster if it does not wait on downstream services. This is
especially useful in a microservices architecture. If there is a chain of service dependencies (service A calls
B, which calls C, and so on), waiting on synchronous calls can add unacceptable amounts of latency.
Load leveling . A queue can act as a buffer to level the workload, so that receivers can process messages at
their own rate.
Workflows . Queues can be used to manage a workflow, by check-pointing the message after each step in
the workflow.
However, there are also some challenges to using asynchronous messaging effectively.
Coupling with the messaging infrastructure . Using a particular messaging infrastructure may cause
tight coupling with that infrastructure. It will be difficult to switch to another messaging infrastructure later.
Latency . End-to-end latency for an operation may become high if the message queues fill up.
Cost . At high throughputs, the monetary cost of the messaging infrastructure could be significant.
Complexity . Handling asynchronous messaging is not a trivial task. For example, you must handle
duplicated messages, either by de-duplicating or by making operations idempotent. It's also hard to
implement request-response semantics using asynchronous messaging. To send a response, you need
another queue, plus a way to correlate request and response messages.
Throughput . If messages require queue semantics, the queue can become a bottleneck in the system. Each
message requires at least one queue operation and one dequeue operation. Moreover, queue semantics
generally require some kind of locking inside the messaging infrastructure. If the queue is a managed
service, there may be additional latency, because the queue is external to the cluster's virtual network. You
can mitigate these issues by batching messages, but that complicates the code. If the messages don't
require queue semantics, you might be able to use an event stream instead of a queue. For more
information, see Event-driven architectural style.
NOTE
Service mesh is an example of the Ambassador pattern — a helper service that sends network requests on behalf of the
application.
Right now, the main options for a service mesh in Kubernetes are linkerd and Istio. Both of these technologies are
evolving rapidly. However, some features that both linkerd and Istio have in common include:
Load balancing at the session level, based on observed latencies or number of outstanding requests. This
can improve performance over the layer-4 load balancing that is provided by Kubernetes.
Layer-7 routing based on URL path, Host header, API version, or other application-level rules.
Retry of failed requests. A service mesh understands HTTP error codes, and can automatically retry failed
requests. You can configure that maximum number of retries, along with a timeout period in order to
bound the maximum latency.
Circuit breaking. If an instance consistently fails requests, the service mesh will temporarily mark it as
unavailable. After a backoff period, it will try the instance again. You can configure the circuit breaker based
on various criteria, such as the number of consecutive failures,
Service mesh captures metrics about interservice calls, such as the request volume, latency, error and
success rates, and response sizes. The service mesh also enables distributed tracing by adding correlation
information for each hop in a request.
Mutual TLS Authentication for service-to-service calls.
Do you need a service mesh? It depends. Without a service mesh, you'll need to consider each of the challenges
mentioned at the beginning of this article. You can solve problems like retry, circuit breaker, and distributed tracing
without a service mesh, but a service mesh moves these concerns out of the individual services and into a
dedicated layer. On the other hand, a service mesh adds complexity to the setup and configuration of the cluster.
There may be performance implications, because requests now get routed through the service mesh proxy, and
because extra services are now running on every node in the cluster. You should do thorough performance and
load testing before deploying a service mesh in production.
Distributed transactions
A common challenge in microservices is correctly handling transactions that span multiple services. Often in this
scenario, the success of a transaction is all or nothing — if one of the participating services fails, the entire
transaction must fail.
There are two cases to consider:
A service may experience a transient failure such as a network timeout. These errors can often be resolved
simply by retrying the call. If the operation still fails after a certain number of attempts, it's considered a
nontransient failure.
A nontransient failure is any failure that's unlikely to go away by itself. Nontransient failures include normal
error conditions, such as invalid input. They also include unhandled exceptions in application code or a
process crashing. If this type of error occurs, the entire business transaction must be marked as a failure. It
may be necessary to undo other steps in the same transaction that already succeeded.
After a nontransient failure, the current transaction might be in a partially failed state, where one or more steps
already completed successfully. For example, if the Drone service already scheduled a drone, the drone must be
canceled. In that case, the application needs to undo the steps that succeeded, by using a Compensating
Transaction. In some cases, this must be done by an external system or even by a manual process.
If the logic for compensating transactions is complex, consider creating a separate service that is responsible for
this process. In the Drone Delivery application, the Scheduler service puts failed operations onto a dedicated
queue. A separate microservice, called the Supervisor, reads from this queue and calls a cancellation API on the
services that need to compensate. This is a variation of the Scheduler Agent Supervisor pattern. The Supervisor
service might take other actions as well, such as notify the user by text or email, or send an alert to an operations
dashboard.
The Scheduler service itself might fault (for example, because a node crashes). In that case, a new instance can spin
up and take over. However, any transactions that were already in progress must be resumed.
One approach is to save a checkpoint to a durable store after each step in the workflow is completed. If an instance
of the Scheduler service crashes in the middle of a transaction, a new instance can use the checkpoint to resume
where the previous instance left off. However, writing checkpoints can create a performance overhead.
Another option is to design all operations to be idempotent. An operation is idempotent if it can be called multiple
times without producing additional side-effects after the first call. Essentially, the downstream service should
ignore duplicate calls, which means the service must be able to detect duplicate calls. It's not always
straightforward to implement idempotent methods. For more information, see Idempotent operations.
Next steps
For microservices that talk directly to each other, it's important to create well-designed APIs.
API design
Designing APIs for microservices
12/18/2020 • 12 minutes to read • Edit Online
Good API design is important in a microservices architecture, because all data exchange between services happens
either through messages or API calls. APIs must be efficient to avoid creating chatty I/O. Because services are
designed by teams working independently, APIs must have well-defined semantics and versioning schemes, so
that updates don't break other services.
Considerations
Here are some things to think about when choosing how to implement an API.
REST versus RPC . Consider the tradeoffs between using a REST-style interface versus an RPC-style interface.
REST models resources, which can be a natural way express your domain model. It defines a uniform
interface based on HTTP verbs, which encourages evolvability. It has well-defined semantics in terms of
idempotency, side effects, and response codes. And it enforces stateless communication, which improves
scalability.
RPC is more oriented around operations or commands. Because RPC interfaces look like local method calls,
it may lead you to design overly chatty APIs. However, that doesn't mean RPC must be chatty. It just means
you need to use care when designing the interface.
For a RESTful interface, the most common choice is REST over HTTP using JSON. For an RPC-style interface, there
are several popular frameworks, including gRPC, Apache Avro, and Apache Thrift.
Efficiency . Consider efficiency in terms of speed, memory, and payload size. Typically a gRPC-based interface is
faster than REST over HTTP.
Interface definition language (IDL) . An IDL is used to define the methods, parameters, and return values of an
API. An IDL can be used to generate client code, serialization code, and API documentation. IDLs can also be
consumed by API testing tools such as Postman. Frameworks such as gRPC, Avro, and Thrift define their own IDL
specifications. REST over HTTP does not have a standard IDL format, but a common choice is OpenAPI (formerly
Swagger). You can also create an HTTP REST API without using a formal definition language, but then you lose the
benefits of code generation and testing.
Serialization . How are objects serialized over the wire? Options include text-based formats (primarily JSON) and
binary formats such as protocol buffer. Binary formats are generally faster than text-based formats. However,
JSON has advantages in terms of interoperability, because most languages and frameworks support JSON
serialization. Some serialization formats require a fixed schema, and some require compiling a schema definition
file. In that case, you'll need to incorporate this step into your build process.
Framework and language suppor t . HTTP is supported in nearly every framework and language. gRPC, Avro,
and Thrift all have libraries for C++, C#, Java, and Python. Thrift and gRPC also support Go.
Compatibility and interoperability . If you choose a protocol like gRPC, you may need a protocol translation
layer between the public API and the back end. A gateway can perform that function. If you are using a service
mesh, consider which protocols are compatible with the service mesh. For example, linkerd has built-in support for
HTTP, Thrift, and gRPC.
Our baseline recommendation is to choose REST over HTTP unless you need the performance benefits of a binary
protocol. REST over HTTP requires no special libraries. It creates minimal coupling, because callers don't need a
client stub to communicate with the service. There is rich ecosystems of tools to support schema definitions,
testing, and monitoring of RESTful HTTP endpoints. Finally, HTTP is compatible with browser clients, so you don't
need a protocol translation layer between the client and the backend.
However, if you choose REST over HTTP, you should do performance and load testing early in the development
process, to validate whether it performs well enough for your scenario.
These sorts of coding practices are particularly important when building a traditional monolithic application. With
a large code base, many subsystems might use the Location object, so it's important for the object to enforce
correct behavior.
Another example is the Repository pattern, which ensures that other parts of the application do not make direct
reads or writes to the data store:
In a microservices architecture, however, services don't share the same code base and don't share data stores.
Instead, they communicate through APIs. Consider the case where the Scheduler service requests information
about a drone from the Drone service. The Drone service has its internal model of a drone, expressed through
code. But the Scheduler doesn't see that. Instead, it gets back a representation of the drone entity — perhaps a
JSON object in an HTTP response.
The Scheduler service can't modify the Drone service's internal models, or write to the Drone service's data store.
That means the code that implements the Drone service has a smaller exposed surface area, compared with code
in a traditional monolith. If the Drone service defines a Location class, the scope of that class is limited — no other
service will directly consume the class.
For these reasons, this guidance doesn't focus much on coding practices as they relate to the tactical DDD patterns.
But it turns out that you can also model many of the DDD patterns through REST APIs.
For example:
Aggregates map naturally to resources in REST. For example, the Delivery aggregate would be exposed as a
resource by the Delivery API.
Aggregates are consistency boundaries. Operations on aggregates should never leave an aggregate in an
inconsistent state. Therefore, you should avoid creating APIs that allow a client to manipulate the internal
state of an aggregate. Instead, favor coarse-grained APIs that expose aggregates as resources.
Entities have unique identities. In REST, resources have unique identifiers in the form of URLs. Create
resource URLs that correspond to an entity's domain identity. The mapping from URL to domain identity
may be opaque to client.
Child entities of an aggregate can be reached by navigating from the root entity. If you follow HATEOAS
principles, child entities can be reached via links in the representation of the parent entity.
Because value objects are immutable, updates are performed by replacing the entire value object. In REST,
implement updates through PUT or PATCH requests.
A repository lets clients query, add, or remove objects in a collection, abstracting the details of the
underlying data store. In REST, a collection can be a distinct resource, with methods for querying the
collection or adding new entities to the collection.
When you design your APIs, think about how they express the domain model, not just the data inside the model,
but also the business operations and the constraints on the data.
API versioning
An API is a contract between a service and clients or consumers of that service. If an API changes, there is a risk of
breaking clients that depend on the API, whether those are external clients or other microservices. Therefore, it's a
good idea to minimize the number of API changes that you make. Often, changes in the underlying
implementation don't require any changes to the API. Realistically, however, at some point you will want to add
new features or new capabilities that require changing an existing API.
Whenever possible, make API changes backward compatible. For example, avoid removing a field from a model,
because that can break clients that expect the field to be there. Adding a field does not break compatibility, because
clients should ignore any fields they don't understand in a response. However, the service must handle the case
where an older client omits the new field in a request.
Support versioning in your API contract. If you introduce a breaking API change, introduce a new API version.
Continue to support the previous version, and let clients select which version to call. There are a couple of ways to
do this. One is simply to expose both versions in the same service. Another option is to run two versions of the
service side-by-side, and route requests to one or the other version, based on HTTP routing rules.
The diagram has two parts. "Service supports two versions" shows the v1 Client and the v2 Client both pointing to
one Service. "Side-by-side deployment" shows the v1 Client pointing to a v1 Service, and the v2 Client pointing to
a v2 Service.
There's a cost to supporting multiple versions, in terms of developer time, testing, and operational overhead.
Therefore, it's good to deprecate old versions as quickly as possible. For internal APIs, the team that owns the API
can work with other teams to help them migrate to the new version. This is when having a cross-team governance
process is useful. For external (public) APIs, it can be harder to deprecate an API version, especially if the API is
consumed by third parties or by native client applications.
When a service implementation changes, it's useful to tag the change with a version. The version provides
important information when troubleshooting errors. It can be very helpful for root cause analysis to know exactly
which version of the service was called. Consider using semantic versioning for service versions. Semantic
versioning uses a MAJOR.MINOR.PATCH format. However, clients should only select an API by the major version
number, or possibly the minor version if there are significant (but non-breaking) changes between minor versions.
In other words, it's reasonable for clients to select between version 1 and version 2 of an API, but not to select
version 2.1.3. If you allow that level of granularity, you risk having to support a proliferation of versions.
For further discussion of API versioning, see Versioning a RESTful web API.
Idempotent operations
An operation is idempotent if it can be called multiple times without producing additional side-effects after the
first call. Idempotency can be a useful resiliency strategy, because it allows an upstream service to safely invoke an
operation multiple times. For a discussion of this point, see Distributed transactions.
The HTTP specification states that GET, PUT, and DELETE methods must be idempotent. POST methods are not
guaranteed to be idempotent. If a POST method creates a new resource, there is generally no guarantee that this
operation is idempotent. The specification defines idempotent this way:
A request method is considered "idempotent" if the intended effect on the server of multiple identical requests
with that method is the same as the effect for a single such request. (RFC 7231)
It's important to understand the difference between PUT and POST semantics when creating a new entity. In both
cases, the client sends a representation of an entity in the request body. But the meaning of the URI is different.
For a POST method, the URI represents a parent resource of the new entity, such as a collection. For
example, to create a new delivery, the URI might be /api/deliveries . The server creates the entity and
assigns it a new URI, such as /api/deliveries/39660 . This URI is returned in the Location header of the
response. Each time the client sends a request, the server will create a new entity with a new URI.
For a PUT method, the URI identifies the entity. If there already exists an entity with that URI, the server
replaces the existing entity with the version in the request. If no entity exists with that URI, the server creates
one. For example, suppose the client sends a PUT request to api/deliveries/39660 . Assuming there is no
delivery with that URI, the server creates a new one. Now if the client sends the same request again, the
server will replace the existing entity.
Here is the Delivery service's implementation of the PUT method.
[HttpPut("{id}")]
[ProducesResponseType(typeof(Delivery), 201)]
[ProducesResponseType(typeof(void), 204)]
public async Task<IActionResult> Put([FromBody]Delivery delivery, string id)
{
logger.LogInformation("In Put action with delivery {Id}: {@DeliveryInfo}", id, delivery.ToLogInfo());
try
{
var internalDelivery = delivery.ToInternal();
It's expected that most requests will create a new entity, so the method optimistically calls CreateAsync on the
repository object, and then handles any duplicate-resource exceptions by updating the resource instead.
Next steps
Learn about using an API gateway at the boundary between client applications and microservices.
API gateways
Using API gateways in microservices
12/18/2020 • 5 minutes to read • Edit Online
In a microservices architecture, a client might interact with more than one front-end service. Given this fact, how
does a client know what endpoints to call? What happens when new services are introduced, or existing services
are refactored? How do services handle SSL termination, authentication, and other concerns? An API gateway can
help to address these challenges.
Next steps
The previous articles have looked at the interfaces between microservices or between microservices and client
applications. By design, these interfaces treat each service as a opaque box. In particular, microservices should
never expose implementation details about how they manage data. That has implications for data integrity and
data consistency, explored in the next article.
Data considerations for microservices
Data considerations for microservices
12/18/2020 • 7 minutes to read • Edit Online
This article describes considerations for managing data in a microservices architecture. Because every
microservice manages its own data, data integrity and data consistency are critical challenges.
A basic principle of microservices is that each service manages its own data. Two services should not share a data
store. Instead, each service is responsible for its own private data store, which other services cannot access directly.
The reason for this rule is to avoid unintentional coupling between services, which can result if services share the
same underlying data schemas. If there is a change to the data schema, the change must be coordinated across
every service that relies on that database. By isolating each service's data store, we can limit the scope of change,
and preserve the agility of truly independent deployments. Another reason is that each microservice may have its
own data models, queries, or read/write patterns. Using a shared data store limits each team's ability to optimize
data storage for their particular service.
This approach naturally leads to polyglot persistence — the use of multiple data storage technologies within a
single application. One service might require the schema-on-read capabilities of a document database. Another
might need the referential integrity provided by an RDBMS. Each team is free to make the best choice for their
service. For more about the general principle of polyglot persistence, see Use the best data store for the job.
NOTE
It's fine for services to share the same physical database server. The problem occurs when services share the same schema,
or read and write to the same set of database tables.
Challenges
Some challenges arise from this distributed approach to managing data. First, there may be redundancy across the
data stores, with the same item of data appearing in multiple places. For example, data might be stored as part of a
transaction, then stored elsewhere for analytics, reporting, or archiving. Duplicated or partitioned data can lead to
issues of data integrity and consistency. When data relationships span multiple services, you can't use traditional
data management techniques to enforce the relationships.
Traditional data modeling uses the rule of "one fact in one place." Every entity appears exactly once in the schema.
Other entities may hold references to it but not duplicate it. The obvious advantage to the traditional approach is
that updates are made in a single place, which avoids problems with data consistency. In a microservices
architecture, you have to consider how updates are propagated across services, and how to manage eventual
consistency when data appears in multiple places without strong consistency.
Next steps
Learn about design patterns that can help mitigate some common challenges in a microservices architecture.
Design patterns for microservices
Design patterns for microservices
12/18/2020 • 2 minutes to read • Edit Online
The goal of microservices is to increase the velocity of application releases, by decomposing the application into
small autonomous services that can be deployed independently. A microservices architecture also brings some
challenges. The design patterns shown here can help mitigate these challenges.
Ambassador can be used to offload common client connectivity tasks such as monitoring, logging, routing, and
security (such as TLS) in a language agnostic way. Ambassador services are often deployed as a sidecar (see
below).
Anti-corruption layer implements a façade between new and legacy applications, to ensure that the design of a
new application is not limited by dependencies on legacy systems.
Backends for Frontends creates separate backend services for different types of clients, such as desktop and
mobile. That way, a single backend service doesn't need to handle the conflicting requirements of various client
types. This pattern can help keep each microservice simple, by separating client-specific concerns.
Bulkhead isolates critical resources, such as connection pool, memory, and CPU, for each workload or service. By
using bulkheads, a single workload (or service) can't consume all of the resources, starving others. This pattern
increases the resiliency of the system by preventing cascading failures caused by one service.
Gateway Aggregation aggregates requests to multiple individual microservices into a single request, reducing
chattiness between consumers and services.
Gateway Offloading enables each microservice to offload shared service functionality, such as the use of SSL
certificates, to an API gateway.
Gateway Routing routes requests to multiple microservices using a single endpoint, so that consumers don't
need to manage many separate endpoints.
Sidecar deploys helper components of an application as a separate container or process to provide isolation and
encapsulation.
Strangler Fig supports incremental refactoring of an application, by gradually replacing specific pieces of
functionality with new services.
For the complete catalog of cloud design patterns on the Azure Architecture Center, see Cloud Design Patterns.
Monitoring a microservices architecture in Azure
Kubernetes Service (AKS)
12/18/2020 • 14 minutes to read • Edit Online
This article describes best practices for monitoring a microservices application that runs on Azure Kubernetes
Service (AKS).
In any complex application, at some point something will go wrong. In a microservices application, you need to
track what's happening across dozens or even hundreds of services. To make sense of what's happening, you must
collect telemetry from the application. Telemetry can be divided into logs and metrics.
Logs are text-based records of events that occur while the application is running. They include things like
application logs (trace statements) or web server logs. Logs are primarily useful for forensics and root cause
analysis.
Metrics are numerical values that can be analyzed. You can use them to observe the system in real time (or close
to real time), or to analyze performance trends over time. To understand the system holistically, you must collect
metrics at various levels of the architecture, from the physical infrastructure to the application, including:
Node-level metrics, including CPU, memory, network, disk, and file system usage. System metrics help you
to understand resource allocation for each node in the cluster, and troubleshoot outliers.
Container metrics. For containerized applications, you need to collect metrics at the container level, not just
at the VM level.
Application metrics. This includes any metrics that are relevant to understanding the behavior of a service.
Examples include the number of queued inbound HTTP requests, request latency, or message queue length.
Applications can also create custom metrics that are specific to the domain, such as the number of business
transactions processed per minute.
Dependent ser vice metrics. Services may call external services or endpoints, such as managed PaaS
services or SaaS services. Third-party services may or may not provide any metrics. If not, you'll have to rely
on your own application metrics to track statistics for latency and error rate.
From here, you can drill in further to find the issue. For example, if the pod status is ImagePullBackoff , it means
that Kubernetes could not pull the container image from the registry. This could be caused by an invalid container
tag or an authentication error trying to pull from the registry.
Note that a container crashing will put the container state into State = Waiting ,with Reason = CrashLoopBackOff
. For a typical scenario where a pod is part of a replica set and the retry policy is Always , this won't show as an
error in the cluster status. However, you can run queries or set up alerts for this condition. For more information,
see Understand AKS cluster performance with Azure Monitor for containers.
Metrics
We recommend using Azure Monitor to collect and view metrics for your AKS clusters and any other dependent
Azure services.
For cluster and container metrics, enable Azure Monitor for containers. When this feature is enabled, Azure
Monitor collects memory and processor metrics from controllers, nodes, and containers via the Kubernetes
metrics API. For more information about the metrics that are available through Azure Monitor for
containers, see Understand AKS cluster performance with Azure Monitor for containers.
Use Application Insights to collect application metrics. Application Insights is an extensible Application
Performance Management (APM) service. To use it, you install an instrumentation package in your
application. This package monitors the app and sends telemetry data to the Application Insights service. It
can also pull telemetry data from the host environment. The data is then sent to Azure Monitor. Application
Insights also provides built-in correlation and dependency tracking (see Distributed tracing, below).
Application Insights has a maximum throughput measured in events/second, and it throttles if the data rate
exceeds the limit. For details, see Application Insights limits. Create different Application Insights instances per
environment, so that dev/test environments don't compete against the production telemetry for quota.
A single operation may generate several telemetry events, so if the application experiences a high volume of traffic,
it is likely to get throttled. To mitigate this problem, you can perform sampling to reduce the telemetry traffic. The
tradeoff is that your metrics will be less precise. For more information, see Sampling in Application Insights. You
can also reduce the data volume by pre-aggregating metrics — that is, calculating statistical values such as average
and standard deviation, and sending those values instead of the raw telemetry. The following blog post describes
an approach to using Application Insights at scale: Azure Monitoring and Analytics at Scale.
If your data rate is high enough to trigger throttling, and sampling or aggregation are not acceptable, consider
exporting metrics to a time-series database such as Prometheus or InfluxDB running in the cluster.
InfluxDB is a push-based system. An agent needs to push the metrics. You can use TICK stack, to setup
monitoring of Kubernetes, and push it to InfluxDB using Telegraf, which is an agent for collecting and
reporting metrics. InfluxDB can be used for irregular events and string data types.
Prometheus is a pull-based system. It periodically scrapes metrics from configured locations. Prometheus
can scrape metrics generated by cAdvisor or kube-state-metrics. kube-state-metrics is a service that collects
metrics from the Kubernetes API server and makes them available to Prometheus (or a scraper that is
compatible with a Prometheus client endpoint). For system metrics, use Node exporter, which is a
Prometheus exporter for system metrics. Prometheus supports floating point data, but not string data, so it
is appropriate for system metrics but not logs. Kubernetes Metrics Server is a cluster-wide aggregator of
resource usage data.
Logging
Here are some of the general challenges of logging in a microservices application:
Understanding the end-to-end processing of a client request, where multiple services might be invoked to
handle a single request.
Consolidating logs from multiple services into a single aggregated view.
Parsing logs that come from multiple sources, which use their own logging schemas or have no particular
schema. Logs may be generated by third-party components that you don't control.
Microservices architectures often generate a larger volume of logs than traditional monoliths, because there are
more services, network calls, and steps in a transaction. That means logging itself can be a performance or
resource bottleneck for the application.
There are some additional challenges for a Kubernetes-based architecture:
Containers can move around and be rescheduled.
Kubernetes has a networking abstraction that uses virtual IP addresses and port mappings.
In Kubernetes, the standard approach to logging is for a container to write logs to stdout and stderr. The container
engine redirects these streams to a logging driver. For ease of querying, and to prevent possible loss of log data if
a node crashes, the usual approach is to collect the logs from each node and send them to a central storage
location.
Azure Monitor integrates with AKS to support this approach. Azure Monitor collects container logs and sends them
to a Log Analytics workspace. From there, you can use the Kusto query language to write queries across the
aggregated logs. For example, here is a Kusto query to show the container logs for a specified pod:
Azure Monitor is a managed service, and configuring an AKS cluster to use Azure Monitor is a simple configuration
switch in the CLI or Resource Manager template. (For more information, see How to enable Azure Monitor for
containers.) Another advantage of using Azure Monitoring is that it consolidates your AKS logs with other Azure
platform logs, providing a unified monitoring experience.
Azure Monitor is billed per gigabyte (GB) of data ingested into the service (see Azure Monitor pricing). At very high
volumes, cost may become a consideration. There are many open-source alternatives available for the Kubernetes
ecosystem. For example, many organizations use Fluentd with Elasticsearch . Fluentd is an open-source data
collector, and Elasticsearch is a document database that is for search. A challenge with these options is that they
require additional configuration and management of the cluster. For a production workload, you may need to
experiment with configuration settings. You'll also need to monitor the performance of the logging infrastructure.
Application Insights
For richer log data, we recommend instrumenting your code with Application Insights. This requires adding an
Application Insights package to your code and configuring your code to send logging statements to Application
Insights. The details depend on the platform, such as .NET, Java, or Node.js. The Application Insights package sends
telemetry data to Azure Monitor.
If you are using .NET Core, we recommend also using the Application Insights for Kubernetes library. This library
enriches Application Insights traces with additional information such as the container, node, pod, labels, and replica
set.
Advantages of this approach include:
Application Insights logs HTTP requests, including latency and result code.
Distributed tracing is enabled by default.
Traces include an operation ID, so you can match all traces for a particular operation.
Traces generated by Application Insights often have additional contextual information. For example, ASP.NET
traces are decorated with the action name and a category such as ControllerActionInvoker , which give you
insights into the ASP.NET request pipeline.
Application Insights collects performance metrics for performance troubleshooting and optimization.
Considerations:
Application Insights throttles the telemetry if the data rate exceeds a maximum limit; for details, see Application
Insights limits. A single operation may generate several telemetry events, so if the application experiences a
high volume of traffic, it is likely to get throttled.
Because Application Insights batches data, it's possible to lose a batch if a process crashes with an unhandled
exception.
Application Insights is billed based on data volume. For more information, see Manage pricing and data volume
in Application Insights.
Structured logging
To make logs easier to parse, use structured logging where possible. Structured logging is approach where the
application writes logs in a structured format, such as JSON, rather than outputting unstructured text strings. There
are many structured logging libraries available. For example, here is a logging statement that uses the Serilog
library for .NET Core:
...
}
Here, the call to LogInformation includes an Id parameter and DeliveryInfo parameter. With structured logging,
these values are not interpolated into the message string. Instead, the log output will look something like this:
This is a JSON string, where the "@t" field is a timestamp, "@mt" is the message string, and the remaining
key/value pairs are the parameters. Outputting JSON format makes it easier to query the data in a structured way.
For example, the following Log Analytics query, written in the Kusto query language, searches for instances of this
particular message from all containers named fabrikam-delivery :
traces
| where customDimensions.["Kubernetes.Container.Name"] == "fabrikam-delivery"
| where customDimensions.["{OriginalFormat}"] == "In Put action with delivery {Id}: {@DeliveryInfo}"
| project message, customDimensions["Id"], customDimensions["@DeliveryInfo"]
Viewing the result in the Azure portal shows that DeliveryInfo is a structured record that contains the serialized
representation of the DeliveryInfo model:
Here is the JSON from this example:
{
"Id": "36585f2d-c1fa-4a3d-9e06-a7f40b7d04ef",
"Owner": {
"UserId": "user id for logging",
"AccountId": "52dadf0c-0067-43e7-af76-86e32b48bc5e"
},
"Pickup": {
"Altitude": 0.29295161612934972,
"Latitude": 0.26815900219052985,
"Longitude": 0.79841844309047727
},
"Dropoff": {
"Altitude": 0.31507750848078986,
"Latitude": 0.753494655598651,
"Longitude": 0.89352830773849423
},
"Deadline": "string",
"Expedited": true,
"ConfirmationRequired": 0,
"DroneId": "AssignedDroneId01ba4d0b-c01a-4369-ba75-51bde0e76cc9"
}
The previous code snippet used the Serilog library, but structured logging libraries are available for other
languages as well. For example, here's an example using the SLF4J library for Java:
MDC.put("DeliveryId", deliveryId);
Distributed tracing
A significant challenge of microservices is to understand the flow of events across services. A single transaction
may involve calls to multiple services. To reconstruct the entire sequence of steps, each service should propagate a
correlation ID that acts as a unique identifier for that operation. The correlation ID enables distributed tracing
across services.
The first service that receives a client request should generate the correlation ID. If the service makes an HTTP call
to another service, it puts the correlation ID in a request header. If the service sends an asynchronous message, it
puts the correlation ID into the message. Downstream services continue to propagate the correlation ID, so that it
flows through the entire system. In addition, all code that writes application metrics or log events should include
the correlation ID.
When service calls are correlated, you can calculate operational metrics such as the end-to-end latency for a
complete transaction, the number of successful transactions per second, and the percentage of failed transactions.
Including correlation IDs in application logs makes it possible to perform root cause analysis. If an operation fails,
you can find the log statements for all of the service calls that were part of the same operation.
We recommend using Application Insights for distributed tracing. The Application Insights SDK automatically
injects correlation context into HTTP headers, and includes the correlation ID in Application Insights logs. Some
services may still need to explicitly propagate the correlation headers, depending on the frameworks and libraries
being used. For more information, see Telemetry correlation in Application Insights.
Some additional considerations when implementing distributed tracing:
There is now a standard HTTP header for correlation IDs, a W3C proposal has been accepted as an official
recommendation recently. Your team should standardize on a custom header value. The choice may be
decided by your logging framework, such as Application Insights, or choice of service mesh.
For asynchronous messages, if your messaging infrastructure supports adding metadata to messages, you
should include the correlation ID as metadata. Otherwise, include it as part of the message schema. For
example, see Distributed tracing and correlation through Service Bus messaging.
Rather than a single opaque identifier, you might send a correlation context that includes richer information,
such as caller-callee relationships.
If you are using Istio or linkerd as a service mesh, these technologies automatically generate correlation
headers when HTTP calls are routed through the service mesh proxies. Services should forward the relevant
headers.
Example of distributed tracing
This example follows a distributed transaction through a set of microservices. The example is taken from a
reference implementation described here.
Because every call includes an operation ID, you can also view the end-to-end steps in a single transaction,
including timing information and the HTTP calls at each step. Here is the visualization of one such transaction:
This visualization shows the steps from the ingestion service to the queue, from the queue to the workflow service,
and from the workflow service to the other backend services. The last step is the Workflow service marking the
Service Bus message as completed.
Now here is an example when calls to a backend service were failing:
This shows that a large fraction (36%) of calls to the Drone Scheduler service failed during the period being
queried. In the end-to-end transaction view, it shows that an exception occurs when sending an HTTP PUT request
to the service.
Further drilling in, the exception turns out to be a socket exception, "No such device or address."
This is a hint that the backend service is not reachable. At this point, you might use kubectl to view the deployment
configuration. In this example, it turned out the service hostname was not resolving, due to an error in the
Kubernetes configuration files. The article Debug Services in the Kubernetes documentation has tips for
diagnosing this sort of error.
Here are some common causes of errors:
Code bugs. These might manifest as:
Exceptions. Look in the Application Insights logs to view the exception details.
Process crashing. Look at container and pod status, and view container logs or Application Insights
traces.
HTTP 5xx errors
Resource exhaustion:
Look for throttling (HTTP 429) or request timeouts.
Examine container metrics for CPU, memory, and disk
Look at the configurations for container and pod resource limits.
Service discovery. Examine the Kubernetes service configuration and port mappings.
API mismatch. Look for HTTP 400 errors. If APIs are versioned, look at which version is being called.
Error pulling a container image. Look at the pod specification. Also make sure the cluster is authorized to pull
from the container registry.
RBAC issues.
Next steps
Learn more about features in Azure Monitor that support monitoring of applications on AKS:
Azure Monitor for containers overview
Understand AKS cluster performance with Azure Monitor for containers
For more information about using metrics for performance tuning, see see Performance tuning a distributed
application.
CI/CD for microservices architectures
12/18/2020 • 8 minutes to read • Edit Online
Faster release cycles are one of the major advantages of microservices architectures. But without a good CI/CD
process, you won't achieve the agility that microservices promise. This article describes the challenges and
recommends some approaches to the problem.
What is CI/CD?
When we talk about CI/CD, we're really talking about several related processes: Continuous integration, continuous
delivery, and continuous deployment.
Continuous integration . Code changes are frequently merged into the main branch. Automated build and
test processes ensure that code in the main branch is always production-quality.
Continuous deliver y . Any code changes that pass the CI process are automatically published to a
production-like environment. Deployment into the live production environment may require manual
approval, but is otherwise automated. The goal is that your code should always be ready to deploy into
production.
Continuous deployment . Code changes that pass the previous two steps are automatically deployed into
production.
Here are some goals of a robust CI/CD process for a microservices architecture:
Each team can build and deploy the services that it owns independently, without affecting or disrupting
other teams.
Before a new version of a service is deployed to production, it gets deployed to dev/test/QA environments
for validation. Quality gates are enforced at each stage.
A new version of a service can be deployed side by side with the previous version.
Sufficient access control policies are in place.
For containerized workloads, you can trust the container images that are deployed to production.
Challenges
Many small independent code bases . Each team is responsible for building its own service, with its own
build pipeline. In some organizations, teams may use separate code repositories. Separate repositories can
lead to a situation where the knowledge of how to build the system is spread across teams, and nobody in
the organization knows how to deploy the entire application. For example, what happens in a disaster
recovery scenario, if you need to quickly deploy to a new cluster?
Mitigation : Have a unified and automated pipeline to build and deploy services, so that this knowledge is
not "hidden" within each team.
Multiple languages and frameworks . With each team using its own mix of technologies, it can be
difficult to create a single build process that works across the organization. The build process must be
flexible enough that every team can adapt it for their choice of language or framework.
Mitigation : Containerize the build process for each service. That way, the build system just needs to be able
to run the containers.
Integration and load testing . With teams releasing updates at their own pace, it can be challenging to
design robust end-to-end testing, especially when services have dependencies on other services. Moreover,
running a full production cluster can be expensive, so it's unlikely that every team will run its own full
cluster at production scales, just for testing.
Release management . Every team should be able to deploy an update to production. That doesn't mean
that every team member has permissions to do so. But having a centralized Release Manager role can
reduce the velocity of deployments.
Mitigation : The more that your CI/CD process is automated and reliable, the less there should be a need
for a central authority. That said, you might have different policies for releasing major feature updates
versus minor bug fixes. Being decentralized doesn't mean zero governance.
Ser vice updates . When you update a service to a new version, it shouldn't break other services that
depend on it.
Mitigation : Use deployment techniques such as blue-green or canary release for non-breaking changes.
For breaking API changes, deploy the new version side by side with the previous version. That way, services
that consume the previous API can be updated and tested for the new API. See Updating services, below.
Updating services
There are various strategies for updating a service that's already in production. Here we discuss three common
options: Rolling update, blue-green deployment, and canary release.
Rolling updates
In a rolling update, you deploy new instances of a service, and the new instances start receiving requests right
away. As the new instances come up, the previous instances are removed.
Example. In Kubernetes, rolling updates are the default behavior when you update the pod spec for a Deployment.
The Deployment controller creates a new ReplicaSet for the updated pods. Then it scales up the new ReplicaSet
while scaling down the old one, to maintain the desired replica count. It doesn't delete old pods until the new ones
are ready. Kubernetes keeps a history of the update, so you can roll back an update if needed.
Example. Azure Service Fabric uses the rolling update strategy by default. This strategy is best suited for
deploying a version of a service with new features without changing existing APIs. Service Fabric starts an upgrade
deployment by updating the application type to a subset of the nodes or an update domain. It then rolls forward to
the next update domain until all domains are upgraded. If an upgrade domain fails to update, the application type
rolls back to the previous version across all domains. Be aware that an application type with multiple services (and
if all services are updated as part of one upgrade deployment) is prone to failure. If one service fails to update, the
entire application is rolled back to the previous version and the other services are not updated.
One challenge of rolling updates is that during the update process, a mix of old and new versions are running and
receiving traffic. During this period, any request could get routed to either of the two versions.
For breaking API changes, a good practice is to support both versions side by side, until all clients of the previous
version are updated. See API versioning.
Blue -green deployment
In a blue-green deployment, you deploy the new version alongside the previous version. After you validate the
new version, you switch all traffic at once from the previous version to the new version. After the switch, you
monitor the application for any problems. If something goes wrong, you can swap back to the old version.
Assuming there are no problems, you can delete the old version.
With a more traditional monolithic or N-tier application, blue-green deployment generally meant provisioning two
identical environments. You would deploy the new version to a staging environment, then redirect client traffic to
the staging environment — for example, by swapping VIP addresses. In a microservices architecture, updates
happen at the microservice level, so you would typically deploy the update into the same environment and use a
service discovery mechanism to swap.
Example . In Kubernetes, you don't need to provision a separate cluster to do blue-green deployments. Instead, you
can take advantage of selectors. Create a new Deployment resource with a new pod spec and a different set of
labels. Create this deployment, without deleting the previous deployment or modifying the service that points to it.
Once the new pods are running, you can update the service's selector to match the new deployment.
One drawback of blue-green deployment is that during the update, you are running twice as many pods for the
service (current and next). If the pods require a lot of CPU or memory resources, you may need to scale out the
cluster temporarily to handle the resource consumption.
Canary release
In a canary release, you roll out an updated version to a small number of clients. Then you monitor the behavior of
the new service before rolling it out to all clients. This lets you do a slow rollout in a controlled fashion, observe
real data, and spot problems before all customers are affected.
A canary release is more complex to manage than either blue-green or rolling update, because you must
dynamically route requests to different versions of the service.
Example . In Kubernetes, you can configure a Service to span two replica sets (one for each version) and adjust the
replica counts manually. However, this approach is rather coarse-grained, because of the way Kubernetes load
balances across pods. For example, if you have a total of 10 replicas, you can only shift traffic in 10% increments. If
you are using a service mesh, you can use the service mesh routing rules to implement a more sophisticated
canary release strategy.
Next steps
Learn specific CI/CD practices for microservices running on Kubernetes.
CI/CD for microservices on Kubernetes
Building a CI/CD pipeline for microservices on
Kubernetes
12/18/2020 • 12 minutes to read • Edit Online
It can be challenging to create a reliable CI/CD process for a microservices architecture. Individual teams must be
able to release services quickly and reliably, without disrupting other teams or destabilizing the application as a
whole.
This article describes an example CI/CD pipeline for deploying microservices to Azure Kubernetes Service (AKS).
Every team and project is different, so don't take this article as a set of hard-and-fast rules. Instead, it's meant to be
a starting point for designing your own CI/CD process.
The example pipeline described here was created for a microservices reference implementation called the Drone
Delivery application, which you can find on GitHub. The application scenario is described here.
The goals of the pipeline can be summarized as follows:
Teams can build and deploy their services independently.
Code changes that pass the CI process are automatically deployed to a production-like environment.
Quality gates are enforced at each stage of the pipeline.
A new version of a service can be deployed side by side with the previous version.
For more background, see CI/CD for microservices architectures.
Assumptions
For purposes of this example, here are some assumptions about the development team and the code base:
The code repository is a monorepo, with folders organized by microservice.
The team's branching strategy is based on trunk-based development.
The team uses release branches to manage releases. Separate releases are created for each microservice.
The CI/CD process uses Azure Pipelines to build, test, and deploy the microservices to AKS.
The container images for each microservice are stored in Azure Container Registry.
The team uses Helm charts to package each microservice.
These assumptions drive many of the specific details of the CI/CD pipeline. However, the basic approach described
here be adapted for other processes, tools, and services, such as Jenkins or Docker Hub.
Validation builds
Suppose that a developer is working on a microservice called the Delivery Service. While developing a new
feature, the developer checks code into a feature branch. By convention, feature branches are named feature/* .
The build definition file includes a trigger that filters by the branch name and the source path:
trigger:
batch: true
branches:
include:
# for new release to production: release flow strategy
- release/delivery/v*
- refs/release/delivery/v*
- master
- feature/delivery/*
- topic/delivery/*
paths:
include:
- /src/shipping/delivery/
NOTE
In Azure DevOps Repos, you can define policies to protect branches. For example, the policy could require a successful CI
build plus a sign-off from an approver in order to merge into master.
Isolation of environments
You will have multiple environments where you deploy services, including environments for development, smoke
testing, integration testing, load testing, and finally production. These environments need some level of isolation. In
Kubernetes, you have a choice between physical isolation and logical isolation. Physical isolation means deploying
to separate clusters. Logical isolation uses namespaces and policies, as described earlier.
Our recommendation is to create a dedicated production cluster along with a separate cluster for your dev/test
environments. Use logical isolation to separate environments within the dev/test cluster. Services deployed to the
dev/test cluster should never have access to data stores that hold business data.
Build process
When possible, package your build process into a Docker container. That allows you to build your code artifacts
using Docker, without needing to configure the build environment on each build machine. A containerized build
process makes it easy to scale out the CI pipeline by adding new build agents. Also, any developer on the team can
build the code simply by running the build container.
By using multi-stage builds in Docker, you can define the build environment and the runtime image in a single
Dockerfile. For example, here's a Dockerfile that builds a .NET application:
FROM mcr.microsoft.com/dotnet/core/runtime:3.1 AS base
WORKDIR /app
COPY Fabrikam.Workflow.Service/Fabrikam.Workflow.Service.csproj .
RUN dotnet restore Fabrikam.Workflow.Service.csproj
COPY Fabrikam.Workflow.Service/. .
RUN dotnet build Fabrikam.Workflow.Service.csproj -c release -o /app --no-restore
COPY Fabrikam.Workflow.Service.Tests/*.csproj .
RUN dotnet restore Fabrikam.Workflow.Service.Tests.csproj
COPY Fabrikam.Workflow.Service.Tests/. .
ENTRYPOINT ["dotnet", "test", "--logger:trx"]
COPY Fabrikam.Workflow.Service.Tests/*.csproj .
RUN dotnet restore Fabrikam.Workflow.Service.Tests.csproj
COPY Fabrikam.Workflow.Service.Tests/. .
ENTRYPOINT ["dotnet", "test", "--logger:trx"]
A developer can use this Docker file to run the tests locally:
The CI pipeline should also run the tests as part of the build verification step.
Note that this file uses the Docker ENTRYPOINT command to run the tests, not the Docker RUN command.
If you use the RUN command, the tests run every time you build the image. By using ENTRYPOINT , the tests are
opt-in. They run only when you explicitly target the testrunner stage.
A failing test doesn't cause the Docker build command to fail. That way you can distinguish container build
failures from test failures.
Test results can be saved to a mounted volume.
Container best practices
Here are some other best practices to consider for containers:
Define organization-wide conventions for container tags, versioning, and naming conventions for resources
deployed to the cluster (pods, services, and so on). That can make it easier to diagnose deployment issues.
During the development and test cycle, the CI/CD process will build many container images. Only some of
those images are candidates for release, and then only some of those release candidates will get promoted
to production. Have a clear versioning strategy, so that you know which images are currently deployed to
production, and can roll back to a previous version if necessary.
Always deploy specific container version tags, not latest .
Use namespaces in Azure Container Registry to isolate images that are approved for production from
images that are still being tested. Don't move an image into the production namespace until you're ready to
deploy it into production. If you combine this practice with semantic versioning of container images, it can
reduce the chance of accidentally deploying a version that wasn't approved for release.
Follow the principle of least privilege by running containers as a nonprivileged user. In Kubernetes, you can
create a pod security policy that prevents containers from running as root. See Prevent Pods From Running
With Root Privileges.
Helm charts
Consider using Helm to manage building and deploying services. Here are some of the features of Helm that help
with CI/CD:
Often a single microservice is defined by multiple Kubernetes objects. Helm allows these objects to be
packaged into a single Helm chart.
A chart can be deployed with a single Helm command, rather than a series of kubectl commands.
Charts are explicitly versioned. Use Helm to release a version, view releases, and roll back to a previous version.
Tracking updates and revisions, using semantic versioning, along with the ability to roll back to a previous
version.
Helm charts use templates to avoid duplicating information, such as labels and selectors, across many files.
Helm can manage dependencies between charts.
Charts can be stored in a Helm repository, such as Azure Container Registry, and integrated into the build
pipeline.
For more information about using Container Registry as a Helm repository, see Use Azure Container Registry as a
Helm repository for your application charts.
IMPORTANT
This feature is currently in preview. Previews are made available to you on the condition that you agree to the supplemental
terms of use. Some aspects of this feature may change prior to general availability (GA).
A single microservice may involve multiple Kubernetes configuration files. Updating a service can mean touching
all of these files to update selectors, labels, and image tags. Helm treats these as a single package called a chart and
allows you to easily update the YAML files by using variables. Helm uses a template language (based on Go
templates) to let you write parameterized YAML configuration files.
For example, here's part of a YAML file that defines a deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "package.fullname" . | replace "." "" }}
labels:
app.kubernetes.io/name: {{ include "package.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
annotations:
kubernetes.io/change-cause: {{ .Values.reason }}
...
spec:
containers:
- name: &package-container_name fabrikam-package
image: {{ .Values.dockerregistry }}/{{ .Values.image.repository }}:{{ .Values.image.tag }}
imagePullPolicy: {{ .Values.image.pullPolicy }}
env:
- name: LOG_LEVEL
value: {{ .Values.log.level }}
Although your CI/CD pipeline could install a chart directly to Kubernetes, we recommend creating a chart archive
(.tgz file) and pushing the chart to a Helm repository such as Azure Container Registry. For more information, see
Package Docker-based apps in Helm charts in Azure Pipelines.
Consider deploying Helm to its own namespace and using role-based access control (RBAC) to restrict which
namespaces it can deploy to. For more information, see Role-based Access Control in the Helm documentation.
Revisions
Helm charts always have a version number, which must use semantic versioning. A chart can also have an
appVersion . This field is optional, and doesn't have to be related to the chart version. Some teams might want to
application versions separately from updates to the charts. But a simpler approach is to use one version number,
so there's a 1:1 relation between chart version and application version. That way, you can store one chart per
release and easily deploy the desired release:
This lets you view the change-cause field for each revision, using the kubectl rollout history command. In the
previous example, the change-cause is provided as a Helm chart parameter.
deployment.extensions/delivery-v010
REVISION CHANGE-CAUSE
1 Initial deployment
You can also use the helm list command to view the revision history:
helm list
- task: Docker@1
inputs:
azureSubscriptionEndpoint: $(AzureSubscription)
azureContainerRegistry: $(AzureContainerRegistry)
arguments: '--pull --target testrunner'
dockerFile: $(System.DefaultWorkingDirectory)/$(dockerFileName)
imageName: '$(imageName)-test'
2. Run the tests, by invoking docker run against the test runner container.
- task: Docker@1
inputs:
azureSubscriptionEndpoint: $(AzureSubscription)
azureContainerRegistry: $(AzureContainerRegistry)
command: 'run'
containerName: testrunner
volumes: '$(System.DefaultWorkingDirectory)/TestResults:/app/tests/TestResults'
imageName: '$(imageName)-test'
runInBackground: false
- task: PublishTestResults@2
inputs:
testResultsFormat: 'VSTest'
testResultsFiles: 'TestResults/*.trx'
searchFolder: '$(System.DefaultWorkingDirectory)'
publishRunAttachments: true
- task: Docker@1
inputs:
azureSubscriptionEndpoint: $(AzureSubscription)
azureContainerRegistry: $(AzureContainerRegistry)
dockerFile: $(System.DefaultWorkingDirectory)/$(dockerFileName)
includeLatestTag: false
imageName: '$(imageName)'
5. Push to the container to Azure Container Registry (or other container registry).
- task: Docker@1
inputs:
azureSubscriptionEndpoint: $(AzureSubscription)
azureContainerRegistry: $(AzureContainerRegistry)
command: 'Push an image'
imageName: '$(imageName)'
includeSourceTags: false
- task: HelmDeploy@0
inputs:
command: package
chartPath: $(chartPath)
chartVersion: $(Build.SourceBranchName)
arguments: '--app-version $(Build.SourceBranchName)'
7. Push the Helm package to Azure Container Registry (or other Helm repository).
task: AzureCLI@1
inputs:
azureSubscription: $(AzureSubscription)
scriptLocation: inlineScript
inlineScript: |
az acr helm push $(System.ArtifactsDirectory)/$(repositoryName)-$(Build.SourceBranchName).tgz --
name $(AzureContainerRegistry);
See the source file.
The output from the CI pipeline is a production-ready container image and an updated Helm chart for the
microservice. At this point, the release pipeline can take over. It performs the following steps:
Deploy to dev/QA/staging environments.
Wait for an approver to approve or reject the deployment.
Retag the container image for release
Push the release tag to the container registry.
Upgrade the Helm chart in the production cluster.
For more information about creating a release pipeline, see Release pipelines, draft releases, and release options.
The following diagram shows the end-to-end CI/CD process described in this article:
Next steps
This article was based on a reference implementation that you can find on GitHub.
Monoliths to microservices using domain-driven
design
12/18/2020 • 6 minutes to read • Edit Online
This article describes how to use domain-driven design (DDD) to migrate a monolithic application to microservices.
A monolithic application is typically an application system in which all of the relevant modules are packaged
together as a single deployable unit of execution. For example, it might be a Java Web Application (WAR) running
on Tomcat or an ASP.NET application running on IIS. A typical monolithic application uses a layered design, with
separate layers for UI, application logic, and data access.
These systems start small but tend to grow over time to meet business needs. At some point, as new features are
added, a monolithic application can begin to suffer from the following problems:
The individual parts of the system cannot be scaled independently, because they are tightly coupled.
It is hard to maintain the code, because of tight coupling and hidden dependencies.
Testing becomes harder, increasing the probability of introducing vulnerabilities.
These problems can become an obstacle to future growth and stability. Teams become wary of making changes,
especially if the original developers are no longer working on the project and design documents are sparse or
outdated.
Despite these limitations, a monolithic design can make sense as a starting point for an application. Monoliths are
often the quickest path to building a proof-of-concept or minimal viable product. In the early phases of
development, monoliths tend to be:
Easier to build, because there is a single shared code base.
Easier to debug, because the code runs within a single process and memory space.
Easier to reason about, because there are fewer moving parts.
As the application grows in complexity, however, these advantages can disappear. Large monoliths often become
progressively harder to build, debug, and reason about. At some point, the problems outweigh the benefits. This is
the point when it can make sense to migrate the application to a microservices architecture. Unlike monoliths,
microservices are typically decentralized, loosely coupled units of execution. The following diagram shows a typical
microservices architecture:
Migrating a monolith to a microservice requires significant time and investment to avoid failures or overruns. To
ensure that any migration is successful, it's good to understand both the benefits and also challenges that
microservices bring. The benefits include:
Services can evolve independently based on user needs.
Services can scale independently to meet user demand.
Over time, development cycles become faster as features can be released to market quicker.
Services are isolated and are more tolerant of failure.
A single service that fails will not bring down the entire application.
Testing becomes more coherent and consistent, using behavior-driven development.
For more information about the benefits and challenges of microservices, see Microservices architecture style.
For more information about using a DDD approach for microservices architectures, see Using domain analysis to
model microservices.
The glue code (adapter pattern) effectively acts as an anti-corruption layer, ensuring that the new service is not
polluted by data models required by the monolithic application. The glue code helps to mediate interactions
between the two and ensures that only data required by the new service is passed to enable compatibility.
Through the process of refactoring, teams can inventory the monolithic application and identify candidates for
microservices refactoring while also establishing new functionality with new services.
For more information about anti-corruption layers, see Anti-Corruption Layer pattern.
This diagram also introduces another layer, the API gateway, that sits between the presentation layer and the
application logic. The API gateway is a façade layer that provides a consistent and uniform interface for the
presentation layer to interact with, while allowing downstream services to evolve independently, without affecting
the application. The API Gateway may use a technology such as Azure API Management, and allows the application
to interact in a RESTful manner.
The presentation tier can be developed in any language or framework that the team has expertise in, such as a
single page application or an MVC application. These applications interact with the microservices via the gateway,
using standard HTTP calls. For more information about API Gateways, see Using API gateways in microservices.
Next steps
When the application has been decomposed into constituent microservices, it becomes possible to use modern
orchestration tools such as Azure DevOps to manage the lifecycle of each service. For more information, see CI/CD
for microservices architectures.
Modernize enterprise applications with Azure Service
Fabric
12/18/2020 • 28 minutes to read • Edit Online
This article provides guidelines for moving Windows applications to an Azure compute platform without rewriting.
This migration uses container support in Azure Service Fabric.
A typical approach for migrating existing workloads to the cloud is the lift-and-shift strategy. In IaaS virtual machine
(VM) migrations, you provision VMs with network and storage components and deploy the existing applications
onto those VMs. Unfortunately, lift-and-shift often results in overprovisioning and overpaying for compute
resources. Another approach is to move to PaaS platforms or refactor code into microservices and run in newer
serverless platforms. But those options typically involve changing existing code.
Containers and container orchestration offer improvements. Containerizing an existing application enables it to run
on a cluster with other applications. It provides tight control over resources, scaling, and shared monitoring and
DevOps.
Optimizing and provisioning the right amount of compute resources for containerization isn't trivial. Service
Fabric's orchestration allows an organization to migrate Windows and Linux applications to a runtime platform
without changing code and to scale the needs of the application without overprovisioning VMs. The result is better
density, better hardware use, simplified operations, and overall lower cloud-compute costs.
An enterprise can use Service Fabric as a platform to run a large set of existing Windows-based web applications
with improved density, monitoring, consistency, and DevOps, all within a secure extended private network in the
cloud. The principle is to use Docker and Service Fabric's containerization support that packages and hosts existing
web applications on a shared cluster with shared monitoring and operations, in order to maximize cloud compute
resources for the ideal performance-to-cost ratio.
This article describes the processes, capabilities, and Service Fabric features that enable containerizing in an optimal
environment for a large enterprise. The guidance is scoped to web applications and Windows containers. Before
reading this article, get familiar with core Windows container and Service Fabric concepts. For more information,
see:
Create your first Service Fabric container application on Windows
Service Fabric terminology overview
Service Fabric best practices overview
Resources
Sample: Modernization templates and scripts.
The repo has these resources:
An example Azure Resource Manager template to bring up an Azure Service Fabric cluster.
A reverse-proxy solution for brokering web requests into the Service Fabric cluster to the destination containers.
Sample Service Fabric application configuration and scripts that show the use of placement, resource
constraints, and autoscaling.
Sample scripts and Docker files that build and package an existing web application.
Customize the templates in this repo for your cluster. The templates implement the best practices described in this
article.
Evaluate requirements
Before containerizing existing applications, evaluate requirements. Select applications that are right for migration,
choose the right developer workstation, and determine network requirements.
Application selection
First, determine the type of applications that are best suited for a containerized platform, full virtual machines, and
pure PaaS environment. The application could be a shared application that is built with Service Fabric to share
Windows Server hosts across various containerized applications. Each Service Fabric host can run multiple different
applications running in isolated Windows containers.
Consider creating a set of criteria to determine such applications. Here are some example criteria of containerized
Windows applications in Service Fabric.
HTTP/HTTPS web and application tiers without database dependency.
Stateless web applications.
Built with .NET Framework versions 3.5 and later.
Do not have hardware dependency or access device drivers.
Applications can run on Windows Server 2016 and later versions.
All dependencies can be containerized, such as are most .NET assemblies, WCF, COM+.
NOTE
Dependencies that cannot be containerized include MSMQ (Currently supported in preview releases of Windows
Server Core post 1709).
NOTE
You can use multiple containers; one per tier.
FROM mcr.microsoft.com/dotnet/framework/aspnet:4.8
ADD PublishOutput/ /inetpub/wwwroot
4. Test locally by using Docker For Windows. The application must successfully run in a Docker container by
using the Visual Studio debug experience. For more information, see Deploy a .NET app using Docker
Compose.
5. Build (if needed), tag, and push the tested image to a container registry, like the Azure Container Registry
service. This example uses an existing Azure Container Registry named MyAcr and Docker build/tag/push to
build/deploy appA to the registry.
The image is tagged with a version number that Service Fabric references when it deploys and versions the
container. Azure DevOps encapsulates and executes the manual Docker build/tag/push process. DevOps details are
described in the DevOps and CI/CD section.
NOTE
In the preceding example, the base image is "mcr.microsoft.com/dotnet/framework/aspnet:4.8" from the Microsoft Container
Registry.
In the preceding image, the application node type has three nodes that run three containerized applications A, B,
and C. The containers can scale on demand based on CPU and memory usage. The hosts can scale out as more
applications are deployed to the cluster. Azure Load Balancer has a public IP. The primary nodes are in an internal
subnet. It has a load balancer that isn't publicly exposed.
Placement constraints
Use placement constraints that target the non-primary node type to reserve the primary node type for system
services.
This approach enables you to configure each scale set with a separate virtual network subnet. A unique subnet for
each node type uses network security group rules for inbound/outbound access to and from the subnet and
controls network flow. You can configure the primary node type with a private load balancer so that external traffic
can't access Service Fabric management and application deployment. If the application node type wants to expose
application endpoints publicly, configure it with a separate load balancer with security configuration. For an
example template that uses Windows Service with NSG rule and multiple node types, see 7 Node, 3 node type
secure Windows Service Fabric Cluster with NSG.
A scale set associated with a node type can reliably scale out to 100 VM instances by using a single placement
group as applications are added to the cluster. The primary node type often doesn't need as many instances and
can run with 5-7 nodes, depending on durability and reliability requirements.
Service Fabric networking
Service Fabric supports with two networking modes for containerized applications; nat and Open. For large
enterprise clusters that host multiple applications, use the Open mode. For more information, see Container
Networking and Constraints.
nat
By default, the cluster brings up containers by using a NAT-bridge mode to the host VM. The NAT bridge
routes requests over a defined port to the container. With this mode, only one IP address is needed per host
VM for the host's primary NIC.
To route traffic to each application container, a unique port is exposed through the load balancer. However,
that port is exposed to end users. If you don't want the port exposed, provide a URL rewrite mechanism.
Rewrite the application domain name with a unique application port. Traffic is routed to the load balancer
that front ends the cluster. One option for the rewrite mechanism is Azure Application Gateway.
Another benefit of this approach is simplistic load balancing with Azure Load Balancer. The load balancer's
probe mechanism balance traffic across the VM instances that are running the application's containers.
Open
The Open mode assigns an IP address to each running container on the host VM from the cluster's virtual
network subnet. Each host is pre-allocated with a set of IP addresses. Each container on the host is assigned
an IP from the virtual network range. You can configure this mode in the cluster Azure Resource Manager
template during cluster creation. The example infrastructure demonstrates the Open mode.
Benefits of the Open mode:
Makes connecting to application containers simple.
Provides application traceability that is, the assigned enterprise-friendly IP is constant for the life of the
container.
Is efficient with Windows containers.
There are downsides:
The number of IP addresses must be set aside during cluster creation and that number is fixed. For example, if
10 IP addresses are assigned to each host, the host supports up to 9 application containers; one IP is reserved
for the host VM's primary NIC, the remaining addresses for each container. The number of IP addresses to
configure for each host is determined based on the hardware size for the node type, and the maximum number
of containers on each host. Both factors depend on application size and needs. If you need more containers in
the cluster, add more application node type VM instances. That is because you can't change the IP number per
instance without rebuilding the cluster.
You need a reverse proxy to route traffic to the correct destination container. The Service Fabric DNS Service can
be used by the reverse proxy to look up the application container and rewrite the HTTP request to this container.
That solution is more complex to implement than the nat mode.
For more information, see Service Fabric container networking modes.
Service Fabric runtime and Windows versions
Choose the version of Windows Server you want to run as the infrastructure host OS. The choice depends on the
testing and readiness at the time of first deployment. Supported versions are Windows Server with Containers
1607, 1709, 1803.
Also choose a version of the Service Fabric runtime to start the cluster. For a new containerization initiative, use the
latest version (Service Fabric 6.4 or later).
Other considerations:
The host OS running Service Fabric can be built by using locked down enterprise images with antivirus and
malware enforcement.
The host OS for all the Service Fabric cluster VMs can be domain joined. Domain joining isn't a requirement and
can add complexity to any Azure virtual machine. However, there are benefits. For example, if an application
requires Windows-integrated security to connect to domain-joined resources, then the domain service account
is typically used at the process level. The account is used to execute the application instead of connection string
credentials and secrets. Windows containers do not currently support direct domain joining but can be
configured to use Group Managed Storage Accounts (gMSA). The gMSA capability is supported by Service
Fabric for Windows-integrated security requirements. Each containerized application in the cluster can have its
own gMSA. gMSA requires the Service Fabric host VMs that run containers to be Active Directory domain
joined.
DNS service
Service Fabric has an internal DNS service that maps a containerized application name to its location in the cluster.
For the Open mode, the application's IP address is used instead. This service is enabled on the cluster. If you name
each service with a DNS name, traffic is routed to the application by using a reverse proxy. For information about
reverse proxy, see the Reverse proxy for inbound traffic section.
Monitoring and diagnostics
The Service Fabric Log Analytics workspace and Service Fabric solution provide detailed information about cluster
events. For information about setting it up, see Configure Azure Monitor logs to collect cluster events.
Install the [monitoring agent] for Azure Log Analytics in the Service Fabric cluster. You can then use the container
monitoring solution and view the running containers in the cluster.
Use Docker performance statistics to monitor memory and CPU use for each container.
Install the Azure Diagnostics extension that collects diagnostic data on a deployed application. For more
information, see What is Azure Diagnostics extension.
Unused container images
Container images are downloaded to each Service Fabric host and can consume space on the host disk. To free up
disk space, consider image pruning to remove images that are no longer referenced and used by running
containers. Configure this option in the Host section of the cluster manifest.
Secrets and certificates management with Key Vault
Avoid hardcoded secrets in the template. Modify the Azure Resource Manager template to add certificates (cluster
certificate) and secrets (host OS admin password) from Azure Key Vault during cluster deployment. To safely store
and retrieve secrets:
1. Create a Key Vault in a separate resource group in the same region as the cluster.
2. Protect the Key Vault with RBAC.
3. Load the cluster certificate and cluster password in Key Vault.
4. Reference this external Key Vault from the Azure Resource Manager template.
Cluster capacity planning
For an enterprise production cluster, start with 5-7 primary nodes, depending on the intended size of the overall
cluster. For more information, see Service Fabric cluster capacity planning considerations.
For a large cluster that hosts stateless web applications, the typical hardware size for the primary node type can be
Standard_D2s_v3. For the primary node type, a 5-node cluster with the Silver durability tier should scale to
containers across 50 application VM instances. Consider doing simulation tests and add more node types as
partitioning needs become apparent. Also perform tests to determine the VM size for application node type and the
number of running containers on each VM instance. Typically, 10 or more containers run on each VM, but that
number is dependent on these factors:
Resource needs, such as CPU and memory use, of the containerized applications. Web applications tend to be
more CPU intensive than memory intensive.
The compute size and available resources of the chosen node VM for the application node type.
The number of IP addresses configured per node. Each container requires an IP with Open mode. Given 15
containers on a VM, 16 IP addresses must be allocated for each VM in the application node type.
Resource needs of the underline OS. The recommendation is to leave 25% of the resources on the VM for OS
system processes.
Here are some recommendations: Choose recent Windows Server builds because older builds of Windows Server
(1607) use larger container image sizes compared to newer versions (1709 and later).
Select the most appropriate VM compute size and initial application node type VM count. Standard_D8s_v3,
Standard_D16s_v3, and Standard_DSv32_v3 are example compute sizes that are known to work well for this tier for
running containerized web applications.
Use premium storage for the production cluster because container images must be read quickly from disk. Attach
one data disk to each VM and move the Docker EE installation to that disk. That way there is enough log space for
each running container. A standard VM with the default 127-GB OS disk fills up because of container logs. A 500 GB
to 1 TB data disk is recommended per VM for the application node type scale set.
Availability and resiliency planning
Service Fabric spreads VM instances across fault and update domains to ensure maximum availability and
resilience. For an enterprise-scale production cluster, as a minimum target the Silver tier (5-node cluster) for
reliability and durability when hosting containerized stateless web applications.
Scale planning
There are two aspects to consider:
Scale in or scale out the scale set associated with the application node type by adding or removing VM
instances. VMs can be added and removed automatically through autoscaling or manually through scripts. For
an enterprise cluster, do not add or remove nodes frequently. It is important that you disable nodes in Service
Fabric before deleting them. Do not delete seed nodes from your cluster. Monitor the cluster VMs with alerts to
trigger when VM resources exceed threshold values.
Scale the application's container count across the scale set instances in the application node type.
For the application node type, start with the minimal required nodes to support the containerized applications and
ensure high availability. Have extra nodes in the cluster. A node can be removed from the cluster for maintenance
and its running containers can be temporarily moved to other nodes. The cluster can grow statically using the Add-
AzureRmSer viceFabricNode cmdlet, or dynamically by using scale set autoscale capability.
Starting with Service Fabric 6.2, application containers can autoscale individually through configuration. The
configuration is based on Docker statistics for CPU and memory use on the host. Together, those capabilities can
optimize the compute cost.
Rather than overprovisioning a VM to run an application, deploy a container to the cluster, monitor it, and then
scale out with additional containers, as necessary. Choosing the right size is easier because Docker statistics are
used to determine the number of containers.
In the example infrastructure, application A has two containers deployed across the cluster that divide load. This
approach allows the application and container combination to be adjusted later for optimization.
NOTE
Docker statistics showing individual container resource utilization is sent to Log Analytics and can be analyzed in Azure
Monitor.
Service Fabric offers constant monitoring and heath checks across the cluster. If a node is unhealthy, applications
on that node automatically move to a healthy node and the bad node stops receiving requests. Regardless of the
number of containers hosting an application, Service Fabric ensures that the application is healthy and running.
For an application that is infrequently used and can be offline, run it in the cluster with just one container instance
(such as application B and C). Service Fabric makes sure that the application is up and healthy during upgrades or
when the container needs to move to a new VM. Heath checking can reduce cost compared to hosting that
application on two redundant and overprovisioned VMs in the traditional IaaS model.
The application is configured to listen on ports 80 and 443. By using the Open mode and reverse proxy, all
applications can share the same ports.
The application is containerized to an image named: myacr.azurecr.io/appa:1.0. Service Fabric invokes the Docker
daemon to pull down the image when the application is deployed. Service Fabric handles all interactions with
Docker.
The reverse proxy container uses similar manifests but isn't configured to use the Open mode. You can update
containers by versioning the Docker image, then redeploying the versioned Service Fabric package with the Star t-
Ser viceFabricApplicationUpgrade cmdlet.
For information about manifests, see Service Fabric application and service manifests.
Environmental configuration
Do not hardcode configuration values in the container image by using environment variables to pass values to a
container. A DevOps pipeline can build a container image, test in a test environment, promote to staging (or pre-
production), and promote to production. Do not rebuild an image for each environment.
Docker can pass environment variables directly to a container when starting one. In this example, Docker passes
the eShopTitle variable to the eshopweb container:
In a Service Fabric cluster, Service Fabric controls Docker execution and lists environment variables in the
ServiceManifest. Those variables are passed automatically when Service Fabric runs the container. You can override
the variables in ApplicationManifest.xml by using the EnvironmentOverrides element, which can be parameterized
and built from Visual Studio publish profiles for each environment.
For information about specifying environment variables in, see How to specify environment variables for services
in Service Fabric.
Security considerations
Here are some articles about container security:
Azure Service Fabric security
Service Fabric application and service security
Set up an encryption certificate and encrypt secrets on Windows clusters
ServiceFabricNode. Links the node to a storage account (support log) for tracing support. This log is used when
a ticket is opened.
IaaSDiagnostics. Collects platform events, such as ServiceFabricSystemEventTable, and stores that data in a blob
storage account (app log). The account is consumed in Log Analytics.
MicrosoftMonitoringAgent. Contains all the performance data such as Docker statistics. The data (such as
ContainerInventory and ContainerLog) is sent to Log Analytics.
Application log
If your containerized application runs in a shared cluster, you can get logs such as IIS and custom logs from the
container into Log Analytics. This option is recommended because of speed, scalability, and the ability to handle
large amounts of unstructured data.
Set up log rotation through Docker to keep the logs size manageable. For more information, see Rotating Docker
Logs - Keeping your overlay folder small.
Here are two approaches for getting application logs into Log Analytics.
Use the existing Container Monitoring Solution installed in Log Analytics. The solution automatically sends
data from the container log directories on the VM (C:\ProgramData\docker\windowsfilter*) to the
configured Log Analytics workspace. Each container creates a directory underneath the \WindowsFilter path,
and the contents are streamed to Log Analytics from MicrosoftMonitoringAgent on the VM. This way you
can send application logs to a shared directory(s) in the container and relay the logs by using Docker Logs to
the monitored container log folders.
1. Write a process script that runs in the container periodically and analyzes log files.
2. Monitor the log file changes in the shared folder and write the log changes to the command window
where Docker Logs can capture the information outside the container.
Each container records any output sent to the command line of a container. Access the output outside the
container, which is automatically executed by Container Monitoring Solution.
Move Docker to an attached VM data disk with enough storage to make sure the OS drive doesn't fill up with
container logs.
The automatic Container Monitoring Solution sends all logs to a single Log Analytics workspace. Different
containerized applications running on the same host send application logs to that shared workspace. If you
need to isolate logs such that each containerized application sends the log to a separate workspace, supply a
custom solution. That content is outside the scope of this article.
Mount external storage to each running container by using a file management service such as Azure Files.
The container logs are sent to the external storage location and don't take up disk space on the host VM.
You don't need, a data disk attached to each VM to hold Docker logs; move Docker Enterprise to the data
disk.
Create a job to monitor the Azure Files location and send logs to the appropriate Log Analytics workspace
for each installed application. The job doesn't need to run in the container. It just observes the Azure Files
location.
The template sets up the build process and tasks for CI/CD by building and containerizing the application, pushing
the container image to a container registry (Azure Container Registry is the default), and deploying the Service
Fabric application with the containerized services to the cluster. Each application code change creates a version of
the code and an updated containerized image. Service Fabric's rolling upgrade feature deploys service upgrades
gracefully.
Here is an example of a build starting the full DevOps process on an Azure-provided hosted build agent. Some
enterprises may require the build agents to run internally within their private Azure virtual network corporate
network. Set up a Windows build agent VM and instruct Azure DevOps to use the private VM for building and
deploying code. For information about using custom build agents, see Self-hosted Windows agents.
Conclusion
Here is the summary of best practices:
Before containerizing existing applications, selecting applications that are suitable for this migration, choose the
right developer workstation, and determine network requirements.
Do not run your application container on the primary node type. Instead, configure the cluster with two or more
node types and run the application tier containers on a non-primary node type. Use placement constraints that
target the non-primary node type to reserve the primary node type for system services.
Use the Open networking mode.
Use an ingress reverse proxy, such as the IIS Application Request Routing. The reverse proxy inspects inbound
traffic and rewrites the traffic to the destination container.
Do not hardcode configuration values in the container image and use environment variables to pass values to a
container.
Monitor the application, platform events, and infrastructure metrics by using IaasDiagnostics and
MicrosoftMonitoringAgent extensions. View the logs in Application Insights and Log Analytics.
Use the latest approved corporate image and provide an automatable image updating process that is consistent
through DevOps.
Next steps
Get the latest version of the tools you need for containerizing, such as Visual Studio and Docker for Windows.
Customize these templates to meet your requirements. Sample: Modernization templates and scripts.
Migrate an Azure Cloud Services application to
Azure Service Fabric
12/18/2020 • 18 minutes to read • Edit Online
Sample code
This article describes migrating an application from Azure Cloud Services to Azure Service Fabric. It focuses on
architectural decisions and recommended practices.
For this project, we started with a Cloud Services application called Surveys and ported it to Service Fabric. The
goal was to migrate the application with as few changes as possible. Later in the article, we show how to optimize
the application for Service Fabric.
Before reading this article, it will be useful to understand the basics of Service Fabric. See Overview of Azure
Service Fabric
Packaging Cloud service package files (.cspkg) Application and service packages
Autoscaling Built-in service Virtual machine scale sets for auto scale
out
* Stateful services use reliable collections to store state across replicas, so that all reads are local to the nodes in the
cluster. Writes are replicated across nodes for reliability. Stateless services can have external state, using a database
or other external storage.
** Worker roles can also self-host ASP.NET Web API using OWIN.
NOTE
When using ASP.NET Core with Kestrel, you should place a reverse proxy in front of Kestrel to handle traffic from the Internet,
for security reasons. For more information, see Kestrel web server implementation in ASP.NET Core. The section Deploying the
application describes a recommended Azure deployment.
HTTP listeners
In Cloud Services, a web or worker role exposes an HTTP endpoint by declaring it in the service definition file. A
web role must have at least one endpoint.
<!-- Cloud service endpoint -->
<Endpoints>
<InputEndpoint name="HttpIn" protocol="http" port="80" />
</Endpoints>
Unlike a cloud service role, Service Fabric services can be colocated within the same node. Therefore, every service
must listen on a distinct port. Later in this article, we'll discuss how client requests on port 80 or port 443 get
routed to the correct port for the service.
A service must explicitly create listeners for each endpoint. The reason is that Service Fabric is agnostic about
communication stacks. For more information, see Build a web service front end for your application using ASP.NET
Core.
F IL E DESC RIP T IO N
Service definition (.csdef) Settings used by Azure to configure the cloud service. Defines
the roles, endpoints, startup tasks, and the names of
configuration settings.
Service package (.cspkg) Contains the application code and configurations, and the
service definition file.
There is one .csdef file for the entire application. You can have multiple .cscfg files for different environments, such
as local, test, or production. When the service is running, you can update the .cscfg but not the .csdef. For more
information, see What is the Cloud Service model and how do I package it?
Service Fabric has a similar division between a service definition and service settings, but the structure is more
granular. To understand Service Fabric's configuration model, it helps to understand how a Service Fabric
application is packaged. Here is the structure:
Application package
- Service packages
- Code package
- Configuration package
- Data package (optional)
The application package is what you deploy. It contains one or more service packages. A service package contains
code, configuration, and data packages. The code package contains the binaries for the services, and the
configuration package contains configuration settings. This model allows you to upgrade individual services
without redeploying the entire application. It also lets you update just the configuration settings, without
redeploying the code or restarting the service.
A Service Fabric application contains the following configuration files:
F IL E LO C AT IO N DESC RIP T IO N
The Service Fabric cluster is deployed to a virtual machine scale set. Scale sets are an Azure Compute resource that
can be used to deploy and manage a set of identical VMs.
As mentioned, it's recommended to place the Kestrel web server behind a reverse proxy for security reasons. This
diagram shows Azure Application Gateway, which is an Azure service that offers various layer 7 load-balancing
capabilities. It acts as a reverse-proxy service, terminating the client connection and forwarding requests to back-
end endpoints. You might use a different reverse proxy solution, such as nginx.
Layer 7 routing
In the original Surveys application, one web role listened on port 80, and the other web role listened on port 443.
P UB L IC SIT E SURVEY M A N A GEM EN T SIT E
http://tailspin.cloudapp.net https://tailspin.cloudapp.net
Another option is to use layer 7 routing. In this approach, different URL paths get routed to different port numbers
on the back end. For example, the public site might use URL paths starting with /public/ .
Options for layer 7 routing include:
Use Application Gateway.
Use a network virtual appliance (NVA), such as nginx.
Write a custom gateway as a stateless service.
Consider this approach if you have two or more services with public HTTP endpoints, but want them to appear as
one site with a single domain name.
One approach that we don't recommend is allowing external clients to send requests through the Service Fabric
reverse proxy. Although this is possible, the reverse proxy is intended for service-to-service communication.
Opening it to external clients exposes any service running in the cluster that has an HTTP endpoint.
If a cluster has multiple node types, one node type is designated as the primary node type. Service Fabric
runtime services, such as the Cluster Management Service, run on the primary node type. Provision at least 5
nodes for the primary node type in a production environment. The other node type should have at least 2
nodes.
Design considerations
The following diagram shows the architecture of the Surveys application refactored to a more granular
architecture:
Tailspin.Web is a stateless service self-hosting an ASP.NET MVC application that Tailspin customers visit to create
surveys and view survey results. This service shares most of its code with the Tailspin.Web service from the ported
Service Fabric application. As mentioned earlier, this service uses ASP.NET core and switches from using Kestrel as
web frontend to implementing a WebListener.
Tailspin.Web.Sur vey.Public is a stateless service also self-hosting an ASP.NET MVC site. Users visit this site to
select surveys from a list and then fill them out. This service shares most of its code with the
Tailspin.Web.Survey.Public service from the ported Service Fabric application. This service also uses ASP.NET Core
and also switches from using Kestrel as web frontend to implementing a WebListener.
Tailspin.Sur veyResponseSer vice is a stateful service that stores survey answers in Azure Blob Storage. It also
merges answers into the survey analysis data. The service is implemented as a stateful service because it uses a
ReliableConcurrentQueue to process survey answers in batches. This functionality was originally implemented in
the Tailspin.AnswerAnalysisService service in the ported Service Fabric application.
Tailspin.Sur veyManagementSer vice is a stateless service that stores and retrieves surveys and survey
questions. The service uses Azure Blob storage. This functionality was also originally implemented in the data
access components of the Tailspin.Web and Tailspin.Web.Survey.Public services in the ported Service Fabric
application. Tailspin refactored the original functionality into this service to allow it to scale independently.
Tailspin.Sur veyAnswerSer vice is a stateless service that retrieves survey answers and survey analysis. The
service also uses Azure Blob storage. This functionality was also originally implemented in the data access
components of the Tailspin.Web service in the ported Service Fabric application. Tailspin refactored the original
functionality into this service because it expects less load and wants to use fewer instances to conserve resources.
Tailspin.Sur veyAnalysisSer vice is a stateless service that persists survey answer summary data in a Redis cache
for quick retrieval. This service is called by the Tailspin.SurveyResponseService each time a survey is answered and
the new survey answer data is merged in the summary data. This service includes the functionality originally
implemented in the Tailspin.AnswerAnalysisService service from the ported Service Fabric application.
Communication framework
Each service in the Surveys application communicates using a RESTful web API. RESTful APIs offer the following
benefits:
Ease of use: each service is built using ASP.NET Core MVC, which natively supports the creation of Web APIs.
Security: While each service does not require SSL, Tailspin could require each service to do so.
Versioning: clients can be written and tested against a specific version of a web API.
Services in the Survey application use the reverse proxy implemented by Service Fabric. Reverse proxy is a service
that runs on each node in the Service Fabric cluster and provides endpoint resolution, automatic retry, and handles
other types of connection failures. To use the reverse proxy, each RESTful API call to a specific service is made using
a predefined reverse proxy port. For example, if the reverse proxy port has been set to 19081 , a call to the
Tailspin.SurveyAnswerService can be made as follows:
static SurveyAnswerService()
{
httpClient = new HttpClient
{
BaseAddress = new Uri("http://localhost:19081/Tailspin/SurveyAnswerService/")
};
}
To enable reverse proxy, specify a reverse proxy port during creation of the Service Fabric cluster. For more
information, see reverse proxy in Azure Service Fabric.
Performance considerations
Tailspin created the ASP.NET Core services for Tailspin.Web and Tailspin.Web.Surveys.Public using Visual Studio
templates. By default, these templates include logging to the console. Logging to the console may be done during
development and debugging, but all logging to the console should be removed when the application is deployed to
production.
NOTE
For more information about setting up monitoring and diagnostics for Service Fabric applications running in production, see
monitoring and diagnostics for Azure Service Fabric.
For example, the following lines in startup.cs for each of the web front end services should be commented out:
// This method gets called by the runtime. Use this method to configure the HTTP request pipeline.
public void Configure(IApplicationBuilder app, IHostingEnvironment env, ILoggerFactory loggerFactory)
{
//loggerFactory.AddConsole(Configuration.GetSection("Logging"));
//loggerFactory.AddDebug();
app.UseMvc();
}
NOTE
These lines may be conditionally excluded when Visual Studio is set to release when publishing.
Finally, when Tailspin deploys the Tailspin application to production, they switch Visual Studio to release mode.
Deployment considerations
The refactored Surveys application is composed of five stateless services and one stateful service, so cluster
planning is limited to determining the correct VM size and number of nodes. In the applicationmanifest.xml file that
describes the cluster, Tailspin sets the InstanceCount attribute of the StatelessService tag to -1 for each of the
services. A value of -1 directs Service Fabric to create an instance of the service on each node in the cluster.
NOTE
Stateful services require the additional step of planning the correct number of partitions and replicas for their data.
Tailspin deploys the cluster using the Azure portal. The Service Fabric Cluster resource type deploys all of the
necessary infrastructure, including virtual machine scale sets and a load balancer. The recommended VM sizes are
displayed in the Azure portal during the provisioning process for the Service Fabric cluster. Because the VMs are
deployed in a virtual machine scale set, they can be both scaled up and out as user load increases.
Next steps
The Surveys application code is available on GitHub.
If you are just getting started with Azure Service Fabric, first set up your development environment then download
the latest Azure SDK and the Azure Service Fabric SDK. The SDK includes the OneBox cluster manager so you can
deploy and test the Surveys application locally with full F5 debugging.
Serverless Functions overview
12/18/2020 • 2 minutes to read • Edit Online
Serverless architecture evolves cloud platforms toward pure cloud-native code by abstracting code from the
infrastructure that it needs to run. Azure Functions is a serverless compute option that supports functions, small
pieces of code that do single things.
Benefits of using serverless architectures with Functions applications include:
The Azure infrastructure automatically provides all the updated servers that applications need to keep running
at scale.
Compute resources allocate dynamically, and instantly autoscale to meet elastic demands. Serverless doesn't
mean "no server," but "less server," because servers run only as needed.
Micro-billing saves costs by charging only for the compute resources and duration the code uses to execute.
Function bindings streamline integration by providing declarative access to a wide variety of Azure and third-
party services.
Functions are event-driven. An external event like an HTTP web request, message, schedule, or change in data
triggers the function code. A Functions application doesn't code the trigger, only the response to the trigger. With a
lower barrier to entry, developers can focus on business logic, rather than writing code to handle infrastructure
concerns like messaging.
Azure Functions is a managed service in Azure and Azure Stack. The open source Functions runtime works in many
environments, including Kubernetes, Azure IoT Edge, on-premises, and other clouds.
Serverless and Functions require new ways of thinking and new approaches to building applications. They aren't
the right solutions for every problem. For example serverless Functions scenarios, see Reference architectures.
Implementation steps
Successful implementation of serverless technologies with Azure Functions requires the following actions:
Decide and plan
Architects and technical decision makers (TDMs) perform application assessment, conduct or attend
technical workshops and trainings, run proof of concept (PoC) or pilot projects, and conduct architectural
designs sessions as necessary.
Develop and deploy apps
Developers implement serverless Functions app development patterns and practices, configure DevOps
pipelines, and employ site reliability engineering (SRE) best practices.
Manage operations
IT professionals identify hosting configurations, future-proof scalability by automating infrastructure
provisioning, and maintain availability by planning for business continuity and disaster recovery.
Secure apps
Security professionals handle Azure Functions security essentials, secure the hosting setup, and provide
application security guidance.
Related resources
To learn more about serverless technology, see the Azure serverless documentation.
To learn more about Azure Functions, see the Azure Functions documentation.
For help with choosing a compute technology, see Choose an Azure compute service for your application.
Serverless Functions reference architectures
12/18/2020 • 4 minutes to read • Edit Online
A reference architecture is a template of required components and the technical requirements to implement them.
A reference architecture isn't custom-built for a customer solution, but is a high-level scenario based on extensive
experience. Before designing a serverless solution, use a reference architecture to visualize an ideal technical
architecture, then blend and integrate it into your environment.
IoT-connected backends
Manufacturing scenario: A manufacturing company uses
IoT to monitor its machines. Functions detects anomalous
data and triggers a message to the service department
when repair is required.
To plan for moving an application to a serverless Azure Functions architecture, a technical decision maker (TDM) or
architect:
Verifies the application's characteristics and business requirements.
Determines the application's suitability for serverless Azure Functions.
Transforms business requirements into functional and other requirements.
Planning activities also include assessing technical team readiness, providing or attending workshops and training,
and conducting architectural design reviews, proofs of concept, pilots, and technical implementations.
TDMs and architects may perform one or more of the following activities:
Execute an application assessment. Evaluate the main aspects of the application to determine how
complex and risky it is to rearchitect through application modernization, or rebuild a new cloud-native
application. See Application assessment.
Attend or promote technical workshops and training. Host a Serverless workshop or CloudHack, or
enjoy many other training and learning opportunities for serverless technologies, Azure Functions, app
modernization, and cloud-native apps. See Technical workshops and training.
Identify and execute a Proof of Concept (PoC) or pilot, or technical implementation. Deliver a
PoC, pilot, or technical implementation to provide evidence that serverless Azure Functions can solve a
team's business problems. Showing teams how to modernize or build new cloud-native applications to their
specifications can accelerate deployment to production. See PoC or pilot.
Conduct architectural design sessions. An architectural design session (ADS) is an in-depth discussion
on how a new solution will blend into the environment. ADSs validate business requirements and transform
them to functional and other requirements.
Next steps
For example scenarios that use serverless architectures with Azure Functions, see Serverless reference
architectures.
To move forward with serverless Azure Functions implementation, see the following resources:
Application development and deployment
Azure Functions app operations
Azure Functions app security
Application assessment
12/18/2020 • 6 minutes to read • Edit Online
Cloud rationalization is the process of evaluating applications to determine the best way to migrate or modernize
them for the cloud.
Rationalization methods include:
Rehost . Also known as a lift and shift migration, rehost moves a current application to the cloud with minimal
change.
Refactor . Slightly refactoring an application to fit platform-as-a-service (PaaS)-based options can reduce
operational costs.
Rearchitect . Rearchitect aging applications that aren't compatible with cloud components, or cloud-compatible
applications that would realize cost and operational efficiencies by rearchitecting into a cloud-native solution.
Rebuild . If the changes or costs to carry an application forward are too great, consider creating a new cloud-
native code base. Rebuild is especially appropriate for applications that previously met business needs, but are
now unsupported or misaligned with current business processes.
Before you decide on an appropriate strategy, analyze the current application to determine the risk and complexity
of each method. Consider application lifecycle, technology, infrastructure, performance, and operations and
monitoring. For multitier architectures, evaluate the presentation tier, service tier, integrations tier, and data tier.
The following checklists evaluate an application to determine the complexity and risk of rearchitecting or
rebuilding.
FA C TO R C O M P L EXIT Y RISK
Business drivers
Older applications might require extensive changes to get to the cloud.
FA C TO R C O M P L EXIT Y RISK
Technology
FA C TO R C O M P L EXIT Y RISK
Deployment
When assessing deployment requirements, consider:
Number of daily users
Average number of concurrent users
Expected traffic
Bandwidth in Gbps
Requests per second
Amount of memory needed
You can reduce deployment risk by storing code under source control in a version control system such as Git,
Azure DevOps Server, or SVN.
FA C TO R C O M P L EXIT Y RISK
Operations
FA C TO R C O M P L EXIT Y RISK
Security
FA C TO R C O M P L EXIT Y RISK
Results
Count your application's Complexity and Risk checkmarks.
The expected level of complexity to migrate or modernize the application to Azure is: Total Complexity/25 .
The expected risk involved is: Total Risk/19 .
For both complexity and risk, a score of <0.3 = low, <0.7 = medium, >0.7 = high.
The workshops, classes, and learning materials in this article provide technical training for serverless architectures
with Azure Functions. These resources help you and your team or customers understand and implement
application modernization and cloud-native apps.
Technical workshops
The Microsoft Cloud Workshop (MCW) program provides workshops you can host to foster cloud learning and
adoption. Each workshop includes presentation decks, trainer and student guides, and hands-on lab guides.
Contribute your own content and feedback to add to a robust database of training guides for deploying advanced
Azure workloads on the Microsoft Cloud Platform.
Workshops related to application development workloads include:
Serverless APIs in Azure. Set of entry-level exercises, which cover the basics of building and managing
serverless APIs in Microsoft Azure - with Azure Functions, Azure API Management, and Azure Application
Insights.
Serverless architecture. Implement a series of Azure Functions that independently scale and break down
business logic to discrete components, allowing customers to pay only for the services they use.
App modernization. Design a modernization plan to move services from on-premises to the cloud by
leveraging cloud, web, and mobile services, secured by Azure Active Directory.
Modern cloud apps. Deploy, configure, and implement an end-to-end secure and Payment Card Industry (PCI)
compliant solution for e-commerce, based on Azure App Services, Azure Active Directory, and Azure DevOps.
Cloud-native applications. Using DevOps best practices, build a proof of concept (PoC) to transform a platform-
as-a-service (PaaS) application to a container-based application with multi-tenant web app hosting.
Continuous delivery in Azure DevOps. Set up and configure continuous delivery (CD) in Azure to reduce
manual errors, using Azure Resource Manager templates, Azure DevOps, and Git repositories for source
control.
Instructor-led training
Course AZ-204: Developing solutions for Microsoft Azure teaches developers how to create end-to-end solutions
in Microsoft Azure. Students learn how to implement Azure compute solutions, create Azure Functions, implement
and manage web apps, develop solutions utilizing Azure Storage, implement authentication and authorization, and
secure their solutions by using Azure Key Vault and managed identities. Students also learn to connect to and
consume Azure and third-party services, and include event- and message-based models in their solutions. The
course also covers monitoring, troubleshooting, and optimizing Azure solutions.
Serverless OpenHack
The Serverless OpenHack simulates a real-world scenario where a company wants to utilize serverless services to
build and release an API to integrate into their distributor's application. This OpenHack lets attendees quickly build
and deploy Azure serverless solutions with cutting-edge compute services like Azure Functions, Logic Apps, Event
Grid, Service Bus, Event Hubs, and Cosmos DB. The OpenHack also covers related technologies like API
Management, Azure DevOps or GitHub, Application Insights, Dynamics 365/Microsoft 365, and Cognitive APIs.
OpenHack attendees build serverless functions, web APIs, and a CI/CD pipeline to support them. They implement
further serverless technologies to integrate line of business (LOB) app workflows, process user and data telemetry,
and create key progress indicator (KPI)-aligned reports. By the end of the OpenHack, attendees have built out a full
serverless technical solution that can create workflows between systems and handle events, files, and data
ingestion.
Microsoft customer projects inspired these OpenHack challenges:
Configure the developer environment.
Create your first serverless function and workflow.
Build APIs to support business needs.
Deploy a management layer for APIs and monitoring usage.
Build a LOB workflow process.
Process large amounts of unstructured file data.
Process large amounts of incoming event data.
Implement a publisher/subscriber messaging pattern and virtual network integration.
Conduct sentiment analysis.
Perform data aggregation, analysis, and reporting.
To attend an OpenHack, register at https://openhack.microsoft.com. For enterprises with many engineers,
Microsoft can request and organize a dedicated Serverless OpenHack.
Microsoft Learn
Microsoft Learn is a free, online training platform that provides interactive learning for Microsoft products. The
goal is to improve proficiency with fun, guided, hands-on content that's specific to your role and goals. Learning
paths are collections of modules that are organized around specific roles like developer, architect, or system admin,
or technologies like Azure Web Apps, Azure Functions, or Azure SQL DB. Learning paths provide understanding of
different aspects of the technology or role.
Learning paths about serverless apps and Azure Functions include:
Create serverless applications. Learn how to leverage functions to execute server-side logic and build serverless
architectures.
Architect message brokering and serverless applications in Azure. Learn how to create reliable messaging for
your applications, and how to take advantage of serverless application services in Azure.
Search all Functions-related learning paths.
Next steps
Execute an application assessment
Identify and execute a PoC or Pilot project
Proof of Concept or pilot
12/18/2020 • 4 minutes to read • Edit Online
When driving a technical and security decision for your company or customer, a Proof of Concept (PoC) or pilot is
an opportunity to deliver evidence that the proposed solution solves the business problems. The PoC or pilot
increases the likelihood of a successful adoption.
A PoC:
Demonstrates that a business model or idea is feasible and will work to solve the business problem
Usually involves one to three features or capabilities
Can be in one or multiple technologies
Is usually geared toward a particular scenario, and proves what the customer needs to know to make the
technical or security decision
Is used only as a demonstration and won't go into production
Is IT-driven and enablement-driven
A pilot:
Is a test run or trial of a proposed action or product
Lasts longer than a PoC, often weeks or months
Has a higher return on investment (ROI) than a PoC
Builds in a pre-production or trial environment, with the intent that it will then go into production
Is adoption-driven and consumption-driven
Change management
Change management uses tested methods and techniques to avoid errors and minimize impact when
administering change.
Ideally, a pilot includes a cross-section of users, to address any potential issues or problems that arise. Users may
be comfortable and familiar with their old technology, and have difficulty moving into new technical solutions.
Change management keeps this in mind, and helps the user understand the reasons behind the change and the
impact the change will make.
This understanding is part of a pilot, and addresses everyone who has a stake in the project. A pilot is better than a
PoC, because the customer is more involved, so they're more likely to implement the change.
The pilot includes a detailed follow up through surveys or focus groups. The feedback can prove and improve the
change.
Next steps
Execute an application assessment
Promote a technical workshop or training
Code a technical implementation with the team or customer
Related resources
Prosci® change management training
Application development and deployment
12/18/2020 • 4 minutes to read • Edit Online
To develop and deploy serverless applications with Azure Functions, examine patterns and practices, configure
DevOps pipelines, and implement site reliability engineering (SRE) best practices.
For detailed information about serverless architectures and Azure Functions, see:
Serverless apps: Architecture, patterns, and Azure implementation
Azure Serverless Computing Cookbook
Example serverless reference architectures
Planning
To plan app development and deployment:
1. Prepare development environment and set up workflow.
2. Structure projects to support Azure Functions app development.
3. Identify app triggers, bindings, and configuration requirements.
Understand event-driven architecture
A different event triggers every function in a serverless Functions project. For more information about event-
driven architectures, see:
Event-driven architecture style.
Event-driven design patterns to enhance existing applications using Azure Functions
Prepare development environment
Set up your development workflow and environment with the tools to create Functions. For details about
development tools and Functions code project structure, see:
Code and test Azure Functions locally
Develop Azure Functions by using Visual Studio Code
Develop Azure Functions using Visual Studio
Work with Azure Functions Core Tools
Folder structure
Development
Decide on the development language to use. Azure Functions supports C#, F#, PowerShell, JavaScript, TypeScript,
Java, and Python. All of a project's Functions must be in the same language. For more information, see Supported
languages in Azure Functions.
Define triggers and bindings
A trigger invokes a Function, and every Function must have exactly one trigger. Binding to a Function declaratively
connects another resource to the Function. For more information about Functions triggers and bindings, see:
Azure Functions triggers and bindings concepts
Execute an Azure Function with triggers
Chain Azure Functions together using input and output bindings
Create the Functions application
Functions follow the single responsibility principle: do only one thing. For more information about Functions
development, see:
Azure Functions developers guide
Create serverless applications
Strategies for testing your code in Azure Functions
Functions best practices
Use Durable Functions for stateful workflows
Durable Functions in Azure Functions let you define stateful workflows in a serverless environment by writing
orchestrator functions, and stateful entities by writing entity functions. Durable Functions manage state,
checkpoints, and restarts, allowing you to focus on business logic. For more information, see What are Durable
Functions.
Understand and address cold starts
If the number of serverless host instances scales down to zero, the next request has the added latency of restarting
the Function app, called a cold start. To minimize the performance impact of cold starts, reduce dependencies that
the Functions app needs to load on startup, and use as few large, synchronous calls and operations as possible.
For more information about autoscaling and cold starts, see Serverless Functions operations.
Manage application secrets
For security, don't store credentials in application code. To use Azure Key Vault with Azure Functions to store and
retrieve keys and credentials, see Use Key Vault references for App Service and Azure Functions.
For more information about serverless Functions application security, see Serverless Functions security.
Deployment
To prepare serverless Functions application for production, make sure you can:
Fulfill application resource requirements.
Monitor all aspects of the application.
Diagnose and troubleshoot application issues.
Deploy new application versions without affecting production systems.
Define deployment technology
Decide on deployment technology, and organize scheduled releases. For more information about how Functions
app deployment enables reliable, zero-downtime upgrades, see Deployment technologies in Azure Functions.
Avoid using too many resource connections
Functions in a Functions app share resources, including connections to HTTPS, databases, and services such as
Azure Storage. When many Functions are running concurrently, it's possible to run out of available connections.
For more information, see Manage connections in Azure Functions.
Configure logging, alerting, and application monitoring
Application Insights in Azure Monitor collects log, performance, and error data. Application Insights automatically
detects performance anomalies, and includes powerful analytics tools to help diagnose issues and understand
function usage.
For more information about application monitoring and logging, see:
Monitor Azure Functions
Monitoring Azure Functions with Azure Monitor Logs
Application Insights for Azure Functions supported features
Diagnose and troubleshoot issues
Learn how to effectively use diagnostics for troubleshooting in proactive and problem-first scenarios. For more
information, see:
Keep your Azure App Service and Azure Functions apps healthy and happy
Troubleshoot error: "Azure Functions Runtime is unreachable"
Deploy applications using an automated pipeline and DevOps
Full automation of all steps from code commit to production deployment lets teams focus on building code, and
removes the overhead and potential human error of manual steps. Deploying new code is quicker and less risky,
helping teams become more agile, more productive, and more confident about their code.
For more information about DevOps and continuous deployment (CD), see:
Continuous deployment for Azure Functions
Continuous delivery by using Azure DevOps
Continuous delivery by using GitHub Action
Optimization
Once the application is in production, prepare for scaling and implement site reliability engineering (SRE).
Ensure optimal scalability
For information about factors that impact Functions app scalability, see:
Scalability best practices
Performance and scale in Durable Functions
Implement SRE practices
Site Reliability Engineering (SRE) is a proven approach to maintaining crucial system and application reliability,
while iterating at the speed the marketplace demands. For more information, see:
Introduction to Site Reliability Engineering (SRE)
DevOps at Microsoft: Game streaming SRE
Next steps
For hands-on serverless Functions app development and deployment walkthroughs, see:
Serverless Functions code walkthrough
CI/CD for a serverless frontend
For an engineering playbook to help teams and customers successfully implement serverless Functions projects,
see the Code-With Customer/Partner Engineering Playbook.
Code walkthrough: Serverless application with
Functions
12/18/2020 • 18 minutes to read • Edit Online
Serverless models abstract code from the underlying compute infrastructure, allowing developers to focus on
business logic without extensive setup. Serverless code reduces costs, because you pay only for the code execution
resources and duration.
The serverless event-driven model fits situations where a certain event triggers a defined action. For example,
receiving an incoming device message triggers storage for later use, or a database update triggers some further
processing.
To help you explore Azure serverless technologies in Azure, Microsoft developed and tested a serverless
application that uses Azure Functions. This article walks through the code for the serverless Functions solution, and
describes design decisions, implementation details, and some of the "gotchas" you might encounter.
Fabrikam manages a fleet of drones for a drone delivery service. The application consists of two main functional
areas:
Event ingestion . During flight, drones send status messages to a cloud endpoint. The application ingests
and processes these messages, and writes the results to a back-end database (Cosmos DB). The devices
send messages in protocol buffer (protobuf) format. Protobuf is an efficient, self-describing serialization
format.
These messages contain partial updates. At a fixed interval, each drone sends a "key frame" message that
contains all of the status fields. Between key frames, the status messages only include fields that changed
since the last message. This behavior is typical of many IoT devices that need to conserve bandwidth and
power.
Web app . A web application allows users to look up a device and query the device's last-known status.
Users must sign into the application and authenticate with Azure Active Directory (Azure AD). The
application only allows requests from users who are authorized to access the app.
Here's a screenshot of the web app, showing the result of a query:
In one data flow, arrows show messages going from Devices to Event Hubs and triggering the function app. From
the app, one arrow shows dead-letter messages going to a storage queue, and another arrow shows writing to
Azure Cosmos DB. In another data flow, arrows show the client web app getting static files from Blob storage static
web hosting, through a CDN. Another arrow shows the client HTTP request going through API Management. From
API Management, one arrow shows the function app triggering and reading data from Azure Cosmos DB. Another
arrow shows authentication through Azure AD. A User also signs in to Azure AD.
Event ingestion:
1. Drone messages are ingested by Azure Event Hubs.
2. Event Hubs produces a stream of events that contain the message data.
3. These events trigger an Azure Functions app to process them.
4. The results are stored in Cosmos DB.
Web app:
1. Static files are served by CDN from Blob storage.
2. A user signs into the web app using Azure AD.
3. Azure API Management acts as a gateway that exposes a REST API endpoint.
4. HTTP requests from the client trigger an Azure Functions app that reads from Cosmos DB and returns the
result.
This application is based on two reference architectures, corresponding to the two functional blocks described
above:
Serverless event processing using Azure Functions
Serverless web application on Azure
You can read those articles to learn more about the high-level architecture, the Azure services that are used in the
solution, and considerations for scalability, security, and reliability.
This class has several dependencies, which are injected into the constructor using dependency injection:
The ITelemetryProcessor and IStateChangeProcessor interfaces define two helper objects. As we'll see, these
objects do most of the work.
The TelemetryClient is part of the Application Insights SDK. It is used to send custom application metrics to
Application Insights.
Later, we'll look at how to configure the dependency injection. For now, just assume these dependencies exist.
[FunctionName("RawTelemetryFunction")]
[StorageAccount("DeadLetterStorage")]
public async Task RunAsync(
[EventHubTrigger("%EventHubName%", Connection = "EventHubConnection", ConsumerGroup
="%EventHubConsumerGroup%")]EventData[] messages,
[Queue("deadletterqueue")] IAsyncCollector<DeadLetterMessage> deadLetterMessages,
ILogger logger)
{
// implementation goes here
}
The EventHubTrigger attribute on the messages parameter configures the trigger. The properties of the attribute
specify an event hub name, a connection string, and a consumer group. (A consumer group is an isolated view of
the Event Hubs event stream. This abstraction allows for multiple consumers of the same event hub.)
Notice the percent signs (%) in some of the attribute properties. These indicate that the property specifies the
name of an app setting, and the actual value is taken from that app setting at run time. Otherwise, without percent
signs, the property gives the literal value.
The Connection property is an exception. This property always specifies an app setting name, never a literal value,
so the percent sign is not needed. The reason for this distinction is that a connection string is secret and should
never be checked into source code.
While the other two properties (event hub name and consumer group) are not sensitive data like a connection
string, it's still better to put them into app settings, rather than hard coding. That way, they can be updated without
recompiling the app.
For more information about configuring this trigger, see Azure Event Hubs bindings for Azure Functions.
[FunctionName("RawTelemetryFunction")]
[StorageAccount("DeadLetterStorage")]
public async Task RunAsync(
[EventHubTrigger("%EventHubName%", Connection = "EventHubConnection", ConsumerGroup
="%EventHubConsumerGroup%")]EventData[] messages,
[Queue("deadletterqueue")] IAsyncCollector<DeadLetterMessage> deadLetterMessages,
ILogger logger)
{
telemetryClient.GetMetric("EventHubMessageBatchSize").TrackValue(messages.Length);
try
{
deviceState = telemetryProcessor.Deserialize(message.Body.Array, logger);
try
{
await stateChangeProcessor.UpdateState(deviceState, logger);
}
catch (Exception ex)
{
logger.LogError(ex, "Error updating status document", deviceState);
await deadLetterMessages.AddAsync(new DeadLetterMessage { Exception = ex, EventData = message,
DeviceState = deviceState });
}
}
catch (Exception ex)
{
logger.LogError(ex, "Error deserializing message", message.SystemProperties.PartitionKey,
message.SystemProperties.SequenceNumber);
await deadLetterMessages.AddAsync(new DeadLetterMessage { Exception = ex, EventData = message });
}
}
}
When the function is invoked, the messages parameter contains an array of messages from the event hub.
Processing messages in batches will generally yield better performance than reading one message at a time.
However, you have to make sure the function is resilient and handles failures and exceptions gracefully. Otherwise,
if the function throws an unhandled exception in the middle of a batch, you might lose the remaining messages.
This consideration is discussed in more detail in the section Error handling.
But if you ignore the exception handling, the processing logic for each message is simple:
1. Call ITelemetryProcessor.Deserialize to deserialize the message that contains a device state change.
2. Call IStateChangeProcessor.UpdateState to process the state change.
Let's look at these two methods in more detail, starting with the Deserialize method.
Deserialize method
The TelemetryProcess.Deserialize method takes a byte array that contains the message payload. It deserializes this
payload and returns a DeviceState object, which represents the state of a drone. The state may represent a partial
update, containing just the delta from the last-known state. Therefore, the method needs to handle null fields in
the deserialized payload.
if (restored.Battery != null)
{
deviceState.Battery = restored.Battery;
}
if (restored.FlightMode != null)
{
deviceState.FlightMode = (int)restored.FlightMode;
}
if (restored.Position != null)
{
deviceState.Latitude = restored.Position.Value.Latitude;
deviceState.Longitude = restored.Position.Value.Longitude;
deviceState.Altitude = restored.Position.Value.Altitude;
}
if (restored.Health != null)
{
deviceState.AccelerometerOK = restored.Health.Value.AccelerometerOK;
deviceState.GyrometerOK = restored.Health.Value.GyrometerOK;
deviceState.MagnetometerOK = restored.Health.Value.MagnetometerOK;
}
return deviceState;
}
}
This method uses another helper interface, ITelemetrySerializer<T> , to deserialize the raw message. The results
are then transformed into a POCO model that is easier to work with. This design helps to isolate the processing
logic from the serialization implementation details. The ITelemetrySerializer<T> interface is defined in a shared
library, which is also used by the device simulator to generate simulated device events and send them to Event
Hubs.
using System;
namespace Serverless.Serialization
{
public interface ITelemetrySerializer<T>
{
T Deserialize(byte[] message);
UpdateState method
The StateChangeProcessor.UpdateState method applies the state changes. The last-known state for each drone is
stored as a JSON document in Cosmos DB. Because the drones send partial updates, the application can't simply
overwrite the document when it gets an update. Instead, it needs to fetch the previous state, merge the fields, and
then perform an upsert operation.
public class StateChangeProcessor : IStateChangeProcessor
{
private IDocumentClient client;
private readonly string cosmosDBDatabase;
private readonly string cosmosDBCollection;
try
{
var response = await client.ReadDocumentAsync(UriFactory.CreateDocumentUri(cosmosDBDatabase,
cosmosDBCollection, source.DeviceId),
new RequestOptions { PartitionKey = new
PartitionKey(source.DeviceId) });
target = (DeviceState)(dynamic)response.Resource;
// Merge properties
target.Battery = source.Battery ?? target.Battery;
target.FlightMode = source.FlightMode ?? target.FlightMode;
target.Latitude = source.Latitude ?? target.Latitude;
target.Longitude = source.Longitude ?? target.Longitude;
target.Altitude = source.Altitude ?? target.Altitude;
target.AccelerometerOK = source.AccelerometerOK ?? target.AccelerometerOK;
target.GyrometerOK = source.GyrometerOK ?? target.GyrometerOK;
target.MagnetometerOK = source.MagnetometerOK ?? target.MagnetometerOK;
}
catch (DocumentClientException ex)
{
if (ex.StatusCode == System.Net.HttpStatusCode.NotFound)
{
target = source;
}
}
This code uses the IDocumentClient interface to fetch a document from Cosmos DB. If the document exists, the
new state values are merged into the existing document. Otherwise, a new document is created. Both cases are
handled by the UpsertDocumentAsync method.
This code is optimized for the case where the document already exists and can be merged. On the first telemetry
message from a given drone, the ReadDocumentAsync method will throw an exception, because there is no
document for that drone. After the first message, the document will be available.
Notice that this class uses dependency injection to inject the IDocumentClient for Cosmos DB and an IOptions<T>
with configuration settings. We'll see how to set up the dependency injection later.
NOTE
Azure Functions supports an output binding for Cosmos DB. This binding lets the function app write documents in Cosmos
DB without any code. However, the output binding won't work for this particular scenario, because of the custom upsert
logic that's needed.
Error handling
As mentioned earlier, the RawTelemetryFunction function app processes a batch of messages in a loop. That means
the function needs to handle any exceptions gracefully and continue processing the rest of the batch. Otherwise,
messages might get dropped.
If an exception is encountered when processing a message, the function puts the message onto a dead-letter
queue:
[FunctionName("RawTelemetryFunction")]
[StorageAccount("DeadLetterStorage")] // App setting that holds the connection string
public async Task RunAsync(
[EventHubTrigger("%EventHubName%", Connection = "EventHubConnection", ConsumerGroup
="%EventHubConsumerGroup%")]EventData[] messages,
[Queue("deadletterqueue")] IAsyncCollector<DeadLetterMessage> deadLetterMessages, // output binding
ILogger logger)
Here the Queue attribute specifies the output binding, and the StorageAccount attribute specifies the name of an
app setting that holds the connection string for the storage account.
Deployment tip: In the Resource Manager template that creates the storage account, you can automatically
populate an app setting with the connection string. The trick is to use the listKeys function.
Here is the section of the template that creates the storage account for the queue:
{
"name": "[variables('droneTelemetryDeadLetterStorageQueueAccountName')]",
"type": "Microsoft.Storage/storageAccounts",
"location": "[resourceGroup().location]",
"apiVersion": "2017-10-01",
"sku": {
"name": "[parameters('storageAccountType')]"
},
Here is the section of the template that creates the function app.
{
"apiVersion": "2015-08-01",
"type": "Microsoft.Web/sites",
"name": "[variables('droneTelemetryFunctionAppName')]",
"location": "[resourceGroup().location]",
"tags": {
"displayName": "Drone Telemetry Function App"
},
"kind": "functionapp",
"dependsOn": [
"[resourceId('Microsoft.Web/serverfarms', variables('hostingPlanName'))]",
...
],
"properties": {
"serverFarmId": "[resourceId('Microsoft.Web/serverfarms', variables('hostingPlanName'))]",
"siteConfig": {
"appSettings": [
{
"name": "DeadLetterStorage",
"value": "[concat('DefaultEndpointsProtocol=https;AccountName=',
variables('droneTelemetryDeadLetterStorageQueueAccountName'), ';AccountKey=',
listKeys(variables('droneTelemetryDeadLetterStorageQueueAccountId'),'2015-05-01-preview').key1)]"
},
...
This defines an app setting named DeadLetterStorage whose value is populated using the listKeys function. It's
important to make the function app resource depend on the storage account resource (see the dependsOn
element). This guarantees that the storage account is created first and the connection string is available.
[assembly: FunctionsStartup(typeof(DroneTelemetryFunctionApp.Startup))]
namespace DroneTelemetryFunctionApp
{
public class Startup : FunctionsStartup
{
public override void Configure(IFunctionsHostBuilder builder)
{
builder.Services.AddOptions<StateChangeProcessorOptions>()
.Configure<IConfiguration>((configSection, configuration) =>
{
configuration.Bind(configSection);
});
builder.Services.AddTransient<ITelemetrySerializer<DroneState>, TelemetrySerializer<DroneState>>
();
builder.Services.AddTransient<ITelemetryProcessor, TelemetryProcessor>();
builder.Services.AddTransient<IStateChangeProcessor, StateChangeProcessor>();
builder.Services.AddSingleton<IDocumentClient>(ctx => {
var config = ctx.GetService<IConfiguration>();
var cosmosDBEndpoint = config.GetValue<string>("CosmosDBEndpoint");
var cosmosDBKey = config.GetValue<string>("CosmosDBKey");
return new DocumentClient(new Uri(cosmosDBEndpoint), cosmosDBKey);
});
}
}
}
Azure Functions written for .NET can use the ASP.NET Core dependency injection framework. The basic idea is that
you declare a startup method for your assembly. The method takes an IFunctionsHostBuilder interface, which is
used to declare the dependencies for DI. You do this by calling Add* method on the Services object. When you
add a dependency, you specify its lifetime:
Transient objects are created each time they're requested.
Scoped objects are created once per function execution.
Singleton objects are reused across function executions, within the lifetime of the function host.
In this example, the TelemetryProcessor and StateChangeProcessor objects are declared as transient. This is
appropriate for lightweight, stateless services. The DocumentClient class, on the other hand, should be a singleton
for best performance. For more information, see Performance tips for Azure Cosmos DB and .NET.
If you refer back to the code for the RawTelemetryFunction, you'll see there another dependency that doesn't
appear in DI setup code, namely the TelemetryClient class that is used to log application metrics. The Functions
runtime automatically registers this class into the DI container, so you don't need to register it explicitly.
For more information about DI in Azure Functions, see the following articles:
Use dependency injection in .NET Azure Functions
Dependency injection in ASP.NET Core
Passing configuration settings in DI
Sometimes an object must be initialized with some configuration values. Generally, these settings should come
from app settings or (in the case of secrets) from Azure Key Vault.
There are two examples in this application. First, the DocumentClient class takes a Cosmos DB service endpoint and
key. For this object, the application registers a lambda that will be invoked by the DI container. This lambda uses the
IConfiguration interface to read the configuration values:
builder.Services.AddSingleton<IDocumentClient>(ctx => {
var config = ctx.GetService<IConfiguration>();
var cosmosDBEndpoint = config.GetValue<string>("CosmosDBEndpoint");
var cosmosDBKey = config.GetValue<string>("CosmosDBKey");
return new DocumentClient(new Uri(cosmosDBEndpoint), cosmosDBKey);
});
The second example is the StateChangeProcessor class. For this object, we use an approach called the options
pattern. Here's how it works:
1. Define a class T that contains your configuration settings. In this case, the Cosmos DB database name and
collection name.
builder.Services.AddOptions<StateChangeProcessorOptions>()
.Configure<IConfiguration>((configSection, configuration) =>
{
configuration.Bind(configSection);
});
3. In the constructor of the class that is being configured, include an IOptions<T> parameter.
The DI system will automatically populate the options class with configuration values and pass this to the
constructor.
There are several advantages of this approach:
Decouple the class from the source of the configuration values.
Easily set up different configuration sources, such as environment variables or JSON configuration files.
Simplify unit testing.
Use a strongly typed options class, which is less error prone than just passing in scalar values.
GetStatus function
The other Functions app in this solution implements a simple REST API to get the last-known status of a drone. This
function is defined in a class named GetStatusFunction . Here is the complete code for the function:
using Microsoft.AspNetCore.Http;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Azure.WebJobs;
using Microsoft.Azure.WebJobs.Extensions.Http;
using Microsoft.Extensions.Logging;
using System.Security.Claims;
using System.Threading.Tasks;
namespace DroneStatusFunctionApp
{
public static class GetStatusFunction
{
public const string GetDeviceStatusRoleName = "GetStatus";
[FunctionName("GetStatusFunction")]
public static IActionResult Run(
[HttpTrigger(AuthorizationLevel.Function, "get", Route = null)]HttpRequest req,
[CosmosDB(
databaseName: "%COSMOSDB_DATABASE_NAME%",
collectionName: "%COSMOSDB_DATABASE_COL%",
ConnectionStringSetting = "COSMOSDB_CONNECTION_STRING",
Id = "{Query.deviceId}",
PartitionKey = "{Query.deviceId}")] dynamic deviceStatus,
ClaimsPrincipal principal,
ILogger log)
{
log.LogInformation("Processing GetStatus request.");
if (deviceStatus == null)
{
return new NotFoundResult();
}
else
{
return new OkObjectResult(deviceStatus);
}
}
}
}
This function uses an HTTP trigger to process an HTTP GET request. The function uses a Cosmos DB input binding
to fetch the requested document. One consideration is that this binding will run before the authorization logic is
performed inside the function. If an unauthorized user requests a document, the function binding will still fetch the
document. Then the authorization code will return a 401, so the user won't see the document. Whether this
behavior is acceptable may depend on your requirements. For example, this approach might make it harder to
audit data access for sensitive data.
A Function application can be configured to authenticate users with zero code. For more information, see
Authentication and authorization in Azure App Service.
Authorization, on the other hand, generally requires some business logic. Azure AD supports claims based
authentication. In this model, a user's identity is represented as a set of claims that come from the identity provider.
A claim can be any piece of information about the user, such as their name or email address.
The access token contains a subset of user claims. Among these are any application roles that the user is assigned
to.
The principal parameter of the function is a ClaimsPrincipal object that contains the claims from the access
token. Each claim is a key/value pair of claim type and claim value. The application uses these to authorize the
request.
The following extension method tests whether a ClaimsPrincipal object contains a set of roles. It returns false if
any of the specified roles is missing. If this method returns false, the function returns HTTP 401 (Unauthorized).
namespace DroneStatusFunctionApp
{
public static class ClaimsPrincipalAuthorizationExtensions
{
public static bool IsAuthorizedByRoles(
this ClaimsPrincipal principal,
string[] roles,
ILogger log)
{
var principalRoles = new HashSet<string>(principal.Claims.Where(kvp => kvp.Type ==
"roles").Select(kvp => kvp.Value));
var missingRoles = roles.Where(r => !principalRoles.Contains(r)).ToArray();
if (missingRoles.Length > 0)
{
log.LogWarning("The principal does not have the required {roles}", string.Join(", ",
missingRoles));
return false;
}
return true;
}
}
}
For more information about authentication and authorization in this application, see the Security considerations
section of the reference architecture.
Next steps
Once you get a feel for how this reference solution works, learn best practices and recommendations for similar
solutions.
For a serverless event ingestion solution, see Serverless event processing using Azure Functions.
For a serverless web app, see Serverless web application on Azure.
Azure Functions is just one Azure compute option. For help with choosing a compute technology, see Choose an
Azure compute service for your application.
Related resources
For in-depth discussion on developing serverless solutions on premises as well as in the cloud, read Serverless
apps: Architecture, patterns, and Azure implementation.
Read more about the Event-driven architecture style.
CI/CD for serverless application frontend on Azure
12/18/2020 • 11 minutes to read • Edit Online
Serverless computing abstracts the servers, infrastructure, and operating systems, allowing developers to focus on
application development. A robust CI/CD or Continuous Integration/Continuous Delivery of such applications
allows companies to ship fully tested and integrated software versions within minutes of development. It provides
a backbone of modern DevOps environment.
What does CI/CD actually stand for?
Continuous Integration allows development teams to integrate code changes in a shared repository almost
instantaneously. This ability, coupled with automated build and testing before the changes are integrated,
ensures that only fully functional application code is available for deployment.
Continuous Delivery allows changes in the source code, configuration, content, and other artifacts to be
delivered to production, and ready to be deployed to end-users, as quickly and safely as possible. The process
keeps the code in a deployable state at all times. A special case of this is Continuous Deployment, which
includes actual deployment to end users.
This article discusses a CI/CD pipeline for the web frontend of a serverless reference implementation. This pipeline
is developed using Azure services. The web frontend demonstrates a modern web application, with client-side
JavaScript, reusable server-side APIs, and pre-built Markup, alternatively called Jamstack. You can find the code in
this GitHub repository. The readme describes the steps to download, build, and deploy the application.
The following diagram describes the CI/CD pipeline used in this sample frontend:
Prerequisites
To work with this sample application, make sure you have the following:
A GitHub account.
An Azure account. If you don't have one, you can try out a free Azure account.
An Azure DevOps organization. If you don't have one, you can try out a basic plan, which includes DevOps
services such as Azure Pipelines.
trigger:
batch: true
branches:
include:
- main
paths:
include:
- src/ClientApp
The following snippet illustrates the start of the build stage, which starts an Ubuntu container to run this stage.
stages:
- stage: Build
jobs:
- job: WebsiteBuild
displayName: Build Fabrikam Drone Status app
pool:
vmImage: 'Ubuntu-16.04'
continueOnError: false
steps:
This is followed by tasks and scripts required to successfully build the project. These include the following:
Installing Node.js and setting up environment variables,
Installing and running Gatsby.js that builds the static website:
- script: |
cd src/ClientApp
npm install
npx gatsby build
displayName: 'gatsby build'
installing and running a compression tool named brotli, to compress the built files before deployment:
- script: |
cd src/ClientApp/public
sudo apt-get install brotli --install-suggests --no-install-recommends -q --assume-yes
for f in $(find . -type f \( -iname '*.html' -o -iname '*.map' -o -iname '*.js' -o -iname
'*.json' \)); do brotli $f -Z -j -f -v && mv ${f}.br $f; done
displayName: 'enable compression at origin level'
- task: PublishPipelineArtifact@1
inputs:
targetPath: 'src/ClientApp/public'
artifactName: 'drop'
A successful completion of the build stage tears down the Ubuntu environment, and triggers the deploy stage in
the pipeline.
Deploy stage
The deploy stage runs in a new Ubuntu container:
- stage: Deploy
jobs:
- deployment: WebsiteDeploy
displayName: Deploy Fabrikam Drone Status app
pool:
vmImage: 'Ubuntu-16.04'
environment: 'fabrikamdronestatus-prod'
strategy:
runOnce:
deploy:
steps:
This stage includes various deployment tasks and scripts to:
Download the build artifacts to the container (which happens automatically as a consequence of using
PublishPipelineArtifact in the build stage),
Record the build release version, and update in the GitHub repository,
Upload the website files to Blob Storage, in a new folder corresponding to the new version, and
Change the CDN to point to this new folder.
The last two steps together replicate a cache purge, since older folders are no longer accessible by the CDN edge
servers. The following snippet shows how this is achieved:
- script: |
az login --service-principal -u $(azureArmClientId) -p $(azureArmClientSecret) --tenant
$(azureArmTenantId)
# upload content to container versioned folder
az storage blob upload-batch -s "$(Pipeline.Workspace)/drop" --destination
"\$web\$(releaseSemVer)" --account-name $(azureStorageAccountName) --content-encoding br --pattern "*.html" --
content-type "text/html"
az storage blob upload-batch -s "$(Pipeline.Workspace)/drop" --destination
"\$web\$(releaseSemVer)" --account-name $(azureStorageAccountName) --content-encoding br --pattern "*.js" --
content-type "application/javascript"
az storage blob upload-batch -s "$(Pipeline.Workspace)/drop" --destination
"\$web\$(releaseSemVer)" --account-name $(azureStorageAccountName) --content-encoding br --pattern "*.js.map"
--content-type "application/octet-stream"
az storage blob upload-batch -s "$(Pipeline.Workspace)/drop" --destination
"\$web\$(releaseSemVer)" --account-name $(azureStorageAccountName) --content-encoding br --pattern "*.json" --
content-type "application/json"
az storage blob upload-batch -s "$(Pipeline.Workspace)/drop" --destination
"\$web\$(releaseSemVer)" --account-name $(azureStorageAccountName) --pattern "*.txt" --content-type
"text/plain"
# target new version
az cdn endpoint update --resource-group $(azureResourceGroup) --profile-name $(azureCdnName) --
name $(azureCdnName) --origin-path '/$(releaseSemVer)'
AZURE_CDN_ENDPOINT_HOSTNAME=$(az cdn endpoint show --resource-group $(azureResourceGroup) --name
$(azureCdnName) --profile-name $(azureCdnName) --query hostName -o tsv)
echo "Azure CDN endpoint host ${AZURE_CDN_ENDPOINT_HOSTNAME}"
echo '##vso[task.setvariable variable=azureCndEndpointHost]'$AZURE_CDN_ENDPOINT_HOSTNAME
displayName: 'upload to Azure Storage static website hosting and purge Azure CDN endpoint'
Atomic deploys
Atomic deployment ensures that the users of your website or application always get the content corresponding to
the same version.
In the sample CI/CD pipeline, the website contents are deployed to the Blob storage, which acts as the origin server
for the Azure CDN. If the files are updated in the same root folder in the blob, the website will be served
inconsistently. Uploading to a new versioned folder as shown in the preceding section solves this problem. The
users either get all or nothing of the new successful build, since the CDN points to the new folder as the origin, only
after all files are successfully updated.
The advantages of this approach are as follows:
Since new content is not available to users until the CDN points to the new origin folder, it results in an atomic
deployment.
You can easily roll back to an older version of the website if necessary.
Since the origin can host multiple versions of the website side by side, you can fine-tune the deployment by
using techniques such as allowing preview to certain users before wider availability.
Next steps
Now that you understand the basics, follow this readme to set up and execute the CI/CD pipeline.
Learn best practices for using content delivery networks (CDNs)
Serverless Functions app operations
12/18/2020 • 3 minutes to read • Edit Online
This article describes Azure operations considerations for serverless Functions applications. To support Functions
apps, operations personnel need to:
Understand and implement hosting configurations.
Future-proof scalability by automating infrastructure provisioning.
Maintain business continuity by meeting availability and disaster recovery requirements.
Planning
To plan operations, understand your workloads and their requirements, then design and configure the best options
for the requirements.
Choose a hosting option
The Azure Functions Runtime provides flexibility in hosting. Use the hosting plan comparison table to determine
the best choice for your requirements.
Azure Functions hosting plans
Each Azure Functions project deploys and runs in its own Functions app, which is the unit of scale and cost.
The three hosting plans available for Azure Functions are the Consumption plan, Premium plan, and
Dedicated (App Service) plan. The hosting plan determines scaling behavior, available resources, and
support for advanced features like virtual network connectivity.
Azure Kubernetes Service (AKS)
Kubernetes-based Functions provides the Functions Runtime in a Docker container with event-driven
scaling through Kubernetes-based Event Driven Autoscaling (KEDA).
For more information about hosting plans, see:
Azure Functions scale and hosting
Consumption plan
Premium plan
Dedicated (App Service) plan
Azure Functions on Kubernetes with KEDA
Azure subscription and service limits, quotas, and constraints
Understand scaling
The serverless Consumption and Premium hosting plans scale automatically, adding and removing Azure
Functions host instances based on the number of incoming events. Scaling can vary on several dimensions, and
behave differently based on plan, trigger, and code language.
For more information about scaling, see:
Understand scaling behaviors
Scalability best practices
Understand and address cold starts
If the number of host instances scales down to zero, the next request has the added latency of restarting the
Function app, called a cold start. Cold start is a large discussion point for serverless architectures, and a point of
ambiguity for Azure Functions.
The Premium hosting plan prevents cold starts by keeping some instances warm. Reducing dependencies and
using asynchronous operations in the Functions app also minimizes the impact of cold starts. However, availability
requirements may require running the app in a Dedicated hosting plan with Always on enabled. The Dedicated
plan uses dedicated virtual machines (VMs), so is not serverless.
For more information about cold start, see Understanding serverless cold start.
Identify storage considerations
Every Azure Functions app relies on Azure Storage for operations such as managing triggers and logging function
executions. When creating a Functions app, you must create or link to a general-purpose Azure Storage account
that supports Blob, Queue, and Table storage. For more information, see Storage considerations for Azure
Functions.
Identify network design considerations
Networking options let the Functions app restrict access, or access resources without using internet-routable
addresses. The hosting plans offer different levels of network isolation. Choose the option that best meets your
network isolation requirements. For more information, see Azure Functions networking options.
Production
To prepare the application for production, make sure you can easily redeploy the hosting plan, and apply scale-out
rules.
Automate hosting plan provisioning
With infrastructure as code, you can automate infrastructure provisioning. Automatic provisioning provides more
resiliency during disasters, and more agility to quickly redeploy the infrastructure as needed.
For more information on automated provisioning, see:
Automate resource deployment for your function app in Azure Functions
Terraform - Manages a Function App
Configure scale out options
Autoscale provides the right amount of running resources to handle application load. Autoscale adds resources to
handle increases in load, and saves money by removing resources that are idle.
For more information about autoscale options, see:
Premium Plan settings
App Service Plan settings
Optimization
When the application is in production, make sure that:
The hosting plan can scale to meet application demands.
There's a plan for business continuity, availability, and disaster recovery.
You can monitor hosting and application health and receive alerts.
Implement availability requirements
Azure Functions run in a specific region. To get higher availability, you can deploy the same Functions app to
multiple regions. In multiple regions, Functions can run in the active-active or active-passive availability pattern.
For more information about Azure Functions availability and disaster recovery, see:
Azure Functions geo-disaster recovery
Disaster recovery and geo-distribution in Azure Durable Functions
Monitoring logging, application monitoring, and alerting
Application Insights and logs in Azure Monitor automatically collect log, performance, and error data and detect
performance anomalies. Azure Monitor includes powerful analytics tools to help diagnose issues and understand
function use. Application Insights help you continuously improve performance and usability.
For more information about monitoring and analyzing Azure Functions performance, see:
Monitor Azure Functions
Monitor Azure Functions with Azure Monitor logs
Application Insights for Azure Functions supported features
Next steps
Serverless application development and deployment
Azure Functions app security
Serverless Functions security
12/18/2020 • 5 minutes to read • Edit Online
This article describes Azure services and activities security personnel can implement for serverless Functions.
These guidelines and resources help develop secure code and deploy secure applications to the cloud.
Planning
The primary goals of a secure serverless Azure Functions application environment are to protect running
applications, quickly identify and address security issues, and prevent future similar issues.
The OWASP Serverless Top 10 describes the most common serverless application security vulnerabilities, and
provides basic techniques to identify and protect against them.
In many ways, planning for secure development, deployment, and operation of serverless functions is much the
same as for any web-based or cloud hosted application. Azure App Service provides the hosting infrastructure for
your function apps. Securing Azure Functions article provides security strategies for running your function code,
and how App Service can help you secure your functions.
For more information about Azure security, best practices, and shared responsibilities, see:
Security in Azure App Service
Built-in security controls
Secure development best practices on Azure.
Security best practices for Azure solutions (PDF report)
Shared responsibilities for cloud computing (PDF report)
Deployment
To prepare serverless Functions applications for production, security personnel should:
Conduct regular code reviews to identify code and library vulnerabilities.
Define resource permissions that Functions needs to execute.
Configure network security rules for inbound and outbound communication.
Identify and classify sensitive data access.
The Azure Security Baseline for Azure Functions article contains more recommendations that will help you
improve the security posture of your deployment.
Keep code secure
Find security vulnerabilities and errors in code and manage security vulnerabilities in projects and dependencies.
For more information, see:
GitHub - Finding security vulnerabilities and errors in your code
GitHub - Managing security vulnerabilities in your project
GitHub - Managing vulnerabilities in your project's dependencies
Perform input validation
Different event sources like Blob storage, Cosmos DB NoSQL databases, event hubs, queues, or Graph events can
trigger serverless Functions. Injections aren't strictly limited to inputs coming directly from the API calls. Functions
may consume other input from the possible event sources.
In general, don't trust input or make any assumptions about its validity. Always use safe APIs that sanitize or
validate the input. If possible, use APIs that bind or parameterize variables, like using prepared statements for SQL
queries.
For more information, see:
Azure Functions Input Validation with FluentValidation
Security Frame: Input Validation Mitigations
HTTP Trigger Function Request Validation
How to validate request for Azure Functions
Secure HTTP endpoints for development, testing, and production
Azure Functions lets you use keys to make it harder to access your HTTP function endpoints. To fully secure your
function endpoints in production, consider implementing one of the following Function app-level security options:
Turn on App Service authentication and authorization for your Functions app. See Authorization keys.
Use Azure API Management (APIM) to authenticate requests. See Import an Azure Function App as an API in
Azure API Management.
Deploy your Functions app to an Azure App Service Environment (ASE).
Use an App Service Plan that restricts access, and implement Azure Front Door + WAF to handle your incoming
requests. See Create a Front Door for a highly available global web application.
For more information, see Secure an HTTP endpoint in production.
Set up role -based access control (RBAC )
Azure role-based access control (RBAC) has several built-in Azure roles that you can assign to users, groups,
service principals, and managed identities to control access to Azure resources. If the built-in roles don't meet your
organization's needs, you can create your own Azure custom roles.
Review each Functions app before deployment to identify excessive permissions. Carefully examine functions to
apply "least privilege" permissions, giving each function only what it needs to successfully execute.
Use RBAC to assign permissions to users, groups, and applications at a certain scope. The scope of a role
assignment can be a subscription, a resource group, or a single resource. Avoid using wildcards whenever
possible.
For more information about RBAC, see:
What is role-based access control (RBAC) for Azure resources?
Azure built-in roles
Azure role-based access control (RBAC)
Azure custom roles
Use managed identities and key vaults
A common challenge when building cloud applications is how to manage credentials for authenticating to cloud
services in your code. Credentials should never appear in application code, developer workstations, or source
control. Instead, use a key vault to store and retrieve keys and credentials. Azure Key Vault provides a way to
securely store credentials, secrets, and other keys. The code authenticates to Key Vault to retrieve the credentials.
For more information, see Use Key Vault references for App Service and Azure Functions.
Managed identities let Functions apps access resources like key vaults and storage accounts without requiring
specific access keys or connection strings. A full audit trail in the logs displays which identities execute requests to
resources. Use RBAC and managed identities to granularly control exactly what resources Azure Functions
applications can access.
For more information, see:
What are managed identities for Azure resources?
How to use managed identities for App Service and Azure Functions
Use shared access signature (SAS ) tokens to limit access to resources
A shared access signature (SAS) provides secure delegated access to resources in your storage account, without
compromising the security of your data. With a SAS, you have granular control over how a client can access your
data. You can control what resources the client may access, what permissions they have on those resources, and
how long the SAS is valid, among other parameters.
For more information, see Grant limited access to Azure Storage resources using shared access signatures (SAS).
Secure Blob storage
Identify and classify sensitive data, and minimize sensitive data storage to only what is necessary. For sensitive
data storage, add multi-factor authentication and data encryption in transit and at rest. Grant limited access to
Azure Storage resources using SAS tokens.
For more information, see Security recommendations for Blob storage.
Optimization
Once an application is in production, security personnel can help optimize workflow and prepare for scaling.
Use Azure Security Center and apply security recommendations
Azure Security Center is a security scanning solution for your application that identifies potential security
vulnerabilities and creates recommendations. The recommendations guide you to configure needed controls to
harden and protect your resources.
For more information, see:
Protect your applications with Azure Security Center
Security Center app recommendations
Enforce application governance policies
Apply centralized, consistent enforcements and safeguards to your application at scale. For more information, see
Azure Policy built-in policy definitions.
Next steps
Serverless application development and deployment
Azure Functions app operations
DevOps Checklist
11/2/2020 • 14 minutes to read • Edit Online
DevOps is the integration of development, quality assurance, and IT operations into a unified culture and set of
processes for delivering software. Use this checklist as a starting point to assess your DevOps culture and process.
Culture
Ensure business alignment across organizations and teams. Conflicts over resources, purpose, goals, and
priorities within an organization can be a risk to successful operations. Ensure that the business, development, and
operations teams are all aligned.
Ensure the entire team understands the software lifecycle. Your team needs to understand the overall
lifecycle of the application, and which part of the lifecycle the application is currently in. This helps all team
members know what they should be doing now, and what they should be planning and preparing for in the future.
Reduce cycle time. Aim to minimize the time it takes to move from ideas to usable developed software. Limit the
size and scope of individual releases to keep the test burden low. Automate the build, test, configuration, and
deployment processes whenever possible. Clear any obstacles to communication among developers, and between
developers and operations.
Review and improve processes. Your processes and procedures, both automated and manual, are never final.
Set up regular reviews of current workflows, procedures, and documentation, with a goal of continual
improvement.
Do proactive planning. Proactively plan for failure. Have processes in place to quickly identify issues when they
occur, escalate to the correct team members to fix, and confirm resolution.
Learn from failures. Failures are inevitable, but it's important to learn from failures to avoid repeating them. If an
operational failure occurs, triage the issue, document the cause and solution, and share any lessons that were
learned. Whenever possible, update your build processes to automatically detect that kind of failure in the future.
Optimize for speed and collect data. Every planned improvement is a hypothesis. Work in the smallest
increments possible. Treat new ideas as experiments. Instrument the experiments so that you can collect
production data to assess their effectiveness. Be prepared to fail fast if the hypothesis is wrong.
Allow time for learning. Both failures and successes provide good opportunities for learning. Before moving on
to new projects, allow enough time to gather the important lessons, and make sure those lessons are absorbed by
your team. Also give the team the time to build skills, experiment, and learn about new tools and techniques.
Document operations. Document all tools, processes, and automated tasks with the same level of quality as
your product code. Document the current design and architecture of any systems you support, along with recovery
processes and other maintenance procedures. Focus on the steps you actually perform, not theoretically optimal
processes. Regularly review and update the documentation. For code, make sure that meaningful comments are
included, especially in public APIs, and use tools to automatically generate code documentation whenever possible.
Share knowledge. Documentation is only useful if people know that it exists and can find it. Ensure the
documentation is organized and easily discoverable. Be creative: Use brown bags (informal presentations), videos,
or newsletters to share knowledge.
Development
Provide developers with production-like environments. If development and test environments don't match
the production environment, it is hard to test and diagnose problems. Therefore, keep development and test
environments as close to the production environment as possible. Make sure that test data is consistent with the
data used in production, even if it's sample data and not real production data (for privacy or compliance reasons).
Plan to generate and anonymize sample test data.
Ensure that all authorized team members can provision infrastructure and deploy the application.
Setting up production-like resources and deploying the application should not involve complicated manual tasks
or detailed technical knowledge of the system. Anyone with the right permissions should be able to create or
deploy production-like resources without going to the operations team.
This recommendation doesn't imply that anyone can push live updates to the production deployment. It's
about reducing friction for the development and QA teams to create production-like environments.
Instrument the application for insight. To understand the health of your application, you need to know how
it's performing and whether it's experiencing any errors or problems. Always include instrumentation as a design
requirement, and build the instrumentation into the application from the start. Instrumentation must include event
logging for root cause analysis, but also telemetry and metrics to monitor the overall health and usage of the
application.
Track your technical debt. In many projects, release schedules can get prioritized over code quality to one
degree or another. Always keep track when this occurs. Document any shortcuts or other suboptimal
implementations, and schedule time in the future to revisit these issues.
Consider pushing updates directly to production. To reduce the overall release cycle time, consider pushing
properly tested code commits directly to production. Use feature toggles to control which features are enabled.
This allows you to move from development to release quickly, using the toggles to enable or disable features.
Toggles are also useful when performing tests such as canary releases, where a particular feature is deployed to a
subset of the production environment.
Testing
Automate testing. Manually testing software is tedious and susceptible to error. Automate common testing tasks
and integrate the tests into your build processes. Automated testing ensures consistent test coverage and
reproducibility. Integrated UI tests should also be performed by an automated tool. Azure offers development and
test resources that can help you configure and execute testing. For more information, see Development and test.
Test for failures. If a system can't connect to a service, how does it respond? Can it recover once the service is
available again? Make fault injection testing a standard part of review on test and staging environments. When
your test process and practices are mature, consider running these tests in production.
Test in production. The release process doesn't end with deployment to production. Have tests in place to ensure
that deployed code works as expected. For deployments that are infrequently updated, schedule production testing
as a regular part of maintenance.
Automate performance testing to identify performance issues early. The impact of a serious performance
issue can be as severe as a bug in the code. While automated functional tests can prevent application bugs, they
might not detect performance problems. Define acceptable performance goals for metrics like latency, load times,
and resource usage. Include automated performance tests in your release pipeline, to make sure the application
meets those goals.
Perform capacity testing. An application might work fine under test conditions, and then have problems in
production due to scale or resource limitations. Always define the maximum expected capacity and usage limits.
Test to make sure the application can handle those limits, but also test what happens when those limits are
exceeded. Capacity testing should be performed at regular intervals.
After the initial release, you should run performance and capacity tests whenever updates are made to production
code. Use historical data to fine-tune tests and to determine what types of tests need to be performed.
Perform automated security penetration testing. Ensuring your application is secure is as important as
testing any other functionality. Make automated penetration testing a standard part of the build and deployment
process. Schedule regular security tests and vulnerability scanning on deployed applications, monitoring for open
ports, endpoints, and attacks. Automated testing does not remove the need for in-depth security reviews at regular
intervals.
Perform automated business continuity testing. Develop tests for large-scale business continuity, including
backup recovery and failover. Set up automated processes to perform these tests regularly.
Release
Automate deployments. Automate deploying the application to test, staging, and production environments.
Automation enables faster and more reliable deployments, and ensures consistent deployments to any supported
environment. It removes the risk of human error caused by manual deployments. It also makes it easy to schedule
releases for convenient times, to minimize any effects of potential downtime. Have systems in place to detect any
problems during rollout, and have an automated way to roll forward fixes or roll back changes.
Use continuous integration. Continuous integration (CI) is the practice of merging all developer code into a
central codebase on a regular schedule, and then automatically performing standard build and test processes. CI
ensures that an entire team can work on a codebase at the same time without having conflicts. It also ensures that
code defects are found as early as possible. Preferably, the CI process should run every time that code is
committed or checked in. At the very least, it should run once per day.
Consider adopting a trunk based development model. In this model, developers commit to a single branch (the
trunk). There is a requirement that commits never break the build. This model facilitates CI, because all feature
work is done in the trunk, and any merge conflicts are resolved when the commit happens.
Consider using continuous deliver y. Continuous delivery (CD) is the practice of ensuring that code is always
ready to deploy, by automatically building, testing, and deploying code to production-like environments. Adding
continuous delivery to create a full CI/CD pipeline will help you detect code defects as soon as possible, and
ensures that properly tested updates can be released in a very short time.
Continuous deployment is an additional process that automatically takes any updates that have passed
through the CI/CD pipeline and deploys them into production. Continuous deployment requires robust
automatic testing and advanced process planning, and may not be appropriate for all teams.
Make small incremental changes. Large code changes have a greater potential to introduce bugs. Whenever
possible, keep changes small. This limits the potential effects of each change, and makes it easier to understand
and debug any issues.
Control exposure to changes. Make sure you're in control of when updates are visible to your end users.
Consider using feature toggles to control when features are enabled for end users.
Implement release management strategies to reduce deployment risk . Deploying an application update
to production always entails some risk. To minimize this risk, use strategies such as canary releases or blue-green
deployments to deploy updates to a subset of users. Confirm the update works as expected, and then roll the
update out to the rest of the system.
Document all changes. Minor updates and configuration changes can be a source of confusion and versioning
conflict. Always keep a clear record of any changes, no matter how small. Log everything that changes, including
patches applied, policy changes, and configuration changes. (Don't include sensitive data in these logs. For
example, log that a credential was updated, and who made the change, but don't record the updated credentials.)
The record of the changes should be visible to the entire team.
Consider making infrastructure immutable. Immutable infrastructure is the principle that you shouldn't
modify infrastructure after it's deployed to production. Otherwise, you can get into a state where ad hoc changes
have been applied, making it hard to know exactly what changed. Immutable infrastructure works by replacing
entire servers as part of any new deployment. This allows the code and the hosting environment to be tested and
deployed as a block. Once deployed, infrastructure components aren't modified until the next build and deploy
cycle.
Monitoring
Make systems obser vable. The operations team should always have clear visibility into the health and status of
a system or service. Set up external health endpoints to monitor status, and ensure that applications are coded to
instrument the operations metrics. Use a common and consistent schema that helps you correlate events across
systems. Azure Diagnostics and Application Insights are the standard method of tracking the health and status of
Azure resources. Microsoft Operation Management Suite also provides centralized monitoring and management
for cloud or hybrid solutions.
Aggregate and correlate logs and metrics . A properly instrumented telemetry system will provide a large
amount of raw performance data and event logs. Make sure that telemetry and log data is processed and
correlated in a short period of time, so that operations staff always have an up-to-date picture of system health.
Organize and display data in ways that give a cohesive view of any issues, so that whenever possible it's clear
when events are related to one another.
Consult your corporate retention policy for requirements on how data is processed and how long it should be
stored.
Implement automated aler ts and notifications. Set up monitoring tools like Azure Monitor to detect patterns
or conditions that indicate potential or current issues, and send alerts to the team members who can address the
issues. Tune the alerts to avoid false positives.
Monitor assets and resources for expirations. Some resources and assets, such as certificates, expire after a
given amount of time. Make sure to track which assets expire, when they expire, and what services or features
depend on them. Use automated processes to monitor these assets. Notify the operations team before an asset
expires, and escalate if expiration threatens to disrupt the application.
Management
Automate operations tasks. Manually handling repetitive operations processes is error-prone. Automate these
tasks whenever possible to ensure consistent execution and quality. Code that implements the automation should
be versioned in source control. As with any other code, automation tools must be tested.
Take an infrastructure-as-code approach to provisioning. Minimize the amount of manual configuration
needed to provision resources. Instead, use scripts and Azure Resource Manager templates. Keep the scripts and
templates in source control, like any other code you maintain.
Consider using containers. Containers provide a standard package-based interface for deploying applications.
Using containers, an application is deployed using self-contained packages that include any software,
dependencies, and files needed to run the application, which greatly simplifies the deployment process.
Containers also create an abstraction layer between the application and the underlying operating system, which
provides consistency across environments. This abstraction can also isolate a container from other processes or
applications running on a host.
Implement resiliency and self-healing. Resiliency is the ability of an application to recover from failures.
Strategies for resiliency include retrying transient failures, and failing over to a secondary instance or even another
region. For more information, see Designing reliable Azure applications . Instrument your applications so that
issues are reported immediately and you can manage outages or other system failures.
Have an operations manual. An operations manual or runbook documents the procedures and management
information needed for operations staff to maintain a system. Also document any operations scenarios and
mitigation plans that might come into play during a failure or other disruption to your service. Create this
documentation during the development process, and keep it up to date afterwards. This is a living document, and
should be reviewed, tested, and improved regularly.
Shared documentation is critical. Encourage team members to contribute and share knowledge. The entire team
should have access to documents. Make it easy for anyone on the team to help keep documents updated.
Document on-call procedures. Make sure on-call duties, schedules, and procedures are documented and
shared to all team members. Keep this information up-to-date at all times.
Document escalation procedures for third-par ty dependencies. If your application depends on external
third-party services that you don't directly control, you must have a plan to deal with outages. Create
documentation for your planned mitigation processes. Include support contacts and escalation paths.
Use configuration management. Configuration changes should be planned, visible to operations, and recorded.
This could take the form of a configuration management database, or a configuration-as-code approach.
Configuration should be audited regularly to ensure that what's expected is actually in place.
Get an Azure suppor t plan and understand the process. Azure offers a number of support plans. Determine
the right plan for your needs, and make sure the entire team knows how to use it. Team members should
understand the details of the plan, how the support process works, and how to open a support ticket with Azure. If
you are anticipating a high-scale event, Azure support can assist you with increasing your service limits. For more
information, see the Azure Support FAQs.
Follow least-privilege principles when granting access to resources. Carefully manage access to
resources. Access should be denied by default, unless a user is explicitly given access to a resource. Only grant a
user access to what they need to complete their tasks. Track user permissions and perform regular security audits.
Use role-based access control. Assigning user accounts and access to resources should not be a manual
process. Use role-based access control (RBAC) grant access based on Azure Active Directory identities and groups.
Use a bug tracking system to track issues. Without a good way to track issues, it's easy to miss items,
duplicate work, or introduce additional problems. Don't rely on informal person-to-person communication to track
the status of bugs. Use a bug tracking tool to record details about problems, assign resources to address them, and
provide an audit trail of progress and status.
Manage all resources in a change management system. All aspects of your DevOps process should be
included in a management and versioning system, so that changes can be easily tracked and audited. This includes
code, infrastructure, configuration, documentation, and scripts. Treat all these types of resources as code
throughout the test/build/review process.
Use checklists. Create operations checklists to ensure processes are followed. It's common to miss something in
a large manual, and following a checklist can force attention to details that might otherwise be overlooked.
Maintain the checklists, and continually look for ways to automate tasks and streamline processes.
For more about DevOps, see What is DevOps? on the Visual Studio site.
Extend Azure Resource Manager template
functionality
12/18/2020 • 2 minutes to read • Edit Online
In 2016, the Microsoft patterns & practices team created a set of Azure Resource Manager template building blocks
with the goal of simplifying resource deployment. Each building block contains a set of prebuilt templates that
deploy sets of resources specified by separate parameter files.
The building block templates are designed to be combined together to create larger and more complex
deployments. For example, deploying a virtual machine in Azure requires a virtual network, storage accounts, and
other resources. The virtual network building block template deploys a virtual network and subnets. The virtual
machine building block template deploys storage accounts, network interfaces, and the actual VMs. You can then
create a script or template to call both building block templates with their corresponding parameter files to deploy
a complete architecture with one operation.
While developing the building block templates, p&p designed several concepts to extend Azure Resource Manager
template functionality. In this series, we will describe several of these concepts so you can use them in your own
templates.
NOTE
These articles assume you have an advanced understanding of Azure Resource Manager templates.
Update a resource in an Azure Resource Manager
template
12/18/2020 • 3 minutes to read • Edit Online
There are some scenarios in which you need to update a resource during a deployment. You might encounter this
scenario when you cannot specify all the properties for a resource until other, dependent resources are created. For
example, if you create a backend pool for a load balancer, you might update the network interfaces (NICs) on your
virtual machines (VMs) to include them in the backend pool. And while Resource Manager supports updating
resources during deployment, you must design your template correctly to avoid errors and to ensure the
deployment is handled as an update.
First, you must reference the resource once in the template to create it and then reference the resource by the same
name to update it later. However, if two resources have the same name in a template, Resource Manager throws an
exception. To avoid this error, specify the updated resource in a second template that's either linked or included as a
subtemplate using the Microsoft.Resources/deployments resource type.
Second, you must either specify the name of the existing property to change or a new name for a property to add
in the nested template. You must also specify the original properties and their original values. If you fail to provide
the original properties and values, Resource Manager assumes you want to create a new resource and deletes the
original resource.
Example template
Let's look at an example template that demonstrates this. Our template deploys a virtual network named
firstVNet that has one subnet named firstSubnet . It then deploys a virtual network interface (NIC) named nic1
and associates it with our subnet. Then, a deployment resource named updateVNet includes a nested template that
updates our firstVNet resource by adding a second subnet named secondSubnet .
{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {},
"resources": [
{
"apiVersion": "2016-03-30",
"name": "firstVNet",
"location":"[resourceGroup().location]",
"type": "Microsoft.Network/virtualNetworks",
"properties": {
"addressSpace":{"addressPrefixes": [
"10.0.0.0/22"
]},
"subnets":[
{
"name":"firstSubnet",
"properties":{
"addressPrefix":"10.0.0.0/24"
}
}
]
}
},
{
"apiVersion": "2015-06-15",
"type":"Microsoft.Network/networkInterfaces",
"name":"nic1",
"location":"[resourceGroup().location]",
"dependsOn": [
"firstVNet"
],
"properties": {
"ipConfigurations":[
{
"name":"ipconfig1",
"properties": {
"privateIPAllocationMethod":"Dynamic",
"subnet": {
"id": "
[concat(resourceId('Microsoft.Network/virtualNetworks','firstVNet'),'/subnets/firstSubnet')]"
}
}
}
]
}
},
{
"apiVersion": "2015-01-01",
"type": "Microsoft.Resources/deployments",
"name": "updateVNet",
"dependsOn": [
"nic1"
],
"properties": {
"mode": "Incremental",
"parameters": {},
"template": {
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {},
"variables": {},
"resources": [
{
"apiVersion": "2016-03-30",
"name": "firstVNet",
"location":"[resourceGroup().location]",
"type": "Microsoft.Network/virtualNetworks",
"properties": {
"addressSpace": "[reference('firstVNet').addressSpace]",
"subnets":[
{
"name":"[reference('firstVNet').subnets[0].name]",
"properties":{
"addressPrefix":"
[reference('firstVNet').subnets[0].properties.addressPrefix]"
}
},
{
"name":"secondSubnet",
"properties":{
"addressPrefix":"10.0.1.0/24"
}
}
]
}
}
],
"outputs": {}
}
}
}
],
"outputs": {}
}
Let's take a look at the resource object for our firstVNet resource first. Notice that we specify again the settings
for our firstVNet in a nested template—this is because Resource Manager doesn't allow the same deployment
name within the same template and nested templates are considered to be a different template. By again specifying
our values for our firstSubnet resource, we are telling Resource Manager to update the existing resource instead
of deleting it and redeploying it. Finally, our new settings for secondSubnet are picked up during this update.
Once deployment has finished, open the resource group you specified in the portal. You see a virtual network
named firstVNet and a NIC named nic1 . Click firstVNet , then click subnets . You see the firstSubnet that was
originally created, and you see the secondSubnet that was added in the updateVNet resource.
Then, go back to the resource group and click nic1 then click IP configurations . In the IP configurations
section, the subnet is set to firstSubnet (10.0.0.0/24) .
The original firstVNet has been updated instead of re-created. If firstVNet had been re-created, nic1 would not
be associated with firstVNet .
Next steps
Learn how deploy a resource based on a condition, such as whether a parameter value is present. See
Conditionally deploy a resource in an Azure Resource Manager template.
Conditionally deploy a resource in an Azure Resource
Manager template
12/18/2020 • 3 minutes to read • Edit Online
There are some scenarios in which you need to design your template to deploy a resource based on a condition,
such as whether or not a parameter value is present. For example, your template may deploy a virtual network and
include parameters to specify other virtual networks for peering. If you've not specified any parameter values for
peering, you don't want Resource Manager to deploy the peering resource.
To accomplish this, use the condition element in the resource to test the length of your parameter array. If the
length is zero, return false to prevent deployment, but for all values greater than zero return true to allow
deployment.
Example template
Let's look at an example template that demonstrates this. Our template uses the condition element to control
deployment of the Microsoft.Network/virtualNetworks/virtualNetworkPeerings resource. This resource creates a
peering between two Azure Virtual Networks in the same region.
Let's take a look at each section of the template.
The parameters element defines a single parameter named virtualNetworkPeerings :
{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"virtualNetworkPeerings": {
"type": "array",
"defaultValue": []
}
},
"virtualNetworkPeerings": [
{
"name": "firstVNet/peering1",
"properties": {
"remoteVirtualNetwork": {
"id": "[resourceId('Microsoft.Network/virtualNetworks','secondVNet')]"
},
"allowForwardedTraffic": true,
"allowGatewayTransit": true,
"useRemoteGateways": false
}
}
]
The properties in our parameter specify the settings related to peering virtual networks. We'll provide the values
for these properties when we specify the Microsoft.Network/virtualNetworks/virtualNetworkPeerings resource in
the resources section:
"resources": [
{
"type": "Microsoft.Resources/deployments",
"apiVersion": "2017-05-10",
"name": "[concat('vnp-', copyIndex())]",
"condition": "[greater(length(parameters('virtualNetworkPeerings')), 0)]",
"dependsOn": [
"firstVNet", "secondVNet"
],
"copy": {
"name": "iterator",
"count": "[length(variables('peerings'))]",
"mode": "serial"
},
"properties": {
"mode": "Incremental",
"template": {
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
},
"variables": {
},
"resources": [
{
"type": "Microsoft.Network/virtualNetworks/virtualNetworkPeerings",
"apiVersion": "2016-06-01",
"location": "[resourceGroup().location]",
"name": "[variables('peerings')[copyIndex()].name]",
"properties": "[variables('peerings')[copyIndex()].properties]"
}
],
"outputs": {
}
}
}
}
]
There are a couple of things going on in this part of our template. First, the actual resource being deployed is an
inline template of type Microsoft.Resources/deployments that includes its own template that actually deploys the
Microsoft.Network/virtualNetworks/virtualNetworkPeerings .
Our name for the inline template is made unique by concatenating the current iteration of the copyIndex() with
the prefix vnp- .
The condition element specifies that our resource should be processed when the greater() function evaluates to
true . Here, we're testing if the virtualNetworkPeerings parameter array is greater() than zero. If it is, it evaluates
to true and the condition is satisfied. Otherwise, it's false .
Next, we specify our copy loop. It's a serial loop that means the loop is done in sequence, with each resource
waiting until the last resource has been deployed. The count property specifies the number of times the loop
iterates. Here, normally we'd set it to the length of the virtualNetworkPeerings array because it contains the
parameter objects specifying the resource we want to deploy. However, if we do that, validation will fail if the array
is empty because Resource Manager notices that we are attempting to access properties that do not exist. We can
work around this, however. Let's take a look at the variables we'll need:
"variables": {
"workaround": {
"true": "[parameters('virtualNetworkPeerings')]",
"false": [{
"name": "workaround",
"properties": {}
}]
},
"peerings": "[variables('workaround')[string(greater(length(parameters('virtualNetworkPeerings')), 0))]]"
},
Our workaround variable includes two properties, one named true and one named false . The true property
evaluates to the value of the virtualNetworkPeerings parameter array. The false property evaluates to an empty
object including the named properties that Resource Manager expects to see—note that false is actually an array,
just as our virtualNetworkPeerings parameter is, which will satisfy validation.
Our peerings variable uses our workaround variable by once again testing if the length of the
virtualNetworkPeerings parameter array is greater than zero. If it is, the string evaluates to true and the
workaround variable evaluates to the virtualNetworkPeerings parameter array. Otherwise, it evaluates to false
and the workaround variable evaluates to our empty object in the first element of the array.
Now that we've worked around the validation issue, we can simply specify the deployment of the
Microsoft.Network/virtualNetworks/virtualNetworkPeerings resource in the nested template, passing the name and
properties from our virtualNetworkPeerings parameter array. You can see this in the template element nested in
the properties element of our resource.
Next steps
Use objects instead of scalar values as template parameters. See Use an object as a parameter in an Azure
Resource Manager template
Use an object as a parameter in an Azure Resource
Manager template
12/18/2020 • 5 minutes to read • Edit Online
When you author Azure Resource Manager templates, you can either specify resource property values directly in
the template or define a parameter and provide values during deployment. It's fine to use a parameter for each
property value for small deployments, but there is a limit of 255 parameters per deployment. Once you get to
larger and more complex deployments you may run out of parameters.
One way to solve this problem is to use an object as a parameter instead of a value. To do this, define the
parameter in your template and specify a JSON object instead of a single value during deployment. Then, reference
the subproperties of the parameter using the parameter() function and dot operator in your template.
Let's take a look at an example that deploys a virtual network resource. First, let's specify a VNetSettings
parameter in our template and set the type to object :
...
"parameters": {
"VNetSettings":{"type":"object"}
},
NOTE
To learn how to provide parameter values during deployment, see the parameters section of understand the structure and
syntax of Azure Resource Manager templates.
"parameters":{
"VNetSettings":{
"value":{
"name":"VNet1",
"addressPrefixes": [
{
"name": "firstPrefix",
"addressPrefix": "10.0.0.0/22"
}
],
"subnets":[
{
"name": "firstSubnet",
"addressPrefix": "10.0.0.0/24"
},
{
"name":"secondSubnet",
"addressPrefix":"10.0.1.0/24"
}
]
}
}
}
As you can see, our single parameter actually specifies three subproperties: name , addressPrefixes , and subnets .
Each of these subproperties either specifies a value or other subproperties. The result is that our single parameter
specifies all the values necessary to deploy our virtual network.
Now let's have a look at the rest of our template to see how the VNetSettings object is used:
...
"resources": [
{
"apiVersion": "2015-06-15",
"type": "Microsoft.Network/virtualNetworks",
"name": "[parameters('VNetSettings').name]",
"location":"[resourceGroup().location]",
"properties": {
"addressSpace":{
"addressPrefixes": [
"[parameters('VNetSettings').addressPrefixes[0].addressPrefix]"
]
},
"subnets":[
{
"name":"[parameters('VNetSettings').subnets[0].name]",
"properties": {
"addressPrefix": "[parameters('VNetSettings').subnets[0].addressPrefix]"
}
},
{
"name":"[parameters('VNetSettings').subnets[1].name]",
"properties": {
"addressPrefix": "[parameters('VNetSettings').subnets[1].addressPrefix]"
}
}
]
}
}
]
The values of our VNetSettings object are applied to the properties required by our virtual network resource
using the parameters() function with both the [] array indexer and the dot operator. This approach works if you
just want to statically apply the values of the parameter object to the resource. However, if you want to dynamically
assign an array of property values during deployment you can use a copy loop. To use a copy loop, you provide a
JSON array of resource property values and the copy loop dynamically applies the values to the resource's
properties.
There is one issue to be aware of if you use the dynamic approach. To demonstrate the issue, let's take a look at a
typical array of property values. In this example the values for our properties are stored in a variable. Notice we
have two arrays here—one named firstProperty and one named secondProperty .
"variables": {
"firstProperty": [
{
"name": "A",
"type": "typeA"
},
{
"name": "B",
"type": "typeB"
},
{
"name": "C",
"type": "typeC"
}
],
"secondProperty": [
"one","two", "three"
]
}
Now let's take a look at the way we access the properties in the variable using a copy loop.
{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
...
"copy": {
"name": "copyLoop1",
"count": "[length(variables('firstProperty'))]"
},
...
"properties": {
"name": { "value": "[variables('firstProperty')[copyIndex()].name]" },
"type": { "value": "[variables('firstProperty')[copyIndex()].type]" },
"number": { "value": "[variables('secondProperty')[copyIndex()]]" }
}
}
The copyIndex() function returns the current iteration of the copy loop, and we use that as an index into each of
the two arrays simultaneously.
This works fine when the two arrays are the same length. The issue arises if you've made a mistake and the two
arrays are different lengths—in this case your template will fail validation during deployment. You can avoid this
issue by including all your properties in a single object, because it is much easier to see when a value is missing.
For example, let's take a look another parameter object in which each element of the propertyObject array is the
union of the firstProperty and secondProperty arrays from earlier.
"variables": {
"propertyObject": [
{
"name": "A",
"type": "typeA",
"number": "one"
},
{
"name": "B",
"type": "typeB",
"number": "two"
},
{
"name": "C",
"type": "typeC"
}
]
}
Notice the third element in the array? It's missing the number property, but it's much easier to notice that you've
missed it when you're authoring the parameter values this way.
Now let's take a look at our template. Our first resource named NSG1 deploys the NSG. Our second resource
named loop-0 performs two functions: first, it dependsOn the NSG so its deployment doesn't begin until NSG1 is
completed, and it is the first iteration of the sequential loop. Our third resource is a nested template that deploys
our security rules using an object for its parameter values as in the last example.
{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"networkSecurityGroupsSettings": {"type":"object"}
},
"variables": {},
"resources": [
{
"apiVersion": "2015-06-15",
"type": "Microsoft.Network/networkSecurityGroups",
"name": "NSG1",
"location":"[resourceGroup().location]",
"properties": {
"securityRules":[]
}
},
{
"apiVersion": "2015-01-01",
"type": "Microsoft.Resources/deployments",
"name": "loop-0",
"dependsOn": [
"NSG1"
],
"properties": {
"mode":"Incremental",
"parameters":{},
"template": {
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {},
"variables": {},
"resources": [],
"outputs": {}
}
}
},
{
"apiVersion": "2015-01-01",
"type": "Microsoft.Resources/deployments",
"name": "[concat('loop-', copyIndex(1))]",
"dependsOn": [
"[concat('loop-', copyIndex())]"
],
"copy": {
"name": "iterator",
"count": "[length(parameters('networkSecurityGroupsSettings').securityRules)]"
},
"properties": {
"mode": "Incremental",
"template": {
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {},
"variables": {},
"resources": [
{
"name": "[concat('NSG1/' ,
parameters('networkSecurityGroupsSettings').securityRules[copyIndex()].name)]",
"type": "Microsoft.Network/networkSecurityGroups/securityRules",
"apiVersion": "2016-09-01",
"location":"[resourceGroup().location]",
"properties":{
"description": "
[parameters('networkSecurityGroupsSettings').securityRules[copyIndex()].description]",
"priority":"
[parameters('networkSecurityGroupsSettings').securityRules[copyIndex()].priority]",
"protocol":"
[parameters('networkSecurityGroupsSettings').securityRules[copyIndex()].protocol]",
"sourcePortRange": "
[parameters('networkSecurityGroupsSettings').securityRules[copyIndex()].sourcePortRange]",
"destinationPortRange": "
[parameters('networkSecurityGroupsSettings').securityRules[copyIndex()].destinationPortRange]",
"sourceAddressPrefix": "
[parameters('networkSecurityGroupsSettings').securityRules[copyIndex()].sourceAddressPrefix]",
"destinationAddressPrefix": "
[parameters('networkSecurityGroupsSettings').securityRules[copyIndex()].destinationAddressPrefix]",
"access":"
[parameters('networkSecurityGroupsSettings').securityRules[copyIndex()].access]",
"direction":"
[parameters('networkSecurityGroupsSettings').securityRules[copyIndex()].direction]"
}
}
],
"outputs": {}
}
}
}
],
"outputs": {}
}
Let's take a closer look at how we specify our property values in the securityRules child resource. All of our
properties are referenced using the parameter() function, and then we use the dot operator to reference our
securityRules array, indexed by the current value of the iteration. Finally, we use another dot operator to reference
the name of the object.
Next steps
Learn how to create a template that iterates through an object array and transforms it into a JSON schema. See
Implement a property transformer and collector in an Azure Resource Manager template
Implement a property transformer and collector in an
Azure Resource Manager template
12/18/2020 • 6 minutes to read • Edit Online
In use an object as a parameter in an Azure Resource Manager template, you learned how to store resource
property values in an object and apply them to a resource during deployment. While this is a very useful way to
manage your parameters, it still requires you to map the object's properties to resource properties each time you
use it in your template.
To work around this, you can implement a property transform and collector template that iterates your object array
and transforms it into the JSON schema expected by the resource.
IMPORTANT
This approach requires that you have a deep understanding of Resource Manager templates and functions.
Let's take a look at how we can implement a property collector and transformer with an example that deploys a
network security group. The diagram below shows the relationship between our templates and our resources
within those templates:
Parameter object
We'll be using our securityRules parameter object from objects as parameters. Our transform template will
transform each object in the securityRules array into the JSON schema expected by the network security group
resource in our calling template .
{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"networkSecurityGroupsSettings": {
"value": {
"securityRules": [
{
"name": "RDPAllow",
"description": "allow RDP connections",
"direction": "Inbound",
"priority": 100,
"sourceAddressPrefix": "*",
"destinationAddressPrefix": "10.0.0.0/24",
"sourcePortRange": "*",
"destinationPortRange": "3389",
"access": "Allow",
"protocol": "Tcp"
},
{
"name": "HTTPAllow",
"description": "allow HTTP connections",
"direction": "Inbound",
"priority": 200,
"sourceAddressPrefix": "*",
"destinationAddressPrefix": "10.0.1.0/24",
"sourcePortRange": "*",
"destinationPortRange": "80",
"access": "Allow",
"protocol": "Tcp"
}
]
}
}
}
}
Transform template
Our transform template includes two parameters that are passed from the collector template :
source is an object that receives one of the property value objects from the property array. In our example,
each object from the "securityRules" array will be passed in one at a time.
state is an array that receives the concatenated results of all the previous transforms. This is the collection of
transformed JSON.
Our parameters look like this:
{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"source": { "type": "object" },
"state": {
"type": "array",
"defaultValue": [ ]
}
},
Our template also defines a variable named instance . It performs the actual transform of our source object into
the required JSON schema:
"variables": {
"instance": [
{
"name": "[parameters('source').name]",
"properties":{
"description": "[parameters('source').description]",
"protocol": "[parameters('source').protocol]",
"sourcePortRange": "[parameters('source').sourcePortRange]",
"destinationPortRange": "[parameters('source').destinationPortRange]",
"sourceAddressPrefix": "[parameters('source').sourceAddressPrefix]",
"destinationAddressPrefix": "[parameters('source').destinationAddressPrefix]",
"access": "[parameters('source').access]",
"priority": "[parameters('source').priority]",
"direction": "[parameters('source').direction]"
}
}
]
},
Finally, the output of our template concatenates the collected transforms of our state parameter with the current
transform performed by our instance variable:
"resources": [],
"outputs": {
"collection": {
"type": "array",
"value": "[concat(parameters('state'), variables('instance'))]"
}
Next, let's take a look at our collector template to see how it passes in our parameter values.
Collector template
Our collector template includes three parameters:
source is our complete parameter object array. It's passed in by the calling template . This has the same name
as the source parameter in our transform template but there is one key difference that you may have
already noticed: this is the complete array, but we only pass one element of this array to the transform
template at a time.
transformTemplateUri is the URI of our transform template . We're defining it as a parameter here for
template reusability.
state is an initially empty array that we pass to our transform template . It stores the collection of
transformed parameter objects when the copy loop is complete.
Our parameters look like this:
"parameters": {
"source": { "type": "array" },
"transformTemplateUri": { "type": "string" },
"state": {
"type": "array",
"defaultValue": [ ]
}
Next, we define a variable named count . Its value is the length of the source parameter object array:
"variables": {
"count": "[length(parameters('source'))]"
},
As you might suspect, we use it for the number of iterations in our copy loop.
Now let's take a look at our resources. We define two resources:
loop-0 is the zero-based resource for our copy loop.
loop- is concatenated with the result of the copyIndex(1) function to generate a unique iteration-based name
for our resource, starting with 1 .
"resources": [
{
"type": "Microsoft.Resources/deployments",
"apiVersion": "2015-01-01",
"name": "loop-0",
"properties": {
"mode": "Incremental",
"parameters": { },
"template": {
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": { },
"variables": { },
"resources": [ ],
"outputs": {
"collection": {
"type": "array",
"value": "[parameters('state')]"
}
}
}
}
},
{
"type": "Microsoft.Resources/deployments",
"apiVersion": "2015-01-01",
"name": "[concat('loop-', copyindex(1))]",
"copy": {
"name": "iterator",
"count": "[variables('count')]",
"mode": "serial"
},
"dependsOn": [
"loop-0"
],
"properties": {
"mode": "Incremental",
"templateLink": { "uri": "[parameters('transformTemplateUri')]" },
"parameters": {
"source": { "value": "[parameters('source')[copyindex()]]" },
"state": { "value": "[reference(concat('loop-', copyindex())).outputs.collection.value]" }
}
}
}
],
Let's take a closer look at the parameters we're passing to our transform template in the nested template. Recall
from earlier that our source parameter passes the current object in the source parameter object array. The
state parameter is where the collection happens, because it takes the output of the previous iteration of our copy
loop—notice that the reference() function uses the copyIndex() function with no parameter to reference the
name of our previous linked template object—and passes it to the current iteration.
Finally, the output of our template returns the output of the last iteration of our transform template :
"outputs": {
"result": {
"type": "array",
"value": "[reference(concat('loop-', variables('count'))).outputs.collection.value]"
}
}
It may seem counterintuitive to return the output of the last iteration of our transform template to our calling
template because it appeared we were storing it in our source parameter. However, remember that it's the last
iteration of our transform template that holds the complete array of transformed property objects, and that's
what we want to return.
Finally, let's take a look at how to call the collector template from our calling template .
Calling template
Our calling template defines a single parameter named networkSecurityGroupsSettings :
...
"parameters": {
"networkSecurityGroupsSettings": {
"type": "object"
}
"variables": {
"collectorTemplateUri": "[uri(deployment().properties.templateLink.uri, 'collector.template.json')]"
}
As you would expect, this is the URI for the collector template that will be used by our linked template resource:
{
"apiVersion": "2015-01-01",
"name": "collector",
"type": "Microsoft.Resources/deployments",
"properties": {
"mode": "Incremental",
"templateLink": {
"uri": "[variables('collectorTemplateUri')]",
"contentVersion": "1.0.0.0"
},
"parameters": {
"source" : {"value": "[parameters('networkSecurityGroupsSettings').securityRules]"},
"transformTemplateUri": { "value": "[uri(deployment().properties.templateLink.uri,
'transform.json')]"}
}
}
}
{
"apiVersion": "2015-06-15",
"type": "Microsoft.Network/networkSecurityGroups",
"name": "networkSecurityGroup1",
"location": "[resourceGroup().location]",
"properties": {
"securityRules": "[reference('collector').outputs.result.value]"
}
}
],
"outputs": {
"instance":{
"type": "array",
"value": "[reference('collector').outputs.result.value]"
}
Microsoft Azure global infrastructure is designed and constructed at every layer to deliver the highest levels of
redundancy and resiliency to its customers. Azure infrastructure is composed of geographies, regions, and
Availability Zones, which limit the blast radius of a failure and therefore limit potential impact to customer
applications and data. The Azure Availability Zones construct was developed to provide a software and networking
solution to protect against datacenter failures and to provide increased high availability (HA) to our customers.
Availability Zones are unique physical locations within an Azure region. Each zone is made up of one or more
datacenters with independent power, cooling, and networking. The physical separation of Availability Zones within
a region limits the impact to applications and data from zone failures, such as large-scale flooding, major storms
and superstorms, and other events that could disrupt site access, safe passage, extended utilities uptime, and the
availability of resources. Availability Zones and their associated datacenters are designed such that if one zone is
compromised, the services, capacity, and availability are supported by the other Availability Zones in the region.
Availability Zones can be used to spread a solution across multiple zones within a region, allowing for an
application to continue functioning when one zone fails. With Availability Zones, Azure offers industry best 99.99%
Virtual Machine (VM) uptime service-level agreement (SLA). Zone-redundant services replicate your services and
data across Availability Zones to protect from single points of failure.
For additional information on Availability Zones, including service support by region and pricing, refer to What are
Availability Zones in Azure? in Microsoft Azure Documentation.
Your applications
Your app or workload architecture
Resilient services
Azure capabilities you enable as needed
Resilient foundation
Azure capabilities built into the platform
When architecting for resilience, all three layers-foundation, services, and applications-should be considered to
achieve the highest level of reliability. Since a solution can be made up of many components, each component
should be designed for reliability.
For example, zone-redundant load balancer, Azure Application Gateway, Azure Service Bus, virtual private network
(VPN), zone-redundant storage, Azure ExpressRoute, Azure Event Hubs, Azure Cosmos DB.
A few resources, like the load balancer and subnets, support both zonal and zone-redundant deployments. An
important consideration in HA is distributing the traffic effectively across resources in the different Availability
Zones. For information on how Availability Zones apply to the load balancer resources for both zonal and zone-
redundant resources, refer to Standard Load Balancer and Availability Zones.
The following is a summary of the zonal (Z) and zone-redundant (ZR) Azure services.
Analytics Management and governance
Azure Data Explorer (ZR) Azure Traffic Manager (ZR)
Azure Event Hubs (ZR) Networking
Compute Azure Load Balancer (Z, ZR)
Linux vir tual machines (Z) VPN gateway (ZR)
Windows vir tual machines (Z) Azure ExpressRoute (ZR)
Vir tual machine scale sets (Z, ZR) Azure Application Gateway (ZR)
Azure App Ser vice Environments (Z) Azure Firewall (ZR)
Containers Azure Vir tual WAN (ZR)
Azure Kubernetes Ser vice (AKS) (Z) Security
Azure Ser vice Fabric (Z) Azure Active Director y Domain Ser vices (ZR)
Databases Storage
Azure SQL Database (ZR) Azure Data Lake Storage (ZRS)
Azure Cache for Redis (Z, ZR) Azure Blob storage (ZRS, ZR)
Azure Cosmos DB (ZR) Azure Managed Disks (Z)
DevOps Additional capabilities
Azure DevOps (ZR) Azure Premium Files (ZRS)
Identity Zone-redundant storage (ZRS)
Azure Active Director y Domain Ser vices (ZR) Standard IP address (ZR)
Integration Azure Traffic Analytics (ZR)
Azure Event Grid (ZR)
Azure Ser vice Bus (ZR)
VMs supporting AZs: AV2-series, B-series, DSv2-series, DSv3-series, Dv2-series, Dv3-series, ESv3-series, Ev3-series,
F-series, FS-series, FSv2-series, M-series.
For a list of Azure services that support Availability Zones, per Azure region, refer to the Availability Zones
documentation.
Availability Zones are unique physical locations within an Azure region. Each zone is made up of one or more
datacenters with independent power, cooling, and networking. The physical separation of availability zones within a
region limits the impact to applications and data from zone failures, such as largescale flooding, major/super-
storms, and other events that would disrupt site access, safe passage, extended utilities uptime, and the availability
of resources. This reference architecture shows best practices for applying Availability Zones to a web application
and Microsoft SQL Server database hosted on virtual machines (VMs) known as a zonal deployment.
This approach is used in high availability (HA) scenarios with resiliency as the most significant concern. With HA
architecture there is a balance between high resilience, low latency, and cost. This architecture uses redundant
resources spread across zones to provide high resiliency. Traffic can be routed between zones to minimize the
impact of a zonal failure. If a zone does fail, resources in the zone absorb the traffic until the zone recovers. This
provides a high level of resilience.
This architecture provides an efficient use of resources as most of the resources are actively used. All resources are
used in handling the requests other than the passive SQL Server. The passive SQL Server only becomes active in
the case of a failure of the current SQL Server.
The zone-redundant application gateway and zone-redundant load balancer distribute the traffic to the available
resources. Cross-zone routing will have increased latency relative to traffic routed within the zone.
Architecture
The architecture uses resources spread across multiple zones to provide a high availability (HA) architecture for
hosting an Infrastructure as a Service (IaaS) web application and SQL Server database. A zone-redundant
application gateway routes traffic to virtual machines within the web tier. A zone-redundant load balancer routes
traffic from the virtual machines in the web tier to the active SQL Server. In the case of a zonal failure, the
application gateway will route to virtual machines in the other available zones.
If the active SQL Server becomes unavailable, either due to a zone failure or local failure, a passive SQL Server will
become the active SQL Server. The zone-redundant load balancer will detect the failover of SQL Server and route
traffic to the newly active SQL Server.
The following illustrates a failure of Zone 1.
The application gateway is zone-redundant and is not impacted by the failure of Zone 1. The application gateway
uses health probes to determine the available virtual machines. With Zone 1 unavailable, the zone-redundant
application gateway only routes traffic to the remaining two zones. The zone-redundant load balancer is also not
impacted by the failure of Zone 1 and uses health probes to determine the location of the active SQL Server. In this
example, the load balancer will detect that the active SQL Server is in Zone 3 and will then route traffic to the SQL
Server instance in Zone 3.
Spreading resources across Availability Zones also protects an application from planned maintenance. When VMs
are distributed across three Availability Zones, effectively they are spread across three update domains. The Azure
platform recognizes this distribution across update domains to ensure that VMs in different zones are not updated
at the same time.
By replicating VMs across Availability Zones, you can protect your applications and data from a zonal failure. This
allows Azure to support the industry best 99.99% VM uptime service-level agreement (SLA). See Building solutions
for high availability using Availability Zones.
The architecture has the following components.
General
Resource group . Resource groups are used to group Azure resources so they can be managed by lifetime,
owner, or other criteria.
Availability Zones . Availability Zones are physical locations within an Azure region. Each zone consists of
one or more datacenters with independent power, cooling, and networking. By placing VMs across zones, the
application becomes resilient to failures within a zone.
Networking and load balancing
Vir tual network and subnets . Every Azure VM is deployed into a virtual network (VNet) that can be
segmented into subnets. Create a separate subnet for each tier.
Azure Application Gateway . Azure Application Gateway is a layer 7 load balancer. In this architecture, a
zone-redundant application gateway routes HTTP requests to the web front end. Application Gateway also
provides a Web Application Firewall (WAF) that protects the application from common exploits and
vulnerabilities. The v2 SKU of Application Gateway supports cross-zone redundancy. A single Application
Gateway deployment can run multiple instances of the gateway. For production workloads, run at least two
instances. For more information, see Autoscaling and Zone-redundant Application Gateway v2 and How
does Application Gateway support high availability and scalability?.
Azure Load Balancer . Azure Load Balancer is a layer 4 load balancer. In this architecture, a zone-redundant
Azure Standard Load Balancer directs network traffic from the web tier to SQL Server. Because a zone-
redundant load balancer is not pinned to a specific zone, the application will continue to distribute the
network traffic in the case of a zonal failure. A zone-redundant load balancer is used to provide availability in
the case the active SQL Server becomes unavailable. The Standard SKU of Azure Load Balancer supports
cross-zone redundancy. For more information, see Standard Load Balancer and Availability Zones.
Network security groups (NSGs) . A network security group is used to restrict network traffic within the
virtual network. In this architecture, the web tier only accepts traffic from the public IP endpoint. Additionally,
the database tier does not accept traffic from any subnet other than the web-tier subnet.
DDoS protection . The Azure platform provides protection against distributed denial of service (DDoS)
attacks. For additional protection, we recommend using DDoS Protection Standard, which has enhanced
DDoS mitigation features. See Security considerations.
Bastion . Azure Bastion provides secure and seamless Remote Desktop Protocol (RDP) and Secure Shell
(SSH) access to the VMs within the VNet. This provides access while limiting the exposed public IP addresses
of the VMs with the VNet. Azure Bastion provides a cost-effective alternative to a provisioned VM to
provide access to all VMs within the same virtual network.
Microsoft SQL Server
SQL Ser ver Always On availability group . SQL Server Always On availability group provides high
availability at the data tier by enabling replication and failover. It uses Windows Server Failover Cluster
(WSFC) technology for failover.
Cloud Witness . A failover cluster requires more than half of its nodes to be running, which is known as
having quorum. If the cluster has just two nodes, a network partition could cause each node to think it's the
primary node. In that case, you need a witness to break ties and establish quorum. A witness is a resource
such as a shared disk that can act as a tie breaker to establish quorum. Cloud Witness is a type of witness
that uses Azure Blob Storage. The Azure Blob Storage must use Zone Redundant Storage (ZRS) to not be
impacted by zonal failure.
To learn more about the concept of quorum, see Understanding cluster and pool quorum. For more
information about Cloud Witness, see Deploy a Cloud Witness for a Failover Cluster.
Recommendations
Your requirements might differ from the architecture described here. Use these recommendations as a starting
point.
For recommendations on configuring the VMs, see Run a Windows VM on Azure.
For more information about designing virtual networks and subnets, see Plan and design Azure Virtual Networks.
Network security groups
Use NSG rules to restrict traffic between tiers. In the architecture shown earlier, only the web tier can communicate
directly with the database tier. To enforce this rule, the database tier should block all incoming traffic except for the
web-tier subnet.
Deny all inbound traffic from the virtual network. (Use the VIRTUAL_NETWORK tag in the rule.)
Allow inbound traffic from the web-tier subnet.
Allow inbound traffic from the database-tier subnet itself. This rule allows communication between the
database VMs, which is needed for database replication and failover.
Create rules 2 – 3 with higher priority than the first rule, so they override it.
SQL Server Always On availability groups
We recommend Always On availability groups for Microsoft SQL Server high availability. Other tiers connect to the
database through an availability group listener. The listener enables a SQL client to connect without knowing the
name of the physical instance of SQL Server. VMs that access the database must be joined to the domain. The client
(in this case, another tier) uses DNS to resolve the listener's virtual network name into IP addresses.
Configure the SQL Server Always On availability group as follows:
Create a Windows Server Failover Clustering (WSFC) cluster, a SQL Server Always On availability group, and
a primary replica. For more information, see Getting Started with Always On availability groups.
Create an internal load balancer with a static private IP address.
Create an availability group listener and map the listener's DNS name to the IP address of an internal load
balancer.
Create a load balancer rule for the SQL Server listening port (TCP port 1433 by default). The load balancer
rule must enable floating IP, also called Direct Server Return. This causes the VM to reply directly to the client,
which enables a direct connection to the primary replica.
NOTE
When floating IP is enabled, the front-end port number must be the same as the back-end port number in the load balancer
rule.
When a SQL client tries to connect, the load balancer routes the connection request to the primary replica. If there
is a failover to another replica, the load balancer automatically routes new requests to a new primary replica. For
more information, see Configure a load balancer for an availability group on Azure SQL Server VMs.
During a failover, existing client connections are closed. After the failover completes, new connections will be routed
to the new primary replica.
If your application makes significantly more reads than writes, you can offload some of the read-only queries to a
secondary replica. See Connect to a read-only replica.
Test your deployment by forcing a manual failover of the availability group.
Availability considerations
Availability Zones provide high resilience within a single region. If you need even higher availability, consider
replicating the application across two regions, using Azure Traffic Manager for failover. For more information, see
Run an N-tier application in multiple Azure regions for high availability.
Not all regions support Availability Zones, and not all VM sizes are supported in all zones. Run the following Azure
CLI command to find the supported zones for each VM size within a region:
Cost considerations
Use the Azure Pricing Calculator to estimates costs. Here are some other considerations.
Virtual machine scale sets
Virtual machine scale sets are available on all Windows VM sizes. You are only charged for the Azure VMs you
deploy, and any additional underlying infrastructure resources consumed, such as storage and networking. There
are no incremental charges for the virtual machine scale sets service.
For single VMs pricing options, see Windows VMs pricing.
SQL Server
If you choose Azure SQL DBaaS, you can save on cost because you don't need to configure an Always On
availability group and domain controller machines. There are several deployment options starting from single
database up to managed instance, or elastic pools. For more information, see Azure SQL pricing.
For SQL server VMs pricing options, see SQL VMs pricing.
Azure Load Balancer
You are charged only for the number of configured load-balancing and outbound rules. Inbound NAT rules are free.
There is no hourly charge for the Standard Load Balancer when no rules are configured.
For more information, see the cost section in Azure Architecture Framework.
Azure Application Gateway
The Application Gateway should be provisioned with the v2 SKU and can span multiple Availability Zones. With the
v2 SKU, the pricing model is driven by consumption and has two components: Hourly fixed price and a
consumption-based cost.
For more information, see the pricing section in Autoscaling and Zone-redundant Application Gateway v2.
Security considerations
Virtual networks are a traffic isolation boundary in Azure. By default, VMs in one virtual network can't
communicate directly with VMs in a different virtual network. However, you can explicitly connect virtual networks
by using virtual network peering.
Network security groups
Use network security groups (NSGs) to restrict traffic to and from the internet. For more information, see Microsoft
cloud services and network security.
DMZ
Consider adding a network virtual appliance (NVA) to create a DMZ between the internet and the Azure virtual
network. NVA is a generic term for a virtual appliance that can perform network-related tasks, such as firewall,
packet inspection, auditing, and custom routing. For more information, see Network DMZ between Azure and an
on-premises datacenter.
Encryption
Encrypt sensitive data at rest and use Azure Key Vault to manage the database encryption keys. Key Vault can store
encryption keys in hardware security modules (HSMs). For more information, see Configure Azure Key Vault
Integration for SQL Server on Azure VMs. It's also recommended to store application secrets, such as database
connection strings, in Key Vault.
DDoS protection
The Azure platform provides basic DDoS protection by default. This basic protection is targeted at protecting the
Azure infrastructure. Although basic DDoS protection is automatically enabled, we recommend using DDoS
Protection Standard. Standard protection uses adaptive tuning, based on your application's network traffic patterns,
to detect threats. This allows it to apply mitigations against DDoS attacks that might go unnoticed by the
infrastructure-wide DDoS policies. Standard protection also provides alerting, telemetry, and analytics through
Azure Monitor. For more information, see Azure DDoS Protection: Best practices and reference architectures.
Next steps
Microsoft Learn module: Tour the N-tier architecture style
Manage identity in multitenant applications
12/18/2020 • 3 minutes to read • Edit Online
This series of articles describes best practices for multitenancy, when using Azure AD for authentication and
identity management.
Sample code
When you're building a multitenant application, one of the first challenges is managing user identities, because
now every user belongs to a tenant. For example:
Users sign in with their organizational credentials.
Users should have access to their organization's data, but not data that belongs to other tenants.
An organization can sign up for the application, and then assign application roles to its members.
Azure Active Directory (Azure AD) has some great features that support all of these scenarios.
To accompany this series of articles, we created a complete end-to-end implementation of a multitenant
application. The articles reflect what we learned in the process of building the application. To get started with the
application, see the GitHub readme.
Introduction
Let's say you're writing an enterprise SaaS application to be hosted in the cloud. Of course, the application will have
users:
Example: Tailspin sells subscriptions to its SaaS application. Contoso and Fabrikam sign up for the app. When Alice (
alice@contoso ) signs in, the application should know that Alice is part of Contoso.
What is multitenancy?
A tenant is a group of users. In a SaaS application, the tenant is a subscriber or customer of the application.
Multitenancy is an architecture where multiple tenants share the same physical instance of the app. Although
tenants share physical resources (such as VMs or storage), each tenant gets its own logical instance of the app.
Typically, application data is shared among the users within a tenant, but not with other tenants.
Compare this architecture with a single-tenant architecture, where each tenant has a dedicated physical instance. In
a single-tenant architecture, you add tenants by spinning up new instances of the app.
Any request can be routed to any instance. Together, the system functions as a single logical instance. You can tear
down a VM or spin up a new VM, without affecting users. In this architecture, each physical instance is multitenant,
and you scale by adding more instances. If one instance goes down, it should not affect any tenant.
Sample code
Tailspin is a fictional company that is developing a SaaS application named Surveys. This application enables
organizations to create and publish online surveys.
An organization can sign up for the application.
After the organization is signed up, users can sign into the application with their organizational credentials.
Users can create, edit, and publish surveys.
NOTE
To get started with the application, see the GitHub readme.
Note that Alice signs into her own tenant, not as a guest of the Contoso tenant. Alice has contributor permissions
only for that survey — she cannot view other surveys from the Contoso tenant.
Architecture
The Surveys application consists of a web front end and a web API backend. Both are implemented using ASP.NET
Core.
The web application uses Azure Active Directory (Azure AD) to authenticate users. The web application also calls
Azure AD to get OAuth 2 access tokens for the Web API. Access tokens are cached in Azure Cache for Redis. The
cache enables multiple instances to share the same token cache (for example, in a server farm).
The diagram shows components in boxes, interacting with other components via two-way arrows. The Surveys
web application authenticates with Azure AD to get access tokens for the web API, and caches the tokens in the
Azure Cache for Redis access token cache.
Next
Authenticate using Azure AD and OpenID Connect
12/18/2020 • 7 minutes to read • Edit Online
Sample code
The Surveys application uses the OpenID Connect (OIDC) protocol to authenticate users with Azure Active
Directory (Azure AD). The Surveys application uses ASP.NET Core, which has built-in middleware for OIDC. The
following diagram shows what happens when the user signs in, at a high level.
1. The user clicks the "sign in" button in the app. This action is handled by an MVC controller.
2. The MVC controller returns a ChallengeResult action.
3. The middleware intercepts the ChallengeResult and creates a 302 response, which redirects the user to the
Azure AD sign-in page.
4. The user authenticates with Azure AD.
5. Azure AD sends an ID token to the application.
6. The middleware validates the ID token. At this point, the user is now authenticated inside the application.
7. The middleware redirects the user back to application.
app.AddAuthentication().AddOpenIdConnect(options => {
options.ClientId = configOptions.AzureAd.ClientId;
options.ClientSecret = configOptions.AzureAd.ClientSecret; // for code flow
options.Authority = Constants.AuthEndpointPrefix;
options.ResponseType = OpenIdConnectResponseType.CodeIdToken;
options.PostLogoutRedirectUri = configOptions.AzureAd.PostLogoutRedirectUri;
options.SignInScheme = CookieAuthenticationDefaults.AuthenticationScheme;
options.TokenValidationParameters = new TokenValidationParameters { ValidateIssuer = false };
options.Events = new SurveyAuthenticationEvents(configOptions.AzureAd, loggerFactory);
});
Notice that some of the settings are taken from runtime configuration options. Here's what the middleware options
mean:
ClientId . The application's client ID, which you got when you registered the application in Azure AD.
Authority . For a multitenant application, set this to https://login.microsoftonline.com/common/ . This is the URL
for the Azure AD common endpoint, which enables users from any Azure AD tenant to sign in. For more
information about the common endpoint, see this blog post.
In TokenValidationParameters , set ValidateIssuer to false. That means the app will be responsible for
validating the issuer value in the ID token. (The middleware still validates the token itself.) For more information
about validating the issuer, see Issuer validation.
PostLogoutRedirectUri . Specify a URL to redirect users after the sign out. This should be a page that allows
anonymous requests — typically the home page.
SignInScheme . Set this to CookieAuthenticationDefaults.AuthenticationScheme . This setting means that after
the user is authenticated, the user claims are stored locally in a cookie. This cookie is how the user stays logged
in during the browser session.
Events. Event callbacks; see Authentication events.
Also add the Cookie Authentication middleware to the pipeline. This middleware is responsible for writing the user
claims to a cookie, and then reading the cookie during subsequent page loads.
app.AddAuthentication().AddCookie(options => {
options.AutomaticAuthenticate = true;
options.AutomaticChallenge = true;
options.AccessDeniedPath = "/Home/Forbidden";
options.CookieSecure = CookieSecurePolicy.Always;
// The default setting for cookie expiration is 14 days. SlidingExpiration is set to true by default
options.ExpireTimeSpan = TimeSpan.FromHours(1);
options.SlidingExpiration = true;
});
This causes the middleware to return a 302 (Found) response that redirects to the authentication endpoint.
In this diagram, there are two MVC controllers. The Account controller handles sign-in requests, and the Home
controller serves up the home page.
Here is the authentication process:
1. The user clicks the "Sign in" button, and the browser sends a GET request. For example: GET /Account/SignIn/ .
2. The account controller returns a ChallengeResult .
3. The OIDC middleware returns an HTTP 302 response, redirecting to Azure AD.
4. The browser sends the authentication request to Azure AD
5. The user signs in to Azure AD, and Azure AD sends back an authentication response.
6. The OIDC middleware creates a claims principal and passes it to the Cookie Authentication middleware.
7. The cookie middleware serializes the claims principal and sets a cookie.
8. The OIDC middleware redirects to the application's callback URL.
9. The browser follows the redirect, sending the cookie in the request.
10. The cookie middleware deserializes the cookie to a claims principal and sets HttpContext.User equal to the
claims principal. The request is routed to an MVC controller.
Authentication ticket
If authentication succeeds, the OIDC middleware creates an authentication ticket, which contains a claims principal
that holds the user's claims.
NOTE
Until the entire authentication flow is completed, HttpContext.User still holds an anonymous principal, not the
authenticated user. The anonymous principal has an empty claims collection. After authentication completes and the app
redirects, the cookie middleware deserializes the authentication cookie and sets HttpContext.User to a claims principal that
represents the authenticated user.
Authentication events
During the authentication process, the OpenID Connect middleware raises a series of events:
RedirectToIdentityProvider . Called right before the middleware redirects to the authentication endpoint. You
can use this event to modify the redirect URL; for example, to add request parameters. See Adding the admin
consent prompt for an example.
AuthorizationCodeReceived . Called with the authorization code.
TokenResponseReceived . Called after the middleware gets an access token from the IDP, but before it is
validated. Applies only to authorization code flow.
TokenValidated . Called after the middleware validates the ID token. At this point, the application has a set of
validated claims about the user. You can use this event to perform additional validation on the claims, or to
transform claims. See Working with claims.
UserInformationReceived . Called if the middleware gets the user profile from the user info endpoint. Applies
only to authorization code flow, and only when GetClaimsFromUserInfoEndpoint = true in the middleware
options.
TicketReceived . Called when authentication is completed. This is the last event, assuming that authentication
succeeds. After this event is handled, the user is signed into the app.
AuthenticationFailed . Called if authentication fails. Use this event to handle authentication failures — for
example, by redirecting to an error page.
To provide callbacks for these events, set the Events option on the middleware. There are two different ways to
declare the event handlers: Inline with lambdas, or in a class that derives from OpenIdConnectEvents . The
second approach is recommended if your event callbacks have any substantial logic, so they don't clutter your
startup class. Our reference implementation uses this approach.
OpenID Connect endpoints
Azure AD supports OpenID Connect Discovery, wherein the identity provider (IDP) returns a JSON metadata
document from a well-known endpoint. The metadata document contains information such as:
The URL of the authorization endpoint. This is where the app redirects to authenticate the user.
The URL of the "end session" endpoint, where the app goes to log out the user.
The URL to get the signing keys, which the client uses to validate the OIDC tokens that it gets from the IDP.
By default, the OIDC middleware knows how to fetch this metadata. Set the Authority option in the middleware,
and the middleware constructs the URL for the metadata. (You can override the metadata URL by setting the
MetadataAddress option.)
OpenID Connect flows
By default, the OIDC middleware uses hybrid flow with form post response mode.
Hybrid flow means the client can get an ID token and an authorization code in the same round-trip to the
authorization server.
Form post response mode means the authorization server uses an HTTP POST request to send the ID token and
authorization code to the app. The values are form-urlencoded (content type = "application/x-www-form-
urlencoded").
When the OIDC middleware redirects to the authorization endpoint, the redirect URL includes all of the query
string parameters needed by OIDC. For hybrid flow:
client_id. This value is set in the ClientId option.
scope = "openid profile", which means it's an OIDC request and we want the user's profile.
response_type = "code id_token". This specifies hybrid flow.
response_mode = "form_post". This specifies form post response.
To specify a different flow, set the ResponseType property on the options. For example:
app.AddAuthentication().AddOpenIdConnect(options =>
{
options.ResponseType = "code"; // Authorization code flow
// Other options
}
Next
Work with claims-based identities
12/18/2020 • 5 minutes to read • Edit Online
Sample code
Claims in Azure AD
When a user signs in, Azure AD sends an ID token that contains a set of claims about the user. A claim is simply a
piece of information, expressed as a key/value pair. For example, email = bob@contoso.com . Claims have an issuer
— in this case, Azure AD — which is the entity that authenticates the user and creates the claims. You trust the
claims because you trust the issuer. (Conversely, if you don't trust the issuer, don't trust the claims!)
At a high level:
1. The user authenticates.
2. The Identity Provider (IDP) sends a set of claims.
3. The app normalizes or augments the claims (optional).
4. The app uses the claims to make authorization decisions.
In OpenID Connect, the set of claims that you get is controlled by the scope parameter of the authentication
request. However, Azure AD issues a limited set of claims through OpenID Connect; see Supported Token and
Claim Types. If you want more information about the user, you'll need to use the Azure AD Graph API.
Here are some of the claims from Azure AD that an app might typically care about:
C L A IM T Y P E IN ID TO K EN DESC RIP T IO N
aud Who the token was issued for. This will be the application's
client ID. Generally, you shouldn't need to worry about this
claim, because the middleware automatically validates it.
Example: "91464657-d17a-4327-91f3-2ed99386406f"
oid The object identifier for the user in Azure AD. This value is the
immutable and non-reusable identifier of the user. Use this
value, not email, as a unique identifier for users; email
addresses can change. If you use the Azure AD Graph API in
your app, object ID is that value used to query profile
information. Example:
"59f9d2dc-995a-4ddf-915e-b3bb314a7fa4"
tid Tenant ID. This value is a unique identifier for the tenant in
Azure AD. Example:
"b9bd2162-77ac-4fb2-8254-5c36e9c0a9c4"
This table lists the claim types as they appear in the ID token. In ASP.NET Core, the OpenID Connect middleware
converts some of the claim types when it populates the Claims collection for the user principal:
oid > http://schemas.microsoft.com/identity/claims/objectidentifier
tid > http://schemas.microsoft.com/identity/claims/tenantid
unique_name > http://schemas.xmlsoap.org/ws/2005/05/identity/claims/name
upn > http://schemas.xmlsoap.org/ws/2005/05/identity/claims/upn
Claims transformations
During the authentication flow, you might want to modify the claims that you get from the IDP. In ASP.NET Core,
you can perform claims transformation inside of the AuthenticationValidated event from the OpenID Connect
middleware. (See Authentication events.)
Any claims that you add during AuthenticationValidated are stored in the session authentication cookie. They
don't get pushed back to Azure AD.
Here are some examples of claims transformation:
Claims normalization , or making claims consistent across users. This is particularly relevant if you are
getting claims from multiple IDPs, which might use different claim types for similar information. For
example, Azure AD sends a "upn" claim that contains the user's email. Other IDPs might send an "email"
claim. The following code converts the "upn" claim into an "email" claim:
Add default claim values for claims that aren't present — for example, assigning a user to a default role.
In some cases this can simplify authorization logic.
Add custom claim types with application-specific information about the user. For example, you might
store some information about the user in a database. You could add a custom claim with this information to
the authentication ticket. The claim is stored in a cookie, so you only need to get it from the database once
per login session. On the other hand, you also want to avoid creating excessively large cookies, so you need
to consider the trade-off between cookie size versus database lookups.
After the authentication flow is complete, the claims are available in HttpContext.User . At that point, you should
treat them as a read-only collection—for example, using them to make authorization decisions.
Issuer validation
In OpenID Connect, the issuer claim ("iss") identifies the IDP that issued the ID token. Part of the OIDC
authentication flow is to verify that the issuer claim matches the actual issuer. The OIDC middleware handles this
for you.
In Azure AD, the issuer value is unique per AD tenant ( https://sts.windows.net/<tenantID> ). Therefore, an
application should do an additional check, to make sure the issuer represents a tenant that is allowed to sign in to
the app.
For a single-tenant application, you can just check that the issuer is your own tenant. In fact, the OIDC middleware
does this automatically by default. In a multitenant app, you need to allow for multiple issuers, corresponding to
the different tenants. Here is a general approach to use:
In the OIDC middleware options, set ValidateIssuer to false. This turns off the automatic check.
When a tenant signs up, store the tenant and the issuer in your user DB.
Whenever a user signs in, look up the issuer in the database. If the issuer isn't found, it means that tenant hasn't
signed up. You can redirect them to a sign up page.
You could also block certain tenants; for example, for customers that didn't pay their subscription.
For a more detailed discussion, see Sign-up and tenant onboarding in a multitenant application.
This code checks whether the user has a Role claim with the value "Admin". It correctly handles the case
where the user has no Role claim or multiple Role claims.
The ClaimTypes class defines constants for commonly used claim types. However, you can use any string
value for the claim type.
To get a single value for a claim type, when you expect there to be at most one value:
For more information, see Role-based and resource-based authorization in multitenant applications.
Next
Tenant sign-up and onboarding
12/18/2020 • 6 minutes to read • Edit Online
Sample code
This article describes how to implement a sign-up process in a multitenant application, which allows a customer to
sign up their organization for your application.
There are several reasons to implement a sign-up process:
Allow an AD admin to consent for the customer's entire organization to use the application.
Collect credit card payment or other customer information.
Perform any one-time per-tenant setup needed by your application.
After the admin clicks Accept , other users within the same tenant can sign in, and Azure AD will skip the consent
screen.
Only an AD administrator can give admin consent, because it grants permission on behalf of the entire
organization. If a non-administrator tries to authenticate with the admin consent flow, Azure AD displays an error:
If the application requires additional permissions at a later point, the customer will need to sign up again and
consent to the updated permissions.
[AllowAnonymous]
public IActionResult SignUp()
{
var state = new Dictionary<string, string> { { "signup", "true" }};
return new ChallengeResult(
OpenIdConnectDefaults.AuthenticationScheme,
new AuthenticationProperties(state)
{
RedirectUri = Url.Action(nameof(SignUpCallback), "Account")
});
}
Like SignIn , the SignUp action also returns a ChallengeResult . But this time, we add a piece of state information
to the AuthenticationProperties in the ChallengeResult :
signup: A Boolean flag, indicating that the user has started the sign-up process.
The state information in AuthenticationProperties gets added to the OpenID Connect state parameter, which
round trips during the authentication flow.
After the user authenticates in Azure AD and gets redirected back to the application, the authentication ticket
contains the state. We are using this fact to make sure the "signup" value persists across the entire authentication
flow.
The Surveys application adds the prompt during the RedirectToAuthenticationEndpoint event. This event is called
right before the middleware redirects to the authentication endpoint.
_logger.RedirectToIdentityProvider();
return Task.FromResult(0);
}
Setting ProtocolMessage.Prompt tells the middleware to add the "prompt" parameter to the authentication request.
Note that the prompt is only needed during sign-up. Regular sign-in should not include it. To distinguish between
them, we check for the signup value in the authentication state. The following extension method checks for this
condition:
string signupValue;
// Check the HTTP context and convert to string
if ((context.Ticket == null) ||
(!context.Ticket.Properties.Items.TryGetValue("signup", out signupValue)))
{
return false;
}
return isSigningUp;
}
Registering a tenant
The Surveys application stores some information about each tenant and user in the application database.
In the Tenant table, IssuerValue is the value of the issuer claim for the tenant. For Azure AD, this is
https://sts.windows.net/<tentantID> and gives a unique value per tenant.
When a new tenant signs up, the Surveys application writes a tenant record to the database. This happens inside
the AuthenticationValidated event. (Don't do it before this event, because the ID token won't be validated yet, so
you can't trust the claim values. See Authentication.
Here is the relevant code from the Surveys application:
if (context.IsSigningUp())
{
if (tenant == null)
{
tenant = await SignUpTenantAsync(context, tenantManager)
.ConfigureAwait(false);
}
// In this case, we need to go ahead and set up the user signing us up.
await CreateOrUpdateUserAsync(context.Ticket, userManager, tenant)
.ConfigureAwait(false);
}
else
{
if (tenant == null)
{
_logger.UnregisteredUserSignInAttempted(userId, issuerValue);
throw new SecurityTokenValidationException($"Tenant {issuerValue} is not registered");
}
Here is the SignUpTenantAsync method that adds the tenant to the database.
private async Task<Tenant> SignUpTenantAsync(BaseControlContext context, TenantManager tenantManager)
{
Guard.ArgumentNotNull(context, nameof(context));
Guard.ArgumentNotNull(tenantManager, nameof(tenantManager));
try
{
await tenantManager.CreateAsync(tenant)
.ConfigureAwait(false);
}
catch(Exception ex)
{
_logger.SignUpTenantFailed(principal.GetObjectIdentifierValue(), issuerValue, ex);
throw;
}
return tenant;
}
Sample code
Application roles are used to assign permissions to users. For example, the Tailspin Surveys application defines the
following roles:
Administrator. Can perform all CRUD operations on any survey that belongs to that tenant.
Creator. Can create new surveys.
Reader. Can read any surveys that belong to that tenant.
You can see that roles ultimately get translated into permissions, during authorization. But the first question is how
to assign and manage roles. We identified three main options:
Azure AD App Roles
Azure AD security groups
Application role manager.
NOTE
If the customer has Azure AD Premium, the admin can assign a security group to a role, and user members of the group will
inherit the app role. This is a convenient way to manage roles, because the group owner doesn't need to be an admin or app
owner.
The value property appears in the role claim. The id property is the unique identifier for the defined role. Always
generate a new GUID value for id .
Assign users . When a new customer signs up, the application is registered in the customer's Azure AD tenant. At
this point, an Azure AD admin for that tenant or an app owner (under Enterprise apps) can assign app roles to
users.
NOTE
As noted earlier, customers with Azure AD Premium can also assign app roles to security groups.
The following screenshot from the Azure portal shows users and groups for the Survey application. Admin and
Creator are groups, assigned the SurveyAdmin and SurveyCreator app roles, respectively. Alice is a user who was
assigned the SurveyAdmin app role directly. Bob and Charles are users that have not been directly assigned an app
role.
As shown in the following screenshot, Charles is part of the Admin group, so he inherits the SurveyAdmin role. In
the case of Bob, he has not been assigned an app role yet.
NOTE
An alternative approach is for the application to assign app roles programmatically, using the Azure AD Graph API. However,
this requires the application to obtain write permissions for the customer's Azure AD directory, which is a high privilege that
is usually unnecessary.
Get role claims . When a user signs in, the application receives the user's assigned role(s) in a claim with type
http://schemas.microsoft.com/ws/2008/06/identity/claims/role (the roles claim in a JWT token).
A user can be assigned multiple roles, or no role. In your authorization code, don't assume the user has exactly one
role claim. Instead, write code that checks whether a particular claim value is present:
{
// ...
"groupMembershipClaims": "SecurityGroup",
}
When a new customer signs up, the application instructs the customer to create security groups for the roles
needed by the application. The customer then needs to enter the group object IDs into the application. The
application stores these in a table that maps group IDs to application roles, per tenant.
NOTE
Alternatively, the application could create the groups programmatically, using the Microsoft Graph API. This could be less
error prone, but requires the application to obtain privileged read/write permissions for the customer's directory. Many
customers might be unwilling to grant this level of access.
Authorization policies should use the custom role claim, not the group claim.
Sample code
Our reference implementation is an ASP.NET Core application. In this article we'll look at two general approaches
to authorization, using the authorization APIs provided in ASP.NET Core.
Role-based authorization . Authorizing an action based on the roles assigned to a user. For example, some
actions require an administrator role.
Resource-based authorization . Authorizing an action based on a particular resource. For example, every
resource has an owner. The owner can delete the resource; other users cannot.
A typical app will employ a mix of both. For example, to delete a resource, the user must be the resource owner or
an admin.
Role-based authorization
The Tailspin Surveys application defines the following roles:
Administrator. Can perform all CRUD operations on any survey that belongs to that tenant.
Creator. Can create new surveys
Reader. Can read any surveys that belong to that tenant
Roles apply to users of the application. In the Surveys application, a user is either an administrator, creator, or
reader.
For a discussion of how to define and manage roles, see Application roles.
Regardless of how you manage the roles, your authorization code will look similar. ASP.NET Core has an
abstraction called authorization policies. With this feature, you define authorization policies in code, and then apply
those policies to controller actions. The policy is decoupled from the controller.
Create policies
To define a policy, first create a class that implements IAuthorizationRequirement . It's easiest to derive from
AuthorizationHandler . In the Handle method, examine the relevant claim(s).
services.AddAuthorization(options =>
{
options.AddPolicy(PolicyNames.RequireSurveyCreator,
policy =>
{
policy.AddRequirements(new SurveyCreatorRequirement());
policy.RequireAuthenticatedUser(); // Adds DenyAnonymousAuthorizationRequirement
// By adding the CookieAuthenticationDefaults.AuthenticationScheme, if an authenticated
// user is not in the appropriate role, they will be redirected to a "forbidden" page.
policy.AddAuthenticationSchemes(CookieAuthenticationDefaults.AuthenticationScheme);
});
options.AddPolicy(PolicyNames.RequireSurveyAdmin,
policy =>
{
policy.AddRequirements(new SurveyAdminRequirement());
policy.RequireAuthenticatedUser();
policy.AddAuthenticationSchemes(CookieAuthenticationDefaults.AuthenticationScheme);
});
});
This code also sets the authentication scheme, which tells ASP.NET which authentication middleware should run if
authorization fails. In this case, we specify the cookie authentication middleware, because the cookie authentication
middleware can redirect the user to a "Forbidden" page. The location of the Forbidden page is set in the
AccessDeniedPath option for the cookie middleware; see Configuring the authentication middleware.
[Authorize(Policy = PolicyNames.RequireSurveyCreator)]
public IActionResult Create()
{
var survey = new SurveyDTO();
return View(survey);
}
In earlier versions of ASP.NET, you would set the Roles property on the attribute:
// old way
[Authorize(Roles = "SurveyCreator")]
This is still supported in ASP.NET Core, but it has some drawbacks compared with authorization policies:
It assumes a particular claim type. Policies can check for any claim type. Roles are just a type of claim.
The role name is hard-coded into the attribute. With policies, the authorization logic is all in one place, making
it easier to update or even load from configuration settings.
Policies enable more complex authorization decisions (for example, age >= 21) that can't be expressed by
simple role membership.
Resource-based authorization
Resource based authorization occurs whenever the authorization depends on a specific resource that will be
affected by an operation. In the Tailspin Surveys application, every survey has an owner and zero-to-many
contributors.
The owner can read, update, delete, publish, and unpublish the survey.
The owner can assign contributors to the survey.
Contributors can read and update the survey.
Note that "owner" and "contributor" are not application roles; they are stored per survey, in the application
database. To check whether a user can delete a survey, for example, the app checks whether the user is the owner
for that survey.
In ASP.NET Core, implement resource-based authorization by deriving from AuthorizationHandler and
overriding the Handle method.
Notice that this class is strongly typed for Survey objects. Register the class for DI on startup:
services.AddSingleton<IAuthorizationHandler>(factory =>
{
return new SurveyAuthorizationHandler();
});
To perform authorization checks, use the IAuthorizationSer vice interface, which you can inject into your
controllers. The following code checks whether a user can read a survey:
Because we pass in a Survey object, this call will invoke the SurveyAuthorizationHandler .
In your authorization code, a good approach is to aggregate all of the user's role-based and resource-based
permissions, then check the aggregate set against the desired operation. Here is an example from the Surveys app.
The application defines several permission types:
Admin
Contributor
Creator
Owner
Reader
The application also defines a set of possible operations on surveys:
Create
Read
Update
Delete
Publish
Unpublish
The following code creates a list of permissions for a particular user and survey. Notice that this code looks at both
the user's app roles, and the owner/contributor fields in the survey.
if (resource.TenantId == surveyTenantId)
{
// Admin can do anything, as long as the resource belongs to the admin's tenant.
if (context.User.HasClaim(ClaimTypes.Role, Roles.SurveyAdmin))
{
context.Succeed(requirement);
return Task.FromResult(0);
}
if (context.User.HasClaim(ClaimTypes.Role, Roles.SurveyCreator))
{
permissions.Add(UserPermissionType.Creator);
}
else
{
permissions.Add(UserPermissionType.Reader);
}
if (resource.OwnerId == userId)
{
permissions.Add(UserPermissionType.Owner);
}
}
if (resource.Contributors != null && resource.Contributors.Any(x => x.UserId == userId))
{
permissions.Add(UserPermissionType.Contributor);
}
if (ValidateUserPermissions[requirement](permissions))
{
context.Succeed(requirement);
}
return Task.FromResult(0);
}
}
In a multitenant application, you must ensure that permissions don't "leak" to another tenant's data. In the Surveys
app, the Contributor permission is allowed across tenants—you can assign someone from another tenant as a
contributor. The other permission types are restricted to resources that belong to that user's tenant. To enforce this
requirement, the code checks the tenant ID before granting the permission. (The TenantId field as assigned when
the survey is created.)
The next step is to check the operation (such as read, update, or delete) against the permissions. The Surveys app
implements this step by using a lookup table of functions:
static readonly Dictionary<OperationAuthorizationRequirement, Func<List<UserPermissionType>, bool>>
ValidateUserPermissions
= new Dictionary<OperationAuthorizationRequirement, Func<List<UserPermissionType>, bool>>
{
{ Operations.Create, x => x.Contains(UserPermissionType.Creator) },
Next
Secure a backend web API for multitenant
applications
12/18/2020 • 8 minutes to read • Edit Online
Sample code
The Tailspin Surveys application uses a backend web API to manage CRUD operations on surveys. For example,
when a user clicks "My Surveys", the web application sends an HTTP request to the web API:
GET /users/{userId}/surveys
{
"Published":[],
"Own":[
{"Id":1,"Title":"Survey 1"},
{"Id":3,"Title":"Survey 3"},
],
"Contribute": [{"Id":8,"Title":"My survey"}]
}
The web API does not allow anonymous requests, so the web app must authenticate itself using OAuth 2 bearer
tokens.
NOTE
This is a server-to-server scenario. The application does not make any AJAX calls to the API from the browser client.
A diagram that shows the web application requesting an access token from Azure AD and sending the token to the
web API.
A screenshot of the Azure portal that shows the application permissions and delegated permissions.
Getting an access token
Before calling the web API, the web application gets an access token from Azure AD. In a .NET application, use the
Azure AD Authentication Library (ADAL) for .NET.
In the OAuth 2 authorization code flow, the application exchanges an authorization code for an access token. The
following code uses ADAL to get the access token. This code is called during the AuthorizationCodeReceived event.
// The OpenID Connect middleware sends this event when it gets the authorization code.
public override async Task AuthorizationCodeReceived(AuthorizationCodeReceivedContext context)
{
string authorizationCode = context.ProtocolMessage.Code;
string authority = "https://login.microsoftonline.com/" + tenantID
string resourceID = "https://tailspin.onmicrosoft.com/surveys.webapi" // App ID URI
ClientCredential credential = new ClientCredential(clientId, clientSecret);
If AcquireTokenByAuthorizationCodeAsync succeeds, ADAL caches the token. Later, you can get the token from the
cache by calling AcquireTokenSilentAsync:
The following extension method from the Surveys application sets the Authorization header on an HTTP request,
using the HttpClient class.
public static async Task<HttpResponseMessage> SendRequestWithBearerTokenAsync(this HttpClient httpClient,
HttpMethod method, string path, object requestBody, string accessToken, CancellationToken ct)
{
var request = new HttpRequestMessage(method, path);
if (requestBody != null)
{
var json = JsonConvert.SerializeObject(requestBody, Formatting.None);
var content = new StringContent(json, Encoding.UTF8, "application/json");
request.Content = content;
}
app.UseJwtBearerAuthentication(new JwtBearerOptions {
Audience = configOptions.AzureAd.WebApiResourceId,
Authority = Constants.AuthEndpointPrefix,
TokenValidationParameters = new TokenValidationParameters {
ValidateIssuer = false
},
Events= new SurveysJwtBearerEvents(loggerFactory.CreateLogger<SurveysJwtBearerEvents>())
});
// ...
}
Audience . Set this to the App ID URL for the web API, which you created when you registered the web API with
Azure AD.
Authority . For a multitenant application, set this to https://login.microsoftonline.com/common/ .
TokenValidationParameters . For a multitenant application, set ValidateIssuer to false. That means the
application will validate the issuer.
Events is a class that derives from JwtBearerEvents .
Issuer validation
Validate the token issuer in the JwtBearerEvents.TokenValidated event. The issuer is sent in the "iss" claim.
In the Surveys application, the web API doesn't handle tenant sign-up. Therefore, it just checks if the issuer is
already in the application database. If not, it throws an exception, which causes authentication to fail.
public override async Task TokenValidated(TokenValidatedContext context)
{
var principal = context.Ticket.Principal;
var tenantManager = context.HttpContext.RequestServices.GetService<TenantManager>();
var userManager = context.HttpContext.RequestServices.GetService<UserManager>();
var issuerValue = principal.GetIssuerValue();
var tenant = await tenantManager.FindByIssuerValueAsync(issuerValue);
if (tenant == null)
{
// The caller was not from a trusted issuer. Throw to block the authentication flow.
throw new SecurityTokenValidationException();
}
As this example shows, you can also use the TokenValidated event to modify the claims. Remember that the
claims come directly from Azure AD. If the web application modifies the claims that it gets, those changes won't
show up in the bearer token that the web API receives. For more information, see Claims transformations.
Authorization
For a general discussion of authorization, see Role-based and resource-based authorization.
The JwtBearer middleware handles the authorization responses. For example, to restrict a controller action to
authenticated users, use the [Authorize] attribute and specify JwtBearerDefaults.AuthenticationScheme as
the authentication scheme:
[Authorize(ActiveAuthenticationSchemes = JwtBearerDefaults.AuthenticationScheme)]
[Authorize(Policy = PolicyNames.RequireSurveyCreator)]
This returns a 401 status code if the user is not authenticated, and 403 if the user is authenticated but not
authorized. Register the policy on startup:
public void ConfigureServices(IServiceCollection services)
{
services.AddAuthorization(options =>
{
options.AddPolicy(PolicyNames.RequireSurveyCreator,
policy =>
{
policy.AddRequirements(new SurveyCreatorRequirement());
policy.RequireAuthenticatedUser(); // Adds DenyAnonymousAuthorizationRequirement
policy.AddAuthenticationSchemes(JwtBearerDefaults.AuthenticationScheme);
});
options.AddPolicy(PolicyNames.RequireSurveyAdmin,
policy =>
{
policy.AddRequirements(new SurveyAdminRequirement());
policy.RequireAuthenticatedUser(); // Adds DenyAnonymousAuthorizationRequirement
policy.AddAuthenticationSchemes(JwtBearerDefaults.AuthenticationScheme);
});
});
// ...
}
At startup, the application reads settings from every registered configuration provider, and uses them to populate
a strongly typed options object. For more information, see Using Options and configuration objects.
Next
Cache access tokens
12/18/2020 • 4 minutes to read • Edit Online
Sample code
It's relatively expensive to get an OAuth access token, because it requires an HTTP request to the token endpoint.
Therefore, it's good to cache tokens whenever possible. The Azure AD Authentication Library (ADAL) automatically
caches tokens obtained from Azure AD, including refresh tokens.
ADAL provides a default token cache implementation. However, this token cache is intended for native client apps,
and is not suitable for web apps:
It is a static instance, and not thread safe.
It doesn't scale to large numbers of users, because tokens from all users go into the same dictionary.
It can't be shared across web servers in a farm.
Instead, you should implement a custom token cache that derives from the ADAL TokenCache class but is suitable
for a server environment and provides the desirable level of isolation between tokens for different users.
The TokenCache class stores a dictionary of tokens, indexed by issuer, resource, client ID, and user. A custom token
cache should write this dictionary to a backing store, such as a Redis cache.
In the Tailspin Surveys application, the DistributedTokenCache class implements the token cache. This
implementation uses the IDistributedCache abstraction from ASP.NET Core. That way, any IDistributedCache
implementation can be used as a backing store.
By default, the Surveys app uses a Redis cache.
For a single-instance web server, you could use the ASP.NET Core in-memory cache. (This is also a good option
for running the app locally during development.)
DistributedTokenCache stores the cache data as key/value pairs in the backing store. The key is the user ID plus
client ID, so the backing store holds separate cache data for each unique combination of user/client.
The backing store is partitioned by user. For each HTTP request, the tokens for that user are read from the backing
store and loaded into the TokenCache dictionary. If Redis is used as the backing store, every server instance in a
server farm reads/writes to the same cache, and this approach scales to many users.
NOTE
If you deploy to Azure Web Sites, the encryption keys are backed up to network storage and synchronized across all
machines (see Key management and lifetime). By default, keys are not encrypted when running in Azure Web Sites, but you
can enable encryption using an X.509 certificate.
DistributedTokenCache implementation
The DistributedTokenCache class derives from the ADAL TokenCache class.
In the constructor, the DistributedTokenCache class creates a key for the current user and loads the cache from the
backing store:
public DistributedTokenCache(
ClaimsPrincipal claimsPrincipal,
IDistributedCache distributedCache,
ILoggerFactory loggerFactory,
IDataProtectionProvider dataProtectionProvider)
: base()
{
_claimsPrincipal = claimsPrincipal;
_cacheKey = BuildCacheKey(_claimsPrincipal);
_distributedCache = distributedCache;
_logger = loggerFactory.CreateLogger<DistributedTokenCache>();
_protector = dataProtectionProvider.CreateProtector(typeof(DistributedTokenCache).FullName);
AfterAccess = AfterAccessNotification;
LoadFromCache();
}
The key is created by concatenating the user ID and client ID. Both of these are taken from claims found in the
user's ClaimsPrincipal :
To load the cache data, read the serialized blob from the backing store, and call TokenCache.Deserialize to convert
the blob into cache data.
private void LoadFromCache()
{
byte[] cacheData = _distributedCache.Get(_cacheKey);
if (cacheData != null)
{
this.Deserialize(_protector.Unprotect(cacheData));
}
}
Whenever ADAL accesses the cache, it fires an AfterAccess event. If the cache data has changed, the
HasStateChanged property is true. In that case, update the backing store to reflect the change, and then set
HasStateChanged to false.
Sample code
This article describes how to add client assertion to the Tailspin Surveys sample application.
One way to authenticate the client is by using a client secret. That's how the Tailspin Surveys application is
configured by default.
Here is an example request from the client to the IDP, requesting an access token. Note the client_secret
parameter.
resource=https://tailspin.onmicrosoft.com/surveys.webapi
&client_id=87df91dc-63de-4765-8701-b59cc8bd9e11
&client_secret=i3Bf12Dn...
&grant_type=authorization_code
&code=PG8wJG6Y...
The secret is just a string, so you have to make sure not to leak the value. The best practice is to keep the client
secret out of source control. When you deploy to Azure, store the secret in an app setting.
However, anyone with access to the Azure subscription can view the app settings. Furthermore, there is always a
temptation to check secrets into source control (for example, in deployment scripts), share them by email, and so
on.
For additional security, you can use client assertion instead of a client secret. With client assertion, the client uses
an X.509 certificate to prove the token request came from the client. The client certificate is installed on the web
server. Generally, it will be easier to restrict access to the certificate, than to ensure that nobody inadvertently
reveals a client secret. For more information about configuring certificates in a web app, see Using Certificates in
Azure Websites Applications
Here is a token request using client assertion:
resource=https://tailspin.onmicrosoft.com/surveys.webapi
&client_id=87df91dc-63de-4765-8701-b59cc8bd9e11
&client_assertion_type=urn:ietf:params:oauth:client-assertion-type:jwt-bearer
&client_assertion=eyJhbGci...
&grant_type=authorization_code
&code= PG8wJG6Y...
Notice that the client_secret parameter is no longer used. Instead, the client_assertion parameter contains a
JWT token that was signed using the client certificate. The client_assertion_type parameter specifies the type of
assertion — in this case, JWT token. The server validates the JWT token. If the JWT token is invalid, the token
request returns an error.
NOTE
X.509 certificates are not the only form of client assertion; we focus on it here because it is supported by Azure AD.
At run time, the web application reads the certificate from the certificate store. The certificate must be installed on
the same machine as the web app.
Next
Federate with a customer's AD FS
12/18/2020 • 6 minutes to read • Edit Online
This article describes how a multitenant SaaS application can support authentication via Active Directory
Federation Services (AD FS), in order to federate with a customer's AD FS.
Overview
Azure Active Directory (Azure AD) makes it easy to sign in users from Azure AD tenants, including Office365 and
Dynamics CRM Online customers. But what about customers who use on-premises Active Directory on a corporate
intranet?
One option is for these customers to sync their on-premises AD with Azure AD, using Azure AD Connect. However,
some customers may be unable to use this approach, due to corporate IT policy or other reasons. In that case,
another option is to federate through Active Directory Federation Services (AD FS).
To enable this scenario:
The customer must have an Internet-facing AD FS farm.
The SaaS provider deploys their own AD FS farm.
The customer and the SaaS provider must set up federation trust. This is a manual process.
There are three main roles in the trust relation:
The customer's AD FS is the account partner, responsible for authenticating users from the customer's AD,
and creating security tokens with user claims.
The SaaS provider's AD FS is the resource partner, which trusts the account partner and receives the user
claims.
The application is configured as a relying party (RP) in the SaaS provider's AD FS.
NOTE
In this article, we assume the application uses OpenID Connect as the authentication protocol. Another option is to use WS-
Federation.
For OpenID Connect, the SaaS provider must use AD FS 2016, running in Windows Server 2016. AD FS 3.0 does not
support OpenID Connect.
For an example of using WS-Federation with ASP.NET 4, see the active-directory-dotnet-webapp-wsfederation
sample.
Authentication flow
1. When the user clicks "sign in", the application redirects to an OpenID Connect endpoint on the SaaS provider's
AD FS.
2. The user enters his or her organizational user name (" alice@corp.contoso.com "). AD FS uses home realm
discovery to redirect to the customer's AD FS, where the user enters their credentials.
3. The customer's AD FS sends user claims to the SaaS provider's AD FS, using WF-Federation (or SAML).
4. Claims flow from AD FS to the app, using OpenID Connect. This requires a protocol transition from WS-
Federation.
Limitations
By default, the relying party application receives only a fixed set of claims available in the id_token, shown in the
following table. With AD FS 2016, you can customize the id_token in OpenID Connect scenarios. For more
information, see Custom ID Tokens in AD FS.
C L A IM DESC RIP T IO N
aud Audience. The application for which the claims were issued.
exp Expiration time. The time after which the token will no longer
be accepted.
iat Issued at. The time when the token was issued.
iss Issuer. The value of this claim is always the resource partner's
AD FS.
nameidentifier Name identifier. The identifier for the name of the entity for
which the token was issued.
The rest of this article describes how to set up the trust relationship between the RP (the app) and the account
partner (the customer).
AD FS deployment
The SaaS provider can deploy AD FS either on-premises or on Azure VMs. For security and availability, the
following guidelines are important:
Deploy at least two AD FS servers and two AD FS proxy servers to achieve the best availability of the AD FS
service.
Domain controllers and AD FS servers should never be exposed directly to the Internet and should be in a
virtual network with direct access to them.
Web application proxies (previously AD FS proxies) must be used to publish AD FS servers to the Internet.
To set up a similar topology in Azure requires the use of virtual networks, network security groups, virtual
machines, and availability sets. For more details, see Guidelines for deploying Windows Server Active Directory on
Azure Virtual Machines.
Typically you might combine this with other OpenID Connect endpoints (such as Azure AD). You'll need two
different sign-in buttons or some other way to distinguish them, so that the user is sent to the correct
authentication endpoint.
where "name" is the friendly name of the claims provider trust, and "suffix" is the UPN suffix for the customer's AD
(example, "corp.fabrikam.com").
With this configuration, end users can type in their organizational account, and AD FS automatically selects the
corresponding claims provider. See Customizing the AD FS Sign-in Pages, under the section "Configure Identity
Provider to use certain email suffixes".
EXISTS([Type == "http://schemas.microsoft.com/ws/2014/01/identity/claims/anchorclaimtype"])=>
issue (Type = "http://schemas.microsoft.com/ws/2014/01/identity/claims/anchorclaimtype",
Value = "http://schemas.xmlsoap.org/ws/2005/05/identity/claims/upn");
This rule issues a claim of type anchorclaimtype . The claim tells the relying party to use UPN as the user's
immutable ID.
12. Click Finish .
13. Click OK to complete the wizard.
Azure AD identity and access management for AWS
12/18/2020 • 16 minutes to read • Edit Online
Amazon Web Services (AWS) accounts that support critical workloads and highly sensitive information need
strong identity protection and access control. Azure Active Directory (Azure AD) is a cloud-based, comprehensive,
centralized identity and access management solution that can help secure and protect AWS accounts and
environments. Azure AD provides centralized single sign-on (SSO) and strong authentication through multi-factor
authentication (MFA) and Conditional Access policies. Azure AD supports AWS role-based identities and access
control.
Many organizations that use AWS already rely on Azure AD for Microsoft 365 or hybrid cloud identity and access
protection. These organizations can quickly and easily deploy Azure AD for their AWS accounts, often without
additional cost. Other, advanced Azure AD features like Privileged Identity Management (PIM) and Advanced
Identity Protection can help protect the most sensitive AWS accounts.
Azure AD easily integrates with other Microsoft security solutions like Microsoft Cloud App Security (MCAS) and
Azure Sentinel. For more information, see MCAS and Azure Sentinel for AWS. Microsoft security solutions are
extensible and have multiple levels of protection. Organizations can implement one or more of these solutions
along with other types of protection for a full security architecture that protects current and future AWS
deployments.
This article provides AWS identity architects, administrators, and security analysts with immediate insights and
detailed guidance for deploying Azure AD identity and access solutions for AWS. You can configure and test these
solutions without impacting your existing identity providers and AWS account users until you're ready to switch
over.
Architecture
AWS creates a separate Identity and Access Management (IAM) store for each account it creates. The following
diagram shows the standard setup for an AWS environment with a single AWS account:
The root user fully controls the AWS account, and delegates access to other identities. The AWS IAM principal
provides a unique identity for each role and user that needs to access the AWS account. AWS IAM can protect each
root, principal, and user account with a complex password and basic MFA.
Many organizations need more than one AWS account, resulting in identity silos that are complex to manage:
To allow centralized identity management and avoid having to manage multiple identities and passwords, most
organizations want to use single sign-on for platform resources. Some AWS customers rely on server-based
Microsoft Active Directory for SSO integration. Other customers invest in third-party solutions to synchronize or
federate their identities and provide SSO.
Azure AD provides centralized identity management with strong SSO authentication. Almost any app or platform
that follows common web authentication standards, including AWS, can use Azure AD for identity and access
management.
Many organizations already use Azure AD to assign and protect Microsoft 365 or hybrid cloud identities.
Employees use their Azure AD identities to access email, files, instant messaging, cloud applications, and on-
premises resources. You can quickly and easily integrate Azure AD with your AWS accounts to let administrators
and developers sign in to your AWS environments with their existing identities.
The following diagram shows how Azure AD can integrate with multiple AWS accounts to provide centralized
identity and access management:
Azure AD offers several capabilities for direct integration with AWS:
SSO across legacy, traditional, and modern authentication solutions.
MFA, including integration with several third-party solutions from Microsoft Intelligent Security Association
(MISA) partners.
Powerful Conditional Access features for strong authentication and strict governance. When users attempt to
gain access to the AWS Management Console or AWS resources, Azure AD uses both Conditional Access policy-
defined controls and risk-based assessments to manage authentication and authorization in real time.
Large-scale threat detection and automated response. Azure AD processes over 30 billion authentication
requests per day, along with trillions of signals about threats worldwide.
Privileged Access Management (PAM) to enable Just-In-Time (JIT) provisioning to specific resources.
Advanced Azure AD identity management
Other advanced Azure AD features can provide additional layers of control for the most sensitive AWS accounts.
Azure AD Premium P2 licenses include these advanced features:
Privileged Identity Management (PIM) to provide advanced controls for all delegated roles within
Azure and Microsoft 365. For example, instead of an administrator always using the Global Admin role, they
have permission to activate the role on demand. This permission deactivates after a set time limit like one
hour. PIM logs all activations, and has additional controls that can further restrict the activation capabilities.
PIM further protects your identity architecture by ensuring additional layers of governance and protection
before administrators can make changes.
You can expand PIM to any delegated permission by controlling access to custom groups, such as the ones
you created for access to AWS roles. For more information about deploying PIM, see Deploy Azure AD
Privileged Identity Management.
Advanced Identity Protection increases Azure AD sign-in security by monitoring user or session risk.
User risk defines the potential of the credentials being compromised, such as the user ID and password
appearing in a publicly released breach list. Session risk determines whether the sign-in activity comes from
a risky location, IP address, or other indicator of compromise. Both detection types draw on Microsoft's
comprehensive threat intelligence capabilities.
For more information about Advanced Identity Protection, see the Azure AD Identity Protection security
overview.
Microsoft Defender for Identity protects identities and services running on Active Directory domain
controllers by monitoring all activity and threat signals. Defender for Identity identifies threats based on
real-life experience from investigations of customer breaches. Defender for Identity monitors user behavior
and recommends attack surface reductions to prevent advanced attacks like reconnaissance, lateral
movement, and domain dominance.
For more information about Defender for Identity, see What is Microsoft Defender for Identity.
Security recommendations
The following principles and guidelines are important for any cloud security solution:
Ensure that the organization can monitor, detect, and automatically protect user and programmatic access
into cloud environments.
Continually review current accounts to ensure identity and permission governance and control.
Follow least privilege and zero trust principles. Make sure that each user can access only the specific
resources they require, from trusted devices and known locations. Reduce the permissions of every
administrator and developer to provide only the rights they need for the role they're performing. Review
regularly.
Continuously monitor platform configuration changes, especially if they provide opportunities for privilege
escalation or attack persistence.
Prevent unauthorized data exfiltration by actively inspecting and controlling content.
Take advantage of solutions you might already own like Azure AD Premium P2 that can increase security
without additional expense.
Basic AWS account security
To ensure basic security hygiene for AWS accounts and resources:
Review the AWS security guidance at Best practices for securing AWS accounts and resources.
Reduce the risk of uploading and downloading malware and other malicious content by actively inspecting
all data transfers through the AWS Management Console. Content that uploads or downloads directly to
resources within the AWS platform, such as web servers or databases, might need additional protection.
Consider protecting access to other resources, including:
Resources created within the AWS account.
Specific workload platforms, like Windows Server, Linux Server, or containers.
Devices that administrators and developers use to access the AWS Management Console.
AWS IAM security
A key aspect of securing the AWS Management Console is controlling who can make sensitive configuration
changes. The AWS account root user has unrestricted access. The security team should fully control the root user
account to prevent it from signing in to the AWS Management Console or working with AWS resources.
To control the root user account:
Consider changing the root user sign-in credentials from an individual's email address to a service account that
the security team controls.
Make sure the root user account password is extremely complex, and enforce MFA for the root user.
Monitor logs for instances of the root user account being used to sign in.
Use the root user account only in emergencies.
Use Azure AD to implement delegated administrative access rather than using the root user for administrative
tasks.
Clearly understand and review other AWS IAM account components for appropriate mapping and assignments.
By default, an AWS account has no IAM users until the root user creates one or more identities to delegate
access. A solution that synchronizes existing users from another identity system, such as Microsoft Active
Directory, can also automatically provision IAM users.
IAM policies provide delegated access rights to AWS account resources. AWS provides over 750 unique IAM
policies, and customers can also define custom policies.
IAM roles attach specific policies to identities. Roles are the way to administer role-based access control
(RBAC). The current solution uses External Identities to implement Azure AD identities by assuming IAM
roles.
IAM groups are also a way to administer RBAC. Instead of assigning IAM policies directly to individual IAM
users, create an IAM group, assign permissions by attaching one or more IAM policies, and add IAM users to
the group to inherit the appropriate access rights to resources.
Some IAM service accounts must continue to run in AWS IAM to provide programmatic access. Be sure to review
these accounts, securely store and restrict access to their security credentials, and rotate the credentials regularly.
If you can't see or select a role, go back to the Provisioning page to confirm successful provisioning in the Azure
AD provisioning agent, and make sure the IAM User account has the correct permissions. You can also restart the
provisioning engine to attempt the import again:
4. Once you're signed into the AWS Console, navigate the features to confirm that this account has the
appropriate delegated access.
5. Notice the naming format for the user sign-in session:
ROLE / UPN / AWS Account Number
You can use this user sign-in session information for tracking user sign-in activity in MCAS or Azure
Sentinel.
6. Sign out, and repeat the process for the other test user account to confirm the differences in role mapping
and permissions.
Enable Conditional Access
To create a new Conditional Access policy that requires MFA:
1. In the Azure portal, navigate to Azure AD > Security , and then select Conditional Access .
2. In the left navigation, select Policies .
You might need to create several Conditional Access policies to meet business needs for strong authentication.
Consider the naming convention you use when creating the policies to ensure ease of identification and ongoing
maintenance. Also, unless MFA is already widely deployed, make sure the policy is scoped to impact only the
intended users. Other policies should cover other user groups' needs.
Once you enable Conditional Access, you can impose additional controls such as PAM and just-in-time (JIT)
provisioning. For more information, see What is automated SaaS app user provisioning in Azure AD.
If you have MCAS, you can use Conditional Access to configure MCAS session policies. For more information, see
Configure Azure AD session policies for AWS activities.
See also
For in-depth coverage and comparison of Azure and AWS features, see the Azure for AWS professionals content
set.
For security guidance from AWS, see Best practices for securing AWS accounts and resources.
For the latest Microsoft security information, see www.microsoft.com/security.
For full details of how to implement and manage Azure AD, see Securing Azure environments with Azure Active
Directory.
Security and identity on Azure and AWS.
AWS tutorial: Azure AD with IDP SSO.
Microsoft tutorial: SSO for AWS.
PIM deployment plan.
Identity protection security overview.
What is Microsoft Defender for Identity?
Connect AWS to Microsoft Cloud App Security.
How Cloud App Security helps protect your Amazon Web Services (AWS) environment.
Vision with Azure IoT Edge
12/18/2020 • 4 minutes to read • Edit Online
Visual inspection of products, resources, and environments has been a core practice for most enterprises, and was
until recently, a very manual process. An individual, or a group of individuals, would be responsible for manually
inspecting the assets or the environment. Depending on the circumstances, this could become inefficient,
inaccurate, or both, due to human error and limitations.
To improve the efficacy of visual inspection, enterprises began turning to deep learning artificial neural networks
known as convolutional neural networks (or CNNs), to emulate human vision for analysis of images and video.
Today this is commonly called computer vision, or simply Vision AI. Artificial intelligence for image analytics spans
a wide variety of industries, including manufacturing, retail, healthcare, and the public sector, and an equally wide
area of use cases.
Vision for quality assurance - In manufacturing environments, Vision AI can be very helpful with quality
inspection of parts and processes with a high degree of accuracy and velocity. An enterprise pursuing this
path automates the inspection of a product for defects to answer questions such as:
Is the manufacturing process producing consistent results?
Is the product assembled properly?
Can there be an earlier notification of a defect to reduce waste?
How to leverage drift in the computer vision model to prescribe predictive maintenance?
Vision for safety - In any environment, safety is a fundamental concern for every enterprise, and the
reduction of risk is a driving force for adopting Vision AI. Automated monitoring of video feeds to scan for
potential safety issues provides critical time to respond to incidents, and opportunities to reduce exposure to
risk. Enterprises looking at Vision AI for this use case are commonly trying to answer questions such as:
How compliant is the workforce with using personal protective equipment?
How often are people entering unauthorized work zones?
Are products being stored in a safe manner?
Are there unreported close calls in a facility, or pedestrian/equipment “near misses”?
Camera considerations
Camera is understandably a very important component of an Azure IoT Edge Vision solution. To learn what
considerations should be taken for this component, proceed to Camera selection in Azure IoT Edge Vision.
Hardware acceleration
To bring AI to the edge, the edge hardware should be able to run the powerful AI algorithms. To know the hardware
capabilities required for IoT Edge Vision, proceed to Hardware acceleration in Azure IoT Edge Vision.
Machine learning
Machine learning can be challenging for the data on the edge, due to resource restrictions of edge devices, limited
energy budget, and low compute capabilities. See Machine learning and data science in Azure IoT Edge Vision to
understand the key considerations in designing the machine learning capabilities of your IoT Edge Vision solution.
Image storage
Your IoT Edge Vision solution cannot be complete without careful consideration of how and where the images
generated will be stored. Read Image storage and management in Azure IoT Edge Vision for a thorough discussion.
Alerts
Your IoT Edge device may need to respond to various alerts in its environment. See Alert persistence in Azure IoT
Edge Vision to understand the best practices in managing these alerts.
User interface
The user interface or UI of your IoT Edge Vision solution will vary based on the target user. The article User interface
in Azure IoT Edge Vision discusses the main UI considerations.
Next steps
This series of articles demonstrate how to build a complete vision workload using Azure IoT Edge devices. For
further information, you may refer to the product documentation as following:
Azure IoT Edge documentation
Tutorial: Perform image classification at the edge with Custom Vision Service
Azure Machine Learning documentation
Azure Kinect DK documentation
MMdnn tool
ONNX
Camera selection in Azure IoT Edge Vision
12/18/2020 • 8 minutes to read • Edit Online
One of the most critical components in any AI Vision workload is selecting the right camera. The items being
identified by this camera must be presented in such a way that the artificial intelligence or machine learning
models can evaluate them correctly. An in-depth understanding of the different camera types is required to
understand this concept.
NOTE
There are different manufacturers for area , line , and smar t cameras. Instead of recommending any one vendor over
another, Microsoft recommends that you select a vendor that fits your specific needs.
Types of cameras
Area scan cameras
This camera type generates the traditional camera image, where a 2D image is captured and then sent over to the
Edge hardware to be evaluated. This camera typically has a matrix of pixel sensors.
As the name suggests, area scan cameras look at a large area and are great at detecting change in an area.
Examples of workloads that could use an area scan camera would be workplace safety, or detecting or counting
objects (people, animals, cars, and so on) in an environment.
Examples of manufacturers of area scan cameras are Basler, Axis, Sony, Bosch, FLIR, Allied Vision.
Line scan cameras
Unlike the area scan cameras, the line scan camera has a single row of linear pixel sensors. This allows the camera
to take one-pixel width images in quick successions, and then stitches them together into a video stream. This
video stream is then sent over to an Edge device for processing.
Line scan cameras are great for vision workloads where the items to be identified are either moving past the
camera, or need to be rotated to detect defects. The line scan camera would then be able to produce a continuous
image stream for evaluation. Examples of workloads that would work best with a line scan camera are:
an item defect detection on parts that are moved on a conveyer belt,
workloads that require spinning to see a cylindrical object, or
any workload that requires rotation.
Examples of manufacturers of line scan cameras are Basler, Teledyne Dalsa, Hamamatsu Corporation, DataLogic,
Vieworks, and Xenics.
Embedded smart cameras
This type of camera can use either an area scan or a line scan camera for capturing the images, although a line scan
smart camera is rare. An embedded smart camera can not only acquire an image, but can also process that image
as it is a self-contained stand-alone system. They typically have either an RS232 or an Ethernet port output, which
allows them to be integrated directly into a PLC or other IIoT interfaces.
Examples of manufacturers of embedded smart cameras are Basler, Lesuze Electronics.
Camera features
Sensor size
This is one of the most important factors to evaluate in any vision workload. A sensor is the hardware within a
camera that captures the light and converts into signals, which then produce an image. The sensor contains
millions of semiconducting photodetectors called photosites. A higher megapixel count does not always result in a
better image. For example, let’s look at two different sensor sizes for a 12-megapixel camera. Camera A has a ½
inch sensor with 12 million photosites and camera B has a 1-inch sensor with 12 million photosites. In the same
lighting conditions, the camera that has the 1-inch sensor will be cleaner and sharper. Many cameras typically used
in vision workloads have a sensor sized between ¼ inch to 1 inch. In some cases, much larger sensors might be
required.
If a camera has a choice between a larger sensor or a smaller sensor, some factors deciding why you might choose
the larger sensor are:
need for precision measurements,
lower light conditions,
shorter exposure times, or fast-moving items.
Resolution
This is another important factor to both line scan and area scan camera workloads. If your workload must identify
fine features, such as the writing on an IC chip, then you need greater resolution cameras. If your workload is trying
to detect a face, then higher resolution is required. And if you need to identify a vehicle from a distance, again a
higher resolution will be required.
Speed
Sensors come in two types- CCD and CMOS . If the vision workload requires high number of images to be captured
per second, then two factors will come into play. The first is how fast is the connection on the interface of the
camera. The second is what type of sensor it is. CMOS sensors have a direct readout from the photosites, because
of which they typically offer a higher frame rate.
NOTE
There are several other camera features to consider when selecting the correct camera for your vision workload. These
include lens selection, focal length, monochrome, color depth, stereo depth, triggers, physical size, and support. Sensor
manufacturers can help you understand the specific feature that your application may require.
Camera placement
The items that you are capturing in your vision workload will determine the location and angles that the camera
should be placed. The camera location can also affect the sensor type, lens type, and camera body type.
There are several different factors that can weigh into the overall decision for camera placement. Two of the most
critical ones are the lighting and the field of view.
Camera lighting
In a computer vision workload, lighting is a critical component to camera placement. There are several different
lighting conditions. While some of the lighting conditions would be useful for one vision workload, it might
produce an undesirable effect in another. Types of lighting that are commonly used in computer vision workloads
are:
Direct lighting: This is the most commonly used lighting condition. This light source is projected at the
object to be captured for evaluation.
Line lighting: This is a single array of lights that are most used with line scan camera applications. This
creates a single line of light at the focus of the camera.
Diffused lighting: This type of lighting is used to illuminate an object but prevent harsh shadows and is
mostly used around specular objects.
Back lighting: This type of light source is used behind the object, producing a silhouette of the object. This
is most useful when taking measurements, edge detection, or object orientation.
Axial diffused lighting: This type of light source is often used with highly reflective objects, or to prevent
shadows on the part that will be captured for evaluation.
Custom Grid lighting: This is a structured lighting condition that lays out a grid of light on the object, the
intent is to have a known grid projection to then provide more accurate measurements of components,
parts, placement of items, and so on.
Strobe lighting: Strobe lighting is used for high speed moving parts. The strobe must be in sync with the
camera to take a freeze of the object for evaluation, this lighting helps to prevent motion blurring effect.
Dark Field lighting: This type of light source uses several lights in conjunction with different angles to the
part to be captured. For example, if the part is laying flat on a conveyor belt the lights would be placed at a
45-degree angle to it. This type of lighting is most useful when looking at highly reflective clear objects and
is most commonly used with lens scratch detections.
The figure below shows the angular placement of light:
Field of view
In a vision workload, you need to know the distance to the object that you are trying to evaluate. This also will play
a part in the camera selection, sensor selection, and lens configuration. Some of the components that make up the
field of view are:
Distance to object(s): For example, is the object being monitored with computer vision on a conveyor belt
and the camera is two feet above it, or is the object across a parking lot? As the distance changes, so does the
camera’s sensors and lens configurations.
Area of coverage: Is the area that the computer vision is trying to monitor small or large? This has direct
correlation to the camera’s resolution, lens, and sensor type.
Direction of the sun: If the computer vision workload is outside, such as monitoring a job construction site
for worker safety, will the camera be pointed in the sun at any time? Keep in mind that if the sun is casting a
shadow over the object that the vision workload is monitoring, items might be a bit obscured. Also, if the
camera is getting direct sunlight in the lens, the camera might be blinded until the angle of the sun changes.
Camera angle to the object(s): Angle of the camera to the object that the vision workload is monitoring is
also a critical component to think about. If the camera is too high, it might miss the details that the vision
workload is trying to capture, and the same may be true if it is too low.
Communication interface
In building a computer vision workload, it is also important to understand how the system will interact with the
output of the camera. Below are a few of the standard ways that a camera will communicate to IoT Edge:
Real Time Streaming Protocol (RTSP): RTSP is a protocol that transfers real-time video data from a
device (in our case, the camera) to an endpoint device (Edge compute) directly over a TCP/IP connection. It
functions in a client-server application model that is at the application level in the network.
Open Network Video Interface Forum (ONVIF): A global and open industry forum that is developing
open standards for IP-based cameras. This standard is aimed at standardization of communication between
the IP camera and down stream systems, interoperability, and open source.
USB: Unlike RTSP and ONVIF, USB connected cameras connect over the Universal Serial Bus directly on the
Edge compute device. This is less complex, however, it limits the distance that the camera can be placed
away from the Edge compute.
Camera Serial Interface: CSI specification is from Mobile Industry Processor Interface (MIPI). This
interface describes how to communicate between a camera and a host processor.
There are several standards defined for CSI:
CSI-1 : This was the original standard that MIPI started with.
CSI-2 : This standard was released in 2005, and uses either D-PHY or C-PHY as physical layers options.
This is further divided into several layers:
Physical Layer (C-PHY, D-PHY)
Lane Merger layer
Low-Level Protocol Layer
Pixel to Byte Conversion Layer
Application layer
The specification was updated in 2017 to v2, which added support for RAW-24 color depth, Unified Serial
Link, and Smart Region of Interest.
Next steps
Now that you know the camera considerations for your IoT Edge Vision workload, proceed to setting up the right
hardware for your workload. Read Hardware acceleration in Azure IoT Edge Vision for more information.
Hardware acceleration in Azure IoT Edge Vision
12/18/2020 • 2 minutes to read • Edit Online
Along with the camera selection, one of the other critical decisions in Vision on the Edge projects is hardware
acceleration.
The following sections describe the key components of the underlying hardware.
CPU
The Central Processing Unit (CPU) is your default compute for most processes running on a computer. It is
designed for general purpose compute. For some vision workloads where timing is not as critical, this might be a
good option. However, most workloads that involve critical timing, multiple camera streams, and/or high frame
rates, will require more specific hardware acceleration.
GPU
Many people are familiar with the Graphics Processing Unit (GPU) as this is the de-facto processor for any high-
end PC graphics card. In recent years, the GPU has been leveraged in high performance computer (HPC) scenarios,
in data mining, and in computer AI/ML workloads. The GPU’s massive potential of parallel computing can be used
in a vision workload to accelerate the processing of pixel data. The downside to a GPU is its higher power
consumption, which is a critical factor to consider for your vision workload.
FPGA
Field Programmable Gate Arrays are reconfigurable hardware accelerators. These powerful accelerators allow for
the growth of Deep Learning Neural networks, which are still evolving. These accelerators have millions of
programmable gates, hundreds of I/O pins, and an exceptional compute power in the trillions of tera-MAC’s. There
also many different libraries available for FPGAs that are optimized for vision workloads. Some of these libraries
also include preconfigured interfaces to connect to downstream cameras and devices. One area that FPGAs tend to
fall short on is floating point operations. However, manufacturers are currently working on this issue and have
made many improvements in this area.
ASIC
Application-Specific Integrated Circuit is by far the fastest accelerator on the market today. While they are the
fastest, they are the hardest to change as they are manufactured to function for a specific task. These custom chips
are gaining popularity due to size, power per watt performance, and IP protection. This is because the IP burned
into the ASIC accelerator is much harder to backwards engineer proprietary algorithms.
Next steps
Proceed to learn what considerations go into place for Machine learning and data science in Azure IoT Edge Vision.
Machine learning and data science in Azure IoT Edge
Vision
12/18/2020 • 12 minutes to read • Edit Online
The process of designing the machine learning (ML) approach for a vision on the edge scenario is one of the
biggest challenges in the entire planning process. It is important to understand how to consider and think about
ML in the context of edge devices.
To begin using machine learning to address business problems and pain points, consider the following points:
Always consider first how to solve the problem without ML or with a simple ML algorithm.
Have a plan to test several ML architectures as they will have different capacities to "learn".
Have a system in place to collect new data from the device to retrain an ML model.
For poorly performing ML models, often a simple fix is to add more representative data to the training process
and ensure it has variability with all classes represented equally.
Remember, this is often an iterative process with both the choice of data and choice of architecture being
updated in the exploratory phase.
It is not an easy space and, for some, a very new way of thinking. It is a data driven process. Careful planning will
be critical to successful results especially on very constrained devices.
It is always critical to clearly define the problem to be solved as the data science and machine learning approach
will depend upon this. It is also very important to consider what type of data will be encountered in the edge
scenario as this will determine the kind of ML algorithm that should be used.
Even at the start, before training any models, real world data collection and examination will help this process
greatly and new ideas could even arise. This article will discuss data considerations in detail. Of course, the
equipment itself will help determine the ML approach with regard to device attributes like limited memory,
compute, and/or power consumption limits.
Fortunately, data science and machine learning are iterative processes, so if the ML model has poor performance,
there are many ways to address issues through experimentation. This article will also discuss considerations
around ML architecture choices. Often, there will be some trial and error involved as well.
NOTE
Be wary of optimizing for the test dataset, in addition to the training dataset, once one test has been run. It might be good
to have a few different test datasets available.
Some good news is that in using deep learning, often costly and onerous feature engineering, featurizations, and
preprocessing can be avoided because of how deep learning works to find signal in noise better than traditional
ML. However, in deep learning, transformations may still be utilized to clean or reformat data for model input
during training as well as inference. The same capacity needs to be used in training and when the model is scoring
new data.
When advanced preprocessing is used such as denoising, adjusting brightness or contrast, or transformations like
RGB to HSV, it must be noted that this can dramatically change the model performance for the better or,
sometimes, for the worse. In general, it is part of the data science exploration process and sometimes it is
something that must be observed once the device and other components are placed in a real-world location.
After the hardware is installed into its permanent location, the incoming data stream should be monitored for data
drift.
Data drift is the deviation due to changes in the current data compared to the original. Data drift will often result
in a degradation in model performance (like accuracy), although this is not the only cause of decreased
performance (for example, hardware or camera failure).
There should be an allowance for data drift testing in the system. This new data should also be collected for
another round of training. The more representative data collected for training, the better the model will perform in
almost all cases. So, preparing for this kind of collection is always a good idea.
In addition to using data for training and inference, new data coming from the device could be used to monitor the
device, camera or other components for hardware degradation.
In summary, here are the key considerations:
Always use a balanced dataset with all classes represented equally.
The more representative data used to train a model, the better.
Have a system in place to collect new data from device to retrain.
Have a system in place to test for data drift.
Only run a test set through a new ML model once. If you iterate and retest on the same test set, this could cause
overfitting to the test set in addition to the training set.
Data collection - Data collection or acquisition could be an online image search from a currently deployed
device, or other representative data source. Generally, the more data the better. In addition, the more variability,
the better the generalization.
Data labeling - If only hundreds of images need to be labeled, such as, when using transfer learning, it can be
done in-house. If tens of thousands of images need to be labeled, a vendor could be enlisted for both data
collection and labeling.
Train a model with ML framework - An ML framework such as TensorFlow or PyTorch (both with Python
and C++ APIs) will need to be chosen. Usually this depends upon what code samples are available in open-
source or in-house, as well as the experience of the ML practitioner. Azure ML may be used to train a model
using any ML framework and approach, as it is agnostic of framework and has Python and R bindings, and
many wrappers around popular frameworks.
Conver t the model for inferencing on device - Almost always, a model will need to be converted to work
with a particular runtime. Model conversion usually involves advantageous optimizations like faster inference
and smaller model footprints. This step differs for each ML framework and runtime. There are open-source
interoperability frameworks available such as ONNX and MMdnn.
Build the solution for device - The solution is usually built on the same type of device as will be used in the
final deployment because binary files created system-specific.
Using runtime, deploy solution to device - Once a runtime is chosen, usually in conjunction with the ML
framework choice, the compiled solution may be deployed. The Azure IoT Runtime is a Docker-based system in
which the ML runtimes may be deployed as containers.
The diagram below shows a sample data science process where open-source tools may be leveraged for the data
science workflow. Data availability and type will drive most of the choices, including the devices/hardware chosen.
If a workflow already exists for the data scientists and app developers, a few other considerations may apply. First,
it is advised to have a code, model, and data versioning system in place. Secondly, an automation plan for code and
integration testing along with other aspects of the data science process, such as triggers, build/release process, and
so on, will help speed up the time to production and cultivate collaboration within the team.
The language of choice can help dictate what API or SDK is used for inferencing and training ML models. This in
turn will then dictate what type of ML model, what type(s) of device, what type of IoT Edge module, and so on, need
to be used. For example, PyTorch has a C++ API for inferencing and training, that works well in conjunction with
the OpenCV C++ API. If the app developer working on the deployment strategy is building a C++ application, or
has this experience, one might consider PyTorch or others (such as TensorFlow or CNTK) that have C++ inferencing
APIs.
Next steps
Proceed to Image storage and management in Azure IoT Edge Vision article to learn how to properly store the
images created by your IoT Edge Vision solution.
Image storage and management in Azure IoT Edge
Vision
12/18/2020 • 2 minutes to read • Edit Online
Storage and management of the images involved in a computer vision application is a critical function.
Some of the key considerations for managing these images are:
Ability to store all raw images during training with ease of retrieval for labeling.
Faster storage medium to avoid pipeline bottlenecks and loss.
Storage on the edge as well as in the cloud, as labeling activity can be performed at both places.
Categorization of images for easy retrieval.
Naming and tagging images to link them with inferred metadata.
The combination of Azure Blob Storage, Azure IoT Hub, and Azure IoT Edge allow several potential options for the
storage of image data, such as:
Use of the Azure IoT Edge Blob Storage module, which will automatically sync images to Azure Blob based on
policy.
Storing images to local host file system and uploading to Azure Blob service using a custom module.
Use of a local database to store images, which then are synced to the cloud database.
Next steps
How you respond to alerts generated by the AI model is crucial. Learn more about this in Alert persistence in Azure
IoT Edge Vision.
Alert persistence in Azure IoT Edge Vision
12/18/2020 • 2 minutes to read • Edit Online
In the context of vision on edge, an alert is a response to an event triggered by the AI model. In other words, it is
the inferencing results. The type of event is determined by the training imparted to the model. These events are
separate from operational events raised by the processing pipeline and any event related to the health of the
runtime.
Types of alerts
Some of the common alerts types are:
Image classification
Movement detection
Direction of movement
Object detection
Count of objects
Total count of objects over period of time
Average count of objects over period of time
Alerts are required to be monitored as they drive certain actions. They are critical to operations, being time
sensitive in terms of processing, and are required to be logged for audit and further analysis.
Persistence of alerts
The alerts need to persist locally on the edge where they are raised and then passed on to the cloud for further
processing and storage. This ensures a quick local response and avoids losing critical alerts due to any transient
failures.
Some options to achieve this persistence and cloud syncing are:
Utilize built-in store and forward capability of IoT Edge runtime, which automatically gets synced with Azure IoT
Hub after a lost connectivity.
Persist alerts on host file system as log files, which can be synced periodically to a blob storage in the cloud.
Utilize Azure Blob Edge module, which will sync this data to Azure Blob in cloud based on policies that can be
configured.
Use local database on IoT Edge, such as SQL Edge for storing data, and sync with Azure SQL DB using SQL Data
Sync. Another lightweight database option is the SQLite.
The preferred option is to use the built-in store and forward capability of IoT Edge runtime. This is more suitable
for the alerts due to its time sensitivity, typically small messages sizes, and ease of use.
Next steps
Now you can proceed to work on the User interface in Azure IoT Edge Vision for your vision workload.
User interface in Azure IoT Edge Vision
12/18/2020 • 9 minutes to read • Edit Online
The user interface requirements of an IoT solution will vary depending on the overall objectives. Four types of user
interfaces are commonly found in IoT solutions:
Administrator : Allows full access to device provisioning, device and solution configuration, user management,
and so on. These features could be provided as part of one solution or as separate solutions.
Consumer : Is only applicable to consumer solutions. They provide similar access to the operator's interface,
but limited to the devices owned by the user.
Operator : Provides centralized access to the operational components of the solution. It typically includes
device management, alerts monitoring, and configuration.
Analytics: Is an interactive dashboard which provides visualization of telemetry and other data analyses.
This article focuses on a simple operator’s user interface and visualization dashboard.
Technology options
Power BI - Power BI is a compelling option for our analytics and virtualization needs. It provides power
features to create customizable interactive dashboards. It also allows connectivity to many popular database
systems and services. It is available as a managed service and as a self-hosted package. The former is the
most popular and recommended option. With Power BI embedded in your solution, you could add
customer-facing reports, dashboards, and analytics in your own applications by using and branding Power
BI as your own. You can reduce developer resources by automating the monitoring, management, and
deployment of analytics, while getting full control of Power BI features and intelligent analytics.
Azure Maps - Another suitable technology for IoT visualizations is Azure Maps which allows you to create
location-aware web and mobile applications using simple and secure geospatial services, APIs, and SDKs in
Azure. You can deliver seamless experiences based on geospatial data with built-in location intelligence from
world-class mobility technology partners.
Azure App Ser vice - Azure App Service is a managed platform with powerful capabilities for building web
and mobile apps for many platforms and mobile devices. It allows developers to quickly build, deploy, and
scale web apps created with popular frameworks, such as .NET, .NET Core, Node.js, Java, PHP, Ruby, or
Python, in containers or running on any supported operating system. You can also meet rigorous,
enterprise-grade performance, security, and compliance requirements by using the fully managed platform
for your operational and monitoring tasks.
Azure SignalR Ser vice - For real-time data reporting, Azure SignalR Service, makes adding real-time
communications to your web application as simple as provisioning a service. An in-depth real-time
communications expertise is not required. It easily integrates with services such as Azure Functions, Azure
Active Directory, Azure Storage, Azure App Service, Azure Analytics, Power BI, Azure IoT, Azure Cognitive
Services, Azure Machine Learning, and more.
To secure your user interface solutions, Azure Active Director y (Azure AD) enterprise identity service provides
single sign-on and multi-factor authentication.
Now let's learn how to build the user interface for some common scenarios.
Scenario 1
Contoso Boards produces high-quality circuit boards used in computers. Their number one product is a
motherboard. Lately, they have been seeing an increase in issues with chip placement on the board. Through their
investigation, they have noticed that the circuit boards are getting placed incorrectly on the assembly line. They
need a way to identify if the circuit board is placed on the assembly line correctly. The data scientists at Contoso
Boards are most familiar with TensorFlow and would like to continue using it as their primary ML model structure.
Contoso Boards has several assembly lines that produce these mother boards. They would also like to centralize
the management of the entire solution.
Considerations in this scenario
Contoso Boards can ask themselves questions such as the following:
What are we analyzing?
Motherboard
Where are we going to view the motherboard from?
Assembly Line Conveyor belt
What camera do we need?
Area or line scan
Color or monochrome
CCD or CMOS sensor
Global or rolling shutter
Frame rate
Resolution
What type of lighting is needed?
Backlighting
Shade
Darkfield
How should the camera be mounted?
Top down
Side view
Angular
What hardware should be used?
CPU
FPGA
GPU
ASIC
Solution
To find the solution that will be useful for Contoso Boards, let's focus on the edge detection. We need to position a
camera directly above at 90 degrees and about 16 inches above the edge par t . Since the conveyer
system moves relatively slowly, we can use an area scan camera with a global shutter . For this use case, our
camera should capture about 30 frames per second . The resolution can be found using the formula of
Res=(Object Size) Divided by (details to be captured) . Based on this formula, Res=16”/8” gives 2MP in x and 4
in y, so we need a camera capable of 4MP . As for the sensor type, we are not fast moving, and really looking for
an edge detection, so a CMOS sensor is better suited. One of the more critical aspects for any vision workload is
lighting. In this application, Contoso Boards should choose to use a white diffused filter back light . This will
make the part look almost black and have a high amount of contrast for edge detection. When it comes to color
options for this application, it is better to be in black and white, as this is what will yield the sharpest edge for the AI
detection model. The data scientists are most familiar with TensorFlow, so learning ONNX or others would slow
down the time for development of the model. Also because there are several assembly lines that will use this
solution, and Contoso Boards would like a centrally managed edge solution, so Azure Stack Edge (with GPU
option) would work well here. Based on the workload, the fact that Contoso Boards already knows TensorFlow, and
this will be used on multiple assembly lines, GPU-based hardware would be the choice for hardware acceleration.
The following figure shows a sample of what the camera would see in this scenario:
Scenario 2
Contoso Shipping recently has had several pedestrian accidents at their loading docks. Most of the accidents are
happening when a truck leaves the loading dock, and the driver does not see a dock worker walking in front of the
truck. Contoso Shipping would like a solution that would watch for people, predict the direction of travel, and warn
the drivers of potential dangers of hitting the workers. The distance from the cameras to Contoso Shipping's server
room is too far for GigE connectivity, however they do have a large WIFI mesh that could be used. Most of the data
scientists at Contoso Shipping are familiar with Open-VINO and they would like to be able to reuse the models on
additional hardware in the future. The solution will also need to ensure that devices are operating as power-
efficiently as possible. Finally, Contoso Shipping needs a way to manage the solution remotely for updates.
Considerations in this scenario
Contoso Shipping can introspect by asking the following questions:
What are we analyzing?
People and patterns of movement
Where are we going to view the people from?
The loading docks are 165 feet long.
Cameras will be placed 17 feet high to keep with city ordinances.
Cameras will need to be positioned 100 feet away from the front of the trucks.
Camera focus will need to be 10 feet behind the front of the truck, and 10 additional feet in front of the
truck, giving a 20 foot depth on focus.
What camera do we need?
Area or line scan
Color or monochrome
CCD or CMOS sensor
Global or rolling shutter
Frame rate
Resolution
What type of lighting is needed?
Backlighting
Shade
Darkfield
What hardware should be used?
CPU
FPGA
GPU
ASIC
How should the camera be mounted?
Top down
Side view
Angular
Solution
Based on the distance of the loading dock size, Contoso Shipping will require several cameras to cover the entire
dock. The zoning laws that Contoso Shipping must adhere to, require that the surveillance cameras cannot be
mounted higher that 20 feet. In this use case, the average size of a worker is 5 foot 8 inches. The solution must use
the least number of cameras possible.
Formula for field of view (FOV):
Next steps
This series of articles have demonstrated how to build a complete vision workload using Azure IoT Edge devices.
For further information, you may refer to the product documentation as following:
Azure IoT Edge documentation
Tutorial: Perform image classification at the edge with Custom Vision Service
Azure Machine Learning documentation
Azure Kinect DK documentation
MMdnn tool
ONNX
Azure Industrial IoT Analytics Guidance
12/18/2020 • 11 minutes to read • Edit Online
This article series shows a recommended architecture for an Industrial IoT (IIoT) analytics solution on Azure using
PaaS (Platform as a service) components. Industrial IoT or IIoT is the application of Internet of Things in the
manufacturing industry.
An IIoT analytics solution can be used to build a variety of applications that provide:
Asset monitoring
Process dashboards
Overall equipment effectiveness (OEE)
Predictive maintenance
Forecasting
Such an IIoT analytics solution relies on real-time and historical data from industrial devices and control systems
located in discrete and process manufacturing facilities. These include PLCs (Programmable Logic Controller),
industrial equipment, SCADA (Supervisory Control and Data Acquisition) systems, MES (Manufacturing Execution
System), and Process Historians. The architecture covered by this series, includes guidance for connecting to all
these systems.
A modern IIoT analytics solution goes beyond moving existing industrial processes and tools to the cloud. It
involves transforming your operations and processes, embracing PaaS services, and leveraging the power of
machine learning and the intelligent edge to optimize industrial processes.
The following list shows some typical personas who would use the solution and how they would use this solution:
Plant Manager - responsible for the entire operations, production, and administrative tasks of the
manufacturing plant.
Production Manager - responsible for production of a certain number of components.
Process Engineer - responsible for designing, implementing, controlling, and optimizing industrial processes.
Operations Manager - responsible for overall efficiency of operation in terms of cost reduction, process time,
process improvement, and so on.
Data Scientist – responsible for building and training predictive Machine Learning models using historical
industrial telemetry.
The following architecture diagram shows the core subsystems that form an IIoT analytics solution.
NOTE
This architecture represents an ingestion-only pattern. No control commands are sent back to the industrial systems or
devices.
The architecture consists of a number of subsystems and services, and makes use of the Azure Industrial IoT
components. Your own solution may not use all these services or may have additional services. This architecture
also lists alternative service options, where applicable.
IMPORTANT
This architecture includes some services marked as "Preview" or "Public Preview". Preview services are governed by
Supplemental Terms of Use for Microsoft Azure Previews.
Intelligent Edge
Intelligent edge devices perform some data processing on the device itself or on a field gateway. In most industrial
scenarios, the industrial equipment cannot have additional software installed on it. This is why a field gateway is
required to connect the industrial equipment to the cloud.
Azure IoT Edge
To connect industrial equipment and systems to the cloud, we recommend using Azure IoT Edge as the field
gateway for:
Protocol and identity translation;
Edge processing and analytics; and
Adhering to network security policies (ISA 95, ISA 99).
Azure IoT Edge is a free, open source field gateway software that runs on a variety of supported hardware devices
or a virtual machine.
IoT Edge allows you to run edge workloads as Docker container modules. The modules can be developed in several
languages, with SDKs provided for Python, Node.js, C#, Java and C. Prebuilt Azure IoT Edge modules from
Microsoft and third-party partners are available from the Azure IoT Edge Marketplace.
Real-time industrial data is encrypted and streamed through Azure IoT Edge to Azure IoT Hub using AMQP 1.0 or
MQTT 3.1.1 protocols. IoT Edge can operate in offline or intermittent network conditions providing store and
for ward capabilities.
There are two system modules provided as part of IoT Edge runtime.
The EdgeAgent module is responsible for pulling down the container orchestration specification (manifest)
from the cloud, so that it knows which modules to run. Module configuration is provided as part of the module
twin.
The EdgeHub module manages the communication from the device to Azure IoT Hub, as well as the inter-
module communication. Messages are routed from one module to the next using JSON configuration.
Azure IoT Edge automatic deployments can be used to specify a standing configuration for new or existing devices.
This provides a single location for deployment configuration across thousands of Azure IoT Edge devices.
A number of third-party IoT Edge gateway devices are available from the Azure Certified for IoT Device Catalog.
IMPORTANT
Proper hardware sizing of an IoT Edge gateway is important to ensure edge module performance. See the performance
considerations for this architecture.
Gateway Patterns
There are three patterns for connecting your devices to Azure via an IoT Edge field gateway (or virtual machine):
1. Transparent - Devices already have the capability to send messages to IoT Hub using AMQP or MQTT.
Instead of sending the messages directly to the hub, they instead send the messages to IoT Edge, which in
turn passes them on to IoT Hub. Each device has an identity and device twin in Azure IoT Hub.
2. Protocol Translation - Also known as an opaque gateway pattern. This pattern is often used to connect
older brownfield equipment (for example, Modbus) to Azure. Modules are deployed to Azure IoT Edge to
perform the protocol conversion. Devices must provide a unique identifier to the gateway.
3. Identity Translation - In this pattern, devices cannot communicate directly to IoT Hub (for example, OPC
UA Pub/Sub, BLE devices). The gateway is smart enough to understand the protocol used by the
downstream devices, provide them identity, and translate IoT Hub primitives. Each device has an identity and
device twin in Azure IoT Hub.
Although you can use any of these patterns in your IIoT Analytics Solution, your choice will be driven by which
protocol is installed on your industrial systems. For example, if your SCADA system supports ethernet/IP, you will
need to use a protocol translation software to convert ethernet/IP to MQTT or AMQP. See the Connecting to
Historians section for additional guidance.
IoT Edge gateways can be provisioned at scale using the Azure IoT Hub Device Provisioning Service (DPS). DPS is a
helper service for IoT Hub that enables zero-touch, just-in-time provisioning to the right IoT hub without requiring
human intervention, enabling customers to provision millions of devices in a secure and scalable manner.
OPC UA
OPC UA is the successor to OPC Classic (OPC DA, AE, HDA). The OPC UA standard is maintained by the OPC
Foundation. Microsoft has been a member of the OPC Foundation since 1996 and has supported OPC UA on Azure
since 2016.
Industry and domain-specific Information Models can be created based on the OPC UA Data Model. The
specifications of such Information Models (also called industry standard models since they typically address a
dedicated industry problem) are called Companion Specifications. The synergy of the OPC UA infrastructure to
exchange such industry information models enables interoperability at the semantic level. OPC UA can use a
number of transport protocols including MQTT, AMQP, and UADP.
Microsoft has developed open source Azure Industrial IoT components, based on OPC UA, which implement
identity translation pattern:
OPC Twin consists of microservices and an Azure IoT Edge module to connect the cloud and the factory
network. OPC Twin provides discovery, registration, and synchronous remote control of industrial devices
through REST APIs.
OPC Publisher is an Azure IoT Edge module that connects to existing OPC UA servers and publishes telemetry
data from OPC UA servers in OPC UA PubSub format, in both JSON and binary.
OPC Vault is a microservice that can configure, register, and manage certificate lifecycle for OPC UA server and
client applications in the cloud.
Discovery Services is an Azure IoT Edge module that supports network scanning and OPC UA discovery.
The Microsoft Azure IIoT solution also contains a number of services, REST APIs, deployment scripts, and
configuration tools that you can integrate into your IIoT analytics solution. These are open source and available on
GitHub.
Edge Workloads
The ability to run custom or third-party modules at the edge is important.
If you want to respond to emergencies as quickly as possible, you can run anomaly detection or Machine
Learning module in tight control loops at the edge.
If you want to reduce bandwidth costs and avoid transferring terabytes of raw data, you can clean and
aggregate the data locally then send only the insights to the cloud for analysis.
If you want to convert legacy industrial protocols, you can develop a custom module or purchase a third-party
module for protocol translation.
If you want to quickly respond to an event on the factory floor, you can use an edge module to detect that event
and another module to respond to it.
Microsoft and our partners have made available on the Azure Marketplace a number of edge modules, which can
be used in your IIoT analytics solution. Protocol and identity translation are the most common edge workloads
used within an IIoT analytics solution. In the future, expect to see other workloads such as closed loop control using
edge ML models.
Connecting to Historians
A common pattern when developing an IIoT analytics solution is to connect to a process historian and stream real-
time data from the historian to Azure IoT Hub. How this is done, will depend on which protocols are installed and
accessible (that is, not blocked by firewalls) on the historian.
OPC UA - Use Azure IoT Edge, along with OPC Publisher, OPC
Twin, and OPC Vault, to send OPC UA data over MQTT to
IoT Hub. OPC Twin also has support for OPC UA HDA
profile, useful for obtaining historical data.
- Use a third-party Azure IoT Edge OPC UA module to
send OPC UA data over MQTT to IoT Hub.
Web Service - Use a custom Azure IoT Edge HTTP module to poll the
web service.
- Use third-party software that supports HTTP to MQTT
3.1.1 or AMQP 1.0.
MQTT 3.1.1 (Can publish MQTT messages) - Connect historian directly to Azure IoT Hub using MQTT.
- Connect historian to Azure IoT Edge as a leaf device. See
Transparent Gateway pattern.
P ROTO C O L AVA IL A B L E O N H ISTO RIA N O P T IO N S
A number of Microsoft partners have developed protocol and identity translation modules or solutions that are
available on the Azure Marketplace.
Some historian vendors also provide first-class capabilities to send data to Azure.
H ISTO RIA N O P T IO N S
Once real time data streaming has been established between your historian and Azure IoT Hub, it is important to
export your historian's historical data and import it into your IIoT analytics solution. For guidance on how to
accomplish this, see Historical Data Ingestion.
Cloud Gateway
A cloud gateway provides a cloud hub for devices and field gateways to connect securely to the cloud and send
data. It also provides device management capabilities. For the cloud gateway, we recommend Azure IoT Hub. IoT
Hub is a hosted cloud service that ingests events from devices and IoT Edge gateways. IoT Hub provides secure
connectivity, event ingestion, bidirectional communication, and device management. When IoT Hub is combined
with the Azure Industrial IoT components, you can control your industrial devices using cloud-based REST APIs.
IoT Hub supports the following protocols:
MQTT 3.1.1,
MQTT over WebSockets,
AMQP 1.0,
AMQP over WebSockets, and
HTTPS.
If the industrial device or system supports any of these protocols, it can send data directly to IoT Hub. In most
industrial environments, this is not permissible because of PCN firewalls and network security policies (ISA 95, ISA
99). In such cases, an Azure IoT Edge field gateway can be installed in a DMZ between the PCN and the Internet.
Next steps
To learn the services recommended for this architecture, continue reading the series with Services in an IIoT
analytics solution.
Services in an IIoT analytics solution
12/18/2020 • 19 minutes to read • Edit Online
Building on the architectural components in the recommended Azure Industrial IoT analytics solution, this article
discusses the subsystems and Azure services that can be used in such a solution. Your solution may not use all
these services or may have additional services.
The data in Time Series Insights is stored in your Azure Blob Storage account (bring your own storage account) in
Parquet file format. It is your data after all!
You can query your data in Time Series Insights using:
Time Series Insights Explorer
Query API
REST API
Power BI
Any of your favorite BI and analytics tools (for example, Spark, Databricks, Azure Notebooks) by accessing the
Parquet files in your Azure blob storage account
Microservices
Your IIoT analytics solution will require a number of microservices to perform functions such as:
Providing HTTP REST APIs to support your web application.
We recommend creating HTTP-triggered Azure Functions to implement your APIs.
Alternatively, you can develop and host your REST APIs using Azure Service Fabric or Azure Kubernetes
Service (AKS).
Providing an HTTP REST API interface to your factory floor OPC UA servers (for example, using Azure
Industrial IoT components consisting of OPC Publisher, OPC Twin and OPC Vault) to provide discovery,
registration, and remote control of industrial devices.
For hosting the Azure Industrial IoT microservices, we recommend using Azure Kubernetes Service (AKS).
See Deploying Azure Industrial IoT Platform to understand the various deployment options.
Performing data transformation such as converting binary payloads to JSON or differing JSON payloads to
a common, canonical format.
We recommend creating Azure Functions connected to IoT Hub to perform payload transformations.
Different industrial equipment vendors will send telemetry in different payload formats (JSON,
binary, and so on) and schemas. When possible, we recommend converting the different equipment
schemas to a common, canonical schema, ideally based on an industry standard.
If the message body is binary, use an Azure Function to convert the incoming messages to JSON and
send the converted messages back to IoT Hub or to Event Hub.
When the message body is binary, IoT Hub message routing cannot be used against the message
body, but can be used against message properties.
The Azure Industrial IoT components include the capability to decode OPC UA binary messages to
JSON.
A Data Ingest Administration service for updating the list of tags monitored by your IIoT analytics solution.
A Historical Data Ingestion service for importing historical data from your SCADA, MES, or historian into
your solution.
Your solution will likely involve additional microservices to satisfy the specific requirements of your IIoT analytics
solution. If your organization is new to building microservices, we recommend implementing custom
microservices using Azure Functions. Azure Functions is an event-driven serverless compute platform that can be
used to develop microservices and solve complex orchestration problems. It allows you to build and debug locally
(in several software languages) without additional setup, deploy and operate at scale in the cloud, and integrate
Azure services using triggers and bindings.
Both stateless and stateful microservices can be developed using Azure Functions. Azure Functions can use
Cosmos DB, Table Storage, Azure SQL, and other databases to store stateful information.
Alternatively, if your organization has a previous experience building container-based microservices, we
recommend you to also consider Azure Service Fabric or Azure Kubernetes Service (AKS). Refer to Microservices in
Azure for more information.
Regardless of your microservices platform choice, we recommend using Azure API Management to create
consistent and modern API gateways for your microservices. API Management helps abstract, publish, secure, and
version your APIs.
Data Ingest Administration
We recommend developing a Data Ingest Administration service to add/update the list of tags monitored by your
IIoT analytics solution.
SCADA tags are variables mapped to I/O addresses on a PLC or RTU. Tag names vary from organization to
organization but often follow a naming pattern. As an example, tag names for a pump with tag number 14P103
located in STN001 (Station 001), has these statuses:
STN001_14P103_RUN
STN001_14P103_STOP
STN001_14P103_TRIP
As new tags are created in your SCADA system, the IIoT analytics solution must become aware of these tags and
subscribe to them in order to begin collecting data from them. In some cases, the IIoT analytics solution may not
subscribe to certain tags as the data they contain may be irrelevant.
If your SCADA system supports OPC UA, new tags should appear as new NodeIDs in the OPC UA hierarchy. For
example, the above tag names may appear as:
ns=2;s= STN001_14P103_RUN
ns=2;s= STN001_14P103_STOP
ns=2;s= STN001_14P103_TRIP
We recommend developing a workflow that informs the administrators of the IIoT analytics solution when new
tags are created, or existing tags are edited in the SCADA system. At the end of the workflow, the OPC Publisher is
updated with the new/updated tags.
To accomplish this, we recommend developing a workflow that involves Power Apps, Logic Apps, and Azure
Functions, as follows:
The SCADA system operator can trigger the Logic Apps workflow using a Power Apps form whenever tags are
created or edited in the SCADA system.
Alternatively, Logic Apps connectors can monitor a table in the SCADA system database for tag changes.
The OPC UA Discovery service can be used to both find OPC UA servers and the tags and methods they
implement.
The Logic Apps workflow includes an approval step where the IIoT analytics solution owners can approve the
new/updated tags.
Once the new/updated tags are approved and a frequency assigned, the Logic App calls an Azure Function.
The Azure function calls the OPC Twin microservice, which directs the OPC Publisher module to subscribe to the
new tags.
A sample can be found here.
If your solution involves third-party software, instead of OPC Publisher, configure the Azure Function to
call an API running on the third-party software either directly or using an IoT Hub Direct Method.
Alternatively, Microsoft Forms and Microsoft Flow can be used in place of Power Apps and Logic Apps.
Historical Data Ingestion
Years of historical data likely exists in your current SCADA, MES, or historian system. In most cases, you will want to
import your historical data into your IIoT analytics solution.
Loading historical data into your IIoT analytics solution consists of three steps:
1. Export your historical data.
a. Most SCADA, MES, or historian systems have some mechanism that allows you to export your historical
data, often as CSV files. Consult your system's documentation on how best to do this.
b. If there is no export option in your system, consult the system's documentation to determine if an API
exists. Some systems support HTTP REST APIs or OPC Historical Data Access (HDA). If so, build an
application or use a Microsoft partner solution that connects to the API, queries for the historical data,
and saves it to a file in formats such as CSV, Parquet, TSV, and so on.
2. Upload to Azure.
a. If the aggregate size of the exported data is small, you can upload the files to Azure Blob Storage over the
internet using Azcopy.
b. If the aggregate size of the exported data is large (tens or hundreds of TBs), consider using Azure
Import/Export Service or Azure Data Box to ship the files to the Azure region where your IIoT analytics
solution is deployed. Once received, the files will be imported into your Azure Storage account.
3. Import your data.
a. This step involves reading the files in your Azure Storage account, serializing the data as JSON, and
sending data as streaming events into Time Series Insights. We recommend using an Azure Function to
perform this.
b. Time Series Insights only supports IoT Hub and Event Hub as data sources. We recommend using an
Azure Function to send the events to a temporary Event Hub, which is connected to Time Series Insights.
c. Refer to How to shape JSON events and Supported JSON shapes for best practices on shaping your
JSON payload.
d. Make sure to use the same Time Series ID as you do for your streaming data.
e. Once this process is completed, the Event Hub and Azure Function may be deleted. This is an optional
step.
NOTE
Exporting large volumes of data from your industrial system (for example, SCADA or historian) may place a significant
performance load on that system, which can negatively impact operations. Consider exporting smaller batches of historical
data to minimize performance impacts.
Notifications
Since the IIoT analytics solution is not a control system, it does not require a complete Alarm Management system.
However, there will be cases where you will want the ability to detect conditions in the streaming data and generate
notifications or trigger workflows. Examples include:
temperature of a heat exchanger exceeding a configured limit, which changes the color of an icon in your web
application,
an error code sent from a pump, which triggers a work order in your ERP system, or
the vibration of a motor exceeding limits, which triggers an email notification to an Operations Manager.
We recommend using Azure Stream Analytics to define and detect conditions in the streaming data (refer to the
rules engine mentioned earlier). For example, a Plant Manager implements an automated workflow that runs
whenever an error code is received from any equipment. In this example, your custom web application can use the
Stream Analytics REST API to provide a user interface for the Plant Manager to create and run a job that monitors
for specific error codes.
For defining an alert (email or SMS) or triggering a workflow, we recommend using Azure Logic Apps. Logic Apps
can be used to build automated, scalable workflows, business processes, and enterprise orchestrations to integrate
your equipment and data across cloud services and on-premises systems.
We recommend connecting Azure Stream Analytics with Azure Logic Apps using Azure Service Bus. In the previous
example, when an error code is detected by Stream Analytics, the job will send the error code to an Azure Service
Bus queue output. A Logic App will be triggered to run whenever a message is received on the queue. This Logic
App will then perform the workflow defined by the Plant Manager, which may involve creating a work order in
Dynamics 365 or SAP, or sending an email to maintenance technician. Your web application can use the Logic Apps
REST API to provide a user interface for the Plant Manager to author workflows or these can be built using the
Azure portal authoring experience.
To display visual alerts in your web application, we recommend creating an Azure Stream Analytics job to detect
specific events and send those to either:
An Event Hub output - Then connect the Event Hub to Time Series Insights. Use the Azure Time Series
Insights JavaScript SDK (tsiclient) to display the event in your web application.
or,
An Azure Functions output - Then develop an Azure Function that sends the events to your web application
using SignalR.
Operational alarms and events triggered on premise can also be ingested into Azure for reporting and to trigger
work orders, SMS messages, and emails.
Microsoft 365
The IIoT analytics solution can also include Microsoft 365 services to automate tasks and send notifications. The
following are a few examples:
Receive email alerts in Microsoft Outlook or post a message to a Microsoft Teams channel when a condition is
met in Azure Stream Analytics.
Receive notifications as part of an approval workflow triggered by a Power App or Microsoft Forms submission.
Create an item in a SharePoint list when an alert is triggered by a Logic App.
Notify a user or execute a workflow when a new tag is created in a SCADA system.
Machine Learning
Machine learning models can be trained using your historical industrial data, enabling you to add predictive
capabilities to your IIoT application. For example, your Data Scientists may be interested in using the IIoT analytics
solution to build and train models that can predict events on the factory floor or indicate when maintenance should
be conducted on an asset.
For building and training machine learning models, we recommend Azure Machine Learning. Azure Machine
Learning can connect to Time Series Insights data stored in your Azure Storage account. Using the data, you can
create and train forecasting models in Azure Machine Learning. Once a model has been trained, it can be deployed
as a web service on Azure (hosted on Azure Kubernetes Services or Azure Functions, for example) or to an Azure
IoT Edge field gateway.
For those new to machine learning or organizations without Data Scientists, we recommend starting with Azure
Cognitive Services. Azure Cognitive Services are APIs, SDKs, and services available to help you build intelligent
applications without having formal AI or data science skills or knowledge. Azure Cognitive Services enable you to
easily add cognitive features into your IIoT analytics solution. The goal of Azure Cognitive Services is to help you
create applications that can see, hear, speak, understand, and even begin to reason. The catalog of services within
Azure Cognitive Services can be categorized into five main pillars - Vision, Speech, Language, Web Search, and
Decision.
Asset Hierarchy
An asset hierarchy allows you to define hierarchies for classifying your asset, for example, Country > Location >
Facility > Room. They may also contain the relationship between your assets. Many organizations maintain asset
hierarchies within their industrial systems or within an Enterprise Asset Management (EAM) system.
The Time Series Model in Azure Time Series Insights provides asset hierarchy capabilities. Through the use of
Instances, Types and Hierarchies, you can store metadata about your industrial devices, as shown in the image
below.
If possible, we recommend exporting your existing asset hierarchy and importing it into Time Series Insights using
the Time Series Model APIs. We recommend periodically refreshing it as updates are made in your Enterprise Asset
Management system.
In the future, asset models will evolve to become digital twins, combining dynamic asset data (real-time telemetry),
static data (3D models, metadata from Asset Management Systems), and graph-based relationships, allowing the
digital twin to change in real-time along with the physical asset.
Azure Digital Twins is an Azure IoT service that provides the ability to:
Create comprehensive models of physical environments,
Create spatial intelligence graphs to model the relationships and interactions between people, places, and
devices,
Query data from a physical space rather than disparate sensors, and
Build reusable, highly scalable, spatially aware experiences that link streaming data across the physical and
digital world.
User Management
User management involves managing user profiles and controlling what actions a user can perform in your IIoT
analytics solution. For example, what asset data can a user view, or whether the user can create conditions and
alerts. This is frequently referred to as role-based access control (RBAC).
We recommend implementing role-based access control using the Microsoft identity platform along with Azure
Active Directory. In addition, the Azure PaaS services mentioned in this IIoT analytics solution can integrate directly
with Azure Active Directory, thereby ensuring security across your solution.
Your web application and custom microservices can also integrate with the Microsoft identity platform using
libraries such as Microsoft Authentication Library (or MSAL) and protocols such as OAuth 2.0 and OpenID Connect.
User management also involves operations such as:
creating a new user,
updating a user's profile, such as their location and phone number,
changing a user's password, and
disabling a user's account.
For these operations, we recommend using the Microsoft Graph.
Next steps
Data visualization is the backbone of a well-defined analytics system. Learn about the data visualization techniques
that you can use with the IIoT analytics solution recommended in this series.
Data analysis in Azure Industrial IoT analytics solution
12/18/2020 • 4 minutes to read • Edit Online
This article shows you how to visualize the data collected by the Azure Industrial IoT analytics solution. You can
easily look for data trends using visual elements and dashboards, and use these trends to analyze the effectiveness
of your solution.
Visualization
There are many options for visualizing your industrial data. Your IIoT analytics solution may use some or all of
these options, depending on the personas using your solution.
For Process Engineers and other personas looking to perform ad-hoc analytics and trend visualizations, we
recommend using Azure Time Series Insights explorer.
For Plant Managers and other personas wanting to develop dashboards, we recommend using Power BI, and
connecting Power BI with your data in Time Series Insights using the Power BI connector. Using Power BI, these
users can also combine external data from your ERP, EAM, or other systems with the data in Time Series
Insights.
For advanced visualizations, such as schematic views and process graphics, we recommend a custom web
application.
For Data Scientists interested in using open source data analysis and visualization tools such as Python, Jupyter
Notebooks, and Matplotlib, we recommend Azure Notebooks.
Data Trends
The Azure Time Series Insights explorer is a web application that provides powerful data trending and visualization
capabilities that make it simple to explore and analyze billions of IIoT events simultaneously.
Time Series Insights Explorer is ideally suited to personas, such as a Process Engineer or Operations Manager, who
want to explore, analyze and visualize the raw data coming from your industrial systems. The insights gained from
exploring the raw data can help build Azure Stream Analytics jobs, which look for conditions in the data or perform
calculations over the data.
The Azure Time Series Insights explorer allows you to seamlessly explore both warm and cold data, or your
historical data, as demonstrated in the following figure.
Azure Time Series Insights explorer has a powerful yet intuitive user interface, as shown below.
Dashboards
For some personas, such as a Plant Manager, dashboards containing factory or plant KPIs and visualizations are
more important than viewing the raw data. For such users, we recommend Power BI as the visualization solution.
You can connect Power BI with your data stored in Time Series Insights, providing you with powerful reporting and
dashboard capabilities over your industrial data, and allowing you to share insights and results across your
organization.
Schematic Views
For advanced visualizations, such as schematic views or process graphics, you may require a custom web
application. A custom web application also allows you to provide a single pane of glass user experience and other
advanced capabilities including:
a simplified and integrated authoring experience for Azure Stream Analytics jobs and Logic Apps,
displaying real-time data using process or custom visuals,
displaying KPIs and external data with embedded Power BI dashboards,
displaying visual alerts using SignalR, and
allowing administrators to add/remove users from the solution.
We recommend building a Single Page Application (SPA) using:
JavaScript, HTML5, and CSS3
Time Series Insights JavaScript SDK for displaying process or custom visuals with data from Time Series
Insights
MSAL.js to sign in users and acquire tokens to use with the Microsoft Graph
Azure App Services Web Apps to host the web application
Power BI to embed Power BI dashboards directly in the web app
Azure Maps to render map visualizations
Microsoft Graph SDK for JavaScript to integrate with Microsoft 365
Notebooks
One of the advantages of moving operational data to the cloud is to take advantage of modern big-data tool sets.
One of the most common tools used by Data Scientists for ad-hoc analysis of big data are Jupyter Notebooks.
Jupyter (formerly IPython) is an open-source project that lets you easily combine Markdown text, executable code,
persistent data, graphics, and visualizations onto a single, sharable canvas - the notebook. Production Engineers
should also consider learning Jupyter Notebooks technology to assist in analysis of plant events, finding
correlations, and so on. Jupyter Notebooks provide support for Python 2/3, R, and F# programming languages and
can connect to your Time Series Insights data stored Azure storage.
Next steps
Now that you have learned the architecture of an Azure IIoT analytics solution, read the architectural considerations
that improve the resiliency and efficiency of this architecture.
Architectural Considerations in an IIoT Analytics
Solution
12/18/2020 • 3 minutes to read • Edit Online
The Microsoft Azure Well-Architected Framework describes some key tenets of a good architectural design.
Keeping in line with these tenets, this article describes the considerations in the reference Azure Industrial IoT
analytics solution that improve its performance and resiliency.
Performance Considerations
Azure PaaS Services
All Azure PaaS services have an ability to scale up and/or out. Some services will do this automatically (for
example, IoT Hub, Azure Functions in a Consumption Plan) while others can be scaled manually.
As you test your IIoT analytics solution, we recommend that you:
understand how each service scales (that is, the units of scale),
collect performance metrics and establish baselines, and
setup alerts when performance metrics exceed baselines.
All Azure PaaS services have a metrics blade that allows you to view service metrics, and configure conditions and
alerts, which are collected and displayed in Azure Monitor. We recommend enabling these features to ensure your
solution performs as expected.
IoT Edge
Azure IoT Edge gateway performance is impacted by:
the number of edge modules running and their performance requirements,
the number of messages processed by modules and EdgeHub,
Edge modules requiring GPU processing,
offline buffering of messages,
the gateway hardware, and
the gateway operating system.
We recommend real world testing and/or testing with simulated telemetry to understand the field gateway
hardware requirements for Azure IoT Edge. Conduct your initial testing using virtual machine where CPU, RAM,
disk can be easily adjusted. Once approximate hardware requirements are known, get your field gateway hardware
and conduct your testing again using actual hardware.
You should also test to ensure:
no messages are being lost between source (for example, historian) and destination (for example, Time Series
Insights),
acceptable message latency exists between the source and the destination,
that source timestamps are preserved, and
data accuracy is maintained, especially when performing data transformations.
Availability Considerations
IoT Edge
A single Azure IoT Edge field gateway can be a single point of failure between your SCADA, MES, or historian and
Azure IoT Hub. A failure can cause gaps in data in your IIoT analytics solution. To prevent this, IoT Edge can
integrate with your on-premise Kubernetes environment, using it as a resilient, highly available infrastructure layer.
For more information, see How to install IoT Edge on Kubernetes (Preview).
Network Considerations
IoT Edge and Firewalls
To maintain compliance with standards such as ISA 95 and ISA 99, industrial equipment is often installed in a
closed Process Control Network (PCN), behind firewalls, with no direct access to the Internet (see Purdue
networking model).
There are three options to connect to equipment installed in a PCN:
1. Connect to a higher-level system, such as a historian, located outside of the PCN.
2. Deploy an Azure IoT Edge device or virtual machine in a DMZ between the PCN and the internet.
a. The firewall between the DMZ and the PCN will need to allow inbound connections from the DMZ to the
appropriate system or device in the PCN.
b. There may be no internal DNS setup to resolve PCN names to IP addresses.
3. Deploy an Azure IoT Edge device or virtual machine in the PCN and configure IoT Edge to communicate with
the Internet through a Proxy server.
a. Additional IoT Edge setup and configuration are required. See Configure an IoT Edge device to
communicate through a proxy server.
b. The Proxy server may introduce a single point of failure and/or a performance bottleneck.
c. There may be no DNS setup in the PCN to resolve external names to IP addresses.
Azure IoT Edge will also require:
access to container registries, such as Docker Hub or Azure Container Registry, to download modules over
HTTPS,
access to DNS to resolve external FQDNs, and
ability to communicate with Azure IoT Hub using MQTT, MQTT over WebSockets, AMQP, or AMQP over
WebSockets.
For additional security, industrial firewalls can be configured to only allow traffic between IoT Edge and IoT Hub
using Service Tags. IP address prefixes of IoT Hub public endpoints are published periodically under the
AzureIoTHub service tag. Firewall administrators can programmatically retrieve the current list of service tags,
together with IP address range detail, and update their firewall configuration.
Next Steps
For a more detailed discussion of the recommended architecture and implementation choices, download
and read the Microsoft Azure IoT Reference Architecture pdf.
Azure Industrial IoT components, tutorials, and source code.
For detailed documentation of the various Azure IoT services, see Azure IoT Fundamentals.
Add IP address spaces to peered virtual networks
12/18/2020 • 3 minutes to read • Edit Online
Many organizations deploy a virtual networking architecture that follows the Hub and Spoke model. At some point,
the hub virtual network might require additional IP address spaces. However, address ranges can't be added or
deleted from a virtual network's address space once it's peered with another virtual network. To add or remove
address ranges, delete the peering, add or remove the address ranges, then re-create the peering manually. The
scripts described in this article can make that process easier.
Single subscription
A single subscription use case, both hub and all spoke virtual networks are in the same subscription.
Multiple subscriptions
Another use case can be where the hub virtual network is in one subscription and all other spoke virtual networks
are in different subscriptions. The subscriptions are for a single Azure Active Directory tenant.
Considerations
Running the script will result in outage or disconnections between the Hub and Spoke virtual networks. Execute
it during an approved maintenance window.
Run Get-Module -ListAvailable Az to find the installed version. The script requires the Azure PowerShell module
version 1.0.0 or later. If you need to upgrade, see Install Azure PowerShell module.
If not already connected, run Connect-AzAccount to create a connection with Azure.
Consider assigning accounts, used for virtual network peering, to the Network Contributor role or a Custom
Role containing the necessary actions found under virtual network peering permissions.
Assign accounts used to add IP address spaces, to the Network Contributor role or a Custom Role containing the
necessary actions found under virtual network permissions.
The IP address space that you want to add to the hub virtual network must not overlap with any of the IP
address spaces of the spoke virtual networks that you intend to peer with the hub virtual network.
param (
# Address Prefix range (CIDR Notation, e.g., 10.0.0.0/24 or 2607:f000:0000:00::/64)
[Parameter(Mandatory = $true)]
[String[]]
$IPAddressRange,
}
Pricing
There is a nominal charge for ingress and egress traffic that utilizes a virtual network peering. There is no change to
existing pricing when adding an additional IP address space to an Azure virtual network. For more information, see
the pricing page.
Next steps
Learn more about managing Virtual Network peerings.
Learn more about managing IP Address ranges on Virtual Networks.
Azure security solutions for AWS
12/18/2020 • 11 minutes to read • Edit Online
Microsoft offers several security solutions that can help secure and protect Amazon Web Services (AWS) accounts
and environments.
AWS organizations that use Azure Active Directory (Azure AD) for Microsoft 365 or hybrid cloud identity and
access protection can quickly and easily deploy Azure AD for AWS accounts, often without additional cost.
Other Microsoft security components can integrate with Azure AD to provide additional security for AWS accounts.
Microsoft Cloud App Security (MCAS) backs up Azure AD with session protection and user behavior monitoring.
Azure Sentinel integrates with Azure AD and MCAS to detect and automatically respond to threats against AWS
environments.
These Microsoft security solutions are extensible and have multiple levels of protection. Organizations can
implement one or more of these solutions along with other types of protection for a full security architecture that
protects current and future AWS deployments.
This article provides AWS identity architects, administrators, and security analysts with immediate insights and
detailed guidance for deploying several Microsoft security solutions.
Architecture
This diagram summarizes how AWS installations can benefit from key Microsoft security components:
Azure AD provides centralized single sign-on (SSO) and strong authentication through multi-factor
authentication (MFA) and Conditional Access. Azure AD supports AWS role-based identities and
authorization for access to AWS resources. For more information and detailed instructions, see Azure AD
identity and access management for AWS.
MCAS integrates with Azure AD Conditional Access to enforce additional restrictions, and monitors and
protects sessions after sign-in. MCAS uses user behavior analytics (UBA) and other AWS APIs to monitor
sessions and users and to support information protection.
Azure Sentinel integrates with MCAS and AWS to detect and automatically respond to threats. Azure
Sentinel monitors the AWS environment for misconfiguration, potential malware, and advanced threats to
AWS identities, devices, applications, and data.
MCAS for visibility and control
Several users or roles making administrative changes can result in configuration drift away from intended security
architecture and standards. Security standards can also change over time. Security personnel must constantly and
consistently detect new risks, evaluate mitigation options, and update security architecture to prevent potential
breaches. Security management across multiple public cloud and private infrastructure environments can become
burdensome.
Microsoft Cloud App Security is a Cloud Access Security Broker (CASB) platform with capabilities for Cloud Security
Posture Management (CSPM). MCAS can connect to multiple cloud services and applications to collect security
logs, monitor user behavior, and impose restrictions that the platforms themselves may not offer.
MCAS provides several capabilities that can integrate with AWS for immediate benefits:
The MCAS app connector uses several AWS APIs, including UBA, to search for configuration issues and threats
on the AWS platform.
AWS Access Controls can enforce sign-in restrictions based on application, device, IP address, location,
registered ISP, and specific user attributes.
Session Controls for AWS block potential malware uploads or downloads based on Microsoft Threat Intelligence
or real-time content inspection.
Session controls can also use real-time content inspection and sensitive data detection to impose data loss
prevention (DLP) rules that prevent cut, copy, paste, or print operations.
MCAS is available standalone, or as part of Microsoft Enterprise Mobility + Security E5, which includes Azure AD
Premium P2. For more pricing and licensing information, see Enterprise Mobility + Security pricing options.
Azure Sentinel for advanced threat detection
Threats can come from a wide range of devices, applications, locations, and user types. Data loss prevention
requires inspecting content during upload or download, because post-mortem review may be too late. AWS
doesn't have native capabilities for device and application management, risk-based conditional access, session-
based controls, or inline UBA.
Azure Sentinel is a Security Information and Event Management (SIEM) and Security Orchestration, Automation,
and Response (SOAR) solution that centralizes and coordinates threat detection and response automation for
modern security operations. Azure Sentinel can monitor AWS accounts to compare events across multiple firewalls,
network devices, and servers. Azure Sentinel combines monitoring data with threat intelligence, analytics rules, and
machine learning to discover and respond to advanced attack techniques.
Connect both AWS and MCAS into Azure Sentinel to get MCAS alerts and run additional threat checks using
multiple Threat Intelligence feeds. Azure Sentinel can initiate a coordinated response outside of MCAS, integrate
with IT Service Management (ITSM) solutions, and retain data long term for compliance purposes.
Security recommendations
The following principles and guidelines are important for any cloud security solution:
Ensure that the organization can monitor, detect, and automatically protect user and programmatic access
into cloud environments.
Continually review current accounts to ensure identity and permission governance and control.
Follow least privilege and zero trust principles. Make sure that each user can access only the specific
resources they require, from trusted devices and known locations. Reduce the permissions of every
administrator and developer to provide only the rights they need for the role they're performing. Review
regularly.
Continuously monitor platform configuration changes, especially if they provide opportunities for privilege
escalation or attack persistence.
Prevent unauthorized data exfiltration by actively inspecting and controlling content.
Take advantage of solutions you might already own like Azure AD Premium P2 that can increase security
without additional expense.
Basic AWS account security
To ensure basic security hygiene for AWS accounts and resources:
Review the AWS security guidance at Best practices for securing AWS accounts and resources.
Reduce the risk of uploading and downloading malware and other malicious content by actively inspecting
all data transfers through the AWS Management Console. Content that uploads or downloads directly to
resources within the AWS platform, such as web servers or databases, might need additional protection.
Consider protecting access to other resources, including:
Resources created within the AWS account.
Specific workload platforms, like Windows Server, Linux Server, or containers.
Devices that administrators and developers use to access the AWS Management Console.
Deploy MCAS
Once you deploy the central management and strong authentication that modern identity and access management
require, you can implement MCAS to:
Collect security data and carry out threat detections for AWS accounts.
Implement advanced controls to mitigate risk and prevent data loss.
To deploy MCAS, you:
1. Add an MCAS app connector for AWS.
2. Configure MCAS monitoring policies for AWS activities.
3. Configure Azure AD session policies for AWS activities.
4. Test MCAS policies for AWS.
Add an AWS app connector
1. In the MCAS portal, expand Investigate and then select Connected apps .
2. On the App Connectors page, select the + and then select Amazon Web Ser vices from the list.
3. Use a unique name for the connector that includes an identifier for the company and specific AWS account,
for example Contoso-AWS-Account1.
4. Follow the instructions at Connect AWS to Microsoft Cloud App Security to create an appropriate AWS IAM
user.
a. Define a policy for restricted permissions.
b. Create a service account to use those permissions on behalf of the MCAS service.
c. Provide the credentials to the app connector.
The initial connection may take some time, depending on the AWS account log sizes. Upon completion, you see a
successful connection confirmation:
As MCAS detects alerts, it displays them on the Aler ts page in the MCAS portal:
Configure Azure AD session policies for AWS activities
Session policies are a powerful combination of Azure AD Conditional Access policies and MCAS reverse proxy
capability that provide real-time suspicious behavior monitoring and control.
1. In Azure AD, create a new Conditional Access policy with the following settings:
Name : Enter AWS Console – Session Controls
Users and Groups : Select the two role groups you created earlier:
AWS-Account1-Administrators
AWS-Account1-Developers
Cloud apps or actions : Select the enterprise application you created earlier, Contoso-AWS-Account
1
Session : Select Use Conditional Access App Control
2. Set Enable policy to On .
3. Select Create .
After you create the Azure AD Conditional Access policy, set up an MCAS Session Policy to control user behavior
during AWS sessions.
1. In the MCAS portal, expand Control and then select Policies .
2. On the Policies page, select Create policy and then select Session policy from the list.
3. On the Create session policy page, under Policy template , select Block upload of potential malware
(based on Microsoft Threat Intelligence) .
4. In the ACTIVITIES section, modify the activity filter to include App equal to Amazon Web Ser vices , and
remove the default device selection.
A N A LY T IC T EM P L AT E N A M E DATA SO URC ES
Known IRIDIUM IP DNS, Azure Monitor, Cisco ASA, Palo Alto Networks, Azure AD,
Azure Activity, AWS
Full Admin policy created and then attached to Roles, Users, or AWS
Groups
Failed AzureAD logons but success logon to AWS Console Azure AD, AWS
A N A LY T IC T EM P L AT E N A M E DATA SO URC ES
Failed AWS Console logons but success logon to AzureAD Azure AD, AWS
New UserAgent observed in last 24 hours Microsoft 365, Azure Monitor, AWS
Threat Intelligence map IP entity to AWS CloudTrail Threat Intelligence Platforms, AWS
If the data stops ingesting and the graph drops, check the credentials used to connect to the AWS account, and
check that AWS CloudTrail can still collect the events.
See also
For in-depth coverage and comparison of Azure and AWS features, see the Azure for AWS professionals content
set.
For security guidance from AWS, see Best practices for securing AWS accounts and resources.
For the latest Microsoft security information, see www.microsoft.com/security.
For full details of how to implement and manage Azure AD, see Securing Azure environments with Azure Active
Directory.
Azure AD identity and access management for AWS.
Connect AWS to Microsoft Cloud App Security.
How Cloud App Security helps protect your Amazon Web Services (AWS) environment.
Connect Azure Sentinel to AWS CloudTrail.
Confidential computing on a healthcare platform
12/18/2020 • 6 minutes to read • Edit Online
When organizations collaborate, they share information. But most parties don't want to give other parties access to
all parts of the data. Mechanisms exist for safeguarding data at rest and in transit. However, encrypting data in use
poses different challenges. This article presents a solution that Azure confidential computing (ACC) offers for
encrypting in-use data.
By using confidential computing and containers, the solution provides a way for a provider-hosted application to
securely collaborate with a hospital and a third-party diagnostic provider. Azure Kubernetes Service (AKS) hosts
confidential computing nodes. Azure Attestation establishes trust with the diagnostic provider. By using these Azure
components, the architecture isolates the sensitive data of the hospital patients while the specific shared data is
being processed in the cloud. The hospital data is then inaccessible to the diagnostic provider. Through this
architecture, the provider-hosted application can also take advantage of advanced analytics. The diagnostic provider
makes these analytics available as confidential computing services of machine learning (ML) applications.
Architecture
Diagram showing how data flows between three parties in a healthcare setting. Three rectangles represent the
three parties: a hospital, a medical platform, and a diagnostic provider. Each rectangle contains icons that represent
various components, such as a website, a client application, Azure Attestation, a web API, data storage, and a
runtime. The medical platform and diagnostic provider rectangles also contain smaller rectangles that represent
confidential nodes and A K S clusters. Arrows connect these components and show the flow of data. Numbered
callouts correspond to the steps that this article describes after the diagram.
The diagram outlines the architecture. Throughout the system:
Network communication is TLS encrypted in transit.
Azure Monitor tracks component performance, and Azure Container Registry (ACR) manages the solution's
containers.
The solution involves the following steps:
1. A clerk for Lamna Hospital opens the hospital's web portal, an Azure Blob Storage static website, to enter
patient data.
2. The clerk enters data into the hospital's web portal, which Contoso Medical Platform Ltd. powers with a
Python Flask–based web API. A confidential node in the SCONE confidential computing software protects the
patient data. SCONE works within an AKS cluster that has the Software Guard Extensions (SGX) enabled that
help run the container in an enclave.
3. The Python code is wrapped in SCONE SGX software. The solution deploys that code on an AKS confidential
node as a confidential container. The code stores the data in memory within a Redis cache (3a ), using Azure
Attestation to establish trust (3b ).
4. If the code doesn't find the patient data, it prompts the clerk to enter the sensitive information on a form. The
code then stores that information in Redis.
5. The Contoso application sends the protected patient data from its AKS cluster to an enclave in the Open
Neural Network Exchange (ONNX) runtime server. Fabrikam Diagnostic Provider hosts this confidential
inferencing server in a confidential node of the Fabrikam AKS cluster.
6. The Fabrikam machine learning–based application obtains the diagnostic results from the confidential
inferencing ONNX runtime server. The app sends these results back to the confidential node in the Contoso
AKS cluster.
7. The Contoso application sends the diagnostic results from the confidential node back to Lamna's client
application.
Components
Static website hosting in Blob Storage serves static content like HTML, CSS, JavaScript, and image files
directly from a storage container.
Azure Attestation is a unified solution that remotely verifies the trustworthiness of a platform. Azure
Attestation also remotely verifies the integrity of the binaries that run in the platform. Use Azure Attestation
to establish trust with the confidential application.
AKS Cluster simplifies the process of deploying a Kubernetes cluster.
Confidential computing nodes are hosted on a specific virtual machine series that can run sensitive
workloads on AKS within a hardware-based trusted execution environment (TEE) by allowing user-level code
to allocate private regions of memory, known as enclaves. Confidential computing nodes can support
confidential containers or enclave-aware containers.
SCONE platform is an Azure Partner independent software vendor (ISV) solution from Scontain.
Redis is an open-source, in-memory data structure store.
Secure Container Environment (SCONE) supports the execution of confidential applications in containers
that run inside a Kubernetes cluster.
Confidential Inferencing ONNX Runtime Server Enclave (ONNX RT - Enclave) is a host that restricts the ML
hosting party from accessing both the inferencing request and its corresponding response.
Azure Monitor collects and analyzes app telemetry, such as performance metrics and activity logs.
ACR is a service that creates a managed registry. ACR builds, stores, and manages container images and can
store containerized machine learning models.
Alternatives
You can use Fortanix instead of SCONE to deploy confidential containers to use with your containerized
application. Fortanix provides the flexibility you need to run and manage the broadest set of applications:
existing applications, new enclave-native applications, and pre-packaged applications.
Graphene is a lightweight, open-source guest OS. Graphene can run a single Linux application in an isolated
environment with benefits comparable to running a complete OS. It has good tooling support for converting
existing Docker container applications to Graphene Shielded Containers (GSC).
Considerations
Azure confidential computing virtual machines (VMs) are available in 2nd-generation D family sizes for general
purpose needs. These sizes are known collectively as D-Series v2 or DCsv2 series. This scenario uses Intel SGX-
enabled DCs_v2-series virtual machines with Gen2 operating system (OS) images. But you can only deploy certain
sizes in certain regions. For more information, see Quickstart: Deploy an Azure Confidential Computing VM in the
Marketplace and Products available by region.
Pricing
To explore the cost of running this scenario, use the Azure pricing calculator, which preconfigures all Azure services.
A sample cost profile is available for the Contoso Medical SaaS Platform, as pictured in the diagram. It includes the
following components:
System node pool and SGX node pool: no disks, all ephemeral
AKS Load Balancer
Azure Virtual Network: nominal
Azure Container Registry
Storage account for single-page application (SPA)
The profile doesn't include the following components:
Azure Attestation Service: free
Azure Monitor Logs: usage based
SCONE ISV licensing
Compliance services required for solutions working with sensitive data, including:
Azure Security Center and Azure Defender for Kubernetes
Azure DDoS Protection: standard
Azure Firewall
Azure Application Gateway and Azure Web Application Firewall
Next steps
Learn more about Azure confidential computing.
See the confidential ONNX inference server project on GitHub.
Related resources
Confidential containers on AKS.
Official ONNX runtime website.
Confidential ONNX inference server (GitHub sample).
MobileCoin use case with anonymized blockchain data.
A sample brain segmentation image for use with the delineation function that invokes the confidential
inferencing server.