Azure Cosmos DB - Technical Deep Dive

Azure Cosmos DB
Technical Deep Dive

Andre Essing
Technology Solutions Professional
Microsoft Deutschland GmbH
Andre advises customers in topics all around the
Microsoft Data Platform. Since version 7.0, Andre
gathering experience with the SQL Server product
family. Today Andre concentrates on working with
data in the cloud, like Modern Data Warehouse
architectures, Artificial Intelligence and new scalable
database systems like Azure Cosmos DB.
aessing/Andre_Essingandre.essing@microsoft.com /aessing @aessingandreessing.de

W H AT I S N O S Q L
NOSQL, BUILT FOR SIMPLE AND FAST APPLICATION
DEVELOPMENT
NoSQL, referring most times to “Non-SQL”, “Not Only SQL” or
also “non-relational” is a kind of database where the data is
modeled differently to relational systems.
• Different kinds available
• Document
• Key/Value
• Columnar
• Graph
• etc.
• Non-Relational
• Schema agnostic
• Built for scale and performance
• Different consistency model

D I F F E R E N T WAY S O F S TO R I N G D ATA W I T H Y O U R
M O D E R N A P P
Come as you are
Data normalization

SQL
MongoDB
Table API
Turnkey global
distribution
Elastic scale out
of storage & throughput
Guaranteed low latency
at the 99th percentile
Comprehensive
SLAs
Five well-defined
consistency models
A Z U R E C O S M O S D B
DocumentColumn-family
Key-value Graph
A globally distributed, massively scalable, multi-model database service

Leveraging Azure Cosmos DB to automatically scale
your data across the globe
This module will reference partitioning in the context
of all Azure Cosmos DB modules and APIs.
R E S O U R C E M O D E L
Account
DatabaseDatabaseDatabase
DatabaseDatabaseContainer
DatabaseDatabaseItem

Account
A C C O U N T U R I A N D C R E D E N T I A L S
********.azure.com
IGeAvVUp …

D ATA B A S E R E P R E S E N TAT I O N S
Account
DatabaseDatabaseUsers
DatabaseDatabasePermission

C O N TA I N E R R E P R E S E N TAT I O N S
Account
= Collection Graph Table

C O N TA I N E R - L E V E L R E S O U R C E S
Account
DatabaseDatabaseItem ConflictSproc Trigger UDF

S Y S T E M TO P O LO G Y ( B E H I N D T H E S C E N E S )
Resource
Manager
Language
Runtime(s)
Hosts
Query
Processor
RSM
Index Manager
Bw-tree++/ LLAMA++
Log Manager
IO Manager
Resource Governor
Transport
Database engine
Admission control
…
…
Planet Earth Azure regions Datacenters Stamps Fault domains
Cluster Machine Replica Database engine
Container
Various agents

R E S O U R C E H I E R A R C H Y
CONTAINERS
Logical resources “surfaced” to APIs as tables,
collections or graphs, which are made up of one or
more physical partitions or servers.
RESOURCE PARTITIONS
• Consistent, highly available, and resource-governed
coordination primitives
• Consist of replica sets, with each replica hosting an
instance of the database engine
Containers
Resource Partitions
CollectionsTables Graphs
Tenants
Leader
Follower
Follower
Forwarder
Replica Set
To remote resource partition(s)

Request Units (RUs) is a rate-based currency
Abstracts physical resources for performing requests
Key to multi-tenancy, SLAs, and COGS efficiency
Foreground and background activities
R E Q U E S T U N I T S
% IOPS% CPU% Memory

Normalized across various access methods
1 RU = 1 read of 1 KB document
Each request consumes fixed RUs
Applies to reads, writes, queries, and stored procedure
execution
GET
POST
PUT
Query
…
=
=
=
=

Normalized across various access methods
1 read of 1 KB document from a single partition
Each request consumes fixed RUs
Applies to reads, writes, queries, and stored procedure
execution
Provisioned in terms of RU/sec
Rate limiting based on amount of throughput provisioned
Can be increased or decreased instantaneously
Metered Hourly
Background processes like TTL expiration, index
transformations scheduled when quiescent
Min RU/sec
Max RU/sec
IncomingRequests
Replica Quiescent
Rate limit
No rate limiting

E L A S T I C S C A L E O U T O F S TO R A G E A N D T H R O U G H P U T
SCALES AS YOUR APPS’ NEEDS CHANGE
Independently and elastically scale storage and
throughput across regions – even during unpredictable
traffic bursts – with a database that adapts to your
app’s needs.
• Elastically scale throughput from 10 to
100s of millions of requests/sec across
multiple regions
• Support for requests/sec for different
workloads
• Pay only for the throughput and
storage you need

Leveraging Azure Cosmos DB to automatically scale
your data across the globe
This module will reference partitioning in the context
of all Azure Cosmos DB modules and APIs.
PA R T I T I O N I N G

PA R T I T I O N S
Cosmos DB Container
(e.g. Collection)
Partition Key: City
Logical Partitioning Abstraction
Behind the Scenes:
Physical Partition Sets
hash(City)
Psuedo-random distribution of data over range of possible hashed values

PA R T I T I O N S
…
Partition 1 Partition 2 Partition n
Frugal # of Partitions based on actual storage and throughput needs
(yielding scalability with low total cost of ownership)
hash(City)
Pseudo-random distribution of data over range of possible hashed values
Cologne
Hamburg
…
Munich
Stuttgart
Berlin
Leipzig
Bremen
Frankfurt
Dresden
…

PA R T I T I O N S
What happens when partitions need to grow?
hash(City)
…
Partition 1 Partition 2 Partition n
Cologne
Hamburg
…
Munich
Stuttgart
Berlin
Leipzig
Bremen
Frankfurt
Dresden
…

PA R T I T I O N S
Partition Ranges can be dynamically sub-divided to seamlessly
grow database as the application grows while simultaneously
maintaining high availability.
Partition management is fully managed by Azure Cosmos DB,
so you don't have to write code or manage your partitions.
+
Partition x Partition x1 Partition x2
hash(User ID)
Stuttgart
Berlin
…
Cologne
Hamburg
Stuttgart
Berlin
Leipzig
Dresden
…
Cologne
Hamburg
…

PA R T I T I O N S
Best Practices: Design Goals for Choosing a Good Partition Key
• Distribute the overall request + storage volume
• Avoid “hot” partition keys
Steps for Success
• Ballpark scale needs (size/throughput)
• Understand the workload
• # of reads/sec vs writes per sec
• Use pareto principal (80/20 rule) to help optimize bulk of workload
• For reads – understand top 3-5 queries (look for common filters)
• For writes – understand transactional needs
General Tips
• Build a POC to strengthen your understanding of the workload and
iterate (avoid analyses paralysis)
• Don’t be afraid of having too many partition keys
• Partitions keys are logical
• More partition keys = more scalability
• Partition Key is scope for multi-record transactions and routing queries
• Queries can be intelligently routed via partition key
• Omitting partition key on query requires fan-out

High Availability
• Automatic and Manual Failover
• Multi-homing API removes need for app
redeployment
Low Latency (anywhere in the world)
• Packets cannot move fast than the speed of light
• Sending a packet across the world under ideal
network conditions takes 100’s of milliseconds
• You can cheat the speed of light – using data
locality
• CDN’s solved this for static content
• Azure Cosmos DB solves this for dynamic content
T U R N K E Y G LO B A L
D I S T R I B U T I O N

T U R N K E Y G LO B A L D I S T R I B U T I O N
• Automatic and transparent replication worldwide
• Each partition hosts a replica set per region
• Customers can test end to end application
availability by programmatically simulating failovers
• All regions are hidden behind a single global URI
with multi-homing capabilities
• Customers can dynamically add / remove
additional regions at any time
Writes/
Reads
Reads
"airport" : “AMS" "airport" : “MEL"
West US
Container
"airport" : "LAX"
Local Distribution (via horizontal partitioning)
GlobalDistribution(ofresourcepartitions)
Reads
30K transactions/sec
Writes/
Reads
Reads
Reads
West Europe
30K transactions/sec
Partition-key = "airport"

R E P L I C AT I N G D ATA G LO B A L LY

Impossible for distributed
data store to simultaneously
provide more than 2 out of
the following 3 guarantees:
• Consistency
• Availability
• Partition Tolerance
B R E W E R ’ S C A P T H E O R E M

C O N S I S T E N C Y
(West US)
(East US)
(North Europe)
Value = 5
Value = 5
Value = 5
Update 5 => 6
What happens when
a network partition
is introduced?
Reader: What is the value?
Should it see 5? (prioritize availability)
Or does the system go offline until
network is restored? (prioritize
consistency)
6
6

PA C E LC T H E O R E M
In the case of network
partitioning (P) in a
distributed computer system,
one has to choose between
availability (A) and
consistency (C) (as per the
CAP theorem), but else (E),
even when the system is
running normally in the
absence of partitions, one has
to choose between latency (L)
and consistency (C).

Value = 5
Value = 5
Value = 5
Update 5 => 6
6
6
Latency: packet of information can travel as fast as speed of light. Replication between distant geographic regions can take 100’s of milliseconds

Value = 5
Value = 5
Value = 5
Update 5 => 6
6
6
Should it Reader B see 5 immediately?
(prioritize latency)
Does it see the same result as reader
A? (quorum impacts throughput)
Does it sit and wait for 5 => 6
propagate? (prioritize consistency)Reader B: What is the value?
Reader A: What is the value?

Strong Bounded-staleness Session Consistent prefix Eventual
F I V E W E L L - D E F I N E D C O N S I S T E N C Y M O D E L S
CHOOSE THE BEST CONSISTENCY MODEL FOR YOUR APP
Five well-defined, consistency models
Overridable on a per-request basis
Provides control over performance-consistency tradeoffs,
backed by comprehensive SLAs.
An intuitive programming model offering low latency and
high availability for your planet-scale app.
CLEAR TRADEOFFS
• Latency
• Availability
• Throughput

D E M Y S T I F Y I N G C O N S I S T E N C Y M O D E L S
Strong consistency
Guarantees linearizability. Once an operation is complete, it will be visible to
all readers in a strongly-consistent manner across replicas.
Eventual consistency
Replicas are eventually consistent with any operations. There is a potential for
out-of-order reads. Lowest cost and highest performance for reads of all
consistency levels.
Strong
Eventual

Bounded-staleness
Session
Consistent prefix
D E M Y S T I F Y I N G C O N S I S T E N C Y M O D E L S
Bounded-staleness
Consistent prefix. Reads lag behind writes by at most k prefixes or t interval.
Similar properties to strong consistency except within staleness window.
Session
Consistent prefix. Within a session, reads and writes are monotonic. This is
referred to as “read-your-writes” and “write-follows-reads”. Predictable
consistency for a session. High read throughput and low latency outside of
session.
Consistent Prefix
Reads will never see out of order writes.

Azure Cosmos DB’s schema-less service automatically
indexes all your data, regardless of the data model, to
delivery blazing fast queries.
H A N D L E A N Y D ATA
W I T H N O S C H E M A O R
I N D E X I N G R E Q U I R E D
Item Color
Microwave
safe
Liquid
capacity
CPU Memory Storage
Geek
mug
Graphite Yes 16ox ??? ??? ???
Coffee
Bean
mug
Tan No 12oz ??? ??? ???
Surface
book
Gray ??? ??? 3.4 GHz
Intel
Skylake
Core i7-
6600U
16GB 1 TB SSD
• Automatic index management
• Synchronous auto-indexing
• No schemas or secondary indices needed
• Works across every data model
GEEK

I N D E X I N G J S O N D O C U M E N T S
{
"locations": [
{
"country": "Germany",
"city": "Berlin"
},
{
"country": "France",
"city": "Paris"
}
],
"headquarter": "Belgium",
"exports": [
{ "city": "Moscow" },
{ "city": "Athens" }
]
}
locations headquarter exports
0
country city
Germany Berlin
1
country city
France Paris
0 1
city
Athens
city
Moscow
Belgium

{
"locations": [
{
"country": "Germany",
"city": "Bonn",
"revenue": 200
}
],
"headquarter": "Italy",
"exports": [
{
"city": "Berlin",
"dealers": [
{ "name": "Hans" }
]
},
{ "city": "Athens" }
]
}
0
country city
Germany Bonn
revenue
200
0 1
citycity
Berlin
Italy
dealers
0
name
Hans

Athens
0
country city
Germany Bonn
revenue
200
0 1
citycity
Berlin
Italy
dealers
0
name
Hans
0
country city
Germany Berlin
1
country city
France Paris
0 1
city
Athens
city
Moscow
Belgium

I N V E R T E D I N D E X
0
country city
Germany
Berlin
revenue
200
0 1
city
Athens
city
Berlin
Italy
dealers
0
name
Hans
Bonn
1
country city
France Paris
Belgium
Moscow

I N D E X P O L I C I E S
CUSTOM INDEXING POLICIES
Though all Azure Cosmos DB data is indexed by default, you
can specify a custom indexing policy for your collections.
Custom indexing policies allow you to design and customize
the shape of your index while maintaining schema flexibility.
• Define trade-offs between storage, write and query
performance, and query consistency
• Include or exclude documents and paths to and from the
index
• Configure various index types
{
"automatic": true,
"indexingMode": "Consistent",
"includedPaths": [{
"path": "/*",
"indexes": [{
"kind": "Hash",
"dataType": "String",
"precision": -1
}, {
"kind": "Range",
"dataType": "Number",
"precision": -1
}, {
"kind": "Spatial",
"dataType": "Point"
}]
}],
"excludedPaths": [{
"path": "/nonIndexedContent/*"
}]
}

Some data produced by applications are only useful
for a finite period of time:
• Machine-generated event data
• Application log data
• User session information
It is important that the database system systematically
purges this data at pre-configured intervals.
S H O R T - L I F E T I M E D ATA

T I M E - TO - L I V E ( T T L )
AUTOMATICALLY PURGE DATA
Azure Cosmos DB allows you to set the length of time in
which documents live in the database before being
automatically purged. A document's "time-to-live" (TTL) is
measured in seconds from the last modification and can be
set at the collection level with override on a per-document
basis.
The TTL value is specified in the _ts field which exists on every
document.
• The _ts field is a unix-style epoch timestamp representing
the date and time. The _ts field is updated every time a
document is modified.
Once TTL is set, Azure Cosmos DB will automatically remove
documents that exist after that period of time.

E X P I R I N G R E C O R D S U S I N G T I M E - TO - L I V E
TTL BEHAVIOR
The TTL feature is controlled by TTL properties at two levels -
the collection level and the document level.
• DefaultTTL for the collection
• If missing (or set to null), documents are not deleted
automatically.
• If present and the value is "-1" = infinite –
documents don’t expire by default
• If present and the value is some number ("n") –
documents expire "n” seconds after last modification
• TTL for the documents:
• Property is applicable only if DefaultTTL is present
for the parent collection.
• Overrides the DefaultTTL value for the parent
collection.
The values are set in seconds and are treated as a delta from
the _ts that the document was last modified at.
Document
Document TTL
Default TTL

IoT, gaming, retail and operational logging applications
need to track and respond to tremendous amount of
data being ingested, modified or removed from a
globally-scaled database.
COMMON SCENARIOS
• Trigger notification for new items
• Perform real-time analytics on streamed data
• Synchronize data with a cache, search engine or data
warehouse.
M O D E R N R E A C T I V E
A P P L I C AT I O N S

C H A N G E F E E D
Persistent log of records within an Azure Cosmos DB
container. Preseneted in the order in which they were
modified

C H A N G E F E E D S C E N A R I O S

Event/stream
processing app tier
C H A N G E F E E D W I T H PA R T I T I O N S
Consumer parallelization
Change feed listens for any changes in Azure Cosmos DB
collection. It then outputs the sorted list of documents that
were changed in the order in which they were modified.
The changes are persisted, can be processed asynchronously
and incrementally, and the output can be distributed across
one or more consumers for parallel processing. The change
feed is available for each partition key range within the
document collection, and thus can be distributed across one
or more consumers for parallel processing.
Consumer 1
Consumer 2
Consumer 3

C H A N G E F E E D P R O C E S S O R L I B R A R Y
https://www.nuget.org/packages
/Microsoft.Azure.DocumentDB.C
hangeFeedProcessor/

Run native JavaScript server-side programming
logic to performic atomic multi-record transactions.
This module will reference programming in the
context of the SQL API.
P R O G R A M M I N G
GEEK

C O N T R O L C O N C U R R E N C Y U S I N G E TA G S
OPTIMISTIC CONCURRENCY
• The SQL API supports optimistic concurrency control (OCC) through HTTP entity tags, or ETags
• Every SQL API resource has an ETag system property, and the ETag value is generated on the server every time a document is
updated.
• If the ETag value stays constant – that means no other process has updated the document. If the ETag value unexpectedly
mutates – then another concurrent process has updated the document.
• ETags can be used with the If-Match HTTP request header to allow the server to decide whether a resource should be
updated:
HTTP 412

BENEFITS
• Familiar programming language
• Atomic Transactions
• Built-in Optimizations
• Business Logic Encapsulation
S TO R E D P R O C E D U R E S

M U LT I - D O C U M E N T T R A N S A C T I O N S
DATABASE TRANSACTIONS
In a typical database, a transaction can be defined as a
sequence of operations performed as a single logical
unit of work. Each transaction provides ACID guarantees.
In Azure Cosmos DB, JavaScript is hosted in the same
memory space as the database. Hence, requests made
within stored procedures and triggers execute in the
same scope of a database session.
Create
New
Document
Query
Collection
Update
Existing
Document
Delete
Existing
Document
Stored procedures utilize snapshot
isolation to guarantee all reads
within the transaction will see a
consistent snapshot of the data

T R A N S A C T I O N C O N T I N U AT I O N M O D E L
CONTINUING LONG-RUNNING TRANSACTIONS
• JavaScript functions can implement a continuation-based model
to batch/resume execution
• The continuation value can be any value of your own choosing.
This value can then be used by your applications to resume a
transaction from a new “starting point”
Bulk Create Documents
Return a “pointer” to resume later
Observe
Return
Value
Try Create
Each
Document
Done

Azure Cosmos DB - Technical Deep Dive

Azure Cosmos DB - Technical Deep Dive

Related slideshows

More Related Content

Azure Cosmos DB - Technical Deep Dive