SQL Server In-Memory OLTP PDF
SQL Server In-Memory OLTP PDF
In-Memory OLTP
Inside the SQL Server 2014 Hekaton Engine
By Kalen Delaney
SQL Server Internals:
In-Memory OLTP
Inside the SQL Server 2014 Hekaton Engine
By Kalen Delaney
ISBN: 978-1-910035-02-3
The right of Kalen Delaney to be identified as the author of this book has been
asserted by Kalen Delaney in accordance with the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored or introduced into a retrieval
system, or transmitted, in any form, or by any means (electronic, mechanical, photocopying, recording or
otherwise) without the prior written consent of the publisher. Any person who does any unauthorized act
in relation to this publication may be liable to criminal prosecution and civil claims for damages. This book
is sold subject to the condition that it shall not, by way of trade or otherwise, be lent, re-sold, hired out, or
otherwise circulated without the publisher's prior consent in any form other than which it is published and
without a similar condition including this condition being imposed on the subsequent publisher.
In-memory OLTP is a game changer for relational databases, and OLTP systems in
particular. Processors are not getting faster, but the number of cores and the amount
of memory is increasing drastically. Machines with terabytes of memory are available
for under $100K. A new technology is needed to take advantage of the changing
hardware landscape, and Microsoft's In-Memory OLTP, codenamed Project Hekaton,
is that new technology.
Project Hekaton gives us an entirely new way to store and access our relational data,
using lock- and latch-free data structures that allow completely non-blocking data
processing operations. Everything you knew about how your SQL Server data is actually
stored and accessed is different in Hekaton. Everything you understood about how
multiple concurrent processes are handled needs to be reconsidered. All of your planning
for when code is recompiled and reused can be re-evaluated if you choose to use natively
compiled stored procedures to access your Hekaton data.
One of the best things about using this new technology is that it is not all or nothing.
Even if much of your processing is not OLTP, even if your total system memory is
nowhere near the terabyte range, you can choose one or more critical tables to migrate
to the new in-memory structures. You can choose one frequently run stored procedure
to recreate as a natively compiled procedure. And you can see measurable performance
improvements.
viii
A lot of people are already writing and speaking about in-memory OLTP, on blog posts
and in conference sessions. People are using it and sharing what they've learned. For
those of you who want to know the complete picture about how in-memory OLTP works
and why exactly it's a game changer, and also peek at the deep details of how the data is
stored and managed, this book is for you, all in one place. Kalen Delaney has been writing
about SQL Server internals, explaining how things work inside the engine, for over
20 years. She started working with the Hekaton team at Microsoft over two years ago,
getting the inside scoop from the people who implemented this new technology. And in
this book, she's sharing it all with you.
She is one of the main editors for SQL Server Central's SQL Server Stairways Series,
http://www.sqlservercentral.com/stairway. Kalen blogs at www.sqlblog.com and
her personal website and schedule can be found at www.SQLServerInternals.com.
Acknowledgements
First of all, I would like to thank Kevin Liu of Microsoft, who brought me on board
with the Hekaton project at the end of 2012, with the goal of providing in-depth white
papers describing this exciting new technology. Under Kevin's guidance, I wrote two
white papers, which were published near the release date at each of the CTPs for
ix
SQL Server 2014. As the paper got longer with each release, a new white paper for the
final released project would be as long as a book. So, with Kevin's encouragement, it
became the book that you are now reading.
I would also like to thank my devoted reviewers and question answerers at Microsoft,
without whom this work would have taken much longer: Sunil Agarwal, Jos de Bruijn,
and Mike Zwilling were always quick to respond and were very thorough in answering my
sometimes seemingly endless questions.
Others on the SQL Server team who also generously provided answers and/or technical
edits include Kevin Farlee, Craig Freedman, Mike Weiner, Cristian Diaconu, Pooja
Harjani, Paul Larson, and David Schwartz. Thank you for all your assistance and
support. And THANK YOU to the entire SQL Server Team at Microsoft for giving us this
incredible new technology!
x
INTRODUCTION
The original design of the SQL Server engine assumed that main memory was very
expensive, and so data needed to reside on disk except when it was actually needed for
processing. However, over the past thirty years, the sustained fulfillment of Moore's Law,
predicting that computing power will double year on year, has rendered this assumption
largely invalid.
Moore's law has had a dramatic impact on the availability and affordability of both large
amounts of memory and multiple-core processing power. Today one can buy a server
with 32 cores and 1 TB of memory for under $50K. Looking further ahead, it's entirely
possible that in a few years we'll be able to build distributed DRAM-based systems with
capacities of 110 Petabytes at a cost of less than $5/GB. It is also only a question of time
before non-volatile RAM becomes viable as main-memory storage.
At the same time, the near-ubiquity of 64-bit architectures removes the previous 4 GB
limit on "addressable" memory and means that SQL Server has, in theory, near-limitless
amounts of memory at its disposal. This has helped to significantly drive down latency
time for read operations, simply because we can fit so much more data in memory. For
example, many, if not most, of the OLTP databases in production can fit entirely in 1 TB.
Even for the largest financial, online retail and airline reservation systems, with databases
between 500 GB and 5 TB in size, the performance-sensitive working dataset, i.e. the
"hot" data pages, is significantly smaller and could reside entirely in memory.
However, the fact remains that the traditional SQL Server engine is optimized for disk-
based storage, for reading specific 8 KB data pages into memory for processing, and
writing specific 8 KB data pages back out to disk after data modification, having first
"hardened" the changes to disk in the transaction log. Reading and writing 8 KB data
pages from and to disk can generate a lot of random I/O and incurs a higher latency cost.
11
In fact, given the amount of data we can fit in memory, and the high number of cores
available to process it, the end result has been that most current SQL Server systems are
I/O bound. In other words, the I/O subsystem struggles to "keep up," and many organi-
zations sink huge sums of money into the hardware that they hope will improve write
latency. Even when the data is in the buffer cache, SQL Server is architected to assume
that it is not, which leads to inefficient CPU usage, with latching and spinlocks. Assuming
all, or most, of the data will need to be read from disk also leads to unrealistic cost estima-
tions for the possible query plans and a potential for not being able to determine which
plans will really perform best.
As a result of these trends, and the limitations of traditional disk-based storage structures,
the SQL Server team at Microsoft began building a database engine optimized for large
main memories and many-core CPUs, driven by the recognition that systems designed for
a particular class of workload can frequently outperform more general purpose systems
by a factor of ten or more. Most specialized systems, including those for Complex Event
Processing (CEP), Data Warehousing and Business Intelligence (DW/BI) and Online
Transaction Processing (OLTP), optimize data structures and algorithms by focusing on
in-memory structures.
The team set about building a specialized database engine specifically for in-memory
workloads, which could be tuned just for those workloads. The original concept was
proposed at the end of 2008, envisioning a relational database engine that was 100
times faster than the existing SQL Server engine. In fact, the codename for this feature,
Hekaton, comes from the Greek word hekaton () meaning 100.
Serious planning and design began in 2010, and product development began in 2011. At
that time, the team did not know whether the current SQL Server could support this new
concept, and the original vision was that it might be a separate product. Fortunately, it
soon became evident that, although the framework could support building stand-alone
processors (discussion of the framework is well beyond the scope of this book), it would
be possible incorporate the "in-memory" processing engine into SQL Server itself.
12
The team then established four main goals as the foundation for further design
and planning:
1. Optimized for data that was stored completely in-memory but was also durable on
SQL Server restarts.
SQL Server In-Memory OLTP, formerly known and loved as Hekaton, meets all of these
goals, and in this book you will learn how it meets them. The focus will be on the features
that allow high performance for OLTP operations. As well as eliminating read latency,
since the data will always be in memory, fundamental changes to the memory-optimized
versions of tables and indexes, as well as changes to the logging mechanism, mean that
in-memory OLTP also offers greatly reduced latency when writing to disk.
The first four chapters of the book offer a basic overview of how the technology works
(Chapter 1), how to create in-memory databases and tables (Chapter 2), the basics of row
versioning and the new multi-version concurrency control model (Chapter 3), and how
memory-optimized tables and their indexes store data (Chapter 4).
Chapters in the latter half of the book focus on how the new in-memory engine delivers
the required performance boost, while still ensuring transactional consistency (ACID
compliance). In order to deliver on performance, the SQL Server team realized they
had to address some significant performance bottlenecks. Two major bottlenecks were
the traditional locking and latching mechanisms: if the new in-memory OTLP engine
retained these mechanisms, with the waiting and possible blocking that they could cause,
it could negate much of the benefit inherent in the vastly increased speed of in-memory
processing. Instead, SQL Server In-Memory OLTP delivers a completely lock- and latch-
free system, and true optimistic multi-version concurrency control (Chapter 5).
13
Other potential bottlenecks were the existing CHECKPOINT and transaction logging
processes. The need to write to durable storage still exists for in-memory tables, but in
SQL Server In-Memory OLTP these processes are adapted to be much more efficient, in
order to prevent them becoming performance limiting, especially given the potential to
support vastly increased workloads (Chapter 6).
The final bottleneck derives from the fact that the SQL Server query processor is essen-
tially an interpreter; it re-processes statements continually, at runtime. It is not a true
compiler. Of course, this is not a major performance concern, when the cost of physi-
cally reading data pages into memory from disk dwarfs the cost of query interpretation.
However, once there is no cost of reading pages, the difference in efficiency between
interpreting queries and running compiled queries can be enormous. Consequently,
the new SQL Server In-Memory OLTP engine component provides the ability to create
natively compiled procedures, i.e. machine code, for our most commonly executed data
processing operations (Chapter 7).
Finally, we turn our attention to tools for managing SQL Server In-Memory OLTP
structures, for monitoring and tuning performance, and finally, considerations for
migrating existing OLTP workloads over to in-memory (Chapter 8).
14
SQL Server In-Memory OLTP is a new technology and this is not a book specifically on
performance tuning and best practices. However, as you learn about how the Hekaton
engine works internally to process your queries, certain best practices and opportunities
for performance tuning will become obvious.
This book does not assume that you're a SQL Server expert, but I do expect that you have
basic technical competency and familiarity with the standard SQL Server engine, and
relative fluency with basic SQL statements.
You should have access to a SQL Server 2014 installation, even if it is the Evaluation
edition, available free from Microsoft:
http://technet.microsoft.com/en-gb/evalcenter/dn205290.aspx.
All examples have been verified on SQL Server 2014 RTM (12.0.2000.8). All of the
examples use custom-built example databases, as defined in the text.
15
Chapter 1: What's Special About In-
Memory OLTP?
SQL Server 2014's In-Memory OLTP feature provides a suite of technologies for working
with memory-optimized tables, in addition to the disk-based tables which SQL Server
has always provided.
The SQL Server team designed the in-memory OLTP engine to be transparently acces-
sible through familiar interfaces such as T-SQL and SQL Server Management Studio
(SSMS). Therefore, during most data processing operations, users may be unaware that
they are working with memory-optimized tables rather than disk-based ones.
However, SQL Server works with the data very differently if it is stored in memory-
optimized tables. This chapter describes, at a high level, some of the fundamental differ-
ences between data storage structures and data operations, when working with memory-
optimized, rather than standard disk-based tables and indexes.
It will also discuss SQL Server In-Memory OLTP in the context of similar, competing
memory-optimized database solutions, and explain why the former is different.
16
Chapter 1: What's Special About In-Memory OLTP?
These pinned tables were no different than any other disk-based tables. They required the
same amount of locking, latching and logging and they used the same index structures,
which also required locking and logging.
By contrast, as we'll discuss through this and subsequent chapters, the memory-
optimized tables in SQL Server In-Memory OLTP are completely different than SQL
Server disk-based tables. They use different data and index structures, and SQL Server
takes no locks or latches on these structures during reading or writing, so it can allow
concurrent access without blocking. Also, logging changes to memory-optimized tables
is usually much more efficient than logging changes to disk-based tables.
Figure 1-1 gives an overview of the SQL Server engine with the in-memory OLTP
components. On the left side, we have the memory optimized tables and indexes, added
as part of in-memory OLTP and, on the right we see the disk-based tables, which use the
data structures that SQL Server has always used, and which require writing and reading
8 KB data pages, as a unit, to and from disk.
17
Chapter 1: What's Special About In-Memory OLTP?
Client App
Parser,
Natively Compiled In-memory Catalog, Interpreted T-SQL
SPs and Schema OLTP Compiler Optimizer Query Execution
Query
T3 Tables Interop T1 T2 Tables
Indexes Indexes
Memory Optimized Tables & Indexes Buffer Pool for Tables & Indexes
SQL Server.exe
Figure 1-1: The SQL Server engine including the in-memory OLTP components.
In-memory OLTP also supports natively compiled stored procedures, an object type
that is compiled to machine code by a new in-memory OLTP compiler and which has the
potential to offer a further performance boost beyond that available solely from the use of
memory-optimized tables. The standard counterpart is interpreted T-SQL stored proce-
dures, which is what SQL Server has always used. Natively compiled stored procedures
can reference only memory-optimized tables.
18
Chapter 1: What's Special About In-Memory OLTP?
Notice that the client application uses the same TDS Handler (Tabular Data Stream,
the underlying networking protocol that is used to communicate with SQL Server),
regardless of whether it is accessing memory-optimized tables or disk-based tables, or
calling natively compiled stored procedures or interpreted T-SQL.
Memory-optimized tables
This section takes a broad look at three of the key differences between memory-
optimized tables and their disk-based counterparts; subsequent chapters will fill in
the details.
The first and perhaps most fundamental difference when using memory-optimized tables
is that the whole table and its indexes are stored in memory all the time. Therefore,
when accessing in-memory data structures, user processes will always find the required
data in-memory. Concurrent data operations require no locking or latching whatsoever,
thanks to a new, truly optimistic concurrency model, which we'll get to shortly.
As user processes modify in-memory data, SQL Server still needs to perform some disk
I/O for any table that we wish to be durable, in others words where we wish a table to
retain the in-memory data in the event of a server crash or restart. We'll return to this a
little later in this chapter, in the Data durability and recovery section.
19
Chapter 1: What's Special About In-Memory OLTP?
For disk-based tables, SQL Server organizes data rows into 8 KB units called data pages,
with space allocated from extents, on disk. The data page is the basic unit of storage on
disk and in memory. When SQL Server reads and writes data from disk, it reads
and writes the relevant data pages. A data page will only contain data from one table or
index. User processes modify rows on various data pages as required, and later, during a
CHECKPOINT process, SQL Server first hardens the log records to disk and then writes all
dirty pages to disk, the latter operation often causing a lot of "random" physical I/O.
For memory-optimized tables, there are no data pages, and no extents; there are just "data
rows," written to memory sequentially, in the order the transactions occurred, with each
row containing an index "pointer" to the next row. All "I/O" is then in-memory scanning
of these structures. It means there is no notion of data rows being written to a particular
location that "belongs" to a specified object. However, this is not to imply that memory-
optimized tables are stored as unorganized sets of data rows, like a disk-based heap. In
fact, every CREATE TABLE statement for a memory-optimized table must also create at
least one index that SQL Server can use to link together all the data rows for that table
(see the later section on Indexes on memory-optimized tables).
Each data row consists of two areas, the row header and then the payload, which is the
actual column data. We'll discuss this structure in much more detail in Chapter 3, but the
information stored in the row header includes the identity of the statement that created
the row, pointers for each index on the target table and, critically, some timestamp
values. There will be a timestamp recording the time a transaction inserted a row, and
another indicating the time a transaction deleted a row. SQL Server records updates by
inserting a new version of the row and marking the old version as "deleted." The actual
cleanup of row versions that are no longer required, which involves unlinking them from
20
Chapter 1: What's Special About In-Memory OLTP?
index structures and removing them from memory, is a cooperative process involving
both user threads and a dedicated garbage collection thread (more on this in Chapter 5).
As this implies, many versions of the same row can coexist at any given time. This allows
concurrent access of the same row, during data modifications, with SQL Server displaying
the row version relevant to each transaction according to the time the transaction started
relative to the timestamps of the row version. This is the essence of the new multi-
version concurrency control (MVCC) mechanism for in-memory tables, which we'll
describe in a little more detail later in the chapter.
In other words, SQL Server holds in memory, not only the table and index structures, but
also a set of DLLs for accessing and modifying these data structures. The table metadata
encodes into each DLL a set of native language algorithms that describe precisely the row
format for the table and how to traverse its indexes, thus providing highly efficient access
paths for the table data. This explains why we cannot alter a table, once created; if the
table were altered, SQL Server would have to regenerate all the DLLs for table operations.
These DLLs result in much faster data access than is possible via the traditional way of
using interpreted metadata. A big success of the implementation of in-memory OLTP is
to have made these operations "invisible" to the user.
21
Chapter 1: What's Special About In-Memory OLTP?
However, there are limitations on the T-SQL language constructs that are allowed
inside a natively compiled stored procedure, compared to the rich feature set available
with interpreted code. In addition, natively compiled stored procedures can only access
memory-optimized tables and cannot reference disk-based tables. Chapter 7 discusses
natively compiled stored procedures in detail.
SQL Server 2005 and later introduced a "sort of" optimistic version of concurrency
control, using the snapshot-based isolation levels, and maintaining previous row versions
in a tempdb version store. Under this model, readers no longer acquire shared locks.
22
Chapter 1: What's Special About In-Memory OLTP?
Instead of blocking, when one transaction needs to read rows that another transaction
is modifying, the reader retrieves, from the version store, the previously committed
values of the set of rows it needs. Therefore, SQL Server can preserve the ACID
properties without having readers block writers, and without writers blocking readers.
However, SQL Server still acquires locks during data modifications and so writers still
block other writers.
We won't discuss any further in this book the locking, latching or concurrency mechanisms
for disk-based tables. For full details, please refer to my book, SQL Server Concurrency:
Locking, Blocking and Row Versioning (https://www.simple-talk.com/books/sql-books/
sql-server-concurrency-locking,-blocking-and-row-versioning/).
In contrast, SQL Server In-Memory OLTP introduces a truly optimistic MVCC model.
It uses row versioning but its implementation bears little relation to the snapshot-based
model used for disk-based tables. When accessing memory-optimized tables and index
structures, SQL Server still supports the ACID properties of transactions, but it does so
without ever using locking or latching to provide transaction isolation. This means that
no transaction ever has, for lock-related reasons, to wait to read or modify a data row.
Readers never block writers, writers never block readers, and writers never block writers.
Transactions never acquire locks on memory-optimized tables, so they never have to wait to acquire
them. However, this does not mean there is never any waiting when working with memory-optimized
tables in a multi-user system. However, the waiting that does occur is usually of very short duration, such
as when SQL Server is waiting for dependencies to be resolved during the validation phase of transaction
processing (more on the validation phase in Chapters 3 and 5). Transactions might also need to wait for
log writes to complete although, since the logging required when making changes to memory-optimized
tables is much more efficient than logging for disk-based tables, the wait times will be much shorter.
23
Chapter 1: What's Special About In-Memory OLTP?
No locks
Operations on disk-based tables implement the requested level of transaction isolation
by using locks to make sure that a transaction (Tx2) cannot change data that another
transaction (Tx1) needs to remain unchanged.
In a traditional relational database system, in which SQL Server needs to read pages from
disk before it can process them, the cost of acquiring and managing locks can be just a
fraction of the total wait time. Often, this cost is dwarfed by the overhead of waiting for
disk reads, and managing the pages in the buffer pool.
However, if SQL Server were to acquire locks on memory-optimized tables, then locking
waits would likely become the major overhead, since there is no cost at all for reading
pages from disk.
Instead, the team designed SQL Server In-Memory OLTP to be a totally lock-free system.
Fundamentally, this is possible because SQL Server never modifies any existing row, and
so there is no need to lock them. Instead, an UPDATE operation creates a new version by
marking the previous version of the row as deleted, and then inserting a new version of
the row with new values. If a row is updated multiple times, there may be many versions
of the same row existing simultaneously. SQL Server presents the correct version of the
row to the requesting transaction by examining timestamps stored in the row header and
comparing them to the transaction start time.
No latches
Latches are lightweight synchronization mechanisms (often called primitives as they are
the smallest possible synchronization device), used by the SQL Server engine to guarantee
consistency of the data structures that underpin disk-based tables, including index and
data pages as well as internal structures such as non-leaf pages in a B-tree. Even though
latches are quite a bit lighter weight than locks, there can still be substantial overhead and
wait time involved in using latches.
24
Chapter 1: What's Special About In-Memory OLTP?
When accessing disk-based tables, SQL Server must acquire a latch every time it reads
a page from disk, to make sure no other transaction writes to the page while it is being
read. It acquires a latch on the memory buffer into which it will read the page, to make
sure no other transaction uses that buffer. In addition, SQL Server acquires latches on
internal metadata, such as the internal table that keeps track of locks being acquired
and released.
One key improvement provided by SQL Server In-Memory OLTP is that there is no page
construct for memory-optimized tables. There is a page structure used for range indexes,
but the way the pages are managed is completely different than the way they are managed
for disk-based tables in a traditional database system. Not having to manage pages funda-
mentally changes the data operation algorithms from being disk optimized to being
memory and cache optimized.
SQL Server In-Memory OLTP doesn't do any reading from disk during data processing,
doesn't store data in buffers and doesn't apply any locks, and there is no reason for it to
acquire latches for operations on memory-optimized tables, and therefore this eliminates
one more possible source of waiting.
With disk-based storage structures, there are data pages that combine sets of rows
into a single structure. With in-memory structures, there are no such pages and instead
SQL Server uses indexes to combine all the rows that belong to a table into a single
structure. This is why every memory-optimized table must have at least one index.
We create indexes as part of table creation; unlike for disk-based indexes, we cannot
use CREATE INDEX to create memory-optimized indexes. If we create a PRIMARY KEY
on a column, and durable memory-optimized tables must have a PRIMARY KEY, then
25
Chapter 1: What's Special About In-Memory OLTP?
SQL Server automatically creates a unique index on that column (and that it is the
only allowed unique index). We can create a maximum of eight indexes on a memory-
optimized table, including the PRIMARY KEY index.
Like tables, SQL Server stores memory-optimized indexes entirely in memory. However,
unlike for tables, SQL Server never logs operations on indexes, and never persists indexes
to the on-disk checkpoint files (covered shortly). SQL Server maintains indexes automati-
cally during all modification operations on memory-optimized tables, just like B-tree
indexes on disk-based tables, but in case of a restart, SQL Server rebuilds the indexes on
the memory-optimized tables as the data is streamed into memory.
Memory-optimized tables support two basic types of index, both of which are
non-clustered structures: hash indexes and range indexes.
A hash index is a new type of SQL Server index, specifically for memory-optimized tables,
which are useful for performing lookups on specific values. A hash index, which is stored
as a hash table, is essentially an array of hash buckets, where each bucket points to the
location in memory of a data row. SQL Server applies a hash function to the index key
values, and maps each one to the appropriate bucket. In each bucket is a pointer to a
single row, the first row in the list of rows that hash to the same value. From that row,
all other rows in the hash bucket are joined in a singularly-linked list (this will become
clearer when we get to see some diagrams in Chapter 4).
A non-clustered range index, useful for retrieving ranges of values, is more like the sort
of index we're familiar with when working with disk-based tables. However, again, the
structure is different. The memory-optimized counterparts use a special Bw-tree storage
structure.
A Bw-tree is similar to a disk-based B-tree index in that it has index pages organized into
a root page, a leaf level, and possibly intermediate-level pages. However, the pages of a
Bw-tree are very different structures from their disk-based counterparts. The pages can
be of varying sizes, and the pages themselves are never modified; new pages are created
when necessary, when the underlying rows are modified.
26
Chapter 1: What's Special About In-Memory OLTP?
We'll discuss this topic in much more detail in Chapter 6, but logging for in-memory tables is more
efficient than for disk-based tables essentially because, given the same workload, SQL Server will write far
fewer log records for an in-memory table than for its equivalent disk-based table. For example, it doesn't
log any changes to data in indexes. It will also never write log records associated with uncommitted
transactions, since SQL Server will never write dirty data to disk for in-memory tables. Also, rather than
write every atomic change as a single log record, in-memory OLTP will combine many changes into a
single log record.
SQL Server In-Memory OLTP also continuously persists the table data to disk in special
checkpoint files. It uses these files only for database recovery, and only ever writes to
them "offline," using a background thread. Therefore, when we create a database that will
use memory-optimized data structures, we must create, not only the data file (used only
for disk-based table storage) and the log file, but also a special MEMORY_OPTIMIZED_
DATA filegroup that will contain the checkpoint file pairs, each pair consisting of a data
checkpoint file and a delta checkpoint file (more on these in Chapter 2).
27
Chapter 1: What's Special About In-Memory OLTP?
These checkpoint files are append-only and SQL Server writes to them strictly sequen-
tially, in the order of the transactions in the transaction log, to minimize the I/O cost. In
case of a system crash or server shutdown, SQL Server can recreate the rows of data in the
memory-optimized tables from the checkpoint files and the transaction log.
When we insert a data row into a memory-optimized table, the background thread (called
the offline checkpoint thread) will, at some point, append the inserted row to the corre-
sponding data checkpoint file. Likewise, when we delete a row, the thread will append a
reference to the deleted row to the corresponding delta checkpoint file. So, a "deleted"
row remains in the data file but the corresponding delta file records the fact that it was
deleted. As the checkpoint files grow, SQL Server will at some point merge them, so that
rows marked as deleted actually get deleted from the data checkpoint file, and create a
new file pair. Again, further details of how all this works come in Chapter 6.
In-memory OLTP does provide the option to create a table that is non-durable, using
an option called SCHEMA_ONLY. As the option indicates, SQL Server will log the table
creation, so the table schema will be durable, but will not log any data manipulation
language (DML) on the table, so the data will not be durable. These tables do not
require any I/O operations during transaction processing, but the data is only available
in memory while SQL Server is running. These non-durable tables could be useful in
certain cases, for example as staging tables in ETL scenarios or for storing web server
session state.
We'll see how to create both durable and non-durable tables in Chapter 2.
28
Chapter 1: What's Special About In-Memory OLTP?
For processing OLTP data, there are two types of specialized engines. The first type
are main-memory databases. Oracle has TimesTen, IBM has SolidDB and there are
many others that primarily target the embedded database space. The second type are
applications caches or key-value stores (for example, Velocity / App Fabric Cache and
Gigaspaces) that leverage application and middle-tier memory to offload work from
the database system. These caches continue to become more sophisticated and acquire
database capabilities, such as transactions, range indexing, and query capabilities
(Gigaspaces already has these, for example). At the same time, database systems are
acquiring cache capabilities like high-performance hash indexes and scale across a
cluster of machines (VoltDB is an example).
The in-memory OLTP engine is meant to offer the best of both of these types of engines,
providing all of the afore-mentioned features. One way to think of in-memory OLTP is
that it has the performance of a cache and the capability of a database. It supports storing
your tables and indexes in memory, so you can create an entire database to be a complete
in-memory system. It also offers high-performance indexes and logging as well as other
features to significantly improve query execution performance.
SQL Server In-Memory OLTP offers the following features that few or any of the
competition's products provide:
29
Chapter 1: What's Special About In-Memory OLTP?
natively compiled stored procedures to improve execution time for basic data
manipulation operations by orders of magnitude
Unlike in-memory OLTP, a lot of its competitors still use traditional page constructs,
even while the pages are forced to stay in memory. For example SAP HANA still uses
16 KB pages for its in-memory row-store, which would inherently suffer from page
latch contention in a high-performance environment.
The most notable difference in design of SQL Server In-Memory OLTP from competitors'
products is the "interop" integration. In a typical high-end OLTP workload, the perfor-
mance bottlenecks are concentrated in specific areas, such as a small set of tables and
stored procedures. It would be costly and inefficient to force the whole database to be
resident in memory. But, to date, the other main competitive products require such
an approach. In SQL Server's case, the high performance and high contention area can
be migrated to in-memory OLTP, then the operations (stored procedures) on those
memory-optimized tables can be natively compiled to achieve maximum business
processing performance.
This "interop" capability is possible because SQL Server In-Memory OLTP is fully
integrated in the SQL Server database engine, meaning you can use the same familiar
APIs, language, development, and administration tools; and, most importantly, you can
exploit the knowledge your organization has built up using SQL Server to also work with
in-memory OLTP. Some competitor products can act like a cache for relational data, but
are not integrated. Other products provide support only for in-memory tables, and any
disk-based tables must be managed through a traditional relational database.
30
Chapter 1: What's Special About In-Memory OLTP?
Summary
This first chapter took a first, broad-brush look at the new SQL Server In-Memory OLTP
engine. Memory-optimized data structures are entirely resident in memory, so user
processes will always find the data they need by traversing these structures in memory,
without the need for disk I/O. Furthermore, the new MVCC model means that
SQL Server can mediate concurrent access of these data structures, and ensure ACID
transaction properties, without the use of any locks and latches; no user transactions
against memory-optimized data structures will ever be forced to wait to acquire a lock!
Natively compiled stored procedures provide highly efficient data access to these data
structures, offering a further performance boost. Even the logging mechanisms for
memory-optimized tables, to ensure transaction durability, are far more efficient than
for standard disk-based tables.
Combined, all these features make the use of SQL Server In-Memory OLTP a very
attractive proposition for many OLTP workloads. Of course, as ever, it is no silver bullet.
While it can and will offer substantial performance improvements to many applications,
its use requires careful planning, and almost certainly some redesign of existing tables
and procedures, as we'll discuss as we progress deeper into this book.
Additional Resources
As with any "v1" release of a new technology, the pace of change is likely to be rapid. We
plan to revise this book to reflect significant advances in subsequent releases, but in the
meantime it's likely that new online information about in-memory OLTP will appear with
increasing frequency.
31
Chapter 1: What's Special About In-Memory OLTP?
As well as bookmarking the online documentation for in-memory OLTP (see below),
you should keep your eyes on whatever Microsoft has to say on the topic, on their
SQL Server website (http://www.microsoft.com/sqlserver), on the TechNet
TechCenter (http://technet.microsoft.com/en-us/sqlserver/) and on the MSDN
DevCenter (http://msdn.microsoft.com/en-us/sqlserver).
SQL Server 2014 online documentation high-level information about SQL Server's
In-Memory OLTP:
http://msdn.microsoft.com/en-us/library/dn133186(v=sql.120).aspx.
The Path to In-Memory Database Technology an excellent blog post about the
history of relational databases and the path that led to in-memory OLTP:
http://sqlblog.com/blogs/joe_chang/archive/2013/02/11/the-path-to-in-
memory-database-technology.aspx.
32
Chapter 2: Creating and Accessing In-
Memory OLTP Databases and Tables
In-memory OLTP is an automatic and obligatory component of the SQL Server setup
process for any installation of a 64-bit Enterprise or Developer edition of SQL Server 2014
that includes the database engine components. In-memory OLTP is not available at all
with 32-bit editions.
Therefore, with no further setup, we can begin creating databases and data structures
that will store memory-optimized data.
Creating Databases
Any database that will contain memory-optimized tables needs to have a single
MEMORY_OPTIMIZED_DATA filegroup containing at least one container, which stores
the checkpoint files needed by SQL Server to recover the memory-optimized tables.
These are the checkpoint data and delta files that we introduced briefly in Chapter 1.
SQL Server populates these files during CHECKPOINT operations, and reads them during
the recovery process, which we'll discuss in Chapter 6.
The syntax for creating a MEMORY_OPTIMIZED_DATA filegroup is almost the same as that
for creating a regular FILESTREAM filegroup, but it must specify the option CONTAINS
MEMORY_OPTIMIZED_DATA. Listing 2-1 provides an example of a CREATE DATABASE
statement for a database that can support memory-optimized tables (edit the path names
to match your system; if you create the containers on the same drive you'll need to
differentiate the two file names).
33
Chapter 2: Creating and Accessing In-Memory OLTP Databases and Tables
USE master
GO
IF EXISTS (SELECT * FROM sys.databases WHERE name='HKDB')
DROP DATABASE HKDB;
GO
CREATE DATABASE HKDB
ON
PRIMARY(NAME = [HKDB_data],
FILENAME = 'Q:\DataHK\HKDB_data.mdf', size=500MB),
FILEGROUP [HKDB_mod_fg] CONTAINS MEMORY_OPTIMIZED_DATA
(NAME = [HKDB_mod_dir],
FILENAME = 'R:\DataHK\HKDB_mod_dir'),
(NAME = [HKDB_mod_dir],
FILENAME = 'S:\DataHK\HKDB_mod_dir')
In Listing 2-1, we create a regular data file (HKDB_data.mdf), used for disk-based table
storage only, and a regular log file (HKDB_log.ldf). In addition, we create a memory-
optimized filegroup, HKDB_mod_fg with, in this case, two file containers each called
HKDB_mod_dir. These containers host data and delta checkpoint file pairs to which
the CHECKPOINT process will write data, for use during database recovery. The data
checkpoint file stores inserted rows and the delta files reference deleted rows. The data
and delta file for each pair may be in the same or different containers, depending on the
number of containers specified. In this case, with two containers, one will contain the
data checkpoint files and the other the delta checkpoint files, for each pair. If we had only
one container, it would contain both the data and delta files.
Notice that we place the primary data file, each of the checkpoint file containers, and the
transaction log, on separate drives. Even though the data in a memory-optimized table is
never read from or written to disk "inline" during query processing, it can still be useful
to consider placement of your checkpoint files and log file for optimum I/O performance
during logging, checkpoint, and recovery.
34
Chapter 2: Creating and Accessing In-Memory OLTP Databases and Tables
To help ensure optimum recovery speed, you will want to put each of the containers in
the MEMORY_OPTIMIZED filegroup on a separate drive, with fast sequential I/O.
To reduce any log waits, and improve overall transaction throughput, it's best to place
the log file on a drive with fast random I/O, such as an SSD drive. As the use of memory-
optimized tables allows for a much greater throughput, we'll start to see a lot of activity
needing to be written to the transaction log although, as we'll see in Chapter 6, the
overall efficiency of the logging process is much higher for in-memory tables than for
disk-based tables.
Finally, notice that Listing 2-1 specifies a binary collation. In the current version of
in-memory OLTP, any indexes on character columns in memory-optimized tables
can only be on columns that use a Windows (non-SQL) BIN2 collation, and natively
compiled procedures only support comparisons, sorting and grouping on those same
collations. In this listing, we use the CREATE DATABASE command to define a default
binary collation for the entire database, which means that this collation will apply
to object names (i.e. the metadata) as well as user data, and so all table names will be
case sensitive. An object called SalesOrders will not be recognized if a query uses
salesorders. As such, unless every table in a database will be a memory-optimized
table, it is better to specify the collation for each of the character columns in any
memory-optimized table. We can also specify the collation in the query, for use in any
comparison, sorting or grouping operation.
If, instead of creating a new database, we want to allow an existing database to store
memory-optimized objects and data, we simply add a MEMORY_OPTIMIZED_DATA
filegroup to an existing database, and then add a container to that filegroup, as shown in
Listing 2-2.
35
Chapter 2: Creating and Accessing In-Memory OLTP Databases and Tables
Listing 2-2: Adding a filegroup and file for storing memory-optimized table data.
Creating Tables
The syntax for creating memory-optimized tables is almost identical to the syntax for
creating disk-based tables, but with a few required extensions, and a few restrictions on
the data types, indexes, constraints and other options, that memory-optimize tables can
support.
USE HKDB;
GO
CREATE TABLE T1
(
[Name] varchar(32) not null PRIMARY KEY NONCLUSTERED HASH
WITH (BUCKET_COUNT = 100000),
[City] varchar(32) null,
[State_Province] varchar(32) null,
[LastModified] datetime not null,
Listing 2-3: Creating a memory-optimized table with the index definition inline.
36
Chapter 2: Creating and Accessing In-Memory OLTP Databases and Tables
Durability
We can define a memory-optimized table with one of two DURABILITY values: SCHEMA_
AND_DATA or SCHEMA_ONLY, with the former being the default. If we define a memory-
optimized table with DURABILITY=SCHEMA_ONLY, then SQL Server will not log changes
to the table's data, nor will it persist the data in the table to the checkpoint files, on disk.
However, it will still persist the schema (i.e. the table structure) as part of the database
metadata, so the empty table will be available after the database is recovered during a
SQL Server restart.
As discussed in Chapter 1, every memory-optimized table must have at least one index,
to join together the rows that belong to that table. We can satisfy this "at least one
index" requirement for in-memory tables by specifying a PRIMARY KEY constraint, since
SQL Server automatically creates an index to support this constraint. We must declare a
PRIMARY KEY on all tables except for those created with the SCHEMA_ONLY option.
In the previous example (Listing 2-3), we declared a PRIMARY KEY constraint on the Name
column, and specified the type of index that should be created to support the constraint,
in this case, a HASH index. When we create a hash index, we must also specify a bucket
count (i.e. the number of buckets in the hash index).
37
Chapter 2: Creating and Accessing In-Memory OLTP Databases and Tables
We'll cover hash indexes in Chapter 4, where we'll also discuss a few guidelines for
choosing a value for the BUCKET_COUNT attribute.
We can create any type of single-column index, hash or range, inline with the column
definition. For example, we might, alternatively, have specified a range index for the
PRIMARY KEY column and a hash index on the City column, as shown in Listing 2-4.
CREATE TABLE T1
(
[Name] varchar(32) not null PRIMARY KEY NONCLUSTERED
WITH (BUCKET_COUNT = 100000),
[City] varchar(32) not null INDEX T1_hdx_c2 HASH
WITH (BUCKET_COUNT = 10000),
[State_Province] varchar(32) null,
[LastModified] datetime not null,
) WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA);
For non-PRIMARY KEY columns the NONCLUSTERED keyword is optional, but we have to
specify it explicitly when defining the PRIMARY KEY because otherwise SQL Server will
try to create a clustered index, the default for a PRIMARY KEY, and will generate an error
because clustered indexes are not allowed on memory-optimized tables.
For composite indexes, we create them after the column definitions. Listing 2-5 creates a
new table, T2, with the same hash index for the primary key on the Name column, plus a
range index on the City and State_Province columns.
If you're wondering why we created a new table, T2, rather than just adding the
composite index to the existing T1, it's because the SQL Server stores the structure of
in-memory tables as part of the database metadata, and so we can't alter those tables
once created.
38
Chapter 2: Creating and Accessing In-Memory OLTP Databases and Tables
CREATE TABLE T2
(
[Name] varchar(32) not null PRIMARY KEY NONCLUSTERED HASH
WITH (BUCKET_COUNT = 100000),
[City] varchar(32) not null,
[State_Province] varchar(32) not null,
[LastModified] datetime not null,
Listing 2-5: Creating a memory-optimized table with the index definition specified separately.
In short, no schema changes are allowed once a table is created so, instead of using
ALTER TABLE, we must drop and recreate the table. Likewise, we cannot use procedure
sp_rename with memory-optimized tables, to change either the table name or any
column names.
Also note that there are no specific index DDL commands (i.e. CREATE INDEX,
ALTER INDEX, DROP INDEX). We always define indexes as part of the table creation.
There are a few other restrictions and limitations around the use of indexes, constraints
and other properties, during table creation, as follows:
39
Chapter 2: Creating and Accessing In-Memory OLTP Databases and Tables
bit
Uniqueidentifier.
None of the LOB data types are allowed; there can be no columns of type XML, or
common language runtime (CLR) data types, or any of the 'MAX' data types, and all row
lengths are limited to 8060 bytes with no off-row data. In fact, the 8060 byte limit is
enforced at table creation time, so unlike a disk-based table, a memory-optimized table
with two varchar(5000) columns could not be created.
In addition, even though most of the same data types that are allowed for disk-based
tables are also available for memory-optimized tables, in some cases the internal storage
size may be different. The main differences are for those data types that allow varying
precision values to be specified, as listed below.
Numeric/decimal
40
Chapter 2: Creating and Accessing In-Memory OLTP Databases and Tables
Time
Date
Datetime2
41
Chapter 2: Creating and Accessing In-Memory OLTP Databases and Tables
In Listing 2-1, we created the HKDB database with a binary (BIN2) collation, so every
table and column in the database will use this BIN2 collation, so the data, including the
metadata, is case sensitive.
Listing 2-6 shows a few simple examples of working with memory-optimized tables.
USE HKDB;
GO
42
Chapter 2: Creating and Accessing In-Memory OLTP Databases and Tables
-- returns 1 row
SELECT *
FROM T1
WHERE Name = 'da Vinci';
GO
-- "da Vinci" appears last in the ordering because, with a BIN2 collation,
-- any upper-case characters sort before all lower-case characters
SELECT *
FROM T1
ORDER BY Name;
-- returns "expected" results, but SQL Server will perform a table scan
SELECT *
FROM T1
WHERE Name = 'Da Vinci' COLLATE Latin1_General_CI_AS;
GO
-- returns "expected" order but, as above, an index cannot be used to support
-- the ordering
SELECT *
FROM T1
ORDER BY Name COLLATE Latin1_General_CI_AS;
GO
The alternative, as discussed earlier, is to create each character column in a given table
with the BIN2 collation. For example, if we rerun Listing 2-1, but without specifying the
collation then, when recreating table T1, we would specify the collation for each character
column as part of the CREATE TABLE statement, and it would be obligatory to specify a
BIN2 collation on the Name column, since this participates in the hash index.
43
Chapter 2: Creating and Accessing In-Memory OLTP Databases and Tables
USE HKDB;
GO
CREATE TABLE T1
(
[Name] varchar(32) COLLATE Latin1_General_100_BIN2 not null
PRIMARY KEY NONCLUSTERED HASH
WITH (BUCKET_COUNT = 100000),
[City] varchar(32) COLLATE Latin1_General_100_BIN2 null,
[State_Province] varchar(32) COLLATE Latin1_General_100_BIN2 null,
[LastModified] datetime not null,
Listing 2-7: Specifying collations at the column level, during table creation.
Rerunning the queries in Listing 2-6, we'll see that this eliminates the case sensitivity on
the table and column names, but the data case sensitivity remains.
Finally, remember that tempdb will use the collation for the SQL Server instance. If
the instance does not use the same BIN2 collation, then any operations that use
tempdb objects may encounter collation mismatch problems. One solution is to
use COLLATE database_default for the columns on any temporary objects.
Interpreted T-SQL
When accessing memory-optimized tables using interpreted T-SQL, via the interop, we
have access to virtually the full T-SQL surface area (i.e. the full list of statements and
expressions). However, we should not expect the same performance as when we access
memory-optimized tables using natively compiled stored procedures (Chapter 7 shows a
performance comparison).
44
Chapter 2: Creating and Accessing In-Memory OLTP Databases and Tables
Use of interop is the appropriate choice when running ad hoc queries, or to use
while migrating your applications to in-memory OLTP, as a step in the migration
process, before migrating the most performance-critical procedures. Interpreted T-SQL
should also be used when you need to access both memory-optimized tables and disk-
based tables.
The only T-SQL features or capabilities not supported when accessing memory-
optimized tables using interop are the following:
TRUNCATE TABLE
dynamic and keyset cursors (these are automatically degraded to static cursors)
cross-database queries
cross-database transactions
linked servers
45
Chapter 2: Creating and Accessing In-Memory OLTP Databases and Tables
Currently, there are many more limitations on the T-SQL that we can use in these
natively compiled procedures, as well as limitations on the data types and collations that
natively compiled procedures can access and process. The MSDN documentation (see
Additional Resources, below) provides a full list of supported and unsupported T-SQL
statements, data types and operators. The reason for the restrictions is that, internally, a
separate function must be created for each operation on each table. The interface will be
expanded in subsequent versions.
However, there are also a few T-SQL features that are supported only in natively
compiled stored procedures that are not available when using interpreted T-SQL code.
These include:
atomic blocks
SCHEMABINDING.
Chapter 7 will describe these features, as well as providing details of how natively
compiled stored procedures are processed.
46
Chapter 2: Creating and Accessing In-Memory OLTP Databases and Tables
Summary
This chapter covered the basics of creating database, tables and indexes to store memory-
optimized data. In creating the database, we must define a special memory-optimized
filegroup, which is built on the FILESTREAM technology. When creating a memory-
optimized table, we just have to specify a special MEMORY_OPTIMIZED = ON clause, and
create at least one index on the table. It sounds simple, and it is, but we have to remember
that there are currently quite a number of restrictions on the data types, indexes,
constraints, and other options, that memory-optimized tables can support. Also, any
character column that participates in an index must use a BIN2 collation, which might
affect the results of queries against this column.
We can access memory-optimized data structures with T-SQL, either in interop mode or
via natively compiled stored procedures. In the former case, we can use more or less the
full T-SQL surface area, but in the latter case, there is a longer list of restrictions.
Additional Resources
Details of supported query constructs in natively compiled procedures:
http://msdn.microsoft.com/en-us/library/dn452279(v=sql.120).aspx.
White paper discussing SQL Server Filestream storage, explaining how files in the
filegroups containing memory-optimized data are organized and managed internally:
http://download.microsoft.com/download/D/2/0/D20E1C5F-72EA-4505-9F26-
FEF9550EFD44/FILESTREAMStorage.docx.
47
Chapter 3: Row Structure and Multi-
Version Concurrency
In the previous two chapters, we discussed how the storage structures for in-memory
tables and indexes are very different from their disk-based counterparts. SQL Server
does not store the data rows on pages, nor does it pre-allocate space for these pages from
extents. Instead, it stores the data rows to memory, written sequentially in the order the
transactions occurred, linked by pointers in an "index chain." SQL Server allocates space
for rows in in-memory tables on the fly, from memory structures called heaps, which are
different than the type of heaps SQL Server supports for disk-based tables.
SQL Server knows what rows belong to the same table because they are all connected
using the tables' indexes. Each in-memory table must have at least one index, as this
index provides structure for each table. An in-memory table can have up to eight indexes
in total, comprising a mixture of both hash and range indexes (covered in Chapter 4),
both of which are non-clustered structures.
The structure of a data row within a memory-optimized data structure reflects the fact
that the in-memory OLTP engine supports a truly optimistic concurrency model, called
a multi-version concurrency control (MVCC) model, which is based on in-memory row
versioning. For memory-optimized tables, SQL Server never modifies any existing row.
Instead, any UPDATE operation is a two-step process that marks the current version of the
row as invalid and then creates a new version of the row. If a row is subject to multiple
updates, then many versions of the same row will coexist simultaneously. SQL Server
displays the correct version to each transaction that wishes to access a row by examining
timestamps stored in the row header, and comparing them to the time the accessing
transaction started.
In this chapter, we're going to explore the row structure that enables this row versioning,
and then take a high-level view of how the new MVCC model works.
48
Chapter 3: Row Structure and Multi-Version Concurrency
Row Structure
The data rows that comprise in-memory tables have a structure very different than
the row structures used for disk-based tables. Each row consists of a row header, and a
payload containing the row attributes (the actual data). Figure 3-1 shows this structure, as
well as expanding on the content of the header area.
Row header
The row header for every data row consists of the following fields:
Begin-Ts the "insert" timestamp. It reflects the time that the transaction that
inserted a row issued its COMMIT.
End-Ts the "delete" timestamp. It reflects the time that the transaction that deleted a
row issued its COMMIT.
49
Chapter 3: Row Structure and Multi-Version Concurrency
StmtId every statement within a transaction has a unique StmtId value, which
identifies the statement that created the row. If the same row is then accessed again by
the same statement, it can be ignored. This can provide Halloween protection within
transactions on memory-optimized tables.
Padding extra bytes added (and not used) so the row will be a multiple
of 8 bytes in length.
Index Pointers these are C language pointers to the next row in the index
chain. There is a pointer for each index on the table. It is the index pointers, plus
the index data structures, that connect together the rows of a table. There are no
other structures for combining rows into a table other than to link them together
with the index pointers, which is why every table must have at least one index. We'll
discuss this in more detail later in the chapter.
Payload area
The payload is the row data itself, containing the index key columns plus all the other
columns in the row, meaning that all indexes on a memory-optimized table can be
thought of as covering indexes. The payload format can vary depending on the table, and
based on the table's schema. As described in Chapter 1, the in-memory OLTP compiler
generates the DLLs for table operations. These DLLs contain code describing the payload
format, and so can also generate the appropriate commands for all row operations.
50
Chapter 3: Row Structure and Multi-Version Concurrency
Each time a transaction modifies a row, the in-memory engine creates a new version
of the row. If a row is subject to multiple modifications, then multiple versions of the
row will exist within memory, concurrently. Note that the storage engine really has no
concept of "versions." It will simply "see" multiple rows using the index linkages and
return each row that is valid, depending on the timestamp, and that meets the query's
criteria. During query processing, the engine will determine which rows should be
visible to the accessing transaction, by comparing the time the transaction started to the
Begin-Ts and End-Ts timestamp values stored in the header of the row it is accessing.
51
Chapter 3: Row Structure and Multi-Version Concurrency
SQL Server keeps track of each active transaction in an internal, global transaction
table. When a transaction starts, SQL Server increments the Transaction-ID counter,
and assigns a unique transaction ID to the transaction. When the transaction issues a
COMMIT, SQL Server generates a commit timestamp, which it stores in the internal table
initially, and then writes the value to the header of affected rows, once it validates the
transaction and hardens the changes to the log.
SQL Server also stores in the internal table a system timestamp that indicates the trans-
action's start time; it is actually the timestamp for the last committed transaction at the
time this transaction started and indicates when the transaction began, relative to the
serialization order of the database. More than one transaction can have the same start
timestamp. This start timestamp is never used anywhere else, and is only in the internal
table while the transaction is active.
The commit timestamps of existing row versions and the start timestamps of the trans-
action determine which row versions each active transaction can see. The version of a
row that an active transaction can see depends on the validity interval of the row version
compared to the logical read time of the active transaction. Generally speaking the
logical read time of the active transaction will be the time the transaction started, and
we'll assume that here, but for the READ COMMITTED transaction isolation level (only
available to auto-commit transactions more on this in Chapter 5), it will be the time
the actual statement executed. A transaction executing with logical read time RT must
only see the effects of transactions with a start time of less than or equal to RT. It can see
existing row versions with a commit timestamp for the INSERT (i.e. Begin-Ts) of less
than or equal to RT, and a commit timestamp for the DELETE (i.e. End-Ts) of greater
than RT.
For example, let's say an INSERT transaction issues a COMMIT at timestamp "5," inserting a
new row with the value "white." At timestamp "10" an UPDATE to the same row commits,
changing the value to "brown." Two versions of this row now coexist. At timestamp 10,
the "white" version of the row is marked for deletion, with a commit timestamp of "10" for
the delete, and the new "brown" row version is inserted, with a commit timestamp of "10"
for the INSERT. We'll assume no subsequent transaction has touched the row.
52
Chapter 3: Row Structure and Multi-Version Concurrency
In this example, the "white" version of the row has a validity interval of 5 to 10 and the
"brown" row has a validity interval of 10 to infinity. An active transaction with a logical
read time greater than or equal to 5, and less than 10, such as 5 or 9, should see the "white"
row, whereas one that started at time 10 or higher should see the "brown" row version.
After a transaction issues a COMMIT, SQL Server performs some validation checks (more
on this shortly). Having determined the transaction is valid, it hardens it to disk and
writes the commit timestamp into the row header of all affected rows. If the transaction
was an INSERT, it writes the commit timestamp to Begin-Ts and, if it was a DELETE,
to End-Ts. An UPDATE is simply an atomic operation consisting of a DELETE followed
by an INSERT.
At the start of our example, we have two existing data rows. In the simplified depiction
in Figure 3-2, the first column shows only the Begin-Ts and End-Ts value from the row
header for each row, and the next two columns show the actual data values for the Name
and City columns in the payload.
53
Chapter 3: Row Structure and Multi-Version Concurrency
We can see that a transaction inserted the rows <Greg, Beijing> and <Susan, Bogota> at
timestamp 20. Notice that SQL Server uses a special value, referred to as "infinity," for the
End-Ts value for rows that are active (not yet marked as invalid).
Let's see how the row versions and their values evolve during the three basic stages
(processing, validation, and post-processing) of this transaction, and how SQL Server
controls which rows are visible to other active concurrent transactions.
Processing phase
During the processing stage, SQL Server processes the transaction, creating new row
versions (and linking them into index structures covered in Chapter 4), and marking
rows for deletion as necessary, as shown in Figure 3-3.
54
Chapter 3: Row Structure and Multi-Version Concurrency
Figure 3-3: Row versions during an in-flight data modification transaction, Tx1.
During processing, SQL Server uses the Transaction-ID for the Begin-Ts value of
any row it needs to insert, and for the End-Ts value for any row that it needs to mark for
deletion. SQL Server uses an extra bit flag to indicate to other transactions that these are
transaction ids, not timestamps.
So, to delete the <Susan, Bogota> row (remember, the row isn't actually removed during
processing; it's more a case of marking it as deleted), transaction Tx1 first locates the row,
via one of the indexes, and then sets the End-Ts value to the Transaction-ID for Tx1.
55
Chapter 3: Row Structure and Multi-Version Concurrency
The update of <Greg, Beijing> occurs in an atomic step consisting of two separate
operations that will delete the original row, and insert a completely new row. Tx1
constructs the new row <Greg, Lisbon>, storing the transaction ID, Tx1, in
Begin-Ts, and then setting End-Ts to (infinity). As part of the same atomic
action, Tx1 deletes the <Greg, Beijing> row, just as described previously. Finally,
it inserts the new <Jane, Helsinki> row.
At this stage our transaction, Tx1, issues a COMMIT. SQL Server generates the commit
timestamp, at 120, say, and stores this value in the global transaction table. This
timestamp identifies the point in the serialization order of the database where this trans-
action's updates have logically all occurred. It does not yet write it to the row header
because SQL Server has yet to validate the transaction (more on this shortly), and so
has not hardened the transaction to the log, on disk. As such, the transaction is not yet
"guaranteed;" it could still abort and roll back, and SQL Server will not acknowledge the
commit to the user until validation completes. However, SQL Server will optimistically
assume that the transaction will actually commit, and makes the row available to other
transactions as soon it receives the COMMIT.
56
Chapter 3: Row Structure and Multi-Version Concurrency
First, let's assume Tx2 starts at timestamp 100, after Tx1 started but before Tx1 issues
a COMMIT. Tx2 will read the <Susan, Bogota> row, and find that End-Ts contains a
Transaction-ID, Tx1, indicating that the row may have been deleted. Tx2 locates
Tx1 in the global transaction table and checks its status to determine if the deletion
of <Susan, Bogota> is complete. In our example, Tx1 is still active and so Tx2 can access
the <Susan, Bogota> row.
When it accesses the <Greg, Lisbon> row version, Tx2 will see the transaction id in
Begin-Ts, check the global transaction table, find Tx1 is still active, follow the index
pointer in the header back to the previous row version, and instead return the row
version <Greg, Beijing>. Likewise, Tx2 will not return the row <Jane, Helsinki>.
However, what if, instead, we assume Tx2 started at timestamp 121, after Tx1 issued the
commit, but before SQL Server completed validation of Tx1? If Tx2 started at timestamp
121, then it will be able to access data rows that have a commit timestamp of less than or
equal to 121 for Begin-Ts and greater than 121 for End-Ts.
Tx2 reads the <Susan, Bogota> row, finds Tx1 in End-Ts indicating it may be deleted,
locates Tx1 in the global transaction table and checks the internal transaction table,
where this time it will find the commit timestamp (the "prospective" Begin-Ts value) of
120 for Tx1. The commit for Tx1 is issued but not confirmed (since it hasn't completed
validation), but SQL Server optimistically assumes that Tx1 will commit, and therefore
that the <Susan, Bogota> row is deleted, and Tx2 will not return this row. By a similar
argument, it will return the rows <Greg, Lisbon> and <Jane, Helsinki>, since their
prospective Begin-Ts are 120 (=<121) and End-Ts are infinity (>121).
However, since SQL Server has yet to validate transaction Tx1, it registers a commit
dependency between Tx2 and Tx1. This means that SQL Server will not complete
validation of Tx2, nor acknowledge the commit of Tx2 to the user, until it completes
validation of Tx1.
In other words, while a transaction will never be blocked waiting to acquire a lock, it
may need to wait a short period for commit dependencies to resolve, during validation.
57
Chapter 3: Row Structure and Multi-Version Concurrency
However, generally, any blocking waits that arise from the resolution of commit depend-
encies will be minimal. Of course, a "problematic" (e.g. long-running) transaction in an
OLTP system is still going to cause some blocking, although never lock-related blocking.
Validation phase
Once our transaction Tx1 issues a commit, and SQL Server generates the commit
timestamp, the transaction enters the validation phase. While SQL Server will immedi-
ately detect direct update conflicts, such as those discussed in the previous section,
it is not until the validation phase that it will detect other potential violations of the
properties specified by the transaction isolation level. So, for example, let's say Tx1 was
accessing the memory-optimized table in REPEATABLE READ isolation. It reads a row
value and then Tx2 updates that row value, which it can do because SQL Server acquires
no locks in the MVCC model, and issues a commit before Tx1 commits. When Tx1
enters the validation phase, it will fail the validation check and SQL Server will abort the
transaction. If there are no violations, SQL Server proceeds with other actions that will
culminate in guaranteeing the durability of the transaction.
58
Chapter 3: Row Structure and Multi-Version Concurrency
The following summarizes the actions that SQL Server will take during the validation
phase (Chapter 5 discusses each of these actions in more detail).
Validate the changes made by a transaction for example, it will perform checks
to ensure that there are no violations of the current transaction isolation level.
Wait for any commit dependencies to resolve (i.e. for the dependency count
to reduce to 0).
Harden the transaction to disk for durable tables only, SQL Server generates
a "write set" of changes, basically a list of DELETE/INSERT operations with pointers
to the version associated with each operation, and writes it to the transaction log,
on disk.
Post-processing
In this stage, SQL Server writes the commit timestamp into the row header of all affected
rows (note this is the timestamp from when Tx1 first issued the commit). Therefore, our
final row versions look as shown in in Figure 3-4.
59
Chapter 3: Row Structure and Multi-Version Concurrency
As noted earlier, the storage engine has no real notion of row "versions." There is no
implicit or explicit reference that relates one version of a given row to another. There are
just rows, connected together by the table's indexes, as we'll see in the next chapter, and
visible to active transactions, or not, depending on the validity interval of the row version
compared to the logical read time of the accessing transaction.
In Figure 3-4, the rows <Greg, Beijing> and <Susan, Bogota> have a validity interval of
20 to 120 and so any user transaction with a starting timestamp greater than or equal
to 20 and less than 120, will still see those row versions. Any transaction with a starting
timestamp greater than 120 will see <Greg, Lisbon> and <Jane, Helsinki>.
60
Chapter 3: Row Structure and Multi-Version Concurrency
Eventually, there will be no active transactions for which the rows <Greg, Beijing> and
<Susan, Bogota> are still valid, and so SQL Server can delete them permanently (remove
the rows from the index chains and de-allocate memory). These "stale" rows may be
removed by user threads or by a separate "garbage collection" thread (we'll cover this in
Chapter 5).
Summary
The SQL Server In-Memory OLTP engine supports true optimistic concurrency, via a
MVCC, based on in-memory row versioning. This chapter described the row structure
that underpins the MVCC model, and then examined how SQL Server maintains multiple
row versions, and determines the correct row version that each concurrent transaction
should access. This model means that SQL Server can avoid read-write conflicts without
the need for any locking or latching, and will raise write-write conflicts immediately,
rather than after a delay (i.e. rather than blocking for the duration of a lock-holding
transaction).
In the next chapter, we'll examine how SQL Server uses indexes to connect all rows that
belong to a single in-memory table, as well as to optimize row access.
Additional Resources
Hekaton: SQL Servers Memory-Optimized OLTP Engine a white paper by
Microsoft Research:
http://research.microsoft.com/pubs/193594/Hekaton%20-%20Sigmod2013%20
final.pdf.
61
Chapter 4: Hash and Range Indexes
The previous chapter discussed data row structure, and how the in-memory OLTP engine
maintains row versions, as part of its optimistic MVCC system.
The row header for each data row contains a set of index pointers, one for each index on
the table to which the row belongs. Each pointer points to the next logical row in that
table, according to the key for that index. As such, it is these indexes that provide order to
the rows in a table. There are no other structures for combining rows into a table other
than to link them together with the index pointers; there are no data pages that combine
sets of data rows into a single structure, which means that all memory-optimized tables
must have at least one index.
Beyond this obligatory index, to connect the rows together, we can choose an additional
seven indexes, to a maximum of eight indexes on a table in total, consisting of both hash
and range indexes, in order to optimize access paths for that table.
In this chapter, we're going to explore, in a lot more detail, the storage structure of
in-memory indexes. We'll start by discussing hash indexes, how SQL Server can use such
an index to join together and organize the rows of a table, and then we'll look at the
tactical use of these indexes for query optimization.
We'll then move on to discuss, in depth, the range index and its new Bw-tree internal
structure, and the internal maintenance that SQL Server performs on these structures to
maintain optimum query performance.
62
Chapter 4: Hash and Range Indexes
Index Basics
To summarize briefly some of what we've discussed about the "rules" governing the use
of indexes on memory-optimized tables:
a maximum of 8 indexes per table, including the index supporting the PRIMARY KEY
we can't alter a table after creating it, so we must define all indexes at the time we
create the memory-optimized table SQL Server writes the number of index pointers,
and therefore number of indexes, into the row header on table creation
during database recovery SQL Server recreates all indexes based on the index defini-
tions. We'll go into detail in Chapter 6, Logging, Checkpoint, and Recovery.
With a maximum limit of 8 indexes, all of which we must define on table creation,
we must exert even more care than usual to choose the correct and most useful set
of indexes.
We discussed earlier in the book how data rows are not stored on pages, so there is no
collection of pages or extents, and there are no partitions or allocation units. Similarly,
although we do refer to index pages in in-memory range indexes, they are very different
structures from their disk-based counterparts.
63
Chapter 4: Hash and Range Indexes
Hash Indexes
A hash index, which is stored as a hash table, consists of an array of pointers, where
each element of the array is called a hash bucket and stores a pointer to the location in
memory of a data row. When we create a hash index on a column, SQL Server applies
a hash function to the value in the index key column in each row and the result of the
function determines which bucket will contain the pointer for that row.
More on hashing
Hashing is a well-known search algorithm, which stores data based on a hash key generated by applying
a hash function to the search key (in this case, the index key). A hash table can be thought of as an array
of "buckets," one for each possible value that the hash function can generate, and each data element (in
this case, each data row) is added to the appropriate bucket based on its index key value. When searching,
the system will apply the same hash function to the value being sought, and will only have to look in a
single bucket. For more information about what hashing and hash searching are all about, take a look at
the Wikipedia article at: http://en.wikipedia.org/wiki/Hash_function.
Let's say we insert the first row into a table and the index key value hashes to the
value 4. SQL Server stores a pointer to that row in hash bucket "4" in the array. If a
transaction inserts a new row into the table, where the index key value also hashes to "4,"
it becomes the first row in the chain of rows accessed from hash bucket 4 in the index,
and the new row will have a pointer to the original row.
64
Chapter 4: Hash and Range Indexes
In other words, the hash index accesses, from the same hash bucket, all key values
that hash to the same value (have the same result from the hash function), with subse-
quent rows linked together in a chain, with one row pointing to the next. If there is
duplication of key values, the duplicates will always generate the same function result
and thus will always be in the same chain. Ideally, there shouldn't be more than one key
value in each hash chain. If two different key values hash to the same value, which means
they will end up in the same hash bucket, or they end up in the same bucket because we
specified fewer buckets than there are possible hash values for the column data, then this
is a hash collision.
Row organization
As discussed previously, SQL Server stores these index pointers in the index pointer array
in the row header. Figure 4-1 shows two rows in a hash index on a name column. For this
example, assume there is a very simple hash function that results in a value equal to the
length of the string in the index key column. The first value of Jane will then hash to 4,
and Susan to 5, and so on. In this simplified illustration, different key values (Jane and
Greg, for example) will hash to the same bucket, which is a hash collision. Of course, the
real hash function is much more random and unpredictable, but I am using the length
example to make it easier to illustrate.
The figure shows the pointers from the 4 and 5 entries in the hash index to the rows
containing Jane and Susan, respectively. Neither row points to any other rows, so the
index pointers in each of the row headers is NULL.
65
Chapter 4: Hash and Range Indexes
Hash index
on Name
50, null Jane Helsinki
4
70, null Susan Vienna
5
In Figure 4-1, we can see that the <Jane, Helsinki> and <Susan, Vienna> rows have a
Begin-Ts timestamp of 50 and 70 respectively, and each is the current, active version
of that row.
In Figure 4-2, a transaction, which committed at timestamp 100, has added to the
same table a row with a name value of Greg. Using our string length hash function, Greg
hashes to 4, and so maps to the same bucket as Jane, and the row is linked into the same
chain as the row for Jane. The <Greg, Beijing> row has a pointer to the <Jane, Helsinki>
row and SQL Server updates the hash index to point to Greg. The <Jane, Helsinki> row
needs no changes.
66
Chapter 4: Hash and Range Indexes
Hash index
on Name
100, Greg Beijing
Finally, what happens if another transaction, which commits at timestamp 200, updates
<Greg, Beijing> to <Greg, Lisbon>? The new version of Greg's row is simply linked in as
any other row, and will be visible to active transactions depending on their timestamps,
as described in Chapter 3. Every row has at least one pointer to it, either directly from the
hash index bucket or from another row. In this manner, each index provides an access
path to every row in the table, in the form of a singularly-linked list joining every row in
the table.
67
Chapter 4: Hash and Range Indexes
Hash index
on Name Greg
200, Lisbon
Of course, this is just a simple example with one index, in this case a hash index, which
is the minimum required to link the rows together. However, for query performance
purposes, we may want to add other hash indexes (as well as range indexes).
For example, if equality searches on the City column are common, and if it were quite a
selective column (small number of repeated values), then we might decide to add a hash
index to that column, too. This creates a second index pointer field. Each row in the table
now has two pointers pointing to it, and can point to two rows, one for each index. The
first pointer in each row points to the next value in the chain for the Name index; the
second pointer points to the next value in the chain for the City index.
68
Chapter 4: Hash and Range Indexes
Figure 4-4 shows the same hash index on Name, this time with three rows that hash to 4,
and two rows that hash to 5, which uses the second bucket in the Name index. The second
index on the City column uses three buckets. The bucket for 6 has three values in the
chain, the bucket for 7 has one value in the chain, and the bucket for 8 also has one value.
Hash index
Hash index on City
on Name Greg
200, Lisbon
Now we have another access path through the rows, using the second hash index.
69
Chapter 4: Hash and Range Indexes
Hash indexes become less effective on columns that have lots of duplicate values, unless
whenever you're querying on that column, you really do want to return all the rows that
have that particular value. If SQL Server has to search many duplicate values in a hash
chain, but a query only needs a few of the rows that have that value, then performance
will be adversely affected due to the cost of traversing the long hash chains. This can
happen if a query has filters on multiple columns, but the index is based on only one of
the filtered columns. For example, suppose you have both lastname and firstname
columns in your table, but an index on lastname only. If you are searching for all the
rows where lastname = 'Smith' and firstname = 'Sue', SQL Server will have
to search the entire chain of 'Smith' values, and inspect each row to determine if the
firstname value is the desired one. The index would be more useful if we needed to
return all the rows with the lastname value 'Smith'.
When defining a hash index, bear in mind that the hash function used is based on all
the key columns. This means that if you have a hash index on the columns: lastname,
firstname in an employees table, a row with the values <Harrison, Josh> will have a
different value returned from the hash function than <Harrison, John>. This means that a
query that just supplies a lastname value, i.e. Harrison, will not be able to use the index
at all, since Harrison may appear in many hash buckets. Therefore, in order to "seek" on
hash indexes the query needs to provide equality predicates for all of the key columns.
70
Chapter 4: Hash and Range Indexes
CREATE TABLE T1
(
[Name] varchar(32) not null PRIMARY KEY NONCLUSTERED HASH
WITH (BUCKET_COUNT = 100000),
[City] varchar(32) null,
[State_Province] varchar(32) null,
[LastModified] datetime not null,
) WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA);
SQL Server rounds up the number we supply for the BUCKET_COUNT to the next power
of two, so it will round up a value of 50,000 to 65,536.
The number of buckets for each hash index should be determined based on the charac-
teristics of the column on which we are building the index. It is recommended to choose
a number of buckets equal to or greater than the expected cardinality (i.e. the number of
unique values) of the index key column, so that there will be a greater likelihood that each
bucket's chain will point to rows with the same value for the index key column. In other
words, we want to try to make sure that two different values will never end up in the
same bucket. If there are fewer buckets than possible values, multiple values will have to
use the same bucket, i.e. a hash collision.
This can lead to long chains of rows and significant performance degradation of all DML
operations on individual rows, including SELECT and INSERT. On the other hand, be
careful not to choose a number that is too big because each bucket uses memory (8 bytes
per bucket). Having extra buckets will not improve performance but will simply waste
memory. As a secondary concern, it might also reduce the performance of index scans,
which will have to check each bucket for rows.
71
Chapter 4: Hash and Range Indexes
If this DMV returns a large average chain length, it indicates that many rows are hashed
to the same bucket. This could happen for the following reasons:
If the number of empty buckets is low or the average and maximum chain lengths are
similar, it is likely that the total BUCKET_COUNT is too low. This causes many different
index keys to hash to the same bucket.
If the BUCKET_COUNT is high or the maximum chain length is high relative to the
average chain length, it is likely that there are many rows with duplicate index key
values or there is a skew in the key values. All rows with the same index key value hash
to the same bucket, hence there is a long chain length in that bucket.
Conversely, short chain lengths along with a high empty bucket count are in indication of
a BUCKET_COUNT that is too high.
Range Indexes
Hash indexes are useful for relatively unique data that we can query with equality predi-
cates. However, if you don't know the cardinality, and so have no idea of the number
of buckets you'll need for a particular column, or if you know you'll be searching your
data based on a range of values, you should consider creating a range index instead of a
hash index.
Range indexes connect together all the rows of a table at their leaf level. Every row in a
table will be accessible by a pointer in the leaf. Range indexes are implemented using a
new data structure called a Bw-tree, originally envisioned and described by Microsoft
Research in 2011. A Bw-tree is a lock- and latch-free variation of a B-tree.
72
Chapter 4: Hash and Range Indexes
The Bw-tree
The general structure of a Bw-tree is similar to SQL Server's regular B-trees, except that
the index pages are not a fixed size, and once they are built they cannot be changed. Like
a regular B-tree page, each index page contains a set of ordered key values, and for each
value there is a corresponding pointer. At the upper levels of the index, on what are called
the internal pages, the pointers point to an index page at the next level of the tree, and
at the leaf level, the pointers point to a data row. Just like for in-memory OLTP hash
indexes, multiple data rows can be linked together. In the case of range indexes, rows that
have the same value for the index key will be linked.
One big difference between Bw-trees and SQL Server's B-trees is that, in the former,
a page pointer is a logical page ID (PID), instead of a physical page address. The PID
indicates a position in a mapping table, which connects each PID with a physical
memory address. Index pages are never updated; instead, they are replaced with a new
page and the mapping table is updated so that the same PID indicates a new physical
memory address.
Figure 4-5 shows the general structure of a Bw-tree, plus the Page Mapping Table.
Each index row in the internal index pages contains a key value, and a PID of a page at
the next level down. The index pages show the key values that the index references. Not
all the PID values are indicated in Figure 4-5, and the mapping table does not show all the
PID values that are in use.
The key value is the highest value possible on the page referenced. Note this is different
than a regular B-tree index, for which the index rows stores the minimum value on the
page at the next level down. The leaf level index pages also contain key values but, instead
of a PID, they contain an actual memory address of a data row, which could be the first in
a chain of data rows, all with the same key value (these are the same rows that might also
be linked using one or more hash indexes).
73
Chapter 4: Hash and Range Indexes
0 address
Root
1 address
2 address
10 20 28 PID 0
3
5 8 10 11 21 24 27 Non-leaf pages
15 18
1 2 4 6 7 8 25 26 27 leaf pages
Another big difference between Bw-trees and SQL Server's B-trees is that, at the leaf
level, SQL Server keeps track of data changes using a set of delta values. As noted above,
index pages are never updated, they are just replaced with a new page. However, the leaf
pages themselves are not replaced for every change. Instead, each update to a leaf-level
index page, which can be an insert or delete of a key value on that page, produces a page
containing a delta record describing the change.
An UPDATE is represented by two new delta records, one for the DELETE of the original
value, and one for the INSERT of the new value. When SQL Server adds each delta record,
it updates the mapping table with the physical address of the page containing the newly
added delta record for the INSERT or DELETE operation.
74
Chapter 4: Hash and Range Indexes
Figure 4-6 illustrates this behavior. The mapping table is showing only a single page with
logical address P. The physical address in the mapping table originally was the memory
address of the corresponding leaf level index page, shown as page P. After we insert a new
row into the table, with index key value 50 (which we'll assume did not already occur in
the table's data), in-memory OLTP adds a delta record linked to page P, indicating the
insert of the new key, and the physical address of page P is updated to indicate the address
of this first delta record page.
Assume, then, that a separate transaction deletes from the table the only row with index
key value 48. In-memory OLTP must then remove the index row with key 48, so it creates
another delta record, and once again updates the physical address for page P.
Physical
Address
When searching through a range index, SQL Server must combine the delta records with
the base page, making the search operation a bit more expensive. However, not having to
completely replace the leaf page for every change gives us performance savings. As we'll
see in the later section, Consolidating delta records, eventually SQL Server will combine
the original page and chain of delta pages into a new base page.
75
Chapter 4: Hash and Range Indexes
Range index pages for memory-optimized tables all have a header area which contains
the following information:
Right PID the PID of the page to the right of the current page.
Height the vertical distance from the current page to the leaf.
Page statistics the count of delta records plus the count of records on the page.
In addition, both leaf and internal pages contain two or three fixed length arrays:
Values this is really an array of pointers. Each entry in the array is 8 bytes long. For
internal pages, each entry contains the PID of a page at the next level and, for a leaf
page, the entry contains the memory address for the first row in a chain of rows having
equal key values. (Note that, technically, the PID could be stored in 4 bytes, but to
allow the same values structure to be used for all index pages, the array allows 8 bytes
per entry.)
Keys this is the array of key values. If the current page is an internal page, the key
represents the first value on the page referenced by the PID. If the current page is a leaf
page, the key is the value in the chain of rows.
Offsets this array exists only for pages of indexes with variable length keys. Each
entry is 2 bytes and contains the offset where the corresponding key starts in the key
array on the page.
76
Chapter 4: Hash and Range Indexes
The smallest pages are typically the delta pages, which have a header containing most
of the same information as in an internal or leaf page. However delta page headers
don't have the arrays described for leaf or internal pages. A delta page contains only an
operation code (insert or delete) and a value, which is the memory address of the first
row in a chain of records. Finally, the delta page will also contain the key value for the
current delta operation. In effect, we can think of a delta page as being a mini-index page
holding a single element, whereas the regular index pages store an array of N elements.
Remember that the Bw-tree leaf pages contain only pointers to the first row of an index
chain, for each key value, not a pointer to every row in the table.
77
Chapter 4: Hash and Range Indexes
The statistics information in the page header for a leaf page keeps track of how much
space would be required to consolidate the delta records, and that information is adjusted
as each new delta record is added. The easiest way to visualize how a page split occurs is
to walk through an example. Figure 4-7 shows a representation of the original structure,
where Ps is the page to be split into pages P1 and P2, and Pp is its parent page, with a row
that points to Ps. Keep in mind that a split can happen at any level of an index, so it is not
specified whether Ps is a leaf page or an internal page. It could be either.
Pp 5 8 10
Ps 1 2 3 4
Figure 4-7: Attempting to insert a new row into a full index page.
Assume we have executed an INSERT statement that inserts a row with key value of 5 into
this table, so that 5 now needs to be added to the range index. The first entry in page Pp is
a 5, which means 5 is the maximum value that could occur on the page to which Pp points,
which is Ps. Page Ps doesn't currently have a value 5, but page Ps is where the 5 belongs.
However, the page Ps is full, so it is unable to add the key value 5 to the page, and it has
to split. The split operation occurs in one atomic operation consisting of two steps, as
described in the next two sections.
78
Chapter 4: Hash and Range Indexes
Pp 5 8 10
Ps 1 2 3 4
P1 1 2 3 4 5
P2
In the same atomic operation as splitting the page, SQL Server updates the page mapping
table to change the pointer to point to P1 instead of Ps. After this operation, page Pp
points directly to page P1; there is no pointer to page Ps, as shown in Figure 4-9.
Pp 5 8 10
P1 1 2 3 4 5
P2
Figure 4-9: The pointer from the parent points to the first new child page.
79
Chapter 4: Hash and Range Indexes
To create a pointer from Pp to page P2, SQL Server allocates a new parent page Ppp, copies
into it all the rows from page Pp, and adds a new row to point to page P1, which holds the
maximum key value of the rows on P1 which is 3, as shown in Figure 4-10.
Ppp 3 5 8 10
Pp 5 8 10
P1 1 2 3 4 5
P2
In the same atomic operation as creating the new pointer, SQL Server then updates the
page mapping table to change the pointer from Pp to Ppp, as shown in Figure 4-11.
Ppp 3 5 8 10
P1 1 2 3 4 5
P2
Figure 4-11: After the split is complete.
80
Chapter 4: Hash and Range Indexes
Again, to illustrate how this works, we'll walk through a simple example, which assumes
we'll be merging a page P with its left neighbor, Page Pln, that is, one with smaller values.
Figure 4-12 shows a representation of the original structure where page Pp, the parent
page, contains a row that points to page P. Page Pln has a maximum key value of 8,
meaning that the row in page Pp that points to page Pln contains the value 8. We will
delete from page P the row with key value 10, leaving only one row remaining, with the
key value 9.
Pp 5 8 10
Pln 6 7 8 9 10
P
The merge operation occurs in three atomic steps, as described over the
following sections.
81
Chapter 4: Hash and Range Indexes
In the same atomic step, SQL Server updates the pointer to page P in the page mapping
table to point to DPm. After this step, the entry for key value 10 in parent page Pp now
points to DPm.
Pp 5 8 10
DPm DP 10
Pln 6 7 8 9 10
P
Figure 4-13: The delta page and the merge-delta page are added to indicate a deletion.
Once this is done, in the same atomic step, SQL Server updates the page mapping table
entry pointing to page Pp to point to page Pp2. Page Pp is no longer reachable.
82
Chapter 4: Hash and Range Indexes
Pp 2 5 10
PI n 6 7 8 9 10
P
Figure 4-14: Pointers are adjusted to get ready for the merge.
Pp 2 5 10
Pnew 6 7 8 9
83
Chapter 4: Hash and Range Indexes
Summary
Memory-optimized tables comprise individual rows connected together by indexes. This
chapter described the two index structures available: hash indexes and range indexes.
Hash indexes have a fixed number of buckets, each of which holds a pointer to a chain of
rows. Ideally, all the rows in a single bucket's chain will have the same key value, and the
correct choice for the number of buckets, which is declared when the table is created, can
help ensure this.
Range indexes are stored as Bw-trees, which are similar to SQL Server's traditional B-trees
in some respects, but very different in others. The internal pages in Bw-trees contain
key values and pointers to pages and the next level. The leaf level of the index contains
pointers to chains of rows with matching key values. Just like for our data rows, index
pages are never updated in place. If an index page needs to add or remove a key value, a
new page is created to replace the original.
When choosing the correct set of indexes for a table at table creation time, evaluate each
indexed column to determine the best type of index. If the column stores lots of duplicate
values, or queries need to search the column by a range of values, then a range index is
the best choice. Otherwise, choose a hash index.
In the next chapter we'll look at how concurrent operations are processed and how trans-
actions are managed and logged.
Additional Resources
Guidelines for Using Indexes on Memory-Optimized Tables:
http://msdn.microsoft.com/en-gb/library/dn133166.aspx.
84
Chapter 5: Transaction Processing
Regardless of whether we access disk-based tables or memory-optimized tables,
SQL Server must manage concurrent transactions against these tables in a manner
that preserves the ACID properties of every transaction. Every transaction runs in a
particular transaction isolation level, which determines the degree to which it is isolated
from the effects of changes made by the concurrent transactions of other users.
Transaction Scope
SQL Server supports several different types of transaction, in terms of how we define the
beginning and end of the transaction; and when accessing memory-optimized tables the
transaction type can affect the isolation levels that SQL Server supports. The two default
types of transactions are:
85
Chapter 5: Transaction Processing
Read stability a transaction, TxA, reads some version, v1, of a record during
processing. To achieve read stability, SQL Server must guarantee that v1 is still the
version visible to TxA as of the end of the transaction; that is, v1 has not been replaced
by another committed version, v2. SQL Server enforces read stability either by
acquiring a shared lock on v1, to prevent changes, or by validating that no other trans-
action updated v1 before TxA committed.
86
Chapter 5: Transaction Processing
If these properties are not enforced, certain read phenomena can occur, such as
non-repeatable reads and phantoms. In some situations, dirty reads can also occur,
although dirty reads are not possible when working with memory-optimized tables in
SQL Server.
Isolation levels are defined in terms of read phenomena. In other words, transaction TxA's
isolation level determines which read phenomena are acceptable and, therefore, what
measures SQL Server must take to prevent changes made by other transactions from
introducing these phenomena into the results of transaction TxA.
In a pessimistic concurrency model, such as when accessing disk-based tables, SQL Server
acquires locks to prevent "interference" between concurrent transactions, in this way
avoiding these read phenomena. Generally speaking, the more restrictive the isolation
level (i.e. the fewer read phenomena it allows) the more restrictive SQL Server's locking
regime will be, and the higher the risk of blocking, as sessions wait to acquire locks.
By contrast, SQL Server regulates all access of data in memory-optimized tables using
completely optimistic MVCC. SQL Server does not use locking or latching to provide
transaction isolation, and so data operations never wait to acquire locks. Instead, SQL
Server assumes that concurrent transactions won't interfere and then performs validation
checks once a transaction issues a commit to ensure that it obeys the required isolation
properties. If it does, then SQL Server will confirm the commit.
SQL Server still supports multiple levels of transaction isolation when assessing memory-
optimized tables, but there are differences in the way the isolation levels are guaranteed
when accessing disk-based versus memory-optimized tables.
First, for comparative purposes, let's review briefly the transaction isolation levels that
SQL Server supports when accessing disk-based tables, and then contrast that to the
isolation levels we can use with memory-optimized tables and how they work.
87
Chapter 5: Transaction Processing
READ UNCOMMITTED allows dirty reads, non-repeatable reads and phantom reads
(this level in not recommended and we won't discuss it further).
To support the standard implementation of this isolation level, for disk-based tables,
transactions must acquire a shared read lock to read a row, and release it as soon as
the read is complete (although the transaction as a whole may still be incomplete)
and so can't perform dirty reads. Transactions hold exclusive locks until the end of
the transaction.
SNAPSHOT guarantees that data read by any statement in a transaction will be the
transactionally-consistent version of the data that existed at the start of the trans-
action. In other words, the statements in a snapshot transaction see a snapshot of the
committed data as it existed at the start of the transaction. Any modifications made
after that are invisible to it. It does not prevent non-repeatable reads or phantoms,
but they won't appear in the results, so this level has the outward appearance of
SERIALIZABLE. For disk-based tables, SQL Server implements this level, using row
versioning in tempdb.
REPEATABLE READ prevents dirty reads and non-repeatable reads but allows
phantom reads. Transactions take shared locks and exclusive locks until the end of the
transaction to guarantee read stability.
88
Chapter 5: Transaction Processing
SERIALIZABLE prevents all read phenomena. To avoid phantoms, SQL Server adopts
a special locking mechanism, using key-range locks, and holds all locks until the end of
the transaction, so that other transactions can't insert new rows into those ranges.
SNAPSHOT a transaction running in snapshot isolation will always see the most
recent committed data. Does not guarantee read stability or phantom avoidance,
though queries running in this isolation level won't see any non-repeatable reads
or phantoms.
REPEATABLE READ includes the guarantees given by SNAPSHOT plus read stability.
Every read operation in the transaction is repeatable up to the end of the transaction.
Since in-memory OLTP uses a completely optimistic concurrency model, SQL Server
implements each of the levels very differently than for disk-optimized table access,
without using any locks or latches.
When accessing memory-optimized tables, SQL Server ensures read stability if required
by the isolation level, by validating that a row version read by a query in transaction T1,
during processing, has not been modified by another transaction, before T1 committed.
It ensures phantom avoidance, as required, by rescanning during transaction validation,
to check for new "phantom" row versions that were inserted before T1 committed. We'll
discuss what happens when violations occur, a little later in the chapter.
89
Chapter 5: Transaction Processing
All of the examples in this chapter will access memory-optimized tables from
interpreted T-SQL.
Any transaction that we execute from interpreted T-SQL can access both disk-based and
memory-optimized tables (whereas a natively compiled stored procedure can only access
memory-optimized tables), and so we refer to it as a cross-container transaction.
There are strict rules that govern the isolation level combinations we can use when
accessing disk-based and memory-optimized tables, in order that SQL Server can
continue to guarantee transactional consistency. Most of the restrictions relate to the fact
that operations on disk-based tables and operations on memory-optimized tables each
have their own transaction sequence number, even if they are accessed in the same T-SQL
transaction. You can think of a cross-container transaction as having two sub-transac-
tions within the larger transaction: one sub-transaction is for the disk-based tables and
one is for the memory-optimized tables.
Table 5-1 lists which isolation levels can be used together in a cross-container transaction.
90
Chapter 5: Transaction Processing
Memory-optimized
Disk-based tables Recommendations
tables
READ COMMITTED REPEATABLE READ / This combination can be used during data
SERIALIZABLE migration and for memory-optimized table
access in interop mode (not in a natively
compiled procedure).
The following sections will explain the restrictions, and the reasons for them,
with examples.
91
Chapter 5: Transaction Processing
USE HKDB
GO
IF EXISTS (SELECT * FROM sys.objects WHERE name='T1')
DROP TABLE [dbo].[T1]
GO
CREATE TABLE T1
(
[Name] varchar(32) not null PRIMARY KEY NONCLUSTERED HASH
WITH (BUCKET_COUNT = 100000),
[City] varchar(32) null,
[State_Province] varchar(32) null,
[LastModified] datetime not null,
Open a new query window in SSMS, and start an explicit transaction accessing a
memory-optimized table, as shown in Listing 5-2.
USE HKDB;
BEGIN TRAN;
SELECT *
FROM [dbo].[T1]
COMMIT TRAN;
By default, this transaction will run in the READ COMMITTED isolation level, which is
the standard isolation level for most SQL Server transactions, and guarantees that the
transaction will not read any dirty (uncommitted) data. If a transaction running under
this default isolation level tries to access a memory-optimized table, it will generate
the following error message, since READ COMMITTED is unsupported for memory-
optimized tables:
92
Chapter 5: Transaction Processing
Accessing memory optimized tables using the READ COMMITTED isolation level is supported only
for autocommit transactions. It is not supported for explicit or implicit transactions.
Provide a supported isolation level for the memory optimized table using a table hint, such
as WITH (SNAPSHOT).
As the message suggests, the transaction needs to specify a supported isolation level,
using a table hint. For example, Listing 5-3 specifies the snapshot isolation level. This
combination, READ COMMITTED for accessing disk-based tables and SNAPSHOT for
memory-optimized, is the one that most cross-container transactions should use.
However, alternatively, we could also use the WITH (REPEATABLEREAD) or WITH
(SERIALIZABLE) table hints, if required.
USE HKDB;
BEGIN TRAN;
SELECT * FROM [dbo].[T1] WITH (SNAPSHOT);
COMMIT TRAN;
Listing 5-3: Explicit transaction using a table hint to specify snapshot isolation.
SQL Server does support READ COMMITTED isolation level for auto-commit (single-
statement) transactions, so we can run Listing 5-4, inserting three rows into our
table T1 successfully.
INSERT [dbo].[T1]
( Name, City, LastModified )
VALUES ( 'Jane', 'Helsinki', CURRENT_TIMESTAMP ),
( 'Susan', 'Vienna', CURRENT_TIMESTAMP ),
( 'Greg', 'Lisbon', CURRENT_TIMESTAMP );
Listing 5-4: READ COMMITTED isolation is supported only for auto-commit transactions.
93
Chapter 5: Transaction Processing
Likewise, for cross-container transactions, SQL Server supports the snapshot implemen-
tation of READ COMMITTED, i.e. READ_COMMITTED_SNAPSHOT, only for auto-commit
transactions, and then only if the query does not access any disk-based tables.
94
Chapter 5: Transaction Processing
For snapshot isolation, all operations need to see the versions of the data that existed as
of the beginning of the transaction. For SNAPSHOT transactions, the beginning of the
transaction is considered to be when the first table is accessed. In a cross-container trans-
action, however, since the sub-transactions can each start at a different time, another
transaction may have changed data between the start times of the two sub-transactions.
The cross-container transaction then will have no one point in time on which to base the
snapshot, so using transaction isolation level SNAPSHOT is not allowed.
Listing 5-6: Attempting to access a memory-optimized table using REPEATABLE READ isolation.
The following transactions must access memory optimized tables and natively compiled stored
procedures under snapshot isolation: RepeatableRead transactions, Serializable transactions,
and transactions that access tables that are not memory optimized in RepeatableRead or
Serializable isolation.
Table 5-2 shows an example of running the two cross-container transactions, Tx1
and Tx2 (both of which we can think of as having two "sub-transactions," one for
accessing disk-based and one for accessing memory-optimized tables). It illustrates why
95
Chapter 5: Transaction Processing
In Table 5-2, RHk# indicates a row in a memory-optimized table, and RSql# indicates
a row in a disk-based table. Transaction Tx1 reads a row from a memory-optimized table
first. SQL Server acquires no locks. Now assume the second transaction, Tx2, starts after
Tx1 reads RHk1. Tx2 reads and updates RSql1 and then reads and updates RHk1, then
commits. Now when Tx1 read the row from the disk-based table, it would now have a
set of values for the two rows that could never have existed if the transaction were run
in isolation, i.e. if the transaction were truly serializable, and so this combination is
not allowed.
2 Read RHk1
3 BEGIN SQL/in-memory
sub-transactions
6 COMMIT
7 Read RSql2
96
Chapter 5: Transaction Processing
Since snapshot isolation is the recommended isolation level in most cases, a new
database property is available to automatically upgrade the isolation to SNAPSHOT,
for all operations on memory-optimized tables, if the T-SQL transaction is running in
a lower isolation level, i.e. READ COMMITTED, which is SQL Server's default (or READ
UNCOMMITTED, which is not recommended). Listing 5-8 shows an example of setting
this option.
Listing 5-8: Setting the database option to elevate isolation level to SNAPSHOT.
97
Chapter 5: Transaction Processing
We can verify whether this option has been set in two ways, shown in Listing 5-9,
either by inspecting the sys.databases catalog view or by querying the
DATABASEPROPERTYEX function.
SELECT is_memory_optimized_elevate_to_snapshot_on
FROM sys.databases
WHERE name = 'HKDB';
SELECT DATABASEPROPERTYEX('HKDB',
'IsMemoryOptimizedElevateToSnapshotEnabled');
Listing 5-9: Verifying if the database has been set to elevate the isolation level to SNAPSHOT.
Otherwise, as demonstrated earlier, simply set the required isolation level on the fly,
using a table hint. We should also consider that having accessed a table in a cross-
container transaction using an isolation level hint, a transaction should continue to use
that same hint for all subsequent access of the table, though this is not enforced. Using
different isolation levels for the same table, whether a disk-based table or memory-
optimized table, will usually lead to failure of the transaction.
Start two simple transactions doing INSERTs into a memory-optimized table, and then
run the query in Listing 5-10.
98
Chapter 5: Transaction Processing
SELECT xtp_transaction_id ,
transaction_id ,
session_id ,
begin_tsn ,
end_tsn ,
state_desc
FROM sys.dm_db_xtp_transactions
WHERE transaction_id > 0;
GO
The output should look similar to that shown in Figure 5-1, with two transactions.
When the first statement accessing a memory-optimized table is executed, SQL Server
obtains a transaction id for the T-SQL part of the transaction (transaction_id) and a
transaction id for the in-memory OLTP portion (xtp_transaction_id).
99
Chapter 5: Transaction Processing
The starting timestamp of 240 indicates when it began relative to the serialization order
of the database. While active, it will only be able to access rows that have a Begin-Ts of
less than or equal to 240 and an End-Ts of greater than 240.
Open a window in SSMS and execute Listing 5-11 (don't commit the transaction yet).
USE HKDB;
BEGIN TRAN Tx1;
DELETE FROM dbo.T1 WITH ( SNAPSHOT )
WHERE Name = 'Greg';
UPDATE dbo.T1 WITH ( SNAPSHOT )
SET City = 'Perth'
WHERE Name = 'Jane';
-- COMMIT TRAN Tx1
During the processing phase, SQL Server links the new <Jane, Perth> row into the index
structure and marks the <Greg, Lisbon> and <Jane, Helsinki> as deleted. Figure 5-2 shows
what the rows will look at this stage, within our index structure (with hash indexes on
Name and City; see Chapter 4).
100
Chapter 5: Transaction Processing
Hash index
on City
Hash index
on Name
200, Tx1 Greg Lisbon
I've just used Tx1 for the transaction-id, but you can use Listing 5-10 to find the real
values of xtp_transaction_id.
Write-write conflicts
What happens if another transaction, TxU, tries to update Jane's row (remember Tx1 is
still active)?
USE HKDB;
BEGIN TRAN TxU;
UPDATE dbo.T1 WITH ( SNAPSHOT )
SET City = 'Melbourne'
WHERE Name = 'Jane';
COMMIT TRAN TxU
Listing 5-12: TxU attempts to update a row while Tx1 is still uncommitted.
101
Chapter 5: Transaction Processing
As discussed in Chapter 3, TxU sees Tx1's transaction-id in the <Jane, Helsinki> row
and, because SQL Server optimistically assumes Tx1 will commit, immediately aborts
TxU, raising a conflict error.
Read-Write conflicts
If a query tries to update a row that has already been updated by an active transaction,
SQL Server generates an immediate "update conflict" error. However, SQL Server does
not catch most other isolation level errors until the transaction enters the validation
phase. Remember, no transaction acquires locks so it can't block other transactions from
accessing rows. We'll discuss the validation phase in more detail in the next section,
but it is during this phase that SQL Server will perform checks to make sure that any
changes made by concurrent transactions do not violate the specified isolation level. Let's
continue our example, and see the sort of violation that can occur.
Our original Tx1 transaction, which started at timestamp 240, is still active, and let's now
start two other transactions that will read the rows in table T1:
Tx3 an explicit transaction that reads a row and then updates another row based on
the value it read in the SELECT; it starts at a timestamp of 246.
102
Chapter 5: Transaction Processing
Tx2 starts before Tx1 commits, and Tx3 starts before Tx2 commits. Figure 5-3 shows the
rows that exist after each transaction commits.
Jane Helsinki
TX1
240 TX2
243 TX3
246
Figure 5-3: Version visibility after each transaction ends, but before validation.
When Tx1 starts at timestamp 240, three rows are visible, and since Tx1 does not commit
until timestamp 250, after Tx2 and Tx3 have started, those are the rows all three of the
transactions see. After Tx1 commits, there will only be two rows visible, and the City
value for Jane will have changed. When Tx3 commits, it will attempt to change the City
value for Susan to Helsinki.
In a second query window in SSMS, we can run our auto-commit transaction, Tx2, which
simply reads the T1 table.
USE HKDB;
SELECT Name ,
City
FROM T1;
103
Chapter 5: Transaction Processing
Tx2's session is running in the default isolation level, READ COMMITTED, but as described
previously, for a single-statement transaction accessing a memory-optimized table, we
can think of Tx2 as running in snapshot isolation level, which for a single-statement
SELECT will give us the same behavior as READ COMMITTED.
Tx2 started at timestamp 243, so it will be able to read rows that existed at that time.
It will not be able to access <Greg, Beijing>, for example, because that row was valid
between timestamps 100 and 200. The row <Greg, Lisbon> is valid starting at timestamp
200, so transaction Tx2 can read it, but it has a transaction-id in End-Ts because Tx1 is
currently deleting it. Tx2 will check the global transaction table and see that Tx1 has not
committed, so Tx2 can still read the row. <Jane, Perth> is the current version of the row
with "Jane," but because Tx1 has not committed, Tx2 follows the pointer to the previous
row version, and reads <Jane, Helsinki>.
Tx3 is an explicit transaction that starts at timestamp 246. It will run using REPEATABLE
READ isolation, and read one row and update another based on the value read, as shown
in Listing 5-14 (again, don't commit it yet).
Listing 5-14: Tx3 reads the value of City for "Jane" and updates the "Susan" row with this value.
104
Chapter 5: Transaction Processing
In Tx3, the SELECT will read the row <Jane, Helsinki> because that row still is accessible
as of timestamp 243. It will then delete the <Susan, Bogota> and insert the row
<Susan, Helsinki>.
What happens next depends on which of Tx1 or Tx3 commits first. In our scheme
from Figure 5-3, Tx1 commits first. When Tx3 tries to commit after Tx1 has committed,
SQL Server will detect during the validation phase that the <Jane, Helsinki> row has been
updated by another transaction. This is a violation of the requested REPEATABLE READ
isolation, so the commit will fail and transaction Tx3 will roll back.
To see this in action, commit Tx1, and then try to commit Tx3. You should see the
following error message:
So Tx1 commits and Tx3 aborts and, at this stage, the only two rows visible will be
<Susan, Vienna> and <Jane, Perth>.
If Tx3 had committed before Tx1, then both transactions would succeed, and the final
rows visible would be <Jane, Perth> and <Susan, Helsinki>, as shown in Figure 5-3.
Let's now take a look in a little more detail at other isolation level violations that
may occur in the validation stage, and at the other actions SQL Server performs
during this phase.
105
Chapter 5: Transaction Processing
Validation Phase
Once a transaction issues a commit and SQL Server generates the commit timestamp, but
prior to the final commit of transactions involving memory-optimized tables, SQL Server
performs a validation phase. As discussed briefly in Chapter 3, this phase consists broadly
of the following three steps:
1. Validate the changes made by Tx1 verifying that there are no isolation
level violations.
Once it logs the changes (which are therefore guaranteed), SQL Server marks the trans-
action as committed in the global transaction table, and then clears the dependencies of
any transactions that are dependent on Tx1.
Note that the only waiting that a transaction on memory-optimized tables will experience
is during this phase. There may be waiting for commit dependencies, which are usually
very brief, and there may be waiting for the write to the transaction log. Logging for
memory-optimized tables is much more efficient than logging for disk-based tables (as
we'll see in Chapter 6), so these waits can also be very short.
The following sections review each of these three steps in a little more detail.
106
Chapter 5: Transaction Processing
During transaction processing, the in-memory OLTP engine will, depending on the
isolation level, keep track of the read-set and write-set for each transaction; these are sets
of pointers to the rows that have been read or written, respectively. SQL Server will use
the read-set to check for non-repeatable reads, (we'll cover the write-set in the logging
section, shortly). Also, depending on the isolation level, it will keep track of a scan-set,
which is information about the predicate used to access a set of records. SQL Server can
use this to check for phantoms.
Table 5-3 summarizes which isolation levels require SQL Server to maintain a read-set or
scan-set, or both. Note that for snapshot isolation, it doesn't matter what happens to the
data a transaction has read, it only matters that our transaction sees the appropriate data,
as of the beginning of the transaction.
SNAPSHOT NO NO
107
Chapter 5: Transaction Processing
Tx2 starts and inserts a row with the same primary key value
Tx2 commits
1 BEGIN TRAN
4 COMMIT TRAN
108
Chapter 5: Transaction Processing
5 COMMIT TRAN
Error 41325: The current transaction
validation failure
During validation, Error 41325 is generated, because we can't have two rows with the same
primary key value, and Tx1 is aborted and rolled back.
Error 41305: The current transaction failed to commit due to a repeatable read
validation failure.
The transaction will abort. We saw an example of this earlier, in the section on
read-write conflicts.
109
Chapter 5: Transaction Processing
1 BEGIN TRAN
5 COMMIT TRAN
6 COMMIT TRAN
During validation, Error 41325 is
generated and Tx1 is rolled back
110
Chapter 5: Transaction Processing
Since SQL Server assumes that these transactions will actually commit, it generates the
logical end timestamps as soon as the commit is issued, which marks the start of the
validation phase. These rows are therefore visible to any transaction that started after
this time. No transaction will ever be able to see the effects of a transaction that has not
entered its validation phase.
If a transaction Tx1 reads rows that Tx2 has updated, and Tx2 is still in the validation
phase, then Tx1 will take a commit dependency on Tx2 and increment an internal
counter that keeps track of the number of commit dependencies for Tx1. In addition, Tx2
will add a pointer from Tx1 to a list of dependent transactions that Tx2 maintains.
In addition, result sets are not returned to the client until all dependencies have
cleared. This prevents the client from retrieving uncommitted data.
Error 41301: A previous transaction that the current transaction took a dependency on has
aborted, and the current transaction can no longer commit.
111
Chapter 5: Transaction Processing
Note that Tx1 can only acquire a dependency on Tx2 when Tx2 is in the validation
or post-processing phase and, because these phases are typically extremely short,
commit dependencies will be quite rare in a true OLTP system. If you want to be able to
determine if you have encountered such dependencies, you can monitor two extended
events. The event dependency_acquiredtx_event will be raised when Tx1 takes a
dependency on Tx2, and the event waiting_for_dependenciestx_event will be
raised when Tx1 has explicitly waited for a dependency to clear.
Transactions track all of their changes in the write-set, which is basically a list of
DELETE/INSERT operations with pointers to the version associated with each operation.
This write-set forms the content of the log for the transaction. Transactions normally
generate only a single log record that contains its ID and commit timestamp and the
versions of all records it deleted or inserted. There will not be separate log records for
each row affected, as there are for disk-based tables. However, there is an upper limit on
the size of a log record, and if a transaction on memory-optimized tables exceeds the
limit, there can be multiple log records generated.
Figure 5-5 shows the write-set for transaction Tx1, from our previous example in this
chapter, in the green box.
Once the log record has been hardened to storage the state of the transaction is changed
to committed in the global transaction table.
112
Chapter 5: Transaction Processing
Transaction Tx1
Begin timestamp = 240
End timestamp = 250
Hash index
Hash index
200, 250 Greg Lisbon on City
on Name
The final step in the validation process is to go through the linked list of dependent
transactions and reduce their dependency counters by one. Once this validation phase is
finished, the only reason that this transaction might fail is due to a log write failure. Once
the log record has been hardened to storage, the state of the transaction is changed to
committed in the global transaction table.
113
Chapter 5: Transaction Processing
Post-processing
The final phase is the post-processing, which is sometimes referred to as commit
processing, and is usually the shortest. The main operations are to update the timestamps
of each of the rows inserted or deleted by this transaction.
For a DELETE operation, set the row's End-Ts value to the commit timestamp of the
transaction, and clear the type flag on the row's End-Ts field to indicate it is really a
timestamp, and not a transaction-ID.
For an INSERT operation, set the row's Begin-Ts value to the commit timestamp of
the transaction and clear the type flag.
If the transaction failed or was explicitly rolled back, inserted rows will be marked as
garbage and deleted rows will have their end-timestamp changed back to infinity.
The actual unlinking and deletion of old row versions is handled by the garbage collection
system. This final step of removing any unneeded or inaccessible rows is not always done
immediately and may be handled either by user threads, once a transaction completes, or
by a completely separate garbage collection thread.
114
Chapter 5: Transaction Processing
Chapter 6 covers cleanup of checkpoint files a completely separate process, but also referred to as
"garbage collection."
The garbage collection process for stale row versions in memory-optimized tables is
analogous to the version store cleanup that SQL Server performs when transactions
use one of the snapshot-based isolation levels, when accessing disk-based tables. A big
difference though is that the cleanup is not done in tempdb because the row versions are
not stored there, but in the in-memory table structures themselves.
To determine which rows can be safely deleted, the in-memory OLTP engine keeps track
of the timestamp of the oldest active transaction running in the system, and uses this
value to determine which rows are potentially still needed. Any rows that are not valid as
of this point in time, in other words any rows with an End-Ts timestamp that is earlier
than this time, are considered stale. Stale rows can be removed and their memory can be
released back to the system.
If, while scanning an index during a data modification operation, (all index access on
memory-optimized tables is considered to be scanning), a user thread encounters a stale
row version, it will either mark the row as expired, or unlink that version from the
current chain and adjust the pointers. For each row it unlinks, it will also decrement the
reference count in the row header area (reflected in the IdxLinkCount value).
When a user thread completes a transaction, it adds information about the transaction to
a queue of transactions to be processed by the idle worker thread. Each time the garbage
collection process runs, it processes the queue of transactions, and determines whether
the oldest active transaction has changed.
115
Chapter 5: Transaction Processing
It moves the transactions that have committed into one or more "worker" queues, sorting
the transactions into "generations" according to whether or not it committed before or
after the oldest active transaction. (We can view the transactions in each generation using
the sys.dm_db_xtp_gc_cycle_stats DMV, for which see Chapter 8.) It groups the
rows associated with transactions that committed before the oldest active transaction
into "work items," each consisting of a set of 16 "stale" rows that are ready for removal.
The final act of a user thread, on completing a transaction, is to pick up one or more work
items from a worker queue and perform garbage collection, i.e. free the memory used by
the rows making up the work items.
The idle worker thread will dispose of any stale rows that were eligible for garbage
collection, but not accessed by a user transaction, as part of what is termed a "dusty
corner" scan. Every row starts with a reference value of 1, so the row can be referenced by
the garbage collection mechanism even if the row is no longer connected to any indexes.
The garbage collector process is considered the "owner" of the initial reference.
The garbage collection thread processes the queue of completed transactions about once
a minute, but the system can adjust the frequency internally, based on the number of
completed transactions waiting to be processed. As noted above, each work item it adds
to the worker queue currently consists of a set of 16 rows, but that number is subject
to change in future versions. These work items are distributed across multiple worker
queues, one for each CPU used by SQL Server.
The DMV sys.dm_db_xtp_index_stats has a row for each index on each memory-
optimized table, and the column rows_expired indicates how many rows have
been detected as being stale during scans of that index. There is also a column called
rows_expired_removed that indicates how many rows have been unlinked from
that index. As mentioned above, once rows have been unlinked from all indexes on
a table, it can be removed by the garbage collection thread. So you will not see the
rows_expired_removed value going up until the rows_expired counters have
been incremented for every index on a memory-optimized table.
116
Chapter 5: Transaction Processing
The query in Listing 5-15 allows us to observe these values. It joins the sys.dm_db_xtp_
index_stats DMV with the sys.indexes catalog view to be able to return the name
of the index.
Depending on the volume of data changes and the rate at which new versions are
generated, SQL Server can be using a substantial amount of memory for old row versions
and we need to make sure that our system has enough memory available. I'll tell you
more about memory management for a database supporting memory-optimized tables in
Chapter 8.
Summary
This chapter contains a lot of detail on the transaction isolation levels that SQL Server
supports when accessing memory-optimized tables, and also on the valid combination of
levels for cross-container transactions, which can access both disk-based and memory-
optimized tables. In most cases, our cross-container transactions will use standard
READ COMMITTED for accessing disk-based tables, and SNAPSHOT isolation for memory-
optimized tables, set either via a table hint or using the MEMORY_OPTIMIZED_ELEVATE_
TO_SNAPSHOT database property for that database.
117
Chapter 5: Transaction Processing
In the MVCC model, no transaction acquires locks, and no transaction can prevent
another transaction reading, or attempting to modify, rows that it is currently accessing.
Due to the optimistic model, SQL Server will raise an immediate conflict if one trans-
action tries to modify a row that another active transaction is already modifying.
However, it will only detect other read-write conflicts during a validation phase which
occurs after a transaction issues a commit. We investigated the sort of violations that
can occur during validation, depending on the isolation levels being used, and we also
considered what happens during other phases of the validation cycle, such as resolving
commit dependencies and hardening the log records to disk. Finally, we discussed the
cooperative garbage collection system that disposes of stale rows which are no longer
visible to any transactions.
We're now ready to take a closer look at the processes by which in-memory OLTP writes
to durable storage, namely the CHECKPOINT process and the transaction logging process.
Additional Resources
General background on isolation levels:
http://en.wikipedia.org/wiki/Isolation_(database_systems).
118
Chapter 6: Logging, Checkpoint, and
Recovery
SQL Server must ensure transaction durability for memory-optimized tables, so that
it can guarantee to recover to a known state after a failure. In-memory OLTP achieves
this by having both the checkpoint process and the transaction logging process write to
durable storage.
The information that SQL Server writes to disk consists of transaction log streams and
checkpoint streams:
Log streams contain the changes made by committed transactions logged as insertion
and deletion of row versions.
delta streams are associated with a particular data stream and contain a list
of integers indicating which versions in its corresponding data stream have
been deleted.
The combined contents of the transaction log and the checkpoint streams are sufficient
to allow SQL Server to recover the in-memory state of memory-optimized tables to a
transactionally-consistent point in time.
Although the overall requirement for the checkpoint and transaction logging process to
write to durable storage is no different than for normal disk-based tables, for in-memory
tables the mechanics of these processes are rather different, and often much more
efficient, as we'll discuss throughout this chapter.
119
Chapter 6: Logging, Checkpoint, and Recovery
Though not covered in this book, in-memory OLTP is also integrated with the
AlwaysOn Availability Group feature, and so supports fail-over and recovery to
highly available replicas.
Transaction Logging
The log streams contain information about all versions inserted and deleted by transac-
tions against in-memory tables. SQL Server writes the log streams to the regular
SQL Server transaction log, but in-memory OLTP's transaction logging is designed to
be more scalable and higher performance than standard logging for disk-based tables.
By contrast, in-memory OLTP relies solely on the transaction end timestamps (End-Ts,
see Chapter 3) to determine the serialization order, so it is designed to support multiple,
concurrently-generated log streams per database. In theory, this removes the potential
scaling bottleneck. However, for SQL Server 2014, the in-memory OLTP integration
with SQL Server makes use of only a single log stream per database, because SQL Server
supports only one log per database. In other words, SQL Server currently treats the log as
one logical file even if multiple physical log files exist. In current testing, this has not been
a problem because in-memory OLTP generates much less log data and fewer log writes
compared with operations on disk-based tables.
120
Chapter 6: Logging, Checkpoint, and Recovery
A third and critical reason is that in-memory OLTP never writes to disk log records
associated with uncommitted transactions. Let's look at this last point in a little more
detail, and contrast it to the logging behavior for disk-based tables.
During a modification to a disk-based table, SQL Server constructs log records describing
the change as the modification proceeds, writing them to the log buffer in memory,
before it modifies the corresponding data pages in memory. When a transaction issues
a commit, SQL Server flushes the log buffer to disk, and this will also flush log records
relating to any concurrent, as-yet-uncommitted transactions. However, it doesn't write
the data pages till later, on checkpoint. If SQL Server crashes after a transaction, T1,
commits but before a checkpoint, the log contains the redo information needed to persist
the effects of T1 during recovery.
By contrast, for memory-optimized tables, the in-memory OLTP engine only constructs
the log records for a transaction at the point that it issues the commit, so it will never
flush to disk log records relating to uncommitted transactions. At commit, SQL Server
combines the changes into a few relatively large log records and writes them to disk.
In-memory OLTP does harden the log records to disk before any data is written to disk.
In fact, the checkpoint files get the rows to write from the hardened log. Therefore, it can
always redo the effects of a transaction, should SQL Server crash before the checkpoint
operation occurs.
121
Chapter 6: Logging, Checkpoint, and Recovery
When working with disk-based tables, a checkpoint flushes to disk all dirty pages in
memory. A dirty page is any page in the cache that has changed since SQL Server read it
from disk or since the last checkpoint, so that the page in cache is different from what's
on disk. This is not a selective flushing; SQL Server flushes out all dirty pages, regardless
of whether they contain changes associated with open (uncommitted) transactions.
However, the Write Ahead Logging (WAL) mechanism used by the log buffer manager
guarantees to write the log records to disk before it writes the dirty data pages to the
physical data files. Therefore if SQL Server crashes immediately after a checkpoint, during
recovery it can guarantee to "undo" the effects of any transactions for which there is no
"commit" log record.
By contrast, for in-memory OLTP, all data modifications are in-memory; it has no
concept of a "dirty page" that needs to be flushed to disk and, since it generates log
records only at commit time, checkpoint will never write to disk log records related to
uncommitted transactions. So, while the transaction log contains enough information
about committed transactions to redo the transaction, no undo information is written to
the transaction log, for memory-optimized tables.
USE master
GO
IF DB_ID('LoggingDemo')IS NOT NULL
DROP DATABASE LoggingDemo;
GO
CREATE DATABASE LoggingDemo ON
PRIMARY (NAME = [LoggingDemo_data],
FILENAME = 'C:\DataHK\LoggingDemo_data.mdf'),
FILEGROUP [LoggingDemo_FG] CONTAINS MEMORY_OPTIMIZED_DATA
122
Chapter 6: Logging, Checkpoint, and Recovery
(NAME = [LoggingDemo_container1],
FILENAME = 'C:\DataHK\LoggingDemo_container1')
LOG ON (name = [LoggingDemo_log],
Filename='C:\DataHK\LoggingDemo.ldf', size= 100 MB);
GO
Listing 6-2 creates one memory-optimized table, and the equivalent disk-based table, in
the LoggingDemo database.
USE LoggingDemo
GO
IF OBJECT_ID('t1_inmem') IS NOT NULL
DROP TABLE [dbo].[t1_inmem]
GO
123
Chapter 6: Logging, Checkpoint, and Recovery
Next, Listing 6-3 populates the disk-based table with 100 rows, and examines the contents
of the transaction log using the undocumented (and unsupported) function fn_dblog().
You should see 200 log records for operations on t1_disk.
-- you will see that SQL Server logged 200 log records
SELECT *
FROM sys.fn_dblog(NULL, NULL)
WHERE PartitionId IN ( SELECT partition_id
FROM sys.partitions
WHERE object_id = OBJECT_ID('t1_disk') )
ORDER BY [Current LSN] ASC;
GO
Listing 6-3: Populate the disk-based table with 100 rows and examine the log.
Listing 6-4 runs a similar INSERT on the memory-optimized table. Note that, since the
partition_id is not shown in the output for memory-optimized tables, we cannot
filter based on the specific object. Instead, we need to look at the most recent log records,
so the query performs a descending sort based on the LSN.
124
Chapter 6: Logging, Checkpoint, and Recovery
BEGIN TRAN
DECLARE @i INT = 0
WHILE ( @i < 100 )
BEGIN
INSERT INTO t1_inmem
VALUES ( @i, REPLICATE('1', 100) )
SET @i = @i + 1
END
COMMIT
-- look at the log
SELECT *
FROM sys.fn_dblog(NULL, NULL)
ORDER BY [Current LSN] DESC;
GO
Listing 6-4: Examine the log after populating the memory-optimized tables with 100 rows.
You should see only three log records related to this transaction, as shown in Figure 6-1,
one marking the start of the transaction, one the commit, and then just one log record for
inserting all 100 rows.
Figure 6-1: SQL Server transaction log showing one log record for a 100-row transaction.
The output implies that all 100 inserts have been logged in a single log record, using an
operation of type LOP_HK, with LOP indicating a "logical operation" and HK being an
artifact from the project codename, Hekaton.
125
Chapter 6: Logging, Checkpoint, and Recovery
The first few rows of output should look similar to those shown in Figure 6-2. It
should return 102 rows, including one *_INSERT_ROW operation for each of the
100 rows inserted.
Figure 6-2: Breaking apart the log record for the inserts on the memory-optimized table.
The single log record for the entire transaction on the memory-optimized table,
plus the reduced size of the logged information, can help to make transactions on
memory-optimized tables much more efficient. This is not to say, however, that trans-
actions on memory-optimized tables are always going to be more efficient, in terms of
logging, than operations on disk-based tables. For very short transactions particularly,
disk-based and memory-optimized will generate about the same amount of log. However,
transactions on memory-optimized tables should never be any less efficient than on their
disk-based counterparts.
126
Chapter 6: Logging, Checkpoint, and Recovery
Checkpoint
The two main purposes of the checkpoint operation, for disk-based tables, are to improve
performance by batching up I/O rather than continually writing a page to disk every time
it changes, and to reduce the time required to run recovery. If checkpoint ran only very
infrequently then, during recovery, there could be a huge number of data rows to which
SQL Server needs to apply redo, as a result of committed transactions hardened to the log
but where the data pages were not hardened to disk before SQL Server entered recovery.
Similarly, one of the main reasons for checkpoint operations, for memory-optimized
tables, is to reduce recovery time. The checkpoint process for memory-optimized tables is
designed to satisfy two important requirements:
127
Chapter 6: Logging, Checkpoint, and Recovery
Automatic checkpoint SQL Server runs the in-memory OLTP checkpoint when
the size of the log has grown by 512 MB since the last checkpoint. Note that this is not
dependent on the amount of work done on memory-optimized tables, only the size
of the transaction log. It's possible that there have been no transactions on memory-
optimized tables when a checkpoint event occurs.
Checkpoint data is stored in two types of checkpoint files: data files and delta files. These,
plus the log records for transactions affecting memory-optimized tables, are the only
physical storage associated with memory-optimized tables. Data and delta files are stored
in pairs, sometimes referred to as a Checkpoint File Pair, or CFP.
A checkpoint data file contains only inserted versions or rows, either new rows,
generated by INSERTs or new versions of existing rows, generated by UPDATE opera-
tions, as we saw in Chapter 3. Each file covers a specific timestamp range and will contain
all rows with a Begin-Ts timestamp value that falls within that range. Data files are
append-only while they are open and, once closed, they are strictly read-only.
A checkpoint delta file stores information about which rows contained in its partner data
file have been subsequently deleted. When we delete rows, the checkpoint thread will
append a reference to the deleted rows (their IDs) to the corresponding delta files. Delta
128
Chapter 6: Logging, Checkpoint, and Recovery
files are append-only for the lifetime of the data file to which they correspond. At recovery
time, the delta file is used as a filter to avoid reloading deleted versions into memory. The
valid row versions in the data files are reloaded into memory and re-indexed.
There is a 1:1 correspondence between delta files and data files, and both cover exactly
the same timestamp range. Since each data file is paired with exactly one delta file,
the smallest unit of work for recovery is a data/delta file pair. This allows the recovery
process to be highly parallelizable as transactions can be recovered from multiple
CFPs concurrently.
Let's take an initial look, at the file system level, at how SQL Server creates these CFPs.
First, create a CkptDemo database, with a single container, as shown in Listing 6-6.
USE master
GO
IF DB_ID('CkptDemo') IS NOT NULL
DROP DATABASE CkptDemo;
GO
CREATE DATABASE CkptDemo ON
PRIMARY (NAME = [CkptDemo_data], FILENAME = 'C:\DataHK\CkptDemo_data.mdf'),
FILEGROUP [CkptDemo_FG] CONTAINS MEMORY_OPTIMIZED_DATA
(NAME = [CkptDemo_container1],
FILENAME = 'C:\DataHK\CkptDemo_container1')
LOG ON (name = [CkptDemo_log],
Filename='C:\DataHK\CkptDemo.ldf', size= 100 MB);
GO
Next, Listing 6-7 turns on an undocumented trace flag, 9851, which inhibits the automatic
merging of checkpoint files. This will allow us to control when the merging occurs, and
observe the process of creating and merging checkpoint files. Only use this trace flag
during testing, not on production servers.
129
Chapter 6: Logging, Checkpoint, and Recovery
Listing 6-7: Turn on Trace Flag 9851 to inhibit automatic merging of checkpoint files.
At this point, you might want to look in the folder containing the memory-optimized
data files, in this example DataHK\CkptDemo_container1. Within that folder is
one subfolder called $FSLOG and another with a GUID for a name. If we had specified
multiple memory-optimized filegroups in Listing 6-5, then we'd see one GUID-named
folder for each filegroup.
Open the GUID-named folder, and in there is another GUID-named folder. Again, there
will be one GUID-named folder at this level for each file in the filegroup. Open up that
GUID-named folder, and you will find it is empty, and it will remain empty until we
create a memory-optimized table, as shown in Listing 6-8.
USE CkptDemo;
GO
-- create a memory-optimized table with each row of size > 8KB
CREATE TABLE dbo.t_memopt (
c1 int NOT NULL,
c2 char(40) NOT NULL,
c3 char(8000) NOT NULL,
CONSTRAINT [pk_t_memopt_c1] PRIMARY KEY NONCLUSTERED HASH (c1)
WITH (BUCKET_COUNT = 100000)
) WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA);
GO
130
Chapter 6: Logging, Checkpoint, and Recovery
At this point, if we re-examine the previously empty folder, we'll find that it now contains
18 files (9 CFPs), as shown in Figure 6-3. The larger ones are the data files, and the smaller
ones are the delta files.
Figure 6-3: The data and delta files in the container for our memory-optimized tables.
131
Chapter 6: Logging, Checkpoint, and Recovery
A checkpoint event will close any currently open (UNDER CONSTRUCTION) data files, and
the state of these CFPs transitions to ACTIVE. For memory-optimized tables, we often
refer to the checkpoint as a "collection of files," referring to the set of data files that were
closed when the checkpoint event occurred, plus their corresponding delta files.
From this point, the continuous checkpointing process will no longer write new INSERTS
into the closed data files, but they are still very much active, since current rows in these
data files may be subject to DELETE and UPDATE operations, which will be reflected in
their corresponding delta files. For example, as the result of an UPDATE, the continuous
checkpointing process will mark the current row version as deleted in the delta file of the
appropriate ACTIVE CFP, and insert the new row version into the data file of the current
UNDER CONSTRUCTION CFP.
Every checkpoint event creates a new set of ACTIVE CFPs and so the number of files can
grow rapidly. As data modifications proceed, a "deleted" row remains in the data file but
the delta file records the fact that it was deleted. Therefore, over time, the percentage
of meaningful content in older ACTIVE data files falls, due to DELETEs and UPDATEs.
Eventually, SQL Server will merge adjacent data files, so that rows marked as deleted
actually get deleted from the checkpoint data file, and create a new CFP. We'll see how
this works in the later section, Merging checkpoint files.
At a certain point, there will be no open transactions that could possibly affect the
content of a closed but active CFP. This point is reached once the oldest transaction still
required by SQL Server (marking the start of the active log, a.k.a. the log truncation point)
is more recent than the time range covered by the CFP, and the CFP transitions into
132
Chapter 6: Logging, Checkpoint, and Recovery
non-active states. Finally, assuming a log backup (which requires at least one database
backup) has occurred, these CFPs are no longer required and can be removed by the
checkpoint file garbage collection process.
Let's take a more detailed look at the various states through which CFPs transition.
Generally, PRECREATED files will be full-sized files with a data file size of 128 MB
and a delta file size of 8 MB. However, if the machine has less than 16 GB of memo-
ry, the data file will be 16 MB and the delta file will be 1 MB.
The number of PRECREATED CFPs is equal to the number of logical processors (or
schedulers) with a minimum of 8. This gives us a fixed minimum storage require-
ment in databases with memory-optimized tables.
UNDER CONSTRUCTION These CFPs are "open" and the continuous checkpointing
process writes to these CFPs any newly inserted and possibly deleted data rows, since
the last checkpoint.
ACTIVE These CFPs contain the inserted/deleted rows for the last checkpoint
event. The checkpoint event "closes" any CFPs with a current status of UNDER
CONSTRUCTION, and their status changes to ACTIVE. The continuous checkpointing
process will no longer write to the data files of ACTIVE CFPs, but will, of course, still
write to any deletes to the ACTIVE delta files. During a database recovery operation
the ACTIVE CFPs contain all inserted/deleted rows that will be required to restore the
data, before applying the tail log backup.
133
Chapter 6: Logging, Checkpoint, and Recovery
In general, the combined size of the ACTIVE CFPs should be about twice the size
of the memory-optimized tables. However, there may be situations, in particular if
your database files are larger than 128 MB, in which the merging process (discussed
a little later) can lag behind the file creation operations, and the total size of your
ACTIVE CFPs may be more than twice the size of the memory-optimized tables.
MERGE TARGET When currently ACTIVE CFPs are chosen for merging, SQL Server
creates a new CFP, which consists of a data file that stores the consolidated rows from
the two adjacent data files of the CFPs that the merge policy identified, plus a new
(empty) delta file. The resulting new CFP will have the status MERGE TARGET until the
merge is installed, when it will transition into the ACTIVE state.
MERGED SOURCE Once the merge has taken place and the MERGE TARGET CFPs are
part of checkpoint, the MERGE TARGET CFPs transition to ACTIVE and the original
source CFPs transition to the MERGED SOURCE state.
Note that, although the merge policy evaluator may identify multiple possible
merges, a CFP can only participate in one merge operation.
134
Chapter 6: Logging, Checkpoint, and Recovery
Listing 6-9 shows how to interrogate the checkpoint file metadata, using the sys.dm_
db_xtp_checkpoint_files DMV. It returns one row for each file, along with property
information for each file.
SELECT file_type_desc ,
state_desc ,
internal_storage_slot ,
file_size_in_bytes ,
inserted_row_count ,
deleted_row_count ,
lower_bound_tsn ,
upper_bound_tsn ,
checkpoint_file_id ,
relative_file_path
FROM sys.dm_db_xtp_checkpoint_files
ORDER BY file_type_desc ,
state_desc ,
lower_bound_tsn;
GO
Listing 6-9 returns the following metadata columns (other columns are available; see the
documentation for a full list):
file_type_desc
Identifies the file as a data or delta file.
state_desc
The state of the file (see previous bullet list).
internal_storage_slot
This value is the pointer to an internal storage array (described below), but
is not populated until a file becomes ACTIVE.
135
Chapter 6: Logging, Checkpoint, and Recovery
file_size_in_bytes
Note that we have just two sizes so far; the DATA files are
16777216 bytes (16 MB) and the delta files are 1048576 bytes (1 MB).
inserted_row_count
This column is only populated for data files.
deleted_row_count
This column is only populated for delta files.
lower_bound_tsn
This is the timestamp for the earliest transaction covered by this checkpoint file.
upper_bound_tsn
This is the timestamp for the last transaction covered by this checkpoint file.
checkpoint_file_id
This is the internal identifier for the file.
relative_file_path
The location of the file relative to the checkpoint file container.
The metadata of all CFPs that exist on disk is stored in an internal array structure referred
to as the storage array. It is a fixed-sized array of 8192 entries, where each entry in the
array refers to a CFP, and the array provides support for a cumulative size of 256 GB for
durable memory-optimized tables in the database. The internal_storage_slot value
in the metadata refers to the location of an entry in this array.
136
Chapter 6: Logging, Checkpoint, and Recovery
Storage Array
Transaction
Range 0 to 8 8 to 20 20 to ?
(low to high)
Figure 6-4: The storage array stores metadata for up to 8192 CFPs per database.
The CFPs referenced by the storage array, along with the tail of the log, represent all the
on-disk information required to recover the memory-optimized tables in a database.
Let's see an example of some of these checkpoint file state transitions in action. At this
stage, our CkptDemo database has one empty table, and we've seen that SQL Server
has created 9 CFPs. We'll take a look at the checkpoint file metadata, using the sys.
dm_db_xtp_checkpoint_files DMV. In this case, we just return the file type (DATA
or DELTA), the state of each file, and the relative path to each file.
SELECT file_type_desc ,
state_desc ,
relative_file_path
FROM sys.dm_db_xtp_checkpoint_files
ORDER BY file_type_desc
GO
Figure 6-5 shows that of the 9 CFPs (9 data files, 9 delta files), 8 CFPs have the state
PRECREATED, and the other 1 CFP has the state UNDER CONSTRUCTION.
137
Chapter 6: Logging, Checkpoint, and Recovery
The values in the relative_file_path column are a concatenation of the two GUID
folder names, plus the file names in the folder that was populated when we created the
table. These relative paths are of the general form GUID1\GUID2\FILENAME where
GUID1 is the GUID for the container, GUID2 is the GUID for the file in the container and
FILENAME is the name of the individual data or delta file. For example, the FILENAME
portion of the relative path for the third row in Figure 6-5 is 00000021-000000a7-0003,
which matches the name of the second file (the first data file) listed in my file browser
previously, in Figure 6-3.
Let's now put some rows into the t_memopt table, as shown in Listing 6-11. The script
also backs up the database so that we can make log backups later (although the backup
does not affect what we will shortly see in the metadata).
138
Chapter 6: Logging, Checkpoint, and Recovery
Listing 6-11: Populate the memory-optimized tables with 8000 rows and back up the database.
If we peek again into the GUID-named subfolder in the file system browser, we should see
four additional CFPs.
Now let's return to look at the checkpoint file metadata in a little more detail by
rerunning the query in Listing 6-9. Figure 6-6 shows the 13 CFPs returned, and the
property values for each file.
139
Chapter 6: Logging, Checkpoint, and Recovery
Notice that there are no ACTIVE data files because there has been no checkpoint event
yet. However, we now have five UNDER CONSTRUCTION CFPs and, because of the
continuous checkpointing, the data files of these CFPs contain 8000 data rows (four files
have 1876 rows and one has 496, as we can see from the inserted_row_count column).
If SQL Server needed to recover this table's data at this point, it would do it completely
from the transaction log.
However, let's see what happens when a checkpoint event occurs, also referred to closing
a checkpoint.
Closing a checkpoint
Let's now actually execute the checkpoint command in this database, manually, and then
rerun Listing 6-11 to interrogate the metadata in sys.dm_db_xtp_checkpoint_files.
CHECKPOINT;
GO
-- now rerun Listing 6-11
In the output, we'll see one or more CFPs (in this case, five) with the state ACTIVE and
with non-NULL values for the internal_storage_slot, as shown in Figure 6-7.
140
Chapter 6: Logging, Checkpoint, and Recovery
Notice that the five ACTIVE CFPs have consecutive internal_storage_slot values.
In fact, if we execute a checkpoint multiple times, we'll see that each checkpoint will
create additional ACTIVE CFPs, with contiguous values for internal_storage_slot.
What's happened here is that the checkpoint event takes a section of the transaction log
not covered by a previous checkpoint event, and converts all operations on memory-
optimized tables contained in that section of the log into one or more ACTIVE CFPs.
Once the checkpoint task finishes processing the log, the checkpoint is completed with
the following steps:
1. All buffered writes (all writes that are currently only present in the in-memory table)
are flushed to the data and delta files.
141
Chapter 6: Logging, Checkpoint, and Recovery
2. A checkpoint inventory is constructed that includes descriptors for all files from the
previous checkpoint plus any files added by the current checkpoint. The inventory is
hardened to the transaction log.
3. The location of the inventory is stored in the transaction log so that it is available at
recovery time.
With a completed checkpoint (i.e. the ACTIVE CFPs that a checkpoint event creates),
combined with the tail of the transaction log, SQL Server can recover any memory-
optimized table. A checkpoint event has a timestamp, which indicates that the effects
of all transactions before the checkpoint timestamp are recorded in files created by the
checkpoint and thus the transaction log is not needed to recover them. Of course, just
as for disk-based tables, even though that section of the log has been covered by a check-
point it can still not be truncated till we've had a log backup.
As discussed previously, the ACTIVE CFPs created by a checkpoint event are "closed" in
the sense that the continuous checkpointing process no longer writes new rows to these
data files, but it will still need to write to the associated delta files, to reflect deletion of
existing row versions.
The solution to this problem is to merge data files that are adjacent in terms of timestamp
ranges, when their active content (the percentage of undeleted versions in a data file)
drops below a threshold. Merging two data files, DF1 and DF2, results in a new data file,
142
Chapter 6: Logging, Checkpoint, and Recovery
DF3, covering the combined range of DF1 and DF2. All deleted versions identified in the
delta files for DF1 and DF2 are removed during the merge. The delta file for DF3 is empty
immediately after the merge, except for deletions that occurred after the merge operation
started.
Merging can also occur when two adjacent data files are each less than 50% full. Data
files can end up only partially full if a manual checkpoint has been run, which closes the
currently open (UNDER CONSTRUCTION) checkpoint data file and starts a new one.
Automatic merge
To identify a set of files to be merged, a background task periodically looks at all ACTIVE
data/delta file pairs and identifies zero or more sets of files that qualify.
Each set can contain two or more data/delta file pairs that are adjacent to each other such
that the resultant set of rows can still fit in a single data file of size 128 MB (or 16 MB for
machines with 16 GB memory or less). Table 6-1 shows some examples of files that will be
chosen to be merged under the merge policy.
DF0 (30%), DF1 (50%), DF2 (50%), DF3 (90%) (DF1, DF2)
Table 6-1: Examples of files that can be chosen for file merge operations.
143
Chapter 6: Logging, Checkpoint, and Recovery
It is possible that two adjacent data files are 60% full. They will not be merged and 40%
of storage is unused. So the total disk storage used for durable memory-optimized tables
is effectively larger than the corresponding memory-optimized size. In the worst case,
the size of storage space taken by durable tables could be two times larger than the
corresponding memory-optimized size.
Manual merge
In most cases, the automatic merging of checkpoint files will be sufficient to keep the
number of files manageable. However, in rare situations or for testing purposes, you
might want to use a manual merge. We can use the procedure sp_xtp_merge_check-
point_files to force a manual merge of checkpoint files. To determine which files
might be eligible, we can look at the metadata in sys.dm_db_xtp_checkpoint_files.
Remember that earlier we turned off automatic merging of files using the undocumented
trace flag, 9851. Again, this is not recommended in a production system but, for the sake of
this example, it does allow us to explore more readily this metadata and how it evolves
during a merge operation.
In continuing our previous example, let's now delete half the rows in the t_memopt table
as shown in Listing 6-13.
144
Chapter 6: Logging, Checkpoint, and Recovery
The metadata will now look something like that shown in Figure 6-8, with one additional
CFP and with row counts in the deleted_rows column, since the table has only half as
many rows.
Figure 6-8: The checkpoint file metadata after deleting half the rows in the table.
The number of deleted rows, spread across five files adds up to 4000, as expected. From
this information, we can find adjacent files that are not full, or files that we can see have
a lot of their rows removed. Armed with the transaction_id_lower_bound from
the first file in the set, and the transaction_id_upper_bound from the last file, we
can call the sys.sp_xtp_merge_checkpoint_files procedure, as in Listing 6-14, to
force a manual merge. Note that this procedure will not accept a NULL as a parameter,
so if the transaction_id_lower_bound is NULL, we can use any value less than
transaction_id_upper_bound.
145
Chapter 6: Logging, Checkpoint, and Recovery
We can verify the state of the merge operation with another DMV,
sys.dm_db_xtp_merge_requests, as shown in Listing 6-15.
SELECT request_state_desc ,
lower_bound_tsn ,
upper_bound_tsn
FROM sys.dm_db_xtp_merge_requests;
GO
In the metadata, we should now see one new CFP in the MERGE TARGET state containing
all the 4000 remaining rows (from here in, I've filtered out the PRECREATED files).
146
Chapter 6: Logging, Checkpoint, and Recovery
Figure 6-10: Some of the checkpoint file metadata after a requested merge.
Now run another manual checkpoint and then, once the merge is complete, the
request_state_description column of sys.dm_db_xtp_merge_requests
should show a value of INSTALLED instead of PENDING. The metadata will now look
similar to Figure 6-11. Now the CFP in slot 5, containing the 4000 remaining rows, is
ACTIVE and once again the checkpoint creates a new ACTIVE CFP (slot 6). The original
6 CFPs (originally slots 05) have been merged and their status is MERGED SOURCE.
If any concurrent activity were occurring on the server, we'd also see new UNDER
CONSTRUCTION CFPs.
Figure 6-11: The checkpoint file metadata after forcing a merge operation.
147
Chapter 6: Logging, Checkpoint, and Recovery
Note that the six CFPs that are included in the merged transaction range are still visible,
but they have no internal_storage_slot number, which means they are no longer
used for any ongoing operations. We see an ACTIVE data file containing all 4000 rows in
it and includes the complete transaction range.
A file needs to go through three stages before it is actually removed by the garbage
collection process. Let us assume we are merging files A (data/delta), B (data/delta)
into C (data/delta). Following are the key steps.
Stage 1: Checkpoint
After the merge has completed, the in-memory engine cannot remove the original
MERGED SOURCE files (A and B) until a checkpoint event occurs that guarantees that data
in those files is no longer needed for recovery, in the event of a service interruption (of
course, they certainly will be required for restore and recovery after a disk failure, so we
need to be running backups).
148
Chapter 6: Logging, Checkpoint, and Recovery
The checkpoint directly after the forced merge produced the MERGED SOURCE CFPs, and
running another one now sees them transition to REQUIRED FOR BACKUP/HA.
Figure 6-12: The checkpoint file metadata a merge, then another checkpoint.
Having performed a log backup, SQL Server can mark the source files from the merge
operation with the LSN, and any files with an LSN lower than the log truncation point are
eligible for garbage collection. Normally, of course, the whole garbage collection process
149
Chapter 6: Logging, Checkpoint, and Recovery
is automatic, and does not require any intervention, but for our example we can force
the process manually using the sp_xtp_checkpoint_force_garbage_collection
system stored procedure, followed by another checkpoint. You may need to run
Listing 6-18 at least twice.
EXEC sp_xtp_checkpoint_force_garbage_collection;
GO
CHECKPOINT
GO
After this stage, the files may or may not still be visible to your in-memory database
engine through the sys.dm_db_xtp_checkpoint_files DMV, but they will be visible
on disk. In my example, I was still able to see the files, with the TOMBSTONE state, but at
some point they will become invisible to this DMV.
150
Chapter 6: Logging, Checkpoint, and Recovery
Before a CFP can be removed, the in-memory OLTP engine must ensure that it will not
be required. In general, the garbage collection process is automatic. However, there is an
option to force the garbage collection of unused checkpoint files.
After Stage 3, the files may no longer be visible through the operating system although,
depending on what else is happening on the system, this process may take a while. Keep
in mind, however, that normally performing any of this manual garbage collection
process should not be necessary.
If you find you do need to implement this manual garbage collection of files, be sure
to account for these extra transaction log backups that were performed. You will need
to make sure any third-party backup solutions are aware of these log backup files.
Alternatively, you could perform a full database backup after performing this manual
garbage collection, so that subsequent transaction log backups would use that as their
starting point.
151
Chapter 6: Logging, Checkpoint, and Recovery
Recovery
Recovery on in-memory OLTP tables starts after the location of the most recent
checkpoint inventory has been recovered during a scan of the tail of the log. Once the
SQL Server host has communicated the location of the checkpoint inventory to the
in-memory OLTP engine, SQL Server and in-memory OLTP recovery proceed in parallel.
The global transaction timestamp is initialized during the recovery process with the
highest transaction timestamp found among the transactions recovered.
In-memory OLTP recovery itself is parallelized. Each delta file represents a filter to
eliminate rows that don't have to be loaded from the corresponding data file. This
data/delta file pair arrangement means that checkpoint file loading can proceed in
parallel across multiple I/O streams with each stream processing a single data file and
delta file. The in-memory OLTP engine creates one thread per core to handle parallel
insertion of the data produced by the I/O streams. The insert threads load into memory
all active rows in the data file after removing the rows that have been deleted. Using one
thread per core means that the load process is performed as efficiently as possible.
As the data rows are loaded they are linked into each index defined on the table the row
belongs to. For each hash index, the row is added to the chain for the appropriate hash
bucket. For each range index, the row is added to the chain for the row's key value, or
a new index entry is created if the key value doesn't duplicate one already encountered
during recovery of the table.
Finally, once the checkpoint file load process completes, the tail of the transaction log
is replayed from the timestamp of the last checkpoint, with the goal of bringing the
database back to the state that existed at the time of the crash.
152
Chapter 6: Logging, Checkpoint, and Recovery
Summary
In this chapter we looked at how the logging process for memory-optimized tables
is more efficient than that for disk-based tables, providing additional performance
improvement for your in-memory operations. We also looked at how your data changes
are persisted to disk using streaming checkpoint files, so that your data is persisted and
can be recovered when the SQL Server service is restarted, or your databases containing
memory-optimized tables are restored.
Additional Resources
A white paper describing FILESTREAM storage and management:
http://msdn.microsoft.com/en-us/library/hh461480.aspx.
153
Chapter 7: Native Compilation of
Tables and Stored Procedures
In-memory OLTP introduces the concept of native compilation into SQL Server 2014.
In this version, SQL Server can natively compile stored procedures that access memory-
optimized tables and, in fact, memory-optimized tables themselves are natively compiled.
In many cases, native compilation allows faster data access and more efficient query
execution than traditional interpreted T-SQL.
The performance benefit of using a natively compiled stored procedure increases with
the number of rows and the complexity of the procedure's code. If a procedure needs to
process just a single row, it's unlikely to benefit from native compilation, but it will almost
certainly exhibit better performance, compared to interpreted procedure, if it uses one or
more of the following:
aggregation
nested-loops joins
complex expressions
You should consider using natively compiled stored procedures for the most
performance-critical parts of your applications, including procedures that you
execute frequently, that contain logic such as that described above, and that need
to be extremely fast.
154
Chapter 7: Native Compilation of Tables and Stored Procedures
The T-SQL language consists of high-level constructs such as CREATE TABLE and
SELECTFROM. The in-memory OLTP compiler takes these constructs, and compiles
them down to native code for fast runtime data access and query execution. The
in-memory OLTP compiler in SQL Server 2014 takes the table and stored procedures
definitions as input. It generates C code, and leverages the Visual C compiler to generate
the native code. The result of the compilation of tables and stored procedures is DLLs
that are loaded into memory and linked into the SQL Server process.
SQL Server compiles both memory-optimized tables and natively compiled stored
procedures to native DLLs at the time of creation. Following a SQL Server instance
restart or a failover, table and stored procedure DLLs are recompiled on first access or
execution. The information necessary to recreate the DLLs is stored in the database
metadata; the DLLs themselves are not part of the database and are not included as part
of database backups.
Maintenance of DLLs
The DLLs for memory-optimized tables and natively compiled stored procedures are
stored in the file system, along with other generated files, which are kept for trouble-
shooting and supportability purposes.
The query in Listing 7-1 shows all table and stored procedure DLLs currently loaded in
memory on the server.
155
Chapter 7: Native Compilation of Tables and Stored Procedures
SELECT name ,
description
FROM sys.dm_os_loaded_modules
WHERE description = 'XTP Native DLL'
Listing 7-1: Display the list of all table and procedure DLLs currently loaded.
Database administrators do not need to maintain the files that native compilation
generates. SQL Server automatically removes generated files that are no longer needed,
for example on table and stored procedure deletion and on dropping a database, but also
on server or database restart.
Consider the script in Listing 7-2, which creates a database and a single, memory-
optimized table, and then retrieves the path of the DLL for the table, from the sys.
dm_os_loaded_modules DMV.
USE master
GO
CREATE DATABASE NativeCompDemo ON
PRIMARY (NAME = NativeCompDemo_Data,
FILENAME = 'c:\DataHK\NativeCompDemo_Data.mdf',
SIZE=500MB)
LOG ON (NAME = NativeCompDemo_log,
FILENAME = 'c:\DataHK\NativeCompDemo_log.ldf',
SIZE=500MB);
GO
156
Chapter 7: Native Compilation of Tables and Stored Procedures
USE NativeCompDemo
GO
CREATE TABLE dbo.t1
(
c1 INT NOT NULL
PRIMARY KEY NONCLUSTERED ,
c2 INT
)
WITH (MEMORY_OPTIMIZED=ON)
GO
The table creation results in the compilation of the table DLL, and also loading that DLL
in memory. The DMV query immediately after the CREATE TABLE statement retrieves
the path of the table DLL. My results are shown in Figure 7-1.
157
Chapter 7: Native Compilation of Tables and Stored Procedures
The table DLL for t1 understands the index structures and row format of the table. SQL
Server uses the DLL for traversing indexes and retrieving rows, as well as for determining
the data contents of the rows.
Consider the stored procedure in Listing 7-3, which inserts a million rows into the
table t1 from Listing 7-2.
158
Chapter 7: Native Compilation of Tables and Stored Procedures
Use of WITH SCHEMABINDING guarantees that the tables accessed by the procedure
are not dropped. Normally, schemabinding also prevents the underlying objects
from being altered, but since memory-optimized tables cannot ever be altered, that
restriction is irrelevant for natively compiled procedures.
The DLL for the procedure p1 can interact directly with the DLL for the table t1, as well
as the in-memory OLTP storage engine, to insert the rows very quickly.
The in-memory OLTP compiler leverages the query optimizer to create an efficient
execution plan for each of the queries in the stored procedure. Note that, for natively
compiled stored procedures, the query execution plan is compiled into the DLL.
SQL Server 2014 does not support automatic recompilation of natively compiled stored
procedures, so if we make changes to table data, we may need to drop and recreate certain
procedures to allow incorporation of new query plans into the stored procedure DLLs.
SQL Server recompiles natively compiled stored procedures on first execution, after
server restart, as well as after failover to an AlwaysOn secondary, meaning that the query
optimizer will create new query plans that are subsequently compiled into the stored
procedure DLLs.
As discussed in Chapter 2, there are limitations on the T-SQL constructs that can be
included in a natively compiled procedure. Natively compiled stored procedures are
intended for short, basic OLTP operations, so many of the complex query constructs
provided in the language are not allowed. In fact, there are so many restrictions, that the
documentation lists the features that are supported, rather than those that are not. You
can find the list at this link: http://msdn.microsoft.com/en-us/library/dn452279.aspx.
159
Chapter 7: Native Compilation of Tables and Stored Procedures
Parameter Sniffing
Interpreted T-SQL stored procedures are compiled into intermediate physical execution
plans at first execution (invocation) time, in contrast to natively compiled stored proce-
dures, which are natively compiled at creation time. When interpreted stored procedures
are compiled at invocation, the values of the parameters supplied for this invocation are
used by the optimizer when generating the execution plan. This use of parameters during
compilation is called parameter sniffing.
SQL Server does not use parameter sniffing for compiling natively compiled stored
procedures. All parameters to the stored procedure are considered to have UNKNOWN
values.
160
Chapter 7: Native Compilation of Tables and Stored Procedures
2. The parser and algebrizer create the processing flow for the procedure, as well as
query trees for the T-SQL queries in the stored procedure.
3. The optimizer creates optimized query execution plans for all the queries in the
stored procedure.
4. The in-memory OLTP compiler takes the processing flow with the embedded
optimized query plans and generates a DLL that contains the machine code for
executing the stored procedure.
5. The generated DLL is loaded in memory and linked to the SQL Server process.
161
Chapter 7: Native Compilation of Tables and Stored Procedures
3. The in-memory OLTP runtime locates the DLL entry point for the stored procedure.
4. The DLL executes the procedure logic and returns the results to the client.
The formula that the optimizer uses for assessing the relative cost of operations on
memory-optimized tables is similar to the costing formula for operations on disk-based
tables, with only a few exceptions. However, because of differences in the way that
memory-optimized tables are organized and managed, the optimizer does need to be
aware of different choices it may need to make, and certain execution plan options
that are not available when working with memory-optimized tables. The following
subsections describe the most important differences between optimizing queries on
disk-based tables and optimizing queries on memory-optimized tables.
162
Chapter 7: Native Compilation of Tables and Stored Procedures
If the optimizer finds no index that it can use efficiently, it will choose a plan that
will effectively be a table scan, although there is really no concept of a table scan
with memory-optimized tables, because all data access is through indexes. The
compiled plan will indicate that one of the indexes is to be used, through which all
of the rows will be retrieved.
Note that for an interop plan, as opposed to a plan for a natively compiled procedure,
you may actually see a table scan in the estimated plan. With such a plan, the decision as
to which index to use to access all the rows is made by the execution engine at runtime.
The usual choice is the hash index with the fewest number of buckets, but that is not
guaranteed.
In general, the optimizer will choose to use a hash index over a range index if the cost
estimations are the same.
Hash indexes
There are no ordered scans with hash indexes. If a query is looking for a range of values,
or requires that the results be returned in sorted order, a hash index will not be useful,
and the optimizer will not consider it.
The optimizer cannot use a hash index unless the query filters on all columns in the index
key. The hash index examples in Chapter 4 illustrated an index on just a single column.
However, just like indexes on disk-based tables, hash indexes on memory-optimized
tables can be composite, but the hash function used to determine to which bucket a row
163
Chapter 7: Native Compilation of Tables and Stored Procedures
belongs is based on all columns in the index. So if we had a hash index on (city, state),
a row for a customer from Springfield, Illinois would hash to a completely different
value than a row for a customer from Springfield, Missouri, and also would hash to a
completely different value than a row for a customer from Chicago, Illinois. If a query
only supplies a value for city, a hash value cannot be generated and the index cannot be
used, unless the entire index is used for a scan.
For similar reasons, a hash index can only be used if the filter is based on an equality. If
the query does not specify an exact value for one of the columns in the hash index key,
the hash value cannot be determined. So, if we have a hash index on city, and the query
is looking for city LIKE 'San%', a hash lookup is not possible.
Range indexes
Range indexes cannot be scanned in reverse order. There is no concept of "previous
pointers" in a range index on a memory-optimized table. With on-disk indexes, if a query
requests the data to be sorted in DESC order, the on-disk index could be scanned in
reverse order to support this. With in-memory tables, an index would have to be created
as a descending index. In fact, it is possible to have two indexes on the same column,
one defined as ascending and one defined as descending. It is also possible to have
both a range and a hash index on the same column.
No Halloween protection
Halloween protection is not incorporated into the query plans. Halloween protection
provides guarantees against accessing the same row multiple times during query
processing. Operations on disk-based tables use spooling operators to make sure rows are
not accessed repeatedly, but this is not necessary for plans on memory-optimized tables.
164
Chapter 7: Native Compilation of Tables and Stored Procedures
No parallel plans
Currently, parallel plans are not produced for operations on memory-optimized tables.
The XML plan for the query will indicate that the reason for no parallelism is because the
table is a memory-optimized table.
No auto-update of statistics
SQL Server In-Memory OLTP does not keep any row modification counters, and does not
automatically update statistics on memory-optimized tables. One of the reasons for not
updating the statistics is so there will be no chance of dependency failures due to waiting
for statistics to be gathered.
You'll need to make sure you set up a process for regularly updating statistics on memory-
optimized tables using the UPDATE STATISTICS command, which can be used to update
statistics on just one index, or on all the indexes of a specified table. Alternatively, you
can use the procedure sp_updatestats, which updates all the statistics on all indexes
in a database. For disk-based tables, this procedure only updates statistics on tables which
have been modified since the last time statistics were updated, but for memory-optimized
tables, the procedure will also recreate statistics. Make sure you have loaded data and
updated statistics on all tables accessed in a natively compiled procedure before the
procedure is created, since the plan is created at the time of procedure creation, and it
will be based on the existing statistics.
Natively compiled procedure plans will never be recompiled on the fly; the only way to
get a new plan is to drop and recreate the procedure (or restart the server).
165
Chapter 7: Native Compilation of Tables and Stored Procedures
Performance Comparisons
Since the very first versions of SQL Server, stored procedures have been described as
being stored in a compiled form. The process of coming up with a query plan for a
batch is also frequently described as compilation. However, until SQL Server 2014 and
in-memory OLTP, what was described as compilation wasn't really true compilation.
SQL Server stored query plans in an internal form, after they had been parsed and
normalized, but they were not truly compiled. When executing the plan, the execution
engine walks the query tree and interprets each operator as it is executed, calling appro-
priate database functions. This is far more expensive than for a true compiled plan,
composed of machine language calls to actual CPU instructions.
When processing a query, the runtime costs include locking, latching, and disk I/O, and
the relatively small cost and overhead associated with interpreted code, compared to
compiled code, gets "lost in the noise." However, in true performance tuning method-
ology, there is always a bottleneck; once we remove one, another becomes apparent.
Once we remove the overhead of locking, latching, and disk I/O, the cost of interpreted
code becomes a major component, and a potential bottleneck.
The only way to substantially speed up processing time is to reduce the number of
internal CPU instructions executed. Assume that in our system we use one million CPU
instructions per transaction which results in 100 transactions per second (TPS). To
achieve a 10-times performance improvement, to 1,000 TPS, we would have to decrease
the number of instructions per second to 100,000, which is a 90% reduction.
To satisfy the original vision for Hekaton, and achieve a 100-times performance
improvement, to 10,000 TPS, would mean reducing the number of instructions per
second to 10,000, or a 99% reduction! A reduction of this magnitude would be
impossible with SQL Server's existing interpretive query engine or any other existing
interpretive engine.
166
Chapter 7: Native Compilation of Tables and Stored Procedures
With natively compiled code, SQL Server In-Memory OLTP has reduced the number of
instructions per second by well over 90%, achieving in some cases an improvement in
performance of 30 times or more.
Table 7-1 shows the number of CPU cycles needed to SELECT a number of random rows
in a single transaction, and the final column shows the percentage improvement for the
memory-optimized table.
167
Chapter 7: Native Compilation of Tables and Stored Procedures
The more rows are read, the greater the performance benefit until, by reading 10,000
rows in a transaction, the improvement is greater than 20%.
Table 7-2 shows the number of CPU cycles needed to UPDATE a number of random rows
in the same 10,000,000 table with one hash index. Log I/O was disabled for this test, by
creating memory-optimized tables using the property SCHEMA_ONLY.
Again, the more rows being processed, the greater the performance benefit when using
memory-optimized tables. For UPDATE operations, Microsoft was able to realize a 30-fold
gain with 10,000 operations, achieving a throughput of 1.9 million updates per core.
Finally, Table 7-3 gives some performance comparisons for a mixed environment, with
both SELECT and UPDATE operations. The workload consists of 50% INSERT transactions
that append a batch of 100 rows in each transaction, and 50% SELECT transactions that
read the more recently inserted batch of rows.
168
Chapter 7: Native Compilation of Tables and Stored Procedures
Table 7-3 shows how many TPS are achieved using first, a disk-based table, second,
a memory-optimized table being accessed through interop code, and finally, a
memory-optimized table being accessed through a natively compiled procedure.
With one CPU, using a natively compiled procedure increased the TPS by more than
four times, and for 12 CPUs, the increase was over six times.
After creating the database, the script creates a memory-optimized table called
bigtable_inmem. This is a SCHEMA_ONLY memory-optimized table so SQL Server
will log the table creation, but will not log any DML on the table, so the data will not
be durable.
169
Chapter 7: Native Compilation of Tables and Stored Procedures
USE master;
GO
USE xtp_demo;
GO
170
Chapter 7: Native Compilation of Tables and Stored Procedures
) WITH ( MEMORY_OPTIMIZED=ON,
DURABILITY=SCHEMA_ONLY );
GO
Listing 7-4: Creating the xtp_demo database, and a SCHEMA_ONLY memory-optimized table.
Next, Listing 7-5 creates an interop (not natively compiled) stored procedure called
ins_bigtable that inserts rows into bigtable_inmem. The number of rows to insert
is passed as a parameter when the procedure is called.
DECLARE @i int = 1;
DECLARE @newid uniqueidentifier
WHILE @i <= @rows_to_INSERT
BEGIN
SET @newid = newid()
INSERT dbo.bigtable_inmem ( id, account_id, trans_type_id,
shop_id, trans_made, trans_amount )
VALUES( @newid,
32767 * rand(),
30 * rand(),
100 * rand(),
171
Chapter 7: Native Compilation of Tables and Stored Procedures
SET @i = @i + 1;
END
END
GO
Listing 7-5: An interop procedure, ins_bigtable, to insert rows into the table.
Finally, Listing 7-6 creates the equivalent natively compiled stored procedure.
172
Chapter 7: Native Compilation of Tables and Stored Procedures
Now we're going to run comparative tests for one-million row inserts into bigtable_
inmem, via the interop and natively compiled stored procedures. We'll delete all the rows
from the table before we insert the next million rows.
First, Listing 7-7 calls the interop procedure, with a parameter value of 1,000,000,
outside of a user-defined transaction, so each INSERT in the procedure is an auto-
commit transaction.
Listing 7-7: Inserting a million rows into a memory-optimized table via ins_bigtable.
When I executed this EXEC above, it took 28 seconds, as indicated in the status bar in the
SQL Server Management Studio. You might want to record the amount of time it took on
your SQL Server instance.
Next, Listing 7-8 calls the interop procedure inside a transaction, so that all the INSERT
operations are a single transaction.
DELETE bigtable_inmem;
GO
BEGIN TRAN
EXEC ins_bigtable @rows_to_INSERT = 1000000;
COMMIT TRAN
When I executed the EXEC above, it took 14 seconds, which was half the time it took to
insert the same number of rows in separate transactions. The savings here are primarily
due to the reduction in the overhead of managing a million separate transactions.
173
Chapter 7: Native Compilation of Tables and Stored Procedures
Since this is an interop procedure, each transaction is both a regular SQL Server trans-
action and an in-memory OLTP transaction, so there is a lot of overhead. The difference
is not due to any additional logging, because the memory-optimized table is a SCHEMA_
ONLY table and no logging is done at all.
Lastly, Listing 7-9 calls the natively compiled procedure called ins_native_bigtable
with a parameter of 1,000,000.
DELETE bigtable_inmem;
GO
EXEC ins_native_bigtable @rows_to_INSERT = 1000000;
GO
Listing 7-9: Creating a natively compiled procedure and inserting a million rows.
Running this natively compiled procedure to insert the same 1,000,000 rows
took only 3 seconds, less than 25% of the time it took to insert the rows through
an interop procedure.
Of course, your results may vary depending on the kinds of operations you are
performing; and keep in mind that I was testing this on SCHEMA_ONLY tables. For this
example, I wanted to show you the impact that native compilation itself could have
without interference from the overhead of disk writes that the CHECKPOINT process is
doing, and any logging that the query thread would have to perform.
174
Chapter 7: Native Compilation of Tables and Stored Procedures
slow down the procedure, and natively compiled procedures were designed not to use any
more CPU cycles than absolutely necessary. If you really need to gather this information
and you want it to be available through sys.dm_exec_query_stats, you can run one
of two stored procedures:
Listing 7-10: Syntax for procedures to enable statistics collection for natively compiled procedures.
As suggested, performance decreases when you enable statistics collection, but obviously
collecting statistics at the procedure level with sys.sp_xtp_control_proc_exec_
stats will be less expensive than using sys.sp_xtp_control_query_exec_stats
to gather statistics for every query within every procedure.
If we only need to troubleshoot one, or a few, natively compiled stored procedures, there
is a parameter for sys.sp_xtp_control_query_exec_stats to enable statistics
collection for a single procedure, so we can run sys.sp_xtp_control_query_exec_
stats once for each of those procedures.
175
Chapter 7: Native Compilation of Tables and Stored Procedures
Summary
This chapter discussed how to create natively compiled stored procedures to access
memory-optimized tables. These procedures generate far fewer CPU instructions for the
engine to execute than the equivalent interpreted T-SQL stored procedure, and can be
executed directly by the CPU, without the need for further compilation or interpretation.
There are some limitations in the T-SQL constructions allowed in natively compiled
procedures, and so certain transformations that the optimizer might have chosen are not
supported. In addition, because of differences in the way that memory-optimized tables
are organized and managed, the optimizer often needs to make different choices than
it would make from a similar operation on a disk-based table. We reviewed some of the
main differences.
When we access memory-optimized tables, which are also compiled, from natively
compiled stored procedures, we have a highly efficient data access path, and the fastest
possible query processing. We examined some Microsoft-generated performance data
to get an idea of the potential size of the performance advantage to be gained from
the use of natively compiled procedures, and we looked at how to run our own perfor-
mance tests, and also how to collect performance diagnostic data from some Dynamic
Management Views.
176
Chapter 7: Native Compilation of Tables and Stored Procedures
Additional Resources
Architectural Overview of SQL Server 2014's In-Memory OLTP Technology:
http://blogs.technet.com/b/dataplatforminsider/archive/2013/07/22/architec-
tural-overview-of-sql-server-2014-s-in-memory-oltp-technology.aspx.
177
Chapter 8: SQL Server Support and
Manageability
SQL Server In-Memory OLTP is an integral part of SQL Server 2014 Enterprise and
Developer editions and it uses the same management tools, including SQL Server
Management Studio.
Most of the standard SQL Server features work seamlessly with memory-optimized
tables. This chapter will discuss feature support, including the new Native Compilation
Advisor, which will highlight unsupported features in any stored procedures that you
wish to convert to natively compiled procedures, and the Memory Optimization Advisor,
which will report on unsupported features in tables that you might want to convert to
memory-optimized tables. We'll then move on to discuss metrics and metadata objects
added to SQL Server 2014 in order to help us manage the objects as well as track memory
usage and performance, including:
To round off the chapter, and the book, I'll summarize some of the key points to
remember when designing efficient memory-optimized tables and indexes, and then
review considerations for migrating existing tables and procedures over to SQL Server
In-Memory OLTP.
178
Chapter 8: SQL Server Support and Manageability
Feature Support
In-memory OLTP and databases containing memory-optimized tables support much,
though not all, of the SQL Server feature set. As we've seen throughout the book,
SQL Server Management Studio works seamlessly with memory-optimized tables,
filegroups and natively compiled procedures. In addition, we can use SQL Server Data
Tools (SSDT), Server Management Objects (SMO) and PowerShell to manage our
memory-optimized objects.
Database backup and restore are fully supported, as is log shipping. In terms of other
"High Availability" solutions, AlwaysOn components are supported, but database
mirroring and replication of memory-optimized tables are unsupported; a memory-
optimized table can be a subscriber in transactional replication, but not a publisher.
For the full list of supported and unsupported features, please refer to the SQL Server In-Memory OLTP
documentation: http://msdn.microsoft.com/en-us/library/dn133181(v=sql.120).aspx.
In this first version of in-memory OLTP, natively compiled stored procedures support
only a limited subset of the full T-SQL "surface area." Fortunately, SQL Server
Management Studio for SQL Server 2014 includes a tool called Native Compilation
Advisor, shown in Figure 8-1, which will highlight any constructs of an existing stored
procedure that are incompatible with natively compiled procedures.
179
Chapter 8: SQL Server Support and Manageability
The Native Compilation Advisor will generate a list of unsupported features used in the
existing procedure, and can generate a report, like the one shown in Figure 8-2.
180
Chapter 8: SQL Server Support and Manageability
Another feature, that works similarly to the Native Compilation Advisor, is the Memory
Optimization Advisor, available from SQL Server Management Studio 2014 when you
right-click on a disk-based table. This tool will report on table features that are unsup-
ported, such as LOB columns, and IDENTITY columns with increment other than 1.
This tool will also provide information such as the estimated memory requirement for
the table if it is converted to be memory optimized. Finally, the Memory Optimization
Advisor can actually convert the table to a memory-optimized table, as long as it doesn't
contain unsupported features.
181
Chapter 8: SQL Server Support and Manageability
If this happens, and you're not able to increase the amount of memory available to
SQL Server, you may be forced to drop some of the memory-optimized tables to free
up memory space. This is why it's so important to understand your memory allocation
requirements for memory-optimized tables before beginning to migrate them.
Fully integrated with the SQL Server memory manager is the in-memory OLTP memory
manager, which will react to memory pressure when possible, by becoming more
aggressive in cleaning up old row versions.
When working with SQL Server In-Memory OLTP, remember that it is not necessary that
the whole database fits in memory; we can work with disk-based tables, right alongside
memory-optimized tables.
182
Chapter 8: SQL Server Support and Manageability
When trying to predict the amount of memory required for memory-optimized tables,
a rule of thumb is to allow two times the amount of memory needed for the data.
Beyond this, the total memory requirement depends on the workload; if there are a lot
of data modifications due to OLTP operations, you'll need more memory for the row
versions. If the workload comprises mainly reading existing data, there might be less
memory required.
Planning space requirements for hash indexes is straightforward. Each bucket requires
8 bytes, so the memory required is simply the number of buckets times 8 bytes. Planning
space for range indexes is slightly trickier. The size for a range index depends on both
the size of the index key and the number of rows in the table. We can assume each index
row is 8 bytes plus the size of the index key (assume K bytes), so the maximum number of
rows that fit on a page would be 8176 / (K+8). Divide that result into the expected number
of rows to get an initial estimate. Remember that not all index pages are 8 KB, and not
all pages are completely full. As SQL Server needs to split and merge pages, it will need
to create new pages and we need to allow space for them, until the garbage collection
process removes them.
183
Chapter 8: SQL Server Support and Manageability
The first step is to create a resource pool for the in-memory OLTP database, specifying a
MIN_MEMORY_PERCENT and MAX_MEMORY_PERCENT of the same value. This specifies
the percentage of the SQL Server memory which may be allocated to memory-optimized
tables in databases associated with this pool. Listing 8-1, for example, creates a resource
pool called HkPool and allocates to it 50% of available memory.
Listing 8-1: Create a resource pool for a database containing memory-optimized tables.
Next, we need to bind the databases that we wish to manage to their respective pools,
using the procedure sp_xtp_bind_db_resource_pool. Note that one pool may
contain many databases, but a database is only associated with one pool at any point
in time.
Listing 8-3: Taking a database offline and then online to allow memory to be associated
with the new resource pool.
184
Chapter 8: SQL Server Support and Manageability
We can remove the binding between a database and a pool using the procedure
sp_xtp_unbind_db_resource_pool, as shown in Listing 8-4. For example, we
may wish to move the database to a different pool, or to delete the pool entirely, to
replace it with some other pool or pools.
Listing 8-4: Remove the binding between a database and a resource pool.
185
Chapter 8: SQL Server Support and Manageability
This report shows you the space used by the table rows and the indexes, as well as
the small amount of space used by the system. Remember that hash indexes will have
memory allocated for the declared number of buckets as soon as they're created, so this
report will show memory usage for those indexes before any rows are inserted. For range
indexes, memory will not be allocated until rows are added, and the memory requirement
will depend on the size of the index keys and the number of rows, as discussed previously.
durability (0 or 1)
is_memory_optimized (0 or 1).
186
Chapter 8: SQL Server Support and Manageability
As a simple example, the query in Listing 8-5 reports which databases a SQL Server
instance could support memory-optimized tables on, based on the requirement of having
a memory-optimized filegroup that contains at least one file. It uses the procedure sp_
MSforeachdb to loop through all databases, and print a message for each database that
meets the requirements.
A new catalog view, sys.hash_indexes, has been added to support hash indexes.
This view is based on sys.indexes, so it has the same columns as that view, with one
extra column added. The bucket_count column shows a count of the number of hash
buckets specified for the index and the value cannot be changed without dropping and
recreating the index.
In addition, there are several new dynamic management objects that provide information
specifically for memory-optimized tables.
187
Chapter 8: SQL Server Support and Manageability
sys.dm_db_xtp_checkpoint_stats
Returns statistics about the in-memory OLTP checkpoint operations in the current
database. If the database has no in-memory OLTP objects, returns an empty result set.
sys.dm_db_xtp_checkpoint_files
Displays information about checkpoint files, including file size, physical location and
the transaction ID. For the current checkpoint that has not closed, the state column
of this DMV will display UNDER CONSTRUCTION, for new files. A checkpoint closes
automatically when the transaction log grows 512 MB since the last checkpoint, or if
you issue the CHECKPOINT command.
sys.dm_xtp_merge_requests
Tracks database merge requests. The merge request may have been
generated by SQL Server or the request could have been made by a user,
with sys.sp_xtp_merge_checkpoint_files.
sys.dm_xtp_gc_stats
Provides information about the current behavior of the in-memory OLTP garbage
collection process. The parallel_assist_count represents the number of rows
processed by user transactions and the idle_worker_count represents the rows
processed by the idle worker.
188
Chapter 8: SQL Server Support and Manageability
sys.dm_xtp_gc_queue_stats
Provides details of activity on each garbage collection worker queue on the server
(one queue per logical CPU). As described in Chapter 5, the garbage collection thread
adds "work items" to this queue, consisting of groups of "stale" rows, eligible for
garbage collection. By taking regular snapshots of these queue lengths, we can make
sure garbage collection is keeping up with the demand. If the queue lengths remain
steady, garbage collection is keeping up. If the queue lengths are growing over time,
this is an indication that garbage collection is falling behind (and you may need to
allocate more memory).
sys.dm_db_xtp_gc_cycle_stats
For the current database, outputs a ring buffer of garbage collection cycles containing
up to 1024 rows (each row represents a single cycle). As discussed in Chapter 5, to
spread out the garbage collection work, the garbage collection thread arranges transac-
tions into "generations" according to when they committed compared to the oldest
active transaction. They are grouped into units of 16 transactions across 16 generations
as follows:
Generation 0: Stores all transactions that have committed earlier than the oldest
active transaction and therefore the row versions generated by them can be imme-
diately garbage collected.
Generations 114: Store transactions with a timestamp greater than the oldest
active transaction meaning that the row versions can't yet be garbage collected.
Each generation can hold up to 16 transactions. A total of 224 (14 * 16) transactions
can exist in these generations.
Generation 15: Stores the remainder of the transactions with a timestamp greater
than the oldest active transaction. Similar to generation 0, there is no limit to the
number of transactions in Generation 15.
189
Chapter 8: SQL Server Support and Manageability
sys.dm_db_xtp_hash_index_stats
Provides information on the number of buckets and hash chain lengths for hash
indexes on a table, useful for understanding and tuning the bucket counts (see
Chapter 4). If there are large tables in your database, queries against sys.dm_db_
xtp_hash_index_stats may take a long time since it needs to scan the entire table.
sys.dm_db_xtp_nonclustered_index_stats
Provides information about consolidation, split, and merge operations on the
Bw-tree indexes.
sys.dm_db_xtp_index_stats
Contains statistics about index accesses collected since the last database restart.
Provides details of expired rows eligible for garbage collection, detected during index
scans (see Chapter 5).
sys.dm_db_xtp_object_stats
Provides information about the write conflicts and unique constraint violations on
memory-optimized tables.
sys.dm_xtp_system_memory_consumers
Reports system-level memory consumers for in-memory OLTP. The memory for
these consumers comes either from the default pool (when the allocation is in the
context of a user thread) or from the internal pool (if the allocation is in the context
of a system thread).
sys.dm_db_xtp_table_memory_stats
Returns memory usage statistics for each in-memory OLTP table (user and system)
in the current database. The system tables have negative object IDs and are used to
store runtime information for the in-memory OLTP engine. Unlike user objects,
system tables are internal and only exist in memory, therefore they are not visible
through catalog views. System tables are used to store information such as metadata
for all data/delta files in storage, merge requests, watermarks for delta files to filter
rows, dropped tables, and relevant information for recovery and backups. Given that
the in-memory OLTP engine can have up to 8,192 data and delta file pairs, for large
in-memory databases the memory taken by system tables can be a few megabytes.
190
Chapter 8: SQL Server Support and Manageability
sys.dm_db_xtp_memory_consumers
Reports the database-level memory consumers in the in-memory OLTP database
engine. The view returns a row for each memory consumer that the engine uses.
sys.dm_xtp_transaction_stats
Reports accumulated statistics about transactions that have run since
the server started.
sys.dm_db_xtp_transactions
Reports the active transactions in the in-memory OLTP database engine
(covered in Chapter 5).
Extended events
The in-memory OLTP engine provides three extended event packages to help in
monitoring and troubleshooting. Listing 8-6 reveals the package names and the number
of events in each package.
Listing 8-6: Retrieve package information for in-memory OLTP extended events.
191
Chapter 8: SQL Server Support and Manageability
Listing 8-7 returns the names of all the extended events currently available in the
in-memory OLTP packages.
Listing 8-7: Retrieve the names of the in-memory OLTP extended events.
Performance counters
The in-memory OLTP engine provides performance counters to help in monitoring and
troubleshooting. Listing 8-8 returns the performance counters currently available.
Listing 8-8: Retrieve the names of the in-memory OLTP performance counters.
192
Chapter 8: SQL Server Support and Manageability
My results show 51 counters in six different categories. The categories are listed and
described in Table 8-1.
Performance
Description
Counter
193
Chapter 8: SQL Server Support and Manageability
Use the COLLATE clause at the column level, specifying the BIN2 collation for every
character column in a table you want to memory-optimize, rather than the database
level, because use at the database level will affect every table and every column in a
database. Or, specify the COLLATE clause in your queries, where it can be used for any
comparison, sorting, or grouping operation.
Do not over- or underestimate the bucket count for hash indexes if at all possible.
The bucket could should be at least equal to the number of distinct values for the
index key columns.
For very low cardinality columns, create range indexes instead of hash indexes.
Statistics are not updated automatically, and there are no automatic recompiles of any
queries on memory-optimized tables.
Memory-optimized table variables behave the same as regular table variables, but
are stored in your database's memory space, not in tempdb. You can consider using
memory-optimized table variables anywhere, as they are not transactional and can
help relieve tempdb contention.
194
Chapter 8: SQL Server Support and Manageability
USE HKDB;
CREATE TYPE SalesOrderDetailType_inmem
AS TABLE
(
OrderQty smallint NOT NULL,
ProductID int NOT NULL,
SpecialOfferID int NOT NULL,
LocalID int NOT NULL,
Listing 8-9: Creating a memory-optimized table variable using a memory-optimized table type.
In-memory OLTP is still a new technology and, as of this writing, there are only a few
applications using memory-optimized tables in a production environment (later in the
chapter, I list a few such applications). As more and more applications are deployed and
monitored, best practices will be discovered.
195
Chapter 8: SQL Server Support and Manageability
Transaction logging
Log I/O can be another bottleneck with disk-based tables since, in most cases for
OLTP operations, SQL Server writes to the transaction log on disk a separate log
record describing every table and index row modification. In-memory OLTP allows
us to create SCHEMA_ONLY tables that do not require any logging, but even for tables
196
Chapter 8: SQL Server Support and Manageability
197
Chapter 8: SQL Server Support and Manageability
particular order. Even if row locks are being used, there are still latches acquired on the
page, and for very high volumes this can be problematic. Also, the logging required for the
inserted rows and for the index rows created for each inserted data row can cause perfor-
mance degradation, if the INSERT volume is high.
SQL Server In-Memory OLTP addresses these problems by eliminating the need for locks
and latches. Logging overhead is reduced because operations on memory-optimized
tables log their changes more efficiently. In addition, the changes to the indexes are not
logged at all. If the application is such that the INSERT operations initially load data into
a staging table, then creating the staging table to be SCHEMA_ONLY will also remove any
overhead associated with logging the table rows.
Finally, the code to process the INSERTs must be run repeatedly for each row inserted,
and when using interop T-SQL this imposes a lot of overhead. If the code to process the
INSERTs meets the criteria for creating a natively compiled procedure, executing the
INSERTs through compiled code can improve performance dramatically, as demonstrated
in Chapter 7.
Bear in mind, however, that the assumption here is of a typical OTLP workload
consisting of many concurrent SELECTs, each reading a small amount of data. If your
workload consists of SELECTs that each process a large number of rows, then this is more
problematical, since operations on memory-optimized tables are always executed on a
single thread; there is no support for parallel operations.
198
Chapter 8: SQL Server Support and Manageability
If you do have datasets that would benefit from parallelism, you can consider moving the
relevant data to a separate disk-based table, where the query optimizer can consider use
of parallelism. Potentially, however, the mere act of separating the data into its own table
may reduce the number of rows that need to be scanned to the point where that table
becomes viable for migration to in-memory OLTP. If the code for processing these rows
can be executed in a natively compiled procedure, the speed improvement for compiled
code can sometimes outweigh the cost of having to run the queries single threaded.
CPU-intensive operations
A common requirement is to load large volumes of data, as discussed previously, but then
to process the data in some way, before it is available for reading by the application. This
processing can involve updating or deleting some of the data, if it is deemed inappro-
priate, or it can involve computations to put the data into the proper form for use.
The biggest bottleneck that the application will encounter in this case is the locking and
latching as the data is read for processing, and then the CPU resources required once
processing is invoked, which will vary depending on the complexity of the code executed.
As discussed, in-memory OLTP can provide a solution for all of these bottlenecks.
199
Chapter 8: SQL Server Support and Manageability
In-memory OLTP solves these problems by providing a lock- and latch-free environment.
Also, the ability to run the code in a truly compiled form, with no compiling or interpre-
tation, can give an enormous performance boost for these kinds of applications.
In general, we can maintain this state information in the database system, but typically at
a high cost. The state information is usually very dynamic, with each user's information
changing very frequently, and with the need to maintain state for multiple concurrent
users. It also can involve lookup queries for each user to gather other information the
system might be keeping for that user, such as past activity. Although the data maintained
might be minimal in size, the number of requests to access that data can be large, leading
to extreme locking and latching requirements, and resulting in very noticeable bottle-
necks and serious slowdowns in responses to user requests.
In-memory OLTP is the perfect solution for this application requirement, since a small
memory-optimized table can handle an enormous number of concurrent lookups and
modifications. In addition, a session state table is almost always transient and does not
need to be preserved across server restarts, so a SCHEMA_ONLY table can be used to
improve the performance even further.
200
Chapter 8: SQL Server Support and Manageability
In most cases, at least part of any application could be better served using traditional
disk-based tables, or you might not see any improvement with in-memory OLTP. If
your application meets any of the following criteria, you may need to reconsider whether
in-memory OLTP is the right choice.
Memory limitations
Memory-optimized tables must reside completely in memory. If the size of the tables
exceeds what SQL Server In-Memory OLTP or a particular machine supports, you will
not be able to have all the required data in memory. Of course, you can have some
memory-optimized tables and some disk-based tables, but you'll need to analyze the
workload carefully to identify those tables that will benefit most from migration to
memory-optimized tables.
201
Chapter 8: SQL Server Support and Manageability
Non-OLTP workload
In-memory OLTP, as the name implies, is designed to be of most benefit to
Online Transaction Processing operations. It may offer benefits to other types of
processing, such as reporting and data warehousing, but those are not the design
goals of the feature. If you are working with processing that is not OLTP in nature,
you should carefully test all operations to verify that in-memory OLTP provides
measurable improvements.
Current applications
As noted earlier, there are currently relatively few applications using memory-optimized
tables in a production environment, but the list is growing rapidly. When considering a
migration, you might want to review the published information regarding the types of
application that are already benefiting from running SQL Server In-Memory OLTP. For
example (the URLs refer to Microsoft case studies):
202
Chapter 8: SQL Server Support and Manageability
203
Chapter 8: SQL Server Support and Manageability
You could choose to just convert one or two critical tables that experience an excessive
number of locks or latches, or long durations on waits for locks or latches, or both.
Convert tables before stored procedures, since natively compiled procedures will only be
able to access memory-optimized tables.
Coverage of these topics is well beyond the scope of this book, but you can take a look at this page
in the SQL Server 2014 documentation to get several pointers on performance this kind of analysis:
http://tinyurl.com/nzzet9b.
Consider the following list of steps as a guide, as you work through a migration to
in-memory OLTP:
204
Chapter 8: SQL Server Support and Manageability
3. Address the constructs in the table DDLs that are not supported for memory-
optimized tables. The Memory Optimization Advisor can tell you what constructs are
unsupported for memory-optimized tables.
6. Address the T-SQL limitations in the code. If the code is in a stored procedure, you
can use the Native Compilation Advisor. Recreate the code in a natively compiled
procedure.
You can think of this as a cyclical process. Start with a few tables and convert them, then
convert the most critical procedures that access those tables, convert a few more tables,
and then a couple more procedures. You can repeat this cycle as needed, until you reach
the point where the performance gains are minimal.
You can also consider using a tool called Analysis, Migration and Reporting (AMR),
provided with SQL Server 2014, to help with the performance analysis prior to migrating
to in-memory OLTP.
AMR uses Management Data Warehouse (MDW) using the data collector, and produces
reports which we can access by right-clicking on the MDW database, and choosing
Reports | Management Data Warehouse. You will then have the option to choose
Transaction Performance Analysis Overview.
205
Chapter 8: SQL Server Support and Manageability
One of the reports will describe which tables are prime candidates for conversion to
memory-optimized tables, as well as providing an estimate of the size of the effort
required to perform the conversion, based on how many unsupported features the table
concurrently uses. For example, it will point out unsupported data types and constraints
used in the table.
Another report will contain recommendations on which procedures might benefit from
being converted to natively compiled procedures for use with memory-optimized tables.
Based on recommendations from the MDW reports, you can start converting tables into
memory-optimized tables one at a time, starting with the ones that would benefit most
from the memory-optimized structures. As you start seeing the benefit of the conversion
to memory-optimized tables, you can continue to convert more of your tables, but access
them using your normal T-SQL interface, with very few application changes.
Once your tables have been converted, you can then start planning a rewrite of the code
into natively compiled stored procedures, again starting with the ones that the MDW
reports indicate would provide the most benefit.
Summary
Using SQL Server In-Memory OLTP, we can create and work with tables that are
memory-optimized and extremely efficient to manage, often providing performance
optimization for OLTP workloads. They are accessed with true multi-version optimistic
concurrency control requiring no locks or latches during processing. All in-memory
OLTP memory-optimized tables must have at least one index, and all access is via indexes.
In-memory OLTP memory-optimized tables can be referenced in the same transactions
as disk-based tables, with only a few restrictions. Natively compiled stored procedures are
the fastest way to access your memory-optimized tables and performance business logic
computations.
If most, or all, of an application's data is able to be entirely memory resident, the costing
rules that the SQL Server optimizer has used since the very first version become almost
completely obsolete, because the rules assume all pages accessed can potentially require
206
Chapter 8: SQL Server Support and Manageability
a physical read from disk. If there is no reading from disk required, the optimizer can
use a different costing algorithm. In addition, if there is no wait time required for disk
reads, other wait statistics (such as waiting for locks to be released, waiting for latches to
be available, or waiting for log writes to complete) can become disproportionately large.
In-memory OLTP addresses all these issues. It removes the issues involved in waiting for
locks to be released, using a new type of multi-version optimistic concurrency control.
It also reduces the delays of waiting for log writes by generating far less log data, and
needing fewer log writes.
Additional Resources
Managing Memory for In-Memory OLTP:
http://msdn.microsoft.com/en-us/library/dn465872.aspx.
Using the Resource Governor extensive white paper written when the feature was
introduced in SQL Server 2008:
http://bit.ly/1sHhaPQ.
Resource Governor in SQL Server 2012 covers significant changes in this release:
http://msdn.microsoft.com/en-us/library/jj573256.aspx.
Extended Events the best place to get a start on working with extended events is in
the SQL Server documentation:
http://msdn.microsoft.com/en-us/library/bb630282(v=sql.120).aspx.
207
About Red Gate
You know those annoying jobs that spoil your day
whenever they come up?