SQL Server In-Memory OLTP
SQL Server In-Memory OLTP
In-Memory OLTP
Inside the SQL Server 2014 Hekaton Engine
By Kalen Delaney
In-Memory OLTP
Inside the SQL Server 2014 Hekaton Engine
By Kalen Delaney
right
of
Kalen
Delaney
to
be
identified
as
the
author
of
this
book
has
been
asserted by Kalen Delaney in accordance with the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored or introduced into a retrieval
system, or transmitted, in any form, or by any means (electronic, mechanical, photocopying, recording or
otherwise) without the prior written consent of the publisher. Any person who does any unauthorized act
in relation to this publication may be liable to criminal prosecution and civil claims for damages. This book
is sold subject to the condition that it shall not, by way of trade or otherwise, be lent, re-sold, hired out, or
otherwise circulated without the publisher's prior consent in any form other than which it is published and
without a similar condition including this condition being imposed on the subsequent publisher.
Cover Image: Andy Martin
Typeset: Peter Woodhouse and Gower Associates
Table of Contents
Chapter 1: What's Special About In-Memory OLTP?_____________ 16
Isn't In-Memory OLTP Just an Improved DBCC PINTABLE?__________________16
The New In-Memory OLTP Component___________________________________ 17
Memory-optimized tables_____________________________________________ 19
Natively compiled stored procedures____________________________________22
Concurrency improvements: the new MVCC model_______________________22
Indexes on memory-optimized tables___________________________________ 25
Data durability and recovery___________________________________________27
SQL Server In-Memory OLTP in Context_________________________________ 29
Summary_____________________________________________________________ 31
Additional Resources___________________________________________________ 31
Index Basics___________________________________________________________63
Hash Indexes_________________________________________________________ 64
Row organization____________________________________________________65
Choosing hash indexes______________________________________________ 70
Determining the number of hash buckets________________________________ 71
Range Indexes_________________________________________________________72
The Bw-tree________________________________________________________ 73
Index page structures________________________________________________76
Internal index maintenance operations__________________________________77
Summary____________________________________________________________ 84
Additional Resources__________________________________________________ 84
Validation Phase______________________________________________________106
Validation phase, Step 1: Check for isolation level violations________________ 107
Validation phase, Step 2: Commit dependencies__________________________ 111
Validation phase, Step 3: Logging______________________________________ 112
Post-processing_______________________________________________________ 114
Garbage Collection of Rows in Memory___________________________________ 114
Summary____________________________________________________________ 117
Additional Resources__________________________________________________ 118
Foreword
By David J. DeWitt
Microsoft Technical Fellow
Director, Jim Gray Systems Lab, Madison WI
John P. Morgridge Professor of Computer Sciences, Emeritus
University of Wisconsin-Madison
In-memory OLTP is a game changer for relational databases, and OLTP systems in
particular. Processors are not getting faster, but the number of cores and the amount
of memory is increasing drastically. Machines with terabytes of memory are available
for under $100K. A new technology is needed to take advantage of the changing
hardware landscape, and Microsoft's In-Memory OLTP, codenamed Project Hekaton,
is that new technology.
Project Hekaton gives us an entirely new way to store and access our relational data,
using lock- and latch-free data structures that allow completely non-blocking data
processing operations. Everything you knew about how your SQL Server data is actually
stored and accessed is different in Hekaton. Everything you understood about how
multiple concurrent processes are handled needs to be reconsidered. All of your planning
for when code is recompiled and reused can be re-evaluated if you choose to use natively
compiled stored procedures to access your Hekaton data.
One of the best things about using this new technology is that it is not all or nothing.
Even if much of your processing is not OLTP, even if your total system memory is
nowhere near the terabyte range, you can choose one or more critical tables to migrate
to the new in-memory structures. You can choose one frequently run stored procedure
to recreate as a natively compiled procedure. And you can see measurable performance
improvements.
viii
A lot of people are already writing and speaking about in-memory OLTP, on blog posts
and in conference sessions. People are using it and sharing what they've learned. For
those of you who want to know the complete picture about how in-memory OLTP works
and why exactly it's a game changer, and also peek at the deep details of how the data is
stored and managed, this book is for you, all in one place. Kalen Delaney has been writing
about SQL Server internals, explaining how things work inside the engine, for over
20 years. She started working with the Hekaton team at Microsoft over two years ago,
getting the inside scoop from the people who implemented this new technology. And in
this book, she's sharing it all with you.
Acknowledgements
First of all, I would like to thank Kevin Liu of Microsoft, who brought me on board
with the Hekaton project at the end of 2012, with the goal of providing in-depth white
papers describing this exciting new technology. Under Kevin's guidance, I wrote two
white papers, which were published near the release date at each of the CTPs for
ix
SQL Server 2014. As the paper got longer with each release, a new white paper for the
final released project would be as long as a book. So, with Kevin's encouragement, it
became the book that you are now reading.
I would also like to thank my devoted reviewers and question answerers at Microsoft,
without whom this work would have taken much longer: Sunil Agarwal, Jos de Bruijn,
and Mike Zwilling were always quick to respond and were very thorough in answering my
sometimes seemingly endless questions.
Others on the SQL Server team who also generously provided answers and/or technical
edits include Kevin Farlee, Craig Freedman, Mike Weiner, Cristian Diaconu, Pooja
Harjani, Paul Larson, and David Schwartz. Thank you for all your assistance and
support. And THANK YOU to the entire SQL Server Team at Microsoft for giving us this
incredible new technology!
INTRODUCTION
The original design of the SQL Server engine assumed that main memory was very
expensive, and so data needed to reside on disk except when it was actually needed for
processing. However, over the past thirty years, the sustained fulfillment of Moore's Law,
predicting that computing power will double year on year, has rendered this assumption
largely invalid.
Moore's law has had a dramatic impact on the availability and affordability of both large
amounts of memory and multiple-core processing power. Today one can buy a server
with 32 cores and 1 TB of memory for under $50K. Looking further ahead, it's entirely
possible that in a few years we'll be able to build distributed DRAM-based systems with
capacities of 110 Petabytes at a cost of less than $5/GB. It is also only a question of time
before non-volatile RAM becomes viable as main-memory storage.
At the same time, the near-ubiquity of 64-bit architectures removes the previous 4 GB
limit on "addressable" memory and means that SQL Server has, in theory, near-limitless
amounts of memory at its disposal. This has helped to significantly drive down latency
time for read operations, simply because we can fit so much more data in memory. For
example, many, if not most, of the OLTP databases in production can fit entirely in 1 TB.
Even for the largest financial, online retail and airline reservation systems, with databases
between 500 GB and 5 TB in size, the performance-sensitive working dataset, i.e. the
"hot" data pages, is significantly smaller and could reside entirely in memory.
However, the fact remains that the traditional SQL Server engine is optimized for diskbased storage, for reading specific 8 KB data pages into memory for processing, and
writing specific 8 KB data pages back out to disk after data modification, having first
"hardened" the changes to disk in the transaction log. Reading and writing 8 KB data
pages from and to disk can generate a lot of random I/O and incurs a higher latency cost.
11
In fact, given the amount of data we can fit in memory, and the high number of cores
available to process it, the end result has been that most current SQL Server systems are
I/O bound. In other words, the I/O subsystem struggles to "keep up," and many organizations sink huge sums of money into the hardware that they hope will improve write
latency. Even when the data is in the buffer cache, SQL Server is architected to assume
that it is not, which leads to inefficient CPU usage, with latching and spinlocks. Assuming
all, or most, of the data will need to be read from disk also leads to unrealistic cost estimations for the possible query plans and a potential for not being able to determine which
plans will really perform best.
As a result of these trends, and the limitations of traditional disk-based storage structures,
the SQL Server team at Microsoft began building a database engine optimized for large
main memories and many-core CPUs, driven by the recognition that systems designed for
a particular class of workload can frequently outperform more general purpose systems
by a factor of ten or more. Most specialized systems, including those for Complex Event
Processing (CEP), Data Warehousing and Business Intelligence (DW/BI) and Online
Transaction Processing (OLTP), optimize data structures and algorithms by focusing on
in-memory structures.
The team set about building a specialized database engine specifically for in-memory
workloads, which could be tuned just for those workloads. The original concept was
proposed at the end of 2008, envisioning a relational database engine that was 100
times faster than the existing SQL Server engine. In fact, the codename for this feature,
Hekaton, comes from the Greek word hekaton () meaning 100.
Serious planning and design began in 2010, and product development began in 2011. At
that time, the team did not know whether the current SQL Server could support this new
concept, and the original vision was that it might be a separate product. Fortunately, it
soon became evident that, although the framework could support building stand-alone
processors (discussion of the framework is well beyond the scope of this book), it would
be possible incorporate the "in-memory" processing engine into SQL Server itself.
12
The team then established four main goals as the foundation for further design
and planning:
1. Optimized for data that was stored completely in-memory but was also durable on
SQL Server restarts.
2. Fully integrated into the existing SQL Server engine.
3. Very high performance for OLTP operations.
4. Architected for modern CPUs (e.g. use of complex atomic instructions).
SQL Server In-Memory OLTP, formerly known and loved as Hekaton, meets all of these
goals, and in this book you will learn how it meets them. The focus will be on the features
that allow high performance for OLTP operations. As well as eliminating read latency,
since the data will always be in memory, fundamental changes to the memory-optimized
versions of tables and indexes, as well as changes to the logging mechanism, mean that
in-memory OLTP also offers greatly reduced latency when writing to disk.
The first four chapters of the book offer a basic overview of how the technology works
(Chapter 1), how to create in-memory databases and tables (Chapter 2), the basics of row
versioning and the new multi-version concurrency control model (Chapter 3), and how
memory-optimized tables and their indexes store data (Chapter 4).
Chapters in the latter half of the book focus on how the new in-memory engine delivers
the required performance boost, while still ensuring transactional consistency (ACID
compliance). In order to deliver on performance, the SQL Server team realized they
had to address some significant performance bottlenecks. Two major bottlenecks were
the traditional locking and latching mechanisms: if the new in-memory OTLP engine
retained these mechanisms, with the waiting and possible blocking that they could cause,
it could negate much of the benefit inherent in the vastly increased speed of in-memory
processing. Instead, SQL Server In-Memory OLTP delivers a completely lock- and latchfree system, and true optimistic multi-version concurrency control (Chapter 5).
13
Other potential bottlenecks were the existing CHECKPOINT and transaction logging
processes. The need to write to durable storage still exists for in-memory tables, but in
SQL Server In-Memory OLTP these processes are adapted to be much more efficient, in
order to prevent them becoming performance limiting, especially given the potential to
support vastly increased workloads (Chapter 6).
The final bottleneck derives from the fact that the SQL Server query processor is essentially an interpreter; it re-processes statements continually, at runtime. It is not a true
compiler. Of course, this is not a major performance concern, when the cost of physically reading data pages into memory from disk dwarfs the cost of query interpretation.
However, once there is no cost of reading pages, the difference in efficiency between
interpreting queries and running compiled queries can be enormous. Consequently,
the new SQL Server In-Memory OLTP engine component provides the ability to create
natively compiled procedures, i.e. machine code, for our most commonly executed data
processing operations (Chapter 7).
Finally, we turn our attention to tools for managing SQL Server In-Memory OLTP
structures, for monitoring and tuning performance, and finally, considerations for
migrating existing OLTP workloads over to in-memory (Chapter 8).
14
SQL Server In-Memory OLTP is a new technology and this is not a book specifically on
performance tuning and best practices. However, as you learn about how the Hekaton
engine works internally to process your queries, certain best practices and opportunities
for performance tuning will become obvious.
This book does not assume that you're a SQL Server expert, but I do expect that you have
basic technical competency and familiarity with the standard SQL Server engine, and
relative fluency with basic SQL statements.
You should have access to a SQL Server 2014 installation, even if it is the Evaluation
edition, available free from Microsoft:
http://technet.microsoft.com/en-gb/evalcenter/dn205290.aspx.
15
16
17
Natively Compiled
SPs and Schema
T3
In-memory
OLTP Compiler
Tables
Parser,
Catalog,
Optimizer
Query
Interop
Interpreted T-SQL
Query Execution
T1
T2
Tables
Indexes
Indexes
Memory Optimized Tables & Indexes
Figure 1-1:
In-memory OLTP also supports natively compiled stored procedures, an object type
that is compiled to machine code by a new in-memory OLTP compiler and which has the
potential to offer a further performance boost beyond that available solely from the use of
memory-optimized tables. The standard counterpart is interpreted T-SQL stored procedures, which is what SQL Server has always used. Natively compiled stored procedures
can reference only memory-optimized tables.
The Query Interop component allows interpreted T-SQL to reference memory-optimized
tables. If a transaction can reference both memory-optimized tables and disk-based
tables, we refer to it as a cross-container transaction.
18
Memory-optimized tables
This section takes a broad look at three of the key differences between memoryoptimized tables and their disk-based counterparts; subsequent chapters will fill in
the details.
20
21
In contrast, SQL Server In-Memory OLTP introduces a truly optimistic MVCC model.
It uses row versioning but its implementation bears little relation to the snapshot-based
model used for disk-based tables. When accessing memory-optimized tables and index
structures, SQL Server still supports the ACID properties of transactions, but it does so
without ever using locking or latching to provide transaction isolation. This means that
no transaction ever has, for lock-related reasons, to wait to read or modify a data row.
Readers never block writers, writers never block readers, and writers never block writers.
Waits on memory-optimized tables
Transactions never acquire locks on memory-optimized tables, so they never have to wait to acquire
them. However, this does not mean there is never any waiting when working with memory-optimized
tables in a multi-user system. However, the waiting that does occur is usually of very short duration, such
as when SQL Server is waiting for dependencies to be resolved during the validation phase of transaction
processing (more on the validation phase in Chapters 3 and 5). Transactions might also need to wait for
log writes to complete although, since the logging required when making changes to memory-optimized
tables is much more efficient than logging for disk-based tables, the wait times will be much shorter.
23
No locks
Operations on disk-based tables implement the requested level of transaction isolation
by using locks to make sure that a transaction (Tx2) cannot change data that another
transaction (Tx1) needs to remain unchanged.
In a traditional relational database system, in which SQL Server needs to read pages from
disk before it can process them, the cost of acquiring and managing locks can be just a
fraction of the total wait time. Often, this cost is dwarfed by the overhead of waiting for
disk reads, and managing the pages in the buffer pool.
However, if SQL Server were to acquire locks on memory-optimized tables, then locking
waits would likely become the major overhead, since there is no cost at all for reading
pages from disk.
Instead, the team designed SQL Server In-Memory OLTP to be a totally lock-free system.
Fundamentally, this is possible because SQL Server never modifies any existing row, and
so there is no need to lock them. Instead, an UPDATE operation creates a new version by
marking the previous version of the row as deleted, and then inserting a new version of
the row with new values. If a row is updated multiple times, there may be many versions
of the same row existing simultaneously. SQL Server presents the correct version of the
row to the requesting transaction by examining timestamps stored in the row header and
comparing them to the transaction start time.
No latches
Latches are lightweight synchronization mechanisms (often called primitives as they are
the smallest possible synchronization device), used by the SQL Server engine to guarantee
consistency of the data structures that underpin disk-based tables, including index and
data pages as well as internal structures such as non-leaf pages in a B-tree. Even though
latches are quite a bit lighter weight than locks, there can still be substantial overhead and
wait time involved in using latches.
24
SQL Server In-Memory OLTP also continuously persists the table data to disk in special
checkpoint files. It uses these files only for database recovery, and only ever writes to
them "offline," using a background thread. Therefore, when we create a database that will
use memory-optimized data structures, we must create, not only the data file (used only
for disk-based table storage) and the log file, but also a special MEMORY_OPTIMIZED_
DATA filegroup that will contain the checkpoint file pairs, each pair consisting of a data
checkpoint file and a delta checkpoint file (more on these in Chapter 2).
27
28
30
Summary
This first chapter took a first, broad-brush look at the new SQL Server In-Memory OLTP
engine. Memory-optimized data structures are entirely resident in memory, so user
processes will always find the data they need by traversing these structures in memory,
without the need for disk I/O. Furthermore, the new MVCC model means that
SQL Server can mediate concurrent access of these data structures, and ensure ACID
transaction properties, without the use of any locks and latches; no user transactions
against memory-optimized data structures will ever be forced to wait to acquire a lock!
Natively compiled stored procedures provide highly efficient data access to these data
structures, offering a further performance boost. Even the logging mechanisms for
memory-optimized tables, to ensure transaction durability, are far more efficient than
for standard disk-based tables.
Combined, all these features make the use of SQL Server In-Memory OLTP a very
attractive proposition for many OLTP workloads. Of course, as ever, it is no silver bullet.
While it can and will offer substantial performance improvements to many applications,
its use requires careful planning, and almost certainly some redesign of existing tables
and procedures, as we'll discuss as we progress deeper into this book.
Additional Resources
As with any "v1" release of a new technology, the pace of change is likely to be rapid. We
plan to revise this book to reflect significant advances in subsequent releases, but in the
meantime it's likely that new online information about in-memory OLTP will appear with
increasing frequency.
31
32
Creating Databases
Any database that will contain memory-optimized tables needs to have a single
MEMORY_OPTIMIZED_DATA filegroup containing at least one container, which stores
the checkpoint files needed by SQL Server to recover the memory-optimized tables.
These are the checkpoint data and delta files that we introduced briefly in Chapter 1.
SQL Server populates these files during CHECKPOINT operations, and reads them during
the recovery process, which we'll discuss in Chapter 6.
The syntax for creating a MEMORY_OPTIMIZED_DATA filegroup is almost the same as that
for creating a regular FILESTREAM filegroup, but it must specify the option CONTAINS
MEMORY_OPTIMIZED_DATA. Listing 2-1 provides an example of a CREATE DATABASE
statement for a database that can support memory-optimized tables (edit the path names
to match your system; if you create the containers on the same drive you'll need to
differentiate the two file names).
33
Listing 2-1:
Creating a database.
In Listing 2-1, we create a regular data file (HKDB_data.mdf), used for disk-based table
storage only, and a regular log file (HKDB_log.ldf). In addition, we create a memoryoptimized filegroup, HKDB_mod_fg with, in this case, two file containers each called
HKDB_mod_dir. These containers host data and delta checkpoint file pairs to which
the CHECKPOINT process will write data, for use during database recovery. The data
checkpoint file stores inserted rows and the delta files reference deleted rows. The data
and delta file for each pair may be in the same or different containers, depending on the
number of containers specified. In this case, with two containers, one will contain the
data checkpoint files and the other the delta checkpoint files, for each pair. If we had only
one container, it would contain both the data and delta files.
Notice that we place the primary data file, each of the checkpoint file containers, and the
transaction log, on separate drives. Even though the data in a memory-optimized table is
never read from or written to disk "inline" during query processing, it can still be useful
to consider placement of your checkpoint files and log file for optimum I/O performance
during logging, checkpoint, and recovery.
34
35
Listing 2-2:
Creating Tables
The syntax for creating memory-optimized tables is almost identical to the syntax for
creating disk-based tables, but with a few required extensions, and a few restrictions on
the data types, indexes, constraints and other options, that memory-optimize tables can
support.
To specify that a table is a memory-optimized table, we use the MEMORY_OPTIMIZED =
ON clause. Apart from that, and assuming we're using only the supported data types and
other objects, the only other requirement is that we include at least one index, as part of
the CREATE TABLE statement. Listing 2-3 shows a basic example.
USE HKDB;
GO
CREATE TABLE T1
(
[Name] varchar(32) not null PRIMARY KEY NONCLUSTERED HASH
WITH (BUCKET_COUNT = 100000),
[City] varchar(32) null,
[State_Province] varchar(32) null,
[LastModified] datetime not null,
) WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA);
Listing 2-3:
36
Durability
We can define a memory-optimized table with one of two DURABILITY values: SCHEMA_
AND_DATA or SCHEMA_ONLY, with the former being the default. If we define a memoryoptimized table with DURABILITY=SCHEMA_ONLY, then SQL Server will not log changes
to the table's data, nor will it persist the data in the table to the checkpoint files, on disk.
However, it will still persist the schema (i.e. the table structure) as part of the database
metadata, so the empty table will be available after the database is recovered during a
SQL Server restart.
Listing 2-4:
For non-PRIMARY KEY columns the NONCLUSTERED keyword is optional, but we have to
specify it explicitly when defining the PRIMARY KEY because otherwise SQL Server will
try to create a clustered index, the default for a PRIMARY KEY, and will generate an error
because clustered indexes are not allowed on memory-optimized tables.
For composite indexes, we create them after the column definitions. Listing 2-5 creates a
new table, T2, with the same hash index for the primary key on the Name column, plus a
range index on the City and State_Province columns.
If you're wondering why we created a new table, T2, rather than just adding the
composite index to the existing T1, it's because the SQL Server stores the structure of
in-memory tables as part of the database metadata, and so we can't alter those tables
once created.
38
Listing 2-5:
In short, no schema changes are allowed once a table is created so, instead of using
ALTER TABLE, we must drop and recreate the table. Likewise, we cannot use procedure
sp_rename with memory-optimized tables, to change either the table name or any
column names.
Also note that there are no specific index DDL commands (i.e. CREATE INDEX,
ALTER INDEX, DROP INDEX). We always define indexes as part of the table creation.
There are a few other restrictions and limitations around the use of indexes, constraints
and other properties, during table creation, as follows:
no FOREIGN KEY or CHECK constraints
IDENTITY columns can only be defined with SEED and INCREMENT of 1
no UNIQUE indexes other than for the PRIMARY KEY
a maximum of 8 indexes, including the index supporting the PRIMARY KEY.
Also note that we can't create DML triggers on a memory-optimized table.
39
40
41
42
Listing 2-6:
The alternative, as discussed earlier, is to create each character column in a given table
with the BIN2 collation. For example, if we rerun Listing 2-1, but without specifying the
collation then, when recreating table T1, we would specify the collation for each character
column as part of the CREATE TABLE statement, and it would be obligatory to specify a
BIN2 collation on the Name column, since this participates in the hash index.
43
Listing 2-7:
Rerunning the queries in Listing 2-6, we'll see that this eliminates the case sensitivity on
the table and column names, but the data case sensitivity remains.
Finally, remember that tempdb will use the collation for the SQL Server instance. If
the instance does not use the same BIN2 collation, then any operations that use
tempdb objects may encounter collation mismatch problems. One solution is to
use COLLATE database_default for the columns on any temporary objects.
Interpreted T-SQL
When accessing memory-optimized tables using interpreted T-SQL, via the interop, we
have access to virtually the full T-SQL surface area (i.e. the full list of statements and
expressions). However, we should not expect the same performance as when we access
memory-optimized tables using natively compiled stored procedures (Chapter 7 shows a
performance comparison).
44
45
46
Summary
This chapter covered the basics of creating database, tables and indexes to store memoryoptimized data. In creating the database, we must define a special memory-optimized
filegroup, which is built on the FILESTREAM technology. When creating a memoryoptimized table, we just have to specify a special MEMORY_OPTIMIZED = ON clause, and
create at least one index on the table. It sounds simple, and it is, but we have to remember
that there are currently quite a number of restrictions on the data types, indexes,
constraints, and other options, that memory-optimized tables can support. Also, any
character column that participates in an index must use a BIN2 collation, which might
affect the results of queries against this column.
We can access memory-optimized data structures with T-SQL, either in interop mode or
via natively compiled stored procedures. In the former case, we can use more or less the
full T-SQL surface area, but in the latter case, there is a longer list of restrictions.
Additional Resources
Details of supported query constructs in natively compiled procedures:
http://msdn.microsoft.com/en-us/library/dn452279(v=sql.120).aspx.
Details of T-SQL Constructs Not Supported by In-Memory OLTP:
http://msdn.microsoft.com/en-us/library/dn246937(v=sql.120).aspx.
White paper discussing SQL Server Filestream storage, explaining how files in the
filegroups containing memory-optimized data are organized and managed internally:
http://download.microsoft.com/download/D/2/0/D20E1C5F-72EA-4505-9F26FEF9550EFD44/FILESTREAMStorage.docx.
47
Row Structure
The data rows that comprise in-memory tables have a structure very different than
the row structures used for disk-based tables. Each row consists of a row header, and a
payload containing the row attributes (the actual data). Figure 3-1 shows this structure, as
well as expanding on the content of the header area.
Row header
Payload
End-Ts
Stmtld
8 bytes
4 bytes
IdxLinkCount Padding
2 bytes
2 bytes
Row header
The row header for every data row consists of the following fields:
Begin-Ts the "insert" timestamp. It reflects the time that the transaction that
inserted a row issued its COMMIT.
End-Ts the "delete" timestamp. It reflects the time that the transaction that deleted a
row issued its COMMIT.
49
Payload area
The payload is the row data itself, containing the index key columns plus all the other
columns in the row, meaning that all indexes on a memory-optimized table can be
thought of as covering indexes. The payload format can vary depending on the table, and
based on the table's schema. As described in Chapter 1, the in-memory OLTP compiler
generates the DLLs for table operations. These DLLs contain code describing the payload
format, and so can also generate the appropriate commands for all row operations.
50
51
52
53
20,
Greg
Beijing
20,
Susan
Bogota
Figure 3-2:
We can see that a transaction inserted the rows <Greg, Beijing> and <Susan, Bogota> at
timestamp 20. Notice that SQL Server uses a special value, referred to as "infinity," for the
End-Ts value for rows that are active (not yet marked as invalid).
We're now going to assume that a user-defined transaction, with a Transaction-ID of
Tx1, starts at timestamp 90 and will:
delete <Susan, Bogota>
update <Greg, Beijing> to <Greg, Lisbon>
insert <Jane, Helsinki>.
Let's see how the row versions and their values evolve during the three basic stages
(processing, validation, and post-processing) of this transaction, and how SQL Server
controls which rows are visible to other active concurrent transactions.
Processing phase
During the processing stage, SQL Server processes the transaction, creating new row
versions (and linking them into index structures covered in Chapter 4), and marking
rows for deletion as necessary, as shown in Figure 3-3.
54
20, Tx1
Susan
Bogota
Tx1,
Greg
Lisbon
20, Tx1
Greg
Beijing
Tx1,
Jane
Helsinki
Figure 3-3:
During processing, SQL Server uses the Transaction-ID for the Begin-Ts value of
any row it needs to insert, and for the End-Ts value for any row that it needs to mark for
deletion. SQL Server uses an extra bit flag to indicate to other transactions that these are
transaction ids, not timestamps.
So, to delete the <Susan, Bogota> row (remember, the row isn't actually removed during
processing; it's more a case of marking it as deleted), transaction Tx1 first locates the row,
via one of the indexes, and then sets the End-Ts value to the Transaction-ID for Tx1.
55
56
Validation phase
Once our transaction Tx1 issues a commit, and SQL Server generates the commit
timestamp, the transaction enters the validation phase. While SQL Server will immediately detect direct update conflicts, such as those discussed in the previous section,
it is not until the validation phase that it will detect other potential violations of the
properties specified by the transaction isolation level. So, for example, let's say Tx1 was
accessing the memory-optimized table in REPEATABLE READ isolation. It reads a row
value and then Tx2 updates that row value, which it can do because SQL Server acquires
no locks in the MVCC model, and issues a commit before Tx1 commits. When Tx1
enters the validation phase, it will fail the validation check and SQL Server will abort the
transaction. If there are no violations, SQL Server proceeds with other actions that will
culminate in guaranteeing the durability of the transaction.
58
Post-processing
In this stage, SQL Server writes the commit timestamp into the row header of all affected
rows (note this is the timestamp from when Tx1 first issued the commit). Therefore, our
final row versions look as shown in in Figure 3-4.
59
20, 120
Greg
Beijing
120,
Greg
Lisbon
20, 120
Susan
Bogota
120,
Jane
Helsinki
Figure 3-4:
As noted earlier, the storage engine has no real notion of row "versions." There is no
implicit or explicit reference that relates one version of a given row to another. There are
just rows, connected together by the table's indexes, as we'll see in the next chapter, and
visible to active transactions, or not, depending on the validity interval of the row version
compared to the logical read time of the accessing transaction.
In Figure 3-4, the rows <Greg, Beijing> and <Susan, Bogota> have a validity interval of
20 to 120 and so any user transaction with a starting timestamp greater than or equal
to 20 and less than 120, will still see those row versions. Any transaction with a starting
timestamp greater than 120 will see <Greg, Lisbon> and <Jane, Helsinki>.
60
Summary
The SQL Server In-Memory OLTP engine supports true optimistic concurrency, via a
MVCC, based on in-memory row versioning. This chapter described the row structure
that underpins the MVCC model, and then examined how SQL Server maintains multiple
row versions, and determines the correct row version that each concurrent transaction
should access. This model means that SQL Server can avoid read-write conflicts without
the need for any locking or latching, and will raise write-write conflicts immediately,
rather than after a delay (i.e. rather than blocking for the duration of a lock-holding
transaction).
In the next chapter, we'll examine how SQL Server uses indexes to connect all rows that
belong to a single in-memory table, as well as to optimize row access.
Additional Resources
Hekaton: SQL Servers Memory-Optimized OLTP Engine a white paper by
Microsoft Research:
http://research.microsoft.com/pubs/193594/Hekaton%20-%20Sigmod2013%20
final.pdf.
Table and Row Size in Memory-Optimized Tables:
http://msdn.microsoft.com/en-us/library/dn205318(v=sql.120).aspx.
61
62
Index Basics
To summarize briefly some of what we've discussed about the "rules" governing the use
of indexes on memory-optimized tables:
all memory-optimized tables must have at least one index
a maximum of 8 indexes per table, including the index supporting the PRIMARY KEY
no UNIQUE indexes other than for the PRIMARY KEY
we can't alter a table after creating it, so we must define all indexes at the time we
create the memory-optimized table SQL Server writes the number of index pointers,
and therefore number of indexes, into the row header on table creation
indexes on memory-optimized tables are entirely in-memory structures SQL Server
never logs any changes made to data rows in indexes, during data modification
during database recovery SQL Server recreates all indexes based on the index definitions. We'll go into detail in Chapter 6, Logging, Checkpoint, and Recovery.
With a maximum limit of 8 indexes, all of which we must define on table creation,
we must exert even more care than usual to choose the correct and most useful set
of indexes.
We discussed earlier in the book how data rows are not stored on pages, so there is no
collection of pages or extents, and there are no partitions or allocation units. Similarly,
although we do refer to index pages in in-memory range indexes, they are very different
structures from their disk-based counterparts.
In disk-based indexes, the pointers locate physical, fixed-size pages on disk. As we
modify data rows, we run into the problem of index fragmentation, as gaps appear in
pages during DELETEs, and page splits occur during INSERTs and UPDATEs. Once this
fragmentation occurs, the I/O overhead associated with reads and writes grows.
63
Hash Indexes
A hash index, which is stored as a hash table, consists of an array of pointers, where
each element of the array is called a hash bucket and stores a pointer to the location in
memory of a data row. When we create a hash index on a column, SQL Server applies
a hash function to the value in the index key column in each row and the result of the
function determines which bucket will contain the pointer for that row.
More on hashing
Hashing is a well-known search algorithm, which stores data based on a hash key generated by applying
a hash function to the search key (in this case, the index key). A hash table can be thought of as an array
of "buckets," one for each possible value that the hash function can generate, and each data element (in
this case, each data row) is added to the appropriate bucket based on its index key value. When searching,
the system will apply the same hash function to the value being sought, and will only have to look in a
single bucket. For more information about what hashing and hash searching are all about, take a look at
the Wikipedia article at: http://en.wikipedia.org/wiki/Hash_function.
Let's say we insert the first row into a table and the index key value hashes to the
value 4. SQL Server stores a pointer to that row in hash bucket "4" in the array. If a
transaction inserts a new row into the table, where the index key value also hashes to "4,"
it becomes the first row in the chain of rows accessed from hash bucket 4 in the index,
and the new row will have a pointer to the original row.
64
Row organization
As discussed previously, SQL Server stores these index pointers in the index pointer array
in the row header. Figure 4-1 shows two rows in a hash index on a name column. For this
example, assume there is a very simple hash function that results in a value equal to the
length of the string in the index key column. The first value of Jane will then hash to 4,
and Susan to 5, and so on. In this simplified illustration, different key values (Jane and
Greg, for example) will hash to the same bucket, which is a hash collision. Of course, the
real hash function is much more random and unpredictable, but I am using the length
example to make it easier to illustrate.
The figure shows the pointers from the 4 and 5 entries in the hash index to the rows
containing Jane and Susan, respectively. Neither row points to any other rows, so the
index pointers in each of the row headers is NULL.
65
Timestamps
Hash index
on Name
50,
4
5
Figure 4-1:
Index ptr
Jane
null
null
70,
Name
Susan
City
Helsinki
Vienna
In Figure 4-1, we can see that the <Jane, Helsinki> and <Susan, Vienna> rows have a
Begin-Ts timestamp of 50 and 70 respectively, and each is the current, active version
of that row.
In Figure 4-2, a transaction, which committed at timestamp 100, has added to the
same table a row with a name value of Greg. Using our string length hash function, Greg
hashes to 4, and so maps to the same bucket as Jane, and the row is linked into the same
chain as the row for Jane. The <Greg, Beijing> row has a pointer to the <Jane, Helsinki>
row and SQL Server updates the hash index to point to Greg. The <Jane, Helsinki> row
needs no changes.
66
Timestamps
Hash index
on Name
Name
Index ptr
Greg
100,
50,
City
Beijing
null
Jane
Helsinki
4
5
70,
Figure 4-2:
null
Susan
Vienna
Finally, what happens if another transaction, which commits at timestamp 200, updates
<Greg, Beijing> to <Greg, Lisbon>? The new version of Greg's row is simply linked in as
any other row, and will be visible to active transactions depending on their timestamps,
as described in Chapter 3. Every row has at least one pointer to it, either directly from the
hash index bucket or from another row. In this manner, each index provides an access
path to every row in the table, in the form of a singularly-linked list joining every row in
the table.
67
Timestamps
Hash index
on Name
Index ptr
Name
Greg
200,
City
Lisbon
100, 200
Greg
Beijing
4
5
50,
70,
Figure 4-3:
null
Susan
null
Jane
Helsinki
Vienna
Of course, this is just a simple example with one index, in this case a hash index, which
is the minimum required to link the rows together. However, for query performance
purposes, we may want to add other hash indexes (as well as range indexes).
For example, if equality searches on the City column are common, and if it were quite a
selective column (small number of repeated values), then we might decide to add a hash
index to that column, too. This creates a second index pointer field. Each row in the table
now has two pointers pointing to it, and can point to two rows, one for each index. The
first pointer in each row points to the next value in the chain for the Name index; the
second pointer points to the next value in the chain for the City index.
68
Timestamps
Index ptr
Name
City
Hash index
on City
Hash index
on Name
200,
Greg
Lisbon
100, 200
Greg
Beijing
4
5
null
50,
90,
Susan
Helsinki
Bogota
70, 90
Figure 4-4:
Jane
null
Susan
Vienna
Now we have another access path through the rows, using the second hash index.
69
6
7
8
70
Listing 4-1:
SQL Server rounds up the number we supply for the BUCKET_COUNT to the next power
of two, so it will round up a value of 50,000 to 65,536.
The number of buckets for each hash index should be determined based on the characteristics of the column on which we are building the index. It is recommended to choose
a number of buckets equal to or greater than the expected cardinality (i.e. the number of
unique values) of the index key column, so that there will be a greater likelihood that each
bucket's chain will point to rows with the same value for the index key column. In other
words, we want to try to make sure that two different values will never end up in the
same bucket. If there are fewer buckets than possible values, multiple values will have to
use the same bucket, i.e. a hash collision.
This can lead to long chains of rows and significant performance degradation of all DML
operations on individual rows, including SELECT and INSERT. On the other hand, be
careful not to choose a number that is too big because each bucket uses memory (8 bytes
per bucket). Having extra buckets will not improve performance but will simply waste
memory. As a secondary concern, it might also reduce the performance of index scans,
which will have to check each bucket for rows.
71
Range Indexes
Hash indexes are useful for relatively unique data that we can query with equality predicates. However, if you don't know the cardinality, and so have no idea of the number
of buckets you'll need for a particular column, or if you know you'll be searching your
data based on a range of values, you should consider creating a range index instead of a
hash index.
Range indexes connect together all the rows of a table at their leaf level. Every row in a
table will be accessible by a pointer in the leaf. Range indexes are implemented using a
new data structure called a Bw-tree, originally envisioned and described by Microsoft
Research in 2011. A Bw-tree is a lock- and latch-free variation of a B-tree.
72
The Bw-tree
The general structure of a Bw-tree is similar to SQL Server's regular B-trees, except that
the index pages are not a fixed size, and once they are built they cannot be changed. Like
a regular B-tree page, each index page contains a set of ordered key values, and for each
value there is a corresponding pointer. At the upper levels of the index, on what are called
the internal pages, the pointers point to an index page at the next level of the tree, and
at the leaf level, the pointers point to a data row. Just like for in-memory OLTP hash
indexes, multiple data rows can be linked together. In the case of range indexes, rows that
have the same value for the index key will be linked.
One big difference between Bw-trees and SQL Server's B-trees is that, in the former,
a page pointer is a logical page ID (PID), instead of a physical page address. The PID
indicates a position in a mapping table, which connects each PID with a physical
memory address. Index pages are never updated; instead, they are replaced with a new
page and the mapping table is updated so that the same PID indicates a new physical
memory address.
Figure 4-5 shows the general structure of a Bw-tree, plus the Page Mapping Table.
Each index row in the internal index pages contains a key value, and a PID of a page at
the next level down. The index pages show the key values that the index references. Not
all the PID values are indicated in Figure 4-5, and the mapping table does not show all the
PID values that are in use.
The key value is the highest value possible on the page referenced. Note this is different
than a regular B-tree index, for which the index rows stores the minimum value on the
page at the next level down. The leaf level index pages also contain key values but, instead
of a PID, they contain an actual memory address of a data row, which could be the first in
a chain of data rows, all with the same key value (these are the same rows that might also
be linked using one or more hash indexes).
73
address
address
address
Root
10 20 28
PID 0
10
11
PID 3
14
15
Figure 4-5:
200,
15 18
21 24 27
Non-leaf pages
PID 14
PID 2
25 26 27
50, 300
leaf pages
Data rows
Key
Key
Another big difference between Bw-trees and SQL Server's B-trees is that, at the leaf
level, SQL Server keeps track of data changes using a set of delta values. As noted above,
index pages are never updated, they are just replaced with a new page. However, the leaf
pages themselves are not replaced for every change. Instead, each update to a leaf-level
index page, which can be an insert or delete of a key value on that page, produces a page
containing a delta record describing the change.
An UPDATE is represented by two new delta records, one for the DELETE of the original
value, and one for the INSERT of the new value. When SQL Server adds each delta record,
it updates the mapping table with the physical address of the page containing the newly
added delta record for the INSERT or DELETE operation.
74
Physical
Address
Physical Address after DELETE
: Delete record 48
1
2
...
: Insert record 50
Mapping Table
Figure 4-6:
Page P
When searching through a range index, SQL Server must combine the delta records with
the base page, making the search operation a bit more expensive. However, not having to
completely replace the leaf page for every change gives us performance savings. As we'll
see in the later section, Consolidating delta records, eventually SQL Server will combine
the original page and chain of delta pages into a new base page.
75
Pp
10
5
Ps
Figure 4-7:
Assume we have executed an INSERT statement that inserts a row with key value of 5 into
this table, so that 5 now needs to be added to the range index. The first entry in page Pp is
a 5, which means 5 is the maximum value that could occur on the page to which Pp points,
which is Ps. Page Ps doesn't currently have a value 5, but page Ps is where the 5 belongs.
However, the page Ps is full, so it is unable to add the key value 5 to the page, and it has
to split. The split operation occurs in one atomic operation consisting of two steps, as
described in the next two sections.
78
Pp
Ps
P1
Figure 4-8:
10
4
4
P2
In the same atomic operation as splitting the page, SQL Server updates the page mapping
table to change the pointer to point to P1 instead of Ps. After this operation, page Pp
points directly to page P1; there is no pointer to page Ps, as shown in Figure 4-9.
Pp
P1
Figure 4-9:
10
P2
The pointer from the parent points to the first new child page.
79
Pp
P1
Figure 4-10:
10
10
P2
In the same atomic operation as creating the new pointer, SQL Server then updates the
page mapping table to change the pointer from Pp to Ppp, as shown in Figure 4-11.
Ppp
P1
Figure 4-11:
10
P2
80
Pp
Pln
Figure 4-12:
10
10
The merge operation occurs in three atomic steps, as described over the
following sections.
81
Pp
10
DP 10
DPm
Pln
Figure 4-13:
10
The delta page and the merge-delta page are added to indicate a deletion.
Pp 2
PI n
Figure 4-14:
10
10
Pp 2
Pnew
Figure 4-15:
10
83
Summary
Memory-optimized tables comprise individual rows connected together by indexes. This
chapter described the two index structures available: hash indexes and range indexes.
Hash indexes have a fixed number of buckets, each of which holds a pointer to a chain of
rows. Ideally, all the rows in a single bucket's chain will have the same key value, and the
correct choice for the number of buckets, which is declared when the table is created, can
help ensure this.
Range indexes are stored as Bw-trees, which are similar to SQL Server's traditional B-trees
in some respects, but very different in others. The internal pages in Bw-trees contain
key values and pointers to pages and the next level. The leaf level of the index contains
pointers to chains of rows with matching key values. Just like for our data rows, index
pages are never updated in place. If an index page needs to add or remove a key value, a
new page is created to replace the original.
When choosing the correct set of indexes for a table at table creation time, evaluate each
indexed column to determine the best type of index. If the column stores lots of duplicate
values, or queries need to search the column by a range of values, then a range index is
the best choice. Otherwise, choose a hash index.
In the next chapter we'll look at how concurrent operations are processed and how transactions are managed and logged.
Additional Resources
Guidelines for Using Indexes on Memory-Optimized Tables:
http://msdn.microsoft.com/en-gb/library/dn133166.aspx.
The Bw-Tree: A B-tree for New Hardware Platforms:
http://research.microsoft.com/pubs/178758/bw-tree-icde2013-final.pdf.
84
Transaction Scope
SQL Server supports several different types of transaction, in terms of how we define the
beginning and end of the transaction; and when accessing memory-optimized tables the
transaction type can affect the isolation levels that SQL Server supports. The two default
types of transactions are:
explicit transactions use the BEGIN TRANSACTION statement to indicate the
beginning of the transaction, and either a COMMIT TRANSACTION or a ROLLBACK
TRANSACTION statement to indicate the end. In between, the transaction can include
any number of statements.
autocommit transactions any single data modification operation. In other words,
any INSERT, UPDATE or DELETE statement (as well as others, such as MERGE and BULK
INSERT), by itself, is automatically a transaction. If we modify one row, or one million
rows, in a single UPDATE statement, SQL Server will consider the UPDATE operation to
be an atomic operation, and will modify either all the rows or none of them. With an
auto-commit transaction, there is no way to force a rollback, manually. A transaction
rollback will only occur when there is a system failure.
85
86
87
88
89
90
Disk-based tables
Memory-optimized
tables
READ COMMITTED
SNAPSHOT
READ COMMITTED
REPEATABLE READ /
SERIALIZABLE
REPEATABLE READ /
SERIALIZABLE
SNAPSHOT
SNAPSHOT
REPEATABLE READ /
SERIALIZABLE
Table 5-1:
Recommendations
REPEATABLE READ /
SERIALIZABLE
The following sections will explain the restrictions, and the reasons for them,
with examples.
91
Listing 5-1:
Open a new query window in SSMS, and start an explicit transaction accessing a
memory-optimized table, as shown in Listing 5-2.
USE HKDB;
BEGIN TRAN;
SELECT *
FROM
[dbo].[T1]
COMMIT TRAN;
Listing 5-2:
By default, this transaction will run in the READ COMMITTED isolation level, which is
the standard isolation level for most SQL Server transactions, and guarantees that the
transaction will not read any dirty (uncommitted) data. If a transaction running under
this default isolation level tries to access a memory-optimized table, it will generate
the following error message, since READ COMMITTED is unsupported for memoryoptimized tables:
92
As the message suggests, the transaction needs to specify a supported isolation level,
using a table hint. For example, Listing 5-3 specifies the snapshot isolation level. This
combination, READ COMMITTED for accessing disk-based tables and SNAPSHOT for
memory-optimized, is the one that most cross-container transactions should use.
However, alternatively, we could also use the WITH (REPEATABLEREAD) or WITH
(SERIALIZABLE) table hints, if required.
USE HKDB;
BEGIN TRAN;
SELECT * FROM [dbo].[T1] WITH (SNAPSHOT);
COMMIT TRAN;
Listing 5-3:
SQL Server does support READ COMMITTED isolation level for auto-commit (singlestatement) transactions, so we can run Listing 5-4, inserting three rows into our
table T1 successfully.
INSERT
VALUES
[dbo].[T1]
( Name, City, LastModified )
( 'Jane', 'Helsinki', CURRENT_TIMESTAMP ),
( 'Susan', 'Vienna', CURRENT_TIMESTAMP ),
( 'Greg', 'Lisbon', CURRENT_TIMESTAMP );
Listing 5-4:
Listing 5-5:
94
Listing 5-6:
Table 5-2 shows an example of running the two cross-container transactions, Tx1
and Tx2 (both of which we can think of as having two "sub-transactions," one for
accessing disk-based and one for accessing memory-optimized tables). It illustrates why
95
Tx1 (SERIALIZABLE)
Read RHk1
BEGIN SQL/in-memory
sub-transactions
COMMIT
7
Table 5-2:
Read RSql2
Two concurrent cross-container transactions.
96
Listing 5-7:
Listing 5-8:
97
is_memory_optimized_elevate_to_snapshot_on
sys.databases
name = 'HKDB';
SELECT
DATABASEPROPERTYEX('HKDB',
'IsMemoryOptimizedElevateToSnapshotEnabled');
Listing 5-9:
Verifying if the database has been set to elevate the isolation level to SNAPSHOT.
Otherwise, as demonstrated earlier, simply set the required isolation level on the fly,
using a table hint. We should also consider that having accessed a table in a crosscontainer transaction using an isolation level hint, a transaction should continue to use
that same hint for all subsequent access of the table, though this is not enforced. Using
different isolation levels for the same table, whether a disk-based table or memoryoptimized table, will usually lead to failure of the transaction.
98
FROM
WHERE
GO
xtp_transaction_id ,
transaction_id ,
session_id ,
begin_tsn ,
end_tsn ,
state_desc
sys.dm_db_xtp_transactions
transaction_id > 0;
The output should look similar to that shown in Figure 5-1, with two transactions.
Figure 5-1:
When the first statement accessing a memory-optimized table is executed, SQL Server
obtains a transaction id for the T-SQL part of the transaction (transaction_id) and a
transaction id for the in-memory OLTP portion (xtp_transaction_id).
The xtp_transaction_id values, generated by the Transaction-ID counter (see
Chapter 3) are consecutive. It is this value that SQL Server inserts into End-Ts for rows
that an active transaction is deleting, and into Begin-Ts for rows that an active transaction is inserting. We can also see that both of these transactions have the same value
for begin_tsn, which is the current timestamp for the last committed transaction at
the time the transaction started. Since both transactions are still active, there is no value
for the end_tsn timestamp. The begin_tsn timestamp is only important while the
transaction is running and is never saved in row versions, whereas the end_tsn, upon
commit, is the value written into the Begin-Ts and End-Ts for the affected rows.
99
Listing 5-11:
During the processing phase, SQL Server links the new <Jane, Perth> row into the index
structure and marks the <Greg, Lisbon> and <Jane, Helsinki> as deleted. Figure 5-2 shows
what the rows will look at this stage, within our index structure (with hash indexes on
Name and City; see Chapter 4).
100
Hash index
on Name
Greg
200, Tx1
Lisbon
null
100, 200
Greg
Beijing
5
6
7
8
4
5
null
Tx1,
90,
Susan
Bogota
50, Tx1
70, 90
Figure 5-2:
null, null
Susan
Jane
Perth
null, null
Jane
Helsinki
Vienna
I've just used Tx1 for the transaction-id, but you can use Listing 5-10 to find the real
values of xtp_transaction_id.
Write-write conflicts
What happens if another transaction, TxU, tries to update Jane's row (remember Tx1 is
still active)?
USE HKDB;
BEGIN TRAN TxU;
UPDATE dbo.T1 WITH ( SNAPSHOT )
SET
City = 'Melbourne'
WHERE
Name = 'Jane';
COMMIT TRAN TxU
Listing 5-12: TxU attempts to update a row while Tx1 is still uncommitted.
101
As discussed in Chapter 3, TxU sees Tx1's transaction-id in the <Jane, Helsinki> row
and, because SQL Server optimistically assumes Tx1 will commit, immediately aborts
TxU, raising a conflict error.
Read-Write conflicts
If a query tries to update a row that has already been updated by an active transaction,
SQL Server generates an immediate "update conflict" error. However, SQL Server does
not catch most other isolation level errors until the transaction enters the validation
phase. Remember, no transaction acquires locks so it can't block other transactions from
accessing rows. We'll discuss the validation phase in more detail in the next section,
but it is during this phase that SQL Server will perform checks to make sure that any
changes made by concurrent transactions do not violate the specified isolation level. Let's
continue our example, and see the sort of violation that can occur.
Our original Tx1 transaction, which started at timestamp 240, is still active, and let's now
start two other transactions that will read the rows in table T1:
Tx2 an auto-commit, single-statement SELECT that starts at timestamp 243.
Tx3 an explicit transaction that reads a row and then updates another row based on
the value it read in the SELECT; it starts at a timestamp of 246.
102
City
Greg
Lisbon
Susan
Bogota
Jane
Helsinki
TX1
240
TX2
243
City
Name
City
Jane
Perth
Jane
Perth
Susan
Bogota
Susan
Helsinki
Name
TX3
246
Figure 5-3:
When Tx1 starts at timestamp 240, three rows are visible, and since Tx1 does not commit
until timestamp 250, after Tx2 and Tx3 have started, those are the rows all three of the
transactions see. After Tx1 commits, there will only be two rows visible, and the City
value for Jane will have changed. When Tx3 commits, it will attempt to change the City
value for Susan to Helsinki.
In a second query window in SSMS, we can run our auto-commit transaction, Tx2, which
simply reads the T1 table.
USE HKDB;
SELECT Name ,
City
FROM
T1;
Listing 5-13:
103
Figure 5-4:
Tx2's session is running in the default isolation level, READ COMMITTED, but as described
previously, for a single-statement transaction accessing a memory-optimized table, we
can think of Tx2 as running in snapshot isolation level, which for a single-statement
SELECT will give us the same behavior as READ COMMITTED.
Tx2 started at timestamp 243, so it will be able to read rows that existed at that time.
It will not be able to access <Greg, Beijing>, for example, because that row was valid
between timestamps 100 and 200. The row <Greg, Lisbon> is valid starting at timestamp
200, so transaction Tx2 can read it, but it has a transaction-id in End-Ts because Tx1 is
currently deleting it. Tx2 will check the global transaction table and see that Tx1 has not
committed, so Tx2 can still read the row. <Jane, Perth> is the current version of the row
with "Jane," but because Tx1 has not committed, Tx2 follows the pointer to the previous
row version, and reads <Jane, Helsinki>.
Tx3 is an explicit transaction that starts at timestamp 246. It will run using REPEATABLE
READ isolation, and read one row and update another based on the value read, as shown
in Listing 5-14 (again, don't commit it yet).
DECLARE @City NVARCHAR(32);
BEGIN TRAN TX3
SELECT @City = City
FROM
T1 WITH ( REPEATABLEREAD )
WHERE
Name = 'Jane';
UPDATE T1 WITH ( REPEATABLEREAD )
SET
City = @City
WHERE
Name = 'Susan';
COMMIT TRAN -- commits at timestamp 260
Listing 5-14: Tx3 reads the value of City for "Jane" and updates the "Susan" row with this value.
104
So Tx1 commits and Tx3 aborts and, at this stage, the only two rows visible will be
<Susan, Vienna> and <Jane, Perth>.
If Tx3 had committed before Tx1, then both transactions would succeed, and the final
rows visible would be <Jane, Perth> and <Susan, Helsinki>, as shown in Figure 5-3.
Let's now take a look in a little more detail at other isolation level violations that
may occur in the validation stage, and at the other actions SQL Server performs
during this phase.
105
Validation Phase
Once a transaction issues a commit and SQL Server generates the commit timestamp, but
prior to the final commit of transactions involving memory-optimized tables, SQL Server
performs a validation phase. As discussed briefly in Chapter 3, this phase consists broadly
of the following three steps:
1. Validate the changes made by Tx1 verifying that there are no isolation
level violations.
2. Wait for any commit dependencies to reduce the dependency count to 0.
3. Log the changes.
Once it logs the changes (which are therefore guaranteed), SQL Server marks the transaction as committed in the global transaction table, and then clears the dependencies of
any transactions that are dependent on Tx1.
Note that the only waiting that a transaction on memory-optimized tables will experience
is during this phase. There may be waiting for commit dependencies, which are usually
very brief, and there may be waiting for the write to the transaction log. Logging for
memory-optimized tables is much more efficient than logging for disk-based tables (as
we'll see in Chapter 6), so these waits can also be very short.
The following sections review each of these three steps in a little more detail.
106
Read-set
Scan-set
SNAPSHOT
NO
NO
REPEATABLE READ
YES
NO
SERIALIZABLE
YES
YES
Table 5-3:
107
Transaction Tx1
Transaction Tx2
BEGIN TRAN
BEGIN TRAN
COMMIT TRAN
108
Transaction Tx1
COMMIT TRAN
Transaction Tx2
Table 5-4:
During validation, Error 41325 is generated, because we can't have two rows with the same
primary key value, and Tx1 is aborted and rolled back.
The transaction will abort. We saw an example of this earlier, in the section on
read-write conflicts.
109
Time
Transaction Tx1
(SERIALIZABLE)
BEGIN TRAN
5
6
Table 5-5:
BEGIN TRAN
INSERT INTO Person VALUES ('Charlie',
'Perth')
3
4
Transaction Tx2
(any isolation level)
COMMIT TRAN
During validation, Error 41325 is
generated and Tx1 is rolled back
Transactions resulting in a SERIALIZABLE isolation failure.
110
111
112
Transaction Tx1
Begin timestamp = 240
End timestamp = 250
Hash index
on Name
Delete
Delete
200, 250
Greg
Insert
Hash index
on City
Lisbon
null
100, 200
Greg
Beijing
5
6
7
8
4
5
90,
Susan
Bogota
null
250,
70, 90
null, null
Susan
Perth
Vienna
50, 250
Figure 5-5:
Jane
null, null
Jane
Helsinki
The final step in the validation process is to go through the linked list of dependent
transactions and reduce their dependency counters by one. Once this validation phase is
finished, the only reason that this transaction might fail is due to a log write failure. Once
the log record has been hardened to storage, the state of the transaction is changed to
committed in the global transaction table.
113
Post-processing
The final phase is the post-processing, which is sometimes referred to as commit
processing, and is usually the shortest. The main operations are to update the timestamps
of each of the rows inserted or deleted by this transaction.
For a DELETE operation, set the row's End-Ts value to the commit timestamp of the
transaction, and clear the type flag on the row's End-Ts field to indicate it is really a
timestamp, and not a transaction-ID.
For an INSERT operation, set the row's Begin-Ts value to the commit timestamp of
the transaction and clear the type flag.
If the transaction failed or was explicitly rolled back, inserted rows will be marked as
garbage and deleted rows will have their end-timestamp changed back to infinity.
The actual unlinking and deletion of old row versions is handled by the garbage collection
system. This final step of removing any unneeded or inaccessible rows is not always done
immediately and may be handled either by user threads, once a transaction completes, or
by a completely separate garbage collection thread.
114
The garbage collection process for stale row versions in memory-optimized tables is
analogous to the version store cleanup that SQL Server performs when transactions
use one of the snapshot-based isolation levels, when accessing disk-based tables. A big
difference though is that the cleanup is not done in tempdb because the row versions are
not stored there, but in the in-memory table structures themselves.
To determine which rows can be safely deleted, the in-memory OLTP engine keeps track
of the timestamp of the oldest active transaction running in the system, and uses this
value to determine which rows are potentially still needed. Any rows that are not valid as
of this point in time, in other words any rows with an End-Ts timestamp that is earlier
than this time, are considered stale. Stale rows can be removed and their memory can be
released back to the system.
The garbage collection system is designed to be non-blocking, cooperative and scalable.
Of particular interest is the "cooperative" attribute. Although there is a dedicated system
thread for the garbage collection process, called the idle worker thread, user threads
actually do most of the work.
If, while scanning an index during a data modification operation, (all index access on
memory-optimized tables is considered to be scanning), a user thread encounters a stale
row version, it will either mark the row as expired, or unlink that version from the
current chain and adjust the pointers. For each row it unlinks, it will also decrement the
reference count in the row header area (reflected in the IdxLinkCount value).
When a user thread completes a transaction, it adds information about the transaction to
a queue of transactions to be processed by the idle worker thread. Each time the garbage
collection process runs, it processes the queue of transactions, and determines whether
the oldest active transaction has changed.
115
116
FROM
WHERE
GO
name AS 'index_name' ,
s.index_id ,
scans_started ,
rows_returned ,
rows_expired ,
rows_expired_removed
sys.dm_db_xtp_index_stats s
JOIN sys.indexes i ON s.object_id = i.object_id
AND s.index_id = i.index_id
OBJECT_ID('<memory-optimized table name>') = s.object_id;
Listing 5-15:
Depending on the volume of data changes and the rate at which new versions are
generated, SQL Server can be using a substantial amount of memory for old row versions
and we need to make sure that our system has enough memory available. I'll tell you
more about memory management for a database supporting memory-optimized tables in
Chapter 8.
Summary
This chapter contains a lot of detail on the transaction isolation levels that SQL Server
supports when accessing memory-optimized tables, and also on the valid combination of
levels for cross-container transactions, which can access both disk-based and memoryoptimized tables. In most cases, our cross-container transactions will use standard
READ COMMITTED for accessing disk-based tables, and SNAPSHOT isolation for memoryoptimized tables, set either via a table hint or using the MEMORY_OPTIMIZED_ELEVATE_
TO_SNAPSHOT database property for that database.
117
Additional Resources
General background on isolation levels:
http://en.wikipedia.org/wiki/Isolation_(database_systems).
A Critique of ANSI SQL Isolation Levels:
http://research.microsoft.com/apps/pubs/default.aspx?id=69541.
Understanding Transactions on Memory-Optimized Tables:
http://msdn.microsoft.com/en-us/library/dn479429.aspx.
118
119
Transaction Logging
The log streams contain information about all versions inserted and deleted by transactions against in-memory tables. SQL Server writes the log streams to the regular
SQL Server transaction log, but in-memory OLTP's transaction logging is designed to
be more scalable and higher performance than standard logging for disk-based tables.
121
122
GO
(NAME = [LoggingDemo_container1],
FILENAME = 'C:\DataHK\LoggingDemo_container1')
LOG ON (name = [LoggingDemo_log],
Filename='C:\DataHK\LoggingDemo.ldf', size= 100 MB);
Listing 6-1:
Listing 6-2 creates one memory-optimized table, and the equivalent disk-based table, in
the LoggingDemo database.
USE LoggingDemo
GO
IF OBJECT_ID('t1_inmem') IS NOT NULL
DROP TABLE [dbo].[t1_inmem]
GO
-- create a simple memory-optimized table
CREATE TABLE [dbo].[t1_inmem]
( [c1] int NOT NULL,
[c2] char(100) NOT NULL,
CONSTRAINT [pk_index91] PRIMARY KEY NONCLUSTERED HASH ([c1])
WITH(BUCKET_COUNT = 1000000)
) WITH (MEMORY_OPTIMIZED = ON,
DURABILITY = SCHEMA_AND_DATA);
GO
IF OBJECT_ID('t1_disk') IS NOT NULL
DROP TABLE [dbo].[t1_disk]
GO
-- create a similar disk-based table
CREATE TABLE [dbo].[t1_disk]
( [c1] int NOT NULL,
[c2] char(100) NOT NULL)
GO
CREATE UNIQUE NONCLUSTERED INDEX t1_disk_index on t1_disk(c1);
GO
Listing 6-2:
123
Listing 6-3:
Populate the disk-based table with 100 rows and examine the log.
Listing 6-4 runs a similar INSERT on the memory-optimized table. Note that, since the
partition_id is not shown in the output for memory-optimized tables, we cannot
filter based on the specific object. Instead, we need to look at the most recent log records,
so the query performs a descending sort based on the LSN.
124
Listing 6-4:
Examine the log after populating the memory-optimized tables with 100 rows.
You should see only three log records related to this transaction, as shown in Figure 6-1,
one marking the start of the transaction, one the commit, and then just one log record for
inserting all 100 rows.
Figure 6-1:
SQL Server transaction log showing one log record for a 100-row transaction.
The output implies that all 100 inserts have been logged in a single log record, using an
operation of type LOP_HK, with LOP indicating a "logical operation" and HK being an
artifact from the project codename, Hekaton.
We can use another undocumented, unsupported function to break apart a LOP_HK
record, as shown in Listing 6-5 (replace the current LSN value with the LSN for your
LOP_HK record).
125
FROM
WHERE
[current lsn] ,
[transaction id] ,
operation ,
operation_desc ,
tx_end_timestamp ,
total_size ,
OBJECT_NAME(table_id) AS TableName
sys.fn_dblog_xtp(NULL, NULL)
[Current LSN] = '00000020:00000157:0005';
Listing 6-5:
The first few rows of output should look similar to those shown in Figure 6-2. It
should return 102 rows, including one *_INSERT_ROW operation for each of the
100 rows inserted.
Figure 6-2:
Breaking apart the log record for the inserts on the memory-optimized table.
The single log record for the entire transaction on the memory-optimized table,
plus the reduced size of the logged information, can help to make transactions on
memory-optimized tables much more efficient. This is not to say, however, that transactions on memory-optimized tables are always going to be more efficient, in terms of
logging, than operations on disk-based tables. For very short transactions particularly,
disk-based and memory-optimized will generate about the same amount of log. However,
transactions on memory-optimized tables should never be any less efficient than on their
disk-based counterparts.
126
Checkpoint
The two main purposes of the checkpoint operation, for disk-based tables, are to improve
performance by batching up I/O rather than continually writing a page to disk every time
it changes, and to reduce the time required to run recovery. If checkpoint ran only very
infrequently then, during recovery, there could be a huge number of data rows to which
SQL Server needs to apply redo, as a result of committed transactions hardened to the log
but where the data pages were not hardened to disk before SQL Server entered recovery.
Similarly, one of the main reasons for checkpoint operations, for memory-optimized
tables, is to reduce recovery time. The checkpoint process for memory-optimized tables is
designed to satisfy two important requirements:
Continuous checkpointing checkpoint-related I/O operations occur incrementally
and continuously as transactional activity accumulates. This is in contrast to the
hyper-active checkpoint scheme for disk-based tables, defined as checkpoint processes
which sleep for a while, after which they wake up and work as hard as possible to
finish up the accumulated work, and which can potentially be disruptive to overall
system performance.
Streaming I/O on disk-based tables, the checkpoint operation generates random
I/O as it writes dirty pages to disk. For memory-optimized tables, checkpointing relies,
for most of its operations, on streaming I/O (which is always sequential) rather than
random I/O. Even on SSD devices, random I/O is slower than sequential I/O and can
incur more CPU overhead due to smaller individual I/O requests.
Since checkpointing is a continuous process for memory-optimized tables, when we talk
about a checkpoint "event," we're actually talking about the closing of a checkpoint. The
later section, Closing a checkpoint, describes exactly what happens during the checkpoint
closing process.
127
Listing 6-6:
Next, Listing 6-7 turns on an undocumented trace flag, 9851, which inhibits the automatic
merging of checkpoint files. This will allow us to control when the merging occurs, and
observe the process of creating and merging checkpoint files. Only use this trace flag
during testing, not on production servers.
129
Listing 6-7:
At this point, you might want to look in the folder containing the memory-optimized
data files, in this example DataHK\CkptDemo_container1. Within that folder is
one subfolder called $FSLOG and another with a GUID for a name. If we had specified
multiple memory-optimized filegroups in Listing 6-5, then we'd see one GUID-named
folder for each filegroup.
Open the GUID-named folder, and in there is another GUID-named folder. Again, there
will be one GUID-named folder at this level for each file in the filegroup. Open up that
GUID-named folder, and you will find it is empty, and it will remain empty until we
create a memory-optimized table, as shown in Listing 6-8.
USE CkptDemo;
GO
-- create a memory-optimized table with each row of size > 8KB
CREATE TABLE dbo.t_memopt (
c1 int NOT NULL,
c2 char(40) NOT NULL,
c3 char(8000) NOT NULL,
CONSTRAINT [pk_t_memopt_c1] PRIMARY KEY NONCLUSTERED HASH (c1)
WITH (BUCKET_COUNT = 100000)
) WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA);
GO
Listing 6-8:
130
Figure 6-3:
The data and delta files in the container for our memory-optimized tables.
131
134
file_type_desc ,
state_desc ,
internal_storage_slot ,
file_size_in_bytes ,
inserted_row_count ,
deleted_row_count ,
lower_bound_tsn ,
upper_bound_tsn ,
checkpoint_file_id ,
relative_file_path
FROM
sys.dm_db_xtp_checkpoint_files
ORDER BY file_type_desc ,
state_desc ,
lower_bound_tsn;
GO
Listing 6-9:
Listing 6-9 returns the following metadata columns (other columns are available; see the
documentation for a full list):
file_type_desc
Identifies the file as a data or delta file.
state_desc
The state of the file (see previous bullet list).
internal_storage_slot
This value is the pointer to an internal storage array (described below), but
is not populated until a file becomes ACTIVE.
135
136
Storage Array
ACTIVE
CFP1
Transaction
Range
(low to high)
Figure 6-4:
0 to 8
ACTIVE
CFP2
8 to 20
UNDER
CONSTRUCTION
CFP3
20 to ?
The storage array stores metadata for up to 8192 CFPs per database.
The CFPs referenced by the storage array, along with the tail of the log, represent all the
on-disk information required to recover the memory-optimized tables in a database.
Let's see an example of some of these checkpoint file state transitions in action. At this
stage, our CkptDemo database has one empty table, and we've seen that SQL Server
has created 9 CFPs. We'll take a look at the checkpoint file metadata, using the sys.
dm_db_xtp_checkpoint_files DMV. In this case, we just return the file type (DATA
or DELTA), the state of each file, and the relative path to each file.
SELECT
file_type_desc ,
state_desc ,
relative_file_path
FROM
sys.dm_db_xtp_checkpoint_files
ORDER BY file_type_desc
GO
Figure 6-5 shows that of the 9 CFPs (9 data files, 9 delta files), 8 CFPs have the state
PRECREATED, and the other 1 CFP has the state UNDER CONSTRUCTION.
137
Figure 6-5:
The values in the relative_file_path column are a concatenation of the two GUID
folder names, plus the file names in the folder that was populated when we created the
table. These relative paths are of the general form GUID1\GUID2\FILENAME where
GUID1 is the GUID for the container, GUID2 is the GUID for the file in the container and
FILENAME is the name of the individual data or delta file. For example, the FILENAME
portion of the relative path for the third row in Figure 6-5 is 00000021-000000a7-0003,
which matches the name of the second file (the first data file) listed in my file browser
previously, in Figure 6-3.
Let's now put some rows into the t_memopt table, as shown in Listing 6-11. The script
also backs up the database so that we can make log backups later (although the backup
does not affect what we will shortly see in the metadata).
-- INSERT 8000 rows.
-- This should load 5 16MB data files on a machine with <= 16GB of memory.
SET NOCOUNT ON;
DECLARE @i INT = 0
WHILE ( @i < 8000 )
BEGIN
INSERT t_memopt
VALUES ( @i, 'a', REPLICATE('b', 8000) )
SET @i += 1;
END;
GO
138
Listing 6-11: Populate the memory-optimized tables with 8000 rows and back up the database.
If we peek again into the GUID-named subfolder in the file system browser, we should see
four additional CFPs.
Now let's return to look at the checkpoint file metadata in a little more detail by
rerunning the query in Listing 6-9. Figure 6-6 shows the 13 CFPs returned, and the
property values for each file.
Figure 6-6:
139
Closing a checkpoint
Let's now actually execute the checkpoint command in this database, manually, and then
rerun Listing 6-11 to interrogate the metadata in sys.dm_db_xtp_checkpoint_files.
CHECKPOINT;
GO
-- now rerun Listing 6-11
In the output, we'll see one or more CFPs (in this case, five) with the state ACTIVE and
with non-NULL values for the internal_storage_slot, as shown in Figure 6-7.
140
Figure 6-7:
Notice that the five ACTIVE CFPs have consecutive internal_storage_slot values.
In fact, if we execute a checkpoint multiple times, we'll see that each checkpoint will
create additional ACTIVE CFPs, with contiguous values for internal_storage_slot.
What's happened here is that the checkpoint event takes a section of the transaction log
not covered by a previous checkpoint event, and converts all operations on memoryoptimized tables contained in that section of the log into one or more ACTIVE CFPs.
Once the checkpoint task finishes processing the log, the checkpoint is completed with
the following steps:
1.
All buffered writes (all writes that are currently only present in the in-memory table)
are flushed to the data and delta files.
141
Automatic merge
To identify a set of files to be merged, a background task periodically looks at all ACTIVE
data/delta file pairs and identifies zero or more sets of files that qualify.
Each set can contain two or more data/delta file pairs that are adjacent to each other such
that the resultant set of rows can still fit in a single data file of size 128 MB (or 16 MB for
machines with 16 GB memory or less). Table 6-1 shows some examples of files that will be
chosen to be merged under the merge policy.
Adjacent source files (%full)
Merge selection
(DF1, DF2)
Table 6-1:
143
Manual merge
In most cases, the automatic merging of checkpoint files will be sufficient to keep the
number of files manageable. However, in rare situations or for testing purposes, you
might want to use a manual merge. We can use the procedure sp_xtp_merge_checkpoint_files to force a manual merge of checkpoint files. To determine which files
might be eligible, we can look at the metadata in sys.dm_db_xtp_checkpoint_files.
Remember that earlier we turned off automatic merging of files using the undocumented
trace flag, 9851. Again, this is not recommended in a production system but, for the sake of
this example, it does allow us to explore more readily this metadata and how it evolves
during a merge operation.
In continuing our previous example, let's now delete half the rows in the t_memopt table
as shown in Listing 6-13.
SET NOCOUNT ON;
DECLARE @i INT = 0;
WHILE ( @i <= 8000 )
BEGIN
DELETE t_memopt
WHERE
c1 = @i;
SET @i += 2;
END;
GO
CHECKPOINT;
GO
144
Figure 6-8:
The checkpoint file metadata after deleting half the rows in the table.
The number of deleted rows, spread across five files adds up to 4000, as expected. From
this information, we can find adjacent files that are not full, or files that we can see have
a lot of their rows removed. Armed with the transaction_id_lower_bound from
the first file in the set, and the transaction_id_upper_bound from the last file, we
can call the sys.sp_xtp_merge_checkpoint_files procedure, as in Listing 6-14, to
force a manual merge. Note that this procedure will not accept a NULL as a parameter,
so if the transaction_id_lower_bound is NULL, we can use any value less than
transaction_id_upper_bound.
145
We can verify the state of the merge operation with another DMV,
sys.dm_db_xtp_merge_requests, as shown in Listing 6-15.
SELECT
FROM
GO
request_state_desc ,
lower_bound_tsn ,
upper_bound_tsn
sys.dm_db_xtp_merge_requests;
Figure 6-9:
In the metadata, we should now see one new CFP in the MERGE TARGET state containing
all the 4000 remaining rows (from here in, I've filtered out the PRECREATED files).
146
Figure 6-10: Some of the checkpoint file metadata after a requested merge.
Now run another manual checkpoint and then, once the merge is complete, the
request_state_description column of sys.dm_db_xtp_merge_requests
should show a value of INSTALLED instead of PENDING. The metadata will now look
similar to Figure 6-11. Now the CFP in slot 5, containing the 4000 remaining rows, is
ACTIVE and once again the checkpoint creates a new ACTIVE CFP (slot 6). The original
6 CFPs (originally slots 05) have been merged and their status is MERGED SOURCE.
If any concurrent activity were occurring on the server, we'd also see new UNDER
CONSTRUCTION CFPs.
Figure 6-11:
147
Stage 1: Checkpoint
After the merge has completed, the in-memory engine cannot remove the original
MERGED SOURCE files (A and B) until a checkpoint event occurs that guarantees that data
in those files is no longer needed for recovery, in the event of a service interruption (of
course, they certainly will be required for restore and recovery after a disk failure, so we
need to be running backups).
148
Figure 6-12:
DISK = N'C:\BackupsHK\CkptDemo_log.bak'
NAME = N'CkptDemo-LOG Backup';
Having performed a log backup, SQL Server can mark the source files from the merge
operation with the LSN, and any files with an LSN lower than the log truncation point are
eligible for garbage collection. Normally, of course, the whole garbage collection process
149
After this stage, the files may or may not still be visible to your in-memory database
engine through the sys.dm_db_xtp_checkpoint_files DMV, but they will be visible
on disk. In my example, I was still able to see the files, with the TOMBSTONE state, but at
some point they will become invisible to this DMV.
Figure 6-13:
150
After Stage 3, the files may no longer be visible through the operating system although,
depending on what else is happening on the system, this process may take a while. Keep
in mind, however, that normally performing any of this manual garbage collection
process should not be necessary.
If you find you do need to implement this manual garbage collection of files, be sure
to account for these extra transaction log backups that were performed. You will need
to make sure any third-party backup solutions are aware of these log backup files.
Alternatively, you could perform a full database backup after performing this manual
garbage collection, so that subsequent transaction log backups would use that as their
starting point.
151
Recovery
Recovery on in-memory OLTP tables starts after the location of the most recent
checkpoint inventory has been recovered during a scan of the tail of the log. Once the
SQL Server host has communicated the location of the checkpoint inventory to the
in-memory OLTP engine, SQL Server and in-memory OLTP recovery proceed in parallel.
The global transaction timestamp is initialized during the recovery process with the
highest transaction timestamp found among the transactions recovered.
In-memory OLTP recovery itself is parallelized. Each delta file represents a filter to
eliminate rows that don't have to be loaded from the corresponding data file. This
data/delta file pair arrangement means that checkpoint file loading can proceed in
parallel across multiple I/O streams with each stream processing a single data file and
delta file. The in-memory OLTP engine creates one thread per core to handle parallel
insertion of the data produced by the I/O streams. The insert threads load into memory
all active rows in the data file after removing the rows that have been deleted. Using one
thread per core means that the load process is performed as efficiently as possible.
As the data rows are loaded they are linked into each index defined on the table the row
belongs to. For each hash index, the row is added to the chain for the appropriate hash
bucket. For each range index, the row is added to the chain for the row's key value, or
a new index entry is created if the key value doesn't duplicate one already encountered
during recovery of the table.
Finally, once the checkpoint file load process completes, the tail of the transaction log
is replayed from the timestamp of the last checkpoint, with the goal of bringing the
database back to the state that existed at the time of the crash.
152
Summary
In this chapter we looked at how the logging process for memory-optimized tables
is more efficient than that for disk-based tables, providing additional performance
improvement for your in-memory operations. We also looked at how your data changes
are persisted to disk using streaming checkpoint files, so that your data is persisted and
can be recovered when the SQL Server service is restarted, or your databases containing
memory-optimized tables are restored.
Additional Resources
A white paper describing FILESTREAM storage and management:
http://msdn.microsoft.com/en-us/library/hh461480.aspx.
Durability for memory-optimized tables:
http://blogs.technet.com/b/dataplatforminsider/archive/2013/10/11/in-memoryoltp-how-durability-is-achieved-for-memory-optimized-tables.aspx.
State transitions during merging of checkpoint files:
http://blogs.technet.com/b/dataplatforminsider/archive/2014/01/23/statetransition-of-checkpoint-files-in-databases-with-memory-optimized-tables.
aspx.
153
154
Maintenance of DLLs
The DLLs for memory-optimized tables and natively compiled stored procedures are
stored in the file system, along with other generated files, which are kept for troubleshooting and supportability purposes.
The query in Listing 7-1 shows all table and stored procedure DLLs currently loaded in
memory on the server.
155
name ,
description
sys.dm_os_loaded_modules
description = 'XTP Native DLL'
Listing 7-1:
Display the list of all table and procedure DLLs currently loaded.
Database administrators do not need to maintain the files that native compilation
generates. SQL Server automatically removes generated files that are no longer needed,
for example on table and stored procedure deletion and on dropping a database, but also
on server or database restart.
156
Listing 7-2:
The table creation results in the compilation of the table DLL, and also loading that DLL
in memory. The DMV query immediately after the CREATE TABLE statement retrieves
the path of the table DLL. My results are shown in Figure 7-1.
Figure 7-1:
157
Listing 7-3:
158
Parameter Sniffing
Interpreted T-SQL stored procedures are compiled into intermediate physical execution
plans at first execution (invocation) time, in contrast to natively compiled stored procedures, which are natively compiled at creation time. When interpreted stored procedures
are compiled at invocation, the values of the parameters supplied for this invocation are
used by the optimizer when generating the execution plan. This use of parameters during
compilation is called parameter sniffing.
SQL Server does not use parameter sniffing for compiling natively compiled stored
procedures. All parameters to the stored procedure are considered to have UNKNOWN
values.
T-SQL Stored
Procedure
Parser
Algebrizer
Query Optimizer
Figure 7-2:
160
Compiler
Runtime
DLL
Sprocinvocation
Parser
Algebnizer
Sproc name
Figure 7-3:
Stored
Proc DLL
Runtime
Parameters
161
162
Hash indexes
There are no ordered scans with hash indexes. If a query is looking for a range of values,
or requires that the results be returned in sorted order, a hash index will not be useful,
and the optimizer will not consider it.
The optimizer cannot use a hash index unless the query filters on all columns in the index
key. The hash index examples in Chapter 4 illustrated an index on just a single column.
However, just like indexes on disk-based tables, hash indexes on memory-optimized
tables can be composite, but the hash function used to determine to which bucket a row
163
Range indexes
Range indexes cannot be scanned in reverse order. There is no concept of "previous
pointers" in a range index on a memory-optimized table. With on-disk indexes, if a query
requests the data to be sorted in DESC order, the on-disk index could be scanned in
reverse order to support this. With in-memory tables, an index would have to be created
as a descending index. In fact, it is possible to have two indexes on the same column,
one defined as ascending and one defined as descending. It is also possible to have
both a range and a hash index on the same column.
No Halloween protection
Halloween protection is not incorporated into the query plans. Halloween protection
provides guarantees against accessing the same row multiple times during query
processing. Operations on disk-based tables use spooling operators to make sure rows are
not accessed repeatedly, but this is not necessary for plans on memory-optimized tables.
164
No parallel plans
Currently, parallel plans are not produced for operations on memory-optimized tables.
The XML plan for the query will indicate that the reason for no parallelism is because the
table is a memory-optimized table.
No auto-update of statistics
SQL Server In-Memory OLTP does not keep any row modification counters, and does not
automatically update statistics on memory-optimized tables. One of the reasons for not
updating the statistics is so there will be no chance of dependency failures due to waiting
for statistics to be gathered.
You'll need to make sure you set up a process for regularly updating statistics on memoryoptimized tables using the UPDATE STATISTICS command, which can be used to update
statistics on just one index, or on all the indexes of a specified table. Alternatively, you
can use the procedure sp_updatestats, which updates all the statistics on all indexes
in a database. For disk-based tables, this procedure only updates statistics on tables which
have been modified since the last time statistics were updated, but for memory-optimized
tables, the procedure will also recreate statistics. Make sure you have loaded data and
updated statistics on all tables accessed in a natively compiled procedure before the
procedure is created, since the plan is created at the time of procedure creation, and it
will be based on the existing statistics.
Natively compiled procedure plans will never be recompiled on the fly; the only way to
get a new plan is to drop and recreate the procedure (or restart the server).
165
Performance Comparisons
Since the very first versions of SQL Server, stored procedures have been described as
being stored in a compiled form. The process of coming up with a query plan for a
batch is also frequently described as compilation. However, until SQL Server 2014 and
in-memory OLTP, what was described as compilation wasn't really true compilation.
SQL Server stored query plans in an internal form, after they had been parsed and
normalized, but they were not truly compiled. When executing the plan, the execution
engine walks the query tree and interprets each operator as it is executed, calling appropriate database functions. This is far more expensive than for a true compiled plan,
composed of machine language calls to actual CPU instructions.
When processing a query, the runtime costs include locking, latching, and disk I/O, and
the relatively small cost and overhead associated with interpreted code, compared to
compiled code, gets "lost in the noise." However, in true performance tuning methodology, there is always a bottleneck; once we remove one, another becomes apparent.
Once we remove the overhead of locking, latching, and disk I/O, the cost of interpreted
code becomes a major component, and a potential bottleneck.
The only way to substantially speed up processing time is to reduce the number of
internal CPU instructions executed. Assume that in our system we use one million CPU
instructions per transaction which results in 100 transactions per second (TPS). To
achieve a 10-times performance improvement, to 1,000 TPS, we would have to decrease
the number of instructions per second to 100,000, which is a 90% reduction.
To satisfy the original vision for Hekaton, and achieve a 100-times performance
improvement, to 10,000 TPS, would mean reducing the number of instructions per
second to 10,000, or a 99% reduction! A reduction of this magnitude would be
impossible with SQL Server's existing interpretive query engine or any other existing
interpretive engine.
166
Memory-optimized table
0.734
0.040
10.8 times
10
0.937
0.051
18.4 times
100
2.72
0.150
18.1 times
1,000
20.1
1.063
18.9 times
10,000
201
9.85
20.4 times
167
Improvement
Memory-optimized table
0.910
0.045
20.2 times
10
1.38
0.059
23.4 times
100
8.17
0.260
31.4 times
1,000
41.9
1.50
27.9 times
10,000
439
14.4
30.5 times
Table 7-2
Improvement
Again, the more rows being processed, the greater the performance benefit when using
memory-optimized tables. For UPDATE operations, Microsoft was able to realize a 30-fold
gain with 10,000 operations, achieving a throughput of 1.9 million updates per core.
Finally, Table 7-3 gives some performance comparisons for a mixed environment, with
both SELECT and UPDATE operations. The workload consists of 50% INSERT transactions
that append a batch of 100 rows in each transaction, and 50% SELECT transactions that
read the more recently inserted batch of rows.
168
# of
CPUs
Diskbased
table
Memory-optimized
table accessed using
interop
Memory-optimized table
accessed using natively compiled
procedure
1
4
985
2,157
1,450
3,066
4,493
15,127
8
12
4,211
5,834
6,195
8,679
30,589
37,249
Table 7-3:
With one CPU, using a natively compiled procedure increased the TPS by more than
four times, and for 12 CPUs, the increase was over six times.
169
170
Listing 7-4:
WITH ( MEMORY_OPTIMIZED=ON,
DURABILITY=SCHEMA_ONLY );
Next, Listing 7-5 creates an interop (not natively compiled) stored procedure called
ins_bigtable that inserts rows into bigtable_inmem. The number of rows to insert
is passed as a parameter when the procedure is called.
------- Create the procedure ------CREATE PROC ins_bigtable ( @rows_to_INSERT int )
AS
BEGIN
SET nocount on;
DECLARE @i int = 1;
DECLARE @newid uniqueidentifier
WHILE @i <= @rows_to_INSERT
BEGIN
SET @newid = newid()
INSERT dbo.bigtable_inmem ( id, account_id, trans_type_id,
shop_id, trans_made, trans_amount )
VALUES( @newid,
32767 * rand(),
30 * rand(),
100 * rand(),
171
END
GO
END
Listing 7-5:
Finally, Listing 7-6 creates the equivalent natively compiled stored procedure.
CREATE PROC ins_native_bigtable ( @rows_to_INSERT int )
with native_compilation, schemabinding, execute AS owner
AS
BEGIN ATOMIC WITH
( TRANSACTION ISOLATION LEVEL = SNAPSHOT,
LANGUAGE = N'us_english')
DECLARE @i int = 1;
DECLARE @newid uniqueidentifier
WHILE @i <= @rows_to_INSERT
BEGIN
SET @newid = newid()
INSERT dbo.bigtable_inmem ( id, account_id, trans_type_id,
shop_id, trans_made, trans_amount )
VALUES( @newid,
32767 * rand(),
30 * rand(),
100 * rand(),
dateadd( second, @i, cast( '20130410' AS datetime ) ),
( 32767 * rand() ) / 100. ) ;
SET @i = @i + 1;
END
END
GO
Listing 7-6:
172
Listing 7-7:
When I executed this EXEC above, it took 28 seconds, as indicated in the status bar in the
SQL Server Management Studio. You might want to record the amount of time it took on
your SQL Server instance.
Next, Listing 7-8 calls the interop procedure inside a transaction, so that all the INSERT
operations are a single transaction.
DELETE
GO
bigtable_inmem;
BEGIN TRAN
EXEC ins_bigtable @rows_to_INSERT = 1000000;
COMMIT TRAN
Listing 7-8:
When I executed the EXEC above, it took 14 seconds, which was half the time it took to
insert the same number of rows in separate transactions. The savings here are primarily
due to the reduction in the overhead of managing a million separate transactions.
173
Listing 7-9:
Running this natively compiled procedure to insert the same 1,000,000 rows
took only 3 seconds, less than 25% of the time it took to insert the rows through
an interop procedure.
Of course, your results may vary depending on the kinds of operations you are
performing; and keep in mind that I was testing this on SCHEMA_ONLY tables. For this
example, I wanted to show you the impact that native compilation itself could have
without interference from the overhead of disk writes that the CHECKPOINT process is
doing, and any logging that the query thread would have to perform.
Listing 7-10: Syntax for procedures to enable statistics collection for natively compiled procedures.
As suggested, performance decreases when you enable statistics collection, but obviously
collecting statistics at the procedure level with sys.sp_xtp_control_proc_exec_
stats will be less expensive than using sys.sp_xtp_control_query_exec_stats
to gather statistics for every query within every procedure.
If we only need to troubleshoot one, or a few, natively compiled stored procedures, there
is a parameter for sys.sp_xtp_control_query_exec_stats to enable statistics
collection for a single procedure, so we can run sys.sp_xtp_control_query_exec_
stats once for each of those procedures.
175
Summary
This chapter discussed how to create natively compiled stored procedures to access
memory-optimized tables. These procedures generate far fewer CPU instructions for the
engine to execute than the equivalent interpreted T-SQL stored procedure, and can be
executed directly by the CPU, without the need for further compilation or interpretation.
There are some limitations in the T-SQL constructions allowed in natively compiled
procedures, and so certain transformations that the optimizer might have chosen are not
supported. In addition, because of differences in the way that memory-optimized tables
are organized and managed, the optimizer often needs to make different choices than
it would make from a similar operation on a disk-based table. We reviewed some of the
main differences.
When we access memory-optimized tables, which are also compiled, from natively
compiled stored procedures, we have a highly efficient data access path, and the fastest
possible query processing. We examined some Microsoft-generated performance data
to get an idea of the potential size of the performance advantage to be gained from
the use of natively compiled procedures, and we looked at how to run our own performance tests, and also how to collect performance diagnostic data from some Dynamic
Management Views.
176
Additional Resources
Architectural Overview of SQL Server 2014's In-Memory OLTP Technology:
http://blogs.technet.com/b/dataplatforminsider/archive/2013/07/22/architectural-overview-of-sql-server-2014-s-in-memory-oltp-technology.aspx.
A peek inside the in-memory OLTP engine:
http://blogs.msdn.com/b/igorpag/archive/2014/01/15/sql-server-2014-insidehekaton-natively-compiled-stored-procedures.aspx.
Hekaton: SQL Server's Memory-Optimized OLTP Engine:
http://research.microsoft.com/pubs/193594/Hekaton%20-%20Sigmod2013%20
final.pdf.
177
178
Feature Support
In-memory OLTP and databases containing memory-optimized tables support much,
though not all, of the SQL Server feature set. As we've seen throughout the book,
SQL Server Management Studio works seamlessly with memory-optimized tables,
filegroups and natively compiled procedures. In addition, we can use SQL Server Data
Tools (SSDT), Server Management Objects (SMO) and PowerShell to manage our
memory-optimized objects.
Database backup and restore are fully supported, as is log shipping. In terms of other
"High Availability" solutions, AlwaysOn components are supported, but database
mirroring and replication of memory-optimized tables are unsupported; a memoryoptimized table can be a subscriber in transactional replication, but not a publisher.
In-memory OLTP feature support
For the full list of supported and unsupported features, please refer to the SQL Server In-Memory OLTP
documentation: http://msdn.microsoft.com/en-us/library/dn133181(v=sql.120).aspx.
In this first version of in-memory OLTP, natively compiled stored procedures support
only a limited subset of the full T-SQL "surface area." Fortunately, SQL Server
Management Studio for SQL Server 2014 includes a tool called Native Compilation
Advisor, shown in Figure 8-1, which will highlight any constructs of an existing stored
procedure that are incompatible with natively compiled procedures.
179
Figure 8-1:
The Native Compilation Advisor will generate a list of unsupported features used in the
existing procedure, and can generate a report, like the one shown in Figure 8-2.
180
Figure 8-2:
Another feature, that works similarly to the Native Compilation Advisor, is the Memory
Optimization Advisor, available from SQL Server Management Studio 2014 when you
right-click on a disk-based table. This tool will report on table features that are unsupported, such as LOB columns, and IDENTITY columns with increment other than 1.
This tool will also provide information such as the estimated memory requirement for
the table if it is converted to be memory optimized. Finally, the Memory Optimization
Advisor can actually convert the table to a memory-optimized table, as long as it doesn't
contain unsupported features.
181
182
Listing 8-1:
Next, we need to bind the databases that we wish to manage to their respective pools,
using the procedure sp_xtp_bind_db_resource_pool. Note that one pool may
contain many databases, but a database is only associated with one pool at any point
in time.
EXEC sp_xtp_bind_db_resource_pool 'HkDB', 'HkPool';
Listing 8-2:
Listing 8-3:
184
Listing 8-4:
Figure 8-3:
185
186
Listing 8-5:
A new catalog view, sys.hash_indexes, has been added to support hash indexes.
This view is based on sys.indexes, so it has the same columns as that view, with one
extra column added. The bucket_count column shows a count of the number of hash
buckets specified for the index and the value cannot be changed without dropping and
recreating the index.
In addition, there are several new dynamic management objects that provide information
specifically for memory-optimized tables.
187
188
189
Extended events
The in-memory OLTP engine provides three extended event packages to help in
monitoring and troubleshooting. Listing 8-6 reveals the package names and the number
of events in each package.
SELECT
p.name AS PackageName ,
COUNT(*) AS NumberOfEvents
FROM
sys.dm_xe_objects o
JOIN sys.dm_xe_packages p ON o.package_guid = p.guid
WHERE
p.name LIKE 'Xtp%'
GROUP BY p.name;
GO
Listing 8-6:
191
FROM
WHERE
GO
p.name AS PackageName ,
o.name AS EventName ,
o.description AS EventDescription
sys.dm_xe_objects o
JOIN sys.dm_xe_packages p ON o.package_guid = p.guid
p.name LIKE 'Xtp%';
Listing 8-7:
Performance counters
The in-memory OLTP engine provides performance counters to help in monitoring and
troubleshooting. Listing 8-8 returns the performance counters currently available.
SELECT
FROM
WHERE
GO
object_name AS ObjectName ,
counter_name AS CounterName
sys.dm_os_performance_counters
object_name LIKE 'XTP%';
Listing 8-8:
192
Description
XTP Cursors
XTP Garbage
Collection
XTP Phantom
Processor
XTP Storage
XTP Transaction
Log
XTP Transactions
Table 8-1:
193
)
WITH (MEMORY_OPTIMIZED = ON );
GO
DECLARE @SalesDetail SalesOrderDetailType_inmem;
GO
Listing 8-9:
In-memory OLTP is still a new technology and, as of this writing, there are only a few
applications using memory-optimized tables in a production environment (later in the
chapter, I list a few such applications). As more and more applications are deployed and
monitored, best practices will be discovered.
195
198
CPU-intensive operations
A common requirement is to load large volumes of data, as discussed previously, but then
to process the data in some way, before it is available for reading by the application. This
processing can involve updating or deleting some of the data, if it is deemed inappropriate, or it can involve computations to put the data into the proper form for use.
The biggest bottleneck that the application will encounter in this case is the locking and
latching as the data is read for processing, and then the CPU resources required once
processing is invoked, which will vary depending on the complexity of the code executed.
As discussed, in-memory OLTP can provide a solution for all of these bottlenecks.
199
200
201
Current applications
As noted earlier, there are currently relatively few applications using memory-optimized
tables in a production environment, but the list is growing rapidly. When considering a
migration, you might want to review the published information regarding the types of
application that are already benefiting from running SQL Server In-Memory OLTP. For
example (the URLs refer to Microsoft case studies):
202
203
Consider the following list of steps as a guide, as you work through a migration to
in-memory OLTP:
1. Capture baseline performance metrics running queries against existing tables.
2. Identify the tables with the biggest bottlenecks.
204
Summary
Using SQL Server In-Memory OLTP, we can create and work with tables that are
memory-optimized and extremely efficient to manage, often providing performance
optimization for OLTP workloads. They are accessed with true multi-version optimistic
concurrency control requiring no locks or latches during processing. All in-memory
OLTP memory-optimized tables must have at least one index, and all access is via indexes.
In-memory OLTP memory-optimized tables can be referenced in the same transactions
as disk-based tables, with only a few restrictions. Natively compiled stored procedures are
the fastest way to access your memory-optimized tables and performance business logic
computations.
If most, or all, of an application's data is able to be entirely memory resident, the costing
rules that the SQL Server optimizer has used since the very first version become almost
completely obsolete, because the rules assume all pages accessed can potentially require
206
Additional Resources
Managing Memory for In-Memory OLTP:
http://msdn.microsoft.com/en-us/library/dn465872.aspx.
Using the Resource Governor extensive white paper written when the feature was
introduced in SQL Server 2008:
http://bit.ly/1sHhaPQ.
Resource Governor in SQL Server 2012 covers significant changes in this release:
http://msdn.microsoft.com/en-us/library/jj573256.aspx.
Extended Events the best place to get a start on working with extended events is in
the SQL Server documentation:
http://msdn.microsoft.com/en-us/library/bb630282(v=sql.120).aspx.
Common Workload Patterns and Migration Considerations types of bottlenecks
and workloads that are most suited to in-memory OLTP:
http://msdn.microsoft.com/en-us/library/dn673538.aspx.
Transact-SQL Constructs Not Supported by In-Memory OLTP recommended
workarounds for the current limitations in support for the T-SQL surface:
http://msdn.microsoft.com/en-us/library/dn246937(v=sql.120).aspx.
207