SQL Server Always On Compute Engine
SQL Server Always On Compute Engine
March 2017
Table of contents
Introduction
About the architecture
1
Introduction
This white paper shows you how to build a Microsoft SQL Server Always On Availability Group
on Google Compute Engine. This paper might be of interest if youre a database administrator,
Windows administrator, or developer and you want to build a Microsoft SQL Server environment
thats highly available by using Always On Availability Groups.
This paper skips a lot of common SQL Server setup tasks that don't change when installing on a
virtual machine (VM) in the cloud.
You'll use PowerShell to accomplish some of the setup and configuration tasks, but you dont
need to be familiar with PowerShell in order to follow along.
2
To keep it simple, you're not going to deal with network security, VPNs, application servers, and
so on, and you're not going to expose SQL Server to the Internet.
Always On Availability Groups (AGs) can make sense in the cloud for a few reasons:
Automatic repair of corrupt data pages. Each replica has its own copy of the
databases data pages. Only logged transactions are sent from one replica to another,
not data pages. When a replica encounters a corrupt data page, it can request a clean
copy of the page from another replica, and heal itself live. It doesnt protect from every
kind of corruption, but it can be a real lifesaver in many situations. Learn more about
automatic page repair in SQL Server Books Online.
Potentially lower downtime for patching. For minimal downtime, patch the secondary
replica, make sure its working successfully, then take a brief outage to fail over from the
primary replica to the newly patched secondary. After that, you can patch the former
primary instance. If something goes wrong when patching either replica, simply build a
new one to replace it and join it into the AG.
Possibly easier scalability with lower downtime. When you want to switch to a bigger
instance type, you can provision a new one, add it into the AG, fail over to it, and then
remove the previous instance. You can scale up and down to handle load with this
approach, although this isn't commonly used on a scripted basis to handle daily peaks
and valleys. Its more of a seasonal approach.
Our 2-node Availability Group design can come with a few drawbacks:
Higher cost. Having a warm standby means basically doubling the Compute Engine
costs.
More complexity. This is a little harder to manage than a conventional single SQL
Server VM and requires Windows clustering knowledge.
No disaster recovery. If you want the ability to fail over to another zone or region, you
need to add additional VMs, but you can do that later.
3
Before you begin
This tutorial makes the following assumptions:
Using SQL Server 2016 for this project. Setup is different for prior versions.
Using Windows Server 2016.
Domain controllers and DNS already set up.
Application servers run on Compute Engine. App servers in other places would require
more complex security and network configuration thats out of scope for this white paper.
Generally:
If your SQL Server needs are unpredictable, consider using pay-per-use.
If your SQL Server needs are predictable for at least a year, consider buying them and
bringing them.
This white paper covers both licensing approaches. If youre not sure which one to use, try
pay-per-use because its easier to get started. Just remember to shut down your instances
when you dont need them.
4
Building a 2-node availability group on Compute
Engine
Youre going to build two VMs, each running Microsoft SQL Server, but its a bit more
complicated than you might be used to doing in an on-premises setup.
You'll use the following computer names, each a Windows VM running SQL Server:
BOU-SQL1, referred to as node1, server1, or the first replica in this paper, depending on
the context.
BOU-SQL2, referred to as node2, server2, or the second replica in this paper, depending
on the context.
Both of those VMs will be in a Windows cluster. If youve used Windows failover clustering in the
past, you might have used a cluster that required a shared quorum drive, such as the Q drive,
that was used for voting to determine which node was the primary at any given time.
Always On Availability Groups is built on Windows clustering. Even if you dont have a shared
drive (you wont in this example), Windows still needs to determine which node is the primary,
and which nodes are the readable secondaries. Clustering is outside of the scope of this white
paper, but theres one thing you need to know: clustering requires some networking and virtual
server name infrastructure.
Later in this white paper, you'll use a couple other names and IP addresses:
bou-cluster1 is a virtual name that points to the node that owns the cluster
infrastructure (the Windows plumbing that determines which node is the primary).
TestListener is a virtual name that points to the node that owns the databases: the
writable primary.
These two names will fail over back and forth between nodes.
If youve built labs at home before, you might have put your two SQL Server VMs right next to
each other in the same subnet. For example, maybe you used 192.168.1.10 and
192.168.1.11. Due to some complex networking issues, that doesnt fly here.
5
However, for this failover process to work in Compute Engine, your two SQL Server virtual
machines have to be in different network subnets, for example:
Note the differences in the IP addresses and subnets. You will need to:
Create the VMs in different subnets.
Set up network rules between the subnets, so that Google can route traffic between
them.
Test to make sure that the communication works.
This white paper provides you with an exact design, complete with IP addresses, that you'll
need follow to get this tutorial to work. In the real world, youll want to work closely with your
network team to pick the right IP addresses, subnets, and network security rules so that your
app servers and end users can connect to SQL Server.
For your first test AG, dont worry too much about the zone. When its time to design
your production infrastructure, youll want to put your database servers in the same zone
as your application servers in order to keep latency low. To learn more about whats
available in each zone, head over to Googles list of regions and zones, which lists what
machine types are available in each location.
4. For Machine type, we dont recommend going any smaller than 4 vCPUs with 15 GB of
memory (n1-standard-4), and its likely that youll need larger for production. Later on
in this white paper, you see how to choose your production instance size based on your
current workloads.
6
5. Change Boot disk to a custom image by clicking Change and then either OS images or
Custom images, depending on your licensing.
You will use SQL Server 2016 on Windows 2016 with a 200 GB SSD. Smaller boot
drives run the risk of filling up due to Windows Updates or other additional files.
7
Important: Picking your licensing model is important because you cant change this quickly later
on. If you pick the wrong licensing model, youll need to build new SQL Server VMs using the
correct OS instance, join them into the AG, fail over to the new VMs, and then decommission
the old VMs.
8
Bring your own licenses:
In this same screen, youll configure the type and size of your SSD:
This paper keeps the storage simple. When performance tuning, you'll need to think about how
to design the disk layouts for your data, logs, and TempDB.
6. Expand Management, disk, networking, SSH keys and then select Networking.
9
7. Specify the Network you created and which Subnetwork this VM should be in. Use a
custom, unused Internal IP address with no External IP to avoid exposing SQL Server
to the Internet.
8. Enable IP forwarding.
9. Click Create.
10
Creating the second VM
Now create your next VM that uses a different subnet. Use the following screenshots to guide
you.
11
Bring your own licenses:
12
Your VM Instances should now look similar to the following screenshot:
This paper doesn't walk you through installing the SDK, but if youre going to do a lot of cloud
work long term, consider doing that. Long term, its easier to use your local command prompt
and run a few scripts to configure your cloud resources. If youre just experimenting, use the
Google Cloud Shell in your browser and you wont have to worry about network connectivity and
firewalls.
13
Bring your own licenses:
1. First, create a disk and a Windows image with the guestOSFeatures enabled.
From the new disk create a custom Windows Server image with MULTI_IP_SUBNET
enabled in the guestOSFeatures parameter.
14
To RDP to a VM, youll need to download the RDP file and then use that to connect. Mac users
can open this file at no charge with the Microsoft Remote Desktop app from the Mac App Store.
Note: If you restart the server, you might need to download a new file because the external IP
might have changed.
15
Double click the downloaded RDP file and use the local account you created to connect.
If you get a timeout error, most likely youre dealing with a network firewall somewhere between
you and your cherished virtual machines. Check with your local sysadmins to make sure port
3389 is open out of your own environment. Then, if another team is managing your Cloud
Platform networking, make sure they havent blocked 3389 there.
Server1:
Server2:
4. Join the servers to the domain by executing the following command on both servers:
16
netdom join %computername% /domain:dbeng.com
/userd:Administrator /password * /reboot
5. In Google Cloud Shell, run the following commands to add the routes needed for
Availability Groups.
1. Add the clustering tools to both servers by running the following command in an admin
Powershell window:
Install-WindowsFeature Failover-Clustering
-IncludeManagementTools
17
2. RDP to one of the servers using a domain account that has administrative privileges.
3. Open Failover Cluster Manager.
4. Right click on Failover Cluster Manager and select Create Cluster.
5. Click Next to go to the Select Servers page and add both servers.
18
6. Click Next and keep the option to run configuration validation tests.
7. Click Next to get to the Validate a Configuration Wizard screen.
8. On the Testing Options page, select Run only tests I select, and then click Next.
9. Unselect Storage on the Test Selection page as the Storage option will fail in
Compute Engine (as it would for separate standalone physical servers). It is not needed
for Availability Groups. Its for traditional failover-cluster instances (FCIs) where every
node needs to see the shared storage locations where data and log files reside.
19
10. Click Next twice to run the tests. Make sure none of the tests have failed.
Only one network path between replicas. In the old days, you would build a
separate cluster heartbeat network. Thats not the case today in the cloud, and
you can ignore this one.
Windows Updates not the same on both replicas. If you configured them to apply
updates automatically, one of them might have applied updates that the other
hasnt downloaded yet. Generally speaking, you dont want to automatically apply
updates to SQL Server.
Pending reboot. Youve made changes to one of the servers, and it needs a
reboot to apply. Dont ignore this one.
20
Important: before you click Finish, you really want these tests to succeed. Whenever
you look at a cluster, check Failover Cluster Validation to see if the tests have passed. If
they havent, start to ask questions about whether the tests have ever been run.
21
13. Click Next twice to create the cluster and then Finish to complete the wizard.
Achieving quorum
Setup a file-share witness to achieve quorum.
1. First create a file share on a server that is not part of this cluster. For example, use a
domain controller: \\DC-WINDOWS\Witness.
2. Right click on the cluster, select More Actions and then Configure Cluster Quorum
Settings.
22
3. Click Next.
4. Select the option for Select the quorum witness and then click Next.
5. Select the option for Configure a file share witness.
6. Input the path to the file share and then click Next.
23
7. Click Next after confirming the settings.
8. Click F inish to end the wizard.
24
2. Run through the wizard, ensuring all tests passed.
At this point, youve built the Windows clustering plumbing. Windows can now manage voting to
understand which nodes own a particular service or function, including AGs.
There can be additional complexities of Windows clustering. For production, youd also want to
consider things like:
Witness design. This Books Online page discusses Windows Server 2012, but the
basics still apply to Windows Server 2016:
https://technet.microsoft.com/library/jj612870.aspx.
Distributed transactions. Newly supported with Always On Availability Groups in
Windows 2016 & SQL 2016, but theres an issue that you'll see later in this article.
25
4. Change the account to dbeng\sql.service and provide the password.
5. On the AlwaysOn High Availability tab, select the checkbox for Enable AlwaysOn
Availability Groups.
6. Click OK and then click Yes in the restart warning dialog box.
7. Click OK in the warning dialog box about saving changes.
8. Restart the SQL Server service.
1. For this white paper, use C:\SQLBackup for simplicity on node2 and share it as
SQLBackup. Use the following screenshots to guide you:
26
27
28
There are, of course, some issues with this simple, demonstration approach:
For these reasons, you wouldn't use this approach for production servers. For this
tutorial, this approach keeps things simple and moving forward.
29
2. Run the following commands in an elevated command prompt on both nodes to open up
ports 1433 and 5022 if you are using Windows Firewall (recommended).
3. On node1, create a test database or restore your database. The recovery model must be
set to Full. Perform a FULL backup as this is a prerequisite.
1. Right click on AlwaysOn High Availability and select New Availability Group Wizard.
30
2. On the Specify Name page, specify a name for your AG and select the check box for
Database Health Detection and then click Next. In SQL Server 2012 and 2014, a
failover would not occur if the database became inaccessible. In SQL Server 2016, a
failover will occur if the database becomes inaccessible and if you have enabled this
feature.
3. Select the database you want to add to the AG and then click Next.
31
Notice the prerequisites have not been met for ReportServer and
ReportServerTempDB.
A full backup is required in order to put a database into an AG. SQL Server doesnt
consider a database to be in full recovery until the first full backup has been taken.
Automatic Failover
Synchronous Commit
Readable Secondary
You can use Yes or Read-intent only for Readable Secondary. Read-intent only
allows only read-intent connections, whereas Yes allows all connections. Yes is needed
if you have applications using database drivers that do not support the
ApplicationIntent=ReadOnly connection parameter.
32
Dont change anything on the tabs for Endpoints, Backup Preferences, and Listener.
If
5. Click Next.
6. Select F ull and specify the path to the backup share on node2.
33
If you restored your own database and it is sizable, you could opt to use the previous full
backup and the log backup chain. If you did this, you would restore the database and log
chain and then select the Join only option.
SQL Server 2016 also adds Direct Seeding, which is the ability to seed replicas
automatically without taking backups. You can read about some caveats here:
https://www.brentozar.com/archive/tag/directseeding/.
7. Click Next and make sure all tests passed with Success, with the exception of the
listener test which will show Warning. You can ignore the warning as you will create the
listener later. If any tests failed, resolve those issues so that you can continue.
8. Click Next and then Finish. You could also save the script from this final page by using
the Script drop-down list.
34
9. After the AG is successfully created, click Close.
35
Adding the listener
You can now add the listener.
1. Expand the AG, right click on Availability Group Listeners, and select Add Listener.
36
2. Give the listener a name. This will be the name that you use in connection strings, so
keep that in mind when naming it. You can change it later if needed.
Tip: In production systems, you might consider using your old servers name as your
listener name. For example, if youre migrating away from SQL2012B, then when you
decommission that old server, you could add a new listener with that name. Your
applications wouldnt even know that their databases had moved to an AG. However, if
you do that, you wont be able to build and test the listener ahead of time.
Instead, use a new name for the listener, but when its time to decommission
SQL2012B, turn that into a DNS CNAME. Rename the old SQL2012B to a different
name (assuming that you have to keep the box around for a while for safekeeping), and
then work with your network admins to create a new DNS CNAME for SQL2012B that
points to your new listener name. This way, if someone forgets to change the connection
string in their application, itll still work.
37
4. Click OK to create the listener.
38
Testing your Availability Group
Whenever you implement a high-availability feature, you need to test 4 things:
1. A planned failover with zero data loss. How youll deal with patching, or a planned
migration from one instance type to another.
2. An unplanned failover with zero data loss. For example, when the primary replica
goes offline.
3. An unplanned failover with data loss. When replication from the primary to the
secondary falls behind, and then the primary fails.
4. A planned failback with zero data loss. After scenario #2 happens, and you bring the
former primary back online, you need to re-establish synchronization and deal with the
data differences between the replicas.
39
Test #1: Failing over manually A planned failover with zero data
loss
We recommend that you don't use Failover Cluster Manager to perform failovers for AGs. The
manager is not aware of the synchronization status of the replicas and can lead to an outage.
Make sure all system admins are aware of this issue. It can be especially tricky because it
seems to work because it doesnt throw an error, but that doesnt mean SQL Server wont have
issues when your systems fail over under duress.
To perform failovers, use the Fail Over Availability Group wizard or TSQL. You can initiate the
wizard from a primary replica or a secondary replica.
2. On the Select New Primary Replica page, verify that the secondary replica is selected
for the server to failover to, and then click Next.
40
3. Connect to the secondary server by using the Connect button, then click Connect, and
then click Next.
4. Click Finish to perform the failover and then Close when it finishes.
41
When theres a planned, graceful failover, SQL Server automatically starts synchronizing data
over to the other replica (the former primary). You can fail over back and forth, without data loss,
and the replicas are automatically kept in sync.
Using TSQL
1. Connect to the secondary replica that you want to failover to and then run the following
command.
2. Refresh Object Explorer for the server you failed over to and see that the role has
changed.
42
2. Use :CONNECT syntax to run the commands underneath it on another server.
Notice that the Query Editor window is connected to BOU-SQL1 but that the failover
command will run on BOU-SQL2.
43
3. Execute both commands and then refresh Object Explorer.
Important: Remember that all sysadmins should know not to use Failover Cluster Manager to
perform an AG failover and instead know how to do them in Management Studio. This is really
important when an organization has a DBA team and a different team that performs reboots
such as for Microsoft security patching. Server admins with clustering experience know to
failover SQL Server instances off the server that they are going to reboot, so be sure that they
know how to do an AG failover.
44
Test #2: Failing over automatically An unplanned failover with
zero data loss
1. Initiate an automatic failover by stopping the SQL Server service on the primary replica.
2. Wait 15-30 seconds, and then check Object Explorer for the new primary replica.
Assuming everything went well, the AG failed over without data loss because the replica
it failed over to is configured for synchronous-commit with automatic failovers.
In this scenario, the failover wasnt planned. As long as the two replicas were in synchronous
commit mode, no data is lost. When the stopped SQL Server service is started back up again, it
automatically starts fetching data from the new primary in a matter of minutes, and catches back
up to sync and allow automatic failovers.
However, while one of the two replicas is down, you have no safety net.
Some people aim for zero data loss by designing a two-node AG with synchronous commits.
This works as long as both replicas are up. SQL Server doesnt automatically go down with a
minimum number of replicas. Your users might keep inserting data into the one remaining
replica, thinking that their data is safe, when its not. Here you have a single point of failure.
This is one reason why monitoring is very important. By default, SQL Server doesnt tell you that
youre down to a single point of failure, or that a replica has stopped synchronizing with other
replicas.
45
Test #3: Failing over Manually An unplanned failover with data
loss
You could use the 2 servers to test out a manual failover with data loss by switching node2 to
be asynchronous-commit and then stopping SQL Server on node1. Instead, take it one step
further and add a third asynchronous server to the mix.
1. Create a third VM, join it to the domain, add it to the cluster (use Add Node in Failover
Cluster Manager), enable the AG feature and then add it to the AG by right clicking on
the AG and then clicking Add Replica.
2. On the Connect to Replicas page, connect to the secondary replica, which is now
BOU-SQL1 as BOU-SQL2 is primary.
3. Click Connect twice once in the wizard and once in the Connect to Server dialog box
and then click Next.
4. On the Specify Replicas page, click A dd Replica, and enter the third server
5. Click Connect in the Connect to Server dialog box.
6. Configure the new replica as asynchronous-commit with manual failovers. There
isnt an option to use automatic failovers with asynchronous-commit. You need to
46
initiate the failover if there could be data loss, because AGs will never automatically fail
over in a way that data will be lost.
7. Decide how youd like to synchronize the data. If your test database is small or empty, a
Full backup is easy. Point it to the backup share if its not already filled in and then click
Next.
Note: If you encountered any errors along the way for this third VM, go back in the doc
and make sure you ran all the necessary commands. Run the same commands with
necessary changes and do the same steps as was done for the other two VMs.
Failing over
1. Stop the SQL Server service on both synchronous-commit replicas (the first two VMs).
This step partially simulates a region or zone failure where all of your synchronous
replicas have gone down. It isnt a perfect simulation, because Windows is still up.
2. Connect to the third VM in Management Studio. Notice that the database is in a Not
Synchronizing state because its unable to reach the other two replicas.
47
3. Run the ALTER AVAILABILITY GROUP command to fail the AG over to the third
server, but use the option to allow data loss. This replica is asynchronous and might not
be completely up to date with all of the data changes that occurred on the primary replica
when both synchronous replicas went down. Failing over to it simulates what would
happen in the case of a disaster where you lose the primary data center and need to get
production online at the Disaster Recovery site.
48
The Not Synchronizing state for BOU-SQL1 and BOU-SQL2 isnt temporary: after you perform
a manual failover with data loss, SQL Server knows that theres data on SQL1 and SQL2 that
you might want to recover.
Microsoft has a great checklist for tasks you need to perform after a forced failover:
https://msdn.microsoft.com/en-us/library/ff877957.aspx#FollowUp. It talks about fixing a cluster
quorum, creating database snapshots in case you want to recover the lost data from the former
primaries, enabling synchronization from the new primary, and what might happen if the offline
nodes restart with their own quorum. This information is important if you plan on implementing a
multi-subnet AG.
49
Test #4: Failing back manually - A planned failback with zero data
loss
There are at least 2 ways to get the first 2 replicas back to a healthy state without rebuilding the
AG:
1. Remove the first 2 servers from the AG, restore the LOG backup chain and then add the
first 2 servers back to the AG
2. Remove the first 2 servers from the AG, restore a FULL backup and the LOG backup
chain, and then add the first 2 servers back to the AG
There can be other ways, depending on factors like how little behind the AG was at the time of
the failover, how small the databases are, how fast the network is between sites, and so on.
This is where SQL Server 2016s Direct Seeding can come in handy, but you have to be careful
with large databases flooding your network connection, unchecked.
Here, you use option 1 for the first server and option 2 for the second server.
1. Remove the first 2 servers from the AG by using the UI or the TSQL command.
UI:
50
TSQL:
On the third server, use the REMOVE REPLICA option of the ALTER AVAILABILITY
GROUP command to remove a replica.
2. Refresh Object Explorer on the third server to confirm that only the third replica is in the
AG.
51
3. Take a log backup on the third server and then restore it on the first server. If any other
log backups have run on the third server since the first 2 servers have been offline,
restore those log backups first.
4. Use the WITH NORECOVERY option when restoring the LOG backups.
Third server:
First server (copy trn file to the 1st server or use a UNC path to reach the file on the third
server):
5. Set the third server to synchronous-commit by right clicking on the replica and selecting
Properties.
52
6. Change Availability mode to Synchronous commit.
7. Add the first server back to the AG by right clicking on the AG and selecting Add
Replica.
8. On the Specify Replicas page, click A dd Replica.
9. Specify the 1st server and click Connect.
10. Check the box for it to be a synchronous-commit replica as you are going to fail over
to it without further data loss. You'll go back and switch the third server to
asynchronous-commit after the first 2 servers are back to synchronizing with each
53
other.
11. Click Next, select Join only on the Select Data Synchronization page and then click
Next again.
12. Ignore the warning that says "Checking the listener configuration", and then click Next.
13. Click Finish to add the server as a replica to the AG.
14. Click Close to close out the Add Replica to Availability Group wizard.
15. Refresh Object Explorer to see that server1 and server3 are replicas in the AG.
16. Run a full backup and a log backup on server3 and then restore those on server2 using
WITH NORECOVERY.
54
Server3:
Server2:
17. Add server2 as a replica to the AG using the same steps you used for server1.
18. Failover to server1.
19. Set server3 to asynchronous-commit by right clicking on the AG from the new
primary replica, server1 and then selecting Properties.
20. Set server3 to Asynchronous commit for Availability Mode and switch to Automatic
for Failover Mode for servers 1 and 2.
55
21. Click OK.
The AG is now back to the original settings before the first 2 servers were taken offline.
Testing recap
When you build a production AG, youll want to step through these same four tests while taking
screenshots and document the steps. That way, if disaster strikes, you or any member of your
team can better handle the emergency by reading through your checklist.
56
Replacing a server with one of a different size
You can replace a server in Compute Engine when using an AG. You might want to do this
because of one of these scenarios:
Your AG already has 3 servers in it, assuming you followed the steps in the previous section.
Now, replace server1 with server3. Here are the steps:
1. Failover to server2.
2. Set server3 to synchronous-commit with automatic failovers.
3. Remove server1 from the AG.
Thats it!
57