SQL Server: System Availability

Ensuring the availability of your instance and databases is one of the most important issues today. There are several techniques that you can use to ensure their availability, which can be divided in two groups: those that are components of the Database Engine and those that are not implemented in the database server. The following two techniques are not part of the Database Engine:

Using a standby server
Using RAID technology

The following proprietary techniques belong to the Database Engine:

Database mirroring
Failover clustering
Log shipping
AlwaysOn
Replication

The following sections describe these techniques, other than replication, which is discussed in Chapter 18.

1. Using a Standby Server

A standby server is just what its name implies—another server that is standing by in case something happens to the production server (also called the primary server). The standby server contains files, databases (system and user-defined), and user accounts identical to those on the production server.

A standby server is implemented by initially restoring a full database backup of the database and applying transaction log backups to keep the database on the standby server synchronized with the production server. To set up a standby server, set the read only database option to TRUE. This option prevents users from performing any write operations in the database.

The general steps to use a copy of a production database are as follows:

Restore the production database using the RESTORE DATABASE statement with the STANDBY clause.
Apply each transaction log to the standby server using the RESTORE LOG statement with the STANDBY clause.
When applying the final transaction log backup, use the RESTORE LOG statement with the RECOVERY clause. (This final statement recovers the database without creating a file with before images, making the database available for write operations, too.)

After the database and transaction logs are restored, users can work with an exact copy of the production database. Only the noncommitted transactions at the time of failure will be permanently lost.

NOTE If the production server fails, user processes are not automatically brought to the standby server. Additionally, all user processes need to restart any tasks with the uncommitted transactions due to the failure of the production server.

2. Using RAID Technology

RAID (redundant array of inexpensive disks) is a special disk configuration in which multiple disk drives build a single logical unit. This process allows files to span multiple disk devices. RAID technology provides improved reliability at the cost of performance decrease. Generally, there are six RAID levels, 0 through 5. Only three of these levels, levels 0, 1, and 5, are significant for database systems.

RAID can be hardware or software based. Hardware-based RAID is more costly (because you have to buy additional disk controllers), but it usually performs better. Software-based RAID can be supported usually by the operating system. Windows operating systems provide RAID levels 0, 1, and 5. RAID technology has impacts on the following features:

Fault tolerance
Performance

The benefits and disadvantages of each RAID level in relation to these two features are explained next.

RAID provides protection from hard disk failure and accompanying data loss with three methods: disk striping, mirroring, and parity. These three methods correspond to RAID levels 0, 1, and 5, respectively.

2.1. RAID 0 (Disk Striping)

RAID 0 specifies disk striping without parity. Using RAID 0, the data is written across several disk drives in order to allow data access more readily, and all read and write operations can be speeded up. For this reason, RAID 0 is the fastest RAID configuration. The disadvantage of disk striping is that it does not offer fault tolerance at all. This means that if one disk fails, all the data on that array become inaccessible.

2.2. RAID 1 (Mirroring)

RAID 1 uses the space on a disk drive to maintain a duplicate copy of all files. Therefore, RAID 1, which specifies disk mirroring, protects data against media failure by maintaining a copy of the database (or a part of it) on another disk. If there is a drive loss with RAID 1 in place, the files for the lost drive can be rebuilt by replacing the failed drive and rebuilding the damaged files. The hardware configurations of RAID 1 are more expensive, but they provide additional speed. (Also, hardware configurations of RAID 1 implement some caching options that provide better throughput.) The advantage of the Windows solution for RAID 1 is that it can be configured to mirror disk partitions, while the hardware solutions are usually implemented on the entire disk.

In contrast to RAID 0, RAID 1 is much slower, but the reliability is higher. Also, RAID 1 costs much more than RAID 0 because each mirrored disk drive must be doubled. It can sustain at least one failed drive and may be able to survive failure of up to half the drives in the set of mirrored disks without forcing the system administrator to shut down the server and recover from file backup. (RAID 1 is the best-performing RAID option when fault tolerance is required.)

RAID 1 also has performance impacts in relation to read and write operations. When RAID 1 is used, write operations decrease performance, because each such operation costs two disk I/O operations, one to the original and one to the mirrored disk drive. On the other hand, RAID 1 increases performance of read operations, because the system will be able to read from either disk drive, depending on which one is least busy at the time.

2.3. RAID 5 (Parity)

Parity is implemented by calculating recovery information about data written to disk and writing that parity information on the other drives that form the RAID array. If a drive fails, a new drive is inserted into the RAID array and the data on that failed drive is recovered by taking the recovery information (parity) written on the other drives and using this information to regenerate the data from the failed drive.

The advantage of parity is that you need one additional disk drive to protect any number of existing disk drives. The disadvantages of parity concern performance and fault tolerance. Due to the additional costs associated with calculating and writing parity, additional disk I/O operations are required. (Read I/O operation costs are the same for RAID 1 and parity.) Also, using parity, you can sustain only one failed drive before the array must be taken offline and recovery from backup media must be performed. Generally, RAID 5 requires four disk I/O operations, whereas RAID 0 requires only one operation and RAID 1 two operations.

3. Database Mirroring

As you already know, mirroring can be supported through hardware or software. The advantage of the software support for mirroring is that it can be configured to mirror disk partitions, while the hardware solutions are usually implemented on the entire disk. This section discusses the Windows solution for database mirroring and how you can set it up.

To set up database mirroring, use two servers with a database that will be mirrored from one server to the other. The former is called the principal server, while the latter is called the mirrored server. (The copy of the database on the mirrored server is called the mirrored database.)

Database mirroring allows continuous streaming of the transaction log from the principal server to the mirrored server. The copy of the transaction log activity is written to the log of the mirrored database, and the transactions are executed on it. If the principal server becomes unavailable, applications can reconnect to the database on the mirrored server without waiting for recovery to finish. Unlike failover clustering, the mirrored server is fully cached and ready to accept workloads because of its synchronized state. It is possible to implement up to four mirrored backup sets. (To implement mirroring, use the MIRROR TO option of either the BACKUP DATABASE statement or the BACKUP LOG statement.)

There is also a third server, called the witness server. It determines which server is the principal server and which is the mirrored server. This server is only needed when automatic failover is required. (To enable automatic failover, you must turn on the synchronous operating mode—that is, set the SAFETY option of the ALTER DATABASE statement to FULL.)

Another performance issue in relation to database mirroring is the possibility to automatically compress the data sent to the mirror. The Database Engine compresses the stream data if at least a 12.5 percent compression ratio can be achieved. That way, the system reduces the consumption of log data that is sent from the principal server to mirrored server(s).

4. Failover Clustering

Failover clustering is a process in which the operating system and database system work together to provide availability in the event of failures. A failover cluster consists of a group of redundant servers, called nodes, that share an external disk system. When a node within the cluster fails, the instance of the Database Engine on that machine shuts down. Microsoft Cluster Service transfers resources from a failing machine to an equally configured target node automatically. The transfer of resources from one node to the other node in a cluster occurs very quickly.

The advantage of failover clustering is that it protects your system against hardware failures, because it provides a mechanism to automatically restart the database system on another node of the cluster. On the other hand, this technology has a single point of failure in the set of disks, which cluster nodes share and cannot protect from data errors. Another disadvantage of this technology is that it does not increase performance or scalability. In other words, an application cannot scale any further on a cluster than it can on one node.

In summary, failover clustering provides server redundancy, but it doesn’t provide data file redundancy. (See also the section “Comparison of High-Availability Components” later in this chapter.)

5. Log Shipping

Log shipping allows the transaction logs from one database to be constantly sent and used by another database. This allows you to have a warm standby server and also provides a way to offload data from the source machine to read-only destination computers. The target database is an exact copy of the primary database, because the former receives all changes from the latter. You have the ability to make the target database a new primary database if the primary server, which hosts the original database, becomes unavailable. When the primary server becomes available again, you can reverse the server roles again.

Log shipping does not support automatic failover. Therefore, if the source database server fails, you must recover the target database yourself, either manually or through custom code.

In summary, log shipping is similar to database mirroring in that it provides database redundancy. (See also the upcoming section “Comparison of High-Availability Components”)

6. AlwaysOn

Database mirroring as a technique to achieve high availability has several drawbacks:

Read-only queries cannot be executed on the mirror.
Database mirroring can be applied only on two instances of the Database Engine.
Database mirroring mirrors only objects inside the database; objects such as logins cannot be protected using mirroring.

To overcome these drawbacks of database mirroring, the Database Engine supports a technique called AlwaysOn, which allows you to maximize availability for your databases.

AlwaysOn is based upon three concepts: availability groups, replicas, and modes, which will be discussed next.

NOTE AlwaysOn is the most sophisticated and hence the most complex technique to support high availability of Database Engine databases. For this reason, this chapter gives you just an introductory description of AlwaysOn. For all other topics concerning AlwaysOn, such as configuration of server instances and creation and configuration of availability groups, see Microsoft Docs.

6.1. Availability Groups, Replicas, and Modes

An availability group comprises a set of failover servers called availability replicas. Each availability replica has a local copy of each of the databases in the availability group. One of these replicas, called the primary replica, maintains the primary copy of each database.

The primary replica makes these databases, called primary databases, available to users for read-write access. For each primary database, up to eight other availability replicas, known as secondary replicas, maintain a failover copy of the database, known as a secondary database.

An availability replica provides redundancy only at the database level, for the set of databases in one availability group. The primary replica makes the primary databases available for read-write connections from clients. Also, in a process known as data synchronization, which occurs at the database level, the primary replica sends transaction log records of each primary database to every secondary database.

Every secondary replica stores the transaction log records on the disk and, after that, applies them to its corresponding secondary database(s). Data synchronization occurs between the primary database and each connected secondary database, independently of the other databases. Optionally, you can configure one or more secondary replicas to support read-only access to secondary databases, and you can configure any secondary replica to permit backups on secondary databases.

Availability replicas can be hosted only by instances that reside on Windows Server Failover Clustering (WSFC) nodes. Instances can be either failover cluster instances or stand-alone instances. The instances that host availability replicas for a given availability group must reside on separate WSFC nodes. The server instance on which the primary replica is located is known as the primary location. An instance on which a secondary replica is located is known as a secondary location.

The availability mode is a property that is set independently for each availability replica. The availability mode of a secondary replica determines whether the primary replica waits to commit transactions on a database until the secondary replica has written the records in the corresponding transaction logs to disk. Therefore, AlwaysOn supports two availability modes:

Asynchronous-commit mode
Synchronous-commit mode

Asynchronous-commit mode means that the primary replica does not wait for any of the secondary replicas to store the log to disk. In other words, after writing the log record to the local log file, the primary replica sends the transaction confirmation to the client. The primary replica runs with minimum transaction latency in relation to a secondary replica that is configured for asynchronous-commit mode.

Synchronous-commit mode emphasizes high availability over performance, at the cost of increased transaction latency. Under synchronous-commit mode, transactions concerning the primary database wait to send the transaction confirmation to the client until the secondary replica has stored the log to disk. When data synchronization begins on a secondary database, the secondary replica begins applying incoming log records from the corresponding primary database. As soon as every log record has been permanently stored, the secondary database enters the SYNCHRONIZED state.

Besides these three concepts, there is also a notion of a primary role and a secondary role. The primary and secondary roles of availability replicas are interchangeable in a process called failover. Three forms of failover exist: automatic failover (without data loss), planned manual failover (without data loss), and forced manual failover (with possible data loss), typically called forced failover.

NOTE Starting with SQL Server 2016, AlwaysOn has been included with the Standard Edition of SQL Server. (Earlier, only Enterprise Edition supported this technique.) AlwaysOn in Standard Edition will have several restrictions. The most important one is that only two nodes (one primary and one secondary) can be used.

7. Comparison of High-Availability Components

The Database Engine supports several proprietary components that are used to enhance the availability of your instance of the Database Engine and databases:

Failover clustering
Log shipping
Mirroring
AlwaysOn

In relation to these techniques, there are three important issues:

Server redundancy
Database redundancy
Data file redundancy

Server redundancy means that an application runs on two or more servers in such a way as to provide fault tolerance. (Clustering is one of the most important server redundancy technologies.) Database redundancy means that a fault tolerance is guaranteed for a database with all its applications. (Data file redundancy is defined similarly.)

Failover clustering provides server redundancy, but doesn’t provide database and data file redundancy. Log shipping provides database redundancy, but doesn’t provide server redundancy. The disadvantage of log shipping is that it doesn’t provide automatic failover.

Database mirroring doesn’t provide server redundancy, but provides database redundancy and data file redundancy. (This component belongs to deprecated features and won’t be supported in one of the future versions of the Database Engine.)

AlwaysOn is similar to database mirroring, but additionally supports clustering. AlwaysOn provides server redundancy as well as database and data file redundancy. The primary goal of AlwaysOn is to support database availability while also giving you the benefits of disaster recovery.

Source: Petkovic Dusan (2020), Microsoft SQL Server 2019: A Beginner’s Guide, Seventh Edition-McGraw-Hill Education.