SQL Server: Internal Storage of Columnstore Indices

Columnstore indices are based on Microsoft’s xVelocity, an advanced storage and technology that Microsoft originally incorporated in PowerPivot and the Analysis Services Tabular model, and later adapted for SQL Server. Because xVelocity offers both storage and compression techniques, I describe these two features in the following subsections.

1. Index Storage

To explain index storage for columnstore indices, we’ll start with a table that includes the PRIMARY KEY constraint, the employee table introduced in Chapter 1 and used throughout the book. Because the emp_no column of the employee table is specified as the primary key, the Database Engine will implicitly create a clustered index based on that column. As you already know, leaf nodes of the clustered index store data by rows, and each leaf node includes all the data associated with that row. Additionally, the data is spread across one or more data pages.

When the Database Engine processes a table scan on the employee table, it moves all data pages into memory, fetching the entire table even though most of the values of its columns are not needed. In doing so, the system wastes I/O and memory resources to retrieve unnecessary data. Now let’s look at what happens when we create a columnstore index on the table.

The basic technique in relation to column index storage consists of dividing data into row groups, called segments. Each segment consists of approximately one million rows. Also, each segment contains values from one column only, which allows each column’s values to be accessed independently. However, a column can span multiple segments, and each segment can be made up of multiple data pages. Data is transferred from disk to memory by segment, not by page. The data within each column’s segment matches row-by-row so that the rows can always be assembled correctly, if needed. (This is an important performance issue when the system has to assemble data from a column store back to the corresponding row store.)

All data within a segment is encoded by using an internal algorithm. Additionally, for some of the columns that need dictionaries, an additional dictionary encoding conversion is used. The Database Engine uses two general forms of dictionaries: a global dictionary associated with the entire column, and a local dictionary associated with each segment.

NOTE The data in a column store is not sorted, not even within a column segment.

The metadata for columnstore indices is internally stored in directories. The metadata includes information about the allocation status of segments and dictionaries. It contains additional metadata about the number of rows, row size, and minimum (maximum) value inside of each segment. (You can retrieve information concerning directories using the sys.column_ store_segments DMV, described later in the chapter in the section “Editing Information Concerning Columnstore Indices.”)

2. Compression

Before I discuss the compression technique of xVelocity, let me explain why compression is so important for column store. In the case of a row store, compressing data is generally suboptimal. The reason is that values of columns of a table have many different data types and forms: some of them are numeric and some are strings or dates. Most compression algorithms are based on similarities of a group of values. When data is stored by rows, the possibility to exploit similarity among values of different columns is thus limited. By contrast, data items from a single column have the same data type and are stored contiguously. Usually, there is repetition and similarity among values within a column. All these factors allow the system to apply compression very effectively on a column store.

As you already know, each set of rows is divided into segments of about one million rows each. Each segment is then compressed, independently and in parallel, using an internal supported technique. The result is one compressed column segment for each column included in the columnstore index.

Besides the form of compression just explained, the Database Engine supports an additional form called archival compression. This form of compression can be applied to already compressed data to further reduce the amount of storage space. In data warehouse systems, there is usually a group of data that is queried more often than other groups of data. For instance, data may be partitioned by date, and the most recent data (say, the data from the last two years) may be accessed much more frequently than older data. In such case, the older data can benefit from additional compression at the cost of slower query performance.

Archival compression should be applied to archival data that you keep for regulatory reasons. The advantage of archival compression is that you can partition a table with a columnstore index, and then change the compression mode for individual partitions.

NOTE The compression form can be specified using the DATA_COMPRESSION clause of the CREATE CLUSTERED COLUMNSTORE INDEX statement (see Example 27.1 in the next section).

After the system finishes the process of encoding and compression, the segments and the dictionaries are converted to large objects (LOBs) and are stored inside of the Database Engine. (If one of the LOBs spans more than 8KB, the regular storage mechanisms are used.)

Source: Petkovic Dusan (2020), Microsoft SQL Server 2019: A Beginner’s Guide, Seventh Edition-McGraw-Hill Education.

Leave a Reply

Your email address will not be published. Required fields are marked *