Warehouse Performance in SQL

The performance of a data warehouse is one of the keys to its usefulness. If business analysis queries take too long, people tend not to use the warehouse on an ad hoc basis for decision making. If it takes too long to load data into the warehouse, the corporate IS organization will probably resist frequent updates, and stale data may make the warehouse less useful. Achieving a good balance between load performance and runtime performance is one of the keys to successful warehouse deployment.

1. Load Performance

The process of loading a warehouse can be very time-consuming. It’s common for warehouse data loads to take hours or even days for very large warehouses. Load processing typically involves these operations:

  • Data extraction. The data to be loaded into the warehouse database typically comes from several different operational data sources. Some may be relational databases that support OLTP applications.
  • Data cleansing. Operational data tends to be “dirty” in the sense that it contains significant errors. For example, older transaction-processing systems may not have strong integrity checks, permitting the entry of incorrect
    customer numbers or product numbers. The warehouse-loading process typically includes data integrity and data sanity checks.
  • Data cross-checking. In many companies, the data processing systems that support various business operations have been developed at different times and are not integrated. Changes that are processed by one system (for example, adding new product numbers to an order-processing application) may not automatically be reflected in other systems (for example, the inventory control system), or there may be delays in propagating changes. When data from these nonintegrated systems arrives at the warehouse, it must be checked for internal consistency.
  • Data reformatting. Data formats in the operational data stores may differ considerably from the warehouse database. Character data may need transformation from a mainframe’s EBCDIC encoding to ASCII. Zoned decimal or packed decimal data may need reformatting. Date and time formats are another source of differences. Beyond these simple data format differences, data from one OLTP data source row may have to be broken apart into multiple warehouse tables, while data from multiple OLTP tables or files may have to be combined to create a warehouse table.
  • Data insertion/update. After the preprocessing, actual bulk loading of data into a warehouse database tends to be a specialized operation. High-volume data loaders typically operate in a batch-oriented mode, without transaction logic and with specialized recovery. Row loading or update rates of hundreds of megabytes per hour may be required.
  • Index creation/update. The specialized indexes used by the warehouse must be modified to reflect the revised warehouse contents. As with the actual data insertion and update, specialized handling is typically applied. In some cases, it is more efficient to rapidly re-create an entire index than to modify it incrementally as data rows are inserted or updated. Other index structures permit more incremental updates.

These tasks are typically performed by specialized warehouse-building programs on a batch-processing basis. Ad hoc query access to the warehouse is turned off during the update/refresh processing, allowing it to proceed at maximum speed without competition for DBMS cycles. Despite these optimizations, warehouse load times tend to grow as the amount of accumulated data grows, so the load-time versus runtime performance trade-off must be made on an ongoing basis. Warehouses with many indexes or precomputed summary values may offer much better runtime performance but at the expense of unacceptably long load times. Simpler structures with less loading work may increase the time required for ad hoc queries beyond an acceptable level. In practice, the warehouse administrator must find a good balance between loading and runtime query performance.

2. Query Performance

Database vendors focused on warehousing have invested considerable energy in optimizing their DBMS products to maximize query performance. As a result, warehousing performance has improved dramatically over the last several years. The growth in the size and complexity of warehouses has prevented some of this performance gain from actually being translated into perceived end-user benefit.

Several different techniques have evolved to maximize the performance of business analysis queries in a warehouse, including:

  • Specialized indexing schemes. Typical business analysis queries involve a subset of the data in the warehouse, selected on the basis of dimension values. For example, a comparison of this month’s and last month’s results involves only two of the 36 months of data in the example warehouse. Specialized indexing schemes have been developed to allow very rapid selection of the appropriate rows from the fact table and joining to the dimension tables. Several of these involve bitmap techniques, where the individual possible values for a dimension (or a combination of dimensions) are each assigned a single bit in an index value. Rows meeting a selection criteria can be very rapidly identified by bitwise logical operations, which a computer system can perform more rapidly than value comparisons.
  • Parallel processing techniques. Business analysis queries can often be broken up into parts that can be carried out in parallel, to reduce the overall time required to produce the final results. In a query joining four warehouse tables, for example, the DBMS might take advantage of a two-processor system by joining two of the tables in one process and two others in another. The results of these intermediate joins are then combined. Alternatively, the workload of processing a single table in the query might be split and carried out in parallel— for example, assigning rows for specific month ranges to specific processes. The use of multiprocessor systems in these cases is quite different than for OLTP databases. For OLTP, the focus of multiprocessor operations is to increase overall throughput. For warehousing, the focus is usually the improvement in overall execution time in response to a single complex query.
  • Specialized optimizations. When faced with a complex database query involving selection criteria and joins, the DBMS has many different sequences in which it can carry out the query. The optimizer for an OLTP database tends to benefit from the assumption that foreign key/primary key relationships should be exercised early in its processing, since they tend to cut down dramatically on the number of rows of intermediate results. The optimizer for a warehousing database may make a quite different decision, based on information accumulated during the load process about the distribution of data values within the database.

As with load-time performance, maximizing the runtime performance of a warehouse is an ongoing task for the database administrator. Newer revisions of DBMS software often provide performance benefits, as do higher-performance processors or more processors.

Source: Liang Y. Daniel (2013), Introduction to programming with SQL, Pearson; 3rd edition.

Leave a Reply

Your email address will not be published. Required fields are marked *