Database Architecture for Warehousing in SQL

The structure (schema) of a warehouse database is typically designed to make the information easy to analyze, since that is the major focus of its use. The structure must make it easy to slice and dice the data along various dimensions. For example, one day a business analyst may want to look at sales by product category by region, to compare the performance of different products in different areas of the country. The next day, the same analyst may want to look at sales trends over time by region, to see which regions are growing and which are not. The structure of the database must lend itself to this type of analysis along several different dimensions.

1. Fact Cubes

In most cases, the data stored in a warehouse can be accurately modeled as an N-dimensional cube (N-cube) of historical business facts. A simple three-dimensional cube of sales data is shown in Figure 21-2 to illustrate the structure. The fact in each cell of the cube is a dollar sales amount. Along one edge of the cube, one of the dimensions is the month during which the sales took place. Another dimension is the region where the sales occurred. The third dimension is the type of product that was sold. Each cell in the cube represents the sales for one combination of these three dimensions. The $50,475 amount in the upper-left front cell represents the sales amount for January, for clothing, in the East region.

Figure 21-2 shows a simple three-dimensional cube, but in many warehousing applications, there will be a dozen dimensions or more. Although a twelve­dimensional cube is difficult to visualize, the principles are the same as for the three-dimensional example. Each dimension represents some variable on which the data may be analyzed. Each combination of dimension values has one associated fact value, which is typically the historical business result obtained for that collection of dimension values.

To illustrate the database structures typically used in warehousing applications, we use a warehouse that might be found in a distribution company. The company distributes different types of products, made by various suppliers, to several hundred customers located in various regions of the country, through the efforts of its sales force. The company wants to analyze historical sales data along these dimensions, to discover trends and gain insights that will help it better manage its business. The underlying model for this analysis will be a five-dimensional fact cube with these dimensions:

  • Category. The category of product that was sold, with values such as clothing, linens, accessories, and shoes. The warehouse has about two dozen product categories.
  • Supplier. The supplier who manufactures the particular product sold. The company might distribute products from 50 different suppliers.
  • Customer. The customer who purchased the products. The company has several hundred customers. Some of the larger customers purchase products centrally and are serviced by a single salesperson; others purchase on a local basis and are served by local salespeople.
  • Region. The region of the country where the products were sold. Some of the company’s customers operate in only one region of the country; others operate in two or more regions.
  • Month. The month when the products were sold. For comparison purposes, the company has decided to maintain 36 months (three years) of historical sales data in the warehouse.

With these characteristics, each of the five dimensions is relatively independent of the others. Sales to a particular customer may be concentrated in a single region or in multiple regions. A specific category of product may be supplied by one or many different suppliers. The fact in each cell of the five-dimensional cube is the sales amount for that particular combination of dimension values. With the attributes just described, the fact cube contains over 35 million cells (24 categories x 50 suppliers x 300 customers x 3 regions x 36 months).

2. Star Schemas

In most data warehouses, the most effective way to model the N-dimensional fact cube is with a star schema. A star schema for the distributor warehouse in the previous example is shown in Figure 21-3. Each dimension of the cube is represented by a dimension table. There are five of them in the figure, named CATEGORIES, SUPPLIERS, CUSTOMERS, REGIONS, and MONTHS. There is one row in each dimension table for each possible value of that dimension. The MONTHS table has 36 rows, one for each month of sales history being stored. Three regions produce a three-row REGIONS table.

Dimension tables in a star schema often contain columns with descriptive text information or other attributes associated with that dimension (such as the name of the buyer for a customer, or the customer’s address and phone number, or the purchasing terms for a supplier). These columns may be displayed in reports generated from the database. A dimension table always has a primary key that contains the value of the dimension. If the values of a dimension are numbers (such as a clothing size) or short text strings (such as a city name), the primary key may be this dimension value itself. It’s more common for dimension values to be expressed in some type of code value. Three-letter airport codes and customer numbers are typical examples. In the sample warehouse of Figure 21-3, we assume that actual values are used as primary keys for REGIONS (East, West, and so on), CATEGORIES (clothing, shoes, and so on), and MONTHS. The other two dimensions use coded values (CUST_CODE for CUSTOMERS, SUPP_CODE for SUPPLIERS).

The largest table in the database is the fact table in the center of the schema. This table is named SALES in Figure 21-3. The fact table contains a column with the data values that appear in the cells of the N-cube in Figure 21-2. In addition, the fact table contains a column (or columns) that forms a foreign key for each of the dimension tables. In this example, there are five such foreign key columns. With this structure, each row represents the data for one cell of the N-cube. The foreign keys link the row to the corresponding dimension table rows for its position in the cube.

The fact table typically contains only a few columns, but many rows—hundreds of thousands or even millions of rows are not unusual in a production data warehouse. The fact column almost always contains numeric values, such as currency amounts, units shipped or received, or pounds processed. Virtually all reports from the warehouse involve summary data—totals, averages, high or low values, percentages—based on arithmetic computations on this numeric value.

The schema structure of Figure 21-3 is called a star schema for obvious reasons. The fact table is at the center of a star of data relationships. The dimension tables form the points of the star. The relationships created by the foreign keys in the fact table connect the center to the points. With the star-schema structure, most business analysis questions turn into queries that join the central fact table with one or more dimension tables. Here are some examples:

Show the total sales for clothing in January, by region.

SELECT SALES_AMOUNT, REGION

  FROM SALES, REGIONS

 WHERE MONTH = 01/1999

   AND PROD_TYPE = “Clothing”

   AND SALES.REGION = REGIONS.REGION

 ORDER BY REGION

Show the average sales for each CUSTOMER, by SUPPLIER, for each month.

SELECT AVG(SALES_AMOUNT), CUST_NAME, SUPPLIER_NAME, MONTH

FROM SALES, CUSTOMERS, SUPPLIERS

WHERE SALES.CUST_CODE = CUSTOMERS.CUST_CODE

AND SALES.SUPP_CODE = SUPPLIERS.SUPP_CODE

GROUP BY CUST_NAME, SUPP_NAME

ORDER BY CUST_NAME, SUPP_NAME, MONTH

3. Multilevel Dimensions

In the star-schema structure of Figure 21-3, each of the dimensions has only one level. In practice, multilevel dimensions are quite common. For example:

  • Sales data may in fact be accumulated for each sales office. Each office is a part of a sales district, and each district is a part of a sales region.
  • Sales data is accumulated by month, but it may also be useful to look at quarterly sales results. Each month is a part of a particular quarter.
  • Sales data may be accumulated for individual products ordered, and the products are associated with a particular supplier.

Multilevel dimensions such as these complicate the basic star schema, and in practice, there are several ways to deal with them:

  • Additional data in the dimension tables. The geographic dimension table might contain information about individual offices, but also include columns indicating the district and region to which the office belongs. Aggregate data for these higher levels of the geographic dimension can then be obtained by summary queries that join the fact table to the dimension table and selected based on the district or region columns. This approach is conceptually simple, but it means that all aggregate (summary) data must be calculated query by query. This likely produces unacceptably poor performance.
  • Multiple levels within the dimension tables. The geographic dimension table might be extended to include rows for offices, districts, and regions. Rows containing summary (total) data for these higher-level dimensions are added to the fact table when it is updated. This solves the runtime query performance problem by precalculating aggregate (summary) data. However, it complicates the queries considerably. Because every sale is now included in three separate fact table rows (one each for office, district, and region), any totals must be computed very carefully. Specifically, the fact table must usually contain a level column to indicate the level of data summarization provided by that row, and every query that computes totals or other statistics must include a search condition that restricts it to rows at only a specific level.
  • Precomputed summaries in the dimension tables. Instead of complicating the fact table, summary data may be precomputed and stored in the dimension tables (for example, summary sales for a district stored in the district’s row of the geographic dimension table). This solves the duplicate facts problem of the previous solution, but it works only for very simple precomputed amounts. The precalculated totals don’t help with queries about products by district or district results by month, for example, without further complicating the dimension tables.
  • Multiple fact tables at different levels. Instead of complicating the fact table, this approach creates multiple fact tables for different levels of summary data. To support cross-dimension queries (for example, district -results -by month), specialized fact tables that summarize data on this basis are needed. The resulting pattern of dimension tables and fact tables tends to have many interrelationships,
    creating a pattern resembling a snowflake; hence, this type of schema is often referred to as a snowflake schema. This approach solves the runtime performance problem and eliminates the possibility of erroneous data from a single fact table, but it can add significant complexity to the warehouse database design, making it harder to understand.

In practice, finding the right schema and architecture for a particular warehouse is a complicated decision, driven by the specifics of the facts and dimensions, the types of queries frequently performed, and other considerations. Many companies use specialized consultants to help them design data warehouses and deal with exactly these issues.

4. SQL Extensions for Data Warehousing

With a star-schema structure, a relational database conceptually provides a good foundation for managing data for business analysis. The capability to freely relate information within the database based solely on data values is a good match for the ad hoc, unstructured queries that typify business intelligence applications. But there are some serious mismatches between typical business intelligence queries and the capabilities of the core SQL language. For example:

  • Data ordering. Many business intelligence queries deal explicitly or implicitly with data ordering—they pose questions like “What is the top 10 percent?” “What are the top 10?” or “Which are the worst performing?” As a set-oriented language, SQL manipulates unordered sets of rows. The only support for sorting and ordering data within standard SQL is the ORDER BY clause in the SELECT statement, which is applied only at the end of all other set-oriented processing.
  • Time series. Many business intelligence queries compare values based on time—contrasting this year’s results to last year’s, or this month’s results to the same month last year, or computing year-over-year growth rates, for example.

It is very hard, and sometimes impossible, to get side-by-side comparisons of data from different time periods within a single row of standard SQL query results, depending on the structure of the underlying database.

  • Comparison to aggregate values. Many business intelligence queries compare values for individual entities (for example, office sales results) to an overall total, or to subtotals (such as regional results). These comparisons are difficult to express in standard SQL. A report format showing line-item detail, subtotals, and totals is impossible to generate directly from SQL, since all rows of query results must have the same column structure.

To deal with these issues, DBMS products on data warehousing have tended to extend the core SQL language. For example, the DBMS from Red Brick, one of the data warehousing pioneers and now a part of Informix’s product line (which has, in turn,
been acquired by IBM), features these extensions as part of its Red Brick Intelligent SQL (RISQL) language:

  • Ranking. Supports queries that ask for the top 10 and similar requests.
  • Moving totals and averages. Supports queries that smooth raw data for time series analysis.
  • Running totals and averages. Allows query responses that show results for individual months plus year-to-date totals, and similar requests.
  • Ratios. Allows queries that very simply express the ratio of individual values to a total or subtotal without the use of complex subqueries.
  • Decoding. Simplifies the translation of dimension-value codes (like the supplier-id in the example warehouse) into understandable names.
  • Subtotals. Allow production of query results that combine detailed and summary data values, at various levels of summarization.

Other warehousing vendors provide similar extensions in their SQL implementations or provide the same capabilities built into their data analysis products. As with extensions in other areas of the SQL language, although the conceptual capabilities provided by several different DBMS brands may be similar, the specifics of the implementation differ substantially.

Source: Liang Y. Daniel (2013), Introduction to programming with SQL, Pearson; 3rd edition.

Leave a Reply

Your email address will not be published. Required fields are marked *