Introduction to Parallel Databases

A parallel database system seeks to improve performance by parallel implementation of various operations such as loading the data, building indexes, and evaluating the queries. The basic idea behind parallel databases is to carry out evaluation steps in parallel to improve the performance.

1. Parallel Database Architecutres

The three main architectures for parallel databases are as follows:

Shared Memory Architecture.
Shared Disk Architecture.
Shared Nothing Architecture.

Note: In the figures, M denotes memory and C denotes CPU.

1.1. Shared Memory Architecture

In shared memory architecture, the CPU’s and disks have access to a common memory through a bus or an interconnection network. This architecture is attractive for achieving moderate parallelism since limited number of CPU’s can be exploited. The memory contention becomes a bottleneck as the number of CPU’s increases. The shared memory architecture is shown in Figure 12.1.

Advantages of Shared Architecture : The major advantages of shared Architectures are:

Communication overheads are low since main memory can be used for this purpose.
Suitable to achieve moderate parallelism.
CPU-to-CPU communication is very efficient since a CPU can send data to other CPU with the speed of memory write.

Disadvantages of Shared Architecture : The major disadvantages of shared Architectures are:

The architecture is not scalable beyond 32 or 64 CPU’s since the bus or interconnection network becomes a bottleneck.
Existing CPU’s get slowed down, as more CPU’s are added due to contention for memory access and network bandwidth.

1.2. Shared Disk Architecture

In shared disk architecture, all CPU’s have a private memory and they can access all disks directly through the interconnection network. The shared disk architecture is shown in Figure 12.2.

Advantages of Shared Disk Architecture : The major advantages of shared disk architecture are:

The memory bus is not a bottleneck, since each CPU’s has its own memory
It offers a degree of fault tolerance e., if a CPU or its memory fails, the other CPU can take over its tasks.
It can scale to somewhat large number of CPU’s.

Disadvantages of Shared Disk Architecture : The major disadvantages of shared disk architecture are:

CPU to CPU Communication is slower, since it has to go through a communication network.
The architecture is not scalable, since the interconnection to disk subsystem is now the bottleneck.
It also faces the problem of interference and disk contention as the number of CPU’s increases.

1.3. Shared Nothing Architecture

In shared nothing architecture, each CPU has local main memory and disk space, but no two CPU’s can access the same storage space. All communication between CPU’s is through the network connection. The shared nothing architecture is shown in Figure 12.3. Here, each CPU has its own copy of operating system, its own copy of the DBMS and own data. All communication between CPU’s is through the high-speed interconnection network.

Advantages of Shared Nothing Architecture : The major advantages of shared Nothing architectures are:

The interconnection networks for shared nothing systems are designed to be scalable so that their transmission capacity increases as more nodes are added.
It has provide linear speed-up i.e., the time taken for operation decreases in proportion to the increase in the number of CPU’s and disks and linear scale up e., the performance is sustained if the number of CPU’s and disks are increased in proportion to the amount of data.

Disadvantages of Shared Nothing Architecture : The major disadvantages of shared Nothing architectures are:

CPU to CPU communication is very slow.
The costs of communication and no-local disk access are higher than shared memory and shared disk.
Shared nothing architectures are difficult to load balance.

2. The Key Elements of Parallel Database Processing

The following are the key elements of parallel database processing:

Speed-up
Scale-up
Synchronisation
Locking
Messaging

Speed-up : It is the property in which the time taken for operations decreases in proportion to the increase in the number of CPU’s and disks in Speed-up means executing a given task in less time by increasing the degree of parallelism of hardware. The speed-up can be measured by using the following formula:

Speed-up = Actual processing time/Processing time in parallel

Scale-up : It is defined as the property in which the performance of the parallel database is sustained if the number of CPU’s and disks are increased in proportion to the amount of Scale-up means ability of the parallel database to handle larger tasks by increasing the degree of parallelism in the same time as the original system. The scale-up can be measured by using the following formula:

Scale-up = Transaction volume processed in parallel/Transaction volume processed originally

Synchronisation : It is defined as the process of coordinating the concurrent The synchronisation is required for correctness of the database. The successful parallel processing divide up the tasks in such a way that very little synchronisation is required. The less the synchronisation required, the better the speed-up and scale-up values.
Locking : It is a technique to synchronise the concurrent tasks. Many different locking mechanisms are used to synchronise the tasks of parallel Two types of Locking mechanisms are available. These are external locking mechanisms and internal locking mechanism to the database. For external locking, a distributed lock manager (DLM) is used.
Messaging : Parallel database processing needs fast and efficient communication between nodes to send

3. Query Parallelism

The major challenge in parallel databases is how to achieve query parallelism? That is, the question is how to design an architecture that allow parallel execution of multiple queries. This goal can be achieved very easily in the share-nothing parallel database architectures.

Some of the query parallelism techniques are as follows :

I/O Parallelism
Inter-query Parallelism
Intra-query Parallelism
Intra-operation Parallelism
Inter-operation Parallelism

3.1. I/O Parallelism

It is the simplest form of parallelism that reduces the time required to access the relations from disk by partitioning the relations on several disks. In this, the data is partitioned and stored on different disks and then each partition is processed in parallel and results are combined later on to produce the final result. There are several partitioning strategies. Some of them are as follows.

Round Robin partitioning
Hash partitioning
Range

Round Robin partitioning : In this technique, the relation is scanned in any order and the ith row of the relation is sent to the disk number Di MOD n. This technique, partitions the relation tuples equally in all the disks e., the tuples are distributed evenly among the various disks.

Advantages : The advantages of round-robin partitioning are:

1. It is well suited for the applications that read the entire relation sequentially for each query.
2. The distribution of tuples among the disks are query.

Disadvantages : The disadvantages of round-robin partitioning are:

Both point queries (where specified value required) and range queries (value lies within range) are very difficult to process, since all the n disks are required to process the query.

Hash partitioning : This technique selects one or more attributes from given relation’s attributes and designate them as partitioning A hash function is defined whose range is {0, 1, …, n – 1}. Now every tuple of the original relation is hashed on the designated partitioning attributes. The tuple t_k is placed on disk D_i, if the hash function returns the value i.

Advantages : The advantages of hash partitioning are:

Best suited for point queries that are based on partitioning attributes
It is useful for sequential scans of the entire relation.
If the hash function is a good randomizing function, then the tuples are evenly distributed among the multiple disks.

Disadvantages : The disadvantages of hash partitioning are:

It is not well suited for point queries that are not based on partitioning attributes
It is also not well suited for range queries since all the disks need to be scanned to give the answer.

Range partitioning : This technique distributes contiguous attribute-value ranges to each Thus the attribute-values within a certain range are to be placed on a certain disks. Here the tuples are sorted before distribution.

Advantages : The advantages of Range partitioning are:

Best suited for point and rarge queries that are based on partitioning attributes.
It gives higher throughput while maintaining good response time since different disks can be used to answer different queries.

Disadvantages : The disadvantages of range partitioning are:

If there are many tuples in the queried range, many tuples have to be retrieved from small number of disks, causing I/O bottleneck at those disks. This situation is called execution skew.

3.2. Inter-query Parallelism

In inter-query parallelism, multiple queries (Transactions) execute in parallel at different CPU’s. This type of parallelism increases the transaction throughput. The response time of individual transactions is almost same as if they were run in isolation. The basic reason of using the inter-query parallelism is to scale up a transaction processing system.

Advantages : The advantages of Inter-query parallelism are:

It is the easiest form of parallelism to support in a database system, particularly in shared memory parallel system.
It increases the transaction throughput.
It scales-up a transaction processing system to support a larger number of transactions per second.

Disadvantages : The disadvantages of Inter-query parallelism are:

The response time of individual transaction remains almost same as if the transactions were run in isolation.
It is more complicated to support in a shared disk or share nothing architecture.
It does not help in speeding up long running queries, since each query runs sequentially

3.3. Intra-query Parallelism

In Intra-query parallelism, a single-query is executed in parallel on multiple CPU’s and disks. It speeds-up the long running queries. The individual query can be parallized by parallelizing the individual operations involved in the query. There are two main ways by which a single query can be parallelzed. These are intraoperation parallelism and Interoperation parallelism. Both are discussed in the following sections.

Advantages : The advantages of Intra-query parallelism are:

It speeds up long running queries.
It is useful in decision support queries.

3.4. Intra-operation Parallelism

In Intra-operation parallelism, the execution of each individual operation such as sort, selection, projection and join of a query is parallelized. The degree of parallelism may be high, since the number of tuples in a relation can be large. This type of parallelism is natural in a database system.

Advantages : The advantages of Intra-operation parallelism are:

It speeds up the query processing by parallely executing the individual operations in the query.
It can scale better with increasing parallelism.

Disadvantages : The disadvantages of Intra-operation parallelism are:

The cost of parallel evaluation of operations is considerable.
A partitioned parallel evaluation is only as fast as the slowest of the parallel executions.
Any skew in the distribution of work across CPU’s greatly affects the performance.

3.5. Inter-operation Parallelism

In Inter-operation parallelism, the different operations in a query expression are executed in parallel to speed-up the processing of the query. There are two types of Interoperation parallelism. These are

Pipelined parallelism
Independent

Pipelined parallelism : In pipelined parallelism, the output tuples of one operation, A, acts as input to a second operation, B, even before the first operation has produced the entire set of tuples in its Thus it is possible to run operations A and B simultaneously on different CPU’s, so that the tuples produced by A is taken by B as input in parallel.

Advantages : The advantages of pipelined parallelism are:

1. It is useful with a small number of CPU’s.
2. Writing intermediate results to disk can be avoided.

Disadvantages : The disadvantages of pipelined parallelism are:

1. It does not scale-up well.
2. When the degree of parallelism is high, it is less important than partitioning.
3. Only marginal speed-up is obtained, when one operator’s cost is much higher than others.
4. It is not possible to pipeline relational operators that do not produce output until all inputs have been accessed g., set difference operation.

Independent parallelism : In independent parallelism, the operations in a query expression that do not depend on one another can be executed in Advantages : The main advantages of Independent parallelism are:

It is useful with a lower degree of parallelism.

Disadvantages : The main disadvantages of Independent parallelism are:

1. It does not provide a high degree of parallelism.
2. The operations in a query expression can be parallelized only if they are independent.

4. Advantages of Parallel Databases

These are many advantages of parallel databases. The important one are as follows:

Parallel databases has increased the throughput e., more number of tasks can be completed in a given time interval using parallel databases.
Parallel databases improve the response time e., the amount of time needed to complete a single task is reduced with the use of parallel databases.
Using several resources (CPU’s and disks) in parallel can significantly improve performance
Increased availability : If a site containing a relation goes down, the relation continues to be available if a copy is maintained at another site.
It is possible to serve large number of users.

5. Disadvantages of Parallel Databases

There are many disadvantage of parallel databases. The important one are as follows:

The start up costs are comparatively high.
Interference problem : Existing CPU’s get slow down, as more CPU’s are added, due to increased contention of memory and network bandwidth.
Data skew : Multiple disks contain partitions with widely varying number of tuples. Skew causes CPU’s dealing with large partitions to become performance bottleneck.

Source: Gupta Satinder Bal, Mittal Aditya (2017), Introduction to Basic Database Management System, 2nd Edition-University Science Press (2017)