Introduction to Parallel Databases

A parallel database system seeks to improve performance by parallel implementation of various operations such as loading the data, building indexes, and evaluating the queries. The basic idea behind parallel databases is to carry out evaluation steps in parallel to improve the performance.

1. Parallel Database Architecutres

The three main architectures for parallel databases are as follows:

  1. Shared Memory Architecture.
  2. Shared Disk Architecture.
  3. Shared Nothing Architecture.

Note: In the figures, M denotes memory and C denotes CPU.

1.1. Shared Memory Architecture

In shared memory architecture, the CPU’s and disks have access to a common memory through a bus or an interconnection network. This architecture is attractive for achieving moderate parallelism since limited number of CPU’s can be exploited. The memory contention becomes a bottleneck as the number of CPU’s increases. The shared memory architecture is shown in  Figure  12.1.

Advantages of Shared Architecture : The major advantages of shared Architectures are:

  1. Communication overheads are low since main memory can be used for this purpose.
  2. Suitable to achieve moderate parallelism.
  3. CPU-to-CPU communication is very efficient since a CPU can send data to other CPU with the speed of memory write.

Disadvantages of Shared Architecture : The major disadvantages of shared Architectures are:

  1. The architecture is not scalable beyond 32 or 64 CPU’s since the bus or interconnection network becomes a bottleneck.
  2. Existing CPU’s get slowed down, as more CPU’s are added due to contention for memory access and network bandwidth.

1.2. Shared Disk Architecture

In shared disk architecture, all CPU’s have a private memory and they can access all disks directly through the interconnection network. The shared disk architecture is shown in Figure 12.2.

Advantages of Shared Disk Architecture : The major advantages of shared disk architecture are:

  1. The memory bus is not a bottleneck, since each CPU’s has its own memory
  2. It offers a degree of fault tolerance e., if a CPU or its memory fails, the other CPU can take over its tasks.
  3. It can scale to  somewhat large  number of  CPU’s.

Disadvantages of Shared Disk Architecture : The major disadvantages of shared disk architecture are:

  1. CPU to CPU Communication is slower, since it has to go through a communication network.
  2. The architecture is not scalable, since the interconnection to disk subsystem is now the bottleneck.
  3. It also faces the problem of interference and disk contention as the number of CPU’s increases.

1.3. Shared Nothing Architecture

In shared nothing architecture, each CPU has local main memory and disk space, but no two CPU’s can access the same storage space. All communication between CPU’s is through the network connection. The shared nothing architecture is shown in Figure 12.3. Here, each CPU has its own copy of operating system, its own copy of the DBMS and own data. All communication between  CPU’s is  through the  high-speed interconnection  network.

Advantages of Shared Nothing Architecture : The major advantages of shared Nothing architectures are:

  1. The interconnection networks for shared nothing systems are designed to be scalable so that their transmission capacity  increases as  more nodes  are added.
  2. It has provide linear speed-up i.e., the time taken for operation decreases in proportion to the increase in the number of CPU’s and disks and linear scale up e., the performance is sustained if the number of CPU’s and disks are increased in proportion to the amount of data.

Disadvantages of Shared Nothing Architecture : The major disadvantages of shared Nothing architectures  are:

  1. CPU to CPU communication is very  slow.
  2. The costs of communication and no-local disk access are higher than shared memory and shared disk.
  3. Shared nothing architectures are difficult to load balance.

2. The Key Elements of Parallel Database Processing

The following are the key elements  of parallel database processing:

  • Speed-up
  • Scale-up
  • Synchronisation                                  
  • Locking
  •  Messaging
  • Speed-up : It is the property in which the time taken for operations decreases in proportion to the increase in the number of CPU’s and disks in Speed-up means executing a given task in less time by increasing the degree of parallelism of hardware. The speed-up can be measured  by using the  following formula:

Speed-up = Actual processing time/Processing time in parallel

  • Scale-up : It is defined as the property in which the performance of the parallel database is sustained if the number of CPU’s and disks are increased in proportion to the amount of Scale-up means ability of the parallel database to handle larger tasks by increasing the degree of parallelism in the same time as the original system. The scale-up  can  be  measured by  using  the  following  formula:

Scale-up = Transaction volume processed in parallel/Transaction volume processed originally

  • Synchronisation : It is defined as the process of coordinating the concurrent The synchronisation is required for correctness of the database. The successful parallel processing divide up the tasks in such a way that very little synchronisation is required. The less the synchronisation required, the better the speed-up and scale-up values.
  • Locking : It is a technique to synchronise the concurrent tasks. Many different locking mechanisms are used to synchronise the tasks of parallel Two types of Locking mechanisms are available. These are external locking mechanisms and internal locking mechanism to the database. For external locking, a distributed lock manager (DLM) is used.
  • Messaging : Parallel database processing needs fast and efficient communication between nodes to send

3. Query Parallelism

The major challenge in parallel databases is how to achieve query parallelism? That is, the question is how to design an architecture that allow parallel execution of multiple queries. This goal can be achieved very easily in the  share-nothing parallel database architectures.

Some of the query parallelism techniques are as follows :

  1. I/O Parallelism
  2. Inter-query Parallelism
  3. Intra-query Parallelism
  4. Intra-operation Parallelism
  5. Inter-operation Parallelism

3.1.  I/O Parallelism

It is the simplest form of parallelism that reduces the time required to access the relations from disk by partitioning the relations on several disks. In this, the data is partitioned and stored on different disks and then each partition is processed in parallel and results are combined later on to produce the final result. There are several partitioning strategies. Some of them  are  as  follows.

  • Round Robin partitioning
  • Hash partitioning
  • Range

 

  • Round Robin partitioning : In this technique, the relation is scanned in any order and the ith row of the relation is sent to the disk number Di MOD n. This technique, partitions the relation tuples equally in all the disks e., the tuples are distributed evenly among the various disks.

Advantages : The advantages of round-robin partitioning are:

    1. It is well suited for the applications that read the entire relation sequentially for each query.
    2. The distribution of tuples  among the  disks are query.

Disadvantages : The disadvantages of round-robin partitioning are:

Both point queries (where specified value required) and range queries (value lies within range) are very difficult to process, since all the n disks are required to process the query.

  • Hash partitioning : This technique selects one or more attributes from given relation’s attributes and designate them as partitioning A hash function is defined whose range is {0, 1, …, n – 1}. Now every tuple of the original relation is hashed on the designated partitioning attributes. The tuple tk is placed on disk Di, if the hash function returns the  value i.

Advantages : The  advantages of  hash partitioning  are:

  1. Best suited for point  queries that  are  based on  partitioning attributes
  2. It is useful for sequential scans of the entire relation.
  3. If the hash function is a good randomizing function, then the tuples are evenly distributed among the multiple disks.

Disadvantages : The  disadvantages of  hash partitioning  are:

  1. It is not well suited for point queries that are not based on partitioning attributes
  2. It is also not well suited for range queries since all the disks need to be scanned to give the answer.
  • Range partitioning : This technique distributes contiguous attribute-value ranges to each Thus the attribute-values within a certain range are to be placed on a certain disks. Here the  tuples  are sorted  before distribution.

Advantages : The  advantages of  Range partitioning  are:

  1. Best suited for point and rarge queries that are based on partitioning attributes.
  2. It gives higher throughput while maintaining good response time since different disks can be  used  to answer  different  queries.

Disadvantages : The  disadvantages of  range partitioning  are:

If there are many tuples in the queried range, many tuples have to be retrieved from small number of disks, causing I/O bottleneck at those disks. This situation is called execution skew.

3.2. Inter-query Parallelism

In inter-query parallelism, multiple queries (Transactions) execute in parallel at different CPU’s. This type of parallelism increases the transaction throughput. The response time of individual transactions is almost same as if they were run in isolation. The basic reason of using the  inter-query parallelism  is  to scale  up  a transaction  processing  system.

Advantages : The advantages of Inter-query parallelism are:

  1. It is the easiest form of parallelism to support in a database system, particularly in shared memory parallel system.
  2. It increases the transaction throughput.
  3. It scales-up a transaction processing system to support a larger number of transactions per second.

Disadvantages : The disadvantages of Inter-query parallelism are:

  1. The response time of individual transaction remains almost same as if the transactions were run in isolation.
  2. It is more complicated to support in a shared disk or share nothing architecture.
  3. It does not help in speeding up long running queries, since each query runs sequentially

3.3. Intra-query Parallelism

In Intra-query parallelism, a single-query is executed in parallel on multiple CPU’s and disks. It speeds-up the long running queries. The individual query can be parallized by parallelizing the individual operations involved in the query. There are two main ways by which a single query can be parallelzed. These are intraoperation parallelism and Interoperation parallelism.  Both  are  discussed  in  the following  sections.

Advantages : The  advantages of  Intra-query parallelism  are:

  1. It speeds up long  running queries.
  2. It is useful in decision support queries.

3.4. Intra-operation Parallelism

In Intra-operation parallelism, the execution of each individual operation such as sort, selection, projection and join of a query is parallelized. The degree of parallelism may be high, since the number of tuples in a relation can be large. This type of parallelism is natural in a database system.

Advantages : The  advantages of  Intra-operation parallelism  are:

  1. It speeds up the query processing by parallely executing the individual operations in the query.
  2. It can scale better with increasing parallelism.

Disadvantages : The  disadvantages of  Intra-operation parallelism  are:

  1. The cost of parallel evaluation of operations is considerable.
  2. A partitioned parallel evaluation is only as fast as the slowest of the parallel executions.
  3. Any skew in the distribution of work across CPU’s greatly affects the performance.

3.5. Inter-operation Parallelism

In Inter-operation parallelism, the different operations in a query expression are executed in parallel to speed-up the processing of the query. There are two types of Interoperation parallelism. These  are

  • Pipelined parallelism
  • Independent

 

  • Pipelined parallelism : In pipelined parallelism, the output tuples of one operation, A, acts as input to a second operation, B, even before the first operation has produced the entire set of tuples in its Thus it is possible to run operations A and B simultaneously on different CPU’s, so that the tuples produced by A is taken by B as input in parallel.

Advantages : The  advantages of  pipelined parallelism  are:

    1. It is useful with a small number of CPU’s.
    2. Writing intermediate results to disk can be avoided.

Disadvantages : The  disadvantages of  pipelined parallelism  are:

    1. It does not scale-up well.
    2. When the degree  of parallelism  is  high, it  is  less important  than partitioning.
    3. Only marginal speed-up is obtained, when one operator’s cost is much higher than others.
    4. It is not possible to pipeline relational operators that do not produce output until all inputs have  been  accessed g.,  set  difference operation.
  • Independent parallelism : In independent parallelism, the operations in a query expression that do not depend on one another can be executed in Advantages : The  main  advantages  of Independent  parallelism  are:

It is  useful with  a lower  degree of  parallelism.

Disadvantages The main  disadvantages of  Independent parallelism  are:

    1. It does not provide a high degree of parallelism.
    2. The operations in a query expression can be parallelized only if they are independent.

4. Advantages of Parallel Databases

These are many  advantages of parallel  databases. The  important one are  as follows:

  1. Parallel databases has increased the throughput e., more number of tasks can be completed in a given time interval using parallel databases.
  2. Parallel databases improve the response time e., the amount of time needed to complete a single  task  is  reduced  with the  use  of  parallel  databases.
  3. Using several resources (CPU’s and disks) in parallel can significantly improve performance
  4. Increased availability : If a site containing a relation goes down, the relation continues to be available if a copy is maintained at another site.
  5. It is possible to  serve large  number of users.

5. Disadvantages of Parallel Databases

There are many disadvantage of parallel databases. The important one are as follows:

  1. The start up costs are  comparatively high.
  2. Interference problem : Existing CPU’s get slow down, as more CPU’s are added, due to increased contention  of  memory  and  network bandwidth.
  3. Data skew : Multiple disks contain partitions with widely varying number of tuples. Skew causes CPU’s dealing with large partitions to become performance bottleneck.

Source: Gupta Satinder Bal, Mittal Aditya (2017), Introduction to Basic Database Management System, 2nd Edition-University Science Press (2017)

Leave a Reply

Your email address will not be published. Required fields are marked *