Introduction to Distributed Databases

In a distributed system, the data is stored across several systems and each system is managed by a DBMS that can run independently of the other systems. Also, a distributed system is defined as a collection of independent computers that appear to the users of the system as a single computer. This definition has two aspects—the first that deals with hardware says that the machines are autonomous; the second that deals with software says the users think of the system as a single computer. The distributed systems need different software than the centralized systems.

The major objectives of Distributed databases are as follows:

The first objective of distributed databases is to provide users at many different locations the ease of access to data. For this, the distributed database system must provide location transparency i.e. a user using data for querying or updating need not know the location of the data. Any request to retrieve or update data from any site is automatically forwarded by the system to the site or sites related to the processing request.

The second objective of distributed databases is local autonomy. This is the capability to administer a local database and to operate independently when connections to other sites have failed. With local autonomy, each site has the capability to control local data, administer security, and log transactions and recover when local failures occur and to provide full access to local data to local users when any central or coordinating site cannot operate.

1. Basic Concepts of Distributed Databases

A distributed database system (DDBS) is a collection of sites connected through network. Each site is a full database system site and the different sites agreed to work together so that a user can access the data from any site without the knowledge of its distribution i.e., distribution of data is transparent. A typical DDBS is shown in Figure 12.4.

Each site has its own local database, its own local users, local DBMS, transaction management software and local data communication manager. The users of the distributed system can use it without knowing anything about the distribution of data.

There are two types of distributed database systems:

Homogeneous distributed database system : In this system, the data is distributed but all systems run the same DBMS software e., all clients and servers use the identical software. The major characteristics of homogeneous DDBS are
- The data are distributed across all the nodes.
- The distributed DBMS manages all the This means there does not exist any exclusive local data.
- At each location, the same DBMS is used.
- The database is accessed through one global schema or data definition by all the users.
- The global schema is simply the union of all the local database schemas.
Hetrogeneous distributed database system : In this system, different systems run by different DBMS’s that are connected to access data from multiple sites e., all clients and servers do not necessarity use the identical software. The major characteristics of hetrogeneous DDBS are
- The data are distributed across all the nodes.
- At each node, the different DBMS’s may be used.
- The users that require only local access to databases, can be accomplished by using only the local DBMS and schema.
- A global schema exists that allows local users to access remote data.

2. Distributed Database Management System (DDBMS)

It is a software program or group of programs that manage a distributed database while making the distribution transparent to the user. To achieve the advantages provided by the DDBS, the DDBMS performs the following additional functions then those performed by centralized DBMS. These are as follows :

Keeping track of data : The DDBMS has the ability to keep track of the data distribution, data replication and data This is achieved by expanding the catalog.
Replicated data management : The DDBMS has the ability to decide which copy of the replicated data item to It also maintains the consistency among various copies of replicated data items.
Distributed transaction management : The DDBMS has the ability to devise execution strategies for transactions and queries that access data from multiple It also synchronize the access to distributed data and maintain the overall database integrity.
Distributed query processing : The DDBMS has the ability to transmit queries and data among various sites and access remote sites through communication network.
Distributed directory management : The information about data in the database is stored in the The DDBMS provides two types of directories one is global for the entire DDB and other one is local for each individual site.
Distributed database recovery : The DDBMS has the ability to recover from individual site crashes and from other types of failures like failure of a communication link.
Security : The DDBMS provide authorization or access privileges to the users so that distributed transactions must be executed with the proper management of the security of the data.

3. Advantages of Distributed Databases

There are many advantages of distributed databases. Some of these are as follows:

Manages the distributed data with different levels of transparency : A DDBMS hides the details of where each file is physically stored within the system, e., a DDBMS is distribution transparent. There are many transparencies that are as follows :
- Location transparency : Here, the command used to perform a task is independent of the location of data and the location of the system where the command was issued.
- Naming transparency : Here, once a name is specified, the named object can be accessed unambiguously without giving any additional details. Both location transparency and naming transparency are types of network transparency.
- Replication transparency : Here, the replicated copies of data may be stored at various sites to obtain better performance, availability and The user do not know about the existence of multiple copies of data.
- Fragmentation transparency : Fragmentation transparency makes the user unaware about the existence of fragments of data. It is of two types. Horizontal fragmentation—It means distributing a relation into sets of tuples. The attributes remain the same in the fragmented Vertical fragmentation—It means distributing a relation into more than one subrelations and each subrelation contained a subset of the columns of the original relation.

Reliability and availability improves : Distributed databases improves reliability as well as The reliability is defined as the probability that a system is running at certain point of time and availability is defined as the probability that the system is continuously available during a time interval. So, if one site fails in a distributed system others still continue to work. The data that is available at the failed site cannot be accessed. It improves reliability and availability.
Improvement in performance : There are many factors that help in improving the performance of the distributed database.

- Data localization : The DDBMS distributes the database among various sites in such a way that the data is placed closer to where it is needed most, called data This reduces contention for CPU and I/O services.
- Since smaller databases exist at different sites hence local queries and transactions accessing data have better performance.
- As compared to the transactions submitted to a centralized database, every site of a distributed databse has a smaller number of transactions executing on them.
- Interquery and intraquery parallelism is possible by executing multiple queries at various sites or by breaking the query into subqueries that execute in parallel.

All these factors contribute in improving the performance.

Scalability : The distributed database systems can be expanded very The expansion may be of adding more data, increasing the size of database or increasing the number of processors.
Site autonomy : It means that each system of a distributed database environment is administered independently from all other databases. It gives the user tighter control over their own local databases.
Lower communication costs : Since data can be located closer to the point of use, the communication costs reduces.

4. Disadvantages of Distributed Databases

There are many disadvantages of distributed databases, some of them are as follows:

Complexity : The distributed databases are more complex and The complexity is due to hidden distribution of the system from the user. The increased complexity increases the acquisition and maintenance costs of the system.
Errors are harder to avoid : The errors are harder to avoid due to parallel nature of the distributed database systems and are very difficult to locate at application level.
Communication overhead : The distributed database systems send messages between sites over the These messages some time blocked the network and hence causes communication overhead and affects the system performance badly.
Lack of standards : There are no standard tools and methodologies available to the users to convert a centralized DBMS to a distributed DBMS.
Inexperience : With the current state-of-the-art, it is hard to find a professional with much experience in designing, implementing and using the distributed database systems
Security : More security of data is required when the database is distributed since unauthorised access and data corruption may occur due to no centralized control over the data.
Slow response : If the data are not distributed properly as per the usage or the queries are not formatted correctly, the response for data access is very slow.

Difficult to maintain integrity : Improper updating and data integrity problems are caused by the increased complexity and need for coordination among the distributed data.

5. Data Distribution

The major goal of a distributed database system (DDBS) is to maintain better control of the organization’s data. The data is distributed at different sites based on the access patterns and costs. The best option can be selected by comparing the costs for different data allocation options. The various issues related to data distribution are data fragmentation, data allocation and data replication. These are discussed as follows:

5.1. Data Fragmentation

The decision regarding which portions of the database will be stored at which site are generally taken during the distributed database design. The most general and simplest unit of the database that are to be distributed is the relations. The whole relation can be stored at a particular site. There are many ways to distribute/fragment the database. These are:

Horizontal fragmentation
Vertical fragmentation
Hybrid fragmentation.

Horizontal Fragmentation : The horizontal fragments of a relation contains subsets of the tuples in that relation. The horizontal fragmentation divides the relation horizontally by grouping tuples to create subsets of tuples and each subset has a certain logical These fragments can be allocated to various sites in the distributed system. The tuples of a horizontal fragment can be extracted from the relation by specifying a condition on one or more attributes of the relation. The horizontal fragmentation is shown in Figure 12.5.

The horizontal partitions for a distributed database have the following major advantages:

1. Efficiency: Data are stored close to where they are used and separate from other data used by other users or applications.
2. Local optimization: Data can be stored to optimize performance for local access.
3. Security: Data not relevant to usage at a particular site are not made available.
4. Ease of querying: Combining data across horizontal partitions is easy because rows are simply merged by unions across the available.

Thus, horizontal partitions are usually used when an organizational function is distributed, but each site is concerned with only a subset of the entity instances.

Horizontal partitions also have the following disadvantages:

1. Inconsistent access speed: When data from several partitions are required, the access time can be significantly different from local-only data access.
2. Backup vulnerability: When data at one site become inaccessible or damaged, user cannot switch to another site where a copy exists, because data are not Data may be lost if proper backup is not performed at each site.

Vertical Fragmentation : The vertical fragmentation divides the relation vertically e., by columns. This type of fragment of the relation keeps certain attributes of the relation and all tuples of that relation.

The vertical fragmentation is called proper if the original relation is obtained from the different fragments of that relation by combining them. This is possible only if every vertical fragment contains some primary key or candidate key. A vertical fragment on some relation R can be specified by a projection operation in the relational algebra. The OUTER UNION operation is applied on the vertical fragments to obtain the original relation R, when no horizontal fragmentation is used. The FULL OUTER JOIN operation can be applied to obtain the original relation, when horizontal fragmentation is used. The vertical fragmentation is shown in Figure 12.5.

The advantages and disadvantages of vertical partitions are identical to those for horizontal partitions, with the exception that combining data across vertical partitions is more difficult than across horizontal partitions. This difficulty arises from the need to match primary keys to join rows across partitions.

Hybrid Fragmentation : A hybrid fragmentation can be obtained by intermixing the horizontal and vertical The (UNION and OUTER UNION) or (UNION and OUTER JOIN) operations are applied in the appropriate order to obtain the original relation R from the hybrid fragmented relations. The hybrid fragmentation is shown in Figure 12.5.

6. Data Replication and Allocation

The data replication allow certain data to be stored at multiple sites and allocation means storing the relations or their replicas at different sites. Both of these techniques are used during the distributed data design.

6.1. Data Replication

The replication of data improves the performance, availability and reliability of the distributed database system. There are many types of data replications. These are as follows :

Full Replication
No Replication
Partial Replication.

There are many advantages of data replication. Some of them are as follows:

Reliability: If one or more sites containing the database fail, the copy of the database can always be found at another site without network traffic delays.
Fast Response: Every site that has a full copy of database can process queries locally, thus queries can be processed rapidly.
Possible Avoidance of Complicated Distributed Transaction Integrity Routines: Replicated databases are usually refreshed at scheduled intervals, thus most forms of replication are used when some relaxing of synchronization across database copies is acceptable.
Node Decoupling: If some sites are down, busy, or disconnected, a transaction is handled when the user desires. This is possible since each transaction may proceed without coordination across the acceptable.
Reduced Network Traffic at Prime Time: In general, the updation of data happens during prime business hours, and at this time the network traffic is highest and the demands for rapid response Due to replication, the delayed updating of copies of data moves network traffic for sending updates to other nodes to non– prime-time hours.

Replication has the following disadvantages:

Storage Requirements: Each site that has a full copy must have the same storage capacity as if the data were stored centrally. Each copy of the database needs to be updated on each site that holds a This requires storage space and processing time.
Complexity and Cost of Updating: Whenever a database is updated, it must be updated at each site that holds a copy. Careful coordination is required in synchronizing the updating in near real time.

Full Replication : In full replication, the replica of the whole database is stored at every site in the distributed This means every relation is available to every user locally.

Advantages : The main advantages of full replication are:

The availability increases drastically as the system continue to operate as long as at least one site is up.
The performance increases since result of every query can be obtained locally from any site.
Queries can be processed rapidly.

Disadvantages : The major disadvantages of full replication are:

It slows down the update operations drastically, since the update must be performed on every copy of the database to keep the copies consistent.
The concurrency control and recovery techniques become more expensive.
Since each site that has a full copy, it must have the same storage capacity that would be required if the data were stored expensive.

No Replication : In no replication, each fragment of the database is stored at exactly one Thus all fragments of the database are disjoint except the primary key.

Advantages : The main advantages of No-replication are:

Updation is very easy since only at one place the data need to be updated.
The concurrency control and recovery techniques are less expensive.
Storage requirement is very-very less compared to full replication.

Disadvantages : The main disadvantages of No-replication are:

The availability decreases drastically.
The performance decreases since every query is not possible to execute locally.

Partial Replication : In partial replication, some fragments of the database may be replicated whereas others may The copies of each fragment varies from one to total number of sites in the distributed system.

Advantages : The main advantages of partial replication are:

The availability of data is considerable.
The performance is good.
The queries are quite fast.

Disadvantages : The main disadvantages of partial replication are

Updatation is more complex than no replication.
The concurrency control and recovery techniques are more expensive than no replication.
The storage requirements are replication.

6.2. Data Allocation

The process of assigning each fragment or its copy to a particular site in a distributed system is called data allocation. The choice of sites and the degree of replication depends on many factors like performance, availability and the type and frequency of transactions submitted at each site.

A fully replicated database is better if the requirement is high availability, most transactions are for retrieving the data and transactions can be submitted at any site.
A partial replicated database is better if data is accessed at multiple sites and many updates are performed.

Thus finding an optimal or best solution to distributed data allocation is very much complex.

7. Distributed DBMS Architectures

There are three types of distributed DBMS architectures. These are:

Client server architecture
Collaborating server architecture
Middleware architecture.

7.1. Client Server Architecture

In a client server architecture of distributed DBMS, there are one or multiple client processes and one or multiple server processes. The client process can send query to any one server process. The clients acts as user-interface and could run on a PC and send queries to a server. The server manages the data and execute transactions and generally run on a mainframe system.

While designing the client-server applications, the boundary must be drawn between the client and the server so that the communication between them is set oriented.

Advantages : The main advantages of client server Architecture are:

It is very simple to implement
It clearly separate the functionality of client and server
Server’s can be fully utilized as now cheaper client machines are available for user- interactions
The graphical user interface (GUI) can be run on the client by the users, which is easy to use and user friendly

Disadvantages : The main disadvantages of client server architecture are:

The client server architecture does not allow a single query to span multiple servers.
The client process is quite complex as it must have the capability to break the query into subqueries and then combining together the answers of these subqueries.
Having the above capability, the client process begin to overlap with the server and distinction between clients and servers become harder.

7.2. Collaborating Server Architecture

To eliminate the disadvantage of client server architecture, we have an alternative architecture called collaborating server architecture. In this architecture, a collection of database servers are used and each server has the capability to run the query or transaction on its local data. These servers can also execute transactions spanning on multiple servers by cooperating with each other. On receipt of a query, on the server, that needs data from other servers, the corresponding server divides the query into subqueries and send them on other servers for execution and combines the result to obtain the answer of the original query. This decomposition of the query must be optimal, taking into consideration the cost of communication and local processing.

Advantages : It allows a single query to span multiple servers.

7.3. Middleware Architecture

Middleware architecture allows a single query to span multiple servers without requiring all database servers having the capability of managing multisite execution strategies. Here, we have one database server that is capable of managing queries and transactions spanning multiple servers. This server acts as a layer of software that coordinates the execution of queries and transactions across one or more independent database servers and is called middleware. All other database servers need to handle only local queries and transactions.

The middleware layer has the capability to execute joins and other operations (relational) on data accessed from other servers. This layer do not maintain any data by itself.

8.Comparison of DBMS and DDBMS

9. Query Processing in Distributed Databases

A query in a DDBMS generally requires data from more than one site. This need of data from other sites means transmission of the data that causes communication costs. The query processing in DDBMS is different from query processing in centralized DBMS due to this communication cost of data transfer over the network. The transmission cost is low when sites are connected through high speed network and is quite significant in other networks.

9.1. Costs (Transfer of Data) of Distributed Query Processing

The data transfer costs of distributed query processing involves cost of transferring intermediate files to other sites for processing and the cost of transferring the final result files to the site where that result is required.

Let us assume, that a user gives a query at site S₁, that requires data from its own as well as another site S₂. There are three strategies to process this query as given below.

Transfering data from S₂ to S₁ and process for query
Transfering data from S₁ to S₂ and process the query
Transfering data from S₁ and S₂ to S₃ and process the query

The choice depends on many factors such as:

The size of relations and the results.
The communication costs between different sites e., between S₁ and S₂, S₁ and S₃, S₂ and S₃ etc.
At which site the result will be utilized.

Generally, the data transfer cost is calculated in terms of the size of messages. The data transfer cost can be calculated using the formula.

Data Transfer Cost = C * Size.

where C is the cost per byte of transferring data and Size is the number of bytes transmitted.

Example. Consider the following relations EMPLOYEE and DEPARTMENT.

QUERy find the name of employees and their department name.

Determine the amount of data transfered to execute this query when the query is submitted at SITE 3.

Solution. Since the query is submitted at SITE 3 and neither of the two relations i.e., EMPLOYEE and DEPARTMENT reside at site 3. We have three strategies to execute this query.

Transfer both the relations e., EMPLOYEE and DEPARTMENT at site 3 and then join the relations there. The total cost in this case is 1000 * 40 + 50 * 25 = 40,000 + 1250 = 41,250 bytes.

Transfer the relation EMPLOYEE to Site 2, join the relation at Site 2 and then transfer the result at Site The total cost is 40 × 1000 + 40 × 1000 = 80,000 bytes, since we have to transfer 1000 tuples having NAME and DNAME from site 2 to site 3 that are of 40 bytes each.

Transfer the relation DEPARTMENT to site 1, join the relation at site 1 and then transfer the result at site The total cost is 25 × 50 + 40 × 1000 = 41,250 bytes, since we have to transfer 1000 tuples having NAME and DNAME from site 1 to site 3 that are of 40 bytes each.

We can choose strategy 1 or 3, if optimization criteria is to minimize the amount of data transfer.

9.2. Using Semijoin in Distributed Query Processing

The semijoin operation is used in distributed query processing to reduce the number of tuples in a relation before transmitting it to another site. This reduction in the number of tuples, reduces the number and total size of the transmission that ultimately reduces the total cost of data transfer.

Let us assume, that we have two relations R1 and R2 on site S1 and S2. We will send the joining column of one relation (say R₁) to the site where the other relation (say R₂) is located. This column is joined with R₂ at that site. The decision as to whether to reduce R₁ or R₂ can only be made after comparing the advantages of reducing R₁ with that of reducing R₂. Thus semijoin is an efficient solution to minimize the data transfer in distributed query processing.

Note.The semijoin operation is not commutative i.e.,

Example. Determine the amount of data transferred to execute the query given in the previous example using semijoin. Assume that the query is submitted at site 3.

Solution. The following strategy can be used to execute the query.

Project the attributes of EMPLOYEE at site 1 and transfer them to site We transfer π_{NAME, DID}(EMPLOYEE) and the size is 25 × 1000 = 25,000 bytes.

Transfer the relation DEPARTMENT to site 3 and join the projected attributes of EMPLOYEE with this The size of DEPARTMENT relation is 25 × 50 = 1250.

Using the above strategy, the amount of data transferred to execute the query is 25000 + 1250 = 26250.

Source: Gupta Satinder Bal, Mittal Aditya (2017), Introduction to Basic Database Management System, 2nd Edition-University Science Press (2017)