Introduction to Distributed Databases

In a distributed system, the data is stored across several systems and each system is managed by a DBMS that can run independently of the other systems. Also, a distributed system is defined as a collection of independent computers that appear to the users of the system as a single computer. This definition has two aspects—the first that deals with hardware says that the machines are autonomous; the second that deals with software says the users think of the system as a single computer. The distributed systems need different software than the centralized systems.

The major  objectives of  Distributed databases  are as  follows:

The first objective of distributed databases is to provide users at many different locations the ease of access to data. For this, the distributed database system must provide location transparency i.e. a user using data for querying or updating need not know the location of the data. Any request to retrieve or update data from any site is automatically forwarded by the  system to  the  site or  sites  related to  the  processing request.

The second objective of distributed databases is local autonomy. This is the capability to administer a local database and to operate independently when connections to other sites have failed. With local autonomy, each site has the capability to control local data, administer security, and log transactions and recover when local failures occur and to provide full access to local  data  to  local  users when  any  central  or  coordinating  site cannot  operate.

1. Basic Concepts of Distributed Databases

A distributed database system (DDBS) is a collection of sites connected through network. Each site is a full database system site and the different sites agreed to work together so that a user can access the data from any site without the knowledge of its distribution i.e., distribution of  data  is  transparent. A typical DDBS is  shown  in  Figure  12.4.

Each site has its own local database, its own local users, local DBMS, transaction management software and local data communication manager. The users of the distributed system can use it without knowing anything about the distribution of data.

There are two types of distributed database systems:

  • Homogeneous distributed database system : In this system, the data is distributed but all systems run the same DBMS software e., all clients and servers use the identical software. The major  characteristics  of homogeneous  DDBS are
    • The data are distributed across all the nodes.
    • The distributed DBMS manages all the This means there does not exist any exclusive local data.
    • At each location, the same DBMS is used.
    • The database is accessed through one global schema or data definition by all the users.
    • The global schema is simply the union of all the local database schemas.
  • Hetrogeneous distributed database system : In this system, different systems run by different DBMS’s that are connected to access data from multiple sites e., all clients and servers do not necessarity use the identical software. The major characteristics of hetrogeneous DDBS  are
    • The data are distributed across all the nodes.
    • At each node, the different DBMS’s may be used.
    • The users that require only local access to databases, can be accomplished by using only the local DBMS and schema.
    • A global schema  exists that  allows  local users  to  access remote data.

2. Distributed Database Management System (DDBMS)

It is a software program or group of programs that manage a distributed database while making the distribution transparent to the user. To achieve the advantages provided by the DDBS, the DDBMS performs the following additional functions then those performed by centralized DBMS.  These  are  as  follows  :

  • Keeping track of data : The DDBMS has the ability to keep track of the data distribution, data replication and data This is achieved by expanding the catalog.
  • Replicated data management : The DDBMS has the ability to decide which copy of the replicated data item to It also maintains the consistency among various copies of replicated  data  items.
  • Distributed transaction management : The DDBMS has the ability to devise execution strategies for transactions and queries that access data from multiple It also synchronize the access to distributed data and maintain the overall database integrity.
  • Distributed query processing : The DDBMS has the ability to transmit queries and data among various sites and access remote sites  through communication network.
  • Distributed directory management : The information about data in the database is stored in the The DDBMS provides two types of directories one is global for the entire  DDB  and  other  one  is local  for  each  individual  site.
  • Distributed database recovery : The DDBMS has the ability to recover from individual site crashes and from other types of failures like failure of a communication link.
  • Security : The DDBMS provide authorization or access privileges to the users so that distributed transactions must be executed with the proper management of the security of the data.

3. Advantages of Distributed Databases

There are many advantages of distributed databases. Some of these are as follows:

  1. Manages the distributed data with different levels of transparency : A DDBMS hides the details of where each file is physically stored within the system, e., a DDBMS is distribution transparent. There are many transparencies that are as follows :
    • Location transparency : Here, the command used to perform a task is independent of the location of data and the location of the system where the command was issued.
    • Naming transparency : Here, once a name is specified, the named object can be accessed unambiguously without giving any additional details. Both location transparency and naming transparency are types of network transparency.
    • Replication transparency : Here, the replicated copies of data may be stored at various sites to obtain better performance, availability and The user do not know about the existence of multiple copies of data.
    • Fragmentation transparency : Fragmentation transparency makes the user unaware about the existence of fragments of data. It is of two types. Horizontal fragmentationIt means distributing a relation into sets of tuples. The attributes remain the same in the fragmented Vertical fragmentationIt means distributing a relation into more than one subrelations and each subrelation contained a subset  of  the  columns  of  the original  relation.
  1. Reliability and availability improves : Distributed databases improves reliability as well as The reliability is defined as the probability that a system is running at certain point of time and availability is defined as the probability that the system is continuously available during a time interval. So, if one site fails in a distributed system others still continue to work. The data that is available at the failed site cannot be accessed. It improves reliability and availability.
  2. Improvement in performance : There are many factors that help in improving the performance of the distributed database.
    • Data localization : The DDBMS distributes the database among various sites in such a way that the data is placed closer to where it is needed most, called data This reduces contention for CPU and I/O services.
    • Since smaller databases exist at different sites hence local queries and transactions accessing data have better performance.
    • As compared to the transactions submitted to a centralized database, every site of a distributed databse has a smaller number of transactions executing on them.
    • Interquery and intraquery parallelism is possible by executing multiple queries at various sites or by breaking  the query  into  subqueries that  execute  in parallel.

All these factors  contribute in  improving the  performance.

  1. Scalability : The distributed database systems can be expanded very The expansion may be of adding more data, increasing the size of database or increasing the number of  processors.
  2. Site autonomy : It means that each system of a distributed database environment is administered independently from all other databases. It gives the user tighter control over their own local databases.
  3. Lower communication costs : Since data can be located closer to the point of use, the communication costs reduces.

4. Disadvantages of Distributed Databases

There are many disadvantages of distributed databases, some of them are as follows:

  1. Complexity : The distributed databases are more complex and The complexity is due to hidden distribution of the system from the user. The increased complexity increases the acquisition  and  maintenance  costs  of  the system.
  2. Errors are harder to avoid : The errors are harder to avoid due to parallel nature of the distributed database systems and are very difficult to locate at application level.
  3. Communication overhead : The distributed database systems send messages between sites over the These messages some time blocked the network and hence causes communication overhead and affects the system performance badly.
  4. Lack of standards : There are no standard tools and methodologies available to the users to convert a centralized DBMS to a distributed DBMS.
  5. Inexperience : With the current state-of-the-art, it is hard to find a professional with much experience in designing, implementing and using the distributed database systems
  6. Security : More security of data is required when the database is distributed since unauthorised access and data corruption may occur due to no centralized control over the data.
  7. Slow response : If the data are not distributed properly as per the usage or the queries are not formatted correctly, the response for data access is very slow.
  1. Difficult to maintain integrity : Improper updating and data integrity problems are caused by the increased complexity and need for coordination among the distributed data.

5. Data Distribution

The major goal of a distributed database system (DDBS) is to maintain better control of the organization’s data. The data is distributed at different sites based on the access patterns and costs. The best option can be selected by comparing the costs for different data allocation options. The various issues related to data distribution are data fragmentation, data allocation and data  replication.  These  are  discussed  as  follows:

5.1. Data Fragmentation

The decision regarding which portions of the database will be stored at which site are generally taken during the distributed database design. The most general and simplest unit of the database that are to be distributed is the relations. The whole relation can be stored at a particular site. There are many ways to distribute/fragment the database. These are:

  • Horizontal fragmentation
  • Vertical fragmentation
  • Hybrid fragmentation.

 

  • Horizontal Fragmentation : The horizontal fragments of a relation contains subsets of the tuples in that relation. The horizontal fragmentation divides the relation horizontally by grouping tuples to create subsets of tuples and each subset has a certain logical These fragments can be allocated to various sites in the distributed system. The tuples of a horizontal fragment can be extracted from the relation by specifying a condition on one or more attributes of the relation. The horizontal fragmentation is shown in Figure 12.5.

The horizontal partitions for a distributed database have the following major advantages:

    1. Efficiency: Data are stored close to where they are used and separate from other data used by other users or applications.
    2. Local optimization: Data  can  be stored  to  optimize  performance for  local  access.
    3. Security: Data not relevant to  usage at a  particular site  are not  made available.
    4. Ease of querying: Combining data across horizontal partitions is easy because rows are simply merged  by unions  across  the available.

Thus, horizontal partitions are usually used when an organizational function is distributed, but each site is concerned with only a subset of the entity instances.

Horizontal partitions also have the following disadvantages:

    1. Inconsistent access speed: When data from several partitions are required, the access time can be significantly different  from local-only  data access.
    2. Backup vulnerability: When data at one site become inaccessible or damaged, user cannot switch to another site where a copy exists, because data are not Data may be  lost  if  proper  backup  is not  performed  at  each  site.
  • Vertical Fragmentation : The vertical fragmentation divides the relation vertically e., by columns. This type of fragment of the relation keeps certain attributes of the relation and all tuples of  that  relation.

The vertical fragmentation is called proper if the original relation is obtained from the different fragments of that relation by combining them. This is possible only if every vertical fragment contains some primary key or candidate key. A vertical fragment on some relation R can be specified by a projection operation in the relational algebra. The OUTER UNION operation is applied on the vertical fragments to obtain the original relation R, when no horizontal fragmentation is used. The FULL OUTER JOIN operation can be applied to obtain the original relation, when horizontal fragmentation is used. The vertical fragmentation is shown in  Figure  12.5.

The advantages and disadvantages of vertical partitions are identical to those for horizontal partitions, with the exception that combining data across vertical partitions is more difficult than across horizontal partitions. This difficulty arises from the need to match primary keys to join  rows  across  partitions.

  • Hybrid Fragmentation : A hybrid fragmentation can be obtained by intermixing the horizontal and vertical The (UNION and OUTER UNION) or (UNION and OUTER JOIN) operations are applied in the appropriate order to obtain the original relation R from the hybrid fragmented relations. The hybrid fragmentation is shown in Figure 12.5.

6. Data Replication and Allocation

The data replication allow certain data to be stored at multiple sites and allocation means storing the relations or their replicas at different sites. Both of these techniques are used during the distributed data design.

6.1. Data Replication

The replication of data improves the performance, availability and reliability of the distributed database system. There are many types of data replications. These are as follows :

  • Full Replication
  • No Replication
  • Partial Replication.

There are many advantages of data replication. Some of them are as follows:

  1. Reliability: If one or more sites containing the database fail, the copy of the database can always be  found  at  another  site without  network  traffic delays.
  2. Fast Response: Every site that has a full copy of database can process queries locally, thus queries can be processed rapidly.
  3. Possible Avoidance of Complicated Distributed Transaction Integrity Routines: Replicated databases are usually refreshed at scheduled intervals, thus most forms of replication are used when some relaxing of synchronization across database copies is acceptable.
  4. Node Decoupling: If some sites are down, busy, or disconnected, a transaction is handled when the user desires. This is possible since each transaction may proceed without coordination across  the acceptable.
  5. Reduced Network Traffic at Prime Time: In general, the updation of data happens during prime business hours, and at this time the network traffic is highest and the demands for rapid response Due to replication, the delayed updating of copies of data moves network traffic for sending updates to other nodes to non– prime-time hours.

Replication has the following disadvantages:

  1. Storage Requirements: Each site that has a full copy must have the same storage capacity as if the data were stored centrally. Each copy of the database needs to be updated on each site that holds a This requires storage space and processing time.
  2. Complexity and Cost of Updating: Whenever a database is updated, it must be updated at each site that holds a copy. Careful coordination is required in synchronizing the updating in near real time.

  • Full Replication : In full replication, the replica of the whole database is stored at every site in the distributed This means every relation is available to every user locally.

Advantages : The main advantages of full replication are:

  1. The availability increases drastically as the system continue to operate as long as at least one site is up.
  2. The performance increases since result of every query can be obtained locally from any site.
  3. Queries can be processed rapidly.

Disadvantages : The major disadvantages of full replication are:

  1. It slows down the update operations drastically, since the update must be performed on every copy of the database to keep the copies consistent.
  2. The concurrency control and recovery techniques become more expensive.
  3. Since each site that has a full copy, it must have the same storage capacity that would be required if the data were stored expensive.
  • No Replication : In no replication, each fragment of the database is stored at exactly one Thus all fragments of the database are disjoint except the primary key.

Advantages : The main advantages of No-replication are:

  1. Updation is very easy since only at one place the data need to be updated.
  2. The concurrency control and recovery techniques are less expensive.
  3. Storage requirement is very-very less compared to full replication.

Disadvantages : The main disadvantages of No-replication are:

  1. The availability decreases drastically.
  2. The performance decreases since every query is not possible to execute locally.
  • Partial Replication : In partial replication, some fragments of the database may be replicated whereas others may The copies of each fragment varies from one to total number of sites in the distributed system.

Advantages : The main advantages of partial replication are:

  1. The availability of data is considerable.
  2. The performance is good.
  3. The queries are quite fast.

Disadvantages : The main disadvantages of partial replication are

  1. Updatation is more complex than no replication.
  2. The concurrency control and recovery techniques are more expensive than no replication.
  3. The storage requirements are replication.

6.2. Data Allocation

The process of assigning each fragment or its copy to a particular site in a distributed system is called data allocation. The choice of sites and the degree of replication depends on many factors like performance, availability and the type and frequency of transactions submitted at each site.

  • A fully replicated database is better if the requirement is high availability, most transactions are for retrieving the data and transactions can be submitted at any site.
  • A partial replicated database is better if data is accessed at multiple sites and many updates are performed.

Thus finding an optimal or best solution to distributed data allocation is very much complex.

7.  Distributed DBMS Architectures

There are three types of distributed DBMS architectures. These are:

  • Client server architecture
  • Collaborating server architecture
  • Middleware architecture.

7.1. Client Server Architecture

In a client server architecture of distributed DBMS, there are one or multiple client processes and one or multiple server processes. The client process can send query to any one server process. The clients acts as user-interface and could run on a PC and send queries to a server. The server manages the data and execute transactions and generally run on a mainframe system.

While designing the client-server applications, the boundary must be drawn between the client and the server so that the communication between them is set oriented.

Advantages : The main advantages of client server Architecture are:

  1. It is very simple to implement
  2. It clearly separate the functionality of client and server
  3. Server’s can be fully utilized as now cheaper client machines are available for user- interactions
  4. The graphical user interface (GUI) can be run on the client by the users, which is easy to use and  user friendly

Disadvantages : The main disadvantages of client server architecture are:

  1. The client server architecture does not allow a single query to span multiple servers.
  2. The client process is quite complex as it must have the capability to break the query into subqueries and then combining together the answers of these subqueries.
  3. Having the above capability, the client process begin to overlap with the server and distinction between clients and servers  become harder.

7.2. Collaborating Server Architecture

To eliminate the disadvantage of client server architecture, we have an alternative architecture called collaborating server architecture. In this architecture, a collection of database servers are used and each server has the capability to run the query or transaction on its local data. These servers can also execute transactions spanning on multiple servers by cooperating with each other. On receipt of a query, on the server, that needs data from other servers, the corresponding server divides the query into subqueries and send them on other servers for execution and combines the result to obtain the answer of the original query. This decomposition of the query must be optimal, taking into consideration the cost of communication  and  local  processing.

Advantages : It allows a single query to span multiple servers.

7.3. Middleware Architecture

Middleware architecture allows a single query to span multiple servers without requiring all database servers having the capability of managing multisite execution strategies. Here, we have one database server that is capable of managing queries and transactions spanning multiple servers. This server acts as a layer of software that coordinates the execution of queries and transactions across one or more independent database servers and is called middleware. All  other  database servers  need  to handle  only  local  queries and  transactions.

The middleware layer has the capability to execute joins and other operations (relational) on data  accessed  from  other  servers. This  layer  do  not  maintain  any data  by  itself.

8.Comparison of DBMS and DDBMS

9. Query Processing in Distributed Databases

A query in a DDBMS generally requires data from more than one site. This need of data from other sites means transmission of the data that causes communication costs. The query processing in DDBMS is different from query processing in centralized DBMS due to this communication cost of data transfer over the network. The transmission cost is low when sites are connected through high speed network and is quite  significant in other networks.

9.1. Costs (Transfer of Data) of Distributed Query Processing

The data transfer costs of distributed query processing involves cost of transferring intermediate files to other sites for processing and the cost of transferring the final result files to  the site  where  that result  is required.

Let us assume, that a user gives a query at site S1, that requires data from its own as well as another site S2. There are three strategies to process this query as given below.

  1. Transfering data from S2 to S1 and process for query
  2. Transfering data from S1 to S2 and process the query
  3. Transfering data from S1 and S2 to S3 and process the query

The choice depends on many factors such as:

  • The size of relations and the results.
  • The communication costs between different sites e., between S1 and S2, S1 and S3, S2 and S3 etc.
  • At which site the result  will be utilized.

Generally, the data transfer cost is calculated in terms of the size of messages. The data transfer cost can be calculated using the formula.

Data Transfer Cost = C * Size.

where C is the cost per byte of transferring data and Size is the number of bytes transmitted.

Example. Consider the following relations EMPLOYEE and DEPARTMENT.

QUERy find the name of employees and their department name.

Determine the amount of data transfered to execute this query when the query is submitted at SITE 3.

Solution. Since the query is submitted at SITE 3 and neither of the two relations i.e., EMPLOYEE and DEPARTMENT reside at site 3. We have three strategies to execute this query.

  1. Transfer both the relations e., EMPLOYEE and DEPARTMENT at site 3 and then join the relations there. The total cost in this case is 1000 * 40 + 50 * 25 = 40,000 + 1250 = 41,250 bytes.
  1. Transfer the relation EMPLOYEE to Site 2, join the relation at Site 2 and then transfer the result at Site The total cost is 40 × 1000 + 40 × 1000 = 80,000 bytes, since we have to transfer 1000 tuples having NAME and DNAME from site 2 to site 3 that are of 40  bytes  each.
  1. Transfer the relation DEPARTMENT to site 1, join the relation at site 1 and then transfer the result at site The total cost is 25 × 50 + 40 × 1000 = 41,250 bytes, since we have to transfer 1000 tuples having NAME and DNAME from site 1 to site 3 that  are  of  40  bytes  each.

We can choose strategy 1 or 3, if optimization criteria is to minimize the amount of data transfer.

9.2. Using Semijoin in Distributed Query Processing

The semijoin operation is used in distributed query processing to reduce the number of tuples in a relation before transmitting it to another site. This reduction in the number of tuples, reduces the number and total size of the transmission that ultimately reduces the total cost  of  data  transfer.

Let us assume, that we have two relations R1 and R2 on site S1 and S2. We will send the joining column of one relation (say R1) to the site where the other relation (say R2) is located. This column is joined with R2 at that site. The decision as to whether to reduce R1 or R2 can only be made after comparing the advantages of reducing R1 with that of reducing R2. Thus semijoin is an efficient solution to minimize the data transfer in distributed query processing.

Note.The semijoin operation is not commutative i.e., 

Example. Determine the amount of data transferred to execute the query given in the previous example  using  semijoin. Assume  that the query  is  submitted  at  site 3.

Solution. The following strategy can be used to execute the query.

  1. Project the attributes of EMPLOYEE at site 1 and transfer them to site We transfer πNAME, DID (EMPLOYEE) and the size is 25 × 1000 = 25,000 bytes.
  1. Transfer the relation DEPARTMENT to site 3 and join the projected attributes of EMPLOYEE with this The size of DEPARTMENT relation is 25 × 50 = 1250.

Using the above strategy, the amount of  data transferred to execute the query is  25000 + 1250 = 26250.

Source: Gupta Satinder Bal, Mittal Aditya (2017), Introduction to Basic Database Management System, 2nd Edition-University Science Press (2017)

Leave a Reply

Your email address will not be published. Required fields are marked *