SOLVED PROBLEMS
Problem 1. Given two very large tables R(A, B) and S(A, C) where all attributes are integers. The data in each table is randomly distributed across three servers N1, N2, and N3. Explain how a parallel DBMS can compute S in parallel using all three servers.
Solution. A common implementation of parallel join is through hashing. The algorithm works as follows:
- On each server N1 through N3, the local DBMS instance will read the local data for relation It will apply a hash function to the join attribute R.A and it will send the tuple to one of N1 though N3 based on the hash value.
For example, we could compute R.A%3 and send the tuple to N1 if the result is 0, N2 if the result is 1 and N3 is the result is 2.
- The local DBMS instance on each server will then perform the same operation on the local data for relation S.
- Now, all tuples from R and S with the same value of the join attribute are located on the same The DBMS instance on each server can thus compute a local join between the two relations.
Problem 2. The following data structure and constraints exist for a magazine publishing company.
- The company publishes one regional magazine each in Punjab (PB), Haryana (HR), Rajasthan (RJ), and Delhi (DL).
- The company has 300,000 customers (subscribers) distributed throughout the four states as given above.
- On the first of each month, an annual subscription INVOICE is printed and sent to each customer whose subscription is due for renewal. The INVOICE entity contains a REGION attribute to indicate the state (PB, HR, RJ, DL) in which the customer resides:
CUSTOMER (CUS_NUM, CUS_NAME, CUS_ADDRESS, CUS_CITY, CUS_STATE, CUS_ZIP, CUS_SUBSDATE)
INVOICE (INV_NUM, INV_REGION, CUS_NUM, INV_DATE, INV_TOTAL)
The company’s management has decided to decentralize the management of the subscriptions in its four regional subsidiaries. Each subscription site will handle its own customer and invoice data. The company’s management, however, wants to have access to customer and invoice data to generate annual reports and to issue ad hoc queries, such as:
- List all current customers by
- List all new customers by
- Report all invoices by customer and by
Given the above conditions and the requirements, answer the following questions:
- What recommendations will you make regarding the type and characteristics of the required database system?
- What type of data fragmentation is needed for each table?
- What must be the criteria used to partition each database?
- Design the database Show an example with node names, location, fragment names, attribute names, and demonstration data.
- What type of distributed database operations must be supported at each remote site?
- What type of distributed database operations must be supported at the headquarters site?
Solution.
(a) The Magazine Publishing Company requires a distributed system with distributed database The distributed system will be distributed among the company locations in Punjab, Haryana, Rajasthan and Delhi.
The DDBMS must be able to support distributed transparency features, such as fragmentation transparency, replica transparency, transaction transparency, and performance transparency. Heterogeneous capability is not a mandatory feature since we assume there is no existing DBMS in place and that the company wants to standardize on a single DBMS.
(b) The database must be horizontally partitioned, using the STATE attribute for the CUSTOMER table and the REGION attribute for the INVOICE table.
(c) The following fragmentation segments reflect the criteria used to partition each database:
Horizontal Fragmentation of the CUSTOMER Table by State
Horizontal Fragmentation of the INVOICE Table by Region
(d) Note the following fragments:
(e) To answer this question, you must first draw a map of the locations, the fragments at each location, and the type of transaction or request support required to access the data in the distributed database.
Given the problem’s specifications, you conclude that no interstate access of CUSTOMER or INVOICE data is required. Therefore, no distributed database access is required in the four nodes. For the headquarters, the manager wants to be able to access the data in all four nodes through a single SQL request. Therefore, the DDBMS must support distributed requests.
(f) See the answer for part (e).
Source: Gupta Satinder Bal, Mittal Aditya (2017), Introduction to Basic Database Management System, 2nd Edition-University Science Press (2017)