In cost based query optimization, optimizer estimates the cost of running of all alternatives of a query and choose the optimum alternative. The alternative which uses the minimum resources is having minimum cost. The cost of a query operation is mainly depend on its selectivity i.e., the proportion of the input relations that forms the output. Following are the main components used to determine the cost of execution of a query:
- Access cost of secondary storage : Access cost to secondary storage consists of cost of database manipulation operations which includes searching, writing, reading of data blocks stored in the secondary The cost of searching depends upon the type of indexes (primary, secondary, hashed), type of file structure, ordering of relation in addition to physical storage location like file blocks are allocated contiguously on the same disk or scattered on the disk.
- Storage cost : Storage cost consists of cost of storing intermediate results (tables or files) that are generated by the execution strategy for the query.
- Computation cost : Computation cost consists of performing in-memory operations during query execution such as sorting of records in a file, merging of records, performing computations on field values, searching of These are mainly performed on data buffers.
- Memory usage cost : It consists of cost of pertaining to the number of memory buffers needed during query execution.
- Communication cost : It consists of the cost of transferring query and its result from database location to the location of terminal where the query is originated.
From all the above components, the most important is access cost to secondary storage because secondary storage is comparatively slower than other devices. Optimizer try to minimize computation cost for small databases as most of the data files are stored in main memory. For large database, it try to minimize the access cost to secondary storage and for distributed databases, it trys to minimize the communication cost because various sites are involved for data transfer.
To estimate the cost of execution strategies, optimizer access statistical data stored in DBMS catalog. The information stored in DBMS catalog is given below:
- Number of records in relation X, given as R.
- Number of blocks required to store relation X, given as B.
- Blocking factor of relation X, given as BFR.
- Primary access method for each file and attributes for each file.
- Number of levels for each multi-level index for a attribute A given as IA.
- Number of first-level index blocks for a attribute A, given as BAI1.
- Selection cardinality of attribute A in relation R, given as SA, where SA = R × SLA, where SLA is the selectivity of the attributes.
Cost Function for Selection Operation : Selection operation works on a single relation in relation algebra and retrieves the records that satisfy the given condition. Depending upon the structure of file, available indexes, searching methods, the estimated cost of strategies for selection operation is as given below.
Now consider the example of Employee table. Consider, there are 10,000 records stored in 2000 blocks. Also the following indexes are available.
- A secondary index on the Emp-ID with 4 levels.
- A clustering index on salary with 3 levels and average selection cardinality of 30.
- A secondary index on non-key attributes Age with 2 levels and 4 first level index There are 200 distinct values for Age.
Consider the queries:
- Number of records (R) = 10,000
- Number of blocks (B)= 2000
- Blocking Factor (BFR) = 10000/2000 = 5
Now cost of the above queries (suppose records are in table)
- If linear search is used then cost will be B/2 = 2000/2= 1000
- If linear search is used then cost will be B = 2000.
- This query has a conjunctive selection To estimate the cost of use using anyone of the two components of selection condition, to retrieve the records plus the linear search. The linear search costs 2000 and condition salary > 9000 first gives cost estimate of Isalary + (B/2)= 2 + 1000 = 1002 and cost of condition Age = 20 is 30 + 2 = 32 So, total cost is 1034.
Cost Function for Join Operation : Join operation is the most time consuming operation in databases and an accurate cost function for join operations depends upon the estimate of size of file (number of records) after the join operation. The estimated cost of strategies for join operation is given below:
Note: (R) and (S) are relations R and S respectively.
Example. Consider we have 600 records in Department table. BFR for department table is 60 and number of blocks are 600/60 = 10.
For the Join operation,
Type of Join Cost
Block nested-loop joins 22000 (Buffer has only one block)
2010 (if all blocks of employee can be read into database buffer)
Hash Join 6010 (is hash index is held in memory)
Example. Given two unary relation (contains only one attribute), R1 and R2.
R1 R2
7 8
2 4
9 2
8 1
3 3
9 2
1 7
3 3
6
Show the result of joining R1 and R2 using each of the following algorithms. List the results in the order that would be output by the join algorithm.
- Nested Loop Use R1 for outer Loop and R2 for the inner loop.
- Sort Merge Join.
Solution. The result relation contains only one attribute, which is the common attribute between R and S.
- Nested loop For every tuple in R1, find matches in R2. 7, 2, 2, 8, 3, 3, 1, 3, 3
- Sort merge The result appears in sorted order on the join attribute. 1, 2, 2, 3, 3, 3, 3, 7, 8
SOLVED PROBLEMS
Problem 1. Consider two relations R and S. The number of tuples in R is 500,000 each of size 50 bytes. The number of tuples in S is 125,000 each of size 40 bytes. The page size is 5 KB. We have a buffer of size 250 KB. Answer the following questions with detail steps for following the join query:
SELECT *
FROM R, S
WHERE R.sid=S.sid.
- Find the cost of page oriented Simple Nested Loops
- Find the cost of Block Nested Loops
- Find the cost of Sort-Merge
- Can you do any refinement in the Sort-Merge Join Cost? Explain?
- Find the cost of Hash
Solution. The given information is as follows:
Page size = 5 K, Buffer size = 250 K = 250/50 = 50 pages.
For R, number of tuples = 500,000, size of one tupe = 50 bytes.
Size of R = 500,000 × 50 byte = 500,000×50/5000 pages = 5000 pages.
For S, number of tuples = 125,000, size of each tuple = 40 bytes.
Size of S = 125,000 × 40 byte = 1250,000 × 40/5000 pages = 1000 pages.
Consider S to be the outer relation and R to be the inner relation.
Note. We considered S to be outer because it is the smaller relation. You can also consider R to be outer but the cost will be more.
Consider N=size of R=5000 pages and M=size of S=1000 pages.
1. Page Oriented Simple Nested Loops Join.
Cost=M+M*N
(Reading outer Relation once + for each outer page reading the whole inner relation)
Cost=1000+5000*1000=5,001,000 I/O
2. Block Nested Loops Join.
Cost= (scan outer)+(scan inner)*(number of outer blocks)
(number of outer blocks)= ceiling of ((size of outer relation)/(Block size) ).
(block size)=(Buffer Size)-2. (one buffer for reading and one buffer for writing)
Cost=M+N*ceiling of (M/B-2).
Cost=1000+5000*21=106,000 I/O.
3. Sort-Merge Join.
Cost=sorting both relations + reading both relations.
Cost of sorting relation = 2*(number of passes)*Size. (2 for reading and writing)
Number of passes= ceiling of (log of (size/buffer) to the base of B-1) + 1.
The one comes here for pass 0. After pass zero you get (Size/Buffer) sorted runs
For Relation S, (# of passes)=2; by applying the above formula.
Cost of sorting S=2*2*1000 =4000.
Explanation: You can sort S in two passes.
In pass 0, you produce 20 (1000/50) sorted runs of size 50.
In pass 1, you can sort these 20 sorted runs, and hence the whole relation.
For Relation R, (# of passes)=3; by applying the above formula.
Cost of sorting R=2*3*5000 =30,000.
Explanation: You can sort R in three passes.
In pass 0, you produce 100 (5000/50) sorted runs of size 50.
In pass 1, you sort these 100 sorted runs, and produce 2 sorted runs of size 50.
In pass 2, you sort these 2 sorted runs and hence the whole relation.
Final Cost=1000 + 5000 + 4000 + 30,000 = 40,000 I/O.
- To do the refinement of combing the merging phase in the sorting of R and S, size of buffer must be greater than sqrt of (size of Larger relation)
In this example, size of Larger relation = 5000 and buffer size = 50. Hence this condition is not satisfied. So we cannot do this refinement.
5. Hash Join
Cost= 3*(M+N)
Cost= 3*6000=18,000 I/O.
Problem 2. Consider the join between relations R and S on R.a=S.b, given the following information about the relations to be joined. The cost metric is the number of page I/Os unless otherwise noted, and the cost of writing out the result should be uniformly ignored.
- Relation R contains 10,000 tuples and has 10 tuples per page.
- Relation S contains 2,000 tuples and also has 10 tuples per page.
- Attribute b of relation S is the primary key for S.
- Both relations are stored as simple heap .les.
- Neither relation has any indexes built on it.
- 52 buffer pages are available.
- What is the cost of joining R and S using a page-oriented simple nested loops join? What is the minimum number of buffer pages required for this cost to remain unchanged?
- What is the cost of joining R and S using a block-nested loops join? What is the minimum number of buffer pages required for this cost to remain unchanged?
- What is the cost of joining R and S using a sort-merge join? What is the minimum number of buffer pages required for this cost to remain unchanged?
- What is the cost of joining R and S using a hash join? What is the minimum number of buffer pages required for this cost to remain unchanged? Neglect the size involved in building a hash table.
Solution. Let M = 1000 be the number of pages in R, N = 200 be the number of pages in S, and B = 52 be the number of bu.er pages available.
- Basic idea is to read each page of the outer relation, and for each page scan the inner relation for matching Total cost would be #pages in outer + (#pages in outer #pages in inner) which is minimized by having the smaller relation be the outer relation.
Total Cost = N + (N*M) = 200, 200
The minimum number of buffer pages for this cost is 3.
- This time read the outer relation in blocks, and for each block scan the inner relation for matching tuples. So the outer relation is still read once, but the inner relation is scanned only once for each outer block, of which there are:
ceiling of (#pagesinouter/ B-2] = 4.
Total Cost = N +M* ceiling (N/B-2) = 4, 200.
If the number of buffer pages is less than 52, the number of scans of the inner would be more than 4 since ceiling(200/49) is 5. The minimum number of buffer pages for this cost is therefore 52.
- Since B > ÖM > ÖN we can use the refinement to Sort-Merge discussed on pages 254–255 in the
Total Cost = 3 (M + N) = 3, 600
Note. If S.b were not a key, then the merging phase could require more than one pass over one of the relations, making the cost of merging M N I/Os in the worst case.
The minimum number of buffer pages required is 25. With 25 buffer pages, the initial sorting pass will split R into 20 runs of size 50 and split S into 4 runs of size 50 (approximately). These 24 runs can then be merged in one pass, with one page left over to be used as an output buffer. With fewer than 25 buffer pages the number of runs produced by the first pass over both relations would exceed the number of available pages, making a one-pass merge impossible.
- The cost of Hash Join is 3(M+N) if B > Ö N where N is the number of pages in the smaller relation, S. Since ÖN =14, this condition is met. We will also assume uniform partitioning from our hash
Total Cost = 3 (M + N) = 3, 600.
The minimum number of buffer pages required, and a good guess is that we need B > √ N.
Problem 3. Consider the relations r1(A,B,C), r2(C,D,E), and r3(E, F), with primary keys A, C, and E, respectively. Assume that r1 has 1000 tuples, r2 has 1500 tuples, and r3 has 750 tuples.
- Give an efficient strategy (in which order) for computing r1 1 r2 1 r3
- Estimate the size of the above joins.
Solution.
- the order shall be (r1 1 r2) 1 This is because the estimate size of r1 1 r2 is 1000, the estimate size of r1 1 r3 is 750,000, while that of r2 1 r3 is 1500.
- less than 1000.
Note that (a) and (b) are not necessarily related.
Problem 4. Consider the tables WORKS AT(SSN, gymID) and GYM(gymID, name). Notice that gymID is not a candidate key for the table GYM.
WORKS AT(SSN, gymID) consists of N1 = 100,000 tuples and has
- V(SSN,WORKS AT) = 50,000 distinct values of SSN
- V(gymID,WORKS AT) = 20,000 distinct values of gymID
GYM(gymID, name) consists of N1 = 40,000 tuples and has
- V(gymID,GYM) = 20,000 distinct values of gymID
- V(name,GYM) = 30,000 distinct values of name.
(a) Estimate the number of qualifying tuples of the query:
SELECT *
FROM WORKS_AT
WHERE SSN = 123456789;
(b) Can SSN be a candidate key for the table WORKS AT? Give a short explanation for your answer
(c) Estimate the number of qualifying tuples of the query:
SELECT *
FROM GYM
WHERE name = “Gym planet”;
(d) Estimate the number of qualifying tuples of the query:
SELECT *
FROM WORKS_AT
WHERE SSN = 123456789 AND gymID=101;
(e) Notice that gymID is not a candidate key for the table Estimate the number of qualifying tuples of the query:
SELECT SSN, GYM.gymID, name
FROM WORKS_AT JOIN GYM
WHERE GYM.gymID = WORKS_AT.gymID;
(f) Estimate the number of qualifying tuples of the query:
SELECT SSN, WA2.SSN
FROM WORKS_AT AS WA1 JOIN WORKS_AT AS WA2
WHERE WA1.gymID = WA2.gymID;
Solution.
(a) N1/V (SSN, WORKS AT) = 100, 000/50, 000 = 2
(b) No, because it has multiple duplicates in the table. (c) N2/V (name, GYM) = 40, 000/30, 000 = 1.33
(d) N1/(V (SSN, WORKS AT)∗V (gymID, WORKS AT)) = 100, 000/50, 000/20, 000 =0.0001
(e) gymID is primary key in the table GYM and foreign key in the table WORKS So,
N1 ∗ N2/V (gymID, WORKS AT) = 100, 000 ∗ 40, 000/20, 000 = 200, 000
(f) gymID is not primary key in the table WORKS So, N1 ∗ N1/V (gymID, WORKSAT)
= 100, 000 ∗ 100, 000/20, 000 = 500,000
Source: Gupta Satinder Bal, Mittal Aditya (2017), Introduction to Basic Database Management System, 2nd Edition-University Science Press (2017)