Cost based query optimization

In cost based query optimization, optimizer estimates the cost of running of all alternatives of a query and choose the optimum alternative. The alternative which uses the minimum resources is having minimum cost. The cost of a query operation is mainly depend on its selectivity i.e., the proportion of the input relations that forms the output. Following are the main components used to determine the cost of execution of a query:

  1. Access cost of secondary storage : Access cost to secondary storage consists of cost of database manipulation operations which includes searching, writing, reading of data blocks stored in the secondary The cost of searching depends upon the type of indexes (primary, secondary, hashed), type of file structure, ordering of relation in addition to physical storage location like file blocks are allocated contiguously on the same disk  or  scattered  on  the  disk.
  2. Storage cost : Storage cost consists of cost of storing intermediate results (tables or files) that are generated by the execution strategy for the query.
  3. Computation cost : Computation cost consists of performing in-memory operations during query execution such as sorting of records in a file, merging of records, performing computations on field values, searching of These are mainly performed on data  buffers.
  4. Memory usage cost : It consists of cost of pertaining to the number of memory buffers needed during  query execution.
  5. Communication cost : It consists of the cost of transferring query and its result from database location to  the  location of  terminal  where  the  query is originated.

From all the above components, the most important is access cost to secondary storage because secondary storage is comparatively slower than other devices. Optimizer try to minimize computation cost for small databases as most of the data files are stored in main memory. For large database, it try to minimize the access cost to secondary storage and for distributed databases, it trys to minimize the communication cost because various sites are involved for  data  transfer.

To estimate the cost of execution strategies, optimizer access statistical data stored in DBMS catalog.  The  information  stored  in  DBMS  catalog is  given  below:

  1. Number of records in relation X, given as R.
  2. Number of blocks required to store relation X, given as B.
  3. Blocking factor of relation X,  given as BFR.
  4. Primary access method for each file and attributes for each file.
  5. Number of levels  for  each  multi-level  index  for  a  attribute A given  as  IA.
  6. Number of first-level index  blocks for  a  attribute A,  given as  BAI1.
  7. Selection cardinality of attribute A in relation R, given as SA, where SA = R × SLA, where SLA is the  selectivity  of  the attributes.

Cost Function for Selection Operation : Selection operation works on a single relation in relation algebra and retrieves the records that satisfy the given condition. Depending upon the structure of file, available indexes, searching methods, the estimated cost of strategies for selection  operation is  as  given below.

Now consider the example of Employee table. Consider, there are 10,000 records stored in 2000  blocks. Also  the  following  indexes  are  available.

  • A secondary index on the Emp-ID with 4 levels.
  • A clustering index on salary  with  3 levels and  average  selection cardinality  of 30.
  • A secondary index on non-key attributes Age with 2 levels and 4 first level index There are 200  distinct  values for Age.

Consider the queries:

  • Number of records (R) = 10,000
  • Number of blocks (B)= 2000
  • Blocking Factor (BFR) = 10000/2000 = 5

Now cost of the above queries (suppose records are in table)

  • If linear search is used then cost will be B/2 = 2000/2= 1000
  • If linear search is used then cost will be B = 2000.
  • This query has a conjunctive selection To estimate the cost of use using anyone of the two components of selection condition, to retrieve the records plus the linear search. The linear search costs 2000 and condition salary > 9000 first gives cost estimate of Isalary + (B/2)= 2 + 1000 = 1002 and cost of condition Age = 20 is 30 + 2 = 32 So, total cost is 1034.

Cost Function for Join Operation : Join operation is the most time consuming operation in databases and an accurate cost function for join operations depends upon the estimate of size of file (number of records) after the join operation. The estimated cost of strategies for join operation is given below:

Note: (R) and (S) are relations R and S respectively.

Example. Consider we have 600 records in Department table. BFR for department table is 60 and number of blocks are 600/60 = 10.

For the Join operation,                       

Type of Join                                    Cost

Block nested-loop joins               22000 (Buffer has only  one block)

  2010      (if all blocks of employee can be read into database buffer)

Hash Join                                    6010      (is hash index is held in memory)

Example. Given  two unary  relation  (contains only  one  attribute), R1  and  R2.

R1                         R2

7                            8

2                            4

9                            2
8                            1
3                            3
9                            2
1                            7

3                            3
6

Show the result of joining R1 and R2 using each of the following algorithms. List the results in the order that would be output by the join algorithm.

  1. Nested Loop Use R1 for outer Loop and R2 for the inner loop.
  2. Sort Merge Join.

Solution. The result relation contains only one attribute, which is the common attribute between R and S.

  1. Nested loop For every tuple in R1, find matches in R2. 7, 2, 2, 8, 3, 3, 1, 3, 3
  2. Sort merge The result appears in sorted order on the join attribute. 1, 2, 2, 3, 3, 3, 3, 7, 8

 

SOLVED PROBLEMS

Problem 1.  Consider two relations R and S. The number of tuples in R is 500,000 each of size 50 bytes. The number of tuples in S is 125,000 each of size 40 bytes. The page size is 5 KB. We have a buffer of size 250 KB. Answer the following questions with detail steps for following the join query:

SELECT *

FROM R, S

WHERE R.sid=S.sid.

  1. Find the cost of page oriented Simple Nested Loops
  2. Find the cost of Block Nested Loops
  3. Find the cost of Sort-Merge
  4. Can you do any refinement in the Sort-Merge Join Cost? Explain?
  5. Find the cost of Hash

Solution. The given information is as follows:

Page size = 5 K, Buffer size = 250 K = 250/50 = 50 pages.

For R, number of tuples = 500,000, size of one tupe = 50 bytes.

Size of R = 500,000 × 50 byte = 500,000×50/5000 pages = 5000 pages.

For S, number of tuples = 125,000, size of each tuple = 40 bytes.

Size of S = 125,000 × 40 byte = 1250,000 × 40/5000 pages = 1000 pages.

Consider S to  be the  outer relation and  R to  be the inner  relation.

Note. We considered S to be outer because it is the smaller relation. You can also consider R  to  be  outer  but  the  cost  will  be  more.

Consider N=size of R=5000 pages and M=size of S=1000 pages.

1.    Page Oriented Simple Nested Loops Join.

Cost=M+M*N

(Reading outer Relation once + for each outer page reading the whole inner relation)

Cost=1000+5000*1000=5,001,000 I/O

2.    Block Nested Loops Join.

Cost= (scan outer)+(scan inner)*(number of outer blocks)

(number of outer blocks)= ceiling of ((size of outer relation)/(Block size) ).

(block size)=(Buffer Size)-2. (one buffer for reading and one buffer for writing)

Cost=M+N*ceiling of (M/B-2).

Cost=1000+5000*21=106,000 I/O.

3.    Sort-Merge Join.

Cost=sorting both relations + reading both relations.

Cost of sorting relation = 2*(number of passes)*Size. (2 for reading and writing)

Number of passes= ceiling of (log of (size/buffer) to the base of B-1) + 1.

The one comes here for pass 0. After pass zero you get (Size/Buffer) sorted runs

For Relation S, (# of passes)=2; by applying the above formula.

Cost of sorting S=2*2*1000 =4000.

Explanation: You can sort S in two passes.

In pass 0, you produce 20 (1000/50) sorted runs of size 50.

In pass  1, you  can  sort these  20 sorted  runs,  and hence  the whole  relation.

For Relation R, (# of passes)=3; by applying the above formula.

Cost of sorting R=2*3*5000 =30,000.

Explanation: You can sort R in three passes.

In pass 0, you produce 100 (5000/50) sorted runs of size 50.

In pass 1, you sort these 100 sorted runs, and produce 2 sorted runs of size 50.

In pass  2,  you sort  these  2  sorted runs  and  hence the  whole  relation.

Final Cost=1000 + 5000 + 4000 + 30,000 = 40,000 I/O.

  1. To do the refinement of combing the merging phase in the sorting of R and S, size of buffer must be greater than  sqrt of (size  of Larger relation)

In this example, size of Larger relation = 5000 and buffer size = 50. Hence this condition is  not satisfied.  So we  cannot  do this  refinement.

5.    Hash Join

Cost= 3*(M+N)

Cost= 3*6000=18,000 I/O.

Problem 2. Consider the join between relations R and S on R.a=S.b, given the following information about the relations to be joined. The cost metric is the number of page I/Os unless otherwise noted, and the cost of writing out the result should be uniformly ignored.

  • Relation R contains 10,000 tuples and has 10 tuples per page.
  • Relation S contains 2,000 tuples and also has 10 tuples per page.
  • Attribute b of relation  S  is the  primary key  for S.
  • Both relations are stored as simple heap .les.
  • Neither relation has any indexes  built on it.
  • 52 buffer pages are available.
  • What is the cost of joining R and S using a page-oriented simple nested loops join? What is the minimum number of buffer pages required for this cost to remain unchanged?
  • What is the cost of joining R and S using a block-nested loops join? What is the minimum number of buffer pages required for this cost to remain unchanged?
  • What is the cost of joining R and S using a sort-merge join? What is the minimum number of buffer pages required for this cost to remain  unchanged?
  • What is the cost of joining R and S using a hash join? What is the minimum number of buffer pages required for this cost to remain unchanged? Neglect the size involved in building a hash table.

Solution. Let M = 1000 be the number of pages in R, N = 200 be the number of pages in S, and B = 52 be the number of bu.er pages available.

  • Basic idea is to read each page of the outer relation, and for each page scan the inner relation for matching Total cost would be #pages in outer + (#pages in outer #pages in inner) which is minimized by having the smaller relation be the  outer  relation.

Total Cost = N + (N*M) = 200, 200

The minimum  number of  buffer  pages for  this  cost is  3.

  • This time read the outer relation in blocks, and for each block scan the inner relation for matching tuples. So the outer relation is still read once, but the inner relation is scanned only once  for each outer  block, of  which there are:

ceiling of (#pagesinouter/ B-2] = 4.

Total Cost = N +M* ceiling (N/B-2) = 4, 200.

If the number of buffer pages is less than 52, the number of scans of the inner would be more than 4 since ceiling(200/49) is 5. The minimum number of buffer pages for  this  cost  is  therefore  52.

  • Since B > ÖM > ÖN we can use the refinement to Sort-Merge discussed on pages 254–255 in the

Total Cost = 3 (M + N) = 3, 600

Note. If S.b were not a key, then the merging phase could require more than one pass over one of the relations, making the cost of merging M N I/Os in the worst case.

The minimum number of buffer pages required is 25. With 25 buffer pages, the initial sorting pass will split R into 20 runs of size 50 and split S into 4 runs of size 50 (approximately). These 24 runs can then be merged in one pass, with one page left over to be used as an output buffer. With fewer than 25 buffer pages the number of runs produced by the first pass over both relations would exceed the  number  of available  pages,  making a  one-pass  merge impossible.

  • The cost of Hash Join is 3(M+N) if B > Ö N where N is the number of pages in the smaller relation, S. Since ÖN =14, this condition is met. We will also assume uniform partitioning from  our  hash

Total Cost = 3 (M + N) = 3, 600.

The minimum number of buffer pages required, and a good guess is that we need B >  N.

Problem 3. Consider the relations r1(A,B,C), r2(C,D,E), and r3(E, F), with primary keys A, C, and E, respectively. Assume that r1 has 1000 tuples, r2 has 1500 tuples, and r3 has 750 tuples.

  • Give an efficient strategy (in  which order) for  computing r1  1 r2  1 r3
  • Estimate the size of the above joins.
Solution.
  • the order shall be (r1 1 r2) 1 This is because the estimate size of r1 1 r2 is 1000, the estimate size of r1 1 r3 is 750,000, while that of r2 1 r3 is 1500.
  • less than 1000.

Note that (a) and (b) are not necessarily related.

Problem 4. Consider the tables WORKS AT(SSN, gymID) and GYM(gymID, name). Notice that gymID is not a candidate key for the table GYM.

WORKS AT(SSN, gymID) consists of N1 = 100,000 tuples and has

  • V(SSN,WORKS AT) = 50,000 distinct values of SSN
  • V(gymID,WORKS AT) = 20,000 distinct values of gymID

GYM(gymID, name) consists of N1 = 40,000 tuples and has

  • V(gymID,GYM) = 20,000 distinct values of gymID
  • V(name,GYM) = 30,000 distinct values of name.

(a) Estimate the number of qualifying tuples of the query:

SELECT *

FROM WORKS_AT

WHERE SSN = 123456789;

(b) Can SSN be a candidate key for the table WORKS AT? Give a short explanation for your answer

(c) Estimate the number of qualifying tuples of the query:

SELECT *

FROM GYM

WHERE name = “Gym planet”;

(d) Estimate the number of qualifying tuples of the query:

SELECT *

FROM WORKS_AT

WHERE SSN = 123456789 AND gymID=101;

(e) Notice that gymID is not a candidate key for the table Estimate the number of qualifying tuples of the query:

SELECT SSN, GYM.gymID, name

FROM WORKS_AT JOIN GYM

WHERE GYM.gymID = WORKS_AT.gymID;

(f) Estimate the number of qualifying tuples of the query:

SELECT SSN, WA2.SSN

FROM WORKS_AT AS WA1 JOIN WORKS_AT AS WA2

WHERE WA1.gymID = WA2.gymID;

Solution.

(a) N1/V (SSN, WORKS AT) = 100, 000/50, 000 = 2

(b) No, because it has multiple duplicates in the table. (c) N2/V (name, GYM) =  40, 000/30,  000 =  1.33

(d) N1/(V (SSN, WORKS AT)∗V (gymID, WORKS AT)) = 100, 000/50, 000/20, 000 =0.0001

(e) gymID is primary key in the table GYM and foreign key in the table WORKS So,

N1 ∗ N2/V (gymID, WORKS AT) = 100, 000 ∗ 40, 000/20, 000 = 200, 000

(f) gymID is not primary key in the table WORKS So, N1 ∗ N1/V (gymID, WORKSAT)

= 100, 000 ∗ 100, 000/20, 000 = 500,000

Source: Gupta Satinder Bal, Mittal Aditya (2017), Introduction to Basic Database Management System, 2nd Edition-University Science Press (2017)

Leave a Reply

Your email address will not be published. Required fields are marked *