Normalisation in Database

Normalisation is a process by which we can decompose or divide any relation into more than one  relation  to remove  anomalies  in relational  database.

It is a step by step process and each step is known as Normal Form.

Normalisation is  a  reversible  process.

1. Benefits of Normalisation

The benefits of normalisation include

  • Normalisation produces smaller tables with smaller rows, this means more rows per page and hence less logical I/O.
  • Searching, sorting, and creating indexes are faster, since tables are narrower, and more rows fit  on  a data
  • The normalisation produces more tables by splitting the original Thus there can be more clustered indexes and hence there is more flexibility in tuning the queries.
  • Index searching is generally faster as indexes tend to be narrower and shorter.
  • The more tables allow better use of segments to control physical placement of data.
  • There are fewer indexes per table and hence data modification commands are faster.
  • There are small number of null values and less redundant This makes the database more compact.
  • Data modification anomalies are reduced.
  • Normalization is conceptually cleaner and easier to maintain and change as the needs change.

2. Various Normal Forms

The different normal forms are as follows. Each of which has its importance and are more desirable than  the  previous  one.

2.1.  First Normal  Form  (1NF)

A relation is in first normal form if domain of each attribute contains only atomic values. It means atomicity must be present in relation.

Consider the relation Employee as shown in Figure 6.8. It is not in first normal form because attribute Name is not atomic. So, divide it into two attributes First Name and Last Name as  shown  in  Figure  6.9.

Now, relation Employee is in 1NF.

Anomalies in First Normal Form : First Normal form deals only with atomicity. Anomalies described earlier  are  also  applicable  here.

Example. Given the relation PART (Part_ID, Descr, Price, Comp_ID, No) having the following dependencies

Part_ID → Descr
Part_ID → Price
Part_ID, Comp_ID → No

Determine the highest  normal form  of this  relation.

Sol. There exists multi-value attributes. The attributes Comp_ID and No are not determined by the primary key. Hence the relation is not in INF.

2.2. Second Normal  Form  (2NF)

A relation is in second normal form if it is in 1NF and all non-primary key attributes must be fully functionally dependent upon primary key attributes.

Consider the relation Student as shown  in Figure 6.10(a) :

The Primary Key is (RollNo., Game). Each Student can participate in more than one game.

Relation Student is in 1NF but still contains anomalies.

  1. Deletion anomaly : Suppose you want to delete student Here you loose information about game Hockey because he is the only player participated in hockey.
  2. Insertion anomaly : Suppose you want to add a new game Basket Ball having no student participated in You cannot add this information unless there is a player for it.
  3. Updation anomaly : Suppose you want to change Fee of Cricket. Here, you have to search all the students participated in cricket and update fee individually otherwise it produces inconsistency

The solution of this problem is to separate Partial dependencies and Fully functional dependencies. So, divide Student relation into three relations Student(RollNo., Name), Games (Game, Fee) and Performance(RollNo., Game, Grade) as  shown in Figure  6.11.

Now, Deletion, Insertion and updation operations can be performed without causing inconsistency.

2.3. Third Normal  Form  (3NF)

A relation is in Third Normal Form if it is in 2NF and non-primary key attributes must be non-transitively dependent upon primary key attributes.

In other words a relation is in 3NF if it is in 2NF and having no transitive dependency. Consider the  relation  Student  as shown  in  Figure  6.12(a).

The Primary Key is (RollNo.). The condition is different Hostel is allotted for different semester. Student relation  is in  2NF but  still contains  anomalies.

  1. Deletion anomaly : If you want to delete student You loose information about Hostel H2 because he is the only student staying in hostel H2.
  2. Insertion anomaly : If you want to add a new Hostel H8 and this is not allotted to any You  cannot add  this information.
  3. Updation anomaly : If you want to change hostel of all students of first You have to search all the students of first semester and update them individually otherwise it causes inconsistency.

The solution of this problem is to divide relation Student into two relations Student(RollNo.Name, Semester) and Hostels(Semester, Hostel) as shown in Figure 6.13.

Now, deletion, insertion and updation operations can be performed without causing inconsistency.

Example. Given the relation BANK (Account#, Cust_No, Cust_Name, Balance). Determine whether the relation is in 1NF, 2NF, 3NF or unnormalizd. If it is not in 3NF, convert it into 3NF relations.

Sol. Since there does not exist multi-value attributes, hence the relation is at least 1NF.

  • There does not exist any partial dependency as the primary key has only one
  • There exists a transitive dependency

i.e. Cust_No → Cust_Name

Hence the relation is not in 3NF. To convert this relation into 3NF, make a new relation CUSTOMERS with attributes Cust_Name and Cust_No on which it depends. Therefore, the relations in 3NF are CUSTOMERS (Cust_No, Cust_Name) BANK (Account#, Cust_No, Balance).

Example. Given the dependency diagram shown in Figure 6.14. The primary key attributes are underlined.

  • Identify and discuss each of the indicated dependencies.
  • Create a database whose tables are at least in 3NF, showing dependency diagram for each table.
Sol.
  • There exists a partial dependency C1 → C2, since C2 depends only on C1 rather than on the  entire  primary  key  (C1,  C3).
    • There exists a transitive dependency C4 →C5, since C5 depends on the attribute C4, which is not a  part of  the primary
    • There exists a functional dependency C1, C3→ C2, C4, C5, since C2, C4, and C5 depend on the primary key (C1, C3).
  • The given relation is decomposed into 3NF and the obtained relations are as follows with their dependency diagrams.
    • R1 (C1, C2) with FD {C1 ® C2}. The functional dependency diagram is as follows.

  • R2 (C1, C3, C4) with FD {C1, C3 → C4}. The FD diagram is as

  • R3 (C4, C5) with FD {C4 → C5}. The FD diagram is as
2.3.1. Synthesis Algorithm for Third Normal Form Decomposition

We can always find a third normal form decomposition of a relation that is lossless and that preserves dependencies.  The algorithm  for the  third  normal form  decomposition is

Algorithm: To find a lossless join and FD-preserving 3NF decomposition of R

Input: Given a universal relation R and a set of functional dependencies, F, on R,

Output: A lossless  join  and  FD-preserving  3NF decomposition  of  R

  1. Calculate the minimal cover of F; call it G (Combine all FDs with same LHS into one FD using “union”  rule of  inference). Also compute  the  candidate keys  for R.
  2. For each FD  X → Y in  G, generate  a  relation  schema XY in  the decomposition.
  1. If there are some attributes, say Z, of R that do not appear in any decomposed schema, then create  a  separate  schema  in the  decomposition  for Z.
  2. If none of the decomposed schemes contain a candidate key, create a separate schema in the decomposition for one of the candidate keys K.

Example. Consider a relation schema with attributes R = ABCGWXYZ and the set of dependencies

𝐹 = {XZ → ZYB, YA → CG, C → W, B → G, XZ → G}.

  1. Find a minimal cover for 𝐹.
  2. Decompose R into a join-lossless 3NF database

Sol.

  1. The minimal cover for F is as follows: After right-reducing, we have

XZ → B
XZ → G
XZ → Y
YA → C
YA → G
C → W
B → G

It is easy to see that the left-hand sides XZ and YA (as well as C and B) cannot be reduced because there are no FDs of the form X → …, Y → …, Z → …, and A → … .

Further, XZ → G is redundant and none of the others. So, the minimal cover consists of the above set minus XZ → G.

  1. A join-lossless 3NF decomposition is as follows:

D = {XZBY, YACG, CW, BG, XZA} is a join-lossless 3NF database schema. Note that XZA is added because no other table in D that contains a key of R.

Example. Consider R(ABCDEF) and the following FDs

F = {A → BCE, B → DE, ABD → CF, AC → DE}.. Find a 3NF database for R under F.

Sol.
  1. The minimal cover for F is as follows: After Left Reduction:

Rule ABD → CF becomes A → CF
Rule AC → DE becomes A → DE
F = {A → BCE, B → DE, A → CF, A → DE}

After Right Reduction:

Rule A → BCE becomes A → B
Rule A → DE becomes A → E
F = {A → B, B → DE, A → CF, A → E}

After Removing Redundant Rule(s):

Rule A → E is redundant (Transitivity on A → B, B → E)
F = {A → B, B → DE, A → CF}.

Therefore, a minimal cover is

F = {A → BCF, B → DE}

  1. Thus, D = (ABCF, BDE) is a join-lossless 3NF database schema.

2.4.  Boyce Codd Normal Form (BCNF)

BCNF is a strict format of 3NF. A relation is in BCNF if and only if all determinants are candidate  keys.  BCNF  deals  with  multiple  candidate  keys.

Relations in 3NF also contains anomalies. Consider the relation Student as shown in Figure 6.15(a).

Assumptions:

  • Student can have more than  1 subject.
  • A Teacher can teach only 1 subject.
  • A subject can be  taught by  more than  1  teacher

There are two candidate keys (RollNo., Subject) and (RollNo., Teacher)

Relation Student is in 3NF but still contain anomalies.

  1. Deletion anomaly : If you delete student whose is 7. You will also loose information that Teacher T4 is  teaching the  subject VB.
  2. Insertion anomaly : If you want to add a new Subject VC++, you cannot do that until a student chooses subject VC++ and a teacher teaches subject VC++.
  3. Updation anomaly : Suppose you want to change Teacher for Subject You have to search all the students having subject C and update each record individually otherwise it causes inconsistency.

In relation Student, candidate key is overloaded. You can find Teacher by RollNo. and Subject. You can also find Subject by RollNo. and Teacher. Here RollNo. is overloaded. You can also find  Subject by Teacher.

The solution of this problem is to divide relation Student in two relations Stu-Teac and Teac-Sub as  shown in  Figure  6.17.

In this  solution all  determinants are  candidate keys.

2.4.1.  Decomposition Algorithm for BCNF with Lossless Join

It is always possible to find a decomposition of a relation that is Boyce-Codd Normal Form and that has the lossless join property. The process involves finding each violation of BCNF and removing it by decomposing the relation containing it into two relations. The process is repeated  until all  such  violations are  removed.  The algorithm  is

Given a  universal  relation R and  a  set of  functional  dependencies  on the  attributes  of  R:

  1. D ← R;
  2. while there is some relation schema S in D that is not already BCNF

{

    1. Find a functional dependency X→Y in S that violates BCNF
    2. Replace S by two  relation schemas  (S–Y) and  (X,Y)

}

Example. The relation is R (A, B, C, D, E) and the FD’s : A → E, BC → A and DE → B.  Decompose  the  relation  into  BCNF  with  lossless  join.

Sol. A you see that {A}+ = {A,E}, violating the BCNF condition. We split R to R1(A,E) and R2(A,B,C,D).

R1 satisfies BCNF now, but R2 not because of: {B,C}+ = {B,C,A}. Notice that the FD, D E → B has now disappeared and we don’t need to consider it! Split R2 to: R2(B,C,A) and R3(B,C,D).

Second Sol. You can split the R differently. Let’s try with the violation {B,C}+ = {B,C,A,E}. We initially split to R1(B,C,A,E) and R2(B,C,D). Now we need to resolve for R1 the violation {A}+ = {A,E}. So we split again R1 to R1(A,E) and R3(A,B,C). The same!

You can also start splitting by considering the BCNF violation {D,E}+ = {D,E,B}. The resulting BCNF  decomposition in  this  case will  be  a different  one.

2.5. Multivalued Dependencies and Fourth Normal Form

Definition: Let R be a relation having attributes or sets of attributes A, B, and C. There is a multivalued dependency of attribute B on attribute A if and only if the set of B values associated with a given A value is independent of the C values.

We write this as A →→ B and read it as A multidetermines B. If R has at least three attributes, A, B, and C then in R(A, B, C), if A →→ B, then A→→ C as well.

Alternate definition of Multivalued Dependency

More generally, if R is a relation with multivalued dependency A →→ B

then in any table for R, if two tuples, t1 and t2, have the same A value, then there must exist  two  other  tuples  t3  and  t4  obeying  these  rules

  1. t3 and t4 have the same A value as t1 and t2
  2. t3 has the same B value as t1
  3. t4 has the same B value as t2
  4. If R – B represents the attributes of R that are not in B, then the t2 and t3 have the same values for R – B and
  5. t1 and t4 have the same values for R – B

The dependency A →→ B is called a trivial multivalued dependency if B is a subset of A or A ∪ B  is  all of  R.  Now  we are  ready  to  consider fourth  normal  form.

Definition: A relation is in fourth normal form (4NF) if and only if it is in BoyceCodd normal form  and  there are  no  nontrivial  multivalued dependencies.

Fourth Normal  Form  (4NF)

A relation is in 4NF if it is in BCNF and for all Multivalued Functional Dependencies (MVD) of the form X →→ Y either X → Y is a trival MVD or X is a super key of relation.

Relations in BCNF also contains anomalies. Consider the relation Project-Work as shown in Figure  6.18.

 

Assumptions:

  • A Programmer can work on any number of projects.
  • A project can have more than one module.

Relation Project-work  is in  BCNF but  still contains  anomalies.

  1. Deletion anomaly : If you delete project You will loose information about Programmer P3.
  2. Insertion anomaly : If you want to add a new project You cannot add this project until it is assigned  to  any programmer.
  3. Updation anomaly : If you want to change name of project Then you have to search all the programmers having project 1 and update them individually otherwise it causes inconsistency.

Dependencies in Relation Project-work are

Programmer →→ Project
Project →→ Module

The solution of this problem is to divide relation Project-Work into two relations Prog-Prj (Programmer, Project) and Prj-Module (Project, Module) as shown in Figure 6.19.

2.6. Projection Join Normal Form (5NF) and Join Dependency

Join Dependency : Let R be a given relation upto 4NF and it decompose (Projected or divided) into {R1, R2, R3, …, Rn}. The relation R satisfy the join dependency * {R1, R2, R3, … Rn} if and only if joining of R1 to Rn = R.

Consider the Relation XYZ with attributes X# (Customer_ID), Y# (Account_ID) and Z# (Branch_ID) as  shown  in  Figure  6.20.

First, decompose XYZ into three relations XY, YZ and ZX.

When joined, these three relations give original form of XYZ. So relation XYZ satisfy the join dependency.

Deflnition Projection Join Normal Form (5NF) : Let R is a relation and D is a set of all dependencies (Multivalued dependency, Functional dependency, Join dependency etc.) The relation R is in 5NF w.r.t. D if every Join Dependency is trivial.

Or

A table is in fifth normal form (5NF) or Project-Join Normal Form (PJNF) if it is in 4NF and it cannot have a lossless decomposition into any number of smaller tables.

Properties of 5NF: The following are the properties  of 5NF

  1. Anomalies can occur in relations in 4NF if the primary key has three or more fields.
  2. 5NF is based on the concept of join dependence – if a relation cannot be decomposed any further then it is  in 5NF.
  3. Pairwise cyclical dependency means that:
    • You always need to know two values (pairwise).
    • For any one you must know the other two (cyclical).

5NF is the ultimate Normal form. A relation in 5 NF is guaranteed to be free of anomalies. Consider the example in Figure 6.20, where relations XY, YZ and ZX are in 5NF.

Example. Consider the following Buying table. This table is used to track buyers, what they buy, and from whom they buy? Take the following sample data:

The problem with the above table is that if HCL starts to sell computers then how many records must you create to record this fact? The problem is there as there are pair- wise cyclical dependencies in the primary key. That is, in order to determine the item you must know the buyer and vendor, and to determine the vendor you must know the buyer and the  item,  and  finally to  know  the  buyer  you must  know  the  vendor  and the  item.

The solution is to break this one table into three tables; BuyerVendor table, BuyerItem table and VendorItem table. All these three tables are in the 5NF.

2.7. 6NF or Domain-Key Normal Form (DKNF)

A table is in sixth normal form (6NF) or Domain-Key normal form (DKNF) if it is in 5NF and if all constraints and dependencies that should hold on the relation can be enforced simply by enforcing the domain constraints and the key constraints specified on the relation.

The domain-key normal form (DKNF) is a theoretical statement of how relationships are formed between tables.

Source: Gupta Satinder Bal, Mittal Aditya (2017), Introduction to Basic Database Management System, 2nd Edition-University Science Press (2017)

Leave a Reply

Your email address will not be published. Required fields are marked *