This page looks best with JavaScript enabled

Database Normalization: Scratching the Database Surface

 ·   ·  ☕ 7 min read

Introduction

Database normalization is the process of structuring a database, in accordance with a series of normal forms in order to reduce data redundancy and improve integrity.

In other words, database normalization is basically breaking a table into two to prevent repetitive columns aka data redundancy. It entails organizing the columns (attributes) and tables (relations) of a database to ensure that their dependencies are properly enforced by database integrity constraints. It is accomplished by applying some formal rules either by a process of synthesis (creating a new database design) or decomposition (improving an existing database design).

That still does not makes sense, right? For this to make sense I would like to define some prerequisites.

Definitions

  1. Primary Key: Primary key is the key or column selected by database administrator to uniquely identify tuples (aka records) in a table. There can be only 1 primary key in a table.
  2. Foreign Key: Foreign key are primary key of a different table and is used to create relationship between two tables.
  3. Composite Key: If there is no single primary key, we select more than one columns whose combintion would act as a primary key. This is called composite key.

What are normal forms by the way?

We already defined what normalization is. Now there are a few ‘passes’ in which data can be normalized. One pass is known as a normal form.

So what are different normal forms?

1. First Normal Form

This is step 1 of the normalization process. At least follow 1st normal form or do not use databases. Scalable Table design which can be extended.

  1. There should be a primary key in the table.
  2. Each column should contain atomic values. Meaning only one value should be there, not a set or list of values.

A column should contain values that are of the same type. Do not inter-mix different types of values in any column. Most RDBMS will throw you an error if you pass another datatype than schema. Each column should have a unique name. Same names lead to confusion at the time of data retrieval.

Order in which data is saved doesn’t matter. Using SQL query, you can easily fetch data in any order from a table.

Break the list of two items into 2 rows. This will create redundancy, but let’s wait for the Second Normal Form. But first, let’s look at an example. Below is a table without 1st normal form applied. This dataset is of a car rental company where car owners can exibit their car for rent. And other party can book their car for a certain period of time.

Customer ID Customer Name Car Number Plate Car Name Date of Transaction Owner ID Owner Name
C12 Sachin CarQ1234 CarQ5436 Swift Thar 12/01/2020 18/01/2020 076 078 Dev Irfan
C46 Rahul CarQ3421 CarQ6534 CarQ3789 Baleno Honda City Swift 12/01/2020 14/01/2020 15/01/2020 054 065 086 Rohit Shikhar Irfan

Above data is not in 1st normal form because each customer has multiple car number plates in a single row.

Customer ID Customer Name Car Number Plate Car Name Date of Transaction Owner ID Owner Name
C12 Sachin CarQ1234 Swift 12/01/2020 076 Dev
C12 Sachin CarQ5436 Thar 18/01/2020 078 Irfan
C46 Rahul CarQ3421 Baleno 12/01/2020 054 Rohit
C46 Rahul CarQ6534 Honda City 14/01/2020 065 Shikhar
C46 Rahul CarQ3789 Swift 15/01/2020 086 Irfan

Is it in first normal form now? Definately not. It does not have a primary key.

If you look closely, none of the field is unique and every entry is prone to repeat, even Car Number Plate; which will repeat as the car is booked by a different person.

In a case where there is no single primary key, we have a select a composite key, which as we know is combination of two column which together can act unique among the row.

We can select Car Number Plate + Date of Transation as a composite key. Here we assume that one car can be booked by only one person in a day. By taking these two fields, we can say our table is in first normal form now.

2. Second Normal Form

To have a table in 2nd normal form:

  1. The table to be in 1st Normal Form.
  2. There should be no Partial Functional Dependencies.

What is Partial Functional Dependency?

Let us understand this by an example.

We have our composite key. By keeping that aside and looking at rest of the table, we can say that Car Name is dependent on one of the field from our composite key; Car Plate Number.

When field is only dependent on one of the composite fields, it is called that a particular field is having partial functional dependency. PFD only happens in the case of composite key.

Owner ID is also PFD on Car Number Plate. Same with Owner name.

We’ll have to break our dataset into two tables.

Car Number Plate Customer ID Customer Name Date of Transaction
CarQ1234 C12 Sachin 12/01/2020
CarQ5436 C12 Sachin 18/01/2020
CarQ3421 C46 Rahul 12/01/2020
CarQ6534 C46 Rahul 14/01/2020
CarQ3789 C46 Rahul 15/01/2020

The Car Number Plate in this table acts as a foreign key in the other table.

And this is our another table:

Car Number Plate Car Name Owner ID Owner Name
CarQ1234 Swift 076 Dev
CarQ5436 Thar 078 Irfan
CarQ3421 Baleno 054 Rohit
CarQ6534 Honda City 065 Shikhar
CarQ3789 Swift 086 Irfan

3. Third Normal Form

For a table to be in 3rd Normal Form:

  • It should be in 2nd Normal Form
  • There should be no transitive dependencies.

What is a Transitive Dependency?

Transitive Dependency can also be said as Indirect dependencies. Let us take an example again.

Look at table 2 from previous section. Owner Name depends on Owner ID and Owner ID dependens on Car Number Plate. You can’t find Owner Name directly from the Car Number Plate (there can be repeating name). The other two fields, Car Name and Owner ID directly depends on the Car Number Plate.

We now have to further split two tables into other 2. First table from previous section is now split into two.

Car Number Plate Customer ID Date of Transaction
CarQ1234 C12 12/01/2020
CarQ5436 C12 18/01/2020
CarQ3421 C46 12/01/2020
CarQ6534 C46 14/01/2020
CarQ3789 C46 15/01/2020

And this…

Customer ID Customer Name
C12 Sachin
C46 Rahul

Now the second table is also split into two:

Car Number Plate Car Name Owner ID
CarQ1234 Swift 076
CarQ5436 Thar 078
CarQ3421 Baleno 054
CarQ6534 Honda City 065
CarQ3789 Swift 086

An this…

Owner ID Owner Name
076 Dev
078 Irfan
054 Rohit
065 Shikhar
086 Irfan

Why we normalize tables?

Anamolies. There are different type of anamolies.

What are different type of anamolies?

  1. Updation Anamolies

  2. Deletion Anamolies

  3. Insertion Anamolies

We’ll get to know more about this with examples.

Updation Anamolies

So for setting the stage, let’s assume that our table is not normalized. Look at the table from 1st normal form.

Now assume that Dev has sold his car to Roushan. If you go ahead and replace the Owner Name in first row, you’ll still have N number of rows remaining to update.

We have to update fields at mulpitle place.

Deletion Anamolies

Although it’s not practicle in the table above. Assume that you have to remove a customer. If you remove a customer Sachin, you’ll be deleting all the data with it including Car Number Plate, Car Name, Owner ID/Name.

This is known as Deletion Anamolies.

Insertion Anamolies

In Insertion Anamolies, you can’t insert a new row without you have all the other data associated with that row. For example, you want to add a new car to the database. You also need a customer, date of transaction and so on.

This is known as Insertion Anamolies.

Why not to normalize database

If your system uses non-relational database, you might be against normalization. Although relation database can scale horizontally too, it’s not by design efficient at that.

If you have multiple table on multiple machines, you will have a network overload with each join. It does not scales well.

If you are using databases such as document based database, you’d want to have all related data on same document. So that when you shard, you don’t have to deal with network overload making the application slower.

Conclusion

This was my first post realted to databases in my blogging history. I have learn from it more than teaching from it. I hope this post was worth of your time too.

Cover image is (c) Integrify.com

Share on

Santosh Kumar
WRITTEN BY
Santosh Kumar
Santosh is a Software Developer currently working with NuNet as a Full Stack Developer.