Introduction
Database normalization is the process of structuring a database, in accordance with a series of normal forms in order to reduce data redundancy and improve integrity.
In other words, database normalization is basically breaking a table into two to prevent repetitive columns aka data redundancy. It entails organizing the columns (attributes) and tables (relations) of a database to ensure that their dependencies are properly enforced by database integrity constraints. It is accomplished by applying some formal rules either by a process of synthesis (creating a new database design) or decomposition (improving an existing database design).
That still does not makes sense, right? For this to make sense I would like to define some prerequisites.
Definitions
- Primary Key: Primary key is the key or column selected by database administrator to uniquely identify tuples (aka records) in a table. There can be only 1 primary key in a table.
- Foreign Key: Foreign key are primary key of a different table and is used to create relationship between two tables.
- Composite Key: If there is no single primary key, we select more than one columns whose combintion would act as a primary key. This is called composite key.
What are normal forms by the way?
We already defined what normalization is. Now there are a few ‘passes’ in which data can be normalized. One pass is known as a normal form.
So what are different normal forms?
1. First Normal Form
This is step 1 of the normalization process. At least follow 1st normal form or do not use databases. Scalable Table design which can be extended.
- There should be a primary key in the table.
- Each column should contain atomic values. Meaning only one value should be there, not a set or list of values.
A column should contain values that are of the same type. Do not inter-mix different types of values in any column. Most RDBMS will throw you an error if you pass another datatype than schema. Each column should have a unique name. Same names lead to confusion at the time of data retrieval.
Order in which data is saved doesn’t matter. Using SQL query, you can easily fetch data in any order from a table.
Break the list of two items into 2 rows. This will create redundancy, but let’s wait for the Second Normal Form. But first, let’s look at an example. Below is a table without 1st normal form applied. This dataset is of a car rental company where car owners can exibit their car for rent. And other party can book their car for a certain period of time.
Customer ID | Customer Name | Car Number Plate | Car Name | Date of Transaction | Owner ID | Owner Name |
---|---|---|---|---|---|---|
C12 | Sachin | CarQ1234 CarQ5436 | Swift Thar | 12/01/2020 18/01/2020 | 076 078 | Dev Irfan |
C46 | Rahul | CarQ3421 CarQ6534 CarQ3789 | Baleno Honda City Swift | 12/01/2020 14/01/2020 15/01/2020 | 054 065 086 | Rohit Shikhar Irfan |
Above data is not in 1st normal form because each customer has multiple car number plates in a single row.
Customer ID | Customer Name | Car Number Plate | Car Name | Date of Transaction | Owner ID | Owner Name |
---|---|---|---|---|---|---|
C12 | Sachin | CarQ1234 | Swift | 12/01/2020 | 076 | Dev |
C12 | Sachin | CarQ5436 | Thar | 18/01/2020 | 078 | Irfan |
C46 | Rahul | CarQ3421 | Baleno | 12/01/2020 | 054 | Rohit |
C46 | Rahul | CarQ6534 | Honda City | 14/01/2020 | 065 | Shikhar |
C46 | Rahul | CarQ3789 | Swift | 15/01/2020 | 086 | Irfan |
Is it in first normal form now? Definately not. It does not have a primary key.
If you look closely, none of the field is unique and every entry is prone to repeat, even Car Number Plate; which will repeat as the car is booked by a different person.
In a case where there is no single primary key, we have a select a composite key, which as we know is combination of two column which together can act unique among the row.
We can select Car Number Plate + Date of Transation as a composite key. Here we assume that one car can be booked by only one person in a day. By taking these two fields, we can say our table is in first normal form now.
2. Second Normal Form
To have a table in 2nd normal form:
- The table to be in 1st Normal Form.
- There should be no Partial Functional Dependencies.
What is Partial Functional Dependency?
Let us understand this by an example.
We have our composite key. By keeping that aside and looking at rest of the table, we can say that Car Name is dependent on one of the field from our composite key; Car Plate Number.
When field is only dependent on one of the composite fields, it is called that a particular field is having partial functional dependency. PFD only happens in the case of composite key.
Owner ID is also PFD on Car Number Plate. Same with Owner name.
We’ll have to break our dataset into two tables.
Car Number Plate | Customer ID | Customer Name | Date of Transaction |
---|---|---|---|
CarQ1234 | C12 | Sachin | 12/01/2020 |
CarQ5436 | C12 | Sachin | 18/01/2020 |
CarQ3421 | C46 | Rahul | 12/01/2020 |
CarQ6534 | C46 | Rahul | 14/01/2020 |
CarQ3789 | C46 | Rahul | 15/01/2020 |
The Car Number Plate in this table acts as a foreign key in the other table.
And this is our another table:
Car Number Plate | Car Name | Owner ID | Owner Name |
---|---|---|---|
CarQ1234 | Swift | 076 | Dev |
CarQ5436 | Thar | 078 | Irfan |
CarQ3421 | Baleno | 054 | Rohit |
CarQ6534 | Honda City | 065 | Shikhar |
CarQ3789 | Swift | 086 | Irfan |
3. Third Normal Form
For a table to be in 3rd Normal Form:
- It should be in 2nd Normal Form
- There should be no transitive dependencies.
What is a Transitive Dependency?
Transitive Dependency can also be said as Indirect dependencies. Let us take an example again.
Look at table 2 from previous section. Owner Name depends on Owner ID and Owner ID dependens on Car Number Plate. You can’t find Owner Name directly from the Car Number Plate (there can be repeating name). The other two fields, Car Name and Owner ID directly depends on the Car Number Plate.
We now have to further split two tables into other 2. First table from previous section is now split into two.
Car Number Plate | Customer ID | Date of Transaction |
---|---|---|
CarQ1234 | C12 | 12/01/2020 |
CarQ5436 | C12 | 18/01/2020 |
CarQ3421 | C46 | 12/01/2020 |
CarQ6534 | C46 | 14/01/2020 |
CarQ3789 | C46 | 15/01/2020 |
And this…
Customer ID | Customer Name |
---|---|
C12 | Sachin |
C46 | Rahul |
Now the second table is also split into two:
Car Number Plate | Car Name | Owner ID |
---|---|---|
CarQ1234 | Swift | 076 |
CarQ5436 | Thar | 078 |
CarQ3421 | Baleno | 054 |
CarQ6534 | Honda City | 065 |
CarQ3789 | Swift | 086 |
An this…
Owner ID | Owner Name |
---|---|
076 | Dev |
078 | Irfan |
054 | Rohit |
065 | Shikhar |
086 | Irfan |
Why we normalize tables?
Anamolies. There are different type of anamolies.
What are different type of anamolies?
Updation Anamolies
Deletion Anamolies
Insertion Anamolies
We’ll get to know more about this with examples.
Updation Anamolies
So for setting the stage, let’s assume that our table is not normalized. Look at the table from 1st normal form.
Now assume that Dev has sold his car to Roushan. If you go ahead and replace the Owner Name in first row, you’ll still have N number of rows remaining to update.
We have to update fields at mulpitle place.
Deletion Anamolies
Although it’s not practicle in the table above. Assume that you have to remove a customer. If you remove a customer Sachin, you’ll be deleting all the data with it including Car Number Plate, Car Name, Owner ID/Name.
This is known as Deletion Anamolies.
Insertion Anamolies
In Insertion Anamolies, you can’t insert a new row without you have all the other data associated with that row. For example, you want to add a new car to the database. You also need a customer, date of transaction and so on.
This is known as Insertion Anamolies.
Why not to normalize database
If your system uses non-relational database, you might be against normalization. Although relation database can scale horizontally too, it’s not by design efficient at that.
If you have multiple table on multiple machines, you will have a network overload with each join. It does not scales well.
If you are using databases such as document based database, you’d want to have all related data on same document. So that when you shard, you don’t have to deal with network overload making the application slower.
Conclusion
This was my first post realted to databases in my blogging history. I have learn from it more than teaching from it. I hope this post was worth of your time too.
Cover image is (c) Integrify.com