With technologies like AI and IoT becoming mainstream and more and more businesses developing customer-centric applications, we’re living in a data storm. Enterprises are collecting data on large scale and using them for gaining every bit of edge they can get in competitive markets. By 2025, the annual revenue from global big data analytics is expected to reach 68.09 billion USD and around 181 zettabytes of data is expected to be created globally.
Big data is being used to improve everything from industrial systems — by collecting technographic data — to customer experience on a ride-hailing app. They’re used to develop machine learning models and deep learning algorithms that are capable of reading out your calendar entries to identify tumors from x-rays.
But the constant refrain about data still applies; garbage in, garbage out. To get good results out of any system that uses data, you need good clean, unbiased data. And one of the aspects of cleaning this data is data normalization.
In this article, we explore what data normalization is, its benefits, how it works, and who stands to benefit from this.
What is data normalization?
Simply put, data normalization is just cleaning up the collected data so as to make it more clear and machine-readable.
Often the data collected by different systems are in different formats, there may be duplicates, and there may be unnecessary data as well. And you won’t get good results or insights if you try to visualize or analyze the data. Messy or cluttered data is difficult to understand and data redundancy creates unnecessary storage costs.
In data normalization, the data is made consistent, the duplicates or other errors are removed, and brought together in a similar format so that it’s easier to interpret.
Here’s a highly simplified explanation of data normalization
Consider that you and your colleague went to conduct surveys at homes in a locality. To make the task easier, you take half the homes and your colleague takes the other half. You go to the first home and collect the data in the following format
Name: John Doe
Date of birth: 23/7/1995
Job: Rocket scientist
Name: Jane Doe
Date of birth: 11/12/1996
Job: Neurosurgeon
In this format, you collect the data of the rest of the homes. Your colleague visits their set of homes and collects the data in the following format:
First Name: John
Last Name: Doe
Date of birth: 07/22/1989
Job: Rocket scientist
Kids: 2
First Name: Jane
Last Name: Doe
Date of birth: 09/24/1969
Job: Neurosurgeon
Kids: 3
You got back, looked at the two data sets, and realized that they are in different formats. Not only that, in some cases, both of you have gone to the same homes. And your colleague has collected an additional unnecessary data point.
In this case, the data is in different formats, there are duplicate entries, and there are unnecessary data present. With data normalization, the goal is to remove duplicate and unnecessary entries and standardize the data and make it easy to analyze.
A similar situation happens when you use multiple systems that collect data. For example, let’s say you’re using a system that collects the number of calls weekly to your service desk answered by individual agents, and another system that tracks the time spent by the agents talking to users every hour. Both are valuable data points, but the data output from the first system is for the whole week, and from the system is for every hour.
And if someone collects this data, puts it in a visualizing system, it may not make much sense until the data is cleaned and made consistent.
How does data normalization work?
As we discussed, data normalization is removing the redundancies and making it easy to store, visualize, and analyze it. For databases, there are different levels of normalization, each with its own characteristics and definitions. They’re referred to 1 normal form (1NF), 2 normal form (2NF), 3 normal form (3NF), etc.
Let’s consider a simple database showing customer details
CustID |
CustName |
CustAge |
Purchases |
1 |
Mark |
19 |
Sugar, Rice |
2 |
Paul |
18 |
Milk, Carrot |
3 |
Suzanne |
17 |
Banana, Icecream |
The above database is not normalized. The problem with the above table is that a single student has more than 2 marks and it won’t be easy to search the data. A similar situation can be observed if you’re recording marks of students or services offered by a company.
To make it 1NF database, each cell should have only a single value, and every record needs to be unique.
CustID |
CustName |
CustAge |
Purchases |
1 |
Mark |
19 |
Sugar |
1 |
Mark |
19 |
Rice |
2 |
Paul |
18 |
Milk |
2 |
Paul |
18 |
Carrot |
3 |
Suzanne |
17 |
Banana |
3 |
Suzanne |
17 |
Icecream |
As you can see, here the database is more searchable. But there’s a lot of redundant data. For every single item purchased, we’re storing the customer name repeatedly, even if it’s the same customer who purchased it. To avoid this, we move on to 2NF.
For a database to be 2NF, it has to be 1NF. And all non-key columns (columns that can’t be used to identify a record) should be fully functional dependant on the primary key (the column used to identify a record uniquely, in this case, CustID). Sounds complicated, but it's easy enough to understand.
Here we split the above table into two.
CustID |
CustName |
CustAge |
1 |
Mark |
19 |
2 |
Paul |
18 |
3 |
Suzanne |
17 |
CustID |
Purchases |
1 |
Sugar |
1 |
Rice |
2 |
Milk |
2 |
Carrot |
3 |
Banana |
3 |
Icecream |
As you can see, as normalize the data, it becomes easy to analyze, and you don’t have to store the same data many times. Like this, data can be further normalized to 3NF, 4NF, and more.
Who stands to benefit from data normalization?
In this age where organizations and enterprises are undergoing rapid digital transformation, collecting data about their digital processes, analyzing them, and making decisions using these insights is not just good-to-have-ability of an organization, but rather a must-have. Companies are using cutting-edge data science tools to improve their workflows, the products and services they deliver, employee and customer experience, and more.
And every one of these use-cases depends on good, clean, non-redundant, and easy-to-analyze data. So organizations that often work with large databases, such as healthcare, pharmaceuticals, customer care, marketing, and others have to use data normalization techniques to make sense of it.
One of the most common and important use cases of data is in developing AI systems. Modern artificial intelligence systems use artificial neural networks to produce their output. And these neural networks, just like the human brain, learn from data.
For example, if you want to develop a system that can detect lung cancers from chest x-ays, you’ll have to feed the system with a lot of x-rays of patients with and without lung cancer, marked as such. More data you have, the better the system.
But as mentioned in the beginning, garbage in, garbage out. Poor quality data can give poor results, The system can generate a large number of false positives and false negatives.
The same applies to other AI systems as well. When it comes to AI and data, normalization is only a part of the game, there are other factors like data sources and representation within the data which determine the final results. But normalization is still a huge part.
Organizations that use IoT, automation, predictive maintenance, inventory management systems, supply chain, and logistics solutions all deal with a large amount of data daily. These organizations stand to gain significantly by using data normalization techniques and standardizing their data.
What are the benefits of data normalization?
More storage space available
Normalization reduces redundancies and duplicates in databases. It’s apparent from the examples mentioned above that normalization reduces the amount of data stored by a significant amount. It may not be clear from the small tables here with just a couple of rows and columns, but when you’re dealing with terabytes and petabytes of data, data normalization can bring significant monetary benefits to organizations.
Besides the savings from storage and maintenance costs, it can also reduce the bandwidth required for sharing and transferring data. When multiple parties have to examine or analyze the data, the large-scale distribution and access alone can prove to be costly. Here data normalization can have a not-so-small impact on the organization’s bottom line.
Easy to sort through data
Normalized data is easy to work with. Whether you want to find all customers who have made a purchase in a given month or you want to find the employees in your team with the most consistent performance, it’s much easier when you have a normalized database.
It’s also easier to spot patterns or trends once you’ve normalized your data. There are way fewer columns and it helps you view more records in a single page. And with less data to handle, your database, as well as any data analysis application you may use also, tends to be faster.
Improve user experience
A normalized database is easy to update and maintain. Without normalization, adding new entries or updating an entry can be difficult, as the user will be dealing with multiple error messages. And with a novice user, this can contribute to loss of integrity, with multiple entries or missing data points from the database.
Once the database is normalized, users can easily access the data and ensure that there are no errors, duplicates, or redundancies in the database. With standardization and improved consistency, multiple users can easily read and understand the data.
Easier to work with multiple data analysis tools
A standardized and normalized database can be connected easily to data processing tools to visualize and analyze it. Without standardization, it will be tricky to work with these solutions and can even produce incorrect output which can be costly. For example, duplicate customer purchase entries can grossly blow up your profit estimates; incorrectly scaled customer purchase data can give you incorrect data about shopping patterns and mess up your inventory plans.
Data analysts can easily plugin normalized databases and produce insights more easily. In the long run, data normalization can help you save significant costs associated with using these tools.
Better output
As with the case of AI systems we discussed above, better, cleaner, standardized data produces better results. From analyzing the results of a clinical study to understanding the products that bring the best revenue to a supermarket, good data is key to good actionable insights. And data normalization is an important step for this.
Common mistakes during data normalization
Normalizing without understanding the metrics or goals
Before we go in, let’s make it very clear: normalize your data as much as you can; when designing a database, keep normalization in your mind.
Having said that, it's important to understand why normalization is needed. Normalization improves data integrity, reduces disk space, and improves performance.
But in some cases, data normalization can make things unnecessarily complex. The more you normalize, the more tables you’ll end up with, the more difficult it is to build. For example, sometimes you just want to try out some proof-of-concept or prototype and the goal is to launch as soon as possible. Here data normalization can take up extra resources and may not be ideal.
In some cases, the data breaks the rules of normalization, but for the specific business use case, it may not be necessary.
And in some cases, data normalization can even slow up things.
So while building a database, you should lean towards normalization, but it’s also important to keep the business case in mind.
Using inconsistent or unreadable names
While normalizing a database, you’ll inevitably end up adding a lot of new columns and tables. So while naming these columns, it's important to use names that can help a different user or a developer easily understand what’s there in the column. You can use naming conventions used for naming variables for creating easy-to-understand names. Describe the contents of the column, but don't make it too lengthy. It’s best not to use spaces in the names as it can create problems further down the line.
Poor documentation and testing
As discussed above, the database and the data are meant to be used later by users and other developers. If at some point something breaks, or if someone doesn’t understand something, they should have proper documentation to refer to. After a while, even the people who developed the database may forget important aspects, and without proper documentation, troubleshooting or maintenance may be costly.
Another common issue is the lack of testing. Before delivery, it’s important to test the database deeply, to make sure you haven’t missed anything. Poor design or implementation can be costly to the end-users, both in terms of incorrect insights as well as the cost involved to fix the issues.
Frequently asked questions
What is data normalization?
Data normalization is a set of techniques and rules used to improve the consistency of data, standardize it, and maintain its integrity. The process removes any redundancies or duplicates in the database and reduces the storage space requirements.
What are the benefits of data normalization?
Normalized data is easy to analyze and interpret. Since the size is vastly reduced, it's also easy to store, distribute, and maintain. And it’s easy to get insights from it.