Article

Data normalization

What a normalized database looks like and why table structure matters.

Simplification and reducing storage: benefits of normalized data
Data anomalies
Rules of normalization
Drawbacks to normalization: when to denormalize

Data normalization is the process of structuring information in a database to cut down on redundancy and make that database more efficient. Think of normalization as a way to make sure that every field and table in your database is organized logically, so that you can avoid data anomalies when inserting, updating, or deleting records. This process is carried out according to specific rules that dictate how tables should be organized.

Normalization is one part of the larger data cleaning and standardization process, which also involves confirming that your data is accurate, complete, and doesn’t contain duplicate records, as well as ensuring that you’ve selected the appropriate data types for your fields. If you’re starting with denormalized tables, the normalization process will involve creating additional, smaller tables that can be joined to one another by a foreign key. Maybe you’ve become frustrated with having to update the same information in multiple places across your database after a single value changes, or are finding that you’re losing valuable data when a record gets deleted. Normalizing your tables will help in both of these cases.

The principles we’ll cover in this lesson apply to relational database management systems (RDBMS). If you’re using a NoSQL or document-based database like MongoDB, the information below won’t apply.

Simplification and reducing storage: benefits of normalized data

Normalization is all about making your data more efficient, so that your team can find and use the information they need. These benefits and rules may seem like common sense once you have some familiarity with how databases work, but it pays to know the explicit purpose of each table and field within your database. Benefits of normalized data include:

Simplifying transactional queries. With normalized data, a query for customer addresses only needs to look in the single field that stores those addresses. If you store customer addresses multiple times in different locations across your database or even keep multiple addresses within the same field, that query will take longer to execute.
Decreasing the size of your database. If you repeat customer data in several locations across your database, that means you’ve made space to store that information several times. This may not be a major concern if your database only contains a few tables, but if you’re working on a larger scale, disk space can be at a premium. Reducing duplicate information means cutting storage costs, whether you’re running a local server or relying on a cloud-hosted database.
Make database maintenance easier. Think of that same customer data stored several times in your database. Each time a customer changes their address, it will need updated in every instance of a Customer Address field, which leaves a lot of room for error. If your data is normalized, you’ll only have one Customer Address field, which joins to other relevant tables like Orders.

Data anomalies

Data anomalies are inconsistencies in how information is stored within a database. These flaws with how a database is structured become apparent whenever something goes wrong when a record is updated, added, or deleted. Fortunately, adhering to the rules of normalization can prevent these anomalies from happening in the first place.

Update anomaly

Update anomalies stem from data redundancy. For example, let’s say your database stores customer address information in fields across several tables. A customer changing their address may result in only one of those fields updating to include the new information, leaving you with inconsistent data.

Insertion anomaly

An insertion anomaly occurs when a record can’t be created without certain fields containing data — data that may not exist yet. For example, a denormalized database may be structured so that a customer account can’t be created unless that customer has place an order. Normalizing that database would solve this problem, through the creation of separate Orders and Customers tables, with no rule prohibiting null values.

Deletion anomaly

Unintentional information loss is the result of a deletion anomaly. Let’s say a table in your database includes information about university courses and the students that take those courses. If one course was cancelled due to low enrollment, you may inadvertently lose valuable student information by removing that course record. Like with insertion anomalies, breaking your data out into multiple, specific tables would prevent this issue.

Rules of normalization

The rules for normalizing data were first introduced in the early 1970s. These rules are grouped in tiers called normal forms. Each tier builds upon the last — you can only apply the second tier of rules if your data already meets the first tier of rules, and so on. While there are several more normal forms beyond the three listed below, these first three are sufficient for most use cases.

As we covered in our introduction to databases, tables within a database should contain a entity key, also known as a primary key. This field distinguishes each row within a table according to a unique ID, and is helpful when joining tables. Before we can even get into first normal form, your table needs to have an entity key field.

First normal form (1NF)

The first normal form (1NF) dictates that each field within a table should only store one value, and that your table shouldn’t contain multiple fields that store similar information, like columns titled Address1 and Address2.

Here’s an example of a table that we’ll normalize according to first normal form. This table includes information about college courses and who teaches them.

Professor table

Professor ID	Professor name	Course name
P001	Gene Watson	Intro to Philosophy; Ethics
P002	Melissa King	Quantum Mechanics
P003	Errol Tyson	Macroeconomics
P004	Mary Jacobson	Graphic Novels

We notice that while our fields are distinct, one professor (Gene Watson, in the first row) is teaching two courses, and that information is currently stored within a single cell. If we normalize this table according to 1NF, we’ll need to break our data out into multiple tables:

Normalized professor table

Professor ID	Professor name
P001	Gene Watson
P002	Melissa King
P003	Errol Tyson
P004	Mary Jacobson

Normalized course table

Course ID	Course name	Professor ID
C001	Intro to Philosophy	P001
C002	Ethics	P001
C003	Quantum Mechanics	P002
C004	Macroeconomics	P003
C005	Graphic Novels	P004

Since one professor can teach more than one course, we’ve broken this data out into two tables. Now, our Professor table has a one-to-many relationship with our Course table. This new table structure meets first normal form, and joins the two tables via a foreign key, the Professor ID field.

Second normal form (2NF)

Second normal form is about reducing redundancy and making sure that every field describes something about what the entity key identifies. To meet 2NF, all fields in a table that aren’t the entity key must be fully dependent on the table’s entity key (which may be a composite key made up of two fields). Let’s look at a new example — a table that includes information about your employees’ birthdays.

Employee birthday table

Employee ID	Birthday	Department
E001	November 18	Accounting
E002	March 29	Sales
E003	June 1	Marketing
E004	February 7	Accounting

This table meets 1NF, because each column is distinct and only holds one value within each cell. However, this table has a composite key: Employee ID + Birthday combined make up the table’s entity key. This table does not meet 2NF in its current state, because the Department field only partially depends on the composite key, since an employee’s department doesn’t depend on their birthday, only on their employee ID. To fix this, we’ll break this information out into two tables:

Normalized employee birthday table

Employee ID	Birthday
E001	November 18
E002	March 29
E003	June 1
E004	February 7

Normalized employee department table

Employee ID	Department
E001	Accounting
E002	Sales
E003	Marketing
E004	Accounting

Third normal form (3NF)

A table meets third normal form if (in addition to meeting 2NF) it doesn’t contain any transitive dependency. Transitive dependency happens when Column A depends on Column B, and Column B depends on the entity key. If you want to normalize according to 3NF, you’ll need to remove Column A from the table, since it does not depend on the entity key directly, and place it in a different table with its own entity key.

Orders table

Order ID	Order date	Customer ID	Customer zip code
R001	01/17/2021	C032	99702
R002	03/01/2021	C004	39204
R003	06/30/2021	C054	06505
R004	08/22/2021	C010	84098
R005	09/27/2021	C004	39204

This table isn’t in third normal form because the Customer zip code field is dependent on Customer ID, which is not this table’s entity key (the entity key here is Order ID). Our current structure could lead to unwanted information loss; if customer C032 returned their order and we needed to delete this record, we’d unintentionally lose their zip code information. If customer C004 ever moved and their zip code changed, we’d also have to update it in two places, since they’ve placed multiple orders. To bring this table into 3NF — you guessed it — we’re going to break it out into two tables.

Normalized orders table

Order ID	Order date	Customer ID
R001	01/17/2021	C032
R002	03/01/2021	C004
R003	06/30/2021	C054
R004	08/22/2021	C010
R005	09/27/2021	C004

Normalized customers table

Customer ID	Customer zip code
C032	99702
C004	39204
C054	06505
C010	84098

Drawbacks to normalization: when to denormalize

Once you reach higher levels of normalization, your database may perform certain analytical queries at a slower rate — particularly those that need to grab a lot of data. Since normalized data demands that a database tap into several tables to perform a query, this can take longer, especially as your database grows in complexity. The tradeoff is that your normalized data takes up less space.

« Previous Next »