Show Data normalization is the process of structuring information in a database to cut down on redundancy and make that database more efficient. Think of normalization as a way to make sure that every field and table in your database is organized logically, so that you can avoid data anomalies when inserting, updating, or deleting records. This process is carried out according to specific rules that dictate how tables should be organized. Normalization is one part of the larger data cleaning and standardization process, which also involves confirming that your data is accurate, complete, and doesn’t contain duplicate records, as well as ensuring that you’ve selected the appropriate data types for your fields. If you’re starting with denormalized tables, the normalization process will involve creating additional, smaller tables that can be joined to one another by a foreign key. Maybe you’ve become frustrated with having to update the same information in multiple places across your database after a single value changes, or are finding that you’re losing valuable data when a record gets deleted. Normalizing your tables will help in both of these cases. The principles we’ll cover in this lesson apply to relational database management systems (RDBMS). If you’re using a NoSQL or document-based database like MongoDB, the information below won’t apply. Simplification and reducing storage: benefits of normalized dataNormalization is all about making your data more efficient, so that your team can find and use the information they need. These benefits and rules may seem like common sense once you have some familiarity with how databases work, but it pays to know the explicit purpose of each table and field within your database. Benefits of normalized data include:
Data anomaliesData anomalies are inconsistencies in how information is stored within a database. These flaws with how a database is structured become apparent whenever something goes wrong when a record is updated, added, or deleted. Fortunately, adhering to the rules of normalization can prevent these anomalies from happening in the first place. Update anomalyUpdate anomalies stem from data redundancy. For example, let’s say your database stores customer address information in fields across several tables. A customer changing their address may result in only one of those fields updating to include the new information, leaving you with inconsistent data. Insertion anomalyAn insertion anomaly occurs when a record can’t be created without certain fields containing data — data that may not exist yet. For example, a denormalized database may be structured so that a customer account can’t be created unless that customer has place an order. Normalizing that database would solve this problem, through the creation of separate Orders and Customers tables, with no rule prohibiting null values. Deletion anomalyUnintentional information loss is the result of a deletion anomaly. Let’s say a table in your database includes information about university courses and the students that take those courses. If one course was cancelled due to low enrollment, you may inadvertently lose valuable student information by removing that course record. Like with insertion anomalies, breaking your data out into multiple, specific tables would prevent this issue. Rules of normalizationThe rules for normalizing data were first introduced in the early 1970s. These rules are grouped in tiers called normal forms. Each tier builds upon the last — you can only apply the second tier of rules if your data already meets the first tier of rules, and so on. While there are several more normal forms beyond the three listed below, these first three are sufficient for most use cases. As we covered in our introduction to databases, tables within a database should contain a entity key, also known as a primary key. This field distinguishes each row within a table according to a unique ID, and is helpful when joining tables. Before we can even get into first normal form, your table needs to have an entity key field. First normal form (1NF)The first normal form (1NF) dictates that each field within a table should only store one value, and that your table shouldn’t contain multiple fields that store similar information, like columns titled Address1 and Address2. Here’s an example of a table that we’ll normalize according to first normal form. This table includes information about college courses and who teaches them. Professor table
We notice that while our fields are distinct, one professor (Gene Watson, in the first row) is teaching two courses, and that information is currently stored within a single cell. If we normalize this table according to 1NF, we’ll need to break our data out into multiple tables: Normalized professor table
Normalized course table
Since one professor can teach more than one course, we’ve broken this data out into two tables. Now, our Professor table has a one-to-many relationship with our Course table. This new table structure meets first normal form, and joins the two tables via a foreign key, the Professor ID field. Second normal form (2NF)Second normal form is about reducing redundancy and making sure that every field describes something about what the entity key identifies. To meet 2NF, all fields in a table that aren’t the entity key must be fully dependent on the table’s entity key (which may be a composite key made up of two fields). Let’s look at a new example — a table that includes information about your employees’ birthdays. Employee birthday table
This table meets 1NF, because each column is distinct and only holds one value within each cell. However, this table has a composite key: Employee ID + Birthday combined make up the table’s entity key. This table does not meet 2NF in its current state, because the Department field only partially depends on the composite key, since an employee’s department doesn’t depend on their birthday, only on their employee ID. To fix this, we’ll break this information out into two tables: Normalized employee birthday table
Normalized employee department table
Third normal form (3NF)A table meets third normal form if (in addition to meeting 2NF) it doesn’t contain any transitive dependency. Transitive dependency happens when Column A depends on Column B, and Column B depends on the entity key. If you want to normalize according to 3NF, you’ll need to remove Column A from the table, since it does not depend on the entity key directly, and place it in a different table with its own entity key. Orders table
This table isn’t in third normal form because the Customer zip code field is dependent on Customer ID, which is not this table’s entity key (the entity key here is Order ID). Our current structure could lead to unwanted information loss; if customer C032 returned their order and we needed to delete this record, we’d unintentionally lose their zip code information. If customer C004 ever moved and their zip code changed, we’d also have to update it in two places, since they’ve placed multiple orders. To bring this table into 3NF — you guessed it — we’re going to break it out into two tables. Normalized orders table
Normalized customers table
Drawbacks to normalization: when to denormalizeOnce you reach higher levels of normalization, your database may perform certain analytical queries at a slower rate — particularly those that need to grab a lot of data. Since normalized data demands that a database tap into several tables to perform a query, this can take longer, especially as your database grows in complexity. The tradeoff is that your normalized data takes up less space. Thanks for your feedback! Get articles like this one in your inbox every month |