This documentation is archived and is not being maintained.

Data Normalization

Visual Studio .NET 2003

The task of a database designer is to structure the data in a way that eliminates unnecessary duplication and provides a rapid search path to all necessary information. The process of refining tables, keys, columns, and relationships to create an efficient database is called normalization. Normalizing is not just for relational files: it is also a common design activity for indexed files.

Normalization is a complex process with many specific rules and different levels of intensity. In its full definition, normalization is the process of discarding repeating groups, minimizing redundancy, eliminating composite keys for partial dependency, and separating non-key attributes. In simple terms, the rules for normalization can be summed up in a single phrase: "Each attribute (column) must be a fact about the key, the whole key, and nothing but the key." Each table should describe only one type of entity (such as a person, place, customer order, or product item).

Some of the benefits of normalization are:

  • Data integrity (because there is no redundant, neglected data).
  • Optimized queries (because normalized tables produce rapid, efficient joins).
  • Faster index creation and sorting (because the tables have fewer columns).
  • Faster UPDATE performance (because there are fewer indexes per table).
  • Improved concurrency resolution (because table locks will affect less data).

You can normalize most simple databases by following a simple rule of thumb: tables that contain repeated information should be divided into separate tables to eliminate the duplication.

For example, your new application is for a bookseller who must track information about each book, including the following data.

  • Author name.
  • Author address.
  • Author phone.
  • Title.
  • ISBN (International Standard Book Number).
  • Year published.
  • Publisher name.
  • Publisher address.
  • Publisher phone.

You could simply create a single table with a field for each of the data items listed. Looking closely at the data, however, it is obvious that such a table would contain many redundancies. For example, many authors have written more than one book, so the author and publisher information for each book title would be repeated many times. If you put all of these fields into a single table there would be many confusing and duplicate entries.

Using the principles of normalization, you might break the data into four groups: Authors, AuthorsTitles, Titles, and Publishers, as shown in the following table.

Authors table AuthorsTitles table Titles table Publishers table
au_id (key) au_id (foreign key) ti_isbn (key) pu_id (key)
au_name ti_isbn (foreign key) ti_title pu_name
au_address   ti_yearpublished pu_address
au_phone   pu_id (foreign key) pu_phone

The keys provide a means of establishing table relationships. For example, the AuthorsTitles table creates a many-to-many relationship between the Authors and Titles tables (an author may write many titles, and a single title may be written by several authors). Using the AuthorsTitles table, you can query for every book number an author wrote (using au_id), and also determine which author(s) wrote a certain book (using ti_isbn).

It is worth noting that instead of creating an AuthorsTitles table, a possible alternative approach would be to add the au_id attribute into the Titles table, but this option is viable only if you assume that each title has only one author. Take a closer look at another assumption: putting the pu_id attribute in the Titles table suggests that each title is by a single publisher. If the same title is published by more than one company, inserting additional Title rows into the Titles table for each separate publisher duplicates data and hence the table is not normalized. Such design alternatives require careful evaluation of what the business data means, what query types your application expects, possible multiuser concurrency issues, and the possible performance problems of many indexes on one table.

As you evaluate normalizing design alternatives, it is worth knowing that there are various techniques you can use to intentionally denormalize a database. Why would you do that? You would intentionally denormalize data because you might discover performance problems or want to simplify ad-hoc reporting. Performance problems derive from production queries that require too many time-consuming, disk intensive table joins. Ad-hoc reporting is about letting end users perform unstructured queries; untrained end users can be uncertain about how to get information from multiple related tables.

The techniques for denormalization include duplicating data, providing summary data, splitting tables into horizontal or vertical partitions, and creating denormalized views to simplify reporting — a clever alternative that leaves your normalized database intact.

As an example of denormalization, consider the case of customer addresses. A customer table would typically include these attributes: name, street, city, state, and zip code. While it is true that the city and state can be determined directly from the zip code, and therefore you could normalize the customer table by removing the city and state from each customer's data, it is a common practice to leave the address denormalized.

There are several reasons for denormalizing addresses:

  • Addresses are used in many places (queries, reports, envelopes, customer support screens), and denormalizing avoids adding a lot of address reconstruction code throughout the application.
  • Address-based queries use a much simpler SQL syntax.
  • Address errors are limited to single customers.

For more information on normalizing data, see the documentation for your database server. If you are using Microsoft SQL Server, see "Normalization" in SQL Server Books Online.

See Also

Data Integrity | Logical Database Design |