
Denormalization Guidelines
by Craig S. Mullins
Normalization is the process of putting one fact in one appropriate place.
This optimizes updates at the expense of retrievals. When a fact is stored in
only one place, retrieving many different but related facts usually requires
going to many different places. This tends to slow the retrieval process.
Updating is quicker, however, because the fact you're updating exists in only
one place.
It is generally recognized that all relational database designs should be
based on a normalized logical data model. With a normalized data model, one
fact is stored in one place, related facts about a single entity are stored
together, and every column of each entity refers non-transitively to only the
unique identifier for that entity. Although an in-depth discussion of
normalization is beyond the scope of this article, brief definitions of the
first three normal forms follow:
- In first normal form (1NF), all entities must have a unique
identifier, or key, that can be composed of one or more attributes. In
addition, all attributes must be atomic and non-repeating. (Atomic
means that the attribute must not be composed of multiple attributes. For
example, EMPNO should not be composed of social security number and last
name because these are separate attributes.)
- In second normal form (2NF), all attributes that are
not part of the key must depend on the entire key for that entity.
- In third normal form (3NF), all attributes that are not part of
the key must not depend on any other non-key attributes.
Frequently, however, performance needs dictate very quick retrieval
capability for data stored in relational databases. To accomplish this,
sometimes the decision is made to denormalize the physical implementation.
Denormalization is the process of putting one fact in numerous places. This
speeds data retrieval at the expense of data modification.
It is not the intention of this article to promote the concept of
denormalization. Of course, a normalized set of relational tables is the
optimal environment and should be implemented for whenever possible. Yet, in
the real world, denormalization is sometimes necessary. Denormalization is not
necessarily a bad decision if implemented wisely. You should always consider
these issues before denormalizing:
- can the system achieve acceptable performance without
denormalizing?
- will the performance of the system after
denormalizing still be unacceptable?
- will the system be less reliable due to denormalization?
If the answer to any of these questions is "yes," then you
should avoid denormalization because any benefit that is accrued will not
exceed the cost. If, after considering these issues, you decide to
denormalize be sure to adhere to the general guidelines that follow.
If enough disk space is available at your shop, create two sets of
tables: one set fully normalized and another denormalized. Populate the
denormalized versions by querying the data in the normalized tables and
loading or inserting it into the denormalized tables. Your application can
access the denormalized tables in a read-only fashion and achieve
performance gains. It is imperative that a controlled and scheduled
population function is maintained to keep the data in the denormalized and
normalized tables synchronized.
If sufficient disk space is not available for two complete sets of
tables, then maintain the denormalized tables programmatically. Be sure to
update each denormalized table representing the same entity at the same
time, or alternately, to provide a rigorous schedule whereby tables will be
synchronized. At any rate, all users should be informed of the implications
of inconsistent data if it is deemed impossible to avoid unsynchronized
data.
When updating any column that is replicated in many different tables,
always update it everywhere that it exists simultaneously, or as close to
simultaneously as possible given the physical constraints of your
environment. If the denormalized tables are ever out of sync with the
normalized tables be sure to inform end-users that batch reports and on-line
queries may not contain sound data; if at all possible, this should be
avoided.
Finally, be sure to design the application so that it can be easily
converted from using denormalized tables to using normalized tables.
The Reason for Denormalization
Only one valid reason exists for denormalizing a relational design - to
enhance performance. However, there are several indicators which will help
to identify systems and tables which are potential denormalization
candidates. These are:
- Many critical queries and reports exist which rely upon
data from more than one table. Often times these requests need to be
processed in an on-line environment.
- Repeating groups exist which need to be processed in a
group instead of individually.
- Many calculations need to be applied to one or many
columns before queries can be successfully answered.
- Tables need to be accessed in different ways by
different users during the same timeframe.
- Many large primary keys exist which are clumsy to query
and consume a large amount of disk space when carried as foreign key
columns in related tables.
- Certain columns are queried a large percentage of the
time causing very complex or inefficient SQL to be used.
Be aware that each new RDBMS release usually brings enhanced performance
and improved access options that may reduce the need for denormalization.
However, most of the popular RDBMS products on occasion will require
denormalized data structures. There are many different types of denormalized
tables which can resolve the performance problems caused when accessing
fully normalized data. The following topics will detail the different types
and give advice on when to implement each of the denormalization types.
Pre-Joined Tables
If two or more tables need to be joined on a regular basis by an
application, but the cost of the join is prohibitive, consider creating
tables of pre-joined data. The pre-joined tables should:
- contain no redundant columns
- contain only those columns absolutely necessary for the
application to meet its processing needs
- be created periodically using SQL to join the normalized
tables
The cost of the join will be incurred only once when the pre-joined
tables are created. A pre-joined table can be queried very efficiently
because every new query does not incur the overhead of the table join
process.
However, pre-joined tables may quickly get out of sync with the
independent tables from which they were created. Pre-joined tables are more
useful for data that changes slowly, instead of more dynamic data.
Report Tables
Often times it is impossible to develop an end-user report using SQL
alone. These types of reports require special formatting or data
manipulation. If certain critical or highly visible reports of this nature
are required to be viewed in an on-line environment, consider creating a
table that represents the report. This table can then be queried using SQL
or a query tool or report facility. The report should be created using the
appropriate mechanism (application program, 4GL, SQL, etc.) in a batch
environment. The data then can be loaded into the report table in sequence.
The report table should:
- contain one column for every column of the report
- have a clustering index on the columns that provide the
reporting sequence
- not subvert relational tenets (such as, 1NF and atomic
data elements)
Report tables are ideal for carrying the results of multiple joins and
outer joins , correlated subqueries, or other complex SQL statements. If a
complex query is coded, run, and then loaded into a table, a simple SELECT
statement can be used to retrieve the results, instead of using the complex
(and slower) query that was used to populate the report table.
Mirror Tables
If an application system is very active it may be necessary to split
processing into two (or more) distinct components. This requires the
creation of duplicate, or mirror tables. Consider an application system that
has very heavy on-line traffic during the morning and early afternoon hours.
This traffic consists of both querying and updating of data. Decision
support processing is also performed on the same application tables during
the afternoon. The production work in the afternoon always seems to disrupt
the decision support processing causing frequent time outs and dead locks.
These disruptions could be corrected by creating mirror tables. A
foreground set of tables would exist for the production traffic and a
background set of tables would exist for the decision support reporting. A
mechanism to periodically migrate the foreground data to background tables
must be established to keep the application data synchronized. One such
mechanism could be a batch job executing UNLOAD and LOAD utilities. Another
possibility is to use the built-in replication and propagation services of
SQL Server. At any rate, the data synchronization should be done as often as
necessary to sustain the effectiveness of the decision support processing.
It is important to note that since the access needs of decision support
are often considerably different than the access needs of the production
environment, different data definition decisions such as indexing and
clustering may be chosen for the mirror tables.
Also, understand that simple mirror tables may not be sufficient for your
decision support needs. Perhaps you will need to create a full-blown data
warehouse environment. A data warehouse, really, is just a relational
database that is specifically denormalized for decision support and
analytical queries.
Split Tables
If separate pieces of one normalized table are accessed by different and
distinct groups of users or applications then consider splitting the table
into two (or more) denormalized tables; one for each distinct processing
group. The original table can also be maintained if other applications exist
that access the entire table. In this scenario the split tables should be
handled as a special case of mirror table. If an additional table is not
desired then a view joining the tables could be provided instead.
Tables can be split in one of two ways: vertically or horizontally. Refer
to Figure 1. A vertical split cuts a table column-wise, such that one group
of columns is placed into one new table and the remaining columns are placed
in another new table. A horizontally split table is a row-wise split. To
split a table horizontally, rows are classified into groups via key ranges.
The rows from one key range are placed in one table, those from another key
range are placed in a different table, and so on.
Figure 1. Split Tables

Vertically split tables should be created by placing the primary key
columns for the old, normalized table into both of the split tables.
Designate one of the two, new tables as the parent table for the purposes of
referential integrity unless the original table still exists. In this case,
the original table should be the parent table in all referential
constraints. If this is the case, and the split tables are read only, do not
set up referential integrity (RI) for the split tables as they are being
derived from a referentially intact source. RI would be redundant.
When a vertical split is being done, always include one row per primary
key in each split table. Do not eliminate rows from either of the two tables
for any reason. If rows are eliminated the update process and any retrieval
process that must access data from both tables will be unnecessarily
complicated.
When a horizontal split is being done, try to split the rows between the
new tables to avoid duplicating any one row in each new table. This is done
by splitting using the primary key such that discrete key ranges are placed
in separate split tables. Simply stated, the operation of UNION ALL, when
applied to the horizontally split tables, should not add more rows than
contained in the original, un-split tables. Likewise, it should not contain
fewer rows either.
Combined Tables
If tables exist with a one-to-one relationship consider combining them
into a single combined table. Sometimes, even one-to-many relationships can
be combined into a single table, but the data update process will be
significantly complicated because of the increase in redundant data.
For example, consider an application with two tables: DEPT (containing
department data) and EMP (containing employee data). Combining the two
tables into a large table named, for example, EMP_WITH_DEPT. This new table
would contain all of the columns of both tables except for the redundant
DEPTNO column (the join criteria). So, in addition to all of the employee
information, all of the department information would also be contained on
each employee row. This will result in many duplicate instances of the
department data. Combined tables of this sort can be considered pre-joined
tables and treated accordingly.
Tables with one-to-one relationships should always be analyzed to
determine if combination is useful. Oftentimes combining two tables with a
one-to-one relationship is normalization, not denormalization.
Redundant Data
Sometimes one or more columns from one table are accessed whenever data
from another table is accessed. If these columns are accessed frequently
with tables other than those in which they were initially defined, consider
carrying those columns in those other tables as redundant data. By carrying
these additional columns, joins can be eliminated and the speed of data
retrieval can be enhanced. This should only be attempted if the normal
access does not perform sufficiently.
Consider, once again, the DEPT and EMP tables. If most of the employee
queries require the name of the employee's department then the department
name column could be carried as redundant data in the EMP table. The column
should not be removed from the DEPT table, though (causing additional update
requirements if the department name changes).
In all cases columns that can potentially be carried as redundant data
should be characterized by the following attributes:
- only a few columns
are necessary to support the redundancy
- the columns should be
stable, being updated only infrequently
- the columns should be
used by either a large number of users or a few very important users
Repeating Groups
When repeating groups are normalized they are implemented as distinct
rows instead of distinct columns. This usually results in higher disk usage
and less efficient retrieval because there are more rows in the table and
more rows need to be read in order to satisfy queries that access the
repeating group.
Sometimes, denormalizing the data by storing it in distinct columns can
achieve significant performance gains. However, these gains come at the
expense of flexibility. For example, consider an application that is storing
repeating group information in the normalized table below:
CREATE TABLE periodic_balances
(customer_no CHAR(11) NOT NULL,
balance_period SMALLINT NOT NULL,
balance DECIMAL(15,2),
CONSTRAINT PK_cnbp PRIMARY KEY NONCLUSTERED
(customer_no, balance_period)
)
This table can store an infinite number of balances per customer, limited
only by available storage and the storage limits of SQL Server. If the
decision were made to string the repeating group, BALANCE, out into columns
instead of rows, a limit would need to be set for the number of balances to
be carried in each row. An example of this after denormalization is shown
below:
CREATE TABLE periodic_balances
(customer_no CHAR(11) NOT NULL,
CONSTRAINT PK_custno NONCLUSTERED,
period1_balance DECIMAL(15,2),
period2_balance DECIMAL(15,2),
period3_balance DECIMAL(15,2),
period4_balance DECIMAL(15,2),
period5_balance DECIMAL(15,2),
period6_balance DECIMAL(15,2)
)
In this example, only six balances may be stored for any one customer. The
designer could just as easily chosen to store eight or twelve balances. The
number six is not important, but the concept that the number of values is
limited is important. This reduces the flexibility of data storage and
should be avoided unless performance needs dictate otherwise.
Using the second design, all six balances can be retrieved by reading one
row. Using the first design, six rows would need to be retrieved to obtain
six balances.
Before deciding to implement repeating groups as columns instead of rows
be sure that the following criteria are met:
- the data is rarely or
never aggregated, averaged, or compared within the row
- the data occurs in a
statistically well-behaved pattern
- the data has a stable
number of occurrences
- the data is usually
accessed collectively
- the data has a
predictable pattern of insertion and deletion
If any of the above criteria are not met, SELECT statements may be
difficult to code making the data less available due to inherently unsound
data modeling practices. This should be avoided because, in general, data is
denormalized only to make it more readily available.
Derivable Data
If the cost of deriving data using complicated formulae is prohibitive
then consider storing the derived data in a column instead of calculating
it. For example, consider employee data that is scattered across multiple
tables. Perhaps the database contains three columns in three different
tables that store employee compensation data. These columns are SALARY,
BONUS, and COMMISSION. Furthermore, assume that more often than not, queries
require total compensation to be reported that is the sum of these three
columns. It might make sense to include a column in the main EMP table
called TOTAL_COMP that is the sum of SALARY, BONUS, and COMMISSION. This
avoids multi-table joins and calculations.
However, when the underlying values that comprise the calculated value
change, it is imperative that the stored derived data also be changed
otherwise inconsistent information could be reported. This will adversely
impact the effectiveness and reliability of the database.
Sometimes it is not possible to immediately update derived data elements
when the columns upon which they rely change. This can occur when the tables
containing the derived elements are off-line or being operated upon by a
utility. In this situation, time the update of the derived data such that it
occurs immediately when the table is made available for update. Under no
circumstances should outdated derived data be made available for reporting
and inquiry purposes.
Hierarchies
A hierarchy is a structure that is easy to support using a relational
database such as SQL Server, but is difficult to retrieve information from
efficiently. For this reason, applications which rely upon hierarchies very
often contain denormalized tables to speed data retrieval. Two examples of
these types of systems are the classic Bill of Materials application and a
Departmental Reporting system. A Bill of Materials application typically
records information about parts assemblies in which one part is composed of
other parts. A Department Reporting system typically records the
departmental structure of an organization indicating which departments
report to which other departments.
A very effective way to denormalize a hierarchy is to create what are
called "speed" tables. Figure 2 depicts a department hierarchy for
a given organization. The hierarchic tree is built such that the top most
node is the entire corporation and each of the other nodes represents a
department at various levels within the corporation. In our example
department 123456 is the entire corporation. Departments 1234 and 56 report
directly to 123456. Departments 12, 3, and 4 report directly to 1234 and
indirectly to department 123456. And so on.
The table shown under the tree in Figure 2 is the classic relational
implementation of a hierarchy. There are two department columns, one for the
parent and one for the child. This is an accurately normalized version of
this hierarchy containing everything that is represented in the diagram. The
complete hierarchy can be rebuilt with the proper data retrieval
instructions.
Figure 2. Classic Relational Implementation of a Department Hierarchy
123456
/ \ Department Hierarchy
1234 56
/ | \ / \
12 3 4 5 6
/\
1 2
This tree structure represents a classic department hierarchy. Department
123456 represents the entire organization. Each of the other departments
report to a higher department. For example, department 12 reports to
department 1234 which in turn reports to the corporate office, department
123456. This can be represented in a SQL Server table as follows:
Department Table
Deptno
Parent_Deptno ... other columns
123456
---
1234
123456
56
123456
12
1234
3
1234
4
1234
1
12
2
12
5
56
6
56
Even though the implementation effectively records the entire hierarchy,
building a query to report all of the departments under any other given
department can be time consuming to code and inefficient to process. Figure
4 shows a sample query that will return all of the departments that report
to the corporate node 123456. However, this query can only be built if you
know in advance the total number of possible levels the hierarchy can
achieve. If there are n levels in the hierarchy then you will need n-1
UNIONs.
To retrieve all dependent departments for Department 123456:
SELECT DEPTNO
FROM
DEPARTMENT
WHERE
PARENT_DEPTNO = '123456'
UNION
SELECT DEPTNO
FROM
DEPARTMENT
WHERE
PARENT_DEPTNO IN
(SELECT DEPTNO
FROM DEPARTMENT
WHERE PARENT_DEPTNO = '123456')
UNION
SELECT DEPTNO
FROM
DEPARTMENT
WHERE
PARENT_DEPTNO IN
(SELECT DEPTNO
FROM DEPARTMENT
WHERE PARENT_DEPTNO IN
(SELECT DEPTNO
FROM DEPARTMENT
WHERE PARENT_DEPTNO = '123456'))
This is complex and probably quite slow to execute.
A "speed" table can be built such as shown below:
Department Speed Table
Parent_Deptno
Child_Deptno Level Detail ... other columns
123456
1234 1
N
123456
56 1
N
123456
12 2
N
123456
1
3 Y
123456
2
4 Y
123456
3
2 Y
123456
4
2 Y
123456
5
2 Y
123456
6
2 Y
1234
12 1
N
1234
1
2 Y
1234
2
2 Y
1234
3 1
Y
1234
4
1 Y
3
3
1 Y
4
4
1 Y
12
1 1
Y
12
2
1 Y
1
1
1 Y
2
2
1 Y
56
5
1 Y
56
6
1 Y
5
5
1 Y
6
6
1 Y
A speed table is a denormalized version of a hierarchy. Data is
replicated within a speed table to increase the speed of data retrieval.
Every parent department has a row for every department that reports to it at
any level, either directly or indirectly. A speed table optionally carries information such as level within the
hierarchy and whether or not the given child department is at a detail most
level within the hierarchy (i.e. bottom of the tree).
This
table depicts the parent department and all of the departments under it
regardless of the level. Contrast this to the previous table which only
recorded immediate children for each parent. A "speed" table also
commonly contains other pertinent information that is needed by the given
application. Typical information includes the level within the hierarchy for
the given node, whether or not the given node of the hierarchy is a detail
node (at the bottom of the tree), and, if ordering within level is
important, the sequence of the nodes at the given level.
After the "speed" table has been built, speedy queries can be
written against this implementation of a hierarchy. A "speed"
table is commonly built using a program written in Visual Basic or another
high level language. SQL alone is usually either too inefficient to handle
the creation of a "speed" table or impractical because the number
of levels in the hierarchy is either unknown or constantly changing.
The following SQL shows various informative queries that would have been
very inefficient to execute against the classical relational hierarchy. But
they work quite effectively against the "speed" tables. These
queries work for any number of levels between the top and bottom of the
hierarchy.
To retrieve all dependent departments for Department 123456:
SELECT CHILD_DEPTNO
FROM
DEPARTMENT
WHERE
PARENT_DEPTNO = '123456'
To retrieve only the bottom, detail departments for Department 123456:
SELECT CHILD_DEPTNO
FROM DEPARTMENT
WHERE PARENT_DEPTNO = '123456'
AND DETAIL = 'Y'
To return a complete department hierarchy for Department 123456:
SELECT PARENT_DEPTNO, CHILD_DEPTNO, LEVEL
FROM DEPARTMENT
WHERE PARENT_DEPTNO = '123456'
ORDER BY LEVEL
Types of Denormalization
We have discussed nine different types of denormalization. The list below summarizes the types of denormalization that are available with a
short description of when this type of denormalization is useful.
Pre-Joined Tables used when the cost of joining is prohibitive
Report Tables used when specialized critical reports are needed
Mirror Tables used when tables are required concurrently by two
different types of environments
Split Tables used when distinct groups use different parts of a
table
Combined Tables used when one-to-one relationships exist
Redundant Data used to reduce the number of table joins required
Repeating Groups used to reduce I/O and (possibly) storage usage
Derivable Data used to eliminate calculations and algorithms
Speed Tables used to support hierarchies
Summary
The decision to denormalize should never be made lightly because it
involves a lot of administrative dedication. This dedication takes the form
of documenting the denormalization decisions, ensuring valid data,
scheduling of data migration, and keeping end users informed about the state
of the tables. In addition, there is one more category of administrative
overhead: periodic analysis.
Whenever denormalized data exists for an application the data and
environment should be periodically reviewed on an on-going basis. Hardware,
software, and application requirements will evolve and change. This may
alter the need for denormalization. To verify whether or not denormalization
is still a valid decision ask the following questions:
- Have the processing
requirements changed for the application such that the join criteria,
timing of reports, and/or transaction throughput no longer require
denormalized data?
- Did a new release
of SQL Server from Microsoft change performance considerations? For
example, did the introduction of a new join method or performance
technique undo the need for pre-joined tables?
- Did a new hardware
release change performance considerations? For example, does the upgrade
to a new, faster processor provide additional CPU resources such that
denormalization is no longer necessary? Or did the addition of memory
enable faster data access so that data can be physically normalized?
In general, periodically test whether the extra cost related to
processing with normalized tables justifies the benefit of denormalization.
You should measure the following criteria:
- I/O saved
- CPU saved
- complexity of update
programming
- cost of returning to
a normalized design
It is important to remember that denormalization was initially
implemented for performance reasons. If the environment changes it is only
reasonable to re-evaluate the denormalization decision. Also, it is possible
that, given a changing hardware and software environment, denormalized
tables may be causing performance degradation instead of performance gains.
Simply stated, always monitor and periodically re-evaluate all
denormalized applications.
From SQL Server Update
(Xephon) December 2000.
© 2000 Craig S. Mullins, All rights reserved.
Home.

|