# Database Normalization: theory and practice

### Intro

Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database using the defined relationships.

Informally, a relational database table is often described as normalized if it is in the Third Normal Form. Most 3NF tables are free of insertion, update, and deletion anomalies.

### 1st Normal Form

Every non-key attribute depends on the key

For instance, the following schema is not on 1st NF (keys are starred):

R(*author_id, author_name, post_name)

In order to be in 1st NF the following FDs must be met:

author_id -> author_name
author_id -> post_name

but author_id -> post_name can be easily violated, thus allowing duplicates on the relation and all sort of other problems.

To satisfy 1st NF we would have to split the relation in something like

R(*author_id, author_name), R(*post_id, post_name, author_id)

### 2nd Normal Form

Every non-key attribute depends on the whole key

Consider the following schema:

R(*author_id, author_name, *post_id, post_name)

The 1NF is met, since every non-prime attribute depends on the key

{ *author_id, *post_id } -> author_name
{ *author_id, *post_id } -> post_name

However, 2NF is not met since there is a partial dependency of some attributes, thus violating the whole key requirement; in particular:

author_id -> author_name
post_id -> post_name

### 3d Normal Form

Every non-key attribute depends on the key, the whole key and nothing but the key

In other words, for every FD of type X -> A, either X is a key or A is a prime attribute. Consider the following schema:

R(*tournament, *year, winner, winner_db)

The 2NF is met, since every non-prime attribute depends on the whole key

{ *tournament, *year } -> winner
{ *tournament, *year } -> winner_db

However, 3NF is not met since there is a FD that violets the “nothing but the key” part, namely

winner -> winner_db

in fact, winner is not a key and winner_db is not prime. Nothing is stopping us from inserting a new row where the same dude has a different dob, which is clearly not cool.

Another example of a schema that meets 2NF but not 3NF would be

R(*author_id, author_name, postcode, city)

Assuming we’re talking UK postcodes here, the following FDs hold:

*author_id -> author_name
*author_id -> postcode
*author_id -> city
postcode -> city

### Boyce-Codd Normal Form (BCNF)

Every non-key attribute depends on the key, the whole key and nothing but the key

…but for every FD of type X -> A, X is a super-key. For example, the schema

R(A, B, C, D, E)

ABCD -> E
E -> BCD

is in 3NF, since the only non-prime attribute (E) depends on the whole key (ABCD) and nothing but the key (1st FD), and BCD is prime (2nd FD). However, E -> BCD has a non-superkey as determinant, therefore it doesn’t meet BCNF. The relation must be decomposed this way:

R1(E, B, C, D), R2(A, E)

Note, however, that this decomposition does not preserve the FD ABCD -> E.

### Lossless Join Decomposition

Decomposing R into R1 and R2 will guarantee a losless-join if (surprise!), joining R1 with R2 gives R back.
Now, the key concept here to keep in mind is that you want to make JOINs on keys, cause if you join on some random attribute that allows duplicates you end up with phantom rows. When we split a table we want to make sure that:

1. Every relation maps some FD, thus having a key
2. There is a shared key between the relations that we can use to join them.

For example, given

R(A, B, C, D, E)
A -> B
C -> DE

A good decomposition is

R1(A, B, C), R2(C, D, E)

we split on C -> DE, thus making C a key and ensuring we don’t have duplicates on R2. C is also shared to allow joins.

R1(A, B, C, D), R2(D, E)

because there is no key in R2. Potentially, R2 could be made up all of the same value. Good luck joining that. This becomes ridiculously evident when you think in term of real stuff, fleshing the above in sth like

R(author_id, author_name, post_id, post_date, post_title)
author_id -> { author_name }
post_id -> { post_date, post_title }

The first decomposition would be like joining on post_id, which is perfectly fine; the second decomposition would be like joining on post_date.

Example 2:

R(A, B, C, D, E)

AB -> C
E -> A
C -> DE
(A, B, C), (C, D, E) // ok
(A, B, C), (C, D), (D, E) // not ok

because DE is not a FD

Example 3: split R in 3 tables

R(A, B, C, D, E, F)

AB -> CD
C -> E
F -> A

// splitting on C -> E
(A, B, C, D, F), (C, E)

// splitting on F -> A
(B, C, D, F), (C, E), (F, A)

### Decomposing a relation in 3NF

Putting things together: we have a big relation R and a set of FDs. To decompose this in 3NF tables, you must

1. Find out which FDs are not in 3NF
2. Split the relation on these FD

The idea is that you don’t want to leave any relation without a key, thus the FDs which are not about keys must be isolated so that their LHS becomes effectiveley a key.

Example 1:

R(A, B, C)
A -> B
B -> C

The only key is A, with B and C non-prime. This implies that B -> C violates 3NF, and we must split on that, resulting in

R1(A, B), R2(B, C)

Example 2:

R(A, B, C, D, E, F, G, H)
AB -> D
EF -> A
FG -> C
EG -> F
D -> EG
F -> BH

The candidate keys are AB, D, EG, EF, AF; thus non-prime attributes are C, H. The FDs that break 3NF are F -> H and FG -> C, so we have to decomopse on them, leaving

R1(A, B, D, E, F, G), R2(C, F, G), R3(F, H)

### Decomposing a relation in BCNF

However, this decomposition still doesn’t meet BCNF. In particular, F -> B breaks BCNF since F is not a superkey within R1, therefore we must decompose

R1(A, D, E, F, G), R2(C, F, G), R3(F, H), R4(F, B)

and since R3 and R4 share the same key we should combine them

R1(A, D, E, F, G), R2(C, F, G), R3(F, B, H)

### Preserving Functional Dependencies

A losless decomposition preserves FDs if all the FDs are represented in the decomposition. Any good 3NF decomposition must preserve FDs. For example

R(A, B, C, D, E, F)

A -> BCE
C -> D
BD -> F
EF -> B
BE -> A

R1(B, D, F), R2(A, B, C, D, E) //  nope, EF -> B is missing
R1(A, B, C, E, F), R2(C, D)  // nope, BD -> F is missing
R1(A, B, C, E, F), R2(C, D), R3(B, D, F) // yes

The important bit is not that the whole FD is present in the table, but that every FD can be found (i.e. that every LHS can be found). The problem with the first two decomposition is that there is no table where BD or EF are present. Compare this with

R(A, B, C, D, E)

AC -> DBE
BC -> DE
B -> A
E -> D

R1(B, C, E), R2(A, B, C), R3(D, E)

even though AC -> DBE is not in any table as a whole, R2 and R3 can be joined together and AC is in R2, thus the FDs is represented. BC -> DE is also represented since R1 and R3 can be joined together.

### Recap examples:

// R and FDs as above
// Keys = { A, BE, EF }
// Non-prime = { C, D }
R1(B, D, F), R2(A, B, C, D, E)
• Lossless decomposition: yes, split on BD -> F
• 3NF: no, it should be split on C -> D
• Preserves FDs: no, EF -> B is missing
R1(A, B, C, E, F), R2(C, D)
• Losless decomposition: yes, split on C -> D
• 3NF: yes, split on C -> D
• Preserve FDs: no, BF -> F is missing
R1(A, B, C, E, F), R2(C, D), R3(B, D, F)
• Losless decomposition: yes, BD is shared after R1 J R2
• 3NF: yes
• Preserve FDs:  yes
R1(B, E, F), R2(A, C, E), R3(C, D)
• Losless decomposition: no, cannot join R1 and R2
• 3NF: yes
• Preserve FDs: no, half of them are missing