# Database Normalization: theory and practice

- 1st Normal Form
- 2nd Normal Form
- 3d Normal Form
- Boyce-Codd Normal Form
- Losless Join Decomposition
- Decomposing a relation in 3NF
- Decomposing a relation in BCNF
- Preserving Fuctional Dependencies
- Recap examples

### Intro

Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database using the defined relationships.

Informally, a relational database table is often described as *normalized* if it is in the **Third Normal Form**. Most 3NF tables are free of insertion, update, and deletion anomalies.

### 1st Normal Form

*Every non-key attribute depends on the key*

For instance, the following schema is **not** on 1st NF (keys are starred):

R(*author_id, author_name, post_name)

In order to be in 1st NF the following FDs must be met:

author_id -> author_name author_id -> post_name

but *author_id -> post_name* can be easily violated, thus allowing duplicates on the relation and all sort of other problems.

To satisfy 1st NF we would have to split the relation in something like

R(*author_id, author_name), R(*post_id, post_name, author_id)

### 2nd Normal Form

**Every non-key attribute depends on the whole key**

Consider the following schema:

R(*author_id, author_name, *post_id, post_name)

The **1NF** is met, since every non-prime attribute depends on the key

{ *author_id, *post_id } -> author_name { *author_id, *post_id } -> post_name

However, **2NF** is **not** met since there is a partial dependency of some attributes, thus violating the *whole key* requirement; in particular:

author_id -> author_name post_id -> post_name

### 3d Normal Form

*Every non-key attribute depends on the key, the whole key and nothing but the key*

In other words, for every FD of type **X -> A**, either X is a **key** or A is a **prime** attribute. Consider the following schema:

R(*tournament, *year, winner, winner_db)

The **2NF** is met, since every non-prime attribute depends on the *whole* key

{ *tournament, *year } -> winner { *tournament, *year } -> winner_db

However, **3NF** is **not** met since there is a FD that violets the “nothing but the key” part, namely

winner -> winner_db

in fact, *winner* is not a key and *winner_db* is not prime. Nothing is stopping us from inserting a new row where the same dude has a different dob, which is clearly not cool.

Another example of a schema that meets 2NF but not 3NF would be

R(*author_id, author_name, postcode, city)

Assuming we’re talking UK postcodes here, the following FDs hold:

*author_id -> author_name *author_id -> postcode *author_id -> city postcode -> city

### Boyce-Codd Normal Form (BCNF)

*Every non-key attribute depends on the key, the whole key and nothing but the key*

…but for every FD of type X -> A, X is a **super-key.**** **For example, the schema

R(A, B, C, D, E) ABCD -> E E -> BCD

is in 3NF, since the only non-prime attribute (*E*) depends on the whole key (*ABCD*) and nothing but the key (1st FD), and *BCD* is prime (2nd FD). However, *E -> BCD* has a non-superkey as determinant, therefore it doesn’t meet **BCNF**. The relation must be decomposed this way:

R1(E, B, C, D), R2(A, E)

Note, however, that this decomposition does not preserve the FD *ABCD -> E*.

### Lossless Join Decomposition

Decomposing R into R1 and R2 will guarantee a losless-join if (surprise!), joining R1 with R2 gives R back.

Now, the key concept here to keep in mind is that **you want to make JOINs on** * keys*, cause if you join on some random attribute that allows duplicates you end up with phantom rows. When we split a table we want to make sure that:

- Every relation maps some FD, thus having a key
- There is a shared key between the relations that we can use to join them.

For example, given

R(A, B, C, D, E) A -> B C -> DE

A good decomposition is

R1(A, B, C), R2(C, D, E)

we split on *C -> DE*, thus making *C* a key and ensuring we don’t have duplicates on R2. *C* is also shared to allow joins.

Whereas a bad decomposition is

R1(A, B, C, D), R2(D, E)

because there is no key in R2. Potentially, R2 could be made up all of the same value. Good luck joining that. This becomes ridiculously evident when you think in term of real stuff, fleshing the above in sth like

R(author_id, author_name, post_id, post_date, post_title) author_id -> { author_name } post_id -> { post_date, post_title }

The first decomposition would be like joining on *post_id*, which is perfectly fine; the second decomposition would be like joining on *post_date*.

*Example 2:*

R(A, B, C, D, E) AB -> C E -> A C -> DE (A, B, C), (C, D, E) // ok (A, B, C), (C, D), (D, E) // not ok

because *DE* is not a FD

*Example 3*: split R in 3 tables

R(A, B, C, D, E, F) AB -> CD C -> E F -> A // splitting on C -> E (A, B, C, D, F), (C, E) // splitting on F -> A (B, C, D, F), (C, E), (F, A)

### Decomposing a relation in 3NF

Putting things together: we have a big relation R and a set of FDs. To decompose this in 3NF tables, you must

- Find out which FDs are not in 3NF
- Split the relation on these FD

The idea is that you don’t want to leave any relation without a key, thus the FDs which are not about keys must be isolated so that their LHS becomes effectiveley a key.

*Example 1:*

R(A, B, C) A -> B B -> C

The only key is A, with B and C non-prime. This implies that B -> C violates 3NF, and we must split on that, resulting in

R1(A, B), R2(B, C)

*Example 2:*

R(A, B, C, D, E, F, G, H) AB -> D EF -> A FG -> C EG -> F D -> EG F -> BH

The candidate keys are *AB, D, EG, EF, AF*; thus non-prime attributes are *C, H*. The FDs that break 3NF are *F -> H* and *FG -> C*, so we have to decomopse on them, leaving

R1(A, B, D, E, F, G), R2(C, F, G), R3(F, H)

### Decomposing a relation in BCNF

However, this decomposition still doesn’t meet BCNF. In particular, *F -> B* breaks BCNF since *F* is not a superkey **within R1**, therefore we must decompose

R1(A, D, E, F, G), R2(C, F, G), R3(F, H), R4(F, B)

and since R3 and R4 share the same key we should combine them

R1(A, D, E, F, G), R2(C, F, G), R3(F, B, H)

### Preserving Functional Dependencies

A losless decomposition preserves FDs if *all the FDs are represented in the decomposition*. Any good 3NF decomposition *must* preserve FDs. For example

R(A, B, C, D, E, F) A -> BCE C -> D BD -> F EF -> B BE -> A R1(B, D, F), R2(A, B, C, D, E) // nope, EF -> B is missing R1(A, B, C, E, F), R2(C, D) // nope, BD -> F is missing R1(A, B, C, E, F), R2(C, D), R3(B, D, F) // yes

The important bit is not that the *whole* FD is present in the table, but that *every* FD can be found (i.e. that every LHS can be found). The problem with the first two decomposition is that there is no table where *BD* or *EF* are present. Compare this with

R(A, B, C, D, E) AC -> DBE BC -> DE B -> A E -> D R1(B, C, E), R2(A, B, C), R3(D, E)

even though *AC -> DBE* is not in any table as a whole, R2 and R3 can be joined together and *AC* is in R2, thus the FDs is represented. *BC -> DE* is also represented since R1 and R3 can be joined together.

### Recap examples:

// R and FDs as above // Keys = { A, BE, EF } // Non-prime = { C, D } R1(B, D, F), R2(A, B, C, D, E)

- Lossless decomposition: yes, split on BD -> F
- 3NF: no, it should be split on C -> D
- Preserves FDs: no, EF -> B is missing

R1(A, B, C, E, F), R2(C, D)

- Losless decomposition: yes, split on C -> D
- 3NF: yes, split on C -> D
- Preserve FDs: no, BF -> F is missing

R1(A, B, C, E, F), R2(C, D), R3(B, D, F)

- Losless decomposition: yes, BD is shared after R1 J R2
- 3NF: yes
- Preserve FDs: yes

R1(B, E, F), R2(A, C, E), R3(C, D)

- Losless decomposition: no, cannot join R1 and R2
- 3NF: yes
- Preserve FDs: no, half of them are missing