I often see programmers make this mistake and sometimes I have to deal with the consequences, so I thought I’d write about it to help soothe my annoyance.
Instead of placing one piece of information into one storage location, some programmers insist on encoding more than one piece of information so that that one storage location can hold more.
Let’s use the canonical example:
| id | name |
|---|---|
| 1 | Jeremy |
| 2 | Chris |
Now let’s say you want to store a list of keys in the possession of each employee. In my opinion this calls for a couple more tables:
| id | name |
|---|---|
| 1 | Front door |
| 2 | Stockroom |
| employee_id | key_id |
|---|---|
| 1 | 1 |
| 2 | 1 |
| 2 | 2 |
Nice and simple, normalised and easy to query. Let’s say I want to know who has access to the stockroom.
SELECT employee_id
FROM keyholders
INNER JOIN keys
ON keyholders.key_id = keys.id
WHERE keys.id = 2
If we want to query by key name and get employee name back, that’s simple too:
SELECT employees.name
FROM employees
INNER JOIN keyholders
ON employees.id = keyholders.employee_id
INNER JOIN keys
ON keyholders.key_id = keys.id
WHERE keys.name = 'Stockroom'
If you’re dealing with a significant amount of data, you can place an index on keys.name to avoid table scans.
What I see all too often is a new column added to the employees table called ‘keys’, which holds a comma separated list of values, e.g.:
| id | name | keys |
|---|---|---|
| 1 | Jeremy | Front door |
| 2 | Chris | Front door, Stockroom |
Now if I want to know who has the stockroom key, I can’t use my simple joins. What I usually see in the code is something like this:
SELECT name
FROM employees
WHERE keys LIKE '%Stockroom%'
There are at least three things wrong with this:
There’s another form of this problem which I keep coming across: encoding two pieces of information and storing in one location.
For example: Let’s say you sell products to customers. If a customer buys a product, normally you’d remember this transaction by storing several pieces of data - perhaps the customer’s unique reference code, the code of the product they bought, how much it sold for, etc.
The unusual design I often see is the lumping together of e.g. customer code and product code, looking something like this: ‘0042000012’, where ‘0042’ is the customer code and ‘00001’2 is the product code.
Again, there’s more than one thing wrong with this design:
A third class of encoding problem is that of giving meaning to the range of a value.
Obviously, some classes of values really do have meaning when within or without certain ranges, for example: Water freezes at 0°C and boils at 100°C. If frozen water needs to be handled differently from liquid then we have to build logic to cope with this.
Arbitrarily defining ranges with their own meaning is just a headache in the making - with zero benefit.
I have dealt with a system where it had been decided that every item of type X should be given an unique number at creation in order to ensure it cannot be confused with any other X. This was a good decision. There follows an example where X has been replaced by 'music album'
| id | name |
|---|---|
| 1 | Superunknown |
| 2 | Spanners |
What was decided next, however, was not so good. Type X has some subtypes. What would normally be done with these is that each subtype would be given its own unique number. For example, with our music albums, we could have each belong to a musical genre:
| id | name |
|---|---|
| 1 | Rock |
| 2 | Electronic |
Assuming a 1:1 relationship (each album belongs to one genre), the Albums table could then be updated to reference a genre like this:
| id | name | genre_id |
|---|---|---|
| 1 | Superunknown | 1 |
| 2 | Spanners | 2 |
What I have seen is the encoding of the subtype into the unique identifier of X - so, in our example, the genre is combined with the album's unique number to form a number which still uniquely identifies the album but also allows you to find the genre without having to look at any other piece of information:
| id | name |
|---|---|
| 1000 | Superunknown |
| 2000 | Spanners |
In the example above, any numbers from 1000 to 1999 are in the 'Rock' genre, while any from 2000 to 2999 are in the 'Electronic' genre.
The drawbacks to this approach:
SELECT name
FROM genres
INNER JOIN albums
ON albums.genre_id = genre.id
WHERE albums.id = 1
Now it looks like this:
SELECT name
FROM genres
INNER JOIN albums
ON (albums.id / 1000) = genre.id
WHERE albums.id = 1
The direct mapping from albums.genre_id to genre.id has gone. This isn't
a massive problem immediately, but doing maths in joins is a sign you're
going down the wrong track. Again, it's now much easier for bugs to
creep in.
The advantages to this approach:
None.