Introduction

I often see programmers make this mistake and sometimes I have to deal with the consequences, so I thought I’d write about it to help soothe my annoyance.

Instead of placing one piece of information into one storage location, some programmers insist on encoding more than one piece of information so that that one storage location can hold more.

CSV

Let’s use the canonical example:

employees
idname
1Jeremy
2Chris

Now let’s say you want to store a list of keys in the possession of each employee. In my opinion this calls for a couple more tables:

keys
idname
1Front door
2Stockroom
keyholders
employee_idkey_id
11
21
22

Nice and simple, normalised and easy to query. Let’s say I want to know who has access to the stockroom.


  SELECT      employee_id
  FROM        keyholders
  INNER JOIN  keys
  ON          keyholders.key_id = keys.id
  WHERE       keys.id = 2

If we want to query by key name and get employee name back, that’s simple too:


  SELECT      employees.name
  FROM        employees
  INNER JOIN  keyholders
  ON          employees.id = keyholders.employee_id
  INNER JOIN  keys
  ON          keyholders.key_id = keys.id
  WHERE       keys.name = 'Stockroom'

If you’re dealing with a significant amount of data, you can place an index on keys.name to avoid table scans.

What I see all too often is a new column added to the employees table called ‘keys’, which holds a comma separated list of values, e.g.:

employees
idnamekeys
1JeremyFront door
2ChrisFront door, Stockroom

Now if I want to know who has the stockroom key, I can’t use my simple joins. What I usually see in the code is something like this:


  SELECT  name
  FROM    employees
  WHERE   keys LIKE '%Stockroom%'

There are at least three things wrong with this:

  1. If you have rooms named Office 1, Office 2... you can’t query using LIKE ‘%Office 1%’ because you’ll have a problem if you have ten offices. Perhaps you can get away with it in this example scenario, but often I see lists of numbers, e.g. ‘1,2,3,4,...42,43’.
  2. Altering or dispensing with a key name requires some unusual SQL.
  3. The database engine can’t use an index, so it has to perform a table scan. With designs where the equivalent of the ‘employe’e table has many rows, this can hurt significantly hurt performance.

Paired values

There’s another form of this problem which I keep coming across: encoding two pieces of information and storing in one location.

For example: Let’s say you sell products to customers. If a customer buys a product, normally you’d remember this transaction by storing several pieces of data - perhaps the customer’s unique reference code, the code of the product they bought, how much it sold for, etc.

The unusual design I often see is the lumping together of e.g. customer code and product code, looking something like this: ‘0042000012’, where ‘0042’ is the customer code and ‘00001’2 is the product code.

Again, there’s more than one thing wrong with this design:

  1. You’re quite limited in the number of customers and products you can refer to. Using 32-bit integers would let you have over 2 billion of each.
  2. There’s no separation between the customer and product codes. To begin with, while each code is a fixed length, you can split at a known position to get to each datum, but what if you decide you need more customer IDs and expand the field?
  3. Finally, the worst problem: You now have to write and test much, much more code simply to handle the fact that these two pieces of data have been combined into one.

Value ranges

A third class of encoding problem is that of giving meaning to the range of a value.

Obviously, some classes of values really do have meaning when within or without certain ranges, for example: Water freezes at 0°C and boils at 100°C. If frozen water needs to be handled differently from liquid then we have to build logic to cope with this.

Arbitrarily defining ranges with their own meaning is just a headache in the making - with zero benefit.

I have dealt with a system where it had been decided that every item of type X should be given an unique number at creation in order to ensure it cannot be confused with any other X. This was a good decision. There follows an example where X has been replaced by 'music album'

Albums
idname
1Superunknown
2Spanners

What was decided next, however, was not so good. Type X has some subtypes. What would normally be done with these is that each subtype would be given its own unique number. For example, with our music albums, we could have each belong to a musical genre:

Genres
idname
1Rock
2Electronic

Assuming a 1:1 relationship (each album belongs to one genre), the Albums table could then be updated to reference a genre like this:

Albums
idnamegenre_id
1Superunknown1
2Spanners2

What I have seen is the encoding of the subtype into the unique identifier of X - so, in our example, the genre is combined with the album's unique number to form a number which still uniquely identifies the album but also allows you to find the genre without having to look at any other piece of information:

Albums
idname
1000Superunknown
2000Spanners

In the example above, any numbers from 1000 to 1999 are in the 'Rock' genre, while any from 2000 to 2999 are in the 'Electronic' genre.

The drawbacks to this approach:

  1. There can't be more than 1000 albums in a genre. This can be 'fixed' by allowing extra space for a genre ('Rock' could cover 1000-1999 and 3000-3999) but this complicates the program code, making it much more likely that bugs will appear.
  2. Finding the genre of an album would normally require code like the following:

    SELECT      name
    FROM        genres
    INNER JOIN  albums
    ON          albums.genre_id = genre.id
    WHERE       albums.id = 1
      
    Now it looks like this:

    SELECT      name
    FROM        genres
    INNER JOIN  albums
    ON          (albums.id / 1000) = genre.id
    WHERE       albums.id = 1
      
    The direct mapping from albums.genre_id to genre.id has gone. This isn't a massive problem immediately, but doing maths in joins is a sign you're going down the wrong track. Again, it's now much easier for bugs to creep in.

The advantages to this approach:

None.

No comments