Contents

Normalizing Data

Contents

In my last post mentioned that shape of my reading data is generally a solved problem and then I joined a Bookwrym1 instance and it really does do all the things I’d generally hope for with tracking my reading (although I still don’t understand why page count is part of a MARC record but word count is not but today I will not start fights with librarians/cataloguers). I do think the data set is useful for the thought experiment of normalizing data and I may even extrapolate what I think Bookwyrm might be doing to store data. I am also not going to really dig into the normal forms but if normalization is your goal digging deep is worth the effort.

So my reading data to date is very simple: I plugged each book I read into a csv with the title, author, genre, media type and a start date and end date.

Here is a little sample of this years data:

Title Author Genre Format Started Completed Days Read (Computed)
A Record of a Spaceborn Few Becky Chambers Science Fiction Book 1/1/2023 1/10/2023 10
Fun Home Allison Bechdel Biography Graphic Novel 1/8/2023 1/8/2023 1

This data is unnormalized and useful by itself but from a database perspective there is a big storage problem: every time I want to record having read a book I have to duplicate a lot of data. Let’s say I actually read Fun Home previously:

Title Author Genre Format Started Completed Days Read (Computed)
Fun Home Allison Bechdel Biography Graphic Novel 3/10/2022 3/15/2022 5
A Record of a Spaceborn Few Becky Chambers Science Fiction Book 1/1/2023 1/10/2023 10
Fun Home Allison Bechdel Biography Graphic Novel 1/8/2023 1/8/2023 1

I’ve duplicated the title, author etc. and the only new information I’ve added are the new start and end dates for the second reading. To store this data efficiently we want to reduce the amount of duplicate data so our data set can grow appropriately when we add more users and books and all (particularly since we are going to end up duplicating data in a different when we add indexes to help our queries perform). This needs to be balanced with the desire to reduce the number of joins needed to retrieve data as well so the database can efficiently return data from queries.

To accomplish our goal we need to start identifying the separate entities in our data. From this very simple data set we have at least 4: the book, the author and act of reading the book itself. Additionally if we are scaling this into a database the reader themselves is another entity.

Once we break this apart we see that if we keep the data set as is there isn’t any reason to separate out the author: we don’t have a surrogate key for our entities and without any added information about authors we don’t gain anything by separating them out we just add an unneeded table to our database (author). But leaving author in with the book itself isn’t helpful either: what if a book has multiple authors? We want to be able to represent this without having more than one entry for the book and we may want to add more information about our authors as well. Here is a similar set of entities with surrogate keys for our data as well as the relationships between them.

Here we end up with some intermediate entities that are really to define a relationship that is many to many. Books can have more than one author and authors can write more than one book, more than one reader can read book and (and when they are good read them more than once).

So that’s our basic data normalized in a way that should perform well no matter how many books, authors, readers and specific readings we add. I am finding myself running rather long and there are now some different ways of storing even more data we might like that could go a number of different ways:

  • How do we add book or author meta data that is more flexible?
  • Does it make sense to store book genre as a string in the book table?
  • What ways might we allow readers to organize books (i.e. books they own, books they’ve only read, or even sub-dividing these conceptual groups/containers)?

I think discussion of these considerations is enough for their own post.


  1. As long as you don’t care about someone else hosting your reading metrics/data Bookwyrm seems like a good way to do it with the added bonus that book and author data will be added by all users or pulled in from datasets like OpenLibrary ↩︎