The basics is a series where I explain a basic programming concept or term for new computer programmers or those who need a refresher.
When you start learning programming you are building toy apps or just apps for yourself and it is easy to get in the habit of dropping your entire database and starting from scratch every time you need to change the database schema. Once you are working with real user’s data, though, it’s no longer an option to lose all of your data every time you change your mind about the proper structure for the database.
Your only options are to always get the database schema design perfect the first time (impossible!) or to have a strategy for rearranging a database table’s columns without losing anything, ideally without downtime. Obviously the second option is the only real one. Even if you always make the right decisions about the database, new information or feature requests might make you want to change it.
The steps needed to change the database schema without losing data or having downtime are pretty much always the same. Each of these code-change steps should be followed by a deploy.
- Create a new column and start writing to it.
- Backfill old data to have the right value in the new column.
- Start reading from the new column.
Let’s look at an example. Here’s a simple Rails blog app with posts and comments:
Now a new requirement comes in to allow users to comment on other comments in addition to posts. This means that comments need a polymorphic association to both comments and posts. The
post_id field will no longer cut it because you’ll need a
commentable_id column on the comments table.
Step 1 looks like this:
Here we add the new fields to comment and start writing to them by setting a comments
commentable. I just add some more logic to the
CommentsController to handle the fact that the comment might be on a comment or post (as an aside, this code in the
CommentsController is pretty bad and should probably be moved into its own class). At this point the code should be deployed to production.
Step 2 is a backfill:
Once this backfill is run in production we are ready to start reading from that new column in the
Once this is deployed the
post_id column can be deleted as it is no longer needed. While this is certainly more steps than just dropping the database and starting over, it isn’t too bad to do things this way and avoid losing any data or having any downtime. I hope you found this helpful and are not scared to refactor your database schema the same way you refactor your code.