A brain dump on schema updates

Written by John Mikael Lindbakk - 09/06/2020 00:00

Some of the most dangerous and scary things a developer does is making changes to the database. Be it fixing some dodgy data in production or changing the database schema all together. Users trust us to at least not lose any data, even less so to break it. SQL changes also tend to be somewhat of a “side thing” compared to normal development. With regular development we tend to have a lot of processes in place to verify what we do, schema changes tend to be more of an afterthought, which it often is.

Developers tend to be really good at their core languages. That means Java, C#, C++ and whatever framework that comes with it. This is their primary technology. SQL is often something which is put to the side, or something developers aren't as comfortable with. This is a shame since poorly designed SQL schemas often are the root of many performance issues, and it is an integral part of the systems we deliver.

In this post we will discuss how to facilitate safe and reliable schema changes in our systems, both for developers and users alike.

Glossary:

Database: Some database systems use the word database differently. As this post is concerned we will use it as such:

A database server - Contains one or more databases
  ∟ A database contains one or more schemas
    ∟ A schema contains one or more tables*
      ∟ A table contains one or more columns

*Yeah, I know, schemas can contain other things. Let’s keep it simple for now okay?

Class: Throughout this post I will use the word “class”. I often use object oriented languages, but it is not limited to “objects”. It could be a plain SQL query or some other structure used to query the database. A common word for it is model. When this post says “class”, think less OOP and more “something which holds a set of values, or expects a set of values to exists”.
Schema: By saying schema I refer to the structure of tables, views, columns etc.
Schema changes: Schema changes means changes to tables, columns, anything which can be changed within a schema. Often it is related to adding or deleting a column, or changing some column requirements like size or null-ability.
System: A system is what the business is created. It is meant as a catch-all for web services, batch systems, desktop applications, apps and so forth.

Out of scope topics:

Making schema changes safe isn’t that hard, yet there’s a lot to talk about. This post does touch upon a few topics which it won’t flesh out due to me not wanting to increase the length of this post more than needed.

Here are the topics which is out of scope for this post:

Migration from one database server to another. These are big migration events rarely happening on a frequent basis. It’s a completely different process than simply updating a table or column.
Quick fixes. We have to employ quick fixes to our database to either fix corrupted data or resolve some other issue. This should also be done in a safe manner, but this post is more concerned with structural changes to the schema to facilitate improvements and new features. That said, normal schema changes and fixes should be kept separate IMHO.
Generating test data. We will talk about having to test our schemas, which requires data. Going into detail how such data is generated is not considered in scope for this post.
Big data? No. Not here. With extremes amount of data extreme measures must be taken. This is the 90% solution.

tl;dr - The rules of schema migrations

System architecture should facilitate easy change
Communication between the system and schema must have fully automated integration tests
Test and production environments should have the same database server version
Performance test critical systems
Small incremental changes over large sweeping changes
SQL is not for humans, use automated processes to execute schema changes
Schema changes must be backwards compatible with the version found in production
Scheduled downtime is a thing of the past - figure out a way to change the schema without taking down the system

The problem

Making changes to the schema can always be fatal. The data stored in our databases is the most sacred thing a business has, and changes to the schema can destroy that data in numerous ways. Rolling data back is expensive, time consuming and often a last resort. Even so we may still lose data. Making a bug in the application can be pretty bad, don’t get me wrong, but such bugs are mostly related data which will be processed with that bug. A funky SQL statement can destroy all historical data on a whim. From a bird’s-eye-view, one is clearly riskier and more prone to failure than the other.

Even if we have good systems for backing up our database, we are really bad at recovering them after the fact. We often have systems that we can fall back on when shit hits the fan, but we are often not trained in how to use them. We are really good at having automated backup systems for databases, but when we have to use that backup it’s a hassle getting in touch with the right people, passwords have expired, URLs have switched etc… I am not implying that we should then not have these systems, we definitely should, they are critical! But we should consider that our usage and knowledge of these tools can be improved (out of scope for this post). At the same time our goal should be to never need these tools in the first place, as using them means we didn’t have a good enough process for detecting and handling issues to begin with.

Due to the reasons mentioned above we should take a look at our process, and how we can, hopefully, avoid disaster before it happens:

1. How do we verify that our schema changes?

“Testing shows the presence, not the absence of bugs’ - Edsger W. Djikstra

This quote by Djikstra is a fact of software development. We cannot conceivably prove that something doesn’t contain problems. That doesn’t mean that we shouldn’t try. After all we have tons of processes to verify that our applications runs and does its intended function. We will therefore look at what we can do to increase our confidence in the schema changes we want to employ.

2. How do we make the required schema changes?

Any steps going wrong can mean the system being down for hours, days or even longer. An error can also lead to loss of data which may doom the business. Therefore we must find a way to alter schemas in a safe and controlled way.

3. How do we facilitate schema changes through architecture?

We should assume that our systems need to change. All used systems change over time. When building systems we must therefore take that into the equation and build systems which facilitates change. To make such systems we must also consider the schema in the overall architecture.

Testing… testing… testing…

Test environments should use the same database server versions as production

I was once tasked with adding a new column to a table. The task itself wasn’t that complicated. Just add a column which contains a 35 character null-able string without a default value. This change looked innocent enough and the pull request was swiftly approved. As usual I put the application into our test environments and verified that everything was still working.

Eventually we decided to push this change to production whereupon our app refused to start up due to not being able to migrate the changes. So what had happened?

After some digging we realized that our test environment was running Oracle 19, which had a name length of 128 bytes, but the version in production was oracle 12.1, which was limited to 30 bytes. The new column had a name 32 bytes long...

Luckily this did not impact production and didn’t cause any downtime, so our users weren’t impacted. That is not always the case. Other situations may result in way more terrifying and serious situations. Therefore we should always make sure that our test and/or staging environments use the same database server version. If not we can quickly run into compatibility issues which aren't discovered before it's too late.

There are times where it is not possible to have the same versions. In the example above it would take a lot of time to upgrade from version 12.1 to 19 in production, which is also a risky process. It’s not one of those things which we want to do on a whim. We didn’t want to downgrade our test environments either, as downgrading can be a major pain as well. What we decided to do is to run our database related integration tests against oracle 12.1 in its own environment as an additional pipeline step, just to make sure that the migration works correctly. This is not a great solution however, but sometimes we have to lean towards pragmatism over purism.

The core of the issue is that we must be certain that our migrations work on the database server version which is running in production - and we can only be certain of that if we test using the same version.

Automated integration tests

Every system which has to interface with the rest of the world should have automated integration tests. All systems which require a database must interface with that database in some way, which warrants having automated integration tests.

We should verify that our classes/models are being loaded and persisted correctly. That our system executes SQL which the database can understand and which works with the schema that the system is built for.

Sometimes we have logic in our databases (ugh). Stuff like views, stored procedures, triggers and so forth. Database systems tend to be tricky to write automated tests for, especially due to their tight coupling with the systems using it. Another issue is that we have to build some sort of dataset to be used. Therefore using integration tests where we verify all the various bits and pieces from the system point of view makes sense. It is also often more efficient and easier doing this from the software side rather than testing the database directly.

When the automatic tests execute the database should be migrated to the latest version automatically. There should not be any need for manual intervention to execute these tests, and they should not be optional. Automated tests should execute often, and they should notify the developer if something is wrong. Usually they would execute as part of a pipeline.

Automated performance tests

Database performance is a fragile beast. One day the performance is fine and the second day a bottleneck has emerged. Databases are marvels of development and the shared amount of data they can deal with is staggering, but much of that power relies on our ability to design the schema within those databases. While some optimizations is automatically applied during runtime, the databases are still forced to work with our stupid designs and limitations. Therefore

we must make sure that performance is maintained after a change. A minor change, which may seem trivial, can turn out to cripple the performance of our system.

Due to the potential of small changes having a huge impact on the stability of our system we should employ automated performance tests, at least for our most critical systems. I am not claiming that every little service needs to have a fully automated performance tests suite (but it would be nice), but the critical ones should.

The first thing one should have in mind when creating automated performance tests is deciding on acceptable metrics. I.e how much time is it acceptable for doing an operation on a dataset with xxx number of entries? 0.1 second on a million rows? 1.0 seconds on 10 million? And so forth. Knowing what is acceptable will decide how big of a dataset is needed and how the tests should be written.

Performance tests should also report the results to some sort of centralized logging/statistical system, which allows developers to spot decline (or improvements) of performance over time. Having data which spans over time allows developers to see the trajectory of the systems performance and make adjustments based on those facts. Combining that with statistics from production, developers can avoid performance issues before the customer even experiences them.

One thing to note is that the test performance environment should be significantly larger than the actual production data. By having a larger dataset in test environments we can be more confident that issues related to the volume of data can be detected before anything gets pushed to production.

Some claim that the automated performance test environment must be identical to the one in production, but I’m not sure that is the case. What I think is important is seeing trends over time. Seeing whether the performance goes up or down. Sure, in an ideal world the test environment would be identical to the production, but it is more important that the test environment remain consistent over time rather than trying to match production. Obviously we should try to get it reasonably close, but it does not need to be identical to provide great benefit.

This also is true for the dataset the tests use - it largely be consistent over time. Any changes to the dataset might alter the statistics, and for developers to make proper decisions they need reliable data over time.

Maybe the biggest advantage to performance tests is that we can see those big spikes or dips on performance when we make changes. This allows us to spot whether we have improved something, or made a horrible mistake. Most importantly this should notify us of such spikes before any changes reach production.

The other reason why we should have automated performance tests is because sometimes our schema changes can lock our tables for some time. Once I was tasked with adding a new column to a table with a default value. Sounds easy enough right? I added the column with the default value, everything seemed to work, at least until it hit production. What I hadn’t considered is that these tables held millions upon millions of records. By me adding a default value it locked the whole table and the system was inaccessible for a couple of minutes. Luckily we could just stop the process the second we realized what was going on and the change wouldn’t commit. Having a large enough dataset we could have discovered these kinds of problems before it affects production.

Make sure to have your pipelines in order

In this section we have talked about tests, and with automated tests we quickly get into the topic of continuous delivery and continuous integration. Both of these topics are huge in their own right which means that I cannot do them justice here. I want to note that there should be automated processes in place which verifies all the tests previously mentioned. A developer should not need to log into a separate server or system to execute any of these tests, rather they should just execute and notify the developer of issues before it affects our customers.

Many stories have been told by developers combining their hubris with changes to the schema. By forcing these tests we are also making sure that the changes that’s done are more likely to be safe.

Use a trusted and verified library to handle schema updates

I once worked at an old project, a project that was about 25 years in the making. When the project was initially conceived there was no readily available library which executed your scripts and managed the schema version. In the absence of such a tool the lead developer had created his own little tool, which first read the database and found the current schema version, then it looked into a specific folder where it found all the missing SQL files and ran them on the schema. On paper this sounds like a great tool, but it wasn’t all perfect.

For one, this tool was something that was thrown together, probably in a fit of rage after someone had made a particularly nasty error when running one of the scripts manually. This means that the code was difficult to read and arcane to everyone, even the lead developer himself told us that he did not dare to make changes to it.

The issue then becomes, what happens if something goes wrong? For example, the tool never gave us the whole error message from the SQL server, and never pointed to where in the SQL script the error occurred. Rather it swallowed much of the error message and preferred its own exception. This meant that whenever something went wrong it was really difficult to figure out what and where the error was. Sometimes it was difficult to figure out if it was the tool that had issues or if it was the script itself.

There are two lessons in this story. One is that people make mistakes, which was what led to this homebrew tool in the first place. We simply cannot trust humans to execute SQL. At some point someone will execute a script which should drop some rows, but accidentally forget the “where” clause, which promptly drops everything in the table.

Lesson two is that we should use well known and supported libraries to update our schemas. They should act predictable and they should be well documented. When something goes wrong a developer should not question whether or not it was the library that suddenly exploded or if something in the script has gone awry.

Such a library should enforce a versioned way of upgrading and make sure that the schema updates to the correct schema version. The schema version number itself should be stored in the schema itself. It should be easy to see what version a schema is on and what version it will be upgraded to.

Examples of such libraries would be flyway, Entity framework or even Django (and many more).

Avoid big changes

Most developers these days have embraced shorter release cycles. 2 week sprints has almost become the de facto standard for the industry. One reason for this is that it is easier to plan for a few weeks rather than several months. It’s just more digestible, it is more accurate. The other reason is that bigger changes pose bigger risk, which is also true for SQL.

I once worked on a project which had quarterly releases. In itself this is not a crime, but far from ideal. The schema changes which usually took place during those 4 months were massive. We’re talking SQL scripts with several thousands lines of code. Scripts which contained cursors to iterate through contents and altered its contents. Migrations of whole tables into new tables. Massive alterations to the overall structure of the schema. The question becomes, how does one verify that all those changes are correct? The answer is, one can’t. It’s too big of a job.

Even if we had automated testing the scripts would be so massive that the size itself embedded insecurities in the developers, and no developer could truly have ownership of a single update. At the time we had no form of automated tests available to us (strike one), rather we relied on manual testing which only kicked in a couple of weeks prior to release (strike two, was a disaster every time).

A 10 line SQL script is much easier to verify than a 3000 line script. It is much easier to see what it does and what it tries to achieve. Smaller scripts are just less prone to error. This is why we should always favor smaller changes over large ones.

This does nudge us towards smaller and more frequent release cycles, which is a common way to work these days. However there are teams which have special quarterly/semi-annually/whatever “slots” where they are allowed to do changes to the schema. The underlying idea is that these slots are provided during times where the system isn’t heavily used, which then should provide a more safe environment to do schema changes. This is a terrible idea, as this approach forces developers to make huge schema changes. It forces developers to hold functionality back, which again creates massive releases. Another reason why this is a bad idea is that it is a process which expects downtime, rather than one which tries to mitigate it.

Having slots where we’re trying to make downtime safe, is an admission that the current process isn’t providing enough safety. They simply aren’t good enough. By having good processes we don’t need slots. We trust that the process works, we fix it when it doesn’t. This way we simply do not require slots whatsoever.

Developers should be certain that something works, and if they are unsure they should employ tests which makes them sure that it is working. Our job is to provide working software, and by not providing the required processes betrays that ethos.

Zero downtime schema migrations

During normal schema changes we should never use downtime. This affects the business and our customers. In modern development the customers expects systems to simply be working. Scheduled downtime is from a bygone age. As an industry we have come far enough to know how to make schema changes without resorting to downtime.

While zero downtime is ideal, it is also not straightforward. To make it happen we have to truly think through how we do certain operations, and how we deploy our software.

Tools can’t save us (or at least they haven’t yet)

There are excellent libraries/frameworks out there like the ones I mentioned in the “Use a trusted and verified library to handle schema updates” part of this post. These tools will take your SQL scripts (or they may generate them), execute these scripts and if something goes wrong the changes will be rolled back the changes. This is great and we should use tools like this. We can even set them to do migration as a part of starting up the application itself, which some do. This has the unfortunate side effect where developers become detached from the interactions that goes on between our application and the database. What some seem to forget is that these tools cannot truly provide safe migrations by themselves.

For example let’s say we have two instances of the same application running:

A normal approach to upgrading such a system is to first introduce a new instance with the new version like so:

Before using instance 3 we often have automated processes which checks whether or not it is healthy. If it is considered healthy we will start routing traffic to instance 3, then we start taking down some of the older instances. However, for instance 3 to be healthy we have to upgrade the schema to version 2, but doing so may break compatibility between the instances which are running application v1. Depending on the change from schema v1 to v2 it can, at worst, introduce downtime or seriously affect customers.

This means that we cannot blindly trust these tools to do the job for us, as they often live in a bigger ecosystem of applications. In the future we might see some solution to this, but as of writing this I have not been able identify an automated process which sorts this out for us. Maybe there’s something out there, but it hasn’t reached me yet.

Zero downtime migrations

Schema updates are always a multi-step process. First you have to update the database, then we have to update the system (or vice versa). We don’t live in a world where we can update both at the same time and be certain that both will do so successfully. If one goes wrong, then both should fail, which is not possible when doing them at the same time.

Let’s do an example, one which I have experienced myself. We have a system which has a database. We decide that we need to change the schema in the database, so as the developer I decide to change the models and the tables to fit the new requirements. This service didn’t see much traffic and wasn’t business critical, so in my hubris I decided to just go for it and execute both the schema update and system update at the same time.

The change in question was incompatible with the previous classes in the system, but I assumed by updating both at relatively the same time nobody would notice. I started the SQL update and then triggered the system update. The database update completed successfully, great! But the system refused to update, as our deploy pipeline threw some nasty messages. That means that the version which was currently in production would start to throw errors to our customers. While it was a low traffic service it took a surprisingly low amount of time before support contacted me.

The pipeline transferred updates to a specific remote folder which it could not access for some reason. We never really figured out what was the case and ended up assuming someone on operations had been messing around with our internal firewall which prevented the pipeline from pushing the application to the correct place. In either case it ended up with us not being able to update the system for about 20-30 minutes. During that half hour our system was inaccessible, and a few customers were rightfully upset with us.

The lesson from that story should be that we should always remain backwards compatible. If we accept that there are two separate processes which update the schema and system (even if we use automated tools), then we must acknowledge that one of those processes might fail and we cannot predict which. The best way to avoid systems going down in this manner is to make sure that they are always backwards compatible with each other.

In the story above, if I hadn’t been lazy I would have made sure that the database changes were compatible with the version running in production. That way, when the release failed, the system would still be able to operate as nothing had happened. We would have avoided upsetting our customers and I would have been more professional in my approach.

Note that we don’t need to be backwards compatible forever. That is often not feasible. We only need to be backwards compatible with the version which is currently running in production.

Deferring actions

One thing that will become clear throughout this section is that if we want to maintain compatibility we must postpone certain actions. Sometimes we cannot do every schema change immediately, rather we want to wait with a breaking change until we’re sure that the application is up and running properly. Sure, this means that we have to add steps to our upgrade process, but that is just the nature of working with systems with multiple independent parts which relies on each other.

Some tools will automatically migrate the first part of the database prior to the application starting up, then migrate the rest after the application starts. This can work great if one ever has just one instance of the application, which is rarely the case in modern development.

Whether or not these steps are automated or manual doesn’t really matter, as things must happen in a certain order either way.

Adding columns

Adding columns is probably the easiest schema change. Adding columns rarely breaks any existing code. As they don’t know about (except when we it is not null-able). It means that the current working version won’t use that column. Let’s take the following example:Let’s assume we have one class which represents one table in our database. Before upgrade it looks like this:

Our job is to add a new column “field_3” to the table and the class. One thing that becomes clear is that we have to update the schema first. If we were to add “field_3” to the class prior to updating the schema it could potentially cause issues as “field_3” doesn’t yet exist in the schema. Therefore we have to update the schema as such:

As we see in the model above, the v2 of the table remains backwards compatible with the v1 of the class. Since class v1 doesn’t know, nor care about this new field it will run just fine. All we then need to do is to update our system and the update is complete:

Deleting columns

Deleting columns is a bit more complicated than adding a column. Take the following table and class:

Our task is to remove “field_3” as it is deemed not useful any longer. If we were simply to drop “field_3” from the database that would cause issues since “class v1” still expects that field to exist. What some do is marking the field as read only, but that can still cause issues if some old code attempts to write data to it.

When it comes to deleting columns we should defer removing it. E.e making it into two separate releases, which may even be put on different dates. Usually one wants to update with a single version number, but in this case we’re doing it with two updates:

First we update our application to version 2. This essentially removes all of the application’s knowledge about “field_3”. This makes “field_3” an unused column in our database with some legacy data.

When we have verified that all of our system instances have been upgraded we can then safely drop the field from the database.

Renaming columns

Much like deleting columns, renaming is also a multi-step process. Especially when we’re not allowed to have downtime. Let’s take an example:

We have a table, with a column called “old_name”, which we want to change the name of to “new_name”. Such an upgrade would look something like this:

In step 1 we are adding the new column to the table, but making sure that the old column still exists, just so we remain backwards compatible with the class v1.

In step 2 we make sure to update our class so that it is using the new name for our column rather than the old one.

In step 3 the “old_name” column is unused, therefore we can safely remove it.

The only remaining question is when do we actually copy the values from the old column to the new column?

Copying old values can be done in step 1 and step 3, but either approach has pros and cons:

Migrating values in step 1: Copying values in step 1 makes sure that the application has the expected data when it is updated to v2. The downside is that data can change between step 1 and step 2. This means that some data in the old column isn’t being transferred to the new one. In certain situations this risk is worth taking as this approach is easier to execute. In situations where this risk is not acceptable we can do a new copy in step 3 to make sure that the last bit of data has been transferred correctly.
Migrating values in step 3: Copying in step 3 is only okay if it is acceptable that the application can live without the data for a little while. That way we can simply copy the values in step 3 and we don’t have to worry about migrating any leftover data.

There isn’t a “correct answer” in this case, though copying values in step 1 seems far more common than step 3. The approach which is more suitable is fully situational.

New column with calculated values/Non-null values

In our code we often have to calculate things based on existing data. This is generally fine, but there are instances where we want to store this calculation somewhere. Let’s say that we have a new requirement that we have to store a nickname for our users, and one must be set for every single user. We arrive at the solution that the default nickname for our users can be “firstname-lastname”, so the user Mikael Gundersen would get the default nickname “Mikael-Gundersen”. The users can change their nicknames later, but our requirements state that it must be a non-null value.

While being an odd requirement, we decide to add a new column, but we have to populate it with some values for each user. Let’s not get into the variations of people’s names and just keep it simple by assuming that everyone in the world has only one first name and one last name. We already have a “firstname” and “lastname” column, so we decide to simply use those to generate new nicknames for everyone.The steps for this is very similar to how we rename a column:

Here we are faced with a similar dilemma for when to update and populate our new column with values. Between step 1 and step 2 there’s room for new users being added, or names being changed which might leave some users without a username. If it is business critical for this new field to be updated correctly, there’s a chance to do new calculations in step 3.

Another issue is that if we mark this new column as not null-able we are breaking backwards compatibility. If user class v1 tries to do an insert on user table v2 the SQL server will throw an exception. This happens because user class v1 doesn’t know about the “nickname” field, therefore it doesn’t try to populate. This teaches us that we only employ restrictions when it doesn’t break compatibility.

What about tables?

Tables follow many of the same rules as columns. Adding one won’t affect much. Deleting or rename one and we might have an issue, and we can apply the same patterns as above. Doing full table migrations might be a big task, and maybe downtime is warranted for this operation. Sometimes it will take a long time to execute, or it may simply not be practical to do so with zero downtime.

I would consider this to be a major schema change, and as such it falls outside the scope of day to day software development.

What if I want to do {insert change here}?

There are probably a bunch of possible scenarios not covered in this section. The point of this post isn't to go over every possible scenario, rather to show how we can maintain a working application while doing discrete changes to its schema. Most issues can, as we have seen, be managed if we defer actions correctly. We can always employ tricks like deactivating certain triggers, removing some restrictions and so forth to make our update seem seamless for our customers. In modern development users expect systems to simply be up - downtime is a thing of the past and we, the developers, should embrace that.

Schema friendly architectures

Generally one system should have only one schema, which belongs to that specific system. While that statement seems sensible on paper, in reality this is not often the case. Modern development often wants to separate systems into more easily managed blocks. This is sensible, but not doing so for the schema we’ll lose out on some of the benefits.

The database is a dependency

In his book “Clean Architecture” Robert C. Martin makes the argument that the more something depends on a system, the more effort it takes to change that system. In essence it makes our software less “soft”. This is something I completely agree with. If nothing depends on a system one can go hog wild and change whatever without much consequence, but the second one, two or more systems depends on something it becomes very difficult to make sweeping changes.

This rule is also true for the schemas. As with our systems, a database is also just an application, and when we connect to a schema we are depending on that schema to contain certain tables which again have certain columns and so forth. If we view databases as just another dependency in our overall architecture it also means that the more systems that share the same schema, the more effort we have to put in to change that schema.

The ideal situation is that every system has their own schema, which is more aligned with the micro-service architecture, but is not limited to it.

Who owns the schema?

When multiple systems depend on the same schema we must also consider who owns the schema. Modern development often has the SQL migration files within the source code of the system itself while a library handles the migration. This has itself some downsides and benefits, but it also raises the question - who owns the schema? Which component gets to execute the changes?

An approach is that all systems are responsible for the migration. That means we have a library which contains the migration files which all relevant systems depend on. In essence we can then deploy whatever system and the schema will be updated when starting up. On paper this sounds like a good idea, but in practice it has issues. Systems who’s responsible for their own migration must have a relationship with the schema version. If the version is higher than the one the system was built for, it should be read as an invalid database. Therefore the system shouldn’t start up. If the version is lower than the one the system knows about, the system should migrate to the latest version. This means that if one component updates the schema, all the other components will be out of date - which can cause issues.

Another approach is making the database migration library its own system, or make a single system responsible for doing migrations. A system which is solely responsible for migrating to the latest version allows us more freedom to initiate database release. Each system that isn’t responsible for schema updates doesn't need to know which version they’re working on, they can simply assume it’s correct. That solves one issue, but it does create a new trap - unexpected incompatibility. As we stated with previously “schema changes should always be backwards compatible”, but this statement doesn’t take multiple systems into consideration. Certain changes cannot be infinitely backwards compatible. And to execute this safely we must have a complete picture of every system version which interfaces with the schema at hand. An operation which deletes a column or table cannot be compatible with a system which expects those to exist. Changes to a schema can seriously affect all components which depends on it, therefore we must eventually compile, package and release new versions of every system which depends on that schema after a change. A common assumption with this approach is that a developer knows that table A isn’t used in system B, so the programmer does the desired change to table A. All good right? But what if someone else on the team had a task where they had to suddenly utilize table A? This means that the developer must have a complete knowledge of every piece of work going on in each of the systems which integrate with the schema, as well as knowing every system itself.

The issues outlined in this section can be solved by having a good integration tests, or at least mitigated. All that said I believe there are safer and better ways to go about it.

Breaking up schema dependencies

As previously discussed, the database is just a dependency. If we treat it as such it becomes clear that we should avoid too many systems interacting with the same schema. One issue is that it becomes harder to change the schema later, but it can also introduce other issues such as locking and performance. Therefore we should consider breaking these architectural patterns apart and modularize the solution. Making sure that our dependencies allow for easier schema changes.

Here are two common patterns which tend to use the same database:

1. Server -> Schema <- Server

In this pattern we have two or more server deployments which use the same schema. They are not instances of the same service, rather they are completely different. Only two are depicted, but it could be more.

2. Server -> Schema <- Batch

In this pattern we have server deployments combined with batch deployments. A pattern which I have encountered many times. Often this type of architecture is due to having one web service which is responsible for certain types of data, but for some reason one needs to transfer a file in a specific format, or maybe one receives a file which must be parsed in some way. Rather than engineering a solution which is more schema friendly, the developers decide to hook it up directly to the database and call it a day. No shame in that, I have done so myself. Often the server and batch deployment exist in the same project so they can share models - to keep things DRY.

Batch applications and server applications tend to be very different, both in functionality and code, but from a dependency point of view they are pretty much the same thing. Throughout this section we will use the “Server -> Schema <- Batch” pattern as an example, but it can easily be switched out with the “Server -> Schema <- Server” pattern, or any other combination of system patterns as well. We could theoretically have a “Batch -> Schema <- Batch” structure, which the patterns below also will work for.

Make one deployment depend on the other

This might be the most simplistic solution to decouple the schema from multiple dependencies - and sometimes it can be an effective approach.

By doing it this way allows us to turn batch deployments into dumb file read file write application which simply posts all the contents of a file to the server, or reads from it. This can work great unless we’re working with huge datasets. This approach might also complicate the architecture a bit, especially when the batch deployment requires its own database with the same models which forces us to maintain two similar schemas. Therefore if the batch deployment requires its own database then this might not be an optimal choice.

The benefit of such an approach is that we can now version the web service which the batch deployment uses, so we’ll have a much cleaner upgrade path.

As for Serve to server patterns this can also be very viable. The good thing about this approach is that we’re splitting up our system into separate chunks which are responsible for different parts.

Some might look at the model above and claim that it is a bad pattern, especially since this leans closer to micro-service architecture. They would also claim that we are basically making a distributed monolith, which I won’t disagree with, but as always, it comes down to business requirements. If one system needs to have strong consistency between the various services, then one probably needs to do a synchronous call to the other system, so that it can be sure that everything is up to date before returning a response to some other system. The pattern above is one which allows this to happen. Distributed monolith or not, sometimes we have to stay synchronous.

Use an message broker

In the previous part we talked about strong consistency. This pattern is more in line with what one would expect from a micro-service architecture - eventual consistency. In this pattern we don’t need our batch deployment to know about the event at the exact moment it happens, we just need it to know what has happened since it last ran.

The other way works as well. The batch deployment can read the file and post whatever result to the message broker which the server deployment can pick whenever it wants. It doesn’t need to happen synchronously, it just needs to happen eventually.

Using a message broker is also a common pattern between service to service communication. Where one service does something, then sends out a message about what happened. This means that it can take some time for the changes to go through the system, but eventually the system will be in a consistent state.

Use a gateway service

Sometimes data gets forced together, either through legacy, due to business requirements or for some other reason. These are situations where it may be tempting to put everything in a single schema which both services can use, or even combine the two systems into one. As outlined previously this isn’t such a great idea, so an approach to avoid this we can create a gateway service (server 3) which combines the results from the two separate services. This pattern brings its own set of challenges, but it does solve the problem of updating schemas since they are kept separate.

What about read-only database replicas?

The model above is great in terms of dealing with a large amount of data which more than one system needs. It is however not a solution to make schema changes any easier. The problem is that when we change the schema, the next replica copy will then also contain the schema changes. Every system relying on data from the replica must also adhere to the same schema, therefore we haven’t solved the underlying issue.

I am personally not sold on read-only database replicas. I can see they have a use-case when first splitting up a behemoth, but it is not a design pattern I would employ unless there was a real good reason for it. Database replicas do not solve the underlying schema dependency issue and it promotes services to rely on data which does not belong to their domain. Is it better than direct access to the schema? Yeah, because then they can’t directly alter that data. But that’s a low bar to clear.

But, do we have to break up this dependency?

Breaking up such dependencies for existing systems might be an enormous job. I believe it will yield positive benefits in the long run, but it will hold back development for some time. I can easily think of several situations where it might not be worth doing it. Especially for older systems, or systems with an expiry date.

These architectural decisions should be made prior to the application being built, not after. Doing so later in the system’s life-cycle will only increase the time it takes to do the change, and it will increase the complexity of doing it.

These kinds of patterns will naturally evolve as businesses grow more comfortable with micro services and distributed systems.

Conclusion

A schema change is simply to execute some SQL that does something and that’s it. Sounds easy enough right? I hope that this post has managed to shed a light on the underlying complexities of that simple action. As developers it is our responsibility to maintain the correctness and reliability of the data which our systems handle, and we can only do that by considering how we approach our data.

As with many facets of software development, something that initially seemed straightforward can turn into a pain point later as the business scales in both size and complexity. As always we should take care to design the architecture of our solutions and consider the various challenges it will face. One of them is how we facilitate schema changes. Sometimes we find that our architectural decisions have fallen short, which means we must reconsider our approach, even making sweeping changes to our solution to avoid building upon a broken foundation.

I’m pretty sure that most developers understand the importance of keeping their databases healthy, and I hope this post might point someone to a better way of doing it.