I am pondering about performing a zero-downtime database migration and came up with minimum necessary steps.
By "migration" I mean any change in the same database that is not backward-compatible such as renaming, splitting or dropping a column.
Since I couldn't find much information elsewhere, I would like to validate my reasoning with someone having hands-on experience with this. Let's imagine we have a capability to perform rolling deployments, otherwise I don't believe that zero downtime DB migration is possible. So:
- Initial state: V1 is deployed in prod. It uses
table1.oldColumn - Goal: rename
table1.oldColumntotable1.newColumnwith zero downtime
Steps:
Create
table1.newColumn:ALTER TABLE table1 ADD COLUMN newColumn(...)Gradually deploy V2. The V2 code contains the following changes:
- SELECTs use oldColumn:
SELECT oldColumn FROM table1 WHERE userId = 1001. That's because onlyoldColumncontains full data for now whilenewColumncontains only a subset of it - UPDATEs use both, but when a new value is missing in newColumn, it's copied from
oldColumn. If we don't do that, we will chase constantly changingoldColumnforever - INSERTs use both columns:
INSERT INTO table1 (oldColumn, newColumn) VALUES ('abcd', 'abcd') - DELETEs are usually irrelevant because the delete remove the entire row:
DELETE FROM table1 WHERE userId = 1001- However, if the column is a UNIQUE KEY, then the oldColumn is used:
DELETE FROM table1 WHERE oldColumn = 'xyz'
- However, if the column is a UNIQUE KEY, then the oldColumn is used:
- SELECTs use oldColumn:
Now that all new data is always in sync, we still have a diff between
oldColumnandnewColumn. In order to liquidate difference betweenoldColumnandnewColumn, we run a background script copying values missing innewColumnfromoldColumnNow that columns are in sync, gradually deploy V3. V3 code contains the following changes: SELECTs, UPDATEs, INSERTs and DELETEs go to
newColumnnow.table1.oldColumnis not used anymoreDrop the unused
table1.oldColumn:ALTER table1 DROP COLUMN oldColumn
Note: steps 3 and 5 can be performed as part of the database migration during V2 and V3 startup
Recap:
- Initially
newColumnis empty and all data goes tooldColumn - While we gradually replace V1 with V2, data starts to flow into
oldColumnalongsidenewColumn. At this point some data still flows intooldColumnonly (because we are performing a rolling update so not all instances are V2) - As soon as V2 is deployed, data flows in both
oldColumnandnewColumn. We mirror updates and inserts to keep columns in sync - However, some data was inserted into
oldColumnbeforenewColumnwas devised and some data got there from remaining V1 instances that existed during the rolling update. We must get rid of this difference - When the script is run, data in
oldColumnmissing innewColumngets copied there