Planned Switchover

Take mysql-prod-primary down for maintenance with zero data loss and a brief write outage. Read all steps before starting.

1

Confirm replication is healthy

Do not proceed if lag is non-zero or threads are not running. Fix replication first.

SQL — run on mysql-prod-replica
SHOW REPLICA STATUS\G
-- Replica_IO_Running: Yes
-- Replica_SQL_Running: Yes
-- Seconds_Behind_Source: 0
2

Freeze writes on the primary

This starts the write outage. Move quickly through steps 3–5.

SQL — run on mysql-prod-primary
SET GLOBAL super_read_only = 1;
SET GLOBAL read_only = 1;
3

Wait for replica to fully catch up

Should be instant since writes are frozen. If it takes more than a few seconds, roll back using the section below.

SQL — run on mysql-prod-replica
SHOW REPLICA STATUS\G
-- Seconds_Behind_Source: 0
4

Promote the replica

SQL — run on mysql-prod-replica
STOP REPLICA;
RESET REPLICA ALL;
SET GLOBAL read_only = 0;
SET GLOBAL super_read_only = 0;
5

Redirect traffic

Get proxy IP
kubectl get pods -n tailscale \
  -l tailscale.com/parent-resource=mysql-replica \
  -o jsonpath='{.items[0].status.podIP}{"\n"}'
Patch EndpointSlice
kubectl patch endpointslice mysql --type=json \
  -p '[{"op":"replace","path":"/endpoints/0/addresses/0","value":"<REPLICA_PROXY_IP>"}]'

Write outage ends here. Apps reconnect within a few seconds.

6

Verify

kubectl run verify --rm -it --image=mysql:8.0 --restart=Never -- \
  mysql -h mysql.default.svc.cluster.local -u <user> -p<pass> \
  -e "SELECT @@server_id, @@read_only"
-- read_only should be 0, server_id should match the replica
7

Do your maintenance

mysql-prod-primary is now idle. Reboot, patch, or do whatever you need.

8

Rejoin the primary as a replica

Once the machine is back up, connect to mysql-prod-primary and configure it as a replica.

SQL — run on mysql-prod-primary
STOP REPLICA;
RESET REPLICA ALL;
SET GLOBAL read_only = 1;
SET GLOBAL super_read_only = 1;
CHANGE REPLICATION SOURCE TO
  SOURCE_HOST     = 'mysql-prod-replica',
  SOURCE_USER     = 'repl',
  SOURCE_PASSWORD = '<REPL_PASSWORD>',
  SOURCE_AUTO_POSITION = 1,
  GET_SOURCE_PUBLIC_KEY = 1;
START REPLICA;
Verify
SHOW REPLICA STATUS\G
-- Replica_IO_Running: Yes
-- Replica_SQL_Running: Yes
-- Seconds_Behind_Source: 0

Roll back (before step 5 only)

If anything goes wrong before the EndpointSlice is patched, traffic is still going to the old primary. Undo the freeze:

SQL — run on mysql-prod-primary
SET GLOBAL super_read_only = 0;
SET GLOBAL read_only = 0;

If you already promoted the replica (step 4), reconfigure it as a replica again:

SQL — run on mysql-prod-replica
STOP REPLICA;
RESET REPLICA ALL;
SET GLOBAL read_only = 1;
SET GLOBAL super_read_only = 1;
CHANGE REPLICATION SOURCE TO
  SOURCE_HOST     = 'mysql-prod-primary',
  SOURCE_USER     = 'repl',
  SOURCE_PASSWORD = '<REPL_PASSWORD>',
  SOURCE_AUTO_POSITION = 1,
  GET_SOURCE_PUBLIC_KEY = 1;
START REPLICA;

There is no rollback after step 5. Use the Restore Topology runbook instead.