Planned Switchover

Take mysql-prod-primary down for maintenance with zero data loss and a brief write outage. Read all steps before starting.

Confirm replication is healthy

Do not proceed if lag is non-zero or threads are not running. Fix replication first.

SQL — run on mysql-prod-replica

SHOW REPLICA STATUS\G
-- Replica_IO_Running: Yes
-- Replica_SQL_Running: Yes
-- Seconds_Behind_Source: 0

Freeze writes on the primary

This starts the write outage. Move quickly through steps 3–5.

SQL — run on mysql-prod-primary

SET GLOBAL super_read_only = 1;
SET GLOBAL read_only = 1;

Wait for replica to fully catch up

Should be instant since writes are frozen. If it takes more than a few seconds, roll back using the section below.

SQL — run on mysql-prod-replica

SHOW REPLICA STATUS\G
-- Seconds_Behind_Source: 0

Promote the replica

SQL — run on mysql-prod-replica

STOP REPLICA;
RESET REPLICA ALL;
SET GLOBAL read_only = 0;
SET GLOBAL super_read_only = 0;

Redirect traffic

Get proxy IP

kubectl get pods -n tailscale \
  -l tailscale.com/parent-resource=mysql-replica \
  -o jsonpath='{.items[0].status.podIP}{"\n"}'

Patch EndpointSlice

kubectl patch endpointslice mysql --type=json \
  -p '[{"op":"replace","path":"/endpoints/0/addresses/0","value":"<REPLICA_PROXY_IP>"}]'

Write outage ends here. Apps reconnect within a few seconds.

Verify

kubectl run verify --rm -it --image=mysql:8.0 --restart=Never -- \
  mysql -h mysql.default.svc.cluster.local -u <user> -p<pass> \
  -e "SELECT @@server_id, @@read_only"
-- read_only should be 0, server_id should match the replica

Do your maintenance

mysql-prod-primary is now idle. Reboot, patch, or do whatever you need.

Rejoin the primary as a replica

Once the machine is back up, connect to mysql-prod-primary and configure it as a replica.

SQL — run on mysql-prod-primary

STOP REPLICA;
RESET REPLICA ALL;
SET GLOBAL read_only = 1;
SET GLOBAL super_read_only = 1;
CHANGE REPLICATION SOURCE TO
  SOURCE_HOST     = 'mysql-prod-replica',
  SOURCE_USER     = 'repl',
  SOURCE_PASSWORD = '<REPL_PASSWORD>',
  SOURCE_AUTO_POSITION = 1,
  GET_SOURCE_PUBLIC_KEY = 1;
START REPLICA;

Verify

SHOW REPLICA STATUS\G
-- Replica_IO_Running: Yes
-- Replica_SQL_Running: Yes
-- Seconds_Behind_Source: 0

Rollback

↩

Roll back (before step 5 only)

If anything goes wrong before the EndpointSlice is patched, traffic is still going to the old primary. Undo the freeze:

SQL — run on mysql-prod-primary

SET GLOBAL super_read_only = 0;
SET GLOBAL read_only = 0;

If you already promoted the replica (step 4), reconfigure it as a replica again:

SQL — run on mysql-prod-replica

STOP REPLICA;
RESET REPLICA ALL;
SET GLOBAL read_only = 1;
SET GLOBAL super_read_only = 1;
CHANGE REPLICATION SOURCE TO
  SOURCE_HOST     = 'mysql-prod-primary',
  SOURCE_USER     = 'repl',
  SOURCE_PASSWORD = '<REPL_PASSWORD>',
  SOURCE_AUTO_POSITION = 1,
  GET_SOURCE_PUBLIC_KEY = 1;
START REPLICA;

There is no rollback after step 5. Use the Restore Topology runbook instead.