Emergency Failover

The primary is down. Promote the replica and redirect traffic. Read all steps before starting.

1

Promote the replica

Connect to mysql-prod-replica with any MySQL client — GUI tool, CLI, or SSH directly onto the host.

SQL — run on mysql-prod-replica
STOP REPLICA;
RESET REPLICA ALL;
SET GLOBAL read_only = 0;
SET GLOBAL super_read_only = 0;

Verify:

SELECT @@read_only, @@super_read_only;
-- both should be 0
2

Redirect traffic

Fetch the Tailscale egress proxy pod IP for the replica, then patch the EndpointSlice.

Get proxy IP
kubectl get pods -n tailscale \
  -l tailscale.com/parent-resource=mysql-replica \
  -o jsonpath='{.items[0].status.podIP}{"\n"}'
Patch EndpointSlice
kubectl patch endpointslice mysql --type=json \
  -p '[{"op":"replace","path":"/endpoints/0/addresses/0","value":"<REPLICA_PROXY_IP>"}]'

App pods reconnect within a few seconds of the patch.

3

Verify

kubectl run verify --rm -it --image=mysql:8.0 --restart=Never -- \
  mysql -h mysql.default.svc.cluster.local -u <user> -p<pass> \
  -e "SELECT @@server_id, @@read_only"
-- read_only should be 0
4

After the primary recovers

Once mysql-prod-primary is back up, rejoin it as a replica.

The old primary may have locally committed transactions that never reached the replica before it crashed. SOURCE_AUTO_POSITION = 1 uses GTIDs, so MySQL will automatically identify and skip those orphaned transactions — they cannot be applied and should not be. To guard against any edge case (duplicate-key or row-not-found errors from data divergence), start replication in IDEMPOTENT mode, verify it catches up cleanly, then revert to STRICT.

1 — make read-only and set IDEMPOTENT mode
STOP REPLICA;
RESET REPLICA ALL;
SET GLOBAL read_only = 1;
SET GLOBAL super_read_only = 1;
SET GLOBAL replica_exec_mode = 'IDEMPOTENT';
2 — configure and start replication
CHANGE REPLICATION SOURCE TO
  SOURCE_HOST     = 'mysql-prod-replica',
  SOURCE_USER     = 'repl',
  SOURCE_PASSWORD = '<REPL_PASSWORD>',
  SOURCE_AUTO_POSITION = 1,
  GET_SOURCE_PUBLIC_KEY = 1;
START REPLICA;

Verify replication is running and lag has drained to zero:

3 — verify
SHOW REPLICA STATUS\G
-- Replica_IO_Running: Yes
-- Replica_SQL_Running: Yes
-- Seconds_Behind_Source: 0

Once lag is zero and no errors appear in Last_Error, revert to STRICT mode. Leaving IDEMPOTENT on permanently hides real replication errors.

4 — revert to STRICT
SET GLOBAL replica_exec_mode = 'STRICT';