DB cluster initialisation can leave async node offline
In rare cases (observed once in ~20 test runs today), the DB cluster initialization can end with a master and a sync replica, but with the async replica not running.
In this case, the dead node's patroni journal showed: postgres lock file "./.s.PGSQL.5432.lock" already exists
This is likely fixable by running a reinit on the failing database node, so when we have 3+ db nodes up in a cluster we should run some checks with a timeout to make sure all nodes become healthy, and reinit nodes that don't within a time limit, then check again.
Steps to Reproduce
pytest -s cosmo_tester/test_suites/cluster/db_management_test.py::test_db_set_master
This test will need to be run multiple times, as the failure rate seems to be ~5% (based on 20 test runs).
Why Propose Close?
Fix script created (see )
I’ve confirmed that this script, when run by hand, fixes the problem if it has occurred. The script being set to run automatically as part of the install seems to make the problem a lot less likely to occur as well.
Restarting the patroni service manually fixes the problem, but this is not something that can be done from another node.
patronictl reinit and restart don’t work
Repro on circle seems more reliable: 0
This Jira has a potential to be a blocker - added blocker flag to make sure we do not forget it