DB cluster initialisation can leave async node offline

Description

In rare cases (observed once in ~20 test runs today), the DB cluster initialization can end with a master and a sync replica, but with the async replica not running.

In this case, the dead node's patroni journal showed: postgres lock file "./.s.PGSQL.5432.lock" already exists

This is likely fixable by running a reinit on the failing database node, so when we have 3+ db nodes up in a cluster we should run some checks with a timeout to make sure all nodes become healthy, and reinit nodes that don't within a time limit, then check again.

Steps to Reproduce

pytest -s cosmo_tester/test_suites/cluster/db_management_test.py::test_db_set_master

This test will need to be run multiple times, as the failure rate seems to be ~5% (based on 20 test runs).

Why Propose Close?

None

Activity

Show:
Barak Azulay
December 23, 2019, 10:04 AM

This Jira has a potential to be a blocker - added blocker flag to make sure we do not forget it

geokala
January 6, 2020, 3:51 PM

Repro on circle seems more reliable: 0

geokala
January 6, 2020, 3:51 PM

patronictl reinit and restart don’t work

geokala
January 6, 2020, 4:01 PM

Restarting the patroni service manually fixes the problem, but this is not something that can be done from another node.

geokala
January 9, 2020, 4:04 PM

Fix script created (see )

I’ve confirmed that this script, when run by hand, fixes the problem if it has occurred. The script being set to run automatically as part of the install seems to make the problem a lot less likely to occur as well.

Assignee

geokala

Reporter

geokala

Labels

None

Severity

Medium

Target Version

5.0.5

Premium Only

yes

Found In Version

5.0

QA Owner

geokala

Bug Type

new feature bug

Customer Encountered

No

Customer Name

None

Release Notes

no

Priority

None

Epic Link

Priority

Blocker
Configure