Quick successive failovers can leave the master in half-promoted state

Description

Acceptance test for this is system-tests' test_replication[3], but unfortunately this is somewhat random.

What can happen is:

  • a node starts trying to follow the master

  • the master dies before the follow process is completed

  • that node is then elected the new master

  • a retry of the previous follow runs at the same time or after the promote, breaking that node's nginx config

Relevant log (from "that node") - attached as "cloudify-cluster.log"

Activity

Show:
Łukasz Maksymczuk
August 24, 2017, 8:28 AM

Fixed by making sure that when handlers run, they run with the most recent data available.

Instead of retrying many times with the value the handler was originally called with, we refetch between handler calls, so that if the value in consul changes, subsequent retries use the new value.

This means that no longer should a handler be stuck trying to follow a master long after the master has already switched.

Fixed

Assignee

Łukasz Maksymczuk

Reporter

Łukasz Maksymczuk

Labels

Bug Type

None

Target Version

None

Severity

None

Sprint

None

Fix versions

Affects versions

Configure