Deployment pending while disable rabbitmq in a cluster

Description

A cloudify agent is connected to one rabbitmq in cluster mode.
I installed an agent , checked which queue it connected to.
I created a blueprint that adds one more node that runs a script on the agent side and used deployment update.
while performing update i disabled the queue that connected to the agent.
noticed 2 scenarios:
fast disable- the deployment is pending forever(even if i enable the queue).
disable after few seconds- Socket error is shown (attaching screenshot) but the the deployment update is pending , when i enable the queue it perform the update.

I tested another scenario with a blueprint of gcp that creates a network.
i invoke install workflow and immediately disabled the first queue. the deployment pending forever.

Steps to Reproduce

Environment:
OS (CLI), HA cluster, cloud provider
------------------------------------

Steps to reproduce:
------------------
1.
2.
3.

Expected result:
---------------

Actual result:
-------------

Why Propose Close?

None

Activity

Show:
Łukasz Maksymczuk
August 31, 2020, 11:48 AM
  • added a connection retry in both the mgmtworker and the agent

  • made the cluster client actually retry on downloads from the fileserver

  • checked the rabbitmq config, we replicate to "all", and we use confirm_delivery everywhere

doesn't seem we can do much more in 5.1. In the next version hopefully we can make it even more resilient in case of rabbitmq failovers but for now, the current state is all we can do

Done

Assignee

Łukasz Maksymczuk

Reporter

Adar Shaked

Labels

Severity

Low

Target Version

5.1

Premium Only

no

Found In Version

5.1

QA Owner

None

Bug Type

legacy bug

Customer Encountered

No

Customer Name

None

Release Notes

yes

Priority

Medium

Epic Link

Sprint

None

Priority

Medium
Configure