A cloudify agent is connected to one rabbitmq in cluster mode.
I installed an agent , checked which queue it connected to.
I created a blueprint that adds one more node that runs a script on the agent side and used deployment update.
while performing update i disabled the queue that connected to the agent.
noticed 2 scenarios:
fast disable- the deployment is pending forever(even if i enable the queue).
disable after few seconds- Socket error is shown (attaching screenshot) but the the deployment update is pending , when i enable the queue it perform the update.
I tested another scenario with a blueprint of gcp that creates a network.
i invoke install workflow and immediately disabled the first queue. the deployment pending forever.
Environment:
OS (CLI), HA cluster, cloud provider
------------------------------------
Steps to reproduce:
------------------
1.
2.
3.
Expected result:
---------------
Actual result:
-------------
added a connection retry in both the mgmtworker and the agent
made the cluster client actually retry on downloads from the fileserver
checked the rabbitmq config, we replicate to "all", and we use confirm_delivery everywhere
doesn't seem we can do much more in 5.1. In the next version hopefully we can make it even more resilient in case of rabbitmq failovers but for now, the current state is all we can do