Tasks are stuck after agent loses connection

Description

Steps to reproduce:
1. Run a workflow that has a long-running task

  • it might be an install workflow like the install.yaml example (this is the scenario originally reported by the customer)

  • it might be an execute_operation like in the operation.yaml example (simpler to reproduce many times when developing)
    2. While an operation is being executed by the remote agent, disconnect the remote agent from rabbitmq, for example by using iptables like `iptables -A INPUT -p tcp -s 10.0.1.46 -j DROP`
    3. After the operation should have ended (eg. 2 minutes have passed, and the task only waited 60 seconds), restore the connection, for example by doing `iptables -D INPUT 1`
    4. The task will never be marked as finished, and the next task will never run

This is because we are using celery events to receive "task succeeded" messages. However, events are designed for real-time purposes, and if they cannot be sent right at the time when the operation finishes, they will never be sent later.

Instead, we should be using celery results, because results will be sent after the connection is restored.

This affects all versions (must also be fixed for 4.1.1.1), and all agents (linux and windows).

Status

Assignee

Łukasz Maksymczuk

Reporter

Łukasz Maksymczuk

Labels

None

Severity

None

Bug Type

legacy bug

Target Version

4.4

Severity

None

Fix versions

Affects versions

4.3
4.1.1.1