Gunicorn restart REST worker every 1000 requests

Description

Gunicorn restarts REST workers every 1000 requests. Parameter name is `GUNICORN_MAX_REQUESTS` and is defined here: `cloudify-manager-install/cfy_manager/components/restservice/config/cloudify-restservice`.
This causes failure in large snapshot restores . Since some of the rest workers handles more than a 1000 requests they restart, and as part of restarting they reload the REST config files (that were changed as part of the workflow). From that point on all REST calls fail because the REST token is no longer valid.
Possible solutions:
1. Disable the restart - a solid system shouldn't have to restart every 1000 requests.
2. Restart before the `create_deployments_env` starts, and make sure we create a valid new rest token.
3. Disable the restart only for the time of the restore.

Steps to Reproduce

Environment:
OS (CLI), HA cluster, cloud provider
------------------------------------

Steps to reproduce:
------------------
1.
2.
3.

Expected result:
---------------

Actual result:
-------------

Why Propose Close?

None

Activity

Show:
Isaac Shabtay
November 1, 2018, 2:34 PM

The MAX_REQUESTS is already in config.yaml.
I support option 1, but leave it in config.yaml, with the default of 0 (so it's disabled).

Regardless, Snapshot Restore should validate that this value is 0 before starting. This can easily be done by inquirying the environment variable.

geokala
November 1, 2018, 2:49 PM

I am not keen on us disabling the restart- we don't know if gunicorn does now (or will in the future) have problems without that. We don't know that our own code doesn't have problems without it. Given that it's a default option I expect them to test with it set to that and maybe not with it disabled.

Therefore I think a better approach would be to update the rest-security.conf atomically at the end of the restore workflow, e.g. by using mv from a temp location, rather than updating it during the restore workflow and hoping it doesn't cause any problems.

geokala
November 1, 2018, 2:49 PM

(to be clear, I think the initial issue here should be marked as NOTABUG even if it is a bit of a code smell)

Adi Grabow
November 6, 2018, 12:50 PM

Since this problem is only relevant to `snapshot restore` we decided to address it directly and not change the Gunicorn restart policy. At the very beginning of the restore workflow we will create a new REST token using the new security config file, we will then update it in the CTX and restart the REST service. From that point forward all REST calls will be made using the new token, so if Gunicorn will restart our workers it should be fine.

Done

Assignee

Adi Grabow

Reporter

Adi Grabow

Labels

None

Severity

High

Target Version

4.5.5

Premium Only

no

Found In Version

4.5

QA Owner

None

Bug Type

legacy bug

Customer Encountered

No

Customer Name

None

Release Notes

yes

Priority

None

Epic Link

Sprint

None

Priority

Unprioritized
Configure