Gunicorn restarts REST workers every 1000 requests. Parameter name is `GUNICORN_MAX_REQUESTS` and is defined here: `cloudify-manager-install/cfy_manager/components/restservice/config/cloudify-restservice`.
This causes failure in large snapshot restores . Since some of the rest workers handles more than a 1000 requests they restart, and as part of restarting they reload the REST config files (that were changed as part of the workflow). From that point on all REST calls fail because the REST token is no longer valid.
1. Disable the restart - a solid system shouldn't have to restart every 1000 requests.
2. Restart before the `create_deployments_env` starts, and make sure we create a valid new rest token.
3. Disable the restart only for the time of the restore.
OS (CLI), HA cluster, cloud provider
Steps to reproduce:
The MAX_REQUESTS is already in config.yaml.
I support option 1, but leave it in config.yaml, with the default of 0 (so it's disabled).
Regardless, Snapshot Restore should validate that this value is 0 before starting. This can easily be done by inquirying the environment variable.
I am not keen on us disabling the restart- we don't know if gunicorn does now (or will in the future) have problems without that. We don't know that our own code doesn't have problems without it. Given that it's a default option I expect them to test with it set to that and maybe not with it disabled.
Therefore I think a better approach would be to update the rest-security.conf atomically at the end of the restore workflow, e.g. by using mv from a temp location, rather than updating it during the restore workflow and hoping it doesn't cause any problems.
(to be clear, I think the initial issue here should be marked as NOTABUG even if it is a bit of a code smell)
Since this problem is only relevant to `snapshot restore` we decided to address it directly and not change the Gunicorn restart policy. At the very beginning of the restore workflow we will create a new REST token using the new security config file, we will then update it in the CTX and restart the REST service. From that point forward all REST calls will be made using the new token, so if Gunicorn will restart our workers it should be fine.