When restoring a snapshot that has many deployments, blueprints and plugins on a 4.4 manager the rest API becomes unresponsive probably because of a mismatch between the rest token created with the manager hash salt and the new hash salt restored from snapshot. When the Gunicorn workers are restarted (after 1k requests), they still have the old token while the rest is already using the new one. this is related to issue https://cloudifysource.atlassian.net/browse/CY-767 found in newer versions.
OS (CLI), HA cluster, cloud provider
Steps to reproduce:
That approach failed, we’ve ended up with
With gunicorn worker restart disabled, the workers won’t be restarted after serving a certain number of requests.
Given that we replace the rest security config part way through the install, if the workers do restart then anyone hitting that worker will receive a 401 because of hash mismatches. With the automatic-restart behaviour disabled, this won’t happen so the snapshot restore will complete (and then the rest service will be restarted).
Side effects will only occur if there is some sort of memory/resource leak in gunicorn, uwsgi, our code or any of our dependencies. In this case, OOM killer can be expected to kill the specific workers, which will then be resurrected. Worst case, systemd will restart gunicorn.
till OOM killer gets into action a system might suffer from a slow behavior.
Do we know the exact reason or bug that caused us to choose that approach (restart every 1000 requests) ?
we may as well not need it if it was fixed in the current version of gunicorn we use.
This is true, depending primarily on paging configuration.
I’m not sure what the exact initial reason was, but I think it’s a fairly standard approach for gunicorn. Changing it would require putting in some good quality tests for detecting resource leaks in gunicorn, which would probably not be an efficient use of resources at the moment (compared to just leaving the restart enabled).