[4.4] When restoring a big snapshot REST API becomes unresponsive

Description

When restoring a snapshot that has many deployments, blueprints and plugins on a 4.4 manager the rest API becomes unresponsive probably because of a mismatch between the rest token created with the manager hash salt and the new hash salt restored from snapshot. When the Gunicorn workers are restarted (after 1k requests), they still have the old token while the rest is already using the new one. this is related to issue https://cloudifysource.atlassian.net/browse/CY-767 found in newer versions.

Steps to Reproduce

Environment:
OS (CLI), HA cluster, cloud provider
------------------------------------

Steps to reproduce:
------------------
1.
2.
3.

Expected result:
---------------

Actual result:
-------------

Why Propose Close?

None

Activity

Show:
geokala
September 4, 2019, 10:09 AM

That approach failed, we’ve ended up with

Barak Azulay
September 4, 2019, 1:08 PM

please elaborate

geokala
September 4, 2019, 1:19 PM

With gunicorn worker restart disabled, the workers won’t be restarted after serving a certain number of requests.

Given that we replace the rest security config part way through the install, if the workers do restart then anyone hitting that worker will receive a 401 because of hash mismatches. With the automatic-restart behaviour disabled, this won’t happen so the snapshot restore will complete (and then the rest service will be restarted).

Side effects will only occur if there is some sort of memory/resource leak in gunicorn, uwsgi, our code or any of our dependencies. In this case, OOM killer can be expected to kill the specific workers, which will then be resurrected. Worst case, systemd will restart gunicorn.

Barak Azulay
September 4, 2019, 2:52 PM

till OOM killer gets into action a system might suffer from a slow behavior.

Do we know the exact reason or bug that caused us to choose that approach (restart every 1000 requests) ?

we may as well not need it if it was fixed in the current version of gunicorn we use.

geokala
September 4, 2019, 3:49 PM
Edited

This is true, depending primarily on paging configuration.

I’m not sure what the exact initial reason was, but I think it’s a fairly standard approach for gunicorn. Changing it would require putting in some good quality tests for detecting resource leaks in gunicorn, which would probably not be an efficient use of resources at the moment (compared to just leaving the restart enabled).

Assignee

geokala

Reporter

Jonathan Abramsohn

Severity

Medium

Target Version

4.4

Premium Only

yes

Found In Version

4.4

QA Owner

None

Bug Type

unknown

Customer Encountered

Yes

Customer Name

c478

Release Notes

yes

Priority

None

Priority

Unprioritized
Configure