Feature Request: Ability to Selectively Rollback Node Instances on Failure

Description

We have an urgent feature request from customer c496. From ticket #10257:

We currently have a set of nodes with dependencies between them like this:

Enterprise
Branch (depends on enterprise)
Domain (depends on enterprise)
Overlay (depends on specific branch and domain)
Topology (depends on specific domain and optionally a specific branch and overlay)

When the first branch is provisioned from the upstream system it will end up creating the enterprise, at least one domain, the branch, one or more overlays and a topology. The issue we have is when one of these node instances past the enterprise has a failure, all of the node instances that were successfully created are rolled back in addition to the instance that had a failure. The creation of these instances in the downstream system is not quick (7ish minutes for some instances) so when these are rolled back, it can take up to that to remove it again. This greatly increases the time between when the upstream system fires off the request and when it receives a failure response. It also means that once the incorrect input (most cases this is the problem) is fixed, the full set of node instances are deployed again creating additional wait time until the node instance that had the failure the previous time gets its chance to try again.

What would be nice is some way to flag a node instance after its deploy lifecycle has completed to keep it from rolling back if another node instance has a failure while the workflow is running. This would mean that if an enterprise, domain, branch have been deployed and two overlays are in the process, but one of them fails then the only thing to rollback would be the failed overlay. All of the successful node instances should be retained and not rolled back.

Current Rollback:
enterprise (create and deploy success)(rolled back)
domain (create and deploy success)(rolled back)
branch (create and deploy success)(rolled back)
overlay1 (create and deploy success)(rolled back)
overlay2 (create success, deploy failure)(rolled back)
topology (not attempted)

Desired Rollback:
enterprise (create and deploy success)
domain (create and deploy success)
branch (create and deploy success)
overlay1 (create and deploy success)
overlay2 (create success, deploy failure)(rolled back)
topology (not attempted)

Activity

Show:
Mohammed Abuaisha
March 17, 2020, 8:21 AM

Great, will test that once reach that point since I’m handling the backend stuff now. Thanks !!

Mohammed Abuaisha
March 22, 2020, 11:33 AM

while I was working on this I had the following case regarding how we should apply the rollback in case of applying scaling workflow where scalable_entity_name is a scaling group which contains multiple members. We could have the following case:

  1. Apply scaling workflow against scaling group called group_1 which contains 3 members [x, y, z]

  2. Dependencies between these members z ----> y ----> x

  3. the initial number of instance of that group is 1 which means when run install workflow 1 instance will be created for each members

  4. Suppose run scale out workflow with delta=2 in that cases the number of current instances for that group will be 3 if the scaling is passed in all group members without any issues

  5. Suppose that the both y & x scaled successfully and z failed at that case we should rollback only the failed node instances which are only z

  6. The question is here the scaling applying against the scaling group group_1 and the two of its members passed and only one failed, and the number of current instances for that group before scaling was 1 and we should update the number of current instances so that it can reflect the number of actual instances applied for each members. how do you want to reflect the number of instances at this case ?

  7. Updating the current instances number to 3 [delta + 1] for the group could cause issue if we need to apply scale in for example since not all members of the group has 3 instances, only the passed scaled instances has 3 and the failed still has 1 instances

  8. The same principle applied for scaling in, when one of the members failed and other passed ?

Let me know what do you think on this


your inputs here also helpful

 

Łukasz Maksymczuk
March 24, 2020, 12:04 AM

I think we should merge +- as is. Then we can work on this later if needed. It's already such a big change that it's hard to understand.

I think please add an inte-test

Mohammed Abuaisha
March 24, 2020, 7:45 AM

for now in case the scaling for group and one of them failed I rollback the whole members because its a little bit complicated and cannot decided which number of current instances to set for that group as not all of its members succeeded. Moreover, this will generate inconsistency when trying to run the scaling

Mohammed Abuaisha
March 24, 2020, 3:19 PM

I did testing when rollback_behviaour=none and did not encounter the same issue I had for extra node instances in our discussion today because calling the scaling after that run will raise error since the modification still in started state and cannot do another modification after that. Regarding the rollback_behviaour=all works well as expected without issue.

The extra instances created after the first scaling run need more investigation and I think we can do that once we have a clear understanding about how we are going to support this option here and also the resume request. For now Ofer I leave this for you to decide how we are going to support rollback and resume.

Let me know if you need anything else on this. BTW not sure if we are going to keep this Jira on this board or move it to backlog

Here are the list of PRs related to this Jira

 

Assignee

Mohammed Abuaisha

Reporter

Eve Land

Labels

Target Version

5.1

QA Owner

None

Premium Only

no

Documentation Required

Yes

Why Blocked?

None

Release Notes

yes

Priority

None

Epic Link

Sprint

None

Priority

Unprioritized
Configure