Large snapshot restore failure on Clustered Postgres DB

Description

During a snapshot restore with large size ex: 2.8GB and above ,

following the procedure :
1- leave one manager node [ stop other nodes ]
– we also tried to turn off all non essential connections to db [ shutting off status-reporter on all nodes and amqp-postgres service ]
2- activate maintenance mode
3- start the restore

Note that the

after sometime into the process , the restore fail due to server closed the connection unexpectedly

Logs :
Manager Log:

Database log :

and when the exception from manager happens [ we can see that the connection on DB will be stuck with rollback action ]

Steps to Reproduce

Environment:
OS (CLI), HA cluster, cloud provider
------------------------------------

Steps to reproduce:
------------------
1.
2.
3.

Expected result:
---------------

Actual result:
-------------

Why Propose Close?

None

Activity

Show:
geokala
January 23, 2020, 3:30 PM

From testing a reasonably sized customer snapshot, it restored in ~15 seconds without issue.

Therefore, this is probably not a blocker, but is something we should consider for our future snapshot upgrades.

Ofer Yarom
January 23, 2020, 4:19 PM

Removed the blocker tag as we were able to restore customer’s size snapshots on a cluster. moving this to post 5.0.5.

Barak Azulay
February 17, 2020, 1:38 PM

Do you think this can/should be considered as a patch for 5.0.5 ?

geokala
February 17, 2020, 3:03 PM

I think this should be pushed back and made part of the reworking of snapshots that we already started discussing, as our current snapshot handling approach is going to make adaptations to deal with significantly large snapshots much more complex.

Given that the problematic snapshot is significantly (more than an order of magnitude) larger than the largest snapshots we’re expecting to see which would likely make expending effort on the complexity of this not worthwhile.

Ofer Yarom
February 18, 2020, 7:53 AM

pushed the version to 5.2, keeping this tentative for 5.1 if we free up resources.

Assignee

geokala

Reporter

Ahmad Musa

Labels

None

Severity

Medium

Target Version

5.2

Premium Only

no

Found In Version

5.0

QA Owner

None

Bug Type

regression bug

Customer Encountered

No

Customer Name

None

Release Notes

yes

Priority

None

Priority

Critical
Configure