Operational Defect Database

BugZero updated this defect 45 days ago.

VMware | 90818

Velero Backups Failing , Restic attempts to prune backups but the prune process gets killed.

Last update date:

4/5/2024

Affected products:

Tanzu Kubernetes Grid

Tanzu Mission Control

Affected releases:

1.x

Fixed releases:

No fixed releases provided.

Description:

Symptoms

In Velero from TMC (Data Protection) Backups continuously fails and Several backups stuck in a deleting state for several clusters. Whenever Restic attempts to prune backups, the prune process gets killed, leaving behind several locks. The locks in turn prevents Restic backups from proceeding. Velero Backup describe command shows the below error: Errors: error running command=restic forget --repo=s3:https://<S3>:9001/wld03/01GHDX9S5SVH2KXARM1JAVP2BZ/restic/tbi-loki --password-file=/tmp/credentials/velero/velero-restic-credentials-repository-password --cache-dir=/scratch/.cache/restic 92f09770, stdout=, stderr=unable to create lock in backend: repository is already locked exclusively by PID 1398984 on velero-68dbc9f6d9-586m8 by nonroot (UID 65532, GID 65532) lock was created at 2023-02-01 13:21:51 (4m8.270026172s ago) storage ID 01187448 the `unlock` command can be used to remove stale locks : exit status 1 kubectl describe resticrepository.velero.io shows the below error: Message: error running command=restic prune --repo=s3:https://<S3>:9001/wld03/01GHDX9S5SVH2KXARM1JAVP2BZ/restic/tbi-loki --password-file=/tmp/credentials/velero/velero-restic-credentials-repository-password --cache-dir=/scratch/.cache/restic, stdout=loading indexes... loading all snapshots... finding data that is still in use for 100 snapshots , stderr=: signal: killed Velero pod logs show below errors: time="2023-02-01T13:24:08Z" level=info msg="Running maintenance on restic repository" logSource="pkg/controller/restic_repository_controller.go:192" resticRepo=velero/tbi-loki-wld03-wvp6p time="2023-02-01T13:24:08Z" level=info msg="Removing existing deletion requests for backup" backup=wld03-full-daily-20221227210027 controller=backup-deletion deletebackuprequest=velero/wld03-full-daily-20221227210027-7bxqv logSource="pkg/controller/backup_deletion_controller.go:414" time="2023-02-01T13:24:08Z" level=info msg="Backup has expired" backup=velero/wld03-full-daily-20221204210056 controller=gc expiration="2023-01-03 21:00:56 +0000 UTC" logSource="pkg/controller/gc_controller.go:145" time="2023-02-01T13:24:08Z" level=info msg="Creating a new deletion request" backup=velero/wld03-full-daily-20221204210056 controller=gc expiration="2023-01-03 21:00:56 +0000 UTC" logSource="pkg/controller/gc_controller.go:203" time="2023-02-01T13:24:08Z" level=info msg="deletion request 'wld03-full-daily-20221227210027-6jpgk' removed." backup=wld03-full-daily-20221227210027 controller=backup-deletion deletebackuprequest=velero/wld03-full-daily-20221227210027-7bxqv logSource="pkg/controller/backup_deletion_controller.go:431" time="2023-02-01T13:24:08Z" level=info msg="Backup has expired" backup=velero/wld03-full-daily-20221217210048 controller=gc expiration="2023-01-16 21:00:48 +0000 UTC" logSource="pkg/controller/gc_controller.go:145" time="2023-02-01T13:24:08Z" level=info msg="Creating a new deletion request" backup=velero/wld03-full-daily-20221217210048 controller=gc expiration="2023-01-16 21:00:48 +0000 UTC" logSource="pkg/controller/gc_controller.go:203" time="2023-02-01T13:24:08Z" level=warning msg="error pruning repository" error="error running command=restic prune --repo=s3:https://<S3>:9001/wld03/01GHDX9S5SVH2KXARM1JAVP2BZ/restic/tbi-loki --password-file=/tmp/credentials/velero/velero-restic-credentials-repository-password --cache-dir=/scratch/.cache/restic, stdout=, stderr=unable to create lock in backend: repository is already locked exclusively by PID 1398984 on velero-68dbc9f6d9-586m8 by nonroot (UID 65532, GID 65532)\nlock was created at 2023-02-01 13:21:51 (2m17.8354312s ago)\nstorage ID 01187448\nthe `unlock` command can be used to remove stale locks\n: exit status 1" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/repository_manager.go:296" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*repositoryManager).exec" logSource="pkg/controller/restic_repository_controller.go:198" resticRepo=velero/tbi-loki-wld03-wvp6p Problem persists after restarting both velero and restic pods.

Cause

First, the Restic repository prune starts (it will create a lock file), and after a while, the prune command fails (the command is killed, so it may not have a chance to clear the lock file). In the end, the backup fails due to the existing lock file. message: |- running Restic backup, stderr=unable to create lock in backend: repository is already locked exclusively by PID 374329 on velero-68dbc9f6d9-ddmcf by nonroot (UID 65532, GID 65532) lock was created at 2023-01-08 21:01:10 (45.062813009s ago) storage ID 34c11c53 the `unlock` command can be used to remove stale locks : exit status 1 phase: Failed Not enough memory could be the reason and the kubelet process kill the prune command due to OOM.stderr=: signal: killed See Restic issue: https://github.com/restic/restic/issues/3844It seems Velero already does clean up stale locks by itself so manual cleanup will not help. And the locks seem to be left over due to the prune command getting killed. So what seems to be happening is:stale locks cleaned up --> prune creates new lock files and gets killed --> wait --> loopSo when a backup is in progress, the volume backups using the Restic repository fail due to the lock files being present.

Impact / Risks

Volume backup fails and backups get stuck in deleting state

Resolution

No Known resolution yet, Just a workaround to solve the issue

Workaround

The value of spec.MaintenanceFrequency may not matter and already be a higher value - but since the prune commands are failing, it's not being respected anyway. But we can always confirm by checking the value in:$ kubectl -n velero get resticrepositories.velero.io -o yaml (no value probably means default of 7 days) Since we suspect the prune command is getting killed due to OOM, we can increase the resources for the Velero deployment$ kubectl -n velero edit deployment velero The current values are probably the recommended defaults VeleroCPURequest = "500m" VeleroMemoryRequest = "128Mi" VeleroCPULimit = "1000m" VeleroMemoryLimit = "512Mi" to the following limit values: Limits: Containers: velero: Image: extensions.aws-euw1.tmc.cloud.vmware.com/extensions/data-protection-images/velero-v1-9-5@sha256:f9930c7448f717a252895c266e9c662ce59f971f7a8849176f7f3ed75964d87d Port: 8085/TCP Host Port: 0/TCP Command: /velero Args: server Limits: cpu: 4 memory: 16Gi Requests: cpu: 500m memory: 128Mi The velero deployment has multiple init containers so make sure you change the limits for the main container - or both the init & main containers since it's a max anyway. And note we're doing this for the velero deployment and not the restic daemonset.

Additional Resources / Links

Share:

BugZero® Risk Score

What's this?

Coming soon

Status

Unavailable

Learn More

Search:

...