-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restic repos constantly getting locked, and piling up jobs until manually unlocked. #1042
Comments
@erenfro Do you see any of the mover pods get killed or fail prior to the lock getting left behind? FYI you can run an unlock from a replicationsource if you don't want to do it manually - see |
That, per the documentation, makes it like it requires a string value and if it's the same value as before, then it will not continue to do so. This seems like it would work once, but then never again, unless once again updated. Since this stale lock is happening multiple times a day, this would not help. |
That's correct, the unlock in the ReplicationSource will only unlock once per value. The reason we don't actively unlock the repo is that restic uses those locks to prevent corrupting the repository. The exclusive lock happens during the prune operation, so there's something about the environment that is leading to restic failing during that operation. Instead of unlocking aggressively, let's figure out what's causing the failures... Things that come to mind are:
|
This I very much do understand. I actually run resticprofile to backup all my Proxmox VE instances as well, on a nightly basis. My resticprofile backups uses the same MinIO S3 storage server running directly on my Synology NAS, and never has issues with locking.
I'm assuming you mean OOM managed by the kernel? No nodes show any OOM action taken. Furthermore, each of the K3S nodes running are running at 70% or less memory capacity, which is roughly about 6GB available RAM. Often times more available RAM per node.
My Restic Cache volumes are generally setup to use 8Gi, but they're also using local-path, and for local-path volumes those have 100Gi or more available. The snapshot volume's are setup with 8Gi, and my average origin volume right now is only 2Gi. My Nextcloud instance is the only one that was using higher than 2Gi and I've since moved that on to direct NFS methods instead.
I've setup a script that uses s3cmd and generates a resticprofile configuration with all pvc- backups in them, so I can easily run resticprofile pvc-. on them.
The NAS server has 3 TiB free space. This is why I'm scratching my head at all this, is that. I'm running restic without issue elsewhere. Granted the main difference is it's a nightly backup rather than hourly backup. And volsync is an hourly backup, often with little change between each backup. |
Sounds like you've checked the obvious stuff. My next thought is that you're going to have to catch it when it actually happens. Probably the easiest way to do that is to use the manual trigger and capture logs of the mover pod failure. |
Describe the bug
I'm constantly seeing restic repos being in a locked state, logging showing there is an exclusive lock on the repository, and yet, there's no backups running. Seeing upwards of 7 backup jobs in red (on k9s's view), all of which logs show:
Starting container
VolSync restic container version: unknown
backup
restic 0.16.2 compiled with go1.21.3 on linux/amd64
Testing mandatory env variables
== Checking directory for content ===
== Initialize Dir =======
repo already locked, waiting up to 0s for the lock
unable to create lock in backend: repository is already locked exclusively by PID 48 on volsync-src-mealie-96mvk by (UID 0, GID 0)
lock was created at 2023-12-19 12:02:00 (5h24m42.719315996s ago)
storage ID b27ebdc3
the
unlock
command can be used to remove stale locksERROR: failure checking existence of repository
Stream closed EOF for selfhosted/volsync-src-mealie-vrl4b (restic)
Steps to reproduce
Setup volsync to backup using restic, and wait. That's all I did, and it started happening frequently. You can see my setup example here:
Template: https://github.com/erenfro/homelab-flux/tree/main/kubernetes/templates/volsync
Applocation: https://github.com/erenfro/homelab-flux/tree/main/kubernetes/apps/selfhosted/mealie
Expected behavior
Locks either don't happen at all, or volsync at least waits longer than 0s for a lock to be cleared, and if not, do some sanity checking and/or attempt to unlock on it's own to automatically remove a stale lock and try again, rather than outright failing.
Actual results
Repo becomes locked, stale, backup jobs queue up, error out, and backups cease to continue until manually cleared.
Additional context
Not sure how to belp you, but I hope I've provided enough information to identify the problem itself.
The text was updated successfully, but these errors were encountered: