Actions
action #152649
closed[alert] `rsnapshot@alpha.service` failed on `backup.qa.suse.de` size:M
Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-12-13
Due date:
% Done:
0%
Estimated time:
Tags:
Description
Observation¶
martchus@backup-vm:~> sudo systemctl status rsnapshot@alpha.service
rsnapshot@alpha.service - rsnapshot (alpha) backup
Loaded: loaded (/etc/systemd/system/rsnapshot@.service; static)
Active: failed (Result: exit-code) since Wed 2023-12-13 16:20:27 CET; 49min ago
TriggeredBy: rsnapshot-alpha.timer
Main PID: 14765 (code=exited, status=1/FAILURE)
Dec 13 16:03:36 backup-vm rsnapshot[14765]: WARNING: root@o3:/var/log/zypp/ skipped due to rollback plan
Dec 13 16:03:36 backup-vm rsnapshot[15411]: WARNING: root@o3:/var/log/zypp/ skipped due to rollback plan
Dec 13 16:03:36 backup-vm rsnapshot[14765]: WARNING: root@o3:/srv/tftpboot/ skipped due to rollback plan
Dec 13 16:03:36 backup-vm rsnapshot[15412]: WARNING: root@o3:/srv/tftpboot/ skipped due to rollback plan
Dec 13 16:19:08 backup-vm rsnapshot[14765]: WARNING: Rolling back "openqa.opensuse.org/"
Dec 13 16:19:08 backup-vm rsnapshot[18048]: WARNING: Rolling back "openqa.opensuse.org/"
Dec 13 16:20:27 backup-vm rsnapshot[18290]: /usr/bin/rsnapshot alpha: ERROR: /usr/bin/rsnapshot alpha: completed, but with some errors
Dec 13 16:20:27 backup-vm systemd[1]: rsnapshot@alpha.service: Main process exited, code=exited, status=1/FAILURE
Dec 13 16:20:27 backup-vm systemd[1]: rsnapshot@alpha.service: Failed with result 'exit-code'.
Dec 13 16:20:27 backup-vm systemd[1]: Failed to start rsnapshot (alpha) backup.
Suggestions¶
- The "alpha" snapshot is the only network-related backup schedule
- Based on the suspicion that it might be only "sporadic network issues" either extend our monitoring to make sure we see those issues or ensure that the network connection attempts are resilient enough to be able to cover those outages
- As the network communication is across locations involving an IPSEC tunnel between NUE2 and PRG2 we need to be more resilient anyway -> add retries
Actions