action #123082
backup of o3 to storage.qa.suse.de was not conducted by rsnapshot since 2021-12 size:M
0%
Description
Observation¶
On storage.qa.suse.de ls -ltra /storage/rsnapshot/*/openqa.opensuse.org/
shows:
/storage/rsnapshot/alpha.3/openqa.opensuse.org/: total 0 drwxr-xr-x 1 root root 6 Nov 19 2021 root drwxr-xr-x 1 root root 8 Nov 22 2021 . drwxr-xr-x 1 root root 38 Nov 22 2021 .. /storage/rsnapshot/_delete.12732/openqa.opensuse.org/: total 0 drwxr-xr-x 1 root root 6 Dec 30 2021 root drwxr-xr-x 1 root root 8 Dec 30 2021 . drwxr-xr-x 1 root root 66 Dec 31 2021 .. /storage/rsnapshot/beta.2/openqa.opensuse.org/: total 0 drwxr-xr-x 1 root root 0 Aug 25 00:00 . drwxr-xr-x 1 root root 66 Aug 25 03:10 .. /storage/rsnapshot/beta.1/openqa.opensuse.org/: total 0 drwxr-xr-x 1 root root 0 Sep 22 00:00 . drwxr-xr-x 1 root root 66 Sep 22 03:38 .. /storage/rsnapshot/beta.0/openqa.opensuse.org/: total 0 drwxr-xr-x 1 root root 0 Oct 24 00:00 . drwxr-xr-x 1 root root 66 Oct 24 03:14 .. /storage/rsnapshot/alpha.2/openqa.opensuse.org/: total 0 drwxr-xr-x 1 root root 0 Nov 28 00:00 . drwxr-xr-x 1 root root 66 Nov 28 03:51 .. /storage/rsnapshot/alpha.1/openqa.opensuse.org/: total 0 drwxr-xr-x 1 root root 0 Dec 1 00:00 . drwxr-xr-x 1 root root 66 Dec 1 03:47 .. /storage/rsnapshot/alpha.0/openqa.opensuse.org/: total 0 drwxr-xr-x 1 root root 0 Jan 12 00:00 . drwxr-xr-x 1 root root 66 Jan 12 05:55 ..
Acceptance criteria¶
- AC1: We are alerted if the backup can not be conducted
- AC2: o3 is backed up again
Rollback steps¶
- Add storage.qa.suse.de back to salt
Related issues
History
#1
Updated by okurz 2 months ago
- Related to action #121282: Recover storage.qa.suse.de size:S added
#2
Updated by mkittler 2 months ago
- Assignee set to mkittler
The general (salt) setup seems still ok. However, it looks like the SSH connection couldn't be established:
/var/log/rsnapshot [2023-01-12T00:00:01] /usr/bin/rsnapshot alpha: started [2023-01-12T00:00:01] echo 26961 > /var/run/rsnapshot.pid [2023-01-12T00:00:01] mv /storage/rsnapshot/alpha.2/ /storage/rsnapshot/_delete.26961/ [2023-01-12T00:00:01] mv /storage/rsnapshot/alpha.1/ /storage/rsnapshot/alpha.2/ [2023-01-12T00:00:01] mv /storage/rsnapshot/alpha.0/ /storage/rsnapshot/alpha.1/ [2023-01-12T00:00:01] mkdir -m 0755 -p /storage/rsnapshot/alpha.0/openqa.opensuse.org/ [2023-01-12T00:00:01] /usr/bin/rsync -a --delete --numeric-ids --relative --delete-excluded --rsh=/usr/bin/ssh root@openqa.opensuse.org:/var/lib/openqa/testresults /storage/rsnapshot/alpha.0/openqa.opensuse.org/root [2023-01-12T00:00:01] ssh: connect to host openqa.opensuse.org port 22: Connection refused^M [2023-01-12T00:00:01] rsync: connection unexpectedly closed (0 bytes received so far) [Receiver] [2023-01-12T00:00:01] rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3] [2023-01-12T00:00:01] ssh: connect to host openqa.opensuse.org port 22: Connection refused^M [2023-01-12T00:00:01] rsync: connection unexpectedly closed (0 bytes received so far) [Receiver] [2023-01-12T00:00:01] rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3] [2023-01-12T00:00:01] ssh: connect to host openqa.opensuse.org port 22: Connection refused^M [2023-01-12T00:00:01] rsync: connection unexpectedly closed (0 bytes received so far) [Receiver] [2023-01-12T00:00:01] rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3] [2023-01-12T00:00:01] ssh: connect to host openqa.opensuse.org port 22: Connection refused^M [2023-01-12T00:00:01] rsync: connection unexpectedly closed (0 bytes received so far) [Receiver] [2023-01-12T00:00:01] rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3] [2023-01-12T00:00:01] ssh: connect to host openqa.opensuse.org port 22: Connection refused^M [2023-01-12T00:00:01] rsync: connection unexpectedly closed (0 bytes received so far) [Receiver] [2023-01-12T00:00:01] rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3] [2023-01-12T00:00:01] /usr/bin/rsnapshot alpha: ERROR: /usr/bin/rsync returned 255 while processing root@openqa.opensuse.org:/var/lib/openqa/testresults [2023-01-12T00:00:01] WARNING: root@openqa.opensuse.org:/var/lib/openqa/archive skipped due to rollback plan [2023-01-12T00:00:01] WARNING: root@openqa.opensuse.org:/var/lib/openqa/share/factory/iso/fixed skipped due to rollback plan [2023-01-12T00:00:01] WARNING: root@openqa.opensuse.org:/var/lib/openqa/share/factory/hdd/fixed skipped due to rollback plan [2023-01-12T00:00:01] WARNING: root@openqa.opensuse.org:/var/lib/openqa/osc-plugin-factory skipped due to rollback plan [2023-01-12T00:00:01] WARNING: root@openqa.opensuse.org:/var/lib/openqa/osc-plugin-factory-new skipped due to rollback plan [2023-01-12T00:00:01] mkdir -m 0755 -p /storage/rsnapshot/alpha.0/openqa.suse.de/ [2023-01-12T00:00:01] /usr/bin/rsync -a --delete --numeric-ids --relative --delete-excluded --rsh=/usr/bin/ssh --link-dest=/storage/rsnapshot/alpha.1/openqa.suse.de/root root@openqa.suse.de:/var/lib/openqa/archive /storage/rsnapshot/alpha.0/openqa.suse.de/root [2023-01-12T04:58:47] /usr/bin/rsync -a --delete --numeric-ids --relative --delete-excluded --hard-links --rsh=/usr/bin/ssh --link-dest=/storage/rsnapshot/alpha.1/openqa.suse.de/root root@openqa.suse.de:/var/lib/openqa/share/factory/iso/fixed /storage/rsnapshot/alpha.0/openqa.suse.de/root [2023-01-12T04:58:48] /usr/bin/rsync -a --delete --numeric-ids --relative --delete-excluded --hard-links --rsh=/usr/bin/ssh --link-dest=/storage/rsnapshot/alpha.1/openqa.suse.de/root root@openqa.suse.de:/var/lib/openqa/share/factory/hdd/fixed /storage/rsnapshot/alpha.0/openqa.suse.de/root [2023-01-12T05:55:24] touch /storage/rsnapshot/alpha.0/ [2023-01-12T05:55:24] rm -f /var/run/rsnapshot.pid [2023-01-12T05:55:24] /usr/bin/rm -rf /storage/rsnapshot/_delete.26961 [2023-01-12T06:03:57] /usr/bin/rsnapshot alpha: ERROR: /usr/bin/rsnapshot alpha: completed, but with some errors
The "connection refused" error is reproducible by running the rsync command manually (or by just running ssh root@openqa.opensuse.org
). It works again after adding the usual SSH config required to access o3. After applying salt again via sudo salt storage.qa.suse.de state.apply
on OSD the config was reverted again. So this needs to be added to the salt states, SR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/782
Since the errors are only visible in the log file (no service has failed since rsnapshot only runs via cron) it would maybe make sense to use the logwarn approach from o3 for alerting.
#4
Updated by openqa_review 2 months ago
- Due date set to 2023-01-28
Setting due date based on mean cycle time of SUSE QE Tools
#5
Updated by mkittler 2 months ago
- Status changed from In Progress to Feedback
SR for using systemd-timers (instead of cron) so our existing monitoring for systemd-services will catch problems: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/783
I've already tested it (see command in SR). I've temporarily removed the storage host from salt so the old cron config won't reappear until the SR has been merged (which would lead to running rsnapshot twice).
#10
Updated by mkittler 2 months ago
The backup was interrupted by the weekly reboot but then re-attempted:
… Jan 22 03:32:34 storage rsnapshot[15789]: rsync error: error in rsync protocol data stream (code 12) at io.c(228) [receiver=3.2.3] Jan 22 03:32:34 storage rsnapshot[15789]: rsync: connection unexpectedly closed (74158612 bytes received so far) [generator] Jan 22 03:32:34 storage rsnapshot[15789]: rsync error: unexplained error (code 255) at io.c(228) [generator=3.2.3] Jan 22 03:32:34 storage rsnapshot[15789]: rsync: [generator] write error: Broken pipe (32) Jan 22 03:32:34 storage rsnapshot[15789]: ssh: connect to host openqa.suse.de port 22: Connection refused Jan 22 03:32:34 storage rsnapshot[15789]: rsync: connection unexpectedly closed (0 bytes received so far) [Receiver] Jan 22 03:32:34 storage rsnapshot[15789]: rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3] Jan 22 03:32:34 storage rsnapshot[15789]: ssh: connect to host openqa.suse.de port 22: Connection refused Jan 22 03:32:34 storage rsnapshot[15789]: rsync: connection unexpectedly closed (0 bytes received so far) [Receiver] Jan 22 03:32:34 storage rsnapshot[15789]: rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3] Jan 22 03:32:34 storage rsnapshot[15789]: ssh: connect to host openqa.suse.de port 22: Connection refused Jan 22 03:32:34 storage rsnapshot[15789]: rsync: connection unexpectedly closed (0 bytes received so far) [Receiver] Jan 22 03:32:34 storage rsnapshot[15789]: rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3] Jan 22 03:32:34 storage rsnapshot[15789]: ---------------------------------------------------------------------------- Jan 22 03:32:34 storage rsnapshot[15789]: rsnapshot encountered an error! The program was invoked with these options: Jan 22 03:32:34 storage rsnapshot[15789]: /usr/bin/rsnapshot alpha Jan 22 03:32:34 storage rsnapshot[15789]: ---------------------------------------------------------------------------- Jan 22 03:32:34 storage rsnapshot[15789]: ERROR: /usr/bin/rsync returned 255 while processing root@openqa.suse.de:/var/lib/openqa/archive Jan 22 03:32:34 storage rsnapshot[15789]: WARNING: root@openqa.suse.de:/var/lib/openqa/share/factory/iso/fixed skipped due to rollback plan Jan 22 03:32:34 storage rsnapshot[15789]: WARNING: root@openqa.suse.de:/var/lib/openqa/share/factory/hdd/fixed skipped due to rollback plan Jan 22 03:32:34 storage rsnapshot[3295]: WARNING: root@openqa.suse.de:/var/lib/openqa/share/factory/hdd/fixed skipped due to rollback plan Jan 22 03:32:34 storage rsnapshot[15789]: WARNING: Rolling back "openqa.suse.de/root" Jan 22 03:32:34 storage rsnapshot[15789]: /usr/bin/rm -rf /storage/rsnapshot/alpha.0/openqa.suse.de/root Jan 22 03:34:41 storage rsnapshot[15789]: /usr/bin/cp -al /storage/rsnapshot/alpha.1/openqa.suse.de/root \ Jan 22 03:34:41 storage rsnapshot[15789]: /storage/rsnapshot/alpha.0/openqa.suse.de/root Jan 22 03:51:11 storage rsnapshot[15789]: touch /storage/rsnapshot/alpha.0/ Jan 22 03:51:11 storage rsnapshot[15789]: rm -f /var/run/rsnapshot.pid Jan 22 03:51:11 storage rsnapshot[15789]: /usr/bin/rm -rf /storage/rsnapshot/_delete.15789 Jan 22 04:01:32 storage systemd[1]: rsnapshot@alpha.service: Main process exited, code=exited, status=1/FAILURE Jan 22 04:01:32 storage systemd[1]: rsnapshot@alpha.service: Failed with result 'exit-code'. Jan 22 04:01:32 storage systemd[1]: Failed to start rsnapshot (alpha) backup. Jan 23 00:00:00 storage systemd[1]: Starting rsnapshot (alpha) backup... …
Currently the backup is still ongoing (after the interruption). I remember that the reboot was also a problem when setting up the backup initially so I guess it is expected. Let's see whether it'll be able to finish normally at some point.
#12
Updated by mkittler about 2 months ago
I temporarily removed storage.qa.suse.de from salt and stopped/disabled auto-update.timer
and invoked rebootmgrctl cancel
to avoid the nightly reboot of storage.qa.suse.de.
#13
Updated by cdywan about 2 months ago
- Due date changed from 2023-02-03 to 2023-02-10
Bumping due date due to hackweek.
#14
Updated by okurz about 2 months ago
- Description updated (diff)
The alert https://monitor.qa.suse.de/d/GDstorage/dashboard-for-storage?editPanel=65090&tab=alert&orgId=1 about storage solution triggered likely due to the backup now trying to save too much data. So I suggest you look into what is backed up and if it's too much then exclude something from backup. After that ensure that the alert is good again and unpause.
#15
Updated by mkittler about 2 months ago
- Description updated (diff)
We still have 1.5 TiB left on /storage
. That's normally considered a huge amount of disk space. The alert is firing because we have nevertheless used more than 85 % of the disk space (87 %).
I would change the alert because we don't need headroom in the scale of terabytes (just because we have an extraordinarily huge disk). That's actually not so easy because the threshold is specified in a quite generic way. Before putting a lot of effort into this it likely makes sense to discuss it first in the team.
However, it may make sense to exclude things anyways because the backup takes a very long time. I'm just not sure what to exclude. So far we backup testresults (for OSD only archived jobs) and fixed assets.
I have resumed the host-up alert as the storage server seems to be up just. Not sure why it was paused.
#16
Updated by cdywan about 2 months ago
mkittler wrote:
I have resumed the host-up alert as the storage server seems to be up just. Not sure why it was paused.
[Alerting] storage: host up alert came in at 13.51 Berlin time. It certainly looks like it was down at one point, or the alert isn't working properly.
#17
Updated by okurz about 2 months ago
- Due date changed from 2023-02-10 to 2023-02-24
Discussed in the weekly unblock meeting 2023-02-08. IMHO we can afford an empty 1.5 TiB as buffer and keep our alerting configuration simple. As discussed please change the backup config of o3 to only backup archived openQA jobs, same as is done for OSD. Then execute the backup jobs and monitor for some days.
#18
Updated by mkittler about 2 months ago
The alert is likely because I removed the storage server from salt to avoid the weekly reboot (see #123082#note-12). I have added it back which should resolve the alert soon.
#19
Updated by mkittler about 2 months ago
MR for only backing up the archive: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/791
#20
Updated by mkittler about 2 months ago
About the alert: The ping entry for the storage server is missing in /etc/telegraf/telegraf.d/telegraf-webui.conf
on OSD. I invoked sudo salt openqa.suse.de state.apply
to add it back immediately which worked and now the alert is no longer firing.
Note that the list of hosts to ping is generated via salt['mine.get']('*', 'network.interfaces').keys()|list
so presumably when the salt mine is loosing that data of a certain host the ping is removed from the configuration. So this behavior is "normal". I suppose improving it is out of scope for this ticket.
#21
Updated by okurz about 2 months ago
#22
Updated by mkittler about 2 months ago
It is now also configured. I'll re-trigger another backup to test it. However, before doing that I'll also remove a leftover _deleted
directory. (Those directories should supposedly be deleted by the end of the rsnapshot run but apparently it hasn't worked in some case.)
#23
Updated by mkittler about 2 months ago
Looks like the backup isn't working:
martchus@storage:~> sudo systemctl status rsnapshot@alpha.service × rsnapshot@alpha.service - rsnapshot (alpha) backup Loaded: loaded (/etc/systemd/system/rsnapshot@.service; static) Active: failed (Result: exit-code) since Thu 2023-02-09 06:32:14 CET; 5h 10min ago TriggeredBy: ● rsnapshot-alpha.timer Process: 8728 ExecStart=/usr/bin/rsnapshot alpha (code=exited, status=2) Main PID: 8728 (code=exited, status=2) Feb 09 05:19:32 storage rsnapshot[8728]: --hard-links --rsh=/usr/bin/ssh \ Feb 09 05:19:32 storage rsnapshot[8728]: --link-dest=/storage/rsnapshot/alpha.1/openqa.suse.de/root \ Feb 09 05:19:32 storage rsnapshot[8728]: root@openqa.suse.de:/var/lib/openqa/share/factory/hdd/fixed \ Feb 09 05:19:32 storage rsnapshot[8728]: /storage/rsnapshot/alpha.0/openqa.suse.de/root Feb 09 05:27:48 storage rsnapshot[8728]: touch /storage/rsnapshot/alpha.0/ Feb 09 05:27:48 storage rsnapshot[8728]: rm -f /var/run/rsnapshot.pid Feb 09 05:27:48 storage rsnapshot[8728]: /usr/bin/rm -rf /storage/rsnapshot/_delete.8728 Feb 09 06:32:14 storage systemd[1]: rsnapshot@alpha.service: Main process exited, code=exited, status=2/INVALIDARGUMENT Feb 09 06:32:14 storage systemd[1]: rsnapshot@alpha.service: Failed with result 'exit-code'. Feb 09 06:32:14 storage systemd[1]: Failed to start rsnapshot (alpha) backup.
Well, it likely had worked but nevertheless returned a non-zero exit code. So maybe there's something wrong with the systemd unit.
#24
Updated by mkittler about 2 months ago
A look at the full log shows the problem:
… Feb 09 00:04:58 storage rsnapshot[8728]: rsync: [sender] link_stat "/var/lib/openqa/osc-plugin-factory-new" failed: No such file or directory (2) Feb 09 00:04:58 storage rsnapshot[8728]: rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1835) [Receiver=3.2.3] Feb 09 00:04:58 storage rsnapshot[8728]: rsync: [Receiver] write error: Broken pipe (32) … Feb 09 05:27:48 storage rsnapshot[8728]: touch /storage/rsnapshot/alpha.0/ Feb 09 05:27:48 storage rsnapshot[8728]: rm -f /var/run/rsnapshot.pid Feb 09 05:27:48 storage rsnapshot[8728]: /usr/bin/rm -rf /storage/rsnapshot/_delete.8728 Feb 09 06:32:14 storage systemd[1]: rsnapshot@alpha.service: Main process exited, code=exited, status=2/INVALIDARGUMENT Feb 09 06:32:14 storage systemd[1]: rsnapshot@alpha.service: Failed with result 'exit-code'. Feb 09 06:32:14 storage systemd[1]: Failed to start rsnapshot (alpha) backup.
So rsnapshot returns the return code for rsync: [sender] link_stat "/var/lib/openqa/osc-plugin-factory-new" failed: No such file or directory (2)
even though further commands worked. I suppose that's actually what we want (in terms of general error handling).
SR to fix the error: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/794
#25
Updated by mkittler about 1 month ago
- Status changed from Feedback to Resolved
The SR has been merged and the backup has now succeeded. The disk usage is now at 51 % (after also deleting a leftover _delete
directory).
Note that it is currently not possible to find out whether the storage usage alert is paused (or now rather "silenced") or not. So I'll just leave it as-is to be handled by #122845.