Project

General

Profile

Actions

action #123082

closed

backup of o3 to storage.qa.suse.de was not conducted by rsnapshot since 2021-12 size:M

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-01-13
Due date:
2023-02-24
% Done:

0%

Estimated time:

Description

Observation

On storage.qa.suse.de ls -ltra /storage/rsnapshot/*/openqa.opensuse.org/ shows:

/storage/rsnapshot/alpha.3/openqa.opensuse.org/:
total 0
drwxr-xr-x 1 root root  6 Nov 19  2021 root
drwxr-xr-x 1 root root  8 Nov 22  2021 .
drwxr-xr-x 1 root root 38 Nov 22  2021 ..

/storage/rsnapshot/_delete.12732/openqa.opensuse.org/:
total 0
drwxr-xr-x 1 root root  6 Dec 30  2021 root
drwxr-xr-x 1 root root  8 Dec 30  2021 .
drwxr-xr-x 1 root root 66 Dec 31  2021 ..

/storage/rsnapshot/beta.2/openqa.opensuse.org/:
total 0
drwxr-xr-x 1 root root  0 Aug 25 00:00 .
drwxr-xr-x 1 root root 66 Aug 25 03:10 ..

/storage/rsnapshot/beta.1/openqa.opensuse.org/:
total 0
drwxr-xr-x 1 root root  0 Sep 22 00:00 .
drwxr-xr-x 1 root root 66 Sep 22 03:38 ..

/storage/rsnapshot/beta.0/openqa.opensuse.org/:
total 0
drwxr-xr-x 1 root root  0 Oct 24 00:00 .
drwxr-xr-x 1 root root 66 Oct 24 03:14 ..

/storage/rsnapshot/alpha.2/openqa.opensuse.org/:
total 0
drwxr-xr-x 1 root root  0 Nov 28 00:00 .
drwxr-xr-x 1 root root 66 Nov 28 03:51 ..

/storage/rsnapshot/alpha.1/openqa.opensuse.org/:
total 0
drwxr-xr-x 1 root root  0 Dec  1 00:00 .
drwxr-xr-x 1 root root 66 Dec  1 03:47 ..

/storage/rsnapshot/alpha.0/openqa.opensuse.org/:
total 0
drwxr-xr-x 1 root root  0 Jan 12 00:00 .
drwxr-xr-x 1 root root 66 Jan 12 05:55 ..

Acceptance criteria

  • AC1: We are alerted if the backup can not be conducted
  • AC2: o3 is backed up again

Rollback steps

  • Add storage.qa.suse.de back to salt

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #121282: Recover storage.qa.suse.de size:SResolvednicksinger2022-12-01

Actions
Actions #1

Updated by okurz over 1 year ago

Actions #2

Updated by mkittler over 1 year ago

  • Assignee set to mkittler

The general (salt) setup seems still ok. However, it looks like the SSH connection couldn't be established:

/var/log/rsnapshot
[2023-01-12T00:00:01] /usr/bin/rsnapshot alpha: started
[2023-01-12T00:00:01] echo 26961 > /var/run/rsnapshot.pid
[2023-01-12T00:00:01] mv /storage/rsnapshot/alpha.2/ /storage/rsnapshot/_delete.26961/
[2023-01-12T00:00:01] mv /storage/rsnapshot/alpha.1/ /storage/rsnapshot/alpha.2/
[2023-01-12T00:00:01] mv /storage/rsnapshot/alpha.0/ /storage/rsnapshot/alpha.1/
[2023-01-12T00:00:01] mkdir -m 0755 -p /storage/rsnapshot/alpha.0/openqa.opensuse.org/
[2023-01-12T00:00:01] /usr/bin/rsync -a --delete --numeric-ids --relative --delete-excluded --rsh=/usr/bin/ssh root@openqa.opensuse.org:/var/lib/openqa/testresults /storage/rsnapshot/alpha.0/openqa.opensuse.org/root
[2023-01-12T00:00:01] ssh: connect to host openqa.opensuse.org port 22: Connection refused^M
[2023-01-12T00:00:01] rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
[2023-01-12T00:00:01] rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3]
[2023-01-12T00:00:01] ssh: connect to host openqa.opensuse.org port 22: Connection refused^M
[2023-01-12T00:00:01] rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
[2023-01-12T00:00:01] rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3]
[2023-01-12T00:00:01] ssh: connect to host openqa.opensuse.org port 22: Connection refused^M
[2023-01-12T00:00:01] rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
[2023-01-12T00:00:01] rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3]
[2023-01-12T00:00:01] ssh: connect to host openqa.opensuse.org port 22: Connection refused^M
[2023-01-12T00:00:01] rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
[2023-01-12T00:00:01] rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3]
[2023-01-12T00:00:01] ssh: connect to host openqa.opensuse.org port 22: Connection refused^M
[2023-01-12T00:00:01] rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
[2023-01-12T00:00:01] rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3]
[2023-01-12T00:00:01] /usr/bin/rsnapshot alpha: ERROR: /usr/bin/rsync returned 255 while processing root@openqa.opensuse.org:/var/lib/openqa/testresults
[2023-01-12T00:00:01] WARNING: root@openqa.opensuse.org:/var/lib/openqa/archive skipped due to rollback plan
[2023-01-12T00:00:01] WARNING: root@openqa.opensuse.org:/var/lib/openqa/share/factory/iso/fixed skipped due to rollback plan
[2023-01-12T00:00:01] WARNING: root@openqa.opensuse.org:/var/lib/openqa/share/factory/hdd/fixed skipped due to rollback plan
[2023-01-12T00:00:01] WARNING: root@openqa.opensuse.org:/var/lib/openqa/osc-plugin-factory skipped due to rollback plan
[2023-01-12T00:00:01] WARNING: root@openqa.opensuse.org:/var/lib/openqa/osc-plugin-factory-new skipped due to rollback plan
[2023-01-12T00:00:01] mkdir -m 0755 -p /storage/rsnapshot/alpha.0/openqa.suse.de/
[2023-01-12T00:00:01] /usr/bin/rsync -a --delete --numeric-ids --relative --delete-excluded --rsh=/usr/bin/ssh --link-dest=/storage/rsnapshot/alpha.1/openqa.suse.de/root root@openqa.suse.de:/var/lib/openqa/archive /storage/rsnapshot/alpha.0/openqa.suse.de/root
[2023-01-12T04:58:47] /usr/bin/rsync -a --delete --numeric-ids --relative --delete-excluded --hard-links --rsh=/usr/bin/ssh --link-dest=/storage/rsnapshot/alpha.1/openqa.suse.de/root root@openqa.suse.de:/var/lib/openqa/share/factory/iso/fixed /storage/rsnapshot/alpha.0/openqa.suse.de/root
[2023-01-12T04:58:48] /usr/bin/rsync -a --delete --numeric-ids --relative --delete-excluded --hard-links --rsh=/usr/bin/ssh --link-dest=/storage/rsnapshot/alpha.1/openqa.suse.de/root root@openqa.suse.de:/var/lib/openqa/share/factory/hdd/fixed /storage/rsnapshot/alpha.0/openqa.suse.de/root
[2023-01-12T05:55:24] touch /storage/rsnapshot/alpha.0/
[2023-01-12T05:55:24] rm -f /var/run/rsnapshot.pid
[2023-01-12T05:55:24] /usr/bin/rm -rf /storage/rsnapshot/_delete.26961
[2023-01-12T06:03:57] /usr/bin/rsnapshot alpha: ERROR: /usr/bin/rsnapshot alpha: completed, but with some errors

The "connection refused" error is reproducible by running the rsync command manually (or by just running ssh root@openqa.opensuse.org). It works again after adding the usual SSH config required to access o3. After applying salt again via sudo salt storage.qa.suse.de state.apply on OSD the config was reverted again. So this needs to be added to the salt states, SR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/782

Since the errors are only visible in the log file (no service has failed since rsnapshot only runs via cron) it would maybe make sense to use the logwarn approach from o3 for alerting.

Actions #3

Updated by mkittler over 1 year ago

  • Status changed from New to In Progress
Actions #4

Updated by openqa_review over 1 year ago

  • Due date set to 2023-01-28

Setting due date based on mean cycle time of SUSE QE Tools

Actions #5

Updated by mkittler over 1 year ago

  • Status changed from In Progress to Feedback

SR for using systemd-timers (instead of cron) so our existing monitoring for systemd-services will catch problems: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/783

I've already tested it (see command in SR). I've temporarily removed the storage host from salt so the old cron config won't reappear until the SR has been merged (which would lead to running rsnapshot twice).

Actions #6

Updated by mkittler over 1 year ago

  • Description updated (diff)
Actions #7

Updated by mkittler over 1 year ago

The backup is still running but it supposedly works. I'll resolve the ticket when the alpha backup has been concluded as all SRs have also already been merged.

EDIT: The backup is still running (18.01.23). It is now almost 2 days.

Actions #8

Updated by mkittler over 1 year ago

  • Subject changed from backup of o3 to storage.qa.suse.de was not conducted by rsnapshot since 2021-12 to backup of o3 to storage.qa.suse.de was not conducted by rsnapshot since 2021-12 size:M
Actions #9

Updated by mkittler over 1 year ago

The backup is still ongoing (iotop shows still traffic so it isn't stuck).

Actions #10

Updated by mkittler over 1 year ago

The backup was interrupted by the weekly reboot but then re-attempted:

…
Jan 22 03:32:34 storage rsnapshot[15789]: rsync error: error in rsync protocol data stream (code 12) at io.c(228) [receiver=3.2.3]
Jan 22 03:32:34 storage rsnapshot[15789]: rsync: connection unexpectedly closed (74158612 bytes received so far) [generator]
Jan 22 03:32:34 storage rsnapshot[15789]: rsync error: unexplained error (code 255) at io.c(228) [generator=3.2.3]
Jan 22 03:32:34 storage rsnapshot[15789]: rsync: [generator] write error: Broken pipe (32)
Jan 22 03:32:34 storage rsnapshot[15789]: ssh: connect to host openqa.suse.de port 22: Connection refused
Jan 22 03:32:34 storage rsnapshot[15789]: rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
Jan 22 03:32:34 storage rsnapshot[15789]: rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3]
Jan 22 03:32:34 storage rsnapshot[15789]: ssh: connect to host openqa.suse.de port 22: Connection refused
Jan 22 03:32:34 storage rsnapshot[15789]: rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
Jan 22 03:32:34 storage rsnapshot[15789]: rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3]
Jan 22 03:32:34 storage rsnapshot[15789]: ssh: connect to host openqa.suse.de port 22: Connection refused
Jan 22 03:32:34 storage rsnapshot[15789]: rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
Jan 22 03:32:34 storage rsnapshot[15789]: rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3]
Jan 22 03:32:34 storage rsnapshot[15789]: ----------------------------------------------------------------------------
Jan 22 03:32:34 storage rsnapshot[15789]: rsnapshot encountered an error! The program was invoked with these options:
Jan 22 03:32:34 storage rsnapshot[15789]: /usr/bin/rsnapshot alpha
Jan 22 03:32:34 storage rsnapshot[15789]: ----------------------------------------------------------------------------
Jan 22 03:32:34 storage rsnapshot[15789]: ERROR: /usr/bin/rsync returned 255 while processing root@openqa.suse.de:/var/lib/openqa/archive
Jan 22 03:32:34 storage rsnapshot[15789]: WARNING: root@openqa.suse.de:/var/lib/openqa/share/factory/iso/fixed skipped due to rollback plan
Jan 22 03:32:34 storage rsnapshot[15789]: WARNING: root@openqa.suse.de:/var/lib/openqa/share/factory/hdd/fixed skipped due to rollback plan
Jan 22 03:32:34 storage rsnapshot[3295]: WARNING: root@openqa.suse.de:/var/lib/openqa/share/factory/hdd/fixed skipped due to rollback plan
Jan 22 03:32:34 storage rsnapshot[15789]: WARNING: Rolling back "openqa.suse.de/root"
Jan 22 03:32:34 storage rsnapshot[15789]: /usr/bin/rm -rf /storage/rsnapshot/alpha.0/openqa.suse.de/root
Jan 22 03:34:41 storage rsnapshot[15789]: /usr/bin/cp -al /storage/rsnapshot/alpha.1/openqa.suse.de/root \
Jan 22 03:34:41 storage rsnapshot[15789]:     /storage/rsnapshot/alpha.0/openqa.suse.de/root
Jan 22 03:51:11 storage rsnapshot[15789]: touch /storage/rsnapshot/alpha.0/
Jan 22 03:51:11 storage rsnapshot[15789]: rm -f /var/run/rsnapshot.pid
Jan 22 03:51:11 storage rsnapshot[15789]: /usr/bin/rm -rf /storage/rsnapshot/_delete.15789
Jan 22 04:01:32 storage systemd[1]: rsnapshot@alpha.service: Main process exited, code=exited, status=1/FAILURE
Jan 22 04:01:32 storage systemd[1]: rsnapshot@alpha.service: Failed with result 'exit-code'.
Jan 22 04:01:32 storage systemd[1]: Failed to start rsnapshot (alpha) backup.
Jan 23 00:00:00 storage systemd[1]: Starting rsnapshot (alpha) backup...
…

Currently the backup is still ongoing (after the interruption). I remember that the reboot was also a problem when setting up the backup initially so I guess it is expected. Let's see whether it'll be able to finish normally at some point.

Actions #11

Updated by okurz over 1 year ago

  • Due date changed from 2023-01-28 to 2023-02-03

The backup just takes longer than expected. We are ok to wait longer.

Actions #12

Updated by mkittler over 1 year ago

I temporarily removed storage.qa.suse.de from salt and stopped/disabled auto-update.timer and invoked rebootmgrctl cancel to avoid the nightly reboot of storage.qa.suse.de.

Actions #13

Updated by livdywan over 1 year ago

  • Due date changed from 2023-02-03 to 2023-02-10

Bumping due date due to hackweek.

Actions #14

Updated by okurz over 1 year ago

  • Description updated (diff)

The alert https://monitor.qa.suse.de/d/GDstorage/dashboard-for-storage?editPanel=65090&tab=alert&orgId=1 about storage solution triggered likely due to the backup now trying to save too much data. So I suggest you look into what is backed up and if it's too much then exclude something from backup. After that ensure that the alert is good again and unpause.

Actions #15

Updated by mkittler over 1 year ago

  • Description updated (diff)

We still have 1.5 TiB left on /storage. That's normally considered a huge amount of disk space. The alert is firing because we have nevertheless used more than 85 % of the disk space (87 %).

I would change the alert because we don't need headroom in the scale of terabytes (just because we have an extraordinarily huge disk). That's actually not so easy because the threshold is specified in a quite generic way. Before putting a lot of effort into this it likely makes sense to discuss it first in the team.

However, it may make sense to exclude things anyways because the backup takes a very long time. I'm just not sure what to exclude. So far we backup testresults (for OSD only archived jobs) and fixed assets.

I have resumed the host-up alert as the storage server seems to be up just. Not sure why it was paused.

Actions #16

Updated by livdywan over 1 year ago

mkittler wrote:

I have resumed the host-up alert as the storage server seems to be up just. Not sure why it was paused.

[Alerting] storage: host up alert came in at 13.51 Berlin time. It certainly looks like it was down at one point, or the alert isn't working properly.

Actions #17

Updated by okurz over 1 year ago

  • Due date changed from 2023-02-10 to 2023-02-24

Discussed in the weekly unblock meeting 2023-02-08. IMHO we can afford an empty 1.5 TiB as buffer and keep our alerting configuration simple. As discussed please change the backup config of o3 to only backup archived openQA jobs, same as is done for OSD. Then execute the backup jobs and monitor for some days.

Actions #18

Updated by mkittler over 1 year ago

The alert is likely because I removed the storage server from salt to avoid the weekly reboot (see #123082#note-12). I have added it back which should resolve the alert soon.

Actions #20

Updated by mkittler over 1 year ago

About the alert: The ping entry for the storage server is missing in /etc/telegraf/telegraf.d/telegraf-webui.conf on OSD. I invoked sudo salt openqa.suse.de state.apply to add it back immediately which worked and now the alert is no longer firing.

Note that the list of hosts to ping is generated via salt['mine.get']('*', 'network.interfaces').keys()|list so presumably when the salt mine is loosing that data of a certain host the ping is removed from the configuration. So this behavior is "normal". I suppose improving it is out of scope for this ticket.

Actions #22

Updated by mkittler over 1 year ago

It is now also configured. I'll re-trigger another backup to test it. However, before doing that I'll also remove a leftover _deleted directory. (Those directories should supposedly be deleted by the end of the rsnapshot run but apparently it hasn't worked in some case.)

Actions #23

Updated by mkittler over 1 year ago

Looks like the backup isn't working:

martchus@storage:~> sudo systemctl status rsnapshot@alpha.service 
× rsnapshot@alpha.service - rsnapshot (alpha) backup
     Loaded: loaded (/etc/systemd/system/rsnapshot@.service; static)
     Active: failed (Result: exit-code) since Thu 2023-02-09 06:32:14 CET; 5h 10min ago
TriggeredBy: ● rsnapshot-alpha.timer
    Process: 8728 ExecStart=/usr/bin/rsnapshot alpha (code=exited, status=2)
   Main PID: 8728 (code=exited, status=2)

Feb 09 05:19:32 storage rsnapshot[8728]:     --hard-links --rsh=/usr/bin/ssh \
Feb 09 05:19:32 storage rsnapshot[8728]:     --link-dest=/storage/rsnapshot/alpha.1/openqa.suse.de/root \
Feb 09 05:19:32 storage rsnapshot[8728]:     root@openqa.suse.de:/var/lib/openqa/share/factory/hdd/fixed \
Feb 09 05:19:32 storage rsnapshot[8728]:     /storage/rsnapshot/alpha.0/openqa.suse.de/root
Feb 09 05:27:48 storage rsnapshot[8728]: touch /storage/rsnapshot/alpha.0/
Feb 09 05:27:48 storage rsnapshot[8728]: rm -f /var/run/rsnapshot.pid
Feb 09 05:27:48 storage rsnapshot[8728]: /usr/bin/rm -rf /storage/rsnapshot/_delete.8728
Feb 09 06:32:14 storage systemd[1]: rsnapshot@alpha.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Feb 09 06:32:14 storage systemd[1]: rsnapshot@alpha.service: Failed with result 'exit-code'.
Feb 09 06:32:14 storage systemd[1]: Failed to start rsnapshot (alpha) backup.

Well, it likely had worked but nevertheless returned a non-zero exit code. So maybe there's something wrong with the systemd unit.

Actions #24

Updated by mkittler over 1 year ago

A look at the full log shows the problem:

…
Feb 09 00:04:58 storage rsnapshot[8728]: rsync: [sender] link_stat "/var/lib/openqa/osc-plugin-factory-new" failed: No such file or directory (2)
Feb 09 00:04:58 storage rsnapshot[8728]: rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1835) [Receiver=3.2.3]
Feb 09 00:04:58 storage rsnapshot[8728]: rsync: [Receiver] write error: Broken pipe (32)
…
Feb 09 05:27:48 storage rsnapshot[8728]: touch /storage/rsnapshot/alpha.0/
Feb 09 05:27:48 storage rsnapshot[8728]: rm -f /var/run/rsnapshot.pid
Feb 09 05:27:48 storage rsnapshot[8728]: /usr/bin/rm -rf /storage/rsnapshot/_delete.8728
Feb 09 06:32:14 storage systemd[1]: rsnapshot@alpha.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Feb 09 06:32:14 storage systemd[1]: rsnapshot@alpha.service: Failed with result 'exit-code'.
Feb 09 06:32:14 storage systemd[1]: Failed to start rsnapshot (alpha) backup.

So rsnapshot returns the return code for rsync: [sender] link_stat "/var/lib/openqa/osc-plugin-factory-new" failed: No such file or directory (2) even though further commands worked. I suppose that's actually what we want (in terms of general error handling).

SR to fix the error: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/794

Actions #25

Updated by mkittler over 1 year ago

  • Status changed from Feedback to Resolved

The SR has been merged and the backup has now succeeded. The disk usage is now at 51 % (after also deleting a leftover _delete directory).

Note that it is currently not possible to find out whether the storage usage alert is paused (or now rather "silenced") or not. So I'll just leave it as-is to be handled by #122845.

Actions

Also available in: Atom PDF