action #175707
closed
coordination #161414: [epic] Improved salt based infrastructure management
OSD backups missing since 2024-11 on backup-vm.qe.nue2.suse.org size:S
Added by okurz 3 months ago.
Updated 28 days ago.
Category:
Regressions/Crashes
Description
Observation¶
# ls /home/rsnapshot/*/
/home/rsnapshot/alpha.0/:
jenkins.qa.suse.de localhost openqa.opensuse.org
/home/rsnapshot/alpha.1/:
jenkins.qa.suse.de localhost openqa-monitor.qa.suse.de openqa.opensuse.org s.qa.suse.de
/home/rsnapshot/alpha.2/:
jenkins.qa.suse.de localhost openqa-monitor.qa.suse.de openqa.opensuse.org s.qa.suse.de
/home/rsnapshot/alpha.3/:
jenkins.qa.suse.de localhost openqa-monitor.qa.suse.de openqa.opensuse.org s.qa.suse.de
/home/rsnapshot/alpha.4/:
jenkins.qa.suse.de localhost openqa-monitor.qa.suse.de openqa.opensuse.org s.qa.suse.de
/home/rsnapshot/alpha.5/:
jenkins.qa.suse.de localhost openqa-monitor.qa.suse.de openqa.opensuse.org s.qa.suse.de
/home/rsnapshot/beta.0/:
jenkins.qa.suse.de localhost openqa-monitor.qa.suse.de openqa.opensuse.org s.qa.suse.de
/home/rsnapshot/beta.1/:
jenkins.qa.suse.de localhost openqa-monitor.qa.suse.de openqa.opensuse.org s.qa.suse.de
/home/rsnapshot/beta.2/:
jenkins.qa.suse.de localhost openqa-monitor.qa.suse.de openqa.opensuse.org s.qa.suse.de
/home/rsnapshot/beta.3/:
jenkins.qa.suse.de localhost openqa-monitor.qa.suse.de openqa.opensuse.org s.qa.suse.de
/home/rsnapshot/beta.4/:
jenkins.qa.suse.de localhost openqa-monitor.qa.suse.de openqa.opensuse.org s.qa.suse.de
/home/rsnapshot/beta.5/:
jenkins.qa.suse.de localhost openqa-monitor.qa.suse.de openqa.opensuse.org s.qa.suse.de
/home/rsnapshot/beta.6/:
jenkins.qa.suse.de localhost openqa-monitor.qa.suse.de openqa.opensuse.org s.qa.suse.de
/home/rsnapshot/_delete.17191/:
jenkins.qa.suse.de localhost openqa-monitor.qa.suse.de openqa.opensuse.org s.qa.suse.de
/home/rsnapshot/_delete.17511/:
jenkins.qa.suse.de localhost openqa-monitor.qa.suse.de openqa.opensuse.org openqa.suse.de s.qa.suse.de
/home/rsnapshot/_delete.4193/:
jenkins.qa.suse.de localhost openqa-monitor.qa.suse.de openqa.opensuse.org s.qa.suse.de
/home/rsnapshot/delta.0/:
jenkins.qa.suse.de localhost openqa-monitor.qa.suse.de openqa.opensuse.org openqa.suse.de s.qa.suse.de
/home/rsnapshot/delta.1/:
jenkins.qa.suse.de localhost openqa-monitor.qa.suse.de openqa.opensuse.org openqa.suse.de s.qa.suse.de
/home/rsnapshot/delta.2/:
jenkins.qa.suse.de localhost openqa-monitor.qa.suse.de openqa.opensuse.org openqa.suse.de s.qa.suse.de
/home/rsnapshot/gamma.0/:
jenkins.qa.suse.de localhost openqa-monitor.qa.suse.de openqa.opensuse.org s.qa.suse.de
/home/rsnapshot/gamma.1/:
jenkins.qa.suse.de localhost openqa-monitor.qa.suse.de openqa.opensuse.org s.qa.suse.de
/home/rsnapshot/gamma.2/:
jenkins.qa.suse.de localhost openqa-monitor.qa.suse.de openqa.opensuse.org s.qa.suse.de
/home/rsnapshot/gamma.3/:
jenkins.qa.suse.de localhost openqa-monitor.qa.suse.de openqa.opensuse.org s.qa.suse.de
Acceptance criteria¶
- AC1: We have alerts when backups are missing
- AC2: There are again current backups from OSD on backup-vm
Suggestions¶
- Observe that openqa.suse.de is missing in all alpha, beta, gamma snapshots. Possibly related to CC-related firewall changes preventing direct ssh access.
- Also be aware of related #173674
- Config for rsnapshot is in https://gitlab.suse.de/qa-sle/backup-server-salt
- Consider also alerting
Files
- Copied from action #175686: OSD webUI ended up with "502 Bad Gateway" from nginx on 2025-01-17, needed manual restart of openqa-webui added
- Subject changed from OSD backups missing since 2024-11 on backup-vm to OSD backups missing since 2024-11 on backup-vm.qe.nue2.suse.org
- Description updated (diff)
- Subject changed from OSD backups missing since 2024-11 on backup-vm.qe.nue2.suse.org to OSD backups missing since 2024-11 on backup-vm.qe.nue2.suse.org size:S
- Description updated (diff)
- Status changed from New to Workable
- Priority changed from High to Urgent
@dheidler is currently creating a backup by setting up a ssh tunnel, so that we at least have some recent backup, until the firewall issue is solved.
/etc/rsnapshot.conf
# osd
backup root@localhost:/etc/ openqa.suse.de/ ssh_args=-p2222
backup_exec ssh -p 2222 root@localhost "cd /tmp; sudo -u postgres ionice -c3 nice -n19 pg_dump -Fc openqa -f /var/lib/openqa/SQL-DUMPS/$(date -I).dump"
backup root@localhost:/var/lib/openqa/SQL-DUMPS/ openqa.suse.de/ ssh_args=-p2222
backup root@localhost:/var/log/zypp/ openqa.suse.de/ ssh_args=-p2222
# from osd
ssh -R 2222:localhost:22 backup.qa.suse.de
- Assignee set to dheidler
- Priority changed from Urgent to High
Decreasing the prio as we have a recent backup now.
@dheidler you mentioned a related ticket - can you link it here?
- Status changed from Workable to In Progress
- Status changed from In Progress to Blocked
In the meantime I added an alias to redirect mail for root@backup-vm to osd-admins mailing list.
This should provide mails for backup issues.
Additionally I wrote a small back script that checks that all machine folders are present in the *.0
backup folders via cronjob at 23:59 every day.
Now blocking on the SD ticket.
- Status changed from Blocked to Workable
This is not the case here - the exit code is correct:
# rsnapshot alpha
----------------------------------------------------------------------------
rsnapshot encountered an error! The program was invoked with these options:
/usr/bin/rsnapshot alpha
----------------------------------------------------------------------------
ERROR: /usr/bin/rsync returned 255 while processing root@openqa.suse.de:/etc/
backup-vm:/home/dheidler # echo $?
1
- Status changed from Workable to Blocked
zypper in systemd-status-mail
# /etc/default/systemd-status-mail
ADDRESS=osd-admins@suse.de
# /etc/systemd/system/rsnapshot@.service.d/override.conf
[Unit]
OnFailure=systemd-status-mail@%n.service
Now we should get an email when the service fails.
But we already monitor for failed systemd services. Isn't this redundant?
Then why do we have this ticket?
Also I much prefer an email with the actual error instead of having to look through some statistics first.
Ok, but we shouldn't pile up too many custom solutions and stay scalable.
I suppose the alert mails we've recently seen¹ are due to your tinkering.
¹ e.g.:
'/home/rsnapshot/gamma.0/openqa.suse.de' does not exist!
'/home/rsnapshot/delta.0/openqa.suse.de' does not exist!
- Status changed from Blocked to Feedback
Can we clarify whether this is actually blocking on SD-178756 or something we want to change?
- Status changed from Feedback to Blocked
livdywan wrote in #note-22:
Can we clarify whether this is actually blocking on SD-178756 or something we want to change?
Apparently we are waiting to clarify SD-175078 first.
- Status changed from Blocked to Workable
- Status changed from Workable to Blocked
- Priority changed from High to Low
I checked the current state of backup snapshots on backup-vm
backup-vm:/home/rsnapshot # ls -ltra */openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14285 Feb 1 04:37 gamma.1/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 2 martchus root 14285 Feb 3 04:37 gamma.2/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 2 martchus root 14285 Feb 3 04:37 delta.0/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14285 Feb 8 03:37 gamma.0/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14285 Feb 10 04:37 beta.6/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14286 Feb 11 04:37 beta.5/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14286 Feb 12 04:37 beta.4/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14286 Feb 13 04:37 beta.3/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14286 Feb 14 04:37 beta.2/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14286 Feb 15 04:16 beta.1/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14286 Feb 16 04:36 beta.0/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14286 Feb 16 12:34 alpha.5/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14286 Feb 16 16:34 alpha.4/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14286 Feb 16 20:35 alpha.3/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14286 Feb 17 00:35 alpha.2/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14286 Feb 17 04:35 alpha.1/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14286 Feb 17 08:35 alpha.0/openqa.suse.de/etc/openqa/openqa.ini
so all good and recent. I don't know where added one additional byte but that's ok :)
Unfortunately again or still we are blocked on https://sd.suse.com/servicedesk/customer/portal/1/SD-178756 and response is very sluggish and responsibilities unclear. For now we will just live with the lowered expectations and reduce to low priority.
There are certain workarounds in place that ensure that we have current backups of osd on backup-vm.
The long term solution would be #173674.
- Status changed from Blocked to Resolved
I guess we can close this one then for now.
Help from infra on this matter is not to be expected and our workarounds are in place.
- Status changed from Resolved to Feedback
dheidler wrote in #note-29:
I guess we can close this one then for now.
Help from infra on this matter is not to be expected and our workarounds are in place.
So we are we at with the ACs? What you're saying sounds more like Rejected than Resolved?
AC1: We have alerts when backups are missing
AC2: There are again current backups from OSD on backup-vm
Do we have alerts and backups right now?
Yes for notifications - see attached image.

Also both ACs are fulfilled. Not sure why it would be rejected.
Also available in: Atom
PDF