Project

General

Profile

Actions

action #175707

open

coordination #161414: [epic] Improved salt based infrastructure management

OSD backups missing since 2024-11 on backup-vm.qe.nue2.suse.org size:S

Added by okurz about 2 months ago. Updated 18 days ago.

Status:
Blocked
Priority:
Low
Assignee:
Category:
Regressions/Crashes
Start date:
2025-01-17
Due date:
% Done:

0%

Estimated time:

Description

Observation

# ls /home/rsnapshot/*/
/home/rsnapshot/alpha.0/:
jenkins.qa.suse.de  localhost  openqa.opensuse.org

/home/rsnapshot/alpha.1/:
jenkins.qa.suse.de  localhost  openqa-monitor.qa.suse.de  openqa.opensuse.org  s.qa.suse.de

/home/rsnapshot/alpha.2/:
jenkins.qa.suse.de  localhost  openqa-monitor.qa.suse.de  openqa.opensuse.org  s.qa.suse.de

/home/rsnapshot/alpha.3/:
jenkins.qa.suse.de  localhost  openqa-monitor.qa.suse.de  openqa.opensuse.org  s.qa.suse.de

/home/rsnapshot/alpha.4/:
jenkins.qa.suse.de  localhost  openqa-monitor.qa.suse.de  openqa.opensuse.org  s.qa.suse.de

/home/rsnapshot/alpha.5/:
jenkins.qa.suse.de  localhost  openqa-monitor.qa.suse.de  openqa.opensuse.org  s.qa.suse.de

/home/rsnapshot/beta.0/:
jenkins.qa.suse.de  localhost  openqa-monitor.qa.suse.de  openqa.opensuse.org  s.qa.suse.de

/home/rsnapshot/beta.1/:
jenkins.qa.suse.de  localhost  openqa-monitor.qa.suse.de  openqa.opensuse.org  s.qa.suse.de

/home/rsnapshot/beta.2/:
jenkins.qa.suse.de  localhost  openqa-monitor.qa.suse.de  openqa.opensuse.org  s.qa.suse.de

/home/rsnapshot/beta.3/:
jenkins.qa.suse.de  localhost  openqa-monitor.qa.suse.de  openqa.opensuse.org  s.qa.suse.de

/home/rsnapshot/beta.4/:
jenkins.qa.suse.de  localhost  openqa-monitor.qa.suse.de  openqa.opensuse.org  s.qa.suse.de

/home/rsnapshot/beta.5/:
jenkins.qa.suse.de  localhost  openqa-monitor.qa.suse.de  openqa.opensuse.org  s.qa.suse.de

/home/rsnapshot/beta.6/:
jenkins.qa.suse.de  localhost  openqa-monitor.qa.suse.de  openqa.opensuse.org  s.qa.suse.de

/home/rsnapshot/_delete.17191/:
jenkins.qa.suse.de  localhost  openqa-monitor.qa.suse.de  openqa.opensuse.org  s.qa.suse.de

/home/rsnapshot/_delete.17511/:
jenkins.qa.suse.de  localhost  openqa-monitor.qa.suse.de  openqa.opensuse.org  openqa.suse.de  s.qa.suse.de

/home/rsnapshot/_delete.4193/:
jenkins.qa.suse.de  localhost  openqa-monitor.qa.suse.de  openqa.opensuse.org  s.qa.suse.de

/home/rsnapshot/delta.0/:
jenkins.qa.suse.de  localhost  openqa-monitor.qa.suse.de  openqa.opensuse.org  openqa.suse.de  s.qa.suse.de

/home/rsnapshot/delta.1/:
jenkins.qa.suse.de  localhost  openqa-monitor.qa.suse.de  openqa.opensuse.org  openqa.suse.de  s.qa.suse.de

/home/rsnapshot/delta.2/:
jenkins.qa.suse.de  localhost  openqa-monitor.qa.suse.de  openqa.opensuse.org  openqa.suse.de  s.qa.suse.de

/home/rsnapshot/gamma.0/:
jenkins.qa.suse.de  localhost  openqa-monitor.qa.suse.de  openqa.opensuse.org  s.qa.suse.de

/home/rsnapshot/gamma.1/:
jenkins.qa.suse.de  localhost  openqa-monitor.qa.suse.de  openqa.opensuse.org  s.qa.suse.de

/home/rsnapshot/gamma.2/:
jenkins.qa.suse.de  localhost  openqa-monitor.qa.suse.de  openqa.opensuse.org  s.qa.suse.de

/home/rsnapshot/gamma.3/:
jenkins.qa.suse.de  localhost  openqa-monitor.qa.suse.de  openqa.opensuse.org  s.qa.suse.de

Acceptance criteria

  • AC1: We have alerts when backups are missing
  • AC2: There are again current backups from OSD on backup-vm

Suggestions

  • Observe that openqa.suse.de is missing in all alpha, beta, gamma snapshots. Possibly related to CC-related firewall changes preventing direct ssh access.
  • Also be aware of related #173674
  • Config for rsnapshot is in https://gitlab.suse.de/qa-sle/backup-server-salt
  • Consider also alerting

Related issues 2 (1 open1 closed)

Related to openQA Infrastructure (public) - action #173674: qamaster-independent backup size:SBlockeddheidler2024-12-03

Actions
Copied from openQA Infrastructure (public) - action #175686: OSD webUI ended up with "502 Bad Gateway" from nginx on 2025-01-17, needed manual restart of openqa-webuiResolvedokurz2025-01-17

Actions
Actions #1

Updated by okurz about 2 months ago

  • Copied from action #175686: OSD webUI ended up with "502 Bad Gateway" from nginx on 2025-01-17, needed manual restart of openqa-webui added
Actions #2

Updated by okurz about 2 months ago

  • Subject changed from OSD backups missing since 2024-11 on backup-vm to OSD backups missing since 2024-11 on backup-vm.qe.nue2.suse.org
  • Description updated (diff)
Actions #3

Updated by okurz about 2 months ago

  • Subject changed from OSD backups missing since 2024-11 on backup-vm.qe.nue2.suse.org to OSD backups missing since 2024-11 on backup-vm.qe.nue2.suse.org size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by okurz about 2 months ago

  • Priority changed from High to Urgent
Actions #5

Updated by tinita about 2 months ago · Edited

@dheidler is currently creating a backup by setting up a ssh tunnel, so that we at least have some recent backup, until the firewall issue is solved.

/etc/rsnapshot.conf
# osd
backup  root@localhost:/etc/    openqa.suse.de/ ssh_args=-p2222
backup_exec     ssh -p 2222 root@localhost "cd /tmp; sudo -u postgres ionice -c3 nice -n19 pg_dump -Fc openqa -f /var/lib/openqa/SQL-DUMPS/$(date -I).dump"
backup  root@localhost:/var/lib/openqa/SQL-DUMPS/       openqa.suse.de/ ssh_args=-p2222
backup  root@localhost:/var/log/zypp/   openqa.suse.de/ ssh_args=-p2222
# from osd
ssh -R 2222:localhost:22 backup.qa.suse.de
Actions #6

Updated by tinita about 2 months ago · Edited

  • Assignee set to dheidler
  • Priority changed from Urgent to High

Decreasing the prio as we have a recent backup now.
@dheidler you mentioned a related ticket - can you link it here?

Actions #8

Updated by tinita about 2 months ago

Actions #9

Updated by okurz about 2 months ago

SSH to OSD was not enabled again during work on #170368 and https://confluence.suse.com/display/qasle/Request+for+Change+(RFC)+-+Allow++specific+traffic+from+VLAN+ID+192+towards+openqa.suse.de+and+proxy.scc.suse.de and (not-accessible) https://sd.suse.com/servicedesk/customer/portal/1/SD-174726 but with same reasoning as applied there we should request SSH to be opened. Please open an SD ticket and request SSH access from NUE2 to OSD. Be my guest to reference other 2-month old requests that are not being handled and the critical absence of backups.

Actions #10

Updated by dheidler about 2 months ago

  • Status changed from Workable to In Progress
Actions #12

Updated by dheidler about 2 months ago

  • Status changed from In Progress to Blocked

In the meantime I added an alias to redirect mail for root@backup-vm to osd-admins mailing list.
This should provide mails for backup issues.
Additionally I wrote a small back script that checks that all machine folders are present in the *.0 backup folders via cronjob at 23:59 every day.

Now blocking on the SD ticket.

Actions #13

Updated by okurz about 2 months ago

  • Status changed from Blocked to Workable

This looks related to https://github.com/rsnapshot/rsnapshot/issues/102
Can you look into rsnapshot not failing on errors?

Actions #14

Updated by dheidler about 2 months ago

This is not the case here - the exit code is correct:

# rsnapshot alpha
----------------------------------------------------------------------------
rsnapshot encountered an error! The program was invoked with these options:
/usr/bin/rsnapshot alpha
----------------------------------------------------------------------------
ERROR: /usr/bin/rsync returned 255 while processing root@openqa.suse.de:/etc/
backup-vm:/home/dheidler # echo $?
1
Actions #15

Updated by dheidler about 2 months ago · Edited

  • Status changed from Workable to Blocked
zypper in systemd-status-mail

# /etc/default/systemd-status-mail
ADDRESS=osd-admins@suse.de

# /etc/systemd/system/rsnapshot@.service.d/override.conf
[Unit]
OnFailure=systemd-status-mail@%n.service

Now we should get an email when the service fails.

Actions #16

Updated by okurz about 2 months ago

But we already monitor for failed systemd services. Isn't this redundant?

Actions #17

Updated by dheidler about 2 months ago · Edited

Then why do we have this ticket?

Also I much prefer an email with the actual error instead of having to look through some statistics first.

Actions #18

Updated by okurz about 2 months ago

Ok, but we shouldn't pile up too many custom solutions and stay scalable.

Actions #20

Updated by mkittler about 1 month ago

I suppose the alert mails we've recently seen¹ are due to your tinkering.

¹ e.g.:

'/home/rsnapshot/gamma.0/openqa.suse.de' does not exist!
'/home/rsnapshot/delta.0/openqa.suse.de' does not exist!
Actions #21

Updated by dheidler about 1 month ago

yes

Actions #22

Updated by livdywan about 1 month ago · Edited

  • Status changed from Blocked to Feedback

Can we clarify whether this is actually blocking on SD-178756 or something we want to change?

Actions #23

Updated by livdywan about 1 month ago

  • Status changed from Feedback to Blocked

livdywan wrote in #note-22:

Can we clarify whether this is actually blocking on SD-178756 or something we want to change?

Apparently we are waiting to clarify SD-175078 first.

Actions #24

Updated by nicksinger about 1 month ago · Edited

  • Status changed from Blocked to Workable

livdywan wrote in #note-23:

livdywan wrote in #note-22:

Can we clarify whether this is actually blocking on SD-178756 or something we want to change?

Apparently we are waiting to clarify SD-175078 first.

#173674 is blocked by https://sd.suse.com/servicedesk/customer/portal/1/SD-175078 and IMHO just related to this topic here.
If, at all, this here was waiting on feedback in https://sd.suse.com/servicedesk/customer/portal/1/SD-178756 which we got by now. An urgency mitigation for all the annoying e-mails happened today with https://suse.slack.com/archives/C02AJ1E568M/p1738660626800719

Actions #25

Updated by dheidler about 1 month ago

  • Status changed from Workable to Blocked
Actions #26

Updated by livdywan about 1 month ago

dheidler wrote in #note-25:

Requested wireguard in https://sd.suse.com/servicedesk/customer/portal/1/SD-178756

Currently clarifying what approach we can take with cert.

Actions #27

Updated by okurz 27 days ago

  • Priority changed from High to Low

I checked the current state of backup snapshots on backup-vm

backup-vm:/home/rsnapshot # ls -ltra */openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14285 Feb  1 04:37 gamma.1/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 2 martchus root 14285 Feb  3 04:37 gamma.2/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 2 martchus root 14285 Feb  3 04:37 delta.0/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14285 Feb  8 03:37 gamma.0/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14285 Feb 10 04:37 beta.6/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14286 Feb 11 04:37 beta.5/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14286 Feb 12 04:37 beta.4/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14286 Feb 13 04:37 beta.3/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14286 Feb 14 04:37 beta.2/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14286 Feb 15 04:16 beta.1/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14286 Feb 16 04:36 beta.0/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14286 Feb 16 12:34 alpha.5/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14286 Feb 16 16:34 alpha.4/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14286 Feb 16 20:35 alpha.3/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14286 Feb 17 00:35 alpha.2/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14286 Feb 17 04:35 alpha.1/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 14286 Feb 17 08:35 alpha.0/openqa.suse.de/etc/openqa/openqa.ini

so all good and recent. I don't know where added one additional byte but that's ok :)

Unfortunately again or still we are blocked on https://sd.suse.com/servicedesk/customer/portal/1/SD-178756 and response is very sluggish and responsibilities unclear. For now we will just live with the lowered expectations and reduce to low priority.

Actions #28

Updated by dheidler 18 days ago

There are certain workarounds in place that ensure that we have current backups of osd on backup-vm.

The long term solution would be https://progress.opensuse.org/issues/173674.

Actions

Also available in: Atom PDF