Project

General

Profile

Actions

action #134519

closed

QA - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

QA - coordination #131525: [epic] Up-to-date and usable LSG QE NUE1 machines

We were not notified that backup.qa.suse.de did not create backups size:M

Added by tinita 9 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-08-23
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

We did not notice e.g. through alerts that backups were not being updated since July 26.

See #134489

Acceptance criteria

  • AC1: Alerts are received when backup jobs fail

Suggestions

  • ~cron.service was failing~ The cron job was failing, but we were never notified about it. The systemd service doesn't fail because of individual jobs.
  • Use a systemd timer which would give us systemd services alert failures

Out of scope

  • Try and see a simple check for the existence of recent backups
% journalctl -u cron.service
Aug 20 12:00:01 backup-vm rsnapshot[15218]: /usr/bin/rsnapshot alpha: ERROR: Errors were found in /etc/rsnapshot.conf, rsnapshot can not continue.

Related issues 3 (0 open3 closed)

Related to openQA Project - action #134837: SLE test repo not updated on OSD, cron service was not running since 2023-08-29, fetchneedles not called size:MResolvedlivdywan

Actions
Related to openQA Infrastructure - action #136370: systemd service rsnapshot@beta on backup-vm.qe.nue2.suse.org failed due to process conflictResolvedokurz2023-09-23

Actions
Copied from openQA Infrastructure - action #134489: backup.qa.suse.de does not create backupsResolvedtinita2023-08-22

Actions
Actions #1

Updated by tinita 9 months ago

  • Copied from action #134489: backup.qa.suse.de does not create backups added
Actions #2

Updated by livdywan 9 months ago

  • Subject changed from We were not notified that backup.qa.suse.de did not create backups to We were not notified that backup.qa.suse.de did not create backups size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by nicksinger 9 months ago

  • Assignee set to nicksinger
Actions #4

Updated by nicksinger 9 months ago

  • Status changed from Workable to In Progress
Actions #5

Updated by openqa_review 9 months ago

  • Due date set to 2023-09-09

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by okurz 8 months ago

  • Due date deleted (2023-09-09)
  • Status changed from In Progress to Workable
  • Assignee deleted (nicksinger)

Unassigning nicksinger as discussed in daily . I recommend to take a look into https://mark.stosberg.com/2016-08-26-rsnapshot-and-systemd/

Actions #7

Updated by okurz 8 months ago

  • Related to action #134837: SLE test repo not updated on OSD, cron service was not running since 2023-08-29, fetchneedles not called size:M added
Actions #8

Updated by livdywan 8 months ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan

I'm taking a look using systemd unit templates. Annoyingly I just spent some time remembering where the actual repo was because GitLab tried to convince me it couldn't find it anywhere...

Anyway https://gitlab.suse.de/qa-sle/backup-server-salt is where it's at.

Actions #9

Updated by livdywan 8 months ago

I ended up using systemd timer shorthands in place of Greek letters because that way the name can double as an interval: https://gitlab.suse.de/qa-sle/backup-server-salt/-/merge_requests/13

Actions #10

Updated by openqa_review 8 months ago

  • Due date set to 2023-09-15

Setting due date based on mean cycle time of SUSE QE Tools

Actions #11

Updated by okurz 8 months ago

https://gitlab.suse.de/qa-sle/backup-server-salt/-/merge_requests/13 merged. Failed in deployment, see https://gitlab.suse.de/qa-sle/backup-server-salt/-/merge_requests/13#note_535341 , reverted in https://gitlab.suse.de/qa-sle/backup-server-salt/-/merge_requests/14 (merged) and accordingly on backup.qe.nue2.suse.org . The format for "OnCalendar" needs to be changed, see https://www.freedesktop.org/software/systemd/man/systemd.time.html#Calendar%20Events . Please feel welcome to directly try it out on backup.qe.nue2.suse.org before creating a MR.

Actions #12

Updated by livdywan 8 months ago

https://gitlab.suse.de/qa-sle/backup-server-salt/-/merge_requests/15 comes with updated intervals. I used systemd-analyze calendar to validate each interval.

Actions #13

Updated by okurz 8 months ago

https://gitlab.suse.de/qa-sle/backup-server-salt/-/merge_requests/15 merged. will apply.

EDIT: From salt

Summary for local
-------------
Succeeded: 16 (changed=9)
Failed:     0
-------------
Total states run:     16
Total run time:   72.077 s

and

# systemctl list-timers | grep rsnapshot
Tue 2023-09-05 16:00:00 CEST 2h 20min left       n/a                          n/a           rsnapshot-alpha.timer        rsnapshot@alpha.service
Wed 2023-09-06 03:30:00 CEST 13h left            n/a                          n/a           rsnapshot-beta.timer         rsnapshot@beta.service
Sat 2023-09-09 03:30:00 CEST 3 days left         n/a                          n/a           rsnapshot-gamma.timer        rsnapshot@gamma.service
Sun 2023-10-01 02:00:00 CEST 3 weeks 4 days left n/a                          n/a           rsnapshot-delta.timer        rsnapshot@delta.service

please monitor over the next days to see if backups are actually conducted.

Actions #14

Updated by okurz 8 months ago

  • Due date deleted (2023-09-15)
  • Status changed from In Progress to Resolved
backup-vm:/home/rsnapshot # ls -ltra
total 8
drwxr-xr-x  7 root root  131 Apr 27 04:21 delta.2
drwxr-xr-x  7 root root  131 May 25 04:28 delta.1
drwxr-xr-x  7 root root  131 Jun 29 04:49 delta.0
drwxr-xr-x  7 root root  131 Jul  3 04:34 gamma.3
drwxr-xr-x  7 root root  131 Jul 13 04:48 gamma.2
drwxr-xr-x  7 root root  131 Jul 23 04:20 gamma.1
drwxr-xr-x 79 root root 4096 Aug  8 18:32 ..
drwxr-xr-x  8 root root  151 Aug 24 04:36 gamma.0
drwxr-xr-x  8 root root  151 Aug 30 04:17 beta.6
drwxr-xr-x  8 root root  151 Aug 31 04:16 beta.5
drwxr-xr-x  8 root root  151 Sep  1 04:16 beta.4
drwxr-xr-x  8 root root  151 Sep  2 04:19 beta.3
drwxr-xr-x  8 root root  151 Sep  3 04:22 beta.2
drwxr-xr-x  8 root root  151 Sep  4 04:17 beta.1
drwxr-xr-x  8 root root  151 Sep  5 04:17 beta.0
drwxr-xr-x  8 root root  151 Sep  5 12:18 alpha.5
drwxr-xr-x  8 root root  151 Sep  5 16:23 alpha.4
drwxr-xr-x  8 root root  151 Sep  5 20:24 alpha.3
drwxr-xr-x  8 root root  151 Sep  6 00:20 alpha.2
drwxr-xr-x  8 root root  151 Sep  6 04:18 alpha.1
drwxr-xr-x  8 root root  151 Sep  6 08:20 alpha.0

looks good

Actions #15

Updated by okurz 7 months ago

  • Related to action #136370: systemd service rsnapshot@beta on backup-vm.qe.nue2.suse.org failed due to process conflict added
Actions

Also available in: Atom PDF