Project

General

Profile

Actions

action #134453

closed

QA - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

QA - coordination #131525: [epic] Up-to-date and usable LSG QE NUE1 machines

backup.qam.suse.de is Failed according to netbox and not creating backups size:M

Added by livdywan 9 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Netbox includes backup.qam.suse.de as Failed. We didn't get any emails, though?

Acceptance criteria

  • AC1: It is known what backup server(s) we should have in netbox
  • AC2: The failure has been resolved.

Suggestions


Related issues 3 (0 open3 closed)

Related to openQA Infrastructure - action #134051: Eng-Infra maintained DNS server for .qa.suse.de taking over from qanet size:MResolveddheidler2023-08-09

Actions
Copied from QA - action #131528: Bring backup.qam.suse.de up-to-date size:MResolvedokurz2023-06-28

Actions
Copied to openQA Infrastructure - action #134489: backup.qa.suse.de does not create backupsResolvedtinita2023-08-22

Actions
Actions #1

Updated by livdywan 9 months ago

  • Copied from action #131528: Bring backup.qam.suse.de up-to-date size:M added
Actions #2

Updated by livdywan 9 months ago

  • Tags changed from infra, backup.qam.suse.de, machine, nue1, dct migration, next-maxtorhof-visit to infra, backup.qam.suse.de
  • Subject changed from backup.qam.suse.de is Failed according to netbox and not runnin backups size:M to backup.qam.suse.de is Failed according to netbox and not creating backups
  • Assignee deleted (okurz)
  • Priority changed from Normal to High
  • Start date deleted (2023-06-28)
Actions #3

Updated by livdywan 9 months ago

  • Description updated (diff)
Actions #4

Updated by tinita 9 months ago

  • Description updated (diff)
Actions #5

Updated by tinita 9 months ago

  • Description updated (diff)

Last backup is from July 26.

% journalctl -u cron.service
Aug 20 12:00:01 backup-vm rsnapshot[15218]: /usr/bin/rsnapshot alpha: ERROR: Errors were found in /etc/rsnapshot.conf, rsnapshot can not continue.
Actions #6

Updated by tinita 9 months ago

# rsnapshot configtest
----------------------------------------------------------------------------
rsnapshot encountered an error! The program was invoked with these options:
/usr/bin/rsnapshot configtest 
----------------------------------------------------------------------------
ERROR: /etc/rsnapshot.conf on line 42:
ERROR: backup>.root@s.qa:/srv/www/schort/data/links.sqlite s.qa.suse.de/ - \
         missing tabs to separate words - change spaces to tabs. 
ERROR: ---------------------------------------------------------------------
ERROR: Errors were found in /etc/rsnapshot.conf,
ERROR: rsnapshot can not continue. If you think an entry looks right, make
ERROR: sure you don't have spaces where only tabs should be.
Actions #7

Updated by tinita 9 months ago

  • Status changed from New to In Progress
  • Assignee set to tinita

I think I repaired the config. Next cron.service should run at 12:00 CEST which is in 10 minutes. Let's see...

Actions #8

Updated by tinita 9 months ago

Somehow I cannot edit my own comments.
Just for the record, the config was edited on July 26:

-rw-r--r-- 1 root root 1701 Jul 26 21:40 /etc/rsnapshot.conf                                                                  │

I copied the broken file to /etc/rsnapshot.conf.bak

Actions #9

Updated by tinita 9 months ago

Backup is running, however I realized that I'm working on backup.qa.suse.de while the ticket is about backup.qam.suse.de (which I cannot even connect to).

Actions #10

Updated by okurz 9 months ago

The Redmine comment issue is discussed in https://progress.opensuse.org/issues/133532

backup.qam.suse.de is now backup-qam.qe.nbg2.suse.org

Actions #11

Updated by livdywan 9 months ago

  • Related to action #134051: Eng-Infra maintained DNS server for .qa.suse.de taking over from qanet size:M added
Actions #12

Updated by tinita 9 months ago

How can I connect to backup.qam.suse.de?

Actions #13

Updated by tinita 9 months ago

How can I connect to backup-qam.qe.nbg2.suse.org?

ssh: Could not resolve hostname backup-qam.qe.nbg2.suse.org: Name or service not known
Actions #14

Updated by tinita 9 months ago

Why is the wiki still talking about backup.qa.suse.de then?
https://progress.opensuse.org/projects/openqav3/wiki/#Backup

Actions #15

Updated by tinita 9 months ago

And why did someone break the rsnapshot config instead of disabling the service? Highly confusing

Actions #16

Updated by tinita 9 months ago

In this comment
https://progress.opensuse.org/issues/132143#note-52
and the following we see related activity around the time the rsnapshot.conf was broken.
This MR https://gitlab.suse.de/qa-sle/backup-server-salt/-/merge_requests/11 is also related, and looking at /root/.ssh/config it has the same content as on backup.qa.suse.de.
So for now I assume I did the right thing, and we have a backup again.

It would be nice if someone could clarify if backup.qa.suse.de is the correct backup machine or not. Oliver, your comment was raiding more questions than answering. Basically only Liv and me are working today, and we are confused.

Then, as Liv suggested, we should investigate why noone was notified that backups weren't running.

Actions #17

Updated by tinita 9 months ago

Wow, looking at https://gitlab.suse.de/qa-sle/backup-server-salt/-/blob/master/rsnapshot/rsnapshot.conf#L42 this actually shows the broken config, but the last change of that file was July 2022??

Maybe this wasn't a problem in the past and rsnapshot got updated and is now more strict?

Actions #19

Updated by openqa_review 9 months ago

  • Due date set to 2023-09-05

Setting due date based on mean cycle time of SUSE QE Tools

Actions #20

Updated by livdywan 9 months ago

  • Description updated (diff)
Actions #21

Updated by tinita 9 months ago

  • Status changed from In Progress to Feedback

https://gitlab.suse.de/qa-sle/backup-server-salt/-/merge_requests/12 merged

We still don't know why we weren't notified.

Actions #22

Updated by tinita 9 months ago

  • Copied to action #134489: backup.qa.suse.de does not create backups added
Actions #23

Updated by tinita 9 months ago

  • Status changed from Feedback to Workable
  • Assignee deleted (tinita)
Actions #24

Updated by tinita 9 months ago

Created #134489 about backup.qa.suse.de.

Ignore my comments for this ticket

Actions #25

Updated by mkittler 9 months ago

The actual domain is backup-qam.qe.nue2.suse.org (and not backup-qam.qe.nbg2.suse.org and also not backup.qam.suse.de). I've updated the the corresponding confluence page: https://confluence.suse.com/display/maintenanceqa/Backup+Server

(This is a salt controlled host so a simple salt-key -L on OSD helps to find the FQDN.)

Actions #26

Updated by mkittler 9 months ago

  • Status changed from Workable to New
Actions #27

Updated by mkittler 9 months ago

  • Project changed from QA to openQA Infrastructure
Actions #28

Updated by livdywan 9 months ago

  • Subject changed from backup.qam.suse.de is Failed according to netbox and not creating backups to backup.qam.suse.de is Failed according to netbox and not creating backups size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #29

Updated by mkittler 9 months ago

  • Description updated (diff)
Actions #30

Updated by mkittler 9 months ago

  • Assignee set to mkittler
Actions #31

Updated by mkittler 9 months ago

  • Status changed from Workable to Feedback

I've just updated the netbox entry. I have also updated the management status to "Active" resolving AC2.

I've also updated the FQDN on https://confluence.suse.com/pages/viewpage.action?spaceKey=maintenanceqa&title=Backup+Server.

Now we only need to clarify whether this server is actually still used at all.

Actions #32

Updated by okurz 8 months ago

  • Due date deleted (2023-09-05)
  • Status changed from Feedback to Resolved

We double-checked the entries, fixed the FQDN in racktables. The racktables entry says "In Use" and the system is up and running and controlled in salt, it's good.

Actions

Also available in: Atom PDF