Project

General

Profile

Actions

action #173674

closed

coordination #161414: [epic] Improved salt based infrastructure management

qamaster-independent backup size:S

Added by okurz 5 months ago. Updated 6 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Feature requests
Start date:
2024-12-03
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

During work on #170077 okurz found work on storage for qamaster to be error-prone due to the hardware RAID with many and old storage devices and unusual configuration with RAID6 for root device etc. Before we do more risky stuff we should ensure we have a current backup of VMs and services and for that we need to find out what's the best approach to store backups.

Acceptance criteria

  • AC1: We know where to store backups which are not on qamaster.qe.nue2.suse.org

Suggestions


Related issues 7 (4 open3 closed)

Related to openQA Infrastructure (public) - action #175707: OSD backups missing since 2024-11 on backup-vm.qe.nue2.suse.org size:SResolveddheidler2025-01-17

Actions
Blocks openQA Infrastructure (public) - action #168177: Migrate critical VM based services needing access to CC-services to CC areasResolvedokurz2024-09-19

Actions
Copied from openQA Infrastructure (public) - action #173347: Ensure we have a current backup of qamaster VMs, VM config, jenkins data, data from backup-vm itself, etc. size:SResolvedgpathak

Actions
Copied to openQA Infrastructure (public) - action #177513: Proper "project" name in op-prg2New

Actions
Copied to openQA Infrastructure (public) - action #181151: Follow-up steps regarding backup.qa.suse.de (and backup.qe.prg2.suse.org)Newokurz

Actions
Copied to openQA Infrastructure (public) - action #181250: salt-master on separate VMNew

Actions
Copied to openQA Infrastructure (public) - action #181256: Easier alert handling by opting out of backup_check for individual hostsNew

Actions
Actions #1

Updated by okurz 5 months ago

  • Copied from action #173347: Ensure we have a current backup of qamaster VMs, VM config, jenkins data, data from backup-vm itself, etc. size:S added
Actions #2

Updated by dheidler 5 months ago

  • Status changed from Workable to Blocked
  • Assignee set to dheidler

Waiting on https://sd.suse.com/servicedesk/customer/portal/1/SD-175078 to get a network with firewall rules to save this on harvester.

Actions #3

Updated by dheidler 4 months ago ยท Edited

Pinged Matze about this SD ticket as there was no status update for 2 Weeks.

Actions #4

Updated by okurz 3 months ago

there were recent updates. dheidler is responding with an update in the SD ticket.

Actions #5

Updated by tinita 3 months ago

  • Related to action #175707: OSD backups missing since 2024-11 on backup-vm.qe.nue2.suse.org size:S added
Actions #6

Updated by livdywan 3 months ago

dheidler wrote in #note-2:

Waiting on https://sd.suse.com/servicedesk/customer/portal/1/SD-175078 to get a network with firewall rules to save this on harvester.

https://sd.suse.com/servicedesk/customer/portal/1/SD-179453

Actions #7

Updated by okurz 2 months ago

Actions #8

Updated by dheidler 2 months ago

Network is there but openplatform has issues with attaching (not creating - for whatever reason) a large (I tried 10TB) volume (for storing backups) to a VM.

I created https://sd.suse.com/servicedesk/customer/portal/1/SD-181328

Actions #9

Updated by okurz 2 months ago

  • Blocks action #168177: Migrate critical VM based services needing access to CC-services to CC areas added
Actions #11

Updated by okurz 2 months ago

Cool :)

Actions #13

Updated by dheidler about 2 months ago

The openplatform people told me that the largest disks they have fit around 5TB.
I guess the images can't span multiple disks.
So our backup storage capacity would be limited.
I thought the idea of "cloud" was that I as a user don't have to deal with that kind of problems :/

We could live with around 3-4TB of storage for now but this might not be a nice and clean solution if openplatform doesn't scale up their disks.
As a workaroud we could build a software raid of multiple volume images. But that wouldn't be the cleanest thing to do.

Actions #14

Updated by okurz about 2 months ago

Thomas Muntaner mentioned

Please follow https://itpe.io.suse.de/core/open-platform/docs/docs/getting_started/requesting_access#nfs-access for a more reliable storage.

What about that?

Actions #15

Updated by dheidler about 2 months ago

I requested an NFS volume via https://sd.suse.com/servicedesk/customer/portal/1/SD-181710

Let's see what happens.

Actions #16

Updated by dheidler about 1 month ago

Hi Dominik Heidler,
The NFS share has been created. You can mount it from
10.144.128.242:/openqa_backup_storage

Well we got an nfs share now but of course they forgot to allow it in the firewall rules.

Actions #17

Updated by dheidler about 1 month ago

It seems to not have been the firewall but the wrong ip was given to me.

Now we got a different issue:

backup:~ # ping -c1 10.144.128.241
PING 10.144.128.241 (10.144.128.241) 56(84) bytes of data.
64 bytes from 10.144.128.241: icmp_seq=1 ttl=63 time=0.266 ms

--- 10.144.128.241 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.266/0.266/0.266/0.000 ms
backup:~ # mount 10.144.128.241:/openqa_backup_storage /mnt/
mount.nfs: mounting 10.144.128.241:/openqa_backup_storage failed, reason given by server: No such file or directory
Actions #18

Updated by dheidler about 1 month ago

  • Status changed from Blocked to Workable
mount 10.144.128.241:/openqa-backup-storage /mnt/

it is.

Actions #19

Updated by dheidler about 1 month ago

  • Status changed from Workable to In Progress
Actions #20

Updated by dheidler about 1 month ago

  • Status changed from In Progress to Blocked

Of course infra messed up setting up the firewall rules, so I can't reach the salt master.

Created https://sd.suse.com/servicedesk/customer/portal/1/SD-184162

Actions #21

Updated by dheidler 28 days ago

  • Assignee changed from dheidler to mgriessmeier

Assigning to Matthias while this is being escalated with infra.

Actions #23

Updated by dheidler 25 days ago

echo NFS tftp ssh ftp http https 8080 rsync amqp zmq salt ICMP GRE
echo {9500..9599}
#echo {5990..6190}
for i in {1..100} ; do
    let "p=i*10+20003"
    let "v=i+5990"
    echo -n "$p "
    echo -n "$v "
done
echo
Actions #24

Updated by dheidler 25 days ago

I guess for workers the summand would be 20003 as 20002 is for qemu internally which should only talk to the command server.
B.

Communication between webui and command server is on ports 20003+10*i

Actions #25

Updated by dheidler 25 days ago

  • Status changed from Blocked to In Progress
  • Assignee changed from mgriessmeier to dheidler

https://build.opensuse.org/requests/1267145

the whole vlan 2221 can now reach openqa.suse.de salt master at 4505/4506

Actions #26

Updated by openqa_review 25 days ago

  • Due date set to 2025-04-19

Setting due date based on mean cycle time of SUSE QE Tools

Actions #27

Updated by dheidler 21 days ago

  • Status changed from In Progress to Blocked
Actions #28

Updated by dheidler 20 days ago

  • Status changed from Blocked to In Progress

connection to o3 enabled in the firewall.

Actions #29

Updated by dheidler 19 days ago

  • Status changed from In Progress to Feedback

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1432

Backups are running.

Let's add the backup check script because as of now nobody noticed that tha o3 backups on the new backup server were failing until the firewall rules for connection ariel were sorted.

Actions #30

Updated by livdywan 14 days ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1432

I realize I hadn't added my review. Hopefully easy to get sorted. Otherwise we can discuss it in the unblock today.

Actions #31

Updated by livdywan 13 days ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1438

Apparently there's several instances of postfix.set_main with very verbose output effectively hiding whatever is causing the deployment to fail?

I'm wondering if we should have a separate ticket about this, as it doesn't seem like an immediate issue with this change ๐Ÿค”

Actions #32

Updated by dheidler 13 days ago

I guess the best approach is to replace postfix.set_main with a salt-managed postfix main.cf file.

Actions #33

Updated by livdywan 13 days ago

livdywan wrote in #note-31:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1438

Correction: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1439 Sorry about that. I was trying to check other recent MRs to double-check what was causing the log issue and mixing it up without realizing right away.

Actions #36

Updated by dheidler 12 days ago

  • Status changed from Feedback to Resolved
Actions #38

Updated by livdywan 12 days ago

  • Copied to action #181151: Follow-up steps regarding backup.qa.suse.de (and backup.qe.prg2.suse.org) added
Actions #39

Updated by okurz 9 days ago

  • Due date deleted (2025-04-19)
Actions #40

Updated by okurz 9 days ago

  • Status changed from Resolved to Workable

I didn't find any mention on https://gitlab.suse.de/suse/wiki/-/blob/main/qe_infrastructure.md . I doubt people will find the backup this way

Actions #41

Updated by dheidler 8 days ago

  • Status changed from Workable to Resolved
Actions #42

Updated by okurz 7 days ago

Actions #43

Updated by livdywan 7 days ago

okurz wrote in #note-40:

I didn't find any mention on https://gitlab.suse.de/suse/wiki/-/blob/main/qe_infrastructure.md . I doubt people will find the backup this way

You probably didn't see #181151 ๐Ÿ™ƒ But thank you for checking anyway

Actions #44

Updated by livdywan 7 days ago

  • Copied to action #181256: Easier alert handling by opting out of backup_check for individual hosts added
Actions #45

Updated by livdywan 6 days ago

  • Status changed from Resolved to In Progress
  • Assignee changed from dheidler to livdywan
  • Priority changed from Normal to High

Apparently we are still getting emails from cron about this:

from: (Cron Daemon) <root@backup-vm.qe.nue2.suse.org>
to: root@backup-vm.qe.nue2.suse.org <root@backup-vm.qe.nue2.suse.org>
folder: Inbox
date: Tue, 22 Apr 2025 23:59:01 +0000 (UTC)
subject: Cron <root@backup-vm> /usr/local/bin/backup_check.sh

Either this is a side-effect of #181175 reverting relevant changes, or the according cron task is still effective. I'm taking a look in any case.

Actions #46

Updated by livdywan 6 days ago

  • Assignee changed from livdywan to dheidler

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1447

Seems like the script wasn't re-introduced. Also adjusting to only check alpha backups now. Giving back to @dheidler from here.

Actions #47

Updated by livdywan 6 days ago

  • Status changed from In Progress to Feedback

livdywan wrote in #note-46:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1447

Seems like the script wasn't re-introduced. Also adjusting to only check alpha backups now. Giving back to @dheidler from here.

Merged.

Actions #48

Updated by dheidler 6 days ago

  • Status changed from Feedback to Resolved

I guess we can close it then.

Actions

Also available in: Atom PDF