action #173674
closedcoordination #161414: [epic] Improved salt based infrastructure management
qamaster-independent backup size:S
0%
Description
Motivation¶
During work on #170077 okurz found work on storage for qamaster to be error-prone due to the hardware RAID with many and old storage devices and unusual configuration with RAID6 for root device etc. Before we do more risky stuff we should ensure we have a current backup of VMs and services and for that we need to find out what's the best approach to store backups.
Acceptance criteria¶
- AC1: We know where to store backups which are not on qamaster.qe.nue2.suse.org
Suggestions¶
- Determine required size from #173347
- Consider OpenPlatform or the host "storage" in PRG2 or currently unused QE machines
- Consider extending https://gitlab.suse.de/suse/wiki/-/blob/main/qe_infrastructure.md or a reasonable other place
- Provide a hint in #173347 how to follow up
Updated by okurz 6 months ago
- Copied from action #173347: Ensure we have a current backup of qamaster VMs, VM config, jenkins data, data from backup-vm itself, etc. size:S added
Updated by dheidler 6 months ago
- Status changed from Workable to Blocked
- Assignee set to dheidler
Waiting on https://sd.suse.com/servicedesk/customer/portal/1/SD-175078 to get a network with firewall rules to save this on harvester.
Updated by tinita 4 months ago
- Related to action #175707: OSD backups missing since 2024-11 on backup-vm.qe.nue2.suse.org size:S added
Updated by livdywan 4 months ago
dheidler wrote in #note-2:
Waiting on https://sd.suse.com/servicedesk/customer/portal/1/SD-175078 to get a network with firewall rules to save this on harvester.
Updated by okurz 3 months ago
- Copied to action #177513: Proper "project" name in op-prg2 size:S added
Updated by dheidler 3 months ago
Network is there but openplatform has issues with attaching (not creating - for whatever reason) a large (I tried 10TB) volume (for storing backups) to a VM.
I created https://sd.suse.com/servicedesk/customer/portal/1/SD-181328
Updated by okurz 3 months ago
- Blocks action #168177: Migrate critical VM based services needing access to CC-services to CC areas added
Updated by dheidler 3 months ago
The openplatform people told me that the largest disks they have fit around 5TB.
I guess the images can't span multiple disks.
So our backup storage capacity would be limited.
I thought the idea of "cloud" was that I as a user don't have to deal with that kind of problems :/
We could live with around 3-4TB of storage for now but this might not be a nice and clean solution if openplatform doesn't scale up their disks.
As a workaroud we could build a software raid of multiple volume images. But that wouldn't be the cleanest thing to do.
Updated by okurz 3 months ago
Thomas Muntaner mentioned
Please follow https://itpe.io.suse.de/core/open-platform/docs/docs/getting_started/requesting_access#nfs-access for a more reliable storage.
What about that?
Updated by dheidler 3 months ago
I requested an NFS volume via https://sd.suse.com/servicedesk/customer/portal/1/SD-181710
Let's see what happens.
Updated by dheidler 2 months ago
It seems to not have been the firewall but the wrong ip was given to me.
Now we got a different issue:
backup:~ # ping -c1 10.144.128.241
PING 10.144.128.241 (10.144.128.241) 56(84) bytes of data.
64 bytes from 10.144.128.241: icmp_seq=1 ttl=63 time=0.266 ms
--- 10.144.128.241 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.266/0.266/0.266/0.000 ms
backup:~ # mount 10.144.128.241:/openqa_backup_storage /mnt/
mount.nfs: mounting 10.144.128.241:/openqa_backup_storage failed, reason given by server: No such file or directory
Updated by dheidler 2 months ago
- Status changed from In Progress to Blocked
Of course infra messed up setting up the firewall rules, so I can't reach the salt master.
Created https://sd.suse.com/servicedesk/customer/portal/1/SD-184162
Updated by dheidler about 2 months ago
- Assignee changed from dheidler to mgriessmeier
Assigning to Matthias while this is being escalated with infra.
Updated by dheidler about 2 months ago
echo NFS tftp ssh ftp http https 8080 rsync amqp zmq salt ICMP GRE
echo {9500..9599}
#echo {5990..6190}
for i in {1..100} ; do
let "p=i*10+20003"
let "v=i+5990"
echo -n "$p "
echo -n "$v "
done
echo
Updated by dheidler about 2 months ago
I guess for workers the summand would be 20003 as 20002 is for qemu internally which should only talk to the command server.
B.
Communication between webui and command server is on ports 20003+10*i
Updated by dheidler about 2 months ago
- Status changed from Blocked to In Progress
- Assignee changed from mgriessmeier to dheidler
https://build.opensuse.org/requests/1267145
the whole vlan 2221 can now reach openqa.suse.de salt master at 4505/4506
Updated by openqa_review about 2 months ago
- Due date set to 2025-04-19
Setting due date based on mean cycle time of SUSE QE Tools
Updated by dheidler about 2 months ago
- Status changed from In Progress to Blocked
Missing firewall rules to reach o3: https://sd.suse.com/servicedesk/customer/portal/1/SD-185109
Updated by dheidler about 2 months ago
- Status changed from Blocked to In Progress
connection to o3 enabled in the firewall.
Updated by dheidler about 2 months ago
- Status changed from In Progress to Feedback
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1432
Backups are running.
Let's add the backup check script because as of now nobody noticed that tha o3 backups on the new backup server were failing until the firewall rules for connection ariel were sorted.
Updated by livdywan about 1 month ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1432
I realize I hadn't added my review. Hopefully easy to get sorted. Otherwise we can discuss it in the unblock today.
Updated by livdywan about 1 month ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1438
Apparently there's several instances of postfix.set_main with very verbose output effectively hiding whatever is causing the deployment to fail?
I'm wondering if we should have a separate ticket about this, as it doesn't seem like an immediate issue with this change ๐ค
Updated by dheidler about 1 month ago
I guess the best approach is to replace postfix.set_main with a salt-managed postfix main.cf file.
Updated by livdywan about 1 month ago
livdywan wrote in #note-31:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1438
Correction: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1439 Sorry about that. I was trying to check other recent MRs to double-check what was causing the log issue and mixing it up without realizing right away.
Updated by dheidler about 1 month ago
Updated by livdywan about 1 month ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1442 Looking better now ๐ธ
Updated by livdywan about 1 month ago
Also: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1443 (already merged)
Updated by livdywan about 1 month ago
- Copied to action #181151: Follow-up steps regarding backup.qa.suse.de (and backup.qe.prg2.suse.org) added
Updated by okurz about 1 month ago
- Status changed from Resolved to Workable
I didn't find any mention on https://gitlab.suse.de/suse/wiki/-/blob/main/qe_infrastructure.md . I doubt people will find the backup this way
Updated by dheidler about 1 month ago
- Status changed from Workable to Resolved
Updated by okurz about 1 month ago
- Copied to action #181250: salt-master on separate VM being able to connect to all OSD machines size:S added
Updated by livdywan about 1 month ago
okurz wrote in #note-40:
I didn't find any mention on https://gitlab.suse.de/suse/wiki/-/blob/main/qe_infrastructure.md . I doubt people will find the backup this way
You probably didn't see #181151 ๐ But thank you for checking anyway
Updated by livdywan about 1 month ago
- Copied to action #181256: Easier alert handling by opting out of backup_check for individual hosts added
Updated by livdywan about 1 month ago
- Status changed from Resolved to In Progress
- Assignee changed from dheidler to livdywan
- Priority changed from Normal to High
Apparently we are still getting emails from cron about this:
from: (Cron Daemon) <root@backup-vm.qe.nue2.suse.org>
to: root@backup-vm.qe.nue2.suse.org <root@backup-vm.qe.nue2.suse.org>
folder: Inbox
date: Tue, 22 Apr 2025 23:59:01 +0000 (UTC)
subject: Cron <root@backup-vm> /usr/local/bin/backup_check.sh
Either this is a side-effect of #181175 reverting relevant changes, or the according cron task is still effective. I'm taking a look in any case.
Updated by livdywan about 1 month ago
- Assignee changed from livdywan to dheidler
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1447
Seems like the script wasn't re-introduced. Also adjusting to only check alpha backups now. Giving back to @dheidler from here.
Updated by livdywan about 1 month ago
- Status changed from In Progress to Feedback
livdywan wrote in #note-46:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1447
Seems like the script wasn't re-introduced. Also adjusting to only check alpha backups now. Giving back to @dheidler from here.
Merged.
Updated by dheidler about 1 month ago
- Status changed from Feedback to Resolved
I guess we can close it then.