action #173674
closedcoordination #161414: [epic] Improved salt based infrastructure management
qamaster-independent backup size:S
0%
Description
Motivation¶
During work on #170077 okurz found work on storage for qamaster to be error-prone due to the hardware RAID with many and old storage devices and unusual configuration with RAID6 for root device etc. Before we do more risky stuff we should ensure we have a current backup of VMs and services and for that we need to find out what's the best approach to store backups.
Acceptance criteria¶
- AC1: We know where to store backups which are not on qamaster.qe.nue2.suse.org
Suggestions¶
- Determine required size from #173347
- Consider OpenPlatform or the host "storage" in PRG2 or currently unused QE machines
- Consider extending https://gitlab.suse.de/suse/wiki/-/blob/main/qe_infrastructure.md or a reasonable other place
- Provide a hint in #173347 how to follow up
Updated by okurz 5 months ago
- Copied from action #173347: Ensure we have a current backup of qamaster VMs, VM config, jenkins data, data from backup-vm itself, etc. size:S added
Updated by dheidler 5 months ago
- Status changed from Workable to Blocked
- Assignee set to dheidler
Waiting on https://sd.suse.com/servicedesk/customer/portal/1/SD-175078 to get a network with firewall rules to save this on harvester.
Updated by tinita 3 months ago
- Related to action #175707: OSD backups missing since 2024-11 on backup-vm.qe.nue2.suse.org size:S added
Updated by livdywan 3 months ago
dheidler wrote in #note-2:
Waiting on https://sd.suse.com/servicedesk/customer/portal/1/SD-175078 to get a network with firewall rules to save this on harvester.
Updated by okurz 2 months ago
- Copied to action #177513: Proper "project" name in op-prg2 added
Updated by dheidler 2 months ago
Network is there but openplatform has issues with attaching (not creating - for whatever reason) a large (I tried 10TB) volume (for storing backups) to a VM.
I created https://sd.suse.com/servicedesk/customer/portal/1/SD-181328
Updated by okurz 2 months ago
- Blocks action #168177: Migrate critical VM based services needing access to CC-services to CC areas added
Updated by dheidler about 2 months ago
The openplatform people told me that the largest disks they have fit around 5TB.
I guess the images can't span multiple disks.
So our backup storage capacity would be limited.
I thought the idea of "cloud" was that I as a user don't have to deal with that kind of problems :/
We could live with around 3-4TB of storage for now but this might not be a nice and clean solution if openplatform doesn't scale up their disks.
As a workaroud we could build a software raid of multiple volume images. But that wouldn't be the cleanest thing to do.
Updated by okurz about 2 months ago
Thomas Muntaner mentioned
Please follow https://itpe.io.suse.de/core/open-platform/docs/docs/getting_started/requesting_access#nfs-access for a more reliable storage.
What about that?
Updated by dheidler about 2 months ago
I requested an NFS volume via https://sd.suse.com/servicedesk/customer/portal/1/SD-181710
Let's see what happens.
Updated by dheidler about 1 month ago
Hi Dominik Heidler,
The NFS share has been created. You can mount it from
10.144.128.242:/openqa_backup_storage
Well we got an nfs share now but of course they forgot to allow it in the firewall rules.
Updated by dheidler about 1 month ago
It seems to not have been the firewall but the wrong ip was given to me.
Now we got a different issue:
backup:~ # ping -c1 10.144.128.241
PING 10.144.128.241 (10.144.128.241) 56(84) bytes of data.
64 bytes from 10.144.128.241: icmp_seq=1 ttl=63 time=0.266 ms
--- 10.144.128.241 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.266/0.266/0.266/0.000 ms
backup:~ # mount 10.144.128.241:/openqa_backup_storage /mnt/
mount.nfs: mounting 10.144.128.241:/openqa_backup_storage failed, reason given by server: No such file or directory
Updated by dheidler about 1 month ago
- Status changed from Blocked to Workable
mount 10.144.128.241:/openqa-backup-storage /mnt/
it is.
Updated by dheidler about 1 month ago
- Status changed from Workable to In Progress
Updated by dheidler about 1 month ago
- Status changed from In Progress to Blocked
Of course infra messed up setting up the firewall rules, so I can't reach the salt master.
Created https://sd.suse.com/servicedesk/customer/portal/1/SD-184162
Updated by dheidler 25 days ago
- Status changed from Blocked to In Progress
- Assignee changed from mgriessmeier to dheidler
https://build.opensuse.org/requests/1267145
the whole vlan 2221 can now reach openqa.suse.de salt master at 4505/4506
Updated by openqa_review 25 days ago
- Due date set to 2025-04-19
Setting due date based on mean cycle time of SUSE QE Tools
Updated by dheidler 21 days ago
- Status changed from In Progress to Blocked
Missing firewall rules to reach o3: https://sd.suse.com/servicedesk/customer/portal/1/SD-185109
Updated by dheidler 19 days ago
- Status changed from In Progress to Feedback
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1432
Backups are running.
Let's add the backup check script because as of now nobody noticed that tha o3 backups on the new backup server were failing until the firewall rules for connection ariel were sorted.
Updated by livdywan 14 days ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1432
I realize I hadn't added my review. Hopefully easy to get sorted. Otherwise we can discuss it in the unblock today.
Updated by livdywan 13 days ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1438
Apparently there's several instances of postfix.set_main with very verbose output effectively hiding whatever is causing the deployment to fail?
I'm wondering if we should have a separate ticket about this, as it doesn't seem like an immediate issue with this change ๐ค
Updated by livdywan 13 days ago
livdywan wrote in #note-31:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1438
Correction: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1439 Sorry about that. I was trying to check other recent MRs to double-check what was causing the log issue and mixing it up without realizing right away.
Updated by livdywan 12 days ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1442 Looking better now ๐ธ
Updated by livdywan 12 days ago
Also: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1443 (already merged)
Updated by livdywan 12 days ago
- Copied to action #181151: Follow-up steps regarding backup.qa.suse.de (and backup.qe.prg2.suse.org) added
Updated by okurz 9 days ago
- Status changed from Resolved to Workable
I didn't find any mention on https://gitlab.suse.de/suse/wiki/-/blob/main/qe_infrastructure.md . I doubt people will find the backup this way
Updated by okurz 7 days ago
- Copied to action #181250: salt-master on separate VM added
Updated by livdywan 7 days ago
- Copied to action #181256: Easier alert handling by opting out of backup_check for individual hosts added
Updated by livdywan 6 days ago
- Status changed from Resolved to In Progress
- Assignee changed from dheidler to livdywan
- Priority changed from Normal to High
Apparently we are still getting emails from cron about this:
from: (Cron Daemon) <root@backup-vm.qe.nue2.suse.org>
to: root@backup-vm.qe.nue2.suse.org <root@backup-vm.qe.nue2.suse.org>
folder: Inbox
date: Tue, 22 Apr 2025 23:59:01 +0000 (UTC)
subject: Cron <root@backup-vm> /usr/local/bin/backup_check.sh
Either this is a side-effect of #181175 reverting relevant changes, or the according cron task is still effective. I'm taking a look in any case.
Updated by livdywan 6 days ago
- Status changed from In Progress to Feedback
livdywan wrote in #note-46:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1447
Seems like the script wasn't re-introduced. Also adjusting to only check alpha backups now. Giving back to @dheidler from here.
Merged.