action #173674: qamaster-independent backup size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #173674

closed

coordination #161414: [epic] Improved salt based infrastructure management

qamaster-independent backup size:S

Added by okurz 6 months ago. Updated about 1 month ago.

Status:

Resolved

Priority:

High

Assignee:

dheidler

Category:

Feature requests

Target version:

openQA Project (public) - Ready

Start date:

2024-12-03

Due date:

% Done:

Estimated time:

Tags:

infra, nue2

Description

Motivation¶

During work on #170077 okurz found work on storage for qamaster to be error-prone due to the hardware RAID with many and old storage devices and unusual configuration with RAID6 for root device etc. Before we do more risky stuff we should ensure we have a current backup of VMs and services and for that we need to find out what's the best approach to store backups.

Acceptance criteria¶

AC1: We know where to store backups which are not on qamaster.qe.nue2.suse.org

Suggestions¶

Determine required size from #173347
Consider OpenPlatform or the host "storage" in PRG2 or currently unused QE machines
Consider extending https://gitlab.suse.de/suse/wiki/-/blob/main/qe_infrastructure.md or a reasonable other place
Provide a hint in #173347 how to follow up

Related issues 7 (4 open — 3 closed)

Related to openQA Infrastructure (public) - action #175707: OSD backups missing since 2024-11 on backup-vm.qe.nue2.suse.org size:S

Resolved

dheidler

2025-01-17

Actions

Blocks openQA Infrastructure (public) - action #168177: Migrate critical VM based services needing access to CC-services to CC areas

Resolved

okurz

2024-09-19

Actions

Copied from openQA Infrastructure (public) - action #173347: Ensure we have a current backup of qamaster VMs, VM config, jenkins data, data from backup-vm itself, etc. size:S

Resolved

gpathak

Actions

Copied to openQA Infrastructure (public) - action #177513: Proper "project" name in op-prg2 size:S

Workable

Actions

Copied to openQA Infrastructure (public) - action #181151: Follow-up steps regarding backup.qa.suse.de (and backup.qe.prg2.suse.org)

New

okurz

Actions

Copied to openQA Infrastructure (public) - action #181250: salt-master on separate VM being able to connect to all OSD machines size:S

Workable

Actions

Copied to openQA Infrastructure (public) - action #181256: Easier alert handling by opting out of backup_check for individual hosts

New

Actions

Copy link

Updated by okurz 6 months ago

Copied from action #173347: Ensure we have a current backup of qamaster VMs, VM config, jenkins data, data from backup-vm itself, etc. size:S added

Actions

Copy link

Updated by dheidler 6 months ago

Status changed from Workable to Blocked
Assignee set to dheidler

Waiting on https://sd.suse.com/servicedesk/customer/portal/1/SD-175078 to get a network with firewall rules to save this on harvester.

Actions

Copy link

Updated by dheidler 5 months ago · Edited

Pinged Matze about this SD ticket as there was no status update for 2 Weeks.

Actions

Copy link

Updated by okurz 4 months ago

there were recent updates. dheidler is responding with an update in the SD ticket.

Actions

Copy link

Updated by tinita 4 months ago

Related to action #175707: OSD backups missing since 2024-11 on backup-vm.qe.nue2.suse.org size:S added

Actions

Copy link

Updated by livdywan 4 months ago

dheidler wrote in #note-2:

Waiting on https://sd.suse.com/servicedesk/customer/portal/1/SD-175078 to get a network with firewall rules to save this on harvester.

https://sd.suse.com/servicedesk/customer/portal/1/SD-179453

Actions

Copy link

Updated by okurz 3 months ago

Copied to action #177513: Proper "project" name in op-prg2 size:S added

Actions

Copy link

Updated by dheidler 3 months ago

Network is there but openplatform has issues with attaching (not creating - for whatever reason) a large (I tried 10TB) volume (for storing backups) to a VM.

I created https://sd.suse.com/servicedesk/customer/portal/1/SD-181328

Actions

Copy link

Updated by okurz 3 months ago

Blocks action #168177: Migrate critical VM based services needing access to CC-services to CC areas added

Actions

Copy link

#10

Updated by dheidler 3 months ago

https://github.com/harvester/harvester/issues/7719

Actions

Copy link

#11

Updated by okurz 3 months ago

Cool :)

Actions

Copy link

#12

Updated by dheidler 3 months ago

https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/6084

Actions

Copy link

#13

Updated by dheidler 3 months ago

The openplatform people told me that the largest disks they have fit around 5TB.
I guess the images can't span multiple disks.
So our backup storage capacity would be limited.
I thought the idea of "cloud" was that I as a user don't have to deal with that kind of problems :/

We could live with around 3-4TB of storage for now but this might not be a nice and clean solution if openplatform doesn't scale up their disks.
As a workaroud we could build a software raid of multiple volume images. But that wouldn't be the cleanest thing to do.

Actions

Copy link

#14

Updated by okurz 3 months ago

Thomas Muntaner mentioned

Please follow https://itpe.io.suse.de/core/open-platform/docs/docs/getting_started/requesting_access#nfs-access for a more reliable storage.

What about that?

Actions

Copy link

#15

Updated by dheidler 3 months ago

I requested an NFS volume via https://sd.suse.com/servicedesk/customer/portal/1/SD-181710

Let's see what happens.

Actions

Copy link

#16

Updated by dheidler 2 months ago

Hi Dominik Heidler,
The NFS share has been created. You can mount it from
10.144.128.242:/openqa_backup_storage

Well we got an nfs share now but of course they forgot to allow it in the firewall rules.

Actions

Copy link

#17

Updated by dheidler 2 months ago

It seems to not have been the firewall but the wrong ip was given to me.

Now we got a different issue:

backup:~ # ping -c1 10.144.128.241
PING 10.144.128.241 (10.144.128.241) 56(84) bytes of data.
64 bytes from 10.144.128.241: icmp_seq=1 ttl=63 time=0.266 ms

--- 10.144.128.241 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.266/0.266/0.266/0.000 ms
backup:~ # mount 10.144.128.241:/openqa_backup_storage /mnt/
mount.nfs: mounting 10.144.128.241:/openqa_backup_storage failed, reason given by server: No such file or directory

Actions

Copy link

#18

Updated by dheidler 2 months ago

Status changed from Blocked to Workable

mount 10.144.128.241:/openqa-backup-storage /mnt/

it is.

Actions

Copy link

#19

Updated by dheidler 2 months ago

Status changed from Workable to In Progress

Actions

Copy link

#20

Updated by dheidler 2 months ago

Status changed from In Progress to Blocked

Of course infra messed up setting up the firewall rules, so I can't reach the salt master.

Created https://sd.suse.com/servicedesk/customer/portal/1/SD-184162

Actions

Copy link

#21

Updated by dheidler about 2 months ago

Assignee changed from dheidler to mgriessmeier

Assigning to Matthias while this is being escalated with infra.

Actions

Copy link

#23

Updated by dheidler about 2 months ago

echo NFS tftp ssh ftp http https 8080 rsync amqp zmq salt ICMP GRE
echo {9500..9599}
#echo {5990..6190}
for i in {1..100} ; do
    let "p=i*10+20003"
    let "v=i+5990"
    echo -n "$p "
    echo -n "$v "
done
echo

Actions

Copy link

#24

Updated by dheidler about 2 months ago

I guess for workers the summand would be 20003 as 20002 is for qemu internally which should only talk to the command server.
B.

Communication between webui and command server is on ports 20003+10*i

Actions

Copy link

#25

Updated by dheidler about 2 months ago

Status changed from Blocked to In Progress
Assignee changed from mgriessmeier to dheidler

https://build.opensuse.org/requests/1267145

the whole vlan 2221 can now reach openqa.suse.de salt master at 4505/4506

Actions

Copy link

#26

Updated by openqa_review about 2 months ago

Due date set to 2025-04-19

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#27

Updated by dheidler about 2 months ago

Status changed from In Progress to Blocked

Missing firewall rules to reach o3: https://sd.suse.com/servicedesk/customer/portal/1/SD-185109

Actions

Copy link

#28

Updated by dheidler about 2 months ago

Status changed from Blocked to In Progress

connection to o3 enabled in the firewall.

Actions

Copy link

#29

Updated by dheidler about 2 months ago

Status changed from In Progress to Feedback

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1432

Backups are running.

Let's add the backup check script because as of now nobody noticed that tha o3 backups on the new backup server were failing until the firewall rules for connection ariel were sorted.

Actions

Copy link

#30

Updated by livdywan about 1 month ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1432

I realize I hadn't added my review. Hopefully easy to get sorted. Otherwise we can discuss it in the unblock today.

Actions

Copy link

#31

Updated by livdywan about 1 month ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1438

Apparently there's several instances of postfix.set_main with very verbose output effectively hiding whatever is causing the deployment to fail?

I'm wondering if we should have a separate ticket about this, as it doesn't seem like an immediate issue with this change 🤔

Actions

Copy link

#32

Updated by dheidler about 1 month ago

I guess the best approach is to replace postfix.set_main with a salt-managed postfix main.cf file.

Actions

Copy link

#33

Updated by livdywan about 1 month ago

livdywan wrote in #note-31:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1438

Correction: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1439 Sorry about that. I was trying to check other recent MRs to double-check what was causing the log issue and mixing it up without realizing right away.

Actions

Copy link

#34

Updated by dheidler about 1 month ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1441

Actions

Copy link

#35

Updated by livdywan about 1 month ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1442 Looking better now 😸

Actions

Copy link

#36

Updated by dheidler about 1 month ago

Status changed from Feedback to Resolved

Actions

Copy link

#37

Updated by livdywan about 1 month ago

Also: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1443 (already merged)

Actions

Copy link

#38

Updated by livdywan about 1 month ago

Copied to action #181151: Follow-up steps regarding backup.qa.suse.de (and backup.qe.prg2.suse.org) added

Actions

Copy link

#39

Updated by okurz about 1 month ago

Due date deleted (~~2025-04-19~~)

Actions

Copy link

#40

Updated by okurz about 1 month ago

Status changed from Resolved to Workable

I didn't find any mention on https://gitlab.suse.de/suse/wiki/-/blob/main/qe_infrastructure.md . I doubt people will find the backup this way

Actions

Copy link

#41

Updated by dheidler about 1 month ago

Status changed from Workable to Resolved

Added some infos:

https://gitlab.suse.de/suse/wiki/-/blob/main/qe_infrastructure.md#backups-via-rsnapshot

Actions

Copy link

#42

Updated by okurz about 1 month ago

Copied to action #181250: salt-master on separate VM being able to connect to all OSD machines size:S added

Actions

Copy link

#43

Updated by livdywan about 1 month ago

okurz wrote in #note-40:

I didn't find any mention on https://gitlab.suse.de/suse/wiki/-/blob/main/qe_infrastructure.md . I doubt people will find the backup this way

You probably didn't see #181151 🙃 But thank you for checking anyway

Actions

Copy link

#44

Updated by livdywan about 1 month ago

Copied to action #181256: Easier alert handling by opting out of backup_check for individual hosts added

Actions

Copy link

#45

Updated by livdywan about 1 month ago

Status changed from Resolved to In Progress
Assignee changed from dheidler to livdywan
Priority changed from Normal to High

Apparently we are still getting emails from cron about this:

from: (Cron Daemon) <root@backup-vm.qe.nue2.suse.org>
to: root@backup-vm.qe.nue2.suse.org <root@backup-vm.qe.nue2.suse.org>
folder: Inbox
date: Tue, 22 Apr 2025 23:59:01 +0000 (UTC)
subject: Cron <root@backup-vm> /usr/local/bin/backup_check.sh

Either this is a side-effect of #181175 reverting relevant changes, or the according cron task is still effective. I'm taking a look in any case.

Actions

Copy link

#46

Updated by livdywan about 1 month ago

Assignee changed from livdywan to dheidler

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1447

Seems like the script wasn't re-introduced. Also adjusting to only check alpha backups now. Giving back to @dheidler from here.

Actions

Copy link

#47

Updated by livdywan about 1 month ago

Status changed from In Progress to Feedback

livdywan wrote in #note-46:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1447

Seems like the script wasn't re-introduced. Also adjusting to only check alpha backups now. Giving back to @dheidler from here.

Merged.

Actions

Copy link

#48

Updated by dheidler about 1 month ago

Status changed from Feedback to Resolved

I guess we can close it then.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #173674

qamaster-independent backup size:S

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz 6 months ago

Updated by dheidler 6 months ago

Updated by dheidler 5 months ago · Edited

Updated by okurz 4 months ago

Updated by tinita 4 months ago

Updated by livdywan 4 months ago

Updated by okurz 3 months ago

Updated by dheidler 3 months ago

Updated by okurz 3 months ago

Updated by dheidler 3 months ago

Updated by okurz 3 months ago

Updated by dheidler 3 months ago

Updated by dheidler 3 months ago

Updated by okurz 3 months ago

Updated by dheidler 3 months ago

Updated by dheidler 2 months ago

Updated by dheidler 2 months ago

Updated by dheidler 2 months ago

Updated by dheidler 2 months ago

Updated by dheidler 2 months ago

Updated by dheidler about 2 months ago

Updated by dheidler about 2 months ago

Updated by dheidler about 2 months ago

Updated by dheidler about 2 months ago

Updated by openqa_review about 2 months ago

Updated by dheidler about 2 months ago

Updated by dheidler about 2 months ago

Updated by dheidler about 2 months ago

Updated by livdywan about 1 month ago

Updated by livdywan about 1 month ago

Updated by dheidler about 1 month ago

Updated by livdywan about 1 month ago

Updated by dheidler about 1 month ago

Updated by livdywan about 1 month ago

Updated by dheidler about 1 month ago

Updated by livdywan about 1 month ago

Updated by livdywan about 1 month ago

Updated by okurz about 1 month ago

Updated by okurz about 1 month ago

Updated by dheidler about 1 month ago

Updated by okurz about 1 month ago

Updated by livdywan about 1 month ago

Updated by livdywan about 1 month ago

Updated by livdywan about 1 month ago

Updated by livdywan about 1 month ago

Updated by livdywan about 1 month ago

Updated by dheidler about 1 month ago