Project

General

Profile

Actions

action #169564

closed

Configure wireguard tunnels on OSD production hosts needed for openQA located in the NUE2 server room size:S

Added by mkittler about 1 month ago. Updated 20 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Feature requests
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Acceptance criteria

  • AC1: All OSD production hosts needed for openQA in the NUE2 server room that are managed via Salt have WireGuard setup via Salt so they can reach the CC area
  • AC2: The setup is reproducible

Suggestions


Related issues 5 (2 open3 closed)

Related to openQA Infrastructure (public) - action #169348: Custom, non-IT-provided wireguard tunnels to connect NUE2 OSD openQA workers to OSDRejectedokurz2024-10-24

Actions
Related to openQA Infrastructure (public) - action #170338: No monitoring data from OSD since 2024-11-25 1449Z size:MResolvednicksinger2024-11-27

Actions
Blocks openQA auto review - openqa-force-result #169834: [qe-core] Unschedule PowerKVM tests for Maintenance updates while keeping ppc64le architecture still running for PowerVM - auto_review:".*_EXIT_AFTER_SCHEDULE. Only evaluating test schedule":force_result:softfailedIn ProgressszarateActions
Copied to openQA Infrastructure (public) - action #170041: Configure wireguard tunnels on hosts located in the NUE2 server room - at least one KVM@PowerNV host size:SResolvedokurz2024-11-08

Actions
Copied to openQA Infrastructure (public) - action #170260: Help others (or ourselves) to configure wireguard tunnels on other hosts needing wireguard to PRG2 in the NUE2 server room size:MWorkable2024-11-26

Actions
Actions #2

Updated by mkittler about 1 month ago

  • Parent task set to #166598
Actions #3

Updated by mkittler about 1 month ago

  • Description updated (diff)
Actions #4

Updated by okurz about 1 month ago

  • Priority changed from Normal to High
  • Target version set to Ready
Actions #5

Updated by mkittler about 1 month ago

  • Blocks action #169159: Allow variable expansion incorporating worker settings size:S added
Actions #6

Updated by mkittler about 1 month ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
Actions #7

Updated by mkittler about 1 month ago · Edited

MR: https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/5779
Updated SD ticket: https://sd.suse.com/servicedesk/customer/portal/1/SD-171369
Slack thread: https://suse.slack.com/archives/C029APBKLGK/p1731323391495109

I added also the monitoring host even though we probably have different long-term plans for this hosts. The config might be useful until we have moved the host.
I added also the powered-off arm worker because it might be useful if we decide to use it again.

I also installed wireguard-tools on all relevant hosts and added the authorized key as mentioned on the Confluence page. This needs to be done manually because Salt is also affected. I nevertheless created a draft to still have the setup "documented" in Salt: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1304

Waiting for feedback from IT.

Actions #8

Updated by mkittler about 1 month ago

  • Status changed from In Progress to Feedback

We'll get feedback earliest next week.

Actions #9

Updated by okurz about 1 month ago

  • Subject changed from Configure wireguard tunnels on hosts located in the NUE2 server room to Configure wireguard tunnels on hosts located in the NUE2 server room size:S
  • Description updated (diff)
Actions #10

Updated by okurz about 1 month ago

  • Related to action #169348: Custom, non-IT-provided wireguard tunnels to connect NUE2 OSD openQA workers to OSD added
Actions #11

Updated by szarate about 1 month ago

  • Blocks openqa-force-result #169834: [qe-core] Unschedule PowerKVM tests for Maintenance updates while keeping ppc64le architecture still running for PowerVM - auto_review:".*_EXIT_AFTER_SCHEDULE. Only evaluating test schedule":force_result:softfailed added
Actions #12

Updated by mkittler about 1 month ago

  • Status changed from Feedback to Blocked

Blocked by getting feedback on the MR and SD-ticket.

Actions #13

Updated by okurz about 1 month ago

  • Copied to action #170041: Configure wireguard tunnels on hosts located in the NUE2 server room - at least one KVM@PowerNV host size:S added
Actions #14

Updated by okurz about 1 month ago

mkittler wrote:

Acceptance criteria

  • AC1: All hosts in the NUE2 server room that are managed via Salt have WireGuard setup via Salt so they can reach the CC area […]
  • When done, add affected workers back to Salt, e.g. via for key in petrol.qe.nue2.suse.org sapworker1.qe.nue2.suse.org diesel.qe.nue2.suse.org mania.qe.nue2.suse.org; do salt-key --accept="$key" --include-rejected --yes; done

Hi mkittler, in AC1 there is "All hosts in NUE2 […] managed via Salt" but the last suggestion only mentions openQA workers which is a discrepancy as there are more salt controlled hosts which are not OSD openQA workers, e.g. monitor, backup, etc. I suggest you create separate tickets for according groups. I just created #170041 for the KVM@PowerNV hosts diesel, petrol, mania. You could look into the other group of hosts based on if it turns out if we actually need wireguard tunnels.

Actions #15

Updated by okurz 29 days ago

Nikolay, Lazaros and me had a call today about the last comments and open points:

  1. Priority is to have https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/5779 merged. Nikolay already provided comments. We will react.
  2. As older ppc64le Linux 5.3 do not support the current wireguard package we should focus on sapworker1
  3. For bare-metal test hosts we should try to get them working using sapworker1 as openQA control host as before. As plan B we could follow up with setting up wireguard for those but that would need test maintainers to adapt test code to install and setup wireguard as part of the tests.
  4. Certain problems regarding DNS resolution are expected which are likely less of a concern for openQA workers as they establish the connection to the openQA webUI.
Actions #16

Updated by okurz 28 days ago

I did

for i in backup-qam.qe.nue2.suse.org backup-vm.qe.nue2.suse.org baremetal-support.qe.nue2.suse.org jenkins.qe.nue2.suse.org monitor.qe.nue2.suse.org openqa-piworker.qe.nue2.suse.org osiris-1.qe.nue2.suse.org qamaster.qe.nue2.suse.org schort-server.qe.nue2.suse.org tumblesle.qe.nue2.suse.org unreal6.qe.nue2.suse.org openqaworker1.qe.nue2.suse.org diesel.qe.nue2.suse.org mania.qe.nue2.suse.org petrol.qe.nue2.suse.org sapworker1.qe.nue2.suse.org ; do echo "### $i" && ssh $i "sudo grep -q 'root@atlas$' /root/.ssh/authorized_keys || echo 'ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBOdQtABW5WPNpAtV0shvOTQi05M6SEUGrXLGuMByWApgwQpWEM41vjWeVIoKim7Y7x62rX99UvC5CiKvG4Do9CI= root@atlas' | sudo tee -a /root/.ssh/authorized_keys" ; done

to deploy the ssh key https://confluence.suse.com/download/attachments/1593344189/wg-prg2-nue2.pub?version=1&modificationDate=1731513804592&api=v2 as suggested in https://sd.suse.com/servicedesk/customer/portal/1/SD-171369

Actions #17

Updated by mkittler 24 days ago

It looks like the setup works on sapworker1. I can reach OSD and download.suse.de via HTTP. The worker also appears as online and is picking up jobs.

So far test results don't look good, though: https://openqa.suse.de/tests/15987565
So we'll have to have an eye on that.

An additional problem is that the salt-minion still cannot connect to OSD:

Nov 25 10:55:55 sapworker1 salt-minion[76940]: [ERROR   ] Failed to send msg SaltReqTimeoutError('Message timed out',)
Nov 25 10:55:55 sapworker1 salt-minion[76940]: [ERROR   ] Error while bringing up minion for multi-master. Is master at openqa.suse.de responding?

Of course I accepted the key on OSD and I have also restarted salt-minion.service.


I replied on the SD ticket to have the config applied on all hosts where it is possible.

Actions #18

Updated by okurz 24 days ago

Please put openqaworker-arm-1 out of production again and power it off. https://racktables.nue.suse.com/index.php?page=object&object_id=9886 has the machine correctly marked as "unused" with a link to #167057. Priority should be machines that are currently in production use.

Actions #19

Updated by mkittler 24 days ago

  • Status changed from Blocked to In Progress

I think it makes sense to cover arm-1 here and it has now been configured so I shut it down again.

I silenced the systemd alert because NFS is not working so the mount units are sometimes failing.

Actions #20

Updated by openqa_review 23 days ago

  • Due date set to 2024-12-10

Setting due date based on mean cycle time of SUSE QE Tools

Actions #21

Updated by gpathak 23 days ago

  • Tags changed from infra, cc, wireguard to infra, cc, wireguard, alert
Actions #22

Updated by mkittler 23 days ago

It looks good now on relevant hosts. I guess now I'll have to work on AC2.

Actions #23

Updated by gpathak 23 days ago

  • Subtask #170257 added
Actions #24

Updated by okurz 23 days ago

  • Subject changed from Configure wireguard tunnels on hosts located in the NUE2 server room size:S to Configure wireguard tunnels on OSD production hosts located in the NUE2 server room size:S
  • Description updated (diff)
Actions #25

Updated by okurz 23 days ago

  • Subject changed from Configure wireguard tunnels on OSD production hosts located in the NUE2 server room size:S to Configure wireguard tunnels on OSD production hosts needed for openQA located in the NUE2 server room size:S
  • Description updated (diff)
Actions #26

Updated by okurz 23 days ago

  • Copied to action #170260: Help others (or ourselves) to configure wireguard tunnels on other hosts needing wireguard to PRG2 in the NUE2 server room size:M added
Actions #27

Updated by mkittler 23 days ago

I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/941 to avoid further test failures due baremetal hosts not reaching OSD for assets. We might need to create a follow-up ticket, so far I tracked it via #168097#note-29.

Actions #28

Updated by mkittler 23 days ago

  • Blocks deleted (action #169159: Allow variable expansion incorporating worker settings size:S)
Actions #29

Updated by mkittler 23 days ago

I updated https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1304. It now also contains a README section to explain the Wireguard setup so we can continue with other hosts more easily in the follow-up ticket.

Not sure whether it makes sense to add /etc/wireguard/prg2wg.conf to Salt. It contains the private key so we needed to add that to the Pillars first. It also contains a list of allowed IPs which differs between hosts and I'm not sure how it is generated. Maybe we should skip this file considering it is configured by Eng-Infra. We could salt the configured systemd units but they depend on the config file so it doesn't make that much sense alone. So I only added this information to the README (for the sake of troubleshooting).

Actions #30

Updated by okurz 23 days ago

  • Subtask deleted (#170257)
Actions #31

Updated by openqa_review 22 days ago

  • Due date set to 2024-12-11

Setting due date based on mean cycle time of SUSE QE Tools

Actions #32

Updated by okurz 22 days ago

  • Related to action #170338: No monitoring data from OSD since 2024-11-25 1449Z size:M added
Actions #33

Updated by mkittler 22 days ago

  • Status changed from In Progress to Feedback
Actions #34

Updated by mkittler 21 days ago

Everything has been merged now but I'll have to look into backing the config up in salt pillars as mentioned in the daily.

Actions #35

Updated by mkittler 20 days ago

  • Status changed from Feedback to In Progress
Actions #36

Updated by mkittler 20 days ago

  • Status changed from In Progress to Feedback

I hope this simple MR suffices as a backup: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/949

The backup and the added documentation are hopefully enough to call the setup "reproducible" as per AC2.

Actions #37

Updated by okurz 20 days ago

  • Due date deleted (2024-12-11)
  • Status changed from Feedback to Resolved

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/949 merged. I agree that with this we should consider this ticket resolved. I guess we will find out in #170260 if it's clear enough where to follow up for other hosts :)

Actions

Also available in: Atom PDF