closedConfigure wireguard tunnels on OSD production hosts needed for openQA located in the NUE2 server room size:S
Acceptance criteria¶
- AC1: All OSD production hosts needed for openQA in the NUE2 server room that are managed via Salt have WireGuard setup via Salt so they can reach the CC area
- AC2: The setup is reproducible
- Follow steps on on one host and prepare a Salt change to apply this to other relevant hosts.
- Introduce a special role or add a condition based on worker classes to setup WireGuard only on hosts in the NUE2 server room.
- Take as inspiration for the Salt change.
- This involves letting IT do the final configuration manually. Supposedly that's also where the keypair is generated and the public key copied over to the WG gateway.
- Have a look at in case we get a response from IT after all.
- Talk to Beijing Colleagues who have already been through this.
- Put into salt or documentation what needs to be done if we want to reproduce, e.g. put private keys into the salt pillar repo
- When done, add affected workers back to Salt, e.g. via
for key in; do salt-key --accept="$key" --include-rejected --yes; done
Updated by mkittler 4 months ago
Updated by mkittler 4 months ago · Edited
Updated SD ticket:
Slack thread:
I added also the monitoring host even though we probably have different long-term plans for this hosts. The config might be useful until we have moved the host.
I added also the powered-off arm worker because it might be useful if we decide to use it again.
I also installed wireguard-tools on all relevant hosts and added the authorized key as mentioned on the Confluence page. This needs to be done manually because Salt is also affected. I nevertheless created a draft to still have the setup "documented" in Salt:
Waiting for feedback from IT.
Updated by okurz 4 months ago
Updated by szarate 4 months ago
Updated by okurz 4 months ago
Updated by okurz 4 months ago
mkittler wrote:
Acceptance criteria¶
- AC1: All hosts in the NUE2 server room that are managed via Salt have WireGuard setup via Salt so they can reach the CC area […]
- When done, add affected workers back to Salt, e.g. via
for key in; do salt-key --accept="$key" --include-rejected --yes; done
Hi mkittler, in AC1 there is "All hosts in NUE2 […] managed via Salt" but the last suggestion only mentions openQA workers which is a discrepancy as there are more salt controlled hosts which are not OSD openQA workers, e.g. monitor, backup, etc. I suggest you create separate tickets for according groups. I just created #170041 for the KVM@PowerNV hosts diesel, petrol, mania. You could look into the other group of hosts based on if it turns out if we actually need wireguard tunnels.
Updated by okurz 3 months ago
Nikolay, Lazaros and me had a call today about the last comments and open points:
- Priority is to have merged. Nikolay already provided comments. We will react.
- As older ppc64le Linux 5.3 do not support the current wireguard package we should focus on sapworker1
- For bare-metal test hosts we should try to get them working using sapworker1 as openQA control host as before. As plan B we could follow up with setting up wireguard for those but that would need test maintainers to adapt test code to install and setup wireguard as part of the tests.
- Certain problems regarding DNS resolution are expected which are likely less of a concern for openQA workers as they establish the connection to the openQA webUI.
Updated by okurz 3 months ago
I did
for i in ; do echo "### $i" && ssh $i "sudo grep -q 'root@atlas$' /root/.ssh/authorized_keys || echo 'ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBOdQtABW5WPNpAtV0shvOTQi05M6SEUGrXLGuMByWApgwQpWEM41vjWeVIoKim7Y7x62rX99UvC5CiKvG4Do9CI= root@atlas' | sudo tee -a /root/.ssh/authorized_keys" ; done
to deploy the ssh key as suggested in
Updated by mkittler 3 months ago
It looks like the setup works on sapworker1. I can reach OSD and via HTTP. The worker also appears as online and is picking up jobs.
So far test results don't look good, though:
So we'll have to have an eye on that.
An additional problem is that the salt-minion still cannot connect to OSD:
Nov 25 10:55:55 sapworker1 salt-minion[76940]: [ERROR ] Failed to send msg SaltReqTimeoutError('Message timed out',)
Nov 25 10:55:55 sapworker1 salt-minion[76940]: [ERROR ] Error while bringing up minion for multi-master. Is master at responding?
Of course I accepted the key on OSD and I have also restarted salt-minion.service
I replied on the SD ticket to have the config applied on all hosts where it is possible.
Updated by okurz 3 months ago
Please put openqaworker-arm-1 out of production again and power it off. has the machine correctly marked as "unused" with a link to #167057. Priority should be machines that are currently in production use.
Updated by openqa_review 3 months ago
Updated by okurz 3 months ago
Updated by mkittler 3 months ago
I created to avoid further test failures due baremetal hosts not reaching OSD for assets. We might need to create a follow-up ticket, so far I tracked it via #168097#note-29.
Updated by mkittler 3 months ago
Updated by mkittler 3 months ago
I updated It now also contains a README section to explain the Wireguard setup so we can continue with other hosts more easily in the follow-up ticket.
Not sure whether it makes sense to add /etc/wireguard/prg2wg.conf
to Salt. It contains the private key so we needed to add that to the Pillars first. It also contains a list of allowed IPs which differs between hosts and I'm not sure how it is generated. Maybe we should skip this file considering it is configured by Eng-Infra. We could salt the configured systemd units but they depend on the config file so it doesn't make that much sense alone. So I only added this information to the README (for the sake of troubleshooting).
Updated by okurz 3 months ago
Updated by mkittler 3 months ago
- Status changed from In Progress to Feedback
I hope this simple MR suffices as a backup:
The backup and the added documentation are hopefully enough to call the setup "reproducible" as per AC2.
Updated by okurz 3 months ago
2024-12-11) - Status changed from Feedback to Resolved merged. I agree that with this we should consider this ticket resolved. I guess we will find out in #170260 if it's clear enough where to follow up for other hosts :)