action #159066
closedopenQA Project - coordination #105624: [saga][epic] Reconsider how openQA handles secrets
openQA Project - coordination #157537: [epic] Secure setup of openQA test machines with secure network+secure authentication
network-level firewall preventing direct ssh+vnc access to openQA test VMs size:M
0%
Description
Motivation¶
In https://sd.suse.com/servicedesk/customer/portal/1/SD-150437 we are asked to handle "compromised root passwords in QA segments" including s390zl11…16. Because we failed to setup a firewall on hypervisors hosts directly, see #158242, we should ask SUSE-IT to REJECT – please don't DROP to not further confuse people – direct ssh access to the specific IP addresses of s390kvm VMs as managed in https://gitlab.suse.de/OPS-Service/salt/ from anything but the QE production networks like oqa.prg2.suse.org and qe.prg2.suse.org.
Acceptance criteria¶
- AC1: firewall on network level prevents direct ssh+vnc access from outside, i.e. normal office networks, to openQA test VMs, e.g. s390kvm080.oqa.prg2.suse.org…s390kvm099.oqa.prg2.suse.org
- AC2: openQA svirt jobs are still able to access ssh+vnc as necessary, e.g. from openQA workers in the same network OR openQA workers on the hypervisor hosts themselves
- AC3: Administrators can still access ssh+vnc of production machines within oqa.prg2.suse.org, e.g. openQA worker hosts and hypervisor hosts (but not test VMs)
Suggestions¶
- Take openQA svirt worker instances related to one hypervisor host, e.g. s390zl12, out of production for testing
- Create IT ticket according to https://progress.opensuse.org/projects/qa/wiki/Tools#SUSE-IT-ticket-handling and ask for the network-level firewall to block ssh+vnc to VMs running on s390zl12+13, e.g. s390kvm080.oqa.prg2.suse.org…s390kvm099.oqa.prg2.suse.org
- Allow traffic from other hosts in oqa.prg2.suse.org
- Ensure that openQA tests still work, e.g. the login to the target SUT VM in "boot_to_desktop". Use for verification
- Ensure that the solution at least applies to s390kvm080.oqa.prg2.suse.org…s390kvm099.oqa.prg2.suse.org
Updated by okurz 7 months ago
- Copied from action #158242: Prevent ssh access to test VMs on svirt hypervisor hosts with firewall size:M added
Updated by okurz 7 months ago
- Copied to action #159069: network-level firewall preventing direct ssh+vnc access to all machines within the oqa.prg2.suse.org network if needed added
Updated by nicksinger 7 months ago
- Subject changed from network-level firewall preventing direct ssh+vnc access to openQA test VMs to network-level firewall preventing direct ssh+vnc access to openQA test VMs size:M
- Status changed from New to Workable
Updated by nicksinger 7 months ago
- Status changed from Workable to In Progress
- Assignee set to nicksinger
Updated by nicksinger 7 months ago
- Status changed from In Progress to Blocked
- Priority changed from High to Normal
Half of production taken out with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/801 and raised an SD ticket describing what we want: https://sd.suse.com/servicedesk/customer/portal/1/SD-155731
I'm lowering the prio because I don't see much more we can do right now.
Updated by mgriessmeier 7 months ago
Hi,
Would it also be possible to use the two machines of https://progress.opensuse.org/issues/159063 for the purpose of testing those firewalls rules instead of production hardware?
I would like to avoid a shortage of s390 production workers when GMC hits end of next week.
Updated by nicksinger 7 months ago
mgriessmeier wrote in #note-7:
Hi,
Would it also be possible to use the two machines of https://progress.opensuse.org/issues/159063 for the purpose of testing those firewalls rules instead of production hardware?
I would like to avoid a shortage of s390 production workers when GMC hits end of next week.
We discovered that only 10/20 zl12 workers were enabled. That was handled in https://progress.opensuse.org/issues/158170#note-20
As discussed in Slack we're monitoring the queue-size and see if 20 slots are good enough for now.
Updated by nicksinger 7 months ago
- Status changed from Blocked to In Progress
So apparently firewall-rules on the network level would by way to complicated. Therefore I requested to close the linked SD ticket.
Looking into several options I finally made some progress with native nft-rules in the "netdev"-table. This table is very low level and receives incoming packages almost immediately once they reached the NIC. A basic rule looks something like this:
table netdev filtermacvtap {
chain filterin_17 {
type filter hook ingress device "macvtap16" priority filter; policy accept;
ip saddr != 10.145.10.0/24 tcp dport 22 drop
}
}
As these rules cannot exist before the interface is present, we have to use libvirtd's hooking mechanism to create these rules manually - I roughly followed https://serverfault.com/a/1147552 and created /etc/libvirt/hooks/qemu.d/block-ssh.sh
:
#!/bin/sh
if [ "$2" = start ] && [ "$3" = begin ]; then
XML=$(cat /dev/stdin)
IFACE=$(echo $XML | xmlstarlet select -t -m 'domain/devices/interface[@type="direct"]' -v 'target/@dev')
DOMID=$(echo $XML | xmlstarlet select -t -v 'domain/@id')
nft "add chain netdev filtermacvtap filterin_${DOMID} { type filter hook ingress device $IFACE priority filter; policy accept; }"
nft "add rule netdev filtermacvtap filterin_${DOMID} ip saddr != 10.145.10.42/24 tcp dport 22 reject"
fi
This worked already quite well: https://openqa.suse.de/tests/overview?distri=sle&version=15-SP6&build=83.1-test-for-poo159066 - the 4 failing tests seem to fail earlier in connecting to zl12 already so I'm not sure if this is really related to my changes. But there are still a few todos open:
- Add a rule for IPv6 (maybe simply adding
nft "add rule netdev filtermacvtap filterin_${DOMID} ip6 saddr != [V6_SUBNET] tcp dport 22 reject"
is already enough?) - What about other ports? VNC? Should we maybe set the default policy to reject and just whitelist connections without specific ports?
- Salt this
- Implement it on zl13
Updated by nicksinger 7 months ago
nicksinger wrote in #note-9:
This worked already quite well: https://openqa.suse.de/tests/overview?distri=sle&version=15-SP6&build=83.1-test-for-poo159066 - the 4 failing tests seem to fail earlier in connecting to zl12 already so I'm not sure if this is really related to my changes.
I found the issue and adjusted our sshd config for all workers: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1179
Updated by openqa_review 7 months ago
- Due date set to 2024-05-23
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger 6 months ago · Edited
Created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1184 with a first draft containing the script. It makes use of our "roles" grain which needs to contain two roles at the same time:
s390zl12:~ # cat /etc/salt/grains
passwordlogin: True
roles:
- libvirt
- worker
openqa:~ # salt -C 'G@roles:libvirt and G@roles:worker' test.ping
s390zl12.oqa.prg2.suse.org:
True
Updated by okurz 6 months ago
caused failure in https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2602994#L45
my proposal for a fix
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/807
Updated by okurz 6 months ago
discussed with nicksinger, I removed the "worker" role again for now and proposing reverting my https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/808. Instead we agreed to define a custom grain "external_openqa_hypervisor" with value "True" for both hosts which prevents the firewall. Or
external_openqa_hypervisor_passlist:
- 10.145.10.0/24
- 2a07:de40:b203:12:…/64
and iterate over each entry
Updated by nicksinger 6 months ago
I tried to set the default policy to "drop" and only explicitly allow what is needed (tried with "everything" for now to get a test working at all). I ended up with the following rules:
nft "add table netdev filtermacvtap" #should not do anything if table is already present
nft "add chain netdev filtermacvtap filterin_${DOMID} { type filter hook ingress device $IFACE priority filter; policy accept; }" #drop everything by default (reject is not available here)
nft "add rule netdev filtermacvtap filterin_${DOMID} ip saddr 10.145.10.0/24 accept"
nft "add rule netdev filtermacvtap filterin_${DOMID} ip6 saddr 2a07:de40:b203:12::0/64 accept"
nft "add rule netdev filtermacvtap filterin_${DOMID} ip saddr 10.144.98.239/32 accept"
nft add rule netdev filtermacvtap filterin_${DOMID} meta l4proto udp accept
nft add rule netdev filtermacvtap filterin_${DOMID} meta l4proto icmp accept
nft add rule netdev filtermacvtap filterin_${DOMID} meta l4proto ipv6-icmp accept
nft add rule netdev filtermacvtap filterin_${DOMID} meta l4proto ipv6 accept
nft add rule netdev filtermacvtap filterin_${DOMID} ether type arp accept
nft add rule netdev filtermacvtap filterin_${DOMID} log
Unfortunately this fails because every outgoing connection of the VM causes a new incoming connection which in turn gets blocked if not explicitly whitelisted (because the used netdev table does not support stateful filtering).
As already discussed in the daily I will go ahead and just drop ports 22+59[00-99] explicitly. Not nice but should cover our use-case.
Updated by nicksinger 6 months ago
The major changes:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1184
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/810
Some cleanups:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1192
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1193
all have been merged. I added the external_openqa_hypervisor
-role to s390zl12 and the deploy-pipeline created /etc/libvirt/hooks/qemu.d/setup-sut-firewall.sh with the following content:
#!/bin/sh
# libvirt hook script (https://libvirt.org/hooks.html)
# receives the DOM XML via stdin and metadata from libvirt as arguments
if [ "$2" = start ] && [ "$3" = begin ]; then
XML=$(cat /dev/stdin)
IFACE=$(echo $XML | xmlstarlet select -t -m 'domain/devices/interface[@type="direct"]' -v 'target/@dev')
DOMID=$(echo $XML | xmlstarlet select -t -v 'domain/@id')
nft "add table netdev filtermacvtap" #should not do anything if table is already present
nft "add chain netdev filtermacvtap filterin_${DOMID} { type filter hook ingress device $IFACE priority filter; policy accept; }"
nft "add rule netdev filtermacvtap filterin_${DOMID} ip saddr != { 10.145.10.0/24 } tcp dport { 22, 5800-5899, 5900-5999 } reject comment \"reject global SUT access to specific ports\""
nft "add rule netdev filtermacvtap filterin_${DOMID} ip6 saddr != { 2a07:de40:b203:12::0/64 } tcp dport { 22, 5800-5899, 5900-5999 } reject"
#nft add rule netdev filtermacvtap filterin_${DOMID} log # helpful for debugging, messages can be found in `journalctl -ft kernel`
fi
Updated by nicksinger 6 months ago
- Status changed from In Progress to Resolved
zl12 back in production with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/811 and role added to zl13. Machines seem to complete jobs successfully. The few "failed" ones I found fail for some time already so I assume this is not related to my change. I think that covers all ACs and we can consider this done.