Project

General

Profile

Actions

action #159066

closed

openQA Project - coordination #105624: [saga][epic] Reconsider how openQA handles secrets

openQA Project - coordination #157537: [epic] Secure setup of openQA test machines with secure network+secure authentication

network-level firewall preventing direct ssh+vnc access to openQA test VMs size:M

Added by okurz 30 days ago. Updated 15 minutes ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2024-03-28
Due date:
% Done:

0%

Estimated time:

Description

Motivation

In https://sd.suse.com/servicedesk/customer/portal/1/SD-150437 we are asked to handle "compromised root passwords in QA segments" including s390zl11…16. Because we failed to setup a firewall on hypervisors hosts directly, see #158242, we should ask SUSE-IT to REJECT – please don't DROP to not further confuse people – direct ssh access to the specific IP addresses of s390kvm VMs as managed in https://gitlab.suse.de/OPS-Service/salt/ from anything but the QE production networks like oqa.prg2.suse.org and qe.prg2.suse.org.

Acceptance criteria

  • AC1: firewall on network level prevents direct ssh+vnc access from outside, i.e. normal office networks, to openQA test VMs, e.g. s390kvm080.oqa.prg2.suse.org…s390kvm099.oqa.prg2.suse.org
  • AC2: openQA svirt jobs are still able to access ssh+vnc as necessary, e.g. from openQA workers in the same network OR openQA workers on the hypervisor hosts themselves
  • AC3: Administrators can still access ssh+vnc of production machines within oqa.prg2.suse.org, e.g. openQA worker hosts and hypervisor hosts (but not test VMs)

Suggestions


Related issues 2 (0 open2 closed)

Copied from openQA Infrastructure - action #158242: Prevent ssh access to test VMs on svirt hypervisor hosts with firewall size:MRejecteddheidler2024-03-28

Actions
Copied to openQA Infrastructure - action #159069: network-level firewall preventing direct ssh+vnc access to all machines within the oqa.prg2.suse.org network if neededRejectedokurz2024-03-28

Actions
Actions #1

Updated by okurz 30 days ago

  • Copied from action #158242: Prevent ssh access to test VMs on svirt hypervisor hosts with firewall size:M added
Actions #2

Updated by okurz 30 days ago

  • Description updated (diff)
Actions #3

Updated by okurz 30 days ago

  • Copied to action #159069: network-level firewall preventing direct ssh+vnc access to all machines within the oqa.prg2.suse.org network if needed added
Actions #4

Updated by nicksinger 28 days ago

  • Subject changed from network-level firewall preventing direct ssh+vnc access to openQA test VMs to network-level firewall preventing direct ssh+vnc access to openQA test VMs size:M
  • Status changed from New to Workable
Actions #5

Updated by nicksinger 16 days ago

  • Status changed from Workable to In Progress
  • Assignee set to nicksinger
Actions #6

Updated by nicksinger 16 days ago

  • Status changed from In Progress to Blocked
  • Priority changed from High to Normal

Half of production taken out with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/801 and raised an SD ticket describing what we want: https://sd.suse.com/servicedesk/customer/portal/1/SD-155731

I'm lowering the prio because I don't see much more we can do right now.

Actions #7

Updated by mgriessmeier 9 days ago

Hi,
Would it also be possible to use the two machines of https://progress.opensuse.org/issues/159063 for the purpose of testing those firewalls rules instead of production hardware?
I would like to avoid a shortage of s390 production workers when GMC hits end of next week.

Actions #8

Updated by nicksinger 9 days ago

mgriessmeier wrote in #note-7:

Hi,
Would it also be possible to use the two machines of https://progress.opensuse.org/issues/159063 for the purpose of testing those firewalls rules instead of production hardware?
I would like to avoid a shortage of s390 production workers when GMC hits end of next week.

We discovered that only 10/20 zl12 workers were enabled. That was handled in https://progress.opensuse.org/issues/158170#note-20
As discussed in Slack we're monitoring the queue-size and see if 20 slots are good enough for now.

Actions #9

Updated by nicksinger 8 days ago

  • Status changed from Blocked to In Progress

So apparently firewall-rules on the network level would by way to complicated. Therefore I requested to close the linked SD ticket.
Looking into several options I finally made some progress with native nft-rules in the "netdev"-table. This table is very low level and receives incoming packages almost immediately once they reached the NIC. A basic rule looks something like this:

table netdev filtermacvtap {
    chain filterin_17 {
        type filter hook ingress device "macvtap16" priority filter; policy accept;
        ip saddr != 10.145.10.0/24 tcp dport 22 drop
    }
}

As these rules cannot exist before the interface is present, we have to use libvirtd's hooking mechanism to create these rules manually - I roughly followed https://serverfault.com/a/1147552 and created /etc/libvirt/hooks/qemu.d/block-ssh.sh:

#!/bin/sh
if [ "$2" = start ] && [ "$3" = begin ]; then
    XML=$(cat /dev/stdin)
    IFACE=$(echo $XML | xmlstarlet select -t -m 'domain/devices/interface[@type="direct"]' -v 'target/@dev')
    DOMID=$(echo $XML | xmlstarlet select -t -v 'domain/@id')
    nft "add chain netdev filtermacvtap filterin_${DOMID} { type filter hook ingress device $IFACE priority filter; policy accept; }"
    nft "add rule netdev filtermacvtap filterin_${DOMID} ip saddr != 10.145.10.42/24 tcp dport 22 reject"
fi

This worked already quite well: https://openqa.suse.de/tests/overview?distri=sle&version=15-SP6&build=83.1-test-for-poo159066 - the 4 failing tests seem to fail earlier in connecting to zl12 already so I'm not sure if this is really related to my changes. But there are still a few todos open:

  1. Add a rule for IPv6 (maybe simply adding nft "add rule netdev filtermacvtap filterin_${DOMID} ip6 saddr != [V6_SUBNET] tcp dport 22 reject" is already enough?)
  2. What about other ports? VNC? Should we maybe set the default policy to reject and just whitelist connections without specific ports?
  3. Salt this
  4. Implement it on zl13
Actions #10

Updated by nicksinger 8 days ago

nicksinger wrote in #note-9:

This worked already quite well: https://openqa.suse.de/tests/overview?distri=sle&version=15-SP6&build=83.1-test-for-poo159066 - the 4 failing tests seem to fail earlier in connecting to zl12 already so I'm not sure if this is really related to my changes.

I found the issue and adjusted our sshd config for all workers: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1179

Actions #11

Updated by openqa_review 7 days ago

  • Due date set to 2024-05-23

Setting due date based on mean cycle time of SUSE QE Tools

Actions #12

Updated by nicksinger 3 days ago · Edited

Created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1184 with a first draft containing the script. It makes use of our "roles" grain which needs to contain two roles at the same time:

s390zl12:~ # cat /etc/salt/grains
passwordlogin: True
roles:
 - libvirt
 - worker
openqa:~ # salt -C 'G@roles:libvirt and G@roles:worker' test.ping
s390zl12.oqa.prg2.suse.org:
    True
Actions #14

Updated by okurz 3 days ago

discussed with nicksinger, I removed the "worker" role again for now and proposing reverting my https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/808. Instead we agreed to define a custom grain "external_openqa_hypervisor" with value "True" for both hosts which prevents the firewall. Or

external_openqa_hypervisor_passlist:
  - 10.145.10.0/24
  - 2a07:de40:b203:12:…/64

and iterate over each entry

Actions #15

Updated by nicksinger 2 days ago

I tried to set the default policy to "drop" and only explicitly allow what is needed (tried with "everything" for now to get a test working at all). I ended up with the following rules:

    nft "add table netdev filtermacvtap" #should not do anything if table is already present
    nft "add chain netdev filtermacvtap filterin_${DOMID} { type filter hook ingress device $IFACE priority filter; policy accept; }" #drop everything by default (reject is not available here)
    nft "add rule netdev filtermacvtap filterin_${DOMID} ip saddr 10.145.10.0/24 accept"
    nft "add rule netdev filtermacvtap filterin_${DOMID} ip6 saddr 2a07:de40:b203:12::0/64 accept"
    nft "add rule netdev filtermacvtap filterin_${DOMID} ip saddr 10.144.98.239/32 accept"
    nft add rule netdev filtermacvtap filterin_${DOMID} meta l4proto udp accept
    nft add rule netdev filtermacvtap filterin_${DOMID} meta l4proto icmp accept
    nft add rule netdev filtermacvtap filterin_${DOMID} meta l4proto ipv6-icmp accept
    nft add rule netdev filtermacvtap filterin_${DOMID} meta l4proto ipv6 accept
    nft add rule netdev filtermacvtap filterin_${DOMID} ether type arp accept
    nft add rule netdev filtermacvtap filterin_${DOMID} log

Unfortunately this fails because every outgoing connection of the VM causes a new incoming connection which in turn gets blocked if not explicitly whitelisted (because the used netdev table does not support stateful filtering).

As already discussed in the daily I will go ahead and just drop ports 22+59[00-99] explicitly. Not nice but should cover our use-case.

Actions #16

Updated by nicksinger 1 day ago

The major changes:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1184
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/810

Some cleanups:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1192
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1193

all have been merged. I added the external_openqa_hypervisor-role to s390zl12 and the deploy-pipeline created /etc/libvirt/hooks/qemu.d/setup-sut-firewall.sh with the following content:

#!/bin/sh
# libvirt hook script (https://libvirt.org/hooks.html)
# receives the DOM XML via stdin and metadata from libvirt as arguments

if [ "$2" = start ] && [ "$3" = begin ]; then
    XML=$(cat /dev/stdin)
    IFACE=$(echo $XML | xmlstarlet select -t -m 'domain/devices/interface[@type="direct"]' -v 'target/@dev')
    DOMID=$(echo $XML | xmlstarlet select -t -v 'domain/@id')
    nft "add table netdev filtermacvtap" #should not do anything if table is already present
    nft "add chain netdev filtermacvtap filterin_${DOMID} { type filter hook ingress device $IFACE priority filter; policy accept; }"
    nft "add rule netdev filtermacvtap filterin_${DOMID} ip saddr != { 10.145.10.0/24 } tcp dport { 22, 5800-5899, 5900-5999 } reject comment \"reject global SUT access to specific ports\""
    nft "add rule netdev filtermacvtap filterin_${DOMID} ip6 saddr != { 2a07:de40:b203:12::0/64 } tcp dport { 22, 5800-5899, 5900-5999 } reject"
    #nft add rule netdev filtermacvtap filterin_${DOMID} log # helpful for debugging, messages can be found in `journalctl -ft kernel`
fi
Actions #17

Updated by nicksinger about 1 hour ago

  • Status changed from In Progress to Resolved

zl12 back in production with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/811 and role added to zl13. Machines seem to complete jobs successfully. The few "failed" ones I found fail for some time already so I assume this is not related to my change. I think that covers all ACs and we can consider this done.

Actions #18

Updated by okurz 15 minutes ago

  • Due date deleted (2024-05-23)
Actions

Also available in: Atom PDF