Project

General

Profile

Actions

tickets #161411

open

Dedicated networks for openSUSE GitHub Runners

Added by SchoolGuy 11 months ago. Updated 1 day ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Network
Target version:
-
Start date:
2024-06-03
Due date:
% Done:

40%

Estimated time:

Description

The SUSE Labs department will sponsor an unused old four-node chassis for usage as GitHub Runners. The maintenance will be done by me (Enno Gotthold/SchoolGuy) during my work hours. One of the nodes will be used for the Cobbler org but the other three can be freely integrated into the openSUSE GitHub Org.

As GitHub Runners are essentially executing untrusted code by design they should be isolated as much as possible. I am proposing a VLAN for each GitHub Org (one for Cobbler and one for openSUSE).

The idea is to use https://github.com/actions/actions-runner-controller on top of a k3s cluster to manage the runners. Furthermore, I would desire to use MicroOS as the base OS.

The host is not yet configured with a static network configuration. The four nodes each have a dedicated BMC that only has a Java Web Start based UI for machine access.


Related issues 1 (1 open0 closed)

Precedes openSUSE admin - tickets #161963: Prepare GitHub runner serversIn Progresscrameleon2024-06-04

Actions
Actions #1

Updated by SchoolGuy 11 months ago · Edited

Of course, for each VLAN we will need a network. The machine atm has an outdated hostname in the SUSE internal Racktables. I would propose to name the chassis "gh-runner-chassis-01" and the nodes "gh-runner--01" with ascending numbers.

Actions #2

Updated by crameleon 11 months ago

  • Category set to Network
  • Private changed from Yes to No
Actions #3

Updated by crameleon 11 months ago

Hi Enno,

I will try to configure the network soon. From reading your SUSE ticket, I should probably be able to find the physical connections in SUSE RackTables. Is there some networking configured I can use to connect to the BMC to then configure the correct addresses for our management network? If not, I could spawn a temporary DHCP server.

I understand why MicroOS would be a good candidate for this application. However, I had terrible experience integrating it with our infrastructure in the past. A lot of the Salt states either do not support transactional operation at all, or require dirty hacks. Also a lot of packages are not included in the base distribution, and required maintaining a separate project with various links: https://build.opensuse.org/project/show/openSUSE:infrastructure:Micro. It lead me to eventually move the two servers I tried it with to Leap again and to give up with pursuing the effort to make it work.
Hence I suggest to make your servers Leap based as well but to confine the relevant services with systemd hardening and AppArmor.

I have an AutoYaST profile we can use for deployment of the base OS (there's currently no network boot server in our infrastructure since we rarely ever have new hardware, hence I'd just load it with an image through the BMC, if possible).

The names are fine with me.

Actions #4

Updated by crameleon 11 months ago

On a second thought, I wonder if the names shouldn't be something more generic.
I know we will only use these machines as GitHub runners now, but I have this fear of finding a new purpose at some point in the future making the names no longer make sense. ;-)

Actions #5

Updated by SchoolGuy 11 months ago

crameleon wrote in #note-3:

Hi Enno,

I will try to configure the network soon. From reading your SUSE ticket, I should probably be able to find the physical connections in SUSE RackTables. Is there some networking configured I can use to connect to the BMC to then configure the correct addresses for our management network? If not, I could spawn a temporary DHCP server.

I understand why MicroOS would be a good candidate for this application. However, I had terrible experience integrating it with our infrastructure in the past. A lot of the Salt states either do not support transactional operation at all, or require dirty hacks. Also a lot of packages are not included in the base distribution, and required maintaining a separate project with various links: https://build.opensuse.org/project/show/openSUSE:infrastructure:Micro. It lead me to eventually move the two servers I tried it with to Leap again and to give up with pursuing the effort to make it work.
Hence I suggest to make your servers Leap based as well but to confine the relevant services with systemd hardening and AppArmor.

I have an AutoYaST profile we can use for deployment of the base OS (there's currently no network boot server in our infrastructure since we rarely ever have new hardware, hence I'd just load it with an image through the BMC, if possible).

The names are fine with me.

Feel free to go ahead with Leap. I just wanted to save myself a bit of maintenance. The BMC should have DHCP, so spawning a temporary DHCP server should make them accessible. Username/Password I will give you via the work messenger.

Actions #6

Updated by SchoolGuy 11 months ago

crameleon wrote in #note-4:

On a second thought, I wonder if the names shouldn't be something more generic.
I know we will only use these machines as GitHub runners now, but I have this fear of finding a new purpose at some point in the future making the names no longer make sense. ;-)

I have no hard feelings about other names. It was an idea from me. I don't know if we have a naming schema in the openSUSE Infra but if yes then feel free to apply it.

Actions #7

Updated by crameleon 11 months ago

Thanks, found the credentials. Will try them soon and let you know.

Naming scheme is sometimes service related and sometimes just creativity. For physical machines usually the latter (as I feel those are more involved to relabel down the line). What about apollo-chassis + apollo0{1,2,3,4}?

Actions #8

Updated by crameleon 11 months ago

Actions #9

Updated by crameleon 11 months ago

  • Status changed from New to In Progress
  • Assignee set to crameleon
  • % Done changed from 0 to 10

Ports on management switches configured.

Actions #10

Updated by crameleon 11 months ago · Edited

  • % Done changed from 10 to 20

Created network allocations:

2a07:de40:b27e:1207::/64 - Machine network for Cobbler runners
https://netbox.infra.opensuse.org/ipam/prefixes/35
with
VLAN 1207 openSUSE-GHR-Cobbler
https://netbox.infra.opensuse.org/ipam/vlans/33

2a07:de40:b27e:1208::/64 - Machine network for openSUSE runners
https://netbox.infra.opensuse.org/ipam/prefixes/36
with
VLAN 1208 openSUSE-GHR-openSUSE
https://netbox.infra.opensuse.org/ipam/vlans/34

2a07:de40:b27e:4003::/64 - K3S Cluster network for Cobbler runners
https://netbox.infra.opensuse.org/ipam/prefixes/37

2a07:de40:b27e:4004::/64 - K3S Service network for Cobbler runners
https://netbox.infra.opensuse.org/ipam/prefixes/38

2a07:de40:b27e:4005::/64 - K3S Cluster network for openSUSE runners
https://netbox.infra.opensuse.org/ipam/prefixes/39

2a07:de40:b27e:4006::/64 - K3S Service network for openSUSE runners
https://netbox.infra.opensuse.org/ipam/prefixes/40

For configuring K3S networking, https://docs.k3s.io/networking/basic-network-options#single-stack-ipv6-networking should be followed (we don't use router advertisements so the warning is not relevant).

Actions #11

Updated by crameleon 11 months ago

  • % Done changed from 20 to 30

Patch for routing configuration and firewall baseline submitted as https://gitlab.infra.opensuse.org/infra/salt/-/merge_requests/1917.

Actions #12

Updated by crameleon 11 months ago

  • % Done changed from 30 to 40

Configured VLANs and ports on switches. Prepared MC-LAG for the one working node.

Actions #13

Updated by SchoolGuy 11 days ago

I think that the networking for both the Cobbler and the openSUSE Hosts can be identical. The tricky part is that ARC doesn't document what it needs itself, but the Actions themselves could, in theory, access any GitHub-related resource. All IP ranges can be found on the GitHub API and are IPv4 only, afaik.

Link: https://api.github.com/meta

Furthermore, to set up k3s, I would like the Hosts to be able to access https://get.k3s.io

Actions #14

Updated by crameleon 11 days ago

Requested firewall rules implemented via https://gitlab.infra.opensuse.org/infra/salt/-/merge_requests/2421. To add "packages" ranges and other apollo nodes soon.

Actions #15

Updated by SchoolGuy 10 days ago

Starting k3s is not possible currently because I forgot to add the fact that we need docker.io access to access the mirrored rancher images.

Actions #16

Updated by crameleon 10 days ago

Right, I also wondered about that but forgot. Will add shortly (and registry.o.o for future action images might be useful too whilst at it).

Actions #18

Updated by SchoolGuy 7 days ago

Either the MR didn't work as intended or the change didn't get deployed.

apollo01 (Cobbler GitHub Runner, K3S):~ # ping docker.io
PING docker.io(2600:1f18:2148:bc01:89b:94df:3759:2fb0 (2600:1f18:2148:bc01:89b:94df:3759:2fb0)) 56 data bytes
From 2a07:de40:b27e:1207::3 (2a07:de40:b27e:1207::3) icmp_seq=1 Destination unreachable: Administratively prohibited
Actions #19

Updated by crameleon 7 days ago

My bad, now it works -note though that with ping, while you will no longer get "administratively prohibited" from our firewall now, it might just get "stuck" because docker.io refuse echo packets - they do allow https though ;).

Actions #20

Updated by SchoolGuy 6 days ago

Apparently docker.io is just a redirect for other registries that are "hidden". On this page they describe all needed URLs to allow for Docker Desktop: https://docs.docker.com/desktop/setup/allow-list/

I do believe we have to whitelist the following URLs:

According to this article we will also need the following URLs: https://support.sonatype.com/hc/en-us/articles/115015442847-Whitelisting-Docker-Hub-Hosts-for-Firewalls-and-HTTP-Proxy-Servers

However, according to Docker Forums, we will run into issues with static whitelisting... - https://forums.docker.com/t/docker-registry-public-ip-addresses/10013/2

I don't believe hosting a proxy registry is a good idea since the effort to maintain this is quite large, IMHO. As our dear friends at Rancher don't build their stuff inside OBS (and it is not viable from an effort perspective to help them), the easiest option is to somehow broaden the scope of the firewall.

Reading up on https://docs.k3s.io/installation/airgap says that we could download the images from the releases page and push them to a registry. I must say that doing this feels like a lot of effort in the long term...

Actions #21

Updated by crameleon 6 days ago

Thanks for the investigation, it's rather unfortunate not having this deployed off in-house packages and containers.

Dynamic ACLs are a TODO of mine (also for other services which do not publish static IP addresses but only domain names), but I feel it would not add much in reliability here as the list of URLs you collected from different sources feels rather convoluted.

Given the circumstances I can agree to allow wider access for the host system, but I do not feel comfortable also having this reflected in the runner containers (else someone can use the CI for arbitrary download jobs). Can we achieve filtering host and container traffic separately? With https://progress.opensuse.org/issues/161411#note-10 the containers do have their own network, but the linked article does not quite tell me if K3S will do NAT or "proper" routing (in order to get the container network source addresses on the firewall) - https://docs.k3s.io/networking/basic-network-options suggests that with IPv6 it will not do masquerading/NAT by default (which sounds good).

Actions #22

Updated by SchoolGuy 6 days ago · Edited

Yes but both the GitHub Actions and the host will need Docker Hub access as reusable GitHub Actions may use this, so while we can filter separately, it will mean that we will just move the issues to a later point in time (aka once ARK is running).

Edit: The only way to achieve proper separation is to mirror the needed Docker Hub images and block Docker Hub. That would require a dedicated host and dedicated everything else. But I never attempted this, and it would mean that fully qualified registry paths (like with OBS) would fail (in case of Docker Hub-based images).

Actions #23

Updated by crameleon 4 days ago

Weird for the containers to require access to registries, I would expect the host to pull the images.

Sorry, but I don't have a solution for this at the moment, I don't deem our current network setup ready to face arbitrary and unrestricted internet workloads.

I would consider some alternative container implementation, for for GitLab CI for example we use Podman.

Actions #24

Updated by SchoolGuy 4 days ago

GitHub ARC is only suitable for k8s. We can instead try using https://fireactions.io/. That would mean that those ephemeral VMs would pull the images. Would that better suit our network infrastructure? It is a young project but as far as I can tell it is used in production by Hostinger.

Actions #25

Updated by crameleon 4 days ago

That sounds very interesting, though I am a bit biased because I like Firecracker VMs. ;)

I read through the documentation, but after several attempts I still can't quite figure out though how the VMs play together with the containers - the runner service is part of the container image, but the host only runs a VM - so does the container then run inside the VM? Do all jobs use a pre-defined image? It seems for the VMs, network namespaces are used, but it's not clear how that'll reflect in the containers / whether those have separate networking or whether we'll end in the same situation just at a different layer.

Actions #26

Updated by SchoolGuy 2 days ago

My understanding is that the workloads are executed in the VMs. The VM image type is, in this case, an OCI-compliant Docker image (yes, this is a thing https://github.com/codecrafters-io/oci-image-executor).

So from a network perspective:

  • Host has a systemd service that executes Firecracker VMs
  • VM executes any workload that the GitHub Action decides to execute.

This means the host is pulling the images from ghcr.io and the ephemeral VMs download and execute any PyPi/Docker Hub/pkgs.go.dev/Rust Crates/Node.js packages that are needed to run the GitHub Action workflow. So the VMs will get full internet access and the Host will need access to ghcr.io and whatever else the Ansible Playbooks asks for.

Actions #27

Updated by SchoolGuy 2 days ago · Edited

P.S.: Looking at the Ansible code, they are installing everything from source. While we can change this in the long run, I would still like to deploy the Ansible Playbook as-is to have some progress. I would commit to opening an issue and attempting to gain support from them to install as much as possible from openSUSE packaged RPMs.

Source: https://github.com/hostinger/ansible-collection-fireactions

Actions #28

Updated by crameleon 1 day ago

Thanks for explaining.

So the VMs will get full internet access

Doesn't this put us in the same situation, where the CI workloads get unrestricted access to the internet as a result?

I would commit to opening an issue and attempting to gain support from them to install as much as possible from openSUSE packaged RPMs.

I'm not sure how much interest upstream would have to work on this, but I would be happy to do the packaging.

Actions #29

Updated by SchoolGuy 1 day ago

Yes, of course, but if you have ephemeral VMs with different containers with names that are previously not known, then how (and why) do you want to limit internet access? The VM is thrown away after the run and the contributors are vetted by me.

Actions #30

Updated by crameleon 1 day ago

It's not so much about what's running in the container or VM, I trust we can implement good means of isolation, especially with the VM approach.

I more care about our current network setup not being ready to deal with arbitrary internet load from strangers.

Someone can cause either technical problems by saturating all of our uplink through big downloads, or cause us legal problems through problematic downloads.
Allowing outbound traffic to only selected resources on the internet would mitigate these concerns (even if not ruling them out completely).

I do want to implement based bandwidth throttling and traffic observability in our infrastructure at some point, but it's not there yet.

contributors are vetted by me

It sounds like you would be ok with limiting who's allowed to start pipelines on these runners instead of allowing arbitrary GitHub users to do so. I think this would definitely help until I have better protections in place (once that is the case, I would be happy to relax the restrictions, as to make it less annoying for external contributors).

Would you mind briefly elaborating how this would look like in practice? I guess filtering by organization members can be considered trustworthy for the ~25 people in Cobbler (https://github.com/orgs/cobbler/people), but I'm not confident about applying the same trust of not accidentally running pipelines on malicious PRs to the ~450 people in openSUSE (https://github.com/orgs/openSUSE/people). Or would it happen on a team or repository level?

Actions #31

Updated by SchoolGuy 1 day ago

The runners in openSUSE would be for anyone in the org to use, so yes, that is a much higher level. However, since there seems to be no need, we could create pools per-repository, which would limit it. With Cobbler, you have already guessed that it is sub 30 people.

GitHub, by default, requires a maintainer to approve running the workflows. Meaning, on a PR of a member that hasn't made a PR which was merged, the workflow is not executed. This is the mechanism that is in place and in my eyes, enough to protect from abuse.

Actions

Also available in: Atom PDF