Project

General

Profile

Actions

action #132143

closed

Migration of o3 VM to PRG2 - 2023-07-19 size:M

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2023-06-29
Due date:
% Done:

0%

Estimated time:

Description

Motivation

The openQA webUI VM for o3 will move to PRG2. This will be conducted by Eng-Infra. We must support them.

Acceptance criteria

  • AC1: o3 is reachable from the new location for SUSE employees
  • AC2: Same as AC1 but for community members outside SUSE
  • AC3: o3 multi-machine jobs run successfully on o3 after the migration
  • AC4: We can still login into the machine over ssh from outside the SUSE network
  • AC5: https://zabbix.nue.suse.com/ can still monitor o3

Suggestions

  • DONE Track https://jira.suse.com/browse/ENGINFRA-2347 "DMZ-OpenQA implementation" (done) so that the o3 network is available
  • DONE Track https://jira.suse.com/browse/ENGINFRA-2155 "Install Additional links to DMZ-CORE from J12 - openQA-DMZ" (done), something about cabling
  • DONE Track https://jira.suse.com/browse/ENGINFRA-1742 "Build OpenQA Environment" for story of the o3 VM being migrated
  • DONE Inform affected users about planned migration on date 2023-07-19
  • DONE During migration work closely with Eng-Infra members conducting the actual VM migration

    1. DONE Join Jitsi and one thread in team-qa-tools and one thread in dct-migration
    2. DONE Wait for go-no-go meeting at 0700Z
    3. DONE Wait for mcaj to give the go from Eng-Infra side, then switch off the openQA scheduler on o3 and disable the authentication. I guess we can try to "break" the code by disabling any authenticated actions.
    4. DONE Also switch off other services like gru, scripts, investigation, etc.
    5. DONE Prepare old workers to connect over https as soon as o3 comes up again in prg2
    6. DONE Install more new machines in prg2 while waiting for the VM to come online -> installed worker21,22,24 though not yet activated for production. Rest to be continued in #132134
    7. DONE As soon as VM is ready in new place ensure that the webUI is good in read-only mode first
    8. DONE Update IP addresses on ariel where necessary in /etc/hosts, also crosscheck /etc/dnsmasq.d/openqa.conf
    9. DONE Ask Eng-Infra, mcaj, to switch off the DHCP/DNS/PXE server in the oqa dmz network
    10. DONE Try to reboot a worker from the PXE on o3
    11. 13. DONE Connect all old workers from NUE1 over https, in particular everything non-qemu-x86_64 for the time being, e.g. aarch64, ppc64le, s390x, bare-metal until we have such things directly from prg2
    12. DONE Test and monitor a lot of o3 tests
    13. DONE As soon as everything looks really stable announce it to users as response all the above announcements
  • DONE Ensure that o3 is reachable again after migration from the new location

    • DONE for SUSE employees
    • DONE for community members outside SUSE
    • DONE for o3 workers from at least one location (NUE1 or PRG2)
  • DONE Ensure that we can still login into the machine over ssh from outside the SUSE network -> updated details on https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Accessing-the-o3-infrastructure

  • DONE Ensure that https://zabbix.nue.suse.com/ can still monitor o3

  • DONE Inform users as soon as migration is complete

  • DONE Rename /dev/vg0-new to /dev/vg0

  • Ensure IPv6 is fully working -> #133358

  • DONE Make wireguard+socat+ssh+routes from #132143-25 persistent Make ssh-tap-tunnel+routes+iptables persistent on new-ariel

  • DONE On o3 systemctl unmask --now openqa-auto-update openqa-continuous-update rebootmgr

  • DONE On o3 enable again o3 specific nginx tmp+log paths in /etc/nginx/vhosts.d/openqa.conf

  • DONE Update https://progress.opensuse.org/projects/openqav3/wiki/ where necessary

  • DONE Make sure we know what to keep an eye out for for the later planned OSD VM migration

  • DONE As necessary also make sure that BuildOPS knows about caveats of migration as they plan to migrate OBS/IBS after us

  • DONE Make ssh-tap-tunnel+routes+iptables persistent on old-ariel

  • DONE Ensure backup to backup.qa.suse.de works

  • DONE Remove root ssh login on new-ariel

  • the openQA machine setting for "s390x-zVM-vswitch-l2" has REPO_HOST=192.168.112.100 and other references to 192.168.112. This needs to be changed as soon as zVM instances are able to reach new-ariel internally, e.g. over FTP -> #132152

  • Fix o3 bare metal hosts iPXE booting, see https://openqa.opensuse.org/tests/3446336#step/ipxe_install/2 -> #132647

  • 11. Enable workers to connect to o3 directly, not external https, and use testpoolserver with rsync instead -> #132134

  • 12. Enable production worker classes on new workers after tests look good -> #132134


Related issues 8 (0 open8 closed)

Related to openQA Project (public) - action #133232: o3 hook scripts are triggered but no comment shows up on jobResolvedtinita2023-07-24

Actions
Related to openQA Infrastructure (public) - action #150956: o3 cannot send e-mails via smtp relay size:MResolvedokurz2023-11-16

Actions
Related to openQA Infrastructure (public) - action #159669: No new openQA data on metrics.opensuse.org since o3 migration to PRG2 size:SResolvedokurz2024-04-26

Actions
Copied from openQA Infrastructure (public) - action #132134: Setup new PRG2 multi-machine openQA worker for o3 size:MResolveddheidler2023-06-29

Actions
Copied to openQA Infrastructure (public) - action #132647: Migration of o3 VM to PRG2 - bare-metal tests size:MResolvedokurz

Actions
Copied to openQA Infrastructure (public) - action #133358: Migration of o3 VM to PRG2 - Ensure IPv6 is fully workingResolvedokurz

Actions
Copied to openQA Infrastructure (public) - action #133364: Migration of o3 VM to PRG2 - Decommission old-ariel in NUE1 as soon as we do not need it anymoreResolvedokurz

Actions
Copied to openQA Infrastructure (public) - action #133475: Migration of o3 VM to PRG2 - connection to rabbit.opensuse.orgResolvedmkittler

Actions
Actions #1

Updated by okurz over 1 year ago

  • Copied from action #132134: Setup new PRG2 multi-machine openQA worker for o3 size:M added
Actions #3

Updated by okurz over 1 year ago

  • Subject changed from Support migration of o3 VM to Support migration of o3 VM to PRG2
Actions #4

Updated by okurz over 1 year ago

  • Subject changed from Support migration of o3 VM to PRG2 to Support migration of o3 VM to PRG2 - 2023-07-19
Actions #5

Updated by okurz over 1 year ago

  • Subject changed from Support migration of o3 VM to PRG2 - 2023-07-19 to Support migration of o3 VM to PRG2 - 2023-07-19 size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by okurz over 1 year ago

  • Priority changed from High to Urgent
Actions #7

Updated by nicksinger over 1 year ago

  • Status changed from Workable to Feedback
  • Assignee set to nicksinger
  • Priority changed from Urgent to Normal
Actions #8

Updated by okurz over 1 year ago

  • Priority changed from Normal to High

Let's keep it at least as "High" priority as all related work will need to happen this month anyway.

Actions #9

Updated by okurz over 1 year ago

  • Copied to action #132647: Migration of o3 VM to PRG2 - bare-metal tests size:M added
Actions #10

Updated by okurz over 1 year ago

  • Project changed from 46 to 115
  • Category deleted (Infrastructure)
Actions #11

Updated by okurz over 1 year ago

  • Project changed from 115 to openQA Infrastructure (public)
Actions #13

Updated by okurz over 1 year ago

https://suse.slack.com/archives/C04MDKHQE20/p1689678826354739

(Oliver Kurz) @John Ford @Moroni Flores @Martin Caj (CC @Matthias Griessmeier @Nick Singer) Given that we did not manage to setup any o3 worker so far Nick Singer and me decided we should not go forward with the o3 VM migration. Next date suggested by us is Tuesday, 2023-07-25, as we expect that we have at least one machine available until then. Please confirm.
(John Ford) Understood. My only concern with 25/07/23 is the proximity to the Build Service cutover on 28/07/23. Not from a technical dependency point of view, more from an IT resource/support availability perspective. I'll re-purpose the 'Go/No-Go' meeting for 15:30. Let's use that time to understand issues to date and next steps.
(Moroni Flores) I think it should be ok, however it will be important to understand what did not work and what we are doing to address it
(Oliver Kurz) What did not work is that nobody yet could manage boot any reasonable installation system. Martin Caj, Nick Singer and me looked into multiple approaches and alternatives. We tried the "virtual media share" from BMCs but with no root access to the single machine within that network, oqa-jumpy, only Martin Caj can do that and so far we only got inconclusive error messages about that. What we are doing to address it: 1. Martin Caj today in the morning was (or is) working on setting up a DHCP/DNS/PXE server within that network and this is also one of the ways to go forward which I am confident that we can use within the next day. Alternatives: 2. Provide a VM with root access to SUSE QE Tools members so that we can try again with virtual media share using FTP/SMB; 3. Connect a USB thumb drive to a machine physically, install something there and use that host same as suggested in "2." for a VM

Today evening I will send an announcement which is either a reminder about the migration tomorrow or a notice about the shifted migration.

Actions #14

Updated by okurz over 1 year ago

  • Subject changed from Support migration of o3 VM to PRG2 - 2023-07-19 size:M to Migration of o3 VM to PRG2 - 2023-07-19 size:M
  • Status changed from Feedback to In Progress

I sent another reminder

All our pre-moving checks have succeeded so we will execute the move of https://openqa.opensuse.org tomorrow as planned.

as response to all the above announcements.

Tomorrow we can go ahead. I have my day reserved for this. I would appreciate if more of you can support this work and at least stay on standby in case we need your help. I suggest we stay in a breakout-room in https://meet.jit.si/suse_qa_tools where possible. At best we have 1 person coordinating, 1 person acting on a main task, 1 person communicating to the outside and optional for side-tasks.

What we should do, roughly:

  1. Join Jitsi and one thread in team-qa-tools and one thread in dct-migration
  2. Wait for go-no-go meeting at 0700Z
  3. Wait for mcaj to give the go from Eng-Infra side, then switch off the openQA scheduler on o3 and disable the authentication. I guess we can try to "break" the code by disabling any authenticated actions.
  4. Also switch off other services like gru, scripts, investigation, etc.
  5. Prepare old workers to connect over https as soon as o3 comes up again in prg2
  6. Install more new machines in prg2 while waiting for the VM to come online
  7. As soon as VM is ready in new place ensure that the webUI is good in read-only mode first
  8. Update IP addresses on ariel where necessary in /etc/hosts, also crosscheck /etc/dnsmasq.d/openqa.conf
  9. Ask Eng-Infra, mcaj, to switch off the DHCP/DNS/PXE server in the oqa dmz network
  10. Try to reboot a worker from the PXE on o3
  11. Enable workers to connect to o3 directly, not external https, and use testpoolserver with rsync instead
  12. Enable production worker classes on new workers after tests look good
  13. Connect old workers from NUE1 over https, in particular everything non-qemu-x86_64 for the time being, e.g. aarch64, ppc64le, s390x, bare-metal until we have such things directly from prg2
  14. Test and monitor a lot of o3 tests
  15. As soon as everything looks really stable announce it to users as response all the above announcments
Actions #15

Updated by openqa_review over 1 year ago

  • Due date set to 2023-08-02

Setting due date based on mean cycle time of SUSE QE Tools

Actions #16

Updated by favogt over 1 year ago

Prepare old workers to connect over https as soon as o3 comes up again in prg2

I assume you mean "public https" here?

Ask Eng-Infra, mcaj, to switch off the DHCP/DNS/PXE server in the oqa dmz network

Isn't that ariel itself?

Actions #17

Updated by nicksinger over 1 year ago

favogt wrote:

Prepare old workers to connect over https as soon as o3 comes up again in prg2
I assume you mean "public https" here?

yes

Ask Eng-Infra, mcaj, to switch off the DHCP/DNS/PXE server in the oqa dmz network
Isn't that ariel itself?

to break the chicken-egg problem, martin setup a temporary host which enables us to boot/install our new workers. Once ariel is in the new location it will take over this task

Actions #18

Updated by okurz over 1 year ago

  • Description updated (diff)

we ensured that our backup on backup.qa.suse.de is sufficiently up-to-date. nicksinger put a notice on the index page of https://openqa.opensuse.org informing about the pending maintenance work. We conducted the go-no-go meeting with others and decided to go ahead. I updated the description of this ticket with the points to do. mcaj will create a test-VM with a public facing webpage that we can check before doing stuff with ariel.

Actions #19

Updated by okurz over 1 year ago

  • Description updated (diff)
Actions #20

Updated by nicksinger over 1 year ago

We made another database dump just to be sure about consistency after the sync to prague. It is located on the FS of ariel and located in /var/lib/openqa/SQL-DUMPS/2023-07-19.dump

Actions #21

Updated by okurz over 1 year ago

We are just out of a call with Eng-Infra and SUSE-IT project management. Status update regarding o3: The sync of storage devices in the background takes longer than expected so we will need to keep the system in the current read-only state. Switch-over to the next instance is expected to happen tomorrow during the day.

I also synced over the latest state of database dumps to my machine with rsync -aHP backup.qa.suse.de:/home/rsnapshot/alpha.0/openqa.opensuse.org/var/lib/openqa/SQL-DUMPS/ /home/okurz/local/tmp/o3-sql-dumps/.

mcaj will bring up the VM with just the root partition so that we can check the system coming up in general and update the network config and reachability from different locations. vdb is already synced over containing the database, vdc+vdd are in the process of being synced. As soon as all storage devices are synced over we will start the VM again with all those connected in a consistent state.

Actions #22

Updated by okurz over 1 year ago

  • Description updated (diff)
Actions #23

Updated by okurz over 1 year ago

temporary login with ssh root@10.150.2.10 for at least okurz/nsinger/mcaj/mgriessmeier. I disabled temporary log and tmp nginx paths in /etc/nginx/vhosts.d/openqa.conf, added to rollback steps

http://195.135.223.231/ is now accessible. However https://195.135.223.231/ does not work. Maybe that is something that was overlooked as originally we planned to use a proxy for that and now we wanted to serve 443/https directly, right? Written in https://suse.slack.com/archives/C04MDKHQE20/p1689850829492159?thread_ts=1689843691.492509&cid=C04MDKHQE20

Actions #24

Updated by okurz over 1 year ago

Syncing of data storage is still in progress and slower than expected. We are looking into alternatives.

My suggestion in https://suse.slack.com/archives/C04MDKHQE20/p1689922035366939?thread_ts=1689891521.574939&cid=C04MDKHQE20

Let's try an alternative: After the 5TB is copied – which is the more important one because it contains millions of test results – let's create a new empty >=6.6TB disk for new-ariel, we start with an empty one, you still copy over the snapshot of the old 6.6TB, mount it into the new-ariel as soon as it's there and we copy over stuff on-the-go or we could even rysnc parts of it as part is a "cache" that can be recreated in the new space

We will be working on a solution today. Expect openqa.opensuse.org to stay in read-only for a couple more hours.

Actions #25

Updated by okurz over 1 year ago

  • Description updated (diff)

The DNS entry is in the process of being changed to the new location

(Martin Caj) DNS change is in the process:
pdnsutil edit-zone opensuse.org
Checked 404 records of 'opensuse.org', 0 errors, 0 warnings.
Detected the following changes:
-opensuse.org 1800 IN SOA ns1.opensuse.org admin.opensuse.org 2023071301 7200 7200 604800 3600
+opensuse.org 1800 IN SOA ns1.opensuse.org admin.opensuse.org 2023072100 7200 7200 604800 3600
+openqa.opensuse.org 1800 IN A 195.135.223.231
-openqa.opensuse.org 1800 IN CNAME proxy-nue.opensuse.org

https://openqa.opensuse.org/ says "This is https://openqa.opensuse.org in the old location".
The new VM already says "This is new ariel!"

And because o3 is now serving https directly we changed /etc/openqa/openqa.ini httpsonly = 1 in the [openid]-section. Authentication works. The system is in read-write able mode now.

Right now lsblk looks like this:

# lsblk 
NAME               MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
loop0                7:0    0    5G  0 loop /home
zram0              252:0    0  1.4G  0 disk [SWAP]
…
zram9              252:9    0  1.4G  0 disk [SWAP]
vda                254:0    0   20G  0 disk 
└─vda1             254:1    0   20G  0 part /
vdb                254:16   0    5T  0 disk 
└─vdb1             254:17   0    5T  0 part /srv/tftpboot/ipxe/images
                                            /var/lib/snapshot-changes
                                            /var/lib/openqa
                                            /space
vdc                254:32   0  100G  0 disk /var/lib/pgsql
vdd                254:48   0  6.6T  0 disk 
├─vg0--new-archive 253:0    0    1T  0 lvm  /space/openqa/archive
│                                           /var/lib/openqa/archive
│                                           /archive
└─vg0--new-assets  253:1    0  5.6T  0 lvm  /space/openqa/share
                                            /var/lib/openqa/share
                                            /assets
vde                254:64   0  6.6T  1 disk 
├─vg0-assets       253:2    0  5.6T  0 lvm  /mnt/old-vg0-assets
└─vg0-archive      253:3    0 1024G  0 lvm  /mnt/old-vg0-archive

vda (/), vdb (/space, /var/lib/openqa, …), vdc (/var/lib/pgsql) are all in sync and fully usable. vdd is a new storage device where we recreate the structure from vde. vde is a read-only mount of a read-only snapshot from "assets+archive".

To copy over the content what we did:

mkdir -p /mnt/old-vg0-{assets,archive}
vgchange -a y /dev/vg0
mount -o ro,norecovery /dev/vg0/assets /mnt/old-vg0-assets
mount -o ro,norecovery /dev/vg0/archive /mnt/old-vg0-archive
wipefs --all /dev/vdd
vgcreate -n vg0-new /dev/vdd
lvcreate -L 1024g -n archive /dev/vg0-new
lvcreate -l100%FREE -n assets /dev/vg0-new
for i in archive assets ; do mkfs.xfs /dev/vg0-new/$i; done
rsync -aHP /mnt/old-vg0-archive/ /var/lib/openqa/archive/ && rsync -aHP /mnt/old-vg0-assets/ /var/lib/openqa/share/

In /etc/hosts mcaj and me adjusted all IPv4 addresses for o3 workers to have corresponding addresses in the new IP range. I talked with mcaj about the 192.168.7. range which we need for OBS access. We likely need to find a temporary solution to reach hosts while OBS is not yet migrated to PRG2:

cd /var/lib/openqa/ssh-tunnel-nue1
chown geekotest.nogroup /var/lib/openqa/ssh-tunnel-nue1/
sudo -u geekotest ssh-keygen -t ed25519 -f /var/lib/openqa/ssh-tunnel-nue1/id_ed25519 -N ''
sudo -u geekotest ssh -i /var/lib/openqa/ssh-tunnel-nue1/id_ed25519 -o "UserKnownHostsFile /var/lib/openqa/ssh-tunnel-nue1/known_hosts" -p 2213 geekotest@gate.opensuse.org "ip a"
cat >> .ssh/conf <<EOF
Host old-ariel
        HostName gate.opensuse.org
        User geekotest
        Port 2213
        IdentityFile /var/lib/openqa/ssh-tunnel-nue1/id_ed25519
        UserKnownHostsFile /var/lib/openqa/ssh-tunnel-nue1/known_hosts
        LocalForward 62450 localhost:62450
EOF
sudo -u geekotest ssh -NT

And setting up a tunnel. On old-ariel following https://www.wireguard.com/quickstart/ to create a key pair and config:

zypper -n in wireguard-tools
cat > /etc/wireguard/ariel-link.conf
[Interface]
PrivateKey = XXX
ListenPort = 62451

[Peer]
PublicKey = XXX
AllowedIPs = 192.168.7.0/24, 192.168.254.0/24, 172.16.0.0/24
EOF
wg setconf ariel-link ariel-link.conf
ip l s dev ariel-link up
…

and accordingly on new-ariel.

and socat on new ariel for UDP<->TCP

socat udp4-listen:62451,reuseaddr,fork tcp:localhost:62450

on old-ariel:

sysctl net.ipv4.conf.ariel-link.forwarding=1
iptables -t nat -A POSTROUTING -o ariel-link -j MASQUERADE
iptables -t nat -A POSTROUTING -o eth2 -j MASQUERADE
iptables -t nat -A POSTROUTING -o eth2 -j SNAT --to-source 192.168.7.15
iptables -I FORWARD 1 -i ariel-link -o eth2 -j ACCEPT
sysctl net.ipv4.ip_forward=1
sysctl net.ipv4.conf.all.forwarding=1

on new-ariel:

ip r a 192.168.7.0/24 via 172.16.0.1 dev ariel-link
ip r a 192.168.254.0/24 via 172.16.0.1 dev ariel-link
for i in $(sed -n 's/^\(192.168[^ ]*\).*$/\1/p' /etc/hosts); do ping -c1 $i; done

The tunnel is working. While this is being used to sync over new assets from obspublish we are connecting the physical machine workers over the internal network. To specify in more detail: Reaching to 192.168.7.0 works but not yet 192.168.254.0 or 192.168.47.0 because we need to route those over different networks:

# ip r
default via 192.168.254.4 dev eth0 
172.16.0.0/24 dev ariel-link proto kernel scope link src 172.16.0.1 
192.168.0.0/24 dev eth1 proto kernel scope link src 192.168.0.100 
192.168.7.0/24 dev eth2 proto kernel scope link src 192.168.7.15 
192.168.47.0/24 dev eth3 proto kernel scope link src 192.168.47.13 
192.168.112.0/24 dev eth1 proto kernel scope link src 192.168.112.100 
192.168.252.0/24 via 192.168.47.252 dev eth3 src 192.168.47.13 
192.168.253.0/24 via 192.168.47.252 dev eth3 src 192.168.47.13 
192.168.254.0/24 dev eth0 proto kernel scope link src 192.168.254.15 

So I added all three ranges to wireguard config on old-ariel and new-ariel 192.168.7.0/24, 192.168.47.0/24, 192.168.254.0/24. Then on old-ariel: iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE and ping from new-ariel to 192.168.254.44 works \o/ However it seems that the complete connection is not always 100% reliable. Sometimes pings get lost. Now 192.168.47.0. For that iptables -t nat -A POSTROUTING -o eth3 -j MASQUERADE. Oh, and of course we need the routes on new-ariel ip r a 192.168.47.0/24 via 172.16.0.1 dev ariel-link. Now ping also works \o/.

Actions #26

Updated by okurz over 1 year ago

  • Description updated (diff)
Actions #27

Updated by okurz over 1 year ago

Working with mkittler to connect workers. Due to the outside IP over https not being reachable from internal directly or something we put into worker22:/etc/hosts the internal IP of o3

10.150.1.11     openqa.opensuse.org

We unmasked and started the openQA scheduler on o3 and triggered a test job:

openqa-clone-job --within-instance https://openqa.opensuse.org/tests/3440744 TEST+=okurz BUILD= _GROUP=0 WORKER_CLASS=openqaworker22        

those were ending up incomplete also because tests weren't synced yet. So I did an explicit

rsync -aHP /mnt/old-vg0-assets/tests/ /var/lib/openqa/share/tests/

so eventually we had https://openqa.opensuse.org/tests/3440879 which looks better but fails in
https://openqa.opensuse.org/tests/3440879#step/openqa_webui/7 trying to download an asset from http://openqa.opensuse.org with timeout exceeded. Maybe local firewall?

Also openqa-scheduler is often "stuck". We attached over strace and found it to be very busy with reading "/var/lib/openqa/testresults".

So immediately over to multi-machine tests:

openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/3439908 TEST+=worker22-poo132134 _GROUP=0 BUILD= WORKER_CLASS=openqaworker22

Cloning parents of microos-Tumbleweed-DVD-x86_64-Build20230718-remote_ssh_target@64bit
Cloning children of microos-Tumbleweed-DVD-x86_64-Build20230718-remote_ssh_controller@64bit
2 jobs have been created:

I should keep in mind that we reach the system over ssh on ssh 10.150.2.10. We can now also use ssh -4 root@ariel.dmz-prg2.suse.org

In the meantime our ssh tunneling connection failed. Still running that in screen on new-ariel but now changed to

sudo -u geekotest AUTOSSH_POLL=20 autossh -M 0 -o ServerAliveInterval=15 -NT old-ariel

Right now in https://openqa.opensuse.org/tests/3440889#live we observed the problem that the openQA jobs try to reach openqa.opensuse.org over the wrong internal address 10.150.2.10. That is also what host openqa.opensuse.org on w22+w24 say even though /etc/hosts has 10.150.1.11 which is correct.

Status update about openqa.opensuse.org migration: https://openqa.opensuse.org is up, users can authenticate, jobs can be scheduled. A sync of assets is still in progress and expected to take a couple of hours more. New products can be scheduled and are already scheduled by openqa-trigger-from-obs. No jobs will be executed though as we do not have production workers connected. Currently the problem is that openQA qemu jobs internally fail to resolve the domain openqa.opensuse.org properly where needed, e.g. when trying to install packages over zypper. See one example https://openqa.opensuse.org/tests/3440879#step/openqa_webui/7 . Any hints would be appreciated to help with that. Due to the end of the work-week for most I expect that this will only be fixed next week. More details about the migration work in https://progress.opensuse.org/issues/132143

In the meantime ssh access with ssh -p 2214 gate.opensuse.org now also works so we can also consider using this for the NUE1-based workers. However I now connected openqaworker4 (NUE1) using TESTPOOLSERVER still pointing to old-ariel which also still runs fetchneedles every minute.

https://openqa.opensuse.org/tests/3440750

is a successful verification from openqaworker4.

and now /dev/vdb crashed with "Input/output error". I unmounted all the mount points relying on vdb directly or indirectly. I unmounted and remounting yields "Structure needs cleaning". I am currently running xfs_metadump -g /dev/vdb1 - | zstd -o vdb1_xfs_metadump-$(date -Is).dump.zstd before continuing with repair attempts. I tried xfs_repair but that failed so I will likely need to use xfs_repair -L. In https://suse.slack.com/archives/C04MDKHQE20/p1689959465196969 Anthony Iliopoulos mentioned that they know about such bug and they asked me to keep that metadata dump for them to check before continuing repair attempts.
Currently at 8516160 inodes, taking very long.

Actions #28

Updated by okurz over 1 year ago

After 1h and only 20% progress I aborted and instead ran xfs_repair -L /dev/vdb1 which took about 3h but eventually I had a storage device back which looks fine. So I remounted everything and restarted services and then on openqaworker4 started all failed services and we are back in business with actually passed tests: https://openqa.opensuse.org/tests/3440811# . I also tried needle creation in https://openqa.opensuse.org/tests/3440880/modules/epiphany/steps/6/edit which was rather slow but completed. I retriggered the job. I also retriggered incompletes with failed_since=2023-07-20 openqa-advanced-retrigger-jobs. Applied the same changes on openqaworker20 as on w4, i.e.:

  • add o3 entry in /etc/hosts
  • add openqa.opensuse.org entry in /etc/openqa/client.conf
  • add openqa.opensuse.org entry in /etc/openqa/workers.ini
  • restart worker services

For now I will monitor before adding more workers.

And then I also tried to trigger again some product sync and/or trigger in https://openqa.opensuse.org/admin/obs_rsync/openSUSE:Factory:ToTest, let's see the effect.

Actions #29

Updated by okurz over 1 year ago

  • Description updated (diff)
Actions #30

Updated by okurz over 1 year ago

Made autossh+socat persistent

cat > /etc/systemd/system/autossh-old-ariel.service <<EOF
[Unit]
Description=AutoSSH tunnel to connect to old-ariel, see https://progress.opensuse.org/issues/132143 for details
After=network.target

[Service]
User=geekotest
Environment="AUTOSSH_POLL=20"
ExecStart=/usr/bin/autossh -M 0 ServerAliveInterval=15 -NT old-ariel
# From my experience a process might hang with "connection refused". Trying to workaround with periodic restarts
RuntimeMaxSec=24h
Restart=always

[Install]
WantedBy=default.target
EOF
cat > /etc/systemd/system/socat-wireguard-to-ssh.service <<EOF
[Unit]
Description=socat forwarding the wireguard UDP socket to the TCP port forwarded to old-ariel by autossh-old-ariel.service, see https://progress.opensuse.org/issues/132143 for details
Wants=autossh-old-ariel.service
After=network.target

[Service]
ExecStart=/usr/bin/socat udp4-listen:62451,reuseaddr,fork tcp:localhost:62450

[Install]
WantedBy=default.target
EOF
systemctl daemon-reload && systemctl enable --now autossh-old-ariel socat-wireguard-to-ssh
  • Enabled NUE1 workers

    • rebel
    • power8-3 was powered off, switched on over http://qa-power8-3.qa.suse.de/. After it was pingable again I could login and update the config
    • worker6
    • worker7
    • worker19
  • TODO: aarch64, currently not reachable

Then I was trying to fix developer mode for NUE1 workers connected to PRG2 o3. I added 192.168.112.0/24 to wireguard on both and a route on new-ariel. Bot on old-ariel tcpdump shows IP 172.16.0.2 > qa-power8-3.openqanet.opensuse.org: ICMP echo request. Shouldn't that as well show a rewritten source?

Regarding dnsmasq mcaj gave me what he had on the temporary PXE server

# Customized dhcp options
option tz-posix-string code 100 = string;
option tz-name code 101 = string;
option domain-name-search-list code 119 = string;
option dhcp6.bootfile-url code 59 = string;
option boot-menu code 150 = string;
option searchlist code 119 = string;
option option-176 code 176 = string;
option option-242 code 242 = string;
option arch code 93 = unsigned integer 16;
option ldap-server code 95 = text;
option proxy-auto code 252 = text;
# option definitions common to all supported networks...
default-lease-time 1296000;
max-lease-time 1296000;
one-lease-per-client True;
get-lease-hostnames True;
server-identifier oqa-pxe.oqa.dmz-prg2.suse.org;
server-name "oqa-pxe.oqa.dmz-prg2.suse.org";


subnet 10.150.1.0 netmask 255.255.255.0 {
  option broadcast-address 10.150.1.255;
  option domain-name-servers 8.8.8.8,4.4.4.4;
  option ntp-servers 0.opensuse.pool.ntp.org,1.opensuse.pool.ntp.org,2.opensuse.pool.ntp.org;
  option lpr-servers cups.prg2.suse.org;
  option irc-server irc.prg2.suse.org;
  option smtp-server relay.prg2.suse.org;
  option domain-name "oqa.dmz-prg2.suse.org";
  option domain-search "oqa.dmz-prg2.suse.org";
  option nis-domain "dmz-prg2.suse.org";
  option nis-servers wotan.prg2.suse.org,amor.prg2.suse.org;
#  filename "/EFI/BOOT/bootx64.efi";
  filename "http://10.150.1.10/EFI/BOOT/bootx64.efi";
  next-server 10.150.1.10;
  default-lease-time 64800;
  max-lease-time 64800;
  option routers 10.150.1.254;
  option vendor-class-identifier "HTTPClient";
  pool {
    range 10.150.1.100 10.150.1.200;
  }
}

and apache config for a simple setup with Leap 15.5 net medium

I had there default apache with simple setting like this:

apache2/vhosts.d/oqa-pxe.conf

<VirtualHost *:80>
    ServerName oqa-pxe-oqa.dmz-prg2.suse.org
    ServerAlias 

    ServerAdmin webmaster@oqa-pxe-oqa.dmz-prg2.suse.org

    AllowEncodedSlashes Off

    LogLevel warn
    ErrorLog /var/log/apache2/oqa-pxe-error.log
    LogFormat "%h %l %u %t \"%r\" %>s"
    CustomLog /var/log/apache2/oqa-pxe-access.log  "%h %l %u %t \"%r\" %>s"

    DocumentRoot /srv/tftpboot

    <Directory "/srv/tftpboot">
        Options FollowSymLinks
        Require all granted
        AllowOverride None

    </Directory>
</VirtualHost>

and yes in /srv/tftpboot/ I copy the NET iso
and lest part was that I had to a bit ajust grub.cfg in srv/tftpboot/EFI/BOOT/grub.cfg

lang=en_US

menuentry 'Installation' --class opensuse --class gnu-linux --class gnu --class os {
  set gfxpayload=keep
  echo 'Loading kernel ...'
  linux /boot/x86_64/loader/linux
  echo 'Loading initial ramdisk ...'
  initrd /boot/x86_64/loader/initrd
}

menuentry 'Upgrade' --class opensuse --class gnu-linux --class gnu --class os {
  set gfxpayload=keep
  echo 'Loading kernel ...'
  linux /boot/x86_64/loader/linux upgrade=1
  echo 'Loading initial ramdisk ...'
  initrd /boot/x86_64/loader/initrd
}

submenu 'More ...' {

  menuentry 'Rescue System' --class opensuse --class gnu-linux --class gnu {
    set gfxpayload=keep
    echo 'Loading kernel ...'
    linux /boot/x86_64/loader/linux rescue=1
    echo 'Loading initial ramdisk ...'
    initrd /boot/x86_64/loader/initrd
  }
Actions #31

Updated by okurz over 1 year ago

  • Description updated (diff)
Actions #32

Updated by okurz over 1 year ago

  • Description updated (diff)

To fixed failed obs_rsync jobs as seen on https://openqa.opensuse.org/minion/jobs?state=failed on new-ariel

cat > /etc/systemd/system/openqa-gru.service.d/override.conf <<EOF
[Service]
Environment="OPENQA_LOGFILE=/var/log/openqa_gru"
Environment='RSYNC_CONNECT_PROG=ssh -i /var/lib/openqa/ssh-tunnel-nue1/id_ed25519 -o "UserKnownHostsFile /var/lib/openqa/ssh-tunnel-nue1/known_hosts" -p 2213 geekotest@gate.opensuse.org nc %%H 873'
WorkingDirectory=/var/lib/openqa
EOF
cat > /etc/systemd/system/openqa-webui.service.d/override.conf <<EOF
[Service]
Environment='RSYNC_CONNECT_PROG=ssh -i /var/lib/openqa/ssh-tunnel-nue1/id_ed25519 -o "UserKnownHostsFile /var/lib/openqa/ssh-tunnel-nue1/known_hosts" -p 2213 geekotest@gate.opensuse.org nc %%H 873'
EOF

and in /etc/nginx/vhosts.d/openqa.conf

server {
    listen       80 default_server;
    listen       [::1]:80 default_server;
    server_name  openqa.opensuse.org openqa.infra.opensuse.org;

    #include vhosts.d/openqa-locations.inc;
    …
    location / {
        return 301 https://$host$request_uri;
    }
}

server {
    listen       80;
    listen       [::1]:80;
    server_name  localhost;

    include vhosts.d/openqa-locations.inc;
}

and then

aa-complain /etc/apparmor.d/usr.share.openqa.script.openqa
systemctl daemon-reload && systemctl restart openqa-webui openqa-gru

and retriggered failed minion jobs.

We found https://openqa.opensuse.org/tests/3446020#step/bootloader_s390/9 showing "RETR Tumbleweed-oss-s390x-Snapshot20230719/suse.ins 550 Failed to open file." as s390x jobs try to the ftp server on 192.168.112.100 which is old-ariel. So what we did so far is

setcap 'cap_net_bind_service=+ep' $(which sshd)
systemctl restart sshd

and on new-ariel in /var/lib/openqa/.ssh/config

Host old-ariel
        HostName gate.opensuse.org
        User geekotest
        Port 2213
        IdentityFile /var/lib/openqa/ssh-tunnel-nue1/id_ed25519
        UserKnownHostsFile /var/lib/openqa/ssh-tunnel-nue1/known_hosts
        LocalForward 62450 localhost:62450
        RemoteForward 192.168.112.100:21 localhost:21

and systemctl autossh-old-ariel but "Warning: remote port forwarding failed for listen port 21". So that isn't solved yet. However we found that the machine for s390x in https://openqa.opensuse.org/admin/machines has hardcoded "REPO_HOST=192.168.112.100;S390_NETWORK_PARAMS=OSAMedium=eth OSAInterface=qdio InstNetDev=osa HostIP=192.168.112.@S390_HOST@/24 Hostname=s390linux@S390_HOST@ Gateway=192.168.112.254 Nameserver=192.168.112.100 Domain=opensuse.org PortNo=0 Layer2=1 ReadChannel=0.0.0800 WriteChannel=0.0.0801 DataChannel=0.0.0802 OSAHWAddr=" so maybe we can also change that after we ensure that new-ariel can reach NUE1 workers as well as that NUE1 workers can reach new-ariel.

ip r a 10.150.1.11/32 via 192.168.112.110 dev eth0

Actions #33

Updated by nicksinger over 1 year ago

# Generated by iptables-save v1.8.7 on Mon Jul 24 12:21:55 2023
*filter
:INPUT DROP [0:0]
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [1774195:3712632607]
:forward_dmz - [0:0]
:forward_ext - [0:0]
:forward_int - [0:0]
:input_dmz - [0:0]
:input_ext - [0:0]
:input_int - [0:0]
:reject_func - [0:0]
-A INPUT -i lo -j ACCEPT
-A INPUT -i ariel-link -j ACCEPT
-A INPUT -m conntrack --ctstate ESTABLISHED -j ACCEPT
-A INPUT -p icmp -m conntrack --ctstate RELATED -j ACCEPT
-A INPUT -i eth1 -j input_int
-A INPUT -i eth0 -j input_dmz
-A INPUT -j input_ext
-A INPUT -m limit --limit 3/min -j LOG --log-prefix "SFW2-IN-ILL-TARGET " --log-tcp-options --log-ip-options
-A INPUT -j DROP
-A FORWARD -m limit --limit 3/min -j LOG --log-prefix "SFW2-FWD-ILL-ROUTING " --log-tcp-options --log-ip-options
-A OUTPUT -o lo -j ACCEPT
-A input_dmz -m pkttype --pkt-type broadcast -j DROP
-A input_dmz -p icmp -m icmp --icmp-type 4 -j ACCEPT
-A input_dmz -p icmp -m icmp --icmp-type 8 -j ACCEPT
-A input_dmz -p tcp -m tcp --dport 22 -j ACCEPT
-A input_dmz -p tcp -m tcp --dport 80 -j ACCEPT
-A input_dmz -p tcp -m tcp --dport 443 -j ACCEPT
-A input_dmz -p tcp -m tcp --dport 5666 -j ACCEPT
-A input_dmz -p tcp -m tcp --dport 6556 -j ACCEPT
-A input_dmz -m comment --comment "sfw2.insert.pos" -m pkttype ! --pkt-type unicast -j DROP
-A input_dmz -p tcp -m limit --limit 3/min -m tcp --tcp-flags FIN,SYN,RST,ACK SYN -j LOG --log-prefix "SFW2-INdmz-DROP-DEFLT " --log-tcp-options --log-ip-options
-A input_dmz -p icmp -m limit --limit 3/min -j LOG --log-prefix "SFW2-INdmz-DROP-DEFLT " --log-tcp-options --log-ip-options
-A input_dmz -p udp -m limit --limit 3/min -m conntrack --ctstate NEW -j LOG --log-prefix "SFW2-INdmz-DROP-DEFLT " --log-tcp-options --log-ip-options
-A input_dmz -j DROP
-A input_ext -m pkttype --pkt-type broadcast -j DROP
-A input_ext -p icmp -m icmp --icmp-type 4 -j ACCEPT
-A input_ext -p icmp -m icmp --icmp-type 8 -j ACCEPT
-A input_ext -p tcp -m tcp --dport 113 -m conntrack --ctstate NEW -j reject_func
-A input_ext -p tcp -m tcp --dport 20048 -j ACCEPT
-A input_ext -p udp -m udp --dport 20048 -j ACCEPT
-A input_ext -s 192.168.254.4/32 -p tcp -m tcp --dport 22 -j ACCEPT
-A input_ext -s 192.168.254.44/32 -p tcp -m tcp --dport 22 -j ACCEPT
-A input_ext -s 192.168.254.46/32 -p tcp -m tcp --dport 22 -j ACCEPT
-A input_ext -s 192.168.254.4/32 -p tcp -m tcp --dport 80 -j ACCEPT
-A input_ext -s 192.168.254.4/32 -p tcp -m tcp --dport 5666 -j ACCEPT
-A input_ext -s 192.168.254.26/32 -p tcp -m tcp --dport 80 -j ACCEPT
-A input_ext -s 192.168.254.46/32 -p tcp -m tcp --dport 80 -j ACCEPT
-A input_ext -m comment --comment "sfw2.insert.pos" -m pkttype ! --pkt-type unicast -j DROP
-A input_ext -p tcp -m limit --limit 3/min -m tcp --tcp-flags FIN,SYN,RST,ACK SYN -j LOG --log-prefix "SFW2-INext-DROP-DEFLT " --log-tcp-options --log-ip-options
-A input_ext -p icmp -m limit --limit 3/min -j LOG --log-prefix "SFW2-INext-DROP-DEFLT " --log-tcp-options --log-ip-options
-A input_ext -p udp -m limit --limit 3/min -m conntrack --ctstate NEW -j LOG --log-prefix "SFW2-INext-DROP-DEFLT " --log-tcp-options --log-ip-options
-A input_ext -j DROP
-A input_int -j ACCEPT
-A reject_func -p tcp -j REJECT --reject-with tcp-reset
-A reject_func -p udp -j REJECT --reject-with icmp-port-unreachable
-A reject_func -j REJECT --reject-with icmp-proto-unreachable
COMMIT
# Completed on Mon Jul 24 12:21:55 2023
Actions #34

Updated by nicksinger over 1 year ago

to make ftp from new-ariel available on old workers we stopped atftpd and added a iptables redirect from old-ariel:21 -> ariel-link -> new-ariel:21 with the commands:

# iptables -t nat -A PREROUTING -p tcp --dport 21 -j DNAT --to-destination 10.150.1.11:21
# iptables -t nat -A POSTROUTING -p tcp -d 10.150.1.11 --dport 21 -j SNAT --to-source 192.168.112.100
Actions #35

Updated by okurz over 1 year ago

  • Description updated (diff)
Actions #36

Updated by nicksinger over 1 year ago

Added few more rules to enable "extended passive mode" which is required by s390 hosts:

# iptables -t nat -A PREROUTING -p tcp --dport 30000:30100 -j DNAT --to-destination 10.150.1.11:30000-30100
# iptables -t nat -A POSTROUTING -p tcp -d 10.150.1.11 --dport 30000:30100 -j SNAT --to-source 192.168.112.100
Actions #37

Updated by nicksinger over 1 year ago

test was able to continue further but failed with: "Illegal EPRT command".
We found that "port_promiscuous=YES" in vsftpd.conf (on new ariel) disables some security checks which might interfere with NATing. Changed and trying again.

Actions #38

Updated by favogt over 1 year ago

Next issue: For some reason the synced TW assets are not tracked properly and not listed on https://openqa.opensuse.org/admin/assets. The DVD iso got deleted and most tests failed due to a 404 during asset download. I forced a resync and it's there again, but still not visible on the assets page, so I suspect it'll be deleted soon.

Actions #39

Updated by okurz over 1 year ago

  • Related to action #133232: o3 hook scripts are triggered but no comment shows up on job added
Actions #40

Updated by favogt over 1 year ago

Some progress for s390 FTP downloads.

    >>>RETR Tumbleweed-oss-s390x-Snapshot20230719/boot/s390x/parmfile
    500 OOPS: vsf_sysutil_bind

I used strace on vsftpd and found that 500 OOPS: vsf_sysutil_bind was caused by a call to bind(::ffff:10:150:11:1 port 20) returning EACCESS. Adding connect_from_port_20=NO to /etc/vsftpd.conf helped. Now it choses a random port as source.

With that set, it got stuck sending initrd data:

[pid 32000] socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP <unfinished ...>
[pid 31998] read(3,  <unfinished ...>
[pid 32000] <... socket resumed>)       = 5
[pid 32000] setsockopt(5, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
[pid 32000] bind(5, {sa_family=AF_INET6, sin6_port=htons(0), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::ffff:10.150.1.11", &sin6_addr), sin6_scope_id=0}, 28) = 0
[pid 32000] fcntl(5, F_GETFL)           = 0x2 (flags O_RDWR)
[pid 32000] fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 32000] connect(5, {sa_family=AF_INET6, sin6_port=htons(11580), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::ffff:192.168.112.9", &sin6_addr), sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)

The host is not reachable over IPv6 that way, so I flipped vsftpd to listen on IPv4 only instead of IPv6 only.

With that, sending data appears to work, but it's too slow:

06:43:44.966985 IP 172.16.0.1.11608 > ariel.suse-dmz.opensuse.org.ftp: Flags [P.], seq 123:153, ack 298, win 32767, length 30: FTP: EPRT |1|192.168.112.9|11610|
06:43:44.967774 IP ariel.suse-dmz.opensuse.org.ftp > 172.16.0.1.11608: Flags [P.], seq 298:349, ack 153, win 507, length 51: FTP: 200 EPRT command successful. Consider using EPSV.
06:43:44.975142 IP 172.16.0.1.11608 > ariel.suse-dmz.opensuse.org.ftp: Flags [.], ack 349, win 32767, length 0
06:43:45.017710 IP 172.16.0.1.11608 > ariel.suse-dmz.opensuse.org.ftp: Flags [P.], seq 153:215, ack 349, win 32767, length 62: FTP: RETR Tumbleweed-oss-s390x-Snapshot20230719/boot/s390x/initrd
06:43:45.020081 IP ariel.suse-dmz.opensuse.org.45609 > zvm63b.11610: Flags [S], seq 1618861255, win 64860, options [mss 1380,nop,nop,sackOK,nop,wscale 7], length 0
06:43:45.027382 IP zvm63b.11610 > ariel.suse-dmz.opensuse.org.45609: Flags [S.], seq 500541048, ack 1618861256, win 8192, options [mss 1460,wscale 1,nop], length 0
06:43:45.027412 IP ariel.suse-dmz.opensuse.org.45609 > zvm63b.11610: Flags [.], ack 1, win 507, length 0
06:43:45.028115 IP ariel.suse-dmz.opensuse.org.ftp > 172.16.0.1.11608: Flags [P.], seq 349:468, ack 215, win 507, length 119: FTP: 150 Opening BINARY mode data connection for Tumbleweed-oss-s390x-Snapshot20230719/boot/s390x/initrd (68905600 bytes).
06:43:45.028477 IP ariel.suse-dmz.opensuse.org.45609 > zvm63b.11610: Flags [P.], seq 1:2761, ack 1, win 507, length 2760
06:43:45.028496 IP ariel.suse-dmz.opensuse.org.45609 > zvm63b.11610: Flags [P.], seq 2761:5521, ack 1, win 507, length 2760
06:43:45.028562 IP ariel.suse-dmz.opensuse.org.45609 > zvm63b.11610: Flags [P.], seq 5521:8193, ack 1, win 507, length 2672
06:43:45.034775 IP 172.16.0.1.11608 > ariel.suse-dmz.opensuse.org.ftp: Flags [.], ack 468, win 32767, length 0
06:43:45.239207 IP ariel.suse-dmz.opensuse.org.45609 > zvm63b.11610: Flags [.], seq 1:1381, ack 1, win 507, length 1380
06:43:45.347067 IP zvm63b.11610 > ariel.suse-dmz.opensuse.org.45609: Flags [.], ack 1381, win 28671, length 0
06:43:45.347089 IP ariel.suse-dmz.opensuse.org.45609 > zvm63b.11610: Flags [.], seq 8193:9573, ack 1, win 507, length 1380
06:43:45.347101 IP ariel.suse-dmz.opensuse.org.45609 > zvm63b.11610: Flags [P.], seq 9573:10953, ack 1, win 507, length 1380
06:43:45.775208 IP ariel.suse-dmz.opensuse.org.45609 > zvm63b.11610: Flags [.], seq 1381:2761, ack 1, win 507, length 1380
Actions #41

Updated by okurz over 1 year ago

nicksinger is working with cbachmann on a direct UDP access for wireguard on new-ariel. @nicksinger please work closely with cbachmann to get that sorted.

In the meantime fvogt is looking into alternatives setting up ssh with ssh -w …:

On old-ariel:
Add PermitTunnel yes to sshd_config and rcsshd reload.

sudo ip tuntap add mode tun user geekotest name tun5
sudo ip addr add 172.17.0.1/24 dev tun5
sudo ip link set tun5 up

On new-ariel:

sudo ip tuntap add mode tun user geekotest name tun5
sudo ip addr add 172.17.0.2/24 dev tun5
sudo ip link set tun5 up
sudo su geekotest -c 'ssh -w 5:5 -i /var/lib/openqa/ssh-tunnel-nue1/id_ed25519 -o "UserKnownHostsFile /var/lib/openqa/ssh-tunnel-nue1/known_hosts" -p 2213 geekotest@gate.opensuse.org'
Actions #42

Updated by nicksinger over 1 year ago

Since we didn't manage to open a port for our wireguard tunnel we followed up on Fabians idea to use the ssh tunnel mechanism. I adjusted routing and firewalls on both new and old ariel to make use of this tunnel. To validate I used:

for i in $(cat /etc/hosts | cut -d " " -f 1 | grep -v "^10.150" | grep -v "^#" | grep -v "^$" | xargs); do ping -c1 -W 1 -q  $i > /dev/null || echo $i broken; done

new-ariel

routing

ip r a 172.16.0.0/24 dev tun5
ip r a 192.168.112.0/24 dev tun5
ip r a 192.168.7.0/24 dev tun5
ip r a 192.168.254.0/24 dev tun5

iptables rules

iptables -I INPUT 1 -i tun5 -j ACCEPT

old-ariel

routing

ip r a 10.150.1.11 via 172.17.0.1 dev tun5

iptables rules

iptables -I INPUT 4 -i tun5 -j ACCEPT
# Compressed down without -o argument to just allow forwarding on *all* interfaces
#-A FORWARD -i ariel-link -o eth1 -j ACCEPT
#-A FORWARD -i ariel-link -o eth3 -j ACCEPT
#-A FORWARD -i ariel-link -o eth2 -j ACCEPT
iptables -I FORWARD 4 -i tun5 -j ACCEPT

# change incoming source ip to own interface ip to allow remote hosts to respond to a known ip
iptables -A POSTROUTING -o eth1 -j SNAT --to-source 192.168.112.100
iptables -A POSTROUTING -o eth2 -j SNAT --to-source 192.168.7.15
iptables -A POSTROUTING -o eth3 -j SNAT --to-source 192.168.47.13

# forward old-ariel:21 to new-ariel:21 to allow old workers to access the new ftp
iptables -A PREROUTING -p tcp -m tcp --dport 21 -j DNAT --to-destination 10.150.1.11:21
iptables -A POSTROUTING -d 10.150.1.11/32 -p tcp -m tcp --dport 21 -j SNAT --to-source 192.168.112.100
# used by passive mode of ftp (e.g. required by s390x test) and defined by "grep ^pasv /etc/vsftpd.conf"
iptables -A PREROUTING -p tcp -m tcp --dport 30000:30100 -j DNAT --to-destination 10.150.1.11:30000-30100
iptables -A POSTROUTING -d 10.150.1.11/32 -p tcp -m tcp --dport 30000:30100 -j SNAT --to-source 192.168.112.100
# used by zabbix-agent to connect from zabbix server (in the old opensuse network) to new ariel
iptables -A PREROUTING -p tcp -m tcp --dport 10049 -j DNAT --to-destination 10.150.1.11:10049
iptables -A POSTROUTING -d 10.150.1.11/32 -p tcp -m tcp --dport 10049 -j SNAT --to-source 192.168.112.100
# nat all traffic on all links
iptables -A POSTROUTING -o eth0 -j MASQUERADE
iptables -A POSTROUTING -o eth2 -j MASQUERADE
iptables -A POSTROUTING -o eth3 -j MASQUERADE
iptables -A POSTROUTING -o tun5 -j MASQUERADE
Actions #43

Updated by favogt over 1 year ago

Apparently the approach to use OpenSSH-backed tun devices works well enough for s390x tests to pass now: https://openqa.opensuse.org/tests/3451083

I made the SSH tunnel configuration persistent now.

On new-ariel, I adjusted /var/lib/openqa/.ssh/config:

        # For wireguard, meanwhile replaced by tun5 tunnel
        #LocalForward 62450 localhost:62450
        Tunnel yes
        TunnelDevice 5:5

Added /etc/sysconfig/network/ifcfg-tun5:

STARTMODE=auto
BOOTPROTO=static
IPADDR=172.17.0.2
NETMASK=255.255.255.0
TUNNEL=tun
TUNNEL_SET_GROUP=nogroup
TUNNEL_SET_OWNER=geekotest

and /etc/sysconfig/network/ifroute-tun5:

192.168.7.0/24
192.168.47.0/24
192.168.112.0/24
192.168.254.0/24

After stopping the existing tunnel and interface, running ifup tun5 and restarting autossh-old-ariel.service brought it back up and I could ping and ssh to workers again.

On old-ariel, only ifcfg-tun5 had to be added:

STARTMODE=auto
BOOTPROTO=static
IPADDR=172.17.0.1
NETMASK=255.255.255.0
TUNNEL=tun
TUNNEL_SET_GROUP=nogroup 
TUNNEL_SET_OWNER=geekotest
Actions #44

Updated by okurz over 1 year ago

The one thing missing is iptables for tun5 so we created another systemd service file now having:

# cat > /etc/systemd/system/autossh-old-ariel.service <<EOF
[Unit]
Description=AutoSSH tunnel to connect to old-ariel, see https://progress.opensuse.org/issues/132143 for details
Requires=autossh-iptables.service
After=network.target

[Service]
User=geekotest
Environment="AUTOSSH_POLL=20"
ExecStart=/bin/sh -c "iptables-save | grep -q 'tun5 -j ACCEPT' || iptables -I INPUT 1 -i tun5 -j ACCEPT"
ExecStart=/usr/bin/autossh -M 0 -o ServerAliveInterval=15 -NT old-ariel
# From my experience a process might hang with "connection refused". Trying to workaround with periodic restarts
RuntimeMaxSec=24h
Restart=always

[Install]
WantedBy=default.target
EOF
cat > /etc/systemd/system/autossh-iptables.service <<EOF
[Unit]
Description=AutoSSH tunnel iptables
After=network.target

[Service]
Type=oneshot
ExecStart=/bin/sh -c "iptables-save | grep -q 'tun5 -j ACCEPT' || iptables -I INPUT 1 -i tun5 -j ACCEPT"

[Install]
WantedBy=default.target
EOF
Actions #45

Updated by nicksinger over 1 year ago

  • Description updated (diff)

I adjusted the following zabbix.conf entries on new-ariel:

Server=172.17.0.1
ListenPort=10049
ServerActive=zabbix-proxy-opensuse
Hostname=10.150.2.10

and added a port-forward from old-ariel:10049->new-ariel:10049 to make the zabbix server reach new-ariel (POST- and PREROUTING as described in https://progress.opensuse.org/issues/132143#note-42). https://zabbix.suse.de/zabbix.php?action=host.view now shows a new entry for "10.150.2.10" with checks being executed on new-ariel.

Actions #46

Updated by okurz over 1 year ago

Now looking into /etc/hosts and /etc/dnsmasq.d/openqa.conf, rebooted openqaworker21,22,24 and checking multi-machine again:

openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/3450618 TEST+=worker24-poo132134 _GROUP=0 BUILD= WORKER_CLASS=openqaworker24

using firewalld-policy tests as suggested by fvogt.

3 jobs have been created:

but they failed trying to access the other machines. Likely we still have an inconsistency regarding the selection of zones.

Actions #47

Updated by okurz over 1 year ago

  • Description updated (diff)
Actions #48

Updated by okurz over 1 year ago

  • Description updated (diff)

Together with mcaj we removed the old read-only snapshot storage device from the VM and then renamed the LVM volume group

vgrename vg0-new vg0
sed -i 's/vg0-new/vg0/' /etc/fstab

We also found that the iptables rules for the tunnel to old-ariel were not applied. We suspect a race-condition with the firewall setup. We added After=SuSEfirewall2.service in autossh-iptables.service. Another reboot brought the tunnel up fine so I guess we are good. Updated description with the additional resolved tasks.

Actions #49

Updated by okurz over 1 year ago

  • Copied to action #133358: Migration of o3 VM to PRG2 - Ensure IPv6 is fully working added
Actions #50

Updated by okurz over 1 year ago

  • Description updated (diff)
Actions #51

Updated by okurz over 1 year ago

  • Copied to action #133364: Migration of o3 VM to PRG2 - Decommission old-ariel in NUE1 as soon as we do not need it anymore added
Actions #52

Updated by okurz over 1 year ago

  • Description updated (diff)
  • DONE Update https://progress.opensuse.org/projects/openqav3/wiki/ where necessary
  • DONE Make sure we know what to keep an eye out for for the later planned OSD VM migration -> lessons learned meeting scheduled for 2023-08-04
  • DONE As necessary also make sure that BuildOPS knows about caveats of migration as they plan to migrate OBS/IBS after us -> lessons learned meeting scheduled for 2023-08-04

Prepared https://gitlab.suse.de/qa-sle/backup-server-salt/-/merge_requests/11 for backup for "Ensure backup to backup.qa.suse.de works"

Actions #53

Updated by okurz over 1 year ago

  • Description updated (diff)
  • Due date deleted (2023-08-02)
  • Status changed from In Progress to Resolved

Together with nicksinger made the config on old-ariel persistent:

cat > /etc/sysconfig/network/ifroute-tun5 <<EOF
10.150.1.11/32
EOF
cat > /etc/systemd/system/autossh-iptables_to_new_ariel.service <<EOF
[Unit]
Description=AutoSSH tunnel iptables
After=network.target SuSEfirewall2.service

[Service]
Type=oneshot
ExecStart=/etc/sysconfig/network/scripts/iptables_connection_to_new_ariel

[Install]
WantedBy=default.target
EOF
cat > /etc/sysconfig/network/scripts/iptables_connection_to_new_ariel <<EOF
iptables -I INPUT 1 -i tun5 -j ACCEPT
# Compressed down without -o argument to just allow forwarding on *all* interfaces
#-A FORWARD -i ariel-link -o eth1 -j ACCEPT
#-A FORWARD -i ariel-link -o eth3 -j ACCEPT
#-A FORWARD -i ariel-link -o eth2 -j ACCEPT
iptables -I FORWARD 1 -i tun5 -j ACCEPT

# change incoming source ip to own interface ip to allow remote hosts to respond to a known ip
iptables -t nat -A POSTROUTING -o eth1 -j SNAT --to-source 192.168.112.100
iptables -t nat -A POSTROUTING -o eth2 -j SNAT --to-source 192.168.7.15
iptables -t nat -A POSTROUTING -o eth3 -j SNAT --to-source 192.168.47.13

# forward old-ariel:21 to new-ariel:21 to allow old workers to access the new ftp
iptables -t nat -A PREROUTING -p tcp -m tcp --dport 21 -j DNAT --to-destination 10.150.1.11:21
iptables -t nat -A POSTROUTING -d 10.150.1.11/32 -p tcp -m tcp --dport 21 -j SNAT --to-source 192.168.112.100
# used by passive mode of ftp (e.g. required by s390x test) and defined by "grep ^pasv /etc/vsftpd.conf"
iptables -t nat -A PREROUTING -p tcp -m tcp --dport 30000:30100 -j DNAT --to-destination 10.150.1.11:30000-30100
iptables -t nat -A POSTROUTING -d 10.150.1.11/32 -p tcp -m tcp --dport 30000:30100 -j SNAT --to-source 192.168.112.100
# used by zabbix-agent to connect from zabbix server (in the old opensuse network) to new ariel
iptables -t nat -A PREROUTING -p tcp -m tcp --dport 10049 -j DNAT --to-destination 10.150.1.11:10049
iptables -t nat -A POSTROUTING -d 10.150.1.11/32 -p tcp -m tcp --dport 10049 -j SNAT --to-source 192.168.112.100
# nat all traffic on all links
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
iptables -t nat -A POSTROUTING -o eth2 -j MASQUERADE
iptables -t nat -A POSTROUTING -o eth3 -j MASQUERADE
iptables -t nat -A POSTROUTING -o tun5 -j MASQUERADE
EOF
chmod +x /etc/sysconfig/network/scripts/iptables_connection_to_new_ariel
systemctl daemon-reload
systemctl enable --now autossh-iptables_to_new_ariel

and with that all tasks are done and ACs are covered. Ticket resolved \o/

Actions #54

Updated by okurz over 1 year ago

  • Copied to action #133475: Migration of o3 VM to PRG2 - connection to rabbit.opensuse.org added
Actions #55

Updated by favogt over 1 year ago

factory-package-news-web.service did not survive the last reboot because systemd did not reload units after /var/lib/openqa was available, which is needed for the unit symlink to work.

For now I replaced the symlink with a copy.

Actions #56

Updated by okurz over 1 year ago

favogt wrote:

factory-package-news-web.service did not survive the last reboot because systemd did not reload units after /var/lib/openqa was available, which is needed for the unit symlink to work.

For now I replaced the symlink with a copy.

I didn't quite get that. But regarding the setup of factory-package-news-web.service I wonder if we can write down more about that setup – and with "we" I mean "you" ;) Maybe more in https://github.com/openSUSE/opensuse-release-tools README?

Actions #57

Updated by favogt over 1 year ago

okurz wrote:

favogt wrote:

factory-package-news-web.service did not survive the last reboot because systemd did not reload units after /var/lib/openqa was available, which is needed for the unit symlink to work.

For now I replaced the symlink with a copy.

I didn't quite get that. But regarding the setup of factory-package-news-web.service I wonder if we can write down more about that setup – and with "we" I mean "you" ;) Maybe more in https://github.com/openSUSE/opensuse-release-tools README?

I'm not really familiar with the setup of that either. Something for @dimstar

Actions #58

Updated by favogt over 1 year ago

OBS was migrated to PRG2 meanwhile. IT added eth2 to reach the OBS backends like on old-ariel. I removed the 192.168.7.0.24 route from tun5, the RSYNC_CONNECT_PROG environment from openqa-webui and openqa-gru and set /etc/apparmor.d/usr.share.openqa.script.openqa to enforcing again.

Actions #59

Updated by jbaier_cz about 1 year ago

  • Related to action #150956: o3 cannot send e-mails via smtp relay size:M added
Actions #60

Updated by okurz 8 months ago

  • Related to action #159669: No new openQA data on metrics.opensuse.org since o3 migration to PRG2 size:S added
Actions

Also available in: Atom PDF