action #132143
closedQA - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability
QA - coordination #123800: [epic] Provide SUSE QE Tools services running in PRG2 aka. Prg CoLo
Migration of o3 VM to PRG2 - 2023-07-19 size:M
0%
Description
Motivation¶
The openQA webUI VM for o3 will move to PRG2. This will be conducted by Eng-Infra. We must support them.
Acceptance criteria¶
- AC1: o3 is reachable from the new location for SUSE employees
- AC2: Same as AC1 but for community members outside SUSE
- AC3: o3 multi-machine jobs run successfully on o3 after the migration
- AC4: We can still login into the machine over ssh from outside the SUSE network
- AC5: https://zabbix.nue.suse.com/ can still monitor o3
Suggestions¶
- DONE Track https://jira.suse.com/browse/ENGINFRA-2347 "DMZ-OpenQA implementation" (done) so that the o3 network is available
- DONE Track https://jira.suse.com/browse/ENGINFRA-2155 "Install Additional links to DMZ-CORE from J12 - openQA-DMZ" (done), something about cabling
- DONE Track https://jira.suse.com/browse/ENGINFRA-1742 "Build OpenQA Environment" for story of the o3 VM being migrated
- DONE Inform affected users about planned migration on date 2023-07-19
DONE During migration work closely with Eng-Infra members conducting the actual VM migration
- DONE Join Jitsi and one thread in team-qa-tools and one thread in dct-migration
- DONE Wait for go-no-go meeting at 0700Z
- DONE Wait for mcaj to give the go from Eng-Infra side, then switch off the openQA scheduler on o3 and disable the authentication. I guess we can try to "break" the code by disabling any authenticated actions.
- DONE Also switch off other services like gru, scripts, investigation, etc.
- DONE Prepare old workers to connect over https as soon as o3 comes up again in prg2
- DONE Install more new machines in prg2 while waiting for the VM to come online -> installed worker21,22,24 though not yet activated for production. Rest to be continued in #132134
- DONE As soon as VM is ready in new place ensure that the webUI is good in read-only mode first
- DONE Update IP addresses on ariel where necessary in /etc/hosts, also crosscheck /etc/dnsmasq.d/openqa.conf
- DONE Ask Eng-Infra, mcaj, to switch off the DHCP/DNS/PXE server in the oqa dmz network
- DONE Try to reboot a worker from the PXE on o3
13.DONE Connect all old workers from NUE1 over https, in particular everything non-qemu-x86_64 for the time being, e.g. aarch64, ppc64le, s390x, bare-metal until we have such things directly from prg2- DONE Test and monitor a lot of o3 tests
- DONE As soon as everything looks really stable announce it to users as response all the above announcements
DONE Ensure that o3 is reachable again after migration from the new location
- DONE for SUSE employees
- DONE for community members outside SUSE
- DONE for o3 workers from at least one location (NUE1 or PRG2)
DONE Ensure that we can still login into the machine over ssh from outside the SUSE network -> updated details on https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Accessing-the-o3-infrastructure
DONE Ensure that https://zabbix.nue.suse.com/ can still monitor o3
DONE Inform users as soon as migration is complete
DONE Rename /dev/vg0-new to /dev/vg0
Ensure IPv6 is fully working-> #133358DONE
Make wireguard+socat+ssh+routes from #132143-25 persistentMake ssh-tap-tunnel+routes+iptables persistent on new-arielDONE On o3
systemctl unmask --now openqa-auto-update openqa-continuous-update rebootmgr
DONE On o3 enable again o3 specific nginx tmp+log paths in /etc/nginx/vhosts.d/openqa.conf
DONE Update https://progress.opensuse.org/projects/openqav3/wiki/ where necessary
DONE Make sure we know what to keep an eye out for for the later planned OSD VM migration
DONE As necessary also make sure that BuildOPS knows about caveats of migration as they plan to migrate OBS/IBS after us
DONE Make ssh-tap-tunnel+routes+iptables persistent on old-ariel
DONE Ensure backup to backup.qa.suse.de works
DONE Remove root ssh login on new-ariel
the openQA machine setting for "s390x-zVM-vswitch-l2" has REPO_HOST=192.168.112.100 and other references to 192.168.112. This needs to be changed as soon as zVM instances are able to reach new-ariel internally, e.g. over FTP-> #132152Fix o3 bare metal hosts iPXE booting, see https://openqa.opensuse.org/tests/3446336#step/ipxe_install/2-> #13264711.Enable workers to connect to o3 directly, not external https, and use testpoolserver with rsync instead-> #13213412.Enable production worker classes on new workers after tests look good-> #132134
Updated by okurz over 1 year ago
- Copied from action #132134: Setup new PRG2 multi-machine openQA worker for o3 size:M added
Updated by okurz over 1 year ago
- Copied to action #132146: Support migration of osd VM to PRG2 - 2023-08-29 size:M added
Updated by okurz over 1 year ago
- Subject changed from Support migration of o3 VM to Support migration of o3 VM to PRG2
Updated by okurz over 1 year ago
- Subject changed from Support migration of o3 VM to PRG2 to Support migration of o3 VM to PRG2 - 2023-07-19
Updated by okurz over 1 year ago
- Subject changed from Support migration of o3 VM to PRG2 - 2023-07-19 to Support migration of o3 VM to PRG2 - 2023-07-19 size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by nicksinger over 1 year ago
- Status changed from Workable to Feedback
- Assignee set to nicksinger
- Priority changed from Urgent to Normal
Updated by okurz over 1 year ago
- Priority changed from Normal to High
Let's keep it at least as "High" priority as all related work will need to happen this month anyway.
Updated by okurz over 1 year ago
- Copied to action #132647: Migration of o3 VM to PRG2 - bare-metal tests size:M added
Updated by okurz over 1 year ago
- Project changed from 46 to QA
- Category deleted (
Infrastructure)
Updated by okurz over 1 year ago
- Project changed from QA to openQA Infrastructure
Updated by okurz over 1 year ago
- Description updated (diff)
I sent out announcements in https://matrix.to/#/!ilXMcHXPOjTZeauZcg:libera.chat/$iRY8O3FQ1B30wDjbIlki7wKsIPbDwVuMzMsE91dmoPY?via=libera.chat&via=matrix.org&via=opensuse.org , https://matrix.to/#/!dRljORKAiNJcGEDbYA:opensuse.org/$QOEP1cCcZa2EDrCyw-3UI3HZYvOtvH5kfeZrqdNhnsU?via=opensuse.org&via=matrix.org&via=im.f3l.de, https://suse.slack.com/archives/C02CANHLANP/p1689279005496679 and by email to opensuse-factory@opensuse.org, openqa@suse.de, devel@suse.de, research@suse.de
Updated by okurz over 1 year ago
https://suse.slack.com/archives/C04MDKHQE20/p1689678826354739
(Oliver Kurz) @John Ford @Moroni Flores @Martin Caj (CC @Matthias Griessmeier @Nick Singer) Given that we did not manage to setup any o3 worker so far Nick Singer and me decided we should not go forward with the o3 VM migration. Next date suggested by us is Tuesday, 2023-07-25, as we expect that we have at least one machine available until then. Please confirm.
(John Ford) Understood. My only concern with 25/07/23 is the proximity to the Build Service cutover on 28/07/23. Not from a technical dependency point of view, more from an IT resource/support availability perspective. I'll re-purpose the 'Go/No-Go' meeting for 15:30. Let's use that time to understand issues to date and next steps.
(Moroni Flores) I think it should be ok, however it will be important to understand what did not work and what we are doing to address it
(Oliver Kurz) What did not work is that nobody yet could manage boot any reasonable installation system. Martin Caj, Nick Singer and me looked into multiple approaches and alternatives. We tried the "virtual media share" from BMCs but with no root access to the single machine within that network, oqa-jumpy, only Martin Caj can do that and so far we only got inconclusive error messages about that. What we are doing to address it: 1. Martin Caj today in the morning was (or is) working on setting up a DHCP/DNS/PXE server within that network and this is also one of the ways to go forward which I am confident that we can use within the next day. Alternatives: 2. Provide a VM with root access to SUSE QE Tools members so that we can try again with virtual media share using FTP/SMB; 3. Connect a USB thumb drive to a machine physically, install something there and use that host same as suggested in "2." for a VM
Today evening I will send an announcement which is either a reminder about the migration tomorrow or a notice about the shifted migration.
Updated by okurz over 1 year ago
- Subject changed from Support migration of o3 VM to PRG2 - 2023-07-19 size:M to Migration of o3 VM to PRG2 - 2023-07-19 size:M
- Status changed from Feedback to In Progress
I sent another reminder
All our pre-moving checks have succeeded so we will execute the move of https://openqa.opensuse.org tomorrow as planned.
as response to all the above announcements.
Tomorrow we can go ahead. I have my day reserved for this. I would appreciate if more of you can support this work and at least stay on standby in case we need your help. I suggest we stay in a breakout-room in https://meet.jit.si/suse_qa_tools where possible. At best we have 1 person coordinating, 1 person acting on a main task, 1 person communicating to the outside and optional for side-tasks.
What we should do, roughly:
- Join Jitsi and one thread in team-qa-tools and one thread in dct-migration
- Wait for go-no-go meeting at 0700Z
- Wait for mcaj to give the go from Eng-Infra side, then switch off the openQA scheduler on o3 and disable the authentication. I guess we can try to "break" the code by disabling any authenticated actions.
- Also switch off other services like gru, scripts, investigation, etc.
- Prepare old workers to connect over https as soon as o3 comes up again in prg2
- Install more new machines in prg2 while waiting for the VM to come online
- As soon as VM is ready in new place ensure that the webUI is good in read-only mode first
- Update IP addresses on ariel where necessary in /etc/hosts, also crosscheck /etc/dnsmasq.d/openqa.conf
- Ask Eng-Infra, mcaj, to switch off the DHCP/DNS/PXE server in the oqa dmz network
- Try to reboot a worker from the PXE on o3
- Enable workers to connect to o3 directly, not external https, and use testpoolserver with rsync instead
- Enable production worker classes on new workers after tests look good
- Connect old workers from NUE1 over https, in particular everything non-qemu-x86_64 for the time being, e.g. aarch64, ppc64le, s390x, bare-metal until we have such things directly from prg2
- Test and monitor a lot of o3 tests
- As soon as everything looks really stable announce it to users as response all the above announcments
Updated by openqa_review over 1 year ago
- Due date set to 2023-08-02
Setting due date based on mean cycle time of SUSE QE Tools
Updated by favogt over 1 year ago
Prepare old workers to connect over https as soon as o3 comes up again in prg2
I assume you mean "public https" here?
Ask Eng-Infra, mcaj, to switch off the DHCP/DNS/PXE server in the oqa dmz network
Isn't that ariel itself?
Updated by nicksinger over 1 year ago
favogt wrote:
Prepare old workers to connect over https as soon as o3 comes up again in prg2
I assume you mean "public https" here?
yes
Ask Eng-Infra, mcaj, to switch off the DHCP/DNS/PXE server in the oqa dmz network
Isn't that ariel itself?
to break the chicken-egg problem, martin setup a temporary host which enables us to boot/install our new workers. Once ariel is in the new location it will take over this task
Updated by okurz over 1 year ago
- Description updated (diff)
we ensured that our backup on backup.qa.suse.de is sufficiently up-to-date. nicksinger put a notice on the index page of https://openqa.opensuse.org informing about the pending maintenance work. We conducted the go-no-go meeting with others and decided to go ahead. I updated the description of this ticket with the points to do. mcaj will create a test-VM with a public facing webpage that we can check before doing stuff with ariel.
Updated by nicksinger over 1 year ago
We made another database dump just to be sure about consistency after the sync to prague. It is located on the FS of ariel and located in /var/lib/openqa/SQL-DUMPS/2023-07-19.dump
Updated by okurz over 1 year ago
We are just out of a call with Eng-Infra and SUSE-IT project management. Status update regarding o3: The sync of storage devices in the background takes longer than expected so we will need to keep the system in the current read-only state. Switch-over to the next instance is expected to happen tomorrow during the day.
I also synced over the latest state of database dumps to my machine with rsync -aHP backup.qa.suse.de:/home/rsnapshot/alpha.0/openqa.opensuse.org/var/lib/openqa/SQL-DUMPS/ /home/okurz/local/tmp/o3-sql-dumps/
.
mcaj will bring up the VM with just the root partition so that we can check the system coming up in general and update the network config and reachability from different locations. vdb is already synced over containing the database, vdc+vdd are in the process of being synced. As soon as all storage devices are synced over we will start the VM again with all those connected in a consistent state.
Updated by okurz over 1 year ago
temporary login with ssh root@10.150.2.10
for at least okurz/nsinger/mcaj/mgriessmeier. I disabled temporary log and tmp nginx paths in /etc/nginx/vhosts.d/openqa.conf, added to rollback steps
http://195.135.223.231/ is now accessible. However https://195.135.223.231/ does not work. Maybe that is something that was overlooked as originally we planned to use a proxy for that and now we wanted to serve 443/https directly, right? Written in https://suse.slack.com/archives/C04MDKHQE20/p1689850829492159?thread_ts=1689843691.492509&cid=C04MDKHQE20
Updated by okurz over 1 year ago
Syncing of data storage is still in progress and slower than expected. We are looking into alternatives.
My suggestion in https://suse.slack.com/archives/C04MDKHQE20/p1689922035366939?thread_ts=1689891521.574939&cid=C04MDKHQE20
Let's try an alternative: After the 5TB is copied – which is the more important one because it contains millions of test results – let's create a new empty >=6.6TB disk for new-ariel, we start with an empty one, you still copy over the snapshot of the old 6.6TB, mount it into the new-ariel as soon as it's there and we copy over stuff on-the-go or we could even rysnc parts of it as part is a "cache" that can be recreated in the new space
We will be working on a solution today. Expect openqa.opensuse.org to stay in read-only for a couple more hours.
Updated by okurz over 1 year ago
- Description updated (diff)
The DNS entry is in the process of being changed to the new location
(Martin Caj) DNS change is in the process:
pdnsutil edit-zone opensuse.org
Checked 404 records of 'opensuse.org', 0 errors, 0 warnings.
Detected the following changes:
-opensuse.org 1800 IN SOA ns1.opensuse.org admin.opensuse.org 2023071301 7200 7200 604800 3600
+opensuse.org 1800 IN SOA ns1.opensuse.org admin.opensuse.org 2023072100 7200 7200 604800 3600
+openqa.opensuse.org 1800 IN A 195.135.223.231
-openqa.opensuse.org 1800 IN CNAME proxy-nue.opensuse.org
https://openqa.opensuse.org/ says "This is https://openqa.opensuse.org in the old location".
The new VM already says "This is new ariel!"
And because o3 is now serving https directly we changed /etc/openqa/openqa.ini httpsonly = 1 in the [openid]-section. Authentication works. The system is in read-write able mode now.
Right now lsblk looks like this:
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
loop0 7:0 0 5G 0 loop /home
zram0 252:0 0 1.4G 0 disk [SWAP]
…
zram9 252:9 0 1.4G 0 disk [SWAP]
vda 254:0 0 20G 0 disk
└─vda1 254:1 0 20G 0 part /
vdb 254:16 0 5T 0 disk
└─vdb1 254:17 0 5T 0 part /srv/tftpboot/ipxe/images
/var/lib/snapshot-changes
/var/lib/openqa
/space
vdc 254:32 0 100G 0 disk /var/lib/pgsql
vdd 254:48 0 6.6T 0 disk
├─vg0--new-archive 253:0 0 1T 0 lvm /space/openqa/archive
│ /var/lib/openqa/archive
│ /archive
└─vg0--new-assets 253:1 0 5.6T 0 lvm /space/openqa/share
/var/lib/openqa/share
/assets
vde 254:64 0 6.6T 1 disk
├─vg0-assets 253:2 0 5.6T 0 lvm /mnt/old-vg0-assets
└─vg0-archive 253:3 0 1024G 0 lvm /mnt/old-vg0-archive
vda (/), vdb (/space, /var/lib/openqa, …), vdc (/var/lib/pgsql) are all in sync and fully usable. vdd is a new storage device where we recreate the structure from vde. vde is a read-only mount of a read-only snapshot from "assets+archive".
To copy over the content what we did:
mkdir -p /mnt/old-vg0-{assets,archive}
vgchange -a y /dev/vg0
mount -o ro,norecovery /dev/vg0/assets /mnt/old-vg0-assets
mount -o ro,norecovery /dev/vg0/archive /mnt/old-vg0-archive
wipefs --all /dev/vdd
vgcreate -n vg0-new /dev/vdd
lvcreate -L 1024g -n archive /dev/vg0-new
lvcreate -l100%FREE -n assets /dev/vg0-new
for i in archive assets ; do mkfs.xfs /dev/vg0-new/$i; done
rsync -aHP /mnt/old-vg0-archive/ /var/lib/openqa/archive/ && rsync -aHP /mnt/old-vg0-assets/ /var/lib/openqa/share/
In /etc/hosts mcaj and me adjusted all IPv4 addresses for o3 workers to have corresponding addresses in the new IP range. I talked with mcaj about the 192.168.7. range which we need for OBS access. We likely need to find a temporary solution to reach hosts while OBS is not yet migrated to PRG2:
cd /var/lib/openqa/ssh-tunnel-nue1
chown geekotest.nogroup /var/lib/openqa/ssh-tunnel-nue1/
sudo -u geekotest ssh-keygen -t ed25519 -f /var/lib/openqa/ssh-tunnel-nue1/id_ed25519 -N ''
sudo -u geekotest ssh -i /var/lib/openqa/ssh-tunnel-nue1/id_ed25519 -o "UserKnownHostsFile /var/lib/openqa/ssh-tunnel-nue1/known_hosts" -p 2213 geekotest@gate.opensuse.org "ip a"
cat >> .ssh/conf <<EOF
Host old-ariel
HostName gate.opensuse.org
User geekotest
Port 2213
IdentityFile /var/lib/openqa/ssh-tunnel-nue1/id_ed25519
UserKnownHostsFile /var/lib/openqa/ssh-tunnel-nue1/known_hosts
LocalForward 62450 localhost:62450
EOF
sudo -u geekotest ssh -NT
And setting up a tunnel. On old-ariel following https://www.wireguard.com/quickstart/ to create a key pair and config:
zypper -n in wireguard-tools
cat > /etc/wireguard/ariel-link.conf
[Interface]
PrivateKey = XXX
ListenPort = 62451
[Peer]
PublicKey = XXX
AllowedIPs = 192.168.7.0/24, 192.168.254.0/24, 172.16.0.0/24
EOF
wg setconf ariel-link ariel-link.conf
ip l s dev ariel-link up
…
and accordingly on new-ariel.
and socat on new ariel for UDP<->TCP
socat udp4-listen:62451,reuseaddr,fork tcp:localhost:62450
on old-ariel:
sysctl net.ipv4.conf.ariel-link.forwarding=1
iptables -t nat -A POSTROUTING -o ariel-link -j MASQUERADE
iptables -t nat -A POSTROUTING -o eth2 -j MASQUERADE
iptables -t nat -A POSTROUTING -o eth2 -j SNAT --to-source 192.168.7.15
iptables -I FORWARD 1 -i ariel-link -o eth2 -j ACCEPT
sysctl net.ipv4.ip_forward=1
sysctl net.ipv4.conf.all.forwarding=1
on new-ariel:
ip r a 192.168.7.0/24 via 172.16.0.1 dev ariel-link
ip r a 192.168.254.0/24 via 172.16.0.1 dev ariel-link
for i in $(sed -n 's/^\(192.168[^ ]*\).*$/\1/p' /etc/hosts); do ping -c1 $i; done
The tunnel is working. While this is being used to sync over new assets from obspublish we are connecting the physical machine workers over the internal network. To specify in more detail: Reaching to 192.168.7.0 works but not yet 192.168.254.0 or 192.168.47.0 because we need to route those over different networks:
# ip r
default via 192.168.254.4 dev eth0
172.16.0.0/24 dev ariel-link proto kernel scope link src 172.16.0.1
192.168.0.0/24 dev eth1 proto kernel scope link src 192.168.0.100
192.168.7.0/24 dev eth2 proto kernel scope link src 192.168.7.15
192.168.47.0/24 dev eth3 proto kernel scope link src 192.168.47.13
192.168.112.0/24 dev eth1 proto kernel scope link src 192.168.112.100
192.168.252.0/24 via 192.168.47.252 dev eth3 src 192.168.47.13
192.168.253.0/24 via 192.168.47.252 dev eth3 src 192.168.47.13
192.168.254.0/24 dev eth0 proto kernel scope link src 192.168.254.15
So I added all three ranges to wireguard config on old-ariel and new-ariel 192.168.7.0/24, 192.168.47.0/24, 192.168.254.0/24. Then on old-ariel: iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
and ping from new-ariel to 192.168.254.44 works \o/ However it seems that the complete connection is not always 100% reliable. Sometimes pings get lost. Now 192.168.47.0. For that iptables -t nat -A POSTROUTING -o eth3 -j MASQUERADE
. Oh, and of course we need the routes on new-ariel ip r a 192.168.47.0/24 via 172.16.0.1 dev ariel-link
. Now ping also works \o/.
Updated by okurz over 1 year ago
Working with mkittler to connect workers. Due to the outside IP over https not being reachable from internal directly or something we put into worker22:/etc/hosts the internal IP of o3
10.150.1.11 openqa.opensuse.org
We unmasked and started the openQA scheduler on o3 and triggered a test job:
openqa-clone-job --within-instance https://openqa.opensuse.org/tests/3440744 TEST+=okurz BUILD= _GROUP=0 WORKER_CLASS=openqaworker22
- openqa-Tumbleweed-dev-x86_64-Build:TW.21664-openqa_install+publish@64bit-2G -> https://openqa.opensuse.org/tests/3440877
those were ending up incomplete also because tests weren't synced yet. So I did an explicit
rsync -aHP /mnt/old-vg0-assets/tests/ /var/lib/openqa/share/tests/
so eventually we had https://openqa.opensuse.org/tests/3440879 which looks better but fails in
https://openqa.opensuse.org/tests/3440879#step/openqa_webui/7 trying to download an asset from http://openqa.opensuse.org with timeout exceeded. Maybe local firewall?
Also openqa-scheduler is often "stuck". We attached over strace and found it to be very busy with reading "/var/lib/openqa/testresults".
So immediately over to multi-machine tests:
openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/3439908 TEST+=worker22-poo132134 _GROUP=0 BUILD= WORKER_CLASS=openqaworker22
Cloning parents of microos-Tumbleweed-DVD-x86_64-Build20230718-remote_ssh_target@64bit
Cloning children of microos-Tumbleweed-DVD-x86_64-Build20230718-remote_ssh_controller@64bit
2 jobs have been created:
- microos-Tumbleweed-DVD-x86_64-Build20230718-remote_ssh_controller@64bit -> https://openqa.opensuse.org/tests/3440885
- microos-Tumbleweed-DVD-x86_64-Build20230718-remote_ssh_target@64bit -> https://openqa.opensuse.org/tests/3440884
I should keep in mind that we reach the system over ssh on ssh 10.150.2.10
. We can now also use ssh -4 root@ariel.dmz-prg2.suse.org
In the meantime our ssh tunneling connection failed. Still running that in screen on new-ariel but now changed to
sudo -u geekotest AUTOSSH_POLL=20 autossh -M 0 -o ServerAliveInterval=15 -NT old-ariel
Right now in https://openqa.opensuse.org/tests/3440889#live we observed the problem that the openQA jobs try to reach openqa.opensuse.org over the wrong internal address 10.150.2.10. That is also what host openqa.opensuse.org
on w22+w24 say even though /etc/hosts has 10.150.1.11 which is correct.
Status update about openqa.opensuse.org migration: https://openqa.opensuse.org is up, users can authenticate, jobs can be scheduled. A sync of assets is still in progress and expected to take a couple of hours more. New products can be scheduled and are already scheduled by openqa-trigger-from-obs. No jobs will be executed though as we do not have production workers connected. Currently the problem is that openQA qemu jobs internally fail to resolve the domain openqa.opensuse.org properly where needed, e.g. when trying to install packages over zypper. See one example https://openqa.opensuse.org/tests/3440879#step/openqa_webui/7 . Any hints would be appreciated to help with that. Due to the end of the work-week for most I expect that this will only be fixed next week. More details about the migration work in https://progress.opensuse.org/issues/132143
In the meantime ssh access with ssh -p 2214 gate.opensuse.org
now also works so we can also consider using this for the NUE1-based workers. However I now connected openqaworker4 (NUE1) using TESTPOOLSERVER still pointing to old-ariel which also still runs fetchneedles every minute.
https://openqa.opensuse.org/tests/3440750
is a successful verification from openqaworker4.
and now /dev/vdb crashed with "Input/output error". I unmounted all the mount points relying on vdb directly or indirectly. I unmounted and remounting yields "Structure needs cleaning". I am currently running xfs_metadump -g /dev/vdb1 - | zstd -o vdb1_xfs_metadump-$(date -Is).dump.zstd
before continuing with repair attempts. I tried xfs_repair but that failed so I will likely need to use xfs_repair -L
. In https://suse.slack.com/archives/C04MDKHQE20/p1689959465196969 Anthony Iliopoulos mentioned that they know about such bug and they asked me to keep that metadata dump for them to check before continuing repair attempts.
Currently at 8516160 inodes, taking very long.
Updated by okurz over 1 year ago
After 1h and only 20% progress I aborted and instead ran xfs_repair -L /dev/vdb1
which took about 3h but eventually I had a storage device back which looks fine. So I remounted everything and restarted services and then on openqaworker4 started all failed services and we are back in business with actually passed tests: https://openqa.opensuse.org/tests/3440811# . I also tried needle creation in https://openqa.opensuse.org/tests/3440880/modules/epiphany/steps/6/edit which was rather slow but completed. I retriggered the job. I also retriggered incompletes with failed_since=2023-07-20 openqa-advanced-retrigger-jobs
. Applied the same changes on openqaworker20 as on w4, i.e.:
- add o3 entry in /etc/hosts
- add openqa.opensuse.org entry in /etc/openqa/client.conf
- add openqa.opensuse.org entry in /etc/openqa/workers.ini
- restart worker services
For now I will monitor before adding more workers.
And then I also tried to trigger again some product sync and/or trigger in https://openqa.opensuse.org/admin/obs_rsync/openSUSE:Factory:ToTest, let's see the effect.
Updated by okurz over 1 year ago
Made autossh+socat persistent
cat > /etc/systemd/system/autossh-old-ariel.service <<EOF
[Unit]
Description=AutoSSH tunnel to connect to old-ariel, see https://progress.opensuse.org/issues/132143 for details
After=network.target
[Service]
User=geekotest
Environment="AUTOSSH_POLL=20"
ExecStart=/usr/bin/autossh -M 0 ServerAliveInterval=15 -NT old-ariel
# From my experience a process might hang with "connection refused". Trying to workaround with periodic restarts
RuntimeMaxSec=24h
Restart=always
[Install]
WantedBy=default.target
EOF
cat > /etc/systemd/system/socat-wireguard-to-ssh.service <<EOF
[Unit]
Description=socat forwarding the wireguard UDP socket to the TCP port forwarded to old-ariel by autossh-old-ariel.service, see https://progress.opensuse.org/issues/132143 for details
Wants=autossh-old-ariel.service
After=network.target
[Service]
ExecStart=/usr/bin/socat udp4-listen:62451,reuseaddr,fork tcp:localhost:62450
[Install]
WantedBy=default.target
EOF
systemctl daemon-reload && systemctl enable --now autossh-old-ariel socat-wireguard-to-ssh
Enabled NUE1 workers
- rebel
- power8-3 was powered off, switched on over http://qa-power8-3.qa.suse.de/. After it was pingable again I could login and update the config
- worker6
- worker7
- worker19
TODO: aarch64, currently not reachable
Then I was trying to fix developer mode for NUE1 workers connected to PRG2 o3. I added 192.168.112.0/24 to wireguard on both and a route on new-ariel. Bot on old-ariel tcpdump shows IP 172.16.0.2 > qa-power8-3.openqanet.opensuse.org: ICMP echo request. Shouldn't that as well show a rewritten source?
Regarding dnsmasq mcaj gave me what he had on the temporary PXE server
# Customized dhcp options
option tz-posix-string code 100 = string;
option tz-name code 101 = string;
option domain-name-search-list code 119 = string;
option dhcp6.bootfile-url code 59 = string;
option boot-menu code 150 = string;
option searchlist code 119 = string;
option option-176 code 176 = string;
option option-242 code 242 = string;
option arch code 93 = unsigned integer 16;
option ldap-server code 95 = text;
option proxy-auto code 252 = text;
# option definitions common to all supported networks...
default-lease-time 1296000;
max-lease-time 1296000;
one-lease-per-client True;
get-lease-hostnames True;
server-identifier oqa-pxe.oqa.dmz-prg2.suse.org;
server-name "oqa-pxe.oqa.dmz-prg2.suse.org";
subnet 10.150.1.0 netmask 255.255.255.0 {
option broadcast-address 10.150.1.255;
option domain-name-servers 8.8.8.8,4.4.4.4;
option ntp-servers 0.opensuse.pool.ntp.org,1.opensuse.pool.ntp.org,2.opensuse.pool.ntp.org;
option lpr-servers cups.prg2.suse.org;
option irc-server irc.prg2.suse.org;
option smtp-server relay.prg2.suse.org;
option domain-name "oqa.dmz-prg2.suse.org";
option domain-search "oqa.dmz-prg2.suse.org";
option nis-domain "dmz-prg2.suse.org";
option nis-servers wotan.prg2.suse.org,amor.prg2.suse.org;
# filename "/EFI/BOOT/bootx64.efi";
filename "http://10.150.1.10/EFI/BOOT/bootx64.efi";
next-server 10.150.1.10;
default-lease-time 64800;
max-lease-time 64800;
option routers 10.150.1.254;
option vendor-class-identifier "HTTPClient";
pool {
range 10.150.1.100 10.150.1.200;
}
}
and apache config for a simple setup with Leap 15.5 net medium
I had there default apache with simple setting like this:
apache2/vhosts.d/oqa-pxe.conf
<VirtualHost *:80>
ServerName oqa-pxe-oqa.dmz-prg2.suse.org
ServerAlias
ServerAdmin webmaster@oqa-pxe-oqa.dmz-prg2.suse.org
AllowEncodedSlashes Off
LogLevel warn
ErrorLog /var/log/apache2/oqa-pxe-error.log
LogFormat "%h %l %u %t \"%r\" %>s"
CustomLog /var/log/apache2/oqa-pxe-access.log "%h %l %u %t \"%r\" %>s"
DocumentRoot /srv/tftpboot
<Directory "/srv/tftpboot">
Options FollowSymLinks
Require all granted
AllowOverride None
</Directory>
</VirtualHost>
and yes in /srv/tftpboot/ I copy the NET iso
and lest part was that I had to a bit ajust grub.cfg in srv/tftpboot/EFI/BOOT/grub.cfg
lang=en_US
menuentry 'Installation' --class opensuse --class gnu-linux --class gnu --class os {
set gfxpayload=keep
echo 'Loading kernel ...'
linux /boot/x86_64/loader/linux
echo 'Loading initial ramdisk ...'
initrd /boot/x86_64/loader/initrd
}
menuentry 'Upgrade' --class opensuse --class gnu-linux --class gnu --class os {
set gfxpayload=keep
echo 'Loading kernel ...'
linux /boot/x86_64/loader/linux upgrade=1
echo 'Loading initial ramdisk ...'
initrd /boot/x86_64/loader/initrd
}
submenu 'More ...' {
menuentry 'Rescue System' --class opensuse --class gnu-linux --class gnu {
set gfxpayload=keep
echo 'Loading kernel ...'
linux /boot/x86_64/loader/linux rescue=1
echo 'Loading initial ramdisk ...'
initrd /boot/x86_64/loader/initrd
}
Updated by okurz over 1 year ago
- Description updated (diff)
To fixed failed obs_rsync jobs as seen on https://openqa.opensuse.org/minion/jobs?state=failed on new-ariel
cat > /etc/systemd/system/openqa-gru.service.d/override.conf <<EOF
[Service]
Environment="OPENQA_LOGFILE=/var/log/openqa_gru"
Environment='RSYNC_CONNECT_PROG=ssh -i /var/lib/openqa/ssh-tunnel-nue1/id_ed25519 -o "UserKnownHostsFile /var/lib/openqa/ssh-tunnel-nue1/known_hosts" -p 2213 geekotest@gate.opensuse.org nc %%H 873'
WorkingDirectory=/var/lib/openqa
EOF
cat > /etc/systemd/system/openqa-webui.service.d/override.conf <<EOF
[Service]
Environment='RSYNC_CONNECT_PROG=ssh -i /var/lib/openqa/ssh-tunnel-nue1/id_ed25519 -o "UserKnownHostsFile /var/lib/openqa/ssh-tunnel-nue1/known_hosts" -p 2213 geekotest@gate.opensuse.org nc %%H 873'
EOF
and in /etc/nginx/vhosts.d/openqa.conf
server {
listen 80 default_server;
listen [::1]:80 default_server;
server_name openqa.opensuse.org openqa.infra.opensuse.org;
#include vhosts.d/openqa-locations.inc;
…
location / {
return 301 https://$host$request_uri;
}
}
server {
listen 80;
listen [::1]:80;
server_name localhost;
include vhosts.d/openqa-locations.inc;
}
and then
aa-complain /etc/apparmor.d/usr.share.openqa.script.openqa
systemctl daemon-reload && systemctl restart openqa-webui openqa-gru
and retriggered failed minion jobs.
We found https://openqa.opensuse.org/tests/3446020#step/bootloader_s390/9 showing "RETR Tumbleweed-oss-s390x-Snapshot20230719/suse.ins 550 Failed to open file." as s390x jobs try to the ftp server on 192.168.112.100 which is old-ariel. So what we did so far is
setcap 'cap_net_bind_service=+ep' $(which sshd)
systemctl restart sshd
and on new-ariel in /var/lib/openqa/.ssh/config
Host old-ariel
HostName gate.opensuse.org
User geekotest
Port 2213
IdentityFile /var/lib/openqa/ssh-tunnel-nue1/id_ed25519
UserKnownHostsFile /var/lib/openqa/ssh-tunnel-nue1/known_hosts
LocalForward 62450 localhost:62450
RemoteForward 192.168.112.100:21 localhost:21
and systemctl autossh-old-ariel
but "Warning: remote port forwarding failed for listen port 21". So that isn't solved yet. However we found that the machine for s390x in https://openqa.opensuse.org/admin/machines has hardcoded "REPO_HOST=192.168.112.100;S390_NETWORK_PARAMS=OSAMedium=eth OSAInterface=qdio InstNetDev=osa HostIP=192.168.112.@S390_HOST@/24 Hostname=s390linux@S390_HOST@ Gateway=192.168.112.254 Nameserver=192.168.112.100 Domain=opensuse.org PortNo=0 Layer2=1 ReadChannel=0.0.0800 WriteChannel=0.0.0801 DataChannel=0.0.0802 OSAHWAddr=" so maybe we can also change that after we ensure that new-ariel can reach NUE1 workers as well as that NUE1 workers can reach new-ariel.
ip r a 10.150.1.11/32 via 192.168.112.110 dev eth0
Updated by nicksinger over 1 year ago
# Generated by iptables-save v1.8.7 on Mon Jul 24 12:21:55 2023
*filter
:INPUT DROP [0:0]
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [1774195:3712632607]
:forward_dmz - [0:0]
:forward_ext - [0:0]
:forward_int - [0:0]
:input_dmz - [0:0]
:input_ext - [0:0]
:input_int - [0:0]
:reject_func - [0:0]
-A INPUT -i lo -j ACCEPT
-A INPUT -i ariel-link -j ACCEPT
-A INPUT -m conntrack --ctstate ESTABLISHED -j ACCEPT
-A INPUT -p icmp -m conntrack --ctstate RELATED -j ACCEPT
-A INPUT -i eth1 -j input_int
-A INPUT -i eth0 -j input_dmz
-A INPUT -j input_ext
-A INPUT -m limit --limit 3/min -j LOG --log-prefix "SFW2-IN-ILL-TARGET " --log-tcp-options --log-ip-options
-A INPUT -j DROP
-A FORWARD -m limit --limit 3/min -j LOG --log-prefix "SFW2-FWD-ILL-ROUTING " --log-tcp-options --log-ip-options
-A OUTPUT -o lo -j ACCEPT
-A input_dmz -m pkttype --pkt-type broadcast -j DROP
-A input_dmz -p icmp -m icmp --icmp-type 4 -j ACCEPT
-A input_dmz -p icmp -m icmp --icmp-type 8 -j ACCEPT
-A input_dmz -p tcp -m tcp --dport 22 -j ACCEPT
-A input_dmz -p tcp -m tcp --dport 80 -j ACCEPT
-A input_dmz -p tcp -m tcp --dport 443 -j ACCEPT
-A input_dmz -p tcp -m tcp --dport 5666 -j ACCEPT
-A input_dmz -p tcp -m tcp --dport 6556 -j ACCEPT
-A input_dmz -m comment --comment "sfw2.insert.pos" -m pkttype ! --pkt-type unicast -j DROP
-A input_dmz -p tcp -m limit --limit 3/min -m tcp --tcp-flags FIN,SYN,RST,ACK SYN -j LOG --log-prefix "SFW2-INdmz-DROP-DEFLT " --log-tcp-options --log-ip-options
-A input_dmz -p icmp -m limit --limit 3/min -j LOG --log-prefix "SFW2-INdmz-DROP-DEFLT " --log-tcp-options --log-ip-options
-A input_dmz -p udp -m limit --limit 3/min -m conntrack --ctstate NEW -j LOG --log-prefix "SFW2-INdmz-DROP-DEFLT " --log-tcp-options --log-ip-options
-A input_dmz -j DROP
-A input_ext -m pkttype --pkt-type broadcast -j DROP
-A input_ext -p icmp -m icmp --icmp-type 4 -j ACCEPT
-A input_ext -p icmp -m icmp --icmp-type 8 -j ACCEPT
-A input_ext -p tcp -m tcp --dport 113 -m conntrack --ctstate NEW -j reject_func
-A input_ext -p tcp -m tcp --dport 20048 -j ACCEPT
-A input_ext -p udp -m udp --dport 20048 -j ACCEPT
-A input_ext -s 192.168.254.4/32 -p tcp -m tcp --dport 22 -j ACCEPT
-A input_ext -s 192.168.254.44/32 -p tcp -m tcp --dport 22 -j ACCEPT
-A input_ext -s 192.168.254.46/32 -p tcp -m tcp --dport 22 -j ACCEPT
-A input_ext -s 192.168.254.4/32 -p tcp -m tcp --dport 80 -j ACCEPT
-A input_ext -s 192.168.254.4/32 -p tcp -m tcp --dport 5666 -j ACCEPT
-A input_ext -s 192.168.254.26/32 -p tcp -m tcp --dport 80 -j ACCEPT
-A input_ext -s 192.168.254.46/32 -p tcp -m tcp --dport 80 -j ACCEPT
-A input_ext -m comment --comment "sfw2.insert.pos" -m pkttype ! --pkt-type unicast -j DROP
-A input_ext -p tcp -m limit --limit 3/min -m tcp --tcp-flags FIN,SYN,RST,ACK SYN -j LOG --log-prefix "SFW2-INext-DROP-DEFLT " --log-tcp-options --log-ip-options
-A input_ext -p icmp -m limit --limit 3/min -j LOG --log-prefix "SFW2-INext-DROP-DEFLT " --log-tcp-options --log-ip-options
-A input_ext -p udp -m limit --limit 3/min -m conntrack --ctstate NEW -j LOG --log-prefix "SFW2-INext-DROP-DEFLT " --log-tcp-options --log-ip-options
-A input_ext -j DROP
-A input_int -j ACCEPT
-A reject_func -p tcp -j REJECT --reject-with tcp-reset
-A reject_func -p udp -j REJECT --reject-with icmp-port-unreachable
-A reject_func -j REJECT --reject-with icmp-proto-unreachable
COMMIT
# Completed on Mon Jul 24 12:21:55 2023
Updated by nicksinger over 1 year ago
to make ftp from new-ariel available on old workers we stopped atftpd and added a iptables redirect from old-ariel:21 -> ariel-link -> new-ariel:21 with the commands:
# iptables -t nat -A PREROUTING -p tcp --dport 21 -j DNAT --to-destination 10.150.1.11:21
# iptables -t nat -A POSTROUTING -p tcp -d 10.150.1.11 --dport 21 -j SNAT --to-source 192.168.112.100
Updated by nicksinger over 1 year ago
Added few more rules to enable "extended passive mode" which is required by s390 hosts:
# iptables -t nat -A PREROUTING -p tcp --dport 30000:30100 -j DNAT --to-destination 10.150.1.11:30000-30100
# iptables -t nat -A POSTROUTING -p tcp -d 10.150.1.11 --dport 30000:30100 -j SNAT --to-source 192.168.112.100
Updated by nicksinger over 1 year ago
test was able to continue further but failed with: "Illegal EPRT command".
We found that "port_promiscuous=YES" in vsftpd.conf (on new ariel) disables some security checks which might interfere with NATing. Changed and trying again.
Updated by favogt over 1 year ago
Next issue: For some reason the synced TW assets are not tracked properly and not listed on https://openqa.opensuse.org/admin/assets. The DVD iso got deleted and most tests failed due to a 404 during asset download. I forced a resync and it's there again, but still not visible on the assets page, so I suspect it'll be deleted soon.
Updated by okurz over 1 year ago
- Related to action #133232: o3 hook scripts are triggered but no comment shows up on job added
Updated by favogt over 1 year ago
Some progress for s390 FTP downloads.
>>>RETR Tumbleweed-oss-s390x-Snapshot20230719/boot/s390x/parmfile
500 OOPS: vsf_sysutil_bind
I used strace on vsftpd and found that 500 OOPS: vsf_sysutil_bind
was caused by a call to bind(::ffff:10:150:11:1 port 20)
returning EACCESS
. Adding connect_from_port_20=NO
to /etc/vsftpd.conf
helped. Now it choses a random port as source.
With that set, it got stuck sending initrd data:
[pid 32000] socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP <unfinished ...>
[pid 31998] read(3, <unfinished ...>
[pid 32000] <... socket resumed>) = 5
[pid 32000] setsockopt(5, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
[pid 32000] bind(5, {sa_family=AF_INET6, sin6_port=htons(0), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::ffff:10.150.1.11", &sin6_addr), sin6_scope_id=0}, 28) = 0
[pid 32000] fcntl(5, F_GETFL) = 0x2 (flags O_RDWR)
[pid 32000] fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 32000] connect(5, {sa_family=AF_INET6, sin6_port=htons(11580), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::ffff:192.168.112.9", &sin6_addr), sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
The host is not reachable over IPv6 that way, so I flipped vsftpd to listen on IPv4 only instead of IPv6 only.
With that, sending data appears to work, but it's too slow:
06:43:44.966985 IP 172.16.0.1.11608 > ariel.suse-dmz.opensuse.org.ftp: Flags [P.], seq 123:153, ack 298, win 32767, length 30: FTP: EPRT |1|192.168.112.9|11610|
06:43:44.967774 IP ariel.suse-dmz.opensuse.org.ftp > 172.16.0.1.11608: Flags [P.], seq 298:349, ack 153, win 507, length 51: FTP: 200 EPRT command successful. Consider using EPSV.
06:43:44.975142 IP 172.16.0.1.11608 > ariel.suse-dmz.opensuse.org.ftp: Flags [.], ack 349, win 32767, length 0
06:43:45.017710 IP 172.16.0.1.11608 > ariel.suse-dmz.opensuse.org.ftp: Flags [P.], seq 153:215, ack 349, win 32767, length 62: FTP: RETR Tumbleweed-oss-s390x-Snapshot20230719/boot/s390x/initrd
06:43:45.020081 IP ariel.suse-dmz.opensuse.org.45609 > zvm63b.11610: Flags [S], seq 1618861255, win 64860, options [mss 1380,nop,nop,sackOK,nop,wscale 7], length 0
06:43:45.027382 IP zvm63b.11610 > ariel.suse-dmz.opensuse.org.45609: Flags [S.], seq 500541048, ack 1618861256, win 8192, options [mss 1460,wscale 1,nop], length 0
06:43:45.027412 IP ariel.suse-dmz.opensuse.org.45609 > zvm63b.11610: Flags [.], ack 1, win 507, length 0
06:43:45.028115 IP ariel.suse-dmz.opensuse.org.ftp > 172.16.0.1.11608: Flags [P.], seq 349:468, ack 215, win 507, length 119: FTP: 150 Opening BINARY mode data connection for Tumbleweed-oss-s390x-Snapshot20230719/boot/s390x/initrd (68905600 bytes).
06:43:45.028477 IP ariel.suse-dmz.opensuse.org.45609 > zvm63b.11610: Flags [P.], seq 1:2761, ack 1, win 507, length 2760
06:43:45.028496 IP ariel.suse-dmz.opensuse.org.45609 > zvm63b.11610: Flags [P.], seq 2761:5521, ack 1, win 507, length 2760
06:43:45.028562 IP ariel.suse-dmz.opensuse.org.45609 > zvm63b.11610: Flags [P.], seq 5521:8193, ack 1, win 507, length 2672
06:43:45.034775 IP 172.16.0.1.11608 > ariel.suse-dmz.opensuse.org.ftp: Flags [.], ack 468, win 32767, length 0
06:43:45.239207 IP ariel.suse-dmz.opensuse.org.45609 > zvm63b.11610: Flags [.], seq 1:1381, ack 1, win 507, length 1380
06:43:45.347067 IP zvm63b.11610 > ariel.suse-dmz.opensuse.org.45609: Flags [.], ack 1381, win 28671, length 0
06:43:45.347089 IP ariel.suse-dmz.opensuse.org.45609 > zvm63b.11610: Flags [.], seq 8193:9573, ack 1, win 507, length 1380
06:43:45.347101 IP ariel.suse-dmz.opensuse.org.45609 > zvm63b.11610: Flags [P.], seq 9573:10953, ack 1, win 507, length 1380
06:43:45.775208 IP ariel.suse-dmz.opensuse.org.45609 > zvm63b.11610: Flags [.], seq 1381:2761, ack 1, win 507, length 1380
Updated by okurz over 1 year ago
nicksinger is working with cbachmann on a direct UDP access for wireguard on new-ariel. @nicksinger please work closely with cbachmann to get that sorted.
In the meantime fvogt is looking into alternatives setting up ssh with ssh -w …
:
On old-ariel:
Add PermitTunnel yes to sshd_config and rcsshd reload.
sudo ip tuntap add mode tun user geekotest name tun5
sudo ip addr add 172.17.0.1/24 dev tun5
sudo ip link set tun5 up
On new-ariel:
sudo ip tuntap add mode tun user geekotest name tun5
sudo ip addr add 172.17.0.2/24 dev tun5
sudo ip link set tun5 up
sudo su geekotest -c 'ssh -w 5:5 -i /var/lib/openqa/ssh-tunnel-nue1/id_ed25519 -o "UserKnownHostsFile /var/lib/openqa/ssh-tunnel-nue1/known_hosts" -p 2213 geekotest@gate.opensuse.org'
Updated by nicksinger over 1 year ago
Since we didn't manage to open a port for our wireguard tunnel we followed up on Fabians idea to use the ssh tunnel mechanism. I adjusted routing and firewalls on both new and old ariel to make use of this tunnel. To validate I used:
for i in $(cat /etc/hosts | cut -d " " -f 1 | grep -v "^10.150" | grep -v "^#" | grep -v "^$" | xargs); do ping -c1 -W 1 -q $i > /dev/null || echo $i broken; done
new-ariel¶
routing¶
ip r a 172.16.0.0/24 dev tun5
ip r a 192.168.112.0/24 dev tun5
ip r a 192.168.7.0/24 dev tun5
ip r a 192.168.254.0/24 dev tun5
iptables rules¶
iptables -I INPUT 1 -i tun5 -j ACCEPT
old-ariel¶
routing¶
ip r a 10.150.1.11 via 172.17.0.1 dev tun5
iptables rules¶
iptables -I INPUT 4 -i tun5 -j ACCEPT
# Compressed down without -o argument to just allow forwarding on *all* interfaces
#-A FORWARD -i ariel-link -o eth1 -j ACCEPT
#-A FORWARD -i ariel-link -o eth3 -j ACCEPT
#-A FORWARD -i ariel-link -o eth2 -j ACCEPT
iptables -I FORWARD 4 -i tun5 -j ACCEPT
# change incoming source ip to own interface ip to allow remote hosts to respond to a known ip
iptables -A POSTROUTING -o eth1 -j SNAT --to-source 192.168.112.100
iptables -A POSTROUTING -o eth2 -j SNAT --to-source 192.168.7.15
iptables -A POSTROUTING -o eth3 -j SNAT --to-source 192.168.47.13
# forward old-ariel:21 to new-ariel:21 to allow old workers to access the new ftp
iptables -A PREROUTING -p tcp -m tcp --dport 21 -j DNAT --to-destination 10.150.1.11:21
iptables -A POSTROUTING -d 10.150.1.11/32 -p tcp -m tcp --dport 21 -j SNAT --to-source 192.168.112.100
# used by passive mode of ftp (e.g. required by s390x test) and defined by "grep ^pasv /etc/vsftpd.conf"
iptables -A PREROUTING -p tcp -m tcp --dport 30000:30100 -j DNAT --to-destination 10.150.1.11:30000-30100
iptables -A POSTROUTING -d 10.150.1.11/32 -p tcp -m tcp --dport 30000:30100 -j SNAT --to-source 192.168.112.100
# used by zabbix-agent to connect from zabbix server (in the old opensuse network) to new ariel
iptables -A PREROUTING -p tcp -m tcp --dport 10049 -j DNAT --to-destination 10.150.1.11:10049
iptables -A POSTROUTING -d 10.150.1.11/32 -p tcp -m tcp --dport 10049 -j SNAT --to-source 192.168.112.100
# nat all traffic on all links
iptables -A POSTROUTING -o eth0 -j MASQUERADE
iptables -A POSTROUTING -o eth2 -j MASQUERADE
iptables -A POSTROUTING -o eth3 -j MASQUERADE
iptables -A POSTROUTING -o tun5 -j MASQUERADE
Updated by favogt over 1 year ago
Apparently the approach to use OpenSSH-backed tun devices works well enough for s390x tests to pass now: https://openqa.opensuse.org/tests/3451083
I made the SSH tunnel configuration persistent now.
On new-ariel, I adjusted /var/lib/openqa/.ssh/config
:
# For wireguard, meanwhile replaced by tun5 tunnel
#LocalForward 62450 localhost:62450
Tunnel yes
TunnelDevice 5:5
Added /etc/sysconfig/network/ifcfg-tun5
:
STARTMODE=auto
BOOTPROTO=static
IPADDR=172.17.0.2
NETMASK=255.255.255.0
TUNNEL=tun
TUNNEL_SET_GROUP=nogroup
TUNNEL_SET_OWNER=geekotest
and /etc/sysconfig/network/ifroute-tun5
:
192.168.7.0/24
192.168.47.0/24
192.168.112.0/24
192.168.254.0/24
After stopping the existing tunnel and interface, running ifup tun5
and restarting autossh-old-ariel.service
brought it back up and I could ping and ssh to workers again.
On old-ariel, only ifcfg-tun5
had to be added:
STARTMODE=auto
BOOTPROTO=static
IPADDR=172.17.0.1
NETMASK=255.255.255.0
TUNNEL=tun
TUNNEL_SET_GROUP=nogroup
TUNNEL_SET_OWNER=geekotest
Updated by okurz over 1 year ago
The one thing missing is iptables for tun5 so we created another systemd service file now having:
# cat > /etc/systemd/system/autossh-old-ariel.service <<EOF
[Unit]
Description=AutoSSH tunnel to connect to old-ariel, see https://progress.opensuse.org/issues/132143 for details
Requires=autossh-iptables.service
After=network.target
[Service]
User=geekotest
Environment="AUTOSSH_POLL=20"
ExecStart=/bin/sh -c "iptables-save | grep -q 'tun5 -j ACCEPT' || iptables -I INPUT 1 -i tun5 -j ACCEPT"
ExecStart=/usr/bin/autossh -M 0 -o ServerAliveInterval=15 -NT old-ariel
# From my experience a process might hang with "connection refused". Trying to workaround with periodic restarts
RuntimeMaxSec=24h
Restart=always
[Install]
WantedBy=default.target
EOF
cat > /etc/systemd/system/autossh-iptables.service <<EOF
[Unit]
Description=AutoSSH tunnel iptables
After=network.target
[Service]
Type=oneshot
ExecStart=/bin/sh -c "iptables-save | grep -q 'tun5 -j ACCEPT' || iptables -I INPUT 1 -i tun5 -j ACCEPT"
[Install]
WantedBy=default.target
EOF
Updated by nicksinger over 1 year ago
- Description updated (diff)
I adjusted the following zabbix.conf entries on new-ariel:
Server=172.17.0.1
ListenPort=10049
ServerActive=zabbix-proxy-opensuse
Hostname=10.150.2.10
and added a port-forward from old-ariel:10049->new-ariel:10049 to make the zabbix server reach new-ariel (POST- and PREROUTING as described in https://progress.opensuse.org/issues/132143#note-42). https://zabbix.suse.de/zabbix.php?action=host.view now shows a new entry for "10.150.2.10" with checks being executed on new-ariel.
Updated by okurz over 1 year ago
Now looking into /etc/hosts and /etc/dnsmasq.d/openqa.conf, rebooted openqaworker21,22,24 and checking multi-machine again:
openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/3450618 TEST+=worker24-poo132134 _GROUP=0 BUILD= WORKER_CLASS=openqaworker24
using firewalld-policy tests as suggested by fvogt.
3 jobs have been created:
- opensuse-Tumbleweed-DVD-x86_64-Build20230724-firewalld-policy-firewall@64bit -> https://openqa.opensuse.org/tests/3451166
- opensuse-Tumbleweed-DVD-x86_64-Build20230724-firewalld-policy-server@64bit -> https://openqa.opensuse.org/tests/3451164
- opensuse-Tumbleweed-DVD-x86_64-Build20230724-firewalld-policy-client@64bit -> https://openqa.opensuse.org/tests/3451165
but they failed trying to access the other machines. Likely we still have an inconsistency regarding the selection of zones.
Updated by okurz over 1 year ago
- Description updated (diff)
Together with mcaj we removed the old read-only snapshot storage device from the VM and then renamed the LVM volume group
vgrename vg0-new vg0
sed -i 's/vg0-new/vg0/' /etc/fstab
We also found that the iptables rules for the tunnel to old-ariel were not applied. We suspect a race-condition with the firewall setup. We added After=SuSEfirewall2.service
in autossh-iptables.service. Another reboot brought the tunnel up fine so I guess we are good. Updated description with the additional resolved tasks.
Updated by okurz over 1 year ago
- Copied to action #133358: Migration of o3 VM to PRG2 - Ensure IPv6 is fully working added
Updated by okurz over 1 year ago
- Copied to action #133364: Migration of o3 VM to PRG2 - Decommission old-ariel in NUE1 as soon as we do not need it anymore added
Updated by okurz over 1 year ago
- Description updated (diff)
- DONE Update https://progress.opensuse.org/projects/openqav3/wiki/ where necessary
- DONE Make sure we know what to keep an eye out for for the later planned OSD VM migration -> lessons learned meeting scheduled for 2023-08-04
- DONE As necessary also make sure that BuildOPS knows about caveats of migration as they plan to migrate OBS/IBS after us -> lessons learned meeting scheduled for 2023-08-04
Prepared https://gitlab.suse.de/qa-sle/backup-server-salt/-/merge_requests/11 for backup for "Ensure backup to backup.qa.suse.de works"
Updated by okurz over 1 year ago
- Description updated (diff)
- Due date deleted (
2023-08-02) - Status changed from In Progress to Resolved
Together with nicksinger made the config on old-ariel persistent:
cat > /etc/sysconfig/network/ifroute-tun5 <<EOF
10.150.1.11/32
EOF
cat > /etc/systemd/system/autossh-iptables_to_new_ariel.service <<EOF
[Unit]
Description=AutoSSH tunnel iptables
After=network.target SuSEfirewall2.service
[Service]
Type=oneshot
ExecStart=/etc/sysconfig/network/scripts/iptables_connection_to_new_ariel
[Install]
WantedBy=default.target
EOF
cat > /etc/sysconfig/network/scripts/iptables_connection_to_new_ariel <<EOF
iptables -I INPUT 1 -i tun5 -j ACCEPT
# Compressed down without -o argument to just allow forwarding on *all* interfaces
#-A FORWARD -i ariel-link -o eth1 -j ACCEPT
#-A FORWARD -i ariel-link -o eth3 -j ACCEPT
#-A FORWARD -i ariel-link -o eth2 -j ACCEPT
iptables -I FORWARD 1 -i tun5 -j ACCEPT
# change incoming source ip to own interface ip to allow remote hosts to respond to a known ip
iptables -t nat -A POSTROUTING -o eth1 -j SNAT --to-source 192.168.112.100
iptables -t nat -A POSTROUTING -o eth2 -j SNAT --to-source 192.168.7.15
iptables -t nat -A POSTROUTING -o eth3 -j SNAT --to-source 192.168.47.13
# forward old-ariel:21 to new-ariel:21 to allow old workers to access the new ftp
iptables -t nat -A PREROUTING -p tcp -m tcp --dport 21 -j DNAT --to-destination 10.150.1.11:21
iptables -t nat -A POSTROUTING -d 10.150.1.11/32 -p tcp -m tcp --dport 21 -j SNAT --to-source 192.168.112.100
# used by passive mode of ftp (e.g. required by s390x test) and defined by "grep ^pasv /etc/vsftpd.conf"
iptables -t nat -A PREROUTING -p tcp -m tcp --dport 30000:30100 -j DNAT --to-destination 10.150.1.11:30000-30100
iptables -t nat -A POSTROUTING -d 10.150.1.11/32 -p tcp -m tcp --dport 30000:30100 -j SNAT --to-source 192.168.112.100
# used by zabbix-agent to connect from zabbix server (in the old opensuse network) to new ariel
iptables -t nat -A PREROUTING -p tcp -m tcp --dport 10049 -j DNAT --to-destination 10.150.1.11:10049
iptables -t nat -A POSTROUTING -d 10.150.1.11/32 -p tcp -m tcp --dport 10049 -j SNAT --to-source 192.168.112.100
# nat all traffic on all links
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
iptables -t nat -A POSTROUTING -o eth2 -j MASQUERADE
iptables -t nat -A POSTROUTING -o eth3 -j MASQUERADE
iptables -t nat -A POSTROUTING -o tun5 -j MASQUERADE
EOF
chmod +x /etc/sysconfig/network/scripts/iptables_connection_to_new_ariel
systemctl daemon-reload
systemctl enable --now autossh-iptables_to_new_ariel
and with that all tasks are done and ACs are covered. Ticket resolved \o/
Updated by okurz over 1 year ago
- Copied to action #133475: Migration of o3 VM to PRG2 - connection to rabbit.opensuse.org added
Updated by favogt over 1 year ago
factory-package-news-web.service did not survive the last reboot because systemd did not reload units after /var/lib/openqa was available, which is needed for the unit symlink to work.
For now I replaced the symlink with a copy.
Updated by okurz over 1 year ago
favogt wrote:
factory-package-news-web.service did not survive the last reboot because systemd did not reload units after /var/lib/openqa was available, which is needed for the unit symlink to work.
For now I replaced the symlink with a copy.
I didn't quite get that. But regarding the setup of factory-package-news-web.service I wonder if we can write down more about that setup – and with "we" I mean "you" ;) Maybe more in https://github.com/openSUSE/opensuse-release-tools README?
Updated by favogt over 1 year ago
okurz wrote:
favogt wrote:
factory-package-news-web.service did not survive the last reboot because systemd did not reload units after /var/lib/openqa was available, which is needed for the unit symlink to work.
For now I replaced the symlink with a copy.
I didn't quite get that. But regarding the setup of factory-package-news-web.service I wonder if we can write down more about that setup – and with "we" I mean "you" ;) Maybe more in https://github.com/openSUSE/opensuse-release-tools README?
I'm not really familiar with the setup of that either. Something for @dimstar
Updated by favogt about 1 year ago
OBS was migrated to PRG2 meanwhile. IT added eth2 to reach the OBS backends like on old-ariel. I removed the 192.168.7.0.24 route from tun5, the RSYNC_CONNECT_PROG
environment from openqa-webui and openqa-gru and set /etc/apparmor.d/usr.share.openqa.script.openqa
to enforcing again.
Updated by jbaier_cz about 1 year ago
- Related to action #150956: o3 cannot send e-mails via smtp relay size:M added
Updated by okurz 7 months ago
- Related to action #159669: No new openQA data on metrics.opensuse.org since o3 migration to PRG2 size:S added