action #136133
closedQA - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability
QA - coordination #129280: [epic] Move from SUSE NUE1 (Maxtorhof) to new NBG Datacenters
Migrate aarch64.openqanet.opensuse.org to FC Basement size:M
0%
Description
Motivation¶
Extracted from #134864
Acceptance criteria¶
- AC1: aarch64.openqanet.opensuse.org up and running in FC Basement again
- AC2: Inventory management systems, i.e. racktables and netbox, reflect the plan and current situation at all times
- AC3: On top of AC1 machine is successfully working on o3 jobs
Suggestions¶
- Read what had been done in #134864 (the machine is physically already in FC Basement, see https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=2034, it's connected to power, ipmi and system ethernet, ipmi should be reachable)
- DONE Wait for https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4043
- Ensure you can reach the machine over IPMI
- Update https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls as necessary
- Lookup what we did for the "rsync over external ssh" forwarding on o3 workers and apply the same here so that the machine can get tests from o3 over public network
- Trigger some test-tests and if all good use production worker classes
- As necessary extend progress.opensuse.org/projects/openqav3/wiki/ for the machine which is likely not directly reachable by community members as it resides within SUSE internal network
- Ensure inventory management system(s), at least racktables, is up-to-date
- Ensure machines are usable again from new location, e.g. passing openQA tests accordingly
Rollback actions¶
- Re-enable https://gitlab.suse.de/openqa/monitor-o3/-/pipeline_schedules after aarch64.o3 is up again, see #135467
Updated by nicksinger 11 months ago
- Assignee deleted (
nicksinger)
putting back as I'm currently to involved with #136130
Updated by nicksinger 11 months ago
- Tags changed from infra, nue1, nue2, nue3 to infra, nue1, nue2, nue3, next-frankencampus-visit
SP cannot be reached at the moment. Resetting power to socket 9 on http://epdu-b4.qe.nue2.suse.org/ did not work. Adding next-frankencampus-visit to physically check connections and power cycle the machine if needed.
Updated by dheidler 11 months ago
- Related to action #138326: Extend openQA-in-openQA test to do a real multimachine run. added
Updated by dheidler 11 months ago
- Related to deleted (action #138326: Extend openQA-in-openQA test to do a real multimachine run.)
Updated by okurz 11 months ago
I physically checked the machine. After connecting a display I Could see it's stuck as reported in before trying and failing to boot but it was at least powered on. Over the PDU local control panel I powered off the outlet A9 and the correct machine went down. After powering on outlet A9 the management interface ethernet outlet showed activity again. The power supply stayed off for multiple minutes but eventually the green LED showed up again and the display showed a message
W E L C O M E
CPU Type:
…
System will boot soon
Updated by okurz 11 months ago
nicksinger wrote in #note-7:
SP cannot be reached at the moment.
What address have you tried? workerconf.sls seems to reference an outdated entry that can not work.
The sp address should be aarch64-o3-sp.qe.nue2.suse.org but on walter1 I could not see any entries for
journalctl -u dhcpd --since "1h ago" | grep -i '\(00:18\|00:00\)'
I tried to enter the setup but the BIOS setup is password protected and needs a reboot every three failed password entry attempts, tried multiple usual passwords all failed, argh
Maybe one can try from the grub menu "Enter firmware settings". I tried to add console=tty0
to the linux command line arguments but still nothing shown on the display after "EFI stub: Exiting boot services..."
Updated by okurz 11 months ago
- Status changed from In Progress to Workable
- Assignee changed from okurz to dheidler
Maybe one can try from the grub menu "Enter firmware settings". I tried to add console=tty0
to the linux command line arguments but still nothing shown on the display after "EFI stub: Exiting boot services..."
@dheidler back to you. I suggest try the above with local display and keyboard, experiment with the installed grub menu, try boot over network or from a local USB thumbdrive.
Updated by dheidler 11 months ago
Fix aarch64 mac in DHCP: https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4296
Updated by dheidler 11 months ago
- Status changed from In Progress to Resolved
https://openqa.opensuse.org/tests/3680520
This is what I did after installing the OS:
zypper ar -p 95 -f 'http://download.opensuse.org/repositories/devel:openQA/$releasever' devel_openQA
zypper ar -p 90 -f 'http://download.opensuse.org/repositories/devel:openQA:Leap:$releasever/$releasever' devel_openQA_Leap
zypper --gpg-auto-import-keys ref
echo "requires:openQA-worker" > /etc/zypp/systemCheck.d/openqa.check
zypper -n in openQA-worker openQA-auto-update openQA-continuous-update os-autoinst-distri-opensuse-deps swtpm # openQA worker services plus dependencies for openSUSE distri or development repo if added previously
zypper -n in ffmpeg-4 # for using external video encoder as it is already configured on some machines like ow19, ow20 and power8
zypper -n in nfs-client # For /var/lib/openqa/share
zypper -n in bash-completion vim htop strace systemd-coredump iputils tcpdump bind-utils # for general tinkering
zypper -n in qemu-uefi-aarch32
# for use of nfs
# echo "openqa1-opensuse:/ /var/lib/openqa/share nfs4 noauto,nofail,retry=30,ro,x-systemd.automount,x-systemd.device-timeout=10m,x-systemd.mount-timeout=30m 0 0" >> /etc/fstab
# for use of cacheservice
# if connecting via ssh-rsync to o3
mkdir -p /var/lib/openqa/ssh/
chown _openqa-worker: /var/lib/openqa/ssh/
sudo -u _openqa-worker ssh-keygen -t ed25519 -N '' -f /var/lib/openqa/ssh/id_ed25519
# append /var/lib/openqa/ssh/id_ed25519.pub to ariel:/var/lib/openqa/.ssh/authorized_keys
sudo -u _openqa-worker ssh -i /var/lib/openqa/ssh/id_ed25519 -p 2214 -o "UserKnownHostsFile /var/lib/openqa/ssh/known_hosts" geekotest@gate.opensuse.org whoami
# note that ariel can be reached directly via ariel.dmz-prg2.suse.org from suse network
sed -i 's/\(solver.dupAllowVendorChange = \)false/\1true/' /etc/zypp/zypp.conf
instances=30 bash -x $(which os-autoinst-setup-multi-machine)
#[global]
#HOST = https://openqa.opensuse.org
#CACHEDIRECTORY = /var/lib/openqa/cache
##WORKER_CLASS = foo
#EXTERNAL_VIDEO_ENCODER_CMD=ffmpeg -y -hide_banner -nostats -r 24 -f image2pipe -vcodec ppm -i - -pix_fmt yuv420p -c:v libvpx-vp9 -crf 35 -b:v 1500k -cpu-used 0
#[https://openqa.opensuse.org]
##TESTPOOLSERVER = -e 'ssh -p 2214 -i /var/lib/openqa/ssh/id_ed25519 -o "UserKnownHostsFile /var/lib/openqa/ssh/known_hosts"' geekotest@gate.opensuse.org:/var/lib/openqa/tests
##TESTPOOLSERVER = -e 'ssh -i /var/lib/openqa/ssh/id_ed25519 -o "UserKnownHostsFile /var/lib/openqa/ssh/known_hosts"' geekotest@ariel.dmz-prg2.suse.org:/var/lib/openqa/tests
#TESTPOOLSERVER = rsync://openqa.opensuse.org/tests
zypper -n in rsync sshpass
rsync -rP --rsh="sshpass -p opensuse ssh -o StrictHostKeyChecking=no" root@openqaworker24:/etc/openqa/ /etc/openqa/
rsync -rP --rsh="sshpass -p opensuse ssh -o StrictHostKeyChecking=no" root@openqaworker24:/etc/wicked/scripts/gre_tunnel_preup.sh /etc/wicked/scripts/
sed -i "s/openqaworker24/$HOSTNAME/g" /etc/openqa/workers.ini
# comment out the worker that we are on and enable all other entries
vi /etc/wicked/scripts/gre_tunnel_preup.sh
# use only hostname as worker class for testing
vi /etc/openqa/workers.ini
chown _openqa-worker /etc/openqa/client.conf
chmod 640 /etc/openqa/client.conf
# configure /etc/openqa/client.conf and /etc/openqa/workers.ini, then enable the desired number of worker slots, e.g.:
systemctl enable --now openqa-worker-auto-restart@{1..30}.service openqa-reload-worker-auto-restart@{1..30}.path openqa-auto-update.timer openqa-continuous-update.timer openqa-worker-cacheservice.service openqa-worker-cacheservice-minion.service
# on aarch64 do something about hugepages permissions (see https://progress.opensuse.org/issues/53234)
echo "hugetlbfs /dev/hugepages hugetlbfs defaults,gid=kvm,mode=1775,pagesize=1G" >> /etc/fstab
# add this to kernel cmdline via /etc/default/grub (& update-bootloader) on aarch64:
# showopts nospec spectre_v2=off pti=off kpti=off mitigations=off default_hugepagesz=1G hugepagesz=1G hugepages=64 crashkernel=167M enforcing=0
Updated by okurz 11 months ago
- Status changed from Resolved to Workable
Please ensure "AC3: On top of AC1 machine is successfully working on o3 jobs" is covered. https://openqa.opensuse.org/tests/3680520#step/patch_agama/6 looks like tests are executing too slow, maybe not optimized virtualization emulation. If in doubt please ask ggardet in irc://libera.chat/opensuse-factory
dheidler wrote in #note-18:
https://openqa.opensuse.org/tests/3680520
This is what I did after installing the OS:
[…]
Did you actually reinstall the system? If yes, why?
And I see that the system has worker classes "WORKER_CLASS = aarch64-o3,qemu_aarch64,qemu_aarch64_lse", didn't it have 32bit ones in before? See for example https://openqa.opensuse.org/admin/workers/158 , one instance from the same host before the move.
Updated by ggardet_arm 10 months ago
okurz wrote in #note-19:
And I see that the system has worker classes "WORKER_CLASS = aarch64-o3,qemu_aarch64,qemu_aarch64_lse", didn't it have 32bit ones in before? See for example https://openqa.opensuse.org/admin/workers/158 , one instance from the same host before the move.
That indeed looks wrong. If this is the D05 machine (as I understand it), it should have: d05,qemu_aarch64,qemu_aarch64_no_lse,qemu_aarch32,tap
Updated by dheidler 10 months ago
okurz wrote in #note-19:
Did you actually reinstall the system? If yes, why?
Because the non-reactive num led indicated that the kernel was not booting up.
And I see that the system has worker classes "WORKER_CLASS = aarch64-o3,qemu_aarch64,qemu_aarch64_lse", didn't it have 32bit ones in before? See for example https://openqa.opensuse.org/admin/workers/158 , one instance from the same host before the move.
Added qemu-uefi-aarch32 and added worker class for 32bit and now 32bit jobs should run. Not adding tap due to networking constraints.
Updated by dheidler 10 months ago
- Status changed from Workable to Resolved
okurz wrote in #note-19:
Please ensure "AC3: On top of AC1 machine is successfully working on o3 jobs" is covered. https://openqa.opensuse.org/tests/3680520#step/patch_agama/6 looks like tests are executing too slow, maybe not optimized virtualization emulation. If in doubt please ask ggardet in irc://libera.chat/opensuse-factory
https://openqa.opensuse.org/tests/3677478#step/agama/3 already had the same error. And that machine is only twice as fast to execute that test. Without kvm we would be much slower than only at 50%.
This machine might have been overloaded as 30 worker slots were enabled.
I reduced the number of slots to 16, like it was the case before on this machine.
Also I did some checks and kvm seems to work. Qemu also doesn't complain about missing /dev/kvm - because it is present.
Updated by ggardet_arm 10 months ago
Despite KVM and huge pages, the systems looks quite slow.
There are lots of failures on 32-bit tests, such as https://openqa.opensuse.org/tests/3687863#step/bootloader_uefi/3
Updated by okurz 10 months ago
- Status changed from Resolved to Feedback
Please show successful aarch32 jobs for verification. And https://gitlab.suse.de/openqa/monitor-o3/-/pipeline_schedules shows that the last job failed.
Updated by dheidler 10 months ago
- Status changed from Feedback to Resolved
Only the overall status. https://gitlab.suse.de/openqa/monitor-o3/-/jobs/1977357 is the only one that doesn't fail due to direct ping.
https://openqa.opensuse.org/tests/3719355
https://openqa.opensuse.org/tests/3719370
Updated by okurz 10 months ago
- Copied to action #150869: Ensure multi-machine tests work on aarch64-o3 (or another but single machine only) size:M added