action #124877
closedQA - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability
QA - coordination #116623: [epic] Migration of SUSE Nbg based openQA+QA+QAM systems to new security zones
Failing pipelines because of unreachable machine openqaworker-arm-1
0%
Description
Observation¶
./ipmi-recover-worker fails in the grafana web hook actions pipeline.
> ssh openqaworker-arm-1
ssh: connect to host openqaworker-arm-1 port 22: Connection timed out
lost connection
Acceptance criteria¶
- AC1:
Rollback steps¶
- If the worker can be recovered
- Add it back to salt
- Remove silence for alert on automatic actions panel
Suggestions¶
- System gets stuck in the GRUB rescue shell after boot
- Look into vKVM session following https://progress.opensuse.org/projects/openqav3/wiki#Accessing-old-BMCs-with-Java-iKVM-Viewer-when-ipmitool-does-not-work-eg-imagetester
Updated by livdywan over 1 year ago
- Copied from action #124715: Failing pipelines because of unreachable machine openqaworker-arm-1 added
Updated by mkittler over 1 year ago
- Status changed from New to In Progress
- Assignee set to mkittler
(see #124715#note-3 and further comments for what I tried before)
I gave up to early with data.jnlp
. With LANG=C _JAVA_AWT_WM_NONREPARENTING=1 javaws -nosecurity -jnlp data.jnlp
I could mount an ISO (openSUSE TW network installer). Maybe I can fix the boot issues from there. However, now the boot is stuck at:
Loading driver at 0x03FF9B01000 EntryPoint=0x03FF9B1F000
Load Image() Return Status = EFI_SUCCESS
WatchDog Timer Status = EFI_SUCCESS
Unknown.Entry(3FF9B1F000)
Please press 't' to show the boot menu
on this console
The Virtual Media Session shows an upload speed of 12 mbit/s at most. So maybe it is just slow (not even using the full upload bandwidth of my DSL connection).
If we're lucky then just the GRUB config is damaged (e.g. because the machine crashed when it was rewritten).
EDIT: The image download eventually worked. Now it is stuck at:
[ 34.420890][ T876] RPC: Registered udp transport module.
[ 34.428264][ T876] RPC: Registered tcp transport module.
[ 34.435618][ T876] RPC: Registered tcp NFSv4.1 backchannel transport module.
[ 36.268941][ C3] random: crng init done
[ 36.805012][ T904] No iBFT detected.
Framebuffer device detected - continuing installation on console /dev/tty1.
Use boot option 'switch_to_fb=0' to prevent this.
Loading basic drivers
Hardware detection
So I need to find a way to specify switch_to_fb=0
(I haven't seen GRUB, maybe I missed it) or gain access to the framebuffer.
Updated by openqa_review over 1 year ago
- Due date set to 2023-03-08
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler over 1 year ago
One can go into GRUB by spamming the up key (or likely any other key) and then some prompt occurs where one can press t.
We could boot again after reinstalling kernel-default
from a chroot env and also re-configuring GRUB. However, then the systemd service for our NVMe setup was stuck. At least another reboot still worked (so the NVMe setup doesn't override out GRUB setup or something similar). Then it even could establish the NVMe setup again. Maybe it failed on the first attempt because I still had the TW netinstall mounted.
Updated by mkittler over 1 year ago
- Status changed from In Progress to Feedback
After a long BTRFS balancing the system was now able to boot. Let's see how it behaves once a few tests are running. I've been adding it back to salt.
My theory is that the GRUB config was damaged, maybe because the worker crashed while the config was written.
Updated by mkittler over 1 year ago
I've also "unsilenced" the silence (see rollback steps). It now shows as expired and could be re-created from the expired entry.
Updated by jbaier_cz over 1 year ago
- Copied from deleted (action #124715: Failing pipelines because of unreachable machine openqaworker-arm-1)
Updated by jbaier_cz over 1 year ago
- Is duplicate of action #124715: Failing pipelines because of unreachable machine openqaworker-arm-1 added
Updated by jbaier_cz over 1 year ago
Seems good so far, https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines/609219 was successful and no new mail since yesterday.