Project

General

Profile

Actions

action #124877

closed

QA - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

QA - coordination #116623: [epic] Migration of SUSE Nbg based openQA+QA+QAM systems to new security zones

Failing pipelines because of unreachable machine openqaworker-arm-1

Added by livdywan over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-02-08
Due date:
% Done:

0%

Estimated time:

Description

Observation

./ipmi-recover-worker fails in the grafana web hook actions pipeline.

> ssh openqaworker-arm-1
ssh: connect to host openqaworker-arm-1 port 22: Connection timed out                                                   
lost connection

Acceptance criteria

  • AC1:

Rollback steps

  • If the worker can be recovered
    • Add it back to salt
    • Remove silence for alert on automatic actions panel

Suggestions


Related issues 1 (0 open1 closed)

Is duplicate of openQA Infrastructure - action #124715: Failing pipelines because of unreachable machine openqaworker-arm-1Rejected2023-02-08

Actions
Actions #1

Updated by livdywan over 1 year ago

  • Copied from action #124715: Failing pipelines because of unreachable machine openqaworker-arm-1 added
Actions #2

Updated by mkittler over 1 year ago

  • Status changed from New to In Progress
  • Assignee set to mkittler

(see #124715#note-3 and further comments for what I tried before)

I gave up to early with data.jnlp. With LANG=C _JAVA_AWT_WM_NONREPARENTING=1 javaws -nosecurity -jnlp data.jnlp I could mount an ISO (openSUSE TW network installer). Maybe I can fix the boot issues from there. However, now the boot is stuck at:

Loading driver at 0x03FF9B01000 EntryPoint=0x03FF9B1F000
Load Image() Return Status = EFI_SUCCESS
WatchDog Timer Status = EFI_SUCCESS
Unknown.Entry(3FF9B1F000)
Please press 't' to show the boot menu
on this console

The Virtual Media Session shows an upload speed of 12 mbit/s at most. So maybe it is just slow (not even using the full upload bandwidth of my DSL connection).

If we're lucky then just the GRUB config is damaged (e.g. because the machine crashed when it was rewritten).


EDIT: The image download eventually worked. Now it is stuck at:

[   34.420890][  T876] RPC: Registered udp transport module.
[   34.428264][  T876] RPC: Registered tcp transport module.
[   34.435618][  T876] RPC: Registered tcp NFSv4.1 backchannel transport module.
[   36.268941][    C3] random: crng init done
[   36.805012][  T904] No iBFT detected.

Framebuffer device detected - continuing installation on console /dev/tty1.
Use boot option 'switch_to_fb=0' to prevent this.

Loading basic drivers
Hardware detection

So I need to find a way to specify switch_to_fb=0 (I haven't seen GRUB, maybe I missed it) or gain access to the framebuffer.

Actions #3

Updated by openqa_review over 1 year ago

  • Due date set to 2023-03-08

Setting due date based on mean cycle time of SUSE QE Tools

Actions #4

Updated by mkittler over 1 year ago

One can go into GRUB by spamming the up key (or likely any other key) and then some prompt occurs where one can press t.

We could boot again after reinstalling kernel-default from a chroot env and also re-configuring GRUB. However, then the systemd service for our NVMe setup was stuck. At least another reboot still worked (so the NVMe setup doesn't override out GRUB setup or something similar). Then it even could establish the NVMe setup again. Maybe it failed on the first attempt because I still had the TW netinstall mounted.

Actions #5

Updated by mkittler over 1 year ago

  • Status changed from In Progress to Feedback

After a long BTRFS balancing the system was now able to boot. Let's see how it behaves once a few tests are running. I've been adding it back to salt.

My theory is that the GRUB config was damaged, maybe because the worker crashed while the config was written.

Actions #6

Updated by mkittler over 1 year ago

I've also "unsilenced" the silence (see rollback steps). It now shows as expired and could be re-created from the expired entry.

Actions #7

Updated by jbaier_cz over 1 year ago

  • Copied from deleted (action #124715: Failing pipelines because of unreachable machine openqaworker-arm-1)
Actions #8

Updated by jbaier_cz over 1 year ago

  • Is duplicate of action #124715: Failing pipelines because of unreachable machine openqaworker-arm-1 added
Actions #9

Updated by jbaier_cz over 1 year ago

Seems good so far, https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines/609219 was successful and no new mail since yesterday.

Actions #10

Updated by mkittler over 1 year ago

  • Status changed from Feedback to Resolved
Actions #11

Updated by okurz over 1 year ago

  • Due date deleted (2023-03-08)
Actions

Also available in: Atom PDF