Project

General

Profile

action #120004

[alert] Host powerqaworker-qam-1.qa.suse.de is down size:M

Added by mkittler 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Start date:
2022-11-07
Due date:
2022-11-22
% Done:

0%

Estimated time:

Description

Leads to the OSD deploment piplining failing so the host has been removed from salt.

Rollback steps

  • Add the worker back to salt

History

#1 Updated by mkittler 3 months ago

  • Priority changed from Normal to Urgent

#2 Updated by mkittler 3 months ago

  • Target version set to Ready

#3 Updated by mkittler 3 months ago

I restarted the pipeline and it passed again (after removing the host from salt). So the pipeline failure was really just due to that.

#4 Updated by mkittler 3 months ago

  • Assignee set to mkittler

#5 Updated by mkittler 3 months ago

  • Status changed from New to In Progress

Similar to other power workers on that Leap version the journal just ends and a power reset helps to recover the machine:

Nov 04 22:06:47 powerqaworker-qam-1 openqa-worker-cacheservice-minion[32693]: [32693] [i] Cache size of "/var/lib/openqa/cache" is 49 GiB, with limit 50 GiB
Nov 04 22:06:47 powerqaworker-qam-1 openqa-worker-cacheservice-minion[32693]: [32693] [i] Downloading "SLES-12-SP5-ppc64le-Installtest.qcow2" from "http://openqa.suse.de/tests/9874981/asset/hdd/SLES-12-SP5-ppc64le-Installtest.qcow2"
Nov 04 22:06:47 powerqaworker-qam-1 worker[32739]: [debug] [pid:32739] Uploading artefact exfat_gf07-2.txt
Nov 04 22:06:48 powerqaworker-qam-1 worker[32739]: [debug] [pid:32739] Uploading artefact exfat_gf08-10.txt
Nov 04 22:06:48 powerqaworker-qam-1 worker[32739]: [debug] [pid:32739] Uploading artefact LTP_lvm.local_exfat_gf09.txt
Nov 04 22:06:48 powerqaworker-qam-1 worker[32739]: [debug] [pid:32739] Uploading artefact exfat_gf10-9.txt
Nov 04 22:06:49 powerqaworker-qam-1 worker[121538]: [debug] [pid:121538] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9871957/status
Nov 04 22:06:49 powerqaworker-qam-1 worker[32739]: [debug] [pid:32739] Uploading artefact exfat_gf10-5.txt
Nov 04 22:06:49 powerqaworker-qam-1 worker[24183]: [debug] [pid:24183] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9873526/status
Nov 04 22:06:50 powerqaworker-qam-1 worker[121538]: [debug] [pid:121538] Upload concluded (at patch_sle)
Nov 04 22:06:50 powerqaworker-qam-1 worker[32739]: [debug] [pid:32739] Uploading artefact exfat_gf06-2.txt
Nov 04 22:06:50 powerqaworker-qam-1 worker[24183]: [debug] [pid:24183] Upload concluded (at install_service)
-- Boot a50959ad14194ac99539d27879c08056 --
Nov 07 16:12:15 powerqaworker-qam-1 kernel: Reserving 210MB of memory at 128MB for crashkernel (System RAM: 196608MB)
Nov 07 16:12:15 powerqaworker-qam-1 kernel: hash-mmu: Page sizes from device-tree:
Nov 07 16:12:15 powerqaworker-qam-1 kernel: hash-mmu: base_shift=12: shift=12, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=0
Nov 07 16:12:15 powerqaworker-qam-1 kernel: hash-mmu: base_shift=12: shift=16, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=7
Nov 07 16:12:15 powerqaworker-qam-1 kernel: hash-mmu: base_shift=12: shift=24, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=56
Nov 07 16:12:15 powerqaworker-qam-1 kernel: hash-mmu: base_shift=16: shift=16, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=1
Nov 07 16:12:15 powerqaworker-qam-1 kernel: hash-mmu: base_shift=16: shift=24, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=8
Nov 07 16:12:15 powerqaworker-qam-1 kernel: hash-mmu: base_shift=24: shift=24, sllp=0x0100, avpnm=0x00000001, tlbiel=0, penc=0
Nov 07 16:12:15 powerqaworker-qam-1 kernel: hash-mmu: base_shift=34: shift=34, sllp=0x0120, avpnm=0x000007ff, tlbiel=0, penc=3
Nov 07 16:12:15 powerqaworker-qam-1 kernel: Enabling pkeys with max key count 32
Nov 07 16:12:15 powerqaworker-qam-1 kernel: Disabling hardware transactional memory (HTM)
Nov 07 16:12:15 powerqaworker-qam-1 kernel: Activating Kernel Userspace Access Prevention
Nov 07 16:12:15 powerqaworker-qam-1 kernel: Activating Kernel Userspace Execution Prevention
Nov 07 16:12:15 powerqaworker-qam-1 kernel: Page orders: linear mapping = 24, virtual = 16, io = 16, vmemmap = 24
Nov 07 16:12:15 powerqaworker-qam-1 kernel: Using 1TB segments
Nov 07 16:12:15 powerqaworker-qam-1 kernel: hash-mmu: Initializing hash mmu with SLB
Nov 07 16:12:15 powerqaworker-qam-1 kernel: Linux version 5.14.21-150400.24.28-default (geeko@buildhost) (gcc (SUSE Linux) 7.5.0, GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.37.20211103-150100.7.37) #1 SMP Mon Oct 10 15:21:12 UTC 2022 (f82da2c)
Nov 07 16:12:15 powerqaworker-qam-1 kernel: Secure boot mode disabled
Nov 07 16:12:15 powerqaworker-qam-1 kernel: Found initrd at 0xc0000000035a0000:0xc000000004e44a73

So it is like on qa-power8-4-kvm.qa.suse.de and qa-power8-5-kvm.qa.suse.de.

PRs to add automatic recovery for this worker: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/767, https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/25

#6 Updated by okurz 3 months ago

Same as we have on qa-power8-3 as well. The important action is to use the old kernel. Please also apply the same consistent change on all our power workers in production. Anyway, it's helpful information to learn that other ppc64le hosts are affected by the same product issue the same. Could you please comment accordingly on the bugilla issue?

#7 Updated by openqa_review 3 months ago

  • Due date set to 2022-11-22

Setting due date based on mean cycle time of SUSE QE Tools

#8 Updated by mkittler 3 months ago

  • Status changed from In Progress to Feedback

This worker crashes with a much lower frequency than the other hosts. Maybe it is the same issue, maybe just the same symptom. Since we didn't really pin down the source of the problem so far it is really hard to tell. This host doesn't crash very often (first time I need to recover it in fact) so I thought it would be better to keep it on the version we use everywhere else (downgrades should still be the exception) and to see how the situation develops. Of course we can still downgrade the machine if it becomes too unstable.

#9 Updated by mkittler 3 months ago

It has crashed again. I've already recovered the machine but this means the worker can now officially be considered unstable like the others. Well, to be sure I'd like to keep it at least for a few days on Leap 15.4. If it stay unstable than https://bugzilla.opensuse.org/show_bug.cgi?id=1202138 is the related product bug.

Note that there's again nothing interesting in the journal (which just ends which no prior error messages or sign of an attempted shutdown).

#10 Updated by cdywan 3 months ago

  • Subject changed from [alert] Host powerqaworker-qam-1.qa.suse.de is down to [alert] Host powerqaworker-qam-1.qa.suse.de is down size:M
  • Description updated (diff)

#11 Updated by okurz 3 months ago

Based on the feedback in the bug report we can't expect more help for bare-metal based OPAL installations it seems. So I suggest to rollback the kernel version to according to #119008#note-14 .

So I did

zypper in --oldpackage http://download.opensuse.org/update/leap/15.3/sle/ppc64le/kernel-default-5.3.18-150300.59.93.1.ppc64le.rpm http://download.opensuse.org/update/leap/15.3/sle/ppc64le/util-linux-2.36.2-150300.4.23.1.ppc64le.rpm && zypper rm kernel-default-5.14.21-150400.24.28 kernel-default-5.14.21-150400.24.21 && for i in kernel-default util-linux; do zypper al --comment "poo#119008, kernel regression boo#1202138" $i; done && reboot

#12 Updated by mkittler 3 months ago

  • Description updated (diff)

Ok, good. Note that qa-power8-4-kvm.qa.suse.de (the first host where we encountered the problem) was just set back to a previous snapshot by me at the time. So it is running under Leap 15.3. I'll boot a Leap 15.4 snapshot and perform the same downgrade there so all workers are on Leap 15.4 at least.

#13 Updated by mkittler 3 months ago

That's actually not easily possible anymore. The worker was too long on 15.3 so all 15.4 snapshots have already been cleaned up (cat /.snapshots/*/snapshot/etc/os-release | grep 15.4 shows nothing). Not sure whether it makes sense to upgrade it to Leap 15.4 anymore at this point (just to downgrade the kernel anyways). (All of this is out of the scope of this ticket anyways.)

#14 Updated by cdywan 3 months ago

FYI the machine is currently up and running. Assuming the upgrade was not stable, in this context we can probably leave it as is and consider it working (upgrades are not part of this ticket).

#15 Updated by mkittler 3 months ago

  • Status changed from Feedback to Resolved

And powerqaworker-qam-1 itself seems to be stable under the downgraded kernel so I suppose this issue can be closed. (The host up alert for the worker has already been resumed as well.)

Also available in: Atom PDF