Project

General

Profile

Actions

action #68053

closed

powerqaworker-qam-1 fails to come up on reboot (repeatedly)

Added by okurz almost 4 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2020-06-14
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

system seems to be stuck in petitboot shell, how to continue?

Acceptance criteria

  • AC1: DONE powerqaworker-qam-1 is up and running (connects to web UI, ssh) at least once
  • AC2: DONE Machine has a defined owner
  • AC3: Grafana OSD Status Overview shows data
  • AC4: powerqaworker-qam-1 consistently comes up after reboots
  • AC5: powerqaworker-qam-1 is part of salt maintained infrastructure

Suggestions

debug over ipmi SOL connection.


Related issues 7 (1 open6 closed)

Related to openQA Infrastructure - action #80688: Upgrade IO firmware for powerqaworker-qam-1Resolvedokurz2020-12-03

Actions
Related to openQA Project - coordination #76984: [epic] Automatically remove assets+results based on available free spaceNew2021-01-21

Actions
Related to openQA Infrastructure - action #75238: Upgrade osd workers and openqa-monitor to openSUSE Leap 15.2Resolvedlivdywan

Actions
Related to openQA Infrastructure - action #81058: [tracker-ticket] Power machines can't find installed OS. Automatic reboots disabled for nowResolvedokurz2020-12-152021-04-16

Actions
Related to openQA Infrastructure - action #89050: OSD deployment blocked since two weeks - salt nodes failingResolvedokurz2021-02-24

Actions
Related to openQA Infrastructure - action #89551: NFS mount fails after boot (reproducible on some OSD workers)Resolvedmkittler2021-03-052021-03-31

Actions
Copied from openQA Infrastructure - action #68050: openqaworker3 fails to come up on reboot, openqa_nvme_format.service failedResolvedokurz2020-06-142020-07-07

Actions
Actions #1

Updated by okurz almost 4 years ago

  • Copied from action #68050: openqaworker3 fails to come up on reboot, openqa_nvme_format.service failed added
Actions #2

Updated by okurz almost 4 years ago

I have the suspicion that the kernel within petitboot with 4.4 is too old to read the current btrfs filesystem to chainload the kernel or so. Similar as described in #55607#note-6 I would try to kexec the current kernel but when within the petitboot shell trying to mount the root fs I get

$ mount /dev/sdb2 /var/petitboot/mnt/
mount: mounting /dev/sdb2 on /var/petitboot/mnt/ failed: Invalid argument

and dmesg shows

[65061.610075] BTRFS info (device sdb2): disk space caching is enabled
[65061.610081] BTRFS: has skinny extents
[65061.611689] BTRFS: failed to read chunk tree on sdb2
[65061.683029] BTRFS: open_ctree failed
Actions #3

Updated by nicksinger almost 4 years ago

okurz wrote:

I have the suspicion that the kernel within petitboot with 4.4 is too old to read the current btrfs filesystem to chainload the kernel or so. Similar as described in #55607#note-6 I would try to kexec the current kernel but when within the petitboot shell trying to mount the root fs I get

$ mount /dev/sdb2 /var/petitboot/mnt/
mount: mounting /dev/sdb2 on /var/petitboot/mnt/ failed: Invalid argument

and dmesg shows

[65061.610075] BTRFS info (device sdb2): disk space caching is enabled
[65061.610081] BTRFS: has skinny extents
[65061.611689] BTRFS: failed to read chunk tree on sdb2
[65061.683029] BTRFS: open_ctree failed

What you could try it to mount the FS read-only. BTRFS sometimes refuses to mount if the journal is not intact. The dmesg however looks more like some metadata is broken. According to the docs btrfs check chunk-recover seems to help here but I'd be careful if you really suspect that the old kernel version of petitboot is a problem - it could kill the FS

Actions #4

Updated by okurz almost 4 years ago

removed the worker machine's salt key for now to fix osd deployment

Actions #5

Updated by livdywan over 3 years ago

  • Status changed from New to In Progress
  • Assignee set to livdywan
  • No success tracking down the owner of the machine so far.
  • Updating the firmware was suggested but we don't have anyone capable of doing that.
  • This is now also being tracked as RT#172392.
Actions #6

Updated by okurz over 3 years ago

  • Target version set to Ready
Actions #7

Updated by okurz over 3 years ago

  • Priority changed from Urgent to High

ok, for the time being we can live without the machine. We have removed the salt key, paused alerts, etc.

@cdywan There was a question to you in https://infra.nue.suse.com/SelfService/Display.html?id=172392

if you do not know any more information please be a bit louder about it, e.g. ask in chat systems, various channels, mailing lists, research more, delegate, escalate, etc.

Actions #8

Updated by livdywan over 3 years ago

The conversation about fixing the machine is on-going (planned for next week). Future maintenance is also being discussed (getting it tracked with QAM) so we don't ned up in the same situation again.

Actions #9

Updated by livdywan over 3 years ago

No success locating the physical machine yet. Question of ownership is still up in the air. Further investigation pending.

Actions #10

Updated by okurz over 3 years ago

  • Priority changed from High to Normal

It is good to learn that no one saw the machine as critical ressource so we can reduce the priority. As long as the machine is powered on and sucks energy I would not regard this as "Low". As soon as someone powers off the machine we can set the prio to "Low".

Actions #11

Updated by livdywan over 3 years ago

FYI there is some delay due to people involved being on hols but we're getting closer to figuring out all the details around the machine.

Actions #12

Updated by livdywan over 3 years ago

  • Assignee changed from livdywan to nicksinger

Since the RT ticket is waiting on @nicksinger now it probably makes sense to pass this ticket on.

Actions #13

Updated by nicksinger over 3 years ago

Alright so I updated Toni that the machine should be maintained by QAM. I also gave him a rough (physical) location for the machine (somewhere in the labs according to my mac address traces). This was on 25.08.2020 and since then no update from him but I guess he would need to poke around too so I grabbed the machine myself and had a look:

The machine has 4 disks:

/tmp # ls -lah /dev/sd*
brw-rw----    1 root     disk        8,   0 Sep 11 10:25 /dev/sda
brw-rw----    1 root     disk        8,   1 Sep 11 10:25 /dev/sda1
brw-rw----    1 root     disk        8,   2 Sep 11 10:25 /dev/sda2
brw-rw----    1 root     disk        8,  16 Sep 11 10:25 /dev/sdb
brw-rw----    1 root     disk        8,  32 Sep 11 10:25 /dev/sdc
brw-rw----    1 root     disk        8,  33 Sep 11 10:25 /dev/sdc1
brw-rw----    1 root     disk        8,  34 Sep 11 10:25 /dev/sdc2
brw-rw----    1 root     disk        8,  48 Sep 11 10:25 /dev/sdd

Two of them (sda and sdc) have a identical looking layout:

/tmp # fdisk -l /dev/sda

Disk /dev/sda: 1142.5 GB, 1142520020992 bytes
128 heads, 32 sectors/track, 544796 cylinders
Units = cylinders of 4096 * 512 = 2097152 bytes

   Device Boot      Start         End      Blocks  Id System
/dev/sda1               1        1028     2103296  82 Linux swap
Partition 1 does not end on cylinder boundary
/dev/sda2            1028      544796  1113637888  83 Linux
/tmp # fdisk -l /dev/sdc

Disk /dev/sdc: 1142.5 GB, 1142520020992 bytes
128 heads, 32 sectors/track, 544796 cylinders
Units = cylinders of 4096 * 512 = 2097152 bytes

   Device Boot      Start         End      Blocks  Id System
/dev/sdc1               1        1028     2103296  82 Linux swap
Partition 1 does not end on cylinder boundary
/dev/sdc2            1028      544796  1113637888  83 Linux
Partition 2 does not end on cylinder boundary

both "Linux" partitions (sda2 and sdc2) can be mounted properly inside petitboot so Oli's initial guess shouldn't be the reason.

However, the other two disks (sdb and sdd) don't have any layout at all:

/tmp # fdisk -l /dev/sdb

Disk /dev/sdb: 1142.5 GB, 1142520020992 bytes
128 heads, 32 sectors/track, 544796 cylinders
Units = cylinders of 4096 * 512 = 2097152 bytes

Disk /dev/sdb doesn't contain a valid partition table
/tmp # fdisk -l /dev/sdd

Disk /dev/sdd: 1142.5 GB, 1142520020992 bytes
128 heads, 32 sectors/track, 544796 cylinders
Units = cylinders of 4096 * 512 = 2097152 bytes

Disk /dev/sdd doesn't contain a valid partition table

Doing a kexec looks promising (booting atm). What I also saw is that all of the machines NICs are down. So even if we get linux back booting the machine will most likely be pretty unusable ;)

Actions #14

Updated by nicksinger over 3 years ago

hah, the installed linux booted and inside it even eth4 is up. I currently let a btrfs scrub run to check for any FS errors (I doubt that was the cause) and will then check how to reinstall grub. For me this all looks like grub2 destroyed itself once again…

Actions #15

Updated by nicksinger over 3 years ago

Scrub found no errors in the FS

powerqaworker-qam-1:/var/log # btrfs scrub status /
scrub status for e29496d5-0080-4a01-9bde-b786944f4ba4
    scrub started at Fri Sep 11 12:52:51 2020 and finished after 01:24:50
    total bytes scrubbed: 1.16TiB with 0 errors
Actions #16

Updated by nicksinger over 3 years ago

I start to suspect the BTRFS raid here to cause problems. Trying to reboot showed the same problem (OS could not be found). If you try to mount /dev/sda2 immediately after that, it also fails to mount with "unknown argument". However, if you mount /dev/sdc2 first, /dev/sda2 becomes "available" again and petitboot can find the installation (and boot from it). I now started a full balance which will take some time - let's see if this helps…

Actions #17

Updated by okurz over 3 years ago

nicksinger wrote:

[…] I now started a full balance which will take some time - let's see if this helps…

Is the full balance finished after 28 days?

Actions #18

Updated by okurz over 3 years ago

  • Due date set to 2020-10-27

setting due-date as reminder for ticket exceeding expecting cycle time. Currently I can't ping the machine. Over SoL I see the machine again stuck in petitboot. What did you do to make the kernel available in petitboot and boot Linux?

EDIT: 2020-10-23 21:47: I guess same as in https://progress.opensuse.org/issues/55607#note-6 we can do something like lookup kernel and initrd files in any available mount points and append the boot parameters found in …/boot/grub2/grub.cfg?

Actions #19

Updated by nicksinger over 3 years ago

Sorry for my late response. I was kind of upset about this ticket here - a too annoying issue :)

okurz wrote:

setting due-date as reminder for ticket exceeding expecting cycle time. Currently I can't ping the machine. Over SoL I see the machine again stuck in petitboot. What did you do to make the kernel available in petitboot and boot Linux?

EDIT: 2020-10-23 21:47: I guess same as in https://progress.opensuse.org/issues/55607#note-6 we can do something like lookup kernel and initrd files in any available mount points and append the boot parameters found in …/boot/grub2/grub.cfg?

You should be able to kexec the kernel from the existing /boot partition without actually digging out parameters in grub.cfg with kexec -l /boot/… --reuse-cmdline && kexec -e

Actions #20

Updated by okurz over 3 years ago

Mounting /dev/sdb2 failed with "Invalid argument" and dmesg shows some problem about btrfs I guess.

Tried:

kexec -l /var/petitboot/mnt/boot/vmlinux-4.12.14-lp151.28.48-default --reuse-cmdline && kexec -e

stated that --reuse-cmdline is not supported, kexec --help supports this. So the best approach so far that should work is:

mount /dev/sda2 /var/petitboot/mnt/
mount /dev/sdb2 /var/petitboot/mnt/
kexec --load /var/petitboot/mnt/boot/vmlinux-4.12.14-lp151.28.48-default --initrd /var/petitboot/mnt/boot/initrd-4.12.14-lp151.28.48-default --append "root=UUID=e29496d5-0080-4a01-9bde-b786944f4ba4 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M"
kexec --exec

the " around append arguments are important of course.

Machine booted up fine. As it was last active months ago I did systemctl stop openqa-worker.target telegraf salt-minion openqa-worker-cacheservice openqa-worker-cacheservice-minion first to ensure it is not like picking up jobs and having problems with version incompatibilities.

After another power reset to try out to fix the petitboot configuration in "System Configuration", explicitly clearing and switching back to "Any Device", going all the way to the end to press "ok" I could suddenly see the grub boot menu entries and selecting the first entry correctly booted the problem. After another power reset I could reproduce this. So I assume that petitboot either suffers also from the two disks not staying in consistent order or the disks "take some time" to be properly initialized and found. What I did now is after the boot menu entries from grub show up in the main petitboot menu go back into "System Configuration", clear all entries from "Boot Order" and add a device, now selecting the new explicit entry "disk: sda2 [uuid: e29496d5-0080-4a01-9bde-b786944f4ba4]" and "Any Device" after that. But after another power reset again this did not work and I could not manage to show up the grub boot entries in petitboot. Feels like a hit and miss, maybe because again petitboot is struggling with inconsistent device ordering.

EDIT: I also added back the machine to salt as we at least have a good enough workaround with IPMI access and being able to manually trigger boots.

Actions #21

Updated by nicksinger over 3 years ago

I've also seen that behavior (hanging in petit for a few seconds before the disk can be found) on QA-Power8-5. So it seems quite "normal" for petitboot to drop you in the menu while it searches for boot-able disks in the background. I just wonder what exactly happened on qam-1 so this mechanism seems to time out (?) now.

Actions #22

Updated by nicksinger over 3 years ago

I'm currently trying to boot the machine for OSD deployment:

powernv-rng: Registering arch random hook.
opal-power: OPAL EPOW, DPO support detected.
HugeTLB registered 16 MB page size, pre-allocated 0 pages
HugeTLB registered 16 GB page size, pre-allocated 0 pages
iommu: Default domain type: Translated
vgaarb: loaded
Linux cec interface: v0.10
EDAC MC: Ver: 3.0.0
NetLabel: Initializing
NetLabel:  domain hash size = 128
NetLabel:  protocols = UNLABELED CIPSOv4 CALIPSO
NetLabel:  unlabeled traffic allowed by default
clocksource: Switched to clocksource timebase
swapper/88 invoked oom-killer: gfp_mask=0x16040d0(GFP_TEMPORARY|__GFP_COMP|__GFP_NOTRACK), nodemask=(null),  order=0, oom_score_adj=0
swapper/88 cpuset=/ mems_allowed=0
CPU: 88 PID: 1 Comm: swapper/88 Not tainted 4.12.14-lp151.28.48-default #1 openSUSE Leap 15.1
Call Trace:
[c00000000e57f440] [c000000008af6ec4] dump_stack+0xb0/0xf0 (unreliable)
[c00000000e57f480] [c000000008af4644] dump_header+0xc8/0x26c
[c00000000e57f550] [c0000000082f1104] out_of_memory+0x3e4/0x700
[c00000000e57f5f0] [c0000000082f92d4] __alloc_pages_nodemask+0x1004/0x10b0
[c00000000e57f7e0] [c00000000838e4b8] cache_grow_begin+0xc8/0x7d0
[c00000000e57f8a0] [c00000000838f21c] fallback_alloc+0x1dc/0x2c0
[c00000000e57f910] [c000000008390658] kmem_cache_alloc+0x1b8/0x2f0
[c00000000e57f970] [c0000000083fc624] alloc_inode+0xe4/0x110
[c00000000e57f9a0] [c0000000083ff624] new_inode_pseudo+0x24/0x90
[c00000000e57f9d0] [c0000000083ff6c0] new_inode+0x30/0x60
[c00000000e57fa00] [c00000000853c9a4] tracefs_get_inode+0x24/0x80
[c00000000e57fa30] [c00000000853d1ac] tracefs_create_file+0x6c/0x1d0
[c00000000e57fa70] [c0000000082593f4] trace_create_file+0x24/0x70
[c00000000e57fae0] [c00000000826fd0c] event_create_dir+0x16c/0xa90
[c00000000e57fba0] [c000000008e11cf4] event_trace_init+0x2dc/0x394
[c00000000e57fc40] [c00000000800f654] do_one_initcall+0x64/0x1d0
[c00000000e57fd00] [c000000008de44c4] kernel_init_freeable+0x298/0x38c
[c00000000e57fdc0] [c0000000080102ec] kernel_init+0x2c/0x160
[c00000000e57fe30] [c00000000800b5e0] ret_from_kernel_thread+0x5c/0x7c
Mem-Info:
active_anon:0 inactive_anon:0 isolated_anon:0
 active_file:0 inactive_file:0 isolated_file:0
 unevictable:0 dirty:0 writeback:0 unstable:0
 slab_reclaimable:113 slab_unreclaimable:189
 mapped:0 shmem:0 pagetables:0 bounce:0
 free:784 free_pcp:0 free_cma:512
Node 0 active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Node 0 DMA free:50176kB min:17472kB low:17728kB high:17984kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:505856kB managed:109248kB mlocked:0kB slab_reclaimable:7232kB slab_unreclaimable:12096kB kernel_stack:2512kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:32768kB
lowmem_reserve[]: 0 0 0 0 0
Node 0 DMA: 4*64kB (UM) 4*128kB (M) 3*256kB (UM) 5*512kB (M) 3*1024kB (UM) 1*2048kB (M) 2*4096kB (UM) 0*8192kB 2*16384kB (C) = 50176kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16384kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16777216kB
0 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap  = 0kB
Total swap = 0kB
7904 pages RAM
0 pages HighMem/MovableOnly
6197 pages reserved
512 pages cma reserved
0 pages hwpoisoned
[ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Kernel panic - not syncing: Out of memory and no killable processes...

CPU: 88 PID: 1 Comm: swapper/88 Not tainted 4.12.14-lp151.28.48-default #1 openSUSE Leap 15.1
Call Trace:
[c00000000e57f480] [c000000008af6ec4] dump_stack+0xb0/0xf0 (unreliable)
[c00000000e57f4c0] [c000000008af301c] panic+0x144/0x31c
[c00000000e57f550] [c0000000082f1110] out_of_memory+0x3f0/0x700
[c00000000e57f5f0] [c0000000082f92d4] __alloc_pages_nodemask+0x1004/0x10b0
[c00000000e57f7e0] [c00000000838e4b8] cache_grow_begin+0xc8/0x7d0
[c00000000e57f8a0] [c00000000838f21c] fallback_alloc+0x1dc/0x2c0
[c00000000e57f910] [c000000008390658] kmem_cache_alloc+0x1b8/0x2f0
[c00000000e57f970] [c0000000083fc624] alloc_inode+0xe4/0x110
[c00000000e57f9a0] [c0000000083ff624] new_inode_pseudo+0x24/0x90
[c00000000e57f9d0] [c0000000083ff6c0] new_inode+0x30/0x60
[c00000000e57fa00] [c00000000853c9a4] tracefs_get_inode+0x24/0x80
[c00000000e57fa30] [c00000000853d1ac] tracefs_create_file+0x6c/0x1d0
[c00000000e57fa70] [c0000000082593f4] trace_create_file+0x24/0x70
[c00000000e57fae0] [c00000000826fd0c] event_create_dir+0x16c/0xa90
[c00000000e57fba0] [c000000008e11cf4] event_trace_init+0x2dc/0x394
[c00000000e57fc40] [c00000000800f654] do_one_initcall+0x64/0x1d0
[c00000000e57fd00] [c000000008de44c4] kernel_init_freeable+0x298/0x38c
[c00000000e57fdc0] [c0000000080102ec] kernel_init+0x2c/0x160
[c00000000e57fe30] [c00000000800b5e0] ret_from_kernel_thread+0x5c/0x7c
Actions #23

Updated by nicksinger over 3 years ago

disabled the worker for now with salt-key --include-accepted -r powerqaworker-qam-1

Actions #24

Updated by nicksinger over 3 years ago

[    2.324075] tg3 0001:03:00.2 enP1p3s0f2: renamed from eth2
[    2.404091] tg3 0001:03:00.3 enP1p3s0f3: renamed from eth3
[    2.494016] tg3 0003:09:00.1 enP3p9s0f1: renamed from eth5
[    2.534020] tg3 0003:09:00.0 enP3p9s0f0: renamed from eth4
[    2.574026] tg3 0003:09:00.3 enP3p9s0f3: renamed from eth7
[    2.614072] tg3 0003:09:00.2 enP3p9s0f2: renamed from eth6
[   12.915336] ipr 0001:04:00.0: Starting IOA initialization sequence.
[   12.916065] ipr 0001:04:00.0: Starting IOA initialization sequence.
[   12.918703] ipr 0001:04:00.0: Adapter firmware version: 17518000
[   12.920174] ipr 0001:04:00.0: IOA initialized.
[   12.920273] ipr 0001:04:00.0: 8150: Permanent IOA failure
[   12.920277] ipr: 00000000: 04448200 17518000 FFFFFFFF 15212204
[   12.920280] ipr: 00000010: 50050760 5EC5BB00 00000000 00000000
[   12.920283] ipr: 00000020: FEFFFFFF FFFFFFFF 00000000 00002D54
[   12.920285] ipr: 00000030: 00000000 00000000 00000000 00000000
[   12.920288] ipr: 00000040: 00000001 00000000 00000000 1521220E
[   12.920290] ipr: 00000050: 00000708 00000000 00020000 2005001B
[   12.920293] ipr: 00000060: 2005001E 80A8A6DE 00000000 00000000
[   12.920295] ipr: 00000070: 00000001 03002675 779D0AC8 534A4600
[   12.920298] ipr: 00000080: 10860000 BC0B0000 32010000 FFFFFFFF
[   12.920300] ipr: 00000090: FFFFFFFF FFFFFFFF 2004DC40 283BC705
[   12.920303] ipr: 000000A0: 2500FFFF FFFFFFFF FFFFFFFF FFFFFFFF
[   12.920305] ipr: 000000B0: FFFFFFFF 10004C00 B5025B02 0E00FFFF
[   12.920308] ipr: 000000C0: FFFFFFFF FFFFFFFF FFFF1C00 35303141
[   12.920310] ipr: 000000D0: 43424532 312E312E 302E3030 30303052
[   12.920313] ipr: 000000E0: 65763032 41475450 4C323846 30303030
[   12.920315] ipr: 000000F0: 30313230 30303052 65763031 30303030
[   12.920318] ipr: 00000100: 31333032 41474947 412D3032 30303030
[   12.920320] ipr: 00000110: 30303030 2EFF0000 00000000 00000000
[   12.920323] ipr: 00000120: 00000000 00000000 00000000 00000000
[   12.920325] ipr: 00000130: 00000000 00000000 00000000 00000000
[   12.920328] ipr: 00000140: 00000000 00000000 00000000 00000000
[   12.920330] ipr: 00000150: 00000000 00000000 00000000 00000000
[   12.920332] ipr: 00000160: 00000000 00000000 00000000 00000000
[   12.920335] ipr: 00000170: 00000000 00000000 00000000 00000000
[   12.920337] ipr: 00000180: 00000000 00000000 00000000 00000000
[   12.920339] ipr: 00000190: 00000000 00000000 00000000 00000000
[   12.920342] ipr: 000001A0: 00000000 00000000 00000000 00000000
[   12.920344] ipr: 000001B0: 00000000 00000000 00000000 00000000
[   12.920347] ipr: 000001C0: 00000000 00000000 00000000 00000000
[   12.920349] ipr: 000001D0: 00000000 00000000 00000000 00000000
[   12.920351] ipr: 000001E0: 00000000 00000000 00000000 00000000
[   12.920354] ipr: 000001F0: 00000000 00000000 00000000 00000000
[   12.920356] ipr: 00000200: 00000000 00000000 00000000 00000000
[   12.920358] ipr: 00000210: 00000000 00000000 00000000 00000000
[   12.920361] ipr: 00000220: 00000000 00000000 00000000 00000000
[   12.920363] ipr: 00000230: 00000000 00000000 00000000 00000000
[   12.920366] ipr: 00000240: 00000000 00000000 00000000 00000000
[   12.920368] ipr: 00000250: 00000000 00000000 00000000 00000000
[   12.920370] ipr: 00000260: 00000000 00000000 00000000 00000000
[   12.920373] ipr: 00000270: 00000000 00000000 00000000 00000000
[   12.920375] ipr: 00000280: 00000000 00000000 00000000 00000000
[   12.920378] ipr: 00000290: 00000000 00000000 00000000 00000000
[   12.920380] ipr: 000002A0: 00000000 00000000 00000000 00000000
[   12.920382] ipr: 000002B0: 00000000 00000000 00000000 00000000
[   12.920385] ipr: 000002C0: 00000000 00000000 00000000 00000000
[   12.920387] ipr: 000002D0: 00000000 00000000 00000000 00000000
[   12.920389] ipr: 000002E0: 00000000 00000000 00000000 00000000
[   12.920392] ipr: 000002F0: 00000000 00000000 00000000 00000000
[   12.920394] ipr: 00000300: 00000000 00000000 00000000 00000000
[   12.920396] ipr: 00000310: 00000000 00000000 00000000 00000000
[   12.920399] ipr: 00000320: 00000000 00000000 00000000 00000000
[   12.920401] ipr: 00000330: 00000000 00000000 00000000 00000000
[   12.920403] ipr: 00000340: 00000000 00000000 00000000 00000000
[   12.920406] ipr: 00000350: 00000000 00000000 00000000 00000000
[   12.920408] ipr: 00000360: 00000000 00000000 00000000 00000000
[   12.920410] ipr: 00000370: 00000000 00000000 00000000 00000000
[   12.920413] ipr: 00000380: 00000000 00000000 00000000 00000000
[   12.920415] ipr: 00000390: 00000000 00000000 00000000 00000000
[   12.920418] ipr: 000003A0: 00000000 00000000 00000000 00000000
[   12.920420] ipr: 000003B0: 00000000 00000000 00000000 00000000
[   12.920422] ipr: 000003C0: 00000000 00000000 00000000 00000000
[   12.920425] ipr: 000003D0: 00000000 00000000 00000000 00000000
[   12.920985] scsi 0:3:0:0: No Device         IBM      57DC001SISIOA    0150 PQ: 0 ANSI: 0
[   12.920992] scsi 0:3:0:0: Resource path: 0/FE
[   12.923542] scsi 0:0:0:0: Direct-Access     IBM      HUC101212CSS600  A5A0 PQ: 0 ANSI: 6
[   12.923547] scsi 0:0:0:0: Resource path: 0/00-0C-00
[   12.936575] scsi 0:0:1:0: Direct-Access     IBM      HUC101212CSS600  A5A0 PQ: 0 ANSI: 6
[   12.936580] scsi 0:0:1:0: Resource path: 0/00-0C-01
[   12.946130] scsi 0:0:2:0: Direct-Access     IBM      HUC101212CSS600  A5A0 PQ: 0 ANSI: 6
[   12.946136] scsi 0:0:2:0: Resource path: 0/00-0C-03
[   13.004584] scsi 0:0:3:0: Direct-Access     IBM      HUC101212CSS600  A5A0 PQ: 0 ANSI: 6
[   13.004589] scsi 0:0:3:0: Resource path: 0/00-0C-02
[   13.005458] scsi 0:1:0:0: No Device         IBM      IPR-0   5EC5BB00      PQ: 0 ANSI: 3
[   13.005463] scsi 0:1:0:0: Resource path: 0/FD-00
[   13.005943] scsi 0:1:1:0: No Device         IBM      IPR-0   5EC5BB00      PQ: 0 ANSI: 3
[   13.005948] scsi 0:1:1:0: Resource path: 0/FD-01
[   13.006447] scsi 0:1:2:0: No Device         IBM      IPR-0   5EC5BB00      PQ: 0 ANSI: 3
[   13.006452] scsi 0:1:2:0: Resource path: 0/FD-02
[   13.006976] scsi 0:1:3:0: No Device         IBM      IPR-0   5EC5BB00      PQ: 0 ANSI: 3
[   13.006981] scsi 0:1:3:0: Resource path: 0/FD-03
[   13.007490] scsi 0:2:0:0: Direct-Access     IBM      IPR-0   5EC5BB00      PQ: 0 ANSI: 3
[   13.007495] scsi 0:2:0:0: Resource path: 0/FC-03-00
[   13.007991] scsi 0:2:1:0: Direct-Access     IBM      IPR-0   5EC5BB00      PQ: 0 ANSI: 3
[   13.007996] scsi 0:2:1:0: Resource path: 0/FC-00-00
[   13.008491] scsi 0:2:2:0: Direct-Access     IBM      IPR-0   5EC5BB00      PQ: 0 ANSI: 3
[   13.008496] scsi 0:2:2:0: Resource path: 0/FC-01-00
[   13.008988] scsi 0:2:3:0: Direct-Access     IBM      IPR-0   5EC5BB00      PQ: 0 ANSI: 3
[   13.008993] scsi 0:2:3:0: Resource path: 0/FC-02-00
[   13.114659] ata1.00: ATAPI: IBM.    RMBO0140592, U51J, max UDMA/133
[   13.116035] ata1.00: configured for UDMA/133
[   13.116868] scsi 0:0:4:0: CD-ROM            IBM.     RMBO0140592      U51J PQ: 0 ANSI: 2
[   13.116874] scsi 0:0:4:0: Resource path: 0/00-0F
[   13.118242] scsi 0:0:5:0: Enclosure         IBM      PSBPD14M1 6GSAS  1007 PQ: 0 ANSI: 4
[   13.118247] scsi 0:0:5:0: Resource path: 0/00-08-18
[   13.119901] scsi 0:0:6:0: Enclosure         IBM      PSBPD14M1 6GSAS  1007 PQ: 0 ANSI: 4
[   13.119906] scsi 0:0:6:0: Resource path: 0/00-0C-18
[   21.057733] ipr 0004:01:00.0: Starting IOA initialization sequence.
[   21.058378] ipr 0004:01:00.0: Adapter firmware version: 0422003F
[   21.059001] ipr 0004:01:00.0: IOA initialized.
[   21.059593] scsi 1:255:255:255: No Device         IBM      57B3001SISIOA    0150 PQ: 0 ANSI: 0
[   21.071842] sd 0:2:0:0: [sda] Spinning up disk...
[   21.072232] sd 0:2:1:0: [sdb] Spinning up disk...
[   21.072621] sd 0:2:2:0: [sdc] Spinning up disk...
[   21.072772] sd 0:2:3:0: [sdd] Spinning up disk...
[   21.090134] sr 0:0:4:0: [sr0] scsi3-mmc drive: 24x/24x writer dvd-ram cd/rw xa/form2 cdda tray
[   21.090141] cdrom: Uniform CD-ROM driver Revision: 3.20
[   21.090386] sr 0:0:4:0: Attached scsi CD-ROM sr0
[   22.303954] .ready
[   22.304053] sd 0:2:0:0: [sda] 2231484416 512-byte logical blocks: (1.14 TB/1.04 TiB)
[   22.304057] sd 0:2:0:0: [sda] 4096-byte physical blocks
[   22.304255] sd 0:2:0:0: [sda] Write Protect is off
[   22.304259] sd 0:2:0:0: [sda] Mode Sense: 13 00 00 08
[   22.343958] .ready
[   22.344112] sd 0:2:1:0: [sdb] 2231484416 512-byte logical blocks: (1.14 TB/1.04 TiB)
[   22.344119] sd 0:2:1:0: [sdb] 4096-byte physical blocks
[   22.344325] sd 0:2:1:0: [sdb] Write Protect is off
[   22.344329] sd 0:2:1:0: [sdb] Mode Sense: 13 00 00 08
[   22.373950] .ready
[   22.374026] sd 0:2:2:0: [sdc] 2231484416 512-byte logical blocks: (1.14 TB/1.04 TiB)
[   22.374031] sd 0:2:2:0: [sdc] 4096-byte physical blocks
[   22.374237] sd 0:2:2:0: [sdc] Write Protect is off
[   22.374241] sd 0:2:2:0: [sdc] Mode Sense: 13 00 00 08
[   22.393950] .ready
[   22.394036] sd 0:2:3:0: [sdd] 2231484416 512-byte logical blocks: (1.14 TB/1.04 TiB)
[   22.394041] sd 0:2:3:0: [sdd] 4096-byte physical blocks
[   22.394247] sd 0:2:3:0: [sdd] Write Protect is off
[   22.394252] sd 0:2:3:0: [sdd] Mode Sense: 13 00 00 08
[   22.404023] sd 0:2:0:0: [sda] Cache data unavailable
[   22.404033] sd 0:2:0:0: [sda] Assuming drive cache: write through
[   22.444011] sd 0:2:1:0: [sdb] Cache data unavailable
[   22.444015] sd 0:2:1:0: [sdb] Assuming drive cache: write through
[   22.474013] sd 0:2:2:0: [sdc] Cache data unavailable
[   22.474019] sd 0:2:2:0: [sdc] Assuming drive cache: write through
[   22.494031] sd 0:2:3:0: [sdd] Cache data unavailable
[   22.494043] sd 0:2:3:0: [sdd] Assuming drive cache: write through
[   22.574293]  sda: sda1 sda2
[   22.617071]  sdb: sdb1 sdb2
[   22.619864] sd 0:2:1:0: [sdb] Attached SCSI disk
[   22.636860] sd 0:2:2:0: [sdc] Attached SCSI disk
[   22.656582] sd 0:2:3:0: [sdd] Attached SCSI disk
[   22.674105] sd 0:2:0:0: [sda] Attached SCSI disk
[   22.832865] device-mapper: ioctl: 4.34.0-ioctl (2015-10-28) initialised: dm-devel@redhat.com
[   23.013969] raid6: altivecx1 gen()  6466 MB/s
[   23.183965] raid6: altivecx2 gen() 13075 MB/s
[   23.353963] raid6: altivecx4 gen() 20001 MB/s
[   23.523952] raid6: altivecx8 gen() 28251 MB/s
[   23.693964] raid6: int64x1  gen()  3282 MB/s
[   23.863999] raid6: int64x1  xor()  1095 MB/s
[   24.033975] raid6: int64x2  gen()  5466 MB/s
[   24.203957] raid6: int64x2  xor()  2187 MB/s
[   24.373958] raid6: int64x4  gen() 10901 MB/s
[   24.543963] raid6: int64x4  xor()  3814 MB/s
[   24.713951] raid6: int64x8  gen()  6619 MB/s
[   24.883967] raid6: int64x8  xor()  2752 MB/s
[   24.883969] raid6: using algorithm altivecx8 gen() 28251 MB/s
[   24.883971] raid6: using intx1 recovery algorithm
[   24.884280] xor: measuring software checksum speed
[   24.983952]    8regs     : 19257.600 MB/sec
[   25.083960]    8regs_prefetch: 17376.000 MB/sec
[   25.183956]    32regs    : 20531.200 MB/sec
[   25.283949]    32regs_prefetch: 18240.000 MB/sec
[   25.383955]    altivec   : 29030.400 MB/sec
[   25.383958] xor: using function: altivec (29030.400 MB/sec)
[   25.394281] Btrfs loaded
[   25.395924] BTRFS: device fsid e29496d5-0080-4a01-9bde-b786944f4ba4 devid 2 transid 2012624 /dev/mapper/sda2
[   25.397280] BTRFS info (device dm-2): disk space caching is enabled
[   25.397286] BTRFS: has skinny extents
[   25.398071] BTRFS: failed to read the system array on dm-2
[   25.533957] BTRFS: open_ctree failed
[   25.725704] BTRFS: device fsid e29496d5-0080-4a01-9bde-b786944f4ba4 devid 1 transid 2012624 /dev/mapper/sdb2
[   25.726537] BTRFS info (device dm-2): disk space caching is enabled
[   25.726539] BTRFS: has skinny extents
[   25.732487] BTRFS: failed to read chunk tree on dm-2
[   25.853974] BTRFS: open_ctree failed

IBM says: "8150 Permanent IOA failure: Exchange the failing items in the Failing Items list one at a time.". Could explain why this problem appeared in the first place.
Checking now why btrfs tries to access dm-2 instead of the real devices. Maybe a race

Actions #25

Updated by nicksinger over 3 years ago

  • Status changed from In Progress to Resolved

alright, I think I figured something out. Apparently petitboot creates a md snapshot before attempting to boot. We don't use dm_raid BUT for some reason this keeps the device (/dev/sda or /dev/sdb) too long in a busy state. While in that busy state btrfs tries to access all found devices. In that list, /dev/sda and /dev/sdb appear but are blocked by device mapper who seems to map it for a short period to /dev/mapper/md2. btrfs then tried to access sdab and /dev/mapper/md2 gets detected as valid btrfs but is removed again by mdadm realizing it is no raid device and destroying the node again. Therefore neither mdadm nor btrfs are able to find devices and petitboot is left with nothing to boot. I tried several things to disable mdadm but didn't find anything suitable. But disabling the snapshots with nvram -p default --update-config "petitboot,snapshots?=false" seems to resolve the race good enough and the system was able to successfully find and boot 3 times in a row. Resolving it for now. Please reopen if the same problem appears again

Actions #26

Updated by nicksinger over 3 years ago

I've created PPC ticket #179273 to address the "Permanent IOA failure" (I/O adapter seems broken according to https://www.ibm.com/support/knowledgecenter/TI0002C/p8ebk/urc_tables.htm)

Actions #27

Updated by okurz over 3 years ago

  • Due date changed from 2020-10-27 to 2020-11-11
  • Status changed from Resolved to Feedback

It's super awesome that you found this but before calling it "Resolved" we should at the very least add back to salt.

Also, as this ticket was causing us quite some work - or at least discussions - I think we should try to improve a bit further. Maybe we can write something useful on a wiki? Or is this too specific? Should we report an upstream bug about it?

Actions #28

Updated by livdywan over 3 years ago

  • Due date changed from 2020-11-11 to 2020-11-13
Actions #29

Updated by livdywan over 3 years ago

  • Due date changed from 2020-11-13 to 2020-11-20
Actions #30

Updated by favogt over 3 years ago

I saw that power8 was stuck in petitboot again after yesterday's power outage. /dev/sda3 was mounted properly, but there was no entry for booting from it, probably due to https://bugzilla.opensuse.org/show_bug.cgi?id=1174166.

I ran kexec --load /var/petitboot/mnt/dev/sda3/boot/vmlinux --initrd /var/petitboot/mnt/dev/sda3/boot/initrd --append "root=UUID=e29496d5-0080-4a01-9bde-b786944f4ba4 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M"
first, but the UUID changed. As the partition layout changed, the system was apparently reinstalled...

After a reboot, I edited grub2.cfg from petitboot by hand to remove the problematic section and it found the proper entry. It's booting now.

Actions #31

Updated by favogt over 3 years ago

I deleted 95_textmode and locked grub2 to avoid that it breaks in the future. This should lbe undone once the fix is released.

Actions #32

Updated by okurz over 3 years ago

  • Priority changed from Normal to High

maybe we need the same on powerqaworker-qam-1 as well as it's currently not up again?

Actions #33

Updated by nicksinger over 3 years ago

  • Assignee changed from nicksinger to okurz

okurz wrote:

maybe we need the same on powerqaworker-qam-1 as well as it's currently not up again?

machine is up and running fine from what I see. 2 days uptime

Actions #34

Updated by okurz over 3 years ago

ok. I triggered another reboot to verify. /etc/grub.d/95_textmode is still there so the workaround on power8 is not applied here at least. I will collect from this ticket what I consider relevant information for the future and document on a wiki.

Actions #35

Updated by okurz over 3 years ago

  • Status changed from Feedback to Resolved

I have added some information to https://progress.opensuse.org/projects/openqav3/wiki/Wiki#PPC-specific-configurations

In general during the work on this ticket (or also during non-work waiting times) we learned that we as SUSE QE Tools team can still try to solve some problems ourselves and help with fixing hardware-near problems but in general we want to go back to what we have defined in https://progress.opensuse.org/projects/qa/wiki/Wiki#Out-of-scope which means: Less operative work with hardware-near issues. I have updated https://progress.opensuse.org/projects/qa/wiki/Wiki#How-we-work-on-our-backlog with best practices, e.g.: "For tickets which are out of the scope of the team remove from backlog, delegate to corresponding teams or persons but be nice and supportive, e.g. SUSE-IT, EngInfra also see SLA, test maintainer, QE-LSG PrjMgr/mgmt". I think this is enough "improvement for the future" here :)

Actions #36

Updated by okurz over 3 years ago

  • Status changed from Resolved to Workable
  • Assignee deleted (okurz)

seems something is still missing, https://monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1 does show "No data" for powerqaworker-qam-1.qa.suse.de . So we are still not back to where we started from :)

Actions #37

Updated by okurz over 3 years ago

https://stats.openqa-monitor.qa.suse.de/d/WDpowerqaworker-qam-1/worker-dashboard-powerqaworker-qam-1?orgId=1&refresh=1m shows data so at least we receive some data from telegraf from powerqaworker-qam-1

Actions #38

Updated by livdywan over 3 years ago

  • Due date changed from 2020-11-20 to 2020-12-09
  • Assignee set to livdywan

okurz wrote:

https://stats.openqa-monitor.qa.suse.de/d/WDpowerqaworker-qam-1/worker-dashboard-powerqaworker-qam-1?orgId=1&refresh=1m shows data so at least we receive some data from telegraf from powerqaworker-qam-1

So it seems like the machine is available so we're not quite back to square one, but grafana doesn't see it - need to find out how exactly it reports the status I guess 🤔

Actions #39

Updated by nicksinger over 3 years ago

nicksinger wrote:

I've created PPC ticket #179273 to address the "Permanent IOA failure" (I/O adapter seems broken according to https://www.ibm.com/support/knowledgecenter/TI0002C/p8ebk/urc_tables.htm)
https://progress.opensuse.org/issues/80688 to address the steps necessary to resolve the infra/arch ticket.

@cdywan as I can see data in grafana again, can we consider this done?

Actions #41

Updated by okurz over 3 years ago

  • Related to action #80688: Upgrade IO firmware for powerqaworker-qam-1 added
Actions #42

Updated by livdywan over 3 years ago

  • Assignee deleted (livdywan)

I tried to ask repeatedly where this is defined and still couldn't get an answer so I'm unassigning. I wouldn't even know how to remove the broken status at this point.

Actions #43

Updated by mkittler over 3 years ago

  • Related to coordination #76984: [epic] Automatically remove assets+results based on available free space added
Actions #44

Updated by okurz over 3 years ago

I have removed the salt key for powerqaworker-qam-1 from osd for now as this blocked osd deployment.

Actions #45

Updated by okurz over 3 years ago

@mkittler can you explain where you see a relation to #76984?

Actions #46

Updated by livdywan over 3 years ago

  • Due date deleted (2020-12-09)
Actions #47

Updated by livdywan over 3 years ago

  • Description updated (diff)
Actions #48

Updated by livdywan over 3 years ago

  • Related to action #75238: Upgrade osd workers and openqa-monitor to openSUSE Leap 15.2 added
Actions #49

Updated by livdywan about 3 years ago

For reference, current working commandline should be:

kexec --load /var/petitboot/mnt/dev/sdc2/boot/vmlinux --initrd /var/petitboot/mnt/dev/sdc2/boot/initrd --append "root=UUID=e29496d5-0080-4a01-9bde-b786944f4ba4 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M" && kexec -e

Except it won't boot anything 🤔 Also, interestingly enough, the device changed between reboots, it was sda2 before.

Note: This machine is also affected by the grub parsing bug now. I applied the work-around of removing the problematic block.

Actions #50

Updated by okurz about 3 years ago

To work with the changing device names use globbing:

kexec --load /var/petitboot/mnt/dev/*/boot/vmlinux --initrd /var/petitboot/mnt/dev/*/boot/initrd --append "root=UUID=e29496d5-0080-4a01-9bde-b786944f4ba4 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M" && kexec -e
Actions #51

Updated by okurz about 3 years ago

  • Subject changed from powerqaworker-qam-1 fails to come up on reboot to powerqaworker-qam-1 fails to come up on reboot (repeatedly)
  • Description updated (diff)
Actions #52

Updated by okurz about 3 years ago

  • Related to action #81058: [tracker-ticket] Power machines can't find installed OS. Automatic reboots disabled for now added
Actions #53

Updated by okurz about 3 years ago

And again powerqaworker-qam-1 is offline at the start of the week. I consider it useless waste of time to handle that machine manually so I paused the alert https://monitor.qa.suse.de/d/WDpowerqaworker-qam-1/worker-dashboard-powerqaworker-qam-1?editPanel=65105&tab=alert and have not applied any workaround. Please do not waste time with repeated manual workaround actions but we should instead invest into either a proper fix or an automatic workaround. Related to #81058

Actions #54

Updated by okurz about 3 years ago

  • Description updated (diff)

I have again removed the salt key for powerqaworker-qam-1 which should not have been added back, blocking OSD deployment, see #89050 . Adding the salt key was also not mentioned in above comments. Done on OSD with:

sudo salt-key -y -d powerqaworker-qam-1
Actions #55

Updated by okurz about 3 years ago

  • Related to action #89050: OSD deployment blocked since two weeks - salt nodes failing added
Actions #56

Updated by okurz about 3 years ago

  • Priority changed from High to Normal

I suggest we work on #89551 first which could be related to some problems on bootup

Actions #57

Updated by okurz about 3 years ago

  • Related to action #89551: NFS mount fails after boot (reproducible on some OSD workers) added
Actions #58

Updated by okurz almost 3 years ago

Looked into this with mkittler after #80544 proved to be reliable. To me it looks like the physical devices are in a hardware RAID. /proc/mdstat does not show anything and no process with "mdadm" runs in skiboot. mdadm --examine --brief --scan --config=partitions does not return anything. I just don't think there is any Linux software RAID running or interfering here.

Starting pb-console is the same as the "Petitboot" menu.

Checking that the workaround about "snapshots" as described by nicksinger is still active

# nvram --print-config | grep snapshots
petitboot,snapshots?=false

pb-config does not do anything useful AFAICS. In pb-console I changed timeout from "10" to "60", maybe helps.

In pb-console I could not get further. AFAIR after letting petitboot parse the grub config the menu should show a boot section that can be selected but that seems to stay empty, like this:

 Petitboot (v1.4.2-e1658ec)                                    8284-22A 100E5EA
 ──────────────────────────────────────────────────────────────────────────────
 *
  System information      
  System configuration    
  System status log       
  Language                
  Rescan devices          
  Retrieve config from URL
  Exit to shell           


 ──────────────────────────────────────────────────────────────────────────────
 Enter=accept, e=edit, n=new, x=exit, l=language, g=log, h=help

see the empty line selected with "*". Doing the manual kexec command from #68053#note-50

this booted the system. Trying to forcefully restart the system with ipmi-fsp1-powerqaworker-qam.qa power reset. This interestingly did not seem to impact the machine SOL which still showed btrfs balance running. After roughly 1m the system started from scratch though. Maybe power cycle makes a difference here. Anyway, after some seconds we are back in "pb-console" and no boot entry shows up. Exiting to shell, ps w | grep -v '\[' shows pb-discover -l /var/log/petitboot/pb-discover.log running. And that log file /var/log/petitboot/pb-discover.log shows:

/ # cat /var/log/petitboot/pb-discover.log
--- pb-discover ---
lang: en_US.utf8
Detected platform type: powerpc
Running command:
 exe:  nvram
 argv: 'nvram' '--print-config' '--partition' 'common'
configuration:
 autoboot: enabled, 60 sec
 network configuration:
 dm-snapshots disabled
  interface 6c:ae:8b:69:21:74
   dhcp
  boot device 0: disk
  IPMI boot device 0x00
  Modifications allowed to disks: yes
  Default UI to boot on: /dev/hvc0 [IPMI / Serial]
 language: en_US.utf8
SKIP: sr0: no media present
SKIP: sda: no ID_FS_TYPE property
SKIP: sda1: ignore 'swap' filesystem
mounting device /dev/sda2 read-only
couldn't mount device /dev/sda2: mount failed: Invalid argument
SKIP: sdb: no ID_FS_TYPE property
SKIP: sdb1: ignore 'swap' filesystem
mounting device /dev/sdb2 read-only
trying parsers for sdb2
parse error: 237('{'): syntax error, unexpected '{', expecting elif or else or fi
SKIP: sdc: no ID_FS_TYPE property
SKIP: sdd: no ID_FS_TYPE property
SKIP: loop0: ignored (path=/devices/virtual/block/loop0)
SKIP: loop1: ignored (path=/devices/virtual/block/loop1)
SKIP: loop2: ignored (path=/devices/virtual/block/loop2)
SKIP: loop3: ignored (path=/devices/virtual/block/loop3)
SKIP: loop4: ignored (path=/devices/virtual/block/loop4)
SKIP: loop5: ignored (path=/devices/virtual/block/loop5)
SKIP: loop6: ignored (path=/devices/virtual/block/loop6)
SKIP: loop7: ignored (path=/devices/virtual/block/loop7)
Running command:
 exe:  /sbin/ip
 argv: '/sbin/ip' 'link' 'set' 'lo' 'up'
Running command:
 exe:  /sbin/ip
 argv: '/sbin/ip' 'link' 'set' 'enP1p3s0f0' 'up'
network: bringing up interface enP1p3s0f0
network: skipping enP1p3s0f1: manual config mode, but no config for this interface
network: skipping enP1p3s0f2: manual config mode, but no config for this interface
network: skipping enP1p3s0f3: manual config mode, but no config for this interface
network: skipping enP3p9s0f0: manual config mode, but no config for this interface
network: skipping enP3p9s0f1: manual config mode, but no config for this interface
network: skipping enP3p9s0f2: manual config mode, but no config for this interface
network: skipping enP3p9s0f3: manual config mode, but no config for this interface

with the error

mounting device /dev/sda2 read-only
couldn't mount device /dev/sda2: mount failed: Invalid argument
SKIP: sdb: no ID_FS_TYPE property
SKIP: sdb1: ignore 'swap' filesystem
mounting device /dev/sdb2 read-only
trying parsers for sdb2
parse error: 237('{'): syntax error, unexpected '{', expecting elif or else or fi

so first sda2 could not be mounted and for sdb2 the parser fails with a syntax error.

Executing pd-discover manually:

/ # pb-discover --verbose --dry-run
failure parsing nvram output
device-mapper: reload ioctl on sda2-base failed: Device or resource busy
(stuck here)

Best idea I have right now is to start kexec within the init.d script after pb-discover while parsing the UUID to put as root parameter

cat - >/sbin/kexec-workaround-boot-poo68053 <<EOF
#!/bin/sh
uuid=\$(ls -l /dev/disk/by-uuid/| grep 'sd[ab]2' | awk '{ print \$9 }')
kexec --load /var/petitboot/mnt/dev/*/boot/vmlinux --initrd /var/petitboot/mnt/dev/*/boot/initrd --append "root=UUID=\$uuid nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M" && kexec -e
EOF
sed '/pb-discover/a\        (sleep 60 && sh /sbin/kexec-workaround-boot-poo68053) &' -i /etc/init.d/S15pb-discover

But a problem is that the rootfs that is mounted as / within skiboot is not persistent. After rebooting all the changes are lost.

But now I understood how https://bugzilla.opensuse.org/show_bug.cgi?id=1174166 is related. pb-discover parses the grub config from the installed Linux system to find the correct boot entries. And as the logfile says pb-discover fails to parse the grub config in line 237 which in our system corresponds to line 241 as reported in https://bugzilla.opensuse.org/show_bug.cgi?id=1174166#c1 . The bug has been set to RESOLVED FIXED for Factory and SLE15-SP3 (and hence openSUSE Leap 15.3) only, not openSUSE Leap 15.2 as we run it. I will try to provide a maintenance request.

Added a comment to the bug report with https://bugzilla.opensuse.org/show_bug.cgi?id=1174166#c14
and created a maintenance request
https://build.opensuse.org/request/show/882581

Now let's see if I can apply a workaround from Linux hot-patching our grub as well:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/470

Actions #59

Updated by okurz almost 3 years ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz
Actions #60

Updated by okurz almost 3 years ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/470 merged after mkittler and me could manually verify the fix on powerqaworker-qam-1. I could confirm that the file was removed from all our power workers. Next steps:

  • check in a loop that powerqaworker-qam-1 reboots fine
  • crosscheck other power workers -> #81058
  • ensure all power workers apply auto-reboot -> #81058
Actions #61

Updated by okurz almost 3 years ago

  • Status changed from In Progress to Resolved

Rebooted the machine five times and it looks good. A more thorough analysis to follow in #81058

Actions #62

Updated by okurz almost 3 years ago

  • Status changed from Resolved to In Progress

oops, need to check grafana status and re-enable machine and such.

Actions #63

Updated by okurz almost 3 years ago

  • Status changed from In Progress to Resolved
Actions #64

Updated by okurz almost 3 years ago

https://monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1 still shows "No Data" for powerqaworker-qam-1 but I think that's now better to be covered in a separate ticket as the "reboot" problem is fixed: #90875

Actions

Also available in: Atom PDF