Project

General

Profile

action #68053

powerqaworker-qam-1 fails to come up on reboot (repeatedly)

Added by okurz 7 months ago. Updated 1 day ago.

Status:
Workable
Priority:
High
Assignee:
-
Target version:
Start date:
2020-06-14
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

system seems to be stuck in petitboot shell, how to continue?

Acceptance criteria

  • AC1: DONE powerqaworker-qam-1 is up and running (connects to web UI, ssh) at least once
  • AC2: DONE Machine has a defined owner
  • AC3: Grafana OSD Status Overview shows data
  • AC4: powerqaworker-qam-1 consistently comes up after reboots

Suggestions

debug over ipmi SOL connection.


Related issues

Related to openQA Infrastructure - action #80688: Upgrade IO firmware for powerqaworker-qam-1Resolved2020-12-03

Related to openQA Project - action #76984: Automatically remove assets+results based on available free spaceFeedback

Related to openQA Infrastructure - action #75238: Upgrade osd workers and openqa-monitor to openSUSE Leap 15.2Resolved

Copied from openQA Infrastructure - action #68050: openqaworker3 fails to come up on reboot, openqa_nvme_format.service failedResolved2020-06-142020-07-07

History

#1 Updated by okurz 7 months ago

  • Copied from action #68050: openqaworker3 fails to come up on reboot, openqa_nvme_format.service failed added

#2 Updated by okurz 7 months ago

I have the suspicion that the kernel within petitboot with 4.4 is too old to read the current btrfs filesystem to chainload the kernel or so. Similar as described in #55607#note-6 I would try to kexec the current kernel but when within the petitboot shell trying to mount the root fs I get

$ mount /dev/sdb2 /var/petitboot/mnt/
mount: mounting /dev/sdb2 on /var/petitboot/mnt/ failed: Invalid argument

and dmesg shows

[65061.610075] BTRFS info (device sdb2): disk space caching is enabled
[65061.610081] BTRFS: has skinny extents
[65061.611689] BTRFS: failed to read chunk tree on sdb2
[65061.683029] BTRFS: open_ctree failed

#3 Updated by nicksinger 7 months ago

okurz wrote:

I have the suspicion that the kernel within petitboot with 4.4 is too old to read the current btrfs filesystem to chainload the kernel or so. Similar as described in #55607#note-6 I would try to kexec the current kernel but when within the petitboot shell trying to mount the root fs I get

$ mount /dev/sdb2 /var/petitboot/mnt/
mount: mounting /dev/sdb2 on /var/petitboot/mnt/ failed: Invalid argument

and dmesg shows

[65061.610075] BTRFS info (device sdb2): disk space caching is enabled
[65061.610081] BTRFS: has skinny extents
[65061.611689] BTRFS: failed to read chunk tree on sdb2
[65061.683029] BTRFS: open_ctree failed

What you could try it to mount the FS read-only. BTRFS sometimes refuses to mount if the journal is not intact. The dmesg however looks more like some metadata is broken. According to the docs btrfs check chunk-recover seems to help here but I'd be careful if you really suspect that the old kernel version of petitboot is a problem - it could kill the FS

#4 Updated by okurz 7 months ago

removed the worker machine's salt key for now to fix osd deployment

#5 Updated by cdywan 7 months ago

  • Status changed from New to In Progress
  • Assignee set to cdywan
  • No success tracking down the owner of the machine so far.
  • Updating the firmware was suggested but we don't have anyone capable of doing that.
  • This is now also being tracked as RT#172392.

#6 Updated by okurz 7 months ago

  • Target version set to Ready

#7 Updated by okurz 6 months ago

  • Priority changed from Urgent to High

ok, for the time being we can live without the machine. We have removed the salt key, paused alerts, etc.

cdywan There was a question to you in https://infra.nue.suse.com/SelfService/Display.html?id=172392

if you do not know any more information please be a bit louder about it, e.g. ask in chat systems, various channels, mailing lists, research more, delegate, escalate, etc.

#8 Updated by cdywan 6 months ago

The conversation about fixing the machine is on-going (planned for next week). Future maintenance is also being discussed (getting it tracked with QAM) so we don't ned up in the same situation again.

#9 Updated by cdywan 6 months ago

No success locating the physical machine yet. Question of ownership is still up in the air. Further investigation pending.

#10 Updated by okurz 6 months ago

  • Priority changed from High to Normal

It is good to learn that no one saw the machine as critical ressource so we can reduce the priority. As long as the machine is powered on and sucks energy I would not regard this as "Low". As soon as someone powers off the machine we can set the prio to "Low".

#11 Updated by cdywan 5 months ago

FYI there is some delay due to people involved being on hols but we're getting closer to figuring out all the details around the machine.

#12 Updated by cdywan 5 months ago

  • Assignee changed from cdywan to nicksinger

Since the RT ticket is waiting on nicksinger now it probably makes sense to pass this ticket on.

#13 Updated by nicksinger 4 months ago

Alright so I updated Toni that the machine should be maintained by QAM. I also gave him a rough (physical) location for the machine (somewhere in the labs according to my mac address traces). This was on 25.08.2020 and since then no update from him but I guess he would need to poke around too so I grabbed the machine myself and had a look:

The machine has 4 disks:

/tmp # ls -lah /dev/sd*
brw-rw----    1 root     disk        8,   0 Sep 11 10:25 /dev/sda
brw-rw----    1 root     disk        8,   1 Sep 11 10:25 /dev/sda1
brw-rw----    1 root     disk        8,   2 Sep 11 10:25 /dev/sda2
brw-rw----    1 root     disk        8,  16 Sep 11 10:25 /dev/sdb
brw-rw----    1 root     disk        8,  32 Sep 11 10:25 /dev/sdc
brw-rw----    1 root     disk        8,  33 Sep 11 10:25 /dev/sdc1
brw-rw----    1 root     disk        8,  34 Sep 11 10:25 /dev/sdc2
brw-rw----    1 root     disk        8,  48 Sep 11 10:25 /dev/sdd

Two of them (sda and sdc) have a identical looking layout:

/tmp # fdisk -l /dev/sda

Disk /dev/sda: 1142.5 GB, 1142520020992 bytes
128 heads, 32 sectors/track, 544796 cylinders
Units = cylinders of 4096 * 512 = 2097152 bytes

   Device Boot      Start         End      Blocks  Id System
/dev/sda1               1        1028     2103296  82 Linux swap
Partition 1 does not end on cylinder boundary
/dev/sda2            1028      544796  1113637888  83 Linux
/tmp # fdisk -l /dev/sdc

Disk /dev/sdc: 1142.5 GB, 1142520020992 bytes
128 heads, 32 sectors/track, 544796 cylinders
Units = cylinders of 4096 * 512 = 2097152 bytes

   Device Boot      Start         End      Blocks  Id System
/dev/sdc1               1        1028     2103296  82 Linux swap
Partition 1 does not end on cylinder boundary
/dev/sdc2            1028      544796  1113637888  83 Linux
Partition 2 does not end on cylinder boundary

both "Linux" partitions (sda2 and sdc2) can be mounted properly inside petitboot so Oli's initial guess shouldn't be the reason.

However, the other two disks (sdb and sdd) don't have any layout at all:

/tmp # fdisk -l /dev/sdb

Disk /dev/sdb: 1142.5 GB, 1142520020992 bytes
128 heads, 32 sectors/track, 544796 cylinders
Units = cylinders of 4096 * 512 = 2097152 bytes

Disk /dev/sdb doesn't contain a valid partition table
/tmp # fdisk -l /dev/sdd

Disk /dev/sdd: 1142.5 GB, 1142520020992 bytes
128 heads, 32 sectors/track, 544796 cylinders
Units = cylinders of 4096 * 512 = 2097152 bytes

Disk /dev/sdd doesn't contain a valid partition table

Doing a kexec looks promising (booting atm). What I also saw is that all of the machines NICs are down. So even if we get linux back booting the machine will most likely be pretty unusable ;)

#14 Updated by nicksinger 4 months ago

hah, the installed linux booted and inside it even eth4 is up. I currently let a btrfs scrub run to check for any FS errors (I doubt that was the cause) and will then check how to reinstall grub. For me this all looks like grub2 destroyed itself once again…

#15 Updated by nicksinger 4 months ago

Scrub found no errors in the FS

powerqaworker-qam-1:/var/log # btrfs scrub status /
scrub status for e29496d5-0080-4a01-9bde-b786944f4ba4
    scrub started at Fri Sep 11 12:52:51 2020 and finished after 01:24:50
    total bytes scrubbed: 1.16TiB with 0 errors

#16 Updated by nicksinger 4 months ago

I start to suspect the BTRFS raid here to cause problems. Trying to reboot showed the same problem (OS could not be found). If you try to mount /dev/sda2 immediately after that, it also fails to mount with "unknown argument". However, if you mount /dev/sdc2 first, /dev/sda2 becomes "available" again and petitboot can find the installation (and boot from it). I now started a full balance which will take some time - let's see if this helps…

#17 Updated by okurz 3 months ago

nicksinger wrote:

[…] I now started a full balance which will take some time - let's see if this helps…

Is the full balance finished after 28 days?

#18 Updated by okurz 3 months ago

  • Due date set to 2020-10-27

setting due-date as reminder for ticket exceeding expecting cycle time. Currently I can't ping the machine. Over SoL I see the machine again stuck in petitboot. What did you do to make the kernel available in petitboot and boot Linux?

EDIT: 2020-10-23 21:47: I guess same as in https://progress.opensuse.org/issues/55607#note-6 we can do something like lookup kernel and initrd files in any available mount points and append the boot parameters found in …/boot/grub2/grub.cfg?

#19 Updated by nicksinger 3 months ago

Sorry for my late response. I was kind of upset about this ticket here - a too annoying issue :)

okurz wrote:

setting due-date as reminder for ticket exceeding expecting cycle time. Currently I can't ping the machine. Over SoL I see the machine again stuck in petitboot. What did you do to make the kernel available in petitboot and boot Linux?

EDIT: 2020-10-23 21:47: I guess same as in https://progress.opensuse.org/issues/55607#note-6 we can do something like lookup kernel and initrd files in any available mount points and append the boot parameters found in …/boot/grub2/grub.cfg?

You should be able to kexec the kernel from the existing /boot partition without actually digging out parameters in grub.cfg with kexec -l /boot/… --reuse-cmdline && kexec -e

#20 Updated by okurz 3 months ago

Mounting /dev/sdb2 failed with "Invalid argument" and dmesg shows some problem about btrfs I guess.

Tried:

kexec -l /var/petitboot/mnt/boot/vmlinux-4.12.14-lp151.28.48-default --reuse-cmdline && kexec -e

stated that --reuse-cmdline is not supported, kexec --help supports this. So the best approach so far that should work is:

mount /dev/sda2 /var/petitboot/mnt/
mount /dev/sdb2 /var/petitboot/mnt/
kexec --load /var/petitboot/mnt/boot/vmlinux-4.12.14-lp151.28.48-default --initrd /var/petitboot/mnt/boot/initrd-4.12.14-lp151.28.48-default --append "root=UUID=e29496d5-0080-4a01-9bde-b786944f4ba4 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M"
kexec --exec

the " around append arguments are important of course.

Machine booted up fine. As it was last active months ago I did systemctl stop openqa-worker.target telegraf salt-minion openqa-worker-cacheservice openqa-worker-cacheservice-minion first to ensure it is not like picking up jobs and having problems with version incompatibilities.

After another power reset to try out to fix the petitboot configuration in "System Configuration", explicitly clearing and switching back to "Any Device", going all the way to the end to press "ok" I could suddenly see the grub boot menu entries and selecting the first entry correctly booted the problem. After another power reset I could reproduce this. So I assume that petitboot either suffers also from the two disks not staying in consistent order or the disks "take some time" to be properly initialized and found. What I did now is after the boot menu entries from grub show up in the main petitboot menu go back into "System Configuration", clear all entries from "Boot Order" and add a device, now selecting the new explicit entry "disk: sda2 [uuid: e29496d5-0080-4a01-9bde-b786944f4ba4]" and "Any Device" after that. But after another power reset again this did not work and I could not manage to show up the grub boot entries in petitboot. Feels like a hit and miss, maybe because again petitboot is struggling with inconsistent device ordering.

EDIT: I also added back the machine to salt as we at least have a good enough workaround with IPMI access and being able to manually trigger boots.

#21 Updated by nicksinger 3 months ago

I've also seen that behavior (hanging in petit for a few seconds before the disk can be found) on QA-Power8-5. So it seems quite "normal" for petitboot to drop you in the menu while it searches for boot-able disks in the background. I just wonder what exactly happened on qam-1 so this mechanism seems to time out (?) now.

#22 Updated by nicksinger 3 months ago

I'm currently trying to boot the machine for OSD deployment:

powernv-rng: Registering arch random hook.
opal-power: OPAL EPOW, DPO support detected.
HugeTLB registered 16 MB page size, pre-allocated 0 pages
HugeTLB registered 16 GB page size, pre-allocated 0 pages
iommu: Default domain type: Translated
vgaarb: loaded
Linux cec interface: v0.10
EDAC MC: Ver: 3.0.0
NetLabel: Initializing
NetLabel:  domain hash size = 128
NetLabel:  protocols = UNLABELED CIPSOv4 CALIPSO
NetLabel:  unlabeled traffic allowed by default
clocksource: Switched to clocksource timebase
swapper/88 invoked oom-killer: gfp_mask=0x16040d0(GFP_TEMPORARY|__GFP_COMP|__GFP_NOTRACK), nodemask=(null),  order=0, oom_score_adj=0
swapper/88 cpuset=/ mems_allowed=0
CPU: 88 PID: 1 Comm: swapper/88 Not tainted 4.12.14-lp151.28.48-default #1 openSUSE Leap 15.1
Call Trace:
[c00000000e57f440] [c000000008af6ec4] dump_stack+0xb0/0xf0 (unreliable)
[c00000000e57f480] [c000000008af4644] dump_header+0xc8/0x26c
[c00000000e57f550] [c0000000082f1104] out_of_memory+0x3e4/0x700
[c00000000e57f5f0] [c0000000082f92d4] __alloc_pages_nodemask+0x1004/0x10b0
[c00000000e57f7e0] [c00000000838e4b8] cache_grow_begin+0xc8/0x7d0
[c00000000e57f8a0] [c00000000838f21c] fallback_alloc+0x1dc/0x2c0
[c00000000e57f910] [c000000008390658] kmem_cache_alloc+0x1b8/0x2f0
[c00000000e57f970] [c0000000083fc624] alloc_inode+0xe4/0x110
[c00000000e57f9a0] [c0000000083ff624] new_inode_pseudo+0x24/0x90
[c00000000e57f9d0] [c0000000083ff6c0] new_inode+0x30/0x60
[c00000000e57fa00] [c00000000853c9a4] tracefs_get_inode+0x24/0x80
[c00000000e57fa30] [c00000000853d1ac] tracefs_create_file+0x6c/0x1d0
[c00000000e57fa70] [c0000000082593f4] trace_create_file+0x24/0x70
[c00000000e57fae0] [c00000000826fd0c] event_create_dir+0x16c/0xa90
[c00000000e57fba0] [c000000008e11cf4] event_trace_init+0x2dc/0x394
[c00000000e57fc40] [c00000000800f654] do_one_initcall+0x64/0x1d0
[c00000000e57fd00] [c000000008de44c4] kernel_init_freeable+0x298/0x38c
[c00000000e57fdc0] [c0000000080102ec] kernel_init+0x2c/0x160
[c00000000e57fe30] [c00000000800b5e0] ret_from_kernel_thread+0x5c/0x7c
Mem-Info:
active_anon:0 inactive_anon:0 isolated_anon:0
 active_file:0 inactive_file:0 isolated_file:0
 unevictable:0 dirty:0 writeback:0 unstable:0
 slab_reclaimable:113 slab_unreclaimable:189
 mapped:0 shmem:0 pagetables:0 bounce:0
 free:784 free_pcp:0 free_cma:512
Node 0 active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Node 0 DMA free:50176kB min:17472kB low:17728kB high:17984kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:505856kB managed:109248kB mlocked:0kB slab_reclaimable:7232kB slab_unreclaimable:12096kB kernel_stack:2512kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:32768kB
lowmem_reserve[]: 0 0 0 0 0
Node 0 DMA: 4*64kB (UM) 4*128kB (M) 3*256kB (UM) 5*512kB (M) 3*1024kB (UM) 1*2048kB (M) 2*4096kB (UM) 0*8192kB 2*16384kB (C) = 50176kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16384kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16777216kB
0 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap  = 0kB
Total swap = 0kB
7904 pages RAM
0 pages HighMem/MovableOnly
6197 pages reserved
512 pages cma reserved
0 pages hwpoisoned
[ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Kernel panic - not syncing: Out of memory and no killable processes...

CPU: 88 PID: 1 Comm: swapper/88 Not tainted 4.12.14-lp151.28.48-default #1 openSUSE Leap 15.1
Call Trace:
[c00000000e57f480] [c000000008af6ec4] dump_stack+0xb0/0xf0 (unreliable)
[c00000000e57f4c0] [c000000008af301c] panic+0x144/0x31c
[c00000000e57f550] [c0000000082f1110] out_of_memory+0x3f0/0x700
[c00000000e57f5f0] [c0000000082f92d4] __alloc_pages_nodemask+0x1004/0x10b0
[c00000000e57f7e0] [c00000000838e4b8] cache_grow_begin+0xc8/0x7d0
[c00000000e57f8a0] [c00000000838f21c] fallback_alloc+0x1dc/0x2c0
[c00000000e57f910] [c000000008390658] kmem_cache_alloc+0x1b8/0x2f0
[c00000000e57f970] [c0000000083fc624] alloc_inode+0xe4/0x110
[c00000000e57f9a0] [c0000000083ff624] new_inode_pseudo+0x24/0x90
[c00000000e57f9d0] [c0000000083ff6c0] new_inode+0x30/0x60
[c00000000e57fa00] [c00000000853c9a4] tracefs_get_inode+0x24/0x80
[c00000000e57fa30] [c00000000853d1ac] tracefs_create_file+0x6c/0x1d0
[c00000000e57fa70] [c0000000082593f4] trace_create_file+0x24/0x70
[c00000000e57fae0] [c00000000826fd0c] event_create_dir+0x16c/0xa90
[c00000000e57fba0] [c000000008e11cf4] event_trace_init+0x2dc/0x394
[c00000000e57fc40] [c00000000800f654] do_one_initcall+0x64/0x1d0
[c00000000e57fd00] [c000000008de44c4] kernel_init_freeable+0x298/0x38c
[c00000000e57fdc0] [c0000000080102ec] kernel_init+0x2c/0x160
[c00000000e57fe30] [c00000000800b5e0] ret_from_kernel_thread+0x5c/0x7c

#23 Updated by nicksinger 3 months ago

disabled the worker for now with salt-key --include-accepted -r powerqaworker-qam-1

#24 Updated by nicksinger 3 months ago

[    2.324075] tg3 0001:03:00.2 enP1p3s0f2: renamed from eth2
[    2.404091] tg3 0001:03:00.3 enP1p3s0f3: renamed from eth3
[    2.494016] tg3 0003:09:00.1 enP3p9s0f1: renamed from eth5
[    2.534020] tg3 0003:09:00.0 enP3p9s0f0: renamed from eth4
[    2.574026] tg3 0003:09:00.3 enP3p9s0f3: renamed from eth7
[    2.614072] tg3 0003:09:00.2 enP3p9s0f2: renamed from eth6
[   12.915336] ipr 0001:04:00.0: Starting IOA initialization sequence.
[   12.916065] ipr 0001:04:00.0: Starting IOA initialization sequence.
[   12.918703] ipr 0001:04:00.0: Adapter firmware version: 17518000
[   12.920174] ipr 0001:04:00.0: IOA initialized.
[   12.920273] ipr 0001:04:00.0: 8150: Permanent IOA failure
[   12.920277] ipr: 00000000: 04448200 17518000 FFFFFFFF 15212204
[   12.920280] ipr: 00000010: 50050760 5EC5BB00 00000000 00000000
[   12.920283] ipr: 00000020: FEFFFFFF FFFFFFFF 00000000 00002D54
[   12.920285] ipr: 00000030: 00000000 00000000 00000000 00000000
[   12.920288] ipr: 00000040: 00000001 00000000 00000000 1521220E
[   12.920290] ipr: 00000050: 00000708 00000000 00020000 2005001B
[   12.920293] ipr: 00000060: 2005001E 80A8A6DE 00000000 00000000
[   12.920295] ipr: 00000070: 00000001 03002675 779D0AC8 534A4600
[   12.920298] ipr: 00000080: 10860000 BC0B0000 32010000 FFFFFFFF
[   12.920300] ipr: 00000090: FFFFFFFF FFFFFFFF 2004DC40 283BC705
[   12.920303] ipr: 000000A0: 2500FFFF FFFFFFFF FFFFFFFF FFFFFFFF
[   12.920305] ipr: 000000B0: FFFFFFFF 10004C00 B5025B02 0E00FFFF
[   12.920308] ipr: 000000C0: FFFFFFFF FFFFFFFF FFFF1C00 35303141
[   12.920310] ipr: 000000D0: 43424532 312E312E 302E3030 30303052
[   12.920313] ipr: 000000E0: 65763032 41475450 4C323846 30303030
[   12.920315] ipr: 000000F0: 30313230 30303052 65763031 30303030
[   12.920318] ipr: 00000100: 31333032 41474947 412D3032 30303030
[   12.920320] ipr: 00000110: 30303030 2EFF0000 00000000 00000000
[   12.920323] ipr: 00000120: 00000000 00000000 00000000 00000000
[   12.920325] ipr: 00000130: 00000000 00000000 00000000 00000000
[   12.920328] ipr: 00000140: 00000000 00000000 00000000 00000000
[   12.920330] ipr: 00000150: 00000000 00000000 00000000 00000000
[   12.920332] ipr: 00000160: 00000000 00000000 00000000 00000000
[   12.920335] ipr: 00000170: 00000000 00000000 00000000 00000000
[   12.920337] ipr: 00000180: 00000000 00000000 00000000 00000000
[   12.920339] ipr: 00000190: 00000000 00000000 00000000 00000000
[   12.920342] ipr: 000001A0: 00000000 00000000 00000000 00000000
[   12.920344] ipr: 000001B0: 00000000 00000000 00000000 00000000
[   12.920347] ipr: 000001C0: 00000000 00000000 00000000 00000000
[   12.920349] ipr: 000001D0: 00000000 00000000 00000000 00000000
[   12.920351] ipr: 000001E0: 00000000 00000000 00000000 00000000
[   12.920354] ipr: 000001F0: 00000000 00000000 00000000 00000000
[   12.920356] ipr: 00000200: 00000000 00000000 00000000 00000000
[   12.920358] ipr: 00000210: 00000000 00000000 00000000 00000000
[   12.920361] ipr: 00000220: 00000000 00000000 00000000 00000000
[   12.920363] ipr: 00000230: 00000000 00000000 00000000 00000000
[   12.920366] ipr: 00000240: 00000000 00000000 00000000 00000000
[   12.920368] ipr: 00000250: 00000000 00000000 00000000 00000000
[   12.920370] ipr: 00000260: 00000000 00000000 00000000 00000000
[   12.920373] ipr: 00000270: 00000000 00000000 00000000 00000000
[   12.920375] ipr: 00000280: 00000000 00000000 00000000 00000000
[   12.920378] ipr: 00000290: 00000000 00000000 00000000 00000000
[   12.920380] ipr: 000002A0: 00000000 00000000 00000000 00000000
[   12.920382] ipr: 000002B0: 00000000 00000000 00000000 00000000
[   12.920385] ipr: 000002C0: 00000000 00000000 00000000 00000000
[   12.920387] ipr: 000002D0: 00000000 00000000 00000000 00000000
[   12.920389] ipr: 000002E0: 00000000 00000000 00000000 00000000
[   12.920392] ipr: 000002F0: 00000000 00000000 00000000 00000000
[   12.920394] ipr: 00000300: 00000000 00000000 00000000 00000000
[   12.920396] ipr: 00000310: 00000000 00000000 00000000 00000000
[   12.920399] ipr: 00000320: 00000000 00000000 00000000 00000000
[   12.920401] ipr: 00000330: 00000000 00000000 00000000 00000000
[   12.920403] ipr: 00000340: 00000000 00000000 00000000 00000000
[   12.920406] ipr: 00000350: 00000000 00000000 00000000 00000000
[   12.920408] ipr: 00000360: 00000000 00000000 00000000 00000000
[   12.920410] ipr: 00000370: 00000000 00000000 00000000 00000000
[   12.920413] ipr: 00000380: 00000000 00000000 00000000 00000000
[   12.920415] ipr: 00000390: 00000000 00000000 00000000 00000000
[   12.920418] ipr: 000003A0: 00000000 00000000 00000000 00000000
[   12.920420] ipr: 000003B0: 00000000 00000000 00000000 00000000
[   12.920422] ipr: 000003C0: 00000000 00000000 00000000 00000000
[   12.920425] ipr: 000003D0: 00000000 00000000 00000000 00000000
[   12.920985] scsi 0:3:0:0: No Device         IBM      57DC001SISIOA    0150 PQ: 0 ANSI: 0
[   12.920992] scsi 0:3:0:0: Resource path: 0/FE
[   12.923542] scsi 0:0:0:0: Direct-Access     IBM      HUC101212CSS600  A5A0 PQ: 0 ANSI: 6
[   12.923547] scsi 0:0:0:0: Resource path: 0/00-0C-00
[   12.936575] scsi 0:0:1:0: Direct-Access     IBM      HUC101212CSS600  A5A0 PQ: 0 ANSI: 6
[   12.936580] scsi 0:0:1:0: Resource path: 0/00-0C-01
[   12.946130] scsi 0:0:2:0: Direct-Access     IBM      HUC101212CSS600  A5A0 PQ: 0 ANSI: 6
[   12.946136] scsi 0:0:2:0: Resource path: 0/00-0C-03
[   13.004584] scsi 0:0:3:0: Direct-Access     IBM      HUC101212CSS600  A5A0 PQ: 0 ANSI: 6
[   13.004589] scsi 0:0:3:0: Resource path: 0/00-0C-02
[   13.005458] scsi 0:1:0:0: No Device         IBM      IPR-0   5EC5BB00      PQ: 0 ANSI: 3
[   13.005463] scsi 0:1:0:0: Resource path: 0/FD-00
[   13.005943] scsi 0:1:1:0: No Device         IBM      IPR-0   5EC5BB00      PQ: 0 ANSI: 3
[   13.005948] scsi 0:1:1:0: Resource path: 0/FD-01
[   13.006447] scsi 0:1:2:0: No Device         IBM      IPR-0   5EC5BB00      PQ: 0 ANSI: 3
[   13.006452] scsi 0:1:2:0: Resource path: 0/FD-02
[   13.006976] scsi 0:1:3:0: No Device         IBM      IPR-0   5EC5BB00      PQ: 0 ANSI: 3
[   13.006981] scsi 0:1:3:0: Resource path: 0/FD-03
[   13.007490] scsi 0:2:0:0: Direct-Access     IBM      IPR-0   5EC5BB00      PQ: 0 ANSI: 3
[   13.007495] scsi 0:2:0:0: Resource path: 0/FC-03-00
[   13.007991] scsi 0:2:1:0: Direct-Access     IBM      IPR-0   5EC5BB00      PQ: 0 ANSI: 3
[   13.007996] scsi 0:2:1:0: Resource path: 0/FC-00-00
[   13.008491] scsi 0:2:2:0: Direct-Access     IBM      IPR-0   5EC5BB00      PQ: 0 ANSI: 3
[   13.008496] scsi 0:2:2:0: Resource path: 0/FC-01-00
[   13.008988] scsi 0:2:3:0: Direct-Access     IBM      IPR-0   5EC5BB00      PQ: 0 ANSI: 3
[   13.008993] scsi 0:2:3:0: Resource path: 0/FC-02-00
[   13.114659] ata1.00: ATAPI: IBM.    RMBO0140592, U51J, max UDMA/133
[   13.116035] ata1.00: configured for UDMA/133
[   13.116868] scsi 0:0:4:0: CD-ROM            IBM.     RMBO0140592      U51J PQ: 0 ANSI: 2
[   13.116874] scsi 0:0:4:0: Resource path: 0/00-0F
[   13.118242] scsi 0:0:5:0: Enclosure         IBM      PSBPD14M1 6GSAS  1007 PQ: 0 ANSI: 4
[   13.118247] scsi 0:0:5:0: Resource path: 0/00-08-18
[   13.119901] scsi 0:0:6:0: Enclosure         IBM      PSBPD14M1 6GSAS  1007 PQ: 0 ANSI: 4
[   13.119906] scsi 0:0:6:0: Resource path: 0/00-0C-18
[   21.057733] ipr 0004:01:00.0: Starting IOA initialization sequence.
[   21.058378] ipr 0004:01:00.0: Adapter firmware version: 0422003F
[   21.059001] ipr 0004:01:00.0: IOA initialized.
[   21.059593] scsi 1:255:255:255: No Device         IBM      57B3001SISIOA    0150 PQ: 0 ANSI: 0
[   21.071842] sd 0:2:0:0: [sda] Spinning up disk...
[   21.072232] sd 0:2:1:0: [sdb] Spinning up disk...
[   21.072621] sd 0:2:2:0: [sdc] Spinning up disk...
[   21.072772] sd 0:2:3:0: [sdd] Spinning up disk...
[   21.090134] sr 0:0:4:0: [sr0] scsi3-mmc drive: 24x/24x writer dvd-ram cd/rw xa/form2 cdda tray
[   21.090141] cdrom: Uniform CD-ROM driver Revision: 3.20
[   21.090386] sr 0:0:4:0: Attached scsi CD-ROM sr0
[   22.303954] .ready
[   22.304053] sd 0:2:0:0: [sda] 2231484416 512-byte logical blocks: (1.14 TB/1.04 TiB)
[   22.304057] sd 0:2:0:0: [sda] 4096-byte physical blocks
[   22.304255] sd 0:2:0:0: [sda] Write Protect is off
[   22.304259] sd 0:2:0:0: [sda] Mode Sense: 13 00 00 08
[   22.343958] .ready
[   22.344112] sd 0:2:1:0: [sdb] 2231484416 512-byte logical blocks: (1.14 TB/1.04 TiB)
[   22.344119] sd 0:2:1:0: [sdb] 4096-byte physical blocks
[   22.344325] sd 0:2:1:0: [sdb] Write Protect is off
[   22.344329] sd 0:2:1:0: [sdb] Mode Sense: 13 00 00 08
[   22.373950] .ready
[   22.374026] sd 0:2:2:0: [sdc] 2231484416 512-byte logical blocks: (1.14 TB/1.04 TiB)
[   22.374031] sd 0:2:2:0: [sdc] 4096-byte physical blocks
[   22.374237] sd 0:2:2:0: [sdc] Write Protect is off
[   22.374241] sd 0:2:2:0: [sdc] Mode Sense: 13 00 00 08
[   22.393950] .ready
[   22.394036] sd 0:2:3:0: [sdd] 2231484416 512-byte logical blocks: (1.14 TB/1.04 TiB)
[   22.394041] sd 0:2:3:0: [sdd] 4096-byte physical blocks
[   22.394247] sd 0:2:3:0: [sdd] Write Protect is off
[   22.394252] sd 0:2:3:0: [sdd] Mode Sense: 13 00 00 08
[   22.404023] sd 0:2:0:0: [sda] Cache data unavailable
[   22.404033] sd 0:2:0:0: [sda] Assuming drive cache: write through
[   22.444011] sd 0:2:1:0: [sdb] Cache data unavailable
[   22.444015] sd 0:2:1:0: [sdb] Assuming drive cache: write through
[   22.474013] sd 0:2:2:0: [sdc] Cache data unavailable
[   22.474019] sd 0:2:2:0: [sdc] Assuming drive cache: write through
[   22.494031] sd 0:2:3:0: [sdd] Cache data unavailable
[   22.494043] sd 0:2:3:0: [sdd] Assuming drive cache: write through
[   22.574293]  sda: sda1 sda2
[   22.617071]  sdb: sdb1 sdb2
[   22.619864] sd 0:2:1:0: [sdb] Attached SCSI disk
[   22.636860] sd 0:2:2:0: [sdc] Attached SCSI disk
[   22.656582] sd 0:2:3:0: [sdd] Attached SCSI disk
[   22.674105] sd 0:2:0:0: [sda] Attached SCSI disk
[   22.832865] device-mapper: ioctl: 4.34.0-ioctl (2015-10-28) initialised: dm-devel@redhat.com
[   23.013969] raid6: altivecx1 gen()  6466 MB/s
[   23.183965] raid6: altivecx2 gen() 13075 MB/s
[   23.353963] raid6: altivecx4 gen() 20001 MB/s
[   23.523952] raid6: altivecx8 gen() 28251 MB/s
[   23.693964] raid6: int64x1  gen()  3282 MB/s
[   23.863999] raid6: int64x1  xor()  1095 MB/s
[   24.033975] raid6: int64x2  gen()  5466 MB/s
[   24.203957] raid6: int64x2  xor()  2187 MB/s
[   24.373958] raid6: int64x4  gen() 10901 MB/s
[   24.543963] raid6: int64x4  xor()  3814 MB/s
[   24.713951] raid6: int64x8  gen()  6619 MB/s
[   24.883967] raid6: int64x8  xor()  2752 MB/s
[   24.883969] raid6: using algorithm altivecx8 gen() 28251 MB/s
[   24.883971] raid6: using intx1 recovery algorithm
[   24.884280] xor: measuring software checksum speed
[   24.983952]    8regs     : 19257.600 MB/sec
[   25.083960]    8regs_prefetch: 17376.000 MB/sec
[   25.183956]    32regs    : 20531.200 MB/sec
[   25.283949]    32regs_prefetch: 18240.000 MB/sec
[   25.383955]    altivec   : 29030.400 MB/sec
[   25.383958] xor: using function: altivec (29030.400 MB/sec)
[   25.394281] Btrfs loaded
[   25.395924] BTRFS: device fsid e29496d5-0080-4a01-9bde-b786944f4ba4 devid 2 transid 2012624 /dev/mapper/sda2
[   25.397280] BTRFS info (device dm-2): disk space caching is enabled
[   25.397286] BTRFS: has skinny extents
[   25.398071] BTRFS: failed to read the system array on dm-2
[   25.533957] BTRFS: open_ctree failed
[   25.725704] BTRFS: device fsid e29496d5-0080-4a01-9bde-b786944f4ba4 devid 1 transid 2012624 /dev/mapper/sdb2
[   25.726537] BTRFS info (device dm-2): disk space caching is enabled
[   25.726539] BTRFS: has skinny extents
[   25.732487] BTRFS: failed to read chunk tree on dm-2
[   25.853974] BTRFS: open_ctree failed

IBM says: "8150 Permanent IOA failure: Exchange the failing items in the Failing Items list one at a time.". Could explain why this problem appeared in the first place.
Checking now why btrfs tries to access dm-2 instead of the real devices. Maybe a race

#25 Updated by nicksinger 3 months ago

  • Status changed from In Progress to Resolved

alright, I think I figured something out. Apparently petitboot creates a md snapshot before attempting to boot. We don't use dm_raid BUT for some reason this keeps the device (/dev/sda or /dev/sdb) too long in a busy state. While in that busy state btrfs tries to access all found devices. In that list, /dev/sda and /dev/sdb appear but are blocked by device mapper who seems to map it for a short period to /dev/mapper/md2. btrfs then tried to access sdab and /dev/mapper/md2 gets detected as valid btrfs but is removed again by mdadm realizing it is no raid device and destroying the node again. Therefore neither mdadm nor btrfs are able to find devices and petitboot is left with nothing to boot. I tried several things to disable mdadm but didn't find anything suitable. But disabling the snapshots with nvram -p default --update-config "petitboot,snapshots?=false" seems to resolve the race good enough and the system was able to successfully find and boot 3 times in a row. Resolving it for now. Please reopen if the same problem appears again

#26 Updated by nicksinger 3 months ago

I've created PPC ticket #179273 to address the "Permanent IOA failure" (I/O adapter seems broken according to https://www.ibm.com/support/knowledgecenter/TI0002C/p8ebk/urc_tables.htm)

#27 Updated by okurz 2 months ago

  • Due date changed from 2020-10-27 to 2020-11-11
  • Status changed from Resolved to Feedback

It's super awesome that you found this but before calling it "Resolved" we should at the very least add back to salt.

Also, as this ticket was causing us quite some work - or at least discussions - I think we should try to improve a bit further. Maybe we can write something useful on a wiki? Or is this too specific? Should we report an upstream bug about it?

#28 Updated by cdywan 2 months ago

  • Due date changed from 2020-11-11 to 2020-11-13

#29 Updated by cdywan 2 months ago

  • Due date changed from 2020-11-13 to 2020-11-20

#30 Updated by favogt 2 months ago

I saw that power8 was stuck in petitboot again after yesterday's power outage. /dev/sda3 was mounted properly, but there was no entry for booting from it, probably due to https://bugzilla.opensuse.org/show_bug.cgi?id=1174166.

I ran kexec --load /var/petitboot/mnt/dev/sda3/boot/vmlinux --initrd /var/petitboot/mnt/dev/sda3/boot/initrd --append "root=UUID=e29496d5-0080-4a01-9bde-b786944f4ba4 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M"
first, but the UUID changed. As the partition layout changed, the system was apparently reinstalled...

After a reboot, I edited grub2.cfg from petitboot by hand to remove the problematic section and it found the proper entry. It's booting now.

#31 Updated by favogt 2 months ago

I deleted 95_textmode and locked grub2 to avoid that it breaks in the future. This should lbe undone once the fix is released.

#32 Updated by okurz 2 months ago

  • Priority changed from Normal to High

maybe we need the same on powerqaworker-qam-1 as well as it's currently not up again?

#33 Updated by nicksinger about 2 months ago

  • Assignee changed from nicksinger to okurz

okurz wrote:

maybe we need the same on powerqaworker-qam-1 as well as it's currently not up again?

machine is up and running fine from what I see. 2 days uptime

#34 Updated by okurz about 2 months ago

ok. I triggered another reboot to verify. /etc/grub.d/95_textmode is still there so the workaround on power8 is not applied here at least. I will collect from this ticket what I consider relevant information for the future and document on a wiki.

#35 Updated by okurz about 2 months ago

  • Status changed from Feedback to Resolved

I have added some information to https://progress.opensuse.org/projects/openqav3/wiki/Wiki#PPC-specific-configurations

In general during the work on this ticket (or also during non-work waiting times) we learned that we as SUSE QE Tools team can still try to solve some problems ourselves and help with fixing hardware-near problems but in general we want to go back to what we have defined in https://progress.opensuse.org/projects/qa/wiki/Wiki#Out-of-scope which means: Less operative work with hardware-near issues. I have updated https://progress.opensuse.org/projects/qa/wiki/Wiki#How-we-work-on-our-backlog with best practices, e.g.: "For tickets which are out of the scope of the team remove from backlog, delegate to corresponding teams or persons but be nice and supportive, e.g. SUSE-IT, EngInfra also see SLA, test maintainer, QE-LSG PrjMgr/mgmt". I think this is enough "improvement for the future" here :)

#36 Updated by okurz about 2 months ago

  • Status changed from Resolved to Workable
  • Assignee deleted (okurz)

seems something is still missing, https://monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1 does show "No data" for powerqaworker-qam-1.qa.suse.de . So we are still not back to where we started from :)

#37 Updated by okurz about 2 months ago

https://stats.openqa-monitor.qa.suse.de/d/WDpowerqaworker-qam-1/worker-dashboard-powerqaworker-qam-1?orgId=1&refresh=1m shows data so at least we receive some data from telegraf from powerqaworker-qam-1

#38 Updated by cdywan about 2 months ago

  • Due date changed from 2020-11-20 to 2020-12-09
  • Assignee set to cdywan

okurz wrote:

https://stats.openqa-monitor.qa.suse.de/d/WDpowerqaworker-qam-1/worker-dashboard-powerqaworker-qam-1?orgId=1&refresh=1m shows data so at least we receive some data from telegraf from powerqaworker-qam-1

So it seems like the machine is available so we're not quite back to square one, but grafana doesn't see it - need to find out how exactly it reports the status I guess 🤔

#39 Updated by nicksinger about 2 months ago

nicksinger wrote:

I've created PPC ticket #179273 to address the "Permanent IOA failure" (I/O adapter seems broken according to https://www.ibm.com/support/knowledgecenter/TI0002C/p8ebk/urc_tables.htm)
https://progress.opensuse.org/issues/80688 to address the steps necessary to resolve the infra/arch ticket.

cdywan as I can see data in grafana again, can we consider this done?

#41 Updated by okurz about 2 months ago

  • Related to action #80688: Upgrade IO firmware for powerqaworker-qam-1 added

#42 Updated by cdywan about 2 months ago

  • Assignee deleted (cdywan)

I tried to ask repeatedly where this is defined and still couldn't get an answer so I'm unassigning. I wouldn't even know how to remove the broken status at this point.

#43 Updated by mkittler about 1 month ago

  • Related to action #76984: Automatically remove assets+results based on available free space added

#44 Updated by okurz about 1 month ago

I have removed the salt key for powerqaworker-qam-1 from osd for now as this blocked osd deployment.

#45 Updated by okurz about 1 month ago

mkittler can you explain where you see a relation to #76984?

#46 Updated by cdywan about 1 month ago

  • Due date deleted (2020-12-09)

#47 Updated by cdywan about 1 month ago

  • Description updated (diff)

#48 Updated by cdywan about 1 month ago

  • Related to action #75238: Upgrade osd workers and openqa-monitor to openSUSE Leap 15.2 added

#49 Updated by cdywan about 1 month ago

For reference, current working commandline should be:

kexec --load /var/petitboot/mnt/dev/sdc2/boot/vmlinux --initrd /var/petitboot/mnt/dev/sdc2/boot/initrd --append "root=UUID=e29496d5-0080-4a01-9bde-b786944f4ba4 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M" && kexec -e

Except it won't boot anything 🤔 Also, interestingly enough, the device changed between reboots, it was sda2 before.

Note: This machine is also affected by the grub parsing bug now. I applied the work-around of removing the problematic block.

#50 Updated by okurz 1 day ago

To work with the changing device names use globbing:

kexec --load /var/petitboot/mnt/dev/*/boot/vmlinux --initrd /var/petitboot/mnt/dev/*/boot/initrd --append "root=UUID=e29496d5-0080-4a01-9bde-b786944f4ba4 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M" && kexec -e

#51 Updated by okurz 1 day ago

  • Subject changed from powerqaworker-qam-1 fails to come up on reboot to powerqaworker-qam-1 fails to come up on reboot (repeatedly)
  • Description updated (diff)

Also available in: Atom PDF