action #68053
powerqaworker-qam-1 fails to come up on reboot (repeatedly)
0%
Description
Observation¶
system seems to be stuck in petitboot shell, how to continue?
Acceptance criteria¶
- AC1: DONE powerqaworker-qam-1 is up and running (connects to web UI, ssh) at least once
- AC2: DONE Machine has a defined owner
- AC3: Grafana OSD Status Overview shows data
- AC4: powerqaworker-qam-1 consistently comes up after reboots
Suggestions¶
debug over ipmi SOL connection.
Related issues
History
#1
Updated by okurz 7 months ago
- Copied from action #68050: openqaworker3 fails to come up on reboot, openqa_nvme_format.service failed added
#2
Updated by okurz 7 months ago
I have the suspicion that the kernel within petitboot with 4.4 is too old to read the current btrfs filesystem to chainload the kernel or so. Similar as described in #55607#note-6 I would try to kexec the current kernel but when within the petitboot shell trying to mount the root fs I get
$ mount /dev/sdb2 /var/petitboot/mnt/ mount: mounting /dev/sdb2 on /var/petitboot/mnt/ failed: Invalid argument
and dmesg shows
[65061.610075] BTRFS info (device sdb2): disk space caching is enabled [65061.610081] BTRFS: has skinny extents [65061.611689] BTRFS: failed to read chunk tree on sdb2 [65061.683029] BTRFS: open_ctree failed
#3
Updated by nicksinger 7 months ago
okurz wrote:
I have the suspicion that the kernel within petitboot with 4.4 is too old to read the current btrfs filesystem to chainload the kernel or so. Similar as described in #55607#note-6 I would try to kexec the current kernel but when within the petitboot shell trying to mount the root fs I get
$ mount /dev/sdb2 /var/petitboot/mnt/ mount: mounting /dev/sdb2 on /var/petitboot/mnt/ failed: Invalid argumentand dmesg shows
[65061.610075] BTRFS info (device sdb2): disk space caching is enabled [65061.610081] BTRFS: has skinny extents [65061.611689] BTRFS: failed to read chunk tree on sdb2 [65061.683029] BTRFS: open_ctree failed
What you could try it to mount the FS read-only. BTRFS sometimes refuses to mount if the journal is not intact. The dmesg
however looks more like some metadata is broken. According to the docs btrfs check chunk-recover
seems to help here but I'd be careful if you really suspect that the old kernel version of petitboot is a problem - it could kill the FS
#7
Updated by okurz 6 months ago
- Priority changed from Urgent to High
ok, for the time being we can live without the machine. We have removed the salt key, paused alerts, etc.
cdywan There was a question to you in https://infra.nue.suse.com/SelfService/Display.html?id=172392
if you do not know any more information please be a bit louder about it, e.g. ask in chat systems, various channels, mailing lists, research more, delegate, escalate, etc.
#10
Updated by okurz 6 months ago
- Priority changed from High to Normal
It is good to learn that no one saw the machine as critical ressource so we can reduce the priority. As long as the machine is powered on and sucks energy I would not regard this as "Low". As soon as someone powers off the machine we can set the prio to "Low".
#12
Updated by cdywan 5 months ago
- Assignee changed from cdywan to nicksinger
Since the RT ticket is waiting on nicksinger now it probably makes sense to pass this ticket on.
#13
Updated by nicksinger 4 months ago
Alright so I updated Toni that the machine should be maintained by QAM. I also gave him a rough (physical) location for the machine (somewhere in the labs according to my mac address traces). This was on 25.08.2020 and since then no update from him but I guess he would need to poke around too so I grabbed the machine myself and had a look:
The machine has 4 disks:
/tmp # ls -lah /dev/sd* brw-rw---- 1 root disk 8, 0 Sep 11 10:25 /dev/sda brw-rw---- 1 root disk 8, 1 Sep 11 10:25 /dev/sda1 brw-rw---- 1 root disk 8, 2 Sep 11 10:25 /dev/sda2 brw-rw---- 1 root disk 8, 16 Sep 11 10:25 /dev/sdb brw-rw---- 1 root disk 8, 32 Sep 11 10:25 /dev/sdc brw-rw---- 1 root disk 8, 33 Sep 11 10:25 /dev/sdc1 brw-rw---- 1 root disk 8, 34 Sep 11 10:25 /dev/sdc2 brw-rw---- 1 root disk 8, 48 Sep 11 10:25 /dev/sdd
Two of them (sda and sdc) have a identical looking layout:
/tmp # fdisk -l /dev/sda Disk /dev/sda: 1142.5 GB, 1142520020992 bytes 128 heads, 32 sectors/track, 544796 cylinders Units = cylinders of 4096 * 512 = 2097152 bytes Device Boot Start End Blocks Id System /dev/sda1 1 1028 2103296 82 Linux swap Partition 1 does not end on cylinder boundary /dev/sda2 1028 544796 1113637888 83 Linux /tmp # fdisk -l /dev/sdc Disk /dev/sdc: 1142.5 GB, 1142520020992 bytes 128 heads, 32 sectors/track, 544796 cylinders Units = cylinders of 4096 * 512 = 2097152 bytes Device Boot Start End Blocks Id System /dev/sdc1 1 1028 2103296 82 Linux swap Partition 1 does not end on cylinder boundary /dev/sdc2 1028 544796 1113637888 83 Linux Partition 2 does not end on cylinder boundary
both "Linux" partitions (sda2 and sdc2) can be mounted properly inside petitboot so Oli's initial guess shouldn't be the reason.
However, the other two disks (sdb and sdd) don't have any layout at all:
/tmp # fdisk -l /dev/sdb Disk /dev/sdb: 1142.5 GB, 1142520020992 bytes 128 heads, 32 sectors/track, 544796 cylinders Units = cylinders of 4096 * 512 = 2097152 bytes Disk /dev/sdb doesn't contain a valid partition table /tmp # fdisk -l /dev/sdd Disk /dev/sdd: 1142.5 GB, 1142520020992 bytes 128 heads, 32 sectors/track, 544796 cylinders Units = cylinders of 4096 * 512 = 2097152 bytes Disk /dev/sdd doesn't contain a valid partition table
Doing a kexec looks promising (booting atm). What I also saw is that all of the machines NICs are down. So even if we get linux back booting the machine will most likely be pretty unusable ;)
#14
Updated by nicksinger 4 months ago
hah, the installed linux booted and inside it even eth4 is up. I currently let a btrfs scrub
run to check for any FS errors (I doubt that was the cause) and will then check how to reinstall grub. For me this all looks like grub2 destroyed itself once again…
#15
Updated by nicksinger 4 months ago
Scrub found no errors in the FS
powerqaworker-qam-1:/var/log # btrfs scrub status / scrub status for e29496d5-0080-4a01-9bde-b786944f4ba4 scrub started at Fri Sep 11 12:52:51 2020 and finished after 01:24:50 total bytes scrubbed: 1.16TiB with 0 errors
#16
Updated by nicksinger 4 months ago
I start to suspect the BTRFS raid here to cause problems. Trying to reboot showed the same problem (OS could not be found). If you try to mount /dev/sda2 immediately after that, it also fails to mount with "unknown argument". However, if you mount /dev/sdc2 first, /dev/sda2 becomes "available" again and petitboot can find the installation (and boot from it). I now started a full balance which will take some time - let's see if this helps…
#18
Updated by okurz 3 months ago
- Due date set to 2020-10-27
setting due-date as reminder for ticket exceeding expecting cycle time. Currently I can't ping the machine. Over SoL I see the machine again stuck in petitboot. What did you do to make the kernel available in petitboot and boot Linux?
EDIT: 2020-10-23 21:47: I guess same as in https://progress.opensuse.org/issues/55607#note-6 we can do something like lookup kernel and initrd files in any available mount points and append the boot parameters found in …/boot/grub2/grub.cfg?
#19
Updated by nicksinger 3 months ago
Sorry for my late response. I was kind of upset about this ticket here - a too annoying issue :)
okurz wrote:
setting due-date as reminder for ticket exceeding expecting cycle time. Currently I can't ping the machine. Over SoL I see the machine again stuck in petitboot. What did you do to make the kernel available in petitboot and boot Linux?
EDIT: 2020-10-23 21:47: I guess same as in https://progress.opensuse.org/issues/55607#note-6 we can do something like lookup kernel and initrd files in any available mount points and append the boot parameters found in …/boot/grub2/grub.cfg?
You should be able to kexec the kernel from the existing /boot partition without actually digging out parameters in grub.cfg with kexec -l /boot/… --reuse-cmdline && kexec -e
#20
Updated by okurz 3 months ago
Mounting /dev/sdb2 failed with "Invalid argument" and dmesg shows some problem about btrfs I guess.
Tried:
kexec -l /var/petitboot/mnt/boot/vmlinux-4.12.14-lp151.28.48-default --reuse-cmdline && kexec -e
stated that --reuse-cmdline
is not supported, kexec --help
supports this. So the best approach so far that should work is:
mount /dev/sda2 /var/petitboot/mnt/ mount /dev/sdb2 /var/petitboot/mnt/ kexec --load /var/petitboot/mnt/boot/vmlinux-4.12.14-lp151.28.48-default --initrd /var/petitboot/mnt/boot/initrd-4.12.14-lp151.28.48-default --append "root=UUID=e29496d5-0080-4a01-9bde-b786944f4ba4 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M" kexec --exec
the "
around append arguments are important of course.
Machine booted up fine. As it was last active months ago I did systemctl stop openqa-worker.target telegraf salt-minion openqa-worker-cacheservice openqa-worker-cacheservice-minion
first to ensure it is not like picking up jobs and having problems with version incompatibilities.
After another power reset
to try out to fix the petitboot configuration in "System Configuration", explicitly clearing and switching back to "Any Device", going all the way to the end to press "ok" I could suddenly see the grub boot menu entries and selecting the first entry correctly booted the problem. After another power reset
I could reproduce this. So I assume that petitboot either suffers also from the two disks not staying in consistent order or the disks "take some time" to be properly initialized and found. What I did now is after the boot menu entries from grub show up in the main petitboot menu go back into "System Configuration", clear all entries from "Boot Order" and add a device, now selecting the new explicit entry "disk: sda2 [uuid: e29496d5-0080-4a01-9bde-b786944f4ba4]" and "Any Device" after that. But after another power reset
again this did not work and I could not manage to show up the grub boot entries in petitboot. Feels like a hit and miss, maybe because again petitboot is struggling with inconsistent device ordering.
EDIT: I also added back the machine to salt as we at least have a good enough workaround with IPMI access and being able to manually trigger boots.
#21
Updated by nicksinger 3 months ago
I've also seen that behavior (hanging in petit for a few seconds before the disk can be found) on QA-Power8-5. So it seems quite "normal" for petitboot to drop you in the menu while it searches for boot-able disks in the background. I just wonder what exactly happened on qam-1 so this mechanism seems to time out (?) now.
#22
Updated by nicksinger 3 months ago
I'm currently trying to boot the machine for OSD deployment:
powernv-rng: Registering arch random hook. opal-power: OPAL EPOW, DPO support detected. HugeTLB registered 16 MB page size, pre-allocated 0 pages HugeTLB registered 16 GB page size, pre-allocated 0 pages iommu: Default domain type: Translated vgaarb: loaded Linux cec interface: v0.10 EDAC MC: Ver: 3.0.0 NetLabel: Initializing NetLabel: domain hash size = 128 NetLabel: protocols = UNLABELED CIPSOv4 CALIPSO NetLabel: unlabeled traffic allowed by default clocksource: Switched to clocksource timebase swapper/88 invoked oom-killer: gfp_mask=0x16040d0(GFP_TEMPORARY|__GFP_COMP|__GFP_NOTRACK), nodemask=(null), order=0, oom_score_adj=0 swapper/88 cpuset=/ mems_allowed=0 CPU: 88 PID: 1 Comm: swapper/88 Not tainted 4.12.14-lp151.28.48-default #1 openSUSE Leap 15.1 Call Trace: [c00000000e57f440] [c000000008af6ec4] dump_stack+0xb0/0xf0 (unreliable) [c00000000e57f480] [c000000008af4644] dump_header+0xc8/0x26c [c00000000e57f550] [c0000000082f1104] out_of_memory+0x3e4/0x700 [c00000000e57f5f0] [c0000000082f92d4] __alloc_pages_nodemask+0x1004/0x10b0 [c00000000e57f7e0] [c00000000838e4b8] cache_grow_begin+0xc8/0x7d0 [c00000000e57f8a0] [c00000000838f21c] fallback_alloc+0x1dc/0x2c0 [c00000000e57f910] [c000000008390658] kmem_cache_alloc+0x1b8/0x2f0 [c00000000e57f970] [c0000000083fc624] alloc_inode+0xe4/0x110 [c00000000e57f9a0] [c0000000083ff624] new_inode_pseudo+0x24/0x90 [c00000000e57f9d0] [c0000000083ff6c0] new_inode+0x30/0x60 [c00000000e57fa00] [c00000000853c9a4] tracefs_get_inode+0x24/0x80 [c00000000e57fa30] [c00000000853d1ac] tracefs_create_file+0x6c/0x1d0 [c00000000e57fa70] [c0000000082593f4] trace_create_file+0x24/0x70 [c00000000e57fae0] [c00000000826fd0c] event_create_dir+0x16c/0xa90 [c00000000e57fba0] [c000000008e11cf4] event_trace_init+0x2dc/0x394 [c00000000e57fc40] [c00000000800f654] do_one_initcall+0x64/0x1d0 [c00000000e57fd00] [c000000008de44c4] kernel_init_freeable+0x298/0x38c [c00000000e57fdc0] [c0000000080102ec] kernel_init+0x2c/0x160 [c00000000e57fe30] [c00000000800b5e0] ret_from_kernel_thread+0x5c/0x7c Mem-Info: active_anon:0 inactive_anon:0 isolated_anon:0 active_file:0 inactive_file:0 isolated_file:0 unevictable:0 dirty:0 writeback:0 unstable:0 slab_reclaimable:113 slab_unreclaimable:189 mapped:0 shmem:0 pagetables:0 bounce:0 free:784 free_pcp:0 free_cma:512 Node 0 active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no Node 0 DMA free:50176kB min:17472kB low:17728kB high:17984kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:505856kB managed:109248kB mlocked:0kB slab_reclaimable:7232kB slab_unreclaimable:12096kB kernel_stack:2512kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:32768kB lowmem_reserve[]: 0 0 0 0 0 Node 0 DMA: 4*64kB (UM) 4*128kB (M) 3*256kB (UM) 5*512kB (M) 3*1024kB (UM) 1*2048kB (M) 2*4096kB (UM) 0*8192kB 2*16384kB (C) = 50176kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16384kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16777216kB 0 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 0kB Total swap = 0kB 7904 pages RAM 0 pages HighMem/MovableOnly 6197 pages reserved 512 pages cma reserved 0 pages hwpoisoned [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name Kernel panic - not syncing: Out of memory and no killable processes... CPU: 88 PID: 1 Comm: swapper/88 Not tainted 4.12.14-lp151.28.48-default #1 openSUSE Leap 15.1 Call Trace: [c00000000e57f480] [c000000008af6ec4] dump_stack+0xb0/0xf0 (unreliable) [c00000000e57f4c0] [c000000008af301c] panic+0x144/0x31c [c00000000e57f550] [c0000000082f1110] out_of_memory+0x3f0/0x700 [c00000000e57f5f0] [c0000000082f92d4] __alloc_pages_nodemask+0x1004/0x10b0 [c00000000e57f7e0] [c00000000838e4b8] cache_grow_begin+0xc8/0x7d0 [c00000000e57f8a0] [c00000000838f21c] fallback_alloc+0x1dc/0x2c0 [c00000000e57f910] [c000000008390658] kmem_cache_alloc+0x1b8/0x2f0 [c00000000e57f970] [c0000000083fc624] alloc_inode+0xe4/0x110 [c00000000e57f9a0] [c0000000083ff624] new_inode_pseudo+0x24/0x90 [c00000000e57f9d0] [c0000000083ff6c0] new_inode+0x30/0x60 [c00000000e57fa00] [c00000000853c9a4] tracefs_get_inode+0x24/0x80 [c00000000e57fa30] [c00000000853d1ac] tracefs_create_file+0x6c/0x1d0 [c00000000e57fa70] [c0000000082593f4] trace_create_file+0x24/0x70 [c00000000e57fae0] [c00000000826fd0c] event_create_dir+0x16c/0xa90 [c00000000e57fba0] [c000000008e11cf4] event_trace_init+0x2dc/0x394 [c00000000e57fc40] [c00000000800f654] do_one_initcall+0x64/0x1d0 [c00000000e57fd00] [c000000008de44c4] kernel_init_freeable+0x298/0x38c [c00000000e57fdc0] [c0000000080102ec] kernel_init+0x2c/0x160 [c00000000e57fe30] [c00000000800b5e0] ret_from_kernel_thread+0x5c/0x7c
#23
Updated by nicksinger 3 months ago
disabled the worker for now with salt-key --include-accepted -r powerqaworker-qam-1
#24
Updated by nicksinger 3 months ago
[ 2.324075] tg3 0001:03:00.2 enP1p3s0f2: renamed from eth2 [ 2.404091] tg3 0001:03:00.3 enP1p3s0f3: renamed from eth3 [ 2.494016] tg3 0003:09:00.1 enP3p9s0f1: renamed from eth5 [ 2.534020] tg3 0003:09:00.0 enP3p9s0f0: renamed from eth4 [ 2.574026] tg3 0003:09:00.3 enP3p9s0f3: renamed from eth7 [ 2.614072] tg3 0003:09:00.2 enP3p9s0f2: renamed from eth6 [ 12.915336] ipr 0001:04:00.0: Starting IOA initialization sequence. [ 12.916065] ipr 0001:04:00.0: Starting IOA initialization sequence. [ 12.918703] ipr 0001:04:00.0: Adapter firmware version: 17518000 [ 12.920174] ipr 0001:04:00.0: IOA initialized. [ 12.920273] ipr 0001:04:00.0: 8150: Permanent IOA failure [ 12.920277] ipr: 00000000: 04448200 17518000 FFFFFFFF 15212204 [ 12.920280] ipr: 00000010: 50050760 5EC5BB00 00000000 00000000 [ 12.920283] ipr: 00000020: FEFFFFFF FFFFFFFF 00000000 00002D54 [ 12.920285] ipr: 00000030: 00000000 00000000 00000000 00000000 [ 12.920288] ipr: 00000040: 00000001 00000000 00000000 1521220E [ 12.920290] ipr: 00000050: 00000708 00000000 00020000 2005001B [ 12.920293] ipr: 00000060: 2005001E 80A8A6DE 00000000 00000000 [ 12.920295] ipr: 00000070: 00000001 03002675 779D0AC8 534A4600 [ 12.920298] ipr: 00000080: 10860000 BC0B0000 32010000 FFFFFFFF [ 12.920300] ipr: 00000090: FFFFFFFF FFFFFFFF 2004DC40 283BC705 [ 12.920303] ipr: 000000A0: 2500FFFF FFFFFFFF FFFFFFFF FFFFFFFF [ 12.920305] ipr: 000000B0: FFFFFFFF 10004C00 B5025B02 0E00FFFF [ 12.920308] ipr: 000000C0: FFFFFFFF FFFFFFFF FFFF1C00 35303141 [ 12.920310] ipr: 000000D0: 43424532 312E312E 302E3030 30303052 [ 12.920313] ipr: 000000E0: 65763032 41475450 4C323846 30303030 [ 12.920315] ipr: 000000F0: 30313230 30303052 65763031 30303030 [ 12.920318] ipr: 00000100: 31333032 41474947 412D3032 30303030 [ 12.920320] ipr: 00000110: 30303030 2EFF0000 00000000 00000000 [ 12.920323] ipr: 00000120: 00000000 00000000 00000000 00000000 [ 12.920325] ipr: 00000130: 00000000 00000000 00000000 00000000 [ 12.920328] ipr: 00000140: 00000000 00000000 00000000 00000000 [ 12.920330] ipr: 00000150: 00000000 00000000 00000000 00000000 [ 12.920332] ipr: 00000160: 00000000 00000000 00000000 00000000 [ 12.920335] ipr: 00000170: 00000000 00000000 00000000 00000000 [ 12.920337] ipr: 00000180: 00000000 00000000 00000000 00000000 [ 12.920339] ipr: 00000190: 00000000 00000000 00000000 00000000 [ 12.920342] ipr: 000001A0: 00000000 00000000 00000000 00000000 [ 12.920344] ipr: 000001B0: 00000000 00000000 00000000 00000000 [ 12.920347] ipr: 000001C0: 00000000 00000000 00000000 00000000 [ 12.920349] ipr: 000001D0: 00000000 00000000 00000000 00000000 [ 12.920351] ipr: 000001E0: 00000000 00000000 00000000 00000000 [ 12.920354] ipr: 000001F0: 00000000 00000000 00000000 00000000 [ 12.920356] ipr: 00000200: 00000000 00000000 00000000 00000000 [ 12.920358] ipr: 00000210: 00000000 00000000 00000000 00000000 [ 12.920361] ipr: 00000220: 00000000 00000000 00000000 00000000 [ 12.920363] ipr: 00000230: 00000000 00000000 00000000 00000000 [ 12.920366] ipr: 00000240: 00000000 00000000 00000000 00000000 [ 12.920368] ipr: 00000250: 00000000 00000000 00000000 00000000 [ 12.920370] ipr: 00000260: 00000000 00000000 00000000 00000000 [ 12.920373] ipr: 00000270: 00000000 00000000 00000000 00000000 [ 12.920375] ipr: 00000280: 00000000 00000000 00000000 00000000 [ 12.920378] ipr: 00000290: 00000000 00000000 00000000 00000000 [ 12.920380] ipr: 000002A0: 00000000 00000000 00000000 00000000 [ 12.920382] ipr: 000002B0: 00000000 00000000 00000000 00000000 [ 12.920385] ipr: 000002C0: 00000000 00000000 00000000 00000000 [ 12.920387] ipr: 000002D0: 00000000 00000000 00000000 00000000 [ 12.920389] ipr: 000002E0: 00000000 00000000 00000000 00000000 [ 12.920392] ipr: 000002F0: 00000000 00000000 00000000 00000000 [ 12.920394] ipr: 00000300: 00000000 00000000 00000000 00000000 [ 12.920396] ipr: 00000310: 00000000 00000000 00000000 00000000 [ 12.920399] ipr: 00000320: 00000000 00000000 00000000 00000000 [ 12.920401] ipr: 00000330: 00000000 00000000 00000000 00000000 [ 12.920403] ipr: 00000340: 00000000 00000000 00000000 00000000 [ 12.920406] ipr: 00000350: 00000000 00000000 00000000 00000000 [ 12.920408] ipr: 00000360: 00000000 00000000 00000000 00000000 [ 12.920410] ipr: 00000370: 00000000 00000000 00000000 00000000 [ 12.920413] ipr: 00000380: 00000000 00000000 00000000 00000000 [ 12.920415] ipr: 00000390: 00000000 00000000 00000000 00000000 [ 12.920418] ipr: 000003A0: 00000000 00000000 00000000 00000000 [ 12.920420] ipr: 000003B0: 00000000 00000000 00000000 00000000 [ 12.920422] ipr: 000003C0: 00000000 00000000 00000000 00000000 [ 12.920425] ipr: 000003D0: 00000000 00000000 00000000 00000000 [ 12.920985] scsi 0:3:0:0: No Device IBM 57DC001SISIOA 0150 PQ: 0 ANSI: 0 [ 12.920992] scsi 0:3:0:0: Resource path: 0/FE [ 12.923542] scsi 0:0:0:0: Direct-Access IBM HUC101212CSS600 A5A0 PQ: 0 ANSI: 6 [ 12.923547] scsi 0:0:0:0: Resource path: 0/00-0C-00 [ 12.936575] scsi 0:0:1:0: Direct-Access IBM HUC101212CSS600 A5A0 PQ: 0 ANSI: 6 [ 12.936580] scsi 0:0:1:0: Resource path: 0/00-0C-01 [ 12.946130] scsi 0:0:2:0: Direct-Access IBM HUC101212CSS600 A5A0 PQ: 0 ANSI: 6 [ 12.946136] scsi 0:0:2:0: Resource path: 0/00-0C-03 [ 13.004584] scsi 0:0:3:0: Direct-Access IBM HUC101212CSS600 A5A0 PQ: 0 ANSI: 6 [ 13.004589] scsi 0:0:3:0: Resource path: 0/00-0C-02 [ 13.005458] scsi 0:1:0:0: No Device IBM IPR-0 5EC5BB00 PQ: 0 ANSI: 3 [ 13.005463] scsi 0:1:0:0: Resource path: 0/FD-00 [ 13.005943] scsi 0:1:1:0: No Device IBM IPR-0 5EC5BB00 PQ: 0 ANSI: 3 [ 13.005948] scsi 0:1:1:0: Resource path: 0/FD-01 [ 13.006447] scsi 0:1:2:0: No Device IBM IPR-0 5EC5BB00 PQ: 0 ANSI: 3 [ 13.006452] scsi 0:1:2:0: Resource path: 0/FD-02 [ 13.006976] scsi 0:1:3:0: No Device IBM IPR-0 5EC5BB00 PQ: 0 ANSI: 3 [ 13.006981] scsi 0:1:3:0: Resource path: 0/FD-03 [ 13.007490] scsi 0:2:0:0: Direct-Access IBM IPR-0 5EC5BB00 PQ: 0 ANSI: 3 [ 13.007495] scsi 0:2:0:0: Resource path: 0/FC-03-00 [ 13.007991] scsi 0:2:1:0: Direct-Access IBM IPR-0 5EC5BB00 PQ: 0 ANSI: 3 [ 13.007996] scsi 0:2:1:0: Resource path: 0/FC-00-00 [ 13.008491] scsi 0:2:2:0: Direct-Access IBM IPR-0 5EC5BB00 PQ: 0 ANSI: 3 [ 13.008496] scsi 0:2:2:0: Resource path: 0/FC-01-00 [ 13.008988] scsi 0:2:3:0: Direct-Access IBM IPR-0 5EC5BB00 PQ: 0 ANSI: 3 [ 13.008993] scsi 0:2:3:0: Resource path: 0/FC-02-00 [ 13.114659] ata1.00: ATAPI: IBM. RMBO0140592, U51J, max UDMA/133 [ 13.116035] ata1.00: configured for UDMA/133 [ 13.116868] scsi 0:0:4:0: CD-ROM IBM. RMBO0140592 U51J PQ: 0 ANSI: 2 [ 13.116874] scsi 0:0:4:0: Resource path: 0/00-0F [ 13.118242] scsi 0:0:5:0: Enclosure IBM PSBPD14M1 6GSAS 1007 PQ: 0 ANSI: 4 [ 13.118247] scsi 0:0:5:0: Resource path: 0/00-08-18 [ 13.119901] scsi 0:0:6:0: Enclosure IBM PSBPD14M1 6GSAS 1007 PQ: 0 ANSI: 4 [ 13.119906] scsi 0:0:6:0: Resource path: 0/00-0C-18 [ 21.057733] ipr 0004:01:00.0: Starting IOA initialization sequence. [ 21.058378] ipr 0004:01:00.0: Adapter firmware version: 0422003F [ 21.059001] ipr 0004:01:00.0: IOA initialized. [ 21.059593] scsi 1:255:255:255: No Device IBM 57B3001SISIOA 0150 PQ: 0 ANSI: 0 [ 21.071842] sd 0:2:0:0: [sda] Spinning up disk... [ 21.072232] sd 0:2:1:0: [sdb] Spinning up disk... [ 21.072621] sd 0:2:2:0: [sdc] Spinning up disk... [ 21.072772] sd 0:2:3:0: [sdd] Spinning up disk... [ 21.090134] sr 0:0:4:0: [sr0] scsi3-mmc drive: 24x/24x writer dvd-ram cd/rw xa/form2 cdda tray [ 21.090141] cdrom: Uniform CD-ROM driver Revision: 3.20 [ 21.090386] sr 0:0:4:0: Attached scsi CD-ROM sr0 [ 22.303954] .ready [ 22.304053] sd 0:2:0:0: [sda] 2231484416 512-byte logical blocks: (1.14 TB/1.04 TiB) [ 22.304057] sd 0:2:0:0: [sda] 4096-byte physical blocks [ 22.304255] sd 0:2:0:0: [sda] Write Protect is off [ 22.304259] sd 0:2:0:0: [sda] Mode Sense: 13 00 00 08 [ 22.343958] .ready [ 22.344112] sd 0:2:1:0: [sdb] 2231484416 512-byte logical blocks: (1.14 TB/1.04 TiB) [ 22.344119] sd 0:2:1:0: [sdb] 4096-byte physical blocks [ 22.344325] sd 0:2:1:0: [sdb] Write Protect is off [ 22.344329] sd 0:2:1:0: [sdb] Mode Sense: 13 00 00 08 [ 22.373950] .ready [ 22.374026] sd 0:2:2:0: [sdc] 2231484416 512-byte logical blocks: (1.14 TB/1.04 TiB) [ 22.374031] sd 0:2:2:0: [sdc] 4096-byte physical blocks [ 22.374237] sd 0:2:2:0: [sdc] Write Protect is off [ 22.374241] sd 0:2:2:0: [sdc] Mode Sense: 13 00 00 08 [ 22.393950] .ready [ 22.394036] sd 0:2:3:0: [sdd] 2231484416 512-byte logical blocks: (1.14 TB/1.04 TiB) [ 22.394041] sd 0:2:3:0: [sdd] 4096-byte physical blocks [ 22.394247] sd 0:2:3:0: [sdd] Write Protect is off [ 22.394252] sd 0:2:3:0: [sdd] Mode Sense: 13 00 00 08 [ 22.404023] sd 0:2:0:0: [sda] Cache data unavailable [ 22.404033] sd 0:2:0:0: [sda] Assuming drive cache: write through [ 22.444011] sd 0:2:1:0: [sdb] Cache data unavailable [ 22.444015] sd 0:2:1:0: [sdb] Assuming drive cache: write through [ 22.474013] sd 0:2:2:0: [sdc] Cache data unavailable [ 22.474019] sd 0:2:2:0: [sdc] Assuming drive cache: write through [ 22.494031] sd 0:2:3:0: [sdd] Cache data unavailable [ 22.494043] sd 0:2:3:0: [sdd] Assuming drive cache: write through [ 22.574293] sda: sda1 sda2 [ 22.617071] sdb: sdb1 sdb2 [ 22.619864] sd 0:2:1:0: [sdb] Attached SCSI disk [ 22.636860] sd 0:2:2:0: [sdc] Attached SCSI disk [ 22.656582] sd 0:2:3:0: [sdd] Attached SCSI disk [ 22.674105] sd 0:2:0:0: [sda] Attached SCSI disk [ 22.832865] device-mapper: ioctl: 4.34.0-ioctl (2015-10-28) initialised: dm-devel@redhat.com [ 23.013969] raid6: altivecx1 gen() 6466 MB/s [ 23.183965] raid6: altivecx2 gen() 13075 MB/s [ 23.353963] raid6: altivecx4 gen() 20001 MB/s [ 23.523952] raid6: altivecx8 gen() 28251 MB/s [ 23.693964] raid6: int64x1 gen() 3282 MB/s [ 23.863999] raid6: int64x1 xor() 1095 MB/s [ 24.033975] raid6: int64x2 gen() 5466 MB/s [ 24.203957] raid6: int64x2 xor() 2187 MB/s [ 24.373958] raid6: int64x4 gen() 10901 MB/s [ 24.543963] raid6: int64x4 xor() 3814 MB/s [ 24.713951] raid6: int64x8 gen() 6619 MB/s [ 24.883967] raid6: int64x8 xor() 2752 MB/s [ 24.883969] raid6: using algorithm altivecx8 gen() 28251 MB/s [ 24.883971] raid6: using intx1 recovery algorithm [ 24.884280] xor: measuring software checksum speed [ 24.983952] 8regs : 19257.600 MB/sec [ 25.083960] 8regs_prefetch: 17376.000 MB/sec [ 25.183956] 32regs : 20531.200 MB/sec [ 25.283949] 32regs_prefetch: 18240.000 MB/sec [ 25.383955] altivec : 29030.400 MB/sec [ 25.383958] xor: using function: altivec (29030.400 MB/sec) [ 25.394281] Btrfs loaded [ 25.395924] BTRFS: device fsid e29496d5-0080-4a01-9bde-b786944f4ba4 devid 2 transid 2012624 /dev/mapper/sda2 [ 25.397280] BTRFS info (device dm-2): disk space caching is enabled [ 25.397286] BTRFS: has skinny extents [ 25.398071] BTRFS: failed to read the system array on dm-2 [ 25.533957] BTRFS: open_ctree failed [ 25.725704] BTRFS: device fsid e29496d5-0080-4a01-9bde-b786944f4ba4 devid 1 transid 2012624 /dev/mapper/sdb2 [ 25.726537] BTRFS info (device dm-2): disk space caching is enabled [ 25.726539] BTRFS: has skinny extents [ 25.732487] BTRFS: failed to read chunk tree on dm-2 [ 25.853974] BTRFS: open_ctree failed
IBM says: "8150 Permanent IOA failure: Exchange the failing items in the Failing Items list one at a time.". Could explain why this problem appeared in the first place.
Checking now why btrfs tries to access dm-2 instead of the real devices. Maybe a race
#25
Updated by nicksinger 3 months ago
- Status changed from In Progress to Resolved
alright, I think I figured something out. Apparently petitboot creates a md snapshot before attempting to boot. We don't use dm_raid BUT for some reason this keeps the device (/dev/sda or /dev/sdb) too long in a busy state. While in that busy state btrfs tries to access all found devices. In that list, /dev/sda and /dev/sdb appear but are blocked by device mapper who seems to map it for a short period to /dev/mapper/md2. btrfs then tried to access sdab and /dev/mapper/md2 gets detected as valid btrfs but is removed again by mdadm realizing it is no raid device and destroying the node again. Therefore neither mdadm nor btrfs are able to find devices and petitboot is left with nothing to boot. I tried several things to disable mdadm but didn't find anything suitable. But disabling the snapshots with nvram -p default --update-config "petitboot,snapshots?=false"
seems to resolve the race good enough and the system was able to successfully find and boot 3 times in a row. Resolving it for now. Please reopen if the same problem appears again
#26
Updated by nicksinger 3 months ago
I've created PPC ticket #179273 to address the "Permanent IOA failure" (I/O adapter seems broken according to https://www.ibm.com/support/knowledgecenter/TI0002C/p8ebk/urc_tables.htm)
#27
Updated by okurz 2 months ago
- Due date changed from 2020-10-27 to 2020-11-11
- Status changed from Resolved to Feedback
It's super awesome that you found this but before calling it "Resolved" we should at the very least add back to salt.
Also, as this ticket was causing us quite some work - or at least discussions - I think we should try to improve a bit further. Maybe we can write something useful on a wiki? Or is this too specific? Should we report an upstream bug about it?
#30
Updated by favogt 2 months ago
I saw that power8 was stuck in petitboot again after yesterday's power outage. /dev/sda3 was mounted properly, but there was no entry for booting from it, probably due to https://bugzilla.opensuse.org/show_bug.cgi?id=1174166.
I ran kexec --load /var/petitboot/mnt/dev/sda3/boot/vmlinux --initrd /var/petitboot/mnt/dev/sda3/boot/initrd --append "root=UUID=e29496d5-0080-4a01-9bde-b786944f4ba4 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M"
first, but the UUID changed. As the partition layout changed, the system was apparently reinstalled...
After a reboot, I edited grub2.cfg from petitboot by hand to remove the problematic section and it found the proper entry. It's booting now.
#33
Updated by nicksinger about 2 months ago
- Assignee changed from nicksinger to okurz
okurz wrote:
maybe we need the same on powerqaworker-qam-1 as well as it's currently not up again?
machine is up and running fine from what I see. 2 days uptime
#34
Updated by okurz about 2 months ago
ok. I triggered another reboot to verify. /etc/grub.d/95_textmode is still there so the workaround on power8 is not applied here at least. I will collect from this ticket what I consider relevant information for the future and document on a wiki.
#35
Updated by okurz about 2 months ago
- Status changed from Feedback to Resolved
I have added some information to https://progress.opensuse.org/projects/openqav3/wiki/Wiki#PPC-specific-configurations
In general during the work on this ticket (or also during non-work waiting times) we learned that we as SUSE QE Tools team can still try to solve some problems ourselves and help with fixing hardware-near problems but in general we want to go back to what we have defined in https://progress.opensuse.org/projects/qa/wiki/Wiki#Out-of-scope which means: Less operative work with hardware-near issues. I have updated https://progress.opensuse.org/projects/qa/wiki/Wiki#How-we-work-on-our-backlog with best practices, e.g.: "For tickets which are out of the scope of the team remove from backlog, delegate to corresponding teams or persons but be nice and supportive, e.g. SUSE-IT, EngInfra also see SLA, test maintainer, QE-LSG PrjMgr/mgmt". I think this is enough "improvement for the future" here :)
#36
Updated by okurz about 2 months ago
- Status changed from Resolved to Workable
- Assignee deleted (
okurz)
seems something is still missing, https://monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1 does show "No data" for powerqaworker-qam-1.qa.suse.de . So we are still not back to where we started from :)
#37
Updated by okurz about 2 months ago
https://stats.openqa-monitor.qa.suse.de/d/WDpowerqaworker-qam-1/worker-dashboard-powerqaworker-qam-1?orgId=1&refresh=1m shows data so at least we receive some data from telegraf from powerqaworker-qam-1
#38
Updated by cdywan about 2 months ago
- Due date changed from 2020-11-20 to 2020-12-09
- Assignee set to cdywan
okurz wrote:
https://stats.openqa-monitor.qa.suse.de/d/WDpowerqaworker-qam-1/worker-dashboard-powerqaworker-qam-1?orgId=1&refresh=1m shows data so at least we receive some data from telegraf from powerqaworker-qam-1
So it seems like the machine is available so we're not quite back to square one, but grafana doesn't see it - need to find out how exactly it reports the status I guess 🤔
#39
Updated by nicksinger about 2 months ago
nicksinger wrote:
I've created PPC ticket #179273 to address the "Permanent IOA failure" (I/O adapter seems broken according to https://www.ibm.com/support/knowledgecenter/TI0002C/p8ebk/urc_tables.htm)
https://progress.opensuse.org/issues/80688 to address the steps necessary to resolve the infra/arch ticket.
cdywan as I can see data in grafana again, can we consider this done?
#40
Updated by okurz about 2 months ago
No, the problem is that https://monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?viewPanel=69&orgId=1 shows "No Data"
#41
Updated by okurz about 2 months ago
- Related to action #80688: Upgrade IO firmware for powerqaworker-qam-1 added
#42
Updated by cdywan about 2 months ago
- Assignee deleted (
cdywan)
I tried to ask repeatedly where this is defined and still couldn't get an answer so I'm unassigning. I wouldn't even know how to remove the broken status at this point.
#43
Updated by mkittler about 1 month ago
- Related to action #76984: Automatically remove assets+results based on available free space added
#44
Updated by okurz about 1 month ago
I have removed the salt key for powerqaworker-qam-1 from osd for now as this blocked osd deployment.
#45
Updated by okurz about 1 month ago
#46
Updated by cdywan about 1 month ago
- Due date deleted (
2020-12-09)
#47
Updated by cdywan about 1 month ago
- Description updated (diff)
#48
Updated by cdywan about 1 month ago
- Related to action #75238: Upgrade osd workers and openqa-monitor to openSUSE Leap 15.2 added
#49
Updated by cdywan about 1 month ago
For reference, current working commandline should be:
kexec --load /var/petitboot/mnt/dev/sdc2/boot/vmlinux --initrd /var/petitboot/mnt/dev/sdc2/boot/initrd --append "root=UUID=e29496d5-0080-4a01-9bde-b786944f4ba4 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M" && kexec -e
Except it won't boot anything 🤔 Also, interestingly enough, the device changed between reboots, it was sda2 before.
Note: This machine is also affected by the grub parsing bug now. I applied the work-around of removing the problematic block.
#50
Updated by okurz 1 day ago
To work with the changing device names use globbing:
kexec --load /var/petitboot/mnt/dev/*/boot/vmlinux --initrd /var/petitboot/mnt/dev/*/boot/initrd --append "root=UUID=e29496d5-0080-4a01-9bde-b786944f4ba4 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M" && kexec -e