Project

General

Profile

action #117631

Failed systemd service transactional-update on openqaworker1 - system is no longer reachable after reboot size:M

Added by okurz 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Start date:
2022-10-05
Due date:
2022-11-04
% Done:

0%

Estimated time:

Description

Observation

# systemctl --failed
systemctl: /lib64/libselinux.so.1: no version information available (required by /usr/lib/systemd/libsystemd-shared-249.so)
systemctl: /lib64/libselinux.so.1: no version information available (required by /usr/lib64/libmount.so.1)
  UNIT                         LOAD   ACTIVE SUB    DESCRIPTION          
* var-lib-openqa-share.mount   loaded failed failed /var/lib/openqa/share
* transactional-update.service loaded failed failed Update the system

so service fails and weird error messages when executing commands but openQA tests seem to be running so far.

Acceptance criteria

  • AC1: No failed services on openqaworker1
  • AC2: openqaworker1 reboots cleanly multiple times

Suggestions

  • Over IPMI check why machine does not boot, as necessary with a physical display and keyboard
  • Check filesystem, potentially need to replace hardware
  • Check failed services

Out of scope

  • Trying to recover the usual IPMI connection, see #117625

Related issues

Related to openQA Project - action #119077: openQA infrastructure issues for s390x and PowerPCResolved2022-10-19

Related to openQA Project - action #119713: Leap tests are failing because of failed log file uploading in multiple tests on s390x size:MResolved2022-11-01

History

#2 Updated by okurz 2 months ago

  • Tags set to next-office-day

#3 Updated by okurz 2 months ago

Over IPMI SOL I found:

GRUB loading...                                                                 
Welcome to GRUB!                                                                

[    1.631441][    T1] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00
/init: error whi[    1.639888][    T1] CPU: 2 PID: 1 Comm: init Not tainted 5.14.21-150400.24.21-default #1 SLE15-SP4 7550826c4c7e8c258239e300508e0c8b2a69bad2                         
le loading share[    1.653809][    T1] Hardware name: Quanta Computer Inc D51B-2U (dual 1G LoM)/S2B-MB (dual 1G LoM), BIOS S2B_3A19 05/15/2015                                         
d libraries: lib[    1.666358][    T1] Call Trace:                              
systemd-shared-2[    1.670912][    T1]  <TASK>                                  
49.so: cannot op[    1.675128][    T1]  dump_stack_lvl+0x45/0x5b                
en shared object[    1.680912][    T1]  panic+0x105/0x2dd                       
 file: No such f[    1.686086][    T1]  do_exit+0x811/0xba0                     
ile or directory[    1.691432][    T1]  do_group_exit+0x3a/0xa0                 

[    1.697111][    T1]  __x64_sys_exit_group+0x14/0x20                          
[    1.702182][    T1]  do_syscall_64+0x58/0x80                                 
[    1.706525][    T1]  ? do_syscall_64+0x67/0x80                               
[    1.711040][    T1]  ? exit_to_user_mode_prepare+0xfc/0x230                  
[    1.716691][    T1]  ? syscall_exit_to_user_mode+0x18/0x40                   
[    1.722249][    T1]  ? do_syscall_64+0x67/0x80                               
[    1.726761][    T1]  ? syscall_exit_to_user_mode+0x18/0x40                   
[    1.732310][    T1]  ? do_syscall_64+0x67/0x80                               
[    1.736823][    T1]  entry_SYSCALL_64_after_hwframe+0x61/0xcb               
[    1.742647][    T1] RIP: 0033:0x7f6f09359d36
[    1.746983][    T1] Code: 90 90 90 90 89 fa 41 b8 e7 00 00 00 be 3c 00 00 00 eb 10 90 89 d7 89 f0 0f 05 48 3d 00 f0 ff ff 77 22 f4 89 d7 44 89 c0 0f 05 <48> 3d 00 f0 ff ff 76 e2 f7 d8 89 05 ba e9 20 00 eb d8 0f 1f 84 00
[    1.766541][    T1] RSP: 002b:00007ffe625e9918 EFLAGS: 00000202 ORIG_RAX: 00000000000000e7
[    1.774907][    T1] RAX: ffffffffffffffda RBX: 00007f6f09362820 RCX: 00007f6f09359d36
[    1.782844][    T1] RDX: 000000000000007f RSI: 000000000000003c RDI: 000000000000007f
[    1.790789][    T1] RBP: 00007ffe625ea290 R08: 00000000000000e7 R09: ffffffffffffffff
[    1.798719][    T1] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000001
[    1.806649][    T1] R13: 0000000000000001 R14: 00007f6f09564720 R15: 00007f6f09564710
[    1.814582][    T1]  </TASK>
[    1.817592][    T1] Kernel Offset: 0x1c800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[    1.832146][    T1] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00 ]---

I did not even manage to enter grub

#4 Updated by okurz 2 months ago

I physically connected a VGA monitor, a USB keyboard and my personal USB recovery thumbdrive with grml.org. Booted into the grml live system (UEFI mode, BIOS mode left me at grub rescue, possibly broken) and from there on could assemble raids and chroot into the system and conduct a proper system upgrade in the hope that this also fixes problems with the kernel or the initrd:

mdadm --assemble /dev/md1
mkdir -p /mnt/md1
mount /dev/md1 /mnt/md1
for i in sys proc dev dev/pts run ; do mount /$i /mnt/$i; done
chroot /mnt/md1 /bin/bash

and within there zypper said something about Leap 15.3 but /etc/os-release said 15.4 so I did

zypper --releasever=15.4 ref && zypper --releasever=15.4 dup

and after that another zypper dup call for consistency. I also did zypper in --force kernel-default and saw the kernel and initramfs configured

Also found an error message about a + in /usr/lib/plymouth/plymouth-populate-initrd. I fixed that but likely a non-critical warning because locally I also have that. dheidler also mentioned that he saw that recently.

To check the installation I called

qemu-system-x86 -enable-kvm -nographic -snapshot -hda /dev/sdb -hdb /dev/sdc

and could observe the system showing the grub menu fine and also loading the inital ramdisk but didn't progress from there within 1 minute so I aborted and tried again the full system boot. The system booted fine now and a second try as well. After boot I could see that var-lib-openqa-share.mount is listed as failed. But if I start the mount unit that works fine so the problem seems to be due to not waiting properly for the network being fully up.

In the journal I can find:

Oct 06 13:54:57 openqaworker1 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 3162 (runc:[2:INIT])
Oct 06 13:54:57 openqaworker1 systemd[1]: Mounting /var/lib/openqa/share...
Oct 06 13:54:57 openqaworker1 kernel: new mount options do not match the existing superblock, will be ignored
Oct 06 13:54:57 openqaworker1 podman[2386]: time="2022-10-06T13:54:57+02:00" level=warning msg="Path \"/etc/SUSEConnect\" from \"/etc/containers/mounts.conf>
Oct 06 13:54:57 openqaworker1 podman[2386]: time="2022-10-06T13:54:57+02:00" level=warning msg="Path \"/etc/zypp/credentials.d/SCCcredentials\" from \"/etc/>
Oct 06 13:54:57 openqaworker1 systemd[1]: Started libcontainer container b68b0d43c4e80b3d0091bc79205b1a88c39636f79db0056835764d7e1c0bd8e8.
Oct 06 13:54:57 openqaworker1 kernel: new mount options do not match the existing superblock, will be ignored
Oct 06 13:54:57 openqaworker1 podman[2389]: time="2022-10-06T13:54:57+02:00" level=warning msg="Path \"/etc/SUSEConnect\" from \"/etc/containers/mounts.conf>
Oct 06 13:54:57 openqaworker1 podman[2389]: time="2022-10-06T13:54:57+02:00" level=warning msg="Path \"/etc/zypp/credentials.d/SCCcredentials\" from \"/etc/>
Oct 06 13:54:57 openqaworker1 systemd[1]: Started libcontainer container 939b483a9345e8d31495b6e95ac8da570e96f43cb775cdfa2b66ac78d9680cf0.
Oct 06 13:54:57 openqaworker1 kernel: new mount options do not match the existing superblock, will be ignored
Oct 06 13:54:57 openqaworker1 mount[3192]: mount.nfs4: Failed to resolve server openqa1-opensuse: Name or service not known
Oct 06 13:54:57 openqaworker1 systemd[1]: var-lib-openqa-share.mount: Mount process exited, code=exited, status=32/n/a
Oct 06 13:54:57 openqaworker1 systemd[1]: var-lib-openqa-share.mount: Failed with result 'exit-code'.
Oct 06 13:54:57 openqaworker1 systemd[1]: Failed to mount /var/lib/openqa/share.
…
Oct 06 13:54:58 openqaworker1 podman[2387]: Error: OCI runtime error: unable to start container "d1b423e9147fd70ec60588364bddf79e0ad41b21feb67659c4ec056c8ba>
Oct 06 13:54:58 openqaworker1 podman[3243]: 2022-10-06 13:54:58.189706039 +0200 CEST m=+0.572331437 container cleanup d1b423e9147fd70ec60588364bddf79e0ad41b>
Oct 06 13:54:58 openqaworker1 systemd[1]: container-openqaworker1_container_102.service: Control process exited, code=exited, status=125/n/a

so the automount triggers the mounting which can not yet be done in this state of the system and then containers fail. I do not know how to continue.

#5 Updated by openqa_review 2 months ago

  • Due date set to 2022-10-21

Setting due date based on mean cycle time of SUSE QE Tools

#6 Updated by okurz 2 months ago

  • Due date deleted (2022-10-21)
  • Status changed from In Progress to New
  • Assignee deleted (okurz)
  • Priority changed from Urgent to High

#7 Updated by cdywan about 2 months ago

  • Subject changed from Failed systemd service transactional-update on openqaworker1 - system is no longer reachable after reboot to Failed systemd service transactional-update on openqaworker1 - system is no longer reachable after reboot size:M
  • Description updated (diff)
  • Status changed from New to Workable

#8 Updated by okurz about 2 months ago

  • Priority changed from High to Urgent

With #119077 this became urgent.

#9 Updated by okurz about 2 months ago

  • Related to action #119077: openQA infrastructure issues for s390x and PowerPC added

#10 Updated by dheidler about 2 months ago

  • Status changed from Workable to In Progress
  • Assignee set to dheidler

#11 Updated by openqa_review about 2 months ago

  • Due date set to 2022-11-04

Setting due date based on mean cycle time of SUSE QE Tools

#12 Updated by okurz about 2 months ago

There are different problems:

  1. We can not rely on the systemd target network-online. Custom scripting is necessary to wait until network is available
  2. Do not rely on NFS mounts on openQA workers, use the cache service instead or just clone stuff from git which you need

#13 Updated by dheidler about 2 months ago

Marius in Slack:

About the s390x containers:

  1. For the problem of knowing what dependencies to install one could actually rsync the directory but with a filter to only look for install_deps.. Or we create a specific container for the opensuse test distribution on OBS that has the required dependencies already installed. Likely that would be the cleanest solution as it would also avoid running zypper manually at all.
  2. For the problem of providing tests/needles we should enable the cache service for those workers. For this to work we need to map port 9530 into the container and to map the cache directory (by default /var/lib/openqa/cache/) into the container.

https://github.com/os-autoinst/openQA/pull/4854
https://github.com/os-autoinst/openQA/pull/4855

#14 Updated by dheidler about 2 months ago

So the idea is the following:

We created a new container that is based on the existing openqa worker container but is distri-opensuse specific and installs the dependencies rpm on startup (not on container build time as the test distribution might change and we don't want to update the container base image in this case).

The new containers with that new base image will now use the cache service. The cache service on the host was configured to be reachable from containers and the cache dir is mounted into the containers. The worker code within the container needs a PR that will allow to point it to a cache service that is not listening at 127.0.0.1.

All the above can only be actually used after the PRs are accepted and an updated container image is available. Until then we manually fixed the current setup with nfs which will work until the next reboot, so that openqa can process some jobs in the meantime.

#16 Updated by okurz about 1 month ago

https://github.com/os-autoinst/openQA/pull/4862 merged.

In a related discussion https://suse.slack.com/archives/C02CANHLANP/p1666359531628739 fvogt provided related issue reports https://github.com/openSUSE/wicked/pull/836 and https://bugzilla.suse.com/show_bug.cgi?id=1172684 and https://jira.suse.com/browse/PM-1982 confirming the hypothesis that wicked's treating of the systemd targets network.target and in particular network-online.target is unexpected, unreliable and not time-optimal.

I see two ways to go forward:

I prefer the latter

#17 Updated by dheidler about 1 month ago

  • Status changed from In Progress to Resolved

The containers are now coming up on boot and use the cache service of the host.

For the network stack replacement it wouldn't be trivial as there are very many tap devices to be configured.
Maybe that large number of devices to bring up could be the reason why the network comes up that slow in the first place.
When watching the bootup via IPMI console, I could see that the network only comes up about half a minute after there is already a login prompt on the tty.

Anyway that would be a topic for a different ticket.

#18 Updated by okurz about 1 month ago

  • Related to action #119713: Leap tests are failing because of failed log file uploading in multiple tests on s390x size:M added

Also available in: Atom PDF