Project

General

Profile

action #109301

openqaworker14 + openqaworker15 sporadically get stuck on boot

Added by osukup 3 months ago. Updated 3 months ago.

Status:
Rejected
Priority:
Normal
Assignee:
Target version:
Start date:
2022-03-31
Due date:
% Done:

0%

Estimated time:

Description

OBSERVATION

on reboot time to time this workers fails to correctly boot ending in emergency mode:

bře 08 14:34:24 openqaworker14 kernel: Loading iSCSI transport class v2.0-870.
bře 08 14:34:24 openqaworker14 systemd[1]: Finished Create Volatile Files and Directories.
bře 08 14:34:24 openqaworker14 systemd[1]: Starting Security Auditing Service...
bře 08 14:34:24 openqaworker14 openqa-establish-nvme-setup[1557]: NAME        MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
bře 08 14:34:24 openqaworker14 openqa-establish-nvme-setup[1557]: nvme0n1     259:0    0  3.5T  0 disk
bře 08 14:34:24 openqaworker14 openqa-establish-nvme-setup[1557]: ├─nvme0n1p1 259:1    0  512M  0 part
bře 08 14:34:24 openqaworker14 openqa-establish-nvme-setup[1557]: ├─nvme0n1p2 259:2    0    1T  0 part  /
bře 08 14:34:24 openqaworker14 openqa-establish-nvme-setup[1557]: └─nvme0n1p3 259:3    0  2.5T  0 part
bře 08 14:34:24 openqaworker14 openqa-establish-nvme-setup[1557]:   └─md127     9:127  0  2.5T  0 raid0
bře 08 14:34:24 openqaworker14 openqa-establish-nvme-setup[1552]: Stopping current RAID "/dev/md/openqa"
bře 08 14:34:24 openqaworker14 systemd[1]: Finished Flush Journal to Persistent Storage.
bře 08 14:34:24 openqaworker14 kernel: i40iw_open: i40iw_open completed
bře 08 14:34:24 openqaworker14 systemd[1]: Created slice Slice /system/rdma-load-modules.
bře 08 14:34:24 openqaworker14 systemd[1]: Starting Load RDMA modules from /etc/rdma/modules/iwarp.conf...
bře 08 14:34:24 openqaworker14 systemd[1]: Starting Load RDMA modules from /etc/rdma/modules/rdma.conf...
bře 08 14:34:24 openqaworker14 kernel: ixgbe 0000:d8:00.1: Multiqueue Enabled: Rx Queue count = 63, Tx Queue count = 63 XDP Queue count = 0
bře 08 14:34:24 openqaworker14 systemd[1]: Finished Load RDMA modules from /etc/rdma/modules/iwarp.conf.
bře 08 14:34:24 openqaworker14 openqa-establish-nvme-setup[1559]: mdadm: stopped /dev/md/openqa
bře 08 14:34:24 openqaworker14 openqa-establish-nvme-setup[1552]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
bře 08 14:34:24 openqaworker14 openqa-establish-nvme-setup[1574]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
bře 08 14:34:24 openqaworker14 openqa-establish-nvme-setup[1574]:        level=raid0 devices=1 ctime=Mon Mar  7 10:20:52 2022
bře 08 14:34:24 openqaworker14 openqa-establish-nvme-setup[1574]: mdadm: unexpected failure opening /dev/md127
bře 08 14:34:24 openqaworker14 openqa-establish-nvme-setup[1552]: Unable to create RAID, mdadm returned with non-zero code
bře 08 14:34:24 openqaworker14 kernel: i40iw_open: i40iw_open completed
bře 08 14:34:24 openqaworker14 systemd[1]: openqa_nvme_format.service: Main process exited, code=exited, status=1/FAILURE
bře 08 14:34:24 openqaworker14 systemd[1]: openqa_nvme_format.service: Failed with result 'exit-code'.
bře 08 14:34:24 openqaworker14 systemd[1]: Failed to start Setup NVMe before mounting it.
bře 08 14:34:24 openqaworker14 systemd[1]: Dependency failed for /var/lib/openqa.
bře 08 14:34:24 openqaworker14 systemd[1]: Dependency failed for openQA Worker #1.
bře 08 14:34:24 openqaworker14 systemd[1]: openqa-worker-auto-restart@1.service: Job openqa-worker-auto-restart@1.service/start failed with result 'dependency'.
bře 08 14:34:24 openqaworker14 systemd[1]: Dependency failed for var-lib-openqa-share.automount.
bře 08 14:34:24 openqaworker14 systemd[1]: var-lib-openqa-share.automount: Job var-lib-openqa-share.automount/start failed with result 'dependency'.
bře 08 14:34:24 openqaworker14 systemd[1]: Dependency failed for openQA Worker #3.
bře 08 14:34:24 openqaworker14 systemd[1]: openqa-worker-auto-restart@3.service: Job openqa-worker-auto-restart@3.service/start failed with result 'dependency'.
bře 08 14:34:24 openqaworker14 systemd[1]: Dependency failed for Prepare NVMe after mounting it.
bře 08 14:34:24 openqaworker14 systemd[1]: openqa_nvme_prepare.service: Job openqa_nvme_prepare.service/start failed with result 'dependency'.
bře 08 14:34:24 openqaworker14 systemd[1]: Dependency failed for Local File Systems.
bře 08 14:34:24 openqaworker14 systemd[1]: local-fs.target: Job local-fs.target/start failed with result 'dependency'.
bře 08 14:34:24 openqaworker14 systemd[1]: local-fs.target: Triggering OnFailure= dependencies.
bře 08 14:34:24 openqaworker14 systemd[1]: Dependency failed for openQA Worker #2.
bře 08 14:34:24 openqaworker14 systemd[1]: openqa-worker-auto-restart@2.service: Job openqa-worker-auto-restart@2.service/start failed with result 'dependency'.
bře 08 14:34:24 openqaworker14 systemd[1]: Dependency failed for openQA Worker #4.
bře 08 14:34:24 openqaworker14 systemd[1]: openqa-worker-auto-restart@4.service: Job openqa-worker-auto-restart@4.service/start failed with result 'dependency'.
bře 08 14:34:24 openqaworker14 systemd[1]: var-lib-openqa.mount: Job var-lib-openqa.mount/start failed with result 'dependency'.

Cause of problem is probably difference in hw configuration of this workers. Our standard workers have 1x HDD with OS and 1x name SSD with /dev/md/openQA. This workers have only one nvme SSD.
Configured as:

nvme0n1                                                                                                       
├─nvme0n1p1 vfat              FAT32                       9AED-277B                               506M     1% /boot/efi
├─nvme0n1p2 btrfs                                         5a405f4e-bd0c-46cb-a5ee-a0e976968be1 1016,5G     1% /
└─nvme0n1p3 linux_raid_member 1.2   openqaworker14:openqa 03972fdb-874d-cbec-4cb8-bca5412d90a2                
  └─md127   ext2              1.0                         4c30279b-d757-4a97-b636-539b18bc9e22    2,3T     0% /var/lib/openqa

Related issues

Related to openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:MResolved2022-01-17

History

#1 Updated by osukup 3 months ago

  • Related to action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:M added

#2 Updated by okurz 3 months ago

  • Status changed from New to Rejected
  • Assignee set to okurz
  • Target version set to Ready

this needs to be solved as part of #104970 as the two machines are not usable without. Please solve there.

Also available in: Atom PDF