Project

General

Profile

action #104970

openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

openQA Project - coordination #109659: [epic] More remote workers

Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:M

Added by okurz 4 months ago. Updated 14 days ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2022-01-17
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Related to #97862. Two machines have been ordered and are prepared in the SUSE CZ office location server room to be used within the OSD infrastructure. okurz suggested to use these two machines as "generic" workers along with their specialization within the OSD infrastructure

Acceptance criteria

  • AC1: openqaworker14+openqaworker15 are used by OSD production jobs
  • AC2: Both machines are maintained same as we do for other x86_64 OSD workers
  • AC3: No significant impact due to the remote connection between NBG and PRG server room locations
  • AC4: Alerts for those workers have been resumed and are not firing anymore

Suggestions

  • Follow https://sd.suse.com/servicedesk/customer/portal/1/SD-70400
  • Ensure that a valid OS is installed
  • Ensure that machines can be remote controlled
  • Deploy machines as part of generic salt-controlled OSD infrastructure
  • Configure openQA worker instances on these machines for testing purposes
  • Test out their operation and take special care due to the remoteness between NBG (osd webUI) and PRG (location of these two machines)
  • If all good include as generic workers as well as with special sap worker class settings

Related issues

Related to openQA Project - action #107497: [qe-tools] openqaworker14 and openqa15 - developer mode fails size:MWorkable2022-02-24

Related to openQA Infrastructure - action #109301: openqaworker14 + openqaworker15 sporadically get stuck on bootRejected2022-03-31

Related to openQA Infrastructure - action #110467: Establish reliable tap setup on ow14New2022-04-29

Copied from openQA Infrastructure - action #97862: More openQA worker hardware for OSD size:MBlocked

History

#1 Updated by okurz 4 months ago

  • Copied from action #97862: More openQA worker hardware for OSD size:M added

#2 Updated by okurz 4 months ago

  • Status changed from New to Feedback
  • Assignee set to okurz

I provided my GPG key to mcaj for the credentials. Let's await that first

#3 Updated by okurz 4 months ago

  • Status changed from Feedback to New
  • Assignee deleted (okurz)

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/381 including IPMI credentials which can be a start for others to continue.

#4 Updated by okurz 4 months ago

Both machines have an openSUSE Leap 15.3 preinstalled with our common, old, known root password for the time being.

#5 Updated by okurz 3 months ago

  • Priority changed from Normal to High

#6 Updated by okurz 3 months ago

  • Subject changed from Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing to Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:M
  • Status changed from New to Workable

#7 Updated by cdywan 3 months ago

  • Assignee set to osukup

osukup is looking forward to tkaing it :-D

#8 Updated by osukup 3 months ago

so ... problem starts.

1) salt-minion -> with default /etc/salt/minion refuses connect to master ( always tries localhost )

  • workaround: remove minion config 2) it has only one disk ( 3.5TB nvme, formated to single btrfs space) so openqa-format_nvme.service cant realy work
  • workaround: systemctl edit .... 3) its connected by 1Gbit interface ( server itself has 10G, not connected anywhere) and connection to NUE is slooow for any openqa job:(

--> please move this two workers to NUE

#9 Updated by osukup 3 months ago

openqaworker14 didn't survive reboot - reinstall :( - maybe change better layout for disk?

#10 Updated by osukup 3 months ago

so, after "succesful' state.apply ... system ater reboot goes to maintenace mode :(

culprit -> prepare nvme openqa service ..

#11 Updated by osukup 3 months ago

  • Status changed from Workable to In Progress

#12 Updated by openqa_review 3 months ago

  • Due date set to 2022-03-04

Setting due date based on mean cycle time of SUSE QE Tools

#13 Updated by okurz 3 months ago

as discussed please put in MAX_SETUP_TIME with something like 10h for the slow connection, change worker class temporarily to not break production tests, restart tests that incompleted on this worker so that they end up on other machines

#14 Updated by okurz 3 months ago

The later suggestion was done with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/388 . After merge I called

for i in openqaworker14 openqaworker15; do failed_since=2022-02-17 result="result='incomplete'" host=openqa.suse.de WORKER=$i bash -ex openqa-advanced-retrigger-jobs; done

I propose a higher cache limit as discussed also in one of our meetings last week: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/389

#15 Updated by okurz 3 months ago

Currently OSD deployment is failing on w14+w15 not being fully ready yet for deployment so I temporarily removed them again from deployment. osukup you can add them back when you continue here

#16 Updated by mkittler 3 months ago

  • Description updated (diff)

I paused the corresponding host-up alerts.

#17 Updated by osukup 3 months ago

bot host deployed with salt, added to salt and osd-deployment works.

due to difference between standard openqaworker and this machines (only one nvme disk):

both were reinstaled with this disk layout:

nvme0n1     259:0    0  3,5T  0 disk  
├─nvme0n1p1 259:1    0  512M  0 part  /boot/efi
├─nvme0n1p2 259:2    0    1T  0 part  /var
└─nvme0n1p3 259:3    0  2,5T  0 part 

nvme0n1p3 set during installation as raw volume with linux raid partition type

after installation initialized with:

mdadm --create /dev/md/openqa --level=0 --raid-devices=1 --force  --run /dev/nvme0n1p3
mkfs.ext2 /dev/md/openqa

added line to fstab
/dev/md/openqa /var/lib/openqa/ ext2 defaults 0 0

created /etc/mdadm.conf file with:

DEVICE containers partitions
ARRAY /dev/md/openqa name=openqaworker14:openqa

of course on openqaworker15 name is openqaworker15:openqa

then run dracut --force

with these steps host can be rebooted without problems ( without mdadm.conf reboot ends in maintenance mode because mdadm didn't create link /dev/md/openqa )

rest was accordind gitlab/openqa/openqa-salt-pillars with these changes:

  1. remove /etc/salt/minion (with config file salt-minion rejected to connect openqa.suse.de)
  2. create /etc/salt/grains with roles: worker

test run on openqaworker{14,15} - run 100 jobs on both machines, the first job usually take pretty long time to setup cache ( average speed of download is bellow 5MB/sec) but results are stable and without incomplete jobs

during testing found problem with websockets service : https://progress.opensuse.org/issues/107497

NEXT -> set tap class to workers and ask qa-sap for multimachine test jobs on workers

#18 Updated by okurz 3 months ago

  • Related to action #107497: [qe-tools] openqaworker14 and openqa15 - developer mode fails size:M added

#19 Updated by okurz 3 months ago

Sounds good. It shouldn't be necessary to conduct these many manual steps. We already have machines that only have a single NVMe and no other storage device, e.g. openqaworker8. There we use the same salt states with the same NVMe setup scripts. As part of the salt states we also enable automatic updating with automatic reboots so better check that reboots work consistently.

Regarding the bandwidth. Running on OSD while true; do dd bs=100M count=20 if=/dev/zero | nc -l 42420; done and doing timeout 20 nc openqa.suse.de 42420 | dd of=/dev/null status=progress locally on my DSL connected work notebook over VPN I get 5.0 MB/s. On openqaworker14 I get 2.5 MB/s. So the SUSE inter-office connection is 2x slower than my home connection. That does not sound right. I suggest we crosscheck what we can expect with SUSE-IT EngInfra. You can do that as part of this ticket or create a dedicated ticket for that because this would be tuning the performance, not the general possibility to use these machines.

Please link progress tickets in format #<id> so that a direct link with preview appears. Also at best link such tickets directly as "related" tickets.

#20 Updated by okurz 2 months ago

  • Due date changed from 2022-03-04 to 2022-03-11

As discussed in the weekly osukup will contact EngInfra regarding network speed and continue with the salt states merge request regarding tty config.

#21 Updated by okurz 2 months ago

  • Due date deleted (2022-03-11)
  • Status changed from In Progress to Blocked

Setting to blocked on the ticket you created: https://sd.suse.com/servicedesk/customer/portal/1/SD-79330

#22 Updated by osukup about 2 months ago

bandwidth nue-prg looks better now , instead 1-2MB/s we now see speeds about 20MB/s

#23 Updated by osukup about 2 months ago

  • Related to action #109301: openqaworker14 + openqaworker15 sporadically get stuck on boot added

#24 Updated by osukup about 1 month ago

as mentioned in https://progress.opensuse.org/issues/109301 #109301 there are boot problems for this two workers.

-> setup of /dev/md/openqa time to time fails during boot ..

#25 Updated by okurz about 1 month ago

  • Status changed from Blocked to Workable

As commented in #109301 this needs to be solved as part of this ticket as the two machines are not usable without. Please solve here.

#26 Updated by okurz about 1 month ago

  • Parent task set to #109659

#27 Updated by cdywan about 1 month ago

  • Due date set to 2022-04-15

I suggest we discuss this in the unblock since Ondřej's still fighting with some disk setup issues and I didn't see much conversation and there's no update here.

#28 Updated by okurz about 1 month ago

  • Due date changed from 2022-04-15 to 2022-04-29
  • Assignee changed from osukup to mkittler
  • The workers get stuck formatting the nvme
    • Error stopping the systemd service
    • osukup tried to reproduce problems without rebooting and everything works when starting the openqa nvme format+preparation services without rebooting. So issues should be reproduced
    • We have what should be the same setup in openqaworker8 (and 9) with just a single NVMe, no other devices. We have the services openqa_nvme_format+openqa_nvme_prepare on openqaworker8 which works fine over reboot

Suggestion by okurz:

  • mkittler tries on openqaworker14+15 with reboot tests where the systems gets stuck to fix the race condition that osukup identified

#29 Updated by mkittler about 1 month ago

I changed the kernel boot parameters so we'll get a tty/emergency shell and also some output via ipmi sol.

That's how it looks like in the error case:

Apr 13 12:39:50 openqaworker14 systemd[1]: Starting Setup NVMe before mounting it...
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1537]: Current mount points (printed for debugging purposes):
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: devtmpfs on /dev type devtmpfs (rw,nosuid,noexec,size=4096k,nr_inodes=1048576,mode=755)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: tmpfs on /run type tmpfs (rw,nosuid,nodev,size=211033536k,nr_inodes=819200,mode=755)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,size=4096k,nr_inodes=1024,mode=755)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: none on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on / type btrfs (rw,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=29,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=34905)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /.snapshots type btrfs (rw,relatime,ssd,space_cache,subvolid=266,subvol=/@/.snapshots)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p1 on /boot/efi type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,utf8,errors=remount-ro)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /home type btrfs (rw,relatime,ssd,space_cache,subvolid=263,subvol=/@/home)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /opt type btrfs (rw,relatime,ssd,space_cache,subvolid=262,subvol=/@/opt)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /tmp type btrfs (rw,relatime,ssd,space_cache,subvolid=259,subvol=/@/tmp)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /boot/grub2/i386-pc type btrfs (rw,relatime,ssd,space_cache,subvolid=265,subvol=/@/boot/grub2/i386-pc)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /usr/local type btrfs (rw,relatime,ssd,space_cache,subvolid=258,subvol=/@/usr/local)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /root type btrfs (rw,relatime,ssd,space_cache,subvolid=261,subvol=/@/root)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /var type btrfs (rw,relatime,ssd,space_cache,subvolid=257,subvol=/@/var)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /boot/grub2/x86_64-efi type btrfs (rw,relatime,ssd,space_cache,subvolid=264,subvol=/@/boot/grub2/x86_64-efi)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /srv type btrfs (rw,relatime,ssd,space_cache,subvolid=260,subvol=/@/srv)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1537]: Present block devices (printed for debugging purposes):
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1541]: NAME        MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1541]: nvme0n1     259:0    0  3.5T  0 disk
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1541]: ├─nvme0n1p1 259:1    0  512M  0 part  /boot/efi
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1541]: ├─nvme0n1p2 259:2    0    1T  0 part  /
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1541]: └─nvme0n1p3 259:3    0  2.5T  0 part
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1541]:   └─md127     9:127  0  2.5T  0 raid0
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1537]: Stopping current RAID "/dev/md/openqa"
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1542]: mdadm: stopped /dev/md/openqa
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1554]: Personalities : [raid0]
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1554]: unused devices: <none>
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1537]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1561]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1561]:        level=raid0 devices=1 ctime=Wed Apr 13 12:04:21 2022
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1561]: mdadm: unexpected failure opening /dev/md127
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1537]: Unable to create RAID, mdadm returned with non-zero code
Apr 13 12:39:51 openqaworker14 systemd[1]: openqa_nvme_format.service: Main process exited, code=exited, status=1/FAILURE
Apr 13 12:39:51 openqaworker14 systemd[1]: openqa_nvme_format.service: Failed with result 'exit-code'.
Apr 13 12:39:51 openqaworker14 systemd[1]: Failed to start Setup NVMe before mounting it.

On ow15 it currently fails every time on a syntax error in /var/local/bin/openqa-establish-nvme-setup. I assume that's from previous debugging attempts.

#30 Updated by mkittler about 1 month ago

I played around with mdadm:

martchus@openqaworker14:~> mdadm_args=(--create /dev/md/openqa --level=0 --force --assume-clean)
martchus@openqaworker14:~> mdadm_args+=(--raid-devices=1 --run /dev/nvme0n1p3)
martchus@openqaworker14:~> sudo systemctl stop var-lib-openqa.mount
martchus@openqaworker14:~> sudo mdadm "${mdadm_args[@]}"
mdadm: cannot open /dev/nvme0n1p3: Device or resource busy
martchus@openqaworker14:~> sudo mdadm --stop /dev/md/openqa
mdadm: stopped /dev/md/openqa
martchus@openqaworker14:~> sudo mdadm "${mdadm_args[@]}"
mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
       level=raid0 devices=1 ctime=Wed Apr 13 13:22:53 2022
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md/openqa started.
martchus@openqaworker14:~> echo $?
0
martchus@openqaworker14:~> for i in {01..100}; do echo "try $i" && sudo mdadm --stop /dev/md/openqa && sudo mdadm "${mdadm_args[@]}" || break ; done
…
try 100
mdadm: stopped /dev/md/openqa
mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
       level=raid0 devices=1 ctime=Wed Apr 13 16:28:00 2022
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md/openqa started.

So the message mdadm: /dev/nvme0n1p3 appears to be part of a raid array: also appears in the good case of mdadm: array /dev/md/openqa started. and it not necessarily related to the error mdadm: unexpected failure opening /dev/md127 we sometimes get. The problem is unfortunately not reproducible when the system is already up.

Considering

Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1537]: Stopping current RAID "/dev/md/openqa"
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1542]: mdadm: stopped /dev/md/openqa
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1554]: Personalities : [raid0]
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1554]: unused devices: <none>

we also can assume that the previous RAID has been stopped.

So it is really the question what this "unexpected error" is. It is printed by mdopen.c when the previous open_dev_excl() call fails but unfortunately no further error details are propagated. Note that this function already has an internal retry with exponential back off in case the error is EBUSY. It does 25 attempts with sleep of min(2^i, 200), so it would wait an hour¹. That should be more than enough to rule out EBUSY. We can also rule out that the device is already in use because that is checked before (and it is unlikely that this changed in the meantime).

This still leaves any other error as possibility. The function calls internally dev_open() which basically calls open() with the device path. So we might run into any of https://man.archlinux.org/man/open.2#ERRORS.


¹

[...Array(25).keys()].reduce((a, b) => a + Math.min(Math.pow(2, b), 200)) / 60 / 60
1.015

#31 Updated by mkittler about 1 month ago

  • Status changed from Workable to In Progress

I installed a custom build of mdadm. Now we see the specific error message:

Apr 13 17:35:16 openqaworker14 systemd[1]: Starting Setup NVMe before mounting it...
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1486]: Using mdadm from:
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1490]: /home/martchus/mdadm/mdadm
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1486]: Current mount points (printed for debugging purposes):
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: devtmpfs on /dev type devtmpfs (rw,nosuid,noexec,size=4096k,nr_inodes=1048576,mode=755)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: tmpfs on /run type tmpfs (rw,nosuid,nodev,size=211033536k,nr_inodes=819200,mode=755)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,size=4096k,nr_inodes=1024,mode=755)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: none on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on / type btrfs (rw,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=29,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=57366)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /.snapshots type btrfs (rw,relatime,ssd,space_cache,subvolid=266,subvol=/@/.snapshots)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /boot/grub2/i386-pc type btrfs (rw,relatime,ssd,space_cache,subvolid=265,subvol=/@/boot/grub2/i386-pc)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /boot/grub2/x86_64-efi type btrfs (rw,relatime,ssd,space_cache,subvolid=264,subvol=/@/boot/grub2/x86_64-efi)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /home type btrfs (rw,relatime,ssd,space_cache,subvolid=263,subvol=/@/home)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /opt type btrfs (rw,relatime,ssd,space_cache,subvolid=262,subvol=/@/opt)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /srv type btrfs (rw,relatime,ssd,space_cache,subvolid=260,subvol=/@/srv)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /root type btrfs (rw,relatime,ssd,space_cache,subvolid=261,subvol=/@/root)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /tmp type btrfs (rw,relatime,ssd,space_cache,subvolid=259,subvol=/@/tmp)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /usr/local type btrfs (rw,relatime,ssd,space_cache,subvolid=258,subvol=/@/usr/local)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /var type btrfs (rw,relatime,ssd,space_cache,subvolid=257,subvol=/@/var)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1486]: Present block devices (printed for debugging purposes):
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: NAME        MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: nvme0n1     259:0    0  3.5T  0 disk
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: ├─nvme0n1p1 259:1    0  512M  0 part
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: ├─nvme0n1p2 259:2    0    1T  0 part  /
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: └─nvme0n1p3 259:3    0  2.5T  0 part
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]:   └─md127     9:127  0  2.5T  0 raid0
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1486]: Stopping current RAID "/dev/md/openqa"
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1494]: mdadm: stopped /dev/md/openqa
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1521]: Personalities : [raid0]
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1521]: unused devices: <none>
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1486]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1532]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1532]:        level=raid0 devices=1 ctime=Wed Apr 13 17:00:01 2022
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1532]: mdadm: unexpected failure opening /dev/md127: No such device or address
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /.snapshots type btrfs (rw,relatime,ssd,space_cache,subvolid=266,subvol=/@/.snapshots)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /boot/grub2/i386-pc type btrfs (rw,relatime,ssd,space_cache,subvolid=265,subvol=/@/boot/grub2/i386-pc)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /boot/grub2/x86_64-efi type btrfs (rw,relatime,ssd,space_cache,subvolid=264,subvol=/@/boot/grub2/x86_64-efi)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /home type btrfs (rw,relatime,ssd,space_cache,subvolid=263,subvol=/@/home)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /opt type btrfs (rw,relatime,ssd,space_cache,subvolid=262,subvol=/@/opt)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /srv type btrfs (rw,relatime,ssd,space_cache,subvolid=260,subvol=/@/srv)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /root type btrfs (rw,relatime,ssd,space_cache,subvolid=261,subvol=/@/root)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /tmp type btrfs (rw,relatime,ssd,space_cache,subvolid=259,subvol=/@/tmp)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /usr/local type btrfs (rw,relatime,ssd,space_cache,subvolid=258,subvol=/@/usr/local)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /var type btrfs (rw,relatime,ssd,space_cache,subvolid=257,subvol=/@/var)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1486]: Present block devices (printed for debugging purposes):
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: NAME        MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: nvme0n1     259:0    0  3.5T  0 disk
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: ├─nvme0n1p1 259:1    0  512M  0 part
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: ├─nvme0n1p2 259:2    0    1T  0 part  /
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: └─nvme0n1p3 259:3    0  2.5T  0 part
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]:   └─md127     9:127  0  2.5T  0 raid0
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1486]: Stopping current RAID "/dev/md/openqa"
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1494]: mdadm: stopped /dev/md/openqa
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1521]: Personalities : [raid0]
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1521]: unused devices: <none>
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1486]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1532]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1532]:        level=raid0 devices=1 ctime=Wed Apr 13 17:00:01 2022
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1532]: mdadm: unexpected failure opening /dev/md127: No such device or address
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1486]: Unable to create RAID, mdadm returned with non-zero code
Apr 13 17:35:16 openqaworker14 systemd[1]: openqa_nvme_format.service: Main process exited, code=exited, status=1/FAILURE
Apr 13 17:35:16 openqaworker14 systemd[1]: openqa_nvme_format.service: Failed with result 'exit-code'.
Apr 13 17:35:16 openqaworker14 systemd[1]: Failed to start Setup NVMe before mounting it.

So it apparently failed to create the device before but the error was apparently ignored and thus when only get the error when trying to open the device.

Pressing Ctrl+D to leave the emergency shell helps in that situation. I'm not sure whether that was always the case, though. Assuming it is we could implement a retry for the systemd service. It also means that my custom build of mdadm is generally working (and not just misleading us by introducing yet another problem; on ow15 my custom build also just worked immediately after booting).

#32 Updated by mkittler about 1 month ago

I've added a retry in our script for which I'll create a SR request. I only had to do one restart to see it already making a difference:

Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /boot/grub2/i386-pc type btrfs (rw,relatime,ssd,space_cache,subvolid=265,subvol=/@/boot/grub2/i386-pc)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /.snapshots type btrfs (rw,relatime,ssd,space_cache,subvolid=266,subvol=/@/.snapshots)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /boot/grub2/x86_64-efi type btrfs (rw,relatime,ssd,space_cache,subvolid=264,subvol=/@/boot/grub2/x86_64-efi)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /home type btrfs (rw,relatime,ssd,space_cache,subvolid=263,subvol=/@/home)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /opt type btrfs (rw,relatime,ssd,space_cache,subvolid=262,subvol=/@/opt)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /root type btrfs (rw,relatime,ssd,space_cache,subvolid=261,subvol=/@/root)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /usr/local type btrfs (rw,relatime,ssd,space_cache,subvolid=258,subvol=/@/usr/local)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /srv type btrfs (rw,relatime,ssd,space_cache,subvolid=260,subvol=/@/srv)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /tmp type btrfs (rw,relatime,ssd,space_cache,subvolid=259,subvol=/@/tmp)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /var type btrfs (rw,relatime,ssd,space_cache,subvolid=257,subvol=/@/var)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1514]: Present block devices (printed for debugging purposes):
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1520]: NAME        MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1520]: nvme0n1     259:0    0  3.5T  0 disk
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1520]: ├─nvme0n1p1 259:1    0  512M  0 part
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1520]: ├─nvme0n1p2 259:2    0    1T  0 part  /
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1520]: └─nvme0n1p3 259:3    0  2.5T  0 part
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1520]:   └─md127     9:127  0  2.5T  0 raid0
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1514]: Stopping current RAID "/dev/md/openqa"
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1522]: mdadm: stopped /dev/md/openqa
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1529]: Personalities : [raid0]
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1529]: unused devices: <none>
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1514]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1537]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1537]:        level=raid0 devices=1 ctime=Wed Apr 13 17:35:48 2022
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1537]: mdadm: unexpected failure opening /dev/md127: No such device or address
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1514]: Unexpected error opening device, waiting 10 before trying again (as retrying usually helps).
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1514]: Trying RAID0 creation again after timeout (attempt 2 of 10)
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1632]: Personalities : [raid0]
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1632]: unused devices: <none>
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1514]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1639]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1639]:        level=raid0 devices=1 ctime=Wed Apr 13 17:35:48 2022
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1639]: mdadm: Defaulting to version 1.2 metadata
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1639]: mdadm: array /dev/md/openqa started.
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1514]: Status for RAID0 "/dev/md/openqa"
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1658]: md127 : active raid0 nvme0n1p3[0]
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1660]: ARRAY /dev/md/openqa metadata=1.2 name=openqaworker14:openqa UUID=0669a47c:5201d4f3:f7f5420a:c0f7f4a7
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1514]: Creating ext2 filesystem on RAID0 "/dev/md/openqa"
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1662]: mke2fs 1.43.8 (1-Jan-2018)
Apr 14 13:34:51 openqaworker14 openqa-establish-nvme-setup[1662]: [258B blob data]
Apr 14 13:34:51 openqaworker14 openqa-establish-nvme-setup[1662]: Creating filesystem with 669084672 4k blocks and 167272448 inodes
Apr 14 13:34:51 openqaworker14 openqa-establish-nvme-setup[1662]: Filesystem UUID: 8ae30c10-6d08-44a5-8913-85bf60940c17
Apr 14 13:34:51 openqaworker14 openqa-establish-nvme-setup[1662]: Superblock backups stored on blocks:
Apr 14 13:34:51 openqaworker14 openqa-establish-nvme-setup[1662]:         32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
Apr 14 13:34:51 openqaworker14 openqa-establish-nvme-setup[1662]:         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
Apr 14 13:34:51 openqaworker14 openqa-establish-nvme-setup[1662]:         102400000, 214990848, 512000000, 550731776, 644972544
Apr 14 13:34:51 openqaworker14 openqa-establish-nvme-setup[1662]: [73B blob data]
Apr 14 13:35:05 openqaworker14 openqa-establish-nvme-setup[1662]: [378B blob data]
Apr 14 13:35:17 openqaworker14 openqa-establish-nvme-setup[1662]: [107B blob data]
Apr 14 13:35:17 openqaworker14 systemd[1]: openqa_nvme_format.service: Succeeded.
Apr 14 13:35:17 openqaworker14 systemd[1]: Finished Setup NVMe before mounting it.

Whether it'll help in all cases I don't know but it is certainly an improvement. So I'll try to reboot a few more times.

#34 Updated by mkittler about 1 month ago

And again:

Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1564]: Present block devices (printed for debugging purposes):
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1570]: NAME        MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1570]: nvme0n1     259:0    0  3.5T  0 disk
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1570]: ├─nvme0n1p1 259:1    0  512M  0 part
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1570]: ├─nvme0n1p2 259:2    0    1T  0 part  /
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1570]: └─nvme0n1p3 259:3    0  2.5T  0 part
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1570]:   └─md127     9:127  0  2.5T  0 raid0
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1564]: Stopping current RAID "/dev/md/openqa"
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1571]: mdadm: stopped /dev/md/openqa
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1582]: Personalities : [raid0]
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1582]: unused devices: <none>
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1564]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1590]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1590]:        level=raid0 devices=1 ctime=Thu Apr 14 13:34:47 2022
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1590]: mdadm: unexpected failure opening /dev/md127: No such device or address
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1564]: Unexpected error opening device, waiting 10 before trying again (as retrying usually helps).
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1564]: Trying RAID0 creation again after timeout (attempt 2 of 10)
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1618]: Personalities : [raid0]
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1618]: unused devices: <none>
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1564]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1625]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1625]:        level=raid0 devices=1 ctime=Thu Apr 14 13:34:47 2022
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1625]: mdadm: Defaulting to version 1.2 metadata
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1625]: mdadm: array /dev/md/openqa started.
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1564]: Status for RAID0 "/dev/md/openqa"
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1644]: md127 : active raid0 nvme0n1p3[0]
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1646]: ARRAY /dev/md/openqa metadata=1.2 name=openqaworker14:openqa UUID=1dc43f3d:fbecb084:33ff1543:61f84fb8
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1564]: Creating ext2 filesystem on RAID0 "/dev/md/openqa"
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1648]: mke2fs 1.43.8 (1-Jan-2018)
Apr 14 13:46:58 openqaworker14 openqa-establish-nvme-setup[1648]: [220B blob data]
Apr 14 13:46:58 openqaworker14 openqa-establish-nvme-setup[1648]: Creating filesystem with 669084672 4k blocks and 167272448 inodes
Apr 14 13:46:58 openqaworker14 openqa-establish-nvme-setup[1648]: Filesystem UUID: b7ec8671-c15f-4887-9b78-157c1aa21715
Apr 14 13:46:58 openqaworker14 openqa-establish-nvme-setup[1648]: Superblock backups stored on blocks:
Apr 14 13:46:58 openqaworker14 openqa-establish-nvme-setup[1648]:         32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
Apr 14 13:46:58 openqaworker14 openqa-establish-nvme-setup[1648]:         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
Apr 14 13:46:58 openqaworker14 openqa-establish-nvme-setup[1648]:         102400000, 214990848, 512000000, 550731776, 644972544
Apr 14 13:46:58 openqaworker14 openqa-establish-nvme-setup[1648]: [73B blob data]
Apr 14 13:47:13 openqaworker14 openqa-establish-nvme-setup[1648]: [400B blob data]
Apr 14 13:47:25 openqaworker14 openqa-establish-nvme-setup[1648]: [107B blob data]
Apr 14 13:47:25 openqaworker14 systemd[1]: openqa_nvme_format.service: Succeeded.
Apr 14 13:47:25 openqaworker14 systemd[1]: Finished Setup NVMe before mounting it.

And again:

Apr 14 13:48:02 openqaworker15 openqa-establish-nvme-setup[1505]: Present block devices (printed for debugging purposes):
Apr 14 13:48:02 openqaworker15 openqa-establish-nvme-setup[1508]: NAME        MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
Apr 14 13:48:02 openqaworker15 openqa-establish-nvme-setup[1508]: nvme0n1     259:0    0  3.5T  0 disk
Apr 14 13:48:02 openqaworker15 openqa-establish-nvme-setup[1508]: ├─nvme0n1p1 259:1    0  512M  0 part
Apr 14 13:48:02 openqaworker15 openqa-establish-nvme-setup[1508]: ├─nvme0n1p2 259:2    0    1T  0 part  /
Apr 14 13:48:02 openqaworker15 openqa-establish-nvme-setup[1508]: └─nvme0n1p3 259:3    0  2.5T  0 part
Apr 14 13:48:02 openqaworker15 openqa-establish-nvme-setup[1508]:   └─md127     9:127  0  2.5T  0 raid0
Apr 14 13:48:02 openqaworker15 openqa-establish-nvme-setup[1505]: Stopping current RAID "/dev/md/openqa"
Apr 14 13:48:03 openqaworker15 openqa-establish-nvme-setup[1509]: mdadm: stopped /dev/md/openqa
Apr 14 13:48:03 openqaworker15 openqa-establish-nvme-setup[1505]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 14 13:48:03 openqaworker15 openqa-establish-nvme-setup[1562]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 14 13:48:03 openqaworker15 openqa-establish-nvme-setup[1562]:        level=raid0 devices=1 ctime=Wed Apr 13 17:53:38 2022
Apr 14 13:48:03 openqaworker15 openqa-establish-nvme-setup[1562]: mdadm: unexpected failure opening /dev/md127: No such device or address
Apr 14 13:48:03 openqaworker15 openqa-establish-nvme-setup[1505]: Unexpected error opening device, waiting 10 before trying again (as retrying usually helps).
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1505]: Trying RAID0 creation again after timeout (attempt 2 of 10)
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1505]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1654]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1654]:        level=raid0 devices=1 ctime=Wed Apr 13 17:53:38 2022
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1654]: mdadm: Defaulting to version 1.2 metadata
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1654]: mdadm: array /dev/md/openqa started.
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1505]: Status for RAID0 "/dev/md/openqa"
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1673]: md127 : active raid0 nvme0n1p3[0]
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1675]: ARRAY /dev/md/openqa metadata=1.2 name=openqaworker15:openqa UUID=e2f5086b:edbf62a8:14bb3a8a:b575f4d3
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1505]: Creating ext2 filesystem on RAID0 "/dev/md/openqa"
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1676]: mke2fs 1.43.8 (1-Jan-2018)
Apr 14 13:48:16 openqaworker15 openqa-establish-nvme-setup[1676]: [220B blob data]
Apr 14 13:48:16 openqaworker15 openqa-establish-nvme-setup[1676]: Creating filesystem with 669084672 4k blocks and 167272448 inodes
Apr 14 13:48:16 openqaworker15 openqa-establish-nvme-setup[1676]: Filesystem UUID: d1e04ad2-cbc3-4356-bd70-36d16a9eef7e
Apr 14 13:48:16 openqaworker15 openqa-establish-nvme-setup[1676]: Superblock backups stored on blocks:
Apr 14 13:48:16 openqaworker15 openqa-establish-nvme-setup[1676]:         32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
Apr 14 13:48:16 openqaworker15 openqa-establish-nvme-setup[1676]:         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
Apr 14 13:48:16 openqaworker15 openqa-establish-nvme-setup[1676]:         102400000, 214990848, 512000000, 550731776, 644972544
Apr 14 13:48:16 openqaworker15 openqa-establish-nvme-setup[1676]: [73B blob data]
Apr 14 13:48:31 openqaworker15 openqa-establish-nvme-setup[1676]: [400B blob data]
Apr 14 13:48:43 openqaworker15 openqa-establish-nvme-setup[1676]: [107B blob data]
Apr 14 13:48:43 openqaworker15 systemd[1]: openqa_nvme_format.service: Succeeded.
Apr 14 13:48:43 openqaworker15 systemd[1]: Finished Setup NVMe before mounting it.

So I'm quite sure that the retry helps and likely that's also all we need.

#35 Updated by mkittler about 1 month ago

  • Status changed from In Progress to Feedback

I did a 8 more reboots (4 on each machine). I encountered 5 more good cases (no retry required) and 3 more handled bad cases (retry required but it helped) but no more unhandled bad cases. Interestingly, all remaining bad cases happened on ow15. So it seems that this host is more affected but it could also be just a coincidence.

When https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/671 has been merged I can bring the workers back into salt (AC2) and remove possible workarounds:

Not sure whether it makes sense to rush this today because I'll only be back on Wednesday and some closer monitoring would be good (AC3).

#36 Updated by okurz 29 days ago

mkittler wrote:

I did a 8 more reboots (4 on each machine). I encountered 5 more good cases (no retry required) and 3 more handled bad cases (retry required but it helped) but no more unhandled bad cases. Interestingly, all remaining bad cases happened on ow15. So it seems that this host is more affected but it could also be just a coincidence.

When https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/671 has been merged …

Done.

I guess you mean https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/405
Before we enable this we should have sufficient tests showing that the workers are stable. Please trigger openQA tests using the existing worker class settings, that's why we have them :)

Not sure whether it makes sense to rush this today because I'll only be back on Wednesday and some closer monitoring would be good (AC3).

yes, no need to rush.

#37 Updated by mkittler 26 days ago

I've been re-triggering a variety¹ of jobs: https://openqa.suse.de/tests/overview?build=test-ow14 https://openqa.suse.de/tests/overview?build=test-ow15

¹ They're simply the 100 most recently passed qemu x86_64 jobs on OSD without parallel dependencies.

Looks like I've messed up the worker class. So jobs on both overviews might be started on any of the workers.

#38 Updated by mkittler 26 days ago

All jobs incompleted early, likely due to a too old os-autoinst version. Considering the workers are not in salt that's expected so I'll update them manually for now and will re-trigger the jobs. (This time with the correct worker classes.)

#39 Updated by mkittler 26 days ago

The tap setup isn't working yet but it generally doesn't look too bad:

openqa=> select jobs.id, job_settings.value as worker, jobs.result, jobs.reason from jobs join job_settings on jobs.id = job_settings.job_id where state = 'done' and not (result = 'passed' or result = 'softfailed') and job_settings.key = 'WORKER_CLASS' and (job_settings.value = 'openqaworker14' or job_settings.value = 'openqaworker15') and t_started >= '2022-04-20' order by id desc;
   id    |     worker     |   result   |                                                                                   reason                                                                                   
---------+----------------+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 8582778 | openqaworker15 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap15 25' failed: 'tap15' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
 8582777 | openqaworker15 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap11 18' failed: 'tap11' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
 8582776 | openqaworker15 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap4 47' failed: 'tap4' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
 8582774 | openqaworker15 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap15 25' failed: 'tap15' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
 8582772 | openqaworker15 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap11 18' failed: 'tap11' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
 8582770 | openqaworker15 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap15 18' failed: 'tap15' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
 8582768 | openqaworker15 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap15 18' failed: 'tap15' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
 8582761 | openqaworker14 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap8 58' failed: 'tap8' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
 8582760 | openqaworker15 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap15 18' failed: 'tap15' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
 8582759 | openqaworker14 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap14 18' failed: 'tap14' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
 8582757 | openqaworker14 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap9 18' failed: 'tap9' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
 8582755 | openqaworker14 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap8 58' failed: 'tap8' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
 8582753 | openqaworker14 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap14 25' failed: 'tap14' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
 8582751 | openqaworker14 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap9 18' failed: 'tap9' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
 8582749 | openqaworker14 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap8 58' failed: 'tap8' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
 8582747 | openqaworker14 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap14 60' failed: 'tap14' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
 8582739 | openqaworker14 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap8 18' failed: 'tap8' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
 8582613 | openqaworker15 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap15 60' failed: 'tap15' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
(18 Zeilen)

I also found some older jobs failing with timeout_exceeded (reason: timeout: test execution exceeded MAX_JOB_TIME) when doing some other SQL queries. Note that these jobs simply timed out because they were waiting for a parallel job so they are likely also just a symptom of the broken tap setup (and not a symptom of slow downloads).

#40 Updated by mkittler 25 days ago

The tap setup wasn't really broken. It was just that I manually increased the number of worker slots (to 16) so the 100 jobs would be processed quicker and more like we'd eventually run them in production. When reducing the number of worker slots back to what's configured in salt the tap jobs work (see https://openqa.suse.de/tests/8587752#dependencies and https://openqa.suse.de/tests/8587756#dependencies).

So I suppose https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/405 can be merged. (It'll also increase the number of worker slots but if it is done by salt it'll take care to extend the tap config accordingly.)

#41 Updated by okurz 25 days ago

#42 Updated by okurz 25 days ago

I have unpaused the "host up" alert for both machines as it looks green right now.

#43 Updated by mkittler 25 days ago

I added the machines back to salt.

#44 Updated by mkittler 24 days ago

Salt is unfortunately broken on ow14 and 15. It always tries to connect to 127.0.0.1 instead of OSD. Setting master in /etc/salt/minion to OSD's IP doesn't help. Hacking around the issue via martchus@openqaworker14:~> ssh -L "4506:localhost:4506" -N -f openqa.suse.de let's it at least authenticate but the Minion still appears unresponsive from the master's side.

Note that on a non-broken worker, e.g. openqaworker5 the minion log looks like this:

[DEBUG   ] Connecting to master. Attempt 1 of 1
[DEBUG   ] Error while parsing IPv4 address: salt
[DEBUG   ] Expected 4 octets in 'salt'
[DEBUG   ] Error while parsing IPv6 address: salt
[DEBUG   ] At least 3 parts expected in 'salt'
[DEBUG   ] "salt" Not an IP address? Assuming it is a hostname.
[DEBUG   ] Master URI: tcp://10.160.0.207:4506
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'openqaworker5.suse.de', 'tcp://10.160.0.207:4506')
[DEBUG   ] Generated random reconnect delay between '1000ms' and '11000ms' (2303)
[DEBUG   ] Setting zmq_reconnect_ivl to '2303ms'
[DEBUG   ] Setting zmq_reconnect_ivl_max to '11000ms'
[TRACE   ] IPCServer: Handling connection to address: 
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'openqaworker5.suse.de', 'tcp://10.160.0.207:4506', 'clear')
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://10.160.0.207:4506
[DEBUG   ] Trying to connect to: tcp://10.160.0.207:4506
[TRACE   ] Inserted key into loop_instance_map id 140315043651088 for key ('/etc/salt/pki/minion', 'openqaworker5.suse.de', 'tcp://10.160.0.207:4506', 'clear') and process 15063
[DEBUG   ] salt.crypt.get_rsa_pub_key: Loading public key
[DEBUG   ] Decrypting the current master AES key
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key
[DEBUG   ] salt.crypt._get_key_with_evict: Loading private key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] salt.crypt.get_rsa_pub_key: Loading public key
[DEBUG   ] salt.crypt.verify_signature: Loading public key
[DEBUG   ] salt.crypt.get_rsa_pub_key: Loading public key
[DEBUG   ] salt.crypt.verify_signature: Verifying signature
[DEBUG   ] <salt.crypt.AsyncAuth object at 0x7f9da5715d68> Got new master aes key.
[DEBUG   ] Closing AsyncZeroMQReqChannel instance
[DEBUG   ] Connecting the Minion to the Master publish port, using the URI: tcp://10.160.0.207:4505
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'openqaworker5.suse.de', 'tcp://10.160.0.207:4506', 'aes')
[DEBUG   ] Re-using AsyncAuth for ('/etc/salt/pki/minion', 'openqaworker5.suse.de', 'tcp://10.160.0.207:4506')
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://10.160.0.207:4506

On ow14 and 15 it looks very similar but OSD's IP is swapped out by 127.0.0.1:

[DEBUG   ] "salt" Not an IP address? Assuming it is a hostname.
…
…
[DEBUG   ] Trying to connect to: tcp://127.0.0.1:4506
[DEBUG   ] Closing AsyncZeroMQReqChannel instance
[ERROR   ] Error while bringing up minion for multi-master. Is master at 10.160.0.207 responding?
[DEBUG   ] Connecting to master. Attempt 1 of 1
[DEBUG   ] Master URI: tcp://127.0.0.1:4506
…
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://127.0.0.1:4506
[DEBUG   ] SaltReqTimeoutError, retrying. (1/7)

Curiously it still says "Is master at 10.160.0.207 responding?" as if the correct host is generally configured but then somehow not passed down to some underlying function.

The salt-minion version is the same as on other hosts.

#45 Updated by mkittler 24 days ago

If I also do martchus@openqaworker14:~> ssh -L "4505:localhost:4505" -N -f -l martchus openqa.suse.de (and not just for port 4506) I get a little bit further:

martchus@openqa:~> sudo salt openqaworker14.qa.suse.cz state.apply
openqaworker14.qa.suse.cz:
----------
          ID: states
    Function: no.None
      Result: False
     Comment: No Top file or master_tops data matches found. Please see master log for details.
     Changes:   

Summary for openqaworker14.qa.suse.cz
------------
Succeeded: 0
Failed:    1
------------
Total states run:     1
Total run time:   0.000 ms
ERROR: Minions returned with non-zero exit code

The Minion is then responding but shows this error message. Something must be wrongly configured on ow14 and 15 but all the config looks like on ow5 which has no issues. So I'm out of ideas.

#46 Updated by mkittler 21 days ago

Looks like somebody put file_client: local in the config which I've missed on Friday. It works after removing that. Now the worker class change to run normal production jobs is effective. Salt also extended the tap setup as expected:

----------
          ID: /etc/sysconfig/network/ifcfg-br1
    Function: file.managed
      Result: True
     Comment: File /etc/sysconfig/network/ifcfg-br1 updated
     Started: 10:57:33.346842
    Duration: 13.948 ms
     Changes:   
              ----------
              diff:
                  --- 
                  +++ 
                  @@ -14,5 +14,40 @@
                   OVS_BRIDGE_PORT_DEVICE_3='tap3'
                   OVS_BRIDGE_PORT_DEVICE_67='tap67'
                   OVS_BRIDGE_PORT_DEVICE_131='tap131'
                  +OVS_BRIDGE_PORT_DEVICE_4='tap4'
                  +OVS_BRIDGE_PORT_DEVICE_68='tap68'
                  +OVS_BRIDGE_PORT_DEVICE_132='tap132'
                  +OVS_BRIDGE_PORT_DEVICE_5='tap5'
                  +OVS_BRIDGE_PORT_DEVICE_69='tap69'
                  +OVS_BRIDGE_PORT_DEVICE_133='tap133'
                  +OVS_BRIDGE_PORT_DEVICE_6='tap6'
                  +OVS_BRIDGE_PORT_DEVICE_70='tap70'
                  +OVS_BRIDGE_PORT_DEVICE_134='tap134'
                  +OVS_BRIDGE_PORT_DEVICE_7='tap7'
                  +OVS_BRIDGE_PORT_DEVICE_71='tap71'
                  +OVS_BRIDGE_PORT_DEVICE_135='tap135'
                  +OVS_BRIDGE_PORT_DEVICE_8='tap8'
                  +OVS_BRIDGE_PORT_DEVICE_72='tap72'
                  +OVS_BRIDGE_PORT_DEVICE_136='tap136'
                  +OVS_BRIDGE_PORT_DEVICE_9='tap9'
                  +OVS_BRIDGE_PORT_DEVICE_73='tap73'
                  +OVS_BRIDGE_PORT_DEVICE_137='tap137'
                  +OVS_BRIDGE_PORT_DEVICE_10='tap10'
                  +OVS_BRIDGE_PORT_DEVICE_74='tap74'
                  +OVS_BRIDGE_PORT_DEVICE_138='tap138'
                  +OVS_BRIDGE_PORT_DEVICE_11='tap11'
                  +OVS_BRIDGE_PORT_DEVICE_75='tap75'
                  +OVS_BRIDGE_PORT_DEVICE_139='tap139'
                  +OVS_BRIDGE_PORT_DEVICE_12='tap12'
                  +OVS_BRIDGE_PORT_DEVICE_76='tap76'
                  +OVS_BRIDGE_PORT_DEVICE_140='tap140'
                  +OVS_BRIDGE_PORT_DEVICE_13='tap13'
                  +OVS_BRIDGE_PORT_DEVICE_77='tap77'
                  +OVS_BRIDGE_PORT_DEVICE_141='tap141'
                  +OVS_BRIDGE_PORT_DEVICE_14='tap14'
                  +OVS_BRIDGE_PORT_DEVICE_78='tap78'
                  +OVS_BRIDGE_PORT_DEVICE_142='tap142'
                  +OVS_BRIDGE_PORT_DEVICE_15='tap15'
                  +OVS_BRIDGE_PORT_DEVICE_79='tap79'
                  +OVS_BRIDGE_PORT_DEVICE_143='tap143'
                   PRE_UP_SCRIPT="wicked:gre_tunnel_preup.sh"
                  -ZONE=trusted
              mode:
                  0644

I'll check the incomplete/failure rate of the workers in a few days. If everything looks good the ticket can then be resolved (as all ACs are then covered).

#47 Updated by mkittler 20 days ago

So far, the fail/incomplete ratio looks ok to me compared to other worker hosts:

openqa=> with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed' or result='incomplete') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where t_finished >= '2022-04-25 11:00' group by host order by ratio_failed_by_host desc;
        host         | ratio_failed_by_host | total 
---------------------+----------------------+-------
 openqaworker-arm-4  |                 88.1 |   168
 openqaworker2       |                64.09 |   362
 openqaworker-arm-3  |                33.94 |   327
 openqaworker14      |                30.92 |   414
 malbec              |                 26.6 |    94
 powerqaworker-qam-1 |                25.93 |   135
 openqaworker-arm-2  |                 25.6 |   336
 QA-Power8-5-kvm     |                24.36 |   156
 openqaworker-arm-1  |                24.19 |   186
 grenache-1          |                23.63 |   787
 openqaworker15      |                   20 |   405
 QA-Power8-4-kvm     |                 18.9 |   164
 openqaworker5       |                 16.8 |   899
 openqaworker6       |                16.22 |   709
 openqaworker3       |                11.98 |   459
 openqaworker9       |                11.24 |   347
 openqaworker10      |                 9.09 |   209
 openqaworker13      |                  8.5 |   341
 openqaworker8       |                 6.67 |   330
 openqa-piworker     |                    0 |     7
                     |                    0 |   325
(21 Zeilen)

Note that 15 is 10 percent points better although it is the very same hardware using the very same connection. So I suppose that's the margin we have to consider anyways.

I also haven't seen any jobs running into timeouts (the setup timeout was a concern due to the worker's connection from Prague):

openqa=> select id, t_finished, result, reason from jobs where (select host from workers where id = assigned_worker_id) similar to 'openqaworker(14|15)' and (result = 'incomplete' or result = 'failed') and reason like '%timeout%' and t_finished >= '2022-04-25 11:00' order by t_finished;
 id | t_finished | result | reason 
----+------------+--------+--------
(0 Zeilen)

#48 Updated by mkittler 20 days ago

I've noticed that there are lots of jobs that just failed due to syntax errors. When ignoring those failures for the sake this investigation, the figures are quite changed:

openqa=> with finished as (select result, reason, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where (result='failed' or result='incomplete') and reason not like '%main%syntax error%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where t_finished >= '2022-04-25 11:00' group by host order by ratio_failed_by_host desc;
        host         | ratio_failed_by_host | total 
---------------------+----------------------+-------
 openqaworker2       |                35.89 |   404
 openqaworker-arm-3  |                16.72 |   329
 openqaworker-arm-2  |                12.02 |   341
 openqaworker-arm-1  |                 9.47 |   190
 grenache-1          |                  6.1 |   787
 openqaworker5       |                 3.74 |   908
 powerqaworker-qam-1 |                 2.96 |   135
 QA-Power8-5-kvm     |                 2.56 |   156
 openqaworker13      |                 2.33 |   343
 QA-Power8-4-kvm     |                 1.83 |   164
 openqaworker-arm-4  |                 1.19 |   168
 openqaworker10      |                 0.94 |   212
 openqaworker6       |                  0.7 |   715
 openqaworker9       |                 0.57 |   351
 openqaworker14      |                 0.48 |   421
 openqaworker8       |                  0.3 |   332
 openqaworker15      |                 0.24 |   412
 openqaworker3       |                 0.21 |   466
                     |                    0 |   325
 openqa-piworker     |                    0 |     7
 malbec              |                    0 |    95
(21 Zeilen)

#49 Updated by mkittler 19 days ago

The fail/incomplete ratio on 14/15 still looks ok:

openqa=> with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed' or result='incomplete') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where t_finished >= '2022-04-25 11:00' group by host order by ratio_failed_by_host desc;
        host         | ratio_failed_by_host | total 
---------------------+----------------------+-------
 openqaworker2       |                78.41 |  1371
 openqaworker-arm-4  |                61.29 |   403
 openqaworker14      |                32.14 |   837
 grenache-1          |                31.72 |  1854
 openqaworker-arm-3  |                26.83 |   779
 openqaworker-arm-1  |                20.71 |   449
 openqaworker15      |                20.27 |   824
 openqaworker-arm-2  |                18.63 |   714
 QA-Power8-5-kvm     |                18.53 |   421
 powerqaworker-qam-1 |                16.04 |   424
 malbec              |                15.19 |   237
 openqaworker5       |                14.97 |  1637
 openqaworker13      |                14.11 |   659
 openqaworker6       |                13.92 |  1329
 QA-Power8-4-kvm     |                 13.4 |   530
 openqaworker9       |                12.41 |   709
 openqaworker3       |                10.78 |   872
 openqaworker10      |                10.55 |   455
 openqaworker8       |                 9.03 |   709
 openqa-piworker     |                    0 |     7
                     |                    0 |   621
(21 Zeilen)

Excluding syntax errors:

openqa=> with finished as (select result, reason, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where (result='failed' or result='incomplete') and reason not like '%main%syntax error%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where t_finished >= '2022-04-25 11:00' group by host order by ratio_failed_by_host desc;
        host         | ratio_failed_by_host | total 
---------------------+----------------------+-------
 openqaworker2       |                50.11 |  1371
 openqaworker-arm-3  |                13.61 |   779
 openqaworker-arm-1  |                 8.91 |   449
 openqaworker-arm-2  |                 5.74 |   714
 openqaworker-arm-4  |                 5.46 |   403
 powerqaworker-qam-1 |                 4.25 |   424
 QA-Power8-5-kvm     |                 4.04 |   421
 grenache-1          |                 3.83 |  1853
 openqaworker13      |                 3.34 |   659
 QA-Power8-4-kvm     |                 2.65 |   529
 openqaworker5       |                 2.63 |  1636
 openqaworker10      |                  2.2 |   455
 openqaworker9       |                 1.41 |   709
 malbec              |                 1.27 |   237
 openqaworker14      |                 1.08 |   837
 openqaworker8       |                 0.85 |   709
 openqaworker6       |                  0.6 |  1329
 openqaworker15      |                 0.36 |   824
 openqaworker3       |                 0.34 |   872
 openqa-piworker     |                    0 |     7
                     |                    0 |   621
(21 Zeilen)

#50 Updated by mkittler 19 days ago

  1. Maybe we need to remove the tap worker class: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/407 - see https://suse.slack.com/archives/C02CANHLANP/p1651052375642909
  2. Maybe we can add the qemu_x86_64-large-mem worker class because those workers have 1 TiB RAM: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/408

#51 Updated by mkittler 18 days ago

Looks like https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/407 couldn't be cleanly applied. However, the worker class "tap" is gone and that's the main goal.

However, maybe this gives also a hint why the setup is broken:

          ID: wicked ifup br1
    Function: cmd.wait
      Result: False
     Comment: Command "wicked ifup br1" run
     Started: 12:32:03.474942
    Duration: 30412.6 ms
     Changes:   
              ----------
              pid:
                  42945
              retcode:
                  162
              stderr:
              stdout:
                  br1             setup-in-progress

At least for openqaworker14 because on 15 it looks good:

          ID: wicked ifup br1
    Function: cmd.wait
      Result: True
     Comment: Command "wicked ifup br1" run
     Started: 12:32:06.437255
    Duration: 31487.345 ms
     Changes:   
              ----------
              pid:
                  65768
              retcode:
                  0
              stderr:
              stdout:
                  br1             up

The bride is up now on ow14 as well (without me doing anything):

martchus@openqaworker14:~> sudo wicked ifup br1
br1             up

#52 Updated by mkittler 17 days ago

After reverting and applying the tap setup again via salt it works again on ow15. So I suppose the tap setup generally works. I also tested it with a parallel cluster where some jobs are running on Nürnberg workers and some on Prague workers (see https://openqa.suse.de/tests/8656821). It also continued to work after rebooting the worker. Maybe the gre tunnel isn't very good at coping with temporary connection problems and can be stuck in a disconnected state. We decided to create another ticket for that and leave the tap worker class out for now. Considering that curl -4 https://openqa.suse.de works and curl -6 https://openqa.suse.de doesn't (on both, ow14 and ow15) it could be an IPv4 vs. IPv6 problem.

I also rebased https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/408 which we might want to merge as well. (I'll merge it after getting at least one approval.)

Now both workers are again back in salt without tap worker class. See suggestions in #110467 for how to continue with the tap setup.

Since the failure rate looks still generally good I'm resolving the ticket keeping both workers running production jobs (just not tap ones).

#53 Updated by mkittler 17 days ago

  • Related to action #110467: Establish reliable tap setup on ow14 added

#54 Updated by mkittler 17 days ago

  • Status changed from Feedback to Resolved

#55 Updated by okurz 14 days ago

  • Due date deleted (2022-04-29)

Also available in: Atom PDF