action #104970
closedopenQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
openQA Project - coordination #109659: [epic] More remote workers
Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:M
0%
Description
Motivation¶
Related to #97862. Two machines have been ordered and are prepared in the SUSE CZ office location server room to be used within the OSD infrastructure. okurz suggested to use these two machines as "generic" workers along with their specialization within the OSD infrastructure
Acceptance criteria¶
- AC1: openqaworker14+openqaworker15 are used by OSD production jobs
- AC2: Both machines are maintained same as we do for other x86_64 OSD workers
- AC3: No significant impact due to the remote connection between NBG and PRG server room locations
- AC4: Alerts for those workers have been resumed and are not firing anymore
Suggestions¶
- Follow https://sd.suse.com/servicedesk/customer/portal/1/SD-70400
- Ensure that a valid OS is installed
- Ensure that machines can be remote controlled
- Deploy machines as part of generic salt-controlled OSD infrastructure
- Configure openQA worker instances on these machines for testing purposes
- Test out their operation and take special care due to the remoteness between NBG (osd webUI) and PRG (location of these two machines)
- If all good include as generic workers as well as with special sap worker class settings
Updated by okurz almost 3 years ago
- Copied from action #97862: More openQA worker hardware for OSD size:M added
Updated by okurz almost 3 years ago
- Status changed from New to Feedback
- Assignee set to okurz
I provided my GPG key to mcaj for the credentials. Let's await that first
Updated by okurz almost 3 years ago
- Status changed from Feedback to New
- Assignee deleted (
okurz)
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/381 including IPMI credentials which can be a start for others to continue.
Updated by okurz almost 3 years ago
Both machines have an openSUSE Leap 15.3 preinstalled with our common, old, known root password for the time being.
Updated by okurz almost 3 years ago
- Subject changed from Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing to Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:M
- Status changed from New to Workable
Updated by livdywan almost 3 years ago
- Assignee set to osukup
@osukup is looking forward to tkaing it :-D
Updated by osukup almost 3 years ago
so ... problem starts.
1) salt-minion -> with default /etc/salt/minion refuses connect to master ( always tries localhost )
- workaround: remove minion config
2) it has only one disk ( 3.5TB nvme, formated to single btrfs space) so
openqa-format_nvme.service
cant realy work - workaround: systemctl edit .... 3) its connected by 1Gbit interface ( server itself has 10G, not connected anywhere) and connection to NUE is slooow for any openqa job:(
--> please move this two workers to NUE
Updated by osukup almost 3 years ago
openqaworker14 didn't survive reboot - reinstall :( - maybe change better layout for disk?
Updated by osukup almost 3 years ago
so, after "succesful' state.apply
... system ater reboot goes to maintenace mode :(
culprit -> prepare nvme openqa service ..
Updated by osukup almost 3 years ago
- Status changed from Workable to In Progress
Updated by openqa_review almost 3 years ago
- Due date set to 2022-03-04
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz almost 3 years ago
as discussed please put in MAX_SETUP_TIME with something like 10h for the slow connection, change worker class temporarily to not break production tests, restart tests that incompleted on this worker so that they end up on other machines
Updated by okurz almost 3 years ago
The later suggestion was done with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/388 . After merge I called
for i in openqaworker14 openqaworker15; do failed_since=2022-02-17 result="result='incomplete'" host=openqa.suse.de WORKER=$i bash -ex openqa-advanced-retrigger-jobs; done
I propose a higher cache limit as discussed also in one of our meetings last week: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/389
Updated by okurz almost 3 years ago
Currently OSD deployment is failing on w14+w15 not being fully ready yet for deployment so I temporarily removed them again from deployment. @osukup you can add them back when you continue here
Updated by mkittler over 2 years ago
- Description updated (diff)
I paused the corresponding host-up alerts.
Updated by osukup over 2 years ago
bot host deployed with salt, added to salt and osd-deployment works.
due to difference between standard openqaworker and this machines (only one nvme disk):
both were reinstaled with this disk layout:
nvme0n1 259:0 0 3,5T 0 disk
├─nvme0n1p1 259:1 0 512M 0 part /boot/efi
├─nvme0n1p2 259:2 0 1T 0 part /var
└─nvme0n1p3 259:3 0 2,5T 0 part
nvme0n1p3 set during installation as raw volume with linux raid partition type
after installation initialized with:
mdadm --create /dev/md/openqa --level=0 --raid-devices=1 --force --run /dev/nvme0n1p3
mkfs.ext2 /dev/md/openqa
added line to fstab
/dev/md/openqa /var/lib/openqa/ ext2 defaults 0 0
created /etc/mdadm.conf
file with:
DEVICE containers partitions
ARRAY /dev/md/openqa name=openqaworker14:openqa
of course on openqaworker15 name is openqaworker15:openqa
then run dracut --force
with these steps host can be rebooted without problems ( without mdadm.conf reboot ends in maintenance mode because mdadm didn't create link /dev/md/openqa
)
rest was accordind gitlab/openqa/openqa-salt-pillars with these changes:
- remove /etc/salt/minion (with config file salt-minion rejected to connect openqa.suse.de)
- create
/etc/salt/grains
withroles: worker
test run on openqaworker{14,15} - run 100 jobs on both machines, the first job usually take pretty long time to setup cache ( average speed of download is bellow 5MB/sec) but results are stable and without incomplete jobs
during testing found problem with websockets service : https://progress.opensuse.org/issues/107497
NEXT -> set tap
class to workers and ask qa-sap for multimachine test jobs on workers
Updated by okurz over 2 years ago
- Related to action #107497: [qe-tools] openqaworker14 (and openqa15) - developer mode fails size:M added
Updated by okurz over 2 years ago
Sounds good. It shouldn't be necessary to conduct these many manual steps. We already have machines that only have a single NVMe and no other storage device, e.g. openqaworker8. There we use the same salt states with the same NVMe setup scripts. As part of the salt states we also enable automatic updating with automatic reboots so better check that reboots work consistently.
Regarding the bandwidth. Running on OSD while true; do dd bs=100M count=20 if=/dev/zero | nc -l 42420; done
and doing timeout 20 nc openqa.suse.de 42420 | dd of=/dev/null status=progress
locally on my DSL connected work notebook over VPN I get 5.0 MB/s. On openqaworker14 I get 2.5 MB/s. So the SUSE inter-office connection is 2x slower than my home connection. That does not sound right. I suggest we crosscheck what we can expect with SUSE-IT EngInfra. You can do that as part of this ticket or create a dedicated ticket for that because this would be tuning the performance, not the general possibility to use these machines.
Please link progress tickets in format #<id>
so that a direct link with preview appears. Also at best link such tickets directly as "related" tickets.
Updated by okurz over 2 years ago
- Due date changed from 2022-03-04 to 2022-03-11
As discussed in the weekly osukup will contact EngInfra regarding network speed and continue with the salt states merge request regarding tty config.
Updated by okurz over 2 years ago
- Due date deleted (
2022-03-11) - Status changed from In Progress to Blocked
Setting to blocked on the ticket you created: https://sd.suse.com/servicedesk/customer/portal/1/SD-79330
Updated by osukup over 2 years ago
bandwidth nue-prg looks better now , instead 1-2MB/s we now see speeds about 20MB/s
Updated by osukup over 2 years ago
- Related to action #109301: openqaworker14 + openqaworker15 sporadically get stuck on boot added
Updated by osukup over 2 years ago
as mentioned in https://progress.opensuse.org/issues/109301 #109301 there are boot problems for this two workers.
-> setup of /dev/md/openqa
time to time fails during boot ..
Updated by okurz over 2 years ago
- Status changed from Blocked to Workable
As commented in #109301 this needs to be solved as part of this ticket as the two machines are not usable without. Please solve here.
Updated by livdywan over 2 years ago
- Due date set to 2022-04-15
I suggest we discuss this in the unblock since Ondřej's still fighting with some disk setup issues and I didn't see much conversation and there's no update here.
Updated by okurz over 2 years ago
- Due date changed from 2022-04-15 to 2022-04-29
- Assignee changed from osukup to mkittler
- The workers get stuck formatting the nvme
- Error stopping the systemd service
- osukup tried to reproduce problems without rebooting and everything works when starting the openqa nvme format+preparation services without rebooting. So issues should be reproduced
- We have what should be the same setup in openqaworker8 (and 9) with just a single NVMe, no other devices. We have the services openqa_nvme_format+openqa_nvme_prepare on openqaworker8 which works fine over reboot
Suggestion by okurz:
- mkittler tries on openqaworker14+15 with reboot tests where the systems gets stuck to fix the race condition that osukup identified
Updated by mkittler over 2 years ago
I changed the kernel boot parameters so we'll get a tty/emergency shell and also some output via ipmi sol.
That's how it looks like in the error case:
Apr 13 12:39:50 openqaworker14 systemd[1]: Starting Setup NVMe before mounting it...
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1537]: Current mount points (printed for debugging purposes):
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: devtmpfs on /dev type devtmpfs (rw,nosuid,noexec,size=4096k,nr_inodes=1048576,mode=755)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: tmpfs on /run type tmpfs (rw,nosuid,nodev,size=211033536k,nr_inodes=819200,mode=755)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,size=4096k,nr_inodes=1024,mode=755)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: none on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on / type btrfs (rw,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=29,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=34905)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /.snapshots type btrfs (rw,relatime,ssd,space_cache,subvolid=266,subvol=/@/.snapshots)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p1 on /boot/efi type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,utf8,errors=remount-ro)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /home type btrfs (rw,relatime,ssd,space_cache,subvolid=263,subvol=/@/home)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /opt type btrfs (rw,relatime,ssd,space_cache,subvolid=262,subvol=/@/opt)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /tmp type btrfs (rw,relatime,ssd,space_cache,subvolid=259,subvol=/@/tmp)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /boot/grub2/i386-pc type btrfs (rw,relatime,ssd,space_cache,subvolid=265,subvol=/@/boot/grub2/i386-pc)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /usr/local type btrfs (rw,relatime,ssd,space_cache,subvolid=258,subvol=/@/usr/local)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /root type btrfs (rw,relatime,ssd,space_cache,subvolid=261,subvol=/@/root)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /var type btrfs (rw,relatime,ssd,space_cache,subvolid=257,subvol=/@/var)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /boot/grub2/x86_64-efi type btrfs (rw,relatime,ssd,space_cache,subvolid=264,subvol=/@/boot/grub2/x86_64-efi)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /srv type btrfs (rw,relatime,ssd,space_cache,subvolid=260,subvol=/@/srv)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1537]: Present block devices (printed for debugging purposes):
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1541]: NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1541]: nvme0n1 259:0 0 3.5T 0 disk
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1541]: ├─nvme0n1p1 259:1 0 512M 0 part /boot/efi
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1541]: ├─nvme0n1p2 259:2 0 1T 0 part /
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1541]: └─nvme0n1p3 259:3 0 2.5T 0 part
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1541]: └─md127 9:127 0 2.5T 0 raid0
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1537]: Stopping current RAID "/dev/md/openqa"
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1542]: mdadm: stopped /dev/md/openqa
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1554]: Personalities : [raid0]
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1554]: unused devices: <none>
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1537]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1561]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1561]: level=raid0 devices=1 ctime=Wed Apr 13 12:04:21 2022
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1561]: mdadm: unexpected failure opening /dev/md127
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1537]: Unable to create RAID, mdadm returned with non-zero code
Apr 13 12:39:51 openqaworker14 systemd[1]: openqa_nvme_format.service: Main process exited, code=exited, status=1/FAILURE
Apr 13 12:39:51 openqaworker14 systemd[1]: openqa_nvme_format.service: Failed with result 'exit-code'.
Apr 13 12:39:51 openqaworker14 systemd[1]: Failed to start Setup NVMe before mounting it.
On ow15 it currently fails every time on a syntax error in /var/local/bin/openqa-establish-nvme-setup
. I assume that's from previous debugging attempts.
Updated by mkittler over 2 years ago
I played around with mdadm
:
martchus@openqaworker14:~> mdadm_args=(--create /dev/md/openqa --level=0 --force --assume-clean)
martchus@openqaworker14:~> mdadm_args+=(--raid-devices=1 --run /dev/nvme0n1p3)
martchus@openqaworker14:~> sudo systemctl stop var-lib-openqa.mount
martchus@openqaworker14:~> sudo mdadm "${mdadm_args[@]}"
mdadm: cannot open /dev/nvme0n1p3: Device or resource busy
martchus@openqaworker14:~> sudo mdadm --stop /dev/md/openqa
mdadm: stopped /dev/md/openqa
martchus@openqaworker14:~> sudo mdadm "${mdadm_args[@]}"
mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
level=raid0 devices=1 ctime=Wed Apr 13 13:22:53 2022
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md/openqa started.
martchus@openqaworker14:~> echo $?
0
martchus@openqaworker14:~> for i in {01..100}; do echo "try $i" && sudo mdadm --stop /dev/md/openqa && sudo mdadm "${mdadm_args[@]}" || break ; done
…
try 100
mdadm: stopped /dev/md/openqa
mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
level=raid0 devices=1 ctime=Wed Apr 13 16:28:00 2022
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md/openqa started.
So the message mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
also appears in the good case of mdadm: array /dev/md/openqa started.
and it not necessarily related to the error mdadm: unexpected failure opening /dev/md127
we sometimes get. The problem is unfortunately not reproducible when the system is already up.
Considering
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1537]: Stopping current RAID "/dev/md/openqa"
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1542]: mdadm: stopped /dev/md/openqa
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1554]: Personalities : [raid0]
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1554]: unused devices: <none>
we also can assume that the previous RAID has been stopped.
So it is really the question what this "unexpected error" is. It is printed by mdopen.c
when the previous open_dev_excl()
call fails but unfortunately no further error details are propagated. Note that this function already has an internal retry with exponential back off in case the error is EBUSY
. It does 25 attempts with sleep of min(2^i, 200)
, so it would wait an hour¹. That should be more than enough to rule out EBUSY
. We can also rule out that the device is already in use because that is checked before (and it is unlikely that this changed in the meantime).
This still leaves any other error as possibility. The function calls internally dev_open()
which basically calls open()
with the device path. So we might run into any of https://man.archlinux.org/man/open.2#ERRORS.
¹
[...Array(25).keys()].reduce((a, b) => a + Math.min(Math.pow(2, b), 200)) / 60 / 60
1.015
Updated by mkittler over 2 years ago
- Status changed from Workable to In Progress
I installed a custom build of mdadm. Now we see the specific error message:
Apr 13 17:35:16 openqaworker14 systemd[1]: Starting Setup NVMe before mounting it...
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1486]: Using mdadm from:
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1490]: /home/martchus/mdadm/mdadm
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1486]: Current mount points (printed for debugging purposes):
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: devtmpfs on /dev type devtmpfs (rw,nosuid,noexec,size=4096k,nr_inodes=1048576,mode=755)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: tmpfs on /run type tmpfs (rw,nosuid,nodev,size=211033536k,nr_inodes=819200,mode=755)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,size=4096k,nr_inodes=1024,mode=755)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: none on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on / type btrfs (rw,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=29,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=57366)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /.snapshots type btrfs (rw,relatime,ssd,space_cache,subvolid=266,subvol=/@/.snapshots)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /boot/grub2/i386-pc type btrfs (rw,relatime,ssd,space_cache,subvolid=265,subvol=/@/boot/grub2/i386-pc)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /boot/grub2/x86_64-efi type btrfs (rw,relatime,ssd,space_cache,subvolid=264,subvol=/@/boot/grub2/x86_64-efi)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /home type btrfs (rw,relatime,ssd,space_cache,subvolid=263,subvol=/@/home)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /opt type btrfs (rw,relatime,ssd,space_cache,subvolid=262,subvol=/@/opt)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /srv type btrfs (rw,relatime,ssd,space_cache,subvolid=260,subvol=/@/srv)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /root type btrfs (rw,relatime,ssd,space_cache,subvolid=261,subvol=/@/root)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /tmp type btrfs (rw,relatime,ssd,space_cache,subvolid=259,subvol=/@/tmp)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /usr/local type btrfs (rw,relatime,ssd,space_cache,subvolid=258,subvol=/@/usr/local)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /var type btrfs (rw,relatime,ssd,space_cache,subvolid=257,subvol=/@/var)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1486]: Present block devices (printed for debugging purposes):
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: nvme0n1 259:0 0 3.5T 0 disk
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: ├─nvme0n1p1 259:1 0 512M 0 part
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: ├─nvme0n1p2 259:2 0 1T 0 part /
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: └─nvme0n1p3 259:3 0 2.5T 0 part
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: └─md127 9:127 0 2.5T 0 raid0
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1486]: Stopping current RAID "/dev/md/openqa"
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1494]: mdadm: stopped /dev/md/openqa
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1521]: Personalities : [raid0]
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1521]: unused devices: <none>
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1486]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1532]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1532]: level=raid0 devices=1 ctime=Wed Apr 13 17:00:01 2022
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1532]: mdadm: unexpected failure opening /dev/md127: No such device or address
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /.snapshots type btrfs (rw,relatime,ssd,space_cache,subvolid=266,subvol=/@/.snapshots)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /boot/grub2/i386-pc type btrfs (rw,relatime,ssd,space_cache,subvolid=265,subvol=/@/boot/grub2/i386-pc)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /boot/grub2/x86_64-efi type btrfs (rw,relatime,ssd,space_cache,subvolid=264,subvol=/@/boot/grub2/x86_64-efi)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /home type btrfs (rw,relatime,ssd,space_cache,subvolid=263,subvol=/@/home)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /opt type btrfs (rw,relatime,ssd,space_cache,subvolid=262,subvol=/@/opt)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /srv type btrfs (rw,relatime,ssd,space_cache,subvolid=260,subvol=/@/srv)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /root type btrfs (rw,relatime,ssd,space_cache,subvolid=261,subvol=/@/root)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /tmp type btrfs (rw,relatime,ssd,space_cache,subvolid=259,subvol=/@/tmp)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /usr/local type btrfs (rw,relatime,ssd,space_cache,subvolid=258,subvol=/@/usr/local)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /var type btrfs (rw,relatime,ssd,space_cache,subvolid=257,subvol=/@/var)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1486]: Present block devices (printed for debugging purposes):
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: nvme0n1 259:0 0 3.5T 0 disk
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: ├─nvme0n1p1 259:1 0 512M 0 part
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: ├─nvme0n1p2 259:2 0 1T 0 part /
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: └─nvme0n1p3 259:3 0 2.5T 0 part
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: └─md127 9:127 0 2.5T 0 raid0
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1486]: Stopping current RAID "/dev/md/openqa"
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1494]: mdadm: stopped /dev/md/openqa
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1521]: Personalities : [raid0]
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1521]: unused devices: <none>
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1486]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1532]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1532]: level=raid0 devices=1 ctime=Wed Apr 13 17:00:01 2022
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1532]: mdadm: unexpected failure opening /dev/md127: No such device or address
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1486]: Unable to create RAID, mdadm returned with non-zero code
Apr 13 17:35:16 openqaworker14 systemd[1]: openqa_nvme_format.service: Main process exited, code=exited, status=1/FAILURE
Apr 13 17:35:16 openqaworker14 systemd[1]: openqa_nvme_format.service: Failed with result 'exit-code'.
Apr 13 17:35:16 openqaworker14 systemd[1]: Failed to start Setup NVMe before mounting it.
So it apparently failed to create the device before but the error was apparently ignored and thus when only get the error when trying to open the device.
Pressing Ctrl+D to leave the emergency shell helps in that situation. I'm not sure whether that was always the case, though. Assuming it is we could implement a retry for the systemd service. It also means that my custom build of mdadm is generally working (and not just misleading us by introducing yet another problem; on ow15 my custom build also just worked immediately after booting).
Updated by mkittler over 2 years ago
I've added a retry in our script for which I'll create a SR request. I only had to do one restart to see it already making a difference:
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /boot/grub2/i386-pc type btrfs (rw,relatime,ssd,space_cache,subvolid=265,subvol=/@/boot/grub2/i386-pc)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /.snapshots type btrfs (rw,relatime,ssd,space_cache,subvolid=266,subvol=/@/.snapshots)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /boot/grub2/x86_64-efi type btrfs (rw,relatime,ssd,space_cache,subvolid=264,subvol=/@/boot/grub2/x86_64-efi)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /home type btrfs (rw,relatime,ssd,space_cache,subvolid=263,subvol=/@/home)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /opt type btrfs (rw,relatime,ssd,space_cache,subvolid=262,subvol=/@/opt)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /root type btrfs (rw,relatime,ssd,space_cache,subvolid=261,subvol=/@/root)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /usr/local type btrfs (rw,relatime,ssd,space_cache,subvolid=258,subvol=/@/usr/local)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /srv type btrfs (rw,relatime,ssd,space_cache,subvolid=260,subvol=/@/srv)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /tmp type btrfs (rw,relatime,ssd,space_cache,subvolid=259,subvol=/@/tmp)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /var type btrfs (rw,relatime,ssd,space_cache,subvolid=257,subvol=/@/var)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1514]: Present block devices (printed for debugging purposes):
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1520]: NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1520]: nvme0n1 259:0 0 3.5T 0 disk
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1520]: ├─nvme0n1p1 259:1 0 512M 0 part
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1520]: ├─nvme0n1p2 259:2 0 1T 0 part /
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1520]: └─nvme0n1p3 259:3 0 2.5T 0 part
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1520]: └─md127 9:127 0 2.5T 0 raid0
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1514]: Stopping current RAID "/dev/md/openqa"
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1522]: mdadm: stopped /dev/md/openqa
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1529]: Personalities : [raid0]
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1529]: unused devices: <none>
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1514]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1537]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1537]: level=raid0 devices=1 ctime=Wed Apr 13 17:35:48 2022
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1537]: mdadm: unexpected failure opening /dev/md127: No such device or address
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1514]: Unexpected error opening device, waiting 10 before trying again (as retrying usually helps).
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1514]: Trying RAID0 creation again after timeout (attempt 2 of 10)
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1632]: Personalities : [raid0]
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1632]: unused devices: <none>
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1514]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1639]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1639]: level=raid0 devices=1 ctime=Wed Apr 13 17:35:48 2022
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1639]: mdadm: Defaulting to version 1.2 metadata
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1639]: mdadm: array /dev/md/openqa started.
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1514]: Status for RAID0 "/dev/md/openqa"
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1658]: md127 : active raid0 nvme0n1p3[0]
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1660]: ARRAY /dev/md/openqa metadata=1.2 name=openqaworker14:openqa UUID=0669a47c:5201d4f3:f7f5420a:c0f7f4a7
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1514]: Creating ext2 filesystem on RAID0 "/dev/md/openqa"
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1662]: mke2fs 1.43.8 (1-Jan-2018)
Apr 14 13:34:51 openqaworker14 openqa-establish-nvme-setup[1662]: [258B blob data]
Apr 14 13:34:51 openqaworker14 openqa-establish-nvme-setup[1662]: Creating filesystem with 669084672 4k blocks and 167272448 inodes
Apr 14 13:34:51 openqaworker14 openqa-establish-nvme-setup[1662]: Filesystem UUID: 8ae30c10-6d08-44a5-8913-85bf60940c17
Apr 14 13:34:51 openqaworker14 openqa-establish-nvme-setup[1662]: Superblock backups stored on blocks:
Apr 14 13:34:51 openqaworker14 openqa-establish-nvme-setup[1662]: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
Apr 14 13:34:51 openqaworker14 openqa-establish-nvme-setup[1662]: 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
Apr 14 13:34:51 openqaworker14 openqa-establish-nvme-setup[1662]: 102400000, 214990848, 512000000, 550731776, 644972544
Apr 14 13:34:51 openqaworker14 openqa-establish-nvme-setup[1662]: [73B blob data]
Apr 14 13:35:05 openqaworker14 openqa-establish-nvme-setup[1662]: [378B blob data]
Apr 14 13:35:17 openqaworker14 openqa-establish-nvme-setup[1662]: [107B blob data]
Apr 14 13:35:17 openqaworker14 systemd[1]: openqa_nvme_format.service: Succeeded.
Apr 14 13:35:17 openqaworker14 systemd[1]: Finished Setup NVMe before mounting it.
Whether it'll help in all cases I don't know but it is certainly an improvement. So I'll try to reboot a few more times.
Updated by mkittler over 2 years ago
Updated by mkittler over 2 years ago
And again:
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1564]: Present block devices (printed for debugging purposes):
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1570]: NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1570]: nvme0n1 259:0 0 3.5T 0 disk
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1570]: ├─nvme0n1p1 259:1 0 512M 0 part
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1570]: ├─nvme0n1p2 259:2 0 1T 0 part /
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1570]: └─nvme0n1p3 259:3 0 2.5T 0 part
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1570]: └─md127 9:127 0 2.5T 0 raid0
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1564]: Stopping current RAID "/dev/md/openqa"
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1571]: mdadm: stopped /dev/md/openqa
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1582]: Personalities : [raid0]
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1582]: unused devices: <none>
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1564]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1590]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1590]: level=raid0 devices=1 ctime=Thu Apr 14 13:34:47 2022
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1590]: mdadm: unexpected failure opening /dev/md127: No such device or address
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1564]: Unexpected error opening device, waiting 10 before trying again (as retrying usually helps).
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1564]: Trying RAID0 creation again after timeout (attempt 2 of 10)
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1618]: Personalities : [raid0]
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1618]: unused devices: <none>
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1564]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1625]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1625]: level=raid0 devices=1 ctime=Thu Apr 14 13:34:47 2022
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1625]: mdadm: Defaulting to version 1.2 metadata
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1625]: mdadm: array /dev/md/openqa started.
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1564]: Status for RAID0 "/dev/md/openqa"
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1644]: md127 : active raid0 nvme0n1p3[0]
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1646]: ARRAY /dev/md/openqa metadata=1.2 name=openqaworker14:openqa UUID=1dc43f3d:fbecb084:33ff1543:61f84fb8
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1564]: Creating ext2 filesystem on RAID0 "/dev/md/openqa"
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1648]: mke2fs 1.43.8 (1-Jan-2018)
Apr 14 13:46:58 openqaworker14 openqa-establish-nvme-setup[1648]: [220B blob data]
Apr 14 13:46:58 openqaworker14 openqa-establish-nvme-setup[1648]: Creating filesystem with 669084672 4k blocks and 167272448 inodes
Apr 14 13:46:58 openqaworker14 openqa-establish-nvme-setup[1648]: Filesystem UUID: b7ec8671-c15f-4887-9b78-157c1aa21715
Apr 14 13:46:58 openqaworker14 openqa-establish-nvme-setup[1648]: Superblock backups stored on blocks:
Apr 14 13:46:58 openqaworker14 openqa-establish-nvme-setup[1648]: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
Apr 14 13:46:58 openqaworker14 openqa-establish-nvme-setup[1648]: 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
Apr 14 13:46:58 openqaworker14 openqa-establish-nvme-setup[1648]: 102400000, 214990848, 512000000, 550731776, 644972544
Apr 14 13:46:58 openqaworker14 openqa-establish-nvme-setup[1648]: [73B blob data]
Apr 14 13:47:13 openqaworker14 openqa-establish-nvme-setup[1648]: [400B blob data]
Apr 14 13:47:25 openqaworker14 openqa-establish-nvme-setup[1648]: [107B blob data]
Apr 14 13:47:25 openqaworker14 systemd[1]: openqa_nvme_format.service: Succeeded.
Apr 14 13:47:25 openqaworker14 systemd[1]: Finished Setup NVMe before mounting it.
And again:
Apr 14 13:48:02 openqaworker15 openqa-establish-nvme-setup[1505]: Present block devices (printed for debugging purposes):
Apr 14 13:48:02 openqaworker15 openqa-establish-nvme-setup[1508]: NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
Apr 14 13:48:02 openqaworker15 openqa-establish-nvme-setup[1508]: nvme0n1 259:0 0 3.5T 0 disk
Apr 14 13:48:02 openqaworker15 openqa-establish-nvme-setup[1508]: ├─nvme0n1p1 259:1 0 512M 0 part
Apr 14 13:48:02 openqaworker15 openqa-establish-nvme-setup[1508]: ├─nvme0n1p2 259:2 0 1T 0 part /
Apr 14 13:48:02 openqaworker15 openqa-establish-nvme-setup[1508]: └─nvme0n1p3 259:3 0 2.5T 0 part
Apr 14 13:48:02 openqaworker15 openqa-establish-nvme-setup[1508]: └─md127 9:127 0 2.5T 0 raid0
Apr 14 13:48:02 openqaworker15 openqa-establish-nvme-setup[1505]: Stopping current RAID "/dev/md/openqa"
Apr 14 13:48:03 openqaworker15 openqa-establish-nvme-setup[1509]: mdadm: stopped /dev/md/openqa
Apr 14 13:48:03 openqaworker15 openqa-establish-nvme-setup[1505]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 14 13:48:03 openqaworker15 openqa-establish-nvme-setup[1562]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 14 13:48:03 openqaworker15 openqa-establish-nvme-setup[1562]: level=raid0 devices=1 ctime=Wed Apr 13 17:53:38 2022
Apr 14 13:48:03 openqaworker15 openqa-establish-nvme-setup[1562]: mdadm: unexpected failure opening /dev/md127: No such device or address
Apr 14 13:48:03 openqaworker15 openqa-establish-nvme-setup[1505]: Unexpected error opening device, waiting 10 before trying again (as retrying usually helps).
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1505]: Trying RAID0 creation again after timeout (attempt 2 of 10)
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1505]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1654]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1654]: level=raid0 devices=1 ctime=Wed Apr 13 17:53:38 2022
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1654]: mdadm: Defaulting to version 1.2 metadata
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1654]: mdadm: array /dev/md/openqa started.
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1505]: Status for RAID0 "/dev/md/openqa"
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1673]: md127 : active raid0 nvme0n1p3[0]
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1675]: ARRAY /dev/md/openqa metadata=1.2 name=openqaworker15:openqa UUID=e2f5086b:edbf62a8:14bb3a8a:b575f4d3
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1505]: Creating ext2 filesystem on RAID0 "/dev/md/openqa"
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1676]: mke2fs 1.43.8 (1-Jan-2018)
Apr 14 13:48:16 openqaworker15 openqa-establish-nvme-setup[1676]: [220B blob data]
Apr 14 13:48:16 openqaworker15 openqa-establish-nvme-setup[1676]: Creating filesystem with 669084672 4k blocks and 167272448 inodes
Apr 14 13:48:16 openqaworker15 openqa-establish-nvme-setup[1676]: Filesystem UUID: d1e04ad2-cbc3-4356-bd70-36d16a9eef7e
Apr 14 13:48:16 openqaworker15 openqa-establish-nvme-setup[1676]: Superblock backups stored on blocks:
Apr 14 13:48:16 openqaworker15 openqa-establish-nvme-setup[1676]: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
Apr 14 13:48:16 openqaworker15 openqa-establish-nvme-setup[1676]: 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
Apr 14 13:48:16 openqaworker15 openqa-establish-nvme-setup[1676]: 102400000, 214990848, 512000000, 550731776, 644972544
Apr 14 13:48:16 openqaworker15 openqa-establish-nvme-setup[1676]: [73B blob data]
Apr 14 13:48:31 openqaworker15 openqa-establish-nvme-setup[1676]: [400B blob data]
Apr 14 13:48:43 openqaworker15 openqa-establish-nvme-setup[1676]: [107B blob data]
Apr 14 13:48:43 openqaworker15 systemd[1]: openqa_nvme_format.service: Succeeded.
Apr 14 13:48:43 openqaworker15 systemd[1]: Finished Setup NVMe before mounting it.
So I'm quite sure that the retry helps and likely that's also all we need.
Updated by mkittler over 2 years ago
- Status changed from In Progress to Feedback
I did a 8 more reboots (4 on each machine). I encountered 5 more good cases (no retry required) and 3 more handled bad cases (retry required but it helped) but no more unhandled bad cases. Interestingly, all remaining bad cases happened on ow15. So it seems that this host is more affected but it could also be just a coincidence.
When https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/671 has been merged I can bring the workers back into salt (AC2) and remove possible workarounds:
- Resume alerts (AC4)
- Run production jobs (AC1): https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/404
Not sure whether it makes sense to rush this today because I'll only be back on Wednesday and some closer monitoring would be good (AC3).
Updated by okurz over 2 years ago
mkittler wrote:
I did a 8 more reboots (4 on each machine). I encountered 5 more good cases (no retry required) and 3 more handled bad cases (retry required but it helped) but no more unhandled bad cases. Interestingly, all remaining bad cases happened on ow15. So it seems that this host is more affected but it could also be just a coincidence.
When https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/671 has been merged …
Done.
- Run production jobs (AC1): https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/404
I guess you mean https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/405
Before we enable this we should have sufficient tests showing that the workers are stable. Please trigger openQA tests using the existing worker class settings, that's why we have them :)
Not sure whether it makes sense to rush this today because I'll only be back on Wednesday and some closer monitoring would be good (AC3).
yes, no need to rush.
Updated by mkittler over 2 years ago
I've been re-triggering a variety¹ of jobs: https://openqa.suse.de/tests/overview?build=test-ow14 https://openqa.suse.de/tests/overview?build=test-ow15
¹ They're simply the 100 most recently passed qemu x86_64 jobs on OSD without parallel dependencies.
Looks like I've messed up the worker class. So jobs on both overviews might be started on any of the workers.
Updated by mkittler over 2 years ago
All jobs incompleted early, likely due to a too old os-autoinst version. Considering the workers are not in salt that's expected so I'll update them manually for now and will re-trigger the jobs. (This time with the correct worker classes.)
Updated by mkittler over 2 years ago
The tap setup isn't working yet but it generally doesn't look too bad:
openqa=> select jobs.id, job_settings.value as worker, jobs.result, jobs.reason from jobs join job_settings on jobs.id = job_settings.job_id where state = 'done' and not (result = 'passed' or result = 'softfailed') and job_settings.key = 'WORKER_CLASS' and (job_settings.value = 'openqaworker14' or job_settings.value = 'openqaworker15') and t_started >= '2022-04-20' order by id desc;
id | worker | result | reason
---------+----------------+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
8582778 | openqaworker15 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap15 25' failed: 'tap15' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582777 | openqaworker15 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap11 18' failed: 'tap11' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582776 | openqaworker15 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap4 47' failed: 'tap4' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582774 | openqaworker15 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap15 25' failed: 'tap15' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582772 | openqaworker15 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap11 18' failed: 'tap11' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582770 | openqaworker15 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap15 18' failed: 'tap15' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582768 | openqaworker15 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap15 18' failed: 'tap15' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582761 | openqaworker14 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap8 58' failed: 'tap8' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582760 | openqaworker15 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap15 18' failed: 'tap15' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582759 | openqaworker14 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap14 18' failed: 'tap14' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582757 | openqaworker14 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap9 18' failed: 'tap9' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582755 | openqaworker14 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap8 58' failed: 'tap8' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582753 | openqaworker14 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap14 25' failed: 'tap14' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582751 | openqaworker14 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap9 18' failed: 'tap9' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582749 | openqaworker14 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap8 58' failed: 'tap8' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582747 | openqaworker14 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap14 60' failed: 'tap14' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582739 | openqaworker14 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap8 18' failed: 'tap8' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582613 | openqaworker15 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap15 60' failed: 'tap15' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
(18 Zeilen)
I also found some older jobs failing with timeout_exceeded
(reason: timeout: test execution exceeded MAX_JOB_TIME
) when doing some other SQL queries. Note that these jobs simply timed out because they were waiting for a parallel job so they are likely also just a symptom of the broken tap setup (and not a symptom of slow downloads).
Updated by mkittler over 2 years ago
The tap setup wasn't really broken. It was just that I manually increased the number of worker slots (to 16) so the 100 jobs would be processed quicker and more like we'd eventually run them in production. When reducing the number of worker slots back to what's configured in salt the tap jobs work (see https://openqa.suse.de/tests/8587752#dependencies and https://openqa.suse.de/tests/8587756#dependencies).
So I suppose https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/405 can be merged. (It'll also increase the number of worker slots but if it is done by salt it'll take care to extend the tap config accordingly.)
Updated by okurz over 2 years ago
openQA verification jobs look nice, https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/405 merged
Updated by okurz over 2 years ago
I have unpaused the "host up" alert for both machines as it looks green right now.
Updated by mkittler over 2 years ago
Salt is unfortunately broken on ow14 and 15. It always tries to connect to 127.0.0.1 instead of OSD. Setting master
in /etc/salt/minion
to OSD's IP doesn't help. Hacking around the issue via martchus@openqaworker14:~> ssh -L "4506:localhost:4506" -N -f openqa.suse.de
let's it at least authenticate but the Minion still appears unresponsive from the master's side.
Note that on a non-broken worker, e.g. openqaworker5 the minion log looks like this:
[DEBUG ] Connecting to master. Attempt 1 of 1
[DEBUG ] Error while parsing IPv4 address: salt
[DEBUG ] Expected 4 octets in 'salt'
[DEBUG ] Error while parsing IPv6 address: salt
[DEBUG ] At least 3 parts expected in 'salt'
[DEBUG ] "salt" Not an IP address? Assuming it is a hostname.
[DEBUG ] Master URI: tcp://10.160.0.207:4506
[DEBUG ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'openqaworker5.suse.de', 'tcp://10.160.0.207:4506')
[DEBUG ] Generated random reconnect delay between '1000ms' and '11000ms' (2303)
[DEBUG ] Setting zmq_reconnect_ivl to '2303ms'
[DEBUG ] Setting zmq_reconnect_ivl_max to '11000ms'
[TRACE ] IPCServer: Handling connection to address:
[DEBUG ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'openqaworker5.suse.de', 'tcp://10.160.0.207:4506', 'clear')
[DEBUG ] Connecting the Minion to the Master URI (for the return server): tcp://10.160.0.207:4506
[DEBUG ] Trying to connect to: tcp://10.160.0.207:4506
[TRACE ] Inserted key into loop_instance_map id 140315043651088 for key ('/etc/salt/pki/minion', 'openqaworker5.suse.de', 'tcp://10.160.0.207:4506', 'clear') and process 15063
[DEBUG ] salt.crypt.get_rsa_pub_key: Loading public key
[DEBUG ] Decrypting the current master AES key
[DEBUG ] salt.crypt.get_rsa_key: Loading private key
[DEBUG ] salt.crypt._get_key_with_evict: Loading private key
[DEBUG ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG ] salt.crypt.get_rsa_pub_key: Loading public key
[DEBUG ] salt.crypt.verify_signature: Loading public key
[DEBUG ] salt.crypt.get_rsa_pub_key: Loading public key
[DEBUG ] salt.crypt.verify_signature: Verifying signature
[DEBUG ] <salt.crypt.AsyncAuth object at 0x7f9da5715d68> Got new master aes key.
[DEBUG ] Closing AsyncZeroMQReqChannel instance
[DEBUG ] Connecting the Minion to the Master publish port, using the URI: tcp://10.160.0.207:4505
[DEBUG ] salt.crypt.get_rsa_key: Loading private key
[DEBUG ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'openqaworker5.suse.de', 'tcp://10.160.0.207:4506', 'aes')
[DEBUG ] Re-using AsyncAuth for ('/etc/salt/pki/minion', 'openqaworker5.suse.de', 'tcp://10.160.0.207:4506')
[DEBUG ] Connecting the Minion to the Master URI (for the return server): tcp://10.160.0.207:4506
On ow14 and 15 it looks very similar but OSD's IP is swapped out by 127.0.0.1:
[DEBUG ] "salt" Not an IP address? Assuming it is a hostname.
…
…
[DEBUG ] Trying to connect to: tcp://127.0.0.1:4506
[DEBUG ] Closing AsyncZeroMQReqChannel instance
[ERROR ] Error while bringing up minion for multi-master. Is master at 10.160.0.207 responding?
[DEBUG ] Connecting to master. Attempt 1 of 1
[DEBUG ] Master URI: tcp://127.0.0.1:4506
…
[DEBUG ] Connecting the Minion to the Master URI (for the return server): tcp://127.0.0.1:4506
[DEBUG ] SaltReqTimeoutError, retrying. (1/7)
Curiously it still says "Is master at 10.160.0.207 responding?" as if the correct host is generally configured but then somehow not passed down to some underlying function.
The salt-minion version is the same as on other hosts.
Updated by mkittler over 2 years ago
If I also do martchus@openqaworker14:~> ssh -L "4505:localhost:4505" -N -f -l martchus openqa.suse.de
(and not just for port 4506) I get a little bit further:
martchus@openqa:~> sudo salt openqaworker14.qa.suse.cz state.apply
openqaworker14.qa.suse.cz:
----------
ID: states
Function: no.None
Result: False
Comment: No Top file or master_tops data matches found. Please see master log for details.
Changes:
Summary for openqaworker14.qa.suse.cz
------------
Succeeded: 0
Failed: 1
------------
Total states run: 1
Total run time: 0.000 ms
ERROR: Minions returned with non-zero exit code
The Minion is then responding but shows this error message. Something must be wrongly configured on ow14 and 15 but all the config looks like on ow5 which has no issues. So I'm out of ideas.
Updated by mkittler over 2 years ago
Looks like somebody put file_client: local
in the config which I've missed on Friday. It works after removing that. Now the worker class change to run normal production jobs is effective. Salt also extended the tap setup as expected:
----------
ID: /etc/sysconfig/network/ifcfg-br1
Function: file.managed
Result: True
Comment: File /etc/sysconfig/network/ifcfg-br1 updated
Started: 10:57:33.346842
Duration: 13.948 ms
Changes:
----------
diff:
---
+++
@@ -14,5 +14,40 @@
OVS_BRIDGE_PORT_DEVICE_3='tap3'
OVS_BRIDGE_PORT_DEVICE_67='tap67'
OVS_BRIDGE_PORT_DEVICE_131='tap131'
+OVS_BRIDGE_PORT_DEVICE_4='tap4'
+OVS_BRIDGE_PORT_DEVICE_68='tap68'
+OVS_BRIDGE_PORT_DEVICE_132='tap132'
+OVS_BRIDGE_PORT_DEVICE_5='tap5'
+OVS_BRIDGE_PORT_DEVICE_69='tap69'
+OVS_BRIDGE_PORT_DEVICE_133='tap133'
+OVS_BRIDGE_PORT_DEVICE_6='tap6'
+OVS_BRIDGE_PORT_DEVICE_70='tap70'
+OVS_BRIDGE_PORT_DEVICE_134='tap134'
+OVS_BRIDGE_PORT_DEVICE_7='tap7'
+OVS_BRIDGE_PORT_DEVICE_71='tap71'
+OVS_BRIDGE_PORT_DEVICE_135='tap135'
+OVS_BRIDGE_PORT_DEVICE_8='tap8'
+OVS_BRIDGE_PORT_DEVICE_72='tap72'
+OVS_BRIDGE_PORT_DEVICE_136='tap136'
+OVS_BRIDGE_PORT_DEVICE_9='tap9'
+OVS_BRIDGE_PORT_DEVICE_73='tap73'
+OVS_BRIDGE_PORT_DEVICE_137='tap137'
+OVS_BRIDGE_PORT_DEVICE_10='tap10'
+OVS_BRIDGE_PORT_DEVICE_74='tap74'
+OVS_BRIDGE_PORT_DEVICE_138='tap138'
+OVS_BRIDGE_PORT_DEVICE_11='tap11'
+OVS_BRIDGE_PORT_DEVICE_75='tap75'
+OVS_BRIDGE_PORT_DEVICE_139='tap139'
+OVS_BRIDGE_PORT_DEVICE_12='tap12'
+OVS_BRIDGE_PORT_DEVICE_76='tap76'
+OVS_BRIDGE_PORT_DEVICE_140='tap140'
+OVS_BRIDGE_PORT_DEVICE_13='tap13'
+OVS_BRIDGE_PORT_DEVICE_77='tap77'
+OVS_BRIDGE_PORT_DEVICE_141='tap141'
+OVS_BRIDGE_PORT_DEVICE_14='tap14'
+OVS_BRIDGE_PORT_DEVICE_78='tap78'
+OVS_BRIDGE_PORT_DEVICE_142='tap142'
+OVS_BRIDGE_PORT_DEVICE_15='tap15'
+OVS_BRIDGE_PORT_DEVICE_79='tap79'
+OVS_BRIDGE_PORT_DEVICE_143='tap143'
PRE_UP_SCRIPT="wicked:gre_tunnel_preup.sh"
-ZONE=trusted
mode:
0644
I'll check the incomplete/failure rate of the workers in a few days. If everything looks good the ticket can then be resolved (as all ACs are then covered).
Updated by mkittler over 2 years ago
So far, the fail/incomplete ratio looks ok to me compared to other worker hosts:
openqa=> with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed' or result='incomplete') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where t_finished >= '2022-04-25 11:00' group by host order by ratio_failed_by_host desc;
host | ratio_failed_by_host | total
---------------------+----------------------+-------
openqaworker-arm-4 | 88.1 | 168
openqaworker2 | 64.09 | 362
openqaworker-arm-3 | 33.94 | 327
openqaworker14 | 30.92 | 414
malbec | 26.6 | 94
powerqaworker-qam-1 | 25.93 | 135
openqaworker-arm-2 | 25.6 | 336
QA-Power8-5-kvm | 24.36 | 156
openqaworker-arm-1 | 24.19 | 186
grenache-1 | 23.63 | 787
openqaworker15 | 20 | 405
QA-Power8-4-kvm | 18.9 | 164
openqaworker5 | 16.8 | 899
openqaworker6 | 16.22 | 709
openqaworker3 | 11.98 | 459
openqaworker9 | 11.24 | 347
openqaworker10 | 9.09 | 209
openqaworker13 | 8.5 | 341
openqaworker8 | 6.67 | 330
openqa-piworker | 0 | 7
| 0 | 325
(21 Zeilen)
Note that 15 is 10 percent points better although it is the very same hardware using the very same connection. So I suppose that's the margin we have to consider anyways.
I also haven't seen any jobs running into timeouts (the setup timeout was a concern due to the worker's connection from Prague):
openqa=> select id, t_finished, result, reason from jobs where (select host from workers where id = assigned_worker_id) similar to 'openqaworker(14|15)' and (result = 'incomplete' or result = 'failed') and reason like '%timeout%' and t_finished >= '2022-04-25 11:00' order by t_finished;
id | t_finished | result | reason
----+------------+--------+--------
(0 Zeilen)
Updated by mkittler over 2 years ago
I've noticed that there are lots of jobs that just failed due to syntax errors. When ignoring those failures for the sake this investigation, the figures are quite changed:
openqa=> with finished as (select result, reason, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where (result='failed' or result='incomplete') and reason not like '%main%syntax error%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where t_finished >= '2022-04-25 11:00' group by host order by ratio_failed_by_host desc;
host | ratio_failed_by_host | total
---------------------+----------------------+-------
openqaworker2 | 35.89 | 404
openqaworker-arm-3 | 16.72 | 329
openqaworker-arm-2 | 12.02 | 341
openqaworker-arm-1 | 9.47 | 190
grenache-1 | 6.1 | 787
openqaworker5 | 3.74 | 908
powerqaworker-qam-1 | 2.96 | 135
QA-Power8-5-kvm | 2.56 | 156
openqaworker13 | 2.33 | 343
QA-Power8-4-kvm | 1.83 | 164
openqaworker-arm-4 | 1.19 | 168
openqaworker10 | 0.94 | 212
openqaworker6 | 0.7 | 715
openqaworker9 | 0.57 | 351
openqaworker14 | 0.48 | 421
openqaworker8 | 0.3 | 332
openqaworker15 | 0.24 | 412
openqaworker3 | 0.21 | 466
| 0 | 325
openqa-piworker | 0 | 7
malbec | 0 | 95
(21 Zeilen)
Updated by mkittler over 2 years ago
The fail/incomplete ratio on 14/15 still looks ok:
openqa=> with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed' or result='incomplete') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where t_finished >= '2022-04-25 11:00' group by host order by ratio_failed_by_host desc;
host | ratio_failed_by_host | total
---------------------+----------------------+-------
openqaworker2 | 78.41 | 1371
openqaworker-arm-4 | 61.29 | 403
openqaworker14 | 32.14 | 837
grenache-1 | 31.72 | 1854
openqaworker-arm-3 | 26.83 | 779
openqaworker-arm-1 | 20.71 | 449
openqaworker15 | 20.27 | 824
openqaworker-arm-2 | 18.63 | 714
QA-Power8-5-kvm | 18.53 | 421
powerqaworker-qam-1 | 16.04 | 424
malbec | 15.19 | 237
openqaworker5 | 14.97 | 1637
openqaworker13 | 14.11 | 659
openqaworker6 | 13.92 | 1329
QA-Power8-4-kvm | 13.4 | 530
openqaworker9 | 12.41 | 709
openqaworker3 | 10.78 | 872
openqaworker10 | 10.55 | 455
openqaworker8 | 9.03 | 709
openqa-piworker | 0 | 7
| 0 | 621
(21 Zeilen)
Excluding syntax errors:
openqa=> with finished as (select result, reason, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where (result='failed' or result='incomplete') and reason not like '%main%syntax error%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where t_finished >= '2022-04-25 11:00' group by host order by ratio_failed_by_host desc;
host | ratio_failed_by_host | total
---------------------+----------------------+-------
openqaworker2 | 50.11 | 1371
openqaworker-arm-3 | 13.61 | 779
openqaworker-arm-1 | 8.91 | 449
openqaworker-arm-2 | 5.74 | 714
openqaworker-arm-4 | 5.46 | 403
powerqaworker-qam-1 | 4.25 | 424
QA-Power8-5-kvm | 4.04 | 421
grenache-1 | 3.83 | 1853
openqaworker13 | 3.34 | 659
QA-Power8-4-kvm | 2.65 | 529
openqaworker5 | 2.63 | 1636
openqaworker10 | 2.2 | 455
openqaworker9 | 1.41 | 709
malbec | 1.27 | 237
openqaworker14 | 1.08 | 837
openqaworker8 | 0.85 | 709
openqaworker6 | 0.6 | 1329
openqaworker15 | 0.36 | 824
openqaworker3 | 0.34 | 872
openqa-piworker | 0 | 7
| 0 | 621
(21 Zeilen)
Updated by mkittler over 2 years ago
- Maybe we need to remove the tap worker class: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/407 - see https://suse.slack.com/archives/C02CANHLANP/p1651052375642909
- Maybe we can add the
qemu_x86_64-large-mem
worker class because those workers have 1 TiB RAM: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/408
Updated by mkittler over 2 years ago
Looks like https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/407 couldn't be cleanly applied. However, the worker class "tap" is gone and that's the main goal.
However, maybe this gives also a hint why the setup is broken:
ID: wicked ifup br1
Function: cmd.wait
Result: False
Comment: Command "wicked ifup br1" run
Started: 12:32:03.474942
Duration: 30412.6 ms
Changes:
----------
pid:
42945
retcode:
162
stderr:
stdout:
br1 setup-in-progress
At least for openqaworker14 because on 15 it looks good:
ID: wicked ifup br1
Function: cmd.wait
Result: True
Comment: Command "wicked ifup br1" run
Started: 12:32:06.437255
Duration: 31487.345 ms
Changes:
----------
pid:
65768
retcode:
0
stderr:
stdout:
br1 up
The bride is up now on ow14 as well (without me doing anything):
martchus@openqaworker14:~> sudo wicked ifup br1
br1 up
Updated by mkittler over 2 years ago
After reverting and applying the tap setup again via salt it works again on ow15. So I suppose the tap setup generally works. I also tested it with a parallel cluster where some jobs are running on Nürnberg workers and some on Prague workers (see https://openqa.suse.de/tests/8656821). It also continued to work after rebooting the worker. Maybe the gre tunnel isn't very good at coping with temporary connection problems and can be stuck in a disconnected state. We decided to create another ticket for that and leave the tap worker class out for now. Considering that curl -4 https://openqa.suse.de
works and curl -6 https://openqa.suse.de
doesn't (on both, ow14 and ow15) it could be an IPv4 vs. IPv6 problem.
I also rebased https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/408 which we might want to merge as well. (I'll merge it after getting at least one approval.)
Now both workers are again back in salt without tap worker class. See suggestions in #110467 for how to continue with the tap setup.
Since the failure rate looks still generally good I'm resolving the ticket keeping both workers running production jobs (just not tap ones).
Updated by mkittler over 2 years ago
- Related to action #110467: Establish reliable tap setup on ow14 added
Updated by okurz over 2 years ago
- Copied to action #113366: Add three more Prague located OSD workers size:M added