https://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842022-01-17T13:17:45ZopenSUSE Project Management ToolopenQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=4806632022-01-17T13:17:45Zokurzokurz@suse.com
<ul><li><strong>Copied from</strong> <i><a class="issue tracker-4 status-3 priority-4 priority-default closed child parent" href="/issues/97862">action #97862</a>: More openQA worker hardware for OSD size:M</i> added</li></ul> openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=4806752022-01-17T13:54:39Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Feedback</i></li><li><strong>Assignee</strong> set to <i>okurz</i></li></ul><p>I provided my GPG key to mcaj for the credentials. Let's await that first</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=4835672022-01-25T10:48:58Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>New</i></li><li><strong>Assignee</strong> deleted (<del><i>okurz</i></del>)</li></ul><p><a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/381" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/381</a> including IPMI credentials which can be a start for others to continue.</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=4845032022-01-27T08:42:28Zokurzokurz@suse.com
<ul></ul><p>Both machines have an openSUSE Leap 15.3 preinstalled with our common, old, known root password for the time being.</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=4900542022-02-11T10:55:43Zokurzokurz@suse.com
<ul><li><strong>Priority</strong> changed from <i>Normal</i> to <i>High</i></li></ul> openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=4900572022-02-11T11:00:01Zokurzokurz@suse.com
<ul><li><strong>Subject</strong> changed from <i>Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing</i> to <i>Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:M</i></li><li><strong>Status</strong> changed from <i>New</i> to <i>Workable</i></li></ul> openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=4914582022-02-16T09:34:53Zlivdywanliv.dywan@suse.com
<ul><li><strong>Assignee</strong> set to <i>osukup</i></li></ul><p><a class="user active user-mention" href="https://progress.opensuse.org/users/21806">@osukup</a> is looking forward to tkaing it :-D</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=4920102022-02-17T08:44:02Zosukup
<ul></ul><p>so ... problem starts.</p>
<p>1) salt-minion -> with default /etc/salt/minion refuses connect to master ( always tries localhost )</p>
<ul>
<li>workaround: remove minion config
2) it has only one disk ( 3.5TB nvme, formated to single btrfs space) so <code>openqa-format_nvme.service</code> cant realy work</li>
<li>workaround: systemctl edit ....
3) its connected by 1Gbit interface ( server itself has 10G, not connected anywhere) and connection to NUE is slooow for any openqa job:(</li>
</ul>
<p>--> please move this two workers to NUE </p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=4920492022-02-17T09:57:29Zosukup
<ul></ul><p>openqaworker14 didn't survive reboot - reinstall :( - maybe change better layout for disk?</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=4922802022-02-17T11:53:41Zosukup
<ul></ul><p>so, after "succesful' <code>state.apply</code> ... system ater reboot goes to maintenace mode :(</p>
<p>culprit -> prepare nvme openqa service ..</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=4923702022-02-17T14:21:43Zosukup
<ul><li><strong>Status</strong> changed from <i>Workable</i> to <i>In Progress</i></li></ul> openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=4926552022-02-18T04:11:19Zopenqa_reviewopenqa-review@suse.de
<ul><li><strong>Due date</strong> set to <i>2022-03-04</i></li></ul><p>Setting due date based on mean cycle time of SUSE QE Tools</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=4928082022-02-18T10:36:29Zokurzokurz@suse.com
<ul></ul><p>as discussed please put in MAX_SETUP_TIME with something like 10h for the slow connection, change worker class temporarily to not break production tests, restart tests that incompleted on this worker so that they end up on other machines</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=4930242022-02-20T12:40:51Zokurzokurz@suse.com
<ul></ul><p>The later suggestion was done with <a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/388" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/388</a> . After merge I called</p>
<pre><code>for i in openqaworker14 openqaworker15; do failed_since=2022-02-17 result="result='incomplete'" host=openqa.suse.de WORKER=$i bash -ex openqa-advanced-retrigger-jobs; done
</code></pre>
<p>I propose a higher cache limit as discussed also in one of our meetings last week: <a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/389" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/389</a></p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=4931442022-02-21T08:08:52Zokurzokurz@suse.com
<ul></ul><p>Currently OSD deployment is failing on w14+w15 not being fully ready yet for deployment so I temporarily removed them again from deployment. <a class="user active user-mention" href="https://progress.opensuse.org/users/21806">@osukup</a> you can add them back when you continue here</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=4943142022-02-23T10:44:16Zmkittlermarius.kittler@suse.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/494314/diff?detail_id=467470">diff</a>)</li></ul><p>I paused the corresponding host-up alerts.</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=4949022022-02-24T09:04:28Zosukup
<ul></ul><p>bot host deployed with salt, added to salt and osd-deployment works.</p>
<p>due to difference between standard openqaworker and this machines (only one nvme disk):</p>
<p>both were reinstaled with this disk layout:</p>
<pre><code>nvme0n1 259:0 0 3,5T 0 disk
├─nvme0n1p1 259:1 0 512M 0 part /boot/efi
├─nvme0n1p2 259:2 0 1T 0 part /var
└─nvme0n1p3 259:3 0 2,5T 0 part
</code></pre>
<p>nvme0n1p3 set during installation as <em>raw</em> volume with linux raid partition type</p>
<p>after installation initialized with:</p>
<pre><code>mdadm --create /dev/md/openqa --level=0 --raid-devices=1 --force --run /dev/nvme0n1p3
mkfs.ext2 /dev/md/openqa
</code></pre>
<p>added line to fstab<br>
<code>/dev/md/openqa /var/lib/openqa/ ext2 defaults 0 0</code></p>
<p>created <code>/etc/mdadm.conf</code> file with:</p>
<pre><code>DEVICE containers partitions
ARRAY /dev/md/openqa name=openqaworker14:openqa
</code></pre>
<p>of course on openqaworker15 name is <code>openqaworker15:openqa</code></p>
<p>then run <code>dracut --force</code></p>
<p>with these steps host can be rebooted without problems ( without <em>mdadm.conf</em> reboot ends in maintenance mode because mdadm didn't create link <code>/dev/md/openqa</code> )</p>
<p>rest was accordind gitlab/openqa/openqa-salt-pillars with these changes:</p>
<ol>
<li>remove /etc/salt/minion (with config file salt-minion rejected to connect openqa.suse.de)</li>
<li>create <code>/etc/salt/grains</code> with <code>roles: worker</code></li>
</ol>
<p>test run on openqaworker{14,15} - run 100 jobs on both machines, the first job usually take pretty long time to setup cache ( average speed of download is bellow 5MB/sec) but results are stable and without incomplete jobs</p>
<p>during testing found problem with websockets service : <a href="https://progress.opensuse.org/issues/107497" class="external">https://progress.opensuse.org/issues/107497</a></p>
<p><em>NEXT</em> -> set <code>tap</code> class to workers and ask <u>qa-sap</u> for multimachine test jobs on workers</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=4956762022-02-25T05:51:14Zokurzokurz@suse.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-4 priority-default closed child" href="/issues/107497">action #107497</a>: [qe-tools] openqaworker14 (and openqa15) - developer mode fails size:M</i> added</li></ul> openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=4958592022-02-25T09:07:27Zokurzokurz@suse.com
<ul></ul><p>Sounds good. It shouldn't be necessary to conduct these many manual steps. We already have machines that only have a single NVMe and no other storage device, e.g. openqaworker8. There we use the same salt states with the same NVMe setup scripts. As part of the salt states we also enable automatic updating with automatic reboots so better check that reboots work consistently.</p>
<p>Regarding the bandwidth. Running on OSD <code>while true; do dd bs=100M count=20 if=/dev/zero | nc -l 42420; done</code> and doing <code>timeout 20 nc openqa.suse.de 42420 | dd of=/dev/null status=progress</code> locally on my DSL connected work notebook over VPN I get 5.0 MB/s. On openqaworker14 I get 2.5 MB/s. So the SUSE inter-office connection is 2x slower than my home connection. That does not sound right. I suggest we crosscheck what we can expect with SUSE-IT EngInfra. You can do that as part of this ticket or create a dedicated ticket for that because this would be tuning the performance, not the general possibility to use these machines.</p>
<p>Please link progress tickets in format <code>#<id></code> so that a direct link with preview appears. Also at best link such tickets directly as "related" tickets.</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=4978932022-03-04T10:40:09Zokurzokurz@suse.com
<ul><li><strong>Due date</strong> changed from <i>2022-03-04</i> to <i>2022-03-11</i></li></ul><p>As discussed in the weekly osukup will contact EngInfra regarding network speed and continue with the salt states merge request regarding tty config.</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=4981902022-03-07T09:57:37Zokurzokurz@suse.com
<ul><li><strong>Due date</strong> deleted (<del><i>2022-03-11</i></del>)</li><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Blocked</i></li></ul><p>Setting to blocked on the ticket you created: <a href="https://sd.suse.com/servicedesk/customer/portal/1/SD-79330" class="external">https://sd.suse.com/servicedesk/customer/portal/1/SD-79330</a></p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5067772022-03-31T08:18:23Zosukup
<ul></ul><p>bandwidth nue-prg looks better now , instead 1-2MB/s we now see speeds about 20MB/s</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5068312022-03-31T09:15:55Zosukup
<ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-6 priority-4 priority-default closed" href="/issues/109301">action #109301</a>: openqaworker14 + openqaworker15 sporadically get stuck on boot</i> added</li></ul> openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5072182022-04-01T11:13:40Zosukup
<ul></ul><p>as mentioned in <a href="https://progress.opensuse.org/issues/109301" class="external">https://progress.opensuse.org/issues/109301</a> <a class="issue tracker-4 status-6 priority-4 priority-default closed" title="action: openqaworker14 + openqaworker15 sporadically get stuck on boot (Rejected)" href="https://progress.opensuse.org/issues/109301">#109301</a> there are boot problems for this two workers.</p>
<p>-> setup of <code>/dev/md/openqa</code> time to time fails during boot ..</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5081012022-04-05T19:06:37Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>Blocked</i> to <i>Workable</i></li></ul><p>As commented in <a class="issue tracker-4 status-6 priority-4 priority-default closed" title="action: openqaworker14 + openqaworker15 sporadically get stuck on boot (Rejected)" href="https://progress.opensuse.org/issues/109301">#109301</a> this needs to be solved as part of this ticket as the two machines are not usable without. Please solve here.</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5090672022-04-07T14:28:32Zokurzokurz@suse.com
<ul><li><strong>Parent task</strong> set to <i>#109659</i></li></ul> openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5105082022-04-12T13:12:10Zlivdywanliv.dywan@suse.com
<ul><li><strong>Due date</strong> set to <i>2022-04-15</i></li></ul><p>I suggest we discuss this in the unblock since Ondřej's still fighting with some disk setup issues and I didn't see much conversation and there's no update here.</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5108832022-04-13T09:19:15Zokurzokurz@suse.com
<ul><li><strong>Due date</strong> changed from <i>2022-04-15</i> to <i>2022-04-29</i></li><li><strong>Assignee</strong> changed from <i>osukup</i> to <i>mkittler</i></li></ul><ul>
<li>The workers get stuck formatting the nvme
<ul>
<li>Error stopping the systemd service</li>
<li>osukup tried to reproduce problems without rebooting and everything works when starting the openqa nvme format+preparation services without rebooting. So issues should be reproduced</li>
<li>We have what should be the same setup in openqaworker8 (and 9) with just a single NVMe, no other devices. We have the services openqa_nvme_format+openqa_nvme_prepare on openqaworker8 which works fine over reboot</li>
</ul></li>
</ul>
<p>Suggestion by okurz:</p>
<ul>
<li>mkittler tries on openqaworker14+15 with reboot tests where the systems gets stuck to fix the race condition that osukup identified</li>
</ul>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5112212022-04-13T12:32:12Zmkittlermarius.kittler@suse.com
<ul></ul><p>I changed the kernel boot parameters so we'll get a tty/emergency shell and also some output via ipmi sol.</p>
<p>That's how it looks like in the error case:</p>
<pre><code>Apr 13 12:39:50 openqaworker14 systemd[1]: Starting Setup NVMe before mounting it...
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1537]: Current mount points (printed for debugging purposes):
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: devtmpfs on /dev type devtmpfs (rw,nosuid,noexec,size=4096k,nr_inodes=1048576,mode=755)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: tmpfs on /run type tmpfs (rw,nosuid,nodev,size=211033536k,nr_inodes=819200,mode=755)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,size=4096k,nr_inodes=1024,mode=755)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: none on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on / type btrfs (rw,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=29,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=34905)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /.snapshots type btrfs (rw,relatime,ssd,space_cache,subvolid=266,subvol=/@/.snapshots)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p1 on /boot/efi type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,utf8,errors=remount-ro)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /home type btrfs (rw,relatime,ssd,space_cache,subvolid=263,subvol=/@/home)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /opt type btrfs (rw,relatime,ssd,space_cache,subvolid=262,subvol=/@/opt)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /tmp type btrfs (rw,relatime,ssd,space_cache,subvolid=259,subvol=/@/tmp)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /boot/grub2/i386-pc type btrfs (rw,relatime,ssd,space_cache,subvolid=265,subvol=/@/boot/grub2/i386-pc)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /usr/local type btrfs (rw,relatime,ssd,space_cache,subvolid=258,subvol=/@/usr/local)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /root type btrfs (rw,relatime,ssd,space_cache,subvolid=261,subvol=/@/root)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /var type btrfs (rw,relatime,ssd,space_cache,subvolid=257,subvol=/@/var)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /boot/grub2/x86_64-efi type btrfs (rw,relatime,ssd,space_cache,subvolid=264,subvol=/@/boot/grub2/x86_64-efi)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1540]: /dev/nvme0n1p2 on /srv type btrfs (rw,relatime,ssd,space_cache,subvolid=260,subvol=/@/srv)
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1537]: Present block devices (printed for debugging purposes):
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1541]: NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1541]: nvme0n1 259:0 0 3.5T 0 disk
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1541]: ├─nvme0n1p1 259:1 0 512M 0 part /boot/efi
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1541]: ├─nvme0n1p2 259:2 0 1T 0 part /
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1541]: └─nvme0n1p3 259:3 0 2.5T 0 part
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1541]: └─md127 9:127 0 2.5T 0 raid0
Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1537]: Stopping current RAID "/dev/md/openqa"
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1542]: mdadm: stopped /dev/md/openqa
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1554]: Personalities : [raid0]
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1554]: unused devices: <none>
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1537]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1561]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1561]: level=raid0 devices=1 ctime=Wed Apr 13 12:04:21 2022
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1561]: mdadm: unexpected failure opening /dev/md127
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1537]: Unable to create RAID, mdadm returned with non-zero code
Apr 13 12:39:51 openqaworker14 systemd[1]: openqa_nvme_format.service: Main process exited, code=exited, status=1/FAILURE
Apr 13 12:39:51 openqaworker14 systemd[1]: openqa_nvme_format.service: Failed with result 'exit-code'.
Apr 13 12:39:51 openqaworker14 systemd[1]: Failed to start Setup NVMe before mounting it.
</code></pre>
<p>On ow15 it currently fails every time on a syntax error in <code>/var/local/bin/openqa-establish-nvme-setup</code>. I assume that's from previous debugging attempts.</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5112452022-04-13T14:26:16Zmkittlermarius.kittler@suse.com
<ul></ul><p>I played around with <code>mdadm</code>:</p>
<pre><code>martchus@openqaworker14:~> mdadm_args=(--create /dev/md/openqa --level=0 --force --assume-clean)
martchus@openqaworker14:~> mdadm_args+=(--raid-devices=1 --run /dev/nvme0n1p3)
martchus@openqaworker14:~> sudo systemctl stop var-lib-openqa.mount
martchus@openqaworker14:~> sudo mdadm "${mdadm_args[@]}"
mdadm: cannot open /dev/nvme0n1p3: Device or resource busy
martchus@openqaworker14:~> sudo mdadm --stop /dev/md/openqa
mdadm: stopped /dev/md/openqa
martchus@openqaworker14:~> sudo mdadm "${mdadm_args[@]}"
mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
level=raid0 devices=1 ctime=Wed Apr 13 13:22:53 2022
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md/openqa started.
martchus@openqaworker14:~> echo $?
0
martchus@openqaworker14:~> for i in {01..100}; do echo "try $i" && sudo mdadm --stop /dev/md/openqa && sudo mdadm "${mdadm_args[@]}" || break ; done
…
try 100
mdadm: stopped /dev/md/openqa
mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
level=raid0 devices=1 ctime=Wed Apr 13 16:28:00 2022
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md/openqa started.
</code></pre>
<p>So the message <code>mdadm: /dev/nvme0n1p3 appears to be part of a raid array:</code> also appears in the good case of <code>mdadm: array /dev/md/openqa started.</code> and it not necessarily related to the error <code>mdadm: unexpected failure opening /dev/md127</code> we sometimes get. The problem is unfortunately not reproducible when the system is already up.</p>
<p>Considering</p>
<pre><code>Apr 13 12:39:50 openqaworker14 openqa-establish-nvme-setup[1537]: Stopping current RAID "/dev/md/openqa"
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1542]: mdadm: stopped /dev/md/openqa
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1554]: Personalities : [raid0]
Apr 13 12:39:51 openqaworker14 openqa-establish-nvme-setup[1554]: unused devices: <none>
</code></pre>
<p>we also can assume that the previous RAID has been stopped.</p>
<p>So it is really the question what this "unexpected error" is. It is printed by <code>mdopen.c</code> when the previous <code>open_dev_excl()</code> call fails but unfortunately no further error details are propagated. Note that this function already has an internal retry with exponential back off in case the error is <code>EBUSY</code>. It does 25 attempts with sleep of <code>min(2^i, 200)</code>, so it would wait an hour¹. That should be more than enough to rule out <code>EBUSY</code>. We can also rule out that the device is already in use because that is checked before (and it is unlikely that this changed in the meantime).</p>
<p>This still leaves any other error as possibility. The function calls internally <code>dev_open()</code> which basically calls <code>open()</code> with the device path. So we might run into any of <a href="https://man.archlinux.org/man/open.2#ERRORS">https://man.archlinux.org/man/open.2#ERRORS</a>.</p>
<hr>
<p>¹</p>
<pre><code>[...Array(25).keys()].reduce((a, b) => a + Math.min(Math.pow(2, b), 200)) / 60 / 60
1.015
</code></pre> openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5112492022-04-13T15:40:42Zmkittlermarius.kittler@suse.com
<ul><li><strong>Status</strong> changed from <i>Workable</i> to <i>In Progress</i></li></ul><p>I installed a <a href="https://martchus.no-ip.biz/gitea/Martchus/mdadm/commit/1e67af6cf818d08a7ae889df9c93e02ead902795" class="external">custom build</a> of mdadm. Now we see the specific error message:</p>
<pre><code>Apr 13 17:35:16 openqaworker14 systemd[1]: Starting Setup NVMe before mounting it...
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1486]: Using mdadm from:
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1490]: /home/martchus/mdadm/mdadm
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1486]: Current mount points (printed for debugging purposes):
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: devtmpfs on /dev type devtmpfs (rw,nosuid,noexec,size=4096k,nr_inodes=1048576,mode=755)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: tmpfs on /run type tmpfs (rw,nosuid,nodev,size=211033536k,nr_inodes=819200,mode=755)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,size=4096k,nr_inodes=1024,mode=755)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: none on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on / type btrfs (rw,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=29,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=57366)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /.snapshots type btrfs (rw,relatime,ssd,space_cache,subvolid=266,subvol=/@/.snapshots)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /boot/grub2/i386-pc type btrfs (rw,relatime,ssd,space_cache,subvolid=265,subvol=/@/boot/grub2/i386-pc)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /boot/grub2/x86_64-efi type btrfs (rw,relatime,ssd,space_cache,subvolid=264,subvol=/@/boot/grub2/x86_64-efi)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /home type btrfs (rw,relatime,ssd,space_cache,subvolid=263,subvol=/@/home)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /opt type btrfs (rw,relatime,ssd,space_cache,subvolid=262,subvol=/@/opt)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /srv type btrfs (rw,relatime,ssd,space_cache,subvolid=260,subvol=/@/srv)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /root type btrfs (rw,relatime,ssd,space_cache,subvolid=261,subvol=/@/root)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /tmp type btrfs (rw,relatime,ssd,space_cache,subvolid=259,subvol=/@/tmp)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /usr/local type btrfs (rw,relatime,ssd,space_cache,subvolid=258,subvol=/@/usr/local)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /var type btrfs (rw,relatime,ssd,space_cache,subvolid=257,subvol=/@/var)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1486]: Present block devices (printed for debugging purposes):
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: nvme0n1 259:0 0 3.5T 0 disk
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: ├─nvme0n1p1 259:1 0 512M 0 part
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: ├─nvme0n1p2 259:2 0 1T 0 part /
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: └─nvme0n1p3 259:3 0 2.5T 0 part
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: └─md127 9:127 0 2.5T 0 raid0
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1486]: Stopping current RAID "/dev/md/openqa"
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1494]: mdadm: stopped /dev/md/openqa
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1521]: Personalities : [raid0]
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1521]: unused devices: <none>
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1486]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1532]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1532]: level=raid0 devices=1 ctime=Wed Apr 13 17:00:01 2022
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1532]: mdadm: unexpected failure opening /dev/md127: No such device or address
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /.snapshots type btrfs (rw,relatime,ssd,space_cache,subvolid=266,subvol=/@/.snapshots)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /boot/grub2/i386-pc type btrfs (rw,relatime,ssd,space_cache,subvolid=265,subvol=/@/boot/grub2/i386-pc)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /boot/grub2/x86_64-efi type btrfs (rw,relatime,ssd,space_cache,subvolid=264,subvol=/@/boot/grub2/x86_64-efi)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /home type btrfs (rw,relatime,ssd,space_cache,subvolid=263,subvol=/@/home)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /opt type btrfs (rw,relatime,ssd,space_cache,subvolid=262,subvol=/@/opt)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /srv type btrfs (rw,relatime,ssd,space_cache,subvolid=260,subvol=/@/srv)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /root type btrfs (rw,relatime,ssd,space_cache,subvolid=261,subvol=/@/root)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /tmp type btrfs (rw,relatime,ssd,space_cache,subvolid=259,subvol=/@/tmp)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /usr/local type btrfs (rw,relatime,ssd,space_cache,subvolid=258,subvol=/@/usr/local)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1491]: /dev/nvme0n1p2 on /var type btrfs (rw,relatime,ssd,space_cache,subvolid=257,subvol=/@/var)
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1486]: Present block devices (printed for debugging purposes):
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: nvme0n1 259:0 0 3.5T 0 disk
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: ├─nvme0n1p1 259:1 0 512M 0 part
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: ├─nvme0n1p2 259:2 0 1T 0 part /
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: └─nvme0n1p3 259:3 0 2.5T 0 part
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1492]: └─md127 9:127 0 2.5T 0 raid0
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1486]: Stopping current RAID "/dev/md/openqa"
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1494]: mdadm: stopped /dev/md/openqa
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1521]: Personalities : [raid0]
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1521]: unused devices: <none>
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1486]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1532]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1532]: level=raid0 devices=1 ctime=Wed Apr 13 17:00:01 2022
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1532]: mdadm: unexpected failure opening /dev/md127: No such device or address
Apr 13 17:35:16 openqaworker14 openqa-establish-nvme-setup[1486]: Unable to create RAID, mdadm returned with non-zero code
Apr 13 17:35:16 openqaworker14 systemd[1]: openqa_nvme_format.service: Main process exited, code=exited, status=1/FAILURE
Apr 13 17:35:16 openqaworker14 systemd[1]: openqa_nvme_format.service: Failed with result 'exit-code'.
Apr 13 17:35:16 openqaworker14 systemd[1]: Failed to start Setup NVMe before mounting it.
</code></pre>
<p>So it apparently failed to create the device before but the error was apparently ignored and thus when only get the error when trying to open the device.</p>
<p>Pressing Ctrl+D to leave the emergency shell helps in that situation. I'm not sure whether that was always the case, though. Assuming it is we could implement a retry for the systemd service. It also means that my custom build of mdadm is generally working (and not just misleading us by introducing yet another problem; on ow15 my custom build also just worked immediately after booting).</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5114402022-04-14T11:39:45Zmkittlermarius.kittler@suse.com
<ul></ul><p>I've added a retry in our script for which I'll create a SR request. I only had to do one restart to see it already making a difference:</p>
<pre><code>Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /boot/grub2/i386-pc type btrfs (rw,relatime,ssd,space_cache,subvolid=265,subvol=/@/boot/grub2/i386-pc)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /.snapshots type btrfs (rw,relatime,ssd,space_cache,subvolid=266,subvol=/@/.snapshots)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /boot/grub2/x86_64-efi type btrfs (rw,relatime,ssd,space_cache,subvolid=264,subvol=/@/boot/grub2/x86_64-efi)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /home type btrfs (rw,relatime,ssd,space_cache,subvolid=263,subvol=/@/home)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /opt type btrfs (rw,relatime,ssd,space_cache,subvolid=262,subvol=/@/opt)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /root type btrfs (rw,relatime,ssd,space_cache,subvolid=261,subvol=/@/root)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /usr/local type btrfs (rw,relatime,ssd,space_cache,subvolid=258,subvol=/@/usr/local)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /srv type btrfs (rw,relatime,ssd,space_cache,subvolid=260,subvol=/@/srv)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /tmp type btrfs (rw,relatime,ssd,space_cache,subvolid=259,subvol=/@/tmp)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1519]: /dev/nvme0n1p2 on /var type btrfs (rw,relatime,ssd,space_cache,subvolid=257,subvol=/@/var)
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1514]: Present block devices (printed for debugging purposes):
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1520]: NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1520]: nvme0n1 259:0 0 3.5T 0 disk
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1520]: ├─nvme0n1p1 259:1 0 512M 0 part
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1520]: ├─nvme0n1p2 259:2 0 1T 0 part /
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1520]: └─nvme0n1p3 259:3 0 2.5T 0 part
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1520]: └─md127 9:127 0 2.5T 0 raid0
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1514]: Stopping current RAID "/dev/md/openqa"
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1522]: mdadm: stopped /dev/md/openqa
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1529]: Personalities : [raid0]
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1529]: unused devices: <none>
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1514]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1537]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1537]: level=raid0 devices=1 ctime=Wed Apr 13 17:35:48 2022
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1537]: mdadm: unexpected failure opening /dev/md127: No such device or address
Apr 14 13:34:37 openqaworker14 openqa-establish-nvme-setup[1514]: Unexpected error opening device, waiting 10 before trying again (as retrying usually helps).
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1514]: Trying RAID0 creation again after timeout (attempt 2 of 10)
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1632]: Personalities : [raid0]
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1632]: unused devices: <none>
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1514]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1639]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1639]: level=raid0 devices=1 ctime=Wed Apr 13 17:35:48 2022
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1639]: mdadm: Defaulting to version 1.2 metadata
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1639]: mdadm: array /dev/md/openqa started.
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1514]: Status for RAID0 "/dev/md/openqa"
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1658]: md127 : active raid0 nvme0n1p3[0]
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1660]: ARRAY /dev/md/openqa metadata=1.2 name=openqaworker14:openqa UUID=0669a47c:5201d4f3:f7f5420a:c0f7f4a7
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1514]: Creating ext2 filesystem on RAID0 "/dev/md/openqa"
Apr 14 13:34:47 openqaworker14 openqa-establish-nvme-setup[1662]: mke2fs 1.43.8 (1-Jan-2018)
Apr 14 13:34:51 openqaworker14 openqa-establish-nvme-setup[1662]: [258B blob data]
Apr 14 13:34:51 openqaworker14 openqa-establish-nvme-setup[1662]: Creating filesystem with 669084672 4k blocks and 167272448 inodes
Apr 14 13:34:51 openqaworker14 openqa-establish-nvme-setup[1662]: Filesystem UUID: 8ae30c10-6d08-44a5-8913-85bf60940c17
Apr 14 13:34:51 openqaworker14 openqa-establish-nvme-setup[1662]: Superblock backups stored on blocks:
Apr 14 13:34:51 openqaworker14 openqa-establish-nvme-setup[1662]: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
Apr 14 13:34:51 openqaworker14 openqa-establish-nvme-setup[1662]: 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
Apr 14 13:34:51 openqaworker14 openqa-establish-nvme-setup[1662]: 102400000, 214990848, 512000000, 550731776, 644972544
Apr 14 13:34:51 openqaworker14 openqa-establish-nvme-setup[1662]: [73B blob data]
Apr 14 13:35:05 openqaworker14 openqa-establish-nvme-setup[1662]: [378B blob data]
Apr 14 13:35:17 openqaworker14 openqa-establish-nvme-setup[1662]: [107B blob data]
Apr 14 13:35:17 openqaworker14 systemd[1]: openqa_nvme_format.service: Succeeded.
Apr 14 13:35:17 openqaworker14 systemd[1]: Finished Setup NVMe before mounting it.
</code></pre>
<p>Whether it'll help in all cases I don't know but it is certainly an improvement. So I'll try to reboot a few more times.</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5114412022-04-14T11:49:21Zmkittlermarius.kittler@suse.com
<ul></ul><p>SR: <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/671" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/671</a></p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5114442022-04-14T11:52:20Zmkittlermarius.kittler@suse.com
<ul></ul><p>And again:</p>
<pre><code>Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1564]: Present block devices (printed for debugging purposes):
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1570]: NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1570]: nvme0n1 259:0 0 3.5T 0 disk
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1570]: ├─nvme0n1p1 259:1 0 512M 0 part
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1570]: ├─nvme0n1p2 259:2 0 1T 0 part /
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1570]: └─nvme0n1p3 259:3 0 2.5T 0 part
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1570]: └─md127 9:127 0 2.5T 0 raid0
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1564]: Stopping current RAID "/dev/md/openqa"
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1571]: mdadm: stopped /dev/md/openqa
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1582]: Personalities : [raid0]
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1582]: unused devices: <none>
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1564]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1590]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1590]: level=raid0 devices=1 ctime=Thu Apr 14 13:34:47 2022
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1590]: mdadm: unexpected failure opening /dev/md127: No such device or address
Apr 14 13:46:45 openqaworker14 openqa-establish-nvme-setup[1564]: Unexpected error opening device, waiting 10 before trying again (as retrying usually helps).
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1564]: Trying RAID0 creation again after timeout (attempt 2 of 10)
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1618]: Personalities : [raid0]
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1618]: unused devices: <none>
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1564]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1625]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1625]: level=raid0 devices=1 ctime=Thu Apr 14 13:34:47 2022
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1625]: mdadm: Defaulting to version 1.2 metadata
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1625]: mdadm: array /dev/md/openqa started.
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1564]: Status for RAID0 "/dev/md/openqa"
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1644]: md127 : active raid0 nvme0n1p3[0]
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1646]: ARRAY /dev/md/openqa metadata=1.2 name=openqaworker14:openqa UUID=1dc43f3d:fbecb084:33ff1543:61f84fb8
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1564]: Creating ext2 filesystem on RAID0 "/dev/md/openqa"
Apr 14 13:46:55 openqaworker14 openqa-establish-nvme-setup[1648]: mke2fs 1.43.8 (1-Jan-2018)
Apr 14 13:46:58 openqaworker14 openqa-establish-nvme-setup[1648]: [220B blob data]
Apr 14 13:46:58 openqaworker14 openqa-establish-nvme-setup[1648]: Creating filesystem with 669084672 4k blocks and 167272448 inodes
Apr 14 13:46:58 openqaworker14 openqa-establish-nvme-setup[1648]: Filesystem UUID: b7ec8671-c15f-4887-9b78-157c1aa21715
Apr 14 13:46:58 openqaworker14 openqa-establish-nvme-setup[1648]: Superblock backups stored on blocks:
Apr 14 13:46:58 openqaworker14 openqa-establish-nvme-setup[1648]: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
Apr 14 13:46:58 openqaworker14 openqa-establish-nvme-setup[1648]: 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
Apr 14 13:46:58 openqaworker14 openqa-establish-nvme-setup[1648]: 102400000, 214990848, 512000000, 550731776, 644972544
Apr 14 13:46:58 openqaworker14 openqa-establish-nvme-setup[1648]: [73B blob data]
Apr 14 13:47:13 openqaworker14 openqa-establish-nvme-setup[1648]: [400B blob data]
Apr 14 13:47:25 openqaworker14 openqa-establish-nvme-setup[1648]: [107B blob data]
Apr 14 13:47:25 openqaworker14 systemd[1]: openqa_nvme_format.service: Succeeded.
Apr 14 13:47:25 openqaworker14 systemd[1]: Finished Setup NVMe before mounting it.
</code></pre>
<p>And again:</p>
<pre><code>Apr 14 13:48:02 openqaworker15 openqa-establish-nvme-setup[1505]: Present block devices (printed for debugging purposes):
Apr 14 13:48:02 openqaworker15 openqa-establish-nvme-setup[1508]: NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
Apr 14 13:48:02 openqaworker15 openqa-establish-nvme-setup[1508]: nvme0n1 259:0 0 3.5T 0 disk
Apr 14 13:48:02 openqaworker15 openqa-establish-nvme-setup[1508]: ├─nvme0n1p1 259:1 0 512M 0 part
Apr 14 13:48:02 openqaworker15 openqa-establish-nvme-setup[1508]: ├─nvme0n1p2 259:2 0 1T 0 part /
Apr 14 13:48:02 openqaworker15 openqa-establish-nvme-setup[1508]: └─nvme0n1p3 259:3 0 2.5T 0 part
Apr 14 13:48:02 openqaworker15 openqa-establish-nvme-setup[1508]: └─md127 9:127 0 2.5T 0 raid0
Apr 14 13:48:02 openqaworker15 openqa-establish-nvme-setup[1505]: Stopping current RAID "/dev/md/openqa"
Apr 14 13:48:03 openqaworker15 openqa-establish-nvme-setup[1509]: mdadm: stopped /dev/md/openqa
Apr 14 13:48:03 openqaworker15 openqa-establish-nvme-setup[1505]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 14 13:48:03 openqaworker15 openqa-establish-nvme-setup[1562]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 14 13:48:03 openqaworker15 openqa-establish-nvme-setup[1562]: level=raid0 devices=1 ctime=Wed Apr 13 17:53:38 2022
Apr 14 13:48:03 openqaworker15 openqa-establish-nvme-setup[1562]: mdadm: unexpected failure opening /dev/md127: No such device or address
Apr 14 13:48:03 openqaworker15 openqa-establish-nvme-setup[1505]: Unexpected error opening device, waiting 10 before trying again (as retrying usually helps).
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1505]: Trying RAID0 creation again after timeout (attempt 2 of 10)
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1505]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1654]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1654]: level=raid0 devices=1 ctime=Wed Apr 13 17:53:38 2022
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1654]: mdadm: Defaulting to version 1.2 metadata
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1654]: mdadm: array /dev/md/openqa started.
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1505]: Status for RAID0 "/dev/md/openqa"
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1673]: md127 : active raid0 nvme0n1p3[0]
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1675]: ARRAY /dev/md/openqa metadata=1.2 name=openqaworker15:openqa UUID=e2f5086b:edbf62a8:14bb3a8a:b575f4d3
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1505]: Creating ext2 filesystem on RAID0 "/dev/md/openqa"
Apr 14 13:48:13 openqaworker15 openqa-establish-nvme-setup[1676]: mke2fs 1.43.8 (1-Jan-2018)
Apr 14 13:48:16 openqaworker15 openqa-establish-nvme-setup[1676]: [220B blob data]
Apr 14 13:48:16 openqaworker15 openqa-establish-nvme-setup[1676]: Creating filesystem with 669084672 4k blocks and 167272448 inodes
Apr 14 13:48:16 openqaworker15 openqa-establish-nvme-setup[1676]: Filesystem UUID: d1e04ad2-cbc3-4356-bd70-36d16a9eef7e
Apr 14 13:48:16 openqaworker15 openqa-establish-nvme-setup[1676]: Superblock backups stored on blocks:
Apr 14 13:48:16 openqaworker15 openqa-establish-nvme-setup[1676]: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
Apr 14 13:48:16 openqaworker15 openqa-establish-nvme-setup[1676]: 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
Apr 14 13:48:16 openqaworker15 openqa-establish-nvme-setup[1676]: 102400000, 214990848, 512000000, 550731776, 644972544
Apr 14 13:48:16 openqaworker15 openqa-establish-nvme-setup[1676]: [73B blob data]
Apr 14 13:48:31 openqaworker15 openqa-establish-nvme-setup[1676]: [400B blob data]
Apr 14 13:48:43 openqaworker15 openqa-establish-nvme-setup[1676]: [107B blob data]
Apr 14 13:48:43 openqaworker15 systemd[1]: openqa_nvme_format.service: Succeeded.
Apr 14 13:48:43 openqaworker15 systemd[1]: Finished Setup NVMe before mounting it.
</code></pre>
<p>So I'm quite sure that the retry helps and likely that's also all we need.</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5114652022-04-14T12:50:09Zmkittlermarius.kittler@suse.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Feedback</i></li></ul><p>I did a 8 more reboots (4 on each machine). I encountered 5 more good cases (no retry required) and 3 more handled bad cases (retry required but it helped) but no more unhandled bad cases. Interestingly, all remaining bad cases happened on ow15. So it seems that this host is more affected but it could also be just a coincidence.</p>
<p>When <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/671" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/671</a> has been merged I can bring the workers back into salt (AC2) and remove possible workarounds:</p>
<ul>
<li>Resume alerts (AC4)</li>
<li>Run production jobs (AC1): <a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/404" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/404</a></li>
</ul>
<p>Not sure whether it makes sense to rush this today because I'll only be back on Wednesday and some closer monitoring would be good (AC3).</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5116752022-04-17T12:50:20Zokurzokurz@suse.com
<ul></ul><p>mkittler wrote:</p>
<blockquote>
<p>I did a 8 more reboots (4 on each machine). I encountered 5 more good cases (no retry required) and 3 more handled bad cases (retry required but it helped) but no more unhandled bad cases. Interestingly, all remaining bad cases happened on ow15. So it seems that this host is more affected but it could also be just a coincidence.</p>
<p>When <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/671" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/671</a> has been merged …</p>
</blockquote>
<p>Done.</p>
<blockquote>
<ul>
<li>Run production jobs (AC1): <a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/404" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/404</a></li>
</ul>
</blockquote>
<p>I guess you mean <a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/405" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/405</a><br>
Before we enable this we should have sufficient tests showing that the workers are stable. Please trigger openQA tests using the existing worker class settings, that's why we have them :)</p>
<blockquote>
<p>Not sure whether it makes sense to rush this today because I'll only be back on Wednesday and some closer monitoring would be good (AC3).</p>
</blockquote>
<p>yes, no need to rush.</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5123082022-04-20T12:41:46Zmkittlermarius.kittler@suse.com
<ul></ul><p>I've been re-triggering a variety¹ of jobs: <a href="https://openqa.suse.de/tests/overview?build=test-ow14" class="external">https://openqa.suse.de/tests/overview?build=test-ow14</a> <a href="https://openqa.suse.de/tests/overview?build=test-ow15" class="external">https://openqa.suse.de/tests/overview?build=test-ow15</a></p>
<p>¹ They're simply the 100 most recently passed qemu x86_64 jobs on OSD without parallel dependencies.</p>
<p>Looks like I've messed up the worker class. So jobs on both overviews might be started on any of the workers.</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5123552022-04-20T14:14:47Zmkittlermarius.kittler@suse.com
<ul></ul><p>All jobs incompleted early, likely due to a too old os-autoinst version. Considering the workers are not in salt that's expected so I'll update them manually for now and will re-trigger the jobs. (This time with the correct worker classes.)</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5123742022-04-20T16:49:46Zmkittlermarius.kittler@suse.com
<ul></ul><p>The tap setup isn't working yet but it generally doesn't look too bad:</p>
<pre><code>openqa=> select jobs.id, job_settings.value as worker, jobs.result, jobs.reason from jobs join job_settings on jobs.id = job_settings.job_id where state = 'done' and not (result = 'passed' or result = 'softfailed') and job_settings.key = 'WORKER_CLASS' and (job_settings.value = 'openqaworker14' or job_settings.value = 'openqaworker15') and t_started >= '2022-04-20' order by id desc;
id | worker | result | reason
---------+----------------+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
8582778 | openqaworker15 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap15 25' failed: 'tap15' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582777 | openqaworker15 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap11 18' failed: 'tap11' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582776 | openqaworker15 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap4 47' failed: 'tap4' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582774 | openqaworker15 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap15 25' failed: 'tap15' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582772 | openqaworker15 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap11 18' failed: 'tap11' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582770 | openqaworker15 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap15 18' failed: 'tap15' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582768 | openqaworker15 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap15 18' failed: 'tap15' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582761 | openqaworker14 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap8 58' failed: 'tap8' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582760 | openqaworker15 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap15 18' failed: 'tap15' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582759 | openqaworker14 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap14 18' failed: 'tap14' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582757 | openqaworker14 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap9 18' failed: 'tap9' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582755 | openqaworker14 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap8 58' failed: 'tap8' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582753 | openqaworker14 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap14 25' failed: 'tap14' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582751 | openqaworker14 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap9 18' failed: 'tap9' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582749 | openqaworker14 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap8 58' failed: 'tap8' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582747 | openqaworker14 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap14 60' failed: 'tap14' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582739 | openqaworker14 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap8 18' failed: 'tap8' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
8582613 | openqaworker15 | incomplete | backend died: Open vSwitch command 'set_vlan' with arguments 'tap15 60' failed: 'tap15' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
(18 Zeilen)
</code></pre>
<p>I also found some older jobs failing with <code>timeout_exceeded</code> (reason: <code>timeout: test execution exceeded MAX_JOB_TIME</code>) when doing some other SQL queries. Note that these jobs simply timed out because they were waiting for a parallel job so they are likely also just a symptom of the broken tap setup (and not a symptom of slow downloads).</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5126382022-04-21T13:13:55Zmkittlermarius.kittler@suse.com
<ul></ul><p>The tap setup wasn't really broken. It was just that I manually increased the number of worker slots (to 16) so the 100 jobs would be processed quicker and more like we'd eventually run them in production. When reducing the number of worker slots back to what's configured in salt the tap jobs work (see <a href="https://openqa.suse.de/tests/8587752#dependencies" class="external">https://openqa.suse.de/tests/8587752#dependencies</a> and <a href="https://openqa.suse.de/tests/8587756#dependencies" class="external">https://openqa.suse.de/tests/8587756#dependencies</a>).</p>
<p>So I suppose <a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/405" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/405</a> can be merged. (It'll also increase the number of worker slots but if it is done by salt it'll take care to extend the tap config accordingly.)</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5126472022-04-21T13:49:06Zokurzokurz@suse.com
<ul></ul><p>openQA verification jobs look nice, <a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/405" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/405</a> merged</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5126502022-04-21T13:54:13Zokurzokurz@suse.com
<ul></ul><p>I have unpaused the "host up" alert for both machines as it looks green right now.</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5126712022-04-21T15:11:18Zmkittlermarius.kittler@suse.com
<ul></ul><p>I added the machines back to salt.</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5129822022-04-22T12:52:48Zmkittlermarius.kittler@suse.com
<ul></ul><p>Salt is unfortunately broken on ow14 and 15. It always tries to connect to 127.0.0.1 instead of OSD. Setting <code>master</code> in <code>/etc/salt/minion</code> to OSD's IP doesn't help. Hacking around the issue via <code>martchus@openqaworker14:~> ssh -L "4506:localhost:4506" -N -f openqa.suse.de</code> let's it at least authenticate but the Minion still appears unresponsive from the master's side.</p>
<p>Note that on a non-broken worker, e.g. openqaworker5 the minion log looks like this:</p>
<pre><code>[DEBUG ] Connecting to master. Attempt 1 of 1
[DEBUG ] Error while parsing IPv4 address: salt
[DEBUG ] Expected 4 octets in 'salt'
[DEBUG ] Error while parsing IPv6 address: salt
[DEBUG ] At least 3 parts expected in 'salt'
[DEBUG ] "salt" Not an IP address? Assuming it is a hostname.
[DEBUG ] Master URI: tcp://10.160.0.207:4506
[DEBUG ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'openqaworker5.suse.de', 'tcp://10.160.0.207:4506')
[DEBUG ] Generated random reconnect delay between '1000ms' and '11000ms' (2303)
[DEBUG ] Setting zmq_reconnect_ivl to '2303ms'
[DEBUG ] Setting zmq_reconnect_ivl_max to '11000ms'
[TRACE ] IPCServer: Handling connection to address:
[DEBUG ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'openqaworker5.suse.de', 'tcp://10.160.0.207:4506', 'clear')
[DEBUG ] Connecting the Minion to the Master URI (for the return server): tcp://10.160.0.207:4506
[DEBUG ] Trying to connect to: tcp://10.160.0.207:4506
[TRACE ] Inserted key into loop_instance_map id 140315043651088 for key ('/etc/salt/pki/minion', 'openqaworker5.suse.de', 'tcp://10.160.0.207:4506', 'clear') and process 15063
[DEBUG ] salt.crypt.get_rsa_pub_key: Loading public key
[DEBUG ] Decrypting the current master AES key
[DEBUG ] salt.crypt.get_rsa_key: Loading private key
[DEBUG ] salt.crypt._get_key_with_evict: Loading private key
[DEBUG ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG ] salt.crypt.get_rsa_pub_key: Loading public key
[DEBUG ] salt.crypt.verify_signature: Loading public key
[DEBUG ] salt.crypt.get_rsa_pub_key: Loading public key
[DEBUG ] salt.crypt.verify_signature: Verifying signature
[DEBUG ] <salt.crypt.AsyncAuth object at 0x7f9da5715d68> Got new master aes key.
[DEBUG ] Closing AsyncZeroMQReqChannel instance
[DEBUG ] Connecting the Minion to the Master publish port, using the URI: tcp://10.160.0.207:4505
[DEBUG ] salt.crypt.get_rsa_key: Loading private key
[DEBUG ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'openqaworker5.suse.de', 'tcp://10.160.0.207:4506', 'aes')
[DEBUG ] Re-using AsyncAuth for ('/etc/salt/pki/minion', 'openqaworker5.suse.de', 'tcp://10.160.0.207:4506')
[DEBUG ] Connecting the Minion to the Master URI (for the return server): tcp://10.160.0.207:4506
</code></pre>
<p>On ow14 and 15 it looks very similar but OSD's IP is swapped out by 127.0.0.1:</p>
<pre><code>[DEBUG ] "salt" Not an IP address? Assuming it is a hostname.
…
…
[DEBUG ] Trying to connect to: tcp://127.0.0.1:4506
[DEBUG ] Closing AsyncZeroMQReqChannel instance
[ERROR ] Error while bringing up minion for multi-master. Is master at 10.160.0.207 responding?
[DEBUG ] Connecting to master. Attempt 1 of 1
[DEBUG ] Master URI: tcp://127.0.0.1:4506
…
[DEBUG ] Connecting the Minion to the Master URI (for the return server): tcp://127.0.0.1:4506
[DEBUG ] SaltReqTimeoutError, retrying. (1/7)
</code></pre>
<p>Curiously it still says "Is master at 10.160.0.207 responding?" as if the correct host is generally configured but then somehow not passed down to some underlying function.</p>
<p>The salt-minion version is the same as on other hosts.</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5129852022-04-22T13:46:46Zmkittlermarius.kittler@suse.com
<ul></ul><p>If I also do <code>martchus@openqaworker14:~> ssh -L "4505:localhost:4505" -N -f -l martchus openqa.suse.de</code> (and not just for port 4506) I get a little bit further:</p>
<pre><code>martchus@openqa:~> sudo salt openqaworker14.qa.suse.cz state.apply
openqaworker14.qa.suse.cz:
----------
ID: states
Function: no.None
Result: False
Comment: No Top file or master_tops data matches found. Please see master log for details.
Changes:
Summary for openqaworker14.qa.suse.cz
------------
Succeeded: 0
Failed: 1
------------
Total states run: 1
Total run time: 0.000 ms
ERROR: Minions returned with non-zero exit code
</code></pre>
<p>The Minion is then responding but shows this error message. Something must be wrongly configured on ow14 and 15 but all the config looks like on ow5 which has no issues. So I'm out of ideas.</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5133962022-04-25T09:02:03Zmkittlermarius.kittler@suse.com
<ul></ul><p>Looks like somebody put <code>file_client: local</code> in the config which I've missed on Friday. It works after removing that. Now the worker class change to run normal production jobs is effective. Salt also extended the tap setup as expected:</p>
<pre><code>----------
ID: /etc/sysconfig/network/ifcfg-br1
Function: file.managed
Result: True
Comment: File /etc/sysconfig/network/ifcfg-br1 updated
Started: 10:57:33.346842
Duration: 13.948 ms
Changes:
----------
diff:
---
+++
@@ -14,5 +14,40 @@
OVS_BRIDGE_PORT_DEVICE_3='tap3'
OVS_BRIDGE_PORT_DEVICE_67='tap67'
OVS_BRIDGE_PORT_DEVICE_131='tap131'
+OVS_BRIDGE_PORT_DEVICE_4='tap4'
+OVS_BRIDGE_PORT_DEVICE_68='tap68'
+OVS_BRIDGE_PORT_DEVICE_132='tap132'
+OVS_BRIDGE_PORT_DEVICE_5='tap5'
+OVS_BRIDGE_PORT_DEVICE_69='tap69'
+OVS_BRIDGE_PORT_DEVICE_133='tap133'
+OVS_BRIDGE_PORT_DEVICE_6='tap6'
+OVS_BRIDGE_PORT_DEVICE_70='tap70'
+OVS_BRIDGE_PORT_DEVICE_134='tap134'
+OVS_BRIDGE_PORT_DEVICE_7='tap7'
+OVS_BRIDGE_PORT_DEVICE_71='tap71'
+OVS_BRIDGE_PORT_DEVICE_135='tap135'
+OVS_BRIDGE_PORT_DEVICE_8='tap8'
+OVS_BRIDGE_PORT_DEVICE_72='tap72'
+OVS_BRIDGE_PORT_DEVICE_136='tap136'
+OVS_BRIDGE_PORT_DEVICE_9='tap9'
+OVS_BRIDGE_PORT_DEVICE_73='tap73'
+OVS_BRIDGE_PORT_DEVICE_137='tap137'
+OVS_BRIDGE_PORT_DEVICE_10='tap10'
+OVS_BRIDGE_PORT_DEVICE_74='tap74'
+OVS_BRIDGE_PORT_DEVICE_138='tap138'
+OVS_BRIDGE_PORT_DEVICE_11='tap11'
+OVS_BRIDGE_PORT_DEVICE_75='tap75'
+OVS_BRIDGE_PORT_DEVICE_139='tap139'
+OVS_BRIDGE_PORT_DEVICE_12='tap12'
+OVS_BRIDGE_PORT_DEVICE_76='tap76'
+OVS_BRIDGE_PORT_DEVICE_140='tap140'
+OVS_BRIDGE_PORT_DEVICE_13='tap13'
+OVS_BRIDGE_PORT_DEVICE_77='tap77'
+OVS_BRIDGE_PORT_DEVICE_141='tap141'
+OVS_BRIDGE_PORT_DEVICE_14='tap14'
+OVS_BRIDGE_PORT_DEVICE_78='tap78'
+OVS_BRIDGE_PORT_DEVICE_142='tap142'
+OVS_BRIDGE_PORT_DEVICE_15='tap15'
+OVS_BRIDGE_PORT_DEVICE_79='tap79'
+OVS_BRIDGE_PORT_DEVICE_143='tap143'
PRE_UP_SCRIPT="wicked:gre_tunnel_preup.sh"
-ZONE=trusted
mode:
0644
</code></pre>
<p>I'll check the incomplete/failure rate of the workers in a few days. If everything looks good the ticket can then be resolved (as all ACs are then covered).</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5137212022-04-26T10:10:33Zmkittlermarius.kittler@suse.com
<ul></ul><p>So far, the fail/incomplete ratio looks ok to me compared to other worker hosts:</p>
<pre><code>openqa=> with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed' or result='incomplete') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where t_finished >= '2022-04-25 11:00' group by host order by ratio_failed_by_host desc;
host | ratio_failed_by_host | total
---------------------+----------------------+-------
openqaworker-arm-4 | 88.1 | 168
openqaworker2 | 64.09 | 362
openqaworker-arm-3 | 33.94 | 327
openqaworker14 | 30.92 | 414
malbec | 26.6 | 94
powerqaworker-qam-1 | 25.93 | 135
openqaworker-arm-2 | 25.6 | 336
QA-Power8-5-kvm | 24.36 | 156
openqaworker-arm-1 | 24.19 | 186
grenache-1 | 23.63 | 787
openqaworker15 | 20 | 405
QA-Power8-4-kvm | 18.9 | 164
openqaworker5 | 16.8 | 899
openqaworker6 | 16.22 | 709
openqaworker3 | 11.98 | 459
openqaworker9 | 11.24 | 347
openqaworker10 | 9.09 | 209
openqaworker13 | 8.5 | 341
openqaworker8 | 6.67 | 330
openqa-piworker | 0 | 7
| 0 | 325
(21 Zeilen)
</code></pre>
<p>Note that 15 is 10 percent points better although it is the very same hardware using the very same connection. So I suppose that's the margin we have to consider anyways.</p>
<p>I also haven't seen any jobs running into timeouts (the setup timeout was a concern due to the worker's connection from Prague):</p>
<pre><code>openqa=> select id, t_finished, result, reason from jobs where (select host from workers where id = assigned_worker_id) similar to 'openqaworker(14|15)' and (result = 'incomplete' or result = 'failed') and reason like '%timeout%' and t_finished >= '2022-04-25 11:00' order by t_finished;
id | t_finished | result | reason
----+------------+--------+--------
(0 Zeilen)
</code></pre> openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5137242022-04-26T10:15:58Zmkittlermarius.kittler@suse.com
<ul></ul><p>I've noticed that there are lots of jobs that just failed due to syntax errors. When ignoring those failures for the sake this investigation, the figures are quite changed:</p>
<pre><code>openqa=> with finished as (select result, reason, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where (result='failed' or result='incomplete') and reason not like '%main%syntax error%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where t_finished >= '2022-04-25 11:00' group by host order by ratio_failed_by_host desc;
host | ratio_failed_by_host | total
---------------------+----------------------+-------
openqaworker2 | 35.89 | 404
openqaworker-arm-3 | 16.72 | 329
openqaworker-arm-2 | 12.02 | 341
openqaworker-arm-1 | 9.47 | 190
grenache-1 | 6.1 | 787
openqaworker5 | 3.74 | 908
powerqaworker-qam-1 | 2.96 | 135
QA-Power8-5-kvm | 2.56 | 156
openqaworker13 | 2.33 | 343
QA-Power8-4-kvm | 1.83 | 164
openqaworker-arm-4 | 1.19 | 168
openqaworker10 | 0.94 | 212
openqaworker6 | 0.7 | 715
openqaworker9 | 0.57 | 351
openqaworker14 | 0.48 | 421
openqaworker8 | 0.3 | 332
openqaworker15 | 0.24 | 412
openqaworker3 | 0.21 | 466
| 0 | 325
openqa-piworker | 0 | 7
malbec | 0 | 95
(21 Zeilen)
</code></pre> openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5140242022-04-27T09:59:32Zmkittlermarius.kittler@suse.com
<ul></ul><p>The fail/incomplete ratio on 14/15 still looks ok:</p>
<pre><code>openqa=> with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed' or result='incomplete') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where t_finished >= '2022-04-25 11:00' group by host order by ratio_failed_by_host desc;
host | ratio_failed_by_host | total
---------------------+----------------------+-------
openqaworker2 | 78.41 | 1371
openqaworker-arm-4 | 61.29 | 403
openqaworker14 | 32.14 | 837
grenache-1 | 31.72 | 1854
openqaworker-arm-3 | 26.83 | 779
openqaworker-arm-1 | 20.71 | 449
openqaworker15 | 20.27 | 824
openqaworker-arm-2 | 18.63 | 714
QA-Power8-5-kvm | 18.53 | 421
powerqaworker-qam-1 | 16.04 | 424
malbec | 15.19 | 237
openqaworker5 | 14.97 | 1637
openqaworker13 | 14.11 | 659
openqaworker6 | 13.92 | 1329
QA-Power8-4-kvm | 13.4 | 530
openqaworker9 | 12.41 | 709
openqaworker3 | 10.78 | 872
openqaworker10 | 10.55 | 455
openqaworker8 | 9.03 | 709
openqa-piworker | 0 | 7
| 0 | 621
(21 Zeilen)
</code></pre>
<p>Excluding syntax errors:</p>
<pre><code>openqa=> with finished as (select result, reason, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where (result='failed' or result='incomplete') and reason not like '%main%syntax error%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where t_finished >= '2022-04-25 11:00' group by host order by ratio_failed_by_host desc;
host | ratio_failed_by_host | total
---------------------+----------------------+-------
openqaworker2 | 50.11 | 1371
openqaworker-arm-3 | 13.61 | 779
openqaworker-arm-1 | 8.91 | 449
openqaworker-arm-2 | 5.74 | 714
openqaworker-arm-4 | 5.46 | 403
powerqaworker-qam-1 | 4.25 | 424
QA-Power8-5-kvm | 4.04 | 421
grenache-1 | 3.83 | 1853
openqaworker13 | 3.34 | 659
QA-Power8-4-kvm | 2.65 | 529
openqaworker5 | 2.63 | 1636
openqaworker10 | 2.2 | 455
openqaworker9 | 1.41 | 709
malbec | 1.27 | 237
openqaworker14 | 1.08 | 837
openqaworker8 | 0.85 | 709
openqaworker6 | 0.6 | 1329
openqaworker15 | 0.36 | 824
openqaworker3 | 0.34 | 872
openqa-piworker | 0 | 7
| 0 | 621
(21 Zeilen)
</code></pre> openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5140772022-04-27T13:36:51Zmkittlermarius.kittler@suse.com
<ul></ul><ol>
<li>Maybe we need to remove the tap worker class: <a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/407" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/407</a> - see <a href="https://suse.slack.com/archives/C02CANHLANP/p1651052375642909" class="external">https://suse.slack.com/archives/C02CANHLANP/p1651052375642909</a></li>
<li>Maybe we can add the <code>qemu_x86_64-large-mem</code> worker class because those workers have 1 TiB RAM: <a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/408" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/408</a></li>
</ol>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5143622022-04-28T10:48:49Zmkittlermarius.kittler@suse.com
<ul></ul><p>Looks like <a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/407" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/407</a> couldn't be cleanly applied. However, the worker class "tap" is gone and that's the main goal.</p>
<p>However, maybe this gives also a hint why the setup is broken:</p>
<pre><code> ID: wicked ifup br1
Function: cmd.wait
Result: False
Comment: Command "wicked ifup br1" run
Started: 12:32:03.474942
Duration: 30412.6 ms
Changes:
----------
pid:
42945
retcode:
162
stderr:
stdout:
br1 setup-in-progress
</code></pre>
<p>At least for openqaworker14 because on 15 it looks good:</p>
<pre><code> ID: wicked ifup br1
Function: cmd.wait
Result: True
Comment: Command "wicked ifup br1" run
Started: 12:32:06.437255
Duration: 31487.345 ms
Changes:
----------
pid:
65768
retcode:
0
stderr:
stdout:
br1 up
</code></pre>
<p>The bride is up now on ow14 as well (without me doing anything):</p>
<pre><code>martchus@openqaworker14:~> sudo wicked ifup br1
br1 up
</code></pre> openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5147262022-04-29T11:35:12Zmkittlermarius.kittler@suse.com
<ul></ul><p>After reverting and applying the tap setup again via salt it works again on ow15. So I suppose the tap setup generally works. I also tested it with a parallel cluster where some jobs are running on Nürnberg workers and some on Prague workers (see <a href="https://openqa.suse.de/tests/8656821" class="external">https://openqa.suse.de/tests/8656821</a>). It also continued to work after rebooting the worker. Maybe the gre tunnel isn't very good at coping with temporary connection problems and can be stuck in a disconnected state. We decided to create another ticket for that and leave the tap worker class out for now. Considering that <code>curl -4 https://openqa.suse.de</code> works and <code>curl -6 https://openqa.suse.de</code> doesn't (on both, ow14 and ow15) it could be an IPv4 vs. IPv6 problem.</p>
<p>I also rebased <a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/408" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/408</a> which we might want to merge as well. (I'll merge it after getting at least one approval.)</p>
<p>Now both workers are again back in salt without tap worker class. See suggestions in <a class="issue tracker-4 status-1 priority-3 priority-lowest child" title="action: Establish reliable tap setup on ow14 (New)" href="https://progress.opensuse.org/issues/110467">#110467</a> for how to continue with the tap setup.</p>
<p>Since the failure rate looks still generally good I'm resolving the ticket keeping both workers running production jobs (just not tap ones).</p>
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5147322022-04-29T11:35:53Zmkittlermarius.kittler@suse.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-1 priority-3 priority-lowest child" href="/issues/110467">action #110467</a>: Establish reliable tap setup on ow14</i> added</li></ul> openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5147382022-04-29T11:36:05Zmkittlermarius.kittler@suse.com
<ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Resolved</i></li></ul> openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5152182022-05-02T20:04:08Zokurzokurz@suse.com
<ul><li><strong>Due date</strong> deleted (<del><i>2022-04-29</i></del>)</li></ul> openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=5347642022-07-07T15:20:29Zokurzokurz@suse.com
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-4 status-3 priority-6 priority-high2 closed child" href="/issues/113366">action #113366</a>: Add three more Prague located OSD workers size:M</i> added</li></ul> openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:Mhttps://progress.opensuse.org/issues/104970?journal_id=6079342023-03-01T21:11:50Zokurzokurz@suse.com
<ul><li><strong>Tags</strong> set to <i>infra</i></li></ul>