https://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842021-05-07T09:52:28ZopenSUSE Project Management ToolopenQA Infrastructure - action #92302: NFS mount var-lib-openqa-share.mount often fails after boot of some workershttps://progress.opensuse.org/issues/92302?journal_id=4050432021-05-07T09:52:28Zokurzokurz@suse.com
<ul><li><strong>Copied from</strong> <i><a class="issue tracker-4 status-3 priority-5 priority-high3 closed behind-schedule" href="/issues/89551">action #89551</a>: NFS mount fails after boot (reproducible on some OSD workers)</i> added</li></ul> openQA Infrastructure - action #92302: NFS mount var-lib-openqa-share.mount often fails after boot of some workershttps://progress.opensuse.org/issues/92302?journal_id=4093912021-05-21T06:56:12ZXiaojing_liuxliu1@suse.com
<ul></ul><p>openqaworker-arm-2 can't be reach using <code>ping</code>, so the osd deployment failed on 2021-05-21, but I didn't receive the alerting email.</p>
openQA Infrastructure - action #92302: NFS mount var-lib-openqa-share.mount often fails after boot of some workershttps://progress.opensuse.org/issues/92302?journal_id=4095982021-05-23T07:42:31Zokurzokurz@suse.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-5 priority-high3 closed behind-schedule" href="/issues/64941">action #64941</a>: after every reboot openqaworker7 is missing var-lib-openqa-share.mount , check dependencies of service with openqaworker1</i> added</li></ul> openQA Infrastructure - action #92302: NFS mount var-lib-openqa-share.mount often fails after boot of some workershttps://progress.opensuse.org/issues/92302?journal_id=4096012021-05-23T07:49:35Zokurzokurz@suse.com
<ul><li><strong>Subject</strong> changed from <i>NFS mount likely fails after boot of ARM workers</i> to <i>NFS mount var-lib-openqa-share.mount often fails after boot of some workers</i></li><li><strong>Priority</strong> changed from <i>Normal</i> to <i>High</i></li></ul><p><a href="https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services" class="external">https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services</a> shows</p>
<p>Currently failing services<br>
Last update | Host | Failing units | # failed services<br>
2021-05-23 05:48:00 | openqaworker-arm-1 | var-lib-openqa-share.mount, os-autoinst-openvswitch | 2<br>
2021-05-23 03:40:00 | openqaworker13 | var-lib-openqa-share.mount | 1<br>
2021-05-23 03:39:00 | grenache-1 | var-lib-openqa-share.mount, os-autoinst-openvswitch | 2<br>
2021-05-22 04:13:00 | openqaworker-arm-2 | var-lib-openqa-share.mount | 1<br>
2021-05-22 01:47:00 | openqaworker-arm-3 | var-lib-openqa-share.mount, os-autoinst-openvswitch | 2</p>
<p>os-autoinst-openvswitch tracked in <a class="issue tracker-4 status-3 priority-5 priority-high3 closed" title="action: Failing service os-autoinst-openvswitch after boot of some workers (Resolved)" href="https://progress.opensuse.org/issues/92969">#92969</a></p>
openQA Infrastructure - action #92302: NFS mount var-lib-openqa-share.mount often fails after boot of some workershttps://progress.opensuse.org/issues/92302?journal_id=4096042021-05-23T07:54:19Zokurzokurz@suse.com
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-4 status-3 priority-5 priority-high3 closed" href="/issues/92969">action #92969</a>: Failing service os-autoinst-openvswitch after boot of some workers</i> added</li></ul> openQA Infrastructure - action #92302: NFS mount var-lib-openqa-share.mount often fails after boot of some workershttps://progress.opensuse.org/issues/92302?journal_id=4096552021-05-23T14:13:35Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>Workable</i> to <i>In Progress</i></li><li><strong>Assignee</strong> set to <i>okurz</i></li></ul><p><a class="user active user-mention" href="https://progress.opensuse.org/users/22072">@mkittler</a> can you comment about your suggestions in <a class="issue tracker-4 status-3 priority-5 priority-high3 closed behind-schedule" title="action: NFS mount fails after boot (reproducible on some OSD workers) (Resolved)" href="https://progress.opensuse.org/issues/89551#note-8">#89551#note-8</a> . Have you tried them? Should we try again?</p>
<p>Trying on openqaworker-arm-3 with:</p>
<pre><code>cat - > /etc/systemd/system/fix_mounts.service <<EOF
[Unit]
Description=Fix failed mounts by explicit mount command early (https://progress.opensuse.org/issues/92302)
Before=var-lib-openqa-share.mount
DefaultDependencies=no
[Service]
Type=oneshot
ExecStart=/usr/bin/mount -a
[Install]
WantedBy=multi-user.target
EOF
</code></pre> openQA Infrastructure - action #92302: NFS mount var-lib-openqa-share.mount often fails after boot of some workershttps://progress.opensuse.org/issues/92302?journal_id=4096582021-05-23T14:51:27Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Feedback</i></li></ul><p><a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/496" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/496</a></p>
openQA Infrastructure - action #92302: NFS mount var-lib-openqa-share.mount often fails after boot of some workershttps://progress.opensuse.org/issues/92302?journal_id=4096732021-05-23T20:07:17Zokurzokurz@suse.com
<ul><li><strong>Assignee</strong> deleted (<del><i>okurz</i></del>)</li></ul><p>fix_mounts.service fails after boot because it's also started to early:</p>
<pre><code>May 23 19:47:44 openqaworker-arm-3 os-autoinst-openvswitch[2684]: Waiting for IP on bridge 'br1', 247s left ...
May 23 19:47:45 openqaworker-arm-3 mount[1147]: mount.nfs: Resource temporarily unavailable
May 23 19:47:45 openqaworker-arm-3 systemd[1]: fix_mounts.service: Main process exited, code=exited, status=64/n/a
May 23 19:47:45 openqaworker-arm-3 systemd[1]: Failed to start Fix failed mounts by explicit mount command early (https://progress.opensuse.org/issues/92302).
May 23 19:47:45 openqaworker-arm-3 systemd[1]: fix_mounts.service: Unit entered failed state.
May 23 19:47:45 openqaworker-arm-3 systemd[1]: fix_mounts.service: Failed with result 'exit-code'.
May 23 19:47:45 openqaworker-arm-3 systemd[1]: Mounting /var/lib/openqa/share...
May 23 19:47:45 openqaworker-arm-3 systemd[1]: var-lib-openqa-share.mount: Mount process exited, code=exited status=32
May 23 19:47:45 openqaworker-arm-3 systemd[1]: Failed to mount /var/lib/openqa/share.
…
</code></pre>
<p>trying to introduce additional dependencies waiting for <code>sockets.target</code>.</p>
openQA Infrastructure - action #92302: NFS mount var-lib-openqa-share.mount often fails after boot of some workershttps://progress.opensuse.org/issues/92302?journal_id=4096762021-05-23T20:07:25Zokurzokurz@suse.com
<ul><li><strong>Assignee</strong> set to <i>okurz</i></li></ul> openQA Infrastructure - action #92302: NFS mount var-lib-openqa-share.mount often fails after boot of some workershttps://progress.opensuse.org/issues/92302?journal_id=4099672021-05-24T12:36:51Zokurzokurz@suse.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/409967/diff?detail_id=389465">diff</a>)</li></ul><p>Tried increased timeout and retries in /etc/fstab</p>
<pre><code># extended NFS retry https://progress.opensuse.org/issues/89551
openqa.suse.de:/var/lib/openqa/share /var/lib/openqa/share nfs ro,x-systemd.mount-timeout=30m,retry=30 0 0
</code></pre>
<p>but this does not prevent the mount points to show up as failed for the first five minutes after boot. Also looks like this for the specific mount unit:</p>
<pre><code># journalctl -f -u var-lib-openqa-share.mount
May 24 14:23:49 openqaworker-arm-3 systemd[1]: Mounting /var/lib/openqa/share...
May 24 14:23:49 openqaworker-arm-3 systemd[1]: var-lib-openqa-share.mount: Mount process exited, code=exited status=32
May 24 14:23:49 openqaworker-arm-3 systemd[1]: Failed to mount /var/lib/openqa/share.
May 24 14:23:49 openqaworker-arm-3 systemd[1]: var-lib-openqa-share.mount: Unit entered failed state.
May 24 14:28:14 openqaworker-arm-3 systemd[1]: Mounting /var/lib/openqa/share...
May 24 14:28:16 openqaworker-arm-3 systemd[1]: Mounted /var/lib/openqa/share.
</code></pre>
<p>Trying again with <code>fix_mounts.service</code> on top:</p>
<pre><code># /etc/systemd/system/fix_mounts.service
[Unit]
Description=Fix failed mounts by explicit mount command early (https://progress.opensuse.org/issues/92302)
After=sockets.target
Wants=sockets.target
Before=var-lib-openqa-share.mount
DefaultDependencies=no
[Service]
Type=oneshot
ExecStart=/usr/bin/mount -a
[Install]
WantedBy=multi-user.target
</code></pre>
<p>this both combined seems to work good:</p>
<pre><code>run: 01, openqaworker-arm-3: ping .. ok, ssh .. ok, uptime/reboot: 14:37:26 up 0:15, 1 user, load average: 1.09, 1.15, 0.93
running
Connection to openqaworker-arm-3 closed by remote host.
run: 02, openqaworker-arm-3: ping .. ok, ssh .. run: 03, openqaworker-arm-3: ping .. ok, ssh .. ok, uptime/reboot: "System is booting up. See pam_nologin(8)"
Connection closed by 10.160.0.85 port 22
run: 04, openqaworker-arm-3: ping .. ok, ssh .. ok, uptime/reboot: 14:48:56 up 0:05, 0 users, load average: 1.41, 1.41, 0.71
starting
run: 05, openqaworker-arm-3: ping .. ok, ssh .. ok, uptime/reboot: 14:51:25 up 0:08, 0 users, load average: 0.80, 1.17, 0.72
running
Connection to openqaworker-arm-3 closed by remote host.
run: 06, openqaworker-arm-3: ping .. ok, ssh .. run: 07, openqaworker-arm-3: ping .. ok, ssh .. ok, uptime/reboot: 15:03:12 up 0:05, 0 users, load average: 1.84, 1.47, 0.72
starting
run: 08, openqaworker-arm-3: ping .. ok, ssh .. ok, uptime/reboot: 15:05:40 up 0:08, 0 users, load average: 0.31, 0.98, 0.65
running
Connection to openqaworker-arm-3 closed by remote host.
run: 09, openqaworker-arm-3: ping .. ok, ssh .. run: 10, openqaworker-arm-3: ping .. ok, ssh .. ok, uptime/reboot: 15:16:48 up 0:05, 0 users, load average: 2.09, 1.47, 0.69
starting
run: 11, openqaworker-arm-3: ping .. ok, ssh .. ok, uptime/reboot: 15:19:17 up 0:07, 0 users, load average: 1.22, 1.28, 0.73
running
Connection to openqaworker-arm-3 closed by remote host.
run: 12, openqaworker-arm-3: ping .. ok, ssh .. run: 13, openqaworker-arm-3: ping .. ok, ssh .. ok, uptime/reboot: "System is booting up. See pam_nologin(8)"
Connection closed by 10.160.0.85 port 22
run: 14, openqaworker-arm-3: ping .. ok, ssh .. ok, uptime/reboot: 15:30:47 up 0:05, 0 users, load average: 1.28, 1.33, 0.67
starting
run: 15, openqaworker-arm-3: ping .. ok, ssh .. ok, uptime/reboot: 15:33:17 up 0:08, 0 users, load average: 0.68, 1.09, 0.68
running
Connection to openqaworker-arm-3 closed by remote host.
run: 16, openqaworker-arm-3: ping .. ok, ssh .. run: 17, openqaworker-arm-3: ping .. ok, ssh .. ok, uptime/reboot: 15:44:22 up 0:05, 0 users, load average: 1.81, 1.37, 0.66
starting
run: 18, openqaworker-arm-3: ping .. ok, ssh .. ok, uptime/reboot: 15:46:51 up 0:07, 0 users, load average: 1.04, 1.14, 0.67
running
Connection to openqaworker-arm-3 closed by remote host.
run: 19, openqaworker-arm-3: ping .. ok, ssh .. run: 20, openqaworker-arm-3: ping .. ok, ssh .. ok, uptime/reboot: 15:57:58 up 0:05, 0 users, load average: 2.06, 1.40, 0.64
starting
run: 21, openqaworker-arm-3: ping .. ok, ssh .. ok, uptime/reboot: 16:00:27 up 0:07, 1 user, load average: 1.27, 1.32, 0.73
running
</code></pre>
<p>so at least 14 successful reboots, the other runs either aborted or accessed the system when it was still in the process of bootup. Tried again with disabled <code>fix_mounts.service</code> and that is not enough:</p>
<pre><code>echo "### disable fix_mounts.service again, only timeout and retry options in fstab"; export host=openqaworker-arm-3; for run in {01..30}; do for host in $host; do echo -n "run: $run, $host: ping .. " && timeout -k 5 600 sh -c "until ping -c30 $host >/dev/null; do :; done" && echo -n "ok, ssh .. " && timeout -k 5 600 sh -c "until nc -z -w 1 $host 22; do :; done" && echo -n "ok, uptime/reboot: " && ssh $host "uptime && sudo systemctl is-system-running && sudo reboot || sudo systemctl --failed --no-legend" && sleep 120 || break; done || break; done
### disable fix_mounts.service again, only timeout and retry options in fstab
run: 01, openqaworker-arm-3: ping .. ok, ssh .. ok, uptime/reboot: 16:21:40 up 0:15, 1 user, load average: 0.81, 0.84, 0.72
running
Connection to openqaworker-arm-3 closed by remote host.
run: 02, openqaworker-arm-3: ping .. ok, ssh .. run: 03, openqaworker-arm-3: ping .. ok, ssh .. ok, uptime/reboot: 16:33:20 up 0:05, 0 users, load average: 1.58, 1.48, 0.76
starting
var-lib-openqa-share.mount loaded failed failed /var/lib/openqa/share
run: 04, openqaworker-arm-3: ping .. ok, ssh .. ok, uptime/reboot: 16:35:49 up 0:08, 0 users, load average: 0.56, 1.13, 0.73
running
Connection to openqaworker-arm-3 closed by remote host.
run: 05, openqaworker-arm-3: ping .. ok, ssh .. run: 06, openqaworker-arm-3: ping .. ok, ssh .. ok, uptime/reboot: 16:46:50 up 0:05, 0 users, load average: 2.04, 1.50, 0.71
starting
var-lib-openqa-share.mount loaded failed failed /var/lib/openqa/share
run: 07, openqaworker-arm-3: ping .. ok, ssh .. ok, uptime/reboot: 16:49:20 up 0:07, 0 users, load average: 0.75, 1.15, 0.70
running
Connection to openqaworker-arm-3 closed by remote host.
run: 08, openqaworker-arm-3: ping .. ok, ssh .. run: 09, openqaworker-arm-3: ping .. ok, ssh .. ok, uptime/reboot: 17:00:22 up 0:05, 0 users, load average: 1.72, 1.36, 0.65
starting
var-lib-openqa-share.mount loaded failed failed /var/lib/openqa/share
run: 10, openqaworker-arm-3: ping .. ok, ssh .. ok, uptime/reboot: 17:02:50 up 0:07, 0 users, load average: 0.88, 1.18, 0.69
running
Connection to openqaworker-arm-3 closed by remote host.
run: 11, openqaworker-arm-3: ping .. ok, ssh .. run: 12, openqaworker-arm-3: ping .. ok, ssh .. ok, uptime/reboot: 17:13:51 up 0:05, 0 users, load average: 1.75, 1.39, 0.67
starting
var-lib-openqa-share.mount loaded failed failed /var/lib/openqa/share
run: 13, openqaworker-arm-3: ping .. ok, ssh .. ok, uptime/reboot: 17:16:21 up 0:07, 0 users, load average: 0.86, 1.19, 0.71
running
Connection to openqaworker-arm-3 closed by remote host.
</code></pre>
<p>failing service <code>var-lib-openqa-share.mount</code>. The systemd documentation <a href="https://www.freedesktop.org/software/systemd/man/systemd.mount.html#">https://www.freedesktop.org/software/systemd/man/systemd.mount.html#</a> explains "Network mount units automatically acquire After= dependencies on remote-fs-pre.target, network.target and network-online.target, and gain a Before= dependency on remote-fs.target unless nofail mount option is set. Towards the latter a Wants= unit is added as well." which I can confirm. But <a href="https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/">https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/</a> states "By default all remote mounts defined in /etc/fstab pull this service in, in order to make sure the network is up before it is attempted to connect to a network share." which does not seem to work in our case. I suspect an actual bug in wicked. But as I assume no one is willing or able to fix that without us having a small and easy reproducer I am not even bothering reporting that elsewhere.</p>
<p>As alternatives I should try other options in fstab, e.g. <code>nofail,x-systemd.device-timeout=600,noauto,x-systemd.automount</code></p>
<p>new experiment with fstab entry</p>
<pre><code># extended NFS retry https://progress.opensuse.org/issues/89551
openqa.suse.de:/var/lib/openqa/share /var/lib/openqa/share nfs ro,,noauto,nofail,retry=30,x-systemd.mount-timeout=30m,x-systemd.device-timeout=10m,x-systemd.automount 0 0
</code></pre>
<p>and no <code>fix_mounts.service</code>.</p>
<p>Running test with</p>
<pre><code>echo -e "### disable fix_mounts.service again, fstab entry noauto,nofail,retry=30,x-systemd.mount-timeout=30m,x-systemd.device-timeout=10m,x-systemd.automount"; export host=openqaworker-arm-3; for run in {01..30}; do for host in $host; do echo -e -n "\nrun: $run, $host: ping .. " && timeout -k 5 600 sh -c "until ping -c30 $host >/dev/null; do :; done" && echo -n "ok, ssh .. " && timeout -k 5 600 sh -c "until nc -z -w 1 $host 22; do :; done" && echo -n "ok, uptime/reboot: " && ssh $host "uptime && sudo systemctl is-system-running && test -d /var/lib/openqa/share/factory/ && sudo reboot || sudo systemctl --failed --no-legend" && echo -n " .. ok" && sleep 120 || break; done || break; done
</code></pre>
<p>20 reboots successful, <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/496">https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/496</a> ready</p>
openQA Infrastructure - action #92302: NFS mount var-lib-openqa-share.mount often fails after boot of some workershttps://progress.opensuse.org/issues/92302?journal_id=4099912021-05-24T21:52:31Zokurzokurz@suse.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/409991/diff?detail_id=389483">diff</a>)</li></ul> openQA Infrastructure - action #92302: NFS mount var-lib-openqa-share.mount often fails after boot of some workershttps://progress.opensuse.org/issues/92302?journal_id=4110542021-05-27T12:31:23Zokurzokurz@suse.com
<ul><li><strong>Due date</strong> set to <i>2021-06-11</i></li></ul><p>checking after next weekend(s) if this prevents further alerts.</p>
openQA Infrastructure - action #92302: NFS mount var-lib-openqa-share.mount often fails after boot of some workershttps://progress.opensuse.org/issues/92302?journal_id=4147842021-06-09T20:57:36Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Resolved</i></li></ul><p><a href="https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=now-30d&to=now" class="external">https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=now-30d&to=now</a> shows no reference to failed NFS mount units since long.</p>