action #136004
closed[qe-core] test fails in autofs_client - NFS restarts can sometimes fail during restarts due to a slow SUT
0%
Description
Observation¶
While debugging https://openqa.suse.de/tests/12181788 I found out that sometimes a service can take longer and an immediate restart could cause an error.
Sep 19 08:14:21 server rpc.mountd[3031]: Caught signal 15, un-registering and exiting.
Sep 19 08:14:21 server systemd[1]: nfs-mountd.service: Succeeded.
Sep 19 08:14:21 server systemd[1]: Stopped NFS Mount Daemon.
Sep 19 08:14:21 server systemd[1]: nfs-idmapd.service: Succeeded.
Sep 19 08:14:21 server systemd[1]: Stopped NFSv4 ID-name mapping service.
Sep 19 08:14:21 server systemd[1]: Starting NFSv4 ID-name mapping service...
Sep 19 08:14:21 server systemd[1]: Starting NFS Mount Daemon...
Sep 19 08:14:21 server systemd[1]: Started NFSv4 ID-name mapping service.
Sep 19 08:14:21 server kernel: nfsd: last server has exited, flushing export cache
Sep 19 08:14:21 server systemd[1]: Started NFS Mount Daemon.
Sep 19 08:14:21 server systemd[1]: Starting NFS server and services...
Sep 19 08:14:21 server rpc.mountd[3098]: Version 2.1.1 starting
Sep 19 08:14:21 server exportfs[3099]: exportfs: /etc/exports [2]: Neither 'subtree_check' or 'no_subtree_check' specified for export "*:/tmp/nfs/server".
Sep 19 08:14:21 server exportfs[3099]: Assuming default behaviour ('no_subtree_check').
Sep 19 08:14:21 server exportfs[3099]: NOTE: this default has changed since nfs-utils version 1.0.x
Sep 19 08:14:21 server exportfs[3099]: exportfs: /etc/exports [3]: Neither 'subtree_check' or 'no_subtree_check' specified for export "*:/home/tux".
Sep 19 08:14:21 server exportfs[3099]: Assuming default behaviour ('no_subtree_check').
Sep 19 08:14:21 server exportfs[3099]: NOTE: this default has changed since nfs-utils version 1.0.x
Sep 19 08:14:21 server rpc.nfsd[3100]: rpc.nfsd: unable to bind AF_INET TCP socket: errno 98 (Address already in use)
In this case it failed at this line: https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/5251193c0ed3c8098771b9cfd5dc76c06abb0a53/tests/network/autofs_server.pm#L75
Acceptance Criteria¶
- AC1 When SUTS have high load (stress-ng can be used to simulate), service restarts are still working as expected
- AC2 New ticket exists for the implementation of what the notes on this ticket aren't implemented together
Notes¶
One idea that comes to my mind is to have the Utils::Systemd::systemctl take two extra subroutines, one for a pre-check and a second one for a post-check and use a small check to verify that the port/socket is open
One example of how this could look like:
systemctl 'restart nfs-server', pre => sub { say hello_world }, post => \check_nfs_port;
Files
Updated by szarate over 1 year ago
- Copied from action #131291: [qe-core] test fails in autofs_client added
Updated by amanzini over 1 year ago
my 2c : if a service does not restart properly, even in a busy system, e.g. it reports as started when it's not yet ready, or it does not release all the resources at shutdown (is the socket option SO_REUSEADDR set ?) ; could be likely a product bug
Updated by szarate over 1 year ago
- Status changed from New to Workable
amanzini wrote in #note-3:
my 2c : if a service does not restart properly, even in a busy system, e.g. it reports as started when it's not yet ready, or it does not release all the resources at shutdown (is the socket option SO_REUSEADDR set ?) ; could be likely a product bug
Could you investigate further? and in the meantime please unscheduled these tests from maintenance
Updated by amanzini over 1 year ago
- Tags changed from bugbusters to bugbusters, qe-core-coverage
Updated by amanzini over 1 year ago
- Status changed from Workable to In Progress
unscheduled with MR https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/merge_requests/610 , trying to reproduce bug in separate environment
Updated by okurz over 1 year ago
- Related to action #135884: [qe-core] test fails in autofs multimachine added
Updated by amanzini over 1 year ago
some investigations on reproducing this bug
when nfs-server is restarted, it reports an "active" status but a code=exited, status=1/FAILURE; nfsd is not able to bind socket.
upon another restart, it succeeds
Updated by amanzini over 1 year ago
this impacts 15-SP2 and 15-SP3, but seems fixed in 15-SP4: https://openqa.suse.de/tests/12216411#
Updated by amanzini over 1 year ago
with explicit nsf-server stop ; nfs-server start , the test pass:
https://openqa.suse.de/tests/12222304 passed
https://openqa.suse.de/tests/12222307 passed
https://openqa.suse.de/tests/12223032 passed
https://openqa.suse.de/tests/12223033 passed
https://openqa.suse.de/tests/12224845 passed
https://openqa.suse.de/tests/12224846 passed
so the flaw should be in the 'restart' logic. I tried to manually reproduce the bug in a separate VM with a loop of "systemctl restart nfs-server" but the service always come up without any error.
My suggestion to improve the test is to look for status=
message, wdyt ?
Updated by amanzini about 1 year ago
- Status changed from In Progress to Feedback
Updated by szarate about 1 year ago
amanzini wrote in #note-11:
with explicit nsf-server stop ; nfs-server start , the test pass:
https://openqa.suse.de/tests/12222304 passed https://openqa.suse.de/tests/12222307 passed https://openqa.suse.de/tests/12223032 passed https://openqa.suse.de/tests/12223033 passed https://openqa.suse.de/tests/12224845 passed https://openqa.suse.de/tests/12224846 passed
so the flaw should be in the 'restart' logic. I tried to manually reproduce the bug in a separate VM with a loop of "systemctl restart nfs-server" but the service always come up without any error.
Yeah, this is a similar conclusion that Dee had, she couldn't reproduce it locally, only on OSD.
My suggestion to improve the test is to look for
status=
message, wdyt ?
Give it a try, however it would be good to report the bug and catch it when it happens (By soft-failing), I would prefer to keep the restart in this case, as that's clearly not a good behavior.
@acarvajal your issues with NFS were different right? do you have a ticket for those already?.
Updated by amanzini about 1 year ago
- Status changed from Feedback to In Progress
Updated by amanzini about 1 year ago
since it was unscheduled, to get a Verification Run for a multi-machine job I edited a development job group:
---
defaults:
x86_64:
machine: 64bit
priority: 50
s390x:
machine: s390x-kvm-sle12
priority: 35
aarch64:
machine: aarch64-virtio
priority: 50
products:
sle-15-SP2-Server-DVD-Updates-aarch64:
distri: sle
flavor: Server-DVD-Updates
version: 15-SP2
sle-15-SP2-Server-DVD-Updates-x86_64:
distri: sle
flavor: Server-DVD-Updates
version: 15-SP2
sle-15-SP2-Server-DVD-Updates-s390x:
distri: sle
flavor: Server-DVD-Updates
version: 15-SP2
sle-15-SP3-Server-DVD-Updates-aarch64:
distri: sle
flavor: Server-DVD-Updates
version: 15-SP3
sle-15-SP3-Server-DVD-Updates-x86_64:
distri: sle
flavor: Server-DVD-Updates
version: 15-SP3
sle-15-SP3-Server-DVD-Updates-s390x:
distri: sle
flavor: Server-DVD-Updates
version: 15-SP3
sle-15-SP4-Desktop-DVD-Updates-x86_64:
distri: sle
flavor: Desktop-DVD-Updates
version: 15-SP4
sle-15-SP4-Server-DVD-Updates-aarch64:
distri: sle
flavor: Server-DVD-Updates
version: 15-SP4
sle-15-SP4-Server-DVD-Updates-x86_64:
distri: sle
flavor: Server-DVD-Updates
version: 15-SP4
sle-15-SP4-Server-DVD-Updates-TERADATA-x86_64:
distri: sle
flavor: Server-DVD-Updates-TERADATA
version: 15-SP4
sle-15-SP4-Server-DVD-Updates-s390x:
distri: sle
flavor: Server-DVD-Updates
version: 15-SP4
sle-15-SP5-Desktop-DVD-Updates-x86_64:
distri: sle
flavor: Desktop-DVD-Updates
version: 15-SP5
sle-15-SP5-Server-DVD-Updates-aarch64:
distri: sle
flavor: Server-DVD-Updates
version: 15-SP5
sle-15-SP5-Server-DVD-Updates-x86_64:
distri: sle
flavor: Server-DVD-Updates
version: 15-SP5
sle-15-SP5-Server-DVD-Updates-s390x:
distri: sle
flavor: Server-DVD-Updates
version: 15-SP5
scenarios:
x86_64:
sle-15-SP2-Server-DVD-Updates-x86_64:
- mau-autofs-client_amanzini:
testsuite: mau-autofs-client
settings:
CASEDIR: 'https://github.com/ilmanzo/os-autoinst-distri-opensuse#poo136004_nfs_restarts'
PARALLEL_WITH: mau-autofs-server_amanzini
- mau-autofs-server_amanzini:
testsuite: mau-autofs-server
settings:
CASEDIR: 'https://github.com/ilmanzo/os-autoinst-distri-opensuse#poo136004_nfs_restarts'
sle-15-SP3-Server-DVD-Updates-x86_64:
- mau-autofs-client_amanzini:
testsuite: mau-autofs-client
settings:
CASEDIR: 'https://github.com/ilmanzo/os-autoinst-distri-opensuse#poo136004_nfs_restarts'
PARALLEL_WITH: mau-autofs-server_amanzini
- mau-autofs-server_amanzini:
testsuite: mau-autofs-server
settings:
CASEDIR: 'https://github.com/ilmanzo/os-autoinst-distri-opensuse#poo136004_nfs_restarts'
sle-15-SP4-Server-DVD-Updates-x86_64:
- mau-autofs-client_amanzini:
testsuite: mau-autofs-client
settings:
CASEDIR: 'https://github.com/ilmanzo/os-autoinst-distri-opensuse#poo136004_nfs_restarts'
PARALLEL_WITH: mau-autofs-server_amanzini
- mau-autofs-server_amanzini:
testsuite: mau-autofs-server
settings:
CASEDIR: 'https://github.com/ilmanzo/os-autoinst-distri-opensuse#poo136004_nfs_restarts'
sle-15-SP5-Server-DVD-Updates-x86_64:
- mau-autofs-client_amanzini:
testsuite: mau-autofs-client
settings:
CASEDIR: 'https://github.com/ilmanzo/os-autoinst-distri-opensuse#poo136004_nfs_restarts'
PARALLEL_WITH: mau-autofs-server_amanzini
- mau-autofs-server_amanzini:
testsuite: mau-autofs-server
settings:
CASEDIR: 'https://github.com/ilmanzo/os-autoinst-distri-opensuse#poo136004_nfs_restarts'
then I run ISO POST with an helper script:
#!/bin/sh
#
MYID="amanzini-136004"
#
ARK=x86_64
BLD=20230924-1
VER=15-SP3
SKIPDEP=1
#
if [[ -z "$ARK" || -z "$BLD" || -z "$VER" || -z "$SKIPDEP" ]]; then
echo "Usage: $0 ARK BUILD SLE_VER SKIPDEP"
exit 1
fi
#
# #FL="Full"
# #FL="Full-QR"
# #FL="Online"
# FL="Online-QR"
FL="Server-DVD-Updates"
#
# test devel
GID="487"
#
#
/usr/bin/openqa-cli api -X post isos --osd \
_SKIP_POST_FAIL_HOOKS=1 _SKIP_CHAINED_DEPS="$SKIPDEP" \
_GROUP_ID="$GID" DISTRI=sle VERSION="$VER" FLAVOR="$FL" BUILD="$BLD" ARCH="$ARK" \
_GROUP=$MYID
Updated by amanzini about 1 year ago
- Status changed from In Progress to Feedback