Project

General

Profile

Actions

action #136004

closed

[qe-core] test fails in autofs_client - NFS restarts can sometimes fail during restarts due to a slow SUT

Added by szarate over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Bugs in existing tests
Start date:
2023-06-23
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

While debugging https://openqa.suse.de/tests/12181788 I found out that sometimes a service can take longer and an immediate restart could cause an error.

Sep 19 08:14:21 server rpc.mountd[3031]: Caught signal 15, un-registering and exiting.
Sep 19 08:14:21 server systemd[1]: nfs-mountd.service: Succeeded.
Sep 19 08:14:21 server systemd[1]: Stopped NFS Mount Daemon.
Sep 19 08:14:21 server systemd[1]: nfs-idmapd.service: Succeeded.
Sep 19 08:14:21 server systemd[1]: Stopped NFSv4 ID-name mapping service.
Sep 19 08:14:21 server systemd[1]: Starting NFSv4 ID-name mapping service...
Sep 19 08:14:21 server systemd[1]: Starting NFS Mount Daemon...
Sep 19 08:14:21 server systemd[1]: Started NFSv4 ID-name mapping service.
Sep 19 08:14:21 server kernel: nfsd: last server has exited, flushing export cache
Sep 19 08:14:21 server systemd[1]: Started NFS Mount Daemon.
Sep 19 08:14:21 server systemd[1]: Starting NFS server and services...
Sep 19 08:14:21 server rpc.mountd[3098]: Version 2.1.1 starting
Sep 19 08:14:21 server exportfs[3099]: exportfs: /etc/exports [2]: Neither 'subtree_check' or 'no_subtree_check' specified for export "*:/tmp/nfs/server".
Sep 19 08:14:21 server exportfs[3099]:   Assuming default behaviour ('no_subtree_check').
Sep 19 08:14:21 server exportfs[3099]:   NOTE: this default has changed since nfs-utils version 1.0.x
Sep 19 08:14:21 server exportfs[3099]: exportfs: /etc/exports [3]: Neither 'subtree_check' or 'no_subtree_check' specified for export "*:/home/tux".
Sep 19 08:14:21 server exportfs[3099]:   Assuming default behaviour ('no_subtree_check').
Sep 19 08:14:21 server exportfs[3099]:   NOTE: this default has changed since nfs-utils version 1.0.x
Sep 19 08:14:21 server rpc.nfsd[3100]: rpc.nfsd: unable to bind AF_INET TCP socket: errno 98 (Address already in use)

In this case it failed at this line: https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/5251193c0ed3c8098771b9cfd5dc76c06abb0a53/tests/network/autofs_server.pm#L75

Acceptance Criteria

  • AC1 When SUTS have high load (stress-ng can be used to simulate), service restarts are still working as expected
  • AC2 New ticket exists for the implementation of what the notes on this ticket aren't implemented together

Notes

One idea that comes to my mind is to have the Utils::Systemd::systemctl take two extra subroutines, one for a pre-check and a second one for a post-check and use a small check to verify that the port/socket is open

One example of how this could look like:

  • systemctl 'restart nfs-server', pre => sub { say hello_world }, post => \check_nfs_port;

Files


Related issues 2 (0 open2 closed)

Related to openQA Tests (public) - action #135884: [qe-core] test fails in autofs multimachineRejected2023-09-18

Actions
Copied from openQA Tests (public) - action #131291: [qe-core] test fails in autofs_clientResolvedrfan12023-06-23

Actions
Actions #1

Updated by szarate over 1 year ago

  • Copied from action #131291: [qe-core] test fails in autofs_client added
Actions #2

Updated by szarate over 1 year ago

  • Description updated (diff)
Actions #3

Updated by amanzini over 1 year ago

my 2c : if a service does not restart properly, even in a busy system, e.g. it reports as started when it's not yet ready, or it does not release all the resources at shutdown (is the socket option SO_REUSEADDR set ?) ; could be likely a product bug

Actions #4

Updated by szarate over 1 year ago

  • Status changed from New to Workable

amanzini wrote in #note-3:

my 2c : if a service does not restart properly, even in a busy system, e.g. it reports as started when it's not yet ready, or it does not release all the resources at shutdown (is the socket option SO_REUSEADDR set ?) ; could be likely a product bug

Could you investigate further? and in the meantime please unscheduled these tests from maintenance

Actions #5

Updated by amanzini over 1 year ago

  • Assignee set to amanzini
Actions #6

Updated by amanzini over 1 year ago

  • Tags changed from bugbusters to bugbusters, qe-core-coverage
Actions #7

Updated by amanzini over 1 year ago

  • Status changed from Workable to In Progress

unscheduled with MR https://gitlab.suse.de/qa-maintenance/qam-openqa-yml/-/merge_requests/610 , trying to reproduce bug in separate environment

Actions #8

Updated by okurz over 1 year ago

  • Related to action #135884: [qe-core] test fails in autofs multimachine added
Actions #9

Updated by amanzini over 1 year ago

some investigations on reproducing this bug


when nfs-server is restarted, it reports an "active" status but a code=exited, status=1/FAILURE; nfsd is not able to bind socket.

upon another restart, it succeeds

Actions #10

Updated by amanzini over 1 year ago

this impacts 15-SP2 and 15-SP3, but seems fixed in 15-SP4: https://openqa.suse.de/tests/12216411#

Actions #11

Updated by amanzini over 1 year ago

with explicit nsf-server stop ; nfs-server start , the test pass:

https://openqa.suse.de/tests/12222304              passed
https://openqa.suse.de/tests/12222307              passed
https://openqa.suse.de/tests/12223032              passed
https://openqa.suse.de/tests/12223033              passed
https://openqa.suse.de/tests/12224845              passed
https://openqa.suse.de/tests/12224846              passed

so the flaw should be in the 'restart' logic. I tried to manually reproduce the bug in a separate VM with a loop of "systemctl restart nfs-server" but the service always come up without any error.

My suggestion to improve the test is to look for status= message, wdyt ?

Actions #12

Updated by amanzini about 1 year ago

  • Status changed from In Progress to Feedback
Actions #13

Updated by szarate about 1 year ago

amanzini wrote in #note-11:

with explicit nsf-server stop ; nfs-server start , the test pass:

https://openqa.suse.de/tests/12222304              passed
https://openqa.suse.de/tests/12222307              passed
https://openqa.suse.de/tests/12223032              passed
https://openqa.suse.de/tests/12223033              passed
https://openqa.suse.de/tests/12224845              passed
https://openqa.suse.de/tests/12224846              passed

so the flaw should be in the 'restart' logic. I tried to manually reproduce the bug in a separate VM with a loop of "systemctl restart nfs-server" but the service always come up without any error.

Yeah, this is a similar conclusion that Dee had, she couldn't reproduce it locally, only on OSD.

My suggestion to improve the test is to look for status= message, wdyt ?

Give it a try, however it would be good to report the bug and catch it when it happens (By soft-failing), I would prefer to keep the restart in this case, as that's clearly not a good behavior.

@acarvajal your issues with NFS were different right? do you have a ticket for those already?.

Actions #15

Updated by amanzini about 1 year ago

since it was unscheduled, to get a Verification Run for a multi-machine job I edited a development job group:

---
defaults:
  x86_64:
    machine: 64bit
    priority: 50
  s390x:
    machine: s390x-kvm-sle12
    priority: 35
  aarch64:
    machine: aarch64-virtio
    priority: 50
products:
  sle-15-SP2-Server-DVD-Updates-aarch64:
    distri: sle
    flavor: Server-DVD-Updates
    version: 15-SP2
  sle-15-SP2-Server-DVD-Updates-x86_64:
    distri: sle
    flavor: Server-DVD-Updates
    version: 15-SP2
  sle-15-SP2-Server-DVD-Updates-s390x:
    distri: sle
    flavor: Server-DVD-Updates
    version: 15-SP2
  sle-15-SP3-Server-DVD-Updates-aarch64:
    distri: sle
    flavor: Server-DVD-Updates
    version: 15-SP3
  sle-15-SP3-Server-DVD-Updates-x86_64:
    distri: sle
    flavor: Server-DVD-Updates
    version: 15-SP3
  sle-15-SP3-Server-DVD-Updates-s390x:
    distri: sle
    flavor: Server-DVD-Updates
    version: 15-SP3
  sle-15-SP4-Desktop-DVD-Updates-x86_64:
    distri: sle
    flavor: Desktop-DVD-Updates
    version: 15-SP4
  sle-15-SP4-Server-DVD-Updates-aarch64:
    distri: sle
    flavor: Server-DVD-Updates
    version: 15-SP4
  sle-15-SP4-Server-DVD-Updates-x86_64:
    distri: sle
    flavor: Server-DVD-Updates
    version: 15-SP4
  sle-15-SP4-Server-DVD-Updates-TERADATA-x86_64:
    distri: sle
    flavor: Server-DVD-Updates-TERADATA
    version: 15-SP4
  sle-15-SP4-Server-DVD-Updates-s390x:
    distri: sle
    flavor: Server-DVD-Updates
    version: 15-SP4
  sle-15-SP5-Desktop-DVD-Updates-x86_64:
    distri: sle
    flavor: Desktop-DVD-Updates
    version: 15-SP5
  sle-15-SP5-Server-DVD-Updates-aarch64:
    distri: sle
    flavor: Server-DVD-Updates
    version: 15-SP5
  sle-15-SP5-Server-DVD-Updates-x86_64:
    distri: sle
    flavor: Server-DVD-Updates
    version: 15-SP5
  sle-15-SP5-Server-DVD-Updates-s390x:
    distri: sle
    flavor: Server-DVD-Updates
    version: 15-SP5
scenarios:
  x86_64:
    sle-15-SP2-Server-DVD-Updates-x86_64:
      - mau-autofs-client_amanzini:
          testsuite: mau-autofs-client
          settings:
            CASEDIR: 'https://github.com/ilmanzo/os-autoinst-distri-opensuse#poo136004_nfs_restarts'
            PARALLEL_WITH: mau-autofs-server_amanzini
      - mau-autofs-server_amanzini:
          testsuite: mau-autofs-server
          settings:
            CASEDIR: 'https://github.com/ilmanzo/os-autoinst-distri-opensuse#poo136004_nfs_restarts'
    sle-15-SP3-Server-DVD-Updates-x86_64:
      - mau-autofs-client_amanzini:
          testsuite: mau-autofs-client
          settings:
            CASEDIR: 'https://github.com/ilmanzo/os-autoinst-distri-opensuse#poo136004_nfs_restarts'
            PARALLEL_WITH: mau-autofs-server_amanzini
      - mau-autofs-server_amanzini:
          testsuite: mau-autofs-server
          settings:
            CASEDIR: 'https://github.com/ilmanzo/os-autoinst-distri-opensuse#poo136004_nfs_restarts'
    sle-15-SP4-Server-DVD-Updates-x86_64:
      - mau-autofs-client_amanzini:
          testsuite: mau-autofs-client
          settings:
            CASEDIR: 'https://github.com/ilmanzo/os-autoinst-distri-opensuse#poo136004_nfs_restarts'
            PARALLEL_WITH: mau-autofs-server_amanzini
      - mau-autofs-server_amanzini:
          testsuite: mau-autofs-server
          settings:
            CASEDIR: 'https://github.com/ilmanzo/os-autoinst-distri-opensuse#poo136004_nfs_restarts'
    sle-15-SP5-Server-DVD-Updates-x86_64:
      - mau-autofs-client_amanzini:
          testsuite: mau-autofs-client
          settings:
            CASEDIR: 'https://github.com/ilmanzo/os-autoinst-distri-opensuse#poo136004_nfs_restarts'
            PARALLEL_WITH: mau-autofs-server_amanzini
      - mau-autofs-server_amanzini:
          testsuite: mau-autofs-server
          settings: 
            CASEDIR: 'https://github.com/ilmanzo/os-autoinst-distri-opensuse#poo136004_nfs_restarts'

then I run ISO POST with an helper script:

#!/bin/sh
#
MYID="amanzini-136004"
#      
ARK=x86_64
BLD=20230924-1
VER=15-SP3
SKIPDEP=1
#
if [[ -z "$ARK" || -z "$BLD" || -z "$VER" || -z "$SKIPDEP" ]]; then
  echo "Usage: $0 ARK BUILD SLE_VER SKIPDEP"
    exit 1
fi
#
#    #FL="Full"
#    #FL="Full-QR"
#    #FL="Online"
#    FL="Online-QR"
FL="Server-DVD-Updates"
#
# test devel
GID="487"
#
#

/usr/bin/openqa-cli api -X post isos --osd \
     _SKIP_POST_FAIL_HOOKS=1 _SKIP_CHAINED_DEPS="$SKIPDEP" \
     _GROUP_ID="$GID" DISTRI=sle VERSION="$VER" FLAVOR="$FL" BUILD="$BLD" ARCH="$ARK" \
     _GROUP=$MYID
Actions #16

Updated by amanzini about 1 year ago

  • Status changed from In Progress to Feedback
Actions #17

Updated by amanzini about 1 year ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF