Project

General

Profile

Actions

action #108266

open

openQA Tests - action #107062: Multiple failures due to network issues

grenache: script_run() commands randomly time out since server room move

Added by MDoucha over 2 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
2022-03-14
Due date:
% Done:

0%

Estimated time:

Description

Observation

Since the NBG server room was moved, I'm seeing a lot of random script_run() command timeouts on grenache. I suspect network issues.
https://openqa.suse.de/tests/8320677#step/sighold02/12
https://openqa.suse.de/tests/8294410#step/fallocate06/8
https://openqa.suse.de/tests/8294334#step/boot_ltp/42

     Test died: command 'vmstat -w' timed out at /usr/lib/os-autoinst/testapi.pm line 1039.
     # Test died: Timed out waiting for LTP test case which may still be running or the OS may have crashed! at sle/tests/kernel/run_ltp.pm line 337.
# Test died: command 'rpm -qi kernel-default > /tmp/kernel-pkg.txt 2>&1' timed out at /usr/lib/os-autoinst/testapi.pm line 1039.

            main::init_backend() called at /usr/bin/isotovideo line 258
[2022-03-09T16:12:24.052826+01:00] [info] ::: consoles::serial_screen::read_until: Matched output from SUT in 1 loops & 0.00229895696975291 seconds: Use of uninitialized value $regexp in concatenation (.) or string at /usr/lib/os-autoinst/testapi.pm line 927.
        testapi::wait_serial(undef, undef, 0, "no_regex", 1) called at sle/tests/kernel/run_ltp.pm line 317
        run_ltp::run(run_ltp=HASH(0x1001999aee8), LTP::TestInfo=HASH(0x1001b24d630)) called at /usr/lib/os-autoinst/basetest.pm line 356

        cf. last good

        [2022-03-12T07:06:13.797172+01:00] [info] ::: consoles::serial_screen::read_until: Matched output from SUT in 1 loops & 0.00224426796194166 seconds: 
Use of uninitialized value $regexp in concatenation (.) or string at /usr/lib/os-autoinst/testapi.pm line 927.
        testapi::wait_serial(undef, undef, 0, "no_regex", 1) called at sle/tests/kernel/run_ltp.pm line 317
        run_ltp::run(run_ltp=HASH(0x1003570fb08), LTP::TestInfo=HASH(0x1003547afa8)) called at /usr/lib/os-autoinst/basetest.pm line 356
        eval {...} called at /usr/lib/os-autoinst/basetest.pm line 354

Acceptance criteria

  • AC1: Users no longer file complaints about script_run timing out

Suggestions

  • Find a reproducer or database query to identify recent cases e.g. ask Martin. EDIT: mdoucha responded that there is no special query available. Next suggestion: Just pick any recent job where the problem happened, trigger 1k jobs for investigation, e.g. according priority or over weekend, etc.
  • Look into warnings in logs
  • "Use of uninitialized value $regexp in concatenation (.) or string" is already fixed
  • last good: https://openqa.suse.de/tests/8315985
  • [debug] Current version is 4.6.1647014989.7540333c [interface v25]
    • Do git log --no-merges 7540333c..$first_bad
  • Investigate the timeout handling c.f. recent improvements to VNC connection code and handling former blocking code paths

    • We don't have a screenshot to compare the serial output to
    • Maybe we can check the serial logs for comparison?

All these occurences are on the same machine, which is s390x-kvm-sle12

One problem I see is that in https://openqa.suse.de/tests/8505116#step/shutdown_ltp/6 we have a serial terminal. If there would be VNC we would be able to see if the command was executed or not. I also don't see the commands in https://openqa.suse.de/tests/8505116/logfile?filename=serial_terminal.txt nor serial0.txt .

We should try to resolve the ambiguity if commands just never write to the serial terminal as they time out or if actual data is going missing from SUT to worker.

What would you say, what is the best way to reproduce the issue? If we have a reproducer we can try to make it as small as possible and then fix it, maybe just increase the timeout. Maybe ensure that we cath any console related processes in the background if they are still responsive.

Further suggestions from SUSE QE Tools unblock 2022-05-11

  • As suggested in #108266#note-22, similar as we do for openQA worker hosts there should be monitoring to critical components (out of scope for SUSE QE Tools, delegate to SUSE QE Core)
  • within the code called by script_run using ssh
    • retry
    • check if the ssh connection is still there at all
    • provide more details when failing
  • Add in the message on timeout how long we waited

Related issues 7 (1 open6 closed)

Related to openQA Infrastructure - action #107437: [alert] Recurring "no data" alerts with only few minutes of outages since SUSE Nbg QA labs move size:MResolvedokurz2022-02-23

Actions
Related to openQA Project - action #108503: Collect and prominently display warnings from autoinst.logNew2016-02-10

Actions
Related to openQA Infrastructure - action #108845: Network performance problems, DNS, DHCP, within SUSE QA network auto_review:"(Error connecting to VNC server.*qa.suse.*Connection timed out|ipmitool.*qa.suse.*Unable to establish)":retry but also other symptoms size:MResolvednicksinger2022-03-24

Actions
Related to openQA Tests - action #108953: [tools] Performance issues in some s390 workersResolvedokurz2022-03-25

Actions
Related to openQA Project - action #109620: os-autoinst: Improve unit-test code coverage for backend::svirt size:MResolvedosukup

Actions
Related to openQA Infrastructure - action #127337: Some s390x workers have been failing for all jobs since 11 months agoResolvedokurz2023-04-06

Actions
Copied to openQA Project - action #108323: Subroutine consoles::sshVirtsh::has redefined at .../Class/Accessor.pmResolvedmkittler2022-03-14

Actions
Actions

Also available in: Atom PDF