Project

General

Profile

action #108266

openQA Tests - action #107062: Multiple failures due to network issues

grenache: script_run() commands randomly time out since server room move

Added by MDoucha 3 months ago. Updated 17 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
Start date:
2022-03-14
Due date:
% Done:

0%

Estimated time:

Description

Observation

Since the NBG server room was moved, I'm seeing a lot of random script_run() command timeouts on grenache. I suspect network issues.
https://openqa.suse.de/tests/8320677#step/sighold02/12
https://openqa.suse.de/tests/8294410#step/fallocate06/8
https://openqa.suse.de/tests/8294334#step/boot_ltp/42

     Test died: command 'vmstat -w' timed out at /usr/lib/os-autoinst/testapi.pm line 1039.
     # Test died: Timed out waiting for LTP test case which may still be running or the OS may have crashed! at sle/tests/kernel/run_ltp.pm line 337.
# Test died: command 'rpm -qi kernel-default > /tmp/kernel-pkg.txt 2>&1' timed out at /usr/lib/os-autoinst/testapi.pm line 1039.

            main::init_backend() called at /usr/bin/isotovideo line 258
[2022-03-09T16:12:24.052826+01:00] [info] ::: consoles::serial_screen::read_until: Matched output from SUT in 1 loops & 0.00229895696975291 seconds: Use of uninitialized value $regexp in concatenation (.) or string at /usr/lib/os-autoinst/testapi.pm line 927.
        testapi::wait_serial(undef, undef, 0, "no_regex", 1) called at sle/tests/kernel/run_ltp.pm line 317
        run_ltp::run(run_ltp=HASH(0x1001999aee8), LTP::TestInfo=HASH(0x1001b24d630)) called at /usr/lib/os-autoinst/basetest.pm line 356

        cf. last good

        [2022-03-12T07:06:13.797172+01:00] [info] ::: consoles::serial_screen::read_until: Matched output from SUT in 1 loops & 0.00224426796194166 seconds: 
Use of uninitialized value $regexp in concatenation (.) or string at /usr/lib/os-autoinst/testapi.pm line 927.
        testapi::wait_serial(undef, undef, 0, "no_regex", 1) called at sle/tests/kernel/run_ltp.pm line 317
        run_ltp::run(run_ltp=HASH(0x1003570fb08), LTP::TestInfo=HASH(0x1003547afa8)) called at /usr/lib/os-autoinst/basetest.pm line 356
        eval {...} called at /usr/lib/os-autoinst/basetest.pm line 354

Acceptance criteria

  • AC1: Users no longer file complaints about script_run timing out

Suggestions

  • Find a reproducer or database query to identify recent cases e.g. ask Martin. EDIT: mdoucha responded that there is no special query available. Next suggestion: Just pick any recent job where the problem happened, trigger 1k jobs for investigation, e.g. according priority or over weekend, etc.
  • Look into warnings in logs
  • "Use of uninitialized value $regexp in concatenation (.) or string" is already fixed
  • last good: https://openqa.suse.de/tests/8315985
  • [debug] Current version is 4.6.1647014989.7540333c [interface v25]
    • Do git log --no-merges 7540333c..$first_bad
  • Investigate the timeout handling c.f. recent improvements to VNC connection code and handling former blocking code paths

    • We don't have a screenshot to compare the serial output to
    • Maybe we can check the serial logs for comparison?

All these occurences are on the same machine, which is s390x-kvm-sle12

One problem I see is that in https://openqa.suse.de/tests/8505116#step/shutdown_ltp/6 we have a serial terminal. If there would be VNC we would be able to see if the command was executed or not. I also don't see the commands in https://openqa.suse.de/tests/8505116/logfile?filename=serial_terminal.txt nor serial0.txt .

We should try to resolve the ambiguity if commands just never write to the serial terminal as they time out or if actual data is going missing from SUT to worker.

What would you say, what is the best way to reproduce the issue? If we have a reproducer we can try to make it as small as possible and then fix it, maybe just increase the timeout. Maybe ensure that we cath any console related processes in the background if they are still responsive.

Further suggestions from SUSE QE Tools unblock 2022-05-11

  • As suggested in #108266#note-22, similar as we do for openQA worker hosts there should be monitoring to critical components (out of scope for SUSE QE Tools, delegate to SUSE QE Core)
  • within the code called by script_run using ssh
    • retry
    • check if the ssh connection is still there at all
    • provide more details when failing
  • Add in the message on timeout how long we waited

Related issues

Related to openQA Infrastructure - action #107437: [alert] Recurring "no data" alerts with only few minutes of outages since SUSE Nbg QA labs move size:MResolved2022-02-23

Related to openQA Project - action #108503: Collect and prominently display warnings from autoinst.logNew2016-02-10

Related to openQA Infrastructure - action #108845: Network performance problems, DNS, DHCP, within SUSE QA network auto_review:"(Error connecting to VNC server.*qa.suse.*Connection timed out|ipmitool.*qa.suse.*Unable to establish)":retry but also other symptoms size:MResolved2022-03-242022-04-15

Related to openQA Tests - action #108953: [tools] Performance issues in some s390 workersResolved2022-03-25

Related to openQA Project - action #109620: os-autoinst: Improve unit-test code coverage for backend::svirt size:MResolved

Copied to openQA Project - action #108323: Subroutine consoles::sshVirtsh::has redefined at .../Class/Accessor.pmResolved2022-03-14

History

#1 Updated by MDoucha 3 months ago

  • Related to action #107437: [alert] Recurring "no data" alerts with only few minutes of outages since SUSE Nbg QA labs move size:M added

#2 Updated by okurz 3 months ago

  • Priority changed from Normal to High
  • Target version set to Ready

#3 Updated by okurz 3 months ago

  • Subject changed from grenache: script_run() commands randomly time out since server room move to grenache: script_run() commands randomly time out since server room move size:M
  • Description updated (diff)
  • Status changed from New to Workable

#4 Updated by okurz 3 months ago

  • Description updated (diff)

#5 Updated by MDoucha 3 months ago

Warnings about Use of uninitialized value $regexp in concatenation are unrelated to this issue. That is a bug in SSH serial console activation that does not seem to interfere with tests at all.

#6 Updated by tinita 3 months ago

  • Copied to action #108323: Subroutine consoles::sshVirtsh::has redefined at .../Class/Accessor.pm added

#7 Updated by tinita 3 months ago

MDoucha wrote:

Warnings about Use of uninitialized value $regexp in concatenation are unrelated to this issue. That is a bug in SSH serial console activation that does not seem to interfere with tests at all.

Might be true, but we stumbled upon this warning and actually many more from other places of the code, and warnings are often a hint that something is wrong, and even if not, they can distract from the important warnings, so our goal should be to eliminate such unrelated warnings.

#8 Updated by MDoucha 3 months ago

I've created a PR to fix the warnings about undefined regex in wait_serial():
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14491

#9 Updated by tinita 3 months ago

  • Related to action #108503: Collect and prominently display warnings from autoinst.log added

#10 Updated by tinita 3 months ago

Again regarding the warnings:
If read_until is called with an undefined regex, it seems it will loop until the timeout occurs, so it would be nice if the PR is merged and we can see a test without those warnings.

I see this 1000 thousand times in the log:

Use of uninitialized value $p in index at /usr/lib/os-autoinst/consoles/serial_screen.pm line 202.
Use of uninitialized value in substr at /usr/lib/os-autoinst/consoles/serial_screen.pm line 204.
Use of uninitialized value in integer addition (+) at /usr/lib/os-autoinst/consoles/serial_screen.pm line 206.

#11 Updated by mkittler 3 months ago

Considering the more recent jobs in the scenarios mentioned in the ticket description I suppose merging https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14491 helped. Anything left to do from our side?

#12 Updated by okurz 3 months ago

  • Related to action #108845: Network performance problems, DNS, DHCP, within SUSE QA network auto_review:"(Error connecting to VNC server.*qa.suse.*Connection timed out|ipmitool.*qa.suse.*Unable to establish)":retry but also other symptoms size:M added

#13 Updated by MDoucha 3 months ago

The problem is still present after the SSH serial terminal fix:
https://openqa.suse.de/tests/8399959#step/update_kernel/326

And it appears that grenache randomly drops VNC over SSH connection as well:
https://openqa.suse.de/tests/8399977#step/svirt_upload_assets/7

#14 Updated by jlausuch 3 months ago

  • Related to action #108953: [tools] Performance issues in some s390 workers added

#15 Updated by okurz 3 months ago

  • Status changed from Workable to Resolved
  • Assignee set to okurz
  • Parent task set to #107062

Checking the scenarios related to the two most recent mentioned jobs I find 10/10 stable jobs in https://openqa.suse.de/tests/8430546 since the fixes from #108845 are effective. Additionally warnings have been addressed in pull requests so resolving assuming the problem is fixed.

#16 Updated by MDoucha 3 months ago

  • Status changed from Resolved to New

We got two more cases of random command timeout on grenache yesterday:
https://openqa.suse.de/tests/8472027#step/shutdown_ltp/6
https://openqa.suse.de/tests/8472022#step/boot_ltp/46

#17 Updated by okurz 3 months ago

  • Assignee deleted (okurz)

#18 Updated by okurz 3 months ago

  • Subject changed from grenache: script_run() commands randomly time out since server room move size:M to grenache: script_run() commands randomly time out since server room move

#20 Updated by cdywan 2 months ago

  • Subject changed from grenache: script_run() commands randomly time out since server room move to grenache: script_run() commands randomly time out since server room move size:M
  • Description updated (diff)
  • Status changed from New to Workable
  • Assignee set to mkittler

#21 Updated by mkittler 2 months ago

@MDoucha Looking into https://openqa.suse.de/tests/8505116#step/shutdown_ltp/6 it isn't clear whether the server move is to blame or not - or in general what the cause for the timeout is. As stated in the updated ticket description's suggestions, we needed to investigate further why this is happening. If you have any ideas, please leave a comment.

#22 Updated by MDoucha 2 months ago

I have no proof that the isues are related to the server room move. But they first appeared a day or two after the move and look very much like connection loss between grenache and the s390x mainframe. For further investigation, I'd recommend checking dmesg log for network errors on grenache, s390zp18.suse.de and s390zp19.suse.de. Then testing the network between them for packet loss and looking at any network monitoring data for unusual network load. My best guess is that there's a bad network cable or a switch running in hub mode somewhere between them.

BTW, more examples since yesterday:
https://openqa.suse.de/tests/8539760#step/dio10/6
https://openqa.suse.de/tests/8539763#step/abort01/2
https://openqa.suse.de/tests/8540428#step/cve-2017-17052/6
https://openqa.suse.de/tests/8540480#step/fork13/6
https://openqa.suse.de/tests/8551205#step/boot_ltp/40 (s390zp18)
https://openqa.suse.de/tests/8551206#step/boot_ltp/42 (s390zp18)
https://openqa.suse.de/tests/8540526#step/abort01/2
https://openqa.suse.de/tests/8540527#step/abort01/2
https://openqa.suse.de/tests/8540539#step/cve-2011-0999/2
https://openqa.suse.de/tests/8540534#step/abort01/2
https://openqa.suse.de/tests/8540543#step/abort01/2
https://openqa.suse.de/tests/8540544#step/abort01/2

The vast majority of these used s390zp19.suse.de but there are two jobs which ran on s390zp18.suse.de (marked).

#23 Updated by mkittler 2 months ago

  • Assignee deleted (mkittler)

#24 Updated by okurz about 2 months ago

  • Related to action #109620: os-autoinst: Improve unit-test code coverage for backend::svirt size:M added

#25 Updated by cdywan about 2 months ago

  • Description updated (diff)
  • Status changed from Workable to Blocked

Discussed this in the unblock. We can block on #109620 for now since this is primarily svirt related.

#26 Updated by cdywan about 2 months ago

  • Subject changed from grenache: script_run() commands randomly time out since server room move size:M to grenache: script_run() commands randomly time out since server room move
  • Assignee set to cdywan

Assigning to myself to track it, and dropping the size so we can re-estimate this later

#27 Updated by okurz about 1 month ago

  • Status changed from Blocked to New
  • Assignee deleted (cdywan)

100% statement coverage reached within #109620. We can now look into this ticket again with proper unit test coverage as a safety net.

#28 Updated by cdywan 26 days ago

  • Subject changed from grenache: script_run() commands randomly time out since server room move to grenache: script_run() commands randomly time out since server room move size:M
  • Description updated (diff)
  • Status changed from New to Workable

#29 Updated by okurz 17 days ago

  • Description updated (diff)

#30 Updated by okurz 17 days ago

  • Subject changed from grenache: script_run() commands randomly time out since server room move size:M to grenache: script_run() commands randomly time out since server room move
  • Status changed from Workable to New
  • Priority changed from High to Normal
  • Target version changed from Ready to future

Discussed in weekly 2022-06-10. We see a too low failure ratio and no clear way to reproduce the issue. I suggest to use auto-review, i.e. coming up with a good enough regex expression how to identify the very same issue without causing false positives for valid timeouts of script_run. With this we decided we will not currently work on this.

Also available in: Atom PDF