Project

General

Profile

Actions

action #106017

closed

[ipmi backend]VNC stalled, no update for 5.38 seconds

Added by Julie_CAO about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2022-02-07
Due date:
2022-02-22
% Done:

0%

Estimated time:

Description

Most of virtualization tests failed with this issue when . Could you please investigate?
job groups:
https://openqa.nue.suse.com/tests/overview?distri=sle&version=15-SP4&build=91.2&groupid=263 https://openqa.nue.suse.com/tests/overview?distri=sle&version=15-SP4&build=91.2&groupid=264

scc_registration failed in "send_key 'alt-s' if check_screen('license-insert-disc-issue', 30);" statement.

https://openqa.nue.suse.com/tests/8113373/logfile?filename=autoinst-log.txt

[2022-02-07T02:47:39.684525+01:00] [debug] no match: -0.1s, best candidate: license-insert-disc-issue-bsc1153326-20191114 (0.00)
[2022-02-07T02:47:40.039376+01:00] [debug] >>> testapi::_check_backend_response: match=license-insert-disc-issue timed out after 90 (check_screen)
[2022-02-07T02:47:40.070779+01:00] [debug] tests/installation/scc_registration.pm:25 called registration::fill_in_registration_data -> lib/registration.pm:572 called registration::process_modules -> lib/registration.pm:957 called registration::process_scc_register_addons -> lib/registration.pm:479 called testapi::check_screen
[2022-02-07T02:47:40.070984+01:00] [debug] <<< testapi::check_screen(mustmatch=[
    "import-untrusted-gpg-key"
  ], timeout=60)
[2022-02-07T02:47:41.115512+01:00] [warn] !!! backend::baseclass::check_asserted_screen: check_asserted_screen took 0.79 seconds for 44 candidate needles - make your needles more specific
[2022-02-07T02:47:41.115638+01:00] [debug] no match: 179.7s, best candidate: add_update_test_repo-import-untrusted-gpg-key-20210321 (0.00)
[2022-02-07T02:47:41.322902+01:00] [debug] no change: 178.7s
[2022-02-07T02:47:41.731741+01:00] [debug] no match: 178.7s, best candidate: add_update_test_repo-import-untrusted-gpg-key-20210321 (0.00)
[2022-02-07T02:47:42.323396+01:00] [debug] no change: 177.7s
[2022-02-07T02:47:42.722764+01:00] [debug] no match: 177.7s, best candidate: add_update_test_repo-import-untrusted-gpg-key-20210321 (0.00)
[2022-02-07T02:47:43.327645+01:00] [debug] no change: 176.7s
[2022-02-07T02:47:43.723201+01:00] [debug] no match: 176.7s, best candidate: add_update_test_repo-import-untrusted-gpg-key-20210321 (0.00)
[2022-02-07T02:47:49.614396+01:00] [warn] !!! backend::baseclass::check_asserted_screen: check_asserted_screen took 5.29 seconds for 44 candidate needles - make your needles more specific
[2022-02-07T02:47:49.614546+01:00] [debug] no match: 175.7s, best candidate: add_update_test_repo-import-untrusted-gpg-key-20210321 (0.00)
[2022-02-07T02:47:49.615018+01:00] [debug] considering VNC stalled, no update for 5.38 seconds
XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":44401"
      after 28811 requests (28654 known processed) with 0 events remaining.
xterm: fatal IO error 11 (Resource temporarily unavailable) or KillClient on X server ":44401"

Related issues 1 (0 open1 closed)

Related to openQA Project - action #105882: Test using svirt backend fails with auto_review:"Error connecting to VNC server.*localhost.*Connection refused"Resolvedmkittler2020-10-30

Actions
Actions #1

Updated by okurz about 2 years ago

  • Priority changed from High to Urgent
  • Target version set to Ready

So, ... not do the fullscreen refresh on exactly where it might have helped most, on the IPMI backend?

Actions #2

Updated by okurz about 2 years ago

  • Related to action #105429: openQA's fullstack test fails in `shutdown` module added
Actions #3

Updated by okurz about 2 years ago

I am taking a look into openqaworker2:17. today three jobs incompleted. One of them is https://openqa.suse.de/tests/8114218 which failed after a runtime of 31m. I wonder how tests behave if I just disable full screen requests in https://github.com/os-autoinst/os-autoinst/pull/1939/files#diff-ee84fe6541d1e2e971e9e0d4bf0187272df8fa4ac878d033b1c8c600325e9f6dR1036 . I did that on openqaworker2 in /usr/lib/os-autoinst/backend/baseclass.pm and retriggered the job with

openqa-clone-job --within-instance https://openqa.suse.de/8114218 TEST=prj2_host_upgrade_sles15sp3_to_developing_kvm_poo106017_investigation_okurz BUILD=X _GROUP=0

This created job #8114284: sle-15-SP4-Online-x86_64-Build91.2-prj2_host_upgrade_sles15sp3_to_developing_kvm@64bit-ipmi-large-mem -> https://openqa.suse.de/t8114284

Actions #4

Updated by mkittler about 2 years ago

  • Assignee set to mkittler

@okurz commented out the most relevant line¹ in /usr/lib/os-autoinst/backend/baseclass.pm on openqaworker2. The restarted job shows the same problem. I'm now commenting out more code to see whether the 2nd line makes a difference.


¹

    elsif ($n % FULL_UPDATE_REQUEST_FREQUENCY == 0) {
            #$self->request_screen_update({incremental => 0});
    }
Actions #5

Updated by okurz about 2 years ago

  • Related to action #105882: Test using svirt backend fails with auto_review:"Error connecting to VNC server.*localhost.*Connection refused" added
Actions #6

Updated by okurz about 2 years ago

  • Status changed from New to In Progress

fails with what looks like the same issue. I wonder what's the "last good" here. Within the history of the job mentioned in the description I can find the last "passed" job https://openqa.nue.suse.com/tests/8030431 which is from 12 days ago, which is a lot.

The complete git log of os-autoinst between the last good and the current is from git log1 --no-merges d319802b..origin/master

a7103b16 Fix sporadic failures in openQA's fullstack test
45d6a872 Adapt configure_repositories step in OBS workflow
bc6eac58 Add unit test for requesting full screen updates when checking screen
f5d175d0 Speed up `t/23-baseclass.t` using `Time::Mock::Time`
c49e2ec5 Avoid division by zero when computing stopwatch data
79950eeb Force full screen update in intervals similar to full screen search
f2399bcc Force full screen update shortly before check screen would fail
0225b12b Simplify initialization of variables in `check_asserted_screen`
7608a54f (okurz/fix/svirt) Revert "svirt: Implement do_extract_assets"
531bbaf0 Revert "svirt: Store vmname early for use after test run"
26d946b7 (okurz/fix/tidy_include_rules) Fix include of extension-less perl files in tidy config
62576fe1 Tidy tools/check_coverage according to new os-autoinst rules
8c883703 git subrepo pull (merge) external/os-autoinst-common
772201f0 (kalikiana/svirt_vmname) svirt: Store vmname early for use after test run
ca120328 Run openQA full-stack test as part of os-autoinst CI tests
a2d9c194 Add new variables in backend/generalhw.pm
34c83a3c (okurz/feature/ikvm) Add simple test for backend::ikvm
aa99694f (kalikiana/svirt_do_extract_assets) svirt: Implement do_extract_assets
aa936732 (okurz/feature/amt_tests) Mark backend::amt as deprecated as it seems to be unused
95b8d8d9 Ensure 100% statement coverage for backend::amt
40ba8c80 backend::amt: Remove unsupported call to remove crash file

I made a mistake. The job I triggered was not actually running on openqaworker2 as I missed to bind it to one host. So again with openqa-clone-job --within-instance https://openqa.suse.de/8114218 TEST=prj2_host_upgrade_sles15sp3_to_developing_kvm_poo106017_investigation_okurz BUILD=X _GROUP=0 WORKER_CLASS=virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,openqaworker2

Created https://openqa.suse.de/tests/8114316

Actions #7

Updated by okurz about 2 years ago

https://openqa.suse.de/tests/8114316 has passed the step where tests previously failed.

@mkittler unless you have better ideas, how about for now we skip requesting full screen updates in the case of consoles that involve localXvnc?

Actions #8

Updated by mkittler about 2 years ago

PR with a workaround: https://github.com/os-autoinst/os-autoinst/pull/1948

It basically is like if the lines are commented out but just for localXvnc.

Actions #9

Updated by mkittler about 2 years ago

  • Description updated (diff)
Actions #10

Updated by mkittler about 2 years ago

Yet another idea to improve the situation: https://github.com/os-autoinst/os-autoinst/pull/1949 (although the other PR has been merged and seems to work in production).

Actions #11

Updated by openqa_review about 2 years ago

  • Due date set to 2022-02-22

Setting due date based on mean cycle time of SUSE QE Tools

Actions #12

Updated by Julie_CAO about 2 years ago

I saw the pr has been merged to github master. Can it be deployed in OSD ASAP? the target date Feb 22 is too late, our tests will be blocked for several builds.

Actions #13

Updated by mkittler about 2 years ago

  • Status changed from In Progress to Feedback

As mentioned in the chat it has already been deployed. Let me know whether it works.

I've been restarting remaining jobs via failed_since='2022-02-03' host=openqa.suse.de additional_filters="reason like '%Error connecting to VNC server%localhost%Connection refused%'" ./openqa-advanced-retrigger-jobs.

Actions #14

Updated by Julie_CAO about 2 years ago

  • Status changed from Feedback to Resolved

the verify test passed. your restarting tool did not seem to work, but it does not matter, I've rerun all our failing tests. Thank @mkittler & @okurz for your quick fix.

Actions #15

Updated by mkittler about 2 years ago

  • Related to deleted (action #105429: openQA's fullstack test fails in `shutdown` module)
Actions

Also available in: Atom PDF