Project

General

Profile

Actions

action #155170

closed

[openqa-in-openqa] [sporadic] test fails in test_running: parallel_failed size:M

Added by tinita 3 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-02-08
Due date:
2024-02-29
% Done:

0%

Estimated time:

Description

Observation

openQA test in scenario openqa-Tumbleweed-dev-x86_64-openqa_install_multimachine@64bit-4G fails in
test_running.

Reproducible

Fails since (at least) Build :TW.26399 (current job)

Expected result

Last good: :TW.26398 (or more recent)

Suggestions

Further details

Always latest result in this scenario: latest


Related issues 6 (1 open5 closed)

Related to openQA Project - action #155173: [openqa-in-openqa] [sporadic] test fails in openqa_worker: os-autoinst-setup-multi-machine timed out size:MResolvedmkittler2024-02-082024-03-01

Actions
Related to openQA Project - action #138302: Ensure automated openQA tests verify that os-autoinst-setup-multi-machine sets up valid networking size:MResolveddheidler2023-07-192024-01-19

Actions
Related to openQA Infrastructure - action #150956: o3 cannot send e-mails via smtp relay size:MResolvedokurz2023-11-16

Actions
Related to openQA Tests - action #153766: [core][sporadic] Handle wild agetty better in tests/network/setup_multimachine.pmNew2024-01-17

Actions
Related to openQA Tests - action #156067: [alert] test fails in setup_multimachineResolvedmkittler2024-02-262024-03-12

Actions
Related to openQA Project - action #156052: [alert] Scripts CI pipeline failing after logging multiple Job state of job ID 13603796: running, waiting size:SResolvedmkittler2024-02-262024-03-13

Actions
Actions #1

Updated by jbaier_cz 3 months ago

  • Tags set to openqa-in-openqa
  • Category set to Regressions/Crashes
  • Target version set to Ready
Actions #2

Updated by okurz 3 months ago

  • Related to action #155173: [openqa-in-openqa] [sporadic] test fails in openqa_worker: os-autoinst-setup-multi-machine timed out size:M added
Actions #3

Updated by okurz 3 months ago

  • Subject changed from [openqa-in-openqa] [sporadic] test fails in test_running: parallel_failed to [openqa-in-openqa] [sporadic] test fails in test_running: parallel_failed size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by dheidler 2 months ago

[2024-02-08T03:41:07.100527-05:00] [debug] [pid:21569] <<< testapi::type_string(string="(echo qQf4r; bash -eox pipefail /tmp/scriptqQf4r.sh ; echo SCRIPT_FINISHEDqQf4r-\$?-) | tee /dev/ttyS0\n", max_interval=250,
 wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2024-02-08T03:41:11.183413-05:00] [debug] [pid:21569] tests/network/setup_multimachine.pm:42 called mm_network::setup_static_mm_network -> lib/mm_network.pm:228 called mm_network::configure_static_dns -> lib/mm_
network.pm:130 called testapi::script_output
[2024-02-08T03:41:11.183633-05:00] [debug] [pid:21569] <<< testapi::wait_serial(timeout=90, quiet=undef, record_output=1, regexp="SCRIPT_FINISHEDqQf4r-\\d+-", buffer_size=undef, expect_not_found=0, no_regex=0)
[2024-02-08T03:42:42.304376-05:00] [debug] [pid:21569] >>> testapi::wait_serial: SCRIPT_FINISHEDqQf4r-\d+-: fail
[2024-02-08T03:42:42.306857-05:00] [info] [pid:21569] ::: basetest::runtest: # Test died: script timeout: nmcli -t -f NAME c | grep -v ^lo: | head -n 1 at /usr/lib/os-autoinst/distribution.pm line 295.
        distribution::script_output(Distribution::Opensuse::Tumbleweed=HASH(0x556ed0a5c2d0), "nmcli -t -f NAME c | grep -v ^lo: | head -n 1", "timeout", undef, "quiet", undef, "proceed_on_failure", undef, ...) ca
lled at /usr/lib/os-autoinst/testapi.pm line 1100
        testapi::script_output("nmcli -t -f NAME c | grep -v ^lo: | head -n 1") called at opensuse/lib/mm_network.pm line 130
        mm_network::configure_static_dns(HASH(0x556ecda1f2d8), "is_nm", 1) called at opensuse/lib/mm_network.pm line 228
        mm_network::setup_static_mm_network("10.0.2.101/24") called at opensuse/tests/network/setup_multimachine.pm line 42
        setup_multimachine::run(setup_multimachine=HASH(0x556ed0ef6fa0)) called at /usr/lib/os-autoinst/basetest.pm line 352
        eval {...} called at /usr/lib/os-autoinst/basetest.pm line 346
        basetest::runtest(setup_multimachine=HASH(0x556ed0ef6fa0)) called at /usr/lib/os-autoinst/autotest.pm line 415
        eval {...} called at /usr/lib/os-autoinst/autotest.pm line 415
        autotest::runalltests() called at /usr/lib/os-autoinst/autotest.pm line 272
        eval {...} called at /usr/lib/os-autoinst/autotest.pm line 272
        autotest::run_all() called at /usr/lib/os-autoinst/autotest.pm line 323
        autotest::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x556ecd7f0cc0)) called at /usr/lib/perl5/vendor_perl/5.38.2/Mojo/IOLoop/ReadWriteProcess.pm line 329
        eval {...} called at /usr/lib/perl5/vendor_perl/5.38.2/Mojo/IOLoop/ReadWriteProcess.pm line 329
        Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x556ecd7f0cc0), CODE(0x556ed183d048)) called at /usr/lib/perl5/vendor_perl/5.38.2/Mojo/IOLoop/ReadWriteProcess.pm line 492
        Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x556ecd7f0cc0)) called at /usr/lib/os-autoinst/autotest.pm line 325
        autotest::start_process() called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 94
        OpenQA::Isotovideo::Runner::start_autotest(OpenQA::Isotovideo::Runner=HASH(0x556ecc6f2528)) called at /usr/bin/isotovideo line 192
        eval {...} called at /usr/bin/isotovideo line 181

[2024-02-08T03:42:42.311881-05:00] [debug] [pid:21569] l
Actions #5

Updated by okurz 2 months ago

  • Priority changed from Normal to Urgent
Actions #6

Updated by okurz 2 months ago

  • Related to action #138302: Ensure automated openQA tests verify that os-autoinst-setup-multi-machine sets up valid networking size:M added
Actions #7

Updated by okurz 2 months ago

  • Related to action #150956: o3 cannot send e-mails via smtp relay size:M added
Actions #8

Updated by okurz 2 months ago

I did not realize that https://openqa.opensuse.org/group_overview/24?limit_builds=50&limit_builds=100&limit_builds=400 looks so bad, bumping prio to "Urgent". I assume this is related to #138302 and possibly missing notifications due to #150956

Actions #9

Updated by ybonatakis 2 months ago

  • Status changed from Workable to In Progress
  • Assignee set to ybonatakis
Actions #10

Updated by openqa_review 2 months ago

  • Due date set to 2024-02-29

Setting due date based on mean cycle time of SUSE QE Tools

Actions #11

Updated by ybonatakis 2 months ago

  • Status changed from In Progress to Blocked

I prioritized the stability of the tests. Many failed before the test_running, on the worker module. But unfortunately, after a day, i bumped into other issues, as i cant clone the job on 03 with some changes. I tried to do so with openqa-clone-custom-git-refspec https://github.com/iob/os-autoinst-distri-openQA/tree/test https://openqa.opensuse.org/tests/3937957 CASEDIR=openqa PRODUCTDIR=openqa TEST=$i TEST_GIT_HASH=2d6e861f8c228c999629ad262569e3c73e724d16. but the tests persists to use TEST_GIT_HASH and TEST_GIT_URL(even if i add this explicitely in openqa-clone-custom-git-refspec) from initial job.

Actions #12

Updated by ybonatakis 2 months ago

  • Status changed from Blocked to Workable
Actions #13

Updated by ybonatakis 2 months ago

  • Status changed from Workable to In Progress
Actions #14

Updated by okurz 2 months ago

  1. I suggest to find out the current fail ratio, e.g. use https://progress.opensuse.org/projects/openqatests/wiki/Wiki#Statistical-investigation to see the percentage of tests failing with this issue
  2. If you find any other issues then please make sure that those are explicitly handled, e.g. create other specific tickets for those, at best already with information about the fail ratio also there
  3. Follow the original suggestions from https://progress.opensuse.org/issues/155170#Suggestions
Actions #15

Updated by livdywan 2 months ago

from https://openqa.opensuse.org/tests/3951088/logfile?filename=test_running-autoinst-log.txt

[2024-02-20T22:27:21.329300-05:00] [info] [pid:28725] ::: basetest::runtest: # Test died: command 'nmcli connection modify 'Welcome to openSUSE Tumbleweed 20240211 - Kernel 6.7.4-1-default (ttyS0).

  ens4: 10.0.2.101 fe80::5054:ff:fe12:2


  susetest login: ens4' ipv4.dns '10.0.2.3'' failed at /usr/lib/os-autoinst/testapi.pm line 926.
    testapi::assert_script_run("nmcli connection modify 'Welcome to openSUSE Tumbleweed 20240"...) called at opensuse/lib/mm_network.pm line 132
    mm_network::configure_static_dns(HASH(0x557a8a64d888), "is_nm", 1) called at opensuse/lib/mm_network.pm line 228

Is this the primary issue? Or should it be split off into a specific test issue?

Actions #16

Updated by tinita 2 months ago

I found a susetest login: in the latest failures:
https://openqa.opensuse.org/tests/3951088/logfile?filename=test_running-autoinst-log.txt
https://openqa.opensuse.org/tests/3950495/logfile?filename=test_running-autoinst-log.txt
https://openqa.opensuse.org/tests/3949776/logfile?filename=test_running-autoinst-log.txt

[2024-02-20T22:27:21.329300-05:00] [info] [pid:28725] ::: basetest::runtest: # Test died: command 'nmcli connection modify 'Welcome to openSUSE Tumbleweed 20240211 - Kernel 6.7.4-1-default (ttyS0).

  ens4: 10.0.2.101 fe80::5054:ff:fe12:2


  susetest login: ens4' ipv4.dns '10.0.2.3'' failed at /usr/lib/os-autoinst/testapi.pm line 926.

In the following failure https://openqa.opensuse.org/tests/3949297#downloads there is no inner autoinst-log:
https://openqa.opensuse.org/tests/3949297/logfile?filename=autoinst-log.txt
It seems something is preventing the post_fail_hook to run.

Actions #17

Updated by tinita 2 months ago

  • Related to action #153766: [core][sporadic] Handle wild agetty better in tests/network/setup_multimachine.pm added
Actions #18

Updated by tinita 2 months ago

I think #153766 is related and might be blocking this?

Actions #19

Updated by tinita 2 months ago

But it would be good to find out why the post_fail_hook failed in https://openqa.opensuse.org/tests/3949297#downloads
Maybe that can be improved.

Actions #20

Updated by ybonatakis 2 months ago

  • Tags changed from openqa-in-openqa, reactive work to openqa-in-openqa

okurz wrote in #note-14:

  1. I suggest to find out the current fail ratio, e.g. use https://progress.opensuse.org/projects/openqatests/wiki/Wiki#Statistical-investigation to see the percentage of tests failing with this issue

https://openqa.opensuse.org/tests/overview?distri=openqa&build=poo32242_investigation&version=Tumbleweed of 100 jobs

  1. If you find any other issues then please make sure that those are explicitly handled, e.g. create other specific tickets for those, at best already with information about the fail ratio also there
  2. Follow the original suggestions from https://progress.opensuse.org/issues/155170#Suggestions
Actions #21

Updated by ybonatakis 2 months ago

12/100 failures

I created https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18708 with some small improvements IMO

Actions #22

Updated by okurz 2 months ago

  • Tags changed from openqa-in-openqa to openqa-in-openqa, reactive work
Actions #23

Updated by ybonatakis 2 months ago

  • Tags changed from openqa-in-openqa, reactive work to openqa-in-openqa
  • Status changed from In Progress to Feedback

I think i came up with something which work. https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18713
This doenst refactor the tests/network/setup_multimachine.pm as the results looks to work for now. Maybe something to follow up

Actions #24

Updated by ybonatakis 2 months ago

  • Status changed from Feedback to In Progress

ybonatakis wrote in #note-23:

I think i came up with something which work. https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18713

Still 1 out of 100 failed with the same error. back in progress

This doenst refactor the tests/network/setup_multimachine.pm as the results looks to work for now. Maybe something to follow up

Actions #25

Updated by ybonatakis 2 months ago

  • Status changed from In Progress to Feedback

A move to serial terminal seems more stable. https://openqa.opensuse.org/tests/overview?distri=opensuse&build=b10n1k%2Fos-autoinst-distri-opensuse%2318713&version=Tumbleweed.
Changes is on the ping test on os-autoinst-distri-opensuse.
Once PR merged the test_running on os-autoinst-distri-openQA should look also ok

Actions #26

Updated by livdywan 2 months ago

  • Priority changed from Urgent to High

So my understanding is some steps were lost here. The above PR still needs to be reviewed, however the fail ratio is 10%. Once that change is deployed it is expected to be resolved.
I assume the ticket can be High.

Actions #27

Updated by okurz 2 months ago

  • Priority changed from High to Urgent
Actions #28

Updated by okurz 2 months ago

  • Status changed from Feedback to In Progress
Actions #29

Updated by livdywan 2 months ago

Context: Bumped to urgent because this is causing multiple alert emails a day (about 2)

Actions #30

Updated by ybonatakis 2 months ago

  • Status changed from In Progress to Resolved

https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18713 merged. The issue expected to be resolved. Feel free to reopen if there is any further issues

Actions #31

Updated by mkittler 2 months ago

  • Related to action #156067: [alert] test fails in setup_multimachine added
Actions #32

Updated by mkittler 2 months ago

  • Status changed from Resolved to Workable

I think this caused a regression, see #156067#note-3.

Actions #33

Updated by mkittler 2 months ago

  • Status changed from Workable to Resolved

I guess I can handle it as part of the newly created ticket.

Actions #34

Updated by okurz about 2 months ago

  • Status changed from Resolved to Workable

I still see significant issues in "test_running", e.g. see https://openqa.opensuse.org/tests/3966834#step/test_running/6

Actions #35

Updated by okurz about 2 months ago

  • Related to action #156052: [alert] Scripts CI pipeline failing after logging multiple Job state of job ID 13603796: running, waiting size:S added
Actions #36

Updated by ybonatakis about 2 months ago

issue is different
logs now shows the following

[33m[2024-02-27T10:10:57.987618-05:00] [info] [pid:28416] ::: basetest::runtest: # Test died: command 'until nmcli networking connectivity check | tee /dev/stderr | grep full; do sleep 10; done' timed out at /usr/lib/os-autoinst/testapi.pm line 926.
    testapi::assert_script_run("until nmcli networking connectivity check | tee /dev/stderr |"...) called at opensuse/lib/mm_network.pm line 240
    mm_network::restart_networking("is_nm", 1) called at opensuse/lib/mm_network.pm line 229
    mm_network::setup_static_mm_network("10.0.2.101/24") called at opensuse/tests/network/setup_multimachine.pm line 42
    setup_multimachine::run(setup_multimachine=HASH(0x557efeee3b00)) called at /usr/lib/os-autoinst/basetest.pm line 352
    eval {...} called at /usr/lib/os-autoinst/basetest.pm line 346
    basetest::runtest(setup_multimachine=HASH(0x557efeee3b00)) called at /usr/lib/os-autoinst/autotest.pm line 415
    eval {...} called at /usr/lib/os-autoinst/autotest.pm line 415
    autotest::runalltests() called at /usr/lib/os-autoinst/autotest.pm line 272
    eval {...} called at /usr/lib/os-autoinst/autotest.pm line 272
    autotest::run_all() called at /usr/lib/os-autoinst/autotest.pm line 323
    autotest::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x557eff7f87c0)) called at /usr/lib/perl5/vendor_perl/5.38.2/Mojo/IOLoop/ReadWriteProcess.pm line 329
    eval {...} called at /usr/lib/perl5/vendor_perl/5.38.2/Mojo/IOLoop/ReadWriteProcess.pm line 329
    Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x557eff7f87c0), CODE(0x557eff147048)) called at /usr/lib/perl5/vendor_perl/5.38.2/Mojo/IOLoop/ReadWriteProcess.pm line 492
    Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x557eff7f87c0)) called at /usr/lib/os-autoinst/autotest.pm line 325
    autotest::start_process() called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 94
    OpenQA::Isotovideo::Runner::start_autotest(OpenQA::Isotovideo::Runner=HASH(0x557efa6e1b50)) called at /usr/bin/isotovideo line 192
    eval {...} called at /usr/bin/isotovideo line 181
Actions #37

Updated by ybonatakis about 2 months ago

  • Status changed from Workable to Resolved
Actions #38

Updated by livdywan about 2 months ago

We had a brief reflection of this ticket in the retro. Significant conversations ended up in Slack or in other tickets, namely #156052 and #156067 rather than here. This meant this ticket was effectively several tickets which would have diluted what different people thought to be an Urgent issue or one small part of a bigger one.

Superficially it looks like the ticket has many gaps. In practice we all agreed that the team collaborated well on getting to the bottom of the various problems that were found.

Actions

Also available in: Atom PDF