action #59273
closedmodule result missing for incompleting job
Description
Observation¶
openQA test in scenario sle-15-SP1-Installer-DVD-QR-x86_64-btrfs@64bit-ipmi incompletes in
installation_overview with no module result causing the label carry over to not work as no "module" can be identified.
Reproducible¶
Fails since (at least) Build 7.1 (current job)
https://openqa.suse.de/tests/3545493#next_previous shows the history with some "passed" jobs, then it turned to "incomplete" but with reporting the module "installation_overview" and then the following ones are just incomplete without any failed module reported which is a pre-requisite for label carry over working.
Expected result¶
Last good: https://openqa.suse.de/tests/3539985#step/installation_overview/1
Problem¶
Last good:
** os-autoinst: 4.5.1571474599.7d873cb5First bad:
** os-autoinst: 4.6.1572372410.447dab86
os-autoinst likely candidates for regression:
$ git log1 --no-merges 7d873cb5..447dab86
447dab86 Allow consoles to persist over reset (#1232)
327be131 myjsonrpc: Go back to incremental parsing (#1248)
73261e5b Use python3 by default (#1247)
3d51864a (perlpunk/num-queues-uninitialized) Avoid warning in comparison; num_queues might be undef
87f895f3 Improve here tag handling in script_output()
7d482890 (perlpunk/new-status) Increase version numbers
87276ac7 Add new status file that worker can read from
c52303a0 Consider tests with `tools/tidy --only-changed`
ab0554c2 (okurz/feature/container) spec: Fix missing, additional runtime requirements
a45e17fa Allow tidy to run only over local changes
37d31fc7 Improve 'check_ssh_serial'
55db2e0d Make start_serial_grab blocking
5266e820 Fix svirt backend's 100 % CPU usage
6ba53926 (okurz/fix/codecov) codecov: Adjust to current coverage target
Further details¶
Always latest result in this scenario: latest
Related to #59267 about the problem of the incomplete itself.
Please handle this with urgency because when label carry over not working human reviewers would need to re-attach the label for every single job.
Updated by okurz about 5 years ago
- Related to action #59267: test incompletes in installation_overview - when trying to switch to root-ssh? added
Updated by coolo about 5 years ago
The job doesn't fail - it crashes. In which module it crashed is generally speaking not a good hint. We can't label installation_overview as failed module IMO. We shouldn't incomplete, but fail in this scenario.
Updated by coolo about 5 years ago
The backend most likely is not supposed to crash here but report a failure to select_console - and such die
in testapi
[2019-11-10T09:33:56.307 CET] [debug] <<< testapi::check_screen(mustmatch='ssh-open', timeout=1)
[2019-11-10T09:33:56.794 CET] [debug] no match: 2.5s, best candidate: installation_overview-ssh-open-20170911 (0.00)
[2019-11-10T09:33:57.801 CET] [debug] no match: 1.5s, best candidate: ssh-open_firewall_disabled-20190221 (0.29)
[2019-11-10T09:33:59.344 CET] [debug] >>> testapi::_handle_found_needle: found ssh-open-20190207, similarity 1.00 @ 401/542
[2019-11-10T09:33:59.344 CET] [debug] /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/installation/installation_overview.pm:83 called testapi::select_console
[2019-11-10T09:33:59.344 CET] [debug] <<< testapi::select_console(testapi_console='install-shell')
/usr/lib/os-autoinst/consoles/vnc_base.pm:58:{
'hostname' => 'localhost',
'port' => 58291,
'ikvm' => 0
}
[2019-11-10T09:34:00.445 CET] [debug] Connected to Xvnc - PID 30160
icewm PID is 30182
xterm PID is 30184
[2019-11-10T09:34:01.526 CET] [debug] led state 0 0 0 -261
[2019-11-10T09:34:01.537 CET] [debug] /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/installation/installation_overview.pm:83 called testapi::select_console
[2019-11-10T09:34:01.537 CET] [debug] <<< testapi::select_console(testapi_console='root-ssh')
/usr/lib/os-autoinst/consoles/vnc_base.pm:58:{
'ikvm' => 0,
'hostname' => 'localhost',
'port' => 49909
}
[2019-11-10T09:34:02.658 CET] [debug] Connected to Xvnc - PID 30187
icewm PID is 30207
xterm PID is 30209
[2019-11-10T09:34:03.690 CET] [debug] Could not connect to root@scooter-1.qa.suse.de, Retrying after some seconds...
[2019-11-10T09:34:13.691 CET] [debug] Could not connect to root@scooter-1.qa.suse.de, Retrying after some seconds...
[2019-11-10T09:34:23.692 CET] [debug] Could not connect to root@scooter-1.qa.suse.de, Retrying after some seconds...
[2019-11-10T09:34:33.693 CET] [debug] Could not connect to root@scooter-1.qa.suse.de, Retrying after some seconds...
[2019-11-10T09:34:43.694 CET] [debug] Could not connect to root@scooter-1.qa.suse.de, Retrying after some seconds...
[2019-11-10T09:34:53.699 CET] [debug] Backend process died, backend errors are reported below in the following lines:
Error connecting to <root@scooter-1.qa.suse.de>: Connection refused
[2019-11-10T09:34:53.748 CET] [debug] IPMI: Chassis Power Control: Down/Off
[2019-11-10T09:34:53.748 CET] [debug] flushing frames
[2019-11-10T09:34:53.822 CET] [debug] sending magic and exit
[2019-11-10T09:34:53.822 CET] [debug] received magic close
[2019-11-10T09:34:53.823 CET] [debug] THERE IS NOTHING TO READ 15 4 3
[2019-11-10T09:34:53.823 CET] [debug] killing command server 28319 because test execution ended
Updated by okurz about 5 years ago
I agree that a failed select_console
should not fail the job. This is what #59267 is about. This ticket is about the regression that the incomplete which is triggered in a specific module is not notifying about the module which it did in before.
Updated by coolo about 5 years ago
- Category changed from Regressions/Crashes to Feature requests
- Priority changed from Urgent to Normal
I don't believe that's the case.
Updated by okurz over 4 years ago
- Related to action #45062: Better visualization of incompletes - show module in which incomplete happens added
Updated by okurz over 4 years ago
#45062 describes that this did work in the past.
Updated by mkittler over 4 years ago
Whether the test module is marked as failed (and e.g. shown as failed module in the various tables) depends on os-autoinst writing the test module result accordingly. Considering the inconsistent error handling in os-autoinst it is likely not very consistent about marking the test module as failed. So it might work in some cases like in #45062 but not generally.
Note that openQA and the result (e.g. incomplete) has not much influence here. Next & Previous and other lists will even show failed modules for passed jobs (if you manage to get a passed job with failed modules into the system).
Updated by okurz over 4 years ago
- Priority changed from Normal to Low
- Target version set to future
Meanwhile we have much better reporting for incompletes with the "reason" that is accessible from the DB, API as well as UI, so this change for incomplete jobs is less important but still valid.
Updated by okurz about 4 years ago
- Status changed from New to Resolved
- Assignee set to okurz
https://openqa.suse.de/tests/5148360#next_previous shows that we do have support for failed modules in incomplete jobs. So it could be that this only works seldomly or in limited cases, e.g. when first the test module fails and then an incomplete happens in the same module. However I think this is good enough.