action #13276
closed[tools]'assert_screen fails, but we detected a timeout in the process, so we abort' aka. "stall detected"
0%
Description
observation¶
There are problems in the testing infrastructure with stalls of the monitoring process for some seconds.
Initially referenced:
https://openqa.suse.de/tests/523174/file/autoinst-log.txt
but there have been corresponding tickets in before
reproducible¶
sometimes, seem to appear especially when SUT produce a lot of I/O, e.g. install_and_reboot
.
expected result¶
No timeouts in the test, no screens missed.
problem¶
os-autoinst already detects this
09:39:27.4559 34353 WARNING: There is some problem with your environment, we detected a stall for 12.8205699920654 seconds
and as a consequence already aborts the test
09:39:30.5158 34353 <<< bmwqemu::mydie(cause_of_death='assert_screen fails, but we detected a timeout in the process, so we abort')
DIE mydie at /usr/lib/os-autoinst/backend/baseclass.pm line 853.
but this appears often enough that we should care about to fix the underlying issue and also improve the feedback on detection.
Updated by oholecek over 8 years ago
Isn't this by design when there is CPU overcommit and process stall?
Updated by okurz over 8 years ago
Well, I don't complain about the test failing. I think the idea to prevent stuff like this is to inverse priorities to ensure the SUT can not stall the monitoring. I guess coolo did the groundwork with the messaging restructuring but needs some followup work. It could also mean the machine is actually overloaded and shouldn't be, i.e. reducing workers or so.
Updated by coolo over 8 years ago
who says the SUT is stalling the monitoring? there are 15 more workers on this box. If enough of them run with -cpu 2, there is little you can do in os-autoinst itself.
Updated by okurz about 8 years ago
- Assignee set to szarate
- Priority changed from Normal to High
Updated by okurz about 8 years ago
- Blocked by coordination #14972: [tools][epic] Improvements on backend to improve better handling of stalls added
Updated by okurz about 8 years ago
Updated by okurz about 8 years ago
- Has duplicate action #13644: test dies with "timeout in the process" added
Updated by asmorodskyi about 8 years ago
https://openqa.suse.de/tests/694144 - another example
Updated by asmorodskyi almost 8 years ago
another example https://openqa.suse.de/tests/716498
Updated by okurz almost 8 years ago
- Project changed from openQA Project (public) to openQA Tests (public)
- Description updated (diff)
- Category changed from 168 to Infrastructure
https://github.com/os-autoinst/os-autoinst/pull/713 prepared which would only abort on demand so not block tests where test reviewers do not care about a stall.
@szarate I updated the ticket description
I see the following tasks:
- short-term: Use "record_testresult" from
basetest.pm
to show a more immediate feedback to the user so that he does not need to take a look into the logfile - mid-term: Provide auto-commenting and labeling of the job in openQA
- long-term: Fix the underlying issue.
Updated by okurz almost 8 years ago
- Blocked by deleted (coordination #14972: [tools][epic] Improvements on backend to improve better handling of stalls)
Updated by okurz almost 8 years ago
- Related to coordination #14972: [tools][epic] Improvements on backend to improve better handling of stalls added
Updated by okurz almost 8 years ago
- Target version changed from Milestone 5 to Milestone 6
Updated by nicksinger almost 8 years ago
Same error appears on this test: https://openqa.suse.de/tests/795001
Does not seem like it is linked to the workload inside the VM but more to the general workload of the worker maybe?
Updated by okurz@suse.de almost 8 years ago
On Friday, 3 March 2017 14:21:19 CET you wrote:
Same error appears on this test: https://openqa.suse.de/tests/795001
Does not seem like it is linked to the workload inside the VM but more to
the general workload of the worker maybe?
exactly. The SUT is fine but the outside process running on the worker that is
checking the screens is the one that is slow here.
Updated by okurz almost 8 years ago
Same error appears on this test: https://openqa.suse.de/tests/795001
and the corresponding snippet from the logfile is:
12:05:31.8730 26644 WARNING: There is some problem with your environment, we detected a stall for 11.2174379825592 seconds
12:05:32.4946 26644 MATCH(GNOME-smartneedle-20150803:0.00)
…
12:05:38.9836 26644 MATCH(GNOME-piglit-unlock-20170130:0.00)
12:05:38.9862 26644 WARNING: check_asserted_screen took 7.10 seconds - make your needles more specific
12:05:38.9874 26644 DEBUG_IO:
NAME TIME CUMULATIVE PERCENTAGE
Searching for needles 0.000 0.000 0.000%
**++ search__: get image 0.000 0.000 0.001%
**++ search__: areas exclusion 0.000 0.000 0.000%
**++ tinycv::search_needle 5x746x166 0.177 0.177 2.497%
**++ tinycv::search_needle 59x18x59 0.131 0.309 1.850%
**++ tinycv::search_needle 60x118x58 0.128 0.436 1.797%
**++ tinycv::search_needle 846x42x141 0.173 0.610 2.442%
**++ mean_square_error 0.000 0.610 0.005%
** search_: GNOME-smartneedle-20150803 0.000 0.610 0.000%
**++ search__: get image 0.000 0.610 0.001%
**++ search__: areas exclusion 0.000 0.610 0.000%
**++ tinycv::search_needle 0x4x1024 0.125 0.735 1.764%
**++ tinycv::search_needle 0x711x216 0.120 0.856 1.696%
**++ mean_square_error 0.000 0.856 0.004%
** search_: GNOME-20141210 0.000 0.856 0.000%
**++ search__: get image 0.000 0.856 0.000%
**++ search__: areas exclusion 0.000 0.856 0.000%
**++ tinycv::search_needle 0x4x1024 0.123 0.979 1.728%
**++ mean_square_error 0.000 0.979 0.004%
** search_: GNOME-20140701 0.000 0.979 0.000%
**++ search__: get image 0.000 0.979 0.000%
**++ search__: areas exclusion 0.000 0.979 0.000%
**++ tinycv::search_needle 0x4x1024 0.120 1.099 1.689%
**++ mean_square_error 0.000 1.099 0.004%
** search_: GNOME-20160523 0.000 1.099 0.000%
**++ search__: get image 0.000 1.099 0.001%
**++ search__: areas exclusion 0.000 1.099 0.000%
**++ tinycv::search_needle 5x746x166 0.164 1.263 2.308%
**++ tinycv::search_needle 59x18x59 0.124 1.388 1.754%
**++ tinycv::search_needle 73x125x31 0.129 1.516 1.811%
**++ tinycv::search_needle 846x42x141 0.159 1.675 2.235%
**++ mean_square_error 0.000 1.675 0.006%
** search_: GNOME-smartneedle-20160211 0.000 1.676 0.000%
**++ search__: get image 0.000 1.676 0.001%
**++ search__: areas exclusion 0.000 1.676 0.001%
**++ tinycv::search_needle 846x42x141 0.164 1.840 2.314%
**++ tinycv::search_needle 0x0x153 0.162 2.002 2.281%
**++ mean_square_error 0.000 2.002 0.004%
** search_: GNOME-workaround-BSC978076-20160502 0.000 2.002 0.000%
**++ search__: get image 0.000 2.002 0.000%
**++ search__: areas exclusion 0.000 2.002 0.000%
**++ tinycv::search_needle 401x9x605 0.124 2.127 1.751%
**++ mean_square_error 0.000 2.127 0.005%
** search_: boot_to_desktop-generic-desktop-20160519 0.000 2.127 0.000%
**++ search__: get image 0.000 2.127 0.001%
**++ search__: areas exclusion 0.000 2.127 0.000%
**++ tinycv::search_needle 0x745x48 0.130 2.257 1.832%
**++ tinycv::search_needle 184x746x68 0.164 2.421 2.307%
**++ mean_square_error 0.000 2.421 0.004%
** search_: first_boot-generic-desktop-20160503 0.000 2.421 0.000%
**++ search__: get image 0.000 2.421 0.001%
**++ search__: areas exclusion 0.000 2.421 0.000%
**++ tinycv::search_needle 846x42x141 0.159 2.580 2.234%
**++ tinycv::search_needle 0x0x153 0.156 2.736 2.198%
**++ mean_square_error 0.000 2.736 0.004%
** search_: GNOME-workaround-BSC978076-20160504 0.000 2.736 0.000%
**++ search__: get image 0.000 2.736 0.000%
**++ search__: areas exclusion 0.000 2.736 0.000%
**++ tinycv::search_needle 846x42x141 0.157 2.893 2.212%
**++ tinycv::search_needle 19x32x153 0.157 3.051 2.217%
**++ tinycv::search_needle 2x721x139 0.139 3.189 1.952%
**++ mean_square_error 0.000 3.190 0.004%
** search_: GNOME-workaround-BSC978076-20160505 0.000 3.190 0.000%
**++ search__: get image 0.000 3.190 0.000%
**++ search__: areas exclusion 0.000 3.190 0.000%
**++ tinycv::search_needle 846x42x141 0.162 3.351 2.280%
**++ tinycv::search_needle 100x26x153 0.169 3.520 2.378%
**++ tinycv::search_needle 2x721x139 0.145 3.665 2.040%
**++ mean_square_error 0.001 3.666 0.007%
** search_: GNOME-workaround-BSC978076-20160506 0.000 3.666 0.001%
**++ search__: get image 0.000 3.666 0.001%
**++ search__: areas exclusion 0.000 3.666 0.000%
**++ tinycv::search_needle 846x42x141 0.158 3.824 2.232%
**++ tinycv::search_needle 88x32x153 0.154 3.978 2.170%
**++ tinycv::search_needle 2x721x139 0.140 4.119 1.976%
**++ mean_square_error 0.000 4.119 0.004%
** search_: GNOME-workaround-BSC978076-20160508 0.000 4.119 0.000%
**++ search__: get image 0.000 4.119 0.000%
**++ search__: areas exclusion 0.000 4.119 0.000%
**++ tinycv::search_needle 5x746x166 0.169 4.288 2.384%
**++ tinycv::search_needle 43x15x59 0.143 4.431 2.016%
**++ tinycv::search_needle 57x97x31 0.136 4.567 1.909%
**++ tinycv::search_needle 846x42x141 0.160 4.727 2.259%
**++ mean_square_error 0.000 4.728 0.004%
** search_: GNOME-smartneedle-20160510 0.000 4.728 0.000%
**++ search__: get image 0.000 4.728 0.001%
**++ search__: areas exclusion 0.000 4.728 0.000%
**++ tinycv::search_needle 846x42x141 0.161 4.889 2.274%
**++ tinycv::search_needle 0x10x112 0.180 5.069 2.539%
**++ tinycv::search_needle 2x721x139 0.142 5.211 1.997%
**++ mean_square_error 0.000 5.211 0.004%
** search_: GNOME-workaround-BSC979072-20160509 0.000 5.211 0.000%
**++ search__: get image 0.000 5.212 0.000%
**++ search__: areas exclusion 0.000 5.212 0.000%
**++ tinycv::search_needle 0x4x1024 0.122 5.334 1.718%
**++ mean_square_error 0.000 5.334 0.004%
** search_: GNOME-rollback-20160512 0.000 5.334 0.000%
**++ search__: get image 0.000 5.334 0.000%
**++ search__: areas exclusion 0.000 5.334 0.000%
**++ tinycv::search_needle 846x42x141 0.162 5.496 2.277%
**++ tinycv::search_needle 0x10x112 0.174 5.670 2.451%
**++ tinycv::search_needle 2x721x139 0.139 5.809 1.962%
**++ mean_square_error 0.000 5.809 0.004%
** search_: GNOME-workaround-BSC979072-20160520 0.000 5.809 0.000%
**++ search__: get image 0.000 5.809 0.000%
**++ search__: areas exclusion 0.000 5.809 0.000%
**++ tinycv::search_needle 5x746x166 0.170 5.979 2.392%
**++ tinycv::search_needle 37x18x59 0.143 6.122 2.010%
**++ tinycv::search_needle 51x115x31 0.130 6.252 1.831%
**++ tinycv::search_needle 846x42x141 0.158 6.409 2.223%
**++ mean_square_error 0.000 6.410 0.004%
** search_: GNOME-smartneedle-20160523 0.000 6.410 0.000%
**++ search__: get image 0.000 6.410 0.001%
**++ search__: areas exclusion 0.000 6.410 0.000%
**++ tinycv::search_needle 848x31x141 0.196 6.605 2.755%
**++ tinycv::search_needle 4x743x149 0.153 6.758 2.148%
**++ tinycv::search_needle 180x16x170 0.171 6.929 2.409%
**++ tinycv::search_needle 32x168x174 0.170 7.099 2.390%
**++ mean_square_error 0.000 7.099 0.005%
** search_: GNOME-piglit-unlock-20170130 0.000 7.099 0.000%
_stop_ 0.001 7.100 0.007%
12:05:38.9889 26644 <<< bmwqemu::mydie(cause_of_death='assert_screen fails, but we detected a timeout in the process, so we abort')
DIE mydie at /usr/lib/os-autoinst/backend/baseclass.pm line 857.
@szarate, can you help us to read the debug output? What can we see from there?
Updated by RBrownSUSE almost 8 years ago
- Subject changed from 'assert_screen fails, but we detected a timeout in the process, so we abort' to [tools]'assert_screen fails, but we detected a timeout in the process, so we abort'
Updated by asmorodskyi over 7 years ago
https://openqa.suse.de/tests/857360 example of stall detected
Updated by nicksinger over 7 years ago
https://openqa.suse.de/tests/861492 stall on QA-Power8-5-kvm:8
Updated by mgriessmeier over 7 years ago
also happens here:
https://openqa.suse.de/tests/871547
but is it really the same issue?
Updated by okurz over 7 years ago
- Priority changed from High to Normal
- Target version deleted (
Milestone 6)
yes, they are. We have the "stall detected" field in the job since https://github.com/os-autoinst/os-autoinst/pull/735 which makes the investigation way cheaper. So reducing prio. It would be nice though if we could have a reference to this ticket here automatically in these cases like a hook from the backend "record_info" into the test distribution to look for known issues.
Also, as currently I guess there are no further plans to really solve this I am removing the target version.
Updated by okurz over 7 years ago
- Priority changed from Normal to Low
after discussing with szarate and rbrown I guess "low" would better resemble that currently there is no plan to do anything about it as we have some mitigation and the workaround is good enough for now.
Updated by okurz over 7 years ago
- Subject changed from [tools]'assert_screen fails, but we detected a timeout in the process, so we abort' to [tools]'assert_screen fails, but we detected a timeout in the process, so we abort' aka. "stall detected"
Updated by okurz over 7 years ago
- Has duplicate action #18782: test fails in login_test: stall detected added
Updated by szarate over 7 years ago
- Status changed from New to Feedback
- Assignee deleted (
szarate)
Okurz, os-autoinst will already stop testing when it detects a stall, what is your expectation on this issue in particular?, it's still not clear.
Updated by okurz over 7 years ago
szarate wrote:
Okurz, os-autoinst will already stop testing when it detects a stall, what is your expectation on this issue in particular?, it's still not clear.
As described in #13276#note-25 there is no plan to do anything about that issue right now in general as we have a good enough mitigation and workaround. So I don't expect anything to be done here in particular. The issue still persists though and should be used as a reporting end in any case of "stall detected" if it is not another more critical issue that just has the same symptoms, e.g. the current "why does power-worker-X behave different than power-worker-Y". I only updated the subject line #13276#note-27 to make the issue easier to find in case of "stall detected".
Updated by okurz over 7 years ago
recent example 64bit-ipmi: https://openqa.suse.de/tests/993230#step/grub_test/7
Updated by okurz over 7 years ago
Latest example: https://openqa.opensuse.org/tests/455293
What could we do to automatically label the job, record the finding here and restart the job?
Updated by okurz about 4 years ago
- Status changed from Feedback to Resolved
- Assignee set to okurz
By now any "stall detected" seems to happen only very seldomly and likely any failure can easily be linked to problems on individual worker hosts which can then be addressed specifically.