action #13276: [tools]'assert_screen fails, but we detected a timeout in the process, so we abort' aka. "stall detected" - openQA Tests (public) - openSUSE Project Management Tool

Actions

Copy link

action #13276

closed

[tools]'assert_screen fails, but we detected a timeout in the process, so we abort' aka. "stall detected"

Added by okurz over 8 years ago. Updated over 4 years ago.

Status:

Resolved

Priority:

Low

Assignee:

okurz

Category:

Infrastructure

Target version:

Start date:

2016-08-19

Due date:

% Done:

Estimated time:

Difficulty:

Description

observation¶

There are problems in the testing infrastructure with stalls of the monitoring process for some seconds.

Initially referenced:
https://openqa.suse.de/tests/523174/file/autoinst-log.txt

but there have been corresponding tickets in before

reproducible¶

sometimes, seem to appear especially when SUT produce a lot of I/O, e.g. install_and_reboot.

expected result¶

No timeouts in the test, no screens missed.

problem¶

os-autoinst already detects this

09:39:27.4559 34353 WARNING: There is some problem with your environment, we detected a stall for 12.8205699920654 seconds

and as a consequence already aborts the test

09:39:30.5158 34353 <<< bmwqemu::mydie(cause_of_death='assert_screen fails, but we detected a timeout in the process, so we abort')
DIE mydie at /usr/lib/os-autoinst/backend/baseclass.pm line 853.

but this appears often enough that we should care about to fix the underlying issue and also improve the feedback on detection.

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by oholecek over 8 years ago

Isn't this by design when there is CPU overcommit and process stall?

Actions

Copy link

Updated by okurz over 8 years ago

Well, I don't complain about the test failing. I think the idea to prevent stuff like this is to inverse priorities to ensure the SUT can not stall the monitoring. I guess coolo did the groundwork with the messaging restructuring but needs some followup work. It could also mean the machine is actually overloaded and shouldn't be, i.e. reducing workers or so.

Actions

Copy link

Updated by coolo over 8 years ago

who says the SUT is stalling the monitoring? there are 15 more workers on this box. If enough of them run with -cpu 2, there is little you can do in os-autoinst itself.

Actions

Copy link

Updated by okurz over 8 years ago

Assignee set to szarate
Priority changed from Normal to High

https://openqa.opensuse.org/tests/310192

Actions

Copy link

Updated by okurz over 8 years ago

Blocked by coordination #14972: [tools][epic] Improvements on backend to improve better handling of stalls added

Actions

Copy link

Updated by okurz over 8 years ago

https://openqa.suse.de/tests/642709
https://openqa.suse.de/tests/643031

Actions

Copy link

Updated by okurz over 8 years ago

Has duplicate action #13644: test dies with "timeout in the process" added

Actions

Copy link

Updated by okurz over 8 years ago

https://openqa.suse.de/tests/661841

Actions

Copy link

Updated by asmorodskyi over 8 years ago

https://openqa.suse.de/tests/694144 - another example

Actions

Copy link

#10

Updated by asmorodskyi about 8 years ago

another example https://openqa.suse.de/tests/716498

Actions

Copy link

#11

Updated by okurz about 8 years ago

Target version set to Milestone 5

Actions

Copy link

#12

Updated by okurz about 8 years ago

Project changed from openQA Project (public) to openQA Tests (public)
Description updated (diff)
Category changed from 168 to Infrastructure

https://github.com/os-autoinst/os-autoinst/pull/713 prepared which would only abort on demand so not block tests where test reviewers do not care about a stall.

@szarate I updated the ticket description

I see the following tasks:

short-term: Use "record_testresult" from basetest.pm to show a more immediate feedback to the user so that he does not need to take a look into the logfile
mid-term: Provide auto-commenting and labeling of the job in openQA
long-term: Fix the underlying issue.

Actions

Copy link

#13

Updated by okurz about 8 years ago

Blocked by deleted (coordination #14972: [tools][epic] Improvements on backend to improve better handling of stalls)

Actions

Copy link

#14

Updated by okurz about 8 years ago

Related to coordination #14972: [tools][epic] Improvements on backend to improve better handling of stalls added

Actions

Copy link

#15

Updated by okurz about 8 years ago

Target version changed from Milestone 5 to Milestone 6

Actions

Copy link

#16

Updated by nicksinger about 8 years ago

Same error appears on this test: https://openqa.suse.de/tests/795001
Does not seem like it is linked to the workload inside the VM but more to the general workload of the worker maybe?

Actions

Copy link

#17

Updated by okurz@suse.de about 8 years ago

On Friday, 3 March 2017 14:21:19 CET you wrote:

Same error appears on this test: https://openqa.suse.de/tests/795001
Does not seem like it is linked to the workload inside the VM but more to
the general workload of the worker maybe?

exactly. The SUT is fine but the outside process running on the worker that is
checking the screens is the one that is slow here.

Actions

Copy link

#18

Updated by okurz about 8 years ago

Same error appears on this test: https://openqa.suse.de/tests/795001

and the corresponding snippet from the logfile is:

12:05:31.8730 26644 WARNING: There is some problem with your environment, we detected a stall for 11.2174379825592 seconds
12:05:32.4946 26644 MATCH(GNOME-smartneedle-20150803:0.00)
…
12:05:38.9836 26644 MATCH(GNOME-piglit-unlock-20170130:0.00)
12:05:38.9862 26644 WARNING: check_asserted_screen took 7.10 seconds - make your needles more specific
12:05:38.9874 26644 DEBUG_IO: 
NAME                                                 TIME        CUMULATIVE      PERCENTAGE
 Searching for needles                                0.000       0.000           0.000%
 **++ search__: get image                             0.000       0.000           0.001%
 **++ search__: areas exclusion                       0.000       0.000           0.000%
 **++ tinycv::search_needle 5x746x166                 0.177       0.177           2.497%
 **++ tinycv::search_needle 59x18x59                  0.131       0.309           1.850%
 **++ tinycv::search_needle 60x118x58                 0.128       0.436           1.797%
 **++ tinycv::search_needle 846x42x141                0.173       0.610           2.442%
 **++ mean_square_error                               0.000       0.610           0.005%
 ** search_: GNOME-smartneedle-20150803               0.000       0.610           0.000%
 **++ search__: get image                             0.000       0.610           0.001%
 **++ search__: areas exclusion                       0.000       0.610           0.000%
 **++ tinycv::search_needle 0x4x1024                  0.125       0.735           1.764%
 **++ tinycv::search_needle 0x711x216                 0.120       0.856           1.696%
 **++ mean_square_error                               0.000       0.856           0.004%
 ** search_: GNOME-20141210                           0.000       0.856           0.000%
 **++ search__: get image                             0.000       0.856           0.000%
 **++ search__: areas exclusion                       0.000       0.856           0.000%
 **++ tinycv::search_needle 0x4x1024                  0.123       0.979           1.728%
 **++ mean_square_error                               0.000       0.979           0.004%
 ** search_: GNOME-20140701                           0.000       0.979           0.000%
 **++ search__: get image                             0.000       0.979           0.000%
 **++ search__: areas exclusion                       0.000       0.979           0.000%
 **++ tinycv::search_needle 0x4x1024                  0.120       1.099           1.689%
 **++ mean_square_error                               0.000       1.099           0.004%
 ** search_: GNOME-20160523                           0.000       1.099           0.000%
 **++ search__: get image                             0.000       1.099           0.001%
 **++ search__: areas exclusion                       0.000       1.099           0.000%
 **++ tinycv::search_needle 5x746x166                 0.164       1.263           2.308%
 **++ tinycv::search_needle 59x18x59                  0.124       1.388           1.754%
 **++ tinycv::search_needle 73x125x31                 0.129       1.516           1.811%
 **++ tinycv::search_needle 846x42x141                0.159       1.675           2.235%
 **++ mean_square_error                               0.000       1.675           0.006%
 ** search_: GNOME-smartneedle-20160211               0.000       1.676           0.000%
 **++ search__: get image                             0.000       1.676           0.001%
 **++ search__: areas exclusion                       0.000       1.676           0.001%
 **++ tinycv::search_needle 846x42x141                0.164       1.840           2.314%
 **++ tinycv::search_needle 0x0x153                   0.162       2.002           2.281%
 **++ mean_square_error                               0.000       2.002           0.004%
 ** search_: GNOME-workaround-BSC978076-20160502      0.000       2.002           0.000%
 **++ search__: get image                             0.000       2.002           0.000%
 **++ search__: areas exclusion                       0.000       2.002           0.000%
 **++ tinycv::search_needle 401x9x605                 0.124       2.127           1.751%
 **++ mean_square_error                               0.000       2.127           0.005%
 ** search_: boot_to_desktop-generic-desktop-20160519 0.000       2.127           0.000%
 **++ search__: get image                             0.000       2.127           0.001%
 **++ search__: areas exclusion                       0.000       2.127           0.000%
 **++ tinycv::search_needle 0x745x48                  0.130       2.257           1.832%
 **++ tinycv::search_needle 184x746x68                0.164       2.421           2.307%
 **++ mean_square_error                               0.000       2.421           0.004%
 ** search_: first_boot-generic-desktop-20160503      0.000       2.421           0.000%
 **++ search__: get image                             0.000       2.421           0.001%
 **++ search__: areas exclusion                       0.000       2.421           0.000%
 **++ tinycv::search_needle 846x42x141                0.159       2.580           2.234%
 **++ tinycv::search_needle 0x0x153                   0.156       2.736           2.198%
 **++ mean_square_error                               0.000       2.736           0.004%
 ** search_: GNOME-workaround-BSC978076-20160504      0.000       2.736           0.000%
 **++ search__: get image                             0.000       2.736           0.000%
 **++ search__: areas exclusion                       0.000       2.736           0.000%
 **++ tinycv::search_needle 846x42x141                0.157       2.893           2.212%
 **++ tinycv::search_needle 19x32x153                 0.157       3.051           2.217%
 **++ tinycv::search_needle 2x721x139                 0.139       3.189           1.952%
 **++ mean_square_error                               0.000       3.190           0.004%
 ** search_: GNOME-workaround-BSC978076-20160505      0.000       3.190           0.000%
 **++ search__: get image                             0.000       3.190           0.000%
 **++ search__: areas exclusion                       0.000       3.190           0.000%
 **++ tinycv::search_needle 846x42x141                0.162       3.351           2.280%
 **++ tinycv::search_needle 100x26x153                0.169       3.520           2.378%
 **++ tinycv::search_needle 2x721x139                 0.145       3.665           2.040%
 **++ mean_square_error                               0.001       3.666           0.007%
 ** search_: GNOME-workaround-BSC978076-20160506      0.000       3.666           0.001%
 **++ search__: get image                             0.000       3.666           0.001%
 **++ search__: areas exclusion                       0.000       3.666           0.000%
 **++ tinycv::search_needle 846x42x141                0.158       3.824           2.232%
 **++ tinycv::search_needle 88x32x153                 0.154       3.978           2.170%
 **++ tinycv::search_needle 2x721x139                 0.140       4.119           1.976%
 **++ mean_square_error                               0.000       4.119           0.004%
 ** search_: GNOME-workaround-BSC978076-20160508      0.000       4.119           0.000%
 **++ search__: get image                             0.000       4.119           0.000%
 **++ search__: areas exclusion                       0.000       4.119           0.000%
 **++ tinycv::search_needle 5x746x166                 0.169       4.288           2.384%
 **++ tinycv::search_needle 43x15x59                  0.143       4.431           2.016%
 **++ tinycv::search_needle 57x97x31                  0.136       4.567           1.909%
 **++ tinycv::search_needle 846x42x141                0.160       4.727           2.259%
 **++ mean_square_error                               0.000       4.728           0.004%
 ** search_: GNOME-smartneedle-20160510               0.000       4.728           0.000%
 **++ search__: get image                             0.000       4.728           0.001%
 **++ search__: areas exclusion                       0.000       4.728           0.000%
 **++ tinycv::search_needle 846x42x141                0.161       4.889           2.274%
 **++ tinycv::search_needle 0x10x112                  0.180       5.069           2.539%
 **++ tinycv::search_needle 2x721x139                 0.142       5.211           1.997%
 **++ mean_square_error                               0.000       5.211           0.004%
 ** search_: GNOME-workaround-BSC979072-20160509      0.000       5.211           0.000%
 **++ search__: get image                             0.000       5.212           0.000%
 **++ search__: areas exclusion                       0.000       5.212           0.000%
 **++ tinycv::search_needle 0x4x1024                  0.122       5.334           1.718%
 **++ mean_square_error                               0.000       5.334           0.004%
 ** search_: GNOME-rollback-20160512                  0.000       5.334           0.000%
 **++ search__: get image                             0.000       5.334           0.000%
 **++ search__: areas exclusion                       0.000       5.334           0.000%
 **++ tinycv::search_needle 846x42x141                0.162       5.496           2.277%
 **++ tinycv::search_needle 0x10x112                  0.174       5.670           2.451%
 **++ tinycv::search_needle 2x721x139                 0.139       5.809           1.962%
 **++ mean_square_error                               0.000       5.809           0.004%
 ** search_: GNOME-workaround-BSC979072-20160520      0.000       5.809           0.000%
 **++ search__: get image                             0.000       5.809           0.000%
 **++ search__: areas exclusion                       0.000       5.809           0.000%
 **++ tinycv::search_needle 5x746x166                 0.170       5.979           2.392%
 **++ tinycv::search_needle 37x18x59                  0.143       6.122           2.010%
 **++ tinycv::search_needle 51x115x31                 0.130       6.252           1.831%
 **++ tinycv::search_needle 846x42x141                0.158       6.409           2.223%
 **++ mean_square_error                               0.000       6.410           0.004%
 ** search_: GNOME-smartneedle-20160523               0.000       6.410           0.000%
 **++ search__: get image                             0.000       6.410           0.001%
 **++ search__: areas exclusion                       0.000       6.410           0.000%
 **++ tinycv::search_needle 848x31x141                0.196       6.605           2.755%
 **++ tinycv::search_needle 4x743x149                 0.153       6.758           2.148%
 **++ tinycv::search_needle 180x16x170                0.171       6.929           2.409%
 **++ tinycv::search_needle 32x168x174                0.170       7.099           2.390%
 **++ mean_square_error                               0.000       7.099           0.005%
 ** search_: GNOME-piglit-unlock-20170130             0.000       7.099           0.000%
 _stop_                                               0.001       7.100           0.007%

12:05:38.9889 26644 <<< bmwqemu::mydie(cause_of_death='assert_screen fails, but we detected a timeout in the process, so we abort')
DIE mydie at /usr/lib/os-autoinst/backend/baseclass.pm line 857.

@szarate, can you help us to read the debug output? What can we see from there?

Actions

Copy link

#19

Updated by okurz about 8 years ago

https://openqa.suse.de/tests/807663

Actions

Copy link

#20

Updated by RBrownSUSE about 8 years ago

Subject changed from 'assert_screen fails, but we detected a timeout in the process, so we abort' to [tools]'assert_screen fails, but we detected a timeout in the process, so we abort'

Actions

Copy link

#21

Updated by asmorodskyi about 8 years ago

https://openqa.suse.de/tests/857360 example of stall detected

Actions

Copy link

#22

Updated by nicksinger about 8 years ago

https://openqa.suse.de/tests/861492 stall on QA-Power8-5-kvm:8

Actions

Copy link

#23

Updated by mgriessmeier almost 8 years ago

also happens here:
https://openqa.suse.de/tests/871547

but is it really the same issue?

Actions

Copy link

#24

Updated by okurz almost 8 years ago

Priority changed from High to Normal
Target version deleted (~~Milestone 6~~)

yes, they are. We have the "stall detected" field in the job since https://github.com/os-autoinst/os-autoinst/pull/735 which makes the investigation way cheaper. So reducing prio. It would be nice though if we could have a reference to this ticket here automatically in these cases like a hook from the backend "record_info" into the test distribution to look for known issues.

Also, as currently I guess there are no further plans to really solve this I am removing the target version.

Actions

Copy link

#25

Updated by okurz almost 8 years ago

Priority changed from Normal to Low

after discussing with szarate and rbrown I guess "low" would better resemble that currently there is no plan to do anything about it as we have some mitigation and the workaround is good enough for now.

Actions

Copy link

#26

Updated by okurz almost 8 years ago

https://openqa.suse.de/tests/909289

Actions

Copy link

#27

Updated by okurz almost 8 years ago

Subject changed from [tools]'assert_screen fails, but we detected a timeout in the process, so we abort' to [tools]'assert_screen fails, but we detected a timeout in the process, so we abort' aka. "stall detected"

Actions

Copy link

#28

Updated by okurz almost 8 years ago

Has duplicate action #18782: test fails in login_test: stall detected added

Actions

Copy link

#29

Updated by szarate almost 8 years ago

Status changed from New to Feedback
Assignee deleted (~~szarate~~)

Okurz, os-autoinst will already stop testing when it detects a stall, what is your expectation on this issue in particular?, it's still not clear.

Actions

Copy link

#30

Updated by okurz almost 8 years ago

szarate wrote:

Okurz, os-autoinst will already stop testing when it detects a stall, what is your expectation on this issue in particular?, it's still not clear.

As described in #13276#note-25 there is no plan to do anything about that issue right now in general as we have a good enough mitigation and workaround. So I don't expect anything to be done here in particular. The issue still persists though and should be used as a reporting end in any case of "stall detected" if it is not another more critical issue that just has the same symptoms, e.g. the current "why does power-worker-X behave different than power-worker-Y". I only updated the subject line #13276#note-27 to make the issue easier to find in case of "stall detected".

Actions

Copy link

#31

Updated by okurz almost 8 years ago

recent example 64bit-ipmi: https://openqa.suse.de/tests/993230#step/grub_test/7

Actions

Copy link

#32

Updated by okurz over 7 years ago

Latest example: https://openqa.opensuse.org/tests/455293

What could we do to automatically label the job, record the finding here and restart the job?

Actions

Copy link

#33

Updated by okurz over 4 years ago

Status changed from Feedback to Resolved
Assignee set to okurz

By now any "stall detected" seems to happen only very seldomly and likely any failure can easily be linked to problems on individual worker hosts which can then be addressed specifically.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Tests (public)

Tags

Custom queries

action #13276

[tools]'assert_screen fails, but we detected a timeout in the process, so we abort' aka. "stall detected"

observation¶

reproducible¶

expected result¶

problem¶

Updated by oholecek over 8 years ago

Updated by okurz over 8 years ago

Updated by coolo over 8 years ago

Updated by okurz over 8 years ago

Updated by okurz over 8 years ago

Updated by okurz over 8 years ago

Updated by okurz over 8 years ago

Updated by okurz over 8 years ago

Updated by asmorodskyi over 8 years ago

Updated by asmorodskyi about 8 years ago

Updated by okurz about 8 years ago

Updated by okurz about 8 years ago

Updated by okurz about 8 years ago

Updated by okurz about 8 years ago

Updated by okurz about 8 years ago

Updated by nicksinger about 8 years ago

Updated by okurz@suse.de about 8 years ago

Updated by okurz about 8 years ago

Updated by okurz about 8 years ago

Updated by RBrownSUSE about 8 years ago

Updated by asmorodskyi about 8 years ago

Updated by nicksinger about 8 years ago

Updated by mgriessmeier almost 8 years ago

Updated by okurz almost 8 years ago

Updated by okurz almost 8 years ago

Updated by okurz almost 8 years ago

Updated by okurz almost 8 years ago

Updated by okurz almost 8 years ago

Updated by szarate almost 8 years ago

Updated by okurz almost 8 years ago

Updated by okurz almost 8 years ago

Updated by okurz over 7 years ago

Updated by okurz over 4 years ago