Project

General

Profile

Actions

action #13276

closed

[tools]'assert_screen fails, but we detected a timeout in the process, so we abort' aka. "stall detected"

Added by okurz over 8 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Infrastructure
Target version:
-
Start date:
2016-08-19
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

observation

There are problems in the testing infrastructure with stalls of the monitoring process for some seconds.

Initially referenced:
https://openqa.suse.de/tests/523174/file/autoinst-log.txt

but there have been corresponding tickets in before

reproducible

sometimes, seem to appear especially when SUT produce a lot of I/O, e.g. install_and_reboot.

expected result

No timeouts in the test, no screens missed.

problem

os-autoinst already detects this

09:39:27.4559 34353 WARNING: There is some problem with your environment, we detected a stall for 12.8205699920654 seconds

and as a consequence already aborts the test

09:39:30.5158 34353 <<< bmwqemu::mydie(cause_of_death='assert_screen fails, but we detected a timeout in the process, so we abort')
DIE mydie at /usr/lib/os-autoinst/backend/baseclass.pm line 853.

but this appears often enough that we should care about to fix the underlying issue and also improve the feedback on detection.


Related issues 3 (0 open3 closed)

Related to openQA Project (public) - coordination #14972: [tools][epic] Improvements on backend to improve better handling of stallsResolvedokurz2016-11-24

Actions
Has duplicate openQA Tests (public) - action #13644: test dies with "timeout in the process"Rejected2016-09-09

Actions
Has duplicate openQA Tests (public) - action #18782: test fails in login_test: stall detectedRejectedokurz2017-04-25

Actions
Actions #1

Updated by oholecek over 8 years ago

Isn't this by design when there is CPU overcommit and process stall?

Actions #2

Updated by okurz over 8 years ago

Well, I don't complain about the test failing. I think the idea to prevent stuff like this is to inverse priorities to ensure the SUT can not stall the monitoring. I guess coolo did the groundwork with the messaging restructuring but needs some followup work. It could also mean the machine is actually overloaded and shouldn't be, i.e. reducing workers or so.

Actions #3

Updated by coolo over 8 years ago

who says the SUT is stalling the monitoring? there are 15 more workers on this box. If enough of them run with -cpu 2, there is little you can do in os-autoinst itself.

Actions #4

Updated by okurz about 8 years ago

  • Assignee set to szarate
  • Priority changed from Normal to High
Actions #5

Updated by okurz about 8 years ago

  • Blocked by coordination #14972: [tools][epic] Improvements on backend to improve better handling of stalls added
Actions #7

Updated by okurz about 8 years ago

  • Has duplicate action #13644: test dies with "timeout in the process" added
Actions #11

Updated by okurz almost 8 years ago

  • Target version set to Milestone 5
Actions #12

Updated by okurz almost 8 years ago

  • Project changed from openQA Project (public) to openQA Tests (public)
  • Description updated (diff)
  • Category changed from 168 to Infrastructure

https://github.com/os-autoinst/os-autoinst/pull/713 prepared which would only abort on demand so not block tests where test reviewers do not care about a stall.

@szarate I updated the ticket description

I see the following tasks:

  • short-term: Use "record_testresult" from basetest.pm to show a more immediate feedback to the user so that he does not need to take a look into the logfile
  • mid-term: Provide auto-commenting and labeling of the job in openQA
  • long-term: Fix the underlying issue.
Actions #13

Updated by okurz almost 8 years ago

  • Blocked by deleted (coordination #14972: [tools][epic] Improvements on backend to improve better handling of stalls)
Actions #14

Updated by okurz almost 8 years ago

  • Related to coordination #14972: [tools][epic] Improvements on backend to improve better handling of stalls added
Actions #15

Updated by okurz almost 8 years ago

  • Target version changed from Milestone 5 to Milestone 6
Actions #16

Updated by nicksinger almost 8 years ago

Same error appears on this test: https://openqa.suse.de/tests/795001
Does not seem like it is linked to the workload inside the VM but more to the general workload of the worker maybe?

Actions #17

Updated by okurz@suse.de almost 8 years ago

On Friday, 3 March 2017 14:21:19 CET you wrote:

Same error appears on this test: https://openqa.suse.de/tests/795001
Does not seem like it is linked to the workload inside the VM but more to
the general workload of the worker maybe?

exactly. The SUT is fine but the outside process running on the worker that is
checking the screens is the one that is slow here.

Actions #18

Updated by okurz almost 8 years ago

Same error appears on this test: https://openqa.suse.de/tests/795001

and the corresponding snippet from the logfile is:

12:05:31.8730 26644 WARNING: There is some problem with your environment, we detected a stall for 11.2174379825592 seconds
12:05:32.4946 26644 MATCH(GNOME-smartneedle-20150803:0.00)
…
12:05:38.9836 26644 MATCH(GNOME-piglit-unlock-20170130:0.00)
12:05:38.9862 26644 WARNING: check_asserted_screen took 7.10 seconds - make your needles more specific
12:05:38.9874 26644 DEBUG_IO: 
NAME                                                 TIME        CUMULATIVE      PERCENTAGE
 Searching for needles                                0.000       0.000           0.000%
 **++ search__: get image                             0.000       0.000           0.001%
 **++ search__: areas exclusion                       0.000       0.000           0.000%
 **++ tinycv::search_needle 5x746x166                 0.177       0.177           2.497%
 **++ tinycv::search_needle 59x18x59                  0.131       0.309           1.850%
 **++ tinycv::search_needle 60x118x58                 0.128       0.436           1.797%
 **++ tinycv::search_needle 846x42x141                0.173       0.610           2.442%
 **++ mean_square_error                               0.000       0.610           0.005%
 ** search_: GNOME-smartneedle-20150803               0.000       0.610           0.000%
 **++ search__: get image                             0.000       0.610           0.001%
 **++ search__: areas exclusion                       0.000       0.610           0.000%
 **++ tinycv::search_needle 0x4x1024                  0.125       0.735           1.764%
 **++ tinycv::search_needle 0x711x216                 0.120       0.856           1.696%
 **++ mean_square_error                               0.000       0.856           0.004%
 ** search_: GNOME-20141210                           0.000       0.856           0.000%
 **++ search__: get image                             0.000       0.856           0.000%
 **++ search__: areas exclusion                       0.000       0.856           0.000%
 **++ tinycv::search_needle 0x4x1024                  0.123       0.979           1.728%
 **++ mean_square_error                               0.000       0.979           0.004%
 ** search_: GNOME-20140701                           0.000       0.979           0.000%
 **++ search__: get image                             0.000       0.979           0.000%
 **++ search__: areas exclusion                       0.000       0.979           0.000%
 **++ tinycv::search_needle 0x4x1024                  0.120       1.099           1.689%
 **++ mean_square_error                               0.000       1.099           0.004%
 ** search_: GNOME-20160523                           0.000       1.099           0.000%
 **++ search__: get image                             0.000       1.099           0.001%
 **++ search__: areas exclusion                       0.000       1.099           0.000%
 **++ tinycv::search_needle 5x746x166                 0.164       1.263           2.308%
 **++ tinycv::search_needle 59x18x59                  0.124       1.388           1.754%
 **++ tinycv::search_needle 73x125x31                 0.129       1.516           1.811%
 **++ tinycv::search_needle 846x42x141                0.159       1.675           2.235%
 **++ mean_square_error                               0.000       1.675           0.006%
 ** search_: GNOME-smartneedle-20160211               0.000       1.676           0.000%
 **++ search__: get image                             0.000       1.676           0.001%
 **++ search__: areas exclusion                       0.000       1.676           0.001%
 **++ tinycv::search_needle 846x42x141                0.164       1.840           2.314%
 **++ tinycv::search_needle 0x0x153                   0.162       2.002           2.281%
 **++ mean_square_error                               0.000       2.002           0.004%
 ** search_: GNOME-workaround-BSC978076-20160502      0.000       2.002           0.000%
 **++ search__: get image                             0.000       2.002           0.000%
 **++ search__: areas exclusion                       0.000       2.002           0.000%
 **++ tinycv::search_needle 401x9x605                 0.124       2.127           1.751%
 **++ mean_square_error                               0.000       2.127           0.005%
 ** search_: boot_to_desktop-generic-desktop-20160519 0.000       2.127           0.000%
 **++ search__: get image                             0.000       2.127           0.001%
 **++ search__: areas exclusion                       0.000       2.127           0.000%
 **++ tinycv::search_needle 0x745x48                  0.130       2.257           1.832%
 **++ tinycv::search_needle 184x746x68                0.164       2.421           2.307%
 **++ mean_square_error                               0.000       2.421           0.004%
 ** search_: first_boot-generic-desktop-20160503      0.000       2.421           0.000%
 **++ search__: get image                             0.000       2.421           0.001%
 **++ search__: areas exclusion                       0.000       2.421           0.000%
 **++ tinycv::search_needle 846x42x141                0.159       2.580           2.234%
 **++ tinycv::search_needle 0x0x153                   0.156       2.736           2.198%
 **++ mean_square_error                               0.000       2.736           0.004%
 ** search_: GNOME-workaround-BSC978076-20160504      0.000       2.736           0.000%
 **++ search__: get image                             0.000       2.736           0.000%
 **++ search__: areas exclusion                       0.000       2.736           0.000%
 **++ tinycv::search_needle 846x42x141                0.157       2.893           2.212%
 **++ tinycv::search_needle 19x32x153                 0.157       3.051           2.217%
 **++ tinycv::search_needle 2x721x139                 0.139       3.189           1.952%
 **++ mean_square_error                               0.000       3.190           0.004%
 ** search_: GNOME-workaround-BSC978076-20160505      0.000       3.190           0.000%
 **++ search__: get image                             0.000       3.190           0.000%
 **++ search__: areas exclusion                       0.000       3.190           0.000%
 **++ tinycv::search_needle 846x42x141                0.162       3.351           2.280%
 **++ tinycv::search_needle 100x26x153                0.169       3.520           2.378%
 **++ tinycv::search_needle 2x721x139                 0.145       3.665           2.040%
 **++ mean_square_error                               0.001       3.666           0.007%
 ** search_: GNOME-workaround-BSC978076-20160506      0.000       3.666           0.001%
 **++ search__: get image                             0.000       3.666           0.001%
 **++ search__: areas exclusion                       0.000       3.666           0.000%
 **++ tinycv::search_needle 846x42x141                0.158       3.824           2.232%
 **++ tinycv::search_needle 88x32x153                 0.154       3.978           2.170%
 **++ tinycv::search_needle 2x721x139                 0.140       4.119           1.976%
 **++ mean_square_error                               0.000       4.119           0.004%
 ** search_: GNOME-workaround-BSC978076-20160508      0.000       4.119           0.000%
 **++ search__: get image                             0.000       4.119           0.000%
 **++ search__: areas exclusion                       0.000       4.119           0.000%
 **++ tinycv::search_needle 5x746x166                 0.169       4.288           2.384%
 **++ tinycv::search_needle 43x15x59                  0.143       4.431           2.016%
 **++ tinycv::search_needle 57x97x31                  0.136       4.567           1.909%
 **++ tinycv::search_needle 846x42x141                0.160       4.727           2.259%
 **++ mean_square_error                               0.000       4.728           0.004%
 ** search_: GNOME-smartneedle-20160510               0.000       4.728           0.000%
 **++ search__: get image                             0.000       4.728           0.001%
 **++ search__: areas exclusion                       0.000       4.728           0.000%
 **++ tinycv::search_needle 846x42x141                0.161       4.889           2.274%
 **++ tinycv::search_needle 0x10x112                  0.180       5.069           2.539%
 **++ tinycv::search_needle 2x721x139                 0.142       5.211           1.997%
 **++ mean_square_error                               0.000       5.211           0.004%
 ** search_: GNOME-workaround-BSC979072-20160509      0.000       5.211           0.000%
 **++ search__: get image                             0.000       5.212           0.000%
 **++ search__: areas exclusion                       0.000       5.212           0.000%
 **++ tinycv::search_needle 0x4x1024                  0.122       5.334           1.718%
 **++ mean_square_error                               0.000       5.334           0.004%
 ** search_: GNOME-rollback-20160512                  0.000       5.334           0.000%
 **++ search__: get image                             0.000       5.334           0.000%
 **++ search__: areas exclusion                       0.000       5.334           0.000%
 **++ tinycv::search_needle 846x42x141                0.162       5.496           2.277%
 **++ tinycv::search_needle 0x10x112                  0.174       5.670           2.451%
 **++ tinycv::search_needle 2x721x139                 0.139       5.809           1.962%
 **++ mean_square_error                               0.000       5.809           0.004%
 ** search_: GNOME-workaround-BSC979072-20160520      0.000       5.809           0.000%
 **++ search__: get image                             0.000       5.809           0.000%
 **++ search__: areas exclusion                       0.000       5.809           0.000%
 **++ tinycv::search_needle 5x746x166                 0.170       5.979           2.392%
 **++ tinycv::search_needle 37x18x59                  0.143       6.122           2.010%
 **++ tinycv::search_needle 51x115x31                 0.130       6.252           1.831%
 **++ tinycv::search_needle 846x42x141                0.158       6.409           2.223%
 **++ mean_square_error                               0.000       6.410           0.004%
 ** search_: GNOME-smartneedle-20160523               0.000       6.410           0.000%
 **++ search__: get image                             0.000       6.410           0.001%
 **++ search__: areas exclusion                       0.000       6.410           0.000%
 **++ tinycv::search_needle 848x31x141                0.196       6.605           2.755%
 **++ tinycv::search_needle 4x743x149                 0.153       6.758           2.148%
 **++ tinycv::search_needle 180x16x170                0.171       6.929           2.409%
 **++ tinycv::search_needle 32x168x174                0.170       7.099           2.390%
 **++ mean_square_error                               0.000       7.099           0.005%
 ** search_: GNOME-piglit-unlock-20170130             0.000       7.099           0.000%
 _stop_                                               0.001       7.100           0.007%

12:05:38.9889 26644 <<< bmwqemu::mydie(cause_of_death='assert_screen fails, but we detected a timeout in the process, so we abort')
DIE mydie at /usr/lib/os-autoinst/backend/baseclass.pm line 857.

@szarate, can you help us to read the debug output? What can we see from there?

Actions #20

Updated by RBrownSUSE almost 8 years ago

  • Subject changed from 'assert_screen fails, but we detected a timeout in the process, so we abort' to [tools]'assert_screen fails, but we detected a timeout in the process, so we abort'
Actions #21

Updated by asmorodskyi over 7 years ago

https://openqa.suse.de/tests/857360 example of stall detected

Actions #22

Updated by nicksinger over 7 years ago

https://openqa.suse.de/tests/861492 stall on QA-Power8-5-kvm:8

Actions #23

Updated by mgriessmeier over 7 years ago

also happens here:
https://openqa.suse.de/tests/871547

but is it really the same issue?

Actions #24

Updated by okurz over 7 years ago

  • Priority changed from High to Normal
  • Target version deleted (Milestone 6)

yes, they are. We have the "stall detected" field in the job since https://github.com/os-autoinst/os-autoinst/pull/735 which makes the investigation way cheaper. So reducing prio. It would be nice though if we could have a reference to this ticket here automatically in these cases like a hook from the backend "record_info" into the test distribution to look for known issues.

Also, as currently I guess there are no further plans to really solve this I am removing the target version.

Actions #25

Updated by okurz over 7 years ago

  • Priority changed from Normal to Low

after discussing with szarate and rbrown I guess "low" would better resemble that currently there is no plan to do anything about it as we have some mitigation and the workaround is good enough for now.

Actions #27

Updated by okurz over 7 years ago

  • Subject changed from [tools]'assert_screen fails, but we detected a timeout in the process, so we abort' to [tools]'assert_screen fails, but we detected a timeout in the process, so we abort' aka. "stall detected"
Actions #28

Updated by okurz over 7 years ago

  • Has duplicate action #18782: test fails in login_test: stall detected added
Actions #29

Updated by szarate over 7 years ago

  • Status changed from New to Feedback
  • Assignee deleted (szarate)

Okurz, os-autoinst will already stop testing when it detects a stall, what is your expectation on this issue in particular?, it's still not clear.

Actions #30

Updated by okurz over 7 years ago

szarate wrote:

Okurz, os-autoinst will already stop testing when it detects a stall, what is your expectation on this issue in particular?, it's still not clear.

As described in #13276#note-25 there is no plan to do anything about that issue right now in general as we have a good enough mitigation and workaround. So I don't expect anything to be done here in particular. The issue still persists though and should be used as a reporting end in any case of "stall detected" if it is not another more critical issue that just has the same symptoms, e.g. the current "why does power-worker-X behave different than power-worker-Y". I only updated the subject line #13276#note-27 to make the issue easier to find in case of "stall detected".

Actions #31

Updated by okurz over 7 years ago

Actions #32

Updated by okurz over 7 years ago

Latest example: https://openqa.opensuse.org/tests/455293

What could we do to automatically label the job, record the finding here and restart the job?

Actions #33

Updated by okurz about 4 years ago

  • Status changed from Feedback to Resolved
  • Assignee set to okurz

By now any "stall detected" seems to happen only very seldomly and likely any failure can easily be linked to problems on individual worker hosts which can then be addressed specifically.

Actions

Also available in: Atom PDF