coordination #14972: [tools][epic] Improvements on backend to improve better handling of stalls - openQA Project (public) - openSUSE Project Management Tool

So let's not to try problems here, but really only detect them and restart. From my experiments with overbooked workers (6 workers on one disk and 4 cores), the actual overbook situations are rare, but rather easy to detect.

Actions

Copy link

#8

Updated by szarate over 8 years ago

File autoinst-log.txt autoinst-log.txt added

Well, we have a winner here, 24 seconds... by looking at the logs on openqa.suse.de... there's 23.92, 23.72, 23.65, 13.84, 13.16, 10.81 but in most cases (around 580 out of 9544 samples) these numbers are under 3 seconds, so we got that going... which is nice.

WARNING: check_asserted_screen took 23.93 seconds - make your needles more specific
02:50:17.1607 17150 DEBUG_IO:

NAME                                                 TIME        CUMULATIVE      PERCENTAGE
Searching for needles                                0.000       0.000           0.000%
**++ search__: get image                             0.000       0.000           0.000%
**++ search__: areas exclusion                       0.000       0.000           0.000%
**++ tinycv::search_needle 5x746x166                 0.635       0.635           2.655%
**++ tinycv::search_needle 59x18x59                  0.646       1.282           2.701%
**++ tinycv::search_needle 60x118x58                 0.651       1.933           2.720%
**++ tinycv::search_needle 846x42x141                0.603       2.536           2.520%
**++ mean_square_error                               0.000       2.536           0.001%
** search_: GNOME-smartneedle-20150803               0.000       2.536           0.000%
**++ search__: get image                             0.000       2.536           0.000%
**++ search__: areas exclusion                       0.000       2.536           0.000%
**++ tinycv::search_needle 0x4x1024                  0.085       2.621           0.356%
**++ tinycv::search_needle 0x711x216                 0.535       3.156           2.234%
**++ mean_square_error                               0.000       3.156           0.001%
** search_: GNOME-20141210                           0.000       3.156           0.000%
**++ search__: get image                             0.000       3.156           0.000%
**++ search__: areas exclusion                       0.000       3.156           0.000%
**++ tinycv::search_needle 0x4x1024                  0.071       3.227           0.296%
**++ mean_square_error                               0.000       3.227           0.001%
** search_: GNOME-20140701                           0.000       3.227           0.000%
**++ search__: get image                             0.000       3.227           0.000%
**++ search__: areas exclusion                       0.000       3.227           0.000%
**++ tinycv::search_needle 0x4x1024                  0.071       3.299           0.298%
**++ mean_square_error                               0.000       3.299           0.001%
** search_: GNOME-20160523                           0.000       3.299           0.000%
**++ search__: get image                             0.000       3.299           0.000%
**++ search__: areas exclusion                       0.000       3.299           0.001%
**++ tinycv::search_needle 5x746x166                 0.652       3.951           2.724%
**++ tinycv::search_needle 59x18x59                  0.651       4.602           2.721%
**++ tinycv::search_needle 73x125x31                 0.664       5.266           2.773%
**++ tinycv::search_needle 846x42x141                0.602       5.867           2.514%
**++ mean_square_error                               0.000       5.868           0.001%
** search_: GNOME-smartneedle-20160211               0.000       5.868           0.000%
**++ search__: get image                             0.000       5.868           0.000%
**++ search__: areas exclusion                       0.000       5.868           0.000%
**++ tinycv::search_needle 846x42x141                0.602       6.470           2.517%
**++ tinycv::search_needle 0x0x153                   0.554       7.024           2.317%
**++ mean_square_error                               0.000       7.025           0.001%
** search_: GNOME-workaround-BSC978076-20160502      0.000       7.025           0.000%
**++ search__: get image                             0.000       7.025           0.000%
**++ search__: areas exclusion                       0.000       7.025           0.000%
**++ tinycv::search_needle 401x9x605                 0.076       7.101           0.319%
**++ mean_square_error                               0.000       7.101           0.001%
** search_: boot_to_desktop-generic-desktop-20160519 0.000       7.101           0.000%
**++ search__: get image                             0.000       7.101           0.000%
**++ search__: areas exclusion                       0.000       7.101           0.000%
**++ tinycv::search_needle 0x745x48                  0.692       7.793           2.892%
**++ tinycv::search_needle 184x746x68                0.725       8.519           3.032%
**++ mean_square_error                               0.000       8.519           0.001%
** search_: first_boot-generic-desktop-20160503      0.000       8.519           0.000%
**++ search__: get image                             0.000       8.519           0.000%
**++ search__: areas exclusion                       0.000       8.519           0.000%
**++ tinycv::search_needle 846x42x141                0.608       9.127           2.539%
**++ tinycv::search_needle 0x0x153                   0.546       9.673           2.283%
**++ mean_square_error                               0.000       9.673           0.001%
** search_: GNOME-workaround-BSC978076-20160504      0.000       9.673           0.000%
**++ search__: get image                             0.000       9.673           0.000%
**++ search__: areas exclusion                       0.000       9.673           0.000%
**++ tinycv::search_needle 846x42x141                0.598       10.272          2.501%
**++ tinycv::search_needle 19x32x153                 0.565       10.837          2.362%
**++ tinycv::search_needle 2x721x139                 0.623       11.460          2.604%
**++ mean_square_error                               0.000       11.460          0.001%
** search_: GNOME-workaround-BSC978076-20160505      0.000       11.460          0.000%
**++ search__: get image                             0.000       11.460          0.000%
**++ search__: areas exclusion                       0.000       11.460          0.000%
**++ tinycv::search_needle 846x42x141                0.603       12.063          2.518%
**++ tinycv::search_needle 100x26x153                0.442       12.505          1.849%
**++ tinycv::search_needle 2x721x139                 0.622       13.127          2.598%
**++ mean_square_error                               0.000       13.127          0.001%
** search_: GNOME-workaround-BSC978076-20160506      0.000       13.127          0.000%
**++ search__: get image                             0.000       13.127          0.000%
**++ search__: areas exclusion                       0.000       13.127          0.000%
**++ tinycv::search_needle 846x42x141                0.602       13.730          2.517%
**++ tinycv::search_needle 88x32x153                 0.462       14.191          1.929%
**++ tinycv::search_needle 2x721x139                 0.636       14.827          2.657%
**++ mean_square_error                               0.000       14.827          0.001%
** search_: GNOME-workaround-BSC978076-20160508      0.000       14.827          0.000%
**++ search__: get image                             0.000       14.827          0.000%
**++ search__: areas exclusion                       0.000       14.827          0.001%
**++ tinycv::search_needle 5x746x166                 0.653       15.480          2.728%
**++ tinycv::search_needle 43x15x59                  0.663       16.143          2.770%
**++ tinycv::search_needle 57x97x31                  0.696       16.839          2.908%
**++ tinycv::search_needle 846x42x141                0.616       17.455          2.574%
**++ mean_square_error                               0.000       17.455          0.001%
** search_: GNOME-smartneedle-20160510               0.000       17.455          0.000%
**++ search__: get image                             0.000       17.455          0.000%
**++ search__: areas exclusion                       0.000       17.455          0.000%
**++ tinycv::search_needle 846x42x141                0.617       18.072          2.578%
**++ tinycv::search_needle 0x10x112                  0.644       18.716          2.693%
**++ tinycv::search_needle 2x721x139                 0.627       19.343          2.618%
**++ mean_square_error                               0.000       19.343          0.001%
** search_: GNOME-workaround-BSC979072-20160509      0.000       19.343          0.000%
**++ search__: get image                             0.000       19.343          0.000%
**++ search__: areas exclusion                       0.000       19.343          0.000%
**++ tinycv::search_needle 0x4x1024                  0.083       19.426          0.346%
**++ mean_square_error                               0.000       19.426          0.001%
** search_: GNOME-rollback-20160512                  0.000       19.426          0.000%
**++ search__: get image                             0.000       19.426          0.000%
**++ search__: areas exclusion                       0.000       19.426          0.000%
**++ tinycv::search_needle 846x42x141                0.599       20.026          2.505%
**++ tinycv::search_needle 0x10x112                  0.649       20.675          2.714%
**++ tinycv::search_needle 2x721x139                 0.626       21.301          2.617%
**++ mean_square_error                               0.000       21.302          0.001%
** search_: GNOME-workaround-BSC979072-20160520      0.000       21.302          0.000%
**++ search__: get image                             0.000       21.302          0.000%
**++ search__: areas exclusion                       0.000       21.302          0.000%
**++ tinycv::search_needle 5x746x166                 0.653       21.955          2.729%
**++ tinycv::search_needle 37x18x59                  0.671       22.626          2.804%
**++ tinycv::search_needle 51x115x31                 0.688       23.313          2.874%
**++ tinycv::search_needle 846x42x141                0.616       23.929          2.575%
**++ mean_square_error                               0.000       23.930          0.001%
** search_: GNOME-smartneedle-20160523               0.000       23.930          0.000%
_stop_                                               0.000       23.930          0.001%

Actions

Copy link

#9

Updated by szarate about 8 years ago

Precedes action #13242: WDYT: For every job that does not have a label or bugref, retrigger some times to see if it's sporadic. Like rescheduling on incomplete but on failed added

Actions

Copy link

#10

Updated by okurz about 8 years ago

Target version set to Milestone 5

Actions

Copy link

#11

Updated by szarate about 8 years ago

Related to action #14072: [tools]monitor our load added

Actions

Copy link

#12

Updated by szarate about 8 years ago

Precedes action #16166: Log per test added

Actions

Copy link

#13

Updated by szarate about 8 years ago

#16166 is somehow related to this one... and should be addressed by https://github.com/os-autoinst/os-autoinst/pull/691 by using custom appenders configured by the test developer, i see no point in handling tons of single files per test.

Actions

Copy link

#14

Updated by okurz about 8 years ago

Related to action #12064: Improved logging for debugging performance related issues added

Actions

Copy link

#15

Updated by okurz about 8 years ago

Related to action #10418: worker: do not warn on expected problems added

Actions

Copy link

#16

Updated by okurz about 8 years ago

Subject changed from Improvements on backend to improve better handling of stalls to [epic] Improvements on backend to improve better handling of stalls

Actions

Copy link

#17

Updated by okurz about 8 years ago

Blocks deleted (action #13276: [tools]'assert_screen fails, but we detected a timeout in the process, so we abort' aka. "stall detected")

Actions

Copy link

#18

Updated by okurz about 8 years ago

Related to action #13276: [tools]'assert_screen fails, but we detected a timeout in the process, so we abort' aka. "stall detected" added

Actions

Copy link

#19

Updated by pgeorgiadis about 8 years ago

Observation¶

vim tests are failing from time to time in Maintenance for the last 3 months at least. After looking at the specifics, we identified the problem being on the infrastructure side. Simply put, there are performance issues of the worker which cannot handle the load. As a result, qemu responds with unexpected behavior (i.e. ignoring keystrokes and screen glitches).

For example: openQA test in scenario sle-12-SP2-Desktop-DVD-Updates-x86_64-qam-allpatterns@64bit fails in vim. We assume the the ar appeared on the screen is because typing was done too quickly and from "cd && clear" only the last two characters showed up.

Reproducible¶

It's unlikely to reproduce it. This random because the worker is struggling.
Fails since (at least) Build 20170222-1 (current job)

vim test also fails from time to time in the last 3 months (at least) for similar problems: second failed vim example and third failed vim example

Actions

Copy link

#20

Updated by pgeorgiadis about 8 years ago

Firefox_urlsprotocols fails in maintenance because of this latest failure

<<< testapi::type_string(string='killall -9 firefox;rm -rf .moz* .config/iced* .cache/iced* .local/share/gnome-shell/extensions/*; firefox > firefox.log 2>&1 &
<<< testapi::assert_screen(mustmatch='firefox-launch', timeout=90)

Expected output:

testapi::_handle_found_needle: found firefox_localfiles-firefox-launch-20161122, similarity 0.98 @ 246/480

needle couldn't be matched because of glitches in qemu. More precisely, og 2>&1 & was found as a remainder of the full command.

Actions

Copy link

#21

Updated by pgeorgiadis about 8 years ago

firefox_flashplayer fails in maintenance because once again there's a strange behavior of (too fast?) keystrokes into an overloaded worker.
The link of the failure can be found here.

WARNING: check_asserted_screen took 1.24 seconds

^ According to @szarate this should be almost 500ms, but it's nearly 3x more.

Exactly as the above example, Firefox cannot start:

<<< testapi::type_string(string='killall -9 firefox;rm -rf .moz* .config/iced* .cache/iced* .local/share/gnome-shell/extensions/*; firefox > firefox.log 2>&1 &
<<< testapi::assert_screen(mustmatch='firefox-launch', timeout=90)

Expected output:

testapi::_handle_found_needle: found firefox_localfiles-firefox-launch-20161122, similarity 0.98 @ 246/480

needle couldn't be matched because of glitches in qemu. More precisely, killall -9 firefox;r was found as a remainder of the full command.

Actions

Copy link

#22

Updated by pgeorgiadis about 8 years ago

Same happens also to firefox_pagesaving test in Maintenance. Look here.

Actions

Copy link

#23

Updated by okurz about 8 years ago

Target version changed from Milestone 5 to Milestone 6

Actions

Copy link

#24

Updated by mgriessmeier about 8 years ago

related:
https://openqa.suse.de/tests/792293

Actions

Copy link

#25

Updated by szarate about 8 years ago

Related to action #17408: Webinterface does not show any information when a worker fails to write to disk added

Actions

Copy link

#26

Updated by pgeorgiadis about 8 years ago

related https://openqa.suse.de/tests/796011#
firefox_passwd cannot even start

Actions

Copy link

#27

Updated by asmorodskyi about 8 years ago

interesting case when even simple "rm -f" stalls https://openqa.suse.de/tests/795598#step/zypper_lifecycle/48

Actions

Copy link

#28

Updated by mkravec about 8 years ago

type_string "DISPLAY=:1 xev\n" typed "DISv" instead
https://openqa.suse.de/tests/802819#step/vnc_two_passwords/11

Actions

Copy link

#29

Updated by RBrownSUSE about 8 years ago

Subject changed from [epic] Improvements on backend to improve better handling of stalls to [tools][epic] Improvements on backend to improve better handling of stalls

Actions

Copy link

#30

Updated by mkravec about 8 years ago

In build 0288:
https://openqa.suse.de/tests/826697#step/zypper_lifecycle/27

Actions

Copy link

#31

Updated by szarate about 8 years ago

Related to action #18286: Catch CPU lockups added

Actions

Copy link

#32

Updated by szarate almost 8 years ago

Target version changed from Milestone 6 to Milestone 8

Actions

Copy link

#33

Updated by szarate almost 8 years ago

Target version deleted (~~Milestone 8~~)

I'm just removing the milestone from here, to be more realistic. I think we've done a lot to improve the stall handling... But this task needs to be groomed

Actions

Copy link

#34

Updated by asmorodskyi over 7 years ago

Blocked by action #25864: [tools][functional][u] stall detected in openqaworker-arm-1 through 3 sometimes - "worker performance issues" added

Actions

Copy link

#35

Updated by coolo over 7 years ago

Assignee deleted (~~szarate~~)

Santi is not working on it, so removing assignee

Actions

Copy link

#36

Updated by szarate about 7 years ago

Status changed from In Progress to New

Actions

Copy link

#37

Updated by okurz almost 6 years ago

Category changed from 132 to Feature requests

Actions

Copy link

#38

Updated by okurz over 4 years ago

Status changed from New to Resolved
Assignee set to okurz

Looking at all the acceptance criteria from our current point of view I think we got them covered and we have improved the backend enough to handle stalls with all of better reporting, better handling in infrastructure (e.g. filesystem and mount point option choices) as well as improvements to prevent stalls in some more instances.

Actions

Copy link

#39

Updated by szarate over 4 years ago

Tracker changed from action to coordination

Actions

Copy link

#40

Updated by szarate over 4 years ago

See for the reason of tracker change: http://mailman.suse.de/mailman/private/qa-sle/2020-October/002722.html

Project

General

Tags

Custom queries

Profile

QA (public) » openQA Project (public)