coordination #14972
closedaction #14804: encoder still stalls
[tools][epic] Improvements on backend to improve better handling of stalls
100%
Description
This is a macro task to group the following tasks (and possibly discussions)
User story¶
As a test infrastructure admin I would openQA to better handle stalls to use full worker performance capacity without failing all over
acceptance criteria¶
- Backend is using ppm files for the video encoding and generate last.png on demand
- Test results are still generated using png
- Have a threshold to allow isotovideo to choose when a SUT/worker is too slow and must be stopped
- Collect information to generate knowledge so that thresholds can be decided by the openQA admin or openQA on it's own
- The worker/isotovideo informs the webui that a job failed with a reason
- Jobs that have failed under known failures, can be retriggered automatically
tasks (subtasks by themselves)¶
- Move to writing ppms instead of png's [#14976]
- Have write_img to support png's and ppms (basically let opencv pick the format)
- Have the videoencoder to write last.png
- have the worker to tell the videoencoder if it really needs the last.png
- Solve the last.png file for the worker to look at
- Let the isotovideo decide when the job/SUT must be shutdown based on statistics/threshold
- Have the isotovideo populate statistics (name of the test that died, and information on what happened)
- Have the isotovideo to update into the database when a job is being slow or killed because of slowliness
- Have the isotovideo to decide when to die, based on a threshold/factor calculated by the webui when instantiating the worker
- Have the webui to be able to handle backend-informed failures
- Retrigger jobs with known failures
- Have webui/scheduler to veify when a job was marked as failed or incomplete due to a known failure (and mark it)
- Have the user (and later on) the scheduler to re-trigger jobs based on a number/factor that may be defined by the openqa administrator
further details¶
Acceptance criteria details:
AC 3: Has to collect statistics on:
- How many tests are failing due to AC 4
- Timing for AC 1
- Some more statistics yet to be defined (e.g Worker load when the job is being cancelled)
AC 5: Has to be accomplished using the backend field in the job table in json format.
AC 6: Has to be disabled by default, since this is a feature that only might be deployed on unsupervised environmetns (i.e. will not be used for test development)
More details on benchmarks and other discussions can be found in the following links:
Files
Updated by okurz about 8 years ago
- Blocks action #13276: [tools]'assert_screen fails, but we detected a timeout in the process, so we abort' aka. "stall detected" added
Updated by szarate about 8 years ago
- Related to action #15118: Text is mistyped on aarch64 added
Updated by szarate about 8 years ago
- Related to action #12250: [sporadic]"QEMU: usb-kbd: warning: key event queue full" on ppc64le and aarch64 added
Updated by coolo about 8 years ago
So let's not to try problems here, but really only detect them and restart. From my experiments with overbooked workers (6 workers on one disk and 4 cores), the actual overbook situations are rare, but rather easy to detect.
Updated by szarate almost 8 years ago
- File autoinst-log.txt autoinst-log.txt added
Well, we have a winner here, 24 seconds... by looking at the logs on openqa.suse.de... there's 23.92, 23.72, 23.65, 13.84, 13.16, 10.81 but in most cases (around 580 out of 9544 samples) these numbers are under 3 seconds, so we got that going... which is nice.
WARNING: check_asserted_screen took 23.93 seconds - make your needles more specific
02:50:17.1607 17150 DEBUG_IO:
NAME TIME CUMULATIVE PERCENTAGE
Searching for needles 0.000 0.000 0.000%
**++ search__: get image 0.000 0.000 0.000%
**++ search__: areas exclusion 0.000 0.000 0.000%
**++ tinycv::search_needle 5x746x166 0.635 0.635 2.655%
**++ tinycv::search_needle 59x18x59 0.646 1.282 2.701%
**++ tinycv::search_needle 60x118x58 0.651 1.933 2.720%
**++ tinycv::search_needle 846x42x141 0.603 2.536 2.520%
**++ mean_square_error 0.000 2.536 0.001%
** search_: GNOME-smartneedle-20150803 0.000 2.536 0.000%
**++ search__: get image 0.000 2.536 0.000%
**++ search__: areas exclusion 0.000 2.536 0.000%
**++ tinycv::search_needle 0x4x1024 0.085 2.621 0.356%
**++ tinycv::search_needle 0x711x216 0.535 3.156 2.234%
**++ mean_square_error 0.000 3.156 0.001%
** search_: GNOME-20141210 0.000 3.156 0.000%
**++ search__: get image 0.000 3.156 0.000%
**++ search__: areas exclusion 0.000 3.156 0.000%
**++ tinycv::search_needle 0x4x1024 0.071 3.227 0.296%
**++ mean_square_error 0.000 3.227 0.001%
** search_: GNOME-20140701 0.000 3.227 0.000%
**++ search__: get image 0.000 3.227 0.000%
**++ search__: areas exclusion 0.000 3.227 0.000%
**++ tinycv::search_needle 0x4x1024 0.071 3.299 0.298%
**++ mean_square_error 0.000 3.299 0.001%
** search_: GNOME-20160523 0.000 3.299 0.000%
**++ search__: get image 0.000 3.299 0.000%
**++ search__: areas exclusion 0.000 3.299 0.001%
**++ tinycv::search_needle 5x746x166 0.652 3.951 2.724%
**++ tinycv::search_needle 59x18x59 0.651 4.602 2.721%
**++ tinycv::search_needle 73x125x31 0.664 5.266 2.773%
**++ tinycv::search_needle 846x42x141 0.602 5.867 2.514%
**++ mean_square_error 0.000 5.868 0.001%
** search_: GNOME-smartneedle-20160211 0.000 5.868 0.000%
**++ search__: get image 0.000 5.868 0.000%
**++ search__: areas exclusion 0.000 5.868 0.000%
**++ tinycv::search_needle 846x42x141 0.602 6.470 2.517%
**++ tinycv::search_needle 0x0x153 0.554 7.024 2.317%
**++ mean_square_error 0.000 7.025 0.001%
** search_: GNOME-workaround-BSC978076-20160502 0.000 7.025 0.000%
**++ search__: get image 0.000 7.025 0.000%
**++ search__: areas exclusion 0.000 7.025 0.000%
**++ tinycv::search_needle 401x9x605 0.076 7.101 0.319%
**++ mean_square_error 0.000 7.101 0.001%
** search_: boot_to_desktop-generic-desktop-20160519 0.000 7.101 0.000%
**++ search__: get image 0.000 7.101 0.000%
**++ search__: areas exclusion 0.000 7.101 0.000%
**++ tinycv::search_needle 0x745x48 0.692 7.793 2.892%
**++ tinycv::search_needle 184x746x68 0.725 8.519 3.032%
**++ mean_square_error 0.000 8.519 0.001%
** search_: first_boot-generic-desktop-20160503 0.000 8.519 0.000%
**++ search__: get image 0.000 8.519 0.000%
**++ search__: areas exclusion 0.000 8.519 0.000%
**++ tinycv::search_needle 846x42x141 0.608 9.127 2.539%
**++ tinycv::search_needle 0x0x153 0.546 9.673 2.283%
**++ mean_square_error 0.000 9.673 0.001%
** search_: GNOME-workaround-BSC978076-20160504 0.000 9.673 0.000%
**++ search__: get image 0.000 9.673 0.000%
**++ search__: areas exclusion 0.000 9.673 0.000%
**++ tinycv::search_needle 846x42x141 0.598 10.272 2.501%
**++ tinycv::search_needle 19x32x153 0.565 10.837 2.362%
**++ tinycv::search_needle 2x721x139 0.623 11.460 2.604%
**++ mean_square_error 0.000 11.460 0.001%
** search_: GNOME-workaround-BSC978076-20160505 0.000 11.460 0.000%
**++ search__: get image 0.000 11.460 0.000%
**++ search__: areas exclusion 0.000 11.460 0.000%
**++ tinycv::search_needle 846x42x141 0.603 12.063 2.518%
**++ tinycv::search_needle 100x26x153 0.442 12.505 1.849%
**++ tinycv::search_needle 2x721x139 0.622 13.127 2.598%
**++ mean_square_error 0.000 13.127 0.001%
** search_: GNOME-workaround-BSC978076-20160506 0.000 13.127 0.000%
**++ search__: get image 0.000 13.127 0.000%
**++ search__: areas exclusion 0.000 13.127 0.000%
**++ tinycv::search_needle 846x42x141 0.602 13.730 2.517%
**++ tinycv::search_needle 88x32x153 0.462 14.191 1.929%
**++ tinycv::search_needle 2x721x139 0.636 14.827 2.657%
**++ mean_square_error 0.000 14.827 0.001%
** search_: GNOME-workaround-BSC978076-20160508 0.000 14.827 0.000%
**++ search__: get image 0.000 14.827 0.000%
**++ search__: areas exclusion 0.000 14.827 0.001%
**++ tinycv::search_needle 5x746x166 0.653 15.480 2.728%
**++ tinycv::search_needle 43x15x59 0.663 16.143 2.770%
**++ tinycv::search_needle 57x97x31 0.696 16.839 2.908%
**++ tinycv::search_needle 846x42x141 0.616 17.455 2.574%
**++ mean_square_error 0.000 17.455 0.001%
** search_: GNOME-smartneedle-20160510 0.000 17.455 0.000%
**++ search__: get image 0.000 17.455 0.000%
**++ search__: areas exclusion 0.000 17.455 0.000%
**++ tinycv::search_needle 846x42x141 0.617 18.072 2.578%
**++ tinycv::search_needle 0x10x112 0.644 18.716 2.693%
**++ tinycv::search_needle 2x721x139 0.627 19.343 2.618%
**++ mean_square_error 0.000 19.343 0.001%
** search_: GNOME-workaround-BSC979072-20160509 0.000 19.343 0.000%
**++ search__: get image 0.000 19.343 0.000%
**++ search__: areas exclusion 0.000 19.343 0.000%
**++ tinycv::search_needle 0x4x1024 0.083 19.426 0.346%
**++ mean_square_error 0.000 19.426 0.001%
** search_: GNOME-rollback-20160512 0.000 19.426 0.000%
**++ search__: get image 0.000 19.426 0.000%
**++ search__: areas exclusion 0.000 19.426 0.000%
**++ tinycv::search_needle 846x42x141 0.599 20.026 2.505%
**++ tinycv::search_needle 0x10x112 0.649 20.675 2.714%
**++ tinycv::search_needle 2x721x139 0.626 21.301 2.617%
**++ mean_square_error 0.000 21.302 0.001%
** search_: GNOME-workaround-BSC979072-20160520 0.000 21.302 0.000%
**++ search__: get image 0.000 21.302 0.000%
**++ search__: areas exclusion 0.000 21.302 0.000%
**++ tinycv::search_needle 5x746x166 0.653 21.955 2.729%
**++ tinycv::search_needle 37x18x59 0.671 22.626 2.804%
**++ tinycv::search_needle 51x115x31 0.688 23.313 2.874%
**++ tinycv::search_needle 846x42x141 0.616 23.929 2.575%
**++ mean_square_error 0.000 23.930 0.001%
** search_: GNOME-smartneedle-20160523 0.000 23.930 0.000%
_stop_ 0.000 23.930 0.001%
Updated by szarate almost 8 years ago
- Precedes action #13242: WDYT: For every job that does not have a label or bugref, retrigger some times to see if it's sporadic. Like rescheduling on incomplete but on failed added
Updated by szarate almost 8 years ago
- Related to action #14072: [tools]monitor our load added
Updated by szarate almost 8 years ago
- Precedes action #16166: Log per test added
Updated by szarate almost 8 years ago
#16166 is somehow related to this one... and should be addressed by https://github.com/os-autoinst/os-autoinst/pull/691 by using custom appenders configured by the test developer, i see no point in handling tons of single files per test.
Updated by okurz almost 8 years ago
- Related to action #12064: Improved logging for debugging performance related issues added
Updated by okurz almost 8 years ago
- Related to action #10418: worker: do not warn on expected problems added
Updated by okurz almost 8 years ago
- Subject changed from Improvements on backend to improve better handling of stalls to [epic] Improvements on backend to improve better handling of stalls
Updated by okurz almost 8 years ago
- Blocks deleted (action #13276: [tools]'assert_screen fails, but we detected a timeout in the process, so we abort' aka. "stall detected")
Updated by okurz almost 8 years ago
- Related to action #13276: [tools]'assert_screen fails, but we detected a timeout in the process, so we abort' aka. "stall detected" added
Updated by pgeorgiadis almost 8 years ago
Observation¶
vim
tests are failing from time to time in Maintenance for the last 3 months at least. After looking at the specifics, we identified the problem being on the infrastructure side. Simply put, there are performance issues of the worker which cannot handle the load. As a result, qemu
responds with unexpected behavior (i.e. ignoring keystrokes and screen glitches).
For example: openQA test in scenario sle-12-SP2-Desktop-DVD-Updates-x86_64-qam-allpatterns@64bit fails in vim. We assume the the ar
appeared on the screen is because typing was done too quickly and from "cd && clear" only the last two characters showed up.
Reproducible¶
It's unlikely to reproduce it. This random because the worker is struggling.
Fails since (at least) Build 20170222-1 (current job)
vim
test also fails from time to time in the last 3 months (at least) for similar problems: second failed vim example and third failed vim example
Updated by pgeorgiadis almost 8 years ago
Firefox_urlsprotocols fails in maintenance because of this latest failure
<<< testapi::type_string(string='killall -9 firefox;rm -rf .moz* .config/iced* .cache/iced* .local/share/gnome-shell/extensions/*; firefox > firefox.log 2>&1 &
<<< testapi::assert_screen(mustmatch='firefox-launch', timeout=90)
Expected output:
testapi::_handle_found_needle: found firefox_localfiles-firefox-launch-20161122, similarity 0.98 @ 246/480
needle couldn't be matched because of glitches in qemu. More precisely, og 2>&1 &
was found as a remainder of the full command.
Updated by pgeorgiadis almost 8 years ago
firefox_flashplayer fails in maintenance because once again there's a strange behavior of (too fast?) keystrokes into an overloaded worker.
The link of the failure can be found here.
WARNING: check_asserted_screen took 1.24 seconds
^ According to @szarate this should be almost 500ms, but it's nearly 3x more.
Exactly as the above example, Firefox cannot start:
<<< testapi::type_string(string='killall -9 firefox;rm -rf .moz* .config/iced* .cache/iced* .local/share/gnome-shell/extensions/*; firefox > firefox.log 2>&1 &
<<< testapi::assert_screen(mustmatch='firefox-launch', timeout=90)
Expected output:
testapi::_handle_found_needle: found firefox_localfiles-firefox-launch-20161122, similarity 0.98 @ 246/480
needle couldn't be matched because of glitches in qemu. More precisely, killall -9 firefox;r
was found as a remainder of the full command.
Updated by pgeorgiadis almost 8 years ago
Same happens also to firefox_pagesaving test in Maintenance. Look here.
Updated by okurz almost 8 years ago
- Target version changed from Milestone 5 to Milestone 6
Updated by mgriessmeier almost 8 years ago
related:
https://openqa.suse.de/tests/792293
Updated by szarate almost 8 years ago
- Related to action #17408: Webinterface does not show any information when a worker fails to write to disk added
Updated by pgeorgiadis almost 8 years ago
related https://openqa.suse.de/tests/796011#
firefox_passwd cannot even start
Updated by asmorodskyi almost 8 years ago
interesting case when even simple "rm -f" stalls https://openqa.suse.de/tests/795598#step/zypper_lifecycle/48
Updated by mkravec almost 8 years ago
type_string "DISPLAY=:1 xev\n" typed "DISv" instead
https://openqa.suse.de/tests/802819#step/vnc_two_passwords/11
Updated by RBrownSUSE almost 8 years ago
- Subject changed from [epic] Improvements on backend to improve better handling of stalls to [tools][epic] Improvements on backend to improve better handling of stalls
Updated by mkravec almost 8 years ago
In build 0288:
https://openqa.suse.de/tests/826697#step/zypper_lifecycle/27
Updated by szarate over 7 years ago
- Related to action #18286: Catch CPU lockups added
Updated by szarate over 7 years ago
- Target version changed from Milestone 6 to Milestone 8
Updated by szarate over 7 years ago
- Target version deleted (
Milestone 8)
I'm just removing the milestone from here, to be more realistic. I think we've done a lot to improve the stall handling... But this task needs to be groomed
Updated by asmorodskyi about 7 years ago
- Blocked by action #25864: [tools][functional][u] stall detected in openqaworker-arm-1 through 3 sometimes - "worker performance issues" added
Updated by coolo about 7 years ago
- Assignee deleted (
szarate)
Santi is not working on it, so removing assignee
Updated by okurz over 5 years ago
- Category changed from 132 to Feature requests
Updated by okurz over 4 years ago
- Status changed from New to Resolved
- Assignee set to okurz
Looking at all the acceptance criteria from our current point of view I think we got them covered and we have improved the backend enough to handle stalls with all of better reporting, better handling in infrastructure (e.g. filesystem and mount point option choices) as well as improvements to prevent stalls in some more instances.
Updated by szarate about 4 years ago
- Tracker changed from action to coordination
Updated by szarate about 4 years ago
See for the reason of tracker change: http://mailman.suse.de/mailman/private/qa-sle/2020-October/002722.html