Project

General

Profile

Actions

coordination #14972

closed

action #14804: encoder still stalls

[tools][epic] Improvements on backend to improve better handling of stalls

Added by szarate over 7 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
-
Start date:
2016-11-24
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

This is a macro task to group the following tasks (and possibly discussions)

User story

As a test infrastructure admin I would openQA to better handle stalls to use full worker performance capacity without failing all over

acceptance criteria

  1. Backend is using ppm files for the video encoding and generate last.png on demand
  2. Test results are still generated using png
  3. Have a threshold to allow isotovideo to choose when a SUT/worker is too slow and must be stopped
  4. Collect information to generate knowledge so that thresholds can be decided by the openQA admin or openQA on it's own
  5. The worker/isotovideo informs the webui that a job failed with a reason
  6. Jobs that have failed under known failures, can be retriggered automatically

tasks (subtasks by themselves)

  1. Move to writing ppms instead of png's [#14976]
    • Have write_img to support png's and ppms (basically let opencv pick the format)
    • Have the videoencoder to write last.png
    • have the worker to tell the videoencoder if it really needs the last.png
    • Solve the last.png file for the worker to look at
  2. Let the isotovideo decide when the job/SUT must be shutdown based on statistics/threshold
    • Have the isotovideo populate statistics (name of the test that died, and information on what happened)
    • Have the isotovideo to update into the database when a job is being slow or killed because of slowliness
    • Have the isotovideo to decide when to die, based on a threshold/factor calculated by the webui when instantiating the worker
    • Have the webui to be able to handle backend-informed failures
  3. Retrigger jobs with known failures
    • Have webui/scheduler to veify when a job was marked as failed or incomplete due to a known failure (and mark it)
    • Have the user (and later on) the scheduler to re-trigger jobs based on a number/factor that may be defined by the openqa administrator

further details

Acceptance criteria details:

  • AC 3: Has to collect statistics on:

    • How many tests are failing due to AC 4
    • Timing for AC 1
    • Some more statistics yet to be defined (e.g Worker load when the job is being cancelled)
  • AC 5: Has to be accomplished using the backend field in the job table in json format.

  • AC 6: Has to be disabled by default, since this is a feature that only might be deployed on unsupervised environmetns (i.e. will not be used for test development)

More details on benchmarks and other discussions can be found in the following links:


Files

autoinst-log.txt (1.15 MB) autoinst-log.txt szarate, 2016-12-30 14:45

Subtasks 1 (0 open1 closed)

action #14976: Change the isotovideo backend to write ppm filesResolvedszarate2016-11-24

Actions

Related issues 11 (0 open11 closed)

Related to openQA Tests - action #15118: Text is mistyped on aarch64Rejectedzluo2016-11-29

Actions
Related to openQA Tests - action #12250: [sporadic]"QEMU: usb-kbd: warning: key event queue full" on ppc64le and aarch64Resolved2016-06-03

Actions
Related to openQA Project - action #14072: [tools]monitor our loadResolved2016-10-05

Actions
Related to openQA Project - action #12064: Improved logging for debugging performance related issuesResolvedokurz2016-05-19

Actions
Related to openQA Project - action #10418: worker: do not warn on expected problemsResolvedmkittler2016-01-25

Actions
Related to openQA Tests - action #13276: [tools]'assert_screen fails, but we detected a timeout in the process, so we abort' aka. "stall detected"Resolvedokurz2016-08-19

Actions
Related to openQA Project - action #17408: Webinterface does not show any information when a worker fails to write to diskRejected2017-03-01

Actions
Related to openQA Project - action #18286: Catch CPU lockupsRejected2017-04-03

Actions
Blocked by openQA Tests - action #25864: [tools][functional][u] stall detected in openqaworker-arm-1 through 3 sometimes - "worker performance issues"Resolvedokurz2017-10-09

Actions
Precedes openQA Project - action #13242: WDYT: For every job that does not have a label or bugref, retrigger some times to see if it's sporadic. Like rescheduling on incomplete but on failedRejectedokurz2016-11-25

Actions
Precedes openQA Project - action #16166: Log per testRejected2017-01-24

Actions
Actions #1

Updated by szarate over 7 years ago

  • Description updated (diff)
Actions #2

Updated by szarate over 7 years ago

  • Description updated (diff)
Actions #3

Updated by okurz over 7 years ago

  • Description updated (diff)

updating the US as discussed with @szarate (I hope you don't mind)

Actions #4

Updated by okurz over 7 years ago

  • Blocks action #13276: [tools]'assert_screen fails, but we detected a timeout in the process, so we abort' aka. "stall detected" added
Actions #5

Updated by szarate over 7 years ago

Actions #6

Updated by szarate over 7 years ago

  • Related to action #12250: [sporadic]"QEMU: usb-kbd: warning: key event queue full" on ppc64le and aarch64 added
Actions #7

Updated by coolo about 7 years ago

So let's not to try problems here, but really only detect them and restart. From my experiments with overbooked workers (6 workers on one disk and 4 cores), the actual overbook situations are rare, but rather easy to detect.

Actions #8

Updated by szarate about 7 years ago

Well, we have a winner here, 24 seconds... by looking at the logs on openqa.suse.de... there's 23.92, 23.72, 23.65, 13.84, 13.16, 10.81 but in most cases (around 580 out of 9544 samples) these numbers are under 3 seconds, so we got that going... which is nice.

WARNING: check_asserted_screen took 23.93 seconds - make your needles more specific
02:50:17.1607 17150 DEBUG_IO:

NAME                                                 TIME        CUMULATIVE      PERCENTAGE
Searching for needles                                0.000       0.000           0.000%
**++ search__: get image                             0.000       0.000           0.000%
**++ search__: areas exclusion                       0.000       0.000           0.000%
**++ tinycv::search_needle 5x746x166                 0.635       0.635           2.655%
**++ tinycv::search_needle 59x18x59                  0.646       1.282           2.701%
**++ tinycv::search_needle 60x118x58                 0.651       1.933           2.720%
**++ tinycv::search_needle 846x42x141                0.603       2.536           2.520%
**++ mean_square_error                               0.000       2.536           0.001%
** search_: GNOME-smartneedle-20150803               0.000       2.536           0.000%
**++ search__: get image                             0.000       2.536           0.000%
**++ search__: areas exclusion                       0.000       2.536           0.000%
**++ tinycv::search_needle 0x4x1024                  0.085       2.621           0.356%
**++ tinycv::search_needle 0x711x216                 0.535       3.156           2.234%
**++ mean_square_error                               0.000       3.156           0.001%
** search_: GNOME-20141210                           0.000       3.156           0.000%
**++ search__: get image                             0.000       3.156           0.000%
**++ search__: areas exclusion                       0.000       3.156           0.000%
**++ tinycv::search_needle 0x4x1024                  0.071       3.227           0.296%
**++ mean_square_error                               0.000       3.227           0.001%
** search_: GNOME-20140701                           0.000       3.227           0.000%
**++ search__: get image                             0.000       3.227           0.000%
**++ search__: areas exclusion                       0.000       3.227           0.000%
**++ tinycv::search_needle 0x4x1024                  0.071       3.299           0.298%
**++ mean_square_error                               0.000       3.299           0.001%
** search_: GNOME-20160523                           0.000       3.299           0.000%
**++ search__: get image                             0.000       3.299           0.000%
**++ search__: areas exclusion                       0.000       3.299           0.001%
**++ tinycv::search_needle 5x746x166                 0.652       3.951           2.724%
**++ tinycv::search_needle 59x18x59                  0.651       4.602           2.721%
**++ tinycv::search_needle 73x125x31                 0.664       5.266           2.773%
**++ tinycv::search_needle 846x42x141                0.602       5.867           2.514%
**++ mean_square_error                               0.000       5.868           0.001%
** search_: GNOME-smartneedle-20160211               0.000       5.868           0.000%
**++ search__: get image                             0.000       5.868           0.000%
**++ search__: areas exclusion                       0.000       5.868           0.000%
**++ tinycv::search_needle 846x42x141                0.602       6.470           2.517%
**++ tinycv::search_needle 0x0x153                   0.554       7.024           2.317%
**++ mean_square_error                               0.000       7.025           0.001%
** search_: GNOME-workaround-BSC978076-20160502      0.000       7.025           0.000%
**++ search__: get image                             0.000       7.025           0.000%
**++ search__: areas exclusion                       0.000       7.025           0.000%
**++ tinycv::search_needle 401x9x605                 0.076       7.101           0.319%
**++ mean_square_error                               0.000       7.101           0.001%
** search_: boot_to_desktop-generic-desktop-20160519 0.000       7.101           0.000%
**++ search__: get image                             0.000       7.101           0.000%
**++ search__: areas exclusion                       0.000       7.101           0.000%
**++ tinycv::search_needle 0x745x48                  0.692       7.793           2.892%
**++ tinycv::search_needle 184x746x68                0.725       8.519           3.032%
**++ mean_square_error                               0.000       8.519           0.001%
** search_: first_boot-generic-desktop-20160503      0.000       8.519           0.000%
**++ search__: get image                             0.000       8.519           0.000%
**++ search__: areas exclusion                       0.000       8.519           0.000%
**++ tinycv::search_needle 846x42x141                0.608       9.127           2.539%
**++ tinycv::search_needle 0x0x153                   0.546       9.673           2.283%
**++ mean_square_error                               0.000       9.673           0.001%
** search_: GNOME-workaround-BSC978076-20160504      0.000       9.673           0.000%
**++ search__: get image                             0.000       9.673           0.000%
**++ search__: areas exclusion                       0.000       9.673           0.000%
**++ tinycv::search_needle 846x42x141                0.598       10.272          2.501%
**++ tinycv::search_needle 19x32x153                 0.565       10.837          2.362%
**++ tinycv::search_needle 2x721x139                 0.623       11.460          2.604%
**++ mean_square_error                               0.000       11.460          0.001%
** search_: GNOME-workaround-BSC978076-20160505      0.000       11.460          0.000%
**++ search__: get image                             0.000       11.460          0.000%
**++ search__: areas exclusion                       0.000       11.460          0.000%
**++ tinycv::search_needle 846x42x141                0.603       12.063          2.518%
**++ tinycv::search_needle 100x26x153                0.442       12.505          1.849%
**++ tinycv::search_needle 2x721x139                 0.622       13.127          2.598%
**++ mean_square_error                               0.000       13.127          0.001%
** search_: GNOME-workaround-BSC978076-20160506      0.000       13.127          0.000%
**++ search__: get image                             0.000       13.127          0.000%
**++ search__: areas exclusion                       0.000       13.127          0.000%
**++ tinycv::search_needle 846x42x141                0.602       13.730          2.517%
**++ tinycv::search_needle 88x32x153                 0.462       14.191          1.929%
**++ tinycv::search_needle 2x721x139                 0.636       14.827          2.657%
**++ mean_square_error                               0.000       14.827          0.001%
** search_: GNOME-workaround-BSC978076-20160508      0.000       14.827          0.000%
**++ search__: get image                             0.000       14.827          0.000%
**++ search__: areas exclusion                       0.000       14.827          0.001%
**++ tinycv::search_needle 5x746x166                 0.653       15.480          2.728%
**++ tinycv::search_needle 43x15x59                  0.663       16.143          2.770%
**++ tinycv::search_needle 57x97x31                  0.696       16.839          2.908%
**++ tinycv::search_needle 846x42x141                0.616       17.455          2.574%
**++ mean_square_error                               0.000       17.455          0.001%
** search_: GNOME-smartneedle-20160510               0.000       17.455          0.000%
**++ search__: get image                             0.000       17.455          0.000%
**++ search__: areas exclusion                       0.000       17.455          0.000%
**++ tinycv::search_needle 846x42x141                0.617       18.072          2.578%
**++ tinycv::search_needle 0x10x112                  0.644       18.716          2.693%
**++ tinycv::search_needle 2x721x139                 0.627       19.343          2.618%
**++ mean_square_error                               0.000       19.343          0.001%
** search_: GNOME-workaround-BSC979072-20160509      0.000       19.343          0.000%
**++ search__: get image                             0.000       19.343          0.000%
**++ search__: areas exclusion                       0.000       19.343          0.000%
**++ tinycv::search_needle 0x4x1024                  0.083       19.426          0.346%
**++ mean_square_error                               0.000       19.426          0.001%
** search_: GNOME-rollback-20160512                  0.000       19.426          0.000%
**++ search__: get image                             0.000       19.426          0.000%
**++ search__: areas exclusion                       0.000       19.426          0.000%
**++ tinycv::search_needle 846x42x141                0.599       20.026          2.505%
**++ tinycv::search_needle 0x10x112                  0.649       20.675          2.714%
**++ tinycv::search_needle 2x721x139                 0.626       21.301          2.617%
**++ mean_square_error                               0.000       21.302          0.001%
** search_: GNOME-workaround-BSC979072-20160520      0.000       21.302          0.000%
**++ search__: get image                             0.000       21.302          0.000%
**++ search__: areas exclusion                       0.000       21.302          0.000%
**++ tinycv::search_needle 5x746x166                 0.653       21.955          2.729%
**++ tinycv::search_needle 37x18x59                  0.671       22.626          2.804%
**++ tinycv::search_needle 51x115x31                 0.688       23.313          2.874%
**++ tinycv::search_needle 846x42x141                0.616       23.929          2.575%
**++ mean_square_error                               0.000       23.930          0.001%
** search_: GNOME-smartneedle-20160523               0.000       23.930          0.000%
_stop_                                               0.000       23.930          0.001%
Actions #9

Updated by szarate about 7 years ago

  • Precedes action #13242: WDYT: For every job that does not have a label or bugref, retrigger some times to see if it's sporadic. Like rescheduling on incomplete but on failed added
Actions #10

Updated by okurz about 7 years ago

  • Target version set to Milestone 5
Actions #11

Updated by szarate about 7 years ago

Actions #12

Updated by szarate about 7 years ago

Actions #13

Updated by szarate about 7 years ago

#16166 is somehow related to this one... and should be addressed by https://github.com/os-autoinst/os-autoinst/pull/691 by using custom appenders configured by the test developer, i see no point in handling tons of single files per test.

Actions #14

Updated by okurz about 7 years ago

  • Related to action #12064: Improved logging for debugging performance related issues added
Actions #15

Updated by okurz about 7 years ago

  • Related to action #10418: worker: do not warn on expected problems added
Actions #16

Updated by okurz about 7 years ago

  • Subject changed from Improvements on backend to improve better handling of stalls to [epic] Improvements on backend to improve better handling of stalls
Actions #17

Updated by okurz about 7 years ago

  • Blocks deleted (action #13276: [tools]'assert_screen fails, but we detected a timeout in the process, so we abort' aka. "stall detected")
Actions #18

Updated by okurz about 7 years ago

  • Related to action #13276: [tools]'assert_screen fails, but we detected a timeout in the process, so we abort' aka. "stall detected" added
Actions #19

Updated by pgeorgiadis about 7 years ago

Observation

vim tests are failing from time to time in Maintenance for the last 3 months at least. After looking at the specifics, we identified the problem being on the infrastructure side. Simply put, there are performance issues of the worker which cannot handle the load. As a result, qemu responds with unexpected behavior (i.e. ignoring keystrokes and screen glitches).

For example: openQA test in scenario sle-12-SP2-Desktop-DVD-Updates-x86_64-qam-allpatterns@64bit fails in vim. We assume the the ar appeared on the screen is because typing was done too quickly and from "cd && clear" only the last two characters showed up.

Reproducible

It's unlikely to reproduce it. This random because the worker is struggling.
Fails since (at least) Build 20170222-1 (current job)

vim test also fails from time to time in the last 3 months (at least) for similar problems: second failed vim example and third failed vim example

Actions #20

Updated by pgeorgiadis about 7 years ago

Firefox_urlsprotocols fails in maintenance because of this latest failure

<<< testapi::type_string(string='killall -9 firefox;rm -rf .moz* .config/iced* .cache/iced* .local/share/gnome-shell/extensions/*; firefox > firefox.log 2>&1 &
<<< testapi::assert_screen(mustmatch='firefox-launch', timeout=90)

Expected output:

testapi::_handle_found_needle: found firefox_localfiles-firefox-launch-20161122, similarity 0.98 @ 246/480

needle couldn't be matched because of glitches in qemu. More precisely, og 2>&1 & was found as a remainder of the full command.

Actions #21

Updated by pgeorgiadis about 7 years ago

firefox_flashplayer fails in maintenance because once again there's a strange behavior of (too fast?) keystrokes into an overloaded worker.
The link of the failure can be found here.

WARNING: check_asserted_screen took 1.24 seconds

^ According to @szarate this should be almost 500ms, but it's nearly 3x more.

Exactly as the above example, Firefox cannot start:

<<< testapi::type_string(string='killall -9 firefox;rm -rf .moz* .config/iced* .cache/iced* .local/share/gnome-shell/extensions/*; firefox > firefox.log 2>&1 &
<<< testapi::assert_screen(mustmatch='firefox-launch', timeout=90)

Expected output:

testapi::_handle_found_needle: found firefox_localfiles-firefox-launch-20161122, similarity 0.98 @ 246/480

needle couldn't be matched because of glitches in qemu. More precisely, killall -9 firefox;r was found as a remainder of the full command.

Actions #22

Updated by pgeorgiadis about 7 years ago

Same happens also to firefox_pagesaving test in Maintenance. Look here.

Actions #23

Updated by okurz about 7 years ago

  • Target version changed from Milestone 5 to Milestone 6
Actions #25

Updated by szarate about 7 years ago

  • Related to action #17408: Webinterface does not show any information when a worker fails to write to disk added
Actions #26

Updated by pgeorgiadis about 7 years ago

related https://openqa.suse.de/tests/796011#
firefox_passwd cannot even start

Actions #27

Updated by asmorodskyi about 7 years ago

interesting case when even simple "rm -f" stalls https://openqa.suse.de/tests/795598#step/zypper_lifecycle/48

Actions #28

Updated by mkravec about 7 years ago

type_string "DISPLAY=:1 xev\n" typed "DISv" instead
https://openqa.suse.de/tests/802819#step/vnc_two_passwords/11

Actions #29

Updated by RBrownSUSE about 7 years ago

  • Subject changed from [epic] Improvements on backend to improve better handling of stalls to [tools][epic] Improvements on backend to improve better handling of stalls
Actions #31

Updated by szarate almost 7 years ago

Actions #32

Updated by szarate almost 7 years ago

  • Target version changed from Milestone 6 to Milestone 8
Actions #33

Updated by szarate over 6 years ago

  • Target version deleted (Milestone 8)

I'm just removing the milestone from here, to be more realistic. I think we've done a lot to improve the stall handling... But this task needs to be groomed

Actions #34

Updated by asmorodskyi over 6 years ago

  • Blocked by action #25864: [tools][functional][u] stall detected in openqaworker-arm-1 through 3 sometimes - "worker performance issues" added
Actions #35

Updated by coolo over 6 years ago

  • Assignee deleted (szarate)

Santi is not working on it, so removing assignee

Actions #36

Updated by szarate about 6 years ago

  • Status changed from In Progress to New
Actions #37

Updated by okurz over 4 years ago

  • Category changed from 132 to Feature requests
Actions #38

Updated by okurz over 3 years ago

  • Status changed from New to Resolved
  • Assignee set to okurz

Looking at all the acceptance criteria from our current point of view I think we got them covered and we have improved the backend enough to handle stalls with all of better reporting, better handling in infrastructure (e.g. filesystem and mount point option choices) as well as improvements to prevent stalls in some more instances.

Actions #39

Updated by szarate over 3 years ago

  • Tracker changed from action to coordination
Actions

Also available in: Atom PDF