coordination #14972


action #14804: encoder still stalls

[tools][epic] Improvements on backend to improve better handling of stalls

Added by szarate about 7 years ago. Updated over 3 years ago.

Feature requests
Target version:
Start date:
Due date:
% Done:


Estimated time:
(Total: 0.00 h)


This is a macro task to group the following tasks (and possibly discussions)

User story

As a test infrastructure admin I would openQA to better handle stalls to use full worker performance capacity without failing all over

acceptance criteria

  1. Backend is using ppm files for the video encoding and generate last.png on demand
  2. Test results are still generated using png
  3. Have a threshold to allow isotovideo to choose when a SUT/worker is too slow and must be stopped
  4. Collect information to generate knowledge so that thresholds can be decided by the openQA admin or openQA on it's own
  5. The worker/isotovideo informs the webui that a job failed with a reason
  6. Jobs that have failed under known failures, can be retriggered automatically

tasks (subtasks by themselves)

  1. Move to writing ppms instead of png's [#14976]
    • Have write_img to support png's and ppms (basically let opencv pick the format)
    • Have the videoencoder to write last.png
    • have the worker to tell the videoencoder if it really needs the last.png
    • Solve the last.png file for the worker to look at
  2. Let the isotovideo decide when the job/SUT must be shutdown based on statistics/threshold
    • Have the isotovideo populate statistics (name of the test that died, and information on what happened)
    • Have the isotovideo to update into the database when a job is being slow or killed because of slowliness
    • Have the isotovideo to decide when to die, based on a threshold/factor calculated by the webui when instantiating the worker
    • Have the webui to be able to handle backend-informed failures
  3. Retrigger jobs with known failures
    • Have webui/scheduler to veify when a job was marked as failed or incomplete due to a known failure (and mark it)
    • Have the user (and later on) the scheduler to re-trigger jobs based on a number/factor that may be defined by the openqa administrator

further details

Acceptance criteria details:

  • AC 3: Has to collect statistics on:

    • How many tests are failing due to AC 4
    • Timing for AC 1
    • Some more statistics yet to be defined (e.g Worker load when the job is being cancelled)
  • AC 5: Has to be accomplished using the backend field in the job table in json format.

  • AC 6: Has to be disabled by default, since this is a feature that only might be deployed on unsupervised environmetns (i.e. will not be used for test development)

More details on benchmarks and other discussions can be found in the following links:


autoinst-log.txt (1.15 MB) autoinst-log.txt szarate, 2016-12-30 14:45

Subtasks 1 (0 open1 closed)

action #14976: Change the isotovideo backend to write ppm filesResolvedszarate2016-11-24


Related issues 11 (0 open11 closed)

Related to openQA Tests - action #15118: Text is mistyped on aarch64Rejectedzluo2016-11-29

Related to openQA Tests - action #12250: [sporadic]"QEMU: usb-kbd: warning: key event queue full" on ppc64le and aarch64Resolved2016-06-03

Related to openQA Project - action #14072: [tools]monitor our loadResolved2016-10-05

Related to openQA Project - action #12064: Improved logging for debugging performance related issuesResolvedokurz2016-05-19

Related to openQA Project - action #10418: worker: do not warn on expected problemsResolvedmkittler2016-01-25

Related to openQA Tests - action #13276: [tools]'assert_screen fails, but we detected a timeout in the process, so we abort' aka. "stall detected"Resolvedokurz2016-08-19

Related to openQA Project - action #17408: Webinterface does not show any information when a worker fails to write to diskRejected2017-03-01

Related to openQA Project - action #18286: Catch CPU lockupsRejected2017-04-03

Blocked by openQA Tests - action #25864: [tools][functional][u] stall detected in openqaworker-arm-1 through 3 sometimes - "worker performance issues"Resolvedokurz2017-10-09

Precedes openQA Project - action #13242: WDYT: For every job that does not have a label or bugref, retrigger some times to see if it's sporadic. Like rescheduling on incomplete but on failedRejectedokurz2016-11-25

Precedes openQA Project - action #16166: Log per testRejected2017-01-24


Also available in: Atom PDF