coordination #14972: [tools][epic] Improvements on backend to improve better handling of stalls - openQA Project (public) - openSUSE Project Management Tool

coordination #14972

This is a macro task to group the following tasks (and possibly discussions) 

 ## User story 
 As a test infrastructure admin I would openQA backend developer, optimize the backend, to better handle stalls to use full worker performance capacity without failing all over stop SUTs that are having slow peformance 


 ## acceptance criteria 
 1. Backend is using ppm files for the video encoding and generate last.png on demand 
 2. Test results are still generated using png 
 3. Have a threshold treshold to allow isotovideo to choose when a SUT/worker is too slow and must be stopped stopped. 
 4. Collect information to generate knowledge so that thresholds can be decided by the openQA admin or openQA on it's own own. 
 5. The worker/isotovideo informs the webui that a job failed with a reason reason. 
 6. Jobs that have failed under known failures, can be retriggered automatically automatically. 

 ## tasks (subtasks by themselves) 

 1. Move to writing ppms instead of png's [#14976] 
	 - Have write_img to support png's and ppms (basically let opencv pick the format) 
	 - Have the videoencoder to write last.png 
	 - have the worker to tell the videoencoder if it really needs the last.png 
	 - Solve the last.png file for the worker to look at 
 2. Let the isotovideo decide when the job/SUT must be shutdown based on statistics/threshold 
	 - Have the isotovideo populate statistics (name of the test that died, and information on what happened) 
	 - Have the isotovideo to update into the database when a job is being slow or killed because of slowliness 
	 - Have the isotovideo to decide when to die, based on a threshold/factor calculated by the webui when instantiating the worker 
	 - Have the webui to be able to handle backend-informed failures 
 3. Retrigger jobs with known failures 
	 - Have webui/scheduler to veify when a job was marked as failed or incomplete due to a known failure (and mark it) 
	 - Have the user (and later on) the scheduler to re-trigger jobs based on a number/factor that may be defined by the openqa administrator 

 ## further details 

 Acceptance criteria details: 

 * AC 3: Has to collect statistics on:  
	 - How many tests are failing due to **AC 4** 
	 - Timing for **AC 1** 
	 - Some more statistics yet to be defined (e.g Worker load when the job is being cancelled) 

 * AC 5: Has to be accomplished using the backend field in the job table in json format. 
 * AC 6: Has to be disabled by default, since this is a feature that only might be deployed on unsupervised environmetns (i.e. will not be used for test development)  

 More details on benchmarks and other discussions can be found in the following links: 

 * https://progress.opensuse.org/issues/14804 
 * https://github.com/os-autoinst/os-autoinst/pull/648 
 * https://docs.google.com/spreadsheets/d/1QbV1VGe5bEpzKPRIJRsBzcJnWW_KTrWD-nXkjuWYXdo/edit?usp=sharing 
 * https://github.com/os-autoinst/os-autoinst/commit/2a7ac57ce0556114d62975d718a6ac039b5e93b6

Back

Project

General

Profile

QA (public) » openQA Project (public)

coordination #14972