action #90332: have a reasonable timeout in lock API calls - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #90332

open

have a reasonable timeout in lock API calls

Added by okurz about 4 years ago. Updated about 4 years ago.

Status:

Workable

Priority:

Low

Assignee:

Category:

Feature requests

Target version:

QA (public) - future

Start date:

2021-03-19

Due date:

% Done:

Estimated time:

Description

Motivation¶

As brought up in SUSE QE Tools workshop 2021-03-19 multi-machine tests, especially during test development, can waste a lot of developer time and hardware ressources because a wait on a barrier or mutex from https://github.com/os-autoinst/os-autoinst/blob/master/lockapi.pm never times out itself but only the complete openQA job is eventually aborted when it runs into MAX_JOB_TIME. Some time and user confusion could be saved with earlier timeouts

Acceptance criteria¶

AC1: waiting on mutex and barrier times out after a reasonable timeout
AC2: timeout can be configured when creating mutex and barrier

Suggestions¶

Research what are industry best practices for mutex and barriers
What would be a good selection for a "reasonable" timeout?
Add a timeout parameter with default value
Ensure it's covered in documentation
Inform about the change - can be implicit if your git commit message subjects are good enough :)

Actions

Copy link

Updated by asmorodskyi about 4 years ago

The root cause for this ticket creation was problem which we ( QAC team ) facing in wicked tests - we have barriers at the end of each module so both instances will wait each other at the end of the module , but in case of exception one of instances will skip code which suppose to touch barrier this means that node will proceed to next module so both will be dead-locked waiting each other in unreachable place. Adding timeout will for sure slightly improve this situation but still leave room for failures because while node1 got exception in module1 and proceed to barrier in module2 and node2 is still in module1 it is not straightforward which barrier will timed out first so if nodes will finally met and will keep constantly failing simultaneously timing out in different barriers and catching new dead-locks further because timeouts will be equal :)

I wonder about different approach here but I am not sure if it is technically possible - when node1 catch exception in module1 openQA might detect that this is MM job and that other jobs in cluster are waiting for mutex ( this is weakest part I am not sure if this is possible ) in case it does openQA might auto-fail module1 for node2 so whole cluster can proceed further

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #90332

have a reasonable timeout in lock API calls

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by asmorodskyi about 4 years ago