action #90332
openhave a reasonable timeout in lock API calls
0%
Description
Motivation¶
As brought up in SUSE QE Tools workshop 2021-03-19 multi-machine tests, especially during test development, can waste a lot of developer time and hardware ressources because a wait on a barrier or mutex from https://github.com/os-autoinst/os-autoinst/blob/master/lockapi.pm never times out itself but only the complete openQA job is eventually aborted when it runs into MAX_JOB_TIME. Some time and user confusion could be saved with earlier timeouts
Acceptance criteria¶
- AC1: waiting on mutex and barrier times out after a reasonable timeout
- AC2: timeout can be configured when creating mutex and barrier
Suggestions¶
- Research what are industry best practices for mutex and barriers
- What would be a good selection for a "reasonable" timeout?
- Add a timeout parameter with default value
- Ensure it's covered in documentation
- Inform about the change - can be implicit if your git commit message subjects are good enough :)
Updated by asmorodskyi almost 4 years ago
The root cause for this ticket creation was problem which we ( QAC team ) facing in wicked tests - we have barriers at the end of each module so both instances will wait each other at the end of the module , but in case of exception one of instances will skip code which suppose to touch barrier this means that node will proceed to next module so both will be dead-locked waiting each other in unreachable place. Adding timeout will for sure slightly improve this situation but still leave room for failures because while node1 got exception in module1 and proceed to barrier in module2 and node2 is still in module1 it is not straightforward which barrier will timed out first so if nodes will finally met and will keep constantly failing simultaneously timing out in different barriers and catching new dead-locks further because timeouts will be equal :)
I wonder about different approach here but I am not sure if it is technically possible - when node1 catch exception in module1 openQA might detect that this is MM job and that other jobs in cluster are waiting for mutex ( this is weakest part I am not sure if this is possible ) in case it does openQA might auto-fail module1 for node2 so whole cluster can proceed further