action #19432
closed[multimachine][scheduling] Fail of one multi-machine jobs cause restart all of them without checking state of others
0%
Description
- Create multi machine test suite ( support_server / node1 / node2 )
- Invoke it but emulate situation when one of node* test dies
==> this will cause restart of whole test suite but old instances of node2 and support_server still running. Main troubles comes with second instance of support server because it will cause conflicts for network configation.
Expected :
possible solutions :
- Not restart multi-machine tests at all
- Define back-end mutex which released only after all multi-machine tests finished and restart whole test suite only after this event
- In case we want to restart force kill of still survived tests and only after this do restart
Updated by coolo over 7 years ago
why it this high? This exists for a while - or is this a regression?
Updated by SLindoMansilla over 7 years ago
- Related to action #14334: job incomplete: "could not configure /dev/net/tun (tap00): Device or resource busy" added
Updated by SLindoMansilla over 7 years ago
For a workaround, see here: https://progress.opensuse.org/issues/14334#note-15
Updated by coolo about 7 years ago
- Subject changed from [tools][multimachine] Fail of one multi-machine jobs cause restart all of them without checking state of others to [multimachine] Fail of one multi-machine jobs cause restart all of them without checking state of others
- Category set to 122
- Status changed from New to Feedback
- Priority changed from High to Normal
I'm not really sure what causes the restart of the tests. Do we still have auto-restart code somewhere?
Updated by okurz over 5 years ago
- Subject changed from [multimachine] Fail of one multi-machine jobs cause restart all of them without checking state of others to [multimachine][scheduling] Fail of one multi-machine jobs cause restart all of them without checking state of others
- Category changed from 122 to Regressions/Crashes
- Assignee set to asmorodskyi
@asmorodskyi does this still apply? I have never seen this problem myself
Updated by asmorodskyi over 5 years ago
- Assignee changed from asmorodskyi to okurz
haven't seen it for a quite a long . also it was noticed in HPC scenarios as I am not maintain them anymore so have no chance to see this anymore :)
taking into account that this issue was reported BEFORE one ( or even few ? ) major scheduler re-writes I think we can easily close it but leaving this to @okurz
Updated by okurz over 5 years ago
- Status changed from Feedback to Resolved
yeah, that's what I assumed. Thanks for clarification.