action #19432
closed
[multimachine][scheduling] Fail of one multi-machine jobs cause restart all of them without checking state of others
Added by asmorodskyi over 7 years ago.
Updated over 5 years ago.
Category:
Regressions/Crashes
Description
- Create multi machine test suite ( support_server / node1 / node2 )
- Invoke it but emulate situation when one of node* test dies
==> this will cause restart of whole test suite but old instances of node2 and support_server still running. Main troubles comes with second instance of support server because it will cause conflicts for network configation.
Expected :
possible solutions :
- Not restart multi-machine tests at all
- Define back-end mutex which released only after all multi-machine tests finished and restart whole test suite only after this event
- In case we want to restart force kill of still survived tests and only after this do restart
- Description updated (diff)
why it this high? This exists for a while - or is this a regression?
Richard asked to post it as High
- Related to action #14334: job incomplete: "could not configure /dev/net/tun (tap00): Device or resource busy" added
- Subject changed from [tools][multimachine] Fail of one multi-machine jobs cause restart all of them without checking state of others to [multimachine] Fail of one multi-machine jobs cause restart all of them without checking state of others
- Category set to 122
- Status changed from New to Feedback
- Priority changed from High to Normal
I'm not really sure what causes the restart of the tests. Do we still have auto-restart code somewhere?
- Subject changed from [multimachine] Fail of one multi-machine jobs cause restart all of them without checking state of others to [multimachine][scheduling] Fail of one multi-machine jobs cause restart all of them without checking state of others
- Category changed from 122 to Regressions/Crashes
- Assignee set to asmorodskyi
@asmorodskyi does this still apply? I have never seen this problem myself
- Assignee changed from asmorodskyi to okurz
haven't seen it for a quite a long . also it was noticed in HPC scenarios as I am not maintain them anymore so have no chance to see this anymore :)
taking into account that this issue was reported BEFORE one ( or even few ? ) major scheduler re-writes I think we can easily close it but leaving this to @okurz
- Status changed from Feedback to Resolved
yeah, that's what I assumed. Thanks for clarification.
Also available in: Atom
PDF