action #46295
closedUnrelated jobs needlessly cancelled as parallel_failed
0%
Description
In Fedora tests, we use a 'support_server' model that I cribbed from SUSE tests some time ago. The 'support server' does stuff like running an iSCSI server, running an NFS server, etc; various tests of iSCSI client, NFS client etc. run in parallel with it.
In recent openQA, it seems like if any one of these parallel jobs fails, all the others immediately get cancelled as 'parallel_failed'. But that doesn't make any sense!
Case in point, these three jobs:
https://openqa.fedoraproject.org/tests/345457 (the support_server)
https://openqa.fedoraproject.org/tests/345479 (an iSCSI install test, which failed)
https://openqa.fedoraproject.org/tests/345522 (an NFS install test, which got cancelled)
It seems like as soon as the iSCSI install test failed, the support server test and the NFS install test were cancelled as 'parallel_failed'. But this is wrong. The iSCSI install test failing indicates that there's a bug in iSCSI installs. It doesn't mean there's anything wrong with the support server itself, or with NFS installs. There's no reason to cancel the other tests here.
Updated by AdamWill almost 6 years ago
I think this is a result of coolo's https://github.com/os-autoinst/openQA/pull/1666 , which was a response to https://progress.opensuse.org/issues/36565 - but I'm not sure if this consequence was intended.
Updated by coolo almost 6 years ago
This sounds like an user error. What you describe are 2 clusters with support_server+X each. if X fails, the support server is supposed to be stopped. That you combine these 2 into support_server+X+Y is a problem on your end.
Updated by AdamWill almost 6 years ago
Why should I run a separate support server for each test that uses a support server? That's just wasting a bunch of resources when it can perfectly well serve all of them. I'm pretty sure that's the model SUSE was using at the time we copied it. :P
I mean, I could do that, but it seems very wasteful. In this medium there are only two client tests, but for another medium we have like 7 or 8. Am I really supposed to run 7 or 8 instances of support_server in that medium? And call them what, support_server_1 through support_server_8?!
Updated by coolo almost 6 years ago
they can be all the same medium. But SUSE is using hpc_pdsh_supportserver hpc_slurm_supportserver hpc_pdsh_supportserver ... these support servers sometimes specify different SUPPORT_SERVER_ROLES, but that's not important.
Important is that each cluster forms a private network and you generally want dhcp4 and dhcp6 tests not to interfere with each other. So each private network gets its own support server serving dhcp.
Updated by coolo almost 6 years ago
That we spam the test suites with this strategy is discussed elsewhere :)
Updated by AdamWill almost 6 years ago
OK, well, I still don't agree. :P I guess I'll say the model is insufficient: it is assuming that all jobs which run parallel to each other are vital to each other's operation, which is simply not always true. We don't have any issues with client tests interfering with each other.
I can't believe I'm typing this, but maybe we need some sort of...dependency syntax? (The Scream emoji) So I can specify that all the client tests depend on the server test, but none of them depend on each other?
Updated by mkittler almost 6 years ago
Maybe just wrap the logic of _job_stop_child
into an if-block which so it won't be executed if disabled via openqa.ini
.
Updated by AdamWill almost 6 years ago
That could work as a quick fix, but it still doesn't seem Truly Correct - it's not really a sitewide thing, any given site can have some cases where the whole cluster should be killed and some where it shouldn't. I mean, even in the case I'm using as an example, the correct behaviour is 'both'. If the support_server test fails, all the 'client' tests should be cancelled. But if any client test fails, none of the other tests needs to be cancelled.
Updated by AdamWill almost 6 years ago
So I've been staring at this all afternoon and it's actually not at all simple, not even to kludge it.
We can't actually use mkittler's suggestion because in the current implementation, _job_stop_child
is badly misnamed; it's actually called on every job ID found by cluster_jobs
, i.e. it really should be called _job_stop_cluster
or something like that. It's not only called on children. So we can't just make calling it optional, because that would prevent any related cluster jobs being stopped on a cluster member failing or being cancelled, even the ones we all agree should be stopped.
It's also complicated by the fact we're using a single "find all jobs in cluster" sub - cluster_jobs
- for various different purposes. I don't just want to unconditionally change its behaviour because the 'include parallel parents' behaviour is clearly correct in some cases, e.g. the 'duplicate' case; when we duplicate a parallel child, obviously we also want to duplicate its parent.
As a trial balloon, how would you feel about this behaviour, assuming I can implement it?:
For 'stop' and 'cancel' (but NOT 'duplicate' or the other users), when cluster_jobs
considers a parallel parent $p
, do not call $p->cluster_jobs($jobs)
if $p
has pending children which are not in $jobs already.
Basically - don't kill parallel parents that have other children that we're not also killing.
The idea is this should satisfy both of us; in your case where you have lots of small neat "one parent, one child" parallel clusters, the child failing or not being scheduled due to a chained parent failure will cause the parallel parent to be cancelled, but in my case where we have "one parent, lots of children" clusters, a single child failing won't nuke the whole cluster.
Updated by AdamWill almost 6 years ago
As a bonus note, I actually have another fun case here: sometimes a chained parent should be included in cluster_jobs
output (currently they never are). That case is when the child relies on an asset uploaded by the parent, and the asset has been cleaned up by the limit_assets minion task...
Updated by AdamWill almost 6 years ago
Poking around in the git logs, I came across this PR:
https://github.com/os-autoinst/openQA/pull/1781
which provides this example of an interesting cluster:
+------------+ +----------------+
| Upgrade | | Node 1 |
| Node1 +------> +---+------------------+
+------------+ +----------------+ | Supportserver |
| | |
+------------+ +--------------------+------------------+
| Upgrade | | Node 2 |
| Node 2 +------> |
+------------+ +----------------+
this seems like a relevant example here, too. For your cases, if "Node 1" failed or was cancelled in this case, would you also be expecting "Node 2" and "Supportserver" to be cancelled? And if so, is this denoted by e.g. having the "Node 1" and "Node 2" test suites marked as PARALLEL_WITH each other, or are they only marked as both being PARALLEL_WITH the support server?
Updated by coolo almost 6 years ago
Both nodes are PARALLEL_WITH the supportserver indeed and this is a HA test, so indeed we require all 3 nodes to be up for the test to pass.
Updated by AdamWill almost 6 years ago
Hmm, that's unfortunate then, because it makes it very hard to even design a 'works for everyone' solution to this...
Are 'Node 1' and 'Node 2' marked PARALLEL_WITH each other at all? Or is each only marked PARALLEL_WITH the server, but they have no directly stated relationship to each other?
If they don't, what would you think of a 'solution' to this problem which would mean Node 1 and Node 2 would have to have an explicit relationship to each other to ensure the other one and the support server got cancelled if either of them failed?
Updated by AdamWill almost 6 years ago
Since I wasn't really able to get anyone to commit to anything other than "stop having tests that work that way it's inconvenient", I went ahead and made something that works for us, and sent a PR:
Updated by okurz over 5 years ago
- Category set to Feature requests
- Status changed from New to Resolved
- Assignee set to AdamWill
@AdamWill as your PR is merged and does everything just perfectly (documentation, tests, verification in staging+production) I would say we can call this resolved :) reopen if I got something wrong.