action #46295: Unrelated jobs needlessly cancelled as parallel_failed - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #46295

closed

Unrelated jobs needlessly cancelled as parallel_failed

Added by AdamWill almost 6 years ago. Updated over 5 years ago.

Status:

Resolved

Priority:

High

Assignee:

AdamWill

Category:

Feature requests

Target version:

Start date:

2019-01-16

Due date:

% Done:

Estimated time:

Description

In Fedora tests, we use a 'support_server' model that I cribbed from SUSE tests some time ago. The 'support server' does stuff like running an iSCSI server, running an NFS server, etc; various tests of iSCSI client, NFS client etc. run in parallel with it.

In recent openQA, it seems like if any one of these parallel jobs fails, all the others immediately get cancelled as 'parallel_failed'. But that doesn't make any sense!

Case in point, these three jobs:

https://openqa.fedoraproject.org/tests/345457 (the support_server)
https://openqa.fedoraproject.org/tests/345479 (an iSCSI install test, which failed)
https://openqa.fedoraproject.org/tests/345522 (an NFS install test, which got cancelled)

It seems like as soon as the iSCSI install test failed, the support server test and the NFS install test were cancelled as 'parallel_failed'. But this is wrong. The iSCSI install test failing indicates that there's a bug in iSCSI installs. It doesn't mean there's anything wrong with the support server itself, or with NFS installs. There's no reason to cancel the other tests here.

Actions

Copy link

Updated by AdamWill almost 6 years ago

I think this is a result of coolo's https://github.com/os-autoinst/openQA/pull/1666 , which was a response to https://progress.opensuse.org/issues/36565 - but I'm not sure if this consequence was intended.

Actions

Copy link

Updated by coolo almost 6 years ago

This sounds like an user error. What you describe are 2 clusters with support_server+X each. if X fails, the support server is supposed to be stopped. That you combine these 2 into support_server+X+Y is a problem on your end.

Actions

Copy link

Updated by AdamWill almost 6 years ago

Why should I run a separate support server for each test that uses a support server? That's just wasting a bunch of resources when it can perfectly well serve all of them. I'm pretty sure that's the model SUSE was using at the time we copied it. :P

I mean, I could do that, but it seems very wasteful. In this medium there are only two client tests, but for another medium we have like 7 or 8. Am I really supposed to run 7 or 8 instances of support_server in that medium? And call them what, support_server_1 through support_server_8?!

Actions

Copy link

Updated by coolo almost 6 years ago

they can be all the same medium. But SUSE is using hpc_pdsh_supportserver hpc_slurm_supportserver hpc_pdsh_supportserver ... these support servers sometimes specify different SUPPORT_SERVER_ROLES, but that's not important.

Important is that each cluster forms a private network and you generally want dhcp4 and dhcp6 tests not to interfere with each other. So each private network gets its own support server serving dhcp.

Actions

Copy link

Updated by coolo almost 6 years ago

That we spam the test suites with this strategy is discussed elsewhere :)

Actions

Copy link

Updated by AdamWill almost 6 years ago

OK, well, I still don't agree. :P I guess I'll say the model is insufficient: it is assuming that all jobs which run parallel to each other are vital to each other's operation, which is simply not always true. We don't have any issues with client tests interfering with each other.

I can't believe I'm typing this, but maybe we need some sort of...dependency syntax? (The Scream emoji) So I can specify that all the client tests depend on the server test, but none of them depend on each other?

Actions

Copy link

Updated by mkittler almost 6 years ago

Maybe just wrap the logic of _job_stop_child into an if-block which so it won't be executed if disabled via openqa.ini.

Actions

Copy link

Updated by AdamWill almost 6 years ago

That could work as a quick fix, but it still doesn't seem Truly Correct - it's not really a sitewide thing, any given site can have some cases where the whole cluster should be killed and some where it shouldn't. I mean, even in the case I'm using as an example, the correct behaviour is 'both'. If the support_server test fails, all the 'client' tests should be cancelled. But if any client test fails, none of the other tests needs to be cancelled.

Actions

Copy link

Updated by AdamWill almost 6 years ago

So I've been staring at this all afternoon and it's actually not at all simple, not even to kludge it.

We can't actually use mkittler's suggestion because in the current implementation, _job_stop_child is badly misnamed; it's actually called on every job ID found by cluster_jobs, i.e. it really should be called _job_stop_cluster or something like that. It's not only called on children. So we can't just make calling it optional, because that would prevent any related cluster jobs being stopped on a cluster member failing or being cancelled, even the ones we all agree should be stopped.

It's also complicated by the fact we're using a single "find all jobs in cluster" sub - cluster_jobs - for various different purposes. I don't just want to unconditionally change its behaviour because the 'include parallel parents' behaviour is clearly correct in some cases, e.g. the 'duplicate' case; when we duplicate a parallel child, obviously we also want to duplicate its parent.

As a trial balloon, how would you feel about this behaviour, assuming I can implement it?:

For 'stop' and 'cancel' (but NOT 'duplicate' or the other users), when cluster_jobs considers a parallel parent $p, do not call $p->cluster_jobs($jobs) if $p has pending children which are not in $jobs already.

Basically - don't kill parallel parents that have other children that we're not also killing.

The idea is this should satisfy both of us; in your case where you have lots of small neat "one parent, one child" parallel clusters, the child failing or not being scheduled due to a chained parent failure will cause the parallel parent to be cancelled, but in my case where we have "one parent, lots of children" clusters, a single child failing won't nuke the whole cluster.

Actions

Copy link

#10

Updated by AdamWill almost 6 years ago

As a bonus note, I actually have another fun case here: sometimes a chained parent should be included in cluster_jobs output (currently they never are). That case is when the child relies on an asset uploaded by the parent, and the asset has been cleaned up by the limit_assets minion task...

Actions

Copy link

#11

Updated by AdamWill almost 6 years ago

Poking around in the git logs, I came across this PR:

https://github.com/os-autoinst/openQA/pull/1781

which provides this example of an interesting cluster:

+------------+      +----------------+
|  Upgrade   |      |   Node 1       |
|  Node1     +------>                +---+------------------+
+------------+      +----------------+   | Supportserver    |
                                     |   |                  |
+------------+      +--------------------+------------------+
| Upgrade    |      |   Node 2       |
| Node 2     +------>                |
+------------+      +----------------+

this seems like a relevant example here, too. For your cases, if "Node 1" failed or was cancelled in this case, would you also be expecting "Node 2" and "Supportserver" to be cancelled? And if so, is this denoted by e.g. having the "Node 1" and "Node 2" test suites marked as PARALLEL_WITH each other, or are they only marked as both being PARALLEL_WITH the support server?

Actions

Copy link

#12

Updated by coolo almost 6 years ago

Both nodes are PARALLEL_WITH the supportserver indeed and this is a HA test, so indeed we require all 3 nodes to be up for the test to pass.

Actions

Copy link

#13

Updated by AdamWill almost 6 years ago

Hmm, that's unfortunate then, because it makes it very hard to even design a 'works for everyone' solution to this...

Are 'Node 1' and 'Node 2' marked PARALLEL_WITH each other at all? Or is each only marked PARALLEL_WITH the server, but they have no directly stated relationship to each other?

If they don't, what would you think of a 'solution' to this problem which would mean Node 1 and Node 2 would have to have an explicit relationship to each other to ensure the other one and the support server got cancelled if either of them failed?

Actions

Copy link

#14

Updated by AdamWill almost 6 years ago

Since I wasn't really able to get anyone to commit to anything other than "stop having tests that work that way it's inconvenient", I went ahead and made something that works for us, and sent a PR:

https://github.com/os-autoinst/openQA/pull/2017

Actions

Copy link

#15

Updated by okurz over 5 years ago

Category set to Feature requests
Status changed from New to Resolved
Assignee set to AdamWill

@AdamWill as your PR is merged and does everything just perfectly (documentation, tests, verification in staging+production) I would say we can call this resolved :) reopen if I got something wrong.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #46295

Unrelated jobs needlessly cancelled as parallel_failed

Updated by AdamWill almost 6 years ago

Updated by coolo almost 6 years ago

Updated by AdamWill almost 6 years ago

Updated by coolo almost 6 years ago

Updated by coolo almost 6 years ago

Updated by AdamWill almost 6 years ago

Updated by mkittler almost 6 years ago

Updated by AdamWill almost 6 years ago

Updated by AdamWill almost 6 years ago

Updated by AdamWill almost 6 years ago

Updated by AdamWill almost 6 years ago

Updated by coolo almost 6 years ago

Updated by AdamWill almost 6 years ago

Updated by AdamWill almost 6 years ago

Updated by okurz over 5 years ago