action #32545

Catch multi-machine clusters misconfigured

Added by asmorodskyi about 2 years ago. Updated 4 months ago.

Status:NewStart date:28/02/2018
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:Feature requests
Target version:Ready
Difficulty:
Duration:

Description

support server
https://openqa.suse.de/tests/1510396/file/autoinst-log.txt :

[2018-02-28T11:51:39.0409 CET] [debug] mutex create 'support_server_ready'

parallel job
https://openqa.suse.de/tests/1510394/file/autoinst-log.txt

mutex lock 'support_server_ready' unavailable, sleeping 5s - for 2 hours


Related issues

Related to openQA Project - action #35140: Parallel job don't see mutex of sibling Resolved 18/04/2018
Related to openQA Project - action #36727: job_grab does not cope with parallel cycles Resolved 04/06/2018
Related to openQA Tests - action #39128: Misconfigured HA Cluster Resolved 03/08/2018

History

#1 Updated by okurz about 2 years ago

  • Category set to Concrete Bugs

#2 Updated by pcervinka about 2 years ago

Were jobs manually restarted? If so, hpc_mrsh_slave should be restarted to re-trigger all related tests.
If is restarted hpc_mrsh_master or hpc_mrsh_supportserver not all tests are triggered and relation between all jobs is lost.

For example hpc_mrsh_master has relation only to hpc_mrsh_slave after manual restart: https://openqa.suse.de/tests/1510394#settings

Question is, is it re-triggering issue in openQA itself or wrong test suite definition?

#3 Updated by asmorodskyi about 2 years ago

I retrigger them one more time let's see if issue will reproduce

#4 Updated by asmorodskyi about 2 years ago

retriggered jobs failed in a same way , and they again have broken dependencies mentioned by @pcervinka. probably this is cause of the problem ? will try to do isos post to see if it will behave the same

#5 Updated by asmorodskyi about 2 years ago

after isos post jobs get correct relations , it might that issue is in only in manual retrigger but let's wait until job finish

#6 Updated by asmorodskyi about 2 years ago

  • Priority changed from Urgent to Normal

yes problem is related to manual restart, lower down priority

#7 Updated by asmorodskyi about 2 years ago

  • Subject changed from Multi-machine job fail to detect that mutex is created to [sporadic] Multi-machine job fail to detect that mutex is created
  • Priority changed from Normal to Urgent

looks like sporadic issue , it hit HPC job group again and this time when jobs was started with isos post
Increase priority

#8 Updated by coolo about 2 years ago

  • Subject changed from [sporadic] Multi-machine job fail to detect that mutex is created to Catch multi-machine clusters misconfigured
  • Priority changed from Urgent to Normal
  • Target version set to Ready

the problem is that:
- client had parallel_with supportserver
- server had parallel_with supportserver,client

As such the cluster client+supportserver had no relation to server. This is a misconfiguration we need to catch
while posting - but it's not urgent.

#10 Updated by asmorodskyi almost 2 years ago

  • Related to action #35140: Parallel job don't see mutex of sibling added

#11 Updated by szarate almost 2 years ago

  • Priority changed from Normal to Urgent

I think this will be needed

#12 Updated by szarate almost 2 years ago

  • Related to action #36727: job_grab does not cope with parallel cycles added

#13 Updated by coolo over 1 year ago

  • Assignee set to coolo

#14 Updated by coolo over 1 year ago

  • Target version changed from Ready to Current Sprint

#15 Updated by szarate over 1 year ago

#16 Updated by coolo over 1 year ago

  • Assignee deleted (coolo)
  • Priority changed from Urgent to Normal
  • Target version changed from Current Sprint to Ready

#17 Updated by mkittler 11 months ago

the problem is that:

- client had parallel_with supportserver

- server had parallel_with supportserver,client


As such the cluster client+supportserver had no relation to server. This is a misconfiguration we need to catch

while posting - but it's not urgent.

Not sure whether I understand this correctly. Server and supportserver are different tests? What is the expected configuration? How would openQA know what the expected configuration is?

#18 Updated by asmorodskyi 11 months ago

if you will open HPC job group you will see several MM tests grouped by 3 jobs:

  • hpc_mrsh_master, hpc_mrsh_slave, hpc_mrsh_supportserver
  • hpc_munge_master, hpc_munge_slave, hpc_munge_supportserver

or

  • hpc_ganglia_server, hpc_ganglia_client, hpc_ganglia_supportserver

I agree terminology is confusing but we have what we have :) if you still have questions fill free to ping me in IRC

#19 Updated by mkittler 11 months ago

Ok, so *_server and *_supportserver are different tests and supposed to run in parallel when this cluster is configured and displayed correctly: https://openqa.suse.de/tests/2837345#dependencies

But I'm still not sure what kind of misconfiguration we're looking for and how openQA would be able to detect it.

#20 Updated by coolo 11 months ago

Meanwhile both master and _slave are PARALLEL_WITH=supportserver. What was misconfigured was _master had PARALLELWITH=supportserver and slave had PARALLEL_WITH=supportserver,master

This lead to 2 clusters be formed and one of them had no slave and as the _master already had a parent, _slave ran on its own. What is pretty hard to understand is that with PARALLE_WITH you're setting a parent and there can only be one. But it's possible that the new cluster code already elimates this problem. To be checked, but it's still a settings smell.

#21 Updated by mkittler 11 months ago

The frontend code I once wrote for the dependency graph handles this situation by simply putting all those jobs into one big cluster: https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/WebAPI/Controller/Test.pm#L737

I'd say it should be like this (and that's how I implemented the graph):

  • If A has PARALLEL_WITH=X and B has PARALLEL_WITH=X,Y that makes the following cluster: A,B,X,Y
  • So although A and Y are not explicitly specified to run in parallel A and Y are part of the same cluster.
  • This is not a misconfiguration (if one was aiming for the big cluster A,B,X,Y).
  • When I understand the issue correctly, that is also the behavior @asmorodskyi was expecting. So if the scheduler would behave like this the issue here wouldn't have been created.
  • The terms parent and child are completely interchangeable for parallel dependencies. Regardless in which way you express the relation - the resulting cluster should be the same. So it makes no sense (from the user perspective) to say you're 'setting a parent'. That might happen behind the scenes (and I guess that's what you meant) but after all this is just about tying jobs together.

If the scheduler behaves differently I would consider it a bug.

I can check how the scheduler code behaves as it is right now and maybe fix it. Of course the code which populates the job dependencies database table from the job settings might handle this incorrect, too. So that's also a place to check.

#22 Updated by mkittler 11 months ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
  • Target version changed from Ready to Current Sprint

@coolo Seems like a misread your example because _master is likely the same as master and the underscore is just from a failed formatting attempt. So the first bullet point would be:

  • If A has PARALLEL_WITH=X and B has PARALLEL_WITH=X,A that makes the following cluster: A,B,X

However, that shouldn't change the conclusion.

I briefly tested the creation of the job dependencies in the database and it seems to be correct: https://github.com/os-autoinst/openQA/pull/2067

The test exploits the route I created for the dependency graph. So at least in the graph it would actually be shown up as one big cluster. I now need to verify what the scheduler would do.

#23 Updated by coolo 11 months ago

I don't think the scheduler misbehaves either - but the config doesn't make sense and the job creation basically applies a workaround by ignoring half the settings.

#24 Updated by mkittler 11 months ago

but the config doesn't make sense

Also in general?

the job creation basically applies a workaround by ignoring half the settings

Not sure what workaround you mean. So settings are lost when creating such a cluster?

#25 Updated by mkittler 11 months ago

  • Status changed from In Progress to New
  • Assignee deleted (mkittler)

#26 Updated by coolo 7 months ago

  • Target version changed from Current Sprint to Ready

#27 Updated by okurz 7 months ago

to me it looks like this is blocked by #41066 as in: Probably no one will pick it up until #41066 is done, e.g. my mkittler. So I suggest mkittler picks the ticket, adds the blocker and and sets the status to Blocked. Agreed?

#28 Updated by coolo 7 months ago

I can't see how this is blocking it.

#29 Updated by okurz 7 months ago

well, it's not a hard "blocked" but rather a soft work schedule serialization :)

#30 Updated by okurz 7 months ago

  • Status changed from New to Blocked
  • Assignee set to okurz

#31 Updated by okurz 5 months ago

  • Status changed from Blocked to New
  • Assignee deleted (okurz)

I wonder if this is related to #32605#note-13 . It seems an HPC job was scheduled on its own, not within the cluster as it should.

#32 Updated by mkittler 5 months ago

So I suggest mkittler picks the ticket

Last time I looked at the ticket it seemed that the user just configured the cluster differently from what was required. But that "different" configuration didn't look generally invalid to me so I didn't know how to proceed with the ticket. The question from my last comment is still unanswered, too. So I will not pick the ticket unless it is clear to me what to do.

#33 Updated by okurz 4 months ago

@coolo can you clarify regarding #32545#note-24 please

#34 Updated by coolo 4 months ago

  • Category changed from Concrete Bugs to Feature requests

I basically would like to see a warning if you create a one node cluster. Or we go with more forgiving parsing. This feature is purely about user support to get this right, not about correctness of the current implementation.

Also available in: Atom PDF