openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2021-07-21T12:13:44Z</p> <ul><li><strong>Copied from</strong> <i><a class="issue tracker-4 status-3 priority-5 priority-high3 closed child" href="/issues/81859">action #81859</a>: openqa-investigate triggers incomplete sets for multi-machine scenarios</i> added</li></ul> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2021-07-21T12:39:10Z</p> <ul></ul><p>Maybe this can be handled more easily by providing the possibility for the restart API route to take optional test parameters, same as for openqa-clone-job</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2021-12-13T10:25:31Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-5 priority-high3 closed child" href="/issues/103425">action #103425</a>: Ratio of multi-machine tests alerting with ratio_mm_failed 5.280 size:M</i> added</li></ul> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2021-12-14T09:49:09Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-4 priority-default closed child" href="/issues/71809">action #71809</a>: Enable multi-machine jobs trigger without "isos post"</i> added</li></ul> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2021-12-14T14:25:51Z</p> <ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/473271/diff?detail_id=447768">diff</a>)</li></ul><p>fixed typo in description</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2021-12-14T14:47:12Z</p> <ul><li><strong>Parent task</strong> set to <i>#103971</i></li></ul> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-01-24T14:44:07Z</p> <ul><li><strong>Target version</strong> changed from <i>future</i> to <i>Ready</i></li></ul> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-01-24T15:32:34Z</p> <ul><li><strong>Assignee</strong> set to <i>mkittler</i></li></ul> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-01-26T12:05:02Z</p> <ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li></ul><p>Draft PRs: <a href="https://github.com/os-autoinst/openQA/pull/4478" class="external">https://github.com/os-autoinst/openQA/pull/4478</a> <a href="https://github.com/os-autoinst/scripts/pull/127" class="external">https://github.com/os-autoinst/scripts/pull/127</a></p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-01-31T12:01:37Z</p> <ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Feedback</i></li></ul><p>All PR for openQA have been merged. I'm waiting for the deployment on o3 to test the investigation script.</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-02-01T16:28:26Z</p> <ul></ul><p>I checked out <a href="https://github.com/os-autoinst/scripts/pull/127" class="external">https://github.com/os-autoinst/scripts/pull/127</a> on o3 15 minutes ago so it has been using my changes recent investigation runs. So far I couldn't spot any problems in the logs. There are also still investigation jobs created, e.g. <a href="https://openqa.opensuse.org/tests/2170735" class="external">https://openqa.opensuse.org/tests/2170735</a>, <a href="https://openqa.opensuse.org/tests/2170567" class="external">https://openqa.opensuse.org/tests/2170567</a>, <a href="https://openqa.opensuse.org/tests/2170569" class="external">https://openqa.opensuse.org/tests/2170569</a>. They look good (name is correct, outside of group, origin referenced, job not shown as clone). So I suppose it is safe to merge the PR.</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-02-02T11:32:43Z</p> <ul></ul><p>I encountered a few problems when looking into the logs again today so I created <a href="https://github.com/os-autoinst/scripts/pull/128" class="external">https://github.com/os-autoinst/scripts/pull/128</a> and <a href="https://github.com/os-autoinst/scripts/pull/129" class="external">https://github.com/os-autoinst/scripts/pull/129</a> to address them.</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-02-02T14:56:47Z</p> <ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>In Progress</i></li></ul><p>A problem report about broken links by guillaume_g in <a href="https://matrix.to/#/!ilXMcHXPOjTZeauZcg:libera.chat/$Vh0JxSRfc4-GGh6Znu2FtDA1mKew1_Wt3kitW7owOFM?via=libera.chat&via=matrix.org&via=gnome.org" class="external">https://matrix.to/#/!ilXMcHXPOjTZeauZcg:libera.chat/$Vh0JxSRfc4-GGh6Znu2FtDA1mKew1_Wt3kitW7owOFM?via=libera.chat&via=matrix.org&via=gnome.org</a> . Think we have a recent regression in openqa-investigate, see the broken links in <a href="https://openqa.opensuse.org/tests/2172673#comments" class="external">https://openqa.opensuse.org/tests/2172673#comments</a> which should be absolute links. mkittler is informed and is on it</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-02-03T12:35:41Z</p> <ul></ul><p>This problem mentioned in the previous comment should be fixed now while <a href="https://github.com/os-autoinst/scripts/pull/129" class="external">https://github.com/os-autoinst/scripts/pull/129</a> is still pending.</p> <hr> <p>I also got feedback. There are multiple remaining issues:</p> <ol> <li>Since we don't consider investigation jobs as clones our usual restrictions for displaying only the latest jobs in the "clone/restart chain" in the dependency tab doesn't apply. Therefore we see one big dependency tree containing <em>all</em> the jobs. It doesn't mean there's a problem with restarting the dependency cluster in general but the way it is displayed is rather confusing. This problem is related to <a class="issue tracker-4 status-3 priority-3 priority-lowest closed child" title="action: Show dependency graph for cloned jobs (Resolved)" href="https://progress.opensuse.org/issues/69976">#69976</a> (and might be solved within that ticket).</li> <li>It is problematic if there are multiple failing "root" jobs (like here: <a href="https://openqa.suse.de/tests/8088091" class="external">https://openqa.suse.de/tests/8088091</a>). The investigate script restarts then one of these "root" jobs first and by that also restart the rest of the dependency tree which is reachable from the "root" going <em>down</em> the tree. It would however not go up and also restart the other "root" job. This one would attempted to be restarted separately which might lead to restarting some jobs down the dependency tree multiple times as they have already been restarted for the first "root" job. <ol> <li>I suppose we needed to go also up the tree when restarting the first "root" job if it has failed.</li> <li>If we follow the previous idea we would restart multiple failing jobs at the same time in the same way (although their "last good" jobs might differ). That's a limitation we'd likely have to accept.</li> </ol></li> <li>The investigation can restart <em>many</em> jobs. Especially when we are short on resources it seems like a waste to trigger so many jobs (e.g. PowerPC jobs are problematic right now).</li> </ol> <p>I'm wondering whether I should add the guard for avoid investigating MM jobs back until these problems have been resolved.</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-02-03T12:44:07Z</p> <ul></ul><p>mkittler wrote:</p> <blockquote> <p>I also got feedback. There are multiple remaining issues:<br> […]</p> </blockquote> <p>For now I tend to keep MM investigation jobs enabled. I understand "1." as something that is only visual and I assume that test reviewers will learn how to interpret the extended dependency trees correctly. The problem "2." regarding root jobs I don't fully understand but as you stated that you see it as an acceptable limitation so do I for now :) . 3. Shouldn't stop us. We should reward test maintainers of tests with low fail-ratio by giving them additional tools in the form of investigation jobs in case tests do fail. This means that test maintainers of unstable tests need to wait longer for all jobs to complete which might be a good motivation for them to stabilize tests or optimize runtime :) In most cases test reviewers would likely anyway restart jobs in case of failures to find out reproducibility. This is wasteful and causes busy-waiting of reviewers when investigation jobs can already provide that information at the time of review as they are triggered as soon as a job fails and not just when jobs are reviewed for the first time.</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-02-03T16:41:28Z</p> <ul></ul><p>I though about it a little bit further:</p> <ul> <li>If we implement <a class="issue tracker-4 status-3 priority-3 priority-lowest closed child" title="action: Show dependency graph for cloned jobs (Resolved)" href="https://progress.opensuse.org/issues/69976">#69976</a> that will bring us nothing for 1. because the implementation will rely on <code>clone_id</code> being set. (At least I don't see any other way.)</li> <li>It shouldn't be a problem, though: I suppose if 2. is fixed then 1. is fixed as well because then the "weird overlap"¹ between the new and old dependency tree is gone.</li> <li>For 3. we can likely come up with an exclude regex. (Maybe the existing exclude regexes are even already enough.)</li> </ul> <p><del>So I'll create a draft to show how 2. would look like.</del> DONE: <a href="https://github.com/os-autoinst/openQA/pull/4498" class="external">https://github.com/os-autoinst/openQA/pull/4498</a></p> <hr> <p>¹ Jobs in the new dependency tree should <em>not</em> refer to any job in the old dependency tree anymore (making it effectively one big dependency tree) because all parents would be cloned (and new children having still parents from the old cluster as the parent is not cloned as well is the problem here).</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-02-04T16:08:45Z</p> <ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Feedback</i></li></ul><p>Waiting for feedback on <a href="https://github.com/os-autoinst/openQA/pull/4498" class="external">https://github.com/os-autoinst/openQA/pull/4498</a></p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-02-08T13:38:19Z</p> <ul></ul><p>Will the PR fix also this lonely support servers, I assume they are created by openqa-investigate ?<br> <a href="https://openqa.suse.de/tests/8119478" class="external">https://openqa.suse.de/tests/8119478</a><br> <a href="https://openqa.suse.de/tests/8119479" class="external">https://openqa.suse.de/tests/8119479</a></p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-02-08T22:06:56Z</p> <ul></ul><p>I added another test to openqa-investigate: <a href="https://github.com/os-autoinst/scripts/pull/133" class="external">https://github.com/os-autoinst/scripts/pull/133</a></p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-02-10T10:31:57Z</p> <ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>In Progress</i></li></ul><p>Some problems have been found with this, e.g. reported in <a href="https://suse.slack.com/archives/C02CANHLANP/p1644483914332169" class="external">https://suse.slack.com/archives/C02CANHLANP/p1644483914332169</a> . <a class="user active user-mention" href="https://progress.opensuse.org/users/22072">@mkittler</a> can you please look into <a href="https://openqa.suse.de/tests/8133359#comments" class="external">https://openqa.suse.de/tests/8133359#comments</a> to understand why there are 4 retry jobs and 6 last good build ones? As discussed we merged <a href="https://github.com/os-autoinst/scripts/pull/131" class="external">https://github.com/os-autoinst/scripts/pull/131</a> and <a href="https://github.com/os-autoinst/scripts/pull/134" class="external">https://github.com/os-autoinst/scripts/pull/134</a> to mitigate some problems. Please closely observe the impact.</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-02-10T11:49:41Z</p> <ul></ul><p>With <a href="https://github.com/os-autoinst/scripts/pull/131" class="external">https://github.com/os-autoinst/scripts/pull/131</a> and <a href="https://github.com/os-autoinst/scripts/pull/134" class="external">https://github.com/os-autoinst/scripts/pull/134</a> being merged we're basically back at where we were before. There are still multiple problems to address.</p> <ol> <li>I suppose the fix for 2. (from <a href="#note-14">#note-14</a>) worked but had the side-effect of restarting even more jobs and that can be problematic (see 3. from <a href="#note-14">#note-14</a>).</li> <li>We still accidentally restarted jobs twice because the tracking of already restarted jobs only works within the scope of one openqa-investigate run. We somehow need to ensure that further openqa-investigate invocations skip jobs which have already been investigated as dependency. <ol> <li>I suppose jobs like "08134285-sle-15-SP4-Online-ppc64le-RAID1:investigate:retry*:investigate:last_good_build*:91.2@ppc64le-hmc-4disk" (mind the 2nd ":investigate:…" suffix; <a href="https://openqa.suse.de/tests/8134285" class="external">https://openqa.suse.de/tests/8134285</a>) are also due to that but it could also be a problem within the general skipping logic and also with the skipping logic within one script invocation.</li> <li>Maybe the investigate script could record all restarted job IDs in a persistent text file to keep track of restarted jobs beyond the scope of one execution.</li> <li>Or we finally make it a real openQA feature - although doing the tacking within openQA itself would likely also be quite some work.</li> </ol></li> <li>Sometimes users seem to restart the jobs also manually, e.g. <a href="https://openqa.suse.de/tests/8133359" class="external">https://openqa.suse.de/tests/8133359</a>. In this case it is really annoying that <a class="issue tracker-4 status-3 priority-3 priority-lowest closed child" title="action: Show dependency graph for cloned jobs (Resolved)" href="https://progress.opensuse.org/issues/69976">#69976</a> is still unresolved because knowing the original dependency tree would be very useful.</li> <li>That we get <em>many</em> investigation job, e.g. as shown in <a href="https://openqa.suse.de/tests/8133359#comment-484933" class="external">https://openqa.suse.de/tests/8133359#comment-484933</a>, is unfortunately "normal" if we really want this feature. Of the 10 jobs in the mentioned comment 8 jobs are actually valid (and only two of them are cases as mentioned in 2.1). If that is too much we needed to reconsider the whole feature.</li> </ol> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-02-10T14:37:01Z</p> <ul></ul><p>dzedro wrote:</p> <blockquote> <p>Will the PR fix also this lonely support servers, I assume they are created by openqa-investigate ?<br> <a href="https://openqa.suse.de/tests/8119478" class="external">https://openqa.suse.de/tests/8119478</a><br> <a href="https://openqa.suse.de/tests/8119479" class="external">https://openqa.suse.de/tests/8119479</a></p> </blockquote> <p>These don't have investigate in the name, so to my understanding they couldn't have been.</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-02-11T14:00:15Z</p> <ul><li><strong>Priority</strong> changed from <i>Low</i> to <i>Urgent</i></li></ul><p><a class="user active user-mention" href="https://progress.opensuse.org/users/22072">@mkittler</a> the situation seems to be severly broken now.</p> <p>I executed</p> <pre><code>echo https://openqa.suse.de/tests/8140400 | host=openqa.suse.de openqa-investigate </code></pre> <p>with the git commit 054cb9c of os-autoinst scripts getting <a href="https://openqa.suse.de/tests/8140400#comment-485767" class="external">https://openqa.suse.de/tests/8140400#comment-485767</a> which looks complete with 4 investigation jobs triggered. With current master 3ae1f1a I got <a href="https://openqa.suse.de/tests/8140400#comment-485765" class="external">https://openqa.suse.de/tests/8140400#comment-485765</a> with only two jobs, the "retry" and "last_good_build". So what we observed today in the review training session was a regression by one of your recent changes.</p> <p>An additional problem seems to be that we can't call openqa-investigate on a job that already has clones which is also a regression.</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-02-11T15:39:41Z</p> <ul></ul><p>okurz wrote:</p> <blockquote> <p>a regression by one of your recent changes.</p> </blockquote> <p>Just to be clear, this is <em>only</em> preventing investigation jobs from running and it's not interfering with cloning or jobs outside of that? So an immediate revert or rollback may not be necessary (that would be my thought anyway).</p> <blockquote> <p>An additional problem seems to be that we can't call openqa-investigate on a job that already has clones which is also a regression.</p> </blockquote> <p>Does that mean the "is clone" flag is not set reliably? Maybe it'd be best to identify where this happens and extend test coverage before attempting a fix.</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-02-11T16:15:41Z</p> <ul></ul><blockquote> <p>An additional problem seems to be that we can't call openqa-investigate on a job that already has clones which is also a regression.</p> </blockquote> <p>Right, the <code>clone=0</code> flag only means the restarted jobs won't be considered clones. But if the jobs have already been restarted it is no help.</p> <hr> <p>Maybe it is best to revert the changes to the scripts repo completely then. At least I likely won't manage to debug and fix the problems today anymore.</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-02-11T16:29:21Z</p> <ul></ul><p>I created <a href="https://github.com/os-autoinst/scripts/pull/135" class="external">https://github.com/os-autoinst/scripts/pull/135</a>.</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-02-12T04:10:54Z</p> <ul><li><strong>Due date</strong> set to <i>2022-02-26</i></li></ul><p>Setting due date based on mean cycle time of SUSE QE Tools</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-02-14T09:42:11Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-3 priority-lowest closed child" href="/issues/69976">action #69976</a>: Show dependency graph for cloned jobs</i> added</li></ul> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-02-14T09:44:35Z</p> <ul><li><strong>Due date</strong> deleted (<del><i>2022-02-26</i></del>)</li><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Blocked</i></li><li><strong>Priority</strong> changed from <i>Urgent</i> to <i>Normal</i></li></ul><p>Urgency was adressed with <a href="https://github.com/os-autoinst/scripts/pull/135" class="external">https://github.com/os-autoinst/scripts/pull/135</a> merged and deployed.</p> <p>As discussed we pull in <a class="issue tracker-4 status-3 priority-3 priority-lowest closed child" title="action: Show dependency graph for cloned jobs (Resolved)" href="https://progress.opensuse.org/issues/69976">#69976</a> first and let's see if that helps us.</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-02-21T16:44:39Z</p> <ul><li><strong>Status</strong> changed from <i>Blocked</i> to <i>Feedback</i></li></ul><p><a class="issue tracker-4 status-3 priority-3 priority-lowest closed child" title="action: Show dependency graph for cloned jobs (Resolved)" href="https://progress.opensuse.org/issues/69976">#69976</a> has been resolved so I'm unblocking the issue.</p> <p>However, there are still open questions which should be answered before proceeding:</p> <ol> <li>Should I follow the "restart API approach"? I needed to change <code>clone=0</code> so it ignores if a job has already been restarted (and restarts it anyways). Not sure what side-effects this will have.</li> <li>I could also implement remaining ACs of <a class="issue tracker-6 status-3 priority-4 priority-default closed child parent" title="coordination: [epic] Easy *re*-triggering and cloning of multi-machine tests (Resolved)" href="https://progress.opensuse.org/issues/103971">#103971</a> and keep using the clone script here. I wouldn't have to care about the restart API anymore but it'll come with its own difficulties (like implementing those ACs in the first place, parsing the output of the clone script to keep track of all jobs which have been cloned).</li> <li>Regardless of whether I'd choose 1. or 2. the problem of keeping track of already restarted jobs between multiple invocations needs to be solved. Maybe it makes sense to simply store handles job IDs in a simple SQLite file? (SQLite would implement searching, updating and locking for us, that's why I bring it up. So it might even be simpler to use than writing to a text file but still just a single additional file. The required SQL should be trivial.)</li> </ol> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-02-21T21:12:39Z</p> <ul></ul><p>mkittler wrote:</p> <blockquote> <p><a class="issue tracker-4 status-3 priority-3 priority-lowest closed child" title="action: Show dependency graph for cloned jobs (Resolved)" href="https://progress.opensuse.org/issues/69976">#69976</a> has been resolved so I'm unblocking the issue.</p> <p>However, there are still open questions which should be answered before proceeding:</p> <ol> <li>Should I follow the "restart API approach"? I needed to change <code>clone=0</code> so it ignores if a job has already been restarted (and restarts it anyways). Not sure what side-effects this will have.</li> </ol> </blockquote> <p>Don't we already have a "force" flag which we use when assets are missing and then the user can force-restart if they still want to?</p> <blockquote> <ol> <li>I could also implement remaining ACs of <a class="issue tracker-6 status-3 priority-4 priority-default closed child parent" title="coordination: [epic] Easy *re*-triggering and cloning of multi-machine tests (Resolved)" href="https://progress.opensuse.org/issues/103971">#103971</a> and keep using the clone script here. I wouldn't have to care about the restart API anymore but it'll come with its own difficulties (like implementing those ACs in the first place, parsing the output of the clone script to keep track of all jobs which have been cloned).</li> </ol> </blockquote> <p>How about giving it a try with "openqa-clone-job" for a limited time and see where we can go?</p> <blockquote> <ol> <li>Regardless of whether I'd choose 1. or 2. the problem of keeping track of already restarted jobs between multiple invocations needs to be solved. Maybe it makes sense to simply store handles job IDs in a simple SQLite file? (SQLite would implement searching, updating and locking for us, that's why I bring it up. So it might even be simpler to use than writing to a text file but still just a single additional file. The required SQL should be trivial.)</li> </ol> </blockquote> <p>sqlite sounds like overkill but you might be right that it might be easier than parsing a text file in custom format. However so far we had stateless scripts which is even easier if we can sustain that. As you mentioned I think it's a good idea to bring this ticket up in the unblock</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-02-23T12:01:48Z</p> <ul></ul><blockquote> <p>Don't we already have a "force" flag which we use when assets are missing and then the user can force-restart if they still want to?</p> </blockquote> <p>No, I don't think it already allows to force this (only to ignore missing assets and some settings problematic for directly chained dependencies).</p> <blockquote> <p>How about giving it a try with "openqa-clone-job" for a limited time and see where we can go?</p> </blockquote> <p>The "openqa-clone-job" script won't behave very nicely. That's why we currently skip those jobs in the first place. I'll note down more details in the summary of conversation from today.</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-02-23T13:28:01Z</p> <ul></ul><p>Recap for those who aren't familiar with the topic:</p> <ol> <li>What is special about MM jobs and how does it relate to openqa-investigate? <ol> <li>Those jobs parallel job dependencies - so the jobs are scheduled to run at the same time. The scheduler code itself works just fine so that's actually <em>not</em> a concern here.</li> <li>Points to improve are: <ol> <li>Posting jobs (via the jobs post API) which have those dependencies. It is possible but only in a non-atomic way which is problematic as we end up with half-running parallel jobs.</li> <li>Thus the openqa-clone-job script which is using the jobs post API is affected by that problems.</li> <li>Thus the openqa-investigate script is affected by those problems. In addition, we need to take care not to create redundant investigation jobs.</li> </ol></li> </ol></li> <li>How comes the restart API into the picture? <ol> <li>The openqa-investigate script could utilize it instead of the openqa-clone-job script as it doesn't need to do inter-openQA-instance cloning.</li> <li>Note that the openqa-clone-job script is nevertheless used by users and 1.2.1 and 1.2.2 are also impairing users. So we still need to take care of openqa-clone-job - regardless of how we handle the investigation.</li> </ol></li> </ol> <hr> <p>Summary of today's discussion:</p> <ol> <li><p>Which dependent jobs do we want/need to restart when investigating?</p> <ol> <li>The general goal is to avoid producing results which are not needed.</li> <li>It depends on the dependency type: <ol> <li>If a <em>child</em> fails: <ol> <li>Chained parents don't need to be restarted. (Unless they failed, but then the child will be skipped and is thus not investigated anyways.)</li> <li>Directly chained parents need to be restarted so the chain is not broken. The whole direct chain of parents needs to be restarted (recursively).</li> <li>Parallel parents need to be restarted as e.g. the "client" job needs the "server" job to run. Presumably parallel siblings can affect each other so the parallel parent's other children and their parallel dependencies need to be restarted as well (resulting in restarting the whole parallel cluster).</li> </ol></li> <li>If a <em>parent</em> fails: <ol> <li>Chained children don't need to be restarted. (We are mainly interested in finding out why the parent fails, not in producing some further results for the children.)</li> <li>Directly chained children don't need to be restarted. (Same counts as for regularly chained children.)</li> <li>Parallel children need to be restarted as e.g. a server crash can maybe only be reproduced if there's a client connecting to the server. Presumably nested parallel children are important as well so they need to be restarted as well (resulting in restarting the whole parallel cluster).</li> </ol></li> <li>If a "job in the middle" fails (a job which has parents and children at the same time): <ol> <li>Both previous points apply. So parents <em>and</em> children need (or don't need) to be restarted as explained in the previous points.</li> </ol></li> <li>Note that "chained" and "directly chained" (and "parallel") are distinct dependency types. A dependency only has <em>one</em> of these types and a directly chained dependency is <strong>not</strong> a chained dependency at the same time. So 1.2.1.1 and 1.2.1.2 don't contradict each other.</li> </ol></li> </ol></li> <li><p>Can we investigate each failure "in isolation"?</p> <ol> <li>First an example what "in isolation" would mean: <ol> <li>Assume we have 2 failed parallel children within the same cluster.</li> <li>Assume we would create 4 investigation jobs per faild child <em>without considering dependencies</em>.</li> <li>For each investigation job we would clone the whole cluster as explained in 1.2.1, let's say 3 jobs.</li> <li>That would make 24 clones in total (number of failed jobs * number of investigation jobs per failed job * number of dependent jobs to be cloned per job).</li> </ol></li> <li>This might be acceptable in general …</li> <li>… but we need to think at least about making exceptions as well.</li> <li>Alternative: Somehow "merge" 2.1.2 and 2.1.3 so we would only have X investigation jobs per dependency tree. <ol> <li>In the example from 2.1 we would end up with "only" 12 jobs (number of failed jobs * number of investigation jobs per failed job).</li> <li>To achieve that we needed to keep track of which jobs we have already investigated which could be done on different levels: <ol> <li>The openqa-investigate script keeps track, e.g. <ol> <li>using a SQLite file as suggested in <a class="issue tracker-4 status-3 priority-4 priority-default closed child" title="action: Provide support for multi-machine scenarios handled by openqa-investigate size:M (Resolved)" href="https://progress.opensuse.org/issues/95783#note-30">#95783#note-30</a> (or some other persistent storage).</li> <li>by adding a special comment or job setting in <em>all</em> cloned jobs (basically utilizing openQA's database).</li> </ol></li> <li>openQA invokes the post-fail-hook per "dependency tree" and not per job. <ol> <li>So the post-fail-hook would receive a list of all failed job IDs within the cluster and not just a single job ID.</li> <li>The investigate script would then loop over these job IDs and skip jobs which have already been cloned in a previous iteration (or only create a comment there).</li> </ol></li> <li>For openQA the previous point boils down to invoking hooks only for dependency trees where all jobs have been cancelled/done. The problem here is that multiple jobs can end up cancelled/done at the same time. <ol> <li>Maybe Minion locks can help here. However, it would be very problematic to run only one <code>finalize_job_results</code> task at the same time (e.g. <code>finalize_job_results</code> will pile up because there's a blocker - we allow hook scripts to run 5 minutes and it can sometimes indeed take a while in practice). So a more fine-grained locking would be required. Unfortunately we don't have the concept of a "dependency tree ID" in openQA (which could simply be used as lock name).</li> <li>Maybe there's a way to query whether a dependency tree is "pending" in a single SQL query to avoid the race condition.</li> <li>If the previous is not possible, we could use a database transaction to avoid the race condition. The following should do the trick, right? <code>$schema->storage->dbh->prepare('SET TRANSACTION ISOLATION LEVEL REPEATABLE READ READ ONLY DEFERRABLE;')->execute();</code></li> </ol></li> </ol></li> </ol></li> </ol></li> <li><p>What would be necessary to change within the openqa-clone script to implement 1.?</p> <ol> <li>Support for posting multiple jobs at once (so it happens atomically) and use the API in the clone script. <ol> <li>For parallel dependencies we <em>could</em> skip this relying on the scheduler's ability to repair half-scheduled clusters. However, that doesn't cover and might not work nicely as not enough worker slots might be available.</li> </ol></li> <li>Support for cloning only parallel children (for 1.2.2.3) but not any kind of chained children (for 1.2.2.1 and for 1.2.2.2). <ol> <li>There's already <code>--clone-children</code> but I suppose it affects all kinds of children. We'd likely needed <code>--clone-parallel-children</code> in addition.</li> </ol></li> <li>Note that skipping chained parents (for 1.2.1.1) while still cloning directly chained and parallel parents (for 1.2.1.2 and 1.2.1.3) should already be possible by specifying <code>--skip-chained-deps</code>.</li> <li>Note that we could skip 3.2 at the cost of also cloning all kinds of child jobs (per investigation).</li> <li>For 2.4.2 we needed to implement a machine readable output format to keep track of the cloned jobs unless we decide for 2.4.2.2.</li> </ol></li> <li><p>What would be necessary to change within the restart API to implement 1.?</p> <ol> <li>Rules for restarting dependent jobs are already <em>mostly</em> according to 1..</li> <li>Add a flag to skip restarting chained and directly chained children (for 1.2.2.1 and for 1.2.2.2) which would effectively only restart parallel children (for 1.2.2.3).</li> <li>Add a flag to force the restart even though the job (or some other job in the cluster) has already been restarted.</li> <li>As of 1. we don't necessarily restart the full dependency tree. So we need to add a flag to avoid creating dependencies between restarted jobs and not restarted jobs. This is to avoid a connection between the old and the restarted dependency tree making it effectively one big dependency tree.</li> <li>I suppose the previous point is only a displaying issue so we could skip it.</li> </ol></li> </ol> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-03-01T12:44:30Z</p> <ul></ul><p>If <a href="https://github.com/os-autoinst/openQA/pull/4537" class="external">https://github.com/os-autoinst/openQA/pull/4537</a> has been merged, the point 3.1 from my previous comment is done. This only leaves 3.2 and 3.5 (only if we decide for 2.4.2.1) which should be simple to implement. So maybe it makes actually most sense to stick with the clone-job script at this point.</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-03-08T14:02:45Z</p> <ul></ul><p>PR for implementing <code>--clone-parallel-children</code> in <code>openqa-clone-job</code>: <a href="https://github.com/os-autoinst/openQA/pull/4551" class="external">https://github.com/os-autoinst/openQA/pull/4551</a></p> <p>With that the only thing left to use <code>openqa-clone-job</code> in <code>openqa-investigate</code> for parallel dependencies would be an additional hook within openQA that would only fire for the whole dependency cluster. (As explained in 2.4.2 there would be other alternatives but an additional hook seems like the best solution.)</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-03-10T12:24:34Z</p> <ul></ul><p>Looks like we currently invoke <code>openqa-clone-job</code> with <code>--skip-chained-deps</code>. That breaks 1.2.1.2 (parents of directly chained children need to be restarted). So I suppose <code>--skip-chained-deps</code> should be changed to only affect chained deps but not directly-chained deps (which it currently does).</p> <p>There's one more problem to sort out: Even if openQA invokes the hook-script for the entire dependency tree we need to find out which of those jobs should be the "root" job to clone. Or we simply use <code>--max-depth 0</code> to ensure a parallel cluster is fully cloned in any case and simply ignore jobs we've already cloned. That means we needed JSON-output in <code>openqa-clone-job</code> after all to keep track of that (within one <code>openqa-investigate</code> run).</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-03-17T16:57:08Z</p> <ul></ul><p>PR for JSON output in <code>openqa-clone-job</code>: <a href="https://github.com/os-autoinst/openQA/pull/4564" class="external">https://github.com/os-autoinst/openQA/pull/4564</a></p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-03-18T09:14:32Z</p> <ul><li><strong>Due date</strong> set to <i>2022-03-25</i></li></ul> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-03-23T16:07:36Z</p> <ul><li><strong>Due date</strong> changed from <i>2022-03-25</i> to <i>2022-04-25</i></li></ul><p>Changing the due date since we wanted to make a pause. Then the next steps would be:</p> <ol> <li>Have <code>openqa-clone-job</code>'s <code>--skip-chained-deps</code> option only affect chained dependencies but not <em>directly</em> chained dependencies. Possibly add <code>--skip-directly-chained-deps</code> in case that's wanted after all. (for 1.2.1.2)</li> <li>Add a hook in openQA to invoke a script once all jobs in a dependency tree are done.</li> <li>Make <code>openqa-investigate</code> use that hook and investigate all jobs that weren't successful. It should use <code>--max-depth 0</code> to ensure parallel clusters are always fully cloned (so it is not necessary to distinguish between parallel parents and children). It needs to keep track of handled job IDs to avoid investigating jobs multiple times as <code>openqa-clone-job</code> will already handle dependencies as needed (and therefore might clone already multiple jobs we need to investigate in one go). The tracking should be easy because <code>openqa-clone-job --json-output</code> has already been implemented.</li> <li>-> <a class="issue tracker-4 status-3 priority-4 priority-default closed child" title="action: Do NOT call job_done_hooks if requested by test setting (Resolved)" href="https://progress.opensuse.org/issues/110530">#110530</a> : add an opt-out (e.g. by specifying a certain test variable) so users who consider these tests as a waste of time won't complain. I suppose making this configurable via a test variable should be sufficient. <ol> <li>We could also make it in opt-in. So we'd keep the current behavior of skipping the investigation of jobs with parallel and directly chained dependencies <em>unless</em> a user specifies some test variable.</li> </ol></li> </ol> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-04-06T12:34:14Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-4 priority-default closed child" href="/issues/107014">action #107014</a>: trigger openqa-trigger-bisect-jobs from our automatic investigations whenever the cause is not already known size:M</i> added</li></ul> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-04-06T12:35:27Z</p> <ul></ul><p>I've been adding <a class="issue tracker-4 status-3 priority-4 priority-default closed child" title="action: trigger openqa-trigger-bisect-jobs from our automatic investigations whenever the cause is not al... (Resolved)" href="https://progress.opensuse.org/issues/107014">#107014</a> as related because I suppose everything applies to <code>openqa-trigger-bisect-job</code> as well if we run it as hook script in production.</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-04-26T09:26:00Z</p> <ul><li><strong>Due date</strong> changed from <i>2022-04-25</i> to <i>2022-05-02</i></li></ul><p>Since <a class="user active user-mention" href="https://progress.opensuse.org/users/17668">@okurz</a> gave the impulse to delay this I suppose he as product owner should decide when we want to resume here. So I'm setting the due date for next week to decide until then when we want to continue. Additionally, some feedback about my last comments is appreciated.</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-04-26T09:43:52Z</p> <ul></ul><p>mkittler wrote:</p> <blockquote> <ol> <li>Maybe add an opt-out (e.g. by specifying a certain test variable) so users who consider these tests as a waste of time won't complain. I suppose making this configurable via a test variable should be sufficient. <ol> <li>We could also make it in opt-in. So we'd keep the current behavior of skipping the investigation of jobs with parallel and directly chained dependencies <em>unless</em> a user specifies some test variable.</li> </ol></li> </ol> </blockquote> <p>I'd suggest making it opt-out via a job setting provided we have follow-up plans for the "problematic" cases mentioned above:</p> <ul> <li>Multiple root jobs. We can consider that a future ticket for now.</li> <li>Spawning too many investigation jobs under high load. We could consider such jobs as low priority and drop them (user story, not technical definition).</li> </ul> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-05-02T08:58:03Z</p> <ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/514924/diff?detail_id=486880">diff</a>)</li><li><strong>Due date</strong> changed from <i>2022-05-02</i> to <i>2022-05-09</i></li></ul> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-05-02T13:45:06Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-4 priority-default closed child" href="/issues/110518">action #110518</a>: Call job_done_hooks if requested by test setting (not only openQA config as done so far) size:M</i> added</li></ul> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-05-02T13:49:55Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-4 priority-default closed child" href="/issues/110530">action #110530</a>: Do NOT call job_done_hooks if requested by test setting</i> added</li></ul> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-05-02T13:50:04Z</p> <ul><li><strong>Due date</strong> deleted (<del><i>2022-05-09</i></del>)</li><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Blocked</i></li></ul><p><a class="user active user-mention" href="https://progress.opensuse.org/users/22072">@mkittler</a> wait for <a class="issue tracker-4 status-3 priority-4 priority-default closed child" title="action: Do NOT call job_done_hooks if requested by test setting (Resolved)" href="https://progress.opensuse.org/issues/110530">#110530</a> first please</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-05-02T14:02:59Z</p> <ul></ul><p>okurz and mkittler discussed:</p> <p>mkittler wrote:</p> <blockquote> <p>Changing the due date since we wanted to make a pause. Then the next steps would be:</p> <ol> <li>Have <code>openqa-clone-job</code>'s <code>--skip-chained-deps</code> option only affect chained dependencies but not <em>directly</em> chained dependencies. Possibly add <code>--skip-directly-chained-deps</code> in case that's wanted after all. (for 1.2.1.2)</li> </ol> </blockquote> <p>I suggest for now to treat <em>directly</em> chained dependencies as unsupported -> So please in openqa-investigate itself instead of skipping parallel and directly-chained deps <em>only</em> skip directly-chained deps as unsupported for now</p> <blockquote> <ol> <li>Add a hook in openQA to invoke a script once all jobs in a dependency tree are done.</li> </ol> </blockquote> <p>Wait for <a class="issue tracker-4 status-3 priority-4 priority-default closed child" title="action: [spike solution] [timeboxed:10h] Restart hook script in delayed minion job based on exit code size:M (Resolved)" href="https://progress.opensuse.org/issues/110176">#110176</a> first</p> <blockquote> <ol> <li>Make <code>openqa-investigate</code> use that hook and investigate all jobs that weren't successful. It should use <code>--max-depth 0</code> to ensure parallel clusters are always fully cloned (so it is not necessary to distinguish between parallel parents and children). It needs to keep track of handled job IDs to avoid investigating jobs multiple times as <code>openqa-clone-job</code> will already handle dependencies as needed (and therefore might clone already multiple jobs we need to investigate in one go). The tracking should be easy because <code>openqa-clone-job --json-output</code> has already been implemented.</li> </ol> </blockquote> <p>This could be directly done but in general we better wait for <a class="issue tracker-4 status-3 priority-4 priority-default closed child" title="action: Do NOT call job_done_hooks if requested by test setting (Resolved)" href="https://progress.opensuse.org/issues/110530">#110530</a></p> <blockquote> <ol> <li>-> <a class="issue tracker-4 status-3 priority-4 priority-default closed child" title="action: Do NOT call job_done_hooks if requested by test setting (Resolved)" href="https://progress.opensuse.org/issues/110530">#110530</a> : add an opt-out (e.g. by specifying a certain test variable) so users who consider these tests as a waste of time won't complain. I suppose making this configurable via a test variable should be sufficient. <ol> <li>We could also make it in opt-in. So we'd keep the current behavior of skipping the investigation of jobs with parallel and directly chained dependencies <em>unless</em> a user specifies some test variable.</li> </ol></li> </ol> </blockquote> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-05-02T14:03:21Z</p> <ul><li><strong>Subject</strong> changed from <i>Provide support for multi-machine scenarios handled by openqa-investigate</i> to <i>Provide support for multi-machine scenarios handled by openqa-investigate size:M</i></li></ul> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-06-09T15:26:57Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-4 priority-default closed child" href="/issues/110176">action #110176</a>: [spike solution] [timeboxed:10h] Restart hook script in delayed minion job based on exit code size:M</i> added</li></ul> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-06-23T15:12:16Z</p> <ul><li><strong>Status</strong> changed from <i>Blocked</i> to <i>Workable</i></li></ul> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-06-24T12:17:11Z</p> <ul></ul><p>The next steps would be (obsoleting <a class="issue tracker-4 status-3 priority-4 priority-default closed child" title="action: Provide support for multi-machine scenarios handled by openqa-investigate size:M (Resolved)" href="https://progress.opensuse.org/issues/95783#note-39">#95783#note-39</a>):</p> <ol> <li><a class="issue tracker-4 status-3 priority-4 priority-default closed child" title="action: [spike solution] [timeboxed:10h] Restart hook script in delayed minion job based on exit code size:M (Resolved)" href="https://progress.opensuse.org/issues/110176">#110176</a> has been implemented. That means we could now check within the hook script whether all dependencies have been finished and if not retry later.</li> <li>We should also check whether the current job is a parallel child. If it is then we'd just skip the job completely to avoid cloning jobs multiple times (cloning the parent will clone the children).</li> <li>When investigating the parallel parent I suppose the "worst" result within the cluster should be assumed (e.g. a passing parallel parent would be treated as if it had failed if at least one parallel child failed).</li> <li>Since we defined directly chained children out-of-scope we don't need to care about <code>--skip-chained-deps</code> interfering with them (that flag is currently used by the investigate script).</li> </ol> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-06-24T13:46:02Z</p> <ul></ul><p>Draft: <a href="https://github.com/os-autoinst/scripts/pull/170" class="external">https://github.com/os-autoinst/scripts/pull/170</a></p> <p>Unfortunately, considering the result (of the whole parallel cluster) isn't that easy. At the point we're currently handling dependencies the result is already evaluated.</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-07-04T15:56:08Z</p> <ul><li><strong>Status</strong> changed from <i>Workable</i> to <i>In Progress</i></li></ul><p>I updated the draft <a href="https://github.com/os-autoinst/scripts/pull/170" class="external">https://github.com/os-autoinst/scripts/pull/170</a> to check the result accordingly. This change means we need to run the investigate script for all jobs (regardless of the result) and do the check for the result within the script (considering the whole job cluster).</p> <p>I have only tested the jq commands in my local shell so far.</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-07-05T04:29:06Z</p> <ul><li><strong>Due date</strong> set to <i>2022-07-19</i></li></ul><p>Setting due date based on mean cycle time of SUSE QE Tools</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-07-05T15:53:22Z</p> <ul></ul><p>I now added some tests. The changes should basically work.</p> <p>To enable the hook script for all job results the following change could be done: <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/707" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/707</a></p> <p>Then I'd still need to implement skipping other results in the labeling script (as it would then also be called for all job results) and of course implement that way of configuring it in openQA upstream.</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-07-06T09:08:17Z</p> <ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/534116/diff?detail_id=505226">diff</a>)</li></ul> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-07-06T10:30:21Z</p> <ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Feedback</i></li></ul><p>I've just learned that executing the hook script for all jobs is a no-go. I assumed we had agreed on doing this kind of logic in the hook script because <a class="issue tracker-4 status-3 priority-4 priority-default closed child" title="action: Make hook scripts restartable with a special exit code (Resolved)" href="https://progress.opensuse.org/issues/112523">#112523</a> was very much in-line with that. Well, back to the drawing board.</p> <p>Note that the general problem we need to resolve here is synchronization (of the investigation of parallel jobs). The question is just where this synchronization is supposed to happen. If we don't synchronize it properly we could either accidentally miss or duplicate the effort.</p> <p>There are multiple approaches:</p> <ol> <li>Call the hook script for all job results and do the synchronization within the hook script. (This is the approach I would have taken.) <ol> <li>The hook script determines whether a cluster has finished (and postpones until then using <a class="issue tracker-4 status-3 priority-4 priority-default closed child" title="action: Make hook scripts restartable with a special exit code (Resolved)" href="https://progress.opensuse.org/issues/112523">#112523</a>) and whether a cluster contains a failed job.</li> <li>The hook script only considers parent jobs to avoid duplicated investigations and therefore needs to be called for all job results. <ol> <li>This might be problematic for clusters with multiple parents but could be solved by: <ol> <li>Ensuring we're really find the top-level parent in the cluster. (We will fail to find the top-level parent in case of cyclic dependencies. It should be fine to not support it but we need to prevent any endless loops in our code.)</li> <li>If the clone the top-level parent with <code>--max-depth 0</code> we can ensure we're cloning the full cluster (0 means infinity here). Since we're using <code>--skip-chained-deps</code> and <em>not</em> <code>--clone-children</code> this should not lead to cloning any unwanted jobs outside the cluster.</li> </ol></li> </ol></li> <li>This is deemed too expensive. However, no other technicalities would prevent the approach.</li> </ol></li> <li>Call the hook script still only for failed jobs and abuse openQA's comment system to do the synchronization within the hook script. <ol> <li>[same as 1.1] The hook script determines whether a cluster has finished (and postpones until then using <a class="issue tracker-4 status-3 priority-4 priority-default closed child" title="action: Make hook scripts restartable with a special exit code (Resolved)" href="https://progress.opensuse.org/issues/112523">#112523</a>) and whether a cluster contains a failed job.</li> <li>The hook script switches to investigate the parallel parent if it is called for a parallel child. This means we would duplicate the effort if there are multiple failures within the same cluster so we need to synchronize: <ol> <li>The hook script writes the investigation comment before starting the investigation, e.g. "Spawning investigation jobs".</li> <li>The hook script checks whether another investigation comment has been created in the meantime and only proceeds if its own comment has the lower ID. Otherwise it deletes its comment and aborts.</li> <li>The hook script edits the comment with the actual contents after spawning the investigation jobs.</li> <li>Point 1.2.1 applies here as well.</li> </ol></li> <li>Not sure yet what technicalities will go in the way.</li> </ol></li> <li>Call the hook script still only for failed jobs and and track already investigated jobs within the hook script (e.g. relying on an SQLite database). <ol> <li>No further details as we'd likely don't want that approach anyways.</li> </ol></li> <li>Call the hook script for the whole cluster providing the hook script with the appropriate job ID to focus the investigation on. This means the synchronization happens in openQA. <ol> <li>No further logic in the hook script is required but the additional openQA upstream feature might get a little involved.</li> <li>openQA's implementation needed to take 1.2.1 into account as well when providing the appropriate job ID. So we don't loose that complexity.</li> </ol></li> </ol> <hr> <p>I suppose we should go with approach 2 (still taking 1.2.1 into account of course).</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-07-06T15:36:39Z</p> <ul></ul><p>The idea to use <code>--max-depth 0</code> to fix 1.2.1 is actually very good because it has the nice side-effect to clone the full parallel cluster regardless from where we start. Normally the depth is limited to direct children so if we clone a parallel parent for its child than that's it. However, with <code>--max-depth 0</code> we will actually go down the tree again and basically clone the full cluster. I've just tested it locally and it really behaves as expected.</p> <p>So we could avoid finding the top-level parent. Actually we also should avoid that because otherwise the case when there are multiple roots would be problematic (as there is not <em>one</em> top-level parent). However, I suppose for (ab)using comments for synchronization (see approach 2. of the previous comment) we still need to decide on <em>one</em> job within the cluster to write the comment on. Otherwise, if different concurrent jobs would read/write comments on different jobs it wouldn't work. I would suggest to simply write the comment on the job with the lowest ID in the cluster. That job can be very easy determined and there is no ambiguity (unlike the top-level parent job which is harder to find and there might be multiple and dependency circles are problematic).</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-07-06T18:31:06Z</p> <ul></ul><p><a class="user active user-mention" href="https://progress.opensuse.org/users/22072">@mkittler</a> a very nice write-up indeed. I agree that we should go with approach 2. Eventually in the future we might need approach 4 for other reasons. That is ok as we consider os-autoinst/scripts also an experimental ground for potential later openQA built-in features.</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-07-07T08:03:44Z</p> <ul></ul><p>One thought regarding 2.2.2 - maybe for better debugging it's worth leaving all comments, e.g. have something like "Triggered investigation jobs because of ..." where the reason it was spawned would still be visible in case it doesn't work as expected and the comment can provide some context</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-07-13T10:35:28Z</p> <ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>In Progress</i></li></ul><p><a href="https://github.com/os-autoinst/scripts/pull/170">https://github.com/os-autoinst/scripts/pull/170</a> has been merged since 2 days ago. I haven't received any feedback from users since then. (Before they complained quite quickly if the investigate script had done a bad job dealing with parallel clusters.)</p> <p>So I've just checked parallel jobs being investigated on OSD myself via <code>select jobs.id from jobs join comments on jobs.id = comments.job_id join job_dependencies on jobs.id = job_dependencies.parent_job_id where job_dependencies.dependency = 2 and comments.text like '%investigation%' and t_finished > '2022-07-11' order by jobs.id desc limit 25;</code>.</p> <p>All jobs I've looked into look good, e.g. on <a href="https://openqa.suse.de/tests/9111048#comments">https://openqa.suse.de/tests/9111048#comments</a> and <a href="https://openqa.suse.de/tests/9105044#comments">https://openqa.suse.de/tests/9105044#comments</a> we can see that the parallel parent was selected correctly for the sync comment and the whole cluster was cloned for each investigation job. The same counts for <a href="https://openqa.suse.de/tests/9109613#dependencies">https://openqa.suse.de/tests/9109613#dependencies</a> which is also part of a bigger dependency tree and only jobs from its parallel cluster have been cloned (as expected).</p> <p>On o3 I only found the job <a href="https://openqa.opensuse.org/tests/2464678#dependencies">https://openqa.opensuse.org/tests/2464678#dependencies</a> (and its clones). This job failed because its parallel job hasn't been scheduled correctly. (I haven't investigated why.) The investigation was postponed once because not all dependencies where done:</p> <pre><code>Jul 12 10:25:02 ariel openqa-gru[4018]: Postponing to investigate job 2464678: waiting until pending dependencies have finished </code></pre> <p>Apparently postponing doesn't work because no automatic investigation was triggered later. I need to look into it because it also doesn't seem to work on OSD. Then the job was manually restarted by ggardet_arm. That didn't work either. The job ended up as parallel failed despite having not even a parallel dependency within the dependency tree. However, likely it is just a displaying issue because the investigation actually cloned the cluster correctly (see <a href="https://openqa.opensuse.org/tests/2465350">https://openqa.opensuse.org/tests/2465350</a>). The strangeness of that parallel dependency is maybe something to look into but out of the scope of this ticket.</p> <p>So I guess at least the investigation itself works as expected except for the postponing case. I'll need to look into that.</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-07-13T12:10:28Z</p> <ul></ul><p>PR: <a href="https://github.com/os-autoinst/scripts/pull/171" class="external">https://github.com/os-autoinst/scripts/pull/171</a></p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-07-14T10:57:34Z</p> <ul></ul><p>The postponing now works. The job <a href="https://openqa.suse.de/tests/9126523" class="external">https://openqa.suse.de/tests/9126523</a> has been postponed¹ and was then investigated later. However, the job was actually postponed needlessly. Maybe it could still be optimized so jobs are only postponed if there are pending jobs within the same parallel cluster (and not just any pending jobs within the related dependency tree).</p> <hr> <p>¹</p> <pre><code>… Jul 14 11:12:22 openqa openqa-gru[5421]: Postponing to investigate job 9126523: waiting until pending dependencies have finished Jul 14 11:13:57 openqa openqa-gru[8560]: Postponing to investigate job 9126523: waiting until pending dependencies have finished Jul 14 11:15:22 openqa openqa-gru[10801]: Postponing to investigate job 9126523: waiting until pending dependencies have finished Jul 14 11:17:17 openqa openqa-gru[13377]: Postponing to investigate job 9126523: waiting until pending dependencies have finished </code></pre> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-07-14T11:41:18Z</p> <ul></ul><p>PR to implement the improvement mentioned in the previous comment: <a href="https://github.com/os-autoinst/scripts/pull/173" class="external">https://github.com/os-autoinst/scripts/pull/173</a></p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-07-15T10:11:53Z</p> <ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li></ul><p>I've checked a few more jobs after the PR has been merged and it looks all good.</p> <p>Also the feedback comment is written on the right job: <a href="https://openqa.suse.de/tests/9105078#comment-565424" class="external">https://openqa.suse.de/tests/9105078#comment-565424</a></p> <p>So I suppose we can actually finally consider this ticket resolved.</p> </article> <article> <h1>openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M</h1> <p>2022-07-15T10:12:10Z</p> <ul><li><strong>Due date</strong> deleted (<del><i>2022-07-19</i></del>)</li></ul> </article> </main></body></html>