https://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842021-01-07T12:42:09ZopenSUSE Project Management ToolopenQA Project - action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarioshttps://progress.opensuse.org/issues/81859?journal_id=3611852021-01-07T12:42:09Zokurzokurz@suse.com
<ul></ul><p>As wished by dzedro I accepted an MR to disable investigation jobs for OSD maintenance scenarios although there is nothing OSD nor maintenance specific going on but maybe it makes him less grumpy :) After improving how investigation jobs are triggered for multi-machine scenarios we can re-enable openqa-investigate for this specific use case as well.</p>
openQA Project - action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarioshttps://progress.opensuse.org/issues/81859?journal_id=3613402021-01-08T09:44:05Zokurzokurz@suse.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-4 priority-default closed child" href="/issues/81206">action #81206</a>: Trigger 'openqa-investigate' from within openQA when jobs fail on osd</i> added</li></ul> openQA Project - action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarioshttps://progress.opensuse.org/issues/81859?journal_id=3613672021-01-08T10:49:33Zokurzokurz@suse.com
<ul><li><strong>Parent task</strong> set to <i>#80828</i></li></ul> openQA Project - action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarioshttps://progress.opensuse.org/issues/81859?journal_id=3779032021-01-14T14:03:34Zokurzokurz@suse.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/377903/diff?detail_id=359045">diff</a>)</li><li><strong>Status</strong> changed from <i>New</i> to <i>Workable</i></li></ul><p>The openqa-clone-job parameter "--clone-children" has been mentioned. It likely comes with caveats. I assume either mkittler or Xiaojing_liu would be able to come up with a viable solution :)</p>
openQA Project - action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarioshttps://progress.opensuse.org/issues/81859?journal_id=3779092021-01-14T14:13:46Zdzedrojpupava@suse.com
<ul></ul><p>I'm not sure if MM jobs with 3+ nodes can be cloned and if yes then the clone job does need special options or cloned job must contain <code>PARALLEL_WITH</code> with all nodes.<br>
I created this 3 node jobs to reproduce what was happening with the clone <a href="https://openqa.suse.de/tests/5274167#dependencies" class="external">https://openqa.suse.de/tests/5274167#dependencies</a><br>
The result of 2x 2 node being restarted out of 3 is happening with "simple" clone_job e.g. when clone_job is done on node1 and node 2.<br>
As result 2 of 3 nodes are cloned/restarted. <a href="https://openqa.suse.de/tests/5289094#dependencies" class="external">https://openqa.suse.de/tests/5289094#dependencies</a></p>
<pre><code>sudo -u geekotest /usr/share/openqa/script/clone_job.pl --from localhost --host localhost --skip-download --skip-chained-deps 5274167
Cloning dependencies of sle-12-SP3-Server-DVD-x86_64-Build12sp3-qam_ha_rolling_update_node01@64bit
Created job #5289085: sle-12-SP3-Server-DVD-x86_64-Build12sp3-qam_ha_rolling_update_support_server@64bit -> http://localhost/t5289085
Created job #5289086: sle-12-SP3-Server-DVD-x86_64-Build12sp3-qam_ha_rolling_update_node01@64bit -> http://localhost/t5289086
sudo -u geekotest /usr/share/openqa/script/clone_job.pl --from localhost --host localhost --skip-download --skip-chained-deps --clone-children 5274167
Cloning dependencies of sle-12-SP3-Server-DVD-x86_64-Build12sp3-qam_ha_rolling_update_node01@64bit
Created job #5289093: sle-12-SP3-Server-DVD-x86_64-Build12sp3-qam_ha_rolling_update_support_server@64bit -> http://localhost/t5289093
Created job #5289094: sle-12-SP3-Server-DVD-x86_64-Build12sp3-qam_ha_rolling_update_node01@64bit -> http://localhost/t5289094
</code></pre> openQA Project - action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarioshttps://progress.opensuse.org/issues/81859?journal_id=3791982021-01-19T13:56:13Zmkittlermarius.kittler@suse.com
<ul></ul><p><a class="user active user-mention" href="https://progress.opensuse.org/users/11830">@dzedro</a> You need to clone the parallel parent which is the job which has <strong>not</strong> <code>PARALLEL_WITH</code> but is mentioned as <code>PARALLEL_WITH</code> in another job. So the job you need to clone is usually "the server", e.g. <code>script/openqa-clone-job … --clone-children https://openqa.suse.de/tests/5274166</code> clones 3 jobs here.</p>
<hr>
<p>I see the following problems:</p>
<ol>
<li>The clone job script operates non-atomically. So it can happen that one job of the cluster can be cloned successfully and the next one fails. I'm not aware that in this case the successfully cloned job would not be discarded again. Hence this looks like a recipe for ending up with half-scheduled clusters anyways.</li>
<li>openqa-investigate considers each job (it possible ends up cloning) individually. This way it might end up cloning the same cluster twice if multiple jobs within the same cluster are selected to be cloned. This also means that openqa-investigate possibly ends up cloning just the parallel child instead of the parent. Even with <code>--clone-children</code> this would not recreate the whole cluster as mentioned before.</li>
<li>Using the clone job script also means that we're relying on the scheduler to repair half-assigned MM clusters and that directly chained dependencies are not supported at all.</li>
</ol>
openQA Project - action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarioshttps://progress.opensuse.org/issues/81859?journal_id=3792012021-01-19T14:06:33Zmkittlermarius.kittler@suse.com
<ul></ul><p>By the way, the restart API of openQA already does the right thing for each dependency type automatically and in an atomic way. When I remember correctly, it also returns the IDs of jobs which have been restarted so openqa-investigate could take these into account to avoid restarting the same cluster twice. So it seems tempting to simply use that API instead of the clone-job script. However, it has the following limitations so far:</p>
<ol>
<li>It only works within the same instance. That shouldn't be a problem here.</li>
<li>It does not allow to change settings. That's not so much effort to implement and likely useful anyways.</li>
<li>The new job is always considered a clone of the original job and one job can only be restarted if it has no clone yet. I suppose we needed a "detached" mode for the restarting API to circumvent that. <em>Likely</em> not much effort to implement.</li>
</ol>
openQA Project - action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarioshttps://progress.opensuse.org/issues/81859?journal_id=3792102021-01-19T14:23:04Zokurzokurz@suse.com
<ul></ul><p>mkittler wrote:</p>
<blockquote>
<p>By the way, the restart API of openQA already does the right thing for each dependency type automatically and in an atomic way. When I remember correctly, it also returns the IDs of jobs which have been restarted so openqa-investigate could take these into account to avoid restarting the same cluster twice. So it seems tempting to simply use that API instead of the clone-job script. However, it has the following limitations so far:</p>
<ol>
<li>It only works within the same instance. That shouldn't be a problem here.</li>
<li>It does not allow to change settings. That's not so much effort to implement and likely useful anyways.</li>
<li>The new job is always considered a clone of the original job and one job can only be restarted if it has no clone yet. I suppose we needed a "detached" mode for the restarting API to circumvent that. <em>Likely</em> not much effort to implement.</li>
</ol>
</blockquote>
<p>Would that mean that we move more functionality from the openqa-clone-job script into a lower layer and reuse it for the API and the clone-job script?</p>
<p>But another question: Would that fix the original issue in the best way? As an alternative I see that within openqa-investigate we look if the clone-candidate has siblings <em>and</em> a parent and clone the parent instead of the clone-candidate?</p>
openQA Project - action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarioshttps://progress.opensuse.org/issues/81859?journal_id=3792192021-01-19T14:32:44Zmkittlermarius.kittler@suse.com
<ul></ul><blockquote>
<p>Would that mean that we move more functionality from the openqa-clone-job script into a lower layer and reuse it for the API and the clone-job script?</p>
</blockquote>
<p>That's at least not what I meant in my previous comment. I thought that openqa-investigate would migrate to use the restart API via openqa-cli. I also don't think we can easily move any functionality from openqa-clone-job into the restart API because that script is supposed to work between multiple web UIs.</p>
<blockquote>
<p>Would that fix the original issue in the best way?</p>
</blockquote>
<p>That's the "cleanest" solution I can currently think of in the sense that we can reuse all the dependency handling already provided by the restart API and don't have to do a lots of manual calls from the outside and that changing settings when restarting a job is beneficial anyways. Not sure whether it is the best™ solution, though.</p>
<blockquote>
<p>As an alternative I see that within openqa-investigate we look if the clone-candidate has siblings and a parent and clone the parent instead of the clone-candidate?</p>
</blockquote>
<p>We could do that. Then we still need to keep track of the job IDs which have actually been cloned and the output of the clone script isn't meant to be parsed (so far). Then we would still have not solved problems "1." and "3." I've mentioned in <a class="issue tracker-4 status-3 priority-5 priority-high3 closed child" title="action: openqa-investigate triggers incomplete sets for multi-machine scenarios (Resolved)" href="https://progress.opensuse.org/issues/81859#note-7">#81859#note-7</a>. (Problem "3." is likely not so important.)</p>
openQA Project - action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarioshttps://progress.opensuse.org/issues/81859?journal_id=3796892021-01-21T10:54:34Zokurzokurz@suse.com
<ul></ul><p>as discussed in meeting:</p>
<ul>
<li>turn to epic</li>
<li>first step as subtask: Detect if there are any siblings for investigate candidate and abort early, optional debug log message about "unsupported multi-machine cluster"</li>
<li>next steps: Extend API to support an atomic operation for "list of jobs with dependencies", then potentially use that for openqa-investigate/client/openqa-clone-job</li>
</ul>
openQA Project - action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarioshttps://progress.opensuse.org/issues/81859?journal_id=3801312021-01-25T10:47:56Zmkittlermarius.kittler@suse.com
<ul></ul><blockquote>
<p>Extend API to support an atomic operation for "list of jobs with dependencies"</p>
</blockquote>
<p>For the mere "listing" of dependencies we don't need an atomic operation. The job <em>creation</em> should be atomic in the sense that multiple jobs which belong to the same cluster are created by one API call internally using one DB transaction. So the problematic part is how to do the cloning/restarting. (See my comments <a class="issue tracker-4 status-3 priority-5 priority-high3 closed child" title="action: openqa-investigate triggers incomplete sets for multi-machine scenarios (Resolved)" href="https://progress.opensuse.org/issues/81859#note-7">#81859#note-7</a> and <a class="issue tracker-4 status-3 priority-5 priority-high3 closed child" title="action: openqa-investigate triggers incomplete sets for multi-machine scenarios (Resolved)" href="https://progress.opensuse.org/issues/81859#note-8">#81859#note-8</a>.)</p>
openQA Project - action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarioshttps://progress.opensuse.org/issues/81859?journal_id=3813572021-02-01T14:25:24Zmkittlermarius.kittler@suse.com
<ul><li><strong>Assignee</strong> set to <i>mkittler</i></li></ul> openQA Project - action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarioshttps://progress.opensuse.org/issues/81859?journal_id=3813592021-02-01T14:58:57Zmkittlermarius.kittler@suse.com
<ul></ul><blockquote>
<p>first step as subtask: Detect if there are any siblings for investigate candidate and abort early, optional debug log message about "unsupported multi-machine cluster"</p>
</blockquote>
<p>PR for that: <a href="https://github.com/os-autoinst/scripts/pull/67" class="external">https://github.com/os-autoinst/scripts/pull/67</a></p>
openQA Project - action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarioshttps://progress.opensuse.org/issues/81859?journal_id=3814172021-02-02T04:12:46Zopenqa_reviewopenqa-review@suse.de
<ul><li><strong>Due date</strong> set to <i>2021-02-16</i></li></ul><p>Setting due date based on mean cycle time of SUSE QE Tools</p>
openQA Project - action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarioshttps://progress.opensuse.org/issues/81859?journal_id=3814952021-02-02T09:57:27Zacarvajalacarvajal@suse.com
<ul></ul><p>Hello. I have been seeing the same issue in the HA job groups as well.</p>
<p>I have submitted <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/440" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/440</a> to temporarily disable it. I hope 'HA' is precise enough and does not accidentally disable this for other groups. A quick search through the group names makes me think 'HA' will be good enough, but I will wait for reviews.</p>
openQA Project - action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarioshttps://progress.opensuse.org/issues/81859?journal_id=3814972021-02-02T10:01:56Zacarvajalacarvajal@suse.com
<ul></ul><p>mkittler wrote:</p>
<blockquote>
<blockquote>
<p>first step as subtask: Detect if there are any siblings for investigate candidate and abort early, optional debug log message about "unsupported multi-machine cluster"</p>
</blockquote>
<p>PR for that: <a href="https://github.com/os-autoinst/scripts/pull/67" class="external">https://github.com/os-autoinst/scripts/pull/67</a></p>
</blockquote>
<p>Sorry, have not seen this message before submitting the MR.</p>
<p>I will keep an eye today to the HA group, and close the MR if I see that the incomplete re-triggers are gone.</p>
openQA Project - action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarioshttps://progress.opensuse.org/issues/81859?journal_id=3815332021-02-02T16:46:38Zmkittlermarius.kittler@suse.com
<ul></ul><p>Thanks because I've admittedly only tested the PR locally and we'll yet have to see how well it works in production.</p>
openQA Project - action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarioshttps://progress.opensuse.org/issues/81859?journal_id=3815782021-02-03T09:23:57Zacarvajalacarvajal@suse.com
<ul></ul><p>mkittler wrote:</p>
<blockquote>
<p>Thanks because I've admittedly only tested the PR locally and we'll yet have to see how well it works in production.</p>
</blockquote>
<p>Judging by the HA group yesterday and today, no jobs were automatically re-triggered with <code>:investigate:last_good_tests:</code>, so I think it's working.</p>
<p>Only odd thing I saw were many jobs cancelled as obsolete, but I don't believe it is related to this.</p>
openQA Project - action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarioshttps://progress.opensuse.org/issues/81859?journal_id=3821302021-02-10T08:40:15Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>Workable</i> to <i>Feedback</i></li></ul><p><a class="user active user-mention" href="https://progress.opensuse.org/users/22072">@mkittler</a> after your changes are effective also on OSD I can continue in #81868 . As the ticket is phrased we are ok to just prevent incomplete sets being triggered, we do not necessarily need to fix that (now or ever). So, can you do a final check and resolve the ticket?</p>
openQA Project - action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarioshttps://progress.opensuse.org/issues/81859?journal_id=3822462021-02-11T15:23:00Zmkittlermarius.kittler@suse.com
<ul></ul><p>I've been checking jobs which were finished 7 days ago and sooner. There are still jobs cloned with chained dependencies, see:</p>
<pre><code>select jobs.id, child_job_id, parent_job_id, comments.text from jobs join comments on jobs.id = comments.job_id join job_dependencies on jobs.id = job_dependencies.parent_job_id or jobs.id = job_dependencies.child_job_id where jobs.id >= 5409964 and comments.text like '%Automatic investigation jobs%' and dependency = 1;
</code></pre>
<p>But there are no more jobs cloned with other dependencies, see:</p>
<pre><code>select jobs.id, child_job_id, parent_job_id, comments.text from jobs join comments on jobs.id = comments.job_id join job_dependencies on jobs.id = job_dependencies.parent_job_id or jobs.id = job_dependencies.child_job_id where jobs.id >= 5409964 and comments.text like '%Automatic investigation jobs%' and dependency != 1;
</code></pre>
<p>The investigate script still creates an empty comment these jobs which could be improved.</p>
openQA Project - action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarioshttps://progress.opensuse.org/issues/81859?journal_id=3822472021-02-11T15:27:18Zmkittlermarius.kittler@suse.com
<ul></ul><p>PR for avoiding the empty comment: <a href="https://github.com/os-autoinst/scripts/pull/68" class="external">https://github.com/os-autoinst/scripts/pull/68</a></p>
openQA Project - action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarioshttps://progress.opensuse.org/issues/81859?journal_id=3822482021-02-11T15:34:48Zmkittlermarius.kittler@suse.com
<ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Resolved</i></li></ul><blockquote>
<p>So, can you do a final check and resolve the ticket?</p>
</blockquote>
<p>The mentioned PR has been merged so I'd consider this done as well. We can decide later whether it makes sense to tackle the problems mentioned in <a class="issue tracker-4 status-3 priority-5 priority-high3 closed child" title="action: openqa-investigate triggers incomplete sets for multi-machine scenarios (Resolved)" href="https://progress.opensuse.org/issues/81859#note-7">#81859#note-7</a>.</p>
openQA Project - action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarioshttps://progress.opensuse.org/issues/81859?journal_id=3927172021-03-18T13:14:05Zokurzokurz@suse.com
<ul><li><strong>Due date</strong> deleted (<del><i>2021-02-16</i></del>)</li></ul> openQA Project - action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarioshttps://progress.opensuse.org/issues/81859?journal_id=4292812021-07-21T12:13:44Zokurzokurz@suse.com
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-4 status-3 priority-4 priority-default closed child" href="/issues/95783">action #95783</a>: Provide support for multi-machine scenarios handled by openqa-investigate size:M</i> added</li></ul>