openQA Project - coordination #96185: [epic] Multimachine failure rate increased</h1> <article> <h1>openQA Project - coordination #96185: [epic] Multimachine failure rate increased</h1> <p>2021-07-28T13:00:24Z</p> <ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/431087/diff?detail_id=408641">diff</a>)</li></ul> </article> <article> <h1>openQA Project - coordination #96185: [epic] Multimachine failure rate increased</h1> <p>2021-07-29T08:09:14Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-4 priority-default closed behind-schedule" href="/issues/96191">action #96191</a>: Provide "fail-rate" of tests, especially multi-machine, in grafana size:M</i> added</li></ul> </article> <article> <h1>openQA Project - coordination #96185: [epic] Multimachine failure rate increased</h1> <p>2021-07-29T08:09:37Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-5 priority-high3 closed" href="/issues/95299">action #95299</a>: Tests timeout with reason 'setup exceeded MAX_SETUP_TIME' on osd ppc64le workers auto_review:"Result: timeout":retry size:M</i> added</li></ul> </article> <article> <h1>openQA Project - coordination #96185: [epic] Multimachine failure rate increased</h1> <p>2021-07-29T08:12:21Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-6 priority-5 priority-high3 closed" href="/issues/95824">action #95824</a>: [qe-sap][ha][shap] test fails in register_system - unable to download license, likely network configuration problem in multi-machine cluster?</i> added</li></ul> </article> <article> <h1>openQA Project - coordination #96185: [epic] Multimachine failure rate increased</h1> <p>2021-07-29T08:12:27Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-6 priority-6 priority-high2 closed" href="/issues/95801">action #95801</a>: [qe-sap][ha][css][shap] test fails in register_system of multi-machine HA tests, failing to access network</i> added</li></ul> </article> <article> <h1>openQA Project - coordination #96185: [epic] Multimachine failure rate increased</h1> <p>2021-07-29T08:12:31Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-4 priority-4 priority-default" href="/issues/95788">action #95788</a>: [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network </i> added</li></ul> </article> <article> <h1>openQA Project - coordination #96185: [epic] Multimachine failure rate increased</h1> <p>2021-07-29T08:14:15Z</p> <ul><li><strong>Project</strong> changed from <i>QA</i> to <i>openQA Project</i></li><li><strong>Description</strong> updated (<a title="View differences" href="/journals/431282/diff?detail_id=408800">diff</a>)</li><li><strong>Category</strong> set to <i>Regressions/Crashes</i></li><li><strong>Priority</strong> changed from <i>Normal</i> to <i>High</i></li><li><strong>Target version</strong> set to <i>Ready</i></li></ul><p>Thanks for your ticket. Just yesterday I came up with <a class="issue tracker-4 status-3 priority-4 priority-default closed behind-schedule" title="action: Provide "fail-rate" of tests, especially multi-machine, in grafana size:M (Resolved)" href="https://progress.opensuse.org/issues/96191">#96191</a> as related as well. Last week I already found multiple network related problems in multi-machine tests. I linked these as related.<br> However, all the different kind of multi-machine tests are quite different and it's again an area that SUSE QE Tools team members have not much experience with. So I don't see that we can offer much help from SUSE QE Tools side. Basically I would hope that the multi-machine test experts would assemble and look into the problem together.</p> <p>EDIT: I asked for help in <a href="https://chat.suse.de/channel/testing?msg=F6s78REbS5XRXjpQ3" class="external">https://chat.suse.de/channel/testing?msg=F6s78REbS5XRXjpQ3</a></p> </article> <article> <h1>openQA Project - coordination #96185: [epic] Multimachine failure rate increased</h1> <p>2021-07-29T08:39:49Z</p> <ul></ul><p>I know that it does not help much , but I would remove this job <a href="https://openqa.suse.de/tests/6588107#step/boot_to_desktop/10" class="external">https://openqa.suse.de/tests/6588107#step/boot_to_desktop/10</a> from description . boot to desktop failure hardly likely may be related to MM infra problems </p> </article> <article> <h1>openQA Project - coordination #96185: [epic] Multimachine failure rate increased</h1> <p>2021-07-29T08:44:37Z</p> <ul></ul><p>this two also looks like unrelated to MM - <br> <a href="https://openqa.suse.de/tests/6588108#step/2_sw_multipath_s_aa/1" class="external">https://openqa.suse.de/tests/6588108#step/2_sw_multipath_s_aa/1</a><br> <a href="https://openqa.suse.de/tests/6588254#step/installation/18" class="external">https://openqa.suse.de/tests/6588254#step/installation/18</a></p> <p>I think it is important to solve one problem at the time </p> </article> <article> <h1>openQA Project - coordination #96185: [epic] Multimachine failure rate increased</h1> <p>2021-07-29T09:30:31Z</p> <ul><li><strong>Tracker</strong> changed from <i>action</i> to <i>coordination</i></li><li><strong>Subject</strong> changed from <i>Multimachine failure rate increased</i> to <i>[epic] Multimachine failure rate increased</i></li><li><strong>Status</strong> changed from <i>New</i> to <i>Blocked</i></li><li><strong>Assignee</strong> set to <i>okurz</i></li></ul><p>Agreed. Making this an epic.</p> <p><a class="user active user-mention" href="https://progress.opensuse.org/users/23378">@asmorodskyi</a> helped to identify one issue about a failed GRE tunnel creation: <a class="issue tracker-4 status-3 priority-5 priority-high3 closed child" title="action: Failed to add GRE tunnel to openqaworker10 on most OSD workers, recent regression explaining mult... (Resolved)" href="https://progress.opensuse.org/issues/96260">#96260</a></p> <p>I would be happy to see more feedback and problem analysis but at least we should rule out that <a class="issue tracker-4 status-3 priority-5 priority-high3 closed child" title="action: Failed to add GRE tunnel to openqaworker10 on most OSD workers, recent regression explaining mult... (Resolved)" href="https://progress.opensuse.org/issues/96260">#96260</a> is the cause of many or most of the failed examples.</p> </article> <article> <h1>openQA Project - coordination #96185: [epic] Multimachine failure rate increased</h1> <p>2021-07-29T10:12:14Z</p> <ul></ul><p>asmorodskyi wrote:</p> <blockquote> <p>I know that it does not help much , but I would remove this job <a href="https://openqa.suse.de/tests/6588107#step/boot_to_desktop/10" class="external">https://openqa.suse.de/tests/6588107#step/boot_to_desktop/10</a> from description . boot to desktop failure hardly likely may be related to MM infra problems</p> </blockquote> <p>No, it's MM test, one node is preparing PXE second node is booting from it, it can be easily <strong>MM</strong> or network problem.</p> </article> <article> <h1>openQA Project - coordination #96185: [epic] Multimachine failure rate increased</h1> <p>2021-07-29T13:19:41Z</p> <ul></ul><p>asmorodskyi wrote:</p> <blockquote> <p>this two also looks like unrelated to MM - <br> <a href="https://openqa.suse.de/tests/6588108#step/2_sw_multipath_s_aa/1" class="external">https://openqa.suse.de/tests/6588108#step/2_sw_multipath_s_aa/1</a><br> <a href="https://openqa.suse.de/tests/6588254#step/installation/18" class="external">https://openqa.suse.de/tests/6588254#step/installation/18</a></p> <p>I think it is important to solve one problem at the time</p> </blockquote> <p>First one can be whatever, second one is not MM, it could be related to network issues, they can be of course removed.<br> Adding new MM failures every day is no problem at all, unfortunately.<br> This were just examples of MM or network failures, only MM failures can be used, but when single job tests are failing due to network then MM can fail due to network or MM or both.</p> </article> <article> <h1>openQA Project - coordination #96185: [epic] Multimachine failure rate increased</h1> <p>2021-08-02T11:52:43Z</p> <ul></ul><p>Yesterday evening I restarted two MM clusters, node1 & node2 on this cluster run on openqaworker10, I would not care about this detail, I picked this tests as they fail always. I will restart also another cluster running on random worker and collect tcpdump. Here are tcpdumps and some info from workers when the failures happened. <a href="ftp://10.100.12.155/MM/" class="external">ftp://10.100.12.155/MM/</a><br> I did do manual retry on <code>qam_ha_rolling_upgrade_migration</code>, the reason test was failing on multiple places is <a href="https://openqa.suse.de/tests/6628646#step/register_without_ltss/10" class="external">failure in name resolution</a>, this fail happened also on scc registration, unfortunately video is not there to see it, but I had to go to network setup and make installation to re-setup the DHCP setup again, sometimes multiple times. Generally the MM/network/DNS was unable to resolve addresses like scc.suse.com or updates.suse.com. <a href="https://openqa.suse.de/tests/6628646#step/register_without_ltss/10" class="external">https://openqa.suse.de/tests/6628646#step/register_without_ltss/10</a> when I retried it passed.</p> <p>qam_alpha_cluster<br> <a href="https://openqa.suse.de/tests/6628644" class="external">https://openqa.suse.de/tests/6628644</a><br> <a href="https://openqa.suse.de/tests/6628643" class="external">https://openqa.suse.de/tests/6628643</a><br> <a href="https://openqa.suse.de/tests/6628642" class="external">https://openqa.suse.de/tests/6628642</a></p> <p>qam_ha_rolling_upgrade_migration<br> <a href="https://openqa.suse.de/tests/6628647" class="external">https://openqa.suse.de/tests/6628647</a><br> <a href="https://openqa.suse.de/tests/6628646" class="external">https://openqa.suse.de/tests/6628646</a><br> <a href="https://openqa.suse.de/tests/6628645" class="external">https://openqa.suse.de/tests/6628645</a></p> </article> <article> <h1>openQA Project - coordination #96185: [epic] Multimachine failure rate increased</h1> <p>2021-08-02T17:48:20Z</p> <ul></ul><p>Sorry, I did not understand your last comment. Did you want to report about what you did or what you open to do? So do you think it would help to exclude openqaworker10 from all tests (not only multi-machine) and resolve <a class="issue tracker-4 status-3 priority-5 priority-high3 closed child" title="action: Failed to add GRE tunnel to openqaworker10 on most OSD workers, recent regression explaining mult... (Resolved)" href="https://progress.opensuse.org/issues/96260">#96260</a> first?</p> </article> <article> <h1>openQA Project - coordination #96185: [epic] Multimachine failure rate increased</h1> <p>2021-08-03T12:13:47Z</p> <ul></ul><p>I tried to find the "fail ratio per worker" with SQL but so far have not found a good approach. Maybe something along the lines of:</p> <pre><code>select count(jobs.id),workers.host from jobs left join workers on jobs.assigned_worker_id = workers.id join (select count(j.id) as failed_jobs_count,host from jobs j join workers w on j.assigned_worker_id = w.id group by w.host) failed_jobs on workers.host = failed_jobs.host group by workers.host order by count desc; </code></pre> <p>might work. I guess I need to either conduct a subquery as join or the other way around.</p> <p>Now did it semi-automatic. All failed per worker host:</p> <pre><code>select count(jobs.id),host from jobs left join workers on jobs.assigned_worker_id = workers.id where result = 'failed' group by host order by count desc; count | host -------+--------------------- 13660 | 11749 | openqaworker5 11173 | grenache-1 8573 | openqaworker2 7271 | openqaworker6 5727 | openqaworker9 5405 | openqaworker13 5349 | openqaworker8 4869 | openqaworker-arm-2 4619 | openqaworker3 3200 | openqaworker10 3016 | openqaworker-arm-1 3005 | openqaworker-arm-3 2953 | QA-Power8-4-kvm 2415 | QA-Power8-5-kvm 2187 | powerqaworker-qam-1 1304 | malbec 360 | automotive-3 (18 rows) </code></pre> <p>and all</p> <pre><code>select count(jobs.id),host from jobs left join workers on jobs.assigned_worker_id = workers.id group by host order by count desc; count | host --------+--------------------- 182809 | 76185 | openqaworker5 58381 | openqaworker6 44507 | openqaworker9 41977 | openqaworker8 38853 | grenache-1 38732 | openqaworker-arm-2 37838 | openqaworker3 32992 | openqaworker13 28221 | openqaworker-arm-3 27405 | openqaworker2 21533 | openqaworker10 20928 | openqaworker-arm-1 18804 | QA-Power8-5-kvm 18454 | QA-Power8-4-kvm 16240 | powerqaworker-qam-1 9627 | malbec 1397 | automotive-3 287 | openqaworker11 (19 rows) </code></pre> <p>and all failed jobs regardless of worker host:</p> <pre><code>select count(jobs.id) from jobs where result = 'failed'; count ------- 96835 (1 row) </code></pre> <p>and all jobs regardless of worker host:</p> <pre><code>select count(jobs.id) from jobs; count -------- 715170 (1 row) </code></pre> <p>so the total fail ratio is 96835/715170 = 13.54%, the openqaworker10 specific fail ratio is 14.86% . An arbitrary other example is openqaworker5 with 15.42%, for grenache-1 28.76% (!). So as conclusion openqaworker10 does not show any significantly different fail ratio itself hence I don't think we need to exclude it.</p> </article> <article> <h1>openQA Project - coordination #96185: [epic] Multimachine failure rate increased</h1> <p>2021-08-03T12:38:10Z</p> <ul></ul><p>okurz wrote:</p> <blockquote> <p>Sorry, I did not understand your last comment. Did you want to report about what you did or what you open to do? So do you think it would help to exclude openqaworker10 from all tests (not only multi-machine) and resolve <a class="issue tracker-4 status-3 priority-5 priority-high3 closed child" title="action: Failed to add GRE tunnel to openqaworker10 on most OSD workers, recent regression explaining mult... (Resolved)" href="https://progress.opensuse.org/issues/96260">#96260</a> first?</p> </blockquote> <p>I reported what I did, I don't think that only openqaworker10 does have problem. Not sure how does <a class="issue tracker-4 status-3 priority-5 priority-high3 closed child" title="action: Failed to add GRE tunnel to openqaworker10 on most OSD workers, recent regression explaining mult... (Resolved)" href="https://progress.opensuse.org/issues/96260">#96260</a> affect the issue, GRE tunnel is either added or not, if not there is no connection between workers, but failures I see are more like random connection drops.</p> </article> <article> <h1>openQA Project - coordination #96185: [epic] Multimachine failure rate increased</h1> <p>2021-08-27T09:54:31Z</p> <ul></ul><p>With <a class="issue tracker-4 status-3 priority-5 priority-high3 closed child" title="action: Failed to add GRE tunnel to openqaworker10 on most OSD workers, recent regression explaining mult... (Resolved)" href="https://progress.opensuse.org/issues/96260">#96260</a> resolved I added <a class="issue tracker-4 status-3 priority-4 priority-default closed behind-schedule" title="action: Provide "fail-rate" of tests, especially multi-machine, in grafana size:M (Resolved)" href="https://progress.opensuse.org/issues/96191">#96191</a> to the backlog now</p> </article> <article> <h1>openQA Project - coordination #96185: [epic] Multimachine failure rate increased</h1> <p>2021-09-07T13:57:59Z</p> <ul></ul><p>I tried to install/use <a href="https://docs.openvswitch.org/en/latest/ref/ovs-test.8/" class="external">ovs-test</a> to debug the openvswitch.<br> But looks like the package <code>openvswitch-test</code> is broken.</p> <pre><code># ovs-test -h File "/usr/bin/ovs-test", line 45 print "Node %s:%u " % (node[0], node[1]) ^ SyntaxError: invalid syntax </code></pre> <p>There is also <a href="https://docs.openvswitch.org/en/latest/ref/ovs-tcpdump.8/" class="external">ovs-tcpdump</a> or other ovs-* tools, but I didn't try if they work.<br> MM jobs are not failing so frequently anymore when the ticket was created, but there are still MM failures. Maybe it's network, I don't know.</p> </article> <article> <h1>openQA Project - coordination #96185: [epic] Multimachine failure rate increased</h1> <p>2021-10-27T16:27:06Z</p> <ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/459127/diff?detail_id=434689">diff</a>)</li></ul><p>I monitored this topic over the past months. I can see that "wicked" tests are very stable so I doubt there is a generic problem with our backends or infrastructure left. SAP related tests are a different kind of problem, e.g. see <a class="issue tracker-4 status-4 priority-4 priority-default" title="action: [qe-sap][ha] SUT reboots unexpectedly, leading to tests failing in HA scenarios auto_review:"(?s)... (Feedback)" href="https://progress.opensuse.org/issues/95458">#95458</a> and <a class="issue tracker-4 status-4 priority-4 priority-default" title="action: [qe-sap][ha][shap] test fails in iscsi_client or other modules in HA tests, missing network (Feedback)" href="https://progress.opensuse.org/issues/95788">#95788</a> .</p> <p>Based on graphs like <a href="https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=27&orgId=1&from=now-30d&to=now" class="external">https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=27&orgId=1&from=now-30d&to=now</a> I can see that some openQA worker hosts are more likely to produce problems than others, e.g. openqaworker-arm-4 and openqaworker-arm-5. These both are specifically handled in tickets like <a class="issue tracker-6 status-1 priority-4 priority-default child parent" title="coordination: [epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 (New)" href="https://progress.opensuse.org/issues/101048">#101048</a> but also they are not even enabled for multi-machine tests. The next in line after that is openqaworker2 which runs many "exotic" machines like vmware, hyperv, IPMI, s390x so that could explain the higher fail-ratio there. Since we record the data in the mentioned graph (about a month) I see only a significant increase in fail-ratio in openqaworker-arm-4/5, which I mentioned already. For other hosts it stays same or reduced.</p> <p>Looking at <a href="https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=24&from=1632016393108&to=1635345865912" class="external">https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=24&from=1632016393108&to=1635345865912</a> I can see no significant change over the reporting period of the last month. Mostly multi-machine tests are obsoleted, next biggest section is "passed".</p> </article> <article> <h1>openQA Project - coordination #96185: [epic] Multimachine failure rate increased</h1> <p>2021-10-27T16:30:18Z</p> <ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/459130/diff?detail_id=434692">diff</a>)</li></ul><p><a class="user active user-mention" href="https://progress.opensuse.org/users/11830">@dzedro</a> do you agree that the situation improved again or are you aware of still problematic areas except the known SAP test scenarios?</p> </article> <article> <h1>openQA Project - coordination #96185: [epic] Multimachine failure rate increased</h1> <p>2021-11-09T14:15:39Z</p> <ul><li><strong>Status</strong> changed from <i>Blocked</i> to <i>Resolved</i></li></ul><p>No response. Assuming fixed for the specific issue at hand. In any case we have better monitoring now so we should have a better chance to detect such issues in the near-future. Also there are currently other related tickets still open with more specific information. See the related tasks.</p> </article> <article> <h1>openQA Project - coordination #96185: [epic] Multimachine failure rate increased</h1> <p>2021-11-09T14:19:17Z</p> <ul></ul><p>Sorry missed the comment, yes I agree that the failure rate of MM is much better.<br> Existing or new failures should be handled separately.</p> </article> </main></body></html>