openSUSE Project Management Tool: Issueshttps://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842023-11-04T11:09:38ZopenSUSE Project Management Tool
Redmine QA - action #139097 (Resolved): Improve collaboration with Eng-Infra - Firewall management access...https://progress.opensuse.org/issues/1390972023-11-04T11:09:38Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>SUSE-IT relies heavily on a new firewall configuration separating multiple zones, e.g. "QE" zones from other zones in R&D. In <a class="issue tracker-4 status-3 priority-5 priority-high3 closed child" title="action: Improve collaboration with Eng-Infra - Firewall management access, potentially also DHCP+DNS size:M (Resolved)" href="https://progress.opensuse.org/issues/125450">#125450</a> already some limited access to firewall logs was provided however in many cases that does not help us like in the recent migration of qam.suse.de to PRG2.</p>
<p>After the instance was moved to PRG2 gitlab runners could not reach qam.suse.de as visible in <a href="https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1956085" class="external">https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1956085</a> repeatedly</p>
<pre><code>urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='dashboard.qam.suse.de', port=80): Max retries exceeded with url: /api/incidents (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2730240780>: Failed to establish a new connection: [Errno 110] Connection timed out',))
</code></pre>
<p>while this gitlab CI job was running I looked into the firewall logs that I have access to using<br>
qe-debug.suse.de as documented on <a href="https://wiki.suse.net/index.php/OpenQA#Firewall_between_different_SUSE_network_zones" class="external">https://wiki.suse.net/index.php/OpenQA#Firewall_between_different_SUSE_network_zones</a></p>
<pre><code>tail -f /var/log/remote/gw-infra-log.suse.de.log | grep '\(10.145.0.26\|2a07:de40:b203:8:10:145:0:26\)'
</code></pre>
<p>using the IPv4+IPv6 addresses of qam.suse.de which yields no results so this firewall command is either not correctly constructed or does not have access to the corresponding relevant data. As we are critically relying on whatever firewall is impacting all of our services we should ensure that there is enough redundancy in access.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> We can ensure that 2+ persons within EMEA timezones have access to firewalls covering multiple Nbg+Prg locations which actually affect us</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Look into what was done in <a class="issue tracker-4 status-3 priority-5 priority-high3 closed child" title="action: Improve collaboration with Eng-Infra - Firewall management access, potentially also DHCP+DNS size:M (Resolved)" href="https://progress.opensuse.org/issues/125450">#125450</a> and <a href="https://sd.suse.com/servicedesk/customer/portal/1/SD-113832" class="external">https://sd.suse.com/servicedesk/customer/portal/1/SD-113832</a></li>
<li>Ask Eng-Infra who has access, why qe-debug.suse.de does not provide the relevant firewall denied messages and what to do to improve</li>
<li>Ensure whatever we come up with is properly documented and known within the SUSE QE Tools team</li>
</ul>
openQA Tests - action #134372 (Resolved): Test for nginx container appears to be very unstable on...https://progress.opensuse.org/issues/1343722023-08-17T09:51:52Zclanigclanig@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>openQA test in scenario sle-15-SP5-BCI-Updates-x86_64-nginx_on_SLES_12-SP5@64bit fails in<br>
<a href="https://openqa.suse.de/tests/11852936/modules/_root_BCI-tests_nginx_docker/steps/1" class="external">_root_BCI-tests_nginx_docker</a></p>
<p>Did rather fail than pass in _root_BCI-tests_nginx_docker. Manual tests could not reproduce the issue.<br>
Likely a timing issue. I.e. maybe curl is run too quickly.</p>
<a name="Test-suite-description"></a>
<h2 >Test suite description<a href="#Test-suite-description" class="wiki-anchor">¶</a></h2>
<p>The base test suite is used for job templates defined in YAML documents. It has no settings of its own.</p>
<a name="Reproducible"></a>
<h2 >Reproducible<a href="#Reproducible" class="wiki-anchor">¶</a></h2>
<p>Fails since (at least) Build <a href="https://openqa.suse.de/tests/11840848" class="external">3.7_nginx-image</a></p>
<a name="Expected-result"></a>
<h2 >Expected result<a href="#Expected-result" class="wiki-anchor">¶</a></h2>
<p>Last good: (unknown) (or more recent)</p>
<a name="Further-details"></a>
<h2 >Further details<a href="#Further-details" class="wiki-anchor">¶</a></h2>
<p>Always latest result in this scenario: <a href="https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=BCI-Updates&machine=64bit&test=nginx_on_SLES_12-SP5&version=15-SP5" class="external">latest</a></p>
openQA Tests - action #134369 (Resolved): test fails in upload_image: Unavailable instance type f...https://progress.opensuse.org/issues/1343692023-08-17T08:37:56Zclanigclanig@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>openQA test in scenario sle-micro-5.1-EC2-BYOS-HVM-aarch64-publiccloud_upload_img@64bit fails in<br>
<a href="https://openqa.suse.de/tests/11850907/modules/upload_image/steps/88" class="external">upload_image</a></p>
<p>Error msg:<br>
Your requested instance type (m6g.medium) is not supported in your requested Availability Zone (us-east-1e).</p>
<p>Compared with the previous build the aarch64 test shifted to it's correct column in OpenQA instead of being listed<br>
under the x86_64 column. This might be related to the issue.</p>
<a name="Test-suite-description"></a>
<h2 >Test suite description<a href="#Test-suite-description" class="wiki-anchor">¶</a></h2>
<p>The base test suite is used for job templates defined in YAML documents. It has no settings of its own.</p>
<a name="Reproducible"></a>
<h2 >Reproducible<a href="#Reproducible" class="wiki-anchor">¶</a></h2>
<p>Fails since (at least) Build <a href="https://openqa.suse.de/tests/11850907" class="external">0115</a> (current job)</p>
<a name="Expected-result"></a>
<h2 >Expected result<a href="#Expected-result" class="wiki-anchor">¶</a></h2>
<p>Last good: (unknown) (or more recent)</p>
<a name="Further-details"></a>
<h2 >Further details<a href="#Further-details" class="wiki-anchor">¶</a></h2>
<p>Always latest result in this scenario: <a href="https://openqa.suse.de/tests/latest?arch=aarch64&distri=sle-micro&flavor=EC2-BYOS-HVM&machine=64bit&test=publiccloud_upload_img&version=5.1" class="external">latest</a></p>
openQA Tests - action #131042 (Resolved): test fails due to missing qcow2 imagehttps://progress.opensuse.org/issues/1310422023-06-16T14:26:32Zclanigclanig@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>openQA test in scenario sle-micro-5.4-Container-Image-Updates-aarch64-Build5.4_4.2.47-sle_micro_toolbox_image@aarch64-virtio fails,<br>
because Download of "/var/lib/openqa/cache/openqa.suse.de/SLE-Micro.aarch64-5.4.0-Default-GM-Updated.qcow2" failed: 404 Not Found.</p>
<a name="Test-suite-description"></a>
<h2 >Test suite description<a href="#Test-suite-description" class="wiki-anchor">¶</a></h2>
<p>The base test suite is used for job templates defined in YAML documents. It has no settings of its own.</p>
<a name="Reproducible"></a>
<h2 >Reproducible<a href="#Reproducible" class="wiki-anchor">¶</a></h2>
<p>Fails since (at least) Build <a href="https://openqa.suse.de/tests/11360877" class="external">5.4_4.2.47</a> (current job)</p>
<a name="Expected-result"></a>
<h2 >Expected result<a href="#Expected-result" class="wiki-anchor">¶</a></h2>
<p>Last good: <a href="https://openqa.suse.de/tests/11344447" class="external">5.4_4.2.45</a> (or more recent)</p>
<a name="Further-details"></a>
<h2 >Further details<a href="#Further-details" class="wiki-anchor">¶</a></h2>
<p>Always latest result in this scenario: <a href="https://openqa.suse.de/tests/latest?arch=aarch64&distri=sle-micro&flavor=Container-Image-Updates&machine=aarch64-virtio&test=sle_micro_toolbox_image&version=5.4" class="external">latest</a></p>
QA - action #125450 (Resolved): Improve collaboration with Eng-Infra - Firewall management access...https://progress.opensuse.org/issues/1254502023-03-06T12:30:04Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>Apparently in many cases <a class="user active user-mention" href="https://progress.opensuse.org/users/15284">@rwawrig</a> can help best with issues spanning over multiple locations, e.g. firewall between NUE1 and NUE2, like in <a href="https://sd.suse.com/servicedesk/customer/portal/1/SD-113832" class="external">https://sd.suse.com/servicedesk/customer/portal/1/SD-113832</a> but the timezones diff is an obstacle. Give more people like SUSE QE Tools access to firewalls, even if it's just read-only for investigation?</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> We can ensure that 2+ persons within EMEA timezones have access to firewalls covering multiple Nbg+Prg locations</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>See how in <a href="https://sd.suse.com/servicedesk/customer/portal/1/SD-113832" class="external">https://sd.suse.com/servicedesk/customer/portal/1/SD-113832</a> <a class="user active user-mention" href="https://progress.opensuse.org/users/15284">@rwawrig</a> could help but due to the significant timezones difference the reaction time is slow in both directions</li>
<li>Follow the discussion in <a href="https://sd.suse.com/servicedesk/customer/portal/1/SD-113959" class="external">https://sd.suse.com/servicedesk/customer/portal/1/SD-113959</a> regarding DHCP and apply the same solution for firewall if applicable, e.g. create a specific ticket with specific requirements and suggestions</li>
<li><em>Optional</em> also try to handle <a class="issue tracker-6 status-15 priority-4 priority-default child parent" title="coordination: [epic] Get management access to o3/osd and other QE related VMs (Blocked)" href="https://progress.opensuse.org/issues/121726">#121726</a> in the same ticket aka. "just get it done" :)</li>
</ul>
openQA Project - action #113078 (New): no investigation job triggered for one case, not even "ret...https://progress.opensuse.org/issues/1130782022-06-27T11:33:51Zokurzokurz@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p><a href="https://openqa.suse.de/tests/9029305#comments" class="external">https://openqa.suse.de/tests/9029305#comments</a> shows</p>
<pre><code>Automatic investigation jobs:
No test changes recorded, test regression unlikely. Skipping test regression investigation job. No test regression expected. Not triggered 'good build+test' as it would be the same as 3., good build + current test
</code></pre>
<p>so according to this comment no investigation jobs have been triggered at all. Shouldn't at least the retry job be triggered?</p>
openQA Infrastructure - action #77887 (Resolved): [tools][openqa] Enable automatic openQA investi...https://progress.opensuse.org/issues/778872020-11-14T15:46:47Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>SUSE QEM is challenged by a relative high false-positive rate of openQA tests. This is due to objectively higher product quality of released products in comparison to products in development, i.e. pre-GM SLE including Tumbleweed snapshots before release. We already use "openqa-investigate" for o3 which has been running there for multiple months and I received positive feedback. We can now extend the solution to osd.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> Automatic investigation jobs are triggered and commented on new, unlabeled failures within osd production groups, e.g. not development groups</li>
<li><strong>AC2:</strong> Automatic investigation jobs run within a reasonable time to provide useful feedback to reviewers in their regular review routines</li>
<li><strong>AC3:</strong> No harmful performance impact on infrastructure due to too many automatic investigation jobs</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>There is a script based solution "openqa-investigate" which we already use automatically within o3, see <a href="https://gitlab.suse.de/openqa/auto-review/-/blob/master/.gitlab-ci.yml#L110" class="external">https://gitlab.suse.de/openqa/auto-review/-/blob/master/.gitlab-ci.yml#L110</a></li>
<li>Try out dry-runs of "openqa-investigate" against OSD and check for obvious things missing or going wrong, e.g. immediate errors or crashes or unreasonable results</li>
<li>Extending to osd can be just as simple as applying a similar block in .gitlab-ci.yml for osd</li>
<li>When activated monitor over couple of days for usefulness and performance impact</li>
<li>Consider changing the schedule of the scheduled pipeline, e.g. trigger more often over the day, or even "continuous" :)</li>
<li>Optional: Ensure that auto-review walks first over all issues, potentially even failed ones to detect known issues, and if unknown to auto-review, only then trigger investigation jobs</li>
</ul>