openSUSE Project Management Tool: Issueshttps://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842023-08-30T08:46:38ZopenSUSE Project Management Tool
Redmine openQA Infrastructure - action #134816 (Resolved): [tools] grafana dashboard for `OpenQA Jobs tes...https://progress.opensuse.org/issues/1348162023-08-30T08:46:38Zosukup
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>Dashboard <a href="https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1" class="external">https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1</a></p>
<p>missing data in graphs showing running tests from yesterday migration</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> No missing data for osd on Grafana</li>
<li><strong>AC2:</strong> Alerts related to affected panels are functioning</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>In salt states in monitoring/telegraf/telegraf-webui.conf instead of <code>grains['fqdn']</code> use something like grains.get('primary_webui_domain', grains.get('fqdn'))`. Alternatively we could use the "id" in place of the FQDN</li>
<li>If the above does not work then use an OR expression since we already have data with different domains in the db (or implement that to cover the data from 2023-08-29 to today)</li>
<li>Also check whether alerts need to be covered</li>
<li>As alternative can we change the FQDN of osd to again point to openqa.suse.de
<ul>
<li>Apparently a bad idea according to mcaj (not sure why)</li>
</ul></li>
<li>See existing MR: <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/953" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/953</a></li>
</ul>
openQA Infrastructure - action #133154 (Resolved): osd-deployment failed because unreachable workershttps://progress.opensuse.org/issues/1331542023-07-21T08:58:16Zosukup
<p><a href="https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/736743" class="external">https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/736743</a></p>
<p>from logs:</p>
<pre><code>sapworker1.qe.nue2.suse.org:
Minion did not return. [Not connected]
openqaworker1.qe.nue2.suse.org:
Minion did not return. [Not connected]
sapworker2.qe.nue2.suse.org:
Minion did not return. [Not connected]
sapworker3.qe.nue2.suse.org:
Minion did not return. [Not connected]
+++ kill %1
</code></pre>
<p>tried to ping/ssh hosts and none of these hosts is reachable<br>
also IPMI is without any response... + this hosts have corresponding host up alert in grapahana.</p>
openQA Infrastructure - action #133127 (Resolved): Frankencampus network broken + GitlabCi failed...https://progress.opensuse.org/issues/1331272023-07-20T17:34:02Zosukup
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>Job <a href="https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipelines/735816" class="external">https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipelines/735816</a></p>
<p>In reality it passed but upload of artifacts failed ....</p>
<p>from logs:</p>
<pre><code>WARNING: Uploading artifacts as "archive" to coordinator... 502 Bad Gateway id=1702329 responseStatus=502 Bad Gateway status=502 token=64_L_XM4
WARNING: Retrying... context=artifacts-uploader error=invalid argument
WARNING: Uploading artifacts as "archive" to coordinator... 502 Bad Gateway id=1702329 responseStatus=502 Bad Gateway status=502 token=64_L_XM4
WARNING: Retrying... context=artifacts-uploader error=invalid argument
WARNING: Uploading artifacts as "archive" to coordinator... 502 Bad Gateway id=1702329 responseStatus=502 Bad Gateway status=502 token=64_L_XM4
FATAL: invalid argument
Cleaning up project directory and file based variables
00:01
ERROR: Job failed: exit code 1
1mERROR: Job failed: exit code 1
</code></pre> openQA Infrastructure - action #133097 (Resolved): cron on OSD (date; fetch_openqa_bugs /etc/open...https://progress.opensuse.org/issues/1330972023-07-20T07:45:15Zosukup
<pre><code>Exception occured while fetching boo#1115169
Traceback (most recent call last):
File "/usr/bin/fetch_openqa_bugs", line 62, in <module>
raise e
File "/usr/bin/fetch_openqa_bugs", line 55, in <module>
client.openqa_request("PUT", "bugs/%s" % bug_dbid, data=issue.get_dict())
File "/usr/lib/python3.6/site-packages/openqa_client/client.py", line 298, in openqa_request
return self.do_request(req, retries=retries, wait=wait, parse=True)
File "/usr/lib/python3.6/site-packages/openqa_client/client.py", line 238, in do_request
raise err
File "/usr/lib/python3.6/site-packages/openqa_client/client.py", line 213, in do_request
request.method, resp.url, resp.status_code
openqa_client.exceptions.RequestError: ('PUT', 'https://openqa.opensuse.org/api/v1/bugs/1021', 403)
</code></pre>
<p>it could be caused by broken IDP login service ? : <a href="https://suse.slack.com/archives/C029APBKLGK/p1689838423782549" class="external">https://suse.slack.com/archives/C029APBKLGK/p1689838423782549</a></p>
openQA Infrastructure - action #132926 (Workable): OSD cron -> (fetch_openqa_bugs)> /tmp/fetch_op...https://progress.opensuse.org/issues/1329262023-07-18T07:56:34Zosukup
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>OSD cron -> (fetch_openqa_bugs)> /tmp/fetch_openqa_bugs_osd.log failed:</p>
<p>from traceback:</p>
<pre><code>requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='api.github.com', port=443): Max retries exceeded with url: /repos/SUSE/ha-sap-terraform-deployments/issues/857 (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f7439e43b38>, 'Connection to api.github.com timed out. (connect timeout=10)'))
</code></pre>
<p>fetch_openqa_bug failed when fetch issues from GitHub</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> It is understood why the error occurred</li>
<li><strong>AC2:</strong> The error does not persist</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Make sure you can login, see <a href="https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/id/openqa-service_qe_suse_de.sls#L11" class="external">https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/id/openqa-service_qe_suse_de.sls#L11</a> or ask dheidler/mkittler to do that for you</li>
<li>Assuming "host unavailable', check how long the scripts retried
<ul>
<li>Re-try more often?</li>
<li>Wait longer between attemps? </li>
</ul></li>
<li><a href="https://github.com/os-autoinst/openqa_bugfetcher" class="external">https://github.com/os-autoinst/openqa_bugfetcher</a></li>
</ul>
openQA Infrastructure - action #132860 (Resolved): openqa-piworker is unstable and needs regular ...https://progress.opensuse.org/issues/1328602023-07-17T08:39:49Zosukup
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p><a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1694765" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1694765</a></p>
<p>only thing found in logs:<br>
salt_ping.log:</p>
<pre><code>Currently the following minions are down:
8d7
< "openqa-piworker.qa.suse.de"
===================
</code></pre>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> we are able to process openQA Raspberry Pi bare-metal jobs consistently over some days</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li><p>Identify the cause for regression</p>
<ul>
<li>likely something related to the hardware RTC</li>
<li>try if it just works with Leap 15.5 because we wanted to upgrade anyway</li>
<li>could be a recent kernel update so try to downgrade</li>
</ul></li>
<li><p>If it is really necessary and you exhausted all other remote-controllable options then go to the office, unplug RTC, reinstall the system assuming it was a borked system and corruption, or whatever</p></li>
<li><p>As Plan Y (if options A to X failed) buy wifi&bluetooth adapter for a IPMI controllable server and use that instead to connect to the rpi bare metal test instances</p></li>
</ul>
<a name="Rollback-steps"></a>
<h2 >Rollback steps<a href="#Rollback-steps" class="wiki-anchor">¶</a></h2>
<ul>
<li>Add back salt key with <code>ssh osd "sudo salt-key -y -a openqa-piworker.qa.suse.de"</code></li>
</ul>
openQA Infrastructure - action #130132 (Resolved): jenkins.qa.suse.de seems downhttps://progress.opensuse.org/issues/1301322023-05-31T11:17:23Zosukup
<p>Jenkins go stuck in emergency mode again ... @nsinger using Ctrl-D booted system.</p>
openQA Infrastructure - action #125228 (Rejected): Salt pillars deployment failed on storage.oqa....https://progress.opensuse.org/issues/1252282023-03-01T12:27:23Zosukup
<pre><code> ID: /root/.ssh/id_ed25519.backup_osd
Function: file.managed
Result: False
Comment: Pillar id_ed25519.backup_osd does not exist
Started: 13:09:31.581660
Duration: 2.844 ms
Changes:
</code></pre> openQA Infrastructure - action #125132 (Resolved): [alert] logrotate failed on OSDhttps://progress.opensuse.org/issues/1251322023-02-28T09:54:59Zosukup
<p>from journalctl:</p>
<pre><code>Feb 15 00:00:07 openqa logrotate[12569]: logrotate does not support parallel execution on the same set of logfiles.
Feb 15 00:00:07 openqa logrotate[12569]: error: state file /var/lib/misc/logrotate.status is already locked
Feb 15 00:00:00 openqa systemd[1]: Starting Rotate log files...
</code></pre> openQA Infrastructure - action #114908 (Resolved): [tools] https://stats.openqa-monitor.qa.suse.d...https://progress.opensuse.org/issues/1149082022-08-02T12:17:54Zosukup
<p>grafana overview page isn't responding .</p>
openQA Infrastructure - action #109301 (Rejected): openqaworker14 + openqaworker15 sporadically g...https://progress.opensuse.org/issues/1093012022-03-31T09:07:53Zosukup
<a name="OBSERVATION"></a>
<h2 >OBSERVATION<a href="#OBSERVATION" class="wiki-anchor">¶</a></h2>
<p>on reboot time to time this workers fails to correctly boot ending in emergency mode:</p>
<pre><code>bře 08 14:34:24 openqaworker14 kernel: Loading iSCSI transport class v2.0-870.
bře 08 14:34:24 openqaworker14 systemd[1]: Finished Create Volatile Files and Directories.
bře 08 14:34:24 openqaworker14 systemd[1]: Starting Security Auditing Service...
bře 08 14:34:24 openqaworker14 openqa-establish-nvme-setup[1557]: NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
bře 08 14:34:24 openqaworker14 openqa-establish-nvme-setup[1557]: nvme0n1 259:0 0 3.5T 0 disk
bře 08 14:34:24 openqaworker14 openqa-establish-nvme-setup[1557]: ├─nvme0n1p1 259:1 0 512M 0 part
bře 08 14:34:24 openqaworker14 openqa-establish-nvme-setup[1557]: ├─nvme0n1p2 259:2 0 1T 0 part /
bře 08 14:34:24 openqaworker14 openqa-establish-nvme-setup[1557]: └─nvme0n1p3 259:3 0 2.5T 0 part
bře 08 14:34:24 openqaworker14 openqa-establish-nvme-setup[1557]: └─md127 9:127 0 2.5T 0 raid0
bře 08 14:34:24 openqaworker14 openqa-establish-nvme-setup[1552]: Stopping current RAID "/dev/md/openqa"
bře 08 14:34:24 openqaworker14 systemd[1]: Finished Flush Journal to Persistent Storage.
bře 08 14:34:24 openqaworker14 kernel: i40iw_open: i40iw_open completed
bře 08 14:34:24 openqaworker14 systemd[1]: Created slice Slice /system/rdma-load-modules.
bře 08 14:34:24 openqaworker14 systemd[1]: Starting Load RDMA modules from /etc/rdma/modules/iwarp.conf...
bře 08 14:34:24 openqaworker14 systemd[1]: Starting Load RDMA modules from /etc/rdma/modules/rdma.conf...
bře 08 14:34:24 openqaworker14 kernel: ixgbe 0000:d8:00.1: Multiqueue Enabled: Rx Queue count = 63, Tx Queue count = 63 XDP Queue count = 0
bře 08 14:34:24 openqaworker14 systemd[1]: Finished Load RDMA modules from /etc/rdma/modules/iwarp.conf.
bře 08 14:34:24 openqaworker14 openqa-establish-nvme-setup[1559]: mdadm: stopped /dev/md/openqa
bře 08 14:34:24 openqaworker14 openqa-establish-nvme-setup[1552]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
bře 08 14:34:24 openqaworker14 openqa-establish-nvme-setup[1574]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
bře 08 14:34:24 openqaworker14 openqa-establish-nvme-setup[1574]: level=raid0 devices=1 ctime=Mon Mar 7 10:20:52 2022
bře 08 14:34:24 openqaworker14 openqa-establish-nvme-setup[1574]: mdadm: unexpected failure opening /dev/md127
bře 08 14:34:24 openqaworker14 openqa-establish-nvme-setup[1552]: Unable to create RAID, mdadm returned with non-zero code
bře 08 14:34:24 openqaworker14 kernel: i40iw_open: i40iw_open completed
bře 08 14:34:24 openqaworker14 systemd[1]: openqa_nvme_format.service: Main process exited, code=exited, status=1/FAILURE
bře 08 14:34:24 openqaworker14 systemd[1]: openqa_nvme_format.service: Failed with result 'exit-code'.
bře 08 14:34:24 openqaworker14 systemd[1]: Failed to start Setup NVMe before mounting it.
bře 08 14:34:24 openqaworker14 systemd[1]: Dependency failed for /var/lib/openqa.
bře 08 14:34:24 openqaworker14 systemd[1]: Dependency failed for openQA Worker #1.
bře 08 14:34:24 openqaworker14 systemd[1]: openqa-worker-auto-restart@1.service: Job openqa-worker-auto-restart@1.service/start failed with result 'dependency'.
bře 08 14:34:24 openqaworker14 systemd[1]: Dependency failed for var-lib-openqa-share.automount.
bře 08 14:34:24 openqaworker14 systemd[1]: var-lib-openqa-share.automount: Job var-lib-openqa-share.automount/start failed with result 'dependency'.
bře 08 14:34:24 openqaworker14 systemd[1]: Dependency failed for openQA Worker #3.
bře 08 14:34:24 openqaworker14 systemd[1]: openqa-worker-auto-restart@3.service: Job openqa-worker-auto-restart@3.service/start failed with result 'dependency'.
bře 08 14:34:24 openqaworker14 systemd[1]: Dependency failed for Prepare NVMe after mounting it.
bře 08 14:34:24 openqaworker14 systemd[1]: openqa_nvme_prepare.service: Job openqa_nvme_prepare.service/start failed with result 'dependency'.
bře 08 14:34:24 openqaworker14 systemd[1]: Dependency failed for Local File Systems.
bře 08 14:34:24 openqaworker14 systemd[1]: local-fs.target: Job local-fs.target/start failed with result 'dependency'.
bře 08 14:34:24 openqaworker14 systemd[1]: local-fs.target: Triggering OnFailure= dependencies.
bře 08 14:34:24 openqaworker14 systemd[1]: Dependency failed for openQA Worker #2.
bře 08 14:34:24 openqaworker14 systemd[1]: openqa-worker-auto-restart@2.service: Job openqa-worker-auto-restart@2.service/start failed with result 'dependency'.
bře 08 14:34:24 openqaworker14 systemd[1]: Dependency failed for openQA Worker #4.
bře 08 14:34:24 openqaworker14 systemd[1]: openqa-worker-auto-restart@4.service: Job openqa-worker-auto-restart@4.service/start failed with result 'dependency'.
bře 08 14:34:24 openqaworker14 systemd[1]: var-lib-openqa.mount: Job var-lib-openqa.mount/start failed with result 'dependency'.
</code></pre>
<p>Cause of problem is probably difference in hw configuration of this workers. Our standard workers have 1x HDD with OS and 1x name SSD with /dev/md/openQA. This workers have only one nvme SSD.<br>
Configured as:</p>
<pre><code>nvme0n1
├─nvme0n1p1 vfat FAT32 9AED-277B 506M 1% /boot/efi
├─nvme0n1p2 btrfs 5a405f4e-bd0c-46cb-a5ee-a0e976968be1 1016,5G 1% /
└─nvme0n1p3 linux_raid_member 1.2 openqaworker14:openqa 03972fdb-874d-cbec-4cb8-bca5412d90a2
└─md127 ext2 1.0 4c30279b-d757-4a97-b636-539b18bc9e22 2,3T 0% /var/lib/openqa
</code></pre> openQA Infrastructure - action #106594 (Resolved): [tools] openqaworker-arm-3 periodically fails ...https://progress.opensuse.org/issues/1065942022-02-10T11:36:16Zosukup
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>from journalctl -xe -u os-autoinst-openvswitch</p>
<pre><code>úno 09 21:56:21 openqaworker-arm-3 os-autoinst-openvswitch[2924]: Waiting for IP on bridge 'br1', 300s left ...
úno 09 21:56:22 openqaworker-arm-3 os-autoinst-openvswitch[2924]: Waiting for IP on bridge 'br1', 299s left ...
....
úno 09 22:01:20 openqaworker-arm-3 os-autoinst-openvswitch[2924]: Waiting for IP on bridge 'br1', 3s left ...
úno 09 22:01:21 openqaworker-arm-3 os-autoinst-openvswitch[2924]: Waiting for IP on bridge 'br1', 2s left ...
úno 09 22:01:22 openqaworker-arm-3 os-autoinst-openvswitch[2924]: can't parse bridge local port IP at /usr/lib/os-autoinst/os-autoinst-openvswitch line 43.
úno 09 22:01:22 openqaworker-arm-3 os-autoinst-openvswitch[2924]: Waiting for IP on bridge 'br1', 1s left ...
úno 09 22:01:22 openqaworker-arm-3 systemd[1]: os-autoinst-openvswitch.service: Main process exited, code=exited, status=255/EXCEPTION
</code></pre>
<p>Default timeout is 60 seconds, on openqaworker-arm-3 is now 5 minutes, but still isn't enough after system reboot</p>
<a name="Rollback-steps"></a>
<h2 >Rollback steps<a href="#Rollback-steps" class="wiki-anchor">¶</a></h2>
<ul>
<li>Unpause alert "Failed systemd services alert (except openqa.suse.de)"systemd services (</li>
</ul>
openQA Infrastructure - action #106365 (Resolved): Improve security for OSD worker credentials br...https://progress.opensuse.org/issues/1063652022-02-09T10:25:15Zosukup
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p><a href="https://progress.opensuse.org/issues/105405" class="external">https://progress.opensuse.org/issues/105405</a> .. changed visibility of salt-pillars-openqa broke <code>deploy</code> stage of CI</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1</strong>: Working salt-states+salt-pillars pipelines in gitlab</li>
<li><strong>AC2:</strong> salt-pillars repo stays non-public</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Try out deploy tokens on OSD to fetch the git repo</li>
</ul>
openQA Infrastructure - action #106035 (Rejected): [qe-tools] dehydrated service fails on osdhttps://progress.opensuse.org/issues/1060352022-02-07T08:09:49Zosukup
<p>OSD has systemd in degraded state because system service dehydrated ends in failed state ..</p>
<pre><code>dehydrated.service - Certificate Update Runner for Dehydrated
Loaded: loaded (/usr/lib/systemd/system/dehydrated.service; static)
Active: failed (Result: exit-code) since Mon 2022-02-07 09:03:35 CET; 4min 58s ago
TriggeredBy: ● dehydrated.timer
Process: 26947 ExecStart=/usr/bin/dehydrated --cron (code=exited, status=1/FAILURE)
Main PID: 26947 (code=exited, status=1/FAILURE)
Feb 07 09:03:34 openqa systemd[1]: Starting Certificate Update Runner for Dehydrated...
Feb 07 09:03:34 openqa dehydrated[26947]: # INFO: Using main config file /etc/dehydrated/config
Feb 07 09:03:34 openqa dehydrated[26947]: # INFO: Using additional config file /etc/dehydrated/config.d/suse-ca.sh
Feb 07 09:03:34 openqa dehydrated[26947]: # INFO: Running /usr/bin/dehydrated as dehydrated/dehydrated
Feb 07 09:03:34 openqa sudo[26947]: root : PWD=/ ; USER=dehydrated ; GROUP=dehydrated ; COMMAND=/usr/bin/dehydrated --cron
Feb 07 09:03:35 openqa dehydrated[27267]: {}
Feb 07 09:03:35 openqa systemd[1]: dehydrated.service: Main process exited, code=exited, status=1/FAILURE
Feb 07 09:03:35 openqa systemd[1]: dehydrated.service: Failed with result 'exit-code'.
Feb 07 09:03:35 openqa systemd[1]: Failed to start Certificate Update Runner for Dehydrated.
</code></pre> openQA Infrastructure - action #96719 (Resolved): recover imagetester with broken filesystem/hard...https://progress.opensuse.org/issues/967192021-08-10T14:36:00Zosukup
<p>During work on <a href="https://progress.opensuse.org/issues/96311" class="external">https://progress.opensuse.org/issues/96311</a> , we found imagetester wasn't updated for 2 months</p>
<p>investigate why wasn't automatic transactional update working and update imagetester.</p>
<p>now blocked by <a href="https://infra.nue.suse.com/SelfService/Display.html?id=194271" class="external">https://infra.nue.suse.com/SelfService/Display.html?id=194271</a> , because it didn't survive reboot and this host hasn't any remote management interface</p>