action #162296
closedcoordination #157969: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.6
openQA workers crash with Linux 6.4 after upgrade openSUSE Leap 15.6 size:S
Description
Observation¶
Observed on w31+w32 that upgraded themselves to Leap 15.6 and then crashed multiple times after booting into kernel 6.4 after a waiting time of 10-20m after boot.
Acceptance criteria¶
- AC1: ssh o3 'hosts="openqaworker21 openqaworker22 openqaworker23 openqaworker24 openqaworker25 openqaworker26 openqaworker27 openqaworker28 openqaworker-arm21 openqaworker-arm22 qa-power8-3"; for i in $hosts; do echo "### $i" && ssh root@$i "zypper ll" ; done' lists no firewall package locks anymore
Suggestions¶
- Temporarily upgrade selected machines to Leap 15.6 with old kernel or vice versa, just kernel 6.4, try to get the system to work in a stable manner
- Optional: Look into the crash files on w31 in /root/crash-2024-06-14/
Updated by okurz 9 months ago
- Copied from action #162293: SMART errors on bootup of worker31, worker32 and worker34 size:M added
Updated by openqa_review 8 months ago
- Due date set to 2024-07-23
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 8 months ago
- Related to action #139103: Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs size:M added
Updated by dheidler 8 months ago
Testing on worker36.
Opened https://bugzilla.suse.com/show_bug.cgi?id=1227616
Updated by okurz 6 months ago
- Related to action #157972: Upgrade o3 workers to openSUSE Leap 15.6 size:S added
Updated by okurz 6 months ago
- Related to action #163469: Upgrade a single o3 worker to openSUSE Leap 15.6 added
Updated by okurz 6 months ago
- Related to action #160095: Upgraded Leap 15.6 workers able to run s390x tests after #162683 size:M added
Updated by okurz 6 months ago
- Related to deleted (action #160095: Upgraded Leap 15.6 workers able to run s390x tests after #162683 size:M)
Updated by okurz 3 months ago
- Tags deleted (
infra) - Status changed from Blocked to In Progress
It seems nobody else could reproduce the problem yet. dheidler found a way to reproduce and will update https://bugzilla.suse.com/show_bug.cgi?id=1227616 with better steps to reproduce
Updated by openqa_review 3 months ago
- Due date set to 2024-12-26
Setting due date based on mean cycle time of SUSE QE Tools
Updated by dheidler 3 months ago
With the current software at least on worker36 the kernel issues seem to be gone.
On boot it still takes some minutes until firewalld is done dealing with ~160 tap devices but the system eventually comes up.
I will now try openqaworker17.qa.suse.cz as well which is also running 15.6 already with a locked firewall package.
Updated by dheidler 3 months ago
- Related to action #157975: Upgrade osd workers to openSUSE Leap 15.6 size:S added
Updated by okurz 2 months ago
- Description updated (diff)
- Priority changed from Normal to High
It seems either you or ybonatakis also removed the package lock on all other workers as well causing issues on bootup of the machines with os-autoinst-openvswitch-service failing and triggering alerts as observed on https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1
From w29 journalctl -u os-autoinst-opensvswitch-service
-- Boot 1de0244017dc4cc3aa72fb4c6e93596a --
Dec 29 03:35:15 worker29 systemd[1]: Starting os-autoinst openvswitch helper...
Dec 29 03:35:15 worker29 systemd[1]: Started os-autoinst openvswitch helper.
Dec 29 03:35:16 worker29 os-autoinst-openvswitch[3741]: Waiting for IP on bridge 'br1', 1200s left ...
…
Dec 29 03:55:19 worker29 os-autoinst-openvswitch[3741]: Waiting for IP on bridge 'br1', 3s left ...
Dec 29 03:55:20 worker29 os-autoinst-openvswitch[3741]: Waiting for IP on bridge 'br1', 2s left ...
Dec 29 03:55:21 worker29 os-autoinst-openvswitch[3741]: can't parse bridge local port IP at /usr/lib/os-autoinst/script/os-autoinst-openvswitch li>
Dec 29 03:55:21 worker29 os-autoinst-openvswitch[3741]: Waiting for IP on bridge 'br1', 1s left ...
Dec 29 03:55:21 worker29 systemd[1]: os-autoinst-openvswitch.service: Main process exited, code=exited, status=255/EXCEPTION
Dec 29 03:55:21 worker29 systemd[1]: os-autoinst-openvswitch.service: Failed with result 'exit-code'.
Dec 29 03:55:21 worker29 systemd[1]: os-autoinst-openvswitch.service: Consumed 6.047s CPU time.
Dec 29 04:57:38 worker29 systemd[1]: Starting os-autoinst openvswitch helper...
Dec 29 04:57:38 worker29 systemd[1]: Started os-autoinst openvswitch helper.
so apparently after 1.5h(!) the service ends up fine, probably being auto-restarted after other services in the network stack. Added according rollback step to ticket.
Updated by dheidler 2 months ago · Edited
- Status changed from Workable to Feedback
Replace firewalld on osd workers with 20 lines of bash / systemd unit code:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1335
Updated by nicksinger 2 months ago
- Description updated (diff)
As worker39 failed OSD deployment the second day in a row (https://gitlab.suse.de/openqa/osd-deployment/-/jobs/3631485) I now removed it from salt following https://progress.opensuse.org/projects/openqav3/wiki#Take-machines-out-of-salt-controlled-production - please make sure to revert before closing this ticket
Updated by dheidler about 2 months ago
I also applied it manually on o3 worker openqaworker21 and ran a test:
Updated by okurz about 2 months ago
discussed in infra daily:
- dheidler+nicksinger to decide if we should apply https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1335 for now or change to a different approach as workaround as long as the product issue persists
- ask in product issue what to try next
- try with one or multiple machines how long the reboot takes, e.g. using https://github.com/os-autoinst/scripts/blob/master/reboot-stability-check
Updated by livdywan about 2 months ago
@dheidler Can you confirm where we are at with regard to verifying this on osd and also on o3?
Updated by dheidler about 2 months ago
I manually tested the systemd service on osd and o3 workers as listed above.
Also I just tested the salt states on openqaworker39 and fixed some remaining issues.
Currently waiting for the MR to be merged to be able to see if the salt.
Updated by nicksinger about 2 months ago
- Status changed from Feedback to Workable
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1335 merged now. I applied several smaller fix-ups afterwards:
- https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1344
- https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1345
- https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1346
- https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1348
CI Deployment maybe did not run on all hosts yet until https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1349 is merged but salt '*' state.apply…
on non-wireguard-hosts should work.
Updated by dheidler about 2 months ago
- Status changed from Workable to In Progress
Updated by gpuliti about 2 months ago
- Blocks action #175836: [alert][FIRING:1] (Broken workers alert Salt dZ025mf4z) due to osd reboot, many broken workers size:S added
Updated by livdywan about 2 months ago
- Tags set to infra
Arguably it's infra (and as before we let the assignee choose if it is unclear).
Updated by okurz about 2 months ago
- Copied to action #175956: openQA workers crash with Linux 6.4 after upgrade openSUSE Leap 15.6 - OSD added
Updated by dheidler about 2 months ago
livdywan wrote in #note-46:
okurz wrote in #note-45:
I now pulled out the changes regarding OSD into a separate ticket #175956
Does the split help us remain within the due date? What's missing right now?
Not at all - this is just confusing as I was mainly working on the OSD machines as part of this (#162296) ticket.
Updated by dheidler about 2 months ago
- Status changed from In Progress to Resolved
ariel:~ # salt-ssh '*' cmd.run 'systemctl is-active nftables && systemctl is-enabled nftables'
openqaworker21:
active
enabled
openqaworker28:
active
enabled
openqaworker25:
active
enabled
openqaworker27:
active
enabled
openqaworker20:
active
enabled
openqaworker22:
active
enabled
openqaworker24:
active
enabled
openqaworker26:
active
enabled
openqaworker23:
active
enabled
openqaworker-arm21:
active
enabled
openqaworker-arm22:
active
enabled
ariel# salt-ssh '*' cmd.run 'zypper ll'
openqaworker26:
There are no package locks defined.
openqaworker20:
There are no package locks defined.
openqaworker21:
There are no package locks defined.
openqaworker22:
There are no package locks defined.
openqaworker28:
There are no package locks defined.
openqaworker24:
There are no package locks defined.
openqaworker25:
There are no package locks defined.
openqaworker27:
There are no package locks defined.
openqaworker23:
System management is locked by the application with pid 111491 (zypper).
Close this application before trying again.
openqaworker-arm21:
There are no package locks defined.
openqaworker-arm22:
# | Name | Type | Repository | Comment
--+------------------------+---------+------------+--------
1 | libply* | package | (any) |
2 | os-autoinst-devel | package | (any) |
3 | xdg-desktop-portal-gtk | package | (any) |
Updated by okurz about 2 months ago
- Status changed from Resolved to Feedback
What's with the two o3 machines in NUE2?
Updated by dheidler about 2 months ago
- Status changed from Feedback to Resolved
No idea. What are the hostnames?
Also they are not mentioned in AC1.
Updated by okurz about 2 months ago
- Status changed from Resolved to Feedback
dheidler wrote in #note-50:
No idea. What are the hostnames?
Please see https://progress.opensuse.org/projects/openqav3/wiki/#Manual-command-execution-on-o3-workers, it's kerosene and aarch64-o3
Also they are not mentioned in AC1.
yes, that was apparently an oversight. But it should be clear that we want a solution for all o3 hosts.
Updated by dheidler about 2 months ago
Moved kerosene.qe.nue2.suse.org and aarch64-o3.qe.nue2.suse.org to nftables as well.
Updated by dheidler about 2 months ago
- Status changed from Feedback to Resolved
Updated by livdywan about 2 months ago
- Status changed from Resolved to Feedback
dheidler wrote in #note-32:
What about this ticket? If we have no ticket blocking on it we won't be keeping track of it. And I don't see it being references in salt 🤔
Please ensure either one is true
Updated by dheidler about 2 months ago
- Status changed from Feedback to Resolved
This ticket is the one about o3 - so there is no (official) salt.
As the new default is nftables, there is no issue anymore related to our work.
There is only an issue with firewalld, which we don't use anymore.
If you think we should track this, it would be better to create a new ticket about evaluating to switch back to firewalld,
that could be blocked on.
As the ACs of this ticket are fulfilled, there is no reason to block an anything.