Project

General

Profile

Actions

action #162296

closed

coordination #157969: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.6

openQA workers crash with Linux 6.4 after upgrade openSUSE Leap 15.6 size:S

Added by okurz 9 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-06-14
Due date:
% Done:

0%

Estimated time:

Description

Observation

Observed on w31+w32 that upgraded themselves to Leap 15.6 and then crashed multiple times after booting into kernel 6.4 after a waiting time of 10-20m after boot.

Acceptance criteria

  • AC1: ssh o3 'hosts="openqaworker21 openqaworker22 openqaworker23 openqaworker24 openqaworker25 openqaworker26 openqaworker27 openqaworker28 openqaworker-arm21 openqaworker-arm22 qa-power8-3"; for i in $hosts; do echo "### $i" && ssh root@$i "zypper ll" ; done' lists no firewall package locks anymore

Suggestions

  • Temporarily upgrade selected machines to Leap 15.6 with old kernel or vice versa, just kernel 6.4, try to get the system to work in a stable manner
  • Optional: Look into the crash files on w31 in /root/crash-2024-06-14/

Related issues 7 (0 open7 closed)

Related to openQA Infrastructure (public) - action #139103: Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs size:MResolvedokurz2023-11-04

Actions
Related to openQA Project (public) - action #157972: Upgrade o3 workers to openSUSE Leap 15.6 size:SResolvedgpathak

Actions
Related to openQA Infrastructure (public) - action #163469: Upgrade a single o3 worker to openSUSE Leap 15.6Resolvedgpathak2024-07-08

Actions
Related to openQA Infrastructure (public) - action #157975: Upgrade osd workers to openSUSE Leap 15.6 size:SResolvedybonatakis

Actions
Blocks openQA Infrastructure (public) - action #175836: [alert][FIRING:1] (Broken workers alert Salt dZ025mf4z) due to osd reboot, many broken workers size:SResolvedrobert.richardson2025-01-20

Actions
Copied from openQA Infrastructure (public) - action #162293: SMART errors on bootup of worker31, worker32 and worker34 size:MResolvednicksinger2024-06-14

Actions
Copied to openQA Infrastructure (public) - action #175956: openQA workers crash with Linux 6.4 after upgrade openSUSE Leap 15.6 - OSDResolveddheidler2024-06-14

Actions
Actions #1

Updated by okurz 9 months ago

  • Copied from action #162293: SMART errors on bootup of worker31, worker32 and worker34 size:M added
Actions #2

Updated by okurz 9 months ago

  • Description updated (diff)
Actions #3

Updated by livdywan 9 months ago

  • Subject changed from openQA workers crash with Linux 6.4 after upgrade openSUSE Leap 15.6 to openQA workers crash with Linux 6.4 after upgrade openSUSE Leap 15.6 size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by okurz 9 months ago

  • Priority changed from High to Normal
Actions #5

Updated by dheidler 8 months ago

  • Status changed from Workable to In Progress
  • Assignee set to dheidler
Actions #6

Updated by openqa_review 8 months ago

  • Due date set to 2024-07-23

Setting due date based on mean cycle time of SUSE QE Tools

Actions #7

Updated by okurz 8 months ago

So originally what happened is that all PRG2 x86_64 upgraded themselves automatically but inconsistently to Leap 15.6 so what I did is call snapper rollback on each and rebooted and then ensured that openQA jobs are properly executed afterwards.

Actions #8

Updated by okurz 8 months ago

Unfortunately dmesg in /root/crash-*/crash/ is all empty. So I guess the next step should be to select any worker, upgrade and check. I suggest to use w36 which is currently offline.

Actions #9

Updated by okurz 8 months ago

  • Related to action #139103: Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs size:M added
Actions #10

Updated by dheidler 8 months ago

Actions #11

Updated by dheidler 8 months ago

  • Status changed from In Progress to Blocked

As we would have to use a 15.6 with both firewalld and kernel-default from 15.5,
I don't see much value in moving to 15.6 for now.

Let's block this ticket on the bugzilla issue.

Actions #12

Updated by okurz 8 months ago

  • Due date deleted (2024-07-23)
Actions #13

Updated by livdywan 7 months ago

Actions #14

Updated by livdywan 6 months ago

livdywan wrote in #note-13:

Opened https://bugzilla.suse.com/show_bug.cgi?id=1227616

No response so far

Still no update (no pun intended)

Actions #15

Updated by okurz 6 months ago

  • Related to action #157972: Upgrade o3 workers to openSUSE Leap 15.6 size:S added
Actions #16

Updated by okurz 6 months ago

  • Related to action #163469: Upgrade a single o3 worker to openSUSE Leap 15.6 added
Actions #17

Updated by okurz 6 months ago

  • Related to action #160095: Upgraded Leap 15.6 workers able to run s390x tests after #162683 size:M added
Actions #18

Updated by okurz 6 months ago

  • Related to deleted (action #160095: Upgraded Leap 15.6 workers able to run s390x tests after #162683 size:M)
Actions #19

Updated by okurz 6 months ago

  • Description updated (diff)
Actions #20

Updated by livdywan 5 months ago

Opened https://bugzilla.suse.com/show_bug.cgi?id=1227616

I pinged Denis in Slack as there's been no response for a while

Actions #21

Updated by okurz 4 months ago

livdywan wrote in #note-20:

Opened https://bugzilla.suse.com/show_bug.cgi?id=1227616

I pinged Denis in Slack as there's been no response for a while

As you didn't link a Slack conversation I have to ask you: Do you remember if there was a response?

Actions #22

Updated by livdywan 3 months ago

As you didn't link a Slack conversation I have to ask you: Do you remember if there was a response?

Yes. Denis was going to check it but I guess that didn't happen. Pinging again.

Actions #23

Updated by okurz 3 months ago

  • Tags deleted (infra)
  • Status changed from Blocked to In Progress

It seems nobody else could reproduce the problem yet. dheidler found a way to reproduce and will update https://bugzilla.suse.com/show_bug.cgi?id=1227616 with better steps to reproduce

Actions #24

Updated by openqa_review 3 months ago

  • Due date set to 2024-12-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions #25

Updated by dheidler 3 months ago

With the current software at least on worker36 the kernel issues seem to be gone.
On boot it still takes some minutes until firewalld is done dealing with ~160 tap devices but the system eventually comes up.

I will now try openqaworker17.qa.suse.cz as well which is also running 15.6 already with a locked firewall package.

Actions #27

Updated by dheidler 3 months ago · Edited

No issues seen in kernel log if openqaworker17.qa.suse.cz so far.

Removed lock, installed updates and rebooted:

  • openqaworker16.qa.suse.cz
  • openqaworker18.qa.suse.cz
  • openqaworker14.qa.suse.cz
  • worker39.oqa.prg2.suse.org
Actions #28

Updated by dheidler 3 months ago

  • Related to action #157975: Upgrade osd workers to openSUSE Leap 15.6 size:S added
Actions #29

Updated by okurz 3 months ago

  • Description updated (diff)
  • Due date changed from 2024-12-26 to 2025-01-23
  • Status changed from In Progress to Workable
Actions #30

Updated by okurz 2 months ago

  • Description updated (diff)
  • Priority changed from Normal to High

It seems either you or ybonatakis also removed the package lock on all other workers as well causing issues on bootup of the machines with os-autoinst-openvswitch-service failing and triggering alerts as observed on https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1

From w29 journalctl -u os-autoinst-opensvswitch-service

-- Boot 1de0244017dc4cc3aa72fb4c6e93596a --
Dec 29 03:35:15 worker29 systemd[1]: Starting os-autoinst openvswitch helper...
Dec 29 03:35:15 worker29 systemd[1]: Started os-autoinst openvswitch helper.
Dec 29 03:35:16 worker29 os-autoinst-openvswitch[3741]: Waiting for IP on bridge 'br1', 1200s left ...
…
Dec 29 03:55:19 worker29 os-autoinst-openvswitch[3741]: Waiting for IP on bridge 'br1', 3s left ...
Dec 29 03:55:20 worker29 os-autoinst-openvswitch[3741]: Waiting for IP on bridge 'br1', 2s left ...
Dec 29 03:55:21 worker29 os-autoinst-openvswitch[3741]: can't parse bridge local port IP at /usr/lib/os-autoinst/script/os-autoinst-openvswitch li>
Dec 29 03:55:21 worker29 os-autoinst-openvswitch[3741]: Waiting for IP on bridge 'br1', 1s left ...
Dec 29 03:55:21 worker29 systemd[1]: os-autoinst-openvswitch.service: Main process exited, code=exited, status=255/EXCEPTION
Dec 29 03:55:21 worker29 systemd[1]: os-autoinst-openvswitch.service: Failed with result 'exit-code'.
Dec 29 03:55:21 worker29 systemd[1]: os-autoinst-openvswitch.service: Consumed 6.047s CPU time.
Dec 29 04:57:38 worker29 systemd[1]: Starting os-autoinst openvswitch helper...
Dec 29 04:57:38 worker29 systemd[1]: Started os-autoinst openvswitch helper.

so apparently after 1.5h(!) the service ends up fine, probably being auto-restarted after other services in the network stack. Added according rollback step to ticket.

Actions #31

Updated by dheidler 2 months ago

Hm - not sure what we could do about that.

Actions #33

Updated by dheidler 2 months ago · Edited

  • Status changed from Workable to Feedback

Replace firewalld on osd workers with 20 lines of bash / systemd unit code:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1335

Actions #34

Updated by nicksinger 2 months ago

  • Description updated (diff)

As worker39 failed OSD deployment the second day in a row (https://gitlab.suse.de/openqa/osd-deployment/-/jobs/3631485) I now removed it from salt following https://progress.opensuse.org/projects/openqav3/wiki#Take-machines-out-of-salt-controlled-production - please make sure to revert before closing this ticket

Actions #35

Updated by dheidler about 2 months ago

I also applied it manually on o3 worker openqaworker21 and ran a test:

https://openqa.opensuse.org/tests/4762456

Actions #36

Updated by okurz about 2 months ago

discussed in infra daily:

  1. dheidler+nicksinger to decide if we should apply https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1335 for now or change to a different approach as workaround as long as the product issue persists
  2. ask in product issue what to try next
  3. try with one or multiple machines how long the reboot takes, e.g. using https://github.com/os-autoinst/scripts/blob/master/reboot-stability-check
Actions #37

Updated by livdywan about 2 months ago

@dheidler Can you confirm where we are at with regard to verifying this on osd and also on o3?

Actions #38

Updated by dheidler about 2 months ago

I manually tested the systemd service on osd and o3 workers as listed above.
Also I just tested the salt states on openqaworker39 and fixed some remaining issues.
Currently waiting for the MR to be merged to be able to see if the salt.

Actions #41

Updated by dheidler about 2 months ago

  • Status changed from Workable to In Progress
Actions #42

Updated by gpuliti about 2 months ago

  • Blocks action #175836: [alert][FIRING:1] (Broken workers alert Salt dZ025mf4z) due to osd reboot, many broken workers size:S added
Actions #43

Updated by livdywan about 2 months ago

  • Tags set to infra

Arguably it's infra (and as before we let the assignee choose if it is unclear).

Actions #44

Updated by okurz about 2 months ago

  • Copied to action #175956: openQA workers crash with Linux 6.4 after upgrade openSUSE Leap 15.6 - OSD added
Actions #45

Updated by okurz about 2 months ago

  • Tags deleted (infra)
  • Description updated (diff)

I now pulled out the changes regarding OSD into a separate ticket #175956

Actions #46

Updated by livdywan about 2 months ago

okurz wrote in #note-45:

I now pulled out the changes regarding OSD into a separate ticket #175956

Does the split help us remain within the due date? What's missing right now?

Actions #47

Updated by dheidler about 2 months ago

livdywan wrote in #note-46:

okurz wrote in #note-45:

I now pulled out the changes regarding OSD into a separate ticket #175956

Does the split help us remain within the due date? What's missing right now?

Not at all - this is just confusing as I was mainly working on the OSD machines as part of this (#162296) ticket.

Actions #48

Updated by dheidler about 2 months ago

  • Status changed from In Progress to Resolved

ariel:~ # salt-ssh '*' cmd.run 'systemctl is-active nftables && systemctl is-enabled nftables'
openqaworker21:
    active
    enabled
openqaworker28:
    active
    enabled
openqaworker25:
    active
    enabled
openqaworker27:
    active
    enabled
openqaworker20:
    active
    enabled
openqaworker22:
    active
    enabled
openqaworker24:
    active
    enabled
openqaworker26:
    active
    enabled
openqaworker23:
    active
    enabled
openqaworker-arm21:
    active
    enabled
openqaworker-arm22:
    active
    enabled
ariel# salt-ssh '*' cmd.run 'zypper ll'
openqaworker26:

    There are no package locks defined.
openqaworker20:

    There are no package locks defined.
openqaworker21:

    There are no package locks defined.
openqaworker22:

    There are no package locks defined.
openqaworker28:

    There are no package locks defined.
openqaworker24:

    There are no package locks defined.
openqaworker25:

    There are no package locks defined.
openqaworker27:

    There are no package locks defined.
openqaworker23:
    System management is locked by the application with pid 111491 (zypper).
    Close this application before trying again.
openqaworker-arm21:

    There are no package locks defined.
openqaworker-arm22:

    # | Name                   | Type    | Repository | Comment
    --+------------------------+---------+------------+--------
    1 | libply*                | package | (any)      |
    2 | os-autoinst-devel      | package | (any)      |
    3 | xdg-desktop-portal-gtk | package | (any)      |
Actions #49

Updated by okurz about 2 months ago

  • Status changed from Resolved to Feedback

What's with the two o3 machines in NUE2?

Actions #50

Updated by dheidler about 2 months ago

  • Status changed from Feedback to Resolved

No idea. What are the hostnames?
Also they are not mentioned in AC1.

Actions #51

Updated by okurz about 2 months ago

  • Status changed from Resolved to Feedback

dheidler wrote in #note-50:

No idea. What are the hostnames?

Please see https://progress.opensuse.org/projects/openqav3/wiki/#Manual-command-execution-on-o3-workers, it's kerosene and aarch64-o3

Also they are not mentioned in AC1.

yes, that was apparently an oversight. But it should be clear that we want a solution for all o3 hosts.

Actions #52

Updated by dheidler about 2 months ago

Moved kerosene.qe.nue2.suse.org and aarch64-o3.qe.nue2.suse.org to nftables as well.

Actions #53

Updated by dheidler about 2 months ago

  • Status changed from Feedback to Resolved
Actions #54

Updated by livdywan about 2 months ago

  • Status changed from Resolved to Feedback

dheidler wrote in #note-32:

Opened https://bugzilla.suse.com/show_bug.cgi?id=1235450

What about this ticket? If we have no ticket blocking on it we won't be keeping track of it. And I don't see it being references in salt 🤔

Please ensure either one is true

Actions #55

Updated by dheidler about 2 months ago

  • Status changed from Feedback to Resolved

This ticket is the one about o3 - so there is no (official) salt.

As the new default is nftables, there is no issue anymore related to our work.
There is only an issue with firewalld, which we don't use anymore.

If you think we should track this, it would be better to create a new ticket about evaluating to switch back to firewalld,
that could be blocked on.

As the ACs of this ticket are fulfilled, there is no reason to block an anything.

Actions #56

Updated by okurz about 2 months ago

  • Due date deleted (2025-01-23)
Actions

Also available in: Atom PDF