Project

General

Profile

Actions

action #131249

closed

[alert][ci][deployment] OSD deployment failed, grenache-1, worker5, worker2 salt-minion does not return, error message "No response" size:M

Added by okurz over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Start date:
2023-06-22
Due date:
% Done:

0%

Estimated time:

Description

Observation

From https://gitlab.suse.de/openqa/osd-deployment/-/jobs/1648467 and reproduced locally with sudo salt --no-color -C 'G@roles:worker' test.ping:

grenache-1.qa.suse.de:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20230622084232610255
worker5.oqa.suse.de:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20230622084232610255
worker2.oqa.suse.de:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20230622084232610255

Acceptance criteria

  • AC1: salt \* test.ping and salt \* state.apply succeeds consistently for more than one day
  • AC2: our salt states and pillar and osd deployment pipelines are green and stable again

Suggestions

  • DONE It seems we might have had this problem for a while but never really that severly. Now it looks like those machines even if we trigger a reboot and restart salt-minion can end up with "no response" again. Maybe we can revert some recent package updates? From /var/log/zypp/history there is
2023-06-22 03:01:12|install|python3-pyzmq|17.1.2-150000.3.5.2|x86_64||repo-sle-update|e2d9d07654cffc31e5199f40aa1ba9fee1e114c4ca5abd78f7fdc78b2e6cc21a|
  • DONE Debug the actual problem of hanging salt-minion. Maybe we can actually try to better trigger the problem, not prevent it?
  • DONE Research upstream, apply workarounds, potentially try upgrade Leap 15.5 if that might fix something

Rollback steps

  • DONE on worker2,worker3,worker5,grenache-1,openqaworker-arm-2,openqaworker-arm-3 sudo mv /etc/systemd/system/auto-update.$i{.disabled_poo131249,} && sudo systemctl enable --now auto-update.timer && sudo systemctl start auto-update, remove manual override /etc/systemd/system/auto-update.service.d/override.conf, wait for upgrade to complete and reboot
  • DONE re-enable osd-deployment https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules/36/edit
  • DONE remove silence https://stats.openqa-monitor.qa.suse.de/alerting/silences "alertname=Failed systemd services alert (except openqa.suse.de)"
  • DONE remove package locks for anything related to salt

Related issues 10 (0 open10 closed)

Related to openQA Infrastructure (public) - action #130835: salt high state fails after recent merge requests in salt pillars size:MResolvedokurz2023-06-14

Actions
Related to openQA Project (public) - action #131447: Some jobs incomplete due to auto_review:"api failure: 400.*/tmp/.*png.*No space left on device.*Utils.pm line 285":retry but enough space visible on machinesResolvedkraih2023-06-27

Actions
Related to openQA Infrastructure (public) - action #107932: Handling broken RPM databases does not handle certain casesResolvedmkittler2022-03-07

Actions
Related to openQA Infrastructure (public) - action #102942: Failed systemd services alert: snapper-cleanup on QA-Power8-4-kvm fails size:MResolvedmkittler2021-11-24

Actions
Related to openQA Infrastructure (public) - action #132137: Setup new PRG2 openQA worker for osd size:MResolvedmkittler2023-06-29

Actions
Related to openQA Infrastructure (public) - action #134906: osd-deployment failed due to openqaworker1 showing "No response" in salt size:MResolvednicksinger2023-08-312023-09-23

Actions
Related to openQA Infrastructure (public) - action #136325: salt deploy fails due to multiple offline workers in qe.nue2.suse.org+prg2.suse.orgResolvedokurz2023-09-22

Actions
Related to openQA Infrastructure (public) - action #150965: At least diesel+petrol+mania fail to auto-update due to kernel locks preventing patches size:MResolveddheidler2023-11-162023-12-22

Actions
Copied to openQA Infrastructure (public) - action #131540: openqa-piworker fails to upgrade many packages. vendor change is not enabled as our salt states so far only do that for openQA machines, not generic machines size:MResolvedmkittler

Actions
Copied to openQA Infrastructure (public) - action #131543: We have machines with both auto-update&auto-upgrade deployed, we should have only one at a time size:MResolvedokurz

Actions
Actions #1

Updated by okurz over 1 year ago

I could still login to worker5 over ssh, salt minion still runs there.

Actions #2

Updated by osukup over 1 year ago

on worker2:

worker2:/home/osukup # ps -aux | grep salt
root     17052  0.1  0.0      0     0 ?        Z    04:01   0:30 [salt-minion] <defunct>
root     24047  0.0  0.0  50452 27904 ?        Ss   03:01   0:00 /usr/bin/python3 /usr/bin/salt-minion
root     24054  0.0  0.0 786284 74092 ?        Sl   03:01   0:05 /usr/bin/python3 /usr/bin/salt-minion
Actions #3

Updated by mkittler over 1 year ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
Actions #4

Updated by mkittler over 1 year ago

Jun 22 03:01:12 grenache-1 salt-minion[693500]: /usr/lib/python3.6/site-packages/salt/transport/client.py:81: DeprecationWarning: This module is deprecated. Please use salt.channel.client instead.
Jun 22 03:01:12 grenache-1 salt-minion[693500]:   "This module is deprecated. Please use salt.channel.client instead.",
Jun 22 03:01:12 grenache-1 salt-minion[693500]: [WARNING ] Got events for closed stream None
Jun 22 04:05:23 grenache-1 salt-minion[699658]: /usr/lib/python3.6/site-packages/salt/states/x509.py:214: DeprecationWarning: The x509 modules are deprecated. Please migrate to the replacement modules (x509_v2). They are the default from Salt 3008 (Argon) onwards.
Jun 22 04:05:23 grenache-1 salt-minion[699658]:   "The x509 modules are deprecated. Please migrate to the replacement "

At least the last deprecation warning we also get an other hosts. Likely those warnings aren't the culprit. However, otherwise nothing has been logged since the last restart of the service except for

2023-06-22 03:01:12,130 [tornado.general  :444 ][WARNING ][693500] Got events for closed stream None
Actions #5

Updated by mkittler over 1 year ago

Restarting the Minion services on the affected hosts helped. This is not the first time I see salt-minion.service being stuck. Not sure what is causing this from time to time.

Yesterday afternoon it definitely still worked on all workers so the relevant timeframe is quite small. However, I couldn't spot anything in the logs.

Actions #6

Updated by okurz over 1 year ago

  • Status changed from In Progress to Resolved

Yes, all salt minions seem to be reachable again. Next time if reproduced I suggest to attach to the blocked processes with strace of check open file handles with lsof.
I restarted failed tests in
https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/710619
The deployment step has just started and all macahines are reachable again so we are good here.

Actions #7

Updated by mkittler over 1 year ago

  • Status changed from Resolved to In Progress

Looks like it got stuck again on grenache-1:

martchus@grenache-1:~> sudo strace -p 731998
strace: Process 731998 attached
wait4(732003, 
^Cstrace: Process 731998 detached
 <detached ...>

martchus@grenache-1:~> sudo strace -p 732003
strace: Process 732003 attached
futex(0x7fff7c001c10, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY
^Cstrace: Process 732003 detached
 <detached ...>

martchus@grenache-1:~> ps aux | grep -i salt
root     731998  0.0  0.0  42560 37376 ?        Ss   12:36   0:00 /usr/bin/python3 /usr/bin/salt-minion
root     732003  0.1  0.0 560960 83328 ?        Sl   12:36   0:05 /usr/bin/python3 /usr/bin/salt-minion
root     738594  4.9  0.0      0     0 ?        Z    13:36   1:03 [salt-minion] <defunct>
martchus 743308  0.0  0.0   8192  1344 pts/1    S+   13:58   0:00 grep --color=auto -i salt

lsof -p … doesn't show anything special and the list is empty for the zombie process. It looks like one of the processes is stuck in a deadlock. There are no coredumps by the way.

Actions #8

Updated by mkittler over 1 year ago

Backtrace via gdb of the process stuck on the futex wait:

* 1    Thread 0x7fff9c5d49b0 (LWP 732003) "salt-minion" 0x00007fff9c117e0c in do_futex_wait.constprop () from /lib64/libpthread.so.0
  2    Thread 0x7fff9a11f180 (LWP 732004) "salt-minion" 0x00007fff9c00c888 in select () from /lib64/libc.so.6
  3    Thread 0x7fff935af180 (LWP 732007) "ZMQbg/0"     0x00007fff9c01a56c in epoll_wait () from /lib64/libc.so.6
  4    Thread 0x7fff92d9f180 (LWP 732008) "ZMQbg/1"     0x00007fff9c01a56c in epoll_wait () from /lib64/libc.so.6
  5    Thread 0x7fff9258f180 (LWP 732013) "ZMQbg/4"     0x00007fff9c01a56c in epoll_wait () from /lib64/libc.so.6
  6    Thread 0x7fff91d7f180 (LWP 732014) "ZMQbg/5"     0x00007fff9c01a56c in epoll_wait () from /lib64/libc.so.6
  7    Thread 0x7fff83fff180 (LWP 741516) "salt-minion" 0x00007fff9c01a56c in epoll_wait () from /lib64/libc.so.6
(gdb) bt
#0  0x00007fff9c117e0c in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007fff9c117fb8 in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007fff9c34cff4 in PyThread_acquire_lock_timed () from /usr/lib64/libpython3.6m.so.1.0
#3  0x00007fff9c3544ec in ?? () from /usr/lib64/libpython3.6m.so.1.0
#4  0x00007fff9c354700 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#5  0x00007fff9c244aac in _PyCFunction_FastCallDict () from /usr/lib64/libpython3.6m.so.1.0
#6  0x00007fff9c2ed908 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#7  0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#8  0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#9  0x00007fff9c2ed060 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#10 0x00007fff9c2ed3c8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#11 0x00007fff9c2ed7a8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#12 0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#13 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#14 0x00007fff9c2ed060 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#15 0x00007fff9c2ed3c8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#16 0x00007fff9c2ed7a8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#17 0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#18 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#19 0x00007fff9c2ed060 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#20 0x00007fff9c2ed3c8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#21 0x00007fff9c2ed7a8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#22 0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#23 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#24 0x00007fff9c208b2c in ?? () from /usr/lib64/libpython3.6m.so.1.0
#25 0x00007fff9c2e55ac in ?? () from /usr/lib64/libpython3.6m.so.1.0
#26 0x00007fff9c2449f4 in _PyCFunction_FastCallDict () from /usr/lib64/libpython3.6m.so.1.0
#27 0x00007fff9c2ed908 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#28 0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#29 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#30 0x00007fff9c2ed060 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#31 0x00007fff9c2ed3c8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#32 0x00007fff9c2ed7a8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#33 0x00007fff9c2f4558 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#34 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#35 0x00007fff9c208b2c in ?? () from /usr/lib64/libpython3.6m.so.1.0
#36 0x00007fff9c2e55ac in ?? () from /usr/lib64/libpython3.6m.so.1.0
#37 0x00007fff9c2449f4 in _PyCFunction_FastCallDict () from /usr/lib64/libpython3.6m.so.1.0
#38 0x00007fff9c2ed908 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#39 0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#40 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#41 0x00007fff9c2ed060 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#42 0x00007fff9c2ed3c8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#43 0x00007fff9c2ed7a8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#44 0x00007fff9c2f4558 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#45 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#46 0x00007fff9c2ed060 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#47 0x00007fff9c2ed3c8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#48 0x00007fff9c2ed7a8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#49 0x00007fff9c2f4558 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#50 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#51 0x00007fff9c208b2c in ?? () from /usr/lib64/libpython3.6m.so.1.0
#52 0x00007fff9c2e55ac in ?? () from /usr/lib64/libpython3.6m.so.1.0
#53 0x00007fff9c2449f4 in _PyCFunction_FastCallDict () from /usr/lib64/libpython3.6m.so.1.0
#54 0x00007fff9c2ed908 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#55 0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#56 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#57 0x00007fff9c2ed060 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#58 0x00007fff9c2ed3c8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#59 0x00007fff9c2ed7a8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#60 0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#61 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#62 0x00007fff9c2ed060 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#63 0x00007fff9c2ed9f8 in PyEval_EvalCodeEx () from /usr/lib64/libpython3.6m.so.1.0
#64 0x00007fff9c215148 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#65 0x00007fff9c1d4ae8 in PyObject_Call () from /usr/lib64/libpython3.6m.so.1.0
#66 0x00007fff9c2f3e88 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#67 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#68 0x00007fff9c2ed060 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#69 0x00007fff9c2ed9f8 in PyEval_EvalCodeEx () from /usr/lib64/libpython3.6m.so.1.0
#70 0x00007fff9c215148 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#71 0x00007fff9c1d4ae8 in PyObject_Call () from /usr/lib64/libpython3.6m.so.1.0
#72 0x00007fff9c2f3e88 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#73 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#74 0x00007fff9c2ed060 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#75 0x00007fff9c2f8b4c in _PyFunction_FastCallDict () from /usr/lib64/libpython3.6m.so.1.0
#76 0x00007fff9c1d4ed0 in _PyObject_FastCallDict () from /usr/lib64/libpython3.6m.so.1.0
#77 0x00007fff9c1d5210 in _PyObject_Call_Prepend () from /usr/lib64/libpython3.6m.so.1.0
#78 0x00007fff9c1f43e8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#79 0x00007fff9c1d4ae8 in PyObject_Call () from /usr/lib64/libpython3.6m.so.1.0
#80 0x00007fff9c2f3e88 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
--Type <RET> for more, q to quit, c to continue without paging--c
#81 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#82 0x00007fff9c2ed060 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#83 0x00007fff9c2f8c5c in _PyFunction_FastCallDict () from /usr/lib64/libpython3.6m.so.1.0
#84 0x00007fff9c1d4ed0 in _PyObject_FastCallDict () from /usr/lib64/libpython3.6m.so.1.0
#85 0x00007fff9c381d1c in ?? () from /usr/lib64/libpython3.6m.so.1.0
#86 0x00007fff9c1d4d84 in _PyObject_FastCallDict () from /usr/lib64/libpython3.6m.so.1.0
#87 0x00007fff9c2ed654 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#88 0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#89 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#90 0x00007fff9c2ed284 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#91 0x00007fff9c2ed7a8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#92 0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#93 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#94 0x00007fff9c2ed284 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#95 0x00007fff9c2ed7a8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#96 0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#97 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#98 0x00007fff9c2ed284 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#99 0x00007fff9c2ed7a8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#100 0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#101 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#102 0x00007fff9c2ed284 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#103 0x00007fff9c2ed7a8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#104 0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#105 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#106 0x00007fff9c2ed060 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#107 0x00007fff9c2ed3c8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#108 0x00007fff9c2ed7a8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#109 0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#110 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#111 0x00007fff9c2ed060 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#112 0x00007fff9c2ed9f8 in PyEval_EvalCodeEx () from /usr/lib64/libpython3.6m.so.1.0
#113 0x00007fff9c215148 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#114 0x00007fff9c1d4ae8 in PyObject_Call () from /usr/lib64/libpython3.6m.so.1.0
#115 0x00007fff9c2f3e88 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#116 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#117 0x00007fff9c2ed284 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#118 0x00007fff9c2ed7a8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#119 0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#120 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#121 0x00007fff9c2ed284 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#122 0x00007fff9c2ed7a8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#123 0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#124 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#125 0x00007fff9c2ed284 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#126 0x00007fff9c2ed7a8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#127 0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#128 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#129 0x00007fff9c2ed284 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#130 0x00007fff9c2f8d4c in _PyFunction_FastCallDict () from /usr/lib64/libpython3.6m.so.1.0
#131 0x00007fff9c1d4ed0 in _PyObject_FastCallDict () from /usr/lib64/libpython3.6m.so.1.0
#132 0x00007fff9c1d5210 in _PyObject_Call_Prepend () from /usr/lib64/libpython3.6m.so.1.0
#133 0x00007fff9c1f43e8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#134 0x00007fff9c1d4ae8 in PyObject_Call () from /usr/lib64/libpython3.6m.so.1.0
#135 0x00007fff9c26c858 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#136 0x00007fff9c267230 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#137 0x00007fff9c1d4d84 in _PyObject_FastCallDict () from /usr/lib64/libpython3.6m.so.1.0
#138 0x00007fff9c2ed654 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#139 0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#140 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#141 0x00007fff9c2ed284 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#142 0x00007fff9c2ed7a8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#143 0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#144 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#145 0x00007fff9c2ed284 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#146 0x00007fff9c2ed7a8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#147 0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#148 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#149 0x00007fff9c2ed284 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#150 0x00007fff9c2ed7a8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#151 0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#152 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#153 0x00007fff9c2ed284 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#154 0x00007fff9c2ed7a8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#155 0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#156 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#157 0x00007fff9c2ed060 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#158 0x00007fff9c2ed1c0 in PyEval_EvalCode () from /usr/lib64/libpython3.6m.so.1.0
#159 0x00007fff9c32bc54 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#160 0x00007fff9c32ec58 in PyRun_FileExFlags () from /usr/lib64/libpython3.6m.so.1.0
#161 0x00007fff9c32eeb8 in PyRun_SimpleFileExFlags () from /usr/lib64/libpython3.6m.so.1.0
#162 0x00007fff9c350860 in Py_Main () from /usr/lib64/libpython3.6m.so.1.0
#163 0x000000012f9d0ea8 in main ()
(gdb) bt
#0  0x00007fff9c00c888 in select () from /lib64/libc.so.6
#1  0x00007fff9c39df54 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#2  0x00007fff9c24491c in _PyCFunction_FastCallDict () from /usr/lib64/libpython3.6m.so.1.0
#3  0x00007fff9c2ed908 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#4  0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#5  0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#6  0x00007fff9c2ed060 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#7  0x00007fff9c2ed9f8 in PyEval_EvalCodeEx () from /usr/lib64/libpython3.6m.so.1.0
#8  0x00007fff9c215148 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#9  0x00007fff9c1d4ae8 in PyObject_Call () from /usr/lib64/libpython3.6m.so.1.0
#10 0x00007fff9c2f3e88 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#11 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#12 0x00007fff9c2ed284 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#13 0x00007fff9c2ed7a8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#14 0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#15 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#16 0x00007fff9c2ed284 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#17 0x00007fff9c2ed7a8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#18 0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#19 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#20 0x00007fff9c2ed284 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#21 0x00007fff9c2f8d4c in _PyFunction_FastCallDict () from /usr/lib64/libpython3.6m.so.1.0
#22 0x00007fff9c1d4ed0 in _PyObject_FastCallDict () from /usr/lib64/libpython3.6m.so.1.0
#23 0x00007fff9c1d5210 in _PyObject_Call_Prepend () from /usr/lib64/libpython3.6m.so.1.0
#24 0x00007fff9c1f43e8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#25 0x00007fff9c1d4ae8 in PyObject_Call () from /usr/lib64/libpython3.6m.so.1.0
#26 0x00007fff9c2ee248 in PyEval_CallObjectWithKeywords () from /usr/lib64/libpython3.6m.so.1.0
#27 0x00007fff9c353f84 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#28 0x00007fff9c34ca60 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#29 0x00007fff9c109748 in start_thread () from /lib64/libpthread.so.0
#30 0x00007fff9c01a084 in clone () from /lib64/libc.so.6
(gdb) bt
#0  0x00007fff9c01a56c in epoll_wait () from /lib64/libc.so.6
#1  0x00007fff9966c4d8 in ?? () from /usr/lib64/libzmq.so.5
#2  0x00007fff996b3ab0 in ?? () from /usr/lib64/libzmq.so.5
#3  0x00007fff9c109748 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fff9c01a084 in clone () from /lib64/libc.so.6
(gdb) bt
#0  0x00007fff9c01a56c in epoll_wait () from /lib64/libc.so.6
#1  0x00007fff9966c4d8 in ?? () from /usr/lib64/libzmq.so.5
#2  0x00007fff996b3ab0 in ?? () from /usr/lib64/libzmq.so.5
#3  0x00007fff9c109748 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fff9c01a084 in clone () from /lib64/libc.so.6
(gdb) bt
#0  0x00007fff9c01a56c in epoll_wait () from /lib64/libc.so.6
#1  0x00007fff9966c4d8 in ?? () from /usr/lib64/libzmq.so.5
#2  0x00007fff996b3ab0 in ?? () from /usr/lib64/libzmq.so.5
#3  0x00007fff9c109748 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fff9c01a084 in clone () from /lib64/libc.so.6
(gdb) bt
#0  0x00007fff9c01a56c in epoll_wait () from /lib64/libc.so.6
#1  0x00007fff9966c4d8 in ?? () from /usr/lib64/libzmq.so.5
#2  0x00007fff996b3ab0 in ?? () from /usr/lib64/libzmq.so.5
#3  0x00007fff9c109748 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fff9c01a084 in clone () from /lib64/libc.so.6
(gdb) bt
#0  0x00007fff9c01a56c in epoll_wait () from /lib64/libc.so.6
#1  0x00007fff9b7c28c0 in ?? () from /usr/lib64/python3.6/lib-dynload/select.cpython-36m-powerpc64le-linux-gnu.so
#2  0x00007fff9c244aac in _PyCFunction_FastCallDict () from /usr/lib64/libpython3.6m.so.1.0
#3  0x00007fff9c2ed908 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#4  0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#5  0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#6  0x00007fff9c2ed284 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#7  0x00007fff9c2ed7a8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#8  0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#9  0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#10 0x00007fff9c2ed060 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#11 0x00007fff9c2ed3c8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#12 0x00007fff9c2ed7a8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#13 0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#14 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#15 0x00007fff9c2ed060 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#16 0x00007fff9c2f8b4c in _PyFunction_FastCallDict () from /usr/lib64/libpython3.6m.so.1.0
#17 0x00007fff9c1d4ed0 in _PyObject_FastCallDict () from /usr/lib64/libpython3.6m.so.1.0
#18 0x00007fff9c1d5210 in _PyObject_Call_Prepend () from /usr/lib64/libpython3.6m.so.1.0
#19 0x00007fff9c1f43e8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#20 0x00007fff9c1d4ae8 in PyObject_Call () from /usr/lib64/libpython3.6m.so.1.0
#21 0x00007fff9c2f3e88 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#22 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#23 0x00007fff9c2ed284 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#24 0x00007fff9c2ed7a8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#25 0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#26 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#27 0x00007fff9c2ed284 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#28 0x00007fff9c2ed7a8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#29 0x00007fff9c2f1694 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#30 0x00007fff9c2ec3d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython3.6m.so.1.0
#31 0x00007fff9c2ed284 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#32 0x00007fff9c2f8d4c in _PyFunction_FastCallDict () from /usr/lib64/libpython3.6m.so.1.0
#33 0x00007fff9c1d4ed0 in _PyObject_FastCallDict () from /usr/lib64/libpython3.6m.so.1.0
#34 0x00007fff9c1d5210 in _PyObject_Call_Prepend () from /usr/lib64/libpython3.6m.so.1.0
#35 0x00007fff9c1f43e8 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#36 0x00007fff9c1d4ae8 in PyObject_Call () from /usr/lib64/libpython3.6m.so.1.0
#37 0x00007fff9c2ee248 in PyEval_CallObjectWithKeywords () from /usr/lib64/libpython3.6m.so.1.0
#38 0x00007fff9c353f84 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#39 0x00007fff9c34ca60 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#40 0x00007fff9c109748 in start_thread () from /lib64/libpthread.so.0
#41 0x00007fff9c01a084 in clone () from /lib64/libc.so.6

On other hosts none of the processes is stuck on a futex wait so that's likely not normal, indeed.

I created coredumps of both running processes. One could open that on grenache via sudo gdb --core=/home/martchus/core.732003 but unfortunately this lacks the symbol names then.

I also tried to generate a Python backtrace but it is useless because debug info is missing. gdb says one should install it via zypper install python3-base-debuginfo-3.6.15-150300.10.48.1.ppc64le but that particular version doesn't exist and just installing python3-base-debuginfo doesn't help.

Actions #9

Updated by mkittler over 1 year ago

Not sure how to make sense of this without diving deeply into salt's internals. It at least looks like we're not the only one's having trouble with salt being stuck:

Both issues mention futex_wait specifically.

Actions #10

Updated by okurz over 1 year ago

mkittler wrote:

Not sure how to make sense of this without diving deeply into salt's internals

I recommend:

  1. write "me too" with a reference to this ticket in at least one of the upstream ones. Preferrably more details than just "me too" :)
  2. apply or at least document in this ticket a workaround that works for us, e.g. reboot machine or whatever
Actions #11

Updated by openqa_review over 1 year ago

  • Due date set to 2023-07-07

Setting due date based on mean cycle time of SUSE QE Tools

Actions #12

Updated by okurz over 1 year ago

  • Subject changed from [alert][ci][deployment] OSD deployment failed, grenache-1, worker5, worker2 salt minion does not return to [alert][ci][deployment] OSD deployment failed, grenache-1, worker5, worker2 salt minion does not return size:M
  • Description updated (diff)
Actions #13

Updated by okurz over 1 year ago

Again test.ping does not return for worker2, worker5 and grenache-1 so exactly the same machines that were problematic as before. From worker2:

$ ssh worker2.oqa.suse.de 
Last login: Mon Jun 19 11:24:42 2023 from 2620:113:80c0:8360::107a
okurz@worker2:~> sudo ps auxf | grep minion
okurz    14681  0.0  0.0   8200   768 pts/0    S+   12:54   0:00              \_ grep --color=auto minion
root     25456  0.0  0.0  50452 27872 ?        Ss   Jun22   0:00 /usr/bin/python3 /usr/bin/salt-minion
root     25461  0.0  0.0 647540 73528 ?        Sl   Jun22   0:06  \_ /usr/bin/python3 /usr/bin/salt-minion
root     14984  0.0  0.0      0     0 ?        Z    Jun22   0:26      \_ [salt-minion] <defunct>
okurz@worker2:~> systemctl status salt-minion
● salt-minion.service - The Salt Minion
     Loaded: loaded (/usr/lib/systemd/system/salt-minion.service; enabled; vendor preset: disabled)
     Active: active (running) since Thu 2023-06-22 12:38:18 CEST; 24h ago
   Main PID: 25456 (salt-minion)
      Tasks: 9 (limit: 4915)
     CGroup: /system.slice/salt-minion.service
             ├─ 25456 /usr/bin/python3 /usr/bin/salt-minion
             └─ 25461 /usr/bin/python3 /usr/bin/salt-minion

Warning: some journal files were not opened due to insufficient permissions.
okurz@worker2:~> sudo strace -p 25461
strace: Process 25461 attached
futex(0x7f3f70000b50, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY^Cstrace: Process 25461 detached

maybe we have better luck debugging here.

Actions #14

Updated by mkittler over 1 year ago

I installed debug packages but couldn't produce a backtrace. Maybe I can try again tomorrow if the issue happens again. Otherwise, the debug packages should be installed again before closing this ticket (zypper rm $(zypper se -i debuginfo | grep -i name | sed -e 's|.*name="\([^"]*\)".*|\1|')).

apply or at least document in this ticket a workaround that works for us, e.g. reboot machine or whatever

I guess the workaround is to simply restart the service. You've already came up with a restart-loop so maybe we can run something similar in a more automated way.

Actions #15

Updated by okurz over 1 year ago

I did

for i in worker5.oqa.suse.de openqaworker-arm-2.suse.de worker2.oqa.suse.de grenache-1.qa.suse.de openqaworker-arm-3.suse.de; do ssh $i "sudo systemctl restart salt-minion"; done && ssh osd "sudo salt \* test.ping

and after that the same for openqaworker18.qa.suse.cz and then trying again three times and eventually test.ping returned ok for all currently salt controlled machines. So at least that would work as workaround.

Next day, 2023-06-24, again w2,w5,w18,arm-2,arm-3,grenache-1 would not respond, others are fine. Trying for i in openqaworker18.qa.suse.cz worker5.oqa.suse.de openqaworker-arm-2.suse.de worker2.oqa.suse.de grenache-1.qa.suse.de openqaworker-arm-3.suse.de; do ssh $i "sudo systemctl restart salt-minion"; done && ssh osd "sudo salt \* cmd.run 'uptime; rpm -q ffmpeg-4'". Immediatetly after resetting salt minion the commands are correctly executed just fine.

Now I am trying timeout 7200 sh -c 'for i in {1..7200}; do echo "### Run $i -- $(date -Is)" && salt --no-color \* test.ping; done' to see when or how the responsiveness breaks down.

In run 917, after around 1h the first minion failed to respond, w5. In run 919 w2 followed, in run 921 grenache-1 followed.

In parallel to find out if I can make another machine break I tried to find out if there are any updates missing on so far not affected machines. On worker13 I found some updates pending so I installed them now:

The following 20 packages are going to be upgraded:
  libatomic1                      
    12.2.1+git416-150000.1.7.1 -> 12.3.0+git1204-150000.1.10.1                  x86_64  Update repository with updates from SUSE Linux Enterprise 15
    SUSE LLC <https://www.suse.com/>
  libgcc_s1                       
    12.2.1+git416-150000.1.7.1 -> 12.3.0+git1204-150000.1.10.1                  x86_64  Update repository with updates from SUSE Linux Enterprise 15
    SUSE LLC <https://www.suse.com/>
  libgcc_s1-32bit                 
    12.2.1+git416-150000.1.7.1 -> 12.3.0+git1204-150000.1.10.1                  x86_64  Update repository with updates from SUSE Linux Enterprise 15
    SUSE LLC <https://www.suse.com/>
  libgfortran5                    
    12.2.1+git416-150000.1.7.1 -> 12.3.0+git1204-150000.1.10.1                  x86_64  Update repository with updates from SUSE Linux Enterprise 15
    SUSE LLC <https://www.suse.com/>
  libgomp1                        
    12.2.1+git416-150000.1.7.1 -> 12.3.0+git1204-150000.1.10.1                  x86_64  Update repository with updates from SUSE Linux Enterprise 15
    SUSE LLC <https://www.suse.com/>
  libitm1                         
    12.2.1+git416-150000.1.7.1 -> 12.3.0+git1204-150000.1.10.1                  x86_64  Update repository with updates from SUSE Linux Enterprise 15
    SUSE LLC <https://www.suse.com/>
  liblsan0                        
    12.2.1+git416-150000.1.7.1 -> 12.3.0+git1204-150000.1.10.1                  x86_64  Update repository with updates from SUSE Linux Enterprise 15
    SUSE LLC <https://www.suse.com/>
  libquadmath0                    
    12.2.1+git416-150000.1.7.1 -> 12.3.0+git1204-150000.1.10.1                  x86_64  Update repository with updates from SUSE Linux Enterprise 15
    SUSE LLC <https://www.suse.com/>
  libstdc++6                      
    12.2.1+git416-150000.1.7.1 -> 12.3.0+git1204-150000.1.10.1                  x86_64  Update repository with updates from SUSE Linux Enterprise 15
    SUSE LLC <https://www.suse.com/>
  libstdc++6-32bit                
    12.2.1+git416-150000.1.7.1 -> 12.3.0+git1204-150000.1.10.1                  x86_64  Update repository with updates from SUSE Linux Enterprise 15
    SUSE LLC <https://www.suse.com/>
  libstdc++6-pp                   
    12.2.1+git416-150000.1.7.1 -> 12.3.0+git1204-150000.1.10.1                  x86_64  Update repository with updates from SUSE Linux Enterprise 15
    SUSE LLC <https://www.suse.com/>
  libstdc++6-pp-32bit             
    12.2.1+git416-150000.1.7.1 -> 12.3.0+git1204-150000.1.10.1                  x86_64  Update repository with updates from SUSE Linux Enterprise 15
    SUSE LLC <https://www.suse.com/>
  openQA-client                   
    4.6.1687510203.8d9fc92-lp154.5910.1 -> 4.6.1687532073.e11feac-lp154.5912.1  x86_64  devel_openQA                                                
    obs://build.opensuse.org/devel:openQA
  openQA-common                   
    4.6.1687510203.8d9fc92-lp154.5910.1 -> 4.6.1687532073.e11feac-lp154.5912.1  x86_64  devel_openQA                                                
    obs://build.opensuse.org/devel:openQA
  openQA-worker                   
    4.6.1687510203.8d9fc92-lp154.5910.1 -> 4.6.1687532073.e11feac-lp154.5912.1  x86_64  devel_openQA                                                
    obs://build.opensuse.org/devel:openQA
  os-autoinst                     
    4.6.1687515905.8e765fc-lp154.1597.1 -> 4.6.1687532294.4a46169-lp154.1598.1  x86_64  devel_openQA                                                
    obs://build.opensuse.org/devel:openQA
  os-autoinst-devel               
    4.6.1687515905.8e765fc-lp154.1597.1 -> 4.6.1687532294.4a46169-lp154.1598.1  x86_64  devel_openQA                                                
    obs://build.opensuse.org/devel:openQA
  os-autoinst-distri-opensuse-deps
    1.1687520240.f97f61fa-lp154.12375.1 -> 1.1687528477.44bada55-lp154.12376.1  noarch  devel_openQA                                                
    obs://build.opensuse.org/devel:openQA
  os-autoinst-openvswitch         
    4.6.1687515905.8e765fc-lp154.1597.1 -> 4.6.1687532294.4a46169-lp154.1598.1  x86_64  devel_openQA                                                
    obs://build.opensuse.org/devel:openQA
  os-autoinst-swtpm               
    4.6.1687515905.8e765fc-lp154.1597.1 -> 4.6.1687532294.4a46169-lp154.1598.1  x86_64  devel_openQA                                                
    obs://build.opensuse.org/devel:openQA

But so far the host did not stop responding.

Actions #16

Updated by okurz over 1 year ago

  • Description updated (diff)

I want to crosscheck if any recent package installations triggered this. On worker5:

# snapper ls
    # | Type   | Pre # | Date                     | User | Cleanup | Description           | Userdata     
------+--------+-------+--------------------------+------+---------+-----------------------+--------------
   0  | single |       |                          | root |         | current               |              
   1* | single |       | Fri Jan 13 10:36:01 2017 | root |         | first root filesystem |              
2895  | pre    |       | Sun May 14 03:37:22 2023 | root | number  | zypp(zypper)          | important=yes
2896  | post   |  2895 | Sun May 14 03:38:55 2023 | root | number  |                       | important=yes
2905  | pre    |       | Thu May 18 08:11:58 2023 | root | number  | zypp(zypper)          | important=yes
2906  | post   |  2905 | Thu May 18 08:13:06 2023 | root | number  |                       | important=yes
2921  | pre    |       | Thu May 25 07:57:26 2023 | root | number  | zypp(zypper)          | important=yes
2922  | post   |  2921 | Thu May 25 07:58:38 2023 | root | number  |                       | important=yes
2947  | pre    |       | Thu Jun  8 08:14:50 2023 | root | number  | zypp(zypper)          | important=yes
2948  | post   |  2947 | Thu Jun  8 08:16:17 2023 | root | number  |                       | important=yes
2955  | pre    |       | Tue Jun 13 07:22:58 2023 | root | number  | zypp(zypper)          | important=yes
2956  | post   |  2955 | Tue Jun 13 07:23:28 2023 | root | number  |                       | important=yes
2985  | pre    |       | Thu Jun 22 13:27:51 2023 | root | number  | zypp(zypper)          | important=no 
2986  | pre    |       | Thu Jun 22 13:28:06 2023 | root | number  | zypp(zypper)          | important=no 
2987  | pre    |       | Thu Jun 22 13:28:22 2023 | root | number  | zypp(zypper)          | important=no 
2988  | pre    |       | Thu Jun 22 13:28:37 2023 | root | number  | zypp(zypper)          | important=no 
2989  | pre    |       | Thu Jun 22 14:15:38 2023 | root | number  | zypp(zypper)          | important=no 
2990  | post   |  2989 | Thu Jun 22 14:15:46 2023 | root | number  |                       | important=no 
2991  | pre    |       | Fri Jun 23 14:38:32 2023 | root | number  | zypp(zypper)          | important=no 
2992  | post   |  2991 | Fri Jun 23 14:38:39 2023 | root | number  |                       | important=no 
2993  | pre    |       | Fri Jun 23 14:39:36 2023 | root | number  | zypp(zypper)          | important=no 
2994  | post   |  2993 | Fri Jun 23 14:40:10 2023 | root | number  |                       | important=no 
worker5:/home/okurz # snapper rollback 2955
Ambit is classic.
Creating read-only snapshot of current system. (Snapshot 2995.)
Creating read-write snapshot of snapshot 2955. (Snapshot 2996.)
Setting default subvolume to snapshot 2996.
# sudo systemctl disable --now auto-update.timer
# reboot

added rollback steps for worker5. Again running the experiment to restart salt-minion on worker2.oqa.suse.de grenache-1.qa.suse.de openqaworker-arm-2.suse.de openqaworker-arm-3.suse.de and then run a test.ping salt call in a loop.

https://github.com/saltstack/salt/issues/56467 looks related, last update in 2022-02, potentially also https://bugzilla.suse.com/show_bug.cgi?id=1135756

EDIT: 2023-06-25: hm, worker5 was unresponsive again but also the auto-update timer was enabled again. Well, yesterday I couldn't mask it because auto-update.timer is a custom timer deployed by our salt states. So instead I will prevent the enablement of the timer and service by

for i in service timer; do mv /etc/systemd/system/auto-update.$i{,.disabled_poo131249}; done
snapper rollback 2955
reboot

By the way, salt-run manage.status provides a good overview over which nodes are considered down from OSD salt point of view. Right now this shows many down:

down:
    - grenache-1.qa.suse.de
    - openqaworker-arm-2.suse.de
    - openqaworker-arm-3.suse.de
    - openqaworker16.qa.suse.cz
    - openqaworker17.qa.suse.cz
    - openqaworker18.qa.suse.cz
    - worker2.oqa.suse.de
    - worker3.oqa.suse.de
    - worker5.oqa.suse.de

which I consider quite bad. Ok, worker5.oqa.suse.de is back for now due to my recovery. For the others I am doing as in before a restart of salt minion, then continuing my test.ping experiment.

After roughly 10h w5 at least is still reachable. Interestingly so is w3,w16,17,18. Though w2,g1,arm2,arm3 still unreachable. On w2 I did for i in service timer; do sudo mv /etc/systemd/system/auto-update.$i{,.disabled_poo131249}; done && sudo snapper rollback 3031 && sudo reboot. Correspondingly on arm2: for i in service timer; do sudo mv /etc/systemd/system/auto-update.$i{,.disabled_poo131249}; done && sudo snapper rollback 2698 && sudo reboot and on arm3: for i in service timer; do sudo mv /etc/systemd/system/auto-update.$i{,.disabled_poo131249}; done && sudo snapper rollback 2426 && sudo reboot and on grenache-1: for i in service timer; do sudo mv /etc/systemd/system/auto-update.$i{,.disabled_poo131249}; done && sudo snapper rollback 1263 && sudo reboot. Waiting for all to reboot, then let's see if that helps. Running "test.ping" in a loop again.

EDIT: Eventually still g1,arm2,arm3,w2,w5 were unresponsive again. But again auto update was running at least on w5. I was again making mistakes. It's either salt enabling the auto update service again or it's because I disabled the auto update service before rolling back. Instead now I have rolled back on worker5 and then after reboot with systemctl edit auto-update.service replaced ExecStart with an echo call instead of the real one, /etc/systemd/system/auto-update.service.d/override.conf is now:

[Service]
ExecStart=
ExecStart=/usr/bin/echo 'Not running auto-update, see https://progress.opensuse.org/issues/131249'
Actions #17

Updated by okurz over 1 year ago

  • Related to action #130835: salt high state fails after recent merge requests in salt pillars size:M added
Actions #18

Updated by okurz over 1 year ago

  • Description updated (diff)
Actions #19

Updated by okurz over 1 year ago

  • Assignee changed from mkittler to okurz

continuing my downgrade experiment as discussed with mkittler.

Actions #20

Updated by okurz over 1 year ago

  • Description updated (diff)
Actions #21

Updated by okurz over 1 year ago

  • Description updated (diff)
Actions #22

Updated by okurz over 1 year ago

  • Description updated (diff)
Actions #23

Updated by okurz over 1 year ago

on grenache-1

sudo snapper rollback 1263 && sudo reboot

wait for reboot and then

sudo mkdir -p /etc/systemd/system/auto-update.service.d && echo -e "[Service]\nExecStart=\nExecStart=/usr/bin/echo 'Not running auto-update, see https://progress.opensuse.org/issues/131249'" | sudo tee /etc/systemd/system/auto-update.service.d/override.conf && sudo systemctl daemon-reload

and now test.ping is failing with 'str' object has no attribute 'pop' most of the time. Ok, after some time it seems to work fine again.

EDIT: 2023-06-27: w2,arm2,arm3 stopped responding, the other still seem fine, supporting my hypothesis of regression due to package upgrade. From worker5 the full list of packages with pending upgrade:

  autoyast2                         4.4.43-150400.3.16.1 -> 4.4.45-150400.3.19.1                                noarch  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  autoyast2-installation            4.4.43-150400.3.16.1 -> 4.4.45-150400.3.19.1                                noarch  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  cups                              2.2.7-150000.3.43.1 -> 2.2.7-150000.3.46.1                                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  cups-client                       2.2.7-150000.3.43.1 -> 2.2.7-150000.3.46.1                                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  cups-config                       2.2.7-150000.3.43.1 -> 2.2.7-150000.3.46.1                                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libatomic1                        12.2.1+git416-150000.1.7.1 -> 12.3.0+git1204-150000.1.10.1                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libbluetooth3                     5.62-150400.4.10.3 -> 5.62-150400.4.13.1                                    x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libcups2                          2.2.7-150000.3.43.1 -> 2.2.7-150000.3.46.1                                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libcups2-32bit                    2.2.7-150000.3.43.1 -> 2.2.7-150000.3.46.1                                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libcupscgi1                       2.2.7-150000.3.43.1 -> 2.2.7-150000.3.46.1                                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libcupsimage2                     2.2.7-150000.3.43.1 -> 2.2.7-150000.3.46.1                                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libcupsmime1                      2.2.7-150000.3.43.1 -> 2.2.7-150000.3.46.1                                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libcupsppdc1                      2.2.7-150000.3.43.1 -> 2.2.7-150000.3.46.1                                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libgcc_s1                         12.2.1+git416-150000.1.7.1 -> 12.3.0+git1204-150000.1.10.1                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libgcc_s1-32bit                   12.2.1+git416-150000.1.7.1 -> 12.3.0+git1204-150000.1.10.1                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libgfortran5                      12.2.1+git416-150000.1.7.1 -> 12.3.0+git1204-150000.1.10.1                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libgomp1                          12.2.1+git416-150000.1.7.1 -> 12.3.0+git1204-150000.1.10.1                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libitm1                           12.2.1+git416-150000.1.7.1 -> 12.3.0+git1204-150000.1.10.1                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libldap-2_4-2                     2.4.46-150200.14.11.2 -> 2.4.46-150200.14.14.1                              x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libldap-2_4-2-32bit               2.4.46-150200.14.11.2 -> 2.4.46-150200.14.14.1                              x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libldap-data                      2.4.46-150200.14.11.2 -> 2.4.46-150200.14.14.1                              noarch  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  liblsan0                          12.2.1+git416-150000.1.7.1 -> 12.3.0+git1204-150000.1.10.1                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libopenssl1_0_0                   1.0.2p-150000.3.76.1 -> 1.0.2p-150000.3.79.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libopenssl1_0_0-32bit             1.0.2p-150000.3.76.1 -> 1.0.2p-150000.3.79.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libpython3_6m1_0                  3.6.15-150300.10.45.1 -> 3.6.15-150300.10.48.1                              x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libquadmath0                      12.2.1+git416-150000.1.7.1 -> 12.3.0+git1204-150000.1.10.1                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libsolv-tools                     0.7.24-150400.3.6.4 -> 0.7.24-150400.3.8.1                                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libstdc++6                        12.2.1+git416-150000.1.7.1 -> 12.3.0+git1204-150000.1.10.1                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libstdc++6-32bit                  12.2.1+git416-150000.1.7.1 -> 12.3.0+git1204-150000.1.10.1                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libX11-6                          1.6.5-150000.3.27.1 -> 1.6.5-150000.3.30.1                                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libX11-data                       1.6.5-150000.3.27.1 -> 1.6.5-150000.3.30.1                                  noarch  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libX11-devel                      1.6.5-150000.3.27.1 -> 1.6.5-150000.3.30.1                                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libX11-xcb1                       1.6.5-150000.3.27.1 -> 1.6.5-150000.3.30.1                                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libyui16                          4.3.3-150400.1.5 -> 4.3.7-150400.3.3.1                                      x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libyui-ncurses16                  4.3.3-150400.1.5 -> 4.3.7-150400.3.3.1                                      x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libyui-ncurses-pkg16              4.3.3-150400.1.8 -> 4.3.7-150400.3.3.1                                      x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libzck1                           1.1.16-150400.3.2.1 -> 1.1.16-150400.3.4.1                                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libzypp                           17.31.11-150400.3.25.2 -> 17.31.13-150400.3.32.1                            x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  openQA-client                     4.6.1686317795.57b586f-lp154.5868.1 -> 4.6.1687790479.74f3352-lp154.5916.1  x86_64  devel_openQA                                                  obs://build.opensuse.org/devel:openQA
  openQA-common                     4.6.1686317795.57b586f-lp154.5868.1 -> 4.6.1687790479.74f3352-lp154.5916.1  x86_64  devel_openQA                                                  obs://build.opensuse.org/devel:openQA
  openQA-worker                     4.6.1686317795.57b586f-lp154.5868.1 -> 4.6.1687790479.74f3352-lp154.5916.1  x86_64  devel_openQA                                                  obs://build.opensuse.org/devel:openQA
  os-autoinst                       4.6.1686321776.9b5f5e8-lp154.1588.1 -> 4.6.1687771504.520c460-lp154.1600.1  x86_64  devel_openQA                                                  obs://build.opensuse.org/devel:openQA
  os-autoinst-devel                 4.6.1686321776.9b5f5e8-lp154.1588.1 -> 4.6.1687771504.520c460-lp154.1600.1  x86_64  devel_openQA                                                  obs://build.opensuse.org/devel:openQA
  os-autoinst-distri-opensuse-deps  1.1686319656.79e363bc-lp154.12295.1 -> 1.1687792629.4b158c58-lp154.12382.1  noarch  devel_openQA                                                  obs://build.opensuse.org/devel:openQA
  os-autoinst-openvswitch           4.6.1686321776.9b5f5e8-lp154.1588.1 -> 4.6.1687771504.520c460-lp154.1600.1  x86_64  devel_openQA                                                  obs://build.opensuse.org/devel:openQA
  os-autoinst-swtpm                 4.6.1686321776.9b5f5e8-lp154.1588.1 -> 4.6.1687771504.520c460-lp154.1600.1  x86_64  devel_openQA                                                  obs://build.opensuse.org/devel:openQA
  python3                           3.6.15-150300.10.45.1 -> 3.6.15-150300.10.48.1                              x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  python3-base                      3.6.15-150300.10.45.1 -> 3.6.15-150300.10.48.1                              x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  python3-curses                    3.6.15-150300.10.45.1 -> 3.6.15-150300.10.48.1                              x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  python3-dbm                       3.6.15-150300.10.45.1 -> 3.6.15-150300.10.48.1                              x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  python3-ply                       3.10-1.27 -> 3.10-150000.3.3.4                                              noarch  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  python3-pyzmq                     17.1.2-3.3.1 -> 17.1.2-150000.3.5.2                                         x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  python3-salt                      3004-150400.8.25.1 -> 3006.0-150400.8.34.2                                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  python3-simplejson                3.17.2-1.10 -> 3.17.2-150300.3.2.3                                          x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  python3-solv                      0.7.24-150400.3.6.4 -> 0.7.24-150400.3.8.1                                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  python3-tk                        3.6.15-150300.10.45.1 -> 3.6.15-150300.10.48.1                              x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  python-solv                       0.7.24-150400.3.6.4 -> 0.7.24-150400.3.8.1                                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu                              6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-accel-qtest                  6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-accel-tcg-x86                6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-audio-spice                  6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-block-curl                   6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-block-iscsi                  6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-block-rbd                    6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-block-ssh                    6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-chardev-baum                 6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-chardev-spice                6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-hw-display-qxl               6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-hw-display-virtio-gpu        6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-hw-display-virtio-gpu-pci    6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-hw-display-virtio-vga        6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-hw-s390x-virtio-gpu-ccw      6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-hw-usb-host                  6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-hw-usb-redirect              6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-hw-usb-smartcard             6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-ipxe                         1.0.0+-150400.37.14.2 -> 1.0.0+-150400.37.17.1                              noarch  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-ivshmem-tools                6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-ksm                          6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-kvm                          6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-microvm                      6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                noarch  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-seabios                      1.15.0_0_g2dd4b9b-150400.37.14.2 -> 1.15.0_0_g2dd4b9b-150400.37.17.1        noarch  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-sgabios                      8-150400.37.14.2 -> 8-150400.37.17.1                                        noarch  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-skiboot                      6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                noarch  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-tools                        6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-ui-curses                    6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-ui-gtk                       6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-ui-opengl                    6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-ui-spice-app                 6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-ui-spice-core                6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-vgabios                      1.15.0_0_g2dd4b9b-150400.37.14.2 -> 1.15.0_0_g2dd4b9b-150400.37.17.1        noarch  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  qemu-x86                          6.2.0-150400.37.14.2 -> 6.2.0-150400.37.17.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  ruby-solv                         0.7.24-150400.3.6.4 -> 0.7.24-150400.3.8.1                                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  salt                              3004-150400.8.25.1 -> 3006.0-150400.8.34.2                                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  salt-bash-completion              3004-150400.8.25.1 -> 3006.0-150400.8.34.2                                  noarch  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  salt-minion                       3004-150400.8.25.1 -> 3006.0-150400.8.34.2                                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  systemd-rpm-macros                12-150000.7.30.1 -> 13-150000.7.33.1                                        noarch  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  vim                               9.0.1443-150000.5.43.1 -> 9.0.1572-150000.5.46.1                            x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  vim-data                          9.0.1443-150000.5.43.1 -> 9.0.1572-150000.5.46.1                            noarch  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  vim-data-common                   9.0.1443-150000.5.43.1 -> 9.0.1572-150000.5.46.1                            noarch  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  yast2-network                     4.4.56-150400.3.18.1 -> 4.4.57-150400.3.21.1                                noarch  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  yast2-pkg-bindings                4.4.5-150400.3.3.1 -> 4.4.6-150400.3.6.1                                    x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>

The following 2 NEW packages are going to be installed:
  python3-jmespath      0.9.3-150000.3.3.4  noarch  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  python3-looseversion  1.0.2-150100.3.3.1  noarch  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>

from that list what I suspect could be one or multiple of:

  libopenssl1_0_0                   1.0.2p-150000.3.76.1 -> 1.0.2p-150000.3.79.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libopenssl1_0_0-32bit             1.0.2p-150000.3.76.1 -> 1.0.2p-150000.3.79.1                                x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libpython3_6m1_0                  3.6.15-150300.10.45.1 -> 3.6.15-150300.10.48.1                              x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libquadmath0                      12.2.1+git416-150000.1.7.1 -> 12.3.0+git1204-150000.1.10.1                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libsolv-tools                     0.7.24-150400.3.6.4 -> 0.7.24-150400.3.8.1                                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libstdc++6                        12.2.1+git416-150000.1.7.1 -> 12.3.0+git1204-150000.1.10.1                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libstdc++6-32bit                  12.2.1+git416-150000.1.7.1 -> 12.3.0+git1204-150000.1.10.1                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libyui16                          4.3.3-150400.1.5 -> 4.3.7-150400.3.3.1                                      x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libyui-ncurses16                  4.3.3-150400.1.5 -> 4.3.7-150400.3.3.1                                      x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libyui-ncurses-pkg16              4.3.3-150400.1.8 -> 4.3.7-150400.3.3.1                                      x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libzck1                           1.1.16-150400.3.2.1 -> 1.1.16-150400.3.4.1                                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  libzypp                           17.31.11-150400.3.25.2 -> 17.31.13-150400.3.32.1                            x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  python3                           3.6.15-150300.10.45.1 -> 3.6.15-150300.10.48.1                              x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  python3-base                      3.6.15-150300.10.45.1 -> 3.6.15-150300.10.48.1                              x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  python3-dbm                       3.6.15-150300.10.45.1 -> 3.6.15-150300.10.48.1                              x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  python3-ply                       3.10-1.27 -> 3.10-150000.3.3.4                                              noarch  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  python3-pyzmq                     17.1.2-3.3.1 -> 17.1.2-150000.3.5.2                                         x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  python3-salt                      3004-150400.8.25.1 -> 3006.0-150400.8.34.2                                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  python3-simplejson                3.17.2-1.10 -> 3.17.2-150300.3.2.3                                          x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  python3-solv                      0.7.24-150400.3.6.4 -> 0.7.24-150400.3.8.1                                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  python3-tk                        3.6.15-150300.10.45.1 -> 3.6.15-150300.10.48.1                              x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  python-solv                       0.7.24-150400.3.6.4 -> 0.7.24-150400.3.8.1                                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC 
  salt                              3004-150400.8.25.1 -> 3006.0-150400.8.34.2                                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  salt-minion                       3004-150400.8.25.1 -> 3006.0-150400.8.34.2                                  x86_64  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC 
  python3-jmespath      0.9.3-150000.3.3.4  noarch  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>
  python3-looseversion  1.0.2-150100.3.3.1  noarch  Update repository with updates from SUSE Linux Enterprise 15  SUSE LLC <https://www.suse.com/>

so with the three failing machines trying selective downgrades trying to fix:

  • worker2: zypper install --force salt=3004-150400.8.25.1 salt-minion=3004-150400.8.25.1 salt-bash-completion=3004-150400.8.25.1 python3-salt=3004-150400.8.25.1

openqaworker-arm-2+3 are actually downgraded but still failed. That is an indication against my hypothesis though. On arm2 I found that I am on snapshot 2698, that's post on 2023-06-13. I am trying with pre from that day, that is snapshot 2697. Same happened on arm3, going back to 2425, pre of 2023-06-13.

I don't think we should need cups on workers so I removed it with deps on openqaworker-arm-3 and subsequently on all salt controlled machines.

EDIT: My experiments caused #131447 so I shouldn't run salt jobs quite that often ;) Now I am running

systemctl start salt-master && for i in {1..7200}; do echo "### Run $i -- $(date -Is)" && salt --no-color \* test.ping ; df -i / ; salt-run jobs.list_jobs | wc -l && sleep 60; done | tee -a log_salt_test_ping_poo131249_$(date -Is).log
Actions #24

Updated by nicksinger over 1 year ago

your continuous "test.ping" on OSD causes the salt job history to grow very quickly and exhausting our inodes on that machine. We stopped the master for now to mitigate the issue over the lunch period.

Actions #25

Updated by kraih over 1 year ago

  • Related to action #131447: Some jobs incomplete due to auto_review:"api failure: 400.*/tmp/.*png.*No space left on device.*Utils.pm line 285":retry but enough space visible on machines added
Actions #26

Updated by okurz over 1 year ago

I am running a slightly adapted experiment due to #131249-24

for i in {1..7200}; do echo "### Run $i -- $(date -Is)" && salt --no-color \* test.ping ; df -i / ; salt-run jobs.list_jobs | wc -l && salt --no-color \* saltutil.kill_all_jobs && sleep 60 && rm -rf /var/cache/salt/master/jobs/*; done | tee -a log_salt_test_ping_poo131249_$(date -Is).log

so far since downgrading all affected machines I could not reproduce the error which might also be due to my reproduction attempt. I will let this run over the night.

Actions #27

Updated by okurz over 1 year ago

  • Description updated (diff)

Good news everyone! w2 became unresponsive, others are still ok. I will check with zypper dup --dry-run --details. w2 has salt-3006.0-150400.8.34.2.x86_64, w5 downgraded has salt-3004-150400.8.25.1.x86_64.

changelog diff:

* Mon Jun 19 2023 pablo.suarezhernandez@suse.com
- Make master_tops compatible with Salt 3000 and older minions (bsc#1212516) (bsc#1212517)
- Added:
  * make-master_tops-compatible-with-salt-3000-and-older.patch

* Mon May 29 2023 yeray.gutierrez@suse.com
- Avoid failures due transactional_update module not available in Salt 3006.0 (bsc#1211754)
- Added:
  * define-__virtualname__-for-transactional_update-modu.patch

* Wed May 24 2023 pablo.suarezhernandez@suse.com
- Avoid conflicts with Salt dependencies versions (bsc#1211612)
- Added:
  * avoid-conflicts-with-dependencies-versions-bsc-12116.patch

* Fri May 05 2023 alexander.graul@suse.com
- Update to Salt release version 3006.0 (jsc#PED-4360)
  * See release notes: https://docs.saltproject.io/en/latest/topics/releases/3006.0.html
- Add missing patch after rebase to fix collections Mapping issues
- Add python3-looseversion as new dependency for salt
- Add python3-packaging as new dependency for salt
- Allow entrypoint compatibility for "importlib-metadata>=5.0.0" (bsc#1207071)
- Create new salt-tests subpackage containing Salt tests
- Drop conflictive patch dicarded from upstream
- Fix SLS rendering error when Jinja macros are used
- Fix version detection and avoid building and testing failures
- Prevent deadlocks in salt-ssh executions
- Require python3-jmespath runtime dependency (bsc#1209233)
- Added:
  * 3005.1-implement-zypper-removeptf-573.patch
  * control-the-collection-of-lvm-grains-via-config.patch
  * fix-version-detection-and-avoid-building-and-testing.patch
  * make-sure-the-file-client-is-destroyed-upon-used.patch
  * skip-package-names-without-colon-bsc-1208691-578.patch
  * use-rlock-to-avoid-deadlocks-in-salt-ssh.patch
- Modified:
  * activate-all-beacons-sources-config-pillar-grains.patch
  * add-custom-suse-capabilities-as-grains.patch
  * add-environment-variable-to-know-if-yum-is-invoked-f.patch
  * add-migrated-state-and-gpg-key-management-functions-.patch
  * add-publish_batch-to-clearfuncs-exposed-methods.patch
  * add-salt-ssh-support-with-venv-salt-minion-3004-493.patch
  * add-sleep-on-exception-handling-on-minion-connection.patch
  * add-standalone-configuration-file-for-enabling-packa.patch
  * add-support-for-gpgautoimport-539.patch
  * allow-vendor-change-option-with-zypper.patch
  * async-batch-implementation.patch
  * avoid-excessive-syslogging-by-watchdog-cronjob-58.patch
  * bsc-1176024-fix-file-directory-user-and-group-owners.patch
  * change-the-delimeters-to-prevent-possible-tracebacks.patch
  * debian-info_installed-compatibility-50453.patch
  * dnfnotify-pkgset-plugin-implementation-3002.2-450.patch
  * do-not-load-pip-state-if-there-is-no-3rd-party-depen.patch
  * don-t-use-shell-sbin-nologin-in-requisites.patch
  * drop-serial-from-event.unpack-in-cli.batch_async.patch
  * early-feature-support-config.patch
  * enable-passing-a-unix_socket-for-mysql-returners-bsc.patch
  * enhance-openscap-module-add-xccdf_eval-call-386.patch
  * fix-bsc-1065792.patch
  * fix-for-suse-expanded-support-detection.patch
  * fix-issue-2068-test.patch
  * fix-missing-minion-returns-in-batch-mode-360.patch
  * fix-ownership-of-salt-thin-directory-when-using-the-.patch
  * fix-regression-with-depending-client.ssh-on-psutil-b.patch
  * fix-salt-ssh-opts-poisoning-bsc-1197637-3004-501.patch
  * fix-salt.utils.stringutils.to_str-calls-to-make-it-w.patch
  * fix-the-regression-for-yumnotify-plugin-456.patch
  * fix-traceback.print_exc-calls-for-test_pip_state-432.patch
  * fixes-for-python-3.10-502.patch
  * include-aliases-in-the-fqdns-grains.patch
  * info_installed-works-without-status-attr-now.patch
  * let-salt-ssh-use-platform-python-binary-in-rhel8-191.patch
  * make-aptpkg.list_repos-compatible-on-enabled-disable.patch
  * make-setup.py-script-to-not-require-setuptools-9.1.patch
  * pass-the-context-to-pillar-ext-modules.patch
  * prevent-affection-of-ssh.opts-with-lazyloader-bsc-11.patch
  * prevent-pkg-plugins-errors-on-missing-cookie-path-bs.patch
  * prevent-shell-injection-via-pre_flight_script_args-4.patch
  * read-repo-info-without-using-interpolation-bsc-11356.patch
  * restore-default-behaviour-of-pkg-list-return.patch
  * return-the-expected-powerpc-os-arch-bsc-1117995.patch
  * revert-fixing-a-use-case-when-multiple-inotify-beaco.patch
  * run-salt-api-as-user-salt-bsc-1064520.patch
  * run-salt-master-as-dedicated-salt-user.patch
  * save-log-to-logfile-with-docker.build.patch
  * switch-firewalld-state-to-use-change_interface.patch
  * temporary-fix-extend-the-whitelist-of-allowed-comman.patch
  * update-target-fix-for-salt-ssh-to-process-targets-li.patch
  * use-adler32-algorithm-to-compute-string-checksums.patch
  * use-salt-bundle-in-dockermod.patch
  * x509-fixes-111.patch
  * zypperpkg-ignore-retcode-104-for-search-bsc-1176697-.patch
- Removed:
  * 3003.3-do-not-consider-skipped-targets-as-failed-for.patch
  * 3003.3-postgresql-json-support-in-pillar-423.patch
  * add-amazon-ec2-detection-for-virtual-grains-bsc-1195.patch
  * add-missing-ansible-module-functions-to-whitelist-in.patch
  * add-rpm_vercmp-python-library-for-version-comparison.patch
  * add-support-for-name-pkgs-and-diff_attr-parameters-t.patch
  * adds-explicit-type-cast-for-port.patch
  * align-amazon-ec2-nitro-grains-with-upstream-pr-bsc-1.patch
  * backport-syndic-auth-fixes.patch
  * batch.py-avoid-exception-when-minion-does-not-respon.patch
  * check-if-dpkgnotify-is-executable-bsc-1186674-376.patch
  * clarify-pkg.installed-pkg_verify-documentation.patch
  * detect-module.run-syntax.patch
  * do-not-crash-when-unexpected-cmd-output-at-listing-p.patch
  * enhance-logging-when-inotify-beacon-is-missing-pyino.patch
  * fix-62092-catch-zmq.error.zmqerror-to-set-hwm-for-zm.patch
  * fix-crash-when-calling-manage.not_alive-runners.patch
  * fixes-pkg.version_cmp-on-openeuler-systems-and-a-few.patch
  * fix-exception-in-yumpkg.remove-for-not-installed-pac.patch
  * fix-for-cve-2022-22967-bsc-1200566.patch
  * fix-inspector-module-export-function-bsc-1097531-481.patch
  * fix-ip6_interface-grain-to-not-leak-secondary-ipv4-a.patch
  * fix-issues-with-salt-ssh-s-extra-filerefs.patch
  * fix-jinja2-contextfuntion-base-on-version-bsc-119874.patch
  * fix-multiple-security-issues-bsc-1197417.patch
  * fix-salt-call-event.send-call-with-grains-and-pillar.patch
  * fix-salt.states.file.managed-for-follow_symlinks-tru.patch
  * fix-state.apply-in-test-mode-with-file-state-module-.patch
  * fix-test_ipc-unit-tests.patch
  * fix-the-regression-in-schedule-module-releasded-in-3.patch
  * fix-wrong-test_mod_del_repo_multiline_values-test-af.patch
  * fixes-56144-to-enable-hotadd-profile-support.patch
  * fopen-workaround-bad-buffering-for-binary-mode-563.patch
  * force-zyppnotify-to-prefer-packages.db-than-packages.patch
  * ignore-erros-on-reading-license-files-with-dpkg_lowp.patch
  * ignore-extend-declarations-from-excluded-sls-files.patch
  * ignore-non-utf8-characters-while-reading-files-with-.patch
  * implementation-of-held-unheld-functions-for-state-pk.patch
  * implementation-of-suse_ip-execution-module-bsc-10999.patch
  * improvements-on-ansiblegate-module-354.patch
  * include-stdout-in-error-message-for-zypperpkg-559.patch
  * make-pass-renderer-configurable-other-fixes-532.patch
  * make-sure-saltcacheloader-use-correct-fileclient-519.patch
  * mock-ip_addrs-in-utils-minions.py-unit-test-443.patch
  * normalize-package-names-once-with-pkg.installed-remo.patch
  * notify-beacon-for-debian-ubuntu-systems-347.patch
  * refactor-and-improvements-for-transactional-updates-.patch
  * retry-if-rpm-lock-is-temporarily-unavailable-547.patch
  * set-default-target-for-pip-from-venv_pip_target-envi.patch
  * state.apply-don-t-check-for-cached-pillar-errors.patch
  * state.orchestrate_single-does-not-pass-pillar-none-4.patch
  * support-transactional-systems-microos.patch
  * wipe-notify_socket-from-env-in-cmdmod-bsc-1193357-30.patch

zypper se --details --match-exact salt shows me that there is no intermediate version available. I am applying the mitigation on w2 by restarting the salt-minion but on worker5 I am upgrading salt as well trying to break it.

On worker5 called sudo zypper al --comment "poo#131249 - potential salt regression, unresponsive salt-minion" salt salt-minion salt-bash-completion python3-salt and upgrading with sudo zypper dup --details.

salt --no-color -L 'worker3.oqa.suse.de,worker5.oqa.suse.de,openqaworker-arm-2.suse.de,openqaworker-arm-3.suse.de,grenache-1.qa.suse.de' cmd.run 'zypper al --comment "poo#131249 - potential salt regression, unresponsive salt-minion" salt salt-minion salt-bash-completion python3-salt && zypper -n dup --download-only && zypper -n dup'

and then for all affected machines

salt --no-color -L 'worker2.oqa.suse.de,worker3.oqa.suse.de,worker5.oqa.suse.de,openqaworker-arm-2.suse.de,openqaworker-arm-3.suse.de,grenache-1.qa.suse.de' cmd.run 'rm /etc/systemd/system/auto-update.service.d/override.conf && rm -f /etc/systemd/system/auto-update.*.disabled_poo131249 && systemctl daemon-reload && systemctl enable --now auto-update.timer && systemctl start auto-update'

With salt --no-color \* cmd.run 'zypper --no-refresh -n dup --dry-run' I am now checking how is the general state of updates. Seems some machines have problems that need manual fixing which I will also try to do.

I enabled osd-deployment again and triggered a pipeline, monitoring https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/715992

I found that many updates are not installed. Checking the output of the auto-update and auto-upgrade services I found that we try to run both nightly at the same time so one is always aborting due to zypper already running. Better to separate them:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/898

EDIT: Deployment succeeded: https://gitlab.suse.de/openqa/osd-deployment/-/jobs/1659740

Actions #28

Updated by okurz over 1 year ago

  • Copied to action #131540: openqa-piworker fails to upgrade many packages. vendor change is not enabled as our salt states so far only do that for openQA machines, not generic machines size:M added
Actions #29

Updated by okurz over 1 year ago

  • Related to action #107932: Handling broken RPM databases does not handle certain cases added
Actions #30

Updated by okurz over 1 year ago

  • Copied to action #131543: We have machines with both auto-update&auto-upgrade deployed, we should have only one at a time size:M added
Actions #31

Updated by okurz over 1 year ago

  • Description updated (diff)
  • Status changed from In Progress to Feedback

I reported the issue as
https://bugzilla.opensuse.org/show_bug.cgi?id=1212816
for now

And on worker2 I applied

zypper -n in --oldpackage --allow-downgrade salt=3004-150400.8.25.1 && zypper al --comment "poo#131249 - potential salt regression, unresponsive salt-minion" salt salt-minion salt-bash-completion python3-salt

On salt --no-color --state-output=changes \* state.apply was clean now.

I will not enable alerts right now as we have quite some firing already so my plan is to check again explicitly the next days.

Actions #32

Updated by mkittler over 1 year ago

After setting up sapworker3 the problem is now also reproducible on that host, see #128528#note-22. That means Leap 15.5 is also affected. The problem really looks like it is the same. It is also weird that sapworker1 and 2 are not affected while the problem could be reproduced on sapworker3 a few times in a relatively short timeframe. All of those 3 workers have the same software and the hardware also seems very similar.

As mentioned in #128528#note-22, restarting the minion helped. However, the stopping of the stuck instance really took a while and it looks like is was not stopped cleanly:

martchus@sapworker3:~> sudo systemctl status salt-minion
● salt-minion.service - The Salt Minion
     Loaded: loaded (/usr/lib/systemd/system/salt-minion.service; enabled; vendor preset: disabled)
     Active: deactivating (stop-sigterm) since Thu 2023-06-29 14:45:07 CEST; 1min 28s ago
   Main PID: 3561 (salt-minion)
      Tasks: 6 (limit: 19660)
     CGroup: /system.slice/salt-minion.service
             ├─ 3561 /usr/bin/python3 /usr/bin/salt-minion
             └─ 3696 /usr/bin/python3 /usr/bin/salt-minion

Jun 29 14:45:07 sapworker3 salt-minion[3696]: The Salt Minion is shutdown. Minion received a SIGTERM. Exited.
Jun 29 14:45:07 sapworker3 salt-minion[3696]: The minion failed to return the job information for job req. This is often due to the master being shut down or overloaded. If the master is running, consider increasing the worker_threads value.
Jun 29 14:45:07 sapworker3 salt-minion[3696]: Future <salt.ext.tornado.concurrent.Future object at 0x7f36ad91ebe0> exception was never retrieved: Traceback (most recent call last):
Jun 29 14:45:07 sapworker3 salt-minion[3696]:   File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 309, in wrapper
Jun 29 14:45:07 sapworker3 salt-minion[3696]:     yielded = next(result)
Jun 29 14:45:07 sapworker3 salt-minion[3696]:   File "/usr/lib/python3.6/site-packages/salt/minion.py", line 2927, in handle_event
Jun 29 14:45:07 sapworker3 salt-minion[3696]:     self._return_pub(data, ret_cmd="_return", sync=False)
Jun 29 14:45:07 sapworker3 salt-minion[3696]:   File "/usr/lib/python3.6/site-packages/salt/minion.py", line 2267, in _return_pub
Jun 29 14:45:07 sapworker3 salt-minion[3696]:     log.trace("ret_val = %s", ret_val)  # pylint: disable=no-member
Jun 29 14:45:07 sapworker3 salt-minion[3696]: UnboundLocalError: local variable 'ret_val' referenced before assignment

Note that the salt master definitely was able to ping other minions at the time so I don't think it was generally overloaded.

I'll keep sapworker3 running for now as an additional machine to reproduce the issue. Right now all of these machines look good, though:

martchus@openqa:~> sudo salt -C 'G@nodename:sapworker1 or G@nodename:sapworker2 or G@nodename:sapworker3' -l error --state-output=changes test.ping
sapworker1.qe.nue2.suse.org:
    True
sapworker3.qe.nue2.suse.org:
    True
sapworker2.qe.nue2.suse.org:
    True

(In fact, right now all machines are pingable via salt.)

Actions #33

Updated by jbaier_cz over 1 year ago

It seems, that it is not limited to the already mentioned workers:

sapworker2.qe.nue2.suse.org:
    Minion did not return. [Not connected]
openqaworker18.qa.suse.cz:
    Minion did not return. [Not connected]
worker8.oqa.suse.de:
    Minion did not return. [Not connected]
openqaworker17.qa.suse.cz:
    Minion did not return. [Not connected]
worker9.oqa.suse.de:
    Minion did not return. [Not connected]
openqaworker16.qa.suse.cz:
    Minion did not return. [Not connected]
sapworker3.qe.nue2.suse.org:
    Minion did not return. [Not connected]
worker3.oqa.suse.de:
    Minion did not return. [Not connected]

On one of the worker:

openqaworker16:~>  ps ax | grep salt
22469 ?        Ss     0:00 /usr/bin/python3 /usr/bin/salt-minion
22978 ?        Sl     0:05 /usr/bin/python3 /usr/bin/salt-minion
39136 ?        Z      0:14 [salt-minion] <defunct>
61089 pts/0    S+     0:00 grep --color=auto salt
Actions #34

Updated by okurz over 1 year ago

jbaier_cz wrote:

It seems, that it is not limited to the already mentioned workers:

sapworker2.qe.nue2.suse.org:
    Minion did not return. [Not connected]
openqaworker18.qa.suse.cz:
    Minion did not return. [Not connected]
worker8.oqa.suse.de:
    Minion did not return. [Not connected]
openqaworker17.qa.suse.cz:
    Minion did not return. [Not connected]
worker9.oqa.suse.de:
    Minion did not return. [Not connected]
openqaworker16.qa.suse.cz:
    Minion did not return. [Not connected]
sapworker3.qe.nue2.suse.org:
    Minion did not return. [Not connected]
worker3.oqa.suse.de:
    Minion did not return. [Not connected]

On one of the worker:

openqaworker16:~>  ps ax | grep salt
22469 ?        Ss     0:00 /usr/bin/python3 /usr/bin/salt-minion
22978 ?        Sl     0:05 /usr/bin/python3 /usr/bin/salt-minion
39136 ?        Z      0:14 [salt-minion] <defunct>
61089 pts/0    S+     0:00 grep --color=auto salt

ok, so the process list excerpt looks like it's about the same problem. However so far I would have considered only nodes with "No response" to suffer from the same issue, not "Not connected" which can happen if a host is down or deliberately disabled.

Actions #35

Updated by okurz over 1 year ago

Ok, I applied the workaround as well now on the affected Leap 15.4 machines:

for i in openqaworker16.qa.suse.cz openqaworker17.qa.suse.cz openqaworker18.qa.suse.cz worker3.oqa.suse.de worker8.oqa.suse.de worker9.oqa.suse.de; do echo "### $i" && ssh $i 'sudo zypper -n in --oldpackage --allow-downgrade salt=3004-150400.8.25.1 && sudo zypper al --comment "poo#131249 - potential salt regression, unresponsive salt-minion" salt salt-minion salt-bash-completion python3-salt'; done

for Leap 15.5 we need to lookup the according 15.4 package in repos manually and force-install that. Found on http://download.opensuse.org/update/leap/15.4/sle/x86_64/?P=salt*
-> http://download.opensuse.org/update/leap/15.4/sle/x86_64/salt-3004-150400.8.25.1.x86_64.rpm

so on sapworker2 and sapworker3 I did:

sudo zypper -n in --oldpackage --allow-downgrade http://download.opensuse.org/update/leap/15.4/sle/x86_64/salt-3004-150400.8.25.1.x86_64.rpm http://download.opensuse.org/update/leap/15.4/sle/x86_64/salt-minion-3004-150400.8.25.1.x86_64.rpm http://download.opensuse.org/update/leap/15.4/sle/x86_64/python3-salt-3004-150400.8.25.1.x86_64.rpm

On worker3 the new salt package is installed despite the lock. Likely I made a mistake there. Removed lock, applied downgrade again and applied locks again.

Retriggered jobs in https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/719953

I found that https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=now-1h&to=now shows multiple related alerts about "snapper-cleanup" failing on machines where we conducted a rollback. In #102942 we had snapper-cleanup failing but this was due to docker subvolumes blocking the delete. /usr/lib/snapper/systemd-helper --cleanup says that it fails to delete a snapshot but does not state which one. https://www.opensuse-forum.de/thread/64330-snapper-cleanup-nach-rollback-nicht-mehr-m%C3%B6glich/ had an open question which I answered now but I don't expect receiving any help. How could we find out which snapshot the systemd-helper tries to delete? It turned out it's actually the very same problem as in #102942. I don't know why that problem did not show itself to me in /var/log/snapper.log when I looked earlier. On worker2 I now manually deleted the btrfs subvolumes that blocked the deletion with btrfs subvolume list -a / | grep containers and btrfs subvolume delete /.snapshots/1/…containers…. Maybe we need to have a script for an automatic recovery.

Actions #36

Updated by okurz over 1 year ago

  • Related to action #102942: Failed systemd services alert: snapper-cleanup on QA-Power8-4-kvm fails size:M added
Actions #37

Updated by okurz over 1 year ago

  • Due date deleted (2023-07-07)
  • Status changed from Feedback to Blocked
  • Priority changed from Urgent to Normal

I am using sudo btrfs subvolume delete $(sudo btrfs subvolume list / | sed -n 's/^.*path @\(.*containers.*\)/\1/p') on all machines

sudo salt --no-color \* cmd.run "sudo btrfs subvolume delete \$(sudo btrfs subvolume list / | sed -n 's/^.*path @\(.*containers.*\)/\1/p') && sudo systemctl is-failed snapper-cleanup | grep -q failed && sudo systemctl restart snapper-cleanup"

Surely we could do that safer :)

But with this
https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=now-1h&to=now is now not showing any more failed snapper-cleanup.

We have workarounds in place. We provided more information in both the snapper cleanup upstream report as well as in a salt regression bug. Waiting for anything to happen there.

Blocking on https://bugzilla.opensuse.org/show_bug.cgi?id=1212816

Actions #38

Updated by nicksinger over 1 year ago

applied your workaround/lock on openqaworker14.qa.suse.cz as well

Actions #39

Updated by okurz over 1 year ago

  • Subject changed from [alert][ci][deployment] OSD deployment failed, grenache-1, worker5, worker2 salt minion does not return size:M to [alert][ci][deployment] OSD deployment failed, grenache-1, worker5, worker2 salt-minion does not return, error message "No response" size:M
Actions #40

Updated by okurz over 1 year ago

  • Related to action #132137: Setup new PRG2 openQA worker for osd size:M added
Actions #41

Updated by okurz over 1 year ago

  • Related to action #134906: osd-deployment failed due to openqaworker1 showing "No response" in salt size:M added
Actions #42

Updated by okurz about 1 year ago

  • Target version changed from Ready to Tools - Next
Actions #43

Updated by mkittler about 1 year ago

The workers worker-arm1 and worker-arm2 were stuck again:

martchus@openqa:~> sudo salt -C 'G@roles:worker' test.ping
…
worker-arm1.oqa.prg2.suse.org:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20230915145848949891
worker-arm2.oqa.prg2.suse.org:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20230915145848949891

They responded again after systemctl kill salt-minion and systemctl restart salt-minion (just restart didn't work, they were really stuck, according to strace in some futex lock).

Actions #44

Updated by okurz about 1 year ago

maybe salt-minion-3005 is also affected and we should really go back to 3004

Actions #45

Updated by okurz about 1 year ago

  • Related to action #136325: salt deploy fails due to multiple offline workers in qe.nue2.suse.org+prg2.suse.org added
Actions #46

Updated by okurz about 1 year ago

  • Status changed from Blocked to In Progress

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1865994#L185

sapworker2.qe.nue2.suse.org:
----------
          ID: lock_salt-bash-completion_pkg
    Function: cmd.run
        Name: zypper rl salt-bash-completion; (zypper -n in --oldpackage --allow-downgrade 'salt-bash-completion<=3005' || zypper -n in --oldpackage --allow-downgrade 'salt-bash-completion<=3005.1') && zypper al -m 'poo#131249 - potential salt regression, unresponsive salt-minion' salt-bash-completion
      Result: False
     Comment: Command "zypper rl salt-bash-completion; (zypper -n in --oldpackage --allow-downgrade 'salt-bash-completion<=3005' || zypper -n in --oldpackage --allow-downgrade 'salt-bash-completion<=3005.1') && zypper al -m 'poo#131249 - potential salt regression, unresponsive salt-minion' salt-bash-completion" run
     Started: 10:12:55.246902
    Duration: 3047.714 ms
     Changes:   
              ----------
              pid:
                  80582
              retcode:
                  4
              stderr:
                  No provider of 'salt-bash-completion<=3005' found.
              stdout:
                  No lock has been removed.
                  Loading repository data...
                  Reading installed packages...
                  'salt-bash-completion<=3005' not found in package names. Trying capabilities.
                  Loading repository data...
                  Reading installed packages...
                  Resolving package dependencies...

                  Problem: the to be installed salt-bash-completion-3005.1-150500.2.13.noarch requires 'salt = 3005.1-150500.2.13', but this requirement cannot be provided
                    not installable providers: salt-3005.1-150500.2.13.x86_64[distribution/leap/$releasever/repo/oss]
                   Solution 1: Following actions will be done:
                    remove lock to allow installation of salt-3005.1-150500.2.13.x86_64[distribution/leap/$releasever/repo/oss]
                    remove lock to allow installation of python3-salt-3005.1-150500.2.13.x86_64[distribution/leap/$releasever/repo/oss]
                    remove lock to allow removal of salt-3004-150400.8.25.1.x86_64
                    remove lock to allow removal of python3-salt-3004-150400.8.25.1.x86_64
                    remove lock to allow removal of salt-minion-3004-150400.8.25.1.x86_64
                   Solution 2: do not install salt-bash-completion-3005.1-150500.2.13.noarch
                   Solution 3: break salt-bash-completion-3005.1-150500.2.13.noarch by ignoring some of its dependencies

                  Choose from above solutions by number or cancel [1/2/3/c/d/?] (c): c
Summary for sapworker2.qe.nue2.suse.org
--------------
Succeeded: 453 (changed=1)
Failed:      1
Actions #47

Updated by okurz about 1 year ago

  • Status changed from In Progress to Blocked

I think I could solve that problem with a manual application of zypper al -m 'poo#131249 - potential salt regression, unresponsive salt-minion' salt-bash-completion after ensuring that salt-bash-completion is actually not installed at all. Retriggered failed salt-pillars-openqa deploy job https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1870710

Actions #48

Updated by okurz about 1 year ago

  • Status changed from Blocked to In Progress
Actions #49

Updated by okurz about 1 year ago

  • Status changed from In Progress to Blocked

Fixed in same way, back to blocked on https://bugzilla.opensuse.org/show_bug.cgi?id=1212816

Actions #50

Updated by okurz about 1 year ago

  • Status changed from Blocked to In Progress
  • Target version changed from Tools - Next to Ready

https://gitlab.suse.de/openqa/osd-deployment/-/jobs/1876545#L2899

that is sapworker1 showing problems with salt-3005. sapworker2+3 are fine with salt-3004.

Added a comment on https://bugzilla.opensuse.org/show_bug.cgi?id=1212816

We have observed that multiple machines running Leap 15.5 with salt-3005 show the same problem eventually of "No response". A forced install of the Leap 15.4 salt-3004 package on Leap 15.5 seems to work fine.

So following https://progress.opensuse.org/projects/openqav3/wiki/#Network-legacy-boot-via-PXE-and-OSworker-setup

I did

zypper -n rm salt-bash-completion
arch=$(uname -m)
sudo zypper -n in --oldpackage --allow-downgrade http://download.opensuse.org/update/leap/15.4/sle/$arch/salt-3004-150400.8.25.1.$arch.rpm http://download.opensuse.org/update/leap/15.4/sle/$arch/salt-minion-3004-150400.8.25.1.$arch.rpm http://download.opensuse.org/update/leap/15.4/sle/$arch/python3-salt-3004-150400.8.25.1.$arch.rpm && sudo zypper al --comment "poo#131249 - potential salt regression, unresponsive salt-minion" salt salt-minion salt-bash-completion python3-salt

sudo salt --no-color 'sapworker*' cmd.run 'rpm -qa | grep -i salt' looks better now:

sapworker1.qe.nue2.suse.org:
    salt-3004-150400.8.25.1.x86_64
    python3-salt-3004-150400.8.25.1.x86_64
    salt-minion-3004-150400.8.25.1.x86_64
sapworker2.qe.nue2.suse.org:
    salt-3004-150400.8.25.1.x86_64
    salt-minion-3004-150400.8.25.1.x86_64
    python3-salt-3004-150400.8.25.1.x86_64
sapworker3.qe.nue2.suse.org:
    salt-3004-150400.8.25.1.x86_64
    salt-minion-3004-150400.8.25.1.x86_64
    python3-salt-3004-150400.8.25.1.x86_64

retriggered https://gitlab.suse.de/openqa/osd-deployment/-/jobs/1876780

Actions #51

Updated by okurz about 1 year ago

  • Status changed from In Progress to Blocked
Actions #52

Updated by okurz about 1 year ago

  • Status changed from Blocked to Feedback

https://bugzilla.suse.com/show_bug.cgi?id=1212816#c6 suggests to try
3006.0-150400.8.44.1

sudo salt --no-color '*' cmd.run 'zypper --no-refresh se --details salt-minion | grep -q 8.44 && zypper rl salt salt-minion salt-bash-completion python3-salt && zypper -n in salt salt-minion python3-salt'

current version installed on all machines sudo salt --no-color --out txt '*' cmd.run 'rpm -q salt-minion' queue=True | sort

backup-qam.qe.nue2.suse.org: salt-minion-3004-150400.8.25.1.x86_64
backup.qa.suse.de: salt-minion-3006.0-150400.8.44.1.x86_64
baremetal-support.qa.suse.de: salt-minion-3006.0-150400.8.44.1.x86_64
diesel.qe.nue2.suse.org: salt-minion-3006.0-150400.8.44.1.ppc64le
imagetester.qe.nue2.suse.org: salt-minion-3005.1-150500.2.13.x86_64
jenkins.qa.suse.de: salt-minion-3006.0-150400.8.44.1.x86_64
openqa-monitor.qa.suse.de: salt-minion-3006.0-150400.8.44.1.x86_64
openqa-piworker.qa.suse.de: salt-minion-3005.1-150500.2.13.aarch64
openqa.suse.de: salt-minion-3006.0-150400.8.44.1.x86_64
openqaw5-xen.qa.suse.de: salt-minion-3006.0-150400.8.44.1.x86_64
openqaworker14.qa.suse.cz: salt-minion-3006.0-150400.8.44.1.x86_64
openqaworker16.qa.suse.cz: salt-minion-3006.0-150400.8.44.1.x86_64
openqaworker17.qa.suse.cz: salt-minion-3006.0-150400.8.44.1.x86_64
openqaworker18.qa.suse.cz: salt-minion-3006.0-150400.8.44.1.x86_64
openqaworker1.qe.nue2.suse.org: salt-minion-3006.0-150400.8.44.1.x86_64
openqaworker-arm-2.suse.de: salt-minion-3006.0-150400.8.44.1.aarch64
openqaworker-arm-3.suse.de: salt-minion-3006.0-150400.8.44.1.aarch64
petrol.qe.nue2.suse.org: salt-minion-3006.0-150400.8.44.1.ppc64le
powerqaworker-qam-1.qa.suse.de: salt-minion-3006.0-150400.8.44.1.ppc64le
qamasternue.qa.suse.de: salt-minion-3006.0-150400.8.44.1.x86_64
qesapworker-prg4.qa.suse.cz: salt-minion-3004-150400.8.25.1.x86_64
qesapworker-prg5.qa.suse.cz: salt-minion-3004-150400.8.25.1.x86_64
qesapworker-prg6.qa.suse.cz: salt-minion-3004-150400.8.25.1.x86_64
qesapworker-prg7.qa.suse.cz: salt-minion-3004-150400.8.25.1.x86_64
sapworker1.qe.nue2.suse.org: salt-minion-3004-150400.8.25.1.x86_64
sapworker2.qe.nue2.suse.org: salt-minion-3004-150400.8.25.1.x86_64
sapworker3.qe.nue2.suse.org: salt-minion-3004-150400.8.25.1.x86_64
schort-server.qa.suse.de: salt-minion-3006.0-150400.8.44.1.x86_64
storage.oqa.suse.de: salt-minion-3006.0-150400.8.44.1.x86_64
tumblesle.qa.suse.de: salt-minion-3006.0-150400.8.44.1.x86_64
worker29.oqa.prg2.suse.org: salt-minion-3004-150400.8.25.1.x86_64
worker2.oqa.suse.de: salt-minion-3006.0-150400.8.44.1.x86_64
worker30.oqa.prg2.suse.org: salt-minion-3004-150400.8.25.1.x86_64
worker31.oqa.prg2.suse.org: salt-minion-3004-150400.8.25.1.x86_64
worker32.oqa.prg2.suse.org: salt-minion-3004-150400.8.25.1.x86_64
worker33.oqa.prg2.suse.org: salt-minion-3004-150400.8.25.1.x86_64
worker34.oqa.prg2.suse.org: salt-minion-3004-150400.8.25.1.x86_64
worker35.oqa.prg2.suse.org: salt-minion-3004-150400.8.25.1.x86_64
worker36.oqa.prg2.suse.org: salt-minion-3004-150400.8.25.1.x86_64
worker37.oqa.prg2.suse.org: salt-minion-3004-150400.8.25.1.x86_64
worker38.oqa.prg2.suse.org: salt-minion-3004-150400.8.25.1.x86_64
worker39.oqa.prg2.suse.org: salt-minion-3004-150400.8.25.1.x86_64
worker40.oqa.prg2.suse.org: salt-minion-3004-150400.8.25.1.x86_64
worker-arm1.oqa.prg2.suse.org: salt-minion-3004-150400.8.25.1.aarch64
worker-arm2.oqa.prg2.suse.org: salt-minion-3004-150400.8.25.1.aarch64
Actions #53

Updated by okurz about 1 year ago

Downgraded imagetester as it had 3005 and was showing "No response". sudo salt --no-color \* test.ping good again

Actions #54

Updated by okurz about 1 year ago

I have been running salt-minion on multiple hosts since more than a week now with a fixed version as mentioned in https://bugzilla.opensuse.org/show_bug.cgi?id=1212816 and no further problems were observed so we can remove this workaround again:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1015

I will handle the removal of locks and upgrades manually.

Actions #55

Updated by okurz about 1 year ago

MR merged

sudo salt --state-output=changes -C \* cmd.run 'zypper rl salt salt-minion salt-bash-completion && zypper rl -t patch openSUSE-SLE-15.4-2023-2571 openSUSE-SLE-15.4-2023-3145 openSUSE-SLE-15.4-2023-3863 && zypper -n in salt-minion' | grep -av 'Result: Clean'                

From today:

openqa:~ # sudo salt --state-output=changes -C \* cmd.run 'rpm -q salt-minion' | grep -av 'Result: Clean'
s390zl13.oqa.prg2.suse.org:
    salt-minion-3006.0-150500.4.19.1.s390x
s390zl12.oqa.prg2.suse.org:
    salt-minion-3006.0-150500.4.19.1.s390x
worker36.oqa.prg2.suse.org:
    salt-minion-3006.0-150500.4.19.1.x86_64
worker35.oqa.prg2.suse.org:
    salt-minion-3006.0-150500.4.19.1.x86_64
worker33.oqa.prg2.suse.org:
    salt-minion-3006.0-150500.4.19.1.x86_64
worker39.oqa.prg2.suse.org:
    salt-minion-3006.0-150500.4.19.1.x86_64
worker34.oqa.prg2.suse.org:
    salt-minion-3006.0-150500.4.19.1.x86_64
worker38.oqa.prg2.suse.org:
    salt-minion-3006.0-150500.4.19.1.x86_64
worker32.oqa.prg2.suse.org:
    salt-minion-3006.0-150500.4.19.1.x86_64
worker40.oqa.prg2.suse.org:
    salt-minion-3006.0-150500.4.19.1.x86_64
worker31.oqa.prg2.suse.org:
    salt-minion-3006.0-150500.4.19.1.x86_64
worker37.oqa.prg2.suse.org:
    salt-minion-3006.0-150500.4.19.1.x86_64
backup-qam.qe.nue2.suse.org:
    salt-minion-3006.0-150500.4.19.1.x86_64
worker29.oqa.prg2.suse.org:
    salt-minion-3006.0-150500.4.19.1.x86_64
worker30.oqa.prg2.suse.org:
    salt-minion-3006.0-150500.4.19.1.x86_64
sapworker3.qe.nue2.suse.org:
    salt-minion-3006.0-150500.4.19.1.x86_64
worker-arm1.oqa.prg2.suse.org:
    salt-minion-3006.0-150500.4.19.1.aarch64
worker-arm2.oqa.prg2.suse.org:
    salt-minion-3006.0-150500.4.19.1.aarch64
sapworker1.qe.nue2.suse.org:
    salt-minion-3006.0-150500.4.19.1.x86_64
sapworker2.qe.nue2.suse.org:
    salt-minion-3006.0-150500.4.19.1.x86_64
openqaworker16.qa.suse.cz:
    salt-minion-3006.0-150400.8.44.1.x86_64
openqaworker17.qa.suse.cz:
    salt-minion-3006.0-150400.8.44.1.x86_64
openqaworker18.qa.suse.cz:
    salt-minion-3006.0-150400.8.44.1.x86_64
openqaworker1.qe.nue2.suse.org:
    salt-minion-3006.0-150400.8.44.1.x86_64
qesapworker-prg7.qa.suse.cz:
    salt-minion-3006.0-150500.4.19.1.x86_64
qesapworker-prg5.qa.suse.cz:
    salt-minion-3006.0-150500.4.19.1.x86_64
qesapworker-prg4.qa.suse.cz:
    salt-minion-3006.0-150500.4.19.1.x86_64
qesapworker-prg6.qa.suse.cz:
    salt-minion-3006.0-150500.4.19.1.x86_64
openqa.suse.de:
    salt-minion-3006.0-150400.8.44.1.x86_64
qamaster.qe.nue2.suse.org:
    salt-minion-3006.0-150500.4.19.1.x86_64
openqaw5-xen.qa.suse.de:
    salt-minion-3006.0-150500.4.19.1.x86_64
openqaworker14.qa.suse.cz:
    salt-minion-3006.0-150400.8.44.1.x86_64
petrol.qe.nue2.suse.org:
    salt-minion-3006.0-150400.8.44.1.ppc64le
imagetester.qe.nue2.suse.org:
    salt-minion-3006.0-150500.4.19.1.x86_64
monitor.qe.nue2.suse.org:
    salt-minion-3006.0-150500.4.19.1.x86_64
jenkins.qe.nue2.suse.org:
    salt-minion-3006.0-150500.4.19.1.x86_64
backup-vm.qe.nue2.suse.org:
    salt-minion-3006.0-150500.4.19.1.x86_64
baremetal-support.qe.nue2.suse.org:
    salt-minion-3006.0-150500.4.19.1.x86_64
diesel.qe.nue2.suse.org:
    salt-minion-3006.0-150400.8.44.1.ppc64le
openqa-piworker.qe.nue2.suse.org:
    salt-minion-3006.0-150500.4.19.1.aarch64
tumblesle.qe.nue2.suse.org:
    salt-minion-3006.0-150500.4.19.1.x86_64
schort-server.qe.nue2.suse.org:
    salt-minion-3006.0-150500.4.19.1.x86_64

all systems seem to have an up-to-date salt-minion and are responsive. No related alerts. Checking rollback steps and ACs.

Actions #56

Updated by okurz about 1 year ago

  • Description updated (diff)
  • Status changed from Feedback to Resolved

All rollback steps and ACs fulfilled as well, done here

Actions #58

Updated by okurz about 1 year ago

  • Related to action #150965: At least diesel+petrol+mania fail to auto-update due to kernel locks preventing patches size:M added
Actions

Also available in: Atom PDF