action #131249
Updated by okurz over 1 year ago
## Observation
From https://gitlab.suse.de/openqa/osd-deployment/-/jobs/1648467 and reproduced locally with `sudo salt --no-color -C 'G@roles:worker' test.ping`:
```
grenache-1.qa.suse.de:
Minion did not return. [No response]
The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:
salt-run jobs.lookup_jid 20230622084232610255
worker5.oqa.suse.de:
Minion did not return. [No response]
The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:
salt-run jobs.lookup_jid 20230622084232610255
worker2.oqa.suse.de:
Minion did not return. [No response]
The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:
salt-run jobs.lookup_jid 20230622084232610255
```
## Acceptance criteria
* **AC1:** `salt \* test.ping` and `salt \* state.apply` succeeds consistently for more than one day
* **AC2:** our salt states and pillar and osd deployment pipelines are green and stable again
## Suggestions
* It seems we might have had this problem for a while but never really that severly. Now it looks like those machines even if we trigger a reboot and restart salt-minion can end up with "no response" again. Maybe we can revert some recent package updates? From /var/log/zypp/history there is
```
2023-06-22 03:01:12|install|python3-pyzmq|17.1.2-150000.3.5.2|x86_64||repo-sle-update|e2d9d07654cffc31e5199f40aa1ba9fee1e114c4ca5abd78f7fdc78b2e6cc21a|
```
* Debug the actual problem of hanging salt-minion. Maybe we can actually try to better trigger the problem, not prevent it?
* Research upstream, apply workarounds, potentially try upgrade Leap 15.5 if that might fix something