action #131249
Updated by okurz 11 months ago
## Observation
From https://gitlab.suse.de/openqa/osd-deployment/-/jobs/1648467 and reproduced locally with `sudo salt --no-color -C 'G@roles:worker' test.ping`:
```
grenache-1.qa.suse.de:
Minion did not return. [No response]
The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:
salt-run jobs.lookup_jid 20230622084232610255
worker5.oqa.suse.de:
Minion did not return. [No response]
The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:
salt-run jobs.lookup_jid 20230622084232610255
worker2.oqa.suse.de:
Minion did not return. [No response]
The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:
salt-run jobs.lookup_jid 20230622084232610255
```
## Acceptance criteria
* **AC1:** `salt \* test.ping` and `salt \* state.apply` succeeds consistently for more than one day
* **AC2:** our salt states and pillar and osd deployment pipelines are green and stable again
## Suggestions
* It seems we might have had this problem for a while but never really that severly. Now it looks like those machines even if we trigger a reboot and restart salt-minion can end up with "no response" again. Maybe we can revert some recent package updates? From /var/log/zypp/history there is
```
2023-06-22 03:01:12|install|python3-pyzmq|17.1.2-150000.3.5.2|x86_64||repo-sle-update|e2d9d07654cffc31e5199f40aa1ba9fee1e114c4ca5abd78f7fdc78b2e6cc21a|
```
* Debug the actual problem of hanging salt-minion. Maybe we can actually try to better trigger the problem, not prevent it?
* Research upstream, apply workarounds, potentially try upgrade Leap 15.5 if that might fix something
## Rollback steps
* on worker2,worker3,worker5,grenache-1,openqaworker-arm-2,openqaworker-arm-3 `sudo mv /etc/systemd/system/auto-update.$i{.disabled_poo131249,} && sudo systemctl enable --now auto-update.timer && sudo systemctl start auto-update`, remove manual override /etc/systemd/system/auto-update.service.d/override.conf, wait for upgrade to complete and reboot
* re-enable osd-deployment https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules/36/edit
* remove silence https://stats.openqa-monitor.qa.suse.de/alerting/silences "alertname=Failed systemd services alert (except openqa.suse.de)"