action #131249
Updated by okurz about 1 year ago
## Observation From https://gitlab.suse.de/openqa/osd-deployment/-/jobs/1648467 and reproduced locally with `sudo salt --no-color -C 'G@roles:worker' test.ping`: ``` grenache-1.qa.suse.de: Minion did not return. [No response] The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command: salt-run jobs.lookup_jid 20230622084232610255 worker5.oqa.suse.de: Minion did not return. [No response] The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command: salt-run jobs.lookup_jid 20230622084232610255 worker2.oqa.suse.de: Minion did not return. [No response] The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command: salt-run jobs.lookup_jid 20230622084232610255 ``` ## Acceptance criteria * **AC1:** `salt \* test.ping` and `salt \* state.apply` succeeds consistently for more than one day * **AC2:** our salt states and pillar and osd deployment pipelines are green and stable again ## Suggestions * *DONE* It seems we might have had this problem for a while but never really that severly. Now it looks like those machines even if we trigger a reboot and restart salt-minion can end up with "no response" again. Maybe we can revert some recent package updates? From /var/log/zypp/history there is ``` 2023-06-22 03:01:12|install|python3-pyzmq|17.1.2-150000.3.5.2|x86_64||repo-sle-update|e2d9d07654cffc31e5199f40aa1ba9fee1e114c4ca5abd78f7fdc78b2e6cc21a| ``` * *DONE* Debug the actual problem of hanging salt-minion. Maybe we can actually try to better trigger the problem, not prevent it? * *DONE* Research upstream, apply workarounds, potentially try upgrade Leap 15.5 if that might fix something ## Rollback steps * *DONE* on worker2,worker3,worker5,grenache-1,openqaworker-arm-2,openqaworker-arm-3 `sudo mv /etc/systemd/system/auto-update.$i{.disabled_poo131249,} && sudo systemctl enable --now auto-update.timer && sudo systemctl start auto-update`, remove manual override /etc/systemd/system/auto-update.service.d/override.conf, wait for upgrade to complete and reboot * *DONE* re-enable osd-deployment https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules/36/edit * *DONE* remove silence https://stats.openqa-monitor.qa.suse.de/alerting/silences "alertname=Failed systemd services alert (except openqa.suse.de)" * *DONE* remove package locks for anything related to salt