Project

General

Profile

Actions

action #164874

open

osd-deployment failed due to openqaworker-arm-1 salt-minion yielding "no response" but the machine is up and reachable over ssh

Added by okurz 5 months ago. Updated 4 months ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Regressions/Crashes
Target version:
Start date:
2024-08-02
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2902107#L9202 shows

openqaworker-arm-1.qe.nue2.suse.org:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20240802080801900813

prevent the osd-deployment to continue. And that was reproduced until I did systemctl restart salt-minion on openqaworker-arm-1 which was otherwise fully reachable and operational.
https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2902185 shows the successful retry.

journalctl -u salt-minion shows

Aug 01 07:17:46 openqaworker-arm-1 [RPM][26364]: Transaction ID 66ab1a75 finished: 0
Aug 01 08:13:26 openqaworker-arm-1 sudo[33685]:     root : PWD=/root ; USER=root ; COMMAND=/usr/bin/timeout -v --kill-after=1m 5m telegraf --test --config /etc/telegraf/telegraf.conf --config-directory /etc/telegraf/telegraf.d/
Aug 01 18:19:24 openqaworker-arm-1 sudo[29545]:     root : PWD=/root ; USER=root ; COMMAND=/usr/bin/timeout -v --kill-after=1m 5m telegraf --test --config /etc/telegraf/telegraf.conf --config-directory /etc/telegraf/telegraf.d/
Aug 02 10:13:17 openqaworker-arm-1 systemd[1]: Stopping The Salt Minion...

so no log entry after 2024-08-01 18:19:24 until I did systemctl restart salt-minion. If we can reproduce this issue we should try with attaching strace to the process

Workaround

systemctl restart salt-minion

Actions

Also available in: Atom PDF