action #164874: osd-deployment failed due to openqaworker-arm-1 salt-minion yielding "no response" but the machine is up and reachable over ssh - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #164874

open

osd-deployment failed due to openqaworker-arm-1 salt-minion yielding "no response" but the machine is up and reachable over ssh

Added by okurz 9 months ago. Updated 8 months ago.

Status:

New

Priority:

Low

Assignee:

Category:

Regressions/Crashes

Target version:

QA (public) - future

Start date:

2024-08-02

Due date:

% Done:

Estimated time:

Tags:

infra, reactive work, openqaworker-arm-1, salt-minion, unresponsive

Description

Observation¶

https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2902107#L9202 shows

openqaworker-arm-1.qe.nue2.suse.org:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20240802080801900813

prevent the osd-deployment to continue. And that was reproduced until I did systemctl restart salt-minion on openqaworker-arm-1 which was otherwise fully reachable and operational.
https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2902185 shows the successful retry.

journalctl -u salt-minion shows

Aug 01 07:17:46 openqaworker-arm-1 [RPM][26364]: Transaction ID 66ab1a75 finished: 0
Aug 01 08:13:26 openqaworker-arm-1 sudo[33685]:     root : PWD=/root ; USER=root ; COMMAND=/usr/bin/timeout -v --kill-after=1m 5m telegraf --test --config /etc/telegraf/telegraf.conf --config-directory /etc/telegraf/telegraf.d/
Aug 01 18:19:24 openqaworker-arm-1 sudo[29545]:     root : PWD=/root ; USER=root ; COMMAND=/usr/bin/timeout -v --kill-after=1m 5m telegraf --test --config /etc/telegraf/telegraf.conf --config-directory /etc/telegraf/telegraf.d/
Aug 02 10:13:17 openqaworker-arm-1 systemd[1]: Stopping The Salt Minion...

so no log entry after 2024-08-01 18:19:24 until I did systemctl restart salt-minion. If we can reproduce this issue we should try with attaching strace to the process

Workaround¶

systemctl restart salt-minion

Actions

Copy link

Updated by livdywan 8 months ago

https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2956062

openqaworker-arm-1.qe.nue2.suse.org:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20240816052647209352
ERROR: Minions returned with non-zero exit code

Actions

Copy link

Updated by okurz 8 months ago

osd deployment today failed again due to this. I logged into openqaworker-arm-1 and investigated. tail /var/log/salt/minion at 2024-08-18 09:37:00Z shows

2024-08-18 03:40:00,724 [salt.channel.client:293 ][ERROR   ][2366] Failed to send msg SaltReqTimeoutError('Message timed out',)
2024-08-18 03:40:05,732 [salt.channel.client:293 ][ERROR   ][2366] Failed to send msg SaltReqTimeoutError('Message timed out',)
2024-08-18 03:40:10,740 [salt.channel.client:293 ][ERROR   ][2366] Failed to send msg SaltReqTimeoutError('Message timed out',)
2024-08-18 03:40:10,744 [salt.minion      :1145][ERROR   ][2366] Error while bringing up minion for multi-master. Is master at openqa.suse.de responding?
2024-08-18 03:40:25,795 [salt.channel.client:293 ][ERROR   ][2366] Failed to send msg SaltReqTimeoutError('Message timed out',)
2024-08-18 03:40:30,803 [salt.channel.client:293 ][ERROR   ][2366] Failed to send msg SaltReqTimeoutError('Message timed out',)
2024-08-18 03:40:35,810 [salt.channel.client:293 ][ERROR   ][2366] Failed to send msg SaltReqTimeoutError('Message timed out',)
2024-08-18 03:40:40,818 [salt.channel.client:293 ][ERROR   ][2366] Failed to send msg SaltReqTimeoutError('Message timed out',)
2024-08-18 03:40:45,826 [salt.channel.client:293 ][ERROR   ][2366] Failed to send msg SaltReqTimeoutError('Message timed out',)
2024-08-18 03:40:50,835 [salt.channel.client:293 ][ERROR   ][2366] Failed to send msg SaltReqTimeoutError('Message timed out',)

so timeouts during the time when OSD rebooted but no message afterwards. w shows that the machine itself is up since 8h. journalctl -b -u salt-minion shows also the above messages plus RPM output where I don't know where they come from. systemctl status salt-minion says

● salt-minion.service - The Salt Minion
     Loaded: loaded (/usr/lib/systemd/system/salt-minion.service; enabled; vendor preset: disabled)
     Active: active (running) since Sun 2024-08-18 03:36:59 CEST; 8h ago
   Main PID: 2163 (salt-minion)
      Tasks: 5 (limit: 7372)
     CGroup: /system.slice/salt-minion.service
             ├─ 2163 /usr/bin/python3 /usr/bin/salt-minion
             └─ 2366 /usr/bin/python3 /usr/bin/salt-minion

Aug 18 07:17:11 openqaworker-arm-1 [RPM][331]: install openQA-client-4.6.1723914098.c2bbc49b-lp155.7068.1.aarch64: success
Aug 18 07:17:12 openqaworker-arm-1 [RPM][331]: erase openQA-client-4.6.1723660485.b24c2671-lp155.7053.1.aarch64: success
Aug 18 07:17:12 openqaworker-arm-1 [RPM][331]: install openQA-client-4.6.1723914098.c2bbc49b-lp155.7068.1.aarch64: success
Aug 18 07:17:12 openqaworker-arm-1 [RPM][331]: Transaction ID 66c183d6 finished: 0
Aug 18 07:17:12 openqaworker-arm-1 [RPM][335]: Transaction ID 66c183d8 started
Aug 18 07:17:12 openqaworker-arm-1 [RPM][335]: erase openQA-worker-4.6.1723660485.b24c2671-lp155.7053.1.aarch64: success
Aug 18 07:17:15 openqaworker-arm-1 [RPM][335]: install openQA-worker-4.6.1723914098.c2bbc49b-lp155.7068.1.aarch64: success
Aug 18 07:17:17 openqaworker-arm-1 [RPM][335]: erase openQA-worker-4.6.1723660485.b24c2671-lp155.7053.1.aarch64: success
Aug 18 07:17:17 openqaworker-arm-1 [RPM][335]: install openQA-worker-4.6.1723914098.c2bbc49b-lp155.7068.1.aarch64: success
Aug 18 07:17:17 openqaworker-arm-1 [RPM][335]: Transaction ID 66c183d8 finished: 0

strace -f -yy -p 2163 shows that the main salt process waits for the second. strace -f -yy -p 2366 shows the process in a polling wait

strace: Process 2366 attached with 4 threads
[pid 13526] epoll_pwait(16<anon_inode:[eventpoll]>,  <unfinished ...>
[pid 13525] epoll_pwait(14<anon_inode:[eventpoll]>,  <unfinished ...>
[pid  2368] pselect6(0, NULL, NULL, NULL, {tv_sec=4, tv_nsec=785892938}, NULL <unfinished ...>
[pid  2366] epoll_pwait(6<anon_inode:[eventpoll]>, [], 1023, 346, NULL, 8) = 0
[pid  2366] epoll_pwait(6<anon_inode:[eventpoll]>, [], 1023, 998, NULL, 8) = 0
[pid  2366] epoll_pwait(6<anon_inode:[eventpoll]>, [], 1023, 997, NULL, 8) = 0
[pid  2366] epoll_pwait(6<anon_inode:[eventpoll]>, [], 1023, 998, NULL, 8) = 0
[pid  2366] epoll_pwait(6<anon_inode:[eventpoll]>, [], 1023, 998, NULL, 8) = 0
[pid  2366] epoll_pwait(6<anon_inode:[eventpoll]>,  <unfinished ...>
[pid  2368] <... pselect6 resumed>)     = 0 (Timeout)
[pid  2368] getuid()                    = 0
[pid  2368] kill(2163, 0)               = 0
…

I crosschecked on another machine, worker29, and there it looks like the same, so nothing extraordinary. If I call sudo salt \* test.ping both processes react, reading a lot and showing that in the strace output. Also all machines now show up in the output from test.ping so the workaround to restart the salt-minion is actually not necessary it seems. I retriggered https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2958810 manually and other jobs in the pipeline which succeeded

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #164874

osd-deployment failed due to openqaworker-arm-1 salt-minion yielding "no response" but the machine is up and reachable over ssh

Observation¶

Workaround¶

Updated by livdywan 8 months ago

Updated by okurz 8 months ago