action #174985
closed[alert] salt-states-openqa | Failed pipeline for master "salt.exceptions.SaltReqTimeoutError: Message timed out" size:S
0%
Description
Observation¶
job: https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3609258
Further details¶
job error:
monitor.qe.nue2.suse.org:
The minion function caused an exception: Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/salt/minion.py", line 1912, in _thread_return
function_name, function_args, executors, opts, data
File "/usr/lib/python3.6/site-packages/salt/minion.py", line 1870, in _execute_job_function
return_data = self.executors[fname](opts, data, func, args, kwargs)
File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 149, in __call__
return self.loader.run(run_func, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1234, in run
return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/contextvars/__init__.py", line 38, in run
return callable(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1249, in _run_as
ret = _func_or_method(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/executors/direct_call.py", line 10, in execute
return func(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 149, in __call__
return self.loader.run(run_func, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1234, in run
return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/contextvars/__init__.py", line 38, in run
return callable(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1249, in _run_as
ret = _func_or_method(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/modules/state.py", line 1161, in highstate
initial_pillar=_get_initial_pillar(opts),
File "/usr/lib/python3.6/site-packages/salt/state.py", line 4953, in __init__
initial_pillar=initial_pillar,
File "/usr/lib/python3.6/site-packages/salt/state.py", line 774, in __init__
self.opts["pillar"] = self._gather_pillar()
File "/usr/lib/python3.6/site-packages/salt/state.py", line 877, in _gather_pillar
return pillar.compile_pillar()
File "/usr/lib/python3.6/site-packages/salt/pillar/__init__.py", line 360, in compile_pillar
dictkey="pillar",
File "/usr/lib/python3.6/site-packages/salt/utils/asynchronous.py", line 112, in wrap
lambda: getattr(self.obj, key)(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/ioloop.py", line 459, in run_sync
return future_cell[0].result()
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python3.6/site-packages/salt/channel/client.py", line 172, in crypted_transfer_decode_dictentry
timeout=timeout,
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python3.6/site-packages/salt/transport/zeromq.py", line 920, in send
ret = yield self.message_client.send(load, timeout=timeout)
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python3.6/site-packages/salt/transport/zeromq.py", line 630, in send
recv = yield future
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
salt.exceptions.SaltReqTimeoutError: Message timed out
Suggestions¶
- Confirm if this is related to any recent work or known errors
- look into system load of monitor https://stats.openqa-monitor.qa.suse.de/d/GDmonitor/dashboard-for-monitor?from=now-1h&to=now&timezone=browser&var-datasource=000000001&refresh=1m
- Possibly bump either VM parameters or salt req timeout (check that gitlab job timeout matches salt)?
Updated by okurz about 1 month ago
- Tags set to infra, reactive work, alert, salt, osd
- Category set to Regressions/Crashes
- Target version set to Ready
Updated by okurz about 1 month ago
- Related to action #174652: Ensure uniqueness of nodenames for generating configs on monitor size:M added
Updated by livdywan about 1 month ago
- Subject changed from [alert] salt-states-openqa | Failed pipeline for master to [alert] salt-states-openqa | Failed pipeline for master "salt.exceptions.SaltReqTimeoutError: Message timed out" size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by nicksinger 24 days ago
- Status changed from Workable to In Progress
- Assignee set to nicksinger
I don't think this is related to any recent changes because it looks buried deep within salt-code but I will have a look if I can find more in logs from the last 7 days.
Updated by nicksinger 24 days ago
- Status changed from In Progress to Rejected
So the machine itself is fine. A manual state.apply
from OSD did complete (except a unrelated issue in "systemctl start dehydrated"). Also the journal doesn't contain more recent, problematic entries. For now I would treat this as one-off occurrence. https://github.com/search?q=repo%3Asaltstack%2Fsalt+SaltReqTimeoutError&type=issues is full of similar examples but most of them mention a huge number of minions (500-1000). Also reading https://github.com/saltstack/salt/blob/9233e1cc3b6b072a61b445d285ba856fc642ef3b/salt/exceptions.py#L330 this seems to be rather about a slow master (master == OSD) so bumping the resources for monitor won't cut it.
https://github.com/saltstack/salt/issues/53147#issuecomment-1593518015 contains a hint regarding worker_threads
and sock_pool_size
- we can look into these settings if this happens more regularly now.
Updated by nicksinger 24 days ago
- Related to action #175407: salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size:S added