Project

General

Profile

Actions

action #174985

closed

[alert] salt-states-openqa | Failed pipeline for master "salt.exceptions.SaltReqTimeoutError: Message timed out" size:S

Added by gpuliti about 1 month ago. Updated 24 days ago.

Status:
Rejected
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2025-01-03
Due date:
% Done:

0%

Estimated time:

Description

Observation

job: https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3609258

Further details

job error:

monitor.qe.nue2.suse.org:
    The minion function caused an exception: Traceback (most recent call last):
      File "/usr/lib/python3.6/site-packages/salt/minion.py", line 1912, in _thread_return
        function_name, function_args, executors, opts, data
      File "/usr/lib/python3.6/site-packages/salt/minion.py", line 1870, in _execute_job_function
        return_data = self.executors[fname](opts, data, func, args, kwargs)
      File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 149, in __call__
        return self.loader.run(run_func, *args, **kwargs)
      File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1234, in run
        return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
      File "/usr/lib/python3.6/site-packages/contextvars/__init__.py", line 38, in run
        return callable(*args, **kwargs)
      File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1249, in _run_as
        ret = _func_or_method(*args, **kwargs)
      File "/usr/lib/python3.6/site-packages/salt/executors/direct_call.py", line 10, in execute
        return func(*args, **kwargs)
      File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 149, in __call__
        return self.loader.run(run_func, *args, **kwargs)
      File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1234, in run
        return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
      File "/usr/lib/python3.6/site-packages/contextvars/__init__.py", line 38, in run
        return callable(*args, **kwargs)
      File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1249, in _run_as
        ret = _func_or_method(*args, **kwargs)
      File "/usr/lib/python3.6/site-packages/salt/modules/state.py", line 1161, in highstate
        initial_pillar=_get_initial_pillar(opts),
      File "/usr/lib/python3.6/site-packages/salt/state.py", line 4953, in __init__
        initial_pillar=initial_pillar,
      File "/usr/lib/python3.6/site-packages/salt/state.py", line 774, in __init__
        self.opts["pillar"] = self._gather_pillar()
      File "/usr/lib/python3.6/site-packages/salt/state.py", line 877, in _gather_pillar
        return pillar.compile_pillar()
      File "/usr/lib/python3.6/site-packages/salt/pillar/__init__.py", line 360, in compile_pillar
        dictkey="pillar",
      File "/usr/lib/python3.6/site-packages/salt/utils/asynchronous.py", line 112, in wrap
        lambda: getattr(self.obj, key)(*args, **kwargs)
      File "/usr/lib/python3.6/site-packages/salt/ext/tornado/ioloop.py", line 459, in run_sync
        return future_cell[0].result()
      File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
        raise_exc_info(self._exc_info)
      File "<string>", line 4, in raise_exc_info
      File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1064, in run
        yielded = self.gen.throw(*exc_info)
      File "/usr/lib/python3.6/site-packages/salt/channel/client.py", line 172, in crypted_transfer_decode_dictentry
        timeout=timeout,
      File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
        value = future.result()
      File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
        raise_exc_info(self._exc_info)
      File "<string>", line 4, in raise_exc_info
      File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1064, in run
        yielded = self.gen.throw(*exc_info)
      File "/usr/lib/python3.6/site-packages/salt/transport/zeromq.py", line 920, in send
        ret = yield self.message_client.send(load, timeout=timeout)
      File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
        value = future.result()
      File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
        raise_exc_info(self._exc_info)
      File "<string>", line 4, in raise_exc_info
      File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1064, in run
        yielded = self.gen.throw(*exc_info)
      File "/usr/lib/python3.6/site-packages/salt/transport/zeromq.py", line 630, in send
        recv = yield future
      File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
        value = future.result()
      File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
        raise_exc_info(self._exc_info)
      File "<string>", line 4, in raise_exc_info
    salt.exceptions.SaltReqTimeoutError: Message timed out

Suggestions


Related issues 2 (0 open2 closed)

Related to openQA Infrastructure (public) - action #174652: Ensure uniqueness of nodenames for generating configs on monitor size:MResolvedybonatakis2024-12-202025-02-04

Actions
Related to openQA Infrastructure (public) - action #175407: salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size:SResolvedokurz

Actions
Actions #1

Updated by okurz about 1 month ago

  • Tags set to infra, reactive work, alert, salt, osd
  • Category set to Regressions/Crashes
  • Target version set to Ready
Actions #2

Updated by okurz about 1 month ago

  • Related to action #174652: Ensure uniqueness of nodenames for generating configs on monitor size:M added
Actions #3

Updated by livdywan about 1 month ago

  • Subject changed from [alert] salt-states-openqa | Failed pipeline for master to [alert] salt-states-openqa | Failed pipeline for master "salt.exceptions.SaltReqTimeoutError: Message timed out" size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by nicksinger 24 days ago

  • Status changed from Workable to In Progress
  • Assignee set to nicksinger

I don't think this is related to any recent changes because it looks buried deep within salt-code but I will have a look if I can find more in logs from the last 7 days.

Actions #5

Updated by nicksinger 24 days ago

  • Status changed from In Progress to Rejected

So the machine itself is fine. A manual state.apply from OSD did complete (except a unrelated issue in "systemctl start dehydrated"). Also the journal doesn't contain more recent, problematic entries. For now I would treat this as one-off occurrence. https://github.com/search?q=repo%3Asaltstack%2Fsalt+SaltReqTimeoutError&type=issues is full of similar examples but most of them mention a huge number of minions (500-1000). Also reading https://github.com/saltstack/salt/blob/9233e1cc3b6b072a61b445d285ba856fc642ef3b/salt/exceptions.py#L330 this seems to be rather about a slow master (master == OSD) so bumping the resources for monitor won't cut it.

https://github.com/saltstack/salt/issues/53147#issuecomment-1593518015 contains a hint regarding worker_threads and sock_pool_size - we can look into these settings if this happens more regularly now.

Actions #6

Updated by nicksinger 24 days ago

  • Related to action #175407: salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size:S added
Actions

Also available in: Atom PDF