action #178078
Updated by tinita about 2 months ago
## Observation
From https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3907598:
```
petrol.qe.nue2.suse.org:
The minion function caused an exception: Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/salt/minion.py", line 1912, in _thread_return
function_name, function_args, executors, opts, data
File "/usr/lib/python3.6/site-packages/salt/minion.py", line 1870, in _execute_job_function
return_data = self.executors[fname](opts, data, func, args, kwargs)
File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 149, in __call__
return self.loader.run(run_func, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1234, in run
return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/contextvars/__init__.py", line 38, in run
return callable(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1249, in _run_as
ret = _func_or_method(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/executors/direct_call.py", line 10, in execute
return func(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 149, in __call__
return self.loader.run(run_func, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1234, in run
return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/contextvars/__init__.py", line 38, in run
return callable(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1249, in _run_as
ret = _func_or_method(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/modules/state.py", line 1161, in highstate
initial_pillar=_get_initial_pillar(opts),
File "/usr/lib/python3.6/site-packages/salt/state.py", line 4953, in __init__
initial_pillar=initial_pillar,
File "/usr/lib/python3.6/site-packages/salt/state.py", line 774, in __init__
self.opts["pillar"] = self._gather_pillar()
File "/usr/lib/python3.6/site-packages/salt/state.py", line 877, in _gather_pillar
return pillar.compile_pillar()
File "/usr/lib/python3.6/site-packages/salt/pillar/__init__.py", line 360, in compile_pillar
dictkey="pillar",
File "/usr/lib/python3.6/site-packages/salt/utils/asynchronous.py", line 112, in wrap
lambda: getattr(self.obj, key)(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/ioloop.py", line 459, in run_sync
return future_cell[0].result()
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python3.6/site-packages/salt/channel/client.py", line 172, in crypted_transfer_decode_dictentry
timeout=timeout,
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python3.6/site-packages/salt/transport/zeromq.py", line 920, in send
ret = yield self.message_client.send(load, timeout=timeout)
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python3.6/site-packages/salt/transport/zeromq.py", line 630, in send
recv = yield future
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
salt.exceptions.SaltReqTimeoutError: Message timed out
```
We now since recently call salt with an explicit timeout value with `salt -t 3600` (60 minutes). Most salt nodes return after around 40s but from petrol we get a SaltReqTimeoutError. The complete pipeline ran for (only) 3m 54s. So after which time did the timeout actually happen and why?
## Acceptance Criteria
* **AC1:** salt states on petrol don't time out
## Suggestions
* This seems to be a generic symptom of a timeout e.g. [many things could file like so](https://github.com/saltstack/salt/issues?q=is%3Aissue%20state%3Aopen%20SaltReqTimeoutError)
* There's no specific module visible in tracebacks?
* Come up with a reproducer?
* Happens very often but only petrol because the machine is slow?
* Maybe it's the module timeout?
* Enable debugging that would help narrow down where this originates
* Can we split up the state execution e.g. apply networking, packages separately?