action #130835
closedsalt high state fails after recent merge requests in salt pillars size:M
0%
Description
Observation¶
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1629763 fails. From https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1629763/raw
.worker3.oqa.suse.de:
Data failed to compile:
----------
Rendering SLS 'base:openqa.openvswitch' failed: Jinja error: argument of type 'StrictUndefined' is not iterable
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/salt/utils/templates.py", line 502, in render_jinja_tmpl
output = template.render(**decoded_context)
File "/usr/lib/python3.6/site-packages/jinja2/asyncsupport.py", line 76, in render
return original_render(self, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/jinja2/environment.py", line 1008, in render
return self.environment.handle_exception(exc_info, True)
File "/usr/lib/python3.6/site-packages/jinja2/environment.py", line 780, in handle_exception
reraise(exc_type, exc_value, tb)
File "/usr/lib/python3.6/site-packages/jinja2/_compat.py", line 37, in reraise
raise value.with_traceback(tb)
File "<template>", line 119, in top-level template code
TypeError: argument of type 'StrictUndefined' is not iterable
; line 119
---
[...]
{% if 'bridge_ip' in remote_conf %}
{% set remote_ip=remote_conf['bridge_ip'] %}
{% elif 'bridge_iface' in remote_conf %}
{% set remote_interfaces = salt['mine.get']("nodename:" + remote, 'ip4_interfaces', tgt_type='grain').values()|list %}
{% set remote_bridge_interface = remote_conf['bridge_iface'] %}
{% if remote_bridge_interface in remote_interfaces[0] %} <======================
{% set remote_ip = remote_interfaces[0][remote_bridge_interface][0] %}
{% endif %}
{% endif %}
- ovs-vsctl --may-exist add-port $bridge gre{{- loop.index }} -- set interface gre{{- loop.index }} type=gre options:remote_ip={{ remote_ip }}
{% endfor %}
[...]
---
From ssh osd 'sudo salt --no-color --state-output=changes \* state.apply'
we can reproduce that.
Suggestions¶
- Check on one worker and fix the error
- Consider improving CI checks e.g. using https://salt-lint.readthedocs.io/en/latest/
Updated by okurz 11 months ago
I just installed salt-lint locally in a python venv and found only one minor issue within salt-pillars which is fixed by https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/543
sudo salt -l error --no-color --state-output=changes 'worker2*' state.apply
showed no failures.
worker3 does, related to openvswitch.
salt-call --local -l error --no-color --state-output=changes slsutil.renderer /srv/pillar/openqa/workerconf.sls 'jinja'
shows what looks like a sane document. I compared the documents before and after the two recent merge requests and found only correct changes in public cloud credentials, not related to openvswitch so I assume something else changed that is not causing problems. I assume some host is now not providing the necessary interfaces to construct all GRE tunnels.
I think the relevant code is:
{% set remote_interfaces = salt['mine.get']("nodename:" + remote, 'ip4_interfaces', tgt_type='grain').values()|list %}
{% set remote_bridge_interface = remote_conf['bridge_iface'] %}
{% if remote_bridge_interface in remote_interfaces[0] %}
I assume remote_interfaces[0]
is invalid as remote_interfaces
is empty meaning that I assume salt['mine.get']("nodename:" + remote, 'ip4_interfaces', tgt_type='grain').values()
is empty.
salt -l error --no-color --state-output=changes 'worker3*' slsutil.renderer /srv/salt/openqa/openvswitch.sls 'jinja'
fails with jinja2.exceptions.UndefinedError: 'dict object' has no attribute 'bridge_ip'
. Don't know how to handle that.
salt -l error --no-color --state-output=changes 'worker3*' state.sls openqa.openvswitch
can reproduce the problem quicker.
I tried to change the jinja code to handle a missing ip4_interface gracefully but then it hit me: A machine is missing on purpose, worker6, see #129484 . So likely our salt codes struggle whenever a multi-machine configured worker is removed from salt keys but still present in salt pillars. So I commented the entry in /srv/pillar/openqa/workerconf.sls on osd and then the above command renders the sls file nicely and proceeds with execution.
Updated by okurz 11 months ago
Yet another improvement is https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/545 to use consistent jinja commenting.
Regarding salt-lint there is https://software.opensuse.org/package/salt-lint?search_term=salt-lint in a 1.5y old version 0.8.0, upstream is 0.9.2. dheidler will create a SR for the new version to https://build.opensuse.org/package/show/systemsmanagement:saltstack/salt-lint and then to oS:Fctry and then maybe Leap.
Still monitoring still running deploy jobs.
EDIT: They seemed to have succeeded in the end.
Updated by openqa_review 11 months ago
- Due date set to 2023-06-29
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 11 months ago
- Status changed from In Progress to Blocked
waiting for https://build.opensuse.org/request/show/1093128 first
Updated by okurz 11 months ago
- Due date changed from 2023-06-29 to 2023-07-07
- Status changed from Blocked to Feedback
SR accepted, now in Factory, since yesterday. Trying to build a container already that has salt-lint https://build.opensuse.org/package/show/home:okurz:container/salt-lint
Updated by okurz 11 months ago
- Status changed from Feedback to Workable
https://build.opensuse.org/package/show/home:okurz:container/salt-lint is built now. podman run --rm -it registry.opensuse.org/home/okurz/container/containers/tumbleweed:salt-lint salt-lint --help
works fine now. So we can integrate that into the CI pipeline. I added an entry point in the Dockerfile so we can just call podman run --rm -it registry.opensuse.org/home/okurz/container/containers/tumbleweed:salt-lint
to call salt-lint.
Updated by okurz 11 months ago
- Related to action #131249: [alert][ci][deployment] OSD deployment failed, grenache-1, worker5, worker2 salt-minion does not return, error message "No response" size:M added