Actions
action #167485
closedcoordination #151582: [epic] Future improvements for QE infrastructure salt management
salt-states-openqa deploy pipeline can cause monitor.qe to fail to render dashboard templates correctly size:M
Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-09-26
Due date:
% Done:
0%
Estimated time:
Description
Observation¶
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1274 introduced a change into the network runtime configuration of workers. This caused the deployment pipeline of that MR to fail with a rather unrelated looking message:
monitor.qe.nue2.suse.org:
Data failed to compile:
----------
Rendering SLS 'base:monitoring.grafana' failed: Jinja variable No first item, sequence was empty.; line 192
---
[...]
file.managed:
- source: salt://monitoring/grafana/alerting-dashboard-GD.yaml.template
- mode: "0644"
- template: jinja
- generic_host: {{ genericname }}
- host_interface: {{ host_interface }} <======================
{% do provisioned_alerts.append('dashboard-GD' + genericname + '.yaml') %}
{% endfor %}
After looking into our states I found a hypothesis why this happened and is related:
- Marius config was applied in the running highstate
- Eventually the highstate reached monitor and started rendering grafana.sls
- It grabs a list of workernames: https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/monitoring/grafana.sls#L4 (which included worker37 which was just changed)
- It tries to fetch a list of all network interfaces from each worker in the list: https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/monitoring/grafana.sls#L177
- This list was empty - presumably while at the point of worker37 in the for-loop but hard to tell from salt log-output
- =>
| first
returned an empty item which causedhost_interface
to be missing while trying to render the according dashboard template
It boils down to the mine returning incomplete/no data which our templates can't handle. We had a similar case in the past with the generation of gre-endpoint-pairs already but I can't quite remember if we found a solution/workaround we could apply here.
Acceptance criteria¶
- AC1: Our alerting template render state can handle an empty host_interface variable
Suggestions¶
- Check if the above hypothesis can be validated easily without interrupting production
- Make grafana.sls robust about empty network interface lists returned by the mine
- Don't render if variable is not defined/empty -> eventual network changes will only be rendered/visible/monitored on next deployment
- What is this variable used for in our dashboards? Does it need to come from the mine? Is there a better way? Maybe in telegraf?
- This should be reproducible on a single machine e.g. by setting an empty value
Actions