Project

General

Profile

Actions

action #167485

closed

coordination #151582: [epic] Future improvements for QE infrastructure salt management

salt-states-openqa deploy pipeline can cause monitor.qe to fail to render dashboard templates correctly size:M

Added by nicksinger about 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-09-26
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1274 introduced a change into the network runtime configuration of workers. This caused the deployment pipeline of that MR to fail with a rather unrelated looking message:

monitor.qe.nue2.suse.org:
    Data failed to compile:
----------
    Rendering SLS 'base:monitoring.grafana' failed: Jinja variable No first item, sequence was empty.; line 192

---
[...]
  file.managed:
    - source: salt://monitoring/grafana/alerting-dashboard-GD.yaml.template
    - mode: "0644"
    - template: jinja
    - generic_host: {{ genericname }}
    - host_interface: {{ host_interface }}    <======================
{% do provisioned_alerts.append('dashboard-GD' + genericname + '.yaml') %}
{% endfor %}

After looking into our states I found a hypothesis why this happened and is related:

  1. Marius config was applied in the running highstate
  2. Eventually the highstate reached monitor and started rendering grafana.sls

It boils down to the mine returning incomplete/no data which our templates can't handle. We had a similar case in the past with the generation of gre-endpoint-pairs already but I can't quite remember if we found a solution/workaround we could apply here.

Acceptance criteria

  • AC1: Our alerting template render state can handle an empty host_interface variable

Suggestions

  • Check if the above hypothesis can be validated easily without interrupting production
  • Make grafana.sls robust about empty network interface lists returned by the mine
    • Don't render if variable is not defined/empty -> eventual network changes will only be rendered/visible/monitored on next deployment
    • What is this variable used for in our dashboards? Does it need to come from the mine? Is there a better way? Maybe in telegraf?
  • This should be reproducible on a single machine e.g. by setting an empty value
Actions

Also available in: Atom PDF