action #167485: salt-states-openqa deploy pipeline can cause monitor.qe to fail to render dashboard templates correctly size:M - openQA Infrastructure - openSUSE Project Management Tool

Actions

Copy link

action #167485

closed

coordination #151582: [epic] Future improvements for QE infrastructure salt management

salt-states-openqa deploy pipeline can cause monitor.qe to fail to render dashboard templates correctly size:M

Added by nicksinger about 2 months ago. Updated about 1 month ago.

Status:

Resolved

Priority:

High

Assignee:

jbaier_cz

Category:

Regressions/Crashes

Target version:

openQA Project - Ready

Start date:

2024-09-26

Due date:

% Done:

Estimated time:

Tags:

osd, salt, gitlab, infra, mine

Description

Observation¶

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1274 introduced a change into the network runtime configuration of workers. This caused the deployment pipeline of that MR to fail with a rather unrelated looking message:

monitor.qe.nue2.suse.org:
    Data failed to compile:
----------
    Rendering SLS 'base:monitoring.grafana' failed: Jinja variable No first item, sequence was empty.; line 192

---
[...]
  file.managed:
    - source: salt://monitoring/grafana/alerting-dashboard-GD.yaml.template
    - mode: "0644"
    - template: jinja
    - generic_host: {{ genericname }}
    - host_interface: {{ host_interface }}    <======================
{% do provisioned_alerts.append('dashboard-GD' + genericname + '.yaml') %}
{% endfor %}

After looking into our states I found a hypothesis why this happened and is related:

Marius config was applied in the running highstate
Eventually the highstate reached monitor and started rendering grafana.sls
- It grabs a list of workernames: https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/monitoring/grafana.sls#L4 (which included worker37 which was just changed)
- It tries to fetch a list of all network interfaces from each worker in the list: https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/monitoring/grafana.sls#L177
- This list was empty - presumably while at the point of worker37 in the for-loop but hard to tell from salt log-output
- => | first returned an empty item which caused host_interface to be missing while trying to render the according dashboard template

It boils down to the mine returning incomplete/no data which our templates can't handle. We had a similar case in the past with the generation of gre-endpoint-pairs already but I can't quite remember if we found a solution/workaround we could apply here.

Acceptance criteria¶

AC1: Our alerting template render state can handle an empty host_interface variable

Suggestions¶

Check if the above hypothesis can be validated easily without interrupting production
Make grafana.sls robust about empty network interface lists returned by the mine
- Don't render if variable is not defined/empty -> eventual network changes will only be rendered/visible/monitored on next deployment
- What is this variable used for in our dashboards? Does it need to come from the mine? Is there a better way? Maybe in telegraf?
This should be reproducible on a single machine e.g. by setting an empty value

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA » openQA Project » openQA Infrastructure

Tags

Custom queries

action #167485

salt-states-openqa deploy pipeline can cause monitor.qe to fail to render dashboard templates correctly size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by nicksinger about 2 months ago

Updated by okurz about 2 months ago

Updated by livdywan about 2 months ago

Updated by okurz about 1 month ago

Updated by jbaier_cz about 1 month ago

Updated by jbaier_cz about 1 month ago

Updated by openqa_review about 1 month ago

Updated by okurz about 1 month ago

Updated by jbaier_cz about 1 month ago

Updated by jbaier_cz about 1 month ago

Updated by jbaier_cz about 1 month ago