Project

General

Profile

Actions

action #167485

closed

coordination #151582: [epic] Future improvements for QE infrastructure salt management

salt-states-openqa deploy pipeline can cause monitor.qe to fail to render dashboard templates correctly size:M

Added by nicksinger about 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-09-26
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1274 introduced a change into the network runtime configuration of workers. This caused the deployment pipeline of that MR to fail with a rather unrelated looking message:

monitor.qe.nue2.suse.org:
    Data failed to compile:
----------
    Rendering SLS 'base:monitoring.grafana' failed: Jinja variable No first item, sequence was empty.; line 192

---
[...]
  file.managed:
    - source: salt://monitoring/grafana/alerting-dashboard-GD.yaml.template
    - mode: "0644"
    - template: jinja
    - generic_host: {{ genericname }}
    - host_interface: {{ host_interface }}    <======================
{% do provisioned_alerts.append('dashboard-GD' + genericname + '.yaml') %}
{% endfor %}

After looking into our states I found a hypothesis why this happened and is related:

  1. Marius config was applied in the running highstate
  2. Eventually the highstate reached monitor and started rendering grafana.sls

It boils down to the mine returning incomplete/no data which our templates can't handle. We had a similar case in the past with the generation of gre-endpoint-pairs already but I can't quite remember if we found a solution/workaround we could apply here.

Acceptance criteria

  • AC1: Our alerting template render state can handle an empty host_interface variable

Suggestions

  • Check if the above hypothesis can be validated easily without interrupting production
  • Make grafana.sls robust about empty network interface lists returned by the mine
    • Don't render if variable is not defined/empty -> eventual network changes will only be rendered/visible/monitored on next deployment
    • What is this variable used for in our dashboards? Does it need to come from the mine? Is there a better way? Maybe in telegraf?
  • This should be reproducible on a single machine e.g. by setting an empty value
Actions #1

Updated by nicksinger about 2 months ago

  • Description updated (diff)
Actions #2

Updated by okurz about 2 months ago

  • Tags set to infra, salt, osd, mine, gitlab
  • Category set to Regressions/Crashes
  • Target version set to Ready
Actions #3

Updated by livdywan about 2 months ago

  • Subject changed from salt-states-openqa deploy pipeline can cause monitor.qe fail to render dashboard templates correctly to salt-states-openqa deploy pipeline can cause monitor.qe to fail to render dashboard templates correctly size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by okurz about 1 month ago

  • Priority changed from Normal to High
Actions #5

Updated by jbaier_cz about 1 month ago

  • Status changed from Workable to In Progress
  • Assignee set to jbaier_cz
Actions #6

Updated by jbaier_cz about 1 month ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1285 should skip defining the template when host_interface is empty

Actions #7

Updated by openqa_review about 1 month ago

  • Due date set to 2024-10-25

Setting due date based on mean cycle time of SUSE QE Tools

Actions #8

Updated by okurz about 1 month ago

  • Parent task set to #151582
Actions #9

Updated by jbaier_cz about 1 month ago

  • Status changed from In Progress to Feedback
Actions #10

Updated by jbaier_cz about 1 month ago

Did an update, now only some parts of the dashboard are skipped when the host_interface is not defined.

Actions #11

Updated by jbaier_cz about 1 month ago

  • Due date deleted (2024-10-25)
  • Status changed from Feedback to Resolved

Change deployed and I still see some dashboards and panels, so I guess at least it is not worse. We need to look at it and reopen if the behavior turns out as not what we want.

Actions

Also available in: Atom PDF