Project

General

Profile

Actions

action #161423

closed

coordination #161414: [epic] Improved salt based infrastructure management

[timeboxed:10h] Incomplete config files on OSD due to salt - Improve salt state application from remotely accessible salt master size:S

Added by okurz about 1 month ago. Updated 12 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2024-06-03
Due date:
% Done:

0%

Estimated time:

Description

Motivation

See #161324 . Why did the salt states pipelines end with success when the salt high state was never reported to be successfully applied to the openqa.suse.de salt minion (openqa.suse.de is not mentioned in the list of minions where the state was applied but the pipeline still ended)? We do not know yet but this should help us in the future to spot errors quicker in case similar problems return. Maybe the problem is related to how we run salt over ssh from that minion openqa.suse.de and potentially the exit code from salt was never propagated but the command in bash just ended prematurely? Research about best practices how to apply a high state from a remotely accessible master upstream and investigate this

Acceptance criteria

  • AC1: We know the best practice how to apply a salt high state on a remotely accessible salt master while avoiding loosing the ssh session in the process

Suggestions

  • Just do a web research or vague look around if there are any best practices, known problems, instructions for running salt on a remote ssh-reachable host
  • Look into how the salt states CI pipelines originally behaved in #161309 and how results of the state application are missing for openqa.suse.de. Maybe we lost connection to the salt master while the high state was applied and then the CI pipeline ended with "success" even though we never received a response from openqa.suse.de?

Related issues 3 (0 open3 closed)

Related to openQA Infrastructure - action #161324: Conduct "lessons learned" with Five Why analysis for "osd not accessible, 502 Bad Gateway"Resolvedokurz2024-05-31

Actions
Related to openQA Infrastructure - action #162641: Prevent redundant salt state.apply actions that are executed in every call - openqa-trigger-from-ibs-pluginResolvedjbaier_cz2024-06-20

Actions
Copied to openQA Infrastructure - action #162377: incomplete config files on OSD due to salt - Prevent conflicting state applications on OSD "fstab" size:SResolvedokurz2024-06-03

Actions
Actions #1

Updated by okurz about 1 month ago

  • Related to action #161324: Conduct "lessons learned" with Five Why analysis for "osd not accessible, 502 Bad Gateway" added
Actions #2

Updated by okurz 30 days ago

  • Copied to action #162377: incomplete config files on OSD due to salt - Prevent conflicting state applications on OSD "fstab" size:S added
Actions #3

Updated by okurz 27 days ago ยท Edited

  • Status changed from New to In Progress
  • Assignee set to okurz
  • Target version changed from future to Ready

For #162377 I looked into "deploy" jobs from https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs and looked into https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2737048 and https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2737025 and https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2732901 and I did not find any state output for the minion openqa.suse.de mentioned so I wonder if that is now always missing from the output.

I looked on https://gitlab.suse.de/openqa/salt-states-openqa/-/pipelines?page=20&scope=all bisecting trying to find older deploy jobs where there is a state output for OSD. Then I will bisect when that output seems to stop appearing. I found https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1821390 from 9 months ago, 2023-09-11, and looked for "Summary for openqa.suse.de" showing

Summary for openqa.suse.de
--------------
Succeeded: 350 (changed=30)
Failed:      0
--------------
Total states run:     350
Total run time:    55.980 s

given that we found a non-contiguous series of missing output for the minion 'openqa.suse.de' I assume that we do not have a clear regression but rather a sporadic issue that could have been present since longer time. Hence we should look for improvements.

I am also suspecting that #162641 could actually cause problems here. At least it's hindering us and annoying.

My suggestions:

  1. Research upstream about known problems of applying a salt high state over an ssh connected master
  2. Apply community best practices how to trigger multi-minion high state from ssh accessible master+minion
  3. Split salt high state application to non-osd and the minion on osd itself
Actions #4

Updated by okurz 27 days ago

  • Related to action #162641: Prevent redundant salt state.apply actions that are executed in every call - openqa-trigger-from-ibs-plugin added
Actions #5

Updated by openqa_review 26 days ago

  • Due date set to 2024-07-05

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by okurz 22 days ago

  • Due date deleted (2024-07-05)
  • Status changed from In Progress to Workable
Actions #7

Updated by okurz 14 days ago

  • Subject changed from incomplete config files on OSD due to salt - Improve salt state application from remotely accessible salt master to [timeboxed:10h] Incomplete config files on OSD due to salt - Improve salt state application from remotely accessible salt master size:S
  • Description updated (diff)
Actions #8

Updated by okurz 14 days ago

  • Status changed from Workable to In Progress

Running an experiment with

runs=400 count-fail-ratio sudo salt --no-color --state-output=changes 'openqa.suse.de' state.apply test=True | grep -v Result

to see if I can reproduce the issue of 'openqa.suse.de' results not showing up.

This yields

openqa.suse.de:
----------
          ID: auto-upgrade.service
    Function: service.dead
     Comment: Service auto-upgrade.service not present; if created in this state run, it would have been stopped
     Started: 15:56:31.939924
    Duration: 54.554 ms
     Changes:   
----------
          ID: auto-upgrade.timer
    Function: service.dead
     Comment: Service auto-upgrade.timer not present; if created in this state run, it would have been stopped
     Started: 15:56:31.995216
    Duration: 42.742 ms
     Changes:   

Summary for openqa.suse.de
--------------
Succeeded: 489 (unchanged=2)
Failed:      0
--------------
Total states run:     489
Total run time:    38.127 s

with the auto-upgrade.{service,timer} mentions. To get rid of that I am running

runs=400 count-fail-ratio sudo salt --no-color --state-output=changes 'openqa.suse.de' state.apply | grep -v Result

If not reproducible this way then should use w/o test mode or with multiple nodes.

In the meantime again looked into deploy jobs backwards in time if the summary for openqa.suse.de was missing:

according job artifacts were already removed in gitlab so we can't reproduce in gitlab and have no logs for older. No point in continuing that route.

Actions #9

Updated by openqa_review 13 days ago

  • Due date set to 2024-07-18

Setting due date based on mean cycle time of SUSE QE Tools

Actions #10

Updated by okurz 12 days ago

  • Due date deleted (2024-07-18)
  • Status changed from In Progress to Resolved

runs=400 count-fail-ratio sudo salt --no-color --state-output=changes 'openqa.suse.de' state.apply | grep -v Result failed in some rare occassions due to other problems, e.g. xml.parsers.expat.ExpatError: syntax error: line 1, column 0 or minion timeout but no missing result. In the end I could not reproduce the problem anymore and we agreed that possibly our changes to not write /etc/fstab from multiple salt states might prevent the situation.

Actions

Also available in: Atom PDF