Project

General

Profile

Actions

action #176250

open

coordination #161414: [epic] Improved salt based infrastructure management

file corruption in salt controlled config files size:M

Added by okurz 2 months ago. Updated 20 days ago.

Status:
Blocked
Priority:
Low
Assignee:
Category:
Regressions/Crashes
Start date:
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

Multiple config files where somehow corrupted by salt or incompletely written. First #163790, then #175710, both on OSD. Also on monitor, see #176175 . okurz first assumed that this might be related to too high load on OSD while running both the salt master as well as the salt minion but as a similar problem appeared on monitor which is salt-minion only the salt-master alone can not be the problem. So far the problem has only happened on virtual machines (both OSD and monitor are VMs).

Acceptance Criteria

  • AC1: We have consistent and stable application of config files managed by salt

Suggestions

  • Try to reproduce the problem in a separate testing environment, e.g. single VM from https://download.opensuse.org/distribution/leap/15.6/appliances/openSUSE-Leap-15.6-Minimal-VM.x86_64-kvm-and-xen.qcow2 and apply local salt state from https://gitlab.suse.de/openqa/salt-states-openqa using the role webui and/or monitor while putting the VM under stress, e.g. with the application stress-ng
  • Read README from https://gitlab.suse.de/openqa/salt-states-openqa
  • Run salt repeatedly with a command like sudo nice env runs=300 count-fail-ratio salt --state-output=changes -C "*" state.apply queue=True | grep -v 'Result.*Clean' 2>&1 | tee -a salt_state.log
  • Give another try at upstream research. So far okurz has not found anything related. Consider asking domain experts from the salt community.
  • Maybe the problem is related to our rather outdated python+salt stack within our infrastructure (as we run Leap). So after you could reproduce the problem in a clean environment consider to run updated python and/or salt as applicable, e.g. try if you can also reproduce the problem within Tumbleweed.
  • If the problem can not be reproduced in a synthetic environment then consider an idea from tinita: "adding another dummy.ini besides openqa.ini, with the same content, and in the loop calling salt.apply, the file is copied to a folder, so we have a list of files and can trace the changes between each call"

Related issues 5 (3 open2 closed)

Related to openQA Infrastructure (public) - action #175989: Too big logfiles causing failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) size:SResolvedjbaier_cz2025-01-22

Actions
Related to openQA Infrastructure (public) - action #177276: Make use of config files in openqa.ini.d for OSD specific settings size:SBlockedokurz

Actions
Blocks openQA Infrastructure (public) - action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17Blockedokurz2024-07-10

Actions
Blocks openQA Infrastructure (public) - action #176175: [alert] Grafana failed to start due to corrupted config fileBlockedokurz2025-01-26

Actions
Copied to openQA Project (public) - action #176421: Support for config files in openqa.d size:SResolvedmkittler

Actions
Actions

Also available in: Atom PDF