Project

General

Profile

Actions

action #176250

open

coordination #161414: [epic] Improved salt based infrastructure management

file corruption in salt controlled config files size:M

Added by okurz 2 months ago. Updated 14 days ago.

Status:
Blocked
Priority:
Low
Assignee:
Category:
Regressions/Crashes
Start date:
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

Multiple config files where somehow corrupted by salt or incompletely written. First #163790, then #175710, both on OSD. Also on monitor, see #176175 . okurz first assumed that this might be related to too high load on OSD while running both the salt master as well as the salt minion but as a similar problem appeared on monitor which is salt-minion only the salt-master alone can not be the problem. So far the problem has only happened on virtual machines (both OSD and monitor are VMs).

Acceptance Criteria

  • AC1: We have consistent and stable application of config files managed by salt

Suggestions

  • Try to reproduce the problem in a separate testing environment, e.g. single VM from https://download.opensuse.org/distribution/leap/15.6/appliances/openSUSE-Leap-15.6-Minimal-VM.x86_64-kvm-and-xen.qcow2 and apply local salt state from https://gitlab.suse.de/openqa/salt-states-openqa using the role webui and/or monitor while putting the VM under stress, e.g. with the application stress-ng
  • Read README from https://gitlab.suse.de/openqa/salt-states-openqa
  • Run salt repeatedly with a command like sudo nice env runs=300 count-fail-ratio salt --state-output=changes -C "*" state.apply queue=True | grep -v 'Result.*Clean' 2>&1 | tee -a salt_state.log
  • Give another try at upstream research. So far okurz has not found anything related. Consider asking domain experts from the salt community.
  • Maybe the problem is related to our rather outdated python+salt stack within our infrastructure (as we run Leap). So after you could reproduce the problem in a clean environment consider to run updated python and/or salt as applicable, e.g. try if you can also reproduce the problem within Tumbleweed.
  • If the problem can not be reproduced in a synthetic environment then consider an idea from tinita: "adding another dummy.ini besides openqa.ini, with the same content, and in the loop calling salt.apply, the file is copied to a folder, so we have a list of files and can trace the changes between each call"

Related issues 5 (3 open2 closed)

Related to openQA Infrastructure (public) - action #175989: Too big logfiles causing failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) size:SResolvedjbaier_cz2025-01-22

Actions
Related to openQA Infrastructure (public) - action #177276: Make use of config files in openqa.ini.d for OSD specific settings size:SBlockedokurz

Actions
Blocks openQA Infrastructure (public) - action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17Blockedokurz2024-07-10

Actions
Blocks openQA Infrastructure (public) - action #176175: [alert] Grafana failed to start due to corrupted config fileBlockedokurz2025-01-26

Actions
Copied to openQA Project (public) - action #176421: Support for config files in openqa.d size:SResolvedmkittler

Actions
Actions #1

Updated by okurz 2 months ago

  • Copied from action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17 added
Actions #2

Updated by okurz 2 months ago

  • Copied from deleted (action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17)
Actions #3

Updated by okurz 2 months ago

  • Blocks action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17 added
Actions #4

Updated by okurz 2 months ago

  • Blocks action #176175: [alert] Grafana failed to start due to corrupted config file added
Actions #5

Updated by okurz 2 months ago

  • Subject changed from file corruption in salt controlled config files to file corruption in salt controlled config files size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by okurz 2 months ago

  • Description updated (diff)
Actions #7

Updated by okurz 2 months ago

  • Description updated (diff)
Actions #8

Updated by okurz 2 months ago

  • Description updated (diff)
Actions #9

Updated by ybonatakis 2 months ago

  • Assignee set to ybonatakis
Actions #10

Updated by ybonatakis 2 months ago

  • Status changed from Workable to In Progress
Actions #11

Updated by ybonatakis 2 months ago

  • Status changed from In Progress to Workable
  • Assignee deleted (ybonatakis)

Here what I tried to bring up a testing environment.

qemu-system-x86_64 -enable-kvm -m 2048 -smp 2 -drive file=openSUSE-Leap-15.6-Minimal-VM.x86_64-kvm-and-xen.qcow2,format=qcow2

Then inside the VM

  • zypper in salt-minion git-core htop vim systemd-coredump and stress-ng (apparently I dont need everything but whatever)
  • reboot and then echo testing >{/etc/salt/minion_id,/etc/hostname} (edited)
  • touch /etc/salt/grains and echo "roles: worker" /etc/salt/grains
  • systemctl enable --now salt-minion but I get errors for master hostname salt not found. kinda solved by running sed -i "s/localhost/& salt/" /etc/hosts but there are still errors. Something missing from the setup

I could run salt-call --local state.apply but it was failing.

in other notes, i need to run the image with proper network setup and serial as I couldnt clone from gitlab.suse.de. Although I learnt about a public repo https://github.com/os-autoinst/salt-states-openqa/. clone it, didnt solve the problem on salt-call. salt-call still couldnt find the high-state.

Actions #12

Updated by okurz 2 months ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz
Actions #13

Updated by okurz 2 months ago · Edited

On osiris installing a new VM using br0. That failed with "cannot execute binary /usr/lib/qemu-bridge-helper: Permission denied
: Transport endpoint is not connected'". Following https://www.reddit.com/r/openSUSE/comments/q9jcmy/tutorial_how_to_use_bridged_network_on_a_gnome/ I did gpasswd -a okurz kvm

cd /var/lib/libvirt/images
https://download.opensuse.org/distribution/leap/15.6/appliances/openSUSE-Leap-15.6-Minimal-VM.x86_64-kvm-and-xen.qcow2
cp -al openSUSE-Leap-15.6-Minimal-VM.x86_64-kvm-and-xen.qcow2 okurz-poo176250.qcow2

Created VM "okurz-poo176250" using that qcow2 and br0.

Standard root QA testing password for manual tests s…g

Currently reachable as d4-147.qe.nue2.suse.org. Configured a local user account and added my SSH key so can now use ssh d4-147.qe.nue2.suse.org.

Configured screen, salt and now running

nice env runs=300 count-fail-ratio sh -c "salt --state-output=changes --no-color \* state.apply queue=True | grep -v 'Result.*Clean' 2>&1 | tee -a salt_state-$(date -Is).log && ls -l /etc/openqa/openqa.ini"

got

    Data failed to compile:
----------
    No matching sls found for 'openqa.server' in env 'base'
----------
    No matching sls found for 'openqa.openqa-trigger-from-ibs' in env 'base'
----------
    No matching sls found for 'certificates.dehydrated' in env 'base'

and corrupted btrfs?!?

Actions #14

Updated by openqa_review 2 months ago

  • Due date set to 2025-02-13

Setting due date based on mean cycle time of SUSE QE Tools

Actions #15

Updated by okurz 2 months ago · Edited

Recreated the VM and trying again.

Now running

nice env runs=300 count-fail-ratio sh -c "salt --state-output=changes --no-color \* state.apply queue=True | grep -v 'Result.*Clean' 2>&1 | tee -a salt_state-$(date -Is).log 2>&1 && cp -a /etc/openqa/openqa.ini{,.\$(date -Is)} && ls -l /etc/openqa/openqa.ini*"

and in parallel

stress-ng --cpu 8 --iomix 4 --vm 2 --vm-bytes 128M --fork 4

EDIT: 2025-01-30 14:23Z intermediate state

-rw-r--r-- 1 geekotest root 17232 Jan 30 14:18 /etc/openqa/openqa.ini
-rw-r--r-- 1 geekotest root 17232 Jan 30 12:18 /etc/openqa/openqa.ini.2025-01-30T12:19:05+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 12:44 /etc/openqa/openqa.ini.2025-01-30T12:19:27+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 12:54 /etc/openqa/openqa.ini.2025-01-30T12:57:26+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 12:59 /etc/openqa/openqa.ini.2025-01-30T13:01:53+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 13:03 /etc/openqa/openqa.ini.2025-01-30T13:06:21+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 13:08 /etc/openqa/openqa.ini.2025-01-30T13:10:42+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 13:12 /etc/openqa/openqa.ini.2025-01-30T13:15:03+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 13:16 /etc/openqa/openqa.ini.2025-01-30T13:19:28+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 13:21 /etc/openqa/openqa.ini.2025-01-30T13:23:47+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 13:25 /etc/openqa/openqa.ini.2025-01-30T13:28:08+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 13:30 /etc/openqa/openqa.ini.2025-01-30T13:32:36+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 13:34 /etc/openqa/openqa.ini.2025-01-30T13:36:51+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 13:38 /etc/openqa/openqa.ini.2025-01-30T13:41:22+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 13:43 /etc/openqa/openqa.ini.2025-01-30T13:45:45+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 13:47 /etc/openqa/openqa.ini.2025-01-30T13:50:13+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 13:52 /etc/openqa/openqa.ini.2025-01-30T13:54:40+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 13:56 /etc/openqa/openqa.ini.2025-01-30T13:59:10+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 14:00 /etc/openqa/openqa.ini.2025-01-30T14:03:35+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 14:05 /etc/openqa/openqa.ini.2025-01-30T14:07:57+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 14:09 /etc/openqa/openqa.ini.2025-01-30T14:12:27+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 14:14 /etc/openqa/openqa.ini.2025-01-30T14:16:50+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 14:18 /etc/openqa/openqa.ini.2025-01-30T14:21:14+00:00
## count-fail-ratio: Run: 20. Fails: 0. Fail ratio 0±0%. No fails, computed failure probability < 15.00%
## mean runtime: 264546±3954.77 ms

so all file sizes correct while the system has a mean load of 43.88 44.04 44.65

Actions #16

Updated by okurz 2 months ago

## count-fail-ratio: Run: 271. Fails: 0. Fail ratio 0±0%. No fails, computed failure probability < 1.10%
## mean runtime: 268072±5618.01 ms

with no differences in file sizes. That means that salt on Leap 15.6 with our salt states can be very stable even under high system load. The differences vs. OSD are at least that some rules fail to apply as well as that there is no other load on the system, e.g. no OSD jobs or deployment while salt might be going on. What happens if openQA packages are upgraded and changing ini files while salt is running? Trying an experiment

Actions #17

Updated by okurz 2 months ago · Edited

from /var/log/salt/minion

OSError: [Errno 28] No space left on device
salt.exceptions.CommandExecutionError: Unable to write file '/etc/openqa/openqa.ini'. Exception: [Errno 28] No space left on device
2025-01-31 21:48:31,329 [salt.state       :327 ][ERROR   ][5219] An exception occurred in this state: OSError: [Errno 28] No space left on device

and

-rw-r--r-- 1 geekotest root 17232 Jan 31 21:25 /etc/openqa/openqa.ini.2025-01-31T21:28:43+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 31 21:30 /etc/openqa/openqa.ini.2025-01-31T21:34:04+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 31 21:36 /etc/openqa/openqa.ini.2025-01-31T21:39:34+00:00
-rw-r--r-- 1 geekotest root  4024 Jan 31 21:41 /etc/openqa/openqa.ini.2025-01-31T21:43:16+00:00
-rw-r--r-- 1 geekotest root  6801 Jan 31 21:45 /etc/openqa/openqa.ini.2025-01-31T21:49:50+00:00

so in the end the storage was depleted which caused the incomplete file write. I assume the repeated zypper calls with no chance for snapper to trigger a cleanup within a reasonable time caused the problems here. Triggered systemctl start snapper-{timeline,boot} to cleanup. But that did not cleanup much. Now running

while sleep 20; do zypper -n in --force openQA; sleep 5; snapper cleanup number; done

and continuing the experiment but without stress-ng as that stalls btrfs cleanup mostly.

Actions #18

Updated by okurz 2 months ago

  • Copied to action #176421: Support for config files in openqa.d size:S added
Actions #19

Updated by okurz about 2 months ago

No real config file corruption until now occured

## count-fail-ratio: Run: 1515. Fails: 0. Fail ratio 0±0%. No fails, computed failure probability < .19%
## mean runtime: 125157±6567.69 ms

and nobody had a better idea what to try besides again trying to break production which I can consider doing over weekend (or never).

Actions #20

Updated by okurz about 2 months ago

  • Due date deleted (2025-02-13)
  • Status changed from In Progress to Workable
  • Assignee deleted (okurz)
  • Priority changed from High to Low
  • Target version changed from Ready to future
Actions #21

Updated by tinita about 2 months ago

  • Related to action #175989: Too big logfiles causing failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) size:S added
Actions #22

Updated by okurz 14 days ago

nicksinger had the great idea to use inotifywait to monitor for file operations on /etc/openqa/openqa.ini . I assume there should be only one of three actors writing to /etc/openqa/openqa.ini

  1. salt from https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/server.sls?ref_type=heads#L37
  2. RPM package upgrades with zypper. But we use config(noreplace) so that shouldn't be a thing for us, or is it?
  3. human operators on manual changes, usually in that environment with vim. That would manifest in seeing also .swp and .ini~ files written temporarily as well as openqa.ini replaced in one go from the swap file

I was running while true; do inotifywait -e CLOSE_WRITE,MOVE,MOVE_SELF,CREATE,DELETE,DELETE_SELF --monitor --format '%f %e %T' --timefmt '%F-%T' /etc/openqa/ ; done and observed blocks that look like this

openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:04
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:04
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:04
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:04
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:04
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
database.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:06
database.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:06

which can I also trigger if I call salt 'openqa.suse.de' state.apply openqa.server so I assume the ini salt module does multiple individual open+close actions instead of a single one with all changes at once. This might be a problem with the corruption we observe. This is one more reason to consider moving to separate files in a config directory

Actions #23

Updated by okurz 14 days ago

  • Related to action #177276: Make use of config files in openqa.ini.d for OSD specific settings size:S added
Actions #24

Updated by okurz 14 days ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz
  • Target version changed from future to Ready

I found a way to prevent the above with #177276

Actions

Also available in: Atom PDF