Project

General

Profile

Actions

action #163790

closed

OSD openqa.ini is corrupted, invalid characters size:M

Added by okurz 5 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Regressions/Crashes
Start date:
2024-07-10
Due date:
% Done:

0%

Estimated time:

Description

Observation

I copied the corrupted config file to /etc/openqa/openqa.ini.corrupted-2024-07-11-okurz-poo163790

On backup-vm.qe.nue2.suse.org I see:

okurz@backup-vm:~> ls -la /home/rsnapshot/*/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 13056 Jul 11 19:32 /home/rsnapshot/alpha.0/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 13056 Jul 11 15:32 /home/rsnapshot/alpha.1/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 13056 Jul 11 12:32 /home/rsnapshot/alpha.2/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 2 martchus root 13056 Jul 11 07:32 /home/rsnapshot/alpha.3/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 2 martchus root 13056 Jul 11 07:32 /home/rsnapshot/alpha.4/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 13056 Jul 11 03:32 /home/rsnapshot/alpha.5/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 13056 Jul 10 03:32 /home/rsnapshot/beta.0/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 13056 Jul  9 03:32 /home/rsnapshot/beta.1/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 13056 Jul  8 03:32 /home/rsnapshot/beta.2/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 13056 Jul  7 03:32 /home/rsnapshot/beta.3/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 13056 Jul  6 03:32 /home/rsnapshot/beta.4/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 13056 Jul  5 03:32 /home/rsnapshot/beta.5/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 13056 Jul  4 03:32 /home/rsnapshot/beta.6/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 10259 Dec 17  2023 /home/rsnapshot/_delete.14764/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 10267 Jan 21 11:32 /home/rsnapshot/_delete.15309/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root  1976 May 31 09:32 /home/rsnapshot/delta.0/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 10312 Apr 26 03:33 /home/rsnapshot/delta.1/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 10312 Mar 29 03:32 /home/rsnapshot/delta.2/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 10463 Jun 28 03:11 /home/rsnapshot/gamma.0/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 10463 Jun 21 03:11 /home/rsnapshot/gamma.1/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 10463 Jun 14 03:32 /home/rsnapshot/gamma.2/openqa.suse.de/etc/openqa/openqa.ini
-rw-r--r-- 1 martchus root 10463 Jun  7 03:32 /home/rsnapshot/gamma.3/openqa.suse.de/etc/openqa/openqa.ini

so judging from the size it seems like 2024-06-28 is the last good. I copied back that config to OSD with

ssh backup-vm.qe.nue2.suse.org "cat /home/rsnapshot/gamma.0/openqa.suse.de/etc/openqa/openqa.ini" | ssh osd "cat - | sudo tee /etc/openqa/openqa.ini"

and restart the openqa-webui service.

Suggestions

  • Enable filesystem checksums (can be enabled for ext4) and check dmesg output in case of corruption
  • Ask around if there might be other options (especially if this has e.g. a big performance hit or version requirements we can't cope with)
  • Read: https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums
  • Check for any problematic configurations in our salt states

Out of scope

  • Write a filesystem driver

Related issues 3 (0 open3 closed)

Related to openQA Infrastructure (public) - action #167584: grafana-server on monitor.qe.nue2.suse.org yields "502 Bad Gateway", fails to start since 2024-09-28 03:57ZResolvedokurz2024-09-29

Actions
Related to openQA Project (public) - action #168721: OSD openqa.ini grossly incompleteResolvedokurz2024-10-22

Actions
Copied from openQA Infrastructure (public) - action #163592: [alert] (HTTP Response alert Salt tm0h5mf4k) size:MResolvedokurz2024-07-10

Actions
Actions #1

Updated by okurz 5 months ago

  • Copied from action #163592: [alert] (HTTP Response alert Salt tm0h5mf4k) size:M added
Actions #2

Updated by okurz 5 months ago

  • Description updated (diff)
Actions #3

Updated by okurz 5 months ago

  • Description updated (diff)
  • Due date deleted (2024-07-25)
  • Status changed from New to Blocked
  • Priority changed from Urgent to Normal

repair applied, blocking on #163592

Actions #4

Updated by okurz 5 months ago

  • Status changed from Blocked to In Progress

#163592 was resolved. I didn't see the same problem again. I assume if the system is under heavy load then if salt is called files become corrupted. I am researching if this is known elsewhere.

Actions #5

Updated by openqa_review 5 months ago

  • Due date set to 2024-08-10

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by okurz 5 months ago

I asked in #discuss-salt https://suse.slack.com/archives/C02JMF41G9E/p1722002950510019

hi, anyone ever had the case that salt would incompletely write managed files? We have observed already two times that in files either content is missing or invalid, non-ASCII characters are included in files managed by salt or changed by salt.

No response as of now. In the meantime learning https://docs.saltproject.io/salt/user-guide/en/latest/

Actions #7

Updated by mkittler 5 months ago

  • Subject changed from OSD openqa.ini is corrupted, invalid characters to OSD openqa.ini is corrupted, invalid characters size:M
  • Description updated (diff)
Actions #8

Updated by okurz 5 months ago

  • Due date changed from 2024-08-10 to 2024-09-20
  • Priority changed from Normal to Low
Actions #9

Updated by okurz 5 months ago

I would like to work on #164427 first because I assume there is a chance that files are corrupted or incompletely written if the system is stalled due to #164427

Actions #10

Updated by okurz 5 months ago

  • Due date deleted (2024-09-20)
  • Status changed from In Progress to Blocked
  • Target version changed from Ready to Tools - Next

I would like to work on #164427 first because I assume there is a chance that files are corrupted or incompletely written if the system is stalled due to #164427

Actions #11

Updated by okurz 3 months ago

  • Related to action #167584: grafana-server on monitor.qe.nue2.suse.org yields "502 Bad Gateway", fails to start since 2024-09-28 03:57Z added
Actions #12

Updated by okurz 3 months ago

  • Status changed from Blocked to Resolved
  • Target version changed from Tools - Next to Ready

#164427 was resolved. There were no new corruptions found on OSD. But an issue which looks quite related showed up in #167584

Actions #13

Updated by okurz about 2 months ago

Actions

Also available in: Atom PDF