Project

General

Profile

Actions

action #156460

closed

Potential FS corruption on osd due to 2 VMs accessing the same disk

Added by jbaier_cz 2 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2024-03-01
Due date:
% Done:

0%

Estimated time:

Description

Observation

Users noticed slowness of osd in https://suse.slack.com/archives/C02CANHLANP/p1709297645213609; openqa-monitor.qa.suse.de also show problem with availability.

Logs on osd shows potential problem with FS

Mar 01 14:29:14 openqa salt-master[25856]: [ERROR   ] Unable to remove /var/cache/salt/master/jobs/26/4669e8a06e5502583ba67b138a9c30b97efbfff1f8af0b92f937ad8b70035d: [Errno 117] Structure needs cleaning: '.min>
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #467326: comm salt-master: deleted inode referenced: 467329
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #467326: comm salt-master: deleted inode referenced: 467329
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #428053: comm salt-master: deleted inode referenced: 428056
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #428053: comm salt-master: deleted inode referenced: 428056
Mar 01 14:29:14 openqa salt-master[25856]: [ERROR   ] Unable to remove /var/cache/salt/master/jobs/08/96cf9ed4cc58d8c044fe257e5e977516e49383070eea5680e3f8d53fc31712: [Errno 117] Structure needs cleaning: '.min>
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #358221: comm salt-master: deleted inode referenced: 358225
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #358221: comm salt-master: deleted inode referenced: 358225
Mar 01 14:29:14 openqa salt-master[25856]: [ERROR   ] Unable to remove /var/cache/salt/master/jobs/eb/8843afe01ce61b501612957cc3df3a3d8371a9c2694ebd800b47d514066853: [Errno 117] Structure needs cleaning: '.min>
Mar 01 14:29:14 openqa openqa-websockets-daemon[15372]: [debug] [pid:15372] Updating seen of worker 1951 from worker_status (free)

There might be a situation where two VMs were running with the same backing device according to https://suse.slack.com/archives/C02CANHLANP/p1709299401351479?thread_ts=1709297645.213609&cid=C02CANHLANP

The server was rebooted to get it to consistent state, but unfortunately due the FS corruption osd is currently in the maintenance mode and needs recovery.


Files

duplicate-ids.txt (2.72 KB) duplicate-ids.txt tinita, 2024-03-01 16:02

Related issues 4 (1 open3 closed)

Related to openQA Infrastructure - action #156481: cron -> (fetch_openqa_bugs)> /tmp/fetch_openqa_bugs_osd.log failed / No route to host / openqa.suse.deResolvedlivdywan2023-07-18

Actions
Related to QA - action #132149: Coordinate with Eng-Infra to get simple management access to VMs (o3/osd/qa-jump.qe.nue2.suse.org) size:MBlockedokurz2023-06-29

Actions
Copied to openQA Infrastructure - action #156532: lessons learned about "Potential FS corruption on osd due to 2 VMs accessing the same disk" size:SResolvedokurz2024-03-01

Actions
Copied to openQA Infrastructure - action #156535: Handle unfinished SLE maintenance tests due to FS corruption on OSD 2024-03-01Resolveddheidler2024-03-012024-03-19

Actions
Actions

Also available in: Atom PDF