Project

General

Profile

Actions

action #156460

closed

Potential FS corruption on osd due to 2 VMs accessing the same disk

Added by jbaier_cz about 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2024-03-01
Due date:
% Done:

0%

Estimated time:

Description

Observation

Users noticed slowness of osd in https://suse.slack.com/archives/C02CANHLANP/p1709297645213609; openqa-monitor.qa.suse.de also show problem with availability.

Logs on osd shows potential problem with FS

Mar 01 14:29:14 openqa salt-master[25856]: [ERROR   ] Unable to remove /var/cache/salt/master/jobs/26/4669e8a06e5502583ba67b138a9c30b97efbfff1f8af0b92f937ad8b70035d: [Errno 117] Structure needs cleaning: '.min>
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #467326: comm salt-master: deleted inode referenced: 467329
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #467326: comm salt-master: deleted inode referenced: 467329
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #428053: comm salt-master: deleted inode referenced: 428056
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #428053: comm salt-master: deleted inode referenced: 428056
Mar 01 14:29:14 openqa salt-master[25856]: [ERROR   ] Unable to remove /var/cache/salt/master/jobs/08/96cf9ed4cc58d8c044fe257e5e977516e49383070eea5680e3f8d53fc31712: [Errno 117] Structure needs cleaning: '.min>
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #358221: comm salt-master: deleted inode referenced: 358225
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #358221: comm salt-master: deleted inode referenced: 358225
Mar 01 14:29:14 openqa salt-master[25856]: [ERROR   ] Unable to remove /var/cache/salt/master/jobs/eb/8843afe01ce61b501612957cc3df3a3d8371a9c2694ebd800b47d514066853: [Errno 117] Structure needs cleaning: '.min>
Mar 01 14:29:14 openqa openqa-websockets-daemon[15372]: [debug] [pid:15372] Updating seen of worker 1951 from worker_status (free)

There might be a situation where two VMs were running with the same backing device according to https://suse.slack.com/archives/C02CANHLANP/p1709299401351479?thread_ts=1709297645.213609&cid=C02CANHLANP

The server was rebooted to get it to consistent state, but unfortunately due the FS corruption osd is currently in the maintenance mode and needs recovery.


Files

duplicate-ids.txt (2.72 KB) duplicate-ids.txt tinita, 2024-03-01 16:02

Related issues 4 (1 open3 closed)

Related to openQA Infrastructure - action #156481: cron -> (fetch_openqa_bugs)> /tmp/fetch_openqa_bugs_osd.log failed / No route to host / openqa.suse.deResolvedlivdywan2023-07-18

Actions
Related to QA - action #132149: Coordinate with Eng-Infra to get simple management access to VMs (o3/osd/qa-jump.qe.nue2.suse.org) size:MBlockedokurz2023-06-29

Actions
Copied to openQA Infrastructure - action #156532: lessons learned about "Potential FS corruption on osd due to 2 VMs accessing the same disk" size:SResolvedokurz2024-03-01

Actions
Copied to openQA Infrastructure - action #156535: Handle unfinished SLE maintenance tests due to FS corruption on OSD 2024-03-01Resolveddheidler2024-03-012024-03-19

Actions
Actions #1

Updated by jbaier_cz about 2 months ago

  • Status changed from New to In Progress
  • Assignee set to nicksinger
Actions #2

Updated by jbaier_cz about 2 months ago

  • Target version set to Ready
Actions #3

Updated by nicksinger about 2 months ago

  • Status changed from In Progress to Feedback
  • Priority changed from Immediate to Normal

We had to roll back the database and root-disk so we lost data between 12:00 CET and the recovery ~15:30 CET. OSD seems back and running again. Keeping it on feedback to collect potential regressions/issues from testers.

Actions #4

Updated by nicksinger about 2 months ago

@gschlotter created a jira-card to remove duplicate/local VM configs in the future.

Actions #5

Updated by tinita about 2 months ago ยท Edited

Some stats about which test ids are duplicated in the testresults dir, because the autoincrement wasn't set:

% for i in 13646 13640 13629 13641 13634 13643 13644 13647 13633 13650 13645 13637 13651 13648 13638 13649 13652 13653 13639 13654 13655 13658 13659 13661 13662 13660 13657 13656; do ls  /var/lib/openqa/testresults/$i >>testresults; done
% cat testresults  | perl -nlwE'if (m/^(\d+)/) { my $id = $1; $count{$id}++ } END { for my $key (sort keys %count) { say "$key: $count{$key}" if $count{$key} > 1 } }' | wc -l
232
% cat testresults  | perl -nlwE'if (m/^(\d+)/) { my $id = $1; $count{$id}++ } END { for my $key (sort keys %count) { say "$key: $count{$key}" if $count{$key} > 1 } }' | head -1
13660451: 2
% cat testresults  | perl -nlwE'if (m/^(\d+)/) { my $id = $1; $count{$id}++ } END { for my $key (sort keys %count) { say "$key: $count{$key}" if $count{$key} > 1 } }' | tail -1
13661726: 2

I attached the list of duplicate ids.
The first duplicated testresult has a timestamp of Mar 1 12:42

Actions #6

Updated by livdywan about 2 months ago

  • Related to action #156481: cron -> (fetch_openqa_bugs)> /tmp/fetch_openqa_bugs_osd.log failed / No route to host / openqa.suse.de added
Actions #7

Updated by jbaier_cz about 2 months ago

  • Related to action #132149: Coordinate with Eng-Infra to get simple management access to VMs (o3/osd/qa-jump.qe.nue2.suse.org) size:M added
Actions #8

Updated by pcervinka about 2 months ago

I'm checking results in maintenance dashboard and i can see http://dashboard.qam.suse.de/blocked?group_names=hpc&incident=32814 that jobs either are running or not finished. But job groups in openQA are green and empty https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=364 https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=434.

Actions #9

Updated by okurz about 2 months ago

  • Copied to action #156532: lessons learned about "Potential FS corruption on osd due to 2 VMs accessing the same disk" size:S added
Actions #10

Updated by okurz about 2 months ago

  • Copied to action #156535: Handle unfinished SLE maintenance tests due to FS corruption on OSD 2024-03-01 added
Actions #11

Updated by okurz about 2 months ago

pcervinka wrote in #note-8:

I'm checking results in maintenance dashboard and i can see http://dashboard.qam.suse.de/blocked?group_names=hpc&incident=32814 that jobs either are running or not finished. But job groups in openQA are green and empty https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=364 https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=434.

I created #156535 for that

Actions #12

Updated by tinita about 2 months ago

So the first job id after the database was recovered is 13660451:

openqa=> select id, t_created from jobs where t_created >= '2024-03-01 11:05:00' order by t_created asc limit 1;
    id    |      t_created      
----------+---------------------
 13660451 | 2024-03-01 14:35:30
(1 row)
Actions #13

Updated by okurz about 2 months ago

  • Status changed from Feedback to Resolved

As discussed in infra daily we clarified that we have two follow-up's and no other issues, resolving.

Actions

Also available in: Atom PDF