action #156460: Potential FS corruption on osd due to 2 VMs accessing the same disk - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #156460

closed

Potential FS corruption on osd due to 2 VMs accessing the same disk

Added by jbaier_cz 10 months ago. Updated 10 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

nicksinger

Category:

Target version:

openQA Project (public) - Ready

Start date:

2024-03-01

Due date:

% Done:

Estimated time:

Tags:

infra, reactive work

Description

Observation¶

Users noticed slowness of osd in https://suse.slack.com/archives/C02CANHLANP/p1709297645213609; openqa-monitor.qa.suse.de also show problem with availability.

Logs on osd shows potential problem with FS

Mar 01 14:29:14 openqa salt-master[25856]: [ERROR   ] Unable to remove /var/cache/salt/master/jobs/26/4669e8a06e5502583ba67b138a9c30b97efbfff1f8af0b92f937ad8b70035d: [Errno 117] Structure needs cleaning: '.min>
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #467326: comm salt-master: deleted inode referenced: 467329
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #467326: comm salt-master: deleted inode referenced: 467329
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #428053: comm salt-master: deleted inode referenced: 428056
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #428053: comm salt-master: deleted inode referenced: 428056
Mar 01 14:29:14 openqa salt-master[25856]: [ERROR   ] Unable to remove /var/cache/salt/master/jobs/08/96cf9ed4cc58d8c044fe257e5e977516e49383070eea5680e3f8d53fc31712: [Errno 117] Structure needs cleaning: '.min>
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #358221: comm salt-master: deleted inode referenced: 358225
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #358221: comm salt-master: deleted inode referenced: 358225
Mar 01 14:29:14 openqa salt-master[25856]: [ERROR   ] Unable to remove /var/cache/salt/master/jobs/eb/8843afe01ce61b501612957cc3df3a3d8371a9c2694ebd800b47d514066853: [Errno 117] Structure needs cleaning: '.min>
Mar 01 14:29:14 openqa openqa-websockets-daemon[15372]: [debug] [pid:15372] Updating seen of worker 1951 from worker_status (free)

There might be a situation where two VMs were running with the same backing device according to https://suse.slack.com/archives/C02CANHLANP/p1709299401351479?thread_ts=1709297645.213609&cid=C02CANHLANP

The server was rebooted to get it to consistent state, but unfortunately due the FS corruption osd is currently in the maintenance mode and needs recovery.

Files

duplicate-ids.txt (2.72 KB) duplicate-ids.txt

tinita, 2024-03-01 16:02

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Updated by jbaier_cz 10 months ago

Status changed from New to In Progress
Assignee set to nicksinger

Actions

Copy link

Updated by jbaier_cz 10 months ago

Target version set to Ready

Actions

Copy link

Updated by nicksinger 10 months ago

Status changed from In Progress to Feedback
Priority changed from Immediate to Normal

We had to roll back the database and root-disk so we lost data between 12:00 CET and the recovery ~15:30 CET. OSD seems back and running again. Keeping it on feedback to collect potential regressions/issues from testers.

Actions

Copy link

Updated by nicksinger 10 months ago

@gschlotter created a jira-card to remove duplicate/local VM configs in the future.

Actions

Copy link

Updated by tinita 10 months ago · Edited

File duplicate-ids.txt duplicate-ids.txt added

Some stats about which test ids are duplicated in the testresults dir, because the autoincrement wasn't set:

% for i in 13646 13640 13629 13641 13634 13643 13644 13647 13633 13650 13645 13637 13651 13648 13638 13649 13652 13653 13639 13654 13655 13658 13659 13661 13662 13660 13657 13656; do ls  /var/lib/openqa/testresults/$i >>testresults; done
% cat testresults  | perl -nlwE'if (m/^(\d+)/) { my $id = $1; $count{$id}++ } END { for my $key (sort keys %count) { say "$key: $count{$key}" if $count{$key} > 1 } }' | wc -l
232
% cat testresults  | perl -nlwE'if (m/^(\d+)/) { my $id = $1; $count{$id}++ } END { for my $key (sort keys %count) { say "$key: $count{$key}" if $count{$key} > 1 } }' | head -1
13660451: 2
% cat testresults  | perl -nlwE'if (m/^(\d+)/) { my $id = $1; $count{$id}++ } END { for my $key (sort keys %count) { say "$key: $count{$key}" if $count{$key} > 1 } }' | tail -1
13661726: 2

I attached the list of duplicate ids.
The first duplicated testresult has a timestamp of Mar 1 12:42

Actions

Copy link

Updated by livdywan 10 months ago

Related to action #156481: cron -> (fetch_openqa_bugs)> /tmp/fetch_openqa_bugs_osd.log failed / No route to host / openqa.suse.de added

Actions

Copy link

Updated by pcervinka 10 months ago

I'm checking results in maintenance dashboard and i can see http://dashboard.qam.suse.de/blocked?group_names=hpc&incident=32814 that jobs either are running or not finished. But job groups in openQA are green and empty https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=364 https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=434.

Actions

Copy link

Updated by okurz 10 months ago

Copied to action #156532: lessons learned about "Potential FS corruption on osd due to 2 VMs accessing the same disk" size:S added

Actions

Copy link

#10

Updated by okurz 10 months ago

Copied to action #156535: Handle unfinished SLE maintenance tests due to FS corruption on OSD 2024-03-01 added

Actions

Copy link

#11

Updated by okurz 10 months ago

pcervinka wrote in #note-8:

I'm checking results in maintenance dashboard and i can see http://dashboard.qam.suse.de/blocked?group_names=hpc&incident=32814 that jobs either are running or not finished. But job groups in openQA are green and empty https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=364 https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=434.

I created #156535 for that

Actions

Copy link

#12

Updated by tinita 10 months ago

So the first job id after the database was recovered is 13660451:

openqa=> select id, t_created from jobs where t_created >= '2024-03-01 11:05:00' order by t_created asc limit 1;
    id    |      t_created      
----------+---------------------
 13660451 | 2024-03-01 14:35:30
(1 row)

Actions

Copy link

#13

Updated by okurz 10 months ago

Status changed from Feedback to Resolved

As discussed in infra daily we clarified that we have two follow-up's and no other issues, resolving.

Actions

Copy link

#14

Updated by okurz 7 months ago

Related to action #161309: osd not accessible, 502 Bad Gateway added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #156460

Potential FS corruption on osd due to 2 VMs accessing the same disk

Observation¶

Updated by jbaier_cz 10 months ago

Updated by jbaier_cz 10 months ago

Updated by nicksinger 10 months ago

Updated by nicksinger 10 months ago

Updated by tinita 10 months ago · Edited

Updated by livdywan 10 months ago

Updated by pcervinka 10 months ago

Updated by okurz 10 months ago

Updated by okurz 10 months ago

Updated by okurz 10 months ago

Updated by tinita 10 months ago

Updated by okurz 10 months ago

Updated by okurz 7 months ago