action #156460: Potential FS corruption on osd due to 2 VMs accessing the same disk - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #156460

closed

Potential FS corruption on osd due to 2 VMs accessing the same disk

Added by jbaier_cz over 1 year ago. Updated about 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

nicksinger

Category:

Target version:

openQA Project (public) - Ready

Start date:

2024-03-01

Due date:

% Done:

Estimated time:

Tags:

infra, reactive work

Description

Observation¶

Users noticed slowness of osd in https://suse.slack.com/archives/C02CANHLANP/p1709297645213609; openqa-monitor.qa.suse.de also show problem with availability.

Logs on osd shows potential problem with FS

Mar 01 14:29:14 openqa salt-master[25856]: [ERROR   ] Unable to remove /var/cache/salt/master/jobs/26/4669e8a06e5502583ba67b138a9c30b97efbfff1f8af0b92f937ad8b70035d: [Errno 117] Structure needs cleaning: '.min>
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #467326: comm salt-master: deleted inode referenced: 467329
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #467326: comm salt-master: deleted inode referenced: 467329
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #428053: comm salt-master: deleted inode referenced: 428056
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #428053: comm salt-master: deleted inode referenced: 428056
Mar 01 14:29:14 openqa salt-master[25856]: [ERROR   ] Unable to remove /var/cache/salt/master/jobs/08/96cf9ed4cc58d8c044fe257e5e977516e49383070eea5680e3f8d53fc31712: [Errno 117] Structure needs cleaning: '.min>
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #358221: comm salt-master: deleted inode referenced: 358225
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #358221: comm salt-master: deleted inode referenced: 358225
Mar 01 14:29:14 openqa salt-master[25856]: [ERROR   ] Unable to remove /var/cache/salt/master/jobs/eb/8843afe01ce61b501612957cc3df3a3d8371a9c2694ebd800b47d514066853: [Errno 117] Structure needs cleaning: '.min>
Mar 01 14:29:14 openqa openqa-websockets-daemon[15372]: [debug] [pid:15372] Updating seen of worker 1951 from worker_status (free)

There might be a situation where two VMs were running with the same backing device according to https://suse.slack.com/archives/C02CANHLANP/p1709299401351479?thread_ts=1709297645.213609&cid=C02CANHLANP

The server was rebooted to get it to consistent state, but unfortunately due the FS corruption osd is currently in the maintenance mode and needs recovery.

Files

duplicate-ids.txt (2.72 KB) duplicate-ids.txt

tinita, 2024-03-01 16:02

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Updated by jbaier_cz over 1 year ago

Status changed from New to In Progress
Assignee set to nicksinger

Actions

Copy link

Updated by jbaier_cz over 1 year ago

Target version set to Ready

Actions

Copy link

Updated by nicksinger over 1 year ago

Status changed from In Progress to Feedback
Priority changed from Immediate to Normal

We had to roll back the database and root-disk so we lost data between 12:00 CET and the recovery ~15:30 CET. OSD seems back and running again. Keeping it on feedback to collect potential regressions/issues from testers.

Actions

Copy link

Updated by nicksinger over 1 year ago

@gschlotter created a jira-card to remove duplicate/local VM configs in the future.

Actions

Copy link

Updated by tinita over 1 year ago · Edited

File duplicate-ids.txt duplicate-ids.txt added

Some stats about which test ids are duplicated in the testresults dir, because the autoincrement wasn't set:

% for i in 13646 13640 13629 13641 13634 13643 13644 13647 13633 13650 13645 13637 13651 13648 13638 13649 13652 13653 13639 13654 13655 13658 13659 13661 13662 13660 13657 13656; do ls  /var/lib/openqa/testresults/$i >>testresults; done
% cat testresults  | perl -nlwE'if (m/^(\d+)/) { my $id = $1; $count{$id}++ } END { for my $key (sort keys %count) { say "$key: $count{$key}" if $count{$key} > 1 } }' | wc -l
232
% cat testresults  | perl -nlwE'if (m/^(\d+)/) { my $id = $1; $count{$id}++ } END { for my $key (sort keys %count) { say "$key: $count{$key}" if $count{$key} > 1 } }' | head -1
13660451: 2
% cat testresults  | perl -nlwE'if (m/^(\d+)/) { my $id = $1; $count{$id}++ } END { for my $key (sort keys %count) { say "$key: $count{$key}" if $count{$key} > 1 } }' | tail -1
13661726: 2

I attached the list of duplicate ids.
The first duplicated testresult has a timestamp of Mar 1 12:42

Actions

Copy link

Updated by livdywan over 1 year ago

Related to action #156481: cron -> (fetch_openqa_bugs)> /tmp/fetch_openqa_bugs_osd.log failed / No route to host / openqa.suse.de added

Actions

Copy link

Updated by pcervinka about 1 year ago

I'm checking results in maintenance dashboard and i can see http://dashboard.qam.suse.de/blocked?group_names=hpc&incident=32814 that jobs either are running or not finished. But job groups in openQA are green and empty https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=364 https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=434.

Actions

Copy link

Updated by okurz about 1 year ago

Copied to action #156532: lessons learned about "Potential FS corruption on osd due to 2 VMs accessing the same disk" size:S added

Actions

Copy link

#10

Updated by okurz about 1 year ago

Copied to action #156535: Handle unfinished SLE maintenance tests due to FS corruption on OSD 2024-03-01 added

Actions

Copy link

#11

Updated by okurz about 1 year ago

pcervinka wrote in #note-8:

I'm checking results in maintenance dashboard and i can see http://dashboard.qam.suse.de/blocked?group_names=hpc&incident=32814 that jobs either are running or not finished. But job groups in openQA are green and empty https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=364 https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=434.

I created #156535 for that

Actions

Copy link

#12

Updated by tinita about 1 year ago

So the first job id after the database was recovered is 13660451:

openqa=> select id, t_created from jobs where t_created >= '2024-03-01 11:05:00' order by t_created asc limit 1;
    id    |      t_created      
----------+---------------------
 13660451 | 2024-03-01 14:35:30
(1 row)

Actions

Copy link

#13

Updated by okurz about 1 year ago

Status changed from Feedback to Resolved

As discussed in infra daily we clarified that we have two follow-up's and no other issues, resolving.

Actions

Copy link

#14

Updated by okurz about 1 year ago

Related to action #161309: osd not accessible, 502 Bad Gateway added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #156460

Potential FS corruption on osd due to 2 VMs accessing the same disk

Observation¶

Updated by jbaier_cz over 1 year ago

Updated by jbaier_cz over 1 year ago

Updated by nicksinger over 1 year ago

Updated by nicksinger over 1 year ago

Updated by tinita over 1 year ago · Edited

Updated by livdywan over 1 year ago

Updated by pcervinka about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by tinita about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago