action #156460
closedPotential FS corruption on osd due to 2 VMs accessing the same disk
0%
Description
Observation¶
Users noticed slowness of osd in https://suse.slack.com/archives/C02CANHLANP/p1709297645213609; openqa-monitor.qa.suse.de also show problem with availability.
Logs on osd shows potential problem with FS
Mar 01 14:29:14 openqa salt-master[25856]: [ERROR ] Unable to remove /var/cache/salt/master/jobs/26/4669e8a06e5502583ba67b138a9c30b97efbfff1f8af0b92f937ad8b70035d: [Errno 117] Structure needs cleaning: '.min>
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #467326: comm salt-master: deleted inode referenced: 467329
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #467326: comm salt-master: deleted inode referenced: 467329
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #428053: comm salt-master: deleted inode referenced: 428056
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #428053: comm salt-master: deleted inode referenced: 428056
Mar 01 14:29:14 openqa salt-master[25856]: [ERROR ] Unable to remove /var/cache/salt/master/jobs/08/96cf9ed4cc58d8c044fe257e5e977516e49383070eea5680e3f8d53fc31712: [Errno 117] Structure needs cleaning: '.min>
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #358221: comm salt-master: deleted inode referenced: 358225
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #358221: comm salt-master: deleted inode referenced: 358225
Mar 01 14:29:14 openqa salt-master[25856]: [ERROR ] Unable to remove /var/cache/salt/master/jobs/eb/8843afe01ce61b501612957cc3df3a3d8371a9c2694ebd800b47d514066853: [Errno 117] Structure needs cleaning: '.min>
Mar 01 14:29:14 openqa openqa-websockets-daemon[15372]: [debug] [pid:15372] Updating seen of worker 1951 from worker_status (free)
There might be a situation where two VMs were running with the same backing device according to https://suse.slack.com/archives/C02CANHLANP/p1709299401351479?thread_ts=1709297645.213609&cid=C02CANHLANP
The server was rebooted to get it to consistent state, but unfortunately due the FS corruption osd is currently in the maintenance mode and needs recovery.
Files
Updated by nicksinger 10 months ago
- Status changed from In Progress to Feedback
- Priority changed from Immediate to Normal
We had to roll back the database and root-disk so we lost data between 12:00 CET and the recovery ~15:30 CET. OSD seems back and running again. Keeping it on feedback to collect potential regressions/issues from testers.
Updated by nicksinger 10 months ago
@gschlotter created a jira-card to remove duplicate/local VM configs in the future.
Updated by tinita 10 months ago ยท Edited
- File duplicate-ids.txt duplicate-ids.txt added
Some stats about which test ids are duplicated in the testresults dir, because the autoincrement wasn't set:
% for i in 13646 13640 13629 13641 13634 13643 13644 13647 13633 13650 13645 13637 13651 13648 13638 13649 13652 13653 13639 13654 13655 13658 13659 13661 13662 13660 13657 13656; do ls /var/lib/openqa/testresults/$i >>testresults; done
% cat testresults | perl -nlwE'if (m/^(\d+)/) { my $id = $1; $count{$id}++ } END { for my $key (sort keys %count) { say "$key: $count{$key}" if $count{$key} > 1 } }' | wc -l
232
% cat testresults | perl -nlwE'if (m/^(\d+)/) { my $id = $1; $count{$id}++ } END { for my $key (sort keys %count) { say "$key: $count{$key}" if $count{$key} > 1 } }' | head -1
13660451: 2
% cat testresults | perl -nlwE'if (m/^(\d+)/) { my $id = $1; $count{$id}++ } END { for my $key (sort keys %count) { say "$key: $count{$key}" if $count{$key} > 1 } }' | tail -1
13661726: 2
I attached the list of duplicate ids.
The first duplicated testresult has a timestamp of Mar 1 12:42
Updated by livdywan 10 months ago
- Related to action #156481: cron -> (fetch_openqa_bugs)> /tmp/fetch_openqa_bugs_osd.log failed / No route to host / openqa.suse.de added
Updated by pcervinka 10 months ago
I'm checking results in maintenance dashboard and i can see http://dashboard.qam.suse.de/blocked?group_names=hpc&incident=32814 that jobs either are running or not finished. But job groups in openQA are green and empty https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=364 https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=434.
Updated by okurz 10 months ago
- Copied to action #156532: lessons learned about "Potential FS corruption on osd due to 2 VMs accessing the same disk" size:S added
Updated by okurz 10 months ago
- Copied to action #156535: Handle unfinished SLE maintenance tests due to FS corruption on OSD 2024-03-01 added
Updated by okurz 10 months ago
pcervinka wrote in #note-8:
I'm checking results in maintenance dashboard and i can see http://dashboard.qam.suse.de/blocked?group_names=hpc&incident=32814 that jobs either are running or not finished. But job groups in openQA are green and empty https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=364 https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=434.
I created #156535 for that
Updated by okurz 7 months ago
- Related to action #161309: osd not accessible, 502 Bad Gateway added