action #156460
closedPotential FS corruption on osd due to 2 VMs accessing the same disk
0%
Description
Observation¶
Users noticed slowness of osd in https://suse.slack.com/archives/C02CANHLANP/p1709297645213609; openqa-monitor.qa.suse.de also show problem with availability.
Logs on osd shows potential problem with FS
Mar 01 14:29:14 openqa salt-master[25856]: [ERROR ] Unable to remove /var/cache/salt/master/jobs/26/4669e8a06e5502583ba67b138a9c30b97efbfff1f8af0b92f937ad8b70035d: [Errno 117] Structure needs cleaning: '.min>
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #467326: comm salt-master: deleted inode referenced: 467329
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #467326: comm salt-master: deleted inode referenced: 467329
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #428053: comm salt-master: deleted inode referenced: 428056
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #428053: comm salt-master: deleted inode referenced: 428056
Mar 01 14:29:14 openqa salt-master[25856]: [ERROR ] Unable to remove /var/cache/salt/master/jobs/08/96cf9ed4cc58d8c044fe257e5e977516e49383070eea5680e3f8d53fc31712: [Errno 117] Structure needs cleaning: '.min>
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #358221: comm salt-master: deleted inode referenced: 358225
Mar 01 14:29:14 openqa kernel: EXT4-fs error (device vda1): ext4_lookup:1855: inode #358221: comm salt-master: deleted inode referenced: 358225
Mar 01 14:29:14 openqa salt-master[25856]: [ERROR ] Unable to remove /var/cache/salt/master/jobs/eb/8843afe01ce61b501612957cc3df3a3d8371a9c2694ebd800b47d514066853: [Errno 117] Structure needs cleaning: '.min>
Mar 01 14:29:14 openqa openqa-websockets-daemon[15372]: [debug] [pid:15372] Updating seen of worker 1951 from worker_status (free)
There might be a situation where two VMs were running with the same backing device according to https://suse.slack.com/archives/C02CANHLANP/p1709299401351479?thread_ts=1709297645.213609&cid=C02CANHLANP
The server was rebooted to get it to consistent state, but unfortunately due the FS corruption osd is currently in the maintenance mode and needs recovery.
Files
Updated by jbaier_cz about 1 year ago
- Status changed from New to In Progress
- Assignee set to nicksinger
Updated by nicksinger about 1 year ago
- Status changed from In Progress to Feedback
- Priority changed from Immediate to Normal
We had to roll back the database and root-disk so we lost data between 12:00 CET and the recovery ~15:30 CET. OSD seems back and running again. Keeping it on feedback to collect potential regressions/issues from testers.
Updated by nicksinger about 1 year ago
@gschlotter created a jira-card to remove duplicate/local VM configs in the future.
Updated by tinita about 1 year ago ยท Edited
- File duplicate-ids.txt duplicate-ids.txt added
Some stats about which test ids are duplicated in the testresults dir, because the autoincrement wasn't set:
% for i in 13646 13640 13629 13641 13634 13643 13644 13647 13633 13650 13645 13637 13651 13648 13638 13649 13652 13653 13639 13654 13655 13658 13659 13661 13662 13660 13657 13656; do ls /var/lib/openqa/testresults/$i >>testresults; done
% cat testresults | perl -nlwE'if (m/^(\d+)/) { my $id = $1; $count{$id}++ } END { for my $key (sort keys %count) { say "$key: $count{$key}" if $count{$key} > 1 } }' | wc -l
232
% cat testresults | perl -nlwE'if (m/^(\d+)/) { my $id = $1; $count{$id}++ } END { for my $key (sort keys %count) { say "$key: $count{$key}" if $count{$key} > 1 } }' | head -1
13660451: 2
% cat testresults | perl -nlwE'if (m/^(\d+)/) { my $id = $1; $count{$id}++ } END { for my $key (sort keys %count) { say "$key: $count{$key}" if $count{$key} > 1 } }' | tail -1
13661726: 2
I attached the list of duplicate ids.
The first duplicated testresult has a timestamp of Mar 1 12:42
Updated by livdywan about 1 year ago
- Related to action #156481: cron -> (fetch_openqa_bugs)> /tmp/fetch_openqa_bugs_osd.log failed / No route to host / openqa.suse.de added
Updated by pcervinka about 1 year ago
I'm checking results in maintenance dashboard and i can see http://dashboard.qam.suse.de/blocked?group_names=hpc&incident=32814 that jobs either are running or not finished. But job groups in openQA are green and empty https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=364 https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=434.
Updated by okurz about 1 year ago
- Copied to action #156532: lessons learned about "Potential FS corruption on osd due to 2 VMs accessing the same disk" size:S added
Updated by okurz about 1 year ago
- Copied to action #156535: Handle unfinished SLE maintenance tests due to FS corruption on OSD 2024-03-01 added
Updated by okurz about 1 year ago
pcervinka wrote in #note-8:
I'm checking results in maintenance dashboard and i can see http://dashboard.qam.suse.de/blocked?group_names=hpc&incident=32814 that jobs either are running or not finished. But job groups in openQA are green and empty https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=364 https://openqa.suse.de/tests/overview?build=%3A32814%3Apython3&groupid=434.
I created #156535 for that
Updated by tinita about 1 year ago
So the first job id after the database was recovered is 13660451:
openqa=> select id, t_created from jobs where t_created >= '2024-03-01 11:05:00' order by t_created asc limit 1;
id | t_created
----------+---------------------
13660451 | 2024-03-01 14:35:30
(1 row)
Updated by okurz about 1 year ago
- Status changed from Feedback to Resolved
As discussed in infra daily we clarified that we have two follow-up's and no other issues, resolving.
Updated by okurz 10 months ago
- Related to action #161309: osd not accessible, 502 Bad Gateway added