action #156532
closedlessons learned about "Potential FS corruption on osd due to 2 VMs accessing the same disk" size:S
0%
Description
Motivation¶
#156460 had a significant impact so we should collect lessons learned. Also an RCA was promised to QE LSG mgmt by mgriessmeier so you can thank him for that :)
Acceptance criteria¶
- AC1: A Five-Whys analysis has been conducted and results documented
- AC2: Improvements are planned
Suggestions¶
- Bring up in retro
- Conduct "Five-Whys" analysis for the topic
- Identify follow-up tasks in tickets
- Organize a call to conduct the 5 whys (not as part of the retro)
Updated by okurz 10 months ago
- Copied from action #156460: Potential FS corruption on osd due to 2 VMs accessing the same disk added
Updated by okurz 10 months ago
- Subject changed from lessons learned about "Potential FS corruption on osd due to 2 VMs accessing the same disk" to lessons learned about "Potential FS corruption on osd due to 2 VMs accessing the same disk" size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by openqa_review 10 months ago
- Due date set to 2024-03-21
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 10 months ago ยท Edited
What happened¶
While working on investigating why PowerPC machines do not have full network related to https://progress.opensuse.org/issues/155521 an IT engineer found a definition of a virtual machine relating to openqa.suse.de not being running and started that VM.
In
https://suse.slack.com/archives/C02CANHLANP/p1709299401351479?thread_ts=1709297645.213609&cid=C02CANHLANP
the discussion revealed that the openQA service itself was still partially working hence no catastrophic event triggered more direct feedback.
Five-Why analysis¶
Why does IT click random buttons? Why was the virtual machine started at all?
A: Because IT does not have a clear overview about their infrastructure and couldn't verify if a certain expected service is already running or not. Possibly the virtual machines have no clear names.
=> Already planned to have access to the hypervisor with #132149 which would possibly lead to IT engineers asking us to make such decisions ourselves and would have prevented that situationWhy could we not prevent IT from clicking that button?
A: Because we do not have access to the hypervisor and so far we are still denied, see #132149 . On the other hand we appreciate that IT tries to fix a problem themselves and does not need to wait for our approval to recover an unavailable service in light of low management overhead.
=> Again, already planned with #132149Why did we wait until users told us about the more severe impact before acting?
A: Monitoring already showed problems already earlier. File system corruption is hard to detect and it is random where symptoms will show first. Only after having received combined information from our monitoring plus user feedback we could realize the severe impact of the issue and then it was still mitigated soon enough. "Just slowness" is also something we observed in the past and we know that we can't prevent all those problems due to the limitations in the current infrastructure setup, e.g. CPU cores and memory assigned to openqa.suse.de VM. Memory was last increased 2023-09 which is recent enough. Maybe more CPU cores could help?
=> Consider requesting more CPU cores to openqa.suse.de VM again from IT. Discussed in infra daily 2024-03-08 and decided that more hardware ressources should not be requested at the current time without investigating in more detail where we would actually benefit, e.g. faster storage for /results vs. more CPU cores or if nothing like that would help and we would need to plan for according implementation changesWhy is IT in charge of the VMs?
A: That is the desired setup. We appreciate that we do not need to maintain that lower level of the infrastructureWhy was there a second VM in the first place?
A: Changes were conducted in libvirt by IT engineers and there was no clear coordination within IT about work on virtual machines. Possibly a manual backup of machine definition was created by one IT engineer with no clear indication in the filenames or machine definition name and another IT engineer was confused by that.
=> SUSE IT planned already a task to get rid of redundant and confusing virtual machine definitions
What went well?¶
- It was quickly realized that there are two conflicting VMs running and the root cause problem was resolved by stopping the redundant secondary machine and also deleting the configuration
- The backup of filesystems worked and we could revert to a working state loosing only some hours.
Derived ideas for the future¶
Why will it not happen again?
A: It probably will, until IT makes sure those extra config files are no longer createdResolving an inconsistency between the results filesystem and the openQA database was challenging mostly due to behaviour of dashboard.qam.suse.de which remembered openQA jobs that did not exist anymore in openqa.suse.de . In theory one could prevent such situations by bringing up the virtual machine in a special mode to prevent openQA to override existing older openQA job ids, e.g. start with emergency mode after recovering from backup. Still, here we would need #132149 to have control over the virtual machine.
Still dashboard.qam.suse.de would have older openQA job ids that would never update themselves again. We deleted all incident test data in the dashboard database for consistency but that caused a multi-day delay in getting SLE maintenance tests covered again. If 7. would have been covered then we could have uniquely identified the range of corrupted openQA job ids and remove those references from the dashboard.qam.suse.de database and recover more quickly.