action #161324: Conduct "lessons learned" with Five Why analysis for "osd not accessible, 502 Bad Gateway" - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #161324

closed

Conduct "lessons learned" with Five Why analysis for "osd not accessible, 502 Bad Gateway"

Added by okurz 7 months ago. Updated 7 months ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Category:

Organisational

Target version:

openQA Project (public) - Ready

Start date:

2024-05-31

Due date:

% Done:

Estimated time:

Tags:

infra

Description

Motivation¶

In #161309 was down for multiple hours (ongoing at time of writing on 2024-05-31). We should learn what happened and find improvements for the future.

Acceptance criteria¶

AC1: A Five-Whys analysis has been conducted and results documented
AC2: Improvements are planned

Suggestions¶

Bring up in retro
Conduct "Five-Whys" analysis for the topic
Identify follow-up tasks in tickets
Organize a call to conduct the 5 whys (not as part of the retro)

Ideas¶

#121726, e.g. https://sd.suse.com/servicedesk/customer/portal/1/SD-126446 "Kind request for (simple/restricted) management access to VMs"

Related issues 4 (2 open — 2 closed)

Related to openQA Infrastructure (public) - action #161429: incomplete config files on OSD due to salt - create annotations in grafana on the time of the osd deployment as well as salt-states-openqa deployments

New

2024-06-03

Actions

Related to openQA Infrastructure (public) - action #161426: incomplete config files on OSD due to salt - introduce post-deploy monitoring steps like in osd-deployment but in salt-states-openqa

New

2024-06-03

Actions

Related to openQA Infrastructure (public) - action #161423: [timeboxed:10h] Incomplete config files on OSD due to salt - Improve salt state application from remotely accessible salt master size:S

Resolved

okurz

2024-06-03

Actions

Copied from openQA Infrastructure (public) - action #161309: osd not accessible, 502 Bad Gateway

Resolved

jbaier_cz

2024-05-31

Actions

Copy link

Updated by okurz 7 months ago

Copied from action #161309: osd not accessible, 502 Bad Gateway added

Actions

Copy link

Updated by jbaier_cz 7 months ago

I already have one question for our 5-Whys:

Why was the salt deploy job green when it (most likely) broke configuration?

Actions

Copy link

Updated by okurz 7 months ago · Edited

Status changed from Blocked to In Progress

Conducted the lessons learned meeting with the team

What happened¶

Monitoring as well as users reported problems with "502 errors" from openqa.suse.de starting 2024-05-31 07:32Z. The initial problem seems to be that salt inconsistently wrote configuration files which left at least /etc/openqa/database.ini incorrect leading to errors in the logfiles preventing the openQA webUI service to properly restart. This was triggered by a salt-states CI pipeline merged by jlausuch. First we assumed a problem with package updates also involving glibc caused inconsistencies before we could understood that the configuration files were left invalid. jbaier+okurz decided to trigger a reboot which then left the machine unresponsive due to /etc/fstab being completely absent. okurz created an SD ticket and mentioned it also in #help-it-ama. After some time and additional personal escalation and reminders mcaj from IT picked up the SD ticket that okurz created and looked into the issue. mcaj contacted okurz over private Slack conversation. After some hours the complete problems were resolved. The most time was spent understanding the problem why the VM would not boot which in the end turned out the be an absent /etc/fstab. Second most time was spent running filesystem checks which potentially were not even necessary.

Five Whys¶

Why did the salt states pipelines end with success when the salt high state was never reported to be successfully applied to the openqa.suse.de salt minion (openqa.suse.de is not mentioned in the list of minions where the state was applied but the pipeline still ended)
1.1. We do not know yet but this should help us in the future to spot errors quicker in case similar problems return. Maybe the problem is related to how we run salt over ssh from that minion openqa.suse.de and potentially the exit code from salt was never propagated but the command in bash just ended prematurely?
1.1.1. separate ticket to research about best practices how to apply a high state from a remotely accessible master upstream and investigate this -> #161423
1.1.2. create ticket to introduce post-deploy monitoring steps like in osd-deployment but in salt-states-openqa -> #161426
1.1.3. create ticket to create annotations in grafana on the time of osd-deployment as well as salt-states-openqa deployments -> #161429
Why did salt leave files inconsistently and not abort more clearly with a fatal error or retry or revert?
2.1. We usually see "result: false" and the run continues. Is this different? separate ticket to research about best practices how to apply a high state from a remotely accessible master upstream and investigate this -> #161423
Why did salt not manage to update three configuration files in the first place?
3.1. Maybe this is due the secondary, unlikely hypothesis of "filesystem corruption" we might have had. If not we could reconsider how we write those three specific configuration files. okurz thinks at least for /etc/fstab we have two states concerning this file. Maybe it's better to combine those into one
Why did we need to wait on an SD ticket to IT and wait for hours for a resolution?
4.1. We do not have hypervisor access. We already pushed for this multiple times and were denied. We should try again. See also specific proposals in https://progress.opensuse.org/issues/121726 and https://sd.suse.com/servicedesk/customer/portal/1/SD-126446 which were not followed up for multiple months now. Also the handling of the issue was sub-optimal with direct message conversation, not sharing screen, no video conference. So we did not see the boot process ourselves and could not help there. If we would have had better access, at least temporarily, at least screenshots or something, we likely would have made the mental link between "system does not boot" and "salt left /etc/fstab" absent. separate ticket to reach out to IT again to improve collaboration, give us hypervisor access, etc. -> #161324-4 and #132149
Why did we accept the multiple hours of waiting for filesystem checks to complete?
5.1. We remembered the recent filesystem corruption we had in the past related to two VMs accessing the same storage and we trusted IT members to make the right choices so we accepted the waiting time. But apparently the ticket was not seen as that high priority of IT because even the already failed filesystem checks due to OOM were not realized for a longer time leading up to an hour or longer. The IT assignee did not ask us what could explain the missing /etc/fstab . same as for 4 reach to IT to improve collaboration, etc. -> #161324-4 and #132149

Actions

Copy link

Updated by okurz 7 months ago

I created https://suse.slack.com/archives/C029APBKLGK/p1717409655107289?thread_ts=1717143302.275009&cid=C029APBKLGK to cover 4.+5.

(Oliver Kurz) To conclude here: The actual issue was resolved with good help from Martin. Thank you @Martin Caj! Still I think multiple things could have helped to reduce the resolution time from 7h to probably less than 1h:

With direct management access to the VM we could have looked into the problem why the machine does not boot up within minutes and probably would have understood the problem as well within minutes as it's related to knowledge that we as machine owners have and can't expect anyone from IT can have why a file like /etc/fstab would have gone missing. For this https://progress.opensuse.org/issues/132149 already mentioned multiple ideas, we had multiple talks about this already going back years and there is https://sd.suse.com/servicedesk/customer/portal/1/SD-126446 with explicit ideas which then again were not followed up since 2023-09. So I am very kindly but persistently asking again: Can we please manage to give hypervisor management access rather sooner than later? Again, theoretical future ideas don't help real-world problems. if you just need a contract signed in blood that I will only touch one machine with having admin-level on the hypervisor host I can do that :)

https://sd.suse.com/servicedesk/customer/portal/1/SD-158390 was picked up at 8:55L which is 45m after my first message here and only after the additional ping by @Matthias Griessmeier so I assume there was a "backchannel assignment" between Moroni and members of the team so potentially me setting the ticket to "Impact: High" and "Urgency: High" is still not seen as urgent as the personally triggered escalation?

I suggested here as well as in the direct message to Martin to please use a video conferencing solution or at least use a group chat. The sluggish conversation over text based direct message did not provide the reaction time that I assume should have been applied to the problem considering the severity. As an alternative to 1. this would very likely have helped to resolve the issue much sooner as for example the reason for boot problem could have been apparent on a shared screen I am hoping for your response.

Actions

Copy link

Updated by okurz 7 months ago

Related to action #161429: incomplete config files on OSD due to salt - create annotations in grafana on the time of the osd deployment as well as salt-states-openqa deployments added

Actions

Copy link

Updated by okurz 7 months ago

Related to action #161426: incomplete config files on OSD due to salt - introduce post-deploy monitoring steps like in osd-deployment but in salt-states-openqa added

Actions

Copy link

Updated by okurz 7 months ago

Related to action #161423: [timeboxed:10h] Incomplete config files on OSD due to salt - Improve salt state application from remotely accessible salt master size:S added

Actions

Copy link

Updated by okurz 7 months ago

Status changed from In Progress to Resolved

All relevant follow-up tasks put in separate, related tickets

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #161324

Conduct "lessons learned" with Five Why analysis for "osd not accessible, 502 Bad Gateway"

Motivation¶

Acceptance criteria¶

Suggestions¶

Ideas¶

Updated by okurz 7 months ago

Updated by jbaier_cz 7 months ago

Updated by okurz 7 months ago · Edited

What happened¶

Five Whys¶

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago