Project

General

Profile

Actions

action #161324

closed

Conduct "lessons learned" with Five Why analysis for "osd not accessible, 502 Bad Gateway"

Added by okurz about 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Organisational
Target version:
Start date:
2024-05-31
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

In #161309 was down for multiple hours (ongoing at time of writing on 2024-05-31). We should learn what happened and find improvements for the future.

Acceptance criteria

  • AC1: A Five-Whys analysis has been conducted and results documented
  • AC2: Improvements are planned

Suggestions

  • Bring up in retro
  • Conduct "Five-Whys" analysis for the topic
  • Identify follow-up tasks in tickets
  • Organize a call to conduct the 5 whys (not as part of the retro)

Ideas


Related issues 5 (3 open2 closed)

Related to QA - action #132149: Coordinate with Eng-Infra to get simple management access to VMs (o3/osd/qa-jump.qe.nue2.suse.org) size:MBlockedokurz2023-06-29

Actions
Related to openQA Infrastructure - action #161429: incomplete config files on OSD due to salt - create annotations in grafana on the time of the osd deployment as well as salt-states-openqa deploymentsNew2024-06-03

Actions
Related to openQA Infrastructure - action #161426: incomplete config files on OSD due to salt - introduce post-deploy monitoring steps like in osd-deployment but in salt-states-openqaNew2024-06-03

Actions
Related to openQA Infrastructure - action #161423: [timeboxed:10h] Incomplete config files on OSD due to salt - Improve salt state application from remotely accessible salt master size:SResolvedokurz2024-06-03

Actions
Copied from openQA Infrastructure - action #161309: osd not accessible, 502 Bad GatewayResolvedjbaier_cz2024-05-31

Actions
Actions #1

Updated by okurz about 2 months ago

  • Copied from action #161309: osd not accessible, 502 Bad Gateway added
Actions #2

Updated by jbaier_cz about 2 months ago

I already have one question for our 5-Whys:

  • Why was the salt deploy job green when it (most likely) broke configuration?
Actions #3

Updated by okurz about 1 month ago ยท Edited

  • Status changed from Blocked to In Progress

Conducted the lessons learned meeting with the team

What happened

Monitoring as well as users reported problems with "502 errors" from openqa.suse.de starting 2024-05-31 07:32Z. The initial problem seems to be that salt inconsistently wrote configuration files which left at least /etc/openqa/database.ini incorrect leading to errors in the logfiles preventing the openQA webUI service to properly restart. This was triggered by a salt-states CI pipeline merged by jlausuch. First we assumed a problem with package updates also involving glibc caused inconsistencies before we could understood that the configuration files were left invalid. jbaier+okurz decided to trigger a reboot which then left the machine unresponsive due to /etc/fstab being completely absent. okurz created an SD ticket and mentioned it also in #help-it-ama. After some time and additional personal escalation and reminders mcaj from IT picked up the SD ticket that okurz created and looked into the issue. mcaj contacted okurz over private Slack conversation. After some hours the complete problems were resolved. The most time was spent understanding the problem why the VM would not boot which in the end turned out the be an absent /etc/fstab. Second most time was spent running filesystem checks which potentially were not even necessary.

Five Whys

  1. Why did the salt states pipelines end with success when the salt high state was never reported to be successfully applied to the openqa.suse.de salt minion (openqa.suse.de is not mentioned in the list of minions where the state was applied but the pipeline still ended)
    1.1. We do not know yet but this should help us in the future to spot errors quicker in case similar problems return. Maybe the problem is related to how we run salt over ssh from that minion openqa.suse.de and potentially the exit code from salt was never propagated but the command in bash just ended prematurely?
    1.1.1. separate ticket to research about best practices how to apply a high state from a remotely accessible master upstream and investigate this -> #161423
    1.1.2. create ticket to introduce post-deploy monitoring steps like in osd-deployment but in salt-states-openqa -> #161426
    1.1.3. create ticket to create annotations in grafana on the time of osd-deployment as well as salt-states-openqa deployments -> #161429

  2. Why did salt leave files inconsistently and not abort more clearly with a fatal error or retry or revert?
    2.1. We usually see "result: false" and the run continues. Is this different? separate ticket to research about best practices how to apply a high state from a remotely accessible master upstream and investigate this -> #161423

  3. Why did salt not manage to update three configuration files in the first place?
    3.1. Maybe this is due the secondary, unlikely hypothesis of "filesystem corruption" we might have had. If not we could reconsider how we write those three specific configuration files. okurz thinks at least for /etc/fstab we have two states concerning this file. Maybe it's better to combine those into one

  4. Why did we need to wait on an SD ticket to IT and wait for hours for a resolution?
    4.1. We do not have hypervisor access. We already pushed for this multiple times and were denied. We should try again. See also specific proposals in https://progress.opensuse.org/issues/121726 and https://sd.suse.com/servicedesk/customer/portal/1/SD-126446 which were not followed up for multiple months now. Also the handling of the issue was sub-optimal with direct message conversation, not sharing screen, no video conference. So we did not see the boot process ourselves and could not help there. If we would have had better access, at least temporarily, at least screenshots or something, we likely would have made the mental link between "system does not boot" and "salt left /etc/fstab" absent. separate ticket to reach out to IT again to improve collaboration, give us hypervisor access, etc. -> #161324-4 and #132149

  5. Why did we accept the multiple hours of waiting for filesystem checks to complete?
    5.1. We remembered the recent filesystem corruption we had in the past related to two VMs accessing the same storage and we trusted IT members to make the right choices so we accepted the waiting time. But apparently the ticket was not seen as that high priority of IT because even the already failed filesystem checks due to OOM were not realized for a longer time leading up to an hour or longer. The IT assignee did not ask us what could explain the missing /etc/fstab . same as for 4 reach to IT to improve collaboration, etc. -> #161324-4 and #132149

Actions #4

Updated by okurz about 1 month ago

I created https://suse.slack.com/archives/C029APBKLGK/p1717409655107289?thread_ts=1717143302.275009&cid=C029APBKLGK to cover 4.+5.

(Oliver Kurz) To conclude here: The actual issue was resolved with good help from Martin. Thank you @Martin Caj! Still I think multiple things could have helped to reduce the resolution time from 7h to probably less than 1h:

  1. With direct management access to the VM we could have looked into the problem why the machine does not boot up within minutes and probably would have understood the problem as well within minutes as it's related to knowledge that we as machine owners have and can't expect anyone from IT can have why a file like /etc/fstab would have gone missing. For this https://progress.opensuse.org/issues/132149 already mentioned multiple ideas, we had multiple talks about this already going back years and there is https://sd.suse.com/servicedesk/customer/portal/1/SD-126446 with explicit ideas which then again were not followed up since 2023-09. So I am very kindly but persistently asking again: Can we please manage to give hypervisor management access rather sooner than later? Again, theoretical future ideas don't help real-world problems. if you just need a contract signed in blood that I will only touch one machine with having admin-level on the hypervisor host I can do that :)
  2. https://sd.suse.com/servicedesk/customer/portal/1/SD-158390 was picked up at 8:55L which is 45m after my first message here and only after the additional ping by @Matthias Griessmeier so I assume there was a "backchannel assignment" between Moroni and members of the team so potentially me setting the ticket to "Impact: High" and "Urgency: High" is still not seen as urgent as the personally triggered escalation?
  3. I suggested here as well as in the direct message to Martin to please use a video conferencing solution or at least use a group chat. The sluggish conversation over text based direct message did not provide the reaction time that I assume should have been applied to the problem considering the severity. As an alternative to 1. this would very likely have helped to resolve the issue much sooner as for example the reason for boot problem could have been apparent on a shared screen I am hoping for your response.
Actions #5

Updated by okurz about 1 month ago

  • Related to action #132149: Coordinate with Eng-Infra to get simple management access to VMs (o3/osd/qa-jump.qe.nue2.suse.org) size:M added
Actions #6

Updated by okurz about 1 month ago

  • Related to action #161429: incomplete config files on OSD due to salt - create annotations in grafana on the time of the osd deployment as well as salt-states-openqa deployments added
Actions #7

Updated by okurz about 1 month ago

  • Related to action #161426: incomplete config files on OSD due to salt - introduce post-deploy monitoring steps like in osd-deployment but in salt-states-openqa added
Actions #8

Updated by okurz about 1 month ago

  • Related to action #161423: [timeboxed:10h] Incomplete config files on OSD due to salt - Improve salt state application from remotely accessible salt master size:S added
Actions #9

Updated by okurz about 1 month ago

  • Status changed from In Progress to Resolved

All relevant follow-up tasks put in separate, related tickets

Actions

Also available in: Atom PDF