Project

General

Profile

Actions

action #163610

closed

Conduct "lessons learned" with Five Why analysis for "[alert] (HTTP Response alert Salt tm0h5mf4k)"

Added by okurz 6 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
2024-07-10
Due date:
% Done:

0%

Estimated time:

Description

Motivation

In #163592 OSD was slow, sluggish or causes a lot of incomplete jobs on 2024-07-10 for multiple hours (ongoing at time of writing on 2024-07-10). We should learn what happened and find improvements for the future.

Acceptance criteria

  • AC1: A Five-Whys analysis has been conducted and results documented
  • AC2: Improvements are planned

Suggestions

  • Bring up in retro
  • Conduct "Five-Whys" analysis for the topic
  • Identify follow-up tasks in tickets
  • Organize a call to conduct the 5 whys (not as part of the retro)

Related issues 2 (0 open2 closed)

Copied from openQA Infrastructure (public) - action #163592: [alert] (HTTP Response alert Salt tm0h5mf4k) size:MResolvedokurz2024-07-10

Actions
Copied to openQA Infrastructure (public) - action #163775: Conduct "lessons learned" with Five Why analysis about many alerts, e.g. alerts not silenced for known issues size:SResolvedlivdywan2024-07-10

Actions
Actions #1

Updated by okurz 6 months ago

  • Copied from action #163592: [alert] (HTTP Response alert Salt tm0h5mf4k) size:M added
Actions #2

Updated by okurz 6 months ago

  • Copied to action #163775: Conduct "lessons learned" with Five Why analysis about many alerts, e.g. alerts not silenced for known issues size:S added
Actions #3

Updated by okurz 6 months ago

  • Status changed from Blocked to New

After #163592 resolved I propose to do a lessons learned meeting today in the mob session.

Actions #4

Updated by okurz 6 months ago ยท Edited

What happened

OSD was unresponsive before our start of business day but also over the day of 2024-07-10. Multiple persons looked into the problem soon after reporting the issue. After identifying one underlying problem related to the live view a fix was prepared and also a mitigation was put in place on OSD by disabling the live view and disabling auto-update even though that's not related. Also osd-deployment was disabled.

5 Whys

Q1. Why was OSD not responding in time to HTTP requests?

  • A: Because the openQA webUI or the nginx service could not respond to all requests within a reasonable time even though the system load was low.
    • A1-1: No action defined. We already have separate monitoring panels for the reaction of the internal openQA webUI daemon as well as the external facing HTTP proxy.

Q2. Why could the system not respond to requests even though the operating system was under low load and low usage?

  • A: Because there were too many long-running blocking requests to the openQA webUI which we identified with the help of strace'ing openQA webUI daemon processes to be connected to the live view of currently running openQA tests

Q3. Why were there more requests than our system can handle for live views at that time?

  • A: Because during a normal business day especially openQA test reviewers (human operators) who are also power users are likely to open multiple tabs to monitor the progress of openQA tests. The number of those open browser tabs likely then exceeded the number of available webUI worker processes.
    • A3-1: Access log statistics. Didn't we have that in the past with some apache related tooling? We should look into this again using nginx related tooling to gather statistics about accesses -> #164472

Q4. Why do we only allow a limited number of webUI worker processes?

  • A: We have that parameter defined in script/openqa-webui-daemon. Before 029fbf5df we had that in systemd/openqa-webui.service . After 2668e2598 the value was 30, in before 20. Before ce5e6e6a0 from 2017 the value was 10. Before ef51c9200 "Increase the default timeouts and worker number of prefork" from 2016 https://github.com/os-autoinst/openQA/pull/586 the value was not set explicitly so likely we used a mojo default, according to https://github.com/os-autoinst/openQA/pull/586/files#r54536427 the number of workers defaults to 4. Because we never cared to understand what the number of workers entails and how a value should be selected.
    • A4-1: Research the meaning and implications of the number of workers as a tuning parameter. Come up with an understanding on how to select the value, where the limits are, what suggestions to openQA admins are -> add that in openQA documentation -> #164493
    • A4-2: Follow-up on A4-1 but select a value dynamically based on nr. of requests vs. available system resources -> #164496

Q5. Why did we only see a problem with the live view and (yet) with other routes?

  • A: We did not investigate slow routes that eventually loaded. And we don't track available webUI daemon workers.
    • A5-1: See A3-1 for access logs, then plan improvements accordingly for the critical bottlenecks -> #164475
    • A5-2: Monitoring of idle/busy webUI/liveview handler workers -> #164478

Related questions regarding handling of the incident

Q6. Why were mitigations only put in place later than looking into the root cause of the issue?

  • A: We benefitted from keeping the system in the known broken state for longer to ease investigation while the issue was present.
    • A6-1: Discuss benefits vs. drawbacks about applying mitigations as early as possible vs. keeping system in broken state to ease investigation. Also do industry standards best practice research -> #164481

Q7. What do we do for recovery?

  • A: As our alert handling states we should prioritize urgency mitigation, i.e. handling the effect visible to the users first, e.g. restart services
    • A7-1: Investigation helper, e.g. commands in a bash script to collect useful logs, systemd journal, etc. -> #164484

Further ideas

  • Limit the number of live views handled and present a "busy" message to users when there are no more free liveview handler workers available -> #164499
  • Investigate if there are more potentially long running blocking requests which should be treated similar as the live view ones, i.e. being handled by a different process (group) -> #164502

Good

  • The actual issue about the liveview handler was quickly identified and fixed. Good teamwork in Jitsi with shared screen tmate/tmux/screen. Good collaboration among the software domain experts and infrastructure/OS experts.

Could be improved

  • B1: Alert was not silenced -> handled in #163775
  • B2: We notified the broader user group in #eng-testing about the problem and mitigations we applied however that was nearly 6h after the incident was reported. Also at 09:42CEST there were already user reports before any message by the tools team about that -> I2: https://progress.opensuse.org/projects/qa/wiki/Tools#Process already states the process which we agree with and we need to remind ourselves to that.

    • I2-1: Research about a status page or maintenance fallback mode information page which we could redirect to. That could be known issues on openQA itself as well as redirect on the level of the web proxy -> #164487
    • I2-2: We need to find a way to remind us about the process which we gave ourselves: How about a checklist that could be followed based on what's already mentioned in https://progress.opensuse.org/projects/qa/wiki/Tools#Process ? -> #164490
  • B3: Deep error investigation was conducted by a task force but there was no clear commitment who would do the outwards facing communication and mitigation

    • See I2-2
Actions #5

Updated by okurz 6 months ago

  • Status changed from New to Resolved

Planned follow-up tasks for all relevant open points.

Actions

Also available in: Atom PDF