action #163610: Conduct "lessons learned" with Five Why analysis for "[alert] (HTTP Response alert Salt tm0h5mf4k)" - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #163610

closed

Conduct "lessons learned" with Five Why analysis for "[alert] (HTTP Response alert Salt tm0h5mf4k)"

Added by okurz 9 months ago. Updated 8 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2024-07-10

Due date:

% Done:

Estimated time:

Tags:

infra, reactive work, http response

Description

Motivation¶

In #163592 OSD was slow, sluggish or causes a lot of incomplete jobs on 2024-07-10 for multiple hours (ongoing at time of writing on 2024-07-10). We should learn what happened and find improvements for the future.

Acceptance criteria¶

AC1: A Five-Whys analysis has been conducted and results documented
AC2: Improvements are planned

Suggestions¶

Bring up in retro
Conduct "Five-Whys" analysis for the topic
Identify follow-up tasks in tickets
Organize a call to conduct the 5 whys (not as part of the retro)

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by okurz 9 months ago

Copied from action #163592: [alert] (HTTP Response alert Salt tm0h5mf4k) size:M added

Actions

Copy link

Updated by okurz 9 months ago

Copied to action #163775: Conduct "lessons learned" with Five Why analysis about many alerts, e.g. alerts not silenced for known issues size:S added

Actions

Copy link

Updated by okurz 8 months ago

Status changed from Blocked to New

After #163592 resolved I propose to do a lessons learned meeting today in the mob session.

Actions

Copy link

Updated by okurz 8 months ago · Edited

What happened¶

OSD was unresponsive before our start of business day but also over the day of 2024-07-10. Multiple persons looked into the problem soon after reporting the issue. After identifying one underlying problem related to the live view a fix was prepared and also a mitigation was put in place on OSD by disabling the live view and disabling auto-update even though that's not related. Also osd-deployment was disabled.

5 Whys¶

Q1. Why was OSD not responding in time to HTTP requests?

A: Because the openQA webUI or the nginx service could not respond to all requests within a reasonable time even though the system load was low.
- A1-1: No action defined. We already have separate monitoring panels for the reaction of the internal openQA webUI daemon as well as the external facing HTTP proxy.

Q2. Why could the system not respond to requests even though the operating system was under low load and low usage?

A: Because there were too many long-running blocking requests to the openQA webUI which we identified with the help of strace'ing openQA webUI daemon processes to be connected to the live view of currently running openQA tests
- A2-1: Handled in #163757

Q3. Why were there more requests than our system can handle for live views at that time?

A: Because during a normal business day especially openQA test reviewers (human operators) who are also power users are likely to open multiple tabs to monitor the progress of openQA tests. The number of those open browser tabs likely then exceeded the number of available webUI worker processes.
- A3-1: Access log statistics. Didn't we have that in the past with some apache related tooling? We should look into this again using nginx related tooling to gather statistics about accesses -> #164472

Q4. Why do we only allow a limited number of webUI worker processes?

A: We have that parameter defined in script/openqa-webui-daemon. Before 029fbf5df we had that in systemd/openqa-webui.service . After 2668e2598 the value was 30, in before 20. Before ce5e6e6a0 from 2017 the value was 10. Before ef51c9200 "Increase the default timeouts and worker number of prefork" from 2016 https://github.com/os-autoinst/openQA/pull/586 the value was not set explicitly so likely we used a mojo default, according to https://github.com/os-autoinst/openQA/pull/586/files#r54536427 the number of workers defaults to 4. Because we never cared to understand what the number of workers entails and how a value should be selected.
- A4-1: Research the meaning and implications of the number of workers as a tuning parameter. Come up with an understanding on how to select the value, where the limits are, what suggestions to openQA admins are -> add that in openQA documentation -> #164493
- A4-2: Follow-up on A4-1 but select a value dynamically based on nr. of requests vs. available system resources -> #164496

Q5. Why did we only see a problem with the live view and (yet) with other routes?

A: We did not investigate slow routes that eventually loaded. And we don't track available webUI daemon workers.
- A5-1: See A3-1 for access logs, then plan improvements accordingly for the critical bottlenecks -> #164475
- A5-2: Monitoring of idle/busy webUI/liveview handler workers -> #164478

Further ideas¶

Limit the number of live views handled and present a "busy" message to users when there are no more free liveview handler workers available -> #164499
Investigate if there are more potentially long running blocking requests which should be treated similar as the live view ones, i.e. being handled by a different process (group) -> #164502

Good¶

The actual issue about the liveview handler was quickly identified and fixed. Good teamwork in Jitsi with shared screen tmate/tmux/screen. Good collaboration among the software domain experts and infrastructure/OS experts.

Could be improved¶

B1: Alert was not silenced -> handled in #163775
B2: We notified the broader user group in #eng-testing about the problem and mitigations we applied however that was nearly 6h after the incident was reported. Also at 09:42CEST there were already user reports before any message by the tools team about that -> I2: https://progress.opensuse.org/projects/qa/wiki/Tools#Process already states the process which we agree with and we need to remind ourselves to that.
- I2-1: Research about a status page or maintenance fallback mode information page which we could redirect to. That could be known issues on openQA itself as well as redirect on the level of the web proxy -> #164487
- I2-2: We need to find a way to remind us about the process which we gave ourselves: How about a checklist that could be followed based on what's already mentioned in https://progress.opensuse.org/projects/qa/wiki/Tools#Process ? -> #164490
B3: Deep error investigation was conducted by a task force but there was no clear commitment who would do the outwards facing communication and mitigation
- See I2-2

Actions

Copy link

Updated by okurz 8 months ago

Status changed from New to Resolved

Planned follow-up tasks for all relevant open points.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #163610

Conduct "lessons learned" with Five Why analysis for "[alert] (HTTP Response alert Salt tm0h5mf4k)"

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz 9 months ago

Updated by okurz 9 months ago

Updated by okurz 8 months ago

Updated by okurz 8 months ago · Edited

What happened¶

5 Whys¶

Related questions regarding handling of the incident¶

Further ideas¶

Good¶

Could be improved¶

Updated by okurz 8 months ago