Project

General

Profile

Actions

action #132218

closed

Conduct lessons learned for "openQA is not accessible" on 2023-07-02

Added by okurz 11 months ago. Updated 11 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2023-07-02
Due date:
% Done:

0%

Estimated time:

Description

Motivation

In #132200 an outage of o3 was reported. This was also brought up to me by bmwiedemann. In https://suse.slack.com/archives/C02CANHLANP/p1688348613953109 also a problem report was brought up that no o3 jobs were being worked on. According to fvogt o3 workers were retrying during the outage until eventually they gave up. The original problem was introduced by https://github.com/os-autoinst/openQA/pull/5231 . okurz has mitigated the original problem as described in #131024-11. tinita has fixed the "snapshot-changes" service. fvogt one day later restarted affected openQA worker instance services to fix the workers not executing jobs.

Acceptance criteria

  • AC1: A Five-Whys analysis has been conducted and results documented
  • AC2: Improvements are planned

Suggestions

  • Bring up in retro
  • Conduct "Five-Whys" analysis for the topic
  • Identify follow-up tasks in tickets
  • Organize a call to conduct the 5 whys (not as part of the retro)

Related issues 4 (0 open4 closed)

Related to openQA Project - action #131024: Ensure both nginx+apache are properly covered in packages+testing+documentation size:SResolveddheidler

Actions
Related to openQA Infrastructure - action #131150: Add alarms for partition usage on o3 size:MResolvedlivdywan2023-06-20

Actions
Related to openQA Infrastructure - action #132278: Basic o3 http response alert on zabbix size:MResolvedjbaier_cz

Actions
Copied from openQA Infrastructure - action #132200: openQA is not accessibleResolvedtinita2023-07-02

Actions
Actions

Also available in: Atom PDF