action #132218: Conduct lessons learned for "openQA is not accessible" on 2023-07-02 - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #132218

closed

Conduct lessons learned for "openQA is not accessible" on 2023-07-02

Added by okurz almost 2 years ago. Updated almost 2 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-07-02

Due date:

% Done:

Estimated time:

Tags:

lessons learned, five why

Description

Motivation¶

In #132200 an outage of o3 was reported. This was also brought up to me by bmwiedemann. In https://suse.slack.com/archives/C02CANHLANP/p1688348613953109 also a problem report was brought up that no o3 jobs were being worked on. According to fvogt o3 workers were retrying during the outage until eventually they gave up. The original problem was introduced by https://github.com/os-autoinst/openQA/pull/5231 . okurz has mitigated the original problem as described in #131024-11. tinita has fixed the "snapshot-changes" service. fvogt one day later restarted affected openQA worker instance services to fix the workers not executing jobs.

Acceptance criteria¶

AC1: A Five-Whys analysis has been conducted and results documented
AC2: Improvements are planned

Suggestions¶

Bring up in retro
Conduct "Five-Whys" analysis for the topic
Identify follow-up tasks in tickets
Organize a call to conduct the 5 whys (not as part of the retro)

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Updated by okurz almost 2 years ago

Copied from action #132200: openQA is not accessible added

Actions

Copy link

Updated by okurz almost 2 years ago

Status changed from New to In Progress
Assignee set to okurz

Actions

Copy link

Updated by okurz almost 2 years ago

Five Whys¶

Q1: Why was there no monitoring alert message to SUSE QE Tools? Or was there?
- A1-1: There was an alert from https://gitlab.suse.de/openqa/monitor-o3 but "Failed pipeline for master" is not that obvious => R1-1-1: Try to make that more obvious
- A1-2: No alert from zabbix => R1-2-1: Find according ticket where we want to make sure that monitoring works again, failed agent for some months, etc. -> #131150 and #132278
- A1-3: openqa-review pipelines also failed showing us there is a problem
Q2: Why was nobody aware that there is already a potentially conflicting nginx config file on o3?
- A2-1: During pull request review we assumed that the reviewee was aware and would include a complete replacement but the package includes "vhost.d/openqa.conf" but the file on o3 is "conf.d/openqa.conf" and they also differ significantly => R2-1-1: We need to remind ourselves to be more explicit in code review and ask the reviewee for confirmation, e.g. how this was tested? Are you aware of the already existing file on o3? Don't just trust!
Q3: Why did no CI tests fail for that?
- A3-1: Because no CI tests include the complete o3 nginx config
Q4: Why does the new vhost.d/openqa.conf file differ from the o3 one so much?
- A4-1: The file was already there in the git repo. dheidler just packaged it. The original file is there since 2018 in a975ac9d2. It has received updates over time but we never cross-checked what would be missing before deploying on o3
Q5: Why did some of us think it's a good idea to replace the o3 nginx config file in production when we already know it differed that much?
- A5-1: That was a big misunderstanding in the code review. The reviewers assumed it would be ok and verified by the reviewee and the reviewee assumed it would not replace it completely. The reviewee assumed that the file, if already present, would only be deployed as an .rpmnew file, not affecting the already existing configuration =>
- R5-1-1: We should come up with a way to structure the config so that there is a file from the package which no admin does not need to manually change anyway and have the instance specific configuration in another layer. Research RPM config handling and test locally, e.g. with a local container, build package locally with osc build and install manually within the test environment with manual rpm calls -> to be covered in #131024-15
- R5-1-2: We need to explicitly test on o3 before merge or just afterwards closely monitor -> to be covered in #131024-15
- R5-1-3: the package on o3 seems to have been applied only on Sunday morning. Was OBS package build delayed that much? tail -n 200 /var/log/zypp/history | grep 'install|openQA|' says

2023-06-30 09:06:45|install|openQA|4.6.1688114325.858536e-lp154.5927.1|x86_64||devel_openQA|77363349f8a1593ab6ce231b1f3aa2fa71ad9afb76b9d9f0e1f337866da3f71e|
2023-06-30 12:01:28|install|openQA|4.6.1688124489.7f4be1c-lp154.5929.1|x86_64||devel_openQA|c572f40efa2a3836cbefd014dbe770de4e5d01fe93cda0efafcb1f35822bd771|
2023-07-02 05:22:50|install|openQA|4.6.1688124489.7f4be1c-lp154.5930.1|x86_64||devel_openQA|7f7086a1efaac197813c42111d6ce645d37f0941e2af8e81ea3faf54aa892052|

7f4be1c is https://github.com/os-autoinst/openQA/pull/5231 so at 12:01, likely 12:01Z, the change was already deployed which is roughly only 30m after merge, not too bad. The problem has actually appeared since Jul 02 03:30:43 ariel nginx[1806]: nginx: [emerg] duplicate upstream "webui" in /etc/nginx/vhosts.d/openqa.conf:3 so the problem is triggered by the scheduled reboot of the machine. We don't trigger a restart/reload of nginx (nor for apache). Find out if we should reload nginx (and apache) on config file updates from package installations -> #131024-15
* R5-1-4: Consider shifting the reboot to Sunday evening

Disclaimer¶

Obviously we share responsibility and we do not blame any single person for any incident. Everyone is invited to submit changes as well as review changes regardless of what impact that has or what it could cause in the future
Nobody from SUSE QE Tools is expected to react on an incident happening outside usual business hours, e.g. any reaction on Sunday or even early Monday morning is purely voluntarily
Incidents like those will always happen for areas when both the reviewee as well as reviewers are not well experienced. This is ok and we embrace the additional learning opportunities that brings from those limitations and potential incidents

Actions

Copy link