Project

General

Profile

Actions

action #132218

closed

Conduct lessons learned for "openQA is not accessible" on 2023-07-02

Added by okurz 10 months ago. Updated 10 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2023-07-02
Due date:
% Done:

0%

Estimated time:

Description

Motivation

In #132200 an outage of o3 was reported. This was also brought up to me by bmwiedemann. In https://suse.slack.com/archives/C02CANHLANP/p1688348613953109 also a problem report was brought up that no o3 jobs were being worked on. According to fvogt o3 workers were retrying during the outage until eventually they gave up. The original problem was introduced by https://github.com/os-autoinst/openQA/pull/5231 . okurz has mitigated the original problem as described in #131024-11. tinita has fixed the "snapshot-changes" service. fvogt one day later restarted affected openQA worker instance services to fix the workers not executing jobs.

Acceptance criteria

  • AC1: A Five-Whys analysis has been conducted and results documented
  • AC2: Improvements are planned

Suggestions

  • Bring up in retro
  • Conduct "Five-Whys" analysis for the topic
  • Identify follow-up tasks in tickets
  • Organize a call to conduct the 5 whys (not as part of the retro)

Related issues 4 (0 open4 closed)

Related to openQA Project - action #131024: Ensure both nginx+apache are properly covered in packages+testing+documentation size:SResolveddheidler

Actions
Related to openQA Infrastructure - action #131150: Add alarms for partition usage on o3 size:MResolvedlivdywan2023-06-20

Actions
Related to openQA Infrastructure - action #132278: Basic o3 http response alert on zabbix size:MResolvedjbaier_cz

Actions
Copied from openQA Infrastructure - action #132200: openQA is not accessibleResolvedtinita2023-07-02

Actions
Actions #1

Updated by okurz 10 months ago

Actions #2

Updated by okurz 10 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz
Actions #3

Updated by okurz 10 months ago

Five Whys

  • Q1: Why was there no monitoring alert message to SUSE QE Tools? Or was there?
    • A1-1: There was an alert from https://gitlab.suse.de/openqa/monitor-o3 but "Failed pipeline for master" is not that obvious => R1-1-1: Try to make that more obvious
    • A1-2: No alert from zabbix => R1-2-1: Find according ticket where we want to make sure that monitoring works again, failed agent for some months, etc. -> #131150 and #132278
    • A1-3: openqa-review pipelines also failed showing us there is a problem
  • Q2: Why was nobody aware that there is already a potentially conflicting nginx config file on o3?
    • A2-1: During pull request review we assumed that the reviewee was aware and would include a complete replacement but the package includes "vhost.d/openqa.conf" but the file on o3 is "conf.d/openqa.conf" and they also differ significantly => R2-1-1: We need to remind ourselves to be more explicit in code review and ask the reviewee for confirmation, e.g. how this was tested? Are you aware of the already existing file on o3? Don't just trust!
  • Q3: Why did no CI tests fail for that?
    • A3-1: Because no CI tests include the complete o3 nginx config
  • Q4: Why does the new vhost.d/openqa.conf file differ from the o3 one so much?
    • A4-1: The file was already there in the git repo. dheidler just packaged it. The original file is there since 2018 in a975ac9d2. It has received updates over time but we never cross-checked what would be missing before deploying on o3
  • Q5: Why did some of us think it's a good idea to replace the o3 nginx config file in production when we already know it differed that much?
    • A5-1: That was a big misunderstanding in the code review. The reviewers assumed it would be ok and verified by the reviewee and the reviewee assumed it would not replace it completely. The reviewee assumed that the file, if already present, would only be deployed as an .rpmnew file, not affecting the already existing configuration =>
    • R5-1-1: We should come up with a way to structure the config so that there is a file from the package which no admin does not need to manually change anyway and have the instance specific configuration in another layer. Research RPM config handling and test locally, e.g. with a local container, build package locally with osc build and install manually within the test environment with manual rpm calls -> to be covered in #131024-15
    • R5-1-2: We need to explicitly test on o3 before merge or just afterwards closely monitor -> to be covered in #131024-15
    • R5-1-3: the package on o3 seems to have been applied only on Sunday morning. Was OBS package build delayed that much? tail -n 200 /var/log/zypp/history | grep 'install|openQA|' says
2023-06-30 09:06:45|install|openQA|4.6.1688114325.858536e-lp154.5927.1|x86_64||devel_openQA|77363349f8a1593ab6ce231b1f3aa2fa71ad9afb76b9d9f0e1f337866da3f71e|
2023-06-30 12:01:28|install|openQA|4.6.1688124489.7f4be1c-lp154.5929.1|x86_64||devel_openQA|c572f40efa2a3836cbefd014dbe770de4e5d01fe93cda0efafcb1f35822bd771|
2023-07-02 05:22:50|install|openQA|4.6.1688124489.7f4be1c-lp154.5930.1|x86_64||devel_openQA|7f7086a1efaac197813c42111d6ce645d37f0941e2af8e81ea3faf54aa892052|

7f4be1c is https://github.com/os-autoinst/openQA/pull/5231 so at 12:01, likely 12:01Z, the change was already deployed which is roughly only 30m after merge, not too bad. The problem has actually appeared since Jul 02 03:30:43 ariel nginx[1806]: nginx: [emerg] duplicate upstream "webui" in /etc/nginx/vhosts.d/openqa.conf:3 so the problem is triggered by the scheduled reboot of the machine. We don't trigger a restart/reload of nginx (nor for apache). Find out if we should reload nginx (and apache) on config file updates from package installations -> #131024-15
* R5-1-4: Consider shifting the reboot to Sunday evening

Disclaimer

  1. Obviously we share responsibility and we do not blame any single person for any incident. Everyone is invited to submit changes as well as review changes regardless of what impact that has or what it could cause in the future
  2. Nobody from SUSE QE Tools is expected to react on an incident happening outside usual business hours, e.g. any reaction on Sunday or even early Monday morning is purely voluntarily
  3. Incidents like those will always happen for areas when both the reviewee as well as reviewers are not well experienced. This is ok and we embrace the additional learning opportunities that brings from those limitations and potential incidents
Actions #4

Updated by openqa_review 10 months ago

  • Due date set to 2023-07-18

Setting due date based on mean cycle time of SUSE QE Tools

Actions #5

Updated by okurz 10 months ago

  • Related to action #131024: Ensure both nginx+apache are properly covered in packages+testing+documentation size:S added
Actions #6

Updated by okurz 10 months ago

  • Related to action #131150: Add alarms for partition usage on o3 size:M added
Actions #7

Updated by okurz 10 months ago

  • Related to action #132278: Basic o3 http response alert on zabbix size:M added
Actions #8

Updated by okurz 10 months ago

  • Due date deleted (2023-07-18)
  • Status changed from In Progress to Resolved

Added according ticket references and comments and follow-up tasks in related tickets. All done.

Actions

Also available in: Atom PDF