action #112859: Conduct Five Whys for "[alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend" - openQA Project (public) - openSUSE Project Management Tool

Actions

action #112859

closed

coordination #109846: [epic] Ensure all our database tables accomodate enough data, e.g. bigint for ids

openQA Infrastructure (public) - coordination #112718: [alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend

Conduct Five Whys for "[alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend"

Added by kraih almost 3 years ago. Updated almost 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Organisational

Target version:

Ready

Start date:

2022-06-22

Due date:

% Done:

Estimated time:

Description

Observation¶

See #112718

Acceptance criteria¶

AC1: A Five-Whys analysis has been conducted and results documented
AC2: Improvements are planned

Suggestions¶

Bring up in retro
Conduct "Five-Whys" analysis for the topic
Identify follow-up tasks in tickets
Organize a call to conduct the 5 whys (not as part of the retro)

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz almost 3 years ago

Status changed from Workable to In Progress
Assignee set to okurz

Actions

Copy link

Updated by okurz almost 3 years ago

Results from our lessons learned meeting

What happened¶

On 2022-04-12 we ran into an incident of exhausting available IDs for job module results, handled in #109836 with follow-up to migrate relevant tables to "bigint". We already foresaw a potentially significant performance impact during migration on rollout, especially on OSD, so we planned the migration to happen starting on a Friday afternoon or later – because every good team deploys on a Friday afternoon :) – The migration was conducted on Saturday morning first on O3 (openqa.opensuse.org) and after no significant impact was observed the migration on OSD (openqa.suse.de) was started some hours later 2022-06-18 1000Z. The services on OSD were generally reachable 18 again minutes afterwards but with a performance impact causing high response times or timeouts.

Five Whys¶

why didn't we see any bad impact directly after migration?

the migration itself finished but the query planner could not have been finished at this time. The webUI itself was up and reachable when okurz tested it but likely we would have seen problems looking at the http responses when trying multiple times. Also there were minion job alerts

=> think about improving our alerts #112736
it was the weekend and the load was low because nobody was scheduling or restarting jobs
=> It was a good idea to deploy explicitly during a time of low load
some performance regressions were expected directly after the migration (first autovacuum after migration)
=> plan a manual vacuum analyze for each table affected by the migration -> https://github.com/os-autoinst/openQA/pull/4716
=> conduct an automatic vacuum analyze after migration -> #112718
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=78&from=1655536755681&to=1655588356863

Why did OSD have performance issues a few hours after the big migration?
- Initially after start of services everything was available but the query planner could not optimize the plans so significant problems can only show up over time
  => run pg_stat_statements_reset() and keep an eye on slowest queries -> https://github.com/os-autoinst/openQA/pull/4716/files#r904941555
- We already have post-deployment checks in osd-deployment in various time periods after migration, e.g. 1h and 8h after migration. They already failed and signaled problems so all good
Why did we not fix it sooner?
- We observed some problems over the weekend and would have expected that they might even resolve themselves (first autovacuum should have triggered analyze), including the query planner. This is backed by our observation on o3 which was never showing significant problems and improved itself after two days
- Also we did not want to make it worse because the services were generally available
- It is a good idea to look at the monitoring data, not only alerts, after such exceptional deployments or changes
- We needed the time to check our database changes for consistency and learn about "pg_stat" and explain analyze and vacuum analyze. Understanding our "rows" dashboard was really helpful.
  => It is a good idea to learn more about PostgreSQL, consider doing courses
  => #112946 Extend openQA documentation with best practices what to do after migration, e.g. look at pg_stats
Why was OSD looking completely fine after Monday morning and only became bad again in the late afternoon?
* maybe people were using OSD less and less and with our mitigations, like stopping openqa-scheduler, was helping to lessen the load
* OSD was working on jobs, including the openqa-scheduler, but around 1k new jobs (nothing special) triggered a cascading effect with sequential scans from bad query plans
```
    => It is a good idea to switch off non-critical features and conduct emergency drills -> #112949
```
Why was OSD so severly affected and could not recover itself but other instances not?
* Less data in the relevant tables (explain analyze showed the same bad query plans)
* Less bot requests
-> Prepare blocklist/passlist for requests by user or user-agent
-> "Quality of Service"/Rate Limiting for requests to deliberately slow down requests when the system is too loaded. Like rate limits for certain slow endpoints. There are likely reverse-proxy modules, e.g. for apache, nginx, haproxy -> #65271#note-101
-> label/comments for SQL queries to be able to identify them -> #65271#note-101
* More performance headroom
=> get more hardware for OSD
=> move Postgres to a separate VM -> already existing ticket #99549

Actions

Copy link

Updated by okurz almost 3 years ago

Status changed from In Progress to Resolved

All minutes are covered. Follow-up tasks are defined in tickets. Additionally a SUSE QE Tools workshop session is planned for 2022-07-08.

Actions

Copy link

Updated by okurz about 2 years ago

Related to action #126212: openqa.suse.de response times very slow. No alert fired size:M added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #112859

Conduct Five Whys for "[alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend"

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz almost 3 years ago

Updated by okurz almost 3 years ago

What happened¶

Five Whys¶

Updated by okurz almost 3 years ago

Updated by okurz about 2 years ago