action #163592
Updated by okurz 5 months ago
https://progress.opensuse.org/issues/163592
[alert] (HTTP Response alert Salt tm0h5mf4k)
## Observation
We saw several alerts this morning:
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/tm0h5mf4k/view?orgId=1
```
Date: Wed, 10 Jul 2024 04:45:31 +0200
From: Grafana <osd-admins@suse.de>
To: osd-admins@suse.de
Subject: [FIRING:1] (HTTP Response alert Salt tm0h5mf4k)
1 firing alert instance
[IMAGE]
🔥 1 firing instances
Firing [stats.openqa-monitor.qa.suse.de]
HTTP Response alert
View alert [stats.openqa-monitor.qa.suse.de]
Values
B0=6.424936551249999Â
Labels
alertname
HTTP Response alert
grafana_folder
Salt
```
## Suggestions
* *DONE* Reduce global job limit
* *DONE* Consider mitigations for route causing the overload (real fix is in place now)
* Actively monitor while mitigations are applied
* Repeatedly: Retrigger affected jobs when applicable
* Handle communication with affected users and downstream services
* strace on relevant processes (of openqa-webui.service). That helped us to define #163757
* After #163757 continue to closely monitor the system and then slowly and carefully conduct the rollback actions
* Ensure that https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=78&orgId=1&from=now-7d&to=now
stays stable
* Consider temporarily disabling pg_dump triggered by the regular backup
## Rollback actions Mitigations
* *DONE* ~~Manual Manual changes to `/usr/share/openqa/lib/OpenQA/WebAPI.pm` on osd~~ -> reverted back to original code with deployment of 2024-07-12 for #163757 osd
* *DONE* ~~enable https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules again~~ ~~https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules switched off~~
* *DONE* `sudo systemctl start stop auto-update.timer` on osd
* in `/etc/openqa/openqa.ini` on OSD bump the /etc/openqa/openqa.ini reduced job limit again from 420 to 300 to 420 (or maybe do some middle ground between those limits?) (why would 420 not be supported anymore?)
## Out of scope Rollback actions
* Fixing the problem that too many browsers on live view starve out other processes: #163757
* openQA jobs explicitly ending with "timestamp mismatch" are See mitigations: @livdywan TODO to be handled in #162038
* improving the error message about openQA jobs failing with "api failure" and no details: #163781 define rollback actions
Back