action #163592
Updated by livdywan 5 months ago
## Observation
We saw several alerts this morning:
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/tm0h5mf4k/view?orgId=1
```
Date: Wed, 10 Jul 2024 04:45:31 +0200
From: Grafana <osd-admins@suse.de>
To: osd-admins@suse.de
Subject: [FIRING:1] (HTTP Response alert Salt tm0h5mf4k)
1 firing alert instance
[IMAGE]
🔥 1 firing instances
Firing [stats.openqa-monitor.qa.suse.de]
HTTP Response alert
View alert [stats.openqa-monitor.qa.suse.de]
Values
B0=6.424936551249999Â
Labels
alertname
HTTP Response alert
grafana_folder
Salt
```
## Suggestions
* *DONE* Reduce global job limit
* Consider mitigations for route causing the overload
* Actively monitor while mitigations are applied
* Retrigger affected jobs
* Handle communication with affected users and downstream services
## Mitigations
* Manual changes to `/usr/share/openqa/lib/OpenQA/WebAPI.pm` on osd
* ~~https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules switched off~~
* `sudo systemctl stop auto-update.timer` on osd
* in OSD /etc/openqa/openqa.ini reduced job limit from 420 to 300
* okurz still assumes that the actions done in #159396 might have an effect. okurz suggests to temporarily disable taking the database backup to observe if that helps.
## Rollback actions
* See mitigations: @livdywan TODO to define rollback actions