action #163592
Updated by livdywan 5 months ago
## Observation We saw several alerts this morning: http://stats.openqa-monitor.qa.suse.de/alerting/grafana/tm0h5mf4k/view?orgId=1 ``` Date: Wed, 10 Jul 2024 04:45:31 +0200 From: Grafana <osd-admins@suse.de> To: osd-admins@suse.de Subject: [FIRING:1] (HTTP Response alert Salt tm0h5mf4k) 1 firing alert instance [IMAGE] 🔥 1 firing instances Firing [stats.openqa-monitor.qa.suse.de] HTTP Response alert View alert [stats.openqa-monitor.qa.suse.de] Values B0=6.424936551249999 Labels alertname HTTP Response alert grafana_folder Salt ``` ## Suggestions * *DONE* Reduce global job limit * Consider mitigations for route causing the overload * Actively monitor while mitigations are applied * Retrigger affected jobs * Handle communication with affected users and downstream services ## Mitigations * Manual changes to `/usr/share/openqa/lib/OpenQA/WebAPI.pm` on osd * ~https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules switched off~ off * `sudo systemctl stop auto-update.timer` on osd * in OSD /etc/openqa/openqa.ini reduced job limit from 420 to 300 * okurz still assumes that the actions done in #159396 might have an effect. okurz suggests to temporarily disable taking the database backup to observe if that helps. ## Rollback actions * See mitigations: @livdywan TODO to define rollback actions