Project

General

Profile

action #163592

Updated by okurz about 1 month ago

## Observation 

 We saw several alerts this morning: 
 http://stats.openqa-monitor.qa.suse.de/alerting/grafana/tm0h5mf4k/view?orgId=1 
 Date: Wed, 10 Jul 2024 04:45:31 +0200 

 ## Suggestions 
 * *DONE* Reduce global job limit 
 * *DONE* Consider mitigations for route causing the overload (real fix is in place now) 
 * Actively monitor while mitigations are applied 
 * Repeatedly: Retrigger affected jobs when applicable 
 * Handle communication with affected users and downstream services 
 * strace on relevant processes (of openqa-webui.service). That helped us to define #163757 
 * After #163757 continue to closely monitor the system and then slowly and carefully conduct the rollback actions 
 * Ensure that https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=78&orgId=1&from=now-7d&to=now 
 stays stable 
 * Consider temporarily disabling pg_dump triggered by the regular backup 


 ## Rollback actions 
 * *DONE* ~~Manual changes to `/usr/share/openqa/lib/OpenQA/WebAPI.pm` on osd~~ -> reverted back to original code with deployment of 2024-07-12 for #163757 
 * *DONE* ~~enable https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules again~~ 
 * *DONE* `sudo systemctl start auto-update.timer` on osd 
 * Enable the backup via `pg_dump` on OSD again by running `sudo ln -fs /usr/lib/postgresql15/bin/pg_dump /etc/alternatives/pg_dump` on OSD (to revert the experiment done in #163592#note-41) 
 * in `/etc/openqa/openqa.ini` on OSD bump the reduced job limit again from 300 to 420 (or maybe do some middle ground between those limits?) (why would 420 not be supported anymore?) 
 * Remove notification policy override in https://monitor.qa.suse.de/alerting/routes 

 


 ## Out of scope 
 * Fixing the problem that too many browsers on live view starve out other processes: #163757 
 * openQA jobs explicitly ending with "timestamp mismatch" are to be handled in #162038 
 * improving the error message about openQA jobs failing with "api failure" and no details: #163781

Back