action #159396
openopenQA Project - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
openQA Project - coordination #108209: [epic] Reduce load on OSD
Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) size:M
0%
Description
Motivation¶
Repeatedly people report problems with responsiveness of OSD, especially during European lunch time. During such times our monitoring also shows HTTP unresponsiveness. Similar situations can be seen shortly after the European midnight. One hypothesis is that "pg_dump" triggered from https://gitlab.suse.de/qa-sle/backup-server-salt/-/blob/master/rsnapshot/rsnapshot.conf#L35 is slowing down the operations causing such symptoms. The schedule for this trigger is defined in https://gitlab.suse.de/qa-sle/backup-server-salt/-/blob/master/rsnapshot/init.sls#L28 so triggered every 4h meaning 00:00 CET/CEST just as well as 12:00 CET/CEST.
One related alert
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=78&from=now-24h&to=now
B0=15.654104274
Notably the graph says 15.7 at the time of the alert e.g. 00:36, the second highest is 3.7s a couple of hours earlier and otherwise around the 200ms mark (note the difference in units). There seems to be no obvious fallout or other alerts coinciding?
Acceptance criteria¶
- AC1: No HTTP unresponsiveness alerts or user reports linked to times when pg_dump was running
- AC2: We still have frequent database backups for OSD openQA content so that not more work than 4h maximum is lost
Suggestions¶
- DONE Investigate what jobs were running around the time of the alert -> pg_dump
- DONE Check if relevant changes e.g. changes to salt/worker slots had an impact here -> unlikely as we see those issues also during other instances
- Regarding pg_dump keep in mind that https://github.com/os-autoinst/sync-and-trigger/blob/main/dump-psql#L4 is used for o3 but OSD uses a custom call
- pg_dump might slow down overall performance. Consider options from
- https://serverfault.com/questions/349221/how-to-make-pg-dump-less-resource-greedy
- https://www.postgresql.org/docs/current/performance-tips.html
- https://www.linkedin.com/pulse/how-speed-up-pgdump-when-dumping-large-postgres-nikolay-samokhvalov/
- https://www.postgresql.org/docs/current/continuous-archiving.html
- try out in particular the
-j
option on pg_dump
- Consider if the datacenter migration is the underlying root cause impacting I/O performance on OSD and/or between the backup VM and OSD.
Updated by okurz 12 days ago
- Related to action #158059: OSD unresponsive or significantly slow for some minutes 2024-03-26 13:34Z added
Updated by okurz 12 days ago
- Status changed from In Progress to Resolved
Same as in #158059 and many tickets in before this seems to be related to how apache behaves together with openQA. What is seemingly different is that we had problems during "lunch time" and now it's midnight where different user interaction and system processes can be observed. https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary shows that before the outage pg_dump was running. Maybe that is blocking quite much? We also run pg_dump in osd-deployment so maybe something to look out in the future. For now I don't think we can do more.
Updated by okurz 11 days ago
- Subject changed from HTTP Response alert for /tests briefly going up to 15.7s to HTTP Response alert for /tests briefly going up to 15.7s - potential detrimental impact of pg_dump
- Description updated (diff)
- Status changed from Resolved to New
- Assignee deleted (
okurz)
today during CEST lunch time another unresponsiveness was observed and we stated the hypothesis that pg_dump might slow down overall performance. I recommend we look into some options. Description extended.
Updated by okurz 11 days ago
- Subject changed from HTTP Response alert for /tests briefly going up to 15.7s - potential detrimental impact of pg_dump to Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s)
- Description updated (diff)
Updated by livdywan 10 days ago
- Subject changed from Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) to Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) size:M
- Status changed from New to Workable
Updated by livdywan 8 days ago
- Related to action #159639: [alert] "web UI: Too many 5xx HTTP responses alert" size:S added