Project

General

Profile

action #159396

Updated by mkittler 8 months ago

## Motivation 
 Repeatedly people report problems with responsiveness of OSD, especially during European lunch time. During such times our monitoring also shows HTTP unresponsiveness. Similar situations can be seen shortly after the European midnight. One hypothesis is that "pg_dump" triggered from https://gitlab.suse.de/qa-sle/backup-server-salt/-/blob/master/rsnapshot/rsnapshot.conf#L35 is slowing down the operations causing such symptoms. The schedule for this trigger is defined in https://gitlab.suse.de/qa-sle/backup-server-salt/-/blob/master/rsnapshot/init.sls#L28 so triggered every 4h meaning 00:00 CET/CEST just as well as 12:00 CET/CEST. 

 One related alert 
 https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=78&from=now-24h&to=now 

     B0=15.654104274 

 Notably the graph says 15.7 at the time of the alert e.g. 00:36, the second highest is 3.7s a couple of hours earlier and otherwise around the 200ms mark (note the difference in units). There seems to be no obvious fallout or other alerts coinciding? 

 ## Acceptance criteria 
 * **AC1:** No HTTP unresponsiveness alerts or user reports linked to times when pg_dump was running 
 * **AC2:** We still have frequent database backups for OSD openQA content so that not more work than 4h maximum is lost 


 ## Suggestions 
 * *DONE* Investigate what jobs were running around the time of the alert -> pg_dump 
 * *DONE* Check if relevant changes e.g. changes to salt/worker slots had an impact here -> unlikely as we see those issues also during other instances 
 * Regarding pg_dump keep in mind that https://github.com/os-autoinst/sync-and-trigger/blob/main/dump-psql#L4 is used for o3 but OSD uses a custom call 
 * pg_dump might slow down overall performance. Consider options from 
   * https://serverfault.com/questions/349221/how-to-make-pg-dump-less-resource-greedy 
   * https://www.postgresql.org/docs/current/performance-tips.html 
   * https://www.linkedin.com/pulse/how-speed-up-pgdump-when-dumping-large-postgres-nikolay-samokhvalov/ 
   * https://www.postgresql.org/docs/8.3/continuous-archiving.html 
   * try out in particular the `-j` option on pg_dump 
 * Consider if the datacenter migration is the underlying root cause impacting I/O performance on OSD and/or between the backup VM and OSD.

Back