action #159396
openopenQA Project - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
openQA Project - coordination #108209: [epic] Reduce load on OSD
Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) size:M
0%
Description
Motivation¶
Repeatedly people report problems with responsiveness of OSD, especially during European lunch time. During such times our monitoring also shows HTTP unresponsiveness. Similar situations can be seen shortly after the European midnight. One hypothesis is that "pg_dump" triggered from https://gitlab.suse.de/qa-sle/backup-server-salt/-/blob/master/rsnapshot/rsnapshot.conf#L35 is slowing down the operations causing such symptoms. The schedule for this trigger is defined in https://gitlab.suse.de/qa-sle/backup-server-salt/-/blob/master/rsnapshot/init.sls#L28 so triggered every 4h meaning 00:00 CET/CEST just as well as 12:00 CET/CEST.
One related alert
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=78&from=now-24h&to=now
B0=15.654104274
Notably the graph says 15.7 at the time of the alert e.g. 00:36, the second highest is 3.7s a couple of hours earlier and otherwise around the 200ms mark (note the difference in units). There seems to be no obvious fallout or other alerts coinciding?
Acceptance criteria¶
- AC1: No HTTP unresponsiveness alerts or user reports linked to times when pg_dump was running
- AC2: We still have frequent database backups for OSD openQA content so that not more work than 4h maximum is lost
Suggestions¶
- DONE Investigate what jobs were running around the time of the alert -> pg_dump
- DONE Check if relevant changes e.g. changes to salt/worker slots had an impact here -> unlikely as we see those issues also during other instances
- Regarding pg_dump keep in mind that https://github.com/os-autoinst/sync-and-trigger/blob/main/dump-psql#L4 is used for o3 but OSD uses a custom call
- pg_dump might slow down overall performance. Consider options from
- https://serverfault.com/questions/349221/how-to-make-pg-dump-less-resource-greedy
- https://www.postgresql.org/docs/current/performance-tips.html
- https://www.linkedin.com/pulse/how-speed-up-pgdump-when-dumping-large-postgres-nikolay-samokhvalov/
- https://www.postgresql.org/docs/current/continuous-archiving.html
- try out in particular the
-j
option on pg_dump
- Consider if the datacenter migration is the underlying root cause impacting I/O performance on OSD and/or between the backup VM and OSD.
Updated by okurz 2 months ago
- Related to action #158059: OSD unresponsive or significantly slow for some minutes 2024-03-26 13:34Z added
Updated by okurz 2 months ago
- Status changed from In Progress to Resolved
Same as in #158059 and many tickets in before this seems to be related to how apache behaves together with openQA. What is seemingly different is that we had problems during "lunch time" and now it's midnight where different user interaction and system processes can be observed. https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary shows that before the outage pg_dump was running. Maybe that is blocking quite much? We also run pg_dump in osd-deployment so maybe something to look out in the future. For now I don't think we can do more.
Updated by okurz 2 months ago
- Subject changed from HTTP Response alert for /tests briefly going up to 15.7s to HTTP Response alert for /tests briefly going up to 15.7s - potential detrimental impact of pg_dump
- Description updated (diff)
- Status changed from Resolved to New
- Assignee deleted (
okurz)
today during CEST lunch time another unresponsiveness was observed and we stated the hypothesis that pg_dump might slow down overall performance. I recommend we look into some options. Description extended.
Updated by okurz 2 months ago
- Subject changed from HTTP Response alert for /tests briefly going up to 15.7s - potential detrimental impact of pg_dump to Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s)
- Description updated (diff)
Updated by livdywan 2 months ago
- Subject changed from Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) to Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) size:M
- Status changed from New to Workable
Updated by livdywan 2 months ago
- Related to action #159639: [alert] "web UI: Too many 5xx HTTP responses alert" size:S added
Updated by okurz about 2 months ago
- Related to action #130636: high response times on osd - Try nginx on OSD size:S added
Updated by okurz about 2 months ago
- Due date set to 2024-06-09
- Status changed from Blocked to Feedback
- Priority changed from High to Low
#130636 is done and haven't seen the unresponsiveness as originally. Monitoring for longer.
Updated by jbaier_cz about 1 month ago
Could this alert https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1716147294259&to=1716175779373 be related?
Values
B0=6.840987906666666
Labels
alertname HTTP Response alert
grafana_folder Salt
rule_uid tm0h5mf4k
There is a visible decrease in CPU usage at that time and quite big increase in inactive minion jobs.
Also some interesting lines from log during that time, not sure if related.
May 20 00:27:04 openqa telegraf[1347]: 2024-05-19T22:27:04Z E! [outputs.influxdb] When writing to [http://openqa-monitor.qa.suse.de:8086]: failed doing req: Post "http://openqa-monitor.qa.suse.de:8086/write?db=telegraf": context deadline exceeded (Client.Timeout exceeded while await>
May 20 00:27:04 openqa telegraf[1347]: 2024-05-19T22:27:04Z E! [agent] Error writing to outputs.influxdb: could not write any address
Updated by okurz about 1 month ago
- Due date deleted (
2024-06-09) - Status changed from Feedback to Workable
- Assignee deleted (
okurz)
yes, I assume that could be related as during the time the alert fired the pg_dump was likely running. So with that we can follow up with the original plan as it seems the problem still reproduces.
Updated by openqa_review 17 days ago
- Due date set to 2024-06-27
Setting due date based on mean cycle time of SUSE QE Tools
Updated by dheidler 17 days ago
Why do we have both?
openqa:/etc/cron.daily # cat /etc/cron.daily/dump-openqa
#!/bin/bash
#exit 0 # avoid further data loss
backup_dir="${backup_dir:-"/var/lib/openqa/backup"}"
date=$(date -Idate)
su - postgres -c "pg_dump -Fc openqa -f $backup_dir/$date.dump"
find $backup_dir/ -mtime +7 -print0 | xargs -0 rm -v
backup-vm:~ # grep openqa.suse.de /etc/rsnapshot.conf | grep backup_exec
backup_exec ssh root@openqa.suse.de "cd /tmp; sudo -u postgres pg_dump -Fc openqa -f /var/lib/openqa/SQL-DUMPS/$(date -I).dump"
Updated by dheidler 17 days ago
The -j
option seems to do the opposite of what we want to achieve and also it is not even compatible with the output format we're using:
-j njobs
--jobs=njobs
Run the dump in parallel by dumping njobs tables
simultaneously. This option may reduce the time needed to
perform the dump but it also increases the load on the
database server. You can only use this option with the
directory output format because this is the only output
format where multiple processes can write their data at
the same time.
Updated by okurz 17 days ago
dheidler wrote in #note-21:
Why do we have both?
openqa:/etc/cron.daily # cat /etc/cron.daily/dump-openqa #!/bin/bash #exit 0 # avoid further data loss backup_dir="${backup_dir:-"/var/lib/openqa/backup"}" date=$(date -Idate) su - postgres -c "pg_dump -Fc openqa -f $backup_dir/$date.dump" find $backup_dir/ -mtime +7 -print0 | xargs -0 rm -v
backup-vm:~ # grep openqa.suse.de /etc/rsnapshot.conf | grep backup_exec backup_exec ssh root@openqa.suse.de "cd /tmp; sudo -u postgres pg_dump -Fc openqa -f /var/lib/openqa/SQL-DUMPS/$(date -I).dump"
because we want to have daily snapshots while also storing remote backups. We have even more because we also trigger a database backup in https://gitlab.suse.de/openqa/osd-deployment/-/blob/master/.gitlab-ci.yml?ref_type=heads#L216
dheidler wrote in #note-22:
The
-j
option seems to do the opposite of what we want to achieve and also it is not even compatible with the output format we're using
but running in parallel might make the database very busy only for a potentially much shorter time so that overall operations are impacted less. If the current output format does not support that then consider changing the output format.
Updated by dheidler 13 days ago
The bottleneck seems to be the CPU
Backup can be speed up at the expense of size, by disabling compression (default is gzip).
Note that another backup was running at the same time:
time sudo -u postgres ionice -c3 nice -n19 pg_dump --compress=0 -Fc openqa -f /var/lib/openqa/SQL-DUMPS/test.dump
real 4m11,817s
user 0m29,894s
sys 0m44,122s
-rw-r--r-- 1 postgres postgres 21G 17. Jun 16:13 /var/lib/openqa/SQL-DUMPS/test.dump
pg_dump --compress=0 -Fc openqa | lz4 | dd status=progress of=/var/lib/openqa/SQL-DUMPS/test.dump
6163796480 bytes (6.2 GB, 5.7 GiB) copied, 308 s, 20.0 MB/s
12055096+1 records in
12055096+1 records out
6172209629 bytes (6.2 GB, 5.7 GiB) copied, 308.394 s, 20.0 MB/s
Updated by okurz 12 days ago · Edited
After discussing I suggest the following
- In the daily cron script /etc/cron.daily/dump-openqa only run the backup if $name for the day does not already exist
- Put /etc/cron.daily/dump-openqa into salt as it is currently not salt-controlled
- In the daily cron script run the backup late at the time to give other scripts or pipelines a chance to already create the daily backup and not duplicate
- Use a specific name for each pipeline and backup so that we don't have potential race conditions
- Use
nice
- block on upgrade of OSD to Leap 15.6 #157981, then upgrade PostgreSQL 15->16 so that we have a fully supported PostgreSQL 16 and then use pg_dump … --compress=METHOD[:DETAIL]` to use a more modern compression, e.g. lz4
- Consider
flock /var/lib/openqa/backup/$name.lock pg_dump …
in all three places cron.daily, rsnapshot and osd deployment and CI pipelines
Updated by okurz 10 days ago
- Related to action #162038: No HTTP Response on OSD on 10-06-2024 - auto_review:".*timestamp mismatch - check whether clocks on the local host and the web UI host are in sync":retry added