action #159396
closedopenQA Project - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
openQA Project - coordination #108209: [epic] Reduce load on OSD
Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) size:M
0%
Description
Motivation¶
Repeatedly people report problems with responsiveness of OSD, especially during European lunch time. During such times our monitoring also shows HTTP unresponsiveness. Similar situations can be seen shortly after the European midnight. One hypothesis is that "pg_dump" triggered from https://gitlab.suse.de/qa-sle/backup-server-salt/-/blob/master/rsnapshot/rsnapshot.conf#L35 is slowing down the operations causing such symptoms. The schedule for this trigger is defined in https://gitlab.suse.de/qa-sle/backup-server-salt/-/blob/master/rsnapshot/init.sls#L28 so triggered every 4h meaning 00:00 CET/CEST just as well as 12:00 CET/CEST.
One related alert
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=78&from=now-24h&to=now
B0=15.654104274
Notably the graph says 15.7 at the time of the alert e.g. 00:36, the second highest is 3.7s a couple of hours earlier and otherwise around the 200ms mark (note the difference in units). There seems to be no obvious fallout or other alerts coinciding?
Acceptance criteria¶
- AC1: No HTTP unresponsiveness alerts or user reports linked to times when pg_dump was running
- AC2: We still have frequent database backups for OSD openQA content so that not more work than 4h maximum is lost
Suggestions¶
- DONE Investigate what jobs were running around the time of the alert -> pg_dump
- DONE Check if relevant changes e.g. changes to salt/worker slots had an impact here -> unlikely as we see those issues also during other instances
- Regarding pg_dump keep in mind that https://github.com/os-autoinst/sync-and-trigger/blob/main/dump-psql#L4 is used for o3 but OSD uses a custom call
- pg_dump might slow down overall performance. Consider options from
- https://serverfault.com/questions/349221/how-to-make-pg-dump-less-resource-greedy
- https://www.postgresql.org/docs/current/performance-tips.html
- https://www.linkedin.com/pulse/how-speed-up-pgdump-when-dumping-large-postgres-nikolay-samokhvalov/
- https://www.postgresql.org/docs/current/continuous-archiving.html
- try out in particular the
-j
option on pg_dump
- Consider if the datacenter migration is the underlying root cause impacting I/O performance on OSD and/or between the backup VM and OSD.
Updated by okurz 7 months ago
- Related to action #158059: OSD unresponsive or significantly slow for some minutes 2024-03-26 13:34Z added
Updated by okurz 7 months ago
- Status changed from In Progress to Resolved
Same as in #158059 and many tickets in before this seems to be related to how apache behaves together with openQA. What is seemingly different is that we had problems during "lunch time" and now it's midnight where different user interaction and system processes can be observed. https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary shows that before the outage pg_dump was running. Maybe that is blocking quite much? We also run pg_dump in osd-deployment so maybe something to look out in the future. For now I don't think we can do more.
Updated by okurz 7 months ago
- Subject changed from HTTP Response alert for /tests briefly going up to 15.7s to HTTP Response alert for /tests briefly going up to 15.7s - potential detrimental impact of pg_dump
- Description updated (diff)
- Status changed from Resolved to New
- Assignee deleted (
okurz)
today during CEST lunch time another unresponsiveness was observed and we stated the hypothesis that pg_dump might slow down overall performance. I recommend we look into some options. Description extended.
Updated by okurz 7 months ago
- Subject changed from HTTP Response alert for /tests briefly going up to 15.7s - potential detrimental impact of pg_dump to Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s)
- Description updated (diff)
Updated by livdywan 7 months ago
- Subject changed from Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) to Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) size:M
- Status changed from New to Workable
Updated by livdywan 7 months ago
- Related to action #159639: [alert] "web UI: Too many 5xx HTTP responses alert" size:S added
Updated by okurz 6 months ago
- Related to action #130636: high response times on osd - Try nginx on OSD size:S added
Updated by jbaier_cz 6 months ago
Could this alert https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1716147294259&to=1716175779373 be related?
Values
B0=6.840987906666666
Labels
alertname HTTP Response alert
grafana_folder Salt
rule_uid tm0h5mf4k
There is a visible decrease in CPU usage at that time and quite big increase in inactive minion jobs.
Also some interesting lines from log during that time, not sure if related.
May 20 00:27:04 openqa telegraf[1347]: 2024-05-19T22:27:04Z E! [outputs.influxdb] When writing to [http://openqa-monitor.qa.suse.de:8086]: failed doing req: Post "http://openqa-monitor.qa.suse.de:8086/write?db=telegraf": context deadline exceeded (Client.Timeout exceeded while await>
May 20 00:27:04 openqa telegraf[1347]: 2024-05-19T22:27:04Z E! [agent] Error writing to outputs.influxdb: could not write any address
Updated by okurz 6 months ago
- Due date deleted (
2024-06-09) - Status changed from Feedback to Workable
- Assignee deleted (
okurz)
yes, I assume that could be related as during the time the alert fired the pg_dump was likely running. So with that we can follow up with the original plan as it seems the problem still reproduces.
Updated by openqa_review 5 months ago
- Due date set to 2024-06-27
Setting due date based on mean cycle time of SUSE QE Tools
Updated by dheidler 5 months ago
Why do we have both?
openqa:/etc/cron.daily # cat /etc/cron.daily/dump-openqa
#!/bin/bash
#exit 0 # avoid further data loss
backup_dir="${backup_dir:-"/var/lib/openqa/backup"}"
date=$(date -Idate)
su - postgres -c "pg_dump -Fc openqa -f $backup_dir/$date.dump"
find $backup_dir/ -mtime +7 -print0 | xargs -0 rm -v
backup-vm:~ # grep openqa.suse.de /etc/rsnapshot.conf | grep backup_exec
backup_exec ssh root@openqa.suse.de "cd /tmp; sudo -u postgres pg_dump -Fc openqa -f /var/lib/openqa/SQL-DUMPS/$(date -I).dump"
Updated by dheidler 5 months ago
The -j
option seems to do the opposite of what we want to achieve and also it is not even compatible with the output format we're using:
-j njobs
--jobs=njobs
Run the dump in parallel by dumping njobs tables
simultaneously. This option may reduce the time needed to
perform the dump but it also increases the load on the
database server. You can only use this option with the
directory output format because this is the only output
format where multiple processes can write their data at
the same time.
Updated by okurz 5 months ago
dheidler wrote in #note-21:
Why do we have both?
openqa:/etc/cron.daily # cat /etc/cron.daily/dump-openqa #!/bin/bash #exit 0 # avoid further data loss backup_dir="${backup_dir:-"/var/lib/openqa/backup"}" date=$(date -Idate) su - postgres -c "pg_dump -Fc openqa -f $backup_dir/$date.dump" find $backup_dir/ -mtime +7 -print0 | xargs -0 rm -v
backup-vm:~ # grep openqa.suse.de /etc/rsnapshot.conf | grep backup_exec backup_exec ssh root@openqa.suse.de "cd /tmp; sudo -u postgres pg_dump -Fc openqa -f /var/lib/openqa/SQL-DUMPS/$(date -I).dump"
because we want to have daily snapshots while also storing remote backups. We have even more because we also trigger a database backup in https://gitlab.suse.de/openqa/osd-deployment/-/blob/master/.gitlab-ci.yml?ref_type=heads#L216
dheidler wrote in #note-22:
The
-j
option seems to do the opposite of what we want to achieve and also it is not even compatible with the output format we're using
but running in parallel might make the database very busy only for a potentially much shorter time so that overall operations are impacted less. If the current output format does not support that then consider changing the output format.
Updated by dheidler 5 months ago
The bottleneck seems to be the CPU
Backup can be speed up at the expense of size, by disabling compression (default is gzip).
Note that another backup was running at the same time:
time sudo -u postgres ionice -c3 nice -n19 pg_dump --compress=0 -Fc openqa -f /var/lib/openqa/SQL-DUMPS/test.dump
real 4m11,817s
user 0m29,894s
sys 0m44,122s
-rw-r--r-- 1 postgres postgres 21G 17. Jun 16:13 /var/lib/openqa/SQL-DUMPS/test.dump
pg_dump --compress=0 -Fc openqa | lz4 | dd status=progress of=/var/lib/openqa/SQL-DUMPS/test.dump
6163796480 bytes (6.2 GB, 5.7 GiB) copied, 308 s, 20.0 MB/s
12055096+1 records in
12055096+1 records out
6172209629 bytes (6.2 GB, 5.7 GiB) copied, 308.394 s, 20.0 MB/s
Updated by okurz 5 months ago · Edited
After discussing I suggest the following
- DONE In the daily cron script /etc/cron.daily/dump-openqa only run the backup if $name for the day does not already exist -> https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1220
- DONE Put /etc/cron.daily/dump-openqa into salt as it is currently not salt-controlled -> https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1220
- DONE In the daily cron script run the backup late at the time to give other scripts or pipelines a chance to already create the daily backup and not duplicate -> https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1220
- DONE Use a specific name for each pipeline and backup so that we don't have potential race conditions -> https://gitlab.suse.de/openqa/osd-deployment/-/merge_requests/58
- DONE Use
nice
-> https://gitlab.suse.de/openqa/osd-deployment/-/merge_requests/58, https://gitlab.suse.de/qa-sle/backup-server-salt/-/merge_requests/19, https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1220 - block on upgrade of OSD to Leap 15.6 #157981, then upgrade PostgreSQL 15->16 so that we have a fully supported PostgreSQL 16 and then use pg_dump … --compress=METHOD[:DETAIL]` to use a more modern compression, e.g. lz4
- Consider
flock /var/lib/openqa/backup/$name.lock pg_dump …
in all three places cron.daily, rsnapshot and osd deployment and CI pipelines
Updated by okurz 5 months ago
- Related to action #162038: No HTTP Response on OSD on 10-06-2024 - auto_review:".*timestamp mismatch - check whether clocks on the local host and the web UI host are in sync":retry size:S added
Updated by okurz 5 months ago
- Due date set to 2024-07-16
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1220 merged and subsequent fix as well. /etc/cron.d/dump-openqa was deployed successfully.
Please verify both ACs over the next day(s)
Updated by okurz 5 months ago
Regarding preventing concurrent runs of the OSD deployment pipeline take a look into https://docs.gitlab.com/ee/ci/resource_groups/ and https://gitlab.suse.de/qa-maintenance/bot-ng/-/blob/master/.gitlab-ci.yml?ref_type=heads#L72 where we already use that
Updated by dheidler 5 months ago
hm - even after that PR merged, I still get no resource groups for that project:
curl --header "PRIVATE-TOKEN: XXXXX" "https://gitlab.suse.de/api/v4/projects/3731/resource_groups"
[]
Without setting the process_mode
to oldest_first
we can pretty much still run into very odd issues.
Any ideas? The documentation doesn't mention any steps on creating resource_groups.
Updated by dheidler 5 months ago
Hm - the bot-ng repo does have them:
$ curl -s --header "PRIVATE-TOKEN: XXXX" "https://gitlab.suse.de/api/v4/projects/6096/resource_groups" | jq
[
{
"id": 56,
"key": "smelt",
"process_mode": "unordered",
"created_at": "2022-03-17T17:42:55.413Z",
"updated_at": "2022-03-17T17:42:55.413Z"
},
{
"id": 57,
"key": "ibs",
"process_mode": "unordered",
"created_at": "2022-03-21T09:47:32.296Z",
"updated_at": "2022-03-21T09:47:32.296Z"
},
{
"id": 58,
"key": "openqa",
"process_mode": "unordered",
"created_at": "2022-03-21T09:51:45.667Z",
"updated_at": "2022-03-21T09:51:45.667Z"
},
{
"id": 285,
"key": "amqp",
"process_mode": "unordered",
"created_at": "2024-04-16T12:21:04.222Z",
"updated_at": "2024-04-16T12:21:04.222Z"
}
]
Updated by dheidler 5 months ago
Over the night the resource_group appeared and I could modify it:
$ curl -s --request PUT --data "process_mode=oldest_first" --header "PRIVATE-TOKEN: XXXX" "https://gitlab.suse.de/api/v4/projects/3731/resource_groups/openqa" | jq
{
"id": 307,
"key": "openqa",
"process_mode": "oldest_first",
"created_at": "2024-07-04T05:03:08.226Z",
"updated_at": "2024-07-04T09:18:34.149Z"
}
Updated by dheidler 5 months ago
- Status changed from Feedback to Resolved
Already did:
dheidler@openqa:/var/lib/openqa/backup> l
insgesamt 33008900
drwxrwxrwx 2 geekotest nogroup 323 4. Jul 07:05 ./
drwxr-xr-x 13 root root 4096 4. Jul 07:16 ../
-rw-r--r-- 1 postgres postgres 3420044822 26. Jun 20:14 2024-06-26.dump
-rw-r--r-- 1 postgres postgres 3400441812 27. Jun 20:14 2024-06-27.dump
-rw-r--r-- 1 postgres postgres 3388550554 28. Jun 20:14 2024-06-28.dump
-rw-r--r-- 1 postgres postgres 3370466054 29. Jun 20:13 2024-06-29.dump
-rw-r--r-- 1 postgres postgres 3355850742 30. Jun 20:13 2024-06-30.dump
-rw-r--r-- 1 postgres postgres 3380505129 1. Jul 20:13 2024-07-01.dump
-rw-r--r-- 1 postgres postgres 3365370546 2. Jul 20:13 2024-07-02.dump
-rw-r--r-- 1 postgres postgres 3402784114 3. Jul 20:14 2024-07-03.dump
-rw-r--r-- 1 postgres postgres 3359971612 4. Jul 12:14 2024-07-04.dump
-rw-r--r-- 1 postgres postgres 3357114125 4. Jul 07:15 osd-pre-deployment.dump
-rw-r--r-- 1 postgres postgres 0 4. Jul 07:05 osd-pre-deployment.lock
Updated by dheidler 5 months ago
- Status changed from Resolved to Feedback
Added some documentation on resource groups: https://gitlab.suse.de/openqa/osd-deployment/-/merge_requests/61/diffs
Updated by okurz 4 months ago
- Related to action #163592: [alert] (HTTP Response alert Salt tm0h5mf4k) size:M added