Project

General

Profile

Actions

action #159396

open

openQA Project - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project - coordination #108209: [epic] Reduce load on OSD

Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) size:M

Added by livdywan 2 months ago. Updated 4 days ago.

Status:
Workable
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Repeatedly people report problems with responsiveness of OSD, especially during European lunch time. During such times our monitoring also shows HTTP unresponsiveness. Similar situations can be seen shortly after the European midnight. One hypothesis is that "pg_dump" triggered from https://gitlab.suse.de/qa-sle/backup-server-salt/-/blob/master/rsnapshot/rsnapshot.conf#L35 is slowing down the operations causing such symptoms. The schedule for this trigger is defined in https://gitlab.suse.de/qa-sle/backup-server-salt/-/blob/master/rsnapshot/init.sls#L28 so triggered every 4h meaning 00:00 CET/CEST just as well as 12:00 CET/CEST.

One related alert
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=78&from=now-24h&to=now

B0=15.654104274

Notably the graph says 15.7 at the time of the alert e.g. 00:36, the second highest is 3.7s a couple of hours earlier and otherwise around the 200ms mark (note the difference in units). There seems to be no obvious fallout or other alerts coinciding?

Acceptance criteria

  • AC1: No HTTP unresponsiveness alerts or user reports linked to times when pg_dump was running
  • AC2: We still have frequent database backups for OSD openQA content so that not more work than 4h maximum is lost

Suggestions


Related issues 4 (1 open3 closed)

Related to openQA Infrastructure - action #158059: OSD unresponsive or significantly slow for some minutes 2024-03-26 13:34ZResolvedokurz

Actions
Related to openQA Infrastructure - action #159639: [alert] "web UI: Too many 5xx HTTP responses alert" size:SResolveddheidler2024-04-26

Actions
Related to openQA Project - action #130636: high response times on osd - Try nginx on OSD size:SResolvedmkittler2024-05-17

Actions
Related to openQA Infrastructure - action #162038: No HTTP Response on OSD on 10-06-2024 - auto_review:".*timestamp mismatch - check whether clocks on the local host and the web UI host are in sync":retryBlockedokurz2024-06-10

Actions
Actions #1

Updated by livdywan 2 months ago

  • Assignee deleted (okurz)
  • Start date deleted (2024-02-12)
Actions #2

Updated by okurz 2 months ago

  • Related to action #158059: OSD unresponsive or significantly slow for some minutes 2024-03-26 13:34Z added
Actions #3

Updated by okurz 2 months ago

  • Subject changed from HTTP Response alert for /tests briefly going up to 15.7s size:M to HTTP Response alert for /tests briefly going up to 15.7s
  • Category set to Regressions/Crashes
  • Status changed from New to In Progress
  • Assignee set to okurz
  • Parent task set to #108209
Actions #4

Updated by okurz 2 months ago

  • Status changed from In Progress to Resolved

Same as in #158059 and many tickets in before this seems to be related to how apache behaves together with openQA. What is seemingly different is that we had problems during "lunch time" and now it's midnight where different user interaction and system processes can be observed. https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary shows that before the outage pg_dump was running. Maybe that is blocking quite much? We also run pg_dump in osd-deployment so maybe something to look out in the future. For now I don't think we can do more.

Actions #5

Updated by okurz 2 months ago

  • Subject changed from HTTP Response alert for /tests briefly going up to 15.7s to HTTP Response alert for /tests briefly going up to 15.7s - potential detrimental impact of pg_dump
  • Description updated (diff)
  • Status changed from Resolved to New
  • Assignee deleted (okurz)

today during CEST lunch time another unresponsiveness was observed and we stated the hypothesis that pg_dump might slow down overall performance. I recommend we look into some options. Description extended.

Actions #6

Updated by okurz 2 months ago

  • Subject changed from HTTP Response alert for /tests briefly going up to 15.7s - potential detrimental impact of pg_dump to Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s)
  • Description updated (diff)
Actions #7

Updated by okurz 2 months ago

  • Description updated (diff)
Actions #8

Updated by livdywan 2 months ago

  • Subject changed from Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) to Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) size:M
  • Status changed from New to Workable
Actions #9

Updated by livdywan 2 months ago

  • Related to action #159639: [alert] "web UI: Too many 5xx HTTP responses alert" size:S added
Actions #10

Updated by okurz 2 months ago

  • Priority changed from Normal to High
Actions #11

Updated by mkittler about 2 months ago

  • Description updated (diff)
Actions #12

Updated by mkittler about 2 months ago

  • Description updated (diff)
Actions #13

Updated by okurz about 2 months ago

  • Related to action #130636: high response times on osd - Try nginx on OSD size:S added
Actions #14

Updated by okurz about 2 months ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz

#130636 first as we don't have super stable metrics about this topic and #130636 is already working on an area that can impact http responsiveness

Actions #15

Updated by okurz about 2 months ago

  • Due date set to 2024-06-09
  • Status changed from Blocked to Feedback
  • Priority changed from High to Low

#130636 is done and haven't seen the unresponsiveness as originally. Monitoring for longer.

Actions #16

Updated by jbaier_cz about 1 month ago

Could this alert https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1716147294259&to=1716175779373 be related?

Values
B0=6.840987906666666 
Labels
alertname   HTTP Response alert
grafana_folder  Salt
rule_uid    tm0h5mf4k

There is a visible decrease in CPU usage at that time and quite big increase in inactive minion jobs.

Also some interesting lines from log during that time, not sure if related.

May 20 00:27:04 openqa telegraf[1347]: 2024-05-19T22:27:04Z E! [outputs.influxdb] When writing to [http://openqa-monitor.qa.suse.de:8086]: failed doing req: Post "http://openqa-monitor.qa.suse.de:8086/write?db=telegraf": context deadline exceeded (Client.Timeout exceeded while await>
May 20 00:27:04 openqa telegraf[1347]: 2024-05-19T22:27:04Z E! [agent] Error writing to outputs.influxdb: could not write any address
Actions #17

Updated by okurz about 1 month ago

  • Due date deleted (2024-06-09)
  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)

yes, I assume that could be related as during the time the alert fired the pg_dump was likely running. So with that we can follow up with the original plan as it seems the problem still reproduces.

Actions #18

Updated by okurz about 1 month ago

  • Priority changed from Low to Normal
Actions #19

Updated by dheidler 18 days ago

  • Status changed from Workable to In Progress
  • Assignee set to dheidler
Actions #20

Updated by openqa_review 17 days ago

  • Due date set to 2024-06-27

Setting due date based on mean cycle time of SUSE QE Tools

Actions #21

Updated by dheidler 17 days ago

Why do we have both?

openqa:/etc/cron.daily # cat /etc/cron.daily/dump-openqa
#!/bin/bash

#exit 0 # avoid further data loss
backup_dir="${backup_dir:-"/var/lib/openqa/backup"}"
date=$(date -Idate)
su - postgres -c "pg_dump -Fc openqa -f $backup_dir/$date.dump"
find $backup_dir/ -mtime +7 -print0 | xargs -0 rm -v
backup-vm:~ # grep openqa.suse.de /etc/rsnapshot.conf | grep backup_exec
backup_exec ssh root@openqa.suse.de "cd /tmp; sudo -u postgres pg_dump -Fc openqa -f /var/lib/openqa/SQL-DUMPS/$(date -I).dump"
Actions #22

Updated by dheidler 17 days ago

The -j option seems to do the opposite of what we want to achieve and also it is not even compatible with the output format we're using:

       -j njobs
       --jobs=njobs
           Run the dump in parallel by dumping njobs tables
           simultaneously. This option may reduce the time needed to
           perform the dump but it also increases the load on the
           database server. You can only use this option with the
           directory output format because this is the only output
           format where multiple processes can write their data at
           the same time.
Actions #23

Updated by okurz 17 days ago

dheidler wrote in #note-21:

Why do we have both?

openqa:/etc/cron.daily # cat /etc/cron.daily/dump-openqa
#!/bin/bash

#exit 0 # avoid further data loss
backup_dir="${backup_dir:-"/var/lib/openqa/backup"}"
date=$(date -Idate)
su - postgres -c "pg_dump -Fc openqa -f $backup_dir/$date.dump"
find $backup_dir/ -mtime +7 -print0 | xargs -0 rm -v
backup-vm:~ # grep openqa.suse.de /etc/rsnapshot.conf | grep backup_exec
backup_exec   ssh root@openqa.suse.de "cd /tmp; sudo -u postgres pg_dump -Fc openqa -f /var/lib/openqa/SQL-DUMPS/$(date -I).dump"

because we want to have daily snapshots while also storing remote backups. We have even more because we also trigger a database backup in https://gitlab.suse.de/openqa/osd-deployment/-/blob/master/.gitlab-ci.yml?ref_type=heads#L216

dheidler wrote in #note-22:

The -j option seems to do the opposite of what we want to achieve and also it is not even compatible with the output format we're using

but running in parallel might make the database very busy only for a potentially much shorter time so that overall operations are impacted less. If the current output format does not support that then consider changing the output format.

Actions #24

Updated by dheidler 13 days ago

The bottleneck seems to be the CPU
Backup can be speed up at the expense of size, by disabling compression (default is gzip).
Note that another backup was running at the same time:

time sudo -u postgres ionice -c3 nice -n19 pg_dump --compress=0 -Fc openqa -f /var/lib/openqa/SQL-DUMPS/test.dump

real    4m11,817s
user    0m29,894s
sys     0m44,122s

-rw-r--r-- 1 postgres postgres 21G 17. Jun 16:13 /var/lib/openqa/SQL-DUMPS/test.dump
pg_dump --compress=0 -Fc openqa | lz4 | dd status=progress of=/var/lib/openqa/SQL-DUMPS/test.dump
6163796480 bytes (6.2 GB, 5.7 GiB) copied, 308 s, 20.0 MB/s
12055096+1 records in
12055096+1 records out
6172209629 bytes (6.2 GB, 5.7 GiB) copied, 308.394 s, 20.0 MB/s
Actions #25

Updated by okurz 12 days ago · Edited

After discussing I suggest the following

  1. In the daily cron script /etc/cron.daily/dump-openqa only run the backup if $name for the day does not already exist
  2. Put /etc/cron.daily/dump-openqa into salt as it is currently not salt-controlled
  3. In the daily cron script run the backup late at the time to give other scripts or pipelines a chance to already create the daily backup and not duplicate
  4. Use a specific name for each pipeline and backup so that we don't have potential race conditions
  5. Use nice
  6. block on upgrade of OSD to Leap 15.6 #157981, then upgrade PostgreSQL 15->16 so that we have a fully supported PostgreSQL 16 and then use pg_dump … --compress=METHOD[:DETAIL]` to use a more modern compression, e.g. lz4
  7. Consider flock /var/lib/openqa/backup/$name.lock pg_dump … in all three places cron.daily, rsnapshot and osd deployment and CI pipelines
Actions #26

Updated by okurz 10 days ago

  • Related to action #162038: No HTTP Response on OSD on 10-06-2024 - auto_review:".*timestamp mismatch - check whether clocks on the local host and the web UI host are in sync":retry added
Actions #27

Updated by okurz 10 days ago

  • Priority changed from Normal to High

#162038 shows again symptoms of this. We need to apply higher prio.

Actions #28

Updated by livdywan 9 days ago

  • Status changed from In Progress to Workable
Actions #29

Updated by livdywan 4 days ago

  • Due date deleted (2024-06-27)
Actions

Also available in: Atom PDF