Project

General

Profile

Actions

action #159396

closed

openQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project (public) - coordination #108209: [epic] Reduce load on OSD

Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) size:M

Added by livdywan 8 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Repeatedly people report problems with responsiveness of OSD, especially during European lunch time. During such times our monitoring also shows HTTP unresponsiveness. Similar situations can be seen shortly after the European midnight. One hypothesis is that "pg_dump" triggered from https://gitlab.suse.de/qa-sle/backup-server-salt/-/blob/master/rsnapshot/rsnapshot.conf#L35 is slowing down the operations causing such symptoms. The schedule for this trigger is defined in https://gitlab.suse.de/qa-sle/backup-server-salt/-/blob/master/rsnapshot/init.sls#L28 so triggered every 4h meaning 00:00 CET/CEST just as well as 12:00 CET/CEST.

One related alert
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=78&from=now-24h&to=now

B0=15.654104274

Notably the graph says 15.7 at the time of the alert e.g. 00:36, the second highest is 3.7s a couple of hours earlier and otherwise around the 200ms mark (note the difference in units). There seems to be no obvious fallout or other alerts coinciding?

Acceptance criteria

  • AC1: No HTTP unresponsiveness alerts or user reports linked to times when pg_dump was running
  • AC2: We still have frequent database backups for OSD openQA content so that not more work than 4h maximum is lost

Suggestions


Related issues 5 (0 open5 closed)

Related to openQA Infrastructure (public) - action #158059: OSD unresponsive or significantly slow for some minutes 2024-03-26 13:34ZResolvedokurz

Actions
Related to openQA Infrastructure (public) - action #159639: [alert] "web UI: Too many 5xx HTTP responses alert" size:SResolveddheidler2024-04-26

Actions
Related to openQA Project (public) - action #130636: high response times on osd - Try nginx on OSD size:SResolvedmkittler

Actions
Related to openQA Infrastructure (public) - action #162038: No HTTP Response on OSD on 10-06-2024 - auto_review:".*timestamp mismatch - check whether clocks on the local host and the web UI host are in sync":retry size:SResolvednicksinger2024-06-10

Actions
Related to openQA Infrastructure (public) - action #163592: [alert] (HTTP Response alert Salt tm0h5mf4k) size:MResolvedokurz2024-07-10

Actions
Actions #1

Updated by livdywan 8 months ago

  • Assignee deleted (okurz)
  • Start date deleted (2024-02-12)
Actions #2

Updated by okurz 8 months ago

  • Related to action #158059: OSD unresponsive or significantly slow for some minutes 2024-03-26 13:34Z added
Actions #3

Updated by okurz 8 months ago

  • Subject changed from HTTP Response alert for /tests briefly going up to 15.7s size:M to HTTP Response alert for /tests briefly going up to 15.7s
  • Category set to Regressions/Crashes
  • Status changed from New to In Progress
  • Assignee set to okurz
  • Parent task set to #108209
Actions #4

Updated by okurz 8 months ago

  • Status changed from In Progress to Resolved

Same as in #158059 and many tickets in before this seems to be related to how apache behaves together with openQA. What is seemingly different is that we had problems during "lunch time" and now it's midnight where different user interaction and system processes can be observed. https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary shows that before the outage pg_dump was running. Maybe that is blocking quite much? We also run pg_dump in osd-deployment so maybe something to look out in the future. For now I don't think we can do more.

Actions #5

Updated by okurz 8 months ago

  • Subject changed from HTTP Response alert for /tests briefly going up to 15.7s to HTTP Response alert for /tests briefly going up to 15.7s - potential detrimental impact of pg_dump
  • Description updated (diff)
  • Status changed from Resolved to New
  • Assignee deleted (okurz)

today during CEST lunch time another unresponsiveness was observed and we stated the hypothesis that pg_dump might slow down overall performance. I recommend we look into some options. Description extended.

Actions #6

Updated by okurz 8 months ago

  • Subject changed from HTTP Response alert for /tests briefly going up to 15.7s - potential detrimental impact of pg_dump to Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s)
  • Description updated (diff)
Actions #7

Updated by okurz 8 months ago

  • Description updated (diff)
Actions #8

Updated by livdywan 8 months ago

  • Subject changed from Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) to Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) size:M
  • Status changed from New to Workable
Actions #9

Updated by livdywan 8 months ago

  • Related to action #159639: [alert] "web UI: Too many 5xx HTTP responses alert" size:S added
Actions #10

Updated by okurz 8 months ago

  • Priority changed from Normal to High
Actions #11

Updated by mkittler 8 months ago

  • Description updated (diff)
Actions #12

Updated by mkittler 8 months ago

  • Description updated (diff)
Actions #13

Updated by okurz 8 months ago

  • Related to action #130636: high response times on osd - Try nginx on OSD size:S added
Actions #14

Updated by okurz 8 months ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz

#130636 first as we don't have super stable metrics about this topic and #130636 is already working on an area that can impact http responsiveness

Actions #15

Updated by okurz 7 months ago

  • Due date set to 2024-06-09
  • Status changed from Blocked to Feedback
  • Priority changed from High to Low

#130636 is done and haven't seen the unresponsiveness as originally. Monitoring for longer.

Actions #16

Updated by jbaier_cz 7 months ago

Could this alert https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1716147294259&to=1716175779373 be related?

Values
B0=6.840987906666666 
Labels
alertname   HTTP Response alert
grafana_folder  Salt
rule_uid    tm0h5mf4k

There is a visible decrease in CPU usage at that time and quite big increase in inactive minion jobs.

Also some interesting lines from log during that time, not sure if related.

May 20 00:27:04 openqa telegraf[1347]: 2024-05-19T22:27:04Z E! [outputs.influxdb] When writing to [http://openqa-monitor.qa.suse.de:8086]: failed doing req: Post "http://openqa-monitor.qa.suse.de:8086/write?db=telegraf": context deadline exceeded (Client.Timeout exceeded while await>
May 20 00:27:04 openqa telegraf[1347]: 2024-05-19T22:27:04Z E! [agent] Error writing to outputs.influxdb: could not write any address
Actions #17

Updated by okurz 7 months ago

  • Due date deleted (2024-06-09)
  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)

yes, I assume that could be related as during the time the alert fired the pg_dump was likely running. So with that we can follow up with the original plan as it seems the problem still reproduces.

Actions #18

Updated by okurz 7 months ago

  • Priority changed from Low to Normal
Actions #19

Updated by dheidler 6 months ago

  • Status changed from Workable to In Progress
  • Assignee set to dheidler
Actions #20

Updated by openqa_review 6 months ago

  • Due date set to 2024-06-27

Setting due date based on mean cycle time of SUSE QE Tools

Actions #21

Updated by dheidler 6 months ago

Why do we have both?

openqa:/etc/cron.daily # cat /etc/cron.daily/dump-openqa
#!/bin/bash

#exit 0 # avoid further data loss
backup_dir="${backup_dir:-"/var/lib/openqa/backup"}"
date=$(date -Idate)
su - postgres -c "pg_dump -Fc openqa -f $backup_dir/$date.dump"
find $backup_dir/ -mtime +7 -print0 | xargs -0 rm -v
backup-vm:~ # grep openqa.suse.de /etc/rsnapshot.conf | grep backup_exec
backup_exec ssh root@openqa.suse.de "cd /tmp; sudo -u postgres pg_dump -Fc openqa -f /var/lib/openqa/SQL-DUMPS/$(date -I).dump"
Actions #22

Updated by dheidler 6 months ago

The -j option seems to do the opposite of what we want to achieve and also it is not even compatible with the output format we're using:

       -j njobs
       --jobs=njobs
           Run the dump in parallel by dumping njobs tables
           simultaneously. This option may reduce the time needed to
           perform the dump but it also increases the load on the
           database server. You can only use this option with the
           directory output format because this is the only output
           format where multiple processes can write their data at
           the same time.
Actions #23

Updated by okurz 6 months ago

dheidler wrote in #note-21:

Why do we have both?

openqa:/etc/cron.daily # cat /etc/cron.daily/dump-openqa
#!/bin/bash

#exit 0 # avoid further data loss
backup_dir="${backup_dir:-"/var/lib/openqa/backup"}"
date=$(date -Idate)
su - postgres -c "pg_dump -Fc openqa -f $backup_dir/$date.dump"
find $backup_dir/ -mtime +7 -print0 | xargs -0 rm -v
backup-vm:~ # grep openqa.suse.de /etc/rsnapshot.conf | grep backup_exec
backup_exec   ssh root@openqa.suse.de "cd /tmp; sudo -u postgres pg_dump -Fc openqa -f /var/lib/openqa/SQL-DUMPS/$(date -I).dump"

because we want to have daily snapshots while also storing remote backups. We have even more because we also trigger a database backup in https://gitlab.suse.de/openqa/osd-deployment/-/blob/master/.gitlab-ci.yml?ref_type=heads#L216

dheidler wrote in #note-22:

The -j option seems to do the opposite of what we want to achieve and also it is not even compatible with the output format we're using

but running in parallel might make the database very busy only for a potentially much shorter time so that overall operations are impacted less. If the current output format does not support that then consider changing the output format.

Actions #24

Updated by dheidler 6 months ago

The bottleneck seems to be the CPU
Backup can be speed up at the expense of size, by disabling compression (default is gzip).
Note that another backup was running at the same time:

time sudo -u postgres ionice -c3 nice -n19 pg_dump --compress=0 -Fc openqa -f /var/lib/openqa/SQL-DUMPS/test.dump

real    4m11,817s
user    0m29,894s
sys     0m44,122s

-rw-r--r-- 1 postgres postgres 21G 17. Jun 16:13 /var/lib/openqa/SQL-DUMPS/test.dump
pg_dump --compress=0 -Fc openqa | lz4 | dd status=progress of=/var/lib/openqa/SQL-DUMPS/test.dump
6163796480 bytes (6.2 GB, 5.7 GiB) copied, 308 s, 20.0 MB/s
12055096+1 records in
12055096+1 records out
6172209629 bytes (6.2 GB, 5.7 GiB) copied, 308.394 s, 20.0 MB/s
Actions #25

Updated by okurz 6 months ago · Edited

After discussing I suggest the following

  1. DONE In the daily cron script /etc/cron.daily/dump-openqa only run the backup if $name for the day does not already exist -> https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1220
  2. DONE Put /etc/cron.daily/dump-openqa into salt as it is currently not salt-controlled -> https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1220
  3. DONE In the daily cron script run the backup late at the time to give other scripts or pipelines a chance to already create the daily backup and not duplicate -> https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1220
  4. DONE Use a specific name for each pipeline and backup so that we don't have potential race conditions -> https://gitlab.suse.de/openqa/osd-deployment/-/merge_requests/58
  5. DONE Use nice -> https://gitlab.suse.de/openqa/osd-deployment/-/merge_requests/58, https://gitlab.suse.de/qa-sle/backup-server-salt/-/merge_requests/19, https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1220
  6. block on upgrade of OSD to Leap 15.6 #157981, then upgrade PostgreSQL 15->16 so that we have a fully supported PostgreSQL 16 and then use pg_dump … --compress=METHOD[:DETAIL]` to use a more modern compression, e.g. lz4
  7. Consider flock /var/lib/openqa/backup/$name.lock pg_dump … in all three places cron.daily, rsnapshot and osd deployment and CI pipelines
Actions #26

Updated by okurz 6 months ago

  • Related to action #162038: No HTTP Response on OSD on 10-06-2024 - auto_review:".*timestamp mismatch - check whether clocks on the local host and the web UI host are in sync":retry size:S added
Actions #27

Updated by okurz 6 months ago

  • Priority changed from Normal to High

#162038 shows again symptoms of this. We need to apply higher prio.

Actions #28

Updated by livdywan 6 months ago

  • Status changed from In Progress to Workable
Actions #29

Updated by livdywan 6 months ago

  • Due date deleted (2024-06-27)
Actions #30

Updated by dheidler 6 months ago

  • Status changed from Workable to In Progress
Actions #33

Updated by dheidler 6 months ago

  • Status changed from In Progress to Feedback
Actions #35

Updated by okurz 6 months ago

  • Due date set to 2024-07-16

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1220 merged and subsequent fix as well. /etc/cron.d/dump-openqa was deployed successfully.

Please verify both ACs over the next day(s)

Actions #36

Updated by okurz 6 months ago

Regarding preventing concurrent runs of the OSD deployment pipeline take a look into https://docs.gitlab.com/ee/ci/resource_groups/ and https://gitlab.suse.de/qa-maintenance/bot-ng/-/blob/master/.gitlab-ci.yml?ref_type=heads#L72 where we already use that

Actions #38

Updated by dheidler 6 months ago

hm - even after that PR merged, I still get no resource groups for that project:

curl --header "PRIVATE-TOKEN: XXXXX" "https://gitlab.suse.de/api/v4/projects/3731/resource_groups"
[]

Without setting the process_mode to oldest_first we can pretty much still run into very odd issues.
Any ideas? The documentation doesn't mention any steps on creating resource_groups.

Actions #39

Updated by dheidler 6 months ago

Hm - the bot-ng repo does have them:

$ curl -s --header "PRIVATE-TOKEN: XXXX" "https://gitlab.suse.de/api/v4/projects/6096/resource_groups" | jq
[
  {
    "id": 56,
    "key": "smelt",
    "process_mode": "unordered",
    "created_at": "2022-03-17T17:42:55.413Z",
    "updated_at": "2022-03-17T17:42:55.413Z"
  },
  {
    "id": 57,
    "key": "ibs",
    "process_mode": "unordered",
    "created_at": "2022-03-21T09:47:32.296Z",
    "updated_at": "2022-03-21T09:47:32.296Z"
  },
  {
    "id": 58,
    "key": "openqa",
    "process_mode": "unordered",
    "created_at": "2022-03-21T09:51:45.667Z",
    "updated_at": "2022-03-21T09:51:45.667Z"
  },
  {
    "id": 285,
    "key": "amqp",
    "process_mode": "unordered",
    "created_at": "2024-04-16T12:21:04.222Z",
    "updated_at": "2024-04-16T12:21:04.222Z"
  }
]
Actions #40

Updated by dheidler 6 months ago

Over the night the resource_group appeared and I could modify it:

$ curl -s --request PUT --data "process_mode=oldest_first" --header "PRIVATE-TOKEN: XXXX" "https://gitlab.suse.de/api/v4/projects/3731/resource_groups/openqa" | jq
{
  "id": 307,
  "key": "openqa",
  "process_mode": "oldest_first",
  "created_at": "2024-07-04T05:03:08.226Z",
  "updated_at": "2024-07-04T09:18:34.149Z"
}
Actions #41

Updated by dheidler 6 months ago

  • Status changed from Feedback to Resolved
Actions #42

Updated by okurz 6 months ago

  • Status changed from Resolved to Feedback

Please ensure that we have consistent database backups at least for two days as well as in osd-deploymemt

Actions #43

Updated by dheidler 6 months ago

  • Status changed from Feedback to Resolved

Already did:

dheidler@openqa:/var/lib/openqa/backup> l
insgesamt 33008900
drwxrwxrwx  2 geekotest nogroup         323  4. Jul 07:05 ./
drwxr-xr-x 13 root      root           4096  4. Jul 07:16 ../
-rw-r--r--  1 postgres  postgres 3420044822 26. Jun 20:14 2024-06-26.dump
-rw-r--r--  1 postgres  postgres 3400441812 27. Jun 20:14 2024-06-27.dump
-rw-r--r--  1 postgres  postgres 3388550554 28. Jun 20:14 2024-06-28.dump
-rw-r--r--  1 postgres  postgres 3370466054 29. Jun 20:13 2024-06-29.dump
-rw-r--r--  1 postgres  postgres 3355850742 30. Jun 20:13 2024-06-30.dump
-rw-r--r--  1 postgres  postgres 3380505129  1. Jul 20:13 2024-07-01.dump
-rw-r--r--  1 postgres  postgres 3365370546  2. Jul 20:13 2024-07-02.dump
-rw-r--r--  1 postgres  postgres 3402784114  3. Jul 20:14 2024-07-03.dump
-rw-r--r--  1 postgres  postgres 3359971612  4. Jul 12:14 2024-07-04.dump
-rw-r--r--  1 postgres  postgres 3357114125  4. Jul 07:15 osd-pre-deployment.dump
-rw-r--r--  1 postgres  postgres          0  4. Jul 07:05 osd-pre-deployment.lock
Actions #44

Updated by dheidler 6 months ago

  • Status changed from Resolved to Feedback
Actions #45

Updated by dheidler 6 months ago

  • Status changed from Feedback to Resolved

Files on osd still looking fine.
PR merged.

Actions #46

Updated by okurz 5 months ago

  • Related to action #163592: [alert] (HTTP Response alert Salt tm0h5mf4k) size:M added
Actions #47

Updated by okurz 5 months ago

  • Due date deleted (2024-07-16)
Actions

Also available in: Atom PDF