action #159396: Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) size:M - openQA Infrastructure - openSUSE Project Management Tool

Actions

action #159396

closed

openQA Project - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project - coordination #108209: [epic] Reduce load on OSD

Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) size:M

Added by livdywan 3 months ago. Updated 12 days ago.

Status:

Resolved

Priority:

High

Assignee:

dheidler

Category:

Regressions/Crashes

Target version:

openQA Project - Ready

Start date:

Due date:

2024-07-16

% Done:

Estimated time:

Tags:

alert, infra, reactive work

Description

Motivation¶

Repeatedly people report problems with responsiveness of OSD, especially during European lunch time. During such times our monitoring also shows HTTP unresponsiveness. Similar situations can be seen shortly after the European midnight. One hypothesis is that "pg_dump" triggered from https://gitlab.suse.de/qa-sle/backup-server-salt/-/blob/master/rsnapshot/rsnapshot.conf#L35 is slowing down the operations causing such symptoms. The schedule for this trigger is defined in https://gitlab.suse.de/qa-sle/backup-server-salt/-/blob/master/rsnapshot/init.sls#L28 so triggered every 4h meaning 00:00 CET/CEST just as well as 12:00 CET/CEST.

One related alert
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=78&from=now-24h&to=now

B0=15.654104274

Notably the graph says 15.7 at the time of the alert e.g. 00:36, the second highest is 3.7s a couple of hours earlier and otherwise around the 200ms mark (note the difference in units). There seems to be no obvious fallout or other alerts coinciding?

Acceptance criteria¶

AC1: No HTTP unresponsiveness alerts or user reports linked to times when pg_dump was running
AC2: We still have frequent database backups for OSD openQA content so that not more work than 4h maximum is lost

Suggestions¶

DONE Investigate what jobs were running around the time of the alert -> pg_dump
DONE Check if relevant changes e.g. changes to salt/worker slots had an impact here -> unlikely as we see those issues also during other instances
Regarding pg_dump keep in mind that https://github.com/os-autoinst/sync-and-trigger/blob/main/dump-psql#L4 is used for o3 but OSD uses a custom call
pg_dump might slow down overall performance. Consider options from
- https://serverfault.com/questions/349221/how-to-make-pg-dump-less-resource-greedy
- https://www.postgresql.org/docs/current/performance-tips.html
- https://www.linkedin.com/pulse/how-speed-up-pgdump-when-dumping-large-postgres-nikolay-samokhvalov/
- https://www.postgresql.org/docs/current/continuous-archiving.html
- try out in particular the -j option on pg_dump
Consider if the datacenter migration is the underlying root cause impacting I/O performance on OSD and/or between the backup VM and OSD.

Related issues 4 (1 open — 3 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA » openQA Project » openQA Infrastructure

Tags

Custom queries

action #159396

Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by livdywan 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by livdywan 3 months ago

Updated by livdywan 3 months ago

Updated by okurz 3 months ago

Updated by mkittler 3 months ago

Updated by mkittler 3 months ago

Updated by okurz 2 months ago

Updated by okurz 2 months ago

Updated by okurz 2 months ago

Updated by jbaier_cz about 2 months ago

Updated by okurz about 2 months ago

Updated by okurz about 2 months ago

Updated by dheidler about 1 month ago

Updated by openqa_review about 1 month ago

Updated by dheidler about 1 month ago

Updated by dheidler about 1 month ago

Updated by okurz about 1 month ago

Updated by dheidler about 1 month ago

Updated by okurz 29 days ago · Edited

Updated by okurz 27 days ago

Updated by okurz 27 days ago

Updated by livdywan 26 days ago

Updated by livdywan 21 days ago

Updated by dheidler 15 days ago

Updated by dheidler 15 days ago

Updated by dheidler 15 days ago

Updated by dheidler 15 days ago

Updated by okurz 15 days ago

Updated by okurz 15 days ago

Updated by okurz 15 days ago

Updated by dheidler 14 days ago

Updated by dheidler 14 days ago

Updated by dheidler 14 days ago

Updated by dheidler 13 days ago

Updated by dheidler 13 days ago

Updated by okurz 13 days ago

Updated by dheidler 13 days ago

Updated by dheidler 13 days ago

Updated by dheidler 12 days ago