Project

General

Profile

Actions

action #159396

closed

openQA Project - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project - coordination #108209: [epic] Reduce load on OSD

Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) size:M

Added by livdywan 3 months ago. Updated 12 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
Due date:
2024-07-16
% Done:

0%

Estimated time:

Description

Motivation

Repeatedly people report problems with responsiveness of OSD, especially during European lunch time. During such times our monitoring also shows HTTP unresponsiveness. Similar situations can be seen shortly after the European midnight. One hypothesis is that "pg_dump" triggered from https://gitlab.suse.de/qa-sle/backup-server-salt/-/blob/master/rsnapshot/rsnapshot.conf#L35 is slowing down the operations causing such symptoms. The schedule for this trigger is defined in https://gitlab.suse.de/qa-sle/backup-server-salt/-/blob/master/rsnapshot/init.sls#L28 so triggered every 4h meaning 00:00 CET/CEST just as well as 12:00 CET/CEST.

One related alert
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=78&from=now-24h&to=now

B0=15.654104274

Notably the graph says 15.7 at the time of the alert e.g. 00:36, the second highest is 3.7s a couple of hours earlier and otherwise around the 200ms mark (note the difference in units). There seems to be no obvious fallout or other alerts coinciding?

Acceptance criteria

  • AC1: No HTTP unresponsiveness alerts or user reports linked to times when pg_dump was running
  • AC2: We still have frequent database backups for OSD openQA content so that not more work than 4h maximum is lost

Suggestions


Related issues 4 (1 open3 closed)

Related to openQA Infrastructure - action #158059: OSD unresponsive or significantly slow for some minutes 2024-03-26 13:34ZResolvedokurz

Actions
Related to openQA Infrastructure - action #159639: [alert] "web UI: Too many 5xx HTTP responses alert" size:SResolveddheidler2024-04-26

Actions
Related to openQA Project - action #130636: high response times on osd - Try nginx on OSD size:SResolvedmkittler2024-05-17

Actions
Related to openQA Infrastructure - action #162038: No HTTP Response on OSD on 10-06-2024 - auto_review:".*timestamp mismatch - check whether clocks on the local host and the web UI host are in sync":retry size:SIn Progressnicksinger2024-06-102024-07-30

Actions
Actions

Also available in: Atom PDF