Project

General

Profile

Actions

action #163592

closed

[alert] (HTTP Response alert Salt tm0h5mf4k) size:M

Added by tinita about 1 month ago. Updated 23 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-07-10
Due date:
% Done:

0%

Estimated time:

Description

Observation

We saw several alerts this morning:
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/tm0h5mf4k/view?orgId=1
Date: Wed, 10 Jul 2024 04:45:31 +0200

Suggestions

  • DONE Reduce global job limit
  • DONE Consider mitigations for route causing the overload (real fix is in place now)
  • Actively monitor while mitigations are applied
  • Repeatedly: Retrigger affected jobs when applicable
  • Handle communication with affected users and downstream services
  • strace on relevant processes (of openqa-webui.service). That helped us to define #163757
  • After #163757 continue to closely monitor the system and then slowly and carefully conduct the rollback actions
  • Ensure that https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=78&orgId=1&from=now-7d&to=now stays stable
  • Consider temporarily disabling pg_dump triggered by the regular backup

Rollback actions

  • DONE Manual changes to /usr/share/openqa/lib/OpenQA/WebAPI.pm on osd -> reverted back to original code with deployment of 2024-07-12 for #163757
  • DONE enable https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules again
  • DONE sudo systemctl start auto-update.timer on osd
  • DONE Enable the backup via pg_dump on OSD again by running sudo ln -fs /usr/lib/postgresql16/bin/pg_dump /etc/alternatives/pg_dump on OSD (to revert the experiment done in #163592#note-41)

Out of scope

  • DONE Fixing the problem that too many browsers on live view starve out other processes: #163757
  • openQA jobs explicitly ending with "timestamp mismatch" are to be handled in #162038
  • improving the error message about openQA jobs failing with "api failure" and no details: #163781

Files


Related issues 11 (2 open9 closed)

Related to openQA Infrastructure - action #163595: [alert] (Web proxy Response Time alert Salt IeChie2He)Resolvedokurz2024-07-10

Actions
Related to openQA Infrastructure - action #162038: No HTTP Response on OSD on 10-06-2024 - auto_review:".*timestamp mismatch - check whether clocks on the local host and the web UI host are in sync":retry size:SResolvednicksinger2024-06-10

Actions
Related to openQA Infrastructure - action #163622: Scripts CI pipeline failing with invalid numeric literal error size:SResolvedjbaier_cz

Actions
Related to openQA Project - action #163757: Prevent live view viewers from making openQA unresponsiveResolvedmkittler2024-07-112024-07-30

Actions
Related to openQA Infrastructure - action #163772: [openQA][ipmi][worker35:x] Assigned jobs hang and actually can not run size:MResolvedokurz2024-07-11

Actions
Related to openQA Project - action #163931: OpenQA logreport for ariel.suse-dmz.opensuse.org Can't locate object method render_specific_not_found via package OpenQA::Shared::Controller::RunningResolvedtinita2024-07-31

Actions
Related to openQA Infrastructure - action #159396: Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) size:MResolveddheidler

Actions
Has duplicate openQA Infrastructure - action #163928: [alert] Openqa HTTP Response lost on 15-07-24 size:SResolvedokurz2024-07-15

Actions
Copied to openQA Infrastructure - action #163610: Conduct "lessons learned" with Five Why analysis for "[alert] (HTTP Response alert Salt tm0h5mf4k)"Resolvedokurz2024-07-10

Actions
Copied to openQA Infrastructure - action #163790: OSD openqa.ini is corrupted, invalid characters size:MBlockedokurz2024-07-10

Actions
Copied to openQA Infrastructure - action #164427: HTTP response alert every Monday 01:00 CET/CEST due to fstrim size:MFeedbackokurz2024-07-102024-09-20

Actions
Actions

Also available in: Atom PDF