Project

General

Profile

Actions

action #130633

closed

Better documentation on jenkins.qa.suse.de alerts and recovery

Added by livdywan about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

It seems the alert regarding "packet loss" is not very clear. And maybe when there's many alerts it's not obvious how to address it.

Acceptance criteria

  • AC1: The alert is understood by the team
  • AC1: There's documentation about how to recover jenkins when it's down

Suggestions


Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure - action #128561: salt managed host being down does not trigger any alert (was: jenkins.qa.suse.de stuck in emergency mode but no alert) size:MResolveddheidler2023-05-032023-07-04

Actions
Actions #1

Updated by livdywan about 1 year ago

  • Copied from action #128561: salt managed host being down does not trigger any alert (was: jenkins.qa.suse.de stuck in emergency mode but no alert) size:M added
Actions #2

Updated by okurz about 1 year ago

  • Tags set to infra
  • Due date deleted (2023-06-15)
  • Priority changed from High to Normal
  • Start date deleted (2023-05-03)
Actions #3

Updated by livdywan about 1 year ago

  • Description updated (diff)
Actions #4

Updated by okurz about 1 year ago

  • Status changed from New to In Progress
  • Assignee set to okurz

I reviewed the description text of the monitoring panel and found one minor point that we can improve https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/879 . I will look into the text for the actual alert later on.

Actions #5

Updated by okurz about 1 year ago

  • Due date set to 2023-06-23
  • Status changed from In Progress to Feedback

I added the wiki section https://wiki.suse.net/index.php/SUSE-Quality_Assurance/Labs#Additional_services describing qamaster as well as important VMs running on there.

The alert text says "At least one host listed under required_external_networks in workerconf.sls in the pillars repository is not pingable from at least one openQA worker host. Check the panel associated with the alert. The legend table on the right shows the problematic hosts on top." I find that clear enough and would not extend it further.

Actions #6

Updated by okurz about 1 year ago

  • Due date deleted (2023-06-23)
  • Status changed from Feedback to Resolved

I added a link to the labs wiki in https://progress.opensuse.org/projects/qa/tools/wiki and in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/879 within the panel description. MR merged, verified ticket status with cdywan

Actions

Also available in: Atom PDF