action #130633
closedBetter documentation on jenkins.qa.suse.de alerts and recovery
0%
Description
Motivation¶
It seems the alert regarding "packet loss" is not very clear. And maybe when there's many alerts it's not obvious how to address it.
Acceptance criteria¶
- AC1: The alert is understood by the team
- AC1: There's documentation about how to recover jenkins when it's down
Suggestions¶
- Write some documentation, or dig up existing docs
- Consider a little mob session on alert handling and recovery of machines
- Look at https://stats.openqa-monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1
Updated by livdywan over 1 year ago
- Copied from action #128561: salt managed host being down does not trigger any alert (was: jenkins.qa.suse.de stuck in emergency mode but no alert) size:M added
Updated by okurz over 1 year ago
- Tags set to infra
- Due date deleted (
2023-06-15) - Priority changed from High to Normal
- Start date deleted (
2023-05-03)
Updated by okurz over 1 year ago
- Status changed from New to In Progress
- Assignee set to okurz
I reviewed the description text of the monitoring panel and found one minor point that we can improve https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/879 . I will look into the text for the actual alert later on.
Updated by okurz over 1 year ago
- Due date set to 2023-06-23
- Status changed from In Progress to Feedback
I added the wiki section https://wiki.suse.net/index.php/SUSE-Quality_Assurance/Labs#Additional_services describing qamaster as well as important VMs running on there.
The alert text says "At least one host listed under required_external_networks
in workerconf.sls
in the pillars repository is not pingable from at least one openQA worker host. Check the panel associated with the alert. The legend table on the right shows the problematic hosts on top." I find that clear enough and would not extend it further.
Updated by okurz over 1 year ago
- Due date deleted (
2023-06-23) - Status changed from Feedback to Resolved
I added a link to the labs wiki in https://progress.opensuse.org/projects/qa/tools/wiki and in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/879 within the panel description. MR merged, verified ticket status with cdywan