action #130633
closedBetter documentation on jenkins.qa.suse.de alerts and recovery
0%
Description
Motivation¶
It seems the alert regarding "packet loss" is not very clear. And maybe when there's many alerts it's not obvious how to address it.
Acceptance criteria¶
- AC1: The alert is understood by the team
- AC1: There's documentation about how to recover jenkins when it's down
Suggestions¶
- Write some documentation, or dig up existing docs
- Consider a little mob session on alert handling and recovery of machines
- Look at https://stats.openqa-monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1
Updated by livdywan 4 months ago
- Copied from action #128561: salt managed host being down does not trigger any alert (was: jenkins.qa.suse.de stuck in emergency mode but no alert) size:M added
Updated by okurz 4 months ago
- Status changed from New to In Progress
- Assignee set to okurz
I reviewed the description text of the monitoring panel and found one minor point that we can improve https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/879 . I will look into the text for the actual alert later on.
Updated by okurz 4 months ago
- Due date set to 2023-06-23
- Status changed from In Progress to Feedback
I added the wiki section https://wiki.suse.net/index.php/SUSE-Quality_Assurance/Labs#Additional_services describing qamaster as well as important VMs running on there.
The alert text says "At least one host listed under required_external_networks
in workerconf.sls
in the pillars repository is not pingable from at least one openQA worker host. Check the panel associated with the alert. The legend table on the right shows the problematic hosts on top." I find that clear enough and would not extend it further.
Updated by okurz 4 months ago
- Due date deleted (
2023-06-23) - Status changed from Feedback to Resolved
I added a link to the labs wiki in https://progress.opensuse.org/projects/qa/tools/wiki and in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/879 within the panel description. MR merged, verified ticket status with cdywan