Better documentation on jenkins.qa.suse.de alerts and recovery
It seems the alert regarding "packet loss" is not very clear. And maybe when there's many alerts it's not obvious how to address it.
- AC1: The alert is understood by the team
- AC1: There's documentation about how to recover jenkins when it's down
- Write some documentation, or dig up existing docs
- Consider a little mob session on alert handling and recovery of machines
- Look at https://stats.openqa-monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1
- Status changed from New to In Progress
- Assignee set to okurz
I reviewed the description text of the monitoring panel and found one minor point that we can improve https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/879 . I will look into the text for the actual alert later on.
- Due date set to 2023-06-23
- Status changed from In Progress to Feedback
I added the wiki section https://wiki.suse.net/index.php/SUSE-Quality_Assurance/Labs#Additional_services describing qamaster as well as important VMs running on there.
The alert text says "At least one host listed under
workerconf.sls in the pillars repository is not pingable from at least one openQA worker host. Check the panel associated with the alert. The legend table on the right shows the problematic hosts on top." I find that clear enough and would not extend it further.
- Due date deleted (
- Status changed from Feedback to Resolved
I added a link to the labs wiki in https://progress.opensuse.org/projects/qa/tools/wiki and in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/879 within the panel description. MR merged, verified ticket status with cdywan