action #130633
closed
Better documentation on jenkins.qa.suse.de alerts and recovery
Added by livdywan over 1 year ago.
Updated over 1 year ago.
Description
Motivation¶
It seems the alert regarding "packet loss" is not very clear. And maybe when there's many alerts it's not obvious how to address it.
Acceptance criteria¶
- AC1: The alert is understood by the team
- AC1: There's documentation about how to recover jenkins when it's down
Suggestions¶
- Copied from action #128561: salt managed host being down does not trigger any alert (was: jenkins.qa.suse.de stuck in emergency mode but no alert) size:M added
- Tags set to infra
- Due date deleted (
2023-06-15)
- Priority changed from High to Normal
- Start date deleted (
2023-05-03)
- Description updated (diff)
- Status changed from New to In Progress
- Assignee set to okurz
- Due date set to 2023-06-23
- Status changed from In Progress to Feedback
I added the wiki section https://wiki.suse.net/index.php/SUSE-Quality_Assurance/Labs#Additional_services describing qamaster as well as important VMs running on there.
The alert text says "At least one host listed under required_external_networks
in workerconf.sls
in the pillars repository is not pingable from at least one openQA worker host. Check the panel associated with the alert. The legend table on the right shows the problematic hosts on top." I find that clear enough and would not extend it further.
- Due date deleted (
2023-06-23)
- Status changed from Feedback to Resolved
Also available in: Atom
PDF