Project

General

Profile

action #80538

Updated by livdywan about 3 years ago

## Observation 

 On 2020-11-26 and 2020-11-27 there were the following alerts and automatic ok-messages: 

 * 2020-11-26 2045Z: [Alerting] openQA minion workers alert 
 * 2020-11-26 2046Z: [OK] openQA minion workers alert 
 * 2020-11-26 2106Z: [OK] openqaworker8: Minion Jobs alert 
 * 2020-11-27 0140Z: [Alerting] QA-Power8-4-kvm: Minion Jobs alert 
 * 2020-11-27 0255Z: [OK] QA-Power8-4-kvm: Minion Jobs alert 

 I assume no one did anything about these alerts to go to "OK" again. 

 the alert "openQA minion workers alert" has the description: 

 ``` 
 Minion workers down. Check systemd services on the openQA host  
 ``` 

 and the alert "$worker: Minion Jobs alert" has the description: 

 ``` 
 to remove all failed jobs on the machine ``` /usr/share/openqa/script/openqa-workercache eval 'my $jobs = app->minion->jobs({states => ["failed"]}); while (my $job = $jobs->next) { $job->remove }' ```  
 ``` 

 ## Problems 

 * meaning of alert is unclear 
 * alerts should be stable, i.e. not flaky 
 * description does not explain enough so that team members would understand what needs to be done and what the meaning is 

 ## Acceptance criteria 
 * **AC1:** Define and document Multiple team members can confirm that the intention of the alert, rename updated name and/or description explains them what to do if needed 
 * **AC2:** Allow at least one other team member to handle the alert based on documentation triggers 
 * **AC3:** The alert is **AC2:** Alerts are stable and persists when there's a problem 

 ## Suggestions 

 * Look at the [o3 minion dashboard](https://openqa.opensuse.org/minion) 
 * Locate the [alert definition](https://gitlab.suse.de/openqa/grafana-webhook-actions) or alert on [grafana](https://stats.openqa-monitor.qa.suse.de/) 
 * Understand what the data source for the alert is and how "minion jobs" are involved 
 * Research what is the expected state for "minion jobs" 
 * Extend the description to explain, optionally reference this ticket 
 * Consider updating the name of the alert as well 
 * Stabilize alert

Back