Project

General

Profile

action #175311

Updated by mkittler about 20 hours ago

## Observation 

 ``` 
 Jan 11 05:14:53 openqa openqa-gru[29123]: Job 16418415 already has a clone, skipping investigation. Use the env variable 'force=true' to trigger investigation jobs 
 Jan 11 05:17:42 openqa openqa-gru[10806]: Job 16418422 already has a clone, skipping investigation. Use the env variable 'force=true' to trigger investigation jobs 
 Jan 11 05:19:38 openqa openqa-gru[20142]: timeout: 
 Jan 11 05:19:38 openqa openqa-gru[20274]: timeout: sending signal TERM to command ‘env’ 
 Jan 11 05:19:38 openqa systemd[1]: Stopping The openQA daemon for various background tasks like cleanup and saving needles... 
 Jan 11 05:19:38 openqa openqa-gru[20142]: sending signal TERM to command ‘env’ 
 Jan 11 05:19:38 openqa openqa-gru[18272]: timeout: sending signal TERM to command ‘env’ 
 Jan 11 05:19:39 openqa openqa-gru[20197]: /opt/os-autoinst-scripts/openqa-label-known-issues: line 34: 20450 Done                      echo "$issues" 
 Jan 11 05:19:39 openqa openqa-gru[20197]:        20451 Terminated                | ( while read -r issue; do 
 Jan 11 05:19:39 openqa openqa-gru[20197]:       read -r subject; read -r tracker; after=${subject#*\"}; search=${after%\"*}; force_result=''; label="poo#$issue $subject"; if [[ ${#search} -ge $min_search_term ]]; then 
 Jan 11 05:19:39 openqa openqa-gru[20197]:           if [[ $after =~ :force_result:([a-z_]+) ]]; then 
 Jan 11 05:19:39 openqa openqa-gru[20197]:               if [[ $tracker == "$force_result_tracker" ]]; then 
 Jan 11 05:19:39 openqa openqa-gru[20197]:                   force_result=${BASH_REMATCH[1]}; 
 Jan 11 05:19:39 openqa openqa-gru[20197]:               else 
 Jan 11 05:19:39 openqa openqa-gru[20197]:                   label="$label (ignoring force result for ticket which is not in tracker \"$force_result_tracker\")"; 
 Jan 11 05:19:39 openqa openqa-gru[20197]:               fi; 
 Jan 11 05:19:39 openqa openqa-gru[20197]:           fi; label-on-issue "$id" "$search" "$label" "${after//*\":retry*/1}" "$force_result" && break; 
 Jan 11 05:19:39 openqa openqa-gru[20197]:       fi; 
 Jan 11 05:19:39 openqa openqa-gru[20197]: done ) 
 Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: State 'stop-sigterm' timed out. Killing. 
 Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 1720 (openqa) with signal SIGKILL. 
 Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 18270 (openqa) with signal SIGKILL. 
 Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 20140 (openqa) with signal SIGKILL. 
 Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 20270 (openqa) with signal SIGKILL. 
 Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 21067 (openqa) with signal SIGKILL. 
 Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Main process exited, code=killed, status=9/KILL 
 Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Failed with result 'timeout'. 
 Jan 11 05:21:09 openqa systemd[1]: Stopped The openQA daemon for various background tasks like cleanup and saving needles. 
 Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Consumed 8h 38min 30.827s CPU time. 
 ``` 

 This caused the systemd alert to fire, even though the time frame was quite short: https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?from=2025-01-11T05%3A16%3A34.817Z&to=2025-01-11T05%3A31%3A42.538Z 

 What I don't understand is that the alert was eventually firing even though the number of failing systemd units went down to zero in the grace period: 

 1. 2025-01-11 05:21:10: one failed systemd unit, alert is PENDING 
 2. 2025-01-11 05:22:49: no failed systemd units anymore (so the unit was only failing for 1 min 39 s), anymore, alert still PENDING 
 3. 2025-01-11 05:24:10: still no failed systemd units anymore, alert is ALERTING 
 4. 2025-01-11 05:27:50: still no failed systemd units anymore, alert is OK 

 EDIT: Times from the state history (the above list is from the graph): 

 1. 2025-01-11 05:21:14: alert is PENDING 
 2. 2025-01-11 05:24:14: alert is FIRING 
 3. 2025-01-11 05:27:54: alert is OK 

 Here we see that the grace period is exactly 3 minutes which matches the expected grace period. 

 See #175311#note-7 for an explanation of the confusing part. 

 --- 

 Otherwise I'm not sure whether the service being killed by systemd is actually a bad/problematic thing. 

 ## Acceptance critera 
 * **AC1**: We know why the alert fired. 
 * **AC2**: Configured grace periods are reasonable. 

 ## Suggestions 
 * Investigate why an alert was triggered even though no failing systemd service remained after some time. Look into how the alert would go back to normal. Is it instant or also in a "grace period"? 
 * Research possible Grafana upstream bugs.

Back