Actions
action #175311
closedThe hook script (in this case openqa-label-known-issues) might take too long to react to SIGTERM leading to alerts size: S
Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2025-01-13
Due date:
% Done:
0%
Estimated time:
Description
Observation¶
Jan 11 05:14:53 openqa openqa-gru[29123]: Job 16418415 already has a clone, skipping investigation. Use the env variable 'force=true' to trigger investigation jobs
Jan 11 05:17:42 openqa openqa-gru[10806]: Job 16418422 already has a clone, skipping investigation. Use the env variable 'force=true' to trigger investigation jobs
Jan 11 05:19:38 openqa openqa-gru[20142]: timeout:
Jan 11 05:19:38 openqa openqa-gru[20274]: timeout: sending signal TERM to command ‘env’
Jan 11 05:19:38 openqa systemd[1]: Stopping The openQA daemon for various background tasks like cleanup and saving needles...
Jan 11 05:19:38 openqa openqa-gru[20142]: sending signal TERM to command ‘env’
Jan 11 05:19:38 openqa openqa-gru[18272]: timeout: sending signal TERM to command ‘env’
Jan 11 05:19:39 openqa openqa-gru[20197]: /opt/os-autoinst-scripts/openqa-label-known-issues: line 34: 20450 Done echo "$issues"
Jan 11 05:19:39 openqa openqa-gru[20197]: 20451 Terminated | ( while read -r issue; do
Jan 11 05:19:39 openqa openqa-gru[20197]: read -r subject; read -r tracker; after=${subject#*\"}; search=${after%\"*}; force_result=''; label="poo#$issue $subject"; if [[ ${#search} -ge $min_search_term ]]; then
Jan 11 05:19:39 openqa openqa-gru[20197]: if [[ $after =~ :force_result:([a-z_]+) ]]; then
Jan 11 05:19:39 openqa openqa-gru[20197]: if [[ $tracker == "$force_result_tracker" ]]; then
Jan 11 05:19:39 openqa openqa-gru[20197]: force_result=${BASH_REMATCH[1]};
Jan 11 05:19:39 openqa openqa-gru[20197]: else
Jan 11 05:19:39 openqa openqa-gru[20197]: label="$label (ignoring force result for ticket which is not in tracker \"$force_result_tracker\")";
Jan 11 05:19:39 openqa openqa-gru[20197]: fi;
Jan 11 05:19:39 openqa openqa-gru[20197]: fi; label-on-issue "$id" "$search" "$label" "${after//*\":retry*/1}" "$force_result" && break;
Jan 11 05:19:39 openqa openqa-gru[20197]: fi;
Jan 11 05:19:39 openqa openqa-gru[20197]: done )
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: State 'stop-sigterm' timed out. Killing.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 1720 (openqa) with signal SIGKILL.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 18270 (openqa) with signal SIGKILL.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 20140 (openqa) with signal SIGKILL.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 20270 (openqa) with signal SIGKILL.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 21067 (openqa) with signal SIGKILL.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Main process exited, code=killed, status=9/KILL
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Failed with result 'timeout'.
Jan 11 05:21:09 openqa systemd[1]: Stopped The openQA daemon for various background tasks like cleanup and saving needles.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Consumed 8h 38min 30.827s CPU time.
This caused the systemd alert to fire, even though the time frame was quite short: https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?from=2025-01-11T05%3A16%3A34.817Z&to=2025-01-11T05%3A31%3A42.538Z
What I don't understand is that the alert was eventually firing even though the number of failing systemd units went down to zero in the grace period:
- 2025-01-11 05:21:10: one failed systemd unit, alert is PENDING
- 2025-01-11 05:22:49: no failed systemd units anymore (so the unit was only failing for 1 min 39 s), alert still PENDING
- 2025-01-11 05:24:10: still no failed systemd units anymore, alert is ALERTING
- 2025-01-11 05:27:50: still no failed systemd units anymore, alert is OK
EDIT: Times from the state history (the above list is from the graph):
- 2025-01-11 05:21:14: alert is PENDING
- 2025-01-11 05:24:14: alert is FIRING
- 2025-01-11 05:27:54: alert is OK
Here we see that the grace period is exactly 3 minutes which matches the expected grace period.
See #175311#note-7 for an explanation of the confusing part.
Otherwise I'm not sure whether the service being killed by systemd is actually a bad/problematic thing.
Acceptance critera¶
- AC1: We know why the alert fired.
- AC2: Configured grace periods are reasonable.
Suggestions¶
- Investigate why an alert was triggered even though no failing systemd service remained after some time. Look into how the alert would go back to normal. Is it instant or also in a "grace period"?
- Research possible Grafana upstream bugs.
Actions