action #175311
closedThe hook script (in this case openqa-label-known-issues) might take too long to react to SIGTERM leading to alerts size: S
0%
Description
Observation¶
Jan 11 05:14:53 openqa openqa-gru[29123]: Job 16418415 already has a clone, skipping investigation. Use the env variable 'force=true' to trigger investigation jobs
Jan 11 05:17:42 openqa openqa-gru[10806]: Job 16418422 already has a clone, skipping investigation. Use the env variable 'force=true' to trigger investigation jobs
Jan 11 05:19:38 openqa openqa-gru[20142]: timeout:
Jan 11 05:19:38 openqa openqa-gru[20274]: timeout: sending signal TERM to command ‘env’
Jan 11 05:19:38 openqa systemd[1]: Stopping The openQA daemon for various background tasks like cleanup and saving needles...
Jan 11 05:19:38 openqa openqa-gru[20142]: sending signal TERM to command ‘env’
Jan 11 05:19:38 openqa openqa-gru[18272]: timeout: sending signal TERM to command ‘env’
Jan 11 05:19:39 openqa openqa-gru[20197]: /opt/os-autoinst-scripts/openqa-label-known-issues: line 34: 20450 Done echo "$issues"
Jan 11 05:19:39 openqa openqa-gru[20197]: 20451 Terminated | ( while read -r issue; do
Jan 11 05:19:39 openqa openqa-gru[20197]: read -r subject; read -r tracker; after=${subject#*\"}; search=${after%\"*}; force_result=''; label="poo#$issue $subject"; if [[ ${#search} -ge $min_search_term ]]; then
Jan 11 05:19:39 openqa openqa-gru[20197]: if [[ $after =~ :force_result:([a-z_]+) ]]; then
Jan 11 05:19:39 openqa openqa-gru[20197]: if [[ $tracker == "$force_result_tracker" ]]; then
Jan 11 05:19:39 openqa openqa-gru[20197]: force_result=${BASH_REMATCH[1]};
Jan 11 05:19:39 openqa openqa-gru[20197]: else
Jan 11 05:19:39 openqa openqa-gru[20197]: label="$label (ignoring force result for ticket which is not in tracker \"$force_result_tracker\")";
Jan 11 05:19:39 openqa openqa-gru[20197]: fi;
Jan 11 05:19:39 openqa openqa-gru[20197]: fi; label-on-issue "$id" "$search" "$label" "${after//*\":retry*/1}" "$force_result" && break;
Jan 11 05:19:39 openqa openqa-gru[20197]: fi;
Jan 11 05:19:39 openqa openqa-gru[20197]: done )
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: State 'stop-sigterm' timed out. Killing.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 1720 (openqa) with signal SIGKILL.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 18270 (openqa) with signal SIGKILL.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 20140 (openqa) with signal SIGKILL.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 20270 (openqa) with signal SIGKILL.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 21067 (openqa) with signal SIGKILL.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Main process exited, code=killed, status=9/KILL
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Failed with result 'timeout'.
Jan 11 05:21:09 openqa systemd[1]: Stopped The openQA daemon for various background tasks like cleanup and saving needles.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Consumed 8h 38min 30.827s CPU time.
This caused the systemd alert to fire, even though the time frame was quite short: https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?from=2025-01-11T05%3A16%3A34.817Z&to=2025-01-11T05%3A31%3A42.538Z
What I don't understand is that the alert was eventually firing even though the number of failing systemd units went down to zero in the grace period:
- 2025-01-11 05:21:10: one failed systemd unit, alert is PENDING
- 2025-01-11 05:22:49: no failed systemd units anymore (so the unit was only failing for 1 min 39 s), alert still PENDING
- 2025-01-11 05:24:10: still no failed systemd units anymore, alert is ALERTING
- 2025-01-11 05:27:50: still no failed systemd units anymore, alert is OK
EDIT: Times from the state history (the above list is from the graph):
- 2025-01-11 05:21:14: alert is PENDING
- 2025-01-11 05:24:14: alert is FIRING
- 2025-01-11 05:27:54: alert is OK
Here we see that the grace period is exactly 3 minutes which matches the expected grace period.
See #175311#note-7 for an explanation of the confusing part.
Otherwise I'm not sure whether the service being killed by systemd is actually a bad/problematic thing.
Acceptance critera¶
- AC1: We know why the alert fired.
- AC2: Configured grace periods are reasonable.
Suggestions¶
- Investigate why an alert was triggered even though no failing systemd service remained after some time. Look into how the alert would go back to normal. Is it instant or also in a "grace period"?
- Research possible Grafana upstream bugs.
Updated by robert.richardson 4 days ago
- Subject changed from The hook script (in this case openqa-label-known-issues) might take too long to react to SIGTERM leading to alerts to The hook script (in this case openqa-label-known-issues) might take too long to react to SIGTERM leading to alerts size: S
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz 4 days ago
- Category set to Regressions/Crashes
Related improvement: https://github.com/os-autoinst/openQA/pull/6113
Updated by dheidler 3 days ago
Is anything here left to do with the PR merged?
AC1 is pretty obvious, so I would count that already as done.
I'm not sure if there is an easy way to verify AC2, at the process didn't terminate on its own in the available log, but was killed.
So we don't know how long it would actually have taken.
Updated by okurz 2 days ago
The PR is only related but not fixing the issue which is that the alert triggered even though at that time the alert condition, i.e. failed systemd service, was already gone. So we need to look into the grafana alert condition. What we want is a grace period when a failed systemd service shows up to not immediately alert but on the other hand if a failed systemd alert vanishes the pending period should immediately end
Updated by mkittler about 18 hours ago
- Status changed from Workable to In Progress
- Assignee set to mkittler
Updated by mkittler about 17 hours ago · Edited
- Description updated (diff)
What I found confusing is that the alert stayed pending (and eventually fired) when no service was failing anymore. This is explained by the fact that the query always look at the last 11 minutes and the conditions is failed units > 0
. This means the number of failing services must be stable at 0 for 11 minutes.
The configured grace period is only 3 minutes so we only wait for a stability of 11 minutes for 3 minutes which will never be enough so the alert will always fire 3 minutes after a service has failed - regardless how shortly the service was failing.
We could change the use of avg()
in the query condition to last()
(and only look at e.g. the last few seconds, with last()
the time interval doesn't matter anymore). Then the alert would have left the pending state immediately after the service wasn't failing anymore and of course also wouldn't have fired.
Otherwise it would perhaps make sense to bring the configured grace period in-line with the 11 minutes we look into the past, e.g. if we extended the grace period to 14 minutes we would trigger an alert if a systemd unit failed for 3 minutes (because 14 - 11 = 3). So in this case the alert wouldn't have triggered as the service was only failing for under 2 minutes. I would prefer to simply use last()
as it is more intuitive than using avg()
/max()
and some duration we look into the past.
Note that avg()
vs. max()
doesn't make a difference here because the condition is > 0
. So if we want to keep the behavior we have with avg()
we can also switch to the simpler/cheaper max()
function. On the other hand, if we wanted even more complicated behavior we could keep using avg()
and change the condition to e.g. > 0.5
. I will not do the math to give you an example of what this means like in the previous paragraph.
Updated by mkittler about 17 hours ago
- Status changed from In Progress to Feedback
With the previous explanation AC1 is fulfilled. I would go for the "bold" suggestion to implement AC2.
Updated by mkittler about 15 hours ago
- Status changed from Workable to Feedback
Looks like we actually only look 5 minutes into the past. That's at least what we have in YAML:
relativeTimeRange:
from: 300
to: 0
I thought is was 11 minutes because that's what is displayed on https://monitor.qa.suse.de/alerting/grafana/UzAhcmBVk/view?tab=query. However, that's maybe just the time selected for displaying the graph in the UI.
I guess it doesn't really matter, though.
MR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1342
Updated by mkittler about 14 hours ago
- Status changed from Feedback to Resolved
The change was applied, see https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1342#note_710389 for details.