action #175311: The hook script (in this case openqa-label-known-issues) might take too long to react to SIGTERM leading to alerts size: S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #175311

closed

The hook script (in this case openqa-label-known-issues) might take too long to react to SIGTERM leading to alerts size: S

Added by mkittler 5 months ago. Updated 4 months ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2025-01-13

Due date:

% Done:

Estimated time:

Tags:

alert, infra

Description

Observation¶

Jan 11 05:14:53 openqa openqa-gru[29123]: Job 16418415 already has a clone, skipping investigation. Use the env variable 'force=true' to trigger investigation jobs
Jan 11 05:17:42 openqa openqa-gru[10806]: Job 16418422 already has a clone, skipping investigation. Use the env variable 'force=true' to trigger investigation jobs
Jan 11 05:19:38 openqa openqa-gru[20142]: timeout:
Jan 11 05:19:38 openqa openqa-gru[20274]: timeout: sending signal TERM to command ‘env’
Jan 11 05:19:38 openqa systemd[1]: Stopping The openQA daemon for various background tasks like cleanup and saving needles...
Jan 11 05:19:38 openqa openqa-gru[20142]: sending signal TERM to command ‘env’
Jan 11 05:19:38 openqa openqa-gru[18272]: timeout: sending signal TERM to command ‘env’
Jan 11 05:19:39 openqa openqa-gru[20197]: /opt/os-autoinst-scripts/openqa-label-known-issues: line 34: 20450 Done                    echo "$issues"
Jan 11 05:19:39 openqa openqa-gru[20197]:      20451 Terminated              | ( while read -r issue; do
Jan 11 05:19:39 openqa openqa-gru[20197]:     read -r subject; read -r tracker; after=${subject#*\"}; search=${after%\"*}; force_result=''; label="poo#$issue $subject"; if [[ ${#search} -ge $min_search_term ]]; then
Jan 11 05:19:39 openqa openqa-gru[20197]:         if [[ $after =~ :force_result:([a-z_]+) ]]; then
Jan 11 05:19:39 openqa openqa-gru[20197]:             if [[ $tracker == "$force_result_tracker" ]]; then
Jan 11 05:19:39 openqa openqa-gru[20197]:                 force_result=${BASH_REMATCH[1]};
Jan 11 05:19:39 openqa openqa-gru[20197]:             else
Jan 11 05:19:39 openqa openqa-gru[20197]:                 label="$label (ignoring force result for ticket which is not in tracker \"$force_result_tracker\")";
Jan 11 05:19:39 openqa openqa-gru[20197]:             fi;
Jan 11 05:19:39 openqa openqa-gru[20197]:         fi; label-on-issue "$id" "$search" "$label" "${after//*\":retry*/1}" "$force_result" && break;
Jan 11 05:19:39 openqa openqa-gru[20197]:     fi;
Jan 11 05:19:39 openqa openqa-gru[20197]: done )
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: State 'stop-sigterm' timed out. Killing.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 1720 (openqa) with signal SIGKILL.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 18270 (openqa) with signal SIGKILL.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 20140 (openqa) with signal SIGKILL.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 20270 (openqa) with signal SIGKILL.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 21067 (openqa) with signal SIGKILL.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Main process exited, code=killed, status=9/KILL
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Failed with result 'timeout'.
Jan 11 05:21:09 openqa systemd[1]: Stopped The openQA daemon for various background tasks like cleanup and saving needles.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Consumed 8h 38min 30.827s CPU time.

This caused the systemd alert to fire, even though the time frame was quite short: https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?from=2025-01-11T05%3A16%3A34.817Z&to=2025-01-11T05%3A31%3A42.538Z

What I don't understand is that the alert was eventually firing even though the number of failing systemd units went down to zero in the grace period:

2025-01-11 05:21:10: one failed systemd unit, alert is PENDING
2025-01-11 05:22:49: no failed systemd units anymore (so the unit was only failing for 1 min 39 s), alert still PENDING
2025-01-11 05:24:10: still no failed systemd units anymore, alert is ALERTING
2025-01-11 05:27:50: still no failed systemd units anymore, alert is OK

EDIT: Times from the state history (the above list is from the graph):

2025-01-11 05:21:14: alert is PENDING
2025-01-11 05:24:14: alert is FIRING
2025-01-11 05:27:54: alert is OK

Here we see that the grace period is exactly 3 minutes which matches the expected grace period.

See #175311#note-7 for an explanation of the confusing part.

Otherwise I'm not sure whether the service being killed by systemd is actually a bad/problematic thing.

Acceptance critera¶

AC1: We know why the alert fired.
AC2: Configured grace periods are reasonable.

Suggestions¶

Investigate why an alert was triggered even though no failing systemd service remained after some time. Look into how the alert would go back to normal. Is it instant or also in a "grace period"?
Research possible Grafana upstream bugs.

Actions

Copy link

Updated by livdywan 5 months ago

Priority changed from Normal to High
Target version set to Ready

I suggest this be on the backlog and High. Unless it turns out a one-off.

Actions

Copy link

Updated by robert.richardson 5 months ago

Subject changed from The hook script (in this case openqa-label-known-issues) might take too long to react to SIGTERM leading to alerts to The hook script (in this case openqa-label-known-issues) might take too long to react to SIGTERM leading to alerts size: S
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by okurz 5 months ago

Category set to Regressions/Crashes

Related improvement: https://github.com/os-autoinst/openQA/pull/6113

Actions

Copy link

Updated by dheidler 5 months ago

Is anything here left to do with the PR merged?

AC1 is pretty obvious, so I would count that already as done.

I'm not sure if there is an easy way to verify AC2, at the process didn't terminate on its own in the available log, but was killed.
So we don't know how long it would actually have taken.

Actions

Copy link

Updated by okurz 5 months ago

The PR is only related but not fixing the issue which is that the alert triggered even though at that time the alert condition, i.e. failed systemd service, was already gone. So we need to look into the grafana alert condition. What we want is a grace period when a failed systemd service shows up to not immediately alert but on the other hand if a failed systemd alert vanishes the pending period should immediately end

Actions

Copy link

Updated by dheidler 5 months ago

So we want to have an alert condition only if the number of failed systemd services on a machine is larger than 0 continuously for e.g. 30 minutes?

Actions

Copy link

Updated by mkittler 4 months ago

Status changed from Workable to In Progress
Assignee set to mkittler

Actions

Copy link

Updated by mkittler 4 months ago · Edited

Description updated (diff)

What I found confusing is that the alert stayed pending (and eventually fired) when no service was failing anymore. This is explained by the fact that the query always look at the last 11 minutes and the conditions is failed units > 0. This means the number of failing services must be stable at 0 for 11 minutes.

The configured grace period is only 3 minutes so we only wait for a stability of 11 minutes for 3 minutes which will never be enough so the alert will always fire 3 minutes after a service has failed - regardless how shortly the service was failing.

We could change the use of avg() in the query condition to last() (and only look at e.g. the last few seconds, with last() the time interval doesn't matter anymore). Then the alert would have left the pending state immediately after the service wasn't failing anymore and of course also wouldn't have fired.

Otherwise it would perhaps make sense to bring the configured grace period in-line with the 11 minutes we look into the past, e.g. if we extended the grace period to 14 minutes we would trigger an alert if a systemd unit failed for 3 minutes (because 14 - 11 = 3). So in this case the alert wouldn't have triggered as the service was only failing for under 2 minutes. I would prefer to simply use last() as it is more intuitive than using avg()/max() and some duration we look into the past.

Note that avg() vs. max() doesn't make a difference here because the condition is > 0. So if we want to keep the behavior we have with avg() we can also switch to the simpler/cheaper max() function. On the other hand, if we wanted even more complicated behavior we could keep using avg() and change the condition to e.g. > 0.5. I will not do the math to give you an example of what this means like in the previous paragraph.

Actions

Copy link

Updated by mkittler 4 months ago

Status changed from In Progress to Feedback

With the previous explanation AC1 is fulfilled. I would go for the "bold" suggestion to implement AC2.

Actions

Copy link

#10

Updated by okurz 4 months ago

Status changed from Feedback to Workable

yes, do that

Actions

Copy link

#11

Updated by mkittler 4 months ago

Status changed from Workable to Feedback

Looks like we actually only look 5 minutes into the past. That's at least what we have in YAML:

            relativeTimeRange:
              from: 300
              to: 0

I thought is was 11 minutes because that's what is displayed on https://monitor.qa.suse.de/alerting/grafana/UzAhcmBVk/view?tab=query. However, that's maybe just the time selected for displaying the graph in the UI.

I guess it doesn't really matter, though.

MR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1342

Actions

Copy link

#12

Updated by mkittler 4 months ago

Status changed from Feedback to Resolved

The change was applied, see https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1342#note_710389 for details.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #175311

The hook script (in this case openqa-label-known-issues) might take too long to react to SIGTERM leading to alerts size: S

Observation¶

Acceptance critera¶

Suggestions¶

Updated by livdywan 5 months ago

Updated by robert.richardson 5 months ago

Updated by okurz 5 months ago

Updated by dheidler 5 months ago

Updated by okurz 5 months ago

Updated by dheidler 5 months ago

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago · Edited

Updated by mkittler 4 months ago

Updated by okurz 4 months ago

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago