Project

General

Profile

Actions

action #175311

closed

The hook script (in this case openqa-label-known-issues) might take too long to react to SIGTERM leading to alerts size: S

Added by mkittler 5 days ago. Updated about 18 hours ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2025-01-13
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

Jan 11 05:14:53 openqa openqa-gru[29123]: Job 16418415 already has a clone, skipping investigation. Use the env variable 'force=true' to trigger investigation jobs
Jan 11 05:17:42 openqa openqa-gru[10806]: Job 16418422 already has a clone, skipping investigation. Use the env variable 'force=true' to trigger investigation jobs
Jan 11 05:19:38 openqa openqa-gru[20142]: timeout:
Jan 11 05:19:38 openqa openqa-gru[20274]: timeout: sending signal TERM to command ‘env’
Jan 11 05:19:38 openqa systemd[1]: Stopping The openQA daemon for various background tasks like cleanup and saving needles...
Jan 11 05:19:38 openqa openqa-gru[20142]: sending signal TERM to command ‘env’
Jan 11 05:19:38 openqa openqa-gru[18272]: timeout: sending signal TERM to command ‘env’
Jan 11 05:19:39 openqa openqa-gru[20197]: /opt/os-autoinst-scripts/openqa-label-known-issues: line 34: 20450 Done                    echo "$issues"
Jan 11 05:19:39 openqa openqa-gru[20197]:      20451 Terminated              | ( while read -r issue; do
Jan 11 05:19:39 openqa openqa-gru[20197]:     read -r subject; read -r tracker; after=${subject#*\"}; search=${after%\"*}; force_result=''; label="poo#$issue $subject"; if [[ ${#search} -ge $min_search_term ]]; then
Jan 11 05:19:39 openqa openqa-gru[20197]:         if [[ $after =~ :force_result:([a-z_]+) ]]; then
Jan 11 05:19:39 openqa openqa-gru[20197]:             if [[ $tracker == "$force_result_tracker" ]]; then
Jan 11 05:19:39 openqa openqa-gru[20197]:                 force_result=${BASH_REMATCH[1]};
Jan 11 05:19:39 openqa openqa-gru[20197]:             else
Jan 11 05:19:39 openqa openqa-gru[20197]:                 label="$label (ignoring force result for ticket which is not in tracker \"$force_result_tracker\")";
Jan 11 05:19:39 openqa openqa-gru[20197]:             fi;
Jan 11 05:19:39 openqa openqa-gru[20197]:         fi; label-on-issue "$id" "$search" "$label" "${after//*\":retry*/1}" "$force_result" && break;
Jan 11 05:19:39 openqa openqa-gru[20197]:     fi;
Jan 11 05:19:39 openqa openqa-gru[20197]: done )
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: State 'stop-sigterm' timed out. Killing.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 1720 (openqa) with signal SIGKILL.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 18270 (openqa) with signal SIGKILL.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 20140 (openqa) with signal SIGKILL.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 20270 (openqa) with signal SIGKILL.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 21067 (openqa) with signal SIGKILL.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Main process exited, code=killed, status=9/KILL
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Failed with result 'timeout'.
Jan 11 05:21:09 openqa systemd[1]: Stopped The openQA daemon for various background tasks like cleanup and saving needles.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Consumed 8h 38min 30.827s CPU time.

This caused the systemd alert to fire, even though the time frame was quite short: https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?from=2025-01-11T05%3A16%3A34.817Z&to=2025-01-11T05%3A31%3A42.538Z

What I don't understand is that the alert was eventually firing even though the number of failing systemd units went down to zero in the grace period:

  1. 2025-01-11 05:21:10: one failed systemd unit, alert is PENDING
  2. 2025-01-11 05:22:49: no failed systemd units anymore (so the unit was only failing for 1 min 39 s), alert still PENDING
  3. 2025-01-11 05:24:10: still no failed systemd units anymore, alert is ALERTING
  4. 2025-01-11 05:27:50: still no failed systemd units anymore, alert is OK

EDIT: Times from the state history (the above list is from the graph):

  1. 2025-01-11 05:21:14: alert is PENDING
  2. 2025-01-11 05:24:14: alert is FIRING
  3. 2025-01-11 05:27:54: alert is OK

Here we see that the grace period is exactly 3 minutes which matches the expected grace period.

See #175311#note-7 for an explanation of the confusing part.


Otherwise I'm not sure whether the service being killed by systemd is actually a bad/problematic thing.

Acceptance critera

  • AC1: We know why the alert fired.
  • AC2: Configured grace periods are reasonable.

Suggestions

  • Investigate why an alert was triggered even though no failing systemd service remained after some time. Look into how the alert would go back to normal. Is it instant or also in a "grace period"?
  • Research possible Grafana upstream bugs.
Actions

Also available in: Atom PDF