action #175311: The hook script (in this case openqa-label-known-issues) might take too long to react to SIGTERM leading to alerts size: S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #175311

closed

The hook script (in this case openqa-label-known-issues) might take too long to react to SIGTERM leading to alerts size: S

Added by mkittler 5 days ago. Updated about 18 hours ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2025-01-13

Due date:

% Done:

Estimated time:

Tags:

alert, infra

Description

Observation¶

Jan 11 05:14:53 openqa openqa-gru[29123]: Job 16418415 already has a clone, skipping investigation. Use the env variable 'force=true' to trigger investigation jobs
Jan 11 05:17:42 openqa openqa-gru[10806]: Job 16418422 already has a clone, skipping investigation. Use the env variable 'force=true' to trigger investigation jobs
Jan 11 05:19:38 openqa openqa-gru[20142]: timeout:
Jan 11 05:19:38 openqa openqa-gru[20274]: timeout: sending signal TERM to command ‘env’
Jan 11 05:19:38 openqa systemd[1]: Stopping The openQA daemon for various background tasks like cleanup and saving needles...
Jan 11 05:19:38 openqa openqa-gru[20142]: sending signal TERM to command ‘env’
Jan 11 05:19:38 openqa openqa-gru[18272]: timeout: sending signal TERM to command ‘env’
Jan 11 05:19:39 openqa openqa-gru[20197]: /opt/os-autoinst-scripts/openqa-label-known-issues: line 34: 20450 Done                    echo "$issues"
Jan 11 05:19:39 openqa openqa-gru[20197]:      20451 Terminated              | ( while read -r issue; do
Jan 11 05:19:39 openqa openqa-gru[20197]:     read -r subject; read -r tracker; after=${subject#*\"}; search=${after%\"*}; force_result=''; label="poo#$issue $subject"; if [[ ${#search} -ge $min_search_term ]]; then
Jan 11 05:19:39 openqa openqa-gru[20197]:         if [[ $after =~ :force_result:([a-z_]+) ]]; then
Jan 11 05:19:39 openqa openqa-gru[20197]:             if [[ $tracker == "$force_result_tracker" ]]; then
Jan 11 05:19:39 openqa openqa-gru[20197]:                 force_result=${BASH_REMATCH[1]};
Jan 11 05:19:39 openqa openqa-gru[20197]:             else
Jan 11 05:19:39 openqa openqa-gru[20197]:                 label="$label (ignoring force result for ticket which is not in tracker \"$force_result_tracker\")";
Jan 11 05:19:39 openqa openqa-gru[20197]:             fi;
Jan 11 05:19:39 openqa openqa-gru[20197]:         fi; label-on-issue "$id" "$search" "$label" "${after//*\":retry*/1}" "$force_result" && break;
Jan 11 05:19:39 openqa openqa-gru[20197]:     fi;
Jan 11 05:19:39 openqa openqa-gru[20197]: done )
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: State 'stop-sigterm' timed out. Killing.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 1720 (openqa) with signal SIGKILL.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 18270 (openqa) with signal SIGKILL.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 20140 (openqa) with signal SIGKILL.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 20270 (openqa) with signal SIGKILL.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Killing process 21067 (openqa) with signal SIGKILL.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Main process exited, code=killed, status=9/KILL
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Failed with result 'timeout'.
Jan 11 05:21:09 openqa systemd[1]: Stopped The openQA daemon for various background tasks like cleanup and saving needles.
Jan 11 05:21:09 openqa systemd[1]: openqa-gru.service: Consumed 8h 38min 30.827s CPU time.

This caused the systemd alert to fire, even though the time frame was quite short: https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?from=2025-01-11T05%3A16%3A34.817Z&to=2025-01-11T05%3A31%3A42.538Z

What I don't understand is that the alert was eventually firing even though the number of failing systemd units went down to zero in the grace period:

2025-01-11 05:21:10: one failed systemd unit, alert is PENDING
2025-01-11 05:22:49: no failed systemd units anymore (so the unit was only failing for 1 min 39 s), alert still PENDING
2025-01-11 05:24:10: still no failed systemd units anymore, alert is ALERTING
2025-01-11 05:27:50: still no failed systemd units anymore, alert is OK

EDIT: Times from the state history (the above list is from the graph):

2025-01-11 05:21:14: alert is PENDING
2025-01-11 05:24:14: alert is FIRING
2025-01-11 05:27:54: alert is OK

Here we see that the grace period is exactly 3 minutes which matches the expected grace period.

See #175311#note-7 for an explanation of the confusing part.

Otherwise I'm not sure whether the service being killed by systemd is actually a bad/problematic thing.

Acceptance critera¶

AC1: We know why the alert fired.
AC2: Configured grace periods are reasonable.

Suggestions¶

Investigate why an alert was triggered even though no failing systemd service remained after some time. Look into how the alert would go back to normal. Is it instant or also in a "grace period"?
Research possible Grafana upstream bugs.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #175311

The hook script (in this case openqa-label-known-issues) might take too long to react to SIGTERM leading to alerts size: S

Observation¶

Acceptance critera¶

Suggestions¶

Updated by livdywan 5 days ago

Updated by robert.richardson 4 days ago

Updated by okurz 4 days ago

Updated by dheidler 3 days ago

Updated by okurz 2 days ago

Updated by dheidler 2 days ago

Updated by mkittler about 21 hours ago

Updated by mkittler about 21 hours ago · Edited

Updated by mkittler about 21 hours ago

Updated by okurz about 20 hours ago

Updated by mkittler about 18 hours ago

Updated by mkittler about 18 hours ago