action #138545
closedMunin - minion hook failed - opensuse.org :: openqa.opensuse.org size:S
Added by livdywan about 1 year ago. Updated about 1 year ago.
0%
Description
Observation¶
Email from 2023-10-24 21.45 CEST
opensuse.org :: openqa.opensuse.org :: hook failed
CRITICALs: rc_failed_per_5min is 17.00 (outside range [:10]).
Apparently the journal contains these messages (sudo journalctl -u openqa-gru
):
Oct 24 19:37:51 new-ariel openqa-gru[13130]: Connect timeout
Oct 24 19:37:51 new-ariel openqa-gru[13130]:
Oct 24 19:37:51 new-ariel openqa-gru[13128]: Connect timeout
Oct 24 19:37:51 new-ariel openqa-gru[13128]:
Oct 24 19:37:51 new-ariel openqa-gru[13129]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Oct 24 19:37:51 new-ariel openqa-gru[13066]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Oct 24 19:37:51 new-ariel openqa-gru[13127]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Oct 24 19:37:51 new-ariel openqa-gru[13077]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Suggestions¶
- Include the source of the alert in the email e.g. systemd journal for openqa-gru
- Investigate what's causing openqa-label-known-issues to abort
Files
munin-load.png (23.3 KB) munin-load.png | tinita, 2023-10-27 12:31 | ||
munin-hook-failed.png (16.9 KB) munin-hook-failed.png | tinita, 2023-10-27 12:31 |
Updated by livdywan about 1 year ago
I'm not sure what the impact is yet, going with Urgent by default
Updated by tinita about 1 year ago
Since it's sporadic and only about the investigation jobs, I don't think it urgent. This is happening regularly.
We could increase the retry count, however, judging from the logs, this went on for quite some time (19:37-19:46), so more retries wouldn't have helped.
Updated by livdywan about 1 year ago
- Subject changed from Munin - minion hook failed - opensuse.org :: openqa.opensuse.org to Munin - minion hook failed - opensuse.org :: openqa.opensuse.org size:S
- Status changed from New to In Progress
- Assignee set to livdywan
- Priority changed from Urgent to High
This may also be the same as #138551 since Connect timeout hints at DNS issues, see https://github.com/mojolicious/mojo/blob/a607e42d80ba169299375e911577fdf4abe28540/lib/Mojo/IOLoop/Client.pm#L36
It would be nice if the email could clarify what commands failed e.g. hook scripts but we only get a count here. Maybe it's worth documenting this in the wiki at least as this is a moving target.
Updated by livdywan about 1 year ago
- Related to action #138527: Zabbix agent on ariel.dmz-prg2.suse.org reported no data for 30m and there is nothing in the journal size:S added
Updated by livdywan about 1 year ago
- Related to action #138551: DNS outage of 2023-10-25, e.g. Cron <root@openqa-service> (date; fetch_openqa_bugs)> /tmp/fetch_openqa_bugs_osd.log Max retries exceeded with url size:S added
Updated by livdywan about 1 year ago
- Status changed from In Progress to Feedback
Updated by livdywan about 1 year ago
More failures:
Oct 27 11:08:15 new-ariel openqa-gru[21111]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Maybe the web UI is overloaded? We couldn't easily find out what if anything is wrong with the web UI. It could be the amount of jobs, and it could be the load but we currently have no equivalent of job queue graphs in Munin.
Updated by livdywan about 1 year ago
- Related to action #138683: https://metrics.opensuse.org/d/osrt_openqa/osrt-openqa?orgId=1&from=now-7d&to=now should show current results … which apparently it doesn't added
Updated by tinita about 1 year ago
- File munin-load.png munin-load.png added
- File munin-hook-failed.png munin-hook-failed.png added
Looking at the munin graphs for hook failed and load it could be that the high load is causing connection timeouts.
http://localhost:8080/munin/opensuse.org/openqa.opensuse.org/openqa_minion_jobs_hook_rc_failed.html
http://localhost:8080/munin/opensuse.org/openqa.opensuse.org/load.html
I attached screenshots.
I also made a little statistic for the number of running jobs this morning per hour.
I used a query to calculate the total sum of time consumed by running jobs, which would give me the average number of running jobs that I would have seen when looking at the https://openqa.opensuse.org/tests page:
openqa=> select count(*), sum(least(t_finished, '2023-10-27 01:00:00') - greatest(t_started, '2023-10-27 00:00:00')) as dur from jobs where t_started <= '2023-10-27 01:00:00' and t_finished >= '2023-10-27 00:00:00';
count | dur
-------+----------
215 | 86:58:37
(1 row)
# and that for every hour...
00: 86:58
01: 49:50
02: 17:02
03: 04:37
04: 09:09
05: 10:27
06: 13:44
07: 69:07
08: 69:48
09: 89:46
10: 192:02
11: 134:31
That looks like it matches the load pretty good.
Updated by livdywan about 1 year ago
Based on @tinita's numbers I decided to add the following to /etc/openqa/openqa.ini:
# Specify how many jobs in total can run per webui instance. Defaults to -1 (no limit)
# See https://progress.opensuse.org/issues/134927 for value selection
max_running_jobs = 120
I confirmed that 120 jobs are running (limited by server config) is shown after scheduling a bunch of jobs and the web UI so far continues to be very responsive.
Updated by livdywan about 1 year ago
livdywan wrote in #note-10:
I confirmed that 120 jobs are running (limited by server config) is shown after scheduling a bunch of jobs and the web UI so far continues to be very responsive.
Let's try 135 and see how the load looks.
Updated by livdywan about 1 year ago
23-10-31T22.55+1
opensuse.org :: openqa.opensuse.org :: hook failed
OKs: rc_failed_per_5min is 4.00.
23-10-31T22.45+1
opensuse.org :: openqa.opensuse.org :: hook failed
WARNINGs: rc_failed_per_5min is 7.00 (outside range [:5]).
Unfortunately it would seem we're once again seeing failures.
I'm trying to confirm if this correlates with a too high load but for now stuck at guessing how to execute postgres correctly - sudo -u postgres postgres -D /var/lib/pgsql/
doesn't work and I'm not sure which folder it wants.
Edit: Apparently this is unrelated.
Updated by livdywan about 1 year ago
Let's try 135 and see how the load looks.
Let's try 175.
Updated by livdywan about 1 year ago
- Priority changed from High to Normal
opensuse.org :: openqa.opensuse.org :: hook failed - see openqa-gru service logs for details
CRITICALs: rc_failed_per_5min is 11.00 (outside range [:10]).
on Fri, 03 Nov 2023 04:45:09 +0000
Nov 03 04:42:58 new-ariel openqa-gru[24778]: Connect timeout
Nov 03 04:43:01 new-ariel openqa-gru[24778]:
Nov 03 04:43:01 new-ariel openqa-gru[24783]: Connect timeout
Nov 03 04:43:01 new-ariel openqa-gru[24783]:
Apparently that was too much. Going back to 120 135.
Updated by okurz about 1 year ago
135 is rather low. How about retries and longer timeouts in whatever is causing the "Connect timeout"?
Updated by livdywan about 1 year ago
okurz wrote in #note-15:
135 is rather low. How about retries and longer timeouts in whatever is causing the "Connect timeout"?
Wouldn't we further increase the load that way? I would rather increase or implement delay given the choice. Even that would likely contribute to an already high load, though.
My intention here is to wrap up this ticket with a reasonable amount, based on statistical observation and careful testing of different limits. And the real fix will be in other tickets that address the lack of responsiveness.
Updated by okurz about 1 year ago
livdywan wrote in #note-16:
okurz wrote in #note-15:
135 is rather low. How about retries and longer timeouts in whatever is causing the "Connect timeout"?
Wouldn't we further increase the load that way? I would rather increase or implement delay given the choice. Even that would likely contribute to an already high load, though.
I understood that gru jobs timeout whereas a higher load likely comes from too many concurrent openQA jobs trying to upload, isn't it? Anyway, I don't mind adding delays on busy responses.
My intention here is to wrap up this ticket with a reasonable amount, based on statistical observation and careful testing of different limits. And the real fix will be in other tickets that address the lack of responsiveness.
As often stated I am fine if you want the work to be followed up in another ticket if you can ensure that any other ticket has sufficient information to get the full picture. Often that can be harder to do than just continue while we are on a topic already.
Updated by tinita about 1 year ago
How many jobs were actually running at that time? Since it was a very short timeframe, one could also just look at one certain point of time.
Updated by tinita about 1 year ago
I set the limit to 170. Will see if it happens again and collect data.
I think we can increase OPENQA_CLI_RETRY_SLEEP_TIME_S
from 3 seconds to something higher.
Updated by tinita about 1 year ago
- Status changed from Feedback to In Progress
If I remember correctly I had a short look into the data when the timeouts happened last time.
Unfortunately the data in minion_jobs is now gone, but I think the problem happened in a timeframe of 5 minutes, and almost all minion hook scripts were failing in that time frame, so in that case an increase of the delay of just some seconds wouldn't have helped much.
This is the SQL that I have in the history:
select id, cast(args->1 as int) as job_id, task, started, finished, notes->'hook_rc' as hook_rc, finished-started from minion_jobs where task = 'hook_script' and started >= '2023-11-03 04:30:00' and started <= '2023-11-03 04:45:00' order by started;
I think I remember the failing scripts took something like 15-25 seconds. With a delay of 3 seconds and a retry number of 3 that means the connection timeout is only 2 seconds, right? 3 * (3+2)
I could see some of the requests appear in the nginx log; funnily more of them were actually getting their first request to http through but then the redirect to https resulted in a timeout apparently. Maybe we should just access https directly, then chances would be higher? :)
Updated by tinita about 1 year ago
Here's the corresponding log from Nov 3 btw:
Nov 03 04:39:30 new-ariel openqa-gru[21433]: Connect timeout
Nov 03 04:39:30 new-ariel openqa-gru[21433]:
Nov 03 04:39:30 new-ariel openqa-gru[21432]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:39:30 new-ariel openqa-gru[21403]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:39:41 new-ariel openqa-gru[21470]: Connect timeout
Nov 03 04:39:41 new-ariel openqa-gru[21470]:
Nov 03 04:39:41 new-ariel openqa-gru[21469]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:39:41 new-ariel openqa-gru[21440]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:39:51 new-ariel openqa-gru[21522]: Connect timeout
Nov 03 04:39:51 new-ariel openqa-gru[21522]:
Nov 03 04:39:51 new-ariel openqa-gru[21521]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:39:51 new-ariel openqa-gru[21492]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:40:16 new-ariel openqa-gru[22123]: Connect timeout
Nov 03 04:41:10 new-ariel openqa-gru[22123]:
Nov 03 04:41:10 new-ariel openqa-gru[23206]: Connect timeout
Nov 03 04:41:10 new-ariel openqa-gru[23206]:
Nov 03 04:41:10 new-ariel openqa-gru[23205]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:41:10 new-ariel openqa-gru[23152]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:41:10 new-ariel openqa-gru[23222]: Connect timeout
Nov 03 04:41:10 new-ariel openqa-gru[23222]:
Nov 03 04:41:10 new-ariel openqa-gru[23221]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:41:10 new-ariel openqa-gru[23140]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:41:10 new-ariel openqa-gru[23232]: Connect timeout
Nov 03 04:41:11 new-ariel openqa-gru[23232]:
Nov 03 04:41:11 new-ariel openqa-gru[23231]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:41:11 new-ariel openqa-gru[23136]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:41:34 new-ariel openqa-gru[23871]: Connect timeout
Nov 03 04:41:34 new-ariel openqa-gru[23871]:
Nov 03 04:41:34 new-ariel openqa-gru[23870]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:41:34 new-ariel openqa-gru[23841]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:41:34 new-ariel openqa-gru[23872]: Connect timeout
Nov 03 04:41:34 new-ariel openqa-gru[23872]:
Nov 03 04:41:34 new-ariel openqa-gru[23873]: Connect timeout
Nov 03 04:41:51 new-ariel openqa-gru[23873]:
Nov 03 04:41:51 new-ariel openqa-gru[23940]: Connect timeout
Nov 03 04:41:51 new-ariel openqa-gru[23940]:
Nov 03 04:41:51 new-ariel openqa-gru[23938]: /opt/os-autoinst-scripts/_common: ERROR: line 145
Nov 03 04:41:51 new-ariel openqa-gru[23539]: /opt/os-autoinst-scripts/_common: ERROR: line 145
Nov 03 04:41:54 new-ariel openqa-gru[23956]: Connect timeout
Nov 03 04:41:54 new-ariel openqa-gru[23956]:
Nov 03 04:41:54 new-ariel openqa-gru[23955]: /opt/os-autoinst-scripts/_common: ERROR: line 145
Nov 03 04:41:54 new-ariel openqa-gru[23584]: /opt/os-autoinst-scripts/_common: ERROR: line 145
Nov 03 04:41:58 new-ariel openqa-gru[23998]: Connect timeout
Nov 03 04:41:58 new-ariel openqa-gru[23998]:
Nov 03 04:41:58 new-ariel openqa-gru[23997]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:41:58 new-ariel openqa-gru[23968]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:42:11 new-ariel openqa-gru[24147]: Connect timeout
Nov 03 04:42:11 new-ariel openqa-gru[24147]:
Nov 03 04:42:11 new-ariel openqa-gru[24145]: Connect timeout
Nov 03 04:42:11 new-ariel openqa-gru[24145]:
Nov 03 04:42:11 new-ariel openqa-gru[24146]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:42:11 new-ariel openqa-gru[24083]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:42:11 new-ariel openqa-gru[24144]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:42:11 new-ariel openqa-gru[24097]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:42:58 new-ariel openqa-gru[24778]: Connect timeout
Nov 03 04:43:01 new-ariel openqa-gru[24778]:
Nov 03 04:43:01 new-ariel openqa-gru[24783]: Connect timeout
Nov 03 04:43:01 new-ariel openqa-gru[24783]:
Nov 03 04:43:01 new-ariel openqa-gru[24780]: Connect timeout
Nov 03 04:43:11 new-ariel openqa-gru[24780]:
Nov 03 04:43:11 new-ariel openqa-gru[24839]: Connect timeout
Nov 03 04:43:11 new-ariel openqa-gru[24839]:
Nov 03 04:43:11 new-ariel openqa-gru[24885]: Connect timeout
Nov 03 04:43:11 new-ariel openqa-gru[24885]:
Nov 03 04:43:11 new-ariel openqa-gru[24884]: /opt/os-autoinst-scripts/_common: ERROR: line 145
Nov 03 04:43:11 new-ariel openqa-gru[24209]: /opt/os-autoinst-scripts/_common: ERROR: line 145
Nov 03 04:43:12 new-ariel openqa-gru[24925]: Connect timeout
Nov 03 04:43:12 new-ariel openqa-gru[24925]:
Nov 03 04:43:12 new-ariel openqa-gru[24924]: /opt/os-autoinst-scripts/_common: ERROR: line 145
Nov 03 04:43:12 new-ariel openqa-gru[24381]: /opt/os-autoinst-scripts/_common: ERROR: line 145
Updated by tinita about 1 year ago
Thanks to @kraih we know that the script doesn't retry in case of a connection timeout:
https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Command.pm#L127
Also he suggested we could increase MOJO_CONNECT_TIMEOUT from 10 to 30 here.
Will try to fix the retry code.
Updated by tinita about 1 year ago
https://github.com/os-autoinst/openQA/pull/5359 Make openqa-cli retry also on connection errors
Updated by tinita about 1 year ago
https://github.com/os-autoinst/scripts/pull/270 Wait longer and sleep longer
Updated by openqa_review about 1 year ago
- Due date set to 2023-11-28
Setting due date based on mean cycle time of SUSE QE Tools
Updated by tinita about 1 year ago
https://github.com/os-autoinst/openQA/pull/5363 Increase MOJO_CONNECT_TIMEOUT in openqa-cli
https://github.com/os-autoinst/scripts/pull/271 Retry openqa-cli requests
Updated by okurz about 1 year ago
- Status changed from Feedback to In Progress
Both MRs merged for long. What's the next step?
Updated by tinita about 1 year ago
- Status changed from In Progress to Resolved
Next step, like mentioned in comment 20: Checking how the limit of currently 170 is working out.
Just that we don't get emails, like I said yesterday.
I checked the munin dashboard and the journal; it looks fine for the last days.
I'll resolve it.
Updated by tinita 7 months ago
- Related to action #162521: Reconsider the global job limit on o3, try higher than 170 added