Project

General

Profile

Actions

action #138545

closed

Munin - minion hook failed - opensuse.org :: openqa.opensuse.org size:S

Added by livdywan 7 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
Due date:
2023-11-28
% Done:

0%

Estimated time:
Tags:

Description

Observation

Email from 2023-10-24 21.45 CEST

opensuse.org :: openqa.opensuse.org :: hook failed
   CRITICALs: rc_failed_per_5min is 17.00 (outside range [:10]).

Apparently the journal contains these messages (sudo journalctl -u openqa-gru):

Oct 24 19:37:51 new-ariel openqa-gru[13130]: Connect timeout
Oct 24 19:37:51 new-ariel openqa-gru[13130]: 
Oct 24 19:37:51 new-ariel openqa-gru[13128]: Connect timeout
Oct 24 19:37:51 new-ariel openqa-gru[13128]: 
Oct 24 19:37:51 new-ariel openqa-gru[13129]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Oct 24 19:37:51 new-ariel openqa-gru[13066]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Oct 24 19:37:51 new-ariel openqa-gru[13127]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Oct 24 19:37:51 new-ariel openqa-gru[13077]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117

Suggestions

  • Include the source of the alert in the email e.g. systemd journal for openqa-gru
  • Investigate what's causing openqa-label-known-issues to abort

Files

munin-load.png (23.3 KB) munin-load.png tinita, 2023-10-27 12:31
munin-hook-failed.png (16.9 KB) munin-hook-failed.png tinita, 2023-10-27 12:31

Related issues 3 (1 open2 closed)

Related to openQA Infrastructure - action #138527: Zabbix agent on ariel.dmz-prg2.suse.org reported no data for 30m and there is nothing in the journal size:SResolvedlivdywan2023-07-07

Actions
Related to openQA Infrastructure - action #138551: DNS outage of 2023-10-25, e.g. Cron <root@openqa-service> (date; fetch_openqa_bugs)> /tmp/fetch_openqa_bugs_osd.log Max retries exceeded with url size:SResolvedlivdywan2023-10-23

Actions
Related to openQA Infrastructure - action #138683: https://metrics.opensuse.org/d/osrt_openqa/osrt-openqa?orgId=1&from=now-7d&to=now should show current results … which apparently it doesn'tNew2023-10-27

Actions
Actions #1

Updated by livdywan 7 months ago

I'm not sure what the impact is yet, going with Urgent by default

Actions #2

Updated by tinita 7 months ago

Since it's sporadic and only about the investigation jobs, I don't think it urgent. This is happening regularly.
We could increase the retry count, however, judging from the logs, this went on for quite some time (19:37-19:46), so more retries wouldn't have helped.

Actions #3

Updated by livdywan 7 months ago

  • Subject changed from Munin - minion hook failed - opensuse.org :: openqa.opensuse.org to Munin - minion hook failed - opensuse.org :: openqa.opensuse.org size:S
  • Status changed from New to In Progress
  • Assignee set to livdywan
  • Priority changed from Urgent to High

This may also be the same as #138551 since Connect timeout hints at DNS issues, see https://github.com/mojolicious/mojo/blob/a607e42d80ba169299375e911577fdf4abe28540/lib/Mojo/IOLoop/Client.pm#L36

It would be nice if the email could clarify what commands failed e.g. hook scripts but we only get a count here. Maybe it's worth documenting this in the wiki at least as this is a moving target.

Actions #4

Updated by livdywan 7 months ago

  • Related to action #138527: Zabbix agent on ariel.dmz-prg2.suse.org reported no data for 30m and there is nothing in the journal size:S added
Actions #5

Updated by livdywan 7 months ago

  • Related to action #138551: DNS outage of 2023-10-25, e.g. Cron <root@openqa-service> (date; fetch_openqa_bugs)> /tmp/fetch_openqa_bugs_osd.log Max retries exceeded with url size:S added
Actions #6

Updated by livdywan 7 months ago

  • Status changed from In Progress to Feedback
Actions #7

Updated by livdywan 6 months ago

More failures:

Oct 27 11:08:15 new-ariel openqa-gru[21111]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117

Maybe the web UI is overloaded? We couldn't easily find out what if anything is wrong with the web UI. It could be the amount of jobs, and it could be the load but we currently have no equivalent of job queue graphs in Munin.

Actions #8

Updated by livdywan 6 months ago

  • Related to action #138683: https://metrics.opensuse.org/d/osrt_openqa/osrt-openqa?orgId=1&from=now-7d&to=now should show current results … which apparently it doesn't added

Updated by tinita 6 months ago

Looking at the munin graphs for hook failed and load it could be that the high load is causing connection timeouts.
http://localhost:8080/munin/opensuse.org/openqa.opensuse.org/openqa_minion_jobs_hook_rc_failed.html
http://localhost:8080/munin/opensuse.org/openqa.opensuse.org/load.html
I attached screenshots.

I also made a little statistic for the number of running jobs this morning per hour.
I used a query to calculate the total sum of time consumed by running jobs, which would give me the average number of running jobs that I would have seen when looking at the https://openqa.opensuse.org/tests page:

openqa=> select count(*), sum(least(t_finished, '2023-10-27 01:00:00') - greatest(t_started, '2023-10-27 00:00:00')) as dur from jobs where t_started <= '2023-10-27 01:00:00' and t_finished >= '2023-10-27 00:00:00';
 count |   dur    
-------+----------
   215 | 86:58:37
(1 row)
# and that for every hour...
00: 86:58
01: 49:50
02: 17:02
03: 04:37
04: 09:09
05: 10:27
06: 13:44
07: 69:07
08: 69:48
09: 89:46
10: 192:02
11: 134:31

That looks like it matches the load pretty good.

Actions #10

Updated by livdywan 6 months ago

Based on @tinita's numbers I decided to add the following to /etc/openqa/openqa.ini:

# Specify how many jobs in total can run per webui instance. Defaults to -1 (no limit)
# See https://progress.opensuse.org/issues/134927 for value selection
max_running_jobs = 120

I confirmed that 120 jobs are running (limited by server config) is shown after scheduling a bunch of jobs and the web UI so far continues to be very responsive.

Actions #11

Updated by livdywan 6 months ago

livdywan wrote in #note-10:

I confirmed that 120 jobs are running (limited by server config) is shown after scheduling a bunch of jobs and the web UI so far continues to be very responsive.

Let's try 135 and see how the load looks.

Actions #12

Updated by livdywan 6 months ago

23-10-31T22.55+1
opensuse.org :: openqa.opensuse.org :: hook failed
    OKs: rc_failed_per_5min is 4.00.
23-10-31T22.45+1
opensuse.org :: openqa.opensuse.org :: hook failed
    WARNINGs: rc_failed_per_5min is 7.00 (outside range [:5]).

Unfortunately it would seem we're once again seeing failures.

I'm trying to confirm if this correlates with a too high load but for now stuck at guessing how to execute postgres correctly - sudo -u postgres postgres -D /var/lib/pgsql/ doesn't work and I'm not sure which folder it wants.

Edit: Apparently this is unrelated.

Actions #13

Updated by livdywan 6 months ago

Let's try 135 and see how the load looks.

Let's try 175.

Actions #14

Updated by livdywan 6 months ago

  • Priority changed from High to Normal
opensuse.org :: openqa.opensuse.org :: hook failed - see openqa-gru service logs for details
    CRITICALs: rc_failed_per_5min is 11.00 (outside range [:10]).

on Fri, 03 Nov 2023 04:45:09 +0000

Nov 03 04:42:58 new-ariel openqa-gru[24778]: Connect timeout
Nov 03 04:43:01 new-ariel openqa-gru[24778]:
Nov 03 04:43:01 new-ariel openqa-gru[24783]: Connect timeout
Nov 03 04:43:01 new-ariel openqa-gru[24783]:

Apparently that was too much. Going back to 120 135.

Actions #15

Updated by okurz 6 months ago

135 is rather low. How about retries and longer timeouts in whatever is causing the "Connect timeout"?

Actions #16

Updated by livdywan 6 months ago

okurz wrote in #note-15:

135 is rather low. How about retries and longer timeouts in whatever is causing the "Connect timeout"?

Wouldn't we further increase the load that way? I would rather increase or implement delay given the choice. Even that would likely contribute to an already high load, though.

My intention here is to wrap up this ticket with a reasonable amount, based on statistical observation and careful testing of different limits. And the real fix will be in other tickets that address the lack of responsiveness.

Actions #17

Updated by okurz 6 months ago

livdywan wrote in #note-16:

okurz wrote in #note-15:

135 is rather low. How about retries and longer timeouts in whatever is causing the "Connect timeout"?

Wouldn't we further increase the load that way? I would rather increase or implement delay given the choice. Even that would likely contribute to an already high load, though.

I understood that gru jobs timeout whereas a higher load likely comes from too many concurrent openQA jobs trying to upload, isn't it? Anyway, I don't mind adding delays on busy responses.

My intention here is to wrap up this ticket with a reasonable amount, based on statistical observation and careful testing of different limits. And the real fix will be in other tickets that address the lack of responsiveness.

As often stated I am fine if you want the work to be followed up in another ticket if you can ensure that any other ticket has sufficient information to get the full picture. Often that can be harder to do than just continue while we are on a topic already.

Actions #18

Updated by tinita 6 months ago

How many jobs were actually running at that time? Since it was a very short timeframe, one could also just look at one certain point of time.

Actions #19

Updated by tinita 6 months ago

  • Assignee changed from livdywan to tinita
Actions #20

Updated by tinita 6 months ago

I set the limit to 170. Will see if it happens again and collect data.
I think we can increase OPENQA_CLI_RETRY_SLEEP_TIME_S from 3 seconds to something higher.

Actions #21

Updated by tinita 6 months ago

  • Status changed from Feedback to In Progress

If I remember correctly I had a short look into the data when the timeouts happened last time.
Unfortunately the data in minion_jobs is now gone, but I think the problem happened in a timeframe of 5 minutes, and almost all minion hook scripts were failing in that time frame, so in that case an increase of the delay of just some seconds wouldn't have helped much.
This is the SQL that I have in the history:

select id, cast(args->1 as int) as job_id, task, started, finished, notes->'hook_rc' as hook_rc, finished-started from minion_jobs where task = 'hook_script' and  started >= '2023-11-03 04:30:00' and started <= '2023-11-03 04:45:00' order by started;

I think I remember the failing scripts took something like 15-25 seconds. With a delay of 3 seconds and a retry number of 3 that means the connection timeout is only 2 seconds, right? 3 * (3+2)

I could see some of the requests appear in the nginx log; funnily more of them were actually getting their first request to http through but then the redirect to https resulted in a timeout apparently. Maybe we should just access https directly, then chances would be higher? :)

Actions #22

Updated by tinita 6 months ago

Here's the corresponding log from Nov 3 btw:

Nov 03 04:39:30 new-ariel openqa-gru[21433]: Connect timeout
Nov 03 04:39:30 new-ariel openqa-gru[21433]: 
Nov 03 04:39:30 new-ariel openqa-gru[21432]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:39:30 new-ariel openqa-gru[21403]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:39:41 new-ariel openqa-gru[21470]: Connect timeout
Nov 03 04:39:41 new-ariel openqa-gru[21470]: 
Nov 03 04:39:41 new-ariel openqa-gru[21469]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:39:41 new-ariel openqa-gru[21440]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:39:51 new-ariel openqa-gru[21522]: Connect timeout
Nov 03 04:39:51 new-ariel openqa-gru[21522]: 
Nov 03 04:39:51 new-ariel openqa-gru[21521]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:39:51 new-ariel openqa-gru[21492]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:40:16 new-ariel openqa-gru[22123]: Connect timeout
Nov 03 04:41:10 new-ariel openqa-gru[22123]: 
Nov 03 04:41:10 new-ariel openqa-gru[23206]: Connect timeout
Nov 03 04:41:10 new-ariel openqa-gru[23206]: 
Nov 03 04:41:10 new-ariel openqa-gru[23205]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:41:10 new-ariel openqa-gru[23152]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:41:10 new-ariel openqa-gru[23222]: Connect timeout
Nov 03 04:41:10 new-ariel openqa-gru[23222]: 
Nov 03 04:41:10 new-ariel openqa-gru[23221]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:41:10 new-ariel openqa-gru[23140]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:41:10 new-ariel openqa-gru[23232]: Connect timeout
Nov 03 04:41:11 new-ariel openqa-gru[23232]: 
Nov 03 04:41:11 new-ariel openqa-gru[23231]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:41:11 new-ariel openqa-gru[23136]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:41:34 new-ariel openqa-gru[23871]: Connect timeout
Nov 03 04:41:34 new-ariel openqa-gru[23871]: 
Nov 03 04:41:34 new-ariel openqa-gru[23870]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:41:34 new-ariel openqa-gru[23841]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:41:34 new-ariel openqa-gru[23872]: Connect timeout
Nov 03 04:41:34 new-ariel openqa-gru[23872]: 
Nov 03 04:41:34 new-ariel openqa-gru[23873]: Connect timeout
Nov 03 04:41:51 new-ariel openqa-gru[23873]: 
Nov 03 04:41:51 new-ariel openqa-gru[23940]: Connect timeout
Nov 03 04:41:51 new-ariel openqa-gru[23940]: 
Nov 03 04:41:51 new-ariel openqa-gru[23938]: /opt/os-autoinst-scripts/_common: ERROR: line 145
Nov 03 04:41:51 new-ariel openqa-gru[23539]: /opt/os-autoinst-scripts/_common: ERROR: line 145
Nov 03 04:41:54 new-ariel openqa-gru[23956]: Connect timeout
Nov 03 04:41:54 new-ariel openqa-gru[23956]: 
Nov 03 04:41:54 new-ariel openqa-gru[23955]: /opt/os-autoinst-scripts/_common: ERROR: line 145
Nov 03 04:41:54 new-ariel openqa-gru[23584]: /opt/os-autoinst-scripts/_common: ERROR: line 145
Nov 03 04:41:58 new-ariel openqa-gru[23998]: Connect timeout
Nov 03 04:41:58 new-ariel openqa-gru[23998]: 
Nov 03 04:41:58 new-ariel openqa-gru[23997]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:41:58 new-ariel openqa-gru[23968]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:42:11 new-ariel openqa-gru[24147]: Connect timeout
Nov 03 04:42:11 new-ariel openqa-gru[24147]: 
Nov 03 04:42:11 new-ariel openqa-gru[24145]: Connect timeout
Nov 03 04:42:11 new-ariel openqa-gru[24145]: 
Nov 03 04:42:11 new-ariel openqa-gru[24146]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:42:11 new-ariel openqa-gru[24083]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:42:11 new-ariel openqa-gru[24144]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:42:11 new-ariel openqa-gru[24097]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:42:58 new-ariel openqa-gru[24778]: Connect timeout
Nov 03 04:43:01 new-ariel openqa-gru[24778]: 
Nov 03 04:43:01 new-ariel openqa-gru[24783]: Connect timeout
Nov 03 04:43:01 new-ariel openqa-gru[24783]: 
Nov 03 04:43:01 new-ariel openqa-gru[24780]: Connect timeout
Nov 03 04:43:11 new-ariel openqa-gru[24780]: 
Nov 03 04:43:11 new-ariel openqa-gru[24839]: Connect timeout
Nov 03 04:43:11 new-ariel openqa-gru[24839]: 
Nov 03 04:43:11 new-ariel openqa-gru[24885]: Connect timeout
Nov 03 04:43:11 new-ariel openqa-gru[24885]: 
Nov 03 04:43:11 new-ariel openqa-gru[24884]: /opt/os-autoinst-scripts/_common: ERROR: line 145
Nov 03 04:43:11 new-ariel openqa-gru[24209]: /opt/os-autoinst-scripts/_common: ERROR: line 145
Nov 03 04:43:12 new-ariel openqa-gru[24925]: Connect timeout
Nov 03 04:43:12 new-ariel openqa-gru[24925]: 
Nov 03 04:43:12 new-ariel openqa-gru[24924]: /opt/os-autoinst-scripts/_common: ERROR: line 145
Nov 03 04:43:12 new-ariel openqa-gru[24381]: /opt/os-autoinst-scripts/_common: ERROR: line 145
Actions #23

Updated by tinita 6 months ago

Thanks to @kraih we know that the script doesn't retry in case of a connection timeout:
https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Command.pm#L127
Also he suggested we could increase MOJO_CONNECT_TIMEOUT from 10 to 30 here.

Will try to fix the retry code.

Actions #24

Updated by tinita 6 months ago

https://github.com/os-autoinst/openQA/pull/5359 Make openqa-cli retry also on connection errors

Actions #25

Updated by tinita 6 months ago

Actions #26

Updated by openqa_review 6 months ago

  • Due date set to 2023-11-28

Setting due date based on mean cycle time of SUSE QE Tools

Actions #27

Updated by tinita 6 months ago

https://github.com/os-autoinst/openQA/pull/5363 Increase MOJO_CONNECT_TIMEOUT in openqa-cli
https://github.com/os-autoinst/scripts/pull/271 Retry openqa-cli requests

Actions #28

Updated by tinita 6 months ago

  • Status changed from In Progress to Feedback
Actions #29

Updated by okurz 6 months ago

  • Status changed from Feedback to In Progress

Both MRs merged for long. What's the next step?

Actions #30

Updated by tinita 6 months ago

  • Status changed from In Progress to Resolved

Next step, like mentioned in comment 20: Checking how the limit of currently 170 is working out.
Just that we don't get emails, like I said yesterday.
I checked the munin dashboard and the journal; it looks fine for the last days.
I'll resolve it.

Actions

Also available in: Atom PDF