action #138545: Munin - minion hook failed - opensuse.org :: openqa.opensuse.org size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #138545

closed

Munin - minion hook failed - opensuse.org :: openqa.opensuse.org size:S

Added by livdywan over 1 year ago. Updated over 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

tinita

Category:

Target version:

openQA Project (public) - Ready

Start date:

Due date:

2023-11-28

% Done:

Estimated time:

Tags:

infra

Description

Observation¶

Email from 2023-10-24 21.45 CEST

opensuse.org :: openqa.opensuse.org :: hook failed
   CRITICALs: rc_failed_per_5min is 17.00 (outside range [:10]).

Apparently the journal contains these messages (sudo journalctl -u openqa-gru):

Oct 24 19:37:51 new-ariel openqa-gru[13130]: Connect timeout
Oct 24 19:37:51 new-ariel openqa-gru[13130]: 
Oct 24 19:37:51 new-ariel openqa-gru[13128]: Connect timeout
Oct 24 19:37:51 new-ariel openqa-gru[13128]: 
Oct 24 19:37:51 new-ariel openqa-gru[13129]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Oct 24 19:37:51 new-ariel openqa-gru[13066]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Oct 24 19:37:51 new-ariel openqa-gru[13127]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Oct 24 19:37:51 new-ariel openqa-gru[13077]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117

Suggestions¶

Include the source of the alert in the email e.g. systemd journal for openqa-gru
Investigate what's causing openqa-label-known-issues to abort

Files

Download all files

munin-load.png (23.3 KB) munin-load.png		tinita, 2023-10-27 12:31
munin-hook-failed.png (16.9 KB) munin-hook-failed.png		tinita, 2023-10-27 12:31

Related issues 4 (2 open — 2 closed)

Related to openQA Infrastructure (public) - action #138527: Zabbix agent on ariel.dmz-prg2.suse.org reported no data for 30m and there is nothing in the journal size:S

Resolved

livdywan

2023-07-07

Actions

Related to openQA Infrastructure (public) - action #138551: DNS outage of 2023-10-25, e.g. Cron <root@openqa-service> (date; fetch_openqa_bugs)> /tmp/fetch_openqa_bugs_osd.log Max retries exceeded with url size:S

Resolved

livdywan

2023-10-23

Actions

Related to openQA Infrastructure (public) - action #138683: https://metrics.opensuse.org/d/osrt_openqa/osrt-openqa?orgId=1&from=now-7d&to=now should show current results … which apparently it doesn't

New

2023-10-27

Actions

Related to openQA Infrastructure (public) - action #162521: Reconsider the global job limit on o3, try higher than 170

New

2024-06-19

Actions

Copy link

Updated by livdywan over 1 year ago

I'm not sure what the impact is yet, going with Urgent by default

Actions

Copy link

Updated by tinita over 1 year ago

Since it's sporadic and only about the investigation jobs, I don't think it urgent. This is happening regularly.
We could increase the retry count, however, judging from the logs, this went on for quite some time (19:37-19:46), so more retries wouldn't have helped.

Actions

Copy link

Updated by livdywan over 1 year ago

Subject changed from Munin - minion hook failed - opensuse.org :: openqa.opensuse.org to Munin - minion hook failed - opensuse.org :: openqa.opensuse.org size:S
Status changed from New to In Progress
Assignee set to livdywan
Priority changed from Urgent to High

This may also be the same as #138551 since Connect timeout hints at DNS issues, see https://github.com/mojolicious/mojo/blob/a607e42d80ba169299375e911577fdf4abe28540/lib/Mojo/IOLoop/Client.pm#L36

It would be nice if the email could clarify what commands failed e.g. hook scripts but we only get a count here. Maybe it's worth documenting this in the wiki at least as this is a moving target.

Actions

Copy link

Updated by livdywan over 1 year ago

Related to action #138527: Zabbix agent on ariel.dmz-prg2.suse.org reported no data for 30m and there is nothing in the journal size:S added

Actions

Copy link

Updated by livdywan over 1 year ago

Related to action #138551: DNS outage of 2023-10-25, e.g. Cron <root@openqa-service> (date; fetch_openqa_bugs)> /tmp/fetch_openqa_bugs_osd.log Max retries exceeded with url size:S added

Actions

Copy link

Updated by livdywan over 1 year ago

Status changed from In Progress to Feedback

https://github.com/os-autoinst/openQA/pull/5350

Actions

Copy link

Updated by livdywan over 1 year ago

More failures:

Oct 27 11:08:15 new-ariel openqa-gru[21111]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117

Maybe the web UI is overloaded? We couldn't easily find out what if anything is wrong with the web UI. It could be the amount of jobs, and it could be the load but we currently have no equivalent of job queue graphs in Munin.

Actions

Copy link

Updated by livdywan over 1 year ago

Related to action #138683: https://metrics.opensuse.org/d/osrt_openqa/osrt-openqa?orgId=1&from=now-7d&to=now should show current results … which apparently it doesn't added

Actions

Copy link Download all files

Updated by tinita over 1 year ago

File munin-load.png munin-load.png added
File munin-hook-failed.png munin-hook-failed.png added

Looking at the munin graphs for hook failed and load it could be that the high load is causing connection timeouts.
http://localhost:8080/munin/opensuse.org/openqa.opensuse.org/openqa_minion_jobs_hook_rc_failed.html
http://localhost:8080/munin/opensuse.org/openqa.opensuse.org/load.html
I attached screenshots.

I also made a little statistic for the number of running jobs this morning per hour.
I used a query to calculate the total sum of time consumed by running jobs, which would give me the average number of running jobs that I would have seen when looking at the https://openqa.opensuse.org/tests page:

openqa=> select count(*), sum(least(t_finished, '2023-10-27 01:00:00') - greatest(t_started, '2023-10-27 00:00:00')) as dur from jobs where t_started <= '2023-10-27 01:00:00' and t_finished >= '2023-10-27 00:00:00';
 count |   dur    
-------+----------
   215 | 86:58:37
(1 row)
# and that for every hour...

That looks like it matches the load pretty good.

Actions

Copy link

#10

Updated by livdywan over 1 year ago

Based on @tinita's numbers I decided to add the following to /etc/openqa/openqa.ini:

# Specify how many jobs in total can run per webui instance. Defaults to -1 (no limit)
# See https://progress.opensuse.org/issues/134927 for value selection
max_running_jobs = 120

I confirmed that 120 jobs are running (limited by server config) is shown after scheduling a bunch of jobs and the web UI so far continues to be very responsive.

Actions

Copy link

#11

Updated by livdywan over 1 year ago

livdywan wrote in #note-10:

I confirmed that 120 jobs are running (limited by server config) is shown after scheduling a bunch of jobs and the web UI so far continues to be very responsive.

Let's try 135 and see how the load looks.

Actions

Copy link

#12

Updated by livdywan over 1 year ago

23-10-31T22.55+1
opensuse.org :: openqa.opensuse.org :: hook failed
	OKs: rc_failed_per_5min is 4.00.
23-10-31T22.45+1
opensuse.org :: openqa.opensuse.org :: hook failed
	WARNINGs: rc_failed_per_5min is 7.00 (outside range [:5]).

Unfortunately it would seem we're once again seeing failures.

I'm trying to confirm if this correlates with a too high load but for now stuck at guessing how to execute postgres correctly - sudo -u postgres postgres -D /var/lib/pgsql/ doesn't work and I'm not sure which folder it wants.

Edit: Apparently this is unrelated.

Actions

Copy link

#13

Updated by livdywan over 1 year ago

Let's try 135 and see how the load looks.

Let's try 175.

Actions

Copy link

#14

Updated by livdywan over 1 year ago

Priority changed from High to Normal

opensuse.org :: openqa.opensuse.org :: hook failed - see openqa-gru service logs for details
	CRITICALs: rc_failed_per_5min is 11.00 (outside range [:10]).

on Fri, 03 Nov 2023 04:45:09 +0000

Nov 03 04:42:58 new-ariel openqa-gru[24778]: Connect timeout
Nov 03 04:43:01 new-ariel openqa-gru[24778]:
Nov 03 04:43:01 new-ariel openqa-gru[24783]: Connect timeout
Nov 03 04:43:01 new-ariel openqa-gru[24783]:

Apparently that was too much. Going back to ~~120~~ 135.

Actions

Copy link

#15

Updated by okurz over 1 year ago

135 is rather low. How about retries and longer timeouts in whatever is causing the "Connect timeout"?

Actions

Copy link

#16

Updated by livdywan over 1 year ago

okurz wrote in #note-15:

135 is rather low. How about retries and longer timeouts in whatever is causing the "Connect timeout"?

Wouldn't we further increase the load that way? I would rather increase or implement delay given the choice. Even that would likely contribute to an already high load, though.

My intention here is to wrap up this ticket with a reasonable amount, based on statistical observation and careful testing of different limits. And the real fix will be in other tickets that address the lack of responsiveness.

Actions

Copy link

#17

Updated by okurz over 1 year ago

livdywan wrote in #note-16:

okurz wrote in #note-15:

135 is rather low. How about retries and longer timeouts in whatever is causing the "Connect timeout"?

Wouldn't we further increase the load that way? I would rather increase or implement delay given the choice. Even that would likely contribute to an already high load, though.

I understood that gru jobs timeout whereas a higher load likely comes from too many concurrent openQA jobs trying to upload, isn't it? Anyway, I don't mind adding delays on busy responses.

My intention here is to wrap up this ticket with a reasonable amount, based on statistical observation and careful testing of different limits. And the real fix will be in other tickets that address the lack of responsiveness.

As often stated I am fine if you want the work to be followed up in another ticket if you can ensure that any other ticket has sufficient information to get the full picture. Often that can be harder to do than just continue while we are on a topic already.

Actions

Copy link

#18

Updated by tinita over 1 year ago

How many jobs were actually running at that time? Since it was a very short timeframe, one could also just look at one certain point of time.

Actions

Copy link

#19

Updated by tinita over 1 year ago

Assignee changed from livdywan to tinita

Actions

Copy link

#20

Updated by tinita over 1 year ago

I set the limit to 170. Will see if it happens again and collect data.
I think we can increase OPENQA_CLI_RETRY_SLEEP_TIME_S from 3 seconds to something higher.

Actions

Copy link

#21

Updated by tinita over 1 year ago

Status changed from Feedback to In Progress

If I remember correctly I had a short look into the data when the timeouts happened last time.
Unfortunately the data in minion_jobs is now gone, but I think the problem happened in a timeframe of 5 minutes, and almost all minion hook scripts were failing in that time frame, so in that case an increase of the delay of just some seconds wouldn't have helped much.
This is the SQL that I have in the history:

select id, cast(args->1 as int) as job_id, task, started, finished, notes->'hook_rc' as hook_rc, finished-started from minion_jobs where task = 'hook_script' and  started >= '2023-11-03 04:30:00' and started <= '2023-11-03 04:45:00' order by started;

I think I remember the failing scripts took something like 15-25 seconds. With a delay of 3 seconds and a retry number of 3 that means the connection timeout is only 2 seconds, right? 3 * (3+2)

I could see some of the requests appear in the nginx log; funnily more of them were actually getting their first request to http through but then the redirect to https resulted in a timeout apparently. Maybe we should just access https directly, then chances would be higher? :)

Actions

Copy link

#22

Updated by tinita over 1 year ago

Here's the corresponding log from Nov 3 btw:

Nov 03 04:39:30 new-ariel openqa-gru[21433]: Connect timeout
Nov 03 04:39:30 new-ariel openqa-gru[21433]: 
Nov 03 04:39:30 new-ariel openqa-gru[21432]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:39:30 new-ariel openqa-gru[21403]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:39:41 new-ariel openqa-gru[21470]: Connect timeout
Nov 03 04:39:41 new-ariel openqa-gru[21470]: 
Nov 03 04:39:41 new-ariel openqa-gru[21469]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:39:41 new-ariel openqa-gru[21440]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:39:51 new-ariel openqa-gru[21522]: Connect timeout
Nov 03 04:39:51 new-ariel openqa-gru[21522]: 
Nov 03 04:39:51 new-ariel openqa-gru[21521]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:39:51 new-ariel openqa-gru[21492]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:40:16 new-ariel openqa-gru[22123]: Connect timeout
Nov 03 04:41:10 new-ariel openqa-gru[22123]: 
Nov 03 04:41:10 new-ariel openqa-gru[23206]: Connect timeout
Nov 03 04:41:10 new-ariel openqa-gru[23206]: 
Nov 03 04:41:10 new-ariel openqa-gru[23205]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:41:10 new-ariel openqa-gru[23152]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:41:10 new-ariel openqa-gru[23222]: Connect timeout
Nov 03 04:41:10 new-ariel openqa-gru[23222]: 
Nov 03 04:41:10 new-ariel openqa-gru[23221]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:41:10 new-ariel openqa-gru[23140]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:41:10 new-ariel openqa-gru[23232]: Connect timeout
Nov 03 04:41:11 new-ariel openqa-gru[23232]: 
Nov 03 04:41:11 new-ariel openqa-gru[23231]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:41:11 new-ariel openqa-gru[23136]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:41:34 new-ariel openqa-gru[23871]: Connect timeout
Nov 03 04:41:34 new-ariel openqa-gru[23871]: 
Nov 03 04:41:34 new-ariel openqa-gru[23870]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:41:34 new-ariel openqa-gru[23841]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:41:34 new-ariel openqa-gru[23872]: Connect timeout
Nov 03 04:41:34 new-ariel openqa-gru[23872]: 
Nov 03 04:41:34 new-ariel openqa-gru[23873]: Connect timeout
Nov 03 04:41:51 new-ariel openqa-gru[23873]: 
Nov 03 04:41:51 new-ariel openqa-gru[23940]: Connect timeout
Nov 03 04:41:51 new-ariel openqa-gru[23940]: 
Nov 03 04:41:51 new-ariel openqa-gru[23938]: /opt/os-autoinst-scripts/_common: ERROR: line 145
Nov 03 04:41:51 new-ariel openqa-gru[23539]: /opt/os-autoinst-scripts/_common: ERROR: line 145
Nov 03 04:41:54 new-ariel openqa-gru[23956]: Connect timeout
Nov 03 04:41:54 new-ariel openqa-gru[23956]: 
Nov 03 04:41:54 new-ariel openqa-gru[23955]: /opt/os-autoinst-scripts/_common: ERROR: line 145
Nov 03 04:41:54 new-ariel openqa-gru[23584]: /opt/os-autoinst-scripts/_common: ERROR: line 145
Nov 03 04:41:58 new-ariel openqa-gru[23998]: Connect timeout
Nov 03 04:41:58 new-ariel openqa-gru[23998]: 
Nov 03 04:41:58 new-ariel openqa-gru[23997]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:41:58 new-ariel openqa-gru[23968]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:42:11 new-ariel openqa-gru[24147]: Connect timeout
Nov 03 04:42:11 new-ariel openqa-gru[24147]: 
Nov 03 04:42:11 new-ariel openqa-gru[24145]: Connect timeout
Nov 03 04:42:11 new-ariel openqa-gru[24145]: 
Nov 03 04:42:11 new-ariel openqa-gru[24146]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:42:11 new-ariel openqa-gru[24083]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:42:11 new-ariel openqa-gru[24144]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:42:11 new-ariel openqa-gru[24097]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 117
Nov 03 04:42:58 new-ariel openqa-gru[24778]: Connect timeout
Nov 03 04:43:01 new-ariel openqa-gru[24778]: 
Nov 03 04:43:01 new-ariel openqa-gru[24783]: Connect timeout
Nov 03 04:43:01 new-ariel openqa-gru[24783]: 
Nov 03 04:43:01 new-ariel openqa-gru[24780]: Connect timeout
Nov 03 04:43:11 new-ariel openqa-gru[24780]: 
Nov 03 04:43:11 new-ariel openqa-gru[24839]: Connect timeout
Nov 03 04:43:11 new-ariel openqa-gru[24839]: 
Nov 03 04:43:11 new-ariel openqa-gru[24885]: Connect timeout
Nov 03 04:43:11 new-ariel openqa-gru[24885]: 
Nov 03 04:43:11 new-ariel openqa-gru[24884]: /opt/os-autoinst-scripts/_common: ERROR: line 145
Nov 03 04:43:11 new-ariel openqa-gru[24209]: /opt/os-autoinst-scripts/_common: ERROR: line 145
Nov 03 04:43:12 new-ariel openqa-gru[24925]: Connect timeout
Nov 03 04:43:12 new-ariel openqa-gru[24925]: 
Nov 03 04:43:12 new-ariel openqa-gru[24924]: /opt/os-autoinst-scripts/_common: ERROR: line 145
Nov 03 04:43:12 new-ariel openqa-gru[24381]: /opt/os-autoinst-scripts/_common: ERROR: line 145

Actions

Copy link

#23

Updated by tinita over 1 year ago

Thanks to @kraih we know that the script doesn't retry in case of a connection timeout:
https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Command.pm#L127
Also he suggested we could increase MOJO_CONNECT_TIMEOUT from 10 to 30 here.

Will try to fix the retry code.

Actions

Copy link

#24

Updated by tinita over 1 year ago

https://github.com/os-autoinst/openQA/pull/5359 Make openqa-cli retry also on connection errors

Actions

Copy link

#25

Updated by tinita over 1 year ago

https://github.com/os-autoinst/scripts/pull/270 Wait longer and sleep longer

Actions

Copy link

#26

Updated by openqa_review over 1 year ago

Due date set to 2023-11-28

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#27

Updated by tinita over 1 year ago

https://github.com/os-autoinst/openQA/pull/5363 Increase MOJO_CONNECT_TIMEOUT in openqa-cli
https://github.com/os-autoinst/scripts/pull/271 Retry openqa-cli requests

Actions

Copy link

#28

Updated by tinita over 1 year ago

Status changed from In Progress to Feedback

Actions

Copy link

#29

Updated by okurz over 1 year ago

Status changed from Feedback to In Progress

Both MRs merged for long. What's the next step?

Actions

Copy link

#30

Updated by tinita over 1 year ago

Status changed from In Progress to Resolved

Next step, like mentioned in comment 20: Checking how the limit of currently 170 is working out.
Just that we don't get emails, like I said yesterday.
I checked the munin dashboard and the journal; it looks fine for the last days.
I'll resolve it.

Actions

Copy link

#31

Updated by tinita 11 months ago

Related to action #162521: Reconsider the global job limit on o3, try higher than 170 added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #138545

Munin - minion hook failed - opensuse.org :: openqa.opensuse.org size:S

Observation¶

Suggestions¶

Updated by livdywan over 1 year ago

Updated by tinita over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by tinita over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by okurz over 1 year ago

Updated by livdywan over 1 year ago

Updated by okurz over 1 year ago

Updated by tinita over 1 year ago

Updated by tinita over 1 year ago

Updated by tinita over 1 year ago

Updated by tinita over 1 year ago

Updated by tinita over 1 year ago

Updated by tinita over 1 year ago

Updated by tinita over 1 year ago

Updated by tinita over 1 year ago

Updated by openqa_review over 1 year ago

Updated by tinita over 1 year ago

Updated by tinita over 1 year ago

Updated by okurz over 1 year ago

Updated by tinita over 1 year ago

Updated by tinita 11 months ago