Project

General

Profile

Actions

action #134693

closed

Only one or two jobs running on osd for several hours

Added by tinita over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2023-08-28
Due date:
% Done:

0%

Estimated time:

Description

Observation

On Friday evening
https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=9&from=1692884789715&to=1692956673630
and over the last half day
https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=9&from=1693136775836&to=1693201042414
(click on "Running" only)
we see the problem that often only one new job gets assigned.
On friday we thought it is related to the rabbitmq server being down and some blocking amqp operation in the scheduler was the reason.

Now that it happens again we need to investigate what's the problem.

Additional Information


Related issues 1 (0 open1 closed)

Related to openQA Project (public) - coordination #135122: [epic] OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alertResolvedokurz2023-09-07

Actions
Actions #1

Updated by tinita over 1 year ago

  • Priority changed from High to Urgent
Actions #2

Updated by tinita over 1 year ago

  • Description updated (diff)

In /var/log/openqa_scheduler we see messages like this during the period where only 1 job was assigned every 10 minutes:

[2023-08-28T04:46:24.396792+02:00] [warn] [pid:1635] Failed to send data to websocket server, reason: Inactivity timeout at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 27.

[2023-08-28T04:46:24.397559+02:00] [warn] [pid:1635] Failed sending job(s) '11925043' to worker '2243': unknown error
[2023-08-28T04:46:24.417820+02:00] [debug] [pid:1635] Job 11925043 reset to state scheduled
[2023-08-28T04:46:24.453093+02:00] [debug] [pid:1635] Assigned job '11923322' to worker ID '2227'
[2023-08-28T04:46:24.457890+02:00] [debug] [pid:1635] [Job#11923322] Prepare for being processed by worker 2227
[2023-08-28T04:56:24.605208+02:00] [warn] [pid:1635] Failed to send data to websocket server, reason: Inactivity timeout at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 27.

[2023-08-28T04:56:24.605571+02:00] [warn] [pid:1635] Failed sending job(s) '11923322' to worker '2227': unknown error
[2023-08-28T04:56:24.625365+02:00] [debug] [pid:1635] Job 11923322 reset to state scheduled
[2023-08-28T04:56:24.649886+02:00] [debug] [pid:1635] Assigned job '11923540' to worker ID '1054'
[2023-08-28T04:56:24.653866+02:00] [debug] [pid:1635] [Job#11923540] Prepare for being processed by worker 1054
[2023-08-28T05:06:24.796411+02:00] [warn] [pid:1635] Failed to send data to websocket server, reason: Inactivity timeout at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 27.

[2023-08-28T05:06:24.796993+02:00] [warn] [pid:1635] Failed sending job(s) '11923540' to worker '1054': unknown error
[2023-08-28T05:06:24.820843+02:00] [debug] [pid:1635] Job 11923540 reset to state scheduled
[2023-08-28T05:06:24.849356+02:00] [debug] [pid:1635] Assigned job '11922879' to worker ID '886'
[2023-08-28T05:06:24.853225+02:00] [debug] [pid:1635] [Job#11922879] Prepare for being processed by worker 886
[2023-08-28T05:16:25.007610+02:00] [warn] [pid:1635] Failed to send data to websocket server, reason: Inactivity timeout at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 27.

[2023-08-28T05:16:25.007856+02:00] [warn] [pid:1635] Failed sending job(s) '11922879' to worker '886': unknown error
[2023-08-28T05:16:25.025099+02:00] [debug] [pid:1635] Job 11922879 reset to state scheduled
[2023-08-28T05:16:25.051353+02:00] [debug] [pid:1635] Assigned job '11925044' to worker ID '2035'
[2023-08-28T05:16:25.055186+02:00] [debug] [pid:1635] [Job#11925044] Prepare for being processed by worker 2035
[2023-08-28T05:26:25.207289+02:00] [warn] [pid:1635] Failed to send data to websocket server, reason: Inactivity timeout at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 27.
Actions #3

Updated by kraih over 1 year ago

  • Assignee set to kraih

These timeout errors appear to be caused by HTTP requests from the scheduler to the websocket service. So, purely localhost communication on the webui machine.

[2023-08-28T04:56:24.605208+02:00] [warn] [pid:1635] Failed to send data to websocket server, reason: Inactivity timeout at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 27.

Looking at the openqa-websockets journal the service does not appear blocked between 2023-08-28T04:46:24 and 2023-08-28T04:56:24:

Aug 28 04:48:09 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Updating seen of worker 2082 from worker_status (free)
Aug 28 04:48:09 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Updating seen of worker 2088 from worker_status (free)
Aug 28 04:48:09 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Updating seen of worker 2097 from worker_status (free)
Aug 28 04:48:09 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Updating seen of worker 2475 from worker_status (free)
Aug 28 04:48:09 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Updating seen of worker 2227 from worker_status (free)
Aug 28 04:48:09 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Rescheduling jobs assigned to worker 2227
Aug 28 04:48:09 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Job 11923322 reset to state scheduled
Aug 28 04:48:09 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Updating seen of worker 2240 from worker_status (free)
Aug 28 04:48:09 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Updating seen of worker 2080 from worker_status (free)
Aug 28 04:48:10 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Updating seen of worker 2303 from worker_status (free)

I would have expected the websocket server not to do anything during this time, otherwise a timeout is very hard to explain.

Actions #4

Updated by kraih over 1 year ago

In the openqa-websockets log this appears to be the moment when things started working again at Aug 28 07:39:14:

Aug 28 07:39:10 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Updating seen of worker 912 from worker_status (free)
Aug 28 07:39:10 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Updating seen of worker 2094 from worker_status (free)
Aug 28 07:39:10 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Rescheduling jobs assigned to worker 2094
Aug 28 07:39:10 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Job 11922888 reset to state scheduled
Aug 28 07:39:11 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Updating seen of worker 2424 from worker_status (free)
Aug 28 07:39:11 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Updating seen of worker 2075 from worker_status (free)
Aug 28 07:39:11 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Updating seen of worker 1293 from worker_status (free)
Aug 28 07:39:11 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Updating seen of worker 1043 from worker_status (free)
Aug 28 07:39:12 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Updating seen of worker 2374 from worker_status (free)
Aug 28 07:39:12 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Updating seen of worker 2352 from worker_status (free)
Aug 28 07:39:12 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Updating seen of worker 2399 from worker_status (free)
Aug 28 07:39:12 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Updating seen of worker 2018 from worker_status (free)
Aug 28 07:39:13 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Updating seen of worker 2485 from worker_status (free)
Aug 28 07:39:13 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Updating seen of worker 2297 from worker_status (free)
Aug 28 07:39:13 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Updating seen of worker 1063 from worker_status (free)
Aug 28 07:39:13 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Updating seen of worker 2168 from worker_status (working)
Aug 28 07:39:13 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Updating seen of worker 1207 from worker_status (free)
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: [info] [pid:1636] Worker 2168 websocket connection closed - 1006
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: Use of uninitialized value $address in string eq at /usr/share/openqa/script/../lib/OpenQA/Shared/Plugin/SharedHelpers.pm line 70.
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: Use of uninitialized value $address in string eq at /usr/share/openqa/script/../lib/OpenQA/Shared/Plugin/SharedHelpers.pm line 70.
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Updating seen of worker 530 from worker_status (free)
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Started to send message to 2269 for job(s) 11922873
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: [info] [pid:1636] Worker 2223 websocket connection closed - 1006
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: Use of uninitialized value $address in string eq at /usr/share/openqa/script/../lib/OpenQA/Shared/Plugin/SharedHelpers.pm line 70.
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: Use of uninitialized value $address in string eq at /usr/share/openqa/script/../lib/OpenQA/Shared/Plugin/SharedHelpers.pm line 70.
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: [info] [pid:1636] Worker 2145 websocket connection closed - 1006
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Started to send message to 2107 for job(s) 11919044
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: Use of uninitialized value $address in string eq at /usr/share/openqa/script/../lib/OpenQA/Shared/Plugin/SharedHelpers.pm line 70.
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: Use of uninitialized value $address in string eq at /usr/share/openqa/script/../lib/OpenQA/Shared/Plugin/SharedHelpers.pm line 70.
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: [info] [pid:1636] Worker 2269 accepted job 11922873 which was never assigned to it or has already finished
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: [info] [pid:1636] Worker 2338 websocket connection closed - 1006
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: [info] [pid:1636] Worker 2107 accepted job 11919044 which was never assigned to it or has already finished
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: [info] [pid:1636] Worker 2112 websocket connection closed - 1006
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: [info] [pid:1636] Worker 2106 websocket connection closed - 1006
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Unable to send command "cancel" to worker 2168: worker not connected
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: Use of uninitialized value $address in string eq at /usr/share/openqa/script/../lib/OpenQA/Shared/Plugin/SharedHelpers.pm line 70.
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: Use of uninitialized value $address in string eq at /usr/share/openqa/script/../lib/OpenQA/Shared/Plugin/SharedHelpers.pm line 70.
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: Use of uninitialized value $address in string eq at /usr/share/openqa/script/../lib/OpenQA/Shared/Plugin/SharedHelpers.pm line 70.
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: Use of uninitialized value $address in string eq at /usr/share/openqa/script/../lib/OpenQA/Shared/Plugin/SharedHelpers.pm line 70.
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Started to send message to 560 for job(s) 11922870
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Started to send message to 1956 for job(s) 11923310
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Started to send message to 1288 for job(s) 11924545
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Started to send message to 2315 for job(s) 11923340
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Started to send message to 530 for job(s) 11922881

Specifically the Started to send message to XXXX for job(s) XXXXXX messages are very important, since those indicate that a job has actually made it all the way from the scheduler to the worker via websocket. No such message was visible in the logs before this point for multiple hours.

These messages also seemed a bit odd:

Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Started to send message to 2269 for job(s) 11922873
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: [debug] [pid:1636] Started to send message to 2107 for job(s) 11919044
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: [info] [pid:1636] Worker 2269 accepted job 11922873 which was never assigned to it or has already finished
Aug 28 07:39:14 openqa openqa-websockets-daemon[1636]: [info] [pid:1636] Worker 2107 accepted job 11919044 which was never assigned to it or has already finished

The only place from where the Started to send message... messages could originate is here, which is immediately preceded by the transaction that assigns the job to the working in the database. But both jobs had the result cancelled based on job settings, so this flow of events seems plausible and probably harmless. The fact that the timeout triggered the very moment things started working again might be an interesting datapoint however.

On openqaworker18, one of the workers the jobs were assigned to, i found this in the logs (sudo journalctl --since="2023-08-28 04:00:23" -u openqa-worker-auto-restart@15):

Aug 28 04:02:12 openqaworker18 worker[28711]: [warn] [pid:28711] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/2312 - trying again in 10 seconds
Aug 28 04:02:22 openqaworker18 worker[28711]: [info] [pid:28711] Registering with openQA openqa.suse.de
Aug 28 04:02:22 openqaworker18 worker[28711]: [info] [pid:28711] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/2312
Aug 28 04:12:22 openqaworker18 worker[28711]: [warn] [pid:28711] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/2312 - trying again in 10 seconds
Aug 28 04:12:32 openqaworker18 worker[28711]: [info] [pid:28711] Registering with openQA openqa.suse.de
Aug 28 04:12:32 openqaworker18 worker[28711]: [info] [pid:28711] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/2312
Aug 28 04:22:32 openqaworker18 worker[28711]: [warn] [pid:28711] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/2312 - trying again in 10 seconds
Aug 28 04:22:42 openqaworker18 worker[28711]: [info] [pid:28711] Registering with openQA openqa.suse.de
...
Aug 28 07:05:28 openqaworker18 worker[28711]: [info] [pid:28711] Registering with openQA openqa.suse.de
Aug 28 07:05:28 openqaworker18 worker[28711]: [info] [pid:28711] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/2312
Aug 28 07:15:29 openqaworker18 worker[28711]: [warn] [pid:28711] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/2312 - trying again in 10 seconds
Aug 28 07:15:39 openqaworker18 worker[28711]: [info] [pid:28711] Registering with openQA openqa.suse.de
Aug 28 07:15:39 openqaworker18 worker[28711]: [info] [pid:28711] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/2312
Aug 28 07:25:39 openqaworker18 worker[28711]: [warn] [pid:28711] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/2312 - trying again in 10 seconds
Aug 28 07:25:49 openqaworker18 worker[28711]: [info] [pid:28711] Registering with openQA openqa.suse.de
Aug 28 07:25:49 openqaworker18 worker[28711]: [info] [pid:28711] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/2312
Aug 28 07:35:49 openqaworker18 worker[28711]: [warn] [pid:28711] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/2312 - trying again in 10 seconds
Aug 28 07:35:59 openqaworker18 worker[28711]: [info] [pid:28711] Registering with openQA openqa.suse.de
Aug 28 07:35:59 openqaworker18 worker[28711]: [info] [pid:28711] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/2312
Aug 28 07:40:35 openqaworker18 worker[28711]: [info] [pid:28711] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 2312
Aug 28 07:45:24 openqaworker18 worker[28711]: [warn] [pid:28711] Websocket connection to http://openqa.suse.de/api/v1/ws/2312 finished by remote side with code 1008, only one connection per worker allowed, finishing old >
Aug 28 07:45:34 openqaworker18 worker[28711]: [info] [pid:28711] Registering with openQA openqa.suse.de
Aug 28 07:45:34 openqaworker18 worker[28711]: [info] [pid:28711] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/2312
Aug 28 07:55:34 openqaworker18 worker[28711]: [warn] [pid:28711] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/2312 - trying again in 10 seconds
Aug 28 07:55:44 openqaworker18 worker[28711]: [info] [pid:28711] Registering with openQA openqa.suse.de
Aug 28 07:55:44 openqaworker18 worker[28711]: [info] [pid:28711] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/2312
Aug 28 07:59:50 openqaworker18 worker[28711]: [info] [pid:28711] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 2312
Aug 28 08:00:16 openqaworker18 worker[28711]: [debug] [pid:28711] Accepting job 11923723 from openqa.suse.de.
Aug 28 08:00:16 openqaworker18 worker[28711]: [debug] [pid:28711] Setting job 11923723 from openqa.suse.de up
Aug 28 08:00:16 openqaworker18 worker[28711]: [debug] [pid:28711] Preparing Mojo::IOLoop::ReadWriteProcess::Session

So it would seem there were actual network problems too. Other worker instances on the same host had no such problems however, but were also not assigned any jobs.

In the Apache logs there were some odd 406 responses for /api/v1/ws/2312:

10.100.96.76 - - [28/Aug/2023:04:32:52 +0200] "GET /api/v1/ws/2312 HTTP/1.1" 406 - "-" "Mojolicious (Perl)" 429970
10.100.96.76 - - [28/Aug/2023:04:43:02 +0200] "GET /api/v1/ws/2312 HTTP/1.1" 406 - "-" "Mojolicious (Perl)" 430028
10.100.96.76 - - [28/Aug/2023:04:53:13 +0200] "GET /api/v1/ws/2312 HTTP/1.1" 406 - "-" "Mojolicious (Perl)" 430187
10.100.96.76 - - [28/Aug/2023:05:03:23 +0200] "GET /api/v1/ws/2312 HTTP/1.1" 406 - "-" "Mojolicious (Perl)" 430216
10.100.96.76 - - [28/Aug/2023:05:13:36 +0200] "GET /api/v1/ws/2312 HTTP/1.1" 406 - "-" "Mojolicious (Perl)" 431307

Neither Mojolicious or openQA use 406 for anything. The 10 minute intervals match those from the direct scheduler -> websocket server communication very suspiciously.

Actions #5

Updated by kraih over 1 year ago

  • Status changed from New to Feedback

I've been unable to identify a specific root cause and am running out of things to check. So will have to keep an eye on this and hope for new data to pop up.

Actions #6

Updated by kraih over 1 year ago

With the OSD move to Prague it is probably not possible to investigate this further. Hopefully the problem does not return there.

Actions #7

Updated by kraih over 1 year ago

One early theory we had was that AMQP connection problems were causing hangups in the scheduler and/or websockets service. And the timing would match the data we see in the qem-dashboard AMQP watcher, which listens to all OSD job events.

Aug 24 18:55:27 qam2 dashboard[20098]: [20098] [i] 11912404: passed (softfailed)
Aug 25 01:04:17 qam2 systemd[1]: dashboard-amqp-watcher.service: Service reached runtime time limit. Stopping.
Aug 25 01:04:17 qam2 systemd[1]: dashboard-amqp-watcher.service: Failed with result 'timeout'.
Aug 25 01:04:18 qam2 systemd[1]: dashboard-amqp-watcher.service: Scheduled restart job, restart counter is at 1.
Aug 25 01:04:18 qam2 systemd[1]: Stopped QEM Dashboard message bus watcher.
Aug 25 01:04:18 qam2 systemd[1]: Started QEM Dashboard message bus watcher.
Aug 25 01:04:18 qam2 dashboard[20868]: [20868] [i] RabbitMQ watcher started
Aug 26 01:04:18 qam2 systemd[1]: dashboard-amqp-watcher.service: Service reached runtime time limit. Stopping.
Aug 26 01:04:18 qam2 systemd[1]: dashboard-amqp-watcher.service: Failed with result 'timeout'.
Aug 26 01:04:18 qam2 systemd[1]: Stopped QEM Dashboard message bus watcher.
Aug 26 01:04:18 qam2 systemd[1]: Started QEM Dashboard message bus watcher.
Aug 26 01:04:19 qam2 dashboard[31934]: [31934] [i] RabbitMQ watcher started
Aug 27 01:04:18 qam2 systemd[1]: Stopping QEM Dashboard message bus watcher...
Aug 27 01:04:18 qam2 systemd[1]: dashboard-amqp-watcher.service: Deactivated successfully.
Aug 27 01:04:18 qam2 systemd[1]: Stopped QEM Dashboard message bus watcher.
Aug 27 01:04:18 qam2 systemd[1]: Started QEM Dashboard message bus watcher.
Aug 27 01:04:18 qam2 dashboard[15000]: [15000] [i] RabbitMQ watcher started
Aug 28 01:04:16 qam2 systemd[1]: Stopping QEM Dashboard message bus watcher...
Aug 28 01:04:16 qam2 systemd[1]: dashboard-amqp-watcher.service: Deactivated successfully.
Aug 28 01:04:16 qam2 systemd[1]: Stopped QEM Dashboard message bus watcher.
Aug 28 01:04:16 qam2 systemd[1]: Started QEM Dashboard message bus watcher.
Aug 28 01:04:16 qam2 dashboard[23672]: [23672] [i] RabbitMQ watcher started
Aug 28 16:09:37 qam2 dashboard[23672]: [23672] [i] 11923513: passed (softfailed)

So there was a large gap in data from Aug 24 18:55:27 to Aug 28 16:09:37 where rabbit.suse.de might have caused all sorts of problems for our AMQP client.

Actions #8

Updated by tinita over 1 year ago

For the record, the webui had those errors in the journal:

Aug 29 12:01:11 openqa openqa[4353]: [error] [pid:4353] Publishing suse.openqa.job.create failed: Connect timeout (6 attempts left)
Aug 29 12:01:14 openqa openqa[4368]: [error] [pid:4368] Publishing suse.openqa.job.cancel failed: Connect timeout (3 attempts left)
Actions #9

Updated by kraih over 1 year ago

  • Priority changed from Urgent to Normal

Reducing priority since it is no longer a recurring problem.

Actions #10

Updated by kraih over 1 year ago

tinita wrote in #note-8:

For the record, the webui had those errors in the journal:

Aug 29 12:01:11 openqa openqa[4353]: [error] [pid:4353] Publishing suse.openqa.job.create failed: Connect timeout (6 attempts left)
Aug 29 12:01:14 openqa openqa[4368]: [error] [pid:4368] Publishing suse.openqa.job.cancel failed: Connect timeout (3 attempts left)

This is probably worth checking out in the Mojo::RabbitMQ::Client module, to see if the connect attempt would block the process while waiting for the timeout.

Actions #11

Updated by kraih over 1 year ago

This is probably worth checking out in the Mojo::RabbitMQ::Client module, to see if the connect attempt would block the process while waiting for the timeout.

I did just that and the results are interesting, because Mojo::RabbitMQ::Client simply reuses the standard Mojolicious socket abstraction. And that means that the socket connect itself is non-blocking with a timeout, and therefore wouldn't block the process. However, the DNS lookup before the socket connect is blocking, since openQA does not use the optional non-blocking resolver for Mojolicious (Net::DNS::Native). So if the system resolver had any kind of networking issue, that could have blocked the whole process for quite some time. And it's noteworthy that we create a fresh connection to RabbitMQ for every single message we send, there is no caching at all.

It might be worth looking into getting Net::DNS::Native packaged. We might benefit in other parts of openQA from this as well.

Actions #12

Updated by kraih over 1 year ago

  • Status changed from Feedback to Resolved
Actions #13

Updated by tinita over 1 year ago

Build succeeds https://build.opensuse.org/package/show/devel:languages:perl:CPAN-N/perl-Net-DNS-Native so it should be easy to get in Tumbleweed

Actions #14

Updated by tinita over 1 year ago

I just created https://build.opensuse.org/request/show/1108495 devel:languages:perl / perl-Net-DNS-Native

Actions #15

Updated by okurz over 1 year ago

  • Related to coordination #135122: [epic] OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alert added
Actions #16

Updated by okurz over 1 year ago

  • Status changed from Resolved to Feedback

#135122 looks like the very same issue still.

Actions #17

Updated by livdywan over 1 year ago

  • Status changed from Feedback to Resolved

We have no strace data here so this ticket isn't very useful to leave open.

Actions

Also available in: Atom PDF