Project

General

Profile

Actions

action #107746

closed

coordination #103962: [saga][epic] Easy multi-machine handling: MM-tests as first-class citizens

Some directly chained jobs are skipped by openQA

Added by tonyyuan almost 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2022-03-01
Due date:
% Done:

0%

Estimated time:

Description

Our test is ten chained jobs. One parent, the rest are children. Sometimes, the last several child jobs were cancelled/skipped by openQA. It seems a websocket problem. Below is the log of http://void.qam.suse.cz/tests/37703:

2022-03-01T11:01:25.603001+01:00 void oqaqambot[14875]: INFO: Project 'SUSE:Maintenance:22978' without jobs done - no worth to comment
2022-03-01T11:01:25.603026+01:00 void oqaqambot[14875]: INFO: Project 'SUSE:Maintenance:23083' without jobs done - no worth to comment
2022-03-01T11:01:25.603050+01:00 void oqaqambot[14875]: INFO: Project 'SUSE:Maintenance:23020' without jobs done - no worth to comment
2022-03-01T11:01:25.648117+01:00 void systemd[1]: openqabot-incident.service: Succeeded.
2022-03-01T11:01:25.648232+01:00 void systemd[1]: Finished Shedule incidents in qam openQA.
2022-03-01T11:01:27.036237+01:00 void openqa-websockets-daemon[4935]: [info] Worker 2 websocket connection closed - 1006
2022-03-01T11:01:27.036551+01:00 void worker[22430]: [warn] Websocket connection to http://void.qam.suse.cz/api/v1/ws/2 finished by remote side with code 1006, no reason - trying again in 10 seconds
2022-03-01T11:01:28.673888+01:00 void worker[22430]: [warn] Websocket connection to http://quasar.suse.cz/api/v1/ws/40 finished by remote side with code 1006, no reason - trying again in 10 seconds
2022-03-01T11:01:29.783464+01:00 void worker[30530]: [info] Registering with openQA http://void.qam.suse.cz
2022-03-01T11:01:29.889641+01:00 void worker[30530]: [info] Establishing ws connection via ws://void.qam.suse.cz/api/v1/ws/16
2022-03-01T11:01:29.894146+01:00 void worker[30530]: [info] Registered and connected via websockets with openQA host http://void.qam.suse.cz and worker ID 16
2022-03-01T11:01:33.834397+01:00 void openqa-websockets-daemon[4935]: [info] Worker 6 websocket connection closed - 1006
2022-03-01T11:01:33.835030+01:00 void worker[22464]: [warn] Websocket connection to http://void.qam.suse.cz/api/v1/ws/6 finished by remote side with code 1006, no reason - trying again in 10 seconds
2022-03-01T11:01:37.037559+01:00 void worker[22430]: [info] Registering with openQA http://void.qam.suse.cz
2022-03-01T11:01:37.072457+01:00 void worker[22430]: [info] Establishing ws connection via ws://void.qam.suse.cz/api/v1/ws/2
2022-03-01T11:01:37.076563+01:00 void worker[22430]: [info] Registered and connected via websockets with openQA host http://void.qam.suse.cz and worker ID 2
2022-03-01T11:01:38.675876+01:00 void worker[22430]: [info] Registering with openQA http://quasar.suse.cz
2022-03-01T11:01:38.720954+01:00 void worker[22430]: [info] Establishing ws connection via ws://quasar.suse.cz/api/v1/ws/40
2022-03-01T11:01:38.725879+01:00 void worker[22430]: [info] Registered and connected via websockets with openQA host http://quasar.suse.cz and worker ID 40
2022-03-01T11:01:43.835993+01:00 void worker[22464]: [info] Registering with openQA http://void.qam.suse.cz
2022-03-01T11:01:43.939580+01:00 void worker[22464]: [info] Establishing ws connection via ws://void.qam.suse.cz/api/v1/ws/6
2022-03-01T11:01:43.943940+01:00 void worker[22464]: [info] Registered and connected via websockets with openQA host http://void.qam.suse.cz and worker ID 6
2022-03-01T11:02:05.663842+01:00 void openqa-websockets-daemon[4935]: [info] Worker 14 websocket connection closed - 1006
2022-03-01T11:02:05.664624+01:00 void worker[30543]: [warn] Websocket connection to http://void.qam.suse.cz/api/v1/ws/14 finished by remote side with code 1006, no reason - trying again in 10 seconds
2022-03-01T11:02:15.664907+01:00 void worker[30543]: [info] Registering with openQA http://void.qam.suse.cz
2022-03-01T11:02:15.693299+01:00 void worker[30543]: [info] Establishing ws connection via ws://void.qam.suse.cz/api/v1/ws/14
2022-03-01T11:02:15.697032+01:00 void worker[30543]: [info] Registered and connected via websockets with openQA host http://void.qam.suse.cz and worker ID 14
2022-03-01T11:02:29.965741+01:00 void openqa-websockets-daemon[4935]: [info] Worker 16 websocket connection closed - 1006
2022-03-01T11:02:29.966036+01:00 void worker[30530]: [warn] Websocket connection to http://void.qam.suse.cz/api/v1/ws/16 finished by remote side with code 1006, no reason - trying again in 10 seconds
2022-03-01T11:02:32.609299+01:00 void worker[30530]: [info] Isotovideo exit status: 0
2022-03-01T11:02:32.695720+01:00 void worker[30530]: [info] +++ worker notes +++
2022-03-01T11:02:32.695818+01:00 void worker[30530]: [info] End time: 2022-03-01 10:02:32
2022-03-01T11:02:32.695860+01:00 void worker[30530]: [info] Result: done
2022-03-01T11:02:32.715730+01:00 void worker[15116]: [info] Uploading vars.json
2022-03-01T11:02:32.717124+01:00 void worker[15116]: [error] REST-API error (POST http://void.qam.suse.cz/api/v1/jobs/37703/status): Connection error: Premature connection close (remaining tries: 59)
2022-03-01T11:02:32.730554+01:00 void worker[15116]: [info] Uploading autoinst-log.txt
2022-03-01T11:02:32.752175+01:00 void worker[15116]: [info] Uploading worker-log.txt
2022-03-01T11:02:32.763516+01:00 void worker[15116]: [info] Uploading serial0.txt
2022-03-01T11:02:32.776117+01:00 void worker[15116]: [info] Uploading video_time.vtt
2022-03-01T11:02:32.789181+01:00 void worker[15116]: [info] Uploading serial_terminal.txt
2022-03-01T11:02:33.344372+01:00 void worker[30530]: [info] Accepting job 37695 from queue
2022-03-01T11:02:33.344467+01:00 void worker[30530]: [error] Unable to accept job 37695 because the websocket connection to http://void.qam.suse.cz has been lost.
2022-03-01T11:02:33.344538+01:00 void worker[30530]: [info] Skipping job 37696 from queue (parent failed with result api-failure)
2022-03-01T11:02:33.393739+01:00 void worker[30530]: [info] Skipping job 37697 from queue (parent failed with result skipped)
2022-03-01T11:02:33.443233+01:00 void worker[30530]: [info] Skipping job 37698 from queue (parent failed with result skipped)
2022-03-01T11:02:33.485169+01:00 void worker[30530]: [info] Skipping job 37699 from queue (parent failed with result skipped)
2022-03-01T11:02:37.145947+01:00 void openqa-websockets-daemon[4935]: [info] Worker 2 websocket connection closed - 1006
2022-03-01T11:02:37.146319+01:00 void worker[22430]: [warn] Websocket connection to http://void.qam.suse.cz/api/v1/ws/2 finished by remote side with code 1006, no reason - trying again in 10 seconds
2022-03-01T11:02:38.776193+01:00 void worker[22430]: [warn] Websocket connection to http://quasar.suse.cz/api/v1/ws/40 finished by remote side with code 1006, no reason - trying again in 10 seconds
2022-03-01T11:02:39.967168+01:00 void worker[30530]: [info] Registering with openQA http://void.qam.suse.cz
2022-03-01T11:02:40.078797+01:00 void worker[30530]: [info] Establishing ws connection via ws://void.qam.suse.cz/api/v1/ws/16
Actions #1

Updated by okurz almost 3 years ago

  • Project changed from openQA Infrastructure (public) to openQA Project (public)
  • Category set to Regressions/Crashes
  • Target version set to Ready

That's not a machine maintained by SUSE QE Tools, right? And also I think if at all this is about openQA itself, not the infrastructure

Actions #2

Updated by mkittler almost 3 years ago

  • Subject changed from Some chained jobs are skipped by openQA to Some directly chained jobs are skipped by openQA
  • Description updated (diff)
  • Assignee set to mkittler

Please take at least the time to properly format the ticket and more importantly state the context of the problem (e.g. URL of the job the log belongs to). I've now adjusted the ticket description but it still lacks important details, e.g. how the jobs have been created in the first place¹ and what else you did after encountering the problem.

¹I did a little bit of reverse engineering what you've did here (but please just state that yourself in the ticket description next time). It looks like the skipped jobs have been created by posting an ISO/product which makes sense as they're actually using directly chained dependencies (which makes an important difference). Then some jobs have been skipped due to a connection loss as you correctly identified. I assume the connection was really lost so this isn't a problem of the openQA worker pre-se. Then some of the jobs have been restarted likely by a user. It is normal that partially restarting causes a dependency cluster that shows old and new jobs at the same time in case you're wondering about that.

I can look into implementing a retry in case of a connection loss when picking the next job from the queue to make that a bit more robust. However, I suppose that's also all we can improve here. (Regardless of whether dependencies handling comes into play or not - the worker will have to abort at some point after a connection loss.)

By the way, the version of your openQA deployment is quite old (48945f1ba from last year). At least this means for sure that none of my recent changes have anything to do with that problem :-)

Actions #3

Updated by mkittler almost 3 years ago

  • Status changed from New to In Progress
Actions #4

Updated by openqa_review almost 3 years ago

  • Due date set to 2022-03-17

Setting due date based on mean cycle time of SUSE QE Tools

Actions #5

Updated by tonyyuan almost 3 years ago

Hello,
This's the overview of the all jobs. http://openqa.qam.suse.cz/tests/overview?groupid=156&version=15-SP3&build=%3A22881%3Alibvirt&distri=sle
Dependencies: http://openqa.qam.suse.cz/tests/37694#dependencies
I am not sure if I had restarted one or two jobs of them.

One thing should be noted that each job need 1-2 hours to complete. The parent job need 3-4 hours to complete. The skipped jobs had already waited for quite long time (more then six hours) before they were skipped. Is there a timetout check: A job will be cancelled if it waiting time is too long?

Actions #6

Updated by tonyyuan almost 3 years ago

Actions #7

Updated by mkittler almost 3 years ago

Is there a timetout check: A job will be cancelled if it waiting time is too long?

A job will be stopped after MAX_JOB_TIME/MAX_SETUP_TIME, see https://open.qa/docs/#_changing_timeouts. These timeouts apply to each job individually - so the execution/setup time of directly chained parents is not taken into account. If that's not working as intended, that's a bug. Having a look at the code I have no reason to believe it is broken, though. Note that the jobs mentioned in the ticket description have been skipped due to connection problems with the web socket server (as you've already identified) and are not a consequence of broken timeout handling.

The skipped jobs had already waited for quite long time (more then six hours) before they were skipped.

I'm wondering whether it is possible to split things up. I'm also wondering whether it is necessary to test to this extend with a setup requiring directly chained dependencies.

But I agree that we should try a bit harder when accepting directly chained jobs so the already mentioned PR will add a retry. Maybe it makes also sense to make it configurable. (Currently I'm waiting for reviewer feedback on that PR.)

Actions #8

Updated by tonyyuan almost 3 years ago

I'm wondering whether it is possible to split things up. I'm also wondering whether it is necessary to test to this extend with a setup requiring directly chained dependencies.

All of these jobs need to run on the same SUT one after another in serial so they have to wait in queue and chained. They are quite some duplicate test modules (login, list_guest) in each job. Some jobs could be combined into one job to avoid duplication. We'll address them :)

But I agree that we should try a bit harder when accepting directly chained jobs so the already mentioned PR will add a retry. Maybe it makes also sense to make it configurable. (Currently I'm waiting for reviewer feedback on that PR.)
Yes, great!

Actions #9

Updated by mkittler almost 3 years ago

  • Status changed from In Progress to Feedback

The PR has been merged. It now tries it ten times. Since the web socket server connection is re-tried every 10 seconds this makes a delay of 100 seconds. Not sure whether that's enough. You can set the environment variable OPENQA_WORKER_ACCEPT_ATTEMPTS to set a different number of attempts. (The environment variable needs to be set for the worker service. If required I can also make it configurable via workers.ini but that's not implemented yet.)

Actions #10

Updated by mkittler almost 3 years ago

  • Parent task set to #103962
Actions #11

Updated by mkittler almost 3 years ago

I'm resolving the issue for now. Let me know if it isn't good enough, e.g. you need to configure it via the config file.

Actions #12

Updated by mkittler almost 3 years ago

  • Status changed from Feedback to Resolved
Actions #13

Updated by tonyyuan almost 3 years ago

Hello,

I am still verifying this fix.
I found the /var/log/messages flooded by websocket log. Could the websockets log be suppressed?

==================================
2022-03-16T13:14:29.295008+01:00 void worker[25212]: [info] Establishing ws connection via ws://void.qam.suse.cz/api/v1/ws/10
2022-03-16T13:14:29.300812+01:00 void worker[25212]: [info] Registered and connected via websockets with openQA host http://void.qam.suse.cz and worker ID 10
2022-03-16T13:14:36.249703+01:00 void worker[4949]: [info] Registering with openQA http://void.qam.suse.cz
2022-03-16T13:14:36.283530+01:00 void worker[4949]: [info] Establishing ws connection via ws://void.qam.suse.cz/api/v1/ws/14
2022-03-16T13:14:36.287028+01:00 void worker[4949]: [info] Registered and connected via websockets with openQA host http://void.qam.suse.cz and worker ID 14
2022-03-16T13:14:36.430606+01:00 void openqa-websockets-daemon[4984]: [info] Worker 2 websocket connection closed - 1006
2022-03-16T13:14:36.430822+01:00 void worker[4947]: [warn] Websocket connection to http://void.qam.suse.cz/api/v1/ws/2 finished by remote side with code 1006, no reason - trying again in 10 seconds
2022-03-16T13:14:38.935499+01:00 void openqa-websockets-daemon[4984]: [info] Worker 16 websocket connection closed - 1006

Actions #14

Updated by okurz over 2 years ago

  • Due date deleted (2022-03-17)

tonyyuan wrote:

Hello,

I am still verifying this fix.
I found the /var/log/messages flooded by websocket log. Could the websockets log be suppressed?

If you see this as necessary you are welcome to report another ticket about it but we will not be able to look into this anytime soon if at all.

Actions

Also available in: Atom PDF