action #20544: [tools] Research/investigate ways to optimize scheduler grab_job - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #20544

closed

openQA Tests (public) - action #20378: [tools]Too many 502 on openqa

[tools] Research/investigate ways to optimize scheduler grab_job

Added by EDiGiacinto over 7 years ago. Updated over 7 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

EDiGiacinto

Category:

Feature requests

Target version:

Milestone 9

Start date:

2017-07-18

Due date:

% Done:

100%

Estimated time:

Description

As discussed in the retrospective call, since we are going to have more workers in future, we need to optimize how scheduler assigns jobs.

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by EDiGiacinto over 7 years ago

Subject changed from [tools] Research/investigate ways to optimize scheduler job_grab to [tools] Research/investigate ways to optimize scheduler grab_job
Category set to 132

Actions

Copy link

Updated by RBrownSUSE over 7 years ago

Priority changed from Normal to High
Target version set to Milestone 9

Actions

Copy link

Updated by EDiGiacinto over 7 years ago

first step into this: https://github.com/os-autoinst/openQA/pull/1396

Plan is to move most blocking calls in the API to the Mojo::IOLoop and make them async

Actions

Copy link

Updated by coolo over 7 years ago

Many workers are stuck in this loop:
Jul 20 07:24:24 openqaworker6 worker[19723]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("error getting ipc service: org.f"...) as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../li...mands.pm line 64.

Getting 502 from /api/v1/ws/X - looks like improper error handling. The code in the worker access $job->{URL} while $job is an error.

Actions

Copy link

Updated by okurz over 7 years ago

Priority changed from High to Immediate

Whole OSD is blocked now. If this ticket is really the one for the current problem then please handle it immediately

… ok, not whole osd, three jobs are running ;-) -> https://openqa.suse.de/tests

Actions

Copy link

Updated by EDiGiacinto over 7 years ago

There was no error handling before, but it's a bit expected from the last night discussion with coolo, that should fix it

PR: https://github.com/os-autoinst/openQA/pull/1399

Actions

Copy link

Updated by EDiGiacinto over 7 years ago

Status changed from New to In Progress
% Done changed from 0 to 90

PR: https://github.com/os-autoinst/openQA/pull/1411 (and covers https://progress.opensuse.org/issues/20546 as well)

Actions

Copy link

Updated by EDiGiacinto over 7 years ago

to have a reference: http://paste.suse.de/24318 we observe dbus lib failures

Actions

Copy link

Updated by EDiGiacinto over 7 years ago

Status changed from In Progress to Resolved
% Done changed from 90 to 100

Polling paradigm now is gone for good, grab_job it's reduced to resolve priorities during scheduling

Actions

Copy link

#10

Updated by okurz over 7 years ago

Related to action #21836: [tools][sprint 201709.1] Many "A message received from unknown worker connection" log entries on openqa.suse.de added

Actions

Copy link

#11

Updated by okurz over 7 years ago

Status changed from Resolved to In Progress
Priority changed from Immediate to Urgent

We have quite some problems in our infrastructure still which I see related to this task, i.e. not really done yet. Currently in the osd infrastructure I can see in the jobs table https://openqa.suse.de/tests that many jobs are not being worked on, e.g. sle-15-Leanos-DVD-ppc64le-Build151.1-RAID5@ppc64le-no-tmpfs . Looking at workers that were the last time successfully working on these scenarios I can find e.g. malbec:1 that has

sle-15-Leanos-DVD-ppc64le-Build151.1-RAID1@ppc64le-no-tmpfs not finished yet
sle-15-Leanos-DVD-ppc64le-Build151.1-ext4@ppc64le 0 about 23 hours ago

and it's reporting to be "working on" https://openqa.suse.de/tests/1119124 which is in state "assigned" but no further information besides the assigned worker.

Looking for logs with ssh malbec 'sudo journalctl --since=yesterday -u openqa-worker@1' reveals:

Aug 18 12:01:42 malbec worker[52555]: [ERROR] unable to connect to host yast-openqa.suse.cz, retry in 10s
Aug 18 12:01:42 malbec worker[52555]: Use of uninitialized value $host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Commands.pm line 137.
Aug 18 12:01:42 malbec worker[52555]: Use of uninitialized value $host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Commands.pm line 140.
Aug 18 12:01:42 malbec worker[52555]: Use of uninitialized value $host in pattern match (m//) at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 468.
Aug 18 12:01:42 malbec worker[52555]: Use of uninitialized value $OpenQA::Worker::Engines::isotovideo::current_host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 131.
Aug 18 12:01:42 malbec worker[52555]: Use of uninitialized value $OpenQA::Worker::Engines::isotovideo::current_host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 148.
Aug 18 12:01:42 malbec worker[52555]: Use of uninitialized value $OpenQA::Worker::Engines::isotovideo::current_host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 160.
Aug 18 12:01:42 malbec worker[52555]: [WARN] job is missing files, releasing job
Aug 18 12:01:42 malbec worker[52555]: Mojo::Reactor::Poll: I/O watcher failed: No worker id or webui host set! at /usr/share/openqa/script/../lib/OpenQA/Worker/Common.pm line 184.
Aug 18 12:01:52 malbec worker[52555]: [INFO] registering worker with openQA yast-openqa.suse.cz...
Aug 18 12:01:52 malbec worker[52555]: [DEBUG] Job 1119124 scheduled for next cycle
Aug 18 12:01:52 malbec worker[52555]: [INFO] got job 1119124: 01119124-sle-15-Leanos-DVD-ppc64le-Build151.1-RAID1@ppc64le-no-tmpfs
Aug 18 12:03:59 malbec worker[52555]: [ERROR] unable to connect to host yast-openqa.suse.cz, retry in 10s
Aug 18 12:04:09 malbec worker[52555]: [INFO] registering worker with openQA yast-openqa.suse.cz...
Aug 18 12:04:09 malbec worker[52555]: [DEBUG] Sending worker status to openqa.suse.de
[…]
Aug 19 10:31:47 malbec worker[52555]: [INFO] registering worker with openQA yast-openqa.suse.cz...
Aug 19 10:31:47 malbec worker[52555]: [DEBUG] Sending worker status to openqa.suse.de

it seems it's stuck there -> I will restart the worker
lot's of warnings -> should be worked on
no useful error message what is wrong here -> some internal watchdog or monitoring should be worked on

I see this ticket as "urgent" because we have currently no better way than to manually look at the worker status and restart them manually to ensure they are used.

Actions

Copy link

#12

Updated by okurz over 7 years ago

The worker malbec:1 after restart is still not picking up jobs. But there are scheduled jobs that should match this worker class which are not "assigned". I see on osd that the journal of openqa-websockets includes a whole bunch of template error messages (why should a websockets server try to render template files?). After about 10 minutes now malbec:1 took one job and immediately incompleted it, no autoinst-log.txt uploaded. Worker log:

Aug 19 10:45:19 malbec worker[63447]: [INFO] 64550: WORKING 1120781
Aug 19 10:45:21 malbec worker[63447]: [DEBUG] Sending IMMEDIATELY worker status to openqa.suse.de
Aug 19 10:45:21 malbec worker[63447]: [DEBUG] Sending worker status to openqa.suse.de
Aug 19 10:47:28 malbec worker[63447]: [ERROR] unable to connect to host yast-openqa.suse.cz, retry in 10s
Aug 19 10:47:29 malbec worker[63447]: [ERROR] 400 response: Bad Request (remaining tries: 2)
Aug 19 10:47:34 malbec worker[63447]: [ERROR] 400 response: Bad Request (remaining tries: 1)

not convincing. As we apparently do not have a ticket for the fact that currently our SLE ppc64le workers do not seem to be able to even properly start any job I will create another ticket. There is probably something else even more serious happening there -> #23476

Actions

Copy link

#13

Updated by okurz over 7 years ago

Related to action #23476: Workers cannot share webUI with different versions. Was: SLE ppc64le workers incomplete immediately after starting jobs, no autoinst-log.txt uploaded. added

Actions

Copy link

#14

Updated by EDiGiacinto over 7 years ago

Status changed from In Progress to Resolved

The problem is in the configuration, two different versions of WebUI can't share the same worker as we changed quite lot of things meanwhile. Closing this since it's not regarding scheduling anymore.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #20544

[tools] Research/investigate ways to optimize scheduler grab_job

Updated by EDiGiacinto over 7 years ago

Updated by RBrownSUSE over 7 years ago

Updated by EDiGiacinto over 7 years ago

Updated by coolo over 7 years ago

Updated by okurz over 7 years ago

Updated by EDiGiacinto over 7 years ago

Updated by EDiGiacinto over 7 years ago

Updated by EDiGiacinto over 7 years ago

Updated by EDiGiacinto over 7 years ago

Updated by okurz over 7 years ago

Updated by okurz over 7 years ago

Updated by okurz over 7 years ago

Updated by okurz over 7 years ago

Updated by EDiGiacinto over 7 years ago