Project

General

Profile

Actions

action #20544

closed

openQA Tests - action #20378: [tools]Too many 502 on openqa

[tools] Research/investigate ways to optimize scheduler grab_job

Added by EDiGiacinto almost 7 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Feature requests
Target version:
Start date:
2017-07-18
Due date:
% Done:

100%

Estimated time:

Description

As discussed in the retrospective call, since we are going to have more workers in future, we need to optimize how scheduler assigns jobs.


Related issues 2 (0 open2 closed)

Related to openQA Project - action #21836: [tools][sprint 201709.1] Many "A message received from unknown worker connection" log entries on openqa.suse.deResolvedEDiGiacinto2017-08-08

Actions
Related to openQA Project - action #23476: Workers cannot share webUI with different versions. Was: SLE ppc64le workers incomplete immediately after starting jobs, no autoinst-log.txt uploaded.RejectedEDiGiacinto2017-08-19

Actions
Actions #1

Updated by EDiGiacinto almost 7 years ago

  • Subject changed from [tools] Research/investigate ways to optimize scheduler job_grab to [tools] Research/investigate ways to optimize scheduler grab_job
  • Category set to 132
Actions #2

Updated by RBrownSUSE almost 7 years ago

  • Priority changed from Normal to High
  • Target version set to Milestone 9
Actions #3

Updated by EDiGiacinto almost 7 years ago

first step into this: https://github.com/os-autoinst/openQA/pull/1396

Plan is to move most blocking calls in the API to the Mojo::IOLoop and make them async

Actions #4

Updated by coolo almost 7 years ago

Many workers are stuck in this loop:
Jul 20 07:24:24 openqaworker6 worker[19723]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("error getting ipc service: org.f"...) as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../li...mands.pm line 64.

Getting 502 from /api/v1/ws/X - looks like improper error handling. The code in the worker access $job->{URL} while $job is an error.

Actions #5

Updated by okurz almost 7 years ago

  • Priority changed from High to Immediate

Whole OSD is blocked now. If this ticket is really the one for the current problem then please handle it immediately

… ok, not whole osd, three jobs are running ;-) -> https://openqa.suse.de/tests

Actions #6

Updated by EDiGiacinto almost 7 years ago

There was no error handling before, but it's a bit expected from the last night discussion with coolo, that should fix it

PR: https://github.com/os-autoinst/openQA/pull/1399

Actions #7

Updated by EDiGiacinto almost 7 years ago

  • Status changed from New to In Progress
  • % Done changed from 0 to 90
Actions #8

Updated by EDiGiacinto almost 7 years ago

to have a reference: http://paste.suse.de/24318 we observe dbus lib failures

Actions #9

Updated by EDiGiacinto almost 7 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 90 to 100

Polling paradigm now is gone for good, grab_job it's reduced to resolve priorities during scheduling

Actions #10

Updated by okurz almost 7 years ago

  • Related to action #21836: [tools][sprint 201709.1] Many "A message received from unknown worker connection" log entries on openqa.suse.de added
Actions #11

Updated by okurz almost 7 years ago

  • Status changed from Resolved to In Progress
  • Priority changed from Immediate to Urgent

We have quite some problems in our infrastructure still which I see related to this task, i.e. not really done yet. Currently in the osd infrastructure I can see in the jobs table https://openqa.suse.de/tests that many jobs are not being worked on, e.g. sle-15-Leanos-DVD-ppc64le-Build151.1-RAID5@ppc64le-no-tmpfs . Looking at workers that were the last time successfully working on these scenarios I can find e.g. malbec:1 that has

  • sle-15-Leanos-DVD-ppc64le-Build151.1-RAID1@ppc64le-no-tmpfs not finished yet
  • sle-15-Leanos-DVD-ppc64le-Build151.1-ext4@ppc64le 0 about 23 hours ago

and it's reporting to be "working on" https://openqa.suse.de/tests/1119124 which is in state "assigned" but no further information besides the assigned worker.

Looking for logs with ssh malbec 'sudo journalctl --since=yesterday -u openqa-worker@1' reveals:

Aug 18 12:01:42 malbec worker[52555]: [ERROR] unable to connect to host yast-openqa.suse.cz, retry in 10s
Aug 18 12:01:42 malbec worker[52555]: Use of uninitialized value $host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Commands.pm line 137.
Aug 18 12:01:42 malbec worker[52555]: Use of uninitialized value $host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Commands.pm line 140.
Aug 18 12:01:42 malbec worker[52555]: Use of uninitialized value $host in pattern match (m//) at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 468.
Aug 18 12:01:42 malbec worker[52555]: Use of uninitialized value $OpenQA::Worker::Engines::isotovideo::current_host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 131.
Aug 18 12:01:42 malbec worker[52555]: Use of uninitialized value $OpenQA::Worker::Engines::isotovideo::current_host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 148.
Aug 18 12:01:42 malbec worker[52555]: Use of uninitialized value $OpenQA::Worker::Engines::isotovideo::current_host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 160.
Aug 18 12:01:42 malbec worker[52555]: [WARN] job is missing files, releasing job
Aug 18 12:01:42 malbec worker[52555]: Mojo::Reactor::Poll: I/O watcher failed: No worker id or webui host set! at /usr/share/openqa/script/../lib/OpenQA/Worker/Common.pm line 184.
Aug 18 12:01:52 malbec worker[52555]: [INFO] registering worker with openQA yast-openqa.suse.cz...
Aug 18 12:01:52 malbec worker[52555]: [DEBUG] Job 1119124 scheduled for next cycle
Aug 18 12:01:52 malbec worker[52555]: [INFO] got job 1119124: 01119124-sle-15-Leanos-DVD-ppc64le-Build151.1-RAID1@ppc64le-no-tmpfs
Aug 18 12:03:59 malbec worker[52555]: [ERROR] unable to connect to host yast-openqa.suse.cz, retry in 10s
Aug 18 12:04:09 malbec worker[52555]: [INFO] registering worker with openQA yast-openqa.suse.cz...
Aug 18 12:04:09 malbec worker[52555]: [DEBUG] Sending worker status to openqa.suse.de
[…]
Aug 19 10:31:47 malbec worker[52555]: [INFO] registering worker with openQA yast-openqa.suse.cz...
Aug 19 10:31:47 malbec worker[52555]: [DEBUG] Sending worker status to openqa.suse.de

so

  1. it seems it's stuck there -> I will restart the worker
  2. lot's of warnings -> should be worked on
  3. no useful error message what is wrong here -> some internal watchdog or monitoring should be worked on

I see this ticket as "urgent" because we have currently no better way than to manually look at the worker status and restart them manually to ensure they are used.

Actions #12

Updated by okurz almost 7 years ago

The worker malbec:1 after restart is still not picking up jobs. But there are scheduled jobs that should match this worker class which are not "assigned". I see on osd that the journal of openqa-websockets includes a whole bunch of template error messages (why should a websockets server try to render template files?). After about 10 minutes now malbec:1 took one job and immediately incompleted it, no autoinst-log.txt uploaded. Worker log:

Aug 19 10:45:19 malbec worker[63447]: [INFO] 64550: WORKING 1120781
Aug 19 10:45:21 malbec worker[63447]: [DEBUG] Sending IMMEDIATELY worker status to openqa.suse.de
Aug 19 10:45:21 malbec worker[63447]: [DEBUG] Sending worker status to openqa.suse.de
Aug 19 10:47:28 malbec worker[63447]: [ERROR] unable to connect to host yast-openqa.suse.cz, retry in 10s
Aug 19 10:47:29 malbec worker[63447]: [ERROR] 400 response: Bad Request (remaining tries: 2)
Aug 19 10:47:34 malbec worker[63447]: [ERROR] 400 response: Bad Request (remaining tries: 1)

not convincing. As we apparently do not have a ticket for the fact that currently our SLE ppc64le workers do not seem to be able to even properly start any job I will create another ticket. There is probably something else even more serious happening there -> #23476

Actions #13

Updated by okurz almost 7 years ago

  • Related to action #23476: Workers cannot share webUI with different versions. Was: SLE ppc64le workers incomplete immediately after starting jobs, no autoinst-log.txt uploaded. added
Actions #14

Updated by EDiGiacinto almost 7 years ago

  • Status changed from In Progress to Resolved

The problem is in the configuration, two different versions of WebUI can't share the same worker as we changed quite lot of things meanwhile. Closing this since it's not regarding scheduling anymore.

Actions

Also available in: Atom PDF