action #20544
closedopenQA Tests (public) - action #20378: [tools]Too many 502 on openqa
[tools] Research/investigate ways to optimize scheduler grab_job
Added by EDiGiacinto over 7 years ago. Updated over 7 years ago.
100%
Description
As discussed in the retrospective call, since we are going to have more workers in future, we need to optimize how scheduler assigns jobs.
Updated by EDiGiacinto over 7 years ago
- Subject changed from [tools] Research/investigate ways to optimize scheduler job_grab to [tools] Research/investigate ways to optimize scheduler grab_job
- Category set to 132
Updated by RBrownSUSE over 7 years ago
- Priority changed from Normal to High
- Target version set to Milestone 9
Updated by EDiGiacinto over 7 years ago
first step into this: https://github.com/os-autoinst/openQA/pull/1396
Plan is to move most blocking calls in the API to the Mojo::IOLoop and make them async
Updated by coolo over 7 years ago
Many workers are stuck in this loop:
Jul 20 07:24:24 openqaworker6 worker[19723]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("error getting ipc service: org.f"...) as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../li...mands.pm line 64.
Getting 502 from /api/v1/ws/X - looks like improper error handling. The code in the worker access $job->{URL} while $job is an error.
Updated by okurz over 7 years ago
- Priority changed from High to Immediate
Whole OSD is blocked now. If this ticket is really the one for the current problem then please handle it immediately
… ok, not whole osd, three jobs are running ;-) -> https://openqa.suse.de/tests
Updated by EDiGiacinto over 7 years ago
There was no error handling before, but it's a bit expected from the last night discussion with coolo, that should fix it
Updated by EDiGiacinto over 7 years ago
- Status changed from New to In Progress
- % Done changed from 0 to 90
PR: https://github.com/os-autoinst/openQA/pull/1411 (and covers https://progress.opensuse.org/issues/20546 as well)
Updated by EDiGiacinto over 7 years ago
to have a reference: http://paste.suse.de/24318 we observe dbus lib failures
Updated by EDiGiacinto over 7 years ago
- Status changed from In Progress to Resolved
- % Done changed from 90 to 100
Polling paradigm now is gone for good, grab_job it's reduced to resolve priorities during scheduling
Updated by okurz over 7 years ago
- Related to action #21836: [tools][sprint 201709.1] Many "A message received from unknown worker connection" log entries on openqa.suse.de added
Updated by okurz over 7 years ago
- Status changed from Resolved to In Progress
- Priority changed from Immediate to Urgent
We have quite some problems in our infrastructure still which I see related to this task, i.e. not really done yet. Currently in the osd infrastructure I can see in the jobs table https://openqa.suse.de/tests that many jobs are not being worked on, e.g. sle-15-Leanos-DVD-ppc64le-Build151.1-RAID5@ppc64le-no-tmpfs . Looking at workers that were the last time successfully working on these scenarios I can find e.g. malbec:1 that has
- sle-15-Leanos-DVD-ppc64le-Build151.1-RAID1@ppc64le-no-tmpfs not finished yet
- sle-15-Leanos-DVD-ppc64le-Build151.1-ext4@ppc64le 0 about 23 hours ago
and it's reporting to be "working on" https://openqa.suse.de/tests/1119124 which is in state "assigned" but no further information besides the assigned worker.
Looking for logs with ssh malbec 'sudo journalctl --since=yesterday -u openqa-worker@1'
reveals:
Aug 18 12:01:42 malbec worker[52555]: [ERROR] unable to connect to host yast-openqa.suse.cz, retry in 10s
Aug 18 12:01:42 malbec worker[52555]: Use of uninitialized value $host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Commands.pm line 137.
Aug 18 12:01:42 malbec worker[52555]: Use of uninitialized value $host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Commands.pm line 140.
Aug 18 12:01:42 malbec worker[52555]: Use of uninitialized value $host in pattern match (m//) at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 468.
Aug 18 12:01:42 malbec worker[52555]: Use of uninitialized value $OpenQA::Worker::Engines::isotovideo::current_host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 131.
Aug 18 12:01:42 malbec worker[52555]: Use of uninitialized value $OpenQA::Worker::Engines::isotovideo::current_host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 148.
Aug 18 12:01:42 malbec worker[52555]: Use of uninitialized value $OpenQA::Worker::Engines::isotovideo::current_host in hash element at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 160.
Aug 18 12:01:42 malbec worker[52555]: [WARN] job is missing files, releasing job
Aug 18 12:01:42 malbec worker[52555]: Mojo::Reactor::Poll: I/O watcher failed: No worker id or webui host set! at /usr/share/openqa/script/../lib/OpenQA/Worker/Common.pm line 184.
Aug 18 12:01:52 malbec worker[52555]: [INFO] registering worker with openQA yast-openqa.suse.cz...
Aug 18 12:01:52 malbec worker[52555]: [DEBUG] Job 1119124 scheduled for next cycle
Aug 18 12:01:52 malbec worker[52555]: [INFO] got job 1119124: 01119124-sle-15-Leanos-DVD-ppc64le-Build151.1-RAID1@ppc64le-no-tmpfs
Aug 18 12:03:59 malbec worker[52555]: [ERROR] unable to connect to host yast-openqa.suse.cz, retry in 10s
Aug 18 12:04:09 malbec worker[52555]: [INFO] registering worker with openQA yast-openqa.suse.cz...
Aug 18 12:04:09 malbec worker[52555]: [DEBUG] Sending worker status to openqa.suse.de
[…]
Aug 19 10:31:47 malbec worker[52555]: [INFO] registering worker with openQA yast-openqa.suse.cz...
Aug 19 10:31:47 malbec worker[52555]: [DEBUG] Sending worker status to openqa.suse.de
so
- it seems it's stuck there -> I will restart the worker
- lot's of warnings -> should be worked on
- no useful error message what is wrong here -> some internal watchdog or monitoring should be worked on
I see this ticket as "urgent" because we have currently no better way than to manually look at the worker status and restart them manually to ensure they are used.
Updated by okurz over 7 years ago
The worker malbec:1 after restart is still not picking up jobs. But there are scheduled jobs that should match this worker class which are not "assigned". I see on osd that the journal of openqa-websockets includes a whole bunch of template error messages (why should a websockets server try to render template files?). After about 10 minutes now malbec:1 took one job and immediately incompleted it, no autoinst-log.txt uploaded. Worker log:
Aug 19 10:45:19 malbec worker[63447]: [INFO] 64550: WORKING 1120781
Aug 19 10:45:21 malbec worker[63447]: [DEBUG] Sending IMMEDIATELY worker status to openqa.suse.de
Aug 19 10:45:21 malbec worker[63447]: [DEBUG] Sending worker status to openqa.suse.de
Aug 19 10:47:28 malbec worker[63447]: [ERROR] unable to connect to host yast-openqa.suse.cz, retry in 10s
Aug 19 10:47:29 malbec worker[63447]: [ERROR] 400 response: Bad Request (remaining tries: 2)
Aug 19 10:47:34 malbec worker[63447]: [ERROR] 400 response: Bad Request (remaining tries: 1)
not convincing. As we apparently do not have a ticket for the fact that currently our SLE ppc64le workers do not seem to be able to even properly start any job I will create another ticket. There is probably something else even more serious happening there -> #23476
Updated by okurz over 7 years ago
- Related to action #23476: Workers cannot share webUI with different versions. Was: SLE ppc64le workers incomplete immediately after starting jobs, no autoinst-log.txt uploaded. added
Updated by EDiGiacinto over 7 years ago
- Status changed from In Progress to Resolved
The problem is in the configuration, two different versions of WebUI can't share the same worker as we changed quite lot of things meanwhile. Closing this since it's not regarding scheduling anymore.