action #60866

Periodic stale job detection keeps scheduler busy producing a massive number of incomplete jobs (was: Again many incomplete jobs with no logs at all)

Added by okurz 4 months ago. Updated 3 months ago.

Status:ResolvedStart date:10/12/2019
Priority:HighDue date:
Assignee:mkittler% Done:

0%

Category:Concrete Bugs
Target version:Done
Difficulty:
Duration:

Description

Observation

E.g. https://openqa.suse.de/tests/3687641 shows modules on the details, no result for any module but no logs. Only from the system journal one can get a bit:

Dec 10 17:11:37 openqaworker6 worker[5647]: [info] [pid:10183] 10183: WORKING 3687641
Dec 10 17:11:37 openqaworker6 worker[5647]: [debug] [pid:10183] +++ worker notes +++
Dec 10 17:11:37 openqaworker6 worker[5647]: [debug] [pid:10183] Start time: 2019-12-10 16:11:37
Dec 10 17:11:37 openqaworker6 worker[5647]: [debug] [pid:10183] Running on openqaworker6:10 (Linux 4.12.14-lp151.28.7-default #1 SMP Mon Jun 17 16:36:38 UTC 2019 (f8a1872) x86_64)
Dec 10 17:11:37 openqaworker6 worker[5647]: [info] [pid:5647] isotovideo has been started (PID: 10183)
Dec 10 17:11:37 openqaworker6 worker[5647]: [debug] [pid:5647] Running job 3687641 from openqa.suse.de: 03687641-sle-15-SP2-Online-x86_64-Build104.1-sles4sap_gnome_extended@64bit-2gbram.
Dec 10 17:11:37 openqaworker6 worker[5647]: [debug] [pid:5647] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3687641/status
Dec 10 17:11:47 openqaworker6 worker[5647]: [debug] [pid:5647] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3687641/status
Dec 10 17:11:58 openqaworker6 worker[5647]: [debug] [pid:5647] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3687641/status
Dec 10 17:11:58 openqaworker6 worker[5647]: [error] [pid:5647] REST-API error (POST http://openqa.suse.de/api/v1/jobs/3687641/status): 400 response: Got status update for job 3687641 with unexpected worker ID 972 (expected no updates anymore, job is done with result incomplete) (remaining tries: 2)
Dec 10 17:12:03 openqaworker6 worker[5647]: [error] [pid:5647] REST-API error (POST http://openqa.suse.de/api/v1/jobs/3687641/status): 400 response: Got status update for job 3687641 with unexpected worker ID 972 (expected no updates anymore, job is done with result incomplete) (remaining tries: 1)
Dec 10 17:12:08 openqaworker6 worker[5647]: [error] [pid:5647] REST-API error (POST http://openqa.suse.de/api/v1/jobs/3687641/status): 400 response: Got status update for job 3687641 with unexpected worker ID 972 (expected no updates anymore, job is done with result incomplete) (remaining tries: 0)
Dec 10 17:12:08 openqaworker6 worker[5647]: [debug] [pid:5647] Stopping job 3687641 from openqa.suse.de: 03687641-sle-15-SP2-Online-x86_64-Build104.1-sles4sap_gnome_extended@64bit-2gbram - reason: api-failure
Dec 10 17:12:08 openqaworker6 worker[5647]: [error] [pid:5647] Aborting job because web UI doesn't accept new images anymore (likely considers this job dead)
Dec 10 17:12:08 openqaworker6 worker[5647]: [debug] [pid:5647] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3687641/status
Dec 10 17:12:08 openqaworker6 worker[5647]: [error] [pid:5647] REST-API error (POST http://openqa.suse.de/api/v1/jobs/3687641/status): 400 response: Got status update for job 3687641 with unexpected worker ID 972 (expected no updates anymore, job is done with result incomplete) (remaining tries: 2)
Dec 10 17:12:13 openqaworker6 worker[5647]: [error] [pid:5647] REST-API error (POST http://openqa.suse.de/api/v1/jobs/3687641/status): 400 response: Got status update for job 3687641 with unexpected worker ID 972 (expected no updates anymore, job is done with result incomplete) (remaining tries: 1)
Dec 10 17:12:18 openqaworker6 worker[5647]: [warn] [pid:5647] Can't open /var/lib/openqa/pool/10/testresults/result-bootloader.json for result upload - likely isotovideo could not be started or failed early. Error message: No such file or directory
Dec 10 17:12:18 openqaworker6 worker[5647]: [debug] [pid:5647] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3687641/status
Dec 10 17:12:18 openqaworker6 worker[5647]: [error] [pid:5647] REST-API error (POST http://openqa.suse.de/api/v1/jobs/3687641/status): 400 response: Got status update for job 3687641 with unexpected worker ID 972 (expected no updates anymore, job is done with result incomplete) (remaining tries: 2)
Dec 10 17:12:18 openqaworker6 worker[5647]: [error] [pid:5647] REST-API error (POST http://openqa.suse.de/api/v1/jobs/3687641/status): 400 response: Got status update for job 3687641 with unexpected worker ID 972 (expected no updates anymore, job is done with result incomplete) (remaining tries: 0)
Dec 10 17:12:18 openqaworker6 worker[5647]: [info] [pid:5647] Trying to stop job gracefully by announcing it to command server via http://localhost:20103/uhMJRhgnkWGej2zu/broadcast
Dec 10 17:12:20 openqaworker6 worker[5647]: [info] [pid:5647] Isotovideo exit status: 0
Dec 10 17:12:20 openqaworker6 worker[5647]: [debug] [pid:5647] Registered process:10273
Dec 10 17:12:27 openqaworker6 worker[5647]: [info] [pid:5647] +++ worker notes +++
Dec 10 17:12:27 openqaworker6 worker[5647]: [info] [pid:5647] End time: 2019-12-10 16:12:27
Dec 10 17:12:27 openqaworker6 worker[5647]: [info] [pid:5647] Result: api-failure
Dec 10 17:12:27 openqaworker6 worker[5647]: [debug] [pid:5647] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3687641/set_done?result=incomplete
Dec 10 17:12:27 openqaworker6 worker[5647]: [error] [pid:5647] REST-API error (POST http://openqa.suse.de/api/v1/jobs/3687641/status): 400 response: Got status update for job 3687641 with unexpected worker ID 972 (expected no updates anymore, job is done with result incomplete) (remaining tries: 1)
Dec 10 17:12:27 openqaworker6 worker[5647]: [info] [pid:5647] Registering with openQA openqa.suse.de
Dec 10 17:12:27 openqaworker6 worker[5647]: [info] [pid:5647] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/972
Dec 10 17:12:27 openqaworker6 worker[5647]: [debug] [pid:5647] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3687641/set_done
Dec 10 17:12:27 openqaworker6 worker[5647]: [debug] [pid:5647] Websocket connection to http://openqa.suse.de/api/v1/ws/972 finished from our side.
Dec 10 17:12:27 openqaworker6 worker[5647]: [info] [pid:5647] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 972
Dec 10 17:12:27 openqaworker6 worker[5647]: [debug] [pid:5647] Job 3687641 from openqa.suse.de finished - reason: api-failure
Dec 10 17:12:27 openqaworker6 worker[5647]: [debug] [pid:5647] Cleaning up for next job
Dec 10 17:12:32 openqaworker6 worker[5647]: [error] [pid:5647] REST-API error (POST http://openqa.suse.de/api/v1/jobs/3687641/status): 400 response: Got status update for job 3687641 with unexpected worker ID 972 (expected no updates anymore, job is done with result incomplete) (remaining tries: 0)
Dec 10 17:12:32 openqaworker6 worker[5647]: [error] [pid:5647] Unable to make final image uploads. Maybe the web UI considers this job already dead.

Reproducible

To find these jobs one can search the database with the SQL query:

select id,test from jobs where (result='incomplete' and t_finished >= (NOW() - interval '24 hour') and result_dir is null and id not in (select job_id from comments where id is null)) order by id desc;

Related issues

Related to openQA Project - action #55529: job incompletes when it can not reach the openqa webui ho... Resolved 14/08/2019
Related to openQA Project - action #61922: [epic] Incomplete jobs with no logs at all Resolved 03/02/2020

History

#1 Updated by okurz 4 months ago

  • Description updated (diff)

#2 Updated by okurz 4 months ago

  • Priority changed from High to Urgent

This seems to be very severe now. For the past 24h the mentioned SQL query returns 161 jobs on osd but also 14 on o3 which could very well be just #55529 for the daily openQA upgrade.

#3 Updated by okurz 4 months ago

  • Related to action #55529: job incompletes when it can not reach the openqa webui host just for a single time aka. retry on 521 connect timeout in cache added

#4 Updated by coolo 4 months ago

But the job was autorestarted, no?

#5 Updated by okurz 4 months ago

it looks like but maybe https://openqa.suse.de/tests/3689370#next_previous shows the problem in a more obvious manner

#6 Updated by coolo 4 months ago

It appears to me that you're mixing issues again - because this link only shows timeout_exceeded jobs, not incompletes

#7 Updated by mkittler 4 months ago

That's a strange case, apparently the job has been marked as incomplete in the second Dec 10 17:11:58:

Dec 10 17:11:58 openqaworker6 worker[5647]: [debug] [pid:5647] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3687641/status
Dec 10 17:11:58 openqaworker6 worker[5647]: [error] [pid:5647] REST-API error (POST http://openqa.suse.de/api/v1/jobs/3687641/status): 400 response: Got status update for job 3687641 with unexpected worker ID 972 (expected no updates anymore, job is done with result incomplete) (remaining tries: 2)

The behavior on the worker side looks expected (retrying the status update, failing with the reason 'api-failure' and registering again). The question is why the web UI suddenly considered that job incomplete. The worker didn't reconnect at this time so it is unlikely a bug in the incompletion code on worker reconnect.

#8 Updated by mkittler 4 months ago

  • Assignee set to mkittler
  • Target version set to Current Sprint

I've looked at the web UI log around the time:

[2019-12-10T17:11:57.0445 CET] [info] GRU: removing /var/lib/openqa/share/factory/repo/SLE-15-SP2-Module-Desktop-Applications-POOL-x86_64-BuildH.68.2-Media1
[2019-12-10T17:11:58.0092 CET] [error] Unable to inform scheduler that worker 822 reported back (connection error: Request timeout)
[2019-12-10T17:11:58.0092 CET] [error] response was:
[2019-12-10T17:11:58.0148 CET] [info] Got status update for job 3687641 with unexpected worker ID 972 (expected no updates anymore, job is done with result incomplete)
[2019-12-10T17:11:59.0753 CET] [info] Got status update for job 3686843 with unexpected worker ID 611 (expected no updates anymore, job is done with result incomplete)
  1. So there's a cleanup job running at the same time but that's likely unrelated.
  2. That the web socket server couldn't inform the scheduler that a different worker reported back is more interesting. That is a non-blocking HTTP call and it failed because the timeout of 1 minute was exceeded. So the scheduler was unresponsive for at least one minute.
  3. We see the log message corresponding to the failed status updated. Apparently other workers/jobs had the same problem at this point.
  4. Earlier in log we see the same pattern as well:

    [2019-12-10T17:11:22.0945 CET] [error] Unable to inform scheduler that worker 486 reported back (connection error: Request timeout)
    [2019-12-10T17:11:22.0946 CET] [error] response was:
    [2019-12-10T17:11:23.0990 CET] [info] Got status update for job 3686843 with unexpected worker ID 611 (expected no updates anymore, job is done with result incomplete)
    [2019-12-10T17:11:25.0275 CET] [info] Got status update for job 3686852 with unexpected worker ID 570 (expected no updates anymore, job is done with result incomplete)
    [2019-12-10T17:11:29.0038 CET] [info] Got status update for job 3686843 with unexpected worker ID 611 (expected no updates anymore, job is done with result incomplete)
    [2019-12-10T17:11:30.0446 CET] [info] Got status update for job 3686852 with unexpected worker ID 570 (expected no updates anymore, job is done with result incomplete)
    [2019-12-10T17:11:33.0283 CET] [error] Unable to inform scheduler that worker 481 reported back (connection error: Request timeout)
    [2019-12-10T17:11:33.0284 CET] [error] response was:
    [2019-12-10T17:11:34.0134 CET] [info] Got status update for job 3686843 with unexpected worker ID 611 (expected no updates anymore, job is done with result incomplete)
    [2019-12-10T17:11:35.0524 CET] [info] Got status update for job 3686852 with unexpected worker ID 570 (expected no updates anymore, job is done with result incomplete)
    [2019-12-10T17:11:39.0228 CET] [info] Got status update for job 3686843 with unexpected worker ID 611 (expected no updates anymore, job is done with result incomplete)
    [2019-12-10T17:11:40.0637 CET] [info] Got status update for job 3686852 with unexpected worker ID 570 (expected no updates anymore, job is done with result incomplete)
    

So the question is what kept the scheduler so busy. It seems to be unresponsive like this for minutes. There are no log messages like Dead job %d aborted and duplicated %d but still lots of jobs have been incompleted. Maybe the scheduler logs to a different file so I just don't see that messages.

#9 Updated by mkittler 4 months ago

Before the whole Got status update for job ... with unexpected worker ... spam the following message is logged:

[2019-12-10T16:33:55.0181 CET] [error] [24b15e6c] DBIx::Class::Row::update(): Can't update OpenQA::Schema::Result::JobLocks=HASH(0x560d4a94b0b0): row not found at /usr/share/openqa/script/../lib/OpenQA/Resource/Locks.pm line 151

And it seems the scheduler logs to a different file, indeed. And it was busy marking jobs as incomplete at this time:

[2019-12-10T16:59:20.0644 CET] [warn] Dead job 3687090 aborted as incomplete
[2019-12-10T16:59:21.0159 CET] [warn] Dead job 3686910 aborted and duplicated 3688470
[2019-12-10T16:59:23.0113 CET] [warn] Dead job 3686893 aborted and duplicated 3688471
[2019-12-10T16:59:23.0989 CET] [warn] Dead job 3687049 aborted and duplicated 3688472
[2019-12-10T16:59:24.0232 CET] [warn] Dead job 3685150 aborted and duplicated 3688473
[2019-12-10T16:59:24.0461 CET] [warn] Dead job 3685234 aborted and duplicated 3688474
[2019-12-10T16:59:24.0979 CET] [warn] Dead job 3687000 aborted and duplicated 3688475
[2019-12-10T16:59:25.0886 CET] [warn] Dead job 3686897 aborted and duplicated 3688476
[2019-12-10T16:59:32.0037 CET] [warn] Dead job 3686881 aborted and duplicated 3688477
[2019-12-10T17:01:59.0749 CET] [warn] Dead job 3687849 aborted and duplicated 3688491
[2019-12-10T17:02:01.0023 CET] [warn] Dead job 3688383 aborted and duplicated 3688492
[2019-12-10T17:02:02.0500 CET] [warn] Dead job 3686981 aborted and duplicated 3688495
[2019-12-10T17:02:03.0370 CET] [warn] Dead job 3688463 aborted and duplicated 3688496
[2019-12-10T17:02:03.0819 CET] [warn] Dead job 3688470 aborted and duplicated 3688498
[2019-12-10T17:02:06.0489 CET] [warn] Dead job 3687020 aborted and duplicated 3688500
[2019-12-10T17:02:08.0508 CET] [warn] Dead job 3687093 aborted and duplicated 3688506
[2019-12-10T17:02:10.0760 CET] [warn] Dead job 3688373 aborted and duplicated 3688507
[2019-12-10T17:02:13.0475 CET] [warn] Dead job 3687099 aborted and duplicated 3688513
[2019-12-10T17:02:14.0744 CET] [warn] Dead job 3687030 aborted and duplicated 3688514
[2019-12-10T17:02:17.0834 CET] [warn] Dead job 3687021 aborted and duplicated 3688516
[2019-12-10T17:02:18.0340 CET] [warn] Dead job 3688455 aborted and duplicated 3688520
[2019-12-10T17:02:19.0469 CET] [warn] Dead job 3687031 aborted and duplicate

#10 Updated by mkittler 4 months ago

Suggestions:

  • sub handle_worker_reported_back could be moved from the scheduler to the main web UI. That would scale better because the main web UI has preforking.
  • But all the Dead job %d aborted and duplicated messages come actually from sub incomplete_and_duplicate_stale_jobs which is executed by the scheduler every 2 minutes. It seems that it kept the scheduler quite busy:

    [2019-10-10T19:56:37.0486 CEST] [warn] Dead job 3462679 aborted and duplicated 3464155
    [2019-10-10T19:56:37.0941 CEST] [warn] Dead job 3462675 aborted and duplicated 3464156
    [2019-10-10T19:56:41.0006 CEST] [warn] Dead job 3462503 aborted and duplicated 3464157
    [2019-10-10T19:56:41.0635 CEST] [warn] Dead job 3462596 aborted and duplicated 3464158
    [2019-10-10T19:56:42.0334 CEST] [warn] Dead job 3462682 aborted and duplicated 3464159
    tons of similar messages and not other messages in between
    [2019-10-10T20:14:44.0108 CEST] [warn] Dead job 3464636 aborted as incomplete
    [2019-10-10T20:14:44.0635 CEST] [warn] Dead job 3464638 aborted and duplicated 3464671
    

    The incomplete detection requires quite some DB queries and possibly blocking IPC with the web socket server. Considering the log, processing a single job can even take more than a second. We can not easily move that function to the web UI because it is triggered using a recurring timer on the IO loop. However, we could make the IPC non-blocking or outsource it to (yet another) daemon. Or we just interrupt the loop if it is taking too long.

#11 Updated by okurz 4 months ago

coolo wrote:

It appears to me that you're mixing issues again - because this link only shows timeout_exceeded jobs, not incompletes

yes, sorry. Seems I have copied the wrong job link. This dark red for both incompletes and timeout_exceeded is something we could overhaul with better meta-results now :)

@mkittler your suggestion looks sound to me. Can we do any short-term changes to mitigate and see if your assumption is right? E.g. reduce the interval that this is triggered, increase the "stale detection time" or something?

#12 Updated by mkittler 4 months ago

I think the suggestions "Or we just interrupt the loop if it is taking too long." can be considered a short-term mitigation. And yes, we could also decrease the interval for the stale job detection. By the way, I'm still not quite certain which condition triggered the problem. Maybe there were just too many stale jobs (or at least more than usual) at the same time to be aborted and duplicated?

Note that in the chat it was discussed to remove the stale job detection altogether. This simplification would likely have a lot of implications which we should be clarify before and it would make sense to have a different ticket for that.

#13 Updated by mkittler 4 months ago

Another short-term improvement would be increasing WORKERS_CHECKER_THRESHOLD. So we wouldn't consider jobs dead after just 2 minutes but e.g. keep them around as 'running' for 15 minutes. (If a worker is restarted the aborted job would still immediately be incompleted when the worker reports back again.)


I've just had a look at the recent scheduler log to see how bad it currently is. Luckily it isn't busy incompleting jobs anymore. However, the dead job detection of course runs every 2 minutes. With debug output enabled there are more interesting messages:

[2019-12-11T15:42:08.0797 CET] [error] [pid:15172] Failed dispatching message to websocket server over ipc for worker "openqaworker-arm-2:": Inactivity timeout at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 55.

[2019-12-11T15:52:08.0845 CET] [error] [pid:15172] Failed dispatching message to websocket server over ipc for worker "openqaworker-arm-2:": Inactivity timeout at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 55.

[2019-12-11T15:52:08.0880 CET] [debug] [pid:15172] Jobs to duplicate {
  3695052 => {
               chained_children          => [],
               chained_parents           => [3694658],
               directly_chained_children => [],
               directly_chained_parents  => [],
               parallel_children         => [3695053, 3695054],
               parallel_parents          => [],
             },
  3695053 => {
               chained_children          => [],
               chained_parents           => [],
               directly_chained_children => [],
               directly_chained_parents  => [],
               parallel_children         => [],
               parallel_parents          => [3695052],
             },
  3695054 => {
               chained_children          => [],
               chained_parents           => [],
               directly_chained_children => [],
               directly_chained_parents  => [],
               parallel_children         => [],
               parallel_parents          => [3695052],
             },
}
[2019-12-11T15:52:09.0466 CET] [debug] [pid:15172] new job 3695250
[2019-12-11T15:52:09.0469 CET] [warn] [pid:15172] Dead job 3695053 aborted and duplicated 3695250
[2019-12-11T15:52:09.0548 CET] [debug] [pid:15172] Jobs to duplicate {
  3695060 => {
               chained_children          => [],
               chained_parents           => [3694658],
               directly_chained_children => [],
               directly_chained_parents  => [],
               parallel_children         => [],
               parallel_parents          => [],
             },
  3695062 => {
               chained_children          => [],
               chained_parents           => [],
               directly_chained_children => [],
               directly_chained_parents  => [],
               parallel_children         => [],
               parallel_parents          => [3695060],
             },
}
[2019-12-11T15:52:09.0699 CET] [debug] [pid:15172] rollback duplicate: {UNKNOWN}: There is already a clone! at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 635. at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 849

[2019-12-11T15:52:09.0700 CET] [debug] [pid:15172] duplication failed
[2019-12-11T15:52:09.0700 CET] [warn] [pid:15172] Dead job 3695062 aborted as incomplete
[2019-12-11T15:52:09.0740 CET] [debug] [pid:15172] Jobs to duplicate {
  3695075 => {
    chained_children          => [],
    chained_parents           => [],
    directly_chained_children => [],
    directly_chained_parents  => [],
    parallel_children         => [],
    parallel_parents          => [],
  },
}
[2019-12-11T15:52:09.0906 CET] [debug] [pid:15172] new job 3695253
[2019-12-11T15:52:09.0910 CET] [warn] [pid:15172] Dead job 3695075 aborted and duplicated 3695253
[2019-12-11T15:52:09.0981 CET] [debug] [pid:15172] Jobs to duplicate {
  3695052 => {
               chained_children          => [],
               chained_parents           => [3694658],
               directly_chained_children => [],
               directly_chained_parents  => [],
               parallel_children         => [],
               parallel_parents          => [],
             },
  3695054 => {
               chained_children          => [],
               chained_parents           => [],
               directly_chained_children => [],
               directly_chained_parents  => [],
               parallel_children         => [],
               parallel_parents          => [3695052],
             },
}
[2019-12-11T15:52:10.0205 CET] [debug] [pid:15172] rollback duplicate: {UNKNOWN}: There is already a clone! at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 635. at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 849

So apparently timeouts happen when sending commands to the worker via the web socket service (part of the dead job detection).

And then there are also these There is already a clone! messages from sub create_clone. The code looks like that this is an expected condition. However, I'm wondering whether it is normal that it happens so often.

#14 Updated by okurz 4 months ago

  • Status changed from New to In Progress

Well, openqaworker-arm-2 is a special case as it is currently the most unstable worker dying a lot, see #41882 for details. This sounds to me as if some checks regarding openqaworker-arm-2 would cause delays and timeouts which would lead to other workers being unable to send an update to the webui causing chain-effects.

You mentioned WORKERS_CHECKER_THRESHOLD but I understand you haven't done anything regarding this so far. I hotpatched the value in /usr/share/openqa/lib/OpenQA/Constants.pm on osd for now to 900 (15 minutes). Let's see if this makes any difference. This will be overwritten again on next deployment of openQA, e.g. next Wednesday.

#15 Updated by mkittler 4 months ago

This sounds to me as if some checks regarding openqaworker-arm-2 would cause delays and timeouts which would lead to other workers being unable to send an update to the webui causing chain-effects.

Yes, it slows down the scheduler. It is generally not nice that a slow worker possibly slows down one of the global daemons - that's why I brought it up.

PR for increasing WORKERS_CHECKER_THRESHOLD: https://github.com/os-autoinst/openQA/pull/2591

I'm currently preparing another PR for moving handle_worker_reported_back to the main web UI so it is done when the worker registers via the API.

#16 Updated by okurz 4 months ago

As I manually increased the WORKERS_CHECKER_THRESHOLD to 15m on osd I checked today for related, incomplete jobs. As auto-review tries to label all incompletes we can query for the used label:

openqa=> select distinct jobs.id,t_finished from jobs, comments where (result='incomplete' and t_finished >= (NOW() - interval '96 hour') and jobs.id in (select job_id from comments where text ~ 'poo#60866')) order by t_finished desc limit 10;
   id    |     t_finished      
---------+---------------------
 3700775 | 2019-12-12 21:17:12
 3700343 | 2019-12-12 20:02:53
 3700339 | 2019-12-12 20:02:14
 3700544 | 2019-12-12 20:00:32
 3700755 | 2019-12-12 19:50:36
 3700727 | 2019-12-12 19:48:04
 3700701 | 2019-12-12 19:47:31
 3700749 | 2019-12-12 19:45:29
 3700611 | 2019-12-12 19:44:38
 3700748 | 2019-12-12 19:44:31

I don't have a graph of incomplete jobs over time but at least we can say for sure that the patch to WORKERS_CHECKER_THRESHOLD did not fix all ;)

To get an overview of all incompletes by type, only the most severe from the last 96 hours:

openqa=> select left(text, 80), count(text) from jobs, comments where (result='incomplete' and t_finished >= (NOW() - interval '96 hour') and jobs.id in (select job_id from comments where id is not null)) and jobs.id = job_id group by text having count(text) > 5 order by count(text) desc;
                                       left                                       | count 
----------------------------------------------------------------------------------+-------
 poo#60866 Again many incomplete jobs with no logs at all                         |   998
 label:setup_failure, probably wrong settings or missing asset                    |   344
 poo#59852 auto_review:"Failed to find an available port: Address already in use" |    41
 poo#46742 premature deletion of files from cache                                 |    32
 poo#57620 job is incomplete if websockets server (or webui?) is unreachable for  |    11
 poo#60161 test incompletes in t20_teaming_ab_all_link                            |     8
 poo#60200 job incompletes just after VNC stall detection, no clear error         |     8
 poo#43631 Job ending up incomplete but everything else looks fine                |     8

#17 Updated by mkittler 4 months ago

  • Priority changed from Urgent to High

Both PRs have been merged: https://github.com/os-autoinst/openQA/pull/2591, https://github.com/os-autoinst/openQA/pull/2606

That means

  1. job incompletion is now trigged with much more patience by the periodic check and happens therefore hopefully less massively.
  2. job incompletion on worker re-registration is now handled directly within the web UI. That prevents IPC overhead and should scale better due to preforking.

This should defuse the situation so I'm lowering the priority. I keep the ticket as 'In Progress' because we still have to see how well it works in production.

#18 Updated by okurz 4 months ago

Too early to be sure but at least I did not see any degradation. What I did now for analysis, trigger a pipeline on https://gitlab.suse.de/openqa/auto-review/pipelines to mark the unreviewed incompletes from today and then in osd SQL:

select left(text, 80), count(text) from jobs, comments where (result='incomplete' and t_finished >= (NOW() - interval '13 hour') and jobs.id in (select job_id from comments where id is not null)) and jobs.id = job_id group by text having count(text) > 5 order by count(text) desc;
                                       left                                       | count 
----------------------------------------------------------------------------------+-------
 label:setup_failure, probably wrong settings or missing asset                    |    56
 poo#60443 job incomplete with auto_review:"(?s)isotovideo failed.*EXIT 1" but no |    22
 poo#60866 Again many incomplete jobs with no logs at all                         |    11

so (just) 11 jobs with no logs at all. Still something to look at, e.g. one from that list: https://openqa.suse.de/tests/3721672 . I looked up the service journal for the job at the time from openqaworker8:2 and found:

Dec 18 13:08:22 openqaworker8 worker[33797]: [debug] [pid:33797] Updating status so job 3721672 is not considered dead.
Dec 18 13:08:22 openqaworker8 worker[33797]: [debug] [pid:33797] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3721672/status
Dec 18 13:08:23 openqaworker8 worker[33797]: [debug] [pid:33797] Download of SLE-12-SP3-Server-DVD-x86_64-GM-DVD1.iso processed:
Dec 18 13:08:23 openqaworker8 worker[33797]: [info] [#220622] Cache size of "/var/lib/openqa/cache" is 48GiB, with limit 50GiB
Dec 18 13:08:23 openqaworker8 worker[33797]: [info] [#220622] Downloading "SLE-12-SP3-Server-DVD-x86_64-GM-DVD1.iso" from "openqa.suse.de/tests/3721672/asset/iso/SLE-12-SP3-Server-DVD-x86_64-GM-DVD1.iso"
Dec 18 13:08:23 openqaworker8 worker[33797]: [info] [#220622] Content of "/var/lib/openqa/cache/openqa.suse.de/SLE-12-SP3-Server-DVD-x86_64-GM-DVD1.iso" has not changed, updating last use
Dec 18 13:13:23 openqaworker8 worker[33797]: Mojo::Reactor::Poll: I/O watcher failed: Inactivity timeout at /usr/share/openqa/script/../lib/OpenQA/CacheService/Client.pm line 57.
Dec 18 13:13:23 openqaworker8 worker[33797]: [error] [pid:33797] Stopping because a critical error occurred.
Dec 18 13:13:23 openqaworker8 worker[33797]: [debug] [pid:33797] Stopping job 3721672 from openqa.suse.de: 03721672-sle-12-SP3-Server-DVD-Updates-x86_64-Build20191218-2-qam-gnome@64bit - reason: exception
Dec 18 13:13:23 openqaworker8 worker[33797]: [debug] [pid:33797] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3721672/status
Dec 18 13:13:23 openqaworker8 worker[33797]: [error] [pid:33797] Another error occurred when trying to stop gracefully due to an error. Trying to kill ourself forcefully now.
Dec 18 14:04:50 openqaworker8 systemd[1]: Starting openQA Worker #2...

so the cache service client failed with inactivity timeout and the worker instance terminated itself. I don't know what restarted it later at 14:04:50.

#19 Updated by kraih 4 months ago

The Mojo::Reactor::Poll: I/O watcher failed: Inactivity timeout at /usr/share/openqa/script/../lib/OpenQA/CacheService/Client.pm line 57. is interesting. It means that the cache service API server was longer unavailable than i expected, and we can probably resolve that issue by extending the default timeout a bit.

#20 Updated by mkittler 4 months ago

  • Subject changed from Again many incomplete jobs with no logs at all to Periodic stale job detection keeps scheduler busy producing a massive number of incomplete jobs (was: Again many incomplete jobs with no logs at all)

@okurz The incompletes job you have found is certainly caused by something else than the series of incompletes which lead to creating this issue. I'm concertizing the ticket description because it makes no sense to handle any kind of incomplete job within that ticket.

@kraih I'm referencing your PR: https://github.com/os-autoinst/openQA/pull/2621

#21 Updated by okurz 4 months ago

mkittler wrote:

@okurz The incompletes job you have found is certainly caused by something else than the series of incompletes which lead to creating this issue. I'm concertizing the ticket description because it makes no sense to handle any kind of incomplete job within that ticket.

yes, you can do that but https://github.com/os-autoinst/scripts/blob/master/openqa-label-known-issues#L68 will continue to use this ticket as label unless we change the detection within there or at least update the ticket reference. The problem with "no logs at all" is though that the detection of what is going is pretty hard. This is why we should try really hard to avoid this generic problem of "no logs at all". And this is what this ticket was and is about, despite what we identified as just one part of a bigger problem area.

#22 Updated by okurz 4 months ago

What we can do is keep this ticket as epic and split out any identified sub-problem.

#23 Updated by mkittler 3 months ago

  • Status changed from In Progress to Resolved
  • Target version changed from Current Sprint to Done

Sorry, but if that was intended to be treated as a ticket about the general topic the observation should never had contained such specific information. So I'm still considering this ticket to be about this specific problem we have observed.

Since we don't see the specific problem in production anymore I assume the PRs helped. I still keep the ticket on feedback because I want to check for the impact of https://github.com/os-autoinst/openQA/pull/2591.


Feel free to create a fresh ticket for the epic. It makes not much sense to turn this ticket here into an epic because it has already a very specific log in the description and lots of comments about the specific problem. Note that this epic would only serve as a category for tickets with these kind of symptom and not contain very specific information.

#24 Updated by mkittler 3 months ago

  • Status changed from Resolved to Feedback
  • Target version changed from Done to Current Sprint

#25 Updated by mkittler 3 months ago

  • Status changed from Feedback to Resolved
  • Target version changed from Current Sprint to Done

Seems like increasing the inactivity timeout for workers has not caused any problems. At least I haven't found any abandoned jobs still considered running on our production instances.

Due to the restructuring (PR) this concrete issue is not expected to happen again. Of course the change might have introduced other regressions/disadvantages but I'm closing the issue for now.

#26 Updated by okurz 3 months ago

@mkittler I don't have hard feeling how you handle this ticket but as stated in #60866#note-21 the ticket is still used as a label in https://github.com/os-autoinst/scripts/blob/master/openqa-label-known-issues#L68 . You closing this ticket does not change that. Meaning, without changing the script this closed ticket will be used as label in incomplete jobs.

#27 Updated by mkittler 3 months ago

  • Related to action #61922: [epic] Incomplete jobs with no logs at all added

#28 Updated by mkittler 3 months ago

For the record, the script has been changed to point to the related epic.

Also available in: Atom PDF