Project

General

Profile

Actions

coordination #61922

closed

coordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues

coordination #62420: [epic] Distinguish all types of incompletes

[epic] Incomplete jobs with no logs at all

Added by mkittler over 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
-
Start date:
2020-02-03
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

These jobs need further investigation as the problem might have different causes.


Subtasks 1 (0 open1 closed)

action #62984: Fix problem with job-worker assignment resulting in API errorsResolvedmkittler2020-02-03

Actions

Related issues 3 (1 open2 closed)

Related to openQA Project - action #60866: Periodic stale job detection keeps scheduler busy producing a massive number of incomplete jobs (was: Again many incomplete jobs with no logs at all)Resolvedmkittler2019-12-10

Actions
Related to openQA Project - action #55529: job incompletes when it can not reach the openqa webui host just for a single time aka. retry on 521 connect timeout in cacheResolvedkraih2019-08-14

Actions
Related to openQA Project - action #62441: openqa-worker systemd service can timeout when stoppingNew2020-01-21

Actions
Actions #1

Updated by mkittler over 4 years ago

  • Related to action #60866: Periodic stale job detection keeps scheduler busy producing a massive number of incomplete jobs (was: Again many incomplete jobs with no logs at all) added
Actions #2

Updated by mkittler over 4 years ago

  • Related to action #55529: job incompletes when it can not reach the openqa webui host just for a single time aka. retry on 521 connect timeout in cache added
Actions #3

Updated by okurz over 4 years ago

  • Subject changed from EPIC: Incomplete jobs with no logs at all to [epic] Incomplete jobs with no logs at all
  • Category set to Regressions/Crashes

common style for tags in subject line is lower case surrounded by square brackets if you don't mind

Actions #4

Updated by okurz over 4 years ago

As visible in https://gitlab.suse.de/openqa/auto-review/-/jobs/160444 this is still the most common issue for incompletes on o3.

E.g. https://openqa.opensuse.org/tests/1150199 has no logs, was assigned to openqaworker4(.o.o):8, incompleted at the time the worker was restarted, for which the journal entries at the time are:

Jan 21 02:17:27 openqaworker4 worker[1865]: [info] 32714: WORKING 1150035
…
Jan 21 03:10:59 openqaworker4 worker[1865]: [info] Test schedule has changed, reloading test_order.json
…
Jan 21 03:16:12 openqaworker4 worker[1865]: [info] Test schedule has changed, reloading test_order.json
Jan 21 03:30:00 openqaworker4 worker[1865]: [info] Quit due to signal TERM
Jan 21 03:30:00 openqaworker4 systemd[1]: Stopping openQA Worker #8...
Jan 21 03:30:02 openqaworker4 worker[1865]: [info] Trying to stop job gracefully by announcing it to command server via http://localhost:20083/zIOPh5DyHpt7_crT/broadcast
Jan 21 03:30:04 openqaworker4 worker[1865]: [info] Isotovideo exit status: 0
Jan 21 03:30:11 openqaworker4 worker[1865]: [info] +++ worker notes +++
Jan 21 03:30:11 openqaworker4 worker[1865]: [info] End time: 2020-01-21 02:30:11
Jan 21 03:30:11 openqaworker4 worker[1865]: [info] Result: quit
Jan 21 03:30:12 openqaworker4 worker[1865]: [warn] Can't open /var/lib/openqa/pool/8/testresults/result-gimp.json for result upload - likely isotovideo could not be started or failed early. Error message: No such file or directory
Jan 21 03:30:29 openqaworker4 worker[1865]: [warn] Ignoring WS message from http://openqa1-opensuse with type grab_job but no job ID (currently running 1150035 for http://openqa1-opensuse):
Jan 21 03:30:29 openqaworker4 worker[1865]: {
Jan 21 03:30:29 openqaworker4 worker[1865]:   job  => {
…
Jan 21 03:30:29 openqaworker4 worker[1865]:             assigned_worker_id => 167,
Jan 21 03:30:29 openqaworker4 worker[1865]:             blocked_by_id      => undef,
Jan 21 03:30:29 openqaworker4 worker[1865]:             clone_id           => undef,
Jan 21 03:30:29 openqaworker4 worker[1865]:             group              => "openSUSE Tumbleweed",
Jan 21 03:30:29 openqaworker4 worker[1865]:             group_id           => 1,
Jan 21 03:30:29 openqaworker4 worker[1865]:             id                 => 1150199,
Jan 21 03:30:29 openqaworker4 worker[1865]:             name               => "opensuse-Tumbleweed-NET-x86_64-Build20200119-xfce\@64bit",
Jan 21 03:30:29 openqaworker4 worker[1865]:             origin_id          => 1149971,
Jan 21 03:30:29 openqaworker4 worker[1865]:             priority           => 49,
Jan 21 03:30:29 openqaworker4 worker[1865]:             result             => "none",
…
Jan 21 03:30:29 openqaworker4 worker[1865]:             state              => "scheduled",
Jan 21 03:30:29 openqaworker4 worker[1865]:             t_finished         => undef,
Jan 21 03:30:29 openqaworker4 worker[1865]:             t_started          => undef,
Jan 21 03:30:29 openqaworker4 worker[1865]:             test               => "xfce",
Jan 21 03:30:29 openqaworker4 worker[1865]:           },
Jan 21 03:30:29 openqaworker4 worker[1865]:   type => "grab_job",
Jan 21 03:30:29 openqaworker4 worker[1865]: }
Jan 21 03:30:35 openqaworker4 systemd[1]: Stopped openQA Worker #8.
-- Reboot --
Jan 21 03:34:41 openqaworker4 systemd[1]: Starting openQA Worker #8...
Jan 21 03:34:41 openqaworker4 systemd[1]: Started openQA Worker #8.
Jan 21 03:34:45 openqaworker4 worker[1861]: [info] worker 8:
Jan 21 03:34:45 openqaworker4 worker[1861]:  - config file:           /etc/openqa/workers.ini
…
Jan 21 03:34:45 openqaworker4 worker[1861]: [info] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa1-opensuse
Jan 21 03:34:45 openqaworker4 worker[1861]: [info] Project dir for host http://openqa1-opensuse is /var/lib/openqa/share
Jan 21 03:34:45 openqaworker4 worker[1861]: [info] Registering with openQA http://openqa1-opensuse
Jan 21 03:34:45 openqaworker4 worker[1861]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/167
Jan 21 03:34:45 openqaworker4 worker[1861]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 167
Jan 21 03:34:56 openqaworker4 worker[1861]: [info] +++ setup notes +++
Jan 21 03:34:56 openqaworker4 worker[1861]: [info] Start time: 2020-01-21 02:34:56
Jan 21 03:34:56 openqaworker4 worker[1861]: [info] Running on openqaworker4:8 (Linux 4.12.14-lp151.28.36-default #1 SMP Fri Dec 6 13:50:27 UTC 2019 (8f4a495) x86_64)

to me it sounds like

  • the worker was working on 1150035
  • systemd service was asked to stop due to automatic nightly deployment
  • worker was shutting down until 03:30:12
  • worker was receiving a message from webui/scheduler at 03:30:29 to grab a job which the worker ignored
  • the webui didn't take the rejection lightly
  • worker stopped

a race condition in shutdown so that the worker is still accepting messages while shutting down?

Actions #5

Updated by okurz over 4 years ago

  • Related to action #62441: openqa-worker systemd service can timeout when stopping added
Actions #6

Updated by okurz over 4 years ago

https://openqa.suse.de/tests/3817376 is showing a different issue but related to planned termination of the worker instance as well. The openQA job has no logs. The system journal has the following:

Jan 22 11:22:12 malbec worker[37377]: [info] [pid:43388] 43388: WORKING 3817376
Jan 22 11:22:12 malbec worker[37377]: [debug] [pid:43388] +++ worker notes +++
Jan 22 11:22:12 malbec worker[37377]: [debug] [pid:43388] Start time: 2020-01-22 10:22:12
Jan 22 11:22:12 malbec worker[37377]: [debug] [pid:43388] Running on malbec:1 (Linux 4.12.14-lp151.28.36-default #1 SMP Fri Dec 6 13:50:27 UTC 2019 (8f4a495>
Jan 22 11:22:12 malbec worker[37377]: [info] [pid:37377] isotovideo has been started (PID: 43388)
Jan 22 11:22:12 malbec worker[37377]: [debug] [pid:37377] Running job 3817376 from openqa.suse.de: 03817376-sle-15-Server-DVD-Incidents-Install-ppc64le-Buil>
Jan 22 11:22:12 malbec worker[37377]: [debug] [pid:37377] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3817376/status
…
Jan 22 11:37:20 malbec worker[37377]: [debug] [pid:37377] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3817376/status
Jan 22 11:37:30 malbec worker[37377]: [warn] [pid:37377] Websocket connection to http://openqa.suse.de/api/v1/ws/885 finished by remote side with code 1006,>
Jan 22 11:37:30 malbec worker[37377]: [debug] [pid:37377] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3817376/status
Jan 22 11:37:30 malbec worker[37377]: [error] [pid:37377] REST-API error (POST http://openqa.suse.de/api/v1/jobs/3817376/status): 503 response: Service Unav>
Jan 22 11:37:40 malbec worker[37377]: [info] [pid:37377] Registering with openQA openqa.suse.de
Jan 22 11:37:41 malbec worker[37377]: [info] [pid:37377] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/885
Jan 22 11:37:43 malbec worker[37377]: [info] [pid:37377] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 885
Jan 22 11:38:04 malbec worker[37377]: [info] [pid:37377] Quit due to signal TERM
Jan 22 11:38:04 malbec worker[37377]: [debug] [pid:37377] Stopping job 3817376 from openqa.suse.de: 03817376-sle-15-Server-DVD-Incidents-Install-ppc64le-Bui>
Jan 22 11:38:04 malbec systemd[1]: Stopping openQA Worker #1...
Jan 22 11:38:30 malbec worker[37377]: [debug] [pid:37377] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3817376/status
Jan 22 11:38:30 malbec worker[37377]: [info] [pid:37377] Trying to stop job gracefully by announcing it to command server via http://localhost:20013/tCDEfK1>
Jan 22 11:38:35 malbec worker[37377]: [info] [pid:37377] Isotovideo exit status: 0
Jan 22 11:38:35 malbec worker[37377]: [debug] [pid:37377] Registered process:43434
Jan 22 11:38:36 malbec worker[37377]: [debug] [pid:37377] Registered process:43435
Jan 22 11:38:43 malbec worker[37377]: [info] [pid:37377] +++ worker notes +++
Jan 22 11:38:43 malbec worker[37377]: [info] [pid:37377] End time: 2020-01-22 10:38:43
Jan 22 11:38:43 malbec worker[37377]: [info] [pid:37377] Result: quit
Jan 22 11:38:43 malbec worker[37377]: [debug] [pid:37377] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3817376/status
Jan 22 11:38:43 malbec worker[37377]: [debug] [pid:37377] Job 3817376 stopped as incomplete
Jan 22 11:38:43 malbec worker[37377]: [debug] [pid:37377] duplicating job 3817376
Jan 22 11:38:43 malbec worker[37377]: [debug] [pid:37377] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3817376/duplicate?dup_type_auto=1
Jan 22 11:38:45 malbec worker[37377]: [debug] [pid:37377] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3817376/status
Jan 22 11:38:47 malbec worker[37377]: [error] [pid:47101] REST-API error (POST http://openqa.suse.de/api/v1/jobs/3817376/status): Connection error: Prematur>
Jan 22 11:38:47 malbec worker[37377]: [debug] [pid:47101] Optimizing /var/lib/openqa/pool/1/testresults/update_install-107.png
Jan 22 11:38:47 malbec worker[37377]: [debug] [pid:47101] Uploading artefact update_install-107.png as cf2bea1b84bfd651f015672b9f69c84d
…
Jan 22 11:39:25 malbec worker[37377]: [debug] [pid:47105] Uploading artefact update_install-167.txt
Jan 22 11:39:25 malbec worker[37377]: [debug] [pid:37377] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3817376/set_done?result=incomplete
Jan 22 11:39:25 malbec worker[37377]: [debug] [pid:37377] Job 3817376 from openqa.suse.de finished - reason: quit
Jan 22 11:39:25 malbec systemd[1]: Stopped openQA Worker #1.
Jan 22 11:39:25 malbec systemd[1]: Starting openQA Worker #1...
Jan 22 11:39:25 malbec systemd[1]: Started openQA Worker #1.
  • looks like at 11:37:30 the webUI restarted but that was handled gracefully by re-registering
  • At 11:38:04 the worker was asked to terminate
  • 11:39:25 still uploading until here
  • worker stopped but never uploaded autoinst-log.txt or worker-log.txt
  • It does not look like any processes were receiving a KILL signal from systemd like in #62441

Open questions for me are:

  1. Why did it not upload the main log files?
  2. How to treat an orderly termination of openQA worker instances in general? Should we stop immediately without logs? Abort as incomplete/failed but upload all logs we have including autoinst-log.txt at the time? Give time as long as needed to finish the complete openQA job?
Actions #7

Updated by okurz over 4 years ago

  • Parent task set to #62420
Actions #8

Updated by mkittler over 4 years ago

  1. When receiving TERM the worker so far only marks the job as incomplete but doesn't upload results (unless it is already uploading as you have observed in https://progress.opensuse.org/issues/62441). It is of course open for discussion what the best behavior for the worker is when receiving SIGTERM which leads to the next point.
  2. I would the worker should only try to mark the job as incomplete as it does now. To avoid https://progress.opensuse.org/issues/62441 there should not be an exception when the job is already uploading. When we implement forwarding the reason for incomplete jobs (https://progress.opensuse.org/issues/62420) it will be clear why the job is incomplete.

Note that we could still implement a "softer" way to restart a worker (not triggered by SIGTERM). In this case we might even finish the current job.

Actions #9

Updated by okurz about 4 years ago

https://gitlab.suse.de/openqa/auto-review/-/jobs/165788 shows the current statistics of incompletes on osd, 333 incompletes referencing this ticket. But https://openqa.suse.de/tests/3848722 is an example with now helpful information in the "reason": "api failure: 400 response: Got status update for job 3848722 and worker 476 but there is not even a worker assigned to this job (job is scheduled)".

Now we can have more helpful queries:

openqa=> select id,state,result,left(result_dir, 40),logs_present,left(reason, 60) from jobs where (result='incomplete' and t_finished >= (NOW() - interval '12 hour'));
   id    | state |   result   |                   left                   | logs_present |                             left                             
---------+-------+------------+------------------------------------------+--------------+--------------------------------------------------------------
 3848522 | done  | incomplete |                                          | t            | api failure: 400 response: Got status update for job 3848522
 3848265 | done  | incomplete | 03848265-sle-15-SP2-Online-x86_64-Build1 | t            | setup failure: Failed to download SLES-15-SP2-x86_64-133.1@6
 3848304 | done  | incomplete | 03848304-sle-15-SP2-Online-x86_64-Build1 | t            | died
 3848722 | done  | incomplete |                                          | t            | api failure: 400 response: Got status update for job 3848722
 3849021 | done  | incomplete | 03849021-sle-15-SP2-Online-x86_64-Build1 | t            | died
 3849124 | done  | incomplete | 03849124-sle-15-SP2-Online-aarch64-Build | t            | setup failure: Failed to download upgrade_from-15-sle-15-SP2
 3848934 | done  | incomplete | 03848934-sle-15-SP2-Online-x86_64-Build1 | t            | died
 3848625 | done  | incomplete | 03848625-sle-15-SP2-Regression-on-Migrat | t            | died
 3849558 | done  | incomplete | 03849558-sle-12-SP5-Server-DVD-Updates-x | t            | setup failure: Failed to download SLES-12-SP5-x86_64-qam-dra
 3849051 | done  | incomplete | 03849051-sle-15-SP2-Online-x86_64-Build1 | t            | died
 3847619 | done  | incomplete | 03847619-sle-15-SP2-Online-ppc64le-Build | t            | died
 3849495 | done  | incomplete | 03849495-sle-12-SP5-Server-DVD-Updates-x | t            | setup failure: Failed to download SLES-12-SP5-x86_64-qam-dra
 3850098 | done  | incomplete | 03850098-sle-15-SP2-Online-G-Staging-x86 | t            | died
 3841921 | done  | incomplete |                                          | t            | api failure: 400 response: Got status update for job 3841921
 3841999 | done  | incomplete |                                          | t            | api failure: 400 response: Got status update for job 3841999
 3849123 | done  | incomplete | 03849123-sle-15-SP2-Online-aarch64-Build | t            | setup failure: Failed to download upgrade_from-15-sle-15-SP2
 3844436 | done  | incomplete | 03844436-sle-15-SP2-Online-x86_64-Build1 | t            | setup failure: Failed to download SLES-15-SP2-x86_64-131.1@6
 3847238 | done  | incomplete |                                          | t            | api failure: 400 response: Got status update for job 3847238
 3850294 | done  | incomplete | 03850294-sle-12-SP5-Server-DVD-Updates-x | t            | setup failure: Failed to download sle-12-SP5-x86_64-20200131
 3850280 | done  | incomplete | 03850280-sle-15-SP2-Online-x86_64-Build1 | t            | setup failure: Failed to download upgrade_from-12-SP5-sle-15
 3850260 | done  | incomplete | 03850260-sle-15-SP2-Online-x86_64-Build1 | t            | setup failure: Failed to download upgrade_from-15-SP1-sle-15
 3845760 | done  | incomplete | 03845760-sle-12-SP5-Server-DVD-HA-Incide | t            | setup failure: Failed to download sle-12-SP5-x86_64-Build:13
 3850274 | done  | incomplete | 03850274-sle-15-SP2-Online-aarch64-Build | t            | setup failure: Failed to download upgrade_from-15-SP1-sle-15
 3850339 | done  | incomplete | 03850339-sle-15-SP2-Full-ppc64le-Build13 | t            | setup failure: Failed to download upgrade_from-15-SP1-sle-15
 3850336 | done  | incomplete | 03850336-sle-15-SP2-Full-aarch64-Build13 | t            | setup failure: Failed to download upgrade_from-15-sle-15-SP2
 3850385 | done  | incomplete | 03850385-sle-15-SP2-Online-ppc64le-Build | t            | setup failure: Failed to download upgrade_from-12-SP5-sle-15
 3845089 | done  | incomplete | 03845089-sle-15-SP2-Online-x86_64-Build1 | t            | setup failure: Failed to download SLES-15-SP2-x86_64-131.1@6
 3847930 | done  | incomplete | 03847930-sle-15-SP2-Online-x86_64-Build1 | t            | api failure: 400 response: Got status update for job 3847930
 3850514 | done  | incomplete | 03850514-sle-15-SP2-Online-ppc64le-Build | t            | died
 3850523 | done  | incomplete | 03850523-sle-15-SP2-Online-ppc64le-Build | t            | died
(30 rows)

what we can also see is that logs_present is true everywhere but result_dir is empty when we are talking about "no logs at all". So we can query with

openqa=> select id,reason from jobs where result='incomplete' and result_dir is null and reason is not null;
   id    |                                                                            reason                                                                             
---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------
 3850726 | api failure: 400 response: Got status update for job 3850726 with unexpected worker ID 1200 (expected no updates anymore, job is done with result incomplete)
 3847609 | api failure: 400 response: Got status update for job 3847609 and worker 1048 but there is not even a worker assigned to this job (job is scheduled)
 3848522 | api failure: 400 response: Got status update for job 3848522 and worker 1030 but there is not even a worker assigned to this job (job is scheduled)
 3848722 | api failure: 400 response: Got status update for job 3848722 and worker 476 but there is not even a worker assigned to this job (job is scheduled)
 3848666 | api failure: 400 response: Got status update for job 3848666 and worker 1261 but there is not even a worker assigned to this job (job is scheduled)
 3849130 | api failure: 400 response: Got status update for job 3849130 and worker 955 but there is not even a worker assigned to this job (job is scheduled)
 3841921 | api failure: 400 response: Got status update for job 3841921 and worker 399 but there is not even a worker assigned to this job (job is scheduled)
 3841999 | api failure: 400 response: Got status update for job 3841999 with unexpected worker ID 1269 (expected no updates anymore, job is done with result incomplete)
 3847238 | api failure: 400 response: Got status update for job 3847238 and worker 534 but there is not even a worker assigned to this job (job is scheduled)
(9 rows)

to me this looks like still #60866

Actions #10

Updated by asmorodskyi about 4 years ago

@mkittler point me to this issue with this :

[2020-01-31T18:23:09.709 CET] [debug] /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/wicked/locks_init.pm:29 called testapi::record_info
[2020-01-31T18:23:09.709 CET] [debug] <<< testapi::record_info(title="barrier create", output="test_before_test_pre_run num_children: 2", result="ok")
[2020-01-31T18:23:09.709 CET] [debug] barrier create 'test_before_test_pre_run' for 2 tasks
[2020-01-31T18:23:12.577 CET] [warn] !!! lockapi::barrier_create: Unknown return code 403 for lock api
[2020-01-31T18:23:12.578 CET] [debug] barrier create 'test_before_test_post_run' for 2 tasks
[2020-01-31T18:23:14.643 CET] [warn] !!! lockapi::barrier_create: Unknown return code 403 for lock api
[2020-01-31T18:23:14.644 CET] [debug] /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/wicked/locks_init.pm:33 called testapi::record_info
[2020-01-31T18:23:14.644 CET] [debug] <<< testapi::record_info(title="barrier create", output="test_before_test_post_run num_children: 2", result="ok")
[2020-01-31T18:23:14.644 CET] [debug] /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/wicked/locks_init.pm:29 called testapi::record_info
[2020-01-31T18:23:14.644 CET] [debug] <<< testapi::record_info(title="barrier create", output="test_t01_bonding_balance_rr_pre_run num_children: 2", result="ok")
[2020-01-31T18:23:14.644 CET] [debug] barrier create 'test_t01_bonding_balance_rr_pre_run' for 2 tasks
[2020-01-31T18:23:15.420 CET] [warn] !!! lockapi::barrier_create: Unknown return code 403 for lock api
[2020-01-31T18:23:15.420 CET] [debug] barrier create 'test_t01_bonding_balance_rr_post_run' for 2 tasks
[2020-01-31T18:23:16.196 CET] [warn] !!! lockapi::barrier_create: Unknown return code 403 for lock api
[2020-01-31T18:23:16.196 CET] [debug] /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/wicked/locks_init.pm:33 called testapi::record_info
[2020-01-31T18:23:16.196 CET] [debug] <<< testapi::record_info(title="barrier create", output="test_t01_bonding_balance_rr_post_run num_children: 2", result="ok")
[2020-01-31T18:23:16.196 CET] [debug] /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/wicked/locks_init.pm:29 called testapi::record_info
[2020-01-31T18:23:16.196 CET] [debug] <<< testapi::record_info(title="barrier create", output="test_t02_bonding_ab_pre_run num_children: 2", result="ok")
[2020-01-31T18:23:16.197 CET] [debug] barrier create 'test_t02_bonding_ab_pre_run' for 2 tasks

see full log here https://openqa.suse.de/tests/3853129/file/autoinst-log.txt

Actions #11

Updated by mkittler about 4 years ago

@asmorodskyi After looking at the full look your issue seem different after all because in your case status updates were always accepted. So the relation between worker and job isn't completely broken.

Actions #12

Updated by mkittler about 4 years ago

  • Status changed from New to In Progress

I've been looking at incompletes on o3 and there are some with the reason associated worker has not sent any status updates for too long now. These are jobs hit by the stall job detection and are therefore restarted automatically. Seems like the worker was really just killed by systemd when restarting:

Feb 04 03:13:31 openqaworker1 worker[2718]: [info] Test schedule has changed, reloading test_order.json
Feb 04 03:30:00 openqaworker1 worker[2718]: [info] Quit due to signal TERM
Feb 04 03:30:00 openqaworker1 worker[2718]: [info] Trying to stop job gracefully by announcing it to command server via http://localhost:20023/a6NUixhtP1TAtRhF/broadcast
Feb 04 03:30:00 openqaworker1 systemd[1]: Stopping openQA Worker #2...
Feb 04 03:30:01 openqaworker1 worker[2718]: [info] Isotovideo exit status: 1
Feb 04 03:30:06 openqaworker1 worker[2718]: [info] +++ worker notes +++
Feb 04 03:30:06 openqaworker1 worker[2718]: [info] End time: 2020-02-04 02:30:06
Feb 04 03:30:06 openqaworker1 worker[2718]: [info] Result: quit
Feb 04 03:30:08 openqaworker1 worker[2718]: [warn] Can't open /var/lib/openqa/pool/2/testresults/result-oomath.json for result upload - likely isotovideo could not be started or failed early. Error message: No such file or directory
Feb 04 03:31:30 openqaworker1 systemd[1]: openqa-worker@2.service: State 'stop-sigterm' timed out. Killing.
Feb 04 03:31:30 openqaworker1 systemd[1]: openqa-worker@2.service: Killing process 2718 (worker) with signal SIGKILL.
Feb 04 03:31:30 openqaworker1 systemd[1]: openqa-worker@2.service: Killing process 19304 (worker) with signal SIGKILL.
Feb 04 03:31:30 openqaworker1 systemd[1]: openqa-worker@2.service: Main process exited, code=killed, status=9/KILL
Feb 04 03:31:30 openqaworker1 systemd[1]: Stopped openQA Worker #2.
Feb 04 03:31:30 openqaworker1 systemd[1]: openqa-worker@2.service: Unit entered failed state.
Feb 04 03:31:30 openqaworker1 systemd[1]: openqa-worker@2.service: Failed with result 'timeout'.
-- Reboot --
Feb 04 03:35:07 openqaworker1 systemd[1]: Starting openQA Worker #2...
Feb 04 03:35:07 openqaworker1 systemd[1]: Started openQA Worker #2.
Feb 04 03:35:11 openqaworker1 worker[2738]: [info] worker 2:
Feb 04 03:35:11 openqaworker1 worker[2738]:  - config file:           /etc/openqa/workers.ini
Feb 04 03:35:11 openqaworker1 worker[2738]:  - worker hostname:       openqaworker1
Feb 04 03:35:11 openqaworker1 worker[2738]:  - isotovideo version:    19
Feb 04 03:35:11 openqaworker1 worker[2738]:  - websocket API version: 1
Feb 04 03:35:11 openqaworker1 worker[2738]:  - web UI hosts:          http://openqa1-opensuse
Feb 04 03:35:11 openqaworker1 worker[2738]:  - class:                 openqaworker1,qemu_x86_64,pool_is_hdd,heavyload,kubic_x86_64,tap
Feb 04 03:35:11 openqaworker1 worker[2738]:  - no cleanup:            no
Feb 04 03:35:11 openqaworker1 worker[2738]:  - pool directory:        /var/lib/openqa/pool/2
Feb 04 03:35:11 openqaworker1 worker[2738]: [info] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa1-opensuse
Feb 04 03:35:11 openqaworker1 worker[2738]: [info] Project dir for host http://openqa1-opensuse is /var/lib/openqa/share
Feb 04 03:35:11 openqaworker1 worker[2738]: [info] Registering with openQA http://openqa1-opensuse
Feb 04 03:35:11 openqaworker1 worker[2738]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/52
Feb 04 03:35:11 openqaworker1 worker[2738]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 52

It is still weird that the reason is not associated worker re-connected but abandoned the job because the worker reconnected very soon but that should be fixed with https://github.com/os-autoinst/openQA/pull/2723. On the next deployment of o3 we should be able to see whether it worked (there shouldn't be jobs with associated worker has not sent any status updates for too long except one really permanenly disabled a worker slot and the running job got killed like in the log above).

Actions #13

Updated by okurz about 4 years ago

  • Assignee set to mkittler

@mkittler would you say that https://github.com/os-autoinst/openQA/pull/2723 helped to show that your assumptions were right? Is the specific problem gone?

Actions #14

Updated by mkittler about 4 years ago

TL;DR: We still sometimes have jobs with the reason "associated worker has not sent any status updates for too" although the worker has just been restarted. This happens when the worker is just processing the job acceptance.


The following query returns currently only 38 results on o3:

select jobs.id, t_started, t_finished, workers.host as worker_host, workers.instance as worker_instance, reason from jobs join workers on assigned_worker_id=workers.id where result = 'incomplete' and reason like '%associated worker has not sent any status updates for too%' order by id desc limit 100;

The last job with the reason "associated worker has not sent any status updates for too" was from 2 weeks ago and in general the number of jobs with that reason isn't very high. To say whether these jobs are handled correctly I need to look into the worker log.

I just took one job as an example for now (https://openqa.opensuse.org/tests/1196814). It has been restarted while the job acceptance was still ongoing. Likely the worker was able to tell the web UI that it goes offline. Then it took almost 4 minutes for the worker to register again. It is possible and I'd say correct behavior that the web UI might set the job to incomplete before the worker re-registers - especially when that takes minutes. So it is ok that we don't see "associated worker re-connected but abandoned the job" but "associated worker has not sent any status updates for too long" from the web UI's perspective. The message is of course misleading and normally we would expect that the reason is "quit" in such a case.

However, the log shows that here the worker was just accepting the job here. After having a brief look at the code it seems possible that the reason is not set in this case and the worker just gives up the job. Likely the other cases are similar. So my PR doesn't prevent "associated worker has not sent any status updates for too" although the worker came back before WORKERS_CHECKER_THRESHOLD entirely. Not sure whether it makes sense to investigate the issue further. These cases are quite rare.

Actions #15

Updated by okurz about 4 years ago

Fine, I am ok to call this ticket "Resolved" when you are. I am convinced we did the most important steps successfully.

Actions #16

Updated by mkittler about 4 years ago

  • Status changed from In Progress to Resolved

With tracking the "reason" we should be able to at least categorize these jobs. I suppose there's still room for improvement but at least we are no longer clueless about these jobs at all (without digging into logs).

Actions #17

Updated by szarate over 3 years ago

  • Tracker changed from action to coordination
Actions

Also available in: Atom PDF