Project

General

Profile

action #91902

Tests incomplete with reason "Failed modules: …"

Added by mkittler 3 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2021-04-28
Due date:
2021-05-13
% Done:

0%

Estimated time:
Difficulty:

Description

observation

Tests incomplete with reason "Failed modules: …". This doesn't happen very often but was observed a few times on o3 and OSD, e.g.:

openqa=> select id, t_finished, state, result, reason from jobs where reason like '%Failed modules:%' order by id desc limit 10;
   id    |     t_finished      | state |   result   |                        reason                         
---------+---------------------+-------+------------+-------------------------------------------------------
 1717042 | 2021-04-28 08:34:31 | done  | incomplete | api failure: 400 response: Failed modules: bootloader
 1716952 | 2021-04-28 03:38:04 | done  | incomplete | api failure: 400 response: Failed modules: bootloader
 1700877 | 2021-04-15 23:08:05 | done  | incomplete | api failure: 400 response: Failed modules: bootloader
(3 Zeilen)
openqa=> select id, t_finished, reason from jobs where reason like '%Failed modules:%' order by id desc limit 10;
   id    |     t_finished      |                                 reason                                  
---------+---------------------+-------------------------------------------------------------------------
 5900633 | 2021-04-27 02:13:00 | api failure: 400 response: Failed modules: bootloader_zkvm
 5853177 | 2021-04-20 07:31:29 | api failure: 400 response: Failed modules: boot_to_desktop
 5853172 | 2021-04-20 07:31:18 | api failure: 400 response: Failed modules: boot_to_desktop
 5845784 | 2021-04-19 13:29:29 | api failure: 400 response: Failed modules: boot_to_desktop
 5825710 | 2021-04-15 07:58:48 | api failure: 400 response: Failed modules: version_switch_origin_system
 5791697 | 2021-04-08 10:42:19 | api failure: 400 response: Failed modules: version_switch_origin_system
 5789446 | 2021-04-07 18:58:41 | api failure: 400 response: Failed modules: boot_windows
 5789444 | 2021-04-07 18:58:38 | api failure: 400 response: Failed modules: boot_windows
 5781982 | 2021-04-08 07:32:30 | api failure: 400 response: Failed modules: boot_to_desktop

Note that the worker log isn't very verbose:

Apr 28 08:34:25 imagetester worker[11572]: [info] Finished to rsync tests
Apr 28 08:34:25 imagetester worker[11572]: [info] Preparing cgroup to start isotovideo
Apr 28 08:34:25 imagetester worker[11572]: [info] Using cgroup /sys/fs/cgroup/systemd/openqa.slice/openqa-worker.slice/openqa-worker-auto-restart@2.service/1717042
Apr 28 08:34:25 imagetester worker[11572]: [info] Starting isotovideo container
Apr 28 08:34:25 imagetester worker[11572]: [info] 14520: WORKING 1717042
Apr 28 08:34:25 imagetester worker[11572]: [info] isotovideo has been started (PID: 14520)
Apr 28 08:34:31 imagetester worker[11572]: [error] REST-API error (POST http://openqa1-opensuse/api/v1/jobs/1717042/status): 400 response: Failed modules: bootloader (remaining tries: 0)
Apr 28 08:34:31 imagetester worker[11572]: [error] Unable to make final image uploads. Maybe the web UI considers this job already dead.
Apr 28 08:34:31 imagetester worker[11572]: [info] Trying to stop job gracefully by announcing it to command server via http://localhost:20023/8fm_RpciitjX9Cjo/broadcast
Apr 28 08:34:31 imagetester worker[11572]: [info] Isotovideo exit status: 1
Apr 28 08:34:31 imagetester worker[11572]: [info] +++ worker notes +++
Apr 28 08:34:31 imagetester worker[11572]: [info] End time: 2021-04-28 08:34:31
Apr 28 08:34:31 imagetester worker[11572]: [info] Result: api-failure

The jobs themselves actually show the test modules including bootloader (e.g. https://openqa.opensuse.org/tests/1717042). So uploading the test order did at least eventually work.

suggestions

  1. The exact error message has been introduced intentionally by https://github.com/os-autoinst/openQA/pull/3813 to track down upload issues. The error condition was previously silently ignored. Maybe it makes sense to allow the worker to retry here which would be a compromise between the new and the old behavior.
  2. Maybe there's also something wrong within the worker (or isotovideo) code. The expectation is that the test order is always posted before or at the same time any results are posted. Maybe that wasn't the case here and therefore posting the result here fails. In this case 1. would likely still be an easy fix (considering that the test order could eventually be uploaded).

Related issues

Related to openQA Project - action #90152: module results missing on quick job (on auto-restarting worker)Resolved2021-03-16

History

#1 Updated by mkittler 3 months ago

Suggestion 1. would actually be fairly simple to implement: https://github.com/os-autoinst/openQA/pull/3868

#2 Updated by okurz 3 months ago

  • Related to action #90152: module results missing on quick job (on auto-restarting worker) added

#3 Updated by okurz 3 months ago

  • Target version set to Ready

I think you could consider this still part of your work on #90152 , isn't it?

#4 Updated by openqa_review 3 months ago

  • Due date set to 2021-05-13

Setting due date based on mean cycle time of SUSE QE Tools

#5 Updated by mkittler 3 months ago

  • Status changed from New to Feedback

The PR implementing suggestion 1 has been merged. Let's whether such incompletes now disappear in production.

Since the ticket has been moved to ready I can also dig a little bit within the code to find out why it actually behaves like that and maybe we can find a better fix which actually tackles the root cause.

#6 Updated by mkittler 3 months ago

There are not further occurrences on OSD/o3. (There's actually one more occurrence on OSD but the PR has only been deployed on OSD since this morning and it happened before.)

Maybe the root cause is an error (which for some reason doesn't show up in the log) when posting the test order. The error could be on the client side or on the server side. Since we don't post the test order again and again on every single status update we might end up uploading results for test modules the web UI doesn't know about. Likely the re-try solution is nevertheless sufficient. I'll check OSD/o3 again in a few days to confirm.

#7 Updated by mkittler 3 months ago

There were 2 more jobs matching the query from the ticket description on o3.

Here it is pretty clear that posting the status didn't work. Since it is a 403 error we don't retry. The subsequent errors are only a consequence of the failing status post:

Mai 06 19:18:21 openqaworker7 worker[5293]: [info] isotovideo has been started (PID: 5711)
Mai 06 19:18:21 openqaworker7 worker[5293]: [info] 5711: WORKING 1729840
Mai 06 19:26:42 openqaworker7 worker[5293]: [error] REST-API error (POST http://openqa1-opensuse/api/v1/jobs/1729840/status): 403 response: timestamp mismatch (remaining tries: 0)
Mai 06 19:26:42 openqaworker7 worker[5293]: [error] Unable to make final image uploads. Maybe the web UI considers this job already dead.
Mai 06 19:30:15 openqaworker7 worker[5293]: [info] Trying to stop job gracefully by announcing it to command server via http://localhost:20023/d_RCcPiPEhsSRaTG/broadcast
Mai 06 19:30:16 openqaworker7 worker[5293]: [info] Isotovideo exit status: 1
Mai 06 19:30:16 openqaworker7 worker[5293]: [info] +++ worker notes +++
Mai 06 19:30:16 openqaworker7 worker[5293]: [info] End time: 2021-05-06 17:30:16
Mai 06 19:30:16 openqaworker7 worker[5293]: [info] Result: api-failure
Mai 06 19:30:40 openqaworker7 worker[5293]: [error] REST-API error (POST http://openqa1-opensuse/api/v1/jobs/1729840/status): 490 response: Failed modules: bootloader, isosize (remaining tries: 59)
Mai 06 19:32:41 openqaworker7 worker[5293]: [error] REST-API error (POST http://openqa1-opensuse/api/v1/jobs/1729840/status): 490 response: Failed modules: bootloader, isosize (remaining tries: 58)
Mai 06 19:32:56 openqaworker7 worker[5293]: [error] REST-API error (POST http://openqa1-opensuse/api/v1/jobs/1729840/status): 490 response: Failed modules: bootloader, isosize (remaining tries: 57)
…
Mai 06 19:56:39 openqaworker7 worker[5293]: [error] REST-API error (POST http://openqa1-opensuse/api/v1/jobs/1729840/status): 490 response: Failed modules: bootloader, isosize (remaining tries: 2)
Mai 06 19:56:49 openqaworker7 worker[5293]: [error] REST-API error (POST http://openqa1-opensuse/api/v1/jobs/1729840/status): 490 response: Failed modules: bootloader, isosize (remaining tries: 1)
Mai 06 19:56:54 openqaworker7 worker[5293]: [error] REST-API error (POST http://openqa1-opensuse/api/v1/jobs/1729840/status): 490 response: Failed modules: bootloader, isosize (remaining tries: 0)
Mai 06 19:56:54 openqaworker7 worker[5293]: [error] Unable to make final image uploads. Maybe the web UI considers this job already dead.
Mai 06 19:56:54 openqaworker7 worker[5293]: [info] Registering with openQA http://openqa1-opensuse
Mai 06 19:56:55 openqaworker7 worker[5293]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/188
Mai 06 19:56:55 openqaworker7 worker[5293]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 188
Mai 06 19:56:55 openqaworker7 worker[5293]: Use of uninitialized value $reason in concatenation (.) or string at /usr/share/openqa/script/../lib/OpenQA/Worker/WebUIConnection.pm line 212.
Mai 06 19:56:55 openqaworker7 worker[5293]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/188 finished by remote side with code 1005,  - trying again in 10 seconds
Mai 06 19:56:55 openqaworker7 systemd[1]: openqa-worker-auto-restart@2.service: Service RestartSec=100ms expired, scheduling restart.

In the other occurrence posting the status failed as well but for a different reason (which is actually rather strange):

Mai 07 15:44:16 openqaworker7 worker[25509]: [info] isotovideo has been started (PID: 26578)
Mai 07 15:44:16 openqaworker7 worker[25509]: [info] 26578: WORKING 1730907
Mai 07 15:44:21 openqaworker7 worker[25509]: [error] REST-API error (POST http://openqa1-opensuse/api/v1/jobs/1730907/status): 400 response: can't mkdir /var/lib/openqa/testresults/01730/01730907-opensuse-Tumbleweed-JeOS-for-kvm-and-xen-x86_64-Build20210505-jeos-containers@64bit_virtio: File exists at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 1260.
Mai 07 15:44:21 openqaworker7 worker[25509]:  (remaining tries: 0)
Mai 07 15:44:21 openqaworker7 worker[25509]: [error] Unable to make final image uploads. Maybe the web UI considers this job already dead.
Mai 07 15:44:21 openqaworker7 worker[25509]: [error] REST-API error (POST http://openqa1-opensuse/api/v1/jobs/1730907/status): 490 response: Failed modules: boot_to_desktop (remaining tries: 59)
Mai 07 15:44:21 openqaworker7 worker[25509]: [info] Trying to stop job gracefully by announcing it to command server via http://localhost:20083/_sdtYjp_UeMfsyEp/broadcast
Mai 07 15:44:22 openqaworker7 worker[25509]: [info] Isotovideo exit status: 1

Note that these problems are not a regression of https://github.com/os-autoinst/openQA/pull/3813 anymore. Not considering this an error wouldn't had prevented these jobs from being incompletes. However, it is problematic that the reason is overridden. At least in this particular situation it would make more sense if the first API error would be passed as reason.

There's also a Perl warning which is a recent regression of another change, see https://github.com/os-autoinst/openQA/pull/3893.

#8 Updated by okurz 2 months ago

Note that these problems are not a regression of https://github.com/os-autoinst/openQA/pull/3813 anymore. Not considering this an error wouldn't had prevented these jobs from being incompletes. However, it is problematic that the reason is overridden. At least in this particular situation it would make more sense if the first API error would be passed as reason.

There's also a Perl warning which is a recent regression of another change, see https://github.com/os-autoinst/openQA/pull/3893.

https://github.com/os-autoinst/openQA/pull/3893 is merged. What are your further plans now?

#9 Updated by okurz 2 months ago

  • Status changed from Feedback to Resolved

As discussed we consider ourselves done here. I crosschecked again that there are no new jobs with the original symptoms:

$ psql -d postgresql://XXX:XXX@openqa.nue.suse.com/openqa -c "select id, t_finished, state, result, reason from jobs where reason like '%Failed modules:%' order by id desc limit 10;"
   id    |     t_finished      | state |   result   |                                 reason                                  
---------+---------------------+-------+------------+-------------------------------------------------------------------------
 5924300 | 2021-04-30 07:56:29 | done  | incomplete | api failure: 400 response: Failed modules: bootloader
 5900633 | 2021-04-27 02:13:00 | done  | incomplete | api failure: 400 response: Failed modules: bootloader_zkvm
 5853177 | 2021-04-20 07:31:29 | done  | incomplete | api failure: 400 response: Failed modules: boot_to_desktop
 5853172 | 2021-04-20 07:31:18 | done  | incomplete | api failure: 400 response: Failed modules: boot_to_desktop
 5845784 | 2021-04-19 13:29:29 | done  | incomplete | api failure: 400 response: Failed modules: boot_to_desktop
 5825710 | 2021-04-15 07:58:48 | done  | incomplete | api failure: 400 response: Failed modules: version_switch_origin_system
 5791697 | 2021-04-08 10:42:19 | done  | incomplete | api failure: 400 response: Failed modules: version_switch_origin_system
 5789446 | 2021-04-07 18:58:41 | done  | incomplete | api failure: 400 response: Failed modules: boot_windows
 5789444 | 2021-04-07 18:58:38 | done  | incomplete | api failure: 400 response: Failed modules: boot_windows
 5781982 | 2021-04-08 07:32:30 | done  | incomplete | api failure: 400 response: Failed modules: boot_to_desktop
(10 rows)

Also available in: Atom PDF