Project

General

Profile

Actions

action #94531

open

OpenQA worker randomly skips uploading artefacts for whole test modules

Added by MDoucha over 3 years ago. Updated about 2 years ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Regressions/Crashes
Target version:
Start date:
2021-06-23
Due date:
% Done:

0%

Estimated time:

Description

Observation

Once in a while, some jobs randomly end up with one or more modules missing all screenshots and console output. The boxes do appear in OpenQA web UI but only show Unable to read foo-123.txt. or blank box instead of a screenshot:
https://openqa.suse.de/tests/6308588#step/cn_pec_sh/1

Worker-log.txt shows that the missing files were not uploaded at all. The worker uploaded all artefacts for update_kernel and install_ltp but there's not a single line with Uploading artefact cn_pec_sh-*.txt or Uploading artefact shutdown_ltp-*.txt (except for log assets uploaded by calling upload_logs() in the test itself):
https://openqa.suse.de/tests/6308588/logfile?filename=worker-log.txt

I've seen this happen randomly multiple times, usually it's just one module in the middle of a test run. Everything before it and after it gets uploaded correctly.

Steps to reproduce

  • Look for jobs with test modules steps that include Unable to read

Expected result

another job in the same scenario as the original one shows correct expected data in steps, https://openqa.suse.de/tests/6312792#step/cn_pec_sh/1 shows

# wait_serial expected: "# "
# Result:

# 

Acceptance criteria

  • AC1: test module steps are uploaded again for openqa.suse.de LTP cases in general

Suggestions

  • Find fail ratio
  • Try to reproduce problem locally
  • Find out what files the openQA worker should try to upload

Out of scope

  • If we find a problem in the custom LTP runner don't try to fix that on openQA side but have it be fixed within os-autoinst-distri-opensuse

Files

scan_details.py (1.73 KB) scan_details.py MDoucha, 2021-06-30 11:24
filter_details.py (776 Bytes) filter_details.py MDoucha, 2021-06-30 13:07
joblist.txt (8.4 KB) joblist.txt MDoucha, 2021-07-01 13:58
missing_results-2021-08.txt (20.2 KB) missing_results-2021-08.txt Affected jobs in the last month MDoucha, 2021-09-01 11:45
missing_results-2021-09.txt (22 KB) missing_results-2021-09.txt Affected jobs in September MDoucha, 2021-09-30 09:52
results.txt (12 KB) results.txt mkittler, 2022-04-12 13:09

Related issues 2 (0 open2 closed)

Related to openQA Project - action #102332: Unable to read *.json: Can't open file in o3 openQA logs /var/log/openqa size:MResolvedkraih2021-11-122022-01-28

Actions
Related to openQA Project - action #102786: Text results "unable to read" when showing result page during test execution, i.e. while an openQA test is running size:MResolvedmkittler2021-11-22

Actions
Actions #1

Updated by okurz over 3 years ago

  • Priority changed from Normal to High
  • Target version set to Ready
Actions #2

Updated by tinita over 3 years ago

It could be helpful to post more links to tests with the same behaviour (could be that there is a pattern)

Actions #3

Updated by MDoucha over 3 years ago

tinita wrote:

It could be helpful to post more links to tests with the same behaviour (could be that there is a pattern)

I would post more links if I had any but this is maybe the third time I'm seeing this bug and the last one I've seen before happened months ago. But you could probably find a few dozen cases by scanning existing test results with a script. The details-*.json file will have text keys but no text_data in the details array.

Actions #4

Updated by livdywan over 3 years ago

Maybe this should be self-evident, but can you clarify that https://openqa.suse.de/tests/6308588#step/cn_pec_sh/1 is expected to pass even without the missing file? What is the expected result?

From the logs it seems the upload is completely missing so my first thought would be that the file wasn't created. But from a single result it's hard to tell. I guess that's also what Tina was wondering.

Actions #5

Updated by okurz over 3 years ago

To put the information into structured form I suggest to follow the ticket template from https://progress.opensuse.org/projects/openqav3/wiki/#Defects

Actions #6

Updated by MDoucha over 3 years ago

cdywan wrote:

Maybe this should be self-evident, but can you clarify that https://openqa.suse.de/tests/6308588#step/cn_pec_sh/1 is expected to pass even without the missing file? What is the expected result?

Yes, this is expected to pass. The command result (pass/fail) is reported to OpenQA database first and the console output is uploaded much later. It is expected to see a green box with "Unable to read foo-123.txt" while the test is still running but not after it finishes.

From the logs it seems the upload is completely missing so my first thought would be that the file wasn't created. But from a single result it's hard to tell. I guess that's also what Tina was wondering.

Like I said, the easiest way to find more examples is to scan existing test results with a script. Valid details-*.json file looks like this:

{"dents":0,"details":[{"result":"ok","text":"boot_ltp-4.txt","text_data":"# wait_serial expected: qr\/login:\\s*$\/ui\n\n# Result:\n\r\nsusetest login: \n","title":"wait_serial"},...],...}

But when it's affected by this issue, the text key will be present but the text_data key will be completely missing:

{"dents":0,"details":[{"result":"ok","text":"boot_ltp-4.txt","title":"wait_serial"},...],...}
Actions #7

Updated by livdywan over 3 years ago

More thoughts:

lib/OpenQA/Task/Job/FinalizeResults.pm#L26 uploads results in a minion job.
Maybe errors are eaten here but should appear on https://openqa.suse.de/minion

Actions #8

Updated by MDoucha over 3 years ago

cdywan wrote:

More thoughts:

lib/OpenQA/Task/Job/FinalizeResults.pm#L26 uploads results in a minion job.
Maybe errors are eaten here but should appear on https://openqa.suse.de/minion

No, this is a minion task which collects the already uploaded text files and inlines them into details-*.json file for faster webpage rendering. It does not upload anything but it can detect this problem and send an alert somehow. The uploading is done by the OpenQA worker which actually runs the test job.

Actions #9

Updated by livdywan over 3 years ago

MDoucha wrote:

cdywan wrote:

More thoughts:

lib/OpenQA/Task/Job/FinalizeResults.pm#L26 uploads results in a minion job.
Maybe errors are eaten here but should appear on https://openqa.suse.de/minion

No, this is a minion task which collects the already uploaded text files and inlines them into details-*.json file for faster webpage rendering. It does not upload anything but it can detect this problem and send an alert somehow. The uploading is done by the OpenQA worker which actually runs the test job.

Thank you for the correction!

FYI @kraih helping out by improving related code: https://github.com/os-autoinst/openQA/pull/3982 Not a fix but could contribute to errors not being propagated

Actions #10

Updated by mkittler over 3 years ago

This sounds like a bug in the uploading code of the openQA worker (code in lib/OpenQA/Worker/Job.pm). Hard to tell what's going on exactly (the VPN is down so I cannot even access the referenced jobs right now). (It could be that the results are already not generated correctly by os-autoinst, though. However, that's not that likely.)

It would of course be useful to see the pool directory of such a job. Is the test order of such jobs changed at some point during the test execution? Are the test details created in a special way (e.g. parsed from an external testsuite)?

Actions #11

Updated by okurz over 3 years ago

  • Subject changed from OpenQA worker randomly skips uploading artefacts for whole test modules to OpenQA worker randomly skips uploading artefacts for whole test modules size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #12

Updated by MDoucha over 3 years ago

Here's a list of other jobs with missing console output. I've also attached two scripts that'll find more. scan_details.py finds all jobs with missing console output. You need to run it directly on OSD. It'll also print files which it can't parse properly and show progress through the results directory on standard error. filter_details.py will filter out unfinished and incomplete jobs which would most likely be false positives. Simply feed it the output of scan_details.py on standard input, you can run this script on any machine that has HTTP access to OSD. Or just run both on OSD: scan_details.py | filter_details.py >results.txt.

Edit: I've replaced filter_details.py with an improved version. See next comment.

Now the list of jobs (just a small fraction of all affected jobs). The string after the URL is a random test module that is missing at least one console output file. Some are caused by network errors which you'll find in worker-log.txt. Those are a different issue but I didn't filter them out.

https://openqa.suse.de/tests/6351096 wireshark
https://openqa.suse.de/tests/6351244 openvswitch
https://openqa.suse.de/tests/6351363 bootcode_options
https://openqa.suse.de/tests/6351379 nis_server
https://openqa.suse.de/tests/6351394 mem03
https://openqa.suse.de/tests/6351403 if6-mtu-change_ifconfig
https://openqa.suse.de/tests/6351405 shutdown_ltp
https://openqa.suse.de/tests/6351393 clock_nanosleep02
https://openqa.suse.de/tests/6351337 setup
https://openqa.suse.de/tests/6351474 before_test
https://openqa.suse.de/tests/6351381 bootloader_start
https://openqa.suse.de/tests/6351297 barrier_init
https://openqa.suse.de/tests/6351300 barrier_init
https://openqa.suse.de/tests/6351487 setup
https://openqa.suse.de/tests/6351489 consoletest_setup
https://openqa.suse.de/tests/6351517 docker
https://openqa.suse.de/tests/6351553 glxgears
https://openqa.suse.de/tests/6351883 repos
https://openqa.suse.de/tests/6350113 ssh_interactive_init
https://openqa.suse.de/tests/6350117 ssh_interactive_init
https://openqa.suse.de/tests/6345228 install_patterns
https://openqa.suse.de/tests/6345411 fcntl36
https://openqa.suse.de/tests/6345435 epoll01
https://openqa.suse.de/tests/6342668 hana_cluster
https://openqa.suse.de/tests/6333915 quota
https://openqa.suse.de/tests/6326404 alarm06
https://openqa.suse.de/tests/6325348 settimeofday02
https://openqa.suse.de/tests/6317236 filesystem_ro
https://openqa.suse.de/tests/6317274 logs_from_installation_system
https://openqa.suse.de/tests/6308588 shutdown_ltp
https://openqa.suse.de/tests/6308958 before_test
https://openqa.suse.de/tests/6302842 install_update
https://openqa.suse.de/tests/6294460 apache
https://openqa.suse.de/tests/6294750 http_srv
https://openqa.suse.de/tests/6294612 perf
https://openqa.suse.de/tests/6294467 btrfs_send_receive
https://openqa.suse.de/tests/6294753 snapper_cleanup
https://openqa.suse.de/tests/6294755 openssl_alpn
https://openqa.suse.de/tests/6294374 t10_vlan_ifup_all_ifdown_one_card
https://openqa.suse.de/tests/6288837 procps
https://openqa.suse.de/tests/6288840 io_getevents01
https://openqa.suse.de/tests/6288224 rpm
https://openqa.suse.de/tests/6288225 gpg

Actions #13

Updated by MDoucha over 3 years ago

  • File deleted (filter_details.py)
Actions #14

Updated by MDoucha over 3 years ago

Simplified and improved filter_details.py.

Actions #15

Updated by MDoucha over 3 years ago

I ran the scripts on OSD overnight and here's the complete list of affected jobs with incomplete/cancelled/restarted jobs filtered out.

Actions #16

Updated by mkittler about 3 years ago

  • Assignee set to mkittler
Actions #17

Updated by mkittler about 3 years ago

  • Assignee deleted (mkittler)

I still haven't found out much. If no one took the issue by then, I'm taking it again next Tuesday when I'm back from vacation.

Actions #18

Updated by mkittler about 3 years ago

  • Assignee set to mkittler
Actions #19

Updated by mkittler about 3 years ago

Additional observations:

  1. Some jobs (at least https://openqa.suse.de/tests/6288225) also miss some of the screenshots.
  2. Previous and next step results are often nevertheless present; not the whole test module is affected.
    1. This is not true for the job from the ticket description, though. Here the steps for the whole modules are gone.
  3. Worker logs show the upload problems at least in some jobs, e.g. [2021-06-18T15:20:50.0791 CEST] [debug] [pid:16668] Uploading artefact gpg-4.txt [2021-06-18T15:20:50.0797 CEST] [error] [pid:16668] Uploading artefact gpg-4.txt failed: Service Unavailable but apparently no further upload attempts have been tried (in this example these are the only log lines containing gpg-4.txt).
    1. The job from the ticket description does not show the upload problem within the worker log, though. Here it looks like the upload hasn't even been attempted once. Maybe we're seeing two different problems here.
Actions #20

Updated by MDoucha about 3 years ago

mkittler wrote:

  1. The job from the ticket description does not show the upload problem within the worker log, though. Here it looks like the upload hasn't even been attempted once. Maybe we're seeing two different problems here.

I also strongly suspect that these are two different bugs. But fixing the lack of retries first will still help us diagnose the other bug (if it exists).

Actions #21

Updated by mkittler about 3 years ago

Worker log for the job in the ticket description:

…
[2021-06-22T17:51:33.0494 CEST] [debug] [pid:5432] Uploading artefact install_ltp-205.txt
[2021-06-22T17:51:33.0719 CEST] [debug] [pid:5368] Uploading artefact install_ltp-215.txt
[2021-06-22T17:51:33.0899 CEST] [info] [pid:5436] Uploading connectors
[2021-06-22T17:51:33.0899 CEST] [info] [pid:5436] Uploading connectors using multiple chunks
[2021-06-22T17:51:33.0900 CEST] [info] [pid:5436] connectors: 1 chunks
[2021-06-22T17:51:33.0900 CEST] [info] [pid:5436] connectors: chunks of 1000000 bytes each
[2021-06-22T17:51:34.0091 CEST] [debug] [pid:5432] Uploading artefact install_ltp-144.txt
[2021-06-22T17:51:34.0307 CEST] [debug] [pid:5368] Uploading artefact install_ltp-145.txt
[2021-06-22T17:51:34.0513 CEST] [info] [pid:5436] connectors: Processing chunk 1/1, avg. speed ~0.056 KiB/s
[2021-06-22T17:51:34.0520 CEST] [info] [pid:5436] Uploading vars.json
[2021-06-22T17:51:34.0599 CEST] [debug] [pid:5432] Uploading artefact install_ltp-152.txt
[2021-06-22T17:51:34.0692 CEST] [debug] [pid:5368] Uploading artefact install_ltp-14.txt
[2021-06-22T17:51:34.0911 CEST] [info] [pid:5436] Uploading autoinst-log.txt
[2021-06-22T17:51:34.0961 CEST] [debug] [pid:5432] Uploading artefact install_ltp-215.txt
[2021-06-22T17:51:35.0075 CEST] [debug] [pid:5368] Uploading artefact install_ltp-100.txt
[2021-06-22T17:51:35.0305 CEST] [info] [pid:5436] Uploading worker-log.txt

This is how it is supposed to work (judging by the code and comparing to log files):

  1. Job starts
    1. Upload results (= step details) concurrently and continuously
  2. Job stops
    1. The upload from 1. might still be running; that's ok
    2. Set job status to uploading
    3. Upload logs/assets (including autoinst-log.txt and worker-log.txt)
  3. Logs/assets (2.3) have been uploaded
    1. Wait until possibly ongoing results upload (1.1) has concluded
    2. Start a "final" results upload
  4. Final result upload has been concluded
    1. Final job state/result is set

  • Point 2.1 explains the weird interleaving between Uploading [some logfile] lines and Uploading artefact [some step file] lines.
  • Point 2.3 explains why log lines for certain artefacts can be missing completely: worker-log.txt is simply uploaded before these lines would be logged in step 3 and 4. (This is plausible because the test modules with missing artefacts are indeed at the end.)

So I would conclude that both cases (log lines for artefacts completely missing within the log file and errors logged for artefacts) are the same after all. It is just confusing (and hence problematic) that log messages from steps 3 and 4 are not part of worker-log.txt.

As a first step I'd just try to improve the error handling. It would make sense to retry artefact uploads in the same way other API calls to the web UI are retired. If they fail nevertheless I assume the test be marked as incomplete (with the relevant error message as reason).

As a second step I can try to move uploading autoinst-log.txt after the 4.

Actions #22

Updated by MDoucha about 3 years ago

mkittler wrote:

As a first step I'd just try to improve the error handling. It would make sense to retry artefact uploads in the same way other API calls to the web UI are retired. If they fail nevertheless I assume the test be marked as incomplete (with the relevant error message as reason).

I'd recommend a different approach for artefact uploads: If it fails even after retry, put it back in upload queue and pause further uploads until the next trigger from the worker loop. If the network connection recovers, everything will be uploaded. If it doesn't recover, the main OpenQA server will terminate the job due to timeout either way. Only the final upload step would kill the job after several failed upload retries.

Actions #23

Updated by mkittler about 3 years ago

Makes sense; I'll make it only a fatal error on the final upload. Unfortunately the upload is not written using a queue (or something similar)¹. So I cannot simply put it back. I guess rewriting the code to use a queue is a bit out of scope here so I simply resort to non-fatal retries during the execution plus the final upload to catch everything which has possibly be missed.


¹ although it would make a lot of sense of course

Actions #24

Updated by mkittler about 3 years ago

  • Status changed from Workable to In Progress
Actions #25

Updated by mkittler about 3 years ago

By the way, I'm not even sure whether it'll be a big win to mark such tests as incomplete. Maybe it is sufficient to add the upload error (after all attempts have been depleted during the final upload) as reason (unless there is a more relevant reason of course)?

Moving the upload of the worker log to the end would also help with that (without considering the test as incomplete).

Actions #26

Updated by MDoucha about 3 years ago

mkittler wrote:

By the way, I'm not even sure whether it'll be a big win to mark such tests as incomplete. Maybe it is sufficient to add the upload error (after all attempts have been depleted during the final upload) as reason (unless there is a more relevant reason of course)?

You can certainly try to send more descriptive error info to the central OpenQA server. But since you'll be doing it while the network is so horribly broken that asset uploads have exceeded retry limit, I don't think you have a very high chance of success.

Actions #27

Updated by mkittler about 3 years ago

My current PR adds a retry mechanism for the artefact upload similar to what we already have for regular API calls. For regular API calls it already seems to improve the situation a lot, e.g. most of the jobs you've found have their final state/result set as expected (which means the API call to send this information eventually succeeded). Hence the PR should make it more robust. That means the situations when the upload ultimately fails but we are still able to set the final state/result (where also the reason is passed) are indeed very unlikely. So likely it doesn't make much sense putting further effort (and complicated code) into this (like changing the job result to incomplete or trying to upload the worker log later).

By the way, to avoid interference with https://github.com/os-autoinst/openQA/pull/4062 I haven't tried to streamline the retry-behavior with regular API calls yet. I will do that after it has been merged if it make sense.

Actions #28

Updated by openqa_review about 3 years ago

  • Due date set to 2021-08-05

Setting due date based on mean cycle time of SUSE QE Tools

Actions #29

Updated by mkittler about 3 years ago

PR for streamlining the retry-behavior with regular API calls (also for having less redundant/inconsistent code): https://github.com/os-autoinst/openQA/pull/4067

Actions #30

Updated by mkittler about 3 years ago

  • Status changed from In Progress to Feedback

All PRs have been merged. So now uploading those artefacts have the same re-try behavior as regular API calls and therefore should be equally reliable.

Actions #31

Updated by MDoucha about 3 years ago

mkittler wrote:

All PRs have been merged. So now uploading those artefacts have the same re-try behavior as regular API calls and therefore should be equally reliable.

Great. Setting my alarm clock to August 31st. When it goes off, I'll run another scan for jobs with missing console output.
Note to self: OSD is currently at job ID 6596263.

Actions #32

Updated by mkittler about 3 years ago

Sounds like a plan. (Especially because adding an automatic alert within our monitoring for this problem would be quite some work.)

Actions #33

Updated by livdywan about 3 years ago

  • Due date changed from 2021-08-05 to 2021-08-31

I'm updating the Due date accordingly after having checked with Marius

Actions #34

Updated by okurz about 3 years ago

  • Due date changed from 2021-08-31 to 2021-09-02

The analysis takes some time. Also, we should not be warned on the last day but only if we exceed it so increasing the due date by two days

Actions #35

Updated by MDoucha about 3 years ago

The retry code still needs some work. My scanner script found 403 jobs with missing result data (some of which may be false positives, see attachment for full list) within past month. Here's an example where the worker printed a retry message to log but failed to upload the text file anyway:

https://openqa.suse.de/tests/6892431#step/consoletest_setup/21
https://openqa.suse.de/tests/6892431/logfile?filename=worker-log.txt

[2021-08-19T07:36:19.0905 CEST] [debug] [pid:7101] Uploading artefact consoletest_setup-21.txt
[2021-08-19T07:36:20.0566 CEST] [warn] [pid:7101] Upload attempts remaining: 59/60 for consoletest_setup-21.txt
[2021-08-19T07:36:20.0743 CEST] [warn] [pid:4110] Websocket connection to http://openqa.suse.de/api/v1/ws/1258 finished by remote side with code 1006, no reason - trying again in 10 seconds
[2021-08-19T07:36:30.0752 CEST] [info] [pid:4110] Registering with openQA openqa.suse.de
[2021-08-19T07:36:32.0334 CEST] [info] [pid:4110] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1258
[2021-08-19T07:36:33.0250 CEST] [info] [pid:4110] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1258
Actions #36

Updated by livdywan about 3 years ago

  • Due date changed from 2021-09-02 to 2021-09-10

Bumping the due date since there's new information, assuming @mkittler is planning to look into it next week.

Actions #37

Updated by mkittler about 3 years ago

You're right, it definitely looks still broken - as if it doesn't actually give it another try after logging Upload attempts remaining: 59/60 for….

Actions #38

Updated by mkittler about 3 years ago

PR for improving the related logging: https://github.com/os-autoinst/openQA/pull/4174

However, I still don't know what's the problem here. Judging by our code and unit tests it actually re-attempts the upload. Maybe the error isn't visible on the worker-side? The error handling correctly checks for both, connection and HTTP errors, though.

Actions #39

Updated by mkittler about 3 years ago

  • Status changed from Workable to Feedback

Looks like the web UI actually returns a 200 response in certain error cases. Maybe these are the remaining cases where the worker doesn't re-attempt the upload. PR: https://github.com/os-autoinst/openQA/pull/4179

Actions #40

Updated by mkittler about 3 years ago

Since not all changes of https://github.com/os-autoinst/openQA/pull/4179 were covered I've been creating https://github.com/os-autoinst/openQA/pull/4184. However, that means the code path which would return a 200 response despite an error was most likely dead code and didn't cause the problem.

I also ran Martin's script again on OSD but couldn't find any further occurrences since the logging changes have been deployed.

Actions #41

Updated by livdywan about 3 years ago

  • Due date changed from 2021-09-10 to 2021-10-15

Discussed in chat:

  • The logs of these tests showed that the retry doesn't work - either because of a bug in the retry code itself or because the web UI doesn't report an error although the request didn't work. Judging by the code and unit tests neither should be the case. So the uploading doesn't work in rare cases.
  • @mkittler couldn't find a further occurrence of the problem on OSD even though the changes for improved logging and error handling shouldn't have fixed it.
  • Since we can't reproduce it but also don't know if it's fixed I suggested to monitor this a bit longer, with one of 3 possible outcomes:
    • If it happens again we can investigate
    • If someone finds evidence that it's already fixed we cam document it retroactively and learn from that
    • If it simply no longer reproduces within a month for better or worse we'll assume it's resolved
Actions #42

Updated by okurz about 3 years ago

  • Priority changed from High to Low

I agree with that. But also for long-running tickets with a due-date in further future we should treat it not as "High" but instead "Low"

Actions #43

Updated by MDoucha about 3 years ago

I've run the result scanner again with similar results as last month: 448 jobs where OpenQA did not upload at least one result file. Again, most of the jobs which I've checked manually had a retry message in the log but the file was not uploaded anyway:

https://openqa.suse.de/tests/7259127#step/pread01_64/6
https://openqa.suse.de/tests/7259127/logfile?filename=worker-log.txt

[2021-09-29T07:37:44.0607 CEST] [debug] [pid:439939] Uploading artefact pread01_64-6.txt
[2021-09-29T07:37:44.0757 CEST] [warn] [pid:439939] Error uploading pread01_64-6.txt: 502 response: Proxy Error
[2021-09-29T07:37:44.0987 CEST] [warn] [pid:427545] Websocket connection to http://openqa.suse.de/api/v1/ws/1333 finished by remote side with code 1006, no reason - trying again in 10 seconds
[2021-09-29T07:37:54.0990 CEST] [info] [pid:427545] Registering with openQA openqa.suse.de
[2021-09-29T07:37:55.0010 CEST] [warn] [pid:427545] Failed to register at openqa.suse.de - 503 response: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<snip HTML error>
 - trying again in 10 seconds
[2021-09-29T07:38:05.0020 CEST] [info] [pid:427545] Registering with openQA openqa.suse.de
[2021-09-29T07:38:06.0187 CEST] [info] [pid:427545] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1333
[2021-09-29T07:38:08.0558 CEST] [info] [pid:427545] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1333
[2021-09-29T07:38:22.0255 CEST] [info] [pid:427545] Received signal HUP
[2021-09-29T07:38:44.0758 CEST] [warn] [pid:439939] Uploading artefact pread01_64-6.txt (attempts remaining: 59/60)

At least one job does not have any mention of the missing file in the worker-log.txt file at all (or any other udp6_ipsec_vti test result either):

https://openqa.suse.de/tests/7253418#step/udp6_ipsec_vti05/1
https://openqa.suse.de/tests/7253418/logfile?filename=worker-log.txt

Actions #44

Updated by mkittler about 3 years ago

In the first case when the upload is logged it looks like an additional attempt is actually made but then no further error handling (e.g. yet another attempt) happens:

[2021-09-29T07:37:44.0757 CEST] [warn] [pid:439939] Error uploading pread01_64-6.txt: 502 response: Proxy Error
[2021-09-29T07:37:44.0987 CEST] [warn] [pid:427545] Websocket connection to http://openqa.suse.de/api/v1/ws/1333 finished by remote side with code 1006, no reason - trying again in 10 seconds
[2021-09-29T07:37:54.0990 CEST] [info] [pid:427545] Registering with openQA openqa.suse.de
[2021-09-29T07:37:55.0010 CEST] [warn] [pid:427545] Failed to register at openqa.suse.de - 503 response: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
<p>The server is temporarily unable to service your
request due to maintenance downtime or capacity
problems. Please try again later.</p>
<p>Additionally, a 503 Service Unavailable
error was encountered while trying to use an ErrorDocument to handle the request.</p>
<hr>
<address>Apache/2.4.43 (Linux/SUSE) Server at openqa.suse.de Port 80</address>
</body></html>
 - trying again in 10 seconds
[2021-09-29T07:38:05.0020 CEST] [info] [pid:427545] Registering with openQA openqa.suse.de
[2021-09-29T07:38:06.0187 CEST] [info] [pid:427545] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1333
[2021-09-29T07:38:08.0558 CEST] [info] [pid:427545] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1333
[2021-09-29T07:38:22.0255 CEST] [info] [pid:427545] Received signal HUP
[2021-09-29T07:38:44.0758 CEST] [warn] [pid:439939] Uploading artefact pread01_64-6.txt (attempts remaining: 59/60)
[2021-09-29T07:38:44.0791 CEST] [debug] [pid:439939] Uploading artefact pread03_64-6.txt

It also looks like the web UI was overloaded at the time, e.g. even the re-registration didn't work.


In the other case it indeed looks like the worker upload completely ignored the file (without any error message) or the file hasn't been created in the first place (and we don't show errors in that case).


I'll have yet another look at the code (and tests). I also noticed that I ran your script incorrectly when checking my self the last time.

Actions #45

Updated by okurz about 3 years ago

mkittler wrote:

In the first case when upload is logged it looks like an additional attempt is actually made but then no further error handling (e.g. yet another attempt) happens

so if I got you right then we conduct one retry but ignore if that succeeds? In that case I would go for the same approach that we usually do on retrying, e.g. a for loop with $configurable_retries // 7 with sleep periods in between and die with explicit error if we don't succeed even after the retries.

I'll have yet another look at the code (and tests).

Maybe just try to improve the logging in error cases for a start and collect more experience from production runs.

Actions #46

Updated by MDoucha about 3 years ago

mkittler wrote:

In the other case it indeed looks like the worker upload completely ignored the file (without any error message) or the file hasn't been created in the first place (and we don't show errors in that case).

There are dozens of udp6_ipsec_vti*.txt result files generated by that job. Almost all of them were correctly uploaded as you can see in the webUI but none of them is mentioned in the worker log. I thought that worker-log.txt should be the last uploaded file now?

Actions #47

Updated by mkittler about 3 years ago

I guess it is safe to exclude that the sub process we spawn to do the upload is terminated prematurely (without proper error handling). Even if the main worker process receives SIGHUP the upload process (in this case PID 25077) continues to run and only one single file is missing:

[2021-09-24T15:23:23.0472 CEST] [debug] [pid:25077] Uploading artefact sshd-402.txt
[2021-09-24T15:23:24.0709 CEST] [debug] [pid:25077] Uploading artefact sshd-111.txt
[2021-09-24T15:23:25.0219 CEST] [warn] [pid:15314] Websocket connection to http://openqa.suse.de/api/v1/ws/1313 finished by remote side with code 1006, no reason - trying again in 10 seconds
[2021-09-24T15:23:25.0730 CEST] [warn] [pid:25077] Error uploading sshd-111.txt: 503 response: Service Unavailable
[2021-09-24T15:23:35.0230 CEST] [info] [pid:15314] Registering with openQA openqa.suse.de
[2021-09-24T15:23:38.0396 CEST] [info] [pid:15314] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1313
[2021-09-24T15:23:46.0746 CEST] [info] [pid:15314] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1313
[2021-09-24T15:24:04.0266 CEST] [info] [pid:15314] Received signal HUP
[2021-09-24T15:24:25.0731 CEST] [warn] [pid:25077] Uploading artefact sshd-111.txt (attempts remaining: 59/60)
[2021-09-24T15:24:27.0952 CEST] [debug] [pid:25077] Uploading artefact sshd-366.txt
[2021-09-24T15:24:29.0234 CEST] [debug] [pid:25077] Uploading artefact sshd-68.txt
[2021-09-24T15:24:31.0449 CEST] [debug] [pid:25077] Uploading artefact sshd-257.txt
[2021-09-24T15:24:32.0816 CEST] [debug] [pid:25077] Uploading artefact sshd-31.txt
[2021-09-24T15:24:34.0249 CEST] [debug] [pid:25077] Uploading artefact sshd-216.txt
[2021-09-24T15:24:35.0559 CEST] [debug] [pid:25077] Uploading artefact sshd-163.txt
[2021-09-24T15:24:36.0617 CEST] [debug] [pid:25077] Uploading artefact sshd-58.txt
[2021-09-24T15:24:38.0132 CEST] [debug] [pid:25077] Uploading artefact sshd-48.txt
[2021-09-24T15:24:39.0793 CEST] [debug] [pid:25077] Uploading artefact sshd-90.txt
Actions #48

Updated by mkittler about 3 years ago

so if I got you right then we conduct one retry but ignore if that succeeds? In that case I would go for the same approach that we usually do on retrying, e.g. a for loop with $configurable_retries // 7 with sleep periods in between and die with explicit error if we don't succeed even after the retries.

We have a loop which is supposed to work like this. The question is why this loop sometimes doesn't work. I was just describing the observation of the actual behavior which goes against what is supposed to happen as the code looks right now.

Maybe just try to improve the logging in error cases for a start and collect more experience from production runs.

That's actually what I've already did and the reason we see Error uploading pread01_64-6.txt: 502 response: Proxy Error. That we don't see a further error message after the next logged attempt actually means that no error occurred (from the perspective of the upload code). In reality files are missing so either the server did not tell the worker that an error happened (although it did) or the worker's error handling is somehow wrong or the file has been deleted for some reason afterwards¹.

¹ The code for finalizing jobs looks good and shouldn't lead to this problem even though it is interrupted in the middle.


There are dozens of udp6_ipsec_vti*.txt result files generated by that job. Almost all of them were correctly uploaded as you can see in the webUI but none of them is mentioned in the worker log. I thought that worker-log.txt should be the last uploaded file now?

That is a good point and it also bugs me. If these files were normal files under "ulogs" or normal files mentioned in a module result I suppose they should show up in the logs. Apparently they don't so maybe these particular files are treated differently as they are "External results" and still lack proper error handling with retries. However, your script also shows completely normal jobs like https://openqa.suse.de/tests/7220766#step/sshd/111 which are affected. So I guess the code for "External results" and the code for normal "ulogs" files are both still problematic.

Actions #49

Updated by MDoucha about 3 years ago

mkittler wrote:

There are dozens of udp6_ipsec_vti*.txt result files generated by that job. Almost all of them were correctly uploaded as you can see in the webUI but none of them is mentioned in the worker log. I thought that worker-log.txt should be the last uploaded file now?

That is a good point and it also bugs me. If these files were normal files under "ulogs" or normal files mentioned in a module result I suppose they should show up in the logs. Apparently they don't so maybe these particular files are treated differently as they are "External results" and still lack proper error handling with retries. However, your script also shows completely normal jobs like https://openqa.suse.de/tests/7220766#step/sshd/111 which are affected. So I guess the code for "External results" and the code for normal "ulogs" files are both still problematic.

No, those udp6_ipsec_vti*.txt files missing in the log are not external results or ulogs or anything of that sort. Those files are regular console output snippets captured automatically by wait_serial().

Actions #50

Updated by mkittler almost 3 years ago

  • Due date changed from 2021-10-15 to 2021-11-01

We wanted to look into it in a pair programming session but this hasn't happened so far because there were more important tickets to work on so I'm adjusting the due date.

Actions #51

Updated by livdywan almost 3 years ago

@mkittler, @cdywan, @kraih to meet tomorrow 13.00 CEST

Actions #52

Updated by livdywan almost 3 years ago

Some notes from the conversation (no verdict or result here):

Unable to read sshd-111.txt. is this in the log:

[2021-09-24T15:23:24.0709 CEST] [debug] [pid:25077] Uploading artefact sshd-111.txt
[2021-09-24T15:23:25.0219 CEST] [warn] [pid:15314] Websocket connection to http://openqa.suse.de/api/v1/ws/1313 finished by remote side with code 1006, no reason - trying again in 10 seconds
[2021-09-24T15:23:25.0730 CEST] [warn] [pid:25077] Error uploading sshd-111.txt: 503 response: Service Unavailable
[2021-09-24T15:23:35.0230 CEST] [info] [pid:15314] Registering with openQA openqa.suse.de
[2021-09-24T15:23:38.0396 CEST] [info] [pid:15314] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1313
[2021-09-24T15:23:46.0746 CEST] [info] [pid:15314] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1313
[2021-09-24T15:24:04.0266 CEST] [info] [pid:15314] Received signal HUP
[2021-09-24T15:24:25.0731 CEST] [warn] [pid:25077] Uploading artefact sshd-111.txt (attempts remaining: 59/60)
  • why HUP after 30s?
    • only main process receives the signal, meaning it shouldn't be relevant here?
  • Uploading artefact sshd-111.txt is expected to be delayed because of 503 and should have succeeded since there's no error afterwards
  • Not seen in the journal?
  • Can we checksum results instead of logging every step of artifact creation
  • Semi-related, but we see a lot of extra log messages due to "debug" level. A newer mojolicious version might get rid of some messages like e.g. 200 OK now at "strace" level?
    • kraih: This is a regression to be fixed in the next release
Actions #53

Updated by mkittler almost 3 years ago

I ran the script again and we checked more jobs. There are different types of jobs which miss txt results but showing slightly different logging behavior:

  1. Some jobs have no traces of txt result uploads in the worker log at all. It isn't clear why that's the case. At least for the successfully uploaded txt results there should be a corresponding line in the log file. Note that the corresponding "Create artifact:" messages still show up on the server log, though (except for the few actually missing txt results).
  2. In some jobs the upload is logged but only for a file with a slightly different name. In this case the log message for the relevant upload is likely just completely absent because the present log name actually belongs to a different result which has a very similar name.
  3. In some jobs a failed upload attempt is actually logged but then it looks like the next attempt succeeded and hence no further retry comes into play. So retrying generally works. The problem is more that sometimes an upload which didn't work is still considered a success.
  4. In some jobs the upload is logged as if it has worked on the first try.

  • We have logging problems, at least concerning the (client-side) worker log.
  • 3. and 4. show that the error handling isn't reliable and sometimes failed uploads are treated as success (and hence the retry isn't effective).
  • 1. also shows that the finalize job is unlikely do cause the problem because it shows that the "Create artifact: …" log line was actually missing for the missing result.

Unfortunately we still couldn't find the source of these problems.

Actions #54

Updated by MDoucha almost 3 years ago

mkittler wrote:

  1. In some jobs the upload is logged but only for a file with a slightly different name. It isn't clear where the file with the slightly different name belongs to. The missing log line with the expected file name could be the same as the previous case although in this case there were log messages for the successfully uploaded txt results.

Unsurprisingly, rt_sigsuspend01-1.txt belongs to the rt_sigsuspend01 test: https://openqa.suse.de/tests/7473737#step/rt_sigsuspend01/1

Actions #55

Updated by mkittler almost 3 years ago

Ah, makes sense. I must have made a typo when searching. I'll update my comment.

Actions #56

Updated by mkittler almost 3 years ago

  • Due date changed from 2021-11-01 to 2021-11-09
Actions #57

Updated by mkittler almost 3 years ago

  • Due date deleted (2021-11-09)

We're currently out of ideas so I'm removing the due date.

Actions #58

Updated by livdywan almost 3 years ago

  • Related to action #102332: Unable to read *.json: Can't open file in o3 openQA logs /var/log/openqa size:M added
Actions #59

Updated by okurz almost 3 years ago

  • Status changed from Feedback to Blocked

We should first work on #102332 which might be related

Actions #60

Updated by okurz almost 3 years ago

  • Related to action #102786: Text results "unable to read" when showing result page during test execution, i.e. while an openQA test is running size:M added
Actions #61

Updated by okurz almost 3 years ago

dheidler found another issue, maybe it's actually the same. Please see #102786 , I added some fancy screenshots there. Maybe that helps :)

Actions #62

Updated by MDoucha almost 3 years ago

okurz wrote:

dheidler found another issue, maybe it's actually the same. Please see #102786 , I added some fancy screenshots there. Maybe that helps :)

I see that a lot but simply waiting for the job to finish uploading and then reloading the page will fix it. My best guess is that the upload thread has unnecessary delays. That would be a design problem, not really a bug.

Actions #63

Updated by okurz over 2 years ago

  • Status changed from Blocked to Feedback

both #102332 and #102786 resolved, please check again.

Actions #64

Updated by mkittler over 2 years ago

As @MDoucha already stated, #102786 is completely unrelated. This is about artifacts missing persistently (on finished tests).

Maybe https://github.com/os-autoinst/openQA/pull/4412 and https://github.com/os-autoinst/openQA/pull/4427 for #102332 helped with that as well. I'm running @MDoucha's script again to check that.

Actions #65

Updated by mkittler over 2 years ago

The script is still running without output. I suppose that's not really the most efficient way to check for problematic jobs.

Actions #66

Updated by MDoucha over 2 years ago

mkittler wrote:

The script is still running without output. I suppose that's not really the most efficient way to check for problematic jobs.

That's because you're running the wrong script. scan_details.py is the main scanner script that reads job result data. filter_details.py reads the scan results from standard input and merely filters out bogus jobs by checking OpenQA web API for "valid" causes of missing result files. Run both scripts like this and you can kill them when the counter printed to console drops under 8000:

scan_details.py | filter_details.py >results.txt
Actions #67

Updated by mkittler over 2 years ago

The issue hasn't been fixed by those commits, e.g. https://openqa.suse.de/tests/8498715#step/kcmp01/3 and https://openqa.suse.de/tests/8497388#step/sysinfo01/11. I've attached a full list of the results gathered by the script (but it seems to contain some false-positives).

Actions #68

Updated by MDoucha over 2 years ago

Yes, some of the jobs in the beginning of the list will be false positives because they were still uploading results while they got scanned but they fully finished before they got filtered.

Actions #69

Updated by okurz over 2 years ago

  • Subject changed from OpenQA worker randomly skips uploading artefacts for whole test modules size:M to OpenQA worker randomly skips uploading artefacts for whole test modules
  • Status changed from Feedback to New
  • Assignee deleted (mkittler)
  • Target version changed from Ready to future

We have discussed this topic today in the SUSE QE Tools weekly 2022-05-06 and have found no good path to proceed. We do not know how to easily reproduce. It is a real bug though. So I am keeping the ticket open but move it out of our backlog as we are not currently doing anything on it anyway.

Actions #70

Updated by MDoucha about 2 years ago

One of my verification runs seems to have skipped the upload phase entirely:
https://openqa.suse.de/tests/9582368

Actions

Also available in: Atom PDF