action #94531: OpenQA worker randomly skips uploading artefacts for whole test modules - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #94531

open

OpenQA worker randomly skips uploading artefacts for whole test modules

Added by MDoucha almost 4 years ago. Updated over 2 years ago.

Status:

New

Priority:

Low

Assignee:

Category:

Regressions/Crashes

Target version:

QA (public) - future

Start date:

2021-06-23

Due date:

% Done:

Estimated time:

Tags:

reactive work

Description

Observation¶

Once in a while, some jobs randomly end up with one or more modules missing all screenshots and console output. The boxes do appear in OpenQA web UI but only show Unable to read foo-123.txt. or blank box instead of a screenshot:
https://openqa.suse.de/tests/6308588#step/cn_pec_sh/1

Worker-log.txt shows that the missing files were not uploaded at all. The worker uploaded all artefacts for update_kernel and install_ltp but there's not a single line with Uploading artefact cn_pec_sh-*.txt or Uploading artefact shutdown_ltp-*.txt (except for log assets uploaded by calling upload_logs() in the test itself):
https://openqa.suse.de/tests/6308588/logfile?filename=worker-log.txt

I've seen this happen randomly multiple times, usually it's just one module in the middle of a test run. Everything before it and after it gets uploaded correctly.

Steps to reproduce¶

Look for jobs with test modules steps that include Unable to read

Expected result¶

another job in the same scenario as the original one shows correct expected data in steps, https://openqa.suse.de/tests/6312792#step/cn_pec_sh/1 shows

# wait_serial expected: "# "
# Result:

#

Acceptance criteria¶

AC1: test module steps are uploaded again for openqa.suse.de LTP cases in general

Suggestions¶

Find fail ratio
Try to reproduce problem locally
Find out what files the openQA worker should try to upload

Out of scope¶

If we find a problem in the custom LTP runner don't try to fix that on openQA side but have it be fixed within os-autoinst-distri-opensuse

Files

Download all files

scan_details.py (1.73 KB) scan_details.py		MDoucha, 2021-06-30 11:24
filter_details.py (776 Bytes) filter_details.py		MDoucha, 2021-06-30 13:07
joblist.txt (8.4 KB) joblist.txt		MDoucha, 2021-07-01 13:58
missing_results-2021-08.txt (20.2 KB) missing_results-2021-08.txt	Affected jobs in the last month	MDoucha, 2021-09-01 11:45
missing_results-2021-09.txt (22 KB) missing_results-2021-09.txt	Affected jobs in September	MDoucha, 2021-09-30 09:52
results.txt (12 KB) results.txt		mkittler, 2022-04-12 13:09

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by okurz almost 4 years ago

Priority changed from Normal to High
Target version set to Ready

Actions

Copy link

Updated by tinita almost 4 years ago

It could be helpful to post more links to tests with the same behaviour (could be that there is a pattern)

Actions

Copy link

Updated by MDoucha almost 4 years ago

tinita wrote:

It could be helpful to post more links to tests with the same behaviour (could be that there is a pattern)

I would post more links if I had any but this is maybe the third time I'm seeing this bug and the last one I've seen before happened months ago. But you could probably find a few dozen cases by scanning existing test results with a script. The details-*.json file will have text keys but no text_data in the details array.

Actions

Copy link

Updated by livdywan almost 4 years ago

Maybe this should be self-evident, but can you clarify that https://openqa.suse.de/tests/6308588#step/cn_pec_sh/1 is expected to pass even without the missing file? What is the expected result?

From the logs it seems the upload is completely missing so my first thought would be that the file wasn't created. But from a single result it's hard to tell. I guess that's also what Tina was wondering.

Actions

Copy link

Updated by okurz almost 4 years ago

To put the information into structured form I suggest to follow the ticket template from https://progress.opensuse.org/projects/openqav3/wiki/#Defects

Actions

Copy link

Updated by MDoucha almost 4 years ago

cdywan wrote:

Maybe this should be self-evident, but can you clarify that https://openqa.suse.de/tests/6308588#step/cn_pec_sh/1 is expected to pass even without the missing file? What is the expected result?

Yes, this is expected to pass. The command result (pass/fail) is reported to OpenQA database first and the console output is uploaded much later. It is expected to see a green box with "Unable to read foo-123.txt" while the test is still running but not after it finishes.

From the logs it seems the upload is completely missing so my first thought would be that the file wasn't created. But from a single result it's hard to tell. I guess that's also what Tina was wondering.

Like I said, the easiest way to find more examples is to scan existing test results with a script. Valid details-*.json file looks like this:

{"dents":0,"details":[{"result":"ok","text":"boot_ltp-4.txt","text_data":"# wait_serial expected: qr\/login:\\s*$\/ui\n\n# Result:\n\r\nsusetest login: \n","title":"wait_serial"},...],...}

But when it's affected by this issue, the text key will be present but the text_data key will be completely missing:

{"dents":0,"details":[{"result":"ok","text":"boot_ltp-4.txt","title":"wait_serial"},...],...}

Actions

Copy link

Updated by livdywan almost 4 years ago

More thoughts:

lib/OpenQA/Task/Job/FinalizeResults.pm#L26 uploads results in a minion job.
Maybe errors are eaten here but should appear on https://openqa.suse.de/minion

Actions

Copy link

Updated by MDoucha almost 4 years ago

cdywan wrote:

More thoughts:

lib/OpenQA/Task/Job/FinalizeResults.pm#L26 uploads results in a minion job.
Maybe errors are eaten here but should appear on https://openqa.suse.de/minion

No, this is a minion task which collects the already uploaded text files and inlines them into details-*.json file for faster webpage rendering. It does not upload anything but it can detect this problem and send an alert somehow. The uploading is done by the OpenQA worker which actually runs the test job.

Actions

Copy link

Updated by livdywan almost 4 years ago

MDoucha wrote:

cdywan wrote:

More thoughts:

lib/OpenQA/Task/Job/FinalizeResults.pm#L26 uploads results in a minion job.
Maybe errors are eaten here but should appear on https://openqa.suse.de/minion

No, this is a minion task which collects the already uploaded text files and inlines them into details-*.json file for faster webpage rendering. It does not upload anything but it can detect this problem and send an alert somehow. The uploading is done by the OpenQA worker which actually runs the test job.

Thank you for the correction!

FYI @kraih helping out by improving related code: https://github.com/os-autoinst/openQA/pull/3982 Not a fix but could contribute to errors not being propagated

Actions

Copy link

#10

Updated by mkittler almost 4 years ago

This sounds like a bug in the uploading code of the openQA worker (code in lib/OpenQA/Worker/Job.pm). Hard to tell what's going on exactly (the VPN is down so I cannot even access the referenced jobs right now). (It could be that the results are already not generated correctly by os-autoinst, though. However, that's not that likely.)

It would of course be useful to see the pool directory of such a job. Is the test order of such jobs changed at some point during the test execution? Are the test details created in a special way (e.g. parsed from an external testsuite)?

Actions

Copy link

#11

Updated by okurz almost 4 years ago

Subject changed from OpenQA worker randomly skips uploading artefacts for whole test modules to OpenQA worker randomly skips uploading artefacts for whole test modules size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

#12

Updated by MDoucha almost 4 years ago

File scan_details.py scan_details.py added
File filter_details.py added

Here's a list of other jobs with missing console output. I've also attached two scripts that'll find more. scan_details.py finds all jobs with missing console output. You need to run it directly on OSD. It'll also print files which it can't parse properly and show progress through the results directory on standard error. filter_details.py will filter out unfinished and incomplete jobs which would most likely be false positives. Simply feed it the output of scan_details.py on standard input, you can run this script on any machine that has HTTP access to OSD. Or just run both on OSD: scan_details.py | filter_details.py >results.txt.

Edit: I've replaced filter_details.py with an improved version. See next comment.

Now the list of jobs (just a small fraction of all affected jobs). The string after the URL is a random test module that is missing at least one console output file. Some are caused by network errors which you'll find in worker-log.txt. Those are a different issue but I didn't filter them out.

Actions

Copy link

#13

Updated by MDoucha almost 4 years ago

File deleted (~~filter_details.py~~)

Actions

Copy link

#14

Updated by MDoucha almost 4 years ago

File filter_details.py filter_details.py added

Simplified and improved filter_details.py.

Actions

Copy link

#15

Updated by MDoucha almost 4 years ago

File joblist.txt joblist.txt added

I ran the scripts on OSD overnight and here's the complete list of affected jobs with incomplete/cancelled/restarted jobs filtered out.

Actions

Copy link

#16

Updated by mkittler almost 4 years ago

Assignee set to mkittler

Actions

Copy link

#17

Updated by mkittler almost 4 years ago

Assignee deleted (~~mkittler~~)

I still haven't found out much. If no one took the issue by then, I'm taking it again next Tuesday when I'm back from vacation.

Actions

Copy link

#18

Updated by mkittler almost 4 years ago

Assignee set to mkittler

Actions

Copy link

#19

Updated by mkittler almost 4 years ago

Additional observations:

Some jobs (at least https://openqa.suse.de/tests/6288225) also miss some of the screenshots.
Previous and next step results are often nevertheless present; not the whole test module is affected.
1. This is not true for the job from the ticket description, though. Here the steps for the whole modules are gone.
Worker logs show the upload problems at least in some jobs, e.g. [2021-06-18T15:20:50.0791 CEST] [debug] [pid:16668] Uploading artefact gpg-4.txt [2021-06-18T15:20:50.0797 CEST] [error] [pid:16668] Uploading artefact gpg-4.txt failed: Service Unavailable but apparently no further upload attempts have been tried (in this example these are the only log lines containing gpg-4.txt).
1. The job from the ticket description does not show the upload problem within the worker log, though. Here it looks like the upload hasn't even been attempted once. Maybe we're seeing two different problems here.

Actions

Copy link

#20

Updated by MDoucha almost 4 years ago

mkittler wrote:

The job from the ticket description does not show the upload problem within the worker log, though. Here it looks like the upload hasn't even been attempted once. Maybe we're seeing two different problems here.

I also strongly suspect that these are two different bugs. But fixing the lack of retries first will still help us diagnose the other bug (if it exists).

Actions

Copy link

#21

Updated by mkittler almost 4 years ago

Worker log for the job in the ticket description:

…
[2021-06-22T17:51:33.0494 CEST] [debug] [pid:5432] Uploading artefact install_ltp-205.txt
[2021-06-22T17:51:33.0719 CEST] [debug] [pid:5368] Uploading artefact install_ltp-215.txt
[2021-06-22T17:51:33.0899 CEST] [info] [pid:5436] Uploading connectors
[2021-06-22T17:51:33.0899 CEST] [info] [pid:5436] Uploading connectors using multiple chunks
[2021-06-22T17:51:33.0900 CEST] [info] [pid:5436] connectors: 1 chunks
[2021-06-22T17:51:33.0900 CEST] [info] [pid:5436] connectors: chunks of 1000000 bytes each
[2021-06-22T17:51:34.0091 CEST] [debug] [pid:5432] Uploading artefact install_ltp-144.txt
[2021-06-22T17:51:34.0307 CEST] [debug] [pid:5368] Uploading artefact install_ltp-145.txt
[2021-06-22T17:51:34.0513 CEST] [info] [pid:5436] connectors: Processing chunk 1/1, avg. speed ~0.056 KiB/s
[2021-06-22T17:51:34.0520 CEST] [info] [pid:5436] Uploading vars.json
[2021-06-22T17:51:34.0599 CEST] [debug] [pid:5432] Uploading artefact install_ltp-152.txt
[2021-06-22T17:51:34.0692 CEST] [debug] [pid:5368] Uploading artefact install_ltp-14.txt
[2021-06-22T17:51:34.0911 CEST] [info] [pid:5436] Uploading autoinst-log.txt
[2021-06-22T17:51:34.0961 CEST] [debug] [pid:5432] Uploading artefact install_ltp-215.txt
[2021-06-22T17:51:35.0075 CEST] [debug] [pid:5368] Uploading artefact install_ltp-100.txt
[2021-06-22T17:51:35.0305 CEST] [info] [pid:5436] Uploading worker-log.txt

This is how it is supposed to work (judging by the code and comparing to log files):

Job starts
1. Upload results (= step details) concurrently and continuously
Job stops
1. The upload from 1. might still be running; that's ok
2. Set job status to uploading
3. Upload logs/assets (including autoinst-log.txt and worker-log.txt)
Logs/assets (2.3) have been uploaded
1. Wait until possibly ongoing results upload (1.1) has concluded
2. Start a "final" results upload
Final result upload has been concluded
1. Final job state/result is set

Point 2.1 explains the weird interleaving between Uploading [some logfile] lines and Uploading artefact [some step file] lines.
Point 2.3 explains why log lines for certain artefacts can be missing completely: worker-log.txt is simply uploaded before these lines would be logged in step 3 and 4. (This is plausible because the test modules with missing artefacts are indeed at the end.)

So I would conclude that both cases (log lines for artefacts completely missing within the log file and errors logged for artefacts) are the same after all. It is just confusing (and hence problematic) that log messages from steps 3 and 4 are not part of worker-log.txt.

As a first step I'd just try to improve the error handling. It would make sense to retry artefact uploads in the same way other API calls to the web UI are retired. If they fail nevertheless I assume the test be marked as incomplete (with the relevant error message as reason).

As a second step I can try to move uploading autoinst-log.txt after the 4.

Actions

Copy link

#22

Updated by MDoucha almost 4 years ago

mkittler wrote:

As a first step I'd just try to improve the error handling. It would make sense to retry artefact uploads in the same way other API calls to the web UI are retired. If they fail nevertheless I assume the test be marked as incomplete (with the relevant error message as reason).

I'd recommend a different approach for artefact uploads: If it fails even after retry, put it back in upload queue and pause further uploads until the next trigger from the worker loop. If the network connection recovers, everything will be uploaded. If it doesn't recover, the main OpenQA server will terminate the job due to timeout either way. Only the final upload step would kill the job after several failed upload retries.

Actions

Copy link

#23

Updated by mkittler almost 4 years ago

Makes sense; I'll make it only a fatal error on the final upload. Unfortunately the upload is not written using a queue (or something similar)¹. So I cannot simply put it back. I guess rewriting the code to use a queue is a bit out of scope here so I simply resort to non-fatal retries during the execution plus the final upload to catch everything which has possibly be missed.

¹ although it would make a lot of sense of course

Actions

Copy link

#24

Updated by mkittler almost 4 years ago

Status changed from Workable to In Progress

PR for a first improvement: https://github.com/os-autoinst/openQA/pull/4063

Actions

Copy link

#25

Updated by mkittler almost 4 years ago

By the way, I'm not even sure whether it'll be a big win to mark such tests as incomplete. Maybe it is sufficient to add the upload error (after all attempts have been depleted during the final upload) as reason (unless there is a more relevant reason of course)?

Moving the upload of the worker log to the end would also help with that (without considering the test as incomplete).

Actions

Copy link

#26

Updated by MDoucha almost 4 years ago

mkittler wrote:

By the way, I'm not even sure whether it'll be a big win to mark such tests as incomplete. Maybe it is sufficient to add the upload error (after all attempts have been depleted during the final upload) as reason (unless there is a more relevant reason of course)?

You can certainly try to send more descriptive error info to the central OpenQA server. But since you'll be doing it while the network is so horribly broken that asset uploads have exceeded retry limit, I don't think you have a very high chance of success.

Actions

Copy link

#27

Updated by mkittler almost 4 years ago

My current PR adds a retry mechanism for the artefact upload similar to what we already have for regular API calls. For regular API calls it already seems to improve the situation a lot, e.g. most of the jobs you've found have their final state/result set as expected (which means the API call to send this information eventually succeeded). Hence the PR should make it more robust. That means the situations when the upload ultimately fails but we are still able to set the final state/result (where also the reason is passed) are indeed very unlikely. So likely it doesn't make much sense putting further effort (and complicated code) into this (like changing the job result to incomplete or trying to upload the worker log later).

By the way, to avoid interference with https://github.com/os-autoinst/openQA/pull/4062 I haven't tried to streamline the retry-behavior with regular API calls yet. I will do that after it has been merged if it make sense.

Actions

Copy link

#28

Updated by openqa_review almost 4 years ago

Due date set to 2021-08-05

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#29

Updated by mkittler almost 4 years ago

PR for streamlining the retry-behavior with regular API calls (also for having less redundant/inconsistent code): https://github.com/os-autoinst/openQA/pull/4067

Actions

Copy link

#30

Updated by mkittler almost 4 years ago

Status changed from In Progress to Feedback

All PRs have been merged. So now uploading those artefacts have the same re-try behavior as regular API calls and therefore should be equally reliable.

Actions

Copy link

#31

Updated by MDoucha almost 4 years ago

mkittler wrote:

All PRs have been merged. So now uploading those artefacts have the same re-try behavior as regular API calls and therefore should be equally reliable.

Great. Setting my alarm clock to August 31st. When it goes off, I'll run another scan for jobs with missing console output.
Note to self: OSD is currently at job ID 6596263.

Actions

Copy link

#32

Updated by mkittler almost 4 years ago

Sounds like a plan. (Especially because adding an automatic alert within our monitoring for this problem would be quite some work.)

Actions

Copy link

#33

Updated by livdywan almost 4 years ago

Due date changed from 2021-08-05 to 2021-08-31

I'm updating the Due date accordingly after having checked with Marius

Actions

Copy link

#34

Updated by okurz over 3 years ago

Due date changed from 2021-08-31 to 2021-09-02

The analysis takes some time. Also, we should not be warned on the last day but only if we exceed it so increasing the due date by two days

Actions

Copy link

#35

Updated by MDoucha over 3 years ago

File missing_results-2021-08.txt missing_results-2021-08.txt added
Status changed from Feedback to Workable

The retry code still needs some work. My scanner script found 403 jobs with missing result data (some of which may be false positives, see attachment for full list) within past month. Here's an example where the worker printed a retry message to log but failed to upload the text file anyway:

https://openqa.suse.de/tests/6892431#step/consoletest_setup/21
https://openqa.suse.de/tests/6892431/logfile?filename=worker-log.txt

[2021-08-19T07:36:19.0905 CEST] [debug] [pid:7101] Uploading artefact consoletest_setup-21.txt
[2021-08-19T07:36:20.0566 CEST] [warn] [pid:7101] Upload attempts remaining: 59/60 for consoletest_setup-21.txt
[2021-08-19T07:36:20.0743 CEST] [warn] [pid:4110] Websocket connection to http://openqa.suse.de/api/v1/ws/1258 finished by remote side with code 1006, no reason - trying again in 10 seconds
[2021-08-19T07:36:30.0752 CEST] [info] [pid:4110] Registering with openQA openqa.suse.de
[2021-08-19T07:36:32.0334 CEST] [info] [pid:4110] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1258
[2021-08-19T07:36:33.0250 CEST] [info] [pid:4110] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1258

Actions

Copy link

#36

Updated by livdywan over 3 years ago

Due date changed from 2021-09-02 to 2021-09-10

Bumping the due date since there's new information, assuming @mkittler is planning to look into it next week.

Actions

Copy link

#37

Updated by mkittler over 3 years ago

You're right, it definitely looks still broken - as if it doesn't actually give it another try after logging Upload attempts remaining: 59/60 for….

Actions

Copy link

#38

Updated by mkittler over 3 years ago

PR for improving the related logging: https://github.com/os-autoinst/openQA/pull/4174

However, I still don't know what's the problem here. Judging by our code and unit tests it actually re-attempts the upload. Maybe the error isn't visible on the worker-side? The error handling correctly checks for both, connection and HTTP errors, though.

Actions

Copy link

#39

Updated by mkittler over 3 years ago

Status changed from Workable to Feedback

Looks like the web UI actually returns a 200 response in certain error cases. Maybe these are the remaining cases where the worker doesn't re-attempt the upload. PR: https://github.com/os-autoinst/openQA/pull/4179

Actions

Copy link

#40

Updated by mkittler over 3 years ago

Since not all changes of https://github.com/os-autoinst/openQA/pull/4179 were covered I've been creating https://github.com/os-autoinst/openQA/pull/4184. However, that means the code path which would return a 200 response despite an error was most likely dead code and didn't cause the problem.

I also ran Martin's script again on OSD but couldn't find any further occurrences since the logging changes have been deployed.

Actions

Copy link

#41

Updated by livdywan over 3 years ago

Due date changed from 2021-09-10 to 2021-10-15

Discussed in chat:

The logs of these tests showed that the retry doesn't work - either because of a bug in the retry code itself or because the web UI doesn't report an error although the request didn't work. Judging by the code and unit tests neither should be the case. So the uploading doesn't work in rare cases.
@mkittler couldn't find a further occurrence of the problem on OSD even though the changes for improved logging and error handling shouldn't have fixed it.
Since we can't reproduce it but also don't know if it's fixed I suggested to monitor this a bit longer, with one of 3 possible outcomes:
- If it happens again we can investigate
- If someone finds evidence that it's already fixed we cam document it retroactively and learn from that
- If it simply no longer reproduces within a month for better or worse we'll assume it's resolved

Actions

Copy link

#42

Updated by okurz over 3 years ago

Priority changed from High to Low

I agree with that. But also for long-running tickets with a due-date in further future we should treat it not as "High" but instead "Low"

Actions

Copy link

#43

Updated by MDoucha over 3 years ago

File missing_results-2021-09.txt missing_results-2021-09.txt added

I've run the result scanner again with similar results as last month: 448 jobs where OpenQA did not upload at least one result file. Again, most of the jobs which I've checked manually had a retry message in the log but the file was not uploaded anyway:

https://openqa.suse.de/tests/7259127#step/pread01_64/6
https://openqa.suse.de/tests/7259127/logfile?filename=worker-log.txt

[2021-09-29T07:37:44.0607 CEST] [debug] [pid:439939] Uploading artefact pread01_64-6.txt
[2021-09-29T07:37:44.0757 CEST] [warn] [pid:439939] Error uploading pread01_64-6.txt: 502 response: Proxy Error
[2021-09-29T07:37:44.0987 CEST] [warn] [pid:427545] Websocket connection to http://openqa.suse.de/api/v1/ws/1333 finished by remote side with code 1006, no reason - trying again in 10 seconds
[2021-09-29T07:37:54.0990 CEST] [info] [pid:427545] Registering with openQA openqa.suse.de
[2021-09-29T07:37:55.0010 CEST] [warn] [pid:427545] Failed to register at openqa.suse.de - 503 response: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<snip HTML error>
 - trying again in 10 seconds
[2021-09-29T07:38:05.0020 CEST] [info] [pid:427545] Registering with openQA openqa.suse.de
[2021-09-29T07:38:06.0187 CEST] [info] [pid:427545] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1333
[2021-09-29T07:38:08.0558 CEST] [info] [pid:427545] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1333
[2021-09-29T07:38:22.0255 CEST] [info] [pid:427545] Received signal HUP
[2021-09-29T07:38:44.0758 CEST] [warn] [pid:439939] Uploading artefact pread01_64-6.txt (attempts remaining: 59/60)

At least one job does not have any mention of the missing file in the worker-log.txt file at all (or any other udp6_ipsec_vti test result either):

https://openqa.suse.de/tests/7253418#step/udp6_ipsec_vti05/1
https://openqa.suse.de/tests/7253418/logfile?filename=worker-log.txt

Actions

Copy link

#44

Updated by mkittler over 3 years ago

In the first case when the upload is logged it looks like an additional attempt is actually made but then no further error handling (e.g. yet another attempt) happens:

[2021-09-29T07:37:44.0757 CEST] [warn] [pid:439939] Error uploading pread01_64-6.txt: 502 response: Proxy Error
[2021-09-29T07:37:44.0987 CEST] [warn] [pid:427545] Websocket connection to http://openqa.suse.de/api/v1/ws/1333 finished by remote side with code 1006, no reason - trying again in 10 seconds
[2021-09-29T07:37:54.0990 CEST] [info] [pid:427545] Registering with openQA openqa.suse.de
[2021-09-29T07:37:55.0010 CEST] [warn] [pid:427545] Failed to register at openqa.suse.de - 503 response: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
<p>The server is temporarily unable to service your
request due to maintenance downtime or capacity
problems. Please try again later.</p>
<p>Additionally, a 503 Service Unavailable
error was encountered while trying to use an ErrorDocument to handle the request.</p>
<hr>
<address>Apache/2.4.43 (Linux/SUSE) Server at openqa.suse.de Port 80</address>
</body></html>
 - trying again in 10 seconds
[2021-09-29T07:38:05.0020 CEST] [info] [pid:427545] Registering with openQA openqa.suse.de
[2021-09-29T07:38:06.0187 CEST] [info] [pid:427545] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1333
[2021-09-29T07:38:08.0558 CEST] [info] [pid:427545] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1333
[2021-09-29T07:38:22.0255 CEST] [info] [pid:427545] Received signal HUP
[2021-09-29T07:38:44.0758 CEST] [warn] [pid:439939] Uploading artefact pread01_64-6.txt (attempts remaining: 59/60)
[2021-09-29T07:38:44.0791 CEST] [debug] [pid:439939] Uploading artefact pread03_64-6.txt

It also looks like the web UI was overloaded at the time, e.g. even the re-registration didn't work.

In the other case it indeed looks like the worker upload completely ignored the file (without any error message) or the file hasn't been created in the first place (and we don't show errors in that case).

I'll have yet another look at the code (and tests). I also noticed that I ran your script incorrectly when checking my self the last time.

Actions

Copy link

#45

Updated by okurz over 3 years ago

mkittler wrote:

In the first case when upload is logged it looks like an additional attempt is actually made but then no further error handling (e.g. yet another attempt) happens

so if I got you right then we conduct one retry but ignore if that succeeds? In that case I would go for the same approach that we usually do on retrying, e.g. a for loop with $configurable_retries // 7 with sleep periods in between and die with explicit error if we don't succeed even after the retries.

I'll have yet another look at the code (and tests).

Maybe just try to improve the logging in error cases for a start and collect more experience from production runs.

Actions

Copy link

#46

Updated by MDoucha over 3 years ago

mkittler wrote:

In the other case it indeed looks like the worker upload completely ignored the file (without any error message) or the file hasn't been created in the first place (and we don't show errors in that case).

There are dozens of udp6_ipsec_vti*.txt result files generated by that job. Almost all of them were correctly uploaded as you can see in the webUI but none of them is mentioned in the worker log. I thought that worker-log.txt should be the last uploaded file now?

Actions

Copy link

#47

Updated by mkittler over 3 years ago

I guess it is safe to exclude that the sub process we spawn to do the upload is terminated prematurely (without proper error handling). Even if the main worker process receives SIGHUP the upload process (in this case PID 25077) continues to run and only one single file is missing:

[2021-09-24T15:23:23.0472 CEST] [debug] [pid:25077] Uploading artefact sshd-402.txt
[2021-09-24T15:23:24.0709 CEST] [debug] [pid:25077] Uploading artefact sshd-111.txt
[2021-09-24T15:23:25.0219 CEST] [warn] [pid:15314] Websocket connection to http://openqa.suse.de/api/v1/ws/1313 finished by remote side with code 1006, no reason - trying again in 10 seconds
[2021-09-24T15:23:25.0730 CEST] [warn] [pid:25077] Error uploading sshd-111.txt: 503 response: Service Unavailable
[2021-09-24T15:23:35.0230 CEST] [info] [pid:15314] Registering with openQA openqa.suse.de
[2021-09-24T15:23:38.0396 CEST] [info] [pid:15314] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1313
[2021-09-24T15:23:46.0746 CEST] [info] [pid:15314] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1313
[2021-09-24T15:24:04.0266 CEST] [info] [pid:15314] Received signal HUP
[2021-09-24T15:24:25.0731 CEST] [warn] [pid:25077] Uploading artefact sshd-111.txt (attempts remaining: 59/60)
[2021-09-24T15:24:27.0952 CEST] [debug] [pid:25077] Uploading artefact sshd-366.txt
[2021-09-24T15:24:29.0234 CEST] [debug] [pid:25077] Uploading artefact sshd-68.txt
[2021-09-24T15:24:31.0449 CEST] [debug] [pid:25077] Uploading artefact sshd-257.txt
[2021-09-24T15:24:32.0816 CEST] [debug] [pid:25077] Uploading artefact sshd-31.txt
[2021-09-24T15:24:34.0249 CEST] [debug] [pid:25077] Uploading artefact sshd-216.txt
[2021-09-24T15:24:35.0559 CEST] [debug] [pid:25077] Uploading artefact sshd-163.txt
[2021-09-24T15:24:36.0617 CEST] [debug] [pid:25077] Uploading artefact sshd-58.txt
[2021-09-24T15:24:38.0132 CEST] [debug] [pid:25077] Uploading artefact sshd-48.txt
[2021-09-24T15:24:39.0793 CEST] [debug] [pid:25077] Uploading artefact sshd-90.txt

Actions

Copy link

#48

Updated by mkittler over 3 years ago

so if I got you right then we conduct one retry but ignore if that succeeds? In that case I would go for the same approach that we usually do on retrying, e.g. a for loop with $configurable_retries // 7 with sleep periods in between and die with explicit error if we don't succeed even after the retries.

We have a loop which is supposed to work like this. The question is why this loop sometimes doesn't work. I was just describing the observation of the actual behavior which goes against what is supposed to happen as the code looks right now.

Maybe just try to improve the logging in error cases for a start and collect more experience from production runs.

That's actually what I've already did and the reason we see Error uploading pread01_64-6.txt: 502 response: Proxy Error. That we don't see a further error message after the next logged attempt actually means that no error occurred (from the perspective of the upload code). In reality files are missing so either the server did not tell the worker that an error happened (although it did) or the worker's error handling is somehow wrong or the file has been deleted for some reason afterwards¹.

¹ The code for finalizing jobs looks good and shouldn't lead to this problem even though it is interrupted in the middle.

There are dozens of udp6_ipsec_vti*.txt result files generated by that job. Almost all of them were correctly uploaded as you can see in the webUI but none of them is mentioned in the worker log. I thought that worker-log.txt should be the last uploaded file now?

That is a good point and it also bugs me. If these files were normal files under "ulogs" or normal files mentioned in a module result I suppose they should show up in the logs. Apparently they don't so maybe these particular files are treated differently as they are "External results" and still lack proper error handling with retries. However, your script also shows completely normal jobs like https://openqa.suse.de/tests/7220766#step/sshd/111 which are affected. So I guess the code for "External results" and the code for normal "ulogs" files are both still problematic.

Actions

Copy link

#49

Updated by MDoucha over 3 years ago

mkittler wrote:

There are dozens of udp6_ipsec_vti*.txt result files generated by that job. Almost all of them were correctly uploaded as you can see in the webUI but none of them is mentioned in the worker log. I thought that worker-log.txt should be the last uploaded file now?

That is a good point and it also bugs me. If these files were normal files under "ulogs" or normal files mentioned in a module result I suppose they should show up in the logs. Apparently they don't so maybe these particular files are treated differently as they are "External results" and still lack proper error handling with retries. However, your script also shows completely normal jobs like https://openqa.suse.de/tests/7220766#step/sshd/111 which are affected. So I guess the code for "External results" and the code for normal "ulogs" files are both still problematic.

No, those udp6_ipsec_vti*.txt files missing in the log are not external results or ulogs or anything of that sort. Those files are regular console output snippets captured automatically by wait_serial().

Actions

Copy link

#50

Updated by mkittler over 3 years ago

Due date changed from 2021-10-15 to 2021-11-01

We wanted to look into it in a pair programming session but this hasn't happened so far because there were more important tickets to work on so I'm adjusting the due date.

Actions

Copy link

#51

Updated by livdywan over 3 years ago

@mkittler, @cdywan, @kraih to meet tomorrow 13.00 CEST

Actions

Copy link

#52

Updated by livdywan over 3 years ago

Some notes from the conversation (no verdict or result here):

Job.pm upload_log_file
- evaluate_error
- Why don't we see "all attempts failed" in https://openqa.suse.de/tests/7220766/logfile?filename=worker-log.txt? Maybe the very last attempt succeeded?
tested in 24-worker-jobs.t

Unable to read sshd-111.txt. is this in the log:

[2021-09-24T15:23:24.0709 CEST] [debug] [pid:25077] Uploading artefact sshd-111.txt
[2021-09-24T15:23:25.0219 CEST] [warn] [pid:15314] Websocket connection to http://openqa.suse.de/api/v1/ws/1313 finished by remote side with code 1006, no reason - trying again in 10 seconds
[2021-09-24T15:23:25.0730 CEST] [warn] [pid:25077] Error uploading sshd-111.txt: 503 response: Service Unavailable
[2021-09-24T15:23:35.0230 CEST] [info] [pid:15314] Registering with openQA openqa.suse.de
[2021-09-24T15:23:38.0396 CEST] [info] [pid:15314] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1313
[2021-09-24T15:23:46.0746 CEST] [info] [pid:15314] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1313
[2021-09-24T15:24:04.0266 CEST] [info] [pid:15314] Received signal HUP
[2021-09-24T15:24:25.0731 CEST] [warn] [pid:25077] Uploading artefact sshd-111.txt (attempts remaining: 59/60)

why HUP after 30s?
- only main process receives the signal, meaning it shouldn't be relevant here?
Uploading artefact sshd-111.txt is expected to be delayed because of 503 and should have succeeded since there's no error afterwards
Not seen in the journal?
Can we checksum results instead of logging every step of artifact creation
Semi-related, but we see a lot of extra log messages due to "debug" level. A newer mojolicious version might get rid of some messages like e.g. 200 OK now at "strace" level?
- kraih: This is a regression to be fixed in the next release

Actions

Copy link

#53

Updated by mkittler over 3 years ago

I ran the script again and we checked more jobs. There are different types of jobs which miss txt results but showing slightly different logging behavior:

Some jobs have no traces of txt result uploads in the worker log at all. It isn't clear why that's the case. At least for the successfully uploaded txt results there should be a corresponding line in the log file. Note that the corresponding "Create artifact:" messages still show up on the server log, though (except for the few actually missing txt results).
- e.g. https://openqa.suse.de/tests/7492641
In some jobs the upload is logged but only for a file with a slightly different name. In this case the log message for the relevant upload is likely just completely absent because the present log name actually belongs to a different result which has a very similar name.
- e.g. in https://openqa.suse.de/tests/7473737#step/sigsuspend01/1
In some jobs a failed upload attempt is actually logged but then it looks like the next attempt succeeded and hence no further retry comes into play. So retrying generally works. The problem is more that sometimes an upload which didn't work is still considered a success.
- e.g. https://openqa.suse.de/tests/7220766#step/sshd/111
In some jobs the upload is logged as if it has worked on the first try.
- e.g. https://openqa.suse.de/tests/7446866#step/hostname/14

We have logging problems, at least concerning the (client-side) worker log.
1. and 4. show that the error handling isn't reliable and sometimes failed uploads are treated as success (and hence the retry isn't effective).
1. also shows that the finalize job is unlikely do cause the problem because it shows that the "Create artifact: …" log line was actually missing for the missing result.

Unfortunately we still couldn't find the source of these problems.

Actions

Copy link

#54

Updated by MDoucha over 3 years ago

mkittler wrote:

In some jobs the upload is logged but only for a file with a slightly different name. It isn't clear where the file with the slightly different name belongs to. The missing log line with the expected file name could be the same as the previous case although in this case there were log messages for the successfully uploaded txt results.

e.g. in https://openqa.suse.de/tests/7473737#step/sigsuspend01/1 the log contains Uploading artefact rt_sigsuspend01-1.txt but not Uploading artefact sigsuspend01-1.txt which would actually be the expected name of the file

Unsurprisingly, rt_sigsuspend01-1.txt belongs to the rt_sigsuspend01 test: https://openqa.suse.de/tests/7473737#step/rt_sigsuspend01/1

Actions

Copy link

#55

Updated by mkittler over 3 years ago

Ah, makes sense. I must have made a typo when searching. I'll update my comment.

Actions

Copy link

#56

Updated by mkittler over 3 years ago

Due date changed from 2021-11-01 to 2021-11-09

Actions

Copy link

#57

Updated by mkittler over 3 years ago

Due date deleted (~~2021-11-09~~)

We're currently out of ideas so I'm removing the due date.

Actions

Copy link

#58

Updated by livdywan over 3 years ago

Related to action #102332: Unable to read *.json: Can't open file in o3 openQA logs /var/log/openqa size:M added

Actions

Copy link

#59

Updated by okurz over 3 years ago

Status changed from Feedback to Blocked

We should first work on #102332 which might be related

Actions

Copy link

#60

Updated by okurz over 3 years ago

Related to action #102786: Text results "unable to read" when showing result page during test execution, i.e. while an openQA test is running size:M added

Actions

Copy link

#61

Updated by okurz over 3 years ago

dheidler found another issue, maybe it's actually the same. Please see #102786 , I added some fancy screenshots there. Maybe that helps :)

Actions

Copy link

#62

Updated by MDoucha over 3 years ago

okurz wrote:

dheidler found another issue, maybe it's actually the same. Please see #102786 , I added some fancy screenshots there. Maybe that helps :)

I see that a lot but simply waiting for the job to finish uploading and then reloading the page will fix it. My best guess is that the upload thread has unnecessary delays. That would be a design problem, not really a bug.

Actions

Copy link

#63

Updated by okurz about 3 years ago

Status changed from Blocked to Feedback

both #102332 and #102786 resolved, please check again.

Actions

Copy link

#64

Updated by mkittler about 3 years ago

As @MDoucha already stated, #102786 is completely unrelated. This is about artifacts missing persistently (on finished tests).

Maybe https://github.com/os-autoinst/openQA/pull/4412 and https://github.com/os-autoinst/openQA/pull/4427 for #102332 helped with that as well. I'm running @MDoucha's script again to check that.

Actions

Copy link

#65

Updated by mkittler about 3 years ago

The script is still running without output. I suppose that's not really the most efficient way to check for problematic jobs.

Actions

Copy link

#66

Updated by MDoucha about 3 years ago

mkittler wrote:

The script is still running without output. I suppose that's not really the most efficient way to check for problematic jobs.

That's because you're running the wrong script. scan_details.py is the main scanner script that reads job result data. filter_details.py reads the scan results from standard input and merely filters out bogus jobs by checking OpenQA web API for "valid" causes of missing result files. Run both scripts like this and you can kill them when the counter printed to console drops under 8000:

scan_details.py | filter_details.py >results.txt

Actions

Copy link

#67

Updated by mkittler about 3 years ago

File results.txt results.txt added

The issue hasn't been fixed by those commits, e.g. https://openqa.suse.de/tests/8498715#step/kcmp01/3 and https://openqa.suse.de/tests/8497388#step/sysinfo01/11. I've attached a full list of the results gathered by the script (but it seems to contain some false-positives).

Actions

Copy link

#68

Updated by MDoucha about 3 years ago

Yes, some of the jobs in the beginning of the list will be false positives because they were still uploading results while they got scanned but they fully finished before they got filtered.

Actions

Copy link

#69

Updated by okurz about 3 years ago

Subject changed from OpenQA worker randomly skips uploading artefacts for whole test modules size:M to OpenQA worker randomly skips uploading artefacts for whole test modules
Status changed from Feedback to New
Assignee deleted (~~mkittler~~)
Target version changed from Ready to future

We have discussed this topic today in the SUSE QE Tools weekly 2022-05-06 and have found no good path to proceed. We do not know how to easily reproduce. It is a real bug though. So I am keeping the ticket open but move it out of our backlog as we are not currently doing anything on it anyway.

Actions

Copy link

#70

Updated by MDoucha over 2 years ago

One of my verification runs seems to have skipped the upload phase entirely:
https://openqa.suse.de/tests/9582368

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #94531

OpenQA worker randomly skips uploading artefacts for whole test modules

Observation¶

Steps to reproduce¶

Expected result¶

Acceptance criteria¶

Suggestions¶

Out of scope¶

Updated by okurz almost 4 years ago

Updated by tinita almost 4 years ago

Updated by MDoucha almost 4 years ago

Updated by livdywan almost 4 years ago

Updated by okurz almost 4 years ago

Updated by MDoucha almost 4 years ago

Updated by livdywan almost 4 years ago

Updated by MDoucha almost 4 years ago

Updated by livdywan almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by okurz almost 4 years ago

Updated by MDoucha almost 4 years ago

Updated by MDoucha almost 4 years ago

Updated by MDoucha almost 4 years ago

Updated by MDoucha almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by MDoucha almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by MDoucha almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by MDoucha almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by openqa_review almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by MDoucha almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by livdywan almost 4 years ago

Updated by okurz over 3 years ago

Updated by MDoucha over 3 years ago

Updated by livdywan over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by livdywan over 3 years ago

Updated by okurz over 3 years ago

Updated by MDoucha over 3 years ago

Updated by mkittler over 3 years ago

Updated by okurz over 3 years ago

Updated by MDoucha over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by MDoucha over 3 years ago

Updated by mkittler over 3 years ago

Updated by livdywan over 3 years ago

Updated by livdywan over 3 years ago

Updated by mkittler over 3 years ago

Updated by MDoucha over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by livdywan over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by MDoucha over 3 years ago

Updated by okurz about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by MDoucha about 3 years ago