Project

General

Profile

action #94531

OpenQA worker randomly skips uploading artefacts for whole test modules size:M

Added by MDoucha 3 months ago. Updated 4 days ago.

Status:
Feedback
Priority:
Low
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2021-06-23
Due date:
2021-10-15
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

Once in a while, some jobs randomly end up with one or more modules missing all screenshots and console output. The boxes do appear in OpenQA web UI but only show Unable to read foo-123.txt. or blank box instead of a screenshot:
https://openqa.suse.de/tests/6308588#step/cn_pec_sh/1

Worker-log.txt shows that the missing files were not uploaded at all. The worker uploaded all artefacts for update_kernel and install_ltp but there's not a single line with Uploading artefact cn_pec_sh-*.txt or Uploading artefact shutdown_ltp-*.txt (except for log assets uploaded by calling upload_logs() in the test itself):
https://openqa.suse.de/tests/6308588/logfile?filename=worker-log.txt

I've seen this happen randomly multiple times, usually it's just one module in the middle of a test run. Everything before it and after it gets uploaded correctly.

Steps to reproduce

  • Look for jobs with test modules steps that include Unable to read

Expected result

another job in the same scenario as the original one shows correct expected data in steps, https://openqa.suse.de/tests/6312792#step/cn_pec_sh/1 shows

# wait_serial expected: "# "
# Result:

# 

Acceptance criteria

  • AC1: test module steps are uploaded again for openqa.suse.de LTP cases in general

Suggestions

  • Find fail ratio
  • Try to reproduce problem locally
  • Find out what files the openQA worker should try to upload

Out of scope

  • If we find a problem in the custom LTP runner don't try to fix that on openQA side but have it be fixed within os-autoinst-distri-opensuse
scan_details.py (1.73 KB) scan_details.py MDoucha, 2021-06-30 11:24
filter_details.py (776 Bytes) filter_details.py MDoucha, 2021-06-30 13:07
joblist.txt (8.4 KB) joblist.txt MDoucha, 2021-07-01 13:58
missing_results-2021-08.txt (20.2 KB) missing_results-2021-08.txt Affected jobs in the last month MDoucha, 2021-09-01 11:45

History

#1 Updated by okurz 3 months ago

  • Priority changed from Normal to High
  • Target version set to Ready

#2 Updated by tinita 3 months ago

It could be helpful to post more links to tests with the same behaviour (could be that there is a pattern)

#3 Updated by MDoucha 3 months ago

tinita wrote:

It could be helpful to post more links to tests with the same behaviour (could be that there is a pattern)

I would post more links if I had any but this is maybe the third time I'm seeing this bug and the last one I've seen before happened months ago. But you could probably find a few dozen cases by scanning existing test results with a script. The details-*.json file will have text keys but no text_data in the details array.

#4 Updated by cdywan 3 months ago

Maybe this should be self-evident, but can you clarify that https://openqa.suse.de/tests/6308588#step/cn_pec_sh/1 is expected to pass even without the missing file? What is the expected result?

From the logs it seems the upload is completely missing so my first thought would be that the file wasn't created. But from a single result it's hard to tell. I guess that's also what Tina was wondering.

#5 Updated by okurz 3 months ago

To put the information into structured form I suggest to follow the ticket template from https://progress.opensuse.org/projects/openqav3/wiki/#Defects

#6 Updated by MDoucha 3 months ago

cdywan wrote:

Maybe this should be self-evident, but can you clarify that https://openqa.suse.de/tests/6308588#step/cn_pec_sh/1 is expected to pass even without the missing file? What is the expected result?

Yes, this is expected to pass. The command result (pass/fail) is reported to OpenQA database first and the console output is uploaded much later. It is expected to see a green box with "Unable to read foo-123.txt" while the test is still running but not after it finishes.

From the logs it seems the upload is completely missing so my first thought would be that the file wasn't created. But from a single result it's hard to tell. I guess that's also what Tina was wondering.

Like I said, the easiest way to find more examples is to scan existing test results with a script. Valid details-*.json file looks like this:

{"dents":0,"details":[{"result":"ok","text":"boot_ltp-4.txt","text_data":"# wait_serial expected: qr\/login:\\s*$\/ui\n\n# Result:\n\r\nsusetest login: \n","title":"wait_serial"},...],...}

But when it's affected by this issue, the text key will be present but the text_data key will be completely missing:

{"dents":0,"details":[{"result":"ok","text":"boot_ltp-4.txt","title":"wait_serial"},...],...}

#7 Updated by cdywan 3 months ago

More thoughts:

lib/OpenQA/Task/Job/FinalizeResults.pm#L26 uploads results in a minion job.
Maybe errors are eaten here but should appear on https://openqa.suse.de/minion

#8 Updated by MDoucha 3 months ago

cdywan wrote:

More thoughts:

lib/OpenQA/Task/Job/FinalizeResults.pm#L26 uploads results in a minion job.
Maybe errors are eaten here but should appear on https://openqa.suse.de/minion

No, this is a minion task which collects the already uploaded text files and inlines them into details-*.json file for faster webpage rendering. It does not upload anything but it can detect this problem and send an alert somehow. The uploading is done by the OpenQA worker which actually runs the test job.

#9 Updated by cdywan 3 months ago

MDoucha wrote:

cdywan wrote:

More thoughts:

lib/OpenQA/Task/Job/FinalizeResults.pm#L26 uploads results in a minion job.
Maybe errors are eaten here but should appear on https://openqa.suse.de/minion

No, this is a minion task which collects the already uploaded text files and inlines them into details-*.json file for faster webpage rendering. It does not upload anything but it can detect this problem and send an alert somehow. The uploading is done by the OpenQA worker which actually runs the test job.

Thank you for the correction!

FYI kraih helping out by improving related code: https://github.com/os-autoinst/openQA/pull/3982 Not a fix but could contribute to errors not being propagated

#10 Updated by mkittler 3 months ago

This sounds like a bug in the uploading code of the openQA worker (code in lib/OpenQA/Worker/Job.pm). Hard to tell what's going on exactly (the VPN is down so I cannot even access the referenced jobs right now). (It could be that the results are already not generated correctly by os-autoinst, though. However, that's not that likely.)

It would of course be useful to see the pool directory of such a job. Is the test order of such jobs changed at some point during the test execution? Are the test details created in a special way (e.g. parsed from an external testsuite)?

#11 Updated by okurz 3 months ago

  • Subject changed from OpenQA worker randomly skips uploading artefacts for whole test modules to OpenQA worker randomly skips uploading artefacts for whole test modules size:M
  • Description updated (diff)
  • Status changed from New to Workable

#12 Updated by MDoucha 3 months ago

Here's a list of other jobs with missing console output. I've also attached two scripts that'll find more. scan_details.py finds all jobs with missing console output. You need to run it directly on OSD. It'll also print files which it can't parse properly and show progress through the results directory on standard error. filter_details.py will filter out unfinished and incomplete jobs which would most likely be false positives. Simply feed it the output of scan_details.py on standard input, you can run this script on any machine that has HTTP access to OSD. Or just run both on OSD: scan_details.py | filter_details.py >results.txt.

Edit: I've replaced filter_details.py with an improved version. See next comment.

Now the list of jobs (just a small fraction of all affected jobs). The string after the URL is a random test module that is missing at least one console output file. Some are caused by network errors which you'll find in worker-log.txt. Those are a different issue but I didn't filter them out.

https://openqa.suse.de/tests/6351096 wireshark
https://openqa.suse.de/tests/6351244 openvswitch
https://openqa.suse.de/tests/6351363 bootcode_options
https://openqa.suse.de/tests/6351379 nis_server
https://openqa.suse.de/tests/6351394 mem03
https://openqa.suse.de/tests/6351403 if6-mtu-change_ifconfig
https://openqa.suse.de/tests/6351405 shutdown_ltp
https://openqa.suse.de/tests/6351393 clock_nanosleep02
https://openqa.suse.de/tests/6351337 setup
https://openqa.suse.de/tests/6351474 before_test
https://openqa.suse.de/tests/6351381 bootloader_start
https://openqa.suse.de/tests/6351297 barrier_init
https://openqa.suse.de/tests/6351300 barrier_init
https://openqa.suse.de/tests/6351487 setup
https://openqa.suse.de/tests/6351489 consoletest_setup
https://openqa.suse.de/tests/6351517 docker
https://openqa.suse.de/tests/6351553 glxgears
https://openqa.suse.de/tests/6351883 repos
https://openqa.suse.de/tests/6350113 ssh_interactive_init
https://openqa.suse.de/tests/6350117 ssh_interactive_init
https://openqa.suse.de/tests/6345228 install_patterns
https://openqa.suse.de/tests/6345411 fcntl36
https://openqa.suse.de/tests/6345435 epoll01
https://openqa.suse.de/tests/6342668 hana_cluster
https://openqa.suse.de/tests/6333915 quota
https://openqa.suse.de/tests/6326404 alarm06
https://openqa.suse.de/tests/6325348 settimeofday02
https://openqa.suse.de/tests/6317236 filesystem_ro
https://openqa.suse.de/tests/6317274 logs_from_installation_system
https://openqa.suse.de/tests/6308588 shutdown_ltp
https://openqa.suse.de/tests/6308958 before_test
https://openqa.suse.de/tests/6302842 install_update
https://openqa.suse.de/tests/6294460 apache
https://openqa.suse.de/tests/6294750 http_srv
https://openqa.suse.de/tests/6294612 perf
https://openqa.suse.de/tests/6294467 btrfs_send_receive
https://openqa.suse.de/tests/6294753 snapper_cleanup
https://openqa.suse.de/tests/6294755 openssl_alpn
https://openqa.suse.de/tests/6294374 t10_vlan_ifup_all_ifdown_one_card
https://openqa.suse.de/tests/6288837 procps
https://openqa.suse.de/tests/6288840 io_getevents01
https://openqa.suse.de/tests/6288224 rpm
https://openqa.suse.de/tests/6288225 gpg

#13 Updated by MDoucha 3 months ago

  • File deleted (filter_details.py)

#14 Updated by MDoucha 3 months ago

Simplified and improved filter_details.py.

#15 Updated by MDoucha 3 months ago

I ran the scripts on OSD overnight and here's the complete list of affected jobs with incomplete/cancelled/restarted jobs filtered out.

#16 Updated by mkittler 2 months ago

  • Assignee set to mkittler

#17 Updated by mkittler 2 months ago

  • Assignee deleted (mkittler)

I still haven't found out much. If no one took the issue by then, I'm taking it again next Tuesday when I'm back from vacation.

#18 Updated by mkittler 2 months ago

  • Assignee set to mkittler

#19 Updated by mkittler 2 months ago

Additional observations:

  1. Some jobs (at least https://openqa.suse.de/tests/6288225) also miss some of the screenshots.
  2. Previous and next step results are often nevertheless present; not the whole test module is affected.
    1. This is not true for the job from the ticket description, though. Here the steps for the whole modules are gone.
  3. Worker logs show the upload problems at least in some jobs, e.g. [2021-06-18T15:20:50.0791 CEST] [debug] [pid:16668] Uploading artefact gpg-4.txt [2021-06-18T15:20:50.0797 CEST] [error] [pid:16668] Uploading artefact gpg-4.txt failed: Service Unavailable but apparently no further upload attempts have been tried (in this example these are the only log lines containing gpg-4.txt).
    1. The job from the ticket description does not show the upload problem within the worker log, though. Here it looks like the upload hasn't even been attempted once. Maybe we're seeing two different problems here.

#20 Updated by MDoucha 2 months ago

mkittler wrote:

  1. The job from the ticket description does not show the upload problem within the worker log, though. Here it looks like the upload hasn't even been attempted once. Maybe we're seeing two different problems here.

I also strongly suspect that these are two different bugs. But fixing the lack of retries first will still help us diagnose the other bug (if it exists).

#21 Updated by mkittler 2 months ago

Worker log for the job in the ticket description:

…
[2021-06-22T17:51:33.0494 CEST] [debug] [pid:5432] Uploading artefact install_ltp-205.txt
[2021-06-22T17:51:33.0719 CEST] [debug] [pid:5368] Uploading artefact install_ltp-215.txt
[2021-06-22T17:51:33.0899 CEST] [info] [pid:5436] Uploading connectors
[2021-06-22T17:51:33.0899 CEST] [info] [pid:5436] Uploading connectors using multiple chunks
[2021-06-22T17:51:33.0900 CEST] [info] [pid:5436] connectors: 1 chunks
[2021-06-22T17:51:33.0900 CEST] [info] [pid:5436] connectors: chunks of 1000000 bytes each
[2021-06-22T17:51:34.0091 CEST] [debug] [pid:5432] Uploading artefact install_ltp-144.txt
[2021-06-22T17:51:34.0307 CEST] [debug] [pid:5368] Uploading artefact install_ltp-145.txt
[2021-06-22T17:51:34.0513 CEST] [info] [pid:5436] connectors: Processing chunk 1/1, avg. speed ~0.056 KiB/s
[2021-06-22T17:51:34.0520 CEST] [info] [pid:5436] Uploading vars.json
[2021-06-22T17:51:34.0599 CEST] [debug] [pid:5432] Uploading artefact install_ltp-152.txt
[2021-06-22T17:51:34.0692 CEST] [debug] [pid:5368] Uploading artefact install_ltp-14.txt
[2021-06-22T17:51:34.0911 CEST] [info] [pid:5436] Uploading autoinst-log.txt
[2021-06-22T17:51:34.0961 CEST] [debug] [pid:5432] Uploading artefact install_ltp-215.txt
[2021-06-22T17:51:35.0075 CEST] [debug] [pid:5368] Uploading artefact install_ltp-100.txt
[2021-06-22T17:51:35.0305 CEST] [info] [pid:5436] Uploading worker-log.txt

This is how it is supposed to work (judging by the code and comparing to log files):

  1. Job starts
    1. Upload results (= step details) concurrently and continuously
  2. Job stops
    1. The upload from 1. might still be running; that's ok
    2. Set job status to uploading
    3. Upload logs/assets (including autoinst-log.txt and worker-log.txt)
  3. Logs/assets (2.3) have been uploaded
    1. Wait until possibly ongoing results upload (1.1) has concluded
    2. Start a "final" results upload
  4. Final result upload has been concluded
    1. Final job state/result is set

  • Point 2.1 explains the weird interleaving between Uploading [some logfile] lines and Uploading artefact [some step file] lines.
  • Point 2.3 explains why log lines for certain artefacts can be missing completely: worker-log.txt is simply uploaded before these lines would be logged in step 3 and 4. (This is plausible because the test modules with missing artefacts are indeed at the end.)

So I would conclude that both cases (log lines for artefacts completely missing within the log file and errors logged for artefacts) are the same after all. It is just confusing (and hence problematic) that log messages from steps 3 and 4 are not part of worker-log.txt.

As a first step I'd just try to improve the error handling. It would make sense to retry artefact uploads in the same way other API calls to the web UI are retired. If they fail nevertheless I assume the test be marked as incomplete (with the relevant error message as reason).

As a second step I can try to move uploading autoinst-log.txt after the 4.

#22 Updated by MDoucha 2 months ago

mkittler wrote:

As a first step I'd just try to improve the error handling. It would make sense to retry artefact uploads in the same way other API calls to the web UI are retired. If they fail nevertheless I assume the test be marked as incomplete (with the relevant error message as reason).

I'd recommend a different approach for artefact uploads: If it fails even after retry, put it back in upload queue and pause further uploads until the next trigger from the worker loop. If the network connection recovers, everything will be uploaded. If it doesn't recover, the main OpenQA server will terminate the job due to timeout either way. Only the final upload step would kill the job after several failed upload retries.

#23 Updated by mkittler 2 months ago

Makes sense; I'll make it only a fatal error on the final upload. Unfortunately the upload is not written using a queue (or something similar)¹. So I cannot simply put it back. I guess rewriting the code to use a queue is a bit out of scope here so I simply resort to non-fatal retries during the execution plus the final upload to catch everything which has possibly be missed.


¹ although it would make a lot of sense of course

#24 Updated by mkittler 2 months ago

  • Status changed from Workable to In Progress

#25 Updated by mkittler 2 months ago

By the way, I'm not even sure whether it'll be a big win to mark such tests as incomplete. Maybe it is sufficient to add the upload error (after all attempts have been depleted during the final upload) as reason (unless there is a more relevant reason of course)?

Moving the upload of the worker log to the end would also help with that (without considering the test as incomplete).

#26 Updated by MDoucha 2 months ago

mkittler wrote:

By the way, I'm not even sure whether it'll be a big win to mark such tests as incomplete. Maybe it is sufficient to add the upload error (after all attempts have been depleted during the final upload) as reason (unless there is a more relevant reason of course)?

You can certainly try to send more descriptive error info to the central OpenQA server. But since you'll be doing it while the network is so horribly broken that asset uploads have exceeded retry limit, I don't think you have a very high chance of success.

#27 Updated by mkittler 2 months ago

My current PR adds a retry mechanism for the artefact upload similar to what we already have for regular API calls. For regular API calls it already seems to improve the situation a lot, e.g. most of the jobs you've found have their final state/result set as expected (which means the API call to send this information eventually succeeded). Hence the PR should make it more robust. That means the situations when the upload ultimately fails but we are still able to set the final state/result (where also the reason is passed) are indeed very unlikely. So likely it doesn't make much sense putting further effort (and complicated code) into this (like changing the job result to incomplete or trying to upload the worker log later).

By the way, to avoid interference with https://github.com/os-autoinst/openQA/pull/4062 I haven't tried to streamline the retry-behavior with regular API calls yet. I will do that after it has been merged if it make sense.

#28 Updated by openqa_review 2 months ago

  • Due date set to 2021-08-05

Setting due date based on mean cycle time of SUSE QE Tools

#29 Updated by mkittler 2 months ago

PR for streamlining the retry-behavior with regular API calls (also for having less redundant/inconsistent code): https://github.com/os-autoinst/openQA/pull/4067

#30 Updated by mkittler about 2 months ago

  • Status changed from In Progress to Feedback

All PRs have been merged. So now uploading those artefacts have the same re-try behavior as regular API calls and therefore should be equally reliable.

#31 Updated by MDoucha about 2 months ago

mkittler wrote:

All PRs have been merged. So now uploading those artefacts have the same re-try behavior as regular API calls and therefore should be equally reliable.

Great. Setting my alarm clock to August 31st. When it goes off, I'll run another scan for jobs with missing console output.
Note to self: OSD is currently at job ID 6596263.

#32 Updated by mkittler about 2 months ago

Sounds like a plan. (Especially because adding an automatic alert within our monitoring for this problem would be quite some work.)

#33 Updated by cdywan about 2 months ago

  • Due date changed from 2021-08-05 to 2021-08-31

I'm updating the Due date accordingly after having checked with Marius

#34 Updated by okurz 20 days ago

  • Due date changed from 2021-08-31 to 2021-09-02

The analysis takes some time. Also, we should not be warned on the last day but only if we exceed it so increasing the due date by two days

#35 Updated by MDoucha 19 days ago

The retry code still needs some work. My scanner script found 403 jobs with missing result data (some of which may be false positives, see attachment for full list) within past month. Here's an example where the worker printed a retry message to log but failed to upload the text file anyway:

https://openqa.suse.de/tests/6892431#step/consoletest_setup/21
https://openqa.suse.de/tests/6892431/logfile?filename=worker-log.txt

[2021-08-19T07:36:19.0905 CEST] [debug] [pid:7101] Uploading artefact consoletest_setup-21.txt
[2021-08-19T07:36:20.0566 CEST] [warn] [pid:7101] Upload attempts remaining: 59/60 for consoletest_setup-21.txt
[2021-08-19T07:36:20.0743 CEST] [warn] [pid:4110] Websocket connection to http://openqa.suse.de/api/v1/ws/1258 finished by remote side with code 1006, no reason - trying again in 10 seconds
[2021-08-19T07:36:30.0752 CEST] [info] [pid:4110] Registering with openQA openqa.suse.de
[2021-08-19T07:36:32.0334 CEST] [info] [pid:4110] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1258
[2021-08-19T07:36:33.0250 CEST] [info] [pid:4110] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1258

#36 Updated by cdywan 17 days ago

  • Due date changed from 2021-09-02 to 2021-09-10

Bumping the due date since there's new information, assuming mkittler is planning to look into it next week.

#37 Updated by mkittler 14 days ago

You're right, it definitely looks still broken - as if it doesn't actually give it another try after logging Upload attempts remaining: 59/60 for….

#38 Updated by mkittler 14 days ago

PR for improving the related logging: https://github.com/os-autoinst/openQA/pull/4174

However, I still don't know what's the problem here. Judging by our code and unit tests it actually re-attempts the upload. Maybe the error isn't visible on the worker-side? The error handling correctly checks for both, connection and HTTP errors, though.

#39 Updated by mkittler 13 days ago

  • Status changed from Workable to Feedback

Looks like the web UI actually returns a 200 response in certain error cases. Maybe these are the remaining cases where the worker doesn't re-attempt the upload. PR: https://github.com/os-autoinst/openQA/pull/4179

#40 Updated by mkittler 10 days ago

Since not all changes of https://github.com/os-autoinst/openQA/pull/4179 were covered I've been creating https://github.com/os-autoinst/openQA/pull/4184. However, that means the code path which would return a 200 response despite an error was most likely dead code and didn't cause the problem.

I also ran Martin's script again on OSD but couldn't find any further occurrences since the logging changes have been deployed.

#41 Updated by cdywan 5 days ago

  • Due date changed from 2021-09-10 to 2021-10-15

Discussed in chat:

  • The logs of these tests showed that the retry doesn't work - either because of a bug in the retry code itself or because the web UI doesn't report an error although the request didn't work. Judging by the code and unit tests neither should be the case. So the uploading doesn't work in rare cases.
  • mkittler couldn't find a further occurrence of the problem on OSD even though the changes for improved logging and error handling shouldn't have fixed it.
  • Since we can't reproduce it but also don't know if it's fixed I suggested to monitor this a bit longer, with one of 3 possible outcomes:
    • If it happens again we can investigate
    • If someone finds evidence that it's already fixed we cam document it retroactively and learn from that
    • If it simply no longer reproduces within a month for better or worse we'll assume it's resolved

#42 Updated by okurz 4 days ago

  • Priority changed from High to Low

I agree with that. But also for long-running tickets with a due-date in further future we should treat it not as "High" but instead "Low"

Also available in: Atom PDF