action #163784
closed[openqa-in-openqa] test fails in test_running on "retry -s 10 -r 6 -e -- sh -c 'test -f /var/lib/openqa/share/tests/*/.git/config", probably due to transient network errors size:S
0%
Description
Observation¶
openQA test in scenario openqa-Tumbleweed-dev-x86_64-openqa_install+publish@64bit-2G fails in
test_running
on command
retry -s 10 -r 6 -e -- sh -c 'test -f /var/lib/openqa/share/tests/*/.git/config
probably due to transient network errors. Same in https://openqa.opensuse.org/tests/4333739#step/test_running/8 but investigation jobs are all green so after a short time the error could not be reproduced anymore. The actual error is visible in https://openqa.opensuse.org/tests/4333739#step/test_running/6 in the middle of the JSON output: "Error detecting remote default branch name […]" which comes from https://github.com/os-autoinst/os-autoinst/blob/master/OpenQA/Isotovideo/Utils.pm#L84 https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Task/Git/Clone.pm#L60.
Acceptance criteria¶
- AC1: tests consistently pass even during intermittent outages during remote git command calls
Further details¶
Always latest result in this scenario: latest
Suggestions¶
* The $handle_output
does already die
if the git
command fails, so apparently the git command was succesful but the output didn't match the expected one. Add the actual output in that error message. The handle_output error message should output the actual returned output from the git command
* In my test I changed the git url to github.comm
, so the git clone should already fail, but still I got the same error message Error detecting remote default branch
in the test. Something to investigate
- Investigate why we didn't see this in unit tests and fix or add an according test case
- #163787 might help with investigation
Updated by okurz 3 months ago
- Related to action #163787: https://progress.opensuse.org/issues/163787 [openqa-in-openqa] autoinst-log.txt uploaded from within the test should always return non-empty content if jobs are actually executed size:S added
Updated by okurz 3 months ago
- Tags set to reactive work, openqa-in-openqa
- Tracker changed from coordination to action
- Subject changed from [epic][openqa-in-openqa] test fails in test_running on "retry -s 10 -r 6 -e -- sh -c 'test -f /var/lib/openqa/share/tests/*/.git/config", probably due to transient network error to [openqa-in-openqa] test fails in test_running on "retry -s 10 -r 6 -e -- sh -c 'test -f /var/lib/openqa/share/tests/*/.git/config", probably due to transient network error
- Priority changed from Normal to High
- Target version changed from future to Ready
Updated by robert.richardson 3 months ago
- Subject changed from [openqa-in-openqa] test fails in test_running on "retry -s 10 -r 6 -e -- sh -c 'test -f /var/lib/openqa/share/tests/*/.git/config", probably due to transient network error to [openqa-in-openqa] test fails in test_running on "retry -s 10 -r 6 -e -- sh -c 'test -f /var/lib/openqa/share/tests/*/.git/config", probably due to transient network errors size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz 2 months ago
mkittler is working on it. I am merely helping to upload the gru log as well: https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/190
Updated by openqa_review 2 months ago
- Due date set to 2024-08-14
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler 2 months ago
Looks like this isn't good enough:
- https://openqa.opensuse.org/tests/4371399#step/test_running/5
- https://openqa.opensuse.org/tests/4371397#step/test_running/6
In both cases the cloning is now retried as expected:
Jul 31 11:15:50 susetest openqa-gru[73699]: [info] Resetting all leftover Gru locks after restart
Jul 31 11:15:50 susetest openqa-gru[73699]: [info] Worker 73699 started
Jul 31 11:20:02 susetest openqa-gru[128327]: [info] Running cmd: env GIT_SSH_COMMAND="ssh -oBatchMode=yes" git ls-remote --symref https://github.com/os-autoinst/os-autoinst-distri-example.git HEAD
Jul 31 11:20:03 susetest openqa-gru[128327]: [55B blob data]
Jul 31 11:20:03 susetest openqa-gru[128327]: [error] cmd returned 32256
Jul 31 11:20:39 susetest openqa-gru[149022]: [info] Running cmd: env GIT_SSH_COMMAND="ssh -oBatchMode=yes" git ls-remote --symref https://github.com/os-autoinst/os-autoinst-distri-example.git HEAD
Jul 31 11:20:40 susetest openqa-gru[149022]: [55B blob data]
Jul 31 11:20:40 susetest openqa-gru[149022]: [error] cmd returned 32256
Jul 31 11:21:20 susetest openqa-gru[156570]: [info] Running cmd: env GIT_SSH_COMMAND="ssh -oBatchMode=yes" git ls-remote --symref https://github.com/os-autoinst/os-autoinst-distri-example.git HEAD
Jul 31 11:21:21 susetest openqa-gru[156570]: [55B blob data]
Jul 31 11:21:21 susetest openqa-gru[156570]: [error] cmd returned 32256
Jul 31 11:22:01 susetest openqa-gru[156587]: [info] Running cmd: env GIT_SSH_COMMAND="ssh -oBatchMode=yes" git ls-remote --symref https://github.com/os-autoinst/os-autoinst-distri-example.git HEAD
Jul 31 11:22:01 susetest openqa-gru[156587]: [55B blob data]
Jul 31 11:22:01 susetest openqa-gru[156587]: [error] cmd returned 32256
Jul 31 11:22:41 susetest openqa-gru[156597]: [info] Running cmd: env GIT_SSH_COMMAND="ssh -oBatchMode=yes" git ls-remote --symref https://github.com/os-autoinst/os-autoinst-distri-example.git HEAD
Jul 31 11:22:42 susetest openqa-gru[156597]: [55B blob data]
Jul 31 11:22:42 susetest openqa-gru[156597]: [error] cmd returned 32256
Jul 31 11:23:17 susetest openqa-gru[156607]: [info] Running cmd: env GIT_SSH_COMMAND="ssh -oBatchMode=yes" git ls-remote --symref https://github.com/os-autoinst/os-autoinst-distri-example.git HEAD
Jul 31 11:23:18 susetest openqa-gru[156607]: [55B blob data]
Jul 31 11:23:18 susetest openqa-gru[156607]: [error] cmd returned 32256
Jul 31 11:23:58 susetest openqa-gru[156617]: [info] Running cmd: env GIT_SSH_COMMAND="ssh -oBatchMode=yes" git ls-remote --symref https://github.com/os-autoinst/os-autoinst-distri-example.git HEAD
Jul 31 11:23:59 susetest openqa-gru[156617]: [55B blob data]
Jul 31 11:23:59 susetest openqa-gru[156617]: [error] cmd returned 32256
Jul 31 11:24:34 susetest openqa-gru[156632]: [info] Running cmd: env GIT_SSH_COMMAND="ssh -oBatchMode=yes" git ls-remote --symref https://github.com/os-autoinst/os-autoinst-distri-example.git HEAD
Jul 31 11:24:34 susetest openqa-gru[156632]: [55B blob data]
Jul 31 11:24:34 susetest openqa-gru[156632]: [error] cmd returned 32256
This blocks the job execution so the job isn't showing up at all before the timeout is reached (as the test only checks for jobs in the states running and done).
The retries were already spread over 4 minutes. Not sure whether it would have helped to retry longer. I'll also have to look into why "55B blob data" is logged (in both cases).
Updated by mkittler 2 months ago · Edited
- Status changed from Feedback to In Progress
The line [55B blob data]
¹ is probably coming from log_warning($stdout . $stderr);
within sub run_cmd_with_log_return_error
. So it is not related to the error handling I changed. Not sure whether we should improve this. After all, Git is not supposed to return any binary data here. It would be interesting to know what it actually returned, though. With my recent changes for the error handling this information (or at least stderr
) would be shown as reason but unfortunately the test stopped before the nested test could incomplete.
¹ Note that our code actually logs the binary data as-is and the square-brackets are actually a substitution by the systemd journal.
Updated by mkittler 2 months ago
- Status changed from In Progress to Feedback
PR to increase the number of retries and help debugging: https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/192
Updated by livdywan 2 months ago
https://openqa.opensuse.org/tests/4382572/logfile?filename=worker-openqa_minion_results.txt
This came back just containing []
🤔
Updated by tinita about 2 months ago
- Related to action #165408: Unreviewed issue (Group 24 openQA) test-running in openQA failing with error detecting default remote branch size:S added