action #163784: [openqa-in-openqa] test fails in test_running on "retry -s 10 -r 6 -e -- sh -c 'test -f /var/lib/openqa/share/tests/*/.git/config", probably due to transient network errors size:S - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #163784

closed

[openqa-in-openqa] test fails in test_running on "retry -s 10 -r 6 -e -- sh -c 'test -f /var/lib/openqa/share/tests/*/.git/config", probably due to transient network errors size:S

Added by okurz 11 months ago. Updated 10 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Feature requests

Target version:

Ready

Start date:

2024-07-11

Due date:

% Done:

Estimated time:

Tags:

openqa-in-openqa, reactive work

Description

Observation¶

openQA test in scenario openqa-Tumbleweed-dev-x86_64-openqa_install+publish@64bit-2G fails in
test_running
on command

retry -s 10 -r 6 -e -- sh -c 'test -f /var/lib/openqa/share/tests/*/.git/config

probably due to transient network errors. Same in https://openqa.opensuse.org/tests/4333739#step/test_running/8 but investigation jobs are all green so after a short time the error could not be reproduced anymore. The actual error is visible in https://openqa.opensuse.org/tests/4333739#step/test_running/6 in the middle of the JSON output: "Error detecting remote default branch name […]" which comes from ~~https://github.com/os-autoinst/os-autoinst/blob/master/OpenQA/Isotovideo/Utils.pm#L84~~ https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Task/Git/Clone.pm#L60.

Acceptance criteria¶

AC1: tests consistently pass even during intermittent outages during remote git command calls

Further details¶

Always latest result in this scenario: latest

Suggestions¶

* The $handle_output does already die if the git command fails, so apparently the git command was succesful but the output didn't match the expected one. Add the actual output in that error message. The handle_output error message should output the actual returned output from the git command
* In my test I changed the git url to github.comm, so the git clone should already fail, but still I got the same error message Error detecting remote default branch in the test. Something to investigate

Investigate why we didn't see this in unit tests and fix or add an according test case
#163787 might help with investigation

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by okurz 11 months ago

Subtask #163787 added

Actions

Copy link

Updated by tinita 11 months ago

We saw this again twice today:

Actions

Copy link

Updated by okurz 11 months ago

Subtask deleted (~~#163787~~)

Actions

Copy link

Updated by okurz 11 months ago

Related to action #163787: https://progress.opensuse.org/issues/163787 [openqa-in-openqa] autoinst-log.txt uploaded from within the test should always return non-empty content if jobs are actually executed size:S added

Actions

Copy link

Updated by okurz 11 months ago

Tags set to reactive work, openqa-in-openqa
Tracker changed from coordination to action
Subject changed from [epic][openqa-in-openqa] test fails in test_running on "retry -s 10 -r 6 -e -- sh -c 'test -f /var/lib/openqa/share/tests/*/.git/config", probably due to transient network error to [openqa-in-openqa] test fails in test_running on "retry -s 10 -r 6 -e -- sh -c 'test -f /var/lib/openqa/share/tests/*/.git/config", probably due to transient network error
Priority changed from Normal to High
Target version changed from future to Ready

Actions

Copy link

Updated by tinita 11 months ago · Edited

AC3 is handled in #163787, should it be removed here?

Actions

Copy link

Updated by okurz 11 months ago

Description updated (diff)

tinita wrote in #note-6:

AC3 is handled in #163787, should it be removed here?

yes, done

Actions

Copy link

Updated by robert.richardson 10 months ago

Subject changed from [openqa-in-openqa] test fails in test_running on "retry -s 10 -r 6 -e -- sh -c 'test -f /var/lib/openqa/share/tests/*/.git/config", probably due to transient network error to [openqa-in-openqa] test fails in test_running on "retry -s 10 -r 6 -e -- sh -c 'test -f /var/lib/openqa/share/tests/*/.git/config", probably due to transient network errors size:S
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by livdywan 10 months ago

https://github.com/os-autoinst/os-autoinst/pull/2515

Actions

Copy link

#10

Updated by tinita 10 months ago

Description updated (diff)

Updated the description with my findings from #163787

Actions

Copy link

#11

Updated by mkittler 10 months ago

Status changed from Workable to In Progress
Assignee set to mkittler

Actions

Copy link

#12

Updated by okurz 10 months ago

mkittler is working on it. I am merely helping to upload the gru log as well: https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/190

Actions

Copy link

#13

Updated by mkittler 10 months ago

PRs:

Actions

Copy link

#14

Updated by openqa_review 10 months ago

Due date set to 2024-08-14

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#15

Updated by mkittler 10 months ago

Status changed from In Progress to Feedback
Priority changed from High to Normal

So far https://openqa.opensuse.org/group_overview/24?limit_builds=20&limit_builds=50 looks good.

Actions

Copy link

#16

Updated by mkittler 10 months ago

Looks like this isn't good enough:

In both cases the cloning is now retried as expected:

Jul 31 11:15:50 susetest openqa-gru[73699]: [info] Resetting all leftover Gru locks after restart
Jul 31 11:15:50 susetest openqa-gru[73699]: [info] Worker 73699 started
Jul 31 11:20:02 susetest openqa-gru[128327]: [info] Running cmd: env GIT_SSH_COMMAND="ssh -oBatchMode=yes" git ls-remote --symref https://github.com/os-autoinst/os-autoinst-distri-example.git HEAD
Jul 31 11:20:03 susetest openqa-gru[128327]: [55B blob data]
Jul 31 11:20:03 susetest openqa-gru[128327]: [error] cmd returned 32256
Jul 31 11:20:39 susetest openqa-gru[149022]: [info] Running cmd: env GIT_SSH_COMMAND="ssh -oBatchMode=yes" git ls-remote --symref https://github.com/os-autoinst/os-autoinst-distri-example.git HEAD
Jul 31 11:20:40 susetest openqa-gru[149022]: [55B blob data]
Jul 31 11:20:40 susetest openqa-gru[149022]: [error] cmd returned 32256
Jul 31 11:21:20 susetest openqa-gru[156570]: [info] Running cmd: env GIT_SSH_COMMAND="ssh -oBatchMode=yes" git ls-remote --symref https://github.com/os-autoinst/os-autoinst-distri-example.git HEAD
Jul 31 11:21:21 susetest openqa-gru[156570]: [55B blob data]
Jul 31 11:21:21 susetest openqa-gru[156570]: [error] cmd returned 32256
Jul 31 11:22:01 susetest openqa-gru[156587]: [info] Running cmd: env GIT_SSH_COMMAND="ssh -oBatchMode=yes" git ls-remote --symref https://github.com/os-autoinst/os-autoinst-distri-example.git HEAD
Jul 31 11:22:01 susetest openqa-gru[156587]: [55B blob data]
Jul 31 11:22:01 susetest openqa-gru[156587]: [error] cmd returned 32256
Jul 31 11:22:41 susetest openqa-gru[156597]: [info] Running cmd: env GIT_SSH_COMMAND="ssh -oBatchMode=yes" git ls-remote --symref https://github.com/os-autoinst/os-autoinst-distri-example.git HEAD
Jul 31 11:22:42 susetest openqa-gru[156597]: [55B blob data]
Jul 31 11:22:42 susetest openqa-gru[156597]: [error] cmd returned 32256
Jul 31 11:23:17 susetest openqa-gru[156607]: [info] Running cmd: env GIT_SSH_COMMAND="ssh -oBatchMode=yes" git ls-remote --symref https://github.com/os-autoinst/os-autoinst-distri-example.git HEAD
Jul 31 11:23:18 susetest openqa-gru[156607]: [55B blob data]
Jul 31 11:23:18 susetest openqa-gru[156607]: [error] cmd returned 32256
Jul 31 11:23:58 susetest openqa-gru[156617]: [info] Running cmd: env GIT_SSH_COMMAND="ssh -oBatchMode=yes" git ls-remote --symref https://github.com/os-autoinst/os-autoinst-distri-example.git HEAD
Jul 31 11:23:59 susetest openqa-gru[156617]: [55B blob data]
Jul 31 11:23:59 susetest openqa-gru[156617]: [error] cmd returned 32256
Jul 31 11:24:34 susetest openqa-gru[156632]: [info] Running cmd: env GIT_SSH_COMMAND="ssh -oBatchMode=yes" git ls-remote --symref https://github.com/os-autoinst/os-autoinst-distri-example.git HEAD
Jul 31 11:24:34 susetest openqa-gru[156632]: [55B blob data]
Jul 31 11:24:34 susetest openqa-gru[156632]: [error] cmd returned 32256

This blocks the job execution so the job isn't showing up at all before the timeout is reached (as the test only checks for jobs in the states running and done).

The retries were already spread over 4 minutes. Not sure whether it would have helped to retry longer. I'll also have to look into why "55B blob data" is logged (in both cases).

Actions

Copy link

#17

Updated by mkittler 10 months ago · Edited

Status changed from Feedback to In Progress

The line [55B blob data]¹ is probably coming from log_warning($stdout . $stderr); within sub run_cmd_with_log_return_error. So it is not related to the error handling I changed. Not sure whether we should improve this. After all, Git is not supposed to return any binary data here. It would be interesting to know what it actually returned, though. With my recent changes for the error handling this information (or at least stderr) would be shown as reason but unfortunately the test stopped before the nested test could incomplete.

¹ Note that our code actually logs the binary data as-is and the square-brackets are actually a substitution by the systemd journal.

Actions

Copy link

#18

Updated by mkittler 10 months ago

Status changed from In Progress to Feedback

PR to increase the number of retries and help debugging: https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/192

Actions

Copy link

#19

Updated by livdywan 10 months ago

https://openqa.opensuse.org/tests/4382572/logfile?filename=worker-openqa_minion_results.txt

This came back just containing [] 🤔

Actions

Copy link

#20

Updated by okurz 10 months ago

Priority changed from Normal to High

Actions

Copy link

#21

Updated by tinita 10 months ago

Priority changed from High to Normal

https://openqa.opensuse.org/tests/4382572/logfile?filename=worker-openqa_minion_results.txt is not related.

Actions

Copy link

#22

Updated by mkittler 10 months ago

Yes, so far there are no related failures.

Actions

Copy link

#23

Updated by okurz 10 months ago

Due date deleted (~~2024-08-14~~)
Status changed from Feedback to Resolved

Ok, good enough. If there will be more problems we will be notified about it.

Actions

Copy link

#24

Updated by tinita 10 months ago

Related to action #165408: Unreviewed issue (Group 24 openQA) test-running in openQA failing with error detecting default remote branch size:S added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #163784

[openqa-in-openqa] test fails in test_running on "retry -s 10 -r 6 -e -- sh -c 'test -f /var/lib/openqa/share/tests/*/.git/config", probably due to transient network errors size:S

Observation¶

Acceptance criteria¶

Further details¶

Suggestions¶

Updated by okurz 11 months ago

Updated by tinita 11 months ago

Updated by okurz 11 months ago

Updated by okurz 11 months ago

Updated by okurz 11 months ago

Updated by tinita 11 months ago · Edited

Updated by okurz 11 months ago

Updated by robert.richardson 10 months ago

Updated by livdywan 10 months ago

Updated by tinita 10 months ago

Updated by mkittler 10 months ago

Updated by okurz 10 months ago

Updated by mkittler 10 months ago

Updated by openqa_review 10 months ago

Updated by mkittler 10 months ago

Updated by mkittler 10 months ago

Updated by mkittler 10 months ago · Edited

Updated by mkittler 10 months ago

Updated by livdywan 10 months ago

Updated by okurz 10 months ago

Updated by tinita 10 months ago

Updated by mkittler 10 months ago

Updated by okurz 10 months ago

Updated by tinita 10 months ago