Project

General

Profile

Actions

action #163784

closed

[openqa-in-openqa] test fails in test_running on "retry -s 10 -r 6 -e -- sh -c 'test -f /var/lib/openqa/share/tests/*/.git/config", probably due to transient network errors size:S

Added by okurz 3 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2024-07-11
Due date:
% Done:

0%

Estimated time:

Description

Observation

openQA test in scenario openqa-Tumbleweed-dev-x86_64-openqa_install+publish@64bit-2G fails in
test_running
on command

retry -s 10 -r 6 -e -- sh -c 'test -f /var/lib/openqa/share/tests/*/.git/config

probably due to transient network errors. Same in https://openqa.opensuse.org/tests/4333739#step/test_running/8 but investigation jobs are all green so after a short time the error could not be reproduced anymore. The actual error is visible in https://openqa.opensuse.org/tests/4333739#step/test_running/6 in the middle of the JSON output: "Error detecting remote default branch name […]" which comes from https://github.com/os-autoinst/os-autoinst/blob/master/OpenQA/Isotovideo/Utils.pm#L84 https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Task/Git/Clone.pm#L60.

Acceptance criteria

  • AC1: tests consistently pass even during intermittent outages during remote git command calls

Further details

Always latest result in this scenario: latest

Suggestions

* The $handle_output does already die if the git command fails, so apparently the git command was succesful but the output didn't match the expected one. Add the actual output in that error message. The handle_output error message should output the actual returned output from the git command
* In my test I changed the git url to github.comm, so the git clone should already fail, but still I got the same error message Error detecting remote default branch in the test. Something to investigate

  • Investigate why we didn't see this in unit tests and fix or add an according test case
  • #163787 might help with investigation

Related issues 2 (0 open2 closed)

Related to openQA Project - action #163787: https://progress.opensuse.org/issues/163787 [openqa-in-openqa] autoinst-log.txt uploaded from within the test should always return non-empty content if jobs are actually executed size:SResolvedtinita2024-07-112024-08-09

Actions
Related to openQA Project - action #165408: Unreviewed issue (Group 24 openQA) test-running in openQA failing with error detecting default remote branch size:SResolvedybonatakis2024-09-06

Actions
Actions #1

Updated by okurz 3 months ago

  • Subtask #163787 added
Actions #3

Updated by okurz 3 months ago

  • Subtask deleted (#163787)
Actions #4

Updated by okurz 3 months ago

  • Related to action #163787: https://progress.opensuse.org/issues/163787 [openqa-in-openqa] autoinst-log.txt uploaded from within the test should always return non-empty content if jobs are actually executed size:S added
Actions #5

Updated by okurz 3 months ago

  • Tags set to reactive work, openqa-in-openqa
  • Tracker changed from coordination to action
  • Subject changed from [epic][openqa-in-openqa] test fails in test_running on "retry -s 10 -r 6 -e -- sh -c 'test -f /var/lib/openqa/share/tests/*/.git/config", probably due to transient network error to [openqa-in-openqa] test fails in test_running on "retry -s 10 -r 6 -e -- sh -c 'test -f /var/lib/openqa/share/tests/*/.git/config", probably due to transient network error
  • Priority changed from Normal to High
  • Target version changed from future to Ready
Actions #6

Updated by tinita 3 months ago · Edited

AC3 is handled in #163787, should it be removed here?

Actions #7

Updated by okurz 3 months ago

  • Description updated (diff)

tinita wrote in #note-6:

AC3 is handled in #163787, should it be removed here?

yes, done

Actions #8

Updated by robert.richardson 3 months ago

  • Subject changed from [openqa-in-openqa] test fails in test_running on "retry -s 10 -r 6 -e -- sh -c 'test -f /var/lib/openqa/share/tests/*/.git/config", probably due to transient network error to [openqa-in-openqa] test fails in test_running on "retry -s 10 -r 6 -e -- sh -c 'test -f /var/lib/openqa/share/tests/*/.git/config", probably due to transient network errors size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #10

Updated by tinita 3 months ago

  • Description updated (diff)

Updated the description with my findings from #163787

Actions #11

Updated by mkittler 2 months ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #12

Updated by okurz 2 months ago

mkittler is working on it. I am merely helping to upload the gru log as well: https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/190

Actions #14

Updated by openqa_review 2 months ago

  • Due date set to 2024-08-14

Setting due date based on mean cycle time of SUSE QE Tools

Actions #15

Updated by mkittler 2 months ago

  • Status changed from In Progress to Feedback
  • Priority changed from High to Normal
Actions #16

Updated by mkittler 2 months ago

Looks like this isn't good enough:

In both cases the cloning is now retried as expected:

Jul 31 11:15:50 susetest openqa-gru[73699]: [info] Resetting all leftover Gru locks after restart
Jul 31 11:15:50 susetest openqa-gru[73699]: [info] Worker 73699 started
Jul 31 11:20:02 susetest openqa-gru[128327]: [info] Running cmd: env GIT_SSH_COMMAND="ssh -oBatchMode=yes" git ls-remote --symref https://github.com/os-autoinst/os-autoinst-distri-example.git HEAD
Jul 31 11:20:03 susetest openqa-gru[128327]: [55B blob data]
Jul 31 11:20:03 susetest openqa-gru[128327]: [error] cmd returned 32256
Jul 31 11:20:39 susetest openqa-gru[149022]: [info] Running cmd: env GIT_SSH_COMMAND="ssh -oBatchMode=yes" git ls-remote --symref https://github.com/os-autoinst/os-autoinst-distri-example.git HEAD
Jul 31 11:20:40 susetest openqa-gru[149022]: [55B blob data]
Jul 31 11:20:40 susetest openqa-gru[149022]: [error] cmd returned 32256
Jul 31 11:21:20 susetest openqa-gru[156570]: [info] Running cmd: env GIT_SSH_COMMAND="ssh -oBatchMode=yes" git ls-remote --symref https://github.com/os-autoinst/os-autoinst-distri-example.git HEAD
Jul 31 11:21:21 susetest openqa-gru[156570]: [55B blob data]
Jul 31 11:21:21 susetest openqa-gru[156570]: [error] cmd returned 32256
Jul 31 11:22:01 susetest openqa-gru[156587]: [info] Running cmd: env GIT_SSH_COMMAND="ssh -oBatchMode=yes" git ls-remote --symref https://github.com/os-autoinst/os-autoinst-distri-example.git HEAD
Jul 31 11:22:01 susetest openqa-gru[156587]: [55B blob data]
Jul 31 11:22:01 susetest openqa-gru[156587]: [error] cmd returned 32256
Jul 31 11:22:41 susetest openqa-gru[156597]: [info] Running cmd: env GIT_SSH_COMMAND="ssh -oBatchMode=yes" git ls-remote --symref https://github.com/os-autoinst/os-autoinst-distri-example.git HEAD
Jul 31 11:22:42 susetest openqa-gru[156597]: [55B blob data]
Jul 31 11:22:42 susetest openqa-gru[156597]: [error] cmd returned 32256
Jul 31 11:23:17 susetest openqa-gru[156607]: [info] Running cmd: env GIT_SSH_COMMAND="ssh -oBatchMode=yes" git ls-remote --symref https://github.com/os-autoinst/os-autoinst-distri-example.git HEAD
Jul 31 11:23:18 susetest openqa-gru[156607]: [55B blob data]
Jul 31 11:23:18 susetest openqa-gru[156607]: [error] cmd returned 32256
Jul 31 11:23:58 susetest openqa-gru[156617]: [info] Running cmd: env GIT_SSH_COMMAND="ssh -oBatchMode=yes" git ls-remote --symref https://github.com/os-autoinst/os-autoinst-distri-example.git HEAD
Jul 31 11:23:59 susetest openqa-gru[156617]: [55B blob data]
Jul 31 11:23:59 susetest openqa-gru[156617]: [error] cmd returned 32256
Jul 31 11:24:34 susetest openqa-gru[156632]: [info] Running cmd: env GIT_SSH_COMMAND="ssh -oBatchMode=yes" git ls-remote --symref https://github.com/os-autoinst/os-autoinst-distri-example.git HEAD
Jul 31 11:24:34 susetest openqa-gru[156632]: [55B blob data]
Jul 31 11:24:34 susetest openqa-gru[156632]: [error] cmd returned 32256

This blocks the job execution so the job isn't showing up at all before the timeout is reached (as the test only checks for jobs in the states running and done).

The retries were already spread over 4 minutes. Not sure whether it would have helped to retry longer. I'll also have to look into why "55B blob data" is logged (in both cases).

Actions #17

Updated by mkittler 2 months ago · Edited

  • Status changed from Feedback to In Progress

The line [55B blob data]¹ is probably coming from log_warning($stdout . $stderr); within sub run_cmd_with_log_return_error. So it is not related to the error handling I changed. Not sure whether we should improve this. After all, Git is not supposed to return any binary data here. It would be interesting to know what it actually returned, though. With my recent changes for the error handling this information (or at least stderr) would be shown as reason but unfortunately the test stopped before the nested test could incomplete.


¹ Note that our code actually logs the binary data as-is and the square-brackets are actually a substitution by the systemd journal.

Actions #18

Updated by mkittler 2 months ago

  • Status changed from In Progress to Feedback

PR to increase the number of retries and help debugging: https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/192

Actions #20

Updated by okurz 2 months ago

  • Priority changed from Normal to High
Actions #21

Updated by tinita 2 months ago

  • Priority changed from High to Normal
Actions #22

Updated by mkittler 2 months ago

Yes, so far there are no related failures.

Actions #23

Updated by okurz 2 months ago

  • Due date deleted (2024-08-14)
  • Status changed from Feedback to Resolved

Ok, good enough. If there will be more problems we will be notified about it.

Actions #24

Updated by tinita about 2 months ago

  • Related to action #165408: Unreviewed issue (Group 24 openQA) test-running in openQA failing with error detecting default remote branch size:S added
Actions

Also available in: Atom PDF