Project

General

Profile

action #75370

unstable/flaky/sporadic t/full-stack.t failing on master (circleCI) "worker did not propagate URL for os-autoinst cmd srv within 1 minute"

Added by cdywan 9 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2020-10-27
Due date:
2020-11-20
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

https://app.circleci.com/pipelines/github/os-autoinst/openQA/4619/workflows/befb448a-59ed-46b7-b98d-dd4f3d2f035f/jobs/44126/steps

#   Failed test 'test 1 is running'
#   at t/full-stack.t line 128.

    #   Failed test 'worker did not propagate URL for os-autoinst cmd srv within 1 minute'
    #   at /home/squamata/project/t/lib/OpenQA/Test/FullstackUtils.pm line 195.

    #   Failed test 'developer console for test 1'
    #   at t/full-stack.t line 134.
    # Looks like you failed 2 tests of 3.
[02:46:51] t/full-stack.t .. 377/? 
#   Failed test 'wait until developer console becomes available'
#   at t/full-stack.t line 135.

Steps to reproduce

  • The failure is observed on CircleCI
  • To be confirmed if this can be reproduced locally with
make test STABILITY_TEST=1 RETRY=500 FULLSTACK=1 TESTS=t/full-stack.t

Suggestions

  • Add retries back
  • Reproduce locally or within circleCI
  • Fix tests or production code
  • Ensure stability with enough runs, e.g. 500
  • Investigate regressings in latest dependencies
    • aspell-0.60.6.1 -> aspell-0.60.8
    • aspell-spell-0.60.6.1 -> aspell-spell-0.60.8
    • libaspell15-0.60.6.1 -> libaspell15-0.60.8
    • perl-IO-Socket-SSL-2.052 -> perl-IO-Socket-SSL-2.066
    • perl-Net-SSLeay-1.81 -> perl-Net-SSLeay-1.88
    • perl-PPIx-Regexp-0.058 -> perl-PPIx-Regexp-0.071
    • perl-Selenium-Remote-Driver-1.37 -> perl-Selenium-Remote-Driver-1.38
    • python3-pathspec-0.5.9 -> python3-pathspec-0.7.0
    • python3-yamllint-1.15.0 -> python3-yamllint-1.22.1
    • ShellCheck-0.6.0 -> ShellCheck-0.7.1

See also #75346 for a new failure on master in OBS.

Workaround

Retrigger as this seems to be "sporadic".


Related issues

Related to openQA Project - action #75346: t/api/08-jobtemplates.t started failing in OBS checksResolved2020-10-26

Has duplicate openQA Project - action #76900: unstable/flaky/sporadic t/full-stack.t test failing in CircleCI "worker did not propagate URL for os-autoinst cmd srv within 1 minute"Resolved

History

#1 Updated by okurz 9 months ago

  • Related to action #75346: t/api/08-jobtemplates.t started failing in OBS checks added

#2 Updated by okurz 9 months ago

  • Status changed from New to Workable
  • Priority changed from Normal to High
  • Target version set to Ready

#3 Updated by okurz 9 months ago

  • Has duplicate action #76900: unstable/flaky/sporadic t/full-stack.t test failing in CircleCI "worker did not propagate URL for os-autoinst cmd srv within 1 minute" added

#4 Updated by okurz 9 months ago

  • Subject changed from t/full-stack.t failing on master to unstable/flaky/sporadic t/full-stack.t failing on master (circleCI) "worker did not propagate URL for os-autoinst cmd srv within 1 minute"
  • Description updated (diff)
  • Priority changed from High to Normal

prepared https://github.com/os-autoinst/openQA/pull/3503 (merged) which adds back retries on the Makefile level again for now. This reduces prio for us a bit.

Integrated duplicate report #76900

#5 Updated by cdywan 9 months ago

  • Status changed from Workable to Feedback

This PR aims to address the issue with

  • a more predictable timeout (less specialized code)
  • a longer timeout
  • ajax waits to avoid refreshing the page too fast

https://github.com/os-autoinst/openQA/pull/3504

Naturally this will need to be monitored in future builds.

#6 Updated by cdywan 9 months ago

  • Assignee set to cdywan

#7 Updated by cdywan 9 months ago

  • Status changed from Feedback to Resolved

#8 Updated by okurz 9 months ago

  • Due date set to 2020-11-11
  • Status changed from Resolved to Feedback

Hi cdywan, originally I assumed that the issue is linked to a change in dependencies or something that caused the test to fail much more often in the mentioned steps than in before. Both from your tickets and your PR I don't see something that would explain if you think this is just coincidence or if there was really something in dependencies that might have caused slightly different behaviour.

Also, we have added back RETRY=3 to the Makefile for t/full-stack.t which we should remove before calling this done. I suggest to follow the suggestions in https://progress.opensuse.org/issues/75370#Suggestions with e.g. 500 runs to verify stability.

#9 Updated by cdywan 9 months ago

  • Due date changed from 2020-11-11 to 2020-11-20

okurz wrote:

Hi cdywan, originally I assumed that the issue is linked to a change in dependencies or something that caused the test to fail much more often in the mentioned steps than in before. Both from your tickets and your PR I don't see something that would explain if you think this is just coincidence or if there was really something in dependencies that might have caused slightly different behaviour.

I'm not positive it's to do with a dependenc change either, my PR was about making the test more robust in general i.e. maybe it "should have" been unreliable even before.

Also, we have added back RETRY=3 to the Makefile for t/full-stack.t which we should remove before calling this done. I suggest to follow the suggestions in https://progress.opensuse.org/issues/75370#Suggestions with e.g. 500 runs to verify stability.

I think what actually we want is, n passes on CircleCI. But dropping the RETRY again, yes.

#10 Updated by cdywan 8 months ago

cdywan wrote:

okurz wrote:

Also, we have added back RETRY=3 to the Makefile for t/full-stack.t which we should remove before calling this done. I suggest to follow the suggestions in https://progress.opensuse.org/issues/75370#Suggestions with e.g. 500 runs to verify stability.

I think what actually we want is, n passes on CircleCI. But dropping the RETRY again, yes.

https://github.com/os-autoinst/openQA/pull/3562

#11 Updated by cdywan 8 months ago

  • Status changed from Feedback to Resolved

As also mentioned on the PR, I checked that previous runs of the fullstack test on CircleCI succeeded on the first try (not counting a case of a PR failing to pull the container image). PR is merged, so I think we can call this resolved since the actual fix was already "in feedback" before.

Also available in: Atom PDF