Project

General

Profile

Actions

action #95009

closed

unstable/flaky test t/ui/12-needle-edit.t size:M

Added by okurz over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-07-01
Due date:
% Done:

0%

Estimated time:

Description

Observation

At least in two places I have seen t/ui/12-needle-edit.t failing, e.g. https://app.circleci.com/pipelines/github/os-autoinst/openQA/6868/workflows/e88ba45b-b130-477f-8095-441095d446a0/jobs/64683/steps

[18:12:47] t/ui/12-needle-edit.t ...................... 4/? 
    #   Failed test 'sub process openqa-webapi terminated with exit code 10752'
    #   at /home/squamata/project/t/ui/../lib/OpenQA/Test/Utils.pm line 223.
A context appears to have been destroyed without first calling release().
Based on $@ it does not look like an exception was thrown (this is not always
a reliable test)

This is a problem because the global error variables ($!, $@, and $?) will
not be restored. In addition some release callbacks will not work properly from
inside a DESTROY method.

Here are the context creation details, just in case a tool forgot to call
release():
  File: t/ui/12-needle-edit.t
  Line: 415
  Tool: Test::More::subtest

Cleaning up the CONTEXT stack...

    #   Failed test 'no (unexpected) warnings (via END block)'
    #   at /usr/lib/perl5/5.26.1/Test/Builder.pm line 135.
    # Got the following unexpected warnings:
    #   1: A context appears to have been destroyed without first calling release().
    # Based on $@ it does not look like an exception was thrown (this is not always
    # a reliable test)
    # 
    # This is a problem because the global error variables ($!, $@, and $?) will
    # not be restored. In addition some release callbacks will not work properly from
    # inside a DESTROY method.
    # 
    # Here are the context creation details, just in case a tool forgot to call
    # release():
    #   File: t/ui/12-needle-edit.t
    #   Line: 415
    #   Tool: Test::More::subtest
    # 
    # Cleaning up the CONTEXT stack...
# Test ended with extra hubs on the stack!
    # Tests were run but no plan was declared and done_testing() was not seen.
[18:12:47] t/ui/12-needle-edit.t ......................      Dubious, test returned 255 (wstat 65280, 0xff00)
All 4 subtests passed

Acceptance criteria

  • AC1: t/ui/12-needle-edit.t stable over multiple runs in CI

Suggestions

  • Move test to unstable tests
  • Find out what is the failure ratio, e.g. take a look from past circleCI jobs
  • Reproduce instability in CI or locally or fix likely problem
  • Research in git logs if there are recent related changes
  • Ensure it's stable locally
  • Remove from unstable tests again

Related issues 1 (0 open1 closed)

Related to openQA Project - action #95024: openQA test t/ui/26-jobs_restart.t very unstable (already marked as unstable) size:MResolvedokurz2021-07-022021-07-22

Actions
Actions #1

Updated by okurz over 2 years ago

https://github.com/os-autoinst/openQA/pull/4001 for the first step to mark it unstable (not a fix per se)

Actions #2

Updated by okurz over 2 years ago

  • Related to action #95024: openQA test t/ui/26-jobs_restart.t very unstable (already marked as unstable) size:M added
Actions #3

Updated by okurz over 2 years ago

  • Subject changed from unstable test t/ui/12-needle-edit.t to unstable/flaky test t/ui/12-needle-edit.t size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by tinita over 2 years ago

The verbose output looks like that:

ok 4 - Saving needle when "taking matches" selected but no matches present
# Subtest: Saving needle when "taking matches" not selected
    ok 1 - Wait for jQuery successful
[debug] [zpoeCoUn393c] POST "/tests/99938/modules/logpackages/steps/1"
[debug] [zpoeCoUn393c] Routing to controller "OpenQA::Shared::Controller::Session" and action "ensure_operator"
[debug] [zpoeCoUn393c] Routing to controller "OpenQA::WebAPI::Controller::Step" and action "save_needle_ajax"
[debug] [zpoeCoUn393c] Template "step/save_needle_ajax.html.ep" not found
[debug] [zpoeCoUn393c] Nothing has been rendered, expecting delayed response
[debug] Inactivity timeout # <-------------------------------------- not seen in a successful test. This is a message from Mojolicious
[debug] [Qf1y73voCtD_] POST "/tests/99938/modules/logpackages/steps/1"
[debug] [Qf1y73voCtD_] Routing to controller "OpenQA::Shared::Controller::Session" and action "ensure_operator"
[debug] [Qf1y73voCtD_] Routing to controller "OpenQA::WebAPI::Controller::Step" and action "save_needle_ajax"
[debug] [Qf1y73voCtD_] Template "step/save_needle_ajax.html.ep" not found
[debug] [Qf1y73voCtD_] Nothing has been rendered, expecting delayed response

# Here we can see several seconds where nothing happens <----------------------------------------------

[debug] Process 248 is performing job "8" with task "save_needle"
[error] [zpoeCoUn393c] Transaction already destroyed at /usr/lib/perl5/vendor_perl/5.26.1/Mojolicious/Renderer.pm line 29.
[error] [pid:134] Stopping openqa-webapi process because a Perl warning occurred: Unhandled rejected promise: Transaction already destroyed at /usr/lib/perl5/vendor_perl/5.26.1/Mojolicious/Renderer.pm line 29.
 at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 129.
    not ok 2 - sub process openqa-webapi terminated with exit code 10752

    #   Failed test 'sub process openqa-webapi terminated with exit code 10752'
    #   at /home/squamata/project/t/ui/../lib/OpenQA/Test/Utils.pm line 223.
    1..2
A context appears to have been destroyed without first calling release().
Based on $@ it does not look like an exception was thrown (this is not always
a reliable test)
...

https://app.circleci.com/pipelines/github/perlpunk/openQA/203/workflows/4ae4a6ee-78db-4698-b4d6-ac54d4eaff9c/jobs/984

Here the output from a successful test:

# Subtest: Saving needle when "taking matches" not selected
    ok 1 - Wait for jQuery successful
[debug] [3sSUpOuWaEeJ] POST "/tests/99938/modules/logpackages/steps/1"
[debug] [3sSUpOuWaEeJ] Routing to controller "OpenQA::Shared::Controller::Session" and action "ensure_operator"
[debug] [3sSUpOuWaEeJ] Routing to controller "OpenQA::WebAPI::Controller::Step" and action "save_needle_ajax"
[debug] [3sSUpOuWaEeJ] Template "step/save_needle_ajax.html.ep" not found
[debug] [3sSUpOuWaEeJ] Nothing has been rendered, expecting delayed response
[debug] Process 24107 is performing job "8" with task "save_needle"
[debug] [3sSUpOuWaEeJ] 200 OK (0.518946s, 1.927/s)
    ok 2 - Wait for jQuery successful
[debug] [VCYOTJ7f9TEU] POST "/tests/99938/modules/logpackages/steps/1"
[debug] [VCYOTJ7f9TEU] Routing to controller "OpenQA::Shared::Controller::Session" and action "ensure_operator"
[debug] [VCYOTJ7f9TEU] Routing to controller "OpenQA::WebAPI::Controller::Step" and action "save_needle_ajax"
[debug] [VCYOTJ7f9TEU] Template "step/save_needle_ajax.html.ep" not found
[debug] [VCYOTJ7f9TEU] Nothing has been rendered, expecting delayed response
[debug] Process 24118 is performing job "9" with task "save_needle"
[debug] [VCYOTJ7f9TEU] 200 OK (0.5342s, 1.872/s)
[debug] Process 24121 is performing job "10" with task "scan_needles"
    ok 3 - Wait for jQuery successful
    ok 4 - highlight appears correct
    1..4
ok 5 - Saving needle when "taking matches" not selected
Actions #5

Updated by tinita over 2 years ago

I find it very interesting that the test label seems to be replaced with an error message:

not ok 2 - sub process openqa-webapi terminated with exit code 10752
Actions #6

Updated by okurz over 2 years ago

  • Priority changed from High to Urgent
Actions #7

Updated by tinita over 2 years ago

  • Assignee set to tinita
Actions #8

Updated by tinita over 2 years ago

I made a branch "circleci-invest" from https://github.com/os-autoinst/openQA/commit/b8a7c9d3d38c57589d33a4ad5db02db0a04f9e8a ("Merge pull request #3991 from Martchus/carry-over-2") from yesterday. Until then we didn't see the failures.

The test shows the same failures we see in current master:
https://app.circleci.com/pipelines/github/perlpunk/openQA/204/workflows/b6dfabfd-557f-437c-bd51-1ef0e3c64ee1

Actions #9

Updated by tinita over 2 years ago

Similar results from https://github.com/os-autoinst/openQA/commit/eda517ece98346c10afe43e4bfdef1cbba7fca73 which is before the last two dependency crons including perl-Mojo-SQLite and perltidy

https://app.circleci.com/pipelines/github/perlpunk/openQA/205/workflows/e222649a-ee57-42fb-9223-95918127a67a

Can we take that as a proof that no changes from openQA side are responsible?

One thing that could be responsible is package updates that are not tracked in our ci-packages.txt, that means the ones that are already installed in the container. (For example we had problems with a chromium update recently).

The other possibility is that CircleCI performance is pretty bad currently, so being very slow triggers all our explicitly set time limits, but also other timeouts like the Mojolicious one I mentioned above.

Actions #10

Updated by okurz over 2 years ago

tinita wrote:

Similar results from https://github.com/os-autoinst/openQA/commit/eda517ece98346c10afe43e4bfdef1cbba7fca73 which is before the last two dependency crons including perl-Mojo-SQLite and perltidy

https://app.circleci.com/pipelines/github/perlpunk/openQA/205/workflows/e222649a-ee57-42fb-9223-95918127a67a

Can we take that as a proof that no changes from openQA side are responsible?

openQA source code, yes.

One thing that could be responsible is package updates that are not tracked in our ci-packages.txt, that means the ones that are already installed in the container. (For example we had problems with a chromium update recently).

correct

The other possibility is that CircleCI performance is pretty bad currently, so being very slow triggers all our explicitly set time limits, but also other timeouts like the Mojolicious one I mentioned above.

Then research if other users report similar problems. Get in contact with the circleCI community and support.

Actions #11

Updated by tinita over 2 years ago

I didn't see any complaints from others in the CircleCI support forum.

It just seemed to be slower in general, and I assume most people don't have such timeouts in their tests as we have.

PR: https://github.com/os-autoinst/openQA/pull/4009

Although it looks like the performance is not as bad as friday anymore.

Actions #12

Updated by tinita over 2 years ago

  • Status changed from Workable to Feedback
Actions #13

Updated by tinita over 2 years ago

  • Status changed from Feedback to Resolved

This specific error didn't occur anymore, so the mojo timeout helped.
Also the other timeout bumps helped, but it looks like there might be a few more tests where we should increase the TimeLimit

Actions

Also available in: Atom PDF