Project

General

Profile

Actions

action #72196

closed

t/24-worker-jobs.t fails in OBS

Added by livdywan over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2020-10-02
Due date:
% Done:

0%

Estimated time:

Description

Observation

[  648s] 
[  648s]     #   Failed test 'no error waiting for 'job status changed to stopped''
[  648s]     #   at ./t/24-worker-jobs.t line 79.
[  648s]     #          got: ''job status changed to stopped' not happened after 15 seconds'
[  648s]     #     expected: undef
[  648s] 
[  648s]     #   Failed test 'no error waiting for 'job status changed to stopped''
[  648s]     #   at ./t/24-worker-jobs.t line 79.
[  648s] 
[  648s]     #   Failed test 'no error waiting for 'job status changed to stopped''
[  648s]     #   at ./t/24-worker-jobs.t line 79.
[  648s]     #          got: ''job status changed to stopped' not happened after 15 seconds'
[  648s]     #     expected: undef
[  648s]     #          got: ''job status changed to stopped' not happened after 15 seconds'
[  648s]     #     expected: undef
[  648s] Forked inside subtest, but subtest never finished!
[  648s] 
[  648s]     #   Failed test 'no error waiting for 'job status changed to stopped''
[  648s]     #   at ./t/24-worker-jobs.t line 79.
[  648s] Forked inside subtest, but subtest never finished!
[  648s]     #          got: ''job status changed to stopped' not happened after 15 seconds'
[  648s]     #     expected: undef
[  648s] Forked inside subtest, but subtest never finished!
[  648s] 
[  648s]     #   Failed test 'no error waiting for 'job status changed to stopped''
[  648s]     #   at ./t/24-worker-jobs.t line 79.
[  648s]     #          got: ''job status changed to stopped' not happened after 15 seconds'
[  648s]     #     expected: undef

Steps to reproduce

  • Inspect OBS build logs

Problem

All builds on OBS currently seem to fail in t/24-worker-jobs.t

Suggestion

  • Reproduce locally
  • Reproduce in CircleCI

Workaround


Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #72139: openQA services on OSD failed to connect to databaseResolvedokurz2020-09-302020-10-20

Actions
Related to openQA Project - action #69160: Improve package and base OS version support: Enable tests for more reposResolvedokurz2020-07-21

Actions
Actions #1

Updated by okurz over 3 years ago

  • Target version set to Ready
Actions #2

Updated by okurz over 3 years ago

  • Related to action #72139: openQA services on OSD failed to connect to database added
Actions #3

Updated by okurz over 3 years ago

  • Related to action #69160: Improve package and base OS version support: Enable tests for more repos added
Actions #4

Updated by okurz over 3 years ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz

Running env runs=200 count_fail_ratio prove -l t/24-worker-jobs.t locally does not fail the test. But after upgrading to perl-Mojolicious-8.60 (from 8.59) I get:

## Run 1
t/24-worker-jobs.t .. 1/? Mojo::Reactor::Poll: Timer failed: Can't call method "unsubscribe" on an undefined value at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 98.
Mojo::Reactor::Poll: Timer failed: Can't call method "unsubscribe" on an undefined value at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 98.
Mojo::Reactor::Poll: Timer failed: Can't call method "unsubscribe" on an undefined value at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 98.
Mojo::Reactor::Poll: Timer failed: Can't call method "unsubscribe" on an undefined value at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 98.
Mojo::Reactor::Poll: Timer failed: Can't call method "unsubscribe" on an undefined value at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 98.
Mojo::Reactor::Poll: Timer failed: Can't call method "unsubscribe" on an undefined value at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 98.
Mojo::Reactor::Poll: Timer failed: Can't call method "unsubscribe" on an undefined value at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 98.
Mojo::Reactor::Poll: Timer failed: Can't call method "unsubscribe" on an undefined value at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 98.
Mojo::Reactor::Poll: Timer failed: Can't call method "unsubscribe" on an undefined value at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 98.
Mojo::Reactor::Poll: Timer failed: Can't call method "unsubscribe" on an undefined value at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 98.
Mojo::Reactor::Poll: Timer failed: Can't call method "unsubscribe" on an undefined value at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 98.
Mojo::Reactor::Poll: Timer failed: Can't call method "unsubscribe" on an undefined value at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 98.
Mojo::Reactor::Poll: Timer failed: Can't call method "unsubscribe" on an undefined value at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 98.
Mojo::Reactor::Poll: Timer failed: Can't call method "unsubscribe" on an undefined value at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 98.
Mojo::Reactor::Poll: Timer failed: Can't call method "unsubscribe" on an undefined value at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 98.
Forked inside subtest, but subtest never finished!
Use of uninitialized value in numeric ne (!=) at /home/okurz/local/os-autoinst/openQA/lib/OpenQA/Worker/Job.pm line 752.
Use of uninitialized value in numeric ne (!=) at /home/okurz/local/os-autoinst/openQA/lib/OpenQA/Worker/Job.pm line 752.
Use of uninitialized value in numeric ne (!=) at /home/okurz/local/os-autoinst/openQA/lib/OpenQA/Worker/Job.pm line 752.
Forked inside subtest, but subtest never finished!
Forked inside subtest, but subtest never finished!
Forked inside subtest, but subtest never finished!
Forked inside subtest, but subtest never finished!
Forked inside subtest, but subtest never finished!
Forked inside subtest, but subtest never finished!
Use of uninitialized value in numeric ne (!=) at /home/okurz/local/os-autoinst/openQA/lib/OpenQA/Worker/Job.pm line 752.
Forked inside subtest, but subtest never finished!
[ERROR] Upload images subprocess error: malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 31.

[ERROR] Upload images subprocess error: malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 31.

[ERROR] Upload images subprocess error: malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 31.

[ERROR] Upload images subprocess error: malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 31.

[ERROR] Upload images subprocess error: malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 31.

[ERROR] Upload images subprocess error: malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 31.

Use of uninitialized value in numeric ne (!=) at /home/okurz/local/os-autoinst/openQA/lib/OpenQA/Worker/Job.pm line 752.
Use of uninitialized value in numeric ne (!=) at /home/okurz/local/os-autoinst/openQA/lib/OpenQA/Worker/Job.pm line 752.
Use of uninitialized value in numeric ne (!=) at /home/okurz/local/os-autoinst/openQA/lib/OpenQA/Worker/Job.pm line 752.
Use of uninitialized value in numeric ne (!=) at /home/okurz/local/os-autoinst/openQA/lib/OpenQA/Worker/Job.pm line 752.
Mojo::Reactor::Poll: I/O watcher failed: Label not found for "last T2_SUBTEST_WRAPPER" at /usr/lib/perl5/vendor_perl/5.26.1/Test2/Hub/Subtest.pm line 67.
Use of uninitialized value in numeric ne (!=) at /home/okurz/local/os-autoinst/openQA/lib/OpenQA/Worker/Job.pm line 752.

    #   Failed test 'no error waiting for 'job status changed to stopped''
    #   at t/24-worker-jobs.t line 79.

    #   Failed test 'no error waiting for 'job status changed to stopped''
    #   at t/24-worker-jobs.t line 79.
    #          got: ''job status changed to stopped' not happened after 15 seconds'
    #     expected: undef
    #          got: ''job status changed to stopped' not happened after 15 seconds'
    #     expected: undef

    #   Failed test 'no error waiting for 'job status changed to stopped''
    #   at t/24-worker-jobs.t line 79.

    #   Failed test 'no error waiting for 'job status changed to stopped''
    #   at t/24-worker-jobs.t line 79.
    #          got: ''job status changed to stopped' not happened after 15 seconds'
    #     expected: undef
    #          got: ''job status changed to stopped' not happened after 15 seconds'
    #     expected: undef
Forked inside subtest, but subtest never finished!
Forked inside subtest, but subtest never finished!
Forked inside subtest, but subtest never finished!
Forked inside subtest, but subtest never finished!
[ERROR] Upload images subprocess error: malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 31.

Use of uninitialized value in numeric ne (!=) at /home/okurz/local/os-autoinst/openQA/lib/OpenQA/Worker/Job.pm line 752.

    #   Failed test 'no error waiting for 'job status changed to stopped''
    #   at t/24-worker-jobs.t line 79.
    #          got: ''job status changed to stopped' not happened after 15 seconds'
    #     expected: undef
Forked inside subtest, but subtest never finished!
[ERROR] Upload images subprocess error: malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 31.

[ERROR] Upload images subprocess error: malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 31.

[ERROR] Upload images subprocess error: malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 31.

Use of uninitialized value in numeric ne (!=) at /home/okurz/local/os-autoinst/openQA/lib/OpenQA/Worker/Job.pm line 752.
Use of uninitialized value in numeric ne (!=) at /home/okurz/local/os-autoinst/openQA/lib/OpenQA/Worker/Job.pm line 752.
Use of uninitialized value in numeric ne (!=) at /home/okurz/local/os-autoinst/openQA/lib/OpenQA/Worker/Job.pm line 752.

    #   Failed test 'no error waiting for 'job status changed to stopped''
    #   at t/24-worker-jobs.t line 79.
    #          got: ''job status changed to stopped' not happened after 15 seconds'
    #     expected: undef
Forked inside subtest, but subtest never finished!

    #   Failed test 'no error waiting for 'job status changed to stopped''
    #   at t/24-worker-jobs.t line 79.
    #          got: ''job status changed to stopped' not happened after 15 seconds'
    #     expected: undef
Forked inside subtest, but subtest never finished!
[ERROR] Upload images subprocess error: malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 31.

[ERROR] Upload images subprocess error: malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 31.

[ERROR] Upload images subprocess error: malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 31.


    #   Failed test 'no error waiting for 'job status changed to stopped''
    #   at t/24-worker-jobs.t line 79.
    #          got: ''job status changed to stopped' not happened after 15 seconds'
    #     expected: undef
    # Looks like you failed 1 test of 11

Probably tests within OBS would have shown this problem before deployment on o3 and/or osd if tests would have been enabled. Waiting for #72139 . As an alternative we could link an old package of perl-Mojolicious to devel:openQA

EDIT: Created https://build.opensuse.org/package/show/home:okurz:branches:devel:languages:perl/perl-Mojolicious with an update to the fixed version of perl-Mojolicious. I tried to fix https://build.opensuse.org/projects/devel:openQA:TestGithub:PR-3421/meta with

  <repository name="openSUSE_Factory">
    <path project="home:okurz:branches:devel:languages:perl" repository="pure_factory"/>
    <path project="devel:openQA" repository="openSUSE_Factory"/>
    <arch>x86_64</arch>
  </repository>

as well as with

  <repository name="openSUSE_Factory">
    <path project="home:okurz:openQA" repository="pure_factory"/>
    <arch>x86_64</arch>
  </repository>

with a new repo that links to the new package version as well as devel:openQA but newly triggered builds seem to still not pick up the new repo config. And unfortunately I don't know where the project meta config template resides when we create new pull requests.

Actions #5

Updated by okurz over 3 years ago

  • Description updated (diff)
Actions #6

Updated by mkittler over 3 years ago

But after upgrading to perl-Mojolicious-8.60 (from 8.59) I get:

I get the same when running the test locally. Updating to perl-Mojolicious 8.61 fixes it. Likely the "Fix fork-safety featureā€¦" commit is relevant for this test and not the other fix for connection leaks. So this test shows a different issue with Mojolicious 8.60 than the issue we've seen in production. (See https://github.com/mojolicious/mojo/compare/v8.60...v8.61 for the mentioned commits.)

Actions #7

Updated by kraih over 3 years ago

Aside from the problem with Mojolicious 8.60 there is more broken here in the test. It has quite a few fails, but ends up passing in a local dev environment. The reason is that all the failing tests happen in forked processes and the TAP output from those forked processes is for some reason redirected to STDERR. There the TAP parser (prove) can't see it and the test passes. Generating TAP from more than one process is usually a really bad idea anyway (bad TAP output) and we should probably avoid that. Easiest solution would be to remove the is check in wait_for_job. And then we could look into increasing the test coverage another way.

Actions #8

Updated by okurz over 3 years ago

  • Status changed from Blocked to Workable
  • Assignee changed from okurz to kraih
  • Priority changed from High to Normal

@kraih as we found a workaround and a fix is on the way I can give the ticket over to you as you suggested in chat that you could pick up the ticket to improve the test in the way you suggested in #72196#note-7

Actions #9

Updated by kraih over 3 years ago

  • Status changed from Workable to In Progress

I'm slowly getting closer to the root problem, but it's quite a bit more annoying than i expected. The test runs an event loop until a certain job event happens that it is waiting for and then stops the event loop. But while it is waiting the job forks one or more processes. And for some reason a bunch of those do not exit properly and end up returning back into the test. Now we have multiple processes running the remaining tests in parallel. This happens more than once, so in the end we have multiple forks of the main test process. Tracking it down is not so easy, since it all goes through the event loop without a call stack to follow.

Actions #10

Updated by kraih over 3 years ago

I did manage to locate the root problem, and it was another upstream issue in Mojolicious. https://github.com/mojolicious/mojo/compare/5fb9f808f043...2286d3c7eb82 An exception raised by the event loop reset caused the POSIX::_exit call to be bypassed. That allowed the event loop to recover with a warning and the process broke out of its subprocess context back into the main test file.

Actions #11

Updated by okurz over 3 years ago

This is promising news.

Actions #12

Updated by kraih over 3 years ago

  • Status changed from In Progress to Feedback

I expect this to be completely resolved with Mojolicious 8.63.

Actions #13

Updated by okurz over 3 years ago

right, so this needs to wait for https://build.opensuse.org/request/show/841736

Actions #14

Updated by livdywan over 3 years ago

  • Status changed from Feedback to Resolved

okurz wrote:

right, so this needs to wait for https://build.opensuse.org/request/show/841736

The request got accepted so I assume we're good

Actions

Also available in: Atom PDF