action #107998: [sporadic][os-autoinst] t/29-backend-driver.t fails in "log output for backend driver creation" - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #107998

closed

coordination #109668: [saga][epic] Stable and updated non-qemu backends for SLE validation

coordination #109740: [epic] Stable os-autoinst unit tests with good coverage

[sporadic][os-autoinst] t/29-backend-driver.t fails in "log output for backend driver creation"

Added by okurz about 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

tinita

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2022-03-08

Due date:

% Done:

Estimated time:

Description

Observation¶

I think I have seen similar errors multiple times. From
https://github.com/os-autoinst/os-autoinst/runs/5463149078?check_suite_focus=true#step:3:613 :

: #   Failed test 'log output for backend driver creation'
7: #   at ./29-backend-driver.t line 24.
7: # STDOUT & STDERR:
7: # hannel_in 20
7: # 
7: # don't match:
7: # (?^u:(Blocking SIGCHLD|channel_out))
7: # as expected
7: # Looks like you failed 1 test of 12.
7: [10:54:25] ./29-backend-driver.t .................... 
7: Dubious, test returned 1 (wstat 256, 0x100)

Locally I can reproduce the error. With https://github.com/okurz/scripts/blob/master/count_fail_ratio runs=1000 count_fail_ratio prove -I. t/29-backend-driver.t failed (only) 6 times so a fail-ratio of 0.6% and that is actually failing in either line 24 or line 26, but at least it's fast and reproducible :)

Actions

Copy link

Updated by okurz about 3 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz about 3 years ago

Description updated (diff)

Actions

Copy link

Updated by livdywan about 3 years ago

Subject changed from [sporadic][os-autoinst] t/29-backend-driver.t fails in "log output for backend driver creation" to [sporadic][os-autoinst] t/29-backend-driver.t fails in "log output for backend driver creation" size:M
Status changed from New to Workable

Actions

Copy link

Updated by kraih about 3 years ago

Assignee set to kraih

Actions

Copy link

Updated by kraih about 3 years ago

Status changed from Workable to In Progress

Actions

Copy link

Updated by openqa_review about 3 years ago

Due date set to 2022-04-08

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by livdywan about 3 years ago

Semi-related https://github.com/os-autoinst/os-autoinst/pull/2001

Actions

Copy link

Updated by okurz about 3 years ago

Description updated (diff)

Discussed in the weekly unblock. For kraih it was reproduced in about 1/10000 runs. Restricting the runtime environment artificially, e.g. limiting to one core, and running stuff in the background seems to have increased the chance of failure.

We suggest that all team members try to reproduce with the above mentioned command to see if we find an environment where it's more prone to fail. How about nice -n 19 ionice -c 3 ….
We still think that a potentially important issue is lurking there so we are not (yet) ready to just disarm the test.

Actions

Copy link

Updated by kraih about 3 years ago

I still can't replicate the issue reliably, but i have a suspicion what it could be. The driver class dupes STDOUT/STDERR before and after forking a process, and that might be messing with the output buffers, perhaps unfortunately overlapping control characters from the log color highlighting. If that's the case autoflush on the log handle might resolve it. https://github.com/os-autoinst/os-autoinst/compare/k/backend_race

Actions

Copy link

#10

Updated by kraih about 3 years ago

Status changed from In Progress to Feedback

Actions

Copy link

#11

Updated by kraih about 3 years ago

Status changed from Feedback to Resolved

Ran the test 50k times with the change applied and got zero errors. So, fingers crossed that it's actually resolved.

Actions

Copy link

#12

Updated by okurz about 3 years ago

Subject changed from [sporadic][os-autoinst] t/29-backend-driver.t fails in "log output for backend driver creation" size:M to [sporadic][os-autoinst] t/29-backend-driver.t fails in "log output for backend driver creation"
Status changed from Resolved to New
Assignee deleted (~~kraih~~)

That's annoying but it reappeared:

https://github.com/os-autoinst/os-autoinst/runs/5870487370?check_suite_focus=true#step%3A3%3A642=

showing

7: #   Failed test 'log output for backend driver creation'
7: #   at ./29-backend-driver.t line 25.
7: # STDOUT & STDERR:
7: # 1, channel_in 20
7: # 
7: # don't match:
7: # (?^u:channel_out.+channel_in)
7: # as expected

Actions

Copy link

#13

Updated by okurz about 3 years ago

Due date deleted (~~2022-04-08~~)

Actions

Copy link

#14

Updated by tinita about 3 years ago

Status changed from New to In Progress
Assignee set to tinita

Actions

Copy link

#15

Updated by tinita about 3 years ago

No new insights so far, but a smaller reproducer:

# runs=100 ./count_fail_ratio prove -I. t/29.t
% cat t/29.t
#!/usr/bin/perl
use Test::More;
use strict;
use warnings;
use Test::Output qw(combined_from stderr_from stdout_from);
use Mojo::IOLoop::ReadWriteProcess 'process';

for (1..30) {
    my ($out, $err);
    $out = stdout_from {
        $err = stderr_from { start() };
    };
#    diag "ERR: >>$err<<";
#    like $err, qr/Blocking SIGCHLD and SIGTERM/;
    like $err, qr/DEBUG.*channel_out.+channel_in/s, 'log output for backend driver creation'
        or do {
            diag "!!!!!!!!!!!! loop($_) OUT: >>$out<<";
            sleep 1;
        };
}

done_testing;

sub start {
    open(my $STDOUTPARENT, '>&', *STDOUT);
    open(my $STDERRPARENT, '>&', *STDERR);

    my $backend_process = process(
        sleeptime_during_kill => .1,
        total_sleeptime_during_kill => 30,
        max_kill_attempts => 1,
        kill_sleeptime => 0,
        blocking_stop => 1,
        separate_err => 0,
        subreaper => 1,
        code => sub {
            my $process = shift;
            open STDOUT, ">&", $STDOUTPARENT;
            open STDERR, ">&", $STDERRPARENT;
            print STDERR "Blocking SIGCHLD and SIGTERM\n"; # <-------- removing this makes it go away
            select undef, undef, undef, 0.05;
        });
    $backend_process->start;
    print STDERR "$$: DEBUG (STDERR)\nchannel_out channel_in\n";
    #print "$$: DEBUG (STDERR)\nchannel_out channel_in\n";
}

It's hard to tell how often it happens. It seems to me like I have to run it a few times, and then it starts happening more often.

The print to STDERR in the subprocess has to do with it. Parent and child seem to write to the same filehandle.

edit: also happens with a simple fork, without Mojo::IOLoop::ReadWriteProcess

Actions

Copy link

#16

Updated by openqa_review about 3 years ago

Due date set to 2022-04-23

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#17

Updated by okurz about 3 years ago

tinita wrote:

No new insights so far, but a smaller reproducer

This is awesome news!

It's hard to tell how often it happens. It seems to me like I have to run it a few times, and then it starts happening more often.

I also ran it locally with runs=100 count_fail_ratio prove and got "## count_fail_ratio: Run: 100. Fails: 13. Fail ratio 13%". I also observed that two consecutive failures are happening but I would still not see it as significant.

print STDERR "Blocking SIGCHLD and SIGTERM\n"; # <-------- removing this makes it go away

I can confirm. So what would you say about just removing the print lines completely? Please see
https://github.com/os-autoinst/os-autoinst/pull/2022
and tell me what you think.

EDIT: So far 5000 runs without failures of the above reproducer with the print line removed.

Actions

Copy link

#18

Updated by okurz about 3 years ago

Parent task set to #109740

Actions

Copy link

#19

Updated by tinita about 3 years ago

Currently I would guess (hope) that this is only a problem when capturing output. Capture::Tiny (what Test::Output uses) docs say clearly that capturing output from parent and forked processes is not reliable.

Before taking any action I would first try to run the test without capturing and redirect into a file and see if the output there is complete or not.

If it's just a testing problem, the simplest solution could just mock the diag function to collect output in an array.

Actions

Copy link

#20

Updated by okurz about 3 years ago

tinita wrote:

If it's just a testing problem, the simplest solution could just mock the diag function to collect output in an array.

Yes, that sounds like a good idea as well.

Actions

Copy link

#21

Updated by tinita about 3 years ago

I double checked with this script:

#!/usr/bin/perl
use Test::More;
use strict;
use warnings;
use Test::Output qw(stderr_from stdout_from);

for (1..50) {
    my ($out, $err);
    if ($ENV{CAPTURE}) {
        # capture STDERR
        $out = stdout_from {
            $err = stderr_from { start() };
        };
        like $err, qr/DEBUG.*channel_out.+channel_in/s, 'log output for backend driver creation';
    }
    else {
        # Only cpature STDOUT
        $out = stdout_from { $err = start() };
    }
}
pass "ok";
done_testing;

sub start {
    open(my $STDOUTPARENT, '>&', *STDOUT);
    open(my $STDERRPARENT, '>&', *STDERR);
    my $child = sub {
        open STDOUT, ">&", $STDOUTPARENT;
        open STDERR, ">&", $STDERRPARENT;
        for (1 .. 300) {
            print STDERR "($$) (loop $_) Blocking SIGCHLD and SIGTERM\n";
            select undef, undef, undef, 0.001;
        }
    };
    my $pid = fork();
    if ($pid) {}
    else { $child->(); exit; }
    print STDERR "(parent $$) DEBUG STDERR\nchannel_out channel_in\n";
}

% CAPTURE=1 runs=100 ./count_fail_ratio prove -I. t/29.t
...
## Run 89                                                                                                                                                                                                                            
t/29.t .. 1/?                                                                                                                                                                                                                        
#   Failed test 'log output for backend driver creation'                                                          
#   at t/29.t line 14.                                                                                                                                                                                                               
#                   'l_in                                                                                         
# '                                                                                                                                                                                                                                  
#     doesn't match '(?^s:DEBUG.*channel_out.+channel_in)'                                                        
# Looks like you failed 1 test of 51.                    
t/29.t .. Dubious, test returned 1 (wstat 256, 0x100)                                                             
Failed 1/51 subtests                                     
...
## count_fail_ratio: Run: 100. Fails: 7. Fail ratio 7%
% runs=100 ./count_fail_ratio prove -I. t/29.t >log 2>&1
% grep "DEBUG STDERR" log | wc -l
5000
% grep "channel_out channel_in" log | wc -l
5000

I ran this a few times, and when not capturing, all output is there like expected.
So I assume this is a capturing problem only.

I will try to rewrite the test to mock diag and fctinfo.

Actions

Copy link

#22

Updated by tinita about 3 years ago

Status changed from In Progress to Feedback

https://github.com/os-autoinst/os-autoinst/pull/2022 merged

Created https://github.com/os-autoinst/os-autoinst/pull/2023: t: Mock log functions to avoid capture problems

Actions

Copy link

#23

Updated by okurz about 3 years ago

Due date deleted (~~2022-04-23~~)
Status changed from Feedback to Resolved

Nice! Your PR is merged meanwhile as well. I think you applied a very diligent and properly statistically driven approach.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #107998

[sporadic][os-autoinst] t/29-backend-driver.t fails in "log output for backend driver creation"

Observation¶

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by livdywan about 3 years ago

Updated by kraih about 3 years ago

Updated by kraih about 3 years ago

Updated by openqa_review about 3 years ago

Updated by livdywan about 3 years ago

Updated by okurz about 3 years ago

Updated by kraih about 3 years ago

Updated by kraih about 3 years ago

Updated by kraih about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by tinita about 3 years ago

Updated by tinita about 3 years ago

Updated by openqa_review about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by tinita about 3 years ago

Updated by okurz about 3 years ago

Updated by tinita about 3 years ago

Updated by tinita about 3 years ago

Updated by okurz about 3 years ago