Project

General

Profile

Actions

action #107998

closed

coordination #109668: [saga][epic] Stable and updated non-qemu backends for SLE validation

coordination #109740: [epic] Stable os-autoinst unit tests with good coverage

[sporadic][os-autoinst] t/29-backend-driver.t fails in "log output for backend driver creation"

Added by okurz about 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2022-03-08
Due date:
% Done:

0%

Estimated time:

Description

Observation

I think I have seen similar errors multiple times. From
https://github.com/os-autoinst/os-autoinst/runs/5463149078?check_suite_focus=true#step:3:613 :

: #   Failed test 'log output for backend driver creation'
7: #   at ./29-backend-driver.t line 24.
7: # STDOUT & STDERR:
7: # hannel_in 20
7: # 
7: # don't match:
7: # (?^u:(Blocking SIGCHLD|channel_out))
7: # as expected
7: # Looks like you failed 1 test of 12.
7: [10:54:25] ./29-backend-driver.t .................... 
7: Dubious, test returned 1 (wstat 256, 0x100)

Locally I can reproduce the error. With https://github.com/okurz/scripts/blob/master/count_fail_ratio runs=1000 count_fail_ratio prove -I. t/29-backend-driver.t failed (only) 6 times so a fail-ratio of 0.6% and that is actually failing in either line 24 or line 26, but at least it's fast and reproducible :)

Actions #1

Updated by okurz about 2 years ago

  • Description updated (diff)
Actions #2

Updated by okurz about 2 years ago

  • Description updated (diff)
Actions #3

Updated by livdywan about 2 years ago

  • Subject changed from [sporadic][os-autoinst] t/29-backend-driver.t fails in "log output for backend driver creation" to [sporadic][os-autoinst] t/29-backend-driver.t fails in "log output for backend driver creation" size:M
  • Status changed from New to Workable
Actions #4

Updated by kraih about 2 years ago

  • Assignee set to kraih
Actions #5

Updated by kraih about 2 years ago

  • Status changed from Workable to In Progress
Actions #6

Updated by openqa_review about 2 years ago

  • Due date set to 2022-04-08

Setting due date based on mean cycle time of SUSE QE Tools

Actions #8

Updated by okurz almost 2 years ago

  • Description updated (diff)

Discussed in the weekly unblock. For kraih it was reproduced in about 1/10000 runs. Restricting the runtime environment artificially, e.g. limiting to one core, and running stuff in the background seems to have increased the chance of failure.

We suggest that all team members try to reproduce with the above mentioned command to see if we find an environment where it's more prone to fail. How about nice -n 19 ionice -c 3 ….
We still think that a potentially important issue is lurking there so we are not (yet) ready to just disarm the test.

Actions #9

Updated by kraih almost 2 years ago

I still can't replicate the issue reliably, but i have a suspicion what it could be. The driver class dupes STDOUT/STDERR before and after forking a process, and that might be messing with the output buffers, perhaps unfortunately overlapping control characters from the log color highlighting. If that's the case autoflush on the log handle might resolve it. https://github.com/os-autoinst/os-autoinst/compare/k/backend_race

Actions #10

Updated by kraih almost 2 years ago

  • Status changed from In Progress to Feedback
Actions #11

Updated by kraih almost 2 years ago

  • Status changed from Feedback to Resolved

Ran the test 50k times with the change applied and got zero errors. So, fingers crossed that it's actually resolved.

Actions #12

Updated by okurz almost 2 years ago

  • Subject changed from [sporadic][os-autoinst] t/29-backend-driver.t fails in "log output for backend driver creation" size:M to [sporadic][os-autoinst] t/29-backend-driver.t fails in "log output for backend driver creation"
  • Status changed from Resolved to New
  • Assignee deleted (kraih)

That's annoying but it reappeared:

https://github.com/os-autoinst/os-autoinst/runs/5870487370?check_suite_focus=true#step%3A3%3A642=

showing

7: #   Failed test 'log output for backend driver creation'
7: #   at ./29-backend-driver.t line 25.
7: # STDOUT & STDERR:
7: # 1, channel_in 20
7: # 
7: # don't match:
7: # (?^u:channel_out.+channel_in)
7: # as expected
Actions #13

Updated by okurz almost 2 years ago

  • Due date deleted (2022-04-08)
Actions #14

Updated by tinita almost 2 years ago

  • Status changed from New to In Progress
  • Assignee set to tinita
Actions #15

Updated by tinita almost 2 years ago

No new insights so far, but a smaller reproducer:

# runs=100 ./count_fail_ratio prove -I. t/29.t
% cat t/29.t
#!/usr/bin/perl
use Test::More;
use strict;
use warnings;
use Test::Output qw(combined_from stderr_from stdout_from);
use Mojo::IOLoop::ReadWriteProcess 'process';

for (1..30) {
    my ($out, $err);
    $out = stdout_from {
        $err = stderr_from { start() };
    };
#    diag "ERR: >>$err<<";
#    like $err, qr/Blocking SIGCHLD and SIGTERM/;
    like $err, qr/DEBUG.*channel_out.+channel_in/s, 'log output for backend driver creation'
        or do {
            diag "!!!!!!!!!!!! loop($_) OUT: >>$out<<";
            sleep 1;
        };
}

done_testing;

sub start {
    open(my $STDOUTPARENT, '>&', *STDOUT);
    open(my $STDERRPARENT, '>&', *STDERR);

    my $backend_process = process(
        sleeptime_during_kill => .1,
        total_sleeptime_during_kill => 30,
        max_kill_attempts => 1,
        kill_sleeptime => 0,
        blocking_stop => 1,
        separate_err => 0,
        subreaper => 1,
        code => sub {
            my $process = shift;
            open STDOUT, ">&", $STDOUTPARENT;
            open STDERR, ">&", $STDERRPARENT;
            print STDERR "Blocking SIGCHLD and SIGTERM\n"; # <-------- removing this makes it go away
            select undef, undef, undef, 0.05;
        });
    $backend_process->start;
    print STDERR "$$: DEBUG (STDERR)\nchannel_out channel_in\n";
    #print "$$: DEBUG (STDERR)\nchannel_out channel_in\n";
}

It's hard to tell how often it happens. It seems to me like I have to run it a few times, and then it starts happening more often.

The print to STDERR in the subprocess has to do with it. Parent and child seem to write to the same filehandle.

edit: also happens with a simple fork, without Mojo::IOLoop::ReadWriteProcess

Actions #16

Updated by openqa_review almost 2 years ago

  • Due date set to 2022-04-23

Setting due date based on mean cycle time of SUSE QE Tools

Actions #17

Updated by okurz almost 2 years ago

tinita wrote:

No new insights so far, but a smaller reproducer

This is awesome news!

It's hard to tell how often it happens. It seems to me like I have to run it a few times, and then it starts happening more often.

I also ran it locally with runs=100 count_fail_ratio prove and got "## count_fail_ratio: Run: 100. Fails: 13. Fail ratio 13%". I also observed that two consecutive failures are happening but I would still not see it as significant.

print STDERR "Blocking SIGCHLD and SIGTERM\n"; # <-------- removing this makes it go away

I can confirm. So what would you say about just removing the print lines completely? Please see
https://github.com/os-autoinst/os-autoinst/pull/2022
and tell me what you think.

EDIT: So far 5000 runs without failures of the above reproducer with the print line removed.

Actions #18

Updated by okurz almost 2 years ago

  • Parent task set to #109740
Actions #19

Updated by tinita almost 2 years ago

Currently I would guess (hope) that this is only a problem when capturing output. Capture::Tiny (what Test::Output uses) docs say clearly that capturing output from parent and forked processes is not reliable.

Before taking any action I would first try to run the test without capturing and redirect into a file and see if the output there is complete or not.

If it's just a testing problem, the simplest solution could just mock the diag function to collect output in an array.

Actions #20

Updated by okurz almost 2 years ago

tinita wrote:

If it's just a testing problem, the simplest solution could just mock the diag function to collect output in an array.

Yes, that sounds like a good idea as well.

Actions #21

Updated by tinita almost 2 years ago

I double checked with this script:

#!/usr/bin/perl
use Test::More;
use strict;
use warnings;
use Test::Output qw(stderr_from stdout_from);

for (1..50) {
    my ($out, $err);
    if ($ENV{CAPTURE}) {
        # capture STDERR
        $out = stdout_from {
            $err = stderr_from { start() };
        };
        like $err, qr/DEBUG.*channel_out.+channel_in/s, 'log output for backend driver creation';
    }
    else {
        # Only cpature STDOUT
        $out = stdout_from { $err = start() };
    }
}
pass "ok";
done_testing;

sub start {
    open(my $STDOUTPARENT, '>&', *STDOUT);
    open(my $STDERRPARENT, '>&', *STDERR);
    my $child = sub {
        open STDOUT, ">&", $STDOUTPARENT;
        open STDERR, ">&", $STDERRPARENT;
        for (1 .. 300) {
            print STDERR "($$) (loop $_) Blocking SIGCHLD and SIGTERM\n";
            select undef, undef, undef, 0.001;
        }
    };
    my $pid = fork();
    if ($pid) {}
    else { $child->(); exit; }
    print STDERR "(parent $$) DEBUG STDERR\nchannel_out channel_in\n";
}
% CAPTURE=1 runs=100 ./count_fail_ratio prove -I. t/29.t
...
## Run 89                                                                                                                                                                                                                            
t/29.t .. 1/?                                                                                                                                                                                                                        
#   Failed test 'log output for backend driver creation'                                                          
#   at t/29.t line 14.                                                                                                                                                                                                               
#                   'l_in                                                                                         
# '                                                                                                                                                                                                                                  
#     doesn't match '(?^s:DEBUG.*channel_out.+channel_in)'                                                        
# Looks like you failed 1 test of 51.                    
t/29.t .. Dubious, test returned 1 (wstat 256, 0x100)                                                             
Failed 1/51 subtests                                     
...
## count_fail_ratio: Run: 100. Fails: 7. Fail ratio 7%
% runs=100 ./count_fail_ratio prove -I. t/29.t >log 2>&1
% grep "DEBUG STDERR" log | wc -l
5000
% grep "channel_out channel_in" log | wc -l
5000

I ran this a few times, and when not capturing, all output is there like expected.
So I assume this is a capturing problem only.

I will try to rewrite the test to mock diag and fctinfo.

Actions #22

Updated by tinita almost 2 years ago

  • Status changed from In Progress to Feedback
Actions #23

Updated by okurz almost 2 years ago

  • Due date deleted (2022-04-23)
  • Status changed from Feedback to Resolved

Nice! Your PR is merged meanwhile as well. I think you applied a very diligent and properly statistically driven approach.

Actions

Also available in: Atom PDF