Project

General

Profile

action #108266

Updated by livdywan about 2 years ago

Since the NBG server room was moved, I'm seeing a lot of random script_run() command timeouts on grenache. I suspect network issues. 
 https://openqa.suse.de/tests/8320677#step/sighold02/12 
 https://openqa.suse.de/tests/8294410#step/fallocate06/8 
 https://openqa.suse.de/tests/8294334#step/boot_ltp/42 

 ``` 
      Test died: command 'vmstat -w' timed out at /usr/lib/os-autoinst/testapi.pm line 1039. 
      # Test died: Timed out waiting for LTP test case which may still be running or the OS may have crashed! at sle/tests/kernel/run_ltp.pm line 337. 
 # Test died: command 'rpm -qi kernel-default > /tmp/kernel-pkg.txt 2>&1' timed out at /usr/lib/os-autoinst/testapi.pm line 1039. 

             main::init_backend() called at /usr/bin/isotovideo line 258 
 [2022-03-09T16:12:24.052826+01:00] [info] ::: consoles::serial_screen::read_until: Matched output from SUT in 1 loops & 0.00229895696975291 seconds: Use of uninitialized value $regexp in concatenation (.) or string at /usr/lib/os-autoinst/testapi.pm line 927. 
         testapi::wait_serial(undef, undef, 0, "no_regex", 1) called at sle/tests/kernel/run_ltp.pm line 317 
         run_ltp::run(run_ltp=HASH(0x1001999aee8), LTP::TestInfo=HASH(0x1001b24d630)) called at /usr/lib/os-autoinst/basetest.pm line 356 
        
         cf. last good 
        
         [2022-03-12T07:06:13.797172+01:00] [info] ::: consoles::serial_screen::read_until: Matched output from SUT in 1 loops & 0.00224426796194166 seconds:  
 Use of uninitialized value $regexp in concatenation (.) or string at /usr/lib/os-autoinst/testapi.pm line 927. 
         testapi::wait_serial(undef, undef, 0, "no_regex", 1) called at sle/tests/kernel/run_ltp.pm line 317 
         run_ltp::run(run_ltp=HASH(0x1003570fb08), LTP::TestInfo=HASH(0x1003547afa8)) called at /usr/lib/os-autoinst/basetest.pm line 356 
         eval {...} called at /usr/lib/os-autoinst/basetest.pm line 354 
        
 ``` 

 ## Suggestions 
 - Look into warnings in logs 
 - Investigate undefined regex args in wait_serial 
 - last good: https://openqa.suse.de/tests/8315985 
 - [debug] Current version is 4.6.1647014989.7540333c [interface v25] 
   * Do `git log --no-merges 7540333c..$first_bad` 
  
   - We don't have a screenshot to compare the serial output to 
   - Maybe we can check the serial logs for comparison? 
  
 All these occurences are on the same machine, which is s390x-kvm-sle12 

 One problem I see is that in https://openqa.suse.de/tests/8505116#step/shutdown_ltp/6 we have a serial terminal. If there would be VNC we would be able to see if the command was executed or not. I also don't see the commands in https://openqa.suse.de/tests/8505116/logfile?filename=serial_terminal.txt nor serial0.txt . 

 We should try to resolve the ambiguity if commands just never write to the serial terminal as they time out or if actual data is going missing from SUT to worker. 

 What would you say, what is the best way to reproduce the issue? If we have a reproducer we can try to make it as small as possible and then fix it, maybe just increase the timeout. Maybe ensure that we cath any console related processes in the background if they are still responsive.

Back