Project

General

Profile

action #53999

Updated by okurz almost 5 years ago

## Observation 

 From openqaworker13 (running Leap 15.0):  

 ``` 
            PID: 18220 (/usr/bin/isotov) 
            UID: 483 (_openqa-worker) 
            GID: 65534 (nogroup) 
         Signal: 11 (SEGV) 
      Timestamp: Mon 2019-07-08 16:10:19 CEST (18h ago) 
   Command Line: /usr/bin/isotovideo: backen 
     Executable: /usr/bin/perl 
  Control Group: /openqa.slice/openqa-worker.slice/openqa-worker@4.service 
           Unit: openqa-worker@4.service 
          Slice: openqa-worker.slice 
        Boot ID: 1234ca1e5b18422d89f258275208b14f 
     Machine ID: 625985c3f939414a1676d1d05a732110 
       Hostname: openqaworker13 
        Storage: /var/lib/systemd/coredump/core.\x2fusr\x2fbin\x2fisotov.483.1234ca1e5b18422d89f258275208b14f.18220.1562595019000000.lz4 
        Message: Process 18220 (/usr/bin/isotov) of user 483 dumped core. 

            PID: 14766 (/usr/bin/isotov) 
            UID: 483 (_openqa-worker) 
            GID: 65534 (nogroup) 
         Signal: 6 (ABRT) 
      Timestamp: Mon 2019-07-08 17:33:40 CEST (17h ago) 
   Command Line: /usr/bin/isotovideo: backen 
     Executable: /usr/bin/perl 
  Control Group: /openqa.slice/openqa-worker.slice/openqa-worker@9.service 
           Unit: openqa-worker@9.service 
          Slice: openqa-worker.slice 
        Boot ID: 1234ca1e5b18422d89f258275208b14f 
     Machine ID: 625985c3f939414a1676d1d05a732110 
       Hostname: openqaworker13 
        Storage: /var/lib/systemd/coredump/core.\x2fusr\x2fbin\x2fisotov.483.1234ca1e5b18422d89f258275208b14f.14766.1562600020000000.lz4 
        Message: Process 14766 (/usr/bin/isotov) of user 483 dumped core. 

            PID: 14989 (/usr/bin/isotov) 
            UID: 483 (_openqa-worker) 
            GID: 65534 (nogroup) 
         Signal: 6 (ABRT) 
      Timestamp: Mon 2019-07-08 17:34:05 CEST (17h ago) 
   Command Line: /usr/bin/isotovideo: backen 
     Executable: /usr/bin/perl 
  Control Group: /openqa.slice/openqa-worker.slice/openqa-worker@6.service 
           Unit: openqa-worker@6.service 
          Slice: openqa-worker.slice 
        Boot ID: 1234ca1e5b18422d89f258275208b14f 
     Machine ID: 625985c3f939414a1676d1d05a732110 
       Hostname: openqaworker13 
        Storage: /var/lib/systemd/coredump/core.\x2fusr\x2fbin\x2fisotov.483.1234ca1e5b18422d89f258275208b14f.14989.1562600045000000.lz4 
        Message: Process 14989 (/usr/bin/isotov) of user 483 dumped core. 
 ``` 

 Over all our workers we've 2070 coredumps of the same process so I fear we're missing an important bug here. 
 As you can see, the signals are mixed but most of them segfault (2077 segfaults, 14 aborts over all our workers). 
 You can find one of these coredumps here: http://files.glados.qa.suse.de/example_coredump.tar.xz (unfortunately it is too big to upload to progress directly) or get one for yourself with `coredumpctl dump $DESIRED_PID -o outfile.dump`. 


 ## Steps to reproduce 

 On OSD: `sudo salt -l error -C 'G@roles:worker' cmd.run 'coredumpctl list'` . The same also happens on o3. 

 To reproduce locally what seems to work: 

 * create a simple vars.json file, e.g. get from openqa.opensuse.org 
 * start test with `isotovideo -d` which at least goes as far as calling the first functions from testapi, e.g. "assert_screen" 
 * kill the "autotest" process, e.g. with `pkill -f autotest` 
 * check if there is a core dump recorded, e.g. `coredumpctl --since=today`

Back