Project

General

Profile

Actions

action #53999

open

openqa-worker (isotovideo) dumps core / segfaults quite often on several workers and distributions for cancelled jobs

Added by nicksinger over 5 years ago. Updated 2 months ago.

Status:
Workable
Priority:
Low
Assignee:
-
Category:
Regressions/Crashes
Target version:
Start date:
2019-07-09
Due date:
% Done:

0%

Estimated time:

Description

Observation

From openqaworker13 (running Leap 15.0):

           PID: 18220 (/usr/bin/isotov)
           UID: 483 (_openqa-worker)
           GID: 65534 (nogroup)
        Signal: 11 (SEGV)
     Timestamp: Mon 2019-07-08 16:10:19 CEST (18h ago)
  Command Line: /usr/bin/isotovideo: backen
    Executable: /usr/bin/perl
 Control Group: /openqa.slice/openqa-worker.slice/openqa-worker@4.service
          Unit: openqa-worker@4.service
         Slice: openqa-worker.slice
       Boot ID: 1234ca1e5b18422d89f258275208b14f
    Machine ID: 625985c3f939414a1676d1d05a732110
      Hostname: openqaworker13
       Storage: /var/lib/systemd/coredump/core.\x2fusr\x2fbin\x2fisotov.483.1234ca1e5b18422d89f258275208b14f.18220.1562595019000000.lz4
       Message: Process 18220 (/usr/bin/isotov) of user 483 dumped core.

           PID: 14766 (/usr/bin/isotov)
           UID: 483 (_openqa-worker)
           GID: 65534 (nogroup)
        Signal: 6 (ABRT)
     Timestamp: Mon 2019-07-08 17:33:40 CEST (17h ago)
  Command Line: /usr/bin/isotovideo: backen
    Executable: /usr/bin/perl
 Control Group: /openqa.slice/openqa-worker.slice/openqa-worker@9.service
          Unit: openqa-worker@9.service
         Slice: openqa-worker.slice
       Boot ID: 1234ca1e5b18422d89f258275208b14f
    Machine ID: 625985c3f939414a1676d1d05a732110
      Hostname: openqaworker13
       Storage: /var/lib/systemd/coredump/core.\x2fusr\x2fbin\x2fisotov.483.1234ca1e5b18422d89f258275208b14f.14766.1562600020000000.lz4
       Message: Process 14766 (/usr/bin/isotov) of user 483 dumped core.

           PID: 14989 (/usr/bin/isotov)
           UID: 483 (_openqa-worker)
           GID: 65534 (nogroup)
        Signal: 6 (ABRT)
     Timestamp: Mon 2019-07-08 17:34:05 CEST (17h ago)
  Command Line: /usr/bin/isotovideo: backen
    Executable: /usr/bin/perl
 Control Group: /openqa.slice/openqa-worker.slice/openqa-worker@6.service
          Unit: openqa-worker@6.service
         Slice: openqa-worker.slice
       Boot ID: 1234ca1e5b18422d89f258275208b14f
    Machine ID: 625985c3f939414a1676d1d05a732110
      Hostname: openqaworker13
       Storage: /var/lib/systemd/coredump/core.\x2fusr\x2fbin\x2fisotov.483.1234ca1e5b18422d89f258275208b14f.14989.1562600045000000.lz4
       Message: Process 14989 (/usr/bin/isotov) of user 483 dumped core.

Over all our workers we've 2070 coredumps of the same process so I fear we're missing an important bug here.
As you can see, the signals are mixed but most of them segfault (2077 segfaults, 14 aborts over all our workers).
You can find one of these coredumps here: http://files.glados.qa.suse.de/example_coredump.tar.xz (unfortunately it is too big to upload to progress directly) or get one for yourself with coredumpctl dump $DESIRED_PID -o outfile.dump.

Steps to reproduce

On OSD: sudo salt -l error -C 'G@roles:worker' cmd.run 'coredumpctl list' . The same also happens on o3.

To reproduce locally what seems to work:

  • create a simple vars.json file, e.g. get from openqa.opensuse.org
  • start test with isotovideo -d which at least goes as far as calling the first functions from testapi, e.g. "assert_screen"
  • kill the "autotest" process, e.g. with pkill -f autotest
  • check if there is a core dump recorded, e.g. coredumpctl --since=today

Related issues 3 (1 open2 closed)

Related to openQA Project (public) - action #58379: isotovideo is slow to shutdown / error messages on proper shutdownResolvedokurz2019-10-042020-04-14

Actions
Related to openQA Project (public) - action #59926: test incompletes in middle of execution with auto_review:"Unexpected end of data 0":retry, system journal shows 'kernel: traps: /usr/bin/isotov[2300] general protection ip:7fd5ef11771e sp:7ffe066f2200 error:0 in libc-2.26.so[7fd5ef094000+1b1000]'New2019-11-17

Actions
Related to openQA Project (public) - action #60443: job incomplete with "(?s)process exited: 0.*isotovideo failed.*EXIT 1":retry but no further details what is wrongResolvedokurz2019-11-29

Actions
Actions

Also available in: Atom PDF