action #53999

openqa-worker (isotovideo) dumps core / segfaults quite often on several workers and distributions

Added by nicksinger 8 months ago. Updated 2 months ago.

Status:WorkableStart date:09/07/2019
Priority:LowDue date:
Assignee:-% Done:

0%

Category:Concrete Bugs
Target version:Ready
Difficulty:
Duration:

Description

Observation

From openqaworker13 (running Leap 15.0):

           PID: 18220 (/usr/bin/isotov)
           UID: 483 (_openqa-worker)
           GID: 65534 (nogroup)
        Signal: 11 (SEGV)
     Timestamp: Mon 2019-07-08 16:10:19 CEST (18h ago)
  Command Line: /usr/bin/isotovideo: backen
    Executable: /usr/bin/perl
 Control Group: /openqa.slice/openqa-worker.slice/openqa-worker@4.service
          Unit: openqa-worker@4.service
         Slice: openqa-worker.slice
       Boot ID: 1234ca1e5b18422d89f258275208b14f
    Machine ID: 625985c3f939414a1676d1d05a732110
      Hostname: openqaworker13
       Storage: /var/lib/systemd/coredump/core.\x2fusr\x2fbin\x2fisotov.483.1234ca1e5b18422d89f258275208b14f.18220.1562595019000000.lz4
       Message: Process 18220 (/usr/bin/isotov) of user 483 dumped core.

           PID: 14766 (/usr/bin/isotov)
           UID: 483 (_openqa-worker)
           GID: 65534 (nogroup)
        Signal: 6 (ABRT)
     Timestamp: Mon 2019-07-08 17:33:40 CEST (17h ago)
  Command Line: /usr/bin/isotovideo: backen
    Executable: /usr/bin/perl
 Control Group: /openqa.slice/openqa-worker.slice/openqa-worker@9.service
          Unit: openqa-worker@9.service
         Slice: openqa-worker.slice
       Boot ID: 1234ca1e5b18422d89f258275208b14f
    Machine ID: 625985c3f939414a1676d1d05a732110
      Hostname: openqaworker13
       Storage: /var/lib/systemd/coredump/core.\x2fusr\x2fbin\x2fisotov.483.1234ca1e5b18422d89f258275208b14f.14766.1562600020000000.lz4
       Message: Process 14766 (/usr/bin/isotov) of user 483 dumped core.

           PID: 14989 (/usr/bin/isotov)
           UID: 483 (_openqa-worker)
           GID: 65534 (nogroup)
        Signal: 6 (ABRT)
     Timestamp: Mon 2019-07-08 17:34:05 CEST (17h ago)
  Command Line: /usr/bin/isotovideo: backen
    Executable: /usr/bin/perl
 Control Group: /openqa.slice/openqa-worker.slice/openqa-worker@6.service
          Unit: openqa-worker@6.service
         Slice: openqa-worker.slice
       Boot ID: 1234ca1e5b18422d89f258275208b14f
    Machine ID: 625985c3f939414a1676d1d05a732110
      Hostname: openqaworker13
       Storage: /var/lib/systemd/coredump/core.\x2fusr\x2fbin\x2fisotov.483.1234ca1e5b18422d89f258275208b14f.14989.1562600045000000.lz4
       Message: Process 14989 (/usr/bin/isotov) of user 483 dumped core.

Over all our workers we've 2070 coredumps of the same process so I fear we're missing an important bug here.
As you can see, the signals are mixed but most of them segfault (2077 segfaults, 14 aborts over all our workers).
You can find one of these coredumps here: http://files.glados.qa.suse.de/example_coredump.tar.xz (unfortunately it is too big to upload to progress directly) or get one for yourself with coredumpctl dump $DESIRED_PID -o outfile.dump.

Steps to reproduce

On OSD: sudo salt -l error -C 'G@roles:worker' cmd.run 'coredumpctl list' . The same also happens on o3.

To reproduce locally what seems to work:

  • create a simple vars.json file, e.g. get from openqa.opensuse.org
  • start test with isotovideo -d which at least goes as far as calling the first functions from testapi, e.g. "assert_screen"
  • kill the "autotest" process, e.g. with pkill -f autotest
  • check if there is a core dump recorded, e.g. coredumpctl --since=today

Related issues

Related to openQA Project - action #58379: isotovideo is slow to shutdown / error messages on proper... Workable 04/10/2019

History

#1 Updated by coolo 8 months ago

opencv starts threads for hardware detection (even if not used) and on exit these threads hit an unprepared perl interpreter that is long gone and so you get a SIGSEGV. You don't hear about this as isotovideo is long gone and wrote a successful test report. It's just the backend process, that is for good reason a seperate process, that is crashing.

#2 Updated by coolo 8 months ago

Without looking int it, my suspicion is this backtrace: https://github.com/os-autoinst/os-autoinst/pull/1032#discussion_r225661226

#3 Updated by okurz 8 months ago

  • Category set to Concrete Bugs

#4 Updated by okurz 8 months ago

I guess mkittler is yearning to dive into C/C++ so I would not immediately dismiss this ticket ;)

#5 Updated by mkittler 8 months ago

  • Category deleted (Concrete Bugs)

I wouldn't dismiss the ticket in general although the priority is likely quite low.

Not sure how to reproduce this. I assume it only happens when one of the OpenCV threads receives the TERM signal and not the Perl interpreter thread.

@kraih Maybe you know how this is usually handled? I can't imagine that we're the first who try to use a native library from Perl code which possibly spawns other threads.

#6 Updated by mkittler 8 months ago

  • Category set to Concrete Bugs

Sorry for deleting the category.

By the way, when I understand the issue and the documentation of sigprocmask correctly it is actually not a surprise why the current attempt to fix it does not work:

            use POSIX ':signal_h';
            my $sigset = POSIX::SigSet->new(SIGTERM);
            unless (defined sigprocmask(SIG_BLOCK, $sigset, undef)) {
                die "Could not block SIGTERM\n";
            }
            require tinycv;

            sigprocmask(SIG_UNBLOCK, $sigset, undef);

This would only block the current thread (which is the Perl thread) but not OpenCV's threads. And it also immediately unblocks again after the require. But we needed to block the OpenCV threads from receiving SIGTERM, right?

#7 Updated by coolo 6 months ago

  • Priority changed from Normal to Low
  • Target version set to Ready

While the core dumps are annoying in the log, it only affects shutdown in jobs that are cancelled. So no effect on our results - at most at our performance due to extra core files.

#8 Updated by okurz 3 months ago

  • Related to action #58379: isotovideo is slow to shutdown / error messages on proper shutdown added

#9 Updated by okurz 2 months ago

  • Description updated (diff)
  • Status changed from New to Workable

#10 Updated by okurz 2 months ago

  • Subject changed from openqa-worker (isotovideo) dumps core quite often on several workers and distributions to openqa-worker (isotovideo) dumps core / segfaults quite often on several workers and distributions

Also available in: Atom PDF