action #53999
openopenqa-worker (isotovideo) dumps core / segfaults quite often on several workers and distributions for cancelled jobs
0%
Description
Observation¶
From openqaworker13 (running Leap 15.0):
PID: 18220 (/usr/bin/isotov)
UID: 483 (_openqa-worker)
GID: 65534 (nogroup)
Signal: 11 (SEGV)
Timestamp: Mon 2019-07-08 16:10:19 CEST (18h ago)
Command Line: /usr/bin/isotovideo: backen
Executable: /usr/bin/perl
Control Group: /openqa.slice/openqa-worker.slice/openqa-worker@4.service
Unit: openqa-worker@4.service
Slice: openqa-worker.slice
Boot ID: 1234ca1e5b18422d89f258275208b14f
Machine ID: 625985c3f939414a1676d1d05a732110
Hostname: openqaworker13
Storage: /var/lib/systemd/coredump/core.\x2fusr\x2fbin\x2fisotov.483.1234ca1e5b18422d89f258275208b14f.18220.1562595019000000.lz4
Message: Process 18220 (/usr/bin/isotov) of user 483 dumped core.
PID: 14766 (/usr/bin/isotov)
UID: 483 (_openqa-worker)
GID: 65534 (nogroup)
Signal: 6 (ABRT)
Timestamp: Mon 2019-07-08 17:33:40 CEST (17h ago)
Command Line: /usr/bin/isotovideo: backen
Executable: /usr/bin/perl
Control Group: /openqa.slice/openqa-worker.slice/openqa-worker@9.service
Unit: openqa-worker@9.service
Slice: openqa-worker.slice
Boot ID: 1234ca1e5b18422d89f258275208b14f
Machine ID: 625985c3f939414a1676d1d05a732110
Hostname: openqaworker13
Storage: /var/lib/systemd/coredump/core.\x2fusr\x2fbin\x2fisotov.483.1234ca1e5b18422d89f258275208b14f.14766.1562600020000000.lz4
Message: Process 14766 (/usr/bin/isotov) of user 483 dumped core.
PID: 14989 (/usr/bin/isotov)
UID: 483 (_openqa-worker)
GID: 65534 (nogroup)
Signal: 6 (ABRT)
Timestamp: Mon 2019-07-08 17:34:05 CEST (17h ago)
Command Line: /usr/bin/isotovideo: backen
Executable: /usr/bin/perl
Control Group: /openqa.slice/openqa-worker.slice/openqa-worker@6.service
Unit: openqa-worker@6.service
Slice: openqa-worker.slice
Boot ID: 1234ca1e5b18422d89f258275208b14f
Machine ID: 625985c3f939414a1676d1d05a732110
Hostname: openqaworker13
Storage: /var/lib/systemd/coredump/core.\x2fusr\x2fbin\x2fisotov.483.1234ca1e5b18422d89f258275208b14f.14989.1562600045000000.lz4
Message: Process 14989 (/usr/bin/isotov) of user 483 dumped core.
Over all our workers we've 2070 coredumps of the same process so I fear we're missing an important bug here.
As you can see, the signals are mixed but most of them segfault (2077 segfaults, 14 aborts over all our workers).
You can find one of these coredumps here: http://files.glados.qa.suse.de/example_coredump.tar.xz (unfortunately it is too big to upload to progress directly) or get one for yourself with coredumpctl dump $DESIRED_PID -o outfile.dump
.
Steps to reproduce¶
On OSD: sudo salt -l error -C 'G@roles:worker' cmd.run 'coredumpctl list'
. The same also happens on o3.
To reproduce locally what seems to work:
- create a simple vars.json file, e.g. get from openqa.opensuse.org
- start test with
isotovideo -d
which at least goes as far as calling the first functions from testapi, e.g. "assert_screen" - kill the "autotest" process, e.g. with
pkill -f autotest
- check if there is a core dump recorded, e.g.
coredumpctl --since=today
Updated by coolo over 5 years ago
opencv starts threads for hardware detection (even if not used) and on exit these threads hit an unprepared perl interpreter that is long gone and so you get a SIGSEGV. You don't hear about this as isotovideo is long gone and wrote a successful test report. It's just the backend process, that is for good reason a seperate process, that is crashing.
Updated by coolo over 5 years ago
Without looking int it, my suspicion is this backtrace: https://github.com/os-autoinst/os-autoinst/pull/1032#discussion_r225661226
Updated by okurz over 5 years ago
I guess mkittler is yearning to dive into C/C++ so I would not immediately dismiss this ticket ;)
Updated by mkittler over 5 years ago
- Category deleted (
Regressions/Crashes)
I wouldn't dismiss the ticket in general although the priority is likely quite low.
Not sure how to reproduce this. I assume it only happens when one of the OpenCV threads receives the TERM signal and not the Perl interpreter thread.
@kraih Maybe you know how this is usually handled? I can't imagine that we're the first who try to use a native library from Perl code which possibly spawns other threads.
Updated by mkittler over 5 years ago
- Category set to Regressions/Crashes
Sorry for deleting the category.
By the way, when I understand the issue and the documentation of sigprocmask
correctly it is actually not a surprise why the current attempt to fix it does not work:
use POSIX ':signal_h';
my $sigset = POSIX::SigSet->new(SIGTERM);
unless (defined sigprocmask(SIG_BLOCK, $sigset, undef)) {
die "Could not block SIGTERM\n";
}
require tinycv;
sigprocmask(SIG_UNBLOCK, $sigset, undef);
This would only block the current thread (which is the Perl thread) but not OpenCV's threads. And it also immediately unblocks again after the require
. But we needed to block the OpenCV threads from receiving SIGTERM
, right?
Updated by coolo over 5 years ago
- Priority changed from Normal to Low
- Target version set to Ready
While the core dumps are annoying in the log, it only affects shutdown in jobs that are cancelled. So no effect on our results - at most at our performance due to extra core files.
Updated by okurz about 5 years ago
- Related to action #58379: isotovideo is slow to shutdown / error messages on proper shutdown added
Updated by okurz about 5 years ago
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz about 5 years ago
- Subject changed from openqa-worker (isotovideo) dumps core quite often on several workers and distributions to openqa-worker (isotovideo) dumps core / segfaults quite often on several workers and distributions
Updated by okurz almost 5 years ago
Current journal entry on openqaworker7.o.o:
Mar 01 03:03:04 openqaworker7 systemd[1]: Started Process Core Dump (PID 402/UID 0).
Mar 01 03:03:05 openqaworker7 kernel: perl[35388]: segfault at 514 ip 00005570617a66a0 sp 00007f1af218c830 error 4 in perl[5570616e6000+1f9000]
Mar 01 03:03:05 openqaworker7 kernel: perl[35389]: segfault at 514 ip 00005570617a66a0 sp 00007f1af198b830 error 4
Mar 01 03:03:05 openqaworker7 systemd[1]: Started Process Core Dump (PID 404/UID 0).
Mar 01 03:03:05 openqaworker7 systemd-coredump[405]: Failed to generate stack trace: (null)
Mar 01 03:03:05 openqaworker7 systemd-coredump[405]: Process 35387 (/usr/bin/isotov) of user 486 dumped core.
Mar 01 03:03:05 openqaworker7 worker[3986]: [info] Isotovideo exit status: 1
Mar 01 03:03:06 openqaworker7 systemd-coredump[403]: Process 35407 (/usr/bin/isotov) of user 486 dumped core.
Stack trace of thread 35407:
#0 0x00007f1afdaeb160 raise (libc.so.6)
#1 0x00007f1afdaec741 abort (libc.so.6)
#2 0x00007f1afdb2e4a7 __libc_message (libc.so.6)
#3 0x00007f1afdb34cc3 malloc_printerr (libc.so.6)
#4 0x00007f1afdb351a2 malloc_consolidate (libc.so.6)
#5 0x00007f1afdb380fd _int_malloc (libc.so.6)
#6 0x00007f1afdb38c6b _int_memalign (libc.so.6)
#7 0x00007f1afdb39daf _mid_memalign (libc.so.6)
#8 0x00007f1afdb3b29a posix_memalign (libc.so.6)
#9 0x00007f1af6e87bfe _ZN2cv10fastMallocEm (libopencv_core.so.3.3)
#10 0x00007f1af6f9a72d _ZN2cv3Mat6createEiPKii (libopencv_core.so.3.3)
#11 0x00007f1af6f77253 n/a (libopencv_core.so.3.3)
#12 0x00007f1af730dfdc _Z15search_TEMPLATEPK5ImageS1_lllllRd (tinycv.so)
#13 0x00007f1af730ec69 _Z12image_searchP5ImageS0_lllllRd (tinycv.so)
#14 0x00007f1af7303da0 n/a (tinycv.so)
#15 0x00005570617c5781 Perl_pp_entersub (perl)
#16 0x00005570617bd5a6 Perl_runops_standard (perl)
#17 0x000055706173e7d7 perl_run (perl)
#18 0x0000557061715ef2 main (perl)
#19 0x00007f1afdad5f8a __libc_start_main (libc.so.6)
#20 0x0000557061715f3a _start (perl)
So similar to what #53999#note-1 describes. Can we ask the opencv/tinycv threads to stop before we get rid of the rest of the stack?
Updated by okurz over 4 years ago
- Related to action #59926: test incompletes in middle of execution with auto_review:"Unexpected end of data 0":retry, system journal shows 'kernel: traps: /usr/bin/isotov[2300] general protection ip:7fd5ef11771e sp:7ffe066f2200 error:0 in libc-2.26.so[7fd5ef094000+1b1000]' added
Updated by okurz over 4 years ago
- Related to action #60443: job incomplete with "(?s)process exited: 0.*isotovideo failed.*EXIT 1":retry but no further details what is wrong added
Updated by okurz over 4 years ago
- Subject changed from openqa-worker (isotovideo) dumps core / segfaults quite often on several workers and distributions to openqa-worker (isotovideo) dumps core / segfaults quite often on several workers and distributions for cancelled jobs
Updated by okurz over 4 years ago
I just managed to reproduce a lot of core dumps by calling systemctl stop openqa-worker.target
on openqaworker3 while multiple openQA worker instances were executing openQA jobs. Good to know how to reproduce :)
Updated by rfan1 over 4 years ago
okurz wrote:
I just managed to reproduce a lot of core dumps by calling
systemctl stop openqa-worker.target
on openqaworker3 while multiple openQA worker instances were executing openQA jobs. Good to know how to reproduce :)
Glad to know.
One quick question here, Oliver, once leap15.2 is out, can we install openqa software upon it?
BR//Richard.
Updated by mkittler over 4 years ago
To follow up on my previous comment: The assumption of the current attempt to fix this issue (mentioned in my previous comment) is likely that all additional threads created by the OpenCV library are created when require tinycv;
is executed so all additional threads would therefore inherit the mask. Considering that it does not work this assumption is likely not correct.
Can we ask the opencv/tinycv threads to stop before we get rid of the rest of the stack?
Maybe OpenCV provides a way to tear down itself gracefully. However, I'm wondering whether this would help fixing the problem with the signals which is outlined in a comment:
# opencv forks a lot of threads and the TERM signal we
# may get from the parent process would be delivered
# to an undefined thread. But as those threads do not
# have a perl interpreter, the perl signal handler
# (we set later) would crash. So we need to block
# the TERM signal in the forked processes before we
# set the signal handler of our choice
Also see my TL;DR conclusion.
I just managed to reproduce a lot of core dumps by calling systemctl stop openqa-worker.target on openqaworker3 while multiple openQA worker instances were executing openQA jobs.
The fact that we're currently hammering the process with SIGTERM signals via Mojo-IOLoop-ReadWriteProcess when stopping the job manually might be the reason why this way the chance to reproduce the issue is increased.
The archive with coredumps is not available anymore and the one mentioned by @okurz only shows the main thread.
So here a random example from openqaworker3:
PID: 27271 (/usr/bin/isotov)
UID: 489 (_openqa-worker)
GID: 65534 (nogroup)
Signal: 11 (SEGV)
Timestamp: Tue 2020-06-30 08:31:08 CEST (5h 43min ago)
Command Line: /usr/bin/isotovideo: autote
Executable: /usr/bin/perl
Control Group: /openqa.slice/openqa-worker.slice/openqa-worker@5.service
Unit: openqa-worker@5.service
Slice: openqa-worker.slice
Boot ID: 44c538291e894bca9c65ae533c5419e7
Machine ID: d77a423928f046d6b0396c85a3b83ffd
Hostname: openqaworker3
Storage: /var/lib/systemd/coredump/core.\x2fusr\x2fbin\x2fisotov.489.44c538291e894bca9c65ae533c5419e7.27271.1593498668000000.lz4
Message: Process 27271 (/usr/bin/isotov) of user 489 dumped core.
Stack trace of thread 27272:
#0 0x000056402a0399e0 Perl_csighandler (perl)
#1 0x00007f620dcd92d0 __restore_rt (libpthread.so.0)
#2 0x00007f620dcd487d futex_wait_cancelable (libpthread.so.0)
#3 0x00007f620507620b n/a (libopenblas_pthreads.so.0)
#4 0x00007f620dcce4f9 start_thread (libpthread.so.0)
#5 0x00007f620da06f2f __clone (libc.so.6)
Stack trace of thread 27271:
#0 0x000056402a047355 S_hv_auxinit (perl)
#1 0x000056402a048475 Perl_hv_iterinit (perl)
#2 0x00007f620ab17d25 encode_hv (XS.so)
#3 0x00007f620ab1a1fe encode_sv (XS.so)
#4 0x00007f620ab1ba7a encode_av (XS.so)
#5 0x00007f620ab1a006 encode_sv (XS.so)
#6 0x00007f620ab182c8 encode_hv (XS.so)
#7 0x00007f620ab1a1fe encode_sv (XS.so)
#8 0x00007f620ab1ba7a encode_av (XS.so)
#9 0x00007f620ab1a006 encode_sv (XS.so)
#10 0x00007f620ab182c8 encode_hv (XS.so)
#11 0x00007f620ab1a1fe encode_sv (XS.so)
#12 0x00007f620ab1c725 encode_json (XS.so)
#13 0x00007f620ab1cac4 XS_Cpanel__JSON__XS_encode (XS.so)
#14 0x000056402a058ac1 Perl_pp_entersub (perl)
#15 0x000056402a0508e6 Perl_runops_standard (perl)
#16 0x0000564029fc9cbf Perl_call_sv (perl)
#17 0x000056402a03e195 Perl_sighandler (perl)
#18 0x000056402a039ba7 Perl_despatch_signals (perl)
#19 0x000056402a0513ae Perl_pp_cond_expr (perl)
#20 0x000056402a0508e6 Perl_runops_standard (perl)
#21 0x0000564029fd17d7 S_run_body (perl)
#22 0x0000564029fa8ef2 main (perl)
#23 0x00007f620d92f34a __libc_start_main (libc.so.6)
#24 0x0000564029fa8f3a _start (perl)
Stack trace of thread 27274:
#0 0x00007f620dcd487d futex_wait_cancelable (libpthread.so.0)
#1 0x00007f620507620b n/a (libopenblas_pthreads.so.0)
#2 0x00007f620dcce4f9 start_thread (libpthread.so.0)
#3 0x00007f620da06f2f __clone (libc.so.6)
There are 13 more threads with the same stack trace as the last one. In this stack trace a thread from the internally used openblas library received a signal (likely SIGTERM) and therefore the signal handler registered by the Perl runtime is called within this thread which apparently caused the crash/SEGV. Most notably: It is not the backend which is crashing here, according to the command line it is the autotest process: Command Line: /usr/bin/isotovideo: autote
So the are annoyingly different types of crashes.
So let's look at another random crash:
PID: 17741 (/usr/bin/isotov)
UID: 489 (_openqa-worker)
GID: 65534 (nogroup)
Signal: 11 (SEGV)
Timestamp: Thu 2020-06-11 16:30:05 CEST (2 weeks 4 days ago)
Command Line: /usr/bin/isotovideo: backen
Executable: /usr/bin/perl
Control Group: /openqa.slice/openqa-worker.slice/openqa-worker@15.service
Unit: openqa-worker@15.service
Slice: openqa-worker.slice
Boot ID: c4be57c72cd441e0b4f1ed20223781aa
Machine ID: d77a423928f046d6b0396c85a3b83ffd
Hostname: openqaworker3
Storage: /var/lib/systemd/coredump/core.\x2fusr\x2fbin\x2fisotov.489.c4be57c72cd441e0b4f1ed20223781aa.17741.1591885805000000.lz4 (inaccessible)
Message: Process 17741 (/usr/bin/isotov) of user 489 dumped core.
Stack trace of thread 17838:
#0 0x0000559aa99a4680 Perl_csighandler (perl)
#1 0x00007f8a0b67c2d0 __restore_rt (libpthread.so.0)
#2 0x00007f8a0b3a47a9 syscall (libc.so.6)
#3 0x00007f8a02eda702 n/a (libtbb.so.2)
#4 0x00007f8a02eda729 n/a (libtbb.so.2)
#5 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#6 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17840:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17844:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17841:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17850:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17847:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17741:
#0 0x00007f8a0b3775a0 __GI___nanosleep (libc.so.6)
#1 0x00007f8a0b3774aa __sleep (libc.so.6)
#2 0x0000559aa9a18c08 Perl_pp_sleep (perl)
#3 0x0000559aa99bb586 Perl_runops_standard (perl)
#4 0x0000559aa9934cbf Perl_call_sv (perl)
#5 0x0000559aa99a8e35 Perl_sighandler (perl)
#6 0x0000559aa99a4847 Perl_despatch_signals (perl)
#7 0x0000559aa99bbb96 Perl_pp_and (perl)
#8 0x0000559aa99bb586 Perl_runops_standard (perl)
#9 0x0000559aa993c7d7 S_run_body (perl)
#10 0x0000559aa9913ef2 main (perl)
#11 0x00007f8a0b2d234a __libc_start_main (libc.so.6)
#12 0x0000559aa9913f3a _start (perl)
Stack trace of thread 17853:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17846:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17857:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17858:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17842:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17861:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17849:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17854:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17839:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17856:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17843:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17859:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17848:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17845:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17864:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17865:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17866:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17851:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17852:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17855:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17860:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17862:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17863:
#0 0x00007f8a0b3a47a9 syscall (libc.so.6)
#1 0x00007f8a02eda702 n/a (libtbb.so.2)
#2 0x00007f8a02eda729 n/a (libtbb.so.2)
#3 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#4 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17836:
#0 0x0000559aa99a4680 Perl_csighandler (perl)
#1 0x00007f8a0b67c2d0 __restore_rt (libpthread.so.0)
#2 0x00007f8a0b3a47a9 syscall (libc.so.6)
#3 0x00007f8a02eda702 n/a (libtbb.so.2)
#4 0x00007f8a02eda729 n/a (libtbb.so.2)
#5 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#6 0x00007f8a0b3a9f2f __clone (libc.so.6)
Stack trace of thread 17837:
#0 0x0000559aa99a4680 Perl_csighandler (perl)
#1 0x00007f8a0b67c2d0 __restore_rt (libpthread.so.0)
#2 0x00007f8a0b3a47a9 syscall (libc.so.6)
#3 0x00007f8a02eda702 n/a (libtbb.so.2)
#4 0x00007f8a02eda729 n/a (libtbb.so.2)
#5 0x00007f8a0b6714f9 start_thread (libpthread.so.0)
#6 0x00007f8a0b3a9f2f __clone (libc.so.6)
This time it is really the backend process. Apparently multiple threads are running Perl's signal handler but ironically not Perl's own thread. As previously mentioned we're sending SIGTERM possibly multiple times when stopping a job manually. That would explain why multiple threads are running the signal handler at the same time. By the way, I assume these threads are created by the libtbb2 library. This likely triggered somewhere within OpenCV.
I've checked a few further stack traces on openqaworker3 and they all look like one of the previous traces. In total there are 6 autotest crashes and 88 backend crashes on openqaworker3.
TL;DR:
- The problem is not strictly OpenCV related as it turns out that autotest has the same problem. This also seems like a problem which could possibly happen with any native library which spawns their own threads. I'm wondering whether we miss a generic solution here as we're likely not the only ones who are using Perl XS with non-trivial libraries.
- It really comes down to signals being received by the wrong thread. So I don't think we "can ask the opencv/tinycv threads to stop before we get rid of the rest of the stack" as @okurz suggested. When would we ask them?
Updated by okurz over 4 years ago
rfan1 wrote:
One quick question here, Oliver, once leap15.2 is out, can we install openqa software upon it?
Yes. Packages exist and are expected to work with no significant difference.
Updated by mkittler over 4 years ago
- Assignee set to mkittler
I've asked people on the #perl IRC channel. It seems there's nothing better than our current approach of trying to block the signals. That means one had to identify all places where possibly new threads are created and wrap them within block and unblock calls.
They also pointed out that threads.xs
does the same:
[…]
#ifdef THREAD_SIGNAL_BLOCKING
/* perl_clone() will leave us the new interpreter's context. This poses
* two problems for our signal handler. First, it sets the new context
* before the new interpreter struct is fully initialized, so our signal
* handler might find bogus data in the interpreter struct it gets.
* Second, even if the interpreter is initialized before a signal comes in,
* we would like to avoid that interpreter receiving notifications for
* signals (especially when they ought to be for the one running in this
* thread), until it is running in its own thread. Another problem is that
* the new thread will not have set the context until some time after it
* has started, so it won't be safe for our signal handler to run until
* that time.
*
* So we block most signals here, so the new thread will inherit the signal
* mask, and unblock them right after the thread creation. The original
* mask is saved in the thread struct so that the new thread can restore
* the original mask.
*/
S_block_most_signals(&thread->initial_sigmask);
#endif
[…]
/* Create the thread */
[…]
# if defined(HAS_PTHREAD_ATTR_SETSCOPE) && defined(PTHREAD_SCOPE_SYSTEM)
pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM);
# endif
rc_thread_create = pthread_create(&thread->thr,
&attr,
S_ithread_run,
(void *)thread);
[…]
#ifdef THREAD_SIGNAL_BLOCKING
/* Now it's safe to accept signals, since we're in our own interpreter's
* context and we have created the thread.
*/
S_set_sigmask(&thread->initial_sigmask);
#endif
(from https://metacpan.org/source/JDHEDDEN/threads-2.21/threads.xs#L834)
It was also mentioned that we should also better use pthread_sigmask
instead of sigprocmask
.
I'm assigning myself to do this as a side-ticket (priority is low). Let's see how far I'll come and whether it is worth the effort. I'm also curious where the OpenBLAS threads in autotest come from.
Updated by mkittler over 4 years ago
require tinycv
spawns 7 threads on my system (8 cores) and 15 threads onopenqaworker3
(32 cores); it should be already covered for the backend (see previous comments)$img = tinycv::new()
spawns no further threads$img = tinycv::read(…)
spawns no further threads$img->similarity(…)
spawns no further threads$img->search_needle($img, 0, 0, 1, 1, 0)
spawns 7 more threads on my system and 31 threads onopenqaworker3
when called the first time on an image loaded viatinycv::read
- It is strange that the coredumps from
openqaworker3
only show 32 threads in total while I get 47 in total (by simply runninguse tinycv; $img = tinycv::read('/var/lib/openqa/tests/opensuse/products/opensuse/needles/evolution_about-SSL-TW-20200429.png'); $img->search_needle($img, 0, 0, 1, 1, 0);
).
- It is strange that the coredumps from
I've tested a few other functions and they don't spawn further threads. So if we're lucky there are only 2 different thread pools.
A simple/stupid workaround for the backend crashes would be simply running the following code immediately after require tinycv;
to create the threads upfront (within the sigprocmask-wrapping):
my $dummy = tinycv::read('dummy.png');
$dummy->search_needle($dummy, 0, 0, 1, 1, 0);
But maybe it makes sense to create a dedicated function to create all threads upfront.
And I'm still wondering where the OpenBLAS threads in autotest come from.
Updated by mkittler over 4 years ago
- Status changed from Workable to In Progress
- Target version changed from Ready to Current Sprint
Draft PR which might help at least with the OpenCV threads: https://github.com/os-autoinst/os-autoinst/pull/1456
Updated by mkittler over 4 years ago
The mentioned PR might really help with most of the crashes because I found a reliable way to create OpenCV's threads upfront. However, we can not merge it because the CI tests won't pass anymore. One unit test seems to rely on a certain timing or otherwise unspecified behavior. I currently don't know how to fix it.
Updated by mkittler over 4 years ago
- Status changed from In Progress to Feedback
The PR has been merged. Let's see whether it works in production when deployed on OSD workers. (Unfortunately systemd-coredump
is missing on most o3 workers.)
Updated by mkittler over 4 years ago
There are no further coredumps on the o3 worker imagetester. (Other o3 workers don't have systemd-coredump
installed.)
I've checked out the OSD worker openqaworker-arm-1. On this machine we've seen coredumps on almost every day (excluding days when no jobs were executed). However, the last coredump is from the 14.07. Judging by the os-autoinst log of jobs executed on the worker the PR with my fix has been deployed on 15.07. Then I've checked on openqaworker5, openqaworker-arm-2 and openqaworker-arm-3 and unfortunately there are also coredumps after the 14.07. However, the number of coredumps is still quite low compared to before.
On openqaworker5 the coredumps I've seen look familiar (Perl_csighandler
called in TBB thread instead of Perl thread).
The coredumps on the ARM workers look like this:
PID: 75307 (/usr/bin/isotov)
UID: 482 (_openqa-worker)
GID: 65534 (nogroup)
Signal: 11 (SEGV)
Timestamp: Thu 2020-07-23 15:49:12 UTC (17h ago)
Command Line: /usr/bin/isotovideo: backen
Executable: /usr/bin/perl
Control Group: /openqa.slice/openqa-worker.slice/openqa-worker@20.service
Unit: openqa-worker@20.service
Slice: openqa-worker.slice
Boot ID: 08fdadd14b0a45bb8dbe352c22b8b106
Machine ID: 528ee47edee40256340231ab5a8f291c
Hostname: openqaworker-arm-2
Storage: /var/lib/systemd/coredump/core.\x2fusr\x2fbin\x2fisotov.482.08fdadd14b0a45bb8dbe352c22b8b106.75307.1595519352000000.lz4
Message: Process 75307 (/usr/bin/isotov) of user 482 dumped core.
Stack trace of thread 75321:
#0 0x0000aaaabeb10e68 Perl_csighandler (perl)
#1 0x0000ffff91757598 n/a (linux-vdso.so.1)
or even just:
Stack trace of thread 38593:
#0 0x0000aaaab75a2e68 Perl_csighandler (/usr/bin/perl)
So here the Perl signal handler is not invoked in a TBB thread and it crashes for a different reason.
I couldn't spot any autotest crashes anymore. (But they were not happening so often anyways.)
So it seems like the PR reduces the number of coredumps on all workers I've checked out so far but doesn't cover all cases yet. The coredumps we still see on ARM workers likely have a different cause.
Updated by okurz over 4 years ago
So it seems like the PR reduces the number of coredumps on all workers I've checked out so far but doesn't cover all cases yet. The coredumps we still see on ARM workers likely have a different cause.
That is great news and a very good evaluation by yourself. So, what do you plan as next steps? Still keep the ticket assigned to yourself? If not, could you give a suggestion what anyone else could look into as the next step?
Updated by mkittler over 4 years ago
There are still no coredumps on openqaworker-arm-1
since 14.07
. Maybe recording coredumps on this machine ceased to work? The other openqaworker-arm-*
hosts appear to be offline so I can not check.
On openqaworker5
there were only 3 coredumps recently so that's an improvement (still Perl_csighandler
called in TBB thread instead of Perl thread). The same counts for openqaworker3
, openqaworker8
, openqaworker9
and openqaworker13
. There are no coredumps on openqaworker2
and openqaworker13
.
So the conclusion from my last comment still holds.
So, what do you plan as next steps? If not, could you give a suggestion what anyone else could look into as the next step?
As stated, my PR seems to work but not 100 % of the time. So there might be a race condition within my code or it simply misses to cover certain cases where TBB threads are created. I'm currently out of ideas where the error might lie. If someone would review my code again that could uncover what it misses. If one comes to the conclusion that my code misses a case where TBB threads are created it would likely be a good approach to extend the test t/28-signalblocker.t
first by this case (and then try to fix it).
I'm keeping myself assigned because I might have an idea later. Nevertheless, anyone is welcome to take over.
Updated by okurz over 4 years ago
- Status changed from Feedback to Workable
- Assignee deleted (
mkittler) - Target version changed from Current Sprint to Ready
You actually made it workable for someone else nicely:
"If someone would review my code again that could uncover what it misses. If one comes to the conclusion that my code misses a case where TBB threads are created it would likely be a good approach to extend the test t/28-signalblocker.t first by this case (and then try to fix it)."
Updated by okurz about 4 years ago
- Target version changed from Ready to future
moving out of backlog due to limited team capacity.
Updated by AdamWill 2 months ago
we've come back to this at https://github.com/os-autoinst/os-autoinst/issues/2549 . From this ticket and the coredumpctl logs on Fedora it looks like this was never really entirely fixed; probably the "off by one thread" issue we're discussing there has been going on the entire time. It's gotten worse for me on Fedora lately, though, as it seems like this bug is now causing workers to actually get stuck in a "broken" state, blocked by the qemu process from the job that hit this bug. In the past, AFAICR, that wasn't happening - we had the crashes in coredumpctl history, and the failed jobs, but at least the workers weren't getting blocked. I'm not sure exactly what changed to make this have a bigger impact. It'd be interesting to know if it's happening on SUSE's workers also.