action #33037

dead children uncaught (again)

Added by coolo almost 2 years ago. Updated almost 2 years ago.

Status:ResolvedStart date:10/03/2018
Priority:UrgentDue date:
Assignee:EDiGiacinto% Done:

50%

Category:Concrete Bugs
Target version:Done
Difficulty:
Duration:

Description

https://openqa.suse.de/tests/1531020 left qemu behind and it wasn't caught. Too much refactoring :(


Related issues

Related to openQA Project - action #32293: [tools] Implement sub-reaper support in isotovideo New 26/02/2018
Related to openQA Project - action #39893: warning message about "Failed creating CGroup subtree 'mk... Resolved 17/08/2018

History

#1 Updated by EDiGiacinto almost 2 years ago

Don't think it's due to refactor.

[2018-03-09T20:36:03.0579 CET] [debug] QEMU: (process:27655): GLib-CRITICAL **: Source ID 62 was not found when attempting to remove it

Suggests qemu process stalled, so they should be killed - i guess it's not a zombie process (without further logs i can just assume) but it's still doing something.

#2 Updated by EDiGiacinto almost 2 years ago

  • Related to action #32293: [tools] Implement sub-reaper support in isotovideo added

#3 Updated by coolo almost 2 years ago

Yeah. And so the worker is supposed to kill all remains of isotovideo after the job is done. And this is what David implemented using the /proc parsing - and then it was refactored

#4 Updated by EDiGiacinto almost 2 years ago

Assuming also isotovideo should do the right thing, as it forks and spawns processes.

#5 Updated by coolo almost 2 years ago

No, don't assume anything - just clean up.

#6 Updated by coolo almost 2 years ago

for completness:

[2018-03-09T20:36:04.0860 CET] [debug] backend 27650 died
can_read received kill signal at /usr/lib/os-autoinst/myjsonrpc.pm line 89.
[2018-03-09T20:36:04.0860 CET] [debug] awaiting death of commands process
[2018-03-09T20:36:04.0861 CET] [info] Collected unknown process with pid 27651 and exit status: 1
[2018-03-09T20:36:04.0863 CET] [debug] commands process exited: 27645
[2018-03-09T20:36:04.0863 CET] [debug] awaiting death of testpid 27647
[2018-03-09T20:36:04.0866 CET] [debug] test process exited: 27647

#7 Updated by szarate almost 2 years ago

There are two coredumps about that time:

Fri 2018-03-09 20:34:55 CET 20917 489 65534 11 * /usr/bin/perl
Fri 2018-03-09 20:37:15 CET 27650 489 65534 11 * /usr/bin/perl

I think we're missing something somewhere :/

#8 Updated by EDiGiacinto almost 2 years ago

coolo wrote:

No, don't assume anything - just clean up.

Not sure how to parse this, if ACs for this tickets are cleanup reading /proc unsafely and depend on systemd(again) - just don't look at me, they both sounds like bad architecture design

#9 Updated by coolo almost 2 years ago

This is the reason isotovideo is crashing:

Thread 9 (Thread 0x7f731ab7a700 (LWP 20934)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f7320d4e74d in cv::ForThread::thread_body (this=0x84cc590) at /usr/src/debug/opencv-3.1.0/modules/core/src/parallel_pthreads.cpp:331
#2  cv::ForThread::thread_loop_wrapper (thread_object=0x84cc590) at /usr/src/debug/opencv-3.1.0/modules/core/src/parallel_pthreads.cpp:298
#3  0x00007f7326cfc724 in start_thread (arg=0x7f731ab7a700) at pthread_create.c:457
#4  0x00007f7326a3cc1d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 8 (Thread 0x7f731bb7c700 (LWP 20931)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f7320d4e74d in cv::ForThread::thread_body (this=0x84cc410) at /usr/src/debug/opencv-3.1.0/modules/core/src/parallel_pthreads.cpp:331
#2  cv::ForThread::thread_loop_wrapper (thread_object=0x84cc410) at /usr/src/debug/opencv-3.1.0/modules/core/src/parallel_pthreads.cpp:298
#3  0x00007f7326cfc724 in start_thread (arg=0x7f731bb7c700) at pthread_create.c:457
#4  0x00007f7326a3cc1d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 7 (Thread 0x7f7327855700 (LWP 20917)):
#0  0x0000000000442a26 in Perl_gv_fetchmeth_pvn (my_perl=my_perl@entry=0x1b22010, stash=stash@entry=0x5e99b78, name=name@entry=0x6f47a30 "_kill_children_processes", len=len@entry=24, level=level@entry=0, flags=flags@entry=768)
    at gv.c:766
#1  0x000000000044034f in Perl_gv_fetchmethod_pvn_flags (my_perl=my_perl@entry=0x1b22010, stash=stash@entry=0x5e99b78, name=0x6f47a30 "_kill_children_processes", len=<optimized out>, flags=768) at gv.c:1002
#2  0x00000000004407b5 in Perl_gv_fetchmethod_sv_flags (my_perl=my_perl@entry=0x1b22010, stash=0x5e99b78, namesv=namesv@entry=0x7b2cbd0, flags=<optimized out>, flags@entry=768) at gv.c:935
#3  0x00000000004a405e in S_method_common (my_perl=my_perl@entry=0x1b22010, meth=meth@entry=0x7b2cbd0, hashp=hashp@entry=0x7ffce349fd5c) at pp_hot.c:3124
#4  0x00000000004ac7e7 in Perl_pp_method_named (my_perl=0x1b22010) at pp_hot.c:3016
#5  0x00000000004a3cf6 in Perl_runops_standard (my_perl=0x1b22010) at run.c:42
#6  0x0000000000437068 in Perl_call_sv (my_perl=my_perl@entry=0x1b22010, sv=sv@entry=0x63d6cc8, flags=<optimized out>, flags@entry=12) at perl.c:2766
#7  0x00000000004937b4 in Perl_sighandler (sig=15, sip=0x0, uap=0x0) at mg.c:3143
#8  0x000000000048fb59 in Perl_despatch_signals (my_perl=0x1b22010) at mg.c:1422
#9  0x00000000004a466e in Perl_pp_and (my_perl=0x1b22010) at pp_hot.c:106
#10 0x00000000004a3cf6 in Perl_runops_standard (my_perl=0x1b22010) at run.c:42
#11 0x000000000043e112 in S_run_body (oldscope=1, my_perl=0x1b22010) at perl.c:2467
#12 perl_run (my_perl=0x1b22010) at perl.c:2383
#13 0x000000000041e963 in main (argc=3, argv=0x7ffce34a0408, env=0x7ffce34a0428) at perlmain.c:114

Thread 6 (Thread 0x7f7319377700 (LWP 20937)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f7320d4e74d in cv::ForThread::thread_body (this=0x84cc710) at /usr/src/debug/opencv-3.1.0/modules/core/src/parallel_pthreads.cpp:331
#2  cv::ForThread::thread_loop_wrapper (thread_object=0x84cc710) at /usr/src/debug/opencv-3.1.0/modules/core/src/parallel_pthreads.cpp:298
#3  0x00007f7326cfc724 in start_thread (arg=0x7f7319377700) at pthread_create.c:457
#4  0x00007f7326a3cc1d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 5 (Thread 0x7f7312b7a700 (LWP 20933)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f7320d4e74d in cv::ForThread::thread_body (this=0x84cc510) at /usr/src/debug/opencv-3.1.0/modules/core/src/parallel_pthreads.cpp:331
#2  cv::ForThread::thread_loop_wrapper (thread_object=0x84cc510) at /usr/src/debug/opencv-3.1.0/modules/core/src/parallel_pthreads.cpp:298
#3  0x00007f7326cfc724 in start_thread (arg=0x7f7312b7a700) at pthread_create.c:457
#4  0x00007f7326a3cc1d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 4 (Thread 0x7f7319b78700 (LWP 20936)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f7320d4e74d in cv::ForThread::thread_body (this=0x84cc690) at /usr/src/debug/opencv-3.1.0/modules/core/src/parallel_pthreads.cpp:331
#2  cv::ForThread::thread_loop_wrapper (thread_object=0x84cc690) at /usr/src/debug/opencv-3.1.0/modules/core/src/parallel_pthreads.cpp:298
#3  0x00007f7326cfc724 in start_thread (arg=0x7f7319b78700) at pthread_create.c:457
#4  0x00007f7326a3cc1d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 3 (Thread 0x7f731b37b700 (LWP 20932)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
---Type <return> to continue, or q <return> to quit---
#1  0x00007f7320d4e74d in cv::ForThread::thread_body (this=0x84cc490) at /usr/src/debug/opencv-3.1.0/modules/core/src/parallel_pthreads.cpp:331
#2  cv::ForThread::thread_loop_wrapper (thread_object=0x84cc490) at /usr/src/debug/opencv-3.1.0/modules/core/src/parallel_pthreads.cpp:298
#3  0x00007f7326cfc724 in start_thread (arg=0x7f731b37b700) at pthread_create.c:457
#4  0x00007f7326a3cc1d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 2 (Thread 0x7f731a379700 (LWP 20935)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f7320d4e74d in cv::ForThread::thread_body (this=0x84cc610) at /usr/src/debug/opencv-3.1.0/modules/core/src/parallel_pthreads.cpp:331
#2  cv::ForThread::thread_loop_wrapper (thread_object=0x84cc610) at /usr/src/debug/opencv-3.1.0/modules/core/src/parallel_pthreads.cpp:298
#3  0x00007f7326cfc724 in start_thread (arg=0x7f731a379700) at pthread_create.c:457
#4  0x00007f7326a3cc1d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 1 (Thread 0x7f731c37d700 (LWP 20930)):
#0  Perl_csighandler (sig=15, sip=<optimized out>, uap=<optimized out>) at mg.c:1334
#1  <signal handler called>
#2  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#3  0x00007f7320d4e74d in cv::ForThread::thread_body (this=0x84cc390) at /usr/src/debug/opencv-3.1.0/modules/core/src/parallel_pthreads.cpp:331
#4  cv::ForThread::thread_loop_wrapper (thread_object=0x84cc390) at /usr/src/debug/opencv-3.1.0/modules/core/src/parallel_pthreads.cpp:298
#5  0x00007f7326cfc724 in start_thread (arg=0x7f731c37d700) at pthread_create.c:457
#6  0x00007f7326a3cc1d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

#10 Updated by coolo almost 2 years ago

you don't need to rely on systemd - if the worker has a different mean to account for all children processes. And IMO cgroups are the only valid option.

#11 Updated by EDiGiacinto almost 2 years ago

coolo wrote:

you don't need to rely on systemd - if the worker has a different mean to account for all children processes. And IMO cgroups are the only valid option.

Wasn't me pushing for systemd ;) cgroups sounds fair enough, but guess won't get rid of those stacktraces - maybe is worth also profiling tinycv/opencv and how we propagate signals there in case we kill backend processes.

#12 Updated by coolo almost 2 years ago

yeah, getting rid of cv threads is a subtask - for now this ticket is about catching crashes in general though :)

#13 Updated by EDiGiacinto almost 2 years ago

coolo wrote:

This is the reason isotovideo is crashing:


Thread 9 (Thread 0x7f731ab7a700 (LWP 20934)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007f7320d4e74d in cv::ForThread::thread_body (this=0x84cc590) at /usr/src/debug/opencv-3.1.0/modules/core/src/parallel_pthreads.cpp:331
#2 cv::ForThread::thread_loop_wrapper (thread_object=0x84cc590) at /usr/src/debug/opencv-3.1.0/modules/core/src/parallel_pthreads.cpp:298
#3 0x00007f7326cfc724 in start_thread (arg=0x7f731ab7a700) at pthread_create.c:457
#4 0x00007f7326a3cc1d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
....

Let me try to draw one hypothesis of a possible scenario from the stacktrace:

We try to kill isotovideo (presumibly job is completed) and we send a sigterm there - but meanwhile there was a tinycv operation in progress, the signal handler kicked in, suspending current execution, signaled other process to quit and then exited, leaving the tinycv pthreads in the cold (and the semaphore as well).

Then, QEMU process will be still running, since some of its resources can't be freed anymore, as we can see:

[2018-03-09T20:36:03.0579 CET] [debug] QEMU: (process:27655): GLib-CRITICAL **: Source ID 62 was not found when attempting to remove it

And i presume won't even fire a sig_chld, because execution will just keep going.

So we will see a qemu process that is still running, and its resources can't be freed cause of the sigsev.

#14 Updated by coolo almost 2 years ago

Well, what bothers me a little: we went from threads to processes in isotovideo to avoid this plaque. So isotovideo main process should hopefully not be affected by this SEGV and will get a sigchld that the backend died. And indeed it does: "backend 27650 died" - but the question is the same as with the cleanup in the worker: how to determine that there are processes left to kill? Process groups can only be used on one level (and I prefer the worker to be that level) - leaves cgroups (again).

#15 Updated by EDiGiacinto almost 2 years ago

coolo wrote:

Well, what bothers me a little: we went from threads to processes in isotovideo to avoid this plaque. So isotovideo main process should hopefully not be affected by this SEGV and will get a sigchld that the backend died. And indeed it does: "backend 27650 died" - but the question is the same as with the cleanup in the worker: how to determine that there are processes left to kill? Process groups can only be used on one level (and I prefer the worker to be that level) - leaves cgroups (again).

I think we should handle the termination gracefully on first place - even using internally other signals instead of sigterm. Because if we are performing low-level operation with pthreads inside tinycv ( as opencv is doing them ) and we have section of code that pre-empt this, we need to cover that case anyway, or we leave an inconsistent memory state (even i/o bounded).

#16 Updated by EDiGiacinto almost 2 years ago

EDiGiacinto wrote:

coolo wrote:

Well, what bothers me a little: we went from threads to processes in isotovideo to avoid this plaque. So isotovideo main process should hopefully not be affected by this SEGV and will get a sigchld that the backend died. And indeed it does: "backend 27650 died" - but the question is the same as with the cleanup in the worker: how to determine that there are processes left to kill? Process groups can only be used on one level (and I prefer the worker to be that level) - leaves cgroups (again).


I think we should handle the termination gracefully on first place - even using internally other signals instead of sigterm. Because if we are performing low-level operation with pthreads inside tinycv ( as opencv is doing them ) and we have section of code that pre-empt this, we need to cover that case anyway, or we leave an inconsistent memory state (even i/o bounded).

So if you are ok with that - i would go over implementing cgroups support and with it trying to trace this unclean shutdown.

#17 Updated by EDiGiacinto almost 2 years ago

  • Status changed from New to In Progress

Already working on cgroups https://github.com/mudler/Mojo-IOLoop-ReadWriteProcess/tree/cgroups_namespaces, PR for changes in openQA will follow soon

#18 Updated by mkittler almost 2 years ago

  • Assignee set to EDiGiacinto

So I assume you're working on this issue? If not, just unassign again.

#19 Updated by szarate almost 2 years ago

  • % Done changed from 0 to 50

#20 Updated by szarate almost 2 years ago

  • Status changed from In Progress to Resolved

This is merged already

#21 Updated by szarate almost 2 years ago

  • Target version changed from Current Sprint to Done

#22 Updated by EDiGiacinto over 1 year ago

  • Related to action #39893: warning message about "Failed creating CGroup subtree 'mkdir /sys/fs/cgroup//var: Read-only file system at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/File.pm line 72." added

Also available in: Atom PDF