action #96507
Job terminated prematurely during needle check auto_review:"isotovideo: backen.*: free.*: invalid pointer":retry
0%
Description
Two LTP jobs have failed recently in a similar way while waiting for needle match.
https://openqa.suse.de/tests/6638157
https://openqa.suse.de/tests/6625293
The first job has some interesting output in autoinst-log.txt:
[2021-08-03T07:24:58.610 CEST] [debug] no change: 0.5s [2021-08-03T07:25:03.632 CEST] [debug] WARNING: check_asserted_screen took 4.02 seconds for 35 candidate needles - make your needles more specific [2021-08-03T07:25:03.632 CEST] [debug] no match: -0.5s, best candidate: linux-login-20181005 (0.29) *** Error in `/usr/bin/isotovideo: backen': free(): invalid pointer: 0x00005560f6795e00 *** [2021-08-03T07:25:05.336 CEST] [debug] backend process exited: 0 [2021-08-03T07:25:05.336 CEST] [debug] stopping command server 30808 because test execution ended
The other job does not show any obvious error in the log:
[2021-07-31T07:21:58.852 CEST] [debug] no change: 1785.4s [2021-07-31T07:22:01.325 CEST] [debug] backend process exited: 0 [2021-07-31T07:22:01.326 CEST] [debug] stopping command server 21557 because test execution ended
Suggestion¶
- Reproduce locally
- Extend
t/01-test_needle.t
to reproduce this case - Have the test create and destroy multiple needs in a loop
- Try using an optimized build for tinycv (e.g. via cmake argument
-DCMAKE_BUILD_TYPE=RelWithDebInfo
, see https://cmake.org/cmake/help/latest/variable/CMAKE_BUILD_TYPE.html) if the problem is not reproducible - Provoke more threading in opencv
- Investigate memory-handling on the Perl side, passed to opencv, in baseclass (check os-autoinst readme)
History
#1
Updated by okurz 10 months ago
- Category set to Concrete Bugs
- Priority changed from Normal to High
- Target version set to Ready
as first step https://github.com/os-autoinst/scripts/blob/master/README.md#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger should be used to detect the issue and automatically retrigger
#2
Updated by mkittler 10 months ago
- File stacktrace stacktrace added
- Subject changed from Job terminated prematurely during needle check to Job terminated prematurely during needle check auto_review:"isotovideo: backen.*: free.*: invalid pointer":retry
I've been adding the regex but it doesn't look like we're running auto_review
currently/anymore.
Luckily the stacktrace is still available so I've attached it. That's the unmangled version of the relevant thread:
(gdb) bt #0 0x00007fa5d9349420 in raise () from /lib64/libc.so.6 #1 0x00007fa5d934aa01 in abort () from /lib64/libc.so.6 #2 0x00007fa5d938c877 in __libc_message () from /lib64/libc.so.6 #3 0x00007fa5d9393093 in malloc_printerr () from /lib64/libc.so.6 #4 0x00007fa5d9394999 in _int_free () from /lib64/libc.so.6 #5 0x00007fa5d1659589 in cv::MatAllocator::unmap(cv::UMatData*) const () from /usr/lib64/libopencv_core.so.3.3 #6 0x00007fa5d26968a1 in image_destroy(Image*) () from /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi/auto/tinycv/tinycv.so #7 0x00007fa5d269d32e in ?? () from /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi/auto/tinycv/tinycv.so #8 0x00005560ecf47ac1 in Perl_pp_entersub (my_perl=0x5560ee4cb260) at pp_hot.c:4231 #9 0x00005560eceb8ca8 in Perl_call_sv (my_perl=my_perl@entry=0x5560ee4cb260, sv=0x5560f319f950, flags=flags@entry=45) at perl.c:2848 #10 0x00005560ecf4ca4e in S_curse (my_perl=my_perl@entry=0x5560ee4cb260, sv=sv@entry=0x5560f3237ef0, check_refcnt=check_refcnt@entry=true) at sv.c:6972 #11 0x00005560ecf4d2f8 in Perl_sv_clear (my_perl=my_perl@entry=0x5560ee4cb260, orig_sv=orig_sv@entry=0x5560f31f6058) at sv.c:6576 #12 0x00005560ecf4d7b1 in Perl_sv_free2 (my_perl=0x5560ee4cb260, sv=0x5560f31f6058, rc=<optimized out>) at sv.c:7073 #13 0x00005560ecf7fa5c in Perl_leave_scope (my_perl=my_perl@entry=0x5560ee4cb260, base=<optimized out>) at scope.c:1178 #14 0x00005560ecf83a08 in Perl_dounwind (my_perl=my_perl@entry=0x5560ee4cb260, cxix=cxix@entry=20) at pp_ctl.c:1552 #15 0x00005560ecf8eef1 in Perl_dounwind (cxix=20, my_perl=0x5560ee4cb260) at pp_ctl.c:2471 #16 Perl_pp_return (my_perl=0x5560ee4cb260) at pp_ctl.c:2477 #17 0x00005560ecf3f8e6 in Perl_runops_standard (my_perl=0x5560ee4cb260) at run.c:41 #18 0x00005560ecec07d7 in S_run_body (oldscope=<optimized out>, my_perl=<optimized out>) at perl.c:2524 #19 perl_run (my_perl=0x5560ee4cb260) at perl.c:2447 #20 0x00005560ece97ef2 in main (argc=<optimized out>, argv=<optimized out>, env=<optimized out>) at perlmain.c:123
Looks like Perl passed a bad pointer to the XS function for destroying the tinycv::Image
object so its destructor passes a bad pointer to the destructor of the cv::Mat
object.
#3
Updated by mkittler 10 months ago
- File stacktrace2 stacktrace2 added
- File stacktrace3 stacktrace3 added
- File stacktrace4 stacktrace4 added
There are 3 more stack traces, likely one of them corresponds to the 2nd job. So there are basically two crashes, one involving image_destroy
and one involving image_search
.
By the way, I haven't seen such stack traces before (e.g. when looking at backend stack traces for different crashes).
#5
Updated by cdywan 10 months ago
- Subject changed from Job terminated prematurely during needle check auto_review:"isotovideo: backen.*: free.*: invalid pointer":retry to Job terminated prematurely during needle check auto_review:"isotovideo: backen.*: free.*: invalid pointer":retry size:M
- Description updated (diff)
- Status changed from New to Workable
#7
Updated by mkittler 10 months ago
This test should run the relevant functions many times over and over again: https://github.com/Martchus/os-autoinst/pull/new/debug-segfault-2
However, so far I failed to reproduce the problem.
Since compiler warnings are often useful to find bugs I've been creating https://github.com/os-autoinst/os-autoinst/pull/1740. I couldn't find any clear problems, though.
#8
Updated by MDoucha 10 months ago
Here's another similar job failure with a slightly different error in the log:
https://openqa.suse.de/tests/6795622
[2021-08-10T20:26:11.866 CEST] [debug] led state 0 1 1 -261 [2021-08-10T20:26:12.353 CEST] [debug] no change: 1785.1s *** Error in `/usr/bin/isotovideo: backen': corrupted size vs. prev_size: 0x000055acfb6e0670 *** Unexpected end of data 0 [2021-08-10T20:26:13.726 CEST] [debug] backend process exited: 0 [2021-08-10T20:26:13.727 CEST] [debug] stopping command server 30489 because test execution ended
#9
Updated by mkittler 9 months ago
Strangely there's no coredump from the day this job ran on openqaworker10 where the job ran according to the database. However, there are several other coredumps which I've attached. Many have actually no tinycv.so
-frames within the stack trace; not sure whether they might even be bugs in Perl. One backtrace contains an image_scale
-frame which is new (so far I've only seen image_search
and image_destroy
). The TBB threads all look the same and unproblematic (so the issue that signal handlers are invoked in these threads doesn't seem to be re-occurring).
#10
Updated by mkittler 9 months ago
I extended my test which runs image functionality by a few other functions but still cannot reproduce the issue locally. I've also tested against OpenCV 3. I also still couldn't find anything wrong with our code. It could be improved here and there according to certain guidelines and to avoid certain warnings but there's nothing which should cause UB or memory corruption.
I'm currently running out of ideas.
#11
Updated by mkittler 9 months ago
I've been running https://github.com/Martchus/os-autoinst/pull/new/debug-segfault-2 in a Leap 15.2 container but still couldn't reproduce the issue.
#13
Updated by okurz 9 months ago
- Subject changed from Job terminated prematurely during needle check auto_review:"isotovideo: backen.*: free.*: invalid pointer":retry size:M to Job terminated prematurely during needle check auto_review:"isotovideo: backen.*: free.*: invalid pointer":retry
- Status changed from Workable to New
#15
Updated by okurz 9 months ago
- Priority changed from High to Low
- Target version changed from Ready to future
During the daily 2021-08-31 we decided that we don't have a good idea how to fix the actual problem. I see the following alternatives:
- With the above information provided and assuming that "auto_review" handles automatic retriggering at least on o3+osd move the ticket out of the backlog and wait for further information. Please anyone who sees the issue again or has any additional information comment on the ticket
- Ensure that openQA has a proper reason in an incomplete job so that openQA can automatically retrigger, see https://github.com/os-autoinst/openQA/blob/master/etc/openqa/openqa.ini#L76
- Ask perl experts if they can read out something better from the stack trace
#16
Updated by MDoucha 9 months ago
We have one more failed job due to this bug:
https://openqa.suse.de/tests/7095789
[2021-09-14T03:04:46.243 CEST] [debug] no change: 1386.7s *** Error in `/usr/bin/isotovideo: backen': corrupted size vs. prev_size: 0x000055ddc3398e70 *** [2021-09-14T03:04:50.611 CEST] [debug] backend process exited: 0 [2021-09-14T03:04:50.612 CEST] [debug] stopping command server 22002 because test execution ended
#17
Updated by MDoucha 8 months ago
We have another one:
https://openqa.suse.de/tests/7345106
[2021-10-08T10:15:12.457542+02:00] [debug] WARNING: check_asserted_screen took 4.09 seconds for 36 candidate needles - make your needles more specific [2021-10-08T10:15:12.457694+02:00] [debug] no match: -0.1s, best candidate: linux-login-20181005 (0.29) *** Error in `/usr/bin/isotovideo: backen': realloc(): invalid pointer: 0x000056449d7fdab0 *** [2021-10-08T10:15:14.621326+02:00] [debug] backend process exited: 0 [2021-10-08T10:15:14.622132+02:00] [debug] stopping command server 2326 because test execution ended
#18
Updated by MDoucha 8 months ago
And one more without any error message in the logs: https://openqa.suse.de/tests/7363250
#19
Updated by MDoucha 6 months ago
We have more examples. The first one is expecially interesting due to new error message:
https://openqa.suse.de/tests/7668281
[2021-11-13T09:42:19.094394+01:00] [debug] led state 0 1 1 -261 not enough bytes for zrle [2021-11-13T09:42:21.140436+01:00] [debug] backend process exited: 0 [2021-11-13T09:42:21.141354+01:00] [debug] stopping command server 3912 because test execution ended
Looks like OpenCV crashed while trying to load a PNG file generated by isotovideo
https://openqa.suse.de/tests/7668232
*** Error in `/usr/bin/isotovideo: backen': realloc(): invalid pointer: 0x0000561b1e635070 ***
https://openqa.suse.de/tests/7668284
(no error message)
#20
Updated by MDoucha 3 months ago
We have another example with a slightly different error message:
https://openqa.suse.de/tests/8190266
[2022-02-18T13:55:15.963440+01:00] [debug] no match: -1.8s, best candidate: linux-login-20181005 (0.29) realloc(): invalid pointer [2022-02-18T13:55:21.860752+01:00] [debug] backend process exited: 0 [2022-02-18T13:55:21.861512+01:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 1908 and exit status: 0 [2022-02-18T13:55:21.862025+01:00] [debug] stopping command server 523 because test execution ended
#21
Updated by MDoucha 3 months ago
One more example, this time with a completely different error message indicating a heap smash somewhere in needle processing:
https://openqa.suse.de/tests/8235310
[2022-02-26T03:35:28.028820+01:00] [debug] no match: 1781.5s, best candidate: linux-login-20181005 (0.29) malloc(): largebin double linked list corrupted (bk) [2022-02-26T03:35:33.426188+01:00] [debug] backend process exited: 0 [2022-02-26T03:35:33.427064+01:00] [debug] stopping command server 3839 because test execution ended