Project

General

Profile

action #96507

Job terminated prematurely during needle check auto_review:"isotovideo: backen.*: free.*: invalid pointer":retry

Added by MDoucha 10 months ago. Updated 3 months ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Concrete Bugs
Target version:
Start date:
2021-08-03
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Two LTP jobs have failed recently in a similar way while waiting for needle match.

https://openqa.suse.de/tests/6638157
https://openqa.suse.de/tests/6625293

The first job has some interesting output in autoinst-log.txt:

[2021-08-03T07:24:58.610 CEST] [debug] no change: 0.5s
[2021-08-03T07:25:03.632 CEST] [debug] WARNING: check_asserted_screen took 4.02 seconds for 35 candidate needles - make your needles more specific
[2021-08-03T07:25:03.632 CEST] [debug] no match: -0.5s, best candidate: linux-login-20181005 (0.29)
*** Error in `/usr/bin/isotovideo: backen': free(): invalid pointer: 0x00005560f6795e00 ***
[2021-08-03T07:25:05.336 CEST] [debug] backend process exited: 0
[2021-08-03T07:25:05.336 CEST] [debug] stopping command server 30808 because test execution ended

The other job does not show any obvious error in the log:

[2021-07-31T07:21:58.852 CEST] [debug] no change: 1785.4s
[2021-07-31T07:22:01.325 CEST] [debug] backend process exited: 0
[2021-07-31T07:22:01.326 CEST] [debug] stopping command server 21557 because test execution ended

Suggestion

  • Reproduce locally
  • Extend t/01-test_needle.t to reproduce this case
  • Have the test create and destroy multiple needs in a loop
  • Try using an optimized build for tinycv (e.g. via cmake argument -DCMAKE_BUILD_TYPE=RelWithDebInfo, see https://cmake.org/cmake/help/latest/variable/CMAKE_BUILD_TYPE.html) if the problem is not reproducible
  • Provoke more threading in opencv
  • Investigate memory-handling on the Perl side, passed to opencv, in baseclass (check os-autoinst readme)
stacktrace (13.1 KB) stacktrace mkittler, 2021-08-06 14:02
stacktrace2 (13 KB) stacktrace2 mkittler, 2021-08-06 14:21
stacktrace3 (13.2 KB) stacktrace3 mkittler, 2021-08-06 14:24
stacktrace4 (13.2 KB) stacktrace4 mkittler, 2021-08-06 14:26
backend-backtraces-openqaworker10.txt (18.1 KB) backend-backtraces-openqaworker10.txt mkittler, 2021-08-17 11:23

History

#1 Updated by okurz 10 months ago

  • Category set to Concrete Bugs
  • Priority changed from Normal to High
  • Target version set to Ready

#2 Updated by mkittler 10 months ago

  • File stacktrace stacktrace added
  • Subject changed from Job terminated prematurely during needle check to Job terminated prematurely during needle check auto_review:"isotovideo: backen.*: free.*: invalid pointer":retry

I've been adding the regex but it doesn't look like we're running auto_review currently/anymore.

Luckily the stacktrace is still available so I've attached it. That's the unmangled version of the relevant thread:

(gdb) bt
#0  0x00007fa5d9349420 in raise () from /lib64/libc.so.6
#1  0x00007fa5d934aa01 in abort () from /lib64/libc.so.6
#2  0x00007fa5d938c877 in __libc_message () from /lib64/libc.so.6
#3  0x00007fa5d9393093 in malloc_printerr () from /lib64/libc.so.6
#4  0x00007fa5d9394999 in _int_free () from /lib64/libc.so.6
#5  0x00007fa5d1659589 in cv::MatAllocator::unmap(cv::UMatData*) const () from /usr/lib64/libopencv_core.so.3.3
#6  0x00007fa5d26968a1 in image_destroy(Image*) () from /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi/auto/tinycv/tinycv.so
#7  0x00007fa5d269d32e in ?? () from /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi/auto/tinycv/tinycv.so
#8  0x00005560ecf47ac1 in Perl_pp_entersub (my_perl=0x5560ee4cb260) at pp_hot.c:4231
#9  0x00005560eceb8ca8 in Perl_call_sv (my_perl=my_perl@entry=0x5560ee4cb260, sv=0x5560f319f950, flags=flags@entry=45) at perl.c:2848
#10 0x00005560ecf4ca4e in S_curse (my_perl=my_perl@entry=0x5560ee4cb260, sv=sv@entry=0x5560f3237ef0, check_refcnt=check_refcnt@entry=true) at sv.c:6972
#11 0x00005560ecf4d2f8 in Perl_sv_clear (my_perl=my_perl@entry=0x5560ee4cb260, orig_sv=orig_sv@entry=0x5560f31f6058) at sv.c:6576
#12 0x00005560ecf4d7b1 in Perl_sv_free2 (my_perl=0x5560ee4cb260, sv=0x5560f31f6058, rc=<optimized out>) at sv.c:7073
#13 0x00005560ecf7fa5c in Perl_leave_scope (my_perl=my_perl@entry=0x5560ee4cb260, base=<optimized out>) at scope.c:1178
#14 0x00005560ecf83a08 in Perl_dounwind (my_perl=my_perl@entry=0x5560ee4cb260, cxix=cxix@entry=20) at pp_ctl.c:1552
#15 0x00005560ecf8eef1 in Perl_dounwind (cxix=20, my_perl=0x5560ee4cb260) at pp_ctl.c:2471
#16 Perl_pp_return (my_perl=0x5560ee4cb260) at pp_ctl.c:2477
#17 0x00005560ecf3f8e6 in Perl_runops_standard (my_perl=0x5560ee4cb260) at run.c:41
#18 0x00005560ecec07d7 in S_run_body (oldscope=<optimized out>, my_perl=<optimized out>) at perl.c:2524
#19 perl_run (my_perl=0x5560ee4cb260) at perl.c:2447
#20 0x00005560ece97ef2 in main (argc=<optimized out>, argv=<optimized out>, env=<optimized out>) at perlmain.c:123

Looks like Perl passed a bad pointer to the XS function for destroying the tinycv::Image object so its destructor passes a bad pointer to the destructor of the cv::Mat object.

#3 Updated by mkittler 10 months ago

There are 3 more stack traces, likely one of them corresponds to the 2nd job. So there are basically two crashes, one involving image_destroy and one involving image_search.

By the way, I haven't seen such stack traces before (e.g. when looking at backend stack traces for different crashes).

#4 Updated by mkittler 10 months ago

It also happened on openqaworker13 (backtrace with image_search) so it is not specific openqaworker10.

#5 Updated by cdywan 10 months ago

  • Subject changed from Job terminated prematurely during needle check auto_review:"isotovideo: backen.*: free.*: invalid pointer":retry to Job terminated prematurely during needle check auto_review:"isotovideo: backen.*: free.*: invalid pointer":retry size:M
  • Description updated (diff)
  • Status changed from New to Workable

#6 Updated by mkittler 10 months ago

  • Description updated (diff)
  • Assignee set to mkittler

#7 Updated by mkittler 10 months ago

This test should run the relevant functions many times over and over again: https://github.com/Martchus/os-autoinst/pull/new/debug-segfault-2

However, so far I failed to reproduce the problem.


Since compiler warnings are often useful to find bugs I've been creating https://github.com/os-autoinst/os-autoinst/pull/1740. I couldn't find any clear problems, though.

#8 Updated by MDoucha 10 months ago

Here's another similar job failure with a slightly different error in the log:
https://openqa.suse.de/tests/6795622

[2021-08-10T20:26:11.866 CEST] [debug] led state 0 1 1 -261
[2021-08-10T20:26:12.353 CEST] [debug] no change: 1785.1s
*** Error in `/usr/bin/isotovideo: backen': corrupted size vs. prev_size: 0x000055acfb6e0670 ***
Unexpected end of data 0
[2021-08-10T20:26:13.726 CEST] [debug] backend process exited: 0
[2021-08-10T20:26:13.727 CEST] [debug] stopping command server 30489 because test execution ended

#9 Updated by mkittler 9 months ago

Strangely there's no coredump from the day this job ran on openqaworker10 where the job ran according to the database. However, there are several other coredumps which I've attached. Many have actually no tinycv.so-frames within the stack trace; not sure whether they might even be bugs in Perl. One backtrace contains an image_scale-frame which is new (so far I've only seen image_search and image_destroy). The TBB threads all look the same and unproblematic (so the issue that signal handlers are invoked in these threads doesn't seem to be re-occurring).

#10 Updated by mkittler 9 months ago

I extended my test which runs image functionality by a few other functions but still cannot reproduce the issue locally. I've also tested against OpenCV 3. I also still couldn't find anything wrong with our code. It could be improved here and there according to certain guidelines and to avoid certain warnings but there's nothing which should cause UB or memory corruption.

I'm currently running out of ideas.

#11 Updated by mkittler 9 months ago

I've been running https://github.com/Martchus/os-autoinst/pull/new/debug-segfault-2 in a Leap 15.2 container but still couldn't reproduce the issue.

#12 Updated by mkittler 9 months ago

  • Assignee deleted (mkittler)

Not sure what the problem could be and how to reproduce it.

#13 Updated by okurz 9 months ago

  • Subject changed from Job terminated prematurely during needle check auto_review:"isotovideo: backen.*: free.*: invalid pointer":retry size:M to Job terminated prematurely during needle check auto_review:"isotovideo: backen.*: free.*: invalid pointer":retry
  • Status changed from Workable to New

#14 Updated by cdywan 9 months ago

  • Description updated (diff)

#15 Updated by okurz 9 months ago

  • Priority changed from High to Low
  • Target version changed from Ready to future

During the daily 2021-08-31 we decided that we don't have a good idea how to fix the actual problem. I see the following alternatives:

  1. With the above information provided and assuming that "auto_review" handles automatic retriggering at least on o3+osd move the ticket out of the backlog and wait for further information. Please anyone who sees the issue again or has any additional information comment on the ticket
  2. Ensure that openQA has a proper reason in an incomplete job so that openQA can automatically retrigger, see https://github.com/os-autoinst/openQA/blob/master/etc/openqa/openqa.ini#L76
  3. Ask perl experts if they can read out something better from the stack trace

#16 Updated by MDoucha 9 months ago

We have one more failed job due to this bug:
https://openqa.suse.de/tests/7095789

[2021-09-14T03:04:46.243 CEST] [debug] no change: 1386.7s
*** Error in `/usr/bin/isotovideo: backen': corrupted size vs. prev_size: 0x000055ddc3398e70 ***
[2021-09-14T03:04:50.611 CEST] [debug] backend process exited: 0
[2021-09-14T03:04:50.612 CEST] [debug] stopping command server 22002 because test execution ended

#17 Updated by MDoucha 8 months ago

We have another one:
https://openqa.suse.de/tests/7345106

[2021-10-08T10:15:12.457542+02:00] [debug] WARNING: check_asserted_screen took 4.09 seconds for 36 candidate needles - make your needles more specific
[2021-10-08T10:15:12.457694+02:00] [debug] no match: -0.1s, best candidate: linux-login-20181005 (0.29)
*** Error in `/usr/bin/isotovideo: backen': realloc(): invalid pointer: 0x000056449d7fdab0 ***
[2021-10-08T10:15:14.621326+02:00] [debug] backend process exited: 0
[2021-10-08T10:15:14.622132+02:00] [debug] stopping command server 2326 because test execution ended

#18 Updated by MDoucha 8 months ago

And one more without any error message in the logs: https://openqa.suse.de/tests/7363250

#19 Updated by MDoucha 6 months ago

We have more examples. The first one is expecially interesting due to new error message:
https://openqa.suse.de/tests/7668281

[2021-11-13T09:42:19.094394+01:00] [debug] led state 0 1 1 -261
not enough bytes for zrle
[2021-11-13T09:42:21.140436+01:00] [debug] backend process exited: 0
[2021-11-13T09:42:21.141354+01:00] [debug] stopping command server 3912 because test execution ended

Looks like OpenCV crashed while trying to load a PNG file generated by isotovideo

https://openqa.suse.de/tests/7668232

*** Error in `/usr/bin/isotovideo: backen': realloc(): invalid pointer: 0x0000561b1e635070 ***

https://openqa.suse.de/tests/7668284
(no error message)

#20 Updated by MDoucha 3 months ago

We have another example with a slightly different error message:
https://openqa.suse.de/tests/8190266

[2022-02-18T13:55:15.963440+01:00] [debug] no match: -1.8s, best candidate: linux-login-20181005 (0.29)
realloc(): invalid pointer
[2022-02-18T13:55:21.860752+01:00] [debug] backend process exited: 0
[2022-02-18T13:55:21.861512+01:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 1908 and exit status: 0
[2022-02-18T13:55:21.862025+01:00] [debug] stopping command server 523 because test execution ended

#21 Updated by MDoucha 3 months ago

One more example, this time with a completely different error message indicating a heap smash somewhere in needle processing:
https://openqa.suse.de/tests/8235310

[2022-02-26T03:35:28.028820+01:00] [debug] no match: 1781.5s, best candidate: linux-login-20181005 (0.29)
malloc(): largebin double linked list corrupted (bk)
[2022-02-26T03:35:33.426188+01:00] [debug] backend process exited: 0
[2022-02-26T03:35:33.427064+01:00] [debug] stopping command server 3839 because test execution ended

Also available in: Atom PDF