Project

General

Profile

action #96507

Job terminated prematurely during needle check auto_review:"isotovideo: backen.*: free.*: invalid pointer":retry

Added by MDoucha 3 months ago. Updated 16 days ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Concrete Bugs
Target version:
Start date:
2021-08-03
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Two LTP jobs have failed recently in a similar way while waiting for needle match.

https://openqa.suse.de/tests/6638157
https://openqa.suse.de/tests/6625293

The first job has some interesting output in autoinst-log.txt:

[2021-08-03T07:24:58.610 CEST] [debug] no change: 0.5s
[2021-08-03T07:25:03.632 CEST] [debug] WARNING: check_asserted_screen took 4.02 seconds for 35 candidate needles - make your needles more specific
[2021-08-03T07:25:03.632 CEST] [debug] no match: -0.5s, best candidate: linux-login-20181005 (0.29)
*** Error in `/usr/bin/isotovideo: backen': free(): invalid pointer: 0x00005560f6795e00 ***
[2021-08-03T07:25:05.336 CEST] [debug] backend process exited: 0
[2021-08-03T07:25:05.336 CEST] [debug] stopping command server 30808 because test execution ended

The other job does not show any obvious error in the log:

[2021-07-31T07:21:58.852 CEST] [debug] no change: 1785.4s
[2021-07-31T07:22:01.325 CEST] [debug] backend process exited: 0
[2021-07-31T07:22:01.326 CEST] [debug] stopping command server 21557 because test execution ended

Suggestion

  • Reproduce locally
  • Extend t/01-test_needle.t to reproduce this case
  • Have the test create and destroy multiple needs in a loop
  • Try using an optimized build for tinycv (e.g. via cmake argument -DCMAKE_BUILD_TYPE=RelWithDebInfo, see https://cmake.org/cmake/help/latest/variable/CMAKE_BUILD_TYPE.html) if the problem is not reproducible
  • Provoke more threading in opencv
  • Investigate memory-handling on the Perl side, passed to opencv, in baseclass (check os-autoinst readme)
stacktrace (13.1 KB) stacktrace mkittler, 2021-08-06 14:02
stacktrace2 (13 KB) stacktrace2 mkittler, 2021-08-06 14:21
stacktrace3 (13.2 KB) stacktrace3 mkittler, 2021-08-06 14:24
stacktrace4 (13.2 KB) stacktrace4 mkittler, 2021-08-06 14:26
backend-backtraces-openqaworker10.txt (18.1 KB) backend-backtraces-openqaworker10.txt mkittler, 2021-08-17 11:23

History

#1 Updated by okurz 3 months ago

  • Category set to Concrete Bugs
  • Priority changed from Normal to High
  • Target version set to Ready

#2 Updated by mkittler 3 months ago

  • File stacktrace stacktrace added
  • Subject changed from Job terminated prematurely during needle check to Job terminated prematurely during needle check auto_review:"isotovideo: backen.*: free.*: invalid pointer":retry

I've been adding the regex but it doesn't look like we're running auto_review currently/anymore.

Luckily the stacktrace is still available so I've attached it. That's the unmangled version of the relevant thread:

(gdb) bt
#0  0x00007fa5d9349420 in raise () from /lib64/libc.so.6
#1  0x00007fa5d934aa01 in abort () from /lib64/libc.so.6
#2  0x00007fa5d938c877 in __libc_message () from /lib64/libc.so.6
#3  0x00007fa5d9393093 in malloc_printerr () from /lib64/libc.so.6
#4  0x00007fa5d9394999 in _int_free () from /lib64/libc.so.6
#5  0x00007fa5d1659589 in cv::MatAllocator::unmap(cv::UMatData*) const () from /usr/lib64/libopencv_core.so.3.3
#6  0x00007fa5d26968a1 in image_destroy(Image*) () from /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi/auto/tinycv/tinycv.so
#7  0x00007fa5d269d32e in ?? () from /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi/auto/tinycv/tinycv.so
#8  0x00005560ecf47ac1 in Perl_pp_entersub (my_perl=0x5560ee4cb260) at pp_hot.c:4231
#9  0x00005560eceb8ca8 in Perl_call_sv (my_perl=my_perl@entry=0x5560ee4cb260, sv=0x5560f319f950, flags=flags@entry=45) at perl.c:2848
#10 0x00005560ecf4ca4e in S_curse (my_perl=my_perl@entry=0x5560ee4cb260, sv=sv@entry=0x5560f3237ef0, check_refcnt=check_refcnt@entry=true) at sv.c:6972
#11 0x00005560ecf4d2f8 in Perl_sv_clear (my_perl=my_perl@entry=0x5560ee4cb260, orig_sv=orig_sv@entry=0x5560f31f6058) at sv.c:6576
#12 0x00005560ecf4d7b1 in Perl_sv_free2 (my_perl=0x5560ee4cb260, sv=0x5560f31f6058, rc=<optimized out>) at sv.c:7073
#13 0x00005560ecf7fa5c in Perl_leave_scope (my_perl=my_perl@entry=0x5560ee4cb260, base=<optimized out>) at scope.c:1178
#14 0x00005560ecf83a08 in Perl_dounwind (my_perl=my_perl@entry=0x5560ee4cb260, cxix=cxix@entry=20) at pp_ctl.c:1552
#15 0x00005560ecf8eef1 in Perl_dounwind (cxix=20, my_perl=0x5560ee4cb260) at pp_ctl.c:2471
#16 Perl_pp_return (my_perl=0x5560ee4cb260) at pp_ctl.c:2477
#17 0x00005560ecf3f8e6 in Perl_runops_standard (my_perl=0x5560ee4cb260) at run.c:41
#18 0x00005560ecec07d7 in S_run_body (oldscope=<optimized out>, my_perl=<optimized out>) at perl.c:2524
#19 perl_run (my_perl=0x5560ee4cb260) at perl.c:2447
#20 0x00005560ece97ef2 in main (argc=<optimized out>, argv=<optimized out>, env=<optimized out>) at perlmain.c:123

Looks like Perl passed a bad pointer to the XS function for destroying the tinycv::Image object so its destructor passes a bad pointer to the destructor of the cv::Mat object.

#3 Updated by mkittler 3 months ago

There are 3 more stack traces, likely one of them corresponds to the 2nd job. So there are basically two crashes, one involving image_destroy and one involving image_search.

By the way, I haven't seen such stack traces before (e.g. when looking at backend stack traces for different crashes).

#4 Updated by mkittler 3 months ago

It also happened on openqaworker13 (backtrace with image_search) so it is not specific openqaworker10.

#5 Updated by cdywan 3 months ago

  • Subject changed from Job terminated prematurely during needle check auto_review:"isotovideo: backen.*: free.*: invalid pointer":retry to Job terminated prematurely during needle check auto_review:"isotovideo: backen.*: free.*: invalid pointer":retry size:M
  • Description updated (diff)
  • Status changed from New to Workable

#6 Updated by mkittler 3 months ago

  • Description updated (diff)
  • Assignee set to mkittler

#7 Updated by mkittler 3 months ago

This test should run the relevant functions many times over and over again: https://github.com/Martchus/os-autoinst/pull/new/debug-segfault-2

However, so far I failed to reproduce the problem.


Since compiler warnings are often useful to find bugs I've been creating https://github.com/os-autoinst/os-autoinst/pull/1740. I couldn't find any clear problems, though.

#8 Updated by MDoucha 3 months ago

Here's another similar job failure with a slightly different error in the log:
https://openqa.suse.de/tests/6795622

[2021-08-10T20:26:11.866 CEST] [debug] led state 0 1 1 -261
[2021-08-10T20:26:12.353 CEST] [debug] no change: 1785.1s
*** Error in `/usr/bin/isotovideo: backen': corrupted size vs. prev_size: 0x000055acfb6e0670 ***
Unexpected end of data 0
[2021-08-10T20:26:13.726 CEST] [debug] backend process exited: 0
[2021-08-10T20:26:13.727 CEST] [debug] stopping command server 30489 because test execution ended

#9 Updated by mkittler 2 months ago

Strangely there's no coredump from the day this job ran on openqaworker10 where the job ran according to the database. However, there are several other coredumps which I've attached. Many have actually no tinycv.so-frames within the stack trace; not sure whether they might even be bugs in Perl. One backtrace contains an image_scale-frame which is new (so far I've only seen image_search and image_destroy). The TBB threads all look the same and unproblematic (so the issue that signal handlers are invoked in these threads doesn't seem to be re-occurring).

#10 Updated by mkittler 2 months ago

I extended my test which runs image functionality by a few other functions but still cannot reproduce the issue locally. I've also tested against OpenCV 3. I also still couldn't find anything wrong with our code. It could be improved here and there according to certain guidelines and to avoid certain warnings but there's nothing which should cause UB or memory corruption.

I'm currently running out of ideas.

#11 Updated by mkittler 2 months ago

I've been running https://github.com/Martchus/os-autoinst/pull/new/debug-segfault-2 in a Leap 15.2 container but still couldn't reproduce the issue.

#12 Updated by mkittler 2 months ago

  • Assignee deleted (mkittler)

Not sure what the problem could be and how to reproduce it.

#13 Updated by okurz 2 months ago

  • Subject changed from Job terminated prematurely during needle check auto_review:"isotovideo: backen.*: free.*: invalid pointer":retry size:M to Job terminated prematurely during needle check auto_review:"isotovideo: backen.*: free.*: invalid pointer":retry
  • Status changed from Workable to New

#14 Updated by cdywan about 2 months ago

  • Description updated (diff)

#15 Updated by okurz about 2 months ago

  • Priority changed from High to Low
  • Target version changed from Ready to future

During the daily 2021-08-31 we decided that we don't have a good idea how to fix the actual problem. I see the following alternatives:

  1. With the above information provided and assuming that "auto_review" handles automatic retriggering at least on o3+osd move the ticket out of the backlog and wait for further information. Please anyone who sees the issue again or has any additional information comment on the ticket
  2. Ensure that openQA has a proper reason in an incomplete job so that openQA can automatically retrigger, see https://github.com/os-autoinst/openQA/blob/master/etc/openqa/openqa.ini#L76
  3. Ask perl experts if they can read out something better from the stack trace

#16 Updated by MDoucha about 1 month ago

We have one more failed job due to this bug:
https://openqa.suse.de/tests/7095789

[2021-09-14T03:04:46.243 CEST] [debug] no change: 1386.7s
*** Error in `/usr/bin/isotovideo: backen': corrupted size vs. prev_size: 0x000055ddc3398e70 ***
[2021-09-14T03:04:50.611 CEST] [debug] backend process exited: 0
[2021-09-14T03:04:50.612 CEST] [debug] stopping command server 22002 because test execution ended

#17 Updated by MDoucha 19 days ago

We have another one:
https://openqa.suse.de/tests/7345106

[2021-10-08T10:15:12.457542+02:00] [debug] WARNING: check_asserted_screen took 4.09 seconds for 36 candidate needles - make your needles more specific
[2021-10-08T10:15:12.457694+02:00] [debug] no match: -0.1s, best candidate: linux-login-20181005 (0.29)
*** Error in `/usr/bin/isotovideo: backen': realloc(): invalid pointer: 0x000056449d7fdab0 ***
[2021-10-08T10:15:14.621326+02:00] [debug] backend process exited: 0
[2021-10-08T10:15:14.622132+02:00] [debug] stopping command server 2326 because test execution ended

#18 Updated by MDoucha 16 days ago

And one more without any error message in the logs: https://openqa.suse.de/tests/7363250

Also available in: Atom PDF