action #53999
Updated by okurz about 5 years ago
## Observation From openqaworker13 (running Leap 15.0): ``` PID: 18220 (/usr/bin/isotov) UID: 483 (_openqa-worker) GID: 65534 (nogroup) Signal: 11 (SEGV) Timestamp: Mon 2019-07-08 16:10:19 CEST (18h ago) Command Line: /usr/bin/isotovideo: backen Executable: /usr/bin/perl Control Group: /openqa.slice/openqa-worker.slice/openqa-worker@4.service Unit: openqa-worker@4.service Slice: openqa-worker.slice Boot ID: 1234ca1e5b18422d89f258275208b14f Machine ID: 625985c3f939414a1676d1d05a732110 Hostname: openqaworker13 Storage: /var/lib/systemd/coredump/core.\x2fusr\x2fbin\x2fisotov.483.1234ca1e5b18422d89f258275208b14f.18220.1562595019000000.lz4 Message: Process 18220 (/usr/bin/isotov) of user 483 dumped core. PID: 14766 (/usr/bin/isotov) UID: 483 (_openqa-worker) GID: 65534 (nogroup) Signal: 6 (ABRT) Timestamp: Mon 2019-07-08 17:33:40 CEST (17h ago) Command Line: /usr/bin/isotovideo: backen Executable: /usr/bin/perl Control Group: /openqa.slice/openqa-worker.slice/openqa-worker@9.service Unit: openqa-worker@9.service Slice: openqa-worker.slice Boot ID: 1234ca1e5b18422d89f258275208b14f Machine ID: 625985c3f939414a1676d1d05a732110 Hostname: openqaworker13 Storage: /var/lib/systemd/coredump/core.\x2fusr\x2fbin\x2fisotov.483.1234ca1e5b18422d89f258275208b14f.14766.1562600020000000.lz4 Message: Process 14766 (/usr/bin/isotov) of user 483 dumped core. PID: 14989 (/usr/bin/isotov) UID: 483 (_openqa-worker) GID: 65534 (nogroup) Signal: 6 (ABRT) Timestamp: Mon 2019-07-08 17:34:05 CEST (17h ago) Command Line: /usr/bin/isotovideo: backen Executable: /usr/bin/perl Control Group: /openqa.slice/openqa-worker.slice/openqa-worker@6.service Unit: openqa-worker@6.service Slice: openqa-worker.slice Boot ID: 1234ca1e5b18422d89f258275208b14f Machine ID: 625985c3f939414a1676d1d05a732110 Hostname: openqaworker13 Storage: /var/lib/systemd/coredump/core.\x2fusr\x2fbin\x2fisotov.483.1234ca1e5b18422d89f258275208b14f.14989.1562600045000000.lz4 Message: Process 14989 (/usr/bin/isotov) of user 483 dumped core. ``` Over all our workers we've 2070 coredumps of the same process so I fear we're missing an important bug here. As you can see, the signals are mixed but most of them segfault (2077 segfaults, 14 aborts over all our workers). You can find one of these coredumps here: http://files.glados.qa.suse.de/example_coredump.tar.xz (unfortunately it is too big to upload to progress directly) or get one for yourself with `coredumpctl dump $DESIRED_PID -o outfile.dump`. ## Steps to reproduce On OSD: `sudo salt -l error -C 'G@roles:worker' cmd.run 'coredumpctl list'` . The same also happens on o3. To reproduce locally what seems to work: * create a simple vars.json file, e.g. get from openqa.opensuse.org * start test with `isotovideo -d` which at least goes as far as calling the first functions from testapi, e.g. "assert_screen" * kill the "autotest" process, e.g. with `pkill -f autotest` * check if there is a core dump recorded, e.g. `coredumpctl --since=today`