Project

General

Profile

action #59926

coordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues

coordination #62420: [epic] Distinguish all types of incompletes

test incompletes in middle of execution with auto_review:"Unexpected end of data 0":retry, system journal shows 'kernel: traps: /usr/bin/isotov[2300] general protection ip:7fd5ef11771e sp:7ffe066f2200 error:0 in libc-2.26.so[7fd5ef094000+1b1000]'

Added by okurz over 1 year ago. Updated 24 days ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Concrete Bugs
Target version:
Start date:
2019-11-17
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

https://openqa.opensuse.org/tests/1087202/file/autoinst-log.txt is the log of an incomplete job. Log excerpt:

[2019-11-17T04:04:08.263 CET] [debug] no match: 71.8s, best candidate: generic-desktop-kde-after-upgrade-20180127 (0.29)
[2019-11-17T04:04:08.271 CET] [debug] no change: 69.5s
[2019-11-17T04:04:09.015 CET] [debug] considering VNC stalled, no update for 4.05 seconds
[2019-11-17T04:04:09.265 CET] [debug] no change: 68.5s
[2019-11-17T04:04:09.265 CET] [debug] pointer type 1 0 1024 768 -257
[2019-11-17T04:04:09.266 CET] [debug] led state 0 1 1 -261
[2019-11-17T04:04:09.877 CET] [debug] backend process exited: 0
[2019-11-17T04:04:09.878 CET] [debug] sysread failed: 
[2019-11-17T04:04:09.879 CET] [debug] THERE IS NOTHING TO READ 15 4 3
[2019-11-17T04:04:09.879 CET] [debug] killing command server 2239 because test execution ended
[2019-11-17T04:04:09.879 CET] [debug] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20163/mCvQTa5MDaid74RX/broadcast
[2019-11-17T04:04:10.892 CET] [debug] commands process exited: 0
Unexpected end of data 0
[2019-11-17T04:04:11.680 CET] [debug] Driver backend collected unknown process with pid 2778 and exit status: 1

so the test fails just out of the blue.

Further details

Prioritized as "Low" as I assume we will just live with an issue like this one if it is that seldom.


Related issues

Related to openQA Project - action #60443: job incomplete with "(?s)process exited: 0.*isotovideo failed.*EXIT 1":retry but no further details what is wrongResolved2019-11-29

Related to openQA Project - action #53999: openqa-worker (isotovideo) dumps core / segfaults quite often on several workers and distributions for cancelled jobsWorkable2019-07-09

History

#1 Updated by okurz over 1 year ago

  • Related to action #60443: job incomplete with "(?s)process exited: 0.*isotovideo failed.*EXIT 1":retry but no further details what is wrong added

#2 Updated by okurz over 1 year ago

  • Subject changed from [sporadic] test incompletes just in the middle of execution with "Unexpected end of data 0", system journal shows "kernel: traps: /usr/bin/isotov[2300] general protection ip:7fd5ef11771e sp:7ffe066f2200 error:0 in libc-2.26.so[7fd5ef094000+1b1000]" to test incompletes in middle of execution with auto_review:"Unexpected end of data 0", system journal shows "kernel: traps: /usr/bin/isotov[2300] general protection ip:7fd5ef11771e sp:7ffe066f2200 error:0 in libc-2.26.so[7fd5ef094000+1b1000]"

#3 Updated by okurz over 1 year ago

#4 Updated by okurz about 1 year ago

  • Related to action #53999: openqa-worker (isotovideo) dumps core / segfaults quite often on several workers and distributions for cancelled jobs added

#5 Updated by okurz 7 months ago

  • Parent task set to #62420

#6 Updated by okurz 2 months ago

  • Subject changed from test incompletes in middle of execution with auto_review:"Unexpected end of data 0", system journal shows "kernel: traps: /usr/bin/isotov[2300] general protection ip:7fd5ef11771e sp:7ffe066f2200 error:0 in libc-2.26.so[7fd5ef094000+1b1000]" to test incompletes in middle of execution with auto_review:"Unexpected end of data 0", system journal shows 'kernel: traps: /usr/bin/isotov[2300] general protection ip:7fd5ef11771e sp:7ffe066f2200 error:0 in libc-2.26.so[7fd5ef094000+1b1000]'

#7 Updated by okurz 2 months ago

Replaced the double ticks in the additional string to not confuse openqa-label-known-issue

#8 Updated by okurz 2 months ago

A recent occurence

[09/04/2021 15:41:35] <sysrich> https://openqa.opensuse.org/tests/1694730 anyone got any idea how I broke these tests with this commit? https://github.com/os-autoinst/os-autoinst-distri-opensuse/commit/f175bada79d2577d7cffca3d1b70a8c90e64c415 I'm at a clue as to how I broke it..but I am sure it's my fault :)
[09/04/2021 15:47:37] <guillaume_g> sysrich: is it reproducible?
[09/04/2021 15:48:32] <sysrich> guillaume_g, yes
[09/04/2021 15:48:40] <sysrich> see the other gnome create hdd test in G :)
[09/04/2021 15:48:46] <fvogt> All abort the same way?
[09/04/2021 15:48:50] <sysrich> yes
[09/04/2021 15:48:54] <sysrich> same way, same place,
[09/04/2021 15:50:56] <fvogt> The KDE one as well? That runs the same code, just not the send_key("esc")
[09/04/2021 15:51:37] <guillaume_g> sysrich: https://openqa.opensuse.org/tests/1694372 and https://openqa.opensuse.org/tests/1694730 are not failing on the same test, though.
[09/04/2021 15:51:54] <guillaume_g> first_boot vs opensuse_welcom
[09/04/2021 15:52:11] <fvogt> Your PR looks fine to me, only that I would've suggested to handle it differently
[09/04/2021 15:52:16] <sysrich> hmmmmm
[09/04/2021 15:52:22] <fvogt> This seems to be something GNOME login specific, not opensuse-welcome specific
[09/04/2021 15:52:24] <sysrich> ok then maybe I'm just blaming mself incorrectly
[09/04/2021 15:52:29] <sysrich> fvogt, how would you have handled it?
[09/04/2021 15:52:35] <fvogt> So it'll most likely break when it does a login after opensuse-welcome
[09/04/2021 15:53:04] <fvogt> Somewhere in ensure_unlocked_desktop
[09/04/2021 15:53:09] <sysrich> there's a KDE one failing in G?
[09/04/2021 15:53:29] <fvogt> Doesn't look like it
[09/04/2021 15:57:46] <fvogt> I suspect that qemu stopped
[09/04/2021 16:00:53] <fvogt> Almost:
[09/04/2021 16:00:55] <fvogt> [Fr Apr  9 13:52:03 2021] /usr/bin/isotov[19731]: segfault at 514 ip 0000559ddaa699e0 sp 00007f8fe7eb78f0 error 4 in perl[559dda9a9000+1fa000
[09/04/2021 16:01:24] <okurz> you found that in worker log? system log?
[09/04/2021 16:01:30] <fvogt> dmesg
[09/04/2021 16:02:01] <fvogt> Not entirely sure whether the timezone matches, but minute + second at least do
[09/04/2021 16:02:48] <okurz> k. It's similar to https://progress.opensuse.org/issues/80226 but different. we had isotovideo segfaulting in other cases since years. I suggest you create a specific ticket for that as the difference is that here the log still continues but the wait period for screen match just stopped in the middle
[09/04/2021 16:02:49] <|Anna|> job incomplete with autoinst-log.txt ending just in the middle in openQA Project (action for unassigned) [Workable] Created on: 2020-11-23 | 0% done.
[09/04/2021 16:03:24] <fvogt> If this is reproducible now that would be awesome
[09/04/2021 16:03:45] <okurz> exactly, we had been looking for such case for years :D
[09/04/2021 16:09:11] <fvogt> I thought that maybe the new needle somehow triggers this, but no:
[09/04/2021 16:09:27] <fvogt> https://openqa.opensuse.org/tests/1694479 is also on ow4:11 and runs the same code + needle just fine, but not Staging:G
[09/04/2021 16:17:47] <okurz> you might have found a qemu bug :)
[09/04/2021 16:21:16] <fvogt> I attached to the running isotovideo processes, but none of those crashed
[09/04/2021 16:21:57] <fvogt> So probably something recently spawned?
[09/04/2021 16:29:16] <fvogt> I started a local clone with MAKETESTSNAPSHOTS=1, maybe that'll help debugging
[09/04/2021 16:34:33] <sysrich> okurz, https://openqa.opensuse.org/tests/1694818 https://openqa.opensuse.org/tests/1694820
[09/04/2021 16:34:42] <sysrich> fvogt, okurz guillaume_g the probelms sure SEEM reproductable to me....
[09/04/2021 16:35:16] <okurz> you lucky guy :)
[09/04/2021 16:35:51] <sysrich> har...I  sure dont 'feel' lucky
[09/04/2021 16:35:52] <fvogt> Maybe send_key somehow triggers a crash, but that didn't even reach the log at that point
[09/04/2021 16:36:03] <fvogt> So I also guess that it's qemu
[09/04/2021 16:36:14] <okurz> the first job you referenced today was really cut off in the middle but the latter ones have "Unexpected end of data 0" in the logs, that's something that we could look for with auto-review
[09/04/2021 16:36:45] <fvogt> Which process pipes data to videoencoder? That's most likely the crashing one
[09/04/2021 16:39:15] <okurz> there was no qemu update since 2021-04-03
[09/04/2021 16:39:34] <fvogt> Apparently it's "isotovideo: backen"
[09/04/2021 16:40:11] <fvogt> My local qemu and os-autoinst here are ancient, maybe they're less "lucky" and won't crash
[09/04/2021 16:40:20] <okurz> yes, the "backend". I guess you would have guessed the complete name ;)
[09/04/2021 16:41:06] <fvogt> https://github.com/os-autoinst/os-autoinst/blob/a1b2e4bab594e1c37abab8045a8c3354ca1d03ce/videoencoder.cpp#L352 is the error
[09/04/2021 16:41:36] <fvogt> So isotovideo writes "E 01234" and then vanishes
[…]
[09/04/2021 16:57:13] <fvogt> "Tesseract Open Source OCR Engine v3.05.01 with Leptonica"
[09/04/2021 16:57:17] <fvogt> It's an OCR needle!
[09/04/2021 16:57:23] <fvogt> That would explain why it causes issues
[09/04/2021 16:57:41] <fvogt> So let's ,,(blame sysrich) after all :P
[09/04/2021 16:57:41] <|Anna|> When in doubt, blame sysrich.
[09/04/2021 16:58:56] <fvogt> It complains about "Empty page!!" and then crashes

#9 Updated by okurz 2 months ago

could reproduce the problem locally with the command

isotovideo -d casedir=~/local/os-autoinst/opensuse productdir=products/opensuse iso=~/local/tmp/openSUSE-Staging\ G-Tumbleweed-DVD-x86_64-Build476.1-Media.iso

and the os-autoinst+test+needles versions:

[2021-04-09T16:58:31.277 CEST] [debug] Current version is 9fd3ebac16015820f250696f402887d1a88593b3 [interface v23]
…
[2021-04-09T16:58:31.283 CEST] [debug] git hash in /home/okurz/local/os-autoinst/opensuse: b9a5a73988812119506a98fb490ee38fef9875c5
…
[2021-04-09T16:58:31.490 CEST] [debug] git hash in /home/okurz/local/os-autoinst/opensuse/products/opensuse/needles: c897156b36dbfdff77b7d95313398d169d8d2d4c

but takes quite some minutes. Could not yet reproduce with t/99-full-stack.t with replacing a "match" with "ocr" in a test needle but I guess this would be still a way to go.

#10 Updated by okurz 2 months ago

#11 Updated by openqa_review 24 days ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam-incidentinstall-ha
https://openqa.suse.de/tests/6083373

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed

#12 Updated by okurz 24 days ago

  • Subject changed from test incompletes in middle of execution with auto_review:"Unexpected end of data 0", system journal shows 'kernel: traps: /usr/bin/isotov[2300] general protection ip:7fd5ef11771e sp:7ffe066f2200 error:0 in libc-2.26.so[7fd5ef094000+1b1000]' to test incompletes in middle of execution with auto_review:"Unexpected end of data 0":retry, system journal shows 'kernel: traps: /usr/bin/isotov[2300] general protection ip:7fd5ef11771e sp:7ffe066f2200 error:0 in libc-2.26.so[7fd5ef094000+1b1000]'

Also available in: Atom PDF