Project

General

Profile

Actions

action #62417

closed

os-autoinst occasionally crashing on startup

Added by AdamWill about 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
-
Start date:
2020-01-21
Due date:
% Done:

0%

Estimated time:

Description

I recently updated Fedora's os-autoinst packages to commit d693abe082d6559721e5053d8fd6151753c73020, and upgraded our instances. However, I've realized (just after I upgraded prod, sigh) that it seems to be crashing on startup occasionally, which results in the test being incomplete. The crash is a segfault, and I'll attach the backtrace. This happens fairly infrequently - say 1 in 50 tests or so. It seems to be happening across multiple worker hosts, and on multiple arches.


Files

gdb.txt (111 KB) gdb.txt crash backtrace AdamWill, 2020-01-21 09:10
osautoinstlogs (15.5 KB) osautoinstlogs AdamWill, 2020-01-22 13:29

Related issues 1 (0 open1 closed)

Related to openQA Project - action #62984: Fix problem with job-worker assignment resulting in API errorsResolvedmkittler2020-02-03

Actions
Actions #1

Updated by AdamWill about 4 years ago

  • Assignee set to cfconrad

@cfconrad , okurz said to poke you for this.

Actions #2

Updated by okurz about 4 years ago

  • Related to action #62108: auto_review:"Can.t fcntl.*Operation not permitted at .*virtio_terminal.pm" test incompletes added
Actions #3

Updated by AdamWill about 4 years ago

Hum, it could be that, I guess? I don't seem to see that 'can't fcntl' error in our logs anywhere I can find, though.

Maybe I'll try applying the patch and see if it helps.

Actions #4

Updated by AdamWill about 4 years ago

So I backported the patch and deployed the new build on our staging instance. It didn't get any crashes for a couple of days, but then we just got one: https://openqa.stg.fedoraproject.org/tests/707115 . I'm attaching a log extract which I trimmed to contain only the messages related to this specific test (the box runs a lot of other workers too).

Actions #5

Updated by mkittler about 4 years ago

According to the attached log the worker's error handling aborts the job as expected because of API errors occurred:

Jan 22 10:14:15 qa09.qa.fedoraproject.org worker[3518949]: [error] REST-API error (POST http://openqa-stg01.qa.fedoraproject.org/api/v1/jobs/707115/status): 400 response: Got status update for job 707115 with unexpected worker ID 94 (expected no updates anymore, job is done with result incomplete) (remaining tries: 0)
Jan 22 10:14:15 qa09.qa.fedoraproject.org worker[3518949]: [error] Aborting job because web UI doesn't accept new images anymore (likely considers this job dead)
Jan 22 10:14:15 qa09.qa.fedoraproject.org worker[3518949]: [error] REST-API error (POST http://openqa-stg01.qa.fedoraproject.org/api/v1/jobs/707115/status): 400 response: Got status update for job 707115 with unexpected worker ID 94 (expected no updates anymore, job is done with result incomplete) (remaining tries: 2)
Jan 22 10:14:20 qa09.qa.fedoraproject.org worker[3518949]: [error] REST-API error (POST http://openqa-stg01.qa.fedoraproject.org/api/v1/jobs/707115/status): 400 response: Got status update for job 707115 with unexpected worker ID 94 (expected no updates anymore, job is done with result incomplete) (remaining tries: 1)
Jan 22 10:14:25 qa09.qa.fedoraproject.org worker[3518949]: [warn] Can't open /var/lib/openqa/pool/8/testresults/result-_graphical_wait_login.json for result upload - likely isotovideo could not be started or failed early. Error message: No such file or directory
Jan 22 10:14:25 qa09.qa.fedoraproject.org worker[3518949]: [error] REST-API error (POST http://openqa-stg01.qa.fedoraproject.org/api/v1/jobs/707115/status): 400 response: Got status update for job 707115 with unexpected worker ID 94 (expected no updates anymore, job is done with result incomplete) (remaining tries: 2)
Jan 22 10:14:25 qa09.qa.fedoraproject.org worker[3518949]: [error] REST-API error (POST http://openqa-stg01.qa.fedoraproject.org/api/v1/jobs/707115/status): 400 response: Got status update for job 707115 with unexpected worker ID 94 (expected no updates anymore, job is done with result incomplete) (remaining tries: 0)
Jan 22 10:14:25 qa09.qa.fedoraproject.org worker[3518949]: [info] Trying to stop job gracefully by announcing it to command server via http://localhost:20083/SQItiv7mKn9e8yHj/broadcast
--
Jan 22 10:14:27 qa09.qa.fedoraproject.org audit[3738617]: ANOM_ABEND auid=4294967295 uid=992 gid=988 ses=4294967295 subj=system_u:system_r:unconfined_service_t:s0 pid=3738617 comm="/usr/bin/isotov" exe="/usr/bin/perl" sig=11 res=1
Jan 22 10:14:27 qa09.qa.fedoraproject.org kernel: /usr/bin/isotov[3738705]: segfault at 5a4 ip 00007f53fdab4ee8 sp 00007f53ee4a78f0 error 4 in libperl.so.5.30.1[7f53fda0e000+16e000]
Jan 22 10:14:27 qa09.qa.fedoraproject.org kernel: Code: ef 8b 38 e8 8a 03 f6 ff 8d 55 f9 83 e2 fb 74 05 83 fd 04 75 15 89 ef 48 8b 80 58 08 00 00 31 d2 31 f6 5d ff e0 0f 1f 44 00 00 <f6> 80 a4 05 00 00 01 75 e2 48 8b 90 38 06 00 00 48 85 d2 74 18 83
Jan 22 10:14:27 qa09.qa.fedoraproject.org worker[3518949]: [info] Isotovideo exit status: 0
Jan 22 10:14:27 qa09.qa.fedoraproject.org systemd[1]: Started Process Core Dump (PID 3739153/UID 0).
Jan 22 10:14:27 qa09.qa.fedoraproject.org audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@45-3739153-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jan 22 10:14:28 qa09.qa.fedoraproject.org worker[3518949]: [info] +++ worker notes +++
Jan 22 10:14:28 qa09.qa.fedoraproject.org worker[3518949]: [info] End time: 2020-01-22 10:14:28
Jan 22 10:14:28 qa09.qa.fedoraproject.org worker[3518949]: [info] Result: api-failure
Jan 22 10:14:28 qa09.qa.fedoraproject.org worker[3518949]: [info] Registering with openQA http://openqa-stg01.qa.fedoraproject.org
Jan 22 10:14:29 qa09.qa.fedoraproject.org systemd-coredump[3739154]: Process 3738617 (/usr/bin/isotov) of user 992 dumped core.

                                                                     Stack trace of thread 3738705:
                                                                     #0  0x00007f53fdab4ee8 Perl_csighandler (libperl.so.5.30)
                                                                     #1  0x00007f53fd8336b0 __restore_rt (libc.so.6)
--
                                                                     #3  0x00007f53fd7de4e2 start_thread (libpthread.so.0)
                                                                     #4  0x00007f53fd8f86d3 __clone (libc.so.6)
Jan 22 10:14:29 qa09.qa.fedoraproject.org systemd[1]: systemd-coredump@45-3739153-0.service: Succeeded.
Jan 22 10:14:29 qa09.qa.fedoraproject.org audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@45-3739153-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jan 22 10:14:29 qa09.qa.fedoraproject.org systemd[1]: systemd-coredump@45-3739153-0.service: Consumed 1.137s CPU time.

So to me it looks like isotovideo really just received the signal to terminate from the worker and the crash is just another instance of https://progress.opensuse.org/issues/53999.

The API errors itself look a little bit like https://progress.opensuse.org/issues/62015 but it is hard to tell without logs from the web UI side.

I don't see any connection to "Can.t fcntl.*Operation not permitted at .*virtio_terminal.pm" (at least not from reading Adam's log files).

Actions #6

Updated by AdamWill about 4 years ago

Yeah, I wasn't sure if the API messages indicated a fatal issue or just a timing blip or something. I'll try and take a look at server logs from a relevant timeframe tomorrow, and try the fix for https://progress.opensuse.org/issues/62015 too. Thanks!

Actions #7

Updated by AdamWill about 4 years ago

Here's server logs from the same timeframe:

Jan 22 10:13:30 openqa-stg01.qa.fedoraproject.org openqa[1876]: [info] Got status update for job 707115 and worker 89 but there is not even a worker assigned to this job (job is scheduled)
Jan 22 10:13:31 openqa-stg01.qa.fedoraproject.org openqa-websockets[1875]: [warn] Worker 89 accepted job 707115 which was never assigned to it or has already finished
Jan 22 10:13:34 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO fedora_messaging.twisted.protocol] Consuming message from topic org.fedoraproject.stg.openqa.job.done (message id 12d93efa-6d0b-4bfd-bc12-73155275c733)
Jan 22 10:13:34 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO OpenQAResultsDBReporter] reporting results for 707099
Jan 22 10:13:34 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056657]: [INFO fedora_messaging.twisted.protocol] Consuming message from topic org.fedoraproject.stg.openqa.job.done (message id 12d93efa-6d0b-4bfd-bc12-73155275c733)
Jan 22 10:13:34 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056657]: [INFO fedora_messaging.twisted.protocol] Successfully consumed message from topic org.fedoraproject.stg.openqa.job.done (message id 12d93efa-6d0b-4bfd-bc12-73155275c733)
Jan 22 10:13:34 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO fedora_messaging.twisted.protocol] Successfully consumed message from topic org.fedoraproject.stg.openqa.job.done (message id 12d93efa-6d0b-4bfd-bc12-73155275c733)
Jan 22 10:13:35 openqa-stg01.qa.fedoraproject.org openqa[1876]: [info] Got status update for job 707115 with unexpected worker ID 89 (expected 94, job is running)
Jan 22 10:13:40 openqa-stg01.qa.fedoraproject.org openqa[1876]: [info] Got status update for job 707115 with unexpected worker ID 89 (expected 94, job is running)
Jan 22 10:13:40 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO fedora_messaging.twisted.protocol] Consuming message from topic org.fedoraproject.stg.openqa.job.done (message id 1e618428-562b-4f09-8d6f-69de46c9094f)
Jan 22 10:13:40 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO OpenQAResultsDBReporter] reporting results for 707070
Jan 22 10:13:40 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056657]: [INFO fedora_messaging.twisted.protocol] Consuming message from topic org.fedoraproject.stg.openqa.job.done (message id 1e618428-562b-4f09-8d6f-69de46c9094f)
Jan 22 10:13:40 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056657]: [INFO fedora_messaging.twisted.protocol] Successfully consumed message from topic org.fedoraproject.stg.openqa.job.done (message id 1e618428-562b-4f09-8d6f-69de46c9094f)
Jan 22 10:13:40 openqa-stg01.qa.fedoraproject.org openqa[1876]: [info] Got status update for job 707115 with unexpected worker ID 89 (expected 94, job is running)
Jan 22 10:13:40 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO fedora_messaging.twisted.protocol] Successfully consumed message from topic org.fedoraproject.stg.openqa.job.done (message id 1e618428-562b-4f09-8d6f-69de46c9094f)
Jan 22 10:13:45 openqa-stg01.qa.fedoraproject.org openqa[1876]: [info] Got status update for job 707115 with unexpected worker ID 89 (expected 94, job is running)
Jan 22 10:13:48 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO fedora_messaging.twisted.protocol] Consuming message from topic org.fedoraproject.stg.openqa.job.done (message id a27585fb-bc12-41b3-b275-d4486c2e1cf6)
Jan 22 10:13:48 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO OpenQAResultsDBReporter] reporting results for 707086
Jan 22 10:13:48 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056657]: [INFO fedora_messaging.twisted.protocol] Consuming message from topic org.fedoraproject.stg.openqa.job.done (message id a27585fb-bc12-41b3-b275-d4486c2e1cf6)
Jan 22 10:13:48 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056657]: [INFO fedora_messaging.twisted.protocol] Successfully consumed message from topic org.fedoraproject.stg.openqa.job.done (message id a27585fb-bc12-41b3-b275-d4486c2e1cf6)
Jan 22 10:13:49 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO fedora_messaging.twisted.protocol] Successfully consumed message from topic org.fedoraproject.stg.openqa.job.done (message id a27585fb-bc12-41b3-b275-d4486c2e1cf6)
Jan 22 10:13:50 openqa-stg01.qa.fedoraproject.org openqa[1876]: [info] Got status update for job 707115 with unexpected worker ID 89 (expected 94, job is running)
Jan 22 10:13:50 openqa-stg01.qa.fedoraproject.org openqa[1876]: [info] Got status update for job 707115 with unexpected worker ID 89 (expected 94, job is running)
Jan 22 10:13:52 openqa-stg01.qa.fedoraproject.org openqa[1876]: [info] Worker 1018395 stopped
Jan 22 10:13:52 openqa-stg01.qa.fedoraproject.org openqa[1876]: [info] Worker 1191290 started
Jan 22 10:13:55 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO fedora_messaging.twisted.protocol] Consuming message from topic org.fedoraproject.stg.openqa.job.done (message id 5fad3d49-666d-4c52-a6e8-a0912f453d88)
Jan 22 10:13:55 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO OpenQAResultsDBReporter] reporting results for 707103
Jan 22 10:13:55 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056657]: [INFO fedora_messaging.twisted.protocol] Consuming message from topic org.fedoraproject.stg.openqa.job.done (message id 5fad3d49-666d-4c52-a6e8-a0912f453d88)
Jan 22 10:13:55 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056657]: [INFO fedora_messaging.twisted.protocol] Successfully consumed message from topic org.fedoraproject.stg.openqa.job.done (message id 5fad3d49-666d-4c52-a6e8-a0912f453d88)
Jan 22 10:13:56 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO fedora_messaging.twisted.protocol] Successfully consumed message from topic org.fedoraproject.stg.openqa.job.done (message id 5fad3d49-666d-4c52-a6e8-a0912f453d88)
Jan 22 10:13:58 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO fedora_messaging.twisted.protocol] Consuming message from topic org.fedoraproject.stg.openqa.job.done (message id c73c6065-cbe9-438e-8606-7b1e85582695)
Jan 22 10:13:58 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO OpenQAResultsDBReporter] reporting results for 707080
Jan 22 10:13:58 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056657]: [INFO fedora_messaging.twisted.protocol] Consuming message from topic org.fedoraproject.stg.openqa.job.done (message id c73c6065-cbe9-438e-8606-7b1e85582695)
Jan 22 10:13:58 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056657]: [INFO fedora_messaging.twisted.protocol] Successfully consumed message from topic org.fedoraproject.stg.openqa.job.done (message id c73c6065-cbe9-438e-8606-7b1e85582695)
Jan 22 10:13:59 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO fedora_messaging.twisted.protocol] Successfully consumed message from topic org.fedoraproject.stg.openqa.job.done (message id c73c6065-cbe9-438e-8606-7b1e85582695)
Jan 22 10:13:59 openqa-stg01.qa.fedoraproject.org openqa[1876]: [info] Got status update for job 707115 with unexpected worker ID 89 (expected 94, job is running)
Jan 22 10:13:59 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO fedora_messaging.twisted.protocol] Consuming message from topic org.fedoraproject.stg.openqa.job.done (message id 76d1b0d5-6565-4dfc-80b9-9749d9ebb653)
Jan 22 10:13:59 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO OpenQAResultsDBReporter] reporting results for 707115
Jan 22 10:13:59 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056657]: [INFO fedora_messaging.twisted.protocol] Consuming message from topic org.fedoraproject.stg.openqa.job.done (message id 76d1b0d5-6565-4dfc-80b9-9749d9ebb653)
Jan 22 10:13:59 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056657]: [INFO fedora_messaging.twisted.protocol] Successfully consumed message from topic org.fedoraproject.stg.openqa.job.done (message id 76d1b0d5-6565-4dfc-80b9-9749d9ebb653)
Jan 22 10:14:00 openqa-stg01.qa.fedoraproject.org openqa-websockets[1875]: [info] Worker 89 websocket connection closed - 1005
Jan 22 10:14:00 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO fedora_messaging.twisted.protocol] Successfully consumed message from topic org.fedoraproject.stg.openqa.job.done (message id 76d1b0d5-6565-4dfc-80b9-9749d9ebb653)
Jan 22 10:14:00 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056657]: [INFO fedora_messaging.twisted.protocol] Consuming message from topic org.fedoraproject.stg.openqa.job.done (message id 68fdcd68-072f-45c5-a264-ca759ae11c4c)
Jan 22 10:14:00 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056657]: [INFO fedora_messaging.twisted.protocol] Successfully consumed message from topic org.fedoraproject.stg.openqa.job.done (message id 68fdcd68-072f-45c5-a264-ca759ae11c4c)
Jan 22 10:14:00 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO fedora_messaging.twisted.protocol] Consuming message from topic org.fedoraproject.stg.openqa.job.done (message id 68fdcd68-072f-45c5-a264-ca759ae11c4c)
Jan 22 10:14:00 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO OpenQAResultsDBReporter] reporting results for 707115
Jan 22 10:14:00 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO fedora_messaging.twisted.protocol] Successfully consumed message from topic org.fedoraproject.stg.openqa.job.done (message id 68fdcd68-072f-45c5-a264-ca759ae11c4c)
Jan 22 10:14:04 openqa-stg01.qa.fedoraproject.org openqa[1876]: [info] Got status update for job 707115 with unexpected worker ID 89 (expected no updates anymore, job is done with result incomplete)
Jan 22 10:14:05 openqa-stg01.qa.fedoraproject.org openqa[1876]: [info] Got status update for job 707115 with unexpected worker ID 94 (expected no updates anymore, job is done with result incomplete)
Jan 22 10:14:10 openqa-stg01.qa.fedoraproject.org openqa[1876]: [info] Got status update for job 707115 with unexpected worker ID 94 (expected no updates anymore, job is done with result incomplete)
Jan 22 10:14:14 openqa-stg01.qa.fedoraproject.org openqa[1876]: [info] Worker 1021193 stopped
Jan 22 10:14:14 openqa-stg01.qa.fedoraproject.org openqa[1876]: [info] Worker 1191292 started
Jan 22 10:14:15 openqa-stg01.qa.fedoraproject.org openqa[1876]: [info] Got status update for job 707115 with unexpected worker ID 94 (expected no updates anymore, job is done with result incomplete)
Jan 22 10:14:15 openqa-stg01.qa.fedoraproject.org openqa[1876]: [info] Got status update for job 707115 with unexpected worker ID 94 (expected no updates anymore, job is done with result incomplete)
Jan 22 10:14:20 openqa-stg01.qa.fedoraproject.org openqa[1876]: [info] Got status update for job 707115 with unexpected worker ID 94 (expected no updates anymore, job is done with result incomplete)
Jan 22 10:14:25 openqa-stg01.qa.fedoraproject.org openqa[1876]: [info] Got status update for job 707115 with unexpected worker ID 94 (expected no updates anymore, job is done with result incomplete)
Jan 22 10:14:25 openqa-stg01.qa.fedoraproject.org openqa[1876]: [info] Got status update for job 707115 with unexpected worker ID 94 (expected no updates anymore, job is done with result incomplete)
Jan 22 10:14:27 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056657]: [INFO fedora_messaging.twisted.protocol] Consuming message from topic org.fedoraproject.stg.openqa.job.done (message id 249c3e86-0f21-4c3d-8389-0b0e8c0f2950)
Jan 22 10:14:27 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056657]: [INFO fedora_messaging.twisted.protocol] Successfully consumed message from topic org.fedoraproject.stg.openqa.job.done (message id 249c3e86-0f21-4c3d-8389-0b0e8c0f2950)
Jan 22 10:14:27 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO fedora_messaging.twisted.protocol] Consuming message from topic org.fedoraproject.stg.openqa.job.done (message id 249c3e86-0f21-4c3d-8389-0b0e8c0f2950)
Jan 22 10:14:27 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO OpenQAResultsDBReporter] reporting results for 707062
Jan 22 10:14:27 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO fedora_messaging.twisted.protocol] Successfully consumed message from topic org.fedoraproject.stg.openqa.job.done (message id 249c3e86-0f21-4c3d-8389-0b0e8c0f2950)
Jan 22 10:14:29 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056657]: [INFO fedora_messaging.twisted.protocol] Consuming message from topic org.fedoraproject.stg.openqa.job.done (message id 4dcf065f-1313-47a8-8ab3-02ca4ccc1ef6)
Jan 22 10:14:29 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056657]: [INFO fedora_messaging.twisted.protocol] Successfully consumed message from topic org.fedoraproject.stg.openqa.job.done (message id 4dcf065f-1313-47a8-8ab3-02ca4ccc1ef6)
Jan 22 10:14:29 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO fedora_messaging.twisted.protocol] Consuming message from topic org.fedoraproject.stg.openqa.job.done (message id 4dcf065f-1313-47a8-8ab3-02ca4ccc1ef6)
Jan 22 10:14:29 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO OpenQAResultsDBReporter] reporting results for 707115
Jan 22 10:14:29 openqa-stg01.qa.fedoraproject.org openqa-websockets[1875]: [info] Worker 94 websocket connection closed - 1005
Jan 22 10:14:29 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056657]: [INFO fedora_messaging.twisted.protocol] Consuming message from topic org.fedoraproject.stg.openqa.job.done (message id a801e0f2-6536-4b9a-8b4c-a4e60f52739d)
Jan 22 10:14:29 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056657]: [INFO fedora_messaging.twisted.protocol] Successfully consumed message from topic org.fedoraproject.stg.openqa.job.done (message id a801e0f2-6536-4b9a-8b4c-a4e60f52739d)
Jan 22 10:14:29 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO fedora_messaging.twisted.protocol] Successfully consumed message from topic org.fedoraproject.stg.openqa.job.done (message id 4dcf065f-1313-47a8-8ab3-02ca4ccc1ef6)
Jan 22 10:14:29 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO fedora_messaging.twisted.protocol] Consuming message from topic org.fedoraproject.stg.openqa.job.done (message id a801e0f2-6536-4b9a-8b4c-a4e60f52739d)
Jan 22 10:14:29 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO OpenQAResultsDBReporter] reporting results for 707115
Jan 22 10:14:30 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO fedora_messaging.twisted.protocol] Successfully consumed message from topic org.fedoraproject.stg.openqa.job.done (message id a801e0f2-6536-4b9a-8b4c-a4e60f52739d)
Jan 22 10:14:30 openqa-stg01.qa.fedoraproject.org openqa[1876]: [info] Got status update for job 707115 with unexpected worker ID 94 (expected no updates anymore, job is done with result incomplete)
Jan 22 10:14:35 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO fedora_messaging.twisted.protocol] Consuming message from topic org.fedoraproject.stg.openqa.job.done (message id c6e93b30-8510-40cf-9905-4fe76e61ea52)
Jan 22 10:14:35 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO OpenQAResultsDBReporter] reporting results for 707098
Jan 22 10:14:35 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056657]: [INFO fedora_messaging.twisted.protocol] Consuming message from topic org.fedoraproject.stg.openqa.job.done (message id c6e93b30-8510-40cf-9905-4fe76e61ea52)
Jan 22 10:14:35 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056657]: [INFO fedora_messaging.twisted.protocol] Successfully consumed message from topic org.fedoraproject.stg.openqa.job.done (message id c6e93b30-8510-40cf-9905-4fe76e61ea52)
Jan 22 10:14:35 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO fedora_messaging.twisted.protocol] Successfully consumed message from topic org.fedoraproject.stg.openqa.job.done (message id c6e93b30-8510-40cf-9905-4fe76e61ea52)
Jan 22 10:14:35 openqa-stg01.qa.fedoraproject.org openqa[1876]: [info] Got status update for job 707115 with unexpected worker ID 94 (expected no updates anymore, job is done with result incomplete)
Jan 22 10:14:44 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO fedora_messaging.twisted.protocol] Consuming message from topic org.fedoraproject.stg.openqa.job.done (message id d7d5c521-7d62-4779-a813-66a6fdf0c763)
Jan 22 10:14:44 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO OpenQAResultsDBReporter] reporting results for 707145
Jan 22 10:14:44 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056657]: [INFO fedora_messaging.twisted.protocol] Consuming message from topic org.fedoraproject.stg.openqa.job.done (message id d7d5c521-7d62-4779-a813-66a6fdf0c763)
Jan 22 10:14:44 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056657]: [INFO fedora_messaging.twisted.protocol] Successfully consumed message from topic org.fedoraproject.stg.openqa.job.done (message id d7d5c521-7d62-4779-a813-66a6fdf0c763)
Jan 22 10:14:44 openqa-stg01.qa.fedoraproject.org fedora-messaging[1056663]: [INFO fedora_messaging.twisted.protocol] Successfully consumed message from topic org.fedoraproject.stg.openqa.job.done (message id d7d5c521-7d62-4779-a813-66a6fdf0c763)
Actions #8

Updated by AdamWill about 4 years ago

So, I think that does look like https://progress.opensuse.org/issues/62015 , and I'll try the PR.

Actions #9

Updated by cfconrad about 4 years ago

Hi AdamWill,
I was on holidays... sorry for late response. I also don't see any relation to that pipe handling.

What is the outcome of trying https://github.com/os-autoinst/openQA/pull/2667 does it solve your problem?

Actions #10

Updated by AdamWill about 4 years ago

So far I don't think so :( I'm moving house right now so everything's kinda nuts, but I think I've seen a couple of early-incompletions still on staging. I'll take a look at logs and stuff for them when I can.

Actions #11

Updated by cfconrad about 4 years ago

Hm that's bad.

I was tried to get some logs of incompletes by searching for them with https://openqa.stg.fedoraproject.org/tests?&resultfilter=Incomplete
but these tests don't have logs. Do I miss something (I'm not logged in).

Actions #12

Updated by AdamWill about 4 years ago

No, it seems like nothing reaches the test logs when this happens (or rather, I think the server doesn't accept their upload from the worker, so they don't reach the server and hence the web UI). The log extracts I've posted have been from the server and worker system journals, which you can't get at I'm afraid :)

Actions #13

Updated by mkittler about 4 years ago

In this case 2 workers grabbed the same job but the web UI doesn't think any of these workers is supposed to work on it. It would be interesting what happened shortly before the first line in the log you've posted. Since it apparently still happens despite https://github.com/os-autoinst/openQA/pull/2667 it would be good to know what lead to the scheduling problem. The PR might have only fixed one problem but it is likely that there are more. When working on https://progress.opensuse.org/issues/62015 I only focused on one particular problematic job.

Actions #14

Updated by mkittler about 4 years ago

  • Assignee changed from cfconrad to mkittler

@cfconrad asked me to take over; this is related to the scheduling issues we see on OSD as well

Actions #15

Updated by mkittler about 4 years ago

  • Related to action #62984: Fix problem with job-worker assignment resulting in API errors added
Actions #16

Updated by mkittler about 4 years ago

  • Related to deleted (action #62108: auto_review:"Can.t fcntl.*Operation not permitted at .*virtio_terminal.pm" test incompletes)
Actions #17

Updated by livdywan about 4 years ago

  • Target version set to Current Sprint
Actions #18

Updated by mkittler about 4 years ago

  • Status changed from New to Resolved
  • Target version deleted (Current Sprint)

The scheduling problems are taken care of in https://progress.opensuse.org/issues/62984 and the crashes when the worker aborts isotovideo in https://progress.opensuse.org/issues/53999. So I'm closing this issue here as it is only mixing up these other issues.

Actions

Also available in: Atom PDF