Project

General

Profile

Actions

action #10418

closed

worker: do not warn on expected problems

Added by okurz about 8 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2016-01-25
Due date:
% Done:

0%

Estimated time:

Description

observation

worker log output

worker[5235]: waitpid returned error: No child processes
worker[5235]: can't open /var/lib/openqa/pool/1/testresults/status.json: No such file or directory at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 525.
worker[5235]: can't open /var/lib/openqa/pool/1/testresults/test_order.json: No such file or directory at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 525.

suggestion

improve confusing error messages. worker already died in before in this case with error message in autoinst-log.txt

$ tail -n 30 /var/lib/openqa/testresults/$(ls -t /var/lib/openqa/testresults/ | head -1)/autoinst-log.txt
+++ worker notes +++
start time: 2016-01-25 16:56:35
running on okurz-test-for-openqa.openstack.local.p2.cloud.suse.de:1 (Linux 4.1.12-1-default #1 SMP PREEMPT Thu Oct 29 06:43:42 UTC 2015 (e24bad1) x86_64)
Can't locate /var/lib/openqa/share/tests/sle/lib/susedistribution.pm in @INC
[…]
result: died
uploading vars.json
uploading autoinst-log.txt

Related issues 1 (0 open1 closed)

Related to openQA Project - coordination #14972: [tools][epic] Improvements on backend to improve better handling of stallsResolvedokurz2016-11-24

Actions
Actions #1

Updated by okurz about 7 years ago

  • Related to coordination #14972: [tools][epic] Improvements on backend to improve better handling of stalls added
Actions #2

Updated by szarate about 7 years ago

@okurz, this issue is already 1 year old. Can you explain a bit what a "confusing message" is?

Actions #3

Updated by okurz about 7 years ago

"confusing message"?:

There are log messages in the worker log which are not pointing to the error plus there are messages which are seemingly unrelated

worker[5235]: can't open /var/lib/openqa/pool/1/testresults/status.json: No such file or directory at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 525.
Actions #4

Updated by coolo over 6 years ago

  • Subject changed from improve worker log output in case of early failure to worker: do not warn on expected problems
  • Category changed from Feature requests to Regressions/Crashes
  • Target version set to Ready

The missing json files causing warnings/errors caused more than a couple of test developers to get nervous if their instance is fine. So I take that as a bug

Actions #5

Updated by mkittler about 5 years ago

  • Status changed from New to Feedback

After almost 2 years I can not find the line which reads status.json anymore. Instead it now only seems to query the command server. Maybe the error handling could improved here, too but that's a different thing. test_order.json is now also only read conditionally if the job is running according to the status. Hence I suppose that warning shouldn't occur anymore as well (unless that file is really missing while the command server is still up and running).

So I'm not sure whether this ticket is still valid. Do we still see (similar) problems in production?

Actions #6

Updated by okurz over 4 years ago

  • Status changed from Feedback to Workable

Let me give you examples from recent jobs on production:

  • https://openqa.suse.de/tests/3213464/file/autoinst-log.txt is a failed job that failed because a needle did not match. After the according message there is a lot of noise, e.g. "[2019-08-06T11:09:14.610 CEST] [debug] isotovideo: unable to inform websocket clients about stopping command server: Request timeout at /usr/bin/isotovideo line 172." and also "[2019-08-06T11:09:15.611 CEST] [error] can_read received kill signal at /usr/lib/os-autoinst/myjsonrpc.pm line 91."
  • https://openqa.suse.de/tests/3213450/file/autoinst-log.txt is a passed job but a lot of errors which are backend specific, e.g. "xterm: fatal IO error 11 (Resource temporarily unavailable) or KillClient on X server ":55829""

The original message seems to have changed in behaviour since the time of reporting, e.g. now we have messages in the worker log like "Aug 2 13:04:53 openqa-worker@10: [warn] [pid:5405] Can't open /var/lib/openqa/pool/10/testresults/result-kdump_and_crash.json for result upload - likely isotovideo could not be started or failed early. Error message: No such file or directory", which isn't the best but probably ok-ish for the time being.

Actions #7

Updated by kraih almost 4 years ago

  • Assignee set to kraih

I'll take a look at the cases mentioned.

Actions #8

Updated by kraih almost 4 years ago

  • Target version changed from Ready to Current Sprint
Actions #9

Updated by okurz almost 4 years ago

Actions #10

Updated by livdywan almost 4 years ago

Can we clarify the goal of this ticket? The PR has been merged.

Actions #11

Updated by okurz almost 4 years ago

  • Status changed from Workable to Resolved
  • Assignee changed from kraih to mkittler

I checked current failures looking for any potentially confusing error messages and have not found instances of the aforementioned cases.

E.g. https://openqa.opensuse.org/tests/1219947 is a simple case of "needle failed to match". In https://openqa.opensuse.org/tests/1219947/file/autoinst-log.txt one sees

[2020-03-31T16:00:04.130 CEST] [debug] no candidate needle with tag(s) 'package-conflict-choice' matched

and then a rather clean shutdown of the stack ending with

[2020-03-31T16:18:22.806 CEST] [debug] BACKEND SHUTDOWN 
[2020-03-31T16:18:22.807 CEST] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2020-03-31T16:18:24.858 CEST] [debug] flushing frames
[2020-03-31T16:18:24.906 CEST] [debug] QEMU: qemu-system-x86_64: terminating on signal 15 from pid 7547 (/usr/bin/isotovideo: backen)
[2020-03-31T16:18:24.907 CEST] [debug] sending magic and exit
[2020-03-31T16:18:24.908 CEST] [debug] received magic close
[2020-03-31T16:18:24.926 CEST] [debug] backend process exited: 0
[2020-03-31T16:18:25.927 CEST] [debug] stopping backend process 7547
[2020-03-31T16:18:25.927 CEST] [debug] done with backend process
7510: EXIT 0
[2020-03-31T16:18:25.0998 CEST] [info] Isotovideo exit status: 0
[2020-03-31T16:19:38.0007 CEST] [info] +++ worker notes +++
[2020-03-31T16:19:38.0007 CEST] [info] End time: 2020-03-31 14:19:38
[2020-03-31T16:19:38.0007 CEST] [info] Result: done

also the worker log is clean:

[2020-03-31T15:52:22.0723 CEST] [info] 7510: WORKING 1219947
[2020-03-31T15:52:22.0725 CEST] [info] isotovideo has been started (PID: 7510)
[2020-03-31T16:19:38.0029 CEST] [info] Uploading start_install-btrfs-filesystem-df-mnt.txt
…
[2020-03-31T16:19:40.0642 CEST] [info] Uploading worker-log.txt

As I did not see any changes mentioned by kraih and I am not aware of any related changes by him but the most relevant and recent change is by mkittler I set the assignee to him and set it to resolved now accordingly.

Actions

Also available in: Atom PDF