action #34042
closed[tools] Worker goes to infinite loop during upload of screenshots in case of writing failure
Description
This happened on loewe, the disk wasn't full during the upload of the screenshots:
Mar 29 13:51:08 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:51:08 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:50:58 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:50:58 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:50:40 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:50:40 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:50:28 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:50:27 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't call method "write" on an undefined value at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/UserAgent.pm line 362.
Mar 29 13:50:24 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:50:05 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:50:05 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:49:53 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:49:51 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:49:43 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:49:43 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:49:30 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:49:30 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:49:27 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:49:18 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:49:10 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:48:55 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:48:55 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:48:47 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't call method "write" on an undefined value at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/UserAgent.pm line 362.
Mar 29 13:48:45 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:48:21 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:48:21 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:48:17 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:48:17 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:48:12 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:48:12 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:47:46 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:47:46 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:47:39 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:47:39 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:47:11 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:47:11 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:47:07 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't call method "write" on an undefined value at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/UserAgent.pm line 362.
Mar 29 13:47:06 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:47:06 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:52 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:52 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:42 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:36 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:36 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:33 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:02 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:01 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:01 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:00 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: No space left on device at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: No space left on device at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: No space left on device at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: No space left on device at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: No space left on device at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: [info] Isotovideo exit status: 28
Aside from the disk full issue, imo the worker should not loop forever (maybe set a retrial limit?) in case of persistent error, as it prevented the job to be cleaned up, and left an inconsistent state in the worker.
Also possibly: implement catching errors in all our log_* functions, as they are relied in all other parts in the code and would prevent cleanup phases from being executed.
Updated by EDiGiacinto over 6 years ago
- Subject changed from [tools] Worker goes to infinite loop during upload phase in case of writing failure to [tools] Worker goes to infinite loop during upload of screenshots in case of writing failure
Updated by szarate over 6 years ago
- Related to action #34267: osd instance unresponsive (HTTP 502) added
Updated by szarate over 6 years ago
EDiGiacinto wrote:
Aside from the disk full issue, imo the worker should not loop forever (maybe set a retrial limit?) in case of persistent error, as it prevented the job to be cleaned up, and left an inconsistent state in the worker.
Also possibly: implement catching errors in all our log_* functions, as they are relied in all other parts in the code and would prevent cleanup phases from being executed.
+1
Updated by EDiGiacinto over 6 years ago
- Related to action #40220: Scheduler died due to Mojo::Log failing to write to log added
Updated by EDiGiacinto over 6 years ago
- Related to action #40862: Out of disk space killed the webui (on osd) added
Updated by mkittler about 6 years ago
- Status changed from New to In Progress
- Assignee set to mkittler
I suppose the worker can be restarted by systemd if it crashes. So I would just make it stop if an unhanded exception occurs (like most scripts/applications behave).
PR: https://github.com/os-autoinst/openQA/pull/1809
I tested this by adding a die on certain locations.
Updated by okurz about 6 years ago
It's not that "systemd crashes" but I guess systemd can be configured to restart crashing services :)
Updated by EDiGiacinto about 6 years ago
okurz wrote:
It's not that "systemd crashes" but I guess systemd can be configured to restart crashing services :)
In this case the worker went to infinite loop, and the process was still alive so systemd won't even catch that as it sees it as running
Updated by mkittler about 6 years ago
I suppose @okurz though "it" in "[...] the worker can be restarted by systemd if it crashes." refers to systemd. (But of course it refers to the worker.)
Updated by mkittler about 6 years ago
- Status changed from In Progress to Feedback
PR is merged. Let's see how well it works in production.
Updated by mkittler about 6 years ago
- Status changed from Feedback to Resolved
Seems like it didn't break anything. If logging fails on production again and my changes turn out to be insufficient we can reopen the ticket.
Updated by coolo about 6 years ago
- Target version changed from Current Sprint to Done