Project

General

Profile

action #34042

[tools] Worker goes to infinite loop during upload of screenshots in case of writing failure

Added by EDiGiacinto about 2 years ago. Updated 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2018-04-05
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Difficulty:
Duration:

Description

This happened on loewe, the disk wasn't full during the upload of the screenshots:

Mar 29 13:51:08 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:51:08 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:50:58 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:50:58 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:50:40 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:50:40 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:50:28 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:50:27 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't call method "write" on an undefined value at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/UserAgent.pm line 362.
Mar 29 13:50:24 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:50:05 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:50:05 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:49:53 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:49:51 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:49:43 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:49:43 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:49:30 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:49:30 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:49:27 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:49:18 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:49:10 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:48:55 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:48:55 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:48:47 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't call method "write" on an undefined value at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/UserAgent.pm line 362.
Mar 29 13:48:45 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:48:21 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:48:21 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:48:17 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:48:17 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:48:12 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:48:12 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:47:46 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:47:46 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:47:39 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:47:39 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:47:11 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:47:11 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:47:07 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't call method "write" on an undefined value at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/UserAgent.pm line 362.
Mar 29 13:47:06 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:47:06 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:52 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:52 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:42 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:36 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:36 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:33 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:02 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:01 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:01 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:00 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: No space left on device at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: No space left on device at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: No space left on device at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: No space left on device at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: No space left on device at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: [info] Isotovideo exit status: 28

Aside from the disk full issue, imo the worker should not loop forever (maybe set a retrial limit?) in case of persistent error, as it prevented the job to be cleaned up, and left an inconsistent state in the worker.

Also possibly: implement catching errors in all our log_* functions, as they are relied in all other parts in the code and would prevent cleanup phases from being executed.


Subtasks

action #34282: implement catching errors in all our log_* functionsResolvedmkittler

action #34285: Add retry limit on the worker for errors related to job handlingResolvedokurz


Related issues

Related to openQA Project - action #34267: osd instance unresponsive (HTTP 502)Resolved2018-04-04

Related to openQA Project - action #40220: Scheduler died due to Mojo::Log failing to write to logRejected2018-08-24

Related to openQA Project - action #40862: Out of disk space killed the webui (on osd)Rejected2018-09-11

History

#1 Updated by EDiGiacinto about 2 years ago

  • Subject changed from [tools] Worker goes to infinite loop during upload phase in case of writing failure to [tools] Worker goes to infinite loop during upload of screenshots in case of writing failure

#2 Updated by EDiGiacinto about 2 years ago

  • Priority changed from Normal to Low

#3 Updated by EDiGiacinto about 2 years ago

  • Description updated (diff)

#4 Updated by szarate about 2 years ago

  • Related to action #34267: osd instance unresponsive (HTTP 502) added

#5 Updated by EDiGiacinto about 2 years ago

  • Description updated (diff)

#6 Updated by szarate about 2 years ago

EDiGiacinto wrote:

Aside from the disk full issue, imo the worker should not loop forever (maybe set a retrial limit?) in case of persistent error, as it prevented the job to be cleaned up, and left an inconsistent state in the worker.

Also possibly: implement catching errors in all our log_* functions, as they are relied in all other parts in the code and would prevent cleanup phases from being executed.

+1

#7 Updated by EDiGiacinto almost 2 years ago

  • Related to action #40220: Scheduler died due to Mojo::Log failing to write to log added

#8 Updated by EDiGiacinto over 1 year ago

  • Related to action #40862: Out of disk space killed the webui (on osd) added

#9 Updated by coolo over 1 year ago

  • Target version set to Current Sprint

#10 Updated by mkittler over 1 year ago

  • Status changed from New to In Progress
  • Assignee set to mkittler

I suppose the worker can be restarted by systemd if it crashes. So I would just make it stop if an unhanded exception occurs (like most scripts/applications behave).

PR: https://github.com/os-autoinst/openQA/pull/1809

I tested this by adding a die on certain locations.

#11 Updated by okurz over 1 year ago

It's not that "systemd crashes" but I guess systemd can be configured to restart crashing services :)

#12 Updated by EDiGiacinto over 1 year ago

okurz wrote:

It's not that "systemd crashes" but I guess systemd can be configured to restart crashing services :)

In this case the worker went to infinite loop, and the process was still alive so systemd won't even catch that as it sees it as running

#13 Updated by mkittler over 1 year ago

I suppose okurz though "it" in "[...] the worker can be restarted by systemd if it crashes." refers to systemd. (But of course it refers to the worker.)

#14 Updated by mkittler over 1 year ago

  • Status changed from In Progress to Feedback

PR is merged. Let's see how well it works in production.

#15 Updated by mkittler over 1 year ago

  • Status changed from Feedback to Resolved

Seems like it didn't break anything. If logging fails on production again and my changes turn out to be insufficient we can reopen the ticket.

#16 Updated by coolo over 1 year ago

  • Target version changed from Current Sprint to Done

Also available in: Atom PDF