Project

General

Profile

Actions

action #34042

closed

[tools] Worker goes to infinite loop during upload of screenshots in case of writing failure

Added by EDiGiacinto about 6 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2018-04-05
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

This happened on loewe, the disk wasn't full during the upload of the screenshots:

Mar 29 13:51:08 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:51:08 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:50:58 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:50:58 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:50:40 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:50:40 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:50:28 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:50:27 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't call method "write" on an undefined value at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/UserAgent.pm line 362.
Mar 29 13:50:24 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:50:05 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:50:05 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:49:53 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:49:51 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:49:43 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:49:43 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:49:30 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:49:30 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:49:27 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:49:18 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:49:10 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:48:55 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:48:55 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:48:47 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't call method "write" on an undefined value at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/UserAgent.pm line 362.
Mar 29 13:48:45 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:48:21 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:48:21 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:48:17 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:48:17 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:48:12 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:48:12 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:47:46 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:47:46 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:47:39 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:47:39 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:47:11 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:47:11 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:47:07 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't call method "write" on an undefined value at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/UserAgent.pm line 362.
Mar 29 13:47:06 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:47:06 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:52 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:52 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:42 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:36 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:36 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:33 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:02 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:01 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:01 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:46:00 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: No space left on device at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: No space left on device at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log:  at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: No space left on device at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: Mojo::Reactor::Poll: Timer failed: Can't write to log: No space left on device at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: Mojo::Reactor::Poll: I/O watcher failed: Can't write to log: No space left on device at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 311.
Mar 29 13:45:27 loewe worker[43184]: [info] Isotovideo exit status: 28

Aside from the disk full issue, imo the worker should not loop forever (maybe set a retrial limit?) in case of persistent error, as it prevented the job to be cleaned up, and left an inconsistent state in the worker.

Also possibly: implement catching errors in all our log_* functions, as they are relied in all other parts in the code and would prevent cleanup phases from being executed.


Subtasks 2 (0 open2 closed)

action #34282: implement catching errors in all our log_* functionsResolvedmkittler2018-04-05

Actions
action #34285: Add retry limit on the worker for errors related to job handlingResolvedokurz2018-04-05

Actions

Related issues 3 (0 open3 closed)

Related to openQA Project - action #34267: osd instance unresponsive (HTTP 502)ResolvedEDiGiacinto2018-04-04

Actions
Related to openQA Project - action #40220: Scheduler died due to Mojo::Log failing to write to logRejected2018-08-24

Actions
Related to openQA Project - action #40862: Out of disk space killed the webui (on osd)Rejected2018-09-11

Actions
Actions #1

Updated by EDiGiacinto about 6 years ago

  • Subject changed from [tools] Worker goes to infinite loop during upload phase in case of writing failure to [tools] Worker goes to infinite loop during upload of screenshots in case of writing failure
Actions #2

Updated by EDiGiacinto about 6 years ago

  • Priority changed from Normal to Low
Actions #3

Updated by EDiGiacinto about 6 years ago

  • Description updated (diff)
Actions #4

Updated by szarate about 6 years ago

  • Related to action #34267: osd instance unresponsive (HTTP 502) added
Actions #5

Updated by EDiGiacinto about 6 years ago

  • Description updated (diff)
Actions #6

Updated by szarate about 6 years ago

EDiGiacinto wrote:

Aside from the disk full issue, imo the worker should not loop forever (maybe set a retrial limit?) in case of persistent error, as it prevented the job to be cleaned up, and left an inconsistent state in the worker.

Also possibly: implement catching errors in all our log_* functions, as they are relied in all other parts in the code and would prevent cleanup phases from being executed.

+1

Actions #7

Updated by EDiGiacinto over 5 years ago

  • Related to action #40220: Scheduler died due to Mojo::Log failing to write to log added
Actions #8

Updated by EDiGiacinto over 5 years ago

  • Related to action #40862: Out of disk space killed the webui (on osd) added
Actions #9

Updated by coolo over 5 years ago

  • Target version set to Current Sprint
Actions #10

Updated by mkittler over 5 years ago

  • Status changed from New to In Progress
  • Assignee set to mkittler

I suppose the worker can be restarted by systemd if it crashes. So I would just make it stop if an unhanded exception occurs (like most scripts/applications behave).

PR: https://github.com/os-autoinst/openQA/pull/1809

I tested this by adding a die on certain locations.

Actions #11

Updated by okurz over 5 years ago

It's not that "systemd crashes" but I guess systemd can be configured to restart crashing services :)

Actions #12

Updated by EDiGiacinto over 5 years ago

okurz wrote:

It's not that "systemd crashes" but I guess systemd can be configured to restart crashing services :)

In this case the worker went to infinite loop, and the process was still alive so systemd won't even catch that as it sees it as running

Actions #13

Updated by mkittler over 5 years ago

I suppose @okurz though "it" in "[...] the worker can be restarted by systemd if it crashes." refers to systemd. (But of course it refers to the worker.)

Actions #14

Updated by mkittler over 5 years ago

  • Status changed from In Progress to Feedback

PR is merged. Let's see how well it works in production.

Actions #15

Updated by mkittler over 5 years ago

  • Status changed from Feedback to Resolved

Seems like it didn't break anything. If logging fails on production again and my changes turn out to be insufficient we can reopen the ticket.

Actions #16

Updated by coolo over 5 years ago

  • Target version changed from Current Sprint to Done
Actions

Also available in: Atom PDF