Project

General

Profile

action #66988

Mojo::Reactor::Poll: I/O watcher failed: Can't use an undefined value as a HASH reference at /usr/share/openqa/script/../lib/OpenQA/CacheService/Response/Status.pm line 20.

Added by okurz 10 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2020-05-18
Due date:
% Done:

0%

Estimated time:
Difficulty:
medium

Description

Observation

https://openqa.opensuse.org/tests/1271018 is stuck in "running" since 3h, assigned to openqa-aarch64:7 and not moving forward. Worker log states:

May 18 08:10:56 openqa-aarch64 worker[11167]: [info] [#14377] Download of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Tumbleweed-ARM-JeOS-efi.armv7l-2020.05.10-Snapshot20200517.raw.xz.sha256" successful, new cache size is 160GiB
May 18 08:10:56 openqa-aarch64 worker[11167]: [info] Downloading openSUSE-Tumbleweed-ARM-JeOS-efi.armv7l-2020.05.10-Snapshot20200517.raw.xz, request #14389 sent to Cache Service
May 18 08:11:06 openqa-aarch64 worker[11167]: Mojo::Reactor::Poll: I/O watcher failed: Can't use an undefined value as a HASH reference at /usr/share/openqa/script/../lib/OpenQA/CacheService/Response/Status.pm line 20.
May 18 08:11:06 openqa-aarch64 worker[11167]: [error] Stopping because a critical error occurred.
May 18 08:11:06 openqa-aarch64 worker[11167]: [error] Another error occurred when trying to stop gracefully due to an error. Trying to kill ourself forcefully now.

no further output since 08:11 until now, 11:20. The worker had been started by me at 08:11:06 so just in before.

Problem

The reported line in code is

sub is_processed   { shift->data->{status} eq 'processed' }

for the hash to be undefined shift->data is the problem source.

Workaround

Restarting the worker service helps


Related issues

Related to openQA Project - action #64776: [cache][worker] cache service suddenly stopped to download assets, all subsequent jobs needing download incomplete auto_review:"setup failure: Cache service status error: Premature connection close":retryResolved2020-03-24

History

#1 Updated by okurz 10 months ago

  • Description updated (diff)

Restarted the systemd service which helped for the immediate problem.

#2 Updated by mkittler 9 months ago

  • Assignee set to mkittler

The worker process should terminate in case a critical error happens. Considering the ticket description that didn't work as well (only Trying to kill ourself forcefully now. has been logged). This is likely easy to fix so I'll have a look.

#3 Updated by mkittler 9 months ago

  • Assignee deleted (mkittler)

This isn't as clear as expected.

I can not reproduce that the worker doesn't terminate itself. I modified the code locally to produce the same error (sub is_processed { my $self = shift; $self->data(undef); $self->data->{status} eq 'processed'; }) and it terminates as expected:

[debug] [pid:28695] Found ASSET_1, caching Tumbleweed.x86_64-1.0-libvirt-Snapshot20200517.vagrant.libvirt.box
[info] [pid:28695] Downloading Tumbleweed.x86_64-1.0-libvirt-Snapshot20200517.vagrant.libvirt.box, request #75 sent to Cache Service
Mojo::Reactor::Poll: I/O watcher failed: Can't use an undefined value as a HASH reference at /hdd/openqa-devel/repos/openQA/script/../lib/OpenQA/CacheService/Response/Status.pm line 20.
[error] [pid:28695] Stopping because a critical error occurred.
[debug] [pid:28695] Stopping job 1359 from http://localhost:9526: 00001359-opensuse-Tumbleweed-DVD-x86_64-Build20200517-btrfs@64bit - reason: exception
[debug] [pid:28695] REST-API call: POST http://localhost:9526/api/v1/jobs/1359/status
[error] [pid:28695] Another error occurred when trying to stop gracefully due to an error. Trying to kill ourself forcefully now.
[martchus@linux-9lzf openQA]$ [(309e21e33...)]

But I'm wondering about the error itself as well. The only place in the code where the data is initialized is lib/OpenQA/CacheService/Client.pm:

my $data = $res ? $res->json : {};
return OpenQA::CacheService::Response::Status->new(data => $data, error => $err);

So the code takes care that data is always a hash except if the response wasn't actually JSON. But the controller for serving the data (sub status in lib/OpenQA/CacheService/Controller/API.pm) always renders JSON.

kraih You know the framework better than me so do you have an idea under which condition data might be undef?

#4 Updated by kraih 9 months ago

mkittler wrote:

kraih You know the framework better than me so do you have an idea under which condition data might be undef?

An exception thrown in the controller perhaps, you'd find traces of that in the journal though.

#5 Updated by kraih 9 months ago

This is probably related to #64776. I'll make the worker <-> cache service communication more reliable, with better exception handling.

#6 Updated by kraih 9 months ago

  • Assignee set to kraih

#7 Updated by kraih 9 months ago

  • Target version set to Current Sprint
  • Difficulty set to medium

#8 Updated by kraih 8 months ago

  • Status changed from New to In Progress

I've opened a PR with a fix for this problem (and many other cache service client error conditions). https://github.com/os-autoinst/openQA/pull/3211

#9 Updated by kraih 8 months ago

  • Status changed from In Progress to Feedback

Previous PR has been applied and there is now a followup. But the original problem in this ticket should be resolved now.

#10 Updated by okurz 8 months ago

  • Status changed from Feedback to Resolved

nothing broke in production. It's unlikely we would see the problem reproduce soon anyway. Thanks.

#11 Updated by okurz 7 months ago

  • Related to action #64776: [cache][worker] cache service suddenly stopped to download assets, all subsequent jobs needing download incomplete auto_review:"setup failure: Cache service status error: Premature connection close":retry added

Also available in: Atom PDF