Project

General

Profile

action #96710

Error `Can't call method "write" on an undefined value` shows up in worker log leading to incompletes

Added by mkittler 6 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Start date:
2021-08-10
Due date:
2021-08-31
% Done:

0%

Estimated time:

Description

observation

Since 2021-08-10T05:00 the number of incompletes is increasing (see https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=14&orgId=1&from=1628546400000&to=now). The reason is e.g. Reason: abandoned: associated worker openqaworker3:11 re-connected but abandoned the job. Judging by the worker log there might be a problem with the worker code:

Aug 10 12:01:46 openqaworker3 worker[13786]: [debug] [pid:13786] REST-API call: POST http://openqa.suse.de/api/v1/jobs/6789995/status
Aug 10 12:01:51 openqaworker3 worker[13786]: [debug] [pid:13786] Updating status so job 6789995 is not considered dead.
Aug 10 12:01:51 openqaworker3 worker[13786]: [debug] [pid:13786] REST-API call: POST http://openqa.suse.de/api/v1/jobs/6789995/status
Aug 10 12:01:54 openqaworker3 worker[13786]: [debug] [pid:13786] Stopping job 6789995 from openqa.suse.de: 06789995-sle-15-SP1-Server-DVD-Updates-x86_64-Build20210810-1-qam_kernel_multipath_supportserver@64bit - reason: abort
Aug 10 12:01:54 openqaworker3 worker[13786]: [debug] [pid:13786] REST-API call: POST http://openqa.suse.de/api/v1/jobs/6789995/status
Aug 10 12:01:54 openqaworker3 worker[13786]: [info] [pid:13786] +++ worker notes +++
Aug 10 12:01:54 openqaworker3 worker[13786]: [info] [pid:13786] End time: 2021-08-10 10:01:54
Aug 10 12:01:54 openqaworker3 worker[13786]: [info] [pid:13786] Result: abort
Aug 10 12:01:54 openqaworker3 worker[13786]: [debug] [pid:13786] Job 6789995 stopped as incomplete
Aug 10 12:01:54 openqaworker3 worker[13786]: [warn] [pid:13786] Unable to upload results of the job because no command server URL or worker ID have been set.
Aug 10 12:01:54 openqaworker3 worker[13786]: [debug] [pid:13786] Upload concluded (no current module)
Aug 10 12:01:54 openqaworker3 worker[13786]: [debug] [pid:13786] REST-API call: POST http://openqa.suse.de/api/v1/jobs/6789995/set_done?reason=abort&result=incomplete&worker_id=966
Aug 10 12:01:54 openqaworker3 worker[13786]: [debug] [pid:13786] Job 6789995 from openqa.suse.de finished - reason: abort
Aug 10 12:01:54 openqaworker3 worker[13786]: [debug] [pid:13786] Cleaning up for next job
Aug 10 12:01:54 openqaworker3 worker[13786]: [debug] [pid:13786] Informing openqa.suse.de that we are going offline
Aug 10 12:01:54 openqaworker3 worker[13786]: Mojo::Reactor::Poll: I/O watcher failed: Can't call method "write" on an undefined value at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/UserAgent.pm line 350.
Aug 10 12:01:54 openqaworker3 worker[13786]: [error] [pid:13786] Another error occurred when trying to stop gracefully due to an error. Trying to kill ourself forcefully now.
Aug 10 12:01:54 openqaworker3 systemd[1]: openqa-worker-auto-restart@11.service: Service RestartSec=100ms expired, scheduling restart.
Aug 10 12:24:05 openqaworker3 worker[16563]: [info] [pid:16563] Output of rsync:
Aug 10 12:24:05 openqaworker3 worker[16563]: [info] [#4187] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/
Aug 10 12:24:05 openqaworker3 worker[16563]: receiving incremental file list
Aug 10 12:24:05 openqaworker3 worker[16563]: sle/.git/
Aug 10 12:24:05 openqaworker3 worker[16563]: sle/.git/FETCH_HEAD
Aug 10 12:24:05 openqaworker3 worker[16563]: [116B blob data]
Aug 10 12:24:05 openqaworker3 worker[16563]: sle/.git/ORIG_HEAD
Aug 10 12:24:05 openqaworker3 worker[16563]: [116B blob data]
Aug 10 12:24:05 openqaworker3 worker[16563]: sle/.git/index
Aug 10 12:24:05 openqaworker3 worker[16563]: [116B blob data]
Aug 10 12:24:05 openqaworker3 worker[16563]: sle/.git/logs/HEAD
Aug 10 12:24:05 openqaworker3 worker[16563]: [116B blob data]
Aug 10 12:24:05 openqaworker3 worker[16563]: sle/.git/refs/heads/
Aug 10 12:24:05 openqaworker3 worker[16563]: sle/products/opensuse/needles/.git/
Aug 10 12:24:05 openqaworker3 worker[16563]: sle/products/opensuse/needles/.git/FETCH_HEAD
Aug 10 12:24:05 openqaworker3 worker[16563]: [117B blob data]
Aug 10 12:24:05 openqaworker3 worker[16563]: sle/products/opensuse/needles/.git/ORIG_HEAD
Aug 10 12:24:05 openqaworker3 worker[16563]: [117B blob data]
Aug 10 12:24:05 openqaworker3 worker[16563]: sle/products/opensuse/needles/.git/index
Aug 10 12:24:05 openqaworker3 worker[16563]: [117B blob data]
Aug 10 12:24:05 openqaworker3 worker[16563]: sle/products/opensuse/needles/.git/logs/HEAD
Aug 10 12:24:05 openqaworker3 worker[16563]: [117B blob data]
Aug 10 12:24:05 openqaworker3 worker[16563]: sle/products/opensuse/needles/.git/refs/heads/
Aug 10 12:24:05 openqaworker3 worker[16563]: sle/products/sle/needles/.git/
Aug 10 12:24:05 openqaworker3 worker[16563]: sle/products/sle/needles/.git/FETCH_HEAD
Aug 10 12:24:05 openqaworker3 worker[16563]: [117B blob data]
Aug 10 12:24:05 openqaworker3 worker[16563]: sle/products/sle/needles/.git/ORIG_HEAD
Aug 10 12:24:05 openqaworker3 worker[16563]: [118B blob data]
Aug 10 12:24:05 openqaworker3 worker[16563]: sle/products/sle/needles/.git/index
Aug 10 12:24:05 openqaworker3 worker[16563]: [118B blob data]
Aug 10 12:24:05 openqaworker3 worker[16563]: sle/products/sle/needles/.git/logs/HEAD
Aug 10 12:24:05 openqaworker3 worker[16563]: [118B blob data]
Aug 10 12:24:05 openqaworker3 worker[16563]: sle/products/sle/needles/.git/refs/heads/
Aug 10 12:24:05 openqaworker3 worker[16563]: sll/.git/
Aug 10 12:24:05 openqaworker3 worker[16563]: sll/.git/FETCH_HEAD
Aug 10 12:24:05 openqaworker3 worker[16563]: [118B blob data]
Aug 10 12:24:05 openqaworker3 worker[16563]: sll/.git/ORIG_HEAD
Aug 10 12:24:05 openqaworker3 worker[16563]: [118B blob data]
Aug 10 12:24:05 openqaworker3 worker[16563]: sll/.git/index
Aug 10 12:24:05 openqaworker3 worker[16563]: [118B blob data]
Aug 10 12:24:05 openqaworker3 worker[16563]: sll/.git/logs/HEAD
Aug 10 12:24:05 openqaworker3 worker[16563]: [118B blob data]
Aug 10 12:24:05 openqaworker3 worker[16563]: sll/.git/refs/heads/
Aug 10 12:24:05 openqaworker3 worker[16563]: sll/products/sll/needles/.git/
Aug 10 12:24:05 openqaworker3 worker[16563]: sll/products/sll/needles/.git/FETCH_HEAD
Aug 10 12:24:05 openqaworker3 worker[16563]: [118B blob data]
Aug 10 12:24:05 openqaworker3 worker[16563]: sll/products/sll/needles/.git/ORIG_HEAD
Aug 10 12:24:05 openqaworker3 worker[16563]: [118B blob data]
Aug 10 12:24:05 openqaworker3 worker[16563]: sll/products/sll/needles/.git/index
Aug 10 12:24:05 openqaworker3 worker[16563]: [118B blob data]
Aug 10 12:24:05 openqaworker3 worker[16563]: sll/products/sll/needles/.git/logs/HEAD
Aug 10 12:24:05 openqaworker3 worker[16563]: [118B blob data]
Aug 10 12:24:05 openqaworker3 worker[16563]: sll/products/sll/needles/.git/refs/heads/
Aug 10 12:24:05 openqaworker3 worker[16563]: vmdp/.git/
Aug 10 12:24:05 openqaworker3 worker[16563]: vmdp/.git/FETCH_HEAD
Aug 10 12:24:05 openqaworker3 worker[16563]: [118B blob data]
Aug 10 12:24:05 openqaworker3 worker[16563]: vmdp/.git/ORIG_HEAD
Aug 10 12:24:05 openqaworker3 worker[16563]: [118B blob data]
Aug 10 12:24:05 openqaworker3 worker[16563]: vmdp/.git/index
Aug 10 12:24:05 openqaworker3 worker[16563]: [118B blob data]
Aug 10 12:24:05 openqaworker3 worker[16563]: vmdp/.git/logs/HEAD
Aug 10 12:24:05 openqaworker3 worker[16563]: [118B blob data]
Aug 10 12:24:05 openqaworker3 worker[16563]: vmdp/.git/refs/heads/
Aug 10 12:24:05 openqaworker3 worker[16563]: sent 302,978 bytes  received 3,122,744 bytes  9,746.01 bytes/sec
Aug 10 12:24:05 openqaworker3 worker[16563]: total size is 14,122,452,144  speedup is 4,122.47
Aug 10 12:24:05 openqaworker3 worker[16563]: [info] [pid:16563] Finished to rsync tests
Aug 10 12:24:05 openqaworker3 worker[16563]: [debug] [pid:16563] Symlinked from "/var/lib/openqa/cache/openqa.suse.de/tests/sle" to "/var/lib/openqa/pool/11/sle"
Aug 10 12:24:05 openqaworker3 worker[16563]: [info] [pid:16563] Preparing cgroup to start isotovideo
Aug 10 12:24:05 openqaworker3 worker[16563]: [info] [pid:16563] Using cgroup /sys/fs/cgroup/systemd/openqa.slice/openqa-worker.slice/openqa-worker-auto-restart@11.service/6790307
Aug 10 12:24:05 openqaworker3 worker[16563]: [info] [pid:16563] Starting isotovideo container
Aug 10 12:24:05 openqaworker3 worker[16563]: [debug] [pid:16563] Registered process:17981
Aug 10 12:24:05 openqaworker3 worker[16563]: [info] [pid:16563] isotovideo has been started (PID: 17981)
Aug 10 12:24:05 openqaworker3 worker[16563]: [info] [pid:17981] 17981: WORKING 6790307
Aug 10 12:24:05 openqaworker3 worker[16563]: [debug] [pid:17981] +++ worker notes +++
Aug 10 12:24:05 openqaworker3 worker[16563]: [debug] [pid:16563] Running job 6790307 from openqa.suse.de: 06790307-sle-15-SP3-Server-DVD-Updates-x86_64-Build20210810-1-mru-iscsi_server_normal_auth_backstore_hdd@64bit.
Aug 10 12:24:05 openqaworker3 worker[16563]: Mojo::Reactor::Poll: I/O watcher failed: Can't call method "write" on an undefined value at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/UserAgent.pm line 350.
Aug 10 12:24:05 openqaworker3 worker[16563]: [error] [pid:16563] Stopping because a critical error occurred.
Aug 10 12:24:05 openqaworker3 worker[16563]: [error] [pid:16563] Another error occurred when trying to stop gracefully due to an error. Trying to kill ourself forcefully now.
Aug 10 12:24:05 openqaworker3 worker[16563]: [info] [pid:16563] Isotovideo exit status: 0
Aug 10 12:24:05 openqaworker3 worker[16563]: [debug] [pid:16563] Stopping job 6790307 from openqa.suse.de: 06790307-sle-15-SP3-Server-DVD-Updates-x86_64-Build20210810-1-mru-iscsi_server_normal_auth_backstore_hdd@64bit - reason: done
Aug 10 12:24:05 openqaworker3 worker[16563]: [debug] [pid:16563] REST-API call: POST http://openqa.suse.de/api/v1/jobs/6790307/status
Aug 10 12:24:06 openqaworker3 systemd[1]: openqa-worker-auto-restart@11.service: Service RestartSec=100ms expired, scheduling restart.

We're currently also faced with #96557 and #96554 so there might be a relation - although these incompletes look quite different now.


Related issues

Related to openQA Infrastructure - coordination #96447: [epic] Failed systemd services and job age alertsBlocked2021-08-04

Related to openQA Infrastructure - action #97043: job queue hitting new record 14k jobsResolved2021-08-17

Copied to openQA Infrastructure - action #96713: Slow grep in openqa-label-known-issues leads to high CPU usageResolved2021-09-10

History

#1 Updated by cdywan 6 months ago

  • Priority changed from Normal to Urgent

We should probably consider this Urgent.

/usr/lib/perl5/vendor_perl/5.26.1/Mojo/UserAgent.pm:350 seems to be inside the _write method:

  $c->{ioloop}->stream($id)->write($chunk => sub { $self && $self->_write($id) });

So $c->{ioloop}->stream($id) must be undef somehow? But not sure how we end up in the path 🤔️

#2 Updated by tinita 6 months ago

  • Copied to action #96713: Slow grep in openqa-label-known-issues leads to high CPU usage added

#3 Updated by mkittler 6 months ago

The line which produces the warning is the last line in

sub _write {
  my ($self, $id) = @_;

  # Protect from resume event recursion
  my $c = $self->{connections}{$id};
  return if !(my $tx = $c->{tx}) || $c->{writing};
  local $c->{writing} = 1;
  my $chunk = $tx->client_write;
  warn term_escape "-- Client >>> Server (@{[_url($tx)]})\n$chunk\n" if DEBUG;
  return unless length $chunk;

  weaken $self;
  $c->{ioloop}->stream($id)->write($chunk => sub { $self && $self->_write($id) });
}

(in /usr/lib/perl5/vendor_perl/5.26.1/Mojo/UserAgent.pm)


We're still at Mojolicious v9.19 so v9.20 cannot be the culprit. Likely we end up with an invalid $id here. (If $self was undef we'd see another warning earlier.)

#4 Updated by mkittler 6 months ago

I still don't know what's causing the problem. Maybe the following change helps: https://github.com/os-autoinst/openQA/pull/4116

#5 Updated by mkittler 6 months ago

  • Assignee set to mkittler

It is also happening on o3 but I couldn't reproduce it locally so far:

Aug 11 12:59:10 openqaworker4 worker[4590]: [info] Finished to rsync tests
Aug 11 12:59:10 openqaworker4 worker[4590]: [info] Preparing cgroup to start isotovideo
Aug 11 12:59:10 openqaworker4 worker[4590]: [info] Using cgroup /sys/fs/cgroup/systemd/openqa.slice/openqa-worker.slice/openqa-worker-auto-restart@18.service/1873856
Aug 11 12:59:10 openqaworker4 worker[4590]: [info] Starting isotovideo container
Aug 11 12:59:10 openqaworker4 worker[4590]: [info] isotovideo has been started (PID: 12179)
Aug 11 12:59:10 openqaworker4 worker[4590]: [info] 12179: WORKING 1873856
Aug 11 12:59:10 openqaworker4 worker[4590]: Mojo::Reactor::Poll: I/O watcher failed: Can't call method "write" on an undefined value at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/UserAgent.pm line 350.
Aug 11 12:59:10 openqaworker4 worker[4590]: [error] Stopping because a critical error occurred.
Aug 11 12:59:10 openqaworker4 worker[4590]: [error] Another error occurred when trying to stop gracefully due to an error. Trying to kill ourself forcefully now.
Aug 11 12:59:10 openqaworker4 worker[4590]: [info] Isotovideo exit status: 0
Aug 11 12:59:10 openqaworker4 systemd[1]: openqa-worker-auto-restart@18.service: Service RestartSec=100ms expired, scheduling restart.

#6 Updated by mkittler 6 months ago

This PR improves the exception handling of the worker, see the detailed commit message: https://github.com/os-autoinst/openQA/pull/4119

Of course it doesn't fix the cause of the exception itself. It would have been nice to print a stack trace but I'm not sure how to do it. I've tried Devel::StackTrace but calling it from the error handling is of course not very useful.

#7 Updated by cdywan 6 months ago

  • Status changed from New to In Progress

Of course it doesn't fix the cause of the exception itself. It would have been nice to print a stack trace but I'm not sure how to do it. I've tried Devel::StackTrace but calling it from the error handling is of course not very useful.

Maybe $SIG{__DIE__} = \&Carp::confess;? At least that's what I find as a suggestion on stackoverflow.

#8 Updated by mkittler 6 months ago

I was suggested to revert https://github.com/os-autoinst/openQA/pull/4067 on a worker so I reverted the most offending commits on openqaworker3. It should be effective on the next jobs and the next deployment will restore it.

So far there are 154 (around 12 per hour) abandoned jobs on openqaworker3 (since today):

openqa=> select host, count(id) as online_slots, (select array[count(distinct id), count(distinct id) / (extract(epoch FROM (timezone('UTC', now()) - '2021-08-12T00:00:00')) / 3600)] from jobs join jobs_assets on jobs.id = jobs_assets.job_id where assigned_worker_id = any(array_agg(w.id)) and t_finished >= '2021-08-12T00:00:00' and reason like '%abandoned: associated worker%') as recently_abandoned_jobs_total_and_per_hour from workers as w where t_updated > (timezone('UTC', now()) - interval '1 hour') group by host order by recently_abandoned_jobs_total_and_per_hour desc;
        host         | online_slots | recently_abandoned_jobs_total_and_per_hour 
---------------------+--------------+--------------------------------------------
 openqaworker5       |           41 | {314,23.9842009127936}
 openqaworker6       |           29 | {233,17.7971936709583}
 openqaworker3       |           19 | {154,11.7629520400326}
 openqaworker8       |           16 | {21,1.60403891454989}
 openqaworker-arm-1  |           10 | {21,1.60403891454989}
 openqaworker9       |           16 | {18,1.37489049818562}
 openqaworker13      |           16 | {14,1.0693592763666}
 powerqaworker-qam-1 |            8 | {9,0.687445249092812}
 QA-Power8-4-kvm     |            8 | {6,0.458296832728541}
 openqaworker10      |           10 | {6,0.458296832728541}
 openqaworker2       |           34 | {6,0.458296832728541}
 malbec              |            4 | {4,0.305531221819027}
 QA-Power8-5-kvm     |            8 | {4,0.305531221819027}
 grenache-1          |           49 | {2,0.152765610909514}
 openqaworker-arm-2  |           20 | {0,0}
 openqaworker11      |           10 | {0,0}
 automotive-3        |            1 | {0,0}
 openqaworker-arm-3  |           20 | {0,0}
(18 Zeilen)

Let's see how these figures develop.

#9 Updated by mkittler 6 months ago

cdywan Thanks. It looks like this works. But it is very hacky. I'll hot-patch it on the workers.

#10 Updated by mkittler 6 months ago

I've been hot-patching the line cdywan suggested on openqaworker5 and 6 (the ones where the problem is most apparent) by manually editing /usr/share/openqa/script/worker. It should be effective on the next jobs and the next deployment will restore it.

#11 Updated by mkittler 6 months ago

In production / in the long term we might want to use https://metacpan.org/pod/Devel::Confess or https://metacpan.org/pod/Carp::Always. These modules just seem to override the signal handler as well but it seems to be the only way to do it.
PR: https://github.com/os-autoinst/openQA/pull/4120

#12 Updated by mkittler 6 months ago

So that's how the stacktrace looks like:

Aug 12 16:26:49 openqaworker6 worker[35113]: Mojo::Reactor::Poll: I/O watcher failed: Can't call method "write" on an undefined value at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/UserAgent.pm line 350.
Aug 12 16:26:49 openqaworker6 worker[35113]:  at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/UserAgent.pm line 350.
Aug 12 16:26:49 openqaworker6 worker[35113]:         Mojo::UserAgent::_write(OpenQA::Client=HASH(0x55c1be367278), "d949caa2cc57cdb459c99b8821dd1d92") called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/UserAgent.pm line 241
Aug 12 16:26:49 openqaworker6 worker[35113]:         Mojo::UserAgent::__ANON__(Mojo::Transaction::WebSocket=HASH(0x55c1be3bcb90)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/EventEmitter.pm line 15
Aug 12 16:26:49 openqaworker6 worker[35113]:         Mojo::EventEmitter::emit(Mojo::Transaction::WebSocket=HASH(0x55c1be3bcb90), "resume") called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Transaction/WebSocket.pm line 134
Aug 12 16:26:49 openqaworker6 worker[35113]:         Mojo::Transaction::WebSocket::server_read(Mojo::Transaction::WebSocket=HASH(0x55c1be3bcb90), "\x{81}~\x{b} {\"job\":{\"assets\":{\"hdd\":[\"sle-15-SP2-x86_64-Build:20693:k"...) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Transaction/WebSocket.pm line 37
Aug 12 16:26:49 openqaworker6 worker[35113]:         Mojo::Transaction::WebSocket::client_read(Mojo::Transaction::WebSocket=HASH(0x55c1be3bcb90), "\x{81}~\x{b} {\"job\":{\"assets\":{\"hdd\":[\"sle-15-SP2-x86_64-Build:20693:k"...) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/UserAgent.pm line 275
Aug 12 16:26:49 openqaworker6 worker[35113]:         Mojo::UserAgent::_read(OpenQA::Client=HASH(0x55c1be367278), "d949caa2cc57cdb459c99b8821dd1d92", "\x{81}~\x{b} {\"job\":{\"assets\":{\"hdd\":[\"sle-15-SP2-x86_64-Build:20693:k"...) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/UserAgent.pm line 136
Aug 12 16:26:49 openqaworker6 worker[35113]:         Mojo::UserAgent::__ANON__(Mojo::IOLoop::Stream=HASH(0x55c1be3bfcd0)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/EventEmitter.pm line 15
Aug 12 16:26:49 openqaworker6 worker[35113]:         Mojo::EventEmitter::emit(Mojo::IOLoop::Stream=HASH(0x55c1be3bfcd0), "read", "\x{81}~\x{b} {\"job\":{\"assets\":{\"hdd\":[\"sle-15-SP2-x86_64-Build:20693:k"...) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/Stream.pm line 109
Aug 12 16:26:49 openqaworker6 worker[35113]:         Mojo::IOLoop::Stream::_read(Mojo::IOLoop::Stream=HASH(0x55c1be3bfcd0)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/Stream.pm line 57
Aug 12 16:26:49 openqaworker6 worker[35113]:         Mojo::IOLoop::Stream::__ANON__(Mojo::Reactor::Poll=HASH(0x55c1bb1cdb30)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 141
Aug 12 16:26:49 openqaworker6 worker[35113]:         eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 141
Aug 12 16:26:49 openqaworker6 worker[35113]:         Mojo::Reactor::Poll::_try(Mojo::Reactor::Poll=HASH(0x55c1bb1cdb30), "I/O watcher", CODE(0x55c1be3c7d38), 0) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 60
Aug 12 16:26:49 openqaworker6 worker[35113]:         Mojo::Reactor::Poll::one_tick(Mojo::Reactor::Poll=HASH(0x55c1bb1cdb30)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 101
Aug 12 16:26:49 openqaworker6 worker[35113]:         Mojo::Reactor::Poll::start(Mojo::Reactor::Poll=HASH(0x55c1bb1cdb30)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 134
Aug 12 16:26:49 openqaworker6 worker[35113]:         Mojo::IOLoop::start(Mojo::IOLoop=HASH(0x55c1bc97e728)) called at /usr/share/openqa/script/../lib/OpenQA/Worker.pm line 412
Aug 12 16:26:49 openqaworker6 worker[35113]:         OpenQA::Worker::exec(OpenQA::Worker=HASH(0x55c1be35f808)) called at /usr/share/openqa/script/worker line 139

#13 Updated by mkittler 6 months ago

According to the stack trace all of the relevant calls happen inside Mojolicious' handling of the web socket connection. So it is still not easy to tell what's the problem.


The reverts on openqaworker3 haven't had any effect. That the upload code is related was unlikely anyways because these tests fail mostly directly during/after the setup. These are the current figures:

openqa=> select host, count(id) as online_slots, (select array[count(distinct id), count(distinct id) / (extract(epoch FROM (timezone('UTC', now()) - '2021-08-12T00:00:00')) / 3600)] from jobs join jobs_assets on jobs.id = jobs_assets.job_id where assigned_worker_id = any(array_agg(w.id)) and t_finished >= '2021-08-12T00:00:00' and reason like '%abandoned: associated worker%') as recently_abandoned_jobs_total_and_per_hour from workers as w where t_updated > (timezone('UTC', now()) - interval '1 hour') group by host order by recently_abandoned_jobs_total_and_per_hour desc;
        host         | online_slots | recently_abandoned_jobs_total_and_per_hour 
---------------------+--------------+--------------------------------------------
 openqaworker5       |           41 | {363,23.1579486957244}
 openqaworker6       |           29 | {275,17.5439005270639}
 openqaworker3       |           19 | {195,12.4402203737362}
 openqaworker8       |           16 | {57,3.63637210924598}
 openqaworker9       |           16 | {48,3.06220809199661}
 openqaworker13      |           16 | {26,1.6586960498315}
 openqaworker-arm-1  |           10 | {21,1.33971604024852}
 openqaworker10      |           10 | {13,0.829348024915749}
 openqaworker2       |           34 | {12,0.765552022999153}
 powerqaworker-qam-1 |            8 | {9,0.574164017249365}
 QA-Power8-4-kvm     |            8 | {7,0.446572013416172}
 malbec              |            4 | {5,0.31898000958298}
 QA-Power8-5-kvm     |            8 | {5,0.31898000958298}
 grenache-1          |           49 | {2,0.127592003833192}
 openqaworker-arm-2  |           20 | {0,0}
 openqaworker11      |           10 | {0,0}
 automotive-3        |            1 | {0,0}
 openqaworker-arm-3  |           20 | {0,0}
(18 Zeilen)

I've been replacing the worker's delay function with a good old sleep on openqaworker5 (hot-patched). That means we don't start/stop the event loop very often in the setup code. Maybe that's causing the trouble, although it would likely be a Mojolicious bug then. This basically "reverts" the essence of https://github.com/os-autoinst/openQA/commit/734a8f688cc390f1c273eb43c2badd52bf6a8fcf rendering features implemented by https://github.com/os-autoinst/openQA/pull/3894 (like MAX_SETUP_TIME) dysfunctional.

#14 Updated by cdywan 6 months ago

#15 Updated by mkittler 6 months ago

As mentioned in my last comment I've hot-patched openqaworker5. Today it isn't the top problematic worker anymore:

openqa=> select host, count(id) as online_slots, (select array[count(distinct id), count(distinct id) / (extract(epoch FROM (timezone('UTC', now()) - '2021-08-12T00:00:00')) / 3600)] from jobs join jobs_assets on jobs.id = jobs_assets.job_id where assigned_worker_id = any(array_agg(w.id)) and t_finished >= '2021-08-12T00:00:00' and reason like '%abandoned: associated worker%') as recently_abandoned_jobs_total_and_per_hour from workers as w where t_updated > (timezone('UTC', now()) - interval '1 hour') group by host order by recently_abandoned_jobs_total_and_per_hour desc;
        host         | online_slots | recently_abandoned_jobs_total_and_per_hour 
---------------------+--------------+--------------------------------------------
 openqaworker6       |           29 | {508,15.8046565253938}
 openqaworker5       |           41 | {391,12.1646076799783}
 openqaworker3       |           19 | {319,9.92457762126107}
 openqaworker8       |           16 | {112,3.48449120244903}
 openqaworker13      |           16 | {96,2.98670674495631}
 openqaworker9       |           16 | {95,2.95559521636301}
 openqaworker10      |           10 | {44,1.36890725810497}
 openqaworker2       |           34 | {32,0.995568914985436}
 openqaworker-arm-1  |           10 | {21,0.653342100459192}
 powerqaworker-qam-1 |            8 | {21,0.653342100459192}
 QA-Power8-4-kvm     |            8 | {19,0.591119043272603}
 QA-Power8-5-kvm     |            8 | {17,0.528895986086013}
 malbec              |            4 | {10,0.311115285932949}
 grenache-1          |           49 | {7,0.217780700153064}
 openqaworker-arm-3  |           20 | {2,0.0622230571865897}
 openqaworker-arm-2  |           20 | {0,0}
 openqaworker11      |           10 | {0,0}
 automotive-3        |            1 | {0,0}

Note that the hot-patch has been reverted with today's deployment.

I couldn't find any occurrence of the issue anymore in the time-frame where the worker ran with the hot-patched code. After today's deployment (and the revert of my hot-patching) it happened again:

Aug 12 18:48:48 openqaworker5 worker[31895]: Mojo::Reactor::Poll: I/O watcher failed: Can't call method "write" on an undefined value at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/UserAgent.pm line 350. # last job which finished before hot-patching was effective
# no error in between
Aug 13 10:46:33 openqaworker5 worker[37401]: Mojo::Reactor::Poll: I/O watcher failed: Can't call method "write" on an undefined value at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/UserAgent.pm line 350. # first job with error after deployment
Aug 13 10:47:26 openqaworker5 worker[37855]: Mojo::Reactor::Poll: I/O watcher failed: Can't call method "write" on an undefined value at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/UserAgent.pm line 350.

So maybe I should get rid of the delay function again and use a proper callback instead. If that's not easily possible I can resort to using sleep again but it'll leave all features implemented by https://github.com/os-autoinst/openQA/pull/3894 dysfunctional.

#16 Updated by cdywan 6 months ago

mkittler wrote:

So maybe I should get rid of the delay function again and use a proper callback instead. If that's not easily possible I can resort to using sleep again but it'll leave all features implemented by https://github.com/os-autoinst/openQA/pull/3894 dysfunctional.

Discussed it in the weekly. My vote is for either the simple sleep or a callback version by end of today. The feature doesn't outweigh the problems we have with the broken loop currently.

#17 Updated by mkittler 6 months ago

This would be a small PR for going back to using sleep: https://github.com/os-autoinst/openQA/pull/4123

I'm currently working on another PR to use proper callbacks. This change is a little bit more work but so I'm not sure whether I can conclude it today. Of course the PR to use proper callback might show the same error as the delay-function hack. If that's the case then it could really be a Mojolicious bug, though. (And the sleep was just working around that bug.)

#18 Updated by mkittler 5 months ago

  • Target version set to Ready

#19 Updated by mkittler 5 months ago

I will hot-patch https://github.com/os-autoinst/openQA/pull/4124 (and https://github.com/os-autoinst/openQA/pull/4120 for better debugging) on openqaworker5 and closely monitor the incomplete rate and logs.

#20 Updated by openqa_review 5 months ago

  • Due date set to 2021-08-31

Setting due date based on mean cycle time of SUSE QE Tools

#21 Updated by mkittler 5 months ago

  • Status changed from In Progress to Feedback

Everything looks good (so openqaworker5 is also still hot-patched). I think we can merge the PR and I'm only waiting for reviews.

#22 Updated by cdywan 5 months ago

mkittler wrote:

I will hot-patch https://github.com/os-autoinst/openQA/pull/4124 (and https://github.com/os-autoinst/openQA/pull/4120 for better debugging) on openqaworker5 and closely monitor the incomplete rate and logs.

Both PRs have been reviewed and merged.

#23 Updated by mkittler 5 months ago

  • Status changed from Feedback to Resolved

Yes, and it looks good in production. The number of incompletes went down significantly. I've checked the logs of 5 OSD workers and couldn't find any instance of this specific bug anymore after the date when at least the workaround was effective. The PR to use proper callback should have been effective since yesterday as well on all OSD workers. So I consider the issue resolved.

We would have catched the issue much earlier if we had monitoring for worker logs, e.g. if an alert would fire if Can't call method ".*" on is encountered within the logs. The current alerts only informed us at a time when already a critical number of incomplete jobs had been reached. We have already #95293 for this kind of improvement. (Of course here it would be needed on the worker host.)

#24 Updated by okurz 5 months ago

created #97247 regarding log monitoring on workers

#25 Updated by okurz 5 months ago

  • Related to action #97043: job queue hitting new record 14k jobs added

Also available in: Atom PDF