Project

General

Profile

action #40103

[o3] openqaworker4 not able to finish any jobs

Added by okurz about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2018-08-22
Due date:
% Done:

0%

Estimated time:
Difficulty:

Related issues

Related to openQA Project - action #39743: [o3][tools] o3 unusable, often responds with 504 Gateway Time-outResolved2018-08-15

Related to openQA Project - action #39833: [tools] When a worker is abruptly killed, jobs get blocked - CACHE: Being downloaded by another worker, sleepingResolved2018-08-16

Related to openQA Project - action #40001: [negotiation:error] [pid 9953] [client <ip_of_openqaworker4>:35634] AH00690: no acceptable variant: /usr/share/apache2/error/HTTP_BAD_GATEWAY.html.varRejected2018-08-20

History

#1 Updated by szarate about 2 years ago

  • Target version set to Current Sprint

#2 Updated by szarate about 2 years ago

Currently waiting for jobs:

https://openqa.opensuse.org/tests/740551 (from console)
https://openqa.opensuse.org/tests/740549 (from systemd unit)

will enable all 16 instances later with the latest job that failed there and see what happens afterwards, suspicion is that all of them where trying to update

#3 Updated by okurz about 2 years ago

  • Related to action #39743: [o3][tools] o3 unusable, often responds with 504 Gateway Time-out added

#4 Updated by okurz about 2 years ago

both jobs passed

#5 Updated by szarate about 2 years ago

As both jobs are passing, next step is to try the 16 instances running the same job, to discard that the cache is deadlocking itself :)

for i in `seq 1 16`; do ./script/clone_job.pl --skip-deps --skip-download --skip-chained-deps --from https://openqa.opensuse.org 740549 --host https://openqa.opensuse.org _GROUP="Development Tumbleweed" BUILD=1119.5:poo40103  TEST=poo_40103_investigation_$i NAME=poo_40103_investigation_$i WORKER_CLASS=openqaworker4; done

#6 Updated by szarate about 2 years ago

So, after looking closer, when all workers are started at the same time, and pick jobs at the same time, some jobs take just a bit too long, and the webUI decides to kill said jobs (mostly because syncing the needles and the git repo takes a bit too long). I'm currently looking for possible solutions to this, since it's kind of easily reproducible.

https://openqa.opensuse.org/tests/overview?build=1119.5%3Apoo40103&groupid=38&distri=opensuse&version=Staging%3AH

#7 Updated by szarate about 2 years ago

  • Status changed from In Progress to Feedback

Looks like the main problem is that since we're syncing the whole needles and tests, from time to time workers might deadlock themselves, a shallow copy should work, but I think this is mostly fallout from previous problems. Setting to feedback and let's monitor

#8 Updated by szarate about 2 years ago

  • Project changed from openQA Tests to openQA Project
  • Category changed from Infrastructure to 168
  • Priority changed from Urgent to Normal

#9 Updated by szarate about 2 years ago

  • Status changed from Feedback to Resolved

I haven't seen any incompletes due to abnormal situations in the worker. Please reopen if you find the same issues, A separate ticket will be open to address the possible deadlock situation.

#10 Updated by szarate about 2 years ago

  • Related to action #39833: [tools] When a worker is abruptly killed, jobs get blocked - CACHE: Being downloaded by another worker, sleeping added

#11 Updated by szarate about 2 years ago

  • Related to action #40001: [negotiation:error] [pid 9953] [client <ip_of_openqaworker4>:35634] AH00690: no acceptable variant: /usr/share/apache2/error/HTTP_BAD_GATEWAY.html.var added

#12 Updated by coolo about 2 years ago

  • Target version changed from Current Sprint to Done

Also available in: Atom PDF