action #40103: [o3] openqaworker4 not able to finish any jobs - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #40103

closed

[o3] openqaworker4 not able to finish any jobs

Added by okurz over 6 years ago. Updated over 6 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

szarate

Category:

Feature requests

Target version:

Done

Start date:

2018-08-22

Due date:

% Done:

Estimated time:

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by szarate over 6 years ago

Target version set to Current Sprint

Actions

Copy link

Updated by szarate over 6 years ago

Currently waiting for jobs:

https://openqa.opensuse.org/tests/740551 (from console)
https://openqa.opensuse.org/tests/740549 (from systemd unit)

will enable all 16 instances later with the latest job that failed there and see what happens afterwards, suspicion is that all of them where trying to update

Actions

Copy link

Updated by okurz over 6 years ago

Related to action #39743: [o3][tools] o3 unusable, often responds with 504 Gateway Time-out added

Actions

Copy link

Updated by okurz over 6 years ago

both jobs passed

Actions

Copy link

Updated by szarate over 6 years ago

As both jobs are passing, next step is to try the 16 instances running the same job, to discard that the cache is deadlocking itself :)

for i in `seq 1 16`; do ./script/clone_job.pl --skip-deps --skip-download --skip-chained-deps --from https://openqa.opensuse.org 740549 --host https://openqa.opensuse.org _GROUP="Development Tumbleweed" BUILD=1119.5:poo40103  TEST=poo_40103_investigation_$i NAME=poo_40103_investigation_$i WORKER_CLASS=openqaworker4; done

Actions

Copy link

Updated by szarate over 6 years ago

So, after looking closer, when all workers are started at the same time, and pick jobs at the same time, some jobs take just a bit too long, and the webUI decides to kill said jobs (mostly because syncing the needles and the git repo takes a bit too long). I'm currently looking for possible solutions to this, since it's kind of easily reproducible.

https://openqa.opensuse.org/tests/overview?build=1119.5%3Apoo40103&groupid=38&distri=opensuse&version=Staging%3AH

Actions

Copy link

Updated by szarate over 6 years ago

Status changed from In Progress to Feedback

Looks like the main problem is that since we're syncing the whole needles and tests, from time to time workers might deadlock themselves, a shallow copy should work, but I think this is mostly fallout from previous problems. Setting to feedback and let's monitor

Actions

Copy link

Updated by szarate over 6 years ago

Project changed from openQA Tests (public) to openQA Project (public)
Category changed from Infrastructure to 168
Priority changed from Urgent to Normal

Actions

Copy link

Updated by szarate over 6 years ago

Status changed from Feedback to Resolved

I haven't seen any incompletes due to abnormal situations in the worker. Please reopen if you find the same issues, A separate ticket will be open to address the possible deadlock situation.

Actions

Copy link

#10

Updated by szarate over 6 years ago

Related to action #39833: [tools] When a worker is abruptly killed, jobs get blocked - CACHE: Being downloaded by another worker, sleeping added

Actions

Copy link

#11

Updated by szarate over 6 years ago

Related to action #40001: [negotiation:error] [pid 9953] [client <ip_of_openqaworker4>:35634] AH00690: no acceptable variant: /usr/share/apache2/error/HTTP_BAD_GATEWAY.html.var added

Actions

Copy link

#12

Updated by coolo over 6 years ago

Target version changed from Current Sprint to Done

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #40103

[o3] openqaworker4 not able to finish any jobs

Updated by szarate over 6 years ago

Updated by szarate over 6 years ago

Updated by okurz over 6 years ago

Updated by okurz over 6 years ago

Updated by szarate over 6 years ago

Updated by szarate over 6 years ago

Updated by szarate over 6 years ago

Updated by szarate over 6 years ago

Updated by szarate over 6 years ago

Updated by szarate over 6 years ago

Updated by szarate over 6 years ago

Updated by coolo over 6 years ago