Project

General

Profile

Actions

action #40103

closed

[o3] openqaworker4 not able to finish any jobs

Added by okurz over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2018-08-22
Due date:
% Done:

0%

Estimated time:

Related issues 3 (0 open3 closed)

Related to openQA Project - action #39743: [o3][tools] o3 unusable, often responds with 504 Gateway Time-outResolvedokurz2018-08-15

Actions
Related to openQA Project - action #39833: [tools] When a worker is abruptly killed, jobs get blocked - CACHE: Being downloaded by another worker, sleepingResolvedEDiGiacinto2018-08-16

Actions
Related to openQA Project - action #40001: [negotiation:error] [pid 9953] [client <ip_of_openqaworker4>:35634] AH00690: no acceptable variant: /usr/share/apache2/error/HTTP_BAD_GATEWAY.html.varRejected2018-08-20

Actions
Actions #1

Updated by szarate over 5 years ago

  • Target version set to Current Sprint
Actions #2

Updated by szarate over 5 years ago

Currently waiting for jobs:

https://openqa.opensuse.org/tests/740551 (from console)
https://openqa.opensuse.org/tests/740549 (from systemd unit)

will enable all 16 instances later with the latest job that failed there and see what happens afterwards, suspicion is that all of them where trying to update

Actions #3

Updated by okurz over 5 years ago

  • Related to action #39743: [o3][tools] o3 unusable, often responds with 504 Gateway Time-out added
Actions #4

Updated by okurz over 5 years ago

both jobs passed

Actions #5

Updated by szarate over 5 years ago

As both jobs are passing, next step is to try the 16 instances running the same job, to discard that the cache is deadlocking itself :)

for i in `seq 1 16`; do ./script/clone_job.pl --skip-deps --skip-download --skip-chained-deps --from https://openqa.opensuse.org 740549 --host https://openqa.opensuse.org _GROUP="Development Tumbleweed" BUILD=1119.5:poo40103  TEST=poo_40103_investigation_$i NAME=poo_40103_investigation_$i WORKER_CLASS=openqaworker4; done
Actions #6

Updated by szarate over 5 years ago

So, after looking closer, when all workers are started at the same time, and pick jobs at the same time, some jobs take just a bit too long, and the webUI decides to kill said jobs (mostly because syncing the needles and the git repo takes a bit too long). I'm currently looking for possible solutions to this, since it's kind of easily reproducible.

https://openqa.opensuse.org/tests/overview?build=1119.5%3Apoo40103&groupid=38&distri=opensuse&version=Staging%3AH

Actions #7

Updated by szarate over 5 years ago

  • Status changed from In Progress to Feedback

Looks like the main problem is that since we're syncing the whole needles and tests, from time to time workers might deadlock themselves, a shallow copy should work, but I think this is mostly fallout from previous problems. Setting to feedback and let's monitor

Actions #8

Updated by szarate over 5 years ago

  • Project changed from openQA Tests to openQA Project
  • Category changed from Infrastructure to 168
  • Priority changed from Urgent to Normal
Actions #9

Updated by szarate over 5 years ago

  • Status changed from Feedback to Resolved

I haven't seen any incompletes due to abnormal situations in the worker. Please reopen if you find the same issues, A separate ticket will be open to address the possible deadlock situation.

Actions #10

Updated by szarate over 5 years ago

  • Related to action #39833: [tools] When a worker is abruptly killed, jobs get blocked - CACHE: Being downloaded by another worker, sleeping added
Actions #11

Updated by szarate over 5 years ago

  • Related to action #40001: [negotiation:error] [pid 9953] [client <ip_of_openqaworker4>:35634] AH00690: no acceptable variant: /usr/share/apache2/error/HTTP_BAD_GATEWAY.html.var added
Actions #12

Updated by coolo over 5 years ago

  • Target version changed from Current Sprint to Done
Actions

Also available in: Atom PDF