action #40103

[o3] openqaworker4 not able to finish any jobs

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:ResolvedStart date:22/08/2018
Priority:NormalDue date:
Assignee:szarate% Done:

0%

Category:Feature requests
Target version:Done
Difficulty:
Duration:

Related issues

Related to openQA Project - action #39743: [o3][tools] o3 unusable, often responds with 504 Gateway ... Resolved 15/08/2018
Related to openQA Project - action #39833: [tools] When a worker is abruptly killed, jobs get blocke... Resolved 16/08/2018
Related to openQA Project - action #40001: [negotiation:error] [pid 9953] [client <ip_of_openqaworke... Rejected 20/08/2018

History

#1 Updated by szarate over 1 year ago

  • Target version set to Current Sprint

#2 Updated by szarate over 1 year ago

Currently waiting for jobs:

https://openqa.opensuse.org/tests/740551 (from console)
https://openqa.opensuse.org/tests/740549 (from systemd unit)

will enable all 16 instances later with the latest job that failed there and see what happens afterwards, suspicion is that all of them where trying to update

#3 Updated by okurz over 1 year ago

  • Related to action #39743: [o3][tools] o3 unusable, often responds with 504 Gateway Time-out added

#4 Updated by okurz over 1 year ago

both jobs passed

#5 Updated by szarate over 1 year ago

As both jobs are passing, next step is to try the 16 instances running the same job, to discard that the cache is deadlocking itself :)

for i in `seq 1 16`; do ./script/clone_job.pl --skip-deps --skip-download --skip-chained-deps --from https://openqa.opensuse.org 740549 --host https://openqa.opensuse.org _GROUP="Development Tumbleweed" BUILD=1119.5:poo40103  TEST=poo_40103_investigation_$i NAME=poo_40103_investigation_$i WORKER_CLASS=openqaworker4; done

#6 Updated by szarate over 1 year ago

So, after looking closer, when all workers are started at the same time, and pick jobs at the same time, some jobs take just a bit too long, and the webUI decides to kill said jobs (mostly because syncing the needles and the git repo takes a bit too long). I'm currently looking for possible solutions to this, since it's kind of easily reproducible.

https://openqa.opensuse.org/tests/overview?build=1119.5%3Apoo40103&groupid=38&distri=opensuse&version=Staging%3AH

#7 Updated by szarate over 1 year ago

  • Status changed from In Progress to Feedback

Looks like the main problem is that since we're syncing the whole needles and tests, from time to time workers might deadlock themselves, a shallow copy should work, but I think this is mostly fallout from previous problems. Setting to feedback and let's monitor

#8 Updated by szarate over 1 year ago

  • Project changed from openQA Tests to openQA Project
  • Category changed from Infrastructure to 168
  • Priority changed from Urgent to Normal

#9 Updated by szarate over 1 year ago

  • Status changed from Feedback to Resolved

I haven't seen any incompletes due to abnormal situations in the worker. Please reopen if you find the same issues, A separate ticket will be open to address the possible deadlock situation.

#10 Updated by szarate over 1 year ago

  • Related to action #39833: [tools] When a worker is abruptly killed, jobs get blocked - CACHE: Being downloaded by another worker, sleeping added

#11 Updated by szarate over 1 year ago

  • Related to action #40001: [negotiation:error] [pid 9953] [client <ip_of_openqaworker4>:35634] AH00690: no acceptable variant: /usr/share/apache2/error/HTTP_BAD_GATEWAY.html.var added

#12 Updated by coolo over 1 year ago

  • Target version changed from Current Sprint to Done

Also available in: Atom PDF