action #40103
closed[o3] openqaworker4 not able to finish any jobs
Updated by szarate over 6 years ago
Currently waiting for jobs:
https://openqa.opensuse.org/tests/740551 (from console)
https://openqa.opensuse.org/tests/740549 (from systemd unit)
will enable all 16 instances later with the latest job that failed there and see what happens afterwards, suspicion is that all of them where trying to update
Updated by okurz over 6 years ago
- Related to action #39743: [o3][tools] o3 unusable, often responds with 504 Gateway Time-out added
Updated by szarate over 6 years ago
As both jobs are passing, next step is to try the 16 instances running the same job, to discard that the cache is deadlocking itself :)
for i in `seq 1 16`; do ./script/clone_job.pl --skip-deps --skip-download --skip-chained-deps --from https://openqa.opensuse.org 740549 --host https://openqa.opensuse.org _GROUP="Development Tumbleweed" BUILD=1119.5:poo40103 TEST=poo_40103_investigation_$i NAME=poo_40103_investigation_$i WORKER_CLASS=openqaworker4; done
Updated by szarate over 6 years ago
So, after looking closer, when all workers are started at the same time, and pick jobs at the same time, some jobs take just a bit too long, and the webUI decides to kill said jobs (mostly because syncing the needles and the git repo takes a bit too long). I'm currently looking for possible solutions to this, since it's kind of easily reproducible.
Updated by szarate over 6 years ago
- Status changed from In Progress to Feedback
Looks like the main problem is that since we're syncing the whole needles and tests, from time to time workers might deadlock themselves, a shallow copy should work, but I think this is mostly fallout from previous problems. Setting to feedback and let's monitor
Updated by szarate over 6 years ago
- Project changed from openQA Tests (public) to openQA Project (public)
- Category changed from Infrastructure to 168
- Priority changed from Urgent to Normal
Updated by szarate over 6 years ago
- Status changed from Feedback to Resolved
I haven't seen any incompletes due to abnormal situations in the worker. Please reopen if you find the same issues, A separate ticket will be open to address the possible deadlock situation.
Updated by szarate over 6 years ago
- Related to action #39833: [tools] When a worker is abruptly killed, jobs get blocked - CACHE: Being downloaded by another worker, sleeping added
Updated by szarate over 6 years ago
- Related to action #40001: [negotiation:error] [pid 9953] [client <ip_of_openqaworker4>:35634] AH00690: no acceptable variant: /usr/share/apache2/error/HTTP_BAD_GATEWAY.html.var added
Updated by coolo over 6 years ago
- Target version changed from Current Sprint to Done