action #39833
closed[tools] When a worker is abruptly killed, jobs get blocked - CACHE: Being downloaded by another worker, sleeping
0%
Description
For example this job: http://10.160.66.74/tests/787
Workaround¶
Delete file /var/lib/openqa/cache/cache.sqlite and restart the workers.
Files
Updated by EDiGiacinto over 6 years ago
From the link you gave, webui is at Version 4.6.1534341932.1ded71aa
There is some particular reason to run such old worker? Basically seems 1 year old https://github.com/os-autoinst/openQA/commits/89b04ed8
Updated by SLindoMansilla over 6 years ago
- File openqa-resource-allocator.log openqa-resource-allocator.log added
- File openqa-scheduler openqa-scheduler added
- File openqa-websockets.log openqa-websockets.log added
- File openqa-webui.log openqa-webui.log added
- File openqa-worker.log openqa-worker.log added
There is no reason, I have updated to the last package version and rebooted the system. How can it be that the worker is so old?
sergio@sergio-latitude:~$ sudo zypper se -si openQA
Loading repository data...
Reading installed packages...
S | Name | Type | Version | Arch | Repository
---+----------------------+---------+-------------------------------+--------+--------------------------
i+ | openQA | package | 4.6.1534341932.1ded71aa-743.1 | noarch | devel-openQA
i | openQA-client | package | 4.6.1534341932.1ded71aa-743.1 | noarch | devel-openQA
i | openQA-common | package | 4.6.1534341932.1ded71aa-743.1 | noarch | devel-openQA
i | openQA-local-db | package | 4.6.1534341932.1ded71aa-743.1 | noarch | devel-openQA
i+ | openQA-worker | package | 4.6.1534341932.1ded71aa-743.1 | noarch | devel-openQA
i+ | python-openqa_review | package | 1.10.0-6.1 | noarch | openSUSE-Leap-42.3-Update
Updated by SLindoMansilla over 6 years ago
- Subject changed from With openQA and openQA-worker 4.4.1497257618.89b04ed8-1.1, workers pick jobs but do nothing to With openQA and openQA-worker 4.6.1534341932.1ded71aa-743.1, workers pick jobs but do nothing
Sorry, I copied the wrong version number.
Updated by EDiGiacinto over 6 years ago
looks like one of the instance is downloading the iso.
Aug 16 10:54:15 sergio-latitude worker[7316]: [info] [pid:7316] CACHE: Being downloaded by another worker, sleeping.
If the worker is actually not downloading it - did you by any chance brutally killed one of the workers instances while the download was in progress? if so you might need to clean up the cache
Updated by SLindoMansilla over 6 years ago
- Subject changed from With openQA and openQA-worker 4.6.1534341932.1ded71aa-743.1, workers pick jobs but do nothing to [tools] When a worker is abruptly killed, jobs get blocked - CACHE: Being downloaded by another worker, sleeping
- Description updated (diff)
It may be realated to the shutdown of the SRV2. My Loewe's workers were down. That may have caused the issue.
Updated by okurz over 6 years ago
- Related to action #39743: [o3][tools] o3 unusable, often responds with 504 Gateway Time-out added
Updated by okurz over 6 years ago
- Priority changed from Normal to High
Observed on production instance o3 as well now: #39743#note-27
Updated by EDiGiacinto over 6 years ago
- Related to action #39980: Cache locks assets when worker dies in critical section added
Updated by szarate over 6 years ago
- Related to action #40103: [o3] openqaworker4 not able to finish any jobs added
Updated by szarate over 6 years ago
Updating the cache database is a workaround for the problem, but it can still happen.
The worker cache needs to evolve and be decoupled so that only one thing is doing the download (the tests are a separate story)
Updated by szarate over 6 years ago
- Related to action #40004: worker continues to work on job which he as well as the webui considers dead added
Updated by szarate over 6 years ago
- Related to action #39905: Job trying to download worker local file "aavmf-aarch64-vars.bin" into cache and fails with 404 added
Updated by szarate over 6 years ago
Somehow this is happening even more often: https://openqa.suse.de/tests/1999655 on workers that were not killed even...
Updated by EDiGiacinto over 6 years ago
Was cache cleaned up on ow6 since https://progress.opensuse.org/issues/39980 ? if not, then it might still be locked since
Updated by EDiGiacinto over 6 years ago
- Status changed from New to In Progress
- Assignee set to EDiGiacinto
Looking at it - we can maybe reduce (not fix) the impact of the bug temporarly with a workaround before going to move the cache as a service (which requires more time)
Updated by szarate over 6 years ago
- Target version changed from Ready to Current Sprint
Updated by EDiGiacinto about 6 years ago
Pr with workaround (which was meant to be merged if the problem is more frequent): https://github.com/os-autoinst/openQA/pull/1764
But that won't spare deadlocks when workers get sigkilled - for a long term solution which is still being worked on, see: https://github.com/os-autoinst/openQA/pull/1783 and #39980
Updated by EDiGiacinto about 6 years ago
- Status changed from In Progress to Feedback
https://github.com/os-autoinst/openQA/pull/1783 has been merged, setting to Feedback
Updated by coolo about 6 years ago
- Status changed from Feedback to Resolved
- Target version changed from Current Sprint to Done