action #68131
[cache] Cache service misses requests from workers
0%
Description
Sometimes, the cache service misses requests from workers.
Worker log shows request has been sent:
Jun 16 09:05:00 oss-apollo7004 worker[4316]: [info] Downloading openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200615-Media.iso, request #526 sent to Cache Service
But cache service minion never received it.
Current problem is on oss-apollo7004
worker (Tumbleweed based), but I already noticed this problem on openqa-aarch64
worker (Leap 15.1 based) some days ago.
Related issues
History
#1
Updated by ggardet_arm 7 months ago
- Related to action #67975: Test running since 6 hours but not really started added
#2
Updated by ggardet_arm 7 months ago
When it happened on oss-apollo7004
workers, cache limit was: 47GB used on a limit of 50GB. Maybe we have a problem when limit is (nearly) reached?
I increased the size limit and rebooted the machine, but the same problem persists.
#4
Updated by ggardet_arm 7 months ago
/usr/share/openqa/script/openqa-workercache minion job 526
returns:
{ "args" => [ 1302670, "iso", "openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200615-Media.iso", "https://openqa.opensuse.org" ], "attempts" => 1, "children" => [], "created" => "2020-06-16T09:05:00Z", "delayed" => "2020-06-16T09:05:00Z", "finished" => "2020-06-16T09:05:01Z", "id" => 526, "notes" => { "downloading_job" => 504, "lock" => "openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200615-Media.iso.https://openqa.opensuse.org", "output" => "Asset \"openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200615-Media.iso\" was downloaded by #504, details are therefore unavailable here" }, "parents" => [], "priority" => 0, "queue" => "default", "result" => undef, "retried" => undef, "retries" => 0, "started" => "2020-06-16T09:05:01Z", "state" => "finished", "task" => "cache_asset", "time" => "2020-06-16T12:03:31Z", "worker" => 22 }
And for #504:
{ "args" => [ 1302489, "iso", "openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200615-Media.iso", "https://openqa.opensuse.org" ], "attempts" => 1, "children" => [], "created" => "2020-06-16T07:09:10Z", "delayed" => "2020-06-16T07:09:10Z", "finished" => undef, "id" => 504, "notes" => { "lock" => "openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200615-Media.iso.https://openqa.opensuse.org" }, "parents" => [], "priority" => 0, "queue" => "default", "result" => undef, "retried" => undef, "retries" => 0, "started" => "2020-06-16T07:09:10Z", "state" => "active", "task" => "cache_asset", "time" => "2020-06-16T12:06:51Z", "worker" => 14 }
#5
Updated by kraih 7 months ago
That job information for #504
looks like an unclean service shutdown. As if the job was killed while it was still running. To leave the job in an active
state the manager process would have to be killed before the job process (otherwise the manager records the dead job process).
#6
Updated by ggardet_arm 7 months ago
As suggested on IRC by kraih
, I workarounded with:
sudo -u _openqa-worker /usr/share/openqa/script/openqa-workercache minion job -e cache_asset -a '[1302489,"iso","openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200615-Media.iso","https://openqa.opensuse.org"]' -n '{"lock":"just_a_test"}'
Also needed to drop the lock:
sudo -u _openqa-worker /usr/share/openqa/script/openqa-workercache minion job -U 'cache_openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200615-Media.iso.https://openqa.opensuse.org
#7
Updated by kraih 7 months ago
I've opened a PR to clean up locks on restart, that will help a bit with the problem. https://github.com/os-autoinst/openQA/pull/3177
Also found an upstream bug in Minion that might have been the main cause for unclean restarts. https://github.com/mojolicious/minion/compare/c8cf67e0deaa...bc6def3b3bc9
#11
Updated by ggardet_arm 6 months ago
- Status changed from Resolved to New
Happened today with https://openqa.opensuse.org/tests/1344801
Log from the worker:
juil. 28 12:26:55 siodtw01 worker[1770]: [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/259 juil. 28 12:26:56 siodtw01 worker[1770]: [info] Registered and connected via websockets with openQA host https://openqa.opensuse.org and worker ID 259 juil. 28 12:26:56 siodtw01 worker[1773]: [warn] Websocket connection to https://openqa.opensuse.org/api/v1/ws/273 finished by remote side with code 1006, no reason - trying again in 10 seconds juil. 28 12:26:56 siodtw01 worker[1709]: [info] Accepting job 1344802 from queue juil. 28 12:26:56 siodtw01 worker[1709]: [error] Unable to accept job 1344802 because the websocket connection to https://openqa.opensuse.org has been lost. juil. 28 12:26:56 siodtw01 worker[1709]: [info] Skipping job 1344803 from queue (parent faild with result api-failure) juil. 28 12:26:56 siodtw01 worker[1709]: [info] Skipping job 1344801 from queue (parent faild with result skipped) juil. 28 12:26:57 siodtw01 worker[1709]: [info] Skipping job 1344804 from queue (parent faild with result skipped) juil. 28 12:26:59 siodtw01 worker[1773]: [info] Registering with openQA http://192.168.0.28 juil. 28 12:26:59 siodtw01 worker[1770]: [info] Registering with openQA http://192.168.0.28 juil. 28 12:27:00 siodtw01 worker[1709]: [info] Registering with openQA https://openqa.opensuse.org juil. 28 12:27:01 siodtw01 worker[1709]: [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/258
#13
Updated by kraih 6 months ago
ggardet_arm wrote:
Happened today with https://openqa.opensuse.org/tests/1344801
Log from the worker:
juil. 28 12:26:55 siodtw01 worker[1770]: [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/259 juil. 28 12:26:56 siodtw01 worker[1770]: [info] Registered and connected via websockets with openQA host https://openqa.opensuse.org and worker ID 259 juil. 28 12:26:56 siodtw01 worker[1773]: [warn] Websocket connection to https://openqa.opensuse.org/api/v1/ws/273 finished by remote side with code 1006, no reason - trying again in 10 seconds juil. 28 12:26:56 siodtw01 worker[1709]: [info] Accepting job 1344802 from queue juil. 28 12:26:56 siodtw01 worker[1709]: [error] Unable to accept job 1344802 because the websocket connection to https://openqa.opensuse.org has been lost. juil. 28 12:26:56 siodtw01 worker[1709]: [info] Skipping job 1344803 from queue (parent faild with result api-failure) juil. 28 12:26:56 siodtw01 worker[1709]: [info] Skipping job 1344801 from queue (parent faild with result skipped) juil. 28 12:26:57 siodtw01 worker[1709]: [info] Skipping job 1344804 from queue (parent faild with result skipped) juil. 28 12:26:59 siodtw01 worker[1773]: [info] Registering with openQA http://192.168.0.28 juil. 28 12:26:59 siodtw01 worker[1770]: [info] Registering with openQA http://192.168.0.28 juil. 28 12:27:00 siodtw01 worker[1709]: [info] Registering with openQA https://openqa.opensuse.org juil. 28 12:27:01 siodtw01 worker[1709]: [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/258
The test you linked has no logs in its assets. And the logs pasted don't mention the cache service. How is this cache service related at all?
#14
Updated by ggardet_arm 6 months ago
hmm, looks like I mixed 2 tickets.
#15
Updated by okurz 6 months ago
- Subject changed from Cache service misses requests from workers to [cache] Cache service misses requests from workers
- Status changed from New to Resolved
as the last comment was about a mixup let's try again as resolved as done in #68131#note-10