action #68131
closed[cache] Cache service misses requests from workers
Description
Sometimes, the cache service misses requests from workers.
Worker log shows request has been sent:
Jun 16 09:05:00 oss-apollo7004 worker[4316]: [info] Downloading openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200615-Media.iso, request #526 sent to Cache Service
But cache service minion never received it.
Current problem is on oss-apollo7004
worker (Tumbleweed based), but I already noticed this problem on openqa-aarch64
worker (Leap 15.1 based) some days ago.
Updated by ggardet_arm over 4 years ago
- Related to action #67975: Test running since 6 hours but not really started added
Updated by ggardet_arm over 4 years ago
When it happened on oss-apollo7004
workers, cache limit was: 47GB used on a limit of 50GB. Maybe we have a problem when limit is (nearly) reached?
I increased the size limit and rebooted the machine, but the same problem persists.
Updated by kraih over 4 years ago
- Assignee set to kraih
The cache service minion has to have received it, since it assigned the minion job id #526
. I'm currently working on a patch that will catch more cache service/worker communication exceptions, that might help with these cases.
Updated by ggardet_arm over 4 years ago
/usr/share/openqa/script/openqa-workercache minion job 526
returns:
{
"args" => [
1302670,
"iso",
"openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200615-Media.iso",
"https://openqa.opensuse.org"
],
"attempts" => 1,
"children" => [],
"created" => "2020-06-16T09:05:00Z",
"delayed" => "2020-06-16T09:05:00Z",
"finished" => "2020-06-16T09:05:01Z",
"id" => 526,
"notes" => {
"downloading_job" => 504,
"lock" => "openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200615-Media.iso.https://openqa.opensuse.org",
"output" => "Asset \"openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200615-Media.iso\" was downloaded by #504, details are therefore unavailable here"
},
"parents" => [],
"priority" => 0,
"queue" => "default",
"result" => undef,
"retried" => undef,
"retries" => 0,
"started" => "2020-06-16T09:05:01Z",
"state" => "finished",
"task" => "cache_asset",
"time" => "2020-06-16T12:03:31Z",
"worker" => 22
}
And for #504:
{
"args" => [
1302489,
"iso",
"openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200615-Media.iso",
"https://openqa.opensuse.org"
],
"attempts" => 1,
"children" => [],
"created" => "2020-06-16T07:09:10Z",
"delayed" => "2020-06-16T07:09:10Z",
"finished" => undef,
"id" => 504,
"notes" => {
"lock" => "openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200615-Media.iso.https://openqa.opensuse.org"
},
"parents" => [],
"priority" => 0,
"queue" => "default",
"result" => undef,
"retried" => undef,
"retries" => 0,
"started" => "2020-06-16T07:09:10Z",
"state" => "active",
"task" => "cache_asset",
"time" => "2020-06-16T12:06:51Z",
"worker" => 14
}
Updated by kraih over 4 years ago
That job information for #504
looks like an unclean service shutdown. As if the job was killed while it was still running. To leave the job in an active
state the manager process would have to be killed before the job process (otherwise the manager records the dead job process).
Updated by ggardet_arm over 4 years ago
As suggested on IRC by kraih
, I workarounded with:
sudo -u _openqa-worker /usr/share/openqa/script/openqa-workercache minion job -e cache_asset -a '[1302489,"iso","openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200615-Media.iso","https://openqa.opensuse.org"]' -n '{"lock":"just_a_test"}'
Also needed to drop the lock:
sudo -u _openqa-worker /usr/share/openqa/script/openqa-workercache minion job -U 'cache_openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200615-Media.iso.https://openqa.opensuse.org
Updated by kraih over 4 years ago
I've opened a PR to clean up locks on restart, that will help a bit with the problem. https://github.com/os-autoinst/openQA/pull/3177
Also found an upstream bug in Minion that might have been the main cause for unclean restarts. https://github.com/mojolicious/minion/compare/c8cf67e0deaa...bc6def3b3bc9
Updated by kraih over 4 years ago
- Status changed from In Progress to Feedback
Together with the recent rework of the cache service client exception handling, i think this ticket should be resolved now. If not we should at least get actionable error messages in the future.
Updated by okurz over 4 years ago
- Status changed from Feedback to Resolved
I assume this problem is gone. From recent review of problems on production I have not seen this again but only more specific cases.
Updated by ggardet_arm over 4 years ago
- Status changed from Resolved to New
Happened today with https://openqa.opensuse.org/tests/1344801
Log from the worker:
juil. 28 12:26:55 siodtw01 worker[1770]: [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/259
juil. 28 12:26:56 siodtw01 worker[1770]: [info] Registered and connected via websockets with openQA host https://openqa.opensuse.org and worker ID 259
juil. 28 12:26:56 siodtw01 worker[1773]: [warn] Websocket connection to https://openqa.opensuse.org/api/v1/ws/273 finished by remote side with code 1006, no reason - trying again in 10 seconds
juil. 28 12:26:56 siodtw01 worker[1709]: [info] Accepting job 1344802 from queue
juil. 28 12:26:56 siodtw01 worker[1709]: [error] Unable to accept job 1344802 because the websocket connection to https://openqa.opensuse.org has been lost.
juil. 28 12:26:56 siodtw01 worker[1709]: [info] Skipping job 1344803 from queue (parent faild with result api-failure)
juil. 28 12:26:56 siodtw01 worker[1709]: [info] Skipping job 1344801 from queue (parent faild with result skipped)
juil. 28 12:26:57 siodtw01 worker[1709]: [info] Skipping job 1344804 from queue (parent faild with result skipped)
juil. 28 12:26:59 siodtw01 worker[1773]: [info] Registering with openQA http://192.168.0.28
juil. 28 12:26:59 siodtw01 worker[1770]: [info] Registering with openQA http://192.168.0.28
juil. 28 12:27:00 siodtw01 worker[1709]: [info] Registering with openQA https://openqa.opensuse.org
juil. 28 12:27:01 siodtw01 worker[1709]: [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/258
Updated by kraih over 4 years ago
ggardet_arm wrote:
Happened today with https://openqa.opensuse.org/tests/1344801
Log from the worker:
juil. 28 12:26:55 siodtw01 worker[1770]: [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/259 juil. 28 12:26:56 siodtw01 worker[1770]: [info] Registered and connected via websockets with openQA host https://openqa.opensuse.org and worker ID 259 juil. 28 12:26:56 siodtw01 worker[1773]: [warn] Websocket connection to https://openqa.opensuse.org/api/v1/ws/273 finished by remote side with code 1006, no reason - trying again in 10 seconds juil. 28 12:26:56 siodtw01 worker[1709]: [info] Accepting job 1344802 from queue juil. 28 12:26:56 siodtw01 worker[1709]: [error] Unable to accept job 1344802 because the websocket connection to https://openqa.opensuse.org has been lost. juil. 28 12:26:56 siodtw01 worker[1709]: [info] Skipping job 1344803 from queue (parent faild with result api-failure) juil. 28 12:26:56 siodtw01 worker[1709]: [info] Skipping job 1344801 from queue (parent faild with result skipped) juil. 28 12:26:57 siodtw01 worker[1709]: [info] Skipping job 1344804 from queue (parent faild with result skipped) juil. 28 12:26:59 siodtw01 worker[1773]: [info] Registering with openQA http://192.168.0.28 juil. 28 12:26:59 siodtw01 worker[1770]: [info] Registering with openQA http://192.168.0.28 juil. 28 12:27:00 siodtw01 worker[1709]: [info] Registering with openQA https://openqa.opensuse.org juil. 28 12:27:01 siodtw01 worker[1709]: [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/258
The test you linked has no logs in its assets. And the logs pasted don't mention the cache service. How is this cache service related at all?
Updated by okurz over 4 years ago
- Subject changed from Cache service misses requests from workers to [cache] Cache service misses requests from workers
- Status changed from New to Resolved
as the last comment was about a mixup let's try again as resolved as done in #68131#note-10