Project

General

Profile

Actions

action #68131

closed

[cache] Cache service misses requests from workers

Added by ggardet_arm over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2020-06-16
Due date:
% Done:

0%

Estimated time:

Description

Sometimes, the cache service misses requests from workers.
Worker log shows request has been sent:

Jun 16 09:05:00 oss-apollo7004 worker[4316]: [info] Downloading openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200615-Media.iso, request #526 sent to Cache Service

But cache service minion never received it.

Current problem is on oss-apollo7004 worker (Tumbleweed based), but I already noticed this problem on openqa-aarch64 worker (Leap 15.1 based) some days ago.


Related issues 1 (0 open1 closed)

Related to openQA Project (public) - action #67975: Test running since 6 hours but not really startedResolvedokurz2020-06-11

Actions
Actions #1

Updated by ggardet_arm over 4 years ago

  • Related to action #67975: Test running since 6 hours but not really started added
Actions #2

Updated by ggardet_arm over 4 years ago

When it happened on oss-apollo7004 workers, cache limit was: 47GB used on a limit of 50GB. Maybe we have a problem when limit is (nearly) reached?
I increased the size limit and rebooted the machine, but the same problem persists.

Actions #3

Updated by kraih over 4 years ago

  • Assignee set to kraih

The cache service minion has to have received it, since it assigned the minion job id #526. I'm currently working on a patch that will catch more cache service/worker communication exceptions, that might help with these cases.

Actions #4

Updated by ggardet_arm over 4 years ago

/usr/share/openqa/script/openqa-workercache minion job 526 returns:

{
  "args" => [
    1302670,
    "iso",
    "openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200615-Media.iso",
    "https://openqa.opensuse.org"
  ],
  "attempts" => 1,
  "children" => [],
  "created" => "2020-06-16T09:05:00Z",
  "delayed" => "2020-06-16T09:05:00Z",
  "finished" => "2020-06-16T09:05:01Z",
  "id" => 526,
  "notes" => {
    "downloading_job" => 504,
    "lock" => "openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200615-Media.iso.https://openqa.opensuse.org",
    "output" => "Asset \"openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200615-Media.iso\" was downloaded by #504, details are therefore unavailable here"
  },
  "parents" => [],
  "priority" => 0,
  "queue" => "default",
  "result" => undef,
  "retried" => undef,
  "retries" => 0,
  "started" => "2020-06-16T09:05:01Z",
  "state" => "finished",
  "task" => "cache_asset",
  "time" => "2020-06-16T12:03:31Z",
  "worker" => 22
}

And for #504:

{
  "args" => [
    1302489,
    "iso",
    "openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200615-Media.iso",
    "https://openqa.opensuse.org"
  ],
  "attempts" => 1,
  "children" => [],
  "created" => "2020-06-16T07:09:10Z",
  "delayed" => "2020-06-16T07:09:10Z",
  "finished" => undef,
  "id" => 504,
  "notes" => {
    "lock" => "openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200615-Media.iso.https://openqa.opensuse.org"
  },
  "parents" => [],
  "priority" => 0,
  "queue" => "default",
  "result" => undef,
  "retried" => undef,
  "retries" => 0,
  "started" => "2020-06-16T07:09:10Z",
  "state" => "active",
  "task" => "cache_asset",
  "time" => "2020-06-16T12:06:51Z",
  "worker" => 14
}
Actions #5

Updated by kraih over 4 years ago

That job information for #504 looks like an unclean service shutdown. As if the job was killed while it was still running. To leave the job in an active state the manager process would have to be killed before the job process (otherwise the manager records the dead job process).

Actions #6

Updated by ggardet_arm over 4 years ago

As suggested on IRC by kraih, I workarounded with:

sudo -u _openqa-worker /usr/share/openqa/script/openqa-workercache minion job -e cache_asset -a '[1302489,"iso","openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200615-Media.iso","https://openqa.opensuse.org"]' -n '{"lock":"just_a_test"}'

Also needed to drop the lock:

sudo -u _openqa-worker /usr/share/openqa/script/openqa-workercache minion job -U 'cache_openSUSE-Tumbleweed-DVD-aarch64-Snapshot20200615-Media.iso.https://openqa.opensuse.org
Actions #7

Updated by kraih over 4 years ago

I've opened a PR to clean up locks on restart, that will help a bit with the problem. https://github.com/os-autoinst/openQA/pull/3177

Also found an upstream bug in Minion that might have been the main cause for unclean restarts. https://github.com/mojolicious/minion/compare/c8cf67e0deaa...bc6def3b3bc9

Actions #8

Updated by kraih over 4 years ago

  • Status changed from New to In Progress
Actions #9

Updated by kraih over 4 years ago

  • Status changed from In Progress to Feedback

Together with the recent rework of the cache service client exception handling, i think this ticket should be resolved now. If not we should at least get actionable error messages in the future.

Actions #10

Updated by okurz over 4 years ago

  • Status changed from Feedback to Resolved

I assume this problem is gone. From recent review of problems on production I have not seen this again but only more specific cases.

Actions #11

Updated by ggardet_arm over 4 years ago

  • Status changed from Resolved to New

Happened today with https://openqa.opensuse.org/tests/1344801

Log from the worker:

juil. 28 12:26:55 siodtw01 worker[1770]: [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/259
juil. 28 12:26:56 siodtw01 worker[1770]: [info] Registered and connected via websockets with openQA host https://openqa.opensuse.org and worker ID 259
juil. 28 12:26:56 siodtw01 worker[1773]: [warn] Websocket connection to https://openqa.opensuse.org/api/v1/ws/273 finished by remote side with code 1006, no reason - trying again in 10 seconds
juil. 28 12:26:56 siodtw01 worker[1709]: [info] Accepting job 1344802 from queue
juil. 28 12:26:56 siodtw01 worker[1709]: [error] Unable to accept job 1344802 because the websocket connection to https://openqa.opensuse.org has been lost.
juil. 28 12:26:56 siodtw01 worker[1709]: [info] Skipping job 1344803 from queue (parent faild with result api-failure)
juil. 28 12:26:56 siodtw01 worker[1709]: [info] Skipping job 1344801 from queue (parent faild with result skipped)
juil. 28 12:26:57 siodtw01 worker[1709]: [info] Skipping job 1344804 from queue (parent faild with result skipped)
juil. 28 12:26:59 siodtw01 worker[1773]: [info] Registering with openQA http://192.168.0.28
juil. 28 12:26:59 siodtw01 worker[1770]: [info] Registering with openQA http://192.168.0.28
juil. 28 12:27:00 siodtw01 worker[1709]: [info] Registering with openQA https://openqa.opensuse.org
juil. 28 12:27:01 siodtw01 worker[1709]: [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/258
Actions #12

Updated by okurz over 4 years ago

  • Target version set to Ready
Actions #13

Updated by kraih over 4 years ago

ggardet_arm wrote:

Happened today with https://openqa.opensuse.org/tests/1344801

Log from the worker:

juil. 28 12:26:55 siodtw01 worker[1770]: [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/259
juil. 28 12:26:56 siodtw01 worker[1770]: [info] Registered and connected via websockets with openQA host https://openqa.opensuse.org and worker ID 259
juil. 28 12:26:56 siodtw01 worker[1773]: [warn] Websocket connection to https://openqa.opensuse.org/api/v1/ws/273 finished by remote side with code 1006, no reason - trying again in 10 seconds
juil. 28 12:26:56 siodtw01 worker[1709]: [info] Accepting job 1344802 from queue
juil. 28 12:26:56 siodtw01 worker[1709]: [error] Unable to accept job 1344802 because the websocket connection to https://openqa.opensuse.org has been lost.
juil. 28 12:26:56 siodtw01 worker[1709]: [info] Skipping job 1344803 from queue (parent faild with result api-failure)
juil. 28 12:26:56 siodtw01 worker[1709]: [info] Skipping job 1344801 from queue (parent faild with result skipped)
juil. 28 12:26:57 siodtw01 worker[1709]: [info] Skipping job 1344804 from queue (parent faild with result skipped)
juil. 28 12:26:59 siodtw01 worker[1773]: [info] Registering with openQA http://192.168.0.28
juil. 28 12:26:59 siodtw01 worker[1770]: [info] Registering with openQA http://192.168.0.28
juil. 28 12:27:00 siodtw01 worker[1709]: [info] Registering with openQA https://openqa.opensuse.org
juil. 28 12:27:01 siodtw01 worker[1709]: [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/258

The test you linked has no logs in its assets. And the logs pasted don't mention the cache service. How is this cache service related at all?

Actions #14

Updated by ggardet_arm over 4 years ago

hmm, looks like I mixed 2 tickets.

Actions #15

Updated by okurz over 4 years ago

  • Subject changed from Cache service misses requests from workers to [cache] Cache service misses requests from workers
  • Status changed from New to Resolved

as the last comment was about a mixup let's try again as resolved as done in #68131#note-10

Actions

Also available in: Atom PDF