action #39833

[tools] When a worker is abruptly killed, jobs get blocked - CACHE: Being downloaded by another worker, sleeping

Added by SLindoMansilla over 1 year ago. Updated about 1 year ago.

Status:ResolvedStart date:16/08/2018
Priority:HighDue date:
Assignee:EDiGiacinto% Done:

0%

Category:Concrete Bugs
Target version:Done
Difficulty:
Duration:

Description

For example this job: http://10.160.66.74/tests/787

Workaround

Delete file /var/lib/openqa/cache/cache.sqlite and restart the workers.

openqa-resource-allocator.log (548 Bytes) SLindoMansilla, 16/08/2018 09:00 am

openqa-scheduler (784 Bytes) SLindoMansilla, 16/08/2018 09:00 am

openqa-websockets.log (1.08 KB) SLindoMansilla, 16/08/2018 09:00 am

openqa-webui.log (1.92 KB) SLindoMansilla, 16/08/2018 09:00 am

openqa-worker.log (198 KB) SLindoMansilla, 16/08/2018 09:00 am


Related issues

Related to openQA Project - action #39743: [o3][tools] o3 unusable, often responds with 504 Gateway ... Resolved 15/08/2018
Related to openQA Project - action #39980: Cache locks assets when worker dies in critical section Resolved 02/10/2018
Related to openQA Project - action #40103: [o3] openqaworker4 not able to finish any jobs Resolved 22/08/2018
Related to openQA Project - action #40004: worker continues to work on job which he as well as the w... Resolved 20/08/2018
Related to openQA Project - action #39905: Job trying to download worker local file "aavmf-aarch64-v... Resolved 17/08/2018

History

#1 Updated by EDiGiacinto over 1 year ago

From the link you gave, webui is at Version 4.6.1534341932.1ded71aa

There is some particular reason to run such old worker? Basically seems 1 year old https://github.com/os-autoinst/openQA/commits/89b04ed8

#2 Updated by SLindoMansilla over 1 year ago

There is no reason, I have updated to the last package version and rebooted the system. How can it be that the worker is so old?

sergio@sergio-latitude:~$ sudo zypper se -si openQA
Loading repository data...
Reading installed packages...

S  | Name                 | Type    | Version                       | Arch   | Repository               
---+----------------------+---------+-------------------------------+--------+--------------------------
i+ | openQA               | package | 4.6.1534341932.1ded71aa-743.1 | noarch | devel-openQA             
i  | openQA-client        | package | 4.6.1534341932.1ded71aa-743.1 | noarch | devel-openQA             
i  | openQA-common        | package | 4.6.1534341932.1ded71aa-743.1 | noarch | devel-openQA             
i  | openQA-local-db      | package | 4.6.1534341932.1ded71aa-743.1 | noarch | devel-openQA             
i+ | openQA-worker        | package | 4.6.1534341932.1ded71aa-743.1 | noarch | devel-openQA             
i+ | python-openqa_review | package | 1.10.0-6.1                    | noarch | openSUSE-Leap-42.3-Update

#3 Updated by SLindoMansilla over 1 year ago

  • Subject changed from With openQA and openQA-worker 4.4.1497257618.89b04ed8-1.1, workers pick jobs but do nothing to With openQA and openQA-worker 4.6.1534341932.1ded71aa-743.1, workers pick jobs but do nothing

Sorry, I copied the wrong version number.

#4 Updated by EDiGiacinto over 1 year ago

looks like one of the instance is downloading the iso.

Aug 16 10:54:15 sergio-latitude worker[7316]: [info] [pid:7316] CACHE: Being downloaded by another worker, sleeping.

If the worker is actually not downloading it - did you by any chance brutally killed one of the workers instances while the download was in progress? if so you might need to clean up the cache

#5 Updated by SLindoMansilla over 1 year ago

  • Subject changed from With openQA and openQA-worker 4.6.1534341932.1ded71aa-743.1, workers pick jobs but do nothing to [tools] When a worker is abruptly killed, jobs get blocked - CACHE: Being downloaded by another worker, sleeping
  • Description updated (diff)

It may be realated to the shutdown of the SRV2. My Loewe's workers were down. That may have caused the issue.

#6 Updated by okurz over 1 year ago

  • Related to action #39743: [o3][tools] o3 unusable, often responds with 504 Gateway Time-out added

#7 Updated by okurz over 1 year ago

  • Priority changed from Normal to High

Observed on production instance o3 as well now: #39743#note-27

#8 Updated by EDiGiacinto over 1 year ago

  • Related to action #39980: Cache locks assets when worker dies in critical section added

#9 Updated by szarate over 1 year ago

  • Related to action #40103: [o3] openqaworker4 not able to finish any jobs added

#10 Updated by szarate over 1 year ago

Updating the cache database is a workaround for the problem, but it can still happen.

The worker cache needs to evolve and be decoupled so that only one thing is doing the download (the tests are a separate story)

#11 Updated by szarate over 1 year ago

  • Target version set to Ready

#12 Updated by szarate over 1 year ago

  • Related to action #40004: worker continues to work on job which he as well as the webui considers dead added

#13 Updated by szarate over 1 year ago

  • Related to action #39905: Job trying to download worker local file "aavmf-aarch64-vars.bin" into cache and fails with 404 added

#14 Updated by szarate over 1 year ago

Somehow this is happening even more often: https://openqa.suse.de/tests/1999655 on workers that were not killed even...

#15 Updated by EDiGiacinto over 1 year ago

Was cache cleaned up on ow6 since https://progress.opensuse.org/issues/39980 ? if not, then it might still be locked since

#16 Updated by EDiGiacinto over 1 year ago

  • Status changed from New to In Progress
  • Assignee set to EDiGiacinto

Looking at it - we can maybe reduce (not fix) the impact of the bug temporarly with a workaround before going to move the cache as a service (which requires more time)

#17 Updated by szarate over 1 year ago

  • Target version changed from Ready to Current Sprint

#18 Updated by EDiGiacinto over 1 year ago

Pr with workaround (which was meant to be merged if the problem is more frequent): https://github.com/os-autoinst/openQA/pull/1764

But that won't spare deadlocks when workers get sigkilled - for a long term solution which is still being worked on, see: https://github.com/os-autoinst/openQA/pull/1783 and #39980

#19 Updated by EDiGiacinto over 1 year ago

  • Status changed from In Progress to Feedback

https://github.com/os-autoinst/openQA/pull/1783 has been merged, setting to Feedback

#20 Updated by coolo about 1 year ago

  • Status changed from Feedback to Resolved
  • Target version changed from Current Sprint to Done

Also available in: Atom PDF