Project

General

Profile

Actions

action #39833

closed

[tools] When a worker is abruptly killed, jobs get blocked - CACHE: Being downloaded by another worker, sleeping

Added by SLindoMansilla over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2018-08-16
Due date:
% Done:

0%

Estimated time:

Description

For example this job: http://10.160.66.74/tests/787

Workaround

Delete file /var/lib/openqa/cache/cache.sqlite and restart the workers.


Files

openqa-resource-allocator.log (548 Bytes) openqa-resource-allocator.log SLindoMansilla, 2018-08-16 09:00
openqa-scheduler (784 Bytes) openqa-scheduler SLindoMansilla, 2018-08-16 09:00
openqa-websockets.log (1.08 KB) openqa-websockets.log SLindoMansilla, 2018-08-16 09:00
openqa-webui.log (1.92 KB) openqa-webui.log SLindoMansilla, 2018-08-16 09:00
openqa-worker.log (198 KB) openqa-worker.log SLindoMansilla, 2018-08-16 09:00

Related issues 5 (0 open5 closed)

Related to openQA Project - action #39743: [o3][tools] o3 unusable, often responds with 504 Gateway Time-outResolvedokurz2018-08-15

Actions
Related to openQA Project - action #39980: Cache locks assets when worker dies in critical sectionResolvedEDiGiacinto2018-10-02

Actions
Related to openQA Project - action #40103: [o3] openqaworker4 not able to finish any jobsResolvedszarate2018-08-22

Actions
Related to openQA Project - action #40004: worker continues to work on job which he as well as the webui considers deadResolvedmkittler2018-08-20

Actions
Related to openQA Project - action #39905: Job trying to download worker local file "aavmf-aarch64-vars.bin" into cache and fails with 404Resolvedszarate2018-08-17

Actions
Actions #1

Updated by EDiGiacinto over 5 years ago

From the link you gave, webui is at Version 4.6.1534341932.1ded71aa

There is some particular reason to run such old worker? Basically seems 1 year old https://github.com/os-autoinst/openQA/commits/89b04ed8

Updated by SLindoMansilla over 5 years ago

There is no reason, I have updated to the last package version and rebooted the system. How can it be that the worker is so old?

sergio@sergio-latitude:~$ sudo zypper se -si openQA
Loading repository data...
Reading installed packages...

S  | Name                 | Type    | Version                       | Arch   | Repository               
---+----------------------+---------+-------------------------------+--------+--------------------------
i+ | openQA               | package | 4.6.1534341932.1ded71aa-743.1 | noarch | devel-openQA             
i  | openQA-client        | package | 4.6.1534341932.1ded71aa-743.1 | noarch | devel-openQA             
i  | openQA-common        | package | 4.6.1534341932.1ded71aa-743.1 | noarch | devel-openQA             
i  | openQA-local-db      | package | 4.6.1534341932.1ded71aa-743.1 | noarch | devel-openQA             
i+ | openQA-worker        | package | 4.6.1534341932.1ded71aa-743.1 | noarch | devel-openQA             
i+ | python-openqa_review | package | 1.10.0-6.1                    | noarch | openSUSE-Leap-42.3-Update
Actions #3

Updated by SLindoMansilla over 5 years ago

  • Subject changed from With openQA and openQA-worker 4.4.1497257618.89b04ed8-1.1, workers pick jobs but do nothing to With openQA and openQA-worker 4.6.1534341932.1ded71aa-743.1, workers pick jobs but do nothing

Sorry, I copied the wrong version number.

Actions #4

Updated by EDiGiacinto over 5 years ago

looks like one of the instance is downloading the iso.

Aug 16 10:54:15 sergio-latitude worker[7316]: [info] [pid:7316] CACHE: Being downloaded by another worker, sleeping.

If the worker is actually not downloading it - did you by any chance brutally killed one of the workers instances while the download was in progress? if so you might need to clean up the cache

Actions #5

Updated by SLindoMansilla over 5 years ago

  • Subject changed from With openQA and openQA-worker 4.6.1534341932.1ded71aa-743.1, workers pick jobs but do nothing to [tools] When a worker is abruptly killed, jobs get blocked - CACHE: Being downloaded by another worker, sleeping
  • Description updated (diff)

It may be realated to the shutdown of the SRV2. My Loewe's workers were down. That may have caused the issue.

Actions #6

Updated by okurz over 5 years ago

  • Related to action #39743: [o3][tools] o3 unusable, often responds with 504 Gateway Time-out added
Actions #7

Updated by okurz over 5 years ago

  • Priority changed from Normal to High

Observed on production instance o3 as well now: #39743#note-27

Actions #8

Updated by EDiGiacinto over 5 years ago

  • Related to action #39980: Cache locks assets when worker dies in critical section added
Actions #9

Updated by szarate over 5 years ago

  • Related to action #40103: [o3] openqaworker4 not able to finish any jobs added
Actions #10

Updated by szarate over 5 years ago

Updating the cache database is a workaround for the problem, but it can still happen.

The worker cache needs to evolve and be decoupled so that only one thing is doing the download (the tests are a separate story)

Actions #11

Updated by szarate over 5 years ago

  • Target version set to Ready
Actions #12

Updated by szarate over 5 years ago

  • Related to action #40004: worker continues to work on job which he as well as the webui considers dead added
Actions #13

Updated by szarate over 5 years ago

  • Related to action #39905: Job trying to download worker local file "aavmf-aarch64-vars.bin" into cache and fails with 404 added
Actions #14

Updated by szarate over 5 years ago

Somehow this is happening even more often: https://openqa.suse.de/tests/1999655 on workers that were not killed even...

Actions #15

Updated by EDiGiacinto over 5 years ago

Was cache cleaned up on ow6 since https://progress.opensuse.org/issues/39980 ? if not, then it might still be locked since

Actions #16

Updated by EDiGiacinto over 5 years ago

  • Status changed from New to In Progress
  • Assignee set to EDiGiacinto

Looking at it - we can maybe reduce (not fix) the impact of the bug temporarly with a workaround before going to move the cache as a service (which requires more time)

Actions #17

Updated by szarate over 5 years ago

  • Target version changed from Ready to Current Sprint
Actions #18

Updated by EDiGiacinto over 5 years ago

Pr with workaround (which was meant to be merged if the problem is more frequent): https://github.com/os-autoinst/openQA/pull/1764

But that won't spare deadlocks when workers get sigkilled - for a long term solution which is still being worked on, see: https://github.com/os-autoinst/openQA/pull/1783 and #39980

Actions #19

Updated by EDiGiacinto over 5 years ago

  • Status changed from In Progress to Feedback

https://github.com/os-autoinst/openQA/pull/1783 has been merged, setting to Feedback

Actions #20

Updated by coolo over 5 years ago

  • Status changed from Feedback to Resolved
  • Target version changed from Current Sprint to Done
Actions

Also available in: Atom PDF