Project

General

Profile

Actions

action #39980

closed

Cache locks assets when worker dies in critical section

Added by EDiGiacinto over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2018-10-02
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

When an asset is being downloaded by the cache (since now is bit more race-free) it acquires a lock that is meant to avoid that other instances (in the same machine) start to download the same asset. If a worker dies or crashes in the critical section, the same asset is locked and the other instances will wait for the lock to be released.

See also: https://progress.opensuse.org/issues/39833

This happened once in osd already, openqaworker6 now is stuck on:

Aug 20 08:38:12 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:17 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:22 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:27 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:32 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:37 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:42 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:47 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:52 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.

And jobs will result stuck in running.

My first experiments where using IPC memory to acquire semaphores, and use them with SEM_UNDO exactly to cover this case, so if a process exits abnormally the lock is released automatically https://progress.opensuse.org/issues/34597#note-4. But that approach was not used in favor of keeping SQLite - now we need to implement some monitoring checks to unlock the cache in such situations (e.g. checking pid that were downloading are still alive, but imho it's racy as could lead to instances unlocking in wrong timings and potentially able to corrupt downloads)


Subtasks 1 (0 open1 closed)

action #41891: Worker stalled during flockResolvedEDiGiacinto2018-10-02

Actions

Related issues 3 (0 open3 closed)

Related to openQA Project - action #39833: [tools] When a worker is abruptly killed, jobs get blocked - CACHE: Being downloaded by another worker, sleepingResolvedEDiGiacinto2018-08-16

Actions
Related to openQA Project - action #34597: Race condition causing problems with the worker cacheResolvedEDiGiacinto2018-05-11

Actions
Related to openQA Project - action #40871: Asset removed during job runtimeResolvedEDiGiacinto2018-09-11

Actions
Actions

Also available in: Atom PDF