action #39980: Cache locks assets when worker dies in critical section - openQA Project - openSUSE Project Management Tool

Actions

Copy link

action #39980

closed

Cache locks assets when worker dies in critical section

Added by EDiGiacinto almost 6 years ago. Updated over 5 years ago.

Status:

Resolved

Priority:

High

Assignee:

EDiGiacinto

Category:

Regressions/Crashes

Target version:

Done

Start date:

2018-10-02

Due date:

% Done:

100%

Estimated time:

(Total: 0.00 h)

Description

When an asset is being downloaded by the cache (since now is bit more race-free) it acquires a lock that is meant to avoid that other instances (in the same machine) start to download the same asset. If a worker dies or crashes in the critical section, the same asset is locked and the other instances will wait for the lock to be released.

See also: https://progress.opensuse.org/issues/39833

This happened once in osd already, openqaworker6 now is stuck on:

Aug 20 08:38:12 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:17 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:22 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:27 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:32 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:37 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:42 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:47 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:52 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.

And jobs will result stuck in running.

My first experiments where using IPC memory to acquire semaphores, and use them with SEM_UNDO exactly to cover this case, so if a process exits abnormally the lock is released automatically https://progress.opensuse.org/issues/34597#note-4. But that approach was not used in favor of keeping SQLite - now we need to implement some monitoring checks to unlock the cache in such situations (e.g. checking pid that were downloading are still alive, but imho it's racy as could lead to instances unlocking in wrong timings and potentially able to corrupt downloads)

Subtasks 1 (0 open — 1 closed)

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA » openQA Project

Tags

Custom queries

action #39980

Cache locks assets when worker dies in critical section

Updated by EDiGiacinto almost 6 years ago

Updated by EDiGiacinto almost 6 years ago

Updated by EDiGiacinto almost 6 years ago

Updated by EDiGiacinto almost 6 years ago

Updated by coolo almost 6 years ago

Updated by EDiGiacinto almost 6 years ago

Updated by EDiGiacinto almost 6 years ago

Updated by EDiGiacinto almost 6 years ago

Updated by coolo almost 6 years ago

Updated by EDiGiacinto almost 6 years ago

Updated by EDiGiacinto almost 6 years ago

Updated by EDiGiacinto almost 6 years ago

Updated by coolo over 5 years ago