action #39980: Cache locks assets when worker dies in critical section - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #39980

closed

Cache locks assets when worker dies in critical section

Added by EDiGiacinto over 6 years ago. Updated about 6 years ago.

Status:

Resolved

Priority:

High

Assignee:

EDiGiacinto

Category:

Regressions/Crashes

Target version:

Done

Start date:

2018-10-02

Due date:

% Done:

100%

Estimated time:

(Total: 0.00 h)

Description

When an asset is being downloaded by the cache (since now is bit more race-free) it acquires a lock that is meant to avoid that other instances (in the same machine) start to download the same asset. If a worker dies or crashes in the critical section, the same asset is locked and the other instances will wait for the lock to be released.

See also: https://progress.opensuse.org/issues/39833

This happened once in osd already, openqaworker6 now is stuck on:

Aug 20 08:38:12 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:17 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:22 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:27 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:32 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:37 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:42 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:47 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:52 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.

And jobs will result stuck in running.

My first experiments where using IPC memory to acquire semaphores, and use them with SEM_UNDO exactly to cover this case, so if a process exits abnormally the lock is released automatically https://progress.opensuse.org/issues/34597#note-4. But that approach was not used in favor of keeping SQLite - now we need to implement some monitoring checks to unlock the cache in such situations (e.g. checking pid that were downloading are still alive, but imho it's racy as could lead to instances unlocking in wrong timings and potentially able to corrupt downloads)

Subtasks 1 (0 open — 1 closed)

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by EDiGiacinto over 6 years ago

Related to action #39833: [tools] When a worker is abruptly killed, jobs get blocked - CACHE: Being downloaded by another worker, sleeping added

Actions

Copy link

Updated by EDiGiacinto over 6 years ago

Description updated (diff)

Actions

Copy link

Updated by EDiGiacinto over 6 years ago

Description updated (diff)

Actions

Copy link

Updated by EDiGiacinto over 6 years ago

Is still being worked in https://github.com/os-autoinst/openQA/pull/1783 but still not ready - the general idea we agreed upon: move the cache to a separate service, and the workers just filling downloads requests to it.

As in our codebase we use already Minion for other things (GRU), come up naturally to base off this work on it.

Currently the service is able to replace the cache component inside the workers, but further work is required to integrate it with the exiting codebase, make the worker code use it and also to adapt all cache tests that are already part of our testsuite.

Actions

Copy link

Updated by coolo over 6 years ago

Target version set to Current Sprint

this part of the current story

Actions

Copy link

Updated by EDiGiacinto over 6 years ago

Update:

I'm seeing races using minion locks with the sqlite backend while setting up stress tests:

ok 1 - Asset downloaded               
[INFO] Downloading sle-12-SP3-x86_64-0368-200_3@64bit.qcow2 from http://localhost:38075/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_3@64bit.qcow2
[Fri Sep 14 16:57:41 2018] [debug] GET "/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_3@64bit.qcow2" (3be6ac10)
[Fri Sep 14 16:57:41 2018] [debug] Routing to a callback                                 
[Fri Sep 14 16:57:41 2018] [debug] 200 OK (0.000446s, 2242.152/s)                         
[DEBUG] CACHE: Health: Real size: 0, Configured limit: 53687091200          
[INFO] CACHE: updating the t/cache.d/cache/sle-12-SP3-x86_64-0368-200_3@64bit.qcow2 with andi $a3, $t1, 41399 and 3
[DEBUG] CACHE: Asset download successful to t/cache.d/cache/sle-12-SP3-x86_64-0368-200_3@64bit.qcow2, Cache size is: 3
ok 2 - Asset downloaded                                                  
not ok 3 - Asset downloaded                       
#   Failed test 'Asset downloaded'                     
#   at t/25-cache-service.t line 120.      
[INFO] Downloading sle-12-SP3-x86_64-0368-200_4@64bit.qcow2 from http://localhost:38075/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_4@64bit.qcow2
[Fri Sep 14 16:57:47 2018] [debug] GET "/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_4@64bit.qcow2" (fcad800b)
[Fri Sep 14 16:57:47 2018] [debug] Routing to a callback        
[Fri Sep 14 16:57:47 2018] [debug] 200 OK (0.000439s, 2277.904/s)
[DEBUG] CACHE: Health: Real size: 0, Configured limit: 53687091200
[INFO] CACHE: updating the t/cache.d/cache/sle-12-SP3-x86_64-0368-200_4@64bit.qcow2 with andi $a3, $t1, 41399 and 4
[DEBUG] CACHE: Asset download successful to t/cache.d/cache/sle-12-SP3-x86_64-0368-200_4@64bit.qcow2, Cache size is: 4
not ok 4 - Asset downloaded                             
#   Failed test 'Asset downloaded'                                     
#   at t/25-cache-service.t line 120.               
not ok 5 - Asset downloaded                                               
#   Failed test 'Asset downloaded'                                                                                                                                                                            
#   at t/25-cache-service.t line 120.
[INFO] Downloading sle-12-SP3-x86_64-0368-200_8@64bit.qcow2 from http://localhost:38075/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_8@64bit.qcow2
[Fri Sep 14 16:57:52 2018] [debug] GET "/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_8@64bit.qcow2" (0762f4d6)
[Fri Sep 14 16:57:52 2018] [debug] Routing to a callback   
[Fri Sep 14 16:57:52 2018] [debug] 200 OK (0.000549s, 1821.494/s)
[DEBUG] CACHE: Health: Real size: 0, Configured limit: 53687091200
[INFO] CACHE: updating the t/cache.d/cache/sle-12-SP3-x86_64-0368-200_8@64bit.qcow2 with andi $a3, $t1, 41399 and 8
[DEBUG] CACHE: Asset download successful to t/cache.d/cache/sle-12-SP3-x86_64-0368-200_8@64bit.qcow2, Cache size is: 8
not ok 6 - Asset downloaded

To trigger it, i filed requests to the cache service with very small assets - it seems there is a race when we detect the task as completed or not, but still to dissect if comes from my changes or the Minion backend.

Actions

Copy link

Updated by EDiGiacinto over 6 years ago

Copied from the GH PR:

What is missing still:

Digging into failures of other tests (compile check, and warning that makes the cache-service fail, which passes locally) and finish adapting logging messages changes that aren't matching anymore in fullstack ( the cache messages now are in the minion worker, and fails for that reason)
Documentation of the modules
systemd services and package spec adjustments
Acceptance testing
Improve error messages reporting on worker side, but otoh the Cache service is exposing also Minion::Admin plugin
Bump cache limit size ( ? ) and polishing

Actions

Copy link

Updated by EDiGiacinto over 6 years ago

Related to action #34597: Race condition causing problems with the worker cache added

Actions

Copy link

Updated by coolo over 6 years ago

Category set to Regressions/Crashes
Status changed from New to In Progress
Assignee set to EDiGiacinto

part of the worker story being worked on

Actions

Copy link

#10

Updated by EDiGiacinto over 6 years ago

Related to action #40871: Asset removed during job runtime added

Actions

Copy link

#11

Updated by EDiGiacinto over 6 years ago

Update (what is missing still):

Investigate/Fix GRU tests that failed after rebase
Acceptance testing
Fix bugs that were met during testing ( e.g. Job should be flagged as incomplete when cache service is down, and on Leap42.3 the cache is failing in updating asset status, while on my TW this doesn't happen )

Actions

Copy link

#12

Updated by EDiGiacinto over 6 years ago

Status changed from In Progress to Feedback

https://github.com/os-autoinst/openQA/pull/1783 has been merged, setting to Feedback

Actions

Copy link

#13

Updated by coolo about 6 years ago

Status changed from Feedback to Resolved
Target version changed from Current Sprint to Done

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #39980

Cache locks assets when worker dies in critical section

Updated by EDiGiacinto over 6 years ago

Updated by EDiGiacinto over 6 years ago

Updated by EDiGiacinto over 6 years ago

Updated by EDiGiacinto over 6 years ago

Updated by coolo over 6 years ago

Updated by EDiGiacinto over 6 years ago

Updated by EDiGiacinto over 6 years ago

Updated by EDiGiacinto over 6 years ago

Updated by coolo over 6 years ago

Updated by EDiGiacinto over 6 years ago

Updated by EDiGiacinto over 6 years ago

Updated by EDiGiacinto over 6 years ago

Updated by coolo about 6 years ago