Project

General

Profile

Actions

action #39980

closed

Cache locks assets when worker dies in critical section

Added by EDiGiacinto over 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2018-10-02
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

When an asset is being downloaded by the cache (since now is bit more race-free) it acquires a lock that is meant to avoid that other instances (in the same machine) start to download the same asset. If a worker dies or crashes in the critical section, the same asset is locked and the other instances will wait for the lock to be released.

See also: https://progress.opensuse.org/issues/39833

This happened once in osd already, openqaworker6 now is stuck on:

Aug 20 08:38:12 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:17 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:22 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:27 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:32 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:37 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:42 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:47 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:52 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.

And jobs will result stuck in running.

My first experiments where using IPC memory to acquire semaphores, and use them with SEM_UNDO exactly to cover this case, so if a process exits abnormally the lock is released automatically https://progress.opensuse.org/issues/34597#note-4. But that approach was not used in favor of keeping SQLite - now we need to implement some monitoring checks to unlock the cache in such situations (e.g. checking pid that were downloading are still alive, but imho it's racy as could lead to instances unlocking in wrong timings and potentially able to corrupt downloads)


Subtasks 1 (0 open1 closed)

action #41891: Worker stalled during flockResolvedEDiGiacinto2018-10-02

Actions

Related issues 3 (0 open3 closed)

Related to openQA Project - action #39833: [tools] When a worker is abruptly killed, jobs get blocked - CACHE: Being downloaded by another worker, sleepingResolvedEDiGiacinto2018-08-16

Actions
Related to openQA Project - action #34597: Race condition causing problems with the worker cacheResolvedEDiGiacinto2018-05-11

Actions
Related to openQA Project - action #40871: Asset removed during job runtimeResolvedEDiGiacinto2018-09-11

Actions
Actions #1

Updated by EDiGiacinto over 6 years ago

  • Related to action #39833: [tools] When a worker is abruptly killed, jobs get blocked - CACHE: Being downloaded by another worker, sleeping added
Actions #2

Updated by EDiGiacinto over 6 years ago

  • Description updated (diff)
Actions #3

Updated by EDiGiacinto over 6 years ago

  • Description updated (diff)
Actions #4

Updated by EDiGiacinto about 6 years ago

Is still being worked in https://github.com/os-autoinst/openQA/pull/1783 but still not ready - the general idea we agreed upon: move the cache to a separate service, and the workers just filling downloads requests to it.

As in our codebase we use already Minion for other things (GRU), come up naturally to base off this work on it.

Currently the service is able to replace the cache component inside the workers, but further work is required to integrate it with the exiting codebase, make the worker code use it and also to adapt all cache tests that are already part of our testsuite.

Actions #5

Updated by coolo about 6 years ago

  • Target version set to Current Sprint

this part of the current story

Actions #6

Updated by EDiGiacinto about 6 years ago

Update:

I'm seeing races using minion locks with the sqlite backend while setting up stress tests:

ok 1 - Asset downloaded               
[INFO] Downloading sle-12-SP3-x86_64-0368-200_3@64bit.qcow2 from http://localhost:38075/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_3@64bit.qcow2
[Fri Sep 14 16:57:41 2018] [debug] GET "/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_3@64bit.qcow2" (3be6ac10)
[Fri Sep 14 16:57:41 2018] [debug] Routing to a callback                                 
[Fri Sep 14 16:57:41 2018] [debug] 200 OK (0.000446s, 2242.152/s)                         
[DEBUG] CACHE: Health: Real size: 0, Configured limit: 53687091200          
[INFO] CACHE: updating the t/cache.d/cache/sle-12-SP3-x86_64-0368-200_3@64bit.qcow2 with andi $a3, $t1, 41399 and 3
[DEBUG] CACHE: Asset download successful to t/cache.d/cache/sle-12-SP3-x86_64-0368-200_3@64bit.qcow2, Cache size is: 3
ok 2 - Asset downloaded                                                  
not ok 3 - Asset downloaded                       
#   Failed test 'Asset downloaded'                     
#   at t/25-cache-service.t line 120.      
[INFO] Downloading sle-12-SP3-x86_64-0368-200_4@64bit.qcow2 from http://localhost:38075/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_4@64bit.qcow2
[Fri Sep 14 16:57:47 2018] [debug] GET "/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_4@64bit.qcow2" (fcad800b)
[Fri Sep 14 16:57:47 2018] [debug] Routing to a callback        
[Fri Sep 14 16:57:47 2018] [debug] 200 OK (0.000439s, 2277.904/s)
[DEBUG] CACHE: Health: Real size: 0, Configured limit: 53687091200
[INFO] CACHE: updating the t/cache.d/cache/sle-12-SP3-x86_64-0368-200_4@64bit.qcow2 with andi $a3, $t1, 41399 and 4
[DEBUG] CACHE: Asset download successful to t/cache.d/cache/sle-12-SP3-x86_64-0368-200_4@64bit.qcow2, Cache size is: 4
not ok 4 - Asset downloaded                             
#   Failed test 'Asset downloaded'                                     
#   at t/25-cache-service.t line 120.               
not ok 5 - Asset downloaded                                               
#   Failed test 'Asset downloaded'                                                                                                                                                                            
#   at t/25-cache-service.t line 120.
[INFO] Downloading sle-12-SP3-x86_64-0368-200_8@64bit.qcow2 from http://localhost:38075/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_8@64bit.qcow2
[Fri Sep 14 16:57:52 2018] [debug] GET "/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_8@64bit.qcow2" (0762f4d6)
[Fri Sep 14 16:57:52 2018] [debug] Routing to a callback   
[Fri Sep 14 16:57:52 2018] [debug] 200 OK (0.000549s, 1821.494/s)
[DEBUG] CACHE: Health: Real size: 0, Configured limit: 53687091200
[INFO] CACHE: updating the t/cache.d/cache/sle-12-SP3-x86_64-0368-200_8@64bit.qcow2 with andi $a3, $t1, 41399 and 8
[DEBUG] CACHE: Asset download successful to t/cache.d/cache/sle-12-SP3-x86_64-0368-200_8@64bit.qcow2, Cache size is: 8
not ok 6 - Asset downloaded       

To trigger it, i filed requests to the cache service with very small assets - it seems there is a race when we detect the task as completed or not, but still to dissect if comes from my changes or the Minion backend.

Actions #7

Updated by EDiGiacinto about 6 years ago

Copied from the GH PR:

What is missing still:

  • Digging into failures of other tests (compile check, and warning that makes the cache-service fail, which passes locally) and finish adapting logging messages changes that aren't matching anymore in fullstack ( the cache messages now are in the minion worker, and fails for that reason)
  • Documentation of the modules
  • systemd services and package spec adjustments
  • Acceptance testing
  • Improve error messages reporting on worker side, but otoh the Cache service is exposing also Minion::Admin plugin
  • Bump cache limit size ( ? ) and polishing
Actions #8

Updated by EDiGiacinto about 6 years ago

  • Related to action #34597: Race condition causing problems with the worker cache added
Actions #9

Updated by coolo about 6 years ago

  • Category set to Regressions/Crashes
  • Status changed from New to In Progress
  • Assignee set to EDiGiacinto

part of the worker story being worked on

Actions #10

Updated by EDiGiacinto about 6 years ago

  • Related to action #40871: Asset removed during job runtime added
Actions #11

Updated by EDiGiacinto about 6 years ago

Update (what is missing still):

  • Investigate/Fix GRU tests that failed after rebase
  • Acceptance testing
  • Fix bugs that were met during testing ( e.g. Job should be flagged as incomplete when cache service is down, and on Leap42.3 the cache is failing in updating asset status, while on my TW this doesn't happen )
Actions #12

Updated by EDiGiacinto about 6 years ago

  • Status changed from In Progress to Feedback

https://github.com/os-autoinst/openQA/pull/1783 has been merged, setting to Feedback

Actions #13

Updated by coolo almost 6 years ago

  • Status changed from Feedback to Resolved
  • Target version changed from Current Sprint to Done
Actions

Also available in: Atom PDF