action #39980
closedCache locks assets when worker dies in critical section
100%
Description
When an asset is being downloaded by the cache (since now is bit more race-free) it acquires a lock that is meant to avoid that other instances (in the same machine) start to download the same asset. If a worker dies or crashes in the critical section, the same asset is locked and the other instances will wait for the lock to be released.
See also: https://progress.opensuse.org/issues/39833
This happened once in osd already, openqaworker6 now is stuck on:
Aug 20 08:38:12 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:17 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:22 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:27 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:32 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:37 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:42 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:47 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
Aug 20 08:38:52 openqaworker6 worker[13371]: [info] CACHE: Being downloaded by another worker, sleeping.
And jobs will result stuck in running.
My first experiments where using IPC memory to acquire semaphores, and use them with SEM_UNDO exactly to cover this case, so if a process exits abnormally the lock is released automatically https://progress.opensuse.org/issues/34597#note-4. But that approach was not used in favor of keeping SQLite - now we need to implement some monitoring checks to unlock the cache in such situations (e.g. checking pid that were downloading are still alive, but imho it's racy as could lead to instances unlocking in wrong timings and potentially able to corrupt downloads)
Updated by EDiGiacinto over 6 years ago
- Related to action #39833: [tools] When a worker is abruptly killed, jobs get blocked - CACHE: Being downloaded by another worker, sleeping added
Updated by EDiGiacinto over 6 years ago
Is still being worked in https://github.com/os-autoinst/openQA/pull/1783 but still not ready - the general idea we agreed upon: move the cache to a separate service, and the workers just filling downloads requests to it.
As in our codebase we use already Minion for other things (GRU), come up naturally to base off this work on it.
Currently the service is able to replace the cache component inside the workers, but further work is required to integrate it with the exiting codebase, make the worker code use it and also to adapt all cache tests that are already part of our testsuite.
Updated by coolo over 6 years ago
- Target version set to Current Sprint
this part of the current story
Updated by EDiGiacinto over 6 years ago
Update:
I'm seeing races using minion locks with the sqlite backend while setting up stress tests:
ok 1 - Asset downloaded
[INFO] Downloading sle-12-SP3-x86_64-0368-200_3@64bit.qcow2 from http://localhost:38075/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_3@64bit.qcow2
[Fri Sep 14 16:57:41 2018] [debug] GET "/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_3@64bit.qcow2" (3be6ac10)
[Fri Sep 14 16:57:41 2018] [debug] Routing to a callback
[Fri Sep 14 16:57:41 2018] [debug] 200 OK (0.000446s, 2242.152/s)
[DEBUG] CACHE: Health: Real size: 0, Configured limit: 53687091200
[INFO] CACHE: updating the t/cache.d/cache/sle-12-SP3-x86_64-0368-200_3@64bit.qcow2 with andi $a3, $t1, 41399 and 3
[DEBUG] CACHE: Asset download successful to t/cache.d/cache/sle-12-SP3-x86_64-0368-200_3@64bit.qcow2, Cache size is: 3
ok 2 - Asset downloaded
not ok 3 - Asset downloaded
# Failed test 'Asset downloaded'
# at t/25-cache-service.t line 120.
[INFO] Downloading sle-12-SP3-x86_64-0368-200_4@64bit.qcow2 from http://localhost:38075/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_4@64bit.qcow2
[Fri Sep 14 16:57:47 2018] [debug] GET "/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_4@64bit.qcow2" (fcad800b)
[Fri Sep 14 16:57:47 2018] [debug] Routing to a callback
[Fri Sep 14 16:57:47 2018] [debug] 200 OK (0.000439s, 2277.904/s)
[DEBUG] CACHE: Health: Real size: 0, Configured limit: 53687091200
[INFO] CACHE: updating the t/cache.d/cache/sle-12-SP3-x86_64-0368-200_4@64bit.qcow2 with andi $a3, $t1, 41399 and 4
[DEBUG] CACHE: Asset download successful to t/cache.d/cache/sle-12-SP3-x86_64-0368-200_4@64bit.qcow2, Cache size is: 4
not ok 4 - Asset downloaded
# Failed test 'Asset downloaded'
# at t/25-cache-service.t line 120.
not ok 5 - Asset downloaded
# Failed test 'Asset downloaded'
# at t/25-cache-service.t line 120.
[INFO] Downloading sle-12-SP3-x86_64-0368-200_8@64bit.qcow2 from http://localhost:38075/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_8@64bit.qcow2
[Fri Sep 14 16:57:52 2018] [debug] GET "/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_8@64bit.qcow2" (0762f4d6)
[Fri Sep 14 16:57:52 2018] [debug] Routing to a callback
[Fri Sep 14 16:57:52 2018] [debug] 200 OK (0.000549s, 1821.494/s)
[DEBUG] CACHE: Health: Real size: 0, Configured limit: 53687091200
[INFO] CACHE: updating the t/cache.d/cache/sle-12-SP3-x86_64-0368-200_8@64bit.qcow2 with andi $a3, $t1, 41399 and 8
[DEBUG] CACHE: Asset download successful to t/cache.d/cache/sle-12-SP3-x86_64-0368-200_8@64bit.qcow2, Cache size is: 8
not ok 6 - Asset downloaded
To trigger it, i filed requests to the cache service with very small assets - it seems there is a race when we detect the task as completed or not, but still to dissect if comes from my changes or the Minion backend.
Updated by EDiGiacinto over 6 years ago
Copied from the GH PR:
What is missing still:
- Digging into failures of other tests (compile check, and warning that makes the cache-service fail, which passes locally) and finish adapting logging messages changes that aren't matching anymore in fullstack ( the cache messages now are in the minion worker, and fails for that reason)
- Documentation of the modules
- systemd services and package spec adjustments
- Acceptance testing
- Improve error messages reporting on worker side, but otoh the Cache service is exposing also Minion::Admin plugin
- Bump cache limit size ( ? ) and polishing
Updated by EDiGiacinto over 6 years ago
- Related to action #34597: Race condition causing problems with the worker cache added
Updated by coolo about 6 years ago
- Category set to Regressions/Crashes
- Status changed from New to In Progress
- Assignee set to EDiGiacinto
part of the worker story being worked on
Updated by EDiGiacinto about 6 years ago
- Related to action #40871: Asset removed during job runtime added
Updated by EDiGiacinto about 6 years ago
Update (what is missing still):
- Investigate/Fix GRU tests that failed after rebase
- Acceptance testing
- Fix bugs that were met during testing ( e.g. Job should be flagged as incomplete when cache service is down, and on Leap42.3 the cache is failing in updating asset status, while on my TW this doesn't happen )
Updated by EDiGiacinto about 6 years ago
- Status changed from In Progress to Feedback
https://github.com/os-autoinst/openQA/pull/1783 has been merged, setting to Feedback
Updated by coolo about 6 years ago
- Status changed from Feedback to Resolved
- Target version changed from Current Sprint to Done