Project

General

Profile

action #34597

Race condition causing problems with the worker cache

Added by szarate over 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Concrete Bugs
Target version:
-
Start date:
2018-05-11
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

The Cache implemented in the worker has some racy conditions, the toggle_lock method can be a good starting point for it, but a deeper look at how SQLite is handling the locks and how the concurrency is being managed in the code is needed.

From irc:

foursixnine | coolo: nope it doesn't, but it's what's being used when a download is triggered. So the etag is sent to webui and all the magic of "Content has not channged" happens... however now looking at the code... I think that the racy sqlite hit, The image was being downloaded by other workers at the same time... so most likely both thought "Hey It's my turn"... and stepped on eachother's toes :(.
coolo | not according to the logs
coolo | only 3611 downloaded the iso
coolo | but the code is indeed racy
worker-journal.log (11.5 KB) worker-journal.log szarate, 2018-04-09 16:13

Related issues

Related to openQA Project - action #26088: [tools] Worker's Cache database sporadically locksResolved2017-10-17

Related to openQA Project - action #35299: Potential corrupted file causing installation tests to failResolved2018-04-20

Related to openQA Project - action #39980: Cache locks assets when worker dies in critical sectionResolved2018-10-02

Related to openQA Tests - action #51743: [openqa] All LTP tests are failing on boot_ltp for openSUSE (o3) on [x86_64]Resolved2019-05-21

History

#1 Updated by szarate over 3 years ago

  • Description updated (diff)

#2 Updated by EDiGiacinto over 3 years ago

  • Related to action #26088: [tools] Worker's Cache database sporadically locks added

#3 Updated by EDiGiacinto over 3 years ago

  • Related to action #35299: Potential corrupted file causing installation tests to fail added

#4 Updated by EDiGiacinto over 3 years ago

  • Status changed from New to In Progress
  • Assignee set to EDiGiacinto
  • Start date changed from 2018-04-09 to 2018-05-11

I'm already looking into this since friday - from the last call, coolo suspects this is the source of download problems, and it could be related also to different disks type (magnetic, etc.) as this is hitting mostly o3 and the locking mechanism that is happening within the sqlite usage.

Updating the ticket with what have been done so far:

  • Looked over the current worker cache code to see how caching works
  • Experimented with in-memory locking: looked over at modules already available on CPAN / suse repos IPC::Semaphore, IPC::Msg, IPC::Queue (tried also IPC::ShareLite, but it was spurting sospicious warnings related to IO::Handle) the best fit would have been https://metacpan.org/pod/IPC::Semaphore::Concurrency but sadly it's missing almost totally a test suite
  • Started to write on the openqa codebase part relevant modules to deal with the locking using IPC::ShareLite, but tests revealed some memory corruption happening when testing with high number of concurrent processes, since i had already suspects with IPC::ShareLite, decided to give IPC::Semaphore::Concurrency a try
  • Since IPC::Semaphore::Concurrency comes with no tests and it's just a mere wrapper to IPC::Semaphore, i decided to roll own a module which behaves like IPC::Semaphore::Concurrency, so i could setup tests to see if this route is feasable and worthy
  • Locking in-memory works, but within my tests i got memory corrupted while using semaphores only when using the sem_undo option - that enlightened some problems when this option is given on semaphore creation - which would be more safe to avoid them to be inaccessible if the application crashes badly and critical section can't be unlocked for some reason

#5 Updated by EDiGiacinto over 3 years ago

@Coolo: A simple flock of a file to reserve db access on critical sections would probably suffer of the same problem i was trying to solve with semaphores with sem_undo option ( in case of hard crashes, the access could be locked permanently, or deadlocks), but i will setup tests and i already added the relative locking/unlocking functions with flock to see how does perform ( but i see it a bit as a hackish approach ).

Meanwhile i looked over the Minion::Backend::SQLite / Mojo::SQLite on how they provide locks features for concurrent workers, and it seems using Mojo::SQLite has its advantages while dealing with concurrent read/write.

For now i will test with flock and try to make the queries more atomic inside a single transaction to change the state and be less racy , and refactoring the cache module while at it, also wrapping all rollbacks call to avoid the workers to die (as with AutoCommit, they are actually ineffective). But we should maybe consider using Mojo::SQLite here for features like: https://metacpan.org/pod/Mojo::SQLite::Database#begin (transaction handling wrt AutoCommit, seems also a bit more compliant to http://sqlite.org/lockingv3.html) and wal support. Also to consider - enabling the wal pragma in our current implementations - http://sqlite.org/wal.html, but it has its own drawbacks.

The last (drastic, but not at all) option i see is the possibility to drop SQLite at all from the cache (and openQA) and just opt for in-memory (that is also what wal is using for the wal-index) as we use it to keep track of the assets to avoid to re-download them ( we cleanup them based on space) and to lock down the download for single instances in the same machine.

#6 Updated by szarate over 3 years ago

  • Target version changed from Ready to Current Sprint

Let's move it to the sprint as it was the actual ticket being worked on, I guess this could reduce the ammount of occurences of poo#35299.

#7 Updated by EDiGiacinto over 3 years ago

No big updates relevant to the race condition as for today: i started refactoring and make the cache class a bit more OOP-ish and adapted the cache tests https://github.com/mudler/openQA/commit/689e6dcf5ecaa0beb7ea02d29a5085cb5b5e5280

#8 Updated by EDiGiacinto over 3 years ago

I setted up a unit test to trigger race conditions (it's still flaky, as requires time/more executions to trigger the real problem), also isolated and locked all the sqlite queries with flock on top - just for sake of debugging - but seems still we are getting different pids trying to download the same image (notice, also the size is downsized, instead should just grow in this test, and database locks are not happening anymore as the test is also checking for that):

[2018-05-22T12:14:12.0840 CEST] [info] pid: 23532 | pid: 23532 | Downloading sle-12-SP3-x86_64-0368-200_1387@64bit.qcow2 from http://127.0.0.1:32805/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_1387@64bit.qcow2
[2018-05-22T12:14:12.0907 CEST] [info] pid: 23532 | pid: 23532 | CACHE: updating the /home/ettore/_git/openQA/t/cache.d/cache/sle-12-SP3-x86_64-0368-200_1387@64bit.qcow2 with andi $a3, $t1, 41399 and 1387
[2018-05-22T12:14:12.0907 CEST] [debug] pid: 23532 | pid: 23532 | CACHE: Asset download successful to /home/ettore/_git/openQA/t/cache.d/cache/sle-12-SP3-x86_64-0368-200_1387@64bit.qcow2, Cache size is: 24044
[2018-05-22T12:14:14.0006 CEST] [info] pid: 23512 | pid: 23512 | Downloading sle-12-SP3-x86_64-0368-200_1387@64bit.qcow2 from http://127.0.0.1:32805/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_1387@64bit.qcow2
[2018-05-22T12:14:14.0083 CEST] [debug] pid: 23512 | pid: 23512 | CACHE: Content has not changed, not downloading the /home/ettore/_git/openQA/t/cache.d/cache/sle-12-SP3-x86_64-0368-200_1387@64bit.qcow2 but updating last use
[2018-05-22T12:14:15.0763 CEST] [info] pid: 23401 | pid: 23401 | Downloading sle-12-SP3-x86_64-0368-200_1387@64bit.qcow2 from http://127.0.0.1:32805/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_1387@64bit.qcow2
[2018-05-22T12:14:17.0370 CEST] [info] pid: 23401 | pid: 23401 | CACHE: updating the /home/ettore/_git/openQA/t/cache.d/cache/sle-12-SP3-x86_64-0368-200_1387@64bit.qcow2 with andi $a3, $t1, 41399 and 1387
[2018-05-22T12:14:17.0370 CEST] [debug] pid: 23401 | pid: 23401 | CACHE: Asset download successful to /home/ettore/_git/openQA/t/cache.d/cache/sle-12-SP3-x86_64-0368-200_1387@64bit.qcow2, Cache size is: 4600
[2018-05-22T12:14:18.0657 CEST] [info] pid: 23496 | pid: 23496 | Downloading sle-12-SP3-x86_64-0368-200_1387@64bit.qcow2 from http://127.0.0.1:32805/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_1387@64bit.qcow2
[2018-05-22T12:14:18.0724 CEST] [debug] pid: 23496 | pid: 23496 | CACHE: Content has not changed, not downloading the /home/ettore/_git/openQA/t/cache.d/cache/sle-12-SP3-x86_64-0368-200_1387@64bit.qcow2 but updating last use
[2018-05-22T12:14:19.0745 CEST] [info] pid: 23536 | pid: 23536 | Downloading sle-12-SP3-x86_64-0368-200_1387@64bit.qcow2 from http://127.0.0.1:32805/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_1387@64bit.qcow2
[2018-05-22T12:14:19.0834 CEST] [debug] pid: 23536 | pid: 23536 | CACHE: Content has not changed, not downloading the /home/ettore/_git/openQA/t/cache.d/cache/sle-12-SP3-x86_64-0368-200_1387@64bit.qcow2 but updating last use
[2018-05-22T12:14:22.0176 CEST] [info] pid: 23446 | pid: 23446 | Downloading sle-12-SP3-x86_64-0368-200_1387@64bit.qcow2 from http://127.0.0.1:32805/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_1387@64bit.qcow2
[2018-05-22T12:14:22.0243 CEST] [debug] pid: 23446 | pid: 23446 | CACHE: Content has not changed, not downloading the /home/ettore/_git/openQA/t/cache.d/cache/sle-12-SP3-x86_64-0368-200_1387@64bit.qcow2 but updating last use

#9 Updated by EDiGiacinto over 3 years ago

Maybe here it's easier to follow, ( no size difference this time ):

[2018-05-22T13:20:26.0144 CEST] [info] [pid 3666] Downloading sle-12-SP3-x86_64-0368-200_1598@64bit.qcow2 from http://127.0.0.1:38903/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_1598@64bit.qcow2
[2018-05-22T13:20:26.0236 CEST] [info] [pid 3664] Downloading sle-12-SP3-x86_64-0368-200_1598@64bit.qcow2 from http://127.0.0.1:38903/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_1598@64bit.qcow2
[2018-05-22T13:20:26.0554 CEST] [info] [pid 3666] CACHE: updating the /home/ettore/_git/openQA/t/cache.d/cache/sle-12-SP3-x86_64-0368-200_1598@64bit.qcow2 with andi $a3, $t1, 41399 and 1598
[2018-05-22T13:20:26.0555 CEST] [debug] [pid 3666] CACHE: Asset download successful to /home/ettore/_git/openQA/t/cache.d/cache/sle-12-SP3-x86_64-0368-200_1598@64bit.qcow2, Cache size is: 10319
[2018-05-22T13:20:28.0532 CEST] [info] [pid 3621] Downloading sle-12-SP3-x86_64-0368-200_1598@64bit.qcow2 from http://127.0.0.1:38903/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_1598@64bit.qcow2
[2018-05-22T13:20:28.0609 CEST] [debug] [pid 3621] CACHE: Content has not changed, not downloading the /home/ettore/_git/openQA/t/cache.d/cache/sle-12-SP3-x86_64-0368-200_1598@64bit.qcow2 but updating last use
[2018-05-22T13:20:29.0131 CEST] [info] [pid 3665] Downloading sle-12-SP3-x86_64-0368-200_1598@64bit.qcow2 from http://127.0.0.1:38903/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_1598@64bit.qcow2
[2018-05-22T13:20:31.0128 CEST] [debug] [pid 3665] CACHE: Content has not changed, not downloading the /home/ettore/_git/openQA/t/cache.d/cache/sle-12-SP3-x86_64-0368-200_1598@64bit.qcow2 but updating last use
[2018-05-22T13:20:31.0938 CEST] [info] [pid 3664] CACHE: updating the /home/ettore/_git/openQA/t/cache.d/cache/sle-12-SP3-x86_64-0368-200_1598@64bit.qcow2 with andi $a3, $t1, 41399 and 1598
[2018-05-22T13:20:31.0938 CEST] [debug] [pid 3664] CACHE: Asset download successful to /home/ettore/_git/openQA/t/cache.d/cache/sle-12-SP3-x86_64-0368-200_1598@64bit.qcow2, Cache size is: 10319
[2018-05-22T13:20:36.0044 CEST] [info] [pid 3736] Downloading sle-12-SP3-x86_64-0368-200_1598@64bit.qcow2 from http://127.0.0.1:38903/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_1598@64bit.qcow2
[2018-05-22T13:20:36.0111 CEST] [debug] [pid 3736] CACHE: Content has not changed, not downloading the /home/ettore/_git/openQA/t/cache.d/cache/sle-12-SP3-x86_64-0368-200_1598@64bit.qcow2 but updating last use

Updated (1ms of race):

[2018-05-22T16:36:55.0847 CEST] [info] [pid 23685] Downloading sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2 from http://127.0.0.1:55291/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2
[2018-05-22T16:36:55.0848 CEST] [info] [pid 23686] Downloading sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2 from http://127.0.0.1:55291/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2
[2018-05-22T16:36:56.0002 CEST] [info] [pid 23685] CACHE: updating the /home/ettore/_git/openQA/t/cache.d/cache/sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2 with andi $a3, $t1, 41399 and 1890
[2018-05-22T16:36:56.0002 CEST] [debug] [pid 23685] CACHE: Asset download successful to /home/ettore/_git/openQA/t/cache.d/cache/sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2, Cache size is: 2152
[2018-05-22T16:36:56.0790 CEST] [info] [pid 23696] Downloading sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2 from http://127.0.0.1:55291/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2
[2018-05-22T16:36:56.0867 CEST] [debug] [pid 23696] CACHE: Content has not changed, not downloading the /home/ettore/_git/openQA/t/cache.d/cache/sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2 but updating last use
[2018-05-22T16:36:57.0367 CEST] [info] [pid 23686] CACHE: updating the /home/ettore/_git/openQA/t/cache.d/cache/sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2 with andi $a3, $t1, 41399 and 1890
[2018-05-22T16:36:57.0367 CEST] [debug] [pid 23686] CACHE: Asset download successful to /home/ettore/_git/openQA/t/cache.d/cache/sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2, Cache size is: 2152
[2018-05-22T16:37:01.0964 CEST] [info] [pid 23691] Downloading sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2 from http://127.0.0.1:55291/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2
[2018-05-22T16:37:05.0074 CEST] [debug] [pid 23691] CACHE: Content has not changed, not downloading the /home/ettore/_git/openQA/t/cache.d/cache/sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2 but updating last use
[2018-05-22T16:37:07.0195 CEST] [info] [pid 23704] Downloading sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2 from http://127.0.0.1:55291/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2
[2018-05-22T16:37:07.0295 CEST] [debug] [pid 23704] CACHE: Content has not changed, not downloading the /home/ettore/_git/openQA/t/cache.d/cache/sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2 but updating last use
[2018-05-22T16:37:07.0540 CEST] [info] [pid 23703] Downloading sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2 from http://127.0.0.1:55291/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2
[2018-05-22T16:37:07.0639 CEST] [debug] [pid 23703] CACHE: Content has not changed, not downloading the /home/ettore/_git/openQA/t/cache.d/cache/sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2 but updating last use
[2018-05-22T16:37:08.0239 CEST] [info] [pid 23730] Downloading sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2 from http://127.0.0.1:55291/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2
[2018-05-22T16:37:08.0317 CEST] [debug] [pid 23730] CACHE: Content has not changed, not downloading the /home/ettore/_git/openQA/t/cache.d/cache/sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2 but updating last use
[2018-05-22T16:37:09.0594 CEST] [info] [pid 23745] Downloading sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2 from http://127.0.0.1:55291/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2
[2018-05-22T16:37:09.0660 CEST] [debug] [pid 23745] CACHE: Content has not changed, not downloading the /home/ettore/_git/openQA/t/cache.d/cache/sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2 but updating last use
[2018-05-22T16:37:09.0794 CEST] [info] [pid 23694] Downloading sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2 from http://127.0.0.1:55291/tests/922756/asset/hdd/sle-12-SP3-x86_64-0368-200_1890@64bit.qcow2

#10 Updated by EDiGiacinto over 3 years ago

PR: https://github.com/os-autoinst/openQA/pull/1662

Doing tests on staging, no issue noticed so far

#11 Updated by EDiGiacinto over 3 years ago

During tests (not staging ones) spotted another section where race condition can occur ( in update_asset ) and managed to trigger it, so still in WIP

#12 Updated by szarate about 3 years ago

While mudler is on holidays, I rebased his pr: https://github.com/os-autoinst/openQA/pull/1698

#13 Updated by szarate about 3 years ago

  • Status changed from In Progress to Feedback

PR is merged, will set to feedback for now

#14 Updated by szarate about 3 years ago

  • Target version changed from Current Sprint to Ready

#15 Updated by szarate about 3 years ago

PR has been merged, sending back to product backlog, waiting for feedback for the time being...

#16 Updated by coolo almost 3 years ago

  • Status changed from Feedback to In Progress
  • Target version changed from Ready to Current Sprint

Looks like we need to close some of these as DUPs

#17 Updated by EDiGiacinto almost 3 years ago

Yes, but this one should be closed as resolved instead, races are not there anymore, and this PR got merged - what we have now is deadlock, that was mentioned also in https://progress.opensuse.org/issues/34597#note-5 and is part of a new story, after we chose to stick with sqlite https://progress.opensuse.org/issues/39980

#18 Updated by EDiGiacinto almost 3 years ago

  • Related to action #39980: Cache locks assets when worker dies in critical section added

#19 Updated by coolo almost 3 years ago

  • Status changed from In Progress to Resolved
  • Target version deleted (Current Sprint)

then let's get over it

#20 Updated by rpalethorpe over 1 year ago

  • Related to action #51743: [openqa] All LTP tests are failing on boot_ltp for openSUSE (o3) on [x86_64] added

Also available in: Atom PDF