action #58484

jobs are stuck in downloading while cache service died

Added by coolo 6 months ago. Updated 3 months ago.

Status:ResolvedStart date:22/10/2019
Priority:NormalDue date:
Assignee:kraih% Done:

0%

Category:Concrete Bugs
Target version:Done
Difficulty:
Duration:

Description

https://openqa.suse.de/tests/3502420 was started at 11pm and is still in
preparation phase.

Oct 21 23:08:45 openqaworker2 worker[479]: [debug] [pid:479] Found ISO, caching SLE-12-SP3-Server-DVD-s390x-GM-DVD1.iso
Oct 21 23:08:45 openqaworker2 worker[479]: [debug] [pid:479] Downloading SLE-12-SP3-Server-DVD-s390x-GM-DVD1.iso - request sent to Cache Service.
Oct 21 23:08:50 openqaworker2 worker[479]: [debug] [pid:479] Updating status so job 3502420 is not considered dead.
Oct 21 23:08:50 openqaworker2 worker[479]: [debug] [pid:479] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3502420/status
Oct 21 23:08:55 openqaworker2 worker[479]: [debug] [pid:479] Updating status so job 3502420 is not considered dead.
Oct 21 23:08:55 openqaworker2 worker[479]: [debug] [pid:479] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3502420/status
Oct 21 23:09:00 openqaworker2 worker[479]: [debug] [pid:479] Updating status so job 3502420 is not considered dead.

Problem is though that the download failed as the minion service failed:

Oct 21 23:19:54 openqaworker2 openqa-workercache[10307]: DBD::SQLite::st execute failed: database is locked at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 381.
Oct 21 23:19:54 openqaworker2 openqa-workercache[10307]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 381.
Oct 21 23:19:54 openqaworker2 openqa-workercache[10307]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Command/minion/worker.pm line 26.
Oct 21 23:19:54 openqaworker2 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=11/n/a
Oct 21 23:21:24 openqaworker2 systemd[1]: openqa-worker-cacheservice-minion.service: State 'stop-sigterm' timed out. Killing.

-> no recovery possible

History

#1 Updated by coolo 6 months ago

on armworker-3 the situation looks the same:

Oct 21 19:00:10 openqaworker-arm-3 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Oct 21 19:00:10 openqaworker-arm-3 systemd[1]: Started OpenQA Worker Cache Service Minion.
Oct 21 19:00:12 openqaworker-arm-3 openqa-workercache[72868]: [INFO] OpenQA::Worker::Cache: loading database from /var/lib/openqa/cache/cache.sqlite
Oct 21 19:00:12 openqaworker-arm-3 openqa-workercache[72868]: [DEBUG] CACHE: Health: Real size: 53404216832, Configured limit: 53687091200
Oct 21 19:00:12 openqaworker-arm-3 openqa-workercache[72868]: [INFO] OpenQA::Worker::Cache: Initialized with localhost at /var/lib/openqa/cache, current size is 53404216832
Oct 21 19:00:12 openqaworker-arm-3 openqa-workercache[72868]: [72868] [i] Worker 72868 started
Oct 22 00:08:50 openqaworker-arm-3 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=4/NOPERMISSION
Oct 22 00:10:20 openqaworker-arm-3 systemd[1]: openqa-worker-cacheservice-minion.service: State 'stop-sigterm' timed out. Killing.
Oct 22 00:10:20 openqaworker-arm-3 systemd[1]: openqa-worker-cacheservice-minion.service: Killing process 97732 (openqa-workerca) with signal SIGKILL.
Oct 22 00:10:22 openqaworker-arm-3 openqa-workercache[1895]: [INFO] OpenQA::Worker::Cache: loading database from /var/lib/openqa/cache/cache.sqlite
Oct 22 00:10:22 openqaworker-arm-3 openqa-workercache[1895]: [DEBUG] CACHE: Health: Real size: 52423787008, Configured limit: 53687091200
Oct 22 00:10:22 openqaworker-arm-3 openqa-workercache[1895]: [INFO] OpenQA::Worker::Cache: Initialized with localhost at /var/lib/openqa/cache, current size is 52423787008
Oct 22 00:10:22 openqaworker-arm-3 openqa-workercache[1895]: [1895] [i] Worker 1895 started
Oct 22 00:10:20 openqaworker-arm-3 systemd[1]: openqa-worker-cacheservice-minion.service: Killing process 97740 (openqa-workerca) with signal SIGKILL.
Oct 22 00:10:20 openqaworker-arm-3 systemd[1]: openqa-worker-cacheservice-minion.service: Killing process 1236 (openqa-workerca) with signal SIGKILL.
Oct 22 00:10:20 openqaworker-arm-3 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Oct 22 00:10:20 openqaworker-arm-3 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
Oct 22 00:10:20 openqaworker-arm-3 systemd[1]: openqa-worker-cacheservice-minion.service: Service RestartSec=100ms expired, scheduling restart.
Oct 22 00:10:20 openqaworker-arm-3 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Oct 22 00:10:20 openqaworker-arm-3 systemd[1]: Started OpenQA Worker Cache Service Minion.

#2 Updated by coolo 6 months ago

While shutting down the jobs it got worse :)

Oct 22 07:07:30 openqaworker-arm-3 worker[53458]: [debug] [pid:53458] Updating status so job 3502538 is not considered dead.
Oct 22 07:07:30 openqaworker-arm-3 worker[53458]: [debug] [pid:53458] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3502538/status
Oct 22 07:07:34 openqaworker-arm-3 systemd[1]: Stopping openQA Worker #24...
Oct 22 07:07:38 openqaworker-arm-3 worker[53458]: [info] [pid:53458] Quit due to signal TERM
Oct 22 07:07:38 openqaworker-arm-3 worker[53458]: [debug] [pid:53458] Stopping job 3502538 from openqa.suse.de: 03502538-sle-15-SP2-Installer-DVD-aarch64-Build67.1-minimal_x+uefi@aarch64 - reason: quit
Oct 22 07:07:38 openqaworker-arm-3 worker[53458]: [debug] [pid:53458] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3502538/status
Oct 22 07:07:38 openqaworker-arm-3 worker[53458]: Mojo::Reactor::Poll: I/O watcher failed: Connection refused at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache/Client.pm line 36.
Oct 22 07:07:38 openqaworker-arm-3 worker[53458]: [error] [pid:53458] Another error occurred when trying to stop gracefully due to an error. Trying to kill ourself forcefully now.
Oct 22 07:07:38 openqaworker-arm-3 worker[53458]: [error] [pid:53458] Websocket connection to http://openqa.suse.de/api/v1/ws/1303 finished by remote side with code 1006, no reason - trying again in 10 seconds
Oct 22 07:07:46 openqaworker-arm-3 systemd[1]: Stopped openQA Worker #24.

#3 Updated by kraih 6 months ago

  • Assignee set to kraih

#4 Updated by kraih 5 months ago

  • Target version changed from Ready to Current Sprint

#5 Updated by kraih 5 months ago

  • Status changed from New to In Progress

Since i can't replicate this problem with a test case, i've started refactoring the cache service, and will be adding better log features along the way to make debugging a bit easier.

#6 Updated by kraih 4 months ago

After refactoring the cache service i do have a much better understanding of the issue now. Exclusive locks are used very liberally in OpenQA::CacheService::Model::Cache, which makes deadlocks a big risk. There are multiple approaches for resolving this problem. For starters i would like to try simply increasing the SQLite busy timeout. If that doesn't help we might have to take a closer look to try and identify the slow query. It might also be worth it to use a separate SQLite database file for Minion, to improve general reliability.

#7 Updated by kraih 4 months ago

One part of this problem i actually already accidentally fixed with the refactoring. Failed Minion jobs no longer result in the download getting stuck.

#8 Updated by kraih 4 months ago

And i've opened a PR that should help with the rest of the described problems above. https://github.com/os-autoinst/openQA/pull/2559

#9 Updated by kraih 4 months ago

  • Status changed from In Progress to Feedback

PR has been merged, lets see how it goes.

#10 Updated by okurz 4 months ago

I am not sure if you consider this the same issue but please consider it:
On malbec I observed now incomplete jobs which do not have any log files uploaded, probably because the webui does not want to accept any updates anymore for the job it considered due to what looks like a too long attempt to download assets into the cache. The job in question is https://openqa.suse.de/tests/3665863#downloads , the according worker is malbec:4 which shows two subsequent jobs with the problem. I collected the system journal of the worker service and have it attached here.

#11 Updated by kraih 4 months ago

  • Status changed from Workable to Feedback

Thanks for attaching the worker journal, that really helps. Unfortunately i can say for certain from the cache service logs in the journal that it is an unrelated issue.

Your problem happens after the cache service is already done. It might be related to the scheduler problem Marius was working on last week. Where the scheduler got blocked every now and then. If that's the case the problem might already be resolved by the recent removal of that blocking scheduler API call.

#12 Updated by okurz 4 months ago

ok, could be that your recent changes to the shared logs help for further debugging to rule out that the cache is involved and the changes to the scheduler could fix it. We will see. Unless you see further things to do I would say that all your relevant changes did not cause any followup problems observable as of now so I suggest to resolve this ticket then.

#13 Updated by okurz 3 months ago

  • Status changed from Feedback to Resolved
  • Target version changed from Current Sprint to Done

Also available in: Atom PDF