action #58484: jobs are stuck in downloading while cache service died - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #58484

closed

jobs are stuck in downloading while cache service died

Added by coolo over 5 years ago. Updated about 5 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

kraih

Category:

Regressions/Crashes

Target version:

Done

Start date:

2019-10-22

Due date:

% Done:

Estimated time:

Description

https://openqa.suse.de/tests/3502420 was started at 11pm and is still in
preparation phase.

Oct 21 23:08:45 openqaworker2 worker[479]: [debug] [pid:479] Found ISO, caching SLE-12-SP3-Server-DVD-s390x-GM-DVD1.iso
Oct 21 23:08:45 openqaworker2 worker[479]: [debug] [pid:479] Downloading SLE-12-SP3-Server-DVD-s390x-GM-DVD1.iso - request sent to Cache Service.
Oct 21 23:08:50 openqaworker2 worker[479]: [debug] [pid:479] Updating status so job 3502420 is not considered dead.
Oct 21 23:08:50 openqaworker2 worker[479]: [debug] [pid:479] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3502420/status
Oct 21 23:08:55 openqaworker2 worker[479]: [debug] [pid:479] Updating status so job 3502420 is not considered dead.
Oct 21 23:08:55 openqaworker2 worker[479]: [debug] [pid:479] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3502420/status
Oct 21 23:09:00 openqaworker2 worker[479]: [debug] [pid:479] Updating status so job 3502420 is not considered dead.

Problem is though that the download failed as the minion service failed:

Oct 21 23:19:54 openqaworker2 openqa-workercache[10307]: DBD::SQLite::st execute failed: database is locked at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 381.
Oct 21 23:19:54 openqaworker2 openqa-workercache[10307]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 381.
Oct 21 23:19:54 openqaworker2 openqa-workercache[10307]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Command/minion/worker.pm line 26.
Oct 21 23:19:54 openqaworker2 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=11/n/a
Oct 21 23:21:24 openqaworker2 systemd[1]: openqa-worker-cacheservice-minion.service: State 'stop-sigterm' timed out. Killing.

-> no recovery possible

Files

openqa_worker_journal_malbec_worker_4_2019-12-05_job_incomplete_after_failed_download_to_cache.log (52.9 KB) openqa_worker_journal_malbec_worker_4_2019-12-05_job_incomplete_after_failed_download_to_cache.log

okurz, 2019-12-05 09:47

Actions

Copy link

Updated by coolo over 5 years ago

on armworker-3 the situation looks the same:

Oct 21 19:00:10 openqaworker-arm-3 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Oct 21 19:00:10 openqaworker-arm-3 systemd[1]: Started OpenQA Worker Cache Service Minion.
Oct 21 19:00:12 openqaworker-arm-3 openqa-workercache[72868]: [INFO] OpenQA::Worker::Cache: loading database from /var/lib/openqa/cache/cache.sqlite
Oct 21 19:00:12 openqaworker-arm-3 openqa-workercache[72868]: [DEBUG] CACHE: Health: Real size: 53404216832, Configured limit: 53687091200
Oct 21 19:00:12 openqaworker-arm-3 openqa-workercache[72868]: [INFO] OpenQA::Worker::Cache: Initialized with localhost at /var/lib/openqa/cache, current size is 53404216832
Oct 21 19:00:12 openqaworker-arm-3 openqa-workercache[72868]: [72868] [i] Worker 72868 started
Oct 22 00:08:50 openqaworker-arm-3 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=4/NOPERMISSION
Oct 22 00:10:20 openqaworker-arm-3 systemd[1]: openqa-worker-cacheservice-minion.service: State 'stop-sigterm' timed out. Killing.
Oct 22 00:10:20 openqaworker-arm-3 systemd[1]: openqa-worker-cacheservice-minion.service: Killing process 97732 (openqa-workerca) with signal SIGKILL.
Oct 22 00:10:22 openqaworker-arm-3 openqa-workercache[1895]: [INFO] OpenQA::Worker::Cache: loading database from /var/lib/openqa/cache/cache.sqlite
Oct 22 00:10:22 openqaworker-arm-3 openqa-workercache[1895]: [DEBUG] CACHE: Health: Real size: 52423787008, Configured limit: 53687091200
Oct 22 00:10:22 openqaworker-arm-3 openqa-workercache[1895]: [INFO] OpenQA::Worker::Cache: Initialized with localhost at /var/lib/openqa/cache, current size is 52423787008
Oct 22 00:10:22 openqaworker-arm-3 openqa-workercache[1895]: [1895] [i] Worker 1895 started
Oct 22 00:10:20 openqaworker-arm-3 systemd[1]: openqa-worker-cacheservice-minion.service: Killing process 97740 (openqa-workerca) with signal SIGKILL.
Oct 22 00:10:20 openqaworker-arm-3 systemd[1]: openqa-worker-cacheservice-minion.service: Killing process 1236 (openqa-workerca) with signal SIGKILL.
Oct 22 00:10:20 openqaworker-arm-3 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Oct 22 00:10:20 openqaworker-arm-3 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
Oct 22 00:10:20 openqaworker-arm-3 systemd[1]: openqa-worker-cacheservice-minion.service: Service RestartSec=100ms expired, scheduling restart.
Oct 22 00:10:20 openqaworker-arm-3 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Oct 22 00:10:20 openqaworker-arm-3 systemd[1]: Started OpenQA Worker Cache Service Minion.

Actions

Copy link

Updated by coolo over 5 years ago

While shutting down the jobs it got worse :)

Oct 22 07:07:30 openqaworker-arm-3 worker[53458]: [debug] [pid:53458] Updating status so job 3502538 is not considered dead.
Oct 22 07:07:30 openqaworker-arm-3 worker[53458]: [debug] [pid:53458] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3502538/status
Oct 22 07:07:34 openqaworker-arm-3 systemd[1]: Stopping openQA Worker #24...
Oct 22 07:07:38 openqaworker-arm-3 worker[53458]: [info] [pid:53458] Quit due to signal TERM
Oct 22 07:07:38 openqaworker-arm-3 worker[53458]: [debug] [pid:53458] Stopping job 3502538 from openqa.suse.de: 03502538-sle-15-SP2-Installer-DVD-aarch64-Build67.1-minimal_x+uefi@aarch64 - reason: quit
Oct 22 07:07:38 openqaworker-arm-3 worker[53458]: [debug] [pid:53458] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3502538/status
Oct 22 07:07:38 openqaworker-arm-3 worker[53458]: Mojo::Reactor::Poll: I/O watcher failed: Connection refused at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache/Client.pm line 36.
Oct 22 07:07:38 openqaworker-arm-3 worker[53458]: [error] [pid:53458] Another error occurred when trying to stop gracefully due to an error. Trying to kill ourself forcefully now.
Oct 22 07:07:38 openqaworker-arm-3 worker[53458]: [error] [pid:53458] Websocket connection to http://openqa.suse.de/api/v1/ws/1303 finished by remote side with code 1006, no reason - trying again in 10 seconds
Oct 22 07:07:46 openqaworker-arm-3 systemd[1]: Stopped openQA Worker #24.

Actions

Copy link

Updated by kraih over 5 years ago

Assignee set to kraih

Actions

Copy link

Updated by kraih over 5 years ago

Target version changed from Ready to Current Sprint

Actions

Copy link

Updated by kraih over 5 years ago

Status changed from New to In Progress

Since i can't replicate this problem with a test case, i've started refactoring the cache service, and will be adding better log features along the way to make debugging a bit easier.

Actions

Copy link

Updated by kraih over 5 years ago

After refactoring the cache service i do have a much better understanding of the issue now. Exclusive locks are used very liberally in OpenQA::CacheService::Model::Cache, which makes deadlocks a big risk. There are multiple approaches for resolving this problem. For starters i would like to try simply increasing the SQLite busy timeout. If that doesn't help we might have to take a closer look to try and identify the slow query. It might also be worth it to use a separate SQLite database file for Minion, to improve general reliability.

Actions

Copy link

Updated by kraih over 5 years ago

One part of this problem i actually already accidentally fixed with the refactoring. Failed Minion jobs no longer result in the download getting stuck.

Actions

Copy link

Updated by kraih over 5 years ago

And i've opened a PR that should help with the rest of the described problems above. https://github.com/os-autoinst/openQA/pull/2559

Actions

Copy link

Updated by kraih over 5 years ago

Status changed from In Progress to Feedback

PR has been merged, lets see how it goes.

Actions

Copy link

#10

Updated by okurz over 5 years ago

File openqa_worker_journal_malbec_worker_4_2019-12-05_job_incomplete_after_failed_download_to_cache.log openqa_worker_journal_malbec_worker_4_2019-12-05_job_incomplete_after_failed_download_to_cache.log added
Status changed from Feedback to Workable

I am not sure if you consider this the same issue but please consider it:
On malbec I observed now incomplete jobs which do not have any log files uploaded, probably because the webui does not want to accept any updates anymore for the job it considered due to what looks like a too long attempt to download assets into the cache. The job in question is https://openqa.suse.de/tests/3665863#downloads , the according worker is malbec:4 which shows two subsequent jobs with the problem. I collected the system journal of the worker service and have it attached here.

Actions

Copy link

#11

Updated by kraih over 5 years ago

Status changed from Workable to Feedback

Thanks for attaching the worker journal, that really helps. Unfortunately i can say for certain from the cache service logs in the journal that it is an unrelated issue.

Your problem happens after the cache service is already done. It might be related to the scheduler problem Marius was working on last week. Where the scheduler got blocked every now and then. If that's the case the problem might already be resolved by the recent removal of that blocking scheduler API call.

Actions

Copy link

#12

Updated by okurz over 5 years ago

ok, could be that your recent changes to the shared logs help for further debugging to rule out that the cache is involved and the changes to the scheduler could fix it. We will see. Unless you see further things to do I would say that all your relevant changes did not cause any followup problems observable as of now so I suggest to resolve this ticket then.

Actions

Copy link

#13

Updated by okurz about 5 years ago

Status changed from Feedback to Resolved
Target version changed from Current Sprint to Done

As per #58484#note-12

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #58484

jobs are stuck in downloading while cache service died

Updated by coolo over 5 years ago

Updated by coolo over 5 years ago

Updated by kraih over 5 years ago

Updated by kraih over 5 years ago

Updated by kraih over 5 years ago

Updated by kraih over 5 years ago

Updated by kraih over 5 years ago

Updated by kraih over 5 years ago

Updated by kraih over 5 years ago

Updated by okurz over 5 years ago

Updated by kraih over 5 years ago

Updated by okurz over 5 years ago

Updated by okurz about 5 years ago