Project

General

Profile

action #80202

jobs incomplete with auto_review:"setup failure: No workers active in the cache service":retry

Added by okurz 8 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2020-11-18
Due date:
2020-12-19
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

https://openqa.opensuse.org/tests/1481492
shows as reason "setup failure: No workers active in the cache service".

Steps to reproduce

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
for example to look this ticket #80202 call openqa-query-for-job-label poo#80202

Suggestions

  • Lookup in source code what this message could mean
  • We have never seen this in before so I am convinced this is a regression. Find out what change caused this and address that

Workaround

  • Retrigger jobs until the end up on other machines
  • Stop all worker related services, cleanup cache dir /var/lib/openqa/cache/* and restart worker related services

Related issues

Related to openQA Project - action #67000: Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retryResolved2020-05-18

Copied from openQA Project - action #78169: after osd-deploy 2020-11-18 incompletes with auto_review:"Cache service (status error from API|.*error 500: Internal Server Error)":retryResolved2020-11-18

History

#1 Updated by okurz 8 months ago

  • Copied from action #78169: after osd-deploy 2020-11-18 incompletes with auto_review:"Cache service (status error from API|.*error 500: Internal Server Error)":retry added

#2 Updated by okurz 8 months ago

  • Description updated (diff)
  • Status changed from New to Workable

#3 Updated by mkittler 8 months ago

Lookup in source code what this message could mean

This means that the cache service itself is available but no Minion workers are registered. In particular, it means that openqa-worker-cacheservice is up and running but openqa-worker-cacheservice-minion not. One can actually simply reproduce this locally. However, one will most likely just end up with a worker declaring itself broken (with the very same error message) instead of creating an incomplete job. To end up with an incomplete job the error condition (openqa-worker-cacheservice is up and running but openqa-worker-cacheservice-minion not) must only occur after a job has already been started but still before the check in isotovideo.pm.


There are no log messages in the openqa-worker-cacheservice-minion journal after 02:11:27 (and before the regular restart):

Nov 23 02:11:27 openqaworker1 openqa-worker-cacheservice-minion[31276]: [12139] [i] [#58] Downloading: "openSUSE-Leap-15.2-DVD-x86_64.iso"
Nov 23 02:11:27 openqaworker1 openqa-worker-cacheservice-minion[31276]: [12139] [i] [#58] Cache size of "/var/lib/openqa/cache" is 19GiB, with limit 400GiB
Nov 23 02:11:27 openqaworker1 openqa-worker-cacheservice-minion[31276]: [12139] [i] [#58] Downloading "openSUSE-Leap-15.2-DVD-x86_64.iso" from "http://openqa1-opensuse/tests/1481490/asset/iso/openSUSE-Leap-15.2-DVD-x86_64.iso"
Nov 23 02:11:27 openqaworker1 openqa-worker-cacheservice-minion[31276]: [12139] [i] [#58] Content of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Leap-15.2-DVD-x86_64.iso" has not changed, updating last use
Nov 23 02:11:27 openqaworker1 openqa-worker-cacheservice-minion[31276]: [12139] [i] [#58] Finished download
Nov 23 03:30:00 openqaworker1 systemd[1]: Stopping OpenQA Worker Cache Service Minion...
Nov 23 03:30:04 openqaworker1 openqa-worker-cacheservice-minion[31276]: [31276] [i] Worker 31276 stopped
Nov 23 03:30:04 openqaworker1 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=248/n/a
Nov 23 03:30:04 openqaworker1 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Nov 23 03:30:04 openqaworker1 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Nov 23 03:30:04 openqaworker1 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.

The job mentioned in the ticket description has just been finished after that last log message (02:12:08). Not sure what caused openqa-worker-cacheservice-minion to be "silent" during that time. There are no helpful message in the logs of openqa-worker-cacheservice and openqa-worker-cacheservice-minion in the relevant time frame. (I actually expected some SQLite errors again but this time there were none.)

#4 Updated by mkittler 8 months ago

  • Assignee set to mkittler

There were actually quite a lot of such jobs during that time frame, one just has to check the worker history, e.g. https://openqa.opensuse.org/admin/workers/64. But it also looks like that all worker slots of openqaworker1 are functioning again (after the reboot).

Sometimes the reason is also setup failure: Cache service status error from API: Specified job ID is invalid (e.g. https://openqa.opensuse.org/tests/1481460) which is likely related as well.

By the way, there are no failed Minion jobs in the cache service (and the history of jobs goes up to 12 hours into the past).

#5 Updated by okurz 8 months ago

  • Status changed from Workable to In Progress
  • Assignee changed from mkittler to okurz

this is what I see in the journal of openqaworker1:

Nov 22 07:28:25 openqaworker1 openqa-worker-cacheservice-minion[4906]: [28488] [i] [#736] Sync: "rsync://openqa1-opensuse/tests" to "/var/lib/openqa/cache/openqa1-opensuse"
Nov 22 07:28:25 openqaworker1 openqa-worker-cacheservice-minion[4906]: [28488] [i] [#736] Calling: rsync -avHP rsync://openqa1-opensuse/tests/ --delete /var/lib/openqa/cache/openqa1-opensuse/tests/
Nov 22 07:28:26 openqaworker1 openqa-worker-cacheservice-minion[4906]: [28488] [i] [#736] Finished sync: 0
Nov 22 07:28:30 openqaworker1 worker[3093]: [info] Output of rsync:
Nov 22 07:28:30 openqaworker1 worker[3093]: [info] [#736] Calling: rsync -avHP rsync://openqa1-opensuse/tests/ --delete /var/lib/openqa/cache/openqa1-opensuse/tests/
Nov 22 07:28:30 openqaworker1 worker[3093]: receiving incremental file list
Nov 22 07:28:30 openqaworker1 worker[3093]: openqa/needles/.git/
Nov 22 07:28:30 openqaworker1 worker[3093]: opensuse/.git/
Nov 22 07:28:30 openqaworker1 worker[3093]: opensuse/.git/FETCH_HEAD
Nov 22 07:28:30 openqaworker1 worker[3093]: [116B blob data]
Nov 22 07:28:30 openqaworker1 worker[3093]: opensuse/products/opensuse/needles/.git/
Nov 22 07:28:30 openqaworker1 worker[3093]: opensuse/products/opensuse/needles/.git/FETCH_HEAD
Nov 22 07:28:30 openqaworker1 worker[3093]: [117B blob data]
Nov 22 07:28:30 openqaworker1 worker[3093]: sent 1,843 bytes  received 1,240,071 bytes  827,942.67 bytes/sec
Nov 22 07:28:30 openqaworker1 worker[3093]: total size is 4,655,699,870  speedup is 3,748.81
Nov 22 07:28:30 openqaworker1 worker[3093]: [info] Finished to rsync tests
Nov 22 07:28:30 openqaworker1 worker[3093]: [info] Preparing cgroup to start isotovideo
Nov 22 07:28:30 openqaworker1 worker[3093]: [info] Using cgroup /sys/fs/cgroup/systemd/openqa.slice/openqa-worker.slice/openqa-worker@11.service/1481271
Nov 22 07:28:30 openqaworker1 worker[3093]: [info] Starting isotovideo container
Nov 22 07:28:30 openqaworker1 worker[3093]: [info] 28503: WORKING 1481271
Nov 22 07:28:30 openqaworker1 worker[3093]: [info] isotovideo has been started (PID: 28503)
Nov 22 07:28:31 openqaworker1 kernel: kvm [28047]: vcpu0, guest rIP: 0xffffffffa9675434 disabled perfctr wrmsr: 0xc2 data 0xffff
Nov 22 07:30:27 openqaworker1 kernel: kvm [26199]: vcpu0, guest rIP: 0xffffffff90475434 disabled perfctr wrmsr: 0xc2 data 0xffff
Nov 22 07:30:36 openqaworker1 kernel: kvm [28429]: vcpu0, guest rIP: 0xffffffffa7675434 disabled perfctr wrmsr: 0xc2 data 0xffff
Nov 22 07:30:56 openqaworker1 kernel: kvm [28736]: vcpu0, guest rIP: 0xffffffffa047b9f4 disabled perfctr wrmsr: 0xc2 data 0xffff
Nov 22 07:33:11 openqaworker1 smartd[1469]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 206 to 214
Nov 22 07:35:59 openqaworker1 kernel: kvm [28047]: vcpu0, guest rIP: 0xffffffffb3475434 disabled perfctr wrmsr: 0xc2 data 0xffff
Nov 22 07:36:05 openqaworker1 kernel: kvm [26199]: vcpu0, guest rIP: 0xffffffffa1875434 disabled perfctr wrmsr: 0xc2 data 0xffff
Nov 22 07:36:50 openqaworker1 openqa-worker-cacheservice-minion[4906]: DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 301.
Nov 22 07:36:50 openqaworker1 openqa-worker-cacheservice-minion[4906]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 301.
Nov 22 07:36:50 openqaworker1 openqa-worker-cacheservice-minion[4906]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Command/minion/worker.pm line 26.
Nov 22 07:36:50 openqaworker1 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=4/NOPERMISSION
Nov 22 07:36:50 openqaworker1 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Nov 22 07:36:50 openqaworker1 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
Nov 22 07:36:51 openqaworker1 systemd[1]: openqa-worker-cacheservice-minion.service: Service RestartSec=100ms expired, scheduling restart.
Nov 22 07:36:51 openqaworker1 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Nov 22 07:36:51 openqaworker1 systemd[1]: Started OpenQA Worker Cache Service Minion.
Nov 22 07:36:51 openqaworker1 openqa-worker-cacheservice-minion[31276]: [31276] [e] [4MP8m9ei] Database integrity check found errors:
Nov 22 07:36:51 openqaworker1 openqa-worker-cacheservice-minion[31276]: [31276] [e] [4MP8m9ei] *** in database main ***
Nov 22 07:36:51 openqaworker1 openqa-worker-cacheservice-minion[31276]: Page 15 is never used
…
Nov 22 07:36:51 openqaworker1 openqa-worker-cacheservice-minion[31276]: [31276] [e] [4MP8m9ei] Re-indexing and vacuuming broken database
Nov 22 07:36:51 openqaworker1 openqa-worker-cacheservice-minion[31276]: [31276] [e] [4MP8m9ei] Purging cache directory because database has been corrupted: DBD::SQLite::st execute failed: UNIQUE constraint failed: mojo_migrations.name at /usr/share/openqa/script/../lib/OpenQA/CacheS>
Nov 22 07:36:51 openqaworker1 openqa-worker-cacheservice-minion[31276]:  at /usr/share/openqa/script/../lib/OpenQA/CacheService/Model/Cache.pm line 66.
Nov 22 07:36:51 openqaworker1 openqa-worker-cacheservice-minion[31276]: [31276] [i] [4MP8m9ei] Creating cache directory tree for "/var/lib/openqa/cache"
Nov 22 07:36:56 openqaworker1 openqa-worker-cacheservice-minion[31276]: [31276] [i] [4MP8m9ei] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 400GiB
Nov 22 07:36:57 openqaworker1 openqa-worker-cacheservice-minion[31276]: [31276] [i] Resetting all leftover locks after restart
Nov 22 07:36:57 openqaworker1 openqa-worker-cacheservice-minion[31276]: [31276] [i] Worker 31276 started
Nov 22 07:40:07 openqaworker1 worker[3076]: [info] Isotovideo exit status: 0
Nov 22 07:40:07 openqaworker1 worker[3076]: [info] +++ worker notes +++
Nov 22 07:40:07 openqaworker1 worker[3076]: [info] End time: 2020-11-22 06:40:07
Nov 22 07:40:07 openqaworker1 worker[3076]: [info] Result: done
Nov 22 07:40:07 openqaworker1 worker[3076]: [info] Uploading video.ogv
Nov 22 07:40:07 openqaworker1 worker[3076]: [info] Uploading vars.json
Nov 22 07:40:07 openqaworker1 worker[3076]: [info] Uploading autoinst-log.txt
Nov 22 07:40:07 openqaworker1 worker[3076]: [info] Uploading worker-log.txt
Nov 22 07:40:08 openqaworker1 worker[3076]: [info] Uploading serial0.txt
Nov 22 07:40:08 openqaworker1 worker[3076]: [info] Uploading video_time.vtt
Nov 22 07:40:08 openqaworker1 worker[3076]: [info] Uploading serial_terminal.txt
Nov 22 07:40:12 openqaworker1 worker[3076]: [error] Worker cache not available: No workers active in the cache service

so at one time the database is detected as corrupted, then reindexing is triggered and "workers" are stopped and retriggered but then the next job that finishes its work and picks up a new job is showing the error "Worker cache not available: No workers active in the cache service".

openqa-worker-cacheservice itself never realizes the problem until the whole machine is triggered for reboot.

Also see the note in #67000#note-58 with more details

https://github.com/os-autoinst/openQA/pull/3579 is our new attempt.

#6 Updated by mkittler 8 months ago

  • Assignee changed from okurz to mkittler

Yes, as we saw together the problem started to emerge earlier than I went when checking the logs. I've reviewed your PR (so see that for further comments).

#7 Updated by okurz 8 months ago

  • Related to action #67000: Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retry added

#8 Updated by mkittler 8 months ago

Unfortunately the PR doesn't work as intended:

Nov 24 03:58:03 openqaworker4 openqa-workercache-daemon[3503]: [3503] [e] [zRyC9NaO] row 659 missing from index downloads_lock
Nov 24 03:58:03 openqaworker4 openqa-workercache-daemon[3503]: [3503] [e] [zRyC9NaO] row 816 missing from index downloads_lock
Nov 24 03:58:03 openqaworker4 openqa-workercache-daemon[3503]: [3503] [e] [zRyC9NaO] wrong # of entries in index downloads_lock
Nov 24 03:58:03 openqaworker4 openqa-workercache-daemon[3503]: [3503] [e] [zRyC9NaO] Purging cache directory because database has been corrupted: database integrity check failed
Nov 24 03:58:03 openqaworker4 openqa-workercache-daemon[3503]: [3503] [e] [zRyC9NaO] Killing all processes accessing the corrupted database file handles (including ourselves)
Nov 24 03:58:03 openqaworker4 openqa-worker-cacheservice-minion[3504]: Specified filename cache.sqlite* does not exist.
Nov 24 03:58:03 openqaworker4 openqa-workercache-daemon[3503]: Specified filename cache.sqlite* does not exist.
Nov 24 03:58:03 openqaworker4 openqa-workercache-daemon[3503]: Killing DB accessing processes failed when trying to cleanup at /usr/share/openqa/script/../lib/OpenQA/CacheService/Model/Cache.pm line 48.
Nov 24 03:58:03 openqaworker4 openqa-worker-cacheservice-minion[3504]: Killing DB accessing processes failed when trying to cleanup at /usr/share/openqa/script/../lib/OpenQA/CacheService/Model/Cache.pm line 48.
Nov 24 03:58:03 openqaworker4 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=1/FAILURE
Nov 24 03:58:03 openqaworker4 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Nov 24 03:58:03 openqaworker4 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
Nov 24 03:58:03 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=1/FAILURE
Nov 24 03:58:03 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Unit entered failed state.
Nov 24 03:58:03 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.

One needed to pass the full path to fuser but even then it wouldn't work because the cache file is deleted before calling fuser.

#9 Updated by mkittler 8 months ago

PR to fix problem from previous comment: https://github.com/os-autoinst/openQA/pull/3584

#10 Updated by mkittler 8 months ago

  • Status changed from In Progress to Resolved

See #67000#note-66 for a status update. I'm closing this issue in favor of #67000 so comments are not spread across 2 different tickets. I also just checked with ./openqa-query-for-job-label poo#80202 and the last occurrence of was 2020-11-23. So our last PRs might have already "fixed" that particular symptom.

#11 Updated by okurz 8 months ago

  • Status changed from Resolved to Blocked

We still need the ticket to be open to handle cases when the issue happens, e.g. https://openqa.opensuse.org/tests/1483735

#12 Updated by mkittler 8 months ago

I had only openqa-worker-cacheservice (and not openqa-worker-cacheservice-minion) running for a short time frame on this o3 worker when switching to the DELETE journaling mode (see #67000#note-68). This job incompleted around the same time so it is likely just because of my reconfiguration. I haven't seen any other instances besides this job.

I would also prefer to discuss the problem only in #67000 instead of having multiple discussions in different tickets which are actually just about different symptoms of the same underlying problem. And at this point it is clear that the SQLite database corruption addressed in #67000 causes these jobs as well.

#13 Updated by okurz 8 months ago

mkittler wrote:

I would also prefer to discuss the problem only in #67000 instead of having multiple discussions in different tickets which are actually just about different symptoms of the same underlying problem. And at this point it is clear that the SQLite database corruption addressed in #67000 causes these jobs as well.

Yes, I agree. We still need this ticket to track the specific error message. You can generalize the regex in #67000 of course if you want them all to be treated the same. I would just keep all as "Blocked", solve #67000 first, and afterwards check for each symptom if it is still happening and also if there can be any, additional, specific improvement.

#14 Updated by mkittler 8 months ago

Looks like this isn't happening anymore, this query doesn't show any results:

select state, t_started, (select host from workers where id = assigned_worker_id), reason from jobs where t_started > '2020-11-25 12:17:44' and result = 'incomplete' and reason like '%No workers active%';

#15 Updated by okurz 8 months ago

Even with the underlying issue #67000 fixed I think we should improve how jobs are treated if an error like this one happens. Can we trigger automatic restarts or un-assign the job from the worker so that it will be put back into the queue and picked by another worker? Maybe to automatically restart it is necessary to adjust the specific "reason" to be of another category.

#16 Updated by mkittler 8 months ago

Can we trigger automatic restarts or un-assign the job from the worker so that it will be put back into the queue and picked by another worker?

We can trigger automatic restarts. As we've discussed in an earlier meeting this is not the only case where it would make sense. However, I would not put it back into the queue because these incomplete jobs already contain data from the execution (like the error message from the issue title) and setting these jobs back to scheduled would mean we needed to delete that information again to avoid confusion and possible issues with that. The problem is that we likely need this information when investigating errors.

Maybe to automatically restart it is necessary to adjust the specific "reason" to be of another category.

Yes, it would make sense if cache service related issues would have the cache service: … category. Of course if the cache service would just pass a 404 response that wouldn't count as a cache service issue but still be treated as setup failure: …. We could of course also introduce a 3rd category asset failure: … for asset specific problems (which we can not blame the cache service for).

#17 Updated by mkittler 8 months ago

  • Status changed from Blocked to In Progress
  • Priority changed from Urgent to Normal

PR for distinguishing the errors: https://github.com/os-autoinst/openQA/pull/3618

#18 Updated by openqa_review 8 months ago

  • Due date set to 2020-12-19

Setting due date based on mean cycle time of SUSE QE Tools

#19 Updated by mkittler 8 months ago

PR for restarting incompletes due to cache service problems: https://github.com/os-autoinst/openQA/pull/3624

#20 Updated by mkittler 8 months ago

  • Status changed from In Progress to Feedback

The PR has been merged. Let's see whether it works in production.

#21 Updated by mkittler 7 months ago

  • Status changed from Feedback to Resolved

There are still no more jobs with No workers active within the reason. However, the latest change was supposed allow a better general distinction introducing categories like cache service: …. This seems to have worked, e.g. on OSD the query

select state, t_started, (select host from workers where id = assigned_worker_id), reason from jobs where t_started > '2020-12-09 00:00:00' and result = 'incomplete' and reason like 'cache failure: %';

returns a lot of jobs (same counts for asset failure: …). And on the other hand the query for the generic category

openqa=> select state, t_started, (select host from workers where id = assigned_worker_id), reason from jobs where t_started > '2020-12-09 00:00:00' and result = 'incomplete' and reason like 'setup failure: %';
 state |      t_started      |    host     |                                                                                         reason                                                                                          
-------+---------------------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 done  | 2020-12-17 14:03:39 | 10.160.0.85 | setup failure: Failed to download SLES-15-SP3-aarch64-102.1@aarch64-unregistered.qcow2 to /var/lib/openqa/cache/openqa.suse.de/SLES-15-SP3-aarch64-102.1@aarch64-unregistered.qcow2
 done  | 2020-12-17 06:31:14 | 10.160.0.85 | setup failure: Failed to download SLE-15-SP2-Online-aarch64-Build209.2-Media1.iso.sha256 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP2-Online-aarch64-Build209.2-Media1.iso.sha256

returns only 2 jobs anymore. (10.160.0.85 is openqaworker-arm-3 and I would assume that this worker simply hasn't been updated yet at this point.)

Also available in: Atom PDF