action #80202
closedjobs incomplete with auto_review:"setup failure: No workers active in the cache service":retry
Description
Observation¶
https://openqa.opensuse.org/tests/1481492
shows as reason "setup failure: No workers active in the cache service".
Steps to reproduce¶
Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
for example to look this ticket #80202 call openqa-query-for-job-label poo#80202
Suggestions¶
- Lookup in source code what this message could mean
- We have never seen this in before so I am convinced this is a regression. Find out what change caused this and address that
Workaround¶
- Retrigger jobs until the end up on other machines
- Stop all worker related services, cleanup cache dir
/var/lib/openqa/cache/*
and restart worker related services
Updated by okurz about 4 years ago
- Copied from action #78169: after osd-deploy 2020-11-18 incompletes with auto_review:"Cache service (status error from API|.*error 500: Internal Server Error)":retry added
Updated by okurz about 4 years ago
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler about 4 years ago
Lookup in source code what this message could mean
This means that the cache service itself is available but no Minion workers are registered. In particular, it means that openqa-worker-cacheservice
is up and running but openqa-worker-cacheservice-minion
not. One can actually simply reproduce this locally. However, one will most likely just end up with a worker declaring itself broken (with the very same error message) instead of creating an incomplete job. To end up with an incomplete job the error condition (openqa-worker-cacheservice
is up and running but openqa-worker-cacheservice-minion
not) must only occur after a job has already been started but still before the check in isotovideo.pm
.
There are no log messages in the openqa-worker-cacheservice-minion
journal after 02:11:27 (and before the regular restart):
Nov 23 02:11:27 openqaworker1 openqa-worker-cacheservice-minion[31276]: [12139] [i] [#58] Downloading: "openSUSE-Leap-15.2-DVD-x86_64.iso"
Nov 23 02:11:27 openqaworker1 openqa-worker-cacheservice-minion[31276]: [12139] [i] [#58] Cache size of "/var/lib/openqa/cache" is 19GiB, with limit 400GiB
Nov 23 02:11:27 openqaworker1 openqa-worker-cacheservice-minion[31276]: [12139] [i] [#58] Downloading "openSUSE-Leap-15.2-DVD-x86_64.iso" from "http://openqa1-opensuse/tests/1481490/asset/iso/openSUSE-Leap-15.2-DVD-x86_64.iso"
Nov 23 02:11:27 openqaworker1 openqa-worker-cacheservice-minion[31276]: [12139] [i] [#58] Content of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Leap-15.2-DVD-x86_64.iso" has not changed, updating last use
Nov 23 02:11:27 openqaworker1 openqa-worker-cacheservice-minion[31276]: [12139] [i] [#58] Finished download
Nov 23 03:30:00 openqaworker1 systemd[1]: Stopping OpenQA Worker Cache Service Minion...
Nov 23 03:30:04 openqaworker1 openqa-worker-cacheservice-minion[31276]: [31276] [i] Worker 31276 stopped
Nov 23 03:30:04 openqaworker1 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=248/n/a
Nov 23 03:30:04 openqaworker1 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Nov 23 03:30:04 openqaworker1 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Nov 23 03:30:04 openqaworker1 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
The job mentioned in the ticket description has just been finished after that last log message (02:12:08). Not sure what caused openqa-worker-cacheservice-minion
to be "silent" during that time. There are no helpful message in the logs of openqa-worker-cacheservice
and openqa-worker-cacheservice-minion
in the relevant time frame. (I actually expected some SQLite errors again but this time there were none.)
Updated by mkittler about 4 years ago
- Assignee set to mkittler
There were actually quite a lot of such jobs during that time frame, one just has to check the worker history, e.g. https://openqa.opensuse.org/admin/workers/64. But it also looks like that all worker slots of openqaworker1 are functioning again (after the reboot).
Sometimes the reason is also setup failure: Cache service status error from API: Specified job ID is invalid
(e.g. https://openqa.opensuse.org/tests/1481460) which is likely related as well.
By the way, there are no failed Minion jobs in the cache service (and the history of jobs goes up to 12 hours into the past).
Updated by okurz about 4 years ago
- Status changed from Workable to In Progress
- Assignee changed from mkittler to okurz
this is what I see in the journal of openqaworker1:
Nov 22 07:28:25 openqaworker1 openqa-worker-cacheservice-minion[4906]: [28488] [i] [#736] Sync: "rsync://openqa1-opensuse/tests" to "/var/lib/openqa/cache/openqa1-opensuse"
Nov 22 07:28:25 openqaworker1 openqa-worker-cacheservice-minion[4906]: [28488] [i] [#736] Calling: rsync -avHP rsync://openqa1-opensuse/tests/ --delete /var/lib/openqa/cache/openqa1-opensuse/tests/
Nov 22 07:28:26 openqaworker1 openqa-worker-cacheservice-minion[4906]: [28488] [i] [#736] Finished sync: 0
Nov 22 07:28:30 openqaworker1 worker[3093]: [info] Output of rsync:
Nov 22 07:28:30 openqaworker1 worker[3093]: [info] [#736] Calling: rsync -avHP rsync://openqa1-opensuse/tests/ --delete /var/lib/openqa/cache/openqa1-opensuse/tests/
Nov 22 07:28:30 openqaworker1 worker[3093]: receiving incremental file list
Nov 22 07:28:30 openqaworker1 worker[3093]: openqa/needles/.git/
Nov 22 07:28:30 openqaworker1 worker[3093]: opensuse/.git/
Nov 22 07:28:30 openqaworker1 worker[3093]: opensuse/.git/FETCH_HEAD
Nov 22 07:28:30 openqaworker1 worker[3093]: [116B blob data]
Nov 22 07:28:30 openqaworker1 worker[3093]: opensuse/products/opensuse/needles/.git/
Nov 22 07:28:30 openqaworker1 worker[3093]: opensuse/products/opensuse/needles/.git/FETCH_HEAD
Nov 22 07:28:30 openqaworker1 worker[3093]: [117B blob data]
Nov 22 07:28:30 openqaworker1 worker[3093]: sent 1,843 bytes received 1,240,071 bytes 827,942.67 bytes/sec
Nov 22 07:28:30 openqaworker1 worker[3093]: total size is 4,655,699,870 speedup is 3,748.81
Nov 22 07:28:30 openqaworker1 worker[3093]: [info] Finished to rsync tests
Nov 22 07:28:30 openqaworker1 worker[3093]: [info] Preparing cgroup to start isotovideo
Nov 22 07:28:30 openqaworker1 worker[3093]: [info] Using cgroup /sys/fs/cgroup/systemd/openqa.slice/openqa-worker.slice/openqa-worker@11.service/1481271
Nov 22 07:28:30 openqaworker1 worker[3093]: [info] Starting isotovideo container
Nov 22 07:28:30 openqaworker1 worker[3093]: [info] 28503: WORKING 1481271
Nov 22 07:28:30 openqaworker1 worker[3093]: [info] isotovideo has been started (PID: 28503)
Nov 22 07:28:31 openqaworker1 kernel: kvm [28047]: vcpu0, guest rIP: 0xffffffffa9675434 disabled perfctr wrmsr: 0xc2 data 0xffff
Nov 22 07:30:27 openqaworker1 kernel: kvm [26199]: vcpu0, guest rIP: 0xffffffff90475434 disabled perfctr wrmsr: 0xc2 data 0xffff
Nov 22 07:30:36 openqaworker1 kernel: kvm [28429]: vcpu0, guest rIP: 0xffffffffa7675434 disabled perfctr wrmsr: 0xc2 data 0xffff
Nov 22 07:30:56 openqaworker1 kernel: kvm [28736]: vcpu0, guest rIP: 0xffffffffa047b9f4 disabled perfctr wrmsr: 0xc2 data 0xffff
Nov 22 07:33:11 openqaworker1 smartd[1469]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 206 to 214
Nov 22 07:35:59 openqaworker1 kernel: kvm [28047]: vcpu0, guest rIP: 0xffffffffb3475434 disabled perfctr wrmsr: 0xc2 data 0xffff
Nov 22 07:36:05 openqaworker1 kernel: kvm [26199]: vcpu0, guest rIP: 0xffffffffa1875434 disabled perfctr wrmsr: 0xc2 data 0xffff
Nov 22 07:36:50 openqaworker1 openqa-worker-cacheservice-minion[4906]: DBD::SQLite::st execute failed: database disk image is malformed at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 301.
Nov 22 07:36:50 openqaworker1 openqa-worker-cacheservice-minion[4906]: at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 301.
Nov 22 07:36:50 openqaworker1 openqa-worker-cacheservice-minion[4906]: at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Command/minion/worker.pm line 26.
Nov 22 07:36:50 openqaworker1 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=4/NOPERMISSION
Nov 22 07:36:50 openqaworker1 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Nov 22 07:36:50 openqaworker1 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
Nov 22 07:36:51 openqaworker1 systemd[1]: openqa-worker-cacheservice-minion.service: Service RestartSec=100ms expired, scheduling restart.
Nov 22 07:36:51 openqaworker1 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Nov 22 07:36:51 openqaworker1 systemd[1]: Started OpenQA Worker Cache Service Minion.
Nov 22 07:36:51 openqaworker1 openqa-worker-cacheservice-minion[31276]: [31276] [e] [4MP8m9ei] Database integrity check found errors:
Nov 22 07:36:51 openqaworker1 openqa-worker-cacheservice-minion[31276]: [31276] [e] [4MP8m9ei] *** in database main ***
Nov 22 07:36:51 openqaworker1 openqa-worker-cacheservice-minion[31276]: Page 15 is never used
…
Nov 22 07:36:51 openqaworker1 openqa-worker-cacheservice-minion[31276]: [31276] [e] [4MP8m9ei] Re-indexing and vacuuming broken database
Nov 22 07:36:51 openqaworker1 openqa-worker-cacheservice-minion[31276]: [31276] [e] [4MP8m9ei] Purging cache directory because database has been corrupted: DBD::SQLite::st execute failed: UNIQUE constraint failed: mojo_migrations.name at /usr/share/openqa/script/../lib/OpenQA/CacheS>
Nov 22 07:36:51 openqaworker1 openqa-worker-cacheservice-minion[31276]: at /usr/share/openqa/script/../lib/OpenQA/CacheService/Model/Cache.pm line 66.
Nov 22 07:36:51 openqaworker1 openqa-worker-cacheservice-minion[31276]: [31276] [i] [4MP8m9ei] Creating cache directory tree for "/var/lib/openqa/cache"
Nov 22 07:36:56 openqaworker1 openqa-worker-cacheservice-minion[31276]: [31276] [i] [4MP8m9ei] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 400GiB
Nov 22 07:36:57 openqaworker1 openqa-worker-cacheservice-minion[31276]: [31276] [i] Resetting all leftover locks after restart
Nov 22 07:36:57 openqaworker1 openqa-worker-cacheservice-minion[31276]: [31276] [i] Worker 31276 started
Nov 22 07:40:07 openqaworker1 worker[3076]: [info] Isotovideo exit status: 0
Nov 22 07:40:07 openqaworker1 worker[3076]: [info] +++ worker notes +++
Nov 22 07:40:07 openqaworker1 worker[3076]: [info] End time: 2020-11-22 06:40:07
Nov 22 07:40:07 openqaworker1 worker[3076]: [info] Result: done
Nov 22 07:40:07 openqaworker1 worker[3076]: [info] Uploading video.ogv
Nov 22 07:40:07 openqaworker1 worker[3076]: [info] Uploading vars.json
Nov 22 07:40:07 openqaworker1 worker[3076]: [info] Uploading autoinst-log.txt
Nov 22 07:40:07 openqaworker1 worker[3076]: [info] Uploading worker-log.txt
Nov 22 07:40:08 openqaworker1 worker[3076]: [info] Uploading serial0.txt
Nov 22 07:40:08 openqaworker1 worker[3076]: [info] Uploading video_time.vtt
Nov 22 07:40:08 openqaworker1 worker[3076]: [info] Uploading serial_terminal.txt
Nov 22 07:40:12 openqaworker1 worker[3076]: [error] Worker cache not available: No workers active in the cache service
so at one time the database is detected as corrupted, then reindexing is triggered and "workers" are stopped and retriggered but then the next job that finishes its work and picks up a new job is showing the error "Worker cache not available: No workers active in the cache service".
openqa-worker-cacheservice itself never realizes the problem until the whole machine is triggered for reboot.
Also see the note in #67000#note-58 with more details
https://github.com/os-autoinst/openQA/pull/3579 is our new attempt.
Updated by mkittler about 4 years ago
- Assignee changed from okurz to mkittler
Yes, as we saw together the problem started to emerge earlier than I went when checking the logs. I've reviewed your PR (so see that for further comments).
Updated by okurz about 4 years ago
- Related to action #67000: Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retry added
Updated by mkittler about 4 years ago
Unfortunately the PR doesn't work as intended:
Nov 24 03:58:03 openqaworker4 openqa-workercache-daemon[3503]: [3503] [e] [zRyC9NaO] row 659 missing from index downloads_lock
Nov 24 03:58:03 openqaworker4 openqa-workercache-daemon[3503]: [3503] [e] [zRyC9NaO] row 816 missing from index downloads_lock
Nov 24 03:58:03 openqaworker4 openqa-workercache-daemon[3503]: [3503] [e] [zRyC9NaO] wrong # of entries in index downloads_lock
Nov 24 03:58:03 openqaworker4 openqa-workercache-daemon[3503]: [3503] [e] [zRyC9NaO] Purging cache directory because database has been corrupted: database integrity check failed
Nov 24 03:58:03 openqaworker4 openqa-workercache-daemon[3503]: [3503] [e] [zRyC9NaO] Killing all processes accessing the corrupted database file handles (including ourselves)
Nov 24 03:58:03 openqaworker4 openqa-worker-cacheservice-minion[3504]: Specified filename cache.sqlite* does not exist.
Nov 24 03:58:03 openqaworker4 openqa-workercache-daemon[3503]: Specified filename cache.sqlite* does not exist.
Nov 24 03:58:03 openqaworker4 openqa-workercache-daemon[3503]: Killing DB accessing processes failed when trying to cleanup at /usr/share/openqa/script/../lib/OpenQA/CacheService/Model/Cache.pm line 48.
Nov 24 03:58:03 openqaworker4 openqa-worker-cacheservice-minion[3504]: Killing DB accessing processes failed when trying to cleanup at /usr/share/openqa/script/../lib/OpenQA/CacheService/Model/Cache.pm line 48.
Nov 24 03:58:03 openqaworker4 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=1/FAILURE
Nov 24 03:58:03 openqaworker4 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Nov 24 03:58:03 openqaworker4 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
Nov 24 03:58:03 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=1/FAILURE
Nov 24 03:58:03 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Unit entered failed state.
Nov 24 03:58:03 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
One needed to pass the full path to fuser
but even then it wouldn't work because the cache file is deleted before calling fuser
.
Updated by mkittler about 4 years ago
PR to fix problem from previous comment: https://github.com/os-autoinst/openQA/pull/3584
Updated by mkittler about 4 years ago
- Status changed from In Progress to Resolved
See #67000#note-66 for a status update. I'm closing this issue in favor of #67000 so comments are not spread across 2 different tickets. I also just checked with ./openqa-query-for-job-label poo#80202
and the last occurrence of was 2020-11-23. So our last PRs might have already "fixed" that particular symptom.
Updated by okurz about 4 years ago
- Status changed from Resolved to Blocked
We still need the ticket to be open to handle cases when the issue happens, e.g. https://openqa.opensuse.org/tests/1483735
Updated by mkittler about 4 years ago
I had only openqa-worker-cacheservice
(and not openqa-worker-cacheservice-minion
) running for a short time frame on this o3 worker when switching to the DELETE journaling mode (see #67000#note-68). This job incompleted around the same time so it is likely just because of my reconfiguration. I haven't seen any other instances besides this job.
I would also prefer to discuss the problem only in #67000 instead of having multiple discussions in different tickets which are actually just about different symptoms of the same underlying problem. And at this point it is clear that the SQLite database corruption addressed in #67000 causes these jobs as well.
Updated by okurz about 4 years ago
mkittler wrote:
I would also prefer to discuss the problem only in #67000 instead of having multiple discussions in different tickets which are actually just about different symptoms of the same underlying problem. And at this point it is clear that the SQLite database corruption addressed in #67000 causes these jobs as well.
Yes, I agree. We still need this ticket to track the specific error message. You can generalize the regex in #67000 of course if you want them all to be treated the same. I would just keep all as "Blocked", solve #67000 first, and afterwards check for each symptom if it is still happening and also if there can be any, additional, specific improvement.
Updated by mkittler about 4 years ago
Looks like this isn't happening anymore, this query doesn't show any results:
select state, t_started, (select host from workers where id = assigned_worker_id), reason from jobs where t_started > '2020-11-25 12:17:44' and result = 'incomplete' and reason like '%No workers active%';
Updated by okurz about 4 years ago
Even with the underlying issue #67000 fixed I think we should improve how jobs are treated if an error like this one happens. Can we trigger automatic restarts or un-assign the job from the worker so that it will be put back into the queue and picked by another worker? Maybe to automatically restart it is necessary to adjust the specific "reason" to be of another category.
Updated by mkittler about 4 years ago
Can we trigger automatic restarts or un-assign the job from the worker so that it will be put back into the queue and picked by another worker?
We can trigger automatic restarts. As we've discussed in an earlier meeting this is not the only case where it would make sense. However, I would not put it back into the queue because these incomplete jobs already contain data from the execution (like the error message from the issue title) and setting these jobs back to scheduled would mean we needed to delete that information again to avoid confusion and possible issues with that. The problem is that we likely need this information when investigating errors.
Maybe to automatically restart it is necessary to adjust the specific "reason" to be of another category.
Yes, it would make sense if cache service related issues would have the cache service: …
category. Of course if the cache service would just pass a 404 response that wouldn't count as a cache service issue but still be treated as setup failure: …
. We could of course also introduce a 3rd category asset failure: …
for asset specific problems (which we can not blame the cache service for).
Updated by mkittler almost 4 years ago
- Status changed from Blocked to In Progress
- Priority changed from Urgent to Normal
PR for distinguishing the errors: https://github.com/os-autoinst/openQA/pull/3618
Updated by openqa_review almost 4 years ago
- Due date set to 2020-12-19
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler almost 4 years ago
PR for restarting incompletes due to cache service problems: https://github.com/os-autoinst/openQA/pull/3624
Updated by mkittler almost 4 years ago
- Status changed from In Progress to Feedback
The PR has been merged. Let's see whether it works in production.
Updated by mkittler almost 4 years ago
- Status changed from Feedback to Resolved
There are still no more jobs with No workers active
within the reason. However, the latest change was supposed allow a better general distinction introducing categories like cache service: …
. This seems to have worked, e.g. on OSD the query
select state, t_started, (select host from workers where id = assigned_worker_id), reason from jobs where t_started > '2020-12-09 00:00:00' and result = 'incomplete' and reason like 'cache failure: %';
returns a lot of jobs (same counts for asset failure: …
). And on the other hand the query for the generic category
openqa=> select state, t_started, (select host from workers where id = assigned_worker_id), reason from jobs where t_started > '2020-12-09 00:00:00' and result = 'incomplete' and reason like 'setup failure: %';
state | t_started | host | reason
-------+---------------------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
done | 2020-12-17 14:03:39 | 10.160.0.85 | setup failure: Failed to download SLES-15-SP3-aarch64-102.1@aarch64-unregistered.qcow2 to /var/lib/openqa/cache/openqa.suse.de/SLES-15-SP3-aarch64-102.1@aarch64-unregistered.qcow2
done | 2020-12-17 06:31:14 | 10.160.0.85 | setup failure: Failed to download SLE-15-SP2-Online-aarch64-Build209.2-Media1.iso.sha256 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP2-Online-aarch64-Build209.2-Media1.iso.sha256
returns only 2 jobs anymore. (10.160.0.85
is openqaworker-arm-3
and I would assume that this worker simply hasn't been updated yet at this point.)